Abstract
Celiac disease (CD) is an autoimmune gastrointestinal disorder which causes immune-mediated enteropathy against gluten. The gluten immunogenic peptides have the potential to trigger immune responses which leads to damage the small intestine. HLA-DQ2 and HLA-DQ8 are major alleles that bind to epitope/antigenic region of gluten and induce celiac disease. There is a need to identify CD associated epitopes in protein-based foods and therapeutics. In addition, prediction of CD associated epitope/peptide is also required for developing antigen-based immunotherapy against celiac disease. In this study, computational tools have been developed to predict CD associated epitopes and motifs. Dataset used in this study for training, testing and evaluation contain experimentally validated CD associated and non-CD associate peptides. Our analysis support existing hypothesis that proline (P) and glutamine (Q) are highly abundant in CD associated peptides. A model based on density of P&Q in peptides has been developed for predicting CD associated which achieve maximum AUROC 0.98. We discovered CD associated motifs (e.g., QPF, QPQ, PYP) which occurs specifically in CD associated peptides. We also developed machine learning based models using peptide composition and achieved maximum AUROC 0.99. Finally, we developed ensemble method that combines motif-based approach and machine learning based models. The ensemble model-predict CD associated motifs with 100% accuracy on an independent dataset, not used for training. Finally, the best models and motifs has been integrated in a web server and standalone software package “CDpred”. We hope this server anticipate the scientific community for the prediction, designing and scanning of CD associated peptides as well as CD associated motifs in a protein/peptide sequence (https://webs.iiitd.edu.in/raghava/cdpred/).
Key Points
Celiac disease is one of the prominent autoimmune diseases
Gluten immunogenic peptides are responsible for celiac disease
Mapping of celiac disease associated epitopes and motifs on a proteins
Identification of proline and glutamine rich regions
A web server and software package for predicting CD associate peptides
Author’s Biography
Ritu Tomer is currently working as Ph.D. in Computational Biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Sumeet Patiyal is currently working as Ph.D. in Computational biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Anjali Dhall is currently working as Ph.D. in Computational Biology from Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Gajendra P. S. Raghava is currently working as Professor and Head of Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Mailing Address of Authors Ritu Tomer (RT) : ritut{at}iiitd.ac.in
Sumeet Patiyal (SP): sumeetp{at}iiitd.ac.in
Anjali Dhall (AD): anjalid{at}iiitd.ac.in
Gajendra P. S. Raghava (GPSR): raghava{at}iiitd.ac.in
Abbreviations
- CD
- Celiac Disease
- HLA
- Human leukocyte antigens
- CXCR3
- Chemokine receptor 3
- tTG
- Tissue transglutaminase
- sIgA
- Secretory Immunoglobulin A
- IEDB
- Immune Epitope Database
- AUROC
- Area under receiver operator curve
- DT
- Decision Tree
- RF
- Random Forest
- SVC
- Support Vector Classifier
- XGB
- XGBoost
- LR
- Logistic Regression
- ET
- Extra Tree classifier
- KNN
- k-Nearest Neighbors
- GNB
- Gaussian Naive Bayes