In silico model for predicting IL-2 inducing peptides in human

Interleukin-2 (IL-2) based immunotherapy has been already approved to treat certain type of cancers as it plays vital role in immune system. Thus it is important to discover new peptides or epitopes that can induce IL-2 with high efficiency. We analyzed experimentally validated IL-2 inducing and non-inducing peptides and observed differ in average amino acid composition, motifs, length, and positional preference of amino acid residues at the N- and C-terminus. In this study, 2528 IL-2 inducing and 2104 non-IL-2 inducing peptides have been used for traning, testing, traing and validation of our models. A large number of machine learning techniques and around 10,000 peptide features have been used for developing prediction models. The Random Forest-based model using hybrid features achieved a maximum accuracy of 73.25%, with AUC of 0.73 on the training set; accuracy of 72.89% with AUC of 0.72 on validation dataset. A web-server IL2pred has been developed for predicting IL-2 inducing peptides, scanning IL-inducing regions in a protein and designing IL-2 specific epitopes by ranking peptide analogs (https://webs.iiitd.edu.in/raghava/il2pred/).


Introduction
The conventionally used cancer treatment strategies include chemotherapy, radiotherapy, and a combination of both. A mounting experimental and clinical outcome suggests that conventional therapies have many side effects ranging from patient discomfort to toxicity, development of resistance [1]. In this regard, the modulation of the body's immune system via immunotherapy is considered the safest and most advanced option for the treatment of cancer [2]. Recent clinical studies have confirmed the efficacy and safety of immunotherapy in managing large cohorts of patients [3]. Thus the use of immunotherapeutic agents becomes a staple of modern oncology treatment regimens [4]. Immunotherapy employs the use of therapeutic modalities such as interleukins, oncolytic viruses, and vaccines. Interleukins are small glycoproteins that can have a pleiotropic effect on the immune system. They bind to receptors on the cell surface that can aid in the differentiation, survival, and proliferation of immune cells [5].
Currently, two interleukins-based drugs are approved by the FDA for the treatment of human malignancies. This include interferon-alpha based drug Roferon-A and interleukin-2 (IL-2) based Aldesleukin or Proleukin. IL-2 is an early cytokine that was commercially approved to treat metastatic renal carcinoma and metastatic melanoma by the FDA. IL-2 is a 15.5 KDa alphahelical cytokine that belongs to the gamma chain family of immune modulators. CD4+ T-cells chiefly produce IL-2, but they can also be made by CD8+ T-cells, Dendritic cells, and Natural killer cells (NK-cells) [6]. IL-2 acts via the JAK-STAT pathway and helps maintain and differentiate CD4+ T-cells in various subsets, including Th1, Th2, and Th17. It can promote CD8+ T-cell and NK-cell cytotoxic activity and in the generation of memory cells [7].
The greater immune-stimulating activity of IL-2 has led to the development of several treatment regimens. The IL-2 has been used in clinical settings as a monotherapy agent and also in combination with several other therapeutics regimes. The low dose of IL-2 combined with interferon-alpha shows more significant therapeutic benefits for patients with renal cell carcinoma [8]. IL-2, in combination with activated killer cells, has been evaluated for the treatment of melanoma patients [9]. IL-2, when combined with chemotherapeutic agents, including cisplatin and dacarbazine has been extensively investigated in patients with metastatic melanoma [10]. The IL-2, combined with targeted therapies such as gefitinib, showed a positive response in NSCLC patients [11]. The combination of IL-2 with a therapeutic cancer vaccine also indicates a synergistic effect in advanced melanoma [12].
Undoubtedly, IL-2 showed great potential in treating cancers. However, its application in the clinic remains relatively restricted due to several shortcomings. The major limitation faced by IL-2 therapy is its toxicity when administered exogenously in high doses. The IL-2 also suffers from a short half-life of several minutes. It is observed that high doses of IL-2 sometimes lead to vascular leakage, hypotension [13]. The clinicians and researchers employed several strategies for overcoming the challenges faced by IL-2 therapy. One such approach is the generation of mutants of IL-2, which can activate immune cells without pro-inflammatory activity and thus free from toxicity-related problems [14]. The mutant version of IL-2 is known as "Superkine" because of its enhanced anti-tumor property.
These points highlight that the mutant version of natural IL-2 possesses a high therapeutic index.
Thus the generation of an improved version of IL-2 has become an important area of research.
However, the identification and generation of mutants in clinical setup is time-consuming and cost-intensive. The in-silico computation methods can help scientists and clinicians in this regard. Several methods have also been developed in the past that targets and utilized the therapeutic potential of interleukin-based therapy for the management of various human malignancies. CytoPred is one such method that is developed for the identification and classification of cytokines [15]. In addition, methods have been developed for inducing specific type of cytokines, it include IL4pred [16], IL10Pred [17], IFNepitope [18], IL6pred [19] for IL-4, IL-10, IFN- gamma, IL-6 respectively. Despite the huge therapeutic potential, there is no method developed in the literature for the identification and generation of IL-2 inducing peptides. In the present study, we have attempted to develop a computational method for the prediction and identification of peptides that can possess the potential of inducing IL-2 cytokine.
To serve the scientific community, the developed computational method has been incorporated in web-server, which is freely available at https://webs.iiitd.edu.in/raghava/il2pred/.

Dataset Preparations and Pre-processing
We extracted all the experimentally validated IL-2 inducing and non-IL-2 inducing peptides from the largest repository of Immune epitope (IEDB). We specifically extracted a total of 5427 peptides showing MHC class-II binding affinity and that can trigger IL-2 secretion as measured by different immunological assays. These epitopes were termed IL-2 inducing peptides and grouped under a positive dataset. The negative dataset consists of a total of 3568 epitopes that do not trigger the IL-2 secretion and are termed as non-inducer. All the identified positive and negative datasets are considered only of human host species. Literature evidence suggests that peptide of length between 8-25 is most suitable for MHC antigen processing and presentation.
Thus we have removed all peptides having lengths below 8 and above 25. We also removed all the redundant peptides and finally left with 2528 IL-2 inducing and 2104 non-IL-2 inducing peptides. These 2528 peptides and 2104 peptides constitute our final dataset for further analysis.

Length Distribution and Residue Conservation Analysis
The positive and negative dataset was analyzed for preferred peptide length distribution using an in-house developed R script. We computed the average composition of different amino acids present in both positive and negative datasets. The two-sample logo was generated using the R package "ggseqlogo" for both datasets to understand the preference of specific amino acid residue at a particular position [20]. This package processes only fixed-length peptide sequences as input, and since the smallest length peptide is eight residues long, we considered the combination of 8-residue length sequences from both N and C terminals for plotting the twosample logo. The following example shows the protocol for generating the sequence of 16 residues long peptides.

Peptide of 16-residues: N-terminus(N->C) + C-terminus residues (C->N)
Single peptide of 16-residues: ARGCGHTRLKRTHGCG The first eight positions of the logo represent the N-terminus of peptides, and the last eight residue positions represent the C-terminus of the peptide. We have used all the peptides presents in our positive and negative dataset for creating the two-sample logo.

Motif Analysis
Identification of the motifs within peptides is very crucial in annotating the function of these peptides. We have MERCI software for the identification of specific motifs in both positive and negative datasets. We implemented the MERCI software in two steps; firstly, we extracted the motifs for a positive dataset by providing the IL-2 inducing peptides as a positive set and non-IL-2 inducing peptides as a negative set. In the second iteration, we retrieved the motifs for the negative dataset by inputting non-IL-2 inducing peptides as a positive set and IL-2 inducing peptides as a negative set. Using the above-mentioned approach, we calculated the motifs in both positive and negative datasets. We have used two MERCI motif analyses using NONE classification for the identification of motifs in a mutually exclusive manner. We have screened peptides containing unique motifs from both sets to have an idea of the overall coverage of the various motifs in the complete data.

Feature Estimation and Selection
We peptide descriptors for both positive and negative datasets. As literature evidence suggests that not all descriptors are good for developing machine learning models. Thus, to remove the unwanted features we have employed the random forest feature selection function available in the R caret package. All the features mentioned above were tested independently as well as in combination with other features after the feature selection step, for their classification capabilities among IL-2 inducing and non-inducing peptides.

Machine Learning-based Classification Model
We have implemented various machine learning-based algorithms for the classification of IL-2 inducing and non-IL-2 inducing peptides. We have specifically used the caret package of R for the classification and regression. The classification and regression algorithm includes -decision trees (DT), random forest (RF), multi-layer perceptron (MLP), eXtreme gradient boosting (XGBoost), K-nearest neighbors (KNN), support random vector with radial basis (SVR), neural network (NN), Ridge, Lasso, and Elastic Net". Different parameters were optimized using "expand Grid" functionality of the "caret" package of R. The DT algorithm was based on the non-parametric supervised algorithm; RF is an ensemble-based method which fits numerous decision tree to predict the outcome of the dependent variable; KNN is an instance-based learning algorithm and XGBoost is a tree boosting classification algorithm based on iterative search approach for making the final prediction.

Cross-validation
For developing the machine learning-based model we have used the standard protocols as used in different previous studies [22]. We divided our dataset in an 80:20 proportion as the training and testing sets, respectively, similar to several studies in the past. Training data comprises 2020 IL-2 inducing peptides and 1683 non-IL-2 inducing peptides, whereas the testing set had 505 IL-2 inducing peptides and 420 non-IL-2 inducing peptides. The different classifiers were trained and evaluated using a ten-fold cross-validation method, which is a well-accepted technique for the optimization of parameters and performance of the models. In ten-fold cross-validation, the whole training dataset was divided into ten equal parts, where iteratively nine parts were used in training and one in the validation, to obtain the optimized parameters for the models. This process was repeated ten times to ensure that every set has been used in the training and validation. All these classifiers were implemented using an in-house R script. The performance of the developed model was evaluated using metrics such as specificity, sensitivity, accuracy,

Results
In this study, we have utilized the 2528 IL-2 inducing peptides and 2104 non-IL-2 inducing peptides for developing the prediction algorithm. All the analysis and model building was done on these datasets. Based on the overall analysis, different prediction models were built, trained, and tested for their performance in predicting IL-2 inducing and IL-2 non-inducing peptides.

Datasets Length Distribution Analysis
We

Positional Preference Analysis
In this analysis, we have generated the two-sample logo of both positive and negative datasets.

Motif Analysis
We have utilized MERCI software to mine the motifs present exclusively in the IL-2 inducing peptides but not in non-IL-2 inducing peptides. Similarly, we have computed the motifs that are exclusive to non-IL-2 inducing peptides. The motif analysis reveals that A, L, S, and G are more prominent residues in the positive dataset, which was also found from the two-sample logo analysis. The results of motif analysis were presented in Table 1

Development of Prediction Models
We have developed various machine-learning-based prediction models such as DT, RF, KNN, MLP, Ridge, Lasso, ElasticNet. Firstly, we have computed the features of IL-2 and non-IL-2 inducing peptides using the Pfeature. The feature selection was done by using SVC-LI through random forest functionality available in the "caret" package of R. Apart from these selected features, classification models were built and tested separately on independent features that were widely followed in several past studies [17]. With the feature selection process, we end up with 10 and 100 most suitable relevant features. With these feature sets, we built different machinelearning algorithms. The statistical detail of each feature along with the best classifiers has been provided in Table 1

*Bold faced are the best models
The AAC-based RF model also performs comparably to the DPC-based method. From this analysis, it is visible the importance of amino acid composition in the classification of IL-2 inducing and non-IL-2 inducing peptides.

Models based on Hybrid Features
Length of peptide plays a crucial role in MHC antigen presentation and thus in the induction of IL-2 inducing peptides. Thus we have developed models using hybrid features that combine DPC and length of the peptide. The statistical details of the different models using various classifiers have been presented in Table 1.3. Our RF based model developed using hybrid features achieve maximum accuracy of 73.25%, with AUC of 0.73 on the training set.

Modules and Functionalities of IL2pred Server
This web server is compatible with all sort of devices viz. Desktop, tablets, and phones and hence provide an interactive and better experience to the users. The webserver has three main modules -1) Prediction; 2) Analogue; 3) Protein Scan. The "Prediction module" allows users to identify the peptides that have IL-2 inducing potential; results can be can be obtained in CSV format. The "Analogue module" module provides users with the opportunity to generate all possible mutants of given query peptide sequence and then to rank them based on their IL-2 inducing potential. "Protein Scan" module allow users to predict IL-2 inducing regions within a given protein sequence.

Discussion
The high-throughput studies have significantly enriched our knowledge regarding the tumor onset to progression. However, the survival of patients is dismal as a whole, and finding novel therapeutics remains a major challenge. Recently, evidence from pre-clinical and clinical studies advocates the activation of the immune system for the treatment of human malignancies [5].
Immunotherapy based on cytokines is considered a novel means of activating and manipulating The present study is a systematic attempt to make an in-silico model for predicting the IL-2 inducing capability of the peptide/antigenic region. The in-silico model is developed on the sequence features of non-redundant, IL-2 inducing and non-inducing peptides extracted from the largest repository of experimentally validated immune epitopes database. The non-redundant dataset ensures that no overfitting/bias/noise can affect the developed model due to the presence of multiple instances of the same peptide. Literature evidence suggests that the length of the peptide can affect its binding to the MHC complex and thus immune activation [23]. The length distribution analysis reveals that IL-2 inducing peptides predominately consists of length ranging from 12-20 while for non-inducer it varies from 15-17. The compositional analysis reveals that IL-2 inducing peptides are enriched in A, L, S, and Y residues. Past studies also reveal that peptides enriched in S, L and A residues have increased anti-tumor capabilities [24] as they can induce apoptosis in cancer cells via modulation of autophagy [25]. Thus, we can suggest that sequence composition can also be used in the discrimination of IL-2 inducing peptides from the non-IL-2 inducing peptides to some extent.
In addition to the above, the development of the IL-2 prediction models was also carried out using several other sequence features on multiple machine learning-based algorithms. It is observed that DPC based RF machine-learning model can discriminate the IL-2 inducing and non-inducing peptides with good accuracy. However, the hybrid model developed on DPC and length of peptides performed well than the DPC-based model alone in terms of balanced model performance measures. This may be due to the increased vector size while developing the prediction model. The external validation of the developed model further ensures that the performance of the model is not due to over-optimization or over-fitting. The best-performing machine-learning-based model was incorporated in the web server. We believe that the developed tool will help the scientific community in better understanding IL-2 inducing/noninducing peptides. However, with the increased number of organism-specific datasets, the applicability of the developed model can further be improved. We anticipated that the developed tools may be of great use to the scientific community for identifying IL-2 inducing/non-inducing peptides from the proteomes for improving the current cancer immunotherapeutics. Also, the developed tools can be used for generating the mutant peptides that are predicted to have IL-2 inducing potential.

Conclusion
Past studies strongly reveal that direct administration of IL-2 may have several toxic effects. The capability of an antigenic region or mutant peptide to induce the IL-2 response is of great significance for activating the immune response towards malignant tissue. However, identifying such mutant peptides with conventional experimental techniques is time-consuming and costly.
Therefore, a computational tool in the form of a web server is provided to the scientific community for predicting and identifying the IL-2 inducing/non-inducing peptides from the given proteomes. The web server is freely available to the scientific community at https://webs.iiitd.edu.in/raghava/il2pred/. We hypothesize that the developed tools will be highly utilized by the scientific community for identifying and prioritizing the potential immunotherapeutic candidates

Funding
There is no funding available for this article to support open access.

Data Availability
All the data is freely available on IL2pred website in the download section.

Declaration of competing interest
There is no potential conflict of interest among the authors of the manuscript.