A generalized machine‐learning aided method for targeted identification of industrial enzymes from metagenome: A xylanase temperature dependence case study

Growing industrial utilization of enzymes and the increasing availability of metagenomic data highlight the demand for effective methods of targeted identification and verification of novel enzymes from various environmental microbiota. Xylanases are a class of enzymes with numerous industrial applications and are involved in the degradation of xylose, a component of lignocellulose. The optimum temperature of enzymes is an essential factor to be considered when choosing appropriate biocatalysts for a particular purpose. Therefore, in silico prediction of this attribute is a significant cost and time‐effective step in the effort to characterize novel enzymes. The objective of this study was to develop a computational method to predict the thermal dependence of xylanases. This tool was then implemented for targeted screening of putative xylanases with specific thermal dependencies from metagenomic data and resulted in the identification of three novel xylanases from sheep and cow rumen microbiota. Here we present thermal activity prediction for xylanase, a new sequence‐based machine learning method that has been trained using a selected combination of various protein features. This random forest classifier discriminates non‐thermophilic, thermophilic, and hyper‐thermophilic xylanases. The model's performance was evaluated through multiple iterations of sixfold cross‐validations as well as holdout tests, and it is freely accessible as a web‐service at arimees.com.

One of the essential attributes of enzymes is their optimum temperature, in which they exhibit their maximum relative activity.
An enzyme's optimum temperature is the indicator of its thermal activity (Liu et al., 2012). On the one hand, high temperature increases substrates' solubility and bioavailability, accelerates molecular dynamics, and decreases the probability of microbial contamination significantly (Kumar et al., 2013. On the other hand, numerous biological processes are carried out at mild or cold temperatures (Collins et al., 2005). Therefore, various research studies have been focused on determining enzymes' optimum temperature To identify appropriate novel biocatalysts for specific purposes (Li et al., 2019;Yan & Wu, 2012).
Unlike culture-dependent methods that are unable to cultivate more than 90% of microorganisms, culture-independent methods such as metagenomics enable the extended exploration of the natural diversity within an environmental sample (Schloss & Handelsman, 2003;Thomas et al., 2012). As the availability of metagenomic data is rapidly increasing, employing computational methods instead of wetlab experiments to identify new enzymes with specific properties can effectively reduce the costs and make the process much faster Maleki et al., 2020;Motahar et al., 2020).
Microbial communities adjust to the requirements of their environment. For example, to enhance the digestion of plant matter and plant-derived complex polysaccharides, such as xylose and cellulose, which are the primary components of the ruminant diet, rumen microbiome has acquired a rich hydrolyzing enzyme profile (Stewart et al., 2019).
The strong relationship between enzymes' functional properties and their amino acid sequences suggests the importance of the primary structure of proteins in their different attributes. Many studies have used sequence alignment-based methods to predict different properties of proteins. However, there are many cases where sequence similarity does not directly correlate with functional resemblance, as proteins with a nearly similar primary structure can sometimes depict unique properties or, reversely, functional similarity can be observed among proteins with different amino acid sequences (Sadowski & Jones, 2009). Machine learning approaches have been successfully applied to predict various properties of proteins such as tertiary structure (Cheng et al., 2008), function (Kulmanov et al., 2018), localization (Armenteros et al., 2017, thermal stability (Wu et al., 2009), and so forth. These computational methods are capable of learning more complex relationships between the primary structure of proteins and their different properties (Shen & Chou, 2006).
Numerous studies have presented in silico methods for predicting enzymatic attributes. Discrimination between thermophilic and mesophilic proteins by using machine learning methods was the focus of a study by Gromiha and Suresh (2008). Similarly, Tang et al. (2017) used support vector machines (SVM) to develop a two-step method for discriminating thermophilic proteins, and amino acid compositions (AACs) have been the basis of a statistical method for a similar task (Zhang, 2013). Pucci et al. (2014) presented a statistical approach to predict thermostability, and Jia et al. (2015) designed a thermostability predictor tool. AcalPred is another similar study that utilizes SVM to classify acidic and alkaline enzymes based on their primary structure (Lin et al., 2013). In another research, Ariaeenejad et al. applied a regression model based on a pseudo amino acid composition (PAAC; Chou, 2009) to predict the optimum temperature and pH of xylanase in strains of Bacillus subtilis (Ariaeenejad et al., 2018). Genetic algorithm-artificial neural network has been employed for the optimization of xylanase production for industrial purposes (Kumar, Chhabra, et al., 2017;. In a 2010 study, to find features with the highest correlation with the thermostability of proteins, a K-means clustering algorithm, and a decision tree have been employed (Ebrahimi & Ebrahimie, 2010). Panja et al. (2015) found that the prevalence of smaller nonpolar and hydrophobic amino-acids, as well as salt-bridges, are some shared characteristics among most thermophilic proteins.
The increasing usage of new high-throughput technologies can rapidly produce massive amounts of data, while the processes of getting access to the protein's tertiary and quaternary structures are much slower. Therefore, an accurate and agile sequence-based approach is in demand to facilitate the targeted screening of high throughput data to find enzymes with the properties of interest. The objective of this study was to design and implement a multistep method for the classification of the thermal activity of xylanases from GHs families 10 and 11 based on their optimum temperature. Since most of the available data in the literature belong to GH10 and GH11 families, we focused on the members of these two protein families.
To the best of our knowledge, this is the first time that a combination of different protein descriptors is calculated, selected, and used to train a machine-learning classification model to make predictions on xylanase temperature dependence in three classes. The implemented tool has been successfully exploited for targeted isolation of three high performance xylanases from metagenomics samples. Moreover, we presented thermal activity prediction for xylanase (TAXyl), a prediction web-server for the thermal activity of xylanases which was evaluated through multiple cross-validations (CVs) as well as holdout test on unseen data.

| Data set preparation
A new data set of GH families 10 and 11, which constitute a considerable ratio of known xylanases, had to be collected. Even though this makes the scope narrower, it can help the final estimation to be more accurate and reliable. The National Center for Biotechnology Information (NCBI) database was explored by searching for thermophilic or thermostable xylanases, and 254 results were found and records without the exact optimum temperature report were removed. Afterward, the BRENDA (Jeske et al., 2019) and UniProt (Bateman, 2019) databases were explored for xylanases with reported optimum temperature, and newly collected data were added to the previous data set.
Redundant or highly similar samples were removed using the CD-Hit with a 0.9 cut-off (Huang et al., 2010). The CD-Hit clusters highly-homologous sequences and retains one representative sample from each cluster, in some protein sequences, a few amino acids are unknown. These residues are represented by the character "X." Since the existence of such noise could potentially interfere with the learning process and since the feature extraction tools are designed for 20 amino acid residues, all unknown amino acids residues were removed from the sequences.
The final data set consisted of 145 different xylanases from GH families 10 and 11 with optimum temperature ranging from 25°C to 95°C. These samples were labeled accordingly into three different classes: "Non-thermophilic" with optimum temperature below 50°C, "thermophilic" with the optimum temperature between 50°C and 75°C, and "hyper-thermophilic" with the optimum temperature above 75°C. Figure 1 provides a general summary of the collected data set by showing the number of samples exisiting in each thermal class and each GH family.

| Feature extraction and selection
Using an appropriate set of features is undoubtedly one of the most crucial steps in creating an efficient classifier. Because of the lack of precise evidence regarding the most related features to thermal activity, in this study, various protein descriptors were computed using the PyDPI python package (Cao et al., 2013).
The PyDPI-computed protein features are in 15 descriptor types, which are from six main groups. AAC, dipeptide composition (2AAC), and tripeptide composition (3AAC) represent their fractions in the protein sequence. Unlike previous feature groups, PAAC and amphiphilic PAAC, try to avoid missing the sequence-order information and reflect it in the composition data (Chou, 2005(Chou, , 2009). Conjoint triad features (CTF) cluster twenty amino acids into several classes based on their dipoles and the number of side chains (Shen et al., 2007). CTF considers the properties of an amino acid and its neighboring ones while regarding any amino acid triads as a unit. Other features are calculated by taking structural and physiochemical properties into account. Three autocorrelation descriptors, including normalized Moreau−Broto, Geary, and Moran autocorrelations, attempt to describe the amount of correlation among peptide or protein sequences. Composition, transition, distribution descriptors; sequence-order coupling number; and quasi-sequence-order describe the distribution patterns of amino acids along a protein sequence in terms of structural and physicochemical attributes.
As a result, a feature vector with 10,074 descriptors was calculated for each protein sequence. Many of these features were duplicates and were removed afterward. Due to different ranges of descriptors, all raw values had to be re-scaled into the same range.
All features do not contribute equally to the modeled response. Thus, a process of feature selection is necessary to find the most relevant descriptors for optimum temperature classification in xylanases and to remove ones of less importance. Filter feature selection methods use a statistical measure to rank features based on their score. For this step, an F test was applied as the filter method, and the best features were chosen using the SelectKBest method while the K value was chosen to be 400 due to better performance.

| Model selection and training
Numerous classification algorithms with different capabilities and weaknesses are currently available. To find the best methods for our problem, various classification algorithms were tested, including multilayer perceptron, random forests (RFs), Gaussian Naïve Bayes, AdaBoost, and SVM with radial basis kernel function.
Results of the model selection process suggested that RF is the most suitable classification algorithm for this problem. RF is an ensemble machine learning method consisting of multiple decision trees which results in reduced variance and better generalization in comparison to single decision trees (Breiman, 2001). To build each tree of the RF, a bootstrap sample of the data set was drawn randomly and the information gain was chosen as the splitting criterion.
In computational biology, RF is a popular method due to several advantages over other algorithms such as dealing with highdimensional feature space, a small number of samples, and complex data structures (Qi, 2012).
The above-mentioned pipeline required several hyper-parameter tuning steps, all of which were performed through grid searching.
There are several approaches to detect the best combination of hyper-parameters for a machine learning model. The grid searching is a method which given multiple options for each hyper-parameter, tries every possible combination to build and evaluate the desired model. This method can thoroughly inspect the hyper-parameterspace to find the best configuration for the model to achieve the best performance.
F I G U R E 1 Summary of collected training data set. The data set is divided into three categories. Samples in the data set are from GH10 and GH11 families with optimum temperatures ranging from 25°C to 95°C. These enzyme samples are divided into three thermal activity classes. This figure illustrates the number of instances in each class and GH family [Color figure can be viewed at wileyonlinelibrary.com] FOROOZANDEH SHAHRAKI ET AL. | 761

| Evaluation criteria
For the evaluation step, samples were randomly split into two subsample (90% as the training set and 10% as the holdout test set) ten times and each time 15 iterations of sixfold CV were performed on the training set and after that, models were tested on the unseen subsample (holdout test set). This means that our classifiers were evaluated through 150 iterations of sixfold CV and they were tested on unseen data 10 times to assure that the models are robust. In the sixfold CV step, the training set was again randomly split into six equal subsample, five of which were used as training sets, and one subsample was then used for testing the models with different metrics. This validation process is executed six times leaving out one subsample each time for testing to ensure that all samples in the set were tested once. Figure 2 illustrates the evaluation process.
Since this is a multiclass classification problem, accuracy, macrorecall, macro-precision, the macro-F1 score, which are among the most commonly used metrics for classification tasks, were considered for evaluation of the model's performance. These metrics were calculated using the following formulas:  2.5 | Cloning, expression, purification, and characterization of xylanases with predicted thermal activity To verify the TAXyl's predictive ability, three candidates were nominated among the putative xylanases from cow and sheep rumen metagenomic data for cloning, expression, and purification. After that, the purified xylanases which were named PersiXyn5, PersiXyn6, and PersiXyn7 were subjected to the optimal temperature investigation. Moreover, two xylanases from our previous studies, named PersiXyn1 and PersiXyn2 , were also used to validate the TAXyl's ability at the correct prediction of xylanases' thermal activity.
To acquire xylanase genes, metagenomic DNA templates from sheep and cow rumen were used for polymerase chain reaction (PCR) amplification with two pairs of primers. To obtain the full-length coding genes, a specific primer set, the 5′-and 3′-ends of genes (Table 3)  method (Miller, 1959). For this purpose, after the incubation time, 120 μl of DNS reagent was added to the reaction mixture followed by boiling for 5 min to terminate the reaction. The absorbance was measured at 540 nm using the UV-VIS spectrophotometer. One unit of xylanase activity was defined as the amount of enzyme that released 1 µmol of reducing sugars per minute under the assay conditions according to the glucose standard graph and the specific activity of enzymes was defined as unit per mg of protein. Protein concentrations were determined through the Bradford method using bovine serum albumin as the standard (Bradford, 1976

| Model selection and evaluation
Our final model was chosen to be the RF, with information gain as the splitting criterion, due to better classification accuracy. Reported | 763 performance metrics in Table 2 were computed through 150 sixfold CV tests and ten holdout tests using the unseen test data (10% of the initial data set, which was reserved at the beginning). In each CV and holdout test iteration, the samples were shuffled with different random seeds before splitting. Table 2 shows the comparison of TAXyl's evaluation results in CV and holdout tests. Figure 5 illustrates TAXyl's performance through ten holdout tests on unseen data.

| Cloning, expression, purification, and characterization of xylanases with predicted thermal activity
Aiming at targeted enzyme isolation and screening of the sheep and cow rumen metagenomic sources, three probable xylanases were cloned, expressed, and purified and named PersiXyn5, PersiXyn6, and PersiXyn7. The SDS-PAGE was applied for checking the purification step and the clear bands of the PersiXyn5, PersiXyn6, and PersiXyn7 were visible with a molecular weight of 42.47, 42.52, and 38.81 kDa, correspondingly. Figure 6 presents the SDS-PAGE profiles of three purified enzymes.
The thermal activity of these purified xylanases was predicted with TAXyl, and also their optimum temperature was investigated through the experimental analysis and measuring the relative enzymatic activity in a wide range of temperatures from 30°C to 100°C. Figure 7, which illustrates the results of the experimental determination of xylanases' optimum temperatures, shows that the maximum relative activities (100%) of PersiXyn5, PersiXyn6, and PersiXyn7 were found at 80°C, 50°C, and 90°C, respectively. Also, the PersiXyn5, PersiXyn6, and PersiXyn7 could retain 74%, 57%, and 90% of their maximum activities at 100°C, respectively.
The in silico predictions of thermal activity for these xylanases were obtained from TAXyl and are presented in Table 3 to be compared with the experimentally identified optimum temperatures. Moreover, to further validate the predictive capability of the TAXyl, experimentally obtained optimum temperatures of two xylanases from our previous studies, namely PersiXyn1 and Per-siXyn2, were compared with the TAXyl's computational prediction (Table 3; . Table 3 contains the list of enzymes, primers, and restrictions used in the cloning process, the prediction of TAXyl, and their experimentally identified optimum temperature.

| Online server
TAXyl is freely available at http://arimees.com. This web service is capable of receiving inputs in the forms of FASTA file, amino acid sequence, or protein entry of xylanases from GH families 10 and 11 and returns their probable thermal activity status. TAXyl also enables users to export the selected features for their inserted protein sequences in CSV format (with the purpose of reproducibility of results or providing features for other machine learning tasks). This webservice is also accessible from the CBB lab website (http://cbb.ut.ac.ir), under the databases and tools submenu.

| DISCUSSION
Xylanases are carbohydrate-active enzymes with multiple industrial applications and high commercial value .
Determining the optimum temperature of activity plays an important role in choosing the appropriate enzymes for specific purposes. Since the advent of next-generation sequencing and its accelerating improvements, getting access to metagenomic data is becoming increasingly easier and more affordable, and the only possible approach to analyze these extensive amounts of data is by fast and accurate computational methods.
One of the challenges of this study has been the limited number of xylanases with a reported optimum temperature in the literature and thus a smaller data set. Therefore, we explored different machine learning methods and various sequence-derived features to find the one with an acceptable interpretation of the data. Our results indicated the competence of computational methods in addressing common problems in bioinformatics and the capability of sequencebased and length-independent protein descriptors for training supervised learning algorithms to effectively predict general enzymatic attributes (Pucci et al., 2014).
Most importantly, this methodology ( Figure 3)  Note: Except for PersiXyn1 and PersiXyn2 which were from previous studies and were used only for validation purposes, TAXyl's predictions were made before cloning and production of three remaining enzymes and facilitated the targeted isolation of these xylanases. Abbreviation: TAXyl, thermal activity prediction for xylanase.
agile, and inexpensive screening of high-throughput data can be effectively addressed.
This tool helps to significantly reduce the number of potential candidate enzymes with a specific thermal activity profile before engaging the wet-lab experiments. Another potential usage for such tools in bioinformatics is in facilitating the engineering of enzymes through directed evolution to obtain biocatalysts with specific thermal dependence targeted at particular industrial purposes.
Unlike other studies designed to classify the status of enzymes' thermal activity and stability (Gromiha & Suresh, 2008;Tang et al., 2017;Zhang, 2013), our model extends the classification ability to the third class which is the hyper-thermophilic enzymes focusing on xylanase families. In comparison with our previous two studies, in which xylanase enzymes from the metagenomic source were identified and characterized by experimental techniques , TAXyl enabled the identification of more putative thermophilic xylanases and enhanced the extended exploration of the metagenome with significantly less labor and cost. This tool was used to estimate the thermal activity of putative xylanases identified within the metagenomic samples and wet-lab experiments validated the predictions which were made previously.

| CONCLUSION
In this study, we presented a novel method based on a supervised machine learning algorithm to predict the thermal activity of GH10 and GH11 xylanases. TAXyl uses sequence-based and lengthindependent protein descriptors to train a RF classifier. TAXyl facilitated the targeted mining of three xylanases with specific thermal dependencies from cow and sheep rumen metagenome. These novel enzymes were subjected to cloning, expression, and purification.
Besides, the activity of enzymes was measured in a wide range of temperature and the optimum temperatures were determined which were in accordance with the TaXyl predictions.
The presented methodology for the development of this prediction tool can be generalized and implemented to model the functional properties and predict various attributes of a variety of enzymes from different families. In case of the availability of sufficient training data, a possible direction for future works would undoubtedly be developing similar tools to predict the structural, functional, and thermodynamic properties of other enzyme families. The TAXyl web-service is available and provides users with a reasonably accurate prediction of any GH10 or GH11 xylanase thermal activity status.

ACKNOWLEDGMENTS
This study was supported by a grant from the Agricultural Biotechnology Research Institute of Iran (ABRII), Institute of Biochemistry and Biophysics (IBB).