Molecular Predicting Drought Tolerance in Maize Inbred Lines by Machine Learning Approaches

Drought is one of the prime abiotic stresses in the world. Now, amongst the new technologies available for speed up the releasing of new drought tolerance genotypes, there is an emanate discipline called machine learning. The study presents Machine Learning for identification, classification and prediction of drought tolerance maize inbred lines based on SSR genetic markers datasets generated from PCR reactions. A total of 356 SSR reproducible fragment alleles were detected across the 71 polymorphic SSR loci. A dataset of 12 inbred lines with these fragments prepared as attributes and was imported into RapidMiner software. After removal of duplicates, useless and correlated features, 311 feature attributes were polymorphic, ranging in size from 1500 to 3500 bp. The most important attribute fragment alleles in different attribute weighting selected. Ten datasets created using attribute selection (weighting) algorithms. Different classification algorithms were applied on datasets. These can be used to identify groups of alleles with similar patterns of expression, and are able to create some models that have been applied successfully in the prediction, classification and pattern recognition in drought stress. Some unsupervised models were able to differentiate tolerant inbred lines from susceptible. Four unsupervised models were able to produce the different decision trees with root and leaves. The most important attribute alleles almost in all of models were phi033a3, bnlg1347a1 and bnlg172a2 respectively, that can help to identify tolerant maize inbred lines with high precision.


Introduction 30
Drought is the main cause of reduced yields in cereal crops (Edmeades et al., 31 2000), after low soil fertility, the second most important cause of yield loss for maize 32 (Dariano et al., 2016). It is a major limitation in maize production, particularly under 33 climatic changes (Lobell et al., 2014), leading to a grain yield reduction of 25-30%, 34 even without any harvests during extremely severe drought events (Khazaei et al., 35 2013;Singh et al., 2016). Drought has greater effect despite cultivar and agronomic 36 management improvements. Climatic warming is expected to further enhance the 37 harmful impact of droughts; potentially leading to a significant decrease in maize 38 yields Lobell et al., 2014). Thus, research on maize drought 39 tolerance and its mechanisms remain hotspot under the pressure of increasing 40 environmental conditions that reflect the impact of human activities on 41 the environment that leads to the appearance of environmental problems. (Lobell et 42 al., 2014;Khazaei et al. 2013;Ribaut et al., 2009). Drought has even been thought of 43 as a "cancer" to plants, owing to its complexity and destructiveness. Hence, there is 44 enormous interest in and demand for improving maize drought tolerance through 45 biotechnology (Wang et al., 2016). 46 Maize respond drought stress at the physiological, biochemical, and molecular 47 levels to adapt to changing environmental conditions (Jaglo-Ottosen et al., 1998;48 Shinozaki et al., 2003;Xiong et al., 2002). The identification and use of molecular 49 markers to assist in selection of quantitative trait loci, genome-wide selection, and 50 association mapping have become common areas of research. In the future, the 51 destructive impact of drought may grow, as the spectre of climate change becomes a 52 reality. With the development of PCR-based DNA markers such as RAPD (Raibut 53 2009) SSR (Ercisli et al., 2011), AFLPs (Pafundo et al., 2005) and single nucleotide 54 polymorphism (Reale et al. 2006), marker technology today offers a palette of 55 powerful tools to analysis the plant genome. (Tardieu 2012;Wang et al. 2011;Yadav 56 et al., 2011). They have enabled the identification of genes and genome associated 57 with the expression of qualitative and quantitative traits and has led to a better 58 understanding of the complex genome of various plants (Shinozaki and Yamaguchi-59 Shinozaki, 2007;Tao et al., 2011), besides helping in identifying the desired species 60 at any growth stage of the plant. Despite numerous published reports of molecular 61 markers for drought-related traits, practical applications of such results in maize 62 improvement are scarce (Benesova et al., 2012) and results are not completely 63 satisfying and more research on methodologies is needed (Ornella et al., 2012). 64 Machine learning is an emerging inherently multidisciplinary approach to data 65 analysis with a revolutionary impact on a variety of areas and refers to a group of 66 computerized modelling approaches that can learn patterns from the data to make 67 automatic decisions without programming obvious rules. (Dror et al., 2005;Hajiloo et 68 al., 2013;Hepworth et al., 2012, Hor et al., 2013Mutka and Bart , 2015;Smith et 69 al., 2013;Tran, 2014;Yan et al., 2013). The main idea of machine learning is to 70 effectively utilize experiences to discover main underlying structures, similarities, or 71 dissimilarities present in data to explain or classify a new experience properly. A key 72 ability of machine learning tools, in its most fundamental form, is their ability to 73 generalize complex patterns and making intelligent decisions from data. (Sing et al., 74 2016;Forghani and Yazdi, 2014;Ma et al., 2012). Machine learning will also 75 enhance our understanding of pathogen-plant interactions as well as the interaction 76 of plants with other stresses (Kuska et al. 2015;Sing et al. 2016;Romer et al., 77 2012). One of the major advantages of machine-learning approaches for plant 78 breeders and biologists is the opportunity to search datasets to discover patterns 79 and govern discovery by simultaneously looking at a combination of factors instead 80 of analysing each feature individually (Mutka and Bart, 2015;Sing et al., 2016). 81 Due to their high generalization capabilities and distribution-free properties they are 82 presented as a valuable alternative to traditional statistical techniques applied in 83 maize breeding, even the more recently introduced linear mixed models (Maenhout 84 et al., 2010;Maenhout et al., 2007;Ornella, Cervigni and Tapia, 2012

Attribute selection 162
After attribute weighting models were run on the dataset, each alleles attribute 163 (feature) gained a value between 0 and 1, which revealed the importance of that 164 attribute with regards to a target attribute (susceptible-tolerance cultivar). All variables 165 with weights higher than 0.50 were selected and 10 new datasets created and were 166 named according to their attribute weighting models (Information gain, Information 167 gain ratio, Rule,Deviation,Chi Squared,Gini index,Uncertainty,Relief,SVM and 168 PCA. These newly formed datasets were used to join with subsequent models 169 (supervised and unsupervised). Each model of supervised or unsupervised clustering 170 were performed 11 times; the first time it ran on the main dataset (FCdb) and then on 171 the 10 newly formed datasets from attribute weighting and selection. 172

Unsupervised clustering algorithms 173
Ten newly created datasets generated as the outcomes of different attribute 174 weighing algorithms along with FCdb were applied to K-Means, K-Medoids, Support 175 Vector Clustering (SVC) and Expectation Maximization (EM) clustering algorithms. K-176 Means uses kernels to estimate the distance between objects and clusters. Because 177 of the nature of kernels, it is necessary to sum over all elements of a cluster to 178 calculate one distance. K-Medoids represents an implementation of k-Medoids. This 179 operator will create a cluster attribute if it is not yet present. SVC represents 180 implementations of Support Vector algorithm which will create a cluster attribute if not 181 present yet. EM represents an implementation of the EM-algorithm. 182

Supervised Classification 183
Three classes of supervised classification (Decision Trees, SVM and Baysian 184 models) applied as follows. To calculate the accuracy of each model, 10-fold cross 185 validation is used to train and test models on all patterns. To perform cross validation, 186 all the records were randomly divided into five parts; four sets were used for training 187 and the 5th one for testing. The process was repeated five times and the accuracy for 188 true, false and total accuracy calculated. The final accuracy is the average of the 189 accuracy in all five tests. 190

Decision Trees 191
Six tree induction models including Decision Tree,Decision Tree Parallel,192 Decision Stump, Random Tree, ID3 Numerical and Random Forest were run on the 193 main dataset (FCdb). Each tree induction model ran with the following four different 194 criteria: Gain Ratio, Information Gain, Gini Index and Accuracy. In addition, a weight-195 based parallel decision tree model, which learns a pruned decision tree based on an 196 arbitrary feature relevance test (attribute weighting scheme as inner operator), was 197 run with 13 different weighing criteria (SVM,Gini Index,Uncertainty,PCA,Chi 198 Squared,Rule,Relief,Information Gain,Information Gain Ratio,Deviation,199 Correlation, Value Average, and Tree Importance). The accuracy of each tree 200 computed based on the previous explanation. 201

Support Vector Machine Approach 202
Support Vector Machines (SVMs) are popular and powerful techniques for 203 supervised data classification and prediction; so SVM, LibSVM, SVM Linear and 204 SVME used here to implement different models to predict maize cultivars based on 205 Susceptible-Tolerance features. Briefly, main database (FCdb) transformed to SVM 206 format and scaled by grid search (to avoid attributes in greater numeric ranges 207 dominating those in smaller numeric ranges) and to find the optimal values for 208 operator parameters. To prevent overfitting problems, 5-fold cross validation applied. 209 Dataset divided into 5 parts and 4 parts used as training set and the last part as 210 testing set. The procedure repeated for 10 different testing sets and the average of 211 accuracy computed. RBF kernel that nonlinearly maps samples into a higher 212 dimensional space and can handle the case when the relation between class labels 213 and attributes is nonlinear used to run the model. Other kernels such as linear, poly, 214 sigmoid and pre-computed were also applied to the dataset to find the best accuracy. 215

Naïve Bayes 216
Naïve Bayes based on Bayes conditional probability rule was used for 217 performing classification tasks (Wang and Tseng 2013). Naïve Bayes assumes the 218 predictors are statistically independent which makes it an effective classification tool 219 that is easy to interpret. Two models, Naïve base (returns classification model using 220 estimated normal distributions) and Naïve base kernel (returns classification model 221 using estimated kernel densities) used and the model accuracy in predicting the right 222 resistance -susceptible computed as stated before. 223

224
SSR alleles differing by several repeat units can often be distinguished on 225 agarose gels. High level of allelic variation, making SSR valuable as genetic 226 attributes ( Fig. 1). SSR is used for understanding of the evolutionary genetics and 227 sequences controlling traits of economic interest of maize (Xu et al., 2009). 228 As mentioned in Materials and Methods, the initial dataset contained 12 maize 229 inbred lines with 356 SSR fragment attributes. Following removal of duplicates, 230 useless attributes, and correlated features (data cleaning) 311 features remained, 231 meaning these fragment attributes were polymorphic, ranging in size from 1500 to 232 3500 bp. 233

Attribute weighting 234
The number of attributes gained weights higher than 0.5 in each weighting 235 model were as follows: Relief 168, Rule 168, Deviation 147, SVM 134, PCA 93, Info 236 gain ratio 20, Uncertainty 20, Chi squared 9, Gini index 8 and Info Gain 8 (Table 2). 237 The most important attribute fragment allele in all of models was phi033a3, which 238 weighted equal to 1.0 in all weighting models except to PCA and can help to identify 239 tolerance maize inbred lines from susceptible. 240 In weighting by Deviation operator 46 attribute alleles including bnlg2323a1, 241 bnlg1730a4, phi088a3, bnlg1655a2, mage05a3 and bnlg210a3 weighed equal to 242 1.0. In Weighting by Rule model five attributes, phi033a3, bnlg172a3, phi078a2, 243 bnlg381a3 and bnlg172a2; in SVM model three attributes, phi033a3, bnlg1347a1 244 and bnlg1347a2; and in PCA model five attributes, bnlg1138a1, bnlg18a3, phi102a3 245 and phi102a2 weighted equal to 1.0. These attributes were the most important 246 attributes selected when these models applied on dataset of correlated removed 247 features. Five of ten attribute weighting models including Uncertainty, Info Gain 248 Ratio, Gini Index, Information Gain and Chi Squared selected the following attribute 249 alleles: phi033a3, bnlg172a2, bnlg1347a1, bnlg1347a2, umc1572a4, bnlg172a3 and 250 bnlg381a3 weighted more than 0.7. In Weighting by Releif model only phi033a3 251 attribute allele weighted equal to 1.0. These attributes were the most important 252 attributes selected when these models applied on dataset of correlated removed 253 features (Table3). Almost all of these attributes are the main branches of decision 254 trees ( Fig. 3 and Fig. 4). 255

Unsupervised Clustering Algorithms 256
Three different unsupervised clustering algorithms (K-Means, K-Medoids and 257 SVC) were applied on ten datasets created using attribute selection (weighting) 258 algorithms. Some models, such as the application of the SVC algorithm on all 259 datasets were able to differentiate tolerant inbred lines from susceptible. In this 260 algorithm all tolerant inbred lines predicted as correct class of tolerance ( Fig. 2-A). 261 Application of the K-Means to SVM, PCA and Deviation databases was able to assign 262 respectively 66.7%, 33.3% and 50% drought tolerance inbred lines into its correct 263 class while application of K-Medoids to same database was able to assign 264 respectively 50%, 83.3% ( Gain and Gini Index datasets were able to produce different decision trees with 285 overall accuracy above 50% (Table5). Decision trees of Random Forest model with 286 Accuracy criterion run on Information Gain dataset (Fig. 4 -A ) and Gini Index Criterion 287 run on Gini Index dataset ( Fig. 4 -B) show 100% overall accuracy and precision. 288 Figure 4 -A shows that bnlg1347a2 attribute fragment allele was the sole fragment 289 allele used to build this tree. If there is bnlg1347a2 allele, the maize inbred line is 290 drought tolerance; otherwise, the inbred line is susceptible. Figure 4 -B shows that 291 bnlg381a3 fragment allele was the sole fragment allele used to build this tree. If there 292 is bnlg381a3 allele, the maize inbred line is tolerance. Random Forest model with Chi 293 Square dataset was able to produce different decision trees with overall accuracy 294 90% and overall precision 100% when run with Accuracy (Fig. 5 -A), Gain Ratio (

SVM approach 299
The result of the SVM prediction system based on the 10 fold cross validation 300 sets show that overall accuracies were in the range of 0.0% to 100 %. The overall 301 accuracies and prediction, prediction susceptible and tolerance precision and true 302 susceptible and tolerance recall of 4 algorithms on Chi Square, Gini Index, 303 Information Gain, Relief and Uncertainty databases and LibSVM algorithm on 304 Uncertainty dataset were 100% (Table 6). The number of support vectors was 10-12 305 and for both susceptible and tolerance cultivars was 5-6 in these algorithm. These 306 results demonstrated the manifest improvement of the prediction accuracies due to 307 the application of generated datasets with Support Vector Machine model. 308

Naïve Bayes 310
The overall accuracies and prediction, prediction susceptible and tolerance 311 precision and true susceptible and tolerance recall of Naïve Base and Naïve Kernel 312 run on all databases were 100% except Deviation, FCdb, PCA and Rule datasets 313 (Table 7). These results show the manifest improvement of the prediction accuracies 314 due to the application of generated datasets with Naïve Base and Naïve Kernel 315 models. Current practices rely heavily on the classical Naïve Bayes algorithm due to 316 its simplicity and robustness. However, results from these algorithms are satisfactory. 317

Discussion 318
The worldwide production of maize (Zea mays L.) is frequently impacted by 319 water scarcity and as a result, increased drought tolerance is a priority target in 320 maize breeding programs (Liu et al., 2013). Understanding the response of a crop to 321 drought is the first step in the breeding of tolerant genotypes (Benesova et al., 2012). according to growth stage, plant organ or even time of day (Blum, 2011). Identify the 330 function of the candidate genes towards drought resistance is difficult. Inter-331 disciplinary scientists have been trying to understand and dissect the mechanisms of 332 plant tolerance to drought stress using a variety of approaches (Mir et al., 2012). In 333 order to achieve useful results, researchers require methods that consolidate, store 334 and query combinations of structured and unstructured data sets efficiently and 335 effectively. Herein, we combined molecular biology with biological knowledge 336 discovery to determine the most important features contribute to the clustering, 337 classification and prediction of drought tolerance inbred lines from susceptible based 338 on SSR data. 339 The first and most important steps in any data processing task is to verify that 340 your data values are correct or, at the very least, conform to some a set of rules. 341 Data cleaning deals with detecting and removing errors and inconsistencies from 342 data and were used to remove redundancy and co-linearity, useless or duplicated 343 attributes in order to improve the quality of data which results in a smaller database 344 (Ashrafi et al., 2011;Beiki et al., 2012). More than 7% of the attribute alleles 345 discarded when these algorithms were applied on the original dataset. Each attribute 346 weighting system uses a specific pattern to define the most important SSR fragment 347 alleles. Thus, the results may be different (Baumgartner et al., 2010), as has been 348 highlighted in previous studies (Ashrafi et al., 2011;Beiki et al., 2012;Ebrahimi et al. 349 2011). The results showed that attribute subset selection can be beneficiary both to 350 processing time and getting more accurate results. This reduces the dimensionality 351 of the data and enables data mining algorithms to predict drought tolerance maize 352 inbred lines faster and more effectively. 353 Attribute selection had an important effect on the classification and 354 identification capability of drought tolerance inbred lines. Attribute weighting and 355 selection methods based on SSR molecular marker data can classify drought 356 tolerance and susceptible inbred lines (Table 2 and Table 3). The phi 033a3, 357 bnlg1347a1, bnlg1347a2, bnlg172a2, bnlg381a3, bnlg381a2, bnlg172a3 and 358 bnlg381a2 SSR fragments were the most important feature to distinguish drought 359 tolerance from susceptible inbred lines, as defined by the entire attribute weighting 360 algorithms (Table 2 and Table 3). These alleles can help to identify tolerance maize 361 inbred lines from susceptible. 362 The goal of unsupervised pattern is to identify small subsets of alleles that 363 Here we quantify the performance of a given unsupervised clustering 372 algorithm applied to a given molecular marker data in terms of its ability to produce 373 biologically meaningful clusters using a reference set of functional classes. We used 374 three different unsupervised clustering methods (K-Means, K-Medoids and SVC) on 375 11 datasets created from SSR fragment allele attributes, which were assigned high 376 weights. The performances of these algorithms varied significantly, usually these 377 algorithms work well when the numbers of classes to be clustered are small. Here 378 we have only two classes, tolerance and susceptible and it is expected that these 379 algorithms are suitable for this condition and there is no need more complex 380

clustering. 381
The results showed that combination of K-Medoids clustering method with 382 Deviation attribute weighting was able to assign the drought tolerance inbred lines in 383 right classes (Fig.2 -B) while Combination of K-Means clustering method with PCA 384 attribute weighting selected the right classes of drought susceptible inbred lines 385 ( Fig.2 -D). 386 The main objective of decision analysis is to offer a theoretical representation 387 of choices made in an environment of uncertainty (Berry andLinoff 2004, Senthil 388 Kumar et al. 2013). The attractiveness of decision trees is due to the fact that, 389 decision trees represent rules. Rules can readily be expressed so that humans can 390 understand them (Berry and Linoff 2004). Decision trees provide the information 391 about which attribute alleles are most important for prediction or classification. As 392 shown in figure 3, 4 and 5, different algorithms of decision trees were able to 393 produce the different decision trees with root and leaves. Random Forest model with 394 Accuracy criteria on Information Gain and Gini Index datasets were able to produce 395 decision trees with 100% overall accuracy and precision (Fig. 4). The most 396 important attribute allele used to build this trees are bnlg1347a2 and bnlg381a3, 397 which can be used to visually and explicitly represent decisions and decision making 398 for susceptible and tolerance cultivars. Therefore, we are able to create a model that 399 predicts the value of a target variable by learning simple decision rules inferred from 400 the data features. It means simply by using 2 SSR markers, bnlg1347 and bnlg381, 401 drought tolerance inbred lines can be predictable. can then be applied for developing an artificial intelligence system to classify a new 408 allele or fragment into the member or non-member class. Our results suggested that 409 only SVM Evolution algorithm with 100% overall accuracies and prediction (Table 6); 410 would be the best candidate algorithm to predict drought tolerance inbred lines if 411 they apply on Chi Square dataset. 412 Naïve Bayes based on Bayes conditional probability rule is used for 413 performing classification tasks. Naïve Bayes assumes the predictors are statistically 414 independent which makes it an effective classification tool that is easy to interpret 415 (Paoin 2011, Prabhakara andAcharya 2012). Two models, Naïve base (returns 416 classification model using estimated normal distributions) and Naïve base kernel 417 (returns classification model using estimated kernel densities) used and the model 418 accuracy in predicting the right resistance -susceptible computed as stated before. 419 The need to accelerate breeding for better adaptation to drought and other 420 abiotic stresses is an issue of increasing urgency (Araus et al. 2008, Bänziger and 421 Cooper 2001). Hence, as traditional breeding appears to be 422 reaching a plateau; several approaches, which complement traditional with analytical 423 selection methodologies, may be required to further improvement (Araus,Slafer,424 Royo and Serret 2008). The molecular approach has a great potential but actual 425 results and delivery towards water limited environments are meager. Although the 426 emergence new molecular techniques such as transcriptomics and proteomics 427 propose a revolutionary impact in analytical breeding, DNA marker technology is still 428 advantageous regarding cost/benefit and a potential partner for this recently 429 introduced discipline. Inter-disciplinary scientists have been trying to understand and 430 dissect the mechanisms of plant tolerance to drought stress using a variety of 431 approaches. Application fields such as molecular genetics, combined with increasing 432 computing power, supervised and unsupervised machine learning, can be used to 433 identify groups of alleles with similar patterns of expression, and this can help 434 provide answers to questions of how different alleles are affected by various traits 435 and which alleles are responsible for specific hereditary characters. We are able to 436 create models that have been applied successfully in the prediction, classification, 437 estimation, and pattern recognition in abiotic stress. The molecular genetics road 438 towards drought resistance is complex but we know that the destination is much 439 simpler. One of the objective of this article was to address the need to bring to the 440 molecular genetics community an increased understanding of knowledge discovery 441 from data so that these robust computing paradigms may be used even more 442 successfully in future molecular biology applications especially in abiotic and biotic 443 stress arias. 444

Acknowledgement 445
The author greatly appreciate support from department of Biology, Faculty of Science, 446 University of Qom.

Figure 2
Application of the SVC algorithm on ten datasets (A) was unable to categorize inbred lines into correct clusters; K-Medoids algorithm to the Deviation (B) and to PCA(C) was able to categorize drought tolerance inbred lines into correct clusters and K-means algorithm to the PCA (C) was able to categorize susceptible tolerance inbred lines into correct clustersError! Bookmark not defined.