Machine learning based prediction of functional capabilities in metagenomically assembled microbial genomes

The increasing popularity of genome resolved meta genomics - the binning of genomes of potentially uncultured organisms direct from the environmental DNA - has resulted in a deluge of draft genomes. There is a pressing need to develop methods to interpret this data. Here, we used machine learning to predict functional and metabolic traits of microbes from their genomes. We collated an extensive database of 84 phenotypic traits associated with 9407 prokaryotic genomes and trained different machine learning models on this data. We found that a lasso logistic regression based on the frequency of gene orthologs had the best combination of functional prediction performance and interpretability. This model was able to classify 65 phenotypic traits with greater than 90

Predicting phenotype from genotype remains one of the major challenges in biology 2 [1,2]. Addressing this challenge is particularly relevant for understanding microbial 3 communities, the study of which had been boosted by an increased ability to extract 4 sequence data directly from communities. Technological improvements in DNA 5 sequencing have led to an explosion in the amount of such data generated. In the 6 context of microbial ecology, large-scale metagenomic studies such as the Human 7 Microbiome Project [3], the Earth Microbiome Project [4] and the Tara Oceans 8 Project [5] have systematically sequenced the microbial communities in a huge variety of 9 environments at great depth. Amplicon sequencing, such as of the 16S rRNA gene, 10 allows detailed study of the taxonomic makeup of these communities, while shotgun 11 metagenomic sequencing allows characterisation of all genes present in an environment. 12 Increasing depth of coverage and improvements in genome binning algorithms for has so far mostly focussed on phylogenetic assignments using the 16S rRNA gene. This 28 highly conserved gene can provide a phylogenetic assignment at the species (or higher) 29 level, which can then be used to infer general functional traits. While this approach has 30 been commonly used to study ecological distribution of microbial functions e.g. [11][12][13], 31 its premise of a direct association of function with phylogenetic assignment (i.e. 32 'functional coherence of microbial taxa') is questionable (e.g. [14]). The level of 33 taxonomic coherence of function is not clear even for strains of the same species, where 34 functions can show high variability either due to a few genetic changes or even 35 regulatory changes [2]; [15] discusses this for E. coli. It is also a common problem that 36 when certain taxonomic groups in a microbial community are found to show direct 37 associations with certain ecological factors (or health state of an host), these groups are 38 so broad that assigning specific functions to them is hard or impossible [16]. Indeed, a 39 study of specific functional traits across microbial taxa has found that many of these 40 traits are dispersed across the phylogenetic tree [2,17]. Even where a specific functional 41 trait is taxonomically coherent, the phylogenetic approach is limited by our ability to 42 assign taxonomy based on the 16s rRNA gene. The extensive accumulation of 43 metagenomics data indicates that we have sampled only a fraction of microbial diversity, 44 and it is not uncommon for such data to result in many unassigned taxa, as for example 45 by the aquifer study described above which described dozens of apparently novel phyla. 46 While reconstruction of genomes from metagenomics data reveals the limitation of 47 taxonomic functional assignment, it also opens up a new route to functional assignments. 48 In particular, several methods and bioinformatics pipelines are now emerging that aim 49 to go from raw metagenomics sequence data to binned (i.e. predicted) genomes to 50 functions, e.g. [18]. The functional annotation steps in these tools usually considers 51 specific gene(s) that are known to associate with specific functions or metabolic 52 pathways [18], as identified for example in databases such as the Kyoto Encyclopedia of 53 Genes and Genomes (KEGG) [19], the Pfam database of protein families [20] and the 54 NCBI COG database of orthologous genes [21]. This approach circumvents the problems 55 of taxonomic coherence and 16S rRNA gene based assignment to taxa, yet relies heavily 56 on existing categorization of genes into functions as done in the above databases. While 57 these functional gene groupings are mostly based on accumulated knowledge and 58 experimental data on metabolic pathways, they might miss the full set of genes 59 associated with a given function and do not consider functions that cannot be assigned 60 to a few genes or seemingly well-organised pathways. One route to overcome such 61 limitations is to develop extensive databases of phenotypic traits of microbes without 62 necessarily using a pathway-centric view. These functional assignments could then serve 63 as a source to apply statistical approaches to 'learn' genetic drivers of those functions 64 using genomes of associated microbes. Efforts in this direction have recently resulted in 65 the compilation of literature-based assignment of functions in microbes, either covering 66 a large selection of functions and organisms [22,23] or specific ones such as 67 methanogenesis [24]. The FAPROTAX database [22], which we focus on here, is based 68 on an extensive survey of the scientific literature. The aim of creating this database was 69 to allow microbes found from 16S rRNA amplicon sequencing to be assigned into 70 functional and metabolic groups, so that functional variation across environemnts could 71 be studied and compared to taxonomic variation. The authors found that the 72 abundance of functional groups was strongly influenced by environmental conditions in 73 a variety of ocean environments [12]. The bulk of the classifications in the FAPROTAX 74 database come from Bergey's Manual of Systematic Bacteriology [25], and it currently the features and combines these predictions by averaging, avoiding overfitting and much 167 improving performance.
168 Support vector machines 169 Finally, we used support vector machines (SVMs) [?]. An SVM essentially tries to find 170 surfaces in the high-dimensional feature space, which separate the different classes as positive prediction [40,41]. An AUROC score much greater than 0.5 (the score for 186 random predictions) indicates a good classifier. In particular, a score of 1 indicates that 187 all positive cases have been assigned a higher probability than all negative cases.

188
Prediction of MAG phenotypes 189 Once classifiers have been trained on the NCBI data, it is possible to use them to make 190 predictions about unseen genomes, such as MAGs generated from shotgun sequencing 191 studies. The MAGs must first be processed to give a matrix of the KEGG ortholog copy 192 numbers associated with them, using the same pipeline as applied above to the NCBI  [42]. These 660 MAGs were those which had at least 202 75% of the 36 single-copy prokaryotic core genes identified in Alneberg et al. [6] in 203 a single-copy and can thus be considered reasonably complete and pure MAGs that were constructed by co-assembly and binning of 95 metagenome 212 samples taken from three replicate laboratory anaerobic digestion (AD) 213 bioreactors converting distillery waste into biogas. They were assembled with Ray 214 using a kmer size of 41  these functions, as we might hope. In particular, consider the prediction of 257 methanogens, a relatively easy task since it is known that methanogens must possess 258 the mcrA gene, this being a necessary and sufficient condition for methanogenesis [45]. 259 Indeed, subunits of this gene have the highest weight, and a total of only 9 genes are 260 used by the classifier.

7/22
Looking at some more complex traits, for example sulfate respiration (i.e. 262 disimmilatory sulfate reduction to H 2 S), the model assigns a lot of weight to subunits of 263 a quinone-modifying oxidoreductase, which is indeed associated with sulfur 264 metabolism [46]. Interestingly, however, none of the genes picked out by the classifier 265 are directly part of the metabolic pathway for this process as described in the KEGG 266 module for dissimilatory sulfate reduction, see Figure 2. The situation is similar with 267 hydrogenotrophic methanogenesis, with classification mostly determined by components 268 of energy-converting hydrogenases which are not directly part of the autotrophic 269 methanogenesis pathway, along with mcr genes indicating that the microbe is a 270 methanogen.

318
To see if we can develop classifiers that are less affected by taxonomic signals, we 319 trained the models on organisms from a subset of taxa and tested its performance on 320 another, unrelated taxa. If a classifier can predict phenotype based on genes in a 321 distantly-related, unseen set of organisms, it is likely the genes it is using have a real 322 association with the function. In particular, we tried training our logistic regression 323 models on the Proteobacteria, a large phylum of bacteria, and testing on the rest of the 324 taxonomy. Some functions did not have significant numbers of species in each of these 325 sets; we used only functions with at least 5 species in the training set and 5 in the test, 326 leaving 59 functions out of 84.

327
As might be expected, the resulting classifiers from this approach performed 328 significantly worse in this case, compared to being trained on a random selection of 329 species from throughout the prokaryotic part of the tree of life, see Figure 3b. However, 330 for a significant number of functions the performance of the classsifier is still fairly good, 331 indicating an ability to make predictions which are generalizable to significantly 332 different unseen groups of organisms. 19 functions have an AUROC score greater than 333 80%, and 9 greater than 90%.
334 Figure 3b shows a scatter plot of classifier complexity against performance, as in 335 Figure 3a. Notable is that the group of classifiers achieving high accuracy while using a 336 lot of genes is gone: functions such as fermentation and nitrate reduction, which were in 337 this group of classifiers, are now much less accurate. Classifiers which work well in the 338 cross-taxa case all use a relatively small number of genes, less than 150 or so. This 339 suggests that the classifiers using a large number of genes to make predictions in the 340 randomized case may have been using a range of genes found in different closely-related 341 clusters of organisms which all have the target trait, but which may not have a causal 342 relationship with the function.

343
Prediction of MAG phenotypes 344 A possible major benefit of developing machine learning approaches to predicting 345 function from genomes is that the resulting classifiers can be applied to novel genomes 346 predicted from meta genomes. We therefore used the classifiers trained above to classify 347 metagenomically assembled genomes (MAGs) from a few different environments. These 348 were laboratory anaerobic digesters, the ocean (from the Tara oceans project [5]), and 349 MAGs from a groundwater aquifer assigned to be members of the so-called 'canditate 350 phyla radioation' (CPR) [8]. The CPR is a set of bacterial lineages discovered from 351 metagenomic studies consisting of a very large number of proposed novel phyla. These 352 organisms have very small genomes, and may typically live in symbiosis with other 353 organisms [47].

354
To perform the functional assigments, we used the 1-regularized LR classifier 355 described above, with a random train-test splitting and the regularization parameter 356 C = 0.05, trained using KEGG orthologs on the full NCBI genomes. Figure 5 shows the 357 predicted functions by this classifier for the MAGs assembled from anaerobic digesters 358 and from the global oceans. There are noticeable differences, such as more AD MAGs 359 having fermentation and sulfate-metabolism-related functions and fewer having aerobic 360 chemoheterophy.

361
To make these differences clearer, Figure S3 is a bar chart showing the proportion of 362 MAGs from the different environments having a function, for some of the most common 363 functions. For many functions, the differences are very significant.

364
These differences in function might well be expected between these environents. For 365 example, fermentation is very important in the AD process, and aerobic 366 chemoheterotrophy obviously is not as the environment is aerobic. This indicates that 367 the method is capable of producing useful information about MAGs. The results for the 368 CPR MAGs indicate that these organisms possess significantly fewer functions than 369 those from the other two environments, as would be expected from their very small 370 genome sizes. A few functions do however have significant incidence in this group.

371
Apart from 'chemoheterotrophy' and 'aerobic chemoheterotrophy', which are very broad 372 categories encompassing a large proportion of all organisms, a few functions associated 373 with nitrogen metabolism, especially nitrate reduction, are noticeably present in the 374 group. That the CPR are involved in nitrate reduction was recently proposed in 375 Danczak et al. [47].

376
Some of the functional assignments seem strange, for example organisms being 377 classified as acetoclastic or hydrogenotrophic methanogens but not as methanogens, 378 including a significant proportion of CPR organisms (about 3%). Looking at the gene 379 orthologs present in these organisms and their taxonomic assignments sheds some light 380 on what is going on here, see Table 3. For example, some of the acetoclastic 381 methanogens which are misclassified as not being methanogens are missing the mcrA 382 ortholog K00399, presumably because the MAGs are incomplete and this gene has been 383 missed. Another example is an organism classified as being a hydrogenotrophic 384 methanogen but not a methanogen. This MAG appears to be similar to the genome of 385 the bacterium Caldisericum exile, which is not a methanogen and does not possess 386 mcrA (it is an anaerobic, thermophilic bacterium which respires by thiosulfate 387 reduction ??). However, it does possess genes for subunits of the energy-converting 388 hydrogenase A, which is indicative of hydrogenotrophic methanogenesis (see Table 1).

389
Therefore, these discrepancies may be the result either of incomplete MAGs, or of 390 combinations of genes which are rare or unseen in the training set.

391
For the Tara dataset, metadata for different samples was available, including depth, 392 temperature and salinity, among other measurements. Combining the coverages of the 393 different MAGs across these samples with the functional assignments of the MAGs, we 394 can calculate the proportion of a given function in different sample groups. Figure 6 395 shows the mean of this metric for all the functions for samples from different oceans.

396
The functional differences do not seem to be very significant between oceans, with a few 397 exceptions, notably an abundance of nitrate reduction in the Southern Ocean, and a 398 strong negative correlation between temperature and fermenter abundance (Pearson 399 correlation r = −0.43, p < 0.001). Figure 7.  mcrA  echA  methanogenesis  hydrogenotrophic  acetoclastic  Caldisericum exile  0  1  0  1  0  Methanomassiliicoccus luminyensis  1  0  1  1  0  Methanosaeta concilii  2  0  1  1  1  Methanosaeta concilii  0  0  1  0  1  Methanosaeta harundinacea  1  0  1  0  0  Methanoregula formicica  1  0  1  0  0  Methanoregula formicica  1  0  1  0  0  Methanolinea tarda  1  0  1  0  0   Table 3. Key genes and predicted functions for MAGs predicted to be methanogens. Gene copy numbers for the mcrA methanogenesis gene and the energy-converting hyrdogenase A, along with functional predictions, for AD MAGs predicted to be methanogenic by our algorithm. proportion of the functions we tested achieved very good classification accuracy, with 406 AUROC scores greater than 90%. Of the machine learning algorithms we tested, we 407 found that 1-regularized logistic regression gave the best combination of accuracy, 408 computational efficiency and interpretability. The results did not depend very strongly 409 on whether KEGG orthologs or Pfam domains were used to characterise genes in the 410 genomes, although the KEGG scheme performed slightly better on average over the 411 functions we considered here. The logistic regression models we generate can be 412 inspected, and the genes most associated with a given phenotype in the model 413 enumerated, which allows for validation of the models by comparison to what is known 414 about the function of these genes. This raises the potential for this method to discover 415 new associations between orthologous groups and phenotypes. For example, we found 416 here that the presence of subunits of the energy-converting hydrogenase A are more 417 predictive of an organism's performing hydrogenotrophic methanogensis than the genes 418 directly involved in the process as described in the KEGG functional module for it. For 419 nitrate reduction, in addition to expected genes such as nitrate reductase, there are 420 multiple KEGG orthologs listed as 'uncharacterised protein' which are highly predictive 421 of this function.

422
To check the robustness of the models we generated, we tried training the models on 423 one section of the microbial taxonomic groups, the Proteobacteria, and testing its 424 accuracy on organisms from the rest of the tree, which it had not encountered at all in 425 training. This did significantly reduce model accuracy for many phenotypes. This is to 426 be expected, as the training and test sets in this case are so different. However, some of 427 the functions still achieved good accuracy. This would suggest that the logistic 428 regression model is identifying genes functionally involved with the phenotype in the 429 training stage, such that their presence even in distantly related organisms is indicative 430 of the presence of the phenotype. Phenotype predictions that had good accuracy under 431 this scheme tended to produce models involving only a few genes (i.e. only a few genes 432 had nonzero weights in the logistic regression model), less than 100, supporting the idea 433 that these models are picking out genes directly involved with the phenotype.

434
As with any machine learning approach, the presented study is limited by the 435 accuracy and availability of the data on which models can be trained. We consider this 436 to be the key bottleneck in the genotype-phenotype prediction problem and expect 437 future improvements to come from better data curation and collection (on existing 438 organsisms), rather than development of new improved classification algorithms. To this 439 end, extended versions of databases such as FAPROTAX, and listing detailed 440 phenotypic features of cultured organisms would provide a highly valuable resource for 441 machine learning approaches. Such data would allow development of finer grained and 442 more accurate classifiers, which can then be applied to unknown genomes and MAGs.

443
The results of such applications will also be improved by better MAG assembly. In 444 particular, we note the key limitation of classifiers on MAGs being the accuracy of their 445 gene content assignments. Increasing this accuracy will also improve the accuracy of 446 their machine learning based functional assignment. As these limitations are addressed, 447 we expect classifiers to become a key tool in gaining insights into the functional 448 capabilities both of microbiomes as a whole and the individual species making up those 449 communitites, even when those species cannot be cultivated.