An encoding of genome content for machine learning

An ever-growing number of metagenomes can be used for biomining and the study of microbial functions. The use of learning algorithms in this context has been hindered, because they often need input in the form of low-dimensional, dense vectors of numbers. We propose such a representation for genomes callednanotextthat scales to very large data sets.The underlying model is learned from a corpus of nearly 150 thousand genomes spanning 750 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar “meaning”. This meaning can be distributed by a neural net over a vector of numbers.The resulting vectors efficiently encode function, preserve known phylogeny, capture subtle functional relationships and are robust against genome incompleteness. The “functional” distance between two vectors complements nucleotide-based distance, so that genomes can be identified as similar even though their nucleotide identity is low.nanotextcan thus encode (meta)genomes for direct use in downstream machine learning tasks. We show this by predicting plausible culture media for metagenome assembled genomes (MAGs) from theTara Oceans Expeditionusing their genome content only.nanotextis freely released under a BSD licence (https://github.com/phiweger/nanotext).

Introduction haemolyticus (Fig. 1, A) is a human gastrointestinal pathogen that lives in warm brackish waters. 110 Subpopulations are associated with different geographic locations 32 . 111 Table 1: Accuracy of selected models on the SOMO and ecotypes task. The core model performs best on the SOMO task, while the accessory model excels in ecotype separation. The SOMO and ecotype tasks helped select three models. Model 93 and 22 correspond to core and   In the most granular nanotext clusters, taxa labels are homogenous at higher ranks. Homogeneity is defined as the fraction of the most abundant taxon within a cluster. At the species level, clusters contain heterogenous labels. This indicates that species fill specific functional guilds. The larger heterogeneity of archaea is an artifact of the low number of archael genomes (< 5%) in the GTDB. Given more data, the model could separate these genomes better. (B) The mean pairwise cosine similarity was calculated across all rank labels in the GTDB, i.e. each estimate is calculated "intra-rank" for a given rank label such as "class Clostridia". We downsampled each rank randomly to = 200 genomes. As expected, lower ranks have a high self-similarity compared to higher ranks, approaching zero at the domain level. Again, this effect differs for archaea, due to far fewer genomes in the corpus. Near-perfect genomes from the GTDB ( = 89) were truncated in silico in 10% increments (facets). Each time, both nanotext and sourmash lca inferred taxa labels across all ranks (x-axis). Either only nanotext (red) or sourmash lca (blue) were correct, both (violet) or none (white). This resulted in a score for cumulative accuracy (y-axis). Note that about 20% of genomes in the test set had no species label (left most facet, right most bar). nanotext is more correct in its assignment of taxa up to about 70% incompleteness. Its accuracy declines thereafter. sourmash lca assigns less taxa in general, but does so for up to 90% truncation.
plements nucleotide-based distance metrics like Jaccard similarity 44 and average nucleotide identity Members of the understudied phylum Gemmatimonadota were indentified from a large Tara Oceans MAG collection 4 . We overlayed vectors for all Gemmatimonadota in the GTDB (black, = 64) with vectors inferred from Tara MAGs (grey, = 957). The GTDB genomes form two distinct clusters. (B) Detail of A (black points in box now in grey). Putative Gemmatimonadota in the Tara collection were identified using functional and nucleotide similarity. nanotext and sourmash both identified the same five MAGs (red). Additionally, the nearest neighbors identified by sourmash based on nucleotide similarity using the MinHash algorithm are shown in blue. Note how these are further from the identified MAGs (red) than are the closest GTDB genomes in vector space (grey). nanotext can relate these closest genomes with similar metabolic potential even when their nucleotide similarity is low. (C, D) Pairwise nucleotide distance between reference and putative Gemmatimonadota shows large variance. Yet functional similarity remains stable even past common thresholds of nucleotide-based relatedness (ANI < 0.8, Jaccard simiarity < 0.5). Figure 6: Genome embeddings allow accurate predictions when used as input to learning algorithms. (A) Prediction of culture media from genome content. We linked GTDB genomes to DSMZ media at the level of genus. We then trained a fully-connected neural net to learn the genotype-phenotype mapping. Media (x-axis) can be predicted with high accuracy (y-axis, white bars count correct cases, grey bars count wrong cases). A classification was correct, if the the target medium was among the 10 nearest neighbors of the predicted medium vector. Only the 20 most common media in the database are displayed. (B) Predicted top media for Tara MAGs. The most common media (excluding their variants) in the prediction set are no. 514 ("Bacto Marine Broth"), no. 1120 ("PYSE Medium") -e.g. used to study Colwellia maris isolated from seawater 54 , no. 830 ("R2A Medium") -developed to study bacteria which inhabit potable water 55 , no. 1066 ("Marinobacter Lutaoensis Medium"), no. 878 ("Thermus 162 Medium"), no. 269 ("Acidiphilium Medium") and no. 607 ("M13 Verrucomicrobium Medium")which includes artificial seawater as an ingredient. All these media correspond to marine isolates. They are plausible starting media for the query MAGs.

282
Model training 283 We used all roughly 150 thousand genomes from the Genome Taxonomy Database (GTDB, release 284 r89) 26 for model training. Because the associated taxonomic assignments for release r89 were not yet 285 available at the time of writing, we used the metadata from the previous r86 release. The taxonomy 286 is largely consistent with the expected r89 one (personal communication). Genomes were annotated 287 using pfam_scan.pl (v1.6, https://bit.ly/2CHXlVI) which resulted in a corpus of 750 million domains. 288 Each line in the corpus is the sequence of Pfam protein domains on a contig. Strand information is not 289 preserved. We did not filter any protein domains. The vocabulary has 10,879 domains, about 60% of 290 the domains in Pfam (v32) 19 . 291 We obtained word and document vectors using the Word2Vec algorithm 22,23 . We trained over a grid of 292 hyperparameter combinations (see below) with a linearly decreasing learning rate (0.025 to 0.0001) over 293 10 epochs using the distributed bag of words (PV-DBOW) training option as implemented in Gensim 294 (v3.4.0, https://radimrehurek.com/gensim/). The result were 100-dimensional vectors for each domain 295 and genome. The similarity of any two vectors can be calculated using cosine similarity (Eq. 1), with 296 a range from -1 (no similarity) to 1 (identical). Cosine distance is defined over the range (0, 2) (Eq.  (1) Hyperparameter optimization 301 The Word2Vec algorithm has many tunable parameters, and they matter 67 . We performed grid search 302 over a range of plausible parameters and trained 96 different models. The most relevant parameters all 303 sampling exponent, the number of negative samples per frame and context window size (Supplementary  table 1). 307 The subsampling parameter determines how likely a word with frequency ( ) in the corpus is 308 kept during training, expressed as a probability ( ) (Eq. 3). Based on this formula, we determined 309 how many of the most frequent domains were affected from subsampling given ( Figure S1, B). We 310 set to (3e-5, 1.4e-4) which translates into the most frequent (100, 1000) domains being affected from 311 downsampling.
contrib/hdbscan) using default parameters and selecting clusters using the conservative leaf 356 method.

357
Training of media embeddings 358 To quantify media similarity, we created a media embedding. ter (SEED-cpd00001###) and agar (SEED-cpd13334###) because these ingredients are non-informative. 367 We trained with a window size of 5 and a learning rate as described above over 100 epochs using can suggest similar media via nearest neighbor search. We visualized the media vector space using 377 t-SNE ( Figure S6). Media vectors cluster and enable learning algorithms to discriminate between 378 media classes.

379
Linking GTDB genome assemblies to BacDive culture media 380 To predict a medium from a genome we needed to create a training set that matches the two. The Figure Figure S2: Effect of window size parameter on ecotype task accuracy. Across all negative exponent values for (facets) larger window (context) size performs better for all ecotype labels.  Figure S3: Effect of negative sampling exponent on ecotype task accuracy. Lower exponents lead to more accurate models on the ecotype task across all ecotype labels.

23
. CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/524280 doi: bioRxiv preprint  Figure S4: Effect of number of negative samples on ecotype task accuracy. For the hardest to separate and thus most informative ecotypes (x-axis) a smaller count increases accuracy for smaller values of the negative sampling exponent (facets). For ≥ 0.3, the effect is negligible.    Figure S6: Visualisation of medium embedding space. We used t-SNE to project medium vectors into the plane (grey points). All media with more than 0.95 cosine similarity to any of the top 16 most common DSMZ media were colored. We observe clear clusters of similar media. These clusters can be used by learning algorithms to discriminate media classes. Note how near-identical media such as no. 830 and no. 830c are embedded in near identical vector space.

25
. CC-BY 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/524280 doi: bioRxiv preprint