RT Journal Article SR Electronic T1 Distributed representations of protein domains and genomes and their compositionality JF bioRxiv FD Cold Spring Harbor Laboratory SP 524280 DO 10.1101/524280 A1 A. Viehweger A1 S. Krautwurst A1 B. König A1 M. Marz YR 2019 UL http://biorxiv.org/content/early/2019/01/20/524280.abstract AB Learning algorithms have at their disposal an ever-growing number of metagenomes for biomining and the study of microbial functions. We propose a novel representation of function called nanotext that scales to very large data sets while capturing precise functional relationships.These relationships are learned from a corpus of 32 thousand genome assemblies with 145 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar “meaning”. This meaning can be distributed by the Word2Vec embedding algorithm over a vector of numbers. These vectors not only encode function but can be used to predict even complex genomic features and phenotypes.We apply nanotext to data from the Tara ocean expedition to predict plausible culture media and growth temperatures for microorganisms from their metagenome assembled genomes (MAGs) alone. nanotext is freely released under a BSD licence (https://github.com/phiweger/nanotext).