RT Journal Article SR Electronic T1 MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples JF bioRxiv FD Cold Spring Harbor Laboratory SP 255018 DO 10.1101/255018 A1 Ehsaneddin Asgari A1 Kiavash Garakani A1 Alice C. McHardy A1 Mohammad R.K. Mofrad YR 2018 UL http://biorxiv.org/content/early/2018/01/28/255018.abstract AB Motivation Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient, and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture, and forensic science.Results We describe a reference- and alignment-free approach for predicting the environment or host phenotype from microbial community samples based on k-mer distributions in 16S rRNA data. In addition, we propose a bootstrapping framework to investigate the sufficiency of a shallow sub-sample for prediction. We study the use of deep learning methods as well as classic machine learning approaches for distinguishing among human body-sites, diagnosis of Crohn's disease, and predicting the environments (18 ecological and 5 organismal environments) from representative 16S gene sequences. Furthermore, we explore the use of unsupervised dimensionality reduction methods as well as supervised deep representation learning for visualizing microbial data of different environments and host phenotypes. We demonstrated that k-mer representations outperform Operational Taxonomic Unit (OTU) features in distinguishing among 5 major body-sites, as well as predicting Crohn's disease using 16S rRNA sequencing samples. We also showed that a shallow sub-sample of 16S rRNA samples alone can be sufficient to produce a proper k-mer representation of data. Aside from being more accurate, using k-mer features in shallow sub-samples provided the following benefits: (i) skipping computationally costly sequence alignments required in OTU-picking, and (ii) proof of concept for the sufficiency of a shallow and short-length 16S rRNA sequencing for environment/host phenotype prediction. In addition, k-mer features were able to accurately predict representative sequences of 18 ecological and 5 organismal environments with relatively high macro-F1 scores. Deep Neural Network outperformed Random Forest and Support Vector Machine in classification of large datasets.Availability The link to the MicroPheno code and datasets will be available at https://llp.berkeley.edu/micropheno.Frequent abbreviations used areDKLKullback Leibler divergenceDNN-lDeep Neural Network with l layersRFRandom ForestsSVMSupport Vector Machine (here linear SVM)