RT Journal Article SR Electronic T1 Light into the darkness: Unifying the known and unknown coding sequence space in microbiome analyses JF bioRxiv FD Cold Spring Harbor Laboratory SP 2020.06.30.180448 DO 10.1101/2020.06.30.180448 A1 Chiara Vanni A1 Matthew Schechter A1 Silvia Acinas A1 Albert Barberán A1 Pier Luigi Buttigieg A1 Emilio O. Casamayor A1 Tom O. Delmont A1 Carlos M. Duarte A1 A. Murat Eren A1 Rob Finn A1 Alex Mitchell A1 Pablo Sanchez A1 Kimmo Siren A1 Martin Steinegger A1 Frank Oliver Glöckner A1 Antonio Fernandez-Guerra YR 2020 UL http://biorxiv.org/content/early/2020/07/01/2020.06.30.180448.abstract AB Bridging the gap between the known and the unknown coding sequence space is one of the biggest challenges in molecular biology today. This challenge is especially extreme in microbiome analyses where between 40% to 60% of the coding sequences detected are of unknown function, and ignoring this fraction limits our understanding of microbial systems. Discarding the uncharacterized fraction is not an option anymore. Here, we present an in-depth exploration of the microbial unknown fraction through the lenses of a conceptual framework and a computational workflow we developed to unify the microbial known and unknown coding sequence space. Our approach partitions the coding sequence space in gene clusters and contextualizes them with genomic and environmental information. We analyzed 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes putting into perspective the extent of the unknown fraction, its diversity, and its relevance in a genomic and environmental context. With the identification of a target gene of unknown function for antibiotic resistance, we demonstrate how a contextualized unknown coding sequence space provides a robust framework for the generation of hypotheses that can be used to augment experimental data.Competing Interest StatementThe authors have declared no competing interest.