Light into the darkness: Unifying the known and unknown coding sequence space in microbiome analyses

Chiara Vanni; Matthew Schechter; Silvia Acinas; Albert Barberán; Pier Luigi Buttigieg; Emilio O. Casamayor; Tom O. Delmont; Carlos M. Duarte; A. Murat Eren; Rob Finn; Alex Mitchell; Pablo Sanchez; Kimmo Siren; Martin Steinegger; Frank Oliver Glöckner; Antonio Fernandez-Guerra

doi:10.1101/2020.06.30.180448

Abstract

Bridging the gap between the known and the unknown coding sequence space is one of the biggest challenges in molecular biology today. This challenge is especially extreme in microbiome analyses where between 40% to 60% of the coding sequences detected are of unknown function, and ignoring this fraction limits our understanding of microbial systems. Discarding the uncharacterized fraction is not an option anymore. Here, we present an in-depth exploration of the microbial unknown fraction through the lenses of a conceptual framework and a computational workflow we developed to unify the microbial known and unknown coding sequence space. Our approach partitions the coding sequence space in gene clusters and contextualizes them with genomic and environmental information. We analyzed 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes putting into perspective the extent of the unknown fraction, its diversity, and its relevance in a genomic and environmental context. With the identification of a target gene of unknown function for antibiotic resistance, we demonstrate how a contextualized unknown coding sequence space provides a robust framework for the generation of hypotheses that can be used to augment experimental data.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.