Abstract
One of the biggest challenges in molecular biology is bridging the gap between the known and the unknown coding sequence space. This challenge is especially extreme in microbial systems, where between 40% and 60% of the predicted genes are of unknown function. Discarding this uncharacterized fraction should not be an option anymore. Here, we present a conceptual framework and a computational workflow that bridges this gap and provides a powerful strategy to contextualize the investigations of genes of unknown function. Our approach partitions the coding sequence space removing the known-unknown dichotomy, unifies genomic and metagenomic data and provides a framework to expand those investigations across environments and organisms. By analyzing 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes we showcase our approach and its application in ecological, evolutionary and biotechnological investigations. As a result, we put into perspective the extent of the unknown fraction, its diversity, and its relevance in genomic and environmental contexts. By identifying a target gene of unknown function for antibiotic resistance, we demonstrate how a contextualized unknown coding sequence space enables the generation of hypotheses that can be used to augment experimental data.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Title updated. Abstract and main text modified. Supplemental files updated.