Abstract
A database of curated genomes is needed to better assess soil microbial communities and their processes associated with differing land management and environmental impacts. Interpreting soil metagenomic datasets with existing sequence databases is challenging because these datasets are biased towards medical and biotechnology research and can result in misleading annotations. We have curated a database of 922 genomes of soil-associated organisms (888 bacteria and 34 archaea). Using this database, we evaluated phyla and functions that are enriched in soils as well as those that may be underrepresented in RefSoil. Our comparison of RefSoil to soil amplicon datasets allowed us to identify targets that if cultured or sequenced would significantly increase the biodiversity represented within RefSoil. To demonstrate the opportunities to access these underrepresented targets, we employed single cell genomics in a pilot experiment to sequence 14 genomes. This effort demonstrates the value of RefSoil in the guidance of future research efforts and the capability of single cell genomics as a practical means to fill the existing genomic data gaps.