Abstract
Our ability to reconstruct genomes from metagenomic datasets has rapidly evolved over the past decade, leading to publications presenting 1,000s, and even more than 100,000 metagenome-assembled genomes (MAGs) from 1,000s of samples. While this wealth of genomic data is critical to expand our understanding of microbial diversity, evolution, and ecology, various issues have been observed in some of these datasets that risk obfuscating scientific inquiry. In this perspective we focus on the issue of identical or highly similar genomes assembled from independent datasets. While obtaining multiple genomic representatives for a species is highly valuable, multiple copies of the same or highly similar genomes complicates downstream analysis. We analyzed data from recent studies to show the levels of redundancy within these datasets, the highly variable performance of commonly used dereplication tools, and to point to existing approaches to account and leverage repeated sampling of the same/similar populations.