New insights from 33,813 publicly available metagenome-assembled-genomes (MAGs) assembled from the rumen microbiome

Recent rumen microbiome metagenomics papers1,2 have published hundreds of metagenome-assembled genomes (MAGs), comparing them to 4,941 MAGs published by Stewart et al3 in order to define novelty. However, there are many more publicly available MAGs from ruminants. In this paper, for the first time, all available resources are combined, catalogued and de-replicated to define putative species-level bins. As well as providing new insights into the constitution of the rumen microbiome, including an updated estimate of the number of microbial species in the rumen, this work demonstrates that a lack of community-adopted standards for the release and annotation of MAGs hinders progress in microbial ecology and metagenomics.


Background
Recent advances in metagenomic assembly and binning have given rise to very large collections of metagenome-assembled genomes (MAGs), which often represent the only genomic information about a microbial strain or species that has not yet been cultured. These MAGs provide essential insight into the functional potential of individual strains and species, as well as the microbiomes they inhabit.
As globally important food-producing animals, ruminants are the subject of intense research, particularly as the microbiome in the rumen is primarily responsible for the breakdown of recalcitrant plant material into nutrients that the host can absorb. Peng et al 1  The 719 MAGs from Peng et al and the 538 MAGs from Gharechahi et al were compared against the full set of existing 32,557 rumen MAGs that were publicly available at the time of publication, in addition to 460 cultured genomes from the Hungate collection. Using permissive settings designed to reduce the number of false-positive species-level bins, the entire set of rumen microbial genomes was de-replicated to produce a set of putative microbial species-level bins. In addition to providing new taxonomic insights into the rumen microbiome, this work also demonstrates that the lack of a single, community-adopted repository for metagenomic bins and MAGs; the use of non-INSDC data repositories; a lack of community-adopted standards for MAG annotation; and a lack of enforcement of standards by journals, are all barriers to future research in metagenomics.

Dereplication
The software dRep 15 was used to de-replicate 33,813 publicly available FASTA files representing MAGs and isolate genomes from ruminants. The parameters used were: These settings delineate species-level bins if genomes have lower than 95% ANI across 30% of their length. The low length parameter (30%) recognizes that incomplete genomes may not overlap along a large proportion of their shared sequence, and using a higher value may split genomes which belong to the same species and over-estimate the number of species. A 30% coverage threshold has been used in previous large-scale MAG studies which involved 50% complete genomes 16 .

Taxonomic classification and phylogeny
MAGpy 17 and GTDB-Tk 18 (with the "classify_wf" option) were used to assign a taxonomy to the "winning" MAGs from dRep. To create a representative phylogenetic tree, genomes that were at least 90% complete and less than 5% contaminated were used as input to PhyloPhlAn 19 . Proteins were predicted for the genomes with Prodigal 20 and PhyloPhlAn was run with options: A plot of the phylogenetic tree was created using GraphlAn 21 with genomes coloured by the Phylum assigned by GTDB-Tk. and low contamination (<5%). The tree was created using PhyloPhlAn and drawn using GraPhlAn, with taxonomic assignments from GTDB-Tk.

Results
After de-replication using the parameters above, there remained 7,533 putative species-level bins. Treating all data released by a single publication as a single dataset (n=10), 4,794 were singletons, representing putative microbial species that are represented in only one dataset. Conversely, 2,739 putative microbial species were seen in more than one dataset. There were no species-level bins present in all datasets. The maximum number of datasets any species-level bins were present in was 7, and only three bins were present in these 7 datasets: two from Stewart  Of the 7,533 species-level bins, GTDB-Tk identified 155 as Archaea. The majority of these were assigned to the Phylum Methanobacteriota (119; 77%), with 31 being assigned to Thermoplasmatota and 5 to Halobacteriota. GTDB-Tk was unable to assign a Genus to eight (5%) of the Archaeal MAGs, and was unable to assign a species to 110 (71%). GTDB-Tk assigned 7,378 species-level bins to the Bacteria domain. These spanned 28 different Phyla, the most popular being Firmicutes_A (3339; 45%), Bacteroidota (1671; 23%), Firmicutes (807; 11%), Proteobacteria (299; 4%) and Verrucomicrobiota (248; 3%). Of particular interest as high fibre-degrading microbes, the dataset also contains 63 members assigned to the phylum Fibrobacterota. Of the 7,378 Bacterial species-level bins, GTDB-Tk was unable to assign a Class to one, unable to assign an Order to 11, unable to assign a Family to 99, unable to assign a Genus to 1,087 (15%) and unable to assign a Species to 5,796 (78%). The full results of the GTDB-Tk taxonomic assignments can be found in supplementary data 2.
A phylogenetic tree of the 2,696 highly complete MAGs can be seen in Figure 1. The tree is dominated by large clades of Firmicutes_A and Bacteroidota, though other significant clades exist with the MAGs spread across 26 different phyla.

Conclusions and discussion
Assembly of genomes from metagenomic sequencing is becoming routine, with new MAG datasets published frequently. In order for comparisons to be made between new and existing MAG catalogues, it should be easy to find and retrieve all MAGs from a particular biome, alongside metadata in a standardised format. The MIMAG standard attempts to define standards for metadata around MAGs 22 , but these have not been universally adopted, and cannot be applied retrospectively to historical datasets. MAGs should be deposited in INSDC databases alongside metadata adhering to the MIMAG standard, and with completeness and contamination estimates as a minimum. In addition, it would be beneficial to submit all of the following: the raw reads; the raw metagenome assemblies; all metagenome bins created during the binning process; the final set of metagenomeassembled genomes. The European Nucleotide archive (ENA) provides a suitable repository and guidance for submitting all of these [23][24][25] ; and have metadata checklists for both binned genomes 26 and metagenome-assembled genomes 27 that allow for the storage of essential metadata needed to interpret and compare metagenomics data. EBI's MGnify 28 also provides added value services on these datasets.
There are approximately 4 billion food-producing ruminants on the planet at any one time (FAOSTAT), and as such, the rumen and its microbiome are a priority area for research globally.
Recent advances in metagenomics mean we can now study the structure and function of the rumen microbiome in unprecedented detail. Here, for the first time, all data resources representing binned metagenomes from ruminants were combined and de-replicated to produce 7,533 putative specieslevel bins. However, combining and de-replicating partial and contaminated genomes is an inexact science, and it is difficult to effectively delineate species given the amount of missing information. Relatively permissive parameters were chosen so as to avoid over-inflation of the number of species-level bins. However it is still possible that this is an over-estimate due to the inclusion of incomplete genomes.
The fact that the majority of the species-level bins for ruminants are singletons suggests that we are yet to sample the entire species-level sequencing space of global ruminants, and the lack of a species-level core microbiota suggests a large amount of variation in the constitution of the rumen microbiome. We update the estimate of the total number of microbial species in the rumen microbiome to 13,616 and provide taxonomic labels for all known species to date, which span 26 different microbial phyla. It is essential that researchers are funded to culture this microbial diversity, and study the role of these species in ruminant productivity and health, in climate change and sustainability, and in global food security.