Insights into rumen microbial biosynthetic gene cluster diversity through genome-resolved metagenomics

Ruminants are critical to global food security as they transform lignocellulosic biomass into high-quality protein products. The rumen microbes ferment feed to provide necessary energy and nutrients for the ruminant host. However, we still lack insight into the metabolic processes encoded by most rumen microbial populations. In this study, we implemented metagenomic binning approaches to recover 2,809 microbial genomes from cattle, sheep, moose, deer, and bison. By clustering genomes based on average nucleotide identity, we demonstrate approximately one-third of the metagenome-assembled genomes (MAGs) to represent species not present in current reference databases and rumen microbial genome collections. Combining these MAGs with other rumen genomic datasets permitted a phylogenomic characterization of the biosynthetic gene clusters (BGCs) from 8,160 rumen microbial genomes, including the identification of 5,346 diverse gene clusters for nonribosomal peptide biosynthesis. A subset of Prevotella and Selenomonas BGCs had higher expression in steers with lower feed efficiency. Moreover, the microdiversity of BGCs was fairly constant across types of BGCs and cattle breeds. The reconstructed genomes expand the genomic representation of rumen microbial lineages, improve the annotation of multi-omics data, and link microbial populations to the production of secondary metabolites that may constitute a source of natural products for manipulating rumen fermentation.

With the expected population growth and changes in food consumption patterns, ruminant agriculture is critical to meeting global demands for animal products 1 . The rumen microbial community is central to the conversion of indigestible plant biomass into food products via conversion of complex carbohydrates to volatile fatty acids (VFAs) that provides the ruminant animal with approximately 70% of its caloric requirements 2 . Consequently, rumen microbes are paramount to ruminant health and productivity. Advancing the understanding of the structure-function relationship of the rumen microbiome is critical to improving ruminant agriculture.
Recent investigations of the rumen microbiome have expanded rumen microbial genomic databases 3-5 ; however, the genomic characterization of rumen microbes is far from complete. The Hungate1000 project provided high-quality reference genomes for hundreds of cultured rumen microbial strains 3 .
Nevertheless, the majority of rumen microbial populations remain elusive to current culturing strategies. In a recent cultivation experiment with defined and undefined media, 23% of rumen microbial operational taxonomic units were recovered 6  However, a notable fraction of metagenomic reads from previous studies did not map to genomes from Stewart et al. and the Hungate1000 collection 4, 5 , suggesting many rumen microbial species are yet to be characterized. Increasing the number of reference genomes for rumen microbes by identifying MAGs across different ruminant species would enhance our understanding of structurefunction relationships within the rumen microbiome and improve metagenomic inference.
Secondary metabolites are involved in a broad range of functions, such as antimicrobial agents and mediating microbial interactions 9 . Given the evidence linking the transmission of antibiotic resistance from livestock to humans 10,11 , there is a need to reduce the use of antimicrobial feed additives by identifying alternatives 12,13 . Due to the intense competition for nutrient resources, the rumen microbiome may provide novel opportunities to develop alternatives using endogenous antimicrobial peptides and probiotic microbial species 14 . A previous analysis found 45.4% of 229 rumen genomes encoded at least one bacteriocin gene cluster 15 . Secondary metabolites also have ecological roles in intercellular communication 9,16 . Thus, gaining fundamental knowledge on secondary metabolism in the rumen is critical to reduce the use of antimicrobial feed additives and expand our understanding of host-microbe and microbe-microbe interactions.
Here, we used publicly available metagenomes from ruminants (cattle, deer, moose, bison, and sheep) in combination with new cattle rumen metagenomic datasets to reconstruct 2,809 MAGs.
The MAGs expand the genomic representation of rumen microbial lineages and provide unique genomic insights into rumen microbial physiology. Moreover, we present a phylogenetic characterization of the secondary metabolite biosynthetic gene clusters (BGCs) of rumen microbial genomes and demonstrate the vast potential present within the rumen microbiome for the discovery of novel metabolites and probiotics to improve animal health and productivity.  Table 1). The metagenomes were from cattle (335 samples, 2.2 Tbp), sheep (75 samples, 888.4 gigabase pairs (Gbp)), moose (9 samples, 108.8 Gbp), deer (8 samples, 62.9 Gbp), and bison (8 samples,52.3 Gbp). Metagenomes were assembled independently to reduce the influence of strain variation and improve the recovery of closely related genomes 17,18 . Following refinement, dereplication, and filtering of resulting population genomes, we identified 2,809 non-redundant MAGs satisfying the following criteria: dRep 19 genome quality score ≥60, ≥75% complete, ≤10% contamination, N50 ≥5 kbp, and ≤500 contigs.

2,809 draft MAGs from the rumen ecosystem
The median estimated completeness and contamination of the MAGs were 89.7% and 0.9%, respectively ( Fig. 1a and Supplementary Table 2). Further, recovered MAGs had a median genome size of 2.2 Mbp, a median of 131 contigs, and a median N50 of 28.3 kbp (Fig. 1b). The proposed minimum information about a MAG (MIMAG) specifies high-quality draft genomes to have an estimated ≥90% completeness, ≤5% contamination, and contain 23S, 16S, and 5S rRNA genes 20 .
It remains challenging to reconstruct rRNA genes from short metagenomic reads due to the high sequence similarity of rRNA genes in closely related species. As a result, despite high estimated completeness and low contamination rates, only 20 MAGs meet the MIMAG standards for a highquality draft genome. We identified a 16S rRNA gene in 197 of the MAGs. The remaining MAGs are characterized as medium-quality MAGs under the MIMAG standards. The size of the point on the scatter plot corresponds to the dRep genome quality score, where Quality = Completeness − (5 · Contamination) + (Contamination · (Strain Heterogeneity / 100)) + 0.5 · (log(N50). The reported MAGs meet the following minimum criteria: genome quality score ≥60, ≥75% complete, ≤10% contamination, N50 ≥5 kbp, and ≥500 contigs. b) The frequency distribution of the number of contigs and genome sizes of reconstructed MAGs.
The majority of bacterial MAGs belonged to phyla Firmicutes or Bacteroidota (2,326; Fig. 2a and Supplementary Table 2). Additionally, we assembled 12 bacterial genomes from the superphylum Patescibacteria. At lower taxonomic ranks, Lachnospiraceae (415) and Prevotella (398) were the dominant family and genus identified among the assembled bacterial genomes. The most prevalent archaeal family and genus were Methanobacteriaceae (45) and Methanobrevibacter (35), respectively ( Fig. 2b). The recovered MAGs represent several new taxonomic lineages, as four genomes could not be classified at the rank of order, 16 at the rank of family, and 243 at the genus rank. Species-level overlap between reference genomes, the Hungate1000 Collection,

and rumen MAGs
To further characterize the assembled genomes, we compared the MAGs to other rumen specific genome collections, specifically genomes generated from the Hungate1000 project 3  Approximately one-third of the MAGs (1,007) did not exhibit ≥95% ANI with a genome in any of the three genomic collections (Fig. 3a). When considering the pairwise intersections between the datasets, 98 (3.5%), 933 (33.2%), and 1,438 (51.2%) of the MAGs in the current study had ≥95% ANI with a genome in the Hungate1000 Collection 3 , GTDB 21 , and Stewart et al. 4,5 , respectively. One hundred twenty-one (29.5%), 552 (2.5%), and 3,125 (63.2%) of the genomes from the Hungate1000 Collection 3 , GTDB 21 , and Stewart et al. 4,5 displayed ≥95% ANI with a MAG from the current study. Together, these results indicate that we recovered a majority of previous rumen genomic diversity with additional new lineages not previously identified from the rumen ecosystem. Fig. 3: Genomes sharing ≥95% ANI between databases and the characterization of rumenspecific 95% ANI clusters. a) The approximate number of species overlapping amongst rumen-specific and reference genomic datasets. Genomes demonstrating ≥95% ANI were considered to be shared between two datasets. Presented are a subset of intersections in which a MAG from the current study was the query genome. b) The number of genomes comprising each of the 3,541 95% ANI clusters generated from 8,160 rumen microbial genomes in the current study, the Hungate1000 Collection 3 , and Stewart et al. studies 4,5 . c) Rarefaction analysis based on subsampling 95% ANI clusters at steps of 500 genomes indicates the 8,160 genomes from recently published rumen genomic collections still only represent a fraction of expected microbial species diversity in the rumen ecosystem. Phylogenomic relationships of the 1,781 near-complete bacterial (d) and 35 near-complete archaeal (e) representative genomes with the highest dRep genome quality score from the 3,541 95% ANI clusters generated from 8,160 rumen-specific genomes. Near-complete genomes were defined as being ≥90% complete, having ≤5% contamination, and contig N50 ≥15 kbp. Layers surrounding the genomic trees indicate the bacterial phyla or archaeal genera and the log normalized number of genomes from each rumen genomic collection belonging to the same 95% ANI cluster. The bacterial and archaeal phylogenetic trees are provided as Supplementary File 3 and Supplementary File 4, respectively.
We applied an additional clustering approach to identify the approximate number of species represented by the rumen-specific genomes assembled in this study, in the Hungate1000 Collection 3 , and Stewart et al. 4,5 . A 95% ANI threshold yielded 3,541 clusters from the combination of the datasets (Supplementary Table 3). Of the 3,541 clusters, 2,024 contained a MAG from the current study, and 1,135 were composed exclusively of MAGs from the current study. In comparison, 2,175 and 286 clusters were comprised of genomes from Stewart et al. 4,5 and the Hungate1000 Collection 3 , respectively. The majority of 95% ANI clusters (2,166) are only comprised of a single genome (Fig.   3b). Furthermore, a rarefaction curve suggests the 8,160 genomes from the genomic collections analyzed here only represent a fraction of the estimated microbial species diversity in the rumen (Fig.   3c). The genome with the best dRep score from each cluster was used to generate a phylogenetic tree highlighting the species diversity within each rumen genomic collection and represents the vast MAGs from the current study. Each individual and collective database was used for classification of sample reads that underpinned metagenomic binning and from a rumen metagenomic dataset not used in the reconstruction of MAGs 25 . MAGs from the current work classified more reads from deer, moose, and sheep metagenomes, while the more numerous MAGs from Stewart et al. 4,5 classified more reads from bison and cattle metagenomes (Fig. 4a). The addition of MAGs improves classification relative to databases primarily based on cultured isolates, like the Hungate1000 Collection 3 (Fig. 4b). Using the combination of all reference and rumen-specific genomes, the median classification rate on an independent set of cattle metagenomes was 62.6%.  4,5 , and the current study as databases. The four genomic databases were utilized to classify reads independently (a) or used to incrementally build larger databases for classification (b). A database including rumen MAGs from the Stewart et al. 4,5 studies and the current study improved classifications rates for bison, cattle, deer, moose, sheep, and independent cattle metagenomes a median 33.3%, 42.1%, 40.9%, 40.1%, 45.0%, and 46.8% compared to a database of mainly isolate genomes from RefSeq and the Hungate1000 collection. The lines denote the median proportions of sample reads classified by the dataset or combinations of datasets.

Phylogenetic characterization of biosynthetic gene clusters
Microbial genome mining is a powerful tool for natural product discovery. We sought to explore the extent of secondary metabolite diversity coded by the MAGs in the current study, the  Table 5). As expected, the network analysis depicted high inter-and intra-phylum genetic diversity among the NRPS gene clusters. The median intra-phylum, -family, and -genus similarity was 0.40, 0.44, and 0.46, respectively, while the median inter-phylum, -family, and -genus similarity was 0.32, 0.34, and 0.34, respectively. Further, only 2.6% of edges were inter-phylum and 69.0% were intra-family. Of the 6,594 Euryarchaeota edges, 8.1% were Euryarchaeota-Firmicutes (median similarity of 0.32) and 2.0% of edges were Euryarchaeota-Bacteroidota (median similarity of 0.31).
To further examine the phylogenetic relationships of rumen Euryarchaeota NRPS, we clustered 265 NRPS gene clusters (≥10 kbp) from 85 near-complete Euryarchaeota genomes at a higher similarity threshold of 0.75, yielding 57 NRPS clusters (Fig. 5d). The distribution of NRPS clusters amongst the genomes suggests there exists a strong relationship between methanogen phylogeny and NRPS similarity. Only Methanobrevibacter genomes contain NRPS gene clusters, and genomes of the same species often possessed many of the same NRPS clusters (see genomes highlighted in blue in Fig. 5d). However, there are instances in which closely related methanogens code for a contrasting pattern of NRPS clusters or no NRPS clusters at all (see genomes highlighted in red in Fig. 5d).
We aligned previously published rumen metatranscriptome data from steers characterized as having high and low feed efficiency to the BGCs to demonstrate if the identified BGCs are active and to explore potential ecological roles of secondary metabolites. Despite data from the metatranscriptome study not being applied to reconstruct genomes in the current study, we identified the expression of 554 gene clusters from rumen-specific genomes in the 20 metatranscriptomes (≥100 aligned reads). Metatranscriptome read count data were normalized independently for each genome to better account for the variation in taxonomic composition across samples 29 . Genome-specific normalization resulted in the identification of 17 differentially expressed gene clusters between steers with high and low feed efficiency (DESeq2 30 false discovery rate adjusted P <0.05; Supplementary Table 6). Of the 17 differentially expressed BGCs, 16 exhibited higher expression levels in the rumen samples from less efficient steers with higher residual feed intake. Further, Prevotella and Selenomonas coded for 12 of the differentially expressed BGCs (70.6%). All of the differentially expressed Selenomonas BGCs were sactipeptides (n = 7), while the Prevotella BGCs were more diverse and included NRPS and aryl polyenes.  4,5 , and the current study. b) Phylogenomic analysis of 1,766 near-complete Firmicutes genomes inferred from the concatenation of phylogenetically informative proteins. The inner layer surrounding the genomic tree designates taxonomic annotations, while the remaining layers depict the log normalized number of BGCs in the genome with the ascribed function. Bacterial class and order labels are displayed for those lineages in which more than 50 genomes were identified. Near-complete genomes were defined as being ≥90% complete, having ≤5% contamination, and contig N50 ≥15 kbp. The phylogenetic tree is provided as Supplementary File 5. c) A relational network of NRPS gene cluster similarity in Firmicutes, Bacteroidota, and Euryarchaeota. Edge weight represents the similarity of two BGCs, as determined by BiG-SCAPE. Edges are only shown for BGCs with ≥0.3 BiG-SCAPE similarity. Nodes from each phylum are duplicated to illustrate intra-phylum relationships and nodes along a given axis are ordered alphabetically by taxonomic family. d) The association between genome phylogeny and the similarity of NRPS gene clusters coded by near-complete Euryarchaeota genomes. BGCs designated as NRPS were clustered with BiG-SCAPE. The relationship between NRPS clusters was portrayed through the hierarchical clustering of pairwise inter-cluster similarities. The number of NRPS clusters coded by each genome (range of 0-3) is presented alongside the assigned genus. A group of Methanobrevibacter genomes, likely of the same species (≥95% ANI), possessed very similar NRPS clusters (highlighted in blue). Yet, phylogenetically closely related genomes, belonging to two different 95% ANI clusters, did not code for any identified NRPS gene clusters (highlighted in red). The phylogenetic tree is based on the concatenation of 122 phylogenetically informative archaeal proteins and is available as Supplementary File 6.

Microdiversity of BGCs and MAGs
Phylogenetic analyses of BGC often revealed high inter-species diversity (i.e, methanogen NRPS in Fig. 5d). We next investigated patterns of sub-species microdiversity in rumen BGCs. In order to reduce the influence of study-to-study effects, we focused on the microdiversity of MAGs across 282 metagenomes in the Stewart et al. studies 4,5 . MAGs with ≥50% of its genome covered by at least 5 reads were considered as detected in a sample and used for microdiversity analyses. The withinsample microdiversity of genes and genomes were assessed using InStrain 31 . Our phylogenetic analysis identified that different classes of BGCs are enriched in certain lineages ( Fig. 5a and Fig.   5b). As a result, the nucleotide diversity values for genes were normalized using the mean genomewide nucleotide diversity for each MAG to account for lineage-specific evolutionary processes and more accurately compare patterns of microdiversity in BGCs across lineages. There were significant differences in the nucleotide diversity of genes from the four major classes of BGCs identified in rumen-specific genomes (Kruskal-Wallis H = 1795.5, ε 2 = 0.001, P <2.2 × 10 −16 ; Fig. 6a), but the effect size (ε 2 ) between BGC types was negligible. Outliers with high microdiversity were bacteriocin genes from RC9 and UBA3207 sp. as well as NRPS genes from CAG-710 and  Fig. 6c) were both significantly different between the four breeds. The effect size (ε 2 ) of microdiversity difference between breeds was much larger for the genome-wide comparison than for genes from BGCs. This finding raised the question if genes from BGCs have different nucleotide diversity relative to other genes. We found that genes across all BGCs had lower normalized nucleotide diversity compared to all other genes from investigated MAGs (Wilcoxon rank-sum W = 6.11 × 10 13 , Vargha and Delaney's A = 0.507, P <2.2 × 10 −16 ; Fig. 6d). The raw nucleotide diversity values were higher for genes in BGCs than other genes (Wilcoxon rank-sum W = 5.801 × 10 13 , Vargha and Delaney's A = 0.481, P <2.2 × 10 −16 ). Regardless, again we find the effect size of the difference to be very small though.
Together, microdiversity analyses suggest rumen microbial BGC diversity is comparable across the prevalent BGC classes, breeds, and similar to other genes.  4,5 . The within-sample nucleotide diversity of BGCs was statistically different between BGC types, but the effect size of the difference was small (ε 2 = 0.001) (a). The difference in nucleotide diversity across breeds was greater for MAGs (ε 2 = 0.0265) (b) than for genes in BGCs (ε 2 = 0.0003) (c). Additionally, the effect size of the difference between the normalized nucleotide diversity of genes from BGCs and other genes was small (Vargha and Delaney's A = 0.507) (d). Genome-wide normalized nucleotide diversity is the nucleotide diversity of a gene relative to the mean nucleotide diversity of the MAG. The genome-wide normalized nucleotide diversity metric was used to reduce the influence of lineage-specific evolutionary processes, allowing for a more accurate comparison of gene nucleotide diversity across microbial populations. The same conclusions were identified using the raw nucleotide diversity in the place of genome-wide normalized nucleotide diversity. The outlier points have been removed from the boxplots for clarity.

Discussion
Ruminant agriculture is critical to the global food system. However, with land constraints and associated environmental impacts, ruminant production systems will need to become more efficient and sustainable to feed a growing population. Due to the importance of microbial processes in ruminant health and production, rumen microbes are central to nearly all aspects of ruminant agriculture 32 . Actionable insights into the roles of rumen microbes have lagged though, partly due to a lack of genomic references that underpin analyses and contextualize community data.
We reconstructed 2,809 metagenome-assembled population genomes from several ruminant species to advance our understanding of the structure-function relationship of the rumen microbial ecosystem. Nearly half of the MAGs are estimated to be ≥90% complete with minimal contamination.
Based on pairwise ANI comparisons, the MAGs in this study constitute approximately 2,024 species (95% ANI clusters), greatly expanding the genomic representation of rumen microbial lineages.
Moreover, clustering the genomic data reported in this study with genomes from the Hungate1000 Collection 3 and Stewart et al. studies 4,5 suggest there are at least 3,541 rumen microbial species with a draft reference genome now. It is worth emphasizing that some of the MAGs reported in this study may have been reported as part of metagenomic binning efforts in previous studies (Supplementary Table 1). However, the aggregating of data from multiple studies and contrasting assembly and binning approaches implemented in the current study may have yielded improved bins.
Approximately one-third of the resolved MAGs did not have a species-level representative in the compared genomic databases. Among the fraction of genomes that did exhibit high similarity, only Selenomonas populations are often linked to feed efficiency. Our approach using genome-resolved metagenomics and organism-specific normalization suggests secondary metabolites may play a role in this association. Further, the findings fit the emerging hypothesis that inefficient cattle have higher microbial diversity and produce a broader range of less usable metabolites for the animal's energy needs 38,39 .
Inter-species diversity of BGCs appeared to be high in the rumen, while sub-species microdiversity analyses suggest strain-level BGC diversity may be more constant across samples. The majority of genes within BGCs had similar nucleotide diversity as other genes, with a few outliers that displayed very high diversity. We know little regarding the relationship between genetic and functional diversity of BGCs in the rumen. As such, future work may focus on obtaining a better understanding of the evolutionary processes shaping the microdiversity patterns of BGCs. The mean genome-wide nucleotide diversity of sub-species MAGs was more different across breeds than it was for genes of BGCs, suggesting host genetics may influence microdiversity.
In this study, we have provided a phylogenomic characterization of rumen-specific genomes that may serve as a foundation for future in silico and laboratory experiments to better explore the rumen as a source for alternative peptides and metabolites to modulate rumen fermentation. The genomes reported here and in other recent genetic explorations of the rumen microbiome appear to only provide a glimpse into rumen microbial diversity. Moving forward, we anticipate using the combination of cultured and uncultured genomes to populate a bottom-up systems biology framework that advances towards mechanistic understandings and modeling dynamics of the rumen microbial ecosystem.

Rumen metagenomic datasets
We used 435 metagenomes for assembly and metagenomic binning (Supplementary Table 1). Rumen metagenomic studies with sufficient depth and quality were identified from the Sequence Read Archive, European Nucleotide Archive, and MG-RAST in early 2018. All publicly available metagenomes were sequenced on Illumina next-generation sequencing platforms. The remaining metagenomic datasets were previously unpublished.
The first two unpublished metagenomic datasets were from an 84-day growing study utilizing 120 steers and subsequent 125-day finishing study with 60 steers at the University of Nebraska Agriculture Research and Development Center, as described previously 40 Table 1).

Assembly and metagenomic binning
Paired-end and single-end sequences from each sample were assembled independently with MEGAHIT (version 1.1.1; parameters: -min-contig-len 1000, -k-min 27, -k-step 10) 44 . No co-assemblies were performed. We applied a maximum k-mer size of 87 (-k-max 87) for samples in which the longest read length was ≤100 bp. For samples with longer read lengths, we employed a maximum k-mer size of 127 (-k-max 127). The single-sample assemblies were input for both single-sample and multi-sample binning strategies with MetaBAT followed by re-assembly and dereplication 45 . Reads from each sample were mapped to assembled contigs (minimap2, parameters: -ax sr) 46 . The resulting alignments were used to bin contigs with a minimum length of 2,000 bp for single-sample binning and 2,500 for multi-sample binning strategies. Due to the total size of the collected datasets, the multi-sample binning was conducted independently for cattle (335 metagenomes) and other ruminant metagenomic datasets (100 metagenomes). Estimates of the completeness and contamination of the resulting bins were assessed using the lineage-specific workflow (lineage wf) of CheckM (version 1.0.11) 47 . Bins ≥50% complete were re-assembled with SPAdes (version 3.13.0; -careful parameter) 48 . MAGs stemming from the single-sample binning pipeline were re-assembled only with reads from that same sample. MAGs reconstructed through multi-sample binning were re-assembled from the sample with the most reads aligning to the bin and from all reads aligning to the bin. The quality of re-assembled bins was assessed with score, where Genome Quality = Completeness − (5 · Contamination) + (Contamination · (Strain Heterogeneity / 100)) + 0.5 · (log(N50) 19 . Contigs with divergent genomic properties (GC content and tetranucleotide frequency) were identified and removed with RefineM to reduce genome bin contamination 22 . Refined genomes from single-sample and multi-sample binning strategies were pooled and dereplicated with dRep at a threshold of 99% ANI 19 . Genomes meeting the following thresholds were retained: dRep quality score ≥60; N50 ≥5 kbp; ≤500 contigs; genome size ≥500 kbp; CheckM contamination estimate ≤10%; and CheckM completeness estimate ≥75%. Nearcomplete genomes were defined as MAGs with CheckM completeness estimate ≥90%, CheckM contamination estimate ≤5%, and N50 ≥15 kbp.  Table 2).

Inference of genome trees
Phylogenetic trees were inferred with near-complete genomes (CheckM completeness estimate ≥90%, CheckM contamination estimate ≤5%, and N50 ≥15 kbp) using the GTDB-Tk (default parameters for the identify, align, and infer commands). Anvi'o was used to visualize the resulting Newick trees and associated metadata (version 5.5) 50 . We estimated how well a MAG was represented in a sample by calculating the percent of a MAG's bases with at least 1X coverage in the sample. The mean number of bases in a MAG with at least 1X coverage is presented for each metagenomic study and was used to compute the hierarchical clustering of rumen metagenomic datasets (Euclidean distance and Ward linkage).
Similarity of reconstructed MAGs to GTDB reference genomes, the Hungate1000 Collection, and rumen-specific MAGs Recent analyses support a 95% ANI threshold to delineate microbial species 51,52 . The ANI values of MAGs from the current study and genomes from the GTDB (a curated and dereplicated collection of 22,441 genomes in the GTDB-Tk FastANI database 21 ), Hungate1000 project 3 , and Stewart et al. 4,5 were compared in a pairwise fashion with FastANI (version 1.1) 51 . Genome pairs with ≥95% ANI were denoted as overlapping species between the datasets. We visualized the number of overlapping genomes between each pair of datasets with UpSetR 53, 54 . Additionally, genomes from the current study, the Hungate1000 project 3 , and Stewart et al. 4,5 were clustered at 95% ANI thresholds with dRep 19 to approximate the number of microbial species represented across the rumen genomic collections. The number of genomes belonging to each 95% ANI cluster was used to calculate rarefaction curves in which cluster counts were subsampled without replacement at steps of 500 genomes with 10 replications at each step (QIIME version 1.9) 55 .
The average genome size of reconstructed MAGs was smaller than was observed in the Hungate1000 Collection. In order to provide a better comparison of genome sizes across similar species, we evaluated the adjusted genome sizes of MAGs and Hunagte1000 Collection genomes that belonged to the same 95% ANI cluster based on Pearson correlation and linear regression, where Adjusted Genome Size = Genome Size / (Completeness + Contamination).

Classification of Metagenomic Reads
Reads from the 435 rumen metagenomes used to assemble MAGs and reads from 16 samples of an independent cattle metagenomic dataset 25 not used in binning were classified with different databases to assess the value of the reconstructed MAGs to improve metagenomic read classification.
Reads were classified with Kraken2 (version 2.0.7; default parameters) 24 using a combination of the Kraken2 standard database containing bacterial, archaeal, fungal, and protozoa RefSeq genomes, MAGs from the current study.

Phylogenetic analysis of biosynthetic gene clusters
BGCs were identified within MAGs, the Hungate1000 collection 3 , and the Stewart et al. MAGs 4,5 using antiSMASH 4.0) 26 . A network was constructed based on the BiG-SCAPE calculated distances between two BGCs (version "20190604") 28 . In short, BiG-SCAPE combines three approaches to measure the similarity of BGC pairs: 1) the Jaccard Index, which measures the percentage of shared domain types; 2) the Domain Sequence Similarity index that takes into account differences in Pfam domain copy number and sequence identity; 3) the Adjacency Index, a measure of the pairs of adjacent domains that are shared between BGCs. The raw BiG-SCAPE distances were converted to similarities for all analyses. Only NRPS ≥10 kbp (71.6% were ≥10 kbp) were evaluated and the network analysis was limited to Bacteroidota, Firmicutes, and Euryarchaeota phyla because these three phyla coded for 96.4% of NRPS gene clusters. Two BGCs (nodes in the network) were connected with an edge if the pairwise similarity was ≥0.3. We visualized the network as a hive plot with the R tidygraph package to demonstrate the inter-and intra-phylum diversity of NRPS BGCs.
Nodes on an axis were ordered by the family of the genome coding the NRPS. Archaeal NRPS were further evaluated by placing BGCs into clusters based on a BiG-SCAPE glocal similarity threshold of 0.75. The distance between clusters was calculated as the mean pairwise similarity between the BGCs of two clusters. The resulting distance matrix was clustered with hierarchical clustering to produce a Newick tree (Euclidean distance and Ward linkage). The number of NRPS from near-complete archaeal genomes (CheckM completeness estimate ≥90%, CheckM contamination estimate ≤5%, and N50 ≥15 kbp) that belong to each BiG-SCAPE cluster were tabulated and visualized alongside a phylogenetic tree inferred with the GTDB-Tk (default parameters for the identify, align, and infer commands) 21,47 . The data were visualized with Anvi'o (version 5.5) 50 .
Rumen metatranscriptomic data 56 sequenced from steers with high (10 samples) and low (10 samples) residual feed intake were used to assess the expression of rumen microbial BGCs. ORF abundances for all rumen genomes were quantified with kallisto (version 0.45.0; default parameters) 57 .
Kallisto generates pseudo-alignments based on exact k-mer matches. Differences in expression may be attributed to both variations in organism abundance and changes in microbial behavior under different conditions. Taxon-specific scaling of count data should reduce the influence of taxonomic composition changes 29 . Thus, to account for variations in taxonomic composition, count data for each genome were first partitioned and normalized separately with DESeq2 (version 1.24.0) 30 .
The genome-specific normalization factors were used to scale raw BGC abundances from the same genome. Normalized BGC counts from each genome were re-combined to identify differentially expressed clusters between steers with high and low feed efficiency with DESeq2 30 . Only genomes with at least one read in all 20 samples (6,630 genomes) and BGCs with a minimum count of 100 reads were included in the analysis (648 BGCs).
Microdiversity analyses were carried out with InStrain (version 1.2.4) 31 . Reads from the 282 Illumina metagenomes described in Stewart et al. were mapped to the 4,941 MAGs previously recovered 4, 5 . MAGs with an unmasked breadth ≥0.5 (i.e., ≥50% of the genome has 5X coverage) in a sample were considered to be present in that sample. That is, only genes from detected MAGs were used in subsequent analyses. Of the 4,941 MAGs, 2,926 had an unmasked breadth ≥0.5 in at least one sample. Further, genes were only considered present if they had ≥5X coverage in a sample in which the MAG was detected. The profile module of InStrain calculates the nucleotide diversity of scaffolds within a given sample. InStrain can use this profile to calculate the nucleotide diversity of genes (profile genes module) and the mean nucleotide diversity of the genome (genome wide module). The nucleotide diversity of detected genes was normalized based on the mean genome-wide microdiversity of the MAG (gene nucleotide diversity / genome-wide microdiversity) to reduce lineage-specific effects when comparing the microdiversity of BGCs. The normalized gene nucleotide diversity represents the nucleotide diversity of the gene relative to the nucleotide diversity of the rest of the genome. Statistical differences were assessed with Kruskal-Wallis and Wilcoxon rank-sum tests. All statistical comparisons were also carried out using raw nucleotide diversity values.

Data availability
The accessions for all metagenomes analyzed are available in Supplementary Table 1