A global survey of specialized metabolic diversity encoded in bacterial genomes

Bacterial secondary metabolites have been studied for decades for their usefulness as drugs, such as antibiotics. However, the identification of new structures has been decelerating, while multi-resistant pathogens continue to emerge. It is unclear how much chemical diversity exists in Nature and whether discovery efforts should be focused on established antibiotic producers or rather on understudied taxa. Here, we surveyed around 170,000 bacterial genomes and several thousand of Metagenome Assembled Genomes (MAGs). Our results indicate that only 3% of the genomic potential for natural products has been experimentally discovered. Our analysis connects the emergence of most biosynthetic diversity in evolutionary history close to the taxonomic rank of genus, identifies Streptomyces as by far the most biosynthetically diverse, followed by producers from genera such as Amycolatopsis, Kutzneria and Micromonospora, and highlights multiple promising high-producing taxa that have thus far been less investigated, such as Weeksellaceae (Bacteroidota), Myxococcaceae (Myxococcota), Pleurocapsa, Nostocaceae (Cyanobacteria). Background Secondary metabolites (also called specialized metabolites) are biomolecules that are not essential for life but rather offer specific ecological or physiological advantages to their producers allowing them to thrive in particular niches. These Natural Products (NPs) are more chemically diverse than the molecules of primary metabolism, varying in both structure and mode of action among different organisms1. Historically, microbial NPs and their derivatives have contributed and continue to contribute a substantial part of new chemical entities brought to the clinic, especially as anticancer compounds and antibiotics2–6. Regrettably, the emergence of antibiotic-resistant pathogens5,7–10 concomitant to a stagnation of antimicrobial discovery pipelines3,6,10 is leading to a global public health crisis5,7–10. Nonetheless, genomics-based approaches to NP discovery11–13 have revealed a largely untapped and much more diverse source of biosynthetic potential within genomes5,14,15. These findings were possible following the discovery that bacterial genes encoding the biosynthesis of secondary metabolites are usually located in close proximity to each other, forming recognizable Biosynthetic Gene Clusters (BGCs). However, while the numbers and kinds of BGCs clearly differ across microbial genomes14,16 and metabolomic data indicate that some biosynthetic pathways are unique to specific taxa17, a systematic analysis of the taxonomic distribution of BGCs has not yet been performed. Similarly, while useful estimates of the chemical diversity of specific taxa have been provided16, methodical comparisons across taxa are lacking. Because of this, the scientific community appears undecided on the best strategy for natural product discovery: should the established known NP producers be studied further or should the community be investigating underexplored taxa14,18?


Background
Secondary metabolites (also called specialized metabolites) are biomolecules that are not essential for life but rather offer specific ecological or physiological advantages to their producers allowing them to thrive in particular niches. These Natural Products (NPs) are more chemically diverse than the molecules of primary metabolism, varying in both structure and mode of action among different organisms 1 . Historically, microbial NPs and their derivatives have contributed and continue to contribute a substantial part of new chemical entities brought to the clinic, especially as anticancer compounds and antibiotics [2][3][4][5][6] . Regrettably, the emergence of antibiotic-resistant pathogens 5,7-10 concomitant to a stagnation of antimicrobial discovery pipelines 3,6,10 is leading to a global public health crisis 5,[7][8][9][10] .
Nonetheless, genomics-based approaches to NP discovery [11][12][13] have revealed a largely untapped and much more diverse source of biosynthetic potential within genomes 5,14,15 . These findings were possible following the discovery that bacterial genes encoding the biosynthesis of secondary metabolites are usually located in close proximity to each other, forming recognizable Biosynthetic Gene Clusters (BGCs). However, while the numbers and kinds of BGCs clearly differ across microbial genomes 14,16 and metabolomic data indicate that some biosynthetic pathways are unique to specific taxa 17 , a systematic analysis of the taxonomic distribution of BGCs has not yet been performed. Similarly, while useful estimates of the chemical diversity of specific taxa have been provided 16 , methodical comparisons across taxa are lacking. Because of this, the scientific community appears undecided on the best strategy for natural product discovery: should the established known NP producers be studied further or should the community be investigating underexplored taxa 14,18 ?
Here, we harnessed recent advances in computational genomic analysis of BGCs to survey the enormous amount of genome data accumulated by the scientific community so far. Using a global approach based on more than 170,000 publicly available genomes, we created a comprehensive overview of the biosynthetic diversity found across the entire bacterial kingdom. We clustered 1,185,995 BGCs into 62,449 Gene Cluster Families (GCFs), and calibrated the granularity of the clustering to make it directly comparable to chemical classes as defined in NP Atlas 19 . This facilitated an analysis of the variance of diversity across major taxonomic ranks, which showed the genus rank to be the most appropriate to compare biosynthetic diversity across homogeneous groups. This finding allowed us to conduct comparisons within the bacterial kingdom. Evident patterns emerged from our analysis, revealing popular taxa as prominent sources of both actual and potential biosynthetic diversity, and multiple yet uncommon taxa as promising producers.

Biosynthetic diversity of the bacterial kingdom
To assess the global number of Gene Cluster Families found in sequenced bacterial strains, we ran AntiSMASH on~170,000 genomes from the NCBI RefSeq database 20 (Supplementary Table 1), spanning 48 bacterial phyla containing 464 families (according to the Genome Taxonomy DataBase classification -GTDB 21 ). We also included almost 50,000 bacterial Metagenome Assembled Genomes (MAGs) from 6 metagenomic projects of various origins [22][23][24][25][26][27] (Table 1 and Supplementary Table 1). To accurately group similar BGCs -which likely encode pathways towards the production of similar compounds -into Gene Cluster Families (GCFs) across such a large dataset, we used a slightly modified version of the BiG-SLiCE tool 28 , which has been calibrated to output GCFs that match the grouping of known compounds in the NP Atlas database 19 (see Methods: Quantification of biosynthetic diversity with BiG-SLiCE). The resulting GCFs were then used to measure biosynthetic diversity across taxa. Table 1. Overview of input datasets and overall biosynthetic diversity under various BiG-SLiCE thresholds. The "Complete Dataset" was used for the computation of the actual and potential biosynthetic diversity found in all cultured (and some uncultured) bacteria. The dataset "RefSeq bacteria with known species taxonomy" was used for pinpointing the emergence of biosynthetic diversity, for which accurate taxonomic information was needed, and for identifying groups of promising producers. The "T"s under Gene Cluster Families represent different BiG-SLiCE thresholds; BGC to GCF assignment for each threshold can be found in Supplementary Tables 2-5 The number of GCFs in RefSeq ranged from 19,152 to 51,052 depending on the threshold used by BiG-SLiCE (Table 1). While, as expected, the pure numbers of the analysis changed based on the threshold, the overall tendencies observed remained the same (Figure 1a, Supplementary Figure 1). The effect that the chosen threshold has on these results presented a challenge to our investigation, as previous estimations have also shown great heterogeneity when different thresholds were used 14,16 , precluding direct comparisons of their predictions. As each BGC can be considered a proxy for its encoded pathways and their products, differing thresholds will result in different degrees of granularity in the grouping of compound structures (Supplementary Figure 2). Nevertheless, linear relationships are not always applicable, as shown previously 29 , and a specific threshold will need to be set anyway to make comparisons possible. For this, we sought to directly relate the choice of our BGC clustering threshold to the clustering of their compound structures. NPAtlas, a database of known microbial small molecules, provides hierarchical clustering of the compound structures via Morgan fingerprinting and Dice similarity scoring 19 . As many as 947 compounds in NPAtlas are mapped to a known BGC in MIBiG repository 30 , giving us the opportunity to use them as an anchor for choosing our clustering threshold. After mapping the BiG-SLiCE groupings of known BGCs from the MIBiG to the compound clusters in NPAtlas (Supplementary Figure  3), we chose a threshold of 0.4, as it provided the most congruent agreements between the two groupings, with v-score=0.94 (out of 1.00) and ΔGCF=-17.  This calibration of thresholds of GCFs to families of chemical structures allowed us to perform a rarefaction analysis to assess how genomically encoded biochemical diversity (expressed as the number of distinct GCFs) increases with the number of sequenced and screened genomes (Figure 1b). The curve appears far from saturated, while the slope is steeper still if the bacterial MAGs are included in the analysis. When compared to the number of chemical classes documented in the NPAtlas 19 database (Figure 1b), it appears that, to date, only about 3% of the kingdom's biosynthetic diversity has been experimentally accessed.
In an attempt to evaluate the potential contribution of metagenomic data to Natural Product (NP) discovery, we studied how many of the GCFs found in the MAGs datasets were unique to this dataset ( Figure 1c). Around 53,4% of GCFs in the MAGs were not found in the RefSeq strains or in the Minimum Information about a Biosynthetic Gene cluster database (MIBiG 30 ). Paradoxically, in Figure 1b Table 7).The latter finding is concordant with recent proof that most genes have a strong biogeography signal 31 .

Genus as the most appropriate taxonomic rank to compare biosynthetic diversity
To identify the most promising bacterial producers, it is important to compare them at a specific taxonomic level. Several studies indicate that there is significant discontinuity in how BGCs are distributed across taxonomy: 'lower' taxonomic ranks like species within a genus carry more similar biosynthetic diversity, than 'higher' taxonomic ranks like phyla within a kingdom. To assess which taxonomic rank is the most appropriate to evaluate biosynthetic potential, we aimed to determine up to which taxonomic level the biosynthetic diversity remains homogeneous within that taxon. For this analysis, from our initial dataset, we left out the MAGs and only used the RefSeq bacterial strains as taxonomic assignment up to species rank (based on GTDB 21 ) was available only for the latter dataset (Table 1).
We first decorated the GTDB 21 bacterial tree with GCF values from the BiG-SLiCE analysis (Figure 2a), revealing the biosynthetic diversity found within currently sequenced genomes at the phylum rank. It immediately stood out that biosynthetic diversity was differently dispersed among the bacterial phyla, in accordance with published data 14,32 . As expected for known NP producers, the phyla Proteobacteria and Actinobacteria appeared particularly diverse 16,33,34 . However, these phyla are amongst the most studied and therefore the most sequenced 16,33,34 , a bias that was addressed later in the study.
Next, we examined whether the diversity of each phylum contributed to the domain's total diversity, or if there was overlap among them. For this reason, we depicted the number of unique GCFs within each phylum, as well as the pairwise overlaps ( Figure  2b). In most phyla, the vast majority (on average 73.81 ± 20.35%) of their GCFs appeared to be unique to them and not found anywhere else. This is coherent with the fact that HGT events, although relatively frequent for BGCs 35 , are much more common among closely related taxa [36][37][38][39] . Once we obtained information on the diversity of different phyla, as well as the rest of the major taxonomic ranks (classes, orders, families, genera, species), we proceeded to determine from which taxonomic rank biosynthetic diversity levels no longer show high variability. Therefore, we conducted a variance analysis that included each taxonomic rank, from phylum to species. For each rank, the variance value was computed based on the #GCFs values of immediately lower-ranked taxa (see Methods: Variance Analysis). The distribution of these variance values for each rank is visualized in Figure 3a.
There is a noticeable drop in the range of variance values for each rank, while diversity becomes highly homogeneous at the species level (Figures 3a,b). The plunge is most striking from the family to the genus level ( Figure 3a), with even the outliers all falling under the 10 3 -line in the genus rank. Different species within a genus are likely to display uniform biosynthetic diversity, while much dissimilarity is observed between different genera belonging to the same family ( Figure 3b). Additional statistical analysis confirmed the significance of this observation (Supplementary Figure 5) thus pinpointing, for the first time, the genus rank as the most appropriate for comparative analyses.  (2), Streptomycetaceae genera (3), top 50 most diverse Streptomyces species (4). The difference in variance is visible in the graphs 1,2,3, but becomes homogeneous at the species level as is shown in graph 4.

Well-known as well as overlooked taxa as sources of biosynthetic diversity
The identification of the genus level as the most informative rank to measure biosynthetic diversity across taxonomy paved the way for a comprehensive comparative analysis of biosynthetic potential across the bacterial tree of life. However, to be able to systematically compare diversity values among groups, said groups need to be uniform. In this case, a common phylogenetic metric was necessary. We chose Relative Evolutionary Divergence (RED) and a specific threshold that was based on the GTDB's range of RED values for the genus rank 21 to define REDgroups: groups of bacteria analogous to genera but characterized by equal evolutionary distance (see Methods: Definition of REDgroups). Our classification revealed the inequalities in within-taxon phylogenetic similarities among the genera, with some being divided into multiple REDgroups (for example the Streptomyces genus was split into 21 REDgroups: Streptomyces_RG1, Streptomyces_RG2 etc.) and some being joined together with other genera to form mixed REDgroups (for example Burkholderiaceae_mixed_RG1 includes the genera Paraburkholderia, Paraburkholderia_A, Paraburkholderia_B, Burkholderia, Paraburkholderia_E and Caballeronia). This disparity among the genera reaffirmed the importance of defining the REDgroups as a technique that allowed for fair comparisons among bacterial producers.
The resulting 3,779 REDgroups showed huge differences in biosynthetic diversity as measured by the numbers of GCFs found in genomes sequenced from these groups so far, with the maximum diversity at 3,339 GCFs, average at 17 GCFs and minimum at 1 GCF. Nevertheless, the variance of diversity within the REDgroups was even more uniform than in the genera (Supplementary Figure 6). Some of the top groups (Supplementary Table 8) included known rich NP producers, such as Streptomyces, Pseudomonas_E and Nocardia 30,33,34,43 .
Although very informative, this analysis is biased because of large differences in the number of sequenced strains among the groups, with the economically or medically important strains having been sequenced more systematically than others. To overcome this bias, rarefaction analyses were conducted for each REDgroup (Figure  4b, Supplementary Table 8), as performed in previous studies 44,45 . Additionally, to examine how effectively this method overcomes the sequencing bias, a random sampling approach was taken (see Methods: Random sampling), which showed comparable results to the original analysis (Supplementary Table 9). With all the information on REDgroups, and in order to provide a global overview of the actual biosynthetic diversity and the potential number of GCFs, we modified and complemented the bacterial tree from Parks et. al. 21 , as shown in Figure 4a. The dispersion of these values across the various phyla can also be seen, with the exceptional outliers standing out: Streptomyces_RG1, Streptomyces_RG2, Amycolatopsis_RG1, Kutzneria, and Micromonospora. All these are groups known for their NP producers 16,33,34,46 and they remain in the top (Extended Data Table 1,  Supplementary Table 1), seemingly having much unexplored biosynthetic potential.
To ensure that our conclusions are not the product of algorithmic artifacts, we reran the analysis using an alternative method of quantifying biosynthetic diversity, which was developed independently, yet for the same purpose. This alternative approach, called clust-o-matic, is based on a sequence similarity all-versus-all distance matrix of BGCs and subsequent agglomerative hierarchical clustering in order to form GCFs (see Methods: Random sampling). Like for BiG-SLiCE, we calibrated the threshold for clust-o-matic based on NP Atlas clusters. When comparing the results (Figure 4c,d, Supplementary Table 8), despite slight differences in absolute numbers, the two algorithms appeared to identify very similar trends. Streptomyces, even when split into multiple REDgroups, is in the top groups both based on the known biosynthetic diversity and based on the estimated potential values. 5,908 (+103 Streptomyces_B, +39 Streptomyces_C, +16 Streptomyces_D) GCFs appear to be unique to the group, even among other phyla (Figure 5a). This is in agreement with previous studies investigating how much overlap there is among the main groups of producers 47 . What is more, streptomycetes appear to be the source of a good percentage of the biosynthetic diversity attributed to the Actinobacteria phylum, as seen in Figure 5b.
However, taxa less popular for NP discovery also show promise, as was evidenced by a comparison of our results with data from the NPASS database of Natural Products 48 (Figure 5c). Among the 20 overall most promising REDgroups we found at least 6 groups that show promise but whose members are either not catalogued in the database as NP sources or are connected to few (<15) known compounds:

Discussion
Using two different independent algorithms, we mined the large amounts of available sequencing data to identify microbial Biosynthetic Gene Clusters (BGCs) and group them into gene cluster families (GCFs), making sure that they mirror chemical families of encoded compounds. We located the emergence of the highest biosynthetic diversity close to the genus rank and chose to further investigate analogous taxonomic groups (REDgroups). Rarefaction analysis identified the highest biosynthetic potential and the most promising bacterial taxa among many known diverse groups as well as multiple promising understudied producers. To the best of our knowledge, this is not only the largest investigation of this sort including all available genomes so far, as well as published Metagenome Assembled Genomes (MAGs) from diverse environments, but it also provides a reproducible pipeline to determine the metabolic diversity of comparable groups of bacteria and propounds a rationale for drug discovery efforts.
The full biosynthetic capacity of the bacterial kingdom has been assessed in a previous study by Cimermancic et al. 14 , already providing a glimpse into the huge untapped biosynthetic potential within bacterial genomes. However, besides their much smaller dataset (about 33,000 BGCs vs 1,185,995 BGCs here), they used ClusterFinder as an identification tool, a more exploratory software algorithm which can include false positive gene clusters. Projects that exploit publicly available genomic data are reliant on the quality of genomes sequenced as well as the efficiency of available genome mining methods, which -though potent as ever -still have some limitations 49 . For instance, the study of GCF uniqueness among taxa may be affected by antiSMASH's imperfect BGC boundary prediction 50 . Even though BiG-SLiCE converts BGCs into features based only on domains related to biosynthesis 28 , genomic context unrelated to the biosynthetic pathway of a BGC could still play a role in the GCF assignment; this issue cannot be fully addressed yet with currently available tools. However, antiSMASH's ability to discern cluster limits and detect BGCs from cultured strains and MAGs is comparable to alternative tools, while its ability to predict so many different BGC types is unparalleled 51 , as is apparent from its common use in Natural Product (NP) research 14,17,32,43,45,52,53 . What is more, the fact that it is rule-based 50 implies the possibility of undetected types of clusters and increases the likelihood that our calculations have underestimated the true biosynthetic potential of bacterial organisms, which could be even more impressive.
Furthermore, our pipeline was the first that used GTDB 21 taxonomy for studying bacterial biosynthetic diversity globally. This enabled us to avoid the established misclassifications of the NCBI taxonomic placement [54][55][56][57] . The use of rarefaction curves allowed us to infer the biosynthetic potential of bacterial groups as done in some smaller-scaled projects 14,16,44,45 . This method aims to enable fair comparisons among incomplete samples 58 . However, while overestimation is not expected to happen, for those groups that contain very few genomes, there is a tendency to underestimate their potential capacity 58 . Hence, sequencing bias of popular taxa still affects our results. We tried to minimize the bias within the pipeline as much as possible while retaining high diversity of bacterial taxa; therefore, we decided not to exclude REDgroups with very few members from the dataset. We also ran an additional random sampling analysis using the most populated REDgroups and confirmed the reproducibility of our results. Nonetheless, the remaining bias will only be eliminated with the inclusion of increased biodiversity in sequencing projects 24,27 .
Our analysis identified a plethora of unexplored taxonomic groups with high biosynthetic potential, rendering their characterization essential 9,17,18,59-63 . At the same time, it revealed that undiscovered biosynthetic diversity hides among the popular targets for NP discovery. Indicatively, multiple Proteobacteria taxa were identified among the top producers: Pseudomonas, Pseudoalteromonas, Paracoccus, Serratia among others. This is in accordance with the known biosynthetic potential of the Proteobacteria phylum 46 . Furthermore, even undersequenced phyla included promising groups, like the myxobacterial genera Cystobacter, Melittangium, Archangium, Vitiosangium, Sorangium and Myxococcus 17,43,64 , or the genera Chryseobacterium and Chryseobacterium_A [65][66][67] from the Bacteroidota phylum. However, the majority of most diverse groups comprise actinobacterial strains of well-known and well-studied NP producers from genera such as Actinoplanes, Amycolatopsis, Micromonospora, Mycobacterium, Nocardia and Streptomyces 16,33,34,60,68 . These bacteria produce the majority of Natural Products used as antibiotics 33,60 and our analysis confirms the notion that, contrary to some estimates, there is still much more natural product diversity to be discovered within this group as more diversified strains get sequenced 16,33,34 .
Streptomyces is a genus of the Actinobacteria phylum that contains some of the most complex bacteria that we know of, though by far not the most sequenced in our dataset (Supplementary Figure 9). These bacteria have been known as NP producers for a long time 68 , as single strains containing a high number of BGCs have been discovered, taking up to 10% of their genome 69 . However, members of other genera contain comparable absolute numbers of BGCs. This is the first time that a systematic comparison of the diversity of the encoded compounds within bacterial genera has been conducted, revealing how diverse Streptomyces are compared to all others 68 . The factors that cause this taxonomic group to stand out are not completely clear but probably related to their sophisticated lifestyle. Many observations suggest that NP biosynthesis drives speciation within the Streptomyces genus 16 . The exploration of factors that led to the rise of biosynthetic diversity in Streptomyces to such an impressive degree will be the subject of further investigations in the future.
Having the genomic capacity for the biosynthesis of secondary metabolites does not always herald the discovery of a new compound though 70,71 . Sometimes, the bacterium in question cannot be grown or BGCs are not expressed in laboratory conditions 33,61,63,70,71 . This issue is related to the complexity of BGCs -we have only just scratched the surface of their intricate regulation and connection to primary metabolism 12,61,70,72 . However, efforts to decode biosynthetic mechanisms for the activation of silent clusters need to be tailored to specific producer groups 33,34,71,73 , such as groups phylogenetically related to promising producers, e.g. members of the Pseudonocardiaceae family (Figure 4, Supplementary Table 8), partly on the grounds that each phylum has unique diversity (Figure 2b).
Original approaches to the prioritization issue of NP research continue to emerge, fueled by the advances in metagenomics and computational tools that enable the use of the biosynthetic potential of unculturable bacteria from environmental samples 74,75 . Furthermore, apart from the few metagenomic projects whose MAGs we incorporated in the first part of our analysis, there are multiple such projects publicly available, some of which have been the focus of NP studies 76,77 . Although the reconstruction of genomes from metagenomes remains a challenge [78][79][80] and the assembly will often miss BGCs [81][82][83][84] , which has indirectly prevented their comparison to the cultured bacteria in the current project, metagenomics is proving a promising source of information on NPs and their producers 14,47,61,74,76,85,86 , as made apparent in the present investigation. We expect the effect of this field on NP research to become more evident in the following years.
The collection of microbial data from increasingly new habitats points to another interesting aspect, namely the relation between the biome of origin of the producers and the uniqueness of their biosynthetic diversity. Although this connection has been investigated to some extent 31,32,45,46,63 drawing more definitive conclusions will require the use of a wider-scale dataset and the availability of more detailed and standardized metadata of producers' genomes.
Our analysis provides a global overview of diverse known and promising understudied NP-producing taxa. We expect this to greatly help overcome one of the main bottlenecks of Natural Product discovery: the prioritization of producers for research 74 .

BGC data set
We obtained 170,585 complete and draft bacterial genomes (Table 1) from RefSeq 20 on 27 March 2020. Furthermore, a dataset of 47,098 MAGs was included in the first part of the analysis (see Results: Biosynthetic diversity of the bacterial kingdom). For the rest of the study, we used only 161,290 RefSeq bacterial genomes whose taxonomic classification up to the species level was known ( Table 1). All genomes were analyzed with antiSMASH (version 5) 50 , which identified their BGCs (Extended Data Table 1, Supplementary Table 1).

Taxonomic classification
Due to multiple indications regarding a lack of accuracy of NCBI's taxonomic classification of bacterial genomes 54-57 , we chose to use the Genome Taxonomy Database (GTDB 21 ) instead. The bacterial tree of 120 concatenated proteins (GTDB release 89), as well as the classifications of organisms up to the species level, were included in the analysis.

Quantification of biosynthetic diversity with BiG-SLiCE
For a bacterium to be regarded as biosynthetically diverse, we considered not the number of BGCs important, but rather how different these BGCs are to each other. In order to quantify this diversity, we analyzed all BGCs with the new BiG-SLiCE tool 28 , which groups similar clusters into Gene Cluster Families (GCFs). However, the first version of this tool has an inherent bias towards multi-protein families BGCs, producing uneven coverage between BGCs of different classes (i.e., due to their lack of biosynthetic domain diversity, all lanthipeptide BGCs may be grouped together using the Euclidean threshold of T=900, which in contrast is ideal for clustering Type-I Polyketide BGCs). To alleviate this issue and provide a fair measurement of biosynthetic diversity between the taxa, we modified the original distance measurement by normalizing the BGC features under L^2-norm, which will produce a cosine-like distance when processed by the Euclidean-based BIRCH algorithm. This usage of cosine-like distance will virtually balance the measured distance between BGCs with "high" and "low" feature counts (Supplementary Figure 10a), in the end providing an improved clustering performance when measured using the reference data of manually-curated MIBiG GCFs (Supplementary Figure 10b).
The GTDB 21 (release 89) bacterial tree was pruned so that it included only the organisms that are part of our dataset. Then, having both the taxonomic classification of all bacteria, as well as how many GCFs their BGCs group into, the pruned GTDB tree was decorated with #GCFs values at each node. This allowed for the evaluation of the biosynthetic diversity of any clade, including the main taxonomic ranks. To pick a single threshold for subsequent taxonomy richness analysis, we compared BiG-SLiCE results on 947 MIBiG BGCs versus the compound-based clustering provided by the NPAtlas database 19 (Supplementary  Figure 3). A final threshold of T=0.4 was chosen based on its similarity to NPAtlas's compound clusters (V-score=0.9X, GCF counts difference=+XX).

Quantification of biosynthetic diversity with clust-o-matic
We aimed to repeat and evaluate the reproducibility of the BGC-to-GCF quantification step of BiG-SLiCE with an alternative, independently derived algorithm. For that instead of grouping BGCs into GCFs based on biosynthetic domain diversity, we developed an algorithm that considers full core biosynthetic genes. Biosynthetic gene clusters that were detected in the input data by antiSMASH 5.1 were parsed to deliver core biosynthetic protein sequences. Those protein sequences were subjected to all-against-all multi-gene sequence similarity search with DIAMOND 87 2.0 using default settings. Only one best hit per query core gene per BGC was allowed divided by a total core protein length, resulting in the final pairwise BGC score always being within range of 0 to 1. Pairwise BGC similarity scores were used to build a distance matrix that was later subjected to agglomerative hierarchical clustering in python programming language (package scipy.cluster.hierarchy). The same process as described in the paragraph above (for BiG-SLiCE in that case) was performed for identification of the most suitable threshold for the clust-o-matic algorithm. The determined optimal threshold of 0.5 was then used to generate GCFs, which were then fed into the next steps in parallel to the original set of GCFs obtained from BiG-SLiCE.

Biogeography Analysis
One 27 of the MAGs datasets was accompanied by sufficient metadata that allowed for a study of a potential connection between biosynthetic diversity patterns and the biomes of origin of the corresponding MAGs. The GCFs for each ecosystem type were collected by combining information from Supplementary Tables 1, 4 of this project and from the Nayfach paper 27 Supplementary Information. This led to the creation of Supplementary Table 7. Then, the largest occurring intersections were computed and visualised in Supplementary Figure 4 using the UpSet 88 visualisation technique.

Variance Analysis
In order to pinpoint the emergence of biosynthetic diversity, the within-taxon homogeneity was compared among the main taxonomic ranks. For each rank, the variance value was computed (with NumPy 89 ) based on the #GCFs values of immediately lower-ranked taxa, as long as there were at least two such taxa. For example, a phylum that includes only one class in our dataset was omitted from this computation. But a phylum with two or more classes would be assigned a variance value computed from its classes' #GCFs values. The distribution of these variance values was plotted for each rank in Figure 3a. We noticed a significant reduction in variance from the family to the genus rank, which was confirmed with an additional statistical test (Supplementary Figure 5). A similar variance analysis was performed to compare genera and REDgroups (Supplementary Figure 6) but in this case variance was calculated based on the strains' biosynthetic diversity.

Definition of REDgroups
To study the biosynthetic diversity of genera, we attempted to achieve uniform taxa. The creators of GTDB used Relative Evolutionary Divergence (RED) for taxonomic rank normalization 21 ; it is a metric that relies heavily on the branch length of a phylogenetic tree and is consequently dependent on the rooting. The GTDB developers provided us with a bacterial tree decorated with the average RED values of all plausible rootings at each node. Since GTDB accepts a range of RED values for each taxonomic rank placement 21 , we chose the median of GTDB genus RED values, namely 0.934, as a cutoff threshold. Any clade in the GTDB bacterial tree with an assigned RED value higher than the threshold was considered one group (Supplementary Figure 11) that we named "REDgroup". For REDgroup naming conventions, see Supplementary Figure 11.

Rarefaction analysis
The extrapolation of potential #GCFs values was achieved by conducting rarefaction analyses, by use of the iNEXT R package 90 . A GCF presence/absence table (GCF-by-strain matrix) was constructed for each group considered and was then used as "incidence-raw" data in the iNEXT main function, where 500 points were inter-or extrapolated with an endpoint of 5000 for the REDgroups, and of 1.6 million (about 8 times the number of strains in the Complete Dataset) in each group for the RefSeq analyses (where 2000 points were inter-or extrapolated). By default, the number of bootstrap replications is 50.

Random sampling
In order to test whether the above methods (creation of REDgroups and the subsequent rarefaction analyses) overcome the inherent sequencing bias in our dataset, a random sampling technique was used. A reduced dataset was tested that included only those REDgroups containing at least 20 members. For each REDgroup, a sample of 20 genomes was randomly chosen (using the Python "random" module), while preserving the species diversity of the group. The latter was achieved by ensuring that genomes belonging to as many species as possible are included in each sample; if all species of a REDgroup were included but the genomes were fewer than 20, the remaining "spots" were distributed evenly among a random sample of the REDgroup's species. One hundred iterations of this process were calculated for all REDgroups in this reduced dataset and rarefaction analyses were conducted for the random samples in each iteration. Finally, the average potential GCFs (pGCFs) value for each REDgroup from all iterations was calculated and reported in Supplementary Table 9.

Identification of unknown producers
We investigated the genera included in the most promising REDgroups, to find out whether they include species that are producers of known compounds. Hence, the species names were cross-referenced with the species named as producers in the NPASS depository 48 (accessed on 15 October 2020), taking care to match the GTDB-given names to the NCBI-given names that the database uses. Table 6 BGC IDs, MiBIG IDs, producer GTDB-based taxonomic information and GCF assignment (for BiG-SLiCE T=0.4) for all MiBIG BGCs included in the creation of Figure 1C.

Supplementary Table 7
Biogeography analysis of the Nayfach MAGs dataset 27 . Number of genomes, BGCs, GCFs and unique GCFs per ecosystem type (as defined in the corresponding paper 27 ).

Supplementary Table 8
REDgroup full metadata: Node IDs (can be used in the exploration of the tree in Supplementary Figure 7), labels, number of members, number of BGCs, number of GCFs and potential GCFs (pGCFs) as defined by BiG-SLiCE (T=0.4) and clust-o-matic (T=0.5), GTDB taxonomic information and number of products in the NPASS database whose producer is a member of the REDgroup (NPASS_hits).

Supplementary Table 9
Comparison of random sampling analysis results to original results. The table includes: Node IDs, labels, number of members, number of BGCs, number of GCFs and potential GCFs (pGCFs) as defined by BiG-SLiCE (T=0.4), the original ranking based on the pGCFs, the average pGCFs from all random sampling iterations, the ranking based on the random sampling and GTDB taxonomic information.
(Note to reviewers: These data in Supplementary Tables 1 to 5, which are especially large files, would be publicly released upon completion of peer review of this manuscript under the following link: https://doi.org/10.5281/zenodo.5159210)

Code availability
The clust-o-matic code is available here: https://github.com/Helmholtz-HIPS The modified BiG-SLiCE script (that accepts as input a regular BiG-SLiCE output folder, then outputs the GCF membership in a tsv file) can be found in this link: https://github.com/medema-group/bigslice/blob/master/misc/useful_scripts/perform_l 2norm_clustering.py     Kutzneria, E: Pseudomonas_E. Phyla known to be enriched in NP producers are immediately visible (Actinobacteriota, Protobacteriota), with the most promising groups coming from the Actinobacteriota phylum (the highest peak belongs to a REDgroup containing Streptomyces strains). Simultaneously, within the underexplored phyla, there seems to be significant biosynthetic diversity and potential. An interactive version of Figure 4a can be accessed online (Supplementary Figure 7). Panel b: Rarefaction curves of REDgroups (BiG-SLiCE T=0.4). The solid lines represent interpolated and actual data, while the dotted lines represent extrapolated data. The letters "L", "M" and "H" on the right edge of the figure correspond to Low-(0-389 pGCFs), Medium-(390-649 pGCFs) and High-diversity (more than 650 pGCFs) producers. The "L" range includes 3,737 REDgroups (shades of green), the "M" range includes 22 (shades of yellow/orange), while the "H" range includes 20 REDgroups (shades of red). The vast majority of REDgroups belong to the low-diversity producers (the mean of all REDgroups' pGCFs is 29, drawn in the "L" range of the graph). The genera included in the most promising REDgroups are indicated (the letters a-e correspond to the peaks in panel a. Streptomyces strains are included in several of the top most promising REDgroups. Panel c: Rarefaction curves of the most promising REDgroups (BiG-SLiCE T=0.4). The solid lines represent interpolated and actual data, while the dotted lines represent extrapolated data. Panel d: Rarefaction curves of the most promising REDgroups (clust-o-matic T=0.5). Again, the solid lines represent interpolated and actual data, while the dotted lines represent extrapolated data. Though the exact numbers differ, the similarities between the two methods are apparent.  41 . Each taxon has a distinct color. The smaller shapes and ribbons represent smaller phyla that can be seen in Supplementary Figure 8. The genus Streptomyces appears to have a very high amount of unique GCFs comparable to entire phyla, such as Proteobacteria. Panel b: Unique GCFs as defined by BiG-SLiCE (T=0.4), of non-streptomycete Actinobacteriota and all Streptomyces genera (solid shapes) and pairwise overlaps between Actinobacteriota and Streptomyces (ribbons), visualized with circlize 41 . The Streptomyces genus, only one of many belonging to the Actinobacteriota phylum, appears to be responsible for a big percentage of the phylum's unique diversity (see big pink ribbon). Panel c: Left: Potential (pGCFs) and actual (GCFs) number of Gene Cluster Families as defined by BiG-SLiCE (T=0.4), of top 20 most promising REDgroups. Right: number of Natural Products (NPs) found in the NPASS database 48 , that originate from species included in each REDgroup. The REDgroups with few (< 15) to no known NPs associated with them are marked with red stars on the right side of the graph. Several of the displayed groups are in the latter category: Amycolatopsis_RG1, Kutzneria, Xanthobacteraceae_mixed_RG1 (containing the genera Bradyrhizobium, Rhodopseudomonas, Tardiphaga and Nitrobacter), Mycolicibacterium_RG1, Nonomuraea, Kitasatospora_RG1. Tables   Table 1. Overview of input datasets and overall biosynthetic diversity under various BiG-SLiCE thresholds. The "Complete Dataset" was used for the computation of the actual and potential biosynthetic diversity found in all cultured (and some uncultured) bacteria. The dataset "RefSeq bacteria with known species taxonomy" was used for pinpointing the emergence of biosynthetic diversity, for which accurate taxonomic information was needed, and for identifying groups of promising producers. The "T"s under Gene Cluster Families represent different BiG-SLiCE thresholds; BGC to GCF assignment for each threshold can be found in Supplementary Tables 2-5. *MAG sources: bovine rumen 22