Mining for a New Class of Fungal Natural Products: The Evolution, Diversity, and Distribution of Isocyanide Synthase Biosynthetic Gene Clusters

The products of non-canonical isocyanide synthase (ICS) biosynthetic gene clusters (BGCs) have notable bioactivities that mediate pathogenesis, microbial competition, and metal-homeostasis through metal-associated chemistry. We sought to enable research into this class of compounds by characterizing the biosynthetic potential and evolutionary history of these BGCs across the Fungal Kingdom. We developed the first genome-mining pipeline to identify ICS BGCs, locating 3,800 ICS BGCs in 3,300 genomes. Genes in these clusters share promoter motifs and are maintained in contiguous groupings by natural selection. ICS BGCs are not evenly distributed across fungi, with evidence of gene-family expansions in several Ascomycete families. We show that the ICS dit1/2 gene cluster family (GCF), which was thought to only exist in yeast, is present in ~30% of all Ascomycetes, including many filamentous fungi. The evolutionary history of the dit GCF is marked by deep divergences and phylogenetic incompatibilities that raise questions about convergent evolution and suggest selection or horizontal gene transfers have shaped the evolution of this cluster in some yeast and dimorphic fungi. Our results create a roadmap for future research into ICS BGCs. We developed a website (www.isocyanides.fungi.wisc.edu) that facilitates the exploration, filtering, and downloading of all identified fungal ICS BGCs and GCFs.

The y-axis is a log-scale of the count, and the x-axis represents the number of genera each GCF was found in. The bars are colored by the number of genera a given GCF contained (see key).    **The dit GCF is characterized by the highly conserved dit1 (ICS) and dit2 (p450) genes.

Calculating the proportion of the fungal biosynthetic potential that ICS BGCs account for
After running antiSMASH v5 (1)

Identifying putative expansions and contractions
The species tree created in this publication (Fig. 2) was subset to only contain Ascomycota species and a single Basidiomycete outgroup (Amanita muscaria). A custom Python program was developed to identify potential expansions and contractions in the number of ICS BGCs between sister clades using independent t-tests (2). The program is designed to start its search from a node of choice provided by the user. Every branch point beneath said node is then explored. For each branch, the program extracts the number of ICS BGCs in the sister clades and conducts a t-test to detect significant differences below the user-defined threshold e-value. The program then generates an output file containing the t-test results, and produces visualizations of the expansions and contractions as can be seen in Fig. S6. The script to run this Python pipeline is included in the publications Reproducible Scripts file, and the data from the t-tests on ICS BGCs along with the tree are available on the publication's GitHub repository.
While this type of analysis has its merits, it is important to note its limitations. One major concern is that statistical tests comparing the equivalence of two groups does not take into account the time elapsed since the divergence of two taxa, which can result in misinterpretation of results due to gene drift rather than selection (3). To address this issue, stochastic birth and death models have been proposed, which compare selection to the null model of random gene birth and loss (4,5). However, accurate time-dated species trees are required for such models, which are difficult to construct for fungi due to low number of fossil records and high-quality genomic sequences (6,7). Such programs also assume the most recent common ancestor contains a single copy of the studied gene or cluster of genes. It remains uncertain whether assuming the most recent common ancestor of all Ascomycetes contained an ICS BGC is a valid assumption given the evidence of convergent evolution between two separate clades of ICS backbone synthases (as depicted in Fig. 5). It was because of these limitations that we developed our own program using traditional statistical tests. While our approach does provide a crude identification of key points on the tree where large differences exist, further research is needed to determine the cause of such putative expansions or contractions.
We performed a Kendall's tau correlation analysis (12) to examine the relationship between the number of ICS BGCs identified in fungi and the corresponding number of canonical clusters. The SciPy v1.5.2 (8) command 'kendalltau' was used for this analysis with the default 'tau-b' variant.

Determining what proteins were biosynthetic, unknown, or other
We adopted the same criteria for defining "biosynthetic" protein domains as the widely-used BGC predictive software antiSMASH v5 (1). This included all enzymatic, regulatory, and transportation domains (e.g. MFSs) that have been previously reported to function within BGCs. Using such criteria, we categorized each gene into one of three groups: (i) "Biosynthetic" if it contained at least one biosynthetic domain, (ii) "Other" if it contained protein domains not previously associated with SM biosynthesis, or (iii) "Unknown" if no predictable protein domain was detected.

Identifying homologous sets of co-localized genes using cBlaster
A representative BGC prediction was hand-selected for each of the 31 analyzed GCFs. This selection was based on a BGC containing all the highly conserved orthologs for a given GCF. To make more informed choices for the selection of a representative BGC for each GCF, we used Orthofinder v2.5.2 (13) to identify the shared set of orthologous genes within each GCF. This enabled us to locate which orthogroups were highly conserved across all the BGCs within a GCF. The representative BGC for each GCF was chosen if it contained all the highly conserved orthologs for said GCF.
We utilized the rapid identification of homologous gene clusters tool cblaster v1.3.15 (14) to search for homologous loci to the query sequence. A local database from genomic GenBank files was created using the 'makedb' module, which implements the alignment software DIAMOND (15). The database was filtered down to include a maximum of 3 isolates for any given species to increase run speeds. This selection was done at random. The following pipeline was then employed for all 31 GCFs: (ii) The session for the default run was used to create gene neighborhood estimations using the 'gne' module, allowing for the optimal intergenic distance value to be calculated. (iii) An optimized search was re-run, using the calculated intergenic distance threshold, a list of backbone ICS genes the hit must include, and the minimum number of hits a species should have for the result to return the putative cluster (was adjusted for unique size of each GCF). Overall, this second approach to identifying BGCs is independent of regulatory elements and allowed us to fine-tune our CASSIS predictions around the highly conserved GCF cores.

Refining the list of species with the dit cluster
We used a phylogenomic approach to verifying if species had a homolog to the dit cluster. Briefly, all ICSs that fell within the dit clade on the ICS protein phylogeny were extracted. If such gene was not originally predicted to be within an ICS BGC using the CASSIS algorithm, we searched to check if a copy of dit2 was within 2 genes on the dit1 homolog. In cases where the dit2 gene was fragmented from dit1, it was documented but not included in any subsequent analyses. The results of the verification step, in addition to the protein names of any hand identified dit clusters can be found in Supplementary Table   S11.