Genomics-accelerated discovery of diverse fungicidal bacteria

Sorghum Anthracnose and Black Sigatoka of bananas are problematic fungal diseases worldwide, with a particularly devastating impact on small-holder farmers in Sub-Saharan Africa. We screened a total of 1,227 bacterial isolates for antifungal activity against these pathogens using detached-leaf methods and identified 72 isolates with robust activity against one or both of these pathogens. These bacterial isolates represent a diverse set of five phyla, 14 genera and 22 species, including taxa for which this is the first observation of fungal disease suppression. We identified biosynthetic gene clusters associated with activity against each pathogen. Through a machine learning workflow we discovered additional active isolates, including an isolate from a genus that had not been included in previous screening or model training. Machine-learning improved the discovery rate of our screen by 3-fold. This work highlights the wealth of biocontrol mechanisms available in the microbial world for management of fungal pathogens, generates opportunities for future characterization of novel fungicidal mechanisms, and provides a set of genomic features and models for discovering additional bacterial isolates with activity against these two pathogens.


Introduction
Sorghum ( Sorghum bicolor (L . ) Moench) is one of the top five cereal crops in the world agricultural economy, and second only to maize as a staple for food-insecure people in Sub-Saharan Africa (1) . Production of sorghum in Africa is constrained by both abiotic and biotic factors. The major biotic constraint is the fungal disease sorghum anthracnose (SA), caused by Colletotrichum sublineolum P. Henn. Kabat et Bub. Many sorghum landraces and improved varieties are susceptible to the foliar stage of the disease and yield losses ranging from 41% to 67% have been reported in Africa (2) .
Bananas and plantains (Musa spp.) are also of critical importance to food security and income generation for over 100 million people in Sub-Saharan Africa (3,4) . As they produce fruits year round, these perennial plants are the backbone of many farming systems . They also protect the soil from erosion and are resistant to floods and drought. Black Sigatoka disease (BS), also known as black leaf streak, caused by the fungal species Mycosphaerella fijiensis , is regarded as the most economically important leaf disease that affects bananas and plantains worldwide (5) . The disease results in heavy losses and, in highly susceptible varieties, can lead to the total collapse of the plant.
Bacteria provide a rich resource for the discovery of novel tools to manage plant diseases. Bacteria have co-inhabited many environments with fungal competitors for millions of years and have evolved mechanisms to antagonize and exclude a diverse range of fungal pathogens (6,7) . To identify bacteria with these naturally-occurring antifungal activities, we recently completed a large screen of more than 1,227 bacterial isolates against SA and BS. We discovered 72 bacterial isolates that significantly and reproducibly reduced infection of leaf tissue (53 isolates in four phyla that controlled SA in sorghum leaf tissue, and 31 isolates in four phyla that controlled BS in banana leaf tissue). These novel isolates provide potential new biological tools for managing the destructive effects of these pathogens.
Our search for the best fungicidal candidate isolates was accelerated using genomics and machine learning to strategically select isolates from a large microbial culture collection.
Through this extensive screening program, we used two distinct methods to identify genomic features (Biosynthetic Gene Cluster families or "BGC families") that are predictive of bacterial activity against SA and BS. These features suggest an expanded pathogen spectrum for some known fungicidal molecules, include entirely novel molecules with unknown modes of action (MOA), and provide a foundation for further research. We provide representative sequences for the most predictive BGC families.
Finally, to demonstrate the predictive value of the BGC families discovered in this work, we used machine learning models. We evaluated the strengths and limitations of multiple modeling frameworks in silico (including neural networks, random forests, and feature enrichment). We found that random forest models trained on oversampled data performed the best, followed closely by deep neural networks. In conjunction with these computational evaluations, we selected 176 additional bacterial isolates using two of the machine learning methods and screened them for antifungal activities. This additional round of model-driven screening led to the identification of a new genus (which had not been included in previous screening or model training) with activity against both pathogens and a 3-fold increase in our screening discovery rate. We provide the resulting machine learning models as tools for the community.

Materials and Methods
Microbial isolation and genome sequencing centrifuge tube (Fisher Scientific, Cat No.14-59-53A). The treated leaf pieces were then plated on 1% water agar amended with 6-Benzylaminopurine (BAP) and incubated at room temperature in the dark ( Figure 1B). 24 hours post treatment, the leaf pieces were inoculated with a 30 µL droplet of C. sublineolum spore suspension, applied on each side of the mid-rib.
The plates were then incubated in a growth chamber (Percival Scientific, Inc) set to a 12-hour photoperiod, maintained at 25°C and 95% relative humidity. The experimental design was a randomized complete block design with 3 replications.
Seven days post inoculation, the leaf pieces were assessed for anthracnose severity on a scale of 0-4 according to (8) , with few modifications ( Figures 1B and 1C). 0-No symptoms or chlorotic flecks, 1-hypersensitive reaction with no acervuli, 2-lesions with minute and few acervuli, 3-lesions with minute and few acervuli ≤25% of the leaf tissue and 4-lesions with acervuli covering ≥25% of the leaf surface. To identify isolates with robust, reproducible activity, we performed a "confirmation screen" on all active isolates from the primary screen. The confirmation screen was conducted using the same protocol as the primary. Data was analyzed No.14-59-53A) . Petri dishes with leaf pieces were incubated at room temperature in the dark.
24 hours later, leaf pieces in petri dishes were inoculated with a mycelial suspension of M.
fijiensis (120 μl per leaf piece) using a fingertip sprayer fitted to a 50 ml conical centrifuge tube .
The suspension contained mycelial fragments scraped from growing cultures that were cut into smaller mycelial tips in sterile distilled water in 50 ml conical tubes using a homogenizer (Omni International, Kennesaw, GA). The suspension was filtered through two layers of cheesecloth and then stirred. 0.05% Tween 20 and 0.02% Silwet L-77 (Loveland Industries Inc., Greeley, CO) were added. Using a hemocytometer, the suspension was adjusted with sterile distilled water to a concentration of 1 × 10 6 mycelial fragments/ml. A day after inoculation, plates were incubated in a growth chamber (Percival Scientific, Inc) set at 14 hours photoperiod, maintained at 25°C and 90% relative humidity.
Data recorded were the most progressed stage on inoculated leaves at the time of data collection (9) . Data were analyzed using analysis of variance (ANOVA) in PROC GLM of SAS (version 9.4; SAS Institute Inc., Cary, NC) and significant differences ( P < 0.05) were observed among treatments. To identify isolates with robust, reproducible activity, we performed a "confirmation screen" on all active isolates from the primary screen. The confirmation screen was conducted using the same protocol as the primary. A high level overview of the leaf disk screens(comparable in principle between SA and BS screens). Leaf disks were embedded in agar, and treated with resuspended bacterial cells. After 24 hours, the leaf disks were treated with fungal spores and incubated for multiple weeks. Scores were assigned by comparing leaf disk disease progression with inoculated and non-inoculated controls. Disease control greater than 70% was considered "active". Bacterial genomes were assigned taxonomic identifiers using the Genome Taxonomy Database Toolkit version 1.0.2 (10)(11)(12) . Biosynthetic gene clusters (BGCs) were annotated using antiSMASH version 5 (13) with the following options enabled: taxon "bacteria", KnownClusterBlast against MIBiG (14) and General ClusterBlast, Active Site Finder, "pfam2go" mapping, and Prodigal (15) as the gene finding tool. We used BiG-SCAPE with default settings (16) to cluster BGCs into families (the default alignment is "glocal" and the default cutoff for clustering into a family is a distance metric of 0.3).

Enrichment analysis and machine learning
Isolate feature enrichment and BGC family enrichment were calculated using a hypergeometric distribution. Isolates were considered "active" if they controlled disease progression by 70% or greater. Specifically, we estimated the cumulative probability that given M isolates screened, with n of the total having a particular attribute (e.g. are derived from soil), and N of the total displaying activity, that we would observe o or more active isolates with that attribute by chance alone. Smaller probabilities indicate "enriched" attributes that are unlikely to be so highly represented given chance alone, and that potentially have a meaningful biological explanation. We leveraged the hypergeometric module in SciPy to calculate enrichment p-values (17) .
We used SciKit-Learn to split training and test sets, train random forest models and generate predictions, perform cross validation, estimate permutation feature importance, and evaluate model performance (18) . Permutation feature importance (PFI) is a measure of how model performance is affected by randomly shuffling feature columns. PFI is a more robust measure of feature importance than random forest feature importance (19) . Before implementing the random forest models, we tested several configurations of random forest hyperparameters, and selected ensembles of 5 trees with a maximum depth of 32 nodes to maximize recall in a 50-fold cross validation analysis. Oversampling was performed by replicating all the active observations in the training data, thereby increasing the ratio of active to inactive observations. Precision, recall and F1 scores were calculated using the SciKit-Learn functions.
We used PyTorch (20) paired with Skorch ( https://github.com/skorch-dev/skorch ) to train a deep neural network and generate predictions. We tested several configurations and selected a 3-layer design with 10 fully-connected nodes and reLu activation in the first layer, 30% dropout between the first and second layers during training, 10 fully-connected nodes and reLu activation in the second layer, and a single sigmoid output node to predict the probability of activity. We used a learning rate of 0.02, trained each model for 10 epochs using the Adam optimizer and the binary cross entropy loss function.
To select additional isolates for screening, we selected representative "query" BGCs for each enriched or important BGC family and clustered them with batches of BGCs from the remainder of our collection using BiG-SCAPE with the same parameters. The BGCs that clustered with a query BGC were assigned to the same family as the query.
The data and code developed for this work are publicly available on GitHub: https://github.com/mbi2gs/fungicidal_bact_genomics

Results
We initially screened 1,051 cultivable bacterial isolates against SA, and 559 of those same isolates against BS. These 1,051 strains were isolated from 27 states in the United States and five districts in Uganda, and were derived from 11 sample types (e.g. soil, plant tissue, insects). We balanced taxonomic breadth and depth by screening more deeply among taxa with well-known fungicidal activity (e.g. Bacilli and Pseudomonads ) while also ensuring that we sampled from the broader diversity of the collection. We refer to this first pool of tested bacterial isolates as "diversity" screening (546 isolates against SA; 145 isolates against BS). In addition, we included isolates that showed activity in other screens against different fungal pathogens, referred to this pool as "spectrum" screening (291 isolates against SA; 259 isolates against BS).
Finally, in several cases when an isolate displayed activity against a pathogen, we followed up by screening additional isolates from the collection with high degrees of genomic similarity to that first active isolate. We refer to this pool of isolates as "genomic similarity" screening (214 against SA; 155 against BS). to each discovery, including predictive models. The "false negative" isolate in purple was predicted to be inactive by all modeling approaches and yet was found to be reproducibly active against SA.
Viewing the results from a taxonomic perspective, we discovered 156 isolates that displayed primary activity against SA, BS or both, representing five phyla, 27 genera, and 37 species (based on Genome Taxonomy Database labels (11) ). We found 72 reproducibly (active again in the confirmation screen) active isolates against SA or BS in five phyla, 14 genera, and 22 species (Figure 2). Isolates that reproducibly controlled both SA and BS came from two phyla, specifically, Proteobacteria and Firmicutes. The active isolate from the phylum Viewing the results from a discovery rate perspective, of the 546 isolates from the diversity screen against SA, 28 controlled SA in the primary (a discovery rate of 5.0%), and eight repeated the control in the confirmation screen (a discovery rate of 1.5%). Of the 145 isolates from the diversity screen against BS, 13 controlled the pathogen in the primary (a discovery rate of 9.0%) and two repeated the control in the confirmation screen (a discovery rate of 1.4%).
We evaluated the association between fungicidal activity (against SA or BS), and the associated taxonomy, source environment, isolation protocol, and geography. The metric used was an "enrichment score" (see Materials and Methods). After False Discovery Rate (FDR) correction, no p-values were below the chosen threshold of 0.05 (see Table 1). However, the variables with the strongest association were taxonomic, specifically members of the class Bacilli, where the species Bacillus velezensis were most enriched for activity against either pathogen, The strongest non-taxonomic associations were with isolates from the soil environment, and isolates from Uganda. Co-variances between each category are presented in Supplemental Figure 2. To identify biosynthetic pathways associated with control of Colletotrichum and Mycosphaerella , we generated genome assemblies for all isolates. We annotated the genomes using antiSMASH, which produced multiple BGCs and BGC fragments per genome assembly.
We clustered the annotated BGCs and fragments into 2,770 families using BiG-SCAPE (see Materials and Methods). We performed an enrichment analysis to associate isolate activity in the SA primary screen with the presence or absence of each BGC family (  Table 1). There were seven BGC families with homology to well-known fungicidal pathways, including Pyrrolnitrin, and Zwittermycin A (25) . There were also two BGC families associated with BS control that did not show significant homology to well-characterized BGCs, and these were distinct from the four novel BGC families associated with SA control. In this case, the two novel BGC families were associated with the family Pseudomonadaceae and the genus Pseudomonas_E and were described as a bacteriocin and a N-acetylglutaminylglutamine amide (NAGGN). These BGC families represent potentially novel biocontrol pathways and molecules with relevant activity against BS .
As an additional metric for associating BGC family with pathogen control, we built a random forest classifier to predict isolate activity in each primary screen and calculated the permutation feature importance of each BGC family (see Materials and Methods). There were 28 BGC families in the top 99th percentile for predicting SA control, and 19 for predicting BS control ( Table 2 and Supplemental Table 1). Among those associated with SA control, four BGC families were selected by both the feature importance and enrichment methods. 15 of the 28 selected by feature importance were homologous to well-characterized BGCs, many of which are known to be fungicidal or generally involved in microbial competition, including Zwittermycin A (25,26) and the siderophore pyoverdine (27) .13 of the 28 did not share significant homology to well-characterized BGCs, and are described as bacteriocins, NRPSs, lanthipeptides and PKSs. We found that many of the BGC families associated with activity against SA were co-occurring throughout the isolates screened (Supplemental Figure 3). For example, we observed clear clusters of BGC families with strong associations with Bacillus_A . Among the BGC families associated with controlling BS (Supplemental Table 1), 11 of the 19 were homologous to well-characterized BGCs, including Bacilysin (28) . 8 of the 19 were not significantly homologous to well-characterized BGCs, and are categorized as beta lactones, linear azole-containing peptides (LAPs), bacteriocins and NRPSs. We found that many of these BGC families associated with activity against BS were co-occurring throughout the isolates screened (Supplemental Figure 3). For example, we observed clusters of BGC families strongly affiliated with the species Pseudomonas_E protegens . These BGC families without significant homology to well-described BGCs represent potentially novel biocontrol pathways and molecules with relevant activity against SA and BS . To demonstrate the predictive value of the activity-associated BGC families, we developed computational tools that would increase the discovery rate of fungicidal isolates ( Figure 3A). We compared an enrichment-based prioritization method "ER" with random forest models (both standard "RF" and trained on over-sampled data "RFOS"; see Materials and Methods), and a deep neural network "NN" ( Figure 3A). We selected an oversampling factor of 13 for the RFOS model based on cross-validation (Supplemental Figure 5). We used only the diversity screening isolates data to train the models. During 100-fold cross-validation with 75/25% training/test split of the data, we repeated the process of feature selection described above (i.e. enrichment and permutation feature importance) using only the subset of training data. After feature selection with the RF, the same features were used to train the NN. As a null model, random isolates were predicted to be active at the same discovery rate as the diversity screen (i.e. 5%). All models were trained to predict activity in the primary screen against SA. We found that all four prioritization tools performed better than the null model in terms of the F1 statistic (p-value of 2×10 -7 by Kruskal-Wallis test, followed by one-sided Mann-Whitney U tests with p-values < 3×10 -4 ) ( Figure 3B). The ER method was the poorest performer of the four methods tested, while the RFOS and the deep neural network were the best performers. The RFOS average F1 score was significantly better than ER (p-value < 0.05 by one-sided Mann-Whitney U tests), largely due to the RFOS precision scores. The NN method displayed the best recall, increasing the average recall by 6.5-fold compared to the null. Comparing prioritization methods by 100-fold cross validation on the primary screening data. Each iteration of the cross-validation split the data randomly into training and test sets, selected features and trained models on the training data, then predicted activity in the test data. ER (enrichment-based prioritization, where any isolate with one or more enriched BGCs was predicted to be active), RF (random forest model), RFOS (random forest with oversampling), NN (deep neural network), Rand (a random, null model based on the discovery rate from the training data). All prioritization methods outperformed the null model. Both RFOS and NN outperformed the ER and RF models. The mean (μ) and median ( ) values for each set of simulations are indicated above μ each distribution. C. Each iteration of this cross-validation held out all the members of a particular genus as the test set. In this case, all methods were indistinguishable from the null (Kruskal-Wallis test), indicating that generally speaking, the BGC families defined here are only predictive within a genus. The mean (μ) and median ( ) values for each set of μ simulations are indicated above each distribution.
To determine the extent to which the best performing cross validation iterations were due to taxonomic similarity between the training set and the test set (assuming that more related isolates between the two would improve observed model performance), we compared the Jaccard similarity of the set of genera in the training and test data to the model performance and found no significant correlation (correlation coefficient of 0.097 and a p-value of 0.34 by Spearman's Rank Correlation; see Supplemental Figure 6). While this result would suggest that taxonomic overlap between training does not affect model performance (which is unlikely), we also repeated the cross-validation evaluation. This time, instead of random training/test splits, we kept out a single genus at each iteration as the test set. We found that under these conditions, the models failed to outperform the null model (p-value 0.13 by Kruskal-Wallis test), suggesting that the BGC families used to train the models tend to be predictive at the genus level or lower ( Figure 3C).
In parallel to the in silico validation of the predictive value of the enriched BGC families, we also completed an in vivo validation. In contrast to the in silico validation where we could simulate hundreds of experiments, we performed a single in vivo validation. We applied two computational models to predict control of SA from among the larger AgBiome bacterial culture collection. We selected BGC features using enrichment and random forest feature importance, based on an older version of the SA screen data set (the BGC features used in this in vivo validation exercise are listed in Supplemental Table 2 and representative sequences are provided in the supplemental materials). To query the collection for isolates with BGCs in the enriched families, we annotated BGCs using antiSMASH across the entire collection, and then used BIG-SCAPE to cluster batches of 200 BGCs together with the queries, using the same clustering parameters as the original clustering job that defined the families. We identified 17,163 isolates that contained at least one representative of the enriched BGC families.
Using two predictive tools (ER and RFOS), we selected 176 isolates from the 17,163 enriched isolates. First, we selected isolates that contained at least one member from the 12 enriched BGC families (ignoring the BGC families identified by permutation feature importance).
A large proportion of those isolates belonged to only a few genera, particularly Bacillus_A . In order to increase the diversity of the validation set, we required that half of the predicted active isolates come from non-Bacillus_A genera. Second, we applied the RFOS model to predict control of SA. We used all fungal-control-associated BGC families as features, trained with 13-fold oversampling, and required that half of the predicted isolates come from non-Bacillus_A was applied post-hoc . 20 of the 176 isolates were predicted to be inactive by all three models, one of which was found to be active (this "false negative" isolate was classified as a Bacillus altitudinis ; see Figure 2).

Figure 4.
Predictive models improved discovery rate . The hit rate in the primary and confirmation screens is shown by the bars, and the fold improvement compared to the diversity screen (a random set of microbes from the collection) is displayed above each bar. RFOS improved the discovery rate by 3-fold in the primary screen, and by 5.4-fold in the confirmation screen.
We found that the RFOS primary discovery rate (precision) was 3-fold greater than the historical discovery rate of the diversity screen (p-value of 3.7×10 -6 by one-sided two-proportions Z test) ( Figure 4). In other words, using the RFOS model, we discovered 3-fold more active isolates for the effort than we would have by screening a random, diverse set of isolates from the collection. The confirmation discovery rate was even higher: 5.4-fold greater than the diversity screen (p-value of 6.2×10 -6 by one-sided two-proportions Z test) ( Figure 4).
Additionally, the RFOS model achieved 72% recall, correctly predicting 23 of the 32 total active isolates (p-value of 0.002, compared to 1,000 draws using a random model based on the baseline diversity screen discovery rate of 5.0%) (Supplemental Figure 7). Notably, the RFOS method did not discover any novel taxa that were not already represented in the training set.
Most of the true positives were classified as members of the genus Pseudomonas_E , while most of the true negatives, false negatives and false positives were classified as members of Bacillus_A . Taking into account the SA confirmation screen results (which were not used to train the model), the model precision was not statistically different from the null model, but the recall was exceptionally high, where the RFOS model correctly predicted 11 of the 12 (92%) isolates that reproducibly controlled SA (p-value < 0.001, compared to 1,000 random draws).
The NN model did not perform as well as the RFOS model in the context of this post-hoc analysis. We found that the NN primary discovery rate was 1.9-fold greater than the diversity screen discovery rate (p-value of 0.01 by one-sided two-proportions Z test), and the confirmation discovery rate was 2.5-fold greater than the diversity screen (p-value of 0.03 by one-sided two-proportions Z test) (Figure 4). The precision was not statistically different from the null model, but the recall was better than the null (p-value of 0.098 compared to 1,000 random draws) (Supplemental Figure 7). Recall was not significantly better than the null model when predicted the results of the SA confirmation screen-the NN model correctly predicted two of 12 (17%) isolates that reproducibly controlled SA (p-value of 0.12, compared to 1,000 random draws). In contrast to the RFOS model, the true and false positives consisted primarily of members of the genera Bacillus_A , while most true negatives were classified as Bacillus, and most false negatives were classified as Pseudomonas_E .
Finally, we found that the ER primary discovery rate was 2.4-fold greater than the diversity screen discovery rate (p-value of 7.4x10 -5 by one-sided two-proportions Z test), and the confirmation discovery rate was 2.9-fold greater than the diversity screen (p-value of 6.1x10 -3 by one-sided two-proportions Z test) (Figure 4). The ER method outperformed the null model in terms of recall (Supplemental Figure 7). It correctly predicted 19 of the 32 (59.4%) SA-controlling isolates in the primary screen (p-value of 0.007 compared to 1,000 random draws), and three of the 12 actives against SA in the confirmation screen (p-value of 0.031 compared to 1,000 random draws). To describe model performance in taxonomic terms, most of the true and false positives were classified as Bacillus_A , most of the true negatives as Bacillus , and most of the false negatives as Pseudomonas_E and Lysobacter .
The ER method (uniquely among the three computational methods used) discovered active isolates from a novel genus which was not present in the training data-Bacillus_I endophyticus . These Bacillus_I isolates were prioritized by the enrichment method based on the presence of the BGC family "nrps2550". The BGC family "nrps2550" shares homology with the BGC known to produce Elsinochrome A. Elsinochrome A is a perylenequinone, a family of molecules which are known to exhibit fungicidal activity by production of reactive oxygen species (29,30) .

Discussion
Many of the isolates with confirmed activity against SA and BS are from taxonomic groups that have long been known to exhibit fungicidal properties, specifically Lysobacter (31) , and a variety of Pseudomonads and Bacilli (  (33) . The diversity also suggests that bacteria yet harbor many untapped opportunities for fungal control, both in agricultural and medical applications.
The gene clusters we observed to be enriched and predictive of bacterial anti-SA and anti-BS activity were a mix of homologs to well-studied BGCs and novel, uncharacterized BGCs.
Many of the BGC families are highly homologous to the characterized BGCs for known fungicidal compounds (e.g. Pyrrolnitrin, Zwittermycin, Fengycin, and Surfactin). This is a strong validation of the approaches (enrichment and permutation feature importance) we used to rank BGC families, and is encouraging of future work to characterize the novel BGCs we observed. It is unclear, based on genome assemblies alone, to what extent these BGCs with homology to known BGCs produce identical, or merely related, compounds. It is also likely that some of the BGCs we found to be associated with antifungal activity are simply correlated with the taxonomic group that exhibited activity and are unrelated to the mode of action. A similar challenge is that singleton BGCs representing novel antifungal pathways cannot be detected by the approach used here. The nature of the training data make it impossible to determine which BGC families actually cause anti-SA or -BS activity. Therefore, it is necessary to do more in-depth studies to demonstrate causation, and to elucidate potential MOAs associated with BGC families discovered here.
Regardless of whether the BGCs discovered here represent the actual source of antifungal activity or are merely correlated with it, they are predictive of antifungal activity against SA and BS. Interestingly, we observed in silico that it is rare for BGC families to be predictive of activity outside the genus level ( Figure 3C). In some cases we observed that the predictive models could correctly predict activity among isolates from a genus not represented in the training data, but In silico , these cases were rare and were not common enough to differentiate from random chance. It is unclear whether the genus-level limit to BGC predictive value is reflective of biology (i.e. antifungal MOAs tend to be limited to within a bacterial genus) or whether it is an artifact of the way we defined and ranked BGC families. For example, despite the in silico results, in the validation experiment the enrichment method did in fact lead us to discover a novel genus. The discovery of a novel genus was possibly a stroke of luck, but it is also likely that some antifungal BGC families span and are predictive across broader taxonomic groups. An example in our training data of such a BGC family is "others2810" (Table 2), which can be found (in our data set) across the class Gammaproteobacteria. In the validation experiment, the BGC family "nrps2550"-which was limited to the genus Bacillus_A in the training set-was the clue that led to the discovery of activity against SA by a member of the genus Bacillus_I . By predicting activity based on genomic features rather than whole-genome homology or taxonomic relatedness, it becomes possible to anchor predictions on features that have been broadly inherited or horizontally dispersed. In this way, genomic features act as stepping stones to new taxa with potentially novel properties for SA or BS control.
Broader and deeper data sets are needed if we wish to predict SA or BS control with more confidence. While every computational approach we tested performed significantly better than the random, null model for predicting activity in the primary assay ( Figure 3B), none of the modeling approaches achieved average F1 scores greater than 20%, or maximum F1 scores greater than 40%. The relatively low model confidence highlights the complex nature of antifungal interactions, which can be influenced by subtle differences between isolates. Small changes in BGC enzymes may result in the production of different metabolites, which may or may not be antifungal. Small changes to promoters, signalling molecules, or growth conditions can all have a large impact on the observed antifungal activity against SA or BS. Furthermore, it is likely that many active isolates employed unique modes of action which could not be used to predict activity in any other isolates. All of this simply points to the need for more screening data, and the use of input features capable of capturing more subtle differences between isolates. Regardless, we found that RFOS and NN models performed the best in silico ( Figure   3), and that predictive models improved the efficiency of our screening efforts (Figure 4). There is a lot left to learn and discover about bacterial control of SA and BS, and computational models will continue to accelerate the discovery effort. 20