Accurate de novo identification of biosynthetic gene clusters with GECCO

Biosynthetic gene clusters (BGCs) are enticing targets for (meta)genomic mining efforts, as they may encode novel, specialized metabolites with potential uses in medicine and biotechnology. Here, we describe GECCO (GEne Cluster prediction with COnditional random fields; https://gecco.embl.de), a high-precision, scalable method for identifying novel BGCs in (meta)genomic data using conditional random fields (CRFs). Based on an extensive evaluation of de novo BGC prediction, we found GECCO to be more accurate and over 3x faster than a state-of-the-art deep learning approach. When applied to over 12,000 genomes, GECCO identified nearly twice as many BGCs compared to a rule-based approach, while achieving higher accuracy than other machine learning approaches. Introspection of the GECCO CRF revealed that its predictions rely on protein domains with both known and novel associations to secondary metabolism. The method developed here represents a scalable, interpretable machine learning approach, which can identify BGCs de novo with high precision.

To construct a benchmark data set in a way that guarantees that training and test data 109 are disjoint, we partitioned MIBiG v2.0 17 into (i) BGCs for training that were already contained 110 in an earlier MIBiG version (v1.3, which was also originally used to train DeepBGC and 111 DeepBGC's re-trained implementation of ClusterFinder) 9 , and (ii) selected BGCs for testing 112 that were newly added in subsequent updates of MIBiG; from this test set we also removed 113 BGCs that were very similar in architecture to any instance contained in MIBiG v1.3 (see 114 section "Data acquisition and feature construction" below for additional details). This yielded 115 a final test set of 376 prokaryotic contigs which each had an embedded BGC that was 116 exclusively present in MIBiG v2.0 (referred to hereafter as the "376-genome test set"). We also 117 used two additional, previously constructed test sets containing thoroughly annotated genomes 118 of well-studied BGC producer organisms (the "six genome test set" used by Hannigan,et al. 9 119 to evaluate DeepBGC and the "nine genome bootstrapping set" used by Hannigan,et al. 9 for 120 hyperparameter tuning, validation, and testing of DeepBGC) and removed all instances of 121 BGCs similar to those in these additional test sets from the MIBiG v1.3-based training set. To 122 ensure a fair comparison of BGC detection methods, we retrained DeepBGC and GECCO on 123 this very same training set, using BGCs from MIBiG v1.3 which were absent from all test sets. 124 We evaluated the performance of both methods using (i) the 376-genome test set (the main test

154
We moreover used the training data to optimize GECCO's feature space. We found that  Figure S8). To determine the MIBiG biosynthetic class for each newly 168 predicted BGC, a separate random forest (RF) classifier was trained and evaluated using five-169 fold CV, as has been previously proposed 9 (see section "Prediction of biosynthetic class" 170 below; Fig. 1). Using the domain composition associated with each BGC as features, the RF 171 classifier achieved AUROC scores > 0.90 for all classes (Supplementary Figure S9).

172
In a final benchmark, we compared the runtime between GECCO, DeepBGC, and 173 antiSMASH using the three test sets, as well as all representative genomes in the proGenomes2   Table S1). Among these (CRF weight > 4.0) were (i) terpene synthase family 2, C-terminal  Table S1). Interestingly, among the highest-weighted, BGC-associated domains (CRF weight 214 > 2.0) that were not members of the core biosynthetic set were three domains of unknown  (InterPro ID IPR035210; Supplementary Table S1). Their importance for 220 BGC prediction with GECCO suggests that functional studies of these domains in the context 221 of secondary metabolism are warranted.

222
To be able to observe coherent biological functions among the domain weights learned 223 by the GECCO CRF, beyond the most strongly associated domains, we used Gene Ontology  Figure S12). Collectively, these results indicate that the GECCO CRF relies on domains 234 associated with secondary metabolite production for BGC inference.

236
ML approaches have revolutionized numerous disciplines and are being increasingly 237 employed to solve problems in biological and medical realms. 23-25 Models that can account for 238 sequential data are particularly attractive when leveraging genomic data to make predictions, 239 as feature context and order (e.g., for genes, domains) may be important. 9 CRFs specifically 240 have played a crucial role in sequential modeling tasks and have been used extensively in areas 241 such as natural language processing (NLP), where they frequently outperform their generative 242 counterparts. 14,15 Recently, deep learning approaches have become popular methods for processing 244 sequential data. However, these models often require a great deal of training data and/or pre-245 training efforts to show marked improvements over classical ML models. 16 This is relevant for 246 BGC identification, as the need for experimental characterization of "true" BGCs limits the 247 amount of training data for these approaches; for example, the current version of MIBiG (v2.0) 248 contains only 1,923 experimentally validated BGCs (with some being very closely related to 249 one another, and thus of limited value as training data). 17 Here, we showed that, with the 250 relatively limited amount of known BGCs available, the linear CRF implemented in GECCO 251 outperforms DeepBGC's BiLSTM approach, achieving higher accuracy at reduced training 252 and prediction time.

253
An additional advantage of CRFs over deep learning approaches is that the former are 254 inherently "simpler" and thus more interpretable (whereas "black box" RNNs require 255 substantial additional efforts to "explain" their behavior). 16,26,27 In the context of BGC mining, 256 an interpretable model can provide insights into genomic mechanisms of secondary 257 metabolism; here, introspection of GECCO's CRF identified numerous intuitive biological and 258 molecular characteristics that were highly associated with BGC presence. The highly BGC-259 enriched GO:0042742 and CL0400 terms (corresponding to "defense response to bacterium" 260 and "Double-Glycine leader-peptide cleavage motif", respectively), for example, are typical of 261 bacteriocin RiPPs often exported by ABC transporters, 28 while BGC-enriched CL0149 ("CoA-262 dependent acyltransferase superfamily") and GO:0008299 ("isoprenoid biosynthetic process") 263 are associated with polyketide synthases and terpenes, respectively. 29,30 Furthermore, we 264 identified numerous BGC-associated domains, which had not been included among domain 265 sets previously associated with secondary metabolism, including three highly BGC-associated 266 domains of unknown function. These results not only provide insight into BGC architecture 267 and function, but may be leveraged in the future to improve BGC annotation and identify "high-   To avoid training leakage into the 376-genome test set, the GECCO CRF (see section 366 "CRF training and cross-validation" above) was re-trained on BGCs available in MIBiG v1.3 367 and was used to predict BGC presence/absence in each genome in the three test sets (i.e., the 368 six-, nine-, and 376-genome test sets; Fig. 2 and Supplementary Figures S4-S7). The ability of 369 each of the following methods to predict BGC presence/absence was additionally evaluated on 370 each of the three test sets (Fig. 2a- Table S1). Three independent, two-group Mann-422 Whitney U tests were used to compare CRF weights associated with the following GECCO 423 domain sets, using the "wilcox.test" function in R, with parameters set to perform an unpaired 424 (paired = F), two-sided (alternative = "two.sided") test using a normal approximation (exact =   Step 4: Assign BGCs to biosynthetic classes using GECCO's random forest (RF) classifier MIBiG biosynthetic class is predicted from the domain vectors of the entire BGC ...

NRP Polyketide
RiPP Terpene Saccharide Alkaloid Step 3: Segment genome into BGC and non-BGC regions using GECCO's Conditional Random Field Step 2: annotate protein domains within ORFs to create domain vectors Step 1