Integration of machine learning and pan-genomics expands the biosynthetic landscape of RiPP natural products

Most clinical drugs are based on microbial natural products, with compound classes including polyketides (PKS), non-ribosomal peptides (NRPS), fluoroquinones and ribosomally synthesized and post-translationally modified peptides (RiPPs). While variants of biosynthetic gene clusters (BGCs) for known classes of natural products are easy to identify in genome sequences, BGCs for new compound classes escape attention. In particular, evidence is accumulating that for RiPPs, subclasses known thus far may only represent the tip of an iceberg. Here, we present decRiPPter (Data-driven Exploratory Class-independent RiPP TrackER), a RiPP genome mining algorithm aimed at the discovery of novel RiPP classes. DecRiPPter combines a Support Vector Machine (SVM) that identifies candidate RiPP precursors with pan-genomic analyses to identify which of these are encoded within operon-like structures that are part of the accessory genome of a genus. Subsequently, it prioritizes such regions based on the presence of new enzymology and based on patterns of gene cluster and precursor peptide conservation across species. We then applied decRiPPter to mine 1,295 Streptomyces genomes, which led to the identification of 42 new candidate RiPP families that could not be found by existing programs. One of these was studied further and elucidated as a novel subfamily of lanthipeptides, designated Class V. Two previously unidentified modifying enzymes are proposed to create the hallmark lanthionine bridges. Taken together, our work highlights how novel natural product families can be discovered by methods going beyond sequence similarity searches to integrate multiple pathway discovery criteria. Code and data availability The source code of DecRiPPter is freely available online at https://github.com/Alexamk/decRiPPter. Results of the data analysis are available online at http://www.bioinformatics.nl/~medem005/decRiPPter_strict/index.html and http://www.bioinformatics.nl/~medem005/decRiPPter_mild/index.html (for the strict and mild filters, respectively). All training data and code used to generate these, as well as outputs of the data analyses, are available on Zenodo at doi:10.5281/zenodo.3834818.


40
The introduction of antibiotics in the 20 th century contributed hugely to extend the human life span. 41 However, the increase in antibiotic resistance and the concomitant steep decline in the number of 42 new compounds discovered via high-throughput screening 1,2 , means that we again face huge 43 challenges to treat infections by multi-drug resistant bacteria 3 . The low return of investment of high 44 throughput screening is due to dereplication, in other words, the rediscovery of bioactive 45 compounds that have been identified before 4,5 . A revolution in our understanding was brought about 46 by the development of next-generation sequencing technologies. Actinobacteria are the most prolific 47 producers of bioactive compounds, including some two-thirds of the clinical antibiotics 6,7 . Mining of 48 the genome sequences of these bacteria revealed a huge repository of previously unseen 49 biosynthetic gene clusters (BGCs), highlighting that their potential as producers of bioactive 50 molecules had been grossly underestimated 6,8,9 . However, these BGCs are often not expressed under 51 laboratory conditions, most likely because the environmental cues that activate their expression in 52 their original habitat are missing 10,11 . To circumvent these issues, a common strategy is to select a 53 candidate BGC and force its expression by expression of the pathway-specific activator or via 54 expression of the BGC in a heterologous host 12 . However, these methods are time-consuming, while 55 it is hard to predict the novelty and utility of the compounds they produce. 56 To improve the success of genome mining-based drug discovery, many bioinformatic tools have been 57 developed for identification and prioritization of BGCs. These tools often rely on conserved genetic 58 markers present in BGCs of certain natural products, such as polyketides (PKS), non-ribosomal 59 peptide synthetases (NRPS) and terpenes [13][14][15] . While these methods have unearthed vast amounts of 60 uncharacterized BGCs, they further expand on previously characterized classes of natural products. 61 This raises the question of whether entirely novel classes of natural products could still be 62 discovered. A few genome mining methods, such as ClusterFinder 16 and EvoMining 17,18 , have tried to 63 tackle this problem. These methods either use criteria true of all BGCs or build around the 64 evolutionary properties of gene families found in BGCs, rather than using specific BGC-class-specific 65 genetic markers. While the lack of clear genetic markers may result in a higher number of false 66 positives, these methods have indeed charted previously uncovered biochemical space and led to the 67 discovery of new natural products. 68 One class of natural products whose expansion has been fueled by the increased amount of genomic 69 sequences available is that of the ribosomally synthesized and post-translationally modified peptides 70 (RiPPs) 19 . RiPPs are characterized by a unifying biosynthetic theme: a small gene encodes a short 71 precursor peptide, which is extensively modified by a series of enzymes that typically recognize the 72 N-terminal part of the precursor called the leader peptide, and finally cleaved to yield the mature 73 product 20 . Despite this common biosynthetic logic, RiPP modifications are highly diverse. The latest 74 comprehensive review categorizes RiPPs into roughly 20 different classes 19 , such as lanthipeptides, 75 lasso peptides and thiopeptides. Each of these classes is characterized by one or more specific 76 modifications, such as the thioether bridge in lanthipeptides or the knot-like structure of lasso 77 peptides. Despite the extensive list of known classes and modifications, new RiPP classes are still 78 being found. Newly identified RiPP classes often carry unusual modifications, such as D-amino acids 21 , 79 addition of unnatural amino acids 22,23 , β-amino acids 24 , or new variants of thioether crosslinks 25,26 . 80 These discoveries strongly indicate that the RiPP genomic landscape remains far from completely 81 charted, and that novel types of RiPPs with new and unique biological activities may yet be 82 uncovered. However, RiPPs pose a unique and major challenge to genome-based pathway 83 identification attempts: unlike in the case of NRPSs Here, we present decRiPPter (Data-driven Exploratory Class-independent RiPP TrackER), an 100 integrative algorithm for the discovery of novel classes of RiPPs, without requiring prior knowledge of 101 their specific modifications or core enzymatic machinery. DecRiPPter employs a Support Vector 102 Machine (SVM) classifier that predicts RiPP precursors regardless of RiPP subclass, and combines this 103 with pan-genomic analysis to identify which putative precursor genes are located within specialized 104 genomic regions that encode multiple enzymes and are part of the accessory genome of a genus. 105 Sequence similarity networking of the resulting precursors and gene clusters then facilitates further 106 prioritization. Applying this method to the gifted natural product producer genus Streptomyces, we 107 identified 42 new RiPP family candidates. Experimental characterization of a widely distributed 108 candidate RiPP BGC led to the discovery of a novel lanthipeptide that was produced by a previously 109 unknown enzymatic machinery. 110

RiPP BGC discovery by detection of genomic islands with characteristics typical of RiPP BGCs 112
Given the promise of RiPPs as a source for novel natural products, we set out to construct a platform 113 to facilitate identification of novel RiPP classes. Since no criteria could be used that are specific for 114 individual RiPP classes, we used three criteria that generally apply to RiPP BGCs: 1) they contain one 115 or more ORFs for a precursor peptide; 2) they contain genes encoding modifying machinery in an 116 operon-like gene cluster together with precursor gene(s); 3) they have a sparse distribution within 117 the wider taxonomic group in which they are found. To focus on novel RiPP classes, we added a 118 fourth criterion: 4) they have no direct similarity to BGCs of known classes ( Figure 1). 119 For the first criterion, we trained an SVM classifier to distinguish between RiPP precursors and other 120 peptides. A collection of 175 known RiPP precursors, gathered from RiPP clusters from the MIBiG 121 repository 32 was used as a positive training set. For the negative training set, we generated a set of 122 20,000 short non-precursor sequences, consisting of 10,000 randomly selected short proteins (<175 123 amino acids long) from Uniprot without measurable similarity to RiPP precursors (representative of 124 gene encoding proteins but not RiPP precursors), and 10,000 translated intergenic sequences 125 between a stop codon and the next start codon of sizes 30-300 nt taken from 10 genomes across the 126 bacterial tree of life (representative of spurious ORFs that do not encode proteins). From both 127 positive and negative training set sequences, 36 different features were extracted describing the 128 amino acid composition and physicochemical properties of the protein/peptide sequences, as well as 129 localized enrichment of amino acids prone to modification by RTEs. Based on these, a support vector 130 machine was trained (see details in Methods section). To make sure that this classifier could predict 131 precursors independent of RiPP subclass, we trained it on all possible subsets of the positive training 132 set in which one of the RiPP subclasses was entirely left out (a strategy we termed leave-one-class-133 out cross-validation). Typically, the classifier was still capable of predicting the class that was left out, 134 with an area-under-receiver operating characteristics curve of 0.955. 135 For the second criterion, we made use of the fact that the majority of RiPP BGCs appear to contain 136 the genes encoding the precursor and the core biosynthetic enzymes in the same strand orientation 137 within close intergenic distance (81.6% of MIBiG RiPPs). Therefore, candidate gene clusters are 138 formed from the genes that appear to reside in an operon with predicted precursor genes, based on 139 intergenic distance and the COG scores calculated (see description of third criterion below, the 140 Methods section and Figure S1). These gene clusters are then analyzed for protein domains that 141 could constitute the modifying machinery ( Figure 1b). Rather than restricting ourselves to specific 142 protein domains, we constructed a broad dataset of Pfam and TIGRFAM domains that are linked to 143 an E.C. number using InterPro mappings 33 . This dataset was extended with a previously curated set of 144 Pfam domains found to be prevalent in the positive training set of the ClusterFinder algorithm 34 , and 145 manually curated, resulting in a set of 4,131 protein domains. We also constructed Pfam 35 and 146 TIGRFAM 36 domain datasets of transporters, regulators and peptidases, as well as a dataset 147 consisting of known RiPP modifying domains to provide more detailed annotation and allow specific 148 filtering of RiPP BGCs based on the presence of each of these types of Pfam domains (Supplemental 149 Document 2). 150 For the third criterion, we sought to distinguish specialized genomic regions from conserved genomic 151 regions. Indeed, most BGCs are sparingly distributed among genomes, with even closely related 152 strains showing differences in their BGC repertoires 37-39 . We therefore developed an algorithm that 153 separates the 'core' genome from the 'accessory' genome, by comparing all genes in a group of 154 query genomes from the same taxon (typically a genus), and identifying the frequency of occurrence 155 of each gene within that group of genomes (Figures 1c and S2). For the purpose of comparing genes 156 between genomes, we reasoned that it was more straightforward to identify groups of functionally 157 closely related genes that also include recent paralogues, due to the complexities of dealing with 158 orthology relationships across large numbers of genomes (especially for biosynthetic genes that are 159 known to have a discontinuous taxonomic distribution and may undergo frequent duplications 40 ). 160 Therefore, decRiPPter first identifies the distribution of sequence identity values of protein-coding 161 genes that can confidently be assigned to be orthologues, and uses this distribution to find groups of 162 genes across genomes with orthologue-like mutual similarity. To identify a set of high-confidence 163 orthologues, decRiPPter looks for genomic loci between which at least three contiguous genes are 164 each other's bidirectional best hits (BBHs, using DIAMOND 41 ) between all possible genome pairs of 165 the group of genomes analyzed, and assigned the center genes of these loci orthologue status, 166 termed a true conserved orthologous gene (trueCOG) 42 . Since many orthologues are missed by only 167 considering orthologues based on BBHs 43 , and to also include recent paralogues, we then further 168 expanded the list of homologues with orthologue-like similarity by dynamically determining a cutoff 169 between each genome pair based on the similarity of the trueCOGs shared between those genomes. 170 This cutoff is used to find all highly similar gene pairs, which are then clustered with the Markov 171 Clustering Algorithm (MCL 44 ) into 'clusters of orthologous genes' (COGs). The number of COG 172 members found for each gene is divided by the number of genomes in the query to get a COG score 173 ranging from 0 to 1, reflecting how widespread the gene is across the set of query genomes. To 174 validate our calculations, we analyzed the COG-scores of the highly conserved single-copy BUSCO 175 gene set from OrthoDB 45,46 , as well as the COG-scores of the genes in the gene clusters predicted by 176 antiSMASH. In line with our expectations, homologs of the BUSCO gene set averaged COG-scores of 177 0.95 ( Figure S5), while the COG-scores of the antiSMASH gene clusters were much lower, averaging 178 0.311 +-0.249 for all BGCs, and 0.234 +-0.166 for RiPP BGCs ( Figure S6). While the COG-scoring 179 method requires a group of genomes to be analyzed rather than a single genome, we believe that 180 the extra calculation significantly contributes in filtering false positives (see Table 1 and Figure S4). In 181 addition, the COG scores aid in the gene cluster identification based on the assumption that gene 182 clusters are generally sets of genes with similar absence/presence patterns across species (see 183 Methods section). 184 For the final criterion, the algorithm dereplicates the identified clusters by comparing them to known 185 RiPP BGCs. All putative BGCs are clustered based on domain content and precursor similarity using 186 sequence similarity networking 47 , and compared to known RiPP BGCs from MIBiG 48 . In addition, the 187 overlap between predicted RiPP BGCs and gene clusters found by antiSMASH 49 is determined ( Figure  188 1). 189

decRiPPter identifies 42 candidate novel RiPP classes in Streptomyces 190
While RiPPs are found in many different microorganisms, their presence in streptomycetes reflects 191 perhaps the most diverse array of RiPP classes within a single genus. Streptomycetes produce a 192 broad spectrum of RiPPs, namely lanthipeptides 50 , lasso peptides 27 , linear azol(in)e-containing 193 peptides (LAPs) 51 , thiopeptides 52 , thioamide-containing peptides 29 and bottromycins 53 . Their 194 potential as RiPP producers is further highlighted by a recent study showcasing the diversity of 195 lanthipeptide BGCs in Streptomyces and other actinobacteria 54 . Given the large variety of different 196 families of natural products produced by this genus, we hypothesized it to be a likely source of novel 197 RiPP classes, and sought to exhaustively mine it. 198 We started by running the pipeline described above on all publicly available Streptomyces genomes 199 (1,295 genomes) from NCBI (Supplemental Document 3). Due to computational limits, the genomes 200 were split into ten randomly selected groups to calculate the frequency of distribution of each gene 201 (COG-scores). In general, the number of genomes that could be grouped together and the resulting 202 cutoffs were found to vary with the amount of minimum trueCOGs required ( Figure S3A). To make 203 sure that as many genomes as possible could be compared at once, we set the cutoff for minimum 204 number of trueCOGs at 10. Despite the low cutoff, the distribution of similarity scores between 205 genome pairs still resembled a Gaussian distribution ( Figure S3B). The bimodal distribution of the 206 resulting COG-scores showed that the majority of the genes were either conserved in only a small 207 portion of the genomes, or present in almost all genomes ( Figure S4). 208 We then scanned all predicted products of genes as well as predicted ORFs in intergenic regions 209 shorter than 100 amino acids (total 7.19 * 10 7 ) with the SVM classifier. While by far most of the 210 queries scored below 0.5, a peak of queries scoring from 0.9 to 1.0 was observed ( Figure S7). Seeking 211 to be inclusive at this stage, we set the cutoff at 0.9, resulting in 1.32*10 6 candidate precursors 212 passing this initial filter, thus filtering out 98.2 % of all candidates. Eliminating candidate precursors 213 whose genes were completely overlapping reduced the number to 8.17*10 5 precursors (1.1 %). 214 While, most probably, the vast majority of these are not RiPP precursors, it provides a suitably sized 215 set of candidates to then enter the next stages of the decRiPPter workflow. 216 In our analyses, we found that the majority of RiPP BGCs contain the majority of biosynthetic genes 217 on the same strand orientation as the precursor (MIBiG: 81.6%; antiSMASH RiPP BGCs: 73.1%). We 218 therefore formed gene clusters using only the genes on the same strand as the predicted precursor. 219 To create a training set, we divided all known RiPP BGCs and all antiSMASH RiPP BGCs found in the 220 analyzed genome sequences into sections where each section contained only genes on the same 221 strand. The core section was defined as the section that contained the most biosynthetic genes as 222 detected by antiSMASH or as annotated in the MIBiG database. These sections were used as training 223 sets to finetune distance and COG cutoffs for our gene cluster methods. 224 In a simple gene cluster method, genes were joined only using the intergenic distances as a cutoff. 225 Using this method, we found that at a distance of 750 nucleotides, all MIBiG core sections were 226 covered, and 91% of all antiSMASH core sections ( Figure S8AB). However, using only distance may 227 cause the gene cluster formation to overshoot into regions not associated with the BGC (e.g. Figure  228 S2). We therefore created an alternative method, called the 'island method'. In this method, each 229 gene is first joined with immediately adjacent genes that lie in the same strand orientation and have 230 very small intergenic regions (<=50 nucleotides), to form islands. These islands may subsequently be 231 combined if they have similar average COG-scores (see materials and methods). We found that with 232 this method, we could confidently cover our validation set, while slightly reducing the average size of 233 the gene clusters (3.73 ± 3.75 vs 3.44 ± 3.53; Figure S8CDE). In addition the variation of the COG 234 scores within the gene clusters decreased, suggesting that fewer housekeeping genes would be 235 added to detected biosynthetic gene clusters ( Figure S8F). 236 Overlapping gene clusters were fused, resulting in 7.18 *10 5 gene clusters. To organize the results, all 237 clusters were paired if their protein domain content was similar (Jaccard index of protein domains; 238 cutoff: 0.5) and at least one of their predicted precursors showed sequence similarity (NCBI blastp; 239 bitscore cutoff: 30). These cutoffs were used to distinguish between different RiPP subclasses ( Figure  240 S9 BGCs were filtered out in the process (and, by extension, many unknown RiPP BGCs were likely also 251 filtered out this way), we felt our odds of discovering novel RiPP families were highest when focusing 252 on the dataset with the highest fraction of RiPP BGCs, and therefore applied the strict filter. The 253 remaining 2,471 clusters of genes were clustered as described above. Since our efforts were aimed at 254 finding new gene cluster families, we discarded groups of clusters with fewer than three members, 255 leaving 1,036 gene clusters in 187 families. Families in which more than half of the gene clusters 256 overlapped with antiSMASH non-RiPP BGCs were discarded as well, leaving only known RiPP families 257 and new candidate RiPP families (893 gene clusters, 151 families; Figure 2). 258 Roughly a third (272) of the remaining gene clusters were members of known families of RiPPs, 259 including lasso peptides, lanthipeptides, thiopeptides, bacteriocins and microcins. In addition, many 260 of the other candidate clusters (55) contained genes common to known RiPP BGCs, such as those 261 encoding YcaO cyclodehydratases and radical SAM-utilizing proteins ( Figure 2). These gene clusters 262 were not annotated as RiPP gene clusters by antiSMASH, but the presence of these genes alone or in 263 combination with a suitable precursor can be used as a lead to find novel RiPP gene clusters 24,29 . 264 Each remaining family of gene clusters was manually investigated to filter out likely false positives 265 from the candidates. Common reasons to discard gene clusters were functional annotations of 266 candidate precursors as having a non-precursor function (e.g. homologous to ferredoxin or LysW 55 ), 267 annotations of the genes within a gene cluster related to primary metabolism (e.g. genes for cell-wall 268 modifying enzymes), or other abnormalities (e.g. large intergenic gaps or very large gene cluster of 269 more than 40 genes). Several modifying enzymes belonging to the candidate families were 270 homologous to gene products involved in primary metabolism, such as 6-pyruvoyltetrahydropterin 271 synthase or phosphoglycerate mutase. Given the low distribution (COG scores) of the genes encoding 272 these enzymes, it seemed more likely to us that they were adapted from primary metabolism to play 273 a role in secondary metabolism 17 . We therefore only discarded a gene cluster family if multiple clear 274 relations to a known pathway were found. The remaining 42 candidate families were further grouped 275 together into broader classes depending on whether a common enzyme was found (Figure 2). 276 A large group of families all contained one or more genes for ATP-grasp enzymes. ATP-grasp enzymes 277 are all characterized by a typical ATP-grasp-fold, which binds ATP, which is hydrolyzed to catalyze a 278 number of different reactions. As such, these enzymes have a wide variety of functions in both 279 primary and secondary metabolism, and their genes are present in a many different genomic 280 contexts 56 . Involvement of ATP-grasp enzymes in RiPP biosynthesis has been reported for both 281 microviridin 57 and pheganomycin 23 , where they catalyze macrocyclization and peptide ligation, 282 respectively. The ATP-grasp enzymes involved in the biosynthesis of these products did not show 283 direct similarity to any of the ATP-grasp ligases of these candidates, however, suggesting that these 284 belong to yet to be uncovered biosynthetic pathways. 285 Among the candidate families were three families that contained homologs to mauE, and one that 286 additionally contained a homolog of mauD. The proteins encoded by these genes, along with other 287 proteins encoded in the mau gene cluster, are known to be involved in the maturation of of 288 methylamine dehydrogenase, which is required for methylamine metabolism. MauE in particular has 289 been speculated to play a role in the formation of disulfide bridges in the β-subunit of the protein, 290 while the exact function of MauD remains unclear 58 . As no other orthologs of the mau cluster were 291 found within the genomes of Streptomyces sp. 2112.3, Streptomyces viridosporus T7A or 292 Streptomyces sp. CS081A, it is unlikely that these proteins carry out this function. Rather, the 293 presence of these genes in a putative RiPP BGC suggests that they play a role in modification of RTEs 294 or RiPP precursors. Supporting this hypothesis, each of these gene clusters contained a gene 295 predicted to a encode for a precursor containing at least eight cysteine residues (Table S3). 296 Similarly, homologs of hypE and hypF were detected in a gene cluster containing another gene 297 encoding an ATP-grasp ligase. Genes encoding these proteins are typically part of the hyp operon, 298 which is involved in the maturation of hydrogenase. Specifically, the two proteins cooperate to 299 synthesize a thiocyanate ligand, which is transferred onto an iron center and used as a catalyst 59 . No 300 other homologs of genes in the hyp operon were detected, however, suggesting that these protein-301 coding genes have adopted a novel function. 302 The remaining 18 families could not be grouped under a single denominator, nor could any single 303 enzyme be found that clearly distinguished these groups as RiPP or non-RiPP BGCs. A wide variety of 304 enzymes was found to be encoded by these gene clusters, including p450 oxidoreductases, 305 flavoproteins, aminotransferases, methyltransferases and phosphatases. In addition (and in line with 306 features dominant in the positive training set), the predicted precursor peptides were often rich in 307 cysteine, serine and threonine residues (Table S3), which contain reactive hydroxyl and sulfide 308 moieties and are present in precursors of various known RiPP subclasses. 309 All candidate gene clusters presented here carry the features we selected, typical of RiPP BGCs: a low 310 frequency of occurrence among the scanned genomes, a suitable precursor peptide, candidate 311 modifying enzymes, transporters, regulators and peptidases. However, many known RiPP BGCs were 312 removed, suggesting that there may be more uncharacterized RiPP families among the gene clusters 313 we discarded. While the complete dataset could not be covered here, the command-line application 314 of decRiPPter has been set up to allow users to set their own filters. In addition, decRiPPter runs are 315 visualized in an HTML output, in which the results can be further browsed and filtered by Pfam 316 domains and other criteria, allowing users to find candidate families according to their preferences. 317 The results from this analysis of the strict and the mild filter is available at 318 http://www.bioinformatics.nl/~medem005/decRiPPter_strict/index.html and 319 http://www.bioinformatics.nl/~medem005/decRiPPter_mild/index.html, respectively. 320

Discovery of a novel family of lanthipeptides 321
To validate the capacity of decRiPPter to find novel RiPP subclasses, we set out to experimentally 322 characterize one of the candidate families (Figure 2; Other; red marker). Gene clusters belonging to 323 this family shared several genes encoding flavoproteins, methyltransferases, oxidoreductases and 324 occasionally a phosphotransferase. Importantly, the predicted precursor peptides encoded by these 325 putative BGCs showed clear conservation of the N-terminal region, while varying more in the C-326 terminal region ( Figure S10). This distinction is typical of RiPP precursors, as the N-terminal leader 327 peptide is used as a recognition site for modifying enzymes, while the C-terminal core peptide can be 328 more variable 20 . 329 One of the gene clusters belonging to this candidate family was identified in Streptomyces 330 pristinaespiralis ATCC 25468 ( fig 3A; Table 2). S. pristinaespiralis is known for the production of 331 pristinamycin, and was selected for experimental work since the strain is genetically tractable 60,61 . 332 The gene cluster was named after its origin (spr: Streptomyces pristinaespiralis RiPP), and the genes 333 were named after their putative function. 334 The gene cluster contains four genes encoding putative precursor peptides, although only three of 335 the peptides (SprA1-A3) showed similarity to each other and to the other peptides in the same family 336 ( Figure S10). The fourth predicted precursor peptide (encoded by sprX) did not align with any of the 337 other peptides and was assumed to be a false positive. The products encoded by sprA1 and sprA2 338 were highly similar to one another compared to the sprA3 gene product. Occurrence of two distinct 339 genes for precursors within a single RiPP BGC is typical for two-component lanthipeptides 62 . have been reported with this modification, including lanthipeptides, cypemycins and thioviridamides, 346 although they are only consistently present in cypemycins and thioviridamides. This type of 347 modification is less common among lanthipeptides, with only nine out of 120 lanthipeptide gene 348 clusters in MIBiG encoding the required decarboxylase. Cysteine-decarboxylating genes are also 349 present in non-RiPP gene clusters (Table S4) and are also associated with other metabolic 350 pathways 64 . 351 A more detailed comparison with the gene clusters in MIBiG showed that two more genes from the 352 thioviridamide gene cluster were homologous to two genes encoding a predicted 353 phosphotransferase (sprPT) and a hypothetical protein (sprH3), respectively. Taken together with the 354 homologous cysteine decarboxylase, it appeared that our gene cluster was distantly related to the 355 thioviridamide gene cluster 65 . Thioviridamide-like compounds are primarily known for thioamide 356 residues, for which a TfuA-associated YcaO is thought to be responsible 29,66 . However, a YcaO 357 homologue was not encoded by the gene cluster, making it unlikely that this gene cluster should 358 produce thioamide-containing RiPPs. 359 Two strains were created to help determine the natural product specified by the BGC. For the first 360 strain, the entire gene cluster was replaced by an apramycin resistance cassette (aac3(IV)) by 361 homologous recombination with the pWHM3 vector 67 (spr::apra). In case the gene cluster was 362 natively expressed, this strain should allow for easy identification of the natural product by 363 comparative metabolomics. In the second approach, we sought to activate the BGC in case it was not 364 natively expressed. To this end, we targeted the cluster-situated luxR-family transcriptional 365 regulatory gene sprR. The sprR gene was expressed from the strong and constitutive gapdh promoter 366 from S. coelicolor (p gapdh ) on the integrative vector pSET152 68 . The resulting construct (pAK1) was 367 transformed to S. pristinaespiralis by protoplast transformation. 368 To assess the expression of the gene cluster in the transformants, we analyzed changes in the global 369 expression profiles in 2 days and 7 days old samples of NMMP-grown cultures using quantitative 370 proteomics ( Figure 3B). Aside from the regulator itself, six out of the sixteen other proteins were 371 detected in the strain containing expression construct pAK1, while only SprPT could be detected in 372 the strain carrying the empty vector pSET152. SprPT was also detected in the proteome of spr::apra, 373 however, indicating a false positive. In the wild-type strain, SprT3 and SprR were detected, but only 374 in a single replicate and at a much lower level. Overall, these results suggest that under the chosen 375 growth conditions the gene cluster was expressed at very low amounts in wild-type cells, and was 376 activated when the expression of the likely pathway-specific regulatory gene was enhanced. This 377 makes spr a likely cryptic BGC. 378 To see if a RiPP was produced, the same cultures used for proteomics were separated into mycelial 379 biomass and supernatant. The biomass was extracted with methanol, while HP20 beads were added 380 to the supernatants to absorb secreted natural products. Analysis of the crude methanol extracts and 381 the HP20 eluents with HPLC-MS revealed several peaks eluting between 5.5 and 7 minutes in the 382 methanol extracts (fig 3C), which were not found in extracts from wild-type strain or the strain 383 containing the empty vector. Feature detection with MZMine followed by statistical analysis with 384 MetaboAnalyst revealed seven unique peaks, with m/z between 707.3534 and 918.0807 ( Figure S11). 385 The isotope patterns of these peaks showed that the six of the corresponding compounds were triply 386 charged. Careful analysis of derivative peaks with mass increases consistent with Na-or K-addition, 387 led to the conclusion that these peaks corresponded to the [M+3H] 3+ adduct, suggesting a 388 monoisotopic masses in the range of 2,604.273 and 2,754.242 Da . The highest signal came from the 389 compound with monoisotopic mass of 2,703.245. Four of the other masses seemed to be related to 390 this mass, as they were different in increments of 4, 14, or 16 Da (Table S5). We therefore reasoned 391 that the mass of 2,703.245 Da was the final product, while others were incompletely processed 392 peptides. 393 To further verify that the identified masses indeed belonged to the RiPP precursors in our gene 394 cluster, we first removed the apramycin resistance cassette from Spr::apra using the pUWLCRE 395 vector 69 , creating strain Δspr. The expression construct pAK1 and an empty pSET152 vector were 396 transformed to the strain Δspr. When these strains were grown under the same conditions, the 397 aforementioned peaks were not detected, further suggesting that indeed they belonged to products 398 of this gene cluster ( Figure S12). 399 Most masses were detected in only low amounts. In order to resolve this, we created a similar 400 construct as pAK1, but this time using the low-copy shuttle vector pHJL401 as the vector 70 . The 401 plasmid pAK2 was introduced into S. pristinaespiralis and the transformants grown in NMMP for 7 402 days. Extraction of the mycelial biomass with methanol resulted in a higher abundance of the masses 403 previously detected ( Figure S13). Consistent with the MS profiles of pAK1 transformants, also pAK2 404 transformants produced an abundant peak corresponding to a monoisotopic mass of 2,703.245 Da, 405 as well as a second peak corresponding to a monoisotopic mass of 2,553.260 Da. Most of the other 406 masses could be related to one of these two masses, suggesting these are the final products, related 407 to two distinct precursors (Tables S5 and S6). 408 We then performed MS-MS analysis of the extracts of the pAK2 transformants to identify the 409 metabolites and their expected modifications, such as Avi(Me)Cys moieties. The fragmentation 410 pattern of the mass of 2,703.245 Da could be assigned to the sprA3 precursor, when several 411 modifications were applied ( Figure 3D, Table S7). Similarly, fragments with a mass of 2,553.260 could 412 be matched to the SprA2 precursors considering the same modifications ( Figure S14; Table S8). 413 Among the predicted modifications were N-terminal methylation, which was supported by the 414 presence of the methyltransferase sprMe in the gene cluster. Secondly, the C-terminal cysteine was 415 predicted to have undergone oxidative decarboxylation (-46 Da), as expected based on the presence 416 of the gene sprF2 in the gene cluster. In addition, many of the serines and threonines could only be 417 matched when their masses were altered by -16 or -18 Da. These mass differences are typical of 418 dehydration (-18 Da) of the residues to dehydroalanine and dehydrobutyric acid. Reduction of these 419 dehydrated amino acids (+2 Da) would then give rise to alanine and butyric acid residues, a 420 modification which has been reported for lanthipeptides 71 . 421 To test for the presence of dehydrated serines and threonines, we treated the purified product with 422 dithiothreitol (DTT), which covalently attaches to these residues via 1,4 nucleophilic addition 72 . 423 Treatment with DTT resulted in the addition of up to two adducts, showing the presence of 424 dehydrated residues, although one fewer than expected ( Figure S15). The fact that two of the 425 dehydrated residues are adjacent to one another may have resulted in steric hindrance, preventing 426 full conversion. 427 Surprisingly, no fragments were found of the residues S -18 S -18 T -18 WC in the center of SprA3, or for the 428 N-terminal T -18, +28 T -18 PVC region. Considering the other modifications typical of lanthipeptides, we 429 hypothesized the presence of thioether crosslinks between the dehydrobutyric acids and cysteines. 430 To find further support for this hypothesis, we treated the purified product of SprA3 with 431 iodoacetamide (IAA). Iodoacetamide alkylates free cysteines, while cysteines in thioether bridges 432 remain unmodified 73 . In agreement with our hypothesis, treatment with iodoacetamide did not 433 affect the observed masses, despite the presence of three cysteines in the peptide ( Figure S10). In 434 addition, we hydrolyzed the purified peptide with 6M HCl at 110°C for 24h. Under these conditions, 435 the amide bond should be hydrolyzed, while the thioether bond should be unaffected 74 . The 436 resulting mixture of amino acids both contained masses corresponding to a cysteine linked to either 437 a dehydrated serine, or to a twice methylated, dehydrated threonine (Table S10). The C-terminal 438 predicted AviMeCys was not detected, although this may be explained by the presence of the alkene 439 in the moiety, which are likely to react under acidic conditions. 440 Many of the other masses found were higher when compared to the product of SprA3 by increments 441 of 16 Da, suggesting that the peptide was incompletely processed. The fragmentation patterns of 442 these masses could not be unambiguously resolved ( Figure S16). An unmodified serine or threonine 443 could occur at several places within the precursor, and each of the possible outcomes would likely 444 give rise to compounds with identical mass and very similar hydrophobic properties, which would not 445 be separated properly. Overall, these results further reinforce the idea that the compound with 446 monoisotopic mass of 2,703.245 Da belongs to the fully modified product, while the others are 447 derived from it. 448

The sprH3/sprPT gene pair is present in a wide variety of RiPP-like contexts 449
Taken together, we have shown that the SprA3 precursor contained a number of posttranslational 450 modifications that are typical of lantibiotics. The conversion of serine/threonine to alanine/butyric 451 acid via reduction, the creation of an AviCys moiety and the crosslinks to form thioether bridges are 452 all found in lanthipeptides, and are dependent on dehydration of serine and threonine residues. Four 453 different sets of enzymes, called LanBC, LanM, LanKC and LanKL can catalyze these reactions in the 454 biosynthesis of lanthipeptides and are used to designate the lanthipeptide type. 455 As stated before, no members of any of these enzyme families were found to be encoded by the 456 gene cluster studied. However, sprH3 and sprPT showed homology to two uncharacterized genes of 457 the thioviridamide BGC. Thioviridamide contains an AviCys moiety, the formation of which requires a 458 dehydrated serine residue. The enzymes responsible for dehydration and subsequent cyclization 459 have not been identified yet 65,75 . Since both gene clusters share a common modification for which 460 the enzyme is unknown, we hypothesized that sprH3 and sprPT should be responsible for 461 dehydration and cyclization, and thus are hallmarks for a new lanthipeptide subtype, which we 462 designate type V. 463 Lanthipeptide core modifying enzymes catalyze the most prominent reaction in lanthipeptide 464 maturation, and as such, are present in many different genetic contexts 54 . To validate that SprH3 and 465 SprPT are the sought-after modifying enzymes, we studied the distribution of the SprH3/PT gene pair 466 across Streptomyces genomes analyzed by decRiPPter. Using CORASON 76 with the sprPT gene as a 467 query yielded 195 homologs in various gene clusters (Figure 4). The sprPT/sprH3 gene pair was 468 completely conserved across all gene clusters for which an uninterrupted contig of DNA was 469 available. , strongly supporting their functional interaction and joint involvement. Using the sprH3 470 gene as a query yielded similar results (data not shown). A total of 391 orthologs of the gene pair 471 were found outside Streptomyces, particularly in Actinobacteria (219) and Firmicutes (161; Figure  472 S17). Distantly similar homologs of the gene pair were also identified in Cyanobacteria, 473 Plantomycetes and Proteobacteria. 474 Among the 195 identified gene clusters in Streptomyces, the majority (131) overlapped with a gene 475 cluster detected by decRiPPter, indicating that the gene pair was within short intergenetic distance 476 from predicted precursor gene in the same strand orientation. A large fraction (80) also passed the 477 strictest filtering (see Table I), showing that among these gene clusters were many encoding 478 biosynthetic machinery, peptidases and regulators. In contrast, only nine of the gene clusters 479 overlapped with a BGC identified by antiSMASH. Four of these showed the gene pair in apparent 480 operative linkage with a bacteriocin gene cluster, marked as such by the presence of a DUF692 481 domain, which is often associated with small prepeptides such as methanobactins. Another four gene 482 clusters detected by decRiPPter were only overlapping due to the gene pair being on the edge of a 483 neighboring gene cluster. 484 The genetic context of the gene pairs showed a wide variation (Figure 4, right side). While some gene 485 clusters were mostly homologous to the spr gene cluster (Figure 4, group g-h), others shared only a 486 few genes (groups a and d), and some only shared the gene pair itself (groups b, c and e). Many other 487 predicted enzyme families were found to be encoded inside these gene clusters, including YcaO-like  488 proteins, glycosyltransferases, sulfotransferases and aminotransferases. alanine residues as a result of conversion of dehydrated serine residues 77 . While all the genes, 508 including the precursors, were well conserved between the two gene clusters, the ltnJ gene had been 509 replaced by an sprOR homolog, suggesting that their gene products catalyze similar functions ( Figure  510 S18). 511

Conclusion and final perspectives 512
The continued expansion of available genomic sequence data has allowed for discovery of large 513 reservoirs of natural product BGCs, fueled by sophisticated genome mining methods. These methods 514 must make tradeoffs between novelty and accuracy 12 . Tools primarily aimed at accuracy reliably 515 discover large numbers of known natural product BGCs, but are limited by specific genetical markers. 516 On the other hand, while tools aimed at novelty may discovery new natural products, these tools 517 have to sacrifice on accuracy, resulting in a larger amount of false positives. 518 Here, we take a new approach to natural product genome mining, aimed specifically at the discovery 519 of novel types of RiPPs. here inevitably gives rise to a higher number of false positives, we feel that such a 'low-confidence / 527 high novelty' approach 12 is necessary for the discovery of completely novel RiPP families. 528 Additionally, users are able to set their own filters for the identified gene clusters, allowing them to 529 search candidate RiPP families containing specific enzymes or enzyme types within a much more 530 confined search space compared to manual genome browsing. 531 The product of one of the candidate classes was characterized as the first member of a new class of 532 lanthipeptides (termed 'type V') that was not detected by any other RiPP genome mining tool. 533 Variants of this gene cluster are widespread across Streptomyces species, further expanding one of 534 the most widely studied RiPP families. In addition, two proposed core genes were used to expand the 535 family by finding additional homologs in Actinobacteria and Firmicutes. Taken together, this work 536 shows that known RiPP families only cover part of the complete genomic landscape, and that many 537 more RiPP families likely remain to be discovered, especially when expanding the search space to the 538 broader bacterial tree of life. genomes, all genes smaller than 100 amino acids are analyzed by the SVM classifier, which finds 544 candidate precursors. The gene clusters formed around the precursors are analyzed for specific 545 protein domains. In addition, all COG scores are calculated to act as an additional filter, and to aid in 546 gene cluster detection. The remaining gene clusters are clustered together and with MIBiG gene 547 clusters to dereplicate and organize the results. In addition, overlap with antiSMASH detected BGCs 548 is analyzed (4). 549 550 551 Figure 2. decRiPPter finds 42 candidate RiPP families with a large variety of encoded modifying 552 enzymes and precursors . Gene clusters found in 1,295 Streptomyces genomes were passed through 553 a strict filter and grouped together (see main text). Arrow colors indicate enzyme family of the 554 product, and the description of gene products is given below the arrows. Roughly a third of the 555 remaining candidates overlapped with or were similar to RiPP BGCs predicted by antiSMASH. 556 Another third of the remaining candidates were discarded as likely false positives (see main text). Of 557 the remaining 42 candidate RiPP families, 15 example gene clusters are displayed. 558 559 560 Figure 3. The Streptomyces pristinaespiralis RiPP (spr) gene cluster produces a highly modified 561 RiPP. A) The spr gene cluster encodes three putative precursors, three transporters, a peptidase and 562 an assortment of modifying enzymes (see Table 1). B) Protein abundance of the products of the spr 563 gene cluster in S. pristinaespiralis ATCC 25468 and derived strains. Strains were grown in NMMP and 564 samples were taken after 2 and 7 days. Enhanced expression of the regulator (from construct pAK1) 565 resulted in the partial activation of the gene cluster. Genes that could not be detected are not 566 illustrated. C) Chromatogram of crude extracts from strains grown under the same conditions as 567 under A), samples after 7 days. Several peaks were detected in the extract from the strain with 568 expression construct pAK1 between 7 and 8 minutes. C) b and y ions detected from one of the 569 predominant peaks found in the crude extract (corresponding to monoisotopic mass of 2703.235 570 Da). The fragmentation pattern could be matched to the sprA3 precursor. Phylogenetic tree of gene clusters containing homologs of sprPT and sprH3, visualized by CORASON 76 . 575 A red dot indicates that the genes were present in a gene cluster found by decRiPPter, a yellow dot 576 that it passed the strict filter (see Table 1 for details). A blue dot indicates overlap with a BGC 577 identified by antiSMASH. (Right side) Several gene clusters with varying genetic contexts are 578 displayed. Group (g) represents the query gene cluster. The genetic context varies, while the gene 579 pair itself is conserved. Color indicates predicted enzymatic activity of the gene products as described 580 in the legend. 581 582 583