Abstract
Compared to eukaryotes, repetitive sequences are rare in bacterial genomes and usually do not persist for long. Yet, there is at least one class of persistent prokaryotic mobile genetic elements: REPINs. REPINs are non-autonomous transposable elements replicated by single-copy transposases called RAYTs. REPIN-RAYT systems are mostly vertically inherited and have persisted in individual bacterial lineages for millions of years. Discovering and analyzing REPIN populations and their corresponding RAYT transposases in bacterial species can be rather laborious, hampering progress in understanding REPIN-RAYT biology and evolution. Here we present RAREFAN, a webservice that identifies REPIN populations and their corresponding RAYT transposase in a given set of bacterial genomes. We demonstrate RAREFAN’s capabilities by analyzing a set of 49 Stenotrophomonas maltophilia genomes, containing nine different REPIN-RAYT systems. We guide the reader through the process of identifying and analyzing REPIN-RAYT systems across S. maltophilia, highlighting erroneous associations between REPIN and RAYTs, and provide solutions on how to find correct associations. RAREFAN enables rapid, large-scale detection of REPINs and RAYTs, and provides insight into the fascinating world of intragenomic sequence populations in bacterial genomes.
Introduction
Repetitive sequences in bacteria are rare compared to most eukaryotic genomes. In eukaryotic genomes, repetitive sequences are the result of the activities of persistent parasitic transposable elements. In bacteria, in contrast, parasitic transposable elements cannot persist for long periods of time (Park et al. 2021; van Dijk et al. 2022). To persist in the gene pool, transposable elements have to constantly infect novel hosts (Sawyer et al. 1987; Lawrence et al. 1992; Bichsel et al. 2010; Rankin et al. 2010; Wu et al. 2015; Park et al. 2021). Yet, there is at least one exception: a class of transposable elements called REPINs.
REPINs are short (~100 bp) nested palindromic sequences (Figure 1) that consist of two inverted REP (repetitive extragenic palindromic (Higgins et al. 1982)) sequences that can be present hundreds of times per genome (Bertels, Rainey 2011a). Most REPINs are symmetric where the 5’ REP sequences is identical to the 3’ REP sequences, with the occasional substitution (Bertels, Rainey 2011a; b). However, there are also asymmetric REPINs where the 5’ REP sequence differs from the 3’ REP sequence by a point deletion or insertion (Bertels, Rainey 2011a, 2022), which makes the analysis and detection significantly more difficult (e.g., Escherichia coli REPINs). Isolated REP sequences, REP singlets can also be found in the genome. These sequences are decaying remnants of REPINs that are not mobile anymore (Bertels, Rainey 2011a). REPINs are non-autonomous mobile genetic elements, which means they require a RAYT (REP Associated tYrosine Transposase) transposase gene (also referred to as tnpAREP) to replicate inside the genome (Nunvar et al. 2010; Bertels, Rainey 2011a; Ton-Hoang et al. 2012).
Within a genome, each REPIN population is usually only associated to a single RAYT gene. Hence, RAYT genes occur only in single copies per genome and do not copy themselves, unlike for example insertion sequences where often multiple identical sequences are present inside the genome. Unlike insertion sequences RAYT genes are also only inherited vertically, meaning they are host-beneficial transposases that are coopted by the host (Bertels, Gallie, et al. 2017; Bertels, Rainey 2022). The fact that REPINs and their corresponding RAYT genes are confined to a single bacterial lineage makes them very special, in comparison to all other parasitic mobile genetic elements in bacterial genomes (Bertels, Rainey 2022).
Of a total of five different RAYT families, there are only two RAYT families that are associated with REPINs: Group 2 and Group 3 RAYTs (Bertels, Gallie, et al. 2017). Group 2 RAYTs are present in most Enterobacteria and usually occur only once per genome associated with a single REPIN population. In contrast, Group 3 RAYTs are found in most Pseudomonas species and are usually present in multiple divergent copies per genome, each copy associated with a specific REPIN population (Bertels, Gallie, et al. 2017).
REPINs and their corresponding RAYT genes occur exclusively in bacterial genomes and are absent in eukaryotic or archaeal genomes (Bertels, Gallie, et al. 2017; Bertels, Rainey 2022). Within bacterial genomes REPINs and RAYTs have been evolving in single bacterial lineages for millions maybe even for a billion years (Bertels, Gallie, et al. 2017). The long term persistence of REPINs in single bacterial lineages can also be observed when analyzing REPIN populations (Bertels, Gokhale, et al. 2017; Bertels, Rainey 2022).
Parasitic insertion sequences usually occur in identical copies in bacterial genomes reflecting the fact that insertion sequences persist only briefly in the genome before they are eradicated from the genome or kill their host (Park et al. 2021). REPINs in contrast are only conserved at the ends of the sequence (presumably due to selection for function), the rest of the sequence is highly variable and only the hairpin structure is conserved (Bertels, Rainey 2011a). The sequence variability of REPINs within the same genome reflects their long-term persistence in single bacterial lineages (Bertels, Rainey 2022). REPINs cannot simply reinfect another bacterial lineage since they rely for mobility on their corresponding RAYT, which itself is immobile.
RAYTs and REPINs are distinct from typical parasitic insertion sequences, yet we know very little about their evolution or biology. Currently, it is completely unclear what kind of beneficial function maintains REPINs and RAYTs for millions of years in the genome. The reason for our lack of knowledge is not because REPINs and RAYTs are rare. They are ubiquitously found in many important and well-studied model bacteria such as Enterobacteria, Pseudomonads, Neisseriads, Xanthomonads. Microbial molecular biologists presumably encounter REPINs quite frequently. However, connecting the presence or absence of REPINs/RAYTs with phenotypes is difficult if we do not know when it is a REPIN that is present close to a gene of interest or a different type or repeat sequence. Even if the scientist knows about the presence of a REPIN it is probably also important to know whether a corresponding RAYT is present, since the function of REPINs largely depends on the function of the presence of a corresponding RAYT gene (Bertels, Rainey 2022).
Yet, the identification of REPIN populations and their corresponding RAYTs can be rather cumbersome if done from scratch. This is particularly true if the microbial molecular biologist is not aware of all the ins and outs of REPIN and RAYT biology. Identifying REPINs starts with an analysis of short repetitive sequences in the genome. If there are excessively abundant short sequences present in the genome, the distribution of these sequences is analyzed next. If they are exclusively identical tandem repeats without sequence variation and present in only one or two loci in the genome then these sequences are probably part of a CRISPR array and not REPINs.
Here, we present RAREFAN (RAYT/REPIN Finder and Analyzer), a webservice that automates the identification of REPINs and their corresponding RAYTs. RAREFAN is publicly accessible at http://rarefan.evolbio.mpg.de and identifies REPIN populations and RAYTs inside a set of bacterial genomes. RAREFAN also generates graphs to visualize the population dynamics of REPINs, and assigns RAYT genes to their corresponding REPIN groups. Here we will demonstrate RAREFAN’s functionality by analyzing REPIN-RAYT systems in the bacterial species Stenotrophomonas maltophilia.
Methods
Implementation
RAREFAN is a modular webservice. It consists of a web frontend written in the python programming language (Van Rossum, Drake Jr 1995) using the flask framework (Grinberg 2018), a java (Arnold et al. 2005) backend for genomic sequence analysis and an R (R Core Team 2016) shiny app (RStudio, Inc 2013) for data visualization. The software is developed and tested on the Debian GNU/Linux operating system (Kleinmann et al. 2021). All components are released under the MIT opensource license (Initiative 2021) and can be obtained from our public GitHub repository at https://github.com/mpievolbio-scicomp/rarefan.
The public RAREFAN instance at http://rarefan.evolbio.mpg.de runs on a virtual cloud server with 4 single-threaded CPUs and 16GB of shared memory provided and maintained by the Gesellschaft für Wissenschaftliche Datenverarbeitung Göttingen (GWDG) and running the Debian GNU/Linux Operating System (Kleinmann et al. 2021).
The java backend drives the sequence analysis. It makes system calls to TBLASTN (Altschul et al. 1990) to identify RAYT homologs and to MCL (Van Dongen 2000) for clustering REPIN sequences in order to determine REPIN populations.
Jobs submitted through the web server are queued and executed as soon as the required resources become available. Users are informed about the status of their jobs. After job completion, the user can trigger the R shiny app to visualize the results.
The java backend can also be run locally via the command line interface (available for download at https://github.com/mpievolbio-scicomp/rarefan/releases).
Usage of the webservice
The front page of our webservice allows users to upload their bacterial genomes in FASTA (.fas) format (Figure 2A). Optionally, users may also provide RAYT protein FASTA sequences (.faa) or phylogenies in NEWICK (.nwk) format. After successful completion of the upload process, the user fills out a web form to specify the parameters of the algorithm:
Reference sequence: Which of the uploaded genome sequences will be designated as reference genome (see below for explanations). Defaults to the first uploaded filename in alphabetical order.
Query RAYT: The RAYT gene that is used to identify homologous RAYTs in the query genomes.
Tree file: A phylogenetic tree of the reference genomes that can be provided by the user, otherwise the tree will be calculated using andi (Haubold et al. 2015).
Minimum seed sequence frequency: Lower limit on seed sequence frequency in the reference genome to be considered as a REP candidate. Default is 55.
Seed sequence length: The seed sequence length (in base pairs) is used to identify REPIN candidates from the input genomes. Default is 21 bp.
Distance group seeds: The maximum distance between a single occurrence of short repetitive sequences to still be sorted into the same sequence group.
Association distance REPIN-RAYT: The maximum distance at which a REP sequence can be located from a RAYT gene to be linked to that RAYT gene.
e-value cut-off: Alignment e-value cut-off for identifying RAYT homologs with TBLASTN. Default is 1e-30.
Analyse REPINs: Ticked REPINs will be analysed (two inverted REP sequences found at a distance of less than 130 bp), if not ticked only short repetitive 21 bp long sequence will be analysed.
User email (optional): If provided, then the user will be notified by email upon run completion.
The job is then ready for submission to the job queue. Upon job completion, links to browse and to download the results, as well as a link to a visualization dashboard are provided. If a job runs for a long time then users may also come back to RAREFAN at a later time, query their job status and eventually retrieve their results by entering the run ID into the search field at http://rarefan.evolbio.mpg.de/results. Relevant links and the run ID are communicated either on the status site or by email if the user provided their email address during run configuration. Runs are automatically deleted from the server after six months.
Identification of REPs and REPINs
The algorithm to determine REP sequence groups has been described in previous papers and is slightly improved (Bertels, Rainey 2011a, 2022; Bertels, Gokhale, et al. 2017). In our current implementation REPs/REPIN populations are now automatically linked to RAYT genes.
First, all N bp (21 bp by default) long seed sequences that occur more than M times (55 by default) are extracted from the reference genome. N and M are the seed sequence length and minimum seed sequence frequency, respectively (Figure 2B). All sequences occurring within the reference genome at least once within 15 bp of each other are then grouped together into n REP sequence groups (numbered 0-(n-1)). The most common sequence in each group, named REP seed sequence, is used for further analyses in each input genome.
In the next step all possible point mutants of the seed sequences are generated and searched for in the genome (Figure 2C). If a sequence is found in the genome, then all possible point mutations are generated for this sequence as well and so on until no more sequences can be identified. If two sequences are found within 130 bp of each other in inverted orientation, then these are designated REPINs.
Among all identified REP and REPIN sequences REPIN populations can be isolated. REPIN populations are determined by applying MCL using an inflation parameter of 1.2 (Van Dongen 2000) to a network of REP/REPIN sequences where all sequences that differ by exactly one nucleotide are connected. The clustering results are stored in a file ending in .mcl. The sequences of the largest REPIN population (excluding REP singlets) are isolated in a file ending in largestCluster.nodes. The largest REPIN populations are shown in the REPIN population plot and the master sequence correlation plot (Figure 4).
Identification of RAYTs
RAYTs are identified using TBLASTN (Camacho et al. 2009) with either a protein sequence provided by the user, a Group 2 RAYT from E. coli (yafM, Uniprot accession Q47152) or a Group 3 RAYT from P. fluorescens SBW25 (yafM, Uniprot accession C3JZZ6). The presence of RAYTs in the vicinity (default 200 bp) of a particular REPIN can be used to establish the association between the RAYT gene and a REPIN group (Figure 2D). All positions of all REPINs and REP sequences of a REPIN group are checked whether they occur within 200 bp (by default) of a RAYT gene. If so then the RAYT gene is linked to the REPIN group in the file repin_rayt_association.txt.
Visualizations
For each REPIN-RAYT group summary plots are generated. These include plots showing the RAYT phylogeny (calculated from a nucleotide alignment using MUSCLE (Edgar 2004) and PHYML (Guindon et al. 2010) to generate a phylogeny), REPIN population sizes in relation to the genome phylogeny (provided by the user or if not provided calculated by andi (Haubold et al. 2015)) as well as the proportion of master sequences (most common REPIN in a REPIN population) in relation to REPIN population size (Figure 2E).
Other outputs
Identified REPINs, REP singlets as well as RAYTs are written to FASTA formatted sequence files and to tab formatted annotation files that can be read with the Artemis genome browser (Rutherford et al. 2000). The REPIN-RAYT associations as well as the number of RAYT copies per genome are written to tabular data files. A detailed description of all output files is provided in the manual (http://rarefan.evolbio.mpg.de/manual) and in the file “readme.md” in the output directory.
Sequence analysis and annotation
For verification of RAREFAN results, REPIN-RAYT-systems were analysed in their corresponding genomes using Geneious Prime version 2022.2.2 (Kearse et al. 2012). Nucleotide sequences and positions of REP singlets, REPINs, and RAYTs were extracted from output files generated by RAREFAN and mapped in the relevant S. maltophilia genome. Complete RAREFAN data used for analysis can be accessed by using the run IDs listed in Table 1.
Results
RAREFAN can identify REPINs and their corresponding RAYTs in a set of fully sequenced bacterial genomes. The RAREFAN algorithm has been used in previous analyses to identify and characterize REPINs and RAYTs in Pseudomonads (Bertels, Rainey 2011a, 2022), Neisseriads (Bertels, Rainey 2022), and Enterobacteria (Bertels, Gallie, et al. 2017; Park et al. 2021). To demonstrate RAREFAN’s capabilities, we are presenting an analysis of 49 strains belonging to the opportunistic pathogen S. maltophilia.
S. maltophilia strains contain Group 3 RAYTs, which are also commonly found in plant-associated Pseudomonas species such as P. fluorescens or P. syringae (Bertels, Rainey 2011a, 2022). Similar to Group 3 RAYTs in other species, S. maltophilia contains multiple REPIN-RAYT systems per genome. Group 2 RAYTs, in contrast, contain only ever one REPIN-RAYT system per genome (Bertels, Rainey 2022).
Nine different REPIN-RAYT systems in S. maltophilia
REPIN-RAYT systems in S. maltophilia are surprisingly diverse compared to other species. For example, Pseudomonas chlororaphis contains three separate REPIN populations that are present in all P. chlororaphis strains, each associated with its cognate RAYT gene (Bertels, Rainey 2022). S. maltophilia, in contrast, contains only one REPIN-RAYT system that is present across almost the entire species (green clade in Figure 3), and at least eight REPIN-RAYT systems that are present in subsets of strains (nine clades in Figure 5).
The patchy presence-absence pattern of REPIN-RAYT systems in S. maltophilia, makes the dataset quite challenging to analyse. If a REPIN population is not present in the reference strain then RAREFAN will not be able to detect it in any other strain. Yet, it is possible to detect RAYT genes in all strains of a species independent of the reference strain selection. RAYT genes that are not associated to a REPIN population are displayed in grey (Figure 3A). While these RAYT genes are not associated to REPIN populations detected in the reference strain, they might still be associated with a yet unidentified REPIN type present in the genome the unassociated RAYT gene is located in.
In order to identify all REPIN populations across a species, we suggest to perform multiple RAREFAN runs with different reference strains. The RAREFAN web interface supports re-launching a given job with modified parameters. To identify as many different REPIN-RAYT systems as possible in each subsequent run the reference should be set to a genome that contains RAYTs that were not associated with a REPIN population previously (i.e., genomes containing grey RAYTs in Figure 3). However, this strategy may also fail when the REPIN population size falls below the RAREFAN seed sequence frequency threshold.
For example, S. maltophilia Sm53 contains three RAYTs only one of which is associated with a REPIN population (RAYT genes indicated in Figure 3A). However, the remaining two RAYTs are indeed associated with a REPIN population, but these REPIN populations are too small to be detected in S. maltophilia Sm53 (the seed sequence frequency threshold is set to 55 by default). In other S. maltophilia strains the REPIN populations are large enough to exceed the threshold. For example, if S. maltophilia AB550 is set as reference, RAYT number 1 from Sm53 (Figure 3A) is associated with the pink REPIN population (Figure 3D). If S. maltophilia 649 is set as reference RAYT number 3 from Sm53 (Figure 3A) is associated with the light green REPIN population (Figure 3C). RAYTs from the bottom clade are only associated with REPIN populations when S. maltophilia AA1 is chosen as reference (Figure 3B). While lower thresholds can guarantee that all REPINs will be identified in the genome, the number of sequence groups that are not REPINs quickly explodes. Especially for genomes that contain large numbers of mobile genetic elements or CRISPRs (Bertels, Rainey 2022).
RAREFAN visualizes REPIN population size and potential replication rate
The RAREFAN webserver visualizes REPIN population size and RAYT numbers in barplots. Barplots are ordered by the phylogenetic relationship of the submitted bacterial strains (Yu et al. 2018). RAREFAN detects three populations when S. maltophilia Sm53 is selected as reference strain (Figure 3A). The largest REPIN population (calculated by MCL from all REPINs of that type) has a corresponding RAYT gene in almost all strains (first barplot in Figure 4A) and most REPIN populations contain more than 100 REPINs (second barplot in Figure 4A). The second largest REPIN population in Sm53 (orange population in Figure 4B) is significantly smaller and contains no more than 61 REPINs in any strain and most strains do not contain a corresponding RAYT for this population.
RAREFAN also provides information on REPIN replication rate (Figure 4C and D). REPIN replication rate can be estimated by dividing the number of the most common REPIN sequence (master sequence) by the REPIN population size if the population is in mutation selection balance (Bertels, Gokhale, et al. 2017). If a REPIN replicates very fast most of the population will consist of identical sequences because mutations do not have enough time to accumulate between replication events. Hence, the proportion of master sequences will be high in populations that have a high replication rate. Transposable elements such as insertion sequences consist almost exclusively of identical master sequences because the time between replication events is not sufficient to accumulate mutations and because quick extinction of the element usually prevents the accumulation of mutations after replication (Park et al. 2021; Bertels, Rainey 2022). REPIN populations in contrast replicate slowly and persist for long periods of time, which means that a high proportion of master sequences suggests a high REPIN replication rate.
In S. maltophilia the proportion of master sequences in the population does not seem to correlate well with REPIN population size, both in the green and the orange population (Figures 4C and D). Similar observations have been made in P. chlororaphis (Bertels, Rainey 2022), and may suggest that an increase in population size is not caused by an increase in replication rate. Population size is likely to be more strongly affected by other factors such as the loss of the corresponding RAYT gene, which leads to the decay of the REPIN population. One could even speculate that high REPIN replication rates are more likely to lead to the eventual demise of the population due to the negative fitness effect on the host (Bertels, Rainey 2022).
The presence of RAYTs and the size of the corresponding REPIN population do correlate surprisingly well (Figure 4A and B, p-Value = 0.008 of the linear model of independent contrasts (Felsenstein 1985) of green RAYT and REPIN number, p-Value = 0.003 for orange REPIN populations). Green RAYTs are absent from an entire S. maltophilia clade (middle of Figure 4A). This clade has also lost a significant amount of green REPINs, and the proportion of the master sequences in these populations is low (Figure 4C). Similarly, genomes without orange RAYTs have smaller REPIN populations in the orange population than genomes with the corresponding RAYT (Figure 4D). A similar observation has been made previously in E. coli, P. chlororaphis, N. meningitidis and N. gonorrhoeae where the loss of the RAYT gene is followed by a decay of the associated REPIN population (Park et al. 2021; Bertels, Rainey 2022).
Linking REPIN populations with RAYT genes can be challenging
Unfortunately, RAREFAN is not always able to link the correct REPIN population with the correct RAYT gene. In some RAREFAN runs associations between RAYTs and REPINs are not monophyletic, as for example the red RAYTs in Figure 3A. However, the same clade of RAYTs is uniformly coloured in yellow in Figure 3D, suggesting that the entire RAYT clade is associated with the same REPIN group.
An analysis of all REPIN groups that were identified by RAREFAN across four different RAREFAN runs (Table 1, one additional analysis was performed with ISMMS3) showed that there are a total of nine different REPIN groups, each defined by an individual central palindrome (Table 2). Each REPIN group is associated with a monophyletic RAYT group (Figure 5). Only a single RAYT is not associated with a REPIN population (ISMMS3_1).
RAREFAN could not link a REPIN to the RAYT gene ISMMS3_1 (Figure 5, grey box). While there is a sequence that resembles the A palindrome as well as variants of the C palindrome flanking both sides of the RAYT gene (Supplementary Figure 2), none of the sequences formed REPIN populations large enough to be identified by RAREFAN. Presumably the RAYT ISMMS3_1, which is only present in a single S. maltophilia strain, is at the early stages of establishing a REPIN population, and the REPIN population has not spread to a considerable size yet.
There are two more cases where RAREFAN failed to link RAYT genes with any REPINs (ISSMS2_ and ISSMS2R_1, Supplementary Figure 1 D and E). Detailed sequence analyses showed that the respective REPINs are located at a distance of more than 130 bp (an adjustable parameter in RAREFAN). These REPINs are ignored by RAREFAN by default. However, this parameter can be adjusted manually and when set to a distance of 200 bp, RAREFAN correctly links these REPINs to the RAYT gene.
In three cases the wrong REPIN population was linked to a RAYT gene. In our dataset this can happen when RAYTs are flanked by seed sequences from two different REPIN populations (Supplementary Figure 1 A-C). A single REP sequence from the “wrong” (non-monophyletic RAYT) clade occurs together with multiple REP or REPIN sequences from the “right” (monophyletic in a different RAREFAN run) clade. REPINs are linked to the “wrong” RAYT when the correct REPIN population is absent in the chosen reference genome. This problem can be alleviated by performing analyses with multiple reference genomes and comparing the results.
REPIN groups may be lost when the seed distance is too large
The seed distance parameter determines whether two highly abundant sequences are sorted into the same or different REPIN groups (Figure 2B). If two REPINs from two different groups occur next to each other, at a distance of less than the seed distance parameter, then the two seeds are erroneously sorted into the same group. If two different REPIN groups are sorted into the same group then one of the groups will be ignored by RAREFAN, because only the most abundant seed in each group will be used to identify REPINs.
A manual analysis (e.g., multiple sequence alignment) of sequences in the groupSeedSequences folder of the RAREFAN output can identify erroneously merged REPIN groups. In S. maltophilia, groups are separated well when the distance parameter is set to 15 bp and Sm53 is used as a reference. When the parameter is set to 30 bp instead, one of the REPIN groups will be missed by RAREFAN.
A small seed distance parameter will separate seed sequences belonging to the same REPIN group into different groups. Hence, RAREFAN will analyse the same REPIN group multiple times. While this will lead to increased RAREFAN runtimes, these errors, are easy to spot, because (1) the same RAYT gene will be associated to multiple REPIN groups, (2) the central palindrome between the group is identical and (3) the master sequence between the groups will be very similar.
Closely related REPIN groups may be merged into a single group by RAREFAN
Incorrect merging of REPIN groups can occur when two REPIN groups are closely related. We identified merged REPIN groups in S. maltophilia because RAREFAN linked some REPIN groups with two RAYT genes in the same genome (Figure 6A). While REPIN groups linked to two RAYTs has been observed before in Neisseria meningitidis (Bertels, Rainey 2022), it is particularly unusual in S. maltophilia due to some key differences between REPIN-RAYT in the two bacterial species. First, RAYTs in N. meningitidis belong to Group 2 and RAYTs in S. maltophilia belong to Group 3 (Bertels, Gallie, et al. 2017), two very divergent RAYT groups. Second, RAYTs that are associated to the same REPIN group in N. meningitidis are almost identical, since they are copied by an insertion sequence in trans (Bertels, Rainey 2022), something that is not the case for S. maltophilia, where the two RAYTs are very distinct from each other (green and red clade in Figure 3A, or clade A and C in Figure 5).
A closer inspection of all sequences identified in REPIN group 2 shows that it also contains sequences belonging to REPIN group 0 (palindromes linked to clade A and C in Table 2). The relationship between the sequences shows that there is a chain of sequences that all differ by at most a single nucleotide between the most abundant sequence in group 2 to the most abundant sequence in group 0 (Figure 6B and C). Hence, the reason group 0 and group 2 are merged is that they are too closely related to each other and hybrids of the two REPIN groups exist. Because sequence groups are built by identifying all related sequences in the genome recursively, closely related groups (the REPIN group 0 seed only differs in four nucleotides from the REPIN group 2 seed sequence) can be merged into a single REPIN group. REPIN population size and RAYT number are the sum of REPIN group 0 and 2. There are various possibilities to resolve this issue: (1) subtract sequences from group 0 (which does not contain group 2) from REPIN group 2; (2) use a different sequence seed from the group 2 seed collection in the seed sequence file (groupSeedSequences/Group_Smalt_Sm53_2.out); (3) sometimes it may be possible to rerun RAREFAN with a different reference strain where the issue does not occur; or (4) increase the length of the seed sequence.
Performance
We measured the elapsed time for a complete RAREFAN run for three different species and for 5, 10, 20, and 40 genomes with randomly selected reference strains and the two query RAYTs (yafM_Ecoli and yafM_SBW25). For a given number N of submitted genomes of average sequence length L (in megabases), a RAREFAN run completes in approximately T = (8-10 seconds) * N * L on our moderate server hardware (4 CPU cores, 16 GB shared RAM) (Supplementary Figure 3 and 4).
Discussion
RAREFAN allows users to quickly detect REPIN populations and RAYT transposases inside bacterial genomes. It also links the RAYT transposase genes to the REPIN population it duplicates. These data help the user to study REPIN-RAYT dynamics in their strains of interest without a dedicated bioinformatician, and hence will render REPIN-RAYT systems widely accessible.
One limitation of RAREFAN is that REPINs can only be identified in genomes when they are symmetric (Figure 1). Symmetric REPINs have seed sequences that can morph into each other by a series of single substitutions (intermediate sequences need to be present in the genome). A REPIN consists of a 5’ and a 3’ REP sequence. If one of these REP sequences contains an insertion or deletion, which the other REP sequence does not contain then RAREFAN will not recognize the second repeat of the seed sequence. In this case, RAREFAN will not be able to identify REPINs but can still be used to analyze REP singlet populations. To date, the only known asymmetric REPIN populations are found in E. coli. However, it is likely that asymmetric REPINs also exist in other microbial species.
RAREFAN sometimes cannot correctly divide REPINs into REPIN groups. Either because REPINs from different groups occur in close proximity in the genome, an issue that can easily be solved by adjusting a RAREFAN parameter, or because two REPIN groups are very closely related (Figure 6). Unfortunately, RAREFAN is not able to automatically detect and resolve the assignment of closely related REPINs into groups yet. Hence it is advisable to manually check associations between REPIN groups and RAYT genes by analyzing the composition of REPIN groups.
In the future we aim to make RAREFAN even more versatile and easier to use by, for example, automatically integrating data from public databases such as Genbank, and integrating RAREFAN into workflows such as Galaxy (Afgan et al. 2018).
RAREFAN makes the study of REPIN-RAYT systems more accessible to any biologist or bioinformatician interested in studying intragenomic sequence populations. Our tool will help understand the purpose and evolution of REPIN-RAYT systems in bacterial genomes.
Supplementary Figures
Acknowledgements
We would like to thank Prajwal Bharadwaj for assisting us with the sequence analysis and Jenna Gallie for valuable feedback on the manuscript.
Footnotes
We added a more extensive introduction and made significant changes to all sections. We also modified most of the figures.