Long identical sequences found in multiple bacterial genomes reveal frequent and widespread exchange of genetic material between distant species

Horizontal transfer of genomic elements is an essential force that shapes microbial genome evolution. Horizontal Gene Transfer (HGT) occurs via various mechanisms and has been studied in detail for a variety of systems. However, a coarse-grained, global picture of HGT in the microbial world is still missing. One reason is the difficulty to process large amounts of genomic microbial data to find and characterise HGT events, especially for highly distant organisms. Here, we exploit the fact that HGT between distant species creates long identical DNA sequences in genomes of distant species, which can be found efficiently using alignment-free methods. We analysed over 90 000 bacterial genomes and thus identified over 100 000 events of HGT. We further developed a mathematical model to analyse the statistical properties of those long exact matches and thus estimate the transfer rate between any pair of taxa. Our results demonstrate that long-distance gene exchange (across phyla) is very frequent, as more than 8% of the bacterial genomes analysed have been involved in at least one such event. Finally, we confirm that the function of the transferred sequences strongly impact the transfer rate, as we observe a 3.5 order of magnitude variation between the most and the least transferred categories. Overall, we provide a unique view of horizontal transfer across the bacterial tree of life, illuminating a fundamental process driving bacterial evolution.


Introduction
material between distant species -because discovering such long-distance transfers requires the application 23 of computationally costly methods to very large numbers of genomes -are rarely studied. 24 In this study we use a novel approach to address these questions. Our method is based on the analysis 25 of long exact sequence matches found in the genomes of distant bacteria. Exact matches can be identified 26 very efficiently using alignment-free algorithms [17], which makes the method much faster than previous 27 methods that rely on alignment tools. This allows us to study transfer events between 1 343 042 bacterial 28 contigs, belonging to 93 481 genomes, encompassing a total of 0.4 Tbp. We identified all long exact matches 29 shared between bacterial genomes from different genera. Such long matches are unlikely to be vertically 30 inherited, and we therefore assume that they result from HGT. 31 In a quarter of all bacterial genomes, we detected HGT across family borders, and 8% participated 32 in HGT across phyla. This shows that genetic material frequently crosses distant taxonomic borders. The 33 length distribution of exact matches can be accounted for by a simple model that assumes that exact matches 34 are continuously produced by transfer of genetic material and subsequently degraded by mutation. Fitting 35 this model to empirical data allow us to estimate the effective rate at which HGT generates long sequence 36 matches in distant organisms. Furthermore, the large number of transfer events identified allows us to 37 conduct a functional analysis of horizontally transferred genes. 38 2. Results 39 2.1. HGT detection using exact sequence matches 40 We identified HGT events between distant bacterial taxa by detecting long exact sequence matches 41 shared by pairs of genomes. We exploit that, in phylogenetically distant genome pairs, sequences that 42 are shared by both genomes due to linear descent (orthologous sequences) have low sequence identity. 43 Therefore long sequence matches in such orthologs are exceedingly rare. Generally, the average nucleotide 44 sequence identity between bacterial genomes selected from different genera is at most 60 to 70% [61]. 45 In the absence of HGT, the probability of observing an exact match longer than 300 bp between a given 46 pair of genomes is then extremely small (of the order of 10 −40 if we assume uniform divergence along the genomes). Thus, even if millions of genome pairs with such divergence are analysed, the probability to 48 observe even one such a match remains extremely low, such that one does not expect to find a single hit of 49 this size between any two bacterial genomes by chance. Escherichia coli and Salmonella enterica (Fig 1A), we observe numerous exact matches smaller than 300bp 52 along the diagonal, revealing a conservation of the genomic architecture at the family level. Filtering out 53 matches shorter than 300bp (Fig 1B) completely removes the diagonal line, confirming that exact matches 54 in the orthologous sequences of these genomes are invariably short. 55 Because very long exact sequence matches are extremely unlikely in orthologs, those that do occur 56 are most likely xenologs: sequences that are shared due to relatively recent events of HGT. As an example,  While the vast majority of matches is very short (< 25 bp), matches with a length of at least 300 bp 85 do occur and contribute a thick tail to the MLD (Fig. 2). Strikingly, over many decades this tail is well 86 described by a power law with exponent -3:  The same power law was found if the analysis was restricted to matches between genomes from two partic-88 ular genera (Fig. 2).

89
Note that the number of long matches found in a single pair of genomes is usually very small, prohibit-90 ing a statistical analysis of their match length distribution. Hence, in this study we conduct all statistical 91 analyses at the level of genera. The MLD for a pair of genera G 1 and G 2 is defined as the normalized length 92 distribution of the matches found in all pairwise comparisons of a contig from G 1 and a contig from G 2 (see  shorter matches, resulting from all past HGT events, converges to a steady state that for 1 bp r < K is given by the power law m(r) = A/r −3 , with prefactor: The analysis above has allowed us to identify a large number of HGT events. In addition, the derivations 117 in the previous section provide a method to quantify the effective HGT rate between any two taxa by 118 measuring the prefactor A. As Supp. file 1 (resp. Supp. file 2), we provide the value of A for all pairs 119 of families (genera). Using these methods, we then studied the HGT rate between all pairs of bacterial 120 families in detail. visual inspection of the heatmap reveals that the HGT rate varies drastically (from 10 − 16 to 10 − 8) from one 125 pair to another (Fig. 3). First, the large squares on the diagonal of the heatmap indicate that HGT occurs 126 more frequently between taxonomically related families. This is especially apparent for well-represented 127 phyla including Bacteriodetes, Proteobacteria, Firmicutes, and Actinobacteria. Yet, we also observe a high 128 transfer rate between many families belonging to very distant phyla, indicating that transfer events across 129 phyla are also frequent. Notably, we find that some families present a highly elevated HGT rate across

141
The data also unveil that the propensity of species to exchange genetic material is very heterogeneous,  First, we assessed the impact of the taxonomic distance between genera on the HGT rate. To do so, 151 we computed the prefactor A for pairs of genera at various taxonomical distances (Fig 5). On average this 152 prefactor decreases by orders of magnitude as the taxonomic distance between the genera increases (inset 153 of Fig 5). In particular, the average prefactor obtained when considering genera from the same family is 154 more than three orders of magnitude higher than when considering genera from different phyla. These   To further explore the factors that influence the value of A we calculated MLDs for sets of genera from 160 different ecological environments: gut, soil, or marine (Fig. S3), regardless of their taxonomic distance. Our 161 results suggest that the effective rate of HGT is about 1 000 times higher among gut bacteria than among 162 marine bacteria. This pattern is observed both for the rates of HGT within ecological environments (i.e., A similar analysis demonstrates that the HGT propensity among gram-positive bacteria and among 169 gram-negative bacteria is much larger than between these groups (see Fig. S4). The groups of bacteria with 170 GC poor and GC rich genomes exhibit a similar pattern (see Fig. S5). We note however, that all these 171 factors correlate with each other [30]. From our analysis, the contribution of each factor to the effective rate 172 of HGT therefore remains unclear. 173 2.6. Large variation in the HGT rates between different categories of genes.

174
To better understand the factors that explain variations in observed HGT rates, we next conducted a 175 functional analysis of transferred sequences. To functionally annotate the transferred sequences, we first 176 queried twelve databases, each specifically dedicated to genes associated with a particular function (see 177   Table S1). Comparing to a randomised set of sequences (see Methods) reveals that the gene functions of 178 the transferred sequences strongly impact the transfer rate, as we observe a 3.5 order of magnitude variation 179 between the most and the least transferred categories ( Fig. 6 and Table S1).

180
More specifically, antibiotic and metal resistance genes are among the most widely transferred classes of genes (resp. 37× and 4× enrichment compared to random expectation), in good agreement with previous

218
In this study, we developed a computationally efficient method to identify recent HGT events. This  One striking result of our analysis is the finding that HGT is also common between very distant organ-226 isms. Indeed, 8% of the organisms we studied have been involved in a transfer of genetic material with 227 at least one bacteria from another phylum in the last ∼ 1000 years. The molecular mechanisms at play 228 in these long distance transfer events remain to be elucidated, for instance via a dedicated study targeting 229 families with very high exchange rate we identified. Analysing the statistical properties of the exact se-230 quence matches in distantly related genera, we were able to quantify the effective rate of HGT for different 231 comparisons (See Supp. File 1 and 2 for an estimation of all pairwise HGT rate at the family and at the 232 genera level). Doing so, we find that the HGT rate varies dramatically between families (Fig. 3)  Finally, our functional analysis of the transferred sequences shows that the function of a gene also 239 strongly influences its chance of being exchanged (Fig. 4). As expected, genes conferring antibiotic resis-240 tance are the most widely transferred. In contrast, some functional categories are strongly underrepresented 241 in the pool of transferred genes. For instance, genes that are involved in transcription, translation, and re-242 lated processes as well as those involved in metabolism are all depleted in our HGT database. One potential 243 explanation could be that these genes generally co-evolve with their binding partners [33]. As such, their 244 transfer would be beneficial to the host species only if both the effector and its binding partner were to be 245 transferred together. As simultaneous HGT of several genes from different genome loci is very unlikely 246 (unless they are co-localized), these genes are not prone to HGT. In addition, transcription, translation, and 247 related processes are core functions that are ubiquitous in bacterial species. As such, transfer of such genes 248 is unlikely to grant the receiving species a new function. Hence, transfers of housekeeping gene are less 249 likely to confer a strong evolutionary advantage, which could explain their under-representation in the HGT 250 dataset. 251 We found that the tail of the MLD follows a power law with exponent -3. This observation is particularly

283
In Fig. 2, 5 and S2, S3, S4, S5 we show MLDs based on the matches found between pairs of sequences from two sets of genera. These MLDs were calculated as follows: where the index i runs over the genera from the first set and the index j runs over the genera from the second 286 set. is expected to be broken into shorter ones due to random mutations, which we assume occur at a constant 293 effective rate µ = (µ 1 + µ 2 )/2 at each base pair, where µ 1 and µ 2 are the mutation rates of genus 1 and 2.

294
Suppose that we now sample n 1 genomes from genus 1 and n 2 from genus 2 and calculate the MLD which yields the observed power-law with exponent -3.

303
The prefactor in Eq. (1) can be interpreted as an effective transfer rate per genome length. It depends on several parame-305 ters: the transfer rate from one species to another per genome length ρ/(L 1 L 2 ), the length of the transferred 306 sequences K, the degree to which the sequence is establishment in the population of the two genera f 1 and f 2 , and the effective mutation rate µ.

308
To fit the power law (1) to the empirical data, we binned the tail (r > 300) of the empirical MLD (using 309 logarithmic binning), and then applied a linear regression with a fixed regression slope of -3 and a single 310 fitting parameter, i.e., the intercept ln(A).
where the angular brackets denote the expectation value. The power law remains, except that the prefactor 319 now represents an average over all possible parameter values. Second, we can relax the assumption that the 320 divergence time t is uniformly distributed (i.e., that HGT events were equally likely at any time in the past).

321
In general, equation 6 should then be replaced by 322 m 12 (r) =ˆ∞ 0 P d (t) ρ m 12 (r|t) dt, in which P d (t) is the divergence-time distribution . Previously, this distribution was assumed to equal 1, 323 but other possibilities can be explored. For example, if instead we assume that xenologous sequences are 324 slowly removed from genomes due to deletions, the divergence times may be exponentially suppressed, in which case equation 9 becomes: This MLD again has the familiar power-law tail in the regime r λ/(2µ). Generally, if the divergence-time 327 distribution can be written as a Taylor series equation 9 evaluates to The tail of this distribution is dominated by the first nonzero term in the series, because it has the largest The most likely time t ML is found by setting the time-derivative of Eq 14 to zero, which results in Above, we considered exact matches with a length r > 300 bp. Only in sequences involved in rather 10 hours [28], this corresponds to approximately 1000 years. That is to say, we estimate that the HGT 343 events we detect date back to the past 1000 years. We stress, however, that both the mutation rate and the 344 generation time can strongly vary from one species to the next; hence this estimate is highly uncertain.     For each set of genes from a database, using the blast toolkit [1], we calculate the total number of 387 unique match-gene hit pairs. We weighted each hit to the database by w i to obtain a total number of hits H: Assuming random sampling or organisms, the standard error of H is given by