Abstract
In 2013, we discovered that encounters between the replication and transcription machineries allow bacteria to evolve at an accelerated rate by promoting mutagenesis in lagging strand genes. Though we proposed that this process is adaptive, it is also possible that the increased mutation frequency in lagging strand genes could be the result of reduced purifying selection (neutral selection).Due to the low number of available genome sequences at the time of publication, we were unable to distinguish between these two models with a high level of confidence. Here, we utilized the wealth of newly available bacterial genome sequences to examine these two possibilities. To test the adaptive hypothesis, we analyzed convergent mutation patterns. To test the neutral hypothesis, we performed in silico modeling. Our results clearly demonstrate that the neutral model cannot explain the increased mutagenesis of lagging strand genes. Additionally, our evolutionary convergence data strongly support the adaptive hypothesis. We conclude that encounters between the replication and transcription machineries in lagging strand gene accelerate the discovery of beneficial mutations.
Introduction
Replication-transcription conflicts cause pervasive replisome collapse across highly divergent bacterial species1,2. Conflicts also occur in eukaryotes including yeast, and humans3–6. Hence, replication-transcription conflicts are a universal problem.
In bacteria, the majority of genes are encoded on the leading strand. A gene’s orientation (encoding on the leading or lagging strand) can have major effects on cellular fitness through co-directional or head-on replication-transcription conflicts, respectively7–11. The two types of conflicts lead to very different outcomes. In leading strand genes, co-directional conflicts between the faster moving replisome and RNA polymerases have a modest effect on replisome stalling12. Conversely, in lagging strand genes, head-on conflicts can cause severe replisome stalling and DNA damage10,11,13,14. As a result, head-on conflicts almost certainly confer strong negative selection pressure against the majority of lagging strand alleles8,15. Yet in every bacterial species, 20-45% of genes remain in the lagging strand orientation16–18. The reasons for this phenomenon are not entirely clear.
In 2013, we made a key discovery that could at least partially explain the retention of lagging strand genes despite the detrimental outcomes of head-on conflicts. We found that lagging strand genes evolve at a faster rate than leading strand genes in Bacillus subtilis, both experimentally and in nature. Our findings were the first to demonstrate the existence of a temporal (transcription-dependent) and spatial (gene-specific) control mechanism for promoting evolution. Some of our results also suggested that this mechanism may be adaptive. However, because our analyses were performed using only 5 genomes, the support we presented for the adaptive hypothesis was somewhat weak. Consequently, reduced negative selection remained a valid alternative to the adaptive hypothesis.
Following our study, Chen and Zhang claimed that no support exists for the adaptive hypothesis15. Their argument was based on two primary forms of evidence: 1) in silico simulations showing that some of the convergent mutations we identified, multi-hit site mutations (defined in Fig. 1A), could have formed by chance. They therefore accepted only the remaining parallel mutations as convergent. Though the authors ultimately found that parallel mutations are also more common in lagging strand genes, they rejected the validity of the adaptive hypothesis on the basis of a non-significant difference between the two groups (p=0.14, Figure S1). Though the issue of statistical significance could have been resolved through an investigation of additional bacterial genomes, none were available at the time.
Today many more bacterial genomes have been sequenced, allowing us to test the two models with high resolution. To this end, we conducted a new analysis of convergent evolution using 50 B. subtilis genomes. This increased our statistical power, and allowed us to identify extremely rare classical convergent mutations which are among the most reliable forms of evidence of evolutionary convergence (Fig 1A). Below, our new analyses show that indeed, each type of convergent mutation is more common in lagging strand genes relative to leading strand genes. We further show that their abundance is both gene length and orientation-dependent, supporting our original model that lagging strand convergent mutations arose largely through mutagenesis caused by head-on replication-transcription conflicts. Our in silico modeling experiments further demonstrate that the observed mutations cannot be explained by chance alone, overturning the neutral hypothesis. Finally, we repeat these analyses in two other species, and observe the same patterns of convergence. As such, we conclude that lagging strand encoding is a broadly conserved mechanism capable of benefiting the cell through head-on conflict-mediated mutagenesis and the accelerated discovery of adaptive mutations.
Materials and Methods
Bacterial genome files in were downloaded from NCBI in Genbank format and are listed in Table S1. Genomic sequences were analyzed using the program TimeZone version 1.019. TimeZone conducts mutational analyses (e.g. dN, dS, dN/dS, etc.) of a subset of genes based upon only two customizable settings: the level of gene conservation in terms of 1) length and 2) amino acid content. We selected 95% conservation for both settings meaning that only genes that are highly conserved in terms of both length and amino acid content were analyzed. These are referred to as core genes, which are effectively essential in nature20. Core genes represent an equal fraction of leading and lagging strand genes, suggesting that the results of our analyses should be equally representative of both groups21. This may also help control for differences in essentiality/mutability that might otherwise, potentially, skew our results. Convergent mutations were parsed from TimeZone output files using a custom script Analysis_conv5.py.
Simulations were performed as previously described, with one exception: all simulated multi-hit sites were mutated to a second amino acid15. As a result, a subset of our multi-hit site mutations were identical, and thus represent parallel mutations. All code used for these simulations are publicly available at https://github.com/lh64/MultihitSimulation. All additional data is available upon request.
Results
To detect signatures of convergent evolution in B. subtilis, we conducted a mutational analysis of 50 fully assembled genomes. This data set provided an order of magnitude more information than was used in our original analysis. To identify all point mutations that occurred since the divergence of the individual isolates, we used the program TimeZone19. This program is optimized for identifying recent adaptive changes in bacteria19. We then parsed out the convergent mutations identified by TimeZone into three groups: classical convergent, parallel, or multi-hit site mutations, and calculated their frequencies in leading or lagging strand core genes (Figure 1A, 1B). Our analysis identified a higher frequency of all three types of convergent mutations in lagging strand genes (Fig 1B). These results are highly statistically significant, strongly supporting the validity of our original analysis, as well as our inference that positive selection acts more frequently on lagging strand genes.
To determine if our findings in B. subtilis are indicative of a broader pattern of evolution, we conducted the same analysis in a related species, Mycoplasma gallisepticum, and an unrelated species from a second phylum, Mycobacterium tuberculosis (Fig. 1C). For each species, we observed the same trend identified in B. subtilis. This strongly suggests that the elevated frequency of convergent evolution is a conserved feature of lagging strand gene evolution.
Most convergent mutations are retained due to positive selection, not chance
Classical convergent and parallel mutations are considered standard indicators of adaptation because they are extremely unlikely to occur by chance15,22. However, it is theoretically possible that multi-hit site mutations could arise at random with significant frequency15. Accordingly, they represent a lower confidence indicator of convergence. To determine if the multi-hit sites we observed could have been produced by chance, we conducted in silico simulations, following the protocol established by Chen and Zhang15. Their method simulates a neutral reassortment of the observed non-synonymous mutations in each gene to estimate the frequency with which non-adaptive multi-hit site mutations could arise by chance. Accordingly, we drew random variable sites within each core leading or lagging strand gene with replacement until the total number of sites equaled the number of observed amino acid changes15. In our method, we added a second step: for any site drawn twice, the original residue was randomly mutated to one of the 19 other amino acids, yielding either a multi-hit or parallel mutation. We performed simulations for all leading and lagging strand core genes using nonsynonymous substitution data from either our original 5 genome or new 50 genome study of B. subtilis. We repeated these simulations 10,000 times for all leading or lagging strand core genes, yielding a distribution of values (Fig. 2). We found that the observed number of both parallel and multi-hit site mutations are greatly in excess of even the most extreme simulated data (Fig. 2). Therefore, the data strongly suggest that the multi-hit site and parallel mutations we observed in both our original and current studies could not have arisen by chance (Fig. 2). Instead, they most likely arose through positive selection, consistent with the idea that both multi-hit site and parallel mutations are indicative of evolutionary convergence.
Despite having conducted an ostensibly identical simulation as Chen and Zhang, our results clearly and directly oppose their conclusion that the “observation on multi-hit site containing genes is fully expected under the neutral model”15. To determine how the same experiment could support opposing conclusions, we re-analyzed our raw simulation data (presented in Figure 2) using the ratio-of-ratios method employed by Chen and Zhang. Here we calculated the ratio of observed multi-hit site mutations in leading strand genes to those in lagging strand genes, and compared this number to the distribution of simulated ratios (Fig. S2). As with the previous report, we found that the observed ratio is roughly in the middle of the distribution of simulated ratios (Fig. S2, Chen and Zhang’s Fig. 2). First, this shows that both simulations, and presumably the underlying data sets, are equivalent.
(Unfortunately, Chen and Zhang did not publish their raw data or computer code, and did not respond to our request for this information so this conclusion is inferred.) Second, this indicates that the authors’ analysis method is the root of the conflict: Even though the raw data shows conclusively that the observed multi-hit site mutations cannot be explained by chance, the ratio-of-ratios method obfuscates this information, erroneously appearing to suggest the opposite. In fact, Chen and Zhang’s method makes even parallel mutations appear to be random events which is exceptionally unlikely as discussed by the authors themselves (Supplementary Figure 3)15. Therefore, we conclude that the ratio-of-ratios data presented by Chen and Zhang is highly misleading, and that the conclusions of their manuscript are directly opposed by their own data. As such, the modeling data strongly support the adaptive hypothesis for lagging strand gene encoding.
Head-on replication-transcription conflicts increase the frequency of convergent mutations development in lagging strand genes
Our previous work indicated that head-on replication-transcription conflicts are the mechanistic basis for the increased mutation rate of head-on genes7,21,23. Evidence for this hypothesis includes our observation of a gene length and orientation-dependent increase in both dN/dS ratios and mutation frequency7. This result was consistent with the idea that head-on conflict severity should increase in direct relation to gene length, whereas co-directional conflicts (leading strand genes) should not7. If head-on replication-transcription conflicts are responsible for promoting the formation of the observed convergent mutations identified here, their abundance should follow the same pattern. To test this, we calculated the number of convergent mutations per core gene, then assessed the relationship with gene length (Fig. 3). We found that all three types of convergent mutations increase in frequency in a gene length-dependent manner. We also found that this effect is more pronounced in lagging strand genes, strongly suggesting that head-on conflicts are indeed responsible for the increased frequency of convergent mutations in lagging strand genes.
We then repeated these analyses in M. gallisepticum and M. tuberculosis and identified the same pattern, demonstrating that this mechanism is conserved in other species (Fig. S3). (Raw data for all three species are shown in Fig. S4.)
Discussion
In summary, our new analyses show that lagging strand genes accumulate convergent mutations at a higher rate than leading strand genes, and that this effect is broadly conserved. Importantly, our modeling experiments demonstrate that the observed parallel and multi-hit site mutations cannot be explained by chance. This overturns the neutral hypothesis, strongly suggesting that these mutations are beneficial. Even if one disagrees with the idea that multi-hit site mutations are indicative of evolutionary convergence, the higher frequency of both parallel and classical convergent mutations in lagging strand genes is powerful and independent evidence that the lagging strand encoding can be adaptive. The gene length and orientation-dependence of these mutation patterns strongly suggests head-on replication-transcription conflicts are a key molecular mechanism driving their formation.
The results we present here are consistent with those of our previous study in which we identified a higher frequency of genes with a dN/dS ratio significantly above 1 encoded on the lagging strand six diverse species21. This metric represents a second standard indicator of positive selection24. In the same study, we identified pervasive, recent, leading-to-lagging strand gene inversion events in every bacterial species tested. (Though our method could not detect “reversion” events in which newly inverted head-on gene flipped back to the leading strand, such events almost certainly occur as part of a dynamic equilibrium.) Together, these observations provided strong support for the notion that lagging strand encoding can be beneficial.
Interestingly, both our new empirical data (Fig. 1B) and modeling experiments (Fig. 2) directly contradict the conclusions of the Chen and Zhang manuscript in which they claimed to have overturned the adaptive hypothesis15. In fact, this is the second time we have discovered that a critique of our work by the Zhang lab was based almost entirely on inaccurate, erroneous, or misleading data25.
In summary, the convergent evolution analysis presented here, together with our prior investigations of gene-specific mutation rates show that lagging strand genes gain adaptive mutations at a faster rate than leading strand genes in species across phyla. In combination with our identification of pervasive leading-to-lagging strand gene inversions, our findings paint a consistent picture: lagging strand encoding for some genes can confer a net benefit to the cell through head-on replication-transcription conflicts, despite their detrimental effects. As such, we conclude that gene orientation provides the cell with a high precision mechanism for temporally and spatially controlling adaptive evolution.
Acknowledgements
H.M. and C.M. were supported by the National Institute of Health, R01‐AI‐127422.