Prisoner of War dynamics explains the time-dependent pattern of substitution rates in viruses

Molecular clock dating is widely used to estimate timescales of phylogenetic histories and to infer rates at which species evolve. One of the major challenges with inferring rates of molecular evolution is the observation of a strong correlation between estimated rates and the timeframe of their measurements. Recent empirical analysis of virus evolutionary rates suggest that a power-law rate decay best explains the time-dependent pattern of substitution rates and that the same pattern is observed regardless of virus type (e.g. groups I-VII in the Baltimore classification). However there exists no explanation for this trend based on molecular evolutionary mechanisms. We provide a simple predictive mechanistic model of the time-dependent rate phenomenon, incorporating saturation and host constraints on the evolution of some sites. Our model recapitulates the ubiquitous power-law rate decay with a slope of −0.65 (95% HPD: −0.72, −0.52) and can satisfactorily account for the variation in inferred molecular evolutionary rates over a wide range of timeframes. We show that once the saturation of sites starts - typically after hundreds of years in RNA viruses and thousands of years in DNA viruses - standard substitution models fail to correctly estimate divergence times among species, while our model successfully re-creates the observed pattern of rate decay. We apply our model to re-date the diversification of genotypes of hepatitis C virus (HCV) to 396,000 (95% HPD: 326,000 - 425,000) years before present, a time preceding the dispersal of modern humans out of Africa, and also showed that the most recent common ancestor of sarbecoviruses dates back to 23,500 (95% HPD: 21,100 - 25,300) years ago, nearly thirty times older than previous estimates. This not only creates a radical new perspective for our understanding the origins of HCV but also suggests a substantial revision of evolutionary timescales of other viruses can be similarly achieved.


Main
The timescale over which viruses evolve and how this process is connected to host adaptation has been an area of considerable research and methodological progress in recent decades. Mammalian RNA viruses, in particular, exhibit extraordinarily rapid genomic change 1-3 and analyses of their genetic variation has enabled detailed reconstruction of the emergence of viruses such as HIV-1 4 , hepatitis C virus 5 and influenza A virus 6,7 . RNA viruses display evolutionary change over short-timescales (weeks to months) and can re-model a substantial part of their genomes following a host switch [8][9][10][11][12][13] . Well characterised examples in both RNA and DNA viruses include the emergence of HIV-1 in humans from a chimpanzee reservoir 4,14,15 , and the adaptation of myxomatosis in rabbits 16 .
These rapid rates of virus sequence change stand in striking contrast with evidence for extreme conservation of virus genome sequences over longer periods of evolution and at higher taxonomic levels 17,18 . Inferred short term rates of virus sequence change should create completely unrecognisable genome sequences if they were extrapolated over thousands or even hundreds of years, yet endogenous viral elements (EVEs) that integrated into host genomes throughout mammalian evolution are recognisably similar to contemporary genera and families of Bornaviridae, Parvoviridae and Circoviridae amongst many examples [19][20][21][22][23][24] . This observation is complemented by increasing evidence from studies of virus / host co-evolution [25][26][27] , and more recently from analysis of viruses recovered from ancient DNA in archaeological remains [28][29][30][31][32] , that together indicate a remarkable degree of conservatism in viral genome sequences and their inter-relationships at genus and family levels. This conundrum has been attributed to the time-dependent rate phenomenon (TDRP), which is the observation that apparent rates of evolution are dependent on timescale of measurement. The TDRP has been explained by processes such as sequence site saturation, shortsighted within-host evolution, short-term changes in selection pressure and potential errors in estimation of short-term substitution rates 17,18,[33][34][35] . Empirically, substitution rates show a striking linear relationship between log-transformed rates and timescale of measurement, across RNA and DNA viruses, despite the large differences among viruses in their initial short term substitution rates 33 . The gradient of regressions of observation time to estimated evolutionary rates is consistently around -0.65, for all virus groups for which long term substitution rates can be calculated or inferred, implying a common underlying evolutionary process.
We recently developed a model of virus evolution in which the primary driver of sequence change over long evolutionary time-scales was host adaptation, in which virus sequence change is severely curtailed by stringent fitness constraints 36 . Viruses exist within a tightly-constraining host niche to which they rapidly adapt; paradoxically, their high mutation rates, large population sizes and consequent ability to adapt rapidly serve to restrict their long-term diversification and sustained sequence change, rendering them evolutionary "Prisoners of War" (PoW). Ultimately over longer timescales, rates of viral evolution will be bounded by the rate of evolution of their hosts 36 .
In the current study, we develop the PoW model to explain the longer-term evolutionary trajectories of viruses and the time-dependence of their inferred substitution rates. The model accounts for genetic saturation of rapidly-evolving sites, and host constraints on site evolution, with the proportion of fast-to slow-evolving sites being exponentially distributed. These sites will saturate chronologically from the fastest evolving to those that evolve epistatically, to those that evolve at the host substitution rate. This model can reproduce effectively the empirically observed TDRP patterns and the inflection points where time-dependent rate changes become manifest. We demonstrate that the model predictions are robust to intrinsic and marked differences in substitution rates among different virus groups or assumptions about the relative proportion of sites evolving at different rates.

Power-law rate decay due to site-saturation
First, we show how a time-dependent rate effect emerges when estimating the sequence divergence using a standard evolutionary model. Suppose that a sequence has diverged from its ancestor t generations ago under a constant and uniform substitution rate,  . The proportion of pairwise differences between the sequence and its ancestor, () pt , reaches its maximum value, hereafter called the saturation frequency,  , at time ). As the ancestral and derived sequence continue to evolve beyond the saturation point,

Saturation under rate heterogeneity
In the presence of among-site rate heterogeneity, a fraction of sites may evolve at a rate that is much slower (or faster) than some other sites. Under such circumstances, if we apply a measure of genetic distance based on a constant and homogeneous substitution rate per site, the substitution model may reliably recover the expected (mean) rate up to and before the fastest-evolving sites reach saturation, beyond which point the time-dependent pattern of rate decay emerges while the remaining sites continue to accumulate changes until the slowest-evolving sites also reach saturation (i.e. the point at which the entire sequence space has been fully explored) and a power-law rate decay with slope 1 − emerges (see Figure S1 b).
Although our focus so far has been on the saturation of observed pairwise differences and how it can create a time-dependent rate effect, the same holds true when tracking the evolutionary changes of a large number of sequences through time. Using a standard Jukes-Cantor substitution model on a set of simulated sequences (see Methods section), both in the absence and presence of rate heterogeneity, we can recreate similar patterns of time-dependent rate decay and show that, over longer timescales, i.e. when the divergence time between two populations is much longer than the typical coalescent times

Saturation under the Prisoner of War model
The PoW model of virus evolution is based on the principle that site saturation over long timescales dominates the time-dependency of virus substitution rates. Building upon a collection of virus evolutionary rate estimates from >130 publications 33 , we use 389 nucleotide substitution rate estimates across six major viral groups to find the line of best fit between our predicted timedependent substitution rate (the PoW model) and the evolutionary rate estimates for each viral group (data) using the geometric least squares method 37 (Figure 1 a). Assuming a fixed incremental difference between any two consecutive rate groups, i.e.

 + =
, the substitution rate at the fastest-evolving sites is determined by the total number of groups, M , which, in turn, sets the inflection point for when the time-dependent rate decay emerges. Once the fastest-evolving sites diverge to saturation, other rate groups that evolve more slowly (e.g. via epistatic and compensatory substitutions), saturate sequentially as the timespan of rate measurement, t , increases. This chronological saturation effect continues until the inferred rate decays to the host substitution rate, min  (Figure 1 b). Therefore, the time-dependent rate curve, according to the PoW model is given by where the observed genetic distance between the derived and ancestral sequences is assumed to increase with the timespan of rate measurement, t . While over short timescales, i.e.  40 has shown that substitution models with Gamma-distributed rate heterogeneities across sites may perform better at estimating the mean substitution rate over longer timescales, thereby delaying the emergence of the inflection point in the power-law rate decay, compared to models with a strict molecular clock.
We find that the mean substitution rates and fastest-evolving rate groups in double-stranded DNA viruses (dsDNA) is the lowest among all viral groups and that, together with reverse-transcribing DNA (RT-DNA) and single-stranded DNA viruses (ssDNA), dsDNA viruses have typically 1-2 orders of magnitude slower rates than RNA viruses (see Table 1). Conversely, the estimated mean and fastest rates are very similar among the groups of positive-strand RNA viruses (+ssRNA), negative-strand RNA viruses (-ssRNA), and RNA retroviruses (RT-RNA). The estimated substitution rates for rapidly evolving sites in RNA viruses, i.e. To ensure that our model predictions are not biased towards a particular virus family with more evolutionary rate estimates (i.e. more data points to fit to the PoW model in Equation 1), we remove all the short-term rate estimates ( 100  years) within each viral group except for one virus family or genus to re-calibrate the mean substitution rates (see Table S1). We find that despite the broad range of evolutionary rate estimates across all viral groups, the estimated parameters of the PoW model are robust to such changes and are not an artifact of systematic biases in selecting rate estimates from a particular virus family. We note that, in group VI, the stark difference in evolutionary rates between Lentivirus and Deltaretrovirus families over short timescales results in a noticeably different pattern of rate decay (see Figure S3 e) and the long-term rates are more aligned with the predictions based on the Deltaretrovirus re-calibration. The predicted mean and maximum substitution rate of Lentivirus families are 1-2 orders of magnitude higher than the Deltaretrovirus family. The latter evolves at rates similar to RT-DNA viruses. We also see that a larger fraction of sites in DNA viruses tend to evolve at rates closer to the host substitution rates (i.e. have a sharper negative gradient,  ) compared to RNA viruses, which largely have an equal fraction of sites across all rate groups. It is worth noting that all rate groups, including the ones with lowest proportion of sites, min m , have representative proportions across the genome, i.e.
where is a typical genome size of an RNA virus. We also carried out a similar sensitivity analysis at the level of virus genera which further confirms that the rate curves predicted by the PoW model are still accurate at this level and are not an artefact of measured rates at the level of Baltimore groups (see Figure S4).
To illustrate the radical effect of applying the PoW model to virus evolutionary timescales, we analysed an alignment of complete hepatitis C virus (HCV) genome sequences that represent its component genotypes and subtypes (Figure 2). Using the trajectory of the PoW-transformed evolutionary rate decay for Group IV and the expected (short-term) substitution rate of 1.2 x 10 -3 substitutions/site/year for HCV (Figure 2 a, b), the model demonstrates a clear separation of timescales for the diversification of variants within genotypes (~50 500 − years), among subtypes (~1, 000 20, 000 − years), and among genotypes (~40, 000 200, 000 − years) with an estimated root age for HCV of 396,000 (95% HDP: 326,000 -425,000) years before present (BP). While the predicted divergence times for withingenotype variants using the PoW model is similar to those obtained using a standard substitution model, the latter models estimate the root age of HCV to be only 970 (95% HDP: 850 -1100) years BP with no clear separation of timescales for among-genotype diversifications (Figure 2 c). These results contrast with estimates of 500 -2,000 years of genotype diversification by simple extrapolation from short term rates 43 , while among-subtype divergence times of 1,000 -20,000 years are up to 50 times higher than the 300 -500 years estimated in previous molecular epidemiological analysis [44][45][46] . The revised, very early evolutionary origin of HCV genotypes (326,000years, 425,000 years 95% HPD) predicted by our model is striking. While these early dates still fit currently proposed hypotheses for multiple and potentially relatively recent zoonotic sources of HCV in humans, associated with different genotypes 47,48 , the existence of a common ancestor of HCV before human migration of Africa (150,000BP) support alternative scenarios where HCV diversified within anatomically modern humans. HCV genotypes may have arisen from geographical separation in Africa (genotypes 1, 2, 4, 5, 7) and migrational separation of human populations migrating out of Africa into Asia (genotypes 3, 6 and 8).
We also carried out a similar analysis to investigate the origins of the SARS-CoV-2 sarbecovirus lineage (Figure 3). While the PoW-transformed phylogeny recovers the previous estimates for SARS-CoV-1 and SARS-CoV-2 diversification from their most closely related bat virus over short timescales (i.e. less than hundreds of years BP), it extends the root age back to 23,000 (21,000 -25,000) years BP, nearly 20 times older than previous estimates 49 . Our results indicate that humanity may have been exposed to these viruses since the Paleolithic if they had come into contact with their natural hosts. Also, our date estimates of the origin of the sarbecovirus lineage are in remarkable concordance with signatures of selection on human genomic datasets that indicate an arms race with corona-like viruses dating back 25,000 years 50 , providing an external comparator for our methodology.
The PoW model creates an over-arching evolutionary framework that can reconcile, and incorporate timescales derived from co-evolutionary and ancient DNA studies. Further radical re-evaluations of timescales of other RNA and DNA viruses using this approach will provide new insights into their origins and evolutionary dynamics 27,51 . Application of the PoW will place ancestors of more divergent virus sequences far further back into the past than conventional reconstructions. The good fit between modelled and observed substitution rates and the gradient of rate decay over time were based on a minimal number of assumptions about mutational fitness effects, proportion of sites evolving at a particular rate, and robust to substantial differences in substitution rates across different viral groups. By finding the short-term substitution rate (the flat part of the modelled rate decay) and the value of the fastest-evolving rate group (which sets the inflection point of the curve), the PoW model can reconstruct corrected substitution rates for virus genotypes with increasingly divergent nucleotide sequences.    is assumed to be the same as the one inferred for Group IV (see Table 1). (c) Shows the estimated divergence times within and between various HCV genotypes and subtypes using a general time-reversible substitution model with a four-bin gamma rate distribution (GTR+G) and PoW-transformed tree. Error bars represent the 95%HDP for the inferred maximum clade credibility tree.

Time-dependent rate decay under a uniform substitution rate
Suppose that a sequence has diverged from its ancestor t generations ago under a substitution rate  that is constant over time and equal across all sites. In this case, the proportion of pairwise differences, () pt , is given by 53 such that  is the maximum proportion of pairwise differences and is determined by

Time-dependent rate decay in the presence of rate heterogeneity
In the previous scenario, Equation S1, we assumed there is no rate heterogeneity across sites. However, if the actual evolutionary process involves M group of sites with each group i evolving at rate i  and occupying a fraction mi of sites, the proportion of pairwise differences would be given by Suppose that a sequence has a fraction of sites m1 evolving at rate 1 = and the remaining sites (1 − 1 ) evolving epistatically such that a pair of sites need to mutate simultaneously for the offspring to recover wild-type fitness, i.e. 2 = 2 . Assuming the saturation frequency across all sites are equal and that the model correctly identifies their frequency, i.e. = = , we can use Equation S2 to recover the expected substitution rate 2 11 (1 ) mm . As the fast-evolving sites approach the saturation point at 1 ≈ / , the inflection point of rate decay emerges and a sharp decline in estimated substitution rate follows while the remaining fraction of sites, (1 − 1 ), keep accumulating new substitutions at rate 2 , slowing down the slope of rate decay until those sites also reach saturation at 2 ≈ / 2 beyond which point the entire genome reaches saturation and the powerlaw rate decay with slope −1 emerges. We can also see from Figure S1 b that as the proportion of slow-evolving sites increases, the mean substitution rate goes down and the slope of the timedependent rate decay becomes less steep.

Simulating a Wright-Fisher population to infer substitution rates
We simulate a neutral haploid Wright-Fisher population of size with evolving sites under a constant mutation rate per site such that every nucleotide (A, C, G, and T) can mutate to any other nucleotide at the same rate /3 -mutation rate is equal to substitution rate under neutrality. We then sample from the entire population at two time points with an increasingly wider time gap, * . Initially, we allow the population to evolve for 10 generations before taking the first sample to ensure that neutral coalescent events reach their steady state distribution and that the population, on average, coalesce every 2 generations. We then take the second sample * generations later and repeat this process 100 times to generate replicate sequences at both time points and run each set of simulations in BEAST 1.10 to estimate the substitution rate ( Figure S2). We load the simulated sequences (along with their sampling times) on BEAST and use a strict molecular clock with a continuous-time Markov chain reference prior on substitution rates, a constant population coalescent prior, and a Jukes-Cantor substitution model. For every simulated set, the Markov chain Monte Carlo was run for 10,000,000 steps and parameter convergence was inspected visually.
In Figure S2, we recreate the time-dependent pattern of rate decay both in the absence and presence of rate heterogeneity across sites, using a standard substitution model on simulated data. We find that while the inferred substitution rates exhibit a power-law rate decay with slope 1 − over longer time intervals (see Figure S2 (a) and (b)), the inferred TMRCAs tend to be overestimated with a similar (inverse) power-law trend, i.e. ˆ1 / t  (see Figure S2 (c) and (d)). We also find an unexpected timedependent rate effect over short timescales. This occurs when the observation gap, * , is much shorter than the expected coalescent time of the population, i.e. * ≪ 2 . This also results in the underestimation of true TMRCAs which systematically makes worse predictions for higher substitution rates. The expected rate curves (dashed lines shown in Figure S2 (a) and (b)) can be approximated by replacing ( ) from Equation S1 into ̂ from Equation S2 which is given by   is the floor function which represents the finite size effect of having evolving sites on saturation frequency. The mean divergence time between the two populations is approximately ≈ * + 2 and the inferred divergence time is ̂≈ * + -this resembles the mis-calibration effects reported elsewhere (see Equation 2 in ref 55 ). The reason why Equation S4 only works as an approximate is that the median inferred TMRCA from simulation results, , also varies with respect to observation gap * (see Figure S2 (c) and (d)). However, for * ≫ 2 , the variation in becomes negligible compared to * and only has second-order effects on inferred substitution rates. Figure S2 (e) and (f) show the tree topology under the two extremes, * ≪ 2 and * ≫ 2 , respectively. It indicates that, over long timescales, the time-dependent rate effects are dominated by the very long (and saturated) branch connecting the two populations that are * generations apart. As a result, the decay dynamics looks very similar to the analytical results in Figure S1 where we estimate the substitution rate between a pair of sequences separated by a very long branch.     *Only short-term rate estimates (measured over time scales of <100 years) -along with the long-term rate estimates (>100 years) from the entire data set -from this particular virus family or genus (rather than the entire viral group) is used for the calibration.