Examining the molecular clock hypothesis for the contemporary evolution of the rabies virus

The molecular clock is a method for measuring the rate of virus evolution but the key assumption that mutations accumulate on the genome at a constant rate over time does not always hold true. While modelling approaches exist to accommodate deviations from a strict molecular clock, assumptions about rate variation may not fully represent the underlying evolutionary processes and can affect the accuracy of clock calibration and divergence time estimates. There is considerable variability in rabies virus (RABV) incubation periods, ranging from days to over a year, during which viral replication may be reduced. This prompts the question of whether modelling RABV on a per infection generation basis might be more appropriate. We investigate how variable incubation periods affect root-to-tip divergence under per-unit time and per-generation models of mutation. Additionally, we assess how well these models represent root-to-tip divergence in time-stamped RABV sequences. We find that at low mutation rates (<1 substitution per genome per generation) divergence patterns between these models are difficult to distinguish, while above this threshold differences become apparent across a range of sampling rates. Using a Tanzanian dataset, we calculate the mean substitution rate to be 0.17 substitutions per genome per generation. At RABV’s substitution rate, the per-generation substitution model is unlikely to represent rabies evolution substantially differently than the molecular clock model when examining contemporary outbreaks; over enough generations for any divergence to accumulate, extreme incubation periods average out. However, measuring substitution rates per-generation holds potential in applications such as inferring transmission trees and predicting lineage emergence.


Introduction 57
The molecular clock is a method of measuring the rate of evolution of organisms, based on 58 the assumption that their genomes accumulate neutral mutations at a constant rate over time, 59 either across all lineages (the "strict molecular clock") or within each individual lineage but 60 with variation between them (the "relaxed molecular clock") (1,2). The ability to sample viral 61 sequences through time, and the application of the molecular clock hypothesis to these 62 sequences, has led to massive advances in using viral genetic data to investigate disease 63 outbreaks (3). The clock rate, measured in substitutions per site per unit time, can be used to 64 estimate how long ago pathogens diverged (4), and the date of infection of individual infected 65 hosts (5). Combining the analysis of epidemiological and genetic data has allowed further 66 insights into the history of outbreaks (6), and the introduction of geographic data provides 67 predictions as to rates of spread and the frequency and source of introductions (7,8). 68 However, in order to conduct these phylogenetic analyses, genetic divergence must 69 measurably increase over time in the dataset under investigation (9). 70 The rabies virus (RABV) is a negative-strand RNA virus, with a genome size of 71 approximately 12 kilobases. While RNA viruses generally have high mutation rates due to a 72 lack of proofreading by RNA polymerases, RABV has a relatively low clock rate of between 73 1 x 10 -4 and 5 x 10 -4 substitutions per site per year (10-12). This may be due to strong 74 purifying selection (10), or to an unusual feature of RABV: that infections can exhibit 75 extended incubation periods within the host. The median generation interval (the time 76 between one individual becoming infected and then infecting another) is estimated to be 17.3 77 days in domestic dogs (13), with other studies estimating mean serial intervals of 26.3 days 78 (14) and 45.0 days (15). Symptoms, infectivity, and death from rabies, however, can 79 occasionally occur years after the initial infection event (16). The length of this incubation 80 period is influenced by the route of exposure, with bites to the head and neck leading to more 81 5 rapid disease progression than bites to lower extremities (17). RABV can remain in the 82 muscle at the bite site for significant lengths of time before invading the host's motor neurons 83 and progressing through the nervous system, with limited, if any, infection of other muscle 84 fibres (18). While some replication in the muscle cells has been observed (19), RABV 85 replication at the inoculation site is not necessary for neural invasion (20). It is currently 86 unknown precisely how the RABV replication rate in the host muscle cells and peripheral 87 nervous system compares to the massive replication rate within the cells of the central 88 nervous system and brain. However, work suggests that RABV replication in muscle cells 89 may be reduced (21), and RABV replication in cultured rat sensory neurons may be 10-to 90 100-fold lower than replication rates in rat and mouse CNS neurons (22). Rabies infections 91 that involve long incubation periods may, therefore, not lead to more accumulated mutations 92 than those with shorter incubation periods, as viral mutation is strongly influenced by the 93 replication process (24).

94
Changes in mutation rates through time due to long incubation periods may affect how we 95 analyse RABV sequence data and interpret these analyses. A relaxed molecular clock is 96 usually required to carry out phylogenetic analyses on rabies datasets, and it is not 97 uncommon for there to be difficulties in applying these analyses due to "insufficient temporal 98 signal"; usually referring to either no or a negative relationship between genetic divergence 99 and time, or this relationship having a very low R 2 (25-29). RABV clock rates can differ 100 significantly between hosts (30), geographic locations (31) and circulating lineages (12), 101 which may be driven in part by differences in incubation periods. If the variable incubation 102 period of rabies infections does cause deviation from the molecular clock model (exceeding 103 the variation captured by relaxed or modified clock models), this may negatively affect the 104 accuracy of time-scaled phylogenetic trees and emergence date predictions. Conversely, if 105 mutation does continue at a consistent rate during the incubation period, attention should be 106 6 paid to extremely long incubators which could drive the emergence of new variants, as seen 107 recently in chronic SARS-CoV-2 infections (32,33). 108 We hypothesised that reduced replication (and thus mutation) during the incubation period 109 could cause rabies evolution to be better represented by a per-generation model of mutation 110 than by the molecular clock model. We aim to clarify the nature of contemporary RABV 111 evolution using in silico methods, comparing the root-to-tip divergence of sequences 112 generated from synthetic outbreaks under per-unit time or per-generation mutation models, 113 and comparing these to RABV genomic data from Tanzania. We also aim to calculate a per-

118
We investigate two contrasting mutational models for RABV -i.e., substitutions occurring on 119 a per-generation vs. per-unit-time basis -using a simulation approach. We first generated 120 synthetic RABV outbreaks using a branching process model (13) and then simulated these 121 two mutation processes over the resulting transmission trees. From the resulting synthetic 122 sequences, we examined root-to-tip divergence and calculated variance explained (R 2 ) from 123 linear regressions, and compared these to the root-to-tip divergence of a set of RABV whole 124 genome sequences from Tanzania. Finally, we developed a method to estimate the per-125 generation substitution rate for RABV and tested this on synthetic data before applying it to 126 the real RABV dataset. Rabies outbreak simulation 129 We simulated RABV mutation on branching-process simulations of rabies outbreaks.

130
Outbreaks were simulated 100 times over a spatially explicit representation of Mara Region 131 in northern Tanzania. In Serengeti District, where contact tracing data were available, the 132 model was initialised with the three cases that occurred in the mean generation interval (g=27  To investigate patterns of temporal divergence under the mutation models described above, 173 we generated synthetic data with values of substitution rates ranging from 0.05 to 3 174 substitutions per genome per generation (or the per unit time substitution rate equivalent) and 175 4 population sampling regimes (from 1% of cases to 20%, informed by a previous study that 176 estimated that routine surveillance for rabies rarely confirms more than 10% of circulating 177 cases (35)). We calculated the genetic divergence as the number of nucleotide differences 9 from the index case to each sampled case. For each of the nine transmission trees, we then 179 compared genetic divergence with time under each scenario (substitution rate and sampling 180 regime combination), using linear regression through the origin.

181
In order to compare our synthetic patterns of divergence over time to real rabies data, a root-182 to-tip divergence plot was also generated for a dataset of real RABV sequences (data from 183 (36); Figure 1A) using TempEst (v1.5.3 (37)), with the best-fit root located ( Figure 1B).   Calculating the per-generation substitution rate 198 We developed a method of calculating the per-generation substitution rate using the clock 199 rate, the genome length, and the mean generation interval. We assessed this method's 200 accuracy using the synthetic outbreak sequence data, before applying it to the aforementioned 201 set of RABV whole genome sequences.

202
To estimate the mean per-generation substitution rate, we analysed sequence data with  To evaluate the accuracy of this method in predicting the mean per-generation substitution 209 rate, we also applied it to synthetic sequence data generated from outbreaks using the per-  Lognormal) for predicting substitution rates and selected the best fitting distribution by AIC. 235 We also calculated the probabilities of between 0 and 10 SNP differences occurring across 1,      It can be difficult to get sufficient temporal signal for RABV sequence datasets, potentially 333 due to its variable incubation periods. We hypothesised that a per-generation model of 334 mutation may be more representative of RABV evolution than a purely time-based model. 335 We found that substantial differences in root-to-tip divergence patterns between synthetic 336 outbreaks using generation-based and time-based models of mutation could be observed only 337 at high underlying substitution rates. The substitution rate for the Tanzanian  where sampling is opportunistic.

347
The observation of little difference between root-to-tip divergence plots derived from the two  to affect the per-generation rate. The low per-generation substitution rate seen in rabies is 372 therefore likely due to mutation being constrained by other factors, such as strong purifying 373 selection (10). 374 We can predict from the estimated per-generation substitution rate that identical sequences became apparent that our one-size-fits-all approach of using a BEAST chain length of 399 10,000,000 was insufficient for such large amounts of data. While it is unlikely that this 400 number of sequences will be used for this purpose in the future, the prediction accuracy is 401 likely to be increased simply by using an appropriate chain length.