Empirical estimates of the mutation rate for an alphabaculovirus

Mutation rates are of key importance for understanding evolutionary processes and predicting their outcomes. Empirical mutation rate estimates are available for a number of RNA viruses, but few are available for DNA viruses, which tend to have larger genomes. Whilst some viruses have very high mutation rates, lower mutation rates are expected for viruses with large genomes to ensure genome integrity. Alphabaculoviruses are insect viruses with large genomes and often have high levels of polymorphism, suggesting high mutation rates despite evidence of proofreading activity by the replication machinery. Here, we report an empirical estimate of the mutation rate per base per strand copying (s/n/r) of Autographa californica multiple nucleopolyhedrovirus (AcMNPV). To avoid biases due to selection, we analyzed mutations that occurred in a stable, non-functional genomic insert after five serial passages in Spodoptera exigua larvae. Our results highlight that viral demography and the stringency of mutation calling affect mutation rate estimates, and that using a population genetic simulation model to make inferences can mitigate the impact of these processes on estimates of mutation rate. We estimated a mutation rate of μ = 1×10−7 s/n/r when applying the most stringent criteria for mutation calling, and estimates of up to μ = 5×10−7 s/n/r when relaxing these criteria. The rates at which different classes of mutations accumulate provide good evidence for neutrality of mutations occurring within the inserted region. We therefore present a robust approach for mutation rate estimation for viruses with stable genomes, and strong evidence of a much lower alphabaculovirus mutation rate than supposed based on the high levels of polymorphism observed.


21
Virus populations can evolve rapidly, driven by the large number of mutations that occur during 22 virus replication. It is challenging to measure mutation rates because selection will affect which 23 mutations are observed: beneficial mutations are overrepresented in virus populations, while 24 deleterious mutations are selected against and therefore underrepresented. Few mutation rates 25 have been estimated for viruses with large DNA genomes, and there are no estimates for any 26 insect virus. Here, we estimate the mutation rate for an alphabaculovirus, a virus that infects 27 caterpillars and has a large, 134 kilobase pair DNA genome. To ensure that selection did not bias 28 our estimate of mutation rate, we studied which mutations occurred in a large artificial region 29 inserted into the virus genome, where mutations did not affect viral fitness. We deep sequenced 30 evolved virus populations, and compared the distribution of observed mutants to predictions from 31 a simulation model to estimate mutation rate. We found evidence for a relatively low mutation 32 rate, of one mutation in every 10 million bases replicated. This estimate is in line with expectations 33 for a virus with self-correcting replication machinery and a large genome. 34

35
Mutation rates are of key importance for understanding and predicting evolutionary patterns, as 36 the mutation rate modulates the mutation supply of a population [1]. Large mutation supplies can 37 fuel rapid and repeatable adaptation [2,3], but also increase the mutational load on a population 38 [4]. By contrast, low mutation supplies can limit the rate of adaptation [5], but also result in a lower 39 mutational load [4]. The impact of mutational supply also depends on the topography of the fitness 40 landscape. Small mutational supplies can have advantages for evolution on rugged fitness 41 landscapes: although adaptation will be slower and in most cases less fit genotypes will be 42 selected, some populations can avoid becoming trapped on local fitness peaks [6]. Mutation rates 43 are not only relevant to understanding basic evolutionary processes, but they also impinge on 44 real world outcomes, such as the efficacy of prophylactic or therapeutic interventions to infectious 45 diseases [7,8]. 46 Viruses have high mutation rates [7,9,10], with estimates of mutations per site per strand 47 copying (s/n/r) ranging from 2×10 -8 for Enterobacteria phage T2 [11] to 2×10 -4 for Influenza A virus 48 [12]. Whilst these high mutation rates are thought to contribute to the rapid adaptation of viruses, 49 the majority of mutations are typically deleterious [13,14]. Many viruses with large genomes 50 belong to Group I of the Baltimore classification (dsDNA viruses) [15], and typically have 51 polymerases with proofreading activity, which should enhance the fidelity of replication [16]. A 52 general expectation is therefore that viruses with relatively large genomes have lower mutation 53 rates [9]. An inverse relationship between genome size and mutation rate indeed has been found. 54 Small genomes can tolerate higher mutation rates as a larger proportion of mutation-free 55 genomes are generated in each round of replication, due to their small size [9,17]. 56 The alphabaculoviruses are a large group of insect baculoviruses that have been studied 57 because of their biocontrol and biotechnological potential [18,19]. Alphabaculoviruses [13,14], and selection will 67 remove these mutations from populations. By contrast, beneficial mutations will increase in 68 frequency due to selection. These opposing effects of purifying and directional selection make it 69 problematic to derive information on mutation rates directly from mutation accumulation in a viral 70 genome. Many different approaches have been developed to remove the bias introduced by 71 selection [7,9]. For example, some studies considered the frequency of lethal mutations in a 72 population, since these variants cannot replicate autonomously and therefore represent a 73 snapshot of genetic variation [25]. Others have evolved viruses in hosts expressing a viral gene, 74 and then restricting their analysis to the sequence of this redundant viral gene [26]. Another 75 strategy reported recently has been to incorporate fluorescent markers with inactivating mutations 76 into a viral genome, and then performing fluctuation tests based on recovering fluorescence [12]. 77 In the current study, we report the first empirical estimate of mutation rate for a large 78 dsDNA virus, the alphabaculovirus Autographa californica multiple nucleopolyhedrovirus 79 (AcMNPV). To ensure selection did not bias our estimates, we analyzed virus populations carrying 80 a large, nonfunctional genomic insert that was stably maintained [27], exploiting the genomic 81 stability of the Group I viruses [28]. For our analysis, we assumed that mutations in this region 82 are neutral due to the absence of known viral genes and regulatory sequences, and verified this 83 assumption. We also developed a population genetics model to estimate mutation rates from 84 empirical data, that incorporates the effects of population bottlenecks and different modes of virus 85 replication on the occurrence and maintenance of mutations. Using this approach, we made 86 robust estimates of the mutation rate for a baculovirus for the first time. 87

Serial passage and detection of mutations 89
To estimate the mutation rate of a large dsDNA virus, we experimentally evolved a variant of 90 alphabaculovirus AcMNPV containing a stable, non-functional genomic region. The AcMNPV 91 variant used was a so-called bacmid: an infectious clone that also contains the AcMNPV genome 92 (~134 kb). It also contains bacterial sequences that enable propagation as a low copy number 93 plasmid in Escherichia coli (~12.5 kb) and the acceptance of expression cassettes by 94 transposition [27]. The specific variant used here contains an expression cassette from the 95 pFastBac-Dual vector to restore expression of complete and functional polyhedrin ( Figure 1). We 96 consider the inserted bacterial sequences to be non-functional and therefore neutral in insects, 97 except for the polyhedrin promoter and open reading frame (ORF) sequences derived from the 98 pFastBac Dual vector. This renders an 11,646 bp neutral region for a mutation accumulation 99 experiment ( Figure 1). By contrast, the remainder of the AcMNPV genome is intact and unaltered, 100 making this system representative for this virus. We will hereafter refer to this bacmid-derived 101 AcMNPV variant with restored polyhedrin expression as "BAC". The virus could be reconstituted 102 from the infectious clone by transfection of the BAC genome into fourth instar (L4) Spodoptera 103 exigua (Hübner) (see Materials and Methods). Mutations were allowed to accumulate across the 104 BAC genome by experimentally evolving five replicate BAC lineages (referred to as lineages A, 105 B, C, D and E) for five passages in S. exigua L2. For each replicate lineage, passaging was 106 performed in five larvae exposed to a high viral dose of occlusion bodies (OBs) sufficient to kill all 107 larvae. Upon death larval cadavers were collected and pooled prior to the isolation of OBs, which 108 were used to inoculate larvae orally for the next passage. Further details on the experimental 109 evolution experimental setup can be found in the Materials and Methods. 110 Evolved lineages A -E, as well as the ancestral BAC were sequenced using Illumina 111 HiSeq to detect mutations in both the non-functional genomic insert, as well as across the whole 112 baculovirus genome. When mapping reads to the BAC reference genome we noticed a correlation 113 between low sequencing coverage and the calling of mutations, and we therefore removed 114 regions with relatively low coverage ( Figures S1-S4, see Materials and Methods). We show that 115 mutations indeed accumulated across lineages and passages (Table 1)

Low mutation rate for AcMNPV with bacmid and whole-genome mutation data 150
To make robust inferences on the mutation rate (μ) from these experimental data, we developed 151 a population genetic model. Briefly, we generated a stochastic model simulating neutral evolution 152 in a virus genome, modelled as mutation frequencies per base per viral generation. We fitted this 153 model to the experimental data by considering the number of bases with a frequency of mutations 154 above the threshold τ, using a maximum likelihood approach. We estimated the viral mutation 155 rate to be ~1 × 10 -7 s/n/r, with all estimates for the neutral bacmid region and the whole genome 156 data being similar ( Figure 2). Below we describe these results in more detail. 157 The simulation model was run for a range of model parameter values for mutation rate,  Figure S5-S6). We also 164 estimated the mutation rate with established models for sequencing data from clones [9], adapting 165 these methods to use the frequency of mutations determined by high-throughput sequencing data 166 instead of Sanger sequences from clones (see Materials and Methods). These estimates were 167 similar to those obtained with the first approach, although they tended to be about a factor 2.5 168 smaller (μ ≤ 4 ×10 -8 s/n/r, Figure S7). This difference was expected, as this alternative approach 169 does not take into consideration the effects of the mutation detection threshold τ, but rather 170 assumes all mutations will be detected irrespective of their frequency. As the sensitivity limits of 171 the deep-sequencing data is not taken into account, this approach will underestimate the mutation 172 rate. 173 One of the purported strengths of the first approach is that we have a large region (11.6 174 kb) in which point mutations will not affect fitness, given there are no known viral genes or 175 regulatory sequences (Figure 1 given that the number of mutations we detected was small and that the model fitting only considers 202 the number of sites with a mutation frequency above τ, and not the frequency of individual 203 mutations. For baculoviruses, the mode of replication has not been described formally, to the best 204 of or our knowledge. However, as these viruses probably employ rolling circle amplification [32], 205 replication is likely to follow the "stamping machine" scenario and to be described best by high 206 values of ρ. 207

AcMNPV mutation rate estimate is congruent with estimates for other viruses 208
Mutation rates (s/n/r) have been estimated for a number of viruses, allowing for a comparison with 209 our baculovirus estimate. For this comparison, we included a collection of mutation rate data 210 We considered whether our data shed light on AcMNPV's mutation spectrum, the occurrence of 228 the different kinds of single-nucleotide mutations, focusing on the transition to transversion ratio 229 (Table 2; see also Table S1). For the neutral region, the number of observed mutations is too 230 small to be informative, even for the lowest mutation detection threshold of τ = 0.5%.  We estimated the mutation rate for the alphabaculovirus AcMNPV, using a novel approach 264 that depends on the insertion of an artificial neutral region in the viral genome, and which starts 265 with a single genotype and exploits the genome stability of group I dsDNA viruses. Such an 266 approach requires mutations in this 'artificial' region to be neutral, and dI/dS results suggest this 267 assumption is met. Whereas others have employed rolling circle amplification of sequencing 268 templates to reduce sequencing error [38], here we developed an approach to analyzing regular 269 Illumina sequencing data that were not gathered specifically with the intent of determining 270 mutation rates. Our approach relies on a detailed analysis of the sequencing data to eliminate 271 obvious sequencing biases and a comparison of sequencing data to simulation-model predictions. 272 The use of models allows us to take into consideration the effects of viral demography on mutation 273 rate estimates, and also limit the impact of choosing thresholds for mutation detection, as the 274 same threshold is applied in the model. 275  Minimum frequency for the detection of mutations, indicated as a percentage.

Materials and Methods
Fitted ρ 1, 3, 10 Parameter that describes the viral mode of replication, with 1 being equivalent to equivalent to "geometric growth" by fission and large values (i.e., 10) representing "stamping machine" replication kinetics.

Mutation calling and filtering 303
Because the number of viral reads was not equal across samples, fastq files were subsampled to 304 ensure an approximately equal mean coverage across the reference genome for each isolate 305 using seqtk sample [41]. NGS data was analyzed using CLC Genomics Workbench 20.0 [42]. 306 Reads were trimmed (quality limit = 0.05) and mapped to the reference genome (see Online 307 Materials for sequence and annotation files). Mutations were called using the "low frequency 308 variant detection tool" (minimum frequency = 0.5%). An overview of parameter settings can be 309 found in the Supplementary Online Materials (Appendix 1). 310 Due to the small likelihood of the same mutation occurring independently in two or more 311 populations, we assume mutations found in two or more evolved populations were present in the 312 ancestral population (having been present at low frequency in the BAC DNA or occurring de novo 313 after reconstitution of the virus. Therefore, mutations called in multiple lineages and/or in the 314 ancestral BAC isolate were filtered out. Additional filtering criteria were: forward-reverse balance 315 > 0.05, read count > 10, and the type of mutation is "SNV" (single nucleotide variant). Additionally, 316 positions with extreme coverage values were excluded. To this end, we ranked the coverage 317 value per position for each lineage and excluded the upper and lower 1%. Analyses were done 318 with three thresholds (τ) for mutation frequency: 0.5%, 1% and 2% (see Supplementary Online 319 materials, Appendices 2-4). Finally, mutations were tallied per isolate, both across the whole 320 genome and for the neutral bacmid insert. 321

Mutation model and mutation rate estimation 322
We generated a stochastic model that predicts the distribution of mutation frequencies per base 323 in an evolving virus genome, and then fitted this model to our empirical data with a maximum 324 likelihood approach to obtain mutation rate estimates. The model was implemented in R 4.0.3 325 [43] and all code is available (see Supplementary Online Material, Appendices 5-6). We model 326 the genome region under consideration as a vector with g elements for each nucleotide position, 327 with each element representing the total frequency f of mutated bases at position i. For simplicity, 328 we do not consider the identity of the mutated bases and we do not allow for reversions, as we 329 are considering scenarios in which mutations are rare and the probability of a reversion occurring 330 and reaching high frequency is very low. We assume that all mutations are strictly neutral, and 331 that all changes in mutation frequency result from the occurrence of de novo mutations or neutral 332 processes like stochastic changes in allele frequencies due to population bottlenecks (i.e. genetic 333 drift). All model parameter values are given in Table 3 Table 3). The virus population then expands within the host 342 exponentially with a replication factor ⍴ per cycle of virus replication within the host, such that 343 1 where N is the number of genomes present at a time t, measured in generations of viral 344 replication within the host. It is unknown what the mode of replication [30,31] is for a baculovirus. 345 We therefore used values of 1 ("geometric" or "symmetric" replication with a doubling of the 346 number of copies per cycle -one original genome copy and one replicated copy), 3 ("mixed" 347 replication -one original genome copy and three replicated copies) and 10 ("stamping machine" 348 or "asymmetric" replication) for ⍴. Replication proceeds until the carrying capacity κ of a host is To obtain c we estimate the number of generations assuming different values for ⍴ (i.e., 1, 3 and 391 10 as for the simulation-based model fitting), such that ⋅ / / 1 , where θ is the 392 number of passages. Finally, we can drop α because we only consider mutations in the neutral 393 bacmid region. One-thousand bootstrapped datasets were used to obtain fiducial limits for the 394 mutation rate estimates. 395

dI/dS and dN/dS analyses 396
Estimates of dN/dS (i.e., the normalized rate of nonsynonymous mutations, here made for 397 authentic viral genes) were made using standard methods [46,47]. As we are not aware of any 398 estimates of the mutation spectrum for insect DNA viruses and our own data suggest these biases 399 may not be very strong (Table 2), we assume no mutation bias is present (i.e., a transition to 400 transversion ratio of 1). For the dI/dS [3], the dS term is the same as for the dN/dS analysis, 401 derived from the results for authentic viral genes. The dI term is determined for mutations in the 402 bacmid neutral region. Ninety-five percent confidence intervals of the dI/dS and dN/dS were 403 obtained using 1000 bootstrapped datasets, and data were tested for significance with a one- isolates. The peak observed at around 10000 bp for the BAC isolate is due to the presence of 538 empty bacmid vectors in sequencing data and is omitted from mutation calling. 539 isolates. The peak observed at around 10000 bp for the BAC isolate is due to the presence of 544 empty bacmid vectors in sequencing data and is omitted from mutation calling. 545 clearly effect model fit, as they have an effect on the number of mutations that will be detected. 556 By contrast, assumptions on the value for the parameter that determines the mode of virus 557 replication (ρ) had little effect on model fit. This result is not surprising however, given that our 558 model does not consider the frequency of mutations, but simply the number of bases with a 559 mutation frequency greater than τ. 560 Figure S6: Results of the fitting of the simulation model to the experimental data, for making 562 mutation rate estimates. For all panels, the abscissa is the log10 mutation rate and the originate 563 the negative log likelihood. Panels on the left correspond to the neutral bacmid region data, and 564 panels on the right to the whole genome data. Different thresholds for the detection of mutations 565 (τ) were also employed, ranging from 0.5% (bottom row) to 1% (middle row) to 2% (top row). 566 The different colors and line types indicate model fit for different values of ρ, the parameter that 567 describes the viral mode of replication. Whereas higher values of ρ lead to lower mutation rate 568 estimates, this model parameter does not impact the minimum for the NLL. 569 Figure S7: Estimate of mutation rate (s/n/r) using established methods applied to deep 571 sequencing data (blue bars, samples categorized as "Classic"), as an alternative to our approach 572 using a simulation-based model (red bars, samples categorized as "Simulation model"). Mutation 573 rates were estimated for different values of the viral mode of replication (⍴) and with different 574 values for the threshold of mutation detection (τ). Error bars represent the 95% fiducial limits, as 575 determined by bootstrapping. When the lower fiducial limit extends beyond the lower limit of the 576 axis, this indicates a lower fiducial limit of zero. Overall, these estimates were lower than those 577 obtained with the approach employing a simulation model. As baculoviruses most likely employ 578 rolling circle amplification, replication is likely to have a high value of ⍴. Furthermore, this method 579 is clearly sensitive to differences in the mutation detection threshold, as the cumulative frequency 580 of mutations above this value is used required to estimate mutations and no correction is made 581 for this threshold. Therefore, the best estimates with this approach assume the highest value of 582 ⍴ and the lowest mutation detection thresholds (τ), provided all mutations are assumed to be bona 583 fide. The estimate for these conditions (τ = 0.5% and ⍴ = 10), renders an estimate of μ = 4 x 10 -8 584 which is lower but roughly similar than for our simulation-based approach (μ ~ 10 -7 ). 585