A mutational gradient drives somatic mutation accumulation in mitochondrial DNA and influences germline polymorphisms and genome composition

Background Mutations in the mitochondrial genome (mtDNA) can cause devastating maternally inherited diseases, while the accumulation of somatic mtDNA mutations is linked to common diseases of aging. Although mtDNA mutations impact human health, the process(es) that give rise to these mutations are unclear and are under considerable debate. We analyzed the distribution of naturally occurring somatic mutations across the mouse and human mtDNA obtained by Duplex Sequencing to provide clues to the mechanism by which de novo mutations arise as well as how the genome is replicated. Results We observe two distinct mutational gradients in G→A and T→C transitions, but not their complements, that are delimited by the light-strand origin and the control region (CR). The gradients increase with age and are lost in the absence of DNA polymerase γ proofreading activity. A nearly identical pattern is present in human mtDNA somatic mutations. The distribution of mtDNA single nucleotide polymorphisms (SNPs) in the human population and genome base composition across >3,000 vertebrate species mirror this gradient pattern, pointing to evolutionary conservation of this phenomenon. Lastly, high-resolution analysis of the mtDNA control region highlights mutational ‘hotspots’ and ‘cold-spots’ that strongly align with important regulatory regions. Conclusions Collectively, these patterns support an asymmetric strand-displacement mechanism with key regulatory structures in the CR and argue against alternative replication models. The mutational gradient is a fundamental consequence of mtDNA replication that drives somatic mutation accumulation and influences inherited polymorphisms and, over evolutionary timescales, genome composition.

Introduction oocytes shows a similar bias towards the strand-asymmetric accumulation of transitions, establishing a 125 mechanistic link between the dominant mutagenic process in somatic tissues and what is seen in 126 population genetics [27]. However, to date, no mutational gradient has been reported outside the context 127 of phylogenetic analyses and it remains an open question if it is an active process or a byproduct of 128 selective pressure over time.

129
In this report, we have taken advantage of several large high accuracy mtDNA mutation data     Table 1). Notably, the G→C gradient is >10-fold smaller than the transition-based 177 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

183
Our mouse mtDNA data set combined mutation profiles of 8 unique tissue types from 6 organ 184 systems, therefore we sought to validate our analysis by accounting for the tissue-specific effects and 185 local differences in sequence contexts identified in these data (Sanchez-Contreras & Sweetwyne et al.,

186
in preparation). To address the possibility that one tissue type in our data was driving the observed 187 gradient, we performed a leave-one-out approach by eliminating one tissue and then performed the 188 same analysis on the reduced data set, repeating this analysis for each tissue type. As expected, the 189 removal of data of any one tissue type did not alter our findings (Supplemental Table 2). These results 190 strongly point to the gradient not being an artifact of any single tissue type in our data. We next 191 addressed the potential impact of different local sequence contexts within each bin by performing Monte-

192
Carlo simulations that randomly redistributed each mutation observed in the 153 bins corresponding to 193 the non-CR portion of the genome (genome positions 1-15,400) using a weighted probability for each 194 bin based on its trinucleotide base composition. After redistribution, the mutation frequency was then 195 recalculated for each bin and the procedure repeated 10,000 times. As expected, we observe no  Table 3).

197
Our analysis confirms that the strong positive mutational gradients in G→A and T→C transitions in both 198 the major and minor arcs of the mouse mtDNA are not artifacts and is most consistent with a strand 199 asynchronous replication mechanism and inconsistent with a conventional leading/lagging strand CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made Effects of age and Pol-γ fidelity on the mutational gradient support an asynchronous replication 203 model.

204
A key question is the identity of biological process giving rise to observed gradient. As noted 205 previously, the classic asynchronous replication model hypothesizes a long-lived ssDNA intermediate.

206
The consequence of this model is that the portions of the mtDNA closest to their initiating origin should 207 disproportionately accumulate G→A and T→C L-strand mutations during the aging process due to more 208 time in the single-stranded state and should manifest as an increase in the gradient slope over time. To 209 test this hypothesis, we made use of the young (4-5 mo; n=9,093 mutations) and old age (26 mo; 210 n=25,020 mutations) cohorts in our data set to evaluate the interaction between aging and genome 211 position on the gradient slope. Both major arc G→A and T→C L-strand gradients, as well as T→C 212 mutations in the minor arc, exhibit a significant increase in their respective slopes during aging (Major 213 arc: G→A interaction=4.21±0.80x10 -8 , p=1.54x10 -7 ; T→C interaction=1.03±0.11x10 -8 ; p=1.31x10 -21 ;

219
While the non-uniform increase in mutations with age is most consistent with deamination, it is 220 possible that some other aspect of mtDNA replication could lead to this pattern. For example, Pol-γ is 221 thought to exist both with and without its p55 accessory subunit that has been reported to affect fidelity    Pol-γ exopoints to a mechanism that is extrinsic to the polymerase itself and is, again, most consistent 243 with a DNA replication intermediate with a long-lived single-stranded state.

246
The CR contains several important regulatory elements, including both transcriptional promoters, 247 the OriH, several highly conserved sequence blocks (CSB), and extended termination-associated 248 sequences (ETAS), whose specific regulatory functions are incompletely understood (Reviewed in [35]).

249
We and others have noted a distinctly different mutation frequency and spectrum in the CR compared 250 to the coding portion of the genome in both humans and mice [7,27,36], suggesting that the unique 251 function and structure may strongly influence CR mutagenesis, but high-resolution mapping of mutations 252 has not been reported.

253
The CR lies at the extreme 3' terminal end of the murine reference genome, which presents 254 issues during data alignment that gives rise to significant biases in sequence depth and mutation calls.

255
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To address this potential bias, we modified the mouse mtDNA reference to place the CR in the middle 256 of the sequence and realigned our data to this modified reference. In addition, we decreased our bin 257 size to 50bp to allow for a higher resolution mapping of mutations. The CR exhibits prominent spikes   cold-spots that correspond to highly conserved regulatory elements responsible for the distinctive 279 mutational bias previously noted in the CR. Additionally, our data suggest the presence of unique DNA 280 structures within these sequence blocks that differently affect DNA damage and/or replication fidelity 281 and also suggest that some regions important for mtDNA replication may poorly tolerate mutagenesis.

282
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint 283 A mutational gradient is conserved in human mtDNA.

284
We next determined the evolutionary conservation of the patterns we observe in our mouse data.

285
To do so, we made use of prior reported Duplex-Seq data sets for human mtDNA [10,28]. As with the 286 mouse data, we performed a binned mutation frequency analysis with bin size of 200bp due to the 287 reduced number of mutations compared to our mouse data. Consistent with our mouse data, we observe 288 a gradient for both G→A and T→C mutations in the major arc (G→A: slope=4.86±0.93x10 -7 , p=1.93x10 -289 7 ; T→C: slope=1.84±0.26x10 -7 , p=1.69x10 -12 ) that is bounded by the OriL and CR ( Figure 4A Table 6; Supplemental Data 6). Unlike the mouse data, the minor arc did not exhibit an 291 apparent gradient and no other mutation types exhibited a significant increase in either the major or 292 minor arcs (Supplemental Figure 6; Supplemental Table 6).

293
Our analysis points to a somatic mutational gradient as an evolutionarily conserved feature of in Duplex-Seq data. Importantly for our purpose, this characteristic of the tumor data is expected to 305 largely eliminate the potential confounder of low frequency artifacts giving rise the observed gradient.

306
We divided the genome into 100bp bins and, for each mutation type, calculated the mutation 307 density (i.e. mean number of detected mutations per wild-type base) in each bin (Supplemental Data 7).

308
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made   Table 7). These data confirm both the presence of a mutational 314 gradient in the major arc and that our results are unlikely to be due to an unknown issue with Duplex-

315
Seq. Taken together, both our Duplex-Seq data and the PCAWG data recapitulate our findings in mouse 316 mtDNA, pointing to the strong evolutionary conservation of G→A and T→C gradients among vertebrate 317 species.

319
A mutational gradient is mirrored in germline SNPs and genome base composition.

320
Previous work has noted similarities in the strand orientation and simple mutational spectra

333
Consistent with our somatic data, we observe a significant positive gradient in G→A and T→C ( Figure   334 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made our initial data set, we observe a significant correlation between SNP density and genome position of 339 G→A SNPs (ρ=0.26; p=0.046; Spearman correlation). We also observe a significant correlation between 340 G→C SNP and genome position (ρ=0.307; p=0.019; Spearman correlation) similar to the MITOMAP 341 based dataset. No other significant correlations were observed (Supplemental Table 9). Thus, we are 342 able to confirm that, at the very least, a G→A gradient is present in human polymorphisms, consistent 343 with our somatic mtDNA data, and further supports the idea that the mechanism of mutagenesis in the 344 somatic tissue is likely the direct driver of human mtDNA variation.

345
The strong conservation of the somatic gradient between mice and humans and the presence of 346 the gradient in human SNP data suggest that this unusual mutational pressure is likely a major driver of 347 sequence diversity across species. Our somatic data point to a sustained G→A and T→C mutational

355
We sought determine the generality of the gradient phenomenon by expanding these findings to   (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint associated G→T/C→A mutations. In addition, these studies have shown a strong strand bias, with 388 G→A/C→T mutation being more prevalent when the dG base is in the L-strand, as well as a notable 389 difference in the mutational frequency and spectrum in the CR. While these studies have provided a 390 broad understanding of mtDNA mutagenesis, the very low frequency of mutations (<1x10 -5 ) means that, 391 for any given sample, only a few dozen to a few hundred mutations are typically detected, leaving 392 conclusions about how these mutations are distributed unclear beyond broad regional differences (i.e.

393
CR vs coding or between genes). In this study, we aggregated several pre-existing Duplex-Seq data 394 sets to better asses the distribution of mutations across the mtDNA molecule at significantly higher 395 resolution than what has been previously reported.

396
In addition to recapitulating previous findings showing a strong bias towards transition mutations 397 over transversions and a higher mutation load in the CR, our analysis shows a strikingly non-uniform 398 gradational distribution of G→A and T→C transitions, but not their complement, along the coding portion 399 of the mtDNA. The totality of our data is most consistent with an asynchronous strand displacement 400 mechanism with a long lived, deamination prone, single-stranded H-strand. A key aspect of our data 401 that supports this hypothesis is the increased slope of G→A and T→C mutations with advancing age.

433
We can clearly discern the location of the OriL in both mouse and human data sets. These data 434 also argue against the proposed use of other tRNAs as L-strand priming sites, as well as a large 435 'initiation zone' for replication, as these models predict either multiple discontinuous gradients or lack a 436 gradient entirely. Instead, our data are consistent with a single L-strand origin in mammalian mtDNA.

437
Moreover, with our high-density mouse data set, we mapped areas of mutation over-and under-438 abundance in the CR that correspond to sequence blocks essential for mtDNA H-strand replication.

454
The strength of this gradient, as highlighted by the variation in anticorrelation between complementary 455 bases and the presence of species that deviate significantly from the trend line, can vary significantly.

456
An important aspect of these observations is that they provide a feasible opportunity to mechanistically 457 study the processes that give rise to genetic variation at the population and taxonomic levels. This is

493
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint After obtaining the raw data, we processed all Duplex-Seq data using v1.1.3 of our in house 494 developed Snakemake-based[53] Duplex-Seq-Pipeline to ensure that all data were uniformly processed 495 with the exception that different data sets had different unique molecular identifier (UMI) and read 496 lengths. Briefly, we perform a reference free consensus approach for error correction similar to that

519
An exception to this process was made for mouse mtDNA due to the presence of a ~5000bp nuclear 520 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint pseudogene with perfect identity to mm10 chrM [59]. In this case, any ambiguous BLAST alignments 521 mapping to this region were assumed to be mitochondrial in origin and kept. The mutated reads passing 522 our BLAST filter are merged back with the non-mutated reads and mutation frequencies calculated (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint regression model with the addition of an interaction term between age and bin number ( = + * 572 + * + ( * )), with age being the categorical classifier with value 0 (young) 573 or 1 (old), was used. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

870
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted June 18, 2021. ; https://doi.org/10.1101/2021.06.18.448725 doi: bioRxiv preprint