Genome-wide analysis of mitochondrial DNA copy number reveals multiple loci implicated in nucleotide metabolism, platelet activation, and megakaryocyte proliferation

Blood-derived mitochondrial DNA copy number (mtDNA-CN) is a minimally invasive proxy measure of mitochondrial function that exhibits both inter-individual and intercellular variation. While mtDNA-CN has been previously associated with various aging-related diseases, little is known about the genetic factors that may modulate this phenotype. We performed a genome-wide association study (GWAS) in 465,809 individuals of White (European) ancestry from the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium and the UK Biobank (UKB). We identified 129 SNPs with statistically significant, independent effects associated with mtDNA-CN across 96 loci. A combination of fine-mapping, variant annotation, co-localization, and gene set enrichment analyses were used to prioritize genes within each of the 129 independent sites. Putative causal genes were enriched for known mitochondrial DNA depletion syndromes (p = 3.09 × 10−15) and the gene ontology (GO) terms for mtDNA metabolism (p = 1.43 × 10−8) and mtDNA replication (p = 1.2 × 10−7). A clustering approach leveraged pleiotropy between mtDNA-CN associated SNPs and 42 mtDNA-CN associated phenotypes to identify functional domains, revealing five distinct groups, including platelet activation, megakaryocyte proliferation, and mtDNA metabolism. In conclusion, in a GWAS of mtDNA-CN conducted in >450,000 individuals, we identified SNPs within loci that implicate novel pathways that provide a framework for defining the underlying mechanisms involved in genetic control of mtDNA-CN.


100
Mitochondria are the cellular organelles primarily responsible for producing the chemical energy 101 required for metabolism, as well as signaling the apoptotic process, maintaining homeostasis, 102 and synthesizing several macromolecules such as lipids, heme and iron-sulfur clusters 1 has been shown to be positively correlated with oxidative stress 6 , energy reserves, and 112 mitochondrial membrane potential 7 . As a minimally invasive proxy measure of mitochondrial 113 dysfunction 8 , decreased blood-derived mtDNA-CN has been previously associated with aging-114 related disease states including frailty 9 , cardiovascular disease 10-12 , chronic kidney disease 13 , 115 neurodegeneration 14,15 , and cancer 16 .

117
Although mtDNA-CN measured from whole blood presents itself as an easily accessible and 118 minimally invasive biomarker, cell type composition has been shown to be an important 119 confounder, complicating analyses 17, 18 . For example, while platelets generally have fewer 120 mtDNA molecules than leukocytes, the lack of a platelet nuclear genome drastically skews 121 mtDNA-CN estimates. As a result, not only is controlling for cell composition extremely vital for 122 accurate mtDNA-CN estimation, but interpreting the results in relation to the impact of cell 123 composition becomes a necessity [18][19][20] .

125
Although the comprehensive mechanism through which mtDNA-CN is modulated is largely

202
Briefly, mtDNA-CN estimates derived from whole exome sequencing data, available on ~50,000 203 individuals, were generated first using customized Perl scripts to aggregate the number of 204 mapped sequencing reads and correct for covariates through both linear and spline regression 205 models. Concurrently, mitochondrial probe intensities from the Affymetrix Axiom arrays, 206 available on the full ~500,000 UKB cohort, were adjusted for technical artifacts through principal 207 components generated from nuclear probe intensities. Probe intensities were then regressed 208 onto the whole exome sequencing mtDNA-CN metric, and beta estimates from that regression 209 were used to estimate mtDNA-CN in the full UKB cohort. Finally, we used a 10-fold cross 210 validation method to select the cell counts to include in the final model (Supplemental Table 2).

211
The final UKB mtDNA-CN phenotype is the standardized residuals (mean = 0, standard 212 deviation = 1) from a linear model adjusting for covariates (age, sex, cell counts) as described in

Identification of Independent GWAS Loci
To identify the initial genome-wide significant (lead) SNPs in each locus, the most significant 234 SNP that passed genome-wide significance (p < 5 x 10 -8 ) within a 1 Mb window was selected.

235
To avoid Type I error, SNPs were only retained for further analyses if there were either (a) at

258
First, variants were converted to an ANNOVAR-ready format using the dbSNP version 150 database 41 . Then, variants were annotated with ANNOVAR using the RefSeq Gene database 42 .

260
The annotation for each variant includes the associated gene and region (e.g., exonic, intronic, 261 intergenic). For intergenic variants, ANNOVAR provides flanking genes and the distance to 262 each gene. For exonic variants, annotations also include likely functional consequences (e.g.,

263
synonymous/nonsynonymous, insertion/deletion), the gene affected by the variant, and the 264 amino acid sequence change (Supplemental Table 4).

267
Co-localization analyses were performed using the approximate Bayes factor method in the R 268 package coloc 43 . Briefly, coloc utilizes eQTL data and GWAS summary statistics to evaluate 269 the probability that gene expression and GWAS data share a single causal SNP (colocalize).

298
and there were no other co-localized genes with a PP within 5%, the gene with the highest 299 posterior probability was assigned. If there was still no assigned gene, the most significant 300 DEPICT gene was assigned. If there was no co-localization or DEPICT evidence, the nearest 301 gene was assigned.

304
Using the finalized gene list from the prioritization pipeline, GO and KEGG pathway enrichment 305 analyses were performed using the "goana" and "kegga" functions from the R package limma 46 , 306 treating all known genes as the background universe 47 . Only one gene per locus was used for 307 "goana" and "kegga" gene set enrichment analysis, prioritizing genes assigned to primary 308 independent hits. If there were multiple assigned genes, one gene was randomly selected to 309 avoid biasing results through loci with multiple genes. To identify an appropriate p-value cutoff,

319
We used metaXcan, which employs gene expression prediction models to evaluate associations 320 between phenotypes and gene expression 48 . We obtained pre-calculated expression prediction 321 models and SNP covariance matrices, computed using whole blood from European ancestry 322 individuals in version 7 of the Genotype-Tissue expression (GTEx) database 49 . Using prediction 323 performance p-values of less than 0.05, a total of 6,285 genes were predicted. Of these genes, 324 74 passed Bonferroni correction of p < 7.95 x 10 -6 . Gene set enrichment analyses were 325 performed on Bonferroni-significant genes as previously described. REVIGO 50 was used on the 326 "medium" setting (allowed similarity = 0.7) to visualize significantly enriched GO terms.

328
We used a one-sided Fisher's exact test to test for enrichment of genes that have been

403
The most significant SNP associated with mtDNA-CN was a missense mutation in 404 LONP1 (p = 3.00x10 -141 ), a gene that encodes a mitochondrial protease that can directly bind 405 mtDNA, and has been shown to regulate TFAM, a transcription factor involved in mtDNA

409
To identify additional independent SNPs within novel loci whose effects were masked by the    Figure 2). For 20 loci, multiple genes were assigned as analyses could not identify a single 434 priority gene (Supplemental Table 9). We noted the identification of a number of mtDNA 435 depletion syndrome genes in the priority list and tested for enrichment of these known causal 436 genes using a one-sided Fisher's exact test. For this analysis, all genes for loci assigned to 437 multiple genes were used, and genes for all primary and secondary loci were considered. Our 438 gene prioritization approach identified 7 of 16 mtDNA depletion genes (Supplemental Table 10 Table 13). Cluster 5 contains a highly diverse set of SNP-trait associations. A 495 secondary clustering only with SNPs mapped to this cluster identified 9 sub-clusters 496 (Supplemental Figure 6A), 3 having clear evidence for enrichment of specific phenotypes.

497
Cluster 6 is enriched for decreased MCV (Supplemental Figure 6B). Cluster 7 is enriched for 498 decreased MPV (Supplemental Figure 6C), PDW, serum phosphate, serum calcium, and 499 PLTCRT. Cluster 10 is enriched for decreased apoA, apoB, total cholesterol (Supplemental 500 Figure 6D), LDL, and HDL, and increased vitamin D, pulse rate, and direct bilirubin. SNPs that 501 were not assigned to a cluster were enriched for decreased fasting blood glucose and maximum 502 workload from fitness testing (Supplemental Table 14).

505
We conducted a GWAS for mtDNA-CN using 465,809 individuals from the CHARGE consortium 506 and the UKB. In addition to replicating two previously reported loci, we discovered 94 novel loci 507 and report multiple independent hits for 26 loci. Examining our GWS SNPs in a Black 508 population, we observed a concordant signal, suggesting that the genetic etiology of mtDNA-CN 509 may be broadly similar across populations. Using several functional follow-up methods, genes 510 were assigned for each identified independent hit and significant enrichment was observed for 511 genes involved in mitochondrial DNA metabolism, homeostasis, cell activation, and amyloid-512 beta clearance. In total, we assigned 124 unique genes to independent GWAS signals 513 associated with mtDNA-CN. We also identified 8 additional genes whose predicted gene 514 expression is associated with mtDNA-CN that could not be mapped back to GWS loci. Finally, 515 using a clustering approach based on SNP associations with various mtDNA-CN associated 516 phenotypes, we were able to functionally categorize SNPs, providing insight into biological 517 pathways that impact mtDNA-CN. We note that during the preparation of this manuscript, a GWAS including 295,150 unrelated individuals from the UK Biobank was published, which 519 reported 50 genome-wide significant regions 33 . While many of our loci overlap with their 520 findings, our study reports twice as many loci largely due to the increased power of our study.

521
We were able to identify a substantial proportion of the genes involved in mtDNA 522 depletion syndromes (7/16, p = 3.09 x 10 -15 for enrichment), including TWNK, TFAM, DGUOK, 523 MGME1, RRM2B, TYMP, and POLG. mtDNA depletion syndromes can be broken down into 5 524 subtypes based on their constellation of phenotypes 65 , and with the exception of 525 cardiomyopathic subtypes (associated with mutations in AGK and SLC25A4), we were able to 526 identify at least 1 gene from the other 4 subtypes, suggesting that our mtDNA-CN measurement 527 in blood-derived DNA can identify genes widely relevant to non-blood phenotypes. This finding 528 is consistent with a large body of work showing that mtDNA-CN measured in blood is 529 associated with numerous aging-related phenotypes for which the primary tissue of interest is 530 not blood (e.g. chronic kidney disease 13 , heart failure 11 , and diabetes 66 ). Also consistent with this 531 finding is recent work demonstrating that mtDNA-CN measured in blood is associated with 532 mtRNA expression across numerous non-blood tissues, suggesting a link between 533 mitochondrial function measured in blood and other tissues 67 .

534
In addition to identifying the mtDNA depletion syndrome genes directly linked to 535 mitochondrial DNA metabolic processes, DNA replication, and genome maintenance, we also 536 identify genes which play a role in mitochondrial function. The top GWAS hit is a missense 537 mutation in LONP1, which encodes a mitochondrial protease that has been shown to cause 538 mitochondrial cytopathy and reduced respiratory chain activity 68,69 . Interestingly, this missense 539 mutation was recently found to be associated with mitochondrial tRNA methylation levels 70 .

540
Additional genes known to impact mitochondrial function include MFN1, which encodes a 541 mediator of mitochondrial fusion 71,72 , STMP1, which plays a role in mitochondrial respiration 73 , 542 and MRPS35, which encodes a ribosomal protein involved in protein synthesis in the 543 mitochondrion 74,75 .
Using a combination of gene-based tests and gene prioritization using functional 545 annotation, pathway analyses reveal enrichment for numerous mitochondrial related pathways, 546 as well as those involved in regulation of cell activation (p < 3.65 x 10 -5 ), homeostatic processes 547 (p < 1.82 x 10 -5 ), and regulation of immune system processes (p < 2.75 x 10 -5 ) (Supplemental 548   Table 11). These results provide additional evidence for the broad role played by mitochondria 549 in numerous aspects of cellular function. Of particular interest, the GO term for amyloid beta is

565
Given the integral role of mitochondria in cellular function, not just with ATP 566 formation/energy production, but signaling through ROS, and its key role in apoptosis, there is a 567 strong reason to a priori assume that genetic variants associated with mtDNA-CN are likely to 568 be highly pleiotropic. Indeed, mtDNA-CN itself is associated with numerous phenotypes 569 (Supplemental Table 5). Through our PHEWAS-based clustering approach using 42 mtDNA-CN associated phenotypes, we uncovered phenotypic associations between five distinct clusters of 571 GWS mtDNA-CN associated SNPs. Cluster 1-3 were characterized by increased MPV, PDW,

591
Cluster 5 was particularly challenging to interpret, given that no particular phenotype was 592 enriched relative to the non-cluster 5 SNPs. We note that this cluster appeared to be enriched 593 for the mtDNA depletion syndrome genes, containing 6/7 genes identified in the GWAS, and 594 significantly enriched for GO Terms related mitochondrial DNA. Examining sub-clusters within 595 cluster 5 identified MCV as an important phenotype (while only sub-cluster 6 was formally associated, looking at Figure 3B suggests more widespread clustering based on MCV). MCV is 597 a measure of the average volume of a red blood corpuscle, and the link between red blood 598 volume and mtDNA-CN is unlikely to be direct, given that red blood cells contain neither nuclei 599 nor mitochondria. Sub-cluster 7 is surprisingly enriched for variants associated with decreased