mity: A highly sensitive mitochondrial variant analysis pipeline for whole genome sequencing data

Motivation Mitochondrial diseases (MDs) are the most common group of inherited metabolic disorders and are often challenging to diagnose due to extensive genotype-phenotype heterogeneity. MDs are caused by mutations in the nuclear or mitochondrial genome, where pathogenic mitochondrial variants are usually heteroplasmic and typically at much lower allelic fraction in the blood than affected tissues. Both genomes can now be readily analysed using unbiased whole genome sequencing (WGS), but most nuclear variant detection methods fail to detect low heteroplasmy variants in the mitochondrial genome. Results We present mity, a bioinformatics pipeline for detecting and interpreting heteroplasmic SNVs and INDELs in the mitochondrial genome using WGS data. In 2,980 healthy controls, we observed on average 3,166× coverage in the mitochondrial genome using WGS from blood. mity utilises this high depth to detect pathogenic mitochondrial variants, even at low heteroplasmy. mity enables easy interpretation of mitochondrial variants and can be incorporated into existing diagnostic WGS pipelines. This could simplify the diagnostic pathway, avoid invasive tissue biopsies and increase the diagnostic rate for MDs and other conditions caused by impaired mitochondrial function. Availability mity is available from https://github.com/KCCG/mityunder an MIT license. Contact clare.puttick@crick.ac.uk, carolyn.sue@sydney.edu.au, MCowley@ccia.org.au

. The high average depth across the MT genome in WGS data means that mity can detect very low heteroplasmy variants. A The average depth (black; red = interquartile range) across 13 replicates of NA12878, with BQ≥24 and MQ≥30. B The VAF of all three possible non-reference bases, at each position in the MT genome is typically far lower than the VAF corresponding to 10 high-quality reads (black). Spikes of alternate reads with VAF>0.002 outside the D-loop (grey) correspond to true genetic variants.
The default variant quality score in FreeBayes penalises low VAF variants, due to the overwhelmingly high numbers of reference reads. However, we reasoned that the evidence to support low heteroplasmy MT variants should a) only consider the alternate read count, b) scale with the alternate read count, and c) be reported using a similar scale to other variant quality methods. To achieve this, we used a binomial model to implement a Phred-scaled variant quality score, q. Assuming a noise level p, q is the Phred-scaled probability of observing at least n alternate reads by chance, given the total number of reads covering the variant position (Supp. Results). The calculation of q is fast, VAF independent, and has a default threshold of q≥30, which can be tuned to favour sensitivity or specificity. mity includes three additional filters: a strand bias filter to exclude variants with >90% or <10% alternative reads from one strand, a read depth filter set to <15× depth, and a region filter to exclude variants in the homopolymeric regions at m.302-319, or m.3105-3109, where there is an 'N' at m.3107, in the GRCh37 version of the mitochondrial sequence.

mity-report
End-users of mity include genome researchers and clinicians, so mity-report was developed to produce easily interpretable spreadsheet reports containing comprehensively annotated MT variant lists. Variants are tiered by VAF to aid prioritisation: tier 1, VAF ≥0.01; tier 2, VAF <0.01 with >10 supporting reads; and tier 3 are the remaining variants. Variant impact is estimated using Variant Effect Predictor (VEP) (McLaren et al., 2016), with additional annotation from MITOMAP (Brandon et al., 2005), allele frequency from 2,980 WGS healthy individuals from MGRB and additional VCF fields. The mitochondrial haplogroup is determined using PhyloTree (van Oven and Kayser, 2009).

mity-merge
In order to integrate mity into existing WGS analysis pipelines, mity-merge replaces the MT variants from a genome-wide VCF (e.g., GATK), with those from the mity VCF. The merged VCF can then be used with existing downstream tools for annotation and filtering.

NUMT homology
Nuclear mitochondrial DNA (NUMT) are fragments of the mitochondrial genome integrated into the nuclear genome (Lopez et al., 1994) that can potentially confound heteroplasmy (Parr et al., 2006;Santibanez-Koref et al., 2019). Informed by our previous work in pseudogene homology (Mallawaarachchi et al., 2016), and sequence homology analysis (Supp. Methods & Supp. Results), we conclude that NUMT:MT homology to known NUMT is highly unlikely to cause falsepositive tier 1 mity variants (Supp. Figure 7). It is challenging to rule out the presence of very rare, or patient-specific NUMT, so we recommend low heteroplasmy variants, particularly tiers 2 and 3 be validated using orthogonal approaches.

2.5
Performance mity can operate on either a WGS or MT-only BAM file (<2.1Gb), with a run-time of <10 minutes using a single-core and <8Gb RAM.

Conclusion
mity overcomes many of the challenges of accurate low heteroplasmy variant identification in the MT genome. Additional work is needed to identify MT structural variation, variant phasing, and prioritisation of novel MT variants. mity can be easily incorporated into existing high-throughput analysis pipelines, while simultaneously producing user-friendly reports. mity was developed on 242 MD patients (Davis et al, in prep)

Patient recruitment
We recruited 10 adult patients reviewed at the Mitochondrial Disease Clinic at Royal North Shore Hospital, Sydney, Australia, between 2013-2015. The research was approved by the Northern Sydney Local Health District Human Research Ethics Committee (HREC/10/HAWKE/132). Total genomic DNA was isolated from peripheral blood using standard methods. All participants provided written informed consent. NA12878 reference material was sourced from Genome in a Bottle.

Sequencing and read alignment
Sequencing libraries were created from nine patients in singlicate, one patient in duplicate, and 13 replicates from NA12878, using Illumina TruSeq Nano HT v2.5 library preparation kits, using Hamilton Star instruments. Sequencing was performed on Illumina HiSeq X instruments following the manufacturers specifications at the Kinghorn Centre for Clinical Genomics, Sydney. Sequence reads were aligned to the human genome reference assembly GRCh37 decoy genome (hs37d5) using BWA-MEM (v0.7.12-r1039, settings -M; (Li, 2013)). Reads were further processed using GATK Indel Realignment, and GATK Base Recalibration (version 3.3; (Van der Auwera et al., 2013)). Depth of coverage was performed using bedtools genomecov (Quinlan, 2014), and depth of alternate alleles in Figure 1B with samtools mpileup (Li et al., 2009).

Evaluation of variant callers
GATK HaplotypeCaller (Poplin et al., 2018) is a popular genome-wide variant caller, but it has three major limitations for MT variant calling: 1) it down-samples the reads to a maximum of 500× depth, 2) it uses a diploid model by default, which is insensitive to low heteroplasmy variants and 3) does not provide a minimum VAF setting. LoFreq (Wilm et al., 2012) was developed specifically for detecting low-frequency variants in somatic cancer data. FreeBayes ) is a popular haplotype-aware, genome-wide variant caller, which allows for control over the minimum VAF and the minimum number of alternative-reads. Whilst FreeBayes and HaplotypeCaller both have ploidy parameters that can theoretically be tuned to prioritise low VAF variants, from a high-ploidy sample, the execution runtime becomes exponentially slower and computationally infeasible in practice for MT analysis.
We assessed the performance of GATK HaplotypeCaller, FreeBayes, and LoFreq for MT variant detection. We first sequenced ten patient genomes using a single lane of Illumina HiSeq X, resulting in 30-40x average nuclear coverage. We ran each of the three callers, on each of the ten samples, and identified a median of 28, 41 and 56 MT variants, respectively (Supp. Fig 1A). We manually inspected every variant and found LoFreq to be overly susceptible to systematic sequencing artefacts, and thus too sensitive. As expected, we found HaplotypeCaller to be insensitive to low VAF variants (Supp. Fig 1B). FreeBayes produced very few false positives and was selected as the mity-call variant caller.

mity-call: variant normalisation
As FreeBayes is a haplotype-aware variant caller, it can merge two nearby variants in cis into a longer multi-nucleotide variant (MNV). In practice, sequencing noise from high MT coverage can artificially inflate the number of MNVs and reduce the number of reads supporting a variant of interest (Supp. Figure 4). Existing variant decomposition methods including vt normalize (Tan et al., 2015) and vcflib (Garrison, 2016) do not decompose all of the INFO and FORMAT annotations of MNVs, which are critical components for the mity-report, and downstream analysis tools. We thus implemented mity-normalise to decompose and normalise all variants, as well as the variant metadata within the INFO and FORMAT fields in the VCF (Supp. Figure 4c).

mity-call: assessing variant quality
We implemented a variant quality score, q, that is fast and VAF independent, similar to the challenge of identifying rare variants within tumours (Rheinbay et al., 2017). q is defined as the Phred-scaled probability of seeing at least the observed number of alternate reads by chance, given a noise threshold and assuming a binomial distribution. That is, given a noise threshold p and position i, and ni alternate bases, from a total depth of Ni, the variant quality qi is: where F is the binomial cumulative distribution function. The noise level parameter p may vary for each dataset and so should be separately estimated for each dataset.
Estimating p in the 13 NA12878 replicates.
To estimate p, we used samtools mpileup (Li et al., 2009) to calculate the VAF of all three alternate bases at every position, in each of the 13 replicates of NA12878 with MQ>30 and BQ>24 (Supp. Figure 5a). There remained a persistent set of alternate reads, with VAF up to 0.001. Thus, taking a conservative approach, we set p=0.002 to reduce the impact of sequencing noise. At q≥30, we identified 18 variants in all 13 replicates of NA12878 (Supp. Figure 5b). However, there were 5 additional variants seen in a subset of replicates, all of which failed manual review.
Estimating p in the two replicates of the MD patient In the same way as above, in two replicates of an MD patient, we estimated a higher noise floor of p=0.003 (Supp. Figure  6a) and found 59 variants in both replicates (Supp . Fig 6b), including a known pathogenic m.3243A>G variant (VAF=0.15% in both samples). There were one and three variants private to each replicate, of which two passed manual review, but which just failed the q threshold in the other sample.
As with any classification problem, when using mity there is a need to balance sensitivity with specificity. We have set the default threshold as q≥30, which favours sensitivity over specificity. In larger cohorts, the estimation of p could be fine-tuned by estimating a base or region-specific p, similar to the ideas presented by Gerstung and colleagues (Gerstung et al., 2014).

Nuclear insertions of mitochondrial origin (NUMTs)
There have been a number of reports of nuclear mitochondrial DNA (NUMT), which are fragments of the mitochondrial genome integrated into the nuclear genome (Lopez et al., 1994). The presence of NUMT can potentially confound heteroplasmy (Parr et al., 2006;Santibanez-Koref et al., 2019) by causing reads to align to the wrong genome, which would change the variant allele frequency, and thereby the estimation of variant heteroplasmy. Many known NUMT were integrated into the human genome over evolutionary timescales, allowing the accumulation of genetic changes to make it possible to resolve these sequences. To our knowledge, the rates of rare (patient-or family-specific) NUMT have not been systematically investigated and will remain challenging to resolve using short-read sequencing.
Here, we investigated whether any of the known NUMT could create false-positive MT SNV and INDEL variants caused by misalignment of known NUMT reads.
We have previously shown that when using the same sequencing technology (Illumina HiSeq X), and read aligner (BWA-MEM), that 150bp paired-end WGS can correctly align sequencing reads to the PKD1 gene, despite PKD1 having six pseudogenes with up-to 97.7% sequence homology (Mallawaarachchi et al., 2016). In a follow-up study of 145 clinical patients, no false positives due to sequencing read misalignment were detected using the same approach (Mallawaarachchi et al, in review). These results suggest that any NUMT with homology <97.7% should be readily resolved using 150bp paired-end sequencing reads. We further reasoned that short NUMT less than 300bp, i.e. less than the size of a typical read-pair of 300-450bp, would be easily resolved using existing read alignment. Furthermore, mity relies on even more stringent read mapping thresholds than reported above, of MQ>=30, meaning a 1:1000 chance of read misalignment.
Based on the RHNumtS.2 catalogue of NUMT (Ramos et al., 2011), just a single NUMT is longer than 300bp with sequence homology higher than 97.7%: HSA_NumtS_001. This NUMT is 5,839bp long and aligns to chr1:564,464-