Haplotracker: a web application for simple and accurate mitochondrial haplogrouping using short DNA fragments

Mitochondrial DNA (mtDNA) haplogrouping is widely used in population genetics, forensics, and medical research to study evolutionary questions in human populations, examine degraded remains, and search for mtDNA-associated diseases. Next-generation sequencing methods have become a revolutionary tool for mtDNA analysis, particularly for degraded DNA, but they remain costly, laborious, and time-consuming. If an accurate and simple haplogroup-tracking tool were available, haplogrouping could be performed easily and rapidly. Here, we present Haplotracker, a web application for highly accurate and simple haplogroup tracking using a few sequence fragments. Haplotracker offers a unique user-friendly interface and estimates highly probable haplogroups from control-region sequences using a novel algorithm based on Phylotree Build 17 and our haplotype database (n=118,869). Haplotracker provides a simple, novel HG tracking solution, which was established through repeated blind simulation tests. It narrows down potential haplogroups and identifies their differential coding-region variants to confirm the haplogroups or to track sub-haplogroups. Haplotracker gives detailed information on sample variants, including their frequency in a large mtGenome database, which may give researchers an insight into common, rare, and potentially pathogenic mutations. It also offers a conserved region mapping tool for PCR primer design for successful tracking. Haplotracker produced high haplogroup prediction accuracy using 8,216 control-region Phylotree-provided sequences. It estimated top-ranking haplogroups with a higher concordance rate (56.6%, p<0.0001) than the similar tools MitoTool (29.4%) and HaploGrep 2 (33.9%). The significance persisted up to Rank 30. Haplotracker accurately estimated super-HGs from 94% of the control-region sequences at Rank 1. Further evaluation of the accuracy with 46,322 control-region sequences was significant. Laboratory application of Haplotracker to an ancient DNA extract demonstrated its practical usefulness. These results highlight the potential for the use of our web application as an alternative to full genome sequencing for easy haplogrouping, which may be useful in related fields. Free access: https://haplotracker.cau.ac.kr. Author summary Mitochondrial haplogroup (HG) classification is required in the search for answers to evolutionary questions about human populations, in the investigation of forensic samples, and in the search for disease-associated mutations. The sequencing of mitochondrial DNA (mtDNA) at the genome level is now frequently used due to the falling cost of next-generation sequencing. It offers accurate and detailed analysis of mtDNA variation, but it is still costly, laborious, and time-consuming. If an accurate and simple HG tracking tool were available, haplogrouping could be performed easily and rapidly. Here, we developed Haplotracker, a simple and accurate HG tracking web application with a novel algorithm and a novel tracking tool. The highly accurate prediction performance of Haplotracker was demonstrated in a series of tests. Using only control-region sequences, it accurately predicted HGs in more than half of the total Phylotree-provided mtGenome sequences and approximately 80% up to Rank 5; for super-HGs, it accurately predicted 94% at Rank 1 and more than 98% up to Rank 3. We demonstrated simple tracking for HG confirmation using Haplotracker. These results show that mtDNA haplogrouping with our web application may be useful in related fields.


65
Haplogrouping of fresh DNA samples, from which the sequence information for large mtDNA 66 fragments can be obtained, is much easier than that of degraded DNA samples, such as those Haplotracker. In contrast, when the CR sequence is used, the most probable HG is identified as 96 "N9b2a" by MitoTool and "N9b2" by the other servers. MitoTool did not predict "O" as an option at all, 97 while the other tools ranked it second.

98
These results demonstrate that subsequent confirmation is required for HG prediction when 99 using CR sequences. This confirmation and sub-haplogrouping are usually conducted using 100 differential variants in the coding region. If many indeterminable HGs are highly ranked in the 101 haplogrouping process, it will be laborious and complicated. A simple tracking method is thus required 102 to definitively and accurately determine the HGs because higher prediction accuracy results in a less 103 laborious process. Several

109
In the present study, we develop Haplotracker, a web application that offers simple and highly 110 accurate tracking of mtDNA HGs using CR and coding-region sequence fragments. Haplotracker 111 offers a unique user-friendly interface with input options for multiple sequence fragments without the 112 need to input the ranges from an mtDNA sample. The server first uses several short sequence 5 113 fragments in the CR of mtDNA to produce a list of the closest-ranked HGs using a novel algorithm 114 based on Phylotree definitions and our haplotype database (DB). To confirm the predicted HGs,

115
Haplotracker employs a novel HG tracking tool. It minimizes the number of tests for HG tracking by 116 narrowing down the HGs through integration to their most recent common ancestor (MRCA) HGs. It 117 then identifies highly specific and conserved differential variants among these HGs. To confirm the 118 HGs or identify the sub-HGs, researchers are guided to re-track the haplogrouping with additional 119 sequence fragments across the positions of the differential variants. The server also provides manual 120 options and tools for researchers to select and compare HGs to narrow down and differentiate HGs.

121
These include HG selection options, options to determine the level of sub-HGs ranging from MRCAs 122 to terminal sub-HGs, access to the HG DB, and a separate tool for HG differentiation.

123
Ambiguous HGs, even those in the top ranks, can be compared to determine differential variants. To 124 re-track the HGs, the presence of HG-differential variants in the sample DNA needs to be examined.

125
This requires PCR amplification of DNA fragments located across the variant positions. Hybridizable 126 primer designs for the fragments are essential for successful PCR, particularly for degraded DNA.

127
Haplotracker provides a mapping tool showing the other variants present across the target variants to 128 survey possible conserved regions for the primers to minimize the chance of potential primer 129 mismatches. In addition, Haplotracker provides information about the sample variants, including their 130 frequencies observed in a large mtGenome DB, which may be helpful for researching commonly 131 found, HG-defining, rare, and/or potential disease-associated variants within a sample. Because the 132 QC of mtDNA sequences is also important to ensure sequencing reliability [35,36]

142
We designed a user-friendly interface for the input of multiple sequence fragments or 143 variant profiles from a sample (S1 Fig.). It includes DNA sequence or variant fields for the fragments 144 and the following additional fields: sample name, an option for the score DB (Haplotracker or 145 HelixMTdb), super-HG level, rank group level, an option for accepting "N" nucleobases as variants, 146 and selection buttons for the control or coding region in which the dominant part of the fragment 147 sequence is positioned. Researchers can input up to 100 sequence fragments regardless of any 148 overlap in the separate fields without needing to input the ranges of the fragments. Researchers can 149 increase the number of input fields for the fragments by clicking the "+" buttons beneath the field or 150 reduce them by clicking the "-" buttons. The rank group level (1-4) is used to set the lowest HG rank 151 group to display. "N" nucleobases in sequences are interpreted as "not sequenced" by default but can 152 be selected to be interpreted as "A, C, G, or T" by ticking the checkbox. In the latter case, more than 153 four tandem-repeated "N"s are interpreted as "not sequenced." Ticking the button "Control" or 154 "Coding" for the fragment region is important for correct processing. All of the sequence fragments in 155 a sample are aligned with the revised Cambridge Reference Sequence (rCRS) using the implemented 156 Gotoh algorithm. There is no need to input the nucleotide positional ranges of the fragments.

157
Once the variants and ranges of the fragments are obtained using the first tool, a second 158 tool, "HG tracking by fragment variant profiles," can be used with these data rather than the 159 sequences themselves for HG tracking (S2 Fig.).  sequences, particularly using the three HV region sequences. Re-tracking of the HGs is required to 166 find diagnostic differential variants in the coding regions for HG confirmation or sub-haplogrouping.

167
For the simple tracking of HGs, we suggest the strategy outlined in Fig. 1. If more than one HG is 168 present in Rank Group 1, we propose narrowing down the HGs in this group first. This is based on our 169 observation that most of the definitive HGs (95%) that were identified by Haplotracker from the CR 170 regions (n=8,216) of Phylotree mtGenome sequences were found in Rank Group 1 (Table 1). If more 171 than one HG is scored (i.e., has a score >0), it is preferable to first track the HGs with scores. This is

252
The recommended variants for further HG testing will be the highest specific variants as the first 253 priority, highly specific variants as the second priority, and highly conserved variants as the third 254 priority.      (HV1 and HV2), Haplotracker predicted the HG of MNX3 to 333 be "U2e1a1" as the top rank (Table 6). The MRCA of Rank Group 1 was predicted to be "U2e1," and 334 this group had nine sub-HGs. Of these, HG U2e1a1 was top-ranked with the highest score. To 335 confirm U2e1a1, we followed the steps suggested in the haplogrouping flowchart (Fig. 1). The first 336 question in the flowchart is "Do you want to continue tracking?" The answer was "Yes." Following the 337 arrow, the second question is "Is there only one HG in Rank Group 1?" The answer was "No" because 338 there were nine HGs in Rank Group 1. We followed the subsequent instruction and narrowed down 339 the HGs of Rank Group 1 by clicking the button [Rank Group 1]. As a result, the HGs were narrowed 340 down from nine to six. The next arrow arrives at the question "Is there more than one scored HG?"

341
The answer was "Yes" because there were three scored HGs. We followed the arrow and arrived at 342 "Select [HGs with scores] and differentiate between them." The results showed that there were three 343 HGs (U2e1a1, U2e1a, and U2e1) and one HG differential variant "3116" for U2e1a1. We followed the 344 next arrow and arrived at "[Add fragment(s)] of the differential variant(s) to the previous ones and 345 predict HGs." We added the sequence of the fragment for the "3116" coding-region position that was 346 obtained from the PCR experiment in the present study and pressed the submit button.

370
We also used other web servers to haplogroup the same sequence fragments from MNX3.

453
Our server programs and hardware environment will be constantly updated in parallel with the future 454 accumulation of mtDNA haplotype data.  The example summarized in Table 7

587
To confirm the predicted HGs, we designed PCR primers for four fragments to identify coding-588 region variants that Haplotracker proposed for confirmation and differentiation (S8 Table). Conserved

604
For the second round of real-time PCR (nested PCR), 2 μl of the diluted products was used 605 for the template, and nested primers for each target coding-region fragment were added to the four 606 separate reactions. The nested PCR reaction mixture (40 μl) consisted of 1X PCR buffer for Ex Taq