Evaluation of haplotype callers for next-generation sequencing of viruses

Currently, the standard practice for assembling next-generation sequencing (NGS) reads of viral genomes is to summarize thousands of individual short reads into a single consensus sequence, thus confounding useful intra-host diversity information for molecular phylodynamic inference. It is hypothesized that a few viral strains may dominate the intra-host genetic diversity with a variety of lower frequency strains comprising the rest of the population. Several software tools currently exist to convert NGS sequence variants into haplotypes. However, previous studies suggest that current approaches of haplotype reconstruction greatly underestimate intra-host diversity. Here, we tested twelve NGS haplotype reconstruction methods using viral populations simulated under realistic evolutionary dynamics. Parameters for the simulated data spanned known fast evolving viruses (e.g., HIV-1) diversity estimates to test the limits of the haplotype reconstruction methods and ensured coverage of predicted intra-host viral diversity levels. Using those parameters, we simulated HIV-1 viral populations of 216-1,185 haplotypes per host at a frequency <7%. All twelve investigated haplotype callers showed variable performance and produced drastically different results that were mainly driven by differences in mutation rate and, to a lesser extent, in effective population size. Most methods were able to accurately reconstruct haplotypes when genetic diversity was low. However, under higher levels of diversity (e.g., those seen intra-host HIV-1 infections), haplotype reconstruction accuracy was highly variable and, on average, poor. High diversity levels led to severe underestimation of, with a few tools greatly overestimating, the true number of haplotypes. PredictHaplo and PEHaplo produced estimates close to the true number of haplotypes, although their haplotype reconstruction accuracy was worse than that of the other ten tools. We conclude that haplotype reconstruction from NGS short reads is unreliable due to high genetic diversity of fast-evolving viruses. Local haplotype reconstruction of longer reads to phase variants may provide a more reliable estimation of viral variants within a population. Highlights Haplotype callers for NGS data vary greatly in their performance. Haplotype callers performance is mainly determined by mutation rate. Haplotype callers performance is less sensitive to effective population size. Most haplotype callers perform well with low diversity and poorly with high diversity. PredictHaplo performs best if genetic diversity is in the range of HIV diversity.

is further used to reconstruct haplotypes ( Fig. 1 с # , с % , d " , d # ). De novo assemblers, however, do not 113 rely on reference genomes, and haplotype sequences are usually reconstructed directly from the reads 114 ( Fig. 1 с ( , d % ). De novo assembly often requires more computational resources, but reference-based 115 assembly requires a closely related genome, which is not always available in high quality. Each 116 strategy has advantages and disadvantages and have been implemented in several software programs, 117 but the performance of these assembly tools has not been comprehensively examined yet. In this study, 118 we simulated realistic, coalescent based intra-host viral diversity with diversity measurements 119 encompassing known variation from fast-evolving viruses such as HIV-1 for empirical grounding. We 120 then used these simulated populations to assess the performance and accuracy of three de novo and 121 nine reference-based sequence variant reconstruction tools or haplotype callers. consensus and strain-specific assemblers. The main goal of consensus-based tools is, generally 145 speaking, to construct a suitable consensus reference genome that may be further used as a template 146 for more fine-grained studies. VICUNA (Yang et al., 2012) and IVA (Hunt et al., 2015) represent this 147 subcategory of methods. VICUNA is the most popular software among them, as it generates full-length 148 consensus and detects polymorphisms. VICUNA merges NGS reads into contigs, and those into a 149 bigger contig, by calculating "good" prefix-suffix overlap between sequences. During this process, 150 contigs are also clustered and validated to reach a better quality of consensus. IVA follows the same 151 approach with only one difference, the tool starts from k-mers that are sorted in decreasing order with 152 respect to their abundance and then extends sequences into a bigger sequence by using reads that have 153 perfect overlap with initial sequences. VICUNA also has an additional option for contig merging if a 154 reference genome exists. 155 Contrary to de novo consensus approaches, de novo strain-specific assemblers aim to 156 reconstruct sequences at the strain resolution level (Table 1). It is worth mentioning that the de novo 157 viral variant reconstruction problem is quite similar to the assembly effort of multiple genomes in 158 microbial communities at once using shotgun metagenomic reads (e.g., Bishara  constructs an overlap graph instead of creating a De Bruijn graph during the initial steps 2 ( Fig. 1 с ( ). 174 PEHaplo also has a more careful path finding algorithm based on paired-end connection information. 175 SAVAGE uses overlap graphs as a key data structure, but the pipeline is different from those in 176 PEHaplo and MLEHaplo. After constructing an overlap graph ( Fig. 1 с  If a haplotype caller produces haplotypes as an output, it given a plus sign. If a haplotype caller reports corresponding frequencies for the sequences produced, it is given a plus sign. Savage and PEHaplo claim they produce contigs not haplotypes, which is why we did not deem that they produce haplotypes.

182
While the final sequences produced by MLEHaplo, PEHaplo, and SAVAGE are strain-183 specific, the obtained sequences, in general, do not represent full-length haplotypes ( Fig. 1 d % ). 184 Recently, Virus-VG and VG-flow have been developed for completing strain-specific assemblies 185 produced by the aforementioned de novo strain-specific assemblers (Baaijens et al., 2018(Baaijens et al., , 2019. 186 Virus-VG and VG-flow try to convert strain-specific contigs into full-length haplotypes taking into 187 account their abundances. The difference between Virus-VG and VG-flow is that the former uses a 188 brute-force exact approach, while the latter utilizes a heuristic algorithm. Therefore, VG-flow is faster 189 than Virus-VG, but less accurate. The main goal for both tools is to find and select maximum-length 190 paths in a variation graph. 191 The main advantage of reference-based viral variant reconstruction methods prior to de novo 192 haplotype assemblers is the potential ability to reconstruct full-length haplotypes ( Fig. 1 196 In this case, the required reference genome can be potentially assembled from sequencing reads by 197 first using de novo consensus assembly tools. Nevertheless, the reference genome is often available 198 for common pathogenic viruses, such as HIV, HCV, polyomavirus or influenza. 199 Currently there are nine commonly used state-of-the-art reference-based tools (Table 1). All 200 these tools claim to be global haplotype inference methods, i.e., able to infer the sequences and 201 frequencies of the underlying viral strains over a longer region than the average read length. ShoRAH 202 (Short Read Assembly into Haplotypes) is, historically, the first publicly available software (Zagordi 203 et al., 2011). ShoRAH uses a probabilistic clustering algorithm for short haplotype sequence 204 reconstruction ( Fig. 1 с " , с % , d # ). Then, it computes a minimal set of haplotypes using the principle of 205 parsimony that provides the best explanation for the given a set of error corrected sequencing reads 206 (Eriksson et al., 2008). The tool then uses an expectation minimization algorithm for haplotype 207 frequency estimation. 208 The next important milestone in the reference-based viral variant reconstruction tool 209 development was the release of QuRe (Prosperi and Salemi, 2012). QuRe uses the combinatorial 210 method proposed in Prosperi and Salemi (2012) for inferring genetic variants in local windows that do 211 not exceed read lengths. After that, the obtained genetic variants are clustered by a probabilistic 212 algorithm (Zagordi et al., 2010) ( Fig. 1 с " , с % , d # ). Finally, haplotypes with their frequencies are 213 obtained by utilizing a genome reference and clustered variants. Later, the same probabilistic clustering 214 and combinatorial algorithms were used for developing the reference-assisted assembly pipeline 215 ViQuas (Jayasundara et al., 2015). The main difference between QuRe and ViQuas is that the latter 216 tool assembles reads into contigs using the SSAKE assembler (Warren et al., 2007) and then iteratively 217 extends obtained contigs by connecting overlapping pairs without using any sequence information 218 from the reference. 219 The variations and then identifies true viral variants by merging cliques in that graph ( Fig. 1 с " , с # , d " ). 242 RegressHaplo, in turn, is based on a regression-based approach specifically designed low diversity and 243 convergent evolutions. This tool implements penalized regression to assess the haplotype interactions 244 that belong to different unlinked regions. aBayesQR employs a maximum-likelihood approach to infer 245 viral sequences. The search of most likely viral sequence is conducted on long contigs, which enables 246 identification of closely related haplotypes in a population and provides computational tractability of 247 the Bayesian method. It should be noted that aBayesQR is designed for reconstructing viral haplotypes 248 that are near genetically identical. 249 Each haplotype reconstruction tool in Table 1  PredictHaplo artificially mutated ten haplotypes from a single HIV-1 reference genome at varying 259 proportions ( originated from an infection of one strain. All of these studies conditioned their simulations on HIV-1 272 data sets, but we also want to explore the general utility across a broader parameter space that 273 encompasses more fast-evolving viral populations. 274 In our simulations, we used parameters and settings under the coalescent theory (Kingman, 275 2000(Kingman, 275 , 1982Rodrigo and Felsenstein, 1999;Rosenberg and Nordborg, 2002) to more accurately reflect 276 viral intra-host diversity and evolution as seen in empirical studies (see (Crandall and Templeton,  277 1993)). We simulated viral intra-host evolutionary histories and the constituent haplotype sequences 278 (tips) using the coalescent simulator CoalEvol v. 7.3.5 (Arenas and Posada, 2014). We set the mutation 279 rate (µ) between 1e-3 and 5e-8 per-site to span past known viral mutation rates to test the limits of the 280 reconstruction algorithms and number of haplotypes present using the human genome mutation rate as 281 an upper limit and other retroviruses' mutation rates as a lower limit. These parameters encapsulated 282 the empirical mutation rate of 2.5e-5 and 3. First, many of the haplotype programs do not account for recombination. Second, we assume that 297 approaches that fail on a simplified model without recombination will not perform well on a more 298 complex that includes recombination. 299 We 10,000, so we varied the effective population size between 500 and 10,000. We also denoted the ploidy 312 as diploid (Coffin, 1992). Wherever possible, we varied the parameters to be above and below 313 estimated HIV-1 estimates to ensure we adequately represented viral intra-host diversity and to 314 examine the limits of the haplotype reconstruction programs. Expanding our parameter set allowed us 315 to gain insights into other viral species with different evolutionary and population characteristics. For 316 example, the Ne for influenza is considerably smaller than HIV-1 at around 20-100 viral sequences 317 ( In order to evaluate the quality of haplotype assembly provided by different tools, we used common 343 statistical measures of precision and recall, as well as weighted normalized UniFrac distance 344 (Lozupone and Knight, 2005), which is widely used to compare microbial communities. Our simulated 345 data can be represented as = {(ℎ . , . ), = 1, 2, … } -the ground truth haplotypes ℎ . and their 346 associated abundances . (∑ . = 1), and = {( . , . ), = 1, 2, … } -the set of predicted haplotypes 347 . together with their predicted abundances . . 348 We define precision as In the case of reference-based methods, we define TP as the total frequency of those haplotypes 353 ℎ in the ground truth set which have an accurate enough prediction in (which means that the 354 edit distance (ℎ, ) is less than some threshold = ( )); we also define FP as the total frequency 355 of those haplotypes f in the predicted set which do not match any haplotype from the ground truth 356 set (which means that (ℎ, ) ≥ for all ∈ ). We choose the threshold = 12 because 12 bp is 357 about 1% of the haplotypes' length. We consider the haplotype ℎ ∈ to be reconstructed correctly if 358 there exists a haplotype ∈ such that the edit distance between them (ℎ, ) ≤ 12.

359
For de novo methods, we define TP as follows: We say that a contig from is proper if there 360 exists such a ground truth haplotype ℎ and its substring s ∈ ℎ so that the edit distance between and 361 is small (less than 1% of 's length). Then, for each ground truth haplotype ℎ . , we define . -the  362 proportion of its part which is properly covered by contigs from . We then define TP as a weighted 363 total frequency of properly predicted haplotypes ∑ . . I J ∈ K . It is important to note that the definition 364 of TP is a generalization of the TP definition for reference-based tools. Indeed, in the latter case, all 365 the . are equal to either 0 or 1. We define FP as the total frequency of non-proper contigs in . 366 While these measures are standard and they show how good the haplotype reconstruction is, 367 they are not very sensitive to the errors in frequency prediction. In order to address this issue, we also 368 computed the UniFrac distance ( , ) using the EMDUniFrac algorithm (McClelland and  369 Koslicki, 2018). The UniFrac distance takes into account both the phylogenetic structure of the 370 haplotype set and their frequency distribution, which makes it ideal for incorporating sensitivity to 371 errors in frequency prediction. The UniFrac EMD method makes the following steps: 372 373 • construct a tree with branch length P on the set of all haplotypes ℎ . ∈ and . ∈ 374 • for each tree branch and its descendant subtree P , estimates the imbalance P : 375 376 .∶I J ∈: W − U .
.:Y J ∈: W T ; 377 378 • evaluate the weighted imbalance with respect to the branch lengths 379 380 ≔ U P P P∈: . 381

382
As a baseline for thе UniFrac EMD comparison, we evaluate the UniFrac distance between 383 reference or, more formally, a set of haplotypes Q containing only one haplotype -the reference at a 384 frequency of 1. 385 386 3. Results and Discussion 387 3.1 True haplotypes from simulated data 388 All analyses were completed using the simulated dataset developed under the coalescent 389 framework. For each mutation rate μ ∈{1e-8, 3e-8, 5e-8, 1e-7, 5e-7, 1e-6, 5e-6, 1e-5, 3e-5, 5e-5, 1e-4, 390 3e-4, 5e-4, 5e-3, 5e-8} and effective population size P = {500, 1000, 2500, 5000, 7500, 10000}, 391 there were five simulated haplotype populations = {(ℎ . , . ), = 1, 2, … } used as replicates for each 392 parameter set. Under the coalescent model, the number of true haplotypes ranged from 1 to 1,993 with 393 a median of 342 haplotypes for a parameter set (Fig. 2) MLEHaplo and ViQuas did not produce any valid results within the given time limit, whereas 425 QuRe crashed in all analyses because of memory limitations. While HaploClique produced results 426 within our time limit (Fig. S1), we excluded this tool from final comparisons because the length of the 427 reconstructed viral sequences was always significantly shorter than the length of the ground truth 428 haplotypes (Fig. S2). Such behavior is atypical among reference-based methods. Moreover, since 429 SAVAGE can be considered as the next installment of HaploClique, it provides an additional argument 430 for excluding HaploClique from our comparison. 431 In addition to the two de novo tools assessed (i.e., SAVAGE and PEHaplo), we also ran the 432 VG-flow tool to complete contigs produced by those methods. We selected VG-flow over Virus-VG 433 because VG-flow is faster and almost as accurate ( CliqueSNV, ShoRAH, PredictHaplo, and QuasiRecomb. Precision (Fig. 3) and recall (Fig. 4) values 454 were calculated for each tool. The quality of obtained results did not seem to depend much on the 455 effective population size (Ne). This is a positive finding, as determining the effective population size 456 for intra-host viral infections is often difficult and can vary between studies. All the tools, except 457 ShoRAH, performed very well (i.e., both precision and recall are close to one) if the mutation rate was 458 relatively small (µ ≤ 1e − 5), which is an estimated mutation rate for influenza. For higher values of μ 459 ( ≥ 1 − 4), such as those seen in HCV and HIV-1, all the tools performed poorly (i.e., both 460 precision and recall were close to zero). For the values of μ seen in HIV-1 (3 − 5 ≤ ≤ 1 − 4), 461 PredictHaplo was able to produce better results than the other tools; PredictHaplo's precision and recall 462 decreased with µ ∈ (3e−5, 1e−4) but stayed positive. It also should be noted that CliqueSNV 463 outperformed all other tools for = 1 − 6, but did not produce any results for ∈ (1 − 5, 1 − 4). 464 Such behavior looks promising and it is possible that in future releases, if run-time is increased, 465 CliqueSNV will exceed PredictHaplo in precision and recall performance.  for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and 480 HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates 481 of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool 482 did not produce any output for some pair of parameters, we included a gap in the corresponding plot. 483 484 We calculated UniFrac distance values for the aforementioned tools (Fig. 5) HIV-1 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean 499 estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype 500 populations. If a tool did not produce any output for some pair of parameters, we included a gap in the 501 corresponding plot. 502 503 For large values of μ ( ≥ 1 − 4), ShoRAH, QuasiRecomb, RegressHaplo and PredictHaplo 504 rarely produced a valid output within the given time limit. aBayesQR and CliqueSNV produced results 505 that were worse than or comparable to the baseline. For large values of the effective population size 506 ( P ≥ 5000) and low values of μ, all the tools except ShoRAH showed better results than the baseline. 507 However, for mutation values larger than 5 − 4, none of the tools made a better prediction of the set 508 of haplotypes than just a reference. It is important to note that HCV, HIV, and influenza do not have 509 population at high genetic diversity levels or overestimated it at low genetic diversity levels (Fig. 6), 512 compared to the true number of haplotypes across the same levels of underlying genetic diversity 513 obtained from the simulated datasets (Fig. S3). PredictHaplo, RegressHaplo, aBayesQR, and 514 CliqueSNV underestimated haplotype numbers in the HIV intra-host diversity range (shaded in 515 yellow). HaploClique and QuasiRecomb, on the other hand, overestimated haplotype numbers, 516 whereas ShoRAH provided the closest estimate to the true number of haplotypes in the HIV-1 diversity 517 range. aBayesQR and CliqueSNV did not produce results for any dataset within the HIV-1 diversity 518 range. 519 520 521 522 Figure 6. Reference-based haplotype callers: number of predicted haplotypes across levels of 523 underlying genetic diversity. Intra-host HIV-1 and HCV diversity levels are highlighted shaded light 524 blue and shaded light red regions, respectively. If a software tool did not complete haplotype 525 reconstruction within the given time frame, we included a gap in the corresponding plot (see Fig. S1  526 for more information on dataset completions). 527 528

De Novo Program Performances 529
We analyzed the behavior of two de novo haplotype callers: SAVAGE and PEHaplo. The 530 output of both tools usually contained shorter contigs, so we completed the assembly using the VG-531 flow tool. PEHaplo itself produced valid output for all the datasets, while SAVAGE or PEHaplo with 532 VG-flow failed to produce results for some datasets (Fig. S1). Moreover, the length of PEHaplo output 533 haplotypes was usually closer to the ground truth haplotype length, while the SAVAGE+VG-flow 534 produced shorter contigs (see N50 statistics plot on Fig. S4). Thus, we only further considered 535 PEHaplo, PEHaplo + VG-flow and SAVAGE + VG-flow. 536 We compared the de novo tools using our modified versions of precision and recall ( Fig. 7 and 537 Fig. 8). VG-flow usually improved slightly the performance of PEHaplo, while PEHaplo usually 538 performed better than SAVAGE+VG-flow. Although the quality of results of SAVAGE+VG-flow did 539 not seem to depend on the effective population size, Ne played a role in the quality of obtained results 540 by PEHaplo. For example, both precision and recall were close to zero for = 1 − 8 and P ∈ 541 {500, 1000, 2500), but significantly higher for P ∈ {5000, 7500, 10000} and = 1 − 8. It is also 542 important to note the behavior of recall values for the obtained results in PEHaplo; those values, in 543 general, were close to one for small values, close to zero for values near 1 − 5, and stayed positive 544 for higher values. 545 546 547 548 Figure 7. De novo haplotype callers: variation of precision values with mutation rate (log-scaled) for 549 all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV 550 diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of 551 precision over all valid outputs produced by each software tool for five haplotype populations. If a tool 552 did not produce any output for some pair of parameters, we included a gap in the corresponding plot. 553 554 555 556 Figure 8. De novo haplotype callers: variation of recall values with the mutation rate (log-scaled) for 557 all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 and HCV 558 diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates of 559 recall over all valid outputs produced by each software tool for five haplotype populations. If a tool 560 did not produce any output for some pair of parameters, we included a gap in the corresponding plot. 561 562 De novo tools performed very well, in both precision and recall values, if the mutation rate was 563 less than 1e − 6 (in contrast to µ ≤ 1e − 5 for reference-based tools). Additionally, recall values for 564 PEHaplo when ≥ 1 − 4 were usually better than those seen for any reference-based approaches. 565 De novo tools did not produce results with a positive precision for HIV-1 and HCV mutation rates. 566 The UniFrac distance further confirmed our previous observation that VG-flow slightly improved the 567 performance of PEHaplo (Fig. 9). Moreover, the performance of SAVAGE + VG-flow did not depend 568 on the mutation rate or the effective population size Ne. It is important to note that all UniFrac distance 569 values were, in general, higher than baseline values. We also compared UniFrac distances between 570 both categories of assemblers (Fig. S5); as we expected, reference-based tools largely outperformed 571 de novo tools. At the same time, PEHaplo performed better than ShoRAH for some datasets. Moreover, 572 SAVAGE + VG-flow showed the worst performance based on UniFrac distances. 573 574 575 576 Figure 9. De novo haplotype callers: variation of UniFrac distance (EMD) with mutation rate (log-577 scaled) for all considered Ne. The shaded light blue and shaded light red regions correspond to HIV-1 578 and HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean 579 estimates of UniFrac distances over all valid outputs produced by each software tool for five haplotype 580 populations. If a tool did not produce any output for some pair of parameters, we included a gap in the 581 corresponding plot. 582 583 Although the de novo methods produced more haplotypes in the HIV-1 diversity range 584 compared to reference-based methods, they all still underestimated the true number of haplotypes in a 585 population at higher diversity levels. They also overestimated true haplotype numbers at lower genetic 586 diversity levels (Fig. 10) compared to the true number of haplotypes from the simulated datasets (Fig.  587 S3). When extending the contigs into scaffolds with VG-flow, the number of haplotypes reconstructed 588 decreased considerably and remained below the number of true haplotypes estimated for the varying 589 genetic diversity levels. PEHaplo reconstructed the lower limit of the true number of haplotypes within 590 HIV-1 diversity levels, but like other tools, including aBayes, CliqueSNV and QuasiRecomb, PEHaplo 591 and SAVAGE, had trouble reconstructing viral sequences at higher diversity levels. We compared twelve of the most commonly used software tools to reconstruct haplotypes from 603 viral NGS data. We simulated coalescent-based populations that spanned past known levels of viral 604 diversity, including mutation rates, sample size, and effective population size. We focused our 605 empirical comparisons on the intra-host diversity levels of fast-evolving RNA viruses such as HIV-1 606 because parameter value ranges are well established and a better understanding of viral dynamics is 607 important for drug and vaccine development. Additionally, the majority of haplotype tool developers 608 used HIV-1 to validate their own programs. In our analyses of HIV-1 intra-host diversity, we estimated 609 between 216 and 1,185 haplotypes with a <7% frequency for a single haplotype. 610 Overall, reference-based assemblers produced more accurate haplotypes than de novo-based 611 assemblers for all performance indices (precision, recall, UniFrac, and number of reconstructed 612 haplotypes) across HIV-1 diversity levels. This performance could be attributed to the availability of 613 high-quality reference sequences for HIV-1, HIV-2, HCV, influenza and other viruses. Furthermore, 614 using a reference sequence reduces the computational time and power needed to reconstruct 615 haplotypes. Reference-based assemblers likely performed better than de novo assemblers because of 616 the high variation within viral populations, especially HIV-1, where the reference sequence may have 617 provided needed guidance to orient the highly diverse NGS sequences into a haplotype sequence. 618 Our results show that PredictHaplo offers the best tradeoff between statistical performance and 619 computational efficiency within HIV-1 diversity ranges. PredictHaplo was found to have the highest 620 precision, recall, and lowest UniFrac distance values. CliqueSNV followed closely and may actually 621 outperform PredictHaplo if more computational resources were made available. An important caveat 622 for both these approaches, however, is that the number of true haplotypes is greatly underestimated. If 623 it is important to identify the true number of haplotypes (as in rare haplotype discovery) approaches 624 such as ShoRAH or PEHaplo may be more appropriate. The haplotype programs also varied greatly 625 in terms of their ease-of-use. This variation is due to differences in coding language, program 626 dependencies, availability of executable files, absence of comprehensive documentation and lack of 627 example datasets. For example, SAVAGE, PEHaplo, ShoRAH can be easily installed by package 628 managers, and CliqueSNV and QuasiRecomb are distributed as executable files. In contrast, Virus-629 VG and VG-flow requires installment of proprietary software, which has an academic license. 630 Installation and usage of PredictHaplo is challenging because of the lack of description and instructions 631 regarding the config file. While CliqueSNV is easier to install and use, there are no example datasets. 632 It is important to note that our study represents an initial attempt of comprehensive comparison 633 of available haplotype reconstruction tools. For example, we focused HIV-1 diversity estimates for the 634 polymerase gene, which is less variable than the envelope gene. Moreover, almost all developers of 635 the aforementioned tools used the polymerase gene as a source of simulating sequencing data for 636 assessing performance of their programs and rarely used the envelope gene for the same purposes. 637 Given the envelope gene has a higher mutation rate and the haplotype reconstruction tools -de novo 638 or reference-based -seem to be dependent on mutation rate, it is likely that the tools available here 639 would not be successful in reconstructing envelope haplotypes for HIV-1 accurately. However, we 640 chose polymerase as our gene of interest because of research focus on this gene as the target for drug 641 resistance mutations. The same concept of lower mutation rates in conserved genes and higher 642 mutation in less conserved genes can be seen in other fast-evolving viruses. For example, in HCV the 643 core protein is more conserved compared to the E1/E2 region. Thus, users should target methods for 644 haplotype calling that best match the mutation rate of their target gene. 645 Another limitation of our study is coverage. It is well-known that coverage plays a crucial role 646 in all algorithms for distinguishing between errors and rare sequence variants. We chose 100x coverage 647 because it represents a reasonable amount of data that can be obtained without intensive labor or money 648 consuming procedures. Contrary to our simulations, the developers of haplotype reconstruction tools 649 usually test their methods on datasets with higher coverage than ours. For example, the famous golden-650 standard benchmark HIV dataset (labmix dataset (Di Giallonardo et al., 2014)) on which all tools have 651 been tested by developers, consisted of an average of 20,000x coverage. Thus, our study represents an 652 attempt to measure the performance of the haplotype reconstruction tools on datasets that are more 653 likely to be seen and produced in laboratories. Moreover, according to our results for higher mutation 654 rates, many tools did not produce any results within the time limit. Considering that higher coverage 655 implies a larger amount of data and thus requiring more computational time to process these data, it is 656 expected that the tools available here would require extensive computational resources. 657 We also considered error-free and recombination-free data in our study. Only a few tools 658 explicitly took into account the presence of errors or recombination in their models (e.g., only 659 QuasiRecomb explicitly assumes the presence of recombination events). By not simulating 660 recombination and sequencing errors, we removed nuisance parameters that would impact haplotype 661 reconstruction. Moreover, since almost all tools have been tested on ultra-deep data, our comparison 662 study by error-free data is giving an advantage to these methods by removing errors in sequence data 663 (i.e., one does not need deep coverage to distinguish between rare variants and errors). Furthermore, 664 Zanini et al. (2015) found evidence that recombination likely interrupts haplotypes, specifically in 665 HIV-1, every 100-200bp, so, the concept of haplotypes in HIV-1, and maybe other fast-evolving 666 viruses with high recombination rates, may not exist or be feasible to study with frequent 667 recombination events. Together these facts imply that the performance of the aforementioned tools 668 would be even worse than observed here. 669 Overall, results and limitations of our study indicate the importance of creating broad and 670 diverse golden-standard datasets that must include several different genes, diverse parameters of 671 mutation rates and effective population sizes, different average coverages, presence or absence of 672 recombination events or/and error prone data. Moreover, future simulation studies should address 673 error-prone data using haplotype callers that can handle sequencing errors and investigate the effect of 674 recombination and average coverage on the reconstruction of haplotype. In addition to simulation 675 studies, some theoretical work similar to DNA sequencing theory should be done for laying analytical 676 foundations for determining coverage depending on the mutation rate, effective population size, error-677 rate of a sequencing technology, and so on. Finally, there are still a lot of opportunities for developing 678 new haplotype callers that can process a wide range of data with different mutation rates, average 679 coverage, and presence or absence of recombination events. Moreover, since the reconstructed 680 haplotypes are often used for reconstructing phylogeny, the future tools may also consider the problem 681 of reconstructing haplotype sequences together with their phylogeny. Considering the possibility that 682 the reconstructing haplotype sequences from short-read sequencing technologies may represent an 683 intractable problem, focusing on reconstructing haplotype phylogeny directly from short-reads may 684 lead to better results after all. In addition to mentioned future directions, the advances and price-685 decreasing of long-read sequencing technologies (e.g., Nanopore, PacBio, 10X Genomics) poses a 686 whole new set of challenges for haplotype reconstructions including the development of new 687 sequencing protocols and haplotype reconstruction tools. This new technology has the power to 688 sequence long amplicons or even entire viral genomes in a single pass, i.e., no need to assemble 689 sequencing reads. However, this type of data requires development of new methods that can 690 distinguish between rare variants and sequencing errors. Therefore, the application of long-read 691 sequencing technology may be more beneficial for studying global, or entire genome, haplotypes. 692 693 Acknowledgements 694 The Competing interest 720 The authors declare that they have no competing interests. 721 HCV diversity levels, respectively. For all pairs of parameters μ and Ne, we report the mean estimates 995 of recall over all valid outputs produced by each software tool for five haplotype populations. If a tool 996 did not produce any output for some pair of parameters, we included a gap in the corresponding plot. 997