A comparative analysis of current phasing and imputation software

Adriano De Marino; Abdallah Amr Mahmoud; Madhuchanda Bose; Karatuğ Ozan Bircan; Andrew Terpolovsky; Varuna Bamunusinghe; Umar Khan; Biljana Novković; Puya G. Yazdi

doi:10.1101/2021.11.04.467340

Abstract

Whole-genome data has become significantly more accessible over the last two decades. This can largely be attributed to both reduced sequencing costs and imputation models which make it possible to obtain nearly whole-genome data from less expensive genotyping methods, such as microarray chips. Although there are many different approaches to imputation, the Hidden Markov Model remains the most widely used. In this study, we compared the latest versions of the most popular Hidden Markov Model based tools for phasing and imputation: Beagle 5.2, Eagle 2.4.1, Shapeit 4, Impute 5 and Minimac 4. We benchmarked them on three input datasets with three levels of chip density. We assessed each imputation software on the basis of accuracy, speed and memory usage, and showed how the choice of imputation accuracy metric can result in different interpretations. The highest average concordance rate was achieved by Beagle 5.2, followed by Impute 5 and Minimac 4, using a reference-based approach during phasing and the highest density chip. IQS and R² metrics revealed that IMPUTE5 obtained better results for low frequency markers, while Beagle 5.2 remained more accurate for common markers (MAF>5%). Computational load as measured by run time was lower for Beagle 5.2 than Impute 5 and Minimac 4, while Minimac utilized the least memory of the imputation tools we compared. ShapeIT 4, used the least memory of the phasing tools examined, even with the highest density chip. Finally, we determined the combination of phasing software, imputation software, and reference panel, best suited for different situations and analysis needs and created an automated pipeline that provides a way for users to create customized chips designed to optimize their imputation results.

Introduction

Genome wide association studies (GWAS) remain one of the most critical and powerful methods of identifying key genes and variants that play a role in many common human diseases (The Wellcome Trust Case Control Consortium 2007, Uffelmann et al. 2021). Identification of disease-associated variants in GWAS is dependent on successful tagging of millions of common variants in the human genome, and the ability to make inferences about genotypes of rare variants which are often not in linkage disequilibrium (LD) with common variants (The Wellcome Trust Case Control Consortium 2007, Uffelmann et al. 2021). Commercial single nucleotide polymorphism (SNP) genotyping arrays can contain up to 2.5 million markers, but none provide complete coverage of the human genome (Schurz et al. 2019). Despite the advances of the last two decades which have led to increasingly rapid and extensive genotyping, it is still prohibitively expensive to obtain whole genome sequencing (WGS) for the tens of thousands of individuals in GWAS (Peterson et al. 2017, Quick et al. 2020). Individual GWAS may also use distinct chips with different markers. To combine these GWAS for meta analysis, we require a method by which to identify genotypes at all markers utilized in each of these studies (Zaitlen and Eskin 2010). Thus, we continue to rely on imputation, the process of probabilistically estimating non-genotyped alleles for individuals in GWAS samples.

Genotype imputation is a method that infers the alleles of un-genotyped single-nucleotide polymorphisms (SNPs) based on the linkage disequilibrium (LD) with directly genotyped markers using a suitable reference population (Marchini and Howie 2010). It is predicated on the idea that seemingly unrelated individuals from the human population sampled at random can share short stretches of DNA within chromosomes derived from a shared ancestor (Scheet and Stephens 2006). Imputation can be used to improve SNP coverage and increase the statistical power of GWAS (Pei et al. 2010; Malhotra et al. 2014). Genotype imputation also facilitates fine mapping of causal variants, plays a key role in the meta-analyses of GWAS, and can be utilized in downstream applications of GWAS such as estimation of disease risk (Das, Abecasis, and Browning 2018). However, an important limitation of imputation is that only variants that were previously observed in a reference panel can be imputed (Das, Abecasis, and Browning 2018). Furthermore, rare variants are often poorly represented in reference panels making accurate imputation of rare and infrequent variants difficult. In addition, the choice of whether to pre-phase the data can impact imputation. Finally, imputation accuracy, sensitivity and computational efficiency are greatly affected by the choice of imputation software or tool (Das, Abecasis, and Browning 2018).

Over the last twenty years, multiple research groups have developed and published a number of phasing and imputation models, the majority of which are based on the Li and Stephens Hidden Markov Model (HMM) (Li and Stephens 2003). First described in 2003, it was applied to haplotype estimation methods, termed “phasing”, and used to handle large stretches of chromosome where individual haplotypes share contiguous, mosaic stretches with other haplotypes in the sample (Scheet and Stephens 2006, Das, Abecasis, and Browning 2018). Unlike previous coalescent approaches, it was computationally tractable, and methods based on the Li & Stephens HMM were soon shown to be more accurate and efficient than other methods (Lunter 2019, Scheet and Stephens 2006). Landmark and popular phasing algorithms are listed in Table 1, as a brief tabular history of the field. Currently, the most commonly used Li and Stephens HMM-based software’s are BEAGLE, EAGLE, and SHAPEIT for phasing, and BEAGLE, IMPUTE and MINIMAC for imputation.

View this table:

Table 1.

A brief history of phasing and imputation tools.

Imputation accuracy is measured by several key sets of metrics which can be classified into two overarching types: statistics that compare imputed genotypes to ‘gold standard’ genotyped data and statistics produced without reference to true genotypes (Ramnarine et al. 2015). Concordance rate, squared correlation R², and Imputation Quality Score (IQS) are examples of the first type (Candelaria Vergara 2018, Ramnarine et al. 2015). In practice, the purpose of imputation is to predict SNPs for which we do not have genotyped data; statistics of the second type are typically relied upon during imputation, and generally output by the various imputation programs. Although the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble increasingly large reference panels that greatly increase the number of imputable variants, the choice of phasing and imputation software currently has a significant impact on accuracy (Herzig et al. 2018). While several studies have evaluated and compared imputation models, or phasing models, or imputation models in combination with different reference panels, no recent studies have compared imputation and phasing algorithms in combination with different reference panels, in tandem, and evaluated the relative computational efficiency and accuracy of each combination (Sariya et al. 2019, Herzig et al. 2018).

In this study, we evaluate the latest versions of the most commonly used tools for phasing and imputation in terms of accuracy, computational speed and memory usage, using 2 different versions of the 1kG project as reference panels and three different microarray chip datasets as inputs. We combine each tool for phasing with a method for imputation to understand which combination achieves the best overall results and which method is the best at imputing rare variants. Our goal was to determine the combination of phasing and imputation software and reference panel that is best suited for different situations and needs.

Methods

Chip Data

We used three different chip datasets with differing marker density and input dataset sizes. The first chip dataset (Affymetrix) was composed of 3450 unrelated individuals from The 1000 Genomes Project genotyped with the Affymetrix 6.0 900K array (Affymetrix, ThermoFisher), the second (Omni) of 2318 unrelated individuals from the 1000 Genomes Project genotyped with the Omni 2.5 chip by Illumina 2.4 Million unphased SNP markers, and the third one (Customized) was a subset of the first two chips and consisted of the intersection of the first two chips with another chip, GSA version 3 with direct-to-consumer booster by Illumina (Fig. 1). This Customized chip is the intersection of commonly used chips, resulting in a low-density chip with fewer overall sites, to allow us to assess imputation and phasing accuracy when the input data is limited to a relatively small number of SNPs.

Fig. 1. Chip data used to assess imputation and phasing accuracy and origin of the customized chip.

Affymetrix, Omni and Customized chips. SNP numbers for chromosome 20 are shown. Customized chip data was obtained from the intersection of the first two chips with the Eurofins chip.

Fig. 2 describes the preparation of chip datasets for analysis. Data from Affymetrix and Omni chips were normalized using BCFtools (Petr Danecek 2021). Chip data was processed separately for each chromosome, which was renamed numerically with the ‘chr’ tag to match the reference panel. Chromosome 20 was chosen for use in all downstream analyses as it is generally representative of autosomal chromosomes. Sample data was converted to GRCh38 with Picard liftover (Picard Toolkit 2019), to match the assembly of the reference panels. We split multiallelic sites to record them as biallelic, left-normalized the variants to the reference genome, and removed duplicate variants. Finally, because Beagle does not allow skipping imputation of sporadic missing data, variants with missing genotype information were removed from the chip datasets and the WGS reference panels.

Fig. 2. Pre-processing of the HD genotype chips and reference panels.

Pre-processing of the HD genotype chips and reference panels downloaded from the International Genome Sample Resource (IGSR). Steps highlighted in orange are specific to the 1000GPphase3 reference panel only; all other steps were performed for both reference panels.

Reference Panel Collection and Sample Selection

We drew our reference panels for imputation and phasing from the The 1000 Genomes Project (1000GP). We used the Phase 3 low coverage WGS which has a mean depth of 7X as one reference panel and the high coverage WGS, with a mean depth of 30x, as a second reference panel (1000 Genomes Project Consortium et al. 2010, 467; 2010, 491; 2015, 526; Sudmant et al. 2015). We refer to these as the 1000GP-Phase3 and 1000GP-30x reference panels.

We randomly selected 190 unrelated individuals taken from the set of 1686 individuals found in all three collections -- the Omni, Affymetrix and WGS 1000 Genomes Project sample collections (Sudmant et al. 2015) as shown in Fig 3. Our sample consisted of 5 males and 5 females per population, for 19 different populations and 5 super-populations (Fig 4.). These 190 individuals, and their relatives, were removed from the reference panels and used to create chip datasets for testing. Imputation accuracy was assessed by looking at the concordance between the imputed chips’ data and the whole genome sequences for these 190 samples.

Fig. 3 Shared individuals between HD genotype chips and reference panels.

Individuals in common between the WGS Reference panels, Omni and Affymetrix chips.

Fig. 4 Origin of the target samples

Sample of 190 individuals belonging to 19 populations from 5 super populations selected for this study.

Quality Control of Reference Panels

For both reference panels, we used BCFtools (Petr Danecek 2021) to split multiallelic sites, remove duplicates and missing data, and align variants to the reference genome. Both the 1000GP30x and 1000GPphase3 panels were preprocessed by prepending the contig name with the prefix ‘chr’. Two additional steps were performed for the 1000GPphase3 panel to convert it to GRCh38 with Picard liftover (Picard Toolkit 2019), and discard rare variant singletons and doubletons to evaluate if their removal increased imputation accuracy for common variants (MAF>5%). The workflow for the quality control and pre-processing of the reference panels is shown in Fig. 2.

Phasing and Imputation Pipeline

The Affymetrix, Omni and Customized chips were used as inputs for 9 combinations of phasing and imputation tools to assess which combination performed best for our sample set (Fig. 5), using one of the two reference panels. Phasing was performed using both reference-free and reference-based approaches for each method, to compare their respective resultant imputation accuracy. This yielded a total of 108 combinations of input chip dataset, phasing tool, reference-based or reference-free phasing, imputation reference panel, and imputation tool (Supplementary Table 1).

Fig. 5 Workflow of the analysis, combinations tested.

Each input chip dataset was analysed using the 36 combinations of 3 different phasing softwares, 2 phasing approaches, 3 imputation softwares, and 2 imputation reference panels.

The haplotype phasing softwares we compared are: Eagle2 v2.4.1 (Loh et al. 2016), Beagle5 v5.2 (Browning, Zhou, and Browning 2018), and Shapeit4 v4.2.1 (Delaneau et al. 2019). All phasing software was launched with default parameters using 4 cores for each analysis on an Intel Corporation 82371AB/EB/MB PIIX4 ACPI 64-bit 32Gb RAM and the saved log file was used to evaluate the total run time. The imputation methods we tested are: Beagle5 v5.2 (Browning, Zhou, and Browning 2018), Impute5 v1.1.5 (Rubinacci, Delaneau, and Marchini 2020) and Minimac4 v1.0.0 (Das et al. 2016).

Each input chip dataset was processed using selphi.sh (SELfdecode PHasing and Imputation) an automated pipeline built in bash that combines the phasing and imputation software and evaluates accuracy at each step to speed up the process of analysis and comparison. The inputs to the pipeline are the chip data file, a reference panel, the number of threads to use and the chromosome to process. The pipeline first checks that the correct version of the reference panel already exists for each imputation software to use and if the input file is available both in BCF format and in VCF format. This means that the original reference panel is converted to bref3 for Imputation with Beagle5.2 using bref3.29May21.d6d.jar, to m3mv for Minimac4 using Minimac3 and to imp5 for Impute5 using imp5Converter_1.1.5_static. If any of these files don’t exist, they are automatically created by the pipeline. After this initial check, the pipeline begins phasing the haplotypes using Eagle2.4.1, Beagle5.2 and Shapit4. Each of these softwares was run twice with default parameters, once with the reference and once without, using 4 threads on chromosome 20 with recombination rates drawn from the genetic map. This step generated 2 phased VCF files for each software, yielding a total of 6 phased VCF files. After phasing, VCF files were moved to imputation with Beagle5.2, Minimac4 and Impute5. All were run using default parameters with a genetic map for the recombination rate and 4 threads. There are options to speed up both Minimac4 and Impute5 but these tend to reduce the accuracy rate. To maximize the accuracy of each tool and preserve the validity of the comparison, we ran them with the default parameters, avoiding the steps required to optimize for computational load.

Accuracy Measurement

Accuracy was assessed by comparing the imputation data resulting from each of the different combinations of phasing tool, imputation tool, and choice of reference, against the WGS dataset of the chosen 190 target samples. Variables considered were population/ancestry, sex, choice of tools, choice of reference, use of a reference panel, chip density, and the effect of MAF. We also looked at computational efficiency and memory usage. To check the effects of MAF on imputation accuracy, we used r² as the metric of choice as it can distinguish between different MAF stratifications and is the most widely used metric for assessing imputation accuracy (Liu et al. 2013).

Accuracy was evaluated using a custom, faster version of the imputation accuracy calculation software available on github (Chen et al. 2020) that summarizes the accuracy metrics described in the work of Ramnarine et al. 2015 (Ramnarine et al. 2015). A detailed report with the concordance ratio (Po), F-measure score, square correlation (R²) and imputation quality score (IQS) was generated and written to the output file. To accurately assess IQS and R² results, we removed all variants with MAF equal to 0 in our target population (allele count equal to 0) of 190 individuals from the analysis; IQS is zero when MAF is equal to zero, and is not indicative of accuracy or imputation quality. The entire code for accuracy metrics can be found in the script simpy.py (see section Data Available).

Results

Genotyping Data

After performing quality control on chromosome 20, 18,279 variants with a genotyping call rate of 100% remained in the Affymetrix chip dataset, and 37,334 variants with a genotyping call rate of 100% remained in the Omni Illumina dataset. In total, 5065 SNP markers overlapped between the two chips. The customized chip had 5913 markers shared between the Eurofins and the Affymetrix and Omni chips. The number of variants shared between the chip datasets and the 1000GP-30x panel (WGS) is shown in Fig. 6.

Fig. 6 Number of shared variants between datasets.

Variants on chromosome 20 shared between chips and the 1000GP-30x WGS reference panel.

Imputation Accuracy

Minor Allele Frequency (MAF) And Reference Panel

We stratified variants based on MAF and assessed imputation accuracy for common, infrequent, and rare variants to obtain a more nuanced understanding of how well each combination of phasing-imputation tools performed (Table 2).

View this table:

Table 2.

MAF-stratified comparison of phasing-imputation combinations.

A comparison of the sensitivity and false positive rate (FPR) of the imputation results, for each phasing-imputation combination, as stratified by MAF.

Based on the accuracy metric, the False Positive Rate (FPR), and the sensitivity, Beagle5.2 outperformed other phasing tools when MAF was greater than 5%, with ShapeIT4 a close second. However, for uncommon variants (MAF≤5%), ShapeIT4 was the better phasing tool, irrespective of imputation tool choice. For the imputation of uncommon variants (MAF ≤ 5%), Impute5 outperformed Beagle5.2 and Minimac, for each phasing tool combination. However, for common variants (MAF ≥ 5%), Beagle5.2 was superior. Similar results were obtained using r² as the metric (Fig. 7). The best combination overall was ShapeIT4-Beagle5.2 imputing from the Omni chip dataset, with a reference-based phasing approach and imputing using the 1000GP-Phase3 reference panel, resulting in an average imputation r² of 0.839 (S1 Table 1). On the other hand, for the 1000GP-30x reference panel, the best phasing and imputation tool combination was ShapeIT4-Impute5 using an Omni chip with reference-based phasing, resulting in an average imputation r² of 0.728 (S1 Table 1).

Fig. 7 Imputation performance for chromosome 20 using 190 mixed population individuals with 2 reference panels and 2 phasing approaches.

Blue colors indicate Beagle5.2, violets indicate Impute5 and oranges indicate Minimac4. The different input chip datasets are notated using the shape of the line: dashed for Affymetrix, continuous for Omni and dotted for the customized chip. (A) reference-based - 1000GP-30x, (B) reference-free - 1000GP-30x, (C) reference-based - 1000GP-Phase3, (D) reference-free - 1000GP-Phase3.

Fig. 8 depicts an increase in IQS with increasing MAF. Impute5 produced better results at lower MAF than either Beagle5.2 or Minimac4, while Beagle5.2 imputed better above 5% allele frequency. Ultra-rare variants were imputed badly with all available software.

Fig. 8 Evaluation of rare variants imputation.

Violin plot. IQS is plotted against Minor allele frequency (MAF).

Choosing ShapeIT4 as the phasing tool for reference-based phasing, followed by any choice of imputation tool, resulted in the highest r² for either imputation reference panel (S1 Table 1). For the Affymetrix and customized chips, ShapeIT4 remained the best choice of phasing tool for reference-free phasing, with respect to r²; for Omni, Beagle was the superior phasing tool. However, when we instead considered IQS as the metric of choice, both Beagle and ShapeIT4 performed equally well for reference-based phasing for higher density input chip datasets, but ShapeIT4 outperformed Beagle for the customized chip dataset, which had low chip density. For reference-free phasing, with respect to IQS, there was no clear winner between ShapeIT4 and Beagle (S1 Table 1).

To get a better overall representation of how MAF affects imputation accuracy and error rates, we plotted IQS against Error rate (Fig. 9), where each dot represents an imputed variant. The markers cluster according to their MAF and follow a waterfall trend. The results of this analysis are shown in Figure 9, which illustrates that IQS is generally higher and error rates overall lower for more common variants. Rare variants, with MAF<1%, tend to have lower IQS and higher error rates.

Fig. 9 Minor allele frequency (MAF) Stratification of imputed variants

Dots are clustered following minor allele frequency stratification. The dots clustered in the right-down corner of the figure have low IQS and high Error rate, while dots in the left-high corner have high IQS and low Error rate. Each dot represents the average IQS and error rate for a specific marker imputed with one phasing tool-imputation tool combination.

Population, Sex, Chip Density, and Phasing Approach

Accuracy as measured by concordance (Po) was lowest in individuals of African ancestry, and highest in individuals of European and American populations--groups which both have significant recent European ancestry (Table 3). Furthermore, despite reaching similar average imputation accuracy, a greater proportion of EUR individuals had very high imputation accuracy compared with a progressively smaller proportion of target individuals with higher concordance for East Asian, American, African and South Asian ancestry, respectively (Fig. 10B). Thus, although we were able to reach similar mean imputation concordance for each of the different populations, imputation tools performed the best when applied to EUR populations and the worst for AFR and South Asian populations.

Fig. 10 Imputation concordance rate over four different features.

Stacked density plot of accuracy stratified by (A) sex; (B) superpopulation; (C) chip data; (D) phasing type (reference-free and reference-based).

View this table:

Table 3.

Accuracy for different Superpopulations.

Differences in imputation accuracy by population and phasing approach are shown in Figure 10. The reference-based approach produced better results than the reference-free approach, for most combinations of imputation and phasing algorithms, based on a comparison of IQS across all combinations (Fig. 10D). There was also a clear relationship between chip density and imputation accuracy, as measured by concordance; as chip density increased, imputation accuracy improved. Omni chip had the greatest chip density and accuracy and the customized chip the lowest (Figs. 10C, 11). From the shape of the chip distributions, we see that the vast majority of the Omni dataset was imputed with very high concordance, whereas less of the Affymetrix input dataset and much less of the Customized chip dataset was imputed with similar accuracy. We also compared imputation accuracy by sex as a check to ensure our QC process does not introduce any artificial differences. Sex had no effect on imputation accuracy for autosomal chromosome 20 (Fig. 10A). Accuracy for females was on average 0.9907 ± 0.0078 while for males it was 0.9906 ± 0.0080.

Fig. 11 Clustermap of target population against 54 software-reference panel-dataset combinations.

This figure depicts the concordance results for the reference-free and reference-based phasing approaches for each of these combinations. Higher density chips with a reference-based phasing approach and with populations without African ancestry obtained better results in terms of imputation accuracy measured by IQS.

Speed and Memory usage

Of the imputation software’s, Minimac4 appeared to be the most computationally efficient in terms of memory but had the slowest run time, followed by Beagle5.2 and Impute5 (Fig. 12B). Memory usage for Impute5 increased drastically with the size of the input dataset used, while Beagle and Minimac4 were not significantly affected (Fig. 12D). Beagle5.2 had the shortest run time, followed by Impute5 and Minimac4 (Fig. 12A). During phasing, Eagle2.4.1 and ShapeIT4 used less memory than Beagle5.2 and were less affected by the input size of the chip (Fig. 12C). Averaged across the datasets, Eagle2.4.1 was the slowest phasing software while ShapeIT4 was the fastest. Figure 13 shows the average computational run time for each combination. Phasing with ShapeIT4 and imputing with Beagle5.2 was the fastest combination, while phasing with Eagle2.4.1 and imputing with Minimac4 was the slowest.

Fig. 12 CPU run time and memory usage of imputation and phasing softwares.

Average run time for phasing (A) and imputation (C) tools. Average memory usage for phasing (B) and imputation (D) tools.

Fig. 13 CPU run time of imputation and phasing combinations tested.

Average run time for each of the 9 phasing and imputation software combinations.

Discussion

We performed a rigorous comparison of the most popular phasing and imputation tools currently used by genomics research groups to examine how the process of genotype imputation is affected by different factors, including the choice of reference panel, population, chip density, and allele frequency, with the factor of sex as a control on our process. We also compared the computational load of these different tools and software combinations.

Factors Affecting Imputation Accuracy

Imputation accuracy decreased with chip density; the Affymetrix chip resulted in lower accuracy than the Omni chip and the customized chip had the lowest imputation accuracy. While this was expected, it also shows how our processing and comparison pipeline may help researchers design better chips by choosing the number and distribution of SNPs for each specific population, and assessing the impact of density and SNP choice on phasing and imputation accuracy; it can also be used to determine whether different sets of chips are likely to perform better with certain combinations of phasing and imputation tools.

Next, we assessed both reference-free and reference-based phasing. Although reference-free phasing was less accurate, we found that increasing chip density alleviates the degree of effect that the lack of reference has on phasing. The difference between reference-free and reference-based phasing was not extreme, suggesting that reference-free phasing may be acceptable in the absence of a representative reference panel. Previous studies comparing phasing accuracy with and without the use of a reference panel have shown that reference-free phasing, such as with Eagle2, can even lead to higher accuracy in cases where the reference panel ancestry and populations do not match well with that of the sample individuals (Loh et al. 2016).

Furthermore, the choice of the reference panel affects imputation accuracy, across all imputation metrics utilized. We note that using the 30X reference panel results in slightly lower imputation accuracy for uncommon variants; this was due to the panel containing more rare SNPs. As variants with lower MAF are more difficult to impute, and are imputed with greater uncertainty and reduced accuracy, these results are expected.

Accuracy was further affected by population but not by sex. Different populations are characterized by differences in LD as a result of differences in genealogical history, and thus have different characteristic LD blocks and LD block sizes, which affect imputation accuracy (David M. Evans 2005). We expect that lower imputation accuracy seen in individuals of AFR ancestry is attributable to the smaller LD blocks characteristic of AFR ancestry, which make it more difficult to correctly impute genotypes.

In agreement with previous research (Shi et al. 2018), we found that variants with low allele frequency are generally imputed poorly. In general, imputation works poorly for variants with low MAF as a function of both bias in the reference panels and bias in the software (Shi et al. 2018). We can address reference-associated bias by significantly increasing the size of the chosen reference panel and including sufficient population-specific samples in the reference. However, addressing software bias would require developing improved imputation algorithms.

Finally, the choice of statistics is important when examining the imputation accuracy of rare and low frequency variants. We found that IQS and squared correlation produced similar means and standard deviations, though this does not necessarily represent similarity of values for particular SNPs. For rare and low frequency variants, concordance rates produce inflated assessments of accuracy (Lin et al. 2010). The higher concordance rate values could mislead a researcher into assuming that these variants were imputed well. However, accuracy for less common variants is best measured using IQS and R² (Ramnarine et al. 2015).

Choice of Phasing and Imputation Tools

There was a discrepancy in accuracy based on different metrics. Highest average concordance rate was achieved by Beagle5.2 at 0.986, followed by Impute5 and Minimac4, using a reference-based approach during phasing, with the highest density chip dataset as input. In general, choosing Beagle5.2 for imputation and ShapeIT4 for phasing tends to get highly accurate results and is computationally faster. When looking to improve the imputation of rare variants, however, researchers may want to use a mix of Beagle5.2 and Impute5 by applying Beagle5.2 to common variants and Impute5 to rare ones. Impute5 tends to perform better on rare variants, because unlike Beagle5.2, which computes clusters of haplotypes and does its calculations based on them, Impute5 searches the whole space of haplotypes. This is more effective when imputing uncommon variants, but there is a tradeoff of increased computational load.

On the other hand, we see imputation accuracy for Beagle5.2 is better than Impute5 for the filtered phase3 reference panel; this is also expected since the phase 3 panel has fewer rare alleles. Beagle5.2 was also the most stable tool to use across different input sizes. Minimac4 requires the least amount of memory but takes the longest time, which can be a good tradeoff depending on the purpose of the imputation. If the memory usage is limited, and the loss of accuracy is acceptable, then Minimac4 may be the optimal choice of imputation software. It is also important to note that the default parameters have been used for all software. For example, we could reduce the computational load of Impute5 by using parallel processing but this can negatively affect the accuracy results; this negative impact is sufficient to reduce Impute5’s accuracy to below that of Beagle5.2. In conclusion, Beagle might have the best tradeoff between imputation quality and computational efficiency.

In conclusion, differences in imputation and phasing performance may be useful in determining the choice of imputation and phasing tool, depending on the intended downstream usage of the imputed results. However, this study also highlights that current tools are not accurate enough to impute rare and ultra-rare variants, showing that, when corrected for chance concordance and MAF bias, they result in only acceptable imputation accuracy and that there is significant scope for improvement.

Supporting Information

View this table:

S1 Table

Comparison of all combinations of phasing and imputation tool, reference panel, phasing approach, and chip dataset.

References

↵
The Wellcome Trust Case Control Consortium. ‘Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls’. Nature. 2007; 447 (7145): 661–78. doi: 10.1038/nature05911.
OpenUrl CrossRef PubMed Web of Science
↵
Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR et al. Genome-wide association studies. Nature Reviews Methods Primers. 2021 August 26; doi: 10.1038/s43586-021-00056-9
OpenUrl CrossRef
↵
Schurz H, Muller SJ, van Helden PD, Tromp G, Hoal EG, Kinnear CJ et al. Evaluating the accuracy of imputation methods in a five-way admixed population. Front. Genet. 2019 February 05. doi: 10.3389/fgene.2019.00034
OpenUrl CrossRef
↵
Peterson BS, Fredrich B, Hoeppner MP, Ellinghaus D, and A Franke. Opportunities and Challenges of whole-genome and -exome sequencing. BMC Genet. 2017 Feb 14; doi: 10.1186/s12863-017-0479-5
OpenUrl CrossRef PubMed
↵
Quick C, Annugu P, Musani S, Weiss ST, Burhard EG, White MJ et al. Sequencing and imputation in GWAS: Cost-effective strategies to increase power and genomic coverage across diverse populations. Genetic Epidemiology. 2020 June 09; 44(6):537–549
OpenUrl
Zaitlen N and E Eskin. Imputation aware meta-analysis of genome-wide association studies. Genet Epidemiol. 2011 Sep 1; 34(6): 537–542
OpenUrl
↵
Marchini J and B Howie. Genotype Imputation for Genome-Wide Association Studies. Nature Reviews Genetics. 2010; 11 (7): 499–511. doi: 10.1038/nrg2796.
OpenUrl CrossRef PubMed Web of Science
↵
Scheet P and M Stephens. A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase. The American Journal of Human Genetics. 2006; 78 (4): 629–44. doi: 10.1086/502802.
OpenUrl CrossRef PubMed Web of Science
↵
Pei YF, Lei Z, Jian L, and HW Deng. Analyses and Comparison of Imputation-Based Association Methods. PLoS ONE. 2010; 5 (5): e10827. doi:10.1371/journal.pone.0010827.
OpenUrl CrossRef PubMed
↵
Malhotra A, Sayuko K, Clifton B, Knowler WC, Baier LJ, and RL Hanson. Assessing Accuracy of Genotype Imputation in American Indians. PLoS ONE. 2014; 9 (7): e102544. doi:10.1371/journal.pone.0102544.
OpenUrl CrossRef
↵
Das S, Abecasis G, and BL Browning. Genotype Imputation from Large Reference Panels. Annual Review of Genomics and Human Genetics. 2018; 19 (1): 73–96. doi: 10.1146/annurev-genom-083117-021602.
OpenUrl CrossRef PubMed
↵
Li N and M Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003; 165(4): 2213–2233. doi: 10.1093/genetics/165.4.2213.
OpenUrl Abstract/FREE Full Text
↵
Lunter G. Haplotype matching in large cohorts using the Li and Stephens model. Bioinformatics. 2019; 35 (5): 798–806. doi: 10.1093/bioinformatics/bty735.
OpenUrl CrossRef
Stephens M, Smith NJ, and P Donnelly. A New Statistical Method for Haplotype Reconstruction from Population Data. The American Journal of Human Genetics. 2001; 68 (4): 978–89. doi: 10.1086/319501.
OpenUrl CrossRef PubMed Web of Science
Williams AL, Patterson N, Glessner J, Hakonarson H, and D Reich. Phasing of Many Thousands of Genotyped Samples. American Journal of Human Genetics. 2012; 91 (2): 238–51. doi:10.1016/j.ajhg.2012.06.013.
OpenUrl CrossRef PubMed
↵
Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, HK Finucane, et al. Reference-Based Phasing Using the Haplotype Reference Consortium Panel. Nature Genetics. 2016; 48 (11): 1443–48. doi:10.1038/ng.3679.
OpenUrl CrossRef PubMed
Browning SR and BL Browning. Rapid and accurate haplotype phasing and missing-data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007; 91(5): 1084–1097. doi: 10.1086/521987
OpenUrl CrossRef
Browning BL and SR Browning. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009; 84: 210–223.
OpenUrl CrossRef PubMed Web of Science
Browning BL and SR Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013 June 1; 194(2):459–471. doi: 10.1534/genetics.113.150029
OpenUrl Abstract/FREE Full Text
↵
Browning BL, Zhou Y, and SR Browning. A One-Penny Imputed Genome from Next-Generation Reference Panels. The American Journal of Human Genetics. 2018; 103 (3): 338–48. doi: 10.1016/j.ajhg.2018.07.015.
OpenUrl CrossRef PubMed
Browning BL, Tian X, Zhou Y, and SR Browning. Fast two-stage phasing of large-scale sequence data. Am J Hum Genet. 2021; 108(10):1880–1890. doi:10.1016/j.ajhg.2021.08.005
OpenUrl CrossRef
Howie BN, Donnelly P, and J Marchini. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLOS Genetics. 2009; 5(6): e1000529. doi: 10.1371/journal.pgen.1000529.
OpenUrl CrossRef PubMed
Bycroft C, Freeman C, Petkova D, Band G, Elliot LT, Sharp K et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018 Oct 10; 562: 203–209.
OpenUrl CrossRef PubMed
↵
Rubinacci S, Delaneau O, and J Marchini. Genotype imputation using the positional Burrows Wheeler transform. PLOS Genetics. 2020; 16(11):e1009049. doi: 10.1371/journal.pgen.1009049.
OpenUrl CrossRef
Li Y, Willer CJ, Ding J, Scheet P and GR Abecasis. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010 Nov 05; 34(8): 816–834. doi: 10.1002/gepi.20533
OpenUrl CrossRef PubMed Web of Science
Delaneau O, J Marchini, and JF Zagury. A Linear Complexity Phasing Method for Thousands of Genomes. Nature Methods. 2011; 9 (2): 179–81.
OpenUrl
Delaneau O, Zagury JF, and J Marchini. Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies. Nature Methods. 2013; 10 (1): 5–6.
OpenUrl
O’Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M et al. Haplotype estimation for biobank-scale data sets. Nature Genetics. 2016 June 06; 48: 817–820.
OpenUrl CrossRef PubMed
↵
Delaneau O, Zagury JF, Robinson MR, Marchini JL, and ET Dermitzakis. Accurate, Scalable and Integrative Haplotype Estimation. Nature Communications. 2019; 10 (1): 5436.
OpenUrl
Howie B, Fuchsberger C, Stephens M, Marchini J, & GR Abecasis. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics. 2012; 44: 955–959.
OpenUrl CrossRef PubMed
Fuchsberger C, Abecasis GR and DA Hinds. Minimac2: Faster Genotype Imputation. Bioinformatics. 2016 Aug 29; 31 (5): 782–84.
OpenUrl
↵
Das S, Forer L, Schonherr S, Sidiore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nature Genetics. 2016; 48(10): 1284–1287.
OpenUrl CrossRef PubMed
Junior GAF, Carvalheiro R, de Oliveira HN, Sargolzaei M, Costilla R, Ventura RV, et al. Imputation accuracy to whole-genome sequence in Nellore cattle. Genet Sel Evol. 2021; 53: 27. doi: 10.1186/s12711-021-00622-5
OpenUrl CrossRef
↵
Ramnarine S, Zhang J, Chen LS, Culverhouse R, Duan W, Hancock DB, et al. When does choice of accuracy measure alter imputation accuracy assessments? PLOS ONE. 2015; 10(10): e0137601. doi: 10.1371/journal.pone.0137601.
OpenUrl CrossRef
Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH et al. Genotype imputation performance of three reference panels using African ancestry individuals. Human Genetics. 2018 April 10; 137: 281–292
OpenUrl CrossRef
↵
Herzig AF, Nutile T, Babron MC, Ciullo M, Bellenguez C and AL Leutenegger. Strategies for Phasing and Imputation in a Population Isolate. Genetic Epidemiology. 2018 Jan an 10; 42(2):201–203. doi:10.1002/gepi.22109.
OpenUrl CrossRef
↵
Sariya S, Lee JH, Mayeux R, Vardarajan BN, Reyes-Dumeyer D, Manly JJ, et al. Rare variants imputation in admixed populations: Comparison across reference panels and bioinformatics tools. Frontiers in Genetics. 2019; 10: 239. doi: 10.3389/fgene.2019.00239.
OpenUrl CrossRef
Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. 2021. Twelve Years of SAMtools and BCFtools. GigaScience. 2021 Feb 16; 10(2). doi:10.1093/gigascience/giab008.
OpenUrl CrossRef PubMed
↵
“Picard Toolkit.” 2019. Broad Institute. GitHub Repository. https://broadinstitute.github.io/picard
1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A Map of Human Genome Variation from Population-Scale Sequencing. Nature. 2010; 467 (7319): 1061–73. doi:10.1038/nature09534.
OpenUrl CrossRef PubMed Web of Science
1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015; 526 (7571): 68–74. doi: 10.1038/nature15393.
OpenUrl CrossRef PubMed
↵
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015; 526 (7571): 75–81. doi: 10.1038/nature15394.
OpenUrl CrossRef PubMed
Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C et al. Genotype imputation of Metabochip SNPs using a study-specific reference panel of ∼4,000 haplotypes in African Americans from the Women’s Health Initiative. Genet Epidemiol. 2012 Feb; 36(2) 107–117. doi: 10.1002/gepi.21603.
OpenUrl CrossRef PubMed
↵
Chen SF, Dias R, Evans D, Salfati EL, Liu S, Wineinger NE, et al. Genotype imputation and variability in polygenic risk score estimation. Genome Medicine. 2020; 12(1):100. doi: 10.1186/s13073-020-00801-x.
OpenUrl CrossRef
Evans DM and LR Cardon. A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations. American Journal of Human Genetics. 2005; 76(4): 681–687. doi: 10.1086/429274
OpenUrl CrossRef PubMed Web of Science
Shi S, Yuan N, Yang M, Du Z, Wang J, Sheng X, et al. Comprehensive assessment of genotype imputation performance. Human Heredity. 2019; 83(3):107–16. doi:10.1159/000489758.
OpenUrl CrossRef
↵
Lin P, Hartz SM, Zhang Z, Saccone SF, Wang J, Tischfield JA, et al. A new statistic to evaluate imputation reliability. PLoS ONE. 2010; 5(3): e9697. doi: 10.1371/journal.pone.0009697.
OpenUrl CrossRef PubMed
Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. BioRxiv [Preprint]2021 bioRxiv 430068 [posted 2021 Feb 06; cited 2021 Oct 28] Available from: https://www.biorxiv.org/content/10.1101/2021.02.06.430068v1
Ding C and S Jin. 2009. High-throughput methods for SNP genotyping. Methods Mol Biol. 2009; 578: 245–254. doi: 10.1007/978-1-60327-411-1_16.
OpenUrl CrossRef
International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010; 467(7311): 52–58. doi: 10.1038/nature09298.
OpenUrl CrossRef PubMed Web of Science
International HapMap Consortium. A haplotype map of the human genome. Nature. 2005; 437(7063): 1299–1320. doi: 10.1038/nature04226.
OpenUrl CrossRef PubMed Web of Science
Ragoussis J. Genotyping technologies for genetic research. Annual Review of Genomics and Human Genetics. 2009; 10:117–133. doi:10.1146/annurev-genom-082908-150116.
OpenUrl CrossRef PubMed Web of Science
Roshyara NR, Horn K, Kirsten H, Ahnert P and M Scholz. Comparing performance of modern genotype imputation methods in different ethnicities. Scientific Reports. 2016; 6(1):34386. doi: 10.1038/srep34386.
OpenUrl CrossRef
Oliveros JC. Venny: An interactive tool for comparing lists with Venn diagrams. 2007. from: https://bioinfogp.cnb.csic.es/tools/venny/index.html

View the discussion thread.

Posted November 04, 2021.

Download PDF

Citation Tools

Subject Area

Genomics

Subject Areas

All Articles

Animal Behavior and Cognition (5216)
Biochemistry (11753)
Bioengineering (8754)
Bioinformatics (29205)
Biophysics (14975)
Cancer Biology (12102)
Cell Biology (17414)
Clinical Trials (138)
Developmental Biology (9423)
Ecology (14185)
Epidemiology (2067)
Evolutionary Biology (18309)
Genetics (12246)
Genomics (16805)
Immunology (11870)
Microbiology (28098)
Molecular Biology (11598)
Neuroscience (60979)
Paleontology (452)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4960)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2886)
Systems Biology (7341)
Zoology (1651)

[1] ↵
The Wellcome Trust Case Control Consortium. ‘Genome-Wide Association Study of 14,000 Cases of Seven Common Diseases and 3,000 Shared Controls’. Nature. 2007; 447 (7145): 661–78. doi: 10.1038/nature05911.
OpenUrl CrossRef PubMed Web of Science

[2] ↵
Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR et al. Genome-wide association studies. Nature Reviews Methods Primers. 2021 August 26; doi: 10.1038/s43586-021-00056-9
OpenUrl CrossRef

[3] ↵
Schurz H, Muller SJ, van Helden PD, Tromp G, Hoal EG, Kinnear CJ et al. Evaluating the accuracy of imputation methods in a five-way admixed population. Front. Genet. 2019 February 05. doi: 10.3389/fgene.2019.00034
OpenUrl CrossRef

[4] ↵
Peterson BS, Fredrich B, Hoeppner MP, Ellinghaus D, and A Franke. Opportunities and Challenges of whole-genome and -exome sequencing. BMC Genet. 2017 Feb 14; doi: 10.1186/s12863-017-0479-5
OpenUrl CrossRef PubMed

[5] ↵
Quick C, Annugu P, Musani S, Weiss ST, Burhard EG, White MJ et al. Sequencing and imputation in GWAS: Cost-effective strategies to increase power and genomic coverage across diverse populations. Genetic Epidemiology. 2020 June 09; 44(6):537–549
OpenUrl

[6] Zaitlen N and E Eskin. Imputation aware meta-analysis of genome-wide association studies. Genet Epidemiol. 2011 Sep 1; 34(6): 537–542
OpenUrl

[7] ↵
Marchini J and B Howie. Genotype Imputation for Genome-Wide Association Studies. Nature Reviews Genetics. 2010; 11 (7): 499–511. doi: 10.1038/nrg2796.
OpenUrl CrossRef PubMed Web of Science

[8] ↵
Scheet P and M Stephens. A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase. The American Journal of Human Genetics. 2006; 78 (4): 629–44. doi: 10.1086/502802.
OpenUrl CrossRef PubMed Web of Science

[9] ↵
Pei YF, Lei Z, Jian L, and HW Deng. Analyses and Comparison of Imputation-Based Association Methods. PLoS ONE. 2010; 5 (5): e10827. doi:10.1371/journal.pone.0010827.
OpenUrl CrossRef PubMed

[10] ↵
Malhotra A, Sayuko K, Clifton B, Knowler WC, Baier LJ, and RL Hanson. Assessing Accuracy of Genotype Imputation in American Indians. PLoS ONE. 2014; 9 (7): e102544. doi:10.1371/journal.pone.0102544.
OpenUrl CrossRef

[11] ↵
Das S, Abecasis G, and BL Browning. Genotype Imputation from Large Reference Panels. Annual Review of Genomics and Human Genetics. 2018; 19 (1): 73–96. doi: 10.1146/annurev-genom-083117-021602.
OpenUrl CrossRef PubMed

[12] ↵
Li N and M Stephens. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics. 2003; 165(4): 2213–2233. doi: 10.1093/genetics/165.4.2213.
OpenUrl Abstract/FREE Full Text

[13] ↵
Lunter G. Haplotype matching in large cohorts using the Li and Stephens model. Bioinformatics. 2019; 35 (5): 798–806. doi: 10.1093/bioinformatics/bty735.
OpenUrl CrossRef

[14] Stephens M, Smith NJ, and P Donnelly. A New Statistical Method for Haplotype Reconstruction from Population Data. The American Journal of Human Genetics. 2001; 68 (4): 978–89. doi: 10.1086/319501.
OpenUrl CrossRef PubMed Web of Science

[15] Williams AL, Patterson N, Glessner J, Hakonarson H, and D Reich. Phasing of Many Thousands of Genotyped Samples. American Journal of Human Genetics. 2012; 91 (2): 238–51. doi:10.1016/j.ajhg.2012.06.013.
OpenUrl CrossRef PubMed

[16] ↵
Loh PR, Danecek P, Palamara PF, Fuchsberger C, Reshef YA, HK Finucane, et al. Reference-Based Phasing Using the Haplotype Reference Consortium Panel. Nature Genetics. 2016; 48 (11): 1443–48. doi:10.1038/ng.3679.
OpenUrl CrossRef PubMed

[17] Browning SR and BL Browning. Rapid and accurate haplotype phasing and missing-data inference for whole genome association studies by use of localized haplotype clustering. Am J Hum Genet. 2007; 91(5): 1084–1097. doi: 10.1086/521987
OpenUrl CrossRef

[18] Browning BL and SR Browning. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 2009; 84: 210–223.
OpenUrl CrossRef PubMed Web of Science

[19] Browning BL and SR Browning. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 2013 June 1; 194(2):459–471. doi: 10.1534/genetics.113.150029
OpenUrl Abstract/FREE Full Text

[20] ↵
Browning BL, Zhou Y, and SR Browning. A One-Penny Imputed Genome from Next-Generation Reference Panels. The American Journal of Human Genetics. 2018; 103 (3): 338–48. doi: 10.1016/j.ajhg.2018.07.015.
OpenUrl CrossRef PubMed

[21] Browning BL, Tian X, Zhou Y, and SR Browning. Fast two-stage phasing of large-scale sequence data. Am J Hum Genet. 2021; 108(10):1880–1890. doi:10.1016/j.ajhg.2021.08.005
OpenUrl CrossRef

[22] Howie BN, Donnelly P, and J Marchini. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLOS Genetics. 2009; 5(6): e1000529. doi: 10.1371/journal.pgen.1000529.
OpenUrl CrossRef PubMed

[23] Bycroft C, Freeman C, Petkova D, Band G, Elliot LT, Sharp K et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018 Oct 10; 562: 203–209.
OpenUrl CrossRef PubMed

[24] ↵
Rubinacci S, Delaneau O, and J Marchini. Genotype imputation using the positional Burrows Wheeler transform. PLOS Genetics. 2020; 16(11):e1009049. doi: 10.1371/journal.pgen.1009049.
OpenUrl CrossRef

[25] Li Y, Willer CJ, Ding J, Scheet P and GR Abecasis. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genetic Epidemiology. 2010 Nov 05; 34(8): 816–834. doi: 10.1002/gepi.20533
OpenUrl CrossRef PubMed Web of Science

[26] Delaneau O, J Marchini, and JF Zagury. A Linear Complexity Phasing Method for Thousands of Genomes. Nature Methods. 2011; 9 (2): 179–81.
OpenUrl

[27] Delaneau O, Zagury JF, and J Marchini. Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies. Nature Methods. 2013; 10 (1): 5–6.
OpenUrl

[28] O’Connell J, Sharp K, Shrine N, Wain L, Hall I, Tobin M et al. Haplotype estimation for biobank-scale data sets. Nature Genetics. 2016 June 06; 48: 817–820.
OpenUrl CrossRef PubMed

[29] ↵
Delaneau O, Zagury JF, Robinson MR, Marchini JL, and ET Dermitzakis. Accurate, Scalable and Integrative Haplotype Estimation. Nature Communications. 2019; 10 (1): 5436.
OpenUrl

[30] Howie B, Fuchsberger C, Stephens M, Marchini J, & GR Abecasis. Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics. 2012; 44: 955–959.
OpenUrl CrossRef PubMed

[31] Fuchsberger C, Abecasis GR and DA Hinds. Minimac2: Faster Genotype Imputation. Bioinformatics. 2016 Aug 29; 31 (5): 782–84.
OpenUrl

[32] ↵
Das S, Forer L, Schonherr S, Sidiore C, Locke AE, Kwong A, et al. Next-generation genotype imputation service and methods. Nature Genetics. 2016; 48(10): 1284–1287.
OpenUrl CrossRef PubMed

[33] Junior GAF, Carvalheiro R, de Oliveira HN, Sargolzaei M, Costilla R, Ventura RV, et al. Imputation accuracy to whole-genome sequence in Nellore cattle. Genet Sel Evol. 2021; 53: 27. doi: 10.1186/s12711-021-00622-5
OpenUrl CrossRef

[34] ↵
Ramnarine S, Zhang J, Chen LS, Culverhouse R, Duan W, Hancock DB, et al. When does choice of accuracy measure alter imputation accuracy assessments? PLOS ONE. 2015; 10(10): e0137601. doi: 10.1371/journal.pone.0137601.
OpenUrl CrossRef

[35] Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH et al. Genotype imputation performance of three reference panels using African ancestry individuals. Human Genetics. 2018 April 10; 137: 281–292
OpenUrl CrossRef

[36] ↵
Herzig AF, Nutile T, Babron MC, Ciullo M, Bellenguez C and AL Leutenegger. Strategies for Phasing and Imputation in a Population Isolate. Genetic Epidemiology. 2018 Jan an 10; 42(2):201–203. doi:10.1002/gepi.22109.
OpenUrl CrossRef

[37] ↵
Sariya S, Lee JH, Mayeux R, Vardarajan BN, Reyes-Dumeyer D, Manly JJ, et al. Rare variants imputation in admixed populations: Comparison across reference panels and bioinformatics tools. Frontiers in Genetics. 2019; 10: 239. doi: 10.3389/fgene.2019.00239.
OpenUrl CrossRef

[38] Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. 2021. Twelve Years of SAMtools and BCFtools. GigaScience. 2021 Feb 16; 10(2). doi:10.1093/gigascience/giab008.
OpenUrl CrossRef PubMed

[39] ↵
“Picard Toolkit.” 2019. Broad Institute. GitHub Repository. https://broadinstitute.github.io/picard

[40] 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al. A Map of Human Genome Variation from Population-Scale Sequencing. Nature. 2010; 467 (7319): 1061–73. doi:10.1038/nature09534.
OpenUrl CrossRef PubMed Web of Science

[41] 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, et al. A global reference for human genetic variation. Nature. 2015; 526 (7571): 68–74. doi: 10.1038/nature15393.
OpenUrl CrossRef PubMed

[42] ↵
Sudmant PH, Rausch T, Gardner EJ, Handsaker RE, Abyzov A, Huddleston J, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015; 526 (7571): 75–81. doi: 10.1038/nature15394.
OpenUrl CrossRef PubMed

[43] Liu EY, Buyske S, Aragaki AK, Peters U, Boerwinkle E, Carlson C et al. Genotype imputation of Metabochip SNPs using a study-specific reference panel of ∼4,000 haplotypes in African Americans from the Women’s Health Initiative. Genet Epidemiol. 2012 Feb; 36(2) 107–117. doi: 10.1002/gepi.21603.
OpenUrl CrossRef PubMed

[44] ↵
Chen SF, Dias R, Evans D, Salfati EL, Liu S, Wineinger NE, et al. Genotype imputation and variability in polygenic risk score estimation. Genome Medicine. 2020; 12(1):100. doi: 10.1186/s13073-020-00801-x.
OpenUrl CrossRef

[45] Evans DM and LR Cardon. A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations. American Journal of Human Genetics. 2005; 76(4): 681–687. doi: 10.1086/429274
OpenUrl CrossRef PubMed Web of Science

[46] Shi S, Yuan N, Yang M, Du Z, Wang J, Sheng X, et al. Comprehensive assessment of genotype imputation performance. Human Heredity. 2019; 83(3):107–16. doi:10.1159/000489758.
OpenUrl CrossRef

[47] ↵
Lin P, Hartz SM, Zhang Z, Saccone SF, Wang J, Tischfield JA, et al. A new statistic to evaluate imputation reliability. PLoS ONE. 2010; 5(3): e9697. doi: 10.1371/journal.pone.0009697.
OpenUrl CrossRef PubMed

[48] Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, et al. High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. BioRxiv [Preprint]2021 bioRxiv 430068 [posted 2021 Feb 06; cited 2021 Oct 28] Available from: https://www.biorxiv.org/content/10.1101/2021.02.06.430068v1

[49] Ding C and S Jin. 2009. High-throughput methods for SNP genotyping. Methods Mol Biol. 2009; 578: 245–254. doi: 10.1007/978-1-60327-411-1_16.
OpenUrl CrossRef

[50] International HapMap 3 Consortium, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010; 467(7311): 52–58. doi: 10.1038/nature09298.
OpenUrl CrossRef PubMed Web of Science

[51] International HapMap Consortium. A haplotype map of the human genome. Nature. 2005; 437(7063): 1299–1320. doi: 10.1038/nature04226.
OpenUrl CrossRef PubMed Web of Science

[52] Ragoussis J. Genotyping technologies for genetic research. Annual Review of Genomics and Human Genetics. 2009; 10:117–133. doi:10.1146/annurev-genom-082908-150116.
OpenUrl CrossRef PubMed Web of Science

[53] Roshyara NR, Horn K, Kirsten H, Ahnert P and M Scholz. Comparing performance of modern genotype imputation methods in different ethnicities. Scientific Reports. 2016; 6(1):34386. doi: 10.1038/srep34386.
OpenUrl CrossRef

[54] Oliveros JC. Venny: An interactive tool for comparing lists with Venn diagrams. 2007. from: https://bioinfogp.cnb.csic.es/tools/venny/index.html

A comparative analysis of current phasing and imputation software

Abstract

Introduction

Methods

Chip Data

Reference Panel Collection and Sample Selection

Quality Control of Reference Panels

Phasing and Imputation Pipeline

Accuracy Measurement

Results

Genotyping Data

Imputation Accuracy

Minor Allele Frequency (MAF) And Reference Panel

Population, Sex, Chip Density, and Phasing Approach

Speed and Memory usage

Discussion

Factors Affecting Imputation Accuracy

Choice of Phasing and Imputation Tools

Supporting Information

References

Citation Manager Formats

Subject Area