Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders

View ORCID ProfileRaquel Dias, Doug Evans, Shang-Fu Chen, Kai-Yu Chen, Leslie Chan, Ali Torkamani
doi: https://doi.org/10.1101/2021.12.01.470739
Raquel Dias
1Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Raquel Dias
Doug Evans
1Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shang-Fu Chen
1Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kai-Yu Chen
1Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Leslie Chan
1Scripps Research Translational Institute, Scripps Research, La Jolla, CA, 92037, USA
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Ali Torkamani
2Department of Integrative Structural and Computational Biology, Scripps Research, La Jolla, CA, 92037, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: atorkama@scripps.edu
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Preview PDF
Loading

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium.

Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly-used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least 4-fold faster inference run time relative to standard imputation tools.

Introduction

The human genome is inherited in large blocks from parental genomes, generated through a DNA-sequence-dependent shuffling process called recombination. The non-random nature of recombination breakpoints producing these genomic blocks results in correlative genotype relationships across genetic variants, known as linkage disequilibrium. Thus, genotypes for a small subset (1% – 10%) of observed common genetic variants can be used to infer the genotype status of unobserved but known genetic variation sites across the genome (on the order of ∼1M of >10M sites) (Li et al., 2009; Marchini and Howie, 2010). This process, called genotype imputation, allows for the generation of nearly the full complement of known common genetic variation at a fraction of the cost of direct genotyping or sequencing. Given the massive scale of genotyping required for genome-wide association studies or implementation of genetically-informed population health initiatives, genotype imputation is an essential approach in population genetics.

Standard approaches to genotype imputation utilize Hidden Markov Models (HMM) (Browning et al., 2018; Das et al., 2016a; Rubinacci et al., 2020) distributed alongside large WGS-based reference panels (Browning and Browning, 2016). In general terms, these imputation algorithms use genetic variants shared between to-be-imputed genomes and the reference panel and apply Hidden Markov Models (HMM) to impute the missing genotypes per sample (Das et al., 2018). Genotyped variants are the observed states of the HMM, whereas the to-be-imputed genetic variants present in the reference panel are the hidden states. The HMM parameter function depends on recombination rates, mutation rates, and/or genotype error rates that must be fit by Markov Chain Monte Carlo Algorithm (MCMC) or an expectation-maximization algorithm. Thus, HMM-based imputation is a computationally intensive process, requiring access to both high-performance computing environments and large, privacy-sensitive, WGS reference panels (Kowalski et al., 2019). Often, investigators outside of large consortia will resort to submitting genotype data to imputation servers (Das et al., 2016a), resulting in privacy and scalability concerns (Sarkar et al., 2021).

Recently, artificial neural networks, especially autoencoders, have attracted attention in functional genomics for their ability to fill-in missing data from genomic assays with significant dropout events, like single-cell RNAseq and ChIP-seq (Arisdakessian et al., 2019; Koh et al., 2017; Lal et al., 2021). Autoencoders are neural networks tasked with the problem of simply reconstructing the original input data, with constraints applied to the network architecture or transformations applied to the input data in order to achieve a desired goal like dimensionality reduction or compression, and de-noising or de-masking (Abouzid et al., 2019; Liu et al., 2020; Voulodimos et al., 2018), stochastic noise or masking is used to modify or remove data inputs, training the autoencoder to reconstruct the original uncorrupted data from corrupted inputs (Tian et al., 2020). These autoencoder characteristics are well-suited for genotype imputation and may address some of the limitations of HMM-based imputation by eliminating the need for dissemination of reference panels and allowing the capture of non-linear relationships in genomic regions with complex linkage disequilibrium structures. Some attempts at genotype imputation using neural networks have been previously reported, though for specific genomic contexts (Naito et al., 2021) at genotype masking levels (5% – 20%) not applicable in typical real-world population genetics scenarios (Chen and Shi, 2019; Islam et al., 2021; Kojima et al., 2020; Sun and Kardia, 2008).

Here we present a generalized approach to unphased human genotype imputation using sparse, denoising autoencoders capable of highly accurate genotype imputation at genotype masking levels (98+%) appropriate for array-based genotyping and low-pass sequencing-based population genetics initiatives. We describe the initial training and implementation of autoencoders spanning all of human chromosome 22, achieving equivalent to superior accuracy relative to modern HMM-based methods, and dramatically improving computational efficiency at deployment without the need to distribute reference panels.

Materials and Methods

Overview

Sparse, de-noising autoencoders spanning all bi-allelic SNPs observed in the Haplotype Reference Consortium were developed and optimized. Each bi-allelic SNP was encoded as two binary input nodes, representing the presence or absence of each allele (Figure 1A, 1D). This encoding allows for the straightforward extension to multi-allelic architectures and non-binary allele presence probabilities. A data augmentation approach using modeled recombination events and offspring formation coupled with random masking at an escalating rate drove our autoencoder training strategy (Figure 1B). Because of the extreme skew of the allele frequency distribution for rarely present alleles (Auton et al., 2015), a focal-loss-based approach was essential to genotype imputation performance. The basic architecture of the template fully-connected autoencoder before optimization to each genomic segment is depicted in Figure 1C. Individual autoencoders were designed to span genomic segments with boundaries defined by computationally identified recombination hotspots (Figure 1E). The starting point for model hyperparameters were randomly selected from a grid of possible combinations and were further tuned from a battery of features describing the complexity of the linkage-disequilibrium structure of each genomic segment.

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. Schematic overview of the autoencoder training workflow.

A) Ground truth whole genome sequencing data is encoded as binary values representing the presence (1) or absence (0) of the reference allele (blue) and alternative allele (red). B) Variant masking (setting both alleles as absent, represented by 0) corrupts data inputs at a gradually increasing masking rate. Example masked variants are outlined. C) Fully-connected autoencoders spanning segments defined as shown in panel E, are then trained to reconstruct the original uncorrupted data from corrupted inputs; D) the reconstructed outputs (imputed data) are compared to the ground truth states for loss calculation and are decoded back to genotypes. E) Tiling of autoencoders across the genome is achieved by E.1) calculating a n x n matrix of pairwise SNP correlations, thresholding them at 0.45 (selected values are shown in red background, excluded values in gray), E.2) quantifying the overall local LD strength centered at each SNP by computing their local correlation box counts and splitting the genome into approximately independent segments by identifying local minima (recombination hotspots). The red arrow illustrates minima between strong LD regions.

Genotype Encoding

Genotypes for all bi-allelic SNPs were converted to binary values representing the presence (1) or absence (0) of the reference allele A and alternative allele B, respectively, as shown in Equation 1. Embedded Image Where x is a vector containing the two allele presence input nodes to the autoencoder and their encoded allele presence values derived from the original genotype, G, of variant i. The output nodes of the autoencoder, regardless of activation function, are similarly rescaled to 0 - 1. The scaled outputs can also be regarded as probabilities and can be combined for the calculation of alternative allele dosage and/or genotype probabilities. This representation maintains the interdependencies among classes, is extensible to other classes of genetic variation, and allows for the use of probabilistic loss functions.

Training Data, Masking, and Data Augmentation

Training Data

Whole-genome sequence data from the Haplotype Reference Consortium (HRC) was used for training and as the reference panel for comparison to HMM-based imputation (McCarthy et al., 2016). The dataset consists of 27,165 samples and 39,235,157 biallelic SNPs generated using whole-genome sequence data from 20 studies of predominantly European ancestry (HRC Release 1.1): 83.92% European, 2.33% East Asian, 1.63% Native American, 2.17% South Asian, 2.96% African, and 6.99% admixed ancestry individuals. Genetic ancestry was determined using continental population classification from the 1000 Genomes Phase3 v5 (1000G) reference panel and a 95% cutoff using Admixture software (Alexander et al., 2009). Genotype imputation autoencoders were trained for all 510,442 unique SNPs observed in HRC on human chromosome 22.

Validation and Testing Data

A balanced (50%:50% European and African genetic ancestry) subset of 796 whole genome sequences from the Atherosclerosis Risk in Communities cohort (ARIC) (Mou et al., 2018), was used for model validation and selection. The Wellderly (Erikson et al., 2016), Human Genome Diversity Panel (HGDP) (Cann, 2002), and Multi-Ethnic Study of Atherosclerosis (MESA) (Bild, 2002) cohorts were used for model testing. The Wellderly cohort consisted of 961 whole genomes of predominantly European genetic ancestry. HGDP consisted of 929 individuals across multiple ancestries: 11.84% European, 14.64% East Asian, 6.57% Native American, 10.98% African, and 55.97% admixed. MESA consisted of 5,370 whole genomes across multiple ancestries: 27.62% European, 11.25% East Asian, 4.99% Native American, 5.53% African, and 50.61% admixed.

GRCh38 mapped cohorts (HGDP and MESA) were converted to hg19 using Picard v2.25 (“Picard toolkit,” 2019). All other datasets were originally mapped and called against hg19. Multi-allelic SNPs, SNPS with >10% missingness, and SNPs not observed in HRC were removed with bcftools v1.10.2 (Danecek et al., 2021). Mock genotype array data was generated from these WGS cohorts by restricting genotypes to those present on commonly used genotyping arrays (Affymetrix 6.0, UKB Axiom, and Omni 1.5M). For chromosome 22, intersection with HRC and this array-like masking respectively resulted in: 9,025, 10,615, and 14,453 out of 306,812 SNPs observed in ARIC; 8,630, 10,325, and 12,969 out of 195,148 SNPs observed in the Wellderly; 10,176, 11,086, and 14,693 out of 341,819 SNPs observed in HGDP; 9,237, 10,428, and 13,677 out of 445,839 SNPs observed in MESA.

Data Augmentation

We employed two strategies for data augmentation – random variant masking and simulating further recombination with offspring formation. During training, random masking of input genotypes was performed at escalating rates, starting with a relatively low masking rate (80% of variants) that is gradually incremented in subsequent training rounds until up to only 5 variants remain unmasked per autoencoder. Masked variants are encoded as the null case in Equation 1. During finetuning we used sim1000G (Dimitromanolakis et al., 2019) to simulate of offspring formation using the default genetic map and HRC genomes as parents. A total of 30,000 offspring genomes were generated and merged with the original HRC dataset, for a total of 57,165 genomes.

Loss Function

In order to account for the overwhelming abundance of rare variants, the accuracy of allele presence reconstruction was scored using an adapted version of focal loss (FL) [32], shown in Equation 2. Embedded Image Where the classic cross entropy (shown as binary log loss in brackets) of the truth class (xt) predicted probability (pt) is weighted by the class imbalance factor αt and a modulating factor (1 - pt)γ. The modulating factor is the standard focal loss factor with hyperparameter, γ, which amplifies the focal loss effect by down-weighting the contributions of well-classified alleles to the overall loss (especially abundant reference alleles for rare variant sites). αt is an additional balancing hyperparameter set to the truth class frequency.

This base focal loss function is further penalized and regularized to encourage simple and sparse models in terms of edge-weight and hidden layer activation complexity. These additional penalties result in our final loss function as shown in Equation 3. Embedded Image Where L1 and L2 are the standard L1 and L2 norms of the autoencoder weight matrix, with their contributions mediated by the hyperparameters λ1 and λ2. S is a sparsity penalty, with its contribution mediated by the hyperparameter β, which penalizes deviation from a target hidden node activation set by the hyperparameter (⍴) vs the observed mean activation Embedded Image over a training batch j summed over total batches n, as shown in Equation 4: Embedded Image

Genome Tiling

All model training tasks were distributed across a diversified set of NVIDIA graphical processing units (GPUs) with different video memory limits: 5x Titan Vs (12GB), 8x A100s (40GB), 60x V100s (32GB). Given computational complexity and GPU memory limitations, individual autoencoders were designed to span approximately independent genomic segments with boundaries defined by computationally identified recombination hotspots (Figure 1E). These segments were defined using an adaptation of the LDetect algorithm [33]. First, we calculated a n x n matrix of pairwise SNP correlations using all common genetic variation (≥5% minor allele frequency) from HRC. Correlation values were thresholded at 0.45. For each SNP, we calculated a box count of all pairwise SNP correlations spanning 500 common SNPs upstream and downstream of the index SNP. This moving box count quantifies the overall local LD strength centered at each SNP. Local minima in this moving box count were used to split the genome into approximately independent genomic segments of two types – large segments of high LD interlaced with short segments of weak LD corresponding to recombination hotspot regions. Individual autoencoders were designed to span the entirety of a single high LD segment plus its adjacent upstream and downstream weak LD regions. Thus, adjacent autoencoders overlap at their weak LD ends. If an independent genomic segment exceeded the threshold number of SNPs amenable to deep learning given GPU memory limitations, internal local minima within the high LD regions were used to split the genomic segments further to a maximum of 6000 SNPs per autoencoder. Any remaining genomic segments still exceeding 6000 SNPs were further split into 6000 SNP segments with large overlaps of 2500 SNPs given the high degree of informative LD split across these regions. This tiling process resulted in 256 genomic segments: 188 independent LD segments, 32 high LD segments resulting from internal local minima splits, and 36 segments further split due to GPU memory limitations.

Hyperparameter Initialization and Grid Search

We first used a random grid search approach to define initial hyperparameter combinations producing generally accurate genotype imputation results. The hyperparameters and their potential starting values are listed in Table 1. This coarse-grain grid search was performed on all genomic segments of chromosome 22 (256 genomic segments), each tested with 100 randomly selected hyperparameter combinations per genomic segment, with a batch size of 256 samples, training for 500 epochs without any stop criteria, and validating on an independent dataset (ARIC). To evaluate the performance of each hyperparameter combination, we calculated the average coefficient of determination (r-squared) comparing the predicted and observed alternative allele dosages per variant. Concordance and F1-score were also calculated to screen for anomalies but were not ultimately used for model selection.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Description and values of hyperparameters tested in grid search. Table 1. λ1: scaling factor for Least Absolute Shrinkage and Selection Operator (LASSO or L1) regularization; λ2: scaling factor for Ridge (L2) regularization; β: scaling factor for sparsity penalty described in equation (4); ρ: target hidden layer activation described in equation (4); Activation function type: defines how the output of a hidden neuron will be computed given a set of inputs; Learning rate: step size at each learning iteration while moving toward the minimum of the loss function; γ: amplifying factor for focal loss described in equation (3); Optimizer type: algorithms utilized to minimize the loss function and update the model weights in backpropagation [34]; Loss type: algorithms utilized to calculate the model error (equation (2)); Number of hidden layers: how many layers of artificial neurons to be implemented between input layer and output layer; Hidden layer size ratio: scaling factor to resize the next hidden layer with reference to the size of its previous layer; Learning rate decay ratio: scaling factor for updating the learning rate value on every 500 epochs.

Hyperparameter Tuning

In order to avoid local optimal solutions and reduce the hyperparameter search space, we developed an ensemble-based machine learning approach (Extreme Gradient Boosting - XGBoost) to predict the expected performance (r-squared) of each hyperparameter combination per genomic segment using the results of the coarse-grid search and predictive features calculated for each genomic segment. These features include the number of variants, average recombination rate and average pairwise Pearson correlation across all SNPs, proportion of rare and common variants across multiple minor allele frequency (MAF) bins, number of principal components necessary to explain at least 90% of variance, and the total variance explained by the first 2 principal components. The observed accuracies of the coarse-grid search, numbering 25,600 training inputs, were used to predict the accuracy of 500,000 new hyperparameter combinations selected from Table 1 without training. All categorical predictors (activation function name, optimizer type, loss function type) were one-hot encoded. The model was implemented using XGBoost package v1.4.1 in Python v3.8.3 with 10-fold cross-validation and default settings.

We then ranked all hyperparameter combinations by their predicted performance and selected the top 10 candidates per genomic segment along with the single best initially tested hyperparameter combination per genomic segments for further consideration. All other hyperparameter combinations were discarded. Genomic segments with sub-optimal performance relative to Minimac were subjected to tuning with simulated offspring formation. For tuning, the maximum number of epochs was increased (35,000) with automatic stop criteria: if there is no improvement in average loss value of the current masking/training cycle versus the previous one, the training is interrupted, otherwise training continues until the maximum epoch limit is reached. Each masking/training cycle consisted of 500 epochs. Final hyperparameter selection was based on performance on the validation dataset (ARIC).

Performance Testing and Comparisons

Performance was compared to Minimac4 (Das et al., 2016b), Beagle5 (Browning et al., 2018), and Impute5 (Rubinacci et al., 2020) using default parameters. Population level reconstruction accuracy is quantified by measuring r-squared across multiple strata of data: per genomic segment, at whole chromosome level, and stratified across multiple minor allele frequency bins: [0.001-0.005), [0.005-0.01), [0.01-0.05), [0.05-0.1), [0.1-0.2), [0.2-0.3), [0.3-0.4), [0.4-0.5). While r-squared is our primary comparison metric, sample-level and population-level model performance is also evaluated with concordance and the F1-score. Wilcoxon rank-sum testing was used assess the significance of accuracy differences observed. Spearman correlations were used to evaluate the relationships between genomic segment features and observed imputation accuracy differences. Standard errors for per variant imputation accuracy r-squared is equal or less than 0.001 where not specified. Performance is reported only for the independent test datasets (Wellderly, MESA, and HGDP).

We used the MESA cohort for inference runtime comparisons. Runtime was determined using the average and standard error of three imputation replicates. Two hardware configurations were used for the tests: 1) a low-end environment: 16-core Intel Xeon CPU (E5-2640 v2 2.00GHz), 250GB RAM, and one GPU (NVIDIA GTX 1080); 2) a high-end environment: 24-Core AMD CPU (EPYC 7352 2.3GHz), 250GB RAM, using one NVIDIA A100 GPU. We report computation time only, input/output (I/O) reading/writing times are excluded as separately optimized functions.

Data availability

The data that support the findings of this study are available from dbGAP and European Genome-phenome Archive (EGA), but restrictions apply to the availability of these data, which were used under ethics approval for the current study, and so are not openly available to the public. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeli ne. The python script for calculating imputation accuracy is available at https://github.com/TorkamaniLab/imputation_accuracy_calculator.

Results

Untuned Performance and Model Optimization

A preliminary comparison of the best performing autoencoder per genomic segment vs HMM-based imputation was made after the initial grid (Minimac4: Figure 2, Beagle5 and Eagle5: Supplemental Figures S1-S2). Untuned autoencoder performance was equivalent or inferior to all tested HMM-based methods except when tested on the European ancestry-rich Wellderly dataset when masked using the Affymetrix 6.0 and UKB Axiom marker sets, but not Omni 1.5M markers. HMM-based imputation was consistently superior across the more ancestrally diverse test datasets (MESA and HGDP) (two proportion test, p ≤ 8.77×10-6). Overall, when performance across genomic segments, test datasets, and test array marker sets was combined, the autoencoders exhibited an average r-squared per variant of 0.352±0.008 in reconstruction of WGS ground truth genotypes versus an average r-squared per variant of 0.374±0.007, 0.364±0.007, and 0.357±0.007 for HMM-based imputation methods (Minimac4, Beagle5, and Impute5, respectively) (Table 2). This difference was statistically significant only relative to Minimac4 (Minimac4: Wilcoxon rank-sum test p=0.037, Beagle5 and Eagle5: p≥0.66).

Figure 2.
  • Download figure
  • Open in new tab
Figure 2. HMM-based (y-axis) versus autoencoder-based (x-axis) imputation accuracy prior to tuning.

Minimac4 and untuned autoencoders were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Minimac4 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Minimac4 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2.

Performance comparisons between untuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5). Table 2. Average r-squared per variant was extracted from each genomic segment of chromosome 22. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference untuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001.

In order to understand the relationship between genomic segment features, hyperparameter values, and imputation performance, we calculated predictive features (see Methods) for each genomic segment and determined their Spearman correlation with the differences in r-squared observed for the autoencoder vs Minimac4 (Supplemental Figure S3). We observed that the autoencoder had superior performance when applied to the genomic segments with the most complex LD structures: those with larger numbers of observed unique haplotypes, unique diplotypes, and heterozygosity, as well as high average MAF, and low average pairwise Pearson correlation across all SNPs (average LD) (Spearman correlation (ρ ≥ 0.22, p ≤ 9.8×10-04). Similarly, we quantified genomic segment complexity by the proportion of variance explained by the first two principal components as well as the number of principal components needed to explain at least 90% of the variance of HRC genotypes from each genomic segment. Concordantly, superior autoencoder performance was associated with a low proportion explained by the first two components and positively correlated with the number of components required to explained 90% of variance (Spearman ρ ≥ 0.22, p ≤ 8.3×10-04). These observations informed our tuning strategy.

We then used the genomic features significantly correlated with imputation performance to predict the performance of and select the hyperparameter values to advance to fine-tuning. An ensemble model inference approach was able to predict the genomic segment-specific performance of hyperparameter combinations with high accuracy (Supplemental Figure S4, mean r-squared = 0.935±0.002 of predicted vs observed autoencoders accuracies via 10-fold cross validation). The top 10 best performing hyperparameter combinations were advanced to fine-tuning (Table 3). Autoencoder tuning with simulated offspring formation was then executed as described in

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3.

Top 10 best performing hyperparameter combinations that advanced to fine-tuning. Table 3. See Methods and Table 1 for a detailed description of the hyperparameters.

Methods

Tuned Performance

After tuning, autoencoder performance surpassed HMM-based imputation performance across all imputation methods, independent test datasets, and genotyping array marker sets. At a minimum, autoencoders surpassed HMM-based imputation performance in >62% of chromosome 22 genomic segments (two proportion test p=1.02×10-11) (Minimac4: Figure 3, Beagle5 and Eagle5: Supplemental Figures S5-S6). Overall, the optimized autoencoders exhibited superior performance with an average r-squared of 0.395±0.007 vs 0.374±0.007 for Minimac4 (Wilcoxon rank sum test p=0.007), 0.364±0.007 for Beagle5 (Wilcoxon rank sum test p=1.53*10-4), and 0.358±0.007 for Impute5 (Wilcoxon rank sum test p=2.01*10-5) (Table 4). This superiority was robust to the marker sets tested, with the mean r-squared per genomic segment for autoencoders being 0.373±0.008, 0.399±0.007, and 0.414±0.008 versus 0.352±0.008, 0.370±0.006, and 0.400±0.007 for Minimac4 using Affymetrix 6.0, UKB Axiom, and Omni 1.5M marker sets (Wilcoxon rank-sums test p-value=0.029, 1.99*10-4, and 0.087, respectively). Detailed comparisons to Beagle5 and Eagle5 are presented in Supplemental Figures S5-S6.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3. HMM-based (y-axis) versus autoencoder-based (axis) imputation accuracy after tuning.

Minimac4 and tuned autoencoders were validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Minimac4 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Minimac4 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 4.

Performance comparisons between tuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5). Table 4. Average r-squared per variant was extracted from each genomic segment of chromosome 22. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference untuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001.

Tuning improved performance of the autoencoders across all genomic segments, generally improving the superiority of autoencoders relative to HMM-based approaches in genomic segments with complex haplotype structures while equalizing performance relative to HMM-based approaches in genomic segments with more simple LD structures (as described in Methods, by the number of unique haplotypes: Supplemental Figure S7, diplotypes: Supplemental Figure S8, average pairwise LD: Supplemental Figure S9, proportion variance explained: Supplemental Figure S10). Concordantly, genomic segments with higher

recombination rates exhibited the largest degree of improvement with tuning (Supplemental Figure S11). Use of the augmented reference panel did not improve HMM-based imputation, having no influence on Minimac4 performance (original overall r-squared of 0.374±0.007 versus 0.363±0.007 after augmentation, Wilcoxon rank-sum test p=0.0917), and significantly degrading performance of Beagle5 and Impute5 (original r-squared of 0.364±0.007 and 0.358±0.007 versus 0.349±0.006 and 0.324±0.007 after augmentation, p=0.026 and p=1.26*10-4 respectively). Summary statistics for these comparisons are available in Supplemental Table S1.

Overall Chromosome 22 Imputation Accuracy

After merging the results from all genomic segments, the whole chromosome accuracy of autoencoder-based imputation remained superior to all HMM-based imputation tools, across all independent test datasets, and all genotyping array marker sets (Wilcoxon rank-sums test p≤5.55×10-67). The autoencoder’s mean r-squared per variant ranged from 0.363 for HGDP to 0.605 for the Wellderly vs 0.340 to 0.557 for Minimac4, 0.326 to 0.549 for Beagle5, and 0.314 to 0.547 for Eagle5, respectively. Detailed comparisons are presented in in Table 5 and Supplemental Table S2.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 5.

Whole chromosome level comparisons between autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5). Table 5. Average r-squared per variant was extracted at whole chromosome level. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001. Standard errors that are equal or less than 0.001 are not shown.

Further, when imputation accuracy is stratified by MAF bins, the autoencoders maintain superiority across all MAF bins by nearly all test dataset and genotyping array marker sets (Figure 4, and Supplemental Table S3). Concordantly, autoencoder imputation accuracy is similarly superior when measured with F1-scores (Supplemental Figure S12) and concordance (Supplemental Figure S13), though these metrics are less sensitive at capturing differences in rare variant imputation accuracy.

Figure 4.
  • Download figure
  • Open in new tab
Figure 4. HMM-based versus autoencoder-based imputation accuracy across MAF bins.

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001, ns represents non-significant p-values.

Ancestry-Specific Chromosome 22 Imputation Accuracy

Finally, we evaluated ancestry-specific imputation accuracy. As before, overall autoencoder-based imputation maintains superiority across all continental populations present in MESA (Figure 5, Wilcoxon rank-sums test p=5.39×10-19). The autoencoders’ mean r-squared ranged from 0.357 for African ancestry to 0.614 for East Asian ancestry vs 0.328 to 0.593 for Minimac4, 0.330 to 0.544 for Beagle5, and 0.324 to 0.586 for Impute5, respectively. Note, East Asian ancestry exhibits a slightly higher overall imputation accuracy relative to European ancestry due to improved rare variant imputation. Autoencoder superiority replicates when HGDP is split into continental populations (Supplemental Figure S14).

Further stratification of ancestry-specific imputation accuracy results by MAF continues to support autoencoder superiority across all ancestries, MAF bins, and nearly all test datasets, and genotyping array marker sets (Figure 5, Supplemental Figure S14). Minimum and maximum accuracies across MAF by ancestry bins ranged between 0.177 to 0.937 for the autoencoder, 0.132 to 0.907 for Minimac4, 0.147 to 0.909 for Beagle5, and 0.115 to 0.903 for Impute5, with a maximum standard error of ±0.004.

Thus, autoencoder performance was superior across all variant allele frequencies and ancestries with the primary source of superiority arising from hard to impute regions with complex LD structures.

Inference Speed

Inference runtimes for the autoencoder vs HMM-based methods were compared in a low-end and high-end computational environment as described in Methods. In the low-end environment, the autoencoder’s inference time is at least ∼4X faster than all HMM-based inference times (summing all inference times from all genomic segments of chromosome 22, the inference time for the autoencoder was 2.4±1.1*10-3 seconds versus 1,754±3.2, 583.3±0.01, and 8.4±4.3*10-3 seconds for Minimac4, Beagle5, and Impute5, respectively (Figure 6A)). In the high-end environment, this difference narrows to a ∼3X advantage of the autoencoder vs HMM-based methods (2.1±8.0*10-4 versus 374.3±1.2, 414.3±0.01, and 6.1±2.1*10-4 seconds for Minimac4, Beagle5, and Impute5, respectively (Figure 6B). These unoptimized results indicate that autoencoder-based imputation can be executed rapidly, without a reference cohort, and without the need for a high-end server or high-performance computing (HPC) infrastructure.

Figure
  • Download figure
  • Open in new tab
Figure 6.
  • Download figure
  • Open in new tab
Figure 6. HMM-based versus autoencoder-based inference runtimes.

We plot the average time and standard error of three imputation replicates. Two hardware configurations were used for the tests: A) a low-end environment: 16-core Intel Xeon CPU (E5-2640 v2 2.00GHz), 250GB RAM, and one GPU (NVIDIA GTX 1080); B) a high-end environment: 24-Core AMD CPU (EPYC 7352 2.3GHz), 250GB RAM, using one NVIDIA A100 GPU.

Impute5 (purple)) imputation accuracy was validated across individuals of diverse ancestry from MESA cohort (EUR: European (top); EAS: East Asian (2nd row); AMR: Native American (3rd row); AFR: African (bottom)) and multiple genotype array platforms (Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right)). Each data point represents the imputation accuracy (average r-squared per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001, ns represents non-significant p-values.

Discussion

Artificial neural network-based data mining techniques are revolutionizing biomedical informatics and analytics(Dias and Torkamani, 2019; Jumper et al., 2021). Here, we have demonstrated the potential for these techniques to execute a fundamental analytical task in population genetics, genotype imputation, producing superior results in a computational efficient and portable framework. The trained autoencoders can be transferred easily, and execute their functions rapidly, even in modest computing environments, obviating the need to transfer private genotype data to external imputation servers or services. Furthermore, our fully trained autoencoders robustly surpass the performance of all modern HMM-based imputation approaches across all tested independent datasets, genotyping array marker sets, minor allele frequency spectra, and diverse ancestry groups. This superiority was most apparent in genomic regions with low LD and/or high complexity in their linkage disequilibrium structure.

Superior imputation accuracy is expected to improve GWAS power, enable more complete coverage in meta-analyses, and improve causal variant identification through fine-mapping. Moreover, superior imputation accuracy in low LD regions may enable the more accurate interrogation of specific classes of genes under a greater degree of selective pressure and involved in environmental sensing. For example, promoter regions of genes associated with inflammatory immune responses, response to pathogens, environmental sensing, and neurophysiological processes (including sensory perception genes) are often located in regions of low LD (Dias and Torkamani, 2019; Frazer et al., 2007). These known disease-associated biological processes that are critical to interrogate accurately in GWAS. Thus, the autoencoder-based imputation approach both improves statistical power and biological coverage of individual GWAS’ and downstream meta-analyses.

HMM-based imputation tools depend on large reference panels or datasets to impute a single genome whereas pre-trained autoencoder models eliminate that dependency. However, further development is required to actualize this approach in practice for broad adoption. Autoencoders must be pre-trained and validated across all segments of the human genome. Here we performed training only for chromosome 22. Autoencoder training is computationally intensive, shifting the computational burden to model trainers, and driving performance gains for end-users. As a result, inference time scales only with the number of variants to be imputed, whereas HMM-based inference time depends on both reference panel and the number of variants to be imputed. This allows for autoencoder-based imputation to extend to millions of genomes but introduces some challenges in the continuous re-training and fine-tuning of the pre-trained models as larger reference panels are made available. In addition, our current encoding approach lacks phasing information, which leads to substantial improvements in imputation accuracy. Future models will need to address the need for phasing and continuous fine-tuning of models for application to modern, ever-growing, genomic datasets.

Ideas and Speculation

After expanding this approach across the whole genome, our work will provide a more efficient genotype imputation platform on whole genome scale and thus beneficial for genome association studies and clinical applications in precision medicine. In addition to the speed, cost and accuracy benefits, our proposed approach can potentially improve automation for downstream analyses. The autoencoder naturally generates a hidden encoding with latent features representative of the original data. This latent representation of the original data acts as an automatic feature extraction and dimensionality reduction technique for downstream tasks such as genetic risk prediction. Moreover, the autoencoder-based imputation approach only requires a reference panel during training – only the neural network needs to be distributed for implementation. Thus, the neural network is portable and avoids privacy issues associated with standard statistical imputation. This privacy-preserving feature will allow developers to deploy real-time data-driven algorithms on personal devices (edge computing). These new features will expand the clinical applications of genomic imputation, as well as its role in preventive healthcare.

Competing interests

The authors declare no competing interests.

Supplemental Figures

Fig. S1.
  • Download figure
  • Open in new tab
Fig. S1. Beagle5 (y-axis) versus autoencoder-based (x-axis) imputation accuracy prior to tuning.

Beagle5 and untuned autoencoders were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Beagle5 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Beagle5 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Fig. S2.
  • Download figure
  • Open in new tab
Fig. S2.

Impute5 (y-axis) versus autoencoder-based (x-axis) imputation accuracy prior to tuning. Impute5 and untuned autoencoders were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Impute5 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Impute5 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Fig. S3.
  • Download figure
  • Open in new tab
Fig. S3. Relationship between genomic segment features and autoencoder performance.

Spearman correlations (ρ) between genomic segment features and autoencoder performance metrics are presented. An “X” denotes Spearman correlations that are not statistically significant (p>0.05). The performance metrics include the mean validation accuracy of Minimac4 and autoencoder (R2_AE_MINUS_MINIMAC), the autoencoder’s improvement in accuracy observed after offspring formation (AE_IMPROVEMENT_SIM) and the autoencoder’s improvement in accuracy after fine tuning of hyperparameters (AE_IMPROVEMENT_TUNING). The genomic features include the total number of variants per genomic segment in HRC (NVAR_HRC), proportion of rare variants at MAF≤0.5% threshold (RARE_VAR_PROP), proportion of common variants at MAF>0.5% threshold (COMMON_VAR_PROP), number of components needed to explain at least 90% of variance after running Principal Component Analysis (NCOMP), proportion of heterozygous genotypes (PROP_HET), proportion of unique haplotypes (PROP_UNIQUE_HAP) and diplotypes (PROP_UNIQUE_DIP), sum of ratios of explained variance from first two (EXP_RATIO_C1_C2) and three (EXP_RATIO_C1_C2_C3) components from Principal Component Analysis, recombination per variant per variant (REC_PER_SITE), mean pairwise correlation across all variants in each genomic segment (MEAN_LD), mean MAF (MEAN_MAF), GC content of reference alleles (GC_CONT_REF), GC content of alternate alleles (GC_CONT_ALT).

Fig. S4.
  • Download figure
  • Open in new tab
Fig. S4. Projecting autoencoder performance from hyperparameters and genomic features.

We developed an ensemble-based machine learning approach (Extreme Gradient Boosting - XGBoost) to predict the expected performance (r-squared) of each hyperparameter combination per genomic segment using the results of the coarse-grid search and predictive features calculated for each genomic segment (see Methods). We plot the observed accuracy of trained autoencoders versus the accuracy predicted by the XGBoost model after 10-fold cross-validation. Each subplot shows one iteration of the 10-fold validation process and its respective Pearson correlation between the predicted and observed accuracy values.

Fig. S5.
  • Download figure
  • Open in new tab
Fig. S5. Beagle5 (y-axis) versus autoencoder-based (axis) imputation accuracy after tuning.

Beagle5 and tuned autoencoders were validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Beagle5 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Beagle5 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Fig. S6.
  • Download figure
  • Open in new tab
Fig. S6. Impute5 (y-axis) versus autoencoder-based (axis) imputation accuracy after tuning.

Impute5 and tuned autoencoders were validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (average r-squared per variant) for an individual genomic segment relative to its WGS-based ground truth. The numerical values presented on the left side and below the identity line (dashed line) indicate the number of genomic segments in which Impute5 outperformed the untuned autoencoder (left of identity line) and the number of genomic segments in which the untuned autoencoder surpassed Impute5 (below the identity line). Statistical significance was assessed through two-proportion Z-test p-values.

Fig. S7.
  • Download figure
  • Open in new tab
Fig. S7. Imputation accuracy as a function of unique haplotype abundance.

Minimac4 and tuned and untuned autoencoders (AE) were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). “Many” vs “Few” haplotypes are defined by splitting genomic segments into those with greater than vs less than the median number of unique haplotypes per genomic segment. We applied Wilcoxon rank-sum tests to compare the untuned and tuned autoencoder to Minimac4. The validation datasets consist of: A) MESA Affymetrix 6.0; B) MESA UKB Axiom; C) MESA Omni 1.5M; D) Wellderly Affymetrix 6.0; E) Wellderly UKB Axiom; F) Wellderly Omni 1.5M; G) HGDP Affymetrix 6.0; H) HGDP UKB Axiom; I) HGDP Omni 1.5M.

Fig. S8.
  • Download figure
  • Open in new tab
Fig. S8. Imputation accuracy as a function of unique diplotype abundance.

Minimac4 and tuned and untuned autoencoders (AE) were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). “Many” vs “Few” diplotypes are defined by splitting genomic segments into those with greater than vs less than the median number of unique diplotypes per genomic segment.. We applied Wilcoxon rank-sum tests to compare the untuned and tuned autoencoder to Minimac4. The validation datasets consist of: A) MESA Affymetrix 6.0; B) MESA UKB Axiom; C) MESA Omni 1.5M; D) Wellderly Affymetrix 6.0; E) Wellderly UKB Axiom; F) Wellderly Omni 1.5M; G) HGDP Affymetrix 6.0; H) HGDP UKB Axiom; I) HGDP Omni 1.5M.

Fig. S9.
  • Download figure
  • Open in new tab
Fig. S9. Imputation accuracy as a function of linkage disequilibrium (LD).

Minimac4 and tuned and untuned autoencoders (AE) were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). “High” vs “Low” LD is defined by splitting genomic segments into those with greater than vs less than the average pairwise LD strength per genomic segment. We applied Wilcoxon rank-sum tests to compare the untuned and tuned autoencoder to Minimac4. The validation datasets consist of: A) MESA Affymetrix 6.0; B) MESA UKB Axiom; C) MESA Omni 1.5M; D) Wellderly Affymetrix 6.0; E) Wellderly UKB Axiom; F) Wellderly Omni 1.5M; G) HGDP Affymetrix 6.0; H) HGDP UKB Axiom; I) HGDP Omni 1.5M.

Fig. S10.
  • Download figure
  • Open in new tab
Fig. S10. Imputation accuracy as a function of data complexity.

Minimac4 and tuned and untuned autoencoders (AE) were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). “High” vs “Low” data complexity is defined by splitting genomic segments into those with greater than vs less than the median proportion of variance explained by first two components of Principal Component Analysis per genomic segment (PCA C1+C2). We applied Wilcoxon rank-sum tests to compare the untuned and tuned autoencoder to Minimac4. The validation datasets consist of: A) MESA Affymetrix 6.0; B) MESA UKB Axiom; C) MESA Omni 1.5M; D) Wellderly Affymetrix 6.0; E) Wellderly UKB Axiom; F) Wellderly Omni 1.5M; G) HGDP Affymetrix 6.0; H) HGDP UKB Axiom; I) HGDP Omni 1.5M.

Fig. S11.
  • Download figure
  • Open in new tab
Fig. S11. Imputation accuracy as a function of recombination rate.

Minimac4 and tuned and untuned autoencoders (AE) were tested across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms – Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). “High” vs “Low” recombination rate is defined by splitting genomic segments in those with greater than vs less than the median recombination rate per variant per genomic segment. We applied Wilcoxon rank-sum tests to compare the untuned and tuned autoencoder to Minimac4. The validation datasets consist of: A) MESA Affymetrix 6.0; B) MESA UKB Axiom; C) MESA Omni 1.5M; D) Wellderly Affymetrix 6.0; E) Wellderly UKB Axiom; F) Wellderly Omni 1.5M; G) HGDP Affymetrix 6.0; H) HGDP UKB Axiom; I) HGDP Omni 1.5M.

Fig. S12.
  • Download figure
  • Open in new tab
Fig. S12. HMM-based versus autoencoder-based imputation accuracy across MAF bins (F1 score).

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (mean F1-score per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p- values ≤ 0.001, and *** indicates p-values ≤ 0.0001, ns represents non-significant p-values.

Fig. S13.
  • Download figure
  • Open in new tab
Fig. S13. HMM-based versus autoencoder-based imputation accuracy across MAF bins (concordance).

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across three independent datasets - MESA (top), Wellderly (middle), and HGDP (bottom) - and across three genotyping array platforms - Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right). Each data point represents the imputation accuracy (mean concordance per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001, ns represents non-significant p- values.

Fig. S14.
  • Download figure
  • Open in new tab
Fig. S14. HMM-based versus autoencoder-based imputation accuracy across ancestry groups.

Autoencoder-based (red) and HMM-based (Minimac4 (blue), Beagle5 (green), and Impute5 (purple)) imputation accuracy was validated across individuals of diverse ancestry from HGDP cohort (EUR: European (top); EAS: East Asian (2nd row); AMR: Native American (3rd row); AFR: African (bottom)) and multiple genotype array platforms (Affymetrix 6.0 (left), UKB Axiom (middle), Omni1.5M (right)). Each data point represents the imputation accuracy (average r- squared per variant) relative to WGS-based ground truth across MAF bins. Error bars represent standard errors. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001, ns represents non-significant p-values.

Supplemental Tables

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S1.

Performance comparisons between tuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5) after applying data augmentation to HMM-based tools. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table S2.

Detailed performance comparisons between tuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5). Validation accuracies were stratified by dataset (MESA, Wellderly, HGDP) and genotype array platform (Affymetrix 6.0, UKB Axiom, Omni 1.5M). We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤ 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001.

View this table:
  • View inline
  • View popup
Table S3.

Detailed performance comparisons between tuned autoencoder (AE) and HMM-based imputation tools (Minimac4, Beagle5, and Impute5). Validation accuracies were stratified by dataset (MESA, Wellderly, HGDP), genotype array platform (Affymetrix 6.0, UKB 1 Axiom, Omni 1.5M), and MAF bin. We applied Wilcoxon rank-sum tests to compare the HMM-based tools to the reference tuned autoencoder (AE). * represents p-values ≤ 3 0.05, ** indicates p-values ≤ 0.001, and *** indicates p-values ≤ 0.0001.

Acknowledgments

This work is supported by KL2TR002552 to RD, by R01HG010881 to AT as well as grants U24TR002306 and UL1TR002550. We would like to thank J.C. Ducom and Lisa Dong from the Scripps High Performance Computing center, as well as Fernanda Foertter, Johnny Israeli, Ohad Mosafi, and Joyjit Daw from NVIDIA for their technical support and collaboration in this project. A portion of this research was conducted using a startup account at the Summit supercomputer from Oak Ridge National Laboratory (ORNL).

References

  1. ↵
    Abouzid H, Chakkor O, Reyes OG, Ventura S. 2019. Signal speech reconstruction and noise removal using convolutional denoising audioencoders with neural deep learning. Analog Integrated Circuits and Signal Processing 100:501–512. doi:10.1007/s10470-019-01446-6
    OpenUrlCrossRef
  2. ↵
    Alexander DH, Novembre J, Lange K. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19:1655–1664. doi:10.1101/gr.094052.109
    OpenUrlAbstract/FREE Full Text
  3. ↵
    Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. 2019. DeepImpute: An accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biology 20:211. doi:10.1186/s13059-019-1837-6
    OpenUrlCrossRefPubMed
  4. ↵
    Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, Clark AG, Donnelly P, Eichler EE, Flicek P, Gabriel SB, Gibbs RA, Green ED, Hurles ME, Knoppers BM, Korbel JO, Lander ES, Lee C, Lehrach H, Mardis ER, Marth GT, McVean GA, Nickerson DA, Schmidt JP, Sherry ST, Wang J, Wilson RK, Boerwinkle E, Doddapaneni H, Han Y, Korchina V, Kovar C, Lee S, Muzny D, Reid JG, Zhu Y, Chang Y, Feng Q, Fang X, Guo X, Jian M, Jiang H, Jin X, Lan T, Li G, Li J, Li Yingrui, Liu S, Liu Xiao, Lu Y, Ma X, Tang M, Wang B, Wang G, Wu H, Wu R, Xu X, Yin Y, Zhang D, Zhang W, Zhao J, Zhao M, Zheng X, Gupta N, Gharani N, Toji LH, Gerry NP, Resch AM, Barker J, Clarke L, Gil L, Hunt SE, Kelman G, Kulesha E, Leinonen R, McLaren WM, Radhakrishnan R, Roa A, Smirnov D, Smith RE, Streeter I, Thormann A, Toneva I, Vaughan B, Zheng-Bradley X, Grocock R, Humphray S, James T, Kingsbury Z, Sudbrak R, Albrecht MW, Amstislavskiy VS, Borodina TA, Lienhard M, Mertes F, Sultan M, Timmermann B, Yaspo ML, Fulton L, Ananiev V, Belaia Z, Beloslyudtsev D, Bouk N, Chen C, Church D, Cohen R, Cook C, Garner J, Hefferon T, Kimelman M, Liu C, Lopez J, Meric P, O’Sullivan C, Ostapchuk Y, Phan L, Ponomarov S, Schneider V, Shekhtman E, Sirotkin K, Slotta D, Zhang H, Balasubramaniam S, Burton J, Danecek P, Keane TM, Kolb-Kokocinski A, McCarthy S, Stalker J, Quail M, Davies CJ, Gollub J, Webster T, Wong B, Zhan Y, Campbell CL, Kong Y, Marcketta A, Yu F, Antunes L, Bainbridge M, Sabo A, Huang Z, Coin LJM, Fang L, Li Q, Li Z, Lin H, Liu B, Luo R, Shao H, Xie Y, Ye C, Yu C, Zhang F, Zheng H, Zhu H, Alkan C, Dal E, Kahveci F, Garrison EP, Kural D, Lee WP, Leong WF, Stromberg M, Ward AN, Wu J, Zhang M, Daly MJ, DePristo MA, Handsaker RE, Banks E, Bhatia G, del Angel G, Genovese G, Li H, Kashin S, McCarroll SA, Nemesh JC, Poplin RE, Yoon SC, Lihm J, Makarov V, Gottipati S, Keinan A, Rodriguez-Flores JL, Rausch T, Fritz MH, Stütz AM, Beal K, Datta A, Herrero J, Ritchie GRS, Zerbino D, Sabeti PC, Shlyakhter I, Schaffner SF, Vitti J, Cooper DN, Ball E v., Stenson PD, Barnes B, Bauer M, Cheetham RK, Cox A, Eberle M, Kahn S, Murray L, Peden J, Shaw R, Kenny EE, Batzer MA, Konkel MK, Walker JA, MacArthur DG, Lek M, Herwig R, Ding L, Koboldt DC, Larson D, Ye Kai, Gravel S, Swaroop A, Chew E, Lappalainen T, Erlich Y, Gymrek M, Willems TF, Simpson JT, Shriver MD, Rosenfeld JA, Bustamante CD, Montgomery SB, de La Vega FM, Byrnes JK, Carroll AW, DeGorter MK, Lacroute P, Maples BK, Martin AR, Moreno-Estrada A, Shringarpure SS, Zakharia F, Halperin E, Baran Y, Cerveira E, Hwang J, Malhotra A, Plewczynski D, Radew K, Romanovitch M, Zhang C, Hyland FCL, Craig DW, Christoforides A, Homer N, Izatt T, Kurdoglu AA, Sinari SA, Squire K, Xiao C, Sebat J, Antaki D, Gujral M, Noor A, Ye Kenny, Burchard EG, Hernandez RD, Gignoux CR, Haussler D, Katzman SJ, Kent WJ, Howie B, Ruiz-Linares A, Dermitzakis ET, Devine SE, Kang HM, Kidd JM, Blackwell T, Caron S, Chen W, Emery S, Fritsche L, Fuchsberger C, Jun G, Li B, Lyons R, Scheller C, Sidore C, Song S, Sliwerska E, Taliun D, Tan A, Welch R, Wing MK, Zhan X, Awadalla P, Hodgkinson A, Li Yun, Shi X, Quitadamo A, Lunter G, Marchini JL, Myers S, Churchhouse C, Delaneau O, Gupta-Hinch A, Kretzschmar W, Iqbal Z, Mathieson I, Menelaou A, Rimmer A, Xifara DK, Oleksyk TK, Fu Yunxin, Liu Xiaoming, Xiong M, Jorde L, Witherspoon D, Xing J, Browning BL, Browning SR, Hormozdiari F, Sudmant PH, Khurana E, Tyler-Smith C, Albers CA, Ayub Q, Chen Y, Colonna V, Jostins L, Walter K, Xue Y, Gerstein MB, Abyzov A, Balasubramanian S, Chen J, Clarke D, Fu Yao, Harmanci AO, Jin M, Lee D, Liu J, Mu XJ, Zhang J, Zhang Yan, Hartl C, Shakir K, Degenhardt J, Meiers S, Raeder B, Casale FP, Stegle O, Lameijer EW, Hall I, Bafna V, Michaelson J, Gardner EJ, Mills RE, Dayama G, Chen K, Fan X, Chong Z, Chen T, Chaisson MJ, Huddleston J, Malig M, Nelson BJ, Parrish NF, Blackburne B, Lindsay SJ, Ning Z, Zhang Yujun, Lam H, Sisu C, Challis D, Evani US, Lu J, Nagaswamy U, Yu J, Li W, Habegger L, Yu H, Cunningham F, Dunham I, Lage K, Jespersen JB, Horn H, Kim D, Desalle R, Narechania A, Sayres MAW, Mendez FL, Poznik GD, Underhill PA, Mittelman D, Banerjee R, Cerezo M, Fitzgerald TW, Louzada S, Massaia A, Yang F, Kalra D, Hale W, Dan X, Barnes KC, Beiswanger C, Cai H, Cao H, Henn B, Jones D, Kaye JS, Kent A, Kerasidou A, Mathias R, Ossorio PN, Parker M, Rotimi CN, Royal CD, Sandoval K, Su Y, Tian Z, Tishkoff S, Via M, Wang Y, Yang H, Yang L, Zhu J, Bodmer W, Bedoya G, Cai Z, Gao Y, Chu J, Peltonen L, Garcia-Montero A, Orfao A, Dutil J, Martinez-Cruzado JC, Mathias RA, Hennis A, Watson H, McKenzie C, Qadri F, LaRocque R, Deng X, Asogun D, Folarin O, Happi C, Omoniwa O, Stremlau M, Tariyal R, Jallow M, Joof FS, Corrah T, Rockett K, Kwiatkowski D, Kooner J, Hien TT, Dunstan SJ, ThuyHang N, Fonnie R, Garry R, Kanneh L, Moses L, Schieffelin J, Grant DS, Gallo C, Poletti G, Saleheen D, Rasheed A, Brooks LD, Felsenfeld AL, McEwen JE, Vaydylevich Y, Duncanson A, Dunn M, Schloss JA. 2015. A global reference for human genetic variation. Nature. doi:10.1038/nature15393
    OpenUrlCrossRefPubMed
  5. Berisa T, Pickrell JK. 2016. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics 32:283–285. doi:10.1093/bioinformatics/btv546
    OpenUrlCrossRefPubMed
  6. ↵
    Bild DE. 2002. Multi-Ethnic Study of Atherosclerosis: Objectives and Design. American Journal of Epidemiology 156:871–881. doi:10.1093/aje/kwf113
    OpenUrlCrossRefPubMedWeb of Science
  7. ↵
    Browning BL, Browning SR. 2016. Genotype Imputation with Millions of Reference Samples. American Journal of Human Genetics 98:116–126. doi:10.1016/j.ajhg.2015.11.020
    OpenUrlCrossRefPubMed
  8. ↵
    Browning BL, Zhou Y, Browning SR. 2018. A One-Penny Imputed Genome from Next-Generation Reference Panels. American Journal of Human Genetics 103:338–348. doi:10.1016/j.ajhg.2018.07.015
    OpenUrlCrossRefPubMed
  9. ↵
    Cann HM. 2002. A Human Genome Diversity Cell Line Panel. Science 296:261b–2262. doi:10.1126/science.296.5566.261b
    OpenUrlFREE Full Text
  10. ↵
    Chen J, Shi X. 2019. Sparse Convolutional Denoising Autoencoders for Genotype Imputation. Genes 10:652. doi:10.3390/genes10090652
    OpenUrlCrossRefPubMed
  11. ↵
    Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. 2021. Twelve years of SAMtools and BCFtools. GigaScience 10:1–4. doi:10.1093/gigascience/giab008
    OpenUrlCrossRef
  12. ↵
    Das S, Abecasis GR, Browning BL. 2018. Genotype Imputation from Large Reference Panels. Annual Review of Genomics and Human Genetics 19:73–96. doi:10.1146/annurev-genom-083117-021602
    OpenUrlCrossRefPubMed
  13. ↵
    Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. 2016a. Next-generation genotype imputation service and methods. Nature Genetics 48:1284–1287. doi:10.1038/ng.3656
    OpenUrlCrossRefPubMed
  14. ↵
    Das S, Forer L, Schönherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M, Schlessinger D, Stambolian D, Loh PR, Iacono WG, Swaroop A, Scott LJ, Cucca F, Kronenberg F, Boehnke M, Abecasis GR, Fuchsberger C. 2016b. Next-generation genotype imputation service and methods. Nature Genetics 48:1284–1287. doi:10.1038/ng.3656
    OpenUrlCrossRefPubMed
  15. ↵
    Dias R, Torkamani A. 2019. Artificial intelligence in clinical and genomic diagnostics. Genome Medicine. doi:10.1186/s13073-019-0689-8
    OpenUrlCrossRef
  16. ↵
    Dimitromanolakis A, Xu J, Krol A, Briollais L. 2019. sim1000G: A user-friendly genetic variant simulator in R for unrelated individuals and family-based designs. BMC Bioinformatics 20:26. doi:10.1186/s12859-019-2611-1
    OpenUrlCrossRef
  17. ↵
    Erikson GA, Bodian DL, Rueda M, Molparia B, Scott ER, Scott-Van Zeeland AA, Topol SE, Wineinger NE, Niederhuber JE, Topol EJ, Torkamani A. 2016. Whole-Genome Sequencing of a Healthy Aging Cohort. Cell 165:1002–1011. doi:10.1016/j.cell.2016.03.022
    OpenUrlCrossRefPubMed
  18. ↵
    Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao Hongbin, Zhao Hui, Zhou J, Gabriel SB, Barry R, Blumenstiel B, Camargo A, Defelice M, Faggart M, Goyette M, Gupta S, Moore J, Nguyen H, Onofrio RC, Parkin M, Roy J, Stahl E, Winchester E, Ziaugra L, Altshuler D, Shen Yan, Yao Z, Huang W, Chu X, He Y, Jin L, Liu Y, Shen Yayun, Sun W, Wang Haifeng, Wang Yi, Wang Ying, Xiong X, Xu L, Waye MMY, Tsui SKW, Xue H, Wong JTF, Galver LM, Fan JB, Gunderson K, Murray SS, Oliphant AR, Chee MS, Montpetit A, Chagnon F, Ferretti V, Leboeuf M, Olivier JF, Phillips MS, Roumy S, Sallée C, Verner A, Hudson TJ, Kwok PY, Cai D, Koboldt DC, Miller RD, Pawlikowska L, Taillon-Miller P, Xiao M, Tsui LC, Mak W, You QS, Tam PKH, Nakamura Y, Kawaguchi T, Kitamoto T, Morizono T, Nagashima A, Ohnishi Y, Sekine A, Tanaka T, Tsunoda T, Deloukas P, Bird CP, Delgado M, Dermitzakis ET, Gwilliam R, Hunt S, Morrison J, Powell D, Stranger BE, Whittaker P, Bentley DR, Daly MJ, de Bakker PIW, Barrett J, Chretien YR, Maller J, McCarroll S, Patterson N, Pe’Er I, Price A, Purcell S, Richter DJ, Sabeti P, Saxena R, Schaffner SF, Sham PC, Varilly P, Stein LD, Krishnan L, Smith AV, Tello-Ruiz MK, Thorisson GA, Chakravarti A, Chen PE, Cutler DJ, Kashuk CS, Lin S, Abecasis GR, Guan W, Li Y, Munro HM, Qin ZS, Thomas DJ, McVean G, Auton A, Bottolo L, Cardin N, Eyheramendy S, Freeman C, Marchini J, Myers S, Spencer C, Stephens M, Donnelly P, Cardon LR, Clarke G, Evans DM, Morris AP, Weir BS, Johnson TA, Mullikin JC, Sherry ST, Feolo M, Skol A, Zhang H, Matsuda I, Fukushima Y, MacEr DR, Suda E, Rotimi CN, Adebamowo CA, Ajayi I, Aniagwu T, Marshall PA, Nkwodimmah C, Royal CDM, Leppert MF, Dixon M, Peiffer A, Qiu R, Kent A, Kato K, Niikawa N, Adewole IF, Knoppers BM, Foster MW, Clayton EW, Watkin J, Muzny D, Nazareth L, Sodergren E, Weinstock GM, Yakub I, Birren BW, Wilson RK, Fulton LL, Rogers J, Burton J, Carter NP, Clee CM, Griffiths M, Jones MC, McLay K, Plumb RW, Ross MT, Sims SK, Willey DL, Chen Z, Han H, Kang L, Godbout M, Wallenburg JC, L’Archevêque P, Bellemare G, Saeki K, Wang Hongguang, An D, Fu H, Li Q, Wang Z, Wang R, Holden AL, Brooks LD, McEwen JE, Guyer MS, Wang VO, Peterson JL, Shi M, Spiegel J, Sung LM, Zacharia LF, Collins FS, Kennedy K, Jamieson R, Stewart J. 2007. A second generation human haplotype map of over 3.1 million SNPs. Nature 449:851–861. doi:10.1038/nature06258
    OpenUrlCrossRefPubMedWeb of Science
  19. ↵
    1. Lorenz R,
    2. Fred ALN,
    3. Gamboa H
    Islam T, Kim CH, Iwata H, Shimono H, Kimura A, Zaw H, Raghavan C, Leung H, Singh RK. 2021. A Deep Learning Method to Impute Missing Values and Compress Genome-ide Polymorphism Data in Rice In: Lorenz R, Fred ALN, Gamboa H, editors. Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies, {BIOSTEC} 2021, Volume 3: BIOINFORMATICS, Online Streaming, February 11-13, 2021. SCITEPRESS. pp. 101–109.
    OpenUrl
  20. ↵
    Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 1–7. doi:10.1038/s41586-021-03819-2
    OpenUrlCrossRefPubMed
  21. ↵
    Koh PW, Pierson E, Kundaje A. 2017. Denoising genome-wide histone ChIP-seq with convolutional neural networks. Bioinformatics 33:i225–i233. doi:10.1093/bioinformatics/btx243
    OpenUrlCrossRef
  22. ↵
    Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. 2020. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network. PLoS Computational Biology 16:e1008207. doi:10.1371/journal.pcbi.1008207
    OpenUrlCrossRef
  23. ↵
    Kowalski MH, Qian H, Hou Z, Rosen JD, Tapia AL, Shan Y, Jain D, Argos M, Arnett DK, Avery C, Barnes KC, Becker LC, Bien SA, Bis JC, Blangero J, Boerwinkle E, Bowden DW, Buyske S, Cai J, Cho MH, Choi SH, Choquet H, Adrienne Cupples L, Cushman M, Daya M, de Vries PS, Ellinor PT, Faraday N, Fornage M, Gabriel S, Ganesh SK, Graff M, Gupta N, He J, Heckbert SR, Hidalgo B, Hodonsky CJ, Irvin MR, Johnson AD, Jorgenson E, Kaplan R, Kardia SLR, Kelly TN, Kooperberg C, Lasky-Su JA, Loos RJF, Lubitz SA, Mathias RA, McHugh CP, Montgomery C, Moon JY, Morrison AC, Palmer ND, Pankratz N, Papanicolaou GJ, Peralta JM, Peyser PA, Rich SS, Rotter JI, Silverman EK, Smith JA, Smith NL, Taylor KD, Thornton TA, Tiwari HK, Tracy RP, Wang T, Weiss ST, Weng LC, Wiggins KL, Wilson JG, Yanek LR, Zöllner S, North KE, Auer PL, Raffield LM, Reiner AP, Li Y. 2019. Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations. PLoS Genetics 15:e1008500. doi:10.1371/journal.pgen.1008500
    OpenUrlCrossRefPubMed
  24. ↵
    Lal A, Chiang ZD, Yakovenko N, Duarte FM, Israeli J, Buenrostro JD. 2021. Deep learning-based enhancement of epigenomics data with AtacWorks. Nature Communications 12. doi:10.1038/s41467-021-21765-5
    OpenUrlCrossRef
  25. ↵
    Li Y, Willer C, Sanna S, Abecasis G. 2009. Genotype Imputation. Annual Review of Genomics and Human Genetics 10:387–406. doi:10.1146/annurev.genom.9.081307.164242
    OpenUrlCrossRefPubMedWeb of Science
  26. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. 2017. Focal Loss for Dense Object Detection.
  27. ↵
    Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M, Zettlemoyer L. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. Transactions of the Association for Computational Linguistics 8:726–742. doi:10.1162/tacl_a_00343
    OpenUrlCrossRef
  28. ↵
    Marchini J, Howie B. 2010. Genotype imputation for genome-wide association studies. Nature Reviews Genetics. doi:10.1038/nrg2796
    OpenUrlCrossRefPubMedWeb of Science
  29. ↵
    McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y, Sidore C, Kwong A, Timpson N, Koskinen S, Vrieze S, Scott LJ, Zhang H, Mahajan A, Veldink J, Peters U, Pato C, van Duijn CM, Gillies CE, Gandin I, Mezzavilla M, Gilly A, Cocca M, Traglia M, Angius A, Barrett JC, Boomsma D, Branham K, Breen G, Brummett CM, Busonero F, Campbell H, Chan A, Chen S, Chew E, Collins FS, Corbin LJ, Smith GD, Dedoussis G, Dorr M, Farmaki AE, Ferrucci L, Forer L, Fraser RM, Gabriel S, Levy S, Groop L, Harrison T, Hattersley A, Holmen OL, Hveem K, Kretzler M, Lee JC, McGue M, Meitinger T, Melzer D, Min JL, Mohlke KL, Vincent JB, Nauck M, Nickerson D, Palotie A, Pato M, Pirastu N, McInnis M, Richards JB, Sala C, Salomaa V, Schlessinger D, Schoenherr S, Slagboom PE, Small K, Spector T, Stambolian D, Tuke M, Tuomilehto J, van den Berg LH, van Rheenen W, Volker U, Wijmenga C, Toniolo D, Zeggini E, Gasparini P, Sampson MG, Wilson JF, Frayling T, de Bakker PIW, Swertz MA, McCarroll S, Kooperberg C, Dekker A, Altshuler D, Willer C, Iacono W, Ripatti S, Soranzo N, Walter K, Swaroop A, Cucca F, Anderson CA, Myers RM, Boehnke M, McCarthy MI, Durbin R, Abecasis G, Marchini J. 2016. A reference panel of 64,976 haplotypes for genotype imputation. Nature Genetics 48:1279–1283. doi:10.1038/ng.3643
    OpenUrlCrossRefPubMed
  30. ↵
    Mou L, Norby FL, Chen LY, O’Neal WT, Lewis TT, Loehr LR, Soliman EZ, Alonso A. 2018. Lifetime Risk of Atrial Fibrillation by Race and Socioeconomic Status: ARIC Study (Atherosclerosis Risk in Communities). Circulation: Arrhythmia and Electrophysiology 11. doi:10.1161/CIRCEP.118.006350
    OpenUrlAbstract/FREE Full Text
  31. ↵
    Naito T, Suzuki K, Hirata J, Kamatani Y, Matsuda K, Toda T, Okada Y. 2021. A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nature Communications 12:1–14. doi:10.1038/s41467-021-21975-x
    OpenUrlCrossRef
  32. Okewu E, Adewole P, Sennaike O. 2019. Experimental Comparison of Stochastic Optimizers in Deep LearningLecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Verlag. pp. 704–715. doi:10.1007/978-3-030-24308-1_55
    OpenUrlCrossRef
  33. Picard toolkit. 2019.
  34. ↵
    Rubinacci S, Delaneau O, Marchini J. 2020. Genotype imputation using the Positional Burrows Wheeler Transform. PLOS Genetics 16:e1009049. doi:10.1371/journal.pgen.1009049
    OpenUrlCrossRef
  35. ↵
    Sarkar E, Chielle E, Gürsoy G, Mazonka O, Gerstein M, Maniatakos M. 2021. Fast and scalable private genotype imputation using machine learning and partially homomorphic encryption. IEEE Access.
  36. ↵
    Sun Y v., Kardia SLR. 2008. Imputing missing genotypic data of single-nucleotide polymorphisms using neural networks. European Journal of Human Genetics 16:487–495. doi:10.1038/sj.ejhg.5201988
    OpenUrlCrossRefPubMed
  37. ↵
    Tian C, Fei L, Zheng W, Xu Y, Zuo W, Lin CW. 2020. Deep learning on image denoising: An overview. Neural Networks. doi:10.1016/j.neunet.2020.07.025
    OpenUrlCrossRef
  38. ↵
    Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E. 2018. Deep Learning for Computer Vision: A Brief Review. Computational Intelligence and Neuroscience. doi:10.1155/2018/7068349
    OpenUrlCrossRef
Back to top
PreviousNext
Posted December 02, 2021.
Download PDF
Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders
Raquel Dias, Doug Evans, Shang-Fu Chen, Kai-Yu Chen, Leslie Chan, Ali Torkamani
bioRxiv 2021.12.01.470739; doi: https://doi.org/10.1101/2021.12.01.470739
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Rapid, Reference-Free Human Genotype Imputation with Denoising Autoencoders
Raquel Dias, Doug Evans, Shang-Fu Chen, Kai-Yu Chen, Leslie Chan, Ali Torkamani
bioRxiv 2021.12.01.470739; doi: https://doi.org/10.1101/2021.12.01.470739

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (3518)
  • Biochemistry (7373)
  • Bioengineering (5355)
  • Bioinformatics (20349)
  • Biophysics (10058)
  • Cancer Biology (7788)
  • Cell Biology (11360)
  • Clinical Trials (138)
  • Developmental Biology (6456)
  • Ecology (9995)
  • Epidemiology (2065)
  • Evolutionary Biology (13369)
  • Genetics (9378)
  • Genomics (12624)
  • Immunology (7733)
  • Microbiology (19122)
  • Molecular Biology (7482)
  • Neuroscience (41191)
  • Paleontology (301)
  • Pathology (1236)
  • Pharmacology and Toxicology (2145)
  • Physiology (3188)
  • Plant Biology (6885)
  • Scientific Communication and Education (1277)
  • Synthetic Biology (1901)
  • Systems Biology (5332)
  • Zoology (1091)