Estimation of Genetic Admixture Proportions via Haplotypes

Estimation of ethnic admixture is essential for creating personal genealogies, studying human history, and conducting genome-wide association studies (GWAS). Three methods exist for estimating admixture coefficients. The frequentist approach directly maximizes the binomial loglikelihood. The Bayesian approach adds a reasonable prior and samples the posterior distribution. Finally, the nonparametric approach decomposes the genotype matrix algebraically. Each approach scales successfully to data sets with a million individuals and a million single nucleotide polymorphisms (SNPs). Despite their variety, all current approaches assume independence between SNPs. To achieve independence requires performing LD (linkage disequilibrium) filtering before analysis. Unfortunately, this tactic loses valuable information and usually retains many SNPs still in LD. The present paper explores the option of explicitly incorporating haplotypes in ancestry estimation. Our program, HaploADMIXTURE, operates on adjacent SNP pairs and jointly estimates their haplotype frequencies along with admixture coefficients. This more complex strategy takes advantage of the rich information available in haplotypes and ultimately yields better admixture estimates and better clustering of real populations in curated data sets.


Introduction
Estimation of genetic admixture is key to reconstructing personal genealogies and understanding population histories 1 . Adjusting for genetic ancestry is also a necessary prelude to genome-wide association studies (GWAS) for medical and anthropological traits 2 . Failure to account for ancestry can lead to false positives due to population stratification 3,4,5 . In these analyses, admixture coefficients serve as covariates adjusting for ancestry. Because admixture coefficients represent the proportions of a person's ancestry derived from different founding populations, they are more interpretable than principal components (PCs).
Admixture coefficients can be estimated simultaneously with allele frequencies in known or latent populations. ADMIXTURE 6 is the most widely-used likelihood-based software. It directly maximizes the binomial likelihood of the admixture coefficients and allele frequencies via alternating sequential quadratic programming 7 . Our recent Julia version, OpenADMIX-TURE 8 , incorporates time-saving software enhancements and AIM (ancestry informative markers) preselection via sparse K-means clustering 9 . STRUCTURE 10 and its extensions fastStructure 11 and TeraStructure 12 rely on Bayesian inference. SCOPE 13 replaces the genotype matrix by a low-rank matrix, which is delivered by alternating least squares 14 and randomized linear algebra. Each of the recent versions of these programs -OpenADMIX-TURE, TeraStructure, and SCOPE -scales to biobank-size data sets of a million people and a million single nucleotide polymorphisms (SNPs).
A regrettable limitation of most of these programs is their assumption of independence for the alleles present at neighboring SNPs. To avoid this patently false assumption, SNPs are filtered to remove SNPs in linkage disequilibrium (LD). Filtering must reach a balance between LD elimination and the loss of valuable AIMs. Unfortunately, the LD-aware program fineSTRUCTURE 15 scales poorly on large data sets 13 . Yet the evidence is clear that ancestry informative haplotypes pack more information than ancestry informative SNPs 16,17,18 .
The current paper demonstrates the value of haplotypes in admixture estimation and population clustering. Given the combinatorial and computational complexities encountered, we 3 consider only haplotypes formed from adjacent SNP pairs. Even with this limitation, haplotype models offer substantial improvements in estimation and clustering in simulated and real data sets on well-separated ancestral populations. Our new program, HaploADMIX-TURE, builds on the high-performance computing (HPC) techniques pioneered in OpenAD-MIXTURE 8 . By leveraging the parallel processing capabilities of graphics processing units (GPUs), HaploADMIXTURE is able to run in reasonable time. In practice only a minority of haplotypes are informative. To select ancestry informative haplotypes, we exploit unsupervised sparse K-means clustering via feature ranking 8,9 . This generic method, denoted by the acronym SKFR, selects the informative features (haplotypes) driving cluster formation. Our experience suggests that the SKFR-HaploADMIXTURE pipeline delivers the best admixture results currently available with reasonable computing times.

Admixture Likelihoods
Consider a sample of I unrelated individuals, B haplotype blocks, and S SNPs per block.
For our purposes S equals 1 or 2. Let x ib denote the length-S genotype vector for haplotype block b of individual i. Each genotype of i counts the number of i's reference alleles present and is coded as a number from the set {0, 1, 2}. Haplotypes are coded as sequences of 0's and 1's, and every x ib = h ib1 + h ib2 equals a sum of a maternal and paternal haplotypes.
The blocks are taken to be contiguous, non-overlapping, and exhaustive. Haplotypes may be chosen through feature selection as discussed in Section 2.4. Let p kbh > 0 be the frequency of haplotype h of haplotype block b in population k, and let q ki > 0 denote the fraction of i's genome coming from population k, where 1 ≤ k ≤ K. The loglikelihood of the sample 4 under a binomial distribution and independence of haplotype blocks is where x ib is the sum of the maternal haplotype 0 ≤ h ≤ 1 and the paternal haplotype There are 2 S possible haplotypes per block, and the constraints K k=1 q ki = 1 and h p kbh = 1 hold for each i and combination (k, b). The loglikelihood (1) simplifies by symmetry if any entry of x ib equals 1 (a heterozygous SNP). Because maternal and paternal haplotypes are interchangeable, the number of summands can be halved if the remaining sum of products is doubled. When S = 1 and i and b are fixed, the heterozygous genotype 1 has probability 2( k q ki p kb0 )( k q ki p kb1 ), which the log function splits into a sum of logarithms.
In fact, this simplification replicates the binomial likelihood employed in ADMIXTURE and STRUCTURE. When S = 2, the doubly heterozygous genotype has the probability 2 k q ki p kb(00) k q ki p kb(11) +2 k q ki p kb(01) k q ki p kb (10) , which no longer splits under the log function. In addition, there are cases where one of the genotypes is observed, but the other is missing. For example, if the first genotype is heterozygous and the other is missing, the probability equals k q ki p kb(00) + k q ki p kb(01) k q ki p kb(10) + k q ki p kb (11) , and the loglikelihood (1) should be adjusted accordingly. Nonetheless, as described in the next subsection, the whole loglikelihood is still amenable to maximization.

Maximum Likelihood Estimation
Estimation in HaploADMIXTURE and OpenADMIXTURE are similar. The optimization machinary in both programs alternate estimation of the per-population haplotype frequencies p kbh and the per-individual admixture coefficients q ki . To allow easy parallelization with graphics processing units (GPUs), we invoke the minorization-maximization (MM) principle 19,20 to split sums appearing in the arguments to the logarithms of the haplotype 5 loglikelihood (1). The operative inequality reduces to an equality when u = u (n) and v = v (n) . Here the irrelevant constant c n depends only on the current values u (n) and v (n) of u and v. The function u (n) u (n) +v (n) log u+ v (n) u (n) +v (n) log v becomes a surrogate for the function log(u + v) it replaces. For example, when S = 2 and i presents a doubly heterozygous genotype, we take u = 2 k q ki p kb(00) k q ki p kb (11) and v = 2 k q ki p kb(01) k q ki p kb (10) . Most genotype probabilities (all homozygous and singly heterozygous genotypes) reduce to a single product where log splitting is unnecessary. For haplotypes involving more than two SNPs, phase combinations become more complex, code is harder to write, and computation slows. For these reasons we venture no further than two-SNP haplotypes. Maximization of the surrogate function created by minorization enjoys the ascent property of steadily increasing the loglikelihood. The ascent property is the essence of the MM (minorization-maximization) principle 19,20 .
Minorization creates a surrogate function involving nonnegative weights w (n) ikbh , many of which are 0 because they correspond to haplotypes incompatible with observed genotypes. Except for revising the weights w (n) ikbh at each iteration n, the surrogate loglikelihood (2) is simpler to deal with than the actual loglikelihood. Updating the admixture matrix Q = (q ki ) can be done simultaneously over columns (individuals i). Updating the haplotype frequency tensor P = (p kbh ) can be done simultaneously over its middle columns (blocks b). Each such maximization must respect the 6 nonnegativity constraints on the proportions and their sum to 1 constraints. Very simple multinomial updates of the p kbh can be achieved by splitting the argument K k=1 q ki p kbh of the log function, but this second minorization slows convergence dramatically.
The parallel updates of P and Q are structured around functions of the form G(r) = j w j log k c jk r k subject to nonnegativity and sum to 1 constraints. The method of recursive quadratic programming involves replacing G(r) by its local quadratic approximation and maximizing this approximation subject to the constraints. The required gradient and Hessian are where c ⊤ j is the jth row of the matrix C of nonnegative constants c jk .
Computation of the gradients and Hessians of G(Q, P | Q (n) , P (n) ) with respect to Q

Selection of K
We employ two devices to select the number of ancestral populations K. First, the crossvalidation method introduced in ADMIXTURE 21 partitions the sample individuals into v folds. Each of the folds is held out as a validation set, and the model is fit on the remaining training individuals. Fitting on a training set is fast because it warm starts parameter values from the estimates garnered under the full data set. Given the haplotype frequencies P train estimated on the training set, we estimate the admixture fractions Q test on the validation set. This fitting step is also fast because it qualifies as a straightforward convex supervised learning problem. Given P train and Q test , we predict the genotype matrix of the validation individuals. The deviance residual under a binomial model yields the prediction error where x is I × SB true genotype matrix, and y is the predicted genotype matrix. This error is then averaged across the different folds. We choose the most parsimonious model whose prediction error is no more than one standard error above the error of the best model (one standard error rule).
The second device for selecting K is the Akaike information criterion (AIC) 22 . In the 8 current setting The term BK(2 S − 1) + I(K − 1) counts the number of free parameters in the model with K ancestral populations. The loglikelihood is evaluated at the maximum likelihood estimates given K. We fit the model for several different values of K and choose the K with the lowest value of AIC. The virtue of AIC is that it requires less computation than full cross-validation.

Sparse K-Means with Feature Ranking for Haplotypes
To select AIMs, sparse K-means with feature ranking (SKFR) 8,9 has proved ideal. SKFR ranks and selects a predetermined number of features based on their importance in K-

means clustering. HaploADMIXTURE requires input blocks of SNPs rather than individual
SNPs. The center for cluster j is a vector c j = (c jg ). The loss in K-means clustering is j∈J i∈C j ∥c j − x i ∥ 2 , where C j denotes the set of individuals belonging to cluster j, and each raw feature vector x i = h m + h p is a sum of unknown haplotypes. (If everyone is haplotyped, then SKFR should operate on haplotypes.) In practice, the x i are standardized to have a mean of 0 for each feature across the entire sample. A missing genotype x ig in x i is imputed by c jg when i is assigned to cluster j 23 . To model haplotypes, the feature vector x i is broken into vector blocks x ib . Lloyd's algorithm 24 alternates updating cluster centers and reassigning feature vectors to clusters. At each iteration of Lloyd's algorithm, the S blocks giving the largest reduction in the loss are selected based on the decomposition The mean for a selected block is cluster-specific. The mean for a non-selected block is taken to be 0. Lloyd's algorithm converges when the cluster centers and ancestry informative blocks stabilize. 9

Computational Tactics
Most of the computational tactics introduced in OpenADMIXTURE carry over to HaploAD-MIXTURE. For instance, HaploADMIXTURE significantly reduces memory demands by directly converting the bit genotypes stored in PLINK BED format 25 into numbers through the OpenMendel 26 package SnpArrays 27 . Multithreading is employed throughout HaploAD-MIXTURE. Multithreading not only promotes parallelism, but also reduces memory usage by tiling the computation of gradients and Hessians. CUDA GPU kernels are implemented for EM updates and computing gradients and Hessians. When running SKFR for multiple sparsity levels S, we start with the highest level of S and warm start Lloyd's algorithm at the current level by its converged value at the previous higher level. We refer the readers to Ko et al. 8 for further details.

Permutation Matching of Clusters
A promising similarity metric proposed by Behr et al. 28 is effective in matching clusters defined by two admixture matrices Q 1 and Q 2 . This metric faithfully matches similar clusters and is invariant when cluster labels are permuted. The metric quantifies the similarity between cluster m in Q 1 and cluster n in Q 2 as where N is the set of indices i for which q 1 mi + q 2 ni > 0, and |N | is the cardinality of N .
To match the clusters delivered by two algorithms, we solve the assignment problem that maximizes the criterion m n y mn J (q 1 m , q 2 n ), subject to the constraints y mn ∈ {0, 1} and K k y km = K k y nk = 1, where K is the number of clusters. In practice, we relax the domain of y mn to the unit interval and solve the simplified problem using linear programming via JuMP 29 , a mathematical optimization package in Julia.

Silhouette Coefficient
The silhouette index s i is a measure of how similar object i is to its own cluster (cohesion) compared to other clusters (separation) 30 . If i belongs to cluster C k , then the index s i reflects the two averages where a i is the average distance of sample i from the other points in C k , and b i is the minimum average distance of sample i from the other clusters. Given these values we define Note that s i ranges from −1 to 1; the higher s i is, the better separated the clusters are.
Thus, the average silhouette value serves as a sensitive measure of clustering quality.

Real Data Sets
To evaluate its performance, we applied HaploADMIXTURE to three different real-world

Simulations
A common approach for simulating genetic admixture with S = 1 is the Pritchard-Stephens-Donnelly (PSD) model 10 , with allele frequencies sampled from the Balding-Nicolas model 36 following a beta distribution: where p A is the allele frequency and F ST is the fixation index. Chiu et al. 13 introduced an extra level of Dirichlet sampling to simulate populations gathered around regional centers. This is accomplished by first sampling T regional centers s t from the Dirichlet(α1 K ) distribution. Then for each regional center, I/T of the admixture vectors q ·i are sampled around the center s t , with a high value of the parameter γ = 50. To model pairs of SNPs (S = 2), we simulated data according to a variant of the model for S = 1 with the beta distributions for allele frequencies replaced by the Dirichlet distributions for haplotype frequencies.
Specifically, we generated s t iid ∼ Dirichlet(α1 K ), regional centers q ·it iid ∼ Dirichlet(γs t ), admixture vector from center t given the initial haplotype frequencies [p (00) , p (01) , p (10) , p (11) ] and fixation indexes F ST . To specify these underlying parameters, we randomly sampled a block of SNPs of length S

Simulation Studies
We simulated independent data sets with different numbers of SNPs and different values of the concentration parameter α ∈ {0.1, 0.05, 0.02} as described in Section 2.8. Tables 1, S3, and S2 display root-mean-square errors (RMSE) for 0,000, 100,000, and 1,000,000 simulated SNPs. RMSE is estimated by

Evaluation of Estimated Admixture
Silhouette coefficients offer another way of quantifying performance. These are based on the ancestry labels implicit in the estimated Q matrix. The average silhouette coefficient is preferable to the training errors of linear classifiers and their cross-entropies 13,8 because training error is discrete, and a single individual can unduly influence cross-entropy. We additionally matched clusters as discussed in Section 2.6.1 and computed root-mean-square error (RMSE) from the SKFR clusters derived from all SNPs.
Tables 4, S6, and S7 display mean silhouette coefficients for HaploADMIXTURE, Open-ADMIXTURE, SCOPE, and TeraStructure. Since one of the continental populations is known to be admixed Americans, we also provide the result without them in Table S5. Tables S8-S10 show continent-by-continent mean silhouette coefficients, and Tables S11-S14 show region-by-region mean silhouette coefficients. HaploADMIXTURE generally performs well in grouping populations by both continent and region. OpenADMIXTURE performs equally well in grouping by continent but falls behind in grouping by region. For TGP and HGDP, TeraStructure is the best at distinguishing continental labels but falters in distinguishing regional labels, particularly in the TGP data where Middle-Eastern and European populations are lumped. For HGDP, SCOPE is the best at distinguishing the 32 regional labels but struggles compared to HaploADMIXTURE and OpenADMIXTURE in distinguishing African continental populations from each other. For the HO data set, HaploADMIXTURE and OpenADMIXTURE perform similarly in distinguishing the 163 regional labels, followed by SCOPE and TeraStructure.
When the analysis is based on AIMs, HaploADMIXTURE usually performs better than OpenADMIXTURE. In the single instance of 5000 AIMs for TGP, HaploADMIXTURE suffers in distinguishing regional subpopulations. In the case of HGDP under AIM selection, OpenADMIXTURE has trouble distinguishing between Middle-Eastern and European populations and adds a population to Africa. This anomaly is visible in Figure S4. HaploAD-MIXTURE with AIMs retains the power to distinguish the Middle-Eastern and European populations. For the HO data set, HaploADMIXTURE performance with AIMs better mimics its performance with all SNPs than OpenADMIXTURE does in the same comparison.
Tables 3, S15, and S16 display RMSE from the baseline of all SNPs for the TGP, HGDP, and HO data sets, respectively. For the TGP data set, as we choose fewer AIMs, the mean silhouette tends to decrease, except for 10,000 and 5000 SNPs in OpenADMIXTURE. However, these exceptional cases yield poorer separation of populations than HaploADMIXTURE with all SNPs. This suggests that parsimony alone is an imperfect criterion for judging admixture estimation.

Computational Efficiency
Given the computational improvements incorporated in HaploADMIXTURE, the analyses reported here finish in a reasonable amount of time. HaploADMIXTURE's cost per iteration with S = 2 SNPs per haplotype block is less than eight times that of OpenADMIXTURE.
Given that the number of frequency parameters quadruples, it takes four times longer to compute gradients and Hessians. While the time for solving quadratic programs is still negligible for Q, quadratic programming for P takes longer, comparable to the time needed to compute gradients and Hessians on a GPU. Since 16-threaded ADMIXTURE was 16 times slower than GPU-accelerated OpenADMIXTURE 8 , HaploADMIXTURE's per-iteration performance is still faster than that of ADMIXTURE. Balanced against these gains is the fact that the number of iterations until convergence increases. This reflects the greater complexity of the likelihood, the increased number of parameters, and the cost of parameter splitting by the MM principle.

Discussion
This paper introduces a technique for global ancestry estimation that converts linkage disequilibrium from a liability to an asset. Our program HaploADMIXTURE exploits multithreading, GPU acceleration, and sparse K-means clustering to identify ancestry informative haplotypes. Although these advances also appear in OpenADMIXTURE, our earlier upgrade of the ADMIXTURE 6 software, they require substantial modification to handle haplotypes.
For instance, in the construction of AIMs, sparse K-means must now operate on haplotypes rather than SNPs. Likelihood calculation becomes more complicated because of increased phase ambiguity. Nevertheless, these technical hurdles can be overcome consistent with computational speed and memory demands on a par with or better than that of the original ADMIXTURE. Computation times scale linearly in the number of haplotype blocks. To keep computational costs in check, our haplotypes span just two SNPs. Even with this limitation, we see substantial gains in ancestry estimation precision. Extending haplotype blocks to include more than two SNPs is theoretically possible and would further increase information content, but extension quickly hits a combinatorial wall in computing the 2 S haplotype frequencies given S SNPs per block. The greater phase ambiguity encountered would complicate computer code and slow the convergence of recursive quadratic programming, the optimization engine in HaploADMIXTURE.
The admixture coefficients delivered by HaploADMIXTURE demonstrate a good separation of populations at the continental and regional levels in both real and simulated data sets.
The other admixture programs tested often perform well on one level and poorly on the other.
The admixture estimates from HaploADMIXTURE are more accurate than the competition as measured by mean square prediction error. In our experience, cross-validation and AIC produce reasonable estimates of the number of ancestral populations K. AIC is much faster than cross-validation. It will be interesting to see whether Bayesian or algebraic methods can be adapted to exploit haplotypes. The algebraic program SCOPE relies on alternating least squares. Adaptation would require passing from matrix to tensor decompositions.
Estimation of admixture proportions given known populations and known haplotype frequencies is possible with HaploADMIXTURE. One simply fixes P and updates only Q.
This simplification is invoked in the time-consuming process of cross-validation. For optimal general use, partial maximization would require curating the most informative pairs of SNPs in population panels. Partial maximization is a convex problem, with parameter separation across individuals, and thus easily applied to biobank-scale subjects.

Declaration of Interests
The authors declare no competing interests.