Summix: A method for detecting and adjusting for population structure in genetic summary data

Publicly available genetic summary data have high utility in research and the clinic including prioritizing putative causal variants, polygenic scoring, and leveraging common controls. However, summarizing individual-level data can mask population structure resulting in confounding, reduced power, and incorrect prioritization of putative causal variants. This limits the utility of publicly available data, especially for understudied or admixed populations where additional research and resources are most needed. While several methods exist to estimate ancestry in individual-level data, methods to estimate ancestry proportions in summary data are lacking. Here, we present Summix, a method to efficiently deconvolute ancestry and provide ancestry-adjusted allele frequencies from summary data. Using continental reference ancestry, African (AFR), Non-Finnish European (EUR), East Asian (EAS), Indigenous American (IAM), South Asian (SAS), we obtain accurate and precise estimates (within 0.1%) for all simulation scenarios. We apply Summix to gnomAD v2.1 exome and genome groups and subgroups finding heterogeneous continental ancestry for several groups including African/African American (∼84% AFR, ∼14% EUR) and American/Latinx (∼4% AFR, ∼5% EAS, ∼43% EUR, ∼46% IAM). Compared to the unadjusted gnomAD AFs, Summix’s ancestry-adjusted AFs more closely match respective African and Latinx reference samples. Even on modern, dense panels of summary statistics, Summix yields results in seconds allowing for estimation of confidence intervals via block bootstrap. Given an accompanying R package, Summix increases the utility and equity of public genetic resources, empowering novel research opportunities.


Introduction
Genetic summary data is a cornerstone of modern analyses. Allele frequencies (AFs) from publicly available data such as the genome Aggregation Database (gnomAD) 1,2 and Allele Frequency Aggregator (ALFA) from dbSNP 3 can be used to prioritize putative causal variants for rare diseases and as pseudo controls in case-control analysis 4-7 . Compared to individual-level data, genetic summary data often has fewer barriers to access promoting open science and the broad use of valuable resources. However, summary-level genetic data frequently contain fine-scale and continental-level population structure. For instance, unquantified continental ancestry exists in gnomAD's African/African American, American/Latinx, and Other groups as well as in other publicly available data (e.g. BRAVO server for TopMED) 2,8 . Using public data without accounting for the underlying population structure can lead to confounded associations and incorrect prioritization of putative rare causal variants.
The use of mixture models to estimate population structure has a history going back over two decades beginning with IMMANC 9 and later the commonly-used STRUCTURE 10 , originally using Dirichlet priors for multinomial modeling. Inference was performed via Markov Chain Monte Carlo, which limits tractability to datasets with thousands to tens of thousands of markers. As datasets grew with the advent of genome-wide arrays, and later with sequencing, new methods were designed with improved convergence characteristics such as the maximum-likelihood methods FRAPPE 11 and Admixture 12 , as well as the variational method FastSTRUCTURE 13 . Along the way, methods were developed to leverage pooled data 14 , e.g. iAdmix 15 , with improvements to enhance supervised analysis in the Admixture framework 16 . However, no method was designed explicitly and efficiently to model mixtures with genome-wide, summary statistic data that is common in modern genomics.
Individuals and samples from understudied or admixed populations are most likely to lack large public resources with precisely matched ancestry data. As a result, researchers and clinicians working with these populations are often left with a suboptimal choice: use the closest, but still poorly matched ancestral group or do not use the publicly available and highly useful resource 7 . The former has the potential to produce biased results in the very populations where additional high-quality research is needed, while the latter is likely to suffer from smaller sample sizes and thus a loss of statistical power. This choice exacerbates inequities in research on understudied and admixed populations 17,18 . These issues are magnified in the context of precision medicine where genetic summary data will likely not be sufficiently matched for the majority of people who themselves are a mixture of continental or finescale ancestries. Thus, methods to estimate and adjust for population structure within publicly available genetic summary data are needed.
Here, we present Summix, an efficient method that identifies, estimates, and adjusts for the proportion of continental reference ancestry in publicly available summary genetic data. We demonstrate the effectiveness of Summix in over 5,000 simulation scenarios and in gnomAD v2.1, including the ability to produce ancestry-adjusted AFs to tailor analyses to less-studied populations. Ultimately, Summix, and the accompanying R, Python, and Shiny app software, help to increase the efficacy and, importantly, the equity of valuable publicly available resources, especially for understudied and admixed samples.

Summix
Estimating ancestry proportions An observed single nucleotide polymorphism (SNP) AF can be described as a mixture of AFs across unobserved subgroups (e.g. continental ancestral populations). We estimate the group-specific mixing proportions, , by minimizing the least-squares difference between vectors of N SNPs for the observed AF, , and the AF generated from a mixture of K reference ancestry groups, , , as shown in Equation (1).
This objective function is quadratic and, as such, is continuous, convex, and easily differentiable; the inequality constraints are linear. Hence, a feasible minimizer will fulfill the Karush-Kuhn-Tucker (KKT) conditions for optimality. We use Sequential Quadratic Programming (SQP) 19,20 a gradient-based, iterative algorithm for constrained, nonlinear optimization to efficiently estimate the proportion of each reference group. We obtain confidence intervals for the continental ancestry proportions using block bootstrapping as described below.

Ancestry-adjusted allele frequencies
Using estimated continental ancestry proportions, we update the AFs in the observed data matching the continental ancestry proportions of an individual or sample as follows in Equation (2). To estimate the ancestry-adjusted AF, K-1 homogenous reference ancestries are used. The ancestry not used is indexed as . In theory, ancestry can be any of the non-zero reference ancestry groups. Here, we choose to be the most common ancestry present in the summary data. * = , ( − ∑̂, Where, * is the ancestry-adjusted allele frequency -ancestry group for which the reference allele frequency data is not used -ancestry group index African Ancestry in SW USA (ASW) and African Caribbeans in Barbados (ACB). We merged the 1000 Genomes and IAM data keeping the subset of SNPs in both datasets. We limited further to biallelic nonpalindromic SNPs with minor allele frequency (MAF)>1% in at least one continental ancestral group resulting in 613,298 SNPs. gnomAD v2.1 Variants from gnomAD v2.1 (data accessed April 2019) were limited to biallelic and PASS resulting in 13,742,683 and 196,606,976 SNPs in the exome and genome gnomAD samples respectively. After limiting further to MAF >1% in at least one gnomAD group and merging with the reference data, 9,835 and 582,550 SNPs remained in the exome and genome data respectively.

ClinVar
ClinVar VCFs (GRCh37/hg19) were downloaded on September 25, 2020 from ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/. GnomAD v2.1 AF and ClinVar variants were merged by chromosome, base pair, and alleles. For three variants, there were multiple ClinVar allele IDs in the same position with the same alleles. All duplicate positions were retained. After merging and restricting to ClinVar classifications of "Uncertain Pathogenicity", and "Conflicting reports of pathogenicity", 42 and 122 variants were present in the exome and genome data, respectively. Ancestry-adjusted AFs were estimated for an African sample from the African/African-Amercian gnomAD control AFs using gnomAD EUR as the reference sample.
The American College of Medical Genetics (ACMG) guidelines 23 designate >5% MAF in a control population as stand-alone strong evidence of a variant having benign impact for a rare Mendelian disorder. As such, for the classifications of "Uncertain Pathogenicity", and "Conflicting reports of pathogenicity", we identified variants where the ancestry-adjusted AF differed from the unadjusted AF with respect to the MAF > 5% threshold. Additionally, we identified variants with the classification "Pathogenic", "Likely_pathogenic", or "Pathogenic/Likely_pathogenic" with either adjusted or unadjusted AF above 5%.

Simulations
Using the 1000 Genomes as reference data, we simulated SNP genotypes for all combinations and subsets of the five continental populations. SNP genotypes were simulated using the rmultinom R function with probability defined from the AFs for the continental reference ancestral populations assuming Hardy-Weinberg Equilibrium. Ancestry proportions were chosen randomly within an assigned proportion bin to ensure coverage across the range of possible values especially at the edges of the distribution: 0-0.015, 0.010-0.055, 0.05-0.105, 0.10-0.255, 0.25-0.505. The simulated proportion for the K th ancestry group was chosen so that the ancestry proportions summed to one.
Simulating across all combinations of continental ancestry groups and the ancestry proportion bins resulted in 5,360 simulation scenarios. We used 1000 replicates within each simulation scenario to assess accuracy and precision. For each simulation replicate, we randomly sampled 100,000 SNPs. We define accuracy as the difference between the mean estimated ancestry proportion and simulated ancestry proportion. We define precision as the standard deviation of the simulation replicates. The simulation code is provided on the method's GitHub site: https://github.com/hendriau/Summix_manuscript.

Real Data Application
Estimating ancestry proportions Continental reference ancestry proportions were estimated using the summix R function for gnomAD v2.1 African/African American, American/Latinx, East Asian, Other, Non-Finnish European, and South Asian exome and genome including all subgroups (e.g. controls, non-neuro). To estimate the ancestry group proportions, the freezes of 9,830 and 582,421 SNPs in the exome and genome samples respectively were used. To assess stability of the estimates over different numbers of SNPs, we estimated the ancestry proportions from 1000 random samples of sets of N SNPs: 10, 50, 100, 500, 1000, 2500, 5000, 10k, 50k, 100k for genomes, and 10, 50, 100, 500, 1000, 2500, 5000, 7500, 9000 for exomes. Finally, estimates of continental ancestry proportions from Summix using summary AF data were compared to estimates from ADMIXTURE 12 using individual-level data for a sample of N=85 unrelated individuals from the Peruvian 1000 Genomes data.

Block bootstrapping
We used block bootstrapping to estimate uncertainty for the ancestry proportion estimates. We used the sex-averaged centimorgan (cM) map created from Bherer et al 24 to define 1 cM blocks throughout the genome. The na.approx function 9,25 from the zoo R package 25 was used to linearly interpolate cM for SNPs in our dataset that were not observed in Bherer et al. This resulted in 3357 1cM blocks across the genome. Five and 129 SNPs in the exome and genome data respectively that were outside the genetic regions contained in the Bherer et al data set were not linearly interpolated. This resulted in 9,830 and 582,421 SNPs in 2206 and 3353 1cM blocks for the exome and genome gnomAD samples respectively. This final sample was used for all real data analysis. We used 1,000 bootstrap replicates to estimate 95% block bootstrap confidence intervals. The lower and upper confidence intervals were estimated from the 2.5 and 97.5 percentiles of the block bootstrap distribution.

Estimating ancestry-adjusted allele frequencies
We estimated and assessed ancestry-adjusted AFs for two samples: (1) an African sample (100% African) estimated from the African/African American gnomAD AFs and (2) an admixed Peruvian sample with average ancestry proportions of 74.6% Indigenous American, 19.9% European, 2.8% African ancestry, 2.7% East Asian ancestry estimated from the American/Latinx gnomAD AFs. The continental ancestry proportions for the admixed Peruvian population were estimated from a subset of unrelated individuals (N = 85) from the 1000 Genomes Peruvian sample using ADMIXTURE 12 . For both the African sample and the Peruvian sample, ancestry-adjusted AFs were estimated using reference AF from either 1000 Genomes or gnomAD. We used reference groups with ≥2% estimated ancestry proportion in the observed gnomAD v2.1 group and normalized the estimated ancestry proportions to total 1. This resulted in K=2 ancestry groups for the African/African American group and K=4 for the American/Latinx group. For the African/African American group with K=2 reference groups, non-Finnish European reference AFs were used (excluding African reference). For the Peruvian population where K=4, non-Finnish European, African, and East Asian reference AFs were used (excluding Indigenous American reference). To estimate the Peruvian ancestry-adjusted AFs using gnomAD reference populations, ancestry-adjusted AFs for a 100% African sample were estimated from the gnomAD African/African American reference group.
To assess the accuracy of the ancestry-adjusted estimates, we compared the ancestry-adjusted and unadjusted gnomAD AFs to 1000 Genomes AFs for the target ancestral population (i.e. African or Peruvian Latinx). For these comparisons, we filtered out variants that were called in fewer than 25% of the gnomAD group. This removed 120 and 128 variants for African/African American and American/Latinx gnomAD exome groups and 8 and 11 variants for African/African American and American/Latinx gnomAD genome groups, respectively. We calculated both the absolute and relative difference as shown in Equation (3).
where * is the ancestry-adjusted or unadjusted AF To test whether differences varied by adjustment group, we use a linear mixed effects model with SNP and cM block as random effects with the lmer function from the lme4 package 26  The Summix R package enables both estimation of reference ancestry groups using the summix function and ancestry-adjusted AFs using the adjAF function. More details including example data and implementation are available in our package, which is hosted on Bioconductor and our GitHub site: https://github.com/hendriau/Summix.

Shiny app
The Shiny app can be found here: http://shiny.clas.ucdenver.edu/Summix. Within the Shiny app, users can estimate and visualize ancestry proportions for three gnomAD ancestry groups (i.e. African/African American, American/Latinx, Other) for both the exome and genome data.

Simulations
Summix achieved accuracy within 0.001% and precision within 0.1% across all simulation scenarios (Tables S1-S5, Figures 1, S1-S4). Accuracy of the proportion estimates was consistent across the range of simulated mixture proportions with a slight increase in bias near 0 and 1. While bias and variability in the estimates was small for all ancestral groups, AFR had the lowest variability followed by IAM, EAS, EUR, and then SAS with the highest variability across simulation replicates (Tables S1-S5  Application to gnomAD Estimating ancestry proportions We estimated the proportion of reference continental ancestry groups in gnomAD v2.1 African/African American, American/Latinx, East Asian, Other, Non-Finnish European, and South Asian groups and subgroups (e.g. controls, non-TopMED) for both the exome and genome data (Tables 1, S6, S7). As expected, the African/African American groups have primarily AFR (>80%) and EUR ancestry (~15%) likely due to contribution from African American individuals. The exome and genome American/Latinx gnomAD groups had high proportions of both EUR and IAM ancestry (i.e. >35%) and ancestry proportion estimates between 1-6% for the other reference groups. Interestingly, >1% SAS ancestry was estimated in both the exome and genome American/Latinx gnomAD groups perhaps due to misspecification from a limited number of reference groups. The exome and genome Other gnomAD groups were primarily EUR reference ancestry (>77%). The estimated reference proportions for Non-Finnish Europeans and East Asian were very homogeneous with >96% EUR and 100% EAS estimated reference ancestries, respectively. The South Asian gnomAD exome group had 85% estimated SAS ancestry and ~15% estimated EUR reference ancestry as expected due to the known ancient admixture events in the region 30-32 .  Despite large differences in sample sizes, the estimated proportion of reference ancestry groups was similar (i.e. within 2%) between exome and genome samples for all groups except American/Latinx where the exome and genome estimates differed by >5% for the EUR and IAM reference proportions. The reference ancestry proportion estimates for the gnomAD v2.1 subgroups (i.e. controls, non_cancer, non_neuro, and non_topmed) were mostly similar, within ~2% of each other and of the overall gnomAD v2.1 group estimates. The exception was for the American/Latinx and Other genome groups, which sometimes varied by 5-10%. These groups had sample sizes (N<600) and thus were likely more susceptible to the inclusion or exclusion of subsamples of individuals. Complete results are shown in Table S6-S7.
We evaluated Summix's ability to estimate reference ancestry proportions using smaller numbers of randomly chosen SNPs. We find that the ancestry proportion estimates stay unbiased regardless of the number of SNPs used to estimate ancestry while the precision decreases as the number of SNPs decreases. The precision is still fairly tight for down to ~500 SNPs, especially when estimating the African/African American gnomAD samples (exomes: SDAFR=0.0039, SDIAM=0.0052, SDEAS=0.0054, SDEUR=0.0065, SDSAS=0.0067). This suggests that far fewer SNPs are likely needed to arrive at sample estimates of ancestry proportions (Figure 2, Figure S5).

Figure 2. Precision in ancestry estimates for AFR and AMR gnomad groups by number of SNPs.
Number of SNPs (x-axis), estimated ancestry proportion (y-axis) for 1,000 replicates; A) AFR exome. B) AMR exome.

Ancestry-adjusted allele frequencies
For both exome and genome data, we estimated the ancestry-adjusted AF for gnomAD African/African American for a target sample with 100% continental African ancestry and for gnomAD American/Latinx for a target Peruvian sample with 74.6% Indigenous American, 19.9% European, 2.8% African ancestry, 2.7% East Asian ancestry proportions. Compared to the unadjusted AFs, the ancestry-adjusted AFs had significantly smaller absolute and relative differences with the target group AF ( Table 2, Tables S8-S9, p < 1E-16) regardless of reference data used (i.e. 1000 Genomes or gnomAD). The relative difference was greatest at small MAF while the absolute difference increased as AF increased, consistent with expectation. For SNPs with an alternative AF > 0.9 in 1000 Genomes, the absolute and relative difference between the unadjusted gnomAD and target 1000 Genomes AFs was especially large. This is likely due to both a relatively small number of SNPs in these groups and differences between 1000 Genomes samples and the reference genome. The absolute and relative differences in SNPs with AF > 0.9 is considerably reduced for the ancestry-adjusted AFs (Figures 3-4, Tables 2, S8-S9).  absolute difference between target sample AF (1000 Genomes African ancestry) and unadjusted or ancestry-adjusted gnomAD AF by 1000 Genomes AF category; C) relative difference between target 1000 Genomes African ancestry AF and unadjusted or ancestry-adjusted gnomAD AF by 1000 Genomes AF category; unzoomed figures B and C are available in the supplemental (Figure S10). D) scatter plot of target sample 1000 Genomes AF (y-axis) and unadjusted (left), ancestry-adjusted with gnomAD reference (center), and ancestry-adjusted with 1000 Genomes reference (right) gnomAD AF (x-axis). Genomes Peruvian ancestry AF and unadjusted or ancestry-adjusted gnomAD AF by 1000 Genomes AF category; C) relative difference between target 1000 Genomes Peruvian ancestry AF and unadjusted or ancestry-adjusted gnomAD AF by 1000 Genomes AF category; unzoomed figures B and C are available in the supplemental (Figure S11). D) scatter plot of target 1000 Genomes AF (y-axis) and unadjusted (left), ancestry-adjusted with gnomAD reference (center), and ancestry-adjusted with 1000 Genomes reference (right) gnomAD AF (x-axis).
The ancestry-adjusted AFs using 1000 Genomes or gnomAD as reference data were very similar. In the exome data, we observed no systematic significant absolute or relative differences between reference datasets (i.e. gnomAD reference vs. 1000 Genomes reference) with the exception of relative difference for the African 0-0.01 bin. Likely due to larger numbers of SNPs in the genome data, we do see significant, albeit very small, differences by reference ancestry between the ancestry-adjusted AF and the AFs in the target 1000 Genomes sample (Tables S8-S9). These differences, while statistically significant, were 10 to 100 times smaller compared to the unadjusted AF.
Lin's CCC was used to estimate agreement between the target sample and the gnomAD AF. In gnomAD exomes, Lin's CCC estimates were higher for the ancestry-adjusted compared to the unadjusted AFs regardless of reference data for both the African group ( . We found similar results for the gnomAD genome data. Both the ancestry-adjusted and unadjusted AFs differ more for the gnomAD American/Latinx group compared to the target 1000 Genomes Peruvian AFs than for the African comparison. This is perhaps due to a larger number of reference ancestry groups or more heterogeneity in Latinx compared to African American samples. Within the gnomAD genome data, we identified some SNPs with a large mismatch between 1000 Genomes and gnomAD AF (Figures S6 -S9). These SNPs appear to be mostly multi-allelic with one or more indels as the additional alleles (Tables S10-S11). We recommend caution when using these multiallelic variants as more research and potentially external validation are likely needed to estimate the true AF of all alleles present.

Reanalysis of PADI3
Summix can be used to estimate and adjust for ancestry in analyses that use gnomAD and other summary data as external controls. As an exemplar, we repeated the case-control analysis of PADI3 from Malki et al 7 . We found the p-values were higher for the ancestry-adjusted allele counts (ACs) (chisquared p-value = 0.114, Fisher's exact test p-value = 0.101) compared to the unadjusted gnomAD v2.1 African/African American ACs (chi-squared p-value = 0.029, Fisher's exact test p-value = 0.031) (Tables S12-S13). As expected, this shows that association results are not robust to differences in ancestry. It is likely that the cases used by Malki et al 7 were not 100% African ancestry. Summix could be used to estimate ancestry-adjusted AC in gnomAD given the exact ancestry proportions in the cases. We expect the p-values of association would likely lie between the unadjusted and adjusted p-values provided given the proportion of African ancestry is likely between gnomAD unadjusted (0.852) and adjusted (1).

ClinVar
As an exemplar for the potential utility of ancestry-adjusted AF in clinical settings, we compare the ancestry-adjusted AF for 100% African ancestry to the unadjusted AF for the gnomAD African/African American sample for a subset of Clinvar variants labeled as pathogenic, uncertain pathogenicity, or conflicting reports (Methods).
Based on previous ACMG guidelines 23 , we focused on variants labeled as either "Uncertain Pathogenicity", and "Conflicting reports of pathogenicity." We identified 68 unique ClinVar variants at 67 positions on the exome and genome (including 11 variants that were identified on both the exome and genome) where the ancestry-adjusted AF was above 5% and unadjusted AF was below 5%; and 42 variants at 41 positions where the ancestry-adjusted AF was below 5% and the unadjusted AF was above (5 of these variants were identified on both the exome and genome). Overall, we find minor differences of less than 0.05 between ancestry-adjusted and unadjusted AF. Eleven variants had differences greater than 0.05 in AF ( Table S14). Some of these variants likely warrant further follow-up.
We identified 29 variants with unique ClinVar IDs (18 in the genome data, 2 in the exome data, and 9 in both the exome and genome data) listed as pathogenic or likely-pathogenic in ClinVAR with AF>5% for either the unadjusted or adjusted gnomAD African/African American samples respectively. All but three of these variants had adjusted and unadjusted AF>5% with most variants being very common (e.g. MAF>20%) ( Table S15). Further inspection of these variants indicated little to no support for pathogenicity. Most of these (N=24) do not have assertion criteria provided and 13 were submitted to the ClinVar repository well before the 2015 update to the ACMG guidelines prompted by increased use of high-throughput sequencing 33 . These variants likely require further review. One of the variants identified as pathogenic and having a high AF is rs429358, one of two variants that defines the APOE-ε4 allele that infers an increased risk of Alzheimers. The high AF for rs429358 is expected. The APOE-ε4 allele confers increased risk of Alzheimer's Disease in various ancestries including European 34 and African 35,36 although heterogeneity is observed in effect size and allele frequency across ancestries 37 .

Discussion
Here we describe Summix, a fast, accurate, and precise method to estimate and adjust for population structure within publicly available genetic summary data. We evaluate Summix in over 5,000 simulation scenarios showing the accuracy and precision is within 0.001% and 0.1% respectively. In gnomAD, we find heterogeneous ancestry similar to what is expected in African/African American, American/Latinx, Other, and South Asian groups. We provide ancestry proportion estimates for all gnomAD groups and subgroups as well as ancestry adjusted AFs for an African sample and for a Peruvian sample for others to use.
Using the estimated proportion of continental ancestry groups, we produce ancestry-adjusted AFs for target samples with either continental African ancestry or Peruvian ancestry. When comparing to a sample with matching ancestry, we find that the unadjusted AFs differ significantly more than the ancestry-adjusted AFs regardless of reference ancestry data used (i.e. gnomAD vs 1000 Genomes). The African ancestry-adjusted AFs are more similar to the target 1000 Genomes African AFs than the Peruvian ancestry-adjusted AFs are to the target 1000 Genomes Peruvian sample. The increased dissimilarity for the Peruvian sample may be due to more than two predominant reference ancestry groups, or ancestral differences (including admixture) between gnomAD American/Latinx and 1000 Genomes Peruvian.
While the AFs of putative causal variants from a breadth of ancestral populations in public databases is useful for assessing evidence for clinical pathogenicity of a genetic variant, checking the AF in an ancestral sample that matches the ancestry of the person with the disease is most useful. Summix can be used to provide ancestry-adjusted AFs to precisely match ancestry increasing clinical utility of public datasets that may not have an ancestry that matches the patient. Additionally, Summix can produce ancestry-adjusted AFs matching the ancestry proportions for an external control sample. Here, we evaluate the potential utility of Summix by producing ancestry-adjusted AFs for ClinVar variants and for a case-control analysis of PADI3, a gene where gnomAD was used as external controls to identify association with Central Centrifugal Cicatricial Alopecia in women with African ancestry. While we find mostly minor discrepancies in the unadjusted and ancestry-adjusted AFs, we note that these differences can result in changed levels of evidence of association for case control analysis and for prioritization of putative causal variants for follow up. This emphasizes the importance of matching by or adjusting for ancestry differences whenever possible.
Here, we estimate genome-wide continental ancestry proportions. While the mean local ancestry proportions for a sample often approach genome-wide ancestry proportions 38 , there may exist regions of the genome, e.g. regions of selection, where the local ancestry for the sample differs substantially from genome-wide ancestry proportions 39,40 . We expect the ancestry-adjusted AF estimates may be less accurate in regions where the average local ancestry proportions differ from the genome-wide estimates. Our results suggest that Summix can estimate ancestry proportions accurately, although much less precisely, using a relatively small number of randomly chosen SNPs (e.g. ~100). This suggests that Summix may be able to estimate the proportion of local continental ancestry. Here, we evaluate subsets of randomly chosen SNPs, albeit reflecting genome-wide coverage. It could be that using ancestry informative markers (AIMs) 41 or removing uninformative markers could increase the precision of Summix further enabling the estimation of local ancestry proportions.
There are several drawbacks to our current method and implementation. First, our method is currently only able to estimate the proportion of provided reference ancestry groups. We recommend the user include all expected ancestral populations in the reference dataset. Second, here we only evaluate the ability of Summix to estimate five broad continental ancestries. We are actively working on an extension to identify and estimate the proportion of ancestry not in the reference data and are evaluating the performance of Summix on a broader reference panel including fine-scale ancestry. Lastly, differences in ancestry is not the only aspect of public databases that can cause confounding in analyses using external public controls. Differences in sequencing technology and computational variant calling pipelines can also cause biases in allele frequencies due to non-exchangeability of individuals. Many methods have been and are being developed to adjust for this bias when using external controls [4][5][6]42,43 .
There are many extensions and applications for Summix beyond those evaluated here. First, Summix can be used with any reference ancestry data needing only AFs. While we provide ancestry-adjusted AFs for a sample, the ancestry-adjusted AFs could be used for individuals providing potential utility in the clinic. Summix has the potential to be applied to other summary datasets where AFs are provided or can be derived such as GWAS summary statistics, which are widely available online 44 . Lastly, Summix has similarities to deconvolution methods used in single-cell and other 'omics data types [45][46][47] , suggesting paths of future development and application.
We provide an R package, Python function, Shiny app, and GitHub site to encourage reproducibility, broad use, and further development of our method. We hope that the methods presented here will be used and extended to improve the utility of valuable publicly available resources especially for individuals and studies with admixed or understudied ancestry.

Supplemental Data
Supplemental Data include eleven figures and fifteen tables.

Declaration of Interests
The authors declare no competing interests.