Ancestry inference and grouping from principal component analysis of genetic data

Here we propose a simple, robust and effective method for global ancestry inference and grouping from Principal Component Analysis (PCA) of genetic data. The proposed approach is particularly useful for methods that need to be applied in homogeneous samples. First, we show that Euclidean distances in the PCA space are proportional to FST between populations. Then, we show how to use this PCA-based distance to infer ancestry in the UK Biobank and the POPRES datasets. We propose two solutions, either relying on projection of PCs to reference populations such as from the 1000 Genomes Project, or by directly using the internal data. Finally, we conclude that our method and the community would benefit from having an easy access to a reference dataset with an even better coverage of the worldwide genetic diversity than the 1000 Genomes Project.


Introduction
Principal Component Analysis (PCA) has been widely used to correct for population structure in association studies and has been shown to mirror geography in Europe (Price et al. 2006;Novembre et al. 2008). Due to its popularity, many methods has been developed for efficiently performing PCA (Abraham et al. 2017; not the closest ones. For example, when using 4 PCs, there are pairs of populations with an F ST of~0.02 while their PC centers are superimposed ( Figure S5). When using more PCs (8,16 or 25) to compute the distances, results remain similar.

PCA-based ancestry inference
We project the dataset of interest onto the PCA space of the 1000G data using the fast tools developed in Privé et al. (2020). We recall that this uses an automatic removal of LD when computing PCA and a correction for shrinkage in projected PC scores, which has been shown to be particularly important when using PC scores for ancestry estimation (Zhang et al. 2020). Based on the results from the previous section, we propose to assign individual ancestry to one of the 26 1000G populations based on the Euclidean distance to these reference population centers (geometric medians † ) in the PCA space. Since we showed previously that (squared) distances in the PCA space are proportional to F ST , we can set a threshold on these distances that would correspond approximately to an F ST of e.g. 0.002. This threshold is close to the dissimilarity between Spanish and Italian people (F ST (IBS, TSI) of 0.0015). When an individual is not close enough to any of the 26 1000G populations, we leave its ancestry inference as unknown, otherwise we assign this individual to the closest reference population center.
We first perform ancestry estimation for the individuals in the UK Biobank † . These individuals seem to originate from many parts of the world when we project onto the PCA space of the 1000G ( Figure S8).
Self-reported ancestry (Field 21000) is available for almost all individuals, with only 1.6% with unknown or mixed ancestry. When using the threshold defined before, we could not infer ancestry for 4.6% of all 488,371 individuals. More precisely, among "British", "Irish" and "White" ancestries, this represented respectively 2.2%, 3.3% and 7.9% (Tables 1 and S7). This also represented 3.3% for "Chinese", 13.8% for "Indian" and 17.8% for "African" ancestries. Finally, mixed ancestries were particularly difficult to match to any of the 1000G populations, e.g. 97.3% unmatched within "White and Black Africa" and 93.0% within "White and Asian" ancestries. Only 47 individuals were misclassified in "super" population of the 1000G; e.g. six "British" were classified as South Asians, one "Chinese" as European and 25 "Caribbean" as South Asian by our method (Table 1). However, when comparing the location of these mismatched individuals to the rest of individuals on the PCA space computed within the UK Biobank (Bycroft et al. 2018), it seems more probable that our genetic ancestry estimate is exact while the self-reported ancestry is not matching the underlying genetic ancestry for these individuals ( Figure S9). This possible discrepancy between selfreported ancestry and genetic ancestry has been reported before (Mersha and Abebe 2015).
We also test the approach proposed in Zhang et al. (2020) which consists in finding the 20 nearest neighbors in 1000G and computing the frequency of (super) population membership, weighted by the inverse distance to these 20 closest 1000G individuals. When this probability is less than 0.875, they leave the ancestry as unknown, aiming at discarding admixed individuals. Less than 0.5% could not be matched by their method (Table S6). Of note, they could match much more admixed individuals, whereas they set a high probability threshold aiming at discarding such admixed individuals. Morever, there are many more discrepancies between their method and the self-reported ancestry in the UK Biobank (Table S6) compared to the previous results with our method (Table 1). Finally, our method is able to accurately differentiate between sub-continental populations such as differentiating between Pakistani, Bangladeshi and Chinese people (Table S7). We also applied our ancestry detection technique to the European individuals of the POPRES data . Only 16 out of the 1385 individuals (1.2%) could not be matched, of which 11 were from East or South-East Europe (Table S8). Note that all individuals that we could match were identified as of European ancestry. We could also identify accurately sub-regions of Europe; e.g. 261 out of 264 Spanish and Portugese individuals were identified as "Iberian Population in Spain" (EUR_IBS ,   Table S8).

PCA-based ancestry grouping
Finally, we show several ways how to use our ancestry inference method for grouping genetically homogeneous individuals. One first possible approach is to simply match individuals that are close enough to one of the 1000G populations, as described previously. Alternatively, one could use the internal PC scores and the self-reported ancestries or countries of birth, e.g. available in the UK Biobank (Fields 21000 and 20115).
This solution does not require projecting individuals to the 1000G, but does require computing PC scores within the dataset instead. In the UK Biobank data, we can define centers of the seven self-reported ancestry groups: British, Indian, Pakistani, Bangladeshi, Chinese, Caribbean and African; then match all individuals to one of these centers (or none if an individual is far from all centers). This enables e.g. to capture a larger set of individuals who are close enough to British people (e.g. Irish people), while discarding individuals whose genetic ancestry is not matching the self-reported ancestry (Table S9). Only 3.7% of all individuals could not be matched. The resulting clusters are presented in the PCA space in figure 2.
One could do the same using the countries of birth instead of the self-reported ancestries. Again, the country of birth may sometime not reflect the ancestral origin. Therefore, we first compute the robust centers (geometric medians) of all countries with at least 300 individuals. Then, we cluster these countries based on their distance in the PCA space to make sure of their validity as proxies for genetic ancestry and to choose a small subset of centers with good coverage of the overall dissimilarities ( Figure S10). Based on the previous clustering and the available sample sizes, we chosed to use the centers from the following eight countries as reference: the United Kingdom, Poland, Iran, Italy, India, China, "Caribbean" and Nigeria. Only 2.8% of all individuals could not be matched (Table S10). The resulting clusters are presented in the PCA space in figure S11. Note that these clusters probably include individuals from nearby countries as well.
Finally, when we know that the data is composed of a predominant ancestry, we can define a single homogeneous cluster by simply restricting to individuals who are close enough to the overall center of all individuals ( Figure S12). When doing so, we can cluster 91% of the data into one cluster composed of 421,871 British, 12,039 Irish, 8351 "Other White", 1814 individuals of unknown ancestry, 467 "White" and 41 individuals of other self-reported ancestries. This is made possible because we use the geometric median which is robust to outliers.

Discussion
Here we propose a PCA-based method for ancestry inference and grouping individuals into genetically homogeneous clusters. We show how the PCA-based distance is related to the F ST , which allows to compute distances based on PC scores directly. This relation between F ST and (squared) Euclidean distances in the PCA space has been previously shown for two populations only (McVean 2009). Previously, we and others proposed to use (robust) Mahalanobis distances to infer ancestry or identify a single homogeneous group of individuals (Peterson et al. 2017;Privé et al. 2020). When looking at distances between two populations, this corresponds to using the Bhattacharyya distance, which appears suboptimal here compared to using a simple Euclidean distance. We hypothesize that the main issue with this approach is that an admixed population covers a large volume in the PCA space, therefore all distances to this population cluster are small because of the covariance component from the Mahalanobis distance. In contrast, here we propose to directly use the global scale of the PC scores, which is invariant from the cluster scattering. This global scale makes it also more robust to infer ancestry with our method as compared to using relative proportions from k=20 nearest neighbors (kNN, Zhang et al. (2020)). Indeed, consider e.g. an admixed individual of say 25% European ancestry and 75% African ancestry. The kNN-based method is likely to identify this individual as of African ancestry, while our method will probably be unable to match it, which is a beneficial feature when we are interested in defining genetically homogeneous groups. We also believe our proposed method to be more robust than machine learning methods, because a machine learning method would try e.g. to differentiate between GBR and CEU 1000G populations, which are two very close populations of Northwest Europe (F ST of 0.0002). In other words, our distance-based method should benefit from the inclusion of any new reference population while it would make it increasingly complex to apply machine learning methods.
Yet, our proposed method also has limitations. First, since we match target individuals to 1000G populations, if individuals are far from all 26 1000G populations, then they would not be matched. When looking at the POPRES data, more individuals from East Europe could not be matched. This is not surprising because there are no East European population in the 1000G data. Moreover, if we look at the location of the 1000G populations on a map, we can see that it lacks representation of many parts of the world ( Figure   3). This issue has also been reported e.g. for Asian populations (Lu and Xu 2013). Therefore more diverse populations should be aggregated to better cover the worldwide genome diversity, which would also improve our proposed method. Nevertheless, we also show how to define homogeneous ancestry groups without using the 1000G data, either by using self-reported ancestries or countries of birth. When a predominant genetic ancestry is present in the data, such as British in the UK Biobank (Bycroft et al. 2018) or Danish in the iPSYCH data (Pedersen et al. 2018), we also show how to directly restrict to a homogeneous subset of the data.
A second potential limitation of our method is that it has two hyper-parameters: the number of PCs used to compute the distances and the threshold on the minimum distance to any cluster center above which the ancestry is not matched. Several studies have used only the first two PCs for ancestry inference. We have shown here that using two PCs (or even four) is not enough for distinguishing between populations at the sub-continental level ( Figure S5). As in Privé et al. (2020), we recommend to use all PCs that visually separate some populations. Moreover, we believe our proposed method to be robust to increasing the number of PCs used because contribution to the Euclidean distance is smaller for later PCs than for first PCs. As for the distance limit, we have shown here how to define it to approximately correspond to an F ST of 0.002.
Alternatively, a threshold can be chosen based on the visual inspection of the histogram of distances (on a log scale). This threshold can also be adjusted depending on how homogeneous one want each cluster to be.
In conclusion, we believe our proposed approach to be a simple and robust way to infer global ancestry and to define groups of homogeneous ancestry. It is also very fast, allowing to infer ancestry for 488,371 individuals in 20 minutes using 16 cores. . Countries in grey contain less than 30 individuals, therefore their percentages are not represented. Red points represent the locations of the 1000G populations, accessed from https://www.internationalgenome.org/data-portal/ population. Note that "Gujarati Indian from Houston, Texas" were manually moved to Gujarat (22.309425, 72.136230), "Sri Lankan Tamil from the UK" to Sri Lanka (6.927079, 79.861244), and "Indian Telugu from the UK" to (16.5, 79.5) to better reflect the location of their ancestors. Also note that "Utah Residents with Northern and Western European Ancestry", "Americans of African Ancestry in SW USA", "African Caribbeans in Barbados" and "Mexican Ancestry from Los Angeles USA" are probably not located at their ancestral location.

Software and code availability
The newest version of R package bigsnpr can be installed from GitHub (see https://github.com/ privefl/bigsnpr). The code used in this paper is available at https://github.com/privefl/ paper-ancestry-matching/tree/master/code.

Supplementary Materials
Definitions † and methods • The F ST measures the relative amount of genetic variance between populations compared to the total genetic variance within these populations (Wright 1965). We use the weighted average formula proposed in Weir and Cockerham (1984), which we now implement in our package bigsnpr (Privé et al. 2018).
• The Principal Component (PC) scores are defined as U ∆, where U ∆V T is the singular value decomposition of the (scaled) genotype matrix (Privé et al. 2020). They are usually truncated, e.g.
corresponding to the first 20 principal dimensions only.
• The Bhattacharyya distance between two multivariate normal distributions N (µ 1 , and |M | is the absolute value of the determinant of matrix M (Bhattacharyya 1943;Fukunaga 1990). The mean and covariance parameters for each population are computed using the robust location and covariance parameters as proposed in Privé et al. (2020).
• The geometric median of points is the point that minimizes the sum of all Euclidean distances to these points. We now implement this as function geometric_median in our R package bigutilsr.
• The UK Biobank is a large cohort of half a million individuals from the UK, for which we have access to both genotypes and multiple phenotypes (https://www.ukbiobank.ac.uk/). We apply some quality control filters to the genotyped data; we remove individuals with more than 10% missing values, variants with more than 1% missing values, variants having a minor allele frequency < 0.01, variants with P-value of the Hardy-Weinberg exact test < 10 −50 , and non-autosomal variants.
This results in 488,371 individuals and 504,139 genetic variants.      Figure S8: First 18 PC scores of the 1000G data (in black), onto which the UK Biobank data has been projected (in red). Figure S9: PC scores (computed in the UK Biobank) colored by self-reported ancestry. On the left, these are 50,000 random individuals. On the right, these are the 47 individuals with some discrepancy between their self-reported-ancestry and our ancestry estimation.