## Abstract

Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection. However, existing methods generally require discrete subpopulations. We introduce a method that infers selection using principal components (PCs) by identifying variants whose differentiation along top PCs is significantly greater than the null distribution of genetic drift. To enable the application of this method to large data sets, we developed the FastPCA software, which employs recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using the PC-based test for natural selection, we replicate previously known selected loci and identify three new genome-wide significant signals of selection, including selection in Europeans at the ADH1B gene. The derived allele of the coding variant rs1229984 has previously been associated to a decreased risk of alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents. We also detect new selection signals at IGFBP3 and IGH, which have also previously been associated to human disease.

## Introduction

Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection^{1–6}. We and others have employed this approach to identify signals of selection in a wide range of settings, informing our understanding of genes under evolutionary adaptation^{7–23}. Examples includes genes linked to lactase persistence^{9,11}, red blood cell abundance^{17}, hypoxia response^{18}, alcoholism^{14}, kidney disease^{21}, malaria^{7,13,19,23}, HIV/AIDS^{16}, autoimmune disease^{20}, cancer^{19}, cystic fibrosis^{8} and hypertension^{23}. However, the signals of selection identified thus far may represent “only the tip of the iceberg^{24}”, implying that further research on selection will provide additional insights about human disease. Unlike extended haplotype homozygosity (EHH) or allele frequency spectrum based tests for selection, the population differentiation approach is able to detect older selection events and selection on standing variation^{1,3}. In addition, signals of selection detected using population differentiation can flag stratified genetic variants that are susceptible to false-positive associations in genome-wide association studies^{15}.

The population differentiation approach has greatest power when comparing very closely related populations with very large sample size^{19}. The increasing availability of very large population cohorts for genetic analysis provides strong prospects for analyzing subtle differences in ancestry in large sample sizes, but raises the challenge of how to select subpopulations to compare; a population cohort with a single continental ancestry may be better represented by continuous clines rather than discrete clusters^{25–27}, and/or may contain a large number of discrete subpopulations corresponding to a large number of possible population comparisons^{28,29}. Principal components analysis (PCA)^{25,30} offers an appealing alternative to model-based clustering methods^{31,32} for modeling human genetic diversity, and has been applied to infer population structure in many settings^{26,27,30,33–39}. One advantage of PCA is that results for top PCs are not sensitive to the number of PCs analyzed, whereas results of model-based clustering methods often vary with the number of clusters. Another advantage of PCA is its low computational cost, as top PCs can be inferred in time only linear in the number of samples by drawing upon recent advances in random matrix theory^{40–42}, implemented in the FastPCA software that we introduce here. We thus developed a test for selection based on unusual population differentiation along top PCs. Our PC-based test is able to detect novel signals at genome-wide significance, a key consideration in genome scans for selection^{19}.

We ran FastPCA on 54,734 individuals of European descent from the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort; FastPCA required only 57 minutes of compute time and 2.6GB of RAM for this analysis, orders of magnitude better than any other publicly available software. We detected evidence of population structure along the top 4 PCs, which separated samples into Northern European, Eastern European, Southeast European, Irish and Ashkenazi Jewish subpopulations. Using our PC-based test for selection, we replicate previously known selected loci *LCT, HLA, OCA2* and *IRF4* and identify three new signals of selection at *IGH, IGFBP3* and *ADH1B*. The signal in *ADH1B* was at coding variant rs1229984 has previously been associated to alcoholism^{43–46} and shown to be under selection in East Asians^{14,45,47,48}; we show that it is a rare example of independent evolution on two continents^{11,12}.

## Methods

### Overview of methods

We first describe the FastPCA algorithm, which is an implementation of the *blanczos* method from Rokhlin *et al.*^{40–42}. As with our previous work on PCA^{25,30}, FastPCA makes use of existing computational literature and does not contain any new computational ideas; nonetheless, we anticipate that the software will be widely used, since to our knowledge it is the only publicly available software for computing top PCs on genetic data in linear time. The algorithm generalizes the method of power iteration^{49}, a technique to estimate the largest eigenvalue and corresponding eigenvector of a matrix. Multiplying a random vector by a square matrix projects that vector onto the eigenvectors of that matrix and then scales it according the respective eigenvalues of that matrix. After repeating, the projection along the eigenvector with the largest eigenvalue grows fasters than the rest and the repeated matrix by vector product converges to this eigenvector. Additional eigenvectors can be found by repeating this process and orthogonalizing to previously-found PCs. The *blanczos* method improves on this method by estimating additional PCs. The original estimates are perturbed from the true PCs, but this missing variation is captured by estimating the extra PCs. The original matrix is then projected onto this set of eigenvectors, reducing its dimension while preserving the variation along the top PCs. Traditional PCA methods are applied to this reduced matrix to find accurate estimates of the top PCs of the original matrix.

We next describe our PC-based selection statistic, which generalizes a previous selection statistic developed for discrete subpopulations^{19}. We detect unusual allele frequency differences along inferred PCs by making use of the fact that the squared correlation of each SNP to a PC, rescaled to account for genetic drift (see derivation below), follows a chi-square (1 d.o.f.) distribution under the null hypothesis of no selection. We have released open-source software implementing the FastPCA algorithm and PC-based selection statistic (see Web Resources).

### FastPCA algorithm

We are given an input *M* × *N* genotype matrix ** X**, where

*M*is the number of SNPs and

*N*is the number of individuals (e.g. each row is a SNP, each column is a sample). Each entry in this matrix takes its values from {0,1,2} indicating the count of variant alleles for a sample at a SNP. From this matrix we can generate the normalized

*M*×

*N*genomic matrix where each row

*y*_{i}has approximately mean 0 and variance 1 for SNPs in Hardy-Weinberg equilibrium.

Here, *x*_{i} is the row vector of genotypes for SNP *i* and *y*_{i} is the normalized row vector. *x*_{ij} and *y _{ij}* are the genotype/normalized genotype at SNP

*i*for sample

*j*.

*x*

_{i}is the number of valid genotypes at SNP

*i*. All this is used to calculate , the sample allele frequency for SNP

*i*, which is used to normalize the genotypes. In practice, the genotype matrix is normalized through the use of a lookup table mapping from genotypes (stored as 0, 1 or 2 copies of the alternate allele, or missing data) to normalized genotypes (using the above formula, with missing data having a normalized value of 0).

We are seeking the top *K* PCs for the normalized genomic matrix ** Y**. Traditional PCA algorithms compute the PCs by performing the eigendecomposition of the genetic relationship matrix (

*GRM*=

*Y*^{T}

**/**

*Y**M*), a costly procedure which returns all the principal components. FastPCA, which makes use of recent advances in random matrix theory

^{40–42}, speeds this process up by only approximating the top

*K*PCs.

FastPCA is seeded with a random *N* × *L* matrix *G*_{0} composed of values drawn from a standard Gaussian distribution. *L* affects the accuracy of the result and *L* should be greater than *K*. For *K* = 10, *L* = 20 is a good choice. Then, for *I* iterations, ** H_{i}** =

**×**

*Y*

*G*_{i}and

*G*_{i+1}=

*Y*^{T}×

*. In simulated samples with discrete subpopulations,*

**H**_{i}/M*I*= 3 was sufficient, but in real datasets,

*I*= 10 was found to provide accurate results.

After the iterative step completes, the singular value decomposition of matrix ** H = (H_{0}|H_{1}| … |H_{I}**) is taken: .

*U*

_{H}is a low-rank approximation to the column-space of

**, where . The SVD of can be computed efficiently and approximates the SVD of**

*Y***since . For the PCA, we are only interested in the left**

*Y**K*columns of

*V*_{T}and the first

*K*entries along the diagonal of

**∑**

_{T}.

### Selection statistic

Previous work^{19} shows that for a SNP *i* genotyped in two populations, the difference in allele frequency estimates approximately follows a normal distribution.

Here, is the allele frequency estimate of SNP *i* for a sample of size *N*_{q} from population *q* and *F*_{ST} is the measure of differentiation between the two populations. Our goal is to extend this formula to individuals with fractional ancestries, and then to continuous-valued PCs.

First, consider the case with two discrete subpopulations. Rather than treating the subpopulations separately, we define a vector α where α_{j} indicates the ancestry in population 1 (e.g. α_{j} = 1 if sample *j* is in population 1 and 0 if sample *j* is in population 2). *D*_{i} can be rewritten as:

If we run PCA on this sample, we would ideally get an eigenvector *ν* that has value *ν*_{1} for individuals in population 1 and − *ν*_{2}for individuals in population 2, where (since **ν**^{T}**1** = 0, **ν**^{T} * ν* = 1)

In this case, *D*_{i} can be rewritten as:

In the limiting case where *F*_{ST} approaches 0, the statistic becomes:

Thus, the square of the SNP weight follows a chi-square 1-d.o.f. distribution in the case where *F*_{ST} → 0. In the case where *F*_{ST} ≠ 0, then the scaling parameter has to be changed, but *D*_{i} still follows a normal distribution.

In the case with fractional ancestry (α_{j} ϵ [0,1]), , and *D*_{i} can still be estimated using equation (3). The individual s will still asymptotically follow a normal distribution (because of the Lyapunov central limit theorem^{50}), but will be correlated due to individuals with fractional ancestry contributing to both estimates. Thus, *D*_{i} will still follow a normal distribution, but the variance of equation (2) will not hold.

Now consider the case where we do not have fractional ancestries, but rather an eigenvector that separates individuals along some axis of variation. We can treat the eigenvector as a linear transformation of the ancestry vector:

Substituting these values into (3), we find:

Thus, our new selection statistic *D*_{i} is based on the dot product of the normalized genotypes and the eigenvector. Since the variance of *D*_{i} is not known, it will need to be rescaled in order to follow a *N* (0,1^{2}) distribution.

(1) If we are operating on the same set of SNPs that we used for PCA, then the rescaling of **y**_{i} * ν* is straightforward. Because PCA is the same as SVD, we see that:

Here, ** V** contains the right singular vectors which are equivalent to the PCs,

**contains the left singular vectors which are rescaled SNP weights and**

*U***∑**contains the singular values which are the square roots of the eigenvalues of the GRM.

**and**

*V***are unitary, so the columns of**

*U***are guaranteed to have a norm of 1. Multiplying**

*U***by will then produce a properly normalized vector of differences**

*U***=(**

*D**D*

_{1},

*D*

_{2}, …,

*x*

_{M})

^{T}. In other words:

In the case where we are computing selection statistics on a different set of SNPs than the one for which we computed PCs, then the above property is not guaranteed to hold. Specifically, inflation can occur if SNPs with higher differentiation tend to have higher LD, which can occur as a consequence of true selection signals^{51}.

One assumption underlying the statistic is that the true minor allele frequency is not extremely small, otherwise the assumption of normality will not hold^{19}. For this reason, the selection statistic was only computed for those SNPs containing minor allele frequency greater than 1% in our sample.

### Simulation framework

Simulated genotypes at a particular SNP were generated for *P* populations by first generating an ancestral population allele frequency *p* from a *Uniform* (0.05,0.95) distribution. Population allele frequencies were generated by simulating random drift in populations of fixed size *N*_{e}. The number of alternate alleles *z*_{qt} in population *q* at generation *t* were sampled from a *Binomial* (*p*_{q, t−1}, 2*N*_{e}). The population allele frequency at this generation was then calculated as . Population allele frequency simulations were run for 200 total generations (τ) and population size was calculated for a target *F*_{ST} by using the formula . For *F*_{ST} ≈ 0.1, 0.01 and 0.001, *N*_{e} = 1,000, 10,000 and 100,000 respectively. To detect the effect of population bottlenecks at the same level of *F*_{ST}, simulations were also run for τ = 20 and *N*_{e} = 100, 1,000 and 10,000. We also considered simulations with admixed samples. In these simulations, ancestral proportions were sampled from a *Dirichlet* (* a*) distribution, where

*is a vector containing ancestry weightings. Multiplying the ancestry matrix (*

**a****– dimension**

*A**N × Q*) by the population allele frequencies (

**– length**

*p**Q*) generated allele frequencies for each admixture fraction (

**′ - length**

*p**N*). Individual genotypes were generated from a

*Binomial*(2,

*p*′

_{i}) distribution.

To assess running time, the simulated datasets had *F*_{ST} = 0.01, *M* = 100*k* SNPs and *N* ≈ {1*k*, 1.5*k*, 2*k*, 3*k*, 5*k*, 7*k*, 10*k*, 15*k*, 20*k*, 30*k*, 50*k*, 70*k*, 100*k*} individuals (since we used 6 populations of equal sample size, we rounded *N* to multiples of 6). Throughout this paper we report CPU time, but due to multi-threading present in the GSL^{52} and OpenBLAS^{53} libraries, run time was about 60% of CPU time. FastPCA accuracy was assessed using *M* = 50*k* SNPs and *N* ≈ 10*k* individuals at *F*_{ST} = {0.001,0.002, …, 0.010}. Calibration and power of the selection statistic was assessed in 2 populations at *F*_{ST} = {0.001,0.002,0.005,0.01,0.1} with *M* = 60*k*, the effective number of independent SNPs in genotype array data^{54}. SNPs under selection were generated in a similar manner as the above, except ancestral allele frequencies were simulated at a fixed allele frequency difference (*D*) by having the sample allele frequencies be .

### Assessing PC accuracy

Accuracy was assessed via the Mean or Explained Variances (MEV) of eigenvectors. Two different sets of *K N*-dimensional principal components each produce column space. A metric for the performance of a PCA algorithm against some baseline is to see how much the column overlap. This is done by projecting the eigenvectors of one subspace onto the other and finding the mean lengths of the projected eigenvectors. If we have a reference set of PCs (*ν*_{1}, *ν*_{2}, …, *ν*_{K}) against which we wish evaluate the performance a set of computed PCs (*u*_{1}, *u*_{2}, …, *u*_{K}), then the performance calculation becomes:

Here, ** U** is a matrix whose column vectors are the PCs which we are testing. The test matrix can either be the result of another computation or the truth for a simulated sample.

*K*eigenvectors can describe the population structure in a dataset with

*K*+ 1 populations. They can be constructed by first creating a vector where if individual

*j*is in population

*k*and 0 otherwise. The set of eigenvectors {

**ν**_{1},

**ν**_{2}, …,

**ν**_{K}} are constructed by taking

*K*of these vectors, normalizing them to have mean 0, and scaling/orthogonalizing them via the Gram-Schmidt process.

### GERA data set

The GERA dataset includes 62,318 individuals from Northern California typed on a European-specific 670,176-SNP array^{55}. Individuals were filtered to remove those with missing sex information, individuals related according to the provided pedigree data or with observed genomic relatedness greater than 0.05 in the GRM^{56} and individuals with less than 90% European ancestry as predicted by SNPweights^{57} using a worldwide dataset containing European, African, and Asian ancestry. After filtering, 54,734 individuals remained.

SNPs were initially filtered to remove non-autosomal SNPs, SNPs with minor allele frequency less than 1%, and SNPs with >1% missing data, leaving 608,981 SNPs. The second stage of filtering removed SNPs that failed PLINK’s Hardy-Weinberg Equilibrium test^{56} with *p* < 10^{−6}, and performed LD-pruning using PLINK. Due to regions of long-range LD, LD persisted even after one filtering run. Multiple rounds of LD filtering were performed using an *r* ^{2} cutoff of 0.2 until additional rounds of LD filtering did not remove additional SNPs, leaving 162,335 SNPs. FastPCA was run on the pruned set of 162,335 SNPs. Selection statistics were computed on the full set of 608,981 SNPs, prior to H-W filtering and LD-pruning. We note that many of the SNPs producing signals of selection generated significant H-W *p*-values (e.g. H-W *p* =1.37 × 10^{−79} for LCT SNP rs6754311), which is an expected consequence of unusual population differentiation.

SNPweights^{57} was used to predict fractional Northwest European, Southeast European, and Ashkenazi Jewish ancestry for each individual. For plotting purposes, percentage ancestry in each of these three populations was mapped to an integer in [0,255], which was then used for the RGB color value for that sample, so a NW sample would appear red, SE would appear green and AJ would appear blue.

### PC Projection

POPRES^{58} individuals were projected onto these PCs. The left singular vectors (** U**) were generated by multiplying normalized genotypes for all SNPs in GERA (

*Y*_{GERA}) by the PCs (

**) and scaling by the singular values (**

*V***∑**), the number of SNPs used to calculate the PCs (

*M*) and the number of SNPs used for projection . Projected PCs were then calculated by multiplying the corresponding set of SNPs in POPRES by these singular vectors and scaling again by the singular values: . The projected individuals were overlaid on the PCA plot of GERA individuals and colored according to population membership and consistently with population assignment from SNPweights

^{57}.

## Results

### FastPCA Simulations

We used simulated data to compare the running time and memory usage of FastPCA to three previous algorithms: smartpca^{25,30}, PLINK2-pca^{56}, and flashpca^{59} (see Web Resources). We simulated genotype data from six populations with a star-shaped phylogeny using 100k SNPs (typical for real data after LD-pruning) and up to 100k individuals (see Methods). For each run, running time was capped at 100 hours and memory usage was capped at 40GB. The running time and memory usage of FastPCA scaled linearly with simulated dataset size (Figure 1), compared with quadratically or cubically for other methods. The computation became intractable at 50k-70k individuals for smartpca, PLINK2-pca and flashpca. The largest dataset, with 100k SNPs and 100k individuals, required only 56 minutes and 3.2GB of memory with FastPCA (Table S1). (We also note that shellfish (see Web Resources), a parallel PCA implementation, requires *O* (*MN*^{2} + *N*^{3}) and is not computationally tractable on large data sets, as previously demonstrated^{59}). Thus, FastPCA—unlike other publicly available software packages for analyzing genetic data—enables rapid principal components analysis without specialized computing facilities.

We next assessed the accuracy of FastPCA, using PLINK2-pca^{56} as a benchmark. We used the same simulation framework as before, with 10k individuals (1,667k individuals per population) and 50k SNPs. We varied the divergence between populations, as quantified by *F _{ST}*

^{60}. We assessed accuracy using the Mean of Explained Variances (MEV) of the 5 population structure PCs (see Methods). We determined that the results of FastPCA and PLINK-pca were virtually identical (Figure 2). This indicates that FastPCA performs comparably to standard PCA algorithms while running much faster.

### PC-based Selection Statistic Simulations

We evaluated the calibration and power of the PC-based selection statistic. To evaluate calibration, we simulated 60k SNPs undergoing random drift with up to *N* = 50k individuals from two populations differentiated by *F*_{ST} = {0.1,0.01,0.001}. At all values of *N* and *F*_{ST}, the proportion of truly null SNPs reported as significant was well-calibrated at *p*-value thresholds ranging from 10^{−1} to 10^{−5}, and similar results were obtained for simulations with admixture (Table S2). The median of the selection statistic was slightly inflated at *F*_{ST} = 0.1 due to a deficiency in the tail (Table S2 and Figure S1), but well-calibrated at the small values of *F*_{ST} that correspond to our analyses of real data. The selection statistic in the presence of a population bottleneck performed identically to populations differentiated by the same *F*_{ST} level (Table S2).

We evaluated power using the same number of SNPs and samples but at *F*_{ST} ={0.1,0.01,0.005,0.002,0.001} and using a separate set of SNPs under selection where the allele frequency between the two populations was varied (*D* = |*p*_{1} − *p*_{2}|). The significance threshold was set to 8.3 × 10^{−7} based on 60K SNPs tested. There was no power to detect selection at *F*_{ST} = 0.1. We observed a phase-change at smaller values of *F*_{ST}, where there was no power to detect selection below a specified allele frequency difference threshold, but there was complete power to detect selection at a slightly higher threshold (Figure 3a). We examined this effect in more depth using a range of samples sizes, and determined that the transition from no-power to complete-power was more sample size dependent at *F*_{ST} = 0.001 (Figure 3b) than at *F*_{ST} = 0.01 (Figure 3c), indicating that sample size is more important when analyzing more closely related populations. We also assessed effect of admixture on power by sampling ancestry for individuals between the two populations using a *Beta* (*a*, *a*) distribution. We determined that increasing the admixture parameter *a* (which reduces the variation in ancestry across samples) had a similar effect to reducing sample size (Figure S2).

### Application of FastPCA to a European American Cohort

We ran FastPCA on the GERA cohort (see Web Resources), a large European American dataset containing 54,734 individuals and 162,335 SNPs after QC filtering and LD-pruning (see Methods). This computation took 57 minutes and 2.6GB of RAM. PC1 and PC2 separated individuals along the canonical Northwest European (NW), Southeast European (SE) and Ashkenazi Jewish (AJ) axes^{15}, as indicated by labeling the individuals by predicted fractional ancestry from SNPweights^{57} (Figure 4). These results are consistent with Banda *et al*. 2015^{61} which also examined this dataset. PC3 and PC4 detected additional population structure within the NW population.

To further investigate this subtle structure, we projected POPRES individuals from throughout Europe^{58} onto these PCs^{30} (see Methods). This analysis recapitulated the position of SE populations via the placement of the Italian individuals, and determined that PC3 and PC4 separate the NW individuals into Irish (IR), Eastern European (EE) and Northern European (NE) populations (Figure 5). This visual subpopulation clustering was confirmed via k-means clustering on the top 4 PCs, which consistently grouped the AJ, SE, NE, IR and EE populations separately (Figure S3).

### Application of PC-based Selection Statistic to a European American Cohort

For each of the top PCs, we computed our PC-based selection statistic for 608,981 non-LD-pruned SNPs (see Methods). The resulting Manhattan plots for PCs 1-4 are displayed in Figure 6 (QQ plots are displayed in Figure S4). Analyses of PCs 5-10 indicated that these PCs do not represent true population structure (Figure S5), but are either dominated by a small number of long-range LD loci^{33,62,63} or correlated with the missing data rate across individuals. Selection statistics for PCs 1-4 exhibited little or no inflation, particularly after removing Table 1 regions (Table S3).

Genome-wide significant signals (listed in Table 1) included several known selection regions^{9,64–67} and novel signals at ADH1B, IGFBP3 and IGH (see below). Suggestive signals were observed at additional known selection regions^{66,68} (Table S4). After removing the regions in Table 1, rerunning FastPCA and recalculating selection statistics, all of these regions remained significant except for a region on chromosome 8 with a known chromosomal inversion^{33,62} (Figure S6, Table S5). Thus, the remaining regions are not due to PC artifacts caused by SNPs inside these regions. Detecting subtle signals of selection benefited from the large sample size, as subsampling the GERA data set at smaller sample sizes and recomputing PCs and selection statistics generally led to less significant signals (Table S6).

We identified a genome-wide significant signal of selection at rs1229984, a coding SNP (Arg47His) in the ADH1B alcohol dehydrogenase gene (Table 1). The derived allele has been shown to have a protective effect on alcoholism risk^{43–46} and to produce an REHH signal in East Asians^{14,45,47,48}, but was not previously known to be under selection in Europeans. (Previous studies noted the higher frequency of the derived T allele in western Asia compared to Europe, but indicated that selection or random drift were both plausible explanations^{69,70}.) We examined the allele frequency of the derived T allele in the five subpopulations: AJ, SE, NE, IR and EE (Table S7). We observed derived allele frequencies (DAF) of 0.21 in AJ, 0.10 in SE, and 0.05 or lower in other subpopulations, consistent with the higher frequency of the derived allele in western Asia. A comparison of NE to the remaining subpopulations using the discrete subpopulation selection statistic^{19} also produced a genome-wide significant signal after correcting for all hypotheses tested (Table S8); this is not an independent experiment, but indicates that this finding is not due to assay artifacts affecting PCs.

To further understand the selection at this locus, we examined the allele frequency of rs1229984 in 1000 Genomes project^{71} populations (see Web Resources), along with the allele frequency of the regulatory SNP rs3811801 that may also have been a target of selection in Asian populations^{45}. The haplotype carrying the derived allele at rs3811801 (and corresponding haplotype H7) was absent in populations outside of East Asia (Table S9). This indicates that if natural selection acted on this SNP in Asian populations, selection acted independently at this locus in Europeans. One possible explanation for these findings is that rs1229984 is an older SNP under selection in Europeans, while rs3811801 is a newer SNP under strong selection in Asian populations leading to the common haplotype found in those populations.

The IGFBP3 insulin-like growth factor-binding protein gene had two SNPs reaching genome-wide significance. Genetic variation in IGFBP3 has been associated with breast cancer^{72}, height^{73}, blood pressure^{74} and hypertension^{75}, although the published associated SNPs are not in LD with the two SNPs we detected. The IGH immunoglobulin heavy locus had one genome-wide-significant SNP and two suggestive SNPs with *p*-value < 10^{−6}. Genetic variation in IGH has been associated with multiple sclerosis^{76}, although the published associated SNPs are not in LD with the three SNPs we detected. The IGFBP3 and IGH SNPs each had substantially higher minor allele frequencies in Eastern Europeans (Table S7), but were not genome-wide significant under the discrete subpopulation selection statistic^{19} (Supplementary Tables 10-11). The existence of multiple SNPs at each of these loci with p < 10^{−6} for the PC-based selection statistic suggests that these findings are not the result of assay artifacts.

## Discussion

We have detected new, genome-wide significant signals of selection by applying a PC-based selection statistic to top PCs computed using FastPCA, a computationally efficient (linear-time and linear-memory) algorithm. Although mixed model association methods are increasingly appealing for conducting genetic association studies^{54,77}, we anticipate that PCA will continue to prove useful in population genetic studies, in characterizing population stratification when present in association studies, in supplementing mixed model association methods by including PCs as fixed effects in studies with extreme stratification, and in correcting for stratification in analyses of components of heritability^{78,79}. Our PC-based selection statistic extends previous statistics developed for discrete populations^{19}. In contrast to previous work on detecting selection using PCs^{63,80} or using the spatial ancestry analysis (SPA) method^{81}, our statistic is able to detect signals at genome-wide significance, a key consideration in genome scans for selection^{82}. In particular, we detected genome-wide significant evidence of selection in Europeans at the ADH1B locus, which was previously reported to be under selection in east Asian populations^{14,45,47,48} using REHH^{51} (which can only detect relatively recent signals and does not work on standing variation^{3}). We also detected genome-wide significant evidence of selection at the disease-associated IGFBP3 and IGH loci. On the other hand, loci with suggestive signals of selection that do not reach genome-wide significance could potentially be used to increase the power of disease mapping^{83}.

We note that our work has several limitations. First, top PCs do not always reflect population structure, but may instead reflect assay artifacts^{84} or regions of long-range LD^{33}; however, PCs 1-4 in GERA data reflect true population structure and not assay artifacts. Second, common variation may not provide a complete description of population structure, which may be different for rare variants^{85}; we note that based on analysis of real sequencing data with known structure, we recommend that LD-pruning and removal of singletons (but not all rare variants) be applied in data sets with pervasive LD and large numbers of rare variants (see Appendix). Third, our selection statistic is only capable of detecting that selection occurred, but not when or where it occurred; indeed, top PCs may not perfectly represent the geographic regions in which selection occurred. Fourth, our selection statistic performs best when allele frequencies vary linearly along a PC; the SPA method^{81} (see above) models allele frequency as a logistic function and is not constrained by this limitation. Despite these limitations, we anticipate that FastPCA and our PC-based selection statistic will prove valuable in analyzing the very large data sets of the future.

## Appendix

Inferring ancestry from genetic data is a common problem in both population and medical genetic studies, and many methods exist to address it^{30,31,86}. Principal components analysis (PCA)^{30} has been shown to be effective at elucidating geographic structure from genetic data^{87} and correcting for confounding due to population stratification in association mapping^{25}. These uses of PCA depend critically on its ability to separate genetically disparate subpopulations when analyzing data from commercial genotyping arrays. However, as high-throughput sequence data becomes more common, enabling ancestry inference from this new class of data is becoming increasingly relevant.

As sequence data contains more variants, and many more population-specific variants^{88}, it may be reasonable to expect that PCA applied to high-throughput sequence data will be substantially more effective than the corresponding analysis on genotype data. However, our results suggest the opposite. Specifically, PCA makes assumptions about marker independence that are violated by the pervasive linkage disequilibrium in sequence data. In addition, assumptions about genetic drift that are reasonable for common SNPs on genotyping arrays are less so when applied to the numerous rare variants in sequence data^{85}.

## Methods

Principal Components Analysis (PCA) is generally applied to a genetic relationship matrix (GRM) that is computed as:
where *x*_{s} is a vector of genotypes for SNP *s* and *p*_{s} is the minor allele frequency of SNP *s*. We propose modifications to standard PCA to deal with two challenges that are present in sequence data but absent from genotype data: pervasive linkage disequilibrium, and rare variants. Specifically, we recommend that LD pruning be applied to sequence data and singleton variants be removed. While we evaluated more sophisticated approaches to handling these issues, they did not improve our results beyond these simpler approaches. Importantly, we recommend against a commonly used strategy of removing all low frequency of rare variants as these variants contain significant information for detecting population structure.

### Linkage Disequilibrium

It is well known that application of PCA to regions of the genome containing long-range LD blocks can confound PCA’s ability to separate disparate populations^{30,63}. As a result, these LD blocks are often simply excluded from analysis. However, in sequence data, many regions of the genome outside of previously identified long-range LD blocks contain sufficient LD to bias results. As a result, we examine three methods to deal with LD: (1) LD Pruning (2) LD Shrinkage^{63} and (3) LD Regression^{30,892,9}.

LD Pruning is a commonly applied approach to removing correlated SNPs from a dataset. To produce a data set pruned for LD above a threshold *T*, one SNP of any pair of SNPs in LD (*r*^{2} > *T*) is removed from the data.

LD Shrinkage is a more sophisticated method of correcting for LD proposed by (Zou et al. 2012)^{63}. In LD shrinkage, each SNP *s* is weighted by its LD to surrounding SNPs before inclusion in the genetic relationship matrix.

We note that *t* ϵ *window* (*s*) refers to SNPs *t* that are within some region of the genome surrounding SNP *s.* Intuitively, this is a heuristic to correct for the over representation in the GRM of some SNPs that are redundant with respect to nearby SNPs.

LD Regression was originally proposed in (Patterson et al. 2006)^{30} and utilized extensively in (Gusev et al. 2013) ^{89}. Only the residual of a SNP—after regressing out other SNPs in LD—in the GRM.:

### Rare Variants

In considering how to optimally include rare variants in the genome, we examined three strategies. The first strategy was to include all rare variants as described in the computations above without any modifications. The second strategy was to exclude all variants below a threshold, which is a standard strategy used in several recent papers. We compared these simple strategies to a strategy based on reweighting rare variants to optimize the separation between populations.

We considered a particular scenario to optimize. Specifically, we imagine that two populations that split from one another *t* generations ago are equally represented in our GRM. We would like to optimize the proportion of variance in our GRM that is explained by the true population labels. That is, our figure of merit is:
where *pop* (*i*) refers to the subpopulation from which individual *i* came.

Now, considering the population split, our data contains two classes of variants: those variants that are result of mutations predating the population split (pre-split SNPs), and those variants arising after the population split (post-split SNPs). For pre-split SNPs we invoke the normal approximation to genetic drift described. That is, the difference between allele frequencies *p*_{1}, *p*_{2} (for populations 1 and 2, respectively) is:
where *p* is the allele frequency in the ancestral population prior to the split and *F*_{ST} quantifies the genetic drift that has occurred since the split. We note that this approximation is reasonable for common SNPs and for small values *F*_{ST}. If we assume that our data contains only pre-split SNPs then our figure of merit is optimized by the standard computation of the GRM given above. On the other hand, rare, post-split SNPs have the property that
where is the allele frequency estimated from the sample. This difference implies that the optimal weighting for pre-split SNPs is identically:
but the optimal weighting for post-spit SNPs is .

However, this modification requires knowledge of the *F*_{ST} between studied subpopulations and, more dauntingly, which SNPs are post-split. We believe it is reasonable to iterate over several values of *F*_{ST} (and find that in real data results are relatively robust to choice of *F*_{ST}). In order to deal with uncertainty over the set of post-split SNPs, we propose that a SNP be considered post-split if

We examine the effect of both of these modifications on the effectiveness of PCA to separate genetically disparate subpopulations.

### Analysis of Northern vs. Southern Europe in POPRES Targeted Sequencing Data

We analyzed 531 individuals from the UK referred to as Northern European and 146 Italian, 134 Portuguese, 100 Spaniards, and 7 Swiss Italian individuals collectively referred to as Southern European^{10}. We excluded 25.9 kb of sequence data from genes on the X chromosome, focusing solely on the autosomes. In total, 8,469 SNPs were polymorphic in either of the Northern or Southern European Samples. These variants were overwhelmingly rare, with 81.5% of variants having a MAF < 1% in the combined sample.

We tested various methods to correct for LD and better handle rare variants (see Methods). The results are summarized in Table S12. These results indicate that handling of both rare variants and LD is critical to maximizing the performance of PCA on this class of data. Applying standard PCA, the top 5 PCs explained only 2.3% of the variance (r^{2}=0.023) of the true population labels. This was improved substantially by removing or reweighting rare variants with (r^{2}=0.287, 0.341, 0.352) for removing variants with MAF < 0.02, removing singletons and reweighting, respectively. This indicates that rare variants, particularly singletons, may be problematic when analyzed using PCA. However, the difference between removing variants with MAF < 0.02 and reweighting (r^{2}=0.287 vs 0.352) suggests that these variants do contain useful information for ancestry inference and should not be universally excluded.

Additionally, application of a method to correct for LD significantly improved performance of PCA when performed in conjunction with singleton exclusion or rare variant reweighting. With rare variant reweighting, LD shrinkage ^{8} (r^{2}=0.563) performing slightly better than LD regression (r^{2}=0.528) ^{2} and LD pruning (r^{2}=0.534). While LD Pruning performed well, this may be due to the fact that LD is broken up because the dataset contains sequence data from separated chunks of genome.

### Recommendations

In data sets that do not include pervasive LD or large numbers of rare variants (i.e. genotyping data), standard techniques are likely to be successful in detecting population structure. However, in data sets that have pervasive LD and large numbers of rare variants, we recommend that LD pruning and singleton removal be applied. While more sophisticated methods for dealing with these issues were assessed, we did not observe significant improvements above and beyond these simpler approaches. Importantly, we do not recommend that all low frequency and rare variants (MAF < 0.02) be removed as these variants do significantly improve detection of population structure.

## Web Resources

EIGENSOFT version 6.1, including open-source implementation of FastPCA and smartpca and the PC-based selection statistic: https://data.broadinstitute.org/alkesgroup/EIG6.1/

PLINK2: https://www.cog-genomics.org/plink2/

flashpca: https://github.com/gabraham/flashpca

GERA cohort:http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000674.v1.p1

1000 Genomes:http://www.1000genomes.org/

Shellfish: http://www.stats.ox.ac.uk/∼davison/software/shellfish/shellfish.ph

## Description of Supplemental Data

Supplemental Data include six figures and twelve tables.

## Acknowledgements

We are grateful to D. Reich for helpful discussions and S. Pollack for assistance with FastPCA software. This research was funded by NIH grant R01 HG006399. SM is funded by NSF grants DMS-1209155 and DMS-1418261.

## References

- 1.↵
- 2.↵
- 3.↵
- 4.
- 5.
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.↵
- 40.↵
- 41.
- 42.↵
- 43.↵
- 44.
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵