Principal components analysis (PCA) is a widely used tool for inferring population structure and correcting confounding in genetic data^{1–8}. We introduce a new algorithm, FastPCA, that leverages recent advances in random matrix theory^{9–11} to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using a new test for natural selection based on population differentiation along these PCs, we replicate previously known selected loci and identify three new signals of selection, including selection in Europeans at the ADH1B gene. The coding variant rs1229984 has previously been associated to alcoholism^{12–14} and shown to be under selection in East Asians^{13,}^{15}^{-17}; we show that it is a rare example of independent evolution on two continents^{18,}^{19}.

The FastPCA method generalizes the method of power iteration^{20}, a technique to estimate the largest eigenvalue and corresponding eigenvector of a matrix. A random vector is repeatedly multiplied by a target matrix and normalized. Thus, it is projected onto all the eigenvectors of the matrix and then scaled by their corresponding eigenvalues. The projection along the eigenvector with the largest eigenvalue grows fasters than the rest, and the product converges to this eigenvector. The method of power iteration can be combined with the Gram-Schmidt orthogonalization process to produce an orthonormal basis of the top eigenvectors, by repeating this process and orthogonalizing subsequent vectors against previous estimated eigenvectors^{20}. In genetic data sets, it is of interest to estimate the top eigenvectors of a genetic relationship matrix (GRM) between individuals^{1,}^{2}. However, this matrix requires time *0*(*MN*^{2}) to compute (where *M* is the #SNPs and *N* is the #individuals) and time *0*(*N*^{3}) to decompose, a time cost that may be prohibitive in large data sets. Instead, FastPCA uses a block-Lanczos process to construct an accurate estimate for the top PCs; accuracy is improved by estimating additional PCs and using them to create a low-rank approximation of the genotype matrix^{9–11}. Singular value decomposition is then applied to the low-rank genotype matrix approximation to approximate the top eigenvectors of the GRM (see Online Methods), reducing time cost and memory usage to *O*(*MN*) - much more tractable than other methods (see below). In addition, we generalize a previous selection statistic developed for discrete subpopulations^{21} to detect unusual allele frequency differences along inferred PCs. This is based on the fact that the squared correlation of each SNP to a PC, rescaled to account for genetic drift, follows a chi-square (1 d.o.f.) distribution under the null hypothesis of no selection. We have released open-source software implementing the methods (see Web Resources).

We used simulated data to compare the running time and memory usage of FastPCA to three previous methods: smartpca^{1,}^{2}, PLINK2-pca^{22}, and flashpca^{23} (see Web Resources). We simulated genotype data from six populations with a star-shaped phylogeny using 100k SNPs (typical for real data after LD-pruning) and up to 100k individuals (see Online Methods). For each run, running time was capped at 100 hours and memory usage was capped at 40GB. The running time and memory usage of FastPCA scaled linearly with simulated dataset size (Figure 1), compared with quadratically or cubically for other methods. The computation became intractable at 50k-70k individuals for smartpca, PLINK2-pca and flashpca. The largest dataset, with 100k SNPs and 100k individuals, required only 56 minutes and 3.2GB of memory with FastPCA (Supplementary Table 1). Thus, FastPCA enables rapid principal components analysis without specialized computing facilities.

We next assessed the accuracy of FastPCA, using PLINK2-pca^{22} as a benchmark. We used the same simulation framework as before, with 10k individuals (1,667k individuals per population) and 50k SNPs. We varied the divergence between populations, as quantified by *F*_{ST}^{24}. We assessed accuracy using the Mean of Explained Variances (MEV) of the 5 population structure PCs (see Online Methods). We determined that the results of FastPCA and PLINK-pca were virtually identical (Figure 2). This indicates that FastPCA performs comparably to standard PCA algorithms while running much faster.

We ran FastPCA on the GERA cohort (see Web Resources), a large European American dataset containing 54,734 individuals and 162,335 SNPs after QC filtering and LD-pruning (see Online Methods). This computation took 57 minutes and 2.6GB of RAM. PC1 and PC2 separated individuals along the canonical Northwest European (NW), Southeast European (SE) and Ashkenazi Jewish (AJ) axes^{25}, as indicated by labeling the individuals by predicted fractional ancestry from SNPweights^{26} (Figure 3). PC3 and PC4 detected additional population structure within the NW population.

To further investigate this subtle structure, we projected POPRES individuals from throughout Europe^{27} onto these PCs^{2} (see Online Methods). This analysis recapitulated the position of SE populations via the placement of the Italian individuals, and determined that PC3 and PC4 separate the NW individuals into Irish (IR), Eastern European (EE) and Northern European (NE) populations (Figure 4). This visual subpopulation clustering was confirmed via k-means clustering on the top 4 PCs, which consistently grouped the AJ, SE, NE, IR and EE populations separately (Supplementary Figure 1).

Population differentiation between closely related populations can be valuable in detecting signals of natural selection^{21,}^{25,}^{28,}^{29}. We generalized a previous method for detecting selection across discrete subpopulations^{21} to detect unusual allele frequency differences along inferred PCs by analyzing the squared correlations of the genotypes at each SNP to a PC. These squared correlations, rescaled to account for population differences due to genetic drift, follow a chi-square (1 d.o.f.) distribution under the null hypothesis of no selection (see Online Methods), as confirmed by simulations (Supplementary Figure 2, Supplementary Table 2). Using the PCs computed on the 162,335 LD-pruned SNPs, we calculated these selection statistics for 608,981 non-LD-pruned SNPs (see Online Methods). The resulting Manhattan plots for PCs 1-4 are displayed in Figure 5 (QQ plots are displayed in Supplementary Table 3). Analyses of PCs 5-10 indicated that these PCs do not represent true population structure (Supplementary Figure 4), but are either dominated by a small number of long-range LD loci^{30–32} or are correlated with the missing genotyping rate in individuals.

Genome-wide significant signals (listed in Table 1) included several known selection regions^{33–37} and novel signals at ADH1B, IGFBP3 and IGH (see below). Suggestive signals were observed at additional known selection regions^{36,}^{38} (Supplementary Table 3). After removing the regions in Table 1, rerunning FastPCA and recalculating selection statistics, all of these regions remained significant except for a chromosomal inversion on chromosome 8^{30,}^{31} (Supplementary Figure 5, Supplementary Table 4). Thus, the remaining regions are not due to PC artifacts caused by SNPs inside these regions. Detecting subtle signals of selection benefitted from the large sample size, as subsampling the GERA data set at smaller sample sizes and recomputing PCs and selection statistics generally led to less significant signals (Supplementary Table 5).

We identified a genome-wide significant signal of selection at rs1229984, a coding SNP (Arg47His) in the ADH1B alcohol dehydrogenase gene (Table 1). The derived allele has been shown to have a protective effect on alcoholism^{39} and to produce an REHH signal^{40} in East Asians^{16}, but was not previously known to be under selection in Europeans. (Previous studies noted the higher frequency of the derived T allele in western Asia compared to Europe, but indicated that selection or random drift were both plausible explanations^{41,}^{42}.) We examined the allele frequency of the derived T allele in the five subpopulations: AJ, SE, NE, IR and EE (Supplementary Table 6). We observed derived allele frequencies (DAF) of 0.21 in AJ, 0.10 in SE, and 0.05 or lower in other subpopulations, consistent with the higher frequency of the derived allele in western Asia. A comparison of NE to the remaining subpopulations using the discrete subpopulation selection statistic^{21} also produced a genome-wide significant signal after correcting for all hypotheses tested (Supplementary Table 7); this is not an independent experiment, but indicates that this finding is not due to assay artifacts affecting PCs.

To further understand the selection at this locus, we examined the allele frequency of rs1229984 in 1000 Genomes project^{43} populations (see Web Resources), along with the allele frequency of the regulatory SNP rs3811801 that may also have been a target of selection in Asian populations^{13}. The haplotype carrying the derived allele at rs3811801 (and corresponding haplotype H7) was absent in populations outside of East Asia (Supplementary Table 8). This indicates that if natural selection acted on this SNP in Asian populations, selection acted independently at this locus in Europeans. One possible explanation for these findings is that rs1229984 is an older SNP under selection in Europeans, while rs3811801 is a newer SNP under strong selection in Asian populations leading to the common haplotype found in those populations.

The IGFBP3 insulin-like growth factor-binding protein gene had two SNPs reaching genome-wide significance. Genetic variation in IGFBP3 is associated with increased risk of breast cancer^{44} and is also associated with pulse pressure^{45}, blood pressure and hypertension^{46}. The IGH immunoglobulin heavy locus had one genome-wide-significant SNP and two suggestive SNPs with *p*-value < 10^{-6}. Genetic variation in IGH is associated with multiple sclerosis^{47} The IGFBP3 and IGH SNPs each had substantially higher minor allele frequencies in Eastern Europeans (Supplementary Table 6), but were not genome-wide significant under the discrete subpopulation selection statistic^{21} (Supplementary Tables 9-10), but the existence of multiple SNPs at each of these loci with p < 10^{-6} for the PC-based selection statistic suggests that these findings are not the result of assay artifacts.

We have presented FastPCA, a computationally efficient (linear-time and linear-memory) algorithm for accurately estimating top PCs. Although mixed model association methods are increasingly appealing for conducting genetic association studies^{48,}^{49}, we anticipate that PCA will continue to prove useful in population genetic studies, in characterizing population stratification when present in association studies, in supplementing mixed model association methods by including PCs as fixed effects in studies with extreme stratification, and in correcting for stratification in analyses of components of heritability^{50,}^{51}. We have also presented a new method to detect selection along top PCs in datasets with subtle population structure. This method can detect selection at genome-wide significance, an important consideration in genome-wide selection scans. In particular, we detected genome-wide significant evidence of selection in Europeans at the ADH1B locus, which was previously reported to be under selection in east Asian populations^{13,}^{15}-^{17} using REHH^{40} (which can only detect relatively recent signals and does not work on standing variation^{52}) - and at the disease-associated IGFBP3 and IGH loci.

We note that our work has several limitations. First, top PCs do not always reflect population structure, but may instead reflect assay artifacts^{53} or regions of long-range LD^{31}; however, PCs 1-4 in GERA data reflect true population structure and not assay artifacts. Second, common variation may not provide a complete description of population structure, which may be different for rare variants^{54}; we note that based on analysis of real sequencing data with known structure, we recommend that LD-pruning and removal of singletons (but not all rare variants) be applied in data sets with pervasive LD and large numbers of rare variants (see Supplementary Note). Third, our selection statistic is only capable of detecting that selection occurred, but not when or where it; indeed, top PCs may not perfectly represent the geographic regions in which selection occurred. Despite these limitations, we anticipate that the methods introduced here will prove valuable in analyzing the very large data sets of the future.

## Web Resources

EIGENSOFT version 6.0.1, including open-source implementation of FastPCA and smartpca: http://www.hsph.harvard.edu/alkes-price/software/

PLINK2: https://www.cog-genomics.org/plink2/

flashpca: https://github.com/gabraham/flashpca

GERA cohort: http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000674.v1.p1

1000 Genomes: http://www.1000genomes.org/

## Online Methods

### Description of FastPCA method

We are given an input *M* × *N* genotype matrix ** X**, where

*M*is the number of SNPs and

*N*is the number of individuals (e.g. each row is a SNP, each column is a sample). Each entry in this matrix takes its values from {0,1,2} indicating the count of variant alleles for a sample at a SNP. From this matrix we can generate the normalized

*M*×

*N*genomic matrix where each row

*y*_{i}has approximately mean 0 and variance 1 for SNPs in Hardy-Weinberg equilibrium.

Here, *x*_{i} is the row vector of genotypes for SNP *i* and *y*_{i} is the normalized row vector. *x*_{ij} and *y*_{ij} are the genotype/normalized genotype at SNP *i* for sample *j*. *N*_{i} is the number of valid genotypes at SNP *i*. All this is used to calculate , the sample allele frequency for SNP *i*, which is used to normalize the genotypes. In practice, the genotype matrix is normalized through the use of a lookup table mapping from genotypes (stored as 0, 1 or 2 copies of the alternate allele, or missing data) to normalized genotypes (using the above formula, with missing data having a normalized value of 0).

We are seeking the top *K* PCs for the normalized genomic matrix ** Y**. Traditional PCA algorithms compute the PCs by performing the eigendecomposition of the genetic relationship matrix (

*GRM*=

*Y*^{T}

**/**

*Y**M*), a costly procedure which returns all the principal components. FastPCA speeds this process up by only approximating the top

*K*PCs.

FastPCA is seeded with a random *N* × *L* matrix *G*_{0} composed of values drawn from a standard Gaussian distribution. *L* affects the accuracy of the result and *L* should be greater than *K*. For *K* = 10, *L* = 20 is a good choice. Then, for *I* iterations, *H*_{i} = ** Y** ×

*G*_{i}and is found by taking the QR-decomposition of where . This step normalizes

*G*_{i}to prevent rounding errors during the computation.

After the iterative step completes, the singular value decomposition of matrix ** H** = (

*H*_{0}|

*H*_{1}|…|

*H*_{I}) is taken: is a low-rank approximation to the column-space of

**, where . The SVD of can be computed efficiently and approximates the SVD of**

*Y***since . For the PCA, we are only interested in the left**

*Y**K*columns of

*V*_{T}and the first

*K*entries along the diagonal of

**Σ**

_{T}.

### Simulation framework

Simulated genotypes at a particular SNP were generated for multiple populations separated by a given fixation index (*F*_{ST}) by first generating an ancestral population allele frequency *p* from a *Uniform(0.05,0.95)* distribution, and then generating individual population frequencies from a truncated *Beta* distribution, where allele frequencies outside of [0.01,0.99] are discarded^{21,}^{55}. This was to facilitate generation of more complicated population structures; a descendent population frequency could be plugged into the above equation to generate additional population frequencies separated by a different *F*_{ST}. When the minor allele frequency approached 0, the method to generate the beta random variate would crash. Once a population allele frequency *p*_{i} was established, *N*_{i} individual genotypes would be generated from a *Binomial*(2,*P*_{i}) distribution.

To assess running time, the simulated datasets had *F*_{ST} = 0.01, *M* = 100*k* SNPs and *N* ≈ {1*k*, 1.5*k*, 2*k*, 3*k*, 5*k*, 7*k*, 10*k*, 15*k*, 20*k*, 30*k*, 50*k*, 70*k*, 100*k*} individuals (since there were 6 populations, ). Throughout this paper we report CPU time, but due to multithreading present in the GSL^{56} and OpenBLAS^{57} libraries run time was about 60% of CPU time. Accuracy was assessed using *M* = 50*k* SNPs and *N* ≈ 10*k* individuals at *F*_{ST} = [0.001,0.002,…,0.010}.

### Assessing accuracy

Accuracy was assessed via the Mean or Explained Variances (MEV) of eigenvectors. Two different sets of *K* N-dimensional principal components each produce column space. A metric for the performance of a PCA algorithm against some baseline is to see how much the column overlap. This is done by projecting the eigenvectors of one subspace onto the other and finding the mean lengths of the projected eigenvectors. If we have a reference set of PCs (*v*_{1},*v*_{2},…,*v*_{K}) against which we wish evaluate the performance a set of computed PCs (*u*_{1}, *u*_{2},…,*u*_{K}), then the performance calculation becomes:

Here, ** U** is a matrix whose column vectors are the PCs which we are testing. The test matrix can either be the result of another computation or the truth for a simulated sample.

*K*eigenvectors can describe the population structure in a dataset with

*K*+ 1 populations. They can be constructed by first creating a vector where if individual

*j*is in population

*k*and 0 otherwise. The set of eigenvectors {

*v*_{1},

*v*_{2},…,

*v*_{K}} are constructed by taking

*K*of these vectors, normalizing them to have mean 0, and scaling/orthogonalizing them via the Gram-Schmidt process.

### GERA data set

The GERA dataset comprises 670,176 SNPs and 62,318 individuals of European descent from Northern California^{58}. Individuals were filtered to remove those with missing sex information, individuals related according to the provided pedigree data or with observed genomic relatedness greater than 0.05 in the GRM^{22} and individuals with less than 90% European ancestry as predicted by SNPweights^{26} using a worldwide dataset containing European, African, Asian and Native American ancestry. After filtering, 54,734 individuals remained.

SNPs were initially filtered to remove non-autosomal SNPs, SNPs with minor allele frequency less than 1%, and SNPs with > 1% missing data, leaving 608,981 SNPs. The second stage of filtering removed SNPs that failed PLINK’s Hardy-Weinberg Equilibrium test^{22} with *p* < 10^{-6}, and performed LD-pruning using PLINK. Due to regions of long-range LD, LD persisted even after one filtering run. Multiple rounds of LD filtering were performed using an *r*^{2} cutoff of 0.2 until additional rounds of LD filtering did not remove additional SNPs, leaving 162,335 SNPs. Selection statistics (see below) were computed on the set of 608,981 SNPs, prior to H-W filtering and LD-pruning. We note that many of the SNPs producing signals of selection generated significant H-W *p*-values (e.g. H-W *p* = 1.37 × 10^{-79} for LCT SNP rs6754311), which is an expected consequence of unusual population differentiation.

SNPweights^{26} was used to predict fractional Northwest European, Southeast European, and Ashkenazi Jewish ancestry for each individual. In Figure 3, percentage ancestry in each of these three populations was mapped to an integer in [0,255], which was then used for the RGB color value for that sample, so a NW sample would appear red, SE would appear green and AJ would appear blue.

### PC Projection

POPRES^{27} individuals were projected onto these PCs. The left singular vectors (** U**) were generated by multiplying normalized genotypes for all SNPs in GERA (

*Y*_{GERA}) by the PCs (

**V**) and scaling by the singular values (

**Σ**), the number of SNPs used to calculate the PCs (

**) and the number of SNPs used for projection (**

*M**M*

_{GERA}):

**=**

*U*

*Y*_{GERA}

*V*Σ^{-1}

*M*/

*M*

_{GERA}. Projected PCs were then calculated by multiplying the corresponding set of SNPs in POPRES by these singular vectors and scaling again by the singular values: . The projected individuals were overlaid on the PCA plot of GERA individuals and colored according to population membership and consistently with population assignment from SNPweights

^{26}.

### Selection statistic

Previous work^{21} shows that for a SNP *i* genotyped in two populations, the difference in allele frequency estimates approximately follows a normal distribution.

Here, is the allele frequency estimate of SNP *i* for a sample of size *N*_{q} from population *q* and *F*_{ST} is the measure of differentiation between the two populations. Our goal is to extend this formula to individuals with fractional ancestries, and then to continuous-valued PCs.

First, consider the case with two discrete subpopulations. Rather than treating the subpopulations separately, we define a vector ** a** where

*a*

_{j}indicates the ancestry in population 1 (e.g.

*α*

_{j}= 1 if sample

*j*is in population 1 and 0 if sample

*j*is in population 2).

*D*

_{i}can be rewritten as:

If we run PCA on this sample, we would ideally get an eigenvector *v* that has value *v*_{1} for individuals in population 1 and —*v*_{2} for individuals in population 2, where (since *v*^{T}1 = 0, *v*^{T}** v** = 1)

In this case, *D*_{i} can be rewritten as:

In the limiting case where *F*_{ST} approaches 0, the statistic becomes:

Thus, the square of the SNP weight follows a chi-square 1-d.o.f. distribution in the case where *F*_{ST} → 0. In the case where *F*_{ST} ≠ 0, then the scaling parameter has to be changed, but *D*_{i} still follows a normal distribution.

In the case with fractional ancestry (*α*_{j} ∊ [0,1]), , and *D*_{i} can still be estimated using equation (4). The individual s will still asymptotically follow a normal distribution (because of the Lyapunov central limit theorem^{59}), but will be correlated due to individuals with fractional ancestry contributing to both estimates. Thus, *D*_{i} will still follow a normal distribution, but the variance of equation (3) will not hold.

Now consider the case where we do not have fractional ancestries, but rather an eigenvector that separates individuals along some axis of variation. We can treat the eigenvector as a linear transformation of the ancestry vector:

Substituting these values into (4), we find:

Thus, our new selection statistic *D*_{i} is based on the dot product of the normalized genotypes and the eigenvector. Since the variance of *D*_{i} is not known, it will need to be rescaled in order to follow a *N*(0,1^{2}) distribution.

(1)If we are operating on the same set of SNPs that we used for PCA, then the rescaling of *y*_{i}** v** is straightforward. Because PCA is the same as SVD, we see that:

Here, ** V** contains the right singular vectors which are equivalent to the PCs,

**contains the left singular vectors which are rescaled SNP weights and**

*U***Σ**contains the singular values which are the square roots of the eigenvalues of the GRM.

**and**

*V***are unitary, so the columns of**

*U**U*are guaranteed to have a norm of 1. Multiplying

**by will then produce a properly normalized vector of differences**

*U***=(**

*D**D*

_{1},

*D*

_{2},…,

*D*

_{M})

^{T}. In other words:

In the case where we are calculating PCs on a different set of SNPs than the one for which we are calculating weights, then the above property is not guaranteed to hold. In this case, (2) a properly normalized ** D** can be obtained by scaling

**so that it has norm**

*YV**M*, i.e. scaling

*y*_{i}

**so it has variance 1. This is the approach used in all of our analyses. When rescaling the weights in GERA using equation (11), the variances for PCs 1-4 were 1.03-1.07, while the variances for PCs 5-10 ranged 0.93-8.12.**

*v*One assumption underlying the statistic is that the true minor allele frequency is not extremely small, otherwise the assumption of normality will not hold^{21}. For this reason, the selection statistic was only computed for those SNPs containing minor allele frequency greater than 1%.

## Acknowledgements

We are grateful to D. Reich for helpful discussions and S. Pollack for assistance with FastPCA software. This research was funded by NIH grant R01 HG006399. SM is funded by NSF grants DMS-1209155 and DMS1418261.