## Abstract

We apply the statistical framework for genome-wide association studies (GWAS) to eigenvector decomposition (EigenGWAS), which is commonly used in population genetics to characterise the structure of genetic data. The approach does not require discrete sub-populations and thus it can be utilized in any genetic data where the underlying population structure is unknown, or where the interest is assessing divergence along a gradient. Through theory and simulation study we show that our approach can identify regions under selection along gradients of ancestry. In real data, we confirm this by demonstrating *LCT* to be under selection between HapMap CEU-TSI cohorts, and validated this selection signal across European countries in the POPRES samples. *HERC2* was also found to be differentiated between both the CEU-TSI cohort and within the POPRES sample, reflecting the likely anthropological differences in skin and hair colour between northern and southern European populations. Controlling for population stratification is of great importance in any quantitative genetic study and our approach also provides a simple, fast, and accurate way of predicting principal components in independent samples. With ever increasing sample sizes across many fields, this approach is likely to be greatly utilized to gain individual-level eigenvectors avoiding the computational challenges associated with conducting singular value decomposition in large datasets. We have developed freely available software to facilitate the application of the methods.

## Introduction

In population genetics, eigenvectors have been routinely used to quantify genetic differentiation across populations and to infer demographic history (Cavalli-Sforza *et al.*, 1996; Novembre *et al.*, 2008; Reich *et al.*, 2009). More recently, eigenvectors are commonly used as covariates in genome-wide association studies (GWAS) to adjust for population stratification (Price *et al*., 2006). Eigenvectors are usually estimated for each individual (individual-level eigenvectors, involving the inversion of a *N* × *N* matrix, where *N* is sample size). Theoretical studies have suggested that individual-level primary eigenvectors are measures of population differentiation reflecting *F _{st}* among subpopulations (Patterson

*et al*., 2006; McVean, 2009; Bryc

*et al.*, 2013) and can be interpreted as the divergence of individuals from their most recent common ancestor. Eigenvectors can also be estimated for each SNP (SNP-level eigenvectors, which involve inversion of a matrix, is the number of SNPs) and these SNP-level eigenvectors can be interpreted as

*F*metrics of each SNP (Weir, 1996). SNP-level eigenvectors from a reference population are useful for revealing the population structure of independent samples (Zhu

_{st}*et al*., 2008) as they can be used to project, or predict, the eigenvector values of individuals. However, due to high-dimensional nature of GWAS data (commonly expressed as

*M*≫

*N*), direct estimation of SNP-level eigenvectors is nearly impossible when using millions of single nucleotide polymorphisms (SNPs).

Singular value decomposition (SVD) enables SNP-level eigenvalues to be obtained in a computationally efficient manner for any set of genotype data (Chen *et al*., 2013), however, it is not possible to determine the SNPs that contribute most to the leading eigenvector, or to test whether specific SNPs are differentiated along the genetic gradient described by the eigenvector. Here, we propose an alternative simple, fast approach for the estimation of SNP-level eigenvectors. By using individual-level eigenvectors as phenotypes in a linear regression, we demonstrate that the regression coefficients generated by single-SNP regression are equivalent to SVD SNP effects as proposed by Chen et al (Chen *et al*., 2013). As the single-SNP regression resembles the popular single-marker GWAS method, as implemented in PLINK (Purcell *et al*., 2007), we call this method EigenGWAS. We show that the EigenGWAS framework represents an alternative way for identifying regions under selection along gradients of ancestry.

## Results

### Properties of the estimating SNP effects for eigenvectors

We applied EigenGWAS to the HapMap cohort, a known structured population. Eigenvectors were estimated via principal component analysis based on the ** A** matrix using all 919,133 SNPs. We conducted EigenGWAS for HapMap, using

*E*, the

_{k}*k*eigenvector, as the phenotype and investigated the performance of EigenGWAS from

^{th}*E*

_{1}to

*E*

_{10}. From

*E*

_{1}to

*E*

_{10}, we found 546,716 significant signals (231,677 quasi-independent signals after clumping) on

*E*

_{1}and gradually reduced to 236 (163 after clumping) selection signals on

*E*

_{10}(

**Fig. 1**). The large number of genome-wide significant loci are likely because HapMap3 was comprised of samples from different ethnicities, and these loci can be interpreted as ancestry informative marker (AIM). For each

*E*its associated eigenvalue was highly correlated with the

_{k},*λ*the genomic inflation factor that is commonly used in adjusting population stratification for GWAS (Devlin and Roeder, 1999), resulted from its EigenGWAS. The top five eigenvalues associated to HapMap samples were 100.14, 47.66, 7.168, 5.92, and 4.40, and the corresponding

_{GC},*λ*of EigenGWAS were 103.72, 44.69, 6.47, 5.17, and 3.96, respectively (

_{GC}**Table 1**). The large eigenvalues observed were consistent with previous theory that the magnitude of eigenvalues indicating structured population (Patterson

*et al*., 2006). The connection between

*λ*and eigenvalues, provides a straightforward interpretation: a large

_{GC}*λ*indicates underlying population structure (Devlin and Roeder, 1999). Therefore, correction for will filter out signals due to population stratification, allowing loci under selection to be identified. These observations agreed well with our theory (see Methods & Materials).

_{GC}We demonstrate theoretically that for EigenGWAS, the estimated SNP effects using single-marker GWAS are equivalent to the estimates from BLUP, and the correlation between the estimates from these two methods was very high (greater than 0.98 on average) (**Fig. 2**), even in HapMap samples that consist of a mix of ethnicities where the ** A** matrix is non-zero for off-diagonal elements (

**Supplementary Fig. 1**). This confirms that our EigenGWAS approach provides an accurate representation of the SNP effects on eigenvalues.

We also conducted EigenGWAS on the POPRES samples, from which we selected 2,466 European samples. On *E*_{1}, there were 10,885 (3,004 quasi-independent signals after clumping) genome-wide significant signals, and reduced to 1,639 (90 after clumping) on *E*_{10} (**Table 1**). As in the HapMap sample, we observed a concordance between eigenvalues and *λ _{GC}* in POPRES. The top five eigenvalues were 5.104, 2.207, 2.157, 2.077, and 1.971, with their associated EigenGWAS

*λ*were 5.005, 1.929, 1.910, 1.464, and 1.866, respectively (

_{GC}**Table 1**), indicating population structure. The genetic relationship matrix (GRM) estimated from the POPRES data resembled a diagonal matrix, which had off-diagonal elements close to zero, suggesting that POPRES is a more homogenous samples as compared to HapMap (

**Supplementary Fig. 1**). Correlations between the estimates from EigenGWAS and BLUP were high, with an average of greater than 0.999 from

*E*

_{1}to

*E*

_{10}(

**Supplementary Fig. 2**), close to one as expected.

The chi-square statistics of the estimated SNP effects on eigenvectors from EigenGWAS were correlated with *F _{st}* for each SNP, consistent with previous established relationship between eigenvectors and

*F*(Patterson

_{st}*et al*., 2006; McVean, 2009). Using naïve threshold of

*E*> 0, 2,466 POPRES samples were divided into nearly two even groups, which would be served as two subgroups in calculating

_{k}*F*.

_{st}*E*

_{1}> 0 split the POPRES samples into North and South Europe; samples from UK, Ireland, Germany, Austria, and Australia were in one group, and samples from Italy, Spain, and Portugal were in the other group; samples from Switzerland and France were nearly evenly split into two groups.

*F*for each SNP was consequently calculated based on these two groups. For every eigenvector until

_{st}*E*

_{10}, we observed strong correlations between

*F*and the chi-square test statistics for EigenGWAS signals (

_{st}**Fig. 3**), and the averaged correlation was 0.925 (S.D., 0.067). For example, the correlation was 0.89 (

*p*-value<1e-16) between chi-square test statistics and

*F*for

_{st}*E*

_{1}in POPRES (

**Supplementary Table 1**). This correlation is consistent with our theory, where

*F*has a strong linear relationship with its EigenGWAS chi-square test statistic.

_{st}We also validated our results in the simulation scheme I, in which there was neither selection nor population stratification. Given 2,000 simulated samples, each of which had 500,000 unlinked SNPs, the EigenGWAS showed few GWAS signals (2 genome-wide significant signals on *E*_{1}, (**Supplementary Fig. 4**). After splitting the samples into 2 groups depending on *E _{i}* > 0, the correlation between chi-square test statistics and

*F*is about 0.67 from to

_{st}*E*

_{1}to

*E*

_{10}(

**Supplementary Fig. 5**). As expected,

*λ*ranged from around 1.124 to 1.130, with a mean of 1.124 for EigenGWAS on the top 10 eigenvectors, indicating little population stratification for the simulated data. Furthermore, we also validated the theory in the simulation scheme II, in which there was population stratification. We wanted to know whether the adjustment of the test statistic with the greatest eigenvalue could render the distribution of the test statistics immunes of population stratification. Given various sample sizes for two subdivisions, after the adjustment for the test statistic with the largest eigenvalue, the test statistic followed the null distribution, which was a chi-square distribution of 1 degree of freedom (

_{GC}**Supplementary Fig. 6**), indicating a well control of population stratification after correction. The statistical power of EigenGWAS was also evaluated. As demonstrated, the power of EigenGWAS in detecting a locus under selection was determined by the ratio between the specific

*F*of a locus and the averaged population stratification in the sample (

_{st}**Supplementary Fig. 7**).

### Using EigenGWAS to identify loci under selection in structured populations

We propose EigenGWAS as a method of finding loci differentiated among populations, or across a gradiant of ancestry. Intuitively, every EigenGWAS hit is an AIM, which differ in allele frequency along an eigenvector due to genetic drift or selection. A locus under selection should be more differed across populations than genetic drift can bring out. Thus, correction for *λ _{GC},* controls for background population structure, providing a test of whether an AIM shows greater allelic differentiation than expected under the process of genetic drift.

We pooled together CEU (112 individuals) and TSI (88 individuals), which represent Northwestern and Southern European populations in HapMap. EigenGWAS was conducted on *E*_{1} > 0, which partitioned CEU and TSI into two groups accurately using as threshold (**Supplementary Fig. 8**). We corrected for *λ _{GC},* which was 1.723, for CEU&TSI. Adjustment for

*λ*significantly reduced population stratification (

_{GC}**Supplementary Fig. 9**), and was consequently possible to filter out the baseline difference between these two cohorts. After correction, we found evidence of selection at the lactose persistence locus,

*LCT*(

*p*-value=1.21e-20). Due to hitchhiking effect, the region near

*LCT*also showed divergent allele frequencies. For example, the

*DARS*gene, 0.15M away from

*LCT,*was also significantly associated with

*E*

_{1}(

*p*-values=1.51e-23).

*HERC2*was slightly below genome-wide significance level (

*p*-value=8.22e-08), indicating that anthropological difference reflected geographic locations of two cohorts but not under selection as strong as

*LCT.*

We then conducted EigenGWAS in the POPRES sample by treating *E*_{1} as a quantitative trait, and calculated the approximate *F _{st}* for each SNP given two groups split by the threshold of

*E*

_{1}> 0 (

**Supplementary Fig. 10**). Given 643,995 SNPs, the genome-wide threshold was

*p*-value < 7.76e-08 for the significance level of

*α*= 0.05.

*λ*= 5.00, which indicated substantial population stratification as expected for POPRES. Correcting for

_{GC}*λ*systematically reduced the EigenGWAS

_{GC}*χ*

^{2}test statistics (

**Supplementary Fig. 11**), and we replicated the significance of

*LCT*(

*p*-value=1.23e-22) and

*DARS*(

*p*-value=8.99e-22) (

**Table 2**), suggesting selection at these regions.

*HERC2*was also replicated with

*p*-value 8.15e-09, and with

*F*of 0.041.

_{st}### Prediction accuracy for projected eigenvector

We investigated three aspects of EigenGWAS prediction: 1) the number of loci needed to achieve high accuracy for the projected eigenvectors; 2) the required sample size of the training set; 3) the importance of matching the population structure between the training and the test sets.

Using the POPRES samples, we split 5% (125 individuals), 10% (250 individuals), 20% (500 individuals), 30% (750 individuals), 40% (1000 individuals), and 50% (1250 individuals) of the sample as the training set, and used the remainder of the samples as the test set. Eigenvectors were estimated using all markers in each training set. As predicted by our theory (Eq 7), the prediction accuracy of the projected eigenvector was consistent with in which *N _{e}* = 1,000 for

*E*

_{1}empirically. If only 100 and 1,000 random SNPs were sampled as predictors, the expected maximal

*R*

^{2}= 0.091 and 0.5, respectively and accuracy reached almost 1 if more than 100,000 SNPs were sampled. In agreement with our theory (

**Fig. 6**), if the number of predictors were too small the prediction accuracy was poor, with prediction accuracy increasing with the addition of more markers for

*E*

_{1}. When the sample size of the discovery was 1,000 or above, maximal prediction accuracy was achieved, as predicted in our theory. Therefore, a discovery with a sample size greater than 1,000 should be sufficient to predict the first eigenvector of an independent set, provided that population structure is the same across the discovery and prediction samples (

**Fig. 6**). In contrast, the prediction accuracy for prediction eigenvectors decreased (

**Fig. 6**) quickly for eigenvectors other than

*E*

_{1}. For example, the prediction accuracy for

*E*

_{2}was below

*R*

^{2}< 0.2 and

*R*

^{2}< 0.15 for

*E*

_{3}. For

*E*

_{4}~

*E*

_{10}, the prediction accuracy dropped down to nearly zero. This is consistent with the top 2~3 eigenvectors explaining the majority of variation (McVean, 2009), if the training and the test sets had their population structure matched.

If EigenGWAS SNPs of low *p*-value were likely to be AIMs, we would hypothesise that AIM markers would be more efficient in giving high accuracy for the predicted eigenvectors (**Fig. 6**). For *E*_{1}, the prediction accuracy reached 1 more quickly by using markers selected by *p*-value thresholds. The prediction accuracy for projected *E*_{2} was dependent upon the threshold. For projected *E*_{2} given a 50:50 split of POPRES sample, applying the threshold of *p*-value < 1e-6 (927 SNPs), *R*^{2} = 0.136, as high as using all markers. For other projected eigenvectors, the pattern of accuracy did not change much after applying *p*-value thresholds because in general, the prediction accuracy was low. This indicated that eigenvectors other than the first two eigenvectors capture little replicable population structure in POPRES.

In practice, the training and the test set may not match perfectly on population structure, and this will likely lead to a reduction in prediction accuracy. To demonstrate this, we split the POPRES samples into two sets: pooling Swiss (991 samples) and French (96 samples) samples into one group (SF), and the rest of the samples into the other group (NSF). We used SF as the training and the NSF as the testing. As SF was almost an average of North European and South European gene flow, making a less stratified population, its EigenGWAS effects would be consequently small and less “heritable”. When using all SNPs effects estimated from SF set, the observed prediction accuracy for NSF set was *R*^{2} = 0.33 and 0.005 for *E*_{1} and *E*_{2}, respectively. These results indicate that a matched training and test set is important for prediction accuracy of the projected eigenvectors.

Ancestry information may still be elucidated well even if the training set and the test set do not match well in their population structure. Using HapMap3 as the training set, we also tried to infer the ancestry of the Puerto Rican cohort (PUR, 105 individuals) and Pakistani cohorts (PJL, 95 individuals) from 1000 Genomes project (The 1000 Genomes Project Consortium, 2012). In chromosome 1, 74,500 common SNs were found between HapMap3 and 1000 Genomes project. As illustrated, using only 74,500 common markers between HapMap3 and 1000 Genome projects SNPs on chromosome 1, it projected Eigenvectors accurately revealed the demographic history of Puerto Rican cohort, an admixture of African and European gene flows, and Pakistan cohort, an admixture of Asian and European gene flows (**Fig. 7**).

As a negative control, we replicated the prediction study for simulated data used in the previous section. The simulated data was split to two equal sample size. As there was no population structure in the simulated data, the prediction accuracy was poor, *R*^{2} = 0.01 from to *E*_{1} to *E*_{10}. This demonstrates that prediction can be used to validate whether population structure exists within a genotype sample.

We concluded that to achieve high prediction accuracy of projected eigenvectors for independent samples, there are several conditions to be met: 1) the training set should harbour sufficient population stratification; 2) the sample size of the training should be sufficiently large; 3) the test sets should be as concordant as possible in its population structure; 4) when there is no real population structure, the prediction accuracy is very low close to zero; 5) depending on the population, high prediction was largely achievable for the projected *E*_{1}.

## Discussion

Eigenvectors have been routinely employed in population genetics, and various approaches have been proposed to offer interpretation and efficient algorithms (Patterson *et al*., 2006; Rokhlin *et al.*, 2009; McVean, 2009; Chen *et al.*, 2013; Galinsky *et al.*, 2015). In this study, we created a GWAS framework for studying and validating population structure, and offer an interpretation of eigenvectors within this framework. The EigenGWAS framework (least square) identifies ancestry informative markers and loci under selection across gradients of ancestry.

We integrated SVD, BLUP, and single-marker regression into a unified framework for the estimation of SNP-level eigenvectors. SVD is a special case of BLUP when heritability is of 1 for the trait and the target phenotype is an eigenvector. Furthermore, the BLUP is equivalent to the commonly used GWAS method for estimating SNP effects. As demonstrated, the correlation between BLUP and GWAS is almost 1 for the estimated SNP effects. EigenGWAS offers an alternative way in estimating *F _{st}* that can replace conventional

*F*when population labels are unknown, populations are admixed, or differentiation occurs across a gradient. As demonstrated for CEU&TSI samples, EigenGWAS brings out nearly identical estimation of

_{st}*F*compared with conventional estimation.

_{st}Different from conventional GWAS, which requires conventional phenotypes, the proposed EigenGWAS provides a novel method for finding loci under selection based on eigenvectors, which are generated from the genotype data itself. An EigenGWAS hit may reflect the consequence of process and thus additional evidence is needed to differentiate selection from drift. *LCT* is a known locus under selection, which differs in its allele frequency as indicated by *F _{st}* statistic between Northern and Southern Europeans (Bersaglieri

*et al*., 2004). We replicated the significance of

*LCT*in CEU&TSI samples and POPRES European samples.

*DARS*has been found in association with hypomyelination with brainstem and spinal cord involvement and leg spasticity (Taft

*et al*., 2013). In addition, we also found

*HERC2*locus independently, which may indicate the existence of anthropological difference in certain characters, such as hair, skin, or eyes color across European nations (Voight

*et al*., 2006; Visser

*et al.*, 2012).

Although by definition selection and genetic drift are different biological processes, both lead to allele frequency differentiation across populations and often difficult to tear them apart. In this study, with and without adjustment for *λ _{GC}* from EigenGWAS offers a straightforward way to filter out population stratification. For example, with adjustment for

*λ*and

_{GC}, LCT*DARS*were still significant in both EigenGWAS, while

*HERC2*was only significant in POPRES. If adjustment for

*λ*removed the average genetic drift since the most recent common ancestor for the whole sample, it might indicate that

_{GC}*HERC2*reflected the anthropological difference between subsamples but not under selection as strong as that for

*LCT.*Nevertheless,

*LCT*was differentiated due to selection that was on top of genetic drift, and for

*DARS,*it might be significant due to hitchhiking effect. So,

*LCT, DARS,*and

*HERC2*were significant in EigenGWAS for different mechanisms.

In EigenGWAS application, it provides a clear scenario that *λ _{GC}* is necessary if genetic drift/population stratification should be filtered out. It has been debated whether correction for

*λ*is necessary for GWAS (Devlin and Risch, 1995; Yang, Weedon,

_{GC}*et al.*, 2011). If the inflation is due to population stratification, as initially introduced, it seems necessary to control for it. In contrast, if it is due to polygenic genetic architecture, then correction for

*λ*will be a overkilling for GWAS signals. Interestingly, Patterson et al (Patterson

_{GC}*et al*., 2006) found that the top eigenvalues reflect population stratification, and in our study we found

*λ*from EigenGWAS was numerically so similar to its corresponding eigenvalues. It in another aspect indicates

_{GC}*λ*captures population stratification. So, in concept and implementation, the correction for

_{GC}*λ*is technically reasonable. Of note, Galinsky et al also proposed a similar procedure to filter out population stratification in a study similar to ours (Galinsky

_{GC}*et al*., 2015), but we believe our framework is much easier to understand and implement in practice.

Once we have EigenGWAS SNP effects estimated, it is straightforward to project those effects onto an independent sample. The prediction of population structure was to that of recent studies (Chen *et al*., 2013). We found that the prediction accuracy for the top eigenvector could be as high as almost 1. Given a training set of about 1,000 samples, the prediction accuracy could be very high if there were a reasonable number of common markers in the order of 100,000. This number, which needs to be available in both reference set and the target set, is achievable. Further investigation may be needed to check whether this number of markers is related to effective number or markers after correction for linkage disequilibrium for GWAS data. When the population structure of the test sample resembles the training sample, high accuracy will be achieved for the leading projected eigenvectors. Therefore, this approach is likely to be extremely beneficial for extremely large samples, such as UK Biobank samples and 23andMe, both of which have more than half million samples where direct eigenvector analysis may be infeasible. Our results suggest that sampling about 1,000 individuals from the whole sample as the training set and subsequently project EigenGWAS SNP effects to the reminding samples will be sufficient to reach a reasonable high resolution of the population structure.

Many improvements to the inference of ancestry using projected eigenvectors have been suggested (Chen *et al*., 2013). As the concordance of population structure between the training and test sets is often unknown (population structure, upon from genetic or social-cultural perspectives, its definition can be difficult or controversial), improvement of the inference of ancestry may or may not be achieved dependent upon the scale of the precision required for a sample. However, for classification of samples at ethnicity level, projected eigenvectors are likely to have high accuracy, as demonstrated in the Puerto Rican cohort and the Pakistani cohort. Therefore, when identifying ethnic outliers, using projected eigenvectors from HapMap is likely to be sufficient in practice.

Eigenvector analysis of GWAS data is an important well utilized data technique, and here we show that its interpretation depends on many factors, such as proportion of different subpopulations, and *F _{st}* between subpopulations. Our EigenGWAS approach provides intuitive interpretation of population structure, enabling ancestry informative markers (AIM) to be identified, and potentially loci under selection to be identified. To facilitate the use of projected eigenvectors, we provide estimated SNP effects from HapMap samples and POPRES and software that can largely reduce the logistics involved in conventional way in generating eigenvectors, such as reference allele match, and strand flips.

## Methods and Materials

**HapMap3 samples.** HapMap3 samples were collected globally to represent genetic diversity of human population (Altshuler *et al*., 2010). HapMap3 contains representative samples from many continents: CEU and TSI represent population from north and south Europe, CHB and JPT from East Asia, and CHD Chinese collected in Denver, Colorado. Loci with palindrome alleles (A/T alleles, or G/C alleles) were excluded, and 919,133 HapMap3 SNPs were used for the analysis.

**1000 Genomes project.** 1000 Genomes project samples were used as a prediction set for projecting eigenvectors (The 1000 Genomes Project Consortium, 2012). We selected the Puerto Rico cohort (PUR, 105 samples) and the Pakistan cohort (Punjabi from Lahore, Pakistan, 95 samples) for analysis.

**POPRES samples.** POPRES (Nelson *et al*., 2008) is a reference population for over 6,000 samples from Asian, African, and European nations. In this study, we selected 2,466 European descendants. The POPRES genotype sample was imputed to a 1000 Genomes reference panel (The 1000 Genomes Project Consortium, 2012). Imputation for the POPRES was performed in two stages. First, the target data was haplotyped using HAPI-UR (Williams *et al*., 2012). Second, Impute2 was used to impute the haplotypes to the 1000 genomes reference panel (Howie *et al*., 2011). We then selected SNPs which were present across all datasets at an imputation information score of >0.8. A full imputation procedure is described at https://github.com/CNSGenomics/impute-pipe. After quality control and removing loci with palindromic alleles (A/T alleles, or G/C alleles) 643,995 SNPs for POPRES remained. In addition, we also conducted the analysis using non-imputed 234,127 common markers between POPRES and HapMap3. As the results were between these two datasets were very similar, this report focused on the results from 643,995 SNPs, which were more informative.

**Simulation scheme I: null model without population structure.** 2,000 unrelated samples with 500,000 biallelic markers, which were in linkage equilibrium to each other, were simulated. The minor allele frequencies ranged from 0.01~0.5, and Hardy-Weinberg equilibrium was assumed for each locus. All individuals were simulated from a homogeneous population, with no population stratification. In order to calculate *F _{st}* at each locus, we divided the sample into sub-populations based upon eigenvectors that were estimated from a genetic relationship matrix calculated using all 500,000 markers (see below).

**Simulation scheme II: null model with population structure.** In general, this simulation scheme was followed Price et al (Price *et al*., 2006). 2,000 unrelated samples with 10,000 biallelic markers, which were in linkage equilibrium to each other, were generated. For each marker, its ancestral allele frequency was sampled from a uniform distribution between 0.05 to 0.95, and its frequency in a subpopulation was sampled from Beta distribution with parameters and . The Beta distribution had mean of *p* and sampling variance of *p*(1 − *p*)*F _{st}.* Once the allele frequency for a subpopulation over a locus was determined as

*p*, individuals were generated from a binomial distribution

_{s}*Binomil*(2,

*p*.). It agreed with the quantity that measures the genetic distance between a pair of subpopulations (Cavalli-Sforza

_{s}*et al.*, 1996).

### Calculating individual-level eigenvectors

We assume that there is a reference sample consisting of *N* unrelated *M* individuals and markers. *X _{i}* = (

*X*

_{i}_{1}

*,X*

_{i}_{2}

*, …, X*)

_{iM}*is a vector of the*

^{T},*i*individual’s genotypes along

^{th}*M*loci, with

*x*the number of the reference alleles. An

*N*×

*N*genetic relatedness (correlation) matrix

**(matrix in bold font) for each pair of individuals is defined as , in which**

*A**f*is the frequency of the reference allele. The principal component analysis (PCA) is then implemented on the

_{l}*A*matrix (Price

*et al*., 2006), generating , which is an

*N*×

*K*(

*K ≤ N*) matrix, in which

*E*is the eigenvector corresponding to the

_{k}*k*largest eigenvector.

^{th}### Unified framework for BLUP, SVD, and EigenGWAS

Theoretically, PCA can also be implemented on a *M* × *M* matrix, but this is often infeasible because the *M* × *M* matrix is very large. However, for individual *i,* eigenvector *k* can also be written as:
in which *β _{k}* is a

*M*× 1 SNP-level vector of the SNP effects on

*E*and

_{k},*x*is the genotype of the

_{i}*i*individual across

^{th}*M*loci. In the text below, we denote individual-level eigenvector as eigenvector

*N*× 1 vector), and SNP-level eigenvector (

*M*× 1) as SNP effects.

We review three possible methods to estimate *β* given eigenvectors. The first method is best linear prediction (BLUP), which is commonly used in animal breeding and recently has been introduced to human genetics for prediction (Henderson, 1975; Goddard *et al.*, 2009). The second method is to convert an individual-level eigenvector to SNP-level eigenvector using SVD, as proposed by Chen et al (Chen *et al*., 2013). The third method is the approach outlined here, EigenGWAS, which is a single-marker regression, as commonly used in GWAS analysis.

### Method 1 and 2: BLUP and SVD

For a quantitative trait, *y* = *μ* + *β*** X** +

*e,*in which

*y*is the phenotype,

*μ*is the grand mean,

*β*is the vector for additive effects,

*X*is the genotype matrix, and

*e*is the residual. Without loss of generality, the BLUP equation can be expressed as: in which is the estimates of the SNP effects, is the standardized genotype matrix,

**is the variance covariance with , and is the trait of interest (Henderson, 1975). Replacing**

*V**y*with individual-level eigenvector (

*E*), Eq 2 can be written as in which

_{k}*β*is the BLUP estimate of the SNP effects,

_{k}*E*is the

_{k}*k*eigenvector estimated from the reference sample., The

^{th}**matrix can be replaced with**

*V***because the eigenvector has no residual error (i.e.**

*A**h*

^{2}=1). This method has also been proposed as an equivalent computing algorithm for genomic predictions (Maier

*et al*., 2015).

In addition, the connection between PCA and SVD can be established through the transformation between the *N* × *N* matrix to the *M* × *M* matrix (McVean, 2009). Let *A* = *PDP*^{−1} in which ** D** is a

*N*×

*N*diagonal matrix with

*λ*,

_{k}**is**

*P**N*×

*N*matrix with the eigenvectors.

*B*=*X**(*

^{T}

*PDP*^{−1})

^{−1}

**=**

*P*

*X*

^{T}

*PD*

^{−}^{1}, in which

**is**

*B**M*×

*M*matrix. This is equivalent to the equation used in Chen et al (Chen

*et al*., 2013) where

*B*

^{T}**=**

*D*

^{−}^{1}(

*X*

^{T}**)**

*P**. Thus, eigenvector transformation can be viewed as a special case of BLUP in which the heritability is 1 (Eq 3). However, under SVD another analysis step is then required to evaluate the significance of the estimated SNP effect. In an EigenGWAS framework an empirical*

^{T}*p*-value is produced when estimating the regression coefficient.

### Method 3: estimating SNP effects on eigenvectors with EigenGWAS

Given the realized genetic relationship matrix *A,* for unrelated homogeneous (i.i.d.) samples, *E*(*A _{ij}*) = 0 (

*i*≠

*j*), and consequently

*E*(

**) =**

*A***. Due to sampling variance of the genetic relationship matrix**

*I*, an identity matrix**, the off diagonal is a number slightly different from zero even for unrelated samples (Chen, 2014). If we replace the matrix with its mathematical expectation – the identity matrix, Equation 3 can be further reduced to , which is equivalent to single-marker regression**

*A**E*=

_{k}*a*+

*bx*+

*e*, as implemented in PLINK (Purcell

*et al*., 2007). Furthermore, standardization for

**is not required because it will not affect**

*X**p*-value. Thus, SNP effects can be estimated using the single-marker regression, which is computationally much easier in practice and is implemented in many software packages. Each SNP effect,, is estimated independently, and the

*p*-value of each marker can be estimated, which requires additional steps in BLUP and SVD.

We summarise the properties and their transformation of SVD, BLUP, and EigenGWAS as below:

*E*_{k}is determined by thematrix, or in another words, it is determined by the genotypes completely. If we consider each*A**E*is the trait of interest – a quantitative trait, its heritability is 1._{k}*h*^{2}= 1. SVD and BLUP are both computational tool in converting a vector from*N*×*N*matrix to a*M*×*M*matrix. SVD is a special case to BLUP when*h*^{2}= 1 for BLUP.*h*^{2}= 1 and*E*() =*A*. When these two conditions are set, BLUP is further reduced to single-marker association studies, which is EigenGWAS as suggested in this study.*I*

Recently, in an independent work Galinsky et al (Galinsky *et al*., 2015) introduced an approximation to find the proper scaling for SNP effects (“SNP weight” in Galinsky’s terminology) estimated from SVD, in order to produce accurate *p*-values. In our EigenGWAS framework, *p*-values for individual-level SNP eigenvector are automatically generated. In practice, it is conceptually easier to conduct EigenGWAS on eigenvectors than to conduct BLUP/SVD. Also, if computational speed is of concern, EigenGWAS can be easily parallelized for each chromosome, each region, or even each locus.

### Interpretation for EigenGWAS

We can write a linear regression model *E _{k}* =

*a*+

*βx*+

*e*, in which both

*E*and

_{k}*x*is standardized. Assuming that a sample has two subdivisions, which have sample size

*n*

_{1}and , and the sampling variance for

*β*is . A chi-square test for

*β*is in which is Nei’s estimator of genetic difference for a biallelic locus (Nei, 1973).

In principal component analysis, the proportion of the variance explained by the largest eigenvalue is equal to (McVean, 2009), in which for a pair of subpopulations as defined in Weir (Weir, 1996). So , in which characterizes the average divergence for a pair of subpopulations. When the test statistic, Eq 4, is adjusted by the largest eigenvalue *λ*_{1}, an equivalent technique in GWAS for the correction of population stratification, . For a population with a pair of subdivisions . So
after the adjustment of the largest eigenvalue, the test statistic immunes of population stratification, at least for a divergent sample.

For a locus under selection, which should have a greater *F _{st}* than the background divergence. So the statistical power for detecting whether a locus is under selection is determined by the strength of selection, which can be defined as the ratio between

*F*of a particular locus and the average divergent in the sample. It is analogous to consider a chi-square test with non-centrality parameter (NCP), .

_{st}Otherwise specified, in this study *F _{st}* is referred to the one defined in Weir (Weir, 1996).

### Validation and prediction for population structure

Once *β _{k}* is estimated, it is straightforward to get genealogical profile for an independent target sample. In general, it is equivalent to genomic prediction, and the theory for prediction can be applied (Daetwyler

*et al*., 2008; Dudbridge, 2013). The predicted genealogical score can be generated as in which

*E*is the predicted

_{k}*k*eigenvector, is the estimated SNP effects, and

^{th}**is the genotype for the target sample. We focus on the correlation between the predicted eigenvectors and the direct eigenvectors, and thus it does not matter whether**

*X***or is used.**

*X*In contrast to conventional prediction studies, which focus on a metric phenotype of interest, prediction of population structure is focussed on a “latent” variable. This latent variable is the genetic structure of population, which is shaped by allele frequency and linkage disequilibrium of markers. Thus, expectations of prediction accuracy differ from what has been established for conventional prediction (Daetwyler *et al*., 2008; Dudbridge, 2013) . We therefore assess prediction of accuracy for *E*_{1} across markers, when using different prediction thresholding (Purcell *et al*., 2009).

Here we proposed an equation for prediction accuracy, especially for *E*_{1}
when there is no heritability, the predictor can be simplified to , meaning that as the number of markers increases prediction accuracy should rapidly reach 1. Here the *h*^{2} is interpreted as the genetic difference in the source population, or real ancestry informative markers. For a homogeneous population, the genetic difference is large due to genetic drift, and *h*^{2} ≈ 0.

For this study, the genetic relationship matrix (** A** matrix), principal component analysis, and BLUP estimation were conducted using GCTA software (Yang, Lee,

*et al.*, 2011). Single-marker GWAS was conducted using PLINK (Purcell

*et al*., 2007), or GEAR (https://github.com/gc5k/GEAR/wiki/EigenGWAS; https://github.com/gc5k/GEAR/wiki/ProPC).

## Web resource and data availability

GEAR is available at http://cnsgenomics.com/

GCTA is available at http://cnsgenomics.com/

PLINK is available at http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml 1000 Genomes Project: http://www.1000genomes.org/

## Author contributions

GBC, SHL, and BB conceived study. GBC, SHL, BB, and MRR designed the experiment. GBC and SHL developed the theory and methods. BB conducted the quality control for HapMap data, and MRR conducted quality control for POPRES data. GBC performed the analyses of the study. GBC and ZXZ developed GEAR software. GBC, MRR, SHL, and BB wrote the paper.

## Acknowledgements

This research was funded by ARC (DE130100614 to SHL), NHMRC (APP1080157 to SHL, APP1084417 and APP1079583 to BB, and APP1050218 to MRR), and GBC was supported by IAP P7/43-BeMGI from the Belgian Science Policy Office Interuniversity Attraction Poles (BELSPO-IAP) program. We thank Peter M. Visscher for discussion, helpful comments, and for proposing the name EigenGWAS. Robert Maier assisted with ggplot, and Alex Holloway helped with Github. We also thank to the Information Technology group, the Queensland Brain Institute. The POPRES dataset were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through accession number phs000145.v4.p2.