Testing for differences in polygenic scores in the presence of confounding

Polygenic scores have become an important tool in human genetics, enabling the prediction of individuals’ phenotypes from their genotypes. Understanding how the pattern of differences in polygenic score predictions across individuals intersects with variation in ancestry can provide insights into the evolutionary forces acting on the trait in question, and is important for understanding health disparities. However, because most polygenic scores are computed using effect estimates from population samples, they are susceptible to confounding by both genetic and environmental effects that are correlated with ancestry. The extent to which this confounding drives patterns in the distribution of polygenic scores depends on patterns of population structure in both the original estimation panel and in the prediction/test panel. Here, we use theory from population and statistical genetics, together with simulations, to study the procedure of testing for an association between polygenic scores and axes of ancestry variation in the presence of confounding. We use a general model of genetic relatedness to describe how confounding in the estimation panel biases the distribution of polygenic scores in a way that depends on the degree of overlap in population structure between panels. We then show how this confounding can bias tests for associations between polygenic scores and important axes of ancestry variation in the test panel. Specifically, for any given test, there exists a single axis of population structure in the GWAS panel that needs to be controlled for in order to protect the test. Based on this result, we propose a new approach for directly estimating this axis of population structure in the GWAS panel. We then use simulations to compare the performance of this approach to the standard approach in which the principal components of the GWAS panel genotypes are used to control for stratification.


S1 Full Model
To model the distribution of genotypes in both panels, we assume that each individual's expected genotype at each site can be modelled as a linear combination of contributions from a potentially large number of ancestral populations, which are themselves related via an arbitrary demographic model.Natural selection, genetic drift, and random sampling each independently contribute to the distribution of genotypes across panels, and we make the approximation that these three effects can be combined linearly.We begin by first developing the population level model, before extending it to individuals.

S1.1 Population model
We assume that for each site , the vector of allele frequencies across K populations can be decomposed as where a is the allele frequency in the original ancestral population, while the deviations ∆p ,D and ∆p ,S capture variation in allele frequency across populations due to genetic drift and natural selection respectively.We approximate the effect of drift via a multivariate Normal model, , where F pop describes the covariance structure imposed by the population model [1,2,3,4].Selection induces an additional deviation of ∆p ,S = ηβ a (1 − a ) A pop , where β is site 's effect (in standardized units) on the phenotype of interest, η is the strength of selection on that phenotype, and A pop is a vector recording the extent to which each population inherits from the ancestral population in which selection occurred.Under the null, where A pop A pop is a rank one matrix accounting for the impact of selection on the joint distribution of allele frequencies.We assume the trait is sufficiently polygenic and/or that selection is sufficiently weak that so that for any individual variant, the effects of drift dominate over those of selection (i.e.∆p ,D >> ∆p ,S for all ).This condition is met, for example, if there are at least hundreds of loci and σ 2 η is on the order of one.

S1.2 Sampling individuals for GWAS and test panels
Next, we consider taking two samples, one to compose the test panel and one to compose the GWAS panel.Individuals in each panel are created as mixtures of the underlying populations.There are N test panel individuals and the deviation of their genotypes from the expected genotype in the ancestral population (2a ) is where X ,D = 2W X ∆p ,D , X ,S = 2W X ∆p ,S .The matrix W X has dimensions N × K, with rows specifying the fraction of ancestry that each of the N test panel individuals inherit from each of the K populations.We can think of the quantity 2a + X ,D + X ,S as giving a set of expected genotypes given the evolutionary history of the population, while X ,B contains the binomial sampling deviations across individuals given these expected genotypes.
Similarly, for the M GWAS panels individuals, the deviation of their genotypes can be decomposed as where

S1.3 Individual level model
Individuals in the two panels can draw ancestry from the same populations, or from related populations, which induces the joint covariance structure where the matrix contains the within and between panel relatedness coefficients.These are in turn related to the population level covariance matrix as Similarly, the genotypic deviations due to selection can be written at the individual level as describe the extent to which individuals in each of the panels inherit from the selection event.
The joint covariance structure of the individual genotypes is therefore where and B is a diagonal matrix with entries accounting for over-dispersion of the binomial sampling step due to evolutionary variation of the underlying allele frequencies (i.e. from drift and selection).
The exact details of this matrix are not important other than that it is diagonal, i.e. it does not contribute to covaraince among individuals.

S1.4 Phenotypes
As in main text Equation ( 5), we assume that the vector of mean-centered phenotypes for the M individuals in the GWAS panel can be written where u = S β G is the combined genetic effect of all S causal variants, and e represents the combination of all environmental effects.The distribution of environmental effects is e ∼ M V N c, σ 2 e I , where c is the vector of expected environmental effects.We consider c to be fixed (as opposed to random).
The genetic effect, u, can be broken down into the contributions from drift, selection, and sampling.
In this case, u S = S β G ,S is the vector of expected values of the genetic contributions to the phenotype, given the ancestries of the individuals in the GWAS panels.Both u D and u B have expectation zero and u D has a correlation structure determined by relatedness matrix F GG (i.e is the additive genetic variance in the original ancestral population.The full covariance structure of u is therefore where We are ultimately interested in the behavior of polygenic scores given that some stratification effect exists in the phenotype.To this end, from here forward we will generally condition on the strength of selection on the phenotype, η, treating it as a fixed effect alongside c.Doing so gives us the condition expectations and covaraince structures:

S1.5 Polygenic scores
We consider a vector of mean centered polygenic scores, computed in the test panel.If the causal effects were known, then the contribution of the S causal sites included in the polygenic score would be We continue now to condition on the effects of selection.Doing so, Z is a vector of random variables with the randomness coming from X ,D + X ,B , neutral variation in the test panel genotypes due drift and binomial sampling, respectively.Therefore, the vector of expected polygenic scores for the N individuals in the test panel is Here the expected polygenic scores depend on the strength of selection on the phenotype, the ancestral additive genetic variance, and the extent to which each individual in the test panel inherits from the selection event.

S1.6 Polygenic score association tests
We want to test the hypothesis that the polygenic scores are associated with some test vector, T , more than is expected due to drift alone.We do not necessarily assume that T is equal to A X , the axis along which selection has actually perturbed the polygenic scores.
To test for association of polygenic scores with the test vector, we consider the linear model where ε is i.i.d.Normal across individuals.A more powerful test is available by modeling the correlation structure among individuals, but the simpler i.i.d.model is sufficient for our purposes (see section S8).We assume that the test vector, T , is scaled to have a variance of one, so the slope is given by and its conditional expectation by, Under the null model, η = 0, so E[q] = 0, reflecting the fact that genetic drift is directionless (though we may also have E [q] = 0 if the test vector is not aligned with the axis along which selection has perturbed the polygenic scores, i.e. if A X T = 0).We compare this null hypothesis to an alternative in which η = 0 (assuming also that A X T = 0), reflecting the possibility that the polygenic score may have either a positive or negative association with the test vector.We refer to this as a "polygenic score association test".
We also re-frame this test as a statement about the association between the effect sizes and a set of genotype contrasts, r = 1 N X T , which measure the association between the test vector and the genotypes at each site [5].Similar to the decomposition of the test panel genotypes above, we can decompose the genotype contrasts as which capture contributions from drift, selection, and sampling, respectively.Writing β and r for the vectors of effect sizes and genotype contrasts across loci, we can write the test statistic as q = β r. (S20)

S2 Expectation of marginal effects
Conditional on the GWAS panel genotypes and the genetic and environmental components of the phenotype, the marginal association at site is given by (equation ( 7) in the main text).Now, taking the expectation over the randomness due to drift and sampling in the GWAS panel genotypes and the total genetic effect, and over the random component of the environmental effect on the phenotype, the expected effect size estimate at site is where 1 + F G = 1 + 1 M M m=1 f mm is the average level of self relatedness in the GWAS panel.The approximation at line (S26) comes from approximating the expectation of the ratio as the ratio of expectations.There is an additional approximation in line (S27) that comes from assuming that , consistent with our assumption that the effects of selection are small at the level of individual loci, relative to the effects of drift and sampling.
Therefore, consistent with results from [6], the expected marginal effect estimate for a given site is expected to be equal to the causal effect if the null holds (i.e. if η = 0).In contrast, under our alternative model, the focal site and the genetic background have a positive empirical covariance as they are both influenced by selection.As a result, the expected marginal effect is biased away from zero (i.e.further in the direction of the causal effect), and may be biased in either direction depending on whether the environmental confounder is positively or negatively associated with the axis along which selection has perturbed the focal allele.For the following analyses concerning bias in the distribution of polygenic scores and in polygenic score association tests (see section S3 and S4) we assume that the above biases in marginal effect estimates induced by selection are small relative to those induced by shared drift between the GWAS and test panels [7,8,9,10,5].

S3 The expected polygenic scores
Given the marginal effect estimates, the vector of N polygenic scores in the test panel is given by, We are interested in the expected polygenic scores in settings where both genetic and environmental confounders exist.We can write this expectation as The first sum is the expected "true" polygenic scores given the non-neutral effects on the test panel genotypes ( while the second sum captures contributions to the polygenic scores due to associations between the genotypes at individual focal sites and the genetic component of the phenotype from all other sites (u − = = β G ).The third captures contributions due to associations between the genotypes and the environmental component.We consider each of the second two in turn, beginning with the environment, which is more straightforward.
For each site , we can write The approximation in line (S34) arises from assuming that G ,S and X ,S are small compared to G ,D and X ,D , and can therefore be ignored (see S2). Alternately, (S34) is exact under the null that η = 0.In line (S38), we make a ratio of expectations approximation.FGX is the expected kinship matrix computed on standardized genotypes, which is approximately equal to F GX (1+FG) , with the approximation being better if F G is small.Summing across sites, we have where S is the number of causal sites.
For the contributions to the polygenic score arising from associations between genotypes and the genetic component of the phenotype, we have Here S58 shows how the axis captured by FGr in the GWAS panel is related to the underlying populations in our model, and to the pattern of population structure among them.The vector T pop = W X T has length K and captures the axis of the test in terms of the underlying populations, while F pop T pop similarly has length K, and captures the extent to which genetic drift along the path to each population is associated with the axis of population structure identified by the test vector.F pop T pop therefore describes the axis of confounding in terms of populations.Multiplying this vector by W G then rotates this axis into the individual space of the GWAS panel to give FGr .

S5 Expectation of polygenic scores while controlling for FGr
Now, controlling for FGr , the marginal effects are where P = I − 1 FGr FGr F Gr Thus, conditional on the variables in the GWAS panel, the polygenic scores are so the expected polygenic scores can be written as Following the same steps as above (see S34, S40, and S45), we take the expectation of each of the above terms separately.The first term is again the expected "true" polygenic scores ( ).Then we consider the association between the environment and the residual genotypes, such that the expected bias is Here we highlight a reasonable assumption we make regarding our ability to detect true signal when controlling for FGr .As discussed in the main text, including FGr as a covariate removes any correlation between β and r under the null hypothesis where η = 0. Notably, under the alternative hypothesis, our ability to detect signal is preserved as we use variation along other axes in the GWAS panel to estimate effect sizes which in turn can still be correlated with r S in the test panel.
We therefore rely on the assumption that for site there is enough remaining variation in G after regressing out FGr to estimate β .If all variation in G lies along FGr , y PG ≈ 0 and we will be unable to estimate β .For example, if FGr represented individuals on opposite sides of a population split this would only apply to sites with fixed differences.This is unlikely to be a concern in practice as F ST is low in human populations [11] and most variants will not be perfectly correlated with a single axis of structure.Therefore, it is reasonable to assume that it will still be possible to accurately estimate β , albeit with slightly larger standard error.

S7 Downward bias with true signal S7.1 Expected bias
The direct estimator approach (i.e computing FGr and including it as covariate in the GWAS) proposes to use the test panel genotype data twice: once when controlling for stratification in the GWAS panel, and a second time when testing for an association between the polygenic scores and the test vector.Shouldn't this remove the signal we are trying to detect?While the answer is yes, at least for naive applications, the effect will be small so long as the number of SNPs used to compute the correction is large relative to the number included in the polygenic score.
To see why, we can rewrite the regression eq.23 from the main text in terms of a sum of the contribution from our focal site and all other sites as where FGr,− is the estimate of FGr that one would obtain using all sites other than site .Thus, because our FGr includes a contribution from the focal site, controlling for it induces a slight bias in the estimated effect size, the sign of which depends on the sign of r .If r is positive, β has a slight negative bias, whereas if r is negative, the bias will be positive.Similar effects are noted elsewhere in the statistical genetics literature, for example in correcting for PCs of gene expression data [12], and in the use of linear mixed models in GWAS [13], where it has been termed "proximal contamination".Assuming that the variance of r √ G G /M across sites included in the score is similar to those used to compute the correction, the product r β will be biased toward 0 by a factor of approximately 1 − 1 L for each site, owing to the fact that the focal site contributes approximately 1 L of our estimate FGr .Our test statistic is a sum over contributions of r β from S independent sites, so including FGr as a covariate when estimating effect sizes induces a downward bias of