Population stratification in GWAS meta-analysis should be standardized to the best available reference datasets

Population stratification has recently been demonstrated to bias genetic studies even in relatively homogeneous populations such as within the British Isles. A key component to correcting for stratification in genome-wide association studies (GWAS) is accurately identifying and controlling for the underlying structure present in the sample. Meta-analysis across cohorts is increasingly important for achieving very large sample sizes, but comes with the major disadvantage that each individual cohort corrects for different population stratification. Here we demonstrate that correcting for structure against an external reference adds significant value to meta-analysis. We treat the UK Biobank as a collection of smaller studies, each of which is geographically localised. We provide software to standardize an external dataset against a reference, provide the UK Biobank principal component loadings for this purpose, and demonstrate the value of this with an analysis of the geographically sampled ALSPAC cohort.

Genome-wide association studies (GWAS) are increasingly being used to identify biological pathways 25 underlying complex traits and diseases. They have become an essential part of making direct links 26 between genetics and phenotypes (Visscher et al. 2017) and have facilitated causal inference 27 through Mendelian Randomization (Paternoster, Tilling, and Smith 2017; Zhu et al. 2018). However, 28 detecting and interpreting associations remains a challenge because genetic associations tend to be 29 tiny (particularly for polygenic traits) and other associations may be large. 30 Many groups have joined efforts to create large consortia that assemble results from multiple 31 GWAS, providing aggregated sample sizes that are now in excess of a million individuals (Linnér et al. 32 2019; Lee et al. 2018). Meta-analysis of consortia datasets improves the power necessary to detect 33 many genotype-phenotype associations. However, where population structure exists in a dataset 34 but is insufficiently controlled for, it can lead to spurious or inflated genotype-phenotype 35 associations (Lawson et al. 2020; Peterson et al. 2017 (Warrington et al. 2015) and ten (Okbay et al. 50 2016) to 40 -the default provided by UK Biobank (Bycroft et al. 2018) -to 100 or more (Abdellaoui 51 et al. 2019). Yet even 100 PCs are insufficient (Lawson et al. 2020) as important structures may 52 explain less variation than noise and hence remain uncorrected, which can lead to uneven correction 53 and bias in meta-analysis. 54 We propose a simple solution. Correction can be improved and standardized using a large external 55 reference dataset to define "all human genetic variation", against which local variation within a 56 single study can be compared. Thus, whilst population stratification might act as a source of 57 covariance between genotypes and phenotypes, this can be corrected for. We demonstrate that 58 meta-analyses corrected for population stratification using a large external reference dataset 59 ("global" ancestry correction) performs better than meta-analyses corrected for population 60 stratification using the same dataset ("local" ancestry) in the UK Biobank. 61 There are alternative methodologies for stratification correction that go beyond PC correction. 62 Linear Mixed Models (LMMs) have gained popularity since their introduction in genetics (Yu et al. 63 2006) through easy-to-use software such as GCTA (Yang et al. 2011 analysis. These findings provide evidence that similar correction will lead to changes in findings for 76 large-scale meta-analysis. 77 Software and appropriate reference data are provided (see Code Availability) to allow others to 78 easily apply this to their own data.

80
Identification of population structure required for correction Population Structure in the UK Biobank 99 We restricted our stratification analyses to 331,890 UK Biobank participants of UK ancestry 100 excluding Northern Ireland, and ~12M SNPs after quality-control filtering and LD pruning (see 101 Materials and Methods). For illustration purposes, we clustered individuals using k-means (see 102 Materials and Methods) into 5 clusters ( Figure 1a). The largest cluster represented southern and 103 eastern England, with northern England, Scotland, North Wales, and South Wales each being 104 represented (Galinsky et al. 2016) . We are not attempting to infer actual ancestry from these PCs. 105 PCs are ordered by the total variation explained in the data. Major variation directions are 106 associated with deep historical splits between populations such as African vs Eurasians (PC1-2), 107 Europeans vs East Asians (PC1-3), Central Asia (PC3-4), and Europe (PCs 5,8). This contrasts regional 108 variation within the UK for which the main PCs are 5 and 9 describing variation between English, 109 Scottish and Welsh ancestry, as we as PCs 11 and 14 which further separate structure within Wales 110 and England. This is strongly structured by study centre, which captures current living location 111 (Figure 1b). These and other PCs (Supplementary Figure S2-3) correspond to known historical and 112 geographical areas (Leslie et al. 2015). 113 To assess how much of this variation is captured by local PCs, we performed PC projection, i.e. a 114 regression analysis for each global PC using all local PCs as predictors (see Materials and Methods). 115 Local PCs capture global variation with varying veracity (Figure 1c). ancestry, and the inverse prediction of explaining local PCs using Global PCs shows that the local 122 analyses typically contain only 2-4 ancestry related PCs (Figure 1d). 123 This observed population structure within the UK provides a source of covariance between 124 genotypes and phenotypes that can bias epidemiological inference from genetic data. The following 125 sections establish consequences of unexplained covariance for understanding complex disease. 126

127
The most straightforward measure of stratification is of the total variation in phenotypes explained 128 by genetic PCs, without attributing this to individual SNPs. Both educational attainment (EA) and 129 Body Mass Index (BMI) vary by region (Supplementary Figure S4) and show large systematic 130 differences between local ancestry and global ancestry correction ( Figure 2). Several study centres 131 explain dramatically less variation with local PCs than global, for example for EA in Croydon (0.6% 132 local vs 3.2% global) and Hounslow (0.8% local vs 3% global). Figure 1c-d explains this as a failure to 133 identify components corresponding to Scottish, Welsh and other ancestries that are individually rare 134 but nevertheless important when considered together. Conversely others, especially centres with 135 small sample sizes such as Wrexham and Swansea, explain more variation in local than global 136 ancestry. 137 We tested 24 disease statuses for the amount of variance explained by Local or Global PC correction, 138 and found that Psoriasis, Hyperthyroidism, and Hypothyroidism were all significant different ( Figure  139 S5) and Multiple Sclerosis and Asthma are implicated though not significant after correcting for 140 multiple testing. 141 Our analyses demonstrate two competing effects. Firstly, local PCs in small studies "overfit", as they 142 are able to explain much of the variance present regardless of whether it describes real ancestry or 143 noise. This is why the number of PCs corrected for is often thresholded using a noise-level 144 approximation  and justifies the small number of PCs used in early GWAS. 145 Secondly, some ancestry components will not be recovered in a small dataset due to lack of 146 statistical power. Mathematically, PC analysis displays a transition as sample size decreases, in which 147 a particular population structure is identified when enough variation exists for it, and rather abruptly 148 becomes indistinguishable from noise (McVean 2009). Importantly, local PCs perform worse not 149 solely in small studies, but in larger but genetically more homogenous populations of the South-East 150 of England. It is rare shared variation, regardless of the size of the study, that local PCs fail to identify 151 and hence correct for. 152

153
Meta-analysis is a statistical tool for combining results from coherent studies on different samples. A 154 fundamental principle in GWAS meta-analysis is that all studies included examined the same 155 hypothesis, had similar study design and analyzed study-level SNPs in a near-identical way (Zeggini 156 and These results are consistent with the proportion of variance in different phenotypes (e.g. education 169 attainment and BMI) being larger when corrected for global PCs than local PCs (Supplementary 170 Figure S7). The magnitude of the difference varies between phenotypes, and depends on the causal 171 model i.e. the relationships between phenotype, genotype, ancestry, and geography (Lawson et al. 172 2019). 173 Reference PCs can be used to identify structure: a case study in ALSPAC 174 To test our hypothesis that uncorrected population structure may lead to misleading inference, we 175 examined the ALSPAC cohort. Local variation is lost when effective sample size for a particular 176 ancestry reduces beyond a threshold. We compare two studies in Bristol, the UK Biobank (N=27,503) 177 and ALSPAC (N=7,927 mothers in our analysis). When constructing global ancestry using the entire 178 UK Biobank variation, the two datasets have very similar genetic variation profiles across all PCs 179 (Figure 4) As ALSPAC is a relatively small cohort, the uncertainty involved in SNP effect estimation dominates 196 the results. However, we found that the more robust estimates (higher Z-scores) changed 197 systematically between correction models (Figure 5b, Supplementary Figure S8). Intriguingly, the 198 direction is not the same for all phenotypes; local correction results in relatively larger estimates (i.e. 199 under-correction) for EA, whilst it results in smaller estimates for BMI, which could imply subtle 200 relatedness or improved power from correcting for ancestry. 201 Constructing a Genetic Score using this procedure leads to a similar picture, with systematic biases in 202 prediction (Figure 5c, Supplementary Figure S9). Whilst there is statistical power to detect some 203 differences in the scaled scores (e.g. in EA and CRP) these are unlikely to be practically significant 204 changes. We therefore view the ordering of individuals to have been robust in this example. 205 However, the raw scores are strongly skewed, again with biases in both directions, and further, the 206 bias direction appears unrelated to whether SNPs were individually over or under predicted. 207 Providing an appropriate set of ancestry covariates 208 The primary barrier to using the UK Biobank PCs is a lack of access to a) SNP loadings, and b) 209 reference information to scale SNPs and perform QC carefully. We provide the key 18 ancestry PCs 210 plus SNP information in an R package and script (see Code Availability) which allows trivial access to 211 for all datasets in plink bim/bed/fam format of any size (e.g. runs on all 500k UK Biobank individuals 212 in 6 hours). We further provide up to 200 UK Biobank PCs. 213 Users with access to UK Biobank data should consider the bigsnpr R package (Privé et al. 2020) which 214 allows translation of any dataset into UK Biobank PCs with careful quality control assured due to 215 comparison with the original raw data. Advanced users who do their own quality control and 216 imputation may wish to directly apply the flashpca software (Abraham, Qiu, and Inouye 2017) to our 217 provided reference data. Our package provides strand and build checks, automatically merges data 218 coded with different minor alleles, and accounts for a moderate amount of non-overlapping SNPs. 219 Above, our UK Biobank results used BoltLMM (Loh et al. 2015). We confirm that these results are not 220 meaningfully different to what we would have seen using linear regression correcting for PCs with 221 PLINK (Supplementary Figure S10). The ALSPAC results also used PLINK. Therefore the effects 222 describe are confirmed to apply to both linear regression and linear mixed models using the 223 BoltLMM approximation. 224

225
Population stratification in association studies has received much attention. However, it has typically 226 been considered as a problem of unintended correlations within the dataset, leading to correction in 227 the form of a within-sample analysis (using PCA or other approaches). We provide evidence that this 228 framing is insufficient. Whilst it is indeed unintended correlations that we wish to correct for, 229 population structure is not always detectible from the dataset being studied. This hard-to-quantify 230 population structure can be structurally related to phenotypes. 231 We demonstrated that within the UK Biobank's individual study centres with samples of tens of 232 thousands, as well as in the independent ALSPAC cohort, correcting for population stratification with 233 a high-quality, external measure of population structure is necessary. Population structure exists at 234 the within-city level and it is not correctly quantified within geographically clustered datasets. We 235 found considerable residual correlation with phenotypes and identified that the SNP-level estimates 236 were systematically biased. This resulted in appreciable error at the genome-wide level for the 237 construction of Genetic Scores. 238 We identified that, were the UK Biobank to have been analysed as independent study centres 239 subject to meta-analysis, then Educational Attainment, BMI, Psoriasis, Hyperthyroidism and 240 Hypothyroidism would all have led to biased inference. This is likely to be the tip of the iceberg in 241 meta-analyses, since the UK is a rather homogeneous population and the power in rare diseases is 242 low. 243 Because global PCs are unarguably a better measure of population structure, it is tempting to 244 interpret the effect size for the global PC correction as "more correct" than that for the local PC 245 correction, and therefore the difference as a bias with the traditional approach. However, it is not 246 that simple. We found little consistency in the direction of the bias; for example, EA for ALSPAC 247 children appears to be "undercorrected" by local PCs, whereas the mothers EA appeared 248 "overcorrected". The reality is that confounding is caused by many sources, and shared ancestry is 249 just one. Here we suspect that cryptic relatedness may exist, which is captured only by the local PCs. 250 The informed reader may find these results self-evident. However, the evidence that we provide of 251 the importance and ease of improved stratification correction has clear implications. Future meta-252 analyses and association studies should adopt a new protocol for quantifying population 253 stratification. Further, every analysis of small to medium sized cohorts whose association outputs 254 remain of value should be re-considered. Large meta-analyses are particularly valuable and yet 255 vulnerable to the biases identified here. Similarly, phenotypes with a non-trivial social or 256 environmental component (Morris et al. 2020) are likely to be influenced by this or other hidden 257 structural biases. 258 The new protocol should continue to adjust for relatedness within the cohort, but it must also add 259 the confounding covariates of ancestry as quantified by a large and hence statistically powerful 260 external resource. We provide such "genetic measures" for the UK Biobank reference in the form of 261 PC loadings that can project any individual into this worldwide quantification of genetic variation. 262 Yet for non-UK individuals, even in the UK Biobank, this may be insufficient. There is no reason that 263 institutions with access to large limited-access databases could not make and share independent PC 264 loadings, for every region of the world, that smaller association studies with less power individually 265 can apply. Although this is a partial solution because a nuanced quantification of ancestry is not 266 linear, these sharable PCs will improve stratification correction with trivial cost, so the genetics 267 community can and should implement this immediately. 268 Individuals with sex-mismatch (derived by comparing genetic sex and reported sex) or individuals with 327 sex-chromosome aneuploidy were excluded from the analysis (n=814). 328 We restricted the sample to individuals of 'european' ancestry as defined by an in-house k-means 329 cluster analysis performed using the first 4 principal components provided by UK Biobank in the 330 statistical software environment R. The current analysis includes the largest cluster from this analysis 331 (n=464,708) ( Japanese and Yoruba reference populations; all individuals with non-European ancestry were 343 removed. SNPs with a minor allele frequency of < 1%, a call rate of < 95% or evidence for violations of 344 Hardy-Weinberg equilibrium (P < 5x10-7) were removed. Cryptic relatedness was measured as 345 proportion of identity by descent (IBD) > 0.1. Related subjects that passed all other quality control 346 thresholds were retained during subsequent phasing and imputation. 9,115 participants and 500,527 347 SNPs passed these quality control filters. ALSPAC mothers were genotyped using the Illumina 348 human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with 349 Illumina GenomeStudio. PLINK (v1.07) was used to carry out quality control measures on an initial set 350 of 10,015 subjects and 557,124 directly genotyped SNPs. SNPs were removed if they displayed more 351 than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1.0e-06. Additionally, SNPs 352 with a minor allele frequency of less than 1% were removed. Samples were excluded if they displayed 353 more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal 354 heterozygosity. Samples showing evidence of population stratification were identified by 355 multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap 356 populations as a reference, and then excluded. Cryptic relatedness was assessed using an IBD estimate 357 of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a 358 relatedness at the first cousin level. Related subjects that passed all other quality control thresholds 359 were retained during subsequent phasing and imputation. 9,048 subjects and 526,688 SNPs passed 360 these quality control filters. 361 We combined 477,482 SNP genotypes in common between the sample of mothers and sample of 362 children. We removed SNPs with genotype missingness above 1% due to poor quality (11,396 SNPs 363 removed) and removed a further 321 subjects due to potential ID mismatches. intervals on the liability scale using a Taylor transformation expansion series (Loh et al. 2015). 383 (Viechtbauer 2010) using the normal distribution approximation. P-values for the difference in R 2 were 386 calculated by computing the difference in the estimates, and the variance of the difference (the sum 387 of the individual variances) and using the null that the R 2 =0 again using rma. For binary traits, only 388 study centres with at least 20 cases were considered. We also implemented a bootstrap procedure 389 that did not make the normal distribution approximation, in which study centres were resampled 500 390 times with replacement. However, the results were not qualitatively different (not shown). 391 Polygenic scoring 392 Genetic scores were created in the ALPAC cohort using PLINK (Purcell et al. 2007 recalculated to be comparable to those that would have been obtained had the full test been 417 administered and then age-scaled to give a total overall score combined from the performance and 418 verbal subscales. BMI was measured during the direct assessments at ages 7, 8, 9, 10 and 11. In order 419 to increase sample size, where BMI data were not available at age 7 we used BMI measured at the 420 earliest available subsequent measurement. C-reactive protein (CRP) was measured from non-fasting 421 blood assays taken during direct assessment when the offspring were aged 9. 422 Detecting bias in scores and SNP effects 423 To assess statistical power, we work with z-scores, i.e. ! = $ ! / " ' , where $ ! is the estimate of the 424 effect of SNP I and " ' is the estimate of the standard deviation of this estimate. To compare the 425 global (g) and local (l) effects we consider the mean estimate ̅ ! = ) #,! + %,! +/2 and difference ! = 426 ) #,! − %,! + for each SNP i. To prevent the large number of barely-significant estimates from 427 dominating the signal, we assign a weight to each SNP ! = 1/ ! where ! is the density estimate 428 taken from a 5 nearest-neighbour estimate using "knnDE" from the R Package "TDA". We then 429 perform robust regression for ~ and report the regression estimate and confidence interval. We 430 further checked that our conclusions are not impacted by these choices by performing regular 431 unweighted regression for ~. 432 UK Biobank trait definition 433 Years of education was determined by recoding highest level of education reported in a questionnaire. 434 Response were coded as basic formal education (7 years), O-levels/GCSEs/CSEs or equivalent (10 435 years), A-level/AS levels or equivalent (13 years), NVQ or HND or HNC or equivalent (19 years) and 436 College/University degree (20 years). We also binary studied educational attainment (EA), which is 437 measured as 1 for people who have obtained a College or University degree. 438 Height and weight were measured during the participants' baseline visit to a UK Biobank assessment 439 center. 440 Heel bone mineral density (eBMD) was estimated based on an ultrasound measurement of the 441 calcaneus by UK Biobank. The T-score is the number of standard deviations for bone mineral density 442 relative to the mean. Consistent with the criteria established by Kemp et al., individuals were 443 excluded that exceeded the following thresholds for eBMD: males, ≤0.18 or ≥1.06 g/cm2; females 444 ≤0.12 or ≥1.025 g/cm2. 445 Other traits were self-reported at the verbal interview and coded as yes/no. If the participant was 446 uncertain of the type of illness they had had, then they described it to the interviewer (a trained 447 nurse) who attempted to place it within the coding tree. If the illness could not be located in the 448 coding tree then the interviewer entered a free-text description of it. These free-text descriptions 449 were subsequently examined by a doctor and, where possible, matched to entries in the coding tree. 450