EraSOR: Erase Sample Overlap in polygenic score analyses

Background Polygenic risk score (PRS) analyses are now routinely applied in biomedical research, with great hope that they will aid in our understanding of disease aetiology and contribute to personalized medicine. The continued growth of multi-cohort genome-wide association studies (GWASs) and large-scale biobank projects has provided researchers with a wealth of GWAS summary statistics and individual-level data suitable for performing PRS analyses. However, as the size of these studies increase, the risk of inter-cohort sample overlap and close relatedness increases. Ideally sample overlap would be identified and removed directly, but this is typically not possible due to privacy laws or consent agreements. This sample overlap, whether known or not, is a major problem in PRS analyses because it can lead to inflation of type 1 error and, thus, erroneous conclusions in published work. Results Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort overlap results in a significant and often substantial inflation in the observed PRS-trait association, coefficient of determination (R2) and false-positive rate. This inflation can be high even when the absolute number of overlapping individuals is small if this makes up a notable fraction of the target sample. We develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting inflation in PRS prediction and association statistics in the presence of sample overlap or close relatedness between the GWAS and target samples. A key component of the EraSOR approach is inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-phenotype data, we demonstrate that PRSs calculated using EraSOR-adjusted GWAS summary statistics are robust to inter-cohort overlap in a wide range of realistic scenarios and are even robust to high levels of residual genetic and environmental stratification. Conclusion The results of all PRS analyses for which sample overlap cannot be definitively ruled out should be considered with caution given high type 1 error observed in the presence of even low overlap between base and target cohorts. Given the strong performance of EraSOR in eliminating inflation caused by sample overlap in PRS studies with large (>5k) target samples, we recommend that EraSOR be used in all future such PRS studies to mitigate the potential effects of inter-cohort overlap and close relatedness.


23
Here, for the first time, we report the scale of the sample overlap problem for PRS analyses by generating 24 known sample overlap across sub-samples of the UK Biobank data, which we then use to produce GWAS 25 and target data to mimic the effects of inter-cohort sample overlap. We demonstrate that inter-cohort 26 overlap results in a significant and often substantial inflation in the observed PRS-trait association, 27 coefficient of determination (R 2 ) and false-positive rate. This inflation can be high even when the absolute 28 number of overlapping individuals is small if this makes up a notable fraction of the target sample. We 29 develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a software for adjusting 30 inflation in PRS prediction and association statistics in the presence of sample overlap or close 31 relatedness between the GWAS and target samples. A key component of the EraSOR approach is 32 inference of the degree of sample overlap from the intercept of a bivariate LD score regression applied to 33 the GWAS and target data, making it powered in settings where both have sample sizes over 1,000 34 individuals. Through extensive benchmarking using UK Biobank and HapGen2 simulated genotype-35 Introduction 45 Polygenic risk scores (PRSs) are proxies of individuals' genetic liability to a trait or disease [1]  [11], as well as GWAS resources from large consortia such as the Psychiatric Genomic Consortium 52 (PGC) [12], GIANT [13] and the Global Lipids Genetics Consortium (GLGC) [14] have provided 53 unprecedented opportunity to perform highly-powered PRS analyses. 54 However, expansion in data size does not come without a cost in this setting: as sample sizes increase, 55 there is greater risk that samples are recruited into multiple cohorts or that entire cohorts are included in 56 multiple consortia. For PRS analyses, which typically test for association between PRS and a trait(s) or 57 outcome of interest, overlapping samples between the GWAS and target data samples can result in 58 spurious inflation of the coefficient of determination (R 2 ) and association P-values, leading to false-59 positive and exaggerated findings [15]. Overlapping samples should ideally be removed from either the 60 GWAS or target data to avoid misinterpretation of results, but participant privacy agreements usually 61 limit access to raw genotyping data, meaning that this is generally not an option. 62 Here we first evaluate the extent to which different degrees of sample overlap and relatedness between 63 GWAS and target samples generates biased PRS-trait associations. Next, to overcome the sample overlap 64 problem, we develop and introduce EraSOR (Erase Sample Overlap and Relatedness), a python software 65 that adjusts GWAS summary statistics [1] 161 Quantitative phenotypes (Y) with heritability (h 2 ) of 0, 0.1, and 0.5 were simulated using the UK Biobank 162 genotype data (post QC; see above) as input. Quantitative traits were simulated as: 163 where X is the standardized genotype matrix corresponding to all samples and 10,000 randomly 164 selected SNPs with effect size β following a standard normal distribution. Xβ were adjusted such that it 165 has mean 0 and variance of h 2 ; and ε represents the random error, which follows . To 166 ensure EraSOR works for distribution that are not only standard normal, we included α as the phenotypic 167 mean randomly sampled from a normal distribution with mean 0 and standard deviation of 1, and δ as the 168 phenotypic variable randomly sampled from 1 to 100 to simulate phenotypes that does not follow the 169 standard normal distribution. 170 To model polygenic risk score analyses with sample overlap, we randomly selected either 120k or 250k 171 individuals from the sample of 387,392 individuals available to us (see above) to generate two different 172 sizes of base GWAS data. Next, we randomly sampled 1,000, 5,000 or 10,000 individuals from the 173 remaining sample to act as three different sizes of target data, of which 0%, 5%, 10%, 50% or 100% were 174 randomly selected from the base data sample so that there was a known degree of sample overlap between 175 the base and target data. In addition, we generated an "overlap-free" base cohort in which the overlapping 176 samples were removed from the base cohort so that we could compare the result of applying EraSOR 177 against results of physically removing overlapped samples from the base cohort. 178 In order to search a feasible parameter space in sufficient depth, we only simulate phenotype with 179 heritability of 0.5, with a base cohort of 250k and target cohort of 5,000; only simulate base cohort with 180 120k samples when the phenotypic heritability is ≤ 0.1 and target cohort has 5,000 samples; and only 181 simulate target cohort with 1,000 and 10k samples when the base cohort contain 250k samples and the 182 phenotypic heritability is ≤ 0.1. The entire set of simulations were repeated 100 times. 183

Binary Trait
184 Binary traits were simulated under the liability threshold model [20], simulating a normally distributed 185 liability using Eq. 12 with α = 0, δ = 1, and cases defined as samples with disease liability higher than 186 liability thresholds of 0.9, 0.7 and 0.5, corresponding to population prevalences of 0.1, 0.3 and 0.5, 187 respectively. To limit the complexity of our simulations, the sample prevalence of our cohorts follows the 188 population prevalence. 189 In the binary trait setting, overlap can be ascertained such that the overlap is among cases, or among 190 controls, or among both. To investigate the effect of case-only or control-only overlap, we randomly 191 selected 120k effective samples (effective samples defined as as the base cohort, and then randomly selected 5,000 effective samples as the target cohort, where 0%, 193 5%, 10%, 30% or 50% of the cases or of the controls in the target cohort were sampled from the base 194 cohort. We also performed simulations where the overlapping samples were selected at random among 195 cases and controls. An "overlap-free" base cohort was generated with all overlapping samples removed. 196 In order to search a feasible parameter space in sufficient depth, we only vary the trait heritability when 197 the population prevalence is 0.1, and only vary the population prevalence when the trait heritability is ≤ 198 0.1. These simulations were repeated 100 times. 199 200 Spurious inflation in PRS analysis test statistics may also be observed when there are closely related 201 individuals between the base and target cohorts. To investigate the effects of relatedness on PRS results, 202

Related samples
we repeated the quantitative trait simulations with a modified Eq. 12: 203  have any first-degree relatives to form a base cohort containing 250k samples. We then generate target 210 cohorts containing 5,000 samples, with either 0%, 30%, 60% or 100% of the target samples being first-211 degree relatives of samples in the base cohort. We also generated a reference cohort from the base cohort 212 where all the related samples in the target cohort were replaced by unrelated individuals for 213 benchmarking the performance of EraSOR. The entire set of simulations were repeated 100 times. 214 215 An assumption of the EraSOR algorithm is that the environmental stratification (σ ) may be introduced. We devised two 218 strategies for simulating data with both environmental and genetic stratification to test the sensitivity of 219

Samples with population stratification
EraSOR to deviations of each from 0. In the first, we partitioned the UK Biobank into European and non-220 European ancestries, while in the second we used the simulation software HapGen2 [22]. 221 In the first simulation strategy, the UK Biobank samples were divided into European and non-European 222 ancestries based on 4-mean clustering on PC1 and PC2 (see above). Quantitative traits with 223 environmental stratification were then simulated as: 224 with the environmental stratification term (S) defined as 225 can take a value of 0, 0.
being the 227 covariance between X β and S. To investigate the effect of sample overlap in the presence of 228 environmental and genetic stratification, we randomly selected either 120k or 250k individuals from the 229 sample of 409,144 individuals available to us (see above) to generate two different sizes of base GWAS 230 data. Next, we randomly sampled 5,000 or 10,000 individuals from the remaining sample to act as two 231 different sizes of target data, of which 0%, 10%, 50% or 100% were randomly selected from the base data 232 sample. To ensure that the genetic and environmental stratification is the same within the base and target 233 data, the same ancestry ratio was maintained in all simulated data sets, matching the ratio in the full data 234 set (~5% non-European ancestry). In addition, we generated an "overlap-free" base cohort in which the 235 overlapping samples were remove from the base cohort to allow benchmarking the performance of 236 EraSOR. The entire set of simulations were repeated 25 times. 237 Given that only ~5% of the UK biobank samples correspond to individuals of non-European ancestry, the 238 effect of genetics and environmental stratification may be limited. We first estimated the false-positive rate induced by sample overlap by simulating non-heritable traits and 284 recording the fraction of significant PRS-trait association ( Supplementary Fig. 1). Highly significant 285 associations between PRS and non-heritable phenotypes were observed when even limited inter-cohort 286 sample overlap was present (Fig. 1). Specifically, for non-heritable quantitative traits, the inflation in 287 association (e.g. p-value of association) is highly positively correlated with the degree of overlap (Pearson 288 Correlation coefficient ሺ ߛ ሻ = 0.96, P-value < 2.2x10 -16 ). For example, when there is a base cohort of 250k 289 samples, target cohort of 5,000 samples and 250 overlapping samples (5% of target sample; degree of 290 overlap = 0.0071) the false positive rate is 17%, while this increases to 94% when there are 500 291 overlapping samples (10% of target sample; degree of overlap = 0.014) (Fig 1a). In the binary trait setting, sample overlap may be among cases only, controls only, or be among both. 295 These alternatives were investigated by first simulating binary traits with different population prevalence 296 using the liability threshold model [20]. Cohorts with effective sample sizes of 120k in the base data and 297 5000 in the target data were generated with different degrees and scenarios of sample overlap. We 298 observed extreme inflation associated with case-only overlap when population prevalence is lower than 299 0.5. For a binary trait with population prevalence 0.1, a false positive rate of 43% is observed when the 300 degree of overlap is 0.001, which corresponds to 5% of the cases from the target cohort also present in the 301 base cohort (Fig 1b). When the degree of sample overlap is doubled (~0.002), the false positive rate is 302 100%. The inflation in PRS-trait association is not as sensitive to control-only sample overlap when the 303 population prevalence is small. We observe a false positive rate of ~47% when the degree of overlap is as 304 high as 0.092, which corresponds to 50% of the controls from the target cohort also present in the base 305 cohort (Fig 1b). This discrepancy between the effect of case and control overlap is a result of the 306 differential contribution of cases and controls to the PRS-trait association in our simulations. Cases are 307 sampled from the extreme upper tail of the liability distribution at a frequency corresponding to the 308 . ce nd e an he he is he as se he re he disease prevalence, which is typically low: this gives each case greater weight in the calculation of the 309 PRS-trait association and, thus, an overlapping case will generate greater inflation than an overlapping 310 control. This was consistent with our simulation results (Fig 1c, 1d) have the same impact on the inflation (Fig 1d). 315 Given that complete overlap of individuals in the base and target data can generate PRS-trait associations 316 that are severely inflated, closely related individuals independently enrolled into the base and target 317 cohorts may induce some inflation considering their shared genetics and environment. Here we tested the 318 effect of relatedness between the base and target cohorts on PRS-trait associations in non-heritable traits 319 in a similar way to that for sample overlap (see Methods), where inter-cohort relatedness is defined as 320 In the next section we extend these investigations to consider the effects of sample overlap on PRS-trait 328 associations on heritable traits, but we present these findings in conjunction with results based on the 329 application of our method EraSOR, which is designed to resolve the problem. 330

331
To tackle the problem of inflation caused by inter-cohort overlap and relatedness, we developed the Erase 332 Sample Overlap and Relatedness (EraSOR) method. Using GWAS summary statistics generated from the 333 base and target cohorts, EraSOR implements univariate and bivariate LD score regression [17,18] to 334 estimate several parameters that are then used to perform a de-correlation calculation of the base GWAS 335 test statistics (see Methods). These adjusted base GWAS summary statistics can then be used for 336 downstream PRS analyses, with sample overlap or relatedness corrected for. 337 In order to evaluate the performance of EraSOR, we conducted an extensive set of simulations covering a 338 range of scenarios of inter-cohort sample overlap and relatedness (see below and Methods). 339 EraSOR also performs extremely well for binary traits (Figure 2b, 2c, 2d). In binary traits with heritability 352 0.1, population prevalence 0.1, a base cohort of with 120k effective samples, and a target cohort of 5,000 353 effective samples, the mean for the adjusted PRS in relation to case-only overlap is 1.61x10 -4 354 (standard error = 2.11x10 -4 ) (Fig 2b) and the mean for the adjusted PRS in relation to control-only 355 overlap is -3.77x10 -5 (standard error = 1.98x10 -4 ) (Fig 2c). On the other hand, when unadjusted, the mean 356 in relation to case only overlap is as high as 0.161 (standard error = 8.93x10 -4 ), whereas the mean 357 is 6.28x10 -3 (standard error = 2.59x10 -4 ) in relation to control-only overlap. Therfore, when the trait is non-heritable, Eq 9 may be unstable and lead to an error in the ErasOR 366 adjustment. However, we recommend that polygenic risk score analyses should not be performed on traits 367

Simulations using UK Biobank data
with estimated h 2 < 0.05 (see [1]) and, thus, in sufficiently powered applications of PRS, EraSOR should 368 have strong performance. 369 One potential reason for the robustness of EraSOR may be due to the simplistic population structure of 399 the simulated genotype. As we simulated the environmental stratification according to the population 400 label, it is possible that by adjusting for PCs, the environmental stratification was fully adjusted for. 401 While it is highly unlikely that environmental stratification is orthogonal to population genetic structure, 402 we performed an additional simulation in which the population label was randomly assigned to the 403 simulated genotype. This ensured the simulation of environmental stratification independent of population 404 genetic structure and, thus, should not be capture by PCA adjustment (see Supplementary Methods and 405 Supplementary Fig. 8). 406 Even when environmental stratification is simulated independently of the population genetic structure, 407 EraSOR adjustments are still robust to different environmental structure. For quantitative traits with 408 heritability of 0.1, base cohort size of 250k, target cohort size of 5,000 and environmental stratification of 409 0.3, the mean for the adjusted PRS is -6.08x10 -4 (standard error = 9.68x10 -4 ). 410 Figure 3. Performance of EraSOR from the HapGen2 simulation. Y-axis represents the mean difference between the observed R 2 and the expected R 2 and X-axis represents the degree of overlap. Range shows the 95% confidence interval. Based on simulation results, it seems like EraSOR are robust against different environmental stratification ( ). 411

412
The recent advent of large-scale national and regional biobank projects, such as the UK relatedness is a function of the phenotypic correlation. While this suggests that the impact of inter-cohort 442 overlap and relatedness are likely to be smaller for cross-trait analyses, EraSOR adjustments may still be 443 beneficial in these scenarios. Investigation of the performance of EraSOR in cross-trait analyses should be 444 the subject of future work. Further research is also required to understand the performance and biases of 445 EraSOR for applications in cross-trait studies and in its potential application to GWAS meta-analyses. 446 Our algorithm is not without any limitations. First, as EraSOR depends on the LD score intercept 447 estimates for the adjustment, all assumptions of LD score regression also apply to EraSOR. For example, 448 LD score regression assumes the level of genetic and environmental stratification is similar between the 449 two cohorts [13], and if this assumption is violated, then it is likely that the bivariate LD score equation 450 does not hold, which will lead to bias in EraSOR estimates. Moreover, due to reliance on LD score 451 regression estimates, EraSOR only produces sufficiently accurate adjustments for application when both 452 base and target cohorts have sample sizes greater than 1,000 and is only consistently accurate when both 453 cohorts are greater than 5,000 samples. Nonetheless, despite its limitations, EraSOR is an ideal tool for 454 application in settings in which there is known overlap in relation to large target samples and for 455 sensitivity analyses in PRS studies. If the performance of PRS using the unadjusted and EraSOR adjusted 456 summary statistics differs substantially, then this will act as a warning to the possible presence of inter-457 cohort overlap or close relatedness that should be adjusted for in order to obtain reliable PRS analysis 458 results. 459  X-axis shows the degree of overlap, and the Y-axis shows the mean difference between the observed R 2 575 and the expected R 2 . Mean difference in R 2 = 0 is represented by the black dotted line. Each row 576 corresponds to different base cohort sizes, each column corresponds to different target cohort size and 577 different colors correspond to different trait heritability. Performance of the adjusted PRS is indicated 578

Availability of supporting source code and requirements
with triangle and performance of the unadjusted PRS is indicated with circle. Shaded area represents the 579 95% confidence interval, which tends to be small. stratifications. The X-axis shows the degree of overlap, and the Y-axis shows the mean difference 614 between the observed R 2 and the expected R 2 . Shaded area shows the 95% confidence interval. Our result 615 suggest EraSOR are robust against different environmental stratification ( ) even when they are agnostic 616 to the principal components. 617