Medical data and machine learning improve power of stroke genome-wide association studies

Genome-wide association studies (GWAS) may require enrollment of up to millions of participants to power variant discovery. This requires manual curation of cases and controls with large-scale collaborations. Biobanks connected to electronic health records (EHR) can facilitate these studies by using data from clinical care systems, like billing diagnosis codes, as phenotypes. These systems, however, do not define adjudicated cases and controls. We developed QTPhenProxy, a machine learning model that adds nuance to cohort classification by assigning everyone in a cohort a probability of having the study disease. We then ran a GWAS using the probabilities as a quantitative trait. With an order of magnitude fewer cases than the largest stroke GWAS, our method outperformed previous methods at replicating known variants in stroke and discovered a novel variant in ABCG8 associated with intracerebral hemorrhage in the UK Biobank that replicated in the MEGASTROKE GWA meta-analysis. QTPhenProxy expands traditional phenotyping to improve the power of GWAS.


Introduction
1 Large genetic repositories connected to electronic health records (EHR) promise 2 the ability to perform thousands of genetic studies using data routinely captured 3 in clinical practice. High-throughput identi cation of cases and controls can 4 be di cult, however, due to time-consuming chart review and incompleteness 5 of medical records. Current disease genome-wide association studies develop 6 case-control sets in which power relies on a large amount of pure cases, and 7 the missingness of EHR data could prevent some cases from discovery in a high-8 throughput manner (McCarthy et al., 2008;Weiskopf et al., 2013;DeBoever et al., 9 2019). In addition, extreme case-control imbalance in biobanks can lead to in-10 creased type 1 error when running linear mixed model genome-wide association 11 analysis (Zhou et al., 2018). An incorporation of additional accessible EHR data 12 could improve case curation sensitivity. In addition, many diseases such as stroke 13 result from a combination of gene and environmental interactions, and there 14 is signi cant overlap with co-morbidities in genome-wide signi cant variants 15 (Malik et al., 2018a). Therefore, it is di cult to con rm every person without 16 the event a control, suggesting the utility of a disease likelihood assignment 17 (McCarthy et al., 2008;Yang et al., 2009). 18 The de nition of the disease phenotype in uences the success of detecting a 19 genetic signal since power is generally calculated by number of cases and controls 20 (Maher, 2008;Zaykin and Zhivotovsky, 2005). We propose that including EHR 21 information about comorbidities (Kohane, 2011) and other health information 22 about our subject to trait assignment can improve the power of GWAS. Past 23 studies have shown that incorporating diagnosis count improved the power of 24 genetic studies, and the addition of patient questionnaires and genetic correlations 25 to hospital records improved detection of cases for genome-wide association 26 studies (Sinnott et al., 2018;DeBoever et al., 2019;Liu et al., 2017;Ruderfer et al., 27 2019). Incorporating EHR data to develop a probability of suicide attempt also 28 improved the power of its genome-wide association study (Ruderfer et al., 2019). 29 We argue that including several modalities of health data to estimate assignment 30 of case probability can improve the power of genomic studies. For example, the 31 most successful genome-wide association study for stroke required 40,585 cases 32 and 406,111 controls (Malik et al., 2018a). We hypothesize that we can discover 33 genome-wide signi cant variants associated with stroke with a fraction of those 34 cases (4,354) by incorporating EHR information of 337,147 participants in the UK 35 Biobank Cohort into a stroke phenotype score to be used as quantitative trait. 36 In this study, we use machine learning methods to expand sample cohorts by 37 assigning every patient a probability of disease. We hypothesize that the output 38 of a supervised machine learning classi er, trained on the EHR data of a small 39 number of known cases and controls, can be used as a proxy variable for stroke 40 and will be an e cient strategy for expanding cohort size (See Graphical Abstract).

Mapping variants to known disease variant marker sets and mapping marker
117 sets to disease systems.

118
The EBI-GWAS catalogue has a database of all published GWAS (Buniello et al.,119 2019). We extracted over 2,000 disease marker sets conducted on populations with 120 European ancestry. MedDRA is a standardized medical vocabulary developed by 121 International Council for Harmonisation of Technical Requirements for Pharma-122 ceuticals for Human Use (ICH) (Brown, 1999). All terms in the vocabulary can be 123 mapped to its highest system level, which includes 27 di erent organ systems 124 and other general and lab studies such as social circumstance and investigations. 125 Using the NCBO annotator, we mapped the names of the EBI-GWAS disease 126 marker sets to the MedDRA System Organ Classes level (Whetzel et al., 2011). 127 2.6. Assessing the speci city of the QTPhenProxy-derived variants. 128 To assess the disease speci city of the genome-wide signi cant variants, we 129 rst calculated the proportion of genome-wide signi cant variants in each of the 130 EBI-GWAS disease marker sets. We then aggregated the marker sets together by 131 System Organ Class to evaluate the systems enriched for genome-wide signi cant 132 variants. We ordered the marker sets in each class by proportion of genome-wide 133 signi cant variants and divided by the number of marker sets in each class. We 134 also compared the proportion of variants of varying signi cance of the marker 135 sets related to each disease with 1) the other marker sets related to the same 136 System Organ Class as the disease and 2) all other marker sets. We strati ed each 137 comparison by signi cance value, between 0.05 and 5E-08. For each disease, we gathered EBI-GWAS marker sets that contained the 140 disease in its name. These represent known variants of each disease. We then 141 extracted the p-value from either the binary trait logistic regression or QTPhen-142 Proxy linear regression. We then ran a t-test comparing the negative log base 143 10 p-values of the binary trait with QTPhenProxy GWAS. We also ran a t-test 144 comparing the di erence between the binary and QTPhenProxy log base 10 p val-145 ues for the known ischemic stroke variants and an equivalent number of random 146 variants. 147 2.8. Re nement of discovered variants by QTPhenProxy using conditional analysis 148 At each LD-independent locus, the SNP with lowest p-value may not be the 149 variant that causes the most phenotypic variation within the area (Yang et al., 2012). 150 Therefore, we applied GCTA-COJO, a conditional analysis that takes into account 151 lead SNPs and the LD structure of a sample of the population, to our genome-152 wide association results (Yang et al., 2012). We randomly sampled 10,000 subjects 153 from the UKBiobank for the linkage disequilibrium calculation (Wells et al., 2019). 154 From the GCTA-COJO results, we then mapped each locus to its nearest gene 155 using dbSNP and the UCSC Genome Browser accessed at http://genome.ucsc.edu/ 156 (National Center for Biotechnology Information, National Library of Medicine; Based on the scale di erence between binary and quantitative traits, the odds 169 ratios are not directly comparable. We decided to simulate a method to convert 170 beta coe cients from QTPhenProxy to Odds ratios. We did this by simulating a 171 quantitative trait, converting it to a binary trait, and testing the correlation of a 172 simulated marker variant's signi cance between the two methods. Using SOLAR, 173 a software package for estimating heritability using identity by descent calcula-174 tions, we simulated a quantitative trait with one quantitative trait locus with two 175 alleles and a nearby marker locus with two alleles (Almasy and Blangero, 1998). 176 We rst removed all related individuals with a resulting cohort of 4,195 subjects.

177
For the simulation, we varied the frequency for the causal minor allele and a 178 marker minor allele from 0.05-0.45 in increments of 0.010, the mean quantitative 179 trait value for the heterozygous genotype from 5-45 in increments of 10 and the 180 homozygous genotypes' mean ± 50, the standard deviation of the quantitative 181 trait from 5-20 in increments of 4, and the recombinant fraction from 0.01-0.10 in 182 increments of 0.02. After simulating the quantitative trait distribution, we then 183 normalized the trait to several distributions: standard normal, normal distribution 184 9 with mean 0 and standard deviation 10, and mean 50 and standard deviation 10. 185 We compared distributions because we quantile normalized our QTPhenProxy 186 trait values before running the genome-wide associations studies. We then con-187 verted each simulation to a binary trait using liability thresholding (Risch, 1990).

188
Liability thresholding was implemented as follows: We determined a quantitative 189 trait value as a threshold based o the prevalence of the simulated trait, which 190 we varied from 2.5-20% in increments of 5%. Any subjects above this threshold is 191 labeled a case, and the rest controls in the binary trait phenotype. We then ran 192 linear or logistic regressions using the python package statsmodels between the 193 simulated quantitative trait or the binary trait and the subjects'genotypes for the 194 marker and causal loci (Seabold and Perktold, 2010). We developed a conversion 195 formula for the beta coe cients to odds ratios by linearly regressing the correla-196 tion between the simulated e ect sizes. We then converted the beta-coe cients 197 to odds ratios of the UK Biobank GWAS results by multiplying the beta coe cient 198 by the average slope and intercept and then taking the exponential of the result. In order to con rm the cases were well distributed within the data and to 201 determine the number of principal components to use as covariates, we conducted 202 PCA. For each of the four diseases, we rst pruned the variants used for PCA by 203 running a sliding window of size 100 kbps, 5 variant step size, and r 2 threshold of 204 0.1. We then combined the chromosomes, extracting only the pruned variants, 205 using plink2 and cat-bgen software (Band and Marchini, 2018). We then ran pca 206 with plink2 (Chang et al., 2015) and evaluated the PCs to be used as covariates 207 using a skree plot. Within the main PCs, we plotted them against each other, 208 highlighting the distribution of the cases. 209 2.12. LD Score Regression and evaluation of genomic in ation 210 We determined the lambda genomic correction and LD score regression coef-211 cient using the software LDSC (Bulik-Sullivan et al., 2015). We used the disease 212 GWAS summary statistics and European LD scores pre-computed from 1000 213 genomes by the Alkes group (Bulik-Sullivan et al., 2015). QQ plots were plotted 214 using qqman (Turner, 2014 We measured the genetic correlation of the QTPhenProxy EN stroke and 221 stroke subtype models using both QC1 and QC2 quality control and the results 222 from the binary classi cation of stroke with a large stroke study as well as other 223 stroke-related risk factors using LDSC software (Bulik-Sullivan et al., 2015). We 224 gathered summary statistics from the MEGASTROKE meta-analysis of GWA data 225 on stroke and stroke subtypes, including Acute Ischemic Stroke (AIS) (Malik et al.,226 2018a). We compared our results to these summary statistics and measured its 227 genetic correlation with the same risk factor GWAS compared in MEGASTROKE 228 (Malik et al., 2018a). Speci cally, we measured the genetic correlation with the 229 following risk factors: Coronary Artery Disease (CAD) (Nikpay et al., 2015), T2D 230 (Xue et al., 2018), Atrial Fibrillation (AF) (Nielsen et al., 2018), White matter 231 hyperintensities (WMH) (Traylor et al., 2019), and Low-density lipoprotein (LDL) 232 (Willer et al., 2013). Further information about the resources of these risk factors 233 can be found in Supplementary Information. Finally, we used LD Hub (Zheng et     3.1. QTPhenProxy has high model performance. 243 We trained models with 5 di erent classi er types and all stroke, ischemic 244 stroke, intracerebral hemorrhage, and subarachnoid hemorrhage as de ned by 245 the UK Biobank Algorithmically De ned Outcomes rubric (Schnier et al., 2017).  The binary trait logistic regression genome-wide association study for stroke, 257 ischemic stroke, and subarachnoid hemorrhage recovered no variants with genome-258 wide signi cance. Binary trait logistic regression genome-wide association study 259 for intracerebral hemorrhage recovered 2 variants with genome-wide signi cance. SNPs with 8 LD-independent loci, and 1059 genome-wide signi cant SNPs with 269 54 LD-independent loci using the EN and RF classi ers, respectively (Table 1). 270 We show the comparison of genome-wide signi cant variants between the QT-271 PhenProxy EN model and Binary trait GWAS for stroke in a Hudson plot ( Figure   272 2) (Lucas)   The binary trait logistic regression genome-wide association study for stroke, 298 ischemic stroke, and subarachnoid hemorrhage recovered no variants with genome-299 wide signi cance. Binary trait logistic regression genome-wide association study 300 for intracerebral hemorrhage recovered 2 variants with genome-wide signi cance.

301
Quantitative trait linear regression using QTPhenProxy probabilities for stroke 302 recovered 7 LD-independent loci, 3 LD-independent loci, for ischemic stroke, 3 303 LD-independent loci for subarachnoid hemorrhage, and 3 LD-independent loci 304 for intracerebral hemorrhage using the EN classi er. We show the comparison 305 of genome-wide signi cant variants between the EN and Binary in a Hudson 306 plot ( Figure 2) (Lucas). Out of known ischemic stroke variants, both models also 307 recovered 3 known variants with genome-wide signi cance, and the EN model    (Table 3) 349 (Malik et al., 2018a).  (Table 4).

377
From the over 2,000 EBI-GWAS disease variant marker sets mapped to organ 378 systems, we calculated the proportion of markers in each set that was found to be 379 at genome-wide signi cance by our QTPhenProxy model with QC2 markers and 380 principal components. We found that the organ systems with markers sets with    found that more common variants, or those with higher minor allele frequencies, 406 had higher genomic in ation than rarer variants (Figure 4).   For all stroke and its subtype ischemic stroke, the machine learning models 458 trained to assign QTPhenProxy probabilities performed well (greater than 90% 459 AUROC, greater than 30% maximum F1 score, 74-97% precision at 50). The    Our simulations showed high correlation between quantitative trait beta 485 coe cients and binary trait log odds ratio, which is similar to our empirical 486 ndings in the UK Biobank. We also show that the correlation slope is relatively 487 stable across all the di erent simulation parameters, suggesting that there is a 488 set correlation between quantitative traits and binary traits created from them.

489
Further simulation will need to determine the e ects of multiple loci on correlation 490 between quantitative and corresponding binary trait.  traits that are not stroke-related, such as bone disorders, cancer, and cataracts, 515 while the most signi cant were cardiovascular-related. These ndings suggest 516 that our models do highlight genetic associations related to the pathophysiology 517 of stroke. Although there is concern that this method may highlight genetic 518 associations related to risk factors of stroke rather than stroke itself, we show 519 that our models' GWAS are highly correlated with the MEGASTROKE GWAS and 520 have similar correlations with risk factor GWAS highlighted by the study. These 521 results suggest that QTPhenProxy models develop a phenotypic score ascertained 522 for stroke and its subtypes.  Results using QC1 showed more variants with genome-wide signi cance and 544 discovered more new variants for stroke than the QC2 GWAS. This may lead one 545 to believe that more stringent quality control led to reduced in ation of p-values.   (Schoonjans et al., 2016). LDLR gene codes for the low-density lipoprotein re-570 ceptor, which is involved in cholesterol production (Brown and Goldstein, 1979), 571 and the APOC1/APOC1P1 genes code for parts of apoliprotein C1, which are 572 involved in high density lipoprotein metabolism (Erqou et al., 2010). ARHGAP1, a 573 gene coding for Rho GTPase activating protein 1 has been associated with cancer 574 phenotypes and activation of hypoxic and in ammatory pathways (Fessler et al.,575 2004; Hashimoto et al., 2018;Zhang et al., 2019). ABCG8 gene has been associated 576 with combined gwas of lipids and in ammation, lipid levels, and gallstone disease 577 (Ligthart et al., 2016). In addition, it is knownto code for a cholesterol transporter  ization. In addition, the GWAS results from QTPhenProxy using the RF model 607 resulted in many more hits than the EN model, and reduced sensitivity to known 608 disease variants. Further study will be required to understand why one model 609 gave more sensitive and speci c results than the other, and whether there was 610 p-value in ation in the random forest models. QTPhenProxy also may be par-611 ticularly suited for stroke because the training model for phenotyping patients 612 was optimized for stroke (Thangaraj et al., 2019). Since stroke is an acute event 613 than can be identi ed with high accuracy in the electronic health record, this 614 method may not translate as well to other diseases, such as chronic illnesses.

615
A nal limitation discussion point is whether QTPhenProxy models highlight 616 genetic associations with risk factors for stroke rather than stroke itself. We