Polygenic risk modeling with latent trait-related genetic components

Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual’s genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few “outlier” individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.


29
Common diseases like diabetes and heart disease are leading causes of death and financial 30 burden in the developed world 1 . Polygenic risk scores (PRS), which sum effects from many 31 risk loci for a trait, have been used to identify individuals at high risk for conditions like 32 cancer, diabetes, heart disease, and obesity 2-5 . Although many versions of PRS can be used 33 to estimate risk 6-8 , previous work suggests that a "palette" model which decomposes genetic 34 risk into pathways could better describe the clinical manifestations of complex disease 9 . 35 Here, we present a polygenic model based on latent trait-related genetic components 36 identified using Decomposition of Genetic Associations (DeGAs) 10 . Where standard PRS for 37 a trait models genetic risk as a sum of effects from genetic variants, the DeGAs polygenic 38 risk score (dPRS) models risk as a sum of contributions from DeGAs components 10 . Each 39 DeGAs component consists of a set of variants which affect a subset of the traits (Figure 1). 40 The component's genetic loading is a component PRS (cPRS) that approximates risk for a 41 weighted combination of relevant traits. Instantiated cPRS values for an individual can then 42 be used to create a profile that describes disease risk and informs genetic subtyping. 43 As proof of concept, we compute DeGAs using summary statistics generated from 44 genome-wide associations between 977 traits and 469,341 independent common variants in 45 a subset of unrelated white British individuals (n=236,005) in the UK Biobank 11 (Methods). 46 We then develop a series of dPRS models and evaluate their performance in additional 47 independent samples of unrelated white British individuals (n=33,716 validation set; 48 n=67,430 test set), and in UK Biobank non-British whites (n=25,486 extra test set). We 49 highlight results for body mass index (BMI), myocardial infarction (MI/heart attack), and gout, 50 motivated by their high prevalence among older individuals in this cohort 12 . 51

52
Study Population: 53 The UK Biobank is a large longitudinal cohort study consisting of 502,560 individuals aged 54 37-73 at recruitment during 2006-2010 11 . The data acquisition and study development 55 Biobank Engine (Web Resources). In this work, we highlight results for body mass index 111 (GBE ID: INI21001), myocardial infarction (HC326), and gout (HC328). 112

Risk modeling using Decomposition of Genetic Associations (DeGAs): 113
Given GWAS summary statistics, we computed DeGAs as previously described 10 . First, a 114 sparse matrix of genetic associations W (n ⨉ m) was populated with effect size estimates (or 115 z-statistics) between the n=977 traits and m=469,341 variants. Only variants with at least 116 two associations were used (p < 10 -6 ; Figure S1 has additional cutoffs). After filtration, rows 117 of W were standardized to zero mean and unit variance, to give traits equal relative weight. 118 Next, we performed a truncated singular value decomposition (TSVD) on W using the 119 TruncatedSVD function in the scikit-learn python module 19,20 to identify the top c=500 trait-120 related genetic components. TSVD outputs three matrices whose product approximates W:  (Figure 1a). W is approximated by U, S, and V as below: 123 The DeGAs risk profile is therefore a normalized measure which, for high risk individuals with 142 positive dPRS, is the fraction of risk owing to driving components. Analogously, for low risk 143 individuals with negative dPRS, it measures the contribution from protective components. 144 Computing polygenic risk scores: 145 As a baseline model for dPRS, we computed single-trait polygenic risk scores (PRS) with a 146 pruning and thresholding approach using the same summary statistics which were input to 147 DeGAs. As DeGAs requires variants filtered on LD independence and for p < p* based on a 148 critical value p* (see above), we simply used the variants present in the DeGAs input matrix 149 W. Specifically, the PRS weights for trait j were taken from the j-th row of W. The PRS was 150 then computed with PLINK v1.90b4.4 [21 May 2017] using the --score flag, with the `sum`, 151 `center`, and `double-dosage` modifiers. These correspond to the assumptions that variants 152 make additive contributions across sites; that the mean distribution of risk is zero; and that 153 the effect alleles have additive effects. These are the same assumptions used in our GWAS. 154 In a similar fashion, polygenic scores (cPRS) for all DeGAs components were 155 computed with PLINK2 v2.00a2 (2 Apr 2019) using the --score flag, with `center` and 156 `cols=scoresums` modifiers. These modifiers correspond to the same assumptions as in the 157 PRS: that genetic effects are additive across sites (this is the default genotype model for --158 score); that each component is zero-centered; and that alleles make additive contributions. 159 Given population-wide estimates of cPRS for every component, we computed dPRS and 160 DeGAs risk profiles for each trait using the formulas listed above. 161 In some analysis, PRS and dPRS were further adjusted by age, sex, and four genetic 162 principal components from UK Biobank's PCA calculation. Covariate adjustment was 163 performed by fitting a multiple regression model with dPRS (or PRS) and covariates in the 164 validation population. We also fit a covariate-only model using the same procedure (without 165 either polygenic score) and used its performance as baseline for the joint models (Figure 2). 166

Model validation: 167
To select DeGAs hyperparameters (the input p-value filter, and whether to use GWAS betas 168 or z-statistics as weights) we performed a grid search over a range of filtering p-values for 169 both betas and z-statistics. Performance of a DeGAs instance was assessed using the 170 average correlation between its dPRS models and their respective traits. For all traits used in 171 the decomposition, we computed Spearman's rho (rank correlation) between dPRS and 172 covariate-adjusted trait residuals in the training and validation sets. Residual traits are the 173 result of regressing out age, sex, and four genetic principal components. To avoid overfitting, 174 we further required modest correlation (Pearson's r 2 > 0.5) between training and validation 175 set performance. With this constraint, we find optimal performance using betas, a p-value 176 cutoff of 10 -6 , and 500 components (Figure S1), and note that this model explains nearly all 177 (>99%) of the variance of its corresponding input matrix W ( Figure S2). 178 For this best-performing DeGAs instance, we present several assessment metrics for 179 each polygenic score (dPRS and PRS) in each study population (Figure 2). For each score 180 and population, we estimated disease prevalence and mean quantitative trait values at 181 various population risk strata. We also estimated the effect of dPRS (or PRS) quantiles on 182 traits using a two-step approach. First, in the training set we compute the below regression: with PC i the i'th population-genetic PC, not the i'th DeGAs component. Second, in the test/ 184 validation set we estimate the effect β due to PRS quantile using the above parameters: is an indicator function that equals 1 if an individual is in the quantile of 186 interest q (e.g. 0-2%), and 0 if the individual is in the baseline group (40-60% risk quantile). 187 Individuals in neither the quantile of interest nor the baseline group were excluded; if 188 individuals were in both q and the baseline group (e.g. if q were 45-40%) they were counted 189 in q and removed from the baseline group. 190 We then assessed how well dPRS and PRS predicted quantitative trait values, and 191 how well they performed binary classification on disease status. For quantitative traits, we 192 report Pearson's r between each score and residual trait values. For binary traits, we report 193 the area under the receiver operating curve (AUROC/AUC) for each score as a classifying 194 metric, both alone and in a joint model with covariates. As baseline, we also report AUC for a 195 covariate-only model (see above for a description of this model and covariate adjustment). 196

Classifying genetic risk profiles from DeGAs components: 197
In order to assess whether our method could identify subtypes of genetic risk, we analyzed 198 the DeGAs risk profiles of high-risk individuals whose dPRS is driven by an "atypical" 199 combination of DeGAs components. We used the Mahalanobis distance (D M ) to identify 200 outlier individuals whose z-scored distance from the mean DeGAs risk profile exceeded 1: 201 where x is the DeGAs risk profile; ߤ is the mean profile; and S is the identity matrix. 202 Traditionally, S is taken to be the covariance matrix for each of the features across all x. 203 However, for this measure we treat each component as having equal variance. This results 204 in the above formula reducing to the Euclidean distance between a profile x and the mean 205 profile ߤ , which can be used to identify "atypical" individuals rather than statistical outliers. 206 We then intersected this set of outlier individuals with the top 5% of dPRS to create 207 the "high risk outlier" group. Here, we define the mean risk profile for a trait as the 208 component-wise mean across all individuals' DeGAs risk profiles in a high risk set (top 5% of 209 dPRS). To identify subtypes among high risk outliers, we performed a k-means clustering of 210 their DeGAs risk profiles using the KMeans function from the python scikit-learn module 19 . 211 The number of clusters k was determined by optimizing a statistic over putative values of k 212 ranging from 1 to 20. Specifically, we used the gap-star statistic in the python "gap-statistic" 213 module 21,22 as the statistic for selecting a value k. We then evaluated which components 214 drove risk in each cluster by computing a mean risk profile for the group (defined as above). 215 The mean risk profile was then renormalized to one for visualization (Figure 4g-i). 216

217
Genome-wide associations between 977 traits and 469,341 independent human leukocyte 218 antigen (HLA) allelotypes, copy-number variants 14 , and array-genotyped variants were 219 computed in a training set of 236,005 unrelated white British individuals from the UK Biobank 220 study 11 (Methods). We applied DeGAs 10 to scaled beta-or z-statistics from these GWAS 221 with varying p-value thresholds for input (Figure 1a). We then defined polygenic risk scores 222 for each DeGAs component (cPRS; Figure 1b) and used them to build the DeGAs polygenic 223 risk score (dPRS; Figure 1c). The model with optimal out-of-sample prediction ( Figure S1) 224 corresponded to DeGAs on beta values with significant (p < 10 -6 ) associations. the first 4 genotype principal components from UK Biobank's PCA calculation 11 . This trend 230 was most pronounced at the highest risk quantile (2%) for each trait. At this stratum we 231 observed 1.40 kg/m 2 higher BMI (95% CI: 1.14-1.67); 1.73-fold increased odds of MI (CI: 232 1.38-2.16); and 2.53-fold increased odds of gout (CI: 1.93-3.33) over the population average 233 in the test set (overall n=67,235 individuals for BMI; 2,812 MI cases; and 1,484 gout cases). 234 Further, we found dPRS to be comparable to prune-and threshold-based PRS using 235 the same input data ( Figure S2). Although there was some discrepancy between the 236 individuals considered high risk by each model (Figure S3,  (Figure 2a-c) using the same covariate adjustment as 240 dPRS. Population-wide predictive measures were also similar, with BMI residual r=0.21, and 241 PRS AUC (not adjusted for covariates) 0.54 for MI and 0.63 for gout (Figure 2d-f). We also 242 note similar performance for BMI dPRS predicting obesity (defined as BMI > 30; Figure S5), 243 with OR=1.7 at the 2% tail, and AUC=0.56. On balance, despite the reduced rank of the 244 DeGAs risk models -the input matrix W is reduced from ~1,000 traits to a 500-dimensional 245 representation -we achieve performance equivalent to traditional PRS for these example 246 traits and observe a similar trend for the other traits ( Figure S2, Data S1). 247 However, we note that dPRS and PRS add little population-wide predictive value 248 over factors such as age, sex, and demographic effects that are captured by genomic PCs 249 (Figure 2d-f). At the population level, we found r=0.12 between covariate-adjusted dPRS 250 and residualized BMI. For binary traits, the area-under the receiver operating curve (AUC) 251 was 0.55 for MI and 0.63 for gout, with unadjusted dPRS as the classifying score. Adjusting 252 for covariates, the marginal increase in AUC is modest: 0.005 for MI and 0.03 for gout. 253

Characterizing DeGAs Components: 254
We describe the latent structure identified through DeGAs by annotating each component 255 with its contributing traits and variants, aggregated by gene. The relative importance of traits 256 to components is measured using the trait contribution score 10 , which corresponds to a 257 squared column of the trait singular matrix U. The relative importance of components to each 258 trait is measured using the trait squared cosine score 10 , which is a normalized squared row 259 of US. The contribution and squared cosine score are defined analogously for variants and 260 genes using the variant singular matrix V. For each example trait, we highlight 5 components 261 of interest (ranked by the trait squared cosine score) and describe them by their respective 262 trait contribution scores (Figure 3) and gene contribution scores. The trait and gene 263 contribution scores for all components can be found in Data S2 and Data S3, respectively. 264 Body mass index is a polygenic trait with associated genetic variation relevant to 265 adipogenesis, insulin secretion, energy metabolism, and synaptic function 10,23 . Here, the 266 DeGAs trait squared cosine score (Figure 3) indicates strong contribution from components 267 related to body size and fat-free mass (PC1; 23.6%), fat mass (PC2; 35.9%), as well as risk 268 factors for obesity like body size at age 10 and trunk fat percentage (PC363; 4,5%). 269 Components related to exercise (PC206; 2.4%) and diabetes (PC6; 1.7%) also contribute. 270 Genetic variation proximal to STC2 and MC4R contribute strongly to both PC1 and 271 PC2 (Data S3). STC2 is a stanniocalcin-related protein most highly expressed in 272 cardiomyocytes and skeletal muscle. It has previously been associated with lean mass traits 273 in humans 24,25 and has been shown to restrict post-natal growth in mouse 26 . MC4R is a 274 melanocortin receptor in the G-protein coupled receptor family. It is primarily expressed in 275 the brain, is known to play a role in energy homeostasis and somatic growth 27,28 , and has 276 been associated with fat-mass and obesity-related traits in humans 29 . Both components also 277 have contribution from variation proximal to FTO and DLEU1, both of which associate with 278 traits affecting body size in adults 30,31 . FTO is an alpha-ketoglutarate dependent dioxygenase 279 whose causal role in BMI has been questioned 32 ; DLEU1 is a tumor-suppressing lncRNA 280 named for its frequent deletion in patients with chronic lymphocytic leukemia 33 . These Two of these components, PC11 and PC12, have contribution from variation 294 proximal to the lipoprotein gene LPA, at the 9p21.3 susceptibility locus (CDKN2B), and in the 295 brain-expressed solute carrier SLC22A3 34 (Data S3). Variation in these three genes also 296 contributes to PC32, as does variation proximal to the transcription factor STAT6 (which has 297 been associated with adult-onset asthma and inflammatory response to mosquito bites 35,36 . 298 PC30 also has contribution from STAT6, as well as the phosphatase and actin regulator 299 PHACTR1, which has been identified in prior coronary artery disease GWAS 37 . These 300 components reflect the diversity of conditions and risk factors which are comorbid with MI. 301 Gout is a heritable (h 2 =17.0-35.1%) common complex form of arthritis characterized 302 by severe sudden onset joint pain and tenderness, which is believed to arise due to 303 excessive blood uric acid which crystallizes and forms deposits in the joints 38 . The top 304 component for gout (Figure 3) has strong contribution from covariate-adjusted blood urate 13 305 (PC473; 80.4%). This component also explains most of the variance in the input genetic 306 associations for gout. Meanwhile, two other important components measure dietary intake. 307 One of them (PC7; 6.3%) is driven by measures of alcohol use and abuse; the other 308 (PC362; 0.8%) is related to coffee, water, and tea intake. Two other components are related 309 to covariate and statin-adjusted biomarkers 13 , namely, cystatin C (PC471; 3.1%) and gamma 310 glutamyltransferase (PC470; 2.5%). The key gene for PC473 is SLC2A9, which is involved 311 in uric acid transport and is associated with gout 39 . The alcohol dehydrogenase ADH1B is 312 key to PC7 and is associated with alcoholism and blood urea nitrogen (BUN) 40,41 . Indeed, 313 alcohol use has been identified as a lifestyle risk factor for gout 42 . These components reflect 314 the clinical pathogenesis of gout by urate buildup, as well as some of its lifestyle risk factors. 315

Painting DeGAs Risk Profiles: 316
To further characterize the genetic architecture of these traits, we "painted" the genetic risk 317 profiles of each high-risk individual (top 5% of dPRS). For this, we decomposed each 318 individual's dPRS across DeGAs components into a vector we call the DeGAs risk profile 319 (Methods). The DeGAs risk profile is an individual-level measure over the (in this case) 320 c=500 DeGAs components and is normalized such that the entries in the vector sum to one. 321 For individuals with higher than average risk (dPRS > 0), it describes the contributions from 322 components which contribute positively to risk; for individuals with lower than average risk 323 (dPRS < 0) it describes the contributions from components which have protective effects. As 324 an individual rather than population-level measure, the DeGAs risk profile can be used to 325 further examine the underlying genetic diversity among high risk individuals (Figure 4a-c) in 326 a way which complements the trait and gene squared scores from DeGAs. 327 We therefore investigated the diversity of components which drive risk among high 328 risk individuals, using their DeGAs risk profiles. We used the Mahalanobis criterion 329 (Methods) to find individuals in the test population whose risk profiles significantly differed 330 from average. We then found high-risk individuals (top 5% of dPRS) among these outliers (z-331 scored Mahalanobis distance > 2) to identify a group of "high-risk outliers". For these 332 individuals (Figure 4d-f), genetic risk is often driven by the same components as for other 333 high-risk individuals (Figure 4a-c), but the degree to which certain components contribute 334 can differ. For example, while the trait squared cosine score for gout identifies PC473 (urate) 335 as the top component (Figure 3), the DeGAs risk profile suggests PC7 (alcohol use traits) 336 has a key role in driving genetic risk for some individuals with high gout dPRS (Figure 4c). 337 This suggests that the DeGAs risk profile can identify individuals with high genetic risk 338 whose pathology may differ from "typical". 339 To better describe genetic diversity among outlying individuals, we attempted to 340 identify genetic subtypes of each example trait in the high-risk outlier population. We 341 performed a k-means clustering of this group using DeGAs risk profiles as the input; k was 342 chosen by optimizing the gap-star statistic across an array of potential values (Methods). 343 We described each cluster using its mean risk profile (Figure 4a-c) and noticed that cluster 344 membership divides individuals based on cPRS for relevant components (Figure 4d-f). 345 For body mass index, we identify two risk clusters (Figure 4g identified as important for gout by its trait cosine score. Furthermore, genetic variation in 370 ADH1B (one of the key genes for PC7) has been associated with gout in prior study 43 , 371 suggesting there may be shared genetic risk between both traits. The other cluster is harder 372 to interpret, due to the number of relevant components. Several components related to BMI 373 (PC1, PC2, PC227) are highly represented in the mean DeGAs risk profile, and high BMI is 374 a risk factor for gout 44,45 , but the data are not conclusive enough for definitive interpretation. 375 Instead, we note challenges in interpreting components as a limitation of our approach. 376

377
In this study, we describe a novel technique to model polygenic traits using components of 378 genetic associations. We build an example model using data from unrelated white British 379 individuals in the UK Biobank to show that our method adds an interpretable dimension to 380 traditional polygenic risk models by expressing disease, lifestyle, and biomarker-level 381 elements in trait-related genetic components. Predicting genetic risk with these components 382 led us to infer disease pathology beyond variant-trait associations without loss of predictive 383 power from reducing model rank (Figure S2). 384 For three phenotypes of interest (BMI, MI, and gout), we showed that the DeGAs risk 385 profile offers meaningful insight into the genetic drivers of trait risk for an individual. We then 386 used this measure to identify clusters of high-risk individuals who share similar genetic risk 387 profiles for each of the traits. We find, as in previous work 10 , that genetic risk for BMI can be 388 decomposed into fat-mass and fat-free mass related components. We also show that while 389 many individuals have risk for BMI driven by a combination of the two components, there 390 exist "outlier" individuals who have strong contributions from only one of them. Our results 391 further indicate that this diversity of contributory genetic risk is not limited to BMI. However, 392 extracting biological insights for other traits will likely require deeper phenotyping, or other 393 rich resources like single cell data. 394 We further demonstrated the generalizability of dPRS by assessing its performance 395 in independent test sets of white British and non-British white individuals (Figure S4; all 396 traits in Data S1) from the UK Biobank. Among non-British whites, the top 2% of dPRS 397 carries OR=1.9 for MI and 5.1 for gout (Figure S4), compared to 1.7 and 2.5 in the test set 398 individuals (Figure 2). Likewise, the top 2% of dPRS risk has 1.63 kg/m 2 higher BMI in non-399 British whites (1.40 kg/m 2 ) in the test set. Though we find similar performance for these traits 400 across these two groups, concerns about the generalizability of traditional clump-and-401 threshold PRS across groups also apply to dPRS. Though methods exist to identify 402 suspected causal variants via fine-mapping, we decided to LD-prune variants prior to 403 analysis with DeGAs. One benefit of this approach is that it is agnostic to fine-scale patterns 404 of association within LD blocks, which avoids the problem of having distinct (but highly 405 correlated) causal variants across traits. However, LD pruning may leave dPRS slightly more 406 vulnerable to overfitting patterns of LD in the GWAS population compared to approaches 407 which use fine mapped variants. This may be worth revisiting in future work. 408 We also note that our analysis of subtypes may not be robust to different choices of 409 input traits or study population. Taking gout as an example, our study finds two clusters of 410 outliers (Figure 4c), one of which is due to a component related to a clinical risk factor for 411 the trait (namely, alcohol use). Our ability to identify such clusters is clearly limited by the 412 inclusion or exclusion of related traits and their degree of correlation in our analysis cohort. 413 The components of genetic associations which can be identified by DeGAs also depend on 414 trait selection. Here, we excluded traits which may have noisy or confounded patterns of 415 genetic associations: specifically, rare conditions (n < 1000 in the UK Biobank) or traits 416 which correlate with social measures like socioeconomic status. In future work, careful 417 selection and curation of phenotypes may provide further insights than those we offer in this 418 study. We encourage replication efforts using similar methods and have made all DeGAs 419 risk models from this work available on the Global Biobank Engine 18 (Web Resources).  (dPRS) for trait j are recovered by taking a weighted sum of cPRSi, with weights from U (j,i-588 th entry). We also compute DeGAs risk profiles for each individual (Methods), which 589 measure the relative contribution of each component to genetic risk. We "paint" the dPRS 590 high risk individuals with these profiles and label them "typical" or "outliers" based on 591 similarity to the mean risk profile (driven by PC1, in blue). Outliers are clustered on their 592 profiles to find additional genetic subtypes: this identifies "Type 2" and "Type 3", with risk 593 driven by PC4 (red) and PC5 (tan). Clusters visually separate each subtype along relevant 594 cPRS (below). Image credit: VectorStock.com/1143365.