Polygenic transcriptome risk scores improve portability of polygenic risk scores across ancestries

Polygenic risk scores (PRS) are on course to translate the results of genome-wide association studies (GWAS) into clinical practice. To date, most GWAS have been based on individuals of European-ancestry, meaning that the utility of PRS for non-European populations is limited because SNP effects and LD patterns may not be conserved across populations. We hypothesized that cross population prediction at the level of genes rather than SNPs would be more effective, since the effect of genes on traits is likely to be more highly conserved. Therefore, we developed a framework to convert effect sizes at SNPs into effect sizes for genetically predicted transcript abundance, which we used for prediction in non-European populations. We compared this approach, which we call polygenic transcriptome risk scores (PTRS), to PRS, using data from 17 quantitative traits that were measured in multiple ancestries (European, African, East Asian, and South Asian) by UK Biobank. On average, PTRS using whole blood predicted transcriptome had lower absolute prediction accuracy than PRS, as we expected since not all regulatory processes were captured by a single tissue. However, as hypothesized, we found that in the African target set, the portability (prediction accuracy relative to the European reference set) was significantly higher for PTRS than PRS (p=0.03) with additional gain when transcriptomic prediction models ancestry matched the target population (p=0.021). Taken together, our results suggest that using PTRS can improve prediction in underrepresented populations and that increasing the diversity of transcriptomic data may be an effective way to improve portability of GWAS results between populations and help reduce health disparities.

study. 96 Experimental setup 97 An overview of the experimental setup describing the discovery, training, and target sets used in the paper is 98 shown in Fig.1. We randomly selected 356,476 unrelated Europeans in the UK Biobank for the discovery set. 99 For testing the performance of risk scores, we constructed 5 target sets with participants of various ancestries 100 in the UK Biobank. We used 2,835 African, 1,326 East Asian, and 4,789 South Asian individuals for the non 101 European target sets. We also reserved two randomly selected sets of 5,000 Europeans as additional target 102 sets. One was selected as the EUR reference set and the second European target was used as a test set to 103 assess the the variability of the results within the same ancestry. 104 For predicting the transcriptome, we downloaded prediction weights from multiple ancestries collected in 105 predictdb.org. The first set of models had been trained in European individuals from the GTEx v8 release sets where the weights for the prediction of transcriptomes were computed are shown in green. We downloaded the weights trained previously from predictdb.org. We sampled 5 target sets from the UK Biobank for testing the risk scores: two randomly sampled sets of European-, one African-, one East Asian-, and one South Asian-descent individuals. For each of the 5 target sets, predicted transcriptomes were calculated using the weights trained in each of the three training sets: GTEx EUR, MESA-EUR, MESA-AFHI.
Predicted transcriptome captures a significant portion of trait variability 112 To assess the feasibility and to quantify the potential for PTRS to predict human traits, we calculated the 113 proportion of variance explained by the predicted transcriptome assuming random effects of gene expression 114 levels. The approach is analogous to standard SNP-heritability estimation (Yang et al., 2010) where the 115 "predicted expression relatedness matrix" is used instead of the genetic relatedness matrix. In this section, 116 we calculated the predicted transcriptome using the GTEx EUR weights using the European target set 117 genotype data. Using these predicted expression levels, we calculated the "predicted expression relatedness 118 matrix" (instead of the genetic relatedness matrix) and applied the standard restricted maximum likelihood 119 estimation to calculate the proportion of variance explained by the predicted transcriptome.

120
Since the PVE varies depending on the underlying heritability of the trait, we also calculated the propor-121 tion of SNP heritability explained by the predicted transcriptome as the ratio of PVE and heritability. We 122 term this values "regulability" (Barbeira et al., 2020a). For the SNP-heritability we used publicly available As shown in Fig.2A, in the European target set, GTEx EUR whole blood based predicted transcriptome 126 captured on average 20.6% (s.e.=2.1%) of the trait variability. This result is largely consistent to the 127 estimates reported previously using a different approach (Yao et al., 2020). 128 Aggregating predicted transcriptomes in multiple tissues increases the PVE

129
To explore ways to increase the proportion of variance explained (PVE), we calculated the proportion 130 explained collectively by the transcriptome predicted in 10 tissues, including muscle, adipose, tibial artery, 131 breast, lung, fibroblast, lung, tibial nerve, and skin, with sample sizes ranging from 337 to 602 (Supplementary   132   Table S3). As anticipated, we found that, collectively, the predicted transcriptomes in 10 tissues explained 133 a larger portion of heritability: on average 34.4% (s.e. = 3.3%) of the heritability corresponding to a 48% 134 increase relative to whole blood alone. This result suggests that adding transcriptomes from multiple tissues 135 will improve predictions in general.    Table S3). We decided to use the combined (African American and Hispanic) transcriptome prediction since the similarity of the sample 143 sizes (578 vs 585) would make the comparison with the European trained models more fair. 144 We found that (Fig.2B)  After having determined that it is possible to capture a significant portion of trait variability using predicted 150 transcriptome and that matching the training and target set ancestries can increase the portion explained, 151 we proceeded to build the PRS and PTRS in our discovery set (356K Europeans from the UK biobank).

152
We built PTRS weights using elastic net, a regularized linear regression approach, which selects a sparse 153 set of predicted expression features to make up the PTRS. For PRS weights, we used the standard LD 154 clumping and p-value thresholding approach (see details Methods section section 1.3). 155 We quantified the prediction accuracy in each target set using the partial R 2 ( R 2 ), which provides a 156 measure of correlation between predicted and observed outcomes with the added benefit of taking covariates 157 into account (see details in section 1.10).

158
All the weights were calculated in the discovery set, however, to boost the prediction performance across  Finally, we used the MESA EUR and AFHI models to assess the potential improvement in portability 199 when matching training and target set ancestries. As shown in Fig.4B, PTRS based on AFHI transcriptome 200 has significantly higher portability than the MESA-EUR transcriptome-based PTRS (paired t-test p=0.021).

201
As an additional evidence of improved portability of PTRS in general, we replicated the higher portability 202 of the EUR-based PTRS compared to PRS using an independent training set (MESA vs GTEx as described 203 above). PRS, which can be improved by using ancestry matching transcriptomes. Also they suggests that adding 206 transcriptomes predicted in other tissues and other omics data can further improve PRS generalizability.

208
In this paper, we showed that informing genetic risk score building using genetically determined gene ex-  We found that the total trait variation that can be explained via predicted transcriptomes range from 212 20.6% (using whole blood) to 34.4% (with a broader sets of tissues) of the SNP heritability, i.e. the total 213 variation that can be explained using common SNPs. Promisingly, the actual predictors built on predicted 214 transcriptomes had performances that were much higher than the expected 20.6% of the PRS performance.

215
The predicted transcriptome tended to be more predictive if it was trained with individuals from the same 216 ancestry stressing the advantages of collecting transcriptome reference data in diverse populations.

217
We found that the portability of PTRS was significantly higher than the portability of PRS in the 218 African target set, the most affected by the Eurocentric bias in GWAS studies, with further gains when the 219 transcriptome was trained in matched ancestry samples.

220
Our results show that investing in multi-omic studies of diverse populations may be a cost effective way 221 to reduce current genomic disparities by taking better advantage of existing GWAS studies. Developments 222 of methods to optimally combine PRS and PTRS should be encouraged.

223
Our study points to promising strategies to improve risk prediction in general but it also has several 224 limitations. First, PTRS are based on prediction models of gene expression traits which we estimated to 225 account for less than a third of the total variability in the complex traits considered here. We expect this 226 limitation to be mitigated as additional transcriptome reference sets as well as other omics data covering 227 mediating mechanisms missed in current models. Second, we used single tissue prediction models for most 228 of the analysis in this paper, which captured a fifth of the variation in the complex traits here. We will 229 develop approaches to integrate multiple tissue models. Third, weights for PRS were calculated using  Supplementary Table S1) across all available instances and arrays were retrieved.

248
The data retrieval described above was performed using ukbREST ( If one individual has multiple measurements for the same phenotype (in more than one instances and/or 252 more than one arrays), we collapsed multiple arrays by taking the average and we aggregated measurements To ensure the quality of ancestry label, we removed individuals who deviate substantially from the popu-257 lation that they were assigned to. Specifically, for population k among the 4 populations (EUR, S.ASN, 258 E.ASN, and AFR), we treated the distribution of the individuals, in the space of the first 10 PCs, as 259 multivariate normal. And we calculated the observed population mean µ k and covariance Σ k accord-260 ingly. Then, for each individual i in population k, we evaluated the "similarity" S ik to the population k 261 as S ik = log Pr(PC 1 i , · · · , PC 10 i ; µ k , Σ k ). Intuitively, if an individual has genetic background differing from is 262 the assigned population, the corresponding S ik will be much larger than others. So, we filtered out individ-263 uals with S ik ≤ −50 in the assigned population k. This cutoff was picked such that S ik for any un-assigned 264 population k has S ik ≤ −50 for all individuals.

265
The number of individuals remained after data retrieval and ancestry quality control is listed Supple-266 mentary Table S2. 268 We built PRS using the genotypes and phenotypes of the individuals in the discovery data set (the details of 269 data splitting is described in section 1.11). We performed GWAS (linear regression) using linear_regression_rows 270 in hail v0.2 where we included covariates: first 20 genetic PCs, age, sex, age 2 , sex × age, and sex × age 2 .

271
In the GWAS run, we excluded variants with minor allele frequency < 0.001 and variants that significantly 272 deviate from Hardy-Weinberg equilibrium (p-value < 10 −10 ). And the phenotype in their original scales 273 were used.

274
To obtain relatively independent associations for PRS construction, we run LD clumping using plink1.9

283
The sample size and tissue informations of the prediction models are listed in Supplementary Table S3. To get a sense on the predictive power of predicted transcriptome on the phenotypes of interest, we estimated the proportion of phenotypic variation that could be explained by the predicted transcriptome in aggregate.
Specifically, we assume the following mixed effect model (for individual i).
where M denotes the number of genes, C il is the lth covariate, T ig is the inverse normalized predicted 286 expression for gene g, and Y i is the observed phenotype. By inverse normalization, we converted the predicted   Briefly, the eigen-predicted expression traits were calculated as follows. For each gene g, let T j g = ( T j 1g , · · · , T j N g ) denote the predicted expression (with standardization) of g in tissue j across individual i = 1, · · · , N . By collecting T j g for all tissues which have prediction model for gene g (suppose there are J of them), we have a matrix T g ∈ R N ×J where columns correspond to tissues. To remove the colinearity in the columns of T g by keeping linearly independent predictors, we used the first K left singular vectors of T g with K selected as follows. We performed PCA on T t g T g and any kth PC V k g was excluded if λ k /λ max ≤ 1/30. The leading K left singular vectors (up to a scaling factor) of T g was reconstructed as follow.
We further inverse normalized U k g (as described in section 1.5) resulting in U k g , which were plugged into 305 eq. (2) the procedure described in the previous subsection 1. and h2_observed_se as the estimated value and standard error of the heritability estimation.
where β 0 is the intercept, M is the number of genes, L is the number of covariates, β 2 2 is the l 2 norm and 315 β 1 is the l 1 norm of the effect size vector. Here, α controls the relative contribution of the l 1 penalization 316 term and λ controls the overall strength of regularization.

317
The model fitting procedure was implemented using tensorflow v2 with mini-batch proximal gradient method and the code is available at https://github.com/liangyy/ptrs-tf. We trained models at α = 0.1 (α = 0.5 and 0.9 show similar performance). And fixing the α value, as suggested in (Friedman et al., 2010){friedman:2010}, we trained a series of models for a sequence of λ's starting from the highest. The maximum λ value, λ max , was determined as the smallest λ such that eq. (8) is satisfied.
where the gradient is evaluated at So, at λ = λ max , eq. (9) is the solution to eq. (6), which could serve as the initial points for the subsequent 318 fittings of λ's. We estimated λ max using the first 1000 individuals of the data. And the sequence of λ was 319 set to be 20 equally spaced points in log scale with the maximum value being 1.5λ max and the minimum 320 value being λ max /10 4 . Among these PTRS models generated at different λ values, we only kept the first 321 11 non-degenerate PTRS models so that we have the same number of candidate models for both PRS and 322 PTRS.

323
Transcriptome prediction models for PTRS construction 324 The predicted transcriptome in the discovery set (UKB EUR) was calculated using models from GTEx  At the testing stage, given the standardized (within the population) predicted transcriptome of an individual, we calculated the PTRS of the individual using the following: To measure the predictive performance of PRS and PTRS, we calculated the partial R 2 of PRS/PTRS against the observed phenotype accounting for the set of covariates listed in section 1.3. Specifically, for individual i, letŷ i denote the predicted phenotype which could be either PRS or PTRS and y i denote the observed phenotype. Partial R 2 (denoted as R 2 below) is defined as the relative difference in sum of squared error (SSE) between two linear models: 1) y ∼ 1 + covariates (null model); and 2) y ∼ 1 + covariates +ŷ (full model), i.e. R 2 = 1 − SSE full SSE null . To enable fast computation, we calculated R 2 using an equivalent formula shown in eq. (11) which relies on the projection matrix of the null model.