Fine-mapping and QTL tissue-sharing information improve causal gene identification and transcriptome prediction performance

The integration of transcriptomic studies and GWAS (genome-wide association studies) via imputed expression has seen extensive application in recent years, enabling the functional characterization and causal gene prioritization of GWAS loci. However, the techniques for imputing transcriptomic traits from DNA variation remain underdeveloped. Furthermore, associations found when linking eQTL studies to complex traits through methods like PrediXcan can lead to false positives due to linkage disequilibrium between distinct causal variants. Therefore, the best prediction performance models may not necessarily lead to more reliable causal gene discovery. With the goal of improving discoveries without increasing false positives, we develop and compare multiple transcriptomic imputation approaches using the most recent GTEx release of expression and splicing data on 17,382 RNA-sequencing samples from 948 post-mortem donors in 54 tissues. We find that informing prediction models with posterior causal probability from fine-mapping (dap-g) and borrowing information across tissues (mashr) lead to better performance in terms of number and proportion of significant associations that are colocalized and the proportion of silver standard genes identified as indicated by precision-recall and ROC (Receiver Operating Characteristic) curves. All prediction models are made publicly available at predictdb.org. Author summary Integrating molecular biology information with genome-wide association studies (GWAS) sheds light on the mechanisms tying genetic variation to complex traits. However, associations found when linking eQTL studies to complex traits through methods like PrediXcan can lead to false positives due to linkage disequilibrium of distinct causal variants. By integrating fine-mapping information into the models, and leveraging the widespread tissue-sharing of eQTLs, we improve the proportion of likely causal genes among significant gene-trait associations, as well as the prediction of “ground truth” genes.

Transcriptome studies with whole genome interrogation characterize genetic effects on 2 gene expression traits. These mechanisms help elucidate the function of loci identified in 3 genome-wide association studies (GWAS) by identifying potential causal genes that link 4 genetic variation with complex traits [1][2][3][4][5]{Albert:2015fx, Aguet2019, Huckins:2019ix, 5 Mancuso:2018fv, Gusev:2018dy}. 6 In particular, the Genotype-Tissue Expression (GTEx) project [2]{Aguet2019} has 7 sequenced whole genomes from 948 organ donors and generated RNA-seq data across 52 8 tissues and 2 cell lines. Results and tools derived from this comprehensive catalog of 9 transcriptome variation have enabled a myriad of applications such as drug repurposing 10 [6]{So2017} and clinical discoveries in cancer susceptibility genes [7]{Wu2018}, to name 11 a few. 12 The general consensus that many noncoding variants associated with complex traits 13 exercise their action via gene expression regulation has motivated the development of 14 imputed transcriptome association approaches such as PrediXcan [8,9]{Gamazon2015, 15 Barbeira2018}, TWAS/FUSION [10]{Gusev:2016ey} and UTMOST [11]{Hu2019}. In 16 essence, these methods predict gene expression traits based on individuals' genotypes 17 and test how these predictions correlate with complex traits. 18 Reliable prediction models for gene expression traits are key components of imputed 19 transcriptome association studies. Given the predominantly sparse genetic architecture 20 of gene expression traits [12]{Wheeler2016} and overall robustness and performance 21 [3,13]{Huckins:2019ix, Fryett:2018bg}, Elastic Net [14]{Friedman2010GLMNET} has 22 become the algorithm of choice for predicting transcriptome variation. 23 Despite Elastic Net's many advantages such as robustness and sparcity, we hypothe- 24 sized that transcriptome imputation can be improved by leveraging biologically-informed 25 methods. Recent efforts [11]{Hu2019} have exploited the high degree of eQTL sharing 26 across tissues [15]{GTEx2017} by leveraging cross-tissue patterns in the broad GTEx 27 panel to improve prediction performance, more notably in tissues with small sample sizes. 28 Also, important methodological progress in fine-mapping [16,17]{Wen2017, Wang2018} 29 and an adaptive shrinkage method that improves effect size estimates across multi- 30 ple experiments [18]{Urbut2019} provide opportunities to further improve quality of 31 downstream associations. 32 In this article, we analyze different transcriptome prediction strategies and compare 33 their strengths both in prediction performance and downstream phenotypic associations. 34 Proximity and linkage disequilibrium (LD) between distinct causal variants can 35 lead to non causal associations between predicted expression and complex traits [9, 36 19]{Barbeira2018, Wainberg:2019kq}. Since the ultimate goal of imputed transcriptome 37 studies is to identify causal genes, our main focus here is to improve discoveries with less 38 emphasis on expression prediction performance. We also applied the same model building 39 techniques to alternative splicing traits quantified with Leafcutter [20]{Li:2018cy}. We 40 make all results, prediction models and software available to the research community.

42
To identify optimal techniques for transcriptomic imputation, we have built models 43 to predict genetically regulated expression (GREx) using four different approaches on 44 GTEx expression and splicing data (release version 8). To reduce LD misspecification 45 problems, most apparent when applying summary statistics-based versions of PrediXcan 46 on GWAS of European populations, we used only European samples. 47 We restricted the analysis to genes that are annotated as protein coding, lncRNA, 48 and pseudogenes in GENCODE version 26 [21]{Frankish2019}. We included 49 different 49 tissues with sample sizes ranging from 65 (Kidney Cortex) to 602 (Muscle Skeletal). 50 The first strategy used the Elastic Net [14]{Friedman2010GLMNET} algorithm to 51 compute predictions as described previously in [8,12]{Gamazon2015, Wheeler2016}. For 52 every gene available in each tissue, this strategy used variants from the HapMap CEU 53 track in a window ranging from 1Mb upstream of the transcription start site to 1MB 54 downstream of the transcription end site as explanatory variables. Only those models 55 achieving thresholds of cross-validated correlation ρ > 0.1 and prediction performance 56 p-value < 0.05 were kept. We will refer to this family as the EN-M models.

57
The second strategy used CTIMP (Cross Tissue gene expression IMPutation) [11]{Hu2019}. 58 CTIMP uses a regularized, generalized linear regression algorithm to fit expression from 59 different tissues simultaneously. CTIMP optimizes a cost function including a within-60 tissue Lasso penalty and a cross-tissue group Lasso penalty, thus inheriting Lasso-like 61 behaviour that is less sparse than Elastic Net. We used the same variants from the 62 EN-M strategy (HapMap CEU track, same windows around each gene), and identical 63 correlation threshold (ρ > 0.1) and cross-validated prediction performance threshold 64 (p < 0.05) to accept models. We will refer to this family as the CTIMP-M family. 65 We verified that this method's performance is not significantly improved by using all 66 available GTEx variants, as explained in the supplementary material.

67
The third strategy used the posterior inclusion probability (PIP) of a variant being 68 causal for gene expression as estimated by the Bayesian fine mapping method dap-g 69 (Deterministic Approximation of Posteriors) [22]{Wen2016}. First, for every gene, we 70 restricted to variants with posterior inclusion probabilities PIP > 0.01. Since dap-g 71 clusters variants by their LD, we kept the variant with highest PIP from each cluster to 72 avoid redundant explanatory variables. Then, the selected variants were fed into the 73 Elastic Net algorithm, scaling each variant's effect size penalty by a factor of 1−PIP 74 (i.e. more likely variants are less penalized). Only those models achieving good enough 75 cross-validated prediction performance (p-value< 0.05) and correlation (ρ > 0.1) were 76 kept. We will refer to this family as DAPGW-M (dap-g weighted). As discussed later, 77 the cross-validated prediction performance of this approach can't be fairly compared to 78 EN-M and CTIMP-M because the pre-selection of fine-mapped variants is based on the 79 same underlying data.

80
The fourth strategy used mashr (Multivariate Adaptive Shrinkage in R) [18]{Urbut2019} 81 effect sizes from variants selected by dap-g as in the DAPGW-M approach. More 82 specifically, fine-mapped variants were selected as in the DAPGW-M approach but 83 the weights were obtained by applying mashr to the marginal effect sizes and standard 84 errors from the GTEx eQTL analysis [2]{Aguet2019}. Unlike the previous methods, 85 this approach does not fit into a cross-validation strategy and therefore lacks a natural 86 prediction performance measure. Only eGenes with at least one cluster of variants 87 achieving dap-g PIP> 0.1 were kept. We will refer to this family as MASHR-M. 88 We did not consider the BSLMM family of methods for transcriptome prediction. 89 These models contain both a sparse and a polygenic component. The latter is likely to 90 induce LD contamination [9]{Barbeira2018} and doesn't reflect the sparse architecture 91 of expression traits [12]{Wheeler2016}. cation from LeafCutter [20]{Li:2018cy} and made them readily available to the research 94 community. These models were extensively used in [2]{Aguet2019} and [23]{GTEx-95 GWAS-Companion}. 96 Summary of models 97 Given the differences in computational approach, not all prediction strategies generated 98 models for every available gene-tissue pair. As can be seen in Fig. 1-A, EN-M yielded 99 the smallest number of valid models, for 281,848 gene-tissue pairs. CTIMP-M produced 100 340,104 valid models, 21% more than EN-M, as expected from its integration of multiple 101 tissues' information.

102
Fine-mapping-based methods generated even more models: 518,537 from DAPGW-M 103 (84% more than EN-M) and 686,241 from MASHR-M (143% more than EN-M). Please 104 note that given the different criteria used to accept a model as valid, simple counts of 105 available models should not be considered a measure of performance. 106 We show the distribution of cross-validated prediction performances in Fig. 1-B. 107 We include 5 representative tissues ordered by increasing sample size (kidney, brain -108 hippocampus, brain -cerebellum, breast, skeletal muscle). In order to perform a uniform 109 comparison, we used only gene-tissue pairs available to all model families. CTIMP-M 110 showed better prediction performance than EN-M on tissues with smaller sample size, but 111 performed similarly on tissues with larger sample sizes. We attribute this to CTIMP's 112 design, which leveraged all existing samples' genotypes in the tissues of smaller expression 113 sample size. MASHR-M models had no natural prediction performance measure and 114 thus are excluded from these panels. DAPGW-M is presented for completeness but its 115 comparison to EN-M and CTIMP-M is unfair. We show in Sup.  Finemapping improves expression prediction in independent dataset118 Next, we sought to validate the models' predictions in an independent RNA-seq dataset. 119 We analyzed data from the the GEUVADIS project [24]{Lappalainen2013}, which 120 includes 341 samples of European ancestry with genotype and LCL (lymphoblastoid cell 121 lines) expression data. We predicted expression using GTEx LCL models from the 4 122 strategies, and compared with measured expression levels. Fig. 2-A shows the number of 123 genes that each family was able to predict. DAPGW-M and MASHR-M had the largest 124 number of predictable genes, followed by CTIMP-M and EN-M.

125
To compare prediction performances, we used Spearman's rank correlation coefficient 126 ρ as a robust measure that handles the scale and complexity differences between real 127 GEUVADIS expression data and predicted expression levels. Fig. 2-B shows the 128 distribution of prediction performance (Spearman's ρ) for genes present in all four 129 methods on the LCL tissue. We observed that all four families achieved similar levels of 130 performance, with DAPGW-M and MASHR-M faring slightly but consistently better 131 than EN-M and CTIMP-M. 132 We attribute the smaller performance differences to low power, since GTEx LCL 133 tissue has a sample size of n= 115 individuals, much lower than the 341 available in 134 GEUVADIS. We attributed the small differences to the GTEX LCL tissue having a small sample size (n=115 individuals), much lower than the 341 available in GEUVADIS. Also, the intersection of genes available to all 4 strategies is dominated by those present in Elastic Net, the smallest set; and genes that can be modelled with Elastic Net tend to be the ones with less complicated patterns of variation.

Fine-mapping improves number and colocalization of associations 136
Next, we assessed whether any of these models perform better at identifying causal genes. 137 We considered the number and proportion of colocalized genes among the significant 138 ones as measures of association quality. 139 We used the four families of models to correlate predicted expression with 87 phe-140 notypes through 49 tissues using the summary version of PrediXcan. Results of ap-141 plying the EN-M models to GWAS summary statistics, harmonized and imputed to 142 GRCh38 [25]{Schneider2017hg}, were presented in [2]{Aguet2019}. In this section, we 143 say that a gene-tissue pair is significant if it achieves a p-value below the Bonferroni-144 corrected threshold (0.05/number of gene-tissue pairs) within each trait. 145 We used enloc [16]{Wen2017} results published in [2]{Aguet2019} to assess the colo-146 calization status of GWAS and transcriptomic traits as evidence for a shared underlying 147 mechanism. Briefly, enloc computes the "regional colocalization probability" (rcp) that a 148 trait shares causal variants with a gene's expression (or an intron's splicing quantification), 149 within a GWAS region and the overlapping gene's cis-window. We say that a gene-tissue 150 pair is "colocalized" with a trait if it achieves an enloc regional colocalization probability 151 rcp > 0.5. Note that rcp <= 0.5 should not be interpreted as a false association; rather, 152 it only means that there is not enough evidence of colocalization. See discussion on the 153 conservative nature of colocalization approaches in [23]{GTEx-GWAS-Companion}. 154 We say that a gene-tissue pair that is both significant and colocalized is a "prioritized" 155 detection or candidate. To simplify interpretation of results across multiple tissues, we 156 count the number of unique genes among the prioritized gene-tissue pairs for each trait. 157 We found that MASHR-M tipically yields more candidate genes. We display the 158 numbers of detections for each trait in . 166 We were thus led to favor MASHR-M, which produced the largest number of models, 167 with superior number of colocalized, significant associations as well as higher proportions 168 of colocalized associations among significant genes. We say a gene is significant if it achieves a Bonferroni-adjusted threshold of 0.05/number of available gene-tissue pairs, in at least one tissue. Likewise, we say a gene is colocalized if it achieves enloc rcp > 0.5 in any tissue. We say a gene is a candidate or "prioritized" detection if it is both significant and colocalized in any tissue.
Enloc relies on the dap-g algorithm itself as a component, so that the fraction of 170 colocalized genes could have been biased towards dap-g informed methods. To make sure 171 that the use of dap-g is not driving the improved colocalization rate of MASHR-M over 172 th other strategies, we verified the performance using another colocalization method, 173 coloc [26]{Giambartolomei2014}. 174 We observed that MASHR-M still had a better rate of colocalization among significant 175 associations, albeit with smaller differences as can be seen in Supplementary Figure 3. 176 This is probably in part due to coloc's reduced power and limiting assumption of a single 177 causal variant (see [23]{GTEx-GWAS-Companion} for details).

178
Finemapping improves identification of silver standard genes 179 As an independent way to assess each prediction strategy's ability to identify causal genes, 180 we framed the problem as one of causal gene prediction and use standard prediction 181 performance measures such as Receiver Operating Charasteristc (ROC) and Precision-182 Recall (PR). This avoids using an ad-hoc significance or colocalization thresholds.

193
Using absolute values of z-scores as association score for each strategy, we assessed 194 their ability to 'predict' the silver standard gene-trait associations. We show in Fig. 4 195 the ROC and Precision-Recall curves on OMIM-and rare variant-based silver standards. 196 Using the OMIM-based silver standard (Fig. 4-A and -C), we observed that MASHR-197 M strategy outperforms the other strategies, with DAPGW-M a close second. Thus we 198 concluded that MASHR-M models are better equipped for detecting known genes in the 199 extreme regulatory case of Mendelian diseases, reinforcing our choice of MASHR-M as 200 the best option. 201 Using the rare-variant-based silver standard (Fig. 4-B and -D), we observed that all 202 four strategies are able to detect known causal genes. However, the limited size of this 203 standard did not allow us to distinguish between the four families. Panel A shows the Receiver Operating Characteristc (ROC, plotting true-positive ratio to false-positive ratio) curve for the OMIM silver standard. We observe that MASHR-M models outperforms the other strategies, with DAPGW-M a close second. Panel B shows the ROC curve for the rare-variant-based silver standard. We observe that all strategies perform better than taking a random choice. However, this silver standard is too limited to properly distinguish between strategies. Panel C shows the Precision-Recall (PR) curve for the OMIM silver standard. MASHR-M performs better than the other strategies in general, but precision becomes a noisy measure towards lower recall ranges. Panel D shows the PR curve for the rare-variant-based silver standard. The precision measure is too unstable to draw any conclusions. The OMIM silver standard not only validates the 4 proposed model strategies as a consistent approach to detect causal genes, but provides additional evidence of MASHR-M's superiority. The second silver standard, based on rare variants, is too limited to conclude anything beyond a high-level validation of all 4 families.
Importance of harmonization and imputation of missing sum-205 mary statistics 206 The prediction models' usefulness depends on the availability of their variants in the 207 GWAS of interest. Publicly available GWAS use different sequencing and genotyping 208 techniques, based on different genotype imputation panels and human genome release 209 versions, so that the lists of available variants vary wildly across traits. Thus, a GWAS 210 might lack particular variants from a prediction model, so that the model can't properly 211 infer variation patterns as shown in [32]{Barbeira2019}. Since many fine-mapped variants 212 in the GRCh38-based GTEx study can be absent in a typical GWAS, we sought to assess 213 the impact of variant compatibility in real applications. 214 We compared S-PrediXcan results from MASHR-M models on 69 publicly available 215 GWAS with two preprocessing schemes: 216 1. Harmonization of variants by mapping genomic coordinates between human genome 217 assemblies, and filtering for matching alleles ("Harmonization" for short) 218 2. Imputation of missing summary statistics ("Imputation" for short) on harmonized 219 GWAS.

220
The 69 traits included in this analysis are those among the 87 traits not belonging to 221 the Rapid GWAS project, to prevent the highly homogeneous Rapid GWAS datasets 222 from dominating comparisons. 223 We show in Fig. 5 the effect of these preprocessing schemes on various performance 224 metrics, segregated by human genome release version (hg17, hg18, hg19). 238 Therefore, we recommend to always perform variant harmonization due to its low 239 complexity and time requirements, followed by summary-statistics imputation if pos-240 sible. For newer GWAS with modern sequencing and genotyping, summary-statistics 241 imputation may not be as critical depending on their intersection with model variants. 242 Fig 5. Effect of imputation on association quality. We display here a comparison of S-PrediXcan results from MASHR-M models on 69 GWAS traits using two different pre-processing schemes: simple harmonization of GWAS variants to GTEx's, and additional imputation of missing summary statistics. Results are grouped by the different human genome release versions underlying each GWAS: 2 traits were defined on hg17, 13 on hg18, and 54 on hg19. Panel A shows the distribution of number of associations per trait-tissue pair that can be computed; imputation dramatically increased the number of associations for hg17and hg18-based traits. Some hg19-based traits exhibited a good number of computable associations after just a simple harmonization. Panel B shows, per trait-tissue pair, the distribution of median fraction of model SNPs present in the GWAS. It is nearly 1 for most trait-tissue pairs in the imputation scheme, ranging between 0.5 and 1 with the harmonization scheme. Panel C shows the number of colocalized, significantly associated genes that can be found after applying imputation and harmonization schemes. The gain of imputation for hg19 is less dramatic that in the other comparisons in this figure, given the conservative nature of the colocalization filter.

243
Through extensive analysis of different model training schemes, we conclude that using 244 fine-mapping information (from dap-g) and cross-tissue patterns (from mashr ) improve 245 transcriptome prediction models both for causal gene detection and prediction perfor-246 mance. These models (MASHR-M) yield more detections when integrating GWAS and 247 eQTL studies, and show better prediction performance on an independent expression 248 cohort. This method also exhibits superior performance when validating results in a 249 silver standard of known gene-to-trait associations (OMIM database). We make all 250 prediction models and results publicly available.

251
Special consideration must be paid to how well each model's variants intersect GWAS' 252 variants. Fine-mapping-informed models are sparse and parsimonious. This could be 253 a hurdle when the fine-mapped variants of import are missing or have low imputation 254 quality in a GWAS, as is often the case with older studies. In this scenario we recommend 255 harmonizing and imputing summary statistics to the models' variants. The alternative 256 is falling back to models such as CTIMP-M, defined on a robust set of variants available 257 to most GWAS, at the cost of decreased performance (detection and prediction). EN-M 258 additionally features some "built-in" redundacy: for a set of variants in LD among each 259 other, they all tend to be included in a model with the effect spread between them.

260
While our recommended MASHR-M method offers several benefits compared to 261 existing approaches, there is still room for improvement. Potential developments could 262 rely on fine-mapping methods that jointly incorporate cross-tissue patterns, or consensus 263 between different fine-mapping approaches. Also, epigenetic information has been 264 shown to improve transcriptome prediction [33]{Zhang2019EpiXcan} as well. Future 265 improvements should incorporate this epigenetic information and other biologically-266 informed annotations jointly.

267
Our validation in silver standards, especially our difficulty interpreting the results 268 from the rare-variant-based silver standard, also illustrates the need for well-curated, 269 large databases of known gene-to-phenotype associations to assess performance of either 270 new or improved methods.

271
In conclussion, we present here a method for predicting the genetically regulated 272 component of transcriptomic traits with superior performance both in terms of prediction 273 performance and gene-trait association detection.   301 We thank the International Genomics of Alzheimer's Project (IGAP) for providing 302 summary results data for these analyses. The investigators within IGAP contributed to 303 the design and implementation of IGAP and/or provided data but did not participate in 304 analysis or writing of this report. IGAP was made possible by the generous participation 305 of the control subjects, the patients, and their families. The i-Select chips was funded 306 by the French National Foundation on Alzheimer's disease and related disorders. Genotype-Tissue Expression (GTEx) project's raw whole transcriptome and genome 337 sequencing data are available via dbGaP accession number phs000424.v8.p1. All pro-338 cessed GTEx data are available via GTEx portal. Imputed summary results, enloc, coloc, 339 PrediXcan, MultiXcan, dap-g, prediction models, and reproducible analysis are available 340 in https://github.com/hakyimlab/gtex-gwas-analysis and links therein.

372
We executed all methods using open source software running in a high performance cluster. 373 We release all of our code and the data analyzed in this paper to ease reproducibility 374 and accessibility.

375
GTEx data processing 376 We downloaded GTEx data for version 8 release from dbGAP (accession number 377 phs000424.v8.p1). This data arises from 17382 RNA-seq samples from 54 tissues of 948 378 post-mortem subjects, aligned to the GRCh38 assembly. Primary and extended results 379 generated by consortium members are available on the Google Cloud Platform storage 380 accessible via the GTEx Portal (see URLs). 381 899 whole-genome sequencing (WGS) samples were analyzed, 68 of them at an average 382 coverage of 30x on HiSeq200, and the rest on HiSeqX. 866 GTEx donors' samples were 383 included in the downstream variant call files (VCF), after excluding one each from 30 384 duplicate samples and 3 donors. Among these, 838 subjects with RNA-seq data were 385 included for QTL mapping and analysis.

405
GTEx expression and splicing modelling 406 We used the same genotypes, phenotypes, covariates, gene annotations and variant 407 annotations from the main GTEx analysis.

408
When building prediction models, we imposed an additional restriction: we used only 409 samples of European ancestry for the sake of leveraging a well defined population LD 410 structure. Only variants with MAF> 0.01 in these samples were included. We used 49 411 tissues with sample sizes ranging from 65 (Kidney Cortex) to 602 (Muscle Skeletal).

412
This ancestry restriction mitigated problems due to LD mismatch when integrating 413 with most publicly available GWAS summary statistics, which are conducted on pre-414 dominantly European populations. Prediction models in other ancestries are important, 415 and we are currently dedicating substantial effort to creating and analyzing such models. 416 However, non-European models are beyond the scope of this paper. 417 We only generated models for genes annotated in GENCODE v26 as protein coding, 418 lncRNA or pseudogenes.

419
Elastic Net models 420 We fitted an Elastic Net model for each gene-tissue pair with available adjusted expression 421 data. We restricted the set of variants to those present in the HapMap 3 CEU track 422 [40]{InternationalHapMap3Consortium2010} with MAF> 0.01. The motivation behind 423 this choice was to restrict the analysis to a robust set of SNPs that has significant 424 intersection with most publicly available GWAS summary statistics. For every gene, 425 variants within 1Mb upstream of the gene's transcription start site and 1Mb downstream 426 of the transcription end site where used as explanatory variables for gene expression. 427 We used the R package glmnet [14]{Friedman2010GLMNET}, with mixing parameter 428 α = 0.5 and penalty parameter chosen through 10-fold cross validation.

429
Prediction performance was estimated using a nested cross-validation approach. 430 Expression was predicted out-of-sample for each fold, with Elastic Net parameters 431 estimated only within training data, and the correlations to observed values at each 432 fold were combined via Fisher's transformation and Stouffer's method. Only those 433 models with mean Pearson correlation across 10 folds ρ > 0.1 and nested cross-validated 434 correlation test p < 0.05 were kept. 435 We refer to these models as EN-M. 436 CTIMP models 437 We employed the CTIMP [11]{Hu2019} framework on the same data from EN-M models 438 in the previous section. This method fits expression for a gene in multiple tissues 439 simultaneously through a regularized linear model, using a Lasso penalty within each 440 tissue and a group-Lasso penalty for cross-tissue patterns. As it internally uses genotypes 441 from all samples available across all tissues, we expect improvements over EN-M to be 442 larger for tissues of smaller sample size where EN-M deals with a less informative LD 443 structure among variants. 444 We performed five-fold cross validation for model tuning and evaluation following 445 the authors' description. We computed cross-validated correlation measures across 446 folds as in the previous method, and kept those models achieving the thresholds of 447 cross-validated correlation ρ > 0.1 and p-value p < 0.05. As in EN-M, we restricted 448 the model training to variants in the HapMap 3 CEU track with MAF> 0.01; this 449 became necessary because using all variants proved too computationally expensive, since 450 CTIMP consumes large amounts of memory and processing time. We briefly show in 451 the Supplement (Supplementary Figures 4, 5, 6) that this additional restriction brings 452 negligible effects in model training performance and prediction. 453 We refer to these models as CTIMP-M.

454
Elastic Net informed by dap-g results 455 We also trained models via the Elastic Net algorithm using fine-mapping information 456 to refine the list of variants to be used as explanatory variables, and lent more weight 457 to variants with higher chances of affecting expression phenotypes. To this aim, we 458 used dap-g's posterior inclusion probability (PIP) of a variant affecting gene expression 459 to select explanatory variables, without restricting to variants in the HapMap CEU 460 track. For every gene, we used all variants in the gene's cis-window with MAF> 0.01 461 and PIP> 0.01. Since dap-g groups variants in clusters according to LD, we kept the 462 top variant (by PIP) per cluster to avert variable redundancy. Since we reasoned that 463 more probable variants should bear more impact in the model's outcome, we multiplied 464 each variant's penalty term in the Elastic Net regularization by a factor of 1 − PIP. We 465 used the same thresholds from the previous subsections (ρ > 0.1 and p-value p < 0.05) 466 to select models with acceptable prediction performance. 467 We refer to these models as DAPGW-M. 468 mashr -based models 469 Finally we explored an entirely different algorithm to determine the prediction models. 470 We executed multivariate adaptive shrinkage in R (mashr ) [18]{Urbut2019} to estimate 471 the models' effect sizes by leveraging cross-tissue variations while allowing for sparse and 472 possibly correlated effects in a Bayesian framework. We used mashr on the same set 473 of variants from DAPGW-M models. We kept models only for eGenes and effect sizes 474 only for variants with P IP > 0.01 (from dap-g) at each gene-tissue pair. Unfortunately, 475 there is no natural prediction performance measure in this scenario as cross-validation 476 was not performed. 477 We refer to these models as MASHR-M.

478
GEUVADIS data processing 479 We used GEUVADIS LCL expression study for an independent validation of prediction 480 performance. We obtained GEUVADIS expression data and sample information from 481 the European Bioinformatics Institute web portal at https://www.ebi.ac.uk/. We 482 obtained genotype data aligned to GRCh38 assembly from the International Genome 483 Sample Resource web portal http://www.internationalgenome.org. We restricted 484 data to individuals of European ancestry, yielding 341 samples. 485 For each one of the four previous model training schemes (EN-M, CTIMP-M, DAPGW-486 M, MASHR-M) we predicted expression through PrediXcan [8]{Gamazon2015} on GEU-487 VADIS genotypes using GTEX LCL models, and correlated predictions to observations. 488 GWAS processing and integration 489 We examined 87 GWAS from a heterogeneous set of traits first presented in the GTEx 490 v8 study [2,23]{GTEx-GWAS-Companion, Aguet2019}. These traits were selected to 491 support a phenome-wide study of the impact of gene regulation. Given the heterogeneous 492 landscape of the GWAS, with intricate differences in data processing protocols and 493 underlying human genome reference versions, it was necessary to make the GWAS 494 variants homogeneous and compatible with those from the GTEx study. 495 First, the GWAS' variants were harmonized to the GTEx study's variants by mapping 496 genomic coordinates via liftover [41]{haeussler:20019} (https://pypi.org/project/ 497 pyliftover) and keeping only variants with matching alleles. Then, GTEx variants 498 with missing summary statistics for any GWAS were imputed with the BLUP method, 499 a standard in the field [42]{Lee2013}. 500 We executed S-PrediXcan for each of 4 families of models (EN-M, CTIMP-M, 501 DAPGW-M and MASHR-M) using 49 tissues, for a total of 17,052 (trait, model family, 502 tissue) tuples. We integrated with enloc and coloc results published in [2]{Aguet2019}. 503 When analying versatility of the models and GWAS preprocessing schemes, we used 504 GWAS studies not belonging to the rapid GWAS study. This was decided because 505 the rapid GWAS project has a common, homogeneous variant set that could dominate 506 comparisons.

508
Restricting CTIMP variants to CEU HapMap track 509 When first attempting to run CTIMP package on GTEx data, we observed a significantly 510 larger computational burden than the related Elastic Net method, up to two orders of 511 magnitude larger in memory, and up to three in compute time for many genes. This 512 scheme was too prohibitive and would had taken months to complete on the high 513 performance cluster available to us.

514
To reduce the computational burden, we decided to restrict the explanatory variables 515 to variants in the HapMap 3 CEU track [40]{InternationalHapMap3Consortium2010} 516 with MAF> 0.01. This brought down the memory and processing consumption to only 517 one order of magnitude larger than Elastic Net.

518
To verify that this technical restriction did not degrade performance too severely, we 519 computed CTIMP models using all variants for chromosome 1, which took over 5 weeks 520 to complete in our computation resource. We refer to these models as CTIMP-M-AS, 521 which allowed comparison to CTIMP-M. On the other hand, DAPGW-M and MASHR-M models's intersection with a typical 541 public GWAS is lower (see Supplementary Note). Therefore a tradeoff arises when 542 integrating models with GWAS, posing the superior but more demanding MASHR-M 543 models against the underperforming but robust EN-M or CTIMP-M models.

544
As an example of the importance of harmonizing and imputing GWAS summary 545 statistics to the transcriptome reference data set (GTEx), consider two relatively new 546 GWAS studies: blood traits from UK Biobank/INTERVAL and height from UK Biobank, 547 without harmonizing nor imputing missing summary statistics. EN-M and CTIMP-M 548 have over 97% of their variants present in said GWAS, while this percenteage drops 549 to around 80% for DAPGW-M and MASHR-M. DAPGW-M shows a slightly higher 550 intersection of variants than MASHR-M. Table 1 summarizes the number for variants 551 present in each family, the subset among them present in the GWAS traits, and the 552 fraction of these two numbers. In our experience, 80% is acceptable for running tools 553 such as PrediXcan in some applications. However, when applying to older studies 554 with potentially lower intersection between GWAS variants and the models', imputing 555 summary statistics for missing variants becomes necessary.  Table 1. Intersection between prediction model and two GWAS studies. family is the model strategy, gwas is the study, percent measures the proportion of model snps present in the GWAS, present is the number of model snps present in the gwas, n is the total number of unique variants in the models.
To fully exploit the power available to the superior MASHR-M models, harmonizing 557 becomes necessary in newer GWAS; imputation of missing sumamry statistics is also 558 necessary in older GWAS.  Supplementary Figure 1. Cross-validated prediction performance is shown for all available models, on sample tissues ordered from smallest sample size to largest sample size. As sample size increased, we observed similar performances on the strategies shown. DAPGW-M is presented for illustration purposes; since it included an additional variable selection step using the same underlying data, it cannot be fairly compared to EN-M and CTIMP-M. We note that EN-M models produced the smallest number of models ( Fig. 1-A), and 82% of them are in the intersection of models available to all strategies. These tend to be genes with higher heritability and thus easier to predict. In other words, EN-M models are generally available to CTIMP-M and DAPGW-M, and the intersection of all strategies is dominated by genes in EN-M. On the other hand, CTIMP-M and DAPGW-M yield viable models for additional genes that are harder to predict, where EN-M couldn't converge to a proper model. In conclusion, CTIMP-M's and DAPGW-M's performance summary on all available genes was penalized by their convergence on genes with less signal or more complicated expression patterns.