## Abstract

To gain biological insight into the discoveries made by GWAS and meta-analysis studies, effective integration of functional data generated by large-scale efforts such as the GTEx Project is needed. PrediXcan is a gene-level approach that addresses this need by estimating the genetically determined component of gene expression. These predicted expression traits can then be tested for association with phenotype in order to test for mediating role of gene expression levels. Furthermore, due to the polygenic nature of many complex traits, efforts to aggregate multiple GWAS studies and conduct meta-analyses have successfully increased our ability to identify variants of small effect sizes. To take advantage of the results generated by these efforts and to avoid the problems associated with accessing and handling individual-level data (e.g. consent limitations, large computational/storage costs) we have developed an extension of PrediXcan. The new method, MetaXcan, infers the results of PrediXcan using only summary statistics from large-scale GWAS or meta-analyses. Here we show that the concordance between PrediXcan and MetaXcan is excellent when the right reference population is used (*R*^{2} > 0.95) and robust to population mismatches (*R*^{2} > 0.85). We provide open source local and web-based software for easy implementation through https://github.com/hakyimlab/MetaXcan.

## Introduction

Over the last decade, GWAS have been successful in identifying genetic loci that robustly associate with multiple complex traits. However, the mechanistic understanding of these discoveries is still limited, hampering the translation of this knowledge into actionable targets. Studies of enrichment of expression quantitative trait loci (eQTLs) among trait-associated variants [1,2] show the importance of gene expres-sion regulation. Direct quantification of the contribution of different functional classes of genetic variants showed that 80% of phenotype variability (in 12 diseases) can be attributed to DNAase I hypersensitivity sites, further highlighting the importance of transcript regulation in determining phenotypes [3].

Many transcriptome studies have been conducted where genotype and expression levels are assayed for a large number of individuals [4–7]. The most comprehensive transcriptome dataset, in terms of tissues covered, is the GTEx Project, a large-scale effort where DNA and RNA are collected from multiple tissue samples from nearly 1000 deceased individuals and sequenced to high coverage [8]. This remarkable resource provides a comprehensive cross-tissue survey of the functional consequences of genetic variation at the transcript level.

To integrate knowledge generated from these large-scale transcriptome studies and shed light on disease biology, we developed PrediXcan [9], a gene-level association approach that tests the mediating effects of gene expression levels on phenotypes. This is implemented on GWAS/sequencing studies (i.e. studies with genome-wide interrogation of DNA variation and phenotypes) where transcriptome levels are imputed with models trained in measured transcriptome datasets (e.g. GTEx). These predicted expression levels are then correlated with the phenotype and provides the basis for a gene-level association test that addresses some of the key limitations of GWAS [9].

Other groups have also proposed methods based on similar ideas [10]. Comparison with our method will be discussed.

On the other hand, meta-analysis efforts that aggregate results from multiple GWAS studies have been able to identify an increasing number of phenotype associations that were not detected with smaller sample sizes. In order to harness the power of these increased sample sizes while keeping the computational burden manageable, we have extended the PrediXcan method so that only summary statistics from meta-analysis studies are needed rather than individual level genotype and phenotype data.

We will show here that our new method, termed MetaXcan, is a fast, accurate, and efficient way to scale up implementation of PrediXcan and take advantage of the large sample sizes made available through meta-analysis of GWAS.

## Results

We have derived an analytic expression that allows us to compute the outcome of PrediXcan using only summary statistics from genetic association studies. Details of the derivation are shown in the Methods section. In Figure 1, we illustrate the mechanics of MetaXcan in relation to traditional GWAS and our recently published PrediXcan method.

For both GWAS and PrediXcan, the input is the genotype matrix and phenotype vector. GWAS computes the regression coefficient of the phenotype on each marker in the genotype matrix and gen-erates SNP-level results. PrediXcan starts by estimating the genetically-regulated component of the transcriptome (using weights from the publicly available PredictDB database) and then computes regression coefficients of the phenotype on each predicted gene expression level generating gene-level results. MetaXcan, on the other hand, can be viewed as a shortcut that uses the output from a GWAS study to generate the output from PrediXcan. Since MetaXcan only depends summary statistics, it can effectively take advantage of large-scale meta analysis results, avoiding the computational and regulatory burden of handling large amounts of protected individual level data.

### MetaXcan formula

Figure 2 shows the main analytic expression used by MetaXcan for the Z-score (effect size divided by its standard error) of the association between predicted gene expression and the phenotype. The input variables are the weights used to predict the expression of a given gene *w _{lg}*, the variance and covariances of the markers included in the prediction of the expression level of the gene, and the GWAS coefficient for each marker. The last factor in the formula can be computed exactly in principle, but we would need some additional information that is unavailable in typical GWAS output. Fortunately, we have found that this factor is very close to 1 and dropping it from the formula does not affect the accuracy of the results.

The approximate formula we will use is as follows: where

*w*is the weight of SNP / in the prediction of the expression of gene_{lg}*g*,is the GWAS regression coefficients for SNP

*l*,se(

*β*) is standard error of ,_{l}is the estimated variance of SNP

*l*, andis the estimated variance of the predicted expression of gene

*g*.

The inputs are based, in general, on data from three different sources:

study set,

training set,

population reference set.

The study set is the main dataset of interest from which the genotype and phenotypes of interest are gathered. The regression coefficients and standard errors are computed based on individual-level data from the study set. Training sets are the reference transcriptome datasets used for the training of the prediction models (GTEx, DGN, Framingham, etc.) thus the weights *w _{lg}* are computed from this set. Finally, the reference sets (e.g. 1000 Genomes) are used to derive variance and covariance (LD) properties of genetic markers, which will usually be different from the study sets.

In the most common use scenario, the user will only need to provide GWAS results using his/her study set. The remaining parameters are pre-computed, and download information can be found at the https://github.com/hakyimlab/MetaXcan resource.

Next we will show the performance of the method, measured as the concordance (R^{2}) between PrediX-can and MetaXcan results.

### Performance in simulated data

We first compared MetaXcan and PrediXcan using simulated phenotypes generated from a normal distribution, using a single transcriptome model trained on Depression Genes and Network’s (DGN) Whole Blood data set [4] downloaded from PredictDB (http://predictdb.org). As genotypes we used three ancestral subsets of the 1000 Genomes project: Africans (n=662), East Asians (n=504), and Europeans (n=503). Each set was taken in turn as reference and study set yielding a total of 9 combinations as shown in Figure 3. For each population combination, we computed PrediXcan association results for the simulated phenotype and compared them with results generated from our MetaXcan approach in a scatter plot. This allowed us to assess the effect of ancestral differences between study and reference sets.

As expected, when the study and reference sets are the same, the concordance between MetaXcan and PrediXcan is 100% whereas for sets of different ancestral origin the R^{2} drops a few percentage points, with the biggest loss (down to 85%) when the study set is African and the reference set is Asian. This confirmed that our formula works as expected and that the approach is robust to ethnic differences between study and reference sets.

### Performance in cellular growth phenotype from 1000 genomes cell lines

Next we tested with an actual cellular phenotype. Intrinsic growth, a cellular phenotype, was computed based on multiple growth asays for over 500 cell lines from the 1000 Genomes project [11]. We used a subset of values for Europeans (EUR), Africans (AFR), Asians (EAS) individuals.

We compared Z-scores for intrinsic growth generated by PrediXcan and MetaXcan for different com-binations of reference and study sets, using whole blood prediction model trained in the DGN cohort. The results are shown in Figure 4. Consistent with our simulation study, the MetaXcan results closely match the PrediXcan results. Again, the best concordance occurs when reference and study sets share similar continental ancestry while differences in population slightly reduce concordance. Compared to the plots for the simulated phenotypes, the diagonal concordance is slightly lower than 1. This is due to the fact that more individuals were included in the reference set than in the study set, thus the study and reference sets were not identical for MetaXcan.

### Performance on disease phenotypes from WTCCC

We show the comparison of MetaXcan and PrediXcan results for two diseases: Bipolar Disorder (BD) and Type 1 Diabetes (T1D) from the WTCCC in Figure 5. Other disease phenotypes exhibited similar performance (data not shown). Concordance between MetaXcan and PrediXcan is over 95% in for both diseases (BD *R*^{2} = 0.956 and T1D *R*^{2} = 0.958). The very small discrepancies are explained by differences in allele frequencies and LD between the reference set (1000 Genomes) and the study set (WTCCC). Given this high concordance, we do not expect much improvement when using a reference set that is more similar to the study set. We verified this and, as expected, found that using control individuals from WTCCC as reference set improved the concordance only marginally (0.1%).

It is worth noting that the PrediXcan results for diseases were obtained using logistic regression whereas MetaXcan formula is based on linear regression properties. As observed before [12], when the number of cases and controls are relatively well balanced (roughly, at least 25% of cases and controls), linear regression approximation yields very similar results to logistic regression.

This high concordance also shows that the approximation where we drop the term does not significantly affect the results.

### Software

We make our software publicly available on a GitHub repository: https://github.com/hakyimlab/MetaXcan. Instructions for obtaining the weights and covariances for different tissues can be found there. A short working example can be found on the GitHub page; more extensive documentation can be found on the project’s wiki page.

## Discussion

Here we present MetaXcan, a scalable, accurate, and efficient method for integrating reference transcrip-tome studies to learn about the biology of complex traits and diseases. Our method extends PrediXcan, which maps genes to phenotypes by testing the mediating effects of gene expression levels. This is imple-mented by predicting gene expression levels and correlating these traits with phenotypes. MetaXcan is a shortcut that uses SNP-level association results and combines them to reproduce the results of PrediXcan, without the need to use individual level data.

MetaXcan shares most of the benefits of PrediXcan: a) it directly tests the regulatory mechanism through which genetic variants affect phenotype; b) it provides gene-level results which are better functionally characterized than genetic variants, easier to validate within model systems, and carry a smaller multiple testing burden; c) the direction of the effects are known, facilitating identification of therapeu-tic targets; d) reverse causality is largely avoided since predicted expression levels are based on germline variation, which are not affected by onset of disease; e) it can be systematically applied to existing GWAS studies; f) tissue-specific analysis can be performed using all the models we have made available through PredictDB (http://predictdb.org).

The difference between the reference sets (used to estimate LD and allele frequencies) and study set (used to compute GWAS/meta analysis summary statistics) is the main cause of the small differences between MetaXcan and PrediXcan results. We have shown here that even when the populations are quite different, the concordance is very high. Thus, MetaXcan is robust to ancestral differences between study and reference sets.

Even though the method was derived with linear regression in mind, in case-control designs, the approximation generates results that are in almost full concordance with exact results generated with PrediXcan and logistic regression.

Methods similar in spirit to PrediXcan have been reported [10]. Gusev et al also propose a method comparable to MetaXcan that is based only on summary statistics. Their method, called Transcriptome-Wide Association Study (TWAS), imputes the SNP level z-scores into gene level z-scores using the method Pasaniuc and others have published [13]. This approach is equivalent to predicting expression levels using BLUP/Ridge Regression, which has been shown to be suboptimal for prediction. This is due to the fact that the local architecture of gene expression traits is sparse so that highly polygenic models underperform more sparse prediction models such as LASSO or Elastic Net with mixing parameters 0.5 or greater [14].

In contrast, MetaXcan is not restricted to one imputation or prediction scheme. It infer the results of PrediXcan using summary statistics through an analytic formula. Thus it can be applied to linear models based on SNP data.

In summary, we present an accurate and computationally efficient gene-level association method that integrates functional information from reference transcriptome dataset into GWAS and large scale meta-analysis results to inform the biology of complex traits.

## Methods

### Derivation of MetaXcan Formula

The goal of MetaXcan is to infer the results of PrediXcan using only GWAS summary statistics. Individual level data are not needed for this algorithm. We will define some notations for the derivation of the analytic expressions of MetaXcan.

#### Notation and Preliminaries

*Y* is the *n*-dimensional vector of phenotype for individuals *i* = 1, *n*.

*X _{l}* is the allelic dosage for SNP

*l*.

*T _{g}* is the predicted expression (or estimated GREx, genetically regulated expression).

We model the phenotype as linear functions of *X _{l}* and

*T*where and are the estimated regression coefficients of

_{g}*Y*regressed on

*T*and

_{g}*X*, respectively. is the result (effect size for gene

_{l}*g*) we get from PrediXcan whereas is the result from a GWAS for SNP

*l*.

We will denote as Var and Cov the operators that computes the sample variance and covariances, i.e.
where is the *n* x *p* matrix of SNP data and is a *n* x *p* matrix where column *l* has the column mean of X_{l} (*p* being the number of SNPS in the model for gene *g*).

With this notation, our goal is to infer PrediXcan results ( and its standard error) using only GWAS results ( and se), estimated variances of SNPs , covariances between SNPs in each gene model (Γ_{g}), and prediction model weights *wl _{g}*.

**Input:** . **Output:** .

Next we list the properties and definitions used in the derivation: and

The proportion of variance explained by the covariate (*T _{g}* or

*X*) can be expressed as

_{l}By definition

can be computed as

#### Calculation of regression coefficient γ_{g}

can be expressed as

#### Calculation of standard error of γ_{g}

Also from the properties of linear regression we know that

In this equation, is not necessarily known but can be estimated using the analogous equation (7) for beta

Thus

Notice that the right hand side of (9) is dependent on the SNP *l* while the left hand side is not. This equality will hold only approximately in our implementation since we will be using approximate values for , i.e. from reference population, not the actual study population.

#### Calculation of Z score

To assess the significance of the association, we need to compute the ratio of the effect size γ_{g} and standard error se(γ_{g}), or Z score,
with which we can compute the p value as

Based on results with actual and simulated data we have found that the last approximation does reduce power since the deviation is only noticeable when the correlation between the SNP or the predicted expression and the phenotype is large, i.e. large effect sizes. When the effects are large the loss of power is compensated by the large effect size.

## Acknowledgments

### Grants

We acknowledge the following US National Institutes of Health grants: R01MH107666 (H.K.I.), K12 CA139160 (H.K.I.), T32 MH020065 (K.P.S.), R01 MH101820 (GTEx), P30 DK20595 and P60 DK20595 (Diabetes Research and Training Center), P50 DA037844 (Rat Genomics), P50 MH094267 (Conte). H.E.W. was supported in part by start-up funds from Loyola University Chicago.

## Footnotes

↵* E-mail: Corresponding haky{at}uchicago.edu