Reversing Transcriptome-Wide Association Studies to improve expression Quantitative Trait Loci associations

Transcriptome-Wide Association Studies discover SNP effects mediated by gene expression through a two-stage process: a typically small reference panel is used to infer SNP-expression effects, and then these are applied to discover associations between imputed expression and phenotypes. We investigate whether the accuracy of SNP-expression and expression-phenotype associations can be increased by performing inference on both the reference panel and independent GWAS cohorts simultaneously. We develop EMBER (Estimation of Mediated Binary Effects in Regression) to re-estimate these effects using a liability threshold model with an adjustment to variance components accounting for imputed expression from GWAS data. In simulated data with only gene-mediated effects, EMBER more than doubles the performance of SNP-expression linear regression, increasing mean r2 from 0.3 to 0.65 with a gene-mediated variance explained of 0.01. EMBER also improves estimation accuracy when the fraction of cis-SNP variance mediated by genes is as low as 30%. We apply EMBER to genotype and gene expression data in schizophrenia by combining 512 samples from the CommonMind Consortium and 56,081 samples from the Psychiatric Genomic Consortium. We evaluate performance of EMBER in 36 genes suggested by TWAS by concordance of inferred effects with effects reported independently for frontal cortex expression. Applying the EMBER framework to a baseline linear regression model increases performance in 26 out of 36 genes (sign test p-value .0020) with an increase in mean r2 from 0.200 to 0.235.

: For a small reference panel cohort of size N , we observe SNPs X, transcripts Z, and binary trait Y . Assume we have a large GWAS cohort of size N with only X and Y observed. EMBER attempts to improve inference of β 1 (the SNP-gene effects) given β 2 (the TWAS gene-trait effect).
We assume we have access to reference panel data comprising N individuals with genotype matrix X, gene 69 expression matrix Z, and case/control phenotype vector Y . The conventional method for calculating eQTLs 70 involves a linear model over the observed X and Z. We also assume that we have a large GWAS data set like to improve estimates ofβ 1 . This relies on an assumption thatβ 2 , likely estimated initially from the 85 same data, is robust to errors inβ 1 to a degree that allows its use to improve re-estimation of β 1 , provided 86 that the gene in question mediates a significant fraction of the observed SNP-trait effects. Improving the 87 accuracy of SNP-gene effects has the potential to improve TWAS estimates further through more accurate 88 gene imputation. Additionally, re-estimating significance of eQTLs may provide more insights into regulation   Therefore, the probability of disease Y i for an individual with predictors Z i is given by Wray and Goddard 106 [15] 107 Our objective is to infer values of the coefficients β such that the probability of observed disease statuses 108 across all individuals is maximized given their PRS. We assume the prior for all β to be a normal distribution 109 with standard deviation 1. Therefore, the joint log likelihood of the liability threshold model is . As in quantitative traits, this label correlation can be regressed against the 129 correlation between pairs of samples in Z. Specifically, Golan et al. [20] show that correlations between 130 labels and between predictors are related by the following linear function where σ 2 Z is the variance explained by Z. Therefore, we can regress the two sets of correlations over pairs 132 of samples to obtain a slope, and then solve for σ 2 Z given observed population parameters of P , K, and T . intractability. Therefore, we can perform inference by the conventional probit expectation maximization.

136
First, we introduce a hidden variable l denoting the unobserved quantitative liability which determines 137 case/control status Y = P (l > T ). This results in the following EM equation for the joint log likelihood: We assume l follows a standard normal distribution, with a fraction of variance explained by SNP effects: 139 therefore l ∼ N (Zβ, 1 − σ 2 Z ). Furthermore, the distribution of l i is dependent on disease label Y i , as l must 140 fall on the correct side of the threshold T to impart the correct case/control status. Therefore, we have the 141 following for the distribution q(l) where T N Y (·) is the truncated normal distribution over values greater than or less than T depending on 143 whether Y i indicates a case or control, respectively. We can then plug this distribution into the EM equation 144 with joint log likelihood LL. By setting q(l) = p(l|Y, Z, β), the KL divergence term cancels, allowing us to 145 maximize β with respect to the expected log likelihood over q(l). This term can be expanded as follows where C denotes terms which are constant with respect to β. Therefore, we have the following EM 147 procedure, where in the E-step we update expected values of the hidden liability, and in the M-step we 148 maximize β with respect to the likelihood informed by those expected liabilities.
At each step the joint log likelihood is evaluated for convergence, after which the inferred β is returned. However, we also investigated whether the re-scaling procedure can instead be incorporated directly into 157 the EM algorithm as a constraint. Having obtained an estimate of σ 2 Z from the sample data, we add a 158 Lagrangian multiplier to the maximization step to enforce the inequality constraint M j=1 β 2 j ≤ σ 2 Z . We now 159 maximize the modified log likelihood We notice that this modification conveniently adds only a constant term to the coefficient of β T β.

161
Therefore, the new update equation has the following form Note that the addition of lambda effectively re-scales the variance of the prior distribution of β, which 163 has until now been assumed to be 1. At every iteration, we would like to solve for lambda such that the 164 constraint M j=1 β 2 j ≤ σ 2 Z is satisfied. However, this operation does not appear to have an analytical solution, 165 and we would like to limit the number of matrix inverse calculations required in a numeric solution. To accomplish this, we define the eigendecomposition for the covariance matrix 167 We also substitute λ = λ + 1. Therefore, the matrix inverse in equation 11 can be avoided by retaining the 168 decomposition and only updating the diagonal matrix Λ, which requires only a vector division.
We can then express the predictors β in equation 11 in terms of their eigendecomposition, and plug in 170 this term into the Lagrangian constraint.
To our knowledge there is still no analytical solution for λ such that β T β is close to σ 2 Z . We instead take 172 the numeric inverse to obtain f −1 (σ 2 Z ) = λ. We perform this using the Python package pynverse, which  To combine both observed expression Z and imputed expression Z = X β 1 into a single model, we introduce 187 a hidden variable ω ∼ N (X β 1 β 2 , 1−σ 2 X β 2 2 ) representing the distribution of the imputed gene effect on disease 188 liability, where σ 2 X is the total fraction of variance in Z explained by predictors in X. We can now redefine 189 disease risk with respect to X and β 1 by marginalizing over the distribution of ω.
We can simplify this expression by applying the following lemma described by Owen [21] We define a standardized gene expression variable ω and substitute this into equation 15 We can plug this expression into equation 16 with b = 1 and a = Therefore the only change from the liability model with observed Z instead of imputed X β 1 is the non-  Our goal is to infer SNP-gene effects given access to both observed genotype and gene expression data from a small reference panel as well as genotype and trait data only from a large GWAS cohort. Therefore, the log joint likelihood now comprises a sum between the normal prior on β, a linear regression over observed gene expression in the reference panel, and liability threshold regression over imputed gene expression in a GWAS cohort. We first estimate σ 2 X , the variance explained of SNPs on gene expression, using either H-E regression on the reference panel or the liability method by Golan et al. [20], making sure to divide by β 2 The inclusion of linear regression terms can be accommodated easily in the EM algorithm. The new 199 updates are M step: The inferred β can be corrected for effect size inflation by either re-scaling according to σ 2 X after EM 203 inference, or since the form of the EM algorithm is largely unchanged, the Lagrangian can be incorporated

227
This also suggests that an estimate of β 2 from data is sufficient to improve estimation of eQTLs (β 1 ) in one 228 step, rather than requiring repeated updates of one set of effects given newer estimates of the other. method produces nearly identical estimates to those of the slower rescaling method (Fig S2B). where σ 2 res denotes the non-gene mediated residual SNP variance explained, and σ 2 X and σ 2 Y denote the SNP-gene and gene-trait variances explained by β 1 and β 2 , respectively. 266 We applied EMBER to study eQTLs for genes relevant to schizophrenia. We obtained a gene expression that the performance plateaus between 5 and 10 PCs and begins to decline afterward, which prompted us 298 to select 10 PCs for this procedure. 299 We evaluated the accuracy of baseline linear regression and EMBER, both using the variable reduction 300 preprocessing step, by the concordance of their inferred SNP-expression effects with those reported by 301 TWAS/FUSION in GTEx v7 data. As TWAS/FUSION also reports effect size calculations from the CMC 302 data for 26 out of the 36 studied genes, we compare the r 2 of these results against the GTEx effects when 303 possible, along with our own implementation of the BLUP method for all genes. We assume in both the these methods across all 36 genes are shown in Figure 5. Across all genes, EMBER achieves an average r 2 306 of 0.235 ± 0.145, improving over an r 2 of 0.200 ± 0.129 obtained by linear regression. EMBER also improves 307 r 2 in 26 out of 36 genes ( Figure 5B), with a one-sided sign test p-value of .00197. However, we do not 308 observe a significant relationship between improvement in performance and the fraction of gene-mediated 309 variance explained in Figure 5B. Given the generally low fraction of variance explained across all genes (see 310 Figure 4), it is possible that cis-SNP effects not mediated by the studied genes contribute to a reduction in 311 estimation accuracy. The average r 2 achieved by EMBER is slightly lower that of our implementation of the 312 BLUP model at 0.240 ± 0.123 ( Figure 5A), though both methods perform significantly better than the same 313 BLUP model on CMC data as reported by TWAS/FUSION, with an r 2 of 0.201 ± 0.109. As EMBER is 314 currently applied to a baseline linear regression model with a variable reduction step, further improvements 315 in performance may be possible by accessing the full SNP covariance matrix as in the BLUP method. Figure   316 6 displays effect size comparisons for the gene GLT8D1, where we observed the largest increase in r 2 in 317 EMBER compared to linear regression. Additional plots for all genes are shown in Figure S4.  GWAS data simultaneously to improve estimation of effect sizes over a two-stage regression approach. In 321 developing EMBER, we introduced also a variance correction to incorporate both measured and imputed 322 gene expression as predictors into a single regression, and a constrained EM method to rapidly re-scale effect 323 sizes based on expected variance explained. 324 We mapped the parameter space where EMBER is effective. Through simulations, we found that imputed 325 gene-trait associations are quite robust to low reference panel sample sizes, but that significant improvements 326 in eQTL estimates are possible by including GWAS data with no gene expression data observed. Additionally, 327 while EMBER focuses on gene mediated effects, we found that accuracy of inferred effects still improved 328 when the fraction of gene-mediated variance explained was as low as 0.3. 329 We observed significant improvement in concordance of inferred eQTL effects when we performed a

339
EMBER is not without its limitations. After variable reduction, for several genes the improvement 340 observed in CMC and PGC data is quite small. This is likely because the fraction of variance explained for 341 these genes is lower than the threshold of 0.2-0.3 for gene-mediated fraction of SNP variance, and EMBER 342 is not benchmarked to improve results in this range. Even so, we do not observe significant reductions in 343 performance in EMBER relative to linear regression, with the largest drop in r 2 being −0.05 in the gene 344 MAPK3.

345
Going forward, EMBER demonstrates a more general approach to the investigation of genotype-to-346 phenotype association. Instead of relying on a single study with its traded-off weaknesses of sample size vs.