Abstract
We developed a framework for identifying trait-associated genes in rats and facilitating the transfer of polygenic evidence across species by expanding the transcriptome-wide association (TWAS) approach to rats. Our analysis successfully trained transcript predictors for over 8000 genes in each of the five brain regions of rats, revealing several shared properties of gene regulation with humans. Moreover, mirroring trends observed in humans, our findings showed that sparse predictors using variants in cis are more effective than polygenic predictors and that gene expression prediction in rats is highly correlated across brain regions. Importantly, our analysis also identified a significant overlap between genes associated with rat and human body length and BMI, indicating rat models may be useful for studying the genetic basis of complex traits in humans. RatXcan represents a valuable tool for uncovering shared biological mechanisms of complex traits across species, with potential applications in a wide range of research fields.
Introduction
Over the last decade, genome-wide association studies (GWAS) have identified numerous genetic loci that contribute to biomedically important traits [Visscher et al., 2017]. GWAS have demonstrated that most traits have a highly polygenic architecture, meaning that numerous genetic variants with individually small effects confer risk [Loos, 2020]. However, translating these results into meaningful biological discoveries remains extremely challenging [Lewis and Vassos, 2020, Martin et al., 2019, Alliance et al., 2021].
Model organisms provide a system in which the effect of genotype, genetic manipulations, and environmental exposures can be experimentally tested. Whereas the tools for using model organisms to study individual genes are well established, there are no satisfactory methods for studying the polygenic signals obtained from GWAS in model organisms.
To start addressing this problem, we extend the TWAS framework [Gamazon et al., 2015] to rats so that the unit of analysis are genes rather than rats. We call this approach RatXcan. Following our human pipeline, we investigate the genetic architecture of gene expression traits in rats and compare them to humans. Then, we train genetic predictors of gene expression traits in rats and perform association between the latter and rat body size traits.
Results
Experimental setup
To build a framework for translating genetic results between species, we followed the experimental setup illustrated in Fig. 1. In the training stage (Fig. 1a), we investigated the genetic architecture of gene expression and built prediction models of gene expression in rats. We used genotype and transcriptome data from five brain regions sampled from 88 heterogeneous stock (HS) rats, generated by the NIDA Center for GWAS for Outbred rats (Fig. 1a). We selected HS rats because they are a well characterized, outbred mammalian population for which dense genotype, phenotype, and gene expression data are available in thousands of subjects [Solberg Woods and Palmer, 2019, Chitre et al., 2020, Keele et al., 2018, Crouse et al., 2022]. In the association stage (Fig. 1b), we used genotype data to predict the transcriptome in a non-overlapping target set of 3,407 rats. We tested for associations between the genetically predicted gene expression and body length by adapting the PrediXcan software, which was originally developed for use in humans [Gamazon et al., 2015], to rats (‘RatXcan’).
Genetic Architecture of Gene Expression across Brain Tissues
To inform the optimal prediction model training, we examined the genetic architecture of gene expression in HS rats by quantifying heritability and polygenicity for five areas of brain tissue. Because the results for each tissue are similar, in the main text we summarize results for all tissues, highlighting the results for nucleus accumbens core; we present the remaining tissues in more detail in the supplement.
We calculated the heritability of expression for each gene by estimating the proportion of variance explained (PVE) using a Bayesian Sparse Linear Mixed Model (BSLMM) [Zhou et al., 2013]. We restricted the feature set to variants within 1 Mb of the transcription start site of each gene since this is expected to capture most cis-eQTLs. Among the 15,216 genes considered, 3,438 genes were heritable (defined as having a 95% credible set lower boundary greater than 1%) in the nucleus accumbens core. The mean heritability ranged from 8.86% to 10.12% for all brain tissues tested (Table 1). Fig. 2a shows the heritability estimates for gene expression in the nucleus accumbens core, while Fig. S1 shows heritability estimates for other tissues. We identified a similar heritability distribution in humans (Fig. 2b, Fig. S2) based on whole blood samples from GTEx.
Next, to evaluate the polygenicity of gene expression levels, we examined whether predictors with more polygenic or sparse architecture correlate better with observed expression. We fitted elastic net regression models using a range of mixing parameters from 0 to 1 (Fig. 2c). The leftmost parameter value of 0 corresponds to ridge regression, which is fully polygenic and uses all cis-variants. Larger values of the mixing parameters yield more sparse predictors, with the number of variants decreasing as the mixing parameter increases. The rightmost value of 1 corresponds to lasso regression, which yields the most sparse predictor within the elastic net family.
We used the 10-fold cross-validated Pearson correlation (R) between predicted and observed values as a measure of performance (Spearman correlation yielded similar results). We observed a substantial drop in performance towards the more polygenic end of the mixing parameter spectrum (Fig. 2c). We observed similar results using human gene expression data from whole blood samples in GTEx individuals (Fig. 2d). Overall, these results indicate that the genetic architecture of gene expression in HS rats (detectable with the currently available sample size) is sparse, similar to that of humans [Wheeler et al., 2016].
Generation of Prediction Models of Gene Expression in Rats
We trained elastic net predictors for all genes in all five brain regions. Based on the relative performance across different elastic net mixing parameters, we chose a parameter value of 0.5, which yielded slightly less sparse predictors than lasso but provided robustness to missing or low quality variants; this is the same value that we have chosen in the past for humans datasets [Gamazon et al., 2015]. The procedure yielded 8,244-8,856 genes across five brain tissues from the available 15,216 genes (Table 1). The 10-fold cross-validated prediction performance (R2) ranged from 0 to 80% with a mean of 8.51% in the nucleus accumbens core. As shown in Table 1, mean prediction R2 was consistently lower than mean heritability for all tissues, as is expected since genetic prediction performance is restricted by its heritability. Prediction performance values followed the heritability curve, confirming that genes with highly heritable expression tend to be better predicted than genes with low heritability in both HS rats and humans (Fig. 2a-b). Interestingly, we identified better prediction performance in HS rats than in humans (Fig. S3), despite heritability of gene expression being similar across species (Fig. 2a-b).
In Fig. 3a-b, we show the prediction performance of the best predicted genes in HS rats (Mgmt, R2 = 0.72) and humans (RPS26, R2 = 0.74). Across all genes, we found that the prediction performance in HS rats was correlated with that of humans (R = 0.061, P = 8.03 * 10−6; Fig. 3c). Furthermore, performance per gene in different tissues was similar in both HS rats (Fig. 3d) and humans (Fig. 3e), namely, genes that were well-predicted in one tissue were also well-predicted in another tissue. Correlation of prediction performance across tissues ranged from 58 to 84% in HS rats and 42 to 69% in humans.
Having established the similarity of the genetic architecture of gene expression between rats and humans, we transitioned to the association stage.
PrediXcan/TWAS Implementation in Rats (RatXcan)
To extend the PrediXcan/TWAS framework to rats, we developed RatXcan. We used the predicted weights from the training stage to estimate the genetically regulated expression in the target set of 3,407 densely genotyped HS rats. We then tested the association between predicted expression and body length in the target set.
We identified 90 Bonferroni significant genes (P (0.05/5388) = 9.28 × 10−6) in 57 distinct loci separated by ±1 Mb for rat body length (Fig. 4a; Supplementary Table 1). Among the 90 significant genes, 30.46% had human orthologs previously associated with height in GWAS. For example, Tgfa, which is related to growth pathways, including epidermal growth factor, was associated with body length in rats (P = 1.18 × 10−9) and nominally associated with height in humans [Comuzzie et al., 2012] (P = 8.00 × 10−6). To evaluate whether trait-associated genes identified in HS rats were more significantly associated with the corresponding traits in humans, we performed enrichment analysis. Specifically, we selected genes that were nominally associated with HS rat body length (P < 0.05) and compared the p-value from the analogous human trait (height) against the background distribution of height-associated genes identified in GWAS. Given the large sample size of human height GWAS, we expected the p-values for of height-associated genes (shown in pink, Fig. 4b)to depart substantially from the identity line (in gray). The subset of genes that were associated with rat body length (in blue, Fig. 4b) showed a major departure from the background distribution, indicating that body-length genes in rats were more significantly associated with human height than expected. To quantify the enrichment, we compared the p-value distribution of all the genes with the distribution of the subset of genes that were nominally significantly associated with rat body length (P = 6.55 × 10−10).
Discussion
Overwhelming evidence demonstrates that most complex diseases are extremely polygenic; however, there is an unmet need for methods that translate polygenic results to other species.
A critical first step to achieve the transfer of polygenic scores is the development of RatXcan, which is the rat version of PrediXcan [Gamazon et al., 2015], a well-established statistical tool that is used in human genetics. We showed that the genetic architecture of gene expression in rats is broadly similar to humans: they are heritable, sparse, and the degree of heritability is preserved across tissues; some of these observations are consistent with another recent publication that mapped eQTLs in HS rats [Munro et al., 2022]. Interestingly, despite the smaller sample sizes used to train our prediction models, rats showed better prediction than humans. This might reflect the fact that HS rats have a preponderance of common alleles [Chitre et al., 2020] whereas humans have numerous rare alleles that influence gene expression but are dificult to capture in prediction models. The superior prediction may also reflect the longer haplotype blocks that are present in HS rats relative to humans [Chitre et al., 2020], which reduces the multiple testing burden when mapping cis-eQTLs and likely facilitates predictor training.
Using RatXcan, we tested gene-level associations of body length, which had been previously measured in rats. We chose height because of the availability of large human GWAS, relatively large genotyped HS rat cohort in which body length was known, and relatively unambiguous similarity between humans height and rat body length. We found substantial enrichment of trait-associated genes among orthologous human trait-associated genes.
There are several limitations in the current study. The sample size of the reference transcriptome data in rats was limited. We would expect better predictability estimates in our elastic-net trained models with larger sample sizes. Furthermore, we used gene expression data from human blood and rat nucleus accumbens core because they were convenient datasets, but these tissues are not likely to be major mediators of height or body length. Second, we suspect that in both humans and rats, some gene-level associations may be confounded by linkage disequilibrium contamination and co-regulation. This problem is likely to be more serious in model organisms where even longer range LD exists. Finally, integration of other omic data types (e.g., protein, methylation, metabolomics) and the use of cell-specific data may improve prediction accuracy and cross-species portability. It is worth noting that while we have shown success with humans and HS rats, it is still not clear whether more distantly related species, such as nonmammalian vertebrates or even insects, might also lend themselves to ortholog analysis and ultimately a cross-species transciptome-based polygenic risk score.
Despite these limitations, we have developed a methodology for effectively and eficiently identifying orthologs between rats and humans, which should support new and transformatice experimental designs involving model organisms and enable the future development of a transcriptome-based polygenic risk score that is portable across species. Moreover, the RatXcan methodology provides a method to empirically validate traits that are intended to model or recapitulate aspects of human diseases in model systems. While the validity of these animal models has been a source of passionate debate, empirical evidence has been limited. Our polygenic approach provides a empirical approach to this debate that has been urgently needed.
Methods
Resource availability
Lead contact
Requests for further information, resources, and reagents should be directed to and will be fulfilled by one of the lead contacts, Hae Kyung Im (haky{at}uchicago.edu) or Abraham Palmer (aapalmer{at}ucsd.edu)
Material availability
This study did not generate new unique reagents.
Experimental model and subject details
The rats used for this study are part of a large multi-site project focused on genetic analysis of complex traits (www.ratgenes.org). N/NIH heterogeneous stock (HS) outbred rats are the most highly recombinant rat intercross available and are a powerful tool for genetic studies ([Solberg Woods and Palmer, 2019]; [Chitre et al., 2020]). HS rats were created in 1984 by interbreeding eight inbred rat strains (ACI/N, BN/SsN, BUF/N, F344/N, M520/N, MR/N, WKY/N and WN/N) and been maintained as an outbred population for almost 100 generations.
Method details
Genotype and expression data in the training rat set For training the gene expression predictors, we used RNAseq and genotype data pre-processed for Munro et al. [2022]. We used 88 HS male and female adult rats, for which whole genome and RNA-sequencing information was available across five brain tissues [nucleus accumbens core (NAcc), infralimbic cortex (Il), prelimbic cortex (PL), orbitofrontal cortex (OFC), and lateral habenula (Lhb); Table 1]. Mean age was 85.7 ± 2.2 for males and 87.0 ± 3.8 for females. All rats were group housed under standard laboratory conditions and had not been through any previous experimental protocols. Genotypes were determined using genotyping-by-sequencing, as described previously in [Parker et al., 2016], [Chitre et al., 2020] and [Gileta et al., 2020]. Bulk RNA-sequencing was performed using Illumina HiSeq 4000 with polyA libraries, 100 bp single-end reads, and mean library size of 27M. Read alignment and gene expression quantification were performed using RSEM and counts were upper-quartile normalized, followed by additional qualitycontrol filtering steps as described in Munro et al. [2022]. Gene-expression levels refer to transcript abundance for reads aligned to the gene’s exons using the Ensembl Rat Transcriptome.
For each gene, we inverse normalized the TPM values to account for outliers and fit a normal distribution. We then performed PEER factor analysis [Stegle et al., 2010]. We regressed out sex, batch number, batch center and 7 PEER factors from the gene expression and saved the residuals for all downstream analyses.
Genotype and phenotype data in the target rat set
We used genotype and phenotype data from 3,407 HS rats (i.e., target set) reported in Chitre et al. [2020]. We used phenotypic information on body length (including tail), and fasting glucose. For each trait, sex, age, batch number and site were regressed out if they were significant and if they explained more than 2% of the variance, as described in [Chitre et al., 2020].
Querying human gene-trait association results
To retrieve analogous human gene–trait association results, we queried PhenomeXcan, a web-based tool that serves gene-level association results for 4,091 traits based on predicted expression in 49 GTEx tissues [Pividori et al., 2020]. Orthologous genes (N = 22,777) were mapped with Ensembl annotation, using the biomart R package and were one to one matched.
Estimating gene expression heritability
We calculated the cis-heritability of gene expression from the training set using a Bayesian sparse linear mixed model, BSLMM [Zhou et al., 2013], as implemented in GEMMA. We used variants within the ±1Mb window up- and down-stream of the transcription start and end of each gene annotated by Gencode v26 [Frankish et al., 2021]. We used the proportion of variance explained (PVE) generated by GEMMA as the measure of cis-heritability of gene expression. We then display only the PVE estimates of 10,268 genes that were also present in the human gene expression data.
Heritability of human gene expression, which was also calculated with BSLMM, was downloaded from the database generated by Wheeler et al. [2016]. Genes were also limited to the same 10,268 as above.
Examining polygenicity versus sparsity of gene expression
To examine the polygenicity versus sparsity of gene expression in rats, we identified the optimal elastic net mixing parameter a, as described in Wheeler et al. [2016]. Briefly, we compared the prediction performance of a range of elastic net mixing parameters spanning from 0 to 1 (11 values from 0 to 1, with steps of 0.1). If the optimal mixing parameter was closer to 0, corresponding to ridge regression, we deemed gene expression trait to be polygenic. In contrast, if the optimal mixing parameter was closer to 1, corresponding to lasso, then the gene expression trait was considered to be more sparse. We also restricted the number of genes in the pipeline to the 10,268 orthologous genes.
Training gene expression prediction in rats
To train prediction models for gene expression in rats, we used the training set of 88 rats described above and followed the elastic net pipeline from predictdb.org. Briefly, for each gene, we fitted an elastic net regression using the glmnet package in R. We only included variants in the cis region (i.e., 1Mb up and downstream of the transcription start and end). The regression coeficient from the best penalty parameter (chosen via glmnet’s internal 10-fold cross validation [Zou and Hastie, 2005]) served as the weight for each gene. The calculated weights (ws) are available in predictdb.org. For the comparison of number of predictable genes across species, we ran the same cross-validated elastic net pipeline in four GTEx tissues with sample sizes similar to that of the rats: Substantia Nigra, Kidney Cortex, Uterus and Ovary. To ensure fair comparison, we used the same number of genes that were orthologous across all four human tissues and rat tissues.
Estimating overlap and enrichment of genes between rats and humans For human transcriptome prediction used in the comparison with rats, we simply downloaded elastic net predictors trained in GTEx whole blood samples from the PredictDB portal, as previously done in humans [Barbeira et al., 2021]. This model was different from the ones used in the UK Biobank for calculating the PTRS weights (See Calculating PTRS in a rat target set).
We quantified the accuracy of the prediction models using a 10-fold cross validated correlation (R) and correlation squared (R2) between predicted and observed gene expression [Zou and Hastie, 2005]. For the rat prediction models, we only included genes whose prediction performance was greater than 0.01 and had a non-negative correlation coeficient, as these genes were considered well predicted.
We tested the prediction performance of our elastic net model trained in nucleus accumbens core in an independent rat reference transcriptome set. We predicted expression in the reference set of 188 individuals and compared to observed genetic expression in the nucleus accumbens core.
Quantification and Statistical Analysis
Implementing RatXcan
We developed RatXcan, based on PrediXcan [Gamazon et al., 2015] [Barbeira et al., 2018] in humans. RatXcan uses the elastic net prediction models generated in the training set. In the prediction stage, we generated a predicted expression matrix for all genes in the rat target set, by fitting an additive genetic model: Yg is the predicted expression of gene g, wk,g is the effect size of marker k for gene g, Xk is the number of reference alleles of marker k, and ∈ is the contribution of other factors that determine the predicted gene expression, assumed to be independent of the genetic component.
We then tested the association between the predicted expression matrix and body length. We fitted a linear regression of the phenotype on the predicted expression of each gene, which generated gene-level association results for all gene trait pairs.
Estimating overlap and enrichment of genes between rats and humans We queried PhenomeXcan to identify genes associated with human height. PhenomeXcan provides gene-level associations aggregated across all available GTEx tissues, as calculated by MultiXcan (an extension of PrediXcan) [Barbeira et al., 2019]. To this aim, we adapted MultiXcan to similarly aggregate our results across the 5 tested brain tissues in rats. We used a Q-Q plot to inspect the level of enrichment across rat and human findings. To quantify enrichment, we used a MannWhitney test as implemented in R to discern whether the distribution of the p-values for genes in humans was the same for the genes that were and were not nominally significant in rats.
Calculating PTRS weights in the UK Biobank
We calculated human-derived height PTRS weights using elastic net with a mixing parameter of 0.5, as described in Liang et al. [2022]. We predicted expression levels in 356,476 UK Biobank unrelated participants of European descent using whole blood prediction models trained in GTEx. We used the prediction models trained with UTMOST based on grouped lasso, which borrows information across tissues to improve prediction performance [Barbeira et al., 2020, Hu et al., 2019]. The predicted expression was generated using high quality SNPs from Hapmap2 [McCarthy et al., 2016]. We performed elastic net regression with height as the predicted variable and the predicted expression matrix from 356,476 UK Biobank unrelated individuals of European descent. More specifically, for each regularization parameter -1, we selected weight parameters γg that minimized the mean squared difference between the predicted variable Y and prediction model Xγ+γ0 where is the standardized predicted expression level of gene g across N individuals and is the the observed value of the lth standardized covariate: where γ0 is the intercept, m the number of genes, L is the number of covariates, is the l2 norm and the ∥B∥1 is the l1 norm of the effect size vector. α denotes the elastic net mixing parameter and λ is the regularization parameter. 37 different λ ’s were used, generating 37 different sets of predictors. Covariates included age at recruitment (Data-Field 21022), sex (Data-Field 31), and the first 20 genetic PCs. For more details, see Liang et al. [2022]. The values of the regularization parameters were chosen in a region likely to cover a wide range of sparsity in the resulting models, from very sparse, containing a couple of genes, to dense, containing all genes Liang et al. [2022].
Code and Data Availability
The code used for this work is available at https://github.com/hakyimlab/Rat_Genomics_Paper_Pipeline. Genotype and expression data are available through [Munro et al., 2022]. Prediction models for gene expression in all five brain tissues in rats are available at predictdb.org
Author contributions
A.A.P. and H.K.I. conceived the cross species PTRS and supervised the work. N.S. and Y.L. performed a large portion of the analyses. N.S. and S.S-R. analyzed and interpreted the results and wrote the initial draft of the manuscript. MP and FN performed analysis of some of the PTRS results. S.M., D.M., A.C., D.C., L.S-W, and O.P. pre-processed and analyzed the RNAseq, genotype, and phenotype data. R.C., J.G., A.M.G., A.G., K.H., A.H., C.P.K., C.L.S-P., J.T., T.W., H.C., S.F., K.I., P.M., L.S. were involved in various aspects of the collection of the rat physiological traits. All authors read, edited and approved the final version of the manuscript.
Competing interests
The authors declare no conflict of interest.
Ethics declaration
Not applicable.
Supplementary information
Acknowledgments
This research has been conducted using the UK Biobank Resource under Application Number 19526. We thank Natalia Gonzales and Christian Jones for help editing the paper. The abstract’s style was improved by using chatGPT iteratively. This work was partially supported by DP1DA054394 (SSR), P30DK020595 and R01CA242929 (HKI, NS, MP), P30DA044223 and R24 AA013162 (LS), P50DA037844 (AAP)
Footnotes
We have removed some of the previous version's results that were not reproduced in a larger dataset.