Abstract
For many complex traits, gene regulation is likely to play a crucial mechanistic role. How the genetic architectures of complex traits vary between populations and subsequent effects on genetic prediction are not well understood, in part due to the historical paucity of GWAS in populations of non-European ancestry. We used data from the MESA (Multi-Ethnic Study of Atherosclerosis) cohort to characterize the genetic architecture of gene expression within and between diverse populations. Genotype and monocyte gene expression were available in individuals with African American (AFA, n=233), Hispanic (HIS, n=352), and European (CAU, n=578) ancestry. We performed expression quantitative trait loci (eQTL) mapping in each population and show genetic correlation of gene expression depends on share ancestry proportions. Using elastic net modeling with cross validation to optimize genotypic predictors of gene expression in each population, we show the genetic architecture of gene expression is sparse across populations. We found the best predicted gene, HLA-DRB5, was the same across populations with R2 > 0.81 in each population. However, there were 1094 (11.3%) well predicted genes in AFA and 372 (3.8%) well predicted genes in HIS that were poorly predicted in CAU. Using genotype weights trained in MESA to predict gene expression in 1000 Genomes populations showed that a training set with ancestry similar to the test set is better at predicting gene expression in test populations, demonstrating an urgent need for diverse population sampling in genomics. Our predictive models in diverse cohorts are made publicly available for use in transcriptome mapping methods at http://predictdb.hakyimlab.org/.
Author summary Most genome-wide association studies (GWAS) have been conducted in populations of European ancestry leading to a disparity in understanding the genetics of complex traits between populations. For many complex traits, gene regulation is likely to play a critical mechanistic role given the consistent enrichment of regulatory variants among trait-associated variants. However, it is still unknown how the effects of these key variants differ across populations. We used data from MESA to study the underlying genetic architecture of gene expression by optimizing gene expression prediction within and across diverse populations. The populations with genotype and gene expression data available are from individuals with African American (AFA, n=233), Hispanic (HIS, n=352), and European (CAU, n=578) ancestry. After calculating the prediction performance, we found that there are many genes that were well predicted in AFA and HIS that were poorly predicted in CAU. We further showed that a training set with ancestry similar to the test set resulted in better gene expression predictions, demonstrating the need to incorporate diverse populations in genomic studies. Our gene expression prediction models are publicly available to facilitate future transcriptome mapping studies in diverse populations.
Introduction
For over a decade, genome-wide association studies (GWAS) have facilitated the discovery of thousands of genetic variants associated with complex traits and new insights into the biology of these traits [1]. Most of these studies involved individuals of primarily European descent, which can lead to disparities when attempting to apply this information across populations [2–4]. Continued increases in GWAS sample sizes and new integrative methods will lead to more clinically relevant and applicable results. Non-European populations need to be included in these studies to avoid further contribution to health care disparities [5]. A recent study shows that the lack of diversity in large GWAS skew the prediction accuracy across non-European populations [6]. This discrepancy in predictive accuracy demonstrates that adding ethnically diverse populations is critical for the success of precision medicine, genetic research, and understanding the biology behind genetic variation [6–8].
Gene regulation is likely to play a critical role for many complex traits as trait-associated variants are enriched in regulatory, not protein-coding, regions [9–13]. Numerous expression quantitative trait loci (eQTL) studies have provided insight into how genetic variation affects gene expression [14–17]. While eQTL can act at a great distance, or in trans, the largest effect sizes are consistently found near the transcription start sites of genes [14–17]. Because gene expression shows a more sparse genetic architecture than many other complex traits, gene expression is amenable to genetic prediction with relatively modest sample sizes [18, 19]. This has led to new mechanistic methods for gene mapping that integrate transcriptome prediction, including PrediXcan [20] and TWAS [21]. These methods have provided useful tools for understanding the genetics of complex traits; however, most of the models have been built using predominantly European populations.
How the key variants involved in gene regulation differ among populations has not been fully explored. While the vast majority of eQTL mapping studies have been performed in populations of European descent, increasing numbers of transcriptome studies in non-European populations make the necessary comparisons between populations feasible [14, 22, 23]. An eQTL study across eight diverse HapMap populations (∼100 individuals/population) showed that the directions of effect sizes were usually consistent when an eQTL was present in two populations [14]. However, the impact of a particular genetic variant on population gene expression differentiation is also dependent on allele frequencies, which often vary between populations. A better understanding of the degree of transferability of gene expression prediction models across populations is essential for broad application of methods like PrediXcan in the study of the genetic architecture of complex diseases and traits in diverse populations.
Here, in order to better define the genetic architecture of gene expression across populations, we combine genotype [24] and monocyte gene expression [25] data from the Multi-Ethnic Study of Atherosclerosis (MESA) for the first time. We perform eQTL mapping and optimize multi-SNP predictors of gene expression in three diverse populations. The MESA populations studied herein comprise 233 African American (AFA), 352 Hispanic (HIS), and 578 European (CAU) self-reported ancestry individuals. Using elastic net regularization and Bayesian sparse linear mixed modeling, we show sparse models outperform polygenic models in each population. We show the genetic correlation of SNP effects and the predictive performance correlation is highest between populations with the most overlapping admixture proportions. We found a subset of genes that are well predicted in the AFA and/or HIS cohorts that are poorly predicted, if predicted at all, in the CAU cohort. We also test our predictive models trained in MESA cohorts in independent cohorts from the HapMap Project [14] and show the correlation between predicted and observed gene expression is highest when the ancestry of the test set is similar to that of the training set. By diversifying our model-building populations, new genes may be implicated in complex trait mapping studies that were not previously interrogated. Models built here have been added to PredictDB http://predictdb.hakyimlab.org/ for use in PrediXcan [20] and other studies.
Results
Common and unique eQTLs across populations in MESA
We surveyed each MESA population (AFA, HIS, CAU) and two combined populations (AFHI, ALL) for cis-eQTLs. SNPs within 1Mb of each of 10,143 genes were tested for association with monocyte gene expression levels using a linear additive model. We used 10 genotype principal components in each model (Fig. 1) and compared models that included a range of PEER factors (0, 10, 20, 30, 50, 100) to adjust for hidden confounders in the expression data [26]. As expected, the sample size of the data influences the number of eQTLs mapped (Fig. 2A). We found that using at least 20 PEER factors was best at finding the optimal number of eQTLs with a FDR < 0.05 for each population (Fig. 2A). For the remainder of this work, all models were adjusted for 10 genotype principal components and 20 PEER factors. Hundreds of thousands to millions of SNPs were found to associate with gene expression (eSNPs) and most genes had at least one associated variant (eGenes) at FDR < 0.05 (Table 1). We quantified the number of eSNPS and eGenes as well as the percentage of common and unique eSNPs found for each population. Common eSNPs met FDR <. 05 in all three self-identified populations (AFA, HIS, CAU) or, in the case of the combined AFHI population, common eSNPs met FDR < 0.05 in both AFHI and CAU. Unique eSNPs met FDR < 0.05 in only the designated population. While the AFA population has a sample size of less than half of the CAU population, the two populations have a similar proportion of unique eSNPs (Table 1). SNPs discovered in the CAU population were less likely to be replicated in the other populations than those discovered in the AFA population (Fig. 2B).
Pairwise comparison between populations show CAU and HIS are the most correlated
We estimated the local heritability (h2) for each gene and the genetic correlation (rG) between genes in each MESA population using GCTA [28]. The sample sizes are not large enough to estimate genetic correlation for individual genes, but since there are a large number of genes, we can estimate the mean rG across genes [29]. The population pair with the highest mean rG was CAU and HIS, followed by AFA and HIS, and the least correlated pair was AFA and CAU (Table 2, Fig. 3). As the heritability threshold within a population increase, the mean rG between populations also increases (Fig. 3B).
Sparse models outperform polygenic models for gene expression
We examined the prediction performance of a range of models using elastic net regularization [30] to characterize the genetic architecture of gene expression in each population. The mixing parameter that gives the largest prediction performance R2 pop pair mean rG SE rG genes that converged AFA-CAU 0.48 0.0080 9227 AFA-HIS 0.57 0.0076 9269 CAU-HIS 0.62 0.0071 9480 rG was estimated using a bivariate restricted maximum likelihood (REML) model implemented in GCTA. indicates the degree of sparsity or polygenicity of the gene expression trait. If the highest R2 occurs when α = 0.05, then the gene expression trait exhibits a more polygenic architecture. However, if the optimal R2 occurs when α = 1 then the trait has a sparse architecture [18]. We performed 10-fold cross-validation across three mixing parameters (α = 0.05, 0.5, 1). We found that the highest R2 predictive performance occurred when α = 0.5 or α = 1, whereas the R2 was smaller when α = 0.05, indicating that the sparse model outperformed the polygenic model. Figure 4 shows that models with 0.5 and 1 had similar predictive power while an α = 0.05 was suboptimal for gene expression prediction in each of the populations. The number of genes that converged when α = 0.5 was 9695 for each population.
In addition to elastic net, we also used Bayesian Sparse Linear Mixed Modeling (BSLMM) [31] to estimate if the local genetic contribution to gene expression is more polygenic or sparse. This approach models the genetic contribution of the trait as the sum of a sparse component and a polygenic component. The parameter PGE represents the proportion of the genetic variance explained by sparse effects. We also estimated heritability (h2) using GCTA, a linear mixed model approach [28]. The PVE is the BSLMM equivalent of h2 that is estimated from GCTA. We found that BSLMM PVE, GCTA h2, and elastic net R2 are highly correlated in each population (S1 Fig). Using BLSMM, we also found that for highly heritable genes, the sparse component (PGE) is large; however, for genes with low PVE, we are unable to determine whether the sparse or polygenic component is predominant (S1 Fig).
GCTA h2, and elastic net R2 are highly correlated in each population (S1 Fig). Using BLSMM, we also found that for highly heritable genes, the sparse component (PGE) is large; however, for genes with low PVE, we are unable to determine whether the sparse or polygenic component is predominant (S1 Fig).
A subset of well-predicted genes in AFA and HIS were missed in CAU
We then compared each population’s gene expression predictive performance. Higher correlation values indicate similar accuracy in prediction performance of gene expression models between two populations. The correlation between CAU and HIS is highest (R2=0.853) followed by AFA and HIS (R2=0.702) and the lowest correlation between two populations was AFA and CAU with R2=0.678 (Fig. 5A-C). These correlation relationships mirror the European and African admixture proportions in the MESA HIS and AFA cohorts (Fig. 1). There are many genes that are well predicted in both populations and there are some that are poorly predicted between populations. We found the best predicted gene, HLA-DRB5, was the same across each population with an R2 >0.81 in each population. On the other hand, there are some genes that are well predicted in one population, but poorly predicted in the other and vice versa (Fig. 5D-E). There were 1094 (11.3%) well predicted genes in AFA that were poorly predicted in CAU with an R2 difference greater than 0.2 between AFA and CAU (Table 3). When comparing HIS and CAU, there were 372 (3.8%) well predicted genes in HIS and poorly predicted in CAU with an R2 difference greater than 0.2. In contrast, a much smaller proportion of genes were well predicted in CAU and poorly predicted in AFA or HIS,2.8% and 0.61%, respectively (Table 3).
Predictive performance improves when training set has similar ancestry to test set
In order to further compare the predictive performance between populations, using each of the MESA populations as training sets, we predicted gene expression in two populations, Mexican ancestry individuals in Los Angeles (MXL) and Yoruba individuals in Ibadan, Nigeria (YRI), from the HapMap and 1000 Genomes Projects (Table 4, Fig. 6). The mean predicted vs. observed Pearson correlation (R) for YRI was 0.081 when using the AFA population as a training set, while mean R = 0.051 when using the CAU training set (Table 4). The MXL population had a mean R = 0.092 using the HIS population as a training set, whereas the mean R was 0.090 when CAU was the training set (Table 4). The AFA training set is suboptimal across models with varying predictive performance R2 when tested in MXL (Fig. 6A). Similarly, the CAU training set is suboptimal across models when used to predict expression in YRI (Fig. 6B). When using the currently available DGN training set that consists of 922 European individuals [20], both YRI and MXL are more poorly predicted than when the MESA training sets are used (Table 4). After combining the AFA and HIS population (AFHI), we see that the predicted expression for YRI does better than HIS or AFA alone (Table 4). When all of the MESA populations are combined, the MXL and YRI mean predicted vs. observed correlation is optimized across models (Fig. 6). This demonstrates that when comparing predicted expression levels to the observed, a balance of the training population with ancestry most similar to the test population and total sample size leads to optimal predicted gene expression.
Discussion
We used three MESA populations (AFA, HIS, and CAU) to better understand the genetic architecture of gene expression in diverse populations. We optimized predictors of gene expression using elastic net regularization and found that sparse models outperform polygenic models. The genetic correlation of gene expression is highest when continental ancestry overlaps between populations. We identified genes that are better predicted in the AFA and/or HIS models that are either absent or poorly predicted in the CAU model. We tested our predictors developed in MESA in independent cohorts and found that the best prediction of gene expression occurred when the training set included individuals with similar ancestry to the test set.
As seen in other studies [18, 21, 32], we show sparse models outperform polygenic models for gene expression prediction across diverse populations. Thus, the genetic architecture of gene expression for well predicted genes has a substantial sparse component. Larger sample sizes may reveal an additional polygenic component that may improve prediction for some genes.
We estimated the genetic correlation between each population pair for each gene. Populations with more shared ancestry as defined by clustering of genotypic principal components showed higher mean correlation across genes (Fig. 1, Table 2). As estimated heritability of genes increase, the mean genetic correlation between populations also increases (Fig. 3B), which indicates the genetic architecture underlying gene expression is similar for the most heritable genes. However, even though prediction across populations is possible for some of the most heritable genes, we define a class of genes where predictive performance drops substantially between populations.
There were several genes with high predictive performance (R2 >0.2) in AFA or HIS that were poorly predicted or not predicted at all in the CAU population (Fig. 5, S1 Table, S2 Table). Of the 372 genes found that were better predicted in HIS, there were 153 genes that overlapped with the 1094 gene found for AFA (S3 Table). Almost all of these well predicted genes in AFA and HIS populations also had biological implications in at least one study in the GWAS Catalog (S4 Table). Examples of such genes include COMMD1 (ENSG00000147905.13), which has been associated with blood cell volume and elevated iron levels and ZCCHC7 (ENSG00000173163.6), which has been linked to HIV susceptibility [33–35].
We tested our predictive gene expression models built in the MESA cohorts in two HapMap/1000 Genomes data sets (MXL and YRI) [14, 36] using the MESA population predictors we generated. As expected, the YRI gene expression prediction was best when using the AFA, AFHI, or ALL training sets, which each include individuals with African-ancestry admixture (Table 4, Fig. 6). The best gene expression prediction for MXL was with the ALL training set, which indicates that admixed populations like MXL benefit from a pooled training set containing individuals of diverse ancestries. Thus, increasing the sample sizes of non-European populations in genomic studies will not only benefit the source population, but will also increase predictive power in admixed populations.
Predictive models of gene expression developed in this study are made publicly available at http://predictdb.hakyimlab.org/ for use in future studies of complex trait genetics across diverse populations. Inclusion of diverse populations in complex trait genetics is crucial for equitable implementation of precision medicine.
Materials and methods
The Loyola University Chicago Institutional Review Board (IRB) reviewed our application for confirmation of exemption (IRB project number 2014). The IRB determined that this human subject research project is exempt from the IRB oversight requirements according to 45 CFR 46.101.
Genomic and transcriptomic data
The Multi-Ethnic Study of Atherosclerosis (MESA)
MESA includes 6814 individuals consisting of 53% females and 47% males between the ages of 45–84 [24]. The individuals were recruited from 6 sites across the US (Baltimore, MD; Chicago, IL; Forsyth County, NC; Los Angeles County, CA; northern Manhattan, NY; St.Paul, MN). MESA cohort population demographics were 39% Caucasian (CAU), 22% Hispanic (HIS), 28% African American (AFA), and 12% Chinese (CHN). Of those individuals, RNA was collected from CD14+ monocytes from 1264 individuals across three populations (AFA, HIS, CAU) and quantified on the Illumina Ref-8 BeadChip [25]. Individuals with both genotype (dbGaP: phs000209.v13.p3) and expression data (GEO: GSE56045) included 234 AFA, 386 HIS, and 582 CAU. Illumina IDs were converted to Ensembl IDs using the RefSeq IDs from MESA and gencode.v18 (gtf and metadata files) to match Illumina IDs to Ensembl IDs. If there were multiple Illumina IDs corresponding to an Ensembl ID, the average of those values was used as the expression level.
HapMap and 1000 Genomes data
We obtained genotype data from the 1000 Genomes Project [36] for populations of interest where lymphoblastoid cell line (LCL) gene expression data were also available [14]. Transcriptome data from Stranger et al. [14] included 45 Mexican ancestry individuals in Los Angeles, CA, USA (MXL) and 107 Yoruba individuals in Ibadan, Nigeria (YRI).
Quality control of genomic and transcriptomic data
MESA populations were previously imputed using IMPUTE 2.2.2 using the 1000 Genomes Phase I variant set and NCBI build 37/hg 19 for a final SNP count of at least 39 million variants [24, 37, 38]. Quality control and cleaning of the genotype data was done using PLINK (https://www.cog-genomics.org/plink2). SNPs were filtered by call rates less than 99%. Prior to IBD and principal component analysis (PCA), SNPs were LD pruned by removing 1 SNP in a 50 SNP window if r2 >0.3. One of a pair of related individuals (IBD >0.05) were removed. Pruned genotypes were merged with HapMap populations and EIGENSTRAT [39] was used to perform PCA (Fig. 1). Final sample sizes for each population post quality control are AFA = 233, HIS = 352, and CAU = 578. We used 5–7 million non-LD pruned SNPs per population post quality control. PEER factor analysis was performed on the expression data using the peer R package in order to correct for potential batch effects and experimental confounders [40]. A range of PEER factors (0, 10, 20, 30, 50, and 100) were calculated after 10 genotypic PC adjustment in each population to determine how many were required to maximize eQTL discovery. HapMap genotypes in individuals not sequenced through the 1000 Genomes Project were imputed using the Michigan Imputation Server for a total of 6–13 million SNPs per population, after undergoing PLINK quality control [41]. These imputed samples were then merged back with the individuals that were previously sequenced, filtering the SNPs (imputation R2 >0.8, MAF >0.01, HWE p >1e-06). HapMap expression data sets were adjusted by ten PEER factors.
eQTL analysis
We used Matrix eQTL [42] to perform a genome-wide cis-eQTL analysis in each population separately (AFA, HIS, CAU), in the AFA and HIS combined (AFHI), and in all three populations combined (ALL). We used SNPs with MAF >0.05 and defined cis-acting as SNPs within 1 Mb of the transcription start site (TSS). The linear regression models included 10 genotype principal component covariates and a range of PEER factors (0, 10, 20, 30, 50, or 100) [26]. The false discovery rate (FDR) for each SNP was calculated using the Benjamini-Hochberg procedure. Similar to the approach recently taken by the GTEx Project Consortium to compare tissues, we estimate the pairwise population eQTL replication rates with π1 statistics (π1 = 1 − π0; π0 is the proportion of false positives) using the qvalue method [17, 27].
Genetic correlation analysis
eQTL effect size comparisons between populations were performed using Genome-wide Complex Trait Analysis (GCTA) software [28]. We performed a bivariate restricted maximum likelihood (REML) analysis to estimate the genetic correlation (rG) between each pair of MESA cohorts for each gene [43]. We also used GCTA to estimate the proportion of variance explained by all cis-region SNPs (local h2) for each gene in each population using restricted maximum likelihood (REML).
Prediction model optimization
We used the glmnet R package [30] to fit an elastic net model to predict gene expression from cis-region SNP genotypes. The elastic net regularization penalty is controlled by the mixing parameter alpha, which can vary between ridge regression (α= 0) and LASSO (α=1, default). We quantified the predictive performance of each model via 10-fold cross-validated Pearson R2 (predicted vs. observed gene expression). A gene with the optimal predictive performance when α= = 0 has a polygenic architecture, whereas a gene with optimal performance when α= = 1 has a sparse genetic architecture. In the MESA cohort we tested three values of the mixing parameter (0.05, 0.5, and 1) for optimal prediction of gene expression of 10,143 genes for each population alone, AFA and HIS combined, and all three populations combined. We used the PredictDB pipeline developed by the Im lab to preprocess, train, and compile elastic net results into database files to use as weights for gene expression prediction. See https://github.com/hakyimlab/ PredictDBPipeline and https://github.com/lmogil/run_PredictDB_with_pops.
We also used the software GEMMA [44] to implement Bayesian Sparse Linear Mixed Modeling (BSLMM) [31] for each gene with 100K sampling steps per gene. BSLMM estimates the PVE (the proportion of variance in phenotype explained by the additive genetic model, analogous to the heritability estimated in GCTA) and PGE (the proportion of genetic variance explained by the sparse effects terms where 0 means that genetic effect is purely polygenic and 1 means that the effect is purely sparse). From the second half of the sampling iterations for each gene, we report the median and the 95% credible sets of the PVE and PGE.
Acknowledgments
This work is supported by the NIH National Human Genome Research Institute Academic Research Enhancement Award R15 HG009569 (HEW), start-up funds from Loyola University Chicago (HEW), the Loyola Carbon Undergraduate Research Fellowship (AA), the Loyola Biology Summer Research Fellowship (AB), and the Loyola Mulcahy Scholars Program (AB). MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts HHSN268201500003I, N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-000040, UL1-TR-001079, UL1-TR-001420, UL1-TR-001881, and DK063491. Funding for SHARe genotyping was provided by NHLBI Contract N02-HL-64278. Genotyping was performed at Affymetrix (Santa Clara, California, USA) and the Broad Institute of Harvard and MIT (Boston, Massachusetts, USA) using the Affymetrix Genome-Wide Human SNP Array 6.0. The MESA Epigenomics & Transcriptomics Study was funded by NIA grant 1R01HL101250–01 to Wake Forest University Health Sciences (YL). DGN gene expression prediction models were obtained from PredictDB at http://predictdb.hakyimlab.org/.
Footnotes
↵* hwheeler1{at}luc.edu