ABSTRACT
Background African Americans (AAs) are an admixed population with portions of their genome derived from West Africans and Europeans. In AAs, the proportion of West African ancestry (WAA) can vary widely and may explain the genetic drivers of disease, specifically those that disproportionately affect this understudied population. To examine the relationship between the proportion of WAA and gene expression, we used high dimensional data obtained from AA primary hepatocytes, a tissue important in disease and drug response.
Methods RNA sequencing (Illumina HiSeq Platform) was conducted on 60 AA-derived primary hepatocytes, with methylation profiling (Illumina MethylationEPIC BeadChip) of 44 overlapping samples. WAA for each sample was calculated using fastSTRUCTURE and correlated to both gene expression and DNA methylation. The GTEx consortium (n = 15) was used for replication and a second cohort (n = 206) was using used for validation using differential gene expression between AAs and European-Americans.
Results We identified 131 genes associated with WAA (FDR< 0.1), of which 28 gene expression traits were replicated (FDR<0.1) and enriched in angiogenesis and inflammatory pathways (FDR<0.1). These 28 replicated gene expression traits represented 257 GWAS catalog phenotypes. Among the PharmGKB pharmacogenes, VDR, PTGIS, ALDH1A1, CYP2C19 and P2RY1 were associated with WAA (p < 0.05) with replication of CYP2C19 and VDR in GTEx. Association of DNA methylation with WAA identified 1037 differentially methylated regions (FDR<0.05), with hypomethylated genes enriched in drug response pathways. Overlapping of differentially methylated regions with the 131 significantly correlated gene expression traits identified 5 genes with concordant directions of effect: COL26A1, HIC1, MKNK2, RNF135, SNAI1 and TRIM39.
Conclusions We conclude that WAA contributes to variability in hepatic gene expression and DNA methylation with identified genes indicative of diseases disproportionately affecting AAs. Specifically, WAA-associated genes were linked to previously identified loci in cardiovascular disease (PTGIS, PLAT), renal disease (APOL1) and drug response (CYP2C19).
Background
African Americans (AAs) are an admixed population, having varying proportions of African and European ancestry between individuals [1]. As a consequence of their West African ancestry (WAA), AAs have more genetic variation and shorter extent of linkage disequilibrium (LD) than European Ancestry populations, with the proportion of WAA varying greatly between self-identified AAs [2]. It is this genetically driven variability that may aid in explaining differences in hepatic gene expression and DNA methylation patterns, which cannot be elucidated in homogenous populations such as those of European-only ancestry. For example, WAA has been shown to predict a stronger inflammatory response to pathogens compared to those of European ancestry due to recent selective pressures specific to this population [3].
Furthermore, AAs suffer disproportionately from many chronic diseases and adverse drug reactions, as compared to other populations [4, 5] as well as are protected from some conditions. AAs have a higher risk of cardiovascular events and negative outcomes therapy [6]. They have higher incidences of death and disability from cardiovascular diseases (CVDs), thrombosis, renal dysfunction and pathologies, diabetes, cancers, and other metabolic disorders [7-15]. Conversely, they have lower prevalence of disease such as testicular cancer [16]. Differences in gene expression may help explain these observed differences.
Due to the key role of the liver in biosynthesis, drug metabolism and complex human diseases, genetic and epigenetic differences in the liver may uncover the underlying causal genes responsible for chronic diseases that disproportionately affect AAs [17, 18]. The first comprehensive mapping of liver expression quantitative trait loci (eQTLs) proposed several candidate susceptibility genes associated with type I diabetes, coronary artery disease and plasma cholesterol levels in a white cohort [19]. More recently, finer-resolution mapping of liver eQTLs, combining both gene expression data with histone modification-based annotation of putative regulatory elements, identified 77 loci found to associate with at least one complex phenotype [20]. In addition, our group has previously shown that studies specifically investigating the unique genetic variants of AAs can reveal population-specific risk factors that may explain differences in drug response, such as African ancestry-specific genetic risk factors associated to a higher risk of bleeding from warfarin therapy in AAs [12] as well as population specific variants associated with increased risk of thrombotic disease [21].
Rather than identifying disease susceptibilities through genome-wide association studies (GWAS), here we use variability in genetic ancestry to uncover potential drivers of disease and drug response that may potentially explain differences in disease and drug response in AAs. In this study we correlated WAA to both gene expression and differentially methylated regions to determine the contribution of WAA admixture in hepatic gene expression traits. We identify genes are known to be dysregulated in diseases, which affect AAs disproportionately and may also be responsible for adverse drug responses in AA populations.
Materials and methods
Cohorts
A total of 68 African American (AA) primary hepatocyte cultures were used for this study. Cells were either purchased from commercial companies (BioIVT, TRL, Life technologies, Corning and Xenotech), or isolated from cadaveric livers using a modified two-step collagenase perfusion procedure [77]. Liver specimens were obtained through collaborations with Gift of Hope, which supplies non-transplantable organs to researchers. In addition, we used GWAS data for 153 subjects from Genotype-Tissue Expression Project (GTEx) release version 7, of which 15 samples were used as a replication set.
Primary Hepatocyte Isolation
Briefly, cadaveric livers obtained from Gift of Hope were transferred to the perfusion vessel Büchner funnel (Carl Roth) and the edge was carefully examined to locate the various vein and artery entries that were used for perfusion. Curved irrigation cannulae with olive tips (Kent Scientific) were inserted into the larger blood vessels on the cut surface of the liver piece. The liver was washed by perfusion of 1 L Solution 1 (HEPES buffer, Sigma-Aldrich), flow rate 100-400 mL/min, with no recirculation, followed by perfusion with 1 L of Solution 2 (EGTA buffer, Sigma-Aldrich), flow rate 100-400 mL/min, with no recirculation. The tissue was washed to remove the EGTA compound by perfusion of 1 L Solution 1, flow rate 100-400 mL/min, with no recirculation. The liver was digested by perfusion with Solution 3 (collagenase buffer, Sigma-Aldrich), flow rate 100-400mL/min, with recirculation. Following perfusion, liver section was placed in a crystallizing dish (Omnilab) containing 100-200 ml of Solution 4 (Bovine Serum Albumin, Sigma-Aldrich). The Glisson’s capsule was carefully removed and the tissue was gently shaken to release hepatocytes. The cell suspension was then filtered by a 70 μm nylon mesh (Fisher Scientific), and centrifuged at 72 x g for 5 min at 4 °C. The pellets contained hepatocytes that was washed twice with solution 4 and resuspended in plating medium (Fisher Scientific).
For primary hepatocyte cultures, cell viability was determined by trypan blue (Lonza) exclusion using a hemocytometer (Fisher Scientific) [78]. If viability was low, Percoll gradient (Sigma-Aldrich) centrifugation of cell suspensions was carried out to improve yield of viable cells. Cell were plated at a density of 0.6 × 106 cells/well in 500 μL InVitroGro-CP media (BioIVT, Baltimore, MD) in collagen-coated plates with matrigel (Corning, Bedford, MA) overlay and incubated overnight at 37° C. Cultures were maintained in InVitroGro-HI media (BioIVT) supplemented with Torpedo antibiotic mix (BioIVT) per the manufacturer’s instructions. The media was replaced every 24 hours for three days. RNA was extracted after three days using the RNAeasy Plus mini-kit (Qiagen) per manufacturer’s instructions.
Genotyping and quality control
DNA was extracted from each hepatocyte line using Gentra Puregene Blood kit (Qiagen) as per manufacturer’s instructions from 1-2 million cells. All DNA samples were then bar-coded for genotyping. SNP genotyping was conducted using the Illumina Multi-Ethnic Genotyping array (MEGA) at the University of Chicago’s Functional Genomics Core using standard protocols.
Quality control (QC) steps were applied as previously described including an w imputation info metric threshold of > 0.4 [21]. Briefly, a sex check was performed on genotypes using PLINK (version 1.9) to identify individuals with discordant sex information. Duplicated or related individuals were identified using identity-by-descent (IBD) method with a cutoff score of 0.125 indicating third-degree relatedness. A total of five individuals were excluded after genotyping QC analysis. SNPs located on the X and Y chromosomes and mitochondrial SNPs were excluded. SNPs with a missing rate of > 5% or those that failed Hardy-Weinberg equilibrium (HWE) tests (p < .00001) were also excluded.
African ancestry measurement
The genotypes of 68 primary hepatocytes and 153 GTEx subjects were merged with HapMap phase 3 reference data from four global populations; Yoruba in Ibadan, Nigeria (YRI); Utah residents with Northern and Western European ancestry (CEU); Han Chinese in Beijing, China (CHB); and Japanese in Tokyo, Japan (JPT) [79]. Population structure of the merged data was inferred by the Bayesian clustering algorithm STRUCTURE deployed within fastStructure v1.0 and performed without any prior population assignment. We employed the admixture model, and the burn in period and number of Markov Chain Monte Carlo repetition was set to 20,000 and 100,000, respectively [80]. The number of parental populations (K) was set to 3, purporting three main continental groups (African, European, or Asian). WAA percentages of the primary hepatocytes and GTEx subjects were calculated as the probability of being grouped as Yoruba African, Caucasian, and East Asian, respectively [80]. All individuals in our 60 AA cohort had WAA greater than 40%.
GTEx replication liver cohort
Of the 144 GTEx liver IDs extracted from “GTEX_Sample_Attributes” file, five were replicates and were removed from the analysis. Among the remaining 139 unique liver IDs, 118 had available genotype information for ancestry determination. Individuals with WAA greater than 40% were included in the GTEx AA Liver cohort. After WAA estimation, normalized gene expression reads for 15 subjects meeting the ancestry inclusion criteria were extracted from GTEx expression file (GTEx_Liver.v7.normalized_expression.bed). Age and sex information of these subjects were extracted from the subject phenotype file on the GTEx Portal site (GTEx_v7_Annotations_SubjectPhenotypesDS.txt).
Validation analysis in an independent liver transcriptome dataset
Gene expression profiling in liver had previously been conducted in 206 samples using Agilent-014850 4×44 k arrays (GPL4133) [23, 81]. These samples had come from donor livers not intended for organ transplantation. Genotyping on these samples had been done using the Illumina Human 610 quad beadchip platform (GPL8887) at the Northwestern University Center for Genetic Medicine Genomics Core Facility and imputation was subsequently performed using bimbam [82]. Principal component analysis was used to quantify ancestry using the Human Genome Diversity Panel with African and European populations as reference, as previously described [23].
We conducted differential expression analysis between the European samples and the African American samples using Linear Models for Microarray Data (limma) as implemented in the Bioconductor package [83]. This Bayesian methodology uses a “moderated” t-statistic from the posterior variance assuming an inverse chi-squared prior for the unknown variance for a gene. We used Bonferroni-adjusted p<0.05 based on the total number of genes that were tested for replication.
RNA-sequencing and quality control
Total RNA was extracted from each primary hepatocyte culture after three days in culture using the Qiagen Rneasy Plus mini-kit per manufacturer’s instructions. RNA-QC was performed using an Agilent Bio-analyzer and samples with RNA integrity number (RIN) scores above 8 were used in subsequent sequencing. RNA-Seq libraries were prepared for sequencing using Illumina mRNA TruSeq RNA Sample Prep Kit, Set A (Illumina catalog # FC-122-1001) according to manufacturer’s instructions. The cDNA libraries were prepared and sequenced using both Illumina HiSeq 2500 and HiSeq 4000 machines by the University of Chicago’s Functional Genomics Core to produce single-end 50 bp reads with approximately 50 million reads per sample (accession number GSE124076). Batch effects were corrected for in quality control below.
Quality of the raw reads was assessed by FastQC (version 0.11.2). Fastq files with a per base sequence quality score > 20 across all bases were included in downstream analysis. Reads were aligned to human Genome sequence GRCh38 and comprehensive gene annotation (GENCODE version 25) was performed using STAR 2.5. Only uniquely mapped reads were retained and indexed by SAMTools (version 1.2). Nucleotide composition bias, GC content distribution and coverage skewness of the mapped reads were further assessed using read_NVC.py, read_GC.py and geneBody_coverage.py scripts, respectively, from RseQC (version 2.6.4). Samples without nucleotide composition bias or coverage skewness and with normally distributed GC content were reserved. Lastly, Picard CollectRnaSeqMetrics (version 2.1.1) was applied to evaluate the distribution of bases within transcripts. Fractions of nucleotides within specific genomic regions were measured for QC and samples with > 80% of bases aligned to exons and UTRs regions were considered for subsequent analysis.
RNA-seq data analysis
Post alignment and QC, reads were mapped to genes referenced with comprehensive gene annotation (GENCODE version 25) by HTSeq (version 0.6.1p1) with union mode and minaqual = 20 [84]. HTSeq raw counts were supplied for gene expression analysis using Bioconductor package DESeq2 (version1.20.0) [85]. Counts were normalized using regularized log transformation and principal component analysis (PCA) was performed in DESeq2. PC1 and PC2 were plotted to visualize samples expression pattern. Three samples with distinct expression patterns were excluded as outliers resulting in 60 samples used in RNA-seq analysis. We calculated TPM (transcript per million) by first normalizing the counts by gene length and then normalizing by read depth [86]. Gene expression values were filtered based on the expression thresholds > 0.1 TPM in at least 20% of samples and ≥ 6 reads in at least 20% of samples as done in the GTEx consortium (gtexportal.org, Analysis Methods, V7, updated 09/05/2017).
WAA percentage, gender, age, platform and batch were used as covariates for downstream analysis. Probabilistic estimation of expression residuals (PEER) method v1.3 was used to identify PEER factors and linear regression was run on inverse normal transformed expression data using five PEER factors, based on GTEx’s determination of number of factors as a function of sample size [88, 89]. WAA-associated genes were identified from a genome-wide list of 18,854 genes for our hepatocyte-derived AA cohort samples and replicated in 21,730 genes from the GTEx-derived AA cohort at FDR cutoff of 0.10. The top 131 genes with an FDR < 0.10 were also replicated in the independent Replication cohort obtained from Innocenti et al. (GEO accession number GSE26106) [23]. The 28 replicated genes were used to identify overlap with the NHGRI-EBI GWAS catalog reported genes to identify GWAS traits link to our replicated genes, v1.0.2 [24]. FDR calculations from linear regressions on gene expression were conducted with the “p.adjust” function in R and the default method of “fdr” was used.
In addition, we also conducted a subset analysis with 64 PharmGKB “very important pharmacogenes”. These genes have extensive literature support for association to drug responses. We analyzed this subset for association with WAA using the same linear regression method at a nominal p-value cutoff of 0.05 due to the smaller number of genes being tested, 64, our smaller sample sizes of 60 in our AA cohort and 15 in the GTEx replication cohort.
Methylation sample preparation and data analysis
DNA was isolated from hepatocytes or liver tissue. Liver tissue was homogenized in a bead mill (Fisher Scientific) using 2.8 mm ceramic beads. Then, DNA from hepatocytes or liver was extracted with the Gentra Puregene Blood kit (Qiagen) as per manufacturer instructions. One microgram of DNA was bisulfite converted at the University of Chicago Functional Genomics Core using standard protocols. CT-conversion was performed using Zymo-Research EZ DNA Methylation kits and further processed for array hybridization using Illumina provided array reagents. Following hybridization the arrays were stained pre manufacturer’s protocol and analysed using an Illumina HiSCAN. Of the 60 available hepatocyte cultures only 56 produced sufficient bisulfite converted DNA for analysis.
Illumina MethylationEPIC BeadChip microarray (San Diego, California, USA), consisting of approximately 850,000 probes, predefined and annotated [90], and containing 90% of CpGs on the HumanMethylation450 chip and with more than 350,000 CpGs regions identified as potential enhancers by FANTOM5 [91] and the ENCODE project [92], was used for methylation profiling of DNA extracted from 56 AA hepatocytes, which overlapped the samples used for gene expression analysis (accession number GSE123995). Raw probe data was analyzed using the ChAMP R package for loading and base workflow [26], which included the following R packages: BMIQ for type-2 probe correction method [93]; ComBat for correction of multiple batch effects including Sentrix ID, gender, age, slide and array [94, 95]; svd for singular value decomposition analysis after correction [96]; limma to calculate differential methylation using linear regression on each CpG site with WAA as a numeric, continuous variable [97]; DMRcate for identification of DMRs, using default parameters, and the corresponding number of CpGs, minimum FDR (minFDR); Stouffer scores, and maximum and mean Beta fold change values [98]; minfi for loading and normalization [99]; missMethyl for gometh function for GSEA analysis [100]; and FEM for detecting differentially methylated gene modules [28].
Methylation data quality control in ChAMP’s champ.load() and champ.filter() function removed the following probes: 9204 probes for any sample which did not have a detection p-value < 0.01, and thus considered as a failed probe, 1043 probes with a bead count < 3 in at least 5% of samples, 2975 probes with no CG start sites, 78,753 probes containing SNPs [101], 49 probes that align to multiple locations as identified in by Nordlund et al. [102], and 17,235 probes located on X and Y chromosomes. Threshold for significantly differentially methylated probes was set at an adjusted p-value of 0.05 using Benjamini-Hochberg correction for multiple testing in the limma package. Significance for DMRs was set at adjusted FDR < 0.05 in the DMRcate package using default parameters. Analysis was performed with R statistical software (version 3.4.3 and version 3.5 for ChAMP (version 2.10.1)). Three outliers and nine samples from young subjects, less than five years old were excluded due to known differences in methylation profiles associated with age [103] leaving 44 samples in the analysis.
Gene ontology analysis was performed using g:Profiler (biit.cs.ut.ee/gprofiler) using g:GOSt to provide statistical enrichment analysis of our significantly HypoM and HyperM genes, and significantly expressed genes associated with African ancestry [104]. We filtered for significance at a Benjamini-Hochberg adjusted p-value < 0.05, we set hierarchical sorting by dependencies between terms on the strongest setting, “best per parent group”, and only annotated genes were used as the statistical domain size parameter to determine hypergeometrical probability of enrichment. We considered ontology terms to be statistically significance at a BH-adjusted p-value < 0.01.
Correlation of log fold change in DNA methylation of DM CpG sites by genomic feature was performed on significantly DM probes identified by limma and plotted using ggplot2 v3.1.0. Genomic feature and transcriptionally-regulated regions were subset for terms “island”, “opensea”, “shelf”, “shore” and “5’UTR”, “TSS1500”, “Body”, “3’UTR”, respectively. A chi square test was used for all categorical comparisons. Correlation with methylation and gene expression was performed using the ggscatter function in ggpubr v0.2, with the correlation method set to Pearson, confidence interval set at 95% and regression calculations included for the following subsets: HypoM, HyperM and cumulative CpG sites. Gene name conversions and annotated gene attributes were determined by BioMart v2.38.0. The Circos plot was created using Circos tool v0.69-6.
Results
Cohorts
60 primary hepatocyte cultures, procured from self-identified AAs, passed all quality control steps and were used for RNA-sequencing (RNA-seq) analysis. 56 hepatocyte cultures were assayed for DNA methylation, with 44 used in the final methylation analysis. We obtained genome-wide genotyping data of 153 subjects from GTEx liver cohort, version 7 [22]. All 60 of our AA hepatocyte cohort were confirmed as having WAA ranging from 41.8% to 93.7%, and 15 subjects in the GTEx liver cohort met the WAA inclusion criteria with WAA ranging from 75.6% to 99.9% (Supplementary Figure 1). Table 1 shows the demographics of each cohort. In general, the GTEx and the replication cohorts were older and had a lower percentage of females than the primary hepatocyte cohort.
Hepatic genes and pharmacogenes associated with African ancestry
Analysis of RNA-seq gene expression traits in the AA hepatocyte cohort with percentage WAA identified 131 genes, in which gene expression traits are significantly associated with WAA (Supplementary Table 1; Figure 1A, FDR < 0.10). We were able to replicate 28 of these genes in an independent dataset as differentially expressed (DE) between AAs and European-Americans (EAs) in liver [23] (Table 2; Figure 1D, p < 0.05). These 28 replicated gene expression traits are reported as trait-associated genes in 257 previous GWAS studies within the NCBI GWAS catalog [24]. These represent 119 unique traits (Supplementary Table 2) [24]. The phenotypic disease and biological GWAS traits associated with WAA include blood and blood pressure measures, coronary heart and artery disease, diabetic blood measures, chronic inflammatory disease, chronic kidney disease and various cancers.
In the GTEx cohort of 15 AAs, we were able to replicate 8 of the 131 significant WAA-associated genes: COL26A1 (effect size = −7.13, p = 0.037), DHODH (effect size = 3.19, p = 0.048), GPI (effect size = −3.46, p = 0.048), HSD17B7P2 (effect size = −9.22, p = 0.027), PLCL2 (effect size = −5.76, p = 0.046), SLC2A3 (effect size = −5.95, p = 0.032), TRIM39 (effect size = 5.53, p = 0.030) and VEGFA (effect size = −4.74, p = 0.032) in the GTEx liver cohort (Supplementary Table 2).
Due to the importance of the liver in pharmacologic drug response, we also tested the association of gene expression levels with percentage West African ancestry in a subset of genes belonging to the very important pharmacogenes (VIP) in PharmGKB, consisting of 64 genes known to be expressed in hepatocytes. These represent drug-metabolizing enzyme, transporters and drug target genes that are well-established for their role in drug response [25]. Testing the association between WAA and gene expression in the VIP genes identified five genes which were significantly associated to WAA: VDR (effect size = 0.35, p = 0.003), PTGIS (effect size = −0.57, p = 0.002), ALDH1A1 (effect size = 1.03, p = 0.032), CYP2C19 (effect size = −1.36, p = 0.032) and P2RY1 (effect size = 1.61, p = 0.002 (Figure 1B).
DNA methylation patterns are associated with African ancestry in human liver
To identify differentially methylated (DM) regions (DMRs) and CpGs associated with WAA, we conducted linear regression on each CpG site to find African ancestry-related differential methylation [26]. We identified 23,317 significant DM CpG sites, out of a total of approximately 867,531 probes on the Illumina EPIC BeadChip microarray, annotated to 11,151 unique genes (Supplementary Figure 2A; BH-adjusted p < 0.05). These DM CpGs correspond to 1037 DMRs annotated to 1034 unique genes (Supplementary Figure 2B; minimum FDR < 0.0001). Each DM CpG site was categorized into hyper-(HyprM) and hypomethylated (HypoM) sites: 15,404 HyprM CpG sites constituted 435 HyprM DMRs, mapping to 432 unique genes, 7913 HypoM CpG sites constituted HypoM 602 DMRs, mapping to 602 unique genes, and seven annotated genes had both HyprM and HypoM DMRs.
As compared to all CpG sites tested, the gene body (45.0% vs 40.9%, p < 0.0001, chi-square test) and shelf (8.4% vs 6.9%, p = 0.011, chi-square test) had significantly higher proportion of DM CpGs, while the promoter (18.4% vs 20.3%, p < 0.002, chi-square test), intergenic regions (IGR) (24.2% vs 27.7%, p < 0.0001, chi-square test) and shore regions (16.3% vs 18.2%, p < 0.0022, chi-square test) had significantly lower proportion of DM CpGs. In our analysis of DM loci associated with WAA, 75.9% of DM CpG sites within islands were HypoM, while the shore, shelf and open sea were predominantly HyperM (65.5%, 81.7% and 78.9% respectively) (Supplementary Figure 2C). Within transcriptionally-regulated promoter regions, 54.0% of CpG sites were HypoM. Within the promoter, 71.5% of DM CpGs were HypoM 200 kb upstream of the transcription start site (TSS), while 42.9% of DM CpGs were HypoM 1500 kb upstream of the TSS. (Supplementary Figure 2D). DNA methylation around TSS is an established predictor of gene expression, with increased methylation leading to decreased expression [27, 28]. In the gene body, 73.0% of DM sites were HyprM (Supplementary Figure 2D). Intergenic, 5’-UTR and 3’-UTR regions were predominantly HyperM (67.5%, 64.0%, 79.6% respectively).
Next, we characterized the locations of WAA-associated CpGs by genomic features and gene annotations. Within the 7913 HypoM CpG sites associated with WAA, there was a greater proportion of CpGs located in islands. Within the 15,404 HyperM CpGs associated with WAA, there was a greater proportion in the shelf and open sea regions (Supplementary Figure 2E), 5’-UTR, promoter regions and gene body (Supplementary Figure 2F).
Gene expression trait correlation to DNA methylation patterns
To determine the relationship of DMRs associated to WAA to gene expression traits, we next looked at the association of the 1034 unique genes corresponding to the 1037 DMRs with their respective gene expression profiles. Although there was no correlation of WAA-associated HyprM gene regions with gene expression (Figure 1C, Pearson’s r = 0.036, p = 0.55), HypoM gene regions associated with WAA were negatively correlated with gene expression (Figure 1C, Pearson’s r = −0.14, p = 0.009). In general, all genes within DMR associated with WAA also showed negative correlation between gene expression and methylation (Pearson’s r = −0.1, p = 0.013). From the 131 genes with gene expression significantly associated with WAA identified, we identified an overlap of ten DM gene regions: COL26A1 (minFDR = 3.99 × 10−45, mean Beta = −0.0816, consisting of 16 CpGs), HIC1 (minFDR = 5.35 × 10−8, mean Beta = 0.0011, consisting of 35 CpGs), MKNK2 (minFDR = 8.40 × 10−6, mean Beta = 0.0404, consisting of 9 CpGs), RNF135 (minFDR = 3.17 × 10−84, mean Beta = −0.2504, consisting of 18 CpGs) and TRIM39 (minFDR = 4.20 × 10−8, mean Beta = 0.0444, consisting of 14 CpGs), with concordant directions of effect (e.g. increased methylation leading to decreased gene expression). A comprehensive Circos plot summarizes the 23,317 HypoM and HyprM CpG sites, the 1037 DMRs, the 131 genes significantly associated with WAA and the 257 GWAS studies associated with the 28 genes replicated gene expression traits found in the validation dataset of Innocenti et al. (Figure 2, Supplemental Table 2).
Functional representation of WAA associated genes and DM genes associated with African ancestry
To understand the biological relevance of differentially HyprM and HypoM genes associated with African ancestry, we performed Gene Ontology (GO) analysis of the 432 unique genes comprised of the 435 HyprM DMRs and 602 unique genes comprised of the 602 HypoM DMRs using a gene panel of all annotated genes in the GO database. HyprM genes are enriched for “cell development” (BH-adjusted p = 0.0016) and “apoptotic process” (BH-adjusted p = 0.0076) within the category of biological processes (Figure 3A). HypoM genes were enriched for “system development” (BH-adjusted p = 4.0 × 10−6), “response to drug” (BH-adjusted p = 1.7 × 10−4), and “response to hypoxia” (BH-adjusted p = 0.0068) within the category of biological processes (Figure 3B). In addition, HypoM genes are enriched for “sequence-specific DNA binding” (BH-adjusted p = 4.4 × 10−5) and “RNA polymerase II transcription factor activity” (BHadjusted p = 4.4 × 10−5) within the category of molecular function (Figure 3B).
With respect to GO analysis of the 131 WAA-associated gene expression traits (FDR < 0.10), “angiogenesis” (BH-adjusted p = 4.5 × 10−5), “leukocyte activation involved in immune response” (BH-adjusted p = 4.5 × 10−4), “acute-phase response” (BH-adjusted p = 0.0011), “positive regulation of endothelial cell proliferation” (BH-adjusted p = 0.0055), “zymogen activation” (BH-adjusted p = 0.0059) and “cell proliferation” (BH-adjusted p = 0.0085) were biological processes that are enriched in hepatocytes (Figure 3C). Molecular functions, such as “oxidoreductase activity” (BH-adjusted p = 9.6 × 10−4), “signaling receptor binding” (BH-adjusted p = 0.0023), “glucocorticoid receptor binding” (BH-adjusted p = 0.0045) and the KEGG pathway “HIF-1 signaling pathway” (BH-adjusted p = 8.2 × 10−4) were also enriched (Figure 3C).
Discussion
Several studies have shown that the first several principal components of methylation data can capture population structure in cohorts composed of European and African individuals [29]. Recently, it was shown that genetic ancestry can be used as a proxy, not only for uncovering unknown covariates contributing to epistatic and gene-environment interactions from gene expression data, but also from DNA methylation data [30]. Indeed, approximately 75% of variation in methylation was attributable to shared genomic ancestry [31]. Moreover, clinically meaningful measures can be associated to the proportion of WAA as has been shown for lung-function prediction [32].
In our study, we investigated population-specific gene expression and DNA methylation in AAs, an admixed population. We found HypoM hepatic genes, which indicate increased gene expression, are enriched for “system development” and “response to drug”. HypoM, or demethylation, may be a better predictor of gene expression than methylation which, in contrast to demethylation, may or may not affect gene expression, depending on the gene region methylated. Methylation within the TSS of the promoter is well known to repress gene expression while methylation within the gene body results in more variable expression [28, 33].
Among gene expression traits associated with WAA, we identified hepatic genes that are enriched for “angiogenesis” as the primary ontology term, followed by inflammatory response categories including “leukocyte activation in immune response”, “acute-phase response”, “positive regulation of vascular endothelial proliferation”, “zymogen activation” and “cell proliferation”. Angiogenesis and inflammatory response terms may underlie conditions that AAs may be more susceptible to, such as CVD and other chronic inflammatory disease. In particular, APOL1, PTGIS and PLAT expression, which we show to be associated with WAA, have been shown to increase CVD risk and renal disease in AAs [13, 34, 35]. Other genes associated with WAA include ALDH1A1, which is involved in alcohol and aldehyde metabolism disorders and cancer risk [11, 36, 37], IL-33, which is involved in beneficial immune response [38-40], and VEGFA, which has been linked to renal disease and microvascular complications of diabetes [7, 14, 41, 42].
Of the 131 genes associated with WAA, we were able to replicate a quarter within our replication and validation cohorts (Table 2). We also identified an overlap with five significantly DM genes in concordant directions of effect, COL26A1, HIC1, MKNK2, RNF135, SNAI1 and TRIM39. HIC1 is a potential tumor suppressor that has been linked to poorer outcomes in laryngeal cancer in AAs [43, 44]. RNF135 is a ring finger protein that is regulated by several population-specific variants [45]. RNF135, itself, then regulates other genes at distant loci and has been implicated in glioblastomas and autism [45-47]. Of particular interest for African ancestry populations, RNF135 has been found to be under selective pressure specifically in African populations [48, 49].
In PharmGKB VIP genes, we identified five genes associated with WAA: ALDH1A1, CYP2C19, P2RY1, PTGIS and VDR [22, 25]. Of particular importance, we found that for every 1% increment in African ancestry, there was a corresponding 1.36% decrement in CYP2C19 expression and 1.61% increment in P2RY1 expression. CYP2C19 is involved in the metabolism of many commonly prescribed drugs including clopidogrel, an antiplatelet therapy widely used for thrombo-prophylaxis of CVDs and linked to substantial inter-patient differences in drug response. It is also known to inhibit the P2RY family of receptor proteins on the surface of platelets [34, 35, 50, 51]. By inhibiting P2RY12 function, clopidogrel indirectly suppresses platelet clustering and clot formation and prevents clots contributing to heart attack, stroke and deep vein thrombosis [52-54]. P2RY1 works in concert with P2RY12 to promote platelet activation and aggregation [55]. Consequently, P2YR1 variants have been associated with increased platelet response to adenosine 5’-diphosphate (ADP) stimulation [56] and increased expression may be linked to thrombotic disease [57].
Clopidogrel requires CYP2C19-mediated conversion to its active form, but it has been shown that different populations have different levels of CYP2C19 activity [12, 58-61]. The underlying mechanism is multifactorial and genetic polymorphisms are one of the main causes of variable drug response within an individual and across populations [6, 62, 63]. A study conducted across 24 U.S. hospitals showed one-year mortality rates of 7.2% in clopidogreltreated AAs, compared to 3.6% for Caucasians on clopidogrel [6]. This study also found that AAs were at a higher risk of cardiovascular events and mortality from poor antiplatelet response to clopidogrel. Our finding that CYP2C19 expression is reduced with WAA while P2RY1 expression is inversely increased, suggests that clopidogrel resistance and susceptibility of AAs to thrombotic disease may be due to ancestry-associated gene expression.
Another WAA-associated gene was PLAT, for which expression is decreased with increased WAA. The PLAT gene, which is involved in plasminogen activation and encodes tissue plasminogen activator (t-PA), is linked to thrombosis and increased risk of CVD [64, 65]. In AAs, p olymorphisms in PLAT have been linked to CVD and higher levels of t-PA antigen seen in both myocardial infarction and venous thromboembolism [66]. In addition, increased plasma fibrinogen level, which is involved in the fibrinolytic pathway and regulated by t-PA, has been linked to increased venous thrombosis risk in AAs [67, 68].
VDR is also important in health and disease because Vitamin D and its active metabolite 1,25(OH)2D are exogenous hormones created by sun exposure or through diet, with established deficiencies in both Vitamin D and its bioactive metabolites in AAs [69, 70]. Those of African ancestry are known to have lower plasma 1,25(OH)2D levels and our findings support this, suggesting that this may lead to an upregulation of VDR with increased WAA that may compensate for these lower levels. Additionally, VDR single nucleotide polymorphisms (SNPs) have been implicated in CVDs and various cancer susceptibilities in AAs [8, 9, 15, 71, 72]. SNPs within VDR may exacerbate an already deficient vitamin D environment in AAs.
Several limitations exist in this study. First, we only had 60 primary hepatocyte cultures included in this analysis, which limits our power to detect small changes in gene expression associated with ancestry. Second, the GTEx replication liver cohort, with 15 AA livers, also severely lacked power to replicate our findings. Third, while most of the genes found in the PharmGKB VIP subset analysis were not statistically significant in the complete genome-wide set, we found ALDH1A1 and PTGIS to be significantly associated to WAA in both the genome-wide and in the 64 PharmGKB VIP analysis, and we replicated 8 of the 131 significant WAA-associated genes in a GTEx liver cohort of 15 AAs. More importantly, and the reason for the low power of our dataset and any other dataset available for AAs, there are very few genome-wide datasets of both genotype and gene expression in AAs, and those that exist have similarly under-powered cohort sizes.
Another differentiation, though not considered a limitation, was that we used cultured hepatocytes as opposed to frozen liver tissues, as was the case with GTEx. Primary cultures may show differences in gene expression profiles from those seen in the intact organ, which consist of only 60% hepatocytes [73]. However, our study design has the advantage of only assaying the gene expression of a single cell type as opposed to the multiple cell types found in liver. Previous studies have shown that primary human hepatocytes show similar gene expression levels for both Phase I (CYPs) and Phase II (e.g. UGTs) drug-metabolizing enzymes to those obtained from frozen liver tissue [74, 75]. Also, the use of primary hepatocyte cultures reduces the effect of the environmental confounders inherent in liver tissue (i.e. there is less effect of previous drug/disease exposure) through the controlled tissue culture processes following hepatocyte isolation). The artifact of previous disease/drug exposures is present in all transcriptome studies conducted in post-mortem human liver tissue.
Conclusion
Currently, genome-wide genetic, epigenetic, and multi-omics datasets of AAs are lacking in both the scientific literature and in public databases and repositories. Since genomics studies should be inclusive of all populations to comprehensively unravel disease etiology, the dearth of genetic data in minorities and underrepresented populations poses a major scientific and medical dilemma in new drug development, precision medicine, and public health policies. An archetypal example is the rs12777823 SNP in CYP2C9, which was found to associate with decreased warfarin dose requirement, but only in AAs [76]. Findings, such as these, were only made possible due to focused studies in a minority patient population.
In conclusion, our study has important implications in the use of genetic ancestry in understanding phenotypic differences and health disparities in AAs. Our study also has implications in determining inter-individual genetic factors and drug response outcomes in admixed populations for precision medicine. As evidenced by limited genome-wide data in AAs within public databases and biobanks, such as in GTEx, our study further stresses the need for genome-wide studies in minority and under-represented populations.
List of Abbreviations
- AA
- African Americans
- WAA
- West African ancestry
- CVD
- Cardiovascular Disease
- IBD
- Identity-by-Descent
- QC
- Quality control
- GWAS
- Genome-wide Association Studies
- eQTLs
- expression quantitative trait loci
- GTEx
- Genotype-Tissue Expression Project
- PCA
- Principal Component Analysis
- PEER
- Probabilistic estimation of expression residuals
- DE
- Differentially Expressed
- DMR
- Differentially Expressed Region
- HyprM
- – Hypermethylated
- HypoM
- – Hypomethylated
Declarations
Ethical Approval and consent to participate
The Institutional Review Board of Northwestern University has waiver the need for approval as this study used human samples obtained from deceased individuals and it thus not considered Human Subjects research.
Consent for publication
Consent for publication is not applicable.
Availability of data and materials
The dataset supporting the conclusion of this article are include with the article and is additional files. All raw gene expression and methylation array data has been submitted to the Gene Expression Omnibus (GEO accession number GSE124076) and (GEO accession number GSE123995), respectively.
Competing interests
The authors declare no competing interests.
Funding
This work was supported by NIH National Institute on Minority Health and Health Disparities (NIMHD) Research Project 1R01MD009217-01 (R01).
Authors’ contributions
CSP and MAP Conception and Design, Analysis and interpretation of data, Drafting or revising the article; YX Design, Analysis and interpretation of data; TD Design and Acquisition of data; YZ Analysis; EG Analysis; ES, CA Acquisition of data.
Supplementary Files
Supplemental Table 2. Phenotypic traits associated with replicated genes in the GWAS Catalog
Refer to Excel datasheet: Supp_Table_2_gwas_catalog.xls
Acknowledgements
We would like to acknowledge Dr. Heather Wheeler for her invaluable scientific input. ERG wishes to thank the President and Fellows of Clare Hall, University of Cambridge for providing a stimulating intellectual home and for the generous support during the Easter and Lent terms.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵