Abstract
Improved understanding of genetic regulation of proteome can facilitate the identification of causal mechanisms for complex traits. We analyzed data on 4,657 plasma proteins from 7,213 European American (EA) and 1,871 African American individuals from the ARIC study, and further replicated findings on 467 AA individuals from the AASK study. We identified 2,004 plasma proteins in EA and 1,618 in AA, with majority overlapping, which showed significant genetic associations with common variants in cis-regions. Availability of AA sample led to smaller credible sets and identification of a significant number of population-specific cis-pQTLs. Estimates of cis-heritability for proteins were similar across EA and AA (median cis-h2=0.09 for EA and 0.10 for AA) and tended to be lower than those of gene expressions. Elastic-net-based algorithms produced high accuracy for protein prediction in each population, but models developed in AA were more transportable to EA than conversely. An illustrative application of proteome-wide association studies (PWAS) to serum urate and gout, implicated several proteins, including IL1RN, revealing the promise of the drug anakinra to treat acute gout flares. Our study demonstrates the value of large and diverse ancestry study for understanding genetic mechanisms of molecular phenotypes and their relationship with complex traits.
Introduction
Genome-wide association studies (GWAS) to date have cumulatively mapped tens of thousands of loci containing common genetic variants associated with complex traits 1, 2. As the majority of the variants are in non-coding regions 3, 4, researchers have focused on understanding the role of gene-expression regulation as a mechanism for complex trait genetic association 5-9. There is known to be substantial overlap between genetic variations regulating gene expression and those influencing complex traits 7-9, but only a small fraction of GWAS heritability of complex traits can be explained by mediating effects through bulk gene-expression 10, 11. While it is likely that future studies with more extensive cell-type specific gene expression measurements will lead to additional insights, comprehensive understanding of causal mechanisms for complex traits will ultimately require the integration of data from various types of genomic and molecular traits 11. Proteins, the ultimate product of the transcripts, are subject to post-translational modifications and processing, and contain additional information that cannot be detected at the level of the transcriptome.
Recently, major opportunities have arisen to substantially increase our understanding of the causal role of proteins in complex traits due to availability of an accurate high throughput technology for measuring proteins in different types of samples 12, 13. The plasma proteome has received particular attention as it can capture a wide variety of proteins that are active in different biological processes, including but not limited to circulation 14. The proteome is often dysregulated by diseases, and it is highly amenable for drug targeting 15, 16. A number of genetic studies have identified protein quantitative trait loci (pQTL), for plasma 15-20 as well as some other tissues 21-23, and noted that pQTLs are enriched for GWAS associations across an array of complex traits 15-23. Studies have used pQTLs as instruments in conducting Mendelian randomization (MR) analysis to identify causative proteins, and hence potential therapeutic targets, across diverse phenotypes 24-26, including COVID-19 related outcomes 19, 20.
In spite of substantial progress, understanding of the genetic architecture of the proteome and its overlap with those of gene expressions and complex traits remains limited. While the sample size for some studies 16-19 of the plasma proteome has involved thousands of individuals, it is likely that identification of pQTLs remains incomplete, both due to inadequate sample size or/and lack of comprehensive protein measurements. Further, existing proteomic studies have been mostly restricted to samples of European ancestry, and thus cannot inform potential heterogeneity by ancestry. Additionally, advanced tools for incorporating pQTL information for exploring causal effects of proteins, such as those available for analysis of gene-expression 27, 28, are lacking.
In this article, we report results from a comprehensive set of analyses of cis-genetic regulation of the plasma proteome in the large European and African American cohorts of the Atherosclerosis Risk in Communities (ARIC) study 29. We focus on the identification of cis-associations, which compared to trans-, have been shown to more replicable across different proteomic platforms 30 and are less likely to be affected by horizontal pleiotropy that could pose additional challenge for downstream Mendelian-randomization analyses 31. We carry out a set of association and fine-mapping analyses to identify common (minor allele frequency (MAF) > 1%) cis-pQTLs and compare results across ethnic groups to explore shared and unique genetic architecture. For each ethnic group, we characterize cis-heritability of the proteome due to common variants and build models for genetically predicting levels of plasma proteins. Using these models, we then conduct a proteome-wide association studies (PWAS) of serum urate 32, an important biomarker of purine metabolism with high heritability and large available large GWAS summary statistics, and the complex disease gout, which can result from high urate levels 32. We create several data resources for using our results to inform future studies (http://nilanjanchatterjeelab.org/pwas).
Results
Identification of cis-pQTLs Across European and African American Populations
We performed separate cis-pQTL analyses for the African American (AA) and European American (EA) populations in the ARIC study, with total sample sizes of N=1,871 and N=7,213, respectively (see Methods). We performed analyses based on plasma samples collected during the third visit of the cohort 29 (see Supplementary Table 1 for sample characteristics). Relative concentrations of plasma proteins or protein complexes were measured by modified aptamers (‘SOMAmer reagents’, hereafter referred to as SOMAmers) 12, 13.
We defined cis-regions to be +/- 500Kb of the transcription start site (TSS) in the cis-pQTL analysis. After quality control (see Methods), we analyzed 4,657 SOMAmers, which tagged proteins or protein complexes encoded by 4,435 genes, and 204 of them were tagged by more than one SOMAmer. In the cis-regions, we analyzed 10,961,088 common (MAF>1%) single-nucleotide polymorphisms (SNPs) for AA and 6,181,856 for EA with imputed or genotyped data after quality filtering (see Methods). For identification of cis-pQTLs, we performed regression analyses of protein levels after residualizing by a number of potential confounders, including sex, age, 10 genetic principal components (PCs) and the study sites at v3. In addition, similar to eQTL analyses 8, we adjusted for Probabilistic Estimation of Expression Residuals (PEER) factors 33, 34 to account for hidden confounders that may influence clusters of proteins. We observed that the inclusion of PEER factors substantially improved power for cis-pQTL studies due to reduced residual variance (Fig. 1a, Supplementary Table 2). In all subsequent analyses, protein levels measured by SOMAmers were residualized with respect to these sets of PEER factors and then normalized by quantile-quantile transformation (see Methods).
In the ARIC study, we identified a total of 2,004 and 1,618 significant SOMAmers, i.e. SOMAmer with a significant (at false discovery rate (FDR)<5%)9 cis-pQTLs near the putative protein’s gene, in the EA and AA populations, respectively, with 1,447 of these overlapping across the populations (Fig. 1b, Supplementary Tables 3.1 and 3.2). Compared to plasma pQTL studies conducted in the past in European ancestry sample16, 17, we almost tripled the number of significant SOMAmers with known cis-pQTLs 17, 18 (1,465 v.s. 508 using the same Bonferroni corrected genome-wide threshold for significance) (Supplementary Table 3.1) and we successfully replicated 99% (504/508) of previously identified cis-pQLTs (Supplementary Table 4).
We found 10% of the sentinel cis-pQTLs identified in EA were non-existent or rare, defined as two or less individuals carrying the variant, in the 1000Genome AA sample. In contrast, nearly one third of the variants identified in the AA population were non-existent or rare in the 1000Genome EA population, signifying the value of diverse ancestry data to identify population-specific cis-pQTLs (Supplementary Tables 3.1 and 3.2). cis-pQTLs which were identified through either of the two populations, but were common in (MAF>1%) in both, the effect-sizes showed high degree of concordance across the populations with a correlation coefficient above 0.9 (Supplementary Fig. 1). We further carried out a replication study using data available on additional 467 individuals from the African American Study of Kidney Disease and Hypertension (AASK) 35, which also ascertained proteins using the SOMAScan platform (see Methods and Supplementary Table 1). Among 1,398 sentinel cis-SNPs which were identified through the ARIC AA sample and which were genotyped or imputed in AASK, we found 93% showed effects in the same direction and 69% showed statistical significance at FDR<5% in the replication analysis (Supplementary Tables 5.1 and 5.2).
Genotypic effect sizes for cis-pQTLs were inversely associated with minor allele frequencies even after accounting for bias due to power for detection 36(Fig. 1c). The cis-pQTLs appeared to be more concentrated near the TSS of corresponding pGenes in the AA than the EA population, and the genotypic effect sizes for cis-pQTLs decreased with distance from the TSS in both populations (Fig. 1d). Using stepwise regression 37, 38, we identified multiple conditional independent cis-SNPs for 1,398 (70%) and 1,021 (63%) of the significant SOMAmers in EA and AA populations, respectively (Fig. 1e, Supplementary Tables 6.1 and 6.2).
Protein altering variants (PAV) may result in apparent cis-pQTLs owing to altered epitope binding effects 16. Following a procedure recommend earlier 16, we found that in the EA population while up to 65% (1,299 out of 2,004) of the sentinel pQTLs could be affected by LD with known PAVs, in the AA population the corresponding proportion drops to 47% (765 out of 1,618) (see Supplementary Tables 3.1, 3.2 and 7). However, large overlap observed between eQTL and pQTLs in colocalization analysis (see below) indicates they are driven by underlying causal variants and reduces concerns for any large-scale effect of epitope artifacts in the detection of pQTLs.
Cis-eQTL Overlap and Functional Enrichment
To evaluate the extent to which the cis-pQTL variants were also involved in modulating transcriptional levels, we cross referenced the cis-pQTLs with significant cis-eQTLs (at FDR<5%) from the Genotype-Tissue Expression project (GTEx V8) 9 across 49 different tissues (See Methods). Since the GTEx cohort is primarily of European ancestry (85.3% EA), we restricted the analysis to the top cis-pQTLs identified in the EA cohort only. We found that, approximately 73.9% of the sentinel cis-pQTLs or variants in high LD (r2 > 0.8) with them, were also significant cis-eQTLs for the same gene in at least one tissue (Supplementary Fig. 2a). Further, pairwise colocalization indicated that for 49.4% of the pGenes, cis-pQTLs colocalize with cis-eQTLs, in at least one of the GTEx tissues with high posterior probability (PP.H4 ⩾ 80%) (Supplementary Fig. 2b, Supplementary Tables 8.1 and 8.2). Further, cis-pQTLs tended to be reported as significant cis-eQTLs across multiple tissues possibly because plasma protein level might contain signatures from multiple tissues (Supplementary Fig. 3).
Results from the association analysis of molecular phenotypes, like plasma proteins, integrated with the functional and regulatory annotation of the genome offer a powerful way to understand the molecular mechanisms and consequences of genetic regulatory effects. The functional annotations were curated from several sources like variant effect predictor (VEP) 39, Loss-Of-Function Transcript Effect Estimator (LOFTEE) 40 and Ensembl Regulatory Build 41. We found that cis-pQTLs were enriched for several protein altering functions which may be caused by epitope binding effects noted earlier (Supplementary Fig. 4a-b). After adjusting for protein altering variants (PAVs), we found that independent top cis-pQTLs were enriched in a large spectrum of functional annotations including untranslated regions (5’ and 3’), promoters and transcription factor binding sites, with a pattern that was consistent across the EA and AA populations (See Methods, Supplementary Fig. 4c-d and Supplementary Table 9).
Fine Mapping
To identify the causal variants underlying the significant cis-pQTLs for plasma proteins, we first conducted population-specific fine-mapping for the 1,447 significant SOMAmers that had at least one cis-pQTL both in EA and AA using SuSiE 42 (Supplementary Tables 10.1 and 10.2). Comparing the 95% credible sets, we found that the average number of variants in the credible sets were significantly smaller in AA compared to that in EA (21.29 in EA vs 12.11 in AA; p-value = 8.43×10− 27; Fig. 2 a-b). This is possibly driven in part by the lower average LD in AA, but also could be due to the lower sample sizes in AA compared to EA, resulting in lower statistical power. To demonstrate the added value of including two ethnic populations in identifying possibly shared causal variants, we further conducted a trans-ethnic meta-analysis using MANTRA 43, which accounts for effect heterogeneity among populations and constructed a 95% credible set of shared causal variants by ranking variants according to their Bayes factor (See Methods).
As an example of the fine mapping analysis, we illustrate the fine-mapped cis-region (+/-500Kb) for the HBZ pGene on chromosome 16p13.3 corresponding to the Hemoglobin subunit zeta protein (HBAZ; Uniprot ID: P02008), which is involved in oxygen transport and metal-binding mechanisms 44, 45 and has been associated with thalassemia 46. Single variant pQTL association results show that there are several significant cis-pQTL associations both in EA and AA populations (Fig. 2c and 2e). Fine-mapping within the EA individuals identifies a 95% credible set of seven variants (Fig. 2d) while that within the AA individuals identifies a smaller credible set of two variants only (Fig. 2f). Trans-ethnic meta-analysis using MANTRA further points to a single variant rs2541645 (16:161106 G>T) as the possible shared causal variant between EA and AA. This variant was in fact the most significantly associated cis-pQTL for HBZ pGene in AA but not in EA, and had some evidence of differences in minor allele frequency across the populations (MAF = 0.32 in EA vs 0.18 in AA). This SNP is a strong eQTL for HBZ expression in GTEx V8 whole blood (p-value = 6.7×10−80), and associated with several erythrocyte related outcomes in the UK Biobank including mean corpuscular hemoglobin (p-value=1.1×10−14) and reticulocyte fraction of red cells (p-value=3.2×10−9) 47, 48. Together, these findings suggest that rs2541645 might be a regulatory variant for HBZ protein levels and possibly warrant further study on downstream phenotypic consequences especially in the context of blood related mechanisms and thalassemia.
Analysis of Cis-Heritability of Proteins and Building Protein Imputation Model
We estimated cis-heritability (cis-h2) of plasma proteins, i.e. the proportion of variance of protein levels that could be explained by all SNPs in cis-regions of their encoded genes, using the GCTA software 49. We found 1,350 and 1,394 SOMAmers to have significant cis-h2 (p-value < 0.01) for the EA and AA populations, respectively, and 1,109 of them overlapped (Supplementary Table 11). The majority of those significant cis-heritable SOMAmers also had cis-pQTLs identified in our study (96% for AA and 99% for EA, Supplementary Table 12). The cis-h2 for significant SOMAmers (median cis-h2 = 0.10 for AA, and 0.09 for EA) tended to be substantially smaller than those reported for gene-expression 50 in two related tissues 14 in liver and whole blood in GTEx V7 (Fig. 3a) and similar patterns were also observed when in GTEx V8 (Supplementary Fig. 5). The pattern is expected given the closer relationship of genetic variation to transcripts than to the encoded proteins, which are subject to additional processing including post-translational modifications.
Next, we built protein imputation models for cis-heritable SOMAmers using an elastic net machine learning method with 5-fold cross-validation as has been used for modeling gene-expression 27. The median accuracy for the elastic-net models for protein predictions, evaluated as the prediction R2 standardized by cis-heritability (R2/cis-h2), was 0.79 and 0.69 for the EA and AA populations, respectively. Compared with imputation models built only with the top cis-pQTL, the elastic net models gained, 36% and 40% of accuracy for the EA and AA populations, respectively (Fig. 3b, Supplementary Table 13). In cross-ethnic analysis, we found that models trained in the EA population performed worse in the AA population than the converse, in spite of a much smaller sample size in AA, again indicating the advantage of the latter population to identify causal pQTLs which are more likely to have robust effects across ethnic groups (Fig. 3c).
Cis-regulated Genetic Correlation between Plasma Proteome and Transcriptome across a variety of tissues
We then explored cis-regulated genetic correlation between plasma proteins and expression levels for the underlying genes across a variety of tissues. We used genotype data for Europeans from the Phase-3 1000 Genome Project (1000Genome) 51 to evaluate Pearson’s correlation coefficients between genotypically-imputed protein levels, and genotypically-imputed expression levels, with the latter being computed based on models that have been previously built and published by Gusev et al. 28 based on data from the GTEx (V7) consortium (Supplementary Tables 13 and 14). We also used models based on GTEx (V8) developed by the same group (available through personal communication), but because of their preliminarily nature we present the all analyses involving imputed gene-expression using the V7 models and present preliminary results from the V8 models as supplementary data. The analysis was restricted to the European population due to the lack of gene-expression imputation models for AA population. Overall, genetically imputed plasma proteins are only moderately correlated with those for gene expression levels (Fig. 3d). Consistent with previous studies 52, we find that plasma proteins show strongest genetic correlations with genes expression levels in the liver, the organ responsible for the synthesis of many highly abundant plasma proteins. The lowest genetic correlations were seen for brain-related tissues, which may be due to the blood-brain barrier. In GTEx (V8), that included a larger number of overlapping genes with our SOMAmers, we observed a similar pattern for high-/low-rank tissues for expression-protein genetic correlations although the absolute levels of these correlations were bit lower (Supplementary Table 15.1). The correlations between direct plasma protein measurements and imputed gene expression levels in ARIC showed similar trend but have generally much lower values as they account for additional variability of protein measurements due to non-genetic factors (Supplementary Fig. 6).
Proteome-wide Association Study (PWAS) of Complex Traits
We illustrate an application of the protein imputation model by conducting proteome-wide association studies for two related complex traits: (1) serum urate, a highly heritable biomarker of health representing the end product of purine metabolism in humans, and (2) gout, a complex disease caused by urate crystal deposition in the setting of elevated urate levels and the resulting inflammatory response. We obtained GWAS summary-statistics data for these traits generated by the CKDGen Consortium 32 involving a total sample size of N=288,649 and N=754,056, respectively. As this GWAS was conducted primarily in EA population, we carried out the PWAS analysis using the model generated for the EA population.
We used a computational pipeline previously developed for conducting TWAS based on GWAS summary-statistics 28, 53 to carry out an analogous PWAS analysis. Simulation studies showed that type 1 error of PWAS analysis based on our protein imputation weights are well controlled (Supplementary Fig. 7). Among SOMEmers that showed significant cis-heritability, we identified 10 and 4 distinct loci containing genes for which the encoded proteins or protein complexes were found to be significantly (p-value < 3.7×10−5) associated with serum urate and gout, respectively. We further examined whether the PWAS signals could be explained by cis-genetic regulation of the expression of nearby (1Mb region around) genes and vice versa by performing bivariate analysis conditioning on imputed expression values for nearby genes that are found to be significantly associated based on the TWAS analysis. Main results were based on GTEx V7 models (Fig. 4, Table 1, Supplementary Fig. 8), and further validated using GTEx V8 preliminary models (Supplementary Table 16). For the nearby TWAS analysis, we considered significance of genes based on two trait-relevant tissues available in GTEx V7, namely whole blood and liver, but also explored other tissues more broadly (see Methods).
The conditional PWAS analysis of serum urate revealed several interesting patterns (Table 1a). First, there were PWAS signals that could be largely explained by nearby TWAS signals for the corresponding transcript in relevant tissues (e.g., INHBB in liver, and SNUPN in whole blood). This may be indicative of genetic loci influencing serum urate through altered gene expression and corresponding protein levels 54. Second, there were also PWAS signals that could be largely explained by the TWAS signal of the corresponding transcript in other tissues (e.g. B3GAT3 in brain), but not in whole blood or liver. Such examples support the notion that the evaluation of diverse potential tissues of action may be important to characterize these genetic loci. However, the effects of TWAS of B3GAT3 in brain tissues are negative whereas the effect of its PWAS is positive. We found the opposite direction is consistent with their negative genetic correlation between plasma protein and gene-expression in those tissues. Future investigation for the complicated directions of effects is worthwhile. Third, for the locus around INHBC, the plasma PWAS signal for INHBC explains the most significant nearby TWAS signal R3HDM2 in thyroid (conditional p-value of TWAS signal = 4.12×10−1) but not vice versa (conditional p-value of PWAS signal = 6.84×10−34). We also found the top TWAS gene-tissue signals detected using the V7 models show similar level of significance, direction and magnitude of association when the analyses were repeated using the V8 models and the corresponding conditional PWAS-TWAS conditional analysis show qualitatively similar results (Supplementary Table 16). For the significant PWAS signals, we further examined evidence of colocalization of with gene expressions across tissues (Supplementary Tables 17.1 and 17.2) and observed that whenever there was strong genetic correlation between plasma protein and gene expression there was also strong evidence of colocalization (e.g. INHBB in liver, and B3GAT3 in brain), which gives the most confidence in those findings.
Finally, the PWAS of gout revealed a finding illustrating the potential to detect potential drug targets for treating gout based on the significant association with the Interleukin 1 Receptor Antagonist protein (IL1RN, p-value = 2.22×10−5) (Table 1b). IL1RN binds to its target, the cell surface interleukin-1 receptor (IL1R1), thereby inhibiting the pro-inflammatory effect of interleukin-1 signaling. Anakinra, an anti-inflammatory drug approved to treat rheumatoid arthritis, is a recombinant, slightly modified version of the IL1RN protein examined in our study that binds to IL1R1, blocking its actions (Supplementary Fig. 9). The observed association between higher levels of IL1RN protein and lower odds of gout are consistent with the beneficial effect of its synthetic analogue anakinra on other inflammatory diseases and suggest a repurposing opportunity for anakinra to treat acute gout flares. In fact, such evaluations are ongoing, with a recent randomized, double-blind, placebo-controlled trial of acute gout flares showing anakinra to be non-inferior to usual treatment 55. While drug delivery to plasma proteins and their cell surface receptors is easier than to other molecules such as intra-nuclear proteins, druggabilty of any implicated protein in our study depends on various factors such as protein structure and biological functions, and needs to be evaluated on a case-by-case basis. A systematic connection of all cis-heritable proteins to active drug candidates is provided as an additional resource (Supplementary Table 18).
Discussion
We present a comprehensive analysis of cis-genetic regulation of the plasma proteome based on a large discovery study that include both EA and AA individuals and an additional replication study based on AA individuals. Our study almost tripled the number of genes with identified cis-pQTL compared to previous reports 16, 17 and led to, for the first time, understanding of unique genetic architecture of plasma proteome in the AA population. We developed models for plasma protein imputation separately for EA and AA populations and make them publicly available to facilitate future proteome wide association studies. Using large-scale GWAS summary-statistics from two complex traits, we illustrate how PWAS can complement TWAS for the identification of causal genes, protein products and inform potential drug targets. We have created a web resource for downloading summary-statistics data and PWAS models with searchable options for exploring/viewing various results from our analyses (http://nilanjanchatterjeelab.org/pwas).
Our analysis provides several important insights into the cis-genetic architecture of plasma proteome. We observe that cis-heritability of protein levels tends to be smaller compared to those of gene expression levels in related tissues (Fig. 3a), a pattern consistent with the central dogma of DNA regulating the proteome through the transcriptome and the widespread presence of post-translational modification. Further, while cis-heritability of plasma proteome is fairly comparable across EA and AA populations, we observe important heterogeneity. We found nearly 30% of the sentinel pQTLs detected in the AA population were non-existent or extremely rare in the EA population, but the converse proportion was much more modest (∼10%). We further observe that the predictive performance of protein imputation model for the AA population, in spite of its much smaller sample size, is comparable to that for the EA population (Fig. 3b), and cross-population performance of such model is better from AA to EA population than the converse (Fig. 3c). Further, fine-mapping analysis using SuSiE indicated that the size of “credible set” for many genes is substantially smaller in the AA than the EA population. Taken all together, our analysis demonstrates that similar to what has been reported earlier for more complex traits 56, there are distinct advantages of including ethnically diverse samples in genetic studies of molecular phenotypes.
While we increased the number of known cis-pQTLs by large margin, some of the patterns of associations we see have been noted earlier. For example, similar to ours, a prior study 16 has reported large overlap between eQTL and pQTLs. Further, the distributions of cis-pQTLs we observe in relationship to distance from gene transcription site and across various functional annotations have also been noted earlier. A study 25 has previously shown that pQTLs identified in the EA population largely replicates in non-EA Arabic and Asian population. Similarly, we found high degree of correlations in effect sizes for cis-pQTLs which are common across both EA and AA populations. However, we also showed that discovery analysis in the AA population itself leads to the identification of many unique cis-pQTLs and further fine-mapping analysis in this population leads to better resolution for the identification of causal variants.
We demonstrate applications of protein imputation models for conducting proteome-wide association studies (PWAS) for two related complex traits, resulting in the exemplary identification of the IL1RN protein which indicates potential promise for drug repurposing of anakinra to treat acute gout flares. Through multivariate analysis, we further explored relationship between plasma PWAS signals and those detected at the transcriptome level through complementary TWAS approach across various tissues. We found that while TWAS signals often exist in the same region, the underlying genes for which the strongest signals are seen can differ or/and the underlying tissue may not be closely related to plasma. As plasma proteins are easier target for drug delivery, we created an additional resource connecting all cis-heritable proteins to active drug candidates (Supplementary Table 18). In general, we believe the most promising target genes could be where there exists both PWAS and TWAS signals with underlying evidence of genetic correlation and colocalization.
Our study has several limitations. First, while the platform we used included SOMAmers for close to 5,000 proteins or protein complexes, it does not provide coverage for the entire plasma proteome. In the future, more comprehensive protein measurements across different tissues will be needed to further pinpoint target genes and tissues of actions. Second, the power of our PWAS analysis conditional on TWAS signals may be affected by small sample size of underlying eQTL datasets. Third, in this study, we have not carried out a joint analysis of the data across the two population and thus may have incurred some loss of power for the identification of shared pQTLs. Fourth, we have not explored effects of uncommon and rare variants, as well as complex trans-associations, all of which could have significant impact in explaining heritability, but substantial discovery is likely to need even larger sample size.
In conclusion, our study provides comprehensive and cross-population insight into cis-genetic architecture of plasma proteome. We generate several resources for utilizing our results for the mapping of causal protein-regulating variants, investigating the causal role of plasma proteins on complex traits and their drug repurposing potential. Future studies are merited to obtain more comprehensive coverage of proteome across different tissues and to comprehensively explore the role of rare variants and trans-effects on the variation of the proteome.
Author Contribution
J.Z, J.C. and N.C conceived the project. J.Z. and D.D. carried out all data analyses with supervision from N.C. B.H. developed online resources for data visualization and sharing, J.Z., D.D., A.K. and
N.C. drafted the manuscript, and A.T., P.S., M.G. and B.Y. provided comments. All co-authors reviewed and approved the final version of the manuscript.
Competing interests
Proteomic assays in ARIC were conducted free of charge as part of a data exchange agreement with Soma Logic.
URLs
Public data used in this study: 1000 Genomes Phase 3,
http://www.internationalgenome.org/category/phase-3/; UK Biobank,
https://www.ukbiobank.ac.uk/; GTEx: https://gtexportal.org/home/datasets; CKDGen,
http://ckdgen.imbi.uni-freiburg.de/; biomaRt,
https://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html; GWAS Atlas: https://atlas.ctglab.nl/PheWAS.
Publicly available software used in this study: R statistical software, http://www.R-project.org/;
PLINK 2.0, https://www.cog-genomics.org/plink/2.0/; GCTA,
https://cnsgenomics.com/software/gcta/; QTLtools: https://qtltools.github.io/qtltools/; VEP,
https://useast.ensembl.org/info/docs/tools/vep/; LOFTEE, https://github.com/konradjk/loftee;
SuSIE, https://github.com/stephenslab/susieR; FUSION/TWAS,
http://gusevlab.org/projects/fusion/; TORUS: https://github.com/xqwen/torus/; coloc:
https://github.com/chr1swallace; MANTRA: Available on request from author Professor
Andrew P. Morris.
Data availability
Genome-wide summary statistics for all single-SNP cis-pQTL analysis, irrespective of significance level, are made available at http://nilanjanchatterjeelab.org/pwas. Additional data and codes required to perform PWAS are also made available from the website.
Plasma protein data availability: Pre-existing data access policies for each of the parent cohort studies (ARIC and AASK) specify that research data requests can be submitted to each steering committee; these will be promptly reviewed for confidentiality or intellectual property restrictions and will not unreasonably be refused. Please refer to the data sharing policies of these studies. Individual level patient or protein data may further be restricted by consent, confidentiality or privacy laws/considerations. These policies apply to both clinical and proteomic data.
Code availability
Example codes to perform PWAS are available at http://nilanjanchatterjeelab.org/pwas. All final codes in this study for data analysis of the protein data, including pQTL and PWAS analysis, will be posted through GitHub upon manuscript publication at https://github.com/nchatterjeelab/PlasmaProtein.
Methods
Study population
Our study was conducted using individual-level data from the Atherosclerosis Risk in Communities (ARIC) study 29. The ARIC study is an ongoing community-based cohort study of individuals that initially enrolled 15,792 participants 1987 and 1989 from four communities across the US: Washington County, Maryland; suburbs of Minneapolis, Minnesota; Forsyth County, North Carolina; and Jackson, Mississippi. The third visit (v3) occurred in 1993-1995, when blood samples used for the measurement of the proteome were collected. A total of 9,084 participants with cleaned plasma protein data (1,871 African Americans (AA), 7,213 European Americans (EA)) after the exclusions of participants without genotype data (see below) were retained in the current study.
Plasma protein data and genetic data
The relative concentrations of plasma proteins or protein complexes from the blood samples were measured by SomaLogic Inc. (Boulder, Colorado, US) using an aptamer (SOMAmer)-based approach 12, 13. Details for this approach and the SomaLogic normalization pipeline can be found in a technical white paper on the manufacturer’s website, http://somalogic.com/wp-content/uploads/2017/06/SSM-002-Technical-White-Paper_010916_LSM1.pdf, and https://somalogic.com/wp-content/uploads/2017/06/SSM-071-Rev-0-Technical-Note-SOMAscan-Data-Standardization.pdf. Of the 4,877 SOMAmers measuring 4,697 unique proteins or protein complexes, we excluded 43 SOMAmers that mapped to multiple gene targets, 9 SOMAmers whose target proteins’ encoding genes do not have position record in the biomaRt database 57, and 8 SOMAmers without any SNPs in cis region. By restricting analysis to plasma proteins or protein complexes encoded by autosomal genes, we further excluded 158 genes on the X chromosome, and 2 genes on the Y chromosome. In total, 4,657 SOMAmers measuring 4,483 unique proteins or protein complexes encoded by 4,435 autosomal genes passed quality control, and were retained in the current study.
Genotyping of ARIC samples was performed on the Affymetrix 6.0 DNA microarray and imputed to the TOPMed reference panel (Freeze 5b) 58, 59. The SNPs with imputation quality R2 < 0.8, call rates <90%, Hardy-Weinberg equilibrium p-values < 10−6, or minor allele frequencies <1% were excluded. Genetic principal components show that the two self-reported ethnic subgroups, European Americans (EA) and African Americans (AA) are well distinguished in terms of genetic ancestry (Supplementary Fig. 10) 60.
Plasma protein data processing
Additional variation in high-throughput gene expression data which is not due to genetic variants has been found to impact the power of eQTL discoveries 8, 9. The fluctuations of internal environment, experimental deviations, and batch effects can all have large influence on high throughput measurements 33. To study whether this type of variance exists in our high-throughput plasma protein data measured by the SOMAmers, we performed analysis of variance (ANOVA) test for non-genetic factors to the first 10 principal components (PCs) of log-transformed relative abundance of SOMAmers. Non-genetic factors include common covariates (age, sex, and study sites at v3), as well as batch effects (plate run date, scanner ID, plate position, and subarray). (Supplementary Table 19).
To account for those non-genetic variances, which may obscure genetic association signals, we used the Probabilistic Estimation of Expression Residuals (PEER) method to estimate a set of latent covariates, and put them linearly in the model 34. The number of PEER factors for each ancestry was selected to maximize the number of significant SOMAmers, i.e. SOMAmers with a significant cis-pQTL near the putative protein’s gene.
The log-transformed relative abundance of SOMAmers were adjusted in a linear regression model including PEER factors and the covariates sex, age, study site, and 10 genetic principal components (PCs). The residuals from this linear regression were then rank-inverse normalized to avoid the influence of extreme values, and were used as the corrected-protein quantification in the analysis. By analyzing up to 200 PEER factors in increments of 10, the maximum of number of significant SOMAmers were achieved at 90 and 80 PEER factors for EA and AA respectively (Fig. 1a). Thus, the corrected-protein quantifications adjusted for 90 and 80 PEER factors were used as phenotypes in the analysis of the EA and AA populations, respectively.
Significant SOMAmers discovery
Significant SOMAmer is defined as SOMAmer with a significant cis-pQTL near the putative protein’s gene. For all primary analyses, we defined the mapping window as 500-kb upstream and downstream of the target protein-coding genes’ transcription start site (TSS). In a secondary analysis, we found that cis-heritability of SNPs within +/- 500Kb and +/- 1Mb of the TSS to be quite similar, indicating that vast majority of cis-pQTLs for the larger region to be concentrated within +/- 500Kb window (Supplementary Table 20). Gene position of GRCh38 reference genome was obtained from Ensembl BioMart database 57. Common linear regression procedures for association tests using the Bonferroni correction to p-values usually proves to be overly stringent and results in many false negatives 38. To overcome this issue, adaptive permutation approach implemented in QTLtools were applied 37. We used one hundred permutations to empirically characterize the null distribution of the strongest signal which is fitted by a Beta distribution. The p-values of association adjusted for the number of variants tested in cis given by the fitted beta distribution were used to calculate gene-level q-values. By controlling the false discovery rate (FDR) threshold < 5%, significant SOMAmers were identified.
Comparison with previous identified cis-pQTL
A list of existing pQTL studies were summarized by Karsten Suhre (http://www.metabolomix.com/a-table-of-all-published-gwas-with-proteomics/) 25. We focus on two recent European-ancestry pQTL studies with large sample size and proteins assayed by SOMAscan. The first was performed in the INTERVAL study with UK blood donors 16. The other was performed in the AGES-RS cohort 17. To make fair comparison, we compared identified cis-pQTLs across the two analyses using the same standard -- sentinel cis-associations (+/-500Kb) for common SNPs (MAF>0.01) and Bonferroni corrected genome-wide threshold for significance. Using these criteria, the two previous studies identified a total of 508 unique significant SOMAmers (304 and 422 respectively) and we identified 1,465 significant SOMAmers. We then tested replication of their sentinel SNPs in our ARIC EA sample (Bonferroni corrected p-value < 0.05/726 = 6.89×10−5, where 726 = 218×2 + 204 + 86. There were 218 SOMAmers discovered in both studies, 204 discovered only in AGES-RS and 86 discovered only in INTERVAL). If a significant SOMAmer’s sentinel SNPs was not available in ARIC, we used their LD proxies and the r2 was calculated from the 1000Genome European individuals.
Replication of cis-pQTL identified in AA
We replicated cis-pQTLs discovered in the ARIC AA in the African American Study of Kidney Disease and Hypertension (AASK), a clinical trial of alternate blood pressure lowering regimen and goals 35. Enrollment occurred from 1995 to 1998, with the original trial population consisting of 1094 African American participants with chronic kidney disease. Blood samples used for the measurement of the proteome were collected at baseline. A total of 467 participants with serum protein data and genotype data were retained in the current study. Proteomic profiling was performed using the SomaScan technology using the V4.1 platform. Genotyping was conducted using the Infinium Muti-Ethnic Global BeadChip array (Illumina, GenomeStudio) and imputed to the TOPMed reference panel (Freeze 5 on GRCh38).
Independent cis-pQTL mapping
It is likely that the significant SOMAmers have multiple proximal cis-SNPs which have independent effects. To identify independent signals for them, we performed independent cis-pQTL mapping using the conditional pass implemented in QTLtools 37.The algorithm first uses permutations to derive a nominal p-value per SOMAmer, then it uses a forward-backward stepwise regression to select the conditional independent signals. In this process, it automatically learns the number of independent signals per SOMAmer using forward selection, and then determines the best candidate SNP per signal using backward selection controlling for the remaining signals. If no SNP is significant at the previous nominal p-value threshold, the candidate signal will be dropped; otherwise, the SNP with smallest backward-p-value will be chosen as the lead SNP for this candidate signal. In some cases, the same SNP during the backward selection can explain multiple independent signals that were detected during the forward selection. In the reporting our results (Supplementary Table 6.1 and 6.2), we show the rank of all the SNPs selected by the forward selection step that is explained by a given lead SNP selected during the final backward selection step.
To account for power for detection in Fig. 1c, we adjusted the SNP effect sizes by assigning a weight of the inverse of statistical power. The statistical power can be derived as following.
The SNP effect is chi-square distributed with one degree of freedom (df). It is a central chi-square distribution under the null, and a non-central chi-square distribution under the alternative hypothesis. The non-centrality parameter (NCP), λ, is , where N is the number of samples in study, f is the MAF of the SNP, and β is the SNP effect 61, 62. The significance threshold for the test statistic under the central chi-square distribution of df 1 and the SOMAmer’s nominal p-value cut-off, p0, is t0 = F−1(1 − p0, 1), where F(⋅, 1) is the cumulative distribution function (CDF) of a central chi-square distribution of df 1. The statistical power can be computed by Pr(T > t0 | Ha) = 1 − G(t0, λ, 1), where T is the test statistics and G(⋅, λ, 1) is the CDF of the non-central chi-square distribution with NCP of λ and df 1. The weight assigned to SNP effect is (1 − G(t0, λ, 1))−1.
Investigation of epitope-binding effects
SOMAscan assay relies on aptamer binding which may be influenced by the change of protein structure. Protein altering variants (PAV) may result in cis-pQTLs by altering binding affinity, instead of protein abundance. Following a procedure recommend earlier 16, we cataloged all cis-pQTLs that were not in LD (r2<0.1) with any PAV in the cis region or those in LD (0.1⩽r2⩽0.9) but remain significant in a conditional analysis after adjusting for PAVs. We annotated variants with variant effect predictor (VEP) 39, Loss-Of-Function Transcript Effect Estimator (LOFTEE) 40 and Ensembl Regulatory Build 41. Variants were considered to be PAV if annotated as coding sequence, frameshift, in-frame deletion, in-frame insertion, missense, splice acceptor, splice donor, splice region, start lost, stop gained, or stop lost variants. LD-pruned (r2>0.9) PAVs were included as covariates for association testing.
Cis-eQTL overlap
We cross referenced the identified cis-pQTLs against cis-eQTLs identified in the overall analysis of GTEx (V8) data across different tissues. For each SOMAmer, we first extracted the sentinel cis-pQTLs, meaning the variants having most significant association for a pGene along with all the variants in high LD (r2 > 0.8). Using this list of variants across 2,004 SOMAmers which had at least one cis-pQTL in EA, we calculated the percentage overlap with the set of significant cis-eQTLs (at FDR<5%, as defined by GTEx consortium) for the same gene identified in each tissue of GTEx V8 9. Since the GTEx cohort is primarily of European ancestry, we restricted this analysis to EA only.
Colocalization
Colocalization analysis was performed to investigate whether the same variants were likely to be causal for variation in protein levels and gene expression levels. We used publicly available overall cis-eQTL summary statistics from GTEx consortium (V8). For testing whether cis-eQTL and cis-pQTL associations for the same gene colocalize, we used coloc package in R with the default setting 63. Evidence for colocalization was assessed using the posterior probability (PP) for the hypothesis that there is an association for both protein levels and gene expression levels, and they are driven by the same causal variant (PP.H4). Since we tested across a large number of tissues, we chose a stringent cut-off of 0.8 and pGenes with PP.H4 > 0.8 were identified as likely to have a shared causal variant for the cis-eQTL and cis-pQTL associations. As before, we restricted our analysis to the 2,004 pGenes identified in EA.
Function annotations enrichment
We performed an enrichment analysis of the cis-pQTLs for known regulatory elements in the genome to identify the broad functions of the cis-pQTLs. The functional annotations were curated from variant effect predictor (VEP) 39, Loss-Of-Function Transcript Effect Estimator (LOFTEE) 40 and Ensembl Regulatory Build as was reported in the recent GTEx analysis. For each SOMAmer, we used sentinel cis-pQTLs, meaning the variants having the most significant association and variants in high LD (r2 > 0.8) for evaluating functional enrichment. With these annotations, we used torus 64 to perform functional enrichment for each functional category. To remove effect of potential epitope binding effects associated with the PAVs, we also investigated functional enrichment among sentinel cis-pQTLs (and variants in high LD) that showed significant effects independent of the PAVs (See previous section for details).
Fine-mapping analysis
To identify the set of possibly causal variants regulating plasma protein levels we performed fine-mapping 65 using the cis-variants for each of the 1,447 SOMAmers that had at least one cis-pQTL in both EA and AA using SuSiE 42. For a given SOMAmer and corresponding variants in the cis-regulatory region, SuSiE outputs a number of single effect components or credible sets that have 95% probability to contain a variant with non-zero causal effect. We set the maximum number of such singlet effect components to be 10, meaning broadly we allow for the possibility that a SOMAmer can be regulated by 10 causal variants at best. Further, SuSiE also outputs the posterior inclusion probability for each variant. This corresponds to the probability of the variant to be included in one of the credible sets.
To perform trans-ethnic meta-analysis, we used MANTRA 43 which is based on a computationally intensive Bayesian partition accounting for the shared similarity in closely related populations assuming the same underlying allelic effect. It models the effect heterogeneity among distant populations by clustering according to the shared ancestry and allelic effects. MANTRA outputs the Bayes factor for association of a variant across ancestries. Using this, we constructed the posterior probability 66 of the kth variant (πk) as: where δk is the Bayes factor for association of the kth variant obtained using trans-ethnic meta-analysis in MANTRA and the sum in the denominator is across all the variants in the cis-region. We performed MANTRA using the variants common to EA and AA and subsequently calculated the posterior probabilities.
Cis-SNP heritability estimation
Cis-SNP heritability (cis-h2) of SOMAmers were estimated using the REML algorithm implemented in GCTA 49. Genotypes of SNPs in a cis-window around the encoding gene of the corresponding target protein of a SOMAmer were used to estimate genetic relatedness matrix (GRM). Corrected-protein quantifications and the estimated GRM were input to the GCTA to estimate cis-h2 using the REML algorithm (option --reml --reml-no-constrain). A maximum number of 100 iterations was set to determine the convergence of the estimation algorithm. The nonzero cis-heritability was tested using a likelihood-ratio test for the first genetic variance component (option --reml-lrt 1) with significance level of 0.01. Plasma protein SOMAmers with negative estimate cis-h2 estimates were excluded. Cis window size of +/- 500Kb and 1Mb were examined, and there were no significant differences between the heritability estimations (Supplementary Table 20). Therefore, throughout the paper, we defined +/- 500Kb window size which is same as those used for TWAS models we used.
Imputation models trained jointly with cis-SNPs
Using the TWAS / FUSION software 28, we built imputation models for 1,394 (AA) and 1,350 (EA) SOMAmers with significant non-zero cis-h2. Imputation model for a SOMAmer was trained jointly by elastic net using cis-SNPs in +/-500Kb around the TSS of the encoding gene of the target protein. The performance of models was evaluated by adjusted prediction accuracy which was defined as the 5-fold cross-validated R2 between predicted and true values standardized by cis-h2. The imputation models built only with the top cis-pQTL was used as a baseline comparison.
Trans-ethnic prediction capacity
To study the trans-ethnic prediction performance, we applied the genetic imputation models to the genotypes of individuals from their opposite races in ARIC. The cross-ethnic prediction performance is evaluated by the R2 between predicted and true values standardized by cis-h2.
Cis-regulated genetic correlation between plasma proteome and transcriptome across a variety of tissues
To study the cis-regulated genetic correlation between plasma protein and expression levels for underlying genes across a variety of tissues, we computed the Pearson’s correlation coefficients between genotypically-imputed plasma proteins and genotypically-imputed gene expressions for the same gene for individuals from Phase-3 1000 Genome Project (1000Genome) 51 by applying weights of their imputation models to the genotype data. For primary analyses, we used established gene expression imputation models available based on GTex V7 dataset across different tissues (http://gusevlab.org/projects/fusion/#reference-functional-data (see Supplementary Table 13 for the full list, Supplementary Table 14 for their prediction accuracies). Here we only studied for genes significant cis-heritable (p-value of cis-h2 from GCTA < 0.01) for both gene expression levels and plasma protein levels (Supplementary Tables 15.1 and 15.2). Since the gene expression imputation models were derived using participants predominantly from European ancestry from GTEx V7, the plasma protein imputation models here were restricted to EA-derived only. If multiple transcripts or SOMAmers were measured for the same gene, the sum of their imputed levels was used to represent “the total level of the gene” in terms of gene expression or plasma protein level. We also obtained preliminary gene-expression imputation models trained based on GTEx V8 dataset (obtained based personal communication with Gusev lab) and used them to conduct several secondary/validation analyses for comparison of results with V7.
Proteome-wide association studies (PWAS)
As an analog of TWAS, weights in the imputation models of SOMAmers can be applied to summary level data using the test statistics derived in TWAS / FUSION. The mathematical derivation can be found in the original paper 28. The type 1 error of PWAS is well-controlled in simulation using null phenotypes simulated from UK Biobank using 337,484 unrelated European ancestry individuals 67. Note that the enet model coefficients for 9 proteins in AA and 2 proteins in EA were all zero. These proteins were excluded in PWAS analysis, and therefore, 1,385 (AA) and 1,348 (EA) imputation models were available in PWAS. The significance level for PWAS loci identification is adjusted by of the total number of imputation models for significant cis-heritable plasma proteins or protein complexes (p-value < 0.05/1,348=3.7×10−5 in EA which was used in our PWAS of serum urate and gout). As discussed in a recent TWAS paper 50, multiple SOMAmers, whose encoding genes of their target proteins or protein complexes locate closely in a locus, were sometimes identified at the same time. To identify distinct loci, a 1Mb region (+/- 500Kb of TSS) was defined around each encoding gene of the target protein of significant SOMAmers, and overlapping regions were merged. The sentinel association in each locus was selected to be the top PWAS candidate hit for this region (Supplementary Tables 21.1 and 21.2).
We obtained standardized estimate for the causal effect and standard error , and thereby confidence intervals, of the underlying proteins on the complex traits (Y) by slightly extending S-PrediXcan 68. We derived these as Where is SNP l ‘s summary statistics for the complex trait, wPl is SNP l ‘s weight in the imputation model for protein P, is the variance of SNP l which can be computed from allele frequency, and ⌈ is the LD (correlation) matrix for all M SNPs in the imputation model. We used the same formulae to derive corresponding causal effects, standard errors and confidence intervals for results from TWAS analyses.
Druggability of PWAS genes
PWAS genes were annotated based on the therapeutic target database 69. Only drugs that were actively pursued were retained in the database and discontinued, terminated or withdrawn drugs were excluded. Additionally, druggability tiers from Finan et al. 70 were mapped via gene symbols (Supplementary Table 18).
Bivariate conditional analysis for PWAS and TWAS
For each significant PWAS loci, we searched all TWAS genes nearby (+/-500Kb around) whose TSS locate within 500Kb of the TSS of its sentinel PWAS gene, and selected the one with the smallest TWAS p-value. The position of genes in TWAS (based on GTEx V7 based on genome build GRCh37) and PWAS (based on genome build GRCh38) were matched using the UCSC genome browser webtool (https://genome.ucsc.edu/cgi-bin/hgLiftOver) 71.
We first performed the nearby TWAS in two trait-relevant tissues, whole blood and liver, for serum urate and gout. Note that kidney is also a trait-relevant tissue, but there is no imputation model trained with GTEx V7 data available on TWAS / FUSION for kidney. The significance of the nearby TWAS hit was determined by significance level after Bonferroni Correction (0.05/ ∑relevant tissues #transcripts with imputation models).
Using z-scores (zP for PWAS gene and zT for TWAS gene) and the cis-regulated genetic correlation (ρ) of each PWAS gene and the most significant TWAS gene nearby, we performed conditional analysis 72 to study the potential underlying mechanism of gene expressions in tissue or proteins in plasma. The cis-regulated genetic correlation was computed from the Pearson’s correlation coefficients between genotypically-imputed plasma proteins and genotypically-imputed gene expressions for individuals from 1000Genome by applying weights of their imputation models to the genotype data. The least-squares estimate of the PWAS z-score conditional on TWAS z-score is and its variance is So the conditional z-score of the PWAS gene is Similarly, the conditional z-score of the nearby TWAS gene is We then performed the same procedure for all nearby TWAS genes in all GTEx V7 tissues. Using Bonferroni Correction for the total number of transcripts with imputation models (0.05/ ∑all GTex tissues #transcripts with imputation models), we identified the tissues which have at least one significant TWAS gene in the PWAS significant loci. The most significant TWAS gene in this region and its corresponding tissue were recorded, and then used to perform conditional analysis (Supplementary Tables 22.1 and 22.2). We further validated the top gene-tissue combination identified through TWAS models in V7 using preliminary models that were available to us based on V8.
Acknowledgements
The Atherosclerosis Risk in Communities study has been funded in whole or in part with Federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services, under Contract nos. (HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700005I, HHSN268201700004I). The authors thank the staff and participants of the ARIC study for their important contributions. SomaLogic Inc. conducted the SomaScan assays in exchange for use of ARIC data. The UK BioBank data was obtained under the UK BioBank resource application 17712. This work was supported in part by NIH/NHLBI grant R01 HL134320, NIH/NIDDK grant R01 DK124399, and NIH/NIDDK grant R01 DK108803. Research of J.Z., D.D. and N.C. was supported R01 grant from the National Human Genome Research Institute [1 R01 HG010480-01]. The work of A.K. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 431984000 – SFB 1453. The work of P.S. was funded by the EQUIP Program for Medical Scientists, Faculty of Medicine, University of Freiburg. We acknowledge Dr. Nicholas Mancuso and Dr. Alexander Gusev for providing preliminary TWAS models built with GTEx V8 data.
Footnotes
Major revisions include (1)Unifying definition of the cis region to be +/-500Kb in both pQTL and PWAS analyses. (2)Addition of data from a second African American (AA) Study for carrying out replication. The analysis shows a high replication rate (>90% concordance rate for the direction of effect-estimates and 70% detected to be statistically significant at a stringent multiple testing adjusted threshold) of our original findings from the ARIC AA population. (3)Cataloging of AA-specific pQTLs which are non-existent or extremely rare in European American (EA) population. In fact, we found 30% of the cis-pQTLs detected in the AA population were population-specific. (4)Cataloging of pQTLs which may be influenced by potential epitope effects associated with protein-altering variants and carrying out downstream sensitivity analysis. (5)Large-scale pQTL-eQTL colocalization analyses across all GTEX tissue. (6)Providing estimation of effect sizes and confidence intervals associated with PWAS and TWAS findings. Validating the top gene-tissue combination identified through TWAS models in V7 using preliminary models that were available to us based on V8. (7)A systematic connection of all cis-heritable proteins to active drug candidates is provided as an additional resource. (8)We have now released the full summary stat for a total of 4,657 proteins analyzed in our paper to our website http://nilanjanchatterjeelab.org/pwas/ Manuscript, supplemental figures, and tables were mainly revised based on those eight points.