Genetic determinants of plasma protein levels in the Estonian population

The proteome holds great potential as an intermediate layer between the genome and phenome. Previous protein quantitative trait locus studies have focused mainly on describing the effects of common genetic variations on the proteome. Here, we assessed the impact of the common and rare genetic variations as well as the copy number variants (CNVs) on 326 plasma proteins measured in up to 500 individuals. We identified 184 cis and 94 trans signals for 157 protein traits, which were further fine-mapped to credible sets for 101 cis and 87 trans signals for 151 proteins. Rare genetic variation contributed to the levels of 7 proteins, with 5 cis and 14 trans associations. CNVs were associated with the levels of 11 proteins (7 cis and 5 trans), examples including a 3q12.1 deletion acting as a hub for multiple trans associations; and a CNV overlapping NAIP, a sensor component of the NAIP-NLRC4 inflammasome which is affecting pro-inflammatory cytokine interleukin 18 levels. In summary, this work presents a comprehensive resource of genetic variation affecting the plasma protein levels and provides the interpretation of identified effects.


51
During the last decade, genome-wide association studies (GWASs) have successfully linked 52 genetic variants to complex traits [1]. However, the mechanisms underlying many of these 53 associations often remain unknown, as most of the associated genetic variants are located in 54 non-coding regions of the genome, suggesting that they have regulatory effects on phenotypes 55 [2]. To fill this knowledge gap, molecular traits are routinely used as intermediate phenotypes 56 in association studies. The study of molecular phenotypes enables the assessment of the direct 57 effects of genetic variants on, for example, the alteration of protein levels, and the potential 7 148 RNA sequencing data 149 RNA was extracted from samples in thawed Tempus tubes using TRIzol reagent (Invitrogen, 150 Waltham, MA, USA) and further purified using an RNeasy Mini Kit (Qiagen, Hilden,  Reads that mapped to each genomic feature were counted with STAR using the same algorithm  Genome-wide SNP pQTL discovery 168 Protein trait levels were rank-based inverse normal transformed. We regressed out the effects 169 of age, sex, the season of sample collection, smoking status, blood sample processing time 170 (days), plasma sample storage time (in days) and protein analysis batch using a custom R script.

171
The residuals were used in a single-variant pQTL analysis performed with the EMMAX linear 172 mixed model [30] and the EPACTS software (version 3.3.0, q.emmax function; 173 https://genome.sph.umich.edu/wiki/EPACTS). To account for population structure, a kinship 174 matrix was generated in EPACTS using genetic variants with MAF > 0.01 and call rate > 95%. 175 Depending on the panel, we tested between 8,856,032 and 8,891,303 autosomal genetic 176 variants against 341 plasma protein traits. 177 We classified associated variants into two categories based on their positions in relation to the 178 protein-coding genes. We defined cis-pQTLs as SNPs located within 1Mb upstream or 179 downstream of the transcription start sites (TSSs) of the corresponding protein-coding genes, 180 and trans-pQTLs as SNPs located >1 Mb upstream or downstream of the TSS or on a different 181 chromosome. Heterodimers were classified based on the protein subunit gene closest to the 182 associated variant. In the case of proteins that were present on multiple panels, weaker signals 183 were omitted from the analyses. 184 To retain independent signals, associated variants were clumped in PLINK (version 1.9) [31], 185 using a 1 Mb window with the LD thresholds of R 2 = 0.1 and P < 5 × 10 -8 . To flag potential 186 'pseudo-pQTL' signals caused by the epitope effect, i.e. altered assay binding affinity due to a 187 change in protein structure instead of an actual change in protein expression level, we followed 188 the strategy described by Folkersen et al, 2020 [6]. Briefly, we determined whether any lead 189 cis variant was a protein-altering variant (PAV) or in high LD (R² ≥ 0.8) with one, by using 190 2,230 WGS samples as the reference for the LD calculations (S2 Table). Missense, frameshift, 191 splice donor region and stop gain variants were flagged as PAVs. Lead pQTL variants were 192 queried for evidence of location in a regulatory region using RegulomeDB [32].

193
Corresponding eQTL discovery 194 In order to overlap the genome-wide significant (P < 5 × 10 -8 ) pQTLs with eQTLs, we used 195 the RNA sequencing data from the overlapping samples of the same cohort [29]. We tested the 196 eQTL effects on the genes encoding corresponding proteins by using a linear mixed model  Multiple testing correction for the pQTL analysis 206 From primary analyses, effects reaching per-protein genome-wide significance (P < 5 × 10 -8 ) 207 were interpreted. To also provide the more conservative results accounting for the number of 208 tested proteins, we used a strategy described by Gao  Fine-mapping analysis 239 We conducted a fine-mapping analysis to pinpoint causal variants for protein level-significant 240 (P < 5 × 10 -8 ) SNV-pQTL associations. We excluded the LTA and MICA-MICB proteins 241 associated with variants in the major histocompatibility complex region on chromosome 6, due 242 to the complexity of the associated HLA region. The fine-mapping procedure was based on the   In an pQTL-eQTL colocalisation analysis, we compared our significant pQTL loci to all eQTL  Table). We lifted the pQTL summary statistics over 299 to an hg38 build to match with the eQTL Catalogue.

300
The region-wide associations for GWAS traits enrolled into the colocalisation analyses were 301 extracted from the MRC IEU OpenGWAS database and were examined using the ieugwasr 302 (version 0.1.5) R package (https://github.com/MRCIEU/ieugwasr; S4 Table). Since proteins 303 were selected based on associated traits from the PheWAS, they were all associated with 304 clinical traits (i.e. drugs, surgeries, diseases/conditions). In addition, all selected proteins except 305 IL6R had primary pQTLs that did not include nonsynonymous variants, to minimise the 306 possibility of association due to the epitope effect. IL6R was selected because it has been 307 widely reported by previous pQTL studies as an example of the successful linking of molecular 308 traits and diseases to discover drug targets [45,54]. The input data consisted of region-based 309 summary statistics for six protein traits and 61 complex clinical traits.
14 310 Two-sample MR 311 We conducted a two-sample MR analysis using protein levels with significant colocalisation 312 (PP₄ ≥ 0.8) as exposures and complex traits as outcomes, using the TwoSampleMR (version 313 0.5.6) R package [55,56]. Independent variants obtained previously by clumping served as 314 instrumental variables. We conducted the analysis using an inverse variance weighted fixed-315 effects method and a single instrument-based Wald ratio test. To correct for multiple testing, 316 we adjusted P-values using false discovery rate (FDR) correction; results were considered 317 significant at Benjamini-Hochberg FDR ≤ 0.05.

319
To determine whether any of the examined proteins are genetically regulated by larger 320 structural variants, we conducted a pQTL mapping using CNV data. Description of the used 321 CNV data is in the Methods section for CNV detection and quality control. Associations For each significantly associated CNV, all SNP markers within a 500-kbp proximity were 332 tested for potential tagging effects. For this purpose, the SNP pQTL analysis using EPACTS 333 was repeated for these regions with the CNVs included as covariates.

334
The same CNVs were tested against the expression levels of 12,619 genes [29], and the CNV 335 pQTL results were then cross-referenced with eQTLs identified from the same set of   345 CNV pQTLs from primary mapping that reached genome-wide significance (P < 1.12 × 10 -7 ) 346 or the suggestive significance threshold (P < 2 × 10 -5 ) were included in a PheWAS, resulting 347 in the inclusion of 38 CNV regions. All data included in the PheWAS were obtained using the 348 lm function with custom R scripts from 2,115 unrelated Estonian Genome Centre samples for 349 which WGS data were available, and were corrected for age, sex and six genotype principal

386
To provide a comparison with previous research, we compared  (Benjamini-Hochberg FDR < 0.05) and all the significant pQTLs were also directionally 398 concordant with the current study (S2 Table). Concordance with previous studies demonstrates 399 the robustness of our results. The total numbers of associated proteins were similar for all panels and ranged from 38 to 43 404 (S2 Table) Table). 77% (48/62) of them 448 were directionally concordant with corresponding pQTLs. 449 We found that 95% CSs for 151 proteins were linked to 131 independent genomic loci (S6 450   Table). LDLR, TNFRSF11B, TNFRSF6B, WISP1, CXCL1 and PLAU proteins showed   Table).

Trans-pQTL colocalising with eQTL
Gene expression (RNAseq) 710 Since the protein measurements originated from blood, the most widely studied tissue, the 483 largest fraction of pQTLs colocalised with blood eQTLs. However, while using the GTEx 484 dataset, we also found 739 cases of pQTL-eQTL colocalisation in multiple tissues (Fig. 3, S8 485   The queries for the 268 unique lead variants and high-LD proxies led to the identification of 511 135 (50.4%) variants with 5,046 significant associations for 432 complex traits (S11 Table).

512
Of these associations, 1583 (31.4%) were with various blood cell traits from the study

529
For example, based on pQTL-eQTL colocalisation analysis, IL6R pQTL signal was also an 530 eQTL of the IL6R gene in macrophages, monocytes, T cells, whole blood and pancreatic islets.

531
A previous study has shown a link between IL6R and CAD [64]. We also identified 532 associations between IL6R pQTLs and CAD, rheumatoid arthritis and 7 other disease traits 533 (S11 Table), thereby supporting the findings of the study [63]. As another example, IL1RL1 534 pQTLs colocalised with IL1RL1, IL18R1 and IL18RAP eQTLs detected in multiple cell types 535 with direct effects on the immune system (e.g. T-cells; S8 Table); these variants were 536 associated with asthma and allergic reactions in the PheWAS.

552
We identified 46 significant colocalisation events (S12 Table). were not supported by the colocalisation results.

560
MR findings 561 We conducted MR analyses using 46 significant (FDR-corrected) pQTL-complex trait pairs 562 from the colocalisation analysis (Fig. 4, S13 Table). We found a causal relationship between  the levels of 7 proteins (S14 Table). The majority of identified rare variant effects (13, 68.4%) 587 were with the level of GDF-15. The most significant rare variant association was a trans signal 588 between JAKMIP1 on chromosome 4 and the level of GDF-15 (P = 5.41 × 10 -12 ). We also 589 assessed if rare nonsynonymous SNPs affect the expression of same genes encoding the 590 corresponding pQTL proteins, however we did not detect any nominally significant 591 (Benjamini-Hochberg FDR < 0.05) associations (S14 Table). 592 We next conducted GeneMANIA network analysis [37, 38] to identify functional connections 593 between genes harbouring rare SNPs and proteins affected by trans associations. First, we 594 studied the potential connection between rare variant genes associated with the GDF-15 level.

693
To further interpret the of CNV-pQTL results, we examined additional pQTLs for proteins that 694 were not measured in our study. For that, we leveraged LD between the EstBB CNVs and 695 previously reported pQTL SNPs and prioritised CNVs which could underlie the previously 696 reported pQTL associations (R 2 between SNP and CNV >0.8). We identified eight CNVs with 697 possible effects on protein levels ( We also detected 76 tagging SNP-CNV pairs for 33 unique CNVs and 72 proteins (S17 Table) 705 from a more recent Sun et al. 2022 study [9]. Twenty-nine (40.3%) of the proteins were also 706 measured in the EstBB cohort, of which six proteins had significant CNV pQTLs (P < 1.12 ×    results to a molecular level is complicated by the nature of associated disease phenotype.

773
Plasma proteins are potentially more relevant for circulatory diseases where the blood is in 774 contact with the affected tissue, such as in the FGF5 example, rather than for conditions with 775 a limited number of affected tissues.

776
The availability of the high-quality WGS data also gave us a unique opportunity to investigate  tagging SNPs could be used as a proxy method to assess the effect of CNVs on complex traits.

841
In conclusion, we have generated a comprehensive pQTL resource and interpreted it by using