Abstract
An increasing number of gene expression quantitative trait locus (QTL) studies have made summary statistics publicly available, which can be used to gain insight into human complex traits by downstream analyses such as fine-mapping and colocalisation. However, differences between these datasets in their variants tested, allele codings, and in the transcriptional features quantified are a barrier to their widespread use. Here, we present the eQTL Catalogue, a resource which contains quality controlled, uniformly re-computed QTLs from 19 eQTL publications. In addition to gene expression QTLs, we have also identified QTLs at the level of exon expression, transcript usage, and promoter, splice junction and 3ʹ end usage. Our summary statistics can be downloaded by FTP or accessed via a REST API and are also accessible via the Open Targets Genetics Portal. We demonstrate how the eQTL Catalogue and GWAS Catalog APIs can be used to perform colocalisation analysis between GWAS and QTL results without downloading and reformatting summary statistics. New datasets will continuously be added to the eQTL Catalogue, enabling systematic interpretation of human GWAS associations across a large number of cell types and tissues. The eQTL Catalogue is available at https://www.ebi.ac.uk/eqtl/.
Introduction
Gene expression and splicing QTLs are a powerful tool to link disease-associated genetic variants to putative target genes. Despite efforts by large-scale consortia such as GTEx [1] and eQTLGen [2] to provide comprehensive eQTL annotations for a large number of human tissues, most eQTL datasets are still scattered across individual publications. Multiple databases have been developed that collect eQTL summary statistics [3–9]; however, these efforts have relied on the heterogeneous set of summary statistics calculated by the original authors.
Relying on publicly available eQTL summary statistics has several limitations. First, many downstream use cases such as fine-mapping [10,11] and colocalisation [12,13] require full summary statistics from the region of interest, but some studies have only released either eQTL lead variants or variants below a certain p-value threshold. Second, studies often test a different subset of variants in the cis region of each gene, meaning that variants tested in one study might be missing from another study. Third, even though the eQTL effect direction relative to a GWAS signal is critical for interpreting disease associations, information about the effect allele is often either missing or ambiguous. Finally, even though both splicing [1,14] and other transcript-level QTLs [15] contribute to complex traits, these analyses have not been performed on many earlier RNA-seq-based eQTL datasets. Where splicing or transcript-level QTL summary statistics have been released, these are still difficult to compare between studies due to large differences in analysis strategy and the types of transcript-level changes captured by different methods [15].
To overcome these limitations, we have reprocessed the raw data from 19 eQTL studies. We have applied uniform data analysis and quality control procedures to all of these datasets. In addition to gene expression QTLs, we have identified QTLs at the level of exons, transcripts and transcriptional events covering alternative promoters, splicing events and transcript 3ʹ ends. This allowed us to detect novel QTLs in existing datasets that would have otherwise remained hidden. Our full summary statistics are available on the eQTL Catalogue FTP server and via a REST API. As an example, we use the eQTL Catalogue and GWAS Catalog APIs to identify a transcript usage QTL in stimulated macrophages at the CD40 locus which colocalises with a rheumatoid arthritis GWAS signal. Access to individual-level data will enable us to recompute QTL summary statistics as improved RNA-seq analysis methods become available.
Results
Data analysis workflow
To uniformly process a large number of eQTL studies, we designed a modular and robust data analysis workflow (Figure 1). We first downloaded the raw gene expression and genotype data and converted the data to common input formats (VCF for genotypes and fastq for RNA-seq). We performed extensive quality control of genotypes (see Methods) and imputed them to the 1000 Genomes Phase 3 [16] reference panel. For RNA sequencing data, we started with the nf-core [17] RNA-seq pipeline written in the Nextflow [18] framework and modified it to support the quantification of four different molecular phenotypes: gene expression, exon expression [19], transcript usage, and promoter, splicing and 3ʹ end usage events defined by txrevise [15] (Supplementary Figure 3). Using the same quantification workflow ensured that molecular phenotype identifiers (genes, transcripts, exons and events) were consistent between individual studies. Furthermore, we harmonised sample metadata between studies and mapped all biological samples (cell types and tissues) to a common set of 24 distinct ontology terms from UBERON [20], Cell Ontology [21] and Experimental Factor Ontology [22]. This will allow users to easily find if the same cell types or tissues has been profiled in multiple studies (Table 1). The normalised molecular phenotype matrices and imputed genotypes were fed into our QTL mapping workflow that was also developed using the Nextflow framework. The full association summary statistics have been made publicly available via the eQTL Catalogue FTP site as well as the REST API. All our data analysis workflows have been released under a permissive licence (see Software Availability).
Datasets included in the eQTL Catalogue
We downloaded raw gene expression and genotype data from 14 RNA-seq and 5 microarray studies from various repositories. This included 8,115 RNA-seq samples and 4,631 microarray samples from 4,685 unique donors (Table 1), covering 24 cell types or tissues (Table 1) and 13 stimulated conditions (Supplementary Material 1) (called ‘biological contexts’). Even though these samples were profiled in different laboratories using a wide range of RNA-seq protocols (Supplementary Tables 2 and 3) and sequencing depth (Supplementary Figure 2), they predominantly clustered by cell type or tissue of origin in multidimensional scaling analysis (MDS) (Figure 2A). Projecting the genotype data of the donors to 1000 Genomes Phase 3 [16] reference panel, we found that although 88% of the donors were of European origin, the datasets also included 487 (~10%) donors from African populations and a small number of samples from other populations (Table 2, Supplementary Table 1).
Our quality control of the gene expression and genotype datasets included removing outlier samples from the gene expression datasets, ascertaining the genetic sex of the samples using the expression of sex-specific genes, detecting genotype concordance between RNA-seq and genotype samples, and detecting cross-contamination between samples within a study using both sex-specific gene expression as well as genotype data (see Methods, Supplementary Figure 4). We excluded a total of 2,418 samples during the quality control procedure (Supplementary Table 3).
For RNA-seq datasets, we performed QTL mapping for four different molecular phenotypes described above (Figure 1, Supplementary Figure 3). The QTL analysis was performed separately in each biological context of each study. In general, we found the largest number of QTLs at the level of gene expression, but for all molecular phenotypes, the number of significant associations scaled approximately linearly with the sample size (Figure 2B, Supplementary Material 1). For microarray datasets, we performed the analysis only at the gene level, but found the same linear trend (Figure 2B, Supplementary Material 1).
Example use case
To demonstrate the utility of the eQTL Catalogue and REST API for interpreting disease-associated genetic variants, we explored the CD40 locus associated with rheumatoid arthritis (RA) [42]. We have previously demonstrated that the RA GWAS signal at this locus colocalises with a promoter usage QTL for CD40 in macrophages stimulated with interferon-gamma [34]. To assess whether this association could be detected in other tissues and cell types, we queried the eQTL Catalogue API using the GWAS lead variant from the CD40 RA locus (rs4239702). We found a number of molecular phenotypes strongly associated with the lead variant (nominal p-value < 10−4) (Figure 3A). In particular, there was a strong association with the total expression level of CD40 in four independent monocyte eQTL studies covering both RNA-seq and microarrays studies [23,24,26,27] (Figure 3A).
To test if these eQTLs are likely to share the same causal variant with the RA GWAS signal, we used colocalisation analysis [12]. We fetched the full association summary statistics from the CD40 locus (GRCh38 chr20:45,980,000-46,200,000). This analysis replicated the previously reported colocalisation with CD40 promoter usage in stimulated macrophages [15] (Figure 3B); however, the same analysis applied to monocyte-specific eQTLs strongly supported a model of distinct causal variants underlying the eQTL and GWAS association in all four studies (Figure 3C). This was consistent with the low linkage disequilibrium (LD) of r2 = 0.13 between the monocyte eQTL (rs745307) and RA GWAS lead variants (rs4239702). This highlights the importance of having access to full summary statistics from the region. Although the GWAS variant was strongly associated with CD40 expression in monocytes, this was likely due to a very strong independent eQTL signal nearby (nominal p-value < 10−50 in the Fairfax_2014 dataset) that was in low LD with the GWAS lead variant. It is possible that the promoter usage QTL detected in stimulated macrophages (Figure 3B) is a weak secondary eQTL in the monocyte samples, but this would still indicate that CD40 expression in naive monocytes does not directly contribute to RA disease risk, because a much stronger eQTL in that context is not associated with the disease [43]. The complete RMarkdown document to reproduce this analysis is available from GitHub (see Software Availability).
Comparison with existing databases
The largest collection of eQTLs is currently hosted by the QTLbase [5] database. Although QTLbase contains some splicing QTLs, this is limited to the summary statistics provided by the original study authors. Secondly, although QTLbase has harmonized variant identifiers and effect sizes across studies, these are not accessible programmatically and the downloadable files only contain p-values of nominally significant associations (p < 0.05) without the effect sizes. Thus, QTLbase summary statistics are not suitable for fine-mapping and colocalisation applications. Both QTLizer [4], PhenoScanner [7] provide programmatic access to their summary statistics, but the QTLizer summary statistics have not been harmonized and PhenoScanner contains data from only ten studies. Finally, both FUMA [6] and ImmuneRegulation [8] provide access to some eQTL summary statistics via their web interface, but the full data cannot be downloaded for local computational analyses.
All eQTL Catalogue summary statistics are available under the Creative Commons Attribution 4.0 International License, enabling third parties to build their own tools and services on top of the released summary statistics and the REST API. To avoid downloading large text files, slices of the summary statistics can be accessed using tabix [44] (see Data Availability).
Discussion
The eQTL Catalogue provides a resource of uniformly processed human gene-level and transcript-level QTL summary statistics, with the aim of supporting biomedical genetic research. This resource will be progressively expanded to all accessible human datasets. We are currently analysing raw data from GTEx v8 [1], the CommonMind Consortium [45] and the FUSION study [46]. We are also setting up data access agreements for additional datasets on an ongoing basis.
We have paid particular attention to making the summary statistics as usable as possible. By mapping cell types and tissues to common ontology terms, we make it easy to discover which studies contain the tissues and cell types of interest for the users. This will also enable summary-level meta-analysis [2] across studies containing the same cell types and tissues. We have imputed most genotype datasets to the same reference panel and reference genome version, ensuring that similar set of genetic variants is present in most studies. Finally, we use a consistent set of molecular phenotype identifiers (genes, exons, transcripts, events) across all datasets, ensuring that genetic effects can directly be compared across datasets. Our summary statistics have already been used to interpret GWAS associations for Alzheirmer’s disease [47].
We welcome feedback on ways to improve our methods. In the next release planned for June 2020, we plan to include LeafCutter [48] splice junction usage QTLs as the fifth molecular phenotype quantified from RNA-seq data. We are also exploring ways to systematically fine-map [10,11] the QTL signals to identify multiple independent associations for each gene and make the credible sets of causal variants publicly available. This can help to further characterise loci with multiple independent signals, such as the CD40 locus described above (Figure 3).
Finally, we are exploring approaches to handle related samples and population stratification by using either linear mixed models or performing eQTL analysis in each population separately. These modifications would not be possible without access to individual-level genotype and RNA-seq data.
We are always looking for additional datasets to be included in the eQTL Catalogue. Unfortunately, we were unable to obtain access to all of the datasets that we would have liked to include in the analysis due to consent limitations or restrictions on sharing individual-level genetic data (Supplementary Table 4). These limitations could be overcome in the future by federated data analysis approaches, where the eQTL analysis is performed at remote sites using our analysis workflows, and only summary statistics are shared with the eQTL Catalogue. To this end, we will continue to improve the usability and portability of our data analysis workflows and will make them available via community efforts such as the nf-core [17] repository. Researchers interested in contributing their datasets to the eQTL Catalogue should contact us at eqtlcatalogue{at}ebi.ac.uk.
Methods
Data access and informed consent
Gene expression and genotype data from two studies (GEUVADIS and CEDAR) were available for download without restrictions from ArrayExpress. For all other datasets, we applied for access via the relevant Data Access Committees. The database accessions and contact details of the individual Data Access Committees can be found on the eQTL Catalogue website (http://www.ebi.ac.uk/eqtl/Datasets/). In our applications, we explained the project and our intent to publicly share the association summary statistics. Although this was acceptable for the 19 studies currently included in the eQTL Catalogue, some of our data access requests were rejected either because informed consent obtained from the study participants did not allow the sharing of genotype data with other researchers or the data were restricted for research into specific diseases (Supplementary Table 4). Ethical approval for the project was obtained from the Research Ethics Committee of the University of Tartu (approval 287/T-14).
Genotype data
Pre-imputation quality control
We aligned the strands of the genotyped variants to 1000 Genomes Phase 3 reference panel using Genotype Harmonizer [49]. We excluded genetic variants with Hardy-Weinberg p-value < 10−6, missingness > 0.05 and minor allele frequency < 0.01 from further analysis. We also excluded samples with more than 5% of their genotypes missing.
Genotype imputation and QC
We imputed the genotypes to the 1000 Genomes Phase 3 reference panel [16] using a local installation of the Michigan Imputation Server v1.0.4 [50]. After imputation, we converted the coordinates of genetic variants from GRCh37 reference genome to GRCh38 using CrossMap v0.2.8 [51]. We used bcftools v1.9.0 to exclude variants with minor allele frequency (MAF) < 0.01 and imputation quality score R2 < 0.4 from downstream analysis.
Assigning individuals to reference populations
We used PLINK [52] v1.9.0 to perform LD pruning of the genetic variants and LDAK [53] to project new samples to the principal components of the 1000 Genomes Phase 3 reference panel [16]. To assign each genotyped sample to one of four superpopulations, we calculated the Euclidean distance in the principal component space from the genotyped individual to all individuals in the reference dataset. Distance from a sample to a reference superpopulation cluster is defined as a mean of distances from the sample to each reference sample from the superpopulation cluster. We explored distances between samples and reference superpopulation cluster using different number of PCs and found that using 3 PCs worked best for inferring superpopulation of a sample. Then, we assigned each sample to a superpopulation if the distance to the closest superpopulation cluster was at least 1.7 times smaller than to the second closest one (Supplementary Figure 5). We used this relatively relaxed threshold, because our aim was to get an approximate estimate of the number of individuals belonging to each superpopulation. Performing a population-specific eQTL analysis would probably require a much more stringent assignment of individuals to populations.
Microarray data
Data normalisation
. All five microarray datasets currently included in the eQTL Catalogue (CEDAR, Fairfax_2012, Fairfax_2014, Naranbhai_2015, Kasela_2017) used the same Illumina HumanHT-12 v4 gene expression microarray. The database accessions for the raw data can be found on the eQTL Catalogue website (http://www.ebi.ac.uk/eqtl/Datasets/). Batch effects, where applicable, were adjusted for with the function removeBatchEffect from the limma v.3.40.6 R package [54]. The batch adjusted log2 intensity values were quantile normalized using the lumiN function from the lumi v.2.36.0 R package [55]. Only the intensities of 30,353 protein-coding probes were used. The raw intensity values for the five microarray datasets have been deposited to Zenodo (doi: https://doi.org/10.5281/zenodo.3565554).
Detecting sample mixups
We used Genotype harmonizer [49] v1.4.20 to convert the imputed genotypes into TRITYPER format. We used MixupMapper [56] v1.4.7 to detect sample swaps between gene expression and genotype data. We detected 155 sample swaps in the CEDAR dataset, most of which affected the neutrophil samples. We also detected one sample swap in the Naranbhai_2015 dataset.
RNA-seq data
Pre-processing
For each study, we downloaded the raw RNA-seq data from one of the six databases (European Genome-phenome Archive (EGA), European Nucleotide Archive (ENA), Array Express, Gene Expression Omnibus (GEO), Database of Genotypes and Phenotypes (dbGaP), Synapse). If the data were already in fastq format then we proceeded directly to quantification. If the raw data were shared in BAM or CRAM format, we used samtools v1.6 [57] to first collate paired-end reads with samtools collate and then used samtools fastq command with ‘-F 2816 -c 6’ flags to convert the CRAM or BAM files to fastq. Since samples from GEO and dbGaP were stored in SRA format, we used the fastq-dump command with ‘--split-files --gzip --skip-technical --readids --dumpbase --clip’ flags to convert those to fastq. The pre-proccessing scripts are available from the rnaseq quantification pipeline GitHub repository (https://github.com/eQTL-Catalogue/rnaseq).
Quantification
We quantified transcription at four different levels: (1) gene expression, (2) exon expression, (3) transcript usage and (4) transcriptional event usage. Quantification was performed with a Nextflow-based [18] pipeline that we developed by adding new quantification methods to nf-core rnaseq pipeline [17]. Before quantification, we used Trim Galore v0.5.0 to remove sequencing adapters from the fastq files.
For gene expression quantification, we used HISAT2 v2.1.0 [58] to align reads to the GRCh38 reference genome (Homo_sapiens.GRCh38.dna.primary_assembly.fa file downloaded from Ensembl). We counted the number of reads overlapping the genes in the GENCODE V30 [59] reference transcriptome annotations with featureCounts v1.6.4 [60]. To quantify exon expression, we first created exon annotation file (GFF) using GENCODE V30 reference transcriptome annotations and dexseq_prepare_annotation.py script from the DEXSeq [19] package. We then used the aligned RNA-seq BAM files from the gene expression quantification and featureCounts with flags ‘-p -t exonic_part -s ${direction} -f -O’ to count the number of reads overlapping each exon.
We quantified transcript and event expression with Salmon v0.13.1 [61]. For transcript quantification, we used GENCODE V30 (GRCh38.p12) reference transcript sequences (fasta) file to build Salmon index. For transcriptional event usage, we downloaded pre-computed txrevise [15] alternative promoter, splicing and alternative 3ʹ end annotations corresponding to Ensembl version 96 from Zenodo (https://doi.org/10.5281/zenodo.3232932) in GFF format. We then used gffread to generate fasta sequences from the event annotations and built Salmon indexes for each event set as we did for transcript usage. Finally, we quantified transcript and event expression using salmon quant with ‘--seqBias --useVBOpt --gcBias --libType’ flags. All expression matrices were merged using csvtk v0.17.0. The pipeline is publicly available at https://github.com/eQTL-Catalogue/rnaseq. Our reference transcriptome annotations are available from Zenodo (https://doi.org/10.5281/zenodo.3366280).
Detecting outliers from gene expression data
We performed the quality control measures using only gene expression counts matrix. In all downstream analyses, we only included 35,367 protein coding and non-coding RNA genes belonging to one of the following Ensembl gene types: lincRNA, protein_coding, IG_C_gene, IG_D_gene, IG_J_gene, IG_V_gene, TR_C_gene, TR_D_gene, TR_J_gene, TR_V_gene, 3prime_overlapping_ncrna, known_ncrna, processed_transcript, antisense, sense_intronic, sense_overlapping. For PCA and MDS analyses, we first filtered out invalid gene types (23,458) and genes in sex chromosomes (1,247), TPM normalised [62] the gene counts, filtered out genes having median normalised expression value less than 1 and log2 transformed the matrix. We performed principal component analysis with prcomp R stats package (center = true, scale = true). For multidimensional scaling (MDS) analysis, we used the isoMDS method from MASS R package with k=2 dimensions. As a distance metric for isoMDS we used 1 - Pearson correlation as recommended previously [63]. We plotted these two-dimensional scatter plots to visually identify outliers (Supplementary Figure 4A-B).
Sex-specific gene expression analysis
Previous studies have successfully used the expression of XIST and Y chromosome genes to ascertain genetic sex of RNA samples [64]. In our analysis, we extracted all protein coding genes from the Y chromosome and XIST gene (ENSG00000229807) expression values and TPM normalised them. Then, we calculated mean value of expressions of Y chromosome genes. Finally, we plotted log2 scatter plot of XIST gene expression (X axis) against the mean expression of Y chromosome genes (Y axis) (Supplementary Figure 4C). In addition to detecting samples with incorrectly labeled genetic sex, this analysis also allowed us to identify cross-contamination between samples (XIST and Y chromosome genes expressed simultanously, Supplementary Figure 4C).
Concordance between genotype data and RNA-seq samples
We used the Match Bam to VCF (MBV) method from QTLTools [65] which directly compares the sample genotypes in VCF to an aligned RNA-seq BAM file. MBV is a good method to detect sample swaps, genotypes from the same donor and cross-contaminated genotypes in VCF. In some cases, such cross-contamination was confirmed by the both sex-specific gene expression and MBV analyses (Supplementary Figure 4D).
Normalisation
We filtered out samples which failed the QC step. We normalised the gene and exon-level read counts using the conditional quantile normalisation (cqn) R package v1.30.0 [66]. We downloaded the gene GC content estimates from Ensembl biomaRt and calculated the exon-level GC content using bedtools v2.19.0 [67]. We also excluded lowly expressed genes, where 95 per cent of the samples within a biological context had TPM normalised expression less than 1. To calculate transcript and transcriptional event usage values, we obtained the TPM normalised transcript (event) expression estimates from Salmon and divided those by the total expression of all transcripts (events) from the same gene (event group). Subsequently, we used the inverse normal transformation to standardise the transcript and event usage estimates. Normalisation scripts together with containerised software is publicly available at https://github.com/eQTL-Catalogue/qtl_norm_qc.
Metadata harmonisation
We mapped all RNA-seq and microarray samples to a minimal metadata model. This included consistent sample identifiers, information about the cell type or tissue of origin, biological context (e.g. stimulation), genetic sex, experiment type (RNA-seq or microarray) and properties of the RNA-seq protocol (paired-end vs single-end; stranded vs unstranded; poly(A) selection vs total RNA). To ensure that cell type and tissue names were consistent between studies and to facilitate easier integration of additional studies, we used Zooma (https://www.ebi.ac.uk/spot/zooma/) to map cell types and tissues to controlled vocabulary of ontology terms from Uber-anatomy ontology (Uberon) [20], Cell Ontology [21] or Experimental Factor Ontology (EFO) [22]. We opted to use ad-hoc controlled vocabulary to represent biological contexts as those often included terms and combinations of terms that were missing from ontologies.
Association testing
We developed a Nextflow based pipeline which takes normalised phenotype expression matrix, genotype VCF file and metadata files and produces association summary statistics for all molecular phenotypes. We performed association testing separately in each biological context (also known as ‘qtl group’) and used a +/− 1 megabase cis window centered around the start of each gene. First, we excluded molecular phenotypes that had less than 5 genetic variants in their cis window, as these were likely to reside in regions with poor genotyping coverage. We also excluded molecular phenotypes with zero variance across all samples and calculated phenotype principal components using prcomp R stats package (center = true, scale = true). We calculated genotype principal components using plink2 v1.90b3.35. We used the first six genotype and phenotype principal components as covariates in QTL mapping. For association testing, we used QTLtools v1.1 [68] nominal and permutation passes in cis. For nominal pass, we used the ‘--window 1000000 --nominal 1’ flags to find all associations in 1 Mb cis window. For permutation pass, we used ‘--window 1000000 --permute 1000 --grp-best’ flags in order to calculate empirical p-values based on 1000 permutations. The ‘--grp-best’ option ensured that the permutations were performed across all phenotypes within the same ‘group’ (e.g. multiple probes per gene in microarray data or multiple transcripts or exons per gene in the exon-level and transcript-level analysis) and the empirical p-value was calculated at the group level.
Colocalisation
We used the GWAS Catalog [69] API (https://www.ebi.ac.uk/gwas/docs/api) to download the rheumatoid arthritis [42] GWAS summary statistics (accession GCST002318) from the CD40 locus (GRCh38 coordinates: chr20:45,980,000-46,200,000). We downloaded the eQTL summary statistics from the eQTL Catalogue API and performed colocalisation using the coloc R package [12] with default prior probabilities.
Software availability
Data analysis pipelines:
RNA-seq quantification: https://github.com/eQTL-Catalogue/rnaseq
Normalisation and QC: https://github.com/eQTL-Catalogue/qtl_norm_qc
Genotype QC: https://github.com/eQTL-Catalogue/genotype_qc
Association testing: https://github.com/eQTL-Catalogue/qtlmap
Example use cases:
Colocalisation in R using GWAS Catalog and eQTL Catalogue APIs: https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/blob/master/scripts/eQTL_API_usecase.Rmd
Python example for querying the HDF5 files: https://github.com/eQTL-Catalogue/eQTL-SumStats/blob/master/querying_hdf5_basics.ipynb
Data availability
The full association summary statistics in HDF5 and TSV format can be downloaded from the eQTL Catalogue website (https://www.ebi.ac.uk/eqtl/Data_access/). Slices of the TSV files can be accessed using tabix. All of the summary statistics are also available via the REST API (https://www.ebi.ac.uk/eqtl/api-docs/). Database accessions for the raw gene expression and genotype datasets are listed on the eQTL Catalogue website (https://www.ebi.ac.uk/eqtl/Datasets/). Our summary statistics have also been integrated to the Open Targets Genetic Portal (https://genetics.opentargets.org/) and gene expression matrices will be made available via the EMBL-EBI Expression Atlas [70]
Author contributions
NK and KA developed the data analysis and quality control workflows and performed quality control of the data. NK processed the RNA-seq datasets and performed the QTL analysis. JH developed and implemented the eQTL Catalogue API. JM processed the gene expression data for the Expression Atlas. LK performed microarray gene expression data normalisation and quality control. KP and MS developed the initial version of the population assignment workflow. TB, SJ, IP, DZ and KA supervised the work. NK and KA wrote the manuscript with input from all authors.
Funding
NK, JH and JM were supported by a grant from Open Targets (OTAR2-046). TB, SJ, IP, HP and DZ were supported on core EMBL funds. KA was supported by the European Regional Development Fund and the programme Mobilitas Pluss (MOBJD67). KA also received funding from the European Union’s Horizon 2020 research and innovation programme (grant number 825775) and Estonian Research Council (grants IUT34-4 and PSG415). LK was supported by the Estonian Research Council grant PSG59. KA, NK and LK were also supported by Estonian Centre of Excellence in ICT Research (EXCITE) funded by the European Regional Development Fund.
Funding for datasets in the eQTL Catalogue
BLUEPRINT
This study makes use of data generated by the Blueprint Consortium. A full list of the investigators who contributed to the generation of the data is available from www.blueprint-epigenome.eu. Funding for the project was provided by the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no 282510 - BLUEPRINT.
Fairfax_2012, Fairfax_2014 and Naranbhai_2015
Funding for the project was provided by the Wellcome Trust under awards Grants 088891 [B.P.F.], 074318 [J.C.K.] and 075491/Z/04 to the core facilities at the Wellcome Trust Centre for Human Genetics, the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) (281824 to J.C.K.), the Medical Research Council (98082, J.C.K.) and the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre.
TwinsUK
TwinsUK is funded by the Wellcome Trust, Medical Research Council, European Union, the National Institute for Health Research (NIHR)-funded BioResource, Clinical Research Facility and Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust in partnership with King’s College London.
BrainSeq
This research was supported by the Intramural Research Program of the NIMH (NCT00001260, 900142).
Schmiedel_2018
This work was funded by the William K. Bowes Jr Foundation (P.V.) and NIH grants R24AI108564 (P.V., B.P., A.R., M.K.), S10RR027366 (BD FACSAria II), and S10OD016262 (Illumina HiSeq 2500).
ROSMAP
Study data were provided by the Rush Alzheimer’s Disease Center, Rush University Medical Center, Chicago. Data collection was supported through funding by NIA grants P304G10161, R014G15819, R014G17917, R01AG30!46, R014G36836, U014G32984, U014G46152, the Illinois Department of Public Health, and the Translational Genomics Research Institute.
GENCORD
Emmanouil T Dermitzakis was supported by grants from the European Research Council (260927), Swiss National Science Foundation (31003A_130342, CRSI33_130326) Louis-Jeantet Foundation, and the Blueprint Consortium. Stylianos E Antonarakis was supported by grants from the European Research Council (249968), Swiss National Science Foundation (144082), and the Blueprint Consortium.
van_de_Bunt_2015
MvdB is supported by a Novo Nordisk postdoctoral fellowship run in partnership with the University of Oxford. ALG is a Wellcome Trust Senior Research Fellow in Basic Biomedical Science (095010/Z/10/Z). MIM is a Wellcome Trust Senior Investigator (WT098381) and a National Institute of Health Research Senior Investigator. PEM holds the Canada Research Chair in Islet Biology. This work was supported in part in Oxford, UK, by grants from the Medical Research Council (MRC; MR/L020149/1) and National Institutes of Health (NIH; R01 MH090941), and in Edmonton, Canada, by operating grants to PEM from the Canadian Institutes of Health Research (CIHR; MOP244739) and the ADI/Johnson & Johnson Diabetes Research Fund. Human islet isolations at the Alberta Diabetes Institute IsletCore were funded by the Alberta Diabetes Foundation and the University of Alberta. The National Institute for Health Research, Oxford Biomedical Research Centre funded islet provision at the Oxford Human Islet Isolation facility. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Supplementary Materials
Acknowledgements
The RNA-seq quantification and QTL analyses were performed at the High Performance Computing Center, University of Tartu. We thank Eleri Vako from the Grant Office of the University of Tartu, and Holly Foster and Paris Litterick from Open Targets for assistance in setting up data access agreements. We thank Jeremy Schwartzentruber for his helpful comments on the manuscript; Daniel Gaffney for guidance in setting up this project.
Footnotes
↵* These authors jointly supervised this work.