Abstract
Intelligence is highly heritable1 and a major determinant of human health and well-being2. Recent genome-wide meta-analyses have identified 24 genomic loci linked to intelligence3–7, but much about its genetic underpinnings remains to be discovered. Here, we present the largest genetic association study of intelligence to date (N=279,930), identifying 206 genomic loci (191 novel) and implicating 1,041 genes (963 novel) via positional mapping, expression quantitative trait locus (eQTL) mapping, chromatin interaction mapping, and gene-based association analysis. We find enrichment of genetic effects in conserved and coding regions and identify 89 nonsynonymous exonic variants. Associated genes are strongly expressed in the brain and specifically in striatal medium spiny neurons and cortical and hippocampal pyramidal neurons. Gene-set analyses implicate pathways related to neurogenesis, neuron differentiation and synaptic structure. We confirm previous strong genetic correlations with several neuropsychiatric disorders, and Mendelian Randomization results suggest protective effects of intelligence for Alzheimer’s dementia and ADHD, and bidirectional causation with strong pleiotropy for schizophrenia. These results are a major step forward in understanding the neurobiology of intelligence as well as genetically associated neuropsychiatric traits.
We performed a genome wide meta-analysis of 16 independent cohorts totaling 279,930 participants of European ancestry and 9,398,186 genetic variants passing quality control (Online Methods; Supplementary Table 1; Supplementary Figure 1). All genome-wide analyses were corrected for cohort-specific ancestry and covariates (Supplementary Information). Various measures of intelligence were used in each study, yet genetic correlations between cohorts (rg, Online Methods), were considerable (mean=0.63), warranting meta-analysis (Supplementary Table 2; Supplementary Results 2.1). Separate meta-analyses for children, young adults, and adults (Online Methods) indicated high genetic correlations between age groups (rg>0.62), and comparable single nucleotide polymorphism (SNP)-based heritability across age (h2SNP=0.19-0.22) (Supplementary Table 3; Supplementary Results 2.2). The inflation factor of the full meta-analysis was λGC=1.95 (Supplementary Table 4; Supplementary Figure 2), with h2SNP=0.18 (SE=0.01), in line with previous findings4,5, and an LD score intercept8 of 1.09 (SE=0.02) indicated that most of the inflation could be explained by polygenic signal and large sample size6.
In the meta-analysis, 12,701 variants indexed by 531 independently significant SNPs (r2<0.6) and 246 lead SNPs in approximate linkage equilibrium (r2<0.1; Online Methods) reached genome-wide significance (GWS; P<5×10−8) (Figure 1a; Supplementary Table 5; Supplementary Figure 3). These were located in 206 distinct genomic loci, 191 of which are novel associations (Supplementary Results 2.3). Proxy replication with the correlated phenotype of educational attainment (EA; rg=0.73) in an independent sample (Online Methods) indicated sign concordance for 94% of GWS SNPs (P<1×10−300) and evidence of replication for 51 loci (Supplementary Results 2.3.2; Supplementary Table 6). Using polygenic score prediction9,10 (Online Methods) we show that the current results explain up to 5.4% of the variance in four independent samples (Supplementary Table 7, Supplementary Results 2.3.3).
(a) Manhattan plot showing the ─log10 transformed P-value of each SNP on the y-axis and base pair positions along the chromosomes on the x-axis. The dotted red line indicates genome-wide significance (P<5×10−8), the blue line the threshold for suggestive associations (P<1×10−5). Independent lead SNPs are indicated by a diamond. (b) Heritability enrichment of 28 functional SNP annotations calculated with stratified LD score regression TSS = Transcription Start Site; CTCF=CCCTC-binding factor; DHS=DNase Hypersensitive Site. (c) Distribution of functional consequences of SNPs in genomic risk loci in the meta-analysis. (d) Distribution of RegulomeDB score for SNPs in genomic risk loci, with a low score indicating a higher likelihood of having a regulatory function (Online methods). (e) The minimum chromatin state across 127 tissue and cell types for SNPs in genomic risk loci, with lower states indicating higher accessibility and states 1-7 referring to open chromatin states (Online Methods).
We observed strong enrichment for heritability (Online Methods; Supplementary Results 2.3.4) of SNPs located in conserved regions of the genome (P=1.84×10−12), coding regions (P=7.88×10−7), H3K9ac histone regions/peaks (P<6.06×10−5), and super-enhancers (P=9.61×10−5) (Figure 1b; Supplementary Table 8). Conserved regions have previously been implicated for intelligence11 but coding regions have not. Heritability was disproportionately found among common variants (Supplementary Figure 5) with greatest enrichment for SNPs with a minor allele frequency (MAF) between 0.4 and 0.5 (P=5.81×10−12), but was distributed proportionally across chromosomes (Supplementary Figure 6).
Functional annotation of all SNPs (n=23,552) in the associated loci was performed using FUMA12 (Online Methods). SNPs were mostly located in intronic (n=12,171; 51.7%) and intergenic areas (n=9,923; 42.1%) (Supplementary Table 9; Figure 1c), yet 6.3% (1,473 SNPs) were annotated to functional genic regions, with 1.4% (318 SNPs) being exonic. Of these, 89 (41 GWS) SNPs were exonic non-synonymous (ExNS) (Table 1, Supplementary Results 2.3.5). Convergent evidence of strong association (Z=9.74) and the highest observed probability of a deleterious protein effect (CADD13 score=34) was found for rs13107325. This missense mutation (MAF=0.065) in SLC39A8 was the lead SNP in locus 71 and the ancestral allele C was associated with higher intelligence scores. The effect sizes for ExNS were individually small, with each effect allele 0.01 to 0.05 standard deviations. Table 1, Supplementary Table 9 and Supplementary Results 2.3.5 present a detailed catalog of the functional impact of variants in the genomic risk loci. Apart from protein consequences, the implicated SNPs also showed some evidence of indirect functional effects: 4.4% had a RegulomeDB14 score of 1a-1f (Figure 1d), suggesting a regulatory function, and the majority of SNPs (81.6%) were in open chromatin regions15,16, as indicated by a minimum chromatin state of 1-7 (Figure 1e).
To link the associated genetic variants to genes, we applied three gene-mapping strategies as implemented in FUMA12 (Online Methods). Positional gene-mapping aligned SNPs to 514 genes by location, eQTL (expression quantitative trait loci) gene-mapping matched cis-eQTL SNPs to 709 genes whose expression levels they influence, and chromatin interaction mapping annotated SNPs to 226 genes based on three-dimensional DNA-DNA interactions between the SNPs’ genomic region and nearby or distant genes (Figure 2; Supplementary Figure 7-8; Supplementary Table 10-12). Of 882 total unique genes, 438 genes were implicated by at least two mapping strategies and 129 by all 3 (Figure 3). Of these, 15 genes are particularly notable as they are implicated via chromatin interactions between two independent genomic risk loci (Supplementary Table 11). VAMP4 (locus 14), shows interactions in 6 tissue types including interactions with locus 15 in the left ventricle (Figure 2a). SATB2 (locus 44) is linked by interaction in liver tissue to locus 43 (Figure 2b). MEF2 (locus 82) shows interactions with locus 83 in 5 tissues (Figure 2c). FBXL17 and MAN2A1 are in two independent loci (87 and 88 respectively); they are mapped by eQTL associations and chromatin interactions between the two loci in the left ventricle (Figure 2c). Loci 102 and 103 show multiple interactions in one of 7 tissue types that are mapped to 8 genes encoding zinc finger proteins or histones (Figure 2d). ELAVL2 (locus 130) interacts with locus 129 in the left ventricle and is also mapped by intra-locus interactions in other tissues (Figure 2e). ATF4 (locus 212) is mapped by eQTLs in 3 tissue types and chromatin interactions in 7 tissue types, and interacts with locus 213 in the left ventricle (Figure 2f).
Circos plots showing genes on chromosomes 1 (a), 2 (b) 5 (c) 6 (d) 9 (e) and 22 (f) that were implicated as genomic risk loci (blue regions) by positional mapping, eQTL mapping (green lines connecting an eQTL SNP to its associated gene), and/or chromatin interaction (orange lines connecting two interacting regions) and showed evidence of interaction across two independent genomic risk loci. Genes implicated by both eQTL and chromatin interactions mapping are in red. The outer layer shows a Manhattan plot containing the ─log10 transformed P-value of each SNP in the GWAS meta-analysis, with genome-wide significant SNPs in color corresponding to linkage disequilibrium patterns with the lead SNP. Circos plots for all chromosomes are provided in Supplementary Fig. 7.
(a) Manhattan plot of the genome-wide gene-based association analysis (GWGAS). The y-axis shows the ─log10 transformed P-value of each gene, and the chromosomal position on the x-axis. The red dotted line indicates the threshold for genome-wide significance of the gene-based test (P<2.76×10−6; 0.05/18,128), and the blue line indicates the suggestive threshold (P<2.76×10−5; 0.5/18,128) (b) Venn diagram showing overlap of genes implicated by positional mapping, eQTL mapping, chromatin interaction mapping, and GWGAS. (c) Gene expression profiles of identified genes for 53 tissue types. Expression data were extracted from the Genotype-Tissue Expression (GTEx) database. Expression values (RPKM) were log2 transformed with pseudocount 1 after winsorization at 50 and averaged per tissue. (d) Single-cell gene-expression analysis of genes related to intelligence in 24 cell-types. The dotted blue line indicates the Bonferroni-corrected significance threshold (P=0.05/7,323=6.83×10−6).
We performed genome-wide gene-based association analysis (GWGAS) using MAGMA17 (Online Methods). This approach provides aggregate association P-values based on all SNPs in a gene, whereas FUMA annotates individually significant SNPs to genes. GWGAS identified 524 associated genes (467 novel) (Figure 3a; Supplementary Table 13; Supplementary Results 2.4.1), of which 159 were outside of the GWAS risk loci, and 365 were also mapped by FUMA (Figure 3b). In total, 92 genes were implicated by all four strategies (Supplementary Table 14).
In gene-set analysis using the GWGAS results (Online Methods), six Gene Ontology18 gene-sets were significantly associated with intelligence: neurogenesis (Beta=0.153, SE=0.030, P=1.55×10−7), neuron differentiation (Beta=0.178, SE=0.038, P=1.36×10−6), central nervous system neuron differentiation (Beta=0.398, SE=0.089, P=3.97×10−6), regulation of nervous system development (Beta=0.187, SE=0.040, P=1.54×10−6), positive regulation of nervous system development (Beta=0.242, SE=0.052, P=1.93×10−6), and regulation of synapse structure or activity (Beta=0.153, SE=0.030, P=5.87×10−6) (Supplementary Table 15). Conditional analysis indicated that there were three independent associations, for the neurogenesis, central nervous system neuron differentiation, and regulation of synapse structure or activity processes, which together accounted for the associations of the other three sets (Supplementary Results 2.4.2).
Linking gene-based P-values to tissue-specific gene-sets (Online Methods), we observed strong associations across various brain areas (Figure 3c; Supplementary Table 16; Supplementary Results 2.4.2), most strongly with the cortex (P=5.12×10−9), and specifically frontal cortex (P=4.94×10−9). In brain single-cell expression gene-set analyses (Online Methods), we found significant associations of striatal medium spiny neurons (P=1.47×10−13) and pyramidal neurons in the CA1 hippocampal (P=4×10−11) and cortical somatosensory regions (P=3×10−9), (Figure 3d; Supplementary Table 17). Conditional analysis showed that the independent association signal in brain cells was driven by medium spiny neurons, neuroblasts, and pyramidal CA1 neurons (Supplementary Results 2.4.2).
Intelligence has been associated with a wide variety of human behaviors19 and brain anatomy20. Confirming previous reports5,6, we observed negative genetic correlations (Online methods) with ADHD (rg=-0.36, P=1.97×10−24), depressive symptoms (rg=-0.27, P=5.77×10−10), Alzheimer’s disease (rg=-0.26, P=5.77×10−10), and schizophrenia (rg=-0.22, P=2.58×10−18) and positive correlations with EA (rg=0.70, P<1×10−200) and longevity (rg=0.43, P=4.91×10−8) (Supplementary Table 18; Supplementary Figure 9). Comparison of our results with the contents of the NHGRI-21 EBI catalog21 supported these correlations, showing numerous shared genetic variants (Supplementary Information 2.5; Supplementary Table 19-20). Low enrichment (91 of 1,518 genes, hypergeometric P=0.03) was found for genes previously linked to intellectual disability or developmental delay (see URLs; Online Methods). However, our results replicate and add to previous genetic research on normal variation in intelligence, as catalogued in Supplementary Tables 21-22.
We used Mendelian Randomization (Online Methods) to test for potential credible causal associations between intelligence and genetically correlated traits (Supplementary Table 23; Supplementary Figures 10-11). We observed a strong effect of intelligence on EA (bxy=0.531, SE=0.006, P<1×10−320), that was bidirectional and showed a similar strong effect of EA on intelligence (bxy=0.517, SE=0.025, P=1.06×10−96), with only a small proportion of SNPs showing pleiotropic effects. Our result also suggested a protective effect of intelligence on ADHD (OR=0.46, bxy=-0.778, SE=0.051, P=3.80×10−45) and Alzheimer’s disease (OR=0.66, bxy=-0.411, SE=0.058, P=1.75×10−12). In line with a positive genetic correlation, we observed that intelligence was associated with higher risk of autism (OR=1.47, bxy=0.382, SE=0.099, P=1.10×10−4). There was evidence of a bidirectional association between intelligence and schizophrenia including a strong protective effect of intelligence on schizophrenia (OR=0.58, bxy=-0.551, SE=0.043, P=3.50×10−30), and a relatively smaller reverse effect (bxy= -0.195, SE= 0.012, P=2.02×10−57), with additional evidence for pleiotropy (Supplementary Results 2.5.3).
In conclusion, we conducted a large-scale genome-wide meta-analysis of intelligence in 279,930 individuals, resulting in the identification of 191 novel loci and 963 novel genes, and replicating previous associations with 15 loci and 78 genes. The applied combined strategies of functional annotation and gene-mapping and the use of unique biological data resources provide extensive information on functional consequences of relevant genetic variants and novel insight into underlying neurobiological pathways, and point towards the involvement of specific cell types. We also found suggestive evidence of causal associations between intelligence and neuropsychiatric traits. These results are important not only for understanding the biological underpinnings of individual differences in intelligence, but also contribute to our understanding of cognitive and related psychiatric disorders.
http://www.biorxiv.org/content/early/2017/07/20/166298
http://hrsonline.isr.umich.edu
https://www.ncbi.nlm.nih.gov/gap
http://ctg.cncr.nl/software/magma
http://genome.sph.umich.edu/wiki/METAL_Program
http://ldsc.broadinstitute.org/
https://data.broadinstitute.org/alkesgroup/LDSCORE/
http://www.med.unc.edu/pgc/results-and-downloads
http://software.broadinstitute.org/gsea/msigdb/collections.jsp
https://github.com/ivankosmos/RegionAnnotator
http://cnsgenomics.com/software/gsmr/
Author Contributions
D.P., J.E.S. and P.R.J. performed the analyses. D.P. conceived the idea of the study and supervised analyses. S.St. performed QC on the UK Biobank data and wrote the analysis pipeline. K.W. constructed and applied the FUMA pipeline for performing follow-up analyses. J.B. conducted the single cell enrichment analyses. C.L, M.N., A.R.H., T.J.C.P., and S.v.d.S. assisted with pipeline development and data analysis. S.A., P.B.B., J.R.I.C., K.L.G., J.K., R.K., E.K., M.L., M.N., C.A.R., J.W.T., H.Y., D.Z., S.H., N.K.H., I.K.K., S.L., G.W.M., A.B.M.-M., E.B.Q., G.S., N.S., B.T.W., T.W., D.E.A., D.K.A., D.A., R.M.B., P.B., K.E.B., T.D.C., O.C.-F., A.C., E.T.C., E.C., A.C., G.D., I.J.D., P.D., D.D., S.D., G.D., E.D.C., J.G.E., T.E., A.R.H., A.H., M.C.K., E.K., B.K., J.L., S.L., T.L., D.C.L., E.L., A.J.L., A.K.M., I.M., D.M., A.C.N., W.O., A.Palotie., A.Payton., N.P., R.A.P., K.R., I.R., P.R., D.R., F.W.S., M.A.S., O.B.S., N.S., J.M.S., V.M.S., N.C.S., R.E.S., K.S., A.N.V., D.R.W., E.W., J.Y., G.A., O.A.A., G.B., L.C., B.D., D.M.D., A.H., J.H.-L., M.A.I., K.S.K., N.G.M., S.E.M., N.L.P., R.P., T.J.C.P., S.R., P.F.S., H.T., S.I.V., and M.J.W. contributed data. D.P., J.E.S. and P.R.J. wrote the paper. All authors critically reviewed the paper.
Author Information
PF Sullivan reports the following potentially competing financial interests: Lundbeck (advisory committee), Pfizer (Scientific Advisory Board member), and Roche (grant recipient, speaker reimbursement). G Breen reports consultancy and speaker fees from Eli Lilly and Illumina and grant funding from Eli Lilly. All other authors declare no financial interests or potential conflicts of interest.
Correspondence and requests for materials should be addressed to d.posthuma@vu.nl.
Online methods
Study Cohorts
The meta-analysis included new and previously reported GWAS summary statistics from 16 cohorts: UK Biobank (UKB), Cognitive Genomics Consortium (COGENT), Rotterdam Study (RS), Generation R Study (GENR), Swedish Twin Registry (STR), Spit for Science (S4S), High-IQ/Health and Retirement Study (HiQ/HRS), Twins Early Development Study (TEDS), Danish Twin Registry (DTR), IMAGEN, Brisbane Longitudinal Twin Study (BLTS), Netherlands Study of Cognition, Environment and Genes (NESCOG), Genes for Good (GfG), Swedish Twin Studies of Aging (STSA), Atherosclerosis Risk in Communities Study (ARIC), and the Multi-Ethnic Study of Atherosclerosis (MESA). Detailed descriptions of the samples, measures, genotyping, quality control, and analysis procedures for each cohort are provided in Supplementary Information 1.1 and Supplementary
Meta-analysis
Stringent quality control measures were applied to the summary statistics for each GWAS cohort before combining. All files were checked for data integrity and accuracy. SNPs were filtered from further analysis if they met any of the following criteria: imputation quality (INFO/R2) score < 0.6, Hardy-Weinberg equilibrium (HWE) P < 5×10−6, study-specific minor allele frequency (MAF) corresponding to a minor allele count (MAC) < 100, and mismatch of alleles or allele frequency difference greater than 20% from the Haplotype Reference Consortium (HRC) genome reference panel16. Some cohorts used more stringent criteria (see Supplementary Information 1.1). Indels and SNPs that were duplicated, multi-allelic, monomorphic, or ambiguous (A/T or C/G) with a MAF >0.4 were also excluded. Visual inspection of the distribution of the summary statistics was completed, and Manhattan plots and QQ plots were created for the cleaned statistics from each cohort (Supplementary Figure 1).
The SNP association P-values from the GWAS cohorts were meta-analyzed with METAL22 (see URLs) in two phases. First, we meta-analyzed all cohorts with quantitative phenotypes (all except HiQ/HRS) using a sample-size weighted scheme. In the second phase, we added the HiQ/HRS study results to the first phase results, weighting each set of summary statistics by their respective non-centrality parameter (NCP). This method improves power when using an extreme case sampling design such as HiQ23. NCPs were estimated using the Genetic Power Calculator, as described by Coleman et al.25. After combining all data, meta-analysis results were further filtered to exclude any variants with N < 50,000.
The X chromosome was treated separately in the meta-analysis because imputed genotypes were not available for the X chromosome in the largest cohort (UKB), and there was little overlap between the UKB called genotypes and imputed data from other cohorts (NSNPs < 500). We therefore included only the called X chromosome variants in UKB for these analyses after performing X-specific quality control steps26.
We conducted a series of meta-analyses on subsets of the full sample using the same methods as above. Age group-specific meta-analyses were run in the cohorts of children (age < 17; GENR, TEDS, IMAGEN, BLTS; N=9,814), young adults (age ∼17-18; S4S, STR; N=6,033), and adults (age < 18, primarily middle-aged or older: UKB, RS, DTR, NESCOG, STSA, ARIC, MESA; N=214,291), excluding studies whose samples overlapped multiple age groups (COGENT, HiQ/HRS, GfG; N=49,792). To create independent discovery samples for use in polygenic score validation, we also conducted meta-analyses with a “leave-one-out” strategy in which summary statistics from four validation datasets were, respectively, excluded from the meta-analysis (see Polygenic Scoring, below).
Cohort Heritability and Genetic Correlation
LD score regression8 was used to estimate genomic inflation and heritability of the intelligence phenotypes in each of the 16 cohorts using their post-quality control summary statistics, and to estimate the cross-cohort genetic correlations27. Pre-calculated LD scores from the 1000 Genomes European reference population were obtained from https://data.broadinstitute.org/alkesgroup/LDSCORE/. Genetic correlations were calculated on HapMap3 SNPs only. LD score regression was also used on the age subgroup meta-analyses to estimate heritability and cross-age genetic correlations.
Genomic Risk Loci Definition
Independently associated loci from the meta-analysis were defined using FUMA12 (http://fuma.ctglab.nl/), an online platform for functional mapping of genetic variants. We first identified independent significant SNPs which have genome-wide significant P-value (<5×10−8) and represented signals that are independent from each other at r2 <0.6. These SNPs were further represented by lead SNPs, which are a subset of the independent significant SNPs that are in approximate linkage equilibrium with each other at r2 <0.1. We then defined associated genomic risk loci by merging any physically overlapping lead SNPs (linkage disequilibrium [LD] blocks <250kb apart). Borders of the genomic risk loci were defined by identifying all SNPs in LD (r2≧0.6) with one of the independent significant SNPs in the locus, and the region containing all of these candidate SNPs was considered to be a single independent genomic risk locus. All LD information was calculated from UK Biobank genotype data.
Proxy-replication with Educational Attainment (EA)
We conducted GWAS of EA, an outcome with a high genetic correlation with intelligence5, in a non-overlapping European subset of the UKB sample (N=188,435) who did not complete the intelligence measure. EA was coded as maximum years of education completed, using the same methods as earlier analyses28 and GWAS was conducted using the same quality control and analytic procedures as described for the UKB intelligence phenotype (Supplementary Information 1.1). To test replication of the SNPs with this proxy phenotype, we performed a sign concordance test for all GWS SNPs from the meta-analysis using the exact binomial test. For each independent genomic locus, we considered it to be evidence for replication if the lead SNP or another correlated SNP in the region was sign concordant with the corresponding SNP in the intelligence meta-analysis and had a P-value of association with EA smaller than 0.05/246=0.0002.
Polygenic Scoring
We calculated polygenic scores (PGS) based on the SNP effect sizes of the leave-one-out metaanalyses, from which four cohorts were (separately) excluded and reserved for score validation. These included a child (GENR), young adult (S4S), and adult sample (RS). We also included the UKB-wb sample to test for validation in a very large (N = 53,576) cohort with the greatest phenotypic similarity to the largest contributor to the meta-analysis statistics (UKB-ts), in order to maximize potential predictive power. PGS were calculated on the genotype data using LDpred10, a Bayesian PGS method that utilizes a prior on effect size distribution to remodel the SNP effect size and account for LD, and PRSice9, a PLINK29-based program that automates optimization of the set of SNPs included in the PGS based on a high-resolution filtering of the GWAS P-value threshold. LDpred PGS were applied to the called, cleaned, genotyped variants in each of the validation cohorts with UK Biobank as the LD reference panel. PRSice PGS were calculated on hard-called imputed genotypes using P-value thresholds from 0.0 to 0.5 in steps of 0.001. The explained variance (ΔR2) was derived from a linear model in which the GWAS intelligence phenotype was regressed on each PGS while controlling for the same covariates as in each cohort-specific GWAS, compared to a linear model with GWAS covariates only.
Stratified Heritability
We partitioned SNP heritability using stratified LD Score regression30 in three ways: 1) by functional annotation category, 2) by minor allele frequency (MAF) in six percentile bins, and 3) by chromosome. Annotations for 22 binary categories of functional genomic characteristics (e.g. coding or regulatory regions) were obtained from the LD score website (https://github.com/bulik/ldsc). The Bonferroni-corrected significance threshold was .05/50 annotations=.001.
Functional Annotation of SNPs
Functional annotation of SNPs implicated in the meta-analysis was performed using FUMA12 (http://fuma.ctglab.nl/). We selected all candidate SNPs in genomic risk loci having an r2≧0.6 with one of the independent significant SNPs (see above), a P-value (P<1e-5) and a MAF>0.0001 for annotations. Functional consequences for these SNPs were obtained by matching SNPs’ chromosome, base-pair position, and reference and alternate alleles to databases containing known functional annotations, including ANNOVAR31 categories, Combined Annotation Dependent Depletion (CADD) scores13, RegulomeDB14 (RDB) scores, and chromatin states15,16. ANNOVAR categories identify the SNP’s genic position (e.g. intron, exon, intergenic) and associated function. CADD scores predict how deleterious the effect of a SNP is likely to be for a protein structure/function, with higher scores referring to higher deleteriousness. A CADD score above 12.37 is the threshold to be potentially pathogenic13. The RegulomeDB score is a categorical score based on information from expression quantitative trait loci (eQTLs) and chromatin marks, ranging from 1a to 7 with lower scores indicating an increased likelihood of having a regulatory function. Scores are as follows: 1a=eQTL + Transciption Factor (TF) binding + matched TF motif + matched DNase Footprint + DNase peak; 1b=eQTL + TF binding + any motif + DNase Footprint + DNase peak; 1c=eQTL + TF binding + matched TF motif + DNase peak; 1d=eQTL + TF binding + any motif + DNase peak; 1e=eQTL + TF binding + matched TF motif; 1f=eQTL + TF binding / DNase peak; 2a=TF binding + matched TF motif + matched DNase Footprint + DNase peak; 2b=TF binding + any motif + DNase Footprint + DNase peak; 2c=TF binding + matched TF motif + DNase peak; 3a=TF binding + any motif + DNase peak; 3b=TF binding + matched TF motif; 4=TF binding + DNase peak; 5=TF binding or DNase peak; 6=other;7=Not available. The chromatin state represents the accessibility of genomic regions (every 200bp) with 15 categorical states predicted by a hidden Markov model based on 5 chromatin marks for 127 epigenomes in the Roadmap Epigenomics Project16. A lower state indicates higher accessibility, with states 1-7 referring to open chromatin states. We annotated the minimum chromatin state across tissues to SNPs. The 15-core chromatin states as suggested by Roadmap are as follows: 1=Active Transcription Start Site (TSS); 2=Flanking Active TSS; 3=Transcription at gene 5’ and 3’; 4=Strong transcription; 5= Weak Transcription; 6=Genic enhancers; 7=Enhancers; 8=Zinc finger genes & repeats; 9=Heterochromatic; 10=Bivalent/Poised TSS; 11=Flanking Bivalent/Poised TSS/Enh; 12=Bivalent Enhancer; 13=Repressed PolyComb; 14=Weak Repressed PolyComb; 15=Quiescent/Low. Standardized SNP effect sizes were calculated for the most impactful SNPs by transforming the sample size-weighted meta-analysis Z score, as described in Zhu et al., 201632.
Gene-mapping
Genome-wide significant loci obtained by GWAS were mapped to genes in FUMA12 using three strategies:
Positional mapping maps SNPs to genes based on physical distance (within a 10kb window) from known protein coding genes in the human reference assembly (GRCh37/hg19).
eQTL mapping maps SNPs to genes with which they show a significant eQTL association (i.e. allelic variation at the SNP is associated with the expression level of that gene). eQTL mapping uses information from 45 tissue types in 3 data repositories (GTEx33, Blood eQTL browser34, BIOS QTL browser35), and is based on cis-eQTLs which can map SNPs to genes up to 1Mb apart. We used a false discovery rate (FDR) of 0.05 to define significant eQTL associations.
Chromatin interaction mapping was performed to map SNPs to genes when there is a three-dimensional DNA-DNA interaction between the SNP region and another gene region. Chromatin interaction mapping can involve long-range interactions as it does not have a distance boundary. FUMA currently contains Hi-C data of 14 tissue types from the study of Schmitt et al36. Since chromatin interactions are often defined in a certain resolution, such as 40kb, an interacting region can span multiple genes. If a SNPs is located in a region that interacts with a region containing multiple genes, it will be mapped to each of those genes. To further prioritize candidate genes, we selected only interaction-mapped genes in which one region involved in the interaction overlaps with a predicted enhancer region in any of the 111 tissue/cell types from the Roadmap Epigenomics Project16 and the other region is located in a gene promoter region (250bp up and 500bp downstream of the transcription start site and also predicted by Roadmap to be a promoter region). This method reduces the number of genes mapped but increases the likelihood that those identified will have a plausible biological function. We used a FDR of 1×10−5 to define significant interactions, based on previous recommendations36 modified to account for the differences in cell lines used here.
Functional annotation of mapped genes
Genes implicated by mapping of significant GWAS SNPs were further investigated using the GENE2FUNC procedure in FUMA12, which provides hypergeometric tests of enrichment of the list of mapped genes in 53 GTEx33 tissue-specific gene expression sets, 7,246 MSigDB gene-sets37, and 2,195 GWAS catalog gene-sets21. The Bonferroni-corrected significance threshold was 0.05/9,494 gene-sets=5.27×10−6.
Gene-based analysis
SNP-based P-values from the meta-analysis were used as input for the gene-based genome-wide association analysis (GWGAS). 18,128 protein-coding genes (each containing at least 1 GWAS SNP) from the NCBI 37.3 gene definitions were used as basis for GWGAS in MAGMA (http://ctg.cncr.nl/software/magma)17. The Bonferroni-corrected genome-wide significance threshold was .05/18,128 genes=2.76×10−6.
Gene-set analysis
Results from the GWGAS analyses were used to test for association in three types of predefined gene-sets:
7,246 curated gene-sets representing known biological and metabolic pathways were derived from 9 data resources, catalogued by and obtained from the MsigDB version 5.229 (http://software.broadinstitute.org/gsea/msigdb/collections.jsp)
gene expression values from 53 tissues obtained from GTEx33, log2 transformed with pseudocount 1 after winsorization at 50 and averaged per tissue
cell-type specific expression in 24 types of brain cells, which were calculated following the method described in Skene et al38. and Coleman et al25. Briefly, brain cell-type expression data was drawn from single-cell RNA sequencing data from mouse brains. For each gene, the value for each cell-type was calculated by dividing the mean Unique Molecular Identifier (UMI) counts for the given cell type by the summed mean UMI counts across all cell types. Single-cell gene-sets were derived by grouping genes into 40 equal bins by specificity of expression.
These gene-sets were tested using MAGMA. We computed competitive P-values, which represent the test of association for a specific gene-set compared to other gene-sets. This method is more robust to Type I error than self-contained tests that only test for association of a gene-set against the null hypothesis of no association39. The Bonferroni-corrected significance threshold was 0.05/7,323 gene-sets=6.83×10−6. Conditional analyses were performed as a followup using MAGMA to test whether each significant association observed was independent of all others. The association between each gene-set was tested conditional on the most strongly associated set, and then - if any substantial (p<.05/number of gene-sets) associations remained - by conditioning on the first and second most strongly associated set, and so on until no associations remained. Gene-sets that retained their association after correcting for other sets were considered to be independent signals. We note that this is not a test of association per se, but rather a strategy to identify, among gene-sets with known significant associations whose defining genes may overlap, which set(s) are responsible for driving the observed association.
Cross-Trait Genetic Correlation
Genetic correlations (rg) between intelligence and 38 phenotypes were computed using LD score regression27, as described above, based on GWAS summary statistics obtained from publicly available databases (http://www.med.unc.edu/pgc/results-and-downloads; http://ldsc.broadinstitute.org/; Supplementary Table 18). The Bonferroni-corrected significance threshold was 0.05/38 traits=1.32×10−3.
GWAS catalog lookup
We used FUMA to identify SNPs with previously reported (P < 5×10−5) phenotypic associations in published GWAS listed in the NHGRI-EBI catalog21 which overlapped with the genomic risk loci identified in the meta-analysis. As an additional relevant phenotype of interest, we examined whether the genes associated with intelligence in this study (by FUMA mapping or GWGAS) were overrepresented in a set of 1,518 genes linked to intellectual disability and/or developmental delay, as compiled by RegionAnnotater (https://github.com/ivankosmos/RegionAnnotator). Many of these have been identified by non-GWAS sources and are not represented in the NHGRI catalog. We tested for enrichment using a hypergeometric test with a background set of 19,283 genomic protein-coding genes, as in FUMA. Manual lookups were also performed to identify overlapping loci/genes with known previous GWAS of intelligence.
Mendelian Randomization
To infer credible causal associations between intelligence and traits that are genetically correlated with intelligence, we performed Generalised Summary-data based Mendelian Randomization40 (GSMR; http://cnsgenomics.com/software/gsmr/). This method utilizes summary-level data to test for causal associations (bxy) between a risk factor and an outcome by using genome-wide significant SNPs as instrumental variables. HEIDI-outlier detection was used to filter genetic instruments that show clear pleiotropic effects on the exposure phenotype and the outcome phenotype. We used a threshold p-value of 0.01 for the outlier detection analysis in HEIDI which removes 1% of SNPs by chance if there is no pleiotropic effect. To test for a causal effect of intelligence (bzx) on an outcome (bzy) we selected traits in non-overlapping samples that showed significant genetic correlations (rg) with intelligence. We tested for bi-directional causation by repeating the analyses using independent GWS SNPs related to the outcome phenotypes as exposure and intelligence as the outcome phenotype. For each trait, we selected independent (r2=<0.1), GWS lead SNPs as instrumental variables in the analyses. For traits with less than 10 lead SNPs (i.e. the minimum number of SNPs on which GSMR can perform a reliable analysis) we selected independent SNPs (r2=<0.1), with a GWS P-value (<5×10−8), except for ADHD for which the threshold was lowered to 1×10−5 due to the small number of GWS SNPs. The estimated bzx and bzy are approximately equal to the natural log odds ratio (OR)40. An OR of 2 can be interpreted as a doubled risk compared to the population prevalence of a binary trait for every SD increase in the exposure trait. For quantitative traits the bzx and bzy can be interpreted as a one standard deviation increase explained in the outcome trait for every SD increase in the exposure trait.
Data availability
Summary statistics will be made available for download upon publication (https://ctg.cncr.nl).
Acknowledgments
This work was funded by The Netherlands Organization for Scientific Research (NWO VICI 453-14-005 and NWO VIDI 452-12-014) and the Sophia Foundation for Scientific Research (grant nr: S14-27). The analyses were carried out on the Genetic Cluster Computer, which is financed by the Netherlands Scientific Organization (NWO: 480-05-003), by the VU University, Amsterdam, The Netherlands, and by the Dutch Brain Foundation, and is hosted by the Dutch National Computing and Networking Services SurfSARA. This research has been conducted using the UK Biobank resource under application number 16406 and the Database of Genotypes and Phenotypes (dbGaP) under accession numbers phs000280.v3.p1 and phs000209.v13.p3. We thank the numerous participants, researchers, and staff from many studies who collected and contributed to the data. Additional acknowledgments can be found in the Supplementary Information file. Summary statistics will be made available for download upon publication from http://ctglab.vu.nl.