Abstract
Schizophrenia (SCZ) and bipolar disorder (BD) are highly heritable disorders that share a significant proportion of common risk variation. Understanding the genetic factors underlying the specific symptoms of these disorders will be crucial for improving diagnosis, intervention and treatment. In case-control data consisting of 53,555 cases (20,129 BD, 33,426 SCZ) and 54,065 controls, we identified 114 genome-wide significant loci (GWS) when comparing all cases to controls, of which 41 represented novel findings. Two genome-wide significant loci were identified when comparing SCZ to BD and a third was found when directly incorporating functional information. Regional joint association identified a genomic region of overlapping association in BD and SCZ with disease-independent causal variants indicating a fourth region contributing to differences between these disorders. Regional SNP-heritability analyses demonstrated that the estimated heritability of BD based on the SCZ GWS regions was significantly higher than that based on the average genomic region (91 regions, p = 1.2×10−6) while the inverse was not significant (19 regions, p=0.89). Using our BD and SCZ GWAS we calculated polygenic risk scores and identified several significant correlations with: 1) SCZ subphenotypes: negative symptoms (SCZ, p=3.6×10−6) and manic symptoms (BD, p=2×10−5), 2) BD subphenotypes: psychotic features (SCZ p=1.2×10−10, BD p=5.3×10−5) and age of onset (SCZ p=7.9×10−4). Finally, we show that psychotic features in BD has significant SNP-heritability (h2 snp=0.15, SE=0.06), and a significant genetic correlation with SCZ (rg=0.34) in addition there is a significant sign test result between SCZ GWAS and a GWAS of BD cases contrasting those with and without psychotic features (p=0.0038, one-side binomial test). For the first time, we have identified specific loci pointing to a potential role of 4 genes (DARS2, ARFGEF2, DCAKD and GATAD2A) that distinguish between BD and SCZ, providing an opportunity to understand the biology contributing to clinical differences of these disorders. Our results provide the best evidence so far of genomic components distinguishing between BD and SCZ that contribute directly to specific symptom dimensions.
Introduction
Bipolar disorder (BD) and schizophrenia (SCZ) are severe psychiatric disorders and among the leading causes of disability worldwide1. Both disorders have significant genetic components with heritability estimates ranging from 60-80%2. A genetic-epidemiological genetic study demonstrated a substantial overlap between these two disorders with a genetic correlation from common variation near 0.6-0.7 and high relative risks (RR) among relatives of both BD and SCZ patients (RRs for parent/offspring: BD/BD: 6.4, BD/SCZ: 2.4; SCZ/BD: 5.2, SCZ/SCZ: 9.9)3. Despite shared genetics and symptomology, the current diagnostic systems4,5 represent BD and SCZ as distinct categorical entities differentiated on the basis of their clinical presentation, with BD characterized by predominant mood symptoms, mood-congruent delusions and an episodic disease course and SCZ considered a prototypical psychotic disorder. Further, premorbid cognitive impairment and reduced intelligence are more frequent and severe in SCZ than BD6. The genetic contributors to these phenotypic distinctions have yet to be elucidated and could aid in understanding the underlying biology of their unique clinical presentation.
While the shared genetic component is large, studies to date have identified key genetic architecture differences between these two disorders. A polygenic risk score created from a case only SCZ vs BD genome-wide association study (GWAS) significantly correlated with SCZ vs BD diagnosis in an independent sample7, providing evidence that differences between the disorders also have a genetic basis. An enrichment of rare, moderate to highly penetrant copy number variants (CNVs) and de novo CNVs are seen in SCZ patients8–12, while, the involvement of CNVs in BD is much less clear13. Although the role of de novo single nucleotide variants in BD and SCZ has been investigated in only a handful of studies so far, enrichment in pathways associated with the postsynaptic density has been reported for SCZ, but not BD14,15. Identifying disorder-specific variants or quantifying the contribution of variation to specific symptom dimensions remains an open question. For example, previous work by this group has demonstrated that SCZ patients with greater manic symptoms had higher polygenic risk for BD7. Here, we utilize the largest collection of genotyped samples of BD and SCZ along with 28 subphenotypes to assess variants and genomic regions that contribute differentially to the disorders and to specific symptoms dimensions or subphenotypes within them.
Methods
Sample Description
SCZ samples are those analyzed previously16. BD samples are the newest collection from Psychiatric Genomics Consortium Bipolar Disorder Working Group (Stahl et al. submitted). To ensure independence of the data sets, individuals were excluded until no individual showed a relatedness (pihat) value greater than 0.2 to any other individual in the collection, while preferentially keeping the case over the control for case-control related pairs. In total 2,181 BD cases, 1,604 SCZ cases and 27,308 controls were removed (most of which were previously known), leaving 20,129 BD cases 33,426 SCZ cases and 54,065 controls for the final metaanalysis.
For analyses directly comparing BD and SCZ, we matched cases from both phenotypes on genotyping platform and ancestry, resulting in 15,270 BD cases versus 23,585 SCZ cases. In other words, we were able to match 76% of BD cases and 71% of SCZ cases.
Sub-phenotype description
BD sub-phenotypes were collected by each study site using a combination of diagnostic instruments, case records and participant interviews. Ascertainment details for each study site are described in the supplementary data of the PGC Bipolar Working Group paper (Stahl et al. submitted). The selection of phenotypes for collection by this group was determined by literature searches in order to determine phenotypes with prior evidence for heritability. It was further refined dependent on the availability of phenotype data across a range of study sites and the consistency by which the phenotypes were defined. Schizophrenia subphenotypes are the same as described previously but in a larger proportion of patients7.
Quality Control, Imputation, Association Analysis and Polygenic Risk Scoring
Quality control and imputation were performed on each of the study cohort datasets (n=81), according to standards established by the Psychiatric Genomics Consortium (PGC). The quality control parameters for retaining SNPs and subjects were: SNP missingness < 0.05 (before sample removal); subject missingness (p < 0.02); autosomal heterozygosity deviation (| Fhet | < 0.2); SNP missingness < 0.02 (after sample removal); difference in SNP missingness between cases and controls < 0.02; and SNP Hardy-Weinberg equilibrium (p > 10−6 in controls or p > 10−10 in cases). Genotype imputation was performed using the pre-phasing/imputation stepwise approach implemented in IMPUTE217 / SHAPEIT18 (chunk size of 3 Mb and default parameters). The imputation reference set consisted of 2,186 phased haplotypes from the full 1000 Genomes Project dataset (August 2012, 30,069,288 variants, release “v3.macGT1”). After imputation, we used the best guess genotypes, for further robust relatedness testing and population structure analysis. Here we required very high imputation quality (INFO > 0.8) and low missingness (<1%) for further quality control. After linkage disequilibrium (LD) pruning (r2 < 0.02) and frequency filtering (MAF > 0.05), there were 14,473 autosomal SNPs in the data set. Relatedness testing was done with PLINK19 and pairs of subjects with pihat > 0.2 were identified and one member of each pair removed at random after preferentially retaining cases over controls. Principal component estimation was done with the same collection of autosomal SNPs. We tested the first 20 principal components for phenotype association (using logistic regression with study indicator variables included as covariates) and evaluated their impact on the genomewide test statistics using λ. Thirteen principal components namely 1,2,3,4,5,6,7,8,10,12,15,18,20 were included in all association analyses (λ=1.45). Analytical steps were repeated for SCZ vs BD analysis.
We performed four main association analyses, i.e. (i) GWAS of BD and SCZ as a single combined case phenotype, as well as disorder-specific GWAS using independent control sets in (ii) BD cases vs BD controls and (iii) SCZ cases vs SCZ controls, and (iv) association analysis of SCZ cases vs BD cases.
Summary-data-based Mendelian Randomization (SMR)20
We used SMR as a statistical fine-mapping tool applied to the SCZ vs BD GWAS results to identify loci with strong evidence of causality via gene expression. SMR analysis is limited to significant (FDR < 0.05) cis SNP-expression quantitative trait loci (eQTLs) with MAF > 0.01. eQTLs passing these thresholds were combined with GWAS results in the SMR test, with significance (pSMR) reported at a Bonferroni-corrected threshold for each eQTL data set. The eQTL architecture may differ between genes. Through LD, many SNPs can generate significant associations with the same gene, but in some instances multiple SNPs may be independently associated with the expression of a gene. After identification of significant SNP-expression-trait association through the SMR test, a follow-up heterogeneity test aims to prioritize variants by excluding regions for which there is conservative evidence for multiple causal loci (pHET < 0.05). SMR analyses were conducted using eQTL data from whole peripheral blood21, dorsolateral prefrontal cortex generated by the CommonMind Consortium8, and 11 brain sub-regions from the GTEx consortium22.
Regional joint GWAS
Summary statistic Z-scores were calculated for each marker in each of the four main GWAS results, using the logistic regression coefficient and its standard error. Rare SNPs (MAF < 0.01), and SNPs with a low INFO score (< 0.3) in either dataset were removed. The causal variant relationships between SCZ and BD were investigated using the Bayesian method software pw-gwas (v0.2.1), with quasi-independent regions determined by estimate LD blocks in an analysis of European individuals (n=1,702)23,24. Briefly, pw-gwas takes a Bayesian approach to determine the probability of five independent models of association. (1) There is no causal variant in BD or SCZ; (2) a causal variant in BD, but not SCZ (3); a causal variant in SCZ, but not BD; (4) a shared causal variant influencing both BD and SCZ; (5) two causal variants where one influences BD, and one influences SCZ. The posterior probability of each model is calculated using model priors, estimated empirically within pw-gwas. Regions were considered to support a particular model when the posterior probability of the model was greater than 0.5.
Regional SNP-heritability estimation
We calculated local SNP-heritability independently for SCZ and BD using the Heritability Estimator from Summary Statistics (HESS) software25 for each of the independent regions defined above. The sum of these regional estimates is the total SNP-heritability of the trait. To calculate local SNP-heritability HESS requires reference LD matrices representative of the population from which the GWAS samples were drawn. We utilized the 1000 genomes European individuals as the reference panel26. Unlike pw-gwas23, HESS does not assume that only one causal variant can be present in each region.
Results
GWAS
We performed association analysis of BD and SCZ as a combined phenotype, totaling 53,555 cases (20,129 BD, 33,426 SCZ) and 54,065 controls on 15.5 million dosages imputed from 1000 genomes phase 326. Logistic regression was performed controlling for 13 components of ancestry, study sites and genotyping platform. One hundred and fourteen regions contained at least one variant for which the p-value was lower than our genome-wide significance (GWS) threshold of p < 5×10−8. Among these 114 loci, 41 had non-overlapping LD regions (r2 > 0.6) with the largest and most recently performed single disease GWAS of SCZ16 and BD (Stahl et al. submitted). Establishing independent controls (see Methods) allowed us to perform disorder-specific GWAS in 20,129 BD cases vs 21,524 BD controls and 33,426 SCZ cases and 32,541 SCZ controls. Using these results, we compared effect sizes of these 114 loci across each disorder independently (Figure 1a) showing that subsets of variants have larger effects in SCZ vs BD or vice versa.
To identify loci with divergent effects on BD and SCZ, we performed an association analysis on 23,585 SCZ cases and 15,270 BD cases matched for shared ancestry and genotyping platform (see Methods, Figure 1b Supplementary Figures 1-5, Supplementary Table 1). Two genomewide significant loci were identified, the most significant of which was rs56355601 located on chromosome 1 at position 173,811,455 within an intron of DARS2. The second most significant locus was a four base indel on chromosome 20 at position 47638976 in an intron of ARFGEF2. For both variants, the minor allele frequency was higher in BD cases than SCZ cases and disease-specific GWAS showed opposite directions of effect. We sought to identify additional disease specific loci by incorporating expression information with association results to perform fine-mapping and identify novel variants27–30. Here, we applied the summary-data-based Mendelian randomization (SMR) method20 (see Methods) utilizing the cis-QTLs derived from peripheral blood21, human dorsolateral prefrontal cortex (DLPFC)31 from the Common Mind Consortium and 11 brain regions from the GTEx consortium22. We identified one SNP-probe combination that surpassed the threshold for genome-wide significance in blood but was also the most significant finding in brain. We found that SNP rs4793172 in gene DCAKD is associated with SCZ vs BD analysis (pGWAS = 2.8×10−6) and is an eQTL for probe ILMN 1811648 (peQTL = 2.9×10−168), resulting in pSMR = 4.1×10−6 in blood (peQTL = 2.9×10−25, pSMR = 2.0×10−5 in DLFC, and peQTL = 4.6×10−15, pSMR = 6.0×10−5 in GTEx cerebellar hemisphere) (Supplementary Table 2, Supplementary Figure 6) and shows no evidence of heterogeneity (pHET =0.66) which implies only a single causal variant in the region.
Regional joint association
We expanded our efforts to identify disorder specific genomic regions by jointly analyzing independent GWAS results from BD and SCZ23. Among 1,702 regions genome-wide (see Methods), 223 had a posterior probability of greater than 0.5 of having a causal variant in at least one disorder. Of these, 132 best fit the model of a shared causal variant influencing both BD and SCZ, 88 were most likely specific to SCZ, 3 demonstrated evidence of two independent variants (with one impacting each of the two disorders) and zero were BD specific. Of note, the data estimated prior probability of having a BD specific region was 0.1% compared to 15% for SCZ, potentially a result of increased power from the larger SCZ sample size.
The 114 GWS SNPs from the combined BD and SCZ GWAS localized into 99 independent regions, of which 78 (79%) were shared with a posterior probability of greater than 0.5. Sixty regions had at least one GWS SNP in the independent SCZ GWAS, of which 30 (50%) are shared and 8 regions contained a GWS SNP in the independent BD GWAS, of which 6 (75%) are shared using the same definition. For the three regions showing evidence for independent variants, two had highly non-overlapping association signals in the same region stemming from independent variants. The third, on chromosome 19 presented a different scenario where association signals were overlapping (Supplementary Figure 7). The most significant variant in BD was rs111444407 (chr19:19358207, p = 8.67×10−10) and for SCZ was rs2315283 (chr19:19480575, p=4.41×10−7). After conditioning on the most significant variant in the other disorder, the association signals of the most significant variant in BD and SCZ were largely unchanged (BD rs111444407 =1.3×10−9, SCZ rs2315283 p=6.7×10−5). We further calculated the probability of each variant in the region being causal for both BD and SCZ32 and found no correlation (r= -0.00016). The most significant variants had the highest posterior probability of being causal (SCZ: rs2315283, prob = 0.02, BD: rs111444407, prob = 0.16). Both variants most significantly regulate the expression of GATAD2A in brain31 but in opposite directions (rs111444407 peQTL = 6×10−15, beta = 0.105; rs2315283 peQTL = 1.5×10−28, beta = -0.11).
Regional SNP-heritability estimation
Across the genome, regional SNP-heritabilities (h2snp) were estimated separately for SCZ and BD25 and were found to be moderately correlated (r=0.25). We next defined risk regions as those containing the most associated SNP for each GWS locus. In total, there were 101 SCZ risk regions from the 105 autosomal GWS loci reported previously16 and 29 BD risk regions from 30 GWS loci reported in a companion paper (Stahl et al. submitted). Ten regions were risk regions for both BD and SCZ comprising 33% of BD risk regions and 10% of SCZ risk regions. We further stratified regional h2snp by whether a region was a risk region in one disorder, none or both (Figure 2). Since the discovery data for the regions overlapped with the data used for the heritability estimation, we expected within-disorder analyses to show significant results. In risk regions specific to SCZ (n=91) there was a significant increase in regional h2snp in SCZ, as expected (p = 1.1×10−22), but also in BD (p = 1.2×10−6). In risk regions specific to BD (n=19), significantly increased regional h2snp was observed in BD, as expected (p = 0.0007), but not in SCZ (p = 0.89). Risk regions shared by both disorders had significantly higher h2snp in both disorders, as expected (BD p = 5.3×10−5, SCZ p = 0.006), compared to non-risk regions. However, we observed a significant increase in BD h2snp in shared risk regions compared to BD risk regions (BD p = 0.003) but not SCZ h2snp for shared risk regions compared to SCZ risk regions (p = 0.62). Using a less stringent p-value threshold for defining risk regions (p < 5×10−6), thereby substantially increasing the number of regions, resulted in similar results (Supplementary Figure 8). Seven regions contributed to substantially higher h2snp in SCZ compared to BD but no region showed the inverse pattern. Of these regions, all but one was in the major histocompatibility region (MHC), the sole novel region was chr10:104380410-106695047 with regional h2snp= 0.0019 in SCZ and h2snp=0.00063 in BD.
Polygenic dissection of subphenotypes
subphenotypes were collected for a subset of patients in both BD and SCZ (see Methods). For SCZ, we had clinical quantitative measurements of manic, depressive, positive and negative symptoms generated from factor analysis of multiple instruments as described previously7 but in larger sample sizes (n=6908, 6907, 8259, 8355 respectively). For BD, 24 subphenotypes were collected among nearly 13,000 cases in distinct categories including comorbidities, clinical information such as rapid cycling and psychotic features as well as additional disease course data such as age of onset and number of hospitalizations. For each BD and SCZ patient, we calculated a polygenic risk score (PRS) using all SNPs, from each of the four main GWAS analyses (BD+SCZ, BD, SCZ and SCZvsBD). We then used regression analysis including principal components and site to assess the relationship between each subphenotype and the 4 PRS. We applied a significance cutoff of p < 0.0004 based on Bonferroni correction for 112 tests. In total, we identified 6 significant results after correction (Figure 3, Table 1). For BD PRS we see a significant positive correlation between PRS and manic symptoms in SCZ cases as seen previously7 (p=2×10−5, t=4.26) and psychotic features in BD patients (p=5.3×10−5, t=4.04). For SCZ PRS, we see a significant increase in PRS for BD cases with versus without psychotic features (p=1.2×10−10, t=6.45) and negative symptoms in SCZ patients (p=3.60×10−6, t=4.64). As with the SCZ PRS, BD+SCZ PRS is also significantly associated with psychotic features in BD (p=7.9×10−13, t=7.17) and negative symptoms in SCZ (p=1.5×10−5, t=4.33). While not surpassing conservative correction, the next two most significant results are both indicative of a more severe course in BD: increased BD+SCZ PRS with increased numbers of hospitalizations in BD cases (p=4.2×10−4, t=3.53) and increased SCZ PRS with earlier onset of BD (p=7.9×10−4, t=-3.36). We assessed the role of BD subtype on correlation between SCZ PRS and psychotic features and identified significant correlation when restricted to only BD type I cases (BDI: 3,763 with psychosis, 2,629 without, p=1.55×10−5, Supplementary Table 3).
For all 8 quantitative subphenotypes and 9 binary subphenotypes having at least 1,000 cases, we performed a GWAS within cases to calculate heritability and genetic correlation with BD and SCZ. Only two subphenotypes had significant h2snp estimates using LD-score regression33, psychotic features in BD (h2snp=0.15, SE=0.06) and suicide attempt (h2snp=0.25, SE=0.1). Only psychotic features demonstrated significant genetic correlation with SCZ (rg=0.34, SE=0.13, p=0.009). While the genetic correlation demonstrates a genome-wide relationship between common variants contributing to SCZ and those contributing to psychotic features in BD cases, we sought to assess whether this could be demonstrated among the most significantly associated SCZ loci. Of the 105 autosomal genome-wide significant SCZ loci previously published16, 60 out of 100 variants in our dataset after QC demonstrated the same direction of effect for psychotic features in BD (p=0.028, one-sided binomial-test).
Discussion
Here we present a genetic dissection of bipolar disorder and schizophrenia from over 100,000 genotyped subjects. As previously shown34, we found an extensive degree of genetic sharing between these two disorders. We identified 114 genome-wide significant loci contributing to both disorders of which 37 are novel to this analysis. Despite the high degree of sharing, we identified several loci that significantly differentiated between the two disorders, having opposite directions of effect, and polygenic components that significantly correlated from one disorder to symptoms of the other.
Two GWS loci were identified from the case only SCZ versus BD analysis providing opportunities to inform the underlying biological distinctions between BD and SCZ. The most significant locus is in DARS2 (coding for the mitochondrial Aspartate-tRNA ligase) which is highly expressed in the brain and significantly regulated by the most significant SNP rs56355601 (peQTL=2.5×10−11). Homozygous mutations in DARS2 are responsible for leukoencephalopathy with brainstem and spinal cord involvement and lactate elevation (LBSL), which was characterized by neurological symptoms such as psychomotor developmental delay, cerebellar ataxia and delayed mental development35. Interestingly, based on methylation analysis from the prefrontal cortex of stress models (rats and monkeys) and from peripheral samples (in monkeys and human newborns), DARS2, among others, has been suggested as a potential molecular marker of early-life stress and vulnerability to psychiatric disorders36. The second most significant locus maps to ARFGEF2, which codes for ADP Ribosylation Factor Guanine Nucleotide Exchange Factor 2 (also known as BIG2), a protein involved in vesicular trafficking from the trans-Golgi network. Mutations in ARFGEF2 have been shown to underlie an autosomal recessive condition characterized by microcephaly and periventricular heterotopia, a disorder caused by abnormal neural proliferation and migration37. Although not genome-wide significant, the third most significant locus implicates ARNTL (Aryl Hydrocarbon Receptor Nuclear Translocator Like), which is a core component of the circadian clock. ARNTL has been previously hypothesized for relevance in bipolar disorder,38 although human genetic evidence is limited39. Incorporating transcriptional data identified a third genome-wide significant finding in DCAKD. The gene codes for Dephospho-CoA Kinase Domain Containing, a member of the human postsynaptic density proteome from human neocortex40. In the mouse cortical synaptoproteome DCAKD has been found to be among the proteins with the highest changes between juvenile postnatal days and adult stage, which suggests a putative role in brain development41,42.
We further assessed the contribution of regions of the genome to each disorder through joint regional association and regional heritability estimation. These results point to two additional loci that may contribute differentially to liability to BD and SCZ. The region on chr19 shows overlapping association peaks that are driven by independent causal variants for each disorder. Both variants significantly regulate the same gene GATAD2A but in opposite directions. GATAD2A is a transcriptional repressor, which is targeted by MBD2 and is involved in methylation-dependent gene silencing. The protein is part of the large NuRD (nucleosome remodeling and deacetylase) complex, for which also HDAC1/2 are essential components. NurD complex proteins have been associated to autism43. Their members, including GATAD2A, display preferential expression in fetal brain development43 and in recent work has been implicated in SCZ through open chromatin44. Further, p66α (mouse GATAD2A) was recently shown to participate in memory preservation through long-lasting histone modification in hippocampal memory-activated neurons45. The region on chromosome 10 appears to be shared across both disorders; however, there are additional independent contributing variants to SCZ and not BD, indicating another region of interest, although biological interpretation remains unknown.
More broadly, SNP-heritability appears to be consistently shared across regions and chromosomes between these two disorders. Regions with GWS loci often explain higher proportions of heritability as expected. When looking at the effect on heritability of the presence of a GWS locus in the other disorder, we identified a significant increase in BD heritability for regions containing a GWS locus for SCZ but no significant increase in SCZ heritability in regions having a BD one. This result suggests a directionality to the genetic sharing of these disorders with a larger proportion of BD loci being specific to BD. However, we cannot exclude that the asymmetry of results may reflect less power of discovery for BD than SCZ. The degree to which power and subphenotypes contribute to this result requires further examination.
We have now identified multiple genomic signatures that correlate between one disorder and a clinical symptom in the other disorder, demonstrating that there are genetic components underlying particular symptom dimensions within these disorders. As previously shown, we find a significant positive correlation between PRS of BD and manic symptoms in SCZ. We also demonstrate that BD cases with psychotic features carry a significantly higher SCZ PRS than BD cases without psychotic features and this result is not driven by schizoaffective BD subtype. Further, we show evidence that increased PRS is associated with more severe illness. This is true for BD with psychotic features having increased SCZ PRS, earlier onset BD having higher SCZ PRS and cases with higher BD+SCZ PRS having a larger number of hospitalizations. We demonstrated that psychotic features within BD is an independently heritable trait and that GWS loci for SCZ have a consistent direction of effect in psychotic features in BD, demonstrating the potential to study psychosis more directly to identify variants contributing to that symptom dimension. All in all, this work illustrates the utility of genetic data to dissect symptom heterogeneity among correlated disorders and suggests that further work could potentially aid in defining subgroups of patients for more personalized treatment.