Abstract
Integrating rare variation from family and case/control studies has successfully implicated specific genes contributing to risk of autism spectrum disorder (ASD). In schizophrenia (SCZ), however, while sets of genes have been implicated through study of rare variation, very few individual risk genes have been identified. Here, we apply hierarchical Bayesian modeling of rare variation in schizophrenia and describe the proportion of risk genes and distribution of risk variant effect sizes across multiple variant annotation categories. Briefly, we developed a pipeline based on the previous work used in ASD studies to jointly estimate genetic parameters for one or multiple combined populations of any disease. We applied this method to the largest available collection for rare variants in schizophrenia (1,077 families, 6,699 cases and 13,028 controls). We defined five variant annotation categories: disruptive (nonsense, frameshift, essential splice site mutations), damaging (predicting damaging by seven algorithms), silentFCPk (silent mutations within frontal cortex-derived DHS peaks) de novo mutations, and disruptive and damaging missense case/control singletons. We estimated that 8.01% of genes are risk genes (95% credible interval, CI, 4.59-12.9%), with mean effect sizes (95% CIs) of 12.25 (4.8- 22.22) for disruptive de novos, 1.44 (1-3.16) for missense damaging de novos, and 1.22 (1-2.16) for silentFCPk de novos. The mean effect sizes of damaging and disruptive singleton variants for three case-control populations were 2.09 (1.04-3.54), 2.44 (1.04, 5.73) and 1.04 (1-1.19) respectively. Our analysis identified only two known SCZ risk genes with FDR<0.05: SETD1A and TAF13; and two other genes with FDR < 0.1: RB1CC1 and PRRC2A. We further used FDRs to directly analyze candidate gene sets for the enrichment of Bayesian support. Significant enrichments were observed for essential genes, which were found enriched among autism genes in a recent study, and central nervous system (CNS) related genes, in addition to gene sets previously found to be enriched (including in these data). We conduct power analyses under our inferred model for SCZ, estimating the number of risk gene discoveries as more data become available, and quantifying the greater value of case/control over trio samples for novel rare variant risk gene discovery. We also applied the method to four other neurodevelopmental disorders: autism spectrum disorder (ASD), intellectual disorder (ID), developmental disorder (DD) and epilepsy (EPI), in total 10,792 families, and 4,058 cases and controls. The predicted proportions of risk genes in these diseases were smaller than that in SCZ, 4.6% in ASD, and < 3% for the other disorders. We report 164 and 58 genes with FDR < 0.05 for DD and ID, respectively, 101 and 15 of which are novel. Overall, replication of previous results confirms the robustness of our approach, and our method is able to identify novel risk genes for SCZ as well as for other diseases.
1 Introduction
Schizophrenia (SCZ) is a complex psychiatric disorder characterized by psychosis, and by positive, negative and cognitive symptoms, with severe medical and social-functioning comorbidities and high public health costs. Despite high reduction of reproductive fecundity, a lifetime risk of 0.7% and very high heritability of 60-80% are observed for the disease (Lichtenstein et al., 2009; Sullivan et al., 2003). The genetic architecture of SCZ is highly polygenic with contributions of common, rare and de novo genetic variants (Purcell et al., 2014; Fromer et al., 2014; Singh et al., 2016; Stefansson et al., 2009; Purcell et al., 2009). With the production of high-quality next-generation sequencing data, the genetics of schizophrenia and other diseases can be increasingly better characterized, especially for rarer variants.
Rare variants in case/control samples and de novo mutations have been successfully leveraged to implicate biologically relevant gene sets for this disease (Purcell et al., 2014; Fromer et al., 2014; Genovese et al., 2016), and to identify a handful specific SCZ risk genes (Singh et al., 2016; Takata et al., 2016). However, the genetic architecture of SCZ for rare variants and de novo mutations remains unknown. Rare variant genetic architecture analyses could help gain further insights into this disease, for example by using the estimated number of risk genes to calibrate gene discovery false discovery rates, or by using the distribution of effect sizes to estimate power for rare variant association studies. A better understanding of our certainty in sets of risk genes for SCZ will provide a better picture of biological pathways specific for the disease.
Here, we aim to develop a pipeline for integrative analysis of case-control rare variants and de novo mutations in order to infer rare-variant genetic architecture and identify risk genes for SCZ as well as other diseases. To do this, we extend a hierarchical model Bayesian analysis framework (TADA, Transmission And De novo Association) which was developed for autism spectrum disorder (ASD) (He et al., 2013). The new framework (extTADA, extended Transmission And De novo Association) can be used to analyze only de novo data, only case-control data or the combination of both. extTADA uses all variant classes to jointly estimate genetic parameters (therefore it assumes that all classes play important roles in the genetic architecture of the tested disease). In extTADA, a conditional model for case-control sample frequency allows rapid analysis without population frequency parameters (which are very poorly estimated for rare variants), facilitating estimation of parameters via Markov Chain Monte Carlo (MCMC). In addition, we designed extTADA for the analysis of data from multiple population samples. The pipeline is publicly available at https://github.com/hoangtn/extTADA.
In this study, we used extTADA to analyze the largest available exome-sequence data, including 19,727 (6,699+13,028) case+control samples and 1,077 trio/quad families for SCZ. We estimated mean relative risks (RRs) of different variant annotation categories as well as the proportion of risk genes for disease. Based on this analysis, SCZ risk gene sets determined with different false discovery rate (FDR) thresholds were tested for enrichment in known and novel gene sets. Analysis of separate classes of variants/mutations in terms of annotation and rarity helps provide a detailed picture of the disease’s rare variant genetic architecture, allowing for example power analyses for risk gene discovery as more data become available. Finally, we used available data for four other neurodevelopmental diseases: intellectual disability (ID), autism spectrum disorder (ASD), epilepsy (EPI) and developmental disorder (DD), totaling 10,792 trios and 4058 cases/control samples. We are able to identify additional new significant genes for ID and DD based on extTADA results.
2 Results
The extTADA pipeline and its comparison with TADA is described in Figure S1. Figure S2 summarises the workflow of analyses of the current study. As presented in Figure S2, variants/mutations in this study were divided into categories: synonymous, missense, loss-of-function (LoF), missense damaging (MiD), silent mutations within frontal cortex-derived DHS (silentFCPk), and then three main categories were used in the analysis: MiD, loF and silentFCPk.
2.1 The extTADA pipeline
We used a Bayesian approach to integrate de novo (DN) and case control (CC) rare variant data, to infer genetic architecture parameters and to identify risk genes under a model with additive to dominant deleterious risk alleles. The framework is extended from the Transmission and Disequilibrium Association (TADA) model proposed by He et al. (2013); De Rubeis et al. (2014), as shown in Figure S1. Primary extensions to the TADA model facilitate joint Bayesian inference of rare variant genetic architecture model parameters (including the risk gene mixture proportion π, which is fixed in TADA), and include a likelihood formulation in which all variant categories contribute to the inference, which also allows inference based on multiple samples. extTADA also uses an approximate expression for case-control data probability that eliminates population allele frequency parameters, and controls the proportion of protective variants by constraining effect size distribution scale parameters. We used the same symbols for parameters as those used in He et al. (2013); De Rubeis et al. (2014) in the following sections. For comparison, we also described in detail methods originally presented in the TADA papers (He et al., 2013; De Rubeis et al., 2014).
In summary, for a given gene, all variants of a given annotation category (e.g. loss-of-function) were collapsed and considered as a single count. Let q, γ and μ be the population frequency of rare heterozygous genotypes for case/control (equivalently, transmitted/nontransmitted) data, the mean relative risk (RR) of the variants, and sum of mutation rates of de novo variants, respectively.
At each gene, two hypotheses H0: γ = 1 and H1: γ ≠ 1 were compared. A fraction of the genes π, assumed to be risk genes, were represented by the H1 model. Under this model, mean relative risks (γ) were assumed to follow a probability distribution across genes. The model H0 described non-risk genes, for which relative risks (γ) equal 1. As in He et al. (2013), we modeled de novo (xd) and case (xca) control (xcn) data as Poisson distributions and their hyper parameters as following Gamma distributions priors. In addition, in extTADA, we used a Beta distribution prior for π and constrain π to be less than 0.5, and a nonlinear function for the variance parameter of γ to constrain mean RRs above 1 (i.e. so that variants are not implied by the model to be protective). Model parameters for TADA are shown in in Table 1.
At each gene, a Bayes Factor (BFgene) can be calculated for each category to compare models H1 and H0 (BF = P (data H1)/P (data H0)). BFgene can be calculated as the product of BFs across multiple variant categories, either DN and CC data or multiple annotation categories. Data could be from heterogeneous population samples; therefore, we extended TADA’s BFgene as the product of BFs of all variant categories including population samples as in Equation 1, in which N dnpop, N ccpop are the numbers of DN and CC population samples, and Cdn, Ccc are the number of annotation categories in DN and CC data. To infer significant genes, BFs were converted to false discovery rates (FDRs) using the approach of Newton et al. (2004).
To calculate BFs in Equation 1, hyper parameters for different categories in Table 1 are needed in advance. These were jointly estimated based on a mixture model of the two hypotheses as in Equation 2, where P1i and P0i at the ith gene were calculated across populations and categories as follows:
To simplify the estimation process in Equation 2, we approximated the original TADA model for CC data P (xca, xcn|Hj) using a new model in which case counts were conditioned on total counts: P (xca|xca +xcn, Hj) (see Methods and Figure S1).
extTADA used Markov Chain Monte Carlo (MCMC) for Bayesian analysis. We extracted posterior density samples from at least two MCMC chains. Posterior modes were reported as parameter estimates for all analyses, with 95% credible intervals (CIs).
2.2 Evaluating extTADA on simulated data
In order to assess extTADA in a realistic use case, we analyzed the main model used in this study as described in Equation 2 on simulated DN and CC data with one variant category each. We also analyzed simulated CC data with one or two variant categories, to examine inference on a single variant class as well as to assess the conditional probability approximation for CC data (Figures S3, S4, S5 and S6, Supplementary Results 6.3). Trinucleotide context dependent mutation rate estimates (Samocha et al., 2014; Fromer et al., 2014; De Rubeis et al., 2014) were used for denovo data for both simulation and estimation. We tested sample sizes ranging from that of the available data, 1,077 trios and 3,157 cases (equal controls) (see below), and larger sample sizes of up to 20,000 cases (see Supplementary Results 6.3).
We saw little bias in parameter estimation (Table S1 and S2). Slight under and over estimation were observed for risk gene proportions and CC mean RRs, respectively, specifically for large simulated CC mean RRs ; we note that these conditions appear outside the range of our SCZ analyses. Some bias can be expected in Bayesian analysis and not expected have a large effect on the risk gene identification results (He et al., 2013). We assessed this directly by calculating observed FDR (oFDR, i.e. the proportion of genes meeting a given FDR significance threshold that are true simulated risk genes). We observed high correlations between oFDR and the FDR significance thresholds over wide parameter ranges (Figure 1). Only for small π (e.g., π = 0.02) oFDRs were higher than FDRs when de novo mean RRs were small (∼5). We also saw oFDR were equal to zero for some cases with small FDR, when very small numbers of FDR-significant genes were all true risk genes. We also ran extTADA on null data, π = 0 and for both DN and CC data (Table S3). MCMC chains tended not to converge, π estimates trended to very small values, and Bayes factors and FDRs identified almost no FDR-significant genes as expected (Table S3).
2.3 extTADA Analyses of Schizophrenia
We next applied extTADA to available DN and CC SCZ data (Figure S2), for inference of rare variant genetic architecture parameters, and for genic association. In total, there were 6,699 cases, 13,028 controls, 1077 trio/quad families used in this analysis (Table S12). Primary analyses included three variant categories for DN data, LoF, MiD and silentFCPk, and a single category of CC singletons (Purcell et al., 2014; Genovese et al., 2016) not present in the Exome Aggregation Consortium (ExAC) (Lek et al., 2015) (termed NoExAC), LoF+MiD. An array of secondary extTADA analyses were conducted to help validate and dissect our results.
2.3.1 SCZ data
De novo mutations and case-control variants were tested to select classes and samples for the extTADA pipeline. Since currently extTADA requires integer counts data, adjustment for ancestry and technical covariates is not possible. For case-control data, there were multiple population samples and sequencing centers; therefore, the data were restricted to non-heterogeneous population samples. First, for the 4,929 cases and 6,232 controls of the Sweden population sample, we clustered all cases and controls into different groups and then tested for case-control differences with and without adjustment for covariates. We aimed to generate clusters yielding very similar results with and without adjustment for covariates. The clustering process divided the data set into three groups as in Figure S7: Group 1, 3,157 cases + 4,672 controls; Group 2, 681 cases + 367 controls; and Group 3, 1,091 cases + 1,193 controls. Only Groups 1 and 3 were used in the next stage because Group 2 showed some difference between adjusted and unadjusted results and was relatively small. As in Genovese et al. (2016), NoExAC variants showed case-control significant differences and InExAC variants did not (Figure S7). Second, only UK and Finnish sample case/control summary counts were available from the UK10K project data (Singh et al., 2016), and we used only the larger UK population sample. Again significance of case-control differences was observed only for NoExAC singleton variants; therefore, we used only NoExAC singletons in primary extTADA analyses, however we also used all singletons in secondary analyses for comparison.
For de novo mutations, we calculated the sample-adjusted ratios of mutation counts between 1,077 cases and 731 controls (Table S12). Similar to Takata et al. (2016), the highest ratio was observed for silentFCPk (2.57), followed by MiD (2.3), LoF (1.83) and missense, silent (∼1.3) mutations (Figure S8). Three classes (LoF, MiD and silentFCPk) were used in extTADA analyses.
2.3.2 Rare variant genetic architecture of SCZ
Three categories of de novo mutations and one category of case/control variants were used in integrative analysis using extTADA. They included LoF, MiD and silentFCPk denovo mutations; and LoF+MiD case-control variants. LoF and MiD variants showed similar enrichment in our case-control data analysis (Figure S7); we pooled them in order to maximize the case-control information. There were four population samples in total: one de novo population, and three case-control populations including two Sweden clusters and the UK data from the UK10K project.
extTADA generated samples from the joint posterior density of all genetic parameters for SCZ. All MCMC chains showed convergences (Figure S9). The estimated proportion of risk genes was 8.01% (95% CI = (4.59%, 12.9%)). LoF de novo variants had the highest estimated mean RR, 12.25 (4.78, 22.22). Two other de novo classes had estimated mean RRs 1.22 (1, 2.16) for silentFCPk and 1.44 (1, 3.16) for MiD. For MiD+LoF case-control variants, two Sweden populations had nearly equal values of mean RRs: 2.09 (1.04, 3.54) and 2.44 (1.04, 5.73); however the signal was weak for the UK population with mean RR 1.04 (1, 1.19), (Table 2, Figure 2).
To test the performance of the pipeline on individual data types and to assess their contribution to the overall results, we ran extTADA separately on each of four single variant classes: silentFCPk, MiD and LoF de novo mutations, and MiD+LoF case-control variants (Table S4). All parameter estimates were consistent with the integrative analysis, with broad credible intervals. The much larger CIs than in integrative analyses demonstrates extTADA’s borrowing of information across data types (also observed in simulation, Figure S4).
We also assessed the sensitivity of genetic parameter inference in several secondary analyses. We observed that synonymous de novo mutation counts were lower than expected, suggesting that mutation rates may be systematically underestimated. Adjusting mutation rates by a factor 0.81, DNM mean RR estimates slightly increased as expected, and the estimated proportion of risk genes increased slightly to 9.37% (5.47-15.12%), while case-control parameters were highly similar (Table S5). Above we assummed that different case-control population samples may have different mean RRs, which could be due to clinical ascertainment, stratification or population specific genetic architectures. Analysis using a single mean RR parameter for all three case-control samples yielded similar π and DNM mean RRs and an intermediate CC MiD+LoF mean RR with relatively narrower credible interval, = 1.93 (1.08-3.21) (Table S6, Figure S11). Considering all CC singleton variants (not just those absent from ExAC) in extTADA also generated similar genetic parameter estimates, with predictably slightly lower case-control mean RRs (Table S7). We note that these alternative analyses also slightly impact support for individual genes as described below.
2.3.2.1 Identifying SCZ risk genes using extTADA
extTADA also generates Bayes factors for all genes, from which we calculated posterior probabilities of association (PPAs) (Stephens and Balding, 2009) and false discovery rates (FDRs) (Benjamini and Hochberg, 1995) (Table S8, which includes supporting data as well as association results). Four genes achieved PPA > 0.8 and FDR < 0.1 (SETD1A, TAF13, PRRC2A, RB1CC1). Twogenes SETD1A (FDR = 0.0033) and TAF13 (FDR = 0.026) were individually significant at FDR < 0.05. SETD1A has been confirmed as the highest statistically significant gene of SCZ in previous studies (Singh et al., 2016; Takata et al., 2016), while TAF13 was only reported as a potential risk gene in the study of Fromer et al. (2014). Interestingly for the RB1CC1 gene, rare duplications were reported to be associated with SCZ with very high odds ratio (8.58) in the study of Degenhardt et al. (2013), but has not been reported in other studies since. In addition, as discussed by the authors, duplications at this gene were also observed by Cooper et al. (2011) with an odds ratio = 5.29 in a study of 15,767 children with ID and/or DD. If we increase the FDR threshold to 0.3 as in the previous ASD study of De Rubeis et al. (2014), we identify 24 candidate SCZ risk genes (SETD1A, TAF13, RB1CC1, PRRC2A, VPS13C, MKI67, RARG, ITSN1, KIAA1109, DARC, URB2, HSPA8, KLHL17, ST3GAL6, SHANK1, EPHA5, LPHN2, NIPBL, KDM5B, TNRC18, ARFGEF1, MIF, HIST1H1E, BLNK). Of these, EPHA5, KDM5B and ARFGEF1 did not have any de novo mutations (Table S8). We note that still more genes showed substantial support for the alternative hypothesis over the null under the model (Jeffreys, 1998) (58 genes with PPA > 0.5, corresponding to BF > 11.49, FDR < 0.391; Table S8).
Secondary extTADA analyses had predictable effects on risk gene identification. Considering all CC singleton variants (not just those absent from ExAC) decreased the impact of CC data and yielded slightly fewer significant genes (three and seventeen genes with FDR < 0.1, 0.3, respectively). Using a single CC parameter for the model also resulted in 4 and 22 significant genes for FDR < 0.1 and 0.3 respectively. Mutation rate adjustment increased support for individual genes with DNMs, increasing the findings to three and six genes at FDR < 0.05, < 0.1, respectively, including (Table S9). Generally the top genes were consistent across analyses, specifically SETD1A and TAF13 were always the top significant genes (FDR < 0.05 in all analyses).
2.3.3 Enrichment of gene sets in extTADA SCZ risk genes candidates
From extTADA, we extracted the FDR of each gene to test the enrichment of gene sets. We used gene set mean FDR to test for significant enrichment in comparison to random gene sets, and empirical P-values were FDR corrected (Benjamini and Hochberg, 1995).
2.3.3.1 Top SCZ significant genes from extTADA are enriched in known gene sets
We first tested 161 gene sets previously implicated in SCZ genetics or with strong genetic evidence relevant to SCZ rare variation (Table S10) (Purcell et al., 2014; Genovese et al., 2016; Pardinas et al., 2017; Ji et al., 2016; Epi4K Consortium and Epilepsy Phenome/Genome Project, 2013; Lin et al., 2012). FDR-significant results were observed for 61 gene sets including those reported using these data (Purcell et al., 2014; Fromer et al., 2014; Genovese et al., 2016) (Table 3). The most significant gene sets were genes harboring de novo SNPs and Indels in DD and ASD, missense constrained and loss-of-function intolerant (pLI09) genes, targets of the fragile X mental retardation protein (FMRP) and CELF4 genes, targets of RBFOX1/3 and RBFOX2 splicing factors, CHD8 promoter targets, and post-synaptic density activity-regulated cytoskeletonassociated (ARC), NMDA-receptor (NMDAR) and mGluR5 complexes (all P < 8.0e-04, FDR < 4.5e-03), Table 3). Genes exhibiting allelic bias in neuronal RNA-seq data Lin et al. (2012) were also strongly enriched in SCZ extTADA results (P = 1.1e-05, FDR = 1.4e-04). Significant enrichments were also obtained for several gene sets enriched in the recent SCZ GWAS of Pardinas et al. (2017), including the mouse mutant gene sets with psychiatric-relevant phenotypes including abnormal behavior, and abnormal nervous system morphology and physiology, as well as genome-wide significant genes from the SCZ gene-level GWAS itself (Pardinas et al., 2017) (P = 9.4e-03, FDR = 5.0e-03), showing convergence with common-variant genetic signal in genes hit by rare variation in SCZ. In addition, novel results were observed for essential genes, and known epilepsy genes (p ≤ 2.0e-04, FDR ≤ 1.6e-03; Table 3). The essential gene set was just reported recently by Ji et al. (2016) as ASD risk genes. De novo genes for other neurodevelopmental diseases (see below) were also strongly enriched in SCZ (DD, P = 1.0e-07, FDR = 2.3e-06; ASD, P = 2.1e-06, FDR = 3.4e-05; ID, P = 7.9e-04, FDR = 4.4e-03).
2.3.3.2 Top SCZ genes are enriched in other gene sets from a datadriven approach
To test more novel gene sets for enrichment in the SCZ extTADA results, we tested 1,878 gene sets from several data bases, and FDR-adjusted for the full set of 1,717 + 161 = 1,878 gene sets tested (Tables S11). We used GO, KEGG, REACTOME and C3 sets from MSigDB (http://software.broadinstitute.org/gsea/msigdb), filtered for sets including greater than 100 genes (see Methods for details).
Significant results were observed in 103 gene sets including 36 gene sets in the above 161 gene sets. The top known gene sets still had the lowest p values in these results. We observed significant enrichment of several C3 conserved non-coding motif genesets showing brain specific expression (Xie et al., 2005): GGGAGGRR V$MAZ Q6, genes containing the conserved M24 GGGAGGRR motif, a MAZ transcription factor binding site; ACAGGGT,MIR-10A,MIR-10B, including microRNA MIR10A/B targets; M12 CAGGTG V$E12 Q6, E12/TCF3 targets; M17 AACTTT UNKNOWN, IRF1 targets; and M13 CTTTGT V$LEF1 Q2, LEF1 targets (P ≤ 1.5e-04, FDR < 0.01; Table S11). Relatively specific significant GO gene sets included GO:0045202/synapse and GO:0043005/neuron projection (P ≤ 2e-04, FDR 0.01). GO:0051179/localization (P = 6.4e-05, FDR = 5.2e-03) was reported by Murphy and Benítez-Burraco (2016) in a study relating to language evolution and SCZ.
2.3.4 Power analysis for SCZ exome sequencing studies across sample sizes
We simulated risk gene discovery using extTADA using the genetic architecture of SCZ inferred from the current data. Different samples sizes from 500-20,000 trio families and 1,000-50,000 cases (controls = cases) were simulated as in our validation analyses, using parameters from the posterior distribution samples given the SCZ data. The number of risk genes with FDR ≤ 0.05 ranged from 0 to 238. Based on this analysis, we expect > 50 risk genes with total sample sizes of trio families plus case-control pairs ∼24,000 (Figure 3). The results imply that, assuming sequencing costs are proportional to the number of individuals, generating case-control data is more efficient than trio data despite the larger relative risks of de novo mutations.
2.4 extTADA Analyses of Other Neurodevelopmental Disorders
We also used the current pipeline to infer rare variant genetic architecture parameters from available data for autism spectrum disorder (ASD), intellectual disability (ID), developmental disorders (DD), and epilepsy (EPI). Sample sizes of these diseases are presented in Table S12, Figure S2. Numbers of trios ranged from 365 for EPI, 1,112 for ID, 4,293 for DD, 5,122 trios for ASD. As previously reported (see references in Table S12, these data have strong signals for de novo mutations contributing to disease (Table S13). Only ASD data included case-control samples (404 cases, 3,654 controls) from the Swedish PAGES study of the Autism Sequencing Consortium (De Rubeis et al., 2014) (see Methods for details).
2.4.1 Rare variant genetic architectures of ASD, ID, DD, EPI
extTADA genetic parameter estimates are presented in Figure 4 and Table 4. MCMC analyses showed good convergence, except for the small sample size EPI (392 families compared with > 1000 families for other diseases). The numbers of risk genes (π) in these diseases were lower than that of SCZ (Figure 4, Tables 2 & 4). For ASD, the estimated proportion of risk genes π was 4.59% (95% CI 3.19% - 6.01%), consistent with the result of 550-1000 genes estimated in the original TADA model (He et al., 2013) using only LoF de novo data. For ID, π was smaller than that of ASD; estimated value was 2.76% (2.1% - 3.7%). For DD π = 2.87% (2.34% - 3.49%) was similar to that of ID. The estimated π value for EPI, 1.65% (0.8% - 3.21%) was the lowest but with a broad credible interval owing to its much smaller sample size. Mean RRs of de novo mutations in all four neurodevelopmental diseases were much higher than those of SCZ. This was expected because of the strong signal of de novo mutations in these data for other diseases. For ASD, estimated mean RRs for de novo mutations were consistent with previous results and much lower than for the other diseases. ID and DD had the highest estimated de novo LoF mean RRs, 96.0 (68 - 131) and 86.5 (66 - 112), respectively. Even though the EPI estimated de novo LoF mean RR, 77.0 (37 - 138), was slightly lower than those of ID and DD, the estimate for EPI de novo MiD mean RR, 48 (20 - 87) was somewhat higher than those of other diseases. The previously estimated (Epi4K Consortium and Epilepsy Phenome/Genome Project, 2013) EPI MiD mean RR of 81 is consistent with the current results, and it will be of interest to see if this result remains consistent in additional data in the future.
2.4.2 Novel risk genes in ID and DD
The extTADA risk gene results of the four disorders ID, DD, ASD and EPI are presented in Tables S14, S15, S16 and S17. Results of other de novo mutation methods using these same data have been recently reported (Lelieveld et al., 2016; Deciphering Developmental Disorders Study, 2017); nevertheless, extTADA identified novel genes with strong statistical support from these recent data. There were 58 and 73 genes for ID with FDR 0.05 and 0.1, respectively, and 164 and 201 genes for DD. In ID 15 of 58 FDR 0.05 genes (TCF7L2, USP7, ATP8A1, FBXO11, KDM2B, MED12L, MAST1, MFN1, TNPO2, CLTC, CEP85L, AGO1, AGO2, SLC6A1-AS1, POU3F3) were not on the list of previously reported known and novel ID genes (Lelieveld et al., 2016). Of the 15 genes, six (TNPO2, AGO2, CLTC, CEP85L, FBXO11, MFN1) were strongly significant (FDR < 0.01); these are genes hit by two or three MiD or LoF de novos but were not identified by the simulation based analyses of Lelieveld et al. (2016). In DD, only 59 of 164 FDR ≤ 0.05 genes were reported by Deciphering Developmental Disorders Study (2017); 101 genes are novel. Similar to ID, the total MiD+LoF de novo counts of these 101 genes were not high (between two and six). Surprisingly, there were 58 of the 101 genes with FDRs < 0.01.
2.4.3 Multiple gene sets are enriched in top significant genes across neurodevelopmental diseases
We also tested for gene set enrichment in the four NDs and combined this information with the SCZ gene-set information above (Tables S18 and S19, Figures 5 and S12). First, we tested 161 known or strong-candidate gene sets tested in SCZ (see Methods for details). The numbers of significant gene sets (FDR< 0.05) were 51, 74, 29 and 17 for ID, DD, ASD and EPI respectively. There were five gene sets significant across five diseases; these included Cav2 channels, FMRP targets, NMDAR network, PSD95, abnormal excitatory postsynaptic currents (all FDR ≤ 0.0097). Second, we tested our 1,877 data-driven gene sets; only one gene set which was significant in all five diseases after FDR adjustment: NMDAR network genes (all FDR ≤ 0.024). FMRP target genes were also very high significant across ASD, ID, DD, SCZ (all FDR ≤ 3.1e-05) but not significant for EPI (FDR = 0.058, Figure S12, Table S19).
The number of significant gene sets was not as high in EPI as in the other diseases, likely due to its smaller sample size and power; therefore, we removed this disorder and repeated our assessment of significant gene sets overlap in the four disorders SCZ, DD, ID and ASD. Twelve gene sets were significant in all four disorders. These consisted of the five gene sets above and seven other gene sets: constrained genes (constrained and pLI09), rbfox1/3 and rbfox2 targets, CHD8 targets (chd8 human brain), and the mouse mutant gene sets abnormal social investigation and abnormal brain size. In an analysis of all 1,877 datadriven gene sets, FMRP targets, constrained and pLI09 genes, and NMDARnetwork genes remained significant across the four disorders. In addition, one other gene set, GO:0016568/chromatin organization, was also enriched for each of SCZ, ASD, DD and ID (Table S19, Table S18).
3 Discussion
In this work, we have built an integrative pipeline extTADA for Bayesian analysis of de novo mutations and rare case-control variants, to infer genetic architecture parameters and identify risk genes. We applied extTADA to available data in schizophrenia and four other neurodevelopmental disorders (Figure S2). The pipeline is based on our previous work in autism sequencing studies, TADA (He et al., 2013; De Rubeis et al., 2014), and conducts fully Bayesian analysis of a simple rare variant genetic architecture model. Unlike TADA, which was developed for studies where LoF de novo mutations have strong discernible effects, we developed extTADA for schizophrenia, where de novo and case-control variants have more subtle effects discernible only at the level of gene set analysis. extTADA borrows information across all annotation categories and between de novo and case-control samples in genetic parameter inference, critical for sparse rare variant sequence data, and we hope that it will be generally useful for rare variant analyses across complex traits.
Using Markov Chain Monte Carlo, extTADA samples from the joint posterior density of risk gene proportion and mean relative risk parameters. Inference of rare variant genetic architexture is of great interest in its own right (Zuk et al., 2014), but of course risk gene discovery is one of the most important objectives of genetics. We provide Bayesian statistical support for risk gene status in the form of Bayes factors for each gene, and we further calculate posterior probabilities (Stephens and Balding, 2009) and false discoery rates (Benjamini and Hochberg, 1995). Although we use TADA for inference of genetic parameters, and joint analysis certainly impacts genetic parameter estimation (see the primary analysis vs single class analyses in Tables S8 and S4), we found that the empirical Bayesian approach of calculating genic BFs from model parameter point estimates (He et al., 2013) is highly similar to joint posterior mean genic BFs (see Methods). Therefore, the approach of He et al. (2013) is a good one if model parameters are known approximately, and we maintain this functionality in extTADA if users have prior information on the rare variant genetic architecture of the tested disease.
As in all Bayesian and Likelihood analyses, we must specify a statistical model; the true model underlying the data is unknown and could in principle yield different results. This is addressed by analyzing a simple model that can allow illustrative, interpretable results, and by assessing the sensitivity of results to a range of alternative model specifications. extTADA uses relatively agnostic hyper-parameter prior distributions (Figure S2), without assuming known parameters and without any previously known risk gene seeds. Still, extTADA makes important assumptions, both in common with TADA and uniquely. First, both models assume Poisson distributed counts data and Gamma distributed mean relative risks across genes for analytical convenience, making alternative model specification inconvenient. Poisson counts are likely to be a good approximation for genetic counts data (He et al., 2013), assuming linkage disequilibrium can be ignored, and that stratification has been adequately addressed. Alternatives should be explored for Gamma distributed mean relative risk distributions. Poisson de novo muation counts further assume known mutation rates, uncertainty in which may introduce bias for multiple reasons; in our data, mutation rate adjustment for silent de novo count rates was actually anti-conservative S9. Differences between de novo studies is not unlikely even though previous studies of De Rubeis et al. (2014); Singh et al. (2016) did not adjust mutation rates to account for it. The ability to incorporate covariates, perhaps with Gaussian sample frequency data and Gaussian effect sizes, would be an important further extention of TADA-like models.
Second, extTADA assumes that different variant classes share risk genes such that the mixture model parameter π applies to all data types, facilitating borrowing of information across classes. This is supported by convergent de novo and case-control rare variant results in SCZ (Fromer et al., 2014; Purcell et al., 2014; Singh et al., 2016; Genovese et al., 2016) (Table S4); however, some evidence exists for disjoint risk genes for de novo vs case-control protein-truncating variants e.g. in congenital heart disease (CHD) Sifrim et al. (2016). We emphasize that we do consider multiple population samples as different categories in extTADA, since sequence data are very often from different countries and/or centers. (Here we used multiple categories of case-control data but multiple de novo categories could be important as well.)
The current study replicated previous studies, and supplies new information about SCZ. First, SETD1A (Singh et al., 2016; Takata et al., 2016) is the most significant gene across analyses (FDR ∼ 1.5 10−3), TAF13 (Fromer et al., 2014) is also significant across analyses. Of two genes with FDR < 0.1, RB1CC1 was reported in a study of copy-number variation in SCZ (Degenhardt et al., 2013). Second, we found substantial overlap of top genes in this study and gene sets known from previous reports on these same SCZ data Genovese et al. (2016). Several conserved non-coding motif gene sets (Xie et al., 2005) and a few GO gene sets were also significant (Table 3). Third, in this study, we describe in detail the rare variant genetic architecture of SCZ. It appears more complex than those of ASD, ID, DD and EPI; the estimated risk gene proportion for SCZ (∼8%) is higher than those of the four other diseases (Figure 2 and 4, Tables 2 and 4). We also see that disease risk information is concentrated in ultra-rare variants not present in the ExAC database (Kosmicki et al., 2016; Genovese et al., 2016) (Table S7). Finally, we see substantial overlap between de novo and case-control, and common variant (Pardinas et al., 2017) genes in SCZ.
We used extTADA to infer genetic parameters for four other neurodevelopmental diseases ASD, EPI, DD and ID (Table 4, Figure 4). The ASD results of extTADA are comparable to previous results (He et al., 2013; De Rubeis et al., 2014). We note the exceptionally high de novo missense damaging mean RR estimated for EPI, also consistent with previous analyses (EuroEPINOMICSRES Consortium et al., 2014). We also highlight the sharing of gene sets enriched across multiple neurodevelopmental diseases (Figure 5), including diverse synaptic gene sets, and possible distinguishing EPI as less similar to the other disorders. Multi-phenotype analyses leveraging shared this could have higher power to to detect novel risk genes. Finally, importantly, many novel significant genes which were missed in recent studies are discovered by extTADA (101 for DD and 15 for DD).
4 Data and methods
4.1 Data
Figure S2 shows the workflow of all data used in this study.
4.1.1 Variant data of SCZ, ID, DD, EPI and ASD
High-quality variants were obtained from published analyses (Table S12). Variants were annotated using Plink/Seq (using RefSeq gene transcripts, UCSC Genome Browser, http://genome.ucsc.edu) as described in Fromer et al. (2014). SnpSift version 4.2 (Cingolani et al., 2012) was used to further annotate these variants using dbnsfp31a (Liu et al., 2015). Variants were grouped into different categories as follows. Loss of function (LoF): nonsense, essential splice, and frameshift variants. Missense damaging (MiD): defined as missense by Plink/Seq and damaging by all of 7 methods (Genovese et al., 2016)- SIFT, Polyphen2_HDIV, Polyphen2_HV AR, LRT, PROVEAN, MutationTaster and MutationAssessor. Recently, Takata et al. (2016) reported significant results for synonymous mutations in regulatory regions; therefore, this category was also analyzed. To annotate synonymous variants within DNase I hypersensitive sites (DHS) as Takata et al. (2016), the file wgEncodeOpenChromDnaseCerebrumfrontalocPk.narrowPeak.gz was downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromDnase/ on April 20, 2016. Based on previous results with SCZ exomes Purcell et al. (2014); Genovese et al. (2016), only case-control singleton variants were used in this study. The data from Exome Aggregation Consortium (ExAC) (Lek et al., 2015) were used to annotate variants inside ExAC (InExAC or not private) and not inside ExAC (NoExAC or private). On April 20, 2016, the file ExAC.r0.3.nonpsych.sites.vcf.gz was downloaded from http://ftp.broadinstitute.org/pub/ExAC_release/release0.3/subsets/ and BEDTools was used to obtain variants inside (InExAC) or outside this file (NoExAC).
4.1.2 Gene sets
Multiple resources were used to obtain gene sets for our study. First, we used known gene sets with prior evidence for involvement in schizophrenia and autism from several sources. Second, to identify possible novel significant gene sets, we collected genes sets from available data bases (see below).
4.1.2.1 Known gene sets
These gene sets and their abbreviations are presented in Table S10.
Gene sets enriched for ultra rare variants in SCZ which were described in detailed in Genovese et al. (2016): missense constrained genes (constrained) from Samocha et al. (2014), loss-of-function tolerance genes (pLI90) from Lek et al. (2015), RBFOX2 and RBFOX1/3 target genes (rbfox2, rbfox13) from Weyn-Vanhentenryck et al. (2014), Fragile X mental retardation protein target genes (fmrp) from Darnell et al. (2011), CELF4 target genes (celf4) from Wagnon et al. (2012), synaptic genes (synaptome) from Pirooznia et al. (2012), microRNA-137 (mir137) from Robinson et al. (2015), PSD-95 complex genes (psd95) from Bayés et al. (2011), ARC and NMDA receptor complexes (arc, nmdar) genes from Kirov et al. (2012), de novo copy number variants in SCZ, ASD, bipolar as presented in Supplementary Table 5 of Genovese et al. (2016).
Allelic-biased expression genes in neurons from Table S3 of Lin et al. (2012).
Promoter targets of CHD8 from Cotney et al. (2015).
Known ID gene set was from the Sup Table 4 of Lelieveld et al. (2016) and the 10 novel genes reported by Lelieveld et al. (2016).
Gene sets from MiD and LoF de novo mutations of ASD, EPI, DD, ID.
The essential gene set from the supplementary data set 2 of Ji et al. (2016). Lists of human accelerated regions (HARs) and primate accelerated regions (PARs) (Lindblad-Toh et al., 2011) were downloaded from http://www.broadinstitute.org/scientific-community/science/projects/mammals-models/29-mammals-project-supplementary-info on May 11, 2016. The coordinates of these regions were converted to hg19 using Liftover tool (Kent et al., 2002). We used a similar approach as Xu et al. (2015) to obtain genes nearby HARs. Genes in regions flanking 100 kb of the HARs/PARs were extracted to use in this study (geneInHARs, geneInPARs).
List of known epilepsy genes was obtained from Supplementary Table 3 of Phenome et al. (2017).
List of common-variant genes was obtained from Extended Table 9 of Pardinas et al. (2017).
134 gene sets from mouse mutants with central nervous system (CNS) phenotypes were obtained from Pardinas et al. (2017). Steps which were used to obtain the gene sets were described in Pocklington et al. (2015). We finally obtained 134 gene sets from this step after removing overlapping gene sets between previous studies and the 161 gene sets.
In the gene-set tests for a given disease, we removed the list of known genes and the list of de novo mutation genes for that disease. As a result, we tested 161 known gene sets for ASD, DD and SCZ; and 159 gene sets for EPI and ID.
4.1.2.2 Other gene sets
We also used multiple data sets to identify novel gene sets overlapping with the current gene sets. Gene sets from the Gene Ontology data base (Consortium et al., 2015), and KEGG, REACTOME and C3 motif gene sets gene sets collected by the Molecular Signatures Database (MSigDB) (Subramanian et al., 2005). To increase the power of this process, we only used gene sets with between 100 to 4995 genes. In total, there were 1717 gene sets. These gene sets and the above gene sets above were used in this data-drive approach.
4.2 Methods
4.2.1 extTADA pipeline: extended transmission (case-control) and de novo analysis
4.2.1.1 extTADA for one de novo population and one case/control population
extTADA is summarized in Table 1 and Figure S1. There,
xd ∼ Pois(2Ndμ, γdn), xca ∼ Pois(qN1γcc), xcn ∼ Pois(qN0), and , , q ∼ Gamma(ρ ν).
Let K be the number of categories (e.g., LoF, MiD), and xi = (xi1, …, xiK) be the vector of counts at the ith given gene. The Bayes Factor for each jth category to test two hypotheses: H0: γ = 1 versus H1: γ ≠ 1 was:
In Equation 3, xij = xd for de novo data and xij = (xca, xcn) for casecontrol data. In addition, the integral over q was not applicable for de novo data because there is no q parameter for de novo data.
As in He et al. (2013), the BF for the ith gene combining all categories is:
To calculate BFs, hyper parameters in Table 1 need to be inferred. Let φ1j and φ0j be hyperparameters for H1 and H0 respectively. A mixture model of the two hypotheses was used to infer parameters using information across the number of tested genes (m) as:
Equation 5 was calculated across categories as in Equation 4.
We used the same approach for the analysis of multiple population samples. Let Ndnpop, Cdn and N ccpop, Ccc be the number of populations, categories for de novo and case-control data respectively. The total Bayes Factor of a given gene was the product of Bayes Factors of all populations as in Equation 1, and all hyper parameters were estimated using Equation 2.
The hyperparameters φ1j = (γj(dn), γj(cc), βj(dn), βj(cc), ρj, νj) were estimated using a Hamiltonian Monte Carlo (HMC) Markov chain Monte Carlo (MCMC) method implemented in the rstan package (Carpenter et al., 2015; R Core Team, 2016). However, the model was first simplified by removing q (see below).
4.2.1.2 Simplified approximate case-control model
For case-control (transmitted) data, q ∼ Gamma(ρ, ν), and hyper-parameters ρ and ν controlled the mean and dispersion of q; therefore, as in the previous studies (He et al., 2013; De Rubeis et al., 2014), ν was heuristically chosen (200 was used in all analyses) and = the mean frequency across genes in both cases and controls.
We simplified the case-control model by expressing it as
Because assuming that xca and xcn were independent, the case data could be modeled as:
xca|xca + xcn, Hj ∼ Binomial(xca + xcn, θ|Hj) with and
The marginal likelihood was
Based on simulation results, the first part P (xca|xca + xcn, Hj) can be used to infer mean RRs ; therefore only this part was used in the extTADA estimation process.
4.2.1.3 Control of an implied proportion of protective variants using the relative risk dispersion hyper-parameter
If and β were small then we could see a high proportion of protective variants when is not large. Although this might be of biological interest, it is not currently accounted for in the model. To control the proportion of protective variants, we tested the relationship between β and in determining . We set this proportion very low (0.5%) (Figure S10) and built a nonlinear relationship .The R package nls was used to estimate a, b and c, as 6.83, −1.29 and −0.58 respectively.
4.2.1.4 Power analyses for extTADA risk gene identification
We simulated DN and CC data for ranges of sample sizes, using random samples from the posterior density of our primary genetic architecture inference analysis. The original case-control model was used in this calculation; however, we changed the order of the integral of parameters to not rely on q because the range of this parameter was not frequently known in advance (Sup Information 6.3). BFs of genes were calculated according to Equation 1, and Newton et al. (2004) false discovery rates (FDRs) were calculated following De Rubeis et al. (2014). Posterior probability (PP) for each gene was calculated as PP = π * BF/(1 – π + π * BF) (Stephens and Balding, 2009). The number of risk genes could be predicted based on the FDR threshold, for which we chose 0.05.
4.2.2 Testing the model on simulated data
To calculate the ability of the model in predicting significant genes, we used the simulation method described in the TADA paper (He et al., 2013). We simulated one case-control (CC) variant class, two CC classes, or one CC and one de novo (DN) class. For CC data, the original case-control model in TADA (He et al., 2013) was used to simulate case-control data and then case-control parameters were estimated using the approximate model. The frequency of SCZ case-control LoF variants was used to calculate prior information of q ∼ Gamma(ρ, ν) as described in Table 1. For DN data, we used exactly the original model of TADA in both the simulation and estimation process.
Different sample sizes were used. For CC data, to see the performance of the approximate model, we used four sample sizes: 1092 cases plus 1193 controls, 3157 cases plus 4672 controls, 10000 cases plus 10000 controls, 20000 cases plus 20000 controls. The first two sample sizes were exactly the same as the two sample sizes from Sweden data in current study. The last two sample sizes were used to see whether the model would be better if sample sizes increased. For DN and CC data, we used exactly the sample sizes of the largest groups in our current data sets: family numbers = 1077, case numbers = 3157 and control numbers = 4672.
To see correlations between simulated and estimated parameters, the Spearman correlation method (Spearman, 1904) was used. To see the performance of the estimation process of parameters inside the model, we compared between expected FDRs and observed FDRs (oFDRs).
We defined oFDR for a FDR threshold as follows. Let G be the set of significant genes under the FDR threshold, and n1 be the length of G. Let n2 be the number of true risk genes (information from simulated data) inside G. oFDR for the FDR threshold was the ratio of n2 and n1 (oFDR = n2/n1). Estimated paramters from extTADA were used in this calculation.
For each combination of simulated parameters, we re-ran 100 times and obtained the medians of estimated values to use for inferences.
We also used different priors of hyper parameters (e.g., in Table 1) in the simulation process and chose the most reliable priors corresponding with ranges of . Because mainly controled the dispersion of hyper parameters, was set equal to 1, and only was tested.
4.2.2.1 Test NULL model (π = 0,
We also tested the situation in which no signal of both de novo mutations and rare case-control variants was present. We simulated one DN category and one CC category with π = 0, . To see the influence of prior information of on these results, we used different values of .
4.2.3 Calculate mutation rates
We used the methodology which was based on trinucleotide context, depth of coverage as described in Fromer et al. (2014) to obtain mutation rates (MRs) for different classes. There were genes whole mutation rates were equal to 0 (0-MR genes). To adjust for this situation for each mutation class, we calculated the minimum MR of genes having this value > 0, then this minimum value divided by 10 was used as MRs of 0-MR genes.
4.2.4 Analyze SCZ data
4.2.4.1 Obtain non-heterogeneous populations for case-control data of SCZ
The case-control data sets were divided into three big populations: Finland, United Kingdom and Sweden. For the Sweden population, this was a large data set and was also sequenced at different centers (Genovese et al., 2016), therefore we divided this population as follows.
A simple combination between a clustering process using a multivariate normal mixture model and a data analyzing strategy using linear and generalized linear models was used to divide the Sweden data into non-heterogeneous populations. Genovese et al. (2016) recently analyzed all case-control data sets by adjusting for multiple covariates: genotype gender of individuals (SEX), 20 principal components (PCs), year of birth of individuals (BIRTH), Aligent kit used in wet-labs (KIT) by using linear regression and generalized linear regression models as in Equation 7. They reported significant results for NonExAC LoF and MiD variants; therefore, this information was used in this step. We defined homogeneous populations as populations which were not much affected by the covariates. Thus, for the populations, analyzing results using Equation 7 (adjusting covariates) would not be much different from those results using Equation 8 (not adjusting covariates). The mclust package Version 5.2 (Fraley and Raftery, 1999) which uses a multivariate normal mixture model was used to divide 11,161 samples (4,929 cases and 6,232 controls) into different groups. To see all situations of the grouping process, we used mclust with three strategies on 11,161 samples: grouping all 20 PCs, grouping all 20 PCs and total counts, and grouping only the first three PCs. The number of groups were set between 2 and 6. For each clustering time, Equation 7 and 8 were used to calculate p values for each variant category of each group from the clustering results (p1 and p2 respectively); then, Spearman correlation (Spearman, 1904) between pvalue results from the two Equations (cPvalue) was calculated. Next, to filter reliable results from the clustering process, we set criteria:
cPvalue ≥ 0.85 and p-values for NonExAC ≤ 0.005.
Ratio p1/p2 from Equation 7 and 8 had to between 0.1 and 1.
From results satisfied the above criteria, we manually chose groups which had similar results between Equation 8 and 7.
For the data from the UK10K project (Singh et al., 2016), we divided the data into two separate populations England and Finland, and tested NoExAC variants in these populations by calculating sample-size-adjusted ratios between cases and controls. The ratios were 0.91 and 0.95 for the UK data. Regarding the Finland data, the ratio for MiD variants was only 0.41 which were extremely low. This could be a special case for the population or might be because of other technical reasons. We did not use this population in the next stage because it showed a different trend with other populations.
4.2.4.2 Estimate genetic parameters for SCZ
De novo mutations and case-control variants from the non-heterogeneous populations were integratively analyzed. Three de novo classes (MiD, LoF and silentFCPk mutations) and two case-control classes (MiD and LoF variants) were used in Equation 5 to obtain genetic parameters for SCZ. Case-control MiD and LoF variants were pooled into one class in the estimation process.
4.2.4.3 Estimate number of risk genes for SCZ
Based on estimated genetic parameters from the data sets available, the number of risk genes were predicted as described in the extTADA pipeline above. Different thresholds of FDRs were used to report their corresponding risk-gene numbers.
4.2.4.4 Test enrichment in known gene sets
Based on the extTADA results, we tested the enrichment of gene sets by using gene FDRs as follows. At each gene, we obtained FDR from extTADA. For each tested gene set, we calculated the mean of FDRs (m0). After that, we randomly choose gene sets n times (n = 10 millions in this study) from the whole genes and recalculated the means of FDRs of the chosen gene sets (vector m). The p value for the gene set was calculated as: . To correct for multiple tests, the p values were adjusted using the method of Benjamini and Hochberg (1995) for all the number of tests.
4.2.4.5 Predict number of risk genes for different sample sizes
Based on the genetic architecture of SCZ, we predicted the number of risk genes for the disease. To simplify the calculation, we assumed that sample sizes of cases and controls were the same and only one de novo and case-control population. In addition, a threshold FDR = 0.05 was used in this process to predict a number of individually significant genes. Therefore, a grid of different simulated counts of family numbers between 500 and 20000 and case/control numbers between 1000 and 50000 were generated. From these simulated counts, we inferred how many risk genes with FDR ≤ 0.05.
4.2.4.6 Test for single classes
To have a general picture of all classes, extTADA was used to test for single classes (LoF/MiD/silentFCPk de novo mutations, LoF/MiD case-control variants only). All parameters were set as the integration analysis.
4.2.4.7 Test genetic architecture of SCZ using both InExAC and NoExAC variants
To test whether InExAC variants could increase (or decrease) the strength of identifying significant genes, we pooled all InExAC and NoExAC case-control variants and then used extTADA to analyze this pooled data set.
4.2.4.8 Test the influence of mutation rates to the analyzing results of SCZ
The de novo data in current study were from different sources; therefore, de novo counts could be affected by differences in coverage, technologies. We therefore tested the analyzing results by adjusting for mutation rates by using synonymous mutations. We divided the observed counts by expected counts (= 2 x family numbers * total mutation rates), and then used this ratio to adjust for all mutation rates. The new mutation rates and the original data (NoExAC) were re-analyzed using extTADA.
4.2.4.9 Test extTADA with the same mean relative risks for casecontrol data
To test the performance of the model when mean were equal, we re-ran the analysis for SCZ data with an adjustment inside the model: ( was the relative risk at the ith gene in the jthpopulation).
4.2.5 Use extTADA to predict genetic parameters of other neurodevelopmental diseases
Use exTADA, we analyzed the integration architecture of genetics for four other neurodevelopmental diseases: EPI, ID, DD and ASD. For ASD, genetic parameters were estimated simultaneously for both de novo and case-control data. For the three other diseases, the estimation process was only carried out for de novo data because there were not rare case-control data publicly available.
4.2.6 Infer parameters using MCMC results
The rstan package (Carpenter et al., 2015) was used to run MCMC processes. For simulation data, 5,000 times and a single chain were used. For real data, 20,000 times and three independent chains were used. In addition, for SCZ data we used two steps to obtain final results. Firstly, 10,000 times were run to obtain parameters. After that, we calculated β values from estimated mean RRs as the Equation described in Table 1. Finally, extTADA was re-run 20,000 times on the SCZ data with calculated β values set as constants to re-estimate mean RRs and the proportions of risk genes. For each MCMC process, a burning period = a half of total running times was used to assure that chains did not rely on their initial values. For example, we ran and removed 2,500 burning times before the 5,000 running times for simulation data.
We just chose 1,000 samples of each chain from MCMC results to do further analyses. For example, with a chain with 20,000 run times, the step to obtain a sample was 20 run times. For all estimated parameters from MCMC chains, the convergence of each parameter was diagnosed using the estimated potential scale reduction statistic introduced in Stan (Carpenter et al., 2015). To produce heatmap plots, modes as well as the credible intervals (CIs) of estimated parameters, the Locfit (Loader, 2007) was used. The mode values were used as our estimated values for other calculations.
6 Supplementary information
6.1 Supplementary Tables
Table S8: extTADA results of SCZ risk gene identification (See Long-SupTables.xlsx Download).
Table S9: extTADA results of SCZ risk gene identification after adjusting mutation rates (See Long-SupTables.xlsx Download).
Table S14: extTADA risk gene identification results of ID data (See LongSupTables.xlsx Download).
Table S15: extTADA risk gene identification results of DD data (See LongSupTables.xlsx Download).
Table S16: extTADA risk gene identification results of ASD data (See LongSupTables.xlsx Download).
Table S17: extTADA risk gene identification results of EPI data (See LongSupTables.xlsx Download).
Table S18: The p values of enrichment tests for 161 known gene sets in SCZ, DD, ID, ASD and EPI (See LongSupTables.xlsx Download).
Table S19: The p values of enrichment tests for whole gene sets in SCZ, DD, ID, ASD and EPI (See LongSupTables.xlsx Download).
6.2 Sup Figure
6.3 Sup Information
6.3.1 Sup Results
6.3.1.1 Simulation case-control data only
To evaluate the performance of the approximate CC model for different parameter values, we simulated a single CC sample with either one or two variant/annotation classes. We tested sample sizes ranging from that of the available data, 1,092 each cases and controls (ASD), and 3,157 cases and controls (SCZ), to larger sample sizes of 10,000 cases and controls, and 20,000 cases and controls. Overall, high correlations (∼1) between estimated and simulated parameter values indicate little bias in inference based on CC data (Figure S3 and S5). Slight over estimation was observed for the sample size of 1092, especially for risk-gene proportions.
An additional analysis was carried out to assess the performance of specific simulated values. Correlations were calculated for each mean RR and π value. For one CC class, mean RRs were estimated well by the model with correlations ∼1 (Figure S4). However, the proportion of risk genes was affected by mean RRs. They were estimated well when mean RRs were between 1.5 and 3.5, but underestimated with smaller mean RRs and slightly overestimated with larger mean RRs (Figure S4). For two CC classes, high correlations (≥ 0.97) between simulated and estimated values were seen for all parameters. In addition, small mean RRs of a given class did not directly affect the estimated values of proportions of risk genes (Figure S6).
The issue of poor estimation for one class, but good estimation for > one class was expected. This was an advantage of using multiple classes compared to using only one class in the estimation process when the clustering signal was not very strong. Small mean RRs could result in difficulties in the calculation process to differentiate between a risk gene (mean RR > 1) and a non-risk gene (mean RR ∼ 1). If one class was used then many risk genes would be considered to be non-risk genes. If more than one class was used, such risk genes would be assigned as genuine risk genes due to the information available from other classes.
6.3.2 Sup methods
6.3.2.1 Calculate Bayes Factor for case/control data
At a given gene, Bayes Factor for each class was calculated as . The probability for each model (Hj, j = 0, 1) was calculated in order to rely only γ parameters as follows.
The first part P (xcn|Hj) was the same as De Rubeis et al. (2014):
The second part:
To identify the lower and upper limits of γCC for the integral, we randomly sampled 10,000 times values from the and used the minimum and maximum values for the lower and upper limits respectively.
5 Acknowledgements
This work was supported in part through the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai, and by NIH grant R01MH105554 to E.A.S. The Sweden exome sequencing data generation and analysis are supported by the Stanley Center for Psychiatric Research and NIH grant R01 MH077139 to C.H., P.S. and P.F.S. We are deeply grateful for the participation of all subjects contributing to this research.
Footnotes
↵* eli.stahl{at}mssm.edu