ABSTRACT
Most genetic studies consider autism spectrum disorder (ASD) and developmental disorder (DD) separately despite overwhelming comorbidity and shared genetic etiology. Here we analyzed de novo mutations (DNMs) from 15,560 ASD (6,557 are new) and 31,052 DD trios independently and combined as broader neurodevelopmental disorders (NDD) using three models. We identify 615 candidate genes (FDR 5%, 189 potentially novel) by one or more models, including 138 reaching exome-wide significance (p < 3.64e-07) in all models. We find no evidence for ASD-specific genes in contrast to 18 genes significantly enriched for DD. There are 53 genes show particular mutational-bias including enrichments for missense (n=41) or truncating DNM (n=12). We find 22 genes with evidence of sex-bias including five X chromosome genes also with significant female burden (DDX3X, MECP2, SMC1A, WDR45, and HDAC8). NDD risk genes group into five functional networks associating with different brain developmental lineages based on single-cell nuclei transcriptomic data, which provides important insights into disease subtypes and future functional studies.
INTRODUCTION
Neurodevelopmental disorders (NDDs) are a group of heritable disorders that are twice as likely to affect males when compared to females and whose prevalence broadly defined continues to increase, in part, due to improved pediatric ascertainment1. Among NDDs, autism spectrum disorder (ASD), developmental disorder (DD), intellectual disability (ID), and attention deficit hyperactivity disorder (ADHD) show considerable heterogeneity both genetically and clinically, and often present some of the greatest sex differences in prevalence2-6. Individuals with different primary NDD diagnoses frequently present overlapping phenotypes. For example, ADHD has been observed as the most common co-occurring disorder among individuals with an ID diagnosis, followed by 15-40% of individuals presenting with ASD7-10. Similarly, ∼30% of individuals with an ASD diagnosis also show some level of cognitive impairment, and ∼30-40% of cases co-occur with ADHD5,11. This significant diagnostic overlap among NDD groups has long suggested a shared genetic etiology, at least among a subset of cases12.
De novo mutations (DNMs) that disrupt or alter protein-coding gene function significantly contribute to NDDs, this genetic signal has facilitated the discovery of hundreds of risk genes over the last decade13,14. Most studies to date have focused on a single phenotype group, e.g., either on ASD15 or DD16, in an effort to identify genes that specifically contribute to that phenotype. The locus heterogeneity of these disorders, however, has limited power to identify genes with rare DNMs given that more than half of the genes still await a sufficient number of cases to reach statistical significance17. Moreover, different studies have tended to apply their preferred statistical or evolutionary models to identify genes with an excess of DNM. This has led to the emergence of slightly different gene sets from the same parent–child trio data15,17.
In this study, we integrate DNMs from 11 cohorts with a primary diagnosis of ASD or DD from 46,612 parent–child trios, of which 6,557 ASD trios are new to this study, and apply three different statistical models (chimpanzee–human divergence [CH] model17,18, denovolyzeR19, and DeNovoWEST16) (Figure 1). Briefly, the CH model estimates the number of expected DNMs by incorporating locus-specific transition, transversion, and indel rates, the gene length, and null expectation based on chimpanzee–human coding sequence divergence; denovolyzeR estimates mutation rates, considers the triplet context, and adjusts divergence based on macaque–human gene comparisons; DeNovoWEST applies the same underlying mutation rate as denovolyzeR but also incorporates gene-based weighting with missense clustering to provide the most sensitive set of candidate genes. In addition to applying standard statistical thresholds of significance, we prioritized genes as higher confidence if they were predicted by more than one or all of these three models and flag potential artefacts in previous call sets based on exclusivity of variant calls to a particular study.
The goal of this analysis is fourfold: 1) discover new candidate genes by combining ASD and DD patients as a broader NDD cohort; 2) identify ASD- or DD-enriched genes and test if any genes show an enrichment in one diagnosis over the other20; 3) identify genes that favor a particular class of mutation (i.e., missense over likely gene-disruptive [LGD]); and 4) identify genes that are biased for DNMs either among females or males. We further explore the protein-protein interaction (PPI) networks and expression properties of these genes both in bulk tissue as well as single-cell RNA-seq data to reveal novel properties regarding the tissues and functional neural networks affected by disruption of these genes.
RESULTS
DNM integration across studies
We integrated whole-exome sequencing (WES) and whole-genome sequencing (WGS) data from over 44,800 parent–child families from 11 cohorts15,16,21-26, where children received a primary diagnosis of ASD or DD (Table S1). We first split the cohorts into two groups based on whether the underlying Illumina sequencing data were available for reanalysis and variant recalling. We identified DNMs by reanalyzing underlying CRAMs or BAMs using a previously published pipeline27 using FreeBayes and GATK for five of the cohorts (recalled subset: 46.4% of families or 60,868 exomes and 9,304 genomes). For the remaining six cohorts (no-recall subset: 53.6% of families with 81,052 exomes), we used the published DNMs with downstream filtering applied (Table S2). We performed stringent quality control and filtering on both recalled and no-recall subsets in an effort to better harmonize DNMs (Methods). This included removing sample duplicates and outliers with a significant excess of DNM; restricting calls to regions with sufficient coverage (∼20X) across the different platforms, including WES and WGS data; excluding DNMs within segmental duplications, low complexity regions, and centromeric and telomeric regions to reduce false positives; and flagging “jackpot” genes where multiple independent events were identified in a single cohort but never again observed in another study (Methods).
After quality control on both samples and variants, the harmonized DNM set was generated from 15,560 ASD (6,557 are new) and 31,052 DD patients as well as 5,241 unaffected siblings (3,034 are new). This set includes 6,921 de novo LGD variants (dnLGD, including frameshift, stop-gain, splice-donor, or splice-acceptor variant), 32,774 de novo missense variants (dnMIS), as well as 11,706 de novo synonymous variants (dnSYN). Among the dnMIS variants, we classify 5,946 as severe based on a combined annotation dependent depletion (CADD) score28,29 greater than 30 (v1.3), ranking these mutations among the top 0.1% for the most severe predicted effect (dnMIS30) (Table S3). There are 11,409 genes with non-synonymous DNMs in probands versus 2,742 genes in siblings, with 37.7% (4,304/11,409) of the genes in probands and 83.7% (2,296/2,742) of genes in siblings with only one DNM (Figure S1). Comparing the overall mutation rate for the recalled and no-recall sample sets gave similar DNM frequency (∼0.78 non-synonymous DNMs per proband). Considering cohorts by phenotype, we find that DD shows the highest DNM rate followed by ASD and then siblings (Figure S2, Table S2).
DNM-enriched NDD candidate genes and diagnostic specificity
To identify genes with a significant excess of DNMs, we applied three statistical models (CH model17,18, denovolyzeR19, and DeNovoWEST16) independently to the ASD and DD cohorts based on their primary diagnoses and then combined as one broader NDD group (Figure 1). After excluding genes that showed evidence of DNM significance among unaffected siblings (n = 7, Table S4), we identify 615 candidate genes in the combined NDD group with an excess of DNMs by one or more of the three models (union FDR 5%, q-value < 0.05, DNM count > 2). We consider these candidates as a lower confidence set and hereafter refer as the LC615 gene set (Table 1, Figure 2). Although the majority of the LC615 genes are shared across the models, 43.3% (266/615) are only identified by a single model (Figure S3). Among the LC615 genes, we also defined the highest confidence set of 138 genes (HC138 genes) where the excess of DNMs meets exome-wide significant supported by all three models (intersection family-wise error rate [FWER] 5%, p-value < 3.64e-07, DNM count > 2) (Table 1, Table S5). Among the three models, DeNovoWEST shows the greatest sensitivity because it incorporates missense clustering in addition to strict counts of DNMs16. We observe similar trends in our independent analyses of the ASD and DD cohorts (Table 1, Tables S6-S7). Candidate genes, as expected, are significantly more likely to be intolerant to mutation (Figure S4). Among the LC615 genes, 324 genes are considered novel when compared to three previous reports (ASC10215, Coe25317, DDD28516) (Figure S5, Table S8). Among the 324 candidates with novel statistical significance, we searched for additional evidence of pathogenicity (including case reports) within the Development Disorder Genotype - Phenotype Database (DDG2P), SFARI gene, Online Mendelian Inheritance in Man (OMIM) database, and PubMed, identifying 189 genes (30 FWER genes) with novel association with NDD (Table S8). Among the most stringent set of HC138 genes, we identify or confirm three genes (MED13, NALCN, and PABPC1) newly reach exome-wide significance in this large data, further confirmed their association with NDD based on recent reports30,31 (Figure S5). We should also note that there are 52 potentially novel genes if we consider exome-wide significance more broadly by any one or more of the models instead of requiring all three to intersect (Table S8). These genes, thus, represent a powerful resource for future clinical and basic research investigations related to NDDs.
Next, we considered cohorts with the primary diagnosis of ASD and DD separately using the same criteria and then compared the groups to identify genes potentially unique to a specific diagnosis. This analysis identifies several genes enriched for DNMs that are unique to each primary diagnosis or combined (union FDR 5%): ASD (n = 17), DD (n = 66), and NDD (n = 119) (Figure 2c). However, most of those genes are only identified by a single model and often with borderline statistical support (88.2% [15/17] of genes in ASD, 93.9% [62/66] of genes in DD, and 85.7% [102/119] of genes in the NDD group). Among the genes with unique significance in ASD, we note that 58.8% (10/17) of genes show evidence of DNMs in both ASD and DD patients, which suggesting that screening of further DD samples will likely yield additional cases and eliminate the ASD specificity. Although there are seven genes (INTS2, ASB17, KRT34, C1orf105, C6orf52, NUCB2, and JKAMP) that currently show no DNMs among the DD patients, but none are supported by all three models (five of which only predicted by DeNovoWEST) and none reach exome-wide significance in ASD, and most are not particularly compelling candidates based on available functional data. We suggest that all of these should be regarded as either unlikely or low-confidence “autism-specific” genes. While we recognize that there is considerable overlap among DD and autism, autism criteria are generally more specific including a large number of patients without intellectual disability or generalized developmental delay.
In contrast to ASD, there is compelling evidence for DD-specific genes that are not associated or rarely associated with ASD. Among the 66 genes that reach FDR significance only in DD but neither ASD nor the combined NDD set, 10 genes meet the exome-wide significance and supported by all three models (ARID1A, BCOR, PTCH1, KIF11, DPF2, DNM1L, ANKRD11, H4C5, HUWE1, and TRAF7) (Figure 2c). There are three genes (KIF11, H4C5, and TRAF7) that show DD-specific significance where no DNM has yet been observed among autism patients in this study, however, there are also few autism cases have been reported previously with DNM in KIF1132, and TRAF7 33. As expected, genes with NDD-specific significance show DNMs in both ASD and DD patients, but 10 genes reach the intersection exome-wide significance when cohorts are combined together as broader NDD group (MYT1L, PHF21A, FBN1, PBX1, GNB2, PABPC1, CLCN4, NALCN, PSMC5, and CERT1).
Comparing the relative frequency of DNMs among DD and ASD cohorts identifies genes with a potential bias toward DD or ASD diagnoses, especially when DNM is further categorized by dnMIS or dnLGD mutation class (Figure 3, Tables S9-S10). There are chromatin modifying genes among the genes with a trending of potential ASD bias: CHD8, KDM5B, ASH1L, and KMT5B—although none of these genes reach significance for ASD enrichment. In contrast to DD patients, a comparison of DNM counts identifies 18 genes with higher DNM burden in DD over ASD patients (two-sided Fisher’s exact test, FDR 5%) (Table 2). Of which, GATAD2B and KIF1A are exclusive to DD without DNM in ASD patients in this study (Figure 4). Interestingly, GATAD2B shows an enrichment only for dnLGD variants, while in contrast KIF1A shows a specific enrichment for dnMIS variants (Figure 4). Patients with mutations in these genes have been described as exhibiting DD and moderate to severe ID34,35. Although no gene shows significant burden specifically for ASD compared to DD patients, however, there are 149 of the LC615 genes show a trend toward ASD diagnosis over DD with more DNMs in ASD patients when comparing DNM frequency with sample size adjusted. For example, CHD8 is highly intolerant to LGD mutation (pLI score = 1, LOEUF score = 0.082), shows a 2.22-fold enrichment of DNMs in ASD patients (n = 15,560, 18 dnLGD and 12 dnMIS variants) when compared to DD patients (n = 31,052, 19 dnLGD and 8 dnMIS variants). Other genes, like KDM5B (1.48-fold) and WDFY3 (2.90-fold), also have more DNMs in ASD than in DD patients relative to the corresponding sample size (Figure 4, Table S10).
Genes with biased burden by mutation class
In the combined NDD group (n = 46,612), the majority of genes (92.7%, 10,578/11,409) show a greater number of dnMIS when compared to dnLGD variants, which is consistent with expectations. However, a subset of 7.3% (831/11,409) show the reverse pattern with more dnLGD than dnMIS variants (e.g., ARID1B, ADNP, KMT2A, DYRK1A, CTNNB1, MED13L, POGZ, SETD5, GATAD2B, etc.), suggesting different mechanisms of pathogenicity or phenotypic manifestation by mutational class. A directly comparison of dnLGD and dnMIS variant count of each gene in the combined NDD cohort (n = 11,409 genes), we identify 12 genes (ARID1B, ADNP, KMT2A, DYRK1A, CTNNB1, SETD5, GATAD2B, WAC, ASXL1, ASXL3, ARID2, and PPM1D) significantly enriched for dnLGD when compared to dnMIS variants (two-sided Fisher’s exact test, FDR 5%). By contrast, there are 41 genes with significantly greater burden of dnMIS over dnLGD variants (e.g., DYNC1H1, GRIN2B, CHD3, CACNA1A, SCN8A, etc.) (Figure 3, Table S11). Among the above genes with a bias by mutation class, PPM1D shows exclusively dnLGD variants (n = 22), while 23 genes show only dnMIS variants (e.g., KIF1A, KCNQ2, PTPN11, SMARCA2, etc.) in this dataset (Figure 4).
Sex-based DNM enrichment and burden analysis
The male-biased predominance is well-established for NDDs especially among ASD, but the underlying mechanism remains largely unknown. We further investigated this sex bias in a subset of probands (19,319 males and 8,132 females) where the sex information of the proband is available (Table 1). Using the three models and treating males and females as two distinct groups, we identified 119 and 90 candidate genes with potential male or female bias for DNM, respectively (union FDR 5%) (Tables S12-S13). After applying the most stringent exome-wide significance criteria (intersection FWER 5%), five male-enriched (ASH1L, ARID2, MBD5, NSD1, and SIN3A) and 17 female-enriched genes (DDX3X, MECP2, SMC1A, WDR45, HDAC8, NEXMIF, USP9X, CHD7, ZC4H2, CASK, KAT6B, KIF1A, MEF2C, NAA15, PABPC1, PPP2R1A, and HNRNPU) were identified (Figure 3d). Directly comparing of the DNM counts in males versus females, we also identify five genes (DDX3X, MECP2, SMC1A, WDR45, and HDAC8) with a significant excess of DNM in females (two-sided Fisher’s exact test, FDR 5%) (Table 3, Figure 5). Notably, those five genes all map to the X chromosome with either no DNMs or a single DNM observed in males suggesting that such mutations may be lethal during male embryonic development. Although no individual gene shows a significant male excess when comparing DNM counts between the sexes, several genes show interesting trends. For example, ASH1L (aka KMT2H), for example, is a histone-lysine N-methyltransferase enzyme with 22 DNMs in males versus only two DNMs observed in females; KMT2E encodes a member of the lysine N-methyltransferase-2 family of chromatin modifiers, is associated with O’Donnell-Luria-Rodan syndrome, and shows a male bias in our data (eight DNMs in males verse none in females); finally, KDM6B shows eight DNMs in males compared to a single event in females and encodes a histone H3K27 demethylase that specifically demethylates di- or trimethylated lysine-27 (K27) of histone H3 (Table 3, Figure 5).
Functional and neuronal properties of NDD genes
We focus on investigating the functional properties of the highest confidence gene set (HC138) by considering both PPI as well as gene expression trends and differences in the adult and developing human brain. A PPI network analysis using STRING highlights five distinct network clusters (p < 1e-16) (Figure 6a). These include previously described networks such as chromatin binding (GO: 0003682, p = 2.25e-22), ion channel (GO: 0022839, p = 8.43e-13), lysine degradation (hsa00310, KEGG, p = 6.86e-09), and the GABAergic synapse (hsa04727, KEGG, p = 8.78e-08). We identify 20 top “hub” genes, where we observe an excess of PPI interactions supported by half or more of the 12 models in CytoHubba36 (Table S14). Tissue-specific expression analysis (TSEA), as expected, shows enrichment in the brain (p = 0.003), with cell-type-specific expression analysis (CSEA) further highlighting expression in specific brain regions from early to late-mid fetal developmental stages; namely: early-mid (p = 1.11e-04) and late-mid fetal (p = 0.005) amygdala, early fetal cerebellum (p = 5.4e-04), early to late fetal cortex (p = 4.87e-06), early to early-mid fetal striatum (p = 1.97e-04), and early fetal thalamus (p = 0.002). Overall, the HC138 genes are more highly expressed prenatally, consistent with the early developmental impact of variation in these genes and NDD (Figure S6).
We also explored the enrichment of NDD genes (ASC10215, Coe25317, DDD28516, and LC615 and HC138 in this study) at cellular resolution from expression datasets obtained from the developing and adult human cerebral cortex37. Overall, the NDD genes show broad expression across adult human cortical cell types, including inhibitory and excitatory neurons and, to a lesser extent, non-neuronal cells (Figure 6b). All PPI-defined clusters show a similar pattern of pan-neuronal enrichment. While the signal is less pronounced for the LC615 genes, it is still qualitatively more than the 447 genes with non-synonymous DNMs (n > 1) in siblings (Figures S7a,b). We tested for expression enrichment in neuronal and non-neuronal cells for NDD versus control gene sets by comparing expression levels for cell subtypes (Figure 6c, Figure S7c) and the number of clusters that express genes for broad cell classes (Figure S8). All NDD gene sets are significantly enriched in neuronal subtypes, and all gene sets (except the LC615 genes) are enriched in non-neuronal types (Bonferroni-corrected p < 0.05, Kolmogorov–Smirnov test). While different neuronal subtypes show similar enrichments, we note that oligodendrocyte progenitor cells show the greatest enrichment when compared to other non-neuronal cells (Figure 6c). If we use the pan-neuronal expression signature as a further classifier of functional enrichment, we identify 194 of the LC615 gene set meet these criteria as are pan-neuronal genes (Table S8).
To further dissect the neuronal signature, we investigated single-nucleus transcriptomic datasets from the developing human cerebral cortex. At the single-cell level, we find, once again, that HC138 genes are also broadly enriched across inhibitory and excitatory neurons (Figures 6d, e). To test which neuronal subtype shows greater specificity, we performed conditional enrichment analysis between the neuronal lineage populations, and found excitatory neurons to have an independent signal with respect to controlling interneuron signal, while neither enrichment nor significance (p = 1 for all) is reached in any neuronal lineage population when controlling for excitatory neurons (Figure S9). Among excitatory neuron clusters, we find the HC138 genes are most strongly enriched in visual cortex neurons, which is consistent with recently reported changes in gene expression in the postmortem ASD brain tissue samples38. When stratified by PPI-defined cluster membership, we find that cluster 1, which harbors many proteins involved in chromatin modification and gene expression regulation, are enriched in progenitor cells and particularly ventral telencephalic neural progenitor cells (‘MGE-IPC’, ‘MGE-div’) and newborn GABAergic neurons (‘nIN’). Cluster 2, which includes proteins involved in neuronal communication and ion channel function, are enriched in cortical excitatory and inhibitory neurons, including excitatory neurons of the prefrontal cortex (‘EN-PFC’), as previously reported39, and caudal ganglionic eminence derived GABAergic neurons (‘IN-CTX-CGE’). Additional analysis of cluster 3 and cluster 4 revealed potential trends but did not reach statistical significance after multiple test correction. Genes in cluster 4 (serine/protease phosphorylation), for example, are biased in their expression towards outer radial glia, which are primate-enriched neural stem cells believed to be responsible for the majority of upper cortical layer neurogenesis in humans40-42, while genes in cluster 3 (insulin receptor function) are enriched in vascular endothelial and mural cells. We performed a similar single-nucleus transcriptomic analysis for the top 10 genes showing an enrichment trend for DNM in either males or females. Using the HC138 genes as the background gene set, we observe an enrichment signal for intermediate progenitor cells for the female-enriched genes and no signal for male-enriched genes (Figure S10).
DISCUSSION
Here we present a comprehensive DNM analysis from 15,560 ASD (6,557 are new) and 31,052 DD patients as well as 5,241 unaffected siblings (3,034 are new) using three different statistical models to increase power to identify risk genes and highlight primary diagnosis, sex, mutation class, and model-specific differences. In the combined NDD group, we identify 615 NDD candidate genes with nominal significance supported by any one or more of the three models (LC615, union FDR 5%), of which, 138 genes reaching exome-wide significance in all three models are defined as the most stringent set (HC138, intersection FWER 5%). Among the LC615 genes, 119 were additionally identified only in the combined NDD, which are not observed when ASD and DD are considered independently (Figure 2c). For example, SYNCRIP (OMIM: 616686), a gene already strongly implicated in NDD43,44, shows two dnLGD and one dnMIS variants in DD patients as well as two dnLGD variants in ASD patients in this dataset, it reaches FDR significance supported by all three models and FWER significance by CH model when combined but no significance when ASD and DD are analyzed separately. Among the high-confidence genes (HC138), there are 10 genes (MYT1L, PHF21A, FBN1, PBX1, GNB2, PABPC1, CLCN4, NALCN, PSMC5, and CERT1) reaching genome-wide significance (intersection FWER 5%) only in the combined NDD group, which representing good candidates for further investigation (Figure 2c).
Among the LC615 genes, there are 189 genes with potentially novel associations. We note, however, that 84.1% (159/189) of those novel associations only reach nominal significance (FDR 5%) and thus only represent as potential candidates for future investigation as opposed to genes where DNM pathogenicity has been established. Among the HC138 genes supported by all three models (intersection FWER 5%), we highlight three genes (MED13, NALCN, and PABPC1) reaching exome-wide significance first in this study. MED13, a gene that encodes a component of the CDK8-kinase module, has previously been identified through case reports and its phenotype investigated as part of GeneMatcher collaboration30. Although NALCN was not previously reported as significant in large WES projects of DNM in ASD or DD separately, the gene now reaches the most stringent significance in this study by combining ASD and DD patients. In addition, Chong and colleagues31 recently described an autosomal dominant disorder associated with DD and multiple congenital contractures of the face and limbs, extending the phenotype beyond Freeman-Sheldon syndrome to include the impaired cognitive phenotype. Another gene, PABPC1 encodes a poly-A RNA binding protein and is thought to play a critical role in the structure of the translation initiation complex45. We identify 15 dnMIS variants in ASD and DD patients and the gene, to our knowledge, reaches genome-wide significance for an excess of DNM for the first time.
As a whole, the genes identified in this study show a consistent pattern of pan-neuronal expression, which appear to be driven in large part by excitatory neurons. A comparison of single-nucleus transcriptomic data with specific functional PPI networks reveals a more nuanced relationship where specific functionally related genes clearly associate with specific cellular lineages in the brain. Four of the five PPI networks in this study show differential enrichment in specific lineages ranging from rapidly dividing progenitor cells (cluster 1) and outer radial glial cells (cluster 4) to established excitatory and inhibitory neuronal lineages (cluster 2) to cell types associated with vascularization and maintenance of blood-brain barrier (cluster 3). Although not all of these findings are yet statistically significant and will become refined as new highly curated cellular-resolution datasets become available, they suggest possible contribution of these specific cell types to a subset of NDD disease phenotypes and are consistent with observations from studies of large effect size copy number variations on cortical development46,47. If functional networks and definition of cell types and circuits are important targets of future therapeutics, distinguishing patients with DNM in these specific genes and further exploration of these PPI networks will be important areas of future research.
Given the locus heterogeneity of ASD and DD, it will be important to continue to expand WES and WGS of parent–child trios in order to identify all the genes associated with excess DNM in NDD. The novel candidate genes identified here with different levels of significance provide an important resource for further genetic and functional characterization. Severe dnMIS30 or dnLGD mutations among the HC138 and LC615 genes account for only 4,667 (10.0%) and 2,872 (6.2%) of the NDD families in this cohort, respectively. Thus, a large number of risk genes are awaiting discovery. Our analyses highlight the value of combining ASD and DD data in increasing power to identify NDD risk genes. While there is ample evidence of DD-specific genes, we find little or no support for ASD-specific genes although do identify candidates with trends toward ASD diagnostic enrichment but none reach statistical significance yet (e.g., CHD8, KDM5B, WDFY3)20.
However, there are also several limitations and caveats with this study. The size of the DD cohort was double to that of the ASD patients and, as a result, favored the discovery of DD-specific genes where the yield of DNM is greater. The Simons Foundation Powering Autism Research for Knowledge (SPARK) project and other collaborative initiatives aim to generate WES data from over 50,000 ASD families over the next few years and will help rectify this imbalance48. Another limitation of this meta-analysis is that not all underlying sequence data or phenotypic information was available or could be accessed for all published cohorts. There are only ∼50% of the raw sequencing data could be reprocessed using the exact same parameters to create a harmonized dataset. Moreover, even basic information regarding patient sex or the ability to match DNM calls to specific samples was limited to some published cohorts. Such impediments need to be rectified for the benefit of the research community and the families who contributed their DNA for the purpose of research. Finally, most of the cohorts included in this study originate from WES whose platform, quality, and coverage varies more widely than WGS data. It has been estimated that 5-8% of genic regions are inaccessible by earlier WES platforms49. In the near future, repeating these analyses with WGS data or even ultimately generated by long-read sequencing technologies will help identify new risk genes or loci where may miss by exome sequencing.
AUTHOR CONTRIBUTIONS
T.W. and E.E.E. designed the study; T.W. analyzed, collected and harmonized the dataset, and performed the data analyses. C.K., T.E.B., and T.J.N. helped with the gene expression analyses. M.A.G. contributed in interpretation of the candidate genes. B.H. and Y.M. helped with figure preparations. C.G. contributed some of the underlying data of RUMC cohort. T.W. and E.E.E. wrote the manuscript with input from other authors. All authors reviewed and approved the manuscript.
DECLARATION OF INTERESTS
The authors declare no competing interests.
METHODS
Cohorts and samples
This study was approved by the University of Washington Institutional Review Board #STUDY00000383, Genetics Consortium Repository. Informed consent was obtained from all subjects by each of the corresponding study cohort. We collected eight WES and three WGS parent–child cohorts (>100 trios), including over 44,800 families from both published and publicly available data (Table S1). We preferentially retain WGS over WES data for cohorts that have both types of data available. For example, we only included the Centers for Common Disease Genomics (CCDG) genomes for Simons Simplex Collection (SSC) samples26. For cohorts with continuous publications, like for the Autism Sequencing Consortium (ASC), Deciphering Developmental Disorders (DDD), and Radboud University Medical Center (RUMC) samples, we only included the final samples from their latest study with potential sample duplicates removed. We also excluded any potential overlaps in the literature; for example, we excluded all SSC samples used in the ASC paper15 for potential redundancy with the CCDG SSC genomes. After all those control measures, we also ran KING50 (v1.4) for samples with the underlying sequencing data available for further detection of potential sample overlap. KING uses identical by state (IBS) to estimate pairwise relatedness between samples; any samples with a kinship value >0.35 were considered as potential sample duplicates. We first checked if the potential duplicates were known as monozygotic twin pairs or known duplicates within a cohort (some individuals had both blood and cell line DNA sequenced for quality control purposes). Only one sample was retained from each duplicate pair in downstream analyses—blood DNA was preferred if both blood and cell line sequencing data were available.
DNM discovery and integration
DNMs were identified by analyzing/reanalyzing the underlying sequencing data wherever available for the five cohorts using the same pipeline. Specifically, DNMs were harmonized by reanalyzing 70,172 samples (46.5%), including 24,520 families within two categories. First, for ASD cohorts with genome sequencing data from the CCDG study, including SSC and the Study of Autism Genetics Exploration (SAGE), raw single-nucleotide variant/insertion or deletion (SNV/indel) variants were called (on hg38) independently using four different callers: GATK51, FreeBayes52, Platypus53, and Strelka254. Downstream DNM discovery was based on genotype, which is required only if the offspring has the alternative allele (with genotype as 0/1 or 1/1) but is not observed in either of the parents (with genotype as 0/0). Candidate DNMs needed to have the support of at least two of the four callers; and then variants from Platypus with a filter of LowGQX or NoPassedVariantGTs were removed, and Strelka2 variants had to have the filter field equal to PASS. For variants on the X chromosome, we separately considered variants in the pseudoautosomal regions (chrX:10000-2781479, chrX:155701382-156030895, hg38) and the X/Y duplicative transposed region (chrX:89201803-93120510, hg38). Candidate DNMs were then converted to hg19 for downstream integration. Second, for part of other WES cohorts, including SPARK_pilot, SPARK_WES_1, and DDD, raw SNV/indel variants were called (on hg19) independently using GATK and FreeBayes. Downstream DNM discovery was based on genotype as applied in genome cohorts. Candidate DNMs needed to have the support from both callers, which is the intersection set by GATK and FreeBayes. Beyond the above measures, we also applied the following variant-level filters: allele balance (AB = 0 in both parents, and AB > 0.25 in the child), read depth (DP > 9 for all family members), child genotype quality (GQ > 20 by both GATK and FreeBayes). For the rest cohorts included in studies with no underlying sequencing data available, DNMs were collected from each corresponding publication. DNMs on hg38 were first converted to hg19; the final integrated DNMs were all on hg19. To combine DNMs between WES and WGS datasets, we restricted DNMs to a well-covered coding region55 (average DP > 20X) generated by accessing the WES data from the SSC and DDD study. We also removed all DNMs in the segmental duplication regions, recent repeat and low-complexity regions, or centromeric and telomeric regions. We excluded variants in a homopolymer A or T of length 10 or greater, and the variants with a reference or alternative allele with greater than 10 bp, to remove potential sequencing errors. Beyond all the above filtering and sample duplicate exclusions, we further excluded samples with more than 10 coding DNMs as outliers and removed specific DNMs that were observed in more than five different unrelated individuals for frequency control. All of the above strict measures yielded a total of 46,612 nonredundant NDD cases with a primary diagnosis of ASD (n = 15,560) or DD (n = 31,052), and also unaffected siblings (n = 5,241) in the integrated de novo enrichment analysis (Table S1). To ensure uniformity, the same version of CADD score (v1.3) and VEP annotation (Ensembl GRCh37 release 94) were applied, and the analysis was restricted to the canonical transcript with the most deleterious annotation.
Statistical analyses
De novo enrichment analyses were performed independently for ASD, DD, and NDD samples by using three statistical models: the CH model, denovolyzeR, and DeNovoWEST. All three methods apply their own underlying variant rate estimates (denovolyzeR and DeNovoWEST use the same rate while CH model is different) to generate the prior probabilities for observing a specific number and class of mutations for a given gene. Briefly, the CH model estimates the number of expected DNMs by incorporating locus-specific transition, transversion, and indel rates and chimpanzee–human coding sequence divergence and the gene length; while denovolyzeR estimates mutation rates based on trinucleotide context and incorporates exome depth, mutational biases such as CpG hotspots, and divergence adjustments based on macaque–human comparisons. DeNovoWEST is a further development beyond denovolyzeR, which scores all classes of coding variants on a unified severity scale based on the empirically estimated positive predictive value of being pathogenic, and incorporates a gene-based weighting derived from the deficit of protein-truncating variants in the general population, further combining missense enrichment by a clustering test. Default parameters were used for all three methods with some minor adjustments, such as in the process of weight creation in DeNovoWEST, fewer numbers of CADD bins for missense and nonsense variants were used for ASD samples (three bins), versus in the DD and combined NDD group where seven bins were used for both, due to the sample size differences as suggested6. The expected mutation rate of 1.8 DNMs per exome was set to the CH model as an upper bound baseline. Siblings were also analyzed similarly using the CH model and denovolyzeR, but not run for DeNovoWEST due to the small sample size. We applied two metrics of significance with the union and intersection of three models: first is the FDR significance, the significance threshold (q < 0.05) was corrected exome-wide using Benjamini–Hochberg method by accounting for the total number of genes in each model (18,946 genes in CH model, 19,618 genes in denovolyzeR, and 18,762 genes in DeNovoWEST); the second is a more stringent FWER significance, for which we applied exome-wide Bonferroni multiple-testing correction considering both the largest number of genes among three models (n = 19,618) and the total of tests per gene across the three models. For probands, FWER 5% significance threshold (p < 3.64e-07) was corrected by the Bonferroni method for 19,618 genes and seven tests in the analysis (dnLGD, dnMIS, and dnMIS30 in CH model, dnLGD and dnMIS variants in denovolyzeR, and dnLGD and dnMIS variants in DeNovoWEST). For siblings, FWER 5% significance threshold (p < 5.09e-07) was corrected for 19,618 genes but only five tests in the CH model and denovolyzeR. We excluded genes that show any significance in the siblings from the counting of significant genes in probands (ASD, DD, and NDD). For each variant category, we required each gene to have more than two DNMs to be considered as significant. All statistics were calculated using R (versions 3.6.2).
Enrichment analyses in recalled and no-recall subsets
We also performed same enrichment analyses using three models in parallel for those two subsets. We identified 323 FDR (132 FWER) significant genes in the recalled subset and 389 FDR (174 FWER) significant genes in the no-recall subset (Tables S15-S16). For those FDR-significant genes, of which 87.3% (282/323) in the recalled subset and 90.0% (350/389) in the no-recall subset overlap with the LC615 genes in combined NDD group (Figure S11), suggesting consistent results in both subsets after data harmonization. However, there are 12 exclusively significant genes in the no-recall subset and one exclusively significant gene (PABPC1) in the recalled subset among the HC138 genes in combined NDD group. A closer look found those 12 genes were also reported as significant genes in a recent study16, and the significant signal was driven by DNMs almost exclusively from GeneDx, which the raw data is not available for recalling. For example, all DNMs in three genes (ZEB2, PDHA1, and SLC2A1) are from GeneDx probands (n = 18,783) and none among the other five cohorts (n = 8,454) in the no-recall subset (Figure S11). This is consistent with the original study where the majority of the DNMs are from GeneDx cohort, with very few from the DDD and RUMC samples. This draws attention to the significance of such genes, where the significance signal is mostly driven by DNMs from a single cohort and no underlying data is available for reanalyzing as more quality control.
PPI analyses and hub genes
PPI network was assessed by searching Multiple Proteins by Names using the online STRING database with default settings. The interaction result was exported as a TSV file and then imported into Cytoscape software for downstream analysis. We used CytoHubba to identify top hub genes (most interacted genes). CytoHubba provides the analyzed results computed by 12 methods, including Degree, clustering coefficient, Edge Percolated Component (EPC), Maximum Neighborhood Component (MNC), Density of Maximum Neighborhood Component (DMNC), Maximal Clique Centrality (MCC), and centralities based on shortest paths, such as Bottleneck (BN), EcCentricity, Closeness, Radiality, Betweenness, and Stress, as previously described36. The top 20 genes were supported by the most, and at least half, of the models as top hub genes. The PPI clusters were identified by the Markov Cluster Algorithm (MCL, https://micans.org/mcl/). The top three GO functions were selected from rank order of the functional enrichment from STRING database with default settings.
Single-nucleus RNA expression analysis
The dataset includes single-nucleus transcriptomes from 49,495 nuclei across multiple human cortical areas. Individual cortex layers were dissected from tissues covering the middle temporal gyrus (MTG), anterior cingulate cortex (ACC; also known as the ventral division of medial prefrontal cortex, A24), primary visual cortex (V1C), primary motor cortex (M1C), primary somatosensory cortex (S1C), and primary auditory cortex (A1C) derived from human brain. Nuclei were dissociated and sorted using the neuronal marker NeuN. Nuclei were sampled from postmortem and neurosurgical (MTG only) donor brains and expression was profiled with SMART-Seq v4 RNA-sequencing. The data are available from the Allen Institute for Brain Science website for analysis (https://celltypes.brain-map.org/rnaseq/human_ctx_smart-seq) and download (https://portal.brain-map.org/atlases-and-data/rnaseq/human-multiple-cortical-areas-smart-seq). Unsupervised clustering with Seurat identified 120 distinct transcriptomic clusters, including 54 GABAergic (inhibitory) neuronal, 56 glutamatergic (excitatory) neuronal, and 10 non-neuronal cell types. Heatmaps were constructed of log-normalized trimmed mean expression (excluding the 25% lowest and 25% highest expression values), log2(CPM + 1), of NDD and control gene sets across cell types. Genes were ordered by the number of cell types with trimmed mean expression > 1. For each cell class (GABAergic and glutamatergic neurons and non-neuronal cells), the number of cell types with trimmed mean expression > 1 for NDD risk genes and control genes were quantified and visualized as empirical cumulative distributions (Figure S8). A Kolmogorov–Smirnov test was used to reject the null hypothesis that the cell type count distributions were the same between each gene set and the control DNM gene set. P-values were Bonferroni-corrected for multiple testing. Similarly, for each cell subclass (e.g., SST interneurons or L6b excitatory neurons), the trimmed mean expression levels of NDD risk genes and control genes were quantified and visualized as empirical cumulative distributions (for example, Figure S7c). A Kolmogorov– Smirnov test was used to reject the null hypothesis that the expression distributions were the same between each gene set and the control DNM gene set. P-values were Bonferroni corrected for multiple testing and -log10-transformed and visualized as a heatmap with columns corresponding to cell subclasses ordered by Ward’s clustering and rows corresponding to gene sets. Full names and detailed descriptions of each cell subtype in the heatmap (Figure 6c) are: GABAergic interneuron (LAMP5: LAMP5 expressing GABAergic neuron, PAX6: PAX6 expressing GABAergic neuron, PVALB: PVALB expressing GABAergic neuron, SST: SST expressing GABAergic neuron; VIP: VIP expressing GABAergic neuron); Glutamatergic neuron (IT: Intratelencephalic neuron, L4 IT: Layer 4 Intratelencephalic neuron, L5 ET: Layer 5 Extratelencephalic neuron, L5/6 IT Car3: Layer 5/6 Intratelencephalic neuron that selectively expresses Car3, L5/6 NP: Layer 5-6 Near-projecting neuron, L6 CT: Layer 6 Corticothalamic neuron, L6b: Layer 6b neuron); and Non-neuronal cell (Astrocyte, Microglia, Oligodendrocyte, OPC: oligodendrocyte progenitor cell).
Tissue and cell-type-specific expression of significant genes
scRNA-seq data were pulled from http://cells.ucsc.edu/?ds=cortex-dev and CPM counts were quasi normalized unto unique molecular identifies (UMI) using quminorm (https://github.com/willtownes/quminorm). Cells were then regrouped by their broad parent cell types with unknown cell types filtered out. SCTransform (https://github.com/ChristophH/sctransform) was used to normalize the UMI counts from quminorm. The corrected counts from SCTransform were used as input into Expression Weighted Celltype Enrichment (EWCE) following the default parameters with two levels of annotations based on clusters and clusters split by sex. Bootstrapping parameters in EWCE: 10000 repetitions with the LC615 genes as background in the unconditional enrichment and HC138 for the controlled experiments. Cluster-specific analysis within the most stringent gene set and top sex specific genes (10 for female and 10 for male) used the HC138 genes as background. Online TSEA and CSEA tools were used to determine the enrichment of expression across brain regions and cell types56. The expression among these tissues was compared using Fisher’s exact tests and followed by Benjamini–Hochberg correction.
Assessment of gene intolerance scores
To assess a gene’s intolerance to variation, we applied the ExAC-based residual variance to intolerance score (RVIS) and missense constraint scores (mis_Z scores), as well as the gnomAD-based “loss-of-function observed/expected upper bound fraction” (LOEUF). LOEUF score is a conservative estimate of the observed/expected ratio, based on the upper bound of a Poisson-derived confidence interval around the ratio. It ranges from 0 to 2, with lower LOEUF scores indicating stronger selection against predicted loss-of-function variation in a given gene, and a cut-off value is suggested as 0.35. The mis_Z score indicates a gene’s intolerance to missense variants, positive scores indicate more constraint, and negative scores indicate less constraint. A greater Z-score indicates more intolerance to the class of variation. RVIS gene score was based on ExAC v2 release 2.0 (accessed: March 15th, 2017). As of this release we use CCDS release 20 and Ensembl release 87 annotations. The score was converted into percentile by ranking all genes from most intolerant to least. For example, percentile of 1% means the gene is amongst the top 1% of the most intolerant genes. Wilcoxon two-sample test was performed in R (versions 3.6.2) with the wilcox.test function.
SUPPLEMENTAL INFORMATION
Supplemental_Information.pdf
Figures S1-S11
Tables S1, S2, S9
Supplemental_Tables.xlsx
Tables S3-S8, S10-S16
DATA AND CODE AVAILABILITY
The underlying genomic and phenotypic data in the recalled subset are available from the following resources: the sequencing and phenotype data for the SSC cohort are available to approved researchers at SFARI Base (accession IDs: SFARI_SSC_WGS_p, SFARI_SSC_WGS_1, and SFARI_SSC_WGS_2). WGS data from the SAGE samples is available at dbGaP under accession phs001740.v1.p1. Family-level FreeBayes and GATK VCF files for SSC and SAGE samples are available at dbGaP under accession phs001874.v1.p1 and also at SFARI Base under accession: SFARI_SSC_WGS_2a. The genomic and phenotypic data for the SPARK study is available by request from SFARI Base (accession IDs: SFARI_SPARK_pilot, SFARI_SPARK_WES_1). The genomic and phenotypic data for the DDD study is available by request from the European Genome-phenome Archive (EGA) under study EGAS00001000775. The DNMs in the no-recall subset are retrieved from each of the corresponding publications (Table S2). The final harmonized DNMs, from both the recalled and no-recall subsets, used in the study are available in Table S3. This study did not generate new unique reagents. Code used for analyses and figures are available upon request. Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Evan E. Eichler (eee{at}gs.washington.edu). Software and databases used in this study are publicly available as follows:
denovo-db: http://denovo-db.gs.washington.edu/;
SPARK: https://sparkforautism.org;
Ensembl VEP (GRCh37): http://grch37.ensembl.org/Homo_sapiens/Tools/VEP/;
CADD score: https://cadd.gs.washington.edu/;
FreeBayes: https://github.com/ekg/freebayes;
CH model: https://github.com/tianyunwang/CH-model;
denovolyzeR: https://github.com/jamesware/denovolyzeR;
DeNovoWEST: https://github.com/queenjobo/DeNovoWEST;
quminorm: https://github.com/willtownes/quminorm;
SCTransform: https://github.com/ChristophH/sctransform;
UCSC Cell Browser: https://cells.ucsc.edu;
TSEA tool: http://genetics.wustl.edu/jdlab/tsea/;
CSEA tool: http://genetics.wustl.edu/jdlab/csea-tool-2/;
RVIS score: http://genic-intolerance.org/;
STRING: https://string-db.org/;
CytoScape: https://cytoscape.org/;
CytoHubba: http://apps.cytoscape.org/apps/cytohubba;
ProteinPaint: https://proteinpaint.stjude.org/;
DDG2P: https://www.ebi.ac.uk/gene2phenotype;
SFARI Gene: https://gene.sfari.org/;
SFARI Base: http://base.sfari.org;
EGA: https://ega-archive.org/;
OMIM: https://omim.org;
gnomAD: https://gnomad.broadinstitute.org/.
The SPARK Consortium
John Acampado 1, Andrea J. Ace 1, Alpha Amatya 1, Irina Astrovskaya 1, Asif Bashar 1, Elizabeth Brooks 1, Martin E. Butler 1, Lindsey A. Cartner 1, Wubin Chin 1, Wendy K. Chung 1,2, Amy M. Daniels 1, Pamela Feliciano 1, Chris Fleisch 1, Swami Ganesan 1, William Jensen 1, Alex E. Lash 1, Richard Marini 1, Vincent J. Myers 1, Eirene O’Connor 1, Chris Rigby 1, Beverly E. Robertson 1, Neelay Shah 1, Swapnil Shah 1, Emily Singer 1, LeeAnne G. Snyder 1, Alexandra N. Stephens 1, Jennifer Tjernagel 1, Brianna M. Vernoia 1, Natalia Volfovsky 1, Loran Casey White 1, Alexander Hsieh 2, Yufeng Shen 2, Xueya Zhou 2, Tychele N. Turner 3, Ethan Bahl 4, Taylor R. Thomas 4, Leo Brueggeman 4, Tanner Koomar 4, Jacob J. Michaelson 4, Brian J. O’Roak 5, Rebecca A. Barnard 5, Richard A. Gibbs 6, Donna Muzny 6, Aniko Sabo 6, Kelli L. Baalman Ahmed 6, Evan E. Eichler 7, Matthew Siegel 8, Leonard Abbeduto 9, David G. Amaral 9, Brittani A. Hilscher 9, Deana Li 9, Kaitlin Smith 9, Samantha Thompson 9, Charles Albright 10, Eric M. Butter 10, Sara Eldred 10, Nathan Hanna 10, Mark Jones 10, Daniel Lee Coury 10, Jessica Scherr 10, Taylor Pifher 10, Erin Roby 10, Brandy Dennis 10, Lorrin Higgins 10, Melissa Brown 10, Michael Alessandri 11, Anibal Gutierrez 11, Melissa N. Hale 11, Lynette M. Herbert 11, Hoa Lam Schneider 11, Giancarla David 11, Robert D. Annett 12, Dustin E. Sarver 12, Ivette Arriaga 13, Alexies Camba 13, Amanda C. Gulsrud 13, Monica Haley 13, James T. McCracken 13, Sophia Sandhu 13, Maira Tafolla 13, Wha S. Yang 13, Laura A. Carpenter 14, Catherine C. Bradley 14, Frampton Gwynette 14, Patricia Manning 15, Rebecca Shaffer 15, Carrie Thomas 15, Raphael A. Bernier 16, Emily A. Fox 16, Jennifer A. Gerdts 16, Micah Pepper 16, Theodore Ho 16, Daniel Cho 16, Joseph Piven 17, Holly Lechniak 18, Latha V. Soorya 18, Rachel Gordon 18, Allison Wainer 18, Lisa Yeh 18, Cesar Ochoa-Lubinoff 19, Nicole Russo 19, Elizabeth Berry-Kravis 20, Stephanie Booker 21, Craig A. Erickson 21, Lisa M. Prock 22, Katherine G. Pawlowski 22, Emily T. Matthews 22, Stephanie J. Brewster 22, Margaret A. Hojlo 22, Evi Abada 22, Elena Lamarche 23, Tianyun Wang 24, Shwetha C. Murali 7, William T. Harvey 24, Hannah E. Kaplan 25, Karen L. Pierce 25, Lindsey DeMarco 26, Susannah Horner 26, Juhi Pandey 26, Samantha Plate 26, Mustafa Sahin 27, Katherine D. Riley 27, Erin Carmody 27, Julia Constantini 7, Amy Esler 28, Ali Fatemi 29, Hanna Hutter 29, Rebecca J. Landa 29, Alexander P. McKenzie 29, Jason Neely 29, Vini Singh 29, Bonnie Van Metre 29, Ericka L. Wodka 29, Eric J. Fombonne 30, Lark Y. Huang-Storms 30, Lillian D. Pacheco 30, Sarah A. Mastel 30, Leigh A. Coppola 30, Sunday Francis 31, Andrea Jarrett 31, Suma Jacob 31, Natasha Lillie 31, Jaclyn Gunderson 31, Dalia Istephanous 31, Laura Simon 31, Ori Wasserberg 31, Angela L. Rachubinski 32, Cordelia R. Rosenberg 32, Stephen M. Kanne 33,34, Amanda D. Shocklee 34, Nicole Takahashi 34, Shelby L. Bridwell 34, Rebecca L. Klimczac 34, Melissa A. Mahurin 34, Hannah E. Cotrell 34, Cortaiga A. Grant 34, Samantha G. Hunter 34, Christa Lese Martin 35, Cora M. Taylor 35, Lauren K. Walsh 35, Katherine A. Dent 35, Andrew Mason 36, Anthony Sziklay 36, Christopher J. Smith 36
1 Simons Foundation, New York, USA
2 Columbia University, New York, USA
3 Washington University School of Medicine, St. Louis, USA
4 University of Iowa Carver College of Medicine, Iowa City, USA
5 Oregon Health & Science University, Portland, USA
6 Baylor College of Medicine, Houston, USA
7 University of Washington School of Medicine & Howard Hughes Medical Institute, Seattle, USA
8 Maine Medical Center Research Institute, Portland, USA
9 University of California, Davis, Sacramento, USA
10 Nationwide Children’s Hospital, Columbus, USA
11 University of Miami, Coral Gables, USA
12 University of Mississippi Medical Center, Jackson, USA
13 University of California, Los Angeles, Los Angeles, USA
14 Medical University of Southern Carolina (MUSC), Portland, USA
15 Cincinnati Children’s Hospital Medical Center - Research Foundation, Cincinnati, USA
16 Seattle Children’s Autism Center/UW, Seattle, USA
17 University of North Carolina at Chapel Hill, Chapel Hill, USA
18 Department of Child & Adolescent Psychiatry, Rush University Medical Center, Chicago, USA
19 Department of Developmental & Behavioral Pediatrics, Rush University Medical Center, Chicago, USA
20 Department of Neurological Sciences, Department of Pediatrics, Department of Biochemistry, Rush University Medical Center, Chicago, USA
21 Cincinnati Children’s Hospital Medical Center - Research Foundation, Cincinnati, USA
22 Boston Children’s Hospital (BCH), Boston, USA
23 University of North Carolina at Chapel Hill, Chapel Hill, USA
24 University of Washington School of Medicine, Seattle, USA
25 University of California, San Diego, School of Medicine, La Jolla, USA
26 Children’s Hospital of Philadelphia, Philadelphia, USA
27 Boston Children’s Hospital (BCH), Boston, USA
28 University of Minnesota, Minneapolis, USA
29 Kennedy Krieger Institute, Baltimore, USA
30 Oregon Health & Science University, Portland, USA
31 University of Minnesota, Minneapolis, USA
32 University of Colorado School of Medicine, Aurora, USA
33 Department of Health Psychology, University of Missouri, Columbia, USA
34 Thompson Center for Autism and Neurodevelopmental Disorders, University of Missouri, Columbia, USA
35 Geisinger Autism & Developmental Medicine Institute, Lewisburg, USA
36 Southwest Autism Research and Resource Center, Phoenix, USA
ACKNOWLEDGEMENTS
We are grateful to all the families who participated in the study. We thank Tonia Brown for assistance in editing this manuscript. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund (grant number HICF-1009-003), a parallel funding partnership between the Wellcome Trust and the Department of Health, and the Wellcome Trust Sanger Institute (grant number WT098051). The views expressed in this publication are those of the authors and not necessarily those of the Wellcome Trust or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South REC, and GEN/284/12 granted by the Republic of Ireland REC). The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network. This work was supported, in part, by US National Institutes of Health (NIH) grants (R01MH101221 and U01MH119705) and a grant from the Simons Foundation (SFARI #608045) to E.E.E.; E.E.E. is an investigator of the Howard Hughes Medical Institute.
Footnotes
↵† A full list of consortium members appears at the end of the paper.
ABBREVIATIONS
- DNM
- de novo mutation
- NDD
- Neurodevelopmental disorder
- ASD
- Autism spectrum disorder
- DD
- Developmental disorder
- ID
- Intellectual disability
- WES
- Whole-exome sequencing
- WGS
- Whole-genome sequencing
- CCDG
- Centers for Common Disease Genomics
- SFARI
- Simons Foundation Autism Research Initiative
- SPARK
- Simons Foundation Powering Autism Research for Knowledge
- SSC
- Simons Simplex Collection
- ASC
- Autism Sequencing Consortium
- DDD
- Deciphering Developmental Disorders
- RUMC
- Radboud University Medical Center
- CADD score
- combined annotation dependent depletion score
- LGD
- likely gene-disruptive
- dnLGD
- de novo LGD variant
- dnSYN
- de novo synonymous variant
- dnMIS
- de novo missense variant
- dnMIS30
- de novo missense variant with CADD score greater than 30
- FDR
- False discovery rate
- FWER
- Family-wise error rate
- PPI
- Protein-protein interaction
- CSEA
- Cell-type-specific expression analysis
- TSEA
- Tissue-specific expression analysis
- DDG2P
- Development Disorder Genotype - Phenotype Database
- OMIM
- Online Mendelian Inheritance in Man
- LC615
- The 615 genes are with a lower confidence at the union significance (FDR 5%) by one or more of the three models
- HC138
- The 138 genes are with the highest confidence at the intersection significance (FWER 5%) supported by all three models, which are a subset of the LC615 genes