Summary
De novo mutations (DNMs) in protein-coding genes are a well-established cause of developmental disorders (DD). However, known DD-associated genes only account for a minority of the observed excess of such DNMs. To identify novel DD-associated genes, we integrated healthcare and research exome sequences on 31,058 DD parent-offspring trios, and developed a simulation-based statistical test to identify gene-specific enrichments of DNMs. We identified 299 significantly DD-associated genes, including 49 not previously robustly associated with DDs. Despite detecting more DD-associated genes than in any previous study, much of the excess of DNMs of protein-coding genes remains unaccounted for. Modelling suggests that over 500 novel DD-associated genes await discovery, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of dominant DDs.
Introduction
It has previously been estimated that ~42-48% of patients with a severe developmental disorder (DD) have a pathogenic de novo mutation (DNM) in a protein coding gene1,2. However, over half of these patients remain undiagnosed despite the identification of hundreds of dominant and X-linked DD-associated genes. This implies that there are more DD relevant genes left to find. Existing methods to detect gene-specific enrichments of damaging DNMs typically ignore much prior information about which variants and genes are more likely to be disease-associated. However, missense variants and protein-truncating variants (PTVs) vary in their impact on protein function3–6. Known dominant DD-associated genes are strongly enriched in the minority of genes that exhibit patterns of strong selective constraint on heterozygous PTVs in the general population7. To identify the remaining DD genes, we need to increase our power to detect gene-specific enrichments for damaging DNMs by both increasing sample sizes and improving our statistical methods. In previous studies of pathogenic Copy Number Variation (CNV), utilising healthcare-generated data has been key to achieve much larger sample sizes than would be possible in a research setting alone8,9.
Improved statistical enrichment test identifies over 300 significant DD-associated genes
Following clear consent practices and only using aggregate, de-identified data, we pooled DNMs in patients with severe developmental disorders from three centres: GeneDx (a US-based diagnostic testing company), the Deciphering Developmental Disorders study, and Radboud University Medical Center. We performed stringent quality control on variants and samples to obtain 45,221 coding and splicing DNMs in 31,058 individuals (Supplementary Fig. 1; Supplementary Table 1), which includes data on over 24,000 trios not previously published. These DNMs included 40,992 single nucleotide variants (SNVs) and 4,229 indels. The three cohorts have similar clinical characteristics, male/female ratios, enrichments of DNMs by mutational class, and prevalences of known disorders (Supplementary Fig. 2).
To detect gene-specific enrichments of damaging DNMs, we developed a method named DeNovoWEST (De Novo Weighted Enrichment Simulation Test, https://github.com/queenjobo/DeNovoWEST). DeNovoWEST scores all classes of sequence variants on a unified severity scale based on the empirically-estimated positive predictive value of being pathogenic (Supplementary Fig. 3-4). We then applied a Bonferroni multiple testing correction with independent hypothesis weighting10 to incorporate a gene-based weighting using the selective constraint against heterozygous PTVs in the general population (shet11), which is strongly correlated with the likelihood of being a dominant disease gene7,11.
We first applied DeNovoWEST to all individuals in our cohort and identified 299 significant genes, 35 more than when using our previous method1 (Supplementary Fig. 5; Fig. 1a). The majority (181/299; 61%) of these significant genes already had sufficient evidence of DD-association to be considered of diagnostic utility (as of September 2018) by all three centres, and we refer to them as “consensus” genes. 69/299 of these significant genes were previously considered diagnostic by one or two centres (“discordant” genes). Applying DeNovoWEST to synonymous DNMs, as a negative control analysis, identified no significantly enriched genes (Supplementary Fig. 6). To discover novel DD-associated genes with greater power, we then applied DeNovoWEST only to DNMs in patients without damaging DNMs in consensus genes (we refer to this subset as ‘undiagnosed’ patients) and identified 118 significant genes (Fig. 1b; Supplementary Fig. 7; Supplementary Table 2). While 69 of these genes were discordant genes, we identified 49 ‘novel’ DD-associated genes, which had a median of 10 nonsynonymous DNMs in our dataset (Fig. 1c; Supplementary Table 3). There were 500 patients with nonsynonymous DNMs in these 49 genes (1.6% of our cohort); all DNMs in these genes were inspected in IGV12 and, of 198 for which experimental validation was attempted, all were confirmed as DNMs in the proband. The DNMs in these novel genes were distributed approximately randomly across the three datasets (no genes with p <0.001, heterogeneity test). Fourteen of the 49 novel DD-associated genes have been further corroborated by recent OMIM entries or publications. In particular, seven of these 14 genes (PPP2CA13, ZMIZ114, CDK815, VAMP216, KMT2E17, KDM6B18, and TAOK119) have had genotype-phenotype studies recently published.
We also investigated whether some synonymous DNMs might be pathogenic by disrupting splicing. We annotated all synonymous DNMs with a splicing pathogenicity score, SpliceAI20, and identified a significant enrichment of synonymous DNMs with high SpliceAI scores (≥ 0.8, 1.56-fold enriched, p = 0.0037, Poisson test; Supplementary Table 4). This enrichment corresponds to an excess of ~15 splice-disrupting synonymous mutations in our cohort, of which six are accounted for by a single recurrent synonymous mutation in KAT6B known to disrupt splicing21.
Taken together, 24.8% of individuals in our combined cohort have a nonsynonymous DNM in one of the consensus or significant DD-associated genes (Fig. 1d). We noted significant sex differences in the autosomal burden of nonsynonymous DNMs (Supplementary Fig. 8). The rate of nonsynonymous DNMs in consensus autosomal genes was significantly higher in females than males (OR = 1.17, p = 1.1 × 10−7, Fisher’s exact test; Fig. 1e), as noted previously1. However, the exome-wide burden of autosomal nonsynonymous DNMs in all genes was not significantly different between undiagnosed males and females (OR = 1.03, p = 0.29, Fisher’s exact test). This suggests the existence of subtle sex differences in the genetic architecture of DD, especially with regard to known and undiscovered disorders.
Characteristics of the novel DD-associated genes and disorders
Based on semantic similarity22 between Human Phenotype Ontology terms, patients with DNMs in the same novel DD-associated gene were less phenotypically similar to each other, on average, than patients with DNMs in a consensus gene (p = 9.5 × 10−38, Wilcoxon rank-sum test; Fig. 2a). This suggests that these novel disorders less often result in distinctive and consistent clinical presentations, which may have made these disorders harder to discover via a phenotype-driven analysis or recognise by clinical presentation alone. Each of these novel disorders requires a detailed genotype-phenotype characterisation, which is beyond the scope of this study.
Overall, novel DD-associated genes encode proteins that have very similar functional and evolutionary properties to consensus genes, e.g. developmental expression patterns, network properties and biological functions (Fig. 2b; Supplementary Table 5). Despite the high-level functional similarity between known and novel DD-associated genes, the nonsynonymous DNMs in the more recently discovered DD-associated genes are much more likely to be missense DNMs, and less likely to be PTVs (discordant and novel; p = 3.3 × 10−21, chi-squared test). Sixteen of the 49 (33%) of the novel genes only had missense DNMs, and only a minority had more PTVs than missense DNMs. Consequently, we expect that a greater proportion of the novel genes will act via altered-function mechanisms (e.g. dominant negative or gain-of-function). For example, the novel gene PSMC5 (DeNovoWEST p = 6.5 × 10−10) had one inframe deletion and nine missense DNMs, eight of which altered one of two amino acids that interact within the 3D protein structure: p.Pro320Arg and p.Arg325Trp (Supplementary Fig. 9a-b), and so is likely to operate via an altered-function mechanism. Additionally, we identified one novel DD-associated gene, MN1, with de novo PTVs significantly (p = 1.6 × 10−7, Poisson test) clustered at the 3’ end of its transcript (Supplementary Fig. 9c). This clustering of PTVs indicates the transcript likely escapes nonsense mediated decay and potentially acts via a gain-of-function or dominant negative mechanism23, although this will require functional confirmation.
We observed that missense DNMs were more likely to affect functional protein domains than other coding regions. We observed a 2.76-fold enrichment (p = 1.6 × 10−68, G-test) of missense DNMs residing in protein domains among consensus genes and a 1.87-fold enrichment (p = 1.4 × 10−4, G-test) in novel DD-associated genes, but no enrichment for synonymous DNMs (Supplementary Table 6). Three protein domain families in consensus genes were specifically enriched for missense DNMs (Supplementary Table 7): ion transport protein (PF00520, p = 3.9 × 10−7, G-test), ligand-gated ion channel (PF00060, p = 6.7 × 10−7, G-test), and protein kinase domain (PF00069, p = 4.4 × 10−2, G-test). Missense DNMs in all three enriched domain families, have previously been associated with DD (Supplementary Table 8)24.
We observed a significant overlap between the 299 DNM-enriched DD-associated genes and a set of 369 previously described cancer driver genes25 (p = 1.7 × 10−46, logistic regression correcting for shet), as observed previously26,27, as well as a significant enrichment of nonsynonymous DNMs in these genes (Supplementary Table 9). This overlap extends to somatic driver mutations: we observe 117 DNMs at 76 recurrent somatic mutations observed in at least three patients in The Cancer Genome Atlas (TCGA)28. By modelling the germline mutation rate at these somatic driver mutations, we found that recurrent nonsynonymous mutations in TCGA are enriched 21-fold in the DDD cohort (p <10−50, Poisson test, Supplementary Fig. 9), whereas recurrent synonymous mutations in TCGA are not significantly enriched (2.4-fold, p = 0.13, Poisson test). This suggests that this observation is driven by the pleiotropic effects of these mutations in development and tumourigenesis, rather than hypermutability.
Recurrent mutations and potential new germline selection genes
We identified 773 recurrent DNMs (736 SNVs and 37 indels), ranging from 2-36 independent observations per DNM, which allowed us to interrogate systematically the factors driving recurrent germline mutation. We considered three potential contributory factors: (i) clinical ascertainment enriching for pathogenic mutations, (ii) greater mutability at specific sites, and (iii) positive selection conferring a proliferative advantage in the male germline, thus increasing the prevalence of sperm containing the mutation29. We observed strong evidence that all three factors contribute, but not necessarily mutually exclusively. Clinical ascertainment drives the observation that 65% of recurrent DNMs were in consensus genes, a 5.4-fold enrichment compared to DNMs only observed once (p <10−50, proportion test). Hypermutability underpins the observation that 68% of recurrent de novo SNVs occurred at hypermutable CpG dinucleotides30, a 1.8-fold enrichment over DNMs only observed once (p = 1.1 × 10−59, proportion test). We also observed a striking enrichment of recurrent mutations at the haploinsufficient DD-associated gene MECP2, in which we observed 11 recurrently mutated SNVs within a 500bp window, nine of which were G to A mutations at a CpG dinucleotide. MECP2 exhibits a highly significant twofold excess of synonymous mutations within the Genome Aggregation Database (gnomAD) population variation resource5, suggesting that locus-specific hypermutability might explain this observation.
To assess the contribution of germline selection to recurrent DNMs, we initially focused on the 12 known germline selection genes, which all operate through activation of the RAS-MAPK signalling pathway31,32. We identified 39 recurrent DNMs in 11 of these genes, 38 of which are missense and all of which are known to be activating in the germline (see supplement). As expected, given that hypermutability is not the driving factor for recurrent mutation in these germline selection genes, these 39 recurrent DNMs were depleted for CpGs relative to other recurrent mutations (9/39 vs 450/692, p = 0.0067, chi-squared test).
Positive germline selection has been shown to be capable of increasing the apparent mutation rate more strongly29 than either clinical ascertainment (10-100× in our dataset) or hypermutability (~10× for CpGs). However, only a minority of the most highly recurrent mutations in our dataset are in genes that have been previously associated with germline selection. Nonetheless, several lines of evidence suggested that the majority of these most highly recurrent mutations are likely to confer a germline selective advantage. Based on the recurrent DNMs in known germline selection genes, DNMs under germline selection should be more likely to be activating missense mutations, and should be less enriched for CpG dinucleotides. Table 1 shows the 16 de novo SNVs observed nine or more times in our DNM dataset, only two of which are in known germline selection genes (MAP2K1 and PTPN11). All but two of these 16 de novo SNVs cause missense changes, all but two of these genes cause disease by an altered-function mechanism, and these DNMs were depleted for CpGs relative to all recurrent mutations. Two of the genes with highly recurrent de novo SNVs, SHOC2 and PPP1CB, encode interacting proteins that are known to play a role in regulating the RAS-MAPK pathway, and pathogenic variants in these genes are associated with a Noonan-like syndrome33. Moreover, two of these recurrent DNMs are in the same gene SMAD4, which encodes a key component of the TGF-beta signalling pathway, potentially expanding the pathophysiology of germline selection beyond the RAS-MAPK pathway. Confirming germline selection of these mutations will require deep sequencing of testes and/or sperm32.
Evidence for incomplete penetrance and pre/perinatal death
Nonsynonymous DNMs in consensus or significant DD-associated genes accounted for half of the exome-wide nonsynonymous DNM burden associated with DD (Fig. 1b). Despite our identification of 299 significantly DD-associated genes, there remains a substantial burden of both missense and protein-truncating DNMs in unassociated genes (those that are neither significant in our analysis nor on the consensus gene list). The remaining burden of protein-truncating DNMs is greatest in genes that are intolerant of PTVs in the general population (Supplementary Fig. 11) suggesting that more haploinsufficient (HI) disorders await discovery. We estimated that our statistical power to detect the gene enrichment for protein-truncating DNMs expected for an HI disorder was lower in unassociated genes compared to the novel DD-associated genes (p = 2.9 × 10−6 Wilcox rank-sum test; Fig. 3a). However, the novel genes do not have significantly less power compared to the consensus genes (p = 0.059, Wilcox rank-sum test).
A key parameter in the above power analysis is the fold-enrichment of de novo PTVs expected in as yet undiscovered HI disorders, which we assumed above to be 37-fold, based on the average enrichment observed in known HI DD-associated genes. However, we observed that novel DD-associated HI genes had significantly lower PTV enrichment compared to the consensus HI genes (p = 1.6 × 10−5, Poisson test; Fig. 3b). Two additional factors that could lower DNM enrichment, and thus power to detect a novel DD-association, are reduced penetrance and increased pre/perinatal death, which here covers spontaneous fetal loss, termination of pregnancy for fetal anomaly, stillbirth, and early neonatal death. To evaluate incomplete penetrance, we investigated whether HI genes with a lower enrichment of protein-truncating DNMs in our cohort are associated with greater prevalences of PTVs in the general population. We observed a significant (p = 0.031, weighted linear regression) negative correlation between PTV enrichment in our cohort and the ratio of PTV to synonymous variants in the gnomAD dataset of population variation5, suggesting that incomplete penetrance does lower de novo PTV enrichment in our cohort (Fig. 3c).
Additionally, we observed that the fold-enrichment of protein-truncating DNMs in consensus HI DD-associated genes in our cohort was significantly lower for genes with a medium or high likelihood of presenting with a prenatal structural malformation (p = 0.0002, Poisson test, Fig. 3d), suggesting that pre/perinatal death decreases our power to detect some novel DD-associated disorders (see supplement for details).
Modelling reveals hundreds of DD genes remain to be discovered
To understand the likely trajectory of future DD discovery efforts, we downsampled the current cohort and reran our enrichment analysis (Fig. 4a). We observed that the number of significant genes has not yet plateaued. Increasing sample sizes should result in the discovery of many novel DD-associated genes. To estimate how many haploinsufficient genes might await discovery, we modelled the likelihood of the observed distribution of protein-truncating DNMs among genes as a function of varying numbers of undiscovered HI DD genes and fold-enrichments of protein-truncating DNMs in those genes. We found that the remaining HI burden is most likely spread across ~500 genes with ~12-fold PTV enrichment (Fig. 4b). This fold enrichment is three times lower than in known HI DD-associated genes, suggesting that incomplete penetrance and/or pre/perinatal death is much more prevalent among undiscovered HI genes. We modelled the missense DNM burden separately and also observed that the most likely architecture of undiscovered DD-associated genes is one that comprises over 500 genes with a substantially lower fold-enrichment than in currently known DD-associated genes (Supplemental Fig. 12).
We calculated that a sample size of ~200,000 parent-offspring trios would be needed to have 80% power to detect a 12-fold enrichment of protein-truncating DNMs for a gene with the median PTV mutation rate among currently unassociated genes. Using this inferred 12-fold enrichment among undiscovered HI genes, from our current data we can evaluate the likelihood that any gene in the genome is an undiscovered HI gene, by comparing the likelihood of the number of de novo PTVs observed in each gene to have arisen from the null mutation rate or from a 12-fold increased PTV rate. Among the ~19,000 non-DD-associated genes, ~1,100 were more than three times more likely to have arisen from a 12-fold increased PTV rate, whereas ~9,000 were three times more likely to have no de novo PTV enrichment.
Discussion
In this study, we have discovered 49 novel developmental disorders by developing an improved statistical test for mutation enrichment and applying it to a dataset of exome sequences from 31,058 children with developmental disorders, and their parents. These 49 novel genes account for up to 1.6% of our cohort, and inclusion of these genes in diagnostic workflows will catalyse increased diagnosis of similar patients globally. We have shown that both incomplete penetrance and pre/perinatal death reduce our power to detect novel DDs postnatally, and that one or both of these factors are likely operating considerably more strongly among undiscovered DD-associated genes. In addition, we have identified a set of highly recurrent mutations that are strong candidates for novel germline selection mutations, which would be expected to result in a higher than expected disease incidence that increases dramatically with increased paternal age.
Our study represents the largest collection of DNMs for any disease area, and is approximately three times larger than a recent meta-analysis of DNMs from a collection of individuals with autism spectrum disorder, intellectual disability, and/or a developmental disorder34. Our analysis included DNMs from 24,348 previously unpublished trios, and we identified ~2.4 times as many significantly DD-associated genes as this previous study when using Bonferroni-corrected exome-wide significance (299 vs 124). In contrast to meta-analyses of published DNMs, the harmonised filtering of candidate DNMs across cohorts in this study should protect against results being confounded by substantial cohort-specific differences in the sensitivity and specificity of detecting DNMs.
Here we inferred indirectly that developmental disorders with higher rates of detectable prenatal structural abnormalities had greater pre/perinatal death. The potential size of this effect can be quantified from the recently published PAGE study of genetic diagnoses in a cohort of fetal structural abnormalities35. In this latter study, genetic diagnoses were not returned to participants during the pregnancy, and so the genetic diagnostic information itself could not influence pre/perinatal death. In the PAGE study data, 69% of fetal abnormalities with a genetically diagnosable cause died perinatally or neonatally, with termination of pregnancy, fetal demise and neonatal death all contributing. This emphasises the substantial impact that pre/perinatal death can have on reducing the ability to discover novel DDs from postnatal recruitment alone, and motivates the integration of genetic data from prenatal, neonatal and postnatal studies in future analyses.
To empower our mutation enrichment testing, we estimated positive predictive values (PPV) of each DNM being pathogenic on the basis of their predicted protein consequence, CADD score3 and presence in a region or gene under missense constraint in the general population4. These PPVs should also be highly informative for variant prioritisation in the diagnosis of dominant developmental disorders. Further work is needed to see whether these PPVs might be informative for recessive developmental disorders, and in other types of dominant disorders. More generally, we hypothesise that empirically-estimated PPVs based on variant enrichment in large datasets will be similarly informative in many other disease areas.
We adopted a conservative statistical approach to identifying DD-associated genes. In two previous studies using the same significance threshold, we identified 26 novel DD-associated genes1,36. All 26 are now regarded as being diagnostic, and have entered routine clinical diagnostic practice. Had we used a significance threshold of FDR <10% as used in Satterstrom, Kosmicki, Wang et al37, we would have identified 737 DD-associated genes. However, as the FDR of individual genes depends on the significance of other genes being tested, FDR thresholds are not appropriate for assessing the significance of individual genes, but rather for defining gene-sets. There are 150 consensus genes that did not cross our significance threshold in this study. It is likely that many of these cause disorders that were under-represented in our study due to the ease of clinical diagnosis on the basis of distinctive clinical features or targeted diagnostic testing. These ascertainment biases are, however, not likely to impact the representation of novel DDs in our cohort.
Our modelling also suggested that likely over 1,000 DD-associated genes remain to be discovered, and that reduced penetrance and pre/perinatal death will reduce our power to identify these genes through DNM enrichment. Identifying these genes will require both improved analytical methods and greater sample sizes. We anticipate that the variant-level and gene-level weights used by DeNovoWEST will improve over time. As reference population samples, such as gnomAD5, increase in size, gene-level weights based on selective constraint metrics will improve. Gene-level weights could also incorporate more functional information, such as expression in disease-relevant tissues. For example, we observe that our DD-associated genes are significantly more likely to be expressed in fetal brain (Supplementary Fig. 13). Furthermore, novel metrics based on gene co-regulation networks can predict whether genes function within a disease relevant pathway38. As a cautionary note, including more functional information in the gene-level weights may increase power to detect some novel disorders while decreasing power for disorders with pathophysiology different from known disorders. Variant-level weights could be further improved by incorporating other variant prioritisation metrics, such as upweighting variants predicted to impact splicing, variants in particular protein domains, or variants that are somatic driver mutations during tumorigenesis. Finally, the discovery of less penetrant disorders can be empowered by analytical methodologies that integrate both DNMs and rare inherited variants, such as TADA39. Nonetheless, using current methods, we estimated that ~200,000 parent-child trios would need to be analysed to have ~80% power to detect HI genes with a 12-fold PTV enrichment. Discovering non-HI disorders will need even larger sample sizes. Reaching this number of sequenced families will be impossible for an individual research study or clinical centre, therefore it is essential that genetic data generated as part of routine diagnostic practice is shared with the research community such that it can be aggregated to drive discovery of novel disorders and improve diagnostic practice.
Data Access
Sequence and variant level data and phenotypic data for the DDD study data are available through EGA study ID EGAS00001000775
RadboudUMC sequence and variant level data cannot be made available through EGA due to the nature of consent for clinical testing
GeneDx data cannot be made available through EGA due to the nature of consent for clinical testing. GeneDx has contributed deidentified data to this study to improve clinical interpretation of genomic data, in accordance with patient consent and in conformance with the ACMG position statement on genomic data sharing (see Supplementary Note for details).
Clinically interpreted variants and associated phenotypes from the DDD study are available through DECIPHER (https://decipher.sanger.ac.uk)
Clinically interpreted variants from RUMC are available from the Dutch national initiative for sharing variant classifications (https://www.vkgl.nl/nl/diagnostiek/vkgl-datashare-database) Clinically interpreted variants from GeneDx are deposited in ClinVar (https://www.ncbi.nlm.nih.gov/clinvar)
Acknowledgements
We thank the families and their clinicians for their participation and engagement. We are very grateful to our colleagues who assisted in the generation and processing of data. Inclusion of RadboudUMC data was in part supported by the Solve-RD project that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 779257. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund [grant number HICF-1009-003]. This study makes use of DECIPHER which is funded by Wellcome. See www.ddduk.org/access.html for full acknowledgement. The DDD study would like to acknowledge the tireless work of Rosemary Kelsell. Finally we acknowledge the contribution of an esteemed DDD clinical collaborator, M. Bitner-Glindicz, who died during the course of the study.