Abstract
While deleterious de novo mutations (DNMs) in coding region conferring risk in neuropsychiatric disorders have been revealed by next-generation sequencing, the role of DNMs involved in post-transcriptional regulation in pathogenesis of these disorders remains to be elucidated. Here, we identified 1,736 post-transcriptionally impaired DNMs (piDNMs), and prioritized 1,482 candidate genes in four neuropsychiatric disorders from 7,748 families. Our results revealed higher prevalence of piDNMs in the probands than in controls (P = 8.19×10−17), and piDNM-harboring genes were enriched for epigenetic modifications and neuronal or synaptic functions. Moreover, we identified 86 piDNM-containing genes forming convergent co-expression modules and intensive protein-protein interactions in at least two neuropsychiatric disorders. These cross-disorder genes carrying piDNMs could form interaction network centered on RNA binding proteins, suggesting a shared post-transcriptional etiology underlying these disorders. Our findings illustrate the significant contribution of piDNMs to four neuropsychiatric disorders, and lay emphasis on combining functional and network-based evidences to identify regulatory causes of genetic disorders.
Introduction
Next-generation sequencing, which allows genome-wide detection of rare and de novo mutations (DNMs), is transforming the pace of genetics of human disease by identifying protein-coding mutations that confer risk1. Various computational methods have been developed to predict the effects of amino acid substitutions on protein function, and to classify corresponding mutations as deleterious or benign, based on evolutionary conservation or protein structural constraints2, 3. Beside the effect on protein structure and function, genetic mutations involve in transcriptional processes4, 5 via direct or indirect effects on histone modifications6, and enhancers7 to affect pathogenesis of diseases. However, the majorities of mutation are located in non-coding regions, and some of them have no relationship with transcriptional regulation, but can lead to an observable phenotype or disease8, suggesting the existence of another layer of regulatory effect of mutations. It has been revealed that single nucleotide variants can alter RNA structure, known as RiboSNitches, and depletion of RiboSNitches result in the alteration of specific RNA shapes at thousands of sites, including 3’untranslated region, binding sites of RBPs and microRNAs9. Thus, the mutations can impair post-transcriptional processes through disrupting the binding of micRNAs and RNA binding proteins (RBPs)10, 11, resulting in various human diseases. For example, a variant in the 3’ untranslated region of FMR1 decreases neuronal activity-dependent translation of FMRP by disrupting the binding of HuR, leading to developmental delay in patients10. Some attempts have been undertaken to better understand the interactions between mutations and binding of noncoding RNAs or RBPs. Maticzka et al. developed a machine learning-based approach to predict protein binding sites on RNA from crosslinking immunoprecipitation (CLIP) data using both RNA structure and sequence features12. Fukunaga et al. developed the CapR algorithm based on the probability of secondary structure of an RNA for RBP binding13. However, identifying the network between single nucleotide mutations and post-transcriptional regulation remains challenging because of the complexity of the underlying interaction networks. Our and other’s methods named RBP-Var14 and POSTAR15 represent initial efforts to systematically annotate post-transcriptional regulatory maps, which hold great promise for exploring the effect of single nucleotide mutations on post-transcriptional regulation in human diseases.
Increasing prevalence of neuropsychiatric disorders in children with unclear etiology has been reported during the past three decades16. Whole-exome sequencing of pediatric neuropsychiatric disorders uncovered the critical role of DNMs in the pathogenesis of these disorders1. However, previous studies of these disorders have focused on mutations in coding region1, cis-regulation17, 18, epigenome19, transcriptome20, 21, and proteome22, very few is known about the effect of DNMs on post-transcriptional regulation. Recently, more attentions have been paid on DNMs in regulatory elements and non-coding regions in neurodevelopmental disorders as it is indispensable to combine functional and evolutionary evidence to identify regulatory causes of genetic disorders23, 24. Most recently, a deep-learning-based framework illuminates involvement of noncoding DNMs in synaptic transmission and neuronal development in autism spectrum disorder25. Therefore, it is imperative to identify the post-transcriptionally regulation-disrupting DNMs related to pathology and clinical treatment of neuropsychiatric disorders.
To test whether post-transcriptionally regulation-disrupting DNMs contribute to the genetic architecture of psychiatric disorders, we collected whole exome sequencing data from 7,748 core families (5,677 families were parent-probands trios and 2,071 families were normal trios) and curated 9,519 de novo mutations (6,996 DNMs in probands and 2,523 DNMs in controls) from four kinds of neuropsychiatric disorders, including autism spectrum disorder (ASD), epileptic encephalopathy (EE), intellectual disability (ID), schizophrenia (SCZ), as well as unaffected control subjects (Supplementary Table 1). By employing our newly updated workflow RBP-Var2 (Figure 1A, Supplementary Table 2) from our previously developed RBP-Var14, we investigated the potential impact of these de novo mutations involved in post-transcriptional regulation in these four neuropsychiatric disorders based on experimental data of genome-wide association studies (GWAS), expression quantitative trait locus (eQTL), CLIP-seq derived RBP binding sites, RNA editing and miRNA targets, and found that a subset of de novo mutations could be classified as post-transcriptionally impaired DNMs (piDNMs). These piDNMs showed significant enrichment in cases after correcting for multiple testing, and genes hit by these piDNMs were further analyzed for their properties and relative contribution to the etiology of neuropsychiatric disorders.
Results
The frequency of piDNMs is much higher in probands than that in controls
To test whether specific subsets of regulatory DNMs contribute to the genetic architecture of neuropsychiatric disorders, we devised and updated the method, RBP-Var2 (http://www.rbp-var.biols.ac.cn/), based on experimental data of GWAS, eQTL, CLIP-seq derived RBP binding sites, RNA editing and miRNA targets. Subsequently, we used our updated workflow to identify functional piDNMs from 5,677 trios with 6,996 DNMs across four neuropsychiatric disorders as well as 2,071 unaffected controls with 2,523 DNMs (Supplementary Table 1). We determined DNMs with 1/2 category score predicted by RBP-Var2 as piDNMs when considering their impact on RNA secondary structure, the binding of miRNAs and RBPs, and identified 1,736 piDNMs in probands (Supplementary Table 3), of which 17,7,7,6 and 1,699 were located in 3’ UTRs, 5’ UTRs, ncRNA exons, splicing sites and exons, respectively. In detail, RBP-Var2 identified 1,262 piDNMs in ASD, 281 piDNMs in SCZ, 101 piDNMs in EE, 92 piDNMs in ID and 354 piDNMs in healthy controls (Supplementary Table 3, 4). Interestingly, the frequency of piDNMs in the four neuropsychiatric disorders were significantly over-represented compared with those in controls (OR = 1.62, P = 8.19×10−17, Table 1). We also observed that probands groups have much more abundant piDNMs compared with controls in four kinds of neuropsychiatric disorders (Figure 1B; Table 1). Dramatically, we found that synonymous piDNMs were significantly enriched in probands in contrast to those in controls (P=9.73×10−4). While in the data set of original DNMs before the evaluated by RBP-Var2, the enrichment of the synonymous DNMs was not observed in cases, which is consistent with previous study1. To eliminate the effects of loss-of-function (LoF) mutations, we filtered out those LoF mutations from all piDNMs and found the non-LoF piDNMs also exhibited higher frequency in probands (P=2.36×10−14) (Table 1, Figure 1C; Supplementary Figure 1). Our analysis found a subset of DNMs, namely piDNMs, are enriched in probands and may contribute to the pathogenesis of these disorders although the rate of all de novo synonymous variants, which as a category, does not contribute significantly to risk for neurodevelopmental disorders.
The piDNMs outperforms protein-disruptive DNMs in risk prediction
To investigate the accuracy and specificity of DNMs in different regulatory processes, we compared our tool with other three variant effect prediction tools, including SIFT25, PolyPhen2 (PPH2)26 and RegulomeDB26. We found that the frequency of the stop gain DNMs is higher in cases than in controls determined by SIFT (P = 4.83×10−2), and higher frequency of nonsynonymous DNMs was identified by PPH2 (P = 1.82×10−2) (Figure 2A, B). However, RegulomeDB determined no significant higher frequency of functional DNMs in any functional category (Figure 2C) in cases versus controls. In contrast, RBP-Var2 could determine much more functional DNMs in the categories of frameshift (P = 1.38×10−3), nonsynonymous (P = 8.79×10−15), stopgain (P = 6.42×10−4) and synonymous (P = 7.30×10−4) (Figure 2D). Then, we performed receiver operating characteristic (ROC) analysis to systemically evaluate the sensitivity and specificity of these four prediction methods. We found that area under curve (AUC) value of SIFT, PPH2, RBP-Var2 and RegulomeDB are 78.27%, 76.57%, 82.89% and 50.77%, respectively (Supplementary Figure 2), indicating that SIFT, PPH2, and RBP-Var2 is more sensitive and specific than that of RegulomeDB with P value 1.63×10−10, 2.40×10−8 and 2.51×10−60, respectively. In addition, the AUC value of RBP-Var2 is higher than that of SIFT and PPH2 with P value 0.049 and 0.019, respectively. Intriguingly, RBP-Var2 could detect an additional 928 piDNMs covering 665 genes that were regarded as benign DNMs by other three methods, accounting for 25.27% of total 3,672 deleterious DNMs detected by all four tools (Supplementary Figure 3A, B). Especially, the non-LoF piDNMs detected by RBP-Var2 alone account for 52.8% of non-LoF piDNMs, while only 26.2% of non-LoF piDNMs were classified to be deleterious predicted by both SIFT and Polyphen2 (Supplementary Figure 3C). The top three enriched gene ontology of these 665 genes were intracellular signal transduction (P = 7.41×10−6), organelle organization (P = 8.90×10−6) and mitotic cell cycle (P = 2.06×10−5) (Supplementary Figure 3D; Supplementary Table 5), suggesting dysregulation involved in cell cycle and the impaired signal transduction may contribute to diverse neural damage, thereby trigger neurodevelopmental disorders27, 28. Therefore, the piDNMs detected by RBP-Var2 are distinct, and may play significant roles in the post-transcriptional processes of the development of neuropsychiatric disorders.
Genes hit by piDNMs are shared across four neuropsychiatric disorders
Firslty, we identified 13 recurrent piDNMs, including seven piDNMs in ASD and six piDNMs in ID (Figure 3A). Secondly, we identified 149 genes carrying at least two piDNMs in all disorders, including 128 genes in ASD, three genes in EE, ten genes in ID and eight in SCZ. Among these 149 genes, we identified 21 high risk genes with P value < 1×10−2 derived from our previously published TADA program (Transmission And De novo Association)29 (Figure 3B). As our previous study using the NPdenovo database demonstrated that DNMs predicted as deleterious in the protein level are shared by four neuropsychiatric disorders30, we then wondered whether there were common piDNMs among four neuropsychiatric disorders. By comparing the genes harboring piDNMs across four disorders, we found 86 genes significantly shared by at least two disorders rather than random overlaps (permutation test, P <1.00×10−5 based on random resampling, Figure 3C, D). Similar results have been observed for the overlap between the cross-disorder genes of any two/three disorders and the genes in control, as well as for the overlapping genes between each disorder and the control (Supplementary Figure 4B-O). In addition, the numbers of shared genes for any pairwise comparison or any three disorders are all significantly higher than randomly expected except for the comparison of EE versus SCZ (P=0.4964) (Supplementary Figure5). Our observation revealed the existence of common genes harboring piDNMs among these four neuropsychiatric disorders.
Genes harboring piDNMs are involved in epigenetic modification and synaptic functions
The phenomenon of shared genes among the four neuropsychiatric disorders suggest there may exist common molecular mechanisms underlying their pathogenesis. Thus, we performed functional enrichment analysis for these shared genes, and found they were remarkably enriched in biological processes in chromatin modification like histone methylation, functional classifications of neuromuscular control and protein localization to synapse (Supplementary Table 5, Supplementary Figure 6). These epigenetic regulating genes are composed of CHD5, DOT1L, JARID2, MECP2, PHF19, PRDM4 and TNRC18 (Supplementary Table 6). Moreover, most of these epigenetic modification genes, have been previously linked with neuropsychiatric disorders31–35. Interestingly, shared genes in enrichment analyses have intensive linkages among these significant pathways as some of piDNM-containing genes could play roles in more than one of these pathways (Supplementary Figure 7).
Next, to investigate the biological pathways involved in each group of disorder-specific genes with piDNMs, we carried out functional enrichment analysis with terms in biological process (Supplementary Table 7-9). The top three enriched categories of ASD-specific genes were “macromolecule modification” (P = 2.90×10−13), “organelle organization” (P = 1.22×10−11) and “cell cycle” (P = 1.17×10−9). With respect to genes specific to SCZ, it is actually no surprise that the significantly enriched categories are related to protein localization and calcium transport, which have been revealed to be involved in the pathophysiology of schizophrenia36. Because of the limited number of genes, only two GO terms were enriched for EE-specific genes, which were “N-glycan processing” (P = 3.65×10−5) and “protein deglycosylation” (P = 2.17×10−4), while no terms were statistically enriched for ID-specific genes. Our observation that each group of disorder-specific genes being overrepresented into different biological pathways, suggests that piDNMs may also play a role in the distinct phenotypes of the four psychiatric disorders although the explicit underlying mechanisms need to be further explored.
Co-expression modules are convergent for cross-disorder genes hit by piDNMs
Co-expression of genes can be used to explore the common and distinct molecular mechanisms in neuropsychiatric disorders37. Thus, we performed weighted gene co-expression network analysis (WGCNA)38 for the 86 cross-disorder piDNMs-containing genes based on gene expression in 16 human brain structures across 31 developmental stages from BrainSpan developmental transcriptome (n=524)39. The results of WGCNA deciphered two gene modules with distinct spatiotemporal expression patterns (Figure 4A, B; Supplementary Figure 8). The turquoise module (n=55 genes) was characterized by high expression during early fetal development (8-24 postconceptional weeks) in the majority of brain structures (Figure 4C). Whereas, the blue module (n=22 genes) showed low expression in early fetal development (8-38 postconceptional weeks) in the majority of brain structures (Figure 4D). It is also crucial to clarify gene expression of these genes in early development stages since altered epigenetic regulation in early development has been shown to be associated with neurodevelopmental disorders40. We found most of these 86 genes are highly expressed and may be required for the normal development of human embryo (Figure 4E, Supplementary Table 10). Our observation indicates that these cross-disorder piDNMs-containing genes may play important roles in not only early brain developmental but also early embryonic development.
Protein-protein interactions are intensive for cross-disorder piDNM-containing proteins
The co-expression results indicate that the proteins coded by the 86 cross-disorder genes may have intensive protein-protein interactions (PPIs). To identify common biological processes that potentially contribute to disease pathogenesis, we investigated protein-protein interactions within these 86 cross-disorder piDNM-containing genes. Our results revealed that 56 out of 86 (65.12%) cross-disorder genes represent an interconnected network on the level of direct/indirect protein-protein interaction relationships (Figure 5A; Supplementary Table 11). Furthermore, we determined several crucial hub piDNM-containing genes in the protein-protein interaction network, such as NOTCH1, MTOR, RYR2, and GNAS (Figure 5A), which may control common biological processes among these four neuropsychiatric disorders. Besides, these 86 cross-disorder proteins are indeed enriched in nervous system phenotypes, including abnormal synaptic transmission, abnormal nervous system development, abnormal neuron morphology and abnormal brain morphology, and behavior/neurological phenotype such as abnormal motor coordination/balance (P <0.05, Supplementary Figure 9A). Similarly, these 56 genes in interaction network are enriched in nervous system phenotype including abnormal nervous system development and abnormal brain morphology (P <0.05, Supplementary Figure 9B). By investigating the expression of these interacting genes in the human cortex of 12 ASD patients and 13 normal donators from public datasets of GSE6401841 and GSE7685242, we identified 45 (80.35%) of these PPI genes were significantly differentially expressed between ASD patients and normal controls (Student’s t-test, q <0.05, Figure 5B). And 40 (71.42%) of these PPI genes were up-regulated in ASD patients while three (5.35%) genes were down-regulated in ASD patients when compared with normal controls (Supplementary Table 12), indicating that the majority of these PPI genes were abnormally expressed in ASD patients compared with normal controls.
Regulatory networks between RBPs and targeting genes are potentially disrupted by piDNMs
Dysregulation or mutations of RBPs can cause a range of developmental and neurological diseases43, 44. Meanwhile, mutations in RNA targets of RBPs, which could disturb the interactions between RBPs and their mRNA targets, and affect mRNA metabolism and protein homeostasis in neurons during the progression of neuropathological disorders45–47. Hence, we constructed a regulatory network between piDNMs and RBPs based on predicted binding sites of RBPs to investigate the genetic perturbations of mRNA-RBP interactions in the four disorders (Figure 6). We identified several crucial RBP hubs that may contribute to the pathogenesis of the four neuropsychiatric disorders, including EIF4A3, FMR1, PTBP1, AGO1/2, ELAVL1, IGF2BP1/3 and WDR33. Genes with piDNMs in different disorders could be regulated by the same RBP hub while one candidate gene may be regulated by different RBP hubs (Figure 6). In addition, all of these RBP hubs were highly expressed in early fetal development stages (8-37 postconceptional weeks) based on BrainSpan developmental transcriptome (Supplementary Figure 10), suggesting their essential roles in the early stages of brain development.
Discussion
In contrasting with the recognized role of LoF DNMs in conferring risk for neuropsychiatric disorders, the effect of DNMs on post-transcriptional regulation in pathogenesis of these disorders remains unknown. In this study, we systematically analyzed the damaging effect of DNMs on post-transcriptional regulation in four neuropsychiatric disorders, and observed higher prevalence of piDNMs in probands than that in controls in four kinds of neuropsychiatric disorders.
To date, it has been a challenge to estimate the functions of synonymous and UTRs mutations though such mutations have been widely acknowledged to alter protein expression, conformation and function48. We applied RBP-Var2 algorithm to annotate and interpret de novo variants in subjects with four neuropsychiatric disorders based on their impact to RNA secondary structure, the binding of miRNAs and RBPs. In comparison with accuracy of other prediction algorithms such as SIFT, PPH2 or RegulomeDB, RBP-Var2 has highest accuracy (AUC: 82.89%) to differentiate affected from the control subjects. Our RBP-Var2 tool identified 399 synonymous DNMs and 25 UTR’s DNMs, which were extremely harmful in post-transcriptional regulation. Consistent with previous study49, synonymous damaging DNMs were significantly prominent in probands compared with that in controls. Meanwhile, de novo insertions and deletions (InDels), especially frameshift patterns are taken for granted to be deleterious. Indeed, de novo frameshift InDels are more frequent in neuropsychiatric disorders compared to non-frameshift InDels50, which were demonstrated by predictions of RBP-Var2 but not SIFT or PPH2. Therefore, the updated version of RBP-Var2 will held great promise for exploring the effect of mutations on post-transcriptional regulation, and deciphering multiple biological layers of deleteriousness may improve the accuracy to predict disease related genetic variations.
Most interestingly, we found that some epigenetic pathways are enriched among these piDNM-containing genes, such as those that regulation of gene expression and histone modification. This finding is consistent with a previous report in which more than 68% of ASD cases shared a common acetylome aberrations at >5,000 cis-regulatory regions in prefrontal and temporal cortex51. Such common “epimutations” may be induced by either perturbations of epigenetic regulations, including post-transcriptional regulations due to mutations of substrates or the disruptions of epigenetic modifications resulting from the mutation of epigenetic genes. Our observations revealed the association of alterations of “epimutations” with dysregulation of post-transcription. This hypothesis is consistent with the observation that several recurrent piDNM-containing genes are non-epigenetic genes, including SYNGAP1, ADNP, POGZ and ANK2. Moreover, we discovered several recurrent epigenetic genes which contain piDNMs, including CHD8, EP300, KMT2A, KMT2C, KDM3B, JARID2 and MECP2, and they may play important roles in the genome-wide aberrations of epigenetic landscapes through disruption of the post-transcriptional regulation. Furthermore, WGCNA analysis revealed that major hubs of the co-expression network for these 86 piDNM-containing genes were histone modifiers by using BrainSpan developmental transcriptome. These data indicate that piDNM-containing genes are co-expressed with genes frequently involved in epigenetic regulation of common cellular and biological process in neuropsychiatric disorders. Importantly, these 86 piDNM-containing genes harbor intensive protein-protein interactions in physics, and shared regulatory networks between piDNMs and RBPs in four neuropsychiatric disorders. We identified several RBP hubs of regulatory networks between piDNM-containing genes and RBP proteins, including EIF4A3, FMRP, PTBP1, AGO1/2, ELAVL1, IGF2BP1/3, WDR33 and FXR2. Taking FMRP for example, it is a well-known pathogenic gene of Fragile X syndrome which co-occurs with autism in many cases and its targets are highly enriched for DNMs in ASD52. Our results demonstrated that, like the mutations on RBP hubs, mutations of RBP-targeting genes through disrupting their interactions with multiple RBPs may synergistically result in pathogenesis of multiple neuropsychiatric disorders.
Alterations in expression or mutations in either RBPs or their binding sites in target transcripts have been reported to cause several human diseases such as muscular atrophies, neurological disorders and cancer53. Although we identified 1,736 piDNMs associated with neuropsychiatric disorders, the cause and explicit effects of these piDNMs in these disorders need to be further validated and explored. In this study, our method sheds light on evaluation of post-transcriptional impact of genetic mutations especially for synonymous mutations. Additionally, as small molecules can be rapidly designed to selectively target RNAs and affect RNA-RBP interactions54, our study provides new insights into RNA-based therapeutic strategies for the treatment of neuropsychiatric disorders.
Materials and methods
Data collection and filtration
For this study, 7,748 trios or quartets were recruited from previous whole exome sequencing (WES) studies (ref), comprising 5,677 parent-probands trios associated with four neuropsychiatric disorders and 2,071 control trios (Supplementary Table 1). After removing the overlap of DNMs between probands and controls, a total of 6,996 DNMs in probands and 2,523 DNMs in controls were identified for subsequent analysis.
RBP-Var2 algorithm
To better interpret the catalog of DNMs, we developed a new heuristic scoring system according to the functional confidence of variants based on experimental data of GWAS, eQTL, CLIP-seq derived RBP binding sites, RNA editing and miRNA targets, and machine learning algorithms. The scoring system represents with increasing confidence if a variant lies in more functional elements14. For example, we consider variants that are known eQTLs as significant and label them as category 1. Within category 1, subcategories indicate additional annotations ranging from the most informational variants (1a, variant may change the motif for RBP binding) to the least informational variants (1e, variant only has a motif for RBP binding). In mathematical algorithms, we employed LS-GKM55 (10-mer) and deltaSVM56 to predict the impact of DNMs on the binding of specific RBPs by calculating the delta SVM scores. Moreover, for single-base mutations, we employed the RNAsnp57 with default parameters to estimate the mutation effects on local RNA secondary structure and calculated the empirical P values based on the base pair probabilities of the wild-type and mutant RNA sequences. For insertions and deletions, we evaluated the effects of DNMs on RNA secondary structure using the minimal free energy generated by RNAfold58 to calculate empirical P values based on cumulative probabilities of the Poisson distribution. Only the functional DNM produces >5 change in gkm-SVM scores for the effect of RBP binding, and P-value < 0.1 or free energy change >1 for the effect of DNMs on RNA secondary structure change were determined to be a piDNM. Only DNMs occurred in exonic or UTR regions were included in our analysis.
Identification of piDNMs and comparison with variants predicted by other methods
To determine the likelihood of a functional mutation in post-transcriptional regulation for all SNVs and InDels, our newly updated program RBP-Var2 was utilized to assign an exclusive rank for each mutation and only those mutations categorized into rank 1 or 2 were considered as piDNMs. In comparison with those mutations involved in the disruption of gene function or transcriptional regulation, several programs such as SIFT, PolyPhen2 and RegulomeDB were used to analyze the same dataset of DNMs as the input for RBP-Var2. We only kept the mutations qualified as “damaging” from the result of SIFT and “possibly damaging” or “probably damaging” from PolyPhen2. In the case of RegulomeDB, mutations labeled as category 1 and 2 were retained. Next, we classified the type of mutation (frameshift, nonframeshift, nonsynonymous, synonymous, splicing and stop) and located regions (UTR3, UTR5, exonic, ncRNA exonic and splicing) to determine the distribution of piDNMs, genetic variants and other regulatory variants. The number of variants in cases versus controls was illustrated by bar chart (***: P< 0.001, **: 0.001 <P< 0.01, *: 0.01 <P< 0.05, binomial test).
TADA analysis of DNMs in four disorders
Our previously published TADA program29, which predicts risk genes accurately on the basis of allele frequencies, gene-specific penetrance, and mutation rate, was used to calculate the P value for the likelihood of each gene contributing to the all four disorders with default parameters.
ROC curves and specificity/sensitivity estimation
We screened a positive (non-neutral) test set of likely casual mutations in Mendelian disease from the ClinVar database (v20170130). From a total of 237,308 mutations in ClinVar database, we picked up 145 exonic mutations presented in our curated DNMs in probands. Our negative (neutral) set of likely non-casual variants was built from DNMs of unaffected controls in four neuropsychiatric disorders. To exclude rare deleterious DNMs, we selected only DNMs in controls with a minor allele frequency of at least 0.01 in 1000 genome (1000g2014oct), and obtained a set of 921 exonic variants. Then, we employed R package pROC to analyze and compare ROC curves.
Permutation analysis for overlaps of genes with piDNMs
In order to evaluate the overlap of genes among any two set of genes with piDNMs, we shuffled the intersections of genes and repeated this procedure 100,000 times. During each permutation, we randomly selected the same number of genes as the actual situation from the all RefSeq genes for each disorder taking account of gene-level de novo mutation rate, then P values were calculated as the proportion of permutations during which the simulated number of overlap was greater than or equal to the actual observed number.
Functional enrichment analysis
A gene harboring piDNMs was selected into our candidate gene set to conduct functional enrichment analysis if it occurred in at least two of the four disorders. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichments analyses were implemented by Cytoscape (version 3.4.0) plugin ClueGO (version 2.3.0) using genome-wide coding genes as background and P values calculated by hypergeometric test were corrected to be q values by Benjamini–Hochberg procedure for reducing the false discovery rate resulted from multiple hypothesis testing.
Co-expression and spatiotemporal specificity
Normalized gene-expression of 16 human brain regions were determined by RNA sequencing and obtained from database BrainSpan (http://www.brainspan.org). We extracted expression for 77 out of 86 extreme damaging cross-disorder genes and employed R-package WGCNA (weighted correlation network analysis) with a power of five to cluster the spatiotemporal-expression patterns and prenatal laminar-expression profiles. The expression level for each gene and development stage (only stages with expression data for all 16 structures were selected, n = 14) was presented across all brain regions.
Protein-protein interaction network of cross-disorder genes
Protein-protein interactions data of Homo sapiens was collected from the STRING (v10.5) database with score over 0.8. For the PPI network of all cross-disorder genes, we only retain the proteins with at least two links. Those nodes with degree over 30 in the network were considered as hubs. Cytoscape (version3.4.0) was used to analyze and visualize protein-protein interaction networks. Overrepresentation of mouse-mutant phenotypes was evaluated by the web tool MamPhea for the genes in the PPI network and for all cross-disorder genes containing piDNMs. The rest of genome was used as background and multiple test adjustment for P values was done by Benjamini-Hochberg method.
Gene-RBP interaction network
Cytoscape (version 3.4.0) was utilized for visualization of the associations between genes harboring piDNMs in the four neuropsychiatric disorders and the corresponding regulatory RBPs.
The available data resources
To make our findings easily accessible to the research community, we have developed RBP-Var2 platform (http://www.rbp-var.biols.ac.cn/) for storage and retrieval of piDNMs, candidate genes, and for exploring the genetic etiology of neuropsychiatric disorders in post-transcriptional regulation. The expression and epigenetic profiles of genes related to regulatory de novo mutations and early embryonic development have been deposited in our previously published database EpiDenovo (http://www.epidenovo.biols.ac.cn/)40.
URLs
RBP-Var2, http://www.rbp-var.biols.ac.cn/; NPdenovo, http://www.wzgenomics.cn/NPdenovo/index.php; EpiDenovo: http://www.epidenovo.biols.ac.cn/; BioGRID, https://thebiogrid.org/;
MamPhea, http://evol.nhri.org.tw/phenome/index.jsp?platform; BrainSpan, http://www.brainspan.org; ClinVar, https://www.ncbi.nlm.nih.gov/clinvar/; 1000Genomes, http://www.internationalgenome.org/; WGCNA, https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/; esyN, http://www.esyn.org/; Cytoscape, http://www.cytoscape.org/; TADA, http://wpicr.wpic.pitt.edu/WPICCompGen/TADA/TADA_homepage.htm; ClueGO, http://apps.cytoscape.org/apps/cluego; pROC, http://web.expasy.org/pROC/; R, https://www.r-project.org/; Perl, https://www.perl.org/.
Contributions
F.M. and L.W. participated in the design and execution of analyses, produced the figures, participated in the interpretation of results and edited the manuscript. F.M. developed computational code employed in the analyses. L.W. and Z.L. developed the statistical framework and drew the figures. X.Z. participated in the interpretation of results, the oversight of analyses. L.X. developed and improved the online platform of RBP-Var2. H.T. and R.C.R. provided professional guidance in the writing and refining of the manuscript. J.L. and H. T. collected the DNMs from literature and database. Z.S.S. and X.H. conceived the study, participated in the design of analyses, oversaw the study and the interpretation of results, and drafted and edited the manuscript.
Competing interests
The authors declare no competing financial interests.
Supplementary Figure 1. Excess of piDNMs in probands. The odds ratio of synonymous DNMs and piDNMs were analyzed. The dominance of filtered piDNMs that not contained LoF mutations were also displayed.
Supplementary Figure 2. ROC curve showing the performance of the predictions of SIFT, PPH2, RBP-Var2 and RegulomeDB.
Supplementary Figure 3. Overlap of DNMs identified by different tools. (A) Venn diagram depicting the overlap between the DNMs predicted by SIFT, PPH2, RBP-Var2 and RegulomeDB. (B) Venn diagram depicting the overlap between the genes predicted by SIFT, PPH2, RBP-Var2 and RegulomeDB. (C) The pie chart shows the distribution of all non-LoF piDNMs. The non-LoF piDNMs detected by RBP-Var2 alone account for 52.8% of all non-LoF piDNMs (pink), while the non-LoF deleterious DNMs identified by both SIFT and Polyphen2 take up 26.2% of all (light purple). (D) Pathway enrichment analysis of the 665 genes unique to the prediction of RBP-Var2.
Supplementary Figure 4. Permutation test of the randomness of the overlap of different set of disease genes with control. (A-K) Permutation test for the validity of the gene overlap between the cross-disorder genes and the control. (L-O) Permutation for the overlap of genes from each disorder with control. We shuffled the genes of each disorder and calculated the shared genes between each pair, and repeated this procedure for 100,000 times to get the null distribution. The vertical dash line stands for the observed value.
Supplementary Figure 5. Test of the significance of the number of cross-disorder genes involved in the four neuropsychiatric disorders. (A-J) Permutation test for the validity of the gene overlap among/between every combination of three/two disorders.
Supplementary Figure 6. Pie chart of the pathway enrichment analysis for the 86 cross-disorder genes.
Supplementary Figure 7. Interaction network of the gene enrichment analysis for the 86 cross-disorder genes.
Supplementary Figure 8. Relationship between Co-expression modules. (A) MDS plot of genes in turquoise module and blue module. (B) Relationship between module eigengenes. (C) Clustering tree based of the module eigengenes. (D) heatmap of adjacency Eigengene.
Supplementary Figure 9. Mammalian phenotype enrichment analysis of selected genes. (A) Mammalian phenotype enrichment of 86 cross-disorder piDNMs genes. (B) Mammalian phenotype enrichment of 56 genes in interaction network.
Supplementary Figure 10. Heat map of the expression of the crucial RBP hub genes during the early fetal development stages.
Acknowledgments
The project was funded by National Key R&D Program of China (No. 2016YFC0900400). We thank Kun Zhang for his help in TADA analysis and thank Leisheng Shi for his help in data analysis.