ABSTRACT
Motor stereotypies are common in children with autism spectrum disorder (ASD), intellectual disability, or sensory deprivation, as well as in typically developing children (“primary” stereotypies, CMS). The precise pathophysiological mechanism for motor stereotypies is unknown, although genetic etiologies have been suggested. In this study, we perform whole-exome DNA sequencing in 129 parent-child trios with CMS and 853 control trios (118 cases and 750 controls after quality control). We report an increased rate of de novo predicted-damaging variants in CMS versus controls, identifying KDM5B as a high-confidence risk gene and estimating 184 genes conferring risk. Genes harboring de novo damaging variants in CMS probands show significant overlap with those in Tourette syndrome, ASD candidate genes, and those in ASD probands with high stereotypy scores. Furthermore, exploratory biological pathway and gene ontology analysis highlight histone demethylation, organism development, cell motility, glucocorticoid receptor pathway, and ion channel transport. Continued sequencing of CMS trios will identify more risk genes and allow greater insights into biological mechanisms of stereotypies across diagnostic boundaries.
INTRODUCTION
Motor stereotypies are rhythmic, repetitive, prolonged, fixed, patterned, non-goal directed movements that are often bilateral and temporarily stop with distraction. Complex motor stereotypies (CMS) include hand flapping, finger wiggling, head nodding, and rocking; these are often accompanied by mouth opening, head posturing, jumping, pacing and occasional vocalizations (1). Movements occur for up to minutes in duration, multiple times per day, and tend to be exacerbated by excitement, fatigue, stress, boredom, or being engrossed in an activity. CMS are common in children with autism spectrum disorder (ASD), intellectual disability, or sensory deprivation, as well as in typically developing children. A favored classification subdivides by etiology into primary (otherwise typically developing) and secondary categories. In both groups, stereotypies often result in social stigmatization, classroom disruption, and interference with academic activities.
In children with ASD, stereotypic behaviors (“secondary” stereotypies) occur in about 44% of patients and are recognized as a core phenotype of the disorder (2). The severity and frequency of motor stereotypies is correlated with severity of illness, degree of intellectual disability, and impairments in adaptive functioning and symbolic play (3–9). They are often associated with self-injurious behaviors (10, 11). A wide range of medications have been tried for treatment of stereotypies in ASD, but efficacy is inconsistent and inadequate, with potential for long-term side effects (12).
Motor stereotypies also occur in otherwise typically developing children (“primary” stereotypies) (13–22). Studies comparing primary and secondary stereotypies show that there is considerable similarity in their phenomenology (23–25). Primary CMS has a typical age of onset before 3 years, and greater than 90% of children continue to experience CMS into adolescence and adulthood (16, 26). The prevalence of primary CMS is estimated to be 3-4% of children in the U.S. (26, 27). Similar to secondary stereotypies, medications are generally regarded as ineffective for primary CMS (13, 28), but there is evidence to support the benefits of cognitive behavioral therapy (29, 30).
The precise pathophysiological mechanism for motor stereotypies remains obscure (31), though investigators have hypothesized abnormalities within cortico-striatal-thalamo-cortical pathways (32–38) and several neurotransmitter systems (33, 39–41). A genetic etiology for stereotypies has been suggested in primary and secondary categories, although the specific gene(s) contributing to this movement disorder remain unclear. With respect to secondary stereotypies in ASD, family studies have demonstrated that these repetitive behaviors are highly heritable, with a genetic etiology that is likely independent from other core diagnostic features (42). While there are no studies of recurrence risk or twin concordance reported for primary CMS, a positive family history is reported in 25-40%, while remaining cases appear to be sporadic (16, 28, 43).
Considering these findings and with the recent success of risk gene discovery in other neurodevelopmental disorders via detection of de novo, or spontaneous, germline DNA mutations (44–46), we conducted the first pilot genetic study of CMS in 129 typically-developing children and their parents. We hypothesized that primary CMS may represent a more genetically homogenous group of individuals versus those with secondary stereotypies, thereby facilitating genetic discovery and insight into the biology of stereotypies more generally (49, 50). Using whole-exome DNA sequencing, we identified an enrichment of de novo predicted-damaging mutations and identified one high-confidence risk gene, Lysine Demethylase 5B (KDM5B) in our cohort. By further analysis of de novo damaging mutations in primary CMS, we predict that there are approximately 184 CMS risk genes and that sequencing more CMS parent-child trios is a definite path toward discovering these genes. In this pilot study, we see an overlap between genes harboring de novo damaging mutations in CMS and those in Tourette syndrome. Furthermore, owing to the two de novo damaging KDM5B mutations in our CMS cohort, there is significant overlap with ASD probands with highest stereotypy scores, but not those with low scores. Finally, exploratory systems analyses of genes harboring de novo damaging mutations in CMS show enrichment in histone demethylation, multicellular organism development, and cell motility.
MATERIALS AND METHODS
Figure 1 provides an overview of the study methods.
We performed whole exome sequencing on 129 CMS and 853 control parent-child trios. After quality control, 118 CMS and 750 control trios remained for subsequent analyses. We performed a burden analysis, comparing the rates of de novo single nucleotide (SNVs) and insertion-deletion (indel) variants between cases and controls. Next, we assessed the significance of gene-level recurrence of de novo damaging variants in our CMS group, identifying one high-confidence risk gene. Using the MLE method, variant simulations, and TADA, we estimated the number of genes contributing to CMS risk and used this estimate to predict the number of risk genes that will be discovered as more CMS trios are sequenced. Finally, exploratory gene enrichment analyses were performed, assessing degree of overlap with gene sets from other disorders, canonical pathways, gene ontologies, and expression pattern clustering within certain brain regions across development.
Subjects and Assessment Measures
This protocol was approved by the Johns Hopkins Medicine Institutional Review Board. Children with primary complex motor stereotypies (CMS) were recruited from either the Johns Hopkins Pediatric Neurology Movement Disorder Outpatient Clinic (HSS, Director), or via email (singerlab{at}jhmi.edu). All participants verbally consented and signed parental consent was obtained. Using standardized forms via telephone, the study coordinator completed a brief screening general history, obtained baseline data about each child’s stereotypies, and completed an Autism Spectrum Screening Questionnaire (ASSQ). The presence of stereotypic movements was confirmed, either via direct observation in clinic or by video review (HSS). If the subject passed the screening assessment, additional data was collected on the child and both parents via RedCap, an electronic web-based application for data capture and online questionnaires. The latter included the Stereotypy Severity Scale (Motor and Impairment scores) and comorbidity measures (Multidimensional Anxiety Scale for Children—MASC; ADHD-Rating Scale IV; Conner’s Parent Rating Scale—CPRS; Repetitive Behavior Scale-Revised—RBS-R; Children’s Yale-Brown Obsessive-Compulsive Scale—CYBOCS; and Social Responsiveness Scale—SRS) (see Supplementary Methods).
For this pilot study, we prioritized the study of “simplex” CMS (children without known family history of affected first or second-degree relatives) to increase the likelihood of detecting de novo sequence and structural variants. Eligibility required participants to have: (a) confirmed complex motor stereotypies; (b) onset before age 3 years; (c) temporary suspension of movements by an external stimulus or distraction. Exclusion criteria included: (a) a total score >13 on the ASSQ or a prior autism spectrum disorder diagnosis; (b) evidence of intellectual disability; (c) seizures or a known neurological disorder; and (d) the presence of motor/vocal tics. The presence of inattentiveness, hyperactivity, or impulsivity (i.e., ADHD symptoms) and/or obsessive-compulsive behaviors were not exclusionary.
DNA whole-exome sequencing (WES)
DNA was collected from all children meeting eligibility criteria and from their parents, using the Oragene OG-500 collection kit and standard extraction protocols (DNA Genotek, Ottowa, Ontario, Canada). Exome capture and sequencing were performed at the Yale Center for Genome Analysis (YCGA), using the NimbleGen SeqCap EZExomeV2 capture library (Roche NimbleGen, Madison, WI, USA) and the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA). WES data from 853 unaffected parent-child trios (2,559 samples total) were obtained from the Simons Simplex Collection via the NIH Data Archive (https://ndar.nih.gov/edit_collection.html?id=2042). These children and their parents have no evidence of autism spectrum or other neurodevelopmental disorders (47). The same exome capture and sequencing platforms were used for these control samples.
Sequence alignment, variant calling, and quality control
Alignment and variant calling of the sequencing reads followed the latest Genome Analysis Toolkit (GATK) (48) Best Practices guidelines, as described previously (49). Variants were annotated using RefSeq hg19 gene definitions using ANNOVAR (50). Trios were omitted from downstream analyses if (a) genetic markers were not consistent with expected family relationships; (b) an excessive number of de novo variants were observed, or (c) if they were outliers in principal components analysis (see Supplementary Methods). De novo variants were called as previously described (49) and as detailed in Supplementary Methods.
Mutation rate and gene recurrence
Within each cohort, we calculated the rate of de novo mutations per base pair, using methods previously described (49). We included only those de novo variants present with a frequency of <0.001 (0.1%) in the ExAC v0.3.1 database (51) and compared de novo mutation rates in cases versus controls using a one-tailed rate ratio test (Supplementary Methods).
As described in our previous WES studies (45, 46, 52), we used the Transmitted And De novo Association (TADA-Denovo) test as a statistical method for risk gene discovery based on gene-level recurrence of de novo mutations within the classes of variants that we found enriched in CMS (53, 54). This test generates random mutational data based on each gene’s specified mutation rate to determine null distributions, then calculates a p-value and a false discovery rate (FDR) q-value for each gene using a Bayesian “direct posterior approach.” A low q-value represents strong evidence for CMS association. See Supplementary Methods for details.
Estimating the number of CMS risk genes
As described previously (45, 52), we used a maximum likelihood estimation (MLE) method (55) to estimate the number of genes contributing risk to CMS, based on the observed number of de novo damaging variants in our dataset. See Supplementary Methods for details of these calculations.
Next, we used previously described methods (45, 52) to predict the likely number of risk genes that will be discovered as additional CMS parent-child trios are sequenced by WES.
These predictions utilize the estimated number of CMS risk genes along with CMS de novo mutation rates observed in our study to perform mutation simulations, followed by TADA-Denovo testing (see Supplementary Methods).
Gene set overlap
We used DNENRICH (56) (https://psychgen.u.hpc.mssm.edu/dnenrich/) to test whether genes harboring de novo damaging mutations in our CMS subjects were significantly enriched among several gene lists from the literature related to neuropsychiatric disorders, including autism (ASD), schizophrenia (SCZ), Tourette’s disorder (TD), obsessive-compulsive disorder (OCD), attention-deficit/hyperactivity disorder (ADHD), and intellectual disability (ID). Additionally, we were interested in the question of whether our CMS cohort share genes harboring de novo damaging mutations with ASD probands having high stereotypy scores. To approach this question, we assembled lists of genes harboring damaging de novo mutations in ASD probands from the Simons Simplex Collection (SSC) for whom stereotyped behavior scores (Stereotyped Behavior Score from the RBS-R, Repetitive Behavior Scale-Revised) were available. We looked for overlap between our CMS cohort and those SSC ASD probands with stereotypy scores in the 90th percentile (high stereotypies) and those scoring in the 10th percentile (low stereotypies). These gene lists are compiled in Table S4. Further details about gene list curation and DNENRICH methods can be found in Supplementary Methods.
Exploratory pathway, gene ontology, and spatiotemporal analyses
To determine whether genes harboring de novo damaging variants in CMS may perform similar biological functions, we used the list of CMS genes harboring de novo damaging mutations to identify enriched pathways and gene ontologies using ConsensusPathDB (Release 34, 15.01.2019). This tool integrates human protein and genetic interaction networks from 32 databases and interactions curated from the literature (57).
Finally, using this same list of genes, we searched for possible enrichment of gene expression within certain brain regions across multiple developmental time periods, using data from the Brainspan Atlas of the Developing Human Brain (58, 59). See Supplementary Methods.
RESULTS
We performed WES on 129 CMS parent-child trios (387 samples total) meeting inclusion criteria. WES data from 853 unaffected control trios, already sequenced from the Simons Simplex Collection, were pooled with our CMS trios for joint variant calling. After quality control methods, our sample size for a burden analysis was 118 CMS and 750 unaffected trios (Table 1, Figure 1, Table S1, Figure S1).
Scree plots following Principal Components Analysis (PCA), showing (A) the percentage of variance captured by each of the first 32 principal components, and (B) the cumulative percentage of variance captured by these same components in the exome metrics data from cases and controls. The “elbow” of the scree plot is visualized to be around the 5th principal component. This was confirmed by the Factominer R code function “estim_ncp()”. The first 5 PCs capture over 80% of the variance, and this number of PCs was used to determine PCA outliers during quality control (see Table S1 and Supplementary Methods). (C) Individual plots for the first two principal components, based on PCA of exome sequencing quality metrics. CMS cases are plotted in red, and controls in blue. The first two PCs together capture 56.3% of the variance. R code to generate this data and figure are in Supplementary Methods, and individual PC factor values are in Table S1. This figure includes PCA outliers (>3 standard deviations from the mean in PCs 1-5), which were removed during quality control, prior to further analysis of case-control data.
– Distribution of de novo variants in CMS cases and controls
Increased burden of de novo damaging variants in CMS
Based on work in other neurodevelopmental disorders, we expected to find an enrichment of de novo LGD variants (stop codon, frameshift, or canonical splice-site variants) in CMS probands versus controls. We found a statistically significant increased rate of de novo LGD variants in CMS cases, confirming our hypothesis (rate ratio [RR] 1.95, 95% Confidence Interval [CI] 1.04-3.50, p=0.04). Furthermore, de novo variants predicted to be damaging (LGD plus missense variants with Polyphen2-HDIV score <0.957 and ≥0.453) were also over-represented in CMS probands (RR 1.37, CI 1.05-1.76, p=0.03). We did not detect a difference in mutation rates for de novo synonymous variants, or when all de novo variants (coding +/- non-coding) were considered together. (Table 1, Figure 2, Table S2)
Bar chart comparing the rates of de novo variant classes between CMS cases (red) and controls (blue). Comparisons are between per base pair (bp) mutation rates, using a one-tailed rate ratio test. Statistically significant comparisons (p<0.05) are marked with asterisks. Error bars show 95% confidence intervals.
KDM5B is a high-confidence candidate risk gene in CMS
Having established a higher rate of de novo damaging variants in CMS probands, we next asked whether these variants cluster within specific genes. We identified one gene with more than one predicted damaging de novo variant in unrelated probands: KDM5B (Lysine Demethylase 5B) harbored two different LGD (stopgain) de novo variants in CMS probands 1029-03 and 1050-03. Using TADA-Denovo (53) and previously established false discovery rate (FDR) thresholds, we found that KDM5B meets statistical criteria for a high-confidence risk gene (q<0.1) in CMS (Table S3).
Approximately 184 genes contribute to CMS risk
Based on the number of observed de novo damaging mutations in CMS, the MLE method estimated the most likely number of CMS risk genes to be 184 (Figure S2). Next, we used this estimate along with de novo mutation rates observed in CMS trios to predict the likely number of risk genes that will be discovered in larger CMS cohorts. Based on these simulations, WES of 500 trios should find 16 probable and 7 high-confidence risk genes; 1000 trios should find 51 probable and 26 high-confidence risk genes (Figure S3).
For each number of possible risk genes between 1-2,500, we conducted 50,000 simulations to determine the number of risk genes that yielded the closest agreement between our observed and simulated data. This MLE method yields an estimate of 184 CMS risk genes (red vertical line). See Supplementary Methods.
Using our estimate of 184 risk genes (based on the MLE method – see Main Text and Supplementary Methods), we estimated the number of probable (FDR q<0.3) and high-confidence (FDR q<0.1) risk genes that will be discovered as more CMS trios are sequenced. We performed 10,000 simulations at each cohort size from 25-3,000 trios, randomly generating variants and assigning to risk genes in agreement with the proportions seen in our data, then applying the TADA-Denovo algorithm. Based on these simulations, WES of 500 trios should find 16 probable and 7 high-confidence risk genes; 1000 trios should find 51 probable and 26 high-confidence risk genes.
CMS gene enrichment in Tourette syndrome, “suggestive evidence” ASD genes, and in ASD with high stereotypy scores
Using DNENRICH (56), we found significant overlap between genes harboring de novo damaging variants in CMS (52 genes after excluding two genes with de novo damaging variants in controls) and several gene sets curated from the literature (Table S4). In particular, our CMS cohort genes show significant gene overlap with autism probands with high stereotypy scores (5.8x enrichment, p=0.048), Tourette’s disorder (4.7x enrichment, p=0.011), and category 3 (suggestive evidence) autism risk genes (3.7x enrichment, p=0.023). There was no significant overlap with OCD, ADHD, schizophrenia, intellectual disability, developmental disabilities, or other categories of autism gene lists (Table S4).
Exploratory pathway, gene ontology, and spatiotemporal analyses
Using this same list of 52 genes harboring de novo damaging variants in CMS, we performed exploratory analyses to determine shared underlying canonical pathways and functional connectivity. Our input gene list is significantly enriched for biological process ontology-based sets including demethylation, organization of branching structures (e.g. neurons and vasculature), multicellular organism development, and cell motility. Enriched canonical pathways include the glucocorticoid receptor pathway, transient receptor potential (TRP) channel function, demethylation of histones by histone lysine demethylases, and ion channel transport (Table S5).
Finally, mapping our CMS de novo damaging variant genes onto the Brainspan Atlas of the Developing Human Brain gene expression data, we see nominal enrichment in early mid-fetal cortex and striatum, with a possible trends toward enrichment in early fetal hippocampus, late mid-fetal cerebellum, and young childhood cerebellum (Table S5).
DISCUSSION
Like prior studies of ASD, Tourette’s disorder, and OCD, the current study demonstrates that the identification of de novo variants will identify risk genes and provide a reliable entry-point into understanding the biology of stereotypies. We are studying otherwise typically-developing children with stereotypies (primary CMS), as this may represent a more genetically homogenous group of individuals versus those with secondary stereotypies, thereby facilitating genetic discovery and insight into the biology of stereotypies more generally (60, 61). Despite our small cohort size, we identified two de novo nonsense mutations in KDM5B in unrelated probands, and we show that finding two such mutations in our cohort is highly unlikely to be a chance occurrence.
KDM5B is a lysine-specific demethylase that removes methyl groups from tri-, di- and monomethylated lysine 4 on histone 3. KDM5B acts a transcriptional repressor and has primarily been implicated in the pathogenesis of cancer (62). More recently, this gene has also been implicated in congenital heart disease risk, embryonic stem cell renewal and differentiation, and DNA repair (63–65). KDM5B has been identified as high-confidence risk gene in WES studies of ASD (54), and expression is normally restricted to the brain and the testis (66). Within the brain, high expression levels are in the cerebellum (Figure S4), and expression across all brain regions is highest prenatally (Figure S5). The identification of this risk gene in CMS suggests that chromatin (dys)regulation of KDM5B target genes may be one contributing mechanism underlying stereotypies. As shown in our gene ontology and pathway results (Table S5), demethylation of histones is enriched, driven by the two KDM5B mutations and a de novo damaging mutation in KDM3B in a third proband. Certainly, further studies are warranted to determine the downstream effects of these mutations in the developing brain. These studies are underway in our laboratory.
Brain expression by region for KDM5B. Data is from GTEx Analysis Release V7 (dbGaP Accession phs000424.v7.p2) (https://gtexportal.org/home/gene/KDM5B). Expression values are shown in Transcripts Per Million (TPM), calculated from a gene model with isoforms collapsed to a single gene. No other normalization steps have been applied. Box plots are shown as median, 25th, and 75th percentiles. Points are displayed as outliers if they are above or below 1.5 times the interquartile range. Further details about expression quantification and samples can be found at https://gtexportal.org/home/documentationPage#AboutData.
Brain expression trajectories of KDM5B in the developing human brain. Expression data is from the Brainspan Consortium (brainspan.org, hbatlas.org), generated using the Affymetrix GeneChip Human Exon 1.0 ST Array platform. Vertical axis is the log2-transformed array signal intensity, which is proportional to transcript expression. A stringent threshold of ≥6 was required to meet criteria for brain expression. Horizontal axis represents periods of human development and adulthood as previously defined by Kang et al (2011). Birth begins period 8 and adolescence begins period 12. Brain regions are by color: neocortex (NCX), hippocampus (HIP), amygdala (AMY), striatum (STR), mediodorsal nucleus of the thalamus (MD), cerebellar cortex (CBC).
It is interesting that we find significant overlap between genes harboring de novo damaging mutations in CMS and those reported in a recent study of Tourette syndrome (Table S4; 4.7x enrichment, p=0.011). We find this overlap with Tourette despite excluding CMS subjects with comorbid motor or vocal tics (see Methods). Enriched expression of CMS genes in the cortex and striatum (Table S5) is also consistent with widely believed involvement of these regions in Tourette syndrome. While OCD and ADHD symptoms were not exclusionary in our CMS study, we saw no gene overlap with these disorders. Similarly, we found no overlap with SCZ, ID, or DD. We did, however, find significant overlap between CMS and Category 3 (suggestive evidence) ASD risk genes (3.7x enrichment, p=0.023).
With regard to stereotypies in ASD, we curated lists of genes harboring de novo damaging mutations in SSC probands with the highest (90th percentile) and lowest (10th percentile) stereotypies, measured by Stereotyped Behavior Scores (SBS) from the RBS-R. KDM5B mutations were found only in SSC probands with high stereotypy scores, yielding 5.8-fold enrichment over expectation (p=0.047) when compared against our CMS genes (Table S4). To further examine the relation between de novo KDM5B mutations stereotypies in SSC ASD probands, we compared SBS scores in four probands with KDM5B mutations versus 364 age-matched patients without (Figure S6). Scores were higher in mutation carriers, but this did not reach statistical significance (p=0.076), likely due to the low number of mutation carriers in this cohort.
Tukey box and whisker plot of Stereotyped Behavior Score (SBS) from the RBS-R (Repetitive Behavior Scale-Revised) in aged-matched Simons Simplex Collection ASD probands with (+, n=4) and without (-, n=364) de novo damaging mutations in KDM5B. Two-tailed Mann-Whitney test of ages (months) between groups: p=0.86. One-tailed Mann-Whitney test of SBS between groups: p=0.076.
In summary, we report an increased burden of de novo damaging DNA coding variants in primary complex motor stereotypies. We identified one high-confidence risk gene for CMS in our pilot cohort and estimate that there are 184 genes conferring risk for this phenotype. Whole-exome sequencing in parent-child CMS trios provides a reliable way to make progress in gene discovery. Our preliminary analyses of genes harboring de novo damaging mutations in CMS highlight several biological pathways, processes, brain regions, and developmental time periods that give insights into possible etiologies of stereotypies, which are a prerequisite to development of new treatments. Further sequencing and mechanistic studies are warranted to understand this phenotype, which has relevance across diagnostic boundaries.
SUPPLEMENTARY METHODS
Subjects and Assessment Measures
The Stereotypy Severity Scale (SSS) is a 5-item caregiver questionnaire consisting of two components (Motor and Impairment) for the ranking of motor stereotypy severity (1). The SSS Motor score (range: 0-18) quantifies motor severity and rates movements along four discriminate dimensions: number (0–3), frequency (0–5), intensity (0–5), and interference (0–5). The SSS Impairment score (range 0-50) is an independent rating of difficulties in self-esteem, family, school, or social acceptance caused by the movements.
The ASSQ is a 27-item caregiver questionnaire addressing symptoms of ASD (2).
Other parent-completed measures included: the Multidimensional Anxiety Scale for Children (MASC), assessing symptoms of anxiety (3); the ADHD-Rating Scale-IV (4) and Conners Parent Rating Scale (CPRS), assessing symptoms of ADHD; the Child Yale Brown Obsessive-Compulsive Scale (CY-BOCS), assessing symptoms of OCD (5); the Repetitive Behavior Scale-Revised (RBS-R), assessing repetitive behaviors (6); and the Social Responsiveness Scale (SRS), assessing social communication skills (7).
Whole-exome sequencing, alignment, variant calling, and quality control
Exome capture and sequencing were performed at the Yale Center for Genome Analysis (YCGA), using the NimbleGen SeqCap EZExomeV2 capture library (Roche NimbleGen, Madison, WI, USA) and the Illumina HiSeq 2500 platform (74 bp paired-end reads; Illumina, San Diego, CA, USA). We multiplexed six samples during each capture reaction and sequencing lane, pooling parents and probands when possible. Alignment and variant calling of the sequencing reads followed the latest Genome Analysis Toolkit (GATK) (8) Best Practices guidelines, as described previously (9). Reads were aligned using BWA-mem (10) to the b37 human reference sequence with decoy sequences. Picard’s MarkDuplicates tool was used to mark PCR duplicates (https://broadinstitute.github.io/picard/). GATK was used to realign indels, recalibrate quality scores, and generate GVCF files for each sample using the HaplotypeCaller tool. All samples were called jointly using GATK’s GenotypeGVCFs tool, variant score recalibration was applied to the called variants, and all variant call data was written to a VCF file. This pipeline uses GATK’s Best Practices parameters and the default parameters for BWA and Picard. Only passing variants were used in downstream analyses. Variants were annotated using the RefSeq hg19 gene definitions and multiple external databases of variant population frequency, conservation scores, variation intolerance, mutation severity, and predicted functional effects using ANNOVAR (11).
Relatedness statistics were calculated based on the method of Manichaikul et al. (12), implemented in VCFtools v0.1.14.10 (13). Trios were omitted if expected family relationships were not confirmed or if there were unexpected relationships within or between families. Trios were omitted if > 5 de novo variants were observed. PLINK/SEQ (14) (i-stats; https://psychgen.u.hpc.mssm.edu/plinkseq/stats.shtml), PicardTools, and GATK DepthOfCoverage tools were used to generate quality metrics (Table S1). To identify outliers that might confound our case-control analysis, we performed principal components analysis (PCA) using this data. A scree plot determined the number of principal components accounting for the greatest proportion of variance, and we removed trios with family members falling more than three standard deviations from the mean in any of these principal components (Figure S1, Table S1). The R code PCA is provided below.
We used stringent thresholds for identifying de novo mutations because DNA from control subjects in the Simons Simplex Collection was not available for confirmation by Sanger sequencing. As previously described (15), de novo variants were called using an in-house script that required: (a) child is heterozygous for a variant, with alternate allele frequency between 0.3 and 0.7 in the child and < 0.05 in the parents; (b) sequencing depth (DP) ≥ 20 in all family members at the variant position; (c) alternate allele depth (AD) ≥ 5; (d) observed allele frequency (AC) < 0.01 (1%) among all cases and controls; and (e) mapping quality (MQ) ≥ 30. False positive calls were removed by in silico visualization. Based on Sanger sequencing confirmations, our confirmation rate using this method and platform is 98.7% (16).
Principal Component Analysis (PCA)
PCA was performed on all sequencing quality metrics (Table S1) in R using the following code:
Mutation rate calculations
Within each cohort, we calculated the rate of de novo mutations per base pair. For accurate rate calculation, we first determined the number of “callable” base pairs per family using the GATK DepthOfCoverage tool. We considered only bases covered at ≥ 20x in all family members, with base quality ≥ 20, and map quality ≥ 30; these thresholds match those required for GATK and de novo variant calling. For each cohort, we summed the “callable” base pairs in every family and used this number as the denominator for de novo rate calculations. Details for calculating these callable base pairs is below. The resulting rate was divided by two to give haploid rates. Confidence intervals were calculated using the pois.conf.int (pois.exact) function from the epitools v0.5-9 package in R. We compared de novo mutation rates in cases versus controls (burden analysis) using a one-tailed rate ratio test in R (https://cran.r-project.org/package=rateratio.test), considering only those variants present with a frequency of <0.001 in the ExAC v0.3.1 database (17).
Calculating “callable” base pairs
The following command was used to calculate the callable base pairs in each trio:
FamilyID.list contains names and locations of the three trio bam files. The .bed file contains the genomic intervals over which to calculate the callable base pairs. To calculate the coding callable base pairs (used for coding mutation rates, e.g. synonymous, nonsynonymous, missense, etc., see Table 1, Table S2), we used a bed file with intervals spanning the intersection of both capture array target intervals and the RefSeq coding intervals (32,027,823 bp total). To calculate all callable base pairs (used for the total coding + noncoding mutation rate, see “All” in Table 1), we used a bed file with intervals spanning the intersection of both capture array target intervals (33,973,867 bp total). The number of coding and total callable base pairs for every family passing is listed in Table S1.
Calculating expected mutation rates for downstream analyses
To perform subsequent maximum likelihood estimation (MLE) and TADA analyses, we used published per gene de novo mutation rates from unaffected parent-child trios (18). From the control samples in our dataset, we calculated the proportion of the overall coding mutation rate that comprised LGD and Mis-D mutations, and then used these proportions to calculate the expected LGD and Mis-D mutation rate per gene (Table S3).
The following R code was used to generate the expected mutation rates:
TADA analysis
As shown in prior WES studies in neuropsychiatric disorders, a small number of rare de novo mutations in the same gene among unrelated individuals can provide considerable statistical power to establish association. To test this hypothesis in CMS, we used the transmitted and de novo association (TADA-Denovo) test. TADA uses a Bayesian model that combines data from de novo mutations and population mutation rates to increase the power of gene discovery. While TADA has a version that can include inherited variants, we did not include inherited data in this study, because their confirmation rate is not known and their contribution to the TADA score is minimal, given their lower relative risks (19, 20). The code and documentation for this tool can be found here (http://wpicr.wpic.pitt.edu/WPICCompGen/TADA/TADA_homepage.htm).
We describe the parameters of this test in our prior WES studies of Tourette’s disorder and OCD (15, 21). The code and parameters used in the current study are given below. A low FDR-corrected q-value represents strong evidence for association. Genes with FDR q<0.3 are considered probable risk genes, and those with FDR q<0.1 are high-confidence risk genes.
Maximum Likelihood Estimation (MLE) method for estimating the number of CMS risk genes
We used a maximum likelihood estimation (MLE) method to estimate the number of genes contributing risk to CMS, based on vulnerability to de novo damaging variants (22). For every number of risk genes from 1 to 2,500, we simulated 54 variants (the number of damaging de novo variants observed in probands in our case-control burden analysis). Variant simulations were performed 50,000 times at each number of risk genes. Following each simulation, a percentage of variants was randomly assigned to the risk genes. The percentage of variants assigned to risk genes was determined by the fraction of de novo damaging variants estimated to carry CMS risk, and variant simulations were weighted by gene size and GC content (19). We then counted the number of risk and non-risk genes containing two variants and the number containing three or more variants. The frequency of concordance between our simulated and observed data was calculated. A curve was plotted to show the concordance frequency (y-axis) at each assumed number of risk genes (x-axis), and the peak was taken as the estimate of the most likely number of risk genes (22). See Figure S2.
The following R code was used to perform these calculations:
Predicting the number of CMS risk genes identified by cohort size
Future discovery of risk genes was predicted as previously described (16, 21). Fixing the number of CMS risk genes at 184 (from estimate above), we simulated de novo mutations, with the number of mutations matching the observed mutation rate in CMS probands, and with mutation simulations weighted by gene size and GC content. Simulations were performed at each cohort size, from 25 to 3000, in increments of 25. Simulated variants were randomly assigned to the risk genes, with the percentage of variants assigned to risk genes determined by the fraction of de novo damaging variants estimated to carry CMS risk. At each cohort size, 10,000 simulations were performed. LGD and Mis-D variants were simulated separately. Simulated variants were then combined and given as input to the TADA-Denovo algorithm, using the same parameters described above for the observed data. The number of high confidence (q<0.1) and probable (q<0.3) risk genes were recorded and plotted using polynomial regression fitting; this regression model allows prediction of the number of genes identified at a specified cohort size. See Figure S3.
The following R code was used to perform these calculations:
Gene set overlap
We used DNENRICH (14) (https://psychgen.u.hpc.mssm.edu/dnenrich/) to test whether CMS genes found to have de novo damaging mutations in our study (52 genes after excluding two genes with de novo damaging variants in controls) were significantly enriched among previously reported genes in autism (ASD), schizophrenia (SCZ), developmental disorders (DD), Tourette’s disorder (TD), obsessive-compulsive disorder (OCD), attention-deficit/hyperactivity disorder (ADHD), and intellectual disability (ID). Gene lists for SCZ and ID were obtained from a recent cross-disorder study (23) that included de novo single nucleotide and indel variants from multiple exome sequencing studies (14, 24–31). DD (32), TD (33), OCD (15), and ADHD (34) genes were obtained from recently published WES studies. For the TD gene list, we removed variants reported in subjects with comorbid OCD, as this phenotype information was readily available. ASD gene lists were obtained from the SFARI Gene online database (https://gene.sfari.org/about-gene-scoring/criteria/), version 08-06-2019. ASD genes in this database have been stratified into six categories, based on manually curated strength of evidence from human genetics studies. As described in more detail on the above website, gene categories are as follows: Category 1: high confidence; Category 2: strong candidate; Category 3: suggestive evidence; Category 4: minimal evidence; Category 5: hypothesized but untested; Category 6: evidence does not support a role. Finally, we curated lists of genes harboring damaging de novo mutations in ASD probands from the Simons Simplex Collection (SSC) for whom stereotyped behavior scores (Stereotyped Behavior Score from the RBS-R, Repetitive Behavior Scale-Revised) were available. De novo mutation data from SSC probands was obtained from denovo-db (http://denovo-db.gs.washington.edu/, version 1.6.1, accessed 5/14/2019). RBS-R Stereotyped Behavior Score data was obtained from the Simons Foundation. Because we were particularly interested in the question of whether our CMS cohort share genes harboring de novo damaging mutations with SSC probands having high stereotypy scores, we assembled gene lists from SSC subjects with stereotypy scores in the 90th percentile (high stereotypies) and those in the 10th percentile (low stereotypies). Gene lists are provided in Table S4, first tab.
DNENRICH simulates random mutations while accounting for gene size, trinucleotide context, and mutational effect. We performed 100,000 permutations, comparing the observed and expected overlap with each gene set. Empirical p-values were generated, based on a one-sided enrichment analysis under a binomial model of greater than expected hits per gene set. We tested for overlap between our CMS genes and those in each of the mentioned gene lists. Results are provided in Table S4, third tab.
The following Linux commands were used to run the DNENRICH analysis:
Exploratory pathway, gene ontology, and spatiotemporal analyses
To explore whether genes harboring de novo damaging variants in our CMS probands (52 genes after excluding two genes with de novo damaging variants also found in SSC controls), we used ConsensusPathDB (35) (http://cpdb.molgen.mpg.de/, Release 34 [15.01.2019], accessed 8/9/2019). This tool integrates human protein and genetic interaction networks from 32 databases and interactions curated from the literature. The following default settings were used for ConsensusPathDB: gene set analysis → over-representation analysis; gene identifier type: gene symbol (HGNC symbol); Pathway-based sets: pathways as defined by pathway databases, select all resources, minimum overlap with input list = 2, p-value cutoff = 0.05; Gene ontology categories: gene ontology level 2 categories, select all (biological processes, molecular function, cellular component), p-value cutoff = 0.05. Results are in Table S5, first tab.
For spatiotemporal enrichment analysis, we used our same list of 52 genes and asked whether these genes have known expression patterns that cluster within certain anatomical brain regions or within certain developmental time periods. To perform this analysis, we used data from the Brainspan Atlas of the Developing Human Brain (36) as implemented in the Specific Enrichment Analysis (SEA) tool (http://genetics.wustl.edu/jdlab/csea-tool-2/, version 1.1, accessed 8/10/2019). For this analysis, we used a specificity index threshold (pSI) of 0.05 (37). Results are in Table S5, second tab. Fisher’s Exact p-values are uncorrected for multiple comparisons.
SUPPLEMENTARY TABLES
Table S1 – Phenotype, exome sequencing metrics, and principal components analysis.
(see “TableS1.xlsx”)
First tab contains individual-level sample information (columns A-I), including family ID, individual ID, phenotype, cohort, collection site, gender, capture platform, size of “callable exome”, and paternal age (years) at birth, where available. Column J lists reasons for any sample exclusions by quality control methods; “0” indicates that the sample was not excluded, and was included in subsequent analyses. Columns K-AF list individual sample sequencing metrics generated using PicardTools, and GATK DepthOfCoverage tools. Columns AG-AQ list individual sample sequencing metrics generated using PLINK/SEQ (i-stats; https://psychgen.u.hpc.mssm.edu/plinkseq/stats.shtml). Columns B, K-AQ were included in Principal Components Analysis (PCA). Third tab contains cohort-level metrics calculated using samples passing quality control. ±95% confidence intervals are given, when applicable. Fourth tab contains coordinates generated for each sample for the top 10 principal components following PCA. The code used to generate this data is included in Supplementary Methods. Using these coordinates, we removed trios with family members falling more than three standard deviations from the mean in any of the first five principal components; this information is contained in the fifth tab.
Table S2 – Annotated de novo variants in CMS and controls.
(see “TableS2.xlsx”)
Detailed information on all high confidence de novo variants in cases and controls. These variants were annotated using ANNOVAR, based on RefSeq hg19 gene definitions. Column descriptions are provided in a separate tab of this file. A third tab provides the number of each de novo variant type per sample.
Table S3 – Gene-level de novo mutation rates, variant counts, and TADA-Denovo results.
(see “TableS3.xlsx”)
First tab contains de novo mutation rates used to perform subsequent maximum likelihood estimation (MLE) and TADA-Denovo analyses. The following mutation rates are listed for each gene: overall (mut.rate), likely gene disrupting (lgd), predicted damaging missense (mis3), and all damaging (lgd + misD). These overall mutation rates were previously published (Ware et al., 2015) from unaffected parent-child trios. The code used to generate the mutation rate table is provided in Supplementary Methods. Second tab contains the input file for the TADA-Denovo algorithm. Gene-level expected mutation rates for LGD (“mut.cls1” column) and Mis-D variants (“mut.cls2” column) are listed, along with their respective observed mutation counts in our CMS data (“dn.cls1” and “dn.cls2”, respectively). Code for running TADA-Denovo is given in Supplementary Methods. Third tab contains the final output results from TADA-Denovo code provided in Supplementary Methods. One gene harboring more than one damaging de novo (LGD or Mis-D) variant in unrelated CMS families is highlighted in yellow (KDM5B). This gene exceeded the threshold for being considered a high confidence (qval < 0.1) risk gene.
Table S4 – DNENRICH gene lists and results.
(see “TableS4.xlsx”)
See Supplementary Methods for details of DNENRICH analysis and gene lists used. First tab contains input gene lists and information about their curation. Second tab contains the input mutation list for DNENRICH; each row represents a de novo damaging mutation in a CMS proband. Third tab contains results output from DNENRICH. Significantly enriched gene sets are highlighted.
Table S5 – Exploratory pathway, gene ontology, and spatiotemporal analyses results.
(see “TableS5.xlsx”)
Pathway and gene ontology results from ConsensusPathDB are in the first tab; p-values < 0.05 and corresponding q-values are shown. Specific Enrichment Analysis (SEA) exploring whether genes cluster within certain brain regions across development using Brainspan atlas data is in the second tab; p-values < 0.05 are highlighted in yellow. See Supplementary Methods for details of these analyses.
ACKNOWLEDGEMENTS
We wish to thank the families who have participated in and contributed to this study. Control subject data were obtained from the NIH-supported National Database for Autism Research (NDAR). NDAR is a collaborative informatics system created by the National Institutes of Health to provide a national resource to support and accelerate research in autism. Dataset identifier: 2042. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or of the Submitters submitting original data to NDAR. This work was supported by grants from the Simons Foundation (SFARI award #239013, TVF), the Allison Family Foundation (TVF), Nesbitt-McMaster Foundation (HSS), Klump Family (HSS), and Graves Family (HSS). We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the principal investigators (A. Beaudet, R. Bernier, J. Constantino, E. Cook, E. Fombonne, D. Geschwind, R. Goin-Kochel, E. Hanson, D. Grice, A. Klin, D. Ledbetter, C. Lord, C. Martin, D. Martin, R. Maxim, J. Miles, O. Ousley, K. Pelphrey, B. Peterson, J. Piggot, C. Saulnier, M. State, W. Stone, J. Sutcliffe, C. Walsh, Z. Warren, E. Wijsman). We appreciate obtaining access to phenotype and genetic data on SFARI Base. Approved researchers can obtain the SSC population dataset described in this study (https://www.sfari.org/resource/simons-simplex-collection/) by applying at https://base.sfari.org.
Dr. Fernandez receives research/grant support from the National Institutes of Mental Health and the Brain and Behavior Research Foundation. Dr. Olfson receives research support from the American Academy of Child & Adolescent Psychiatry and the Alan B. Slifka Foundation through the Riva Ariella Ritvo endowment. Dr Singer serves as a consultant for Abide Therapeutics, Inc; Cello Health BioConsulting; ClearView Healthcare Partners; Teva Pharmaceutical Industries Ltd; and Trinity Partners, LLC. Dr Singer receives publishing royalties from Elsevier and research/grant support from the Tourette Association of America. Other authors declare no potential conflicts.