Mutually exclusive autism mutations point to the circadian clock and PI3K signaling pathways

Mutual exclusivity analysis of genomic mutations has proven useful for detecting driver alterations in cancer patient cohorts. Here we demonstrate, for the first time, that this pattern is also present among de novo mutations in autism spectrum disorder. We analyzed three large whole genome sequencing studies and identified mutual exclusivity patterns within the most confident set of autism-related genes, as well as in the circadian clock and PI3K/AKT signaling pathways.


Main text
DNA sequencing studies have identified several genes whose mutations are strongly associated with autism spectrum disorder (ASD) as well as a greater number of weakly associated genes that require further investigation. While over 1,000 of these genes are catalogued and classified in the SFARI Gene database 1 , only 25 rank in the highest-confidence category (rank 1) (Fig.1a). We need better methods to distinguish functionally impactful genetic mutations from those that are innocuous. More importantly, we need methods to biologically contextualize these so-called functional mutations and to explain how and when they contribute to ASD etiology.
A fundamental approach for discovering functional genomic alterations is to focus on genomic regions with high numbers of variant calls in probands relative to background controls. However, such frequent mutations are rare in complex heterogeneous disorders such as ASD. In the cancer research field, we and others have employed an approach that is uniquely well-suited to this scenario: we simultaneously evaluate mutations within multiple genes to test whether their distribution across patients is non-random [2][3][4] . This approach identifies "mutually exclusive" gene groups whose members-while mutated throughout a disease cohort-are rarely co-mutated within an individual. The mutual exclusivity pattern indicates the existence of substitutable functional mutations, which often align with known biological pathways. Such a pattern is characteristic of certain subgroups of cancer-driving mutations and it contrasts with the random distribution of innocuous mutations (Fig. 1c). Here we show that mutual exclusivity also exists in ASD datasets and that it presents a novel opportunity for detecting and characterizing functional mutations that, to date, have been indistinguishable from randomly distributed background mutations.
We use three recent whole genome sequencing ASD datasets with de novo mutation calls (Fig 1b): (i) a dataset released by Yuen et al. 5 from the MSSNG project containing 1,625 individuals with ASD and 2 control cases, (ii) a dataset by Turner et al. 6 analyzing 516 ASD probands and 516 unaffected sibling controls from Simon Simplex Collection (SSC), and (iii) a dataset published by An et al. 7 analyzing 1,902 ASD probands and 1,902 unaffected sibling controls from the SSC cohort. Although all samples in the Turner dataset were re-analyzed within the An dataset, it should be noted that the Turner study's probands were specifically selected for their lack of likely gene-disrupting (LGD) de novo mutations or large copy number variants (CNVs).
For proof of principle, we first test whether mutations of SFARI genes are distributed in a mutually exclusive fashion-as these genes are the most likely to bear functional mutations. To do so, we transform each dataset into a gene by sample mutation matrix and test for mutual exclusivity of the high-confidence SFARI genes with varying confidence thresholds (see Methods). We determine that the most confident SFARI gene set (rank 1) has mutually exclusive mutations in both the Yuen dataset (p = 0.0251) and in the combined Yuen and An dataset (p = 0.0108). This significance diminishes as we expand the gene set with less-confident tiers (Fig. 2a, Suppl. Tab. 1a-e). Notably, the vast majority (91%) of the gene-associated de novo mutations in these datasets are intronic. To understand if the observed mutual exclusivity is driven by coding mutations, we perform the same analysis on the combined Yuen and An datasets (i) using only intronic mutations, and (ii) excluding intronic mutations. While the intron-only test produces a nearly significant result for the SFARI rank 1 gene set (p = 0.0553, Suppl. Tab. 1f), excluding intron mutations results in only one overlap but insignificant results (p = 0.6421, Suppl. Tab. 1g). We conclude that, while intronic mutations may contribute to the observed mutual exclusivity, we lack the abundance of coding mutations necessary to confirm or deny their contribution.

SSC cohort
Turner et al.

Figure 1.
Overview of resources in ASD and the concept of mutual exclusivity in cancer. a) Frequency of autism genes in SFARI database grouped by their assigned ranks (lower rank indicates higher confidence). b) Contents of 3 whole-genome ASD datasets used in this study. c) An example of mutual exclusivity from the TCGA Glioblastoma cohort wherein CDKN2A, CDK4 and RB1 genomic alterations exhibit much less overlap than would occur randomly 2 . In this case, inhibition of either CDKN2A or RB1, or activation of CDK4 is enough to unlock the cell cycle, and a second gene alteration in this group is unnecessary for disease progression.
Next, we test biological pathways from the Reactome knowledgebase 8 to identify processes harboring mutually exclusive mutations. Reactome contains 1,576 pathways of various sizes. While exhaustively testing every pathway would not produce significant results, limiting the search space to the most mutated 50 pathways identifies the Circadian Clock pathway within the Yuen dataset (p = 0.0006, FDR = 0.03, Suppl. Tab. 2), and we find that the An dataset independently validates this result (p = 0.0371, Suppl. Tab. 3). To identify more pathways, we again look to the combined Yuen and An datasets, but here we find only the Circadian Clock pathway (Suppl. Tab. 4). Interestingly, using the combined Yuen and Turner dataset identifies the Circadian Clock as well as two additional pathways-both of which are related to PI3K signaling (FDR = 0.0817, Fig. 2b,c,d, Suppl. Tab. 5). It is possible that the sample selection criteria in the Turner dataset has, for reasons not intended nor yet understood, enriched it for functional non-coding mutations in PI3K signaling.
In the combined Yuen and An dataset, the Circadian Clock pathway has 53 mutated member genes-15 of which are catalogued by SFARI (although none of them are ranked 1, Fig. 2c, Suppl. Tab. 6, Suppl. Fig. 1). None of the member gene contributions to the observed exclusivity are individually significant. In the combined Yuen and Turner dataset, the PI3K/AKT Signaling pathway has 73 mutated member genes-11 of which are catalogued by SFARI (again, none of them are ranked 1, Fig. 2d, Suppl. Tab. 7, Suppl. Fig. 2). For this pathway, the member gene ERBB4, which is of low-confidence among SFARI genes (rank 5), has a significant contribution to the mutual exclusivity pattern (p = 0.0005, FDR = 0.0365). These 2 significant pathways are totally disjoint in terms of their mutated members, while the other PI3K pathway-Constitutive Signaling by Aberrant PI3K in Cancer-is a subset of the second pathway.
We perform three additional analyses to further validate our findings. First, we interrogate the reason for the observed increase in PI3K/AKT Signaling pathway mutual exclusivity brought about by replacing the An dataset with the smaller Turner dataset. Specifically, we ask whether Turner et al.'s unique sample selection is related to this difference rather than it being entirely attributable to general differences in data generation and processing. To test this, we calculate mutual exclusivity by using the combined Yuen and An dataset, but this time limiting the An samples to those 516 ASD probands that overlap with Turner samples. We find that, while p-values slightly deteriorate, the Circadian Clock and PI3K/AKT Signaling pathways are still significant in these results (p = 0.0001 and p = 0.0028, respectively, FDR = 0.07, Suppl. Tab. 8), so we conclude that sample selection has a major role in the effect, and it cannot be solely explained by data processing differences. Second, we assess the results for association with the ASD phenotype by replacing the ASD individuals with their neurotypical siblings  from the Turner and An datasets (Fig. 2e,  individual mutual exclusivity contributions (Suppl. Tab. 11), we find that the first 11 genes, taken together, are significantly biased toward mutation in ASD in the An dataset (p = 0.0218, Fig. 2f). In this analysis, we use the Yuen dataset alone for generating individual mutual exclusivity scores so that bias calculations and mutual exclusivity calculations are done on independent datasets, thereby preventing possible confounding effects. PI3K signaling is already popularly studied in autism, as abnormalities in this pathway have been shown to promote the ASD phenotype in multiple reports 9,10 . The circadian clock pathway, however, is relatively less studied at the genetic level, despite the well-documented prevalence of circadian rhythm disruptions within individuals with ASD 11 , and higher polymorphism of circadian clock related genes in ASD 12 . While it is still unclear if circadian rhythm abnormalities contribute to or are a by-product of ASD, our results bolster the former hypothesis and nominate the circadian clock as a key process whose alteration promotes the ASD phenotype. Here we demonstrate that mutual exclusivity analysis of de novo mutations is a powerful methodology for identification and characterization of functional mutations in ASD, and that it will play an important role in future research on ASD and other complex, multi-factorial disorders of unclear origins.
We downloaded the Yuen and Turner datasets from denovo-db 14 version 1.6.1 at http://denovo-db.gs.washington. edu/denovo-db.non-ssc-samples.variants.tsv.gz and http://denovo-db.gs.washington.edu/ denovo-db.ssc-samples.variants.tsv.gz on March 26, 2019. We limited the Yuen dataset to the samples listed in their supplementary table 3 and ignored mutation calls coming from other tables that are not derived from whole genome sequencing. We downloaded the An dataset from their supplementary table 2. For each of the datasets and in every analysis, we ignored multiple mutations of the same gene in the same sample (i.e. we only considered whether the gene is mutated at least once; additional mutations on the gene in the same sample were disregarded).

Detection of mutual exclusivity
Our mutual exclusivity detection approach is a hybrid of two previously published methods: MEMo 3 and Mutex 2 . For a given mutation matrix and a gene set of interest, we first calculate that gene set's "overlap" in the matrix (Fig. 3). Within the gene set, we define each sample's overlap as one less than the number of mutated genes from the gene set in that sample (or 0 if none are present). Summation of this number across all samples produces the overlap value for the gene set within the matrix. Then, to produce a p-value for mutual exclusivity, we generate a null background distribution by shuffling the entire mutation matrix 10,000 times, preserving the number of mutations on each gene and on each sample in each shuffle, and test whether this produces the same or reduced overlap for the gene set. In addition to computing the significance for the gene set, we seek to identify the contribution of individual members. We calculate a p-value for each member gene based on its overlap with all other members collectively, using the same shuffled matrices as background.  To generate the null background distribution, we employ a degree-preserving randomization method previously applied in MEMo and originally adapted from the switching algorithm described in Milo et al. 15 . The algorithm preserves gene and sample mutation counts in the matrix during shuffling, as described below:

4/6
Define the mutation matrix as a bipartite graph from G (genes) to S (samples). Each edge is represented with (g i →s j ) where g i ∈G and s j ∈S. E is the total number of edges, and Q is a constant. For each of our 10,000 shuffle iterations: Do Q times Do E times Select two existing random edges (g i →s j ) and (g k →s l ) If there exist no such edges (g i →s l ) and (g k →s j ) Then replace (g i →s j ) and (g k →s l ) with (g i →s l ) and (g k →s j ) Although there is no theoretical optimal value for Q, Milo et al. demonstrate that Q = 100 is a practical value for a broad variety of graphs. We follow this precedent and set Q to 100.
We extend MEMo's group score with the idea of calculating individual gene scores-an idea we previously used in the Mutex approach-which enabled us to detect ERBB4 individually and to generate ranked gene lists from the resulting gene sets.
We use the Benjamini-Hochberg procedure for false discovery rate (FDR) estimation for testing the most-mutated 50 Reactome pathways. We considered 0.1 as FDR threshold for significance. In the Supplementary Tables 2-5,8-10, a pathway's mutation count is equal to the sum of the "Coverage" and "Overlap" columns.

Calculation of mutational bias toward autism
In the An dataset, there are a total of 73,624 mutations in probands and 72,576 mutations in the control samples. For a randomly selected mutation, the probability of belonging to a control sample is: p = 72576 73624+72576 = 0.496 Using this probability, we define a Binomial distribution B(n, p) where n is the total number of mutations for a given gene or a group of genes. On the Binomial distribution, we calculate a one-sided p-value for the number of mutations in the control samples to be less than or equal to the observed value.