Abstract
Early prediction of complex disorders (e.g., autism and other neurodevelopmental disorders) is one of the fundamental goals of precision medicine and personalized genomics. An early prediction of complex disorders can have a significant impact on increasing the effectiveness of interventions and treatments in improving the prognosis and, in many cases, enhancing the quality of life in the affected patients. Considering the genetic heritability of neurodevelopmental disorders, we are proposing a novel framework for utilizing rare coding variation for early prediction of these disorders. We provide a novel formulation for the Ultra-Accurate Disorder Prediction (UADP) problem and develop a novel combinatorial framework for solving this problem. The primary goal of this novel framework, denoted as Odin (Oracle for DIsorder predictioN), is to make an accurate prediction for a subset of affected cases while having virtually zero false positive predictions for unaffected samples. Note that in the Odin framework we will take advantage of the available functional information (e.g., pairwise coexpression of genes during brain development) to increase the prediction power beyond genes with recurrent variants. Application of our method accurately recovers an additional 8% of autism cases without known recurrent mutated genes in the training set and with a less than 0.5% false positive prediction based on our analysis of unaffected controls. Furthermore, Odin predicted a set of 391 genes that severe variants in these genes can cause autism or other developmental delay disorders. Odin is publicly available at https://github.com/HormozdiariLab/Odin †
1 Introduction
The start of the genomics era and sequencing of the first human genome over a decade ago promised significant benefits to public health [1]. These include the potential capability of early detection, pinpointing the causes, and developing novel treatments and therapeutics for most diseases. The sequencing of the human genome has dramatically accelerated biomedical research; however, even a decade after publication, progress has been slow in truly unlocking the promise of genetics and genomics in direct application to human health and disease. Notably, the translation of genetic discoveries into actionable items in medicine has not achieved the promised potential. One of the main challenges lies in the fact that discovering the exhaustive set of causative variants for most diseases, except some monogenic Mendelian disorders, has proven to be an elusive and unmet objective [2–4].
One of the first questions that we need to answer regarding any disease of interest is to calculate the contribution of genetics to the etiology of the disease. The primary metric used for calculating the contribution of genetics to any trait, including diseases, is denoted as genetic heritability (0 ≤ h2 ≤ 1) [5, 6]. However, in many complex diseases, the calculated genetic heritability is by far higher than the fraction of cases that can be explained or predicted by the observed genetic variants. This gap is known as the “missing heritability” problem and is one of the main hindrances in not only building early prediction models for complex disorders but also developing novel treatments [7, 8].
Autism spectrum disorder (ASD) is an umbrella term used to describe a set of neurodevelopmental disorders having a wide range of symptoms from lack of social interaction, difficulty in communication/language, repetitive behavior, and in many cases intellectual disability (ID) (i.e., having an IQ < 70) [9]. ASD is typically diagnosed around the age of two and is estimated to affect over 1 in 68 children (1.5% of all children). There is a well-known sex bias in ASD as there are four times more male children affected with ASD than female children. Twin study comparisons have shown that genetics play a major role in ASD, and researchers have estimated the heritability of ASD to be one of the highest among complex diseases (0.5 ≤ h2 ≤ 0.8) [10, 11].
It is becoming apparent that early treatment and intervention can significantly improve the IQ, language skills, and social interactions in children affected with ASD [12–14]. Early diagnosis of ASD in young infants is challenging mainly due to the fact that most symptoms are not reliably detectable at a very young age and children tend to manifest a heterogeneous set of phenotypes with a diverse range of severity [15]. However, it is theoretically possible to make an accurate diagnosis of ASD or other neurodevelopmental disorders in some children before any symptoms appear (or even before the child is born) using (perinatal) genetic testing and genome sequencing [16]. Thus, building methods for early prediction of ASD using genetic variation and other biomarkers is extremely important and will have significant direct positive effects on public health.
The recent advances in high-throughput sequencing (HTS) technologies have given us the capability to sequence the whole genomes or exomes of many samples. For instance, the consortia focused on complex disorders, such as autism, ID, schizophrenia, and diabetes sequenced tens of thousands of cases [17–20]. Sequencing of samples with these complex disorders had produced genetic maps with tens of thousands of variants with most of them being rare (i.e., minor allele frequency - MAF < 0.05). Unfortunately, in most cases the effect of the variants found is not known (i.e., a variant of unknown significance) and for rare variants it is extremely hard to assign significance solely considering their frequency. In extreme cases of de novo variants (i.e., novel variants not inherited from the parents), the same exact variant will likely never be seen in any other sample. Thus, building models for early prediction of complex disorders need to be more sophisticated than just considering the frequency of variants in cases and controls.
There are some known syndromic subtypes of ASD with known genetic causes, such as Fragile X or Rett syndromes, which are the result of single-gene mutations (FMR1 or MECP2, respectively) [21]. Furthermore, there are known rare, large recurrent copy number variations, such as the 16p11.2 deletion or Prader-Willi syndrome, which are known to cause ASD. However, in most cases of ASD, the exact cause of the disorder is not known and no accurate method or model for early prediction of ASD using genetic variants exists. Recently, consortia focused on sporadic ASD have performed whole-exome sequencing on thousands of autism families (affected proband, unaffected sibling and parents) with the hope of finding causative variants in these samples. The enrichment of de novo variants in affected probands versus unaffected siblings has indicated that a significant fraction of ASD is the result of de novo and rare (MAF < 0.05) variants [17, 22]. However, in many cases, it is not clear which de novo or rare variants are the real culprit(s) of the phenotype. As we do not expect to see the same de novo variant to appear in two different samples, it has proven useful to summarize the observed variants on the genes being affected. This simple approach has provided researchers with enough statistical power to accurately (p < 0.01) predict tens of novel ASD genes with high penetrance of likely gene disruptive (LGD) and missense variants. However, these statistically significant genes only cover a fraction of ASD cases estimated to be caused by de novo or rare variants. Based on the twin studies, it is estimated that ASD and ID have a genetic heritability of over 0.5, while we can optimistically explain less than 0.2 fraction of the affected children [17, 22] based on observed common or rare genetic variants (including structural variation and copy number variation [23–25]).
It is estimated that hundreds of genes are involved in neurodevelopment and disruption in them can cause ASD or ID [22, 26]. The primary justification for such a high number of genes (high genetic heterogeneity) contributing to a similar disorder (i.e., ASD) is that most of these genes are members of only a few functional “modules” or pathways [27, 28]. Thus, disruption of any of these functional modules results in a similar disorder mainly through interruption of normal neurodevelopment. It is being hypothesized that by using the functional relationship between these genes it is possible to find the causative variants.
Complex disorder prediction using (rare) coding variants
As mentioned above, early prediction of complex disorders using genetic variation is one of the fundamental goals of personalized medicine. Currently, thousands of cases with neurodevelopmental disorders (e.g., ASD) have been studied using WES or targeted sequencing. Thus, we have a very rich set of rare and common coding variants found in samples with ASD and other neurodevelopmental disorders, which can be used for building early prediction models and methods. However, it is also important to realize the intrinsic limitations of rare coding variants in predicting ASD or other complex disorders. Notably, (i) most complex disorders have genetic heritability of significantly less than 1 (e.g., 0.5 < h2 < 0.8 for autism), (ii) noncoding variants, which significantly contribute to these disorders, are not found using WES, and (iii) (coding) variants alone do not have the power to rule out the possibility of being diagnosed of a complex disorder (such as autism) with very high accuracy. Therefore, achieving accurate prediction for all (or even most) affected cases using solely the coding variants is theoretically not achievable. On the flip side, this also means, theoretically, that we cannot confidently predict a sample as an unaffected control solely based on the observed (coding) variants, as other factors (e.g., environment, epigenetic) can contribute to the disorder. Thus, instead of trying to predict the status of every input sample as affected case or unaffected control, which theoretically is not possible, we propose to only predict a subset of samples as affected cases with very low false prediction/discovery.
Ultra-Accurate Disorder Prediction (UADP) problem
A positive diagnosis/prediction of a complex disorder (e.g., ASD) can have a severe negative psychological and economical impact on affected individuals and their family. For instance, a positive prediction of severe developmental disability during prenatal testing can result in a termination of pregnancy. Thus, one of the main practical constraints in developing models and methods for prediction of a severe complex disorder is to guarantee a false positive prediction/discovery rate (FDR) of virtually zero. In other words, it is highly desirable not to have a false prediction of an input unaffected control as an affected case. Note, the UADP problem is different from traditional binary classification problems where each sample is assigned to one of the two classes (i.e., affected case or unaffected control). In the UADP problem, the goal is to predict a subset of samples as affected cases while all other samples are not assigned to any class.
In this paper, we study the UADP problem and provide a framework for solving this problem using rare coding variants. We aim to develop a computational method for positive prediction of a significant fraction of affected cases (due to the prediction limitation from coding variants) with virtually zero false positive prediction of unaffected controls (due to the negative effects of false positive prediction). We choose ASD as a case study since we can utilize a rich dataset of de novo mutations. In addition, we also integrate the functional relationship to increase the prediction capability. Approaches such as the one presented in this paper are needed not only to close the missing heritability in many complex disorders but also to translate the biomedical discoveries into actionable items by clinicians.
2 Methods
2.1 UADP problem definition and notations
In the UADP problem we are trying to maximize the number of samples correctly predicted as affected case, while the number of unaffected controls falsely predicted as affected cases must be extremely small. Another way to look at this problem is that we are trying to select a subset of samples such that the total number of unaffected controls picked is negligible while the number of affected cases selected is maximized. Finally, note that because of low recurrence of the same rare and de novo variants, we will be using a summarization of coding rare variants based on their effect on each gene and the biological function disrupted to increase our power for prediction.
Training Data
Let n and m be the number of genes and the total number of samples respectively. The LGD (likely gene disruptive) mutation profile of the ith sample is a binary row vector xi = (xi , xi , …, xi) Where An assumption here is that an LGD mutation will completely knockout or disrupt the copy of the affected gene in the sample. The diagnosis result (or class) of the ith sample is a binary value yi where A dataset D of m input samples is a set of m pairs D = (x1, y1), (x2, y2), …, (xm, ym) where each pair (xi, yi) represents the LGD mutation profile and the diagnosis result respectively of the ith sample. We define the unaffected control set and the affected case set as Dcontrol = {xi|(xi, yi) ∈ D, yi = 0} and Dcase = {xi|(xi, yi) ∈ D, yi = 1}, respectively.
Gene similarity score
We will use the functional similarity between genes to increase the statistical power in disorder prediction. The assumption is that disruption of genes with similar functionality will result in similar phenotypes. Thus, we would like to develop a framework that can include the similarities between genes (mutational landscape and function) as an additional signal for disease prediction. We denote such a matrix by P ∈ [0, 1]n×n where Pi,j indicates the similarity between genes ith and jth and potentially how disruption of one gene can affect the other gene. For neurodevelopmental disorders such as ASD our goal is to build the matrix P to reflect functional similarity of genes during brain development. One way to calculate such a score for any pair of genes is based on using the coexpression of genes during brain development. Coexpression between two genes i and j is denoted by R(i, j) and is calculated to represent the expression similarity of these two genes in different conditions and tissues. Similar to previous practices, we are using the Pearson correlation of expression profiles between two genes in different conditions as the coexpression value [27, 29–32]. Coexpression has been shown to be a powerful indicator of functional similarity of two genes for neurodevelopment. We also include the similarity of likelihood of observing LGD mutation (pLI) between the two genes in the population [33] for building this matrix. Assuming we are given multiple matrices capturing the similarity of genes with each matrix using different biological concepts, we will use the minimum of similarity scores of two genes among different matrices to build matrix P. In another words, if we have two matrices of gene similarity P ′and P ″, the matrix P is built by assigning . Of course, the choice of these matrices and the way we combine them together can be changed without any need to change the underlying framework and proposed methods. The details of different datasets used to build the matrix P for this study is provided in Section 3.1. We will convert every sample by multiplying the vector xi by matrix P to produce new vectors zi = xi × P. We will denote the set of samples Dcontrol and Dcases converted by the gene similarity matrix P as D′control = {zi = xi × P | xi ∈ Dcontrol} and D′case = {zi = xi × P | xi ∈ Dcase}.
2.2 Odin framework
In this subsection, we will introduce the intuition behind our framework Odin (Oracle for DIsorder predictioN) as a practical solution for the UADP problem. To build such a conservative prediction model, Odin will intuitively predict an input/test sample to be an affected case if and only if it satisfies two conditions:
The input sample is “close” to many affected case samples
The input sample is “far” from any unaffected control sample
For satisfying the first condition, we simply use the nearest neighbor approach using a distance function (e.g., Euclidean distance). The closest neighbor of the input sample among the training data should be an affected case so that input sample passes the first condition.
For satisfying the second condition, we will initially develop a novel algorithm that first finds a region (after dimension reduction) containing a significant number of affected cases and does not contain any unaffected controls. This cluster is denoted as unicolor cluster, as it only includes the affected cases. The input sample passes the second condition if it falls inside of this unicolor cluster. We denote the problem of finding such a cluster as Unicolor Clustering with Dimensionality Reduction (UCDR). We prove that this problem is a NP complete problem (section 1 in the supplemental material) and can not be solved efficiently. Therefore, we propose a relaxation of UCDR that we denote as Weighted Unicolor Clustering with Dimensionality Reduction (WUCDR). In the remainder of this section, we will first formalize the UCDR and WUCDR problems and then present an iterative algorithm to solve the WUCDR problem.
2.2.1 Unicolor Clustering with Dimensionality Reduction (UCDR) problem
In the UCDR problem we have a set of red and blue points in n-dimension space ℝn representing unaffected controls (i.e., D′control) and affected cases (i.e., D′case), respectively. Furthermore, we have an upper bound on the number of dimensions to consider (dimension reduction/feature selection) denoted by k. The goal of the UCDR problem is to discover a subset of dimensions with cardinality k (k ≪ n), a center point c ∈ ℝ|k| and a constant r such that after mapping all the blue and red points to the reduced k dimensions the following objective and constraint hold:
Objective: maximize the total number of blue points with “distance” less than r to center c.
Constraint: there is no red point with “distance” less than r to center c as
As a general rule any metric distance function (e.g., Euclidean distance) can be used for the UCDR problem. However, we are using the ℓ1 distance since it is concordant with the additive model used in common variant studies. The ℓ1 distance between two points (a1, a2, …, an) and (b1, b2, …, bn) is defined as . We will denote the region contained with distance r from center c ∈ ℝ|k| as area of interest A(c, r). Furthermore, any affected case zi ∈ D′case inside the area of interest (i.e., ℓ1(c, zi) ≤ r) is considered covered by this area.
Note that the intuition behind the dimension reduction is to avoid the overfitting issue raised as a result of a large number of dimensions (> 20, 000 genes) and a small number of training samples. In practice, we will require that the number of selected dimensions be less than O(log2(m)) (i.e., k = O(log2(m))) where m is the total number of training samples (both cases and controls).
2.2.2 Weighted Unicolor Clustering with Dimensionality Reduction (WUCDR) Problem
Since UCDR problem is NP-complete (see Section 1 in the supplemental material), we will define a relaxation, where we assign (continuous value) weights to the dimensions. We denote this problem as the Weighted Unicolor Clustering with Dimensionality Reduction (WUCDR) problem. More formally, in addition to selecting k genes/dimensions, we also have to assign weights 0 ≤ wi ≤ 1 to each gene/dimension i and use the weighted ℓ1 as the distance metric for clustering (we use the notation wℓ1 to represent weighted ℓ1). In the rest of the paper, we will define the weighted ℓ1 distance function between two input points a and b with weights w (in n dimensions) as . Note that as we are only allowed to select k dimensions thus, over n − k other dimensions will have weight zero.
2.2.3 Iterative solution for WUCDR
Here we propose an iterative approach to solve the WUCDR problem. It is consist of two main steps. In the first step, given a set of weights w, we find the optimal center c and radius r to cover a maximum number of affected cases (blue points) in the area of interest A(c, r) (note that the area of interest is considered using weighted ℓ1 distance). In the second step, we try to find a new set of weights w given the center c and the radius r.
First Step
Given the weights w = (w1, w2, …, wn) (all the weights are assigned to 1 at the first iteration), find a center c and constant r such that
all red points have a weighted ℓ1 distance greater than r to center c and
the number of blue points, which have weighted ℓ1 distance less than r to center c, is maximized.
In general, finding such a center is a hard problem in n dimensional space and can be very time-consuming. Thus, we will relax the problem only to consider the blue points as a potential center c. This can be done trivially in polynomial time by considering every blue point as potential center and picking the optimal one. Given a center c, radius r and the weights w we can easily calculate the affected cases (i.e., blue points) covered by the area of interest. Let set S denote the covered (blue) points (i.e., affected cases), which will be used in the next step for updating the weights.
Second Step
Given a center c and the set of blue points S, covered by the area of interest found in the first step, we will calculate new weights w (for each dimension). The objective is to decrease the weighted ℓ1 distance of points in the set S to center c, while increasing the weighted ℓ1 distance of points in the set S to the red points (D′Control). We will solve the linear programming (LP) problem below to find these new weights
Note that in the above LP problem only w and ρ are unknown variables, while the set S and center c are calculated in the first step of the method. The constraints in the above LP problem will find a set of weights that are guaranteed to have all of the points in set S closer to the selected center c than any red point. Furthermore, these weights try to squeeze the (blue) points in S further closer to the center c, while increasing the distance of red points to the (blue) points in the set S. The objective function of the above LP problem has two main terms. The first term aims to reduce the average distance between points in the set S and the center c. Simply stated, the new weights w would try to make blue points covered in first step (i.e., point in set S) get closer to the center c (note that both c and S are from the previous step, not variables in this LP problem). The second term aims to increase the average weighted ℓ1 distance of all red points to the blue points in set S. Finally, among the weights produced we will only keep the top k weights and convert all of the remaining weights to 0. Note that because of the condition , we are guarantied to be able to keep any dimension with value > 0.5 from the LP solution.
Odin framework using WUCDR
As mentioned in the Section 2.2, two conditions should be satisfied for a sample to be predicted as a potential affected case by Odin. The first condition is that the nearest neighbor to the samples should not be an unaffected control. Odin uses the ℓ1 distance function for finding the nearest neighbors of any test sample. The second condition is that the input sample should fall inside the area of interest A(c, r) after performing the same dimension reduction mapping using weights w (note that c, r and w are found by the iterative solution of WUCDR).
3 Results
3.1 Data Summary
We tested Odin for accurate prediction of neurodevelopmental disorders using the LGD (likely gene disruptive) de novo variants from WES and targeted sequenced samples with ASD or ID. Table 1 shows the total number of samples and LGD variants reported from the union of several publications on over 6,000 ASD/ID probands. The union dataset of de novo variants used from these publications can be found in [34].
For building the gene similarity matrix P, which is used to convert the input variant vector for every sample (i.e, zi = xi×P), we have used the combination of coexpression values between two genes during brain development [27] and the difference between the likelihood of observing LGD variants in population [33]. We can trivially extend the matrix P to include additional data such as tissue specific networks [42]. We observed that using such a matrix to map the variant vector for each sample into new space results in significantly reducing the ℓ1 distance of probands with each other (p < 1.6e − 16). This indicates that using such a transformation indeed helps in increasing the prediction power.
Considering the samples in Table 1, there are few genes with significant recurrence of de novo LGD variants in affected cases while having no de novo variant in unaffected controls. Any prediction model/method for ASD can be trivially extended to predict a sample as an affected case if they have an LGD de novo variant in any of these genes. Thus, in our test data we will not consider any samples with de novo variants in any of these genes that are recurrently mutated in our training data. We will call these samples trivial cases/samples and the remaining samples as nontrivial cases/samples. Note that there are nine genes with four of more LGD variants in union of these ASD/ID samples (Table 1). These nine genes are ADNP, ANK2, ARID1B, CHD2, CHD8, DSCAM, DYRK1A, SCN2A and SYNGAP1 and any sample with an LGD variant in any of these genes is considered a trivial case to predict and it is not considered in our analysis.
3.2 Unicolor clustering with dimension reduction
We will first show that the proposed iterative method in Section 2.2.3 for solving the WUCDR problem (number of dimensions selected < 10) does in fact greatly improve the number of cases covered in comparison to the unweighted result (considering all dimensions with weights wi = 1). As shown in Table 2, the optimal result found using the input was only able to cover 45 cases (only 24 of the cases not having LGD variants in recurrently mutated genes, i.e., the number of nontrivial cases covered). However, the WUCDR approach in less than five iterations was able to cover over 71 cases (40 of the cases not having LGD variants in significantly recurrent mutated genes). Thus, our iterative approach for WUCDR improves the number of affected cases covered by over 60% using less than 10 dimensions. We also investigated the “density” of cases inside each selected region. The density was defined as the ratio between the number of affected cases covered and the radius r. We observed that not only the number of cases covered was improved using the WUCDR approach but also the density was increased (see Table 2) per iteration.
3.3 ASD/ID disease prediction results
We will compare the Odin framework in predicting affected ASD cases in comparison to different classification methods. We have used the k-NN classifier (various k-values), support vector machines (SVM) [43], and (lasso and elastic-net) regularization of generalized linear models [44]. We are specifically interested in comparing these methods in predicting ASD in nontrivial cases. We will use the leave-one-out (LOO) technique to compare the Odin framework versus prediction power of k-NN, SVM classifiers, (lasso and elastic-net) generalized linear models. As our stated goal is to keep the false positive prediction of unaffected samples as cases close to zero, we will only consider the most conservative results for each method (FDR < 0.01). As can be seen, Odin’s true positive rate for predicting ASD is at least two times higher than the best k-NN result (for different values of k) and significantly higher than SVM (Figure 1). It is also significantly higher for different regularized generalized linear models (lasso and elastic net) for different input parameters of α (Glmnet implementation). For each of these tools, we used their intrinsic properties to control/limit the FDR for calculating the TDR. In k-NN we used the difference of number of affected cases and unaffected controls in the k closest neighbor; for SVM and generalized linear models we used the predicted probability (or distance) given by the libSVM [43] or Glmnet [44]. For Odin the weighted ℓ1 distance of the sample to the selected center was used. Using these values we could calculate the highest true positive rate for each method given the FDR value using LOO cross-validation. Note that in Odin the full set of samples predicted as affected cases will have an FDR of less than 0.01.
3.4 Developmental delay disorder (DDD) predictions
In addition to the samples reported in Table 1, a set of over 4000 trios with developmental delay disorder (DDD) were whole-exome sequenced [45]. Since linear model (i.e. SVM or Glmnet) was not suitable for this prediction problem (as seen in Figure 1), we only compared Odin against the k-NN approach (1 ≤ k ≤ 10) using the dataset in Table 1 as a training set and the (nontrivial) DDD affected cases [45] for testing (Figure 2). Note that we used the parameters from the LOO cross-validation experiment from the previous section (Section 3.3) to control the FDR. The Odin method was able to accurately predict a higher fraction of nontrivial DDD probands in comparison to k-NN approach (Figure 2a) using the ASD/ID samples as training. We further investigated the overlap between nontrivial DDD affected samples which were correctly predicted by Odin and 1-NN (nearest neighbor) approach (Figures 2b). Interestingly, there are significant number of samples which were correctly predicted only by one of the methods, which indicates an approach which combines different methods can even outperform Odin. Similar results was observed for FDR < 0.01 (Supplementary Figure 1).
3.5 Autism and developmental delay gene prediction and ranking
Odin is a framework to predict with ultra-accuracy a subset of samples that will develop ASD given the de novo variation; however, it can also be used to predict some novel ASD genes. We have utilized Odin to rank all genes for the potential impact of a de novo LGD variant disrupting them. Note that similar to the predictions of ASD/ID for subsets of samples made by Odin, if a gene is not selected does not mean it is not an ASD/ID gene. We used the ASD and Siblings variants (Table 1) as the training data and calculated the weighted ℓ1 distance to the selected center. Ranking all the genes based of the calculated distance, we clearly see an enrichment of known ASD genes closer to the center (Figure 3a). For the set of known ASD genes we used the union of SFARI high-confidence genes and known syndromic genes.
Our analysis also indicated that there are 391 genes where an LGD variant on them will result in a sample falling inside the predicted area of interest (i.e., the inner most sphere/circle in Figure 3a). These 391 genes indicate a set of genes in which Odin has predicted with high probability that their disruption will cause significant (neuro)developmental disorder. Furthermore, there was significant enrichment of LGD variants in these 391 genes in the DDD set (which was not used in the training) versus the ASD set (which was used in the training) as shown in Figure 3b. Interestingly, this clearly indicates that even after normalizing based on expected LGD variants for each disease group the more severe samples tend to be more enriched in LGD variants than their less severe autism samples disrupting these selected 391 genes (Figure 3b). There is also an enrichment of severe de novo missense variants (i.e., with CADD score > 25) disrupting these genes (Figure 3c) in affected autism/ID/DDD probands while no such enrichment is seen for control/sibling samples (Figure 3c).
In the Simons Simplex Collection (SSC) we also observed not only that probands with LGD variant tend to have a lower IQ than probands without de novo LGD variants, but also the probands with de novo LGD variant disrupting one of the genes in the most inner sphere (the 391 genes) have lower IQ than other probands with de novo LGD variants (Figure 4a). It is been known that their is a large male to female bias in autism (estimate to be over 4 to 1). Note that in the Simons Simplex Collection (SSC) there are a total of 2478 male probands and 396 female probands (over 6:1 ratio). However, the difference between the number of samples with LGD de novo variants in the selected 391 genes in the inner most sphere is 31 to 16 (around 2:1 ratio). This indicates that there is much smaller gap for sex difference for ASD samples with de novo LGD variants in the predicted ASD genes by Odin (Figure 4b). We were also interested to see if there are any specific enrichment of expression of these top 391 genes selected in human brain. We used the online tool CSEA (http://genetics.wustl.edu/jdlab/csea-tool-2/) to study the expression profile of these 391 genes. Interestingly, the only significant expression we observed was on the early fetal development and mid-early fetal development of brain (Figure 4c). No significant expression of these genes in any tissues in adult human or mouse brain was observed. Finally, we utilized the predicted probability of observing a missense or LGD de novo variant per sample for each gene [35] to calculate the p-values. We could group these genes based on observing significant de novo LGD and/or missense variants in affected probands (Figure 4d). The set of genes with only significant missense de novo variants observed in cases potentially indicates genes in which an LGD variant will be incompatible with life (i.e., essential genes). However, a missense mutation can result in a severe (neuro)developmental disorder. These genes include CSNK2A1, SMARCA4, TRRAP, MORC2, PRPF8, TAF1, CNOT1, SF3B1, SMAD4, UBR5, CLASP1, KDM2B, and U2AF2.
3.6 Pathways
We were interested in studying the properties of the samples that Odin correctly predicted as an affected case. We used the tool David (v.6.7) [46] for the discovery of enriched GO-terms and KEGG pathways for the genes mutated in these samples. For the ASD/ID samples in Table 1, Odin was able to correctly predict ASD status of samples that have de novo variants in genes in Wnt pathways [27, 47, 48] and in chromatin regulation [18, 35] (Figure 5a). Similarly, correctly predicted DDD study samples [45] had mutations in chromatin modification and transcription regulation genes (Figure 5b).
4 Conclusion and future works
In this paper, we introduce Odin, a framework for an early prediction of ASD and related disorders from rare genetic variants. Our initial evaluation of the experimental data shows a clear power of this approach in ultra-accurate prediction of ASD using rare genetic variants. The proposed framework can be extended to take into account not only LGD mutations but also missense mutations to increase the power of the model in predicting a higher percentage of affected cases. As we have shown, there is clear enrichment of severe missense mutations (CADD score > 25) to genes closer to the predicted center. We can adapt evolutionary based scores (e.g., CADD score [49] or polyphen-2 score [50]) to define an additive summarization function to assign a disruption score for each gene (i.e., a continuous value in comparison to a binary value as done in this paper). In addition, we can integrate other information, such as protein interaction [51, 52], tissue-specific networks [42] or the regulation of specifically related pathways such as Wnt [48] or mTOR [53], to increase the prediction capability. For the algorithm, we can improve the first guessed solution (in the first step) of WUCDR (see Section 2.2.3) by utilizing algorithmic techniques in geometry. Finally, our proposed framework here can be extended for predicting the risk of other neurological disorders, such as schizophrenia, epilepsy or Alzheimer.
1 Supplementary materials
1.1 Complexity of the UCDR problem
We show that an instance of the decision version of the UCDR problem is NP-complete.
Remark 1
Given a set of positive (rational) numbers. The problem of determining if there exists two disjoint nonempty subsets whose elements sum up to the same value is NP-complete [Woeginger, G. J., & Yu, Z. (1992). On the equal-subset-sum problem. Information Processing Letters, 42(6), 299-302].
The problem in Remark 1 was called “equal subset sum problem”. Notice that the pair of two subsets in the solution is not necessary a partition (i.e. there may be some elements that are in the original set but are not in either of these two sub-sets).
Theorem 2
Given a set of points in a n-dimension space where each point was assigned a color either blue or red. The problem of determining if there exists a non-empty dimension subset and a center point such that all blue points are not farther to that center point in comparison to red points (by the L1 norm in the reduced dimension space) is NP-complete. We call the problem “UCDR decision problem”.
Proof. We will reduce the equal subset sum problem (Remark 1) to a special instance of the UCDR decision problem.
Assume we are given a set of positive rational numbers A = {a1, a2, …, an}. We create two blue points B1 = (a1, a2, …, an), B2 = (−a1, −a2, …, −an) and one red point R = (0, 0, …, 0). We consider the UCDR decision problem of three points B1, B2 and R. Suppose that this UCDR decision problem has a solution that includes a dimension subset I = {i1, i2, …, id} ⊆ {1, 2, …, n} and a center C.
Now we only consider the reduced space with d dimensions from I. We denote , , and R′as the corresponding points of B1, B2, and R respectively in the reduced space.
Let H be the smallest (by volume) L1 norm ball that has the center C and contains both and . Thus or (or both) must be on a facet of H, we can assume is on a facet of H without losing generality. Since H is convex and , H also contains R′. But if is not on the same facet of , then R′will be inside H and thus . Therefore, both , and R′must be on the same facet of H. Let F be that facet, since H is a L1 norm ball then any point must satisfy an equation that has the form
Since R′= (0, 0, …, 0) ∈ F, so s must be 0. Thus we can re-write the equation as where I1 ∩ I2 = ∅ and I1 ∪ I2 = I. Since then but both and are in A that contains positive numbers only so I1 /= ∅ and I2 /= ∅. Therefore, the pair of two sets and is a solution of the equal subset sum problem of the set A.
Thus, a solution of the UCDR decision problem is also a solution of the equal subset sum problem. Conversely, we can also easily verify that a solution of the equal subset sum problem is also a solution of the UCDR decision problem. Therefore, if we can solve the decision version of UCDR then we can solve the equal subset sum problem which is NP-complete (Remark 1). Since it is easy to verify this problem is in NP, it is also NP-complete.
1.2 Developmental delay disorder (DDD) prediction using Odin
We further analysis Odin’s capability in accurate prediction of DDD probands with FDR < 0.01 while using the ASD/ID data (samples in Table 1) as training.
1.3 Protein interaction enrichment
We investigated the changes in genes degree in protein-interaction networks based on their weighted ℓ1 distance to the center found using Odin. There is an interesting correlation between distance calculated by Odin for each gene and the average degree of that genes in protein-interaction networks (Supplementary Figure 2).
5 Acknowledgment
We would like to thank Evan E. Eichler, Tychele N. Turner, Phuong Dao, Farhad Hormozdiari, Madeleine Geisheker and Tonia Brown for reading the paper and providing comments.