Forecasting autism gene discovery with machine learning and genome-scale data

Background Genes are one of the most powerful windows into the biology of autism, and it has been estimated that perhaps a thousand or more genes may confer risk. However, less than 100 genes are currently viewed as having robust enough evidence to be considered true "autism genes". Massive genetic studies are underway to produce data to implicate additional genes, but this approach, although necessary, is costly and slow-moving. Methods We approach autism gene discovery as a machine learning problem, rather than a genetic association problem, and use genome-scale data as predictors for identifying further genes that have similar properties in the feature space compared to established autism risk genes. This approach, which we call forecASD, integrates spatiotemporal gene expression, heterogeneous network data, and previous gene-level predictors of autism association into an ensemble classifier that yields a single score that indexes each gene’s evidence for being involved in the etiology of autism. Results We demonstrate that forecASD has substantially increased sensitivity and specificity compared to previous gene-level predictors of autism association, including genetic measures such as TADA. On an independent test set, consisting of newly-released pilot data from the SPARK Genomics Consortium, we show that forecASD best predicts which genes will have an excess of likely gene disrupting (LGD) de novo mutations. We further use independent data from a recent post mortem study of case/control gene expression to show that forecASD is also a significant predictor of genes implicated in ASD through differential expression. Using forecASD results, we show which molecular pathways are currently under-represented in the autism literature and likely represent under-appreciated biological mechanisms of autism. Finally, forecASD correctly predicted 12 of 16 genes implicated at FDR=0.2 by the latest ASD gene discovery study, while also identifying the most likely false positives among the candidate genes. Conclusions These results demonstrate that forecASD bridges the gap between genetic- and expression-based ASD gene discovery, and provides a data-driven replacement to much of the manual filtering and curation that is a critical step in ensuring the robustness of gene discovery studies.


24
Background 25 Genes are one of the most powerful windows into the biology of autism, and it has been 26 estimated that perhaps a thousand or more genes may confer risk. However, less than 27 100 genes are currently viewed as having robust enough evidence to be considered 28 true "autism genes". Massive genetic studies are underway to produce data to 29 implicate additional genes, but this approach, although necessary, is costly and slow-30 moving.

32
We approach autism gene discovery as a machine learning problem, rather than a 33 genetic association problem, and use genome-scale data as predictors for identifying 34 further genes that have similar properties in the feature space compared to established 35 autism risk genes. This approach, which we call forecASD, integrates spatiotemporal 36 gene expression, heterogeneous network data, and previous gene-level predictors of 37 autism association into an ensemble classifier that yields a single score that indexes 38 each gene's evidence for being involved in the etiology of autism.

40
We demonstrate that forecASD has substantially increased sensitivity and specificity 41 compared to previous gene-level predictors of autism association, including genetic 42 measures such as TADA. On an independent test set, consisting of newly-released 43 pilot data from the SPARK Genomics Consortium, we show that forecASD best predicts 44 which genes will have an excess of likely gene disrupting (LGD) de novo mutations. We 45 further use independent data from a recent post mortem study of case/control gene 46 providing a useful genome-wide metric that indicates the evidence of autism 85 involvement for every gene. More recently, machine learning based methods have used 86 gene interaction networks(6) and cell-specific expression profiles(7) to predict gene 87 involvement in autism. Importantly, the results of these studies lead to a quantitative 88 metric that scores every gene in the genome according to evidence of a role in autism. 89 Despite the demonstrated effectiveness of these studies in prioritizing autism risk 90 genes, our preliminary investigations suggested there was still room for appreciable 91 improvement in the form of the classification algorithm, the training set, and the 92 predictors used. In particular, these approaches do not incorporate indicators of autism 93 involvement that are based on genetic association (e.g., TADA scores) into their 94 predictive features.

96
We introduce a new score, forecASD, that integrates prior network-biology approaches, 97 scores of genetic association, brain gene expression, and topological information from 98 large gene interaction networks relevant to the brain into a single gene-level score for 99 autism involvement. We show that forecASD successfully outperforms existing methods 100 in a diverse range of gene and mutation prioritization tasks. Further, using the recent 101 sequencing studies MSSNG(8) and SPARK(9), we show that forecASD generalizes to 102 previously unseen data. Importantly, this generalization holds even when excluding 103 genes with known links to autism, emphasizing forecASD's ability to identify novel ASD 104 genes. We also demonstrate that forecASD correctly predicts 12 of 16 genes implicated 105 at FDR=0.2 by the latest ASD gene discovery study, while also identifying the most 106 likely false positives among the candidate genes. Through comparing the top decile of 107 forecASD identified genes (1787 genes; hereafter forecASD genes) with known autism 108 genes, we identify numerous biological pathways that are currently underrepresented in 109 our understanding of autism risk. By reanalyzing the results of autism brain differential 110 gene expression studies, we show that the current list of known autism genes is 111 significantly depleted for upregulated biological pathways, whereas forecASD captures 112 both up-and downregulated pathways. We show that direction of differential expression 113 is related to haploinsufficiency status, with low pLI genes showing a trend towards 114 upregulation. Importantly, this relationship between direction of differential expression 115 and pLI is dependent on forecASD gene inclusion, signifying forecASD's ability to 116 capture low-and high-pLI disease genes. Through these studies, we show evidence 117 that current methods of autism gene discovery have biases, and that forecASD 118 mitigates these biases through its integrative approach, thus providing a view of the full 119 spectrum of genes and biological pathways underlying autism. Overview:

124
The forecASD method relies upon stacked Random Forest models, organized in two 125 levels (shown in Figure 1). In the first level, two models are trained using BrainSpan(10) 126 gene expression and the STRING(11) shortest paths network as features, respectively.

127
Our training dataset consists of high-confidence genes scored in SFARI gene(4) as 128 either 1 or 2 (SFARI HC genes), and 1,000 random background genes not contained 129 within SFARI gene. These two models produce genome-wide predictions for autism involvement. These scores are then used as features in the second level's Random

131
Forest model, along with other genome-wide scores obtained from previous studies.

133
BrainSpan, STRING, and TADA data assembly 134 BrainSpan data was obtained from the Allen Institute, and brain regions containing 135 fewer than 20 samples were excluded. This filtered BrainSpan dataset was loess-136 smoothed, with the purpose of reducing noise and imputing missing data points.

138
The STRING database(11) was thresholded at their recommended score of 0.4, and 139 transformed into a gene by gene matrix with each cell representing the shortest path 140 between two genes.    Pathway enrichment and comparison with case/control brain gene expression data 185 We used Reactome annotations(13), and unless otherwise noted, PantherDB(14) to 186 assess functional enrichment in both forecASD genes and SFARI HC genes using Reactome pathway satisfying pLI>0.9. Gene-wise and pathway-level comparisons with 194 ASD case/control brain gene expression data were performed using frontal cortex RNA-195 seq summary statistics from Gandal et al.(17). Our preliminary tests showed that both   Figure 2D were taken from Sugathan et al.(19), Darnel et al.(20), 213 and Abrahams et al.(4). P-values were computed by the hypergeometric statistical test 214 of overlap between forecASD genes and these three gene sets. The goal of our approach was to create a gene-wise score that indexes the level of 237 evidence for involvement in ASD using both systems biology (i.e., network and 238 transcriptional data) and genetic features. An initial forecASD systems biology model 239 was built (forecASD:sys) using only BrainSpan expression and the STRING database 240 shortest paths matrices as features. This model was trained on the high confidence set 241 of 76 SFARI genes scoring 1 or 2 (SFARI HC genes), with negative training labels 242 assigned to 1,000 background genes that were not listed in the SFARI gene database.

269
As an initial validation of genes prioritized by forecASD, we tested for an enrichment of 270 gene sets and characteristics well known to be overrepresented in autism genes (Fig. 271 2D). We first performed several overrepresentation tests and found that genes receiving 272 forecASD scores in the top decile (1,787 genes, referred to as forecASD genes) had a 273 significant overlap with known targets of CHD8 (P < 1 x 10 -16 ), FMRP (P < 1 x 10 -16 ), 274 and the full SFARI gene database (P < 1 x 10 -16 ). We next performed a series of 275 functional enrichment tests, comparing forecASD genes to randomly sampled sets of 276 background genes. Text mining in PubMed showed that forecASD genes were 277 significantly overrepresented in abstracts which mention autism (P < 0.001). Given the 278 established role of autism genes early in fetal development, we next tested and found 279 that forecASD genes showed significantly higher rates of coexpression across all 280 regions of the fetal brain (P < 0.001). Lastly, forecASD genes were shown to have 281 significantly enriched rates of interaction in the STRING database (P < 0.001).

283
We next tested the ability of these scores to discriminate both high confidence ( To compare forecASD and prior methods' ability to generalize to new data, we 299 combined two recently released autism genetics resources. Specifically, we used de 300 novo mutations in gene regions from the SPARK(9) and MSSNG(8) cohorts.

301
Importantly, none of our model training used information from these studies, thus any 302 subsequent validation is unbiased.  Table 2). To highlight 321 new biological themes that forecASD detects but that are not clear from the list of 322 SFARI HC genes, we prioritized pathways based on differential enrichment (Fig. 4).

339
Lastly, forecASD genes were loaded into the STRING network and clustered using a 340 greedy hierarchical approach which maximizes the modularity score. The resulting 341 networks consisted of 17 clusters composed of 1452 genes. All clusters were found to 342 be significantly enriched with numerous GO and KEGG pathways (Supplemental Table   343 3). Similarly, all clusters contained a significant enrichment of haploinsufficiency genes 344 (pLI > 0.5), except for the small cluster of 31 genes with functions related to the 345 mediator complex. Clusters were also tested for overlap with SFARI HC genes, of which

365
Because it draws upon multiple approaches for identifying autism genes, forecASD is 366 less biased than gene discovery based only on one form of data (e.g., genetic data).

367
This is particularly important because current SFARI HC genes, which rely heavily on 368 studies of de novo mutation, are strongly biased towards genes that are loss-of-function 369 intolerant (Fig. 5A). While these haploinsufficient genes represent a sizable and 370 important component of genetic risk for autism, this ascertainment bias has led to 371 molecular "blind spots" that will not be resolved simply by sequencing more probands

389
In our analyses, we noted a trend that is a potential bridge between gene discovery 390 studies based on DNA sequence variants and those based on differential expression.

402
Consequently, we propose that this pLI-expression relationship is a hallmark of robust 403 ASD risk genes, and may be used as a criterion when identifying optimal thresholds in 404 genome-wide scores like forecASD. Indeed, although initially chosen as a convenient 405 but arbitrary threshold for identifying a discrete set of ASD candidate genes, the top 406 decile proved to be the optimal split point for forecASD, maximizing the significance of 407 the pLI/t-statistic relationship among candidate genes, while minimizing the same 408 relationship in the remaining, non-candidate genes. Interestingly, when applying this 409 approach to TADA FDR values, although TADA-implicated genes showed the expected 410 pLI-expression relationship, no TADA threshold was able to eliminate the trend from 411 non-candidate genes, suggesting lower sensitivity in identifying ASD risk genes 412 compared to forecASD. Taken together, these analyses demonstrate that the reduced 413 bias in forecASD contributes to increased sensitivity to autism risk pathways identified in 414 gene expression studies (Fig. 4C,4D) as well as those implicated by genetic studies 415 (Fig. 3).

439
Other pathways were identified by forecASD as significantly enriched for autism risk, but 440 were not represented at all among SFARI HC genes (Supplemental Table 2, Figure 4B).

441
Consequently, we expect that new insights into the molecular basis of autism will come 442 disproportionately from these pathways as their constituent genes are associated with 443 autism. One gene set in particular, potassium channels, showed highly significant 444 enrichment in forecASD genes (OR=4.1, P=7.2x10 -9 , N=35 genes) despite the absence 445 of potassium channel genes among currently accepted autism risk genes. However, the 446 literature shows support for a role for potassium channels in ASD risk(30) ,(31),(32),(33) , and 447 the pathway was enriched for differential regulation in a recently published brain gene 448 expression study of autism (P=0.001, downregulated)(17). Notably, this pathway has a 449 lower proportion of genes with pLI>0.9 (0.22) compared to SFARI HC gene-implicated 450 pathways (median=0.47), potentially explaining its absence due to ascertainment bias. 451 Overall, pathways that demonstrated forecASD-specific excess enrichment showed a 452 significant agreement with pathway enrichment from independent case/control brain

469
During the development of forecASD, another ASD gene prediction method was 470 published(36), ASD-FRN, which utilizes a brain-specific functional network. We 471 evaluated this method using the performance benchmarks presented in Figure 3, and found it to have performance comparable to DAMAGES and Krishnan,et al. 473 (Supplemental Figure 1 and Supplemental Figure 2), but in none of these tests did it 474 surpass the performance observed using forecASD as an ASD gene predictor. We also 475 added ASD-FRN to the ensemble that comprises forecASD, but we did not observe a 476 significant increase in predictive performance, suggesting that the latent information in 477 ASD-FRN is already accounted for in the forecASD ensemble as presented here.

478
Our use of TADA to impart genetic association information to the forecASD ensemble is 479 unique among the ASD gene prediction approaches we used as benchmarks. However, 480 this raises concerns about the potential for circularity: TADA is emerging as the most 481 popular way to compute and update gene-wise genetic association statistics for ASD 482 studies, and previous TADA scores are strongly correlated with updated TADA scores.

483
Furthermore, TADA scores are among the most important predictive features in the 484 forecASD ensemble (Fig. 2). Consequently, it would be concerning if forecASD's ability 485 to predict new ASD genes was due entirely to the inclusion of previously published 486 TADA scores. To examine this possibility, we fit bivariate logistic regression models, 487 always including TADA rankings as a covariate, with Krishnan,DAMAGES,488 or forecASD rankings as the predictor of membership among SFARI Gene score 3 489 (Supplemental Figure 2; score 3 genes were not used in training forecASD, but are 490 nevertheless assumed to be enriched for true ASD genes). All other genes listed in the 491 SFARI Gene database were removed from the analysis, and the remainder of the 492 genome was considered the negative class. Strong association with score 3 493 membership (as measured by the regression coefficient and Z-score) is a desirable trait 494 for an ASD gene prediction method, and forecASD proved to be the most strongly 495 associated of the tested methods, even after correcting for the information imparted by 496 TADA rankings (Z=11.9, P=8.9x10 -33 ).

497
Rather than simply parroting gene prioritization according to TADA, forecASD purifies 498 TADA's signal by imposing biological plausibility through the inclusion of other predictive 499 features, such as network and expression data, as well as previously published 500 machine learning approaches to ASD gene prediction. To illustrate this filtering effect 501 that forecASD has on TADA rankings, we performed a GSEA on the differences in gene  These kinds of genes are routinely manually removed from consideration even in the 511 presence of statistical support, because they often lack biological plausibility. forecASD 512 accomplishes this same task, but in a more automated, data-driven, and objective way.

513
One bias in forecASD uncovered by the above analysis, and that is intuitive based on 514 forecASD's reliance on functional data, is that forecASD tends to de-prioritize genes 515 that are poorly annotated or lack strong signals in expression or other functional data 516 sets (in the above GSEA, forecASD-preferred genes were depleted for "Uncategorized" 517 at P=6.8x10 -9 ). On one hand, this limits forecASD's ability to make entirely unexpected discoveries among poorly characterized genes. On the other hand, the probability of 519 highly important genes completely evading decades of neuroscience investigation is 520 diminishing, and penalizing these "unassuming" genes may be justifiable. Nevertheless, 521 the value of new discovery should not be discounted, and this will be an area of active 522 investigation in the ongoing development of forecASD. We introduce a new model, forecASD, for prioritizing autism-associated risk genes. We

551
show that forecASD has significantly improved sensitivity and specificity compared to 552 previous gene-level predictors, including the purely genetic-based approach TADA. We 553 demonstrate forecASD's usefulness as an autism risk-gene discovery post-hoc filter 554 through its ability to prioritize 12 of 16 genes implicated at FDR=0.2 by the latest ASD 555 gene discovery study. Using forecASD prioritized risk-genes, we highlight which LGD-likely gene disrupting, DNM-de novo mutation, TADA-Transmission And De novo 560 Association, DE-differential expression, GSEA-gene set enrichment analysis 561 Ethics approval and consent to participate: All human genetic data used in this study 563 was accessed in a deidentified manner from associated sequencing consortia with their 564 approval. Due to the method of access, this study is not formally considered human 565 subjects research and therefore not subject to associated restrictions, as outlined by the 566 National Institutes of Health. 567 Consent for publication: Not applicable.

568
Availability of data and material: All data and code used to generate the forecASD 569 model is available for download from the repository associated with this project, 570 https://github.com/LeoBman/forecASD.

571
Competing interests: The authors declare that they have no competing interests.   We are grateful to all of the families in SPARK, the SPARK clinical sites and SPARK 583 staff. We appreciate obtaining access to exome sequencing and phenotypic data on 584 SFARI Base. Approved researchers can obtain the SPARK population dataset 585 described in this study by applying at https://base.sfari.org. when combined with a genetic measure of autism association (a). Building the full forecASD model, we test all features for their informativeness, finding that the STRING 606 score is primary (b). Using the three mentioned scores, we assess their genome-wide 607 ranking of SFARI genes at all levels, and find that the full forecASD model at least ties, 608 and often significantly outperforms TADA and forecASD:sys in the prioritization of 609 SFARI genes (c). As an initial assessment of forecASD prioritized genes, we find the represented, but not enriched in the SFARI HC list (a). Other pathways were highly enriched in forecASD genes that were not represented at all in the SFARI HC list, even 634 though they have associated literature suggesting a role in autism (b). forecASD is 635 more sensitive than SFARI HC to pathways that are differentially regulated in the brains 636 of individuals with autism, particularly in ASD-upregulated pathways (c), but also in 637 downregulated pathways (d). Using the top decile of TADA -log10 FDR genes showed 638 similar sensitivity to SFARI HC (not shown), suggesting that rare variant approaches 639 may be less sensitive in implicating genes found through gene expression studies. while forecASD is significantly less biased. We found a significant relationship between 645 pLI and differential expression (DE) in the brains of autism cases (b), such that low pLI 646 genes tend toward upregulation in cases, while high pLI genes tend to be 647 downregulated. We also observed a significant interaction between forecASD and pLI 648 such that the observed pLI-DE trend (b) is absent in non-forecASD genes (c), and 649 present and significant among forecASD genes (d). We propose that the presence of 650 the pLI-DE trend is a hallmark of ASD risk genes, and an optimal ASD gene 651 prioritization method will concentrate the trend among risk genes and remove it from 652 non-risk genes. Notably, no threshold of TADA (tested to the 50 th percentile) was able to 653 remove the trend from the non-prioritized genes, suggesting the persistence of residual 654 risk genes that were not selected.