Abstract
Autism spectrum disorder (ASD) has clinically and genetically heterogeneous characteristics. Here, we show a two-step genome-wide association study (GWAS). In the first step, we observed no significant associations in a GWAS including 597 cases and 370 controls. In the second step, we conducted a cluster analysis using k-means with 15 clusters based on Autism Diagnostic Interview-Revised (ADI-R) scores and history of vitamin treatment. We then conducted GWAS by each subgroup of cases vs all controls (cluster-based GWAS) and identified significant associations with 93 chromosomal loci that satisfied the genome-wide significance threshold of P<5.0×10−8. These loci included previously reported candidate genes for ASD: CDH9, MED13L, SOX5, CADM2, CADM1, DAB1, SEMA5A, RORA, MED13, COBL, EPHA7, HIF1AN, ICE1, PML, and WNT7B. We observed that clustering-based GWAS, even with a smaller sample size, revealed abundant significant associations. These findings suggest that clustering may successfully identify subgroups that are aetiologically more homogeneous.
Introduction
Autism spectrum disorder (ASD) has heterogeneous characteristics, in terms of both phenotypic features and genetics. Clinically, ASD is mainly characterized by difficulties in communication and repetitive behaviours1, but ASD also shows many other symptoms2. Regarding genetics, previous studies have not consistently identify relatively common genetic variants that are associated with an increased risk of ASDs3, although several lines of evidence suggest strong genetic components contribute to the susceptibility to ASDs. There are higher concordance rates of ASDs in monozygotic twins (92%) than in dizygotic twins (10%)4. The sibling recurrence risk ratio (λs) is 22 for ASD5. The Human Gene module of the Simons Foundation Autism Research Initiative (SFARI) Gene serves as a comprehensive, up-to-date reference for all known human genes associated with ASD6 and currently demonstrates ∼1,000 genes that have potential links to ASD, indicating the heterogeneity of ASD. In addition to the phenotype and genotype heterogeneities, ASD shows heterogeneous responses to interventions. Several kinds of pharmacological treatments are suggested but the effects of these treatments are controversial7.
If the heterogeneous phenotypes and responses to treatment in some way correspond to differences in genotype, grouping persons with ASD according to phenotypic variables may increase the chances of identifying common genetic susceptibility factors. A simulation study demonstrated that analysis of case subsets could be a powerful strategy to uncover some of the hidden heritability of common complex disorders8. Several studies of ASD, Alzheimer’s disease, neuroticism, or asthma indicated that items or symptoms were in some degree useful to identify more genetically homogeneous subgroups of these diseases than broadly defined ones9–12. In recent years, ASD has been investigated using machine learning methods13, 14. Machine learning employs artificial intelligence techniques to discover useful masked patterns. Clustering algorithms of machine learning could make novel and potentially more homogeneous clusters, but these algorithms using phenotypic variables have not, to the best of our knowledge, been applied to subgrouping multifactorial diseases to date.
In the present study, we explored whether grouping persons with ASD using clustering algorithms with phenotypic and responses to treatment variables can be used to discriminate more genetically homogeneous ASD persons. We applied machine learning k-means15 or affinity propagation (AP)16 algorithms to cluster analysis. Based on these clusters, we conducted genome-wide association studies (GWASs). We used genetic data to evaluate whether our clusters identify biologically homogeneous subgroups.
Results
Clustering
We used phenotypic variables, history of treatment, and genome-wide genotypic data from the Simons Simplex Collection (SSC)17, the largest cohort of autism simplex families amassed to date. The SSC is a core project and resource of the SFARI6.
To classify persons with ASD into more homogeneous subgroups, we conducted cluster analyses using phenotypic variables of Autism Diagnostic Interview-Revised (ADI-R)18 scores and history of vitamin treatment. We chose these variables because the ADI-R is one of the most reliable estimates of ASD and has the ability to evaluate substructure domains of ASD. Among the treatments19, we selected the variable history of vitamin treatment because we recently found that a cluster of persons with ASD is associated with potential responsiveness to vitamin B6 treatment20, 21. The history of treatment is not always compatible with responsiveness, but we considered that continuous treatment indicates responsiveness to some degree. The SSC dataset includes history of treatment but not variables of responsiveness.
We used k-means15 or AP16 algorithms. The k-means algorithm requires cluster numbers determined by researchers. AP algorithms do not need a priori cluster numbers; rather, the algorithm itself finds the appropriate one. When using k-means algorithms, we chose 2, 3, 4, 5, 10, 15, and 20 clusters. Interestingly, we observed that the AP analysis classified the participants into 36 groups.
Cluster-based genome-wide association study
GWASs were applied to male ASD probands and their unaffected brothers. In the first step, we conducted GWAS for all 597 male probands vs all 370 unaffected brothers using the sib transmission/disequilibrium test (sib-TDT)22. We observed no significant associations (Fig. 1).
In the second step, we conducted GWAS by each subgroup of the probands vs unaffected brothers as controls without the brothers of the members of the subgroup being analysed (cluster-based GWAS) (Fig. 2) using k-means or AP algorithms. We applied the Cochran-Armitage trend test23, 24 and Fisher’s exact test25 to both algorithms. Notably, we observed that the number of genome-wide significant loci increased as the number of clusters increased when the Cochran-Armitage trend test was applied (Table 1). In contrast, when Fisher’s exact test was applied, zero to three significant loci were observed for numbers of clusters between two and 36. Two reasons may explain the difference in the results between the two tests. The first is the difference in analysis methods for the genetic case-control data. The Cochran-Armitage trend test examines the risk of disease in those who do not have the allele of interest, those who have a single copy, and those who are homozygous. Fisher’s exact test examines the allele frequency in cases and controls. The disease model and mode of inheritance may influence the difference, although those of ASD are largely unknown26, 27. Our data might indicate that a case-control study of ASD should be analysed by genotype. The second is the conservative nature of Fisher’s exact test. The quantile-quantile (Q-Q) plots of the cluster-based GWAS with 20 clusters by k-means using Fisher’s exact test demonstrated that almost all observed p-values were high compared to the expected distribution of p-values. In addition, genomic inflation factor (λ) values ranged from 0.615 to 0.738, and the average was 0.683, which was very small compared to one (Table 1). We therefore regarded the Cochran-Armitage trend test to be a more appropriate method in the present cluster-based GWAS.
Regarding appropriate cluster numbers, we compared the Q-Q plots and λ values among the analyses and observed that as the number of clusters increased, the observed p-values were lower than the expected distribution of p-values. For instance, the Q-Q plots for the cluster-based GWAS with 20 clusters by k-means using the Cochran-Armitage trend test demonstrated that the observed p-values were very low compared to the expected distribution of p-values. In addition, λ values ranged from 1.022 to 1.093, and the average was 1.054 (Table 1), indicating that the rate of false positives was relatively high. Several lines of evidence suggest that regarding an appropriate threshold of inflation factor λ, empirically, a value of less than 1.050 is deemed safe for avoiding false positives28–30.
In contrast, inflation factor λ values of the cluster-based GWAS with 15 clusters by k-means ranged from 1.018 to 1.065, and the average was 1.043, which was below 1.050 (Table 1 and Fig. 3).
According to the above results, we considered the cluster-based GWAS with 15 clusters by k-means using the Cochran-Armitage trend test to be the most appropriate approach to the present dataset. The characteristics of each cluster are presented in Table 2.
Our results indicate that clustering by specific phenotypic variables might be informative and provide the best model for identifying aetiologically similar cases of ASD.
Gene interpretation
Among the cluster-based GWASs, we mainly presented here the results using the Cochran-Armitage trend test by k-means with 15 clusters. In this cluster-based GWAS, we identified significant associations with 93 chromosomal loci that satisfied the genome-wide significance threshold of P < 5.0 × 10−8 (Table 1 and Fig. 3), and this cluster-based GWAS demonstrates that a total of 93 single nucleotide polymorphisms (SNPs), including 45 intragenic and 48 intergenic SNPs, satisfied the genome-wide significance threshold (Table 3). Among them, 9 genes corresponded to the Human Gene module of the SFARI Gene scoring system6; CDH9 (score 4) in Cluster 3; MED13L (score 2, Rare Single Gene Mutation, Syndromic) in Clusters 7 and 13; SOX5 (Rare Single Gene Mutation, Syndromic, Genetic Association) in Cluster 9; CADM2 (score 4) in Cluster 9; CADM1 (score 4, Rare Single Gene Mutation) in Cluster 10; DAB1 (score 5) in Cluster 11; SEMA5A (score 3) in Cluster 12; RORA (Rare Single Gene Mutation, Syndromic, Genetic Association, Functional) in Cluster 13; and MED13 (score 2, syndromic) in Cluster 15.
In the SFARI Gene scoring system, ranging from “Category 1”, which indicates “high confidence”, through “Category 6”, which denotes “evidence does not support a role”. Genes predisposing to autism in the context of a syndromic disorder (e.g., fragile X syndrome) are placed in a separate category. Rare single gene variants, disruptions/mutations, and sub-microscopic deletions/duplications directly linked to ASD are placed in “Rare Single Gene Mutation”. The relatively high correspondence between our results in part and the SFARI Gene scoring system indicates that the statistically significant loci we found may indeed be associated with ASD subgroups.
In addition to genes in the Human Gene module of the SFARI Gene, several important genes associated with ASD or other related disorders31, 32 from previous reports were included in our findings as follows: COBL in Cluster 12, EPHA7 in Cluster 3, HIF1AN in Cluster 4, ICE1 in Cluster 2, PML in Cluster 15, and WNT7B in Cluster 8 previously reported with ASD33–38; LHPP in Cluster 7 previously reported with depression39; KIDINS220 in Cluster 7 previously reported with intellectual disability40; ALPL in Cluster 6 previously reported with deleterious neurological outcome41; and PAX2 in Cluster 4 previously reported with development of the central nervous system42. These findings suggest that the statistically significant SNPs might explain autistic symptoms because these diseases are suggested to share common aetiology, even in part, with ASD31, 32. Associations at the remaining significant loci that were not in the SFARI module or described above have not been previously reported, and to the best of our knowledge, some of them might be novel findings, although further confirmation is needed.
Replication study
To further validate the associations identified in the GWASs, we performed replication studies on another independent dataset from SSC, 1Mv3. In the first step, we conducted GWAS for all 712 male probands vs all 354 unaffected brothers using the sib-TDT test, and we observed no significant associations.
In the second step, we classified the male probands by k-means into 15 clusters and conducted GWAS for each subgroup vs the unaffected brothers as controls without the siblings of the members of the subgroup being analysed using the Cochran-Armitage trend test23, 24. We observed that the number of genome-wide significant loci slightly increased as the number of clusters increased (Supplementary Table S1), as observed with the Omni2.5 data set. In this cluster-based GWAS using the Cochran-Armitage trend test by k-means with 15 clusters, we identified significant associations at 8 chromosomal loci that satisfied the genome-wide significance threshold of P < 5.0 × 10−8. Furthermore, this cluster-based GWAS demonstrated that a total of 8 SNPs, including 5 intragenic and 3 intergenic SNPs, satisfied the genome-wide significance threshold (Supplementary Table S2).
Between the results from the Omni2.5 and 1Mv3 datasets, we observed no consistent genes that displayed genome-wide significance, although a consistent increase in the number of genome-wide significant loci as the numbers of clusters increased was observed. One possible explanation might be the extremely heterogeneous features of the ASD genotype. If the genotype has more than 1,000 genes6, each analysis with a sample size of less than one hundred vs hundreds with 15 clusters could find different genes.
Discussion
To the best of our knowledge, this is the first study to demonstrate that grouping persons with ASD using clustering algorithms is useful to discriminate more genetically homogeneous ASD persons. We observed many statistically significant SNPs, which is consistent with the findings from previous studies, and significant high odds ratios and corresponding reasonable lambda values, indicating our results indeed have reasonable validity.
Previous studies regarding ASD, Alzheimer’s disease, neuroticism, or asthma found that items or symptoms showed, to some degree, larger odds ratios of the odds among cases’ loci to the odds among controls’ loci compared to that from previous studies using broadly defined disease diagnoses9–12. These findings may indicate that GWAS with a symptom or an item could identify genetically more homogeneous subgroups and let us hypothesize that relatively reasonable combination of symptoms or items could identify more genetically homogeneous subgroups. Clustering algorithms could make essentially homogeneous clusters. To the best of our knowledge, these algorithms using phenotypic variables have not been applied for subgrouping multifactorial diseases to date. The present study demonstrate that clustering is one of the successful approaches to identifying more homogeneous subgroups.
Selection of variables is a critical issue in conducting clustering analysis. In this study, we focused on ADI-R variables and treatment, which have been indicated as candidates in previous studies18, 20, 21. We believe this protocol is an appropriate way of identifying subgroups of ASD. Nevertheless, further clustering utilizing other variables is warranted because ASD is highly heterogeneous and there are many variables for evaluating ASD symptoms. We can obtain many kinds of clusters from various views, and the ultimate cluster is the individuals themselves because every person has different genetic factors; however, we believe that one of the goals of clustering is the identification of subgroups based on treatment responsiveness, which may indicate the implementation of precision medicine for ASD.
AP is a relatively recently developed unsupervised machine learning clustering algorithm that identifies clusters of similar points using a set of points and a set of similarity values between the points and provides a representative example, called an exemplar, for each cluster16. We identified 36 clusters and 1,253 significant loci using the AP analysis, but our data also showed that the lambda values ranged from 1.032 to 1.093, with an average lambda value of 1.076 (Table 1). Although AP is a useful algorithm to identify clusters, the lambda values exceeded the appropriate threshold, i.e., less than 1.050, necessary to avoid false positives28–30. Therefore, the observed significant loci might include both true positives and false positives and we selected here the Cochran-Armitage trend test.
One of the most important findings of our study was that reasonably decreasing the sample size could increase the statistical power. A plausible explanation is that our clustering may have successfully identified subgroups that are aetiologically more homogeneous. To date, genetic studies have been conducted with huge sample sizes and have found modest to moderate impacts of genetic factors on multifactorial diseases, called missing heritability43. The present study indicates that the reason for the observed modest effects in previous genetic studies may be disease heterogeneity because we observed several significantly high odds ratios. Our approach using clustering algorithms in machine learning methods may be a breakthrough approach for dealing with the issue of missing heritability and for identifying disease architectures. GWAS with a larger sample size is useful, but our data indicate that another strategy, such as clustering by phenotype, may also be useful.
Our data strongly highlights the relevance of cluster-based GWAS as a means to identify more homogeneous subgroups of ASD than broadly defined ASD. The present study may provide clues to discover the aetiologies of ASD as well as that of other multifactorial diseases.
Methods
We conducted the present study in accordance with the guidelines of the Declaration of Helsinki44 and all other applicable guidelines. The protocol was reviewed and approved by the institutional review board of Tohoku University Graduate School of Medicine, and written informed consent from all participants was obtained by the Simons Foundation Autism Research Initiative (SFARI)17. For participants under the age of 18 year, they obtained informed consent from a parent and/or legal guardian. Additionally, for participants 10 to 17 years of age, they obtained informed assent from the individuals.
Datasets
We used phenotypic variables, history of treatment, and genome-wide genotypic data from the Simons Simplex Collection (SSC)17, the largest cohort of autism simplex families amassed to date. The SSC establishes a repository of genetic samples from simplex families.
The SSC data were publicly released in October 2007 and are directly available from the SFARI. From the SSC dataset, we used data from 614 affected white male child or adult probands who have no missing information about ADI-R scores and vitamin treatment and 391 unaffected brothers for whom Omni2.5 array data were available for subsequent clustering and genetic analyses. We excluded participants whose ancestries were estimated to be different from the other participants using principal component analyses (PCAs) performed by EIGENSOFT version 7.2.145, 46. We also performed PCA for the genotype data in our study. Based on the PCA analyses, we excluded data beyond 4 standard deviations of principle components 1 or 2 (Supplementary Fig. S1). Therefore, we used data from 597 probands and 370 unaffected siblings.
In the replication study, we used the SSC 1Mv3 dataset. In the dataset, data from 735 affected male child or adult probands with no missing information about ADI-R scores and vitamin treatment and 387 unaffected child or adult male siblings were available. After conducting PCA, we excluded data beyond 4 standard deviations of principal components 1 or 2 as outliers. Therefore, we used data from 712 probands and 354 unaffected siblings in the replication study.
Cluster analysis
In the cluster analysis, we used phenotypic variables of the Autism Diagnostic Interview-Revised (ADI-R) score and treatment18. Among ADI-R scores, “The total score for the Verbal Communication Domain on the ADI-R algorithm minus the total score for the Nonverbal Communication Domain on the ADI-R algorithm”, “The total score for the Nonverbal Communication Domain on the ADI-R algorithm”, “The total score for the Restricted, Repetitive, and Stereotyped Patterns of Behavior Domain on the ADI-R algorithms”, and “The total score for the Reciprocal Social Interaction Domain on the ADI-R algorithms” were included in the preprocessed dataset. Among the histories of treatments, the use of vitamins, though it does not guarantee effectiveness, was also included in the preprocessed dataset because we recently found that a cluster of persons with ASD is associated with potential responsiveness to vitamin B6 treatment21.
We applied machine learning k-means15 or affinity propagation (AP)16 algorithms to conduct a cluster analysis to divide the dataset including data from ASD persons into subgroups using phenotype variables and history of treatment. The k-means algorithm requires cluster numbers determined by researchers. AP algorithms do not need a priori cluster numbers, as the algorithm itself finds the appropriate number. When using k-means algorithms, we chose 2, 3, 4, 5, 10, 15, and 20 clusters. The ordinary k-means algorithm was first applied to the preprocessed dataset to divide the participants into more homogeneous subgroups15. Then, we used the relatively recently developed AP algorithm16. AP is an unsupervised clustering analysis using a message-passing-based algorithm. In the present study, AP was performed without diagonal components using a dumping factor of 0.9. These analyses were performed with the scikit-learn toolkit in Python 2.7 (Supplementary Information S1, Supplementary Information S2 and Supplementary Information S3)47.
The cluster analyses described above were performed in the replication study as well.
Genotype data and quality control
We used the SSC dataset, in which probands and unaffected siblings had already been genotyped in other previous studies17, 48. In the discovery-stage genome-wide association study (GWAS), all members of each family were analysed on the same array version, the Illumina HumanOmni2.5, which has approximately 2,450,000 probes. We excluded SNPs with a minor allele frequency (MAF) < 0.01, call rate < 0.95, and Hardy-Weinberg equilibrium test P < 0.000001 and obtained genotype data for 1000 participants in SSC.
In the replication study, we used genotyping data generated using the Illumina BeadChip in the SSC 1Mv3 datasets. We applied the same quality control criteria as those used in the discovery-stage GWAS.
Statistical analysis
In the discovery studies and in the replication studies, GWAS were applied to ASD probands and unaffected siblings. In the first step, we conducted a GWAS for all male probands vs all unaffected male siblings using sib-TDT analyses. The first step association test was the sib-TDT for all cases and controls. In the second step, we conducted a GWAS by each subgroup of the male probands vs unaffected male siblings without the siblings of the members of the subgroup being analysed (cluster-based GWAS) using k-means15 or AP16 algorithms. We applied the Cochran-Armitage trend test23, 24 and Fisher’s exact test25 to both algorithms. Details of the study design are also indicated in Fig. 2.
Association analyses were performed in PLINK version 1.0749 and 1.950. The detected SNPs were subsequently annotated using ANNOVAR51. Manhattan plots and Q-Q plots were generated using the ‘qqman’ package in R version 3.0.2.
Data availability
All the data used in the study are available only to those granted access by the Simons Foundation.
Author information
These authors contributed equally: Akira Narita, Masato Nagai, and Satoshi Mizuno.
Author contributions
A.N., M.N., S.M., S.O. G.T. and S.K. designed the study. M.N. and S.K. conducted the clustering analyses. A.N., M.N., S.M., S.O. and G.T. conducted GWAS. A.N., M.N., S.M., S.O. G.T. and S.K. drafted the manuscript. M.U., R.S., S.M., T.O., M.I., C.Y., H.M., Y.K., K.M., T.K., M.K., T.U., H.O., A.H., M.K., H.M., and S.K. helped with the interpretation of data. A.N., M.N., S.M., S.O., G.T., M.U., R.S., S.M., T.O., M.I., C.Y., H.M., Y.K., K.M., T.K., M.K., T.U., H.O., A.H., M.K., H.M., S.K., and S.K. edited the manuscript and gave intellectually critical contributions to it.
Competing interests
The authors declare no competing interests.
Acknowledgements
We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the staff at the Simons Foundation Autism Research Initiative (SFARI). The present study was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) KAKENHI Grant Numbers 19390171 and 16H05242. MEXT had no role in the design or execution of the study.