Clustering by phenotype and genome-wide association study in autism

Akira Narita; Masato Nagai; Satoshi Mizuno; Soichi Ogishima; Gen Tamiya; Masao Ueki; Rieko Sakurai; Satoshi Makino; Taku Obara; Mami Ishikuro; Chizuru Yamanaka; Hiroko Matsubara; Yasutaka Kuniyoshi; Keiko Murakami; Tomoko Kobayashi; Mika Kobayashi; Takuma Usuzaki; Hisashi Ohseto; Atsushi Hozawa; Masahiro Kikuya; Hirohito Metoki; Shigeo Kure; Shinichi Kuriyama

doi:10.1101/614958

Abstract

Autism spectrum disorder (ASD) has clinically and genetically heterogeneous characteristics. Here, we show a two-step genome-wide association study (GWAS). In the first step, we observed no significant associations in a GWAS including 597 cases and 370 controls. In the second step, we conducted a cluster analysis using k-means with 15 clusters based on Autism Diagnostic Interview-Revised (ADI-R) scores and history of vitamin treatment. We then conducted GWAS by each subgroup of cases vs all controls (cluster-based GWAS) and identified significant associations with 93 chromosomal loci that satisfied the genome-wide significance threshold of P<5.0×10⁻⁸. These loci included previously reported candidate genes for ASD: CDH9, MED13L, SOX5, CADM2, CADM1, DAB1, SEMA5A, RORA, MED13, COBL, EPHA7, HIF1AN, ICE1, PML, and WNT7B. We observed that clustering-based GWAS, even with a smaller sample size, revealed abundant significant associations. These findings suggest that clustering may successfully identify subgroups that are aetiologically more homogeneous.

Introduction

Autism spectrum disorder (ASD) has heterogeneous characteristics, in terms of both phenotypic features and genetics. Clinically, ASD is mainly characterized by difficulties in communication and repetitive behaviours¹, but ASD also shows many other symptoms². Regarding genetics, previous studies have not consistently identify relatively common genetic variants that are associated with an increased risk of ASDs³, although several lines of evidence suggest strong genetic components contribute to the susceptibility to ASDs. There are higher concordance rates of ASDs in monozygotic twins (92%) than in dizygotic twins (10%)⁴. The sibling recurrence risk ratio (λs) is 22 for ASD⁵. The Human Gene module of the Simons Foundation Autism Research Initiative (SFARI) Gene serves as a comprehensive, up-to-date reference for all known human genes associated with ASD⁶ and currently demonstrates ∼1,000 genes that have potential links to ASD, indicating the heterogeneity of ASD. In addition to the phenotype and genotype heterogeneities, ASD shows heterogeneous responses to interventions. Several kinds of pharmacological treatments are suggested but the effects of these treatments are controversial⁷.

If the heterogeneous phenotypes and responses to treatment in some way correspond to differences in genotype, grouping persons with ASD according to phenotypic variables may increase the chances of identifying common genetic susceptibility factors. A simulation study demonstrated that analysis of case subsets could be a powerful strategy to uncover some of the hidden heritability of common complex disorders⁸. Several studies of ASD, Alzheimer’s disease, neuroticism, or asthma indicated that items or symptoms were in some degree useful to identify more genetically homogeneous subgroups of these diseases than broadly defined ones^9–12. In recent years, ASD has been investigated using machine learning methods^{13, 14}. Machine learning employs artificial intelligence techniques to discover useful masked patterns. Clustering algorithms of machine learning could make novel and potentially more homogeneous clusters, but these algorithms using phenotypic variables have not, to the best of our knowledge, been applied to subgrouping multifactorial diseases to date.

In the present study, we explored whether grouping persons with ASD using clustering algorithms with phenotypic and responses to treatment variables can be used to discriminate more genetically homogeneous ASD persons. We applied machine learning k-means¹⁵ or affinity propagation (AP)¹⁶ algorithms to cluster analysis. Based on these clusters, we conducted genome-wide association studies (GWASs). We used genetic data to evaluate whether our clusters identify biologically homogeneous subgroups.

Results

Clustering

We used phenotypic variables, history of treatment, and genome-wide genotypic data from the Simons Simplex Collection (SSC)¹⁷, the largest cohort of autism simplex families amassed to date. The SSC is a core project and resource of the SFARI⁶.

To classify persons with ASD into more homogeneous subgroups, we conducted cluster analyses using phenotypic variables of Autism Diagnostic Interview-Revised (ADI-R)¹⁸ scores and history of vitamin treatment. We chose these variables because the ADI-R is one of the most reliable estimates of ASD and has the ability to evaluate substructure domains of ASD. Among the treatments¹⁹, we selected the variable history of vitamin treatment because we recently found that a cluster of persons with ASD is associated with potential responsiveness to vitamin B6 treatment^{20, 21}. The history of treatment is not always compatible with responsiveness, but we considered that continuous treatment indicates responsiveness to some degree. The SSC dataset includes history of treatment but not variables of responsiveness.

We used k-means¹⁵ or AP¹⁶ algorithms. The k-means algorithm requires cluster numbers determined by researchers. AP algorithms do not need a priori cluster numbers; rather, the algorithm itself finds the appropriate one. When using k-means algorithms, we chose 2, 3, 4, 5, 10, 15, and 20 clusters. Interestingly, we observed that the AP analysis classified the participants into 36 groups.

Cluster-based genome-wide association study

GWASs were applied to male ASD probands and their unaffected brothers. In the first step, we conducted GWAS for all 597 male probands vs all 370 unaffected brothers using the sib transmission/disequilibrium test (sib-TDT)²². We observed no significant associations (Fig. 1).

Fig. 1. Manhattan plots (a) and corresponding quantile-quantile plots (b) in GWAS for all males’ probands vs all males’ unaffected siblings using the sib transmission/disequilibrium test.

We conducted GWAS for all 597 male probands vs all 370 unaffected brothers using the sib transmission/disequilibrium test (sib-TDT). We observed no significant associations in this GWAS. The dotted line indicates the threshold for genome-wide significance (P < 5.0 × 10⁻⁸).

In the second step, we conducted GWAS by each subgroup of the probands vs unaffected brothers as controls without the brothers of the members of the subgroup being analysed (cluster-based GWAS) (Fig. 2) using k-means or AP algorithms. We applied the Cochran-Armitage trend test^{23, 24} and Fisher’s exact test²⁵ to both algorithms. Notably, we observed that the number of genome-wide significant loci increased as the number of clusters increased when the Cochran-Armitage trend test was applied (Table 1). In contrast, when Fisher’s exact test was applied, zero to three significant loci were observed for numbers of clusters between two and 36. Two reasons may explain the difference in the results between the two tests. The first is the difference in analysis methods for the genetic case-control data. The Cochran-Armitage trend test examines the risk of disease in those who do not have the allele of interest, those who have a single copy, and those who are homozygous. Fisher’s exact test examines the allele frequency in cases and controls. The disease model and mode of inheritance may influence the difference, although those of ASD are largely unknown^{26, 27}. Our data might indicate that a case-control study of ASD should be analysed by genotype. The second is the conservative nature of Fisher’s exact test. The quantile-quantile (Q-Q) plots of the cluster-based GWAS with 20 clusters by k-means using Fisher’s exact test demonstrated that almost all observed p-values were high compared to the expected distribution of p-values. In addition, genomic inflation factor (λ) values ranged from 0.615 to 0.738, and the average was 0.683, which was very small compared to one (Table 1). We therefore regarded the Cochran-Armitage trend test to be a more appropriate method in the present cluster-based GWAS.

Fig. 2. Methods of GWAS according to each subgroup of the probands vs the unaffected brothers as controls without the brothers of the members of the subgroup being analysed in the present study.

We call GWAS according to each subgroup of the probands vs the unaffected brothers as controls without the brothers of the members of the subgroup as “Cluster-based GWAS”. This panel shows the detailed methods of Cluster-based GWAS in the present study.

View this table:

Table 1.

Number of genome-wide significant loci for each clustering algorithm and test method using the Omni2.5 dataset with MAF <0.01 deleted

Regarding appropriate cluster numbers, we compared the Q-Q plots and λ values among the analyses and observed that as the number of clusters increased, the observed p-values were lower than the expected distribution of p-values. For instance, the Q-Q plots for the cluster-based GWAS with 20 clusters by k-means using the Cochran-Armitage trend test demonstrated that the observed p-values were very low compared to the expected distribution of p-values. In addition, λ values ranged from 1.022 to 1.093, and the average was 1.054 (Table 1), indicating that the rate of false positives was relatively high. Several lines of evidence suggest that regarding an appropriate threshold of inflation factor λ, empirically, a value of less than 1.050 is deemed safe for avoiding false positives^28–30.

In contrast, inflation factor λ values of the cluster-based GWAS with 15 clusters by k-means ranged from 1.018 to 1.065, and the average was 1.043, which was below 1.050 (Table 1 and Fig. 3).

Fig. 3. Manhattan plots (a) and corresponding quantile-quantile plots (b) in GWAS for cluster-based males’ probands and males’ unaffected siblings who did not include corresponding probands by k-means algorithms with 15 clusters using Cochran-Armitage trend test.

We conducted GWAS according to each subgroup of the probands vs the unaffected brothers as controls without the brothers of the members of the subgroup being analysed (cluster-based GWAS) using the k-means with 15 clusters and the Cochran-Armitage trend test. Among 15 clusters, significant associations were observed in 14 clusters. In total, we identified significant associations in 93 chromosomal loci that satisfied the genome-wide significance threshold of P < 5.0 × 10⁻⁸. The genetic loci that were previously reported candidate genes for ASD and satisfied the genome-wide significance threshold are labelled. The dotted line indicates the threshold for genome-wide significance (P < 5.0 × 10⁻⁸).

According to the above results, we considered the cluster-based GWAS with 15 clusters by k-means using the Cochran-Armitage trend test to be the most appropriate approach to the present dataset. The characteristics of each cluster are presented in Table 2.

View this table:

Table 2.

The characteristics of each of 15 clusters by k-means in the Omni2.5 dataset

Our results indicate that clustering by specific phenotypic variables might be informative and provide the best model for identifying aetiologically similar cases of ASD.

Gene interpretation

Among the cluster-based GWASs, we mainly presented here the results using the Cochran-Armitage trend test by k-means with 15 clusters. In this cluster-based GWAS, we identified significant associations with 93 chromosomal loci that satisfied the genome-wide significance threshold of P < 5.0 × 10⁻⁸ (Table 1 and Fig. 3), and this cluster-based GWAS demonstrates that a total of 93 single nucleotide polymorphisms (SNPs), including 45 intragenic and 48 intergenic SNPs, satisfied the genome-wide significance threshold (Table 3). Among them, 9 genes corresponded to the Human Gene module of the SFARI Gene scoring system⁶; CDH9 (score 4) in Cluster 3; MED13L (score 2, Rare Single Gene Mutation, Syndromic) in Clusters 7 and 13; SOX5 (Rare Single Gene Mutation, Syndromic, Genetic Association) in Cluster 9; CADM2 (score 4) in Cluster 9; CADM1 (score 4, Rare Single Gene Mutation) in Cluster 10; DAB1 (score 5) in Cluster 11; SEMA5A (score 3) in Cluster 12; RORA (Rare Single Gene Mutation, Syndromic, Genetic Association, Functional) in Cluster 13; and MED13 (score 2, syndromic) in Cluster 15.

View this table:

Table 3.

Association table of the cluster-based GWASs in 15 clusters by k-means in the Omni2.5 dataset

In the SFARI Gene scoring system, ranging from “Category 1”, which indicates “high confidence”, through “Category 6”, which denotes “evidence does not support a role”. Genes predisposing to autism in the context of a syndromic disorder (e.g., fragile X syndrome) are placed in a separate category. Rare single gene variants, disruptions/mutations, and sub-microscopic deletions/duplications directly linked to ASD are placed in “Rare Single Gene Mutation”. The relatively high correspondence between our results in part and the SFARI Gene scoring system indicates that the statistically significant loci we found may indeed be associated with ASD subgroups.

In addition to genes in the Human Gene module of the SFARI Gene, several important genes associated with ASD or other related disorders^{31, 32} from previous reports were included in our findings as follows: COBL in Cluster 12, EPHA7 in Cluster 3, HIF1AN in Cluster 4, ICE1 in Cluster 2, PML in Cluster 15, and WNT7B in Cluster 8 previously reported with ASD^33–38; LHPP in Cluster 7 previously reported with depression³⁹; KIDINS220 in Cluster 7 previously reported with intellectual disability⁴⁰; ALPL in Cluster 6 previously reported with deleterious neurological outcome⁴¹; and PAX2 in Cluster 4 previously reported with development of the central nervous system⁴². These findings suggest that the statistically significant SNPs might explain autistic symptoms because these diseases are suggested to share common aetiology, even in part, with ASD^{31, 32}. Associations at the remaining significant loci that were not in the SFARI module or described above have not been previously reported, and to the best of our knowledge, some of them might be novel findings, although further confirmation is needed.

Replication study

To further validate the associations identified in the GWASs, we performed replication studies on another independent dataset from SSC, 1Mv3. In the first step, we conducted GWAS for all 712 male probands vs all 354 unaffected brothers using the sib-TDT test, and we observed no significant associations.

In the second step, we classified the male probands by k-means into 15 clusters and conducted GWAS for each subgroup vs the unaffected brothers as controls without the siblings of the members of the subgroup being analysed using the Cochran-Armitage trend test^{23, 24}. We observed that the number of genome-wide significant loci slightly increased as the number of clusters increased (Supplementary Table S1), as observed with the Omni2.5 data set. In this cluster-based GWAS using the Cochran-Armitage trend test by k-means with 15 clusters, we identified significant associations at 8 chromosomal loci that satisfied the genome-wide significance threshold of P < 5.0 × 10⁻⁸. Furthermore, this cluster-based GWAS demonstrated that a total of 8 SNPs, including 5 intragenic and 3 intergenic SNPs, satisfied the genome-wide significance threshold (Supplementary Table S2).

Between the results from the Omni2.5 and 1Mv3 datasets, we observed no consistent genes that displayed genome-wide significance, although a consistent increase in the number of genome-wide significant loci as the numbers of clusters increased was observed. One possible explanation might be the extremely heterogeneous features of the ASD genotype. If the genotype has more than 1,000 genes⁶, each analysis with a sample size of less than one hundred vs hundreds with 15 clusters could find different genes.

Discussion

To the best of our knowledge, this is the first study to demonstrate that grouping persons with ASD using clustering algorithms is useful to discriminate more genetically homogeneous ASD persons. We observed many statistically significant SNPs, which is consistent with the findings from previous studies, and significant high odds ratios and corresponding reasonable lambda values, indicating our results indeed have reasonable validity.

Previous studies regarding ASD, Alzheimer’s disease, neuroticism, or asthma found that items or symptoms showed, to some degree, larger odds ratios of the odds among cases’ loci to the odds among controls’ loci compared to that from previous studies using broadly defined disease diagnoses^9–12. These findings may indicate that GWAS with a symptom or an item could identify genetically more homogeneous subgroups and let us hypothesize that relatively reasonable combination of symptoms or items could identify more genetically homogeneous subgroups. Clustering algorithms could make essentially homogeneous clusters. To the best of our knowledge, these algorithms using phenotypic variables have not been applied for subgrouping multifactorial diseases to date. The present study demonstrate that clustering is one of the successful approaches to identifying more homogeneous subgroups.

Selection of variables is a critical issue in conducting clustering analysis. In this study, we focused on ADI-R variables and treatment, which have been indicated as candidates in previous studies^{18, 20, 21}. We believe this protocol is an appropriate way of identifying subgroups of ASD. Nevertheless, further clustering utilizing other variables is warranted because ASD is highly heterogeneous and there are many variables for evaluating ASD symptoms. We can obtain many kinds of clusters from various views, and the ultimate cluster is the individuals themselves because every person has different genetic factors; however, we believe that one of the goals of clustering is the identification of subgroups based on treatment responsiveness, which may indicate the implementation of precision medicine for ASD.

AP is a relatively recently developed unsupervised machine learning clustering algorithm that identifies clusters of similar points using a set of points and a set of similarity values between the points and provides a representative example, called an exemplar, for each cluster¹⁶. We identified 36 clusters and 1,253 significant loci using the AP analysis, but our data also showed that the lambda values ranged from 1.032 to 1.093, with an average lambda value of 1.076 (Table 1). Although AP is a useful algorithm to identify clusters, the lambda values exceeded the appropriate threshold, i.e., less than 1.050, necessary to avoid false positives^28–30. Therefore, the observed significant loci might include both true positives and false positives and we selected here the Cochran-Armitage trend test.

One of the most important findings of our study was that reasonably decreasing the sample size could increase the statistical power. A plausible explanation is that our clustering may have successfully identified subgroups that are aetiologically more homogeneous. To date, genetic studies have been conducted with huge sample sizes and have found modest to moderate impacts of genetic factors on multifactorial diseases, called missing heritability⁴³. The present study indicates that the reason for the observed modest effects in previous genetic studies may be disease heterogeneity because we observed several significantly high odds ratios. Our approach using clustering algorithms in machine learning methods may be a breakthrough approach for dealing with the issue of missing heritability and for identifying disease architectures. GWAS with a larger sample size is useful, but our data indicate that another strategy, such as clustering by phenotype, may also be useful.

Our data strongly highlights the relevance of cluster-based GWAS as a means to identify more homogeneous subgroups of ASD than broadly defined ASD. The present study may provide clues to discover the aetiologies of ASD as well as that of other multifactorial diseases.

Methods

We conducted the present study in accordance with the guidelines of the Declaration of Helsinki⁴⁴ and all other applicable guidelines. The protocol was reviewed and approved by the institutional review board of Tohoku University Graduate School of Medicine, and written informed consent from all participants was obtained by the Simons Foundation Autism Research Initiative (SFARI)¹⁷. For participants under the age of 18 year, they obtained informed consent from a parent and/or legal guardian. Additionally, for participants 10 to 17 years of age, they obtained informed assent from the individuals.

Datasets

The SSC data were publicly released in October 2007 and are directly available from the SFARI. From the SSC dataset, we used data from 614 affected white male child or adult probands who have no missing information about ADI-R scores and vitamin treatment and 391 unaffected brothers for whom Omni2.5 array data were available for subsequent clustering and genetic analyses. We excluded participants whose ancestries were estimated to be different from the other participants using principal component analyses (PCAs) performed by EIGENSOFT version 7.2.1^{45, 46}. We also performed PCA for the genotype data in our study. Based on the PCA analyses, we excluded data beyond 4 standard deviations of principle components 1 or 2 (Supplementary Fig. S1). Therefore, we used data from 597 probands and 370 unaffected siblings.

In the replication study, we used the SSC 1Mv3 dataset. In the dataset, data from 735 affected male child or adult probands with no missing information about ADI-R scores and vitamin treatment and 387 unaffected child or adult male siblings were available. After conducting PCA, we excluded data beyond 4 standard deviations of principal components 1 or 2 as outliers. Therefore, we used data from 712 probands and 354 unaffected siblings in the replication study.

Cluster analysis

In the cluster analysis, we used phenotypic variables of the Autism Diagnostic Interview-Revised (ADI-R) score and treatment¹⁸. Among ADI-R scores, “The total score for the Verbal Communication Domain on the ADI-R algorithm minus the total score for the Nonverbal Communication Domain on the ADI-R algorithm”, “The total score for the Nonverbal Communication Domain on the ADI-R algorithm”, “The total score for the Restricted, Repetitive, and Stereotyped Patterns of Behavior Domain on the ADI-R algorithms”, and “The total score for the Reciprocal Social Interaction Domain on the ADI-R algorithms” were included in the preprocessed dataset. Among the histories of treatments, the use of vitamins, though it does not guarantee effectiveness, was also included in the preprocessed dataset because we recently found that a cluster of persons with ASD is associated with potential responsiveness to vitamin B6 treatment²¹.

We applied machine learning k-means¹⁵ or affinity propagation (AP)¹⁶ algorithms to conduct a cluster analysis to divide the dataset including data from ASD persons into subgroups using phenotype variables and history of treatment. The k-means algorithm requires cluster numbers determined by researchers. AP algorithms do not need a priori cluster numbers, as the algorithm itself finds the appropriate number. When using k-means algorithms, we chose 2, 3, 4, 5, 10, 15, and 20 clusters. The ordinary k-means algorithm was first applied to the preprocessed dataset to divide the participants into more homogeneous subgroups¹⁵. Then, we used the relatively recently developed AP algorithm¹⁶. AP is an unsupervised clustering analysis using a message-passing-based algorithm. In the present study, AP was performed without diagonal components using a dumping factor of 0.9. These analyses were performed with the scikit-learn toolkit in Python 2.7 (Supplementary Information S1, Supplementary Information S2 and Supplementary Information S3)⁴⁷.

The cluster analyses described above were performed in the replication study as well.

Genotype data and quality control

We used the SSC dataset, in which probands and unaffected siblings had already been genotyped in other previous studies^{17, 48}. In the discovery-stage genome-wide association study (GWAS), all members of each family were analysed on the same array version, the Illumina HumanOmni2.5, which has approximately 2,450,000 probes. We excluded SNPs with a minor allele frequency (MAF) < 0.01, call rate < 0.95, and Hardy-Weinberg equilibrium test P < 0.000001 and obtained genotype data for 1000 participants in SSC.

In the replication study, we used genotyping data generated using the Illumina BeadChip in the SSC 1Mv3 datasets. We applied the same quality control criteria as those used in the discovery-stage GWAS.

Statistical analysis

In the discovery studies and in the replication studies, GWAS were applied to ASD probands and unaffected siblings. In the first step, we conducted a GWAS for all male probands vs all unaffected male siblings using sib-TDT analyses. The first step association test was the sib-TDT for all cases and controls. In the second step, we conducted a GWAS by each subgroup of the male probands vs unaffected male siblings without the siblings of the members of the subgroup being analysed (cluster-based GWAS) using k-means¹⁵ or AP¹⁶ algorithms. We applied the Cochran-Armitage trend test^{23, 24} and Fisher’s exact test²⁵ to both algorithms. Details of the study design are also indicated in Fig. 2.

Association analyses were performed in PLINK version 1.07⁴⁹ and 1.9⁵⁰. The detected SNPs were subsequently annotated using ANNOVAR⁵¹. Manhattan plots and Q-Q plots were generated using the ‘qqman’ package in R version 3.0.2.

Data availability

All the data used in the study are available only to those granted access by the Simons Foundation.

Author information

These authors contributed equally: Akira Narita, Masato Nagai, and Satoshi Mizuno.

Author contributions

A.N., M.N., S.M., S.O. G.T. and S.K. designed the study. M.N. and S.K. conducted the clustering analyses. A.N., M.N., S.M., S.O. and G.T. conducted GWAS. A.N., M.N., S.M., S.O. G.T. and S.K. drafted the manuscript. M.U., R.S., S.M., T.O., M.I., C.Y., H.M., Y.K., K.M., T.K., M.K., T.U., H.O., A.H., M.K., H.M., and S.K. helped with the interpretation of data. A.N., M.N., S.M., S.O., G.T., M.U., R.S., S.M., T.O., M.I., C.Y., H.M., Y.K., K.M., T.K., M.K., T.U., H.O., A.H., M.K., H.M., S.K., and S.K. edited the manuscript and gave intellectually critical contributions to it.

Competing interests

The authors declare no competing interests.

Acknowledgements

We are grateful to all of the families at the participating Simons Simplex Collection (SSC) sites, as well as the staff at the Simons Foundation Autism Research Initiative (SFARI). The present study was supported by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) KAKENHI Grant Numbers 19390171 and 16H05242. MEXT had no role in the design or execution of the study.

REFERENCES

1.↵
American Psychological Association. Diagnostic and Statistical Manual of Mental Disorders (DSM–5) (American Psychological Association, Washington, DC, 2018).
2.↵
Rapin, I. Autism. N. Engl. J. Med. 337, 97–104 (1997).
OpenUrl CrossRef PubMed Web of Science
3.↵
Geschwind, D. H. & State, M. W. Gene hunting in autism spectrum disorder: on the path to precision medicine. Lancet Neurol. 14, 1109–1120 (2015).
OpenUrl CrossRef PubMed
4.↵
Bailey, A. et al. Autism as a strongly genetic disorder: evidence from a British twin study. Psychol. Med. 25, 63–77 (1995).
OpenUrl CrossRef PubMed Web of Science
5.↵
Lauritsen, M. B., Pedersen, C. B. & Mortensen, P. B. Effects of familial risk factors and place of birth on the risk of autism: a nationwide register-based study. J. Child Psychol. Psychiatry 46, 963–971 (2005).
OpenUrl CrossRef PubMed Web of Science
6.↵
Gene, S. Gene scoring. https://gene.sfari.org/database/gene-scoring/ (2018).
7.↵
Eissa, N. et al. Current enlightenment about etiology and pharmacological treatment of autism spectrum disorder. Front. Neurosci. 12, 304 (2018).
OpenUrl
8.↵
Traylor, M., Markus, H. & Lewis, C. M. Homogeneous case subgroups increase power in genetic association studies. Eur. J. Hum. Genet. 23, 863–869 (2015).
OpenUrl
9.↵
Chaste, P. et al. A genome-wide association study of autism using the Simons Simplex Collection: does reducing phenotypic heterogeneity in autism increase genetic homogeneity? Biol. Psychiatry 77, 775–784 (2015).
OpenUrl CrossRef PubMed
10.
Mukherjee, S. et al. Genetic data and cognitively defined late-onset Alzheimer’s disease subgroups. Mol. Psychiatry (2018). doi: 10.1038/s41380-018-0298-8.
OpenUrl CrossRef
11.
Nagel, M., Watanabe, K., Stringer, S., Posthuma, D. & van der Sluis, S. Item-level analyses reveal genetic heterogeneity in neuroticism. Nat. Commun. 9, 905 (2018).
OpenUrl
12.↵
Lavoie-Charland, E., Berube, J. C., Boulet, L. P. & Bosse, Y. Asthma susceptibility variants are more strongly associated with clinically similar subgroups. J. Asthma 53, 907–913 (2016).
OpenUrl
13.↵
Krishnan, A. et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462 (2016).
OpenUrl CrossRef
14.↵
Thabtah, F. Machine learning in autistic spectrum disorder behavioral research: a review and ways forward. Inform. Health Soc. Care 1–20 (2018). doi: 10.1080/17538157.2017.1399132
OpenUrl CrossRef
15.↵
MacQueen, J. Some methods for classification and analysis of multivariate observations in Procedings Fifth Berkeley Symposium on Mathematical Statistics and Probability 281–297 (University of California Press, Berkeley, 1967).
16.↵
Frey, B. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
OpenUrl Abstract/FREE Full Text
17.↵
Fischbach, G. D. & Lord, C. The simons simplex collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
OpenUrl CrossRef PubMed Web of Science
18.↵
Beggiato, A. et al. Gender differences in autism spectrum disorders: divergence among specific core symptoms. Autism. Res. 10, 680–689 (2017).
OpenUrl
19.↵
Sharma, S. R., Gonda, X. & Tarazi, F. I. Autism spectrum disorder: classification, diagnosis and therapy. Pharmacol. Ther. 190, 91–104 (2018).
OpenUrl
20.↵
Kuriyama, S. et al. Pyridoxine treatment in a subgroup of children with pervasive developmental disorders. Dev. Med. Child Neurol. 44, 284–286 (2002).
OpenUrl PubMed
21.↵
Obara, T. et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci. Rep. 8, 14840 (2018).
OpenUrl
22.↵
Spielman, R. S. & Ewens, W. J. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am. J. Hum. Genet. 62, 450–458 (1998).
OpenUrl CrossRef PubMed Web of Science
23.↵
Freidlin, B., Zheng, G., Li, Z. & Gastwirth, J. L. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).
OpenUrl CrossRef PubMed Web of Science
24.↵
Sladek, R. et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881–885 (2007).
OpenUrl CrossRef PubMed Web of Science
25.↵
Fisher, R. A. On the interpretation of χ 2 from contingency tables, and the calculation of P. J. R. Stat. Soc. 85, 87–94 (1922).
OpenUrl CrossRef
26.↵
Sasieni, P. D. From genotypes to genes: doubling the sample size. Biometrics 53, 1253–1261 (1997).
OpenUrl CrossRef PubMed Web of Science
27.↵
Emily, M. Power comparison of cochran-armitage trend test against allelic and genotypic tests in large-scale case-control genetic association studies. Stat. Methods Med. Res. 27, 2657–2673 (2018).
OpenUrl
28.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
OpenUrl CrossRef PubMed Web of Science
29.
Zeng, P. et al. Statistical analysis for genome-wide association study. J. Biomed. Res. 29, 285–297 (2015).
OpenUrl PubMed
30.↵
Wang, Y. et al. Genome-wide association study of piglet uniformity and farrowing interval. Front. Genet. 8, 194 (2017).
OpenUrl
31.↵
Anttila, V. et al. Analysis of shared heritability in common disorders of the brain. Science 360, pii: eaap8757 (2018).
OpenUrl Abstract/FREE Full Text
32.↵
Bralten, J. et al. Autism spectrum disorders and autistic traits share genetics and biology. Mol. Psychiatry 23, 1205–1212 (2018).
OpenUrl
33.↵
Griswold, A. J. et al. Evaluation of copy number variations reveals novel candidate genes in autism spectrum disorder-associated pathways. Hum. Mol. Genet. 21, 3513–3523 (2012).
OpenUrl CrossRef PubMed Web of Science
34.
Wurzman, R., Forcelli, P. A., Griffey, C. J. & Kromer, L. F. Repetitive grooming and sensorimotor abnormalities in an ephrin-a knockout model for autism spectrum disorders. Behav. Brain. Res. 278, 115–128 (2015).
OpenUrl CrossRef PubMed
35.
Martinez-Noel, G. et al. Identification and proteomic analysis of distinct UBE3A/E6AP protein complexes. Mol. Cell. Biol. 32, 3095–3106 (2012).
OpenUrl Abstract/FREE Full Text
36.
Zhang, B. et al. Multigenerational autosomal dominant inheritance of 5p chromosomal deletions. Am. J. Med. Genet. A 170, 583–593 (2016).
OpenUrl
37.
Silva, A. E., Vayego-Lourenco, S. A., Fett-Conte, A. C., Goloni-Bertollo, E. M. & Varella-Garcia, M. Tetrasomy 15q11-q13 identified by fluorescence in situ hybridization in a patient with autistic disorder. Arq. Neuropsiquiatr. 60, 290–294 (2002).
OpenUrl CrossRef PubMed
38.↵
Darbandi, S. F. et al. Neonatal Tbr1 dosage controls cortical layer 6 connectivity. Neuron S0896–S6273, 30829-30838 (2018).
39.↵
Cui, L. et al. Relationship between the LHPP gene polymorphism and resting-state brain activity in major depressive disorder. Neural Plast. 2016, 9162590 (2016).
40.↵
Josifova, D. J. et al. Heterozygous KIDINS220/ARMS nonsense variants cause spastic paraplegia, intellectual disability, nystagmus, and obesity. Hum. Mol. Genet. 25, 2158–2167 (2016).
OpenUrl CrossRef PubMed
41.↵
Hofmann, C. et al. Compound heterozygosity of two functional null mutations in the ALPL gene associated with deleterious neurological outcome in an infant with hypophosphatasia. Bone 55, 150–157 (2013).
OpenUrl
42.↵
Namm, A., Arend, A. & Aunapuu, M. Expression of Pax2 protein during the formation of the central nervous system in human embryos. Folia Morphol. 73, 272–278 (2014).
OpenUrl
43.↵
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
OpenUrl CrossRef PubMed Web of Science
44.↵
World Medical Association. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA 310, 2191–2194 (2013).
OpenUrl CrossRef PubMed Web of Science
45.↵
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
OpenUrl CrossRef PubMed Web of Science
46.↵
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
OpenUrl CrossRef PubMed
47.↵
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
OpenUrl CrossRef
48.↵
Sanders, S. J. et al. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism. Neuron 70, 863–885 (2011).
OpenUrl CrossRef PubMed Web of Science
49.↵
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
OpenUrl CrossRef PubMed
50.↵
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
OpenUrl CrossRef PubMed
51.↵
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
OpenUrl CrossRef PubMed

View the discussion thread.

Posted April 22, 2019.

Download PDF

Supplementary Material

Citation Tools

Subject Area

Neuroscience

Subject Areas

All Articles

Animal Behavior and Cognition (5215)
Biochemistry (11753)
Bioengineering (8752)
Bioinformatics (29201)
Biophysics (14974)
Cancer Biology (12100)
Cell Biology (17413)
Clinical Trials (138)
Developmental Biology (9422)
Ecology (14182)
Epidemiology (2067)
Evolutionary Biology (18309)
Genetics (12245)
Genomics (16804)
Immunology (11869)
Microbiology (28098)
Molecular Biology (11596)
Neuroscience (60975)
Paleontology (451)
Pathology (1871)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2886)
Systems Biology (7340)
Zoology (1651)

[1] 1.↵
American Psychological Association. Diagnostic and Statistical Manual of Mental Disorders (DSM–5) (American Psychological Association, Washington, DC, 2018).

[2] 2.↵
Rapin, I. Autism. N. Engl. J. Med. 337, 97–104 (1997).
OpenUrl CrossRef PubMed Web of Science

[3] 3.↵
Geschwind, D. H. & State, M. W. Gene hunting in autism spectrum disorder: on the path to precision medicine. Lancet Neurol. 14, 1109–1120 (2015).
OpenUrl CrossRef PubMed

[4] 4.↵
Bailey, A. et al. Autism as a strongly genetic disorder: evidence from a British twin study. Psychol. Med. 25, 63–77 (1995).
OpenUrl CrossRef PubMed Web of Science

[5] 5.↵
Lauritsen, M. B., Pedersen, C. B. & Mortensen, P. B. Effects of familial risk factors and place of birth on the risk of autism: a nationwide register-based study. J. Child Psychol. Psychiatry 46, 963–971 (2005).
OpenUrl CrossRef PubMed Web of Science

[6] 6.↵
Gene, S. Gene scoring. https://gene.sfari.org/database/gene-scoring/ (2018).

[7] 7.↵
Eissa, N. et al. Current enlightenment about etiology and pharmacological treatment of autism spectrum disorder. Front. Neurosci. 12, 304 (2018).
OpenUrl

[8] 8.↵
Traylor, M., Markus, H. & Lewis, C. M. Homogeneous case subgroups increase power in genetic association studies. Eur. J. Hum. Genet. 23, 863–869 (2015).
OpenUrl

[9] 9.↵
Chaste, P. et al. A genome-wide association study of autism using the Simons Simplex Collection: does reducing phenotypic heterogeneity in autism increase genetic homogeneity? Biol. Psychiatry 77, 775–784 (2015).
OpenUrl CrossRef PubMed

[10] 10.
Mukherjee, S. et al. Genetic data and cognitively defined late-onset Alzheimer’s disease subgroups. Mol. Psychiatry (2018). doi: 10.1038/s41380-018-0298-8.
OpenUrl CrossRef

[11] 11.
Nagel, M., Watanabe, K., Stringer, S., Posthuma, D. & van der Sluis, S. Item-level analyses reveal genetic heterogeneity in neuroticism. Nat. Commun. 9, 905 (2018).
OpenUrl

[12] 12.↵
Lavoie-Charland, E., Berube, J. C., Boulet, L. P. & Bosse, Y. Asthma susceptibility variants are more strongly associated with clinically similar subgroups. J. Asthma 53, 907–913 (2016).
OpenUrl

[13] 13.↵
Krishnan, A. et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462 (2016).
OpenUrl CrossRef

[14] 14.↵
Thabtah, F. Machine learning in autistic spectrum disorder behavioral research: a review and ways forward. Inform. Health Soc. Care 1–20 (2018). doi: 10.1080/17538157.2017.1399132
OpenUrl CrossRef

[15] 15.↵
MacQueen, J. Some methods for classification and analysis of multivariate observations in Procedings Fifth Berkeley Symposium on Mathematical Statistics and Probability 281–297 (University of California Press, Berkeley, 1967).

[16] 16.↵
Frey, B. & Dueck, D. Clustering by passing messages between data points. Science 315, 972–976 (2007).
OpenUrl Abstract/FREE Full Text

[17] 17.↵
Fischbach, G. D. & Lord, C. The simons simplex collection: a resource for identification of autism genetic risk factors. Neuron 68, 192–195 (2010).
OpenUrl CrossRef PubMed Web of Science

[18] 18.↵
Beggiato, A. et al. Gender differences in autism spectrum disorders: divergence among specific core symptoms. Autism. Res. 10, 680–689 (2017).
OpenUrl

[19] 19.↵
Sharma, S. R., Gonda, X. & Tarazi, F. I. Autism spectrum disorder: classification, diagnosis and therapy. Pharmacol. Ther. 190, 91–104 (2018).
OpenUrl

[20] 20.↵
Kuriyama, S. et al. Pyridoxine treatment in a subgroup of children with pervasive developmental disorders. Dev. Med. Child Neurol. 44, 284–286 (2002).
OpenUrl PubMed

[21] 21.↵
Obara, T. et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci. Rep. 8, 14840 (2018).
OpenUrl

[22] 22.↵
Spielman, R. S. & Ewens, W. J. A sibship test for linkage in the presence of association: the sib transmission/disequilibrium test. Am. J. Hum. Genet. 62, 450–458 (1998).
OpenUrl CrossRef PubMed Web of Science

[23] 23.↵
Freidlin, B., Zheng, G., Li, Z. & Gastwirth, J. L. Trend tests for case-control studies of genetic markers: power, sample size and robustness. Hum. Hered. 53, 146–152 (2002).
OpenUrl CrossRef PubMed Web of Science

[24] 24.↵
Sladek, R. et al. A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445, 881–885 (2007).
OpenUrl CrossRef PubMed Web of Science

[25] 25.↵
Fisher, R. A. On the interpretation of χ 2 from contingency tables, and the calculation of P. J. R. Stat. Soc. 85, 87–94 (1922).
OpenUrl CrossRef

[26] 26.↵
Sasieni, P. D. From genotypes to genes: doubling the sample size. Biometrics 53, 1253–1261 (1997).
OpenUrl CrossRef PubMed Web of Science

[27] 27.↵
Emily, M. Power comparison of cochran-armitage trend test against allelic and genotypic tests in large-scale case-control genetic association studies. Stat. Methods Med. Res. 27, 2657–2673 (2018).
OpenUrl

[28] 28.↵
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
OpenUrl CrossRef PubMed Web of Science

[29] 29.
Zeng, P. et al. Statistical analysis for genome-wide association study. J. Biomed. Res. 29, 285–297 (2015).
OpenUrl PubMed

[30] 30.↵
Wang, Y. et al. Genome-wide association study of piglet uniformity and farrowing interval. Front. Genet. 8, 194 (2017).
OpenUrl

[31] 31.↵
Anttila, V. et al. Analysis of shared heritability in common disorders of the brain. Science 360, pii: eaap8757 (2018).
OpenUrl Abstract/FREE Full Text

[32] 32.↵
Bralten, J. et al. Autism spectrum disorders and autistic traits share genetics and biology. Mol. Psychiatry 23, 1205–1212 (2018).
OpenUrl

[33] 33.↵
Griswold, A. J. et al. Evaluation of copy number variations reveals novel candidate genes in autism spectrum disorder-associated pathways. Hum. Mol. Genet. 21, 3513–3523 (2012).
OpenUrl CrossRef PubMed Web of Science

[34] 34.
Wurzman, R., Forcelli, P. A., Griffey, C. J. & Kromer, L. F. Repetitive grooming and sensorimotor abnormalities in an ephrin-a knockout model for autism spectrum disorders. Behav. Brain. Res. 278, 115–128 (2015).
OpenUrl CrossRef PubMed

[35] 35.
Martinez-Noel, G. et al. Identification and proteomic analysis of distinct UBE3A/E6AP protein complexes. Mol. Cell. Biol. 32, 3095–3106 (2012).
OpenUrl Abstract/FREE Full Text

[36] 36.
Zhang, B. et al. Multigenerational autosomal dominant inheritance of 5p chromosomal deletions. Am. J. Med. Genet. A 170, 583–593 (2016).
OpenUrl

[37] 37.
Silva, A. E., Vayego-Lourenco, S. A., Fett-Conte, A. C., Goloni-Bertollo, E. M. & Varella-Garcia, M. Tetrasomy 15q11-q13 identified by fluorescence in situ hybridization in a patient with autistic disorder. Arq. Neuropsiquiatr. 60, 290–294 (2002).
OpenUrl CrossRef PubMed

[38] 38.↵
Darbandi, S. F. et al. Neonatal Tbr1 dosage controls cortical layer 6 connectivity. Neuron S0896–S6273, 30829-30838 (2018).

[39] 39.↵
Cui, L. et al. Relationship between the LHPP gene polymorphism and resting-state brain activity in major depressive disorder. Neural Plast. 2016, 9162590 (2016).

[40] 40.↵
Josifova, D. J. et al. Heterozygous KIDINS220/ARMS nonsense variants cause spastic paraplegia, intellectual disability, nystagmus, and obesity. Hum. Mol. Genet. 25, 2158–2167 (2016).
OpenUrl CrossRef PubMed

[41] 41.↵
Hofmann, C. et al. Compound heterozygosity of two functional null mutations in the ALPL gene associated with deleterious neurological outcome in an infant with hypophosphatasia. Bone 55, 150–157 (2013).
OpenUrl

[42] 42.↵
Namm, A., Arend, A. & Aunapuu, M. Expression of Pax2 protein during the formation of the central nervous system in human embryos. Folia Morphol. 73, 272–278 (2014).
OpenUrl

[43] 43.↵
Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).
OpenUrl CrossRef PubMed Web of Science

[44] 44.↵
World Medical Association. World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA 310, 2191–2194 (2013).
OpenUrl CrossRef PubMed Web of Science

[45] 45.↵
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
OpenUrl CrossRef PubMed Web of Science

[46] 46.↵
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
OpenUrl CrossRef PubMed

[47] 47.↵
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
OpenUrl CrossRef

[48] 48.↵
Sanders, S. J. et al. Multiple recurrent de novo CNVs, including duplications of the 7q11.23 Williams syndrome region, are strongly associated with autism. Neuron 70, 863–885 (2011).
OpenUrl CrossRef PubMed Web of Science

[49] 49.↵
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
OpenUrl CrossRef PubMed

[50] 50.↵
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
OpenUrl CrossRef PubMed

[51] 51.↵
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 (2010).
OpenUrl CrossRef PubMed

Clustering by phenotype and genome-wide association study in autism

Abstract

Introduction

Results

Clustering

Cluster-based genome-wide association study

Gene interpretation

Replication study

Discussion

Methods

Datasets

Cluster analysis

Genotype data and quality control

Statistical analysis

Data availability

Author information

Author contributions

Competing interests

Acknowledgements

REFERENCES

Citation Manager Formats

Subject Area