Abstract
Recent work has shown that genome-wide DNA methylation (DNAm) profiles can be used to discern signatures that can identify specific genetic disorders. These methods are especially effective at identifying single gene (Mendelian) disease, and methods to identify such signatures have been built by comparing methylation profiles of known disease versus control samples. These methods, however, have to-date been supervised, precluding the application of these methods to diseases with as-yet-unknown genetic cause. In this work, we tackle the problem of unsupervised disease classification based on DNAm signatures. Our method combines pre-filtration of the data to identify most promising methylation sites, clustering to identify co-varying sites, and an iterative method to further refine the signatures to build an effective clustering framework. We validate the proposed method on four diseases with known DNAm signatures (CHARGE, Kabuki, Sotos, and Weaver syndromes) and show high accuracy at determining the correct disease using unsupervised analysis. We also experiment with our approach on a novel dataset of patients with a clinical diagnosis of Autism, and illustrate the de novo identification of a specific subtype.