RT Journal Article SR Electronic T1 DeLUCS: Deep Learning for Unsupervised Classification of DNA Sequences JF bioRxiv FD Cold Spring Harbor Laboratory SP 2021.05.13.444008 DO 10.1101/2021.05.13.444008 A1 Pablo Millán Arias A1 Fatemeh Alipour A1 Kathleen A. Hill A1 Lila Kari YR 2021 UL http://biorxiv.org/content/early/2021/05/14/2021.05.13.444008.abstract AB We present a novel Deep Learning method for the Unsupervised Classification of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Chaos Game Representations (CGRs) of primary DNA sequences, and generates “mimic” sequence CGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster label for each sequence. DeLUCS is able to cluster large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into families; three viral genome and gene datasets, averaging 1,300 sequences each, into virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means and Gaussian Mixture Models) for unlabelled data, by as much as 48%. DeLUCS is highly effective, it is able to classify datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence classification for previously unclassifiable datasets.Competing Interest StatementThe authors have declared no competing interest.