RT Journal Article
SR Electronic
T1 DeLUCS: Deep Learning for Unsupervised Classification of DNA Sequences
JF bioRxiv
FD Cold Spring Harbor Laboratory
SP 2021.05.13.444008
DO 10.1101/2021.05.13.444008
A1 Pablo Millán Arias
A1 Fatemeh Alipour
A1 Kathleen A. Hill
A1 Lila Kari
YR 2021
UL http://biorxiv.org/content/early/2021/05/14/2021.05.13.444008.abstract
AB We present a novel Deep Learning method for the Unsupervised Classification of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Chaos Game Representations (CGRs) of primary DNA sequences, and generates “mimic” sequence CGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster label for each sequence. DeLUCS is able to cluster large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into families; three viral genome and gene datasets, averaging 1,300 sequences each, into virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means and Gaussian Mixture Models) for unlabelled data, by as much as 48%. DeLUCS is highly effective, it is able to classify datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence classification for previously unclassifiable datasets.Competing Interest StatementThe authors have declared no competing interest.