Abstract
Long-read sequencing techniques can sequence transcripts from end to end, greatly improving our ability to study the transcription process and enabling more detailed analysis of diseases such as cancer. While several well-established tools exist for long-read transcriptome analysis, most are reference-based and, therefore, limited by the reference genome. This prevents analysis of organisms without high-quality reference genomes and samples or genes with high variability (e.g., cancer samples or some gene families) from being analyzed to their full potential. In such settings, analysis using a reference-free method is favorable. The computational problem of clustering long reads by region of common origin is well-established for reference-free transcriptome analysis pipelines. Such clustering enables large datasets to be split up roughly by gene family and, therefore, an independent analysis of each cluster. There exist tools for this. However, none of those tools can efficiently process the large amount of reads that are now generated by long-read sequencing technologies.
We present isONclust3, an improved algorithm over isONclust and isONclust2, to cluster massive longread transcriptome datasets at the gene family level. Like isONclust, IsONclust3 represents each cluster with a set of minimizers. However, unlike other approaches, isONclust3 dynamically updates the cluster representation during clustering by adding high-confidence minimizers from new reads assigned to the cluster. We show that isONclust3 yields results with higher or comparable quality to state-of-the-art algorithms but is 10-100 times faster on large datasets. Also, using a 256Gb computing node, isONclust3 was the only tool that could cluster 37 million PacBio reads, which is a typical throughput of the recent PacBio Revio sequencing machine.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* alexander.petri{at}math.su.se, ksahlin{at}math.su.se