ABSTRACT
Each individual cell produces its own set of transcripts, which is a combinatorial result of genetic, transcriptomic and post-transcriptomic variations. Due to this combinatorial nature, obtaining the exhaustive set of full-length transcripts for a given species is a never ending endeavor. Yet, each RNA deep sequencing experiment turns out a variety of transcripts that depart from reference transcriptomes and should be properly identified. To address this challenge, we introduce a k-mer-based software protocol for capturing local transcriptional variation from a set of standard RNA-seq libraries, independently of a reference genome or transcriptome. Our software, called DE-kupl, analyzes k-mer contents and detects k-mers with differential abundance directly from the sequencing files, prior to assembly or mapping. This enables to retrieve the virtually complete set of unannotated variation lying in an RNA-seq dataset. This variation can be subsequently assigned to lincRNAs, antisense RNAs, splice and polyadenylation variants, retained introns, expressed repeats, chimeric or circular RNA, foreign RNA and SNV-harbouring RNA. We applied DE-kupl to a published differential RNA-seq experiment carried on a human cell line, and were able to discover highly significant unannotated transcript variations. We propose that DE-Kupl could be a valuable tool for extracting in full the untapped transcript information contained in large scale transcriptome projects.