Abstract
Background The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.
Results We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: i) full-length RNA-seq for detection of splicing patterns and ii) high-throughput 5’ and 3’ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts.
We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the two most commonly used community gene models, TAIR10 and Araport11. In particular, we identify thousands of transient transcripts missing from the existing annotations. Our new annotation promises to improve the quality of A.thaliana genome research.
Conclusions Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
- Abbreviations
- RNA-seq
- RNA sequencing;
- ONT
- Oxford Nanopore;
- PacBio
- Pacific Biosciences;
- NET-seq
- Native Elongation Transcript sequencing;
- GRO-seq
- Global Run-On sequencing;
- A. thaliana
- Arabidopsis thaliana;
- cDNA
- complementary DNA;
- TSS
- transcription start site;
- PAS
- polyadenylation site;
- lncRNA
- long non-coding RNA;
- CAGE-seq
- Cap Analysis of Gene Expression sequencing;
- PAT-seq
- Poly(A) tag sequencing;
- CRAN
- Comprehensive R Archive;
- BAM
- Binary Alignment Map;
- BED
- Browser Extensible Data;
- Iso-seq
- isoform seqiencing;
- HC
- high confidence;
- MC
- medium confidence;
- LC
- low confidence;
- RT
- read-through;
- plaNET-seq
- plant Native Elongation Transcript sequencing;
- M
- million;
- bp
- base pair;
- ncRNA
- non-coding RNA;
- TSS-seq
- Transcription Start Site sequencing;
- 3’ DRS-seq
- 3’ Direct RNA sequencing;
- TU
- transcription unit;
- chrRNA-seq
- chromatin-associated RNA sequencing;
- TIF-seq
- Transcript Isoform sequencing;
- EST
- expressed sequence tag;