Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Lisa K Johnson; Harriet Alexander; C Titus Brown

doi:10.1093/gigascience/giy158

Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Gigascience. 2019 Apr 1;8(4):giy158. doi: 10.1093/gigascience/giy158.

Authors

Lisa K Johnson^{1

2}, Harriet Alexander^{1

3}, C Titus Brown^{1

2

4}

Affiliations

¹ Department of Population Health, and Reproduction, School of Veterinary Medicine, University of California Davis, One Shields Ave, Davis, CA 95616, USA.
² Molecular, Cellular, and Integrative Physiology Graduate Group, University of California Davis, One Shields Ave, Davis, CA 95616, USA.
³ Biology Department, Woods Hole Oceanographic Institution, Woods Hole, MA 02543, USA.
⁴ Genome Center, University of California Davis, 451 Health Sciences Dr, Davis, CA 95616, USA.

Abstract

Background: De novo transcriptome assemblies are required prior to analyzing RNA sequencing data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or "pipelines," on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short-read data collected as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project. The resulting transcriptome assemblies were evaluated and compared against assemblies that were previously generated with a different pipeline developed by the National Center for Genome Research.

Results: New transcriptome assemblies contained the majority of previous contigs as well as new content. On average, 7.8% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics. Assemblies from the Dinoflagellata showed a higher number of contigs and unique k-mers than transcriptomes from other phyla, while assemblies from Ciliophora had a lower percentage of open reading frames compared to other phyla.

Conclusions: Given current bioinformatics approaches, there is no single "best" reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally intensive tasks required for re-processing large sets of samples with revised pipelines and ensuring a common evaluation workflow is applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.

Keywords: automated pipeline; marine microbial eukaryote; re-analysis; transcriptome assembly.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Computational Biology* / methods
Databases, Genetic
Eukaryota / genetics*
Gene Expression Profiling* / methods
Genome
Genomics / methods
High-Throughput Nucleotide Sequencing
Transcriptome*
Workflow