Abstract
Background
De novo transcriptome assemblies are required prior to analyzing RNAseq data from a species without an existing reference genome or transcriptome. Despite the prevalence of transcriptomic studies, the effects of using different workflows, or “pipelines”, on the resulting assemblies are poorly understood. Here, a pipeline was programmatically automated and used to assemble and annotate raw transcriptomic short read data collected by the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). Transcriptome assemblies generated through this pipeline were evaluated and compared against assemblies that were previously generated with a pipeline developed by the National Center for Genome Research (NCGR).
Findings
New transcriptome assemblies contained 70% of the previous contigs as well as new content. On average, 7.8 ±0.19% of the annotated contigs in the new assemblies were novel gene names not found in the previous assemblies. Taxonomic trends were observed in the assembly metrics, with assemblies from the Dinoflagellata and Ciliophora phyla showing a higher percentage of open reading frames and number of contigs than transcriptomes from other phyla.
Conclusions
Given current bioinformatics approaches, there is no single ‘best’ reference transcriptome for a particular set of raw data. As the optimum transcriptome is a moving target, improving (or not) with new tools and approaches, automated and programmable pipelines are invaluable for managing the computationally-intensive tasks required for re-processing large sets of samples with revised pipelines. Moreover, automated and programmable pipelines facilitate the comparison of diverse sets of data by ensuring a common evaluation workflow was applied to all samples. Thus, re-assembling existing data with new tools using automated and programmable pipelines may yield more accurate identification of taxon-specific trends across samples in addition to novel and useful products for the community.
Introduction
The analysis of gene expression from high-throughput nucleic acid sequence data relies on the presence of a high quality reference genome or transcriptome. When there is no reference genome or transcriptome for an organism of interest, raw RNA sequence data (RNAseq) must be assembled de novo into a transcriptome [1]. This type of analysis is ubiquitous across many fields. For example, evolutionary developmental biology [2], cancer biology [3], agriculture [4,5], ecological physiology [6,7], and biological oceanography [8]. In recent years, substantial investments have been made in data generation, primary data analysis, and development of downstream applications, such as biomarkers and diagnostic tools [9–16].
Methods for de novo RNAseq assembly of the most common short read Illumina sequencing data continue to evolve rapidly, especially for non-model species [17]. At this time, there are several major de novo transcriptome assembly software tools available to choose from, including Trinity [18], SOAPdenovo-Trans [19], Trans-ABySS [20], Oases [21], SPAdes [22], IDBA-tran [23], and Shannon [24]. The availability of these options stems from continued research into the unique computational challenges associated with transcriptome assembly of short read Illumina RNAseq data, including large memory requirements, alternative splicing and allelic variants [18,25].
The continuous development of new tools and workflows for RNAseq analysis combined with the vast amount of publicly available RNAseq data [26] raises the opportunity to re-analyze existing data with new tools. This, however, is rarely done systematically. To evaluate the performance impact of new tools on old data, we developed and applied a programmatically automated de novo transcriptome assembly workflow that is modularized and extensible based on the Eel Pond Protocol [27]. This workflow incorporates Trimmomatic [28], digital normalization with khmer software [29,30], and the Trinity de novo transcriptome assembler [18].
To evaluate this pipeline, we re-analyzed RNAseq data from 678 samples generated as part of the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP). The MMETSP RNAseq data set was generated to broaden the diversity of sequenced marine protists to enhance our understanding of their evolution and roles in marine ecosystems and biogeochemical cycles [31,32]. With data from species spanning more than 40 eukaryotic phyla, the MMETSP provides one of the largest publicly-available collections of RNAseq data from a diversity of species. Moreover, the MMETSP used a standardized library preparation procedure and all of the samples were sequenced at the same facility, making this data set unusually comparable.
Reference transcriptomes for the MMETSP were originally assembled by the National Center for Genome Research (NCGR) with a pipeline which used the Trans-ABySS software program [31] to assemble the short reads. The transcriptomes generated from the NCGR pipeline have already facilitated discoveries in the evolutionary history of ecologically significant genes [33,34], differential gene expression under shifting environmental conditions [8,35], inter-group transcriptome comparisons [36], unique transcriptional features [37–39], and meta-transcriptomic studies [34–36].
In re-assembling the MMETSP data, we sought to compare and improve the original MMETSP reference transcriptome and to create a platform which facilitates automated re-assembly and evaluation. Here, we show that our re-assemblies had higher evaluation metrics and contained most of the NCGR contigs as well as adding new content.
Methods
Programmatically Automated Pipeline
An automated pipeline was developed to execute the steps of the Eel Pond mRNAseq Protocol [27], a lightweight protocol for assembling short Illumina RNA-seq reads that uses the Trinity de novo transcriptome assembler. This protocol generates de novo transcriptome assemblies of acceptable quality [43]. The pipeline was used to assemble all of the data from the MMETSP (Figure 1). The code and instructions for running the pipeline are available at https://doi.org/10.5281/zenodo.249982.
The steps of the pipeline applied to the MMETSP are as follows:
1. Download the raw data
Raw RNA-seq data sets were obtained from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) from BioProject PRJNA231566. Data were paired-end (PE) Illumina reads with lengths of 50 bases for each read. A metadata (SraRunInfo.csv) file obtained from the SRA web interface was used to provide a list of samples to the getdata.py pipeline script, which was then used to download and extract fastq files from 719 records. The script uses the fastq-dump program from the SRA Toolkit to extract the SRA-formatted fastq files (version 2.5.4) [44]. There were 18 MMETSP samples with more than one SRA record (MMETSP0693, MMETSP1019, MMETSP0923, MMETSP0008, MMETSP1002, MMETSP1325, MMETSP1018, MMETSP1346, MMETSP0088, MMETSP0092, MMETSP0717, MMETSP0223, MMETSP0115, MMETSP0196, MMETSP0197, MMETSP0398, MMETSP0399, MMETSP0922). In these cases, reads from multiple SRA records were concatenated together per sample. Taking these redundancies into consideration, there were a total of 678 re-assemblies generated from the 719 records in PRJNA231566 (Supplemental Notebook 1). Assembly evaluation metrics were not calculated for MMETSP samples with more than one SRA record because these assemblies were different than the others, containing multiple samples, and thus not as comparable.
Initial transcriptomes that were assembled by the National Center for Genome Resources (NCGR), using methods and data described in the original publication [31], were downloaded from the iMicrobe repository to compare with our re-assemblies (ftp://ftp.imicrobe.us/projects/104/). There were two versions of each assembly, ‘nt’ and ‘cds’. The version used for comparison is noted below in each evaluation step. To our knowledge, the NCGR took extra post-processing steps to filter content, leaving only coding sequences in the ‘cds’ versions of each assembly [31].
2. Perform quality control
Reads were analyzed with FastQC (version 0.11.5) and multiqc (version 1.2) [45] to confirm overall qualities before and after trimming. A conservative trimming approach [46] was used with Trimmomatic (version 0.33) [28] to remove residual Illumina adapters and cut bases off the start (LEADING) and end (TRAILING) of reads if they were below a threshold Phred quality score (Q<2).
3. Apply digital normalization
To decrease the memory requirements for each assembly, reads were interleaved, normalized to a k-mer (k = 20) coverage of 20 and a memory size of 4e9, then low-abundance k-mers from reads with a coverage above 18 were trimmed. Orphaned reads, where the mated pair was removed during normalization, were included in the assembly.
4. Assemble
Transcriptomes were assembled from normalized reads with Trinity 2.2.0 using default parameters (k = 25).
The resulting assemblies are referred to below as the “Lab for Data Intensive Biology” assemblies, or DIB assemblies. The original assemblies are referred to as the NCGR assemblies.
5. Post-assembly assessment
Transcriptomes were annotated using the dammit pipeline (Scott 2016), which relies on the following databases as evidence: Pfam-A [47], Rfam [48], OrthoDB [49]. In the case where there were multiple database hits, one gene name per contig was selected by choosing the name of the lowest e-value match (<1e-05).
All assemblies were evaluated using metrics generated by the Transrate program [50]. Trimmed reads were used to calculate a Transrate score for each assembly, which represents the geometric mean of all contig scores multiplied by the proportion of input reads providing positive support for the assembly [50]. Comparative metrics were calculated using Transrate for each MMETSP sample between DIB and the NCGR assemblies using the Conditional Reciprocal Best BLAST hits (CRBB) algorithm [51]. A forward comparison was made with the NCGR assembly used as the reference and each DIB re-assembly as the query. Reverse comparative metrics were calculated with each DIB re-assembly as the reference and the NCGR assembly as the query. Transrate scores were calculated for each assembly using the Trimmomatic quality-trimmed reads, prior to digital normalization.
Benchmarking Universal Single-Copy Orthologs (BUSCO) software (version 3) was used with a database of 234 orthologous genes specific to protistans and 306 genes specific to eukaryota with open reading frames in the assemblies. BUSCO scores are frequently used as one measure of assembly completeness [52].
To assess the occurrences of fixed-length words in the assemblies, unique 25-mers were measured in each assembly using the HyperLogLog estimator of cardinality built into the khmer software package [53].
Unique gene names were compared from a random subset of 296 samples using the dammit annotation pipeline [54]. If a gene name was annotated in NCGR but not in DIB, this was considered a gene uniquely annotated in NCGR. Unique gene names were normalized to the total number of annotated genes in each assembly.
A Tukey’s honest significant different (HSD) post-hoc range test of multiple pairwise comparisons was used in conjunction with an ANOVA to measure differences between distributions of data from the top eight most-represented phyla (“Bacillariophyta”, “Dinophyta”, “Ochrophyta”, “Haptophyta”, “Ciliophora”, “Chlorophyta”, “Cryptophyta”, “Others“) using the ‘agricolae’ package version 1.2-8 in R version 3.4.2 (2017-09-28). Margins sharing a letter in the group label are not significantly different at the 5% level (Figure 8). Averages are reported ± standard deviation.
Results
After assemblies and annotations were completed, files were uploaded to Figshare and Zenodo are available for download [55,56]. Due to obstacles encountered uploading and maintaining 678 assemblies on Figshare, Zenodo will be the long-term archive for these re-assemblies http://doi.org/10.5281/zenodo.1212585.
Differences in available evaluation metrics between NCGR and DIB were variable
The majority of transcriptome evaluation metrics collected for each sample were higher in Trinity-based DIB re-assemblies than for the Trans-ABySS-based NCGR assemblies (Table 1), with the exception being the Transrate score from the “nt” version of the assembly. The Transrate score with this ‘cds’ version was higher in DIB compared to NCGR but lower in DIB compared to the NCGR ‘nt’ version (Supplemental Figure 1).
The DIB re-assemblies had more contigs than the NCGR assemblies in 83.5% of the samples (Table 1). The mean number of contigs in the DIB re-assemblies was 48,361 ± 35,703 while the mean number of contigs in the NCGR ‘nt’ assemblies was 30,532 ± 21,353 (Figure 2). A two-sample Kolmogorov-Smirnov test comparing distributions indicated that the number of contigs were significantly different between DIB and NCGR assemblies (p < 0.001, D = 0.35715). Transrate scores [35], which calculate the overall quality of the assembly based on the original reads, were significantly higher in the DIB re-assemblies (0.31 ± 0.1) compared to the ‘cds’ versions of the NCGR assemblies (0.22 ± 0.09) (p < 0.001, D = 0.49899). The Transrate scores in the NCGR ‘nt’ assemblies (0.35 ± 0.09) were significantly higher than the DIB assemblies (0.22 ± 0.09) (p < 0.001, D = 0.22475) (Supplemental Figure 1). The frequency of the differences between Transrate scores in the NCGR ‘nt’ assemblies and the DIB re-assemblies appears to be normally distributed (Figure 2C). Transrate scores from the DIB assemblies relative to the NCGR ‘nt’ assemblies did not appear to have taxonomic trends (Supplemental Figure 2).
The DIB re-assemblies contained most of the NCGR contigs as well as new content
We applied CRBB to evaluate overlap between the assemblies. A positive CRBB result indicates that one assembly contains the same contig information as the other. Thus, the proportion of positive CRBB hits can be used as a scoring metric to compare the relative similarity of content between two assemblies. For example, MMETSP0949 (Chattonella subsalsa) had 39,051 contigs and a CRBB score of 0.71 in the DIB re-assembly whereas in the NCGR assembly of the same sample had 18,873 contigs and a CRBB score of 0.34. This indicated that 71% of the reference of DIB was covered by the NCGR assembly, whereas in the reverse alignment, the NCGR reference assembly was only covered by 34% of the DIB re-assembly. The mean CRBB score in DIB when queried against NCGR ‘nt’ as a reference was 0.70 ± 0.22, while the mean proportion for NCGR ‘nt’ assemblies queried against DIB re-assemblies was 0.49 ± 0.10 (p < 0.001, D = 0. 71121) (Figure 3). This indicates that more content from the NCGR assemblies was included in the DIB re-assemblies than vice versa and also suggests that the DIB re-assemblies overall have additional content. This finding is reinforced by higher unique k-mer content found in the DIB re-assemblies compared to NCGR, where more than 95% of the samples had more unique k-mers in the DIB re-assemblies compared to NCGR assemblies (Figure 4).
To investigate whether the new sequence content was genuine, we examined two different metrics that take into account the biological quality of the assemblies. First, the estimated content of open reading frames (ORFs), or coding regions, across contigs was quantified. Though DIB re-assemblies had more contigs, the ORF content is similar to the original assemblies, with a mean of 81.8 ± 9.9% ORF content in DIB re-assemblies and 76.7 ± 10.1% ORF content in the NCGR assemblies. Nonetheless, ORF content in DIB re-assemblies was slightly higher than NCGR assemblies for 95% of the samples (Figure 5 A,B), although DIB re-assemblies had significantly higher ORF content (p < 0.001, D = 2681). Secondly, when the assemblies were queried against the eukaryotic BUSCO database [37], the percentages of BUSCO eukaryotic matches in the DIB re-assemblies (63 ± 18.6%) were less significantly different compared to the original NCGR assemblies (65 ± 19.1%) (p = 0.001873, D = 0.10291) (Figure 5 C,D). Thus, although the number of contigs and amount of content was increased in the DIB re-assemblies compared to the NCGR assemblies, the ORF content and contigs matching with the BUSCO eukaryotic (Figure 5 C,D) and protistan (Supplemental Figure 3) databases did not decrease, suggesting that the extra content contained similar proportions of ORFs and BUSCO annotations and, therefore, might be biologically meaningful.
Following annotation by the dammit pipeline (Scott 2016), 91 ± 1.6% of the contigs in the DIB re-assemblies had positive matches with sequence content in the databases queried (Pfam, Rfam, and OrthoDB), with 48 ± 0.9% of those containing unique gene names (the remaining are fragments of the same gene). Of those annotations, 7.8 ± 0.2% were identified as novel compared to the NCGR ‘nt’ assemblies, determined by a “false” CRBB result (Figure 6). Additionally, the number of unique gene names in DIB re-assemblies were higher in 97% of the samples compared to NCGR assemblies, suggesting an increase in genic content (Figure 7).
Novel contigs in the DIB re-assemblies likely represent a combination of unique annotations, allelic variants and alternatively spliced isoforms. For example, “F0XV46_GROCL”, “Helicase_C”, “ODR4-like”,”PsaA_PsaB”, and “Metazoa_SRP” are novel gene names found annotated in the DIB re-assembly of the sample MMETSP1473 (Stichococcus sp.) that were absent in the NCGR assembly of this same sample. Other gene names, for example “Pkinase_Tyr”,”Bromodomain”, and “DnaJ”, are found in both the NCGR and DIB assemblies, but are identified as novel contigs based on negative CRBB results in the DIB re-assembly of sample MMETSP1473 compared to the NCGR reference.
Assembly metrics varied by taxonomic group being assembled
To examine systematic taxonomic differences in the assemblies, metrics for content and assembly quality were assessed (Figure 8). Metrics were grouped by the top eight most represented phyla in the MMETSP data set as follows: Bacillariophyta (N=173), Dinophyta (N=114), Ochrophyta (N=73), Chlorophyta (N=62), Haptophyta (N=61), Ciliophora (N=25), Cryptophyta (N=22) and Others (N=130).
While there were no major differences between the phyla in the number of input reads (Figure 8 A), the Dinoflagellates (Dinophyta) had significantly different (higher) contigs (p < 0.01), unique &-mers (p < 0.001), and % ORF (p < 0.001) compared to than other groups (Figure 8 B,C,D), and assemblies from Ciliates (Ciliophora) had lower % ORF (p < 0.001) (Figure 8 D).
Discussion
DIB re-assemblies contained the majority of the previously-assembled contigs
We used a different pipeline than the original one used to create the NCGR assemblies, in part because new software was available [8] and in part because of new trimming guidelines [27]. We had no a priori expectation for the similarity of the results, yet we found that in the majority of cases the new DIB re-assemblies included substantial portions of the previous NCGR assemblies. Moreover, both the fraction of contigs with ORFs and the mean percentage of BUSCO matches were similar between the two assemblies, suggesting that both pipelines yielded equally valid contigs, even though the NCGR assemblies were less sensitive.
Reassembly with new tools can yield new results
Evaluation with quality metrics suggested that the DIB re-assemblies were more inclusive than the NCGR assemblies. The Transrate scores in the DIB re-assemblies compared to the NCGR ‘nt’ assemblies were significantly lower, indicating that the NCGR ‘nt’ assemblies had better overall read inclusion in the assembled contigs whereas the DIB assemblies had higher Transrate scores than the NCGR ‘cds’ version. This suggests that the NCGR ‘cds’ version, which was post-processed to only include coding sequence content, was missing information originally in the quality-trimmed reads. The Transrate score [50] is one of the few metrics available for evaluating the ‘quality’ of a de novo transcriptome. It is similar to the DETONATE RSEM-EVAL score in that it returns a metric indicating how well the assembly is supported by the read data [57]. Metrics directly evaluating the underlying de Bruijn graph data structure used to produce the assembled contigs may be better evaluators of assembly quality in the future. Here, the DIB re-assemblies, which used the Trinity de novo assembly software, typically contained more &-mers, more annotated transcripts, and more unique gene names than the NCGR assemblies. These points all suggest that additional content in these re-assemblies might be biologically meaningful and that these re-assemblies provide new content not available in the previous NCGR assemblies. Since contigs are probabilistic predictions made by assembly software for full-length transcripts [57], ‘final’ reference assemblies are approximations of the full set of transcripts in the transcriptome. Results from this study suggest that achieving the ‘ideal’ reference transcriptome is like chasing a moving target and that these predictions may continue to improve given updated tools in the future.
The evaluation metrics described here serve as a framework for better contextualizing the quality of protistan transcriptomes. For some species and strains in the MMETSP data set, these data represent the first nucleic acid sequence information available [31].
Automated and programmable pipelines can be used to process arbitrarily many RNAseq samples
The automated and programmable nature of this pipeline was useful for processing large data sets like the MMETSP as it allowed for batch processing of the entire collection, including reanalysis when new tools or new samples become available (see op-ed Alexander et al. 2018). During the course of this project, we ran four re-assemblies of the MMETSP data set as versions of the component tools were updated. Each re-analysis required only a single command and approximately half a CPU-year of compute. New Trinity versions were released (Supplemental Notebook 2) The value of programmable automation is clear when new data sets become available, tools are updated, or many tools are compared in benchmark studies. Despite this, few assembly efforts completely automate their process, perhaps because the up-front cost of doing so is high compared to the size of the dataset typically being analyzed.
Analyzing many samples using a common pipeline identifies taxon-specific trends
The MMETSP dataset presents an opportunity to examine transcriptome qualities for hundreds of taxonomically diverse species spanning a wide array of protistan lineages. This is among the largest set of diverse RNAseq data to be sequenced. In comparison, the Assemblathon2 project compared genome assembly pipelines using data from three vertebrate species [59]. The BUSCO paper assessed 70 genomes and 96 transcriptomes representing groups of diverse species (vertebrates, arthropods, other metazoans, fungi) [52]. Other benchmarking studies have examined transcriptome qualities for samples representing dozens of species from different taxonomic groupings [57,58]. A study with a more restricted evolutionary analysis of 15 plant and animals species [58] found no evidence of taxonomic trend in assembly quality but did find evidence of differences between assembly software packages [58].
With the MMETSP data set, we show that comparison of assembly evaluation metrics across this diversity provides not only a baseline for assembly performance, but also highlights particular metrics which are unique within some taxonomic groups. For example, the phyla Ciliophora had a significantly lower percentage of ORFs compared to other phyla. This is supported by recent work which has found that ciliates have an alternative triplet codon dictionary, with codons normally encoding STOP serving a different purpose [37–39], thus application of typical ORF finding tools fail to identify ORFs accurately in Ciliophora. Additionally, Dinophyta data sets had a significantly higher number of unique k-mers and total contigs in assemblies compared to the assemblies from other data sets, despite having the same number of input reads. Such a finding supports previous evidence from studies showing that large gene families are constitutively expressed in Dinophyta [60].
In future development of de novo transcriptome assembly software, the incorporation of phylum-specific information may be useful in improving the overall quality of assemblies for different taxa. Phylogenetic trends are important to consider in the assessment of transcriptome quality, given that the assemblies from Dinophyta and Ciliophora are distinguished from other assemblies by some metrics. Applying domain-specific knowledge, such as specialized transcriptional features in a given phyla, in combination with other evaluation metrics can help to evaluate whether a transcriptome is of good quality or “finished” enough to serve as a high quality reference to answer the biological questions of interest.
Conclusion
As the rate of sequencing data generation continues to increase, efforts to programmatically automate the processing and evaluation of sequence data will become increasingly important. Ultimately, the goal in generating de novo transcriptomes is to create the best possible reference against which downstream analyses can be accurately based. This study demonstrated that reanalysis of old data with new tools and methods improved the quality of the reference assembly through an expansion of the gene catalogue of the dataset. Notably, these improvements arose without further experimentation or sequencing.
With the growing volume of nucleic acid data in centralized and de-centralized repositories, streamlining methods into pipelines will not only enhance the reproducibility of future analyses, but will facilitate inter-comparisons amongst datasets from similar and diverse. Automation tools were key in successfully processing and analyzing this large collection of 678 samples.
Acknowledgements
Camille Scott, Luiz Irber, Daniel Standage, and other members of the Data Intensive Biology lab at UC Davis provided helpful assistance with troubleshooting the assembly, annotation and evaluation pipeline. Funding was provided from the Gordon and Betty Moore Foundation under award number GBMF4551 to CTB. Scripts were tested and run on the MSU HPCC and NSF-XSEDE Jetstream cloud platform with allocation TG-BIO160028.