The evaluation of RNA-Seq de novo assembly by PacBio long read sequencing

RNA-Seq de novo assembly is an important method to generate transcriptomes for non-model organisms before any downstream analysis. Given many great de novo assembly methods developed by now, one critical issue is that there is no consensus on the evaluation of de novo assembly methods yet. Therefore, to set up a benchmark for evaluating the quality of de novo assemblies is very critical. Addressing this challenge will help us deepen the insights on the properties of different de novo assemblers and their evaluation methods, and provide hints on choosing the best assembly sets as transcriptomes of non-model organisms for the further functional analysis. In this article, we generate a “real time” transcriptome using PacBio long reads as a benchmark for evaluating five de novo assemblers and two model-based de novo assembly evaluation methods. By comparing the de novo assmblies generated by RNA-Seq short reads with the “real time” transcriptome from the same biological sample, we find that Trinity is best at the completeness by generating more assemblies than the alternative assemblers, but less continuous and having more misassemblies; Oases is best at the continuity and specificity, but less complete; The performance of SOAPdenovo-Trans, Trans-AByss and IDBA-Tran are in between of five assemblers. For evaluation methods, DETONATE leverages multiple aspects of the assembly set and ranks the assembly set with an average performance as the best, meanwhile the contig score can serve as a good metric to select assemblies with high completeness, specificity, continuity but not sensitive to misassemblies; TransRate contig score is useful for removing misassemblies, yet often the assemblies in the optimal set is too few to be used as a transcriptome.

With the rapid development of sequencing technology, transcriptome assembly by 2 RNA-Seq short reads has become increasingly important in many fields, such as plant 3 science (1; 2), animal science (3) and disease related studies (4; 5; 6). Current 4 transcriptome assembly methods mainly fall into three categories: reference-based 5 assembly, de novo assembly and a hybrid assembly that merges the above two (7). For 6 non-model organisms with no available reference genome or transcriptome, de novo 7 assembly becomes the only choice to determine the transcriptome before any 8 downstream analysis. Many de novo assembly methods have been developed, however, 9 there is no consensus on how to evaluate these methods. Therefore, establishing a 10 reliable benchmark for understanding the property of each de novo assembly tool has 11 become a critical issue (8). 12 Recently, powerful tools, such as Trinity (9), Oases (10), SOAPdenovo-Trans (11), 13 Trans-ABySS (12), and IDBA-Tran (13), have been developed for de novo assembly of 14 transcriptomes from RNA-Seq short reads. From the data perspective, when evaluating 15 de novo assembly methods, researchers can either simulate RNA-Seq short reads base 16 on a known reference genome or transcriptome (14), or use real RNA-Seq datasets and 17 evaluate the performance of assemblers by comparing the assemblies with the reference 18 transcriptome or the transcriptome of a related species (15; 16). In the first case, even 19 though it is convenient to control the properties of simulated data, such as the 20 expression levels of transcripts, the sequencing error rate , the sequencing depth and etc, 21 the simulated data cannot completely represent the real data. In the latter case, the 22 evaluation heavily relies on the quality of the reference transcriptome. Nevertheless, the 23 expressed transcripts may even vary among biological replicates or different tissues (17). 24 The presence of assemblies that are missed in the reference transcriptome does not 25 necessarily mean that those assemblies are misassemblies. The novel transcript could be 26 representing mutated or fusion transcripts that hasn't been annotated in the reference. 27 Similarly, the absence of assemblies compared with the reference transcriptome does not 28 necessarily indicate incompleteness of an assembly. It could be that the transcripts that 29 are not expressed in a particular sample. Even though the reference transcriptome is 30 well annotated for a species, e.g., Homo sapiens, the reference transcriptomes still vary 31 between different instititional sites (Ensembl and RefSeq) and versions still exist, which 32 complicate the issue from another perspective. 33 Two types of methods are used for assessing de novo assemblies: metrics-based 34 methods and model-based methods. However, without a reference transcriptome, 35 metrics-based methods can only provide an empirical description rather than an 36 assessment of the quality of the assemblies, such as the total number and the length 37 information of assemblies. With a reference transcriptome, the metrics-based methods 38 have the ability to comprehensively evaluate the accuracy, completeness, continuity, and 39 misassembly rate of the assemblies. However, this analysis is based on the assumption 40 that the reference transcriptome is complete and reliable (7; 18). Model-based methods, 41 such as DETONATE (19) and TransRate (20), focus on how well the assemblies can be 42 explained by the read evidence. However, each model-based method has its own 43 definition of the "optimal" assembly, which is inconsistent among different models.

44
Furthermore, model-based methods themselves are hard to evaluate if we do not have a 45 reliable reference transcriptome in hand.

46
In this study, we utilize the PacBio long read sequencing technology to generate a 47 "real time" transcriptome as a benchmark for assessing (1) the properties of five 48 commonly used de novo assembly methods and (2) the effectiveness of two model-based 49 evaluation methods. By comparing the assemblies from the short reads to the "real 50 time" transcriptome from PacBio long reads of the same biological sample, we eliminate 51 the biological uncertainties to a large extent. We conclude that Trinity is best at 52 completeness, but assembled trancripts are less continuous and have more misassemblies 53 than the alternative methods; Oases is best at continuity and specificity (we followed 54 the nomenclature used in rnaQUAST (18); the specificity refers to the percentage of the 55 assemblies that can be well mapped back to the annotated transcripts), but less 56 complete; The performance of SOAPdenovo-Trans, Trans-ABySS and IDBA-Tran are in 57 between. For the model-based evaluation methods, DETONATE ranks the method with 58 all aspects having the average performance as the best, while TransRate doesn't 59 penalize any downsides but only encourages the good aspects of the assemblies; The 60 contig scores of DETONATE can help select the assemblies with high completeness, 61 specificity and continuity but not a low misassembly rate, while the contig scores of 62 TransRate are helpful in removing misassemblies.

64
RNA-Seq datasets 65 The datasets we used were from the Sequencing Quality Control (SEQC)/MAQC-III 66 Consortium, which sequenced a human brain sample by multiple platforms, including 67 MiSeq short read sequencing and PacBio long read sequencing (21). MiSeq generated 68 7.85 million paired-end reads with the length equal to 250 bp. PacBio generated 0.68 69 million Reads of Insert (RoIs) with an average length equal to 1, 640 bp.

70
Quality control for short read de novo assemblies 71 We first trimmed the adapters and filtered out the low quality reads from the MiSeq 72 dataset using Trimmomatic (version 0.32). Adaptors and low quality reads with average 73 quality below 16 over a 5 base window were removed. And only trimmed reads with 74 length over 30 bases were used for de novo assembly. FastQC (version 0.11.2) 75 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc ) was then used 76 to visualize the read quality before and after cleaning, shown in S1 Fig.   77 To determine the best kmer for de novo assembly, we used Kmergenie (version

98
The First, to show the relationship between the "real time" transcriptome generated from 107 PacBio long reads and the well annotated human transcriptomes, we drew a venn 108 diagram bewteen the reference transcriptomes from Ensembl and RefSeq and the "real 109 time" transcriptome using vennBLAST (26). Ensembl reference transcriptome has should be more optimal as a benchmark for assessing short read assemblies than the 120 other two references. Second, we checked whether the abundance of PacBio long reads was corresponds to 122 that of the MiSeq short reads. If yes, it will provide another evidence that the "real 123 time" transcriptome generated from PacBio long reads can serve as a reliable reference 124 for assessing short read assemblies. A scatter plot of the ranks of the abundances ranging from low to high as estimated by short reads, and lie along the bottom. This 130 pattern is due to the different throughputs of two sequencing technologies. PacBio has a 131 lower sequencing thoughput than the short read platform. Many transcripts have only 132 one copy detected in PacBio, but these transcripts may have many short reads sampled 133 in MiSeq. This relationship between the abundances of PacBio long reads and MiSeq 134 short reads suggests that a majority of transcripts from the "real time" transcriptome 135 should be recovered by the short read assembly.  (27) using MiSeq short reads. Y-axis shows the ranks of gene counts from PacBio long reads. If a gene is supported by 1f 5p in PacBio long read sequencing, it means this gene is supported by one full length read and five partial length reads in PacBio. The expression of this gene would be given as (1+ 0.5x5) = 3.5.
In summary, the generation of the real time transcriptome agrees with both the well 137 annotated reference transcriptome and the real time sampling. Therefore, the real time 138 transcriptome can be a better benchmark for assessing short read de novo assembling in 139 terms of both sufficency and specificity.
Assessments on short read de novo assembly methods 141 Assemblies were performed by each method and for each method the number and the 142 length of predicted transcripts are compared. In Table 1 The cumulative curves of the assembly length from five de novo assembly methods. There reference transcriptomes are also plotted as quality controls.
The more short reads that can be aligned back to the assembly, the higher 159 probability that the assembler has generated the correct assembly, if the gene expression 160 level is not taken into account at this stage. though Trinity covers the highest number of short reads, it reports many more predicted 171 assemblies than the other methods.

172
We evaluated the qualities of assemblies by aligning them back to the "real time"  The total number of read support include both the concordantly and disconcordantly mapped reads. Concordant read support means both of the paired end reads can be mapped into an assembly in the right orientation. The disconcordant reads mean either the paired end reads cannot map to the same assembly or map to an assembly in a reversed manner. The number of aligned reads per kilobases of assemblies = the total number of read support / the total number of assemblies.
methods. Oases and Trinity perform very differently, yet each has its own advantages. 179 Oases has the longest average alignment length, the best continuity, specificity, and There are two state-of-the-art methods having been assessed here, DETONATE (version 188 1.9) and TransRate (version 1.0.1). The RSEM-EVAL module of DETONATE is used 189 for evaluating de novo assemblies without a reference transcriptome. The RSEM-EVAL 190 score is the sum of three components; the likelihood estimates how well the assemblies 191 are explained by the mapped short reads; the assembly prior assumes the assembly 192 length follows a negative binomial distribution and the transcripts are independent from 193 each other (the number of isoforms or homogenous genes will influence this component); 194 the BIC penalty penalizes the prediction of too many bases and assemblies. Table 4 195 shows that the likelihood makes the largest contribution to the RSEM-EVAL score. Table 2 and 3, Trinity has the highest likelihood, but the lowest 197 assembly prior and BIC penalty, which lowers its overall RSEM-EVAL score. On the 198 contrary, SOAPdenovo-Trans and Trans-AByss do not score highly any component, 199 which is consistent with Table 3, but achieve the best overall RSEM-EVAL score, have low RSEM-EVAL scores mainly due to the low likelihoods, though Oases has the 202 best assembly prior and BIC penalty, which is also consistant with Table 2 and 3.

203
TransRate shows the opposite pattern compared with DETONATE. The TransRate 204 assmbly score is the geometric mean of the contig scores multiplied by the proportion of 205 short reads that positively support the assemblies. Each contig score is the product of 206 four components: the nucleotide score measuring the alignment distance between the 207 assembly and the short reads, the coverage score measuring the fraction of the assembly 208 length covered by reads, the order score measuring the orientation of the paired-end 209 read mapping, and the segment score measuring the per-nucleotide read coverage. In 210 Table 5, we find that Oases has the highest TransRate score. However, after optimizing 211 an empircal target function -T = n c=1 S(C) 1 n R valid , where S(C) is the contig 212 score, n is the number of selected contigs, and R valid is the proportion of reads that can 213 be mapped to the selected contigs -as recommended by TransRate, the TransRate Table 3. The evaluation of the assemblies from five de novo assembly methods by comparing with the "real time" transcriptome. We evaluate the assembly quality in terms of the alignability, accuracy, completeness/sensitivity, specificity, continuity, and misassembly. We followed the nomenclatures that were used in rnaQUAST.

SOAPdenovo-Trans
Trans Completeness/sensitivity is calculated by aligning assemblies to genes/isoforms in the database, showing how completely the assemblies can cover the database. Specificity is calculated by aligning isoforms/genes in the database to assemblies, showing how specific or redundant the assemblies are in the database. Continuity is always calculated using the longest assemblies that can continuously mapped to the genes/isoforms in the database, showing whether the assemblies are integral. Misassemblies are confirmed by both GMAP and BLASTN, meaning partial alignments from the one assembly can be equally well mapped to different locations in the database. Genes/isoforms mean the transcripts from the real time transcriptome. Assemblies means the de novo assemblies generated by each assembler. Gene/isoform coverage is a percentage calculated as the number of bases on the gene/isoform covered by the assemblies divided by the length of this gene/isoform. x% covered genes/isoforms means the number of genes/isoforms that have at least x% gene/isoform coverage. Database coverage means the total number of bases covered by assemblies divided by the total length of all isoforms in the database. The matched fraction of each assembly is calculated as the number of matched bases on the assembly divided by the length of this assembly. x% matched assemblies means the total number of assemblies that have at least x% matched fraction. Gene/isoform continuity is also a percentage calculated as the number of bases on the gene/isoform covered by the longest continuous assembly divided by the length of this gene/isoform. x% assembled genes/isoforms means the number of genes/isoforms that have an at least x% gene/isoform continuity. Dark green: the best performance; Light green: good performance, slightly lower than the best, but better than the rest methods; Red: the lowest performance. For those having no color marked, their performance are comparable to each other, but obviously better than the red and worse than the green.
scores of all methods greatly increase, and Trinity has the best optimal score.

215
Contig scores can serve as a good metric for removing low 216 quality assemblies

217
In the section of assessments on short read de novo assembly methods, we found that 218 the number of predicted assemblies produced by de novo approaches was about 15 times 219 The RSEM-EVAL score is the sum of three components, including the likelihood, the assembly prior and the BIC penalty for each assembly set. The potential bridges show the number of potential links between contigs that are supported by the reads.
of the number of transcripts in the"real time" transcriptome, on average; In the section 220 of assessments of model-based de novo assembly evaluation methods, we found that a 221 large portion of assemblies have low DETONATE and TranRate contig scores.

222
Together, this indicates a redundency of assemblies. Therefore, the question is whether 223 DETONATE and TransRate contig scores can serve as good metrics for removing low 224 quality assemblies. 225 We selected the top 40, 000 assemblies based on the DETONATE score, the 226 TransRate score, and the FPKM of each assembly. An ideal selection would be an 227 assembly set with no change in completeness, but with increased specificity and 228 continuity, and decreased misassembly rate, compared with the full set of assemblies. The efficiency to remove low quality assemblies by three different metrics, including DETONATE contig score, TransRate contig scores and the FPKMs of contigs. Four major aspects have been evaluated by comparing the top 40, 000 selected assemblies with the"real time" transcriptome. We evaluate the assembly quality in terms of the completeness, specificity, continuity, and the misassembly rate, as that in Table 3. The misassembly rate is calculated as the number of misassemblies divided by the number of assemblies.

242
In this study, we propose a reliable benchmark -a real time transcriptome, produced by 243 PacBio long read sequencing -for assessing the de novo assembly and evaluation 244 methods. As opposed to other de novo assembly assessment strategies, which either 245 simulate RNA-Seq data or utilize well annotated reference transcriptome as a groud 246 truth for real data, our study takes the advantage of sequencing the same biological 247 sample using both the short read and long read technologies to eliminate the biological 248 uncertainty. The real time transcriptome relies on both the well annotated reference 249 transcriptome and the real time sampling, thus, the real time transcriptome can serve as 250 a better reference for assessing de novo assemblies than the alternative simulation or a 251 reference transcriptome.

252
By comparing the de novo assemblies from five commonly used methods to the real 253 time transcriptome, we find that the properties of the tested assemblers vary 254 significantly. For instance, Trinity has the highest read mapping rate (shown in Table   255 2), and the best completeness, but generates too many short assemblies in the range 256 between 200 − 400 bases (shown in Fig 3). This makes Trinity assemblies less 257 continuous, and potentially increasing the number of assemblies that can be linked by 258 short reads (shown in Table 4). Trinity also has the highest misassembly rate of the five 259 methods (shown in Table 3). An improvement to Trinity would be to decrease the 260 number of misassemblies while increasing the continuity. Oases generally generates the 261 longest and the fewest assemblies in all five methods (shown in Table 1), which gives it 262 the best continuity and specificity (shown in Table 3). However, Oases has a low read 263 mapping rate (shown in Table 2), which makes it less complete than the other methods. 264 An improvement to Oases would be to increase the completeness of the assemblies. The 265 performance of SOAPdenovo-Trans, Tran-AByss, and IDBA-Tran are very similar, but 266 SOAPdenovo-Trans has the lowest number of mismatches and misassemblies of the five 267 methods (shown in Table 3).

268
Because of the overall redundency of de novo assemblies in all the methods, 269 DETONATE and TransRate can serve as good metrics to evaluate and remove low 270 quality assemblies, but with different patterns. The DETONATE assembly score mainly 271 considers the read mapping rate, the independence of trancripts, the total number of 272 assemblies, and the number of assembled bases when evaluates the assemblies.

273
DETONATE ranks the method with no extreme disadvantages in the above aspects as 274 the best (shown in Table 4). The TransRate assembly score is an empirical function 275 that takes many different aspects into account, mainly including the mapping accuracy, 276 the mapping orientation, the mapping depth, the mapping coverage, and the fraction of 277 mapped reads. By taking the product of the first four terms as the contig score, 278 TransRate actually treats the first four aspects equally, then weights the contig score by 279 the fraction of mapped reads. TransRate only encourages the advantages but doesn't 280 penalize the disadvantages of the assembly, as the way DETONATE does. The 281 optimization of the TransRate assembly score is a good way to select the best quality 282 assemblies, but the number of selected assemblies is often low, and cannot be controlled 283 by users.

284
Both DETONATE and TransRate provide contig scores as an evaluation for each 285 assembly. The contig scores can be used as metrics for removing redundent low quality 286 assemblies. When the top 40, 000 assemblies ranked by DETONATE, TransRate and 287 FPKM are examined, we find that the DETONATE contig score can effectively remove 288 the redundent assemblies while keeping a high completeness and continuity rate, but not 289 be able to remove misassemblies. The TransRate contig score is very sensitive in 290 removing misassemblies but not helpful in the completeness, specificity and continuity. 291 There are some weakness in this study. For instance, only one dataset has been 292 tested here, because it is not very easy to obtain the datasets which have been 293 sequenced by both short read and long read technologies. It would be better to include 294 further benchmark datasets to eliminate any bias from the sequencing platforms or 295 organisms. Also, we evaluated the assemblies from several major perspectives, including 296 length, the total number of assemblies, the read mapping rate, completeness, specificity, 297 continuity and misassembly, by comparing the assemblies with the real time 298 transcriptome. There may be additional perspectives that the model-based evaluation 299 methods take into account, but are not included in our metrics. respectively. Note that when the coverage is from 0.90 to 0.95, more transcripts/genes 322 will be dropped out than the previous columns, which indicates many consensus 323 sequences having a coverage between 0.90 and 0.95. To keep as much information from 324 PacBio long reads as possible, we consider coverage = 0.9 is a long enough to represent 325 a transcript/gene. Similarly for identity, when identity is from 0.85 to 0.90, more 326 transcripts/genes will be dropped out than the previous columns. Taking the facts that 327 the error rate of PacBio sequencing was about 10%-15% in 2013 and the average length 328 of consensus sequence (1, 640 bp) was long enough to align the consensus sequence to Misassembly rate (%)

Misassembly
All assemblies Top 40000 assemblies by DETONATE Top 40000 assemblies by TransRate Top 40000 assemblies by FPKM