A comprehensive evaluation of long read error correction methods

Haowen Zhang; Chirag Jain; Srinivas Aluru

doi:10.1101/519330

Abstract

Motivation Third-generation sequencing technologies can sequence long reads, which is advancing the frontiers of genomics research. However, their high error rates prohibit accurate and efficient downstream analysis. This difficulty has motivated the development of many long read error correction tools, which tackle this problem through sampling redundancy and/or leveraging accurate short reads of the same biological samples. Existing studies to asses these tools use simulated data sets, and are not sufficiently comprehensive in the range of software covered or diversity of evaluation measures used.

Results In this paper, we present a categorization and review of long read error correction methods, and provide a comprehensive evaluation of the corresponding long read error correction tools. Leveraging recent real sequencing data, we establish benchmark data sets and set up evaluation criteria for a comparative assessment which includes quality of error correction as well as run-time and memory usage. We study how trimming and long read sequencing depth affect error correction in terms of length distribution and genome coverage post-correction, and the impact of error correction performance on an important application of long reads, genome assembly. We provide guidelines for practitioners for choosing among the available error correction tools and identify directions for future research.

Availability The source code is available at https://github.com/haowenz/LRECE.

Contact aluru{at}cc.gatech.edu

1. Introduction

Third-generation sequencing technologies produce long reads with average length of 10 Kbp or more that are orders of magnitudes longer than the short reads available through second-generation sequencing technologies (typically a few hundred bp). In fact, the longest read length reported to date is > 1 million bp (Sedlazeck et al., 2018). Longer lengths are attractive because they enable disambiguation of repetitive regions in a genome or a set of genomes. The impact of this valuable long-range information has already been demonstrated for de novo genome assembly (Loman et al., 2015; Chin et al., 2016; Jain et al., 2018), novel variant detection (Sedlazeck et al., 2017; Chaisson et al., 2015), RNA-seq analysis (Gordon et al., 2015), and epigenetics (Rand et al., 2017; Simpson et al. 2017).

The benefit of longer read lengths, however, comes with the major challenge of handling high error rates. Currently, there are two widely used third-generation sequencing platforms – Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Both sequencing platforms are similar in terms of their high error rates (ranging from 10-20%) with most errors occurring due to insertions or deletions (indels); however the error distribution varies (Carneiro et al., 2012; Jain et al., 2015, 2018). Pacbio sequencing errors appear to be randomly distributed over the sequence (Korlach and Biosciences, 2013). For ONT on the other hand, the error profile has been reported to be biased. For example, A to T and T to A substitutions are less frequent than other substitutions, and indels tend to occur in homopolymer regions (Jain et al., 2015; Ashton et al., 2015). These error characteristics pose a challenge for long read data analyses, particularly for detecting correct read overlaps during genome assembly and variants at single base pair resolution, thus motivating the development of error correction methods.

Error correction algorithms are designed to identify and fix or remove sequencing errors, thereby benefiting resequencing or de novo sequencing analysis. In addition, the algorithms should be computationally efficient to handle increasing volumes of sequencing data, particularly in the case of large, complex genomes. Numerous error correction methodologies and software have been developed for short reads; we refer readers to Yang et al. (2012) and Alic et al. (2016) for a thorough review. Given the distinct characteristics of long reads, i.e., significantly higher error rates and lengths, specialized algorithms are needed to correct them. Till date, several error correction tools for long reads have been developed including PacBioToCA (Koren et al., 2012), LSC (Au et al., 2012), ECTools (Lee et al., 2014), LoRDEC (Salmela and Rivals, 2014), proovread (Hackl et al., 2014), NaS (Madoui et al., 2015), Nanocorr (Goodwin et al., 2015), Jabba (Miclotte et al., 2016), CoLoRMap (Haghshenas et al., 2016), LoRMA (Salmela et al., 2016), HALC (Bao and Lan, 2017), FLAS (Bao et al., 2017), FMLRC (Wang et al., 2018), HG-CoLoR (Morisse et al., 2018) and Hercules (Firtina et al., 2018).

In addition, error correction modules have been developed as part of long read de novo assembly pipelines, such as Canu (Koren et al., 2017) and HGAP (Chin et al., 2013). In the assembly pipeline, correction helps by increasing alignment identities of overlapping reads, which facilitates overlap detection and improves assembly. Many long error correction tools require and make use of highly accurate short reads to correct long reads (accordingly referred to as hybrid methods). Others, referred to as non-hybrid methods, perform self-correction of long reads using overlap information among them.

A few review studies have showcased comparisons among rapidly evolving error correction algorithms to assess state-of-the-art. Laehnemann et al. (2015) provide an introduction to error rates/profiles and a methodology overview of some correction tools for various short and long read sequencing platforms, although no benchmark is included. A review and benchmark for hybrid methods is also available (Mahmoud et al., 2017). However, the study only used simulated reads and focused more on speed rather than correction accuracy. Besides, it does not include non-hybrid methods in the assessment. More recently, LRCstats (La et al., 2017) was developed for evaluation of long read error correction software; however, it is restricted to benchmarking with simulated reads. Furthermore, it does not provide a comprehensive evaluation of many of the current state-of-the-art correction software.

While benchmarking with simulated reads is useful, it fails to convey performance in real-world scenarios. Besides the base-level errors (i.e., indels and substitutions), real sequencing data sets also contain larger structural errors, e.g., chimeras (Fichot and Norman, 2013). However, state-of-the-art simulators such as SimLoRD (Stöcker et al., 2016) only generate reads with base-level errors rather than structural errors. Furthermore, Miclotte et al. (2016) consistently observed worse performance when using real reads instead of simulated reads, suggesting that simulation may fail to match the characteristics of actual error distribution. Therefore, benchmarking with real datasets is important.

In this study, we establish benchmark datasets, present an evaluation methodology suitable to long reads, and carry out comprehensive evaluation of the quality and computational resource requirements of state-of-the-art long read correction software. We also study the effect of trimming and different sequencing depths on correction quality. To understand impact of error correction on downstream analysis, we perform assembly using corrected reads generated by various tools and assess quality of the resulting assemblies.

2. Overview of long read error correction methods

2.1 Hybrid methods

Hybrid methods take advantage of high accuracy of short reads (error rates often < 1%) for correcting errors in long reads. An obvious requirement is that the same biological sample be sequenced using both short read and long read technologies. Based on how these methods make use of short reads, we further divide them into two categories: alignment-based and assembly-based. The first category includes Hercules, CoLoRMap, Nanocorr, Nas, proovread, LSC and PacBioToCA, whereas HG-CoLoR, HALC, Jabba, LoRDEC, and ECTools are in the latter. The ideas underlying the methods are summarized below.

2.1.1 Short-read-alignment-based methods

As a first step, these methods align short reads to long reads using a variety of aligners, e.g. BLAST (Altschul et al., 1990), Novoalign (http://www.novocraft.com/products/novoalign/). As long reads are usually error-prone, some alignments can be missed or biased. Thus, most of the tools in this category utilize various approaches to increase accuracy of alignments. Drawing upon the alignments, these methods use distinct approaches to generate corrected reads.

PacBioToCA

Consensus sequences for long reads are generated by multiple sequence alignment of short reads using AMOS consensus module (Pop et al., 2004).

LSC

Short reads and long reads are compressed using homopolymer compression (HC) transformation prior to alignment. Then error correction is performed at HC points, mismatches and indels by temporarily decompressing the aligned short reads and then generating consensus sequences. Finally, the corrected sequences are decompressed.

proovread

Similar to PacBioToCA and LSC, short reads are mapped to long reads and then the resulting alignments are used to call consensus. But its alignment parameters are carefully selected and adapted to the PacBio sequencing error profile. To further improve correction, the phred quality score and Shannon entropy value are calculated at each nucleotide for quality control and chimera detection, respectively. To reduce run time, an iterative correction strategy is employed. Three pre-correction steps are performed using increasing subsamples of short reads. In each step, the long read regions are masked to reduce alignment search space once they are corrected and covered by a sufficient number of short read alignments. In the final step, all short reads are mapped to the unmasked regions to make corrections.

NaS

Like the other tools in this category, it first aligns short reads with long reads. However, only the stringently aligned short reads are found and kept as seed-reads. Then instead of calling consensus, similar short reads are retrieved with these seed-reads. Micro-assemblies of these short reads are performed to generate contigs, which are regarded as corrected reads. In other words, the long reads are only used as template to select seed-reads.

Nanocorr

It follows the same general approach as PacBioToCA and LSC, by aligning short reads to long reads and then calling consensus. But before the consensus step, a dynamic programming algorithm is utilized to select an optimal set of short read alignments that span each long read.

CoLoRMap

CoLoRMap does not directly call consensus. Instead, for each long read region, it runs a shortest path algorithm to construct a sequence of overlapping short reads aligned to that region with minimum edit distance. Subsequently, the region is corrected by the constructed sequence. In addition, for each uncovered region (called gap) on long reads, any unmapped reads with corresponding mapped mates are retrieved and assembled locally to fill the gap.

Hercules

It first aligns short reads to long reads. Then unlike other tools, Hercules uses a machine learning-based algorithm. It creates a profile Hidden Markov Model (pHMM) template for each long read and then learns posterior transition and emission probabilities. Finally, the pHMM is decoded to get the corrected reads.

2.1.2 Short-read-assembly-based methods

These methods first perform assembly with short reads, e.g., generate contigs using an existent assembler, or only build the de Bruijn graph (DBG) based on them. Then the long reads are aligned to the assemblies, i.e., contigs/unitigs or a path in the DBG, and corrected. Algorithms for different tools in this category are summarized below.

ECTools

First, unitigs are generated from short reads using any available assembler and aligned to long reads. Afterwards, the alignments are filtered to select a set of unitigs which provide the best cover for each long read. Finally, differences in bases between each long read and its corresponding unitigs are identified and corrected.

LoRDEC

Unlike ECTools which generates assemblies, LoRDEC only builds a DBG of short reads. Subsequently, it traverses paths in the DBG to correct erroneous regions within each long read. The regions are replaced by the respective optimal paths which are regarded as the corrected sequence.

Jabba

It adopts a similar strategy as in LoRDEC, and builds a DBG of short reads followed by aligning long reads to the graph to correct them. The improvement is that Jabba employs a seed-and-extend strategy using maximal exact matches (MEMs) as seeds to accelerate the alignment.

HALC

Similar to ECTools, short reads are used to generate contigs as the first step. Unlike other methods which try to avoid ambiguous alignments (Koren et al., 2012; Yang et al., 2010), HALC aligns long reads to the contigs with a relatively low identity requirement, thus allowing long reads to align with their similar repeats which might not be their true genomic origin. Then long reads and contigs are split according to the alignments so that every aligned region on read has its corresponding aligned contig region. A contig graph is constructed with the aligned contig regions as vertices. A weighted edge is added between two vertices if there are adjacent aligned long read regions supporting it. The more regions support the edge, the lower is the weight assigned to it. Each long read is corrected by the path with minimum total weight in the graph. Furthermore, the corrected long read regions are refined by running LoRDEC, if they are aligned to similar repeats.

FMLRC

This software uses a DBG-based correction strategy similar to LoRDEC. However, the key difference in the algorithm is that it makes two passes of correction using DBGs with different k-mer sizes. The first pass does the majority of correction, while the second pass with a longer k-mer size corrects repetitive regions in the long reads. Note that a straightforward implementation of a DBG does not support dynamic adjustment of k-mer size. As a result, FMLRC uses FM-index to implicitly represent DBGs of arbitrary length k-mers.

HG-CoLoR

Similar to FMLRC, it avoids using a fixed k-mer size for the de Bruijn graph. Accordingly, it relies on a variable-order de Bruijn graph structure (Kowalski et al., 2015). It also uses a seed-and-extend approach to align long reads to the graph. However, the seeds are found by aligning short reads to long reads rather than directly selecting them from the long reads.

2.2 Non-hybrid methods

These methods perform self-correction with long reads alone. They all contain a step to generate consensus sequences using pairwise alignment/overlap information. However, the respective methods vary in how they find the overlaps and generate consensus sequences. The details are as follows.

FLAS

It takes all-to-all long read overlaps computed using MECAT (Xiao et al., 2017) as input, and clusters the reads that are aligned with each other. In case of ambiguous instances, i.e., the clusters that share the same reads, FLAS evaluates the overlaps by computing alignments using sensitive alignment parameters either to augment the clusters or discard the incorrect overlaps. The refined alignments are then used to correct the reads. To achieve better accuracy, it also corrects errors in the uncorrected regions of the long reads. Accordingly, it constructs a string graph using the corrected regions of long reads, and aligns the uncorrected ones to the graph for further correction.

LoRMA

By gradually increasing the k-mer size, LoRMA iteratively constructs DBGs using k-mers from long reads exceeding a specified frequency threshold, and runs LoRDEC to correct errors based on the respective DBGs. After that, a set of reads similar to each read termed friends are selected using the final DBG, which should be more accurate due to several rounds of corrections. Then, each read is corrected by the consensus sequence generated by its friends.

Canu error correction module

As a first step during the correction process, Canu computes all-versus-all overlap information among the reads using a modified version of MHAP (Berlin et al., 2015). It uses a filtering mechanism during the correction to favor true overlaps over the false ones that occur due to repetitive segments in genomes. The filtering heuristic ensures that each read contributes to correction of no more than D other reads, where D is the expected sequencing depth. Finally, a consensus sequence is generated for each read using its best set of overlaps.

3. Materials and Methods

We selected data sets from recent publicly accessible genome sequencing experiments. For benchmarking the different programs, our experiments used genome sequences from multiple species and different sequencing platforms with recent chemistry, e.g., R9/R7 for ONT or P6-C4/P5-C3 for PacBio. We describe our evaluation criteria and use it for a comprehensive assessment of the correction methods/software.

3.1. Benchmark data sets

Our benchmark includes resequencing data from three reference genomes – Escherichia coli K-12 MG1655 (E. coli), Saccharomyces cerevisiae S288C (yeast), and Drosophila melanogaster ISO1 (fruit fly). The biggest hurdle when benchmarking with real data is the absence of ground truth (i.e., perfectly corrected reads). However, the availability of reference genomes of these strains enables us to evaluate the output of correction software in a reliable manner using the reference. Essentially, differences in a corrected read with respect to the reference imply uncorrected errors. A summary of the selected read data sets is listed in Table 1. We leveraged publicly available high coverage read data sets of the selected genomes available from all three platforms – Illumina (for short reads), Pacbio, and ONT. In addition, some of these samples were sequenced using multiple protocols, yielding reads of varying quality. This enabled us to do a thorough comparison among error correction software across various error rates and error profiles.

View this table:

Table 1.

Details of the benchmark data sets

Dataset D1-O1 is a recent MinION sequencing of E. coli genome (Loman et al., 2015). Its 2D reads were also extracted from the raw reads using poretools (Loman and Quinlan, 2014), and was included into the benchmark as D1-O2. Note that raw reads are more erroneous than the 2D reads, which enabled the evaluation of the tools across different error rates. Giordano et al. (2017) recently released a bundle of PacBio, MinION, and Miseq sequencing data of the yeast genome. For the same purpose, pass 2D reads and the combination of pass and fail 2D reads of the MinION data were downloaded and regarded as two separate data sets in our benchmark (D2-O1 and D2-O2).

To conduct performance evaluation under different sequencing depths, yeast sequencing reads (D2-P and D2-O1) were subsampled randomly using Seqtk (https://github.com/lh3/seqtk). Subsamples with average depth of 10x and 20x were generated for MinION reads. In addition, 10x, 20x, 30x, 60x and 90x PacBio read subsamples were generated from D2-P. Details of these subsamples are available in Supplementary Table 1.

3.2 Evaluation methodology

Our evaluation method takes uncorrected reads, corrected reads, and a reference genome as input. Both the uncorrected and corrected reads were filtered using a user defined length (default 500). Reads which were too short to include in downstream analysis were dropped during the filtration. Filtered reads were aligned to the reference genome using Minimap2 (Li, 2018). Majority of reads align to a single position in the reference. Fraction of base pairs with ambiguous or split read mappings is found to be insignificant (Supplementary Table 2). This can be attributed to two reasons. First, the reads were sequenced from the same reference genome to which they are aligned. Second, as the reads are long (> 500 bp), majority of the base pairs map uniquely to the reference. As a result, we retain only the primary alignment for a read with multiple mappings or split alignments.

In an ideal scenario, an error correction software should take each erroneous long read and produce the error-free version of it, preserving each read and its full length. To assess how close to the ideal one can get, measures such as error rate post-correction or percentage of errors removed (termed gain; see Yang et al. (2012)) can be utilized. However, long read error correction programs do not operate in this fashion. They may completely discard some reads or choose to split an input read into multiple reads when the high error rate cannot be reckoned with. In addition, short read assembly based error correction programs use long read alignments to de Bruijn graphs, and produce sequences corresponding to the aligned de Bruijn graph paths as output reads instead. Though original reads may not be fully preserved, all that matters for effective use of error correction software is that its output consists of sufficient number of high quality long reads that reflect adequate read lengths, sequencing depth, and coverage of the genome. Accordingly, our evaluation methodology reflects such assessment.

We measure the number of reads and total bases output by each error correction software, along with the number of aligned reads and total number of aligned bases extracted from alignment statistics, because they together reveal the effectiveness of correction. Besides, statistics which convey read length distribution such as maximum length and N50 were calculated to assess effect of the correction process on read lengths. Fraction of the genome covered by output reads is also reported to assess if there are regions of the genome that lost coverage or suffered significant deterioration in coverage depth post-correction. Any significant drop on these metrics can be a potential sign of information loss during the correction. Finally, alignment identity is calculated by the number of matched bases divided by the alignment length, averaged over all reads. Tools which achieve maximum alignment identity with minimum loss of information are desirable.

As part of this study, we provide an evaluation tool to automatically generate the evaluation statistics mentioned above. Besides, we provide a wrapper script which can run state-of-the-art error correction software on a grid engine given any input data from user. Using the scripts, two types of evaluations can be conducted. Users can either evaluate the performance on a list of tools with their own data to find a suitable tool for their studies, or they can run any correction tool with the benchmark data and compare it with other state-of-the-art tools.

4 Experimental results and discussion

4.1 Experimental setup

All tests were run on the Swarm cluster located at Georgia Institute of Technology. Each compute node in the cluster has dual Intel Xeon CPU E5-2680 v4 (2.40GHz) processors equipped with a total of 28 cores and 256GB main memory. The cluster is set up using 64-bit Red Hat Linux kernel version 2.6.32.

4.2. Evaluated software

We evaluated 15 long read error correction programs in this study: Hercules, HG-CoLoR, FMLRC, HALC, CoLoRMap, Jabba, Nanocorr, proovread, LoRDEC, ECTools, LSC, PacBioToCA, FLAS, LoRMA and the error correction module in Canu. NaS was not included in the evaluation because it requires Newbler assembler which is no longer available from 454. The command line parameters were chosen based on user documentations of each software (Supplementary note section “Versions and configurations”). The tools were configured to run exclusively on a single compute node and allowed to leverage all the 28 cores if multi-threading is supported. A cutoff on wall time was set to three days.

4.3. Performance on benchmark data sets

We evaluated the quality and computational resource requirements of each software on our benchmark data sets (Table 1). Results for the three different datasets are shown in Tables 2, 3 and 4, respectively. Because multiple factors are at play when considering accuracy, it is important to consider their collective influence in assessing quality of error correction. In what follows, we present a detailed and comparative discussion on correction accuracy, runtime and memory-usage. In addition, to guide error correction software users and future developers, we provide further insights into the strengths and limitations of various approaches that underpin the software. This includes evaluating their resilience to handle various sequencing depths, studying the effect of discarding or trimming input reads to gain higher accuracy, and impact on genome assembly.

View this table:

Table 2.

Experimental results for E. coli data sets

View this table:

Table 3.

Experimental results for yeast data sets

View this table:

Table 4.

Experimental results for fruit fly data sets

4.3.1 Correction quality

We measure quality using the number of reads and total bases output in comparison with the input, the resulting alignment identity, fraction of the genome covered and read length distribution including maximum size and N50 length. From Tables 2, 3 and 4, we gather that the best performing hybrid methods (e.g., FMLRC) are capable of correcting reads to achieve base-level accuracy in the high 90’s. For the E. coli and yeast data sets, many of these programs achieve alignment identity > 99%. A crucial aspect to consider here is whether the high accuracy is achieved while preserving input read depth and N50. Few tools (e.g. Jabba, proovread) seem to attain high alignment identity at the cost of producing shorter reads and reduced depths because they choose to either discard uncorrected reads or trim the uncorrected regions. This may have a negative impact on downstream analyses. This trade-off is further discussed later in Section 4.3.4.

Among the hybrid methods, a key observation is that short-read-assembly-based methods tend to show better performance than short-read-alignment-based methods. We provide the following explanation. Given that long reads are error-prone, short read alignment to long reads is more likely to be wrong (or ambiguous) than long read alignment to graph structures built using short reads. Errors in long reads can cause false positives in identifying the true positions where the respective short reads should align, which causes false correction later. For example, during the correction of D3-P, the alignment identity of corrected reads generated by CoLoRMap in fact decreased when compared to the uncorrected reads. The reason is that CoLoRMap uses BWA-mem (Li, 2013) to map short reads, which is designed to report best mapping. However, due to the high error rates, the best mapping is not necessarily the true mapping. Large volume of erroneous long reads in D3-P can lead to many false alignments, which affected the correction process. On the other hand, long read lengths make it possible to have higher confidence when aligning them to paths in the graph. Therefore, in most of the experiments, assembly-based methods were able to produce reasonable correction.

Non-hybrid correction is more challenging as it relies solely on overlaps between erroneous long reads, yet the tools in this category yield competitive accuracy in many cases. However, non-hybrid methods may significantly reduce read count and/or read lengths, and completely fail when the original long reads are highly erroneous. For example, neither Canu nor LoRMA was able to correct D1-O1 where average input identity is only 63.46%. FLAS also discarded most of the reads.

4.3.2 Runtime and memory usage

Scalability of the correction tools is an important aspect to consider in their evaluation. Slow speed or high memory usage makes it difficult to apply them to correct large data sets. Our results show that hybrid methods, in particular assembly-based methods, are much faster than the rest. For instance, PacBioToCA and LSC failed to generate corrected reads in three days for D1-P, while most of the assembly-based tools finished the same job in less than one hour. Nanocorr, ECTools and LSC were unable to finish the correction of D2-O2 in three days, which was finished by FMLRC or LoRDEC in 30 minutes. Although proovread can complete the corrections of D2-P, D2-O1 and D2-O2, the run-time was 49.3, 17.5 and 29.3 times longer, respectively, than run-time needed by FMLRC. Moreover, assembly-based methods, e.g., LoRDEC and FMLRC, used less memory in most of the experiments. Therefore, in terms of computational performance, users should give priority to assembly-based methods over short-read alignment-based methods.

Among the non-hybrid methods, LoRMA’s memory usage was generally the highest among all the tools, and was slower than assembly-based methods. However, Canu showed superior scalability. Owing to a fast long read overlap detection algorithm using MinHash (Berlin et al., 2015), Canu was able to compute long read overlaps and used them to correct the reads in reasonable time, which is comparable to most of the hybrid methods. The memory footprint of Canu was also lower than many hybrid-methods. However, Canu did not finish the correction of D3-P in three days probably because this data set is too large to compute pairwise overlaps. FLAS showed performance comparable to Canu as FLAS also leverages the fast overlap computation method in MECAT (Xiao et al., 2017).

4.3.3 Effect of long read sequencing depth on error correction

Requiring high sequencing coverage for effective error correction can impact both cost and time consumed during sequencing and analysis. The relative cost per base pair using third-generation sequencing is still several folds higher when compared to the latest Illumina sequencers (Sedlazeck et al., 2018). Accordingly, we study how varying long read sequencing depth affects correction quality, while keeping the short read data set fixed. We conducted this experiment using data sets D2-P and D2-O1 with various depth levels obtained using random sub-sampling. The output behavior of the correction tools is shown in Supplementary Tables 3-7.

For corrected reads generated by hybrid methods, no significant change on the metrics was observed except those generated by CoLoRMap. The alignment identity of its corrected reads increased with decreased sequencing depth. This observation is consistent with the experimental results reported by its authors. Similarly, CoLoRMap did not perform well on large data sets such as D3-P and D3-O as large data sets increase the risk of false positive alignments (discussed previously in Section 4.3.1).

On the other hand, the performance of non-hybrid methods deteriorated significantly when sequencing depth was decreased. As non-hybrid methods leverage overlap information to correct errors, they require sufficient long read coverage to make true correction. The genome fraction covered by corrected reads produced by LoRMA decreased from 99.59% to 82.97% when sequencing depth dropped from 90x to 60x, and further decreased to 9.61%, 5.39% and 3.78% for 30x, 20x and 10x respectively,implying loss of many long reads after correction. The alignment identities were still greater than 99% using all subsamples because LoRMA trimmed the uncorrected regions. For corrected reads generate by Canu, no significant change on genome fraction was observed. But the alignment identity dropped from above 99% to 97.03% and 95.63% for 20x and 10x sequencing depths, respectively. FLAS showed similar performance but genome fraction for 10x sequencing depth was only 90.204%, which is lower than the 99.919% achieved by Canu.

4.3.4 Effect of discarding reads during correction

Many correction tools opt for discarding input reads or regions within reads that they fail to correct. As a result, the reported alignment identity is high (>99%), but much fewer number of bases survive after correction. This effect is more pronounced in corrected reads generated by Jabba, proovread, ECTools, PacBioToCA and LoRMA. They either trim uncorrected regions at sequence ends, or even in the middle, to avoid errors in the final output which eventually yields high alignment identity. However, aggressive trimming also makes the correction lossy and may influence downstream analysis because long range information is lost if the reads are shortened or broken into smaller pieces. Therefore, users should be conservative in trimming and turn it off when necessary. One good practice is to keep the uncorrected regions and let downstream analysis tools perform the trimming, e.g. overlap-based trimming after read correction in Canu.

A direct implication of discarding or trimming reads is the change of read length distribution. Figure 1 and 2 show the original and corrected read length distributions. Among all the tools, Hercules, FMLRC, HALC, CoLoRMAP and LoRDEC can maintain a similar read length distribution after correction whereas Nanocorr, Jabba, ECTools and prooveread lost many long reads after correction due to their trimming step. Nanocorr drops a long read when there is no short read aligning to it. This procedure can remove many error-prone long reads, which leads to a higher alignment identity after correction. However, the fraction of discarded reads in many cases is found to be significant. For example, a mere 1.5 million bp cumulative length of sequences survived out of 245.7 million bp data set, after correction of D1-O1. ECTools also generated only 10.7 million corrected bases using this data set. Canu changed the read length distribution significantly after correction although due to a different reason (Figure 1). Canu estimates the read length after correction and tries to keep the longest 40x reads for subsequent assembly. FLAS kept most of the reads with short length while losing many reads with long length.

Fig. 1. Corrected read length distribution for D1-P.

Fig. 2. Corrected read length distribution for D1-O1.

Few hybrid-methods managed to generate enough corrected reads with relatively higher alignment identity. Notably, FMLRC and HG-CoLoR substantially outperformed other tools using D1-O1 by producing high alignment identity of 98.98% and 99.56% respectively and maintaining long read lengths (Table 2, Figure 2). Notably, HG-CoLoR generated one extremely long read of length 73,992 bp which is substantially longer than the longest read (43,624 bp) in D1-O1, perhaps due to the use of assembly DBG during the correction process.

4.3.5 Effect of error correction on genome assembly

We examine the effect of error correction on genome assembly, and evaluate if quality of error correction correlates well with the quality of genome assembly performed using corrected reads. To do so, we conducted an experiment to compute genome assembly using corrected PacBio and ONT 2D reads of E. coli, i.e., corrected reads for D1-P and D1-O2. Assembly was computed using Canu and its quality was assessed using QUAST (Gurevich et al., 2013); results are shown in Table 5.

View this table:

Table 5.

Results of genome assembly computed using corrected reads of D1-P and D1-O2

Considering the assemblies generated using corrected PacBio reads (D1-P), NGA50 score of >3 million bases was obtained when using reads generated by FLAS, Canu, FMLRC, Nanocorr, LoRDEC or ECTools. Surprisingly, the highest NGA50 was obtained when using corrected reads generated by LoRDEC, but the alignment identity of its corrected reads was lower than most of the tools. Similarly, the highest NGA50 was achieved using corrected reads generated by Canu for D1-O2, but the alignment identity of the corrected reads was 93.32%. Therefore, higher alignment identity does not necessarily translate to a better NGA50, i.e., a more continuous assembly.

We also examined the frequency of mismatches and indels in the assemblies. For both data sets D1-P and D1-O2, corrected reads generated by HALC and ECTools produced assemblies with > 500 mismatches, significantly higher than the other tools. However, alignment identity of their corrected reads was either competitive with, or superior to, what is produced by other tools. Notably, both HALC and ECTools use assembled contigs from short reads to do error correction. Mis-assemblies of short reads, especially in repetitive and low-complexity regions, may cause false corrections, which leads to errors during assembly (Wang et al., 2018). Corrected reads produced by FMLRC achieved the least number of errors in assembly. Meanwhile, its alignment identity was also the highest among the methods which avoid trimming. Therefore, higher alignment identity of corrected reads can lead to, but not guarantee, fewer errors in genome assemblies.

Non-hybrid methods such as LoRMA, Canu produced more indels than mismatches in their assemblies while most of the hybrid methods showed the opposite behavior. To further investigate, we visualized the alignments of corrected reads generated by Canu and FMLRC for D1-O2 in Supplementary Figures 1, 2, and 3. More indels were observed in the alignments of corrected reads generated by Canu than FMLRC. Moreover, for D1-O2, indels mostly occurred in homoploymers which is consistent with ONT sequencing error profile. These observations suggest that self-correction methods are not good at handling indels when compared to hybrid methods.

5 Conclusions and Future Directions

In this work, we established benchmark data sets and evaluation methods for comprehensive assessment of long read error correction software. Our results suggest that hybrid methods aided by short accurate reads can achieve better correction quality, especially when handling low coverage-depth long reads, compared with non-hybrid methods. Within the hybrid methods, assembly-based methods are superior to alignment-based methods in terms of scalability to large data sets. Besides, better performance on correction such as preserving higher proportion of input bases and better alignment identity may lead to, but cannot guarantee, better results on downstream applications such as genome assembly. The tools with superior correction performance should be further tested in the context of applications of interest, to determine which are best suited for the application of interest.

Users can also select tools according to our experimental results for their specific expectations. When speed is a concern, assembly-based hybrid methods are preferred as long as short reads are available. Besides, hybrid methods are less sensitive to low sequencing depth than non-hybrid methods. Thus, users are recommended to choose hybrid methods when sequencing depth is relatively low. In cases where indel errors may cause a serious negative impact on downstream analyses, hybrid methods should be preferred over non-hybrid ones.

FMLRC outperformed other hybrid methods in almost all the experiments. For non-hybrid methods, Canu and FLAS showed better performance over LoRMA. Hence, these three are recommended as default when users want to avoid laborious tests on all the error correction tools.

For future work, better self-correction algorithms are expected to avoid hybrid sequencing, thus reducing experimental labor on short read sequencing preparation. In addition, most of the correction algorithms run for days to correct errors in the sequencing of even moderately large and complex genomes like the fruit fly, and become a bottleneck in sequencing data analysis. Therefore, more efficient or parallel correction algorithms should be developed to ease the computational burden. Furthermore, none of the hybrid tools makes use of paired-end information in their correction, except CoLoRMap. But the use of paired-end reads in CoLoRMap did not improve correction performance significantly according to previous studies. Paired-end reads have already been used to resolve repeats and remove entanglements in de Bruijn graphs (Bankevich et al., 2012). Since many error correction tools build de Bruijn graphs to correct long reads, the paired-end information may also be able to improve error correction.

Most of the published error correction tools focus on correction of long DNA reads sequenced from a single genome, which also served as the motivation for our review. Long read sequencing is increasingly gaining traction for transcriptomics and metagenomics applications. It is not clear whether the existing tools can be leveraged or extended to work effectively in such scenarios, and is an active area of research (de Lima et al., 2018).

Funding

This work is supported in part by the National Science Foundation under CCF-1718479.

Key Points

Despite the high error rate of long reads, the state-of-the-art correction tools achieve high correction accuracy and throughput.
The best hybrid methods show better performance than non-hybrid methods in terms of correction quality and computing resource usage.
Few correction tools discard reads, which practitioners are supposed to be careful with.
Evaluation of long read error correction should be conducted while checking its effect on downstream analysis, since better correction quality does not always imply better accuracy of downstream analysis.

Biographical Note

Haowen Zhang and Chirag Jain are PhD students in School of Computational Science and Engineering at Georgia Institute of Technology. Srinivas Aluru, PhD, is a Professor in School of Computational Science and Engineering and Co-Executive Director of Institute for Data Engineering and Science at Georgia Institute of Technology. He is a Fellow of AAAS and IEEE.

References

↵
Alic, A. S., Ruzafa, D., Dopazo, J., and Blanquer, I. (2016). Objective review of de novo stand-alone error correction methods for NGS data. Wiley Interdisciplinary Reviews: Computational Molecular Science, 6(2), 111–146.
OpenUrl
↵
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403–410.
OpenUrl CrossRef PubMed Web of Science
↵
Ashton, P. M., Nair, S., Dallman, T., Rubino, S., Rabsch, W., Mwaigwisya, S., Wain, J., and O’grady, J. (2015). Minion nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nature biotechnology, 33(3), 296.
OpenUrl CrossRef PubMed
↵
Au, K. F., Underwood, J. G., Lee, L., and Wong, W. H. (2012). Improving pacbio long read accuracy by short read alignment. PloS one, 7(10), e46679.
OpenUrl CrossRef PubMed
↵
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455–477.
OpenUrl CrossRef PubMed
↵
Bao, E. and Lan, L. (2017). Halc: High throughput algorithm for long read error correction. BMC bioinformatics, 18(1), 204.
OpenUrl
↵
Bao, E., Xie, F., Song, C., and Song, D. (2017). Hals: Fast and high throughput algorithm for pacbio long read self-correction.
↵
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M., and Phillippy, A. M. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6), 623.
OpenUrl CrossRef PubMed
↵
Carneiro, M. O., Russ, C., Ross, M. G., Gabriel, S. B., Nusbaum, C., and DePristo, M. A. (2012). Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics, 13(1), 375.
OpenUrl CrossRef PubMed
↵
Chaisson, M. J., Huddleston, J., Dennis, M. Y., Sudmant, P. H., Malig, M., Hormozdiari, F., Antonacci, F., Surti, U., Sandstrom, R., Boitano, M., et al. (2015). Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536), 608.
OpenUrl CrossRef PubMed
↵
Chin, C.-S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., et al. (2013). Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6), 563.
OpenUrl
↵
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., et al. (2016). Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods, 13(12), 1050.
OpenUrl
↵
de Lima, L. I. S., Marchet, C., Caboche, S., Da Silva, C., Istace, B., Aury, J.-M., Touzet, H., and Chikhi, R. (2018). Comparative assessment of long-read error-correction software applied to RNA-sequencing data. bioRxiv, page 476622.
↵
Fichot, E. B. and Norman, R. S. (2013). Microbial phylogenetic profiling with the pacific biosciences sequencing platform. Microbiome, 1(1), 10.
OpenUrl CrossRef PubMed
↵
Firtina, C., Bar-Joseph, Z., Alkan, C., and Cicek, A. (2018). Hercules: a profile hmm-based hybrid error correction algorithm for long reads. Nucleic Acids Research, page gky724.
↵
Giordano, F., Aigrain, L., Quail, M. A., Coupland, P., Bonfield, J. K., Davies, R. M., Tischler, G., Jackson, D. K., Keane, T. M., Li, J., et al. (2017). De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Scientific reports, 7(1), 3935.
OpenUrl
↵
Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., and McCombie, W. R. (2015). Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome research, 25(11), 1750–1756.
OpenUrl Abstract/FREE Full Text
↵
Gordon, S. P., Tseng, E., Salamov, A., Zhang, J., Meng, X., Zhao, Z., Kang, D., Underwood, J., Grigoriev, I. V., Figueroa, M., et al. (2015). Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS one, 10(7), e0132628.
OpenUrl CrossRef PubMed
↵
Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075.
OpenUrl CrossRef PubMed Web of Science
↵
Hackl, T., Hedrich, R., Schultz, J., and Förster, F. (2014). proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics, 30(21), 3004–3011.
OpenUrl CrossRef PubMed
↵
Haghshenas, E., Hach, F., Sahinalp, S. C., and Chauve, C. (2016). Colormap: Correcting long reads by mapping short reads. Bioinformatics, 32(17), i545–i551.
OpenUrl CrossRef PubMed
↵
Jain, M., Fiddes, I. T., Miga, K. H., Olsen, H. E., Paten, B., and Akeson, M. (2015). Improved data analysis for the minion nanopore sequencer. Nature methods, 12(4), 351.
OpenUrl
↵
Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T., et al. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology, 36(4), 338.
OpenUrl CrossRef
↵
Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., et al. (2012). Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology, 30(7), 693.
OpenUrl CrossRef PubMed
↵
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), 722–736.
OpenUrl Abstract/FREE Full Text
↵
Korlach, J. and Biosciences, P. (2013). Understanding accuracy in SMRT® sequencing.
↵
Kowalski, T., Grabowski, S., and Deorowicz, S. (2015). Indexing arbitrary-length k-mers in sequencing reads. PloS one, 10(7), e0133198.
OpenUrl
↵
La, S., Haghshenas, E., and Chauve, C. (2017). Lrcstats, a tool for evaluating long reads correction methods. Bioinformatics, 33(22), 3652–3654.
OpenUrl
↵
Laehnemann, D., Borkhardt, A., and McHardy, A. C. (2015). Denoising DNA deep sequencing data high-throughput sequencing errors and their correction. Briefings in bioinformatics, 17(1), 154–179.
OpenUrl PubMed
↵
Lee, H., Gurtowski, J., Yoo, S., Marcus, S., McCombie, W. R., and Schatz, M. (2014). Error correction and assembly complexity of single molecule sequencing reads. BioRxiv, page 006395.
↵
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:1303.3997.
↵
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 1, 7.
OpenUrl
↵
Loman, N. J. and Quinlan, A. R. (2014). Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics, 30(23), 3399–3401.
OpenUrl CrossRef PubMed
↵
Loman, N. J., Quick, J., and Simpson, J. T. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature methods, 12(8), 733.
OpenUrl
↵
Madoui, M.-A., Engelen, S., Cruaud, C., Belser, C., Bertrand, L., Alberti, A., Lemainque, A., Wincker, P., and Aury, J.-M. (2015). Genome assembly using nanopore-guided long and error-free DNA reads. BMC genomics, 16(1), 327.
OpenUrl CrossRef PubMed
↵
Mahmoud, M., Zywicki, M., Twardowski, T., and Karlowski, W. M. (2017). Efficiency of pacbio long read correction by 2nd generation illumina sequencing. Genomics.
↵
Miclotte, G., Heydari, M., Demeester, P., Rombauts, S., Van de Peer, Y., Audenaert, P., and Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. Algorithms for Molecular Biology, 11(1), 10.
OpenUrl
↵
Morisse, P., Lecroq, T., Lefebvre, A., and Berger, B. (2018). Hybrid correction of highly noisy long reads using a variable-order de bruijn graph. Bioinformatics.
↵
Pop, M., Phillippy, A., Delcher, A. L., and Salzberg, S. L. (2004). Comparative genome assembly. Briefings in bioinformatics, 5(3), 237–248.
OpenUrl CrossRef PubMed Web of Science
↵
Rand, A. C., Jain, M., Eizenga, J. M., Musselman-Brown, A., Olsen, H. E., Akeson, M., and Paten, B. (2017). Mapping DNA methylation with high-throughput nanopore sequencing. Nature methods, 14(4), 411.
OpenUrl
↵
Salmela, L. and Rivals, E. (2014). Lordec: accurate and efficient long read error correction. Bioinformatics, 30(24), 3506–3514.
OpenUrl CrossRef PubMed
↵
Salmela, L., Walve, R., Rivals, E., and Ukkonen, E. (2016). Accurate self-correction of errors in long reads using de bruijn graphs. Bioinformatics, 33(6), 799–806.
OpenUrl
↵
Sedlazeck, F. J., Rescheneder, P., Smolka, M., Fang, H., Nattestad, M., von Haeseler, A., and Schatz, M. C. (2017). Accurate detection of complex structural variations using single molecule sequencing. Preprint at https://www.biorxiv.org/content/arly/2017/07/28/169557.
↵
Sedlazeck, F. J., Lee, H., Darby, C. A., and Schatz, M. C. (2018). Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics, page 1.
↵
Simpson, J. T., Workman, R. E., Zuzarte, P., David, M., Dursi, L., and Timp, W. (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nature methods, 14(4), 407.
OpenUrl
↵
Stöcker, B. K., Köster, J., and Rahmann, S. (2016). Simlord: simulation of long read data. Bioinformatics, 32(17), 2704–2706.
OpenUrl CrossRef PubMed
↵
Wang, J. R., Holt, J., McMillan, L., and Jones, C. D. (2018). Fmlrc: Hybrid long read error correction using an fm-index. BMC bioinformatics, 19(1), 50.
OpenUrl
↵
Xiao, C.-L., Chen, Y., Xie, S.-Q., Chen, K.-N., Wang, Y., Han, Y., Luo, F., and Xie, Z. (2017). Mecat: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. nature methods, 14(11), 1072.
OpenUrl
↵
Yang, X., Dorman, K. S., and Aluru, S. (2010). Reptile: representative tiling for short read error correction. Bioinformatics, 26(20), 2526–2533.
OpenUrl CrossRef PubMed Web of Science
↵
Yang, X., Chockalingam, S. P., and Aluru, S. (2012). A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics, 14(1), 56–66.
OpenUrl PubMed

View the discussion thread.

Posted January 13, 2019.

Download PDF

Supplementary Material

Citation Tools

Subject Area

Bioinformatics

Subject Areas

All Articles

Animal Behavior and Cognition (5210)
Biochemistry (11736)
Bioengineering (8749)
Bioinformatics (29186)
Biophysics (14964)
Cancer Biology (12086)
Cell Biology (17403)
Clinical Trials (138)
Developmental Biology (9418)
Ecology (14176)
Epidemiology (2067)
Evolutionary Biology (18299)
Genetics (12235)
Genomics (16795)
Immunology (11863)
Microbiology (28066)
Molecular Biology (11582)
Neuroscience (60936)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4956)
Plant Biology (10423)
Scientific Communication and Education (1683)
Synthetic Biology (2883)
Systems Biology (7338)
Zoology (1650)

[1] ↵
Alic, A. S., Ruzafa, D., Dopazo, J., and Blanquer, I. (2016). Objective review of de novo stand-alone error correction methods for NGS data. Wiley Interdisciplinary Reviews: Computational Molecular Science, 6(2), 111–146.
OpenUrl

[2] ↵
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3), 403–410.
OpenUrl CrossRef PubMed Web of Science

[3] ↵
Ashton, P. M., Nair, S., Dallman, T., Rubino, S., Rabsch, W., Mwaigwisya, S., Wain, J., and O’grady, J. (2015). Minion nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nature biotechnology, 33(3), 296.
OpenUrl CrossRef PubMed

[4] ↵
Au, K. F., Underwood, J. G., Lee, L., and Wong, W. H. (2012). Improving pacbio long read accuracy by short read alignment. PloS one, 7(10), e46679.
OpenUrl CrossRef PubMed

[5] ↵
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), 455–477.
OpenUrl CrossRef PubMed

[6] ↵
Bao, E. and Lan, L. (2017). Halc: High throughput algorithm for long read error correction. BMC bioinformatics, 18(1), 204.
OpenUrl

[7] ↵
Bao, E., Xie, F., Song, C., and Song, D. (2017). Hals: Fast and high throughput algorithm for pacbio long read self-correction.

[8] ↵
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M., and Phillippy, A. M. (2015). Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature biotechnology, 33(6), 623.
OpenUrl CrossRef PubMed

[9] ↵
Carneiro, M. O., Russ, C., Ross, M. G., Gabriel, S. B., Nusbaum, C., and DePristo, M. A. (2012). Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC genomics, 13(1), 375.
OpenUrl CrossRef PubMed

[10] ↵
Chaisson, M. J., Huddleston, J., Dennis, M. Y., Sudmant, P. H., Malig, M., Hormozdiari, F., Antonacci, F., Surti, U., Sandstrom, R., Boitano, M., et al. (2015). Resolving the complexity of the human genome using single-molecule sequencing. Nature, 517(7536), 608.
OpenUrl CrossRef PubMed

[11] ↵
Chin, C.-S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., et al. (2013). Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature methods, 10(6), 563.
OpenUrl

[12] ↵
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., et al. (2016). Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods, 13(12), 1050.
OpenUrl

[13] ↵
de Lima, L. I. S., Marchet, C., Caboche, S., Da Silva, C., Istace, B., Aury, J.-M., Touzet, H., and Chikhi, R. (2018). Comparative assessment of long-read error-correction software applied to RNA-sequencing data. bioRxiv, page 476622.

[14] ↵
Fichot, E. B. and Norman, R. S. (2013). Microbial phylogenetic profiling with the pacific biosciences sequencing platform. Microbiome, 1(1), 10.
OpenUrl CrossRef PubMed

[15] ↵
Firtina, C., Bar-Joseph, Z., Alkan, C., and Cicek, A. (2018). Hercules: a profile hmm-based hybrid error correction algorithm for long reads. Nucleic Acids Research, page gky724.

[16] ↵
Giordano, F., Aigrain, L., Quail, M. A., Coupland, P., Bonfield, J. K., Davies, R. M., Tischler, G., Jackson, D. K., Keane, T. M., Li, J., et al. (2017). De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms. Scientific reports, 7(1), 3935.
OpenUrl

[17] ↵
Goodwin, S., Gurtowski, J., Ethe-Sayers, S., Deshpande, P., Schatz, M. C., and McCombie, W. R. (2015). Oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome research, 25(11), 1750–1756.
OpenUrl Abstract/FREE Full Text

[18] ↵
Gordon, S. P., Tseng, E., Salamov, A., Zhang, J., Meng, X., Zhao, Z., Kang, D., Underwood, J., Grigoriev, I. V., Figueroa, M., et al. (2015). Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PloS one, 10(7), e0132628.
OpenUrl CrossRef PubMed

[19] ↵
Gurevich, A., Saveliev, V., Vyahhi, N., and Tesler, G. (2013). Quast: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072–1075.
OpenUrl CrossRef PubMed Web of Science

[20] ↵
Hackl, T., Hedrich, R., Schultz, J., and Förster, F. (2014). proovread: large-scale high-accuracy pacbio correction through iterative short read consensus. Bioinformatics, 30(21), 3004–3011.
OpenUrl CrossRef PubMed

[21] ↵
Haghshenas, E., Hach, F., Sahinalp, S. C., and Chauve, C. (2016). Colormap: Correcting long reads by mapping short reads. Bioinformatics, 32(17), i545–i551.
OpenUrl CrossRef PubMed

[22] ↵
Jain, M., Fiddes, I. T., Miga, K. H., Olsen, H. E., Paten, B., and Akeson, M. (2015). Improved data analysis for the minion nanopore sequencer. Nature methods, 12(4), 351.
OpenUrl

[23] ↵
Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., Tyson, J. R., Beggs, A. D., Dilthey, A. T., Fiddes, I. T., et al. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology, 36(4), 338.
OpenUrl CrossRef

[24] ↵
Koren, S., Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., et al. (2012). Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature biotechnology, 30(7), 693.
OpenUrl CrossRef PubMed

[25] ↵
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5), 722–736.
OpenUrl Abstract/FREE Full Text

[26] ↵
Korlach, J. and Biosciences, P. (2013). Understanding accuracy in SMRT® sequencing.

[27] ↵
Kowalski, T., Grabowski, S., and Deorowicz, S. (2015). Indexing arbitrary-length k-mers in sequencing reads. PloS one, 10(7), e0133198.
OpenUrl

[28] ↵
La, S., Haghshenas, E., and Chauve, C. (2017). Lrcstats, a tool for evaluating long reads correction methods. Bioinformatics, 33(22), 3652–3654.
OpenUrl

[29] ↵
Laehnemann, D., Borkhardt, A., and McHardy, A. C. (2015). Denoising DNA deep sequencing data high-throughput sequencing errors and their correction. Briefings in bioinformatics, 17(1), 154–179.
OpenUrl PubMed

[30] ↵
Lee, H., Gurtowski, J., Yoo, S., Marcus, S., McCombie, W. R., and Schatz, M. (2014). Error correction and assembly complexity of single molecule sequencing reads. BioRxiv, page 006395.

[31] ↵
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with bwa-mem. arXiv preprint arXiv:1303.3997.

[32] ↵
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 1, 7.
OpenUrl

[33] ↵
Loman, N. J. and Quinlan, A. R. (2014). Poretools: a toolkit for analyzing nanopore sequence data. Bioinformatics, 30(23), 3399–3401.
OpenUrl CrossRef PubMed

[34] ↵
Loman, N. J., Quick, J., and Simpson, J. T. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature methods, 12(8), 733.
OpenUrl

[35] ↵
Madoui, M.-A., Engelen, S., Cruaud, C., Belser, C., Bertrand, L., Alberti, A., Lemainque, A., Wincker, P., and Aury, J.-M. (2015). Genome assembly using nanopore-guided long and error-free DNA reads. BMC genomics, 16(1), 327.
OpenUrl CrossRef PubMed

[36] ↵
Mahmoud, M., Zywicki, M., Twardowski, T., and Karlowski, W. M. (2017). Efficiency of pacbio long read correction by 2nd generation illumina sequencing. Genomics.

[37] ↵
Miclotte, G., Heydari, M., Demeester, P., Rombauts, S., Van de Peer, Y., Audenaert, P., and Fostier, J. (2016). Jabba: hybrid error correction for long sequencing reads. Algorithms for Molecular Biology, 11(1), 10.
OpenUrl

[38] ↵
Morisse, P., Lecroq, T., Lefebvre, A., and Berger, B. (2018). Hybrid correction of highly noisy long reads using a variable-order de bruijn graph. Bioinformatics.

[39] ↵
Pop, M., Phillippy, A., Delcher, A. L., and Salzberg, S. L. (2004). Comparative genome assembly. Briefings in bioinformatics, 5(3), 237–248.
OpenUrl CrossRef PubMed Web of Science

[40] ↵
Rand, A. C., Jain, M., Eizenga, J. M., Musselman-Brown, A., Olsen, H. E., Akeson, M., and Paten, B. (2017). Mapping DNA methylation with high-throughput nanopore sequencing. Nature methods, 14(4), 411.
OpenUrl

[41] ↵
Salmela, L. and Rivals, E. (2014). Lordec: accurate and efficient long read error correction. Bioinformatics, 30(24), 3506–3514.
OpenUrl CrossRef PubMed

[42] ↵
Salmela, L., Walve, R., Rivals, E., and Ukkonen, E. (2016). Accurate self-correction of errors in long reads using de bruijn graphs. Bioinformatics, 33(6), 799–806.
OpenUrl

[43] ↵
Sedlazeck, F. J., Rescheneder, P., Smolka, M., Fang, H., Nattestad, M., von Haeseler, A., and Schatz, M. C. (2017). Accurate detection of complex structural variations using single molecule sequencing. Preprint at https://www.biorxiv.org/content/arly/2017/07/28/169557.

[44] ↵
Sedlazeck, F. J., Lee, H., Darby, C. A., and Schatz, M. C. (2018). Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nature Reviews Genetics, page 1.

[45] ↵
Simpson, J. T., Workman, R. E., Zuzarte, P., David, M., Dursi, L., and Timp, W. (2017). Detecting DNA cytosine methylation using nanopore sequencing. Nature methods, 14(4), 407.
OpenUrl

[46] ↵
Stöcker, B. K., Köster, J., and Rahmann, S. (2016). Simlord: simulation of long read data. Bioinformatics, 32(17), 2704–2706.
OpenUrl CrossRef PubMed

[47] ↵
Wang, J. R., Holt, J., McMillan, L., and Jones, C. D. (2018). Fmlrc: Hybrid long read error correction using an fm-index. BMC bioinformatics, 19(1), 50.
OpenUrl

[48] ↵
Xiao, C.-L., Chen, Y., Xie, S.-Q., Chen, K.-N., Wang, Y., Han, Y., Luo, F., and Xie, Z. (2017). Mecat: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. nature methods, 14(11), 1072.
OpenUrl

[49] ↵
Yang, X., Dorman, K. S., and Aluru, S. (2010). Reptile: representative tiling for short read error correction. Bioinformatics, 26(20), 2526–2533.
OpenUrl CrossRef PubMed Web of Science

[50] ↵
Yang, X., Chockalingam, S. P., and Aluru, S. (2012). A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics, 14(1), 56–66.
OpenUrl PubMed

A comprehensive evaluation of long read error correction methods

Abstract

1. Introduction

2. Overview of long read error correction methods

2.1 Hybrid methods

2.1.1 Short-read-alignment-based methods

PacBioToCA

LSC

proovread

NaS

Nanocorr

CoLoRMap

Hercules

2.1.2 Short-read-assembly-based methods

ECTools

LoRDEC

Jabba

HALC

FMLRC

HG-CoLoR

2.2 Non-hybrid methods

FLAS

LoRMA

Canu error correction module

3. Materials and Methods

3.1. Benchmark data sets

3.2 Evaluation methodology

4 Experimental results and discussion

4.1 Experimental setup

4.2. Evaluated software

4.3. Performance on benchmark data sets

4.3.1 Correction quality

4.3.2 Runtime and memory usage

4.3.3 Effect of long read sequencing depth on error correction

4.3.4 Effect of discarding reads during correction

4.3.5 Effect of error correction on genome assembly

5 Conclusions and Future Directions

Funding

Key Points

Biographical Note

References

Citation Manager Formats

Subject Area