## Abstract

Next generation sequencing produces large volumes of short sequences with broad applications. The noise due to sequencing errors led to the development of several correction methods. The main correction paradigm expects a high (from 30-40X) uniform coverage to correctly infer a reference set of subsequences from the reads, that are used for correction. In practice, most accurate methods use *k*-mer spectrum techniques to obtain a set of reference *k*-mers. However, when correcting NGS datasets that present an uneven coverage, such as RNA-seq data, this paradigm tends to mistake rare variants for errors. It may therefore discard or alter them using highly covered sequences, which leads to an information loss and may introduce bias. In this paper we present two new contributions in order to cope with this situation.

First, we show that starting from non-uniform sequencing coverages, a De Bruijn graph can be cleaned from most errors while preserving biological variability. Second, we demonstrate that reads can be efficiently corrected via local alignment on the cleaned De Bruijn graph paths. We implemented the described method in a tool dubbed BCT and evaluated its results on RNA-seq and metagenomic data. We show that the graph cleaning strategy combined with the mapping strategy leads to save more rare *k*-mers, resulting in a more conservative correction than previous methods. BCT is also capable to better take advantage of the signal of high depth datasets. We suggest that BCT, being scalable to large metagenomic datasets as well as correcting shallow single cell RNA-seq data, can be a general corrector for non-uniform data. Availability: BCT is open source and available at github.com/Malfoy/BCT under the Affero GPL License.

## Introduction

### Motivations

Fast and accurate pre-filters are key to many bioinformatics applications such as assembly (to remove spurious nucleotides and propose more contiguous assemblies), variant calling, genotyping or mapping (to help reduce the noise ratio and increase the power of these methods). Removing errors is also determinant to reduce the quantity of data to index, in particular for *k*-mer approaches. Thus, a good read correction should allow to index less, though more meaningful, data.

While numerous and very efficient methods exist for genomic correction, less methods were dedicated to short-reads datasets presenting uneven coverage^{1} such as transcriptomics or metagenomics. Short read sequencing (shotgun and single-cell) remain the widest way to access transcriptomics, metagenomics and metatranscriptomics data despite the growing applications of long reads (PacBio, Oxford Nanopore). Partly because of the relative lack of correction methods, most transcriptomic/metagenomic softwares include their own pre-filters, which is not in favor of homogeneous pipelines. This motivates efficient and reliable correction of non-uniform coverage short-read data.

## Background

*k*-mer spectrum methods

These correction methods assume a uniform sequencing coverage. With the second assumption that sequencing errors are uniformly distributed along read sequences, then there must exist a coverage level for *k*-mers (the set of read substrings of length *k*) at which it becomes highly unlikely to observe errors. Thus, they compute distribution of the *k*-mers’ number of occurrences (spectrum) in the dataset, and rely on an abundance threshold that is chosen to distinguish “weak” (error-prone) *k*-mers from “solid” *k*-mers. “Weak” *k*-mers are then corrected by converting them to corresponding solid ones. Because they are conceptually simple and only rely on computationally efficient structures [15, 18, 28], these methods have been preferred to other approaches relying on suffix structures [25], [21], [13] and probabilistic models or multiple sequence alignment [23].

### Graph-based methods

Recently, methods based on De Bruijn graphs emerged for read correction. The rationale is that a set of *k*-mers can efficiently be represented in such a graph, and that a read can be corrected by its alignment on the graph that contain a low amount of errors. Following this idea, the two methods [22] [24] were specifically designed to correct long reads from PacBio or Oxford Nanopore technologies (therefore, they are out of the scope of short read correction, as shown in [17]). Lastly, Bcool [17] introduced a new approach dedicated to short reads. The approach first constructs a De Bruijn graph and applies heuristics to remove most of the erroneous *k*-mers from the graph. Thereafter, the reads are aligned to this cleaned graph which is used as a set of reference sequences to correct them. This allows to take into account more global information for the correction by relying on graph paths instead of *k*-mers.

### Adaptation to non-uniform coverage

In transcriptomics, data reflect the gene expression levels. Consequently, relative low-frequency *k*-mers are not always the mark of a sequencing error, as they may pertain to lowly expressed regions. Moreover, in a De Bruijn graph, alternative splicing or alternative transcription events produce several alternative correct *k*-mers in the same graph region. Those properties induce a harder correction process. In metagenomics, the mixture of unevenly represented genomes yields similar issues.

The first approach to face this issue was SEECER [14], combining a multiple sequence alignment to hidden Markov models, but was poorly scalable. BayesHammer and Rcorrector [27] both derive from the idea to compute groups of neighbors *k*-mers, from which “solid” *k*-mers are identified, and used to correct “weak” *k*-mers from the same group. BayesHammer (designed for single-cell datasets) proposes a clustering technique based on bayesian statistics derived from the k-means clustering technique to find groups of close *k*-mers according to the Hamming distance. Rcorrector is designed for transcriptomic datasets, and proposes to choose among sets of *k*-mers from possible alternative paths in the De Bruijn graph to correct an error. The path chosen must not contain *k*-mers counted less than a fixed threshold, and must not be incoherent with the read’s *k*-mer counts. Finally, to our knowledge, there exist no method specifically dedicated to the correction of shotgun metagenomics data.

### Contribution

In this paper, we show that important algorithmic novelties are required to propose an uneven coverage corrector based on De Bruijn graph cleaning and alignment. We demonstrate that:

*k*-mer set cleaning steps, whether they are*k*-mer spectrum approaches, assembler pre-filters or Bcool, have room from improvement, and we present a new approach with better results,a graph approach can be combined with a likelihood model for scoring the unitigs according to their probability to be generated by errors. Such approach allows to save more “weak” although meaningful

*k*-mers,local read alignment on the graph provides better results for read correction (more read corrected and less errors introduced in reads) than the current techniques.

## Methods

### Preliminaries

Let Σ be an alphabet of fixed size, in this work we assume Σ = {*A,C,G,T*}. A *k*-mer is a sequence *s*∈Σ_{k} (i.e of length |*s*| = *k*). We consider a read dataset as a multiset *R*⊂Σ* and an integer *k*, and define the De Bruijn graph as a directed graph *G*_{k}(*R*)=(*N,E*) where *N* is the set of distinct *k*-mers that appear in *R*, and for *u,v*∈*N*,(*u,v*)∈*E* if and only if *u*[2,*k*]=*v*[1,*k*-1] (see (1) in Figure 1).

The out-neighborhood (respectively in-neighborhood) of a node *u* is {*v*∈*N*|(*u,v*)∈*E*} = *N*^{+} (respectively {*v*∈*N*|(*v,u*)∈ *E}* = *N*^{−} and its in-degree *d*^{+}(*u*) (respectively out-degree *d*^{−}(*u*)) is |*N*^{+}| (respectively |*N*^{−}|). A node *u* is called a *branching node* if *d*^{+}(*u*)>1 or *d*^{−}(*u*)>1.

We call *simple path* in *G*_{k} a set of nodes *s* = *u*_{0},*…u*_{n} such that, for each 0≤*i*≤*n*, (*u*_{i},*ui*+1)∈*E, d*^{+}(*u*_{0}) = 1,*d*^{−}(*u*_{n}) = 1 and for each 1≤*i*≤*n*-1,*d*^{+}(*u*_{i})=*d*^{−}(*u*_{i})=1. A simple path *s* = *u*_{0},*…u*_{n} can be compacted by concatenating *u*_{0} to *u*_{i}[*k*-1] for each 0*<i*≤*n* into a single node of length *k*+*n*-1, and by keeping only edges to in-neighbors of *u*_{0} and to out-neighbors of *u*_{n} (see (2) in Figure 1). We call a graph whose simple paths have been compacted a *compacted De Bruijn graph*. Resulting compacted nodes are called *unitigs*. In the following, we work with this node-centric, compacted version of a De Bruijn graph.

### Algorithm overview

We present an approach to correct reads by using a 2-steps procedure. First a De Bruijn graph is built fom the reads’ *k*-mers. This graph is iteratively cleaned (unitigs are removed, see section 2.3) using topological, textual and unitig coverage information (the unitig coverage being the *k*-mers of the unitig *u* of size *n*+*k-*1). Then, reads are mapped to the cleaned graph, using a procedure involving coverage estimation and local alignment, to decide which path is the most eligible to correct each read (see section 2.4).

### Graph cleaning

#### Graph construction and hard filters

To build a reference De Bruijn graph, we rely on Bcalm2 [5] to build a raw De Bruijn graph from all non-unique *k*-mers of the dataset. *k*-mers are extracted and counted, then Bcalm2 constructs the unitigs and computes their average coverage. Optionally, we can apply a filter on this coverage, removing low coverage unitigs (useful in case (5), Figure 2). While this kind of filtering has been shown to be efficient in genomics [17], such a threshold could suppress poorly represented but relevant *k*-mers. Unless said otherwise we did not applied this filter.

In genomics, since most sequencing errors create bulges (pattern (3) in Figure 1) and tips (pattern (1) in Figure 2), their removal is done by a very efficient cleaning [1]. However, those patterns are commonly found as variants in transcriptomic De Bruijn graph [11], thus we introduce more advanced strategy to distinguish sequencing errors from actual variants in the next paragraph.

#### Advanced filters

In this section we present a likelihood model based on *k*-mer counting and hamming distance to attribute unitigs a probability to have emerged from an error or from a real variant. We use the Bayesian estimation framework to derive rules used to keep or discard unitigs for the correction step. Despite being based on simple framework, we demonstrate that these rules helps our correction to be more accurate than Bcool’s cleaning algorithm.

##### Preliminary observations We examine branching patterns

We start from the following observations: first, sequencing errors provoke patterns including a branching node and alternative paths in the graph. Thus we can focus on graph regions that correspond to triplets of nodes *u,v,w*∈*N* such that (*u,v*),(*u,w*)∈*E* or (*v,u*),(*w,u*)∈*E*. In the following, *u* is the branching node, *v* and *w* the alternative unitigs. In Figure 2, scenario (1) represents such branching pattern.

##### We examine pairs of alternative nodes and assume that the most covered comes from a real event

Second, for triplets *u,v,w*_{e}, unitigs containing sequencing errors *w*_{e} are expected to be such that the coverages and . However, due to the skewed *k*-mer distribution, an error can be more covered than other low expressed real events in other parts of the graph. This motivates to work at a local scale (i.e. each branching pattern must be considered rather than applying a global threshold). Third, we expect that, for a triplet *u,v,w*_{e}, if *w*_{e} is an erroneous unitig, *v* exists with and corresponds a real event node. The existence of the error node without a true alternative is considered highly unlikely, unless in a specific scenario depicted in Figure 2 (5).

##### Sequencing errors have a small distance to real events

Within triplets where *w*_{e} is erroneous, we expect that *d*_{hamming}(*v,w*_{e})*<t*, with *t*≤1%×|*w*_{e}| because we consider a small (0.01% to 1%) and uniform error rate. On the contrary, alternative splicing or insertion/deletion for instance are expected to produce distant nodes (see Figure 2 (3) and (4) for the contrast). A SNV can yield a node with a small distance just as sequence errors. Such properties were also noticed and used in [3]. This means that in some cases, SNV with very shallow coverage might be lost during the correction.

##### Rules for the relative filters

We place ourselves in a given branching pattern with a triplet of nodes *u,v,w*, as described above (with *c*_{v} *c*≥_{w}) and we want to determine whether *w* comes from an expressed variant or an error.

According to Bayesian frameworks [9] designed for close problems (to distinguish genomic variants), we consider the probability to observe an error *err* in a node given the data in the given pattern and the probability to observe a real event *event* under the same circumstances:
*cov* represents the observed data, i.e. the mean coverage in *k*-mers of each unitigs.

The first posterior probability corresponding to errors can be written as:

*p*(*err*) =*ϵ*^{d}, the probability to have errors in*w*(under the hypothesis that errors are uniformely distributed and that there is no error in*v*), which is the probability to generate

*c*_{w}erroneous unitigs given that the neighborhood indicates there are*C*unitigs (with*C*=*max*(*c*_{v}+*c*_{w},*c*_{u})).

The second posterior probability corresponding to an existing variant is:

thus sequences are expressed and contain no sequencing error

using the approximation of the expression of the region from which the unitigs originate, with their coverages as proxy

For each pair of unitigs, we compute both *p*(*err*|*cov*) and *p*(*event*|*cov*) and keep the maximum value between the two. If *p*(*err*|*cov*)>>*p*(*event*|*cov*) the unitig is removed, else it is considered as an event and kept in the graph for correction.

##### Implementation details

A detailed paragraph summing up the implementation details is provided in the Appendix. We mention that we apply an iterative procedure to compare pairs of nodes in case the branching node *u* has more out-neighbors. We also apply a low-pass filter on the *v,w* pairs to account for putative over-estimation of the error rate. Finally, polyA regions (from messenger RNAs) of the reads are removed before the graph construction to avoid highly connected nodes due to repeats.

#### Read correction

During the graph cleaning, some rare events may be removed from the graph due to their low coverage (example (2) in Figure 2). To cope with this peculiar situation that is not expected in genomic graphs, we developed a new read mapping algorithm. On such cases, our aim is to produce a local alignment of the read to the graph able to partly correct the read without applying incorrect modifications to the read subsequence absent from the graph.

#### Global strategy

The mapping procedure follows a seed and extend paradigm as described in [16, 17]. During the mapping of a read, the anchors shared between the read and the graph are detected and the alignment is greedily extended from each anchor using a simple linear scoring function, and the best alignments are used for correction. The proposed mapping algorithm brings two new important features described in detail below, the local alignment and the coverage assisted consensus correction.

#### Local alignment

Because we expect some read parts to be missing in the cleaned graph, we allow a trimming of the last (respectively first) unitigs selected if it improves the read score in order to find the best mapping possible, even if it does not involve the entire read. The procedure (algorithm described in Appendix) selects the unitig sequence optimizing the alignment score and keeps the corresponding unitigs for correction. Note that if a local alignment is chosen to correct a read, the unmapped part of the read will be unchanged. The rationale behind this choice is that we do not want to take the risk to perform wrong corrections in such cases (Figure 3).

#### Coverage-assisted correction

Another significant feature of the mapping algorithm is its ability to take into account the unitigs coverage. Following the idea of Rcorrector that estimates the coverage of a read from its *k*-mers counts, our approach is able to estimate a read coverage using the coverages of unitigs it can be mapped on. This information is then used when different mappings present the same score, the average coverage of the read is computed from the unitigs shared among the mapping candidates. This behavior allows to choose more precisely the path that corresponds to the read context (an example is given in Figure 4, see also the algorithm in Appendix). Once again the rationale is to avoid to perform wrong corrections when possible. We use the statistics of the comparison of means test to decide whether one mapping should be selected among others or the corresponding read part should be left uncorrected (Implementation details in Appendix).

#### Differences with the other correction strategies

As BCT, Bcool and Rcorrector rely on De Bruijn graphs for the reads correction. Bcool was designed to correct genomic data with high uniform sequencing coverage. Its approach, based on a global alignment of the read to the graph is not adapted to RNA-seq data. Indeed, the reference graph may not contain all rare alternative events. Furthermore the graph cleaning step proposed by Bcool strongly relies on a uniform coverage.

Rcorrector uses paths of the De Bruijn graphs built from the reads’ *k*-mers for the correction. However, it does not perform a graph cleaning step. It selects *k*-mer nodes that are more covered than a threshold to correct reads in a greedy way. The correction step highly relies on the coverage estimation of the read to be corrected. Contrary to BCT, the graph is not compacted, thus each single *k*-mer coverage is used, and the filter threshold is estimated empirically.

Interestingly, proposed improvements over Bcool do not impact the ability of BCT to correct genomic datasets. We show in Appendix that BCT provides results similar to Bcool on the correction of a genomic dataset, with the same parameters.

## Results

### Transcriptomic De Bruijn graph cleaning

In this section we assess the behavior of our proposed method to remove erroneous *k*-mers from a De Bruijn graph constructed from non uniform dataset as RNA-seq data. To assess our ability to correct sequencing errors, we simulated RNA-seq dataset of more than 100 millions reads with FluxSimulator [12]. The reads were generated from RefSeq reference genome v19 and represent 107 millions reads, and 19 billions bases and were simulated with a one percent error rate. We selected rnaSPAdes [3] as it shares similarities with the presented approach (graph topology and hamming-based cleaning), the graph correction module of Bcool [17] to benchmark our own BCT implementation. We also included how a simple *k*-mer spectrum technique would perform for comparison.

In Table 1, we compared the produced cleaned De Bruijn graphs to the reference De Bruijn graph constructed using the errorless reads and measured the number of true positive (“good” *k*-mer seen both in the graph and the reference), false positive (“erroneous” *k*-mer in the graph not present in the reference) and the number of false negative (“missing” *k*-mer from the reference not found in the graph) and derive precision, recall and F1-score.

We show that BCT’s cleaning is able to retain most *k*-mers, even *k*-mers with very low abundance. BCT lost 3.5 millions *k*-mers, slightly more than the filter removing unique *k*-mers but less than the filter removing *k*-mers seen 3 times or less. We observe that if rnaSPAdes or Bcool’s cleaning steps are able to suppress most sequencing errors (low FP and high precision) they tend to suppress a very important part of the expressed *k*-mers. In addition to its high recall BCT is also able to propose a similar ratio of erroneous *k*-mers than those methods.

### Results on simulated transcriptomic data

We kept the previously described dataset to assess our ability to correct sequencing errors. We corrected the read set using BFC [15] and Musket [18], that are correctors among the best performers according to recent benchmarks [17], Bcool and Rcorrector [27]. In Table 2, we report the recall and precision of our correction, the F1-score, the correction ratio (by which factor the error rate was divided after correction) and the proportion of reads that still contain at least one error after correction.

We observe that BCT is able to correct more errors than Rcorrector at the cost of being slightly less precise. We also observe that *k*-mer spectrum techniques developed for genomics data does not perform well.

Because of low expression variants, we expect the correction to be harder with a lower coverage as it become more challenging to distinguish erroneous from rare *k*-mer. To assess this effect we constructed smaller datasets by random sub-sampling on the one presented. We build a “medium” dataset of 53 millions reads and 9 billions bases, a “small” dataset of 17 millions reads and 5 millions bases and a “shallow” dataset of 13 millions reads and 2 billions bases. We kept Rcorrector as the most competing tool for this benchmark and report it in Table 3.

We show that BCT is able to increase its correction performances when a higher coverage is available. The correction ratio going from 35 from the 13 millions reads dataset to 52 with the dataset containing four time more reads. This can be explained that a higher coverage leads to the construction of a better De Bruijn graph as the sequencing errors and the variant become easier to distinguish. Per contra the performances of Rcorrector remained very similar despite an order of magnitude coverage change.

### Results on simulated metagenomic data

To evaluate the capacity of BCT to correct sequences from metagenomic samples, we created 2 synthetic datasets. First, we created a highly covered dataset (more than 100X) of 25 mixed bacterial genomes using CAMISIM [8] (genomes included in CAMISIM). Second, we created a low coverage dataset (20 in average) with high diversity of bacterial genomes (100 genomes). Datasets used for the reads generation can be found in Appendix. For both of the datasets, we added 1% of uniformly distributed errors.

Results are presented in Table 4. A study of the influence of BCT’s parameters is available in Appendix. While BCT, BFC and Rcorrector could handle 25 genomes correction, only BFC and BCT’s performances continued to keep up with the number of genomes. With default parameters, BCT had the best results on all metrics on the 100 genomes dataset.

### Results on real datasets

We tested our correction on more realistic scenarios from human sequencing: 2 RNA-seq files and 5 single cell RNA-seq (scRNA) files, presenting various read sizes and coverages. Since no ground truth is available when working on real data, we used read alignment on the reference genome as a proxy to correctors’ performances. All read files (raw and corrected) were mapped to the human reference genome using STAR [7] with default parameters. Since BayesHammer uses different information (FASTQ qualities) than our algorithm, we do not report direct comparison. However, BayesHammer’s results can be found in Appendix, showing that it performed best on scRNA-seq, with BCT ranking second in terms of mismatches. Results are presented in Tables 5 and 6, and the complete results are available in Appendix. BCT successfully reduced the mismatch rates, at a better rate than Rcorrector in 6 out of 7 cases. It increased the reads length mapped and decreased the fraction of unmapped reads in most cases.

### Conclusion and future works

In this work we presented a method able to construct a De Bruijn graph containing a low amount of erroneous *k*-mers while preserving most of the diversity present in dataset, even if in presence of a high discrepancy in the observed coverage. We shown that this graph could be used as a reference for read correction using De Bruijn graph local read alignment. We benchmarked the implementation of the proposed method, dubbed BCT, against state-of-the-art correctors on RNA-seq and metagenomic data and showed the interest of such a method to efficiently raise the signal to noise ratio in such datasets.

Since our proposition is designed as a workflow, a natural extension of this work could be to test different graph alignment tools as graphAligner [19] or vg [10] to assess if they could be adapted to perform read correction.

An ongoing application of this work is the hybrid correction of long reads. Transcriptomic long reads (Iso-seq and Nanopore) are increasingly used for full-length isoform identification and *de novo* transcript and alternative splicing discovery. For these data, hindered by high error rates and especially insertion/deletion, a primary correction step is often required. The mixture of short and long reads was shown to provide sequences with higher accuracy [6], in particular for technologies prone to systematic errors such as Oxford Nanopore. We would like to investigate the impact of high quality short reads on such correction.

Another potential function of the proposed method would be to produced paired-end read alignment to produce merged long reads from read pairs [4]. Such extended and corrected reads could improve assembly or variant calling, especially in repeated regions. Those extended reads could also improve the hybrid correction as they should be easier to associate to a long reads.

An interesting feature of our method is that most reads can be decomposed into a path of the De Bruijn graph. This compressed representation could be highly efficient to index RNA-seq experiment at the read level. One could build an index of such paths to efficiently associate a *k*-mer (or a set of *k*-mers) to the reads it originates from.

Finally, we see in Figure 5 that the corrected reads account for orders of magnitude less distinct *k*-mers to index while preserving most of the original read *k*-mer content. Therefore, this procedure could greatly impact the scalability of *k*-mers based indexing methods such as SBT [26] or BIGSI [2] by drastically reducing the amount of *k*-mers to index.

## Acknowledgments

The authors would like to thank Rayan Chikhi and Mikaël Salson for their support and the insightful discussions. This work was supported by the ANR Transipedia (ANR-18-CE45-0020). We also thanks the Universitè de Lille HPC Cloud computing resources and the Institut Pasteur for the computing and storage resources.

## Footnotes

camille.marchet{at}univ-lille.fr, antoine.limasset{at}univ-lille.fr

1 number of reads that cover a given nucleotide in the original sequences