PT  - JOURNAL ARTICLE
AU  - Nathan LaPierre
AU  - Rob Egan
AU  - Wei Wang
AU  - Zhong Wang
TI  - MiniScrub: &lt;em&gt;de novo&lt;/em&gt; long read scrubbing using approximate alignment and deep learning
AID  - 10.1101/433573
DP  - 2018 Jan 01
TA  - bioRxiv
PG  - 433573
4099  - http://biorxiv.org/content/early/2018/10/02/433573.short
4100  - http://biorxiv.org/content/early/2018/10/02/433573.full
AB  - Long read sequencing technologies such as Oxford Nanopore can greatly de-crease the complexity of de novo genome assembly and large structural variation iden-tification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world con-trol datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads pro-duces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub