PT - JOURNAL ARTICLE AU - Nathan LaPierre AU - Rob Egan AU - Wei Wang AU - Zhong Wang TI - MiniScrub: <em>de novo</em> long read scrubbing using approximate alignment and deep learning AID - 10.1101/433573 DP - 2018 Jan 01 TA - bioRxiv PG - 433573 4099 - http://biorxiv.org/content/early/2018/10/02/433573.short 4100 - http://biorxiv.org/content/early/2018/10/02/433573.full AB - Long read sequencing technologies such as Oxford Nanopore can greatly de-crease the complexity of de novo genome assembly and large structural variation iden-tification. Currently Nanopore reads have high error rates, and the errors often cluster into low-quality segments within the reads. Many methods for resolving these errors require access to reference genomes, high-fidelity short reads, or reference genomes, which are often not available. De novo error correction modules are available, often as part of assembly tools, but large-scale errors still remain in resulting assemblies, motivating further innovation in this area. We developed a novel Convolutional Neu-ral Network (CNN) based method, called MiniScrub, for de novo identification and subsequent “scrubbing” (removal) of low-quality Nanopore read segments. MiniScrub first generates read-to-read alignments by MiniMap, then encodes the alignments into images, and finally builds CNN models to predict low-quality segments that could be scrubbed based on a customized quality cutoff. Applying MiniScrub to real world con-trol datasets under several different parameters, we show that it robustly improves read quality. Compared to raw reads, de novo genome assembly with scrubbed reads pro-duces many fewer mis-assemblies and large indel errors. We propose MiniScrub as a tool for preprocessing Nanopore reads for downstream analyses. MiniScrub is open-source software and is available at https://bitbucket.org/berkeleylab/jgi-miniscrub