PT - JOURNAL ARTICLE AU - Gunjan Baid AU - Daniel E. Cook AU - Kishwar Shafin AU - Taedong Yun AU - Felipe Llinares-López AU - Quentin Berthet AU - Aaron M. Wenger AU - William J. Rowell AU - Maria Nattestad AU - Howard Yang AU - Alexey Kolesnikov AU - Armin Töpfer AU - Waleed Ammar AU - Jean-Philippe Vert AU - Ashish Vaswani AU - Cory Y. McLean AU - Pi-Chuan Chang AU - Andrew Carroll TI - DeepConsensus: Gap-Aware Sequence Transformers for Sequence Correction AID - 10.1101/2021.08.31.458403 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.08.31.458403 4099 - http://biorxiv.org/content/early/2021/08/31/2021.08.31.458403.short 4100 - http://biorxiv.org/content/early/2021/08/31/2021.08.31.458403.full AB - Pacific BioScience (PacBio) circular consensus sequencing (CCS) generates long (10-25 kb), accurate “HiFi” reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation uses a hidden Markov model (pbccs). Here, we introduce DeepConsensus, which uses a unique alignment-based loss to train a gap-aware transformer-encoder (GATE) for sequence correction. Compared to pbccs, DeepConsensus reduces read errors in the same dataset by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45), and also reduce variant calling errors by 24%.Competing Interest StatementGB, DEC, KS, TY, FLL, QB, MN, HY, AK, WA, JPV, AV, CYM, PCC, and AC are employees of Google LLC and own Alphabet stock as part of the standard compensation package. AMW, AT, and WJR are full-time employees and shareholders of Pacific Biosciences. This study was funded by Google LLC.