PT - JOURNAL ARTICLE AU - Yao, Brian AU - Hsu, Chloe AU - Goldner, Gal AU - Michaeli, Yael AU - Ebenstein, Yuval AU - Listgarten, Jennifer TI - Nanopore callers for epigenetics from limited supervised data AID - 10.1101/2021.06.17.448800 DP - 2021 Jan 01 TA - bioRxiv PG - 2021.06.17.448800 4099 - http://biorxiv.org/content/early/2021/06/17/2021.06.17.448800.short 4100 - http://biorxiv.org/content/early/2021/06/17/2021.06.17.448800.full AB - Nanopore sequencing platforms combined with supervised machine learning (ML) have been effective at detecting base modifications in DNA such as 5mC and 6mA. These ML-based nanopore callers have typically been trained on data that span all modifications on all possible DNA k-mer backgrounds—a complete training dataset. However, as nanopore technology is pushed to more and more epigenetic modifications, such complete training data will not be feasible to obtain. Nanopore calling has historically been performed with Hidden Markov Models (HMMs) that cannot make successful calls for k-mer contexts not seen during training because of their independent emission distributions. However, deep neural networks (DNNs), which share parameters across contexts, are increasingly being used as callers, often outperforming their HMM cousins. It stands to reason that a DNN approach should be able to better generalize to unseen k-mer contexts. Indeed, herein we demonstrate that a common DNN approach (DeepSignal) outperforms a common HMM approach (Nanopolish) in the incomplete data setting. Furthermore, we propose a novel hybrid HMM-DNN approach, Amortized-HMM, that outperforms both the pure HMM and DNN approaches on 5mC calling when the training data are incomplete. Such an approach is expected to be useful for calling 5hmC and combinations of cytosine modifications, where complete training data are not likely to be available.Competing Interest StatementThe authors have declared no competing interest.