Abstract
We introduce alevin, a fast end-to-end pipeline to process droplet-based single cell RNA sequencing data, which performs cell barcode detection, read mapping, unique molecular identifier deduplication, gene count estimation, and cell barcode whitelisting. Alevin’s approach to UMI deduplication accounts for both gene-unique reads and reads that multimap between genes. This addresses the inherent bias in existing tools which discard gene-ambiguous reads, and improves the accuracy of gene abundance estimates.
Footnotes
↵* asrivastava{at}cs.stonybrook.edu
↵† lmalik{at}cs.stonybrook.edu
↵‡ tss38{at}cam.ac.uk
↵§ i.sudbery{at}sheffield.ac.uk
¶ rob.patro{at}cs.stonybrook.edu
↵† We note that whether the majority of amplification occurs pre- or post-fragmentation can be protocol specific and can suggest different strategies for UMI deduplication. Here, we are primarily concerned with the 10X Chromium protocols, dominated by pre-fragmentation amplification. However, the method we propose for UMI deduplication can be applied to other protocols as well.