Abstract
We introduce a probabilistic model for estimation of sample index-hopping rate in multiplexed droplet-based single-cell RNA sequencing data and for inference of the true sample of origin of the hopped reads. Across the datasets we analyzed, we estimate the sample index hopping probability to range between 0.003–0.009, a small number that counter-intuitively gives rise to a large fraction of ‘phantom molecules’ – as high as 85% in a given sample. We demonstrate that our model-based approach can correct for this artifact by accurately purging the majority of phantom molecules from the data. Code and reproducible analysis notebooks are available at https://github.com/csglab/phantom_purge.
Structure Section 1 provides a concise summary of the paper. Section 2 provides a brief historical and technical overview of the phenomenon of sample index hopping and an explanation of related concepts. The three sections that follow describe the statistical modeling approach and correspond to the following three goals. (1) Building a generative model that probabilistically describes the phenomenon of sample index hopping of multiplexed sample reads (Section 3). (2) Estimating the index hopping rate from empirical experimental data (Section 4). (3) Correcting for the effects of sample index hopping through a principled probabilistic procedure that reassigns reads to their true sample of origin and discards predicted phantom molecules by optimally minimizing the false positive rate (Section 5). Next, Section 6 details the results of the analyses performed on empirical and experimental validation datasets. The Supplementary Notes consists of three sections: (1) Mathematical Derivations, (2) Overview of Computational Workflow, (3) Method’s Limitations.
Footnotes
Manuscript updated. Figures and Tables updated.