Abstract
Summary The number of cells measured in single-cell transcriptomic data has grown fast in recent years. For such large-scale data, subsampling is a powerful and often necessary tool for exploratory data analysis. However, the easiest random subsampling is not ideal from the perspective of preserving rare cell types. Therefore, diversity-preserving subsampling is required for fast exploration of cell types in a large-scale dataset. Here we propose scSampler, an algorithm for fast diversity-preserving subsampling of single-cell transcriptomic data. Using simulated and real data, we show that scSampler consistently outperforms existing subsam-pling methods in terms of both the computational time and the Hausdorff distance between the full and subsampled datasets.
Availability scSampler is implemented in Python and is published under the MIT source license. It can be installed by pip install scsampler and used with the Scanpy pipline. The code is available on GitHub: https://github.com/SONGDONGYUAN1994/scsampler.
Contact linwang{at}gwu.edu; jli{at}stat.ucla.edu
Competing Interest Statement
The authors have declared no competing interest.