Abstract
The rise of large-scale single-cell RNA-seq data has introduced challenges in data processing due to its slow speed. Leveraging advancements in GPU computing ecosystems, such as CuPy, and building on Scanpy and rapids-singlecell package, we developed ScaleSC, a GPU-accelerated solution for single-cell data processing. ScaleSC delivers over a 20x speedup through GPU computing and significantly improves scalability, handling datasets of 10–40 million cells with over 1000 batches by overcoming the memory bottleneck on a single A100 card- far surpassing rapids-singlecell’s capacity of processing only 1 million cells without multi-GPU support. We also resolved discrepancies between GPU and CPU algorithm implementations to ensure consistency. In addition to core optimizations, we developed new advanced tools for marker gene identification, cluster merging, and more, with GPU-optimized implementations seamlessly integrated. Designed for ease of use, the ScaleSC package is compatible with Scanpy workflows, requiring minimal adaptation from users. The ScaleSC package (https://github.com/interactivereport/ScaleSC) promises significant benefits for the single-cell RNA-seq computational community.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
1. Fixed typos 2. Fixed citation order 3. Fixed inconsistent use of names 4. Changed sentences to reflect the improved capabilities of the most recent version of rapids-singlecell.