Abstract
Motivation Single-cell RNA sequencing (scRNAseq) technologies allow for measurements of gene expression at a single-cell resolution. This provides researchers with a tremendous advantage for detecting heterogeneity, delineating cellular maps, or identifying rare subpopulations. However, a critical complication remains the low number of single-cell observations due to limitations by the rarity of a subpopulation, tissue degradation, or cost. This absence of sufficient data may cause inaccuracy or irreproducibility of downstream analysis. In this work, we present ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information. Within a single framework, ACTIVA can generate data representative of the entire population, or specific subpopulations on demand, as opposed to two separate models (such as scGAN and cscGAN). Data generation and augmentation with ACTIVA can enhance scRNAseq pipelines and analysis, such as benchmarking new algorithms, studying the accuracy of classifiers, and detecting marker genes. ACTIVA will facilitate analysis of smaller datasets, potentially reducing the number of patients and animals necessary in initial studies.
Results We train and evaluate models on multiple public scRNAseq datasets. In comparison to GAN-based models (scGAN and cscGAN), we demonstrate that ACTIVA generates cells that are more realistic and harder for classifiers to identify as synthetic, which also have better pair-wise correlations between genes. We show that data augmentation with ACTIVA significantly improves the classification of rare subtypes (more than 45% improvement compared to not augmenting and 4% better than cscGAN) all while reducing training time by an order of magnitude in comparison to both models.
Availability of data and code Links to raw, pre- and post-processed data, source code and tutorials are available at https://github.com/SindiLab.
Supplementary information Supplementary material can be found as a separate file with the same pre-print submission.
Competing Interest Statement
The authors have declared no competing interest.