Abstract
High-throughput, high-dimensional single-cell data is accumulating at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different samples, such as different patients, conditions, or treatments. While scalability is a challenge on its own, dealing with large experimental design presents a whole new set of problems, such as large-scale batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline, such as a visualization that matches the clustering. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural net`works, as well as the deep representation of data that can be learned by them.
A well-known limitation of neural networks is their interpretability. Our key contribution here is to constrain and regularize the layers with newly formulated regularizations such that their features become interpretable. When large multipatient datasets are fed into SAUCIE, the layers contain denoised and batch-normalized data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch normalize and process this 11-million cell data identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.