ABSTRACT
In single cell analysis, visualization of high-dimensional data is essential for information extraction and interpretation. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction algorithm that facilitates visualization of complex high-dimensional cytometry data as a two-dimensional distribution or ‘map’. t-SNE maps can be interrogated by either expert-driven or automated techniques to categorize single cell data into relevant biological populations and discern biologically relevant differences between individual samples. The use of t-SNE for high parameter mass and fluorescent cytometry datasets enables a comprehensive and unbiased view of results as compared to traditional biaxial gating. However, successful t-SNE visualization depends on heuristic titration of multiple parameters, as non-optimal embeddings can carry artifacts that make the map difficult or impossible to interpret. Moreover, standard t-SNE implementations fail to produce clear visualizations of datasets when millions of datapoints are projected on the map, often making this method unusable for larger biological datasets. To overcome current t-SNE limitations, we formulated opt-SNE, an array of automated tools for optimal parameter selection in t-SNE visualization. For optimal and fastest data embedding, opt-SNE utilizes Kullback-Liebler (KL) divergence evaluation in real time by tailoring the early exaggeration stage of t-SNE gradient computation in a dataset-specific manner. Here, we demonstrate that precise timing of early exaggeration and scaling the gradient descent learning rate step to the size of the dataset together dramatically improve computation time and enable high quality visualization of both large cytometry and transcriptomics datasets. Also, our results explain why existing software solutions with hard-coded t-SNE parameters produce poorly resolved and potentially misleading maps of fluorescent and mass cytometry data. In sum, our novel approach to t-SNE enables the required fine-tuning of the algorithm to ensure optimal resolution of t-SNE maps and more precise data interpretation.