Using global t-SNE to preserve inter-cluster data structure

Yuansheng Zhou; Tatyana O Sharpee

doi:10.1101/331611

Abstract

The t-distributed Stochastic Neighbor Embedding (t-SNE) method is one of the leading techniques for data visualization and clustering. This method finds lower dimensional embeddings of data points while minimizing distortions in distances between neighboring data points. By construction, t-SNE discards information about large scale structure of the data. We show that adding a global cost function to the t-SNE cost function makes it possible to cluster the data while preserving global inter-cluster data structure. We test the new “global t-SNE” (g-SNE) method on one synthetic and two real data sets on flowers and human brain cells which have significant and meaningful global structures. In all cases, g-SNE outperforms t-SNE in preserving the global structure. The weight parameter λ of the global cost function determines the balance between local and global distances preservations. For the human brain atlas data set, we show the tradeoff of λ in representing global structure of data. Using g-SNE with the optimized λ may therefore yield biological insights into how data is organized on multiple scales.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.