VAE-SNE: a deep generative model for simultaneous dimensionality reduction and clustering

Scientific datasets are growing rapidly in scale and complexity. Consequently, the task of understanding these data to answer scientific questions increasingly requires the use of compression algorithms that reduce dimensionality by combining correlated features and cluster similar observations to summarize large datasets. Here we introduce a method for both dimension reduction and clustering called VAE-SNE (variational autoencoder stochastic neighbor embedding). Our model combines elements from deep learning, probabilistic inference, and manifold learning to produce interpretable compressed representations while also readily scaling to tens-of-millions of observations. Unlike existing methods, VAE-SNE simultaneously compresses high-dimensional data and automatically learns a distribution of clusters within the data — without the need to manually select the number of clusters. This naturally creates a multi-scale representation, which makes it straightforward to generate coarse-grained descriptions for large subsets of related observations and select specific regions of interest for further analysis. VAE-SNE can also quickly and easily embed new samples, detect outliers, and can be optimized with small batches of data, which makes it possible to compress datasets that are otherwise too large to fit into memory. We evaluate VAE-SNE as a general purpose method for dimensionality reduction by applying it to multiple real-world datasets and by comparing its performance with existing methods for dimensionality reduction. We find that VAE-SNE produces high-quality compressed representations with results that are on par with existing nonlinear dimensionality reduction algorithms. As a practical example, we demonstrate how the cluster distribution learned by VAE-SNE can be used for unsupervised action recognition to detect and classify repeated motifs of stereotyped behavior in high-dimensional timeseries data. Finally, we also introduce variants of VAE-SNE for embedding data in polar (spherical) coordinates and for embedding image data from raw pixels. VAE-SNE is a robust, feature-rich, and scalable method with broad applicability to a range of datasets in the life sciences and beyond.


Introduction
Modern scientific research generates large, high-resolution datasets that are complex and high-dimensional, where a single observation from an experimental system can contain measurements to scale beyond a few million observations without specialized, high-performance hardware -especially in the case of large, out-of-core datasets that cannot fit into memory. Recent applications of deep learning (Goodfellow et al.,6), and deep generative models in particular (Appendix A. ; Kingma and Welling ; Rezende et al. ), have begun to address these issues (Ding et al.,8;Szubert et al.,;Ding and Regev,). Nevertheless, even with the low memory and computational cost of deep learning methods that can be trained with small batches of data on consumer-grade hardware, these new algorithms are still significantly slower to fit to data than more popular methods because they 6 require costly nearest neighbor or pairwise distance calculations (Becht et al.,;Ding et al.,8;Szubert et al., ). The majority of these methods also do not provide any built-in mechanism for 8 detecting outliers, which could potentially bias any downstream results and cause statistical errors when testing hypotheses. 8 There has also been a flurry of recent work on advanced methods for clustering data (e.g., Campello our model with existing methods for dimensionality reduction, and importantly, to ensure that it has practical utility, we demonstrate the application of our method using empirical examples with real-world data from multiple domains. In comparison to existing dimension reduction methods, our proposed method produces low-dimensional data representations with similar, or better, quality while also offering several key improvements. Notably, our approach provides the ability to scale to datasets containing tens-of-millions of observations without specialized, high-performance hardware and automatically learns an interpretable cluster distribution from the data without any manual tuning or expensive computations to determine the number of clusters. Together these results demonstrate that our 6 proposed method is a robust, feature-rich, and scalable tool for data analysis and is widely-applicable to a variety of tasks. 8

Results
We make three main contributions in this paper: ( ) First, we introduce a deep generative model for both dimensionality reduction and clustering called variational autoencoder stochastic neighbor embedding (VAE-SNE; Fig. ; Methods). VAE-SNE can produce a variety of different compressed representations and readily scales to out-of-core datasets with tens-of-millions of observations. Our model builds on numerous ideas from past work by synthesizing methods from a class of generative models known as variational autoencoders (VAEs; Kingma and Welling ), the popular dimensionality reduction algorithm (t-distributed) stochastic neighbor embedding (SNE/t-SNE;Hinton and Roweis ; van der 6 Maaten and Hinton 8) and its many extensions (van der Maaten, ; Wang and Wang, 6; Chien and Hsu, ; Ding et al.,8), as well as recent advances in variational inference (Kingma et al.,;8 Burda et al.,;Dilokthanakul et al.,6;Cremer et al.,;Tomczak and Welling,) and clustering methods (Todd et al., ). ( ) Second, we apply VAE-SNE, and a variety of other popular dimensionality reduction methods, to compress real-world datasets from different domains ( Fig. ). We then quantitatively assess how each algorithm performs in preserving important aspects of the dataincluding information about local, global, and temporal structure. We also assess generalization to new, out-of-sample data and compare processing speeds for each algorithm. Additionally, we show how the likelihood score produced by VAE-SNE can be used to detect outliers when embedding out-of-sample data. ( ) Third, we show how VAE-SNE can be used to automatically cluster large datasets into a small 6 set of interpretable classes. As a practical example, we apply VAE-SNE to a dataset of . million observations describing the high-dimensional body posture dynamics of a commonly-used model 8 organism -the fruit fly (Drosophila melanogaster) -to automatically discretize these data into motifs of stereotyped behavior for further analysis ( q ϕ (z | z 1 ) p(x | x 1 ) local neighborhood preservation Figure . Overview of the VAE-SNE model. a-f, Observed samples from a high-dimensional data distribution x ∼ p(x) (a) are probabilistically embedded (b) into a low-dimensional latent distribution p θ (z) (c) using an encoder deep neural network DNN φ : x → z to generate an approximate latent posterior distribution q φ (z|x). Samples from the latent distribution z ∼ q φ (z|x) or z ∼ p θ (z) (c) are then transformed (f) using a generative decoder deep neural network DNN θ : z → x to probabilistically reconstruct the high-dimensional data distribution p θ (x|z). Given a set of observed high-dimensional data {x 1 , x 2 , . . . , x N } the model parameters for the encoder and decoder {θ, φ} are optimized so that the approximate posterior for the encoder matches the true posterior from the generative decoder as best as possible, or q φ (z|x) ≈ p θ (z|x), which then creates a functional mapping between the high-dimensional and low-dimensional distributions. To improve local structure preservation during optimization, pairwise distances between vectors in the high-dimensional and low-dimensional space are optimized using pairwise similarity kernels (e), a probability density function of distance, so that the local neighborhoods around each observation match as best as possible, or p(x|x i ) ≈ q φ (z|z i ). This preferentially weights the preservation of local neighborhoods over global relationships by assigning more probability mass to nearby neighbors during optimization. The prior for the latent distribution p θ (z) is also a learned Gaussian mixture distribution (c) that is jointly optimized with the encoder and decoder to fit the observed data and can be used to cluster the latent distribution (d) into a small set of discrete classes p θ (y|z) -where highly-overlapping modes (mixture components) within the distribution are automatically merged into the same class label using sparse watershed assignment (Methods; Todd et al. ) reduction (Appendix A. ), which, like other types of autoencoders (Hinton and Salakhutdinov, 6), model high-dimensional data using two deep neural networks: one to encode data to a compressed 6 latent representation, and another to decode the latent vectors and reconstruct the data. However, VAEs are distinct from other autoencoders in that the encoder is used to parameterize continuous 8 distributions of latent vectors -from which latent vectors are then probabilistically sampled -rather than embedding each high-dimensional observation as a single point in the latent space. This type of model offers an attractive dimensionality reduction framework because the objective function (Appendix A. ) naturally imparts a trade-off between the complexity of the encoded description and the overall accuracy of the decoded reconstruction (Alemi et al.,6). However, these models suffer from multiple long-standing issues including a phenomenon known as posterior collapse (Alemi et al., ; Dieng et al., a) where the latent coordinate space becomes arbitrarily organized and no longer preserves any statistical features of the high-dimensional data distribution. There has been a string of recent work 6 to address these issues including some relatively straightforward solutions ( 8) that optimizes pairwise similarity kernels between the high-and low-dimensional distributions to strengthen local neighborhood preservation and more explicitly retain a useful representation. We also draw on other theoretical and practical improvements from the literature to enhance the performance of VAE-SNE (Methods). For example, we use a Gaussian mixture prior for 6 learning the latent distribution (Kingma et al.,;Dilokthanakul et al.,6;Tomczak and Welling,). This choice of distribution allows for better local structure preservation and, when combined with 8 sparse watershed assignment to merge overlapping mixture components (Fig. ; Methods; Todd et al. ), serves as a flexible method for clustering data -without the need to manually define the number 6 of clusters or impose strong assumptions about cluster shape. We employ several other advances to 6 further improve structure preservation. For instance, we apply a perplexity annealing technique (Kobak 6 and Berens, ) to slowly decay the size of the local neighborhoods optimized by the model during 6 training, which helps to preserve structure across multiple scales. Moreover, we extensively optimize the 6 algorithms underlying our model by applying parallel computations on the CPU and GPU that 6 dramatically improve processing speed compared to previous work (Ding et al.,8). 66 In addition to our three main contributions, we further extend VAE-SNE to demonstrate its flexibility 6 as a framework for dimensionality reduction. To accomplish this, we introduce a von Mises-Fisher 68 variant of VAE-SNE (Appendix C. ; Fig. S ; Video S8, Video S ) that embeds data in polar coordinates 6 (rather than Euclidean coordinates) on a -D unit sphere, which is potentially a more natural representation for many high-dimensional datasets (Davidson et al.,8) and solves the "crowding" problem common to some methods (van der Maaten and Hinton,8;Ding and Regev,). Finally, we also apply a modified convolutional version of VAE-SNE (Appendix C. ; Figs. S , S ) to visualize natural history images of animal specimen collections (Cuthill et al., ; Zhang et al., ) by directly embedding the raw pixel data. Our results for these two extensions are described in Appendix C.
. Comparisons with other dimension reduction algorithms 6 Current methods for dimensionality reduction generally fall into two classes known as linear and nonlinear algorithms. Linear algorithms, such as principal components analysis (PCA), compress 8 high-dimensional data by learning linearly weighted combinations (affine transformations) of the original feature set. Typically these algorithms are optimized to preserve the global structure of the data, 8 where local neighborhood relationships are distorted in order to maintain the To validate VAE-SNE as a general-purpose method for dimensionality reduction, we quantitatively compare its performance with other dimension reduction algorithms -both linear and nonlinear -using two datasets from different domains (see Methods) describing animal body part dynamics (Berman et al.,,6;Pereira et al.,) and single-cell RNA-seq expression profiles for hippocampal neurons (La Manno et al.,8). We benchmark multiple variants of VAE-SNE with different pairwise similarity kernels for preserving local neighborhood information (including kernel functions with learned parameters; Appendix B), and we compare these results with those from two high-performance variants 6 of t-SNE (van der Maaten and Hinton, . We also apply PCA in , , , and dimensions for a linear baseline comparison. We fit each algorithm with a training set and also embed an out-of-sample test set to assess generalization to new data. For both the training and test sets, we then quantitatively assess each algorithm's ability to preserve different types of information about the high-dimensional data when compressing the data to two dimensions, including local, global, fine-scale, and temporal information (Methods). We quantify local information preservation for each algorithm by measuring the preservation of both metric (distance-or radius-based) and topological (nearest neighbors-based) neighborhoods that are 6 approximately 1% of the total embedding size; we measure global information preservation by calculating the correlation between pairwise distances in high-and low-dimensional space; we assess 8 fine-scale information by measuring neighborhood preservation for multiple neighborhood sizes < 1% of the total embedding size; and we evaluate temporal information preservation by computing the correlation between high-and low-dimensional temporal derivatives in a timeseries dataset. Overall the qualitative properties of the embeddings produced by each algorithm are strikingly similar within datasets (Fig. ), which likely indicates shared mathematical properties of how the latent distributions are modelled. However, we do find potentially important quantitative differences between these algorithms in terms of information preservation and processing speed. We summarize our overall assessments of each nonlinear dimension reduction algorithm in Tables S , S  ). We also find that learning the similarity kernel parameters (for both Gaussian and Student's t kernels) as a function of each data point does not improve performance for our local preservation metrics. The top performing algorithms for local structure preservation (VAE-SNE, t-SNE, 6 and UMAP) are closely comparable to -dimensional PCA for both metrics we used to assess local neighborhood preservation. 8 . .

Global structure preservation
We find that VAE-SNE also does well in preserving global structure for both the training set (Figs. S a, S b, S a) and test set (Figs. S a, S b). VAE-SNE with a Gaussian SNE kernel performs best for this metric, but VAE-SNE with a t-SNE kernel also performs nearly as well. Notably all the / neural-network-based methods scvis Ding et al. 8,ivis Szubert et al. ) outperform both t-SNE and UMAP (McInnes et al.,8) in preserving global structure for both datasets we tested. This is perhaps not surprising given that recent work has shown neural network models tend to learn the same axes as PCA (Rolinek et al., ). Additionally, these results show that learning the similarity 6 kernel parameters as a function of each data point does improve global structure preservation for VAE-SNE with a t-SNE kernel -likely because it is optimized to be more similar to the Gaussian kernel 8 used to calculate high-dimensional similarities (Appendix B). The top performing algorithms for this metric are comparable to -dimensional PCA, which demonstrates that nonlinear algorithms are capable of preserving the same global information as PCA while also better preserving local structure. On one hand, The scvis (Ding et al., 8) algorithm in particular excels at preserving global structure for the single-cell RNA-seq dataset we tested (Fig. S a). On the other hand, ivis (Szubert et al., ) performs much more poorly than the other neural network algorithms for this dataset, and FIt-SNE (Linderman et al., , ) and Barnes-Hut-SNE (van der Maaten, ) perform even worse. We also show that UMAP (McInnes et al.,8) with PCA initialization better preserves global structure than the default 6 Laplacian Eigenmap initialization.
. . Fine-scale structure preservation 8 In addition to local and global structure preservation, we evaluate the ability of each algorithm to preserve very fine-scale neighborhood information (Figs. S b, S b, S b). We find that both FIt-SNE (Linderman et al., ) and Barnes-Hut-SNE (van der Maaten, ) excel at preserving this fine-scale information for the posture dynamics dataset (Figs. S b, S b) while every other nonlinear algorithm performs relatively poorly for both the training and test set. For the single-cell RNA-seq dataset, this distinction is not nearly as large and the algorithms all perform more similarly (Fig. S b), which indicates performance varies depending on the dataset. Performance for the ivis algorithm (Szubert et al., ) is especially poor for this metric on the single cell RNA-seq dataset. However, neighborhood 6 membership for neighborhoods between 1% and 10% of the total embedding size are all similarly well-preserved for each algorithm. 8 . .

Temporal structure preservation
Because one of the datasets we use for benchmarking is a behavioral timeseries, for these data we also 6 assess the temporal structure preservation of each algorithm (Figs. S a, S c) on the out-of-sample test 6 set (the training set is randomly sampled across multiple timeseries, so temporal information is not 6 preserved while all of the other algorithms we tested could still successfully run under the same conditions. This 88 helps to illustrate the key advantage of deep learning-based methods, which naturally maintain very low 8 memory complexity by applying optimization using small batches of data. .

Using the likelihood to assess out-of-sample data
Because VAE-SNE also calculates a likelihood score for reconstructing the original high-dimensional data, we can use this to assess performance on out-of-sample data, which is an idea originally proposed by Ding et al. ( 8). To test this, we calculate the likelihood score for real data from the posture dynamics dataset (Berman et al., , 6; Pereira et al., ) and randomly-permuted data (randomized across feature columns) from the same dataset. We find that the likelihood score is reliably 6 lower for the randomized data, and the two likelihood distributions are well separated (Fig. S6a), which shows this metric could potentially be used to detect outliers. We also compare the entropy of the 8 approximate posterior distribution for each embedded sample as another potential metric for detecting outliers. While we find that the entropy is much higher for the randomized data, the distribution is highly overlapping with the entropy for the real data ( Fig. S6b), which indicates the entropy may not be as useful for evaluating the embedding quality. .

Clustering body posture dynamics to reveal stereotyped behavioral organization
To demonstrate its capabilities as a clustering algorithm, we use VAE-SNE to automatically discretize a dynamical time series dataset describing the high-dimensional body posture and behavioral repertoire of ). We then use this trained model to automatically track the spatial locations of body parts (head, legs, wings, abdomen) directly from video timeseries data and generate time-frequency spectrograms describing body-part dynamics for each observation in the timeseries (Berman et al., ), which naturally incorporates multi-scale temporal information into each data vector. We then apply VAE-SNE to compress the data to a -dimensional latent embedding and simultaneously discretize the dynamical posture timeseries into a set of behavioral clusters. We find that, after optimizing the -D VAE-SNE model for repeated trials 6 using the full . million observation dataset and applying sparse watershed assignment to generate cluster labels (Methods; Fig. d; Todd et al. ), VAE-SNE consistently learns a total of 6 low-level 8 behavioral clusters describing distinct, stereotyped body part movements. We also achieve similar (nearly identical) results when clustering in -D and -D space and when varying the number of components in the Gaussian mixture prior used for clustering -provided that the number of components is large enough (e.g., K ≥ 100).
To provide a broad overview of the behavioral structure discovered by VAE-SNE, we manually group these low-level clusters into 6 high-level clusters (Figs. , S ; Video S ) by examining video clips sampled from each cluster (Video S -Video S ) and by calculating and visualizing the mean spectrograms for each low-level cluster to quantify the average distribution of body part movements / across frequencies for each behavioral class (Figs. S8d-f, S d-i). These high-level clusters include: locomotion (Video S ), anterior grooming (Video S ), posterior grooming (Video S ), wing movements 8 (Video S ), small/slow leg movements (Video S6), and idle behavior (Video S ). Many of the low-level clusters ( clusters in total) describe distinct slow/small leg movements, while there are low-level clusters for locomotion ( Fig. S8), for anterior grooming, 6 for posterior grooming (Fig. S ), for wing movements, and for idle behavior. Videos and posture timeseries data sampled from each cluster also clearly demonstrate the stereotypy of behaviors within these behavioral classes, which matches well with previous work describing these dynamics ( body movements and differences in peak movement frequency. We calculate basic statistics describing cluster usage across individuals (Figs. c, S8c, S c) including the marginal (stationary) probability of 8 behavioral classes across individuals and the mean bout length, or the average amount of time a behavior is performed when an individual transitions into that cluster. In particular, the low probability and short bout length for wing movements and short bout length for slow/small leg movements (Fig. c) indicate these clusters may be transitional or idiosyncratic behaviors (Todd et al., ). For the low-level locomotion clusters (Fig. S8) we also calculate the forward component of the leg movement velocity (in body lengths per second, or BL · s −1 ) relative to the egocentric orientation of the animal. We then use the forward velocity to classify each leg in the timeseries as "swing" (forward velocity > 0 BL · s −1 ) or "stance" (forward velocity ≤ 0 BL · s −1 ) and find that our low-level locomotion clusters show signatures 6 of distinct locomotory gaits (i.e., tetrapod and tripod gaits; Mendes et al.
) with different numbers of legs being used for walking, on average, within each cluster. Together these results demonstrate that VAE-SNE is able to automatically decompose the dynamics of known complex behaviors (Video S ).
Due to the many philosophical complexities of objectively evaluating unsupervised cluster representations (reviewed by Jain et al.
; Kleinberg ; Todd et al. ), we forgo any further quantitative assessment of our clustering results and instead leave this for future work. For example, it is unclear how to best select the number of clusters for many different algorithms; how to properly compare algorithms that naturally produce different numbers of clusters and cluster shapes; and what metric(s) should be used to meaningfully evaluate a clustering description as generally "good" or 6 "useful" other than manual, qualitative validation of the results, which we already provide here -though several quantitative descriptors with varying levels of desirability have been recently proposed for 8 behavioral data (Todd et al., ). Comparing unsupervised cluster labels with a priori-defined labelsas is common practice (e.g., Jiang et al. ) -is also problematic, as human-supervised descriptions may not accurately capture the 6 underlying structure of the data distribution, and this is especially true for datasets where the goal is to 6 potentially discover subtle differences that are undetectable by humans (e.g., Wiltschko et al. However, in contrast to many of these existing methods, VAE-SNE performs dimension reduction and clustering simultaneously, and unlike most previously-described algorithms for clustering data (e.g., Jiang et al. ). In contrast to methods that impose strong assumptions about cluster shape, our clustering method has relaxed assumptions and allows for arbitrarily complex (e.g., non-convex) cluster distributions based on the local structure of the data. Additionally, in comparison to prior methods for unsupervised behavioral analysis, VAE-SNE has the advantage of being 8 able to use more than two dimensions for clustering data, which has been shown to provide 8 higher-quality behavioral labels with many potentially-desirable properties (Todd et al., Here we introduce VAE-SNE, a deep generative model for simultaneously reducing dimensionality and 8 clustering data. We compare VAE-SNE to existing methods for dimensionality reduction and demonstrate its utility and versatility using real-world examples. Our results establish that VAE-SNE is able to generate robust and interpretable compressed representations for data from different domains and is comparable in performance to other nonlinear methods for dimensionality reduction. In contrast to these existing methods, VAE-SNE has the advantage of being able to automatically cluster similar observations into a small set of classes, which can then be used to summarize large datasets with a coarse-grained description or select specific subpopulations of data for more detailed analysis. Our 6 approach can also readily scale to very large datasets by leveraging techniques from deep learningincluding, and especially, out-of-core data that cannot fit into memory. However, despite these strengths, 8 VAE-SNE still has important limitations depending on the goals of the user, and there are many ways in which the model could be improved or extended in subsequent iterations. There are also other domains that VAE-SNE could be applied to in the future. VAE-SNE preserves local relationships while also minimizing global structure distortion. Additionally, while VAE-SNE is not explicitly an autoregressive model, it still preserves a good deal of high-dimensional timeseries information. However, our results also show that VAE-SNE, and most of the other dimension reduction methods we tested, does not accurately preserve fine-scale structure (neighborhoods < % of the total embedding size). For many applications, preserving these details may 6 be unimportant, but this structure has been shown to be useful for detecting infrequent types of data, such as rare cell types (Linderman et al., ). Therefore, our results suggest that if researchers wish to 8 preserve this type of information they should use FIt-SNE (Linderman et al., , ) or Barnes-Hut-SNE (van der Maaten, ) over other algorithms for dimension reduction. We also find that, when initialized with PCA over the default initialization, UMAP ( are still attractive options for dimensionality reduction, but for datasets that do no fit into memory, / VAE-SNE provides some distinct advantages. 8 VAE-SNE has the ability to detect outliers and assess the embedding quality for out-of-sample data. This provides a straightforward mechanism for identifying new data to include in the training set, which can further improve performance. Most of the other algorithms we tested, or at least the specific software implementations we tested, provide no mechanism for quantitatively assessing embedding quality for each observation -with outliers being simply embedded under the assumption that the data are well supported by the training distribution. This can cause problems for any downstream analysis, especially when using statistical tests to answer scientific questions. Further improvements for outlier detection might include the use of Bayesian inference (Hafner et al.,8) or other methods for 6 estimating predictive uncertainty (reviewed by Kendall and Gal ). We demonstrate that results produced by VAE-SNE can serve as a highly-interpretable coarse-grained 8 description of tens-of-millions of observations -with several advantages over existing methods for clustering data. Applying VAE-SNE to future research in the behavioral sciences could help to reveal the genetic, environmental, and neural underpinnings of animal behavior ( , where additional scales with finer-or coarser-grained descriptions can be selected from the model for post-hoc analysis. Recent work has also shown that iteratively adjusting the parameters of the t-SNE similarity kernel can be used to generate a hierarchy of clusters in the latent embedding (Robinson and Pierce-Hoffman, ), which could be potentially applied to VAE-SNE as well. To demonstrate the flexibility of VAE-SNE as a deep learning model, we introduce a variant for embedding data in polar coordinates on a unit sphere (Appendix C. ). We find that VAE-SNE successfully preserves structure in a spherical embedding as well ( ). While we focus on the Euclidean and cosine distances for calculating local neighborhoods, any differentiable distance function could potentially be substituted to create different embedding 6 geometries, and, while we focus on kernels from the location-scale family of probability distributions (i.e. 6 Gaussian, Student's t), other log probability functions could potentially be used as well. 6 We also introduce a convolutional version of VAE-SNE for embedding images directly from raw pixel 6 data (Appendix C. ). After applying this model to natural history images, we find that it groups There are multitude of ways in which VAE-SNE could be further improved or extended. Naturally, future work could apply more recent advances in variational and probabilistic inference like normalizing flows (Rezende and Mohamed, ; Kingma et al., 6; Papamakarios et al., ), which allow data to be modeled with a more direct invertible mapping from the latent posterior to the data distribution, while also employing flexible, arbitrarily-complex distributions. The latent distribution used for VAE-SNE could also be modeled using many other types of representations such as quantized (Van Den Oord et al., )  clustering that can be applied to many different types of data and readily scales to large datasets. Together our results illustrate that it is a robust, feature-rich method with multiple distinct advantages that make it an effective tool for analyzing real-world datasets across disciplines. ; Dieng et al. a) that may prevent the model from learning a useful latent representation -e.g., a powerful, overparameterized decoder can simply ignore the compressed latent codes but still produce high-quality reconstructions. These issues render VAEs problematic as a general method for reducing dimensionality, as the primary purpose of dimensionality reduction is to create compressed representations that preserve important statistical features of the original data distribution.

. . Regularizing the ELBO to improve structure preservation
We address the problems outlined above by optimizing VAE-SNE with a regularized version of the ELBO. ). This idea of using the SNE objective for regularizing the latent space of VAEs was first proposed by Chien and Hsu ( ), which they called variational manifold probabilistic linear discriminant analysis (vm-PLDA), and later independently proposed by Ding et al. ( 8) with their scvis model. However, the idea of applying the SNE objective to autoencoders, and deep neural networks in general, was introduced much earlier by van der Maaten ( ) with parametric t-SNE (pt-SNE), who / proposed to use this objective in conjunction with an autoencoder to jointly learn a latent embedding. 6 The pt-SNE model (van der Maaten, ) was also recently combined with advances from the Barnes-Hut-SNE algorithm (van der Maaten, ) under the name net-SNE (Cho et al.,8). 8 Additionally, Moody ( ) developed one of the first publicly-available pieces of software to combine the SNE objective with variational inference (variational t-SNE, or vt-SNE; and topic-SNE) but did not use a deep neural network to amortize inference across a set of shared parameters. Im et al. (  8) also proposed a variational bound on the t-SNE objective to improve optimization.
Here we apply the SNE objective to a VAE in a similar fashion to Ding et al. (  8). That is, we use the SNE objective as a method of better preserving structure in the latent embedding produced by our VAE, which improves the usefulness of the compressed representation (approximate posterior) produced by the ELBO. When combined into a single objective, we call this the stochastic neighbor evidence lower 6 bound, or SNELBO. Generalizing from Ding et al. (  8), given a high-dimensional data matrix X = {x 1 , . . . , x N } and model parameters {θ, φ}, the SNELBO objective is written as: for i, j = 1, . . . , N and i = j, where N is the number of observations in the N × M matrix X ∈ R M . Thus vectors x i and x j are the ith and jth row in X, while z i and z j are Monte Carlo samples from the approximate low-dimensional posterior z i ∼ q φ (z|x i ) and z j ∼ q φ (z|x j ) respectively (Eq. c)sampled using the reparameterization trick from Kingma and Welling ( ), or z i = µ + σ , where is an auxillary noise variable ∼ N (0, I) and is the element-wise product (see Appendix A. for further discussion).
The objective function (Eq. a) consists of three terms, which can be interpreted as follows: ( ) the expected log likelihood of the decoder distribution (Eq. b; distortion) minimizes distortion between the 6 observed ground truth x i and reconstruction, or maximizes accuracy, and preserves global structure in the embedding; ( ) the divergence between the approximate posterior and the prior distribution (Eq. b; 8 rate) constrains the global coordinate space of the embedding and restricts the rate of information (relative to the prior) that can be transmitted through the compressed space; and ( ) the expected divergence between pairwise similarities (Eq. d) in high-dimensional space p(x j |x i ) and those in low-dimensional space q φ (z j |z i ) acts as a regularizer to preserve local neighbor relationships between data points. Further details of this stochastic neighbor regularizer are derived in Appendix B.
The Lagrange multipliers γ, β, and α are used to weight the distortion, rate, and pairwise similarity terms respectively, which we include as hyperparameters for the model. These multipliers can be adjusted to produce different forms of the objective for optimizing the model -e.g., increasing or 6 decreasing the rate with the β multiplier (Higgins et al., ) -but in practice we set γ = β = 1, while α is set (following Ding et al. ( ) The mean µ k ∈ M and mixture weight ω k ∈ ω of each component are learned as model parameters 68 {M, ω} ∈ θ subject to a softmax normalization constraint K k=1 ω k = 1. We also regularize the prior 6 distribution by minimizing the divergence between the mixture distribution used to weight each component and a maximum-entropy mixture distribution, or: This prevents the prior from degenerating to a small number of modes (a problem described in more detail by Kingma et al.
; Dilokthanakul et al. 6) by increasing the entropy of the mixture distribution. A higher entropy mixture distribution forces to model to utilize more of the components within the distribution, which increases the number of clusters and, consequently, the level of detail of the final clustering description (Still and Bialek, ). An analogous maximum entropy regularizer was 6 also recently applied to solve the long-standing mode collapse problem common to generative adversarial networks (GANs; Dieng et al. b). 8 The covariance for each component distribution could be learned as free parameters, but we find that using a simpler identity covariance matrix I allows for a sufficiently expressive prior distribution 8 without adding additional complexity -and is less prone to cluster degeneracy during optimization. for a Gaussian mixture distribution with K > 1, we instead approximate this term numerically using 8 Monte Carlo integration. In this case we use the expected log-density ratio for calculating the rate (Appendix A. ), which is written as: 6/ Clustering data with the Gaussian mixture prior After optimizing the parameters for the prior, we can then use the learned Gaussian mixture to assign embedded data to discrete clusters. In other words, we wish to calculate the conditional distribution p θ (y|z), where y is a vector of class labels, or y = {y 1 , y 2 , . . . , y K }. However, the Gaussian mixture prior can contain highly-overlapping component distributions, which can cause undesirable side-effects. On one hand, this renders the parameterized 6 mode for each overlapping component an unreliable descriptor of the surrounding local density, as each component is then simply a degenerate sub-mode within a non-Gaussian density cluster rather than a 8 distinct subpopulation within the distribution delineated by the structure of the data. On the other hand, a Gaussian mixture distribution can have any arbitrary arrangement of weighted components, which 6 makes the task of directly calculating the true local density mode for each embedded point both 6 analytically and numerically intractable. Therefore, to circumvent these problems, we apply the sparse overlapping components to ensure each latent vector is assigned to the mode that best represents the 6 underlying density of the local neighborhood for that observation. To accomplish these steps, the 6 8 expected value of the approximate posterior for each data point is initially assigned to a single mode in 6 the Gaussian mixture distribution by calculating the weighted mixture component with the maximum 6 likelihood (minimum distortion), which is written as: where k i is the initial cluster assignment for the ith data point x i . We then combine degenerate 6 (highly-overlapping) modes from the distribution by applying the sparse watershed procedure described 6 by Todd et al. ( ). Using this procedure, the initial cluster assignments are further combined by 6 optimizing the mean of each component to ascend to its local maximum within the Gaussian mixture 6 prior, which we write as a minimization of the negative log-likelihood, or: where µ * k ∈ M * is the optimized mean of each component. We optimize this objective numerically with 6 the Adam optimizer (Kingma and Ba, ) with a learning rate of 1 × 10 −3 until the objective (Eq. ) 6 8 stops improving for training steps. We then merge cluster assignments based on whether the mode 6 for the initial cluster assignment k i has moved within the basin of attraction for another mixture 6 component in the distribution (after optimizing Eq. ), or: 6 l i = arg max l ω l N (µ * ki |µ l , I) / where l i is the sparse watershed label assignment for the ith data point x i , which was assigned to the 6 k i th mode of the distribution µ ki in the initial cluster assignment step (Eq. 6). We then repeat this 6 assignment procedure K times to ensure all label assignments to degenerate modes are reassigned to 6 the mode with the highest local density: 6 l i := arg max l ω l N (µ * li |µ l , I) for k = 1, . . . , K. ( ) Note that, for data assigned to non-degenerate modes in the initial step, typically the cluster assignment  and Berens ). After the perplexity ratio is fully annealed to the target value, we then perform early 6 stopping if pairwise similarity loss stops improving by at least . per epoch with a patience of epochs (lack of progress is ignored for epochs before stopping training). While it is common practice to decrease the learning rate after training stagnates to further improve performance, we instead increase the batch size, which has been shown to provide similar improvements (Smith et al., ). Therefore after training stagnates and early stopping is initiated for the initial batch size of , we increase the batch size to and continue training until early stopping is initiated again using the same criteria. For the Gaussian mixture prior we set the number of components to K = 100, but we 6 found that any arbitrarily large number of components produced similar (nearly identical) results.
We tested variants of VAE-SNE with different similarity kernels. We tested VAE-SNE using a t-SNE 8 similarity kernel with ( ) constant kernel parameters (ν = τ = 1) as well as ( ) learned kernel parameters (van der Maaten, ). We also tested VAE-SNE variants using a SNE kernel with ( ) constant (η = 1) and ( ) learned parameters as well. Otherwise the hyperparameters for each variant were kept constant, as described above.

. . Local structure preservation
After embedding the data with each algorithm we assessed local structure preservation with two measures of preservation that define local neighborhoods in different ways. For both of these metrics we targeted neighborhoods that correspond to ∼ 1% of the total embedding size. ) to the high-dimensional data and the low-dimensional embedding for each method, which effectively divides the data into small Voronoi regions. We then calculated the normalized mutual information (reviewed by Vinh et al.
; see also McDaid et al. ) between the high-dimensional and low-dimensional cluster assignments (using scikit-learn v . ; Pedregosa et al. ). This provides a symmetric and permutation invariant measure of how well local neighborhood memberships from the high-dimensional space are preserved by each embedding method -with similarity ranging from (no overlap, or random) to (perfect overlap). We performed replicates 6 of this for each trial.
topological neighborhoods Second, we assessed local neighborhood preservation topologically by 8 calculating the exact nearest neighbors for randomly selected data points and then defining the local neighborhood for each point as k nearest neighbors, where k is selected such that k N ≈ 0.01, and N is the total embedding size. We then computed the proportion of the neighbors that are assigned to the correct local neighborhood in low-dimensional embedding, which ranges from (no neighbors preserved) to (all neighbors preserved). We performed replicates of this for each trial.

. .6 Global structure preservation
To assess global structure preservation we follow Becht et al. ( ) by calculating the Pearson correlation between pairwise squared Euclidean distances for , points in the high-dimensional 6 space and the low-dimensional embedding for each method (for a total of . million distances). As distances have a lower bound of zero and tend to follow a log-normal (or Gamma) distribution, we first 8 log transformed the distances in order to homogenize the variance and better match the assumptions of Pearson's correlation score. The Pearson correlation then provides a measure of the global structure preservation ranging from -(anti-correlated) to (correlated). We performed replicates of this for each trial.

. . Fine-scale structure preservation
Because our metrics for local structure preservation only account for a single scale but not the fine-scale structure within local neighborhoods, we also assessed topological structure preservation for smaller neighborhood sizes. As before, we calculated the exact nearest neighbors for randomly 6 selected data points. We then computed the proportion of points assigned to the correct neighborhood across dyadically (log 2 ) spaced neighborhood sizes ranging from k = 2 1 to k = 2 14 . Neighborhood 8 sizes were then normalized as a proportion of the total embedding size, or k N . We performed replicates of this for each trial and neighborhood size.

. .8 Temporal structure preservation
Because the largest dataset we use is also timeseries data, we assess temporal structure preservation for the test set by calculating Euclidean distances between sequential time points in high-dimensions and low-dimensions for each method. We then calculate the Pearson correlation coefficient of the log transformed distances (same as for assessing global structure preservation) for randomly selected minute subsets (6 , observations) within the full timeseries. This then provides a measure of how 6 well temporal derivatives are preserved in the low-dimensional embedding ranging from -(anti-correlated) to (correlated). of the mean for each information preservation metric -resampling (with replacement) both within trials 6 and across trials (n= bootstrap samples). We then plot the % intervals of the bootstrap 6 distribution to compare the performance of each dimension reduction method statistically. Rather than 66 attempting to make decisions regarding the statistical "significance" of these bootstrap distributions 6 based on an arbitrary threshold, we instead simply treat them as a measure of the uncertainty (variance) 68 in effect size for each information preservation metric. The computational experiments from which the 6 information preservation metrics are derived could be run ad infinitum to achieve statistical significance, which is effectively a measure of statistical resolution based on the number of observations, but this is not necessarily informative in practice. ), we used dyadically (log 2 ) spaced frequencies ranging from Hz to Hz (the Nyquist frequency of the signal), which expanded the dimensionality of the timeseries from 30 × 2 = 60 to 30 × 2 × 25 = 1500.

Dimension reduction comparisons
To generate a training set for benchmarking the different algorithms, we uniformly randomly sampled a subset of data from the body posture dynamics timeseries for 8 of individuals while excluding one randomly selected individual to use as a test set. We tested training set sizes: 58 × 500 = 29, 000; 58 × 1000 = 58, 000; 58 × 2000 = 116, 000; 58 × 4000 = 232, 000, 6 above which we encountered out-of-memory errors when running UMAP (McInnes et al., 8) on larger subsets of data. Each test set contains ∼ 360, 000 sequential observations. We then applied each 8 dimension reduction method to the training set and subsequently embedded the test set. For training VAE-SNE we used the cross-entropy loss as a log likelihood function, as it matches well with the 8 normalized time-frequency data, but we also found that other likelihood functions work similarly well. 8 Behavioral clustering To simplify the dataset for performing our clustering analysis, we used the sine 8 and cosine of the keypoint angles for the 6 legs (the distal tips of each leg), wings, head, and abdomen 8 for a total of body parts and a 10 × 2 = 20 dimensional posture timeseries. As before we applied the 8 time-frequency transform which expands the dimensionality of the timeseries from 10 × 2 = 20 to 8 10 × 2 × 25 = 500. We then applied VAE-SNE with a t-SNE kernel (Appendix B; ν = τ = 1) to compress            As is common to most dimensionality reduction algorithms, we seek to model a high-dimensional data 86 distribution p(x) using a low dimensional latent distribution p(z). Variational autoencoders (VAEs) are 8 one such model that combines both modeling and inference by defining a joint distribution between a 8 latent variable z and observed samples x. We can accomplish this using a generative model that maps 8 samples from the low-dimensional latent distribution to the high-dimensional data distribution using a 8 set of shared parameters θ, which can take the form of a deep neural network model p θ (x|z) = DNN θ (z) 8 with some prior over latent distribution p θ (z). We then wish to find the model parameters θ that 8 maximize the joint likelihood, which can be written as: ( ) Although, to compute the low-dimensional distribution for the data, we then need to derive the latent 8 posterior for the model p θ (z|x). This can be derived from the likelihood using Bayes' rule: where L(x|·) is a user-selected likelihood function parameterized by the decoder function DNN θ (z), and 8 N (·|µ, diag(σ 2 )) is a multivariate Gaussian whose parameters µ and σ 2 are a specified by the encoder 8 function DNN φ (x).

A.
Deriving the evidence lower bound 8 After defining the generative model, we then wish to optimize the parameters of the encoder φ and 8 6 decoder θ -given a set of observed samples from a data distribution x ∼ p(x) -so that the 8 approximate posterior distribution q φ (z|x) matches closely with the true latent posterior from the 8 8 generative decoder, or q φ (z|x) ≈ p θ (z|x). In other words, we wish to minimize the divergence between 8 the two distributions, or: However, as we have already established, computing the true posterior is intractable, so researchers have derived a lower bound known as the evidence lower bound, or ELBO (Kingma and Welling, ), to approximate this objective. The ELBO can be derived directly from Eq. (Adams, ), which is written as: Because the Kullback-Leibler divergence is strictly non-negative, the ELBO is then a lower bound on the log marginal likelihood. However, The ELBO can also be derived by applying Jensen's inequality, as is 6 more common in the literature (Kingma and Welling, ), to directly calculate a lower bound on the log marginal likelihood, or: To learn the latent distribution given the model and the data, the ELBO is then maximized to optimize the model parameters. Here we write this as a minimization of the negative ELBO, which can be further decomposed into separate terms for the log-likelihood and the divergence between the approximate posterior and the prior over the latent distribution, or: / The derivation of the ELBO has also been discussed at length elsewhere (e. A. Importance-weighted ELBO 6 While we use only a single Monte Carlo sample from the approximate posterior per training batch, we also include a hyperparameter for multiple samples per training batch using the importance-weighted 8 ELBO from Burda et al. ( ), which modifies how the expectation in Eq. 6c is calculated to produce a tighter bound on the loss by implicitly increasing the complexity of the posterior (Cremer et al., ). However, we did not see any obvious performance improvements when using the importance-weighted objective, and increasing the number of Monte Carlo samples per batch also increases training time. The general utility of calculating a tighter bound is also unclear (Rainforth et al.,8) but this may be related to the generalization ability of the model. We leave further exploration of this hyperparameter for future work. 6 For computing pairwise similarities, we largely follow Hinton and Roweis ( ) and van der Maaten and Hinton ( 8) by modeling local neighborhoods as the probability of transitioning from a landmark point 8 to its nearby neighbors when performing a random walk initialized from the landmark. By modeling local neighborhoods as probability distributions and then minimizing the divergence between the neighborhood distributions in high-and low-dimensional space, we preserve more local structure within the low-dimensional embedding than a standard VAE (Ding et al.,8).

High-dimensional transition probabilities
To accomplish this, pairwise transition probabilities in high-dimensional space t(x j |x i ) are modelled by applying a Gaussian kernel to convert the pairwise distances between data points d(x i , x j ) into conditional probabilities -with self transitions set to t(x i |x i ) = 0. While Ding et al. ( 8) use these asymmetric conditional probabilities t(x j |x i ) directly for 6 the high-dimensional similarities, van der Maaten and Hinton (  8) show that symmetrizing the pairwise similarities so that p(x j |x i ) = p(x i |x j ) reduces susceptibility to outliers, which can become 8 ill-determined in the low-dimensional embedding with an asymmetric kernel. Therefore, we use the symmetrized conditional probabilities, which are computed as: for n = 1, . . . , N and n = i, where d(·, ·) is a user-selected distance metric, such as the Euclidean distance. The landmark data point x i can then be considered the mean, and ς 2 i is the variance of the Gaussian kernel describing the local neighborhood around x i -thereby assigning more probability mass to nearby neighbors. The variance ς 2 i is selected for each data point via binary search such that 2 Hi ≈ P, where P is the desired perplexity (a user-defined hyperparameter), 2 Hi is the perplexity of the kernel for the ith data point, which approximately corresponds to the number of nearest neighbors 6 considered by the kernel, and H i is the Shannon entropy in bits, or: / Low-dimensional transition probabilities The low-dimensional similarities q φ (z j |z i ) are then 8 calculated according to Hinton and Roweis ( ) and van der Maaten and Hinton ( 8) using a kernel function w φ (z j |z i ) to convert pairwise distances into conditional probabilities: .

( )
As in high-dimensional space, self transitions are set to q φ (z i |z i ) = 0. Here we test two kernel functions for preserving Euclidean similarities.

t-SNE kernel
First is the heavy-tailed Student's t-distributed kernel used for the t-SNE algorithm (van der Maaten and Hinton, 8) with the log probability function written as: where τ i is the scale, ν i is the degrees of freedom, which varies the heavy-tails of the kernel, and Γ(·) is the gamma function. We write this as a log probability to more clearly show the relationship with the 6 similarity loss term derived later in this section (Eq. c). The Student's t-distribution is used primarily to alleviate the "crowding problem" (van der Maaten and Hinton, 8) that can occur with other nonlinear 8 embedding algorithms, including the original SNE algorithm (Hinton and Roweis, ), where points are too densely packed in the low-dimensional space and moderately distant points are "crushed" together 6 as an artifact of the embedding algorithm. 6 SNE kernel Secondly, we test a Gaussian kernel -the kernel used for the original SNE algorithm 6 (Hinton and Roweis, ; van der Maaten and Hinton, 8) -with the log probability function: where η 2 i is the variance.

6
Setting the kernel parameters The kernel parameters for the low-dimensional similarities are typically 6 set to a constant value, such as τ i = ν i = η i = 1 (van der Maaten and Hinton, 8), or are scaled 66 linearly with the dimensionality of the latent embedding (van der Maaten, ), but we also test 6 similarity kernels where these parameters are learned for each data point, parameterized by the encoder 68 DNN φ (x) -an idea proposed by van der Maaten ( ). When the kernel parameters are constant 6 across all data points, the log normalization terms (Eqs. b, b) used for calculating the log probabilities can be omitted as an additive constant that has no effect on the calculations after normalization. However, this term is potentially important for optimization when learning these parameters as a function of each data point, so we include it in our calculations.
Reinterpreting the similarity loss term To maximize numerical stability when optimizing the similarity term, we substitute the cross-entropy between the high-dimensional and low-dimensional similarities / H[p(x j |x i ), q φ (z j |z i )], which is proportional to the Kullback-Leibler divergence and, after dropping the 6 expectation, can be derived as follows: Consequently, the Kullback-Leibler divergence for the similarity term can be reinterpreted as the 8 cross-entropy between the pairwise similarities up to an additive constant (the negative entropy of the high-dimensional similarities), which can be omitted for the purposes of optimization. To further 8 improve numerical stability for this computation, the cross-entropy is decomposed into attractive and 8 repulsive forces using the unnormalized similarities (following Ding et al. 8; Kobak and Berens ), 8 which is written as: This may also help to clarify why we wrote the low-dimensional kernels as log-probability functions in 8 Eqs. a, a. In addition to embeddings with Euclidean geometry, we introduce a version of VAE-SNE that uses polar 88 geometry and embeds high-dimensional data on the surface of a D unit sphere. We calculate the 8 high-dimensional similarities according to Appendix B, but we alter the calculations for the transition probabilities by using the cosine similarity for the high-dimensional pairwise metric. After normalization, this is equivalent to using a (hyper)spherical von Mises-Fisher distribution as the similarity kernel, or: wherex i = x i / x i 2 and κ i is the concentration parameter (the inverse variance κ i = ς −2 i ), which is selected using binary search to match the perplexity to a desired value (see Appendix B for details). We / then calculate the low-dimensional similarities using a D von Mises-Fisher kernel to create a spherical embedding: 6 log w φ (z j |z i ) = log F(z j |z i , ρ i ) =ẑ i ·ẑ T j ρ i + Z i ( a) whereẑ i = z i / z i 2 and ρ i is the concentration parameter (inverse variance). The log normalization term (Eq. b) can be omitted when ρ i is set to a constant, but we include it for the purposes of 8 optimizing ρ i as a function of each data point.
The idea of using spherical embeddings for dimensionality reduction has been explored previously with the von Mises-Fisher stochastic neighbor embedding (VMF-SNE) algorithm (Wang and Wang, 6) as well as more recent work by Ding and Regev ( ) who apply this type of embedding to visualize single-cell RNA-seq data. The UMAP algorithm (McInnes et al.,8) has a similar option to embed data in polar coordinates, as well as other non-Euclidean spaces. VAEs with (hyper)spherical latent variables have also been explored extensively in the machine learning literature (Davidson et al. 8;reviewed by Ding and Regev ). This type of spherical representation can be useful for data analysis, as . We find that the results are qualitatively similar to -D Euclidean embeddings of the same data ( Fig. ), but are instead embedded across a -D sphere. Despite not using a heavy-tailed similarity kernel (van der Maaten and Hinton, 8) these spherical embeddings 6 naturally do not exhibit any crowding problems (Davidson et al.,8;Ding and Regev,), which may make this a useful visualization tool for some scenarios. 8

C. Convolutional VAE-SNE for image data
We introduce a convolutional version of VAE-SNE for embedding image data from raw pixels. This version of VAE-SNE is modified by first applying a -D convolutional neural network CNN φ -a SqueezeNet v . (Iandola et al., 6) pretrained on ImageNet (Deng et al., ) -to each image and then calculating the pairwise similarity using spatially-pooled feature maps from the CNN φ output. The high-dimensional transition probabilities (Appendix B) are then calculated using a Gaussian kernel: wherev i is a vector of spatially-pooled feature maps from the CNN φ output, orv i = CNN φ (x i ). The approximate posterior is then calculated as a nonlinear function of the pooled feature maps 6 DNN φ :v i → z i , which is written as q φ (z|x i ) = N (z|DNN φ (v i )). For the decoder we use a feed-forward network DNN θ : z i →ṽ i as before, whereṽ i is a reconstruction of the CNN φ outputv i . We then apply 8 mean squared error between the pooled feature maps and the reconstruction as the likelihood function for the distortion loss (Eq. b). A convolutional decoder could also be used to fully reconstruct the raw image pixels, but we found simply reconstructing the pooled feature maps to be effective for visualizing the distribution of images in two dimensions.
To demonstrate the utility of convolutional VAE-SNE, we embed natural history image datasets of both shells (Zhang et al., ) and (Heliconius spp.) butterflies (Cuthill et al., ). We then visualize these embeddings to qualitatively assess performance of this VAE-SNE variant (Figs. S , S ). We find / that perceptually similar images are grouped together in the embedding based on complex sets of 6 image features -rather than simple heuristics like color -and these groupings correspond to taxonomic relationships within the dataset, which were not explicitly included as part of the training set. 8 This variant of VAE-SNE is functionally similar to using the perceptual distance (Johnson et al.,6a;Wham et al., ) as a similarity metric and likelihood function except that the model can be trained end-to-end with small batches of images directly using raw pixels instead of first preprocessing images to produce feature activations. These results demonstrate that VAE-SNE can be used to analyze very large image datasets, by loading images in small batches, and can also be extended to images with variable resolution, by integrating across feature map outputs from the CNN to remove the spatial dimension -both of which are typically not possible with other dimension reduction algorithms. /