Unsupervised discovery of dynamic cell phenotypic states from transmitted light movies

Identification of cell phenotypic states within heterogeneous populations, along with elucidation of their switching dynamics, is a central challenge in modern biology. Conventional single-cell analysis methods typically provide only indirect, static phenotypic readouts. Transmitted light images, on the other hand, provide direct morphological readouts and can be acquired over time to provide a rich data source for dynamic cell phenotypic state identification. Here, we describe an end-to-end deep learning platform, UPSIDE (Unsupervised Phenotypic State IDEntification), for discovering cell states and their dynamics from transmitted light movies. UPSIDE uses the variational auto-encoder architecture to learn latent cell representations, which are then clustered for state identification, decoded for feature interpretation, and linked across movie frames for transition rate inference. Using UPSIDE, we identified distinct blood cell types in a heterogeneous dataset. We then analyzed movies of patient-derived acute myeloid leukemia cells, from which we identified stem-cell associated morphological states as well as the transition rates to and from these states. UPSIDE opens up the use of transmitted light movies for systematic exploration of cell state heterogeneity and dynamics in biology and medicine.


Introduction
Cells maintain and switch between distinct phenotypic states in a dynamic manner. The facile identification of these states, and understanding the basis for and dynamics by which they interconvert, is a central challenge in biology. Modern single-cell analysis methods, such as single cell RNA sequencing and multiparameter flow cytometry or mass cytometry [1][2][3][4][5] , are widely used to define cell states in heterogeneous populations; while powerful, these methods provide incomplete readouts of cell phenotypes, and typically do not report on stability or transition dynamics. Transmitted light microscopy images directly reveal cell morphology, and have historically formed the basis for identifying cell types and cell states in diverse fields, ranging from cell biology to neuroscience 6,7 . These images can then be acquired at successive timelapse intervals and over long times, with minimal phototoxicity and without prior labeling or genetic manipulation. The resultant live cell movies can reveal additional information about the dynamics of these cell phenotypic states.
Cell phenotypes have traditionally been identified by the visual inspection and interpretation of transmitted light or electron micrographic images. The advent of modern machine learning, however, is enabling high-throughput automated analysis of cell morphology, and could open doors for new deep learning approaches for the systematic, unbiased extraction of cell morphological states and their transition dynamics from these imaging data sets. However, several barriers remain to the development of such methods. First, current deep learning pipelines for cell image analysis rely heavily on predetermined knowledge to generate classification training datasets, or on large sets of heuristic formulations to capture the diversity of cell shapes and morphologies [8][9][10][11] . When examining novel biological processes with minimal to no preconceived information, it can be difficult for investigators to determine what the important labels are without manual intervention and feature selection. Second, current machine learning pipelines generate features that are often not readily interpretable. A variety of unsupervised methods can generate reduced dimensionality representations from complex data, including principal component analysis (PCA), adversarial autoencoders 12 , generative adversarial network 13,14 , and self-supervised deep learning approaches 15,16 . However, these methods are limited in their ability to generate interpretable morphological features that allow scientists to investigate and understand the machine-identified cell states. Finally, current movie analysis methods cannot infer state transition dynamics from live cell movies in an automated, systematic manner 17 . Cell state transitions are typically observed from trajectories of single cells; however, despite recent advances 18 , current tracking algorithms still typically require considerable parameter adjustment and manual error correction for generation of cell trajectories 19 .
Here, we present an end-to-end deep learning method for elucidating cell phenotypic states and their dynamics from brightfield movies of living cells. This method, termed UPSIDE (for Unsupervised Phenotypic State IDEntification), is designed to facilitate unsupervised discovery of cellular phenotypic states, elucidation of morphological features that define these states, and inference of state transition dynamics. UPSIDE segments cells directly from brightfield images, then utilizes the variational autoencoder architecture (VAE) 20 to learn intuitive latent features that can be clustered to reveal distinct morphological states, and also decoded to extract human-interpretable meaning. In order to demonstrate use and versatility of UPSIDE, we first used distinct hematopoietic cell phenotypes in a mixed dataset. We then analyzed live imaging movies of leukemic cells from an acute myeloid leukemia (AML) patient to identify morphologically-distinct cell states associated with stemness, and determined the rates of transition to and from these states. These results demonstrate the utility of UPSIDE as a tool for unbiased exploration of cellular states and their dynamics from large, time-resolved imaging datasets.

Description of the UPSIDE platform
UPSIDE is designed to be a versatile machine-learning pipeline for unsupervised exploration cell morphological states in transmitted light images, and subsequent elucidation of their transition dynamics from movies ( Figure 1A, see Methods section for detailed description of the pipeline).
In this pipeline, cells are first segmented using a neural network that converts brightfield images of unlabeled cells into synthetic fluorescent images of cytoplasm for segmentation 21 . This neural network is trained using a set of images of cells stained for their cytoplasm (Supplementary Figure 1). This approach allows the network to autonomously tailor its parameters, and to accommodate a wide range of different cell types in order to optimize performance without human input. Dead cells and other debris are eliminated from identified cell sub-images through a convolutional classifier that is trained on manually annotated brightfield cell crops labeled as live or dead (Supplementary Figure 2). UPSIDE utilizes a variational autoencoder (VAE) architecture to learn morphological features of segmented cells. Preprocessed masks of the cell and the cellular texture inside the mask boundary are then used to train two concurrent VAEs, one that encodes the cell shape, and another that encodes cell texture, through a binary mask. The mask VAE learns features related to overall cell shape and size, while the texture VAE learns pixel value variation within the mask itself while accounting for size and shape to some degree. Latent space encodings representing the learned mask and texture features are then multiplied with varying contributing weights, then concatenated for subsequent clustering and dimensionality reduction. Specifically, encodings are clustered into groups using the Louvain method 23 , represented on a 2D plane using the uniform manifold and projection algorithm (UMAP) 22 . Additionally, mask and texture vectors are subject to decoding, through variation of magnitudes of specific features or groups of features, followed by generation of synthetic images in observable image space ( Figure 1B). This approach allows latent features to be visually displayed for human inspection and interpretation.

UPSIDE distinguishes between different blood cell types in a heterogeneous population
To test UPSIDE's ability to learn cell type-defining morphological features, we first determined whether this pipeline could distinguish among different cell types based on their morphologies in a mixed cell dataset. We focused first on using UPSIDE to discriminate among four blood cell types that, despite having distinct size, shape and textural features, were similar in their gross morphologies (  Our VAE outperformed these other approaches, generating approximately 6% higher homogeneity scores compared to the adversarial autoencoders, 9% higher compared to PCA, and 26% higher than the ClusterGAN architecture (Supplemental Figure 5C). Adversarial autoencoders performed better than the vanilla encoder, though worse than the VAE, possibly because it is difficult to train the discriminator to perfectly fit the latent encoding to the desired distribution. Surprisingly, ClusGAN architecture performed the worst, even though it generated quite realistic-looking generated cell texture and mask images (data not shown). This weaker performance may stem from an inability to consistently generate direct, regularized encoded representations. These comparisons suggest that the VAE architecture is particularly well suited for learning morphological features forcell type discrimination.
To further visualize and analyze the representation of cells in latent space, we projected the encodings from the VAE into two dimensions using the UMAP algorithm 22 ( Figure 2B). From the UMAP projection, we found that the cell types largely segregated into distinct regions in this two-dimensional space ( Figure 2B). Raw264.7 macrophages occupied a region that was largely distinct from regions occupied by other three cell types, reflecting their markedly different cell size and shape distribution. The three other cell types occupied partially overlapping regions, reflecting greater similarities in morphology among these cells (Supplemental Figure 3A).
Interestingly, primary human AML stem cells (identified by their CD34 + CD38surface marker phenotype) overlapped parts of the Scid.ADH2 region, suggesting some of Scid.ADH2 cells look quite similar to their AML counterparts. Despite these overlaps, there are substantial areas in the two-dimensional space occupied by these regions containing only one cell type, indicating the presence of morphological features that distinguish each of these three cell types from another and allow them to be identified in mixed populations.
To better understand the morphological features that drive cell type discrimination in this learned latent space, we clustered cell representations in the latent space using the Louvain method, then visualized cells and the morphological attributes that defined each cluster. Eight clusters were identified, with each enriched for different cell types ( Figure 2C-D, Supplemental Figure 3B).
Clusters C1-3 were highly enriched for Raw264.7 macrophages that are larger phagocytic cells than their progenitor cells. Clusters C4 and C8 were highly enriched for Kasumi-1 cells that contained circular profile cells with dark granules, a unique distinguishing observable feature of these cells. Cluster C5 was enriched for Scid.ADH2 cells, which were also circular, but lacked granules. Clusters C6 and C7 were enriched for both LSCs and Scid.ADH2 cells, both of which were small and lacked granules. Cells in Cluster C7 had darker interiors and less well-defined cell boundaries when compared with Cluster C6 cells, which indicate they are flatter and may be more substrate-adherent. The morphological differences within these clusters indicate the existence of distinct morphological sub-states within individual cell types.
To understand the morphological features that separate cells into distinct groups in latent space, we performed hierarchical clustering on the averaged latent space representation for cells from different clusters ( Figure 3A). From this analysis, we found that each morphological cluster of cells was associated with a specific set of latent features, with magnitudes that are higher than population average. To decode these latent features, we transformed them back into synthetic images in visual space ( Figure 3B-C, top). First, we generated a mean mask or texture vector by averaging over all cells in the dataset. From these mean vectors, we then selectively increased the magnitudes of the feature (or groups of features) of interest to generate a new vector. Using the VAE decoder module, we then transformed the feature-dominated vector and the mean vector into synthetic images for interpretation.
We first examined the synthetic decoded images from the five most enriched mask features for each morphology-defined cluster ( Figure

UPSIDE uncovers morphologically distinct cell states in patient-derived leukemic cells
LSCs play critical roles in AML disease propagation and drug resistance 26,27 . LSC and other AML cell subpopulations are typically identified and characterized by a combination of cell staining for granule content and cell surface markers as well as by their gene expression signatures 28,29 . All of these classification approaches can be further extended by transmitted light imaging and analyses to provide complementary information about leukemic cell types and states that is not readily obtainable through more conventional classification approaches. In particular, live cell movies that resolve phenotypic states over time and in response to pharmacological treatment could provide unique insights into cellular heterogeneity and responses that could better inform therapeutic decision-making.
Towards this end, we employed UPSIDE to profile human LSCs cultured under cytokine conditions promoting expansion and differentiation, and filmed using brightfield imaging ( Figure   4A, left ). We directly isolated leukemic stem cells from an adult AML patient using as markers of stemness expression of the cell surface marker CD34 together with an absence of the differentiation marker CD38 26,30,31 (i.e., the CD34 + CD38cell fraction). To profile the self-renewal and differentiation dynamics of these sorted cells, we then cultured LSCs with either IL6 and thrombopoietin (TPO) to induce, to induce differentiation, or in the presence of Aryl hydrocarbon receptor inhibitors (AhRi) UM729 and SR1 to maintain stemness and suppress differentiation [32][33][34] . We then filmed these cells in the brightfield channel for ~4 days at high temporal resolution (3 minute intervals, Figure 4A). To determine the association between observed cell morphological states, stemness and differentiation, we also added fluorescently-labeled anti-CD34 and anti-CD38 antibodies in situ, and took fluorescent images each hour to follow expression of these markers in imaged cells ( Figure  In order to gain insight into the features that drive the separation of the cell encodings into distinct clusters, we performed hierarchical clustering on averaged cell encodings from each group ( Figure 4C), then decoded the specific mask or textural features with the highest z-scores in each group to generate feature-dominated synthetic images as described above ( Figure 3B-C).
These synthetic images highlight significant morphological features that display coherence across cells within a cluster, but differ between cells in different clusters ( Figure  Indeed, cells starting in states C7 and C8 (tracks 1 and 4) tended to remain in the same state, whereas cells starting in states C1, C3 and C5 were highly dynamic, switching from one state to another rapidly in successive frames (tracks 2 and 3). Notably, transitions occurred preferentially between particular groups of states. For instance, track 2 showed frequent transitions between states C1 and C2, whereas track 3 showed frequent transitions between states C4, C5 and C6, consistent with the elevated transition probabilities between these states as observed in the transition matrix ( Figure 6B, left).
The morphological state transitions observed above occurred over tens of minutes ( Figure 4B), a timescale much shorter than that for cell differentiation, which occurs progressively over the course of movie observation ( Figure 5A). As such, these transitions are unlikely to report on an entire cell differentiation event, but rather snapshots of such a process. They could also reflect more rapid cell phenotypic state changes. Cells tend to polarize as they crawl on a substrate; thus, it is possible that some of these observed transitions could reflect transitions from a stationary to a motile state. To test this hypothesis, we derived the instantaneous velocities of cells in different morphological states, by calculating the displacement between successive frames for each state ( Figure 6C). From this analysis, we found that cells with elongated morphologies, such as those in states C2, C4, C5, and C7, showed a higher movement velocity compared to other states. Consistently, there was a strong correlation between instantaneous velocity and cell eccentricity, averaged over all cells in individual clusters ( Figure 6D). Thus, it is likely the case that the observed morphological transitions partially involve the rapid switching between stationary and mobile states ( Figure 6B Label-free Imaging and Image Segmentation. UPSIDE utilizes a label-free imaging method to identify cells from brightfield (BF) images. Here we adapted a U-net-based deep learning technique described by 21 to predict fluorescent pictures of the cells' cytoplasm and nuclei from the captured BF images. To complete this task, a small sample of cells to be analyzed was stained CellTrace Violet Cell Proliferation dye (ThermoFisher C34557) to label the cytoplasm.
Training data was obtained by capturing approximately 300 -400 BF images and their corresponding nuclei or cytoplasm fluorescent images. This data was then used to train a model that predicts nuclei and another that predicts cellular cytoplasm. These two models were subsequently used to predict fluorescent images for the main timelapse brightfield image stacks.
Object segmentation was performed on predicted nuclei images using ictrack software described in 19 .
Live Cell Classification. Identified cell crops were then fed through a classifier that separated a living cell for further analysis from dead cells, and other debris that were picked up from the segmentation. We built a convolutional neural network for this classification task:

Supplementary Figure 11. Architecture of convolutional classifier neural network for live cell classification
To obtain training data for this network, ~10000 brightfield cell crops were manually annotated as 'Live' or 'Dead.' The network were trained for approximately 10,000 steps, and cross entropy loss were calculated a Adam optimizer 41 were used for weights and biases learning: Where is the predicted class of a given cell crop and is its correct label. The (X) f X y remainder of the identified cell crops were then fed to the trained classification model. Crops classified as 'Dead' were discarded, and 'Live' crops were used for further analysis. Preprocessed image crops for shape and texture were used to train two separate VAEs. The overall architecture is as described below: Supplementary Figure 12. Architecture of convolutional variational autoencoder for cell shape and texture learning.

The loss function for the VAE is a weighted combination between reconstruction loss and Kullback-Leibler Divergence loss:
Where A is a constant, and γ is between 0 and 1. Additionally, VAE for cell shape feature extraction was first trained for ~100000 steps while VAE for texture feature extraction was first trained for ~200000 steps. Trained weights and biases for the cell shape extraction were then used to encode all cell crops obtained from the movie into 100-element vectors. These vectors were projected onto a 2D plane using UMAP 22 . Cell crops with defective shapes are gated out using the cytometry2 function in ictrack. The remaining crops were then used to train VAE for cell shapes and texture separately. Afterward, cell crops were encoded into 100-element shape vectors and 100-element texture vectors. Each cell crop's latent vector is represented by a weighted concatenation between the shape and the texture contributions: Encoded latent dimension of cell crops are then clustered using Louvain clustering algorithm.
To generate synthetic images, encoded cell barcodes and arithmetic variations were treated as z and fed directly into the decoder.

Comparable Deep Learning Architectures . In addition to utilizing the Variational Auto
Encoder architecture to learn the latent dimensions in our imaging datasets, we tested a few other deep learning architectures to compare their performances with our current approach: Vanilla Auto Encoder (AE) 24 In this architecture, each processed shape or texture is fed through a series of convolutional layers and fully connected neural network layers to generate a latent vector with a dimension of 100. The organization of the neural network layers are as follows: The loss function for the AE is: AE for cell shape feature extraction was first trained for ~100000 steps while AE for texture feature extraction was first trained for ~200000 steps. Each cell crop's latent vector is represented by a weighted concatenation between the shape and the texture contributions: Adversarial Auto Encoder (AAE) 12 In this architecture, each processed shape or texture is fed through a series of convolutional layers and fully connected neural network layers to generate a latent vector with a dimension of 100. The latent dimension was then regularized using a discriminator that forces the dimension space into a unit gaussian distribution (1x AAE) or four mixed gaussian distributions (4x AAE).
The organization of the neural network layers are as follows: Supplementary Figure 14. Architecture of convolutional Adversarial Auto Encoder (AAE) for cell shape and texture learning.
The loss functions for the VAE are: Where is a 100 element vector sampled from a normal gaussian distribution (1X AAE) or z real a mixed 4-gaussian distribution with each gaussian's mean to be -1, -0.5, 0.5, and 0.5 and standard deviation to be 1 (4X AAE). AAE for cell shape feature extraction was first trained for ~100000 steps while AAE for texture feature extraction was first trained for ~200000 steps. Each cell crop's latent vector is represented by a weighted concatenation between the shape and the texture contributions:

ClusterGAN 25
This architecture carries an encoder that converts a generated image into a latent dimension which is then forced to match the same starting latent code that was originally used to make the image. This is a semi-supervised architecture where a specific number of classes needs to be predetermined beforehand. To convert this into an unsupervised method, we removed the class module, enabling the GAN to draw data from a normal distribution, without the one-hot class vector input. The neural network organizations for the generator, encoder, and the discriminator are follows: Figure 15. Architectures of the generator, encoder, and discriminator module of clusterGAN for cell shape and texture learning.

Supplementary
Loss functions used for training were described previously 25 . We input the cell crops into the encoder module of ClusterGAN to generate the latent dimensions for the comparative analysis with other architectures.
Cell shape feature extraction was first trained for ~100000 steps while the texture feature extraction was first trained for ~200000 steps. Each cell crop's latent vector is represented by a weighted concatenation between the shape and the texture contributions:

Algorithms and quantitative analysis
Neighbor similarity scoring . The metric is formulated to estimate the degree of homogeneity of the grouping of each cell type in encoding space of the four cell types data. Specifically, the neighbor similarity score for a given cell type is define as followed: Transition probability between cell clusters. In order to estimate the transitional dynamics between identified morphological clusters through time, we determine the transition probability between a given cell at time to belong to cluster to another cluster in a set of clusters X t i j at time as followed: Where is the number of transitions from to . f ij i j

Data Availability
Imaging data for the UPSIDE platform will be made available on the Image Data Resource ( https://idr.openmicroscopy.org/about/ ). Code for the software is available on Github ( https://github.com/KuehLabUW/UPSIDE ).               Step

KLD Loss Reconstruction Loss
Step Reconstruction Loss Step KLD Loss

KLD Loss
Step

Reconstruction Loss
Step