Quantifying Cell-State Densities in Single-Cell Phenotypic Landscapes using Mellon

Cell-state density characterizes the distribution of cells along phenotypic landscapes and is crucial for unraveling the mechanisms that drive cellular differentiation, regeneration, and disease. Here, we present Mellon, a novel computational algorithm for high-resolution estimation of cell-state densities from single-cell data. We demonstrate Mellon’s efficacy by dissecting the density landscape of various differentiating systems, revealing a consistent pattern of high-density regions corresponding to major cell types intertwined with low-density, rare transitory states. Utilizing hematopoietic stem cell fate specification to B-cells as a case study, we present evidence implicating enhancer priming and the activation of master regulators in the emergence of these transitory states. Mellon offers the flexibility to perform temporal interpolation of time-series data, providing a detailed view of cell-state dynamics during the inherently continuous developmental processes. Scalable and adaptable, Mellon facilitates density estimation across various single-cell data modalities, scaling linearly with the number of cells. Our work underscores the importance of cell-state density in understanding the differentiation processes, and the potential of Mellon to provide new insights into the regulatory mechanisms guiding cellular fate decisions.


Table of Contents
Convergence forces cell states into a more confined state space, thereby increasing density, whereas divergence spreads out cell states, resulting in a decrease in density. C. The pace of state changes also influences cell-state density. Deceleration of state changes can result in a concentration of cell states, thereby increasing density. Conversely, acceleration of state changes tends to disperse cell states, decreasing density. In gene expression, acceleration is induced by rapid transcriptional changes. D. Schematic of the toy dataset 1 to illustrate the Mellon continuous density function.

Supplementary Figure 2: Comparison of cell-state density estimation approaches
A. UMAP of scRNA-seq dataset of T-cell depleted bone marrow dataset, colored by cell-type. B. UMAPs colored by Mellon density (left), density computed as inverse of distance to kth nearest neighbor (middle) and density computed using UMAP coordinates (right). K =15 C. Violin plots to compare cell-state densities among different hematopoietic cell-types. Arrowheads indicate example cell-types with high variability in density. Top: Mellon, Middle: Inverse of distance to kth nearest neighbor, Bottom: UMAP densities. Mellon densities are most consistent with expected landscape of human hematopoiesis. D. Plots comparing Palantir pseudotime to log-density for different hematopoietic lineages. Top row: Mellon, Middle row: Inverse of distance to kth nearest neighbor. Bottom row: UMAP densities. Mellon provides the most robust and interpretable density estimates with clear separation of high-and lowdensity regions. Log densities are shown for all comparisons.

Supplementary Figure 3: Illustration of how the Covariance Kernel of the Gaussian Process links cell-state density inference across related cells.
A. Inset of the toy dataset in Fig. 1B, showing nearest neighbor distance of representative cells in highand low-density regions B. Representative cells displaying the covariance function (more intense red indicates higher covariance). This gradient demonstrates the degree of covariance with neighboring cells: a cell located in a highdensity region shows high covariance of density with many neighbors, whereas a cell in a low-density region exhibits strong covariance with fewer cells. C. UMAPs of the T-cell depleted bone marrow dataset colored by covariance to randomly selected cells in low-density regions. Covariance between all pairs of cells serve as input to Gaussian Process. D. Same as (C), for the CD34+ bone marrow data. E. UMAPs of the T-cell depleted bone marrow dataset colored by covariance to randomly selected cells in high-density regions. F. Same as (E), for the CD34+ bone marrow data.

Supplementary Figure 4: Validation of Mellon density estimation using simulated datasets with known ground truth.
A. UMAP of simulated data colored according to differentiation tree nodes B. Correlation plots between known ground truth log-density (x-axis) and Mellon-inferred density (y-axis) across all simulated cells. Each point represents a simulated cell. C. MAPs colored by Mellon-inferred density (left) and ground truth log-density (right). Density values falling below the 20th percentile are projected to the 20th percentile for visualization D-F. Same as A-C for a second simulated dataset. Cells in (D) are colored by differentiation tree nodes G-I. Same as A-C for a third simulated dataset. Cells in (G) are colored by clusters. A. Heatmaps displaying Spearman correlation between density estimates derived using different numbers of landmarks for five datasets. The number of landmarks used for each column's density estimate is indicated by a marginal plot at the bottom. The right-most column and topmost row correspond to the maximum number of landmarks tested, which is typically substantially more than the default of 5,000 landmarks. This default value is indicated by a vertical line traversing both the heatmap column and the associated marginal plot. As the number of landmarks decreases from right to left, the resulting correlations provide insights into the robustness of Mellon's density inference as a function of landmark quantity. The colormaps are uniform across all heatmaps, ranging from -1 to 1, covering the entire possible range of Spearman correlation values. High correlation values adjacent to the top-right corner signify that Mellon's density inference maintains strong accuracy, even with varying numbers of landmarks. B. Spearman and Pearson correlation coefficients between density estimates derived using varying numbers of landmarks and the landmark-free full Gaussian process for each dataset. In the case of the iPSC dataset, a 65k landmark inference is used as the reference, as the full Gaussian process requires too much memory. A vertical line marks the default value of 5,000 landmarks, while a horizontal line denotes a perfect Pearson correlation coefficient of 1. B Force directed layout of the mouse lung adenocarcinoma scATAC-seq dataset 10 colored by cell-type (left) and Mellon density (right). C. Same as (A), for the scRNA-seq data of mouse models of lung adenocarcinoma 11 . UMAPs and diffusion maps for density estimation were computed using scVI 12 latent space. D. Same as (A), for SHARE-seq data of mouse skin differentiation 13 . UMAPs and diffusion maps for density estimation were computed using MIRA 14 multimodal representation.

Supplementary Figure 29: Mellon length scale heuristic.
A. Plot depicting datasets with varying numbers of diffusion components used for cell-state representation (Supplementary Table 1). Optimal length scale (y-axis) is chosen through the maximum a posteriori estimate of the Bayesian model employed in Mellon's density inference, where the length scale is considered a free parameter. Regression line depicts the fit used to derive the length scale heuristic for any dataset. B-D. Scatter plots displaying the same relation for 597 simulated datasets (Supplementary Note 5). The optimal length scale on the y-axis is chosen by maximizing the Spearman correlation of the Melloninferred density with the ground truth density. The x-axis shows the geometric mean of nearest neighbor distances across all cells in the respective datasets.