Quasi-universality in single-cell sequencing data

The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.


INTRODUCTION
Single-cell technologies offer the opportunity to identify previously unreported cell types and cellular states and explore the relationship between new and known cell states (1)(2)(3)(4)(5)(6)(7). However, there exist several significant biological and technical challenges that complicate the analysis. The first challenge relates to the lack of a complete quantitative understanding of the different sorts of noise. Almost identical cells have an intrinsic cell-to-cell variability and, within a cell, there are spatial and temporal fluctuations. Moreover, different technologies show biases at the level of detection, amplifying, and sequencing genomic material that significantly vary across different genomic loci. Estimating noise, and distinguishing between biological and technical sources, is paramount for any further analysis: without reliable estimates of noise it is difficult to distinguish states or identify potential variations of a single state. A second complicating factor for single-cell analysis is the sparsity of data associated to the very low amounts of genomic material amplified.
Several computational and statistical approaches have been designed to address some of these challenges (4,(8)(9)(10)(11)(12)(13). For instance, imputation methods try to infer the "true" expression for missing values from the sample data by empirically modeling the underlying distributions, for instance, using negative binomial plus zero inflation (drop-out) for single-cell data. These techniques usually assume that values are generated by the same distribution (identically independent distributed variables or i.i.d.). However, we currently do not have predictive quantitative models of gene expression and so it is not clear what is the correct distribution or why the i.i.d. assumption should hold. Given the lack of a quantitative microscopic description of cell transcription, we would ideally like to have a statistical description of the noise in single-cell data that does not rely on specific details of the underlying distributions of expression.
Historically, a similar problem occurred in the 1950s in nuclear physics, when the lack of quantitative models of complex nuclei precluded accurate predictions of their energy levels. However, simple theoretical models based on experimental data showed that some observables, such as the distance between two consecutive energy levels, followed universal distributions that could be derived from random Hermitian matrices (14)(15)(16). These distributions were subsequently identified in a variety of complex systems, as quantum versions of chaotic systems (17), and even in patterns of zeros of Riemann zeta functions (18,19).
Further work showed that many random matrix statistics present a universal behavior, akin to the central limit theorem, where specific details of the underlying distribution generating the entries of the matrix become irrelevant (20,21). In the context of PCA problems with where the ratio of the rows and columns converges to a constant as both dimensions go to infinity, as arises in the single-cell setting, the study of the asymptotics of random matrices led to development of techniques for sparse PCA. These methods are intended to correct for bias in empirical eigenvalues and eigenvectors in PCA and enhance interpretability of the results (22,23).
We propose here to apply these asymptotics to identify universal statistical features of noise that are insensitive to the specific details of a complex system (i.e., the cell and the single-cell measurement technologies). It has recently been shown that universality of the eigenvalues of random matrices depend only on the asymptotic behavior (subexponential) or the finiteness of the first few moments of the distribution generating the matrix, without requirement of being identically distributed random variables (24)(25)(26). These hypotheses hold in all distributions commonly used to describe single-cell data. Similar strong results have been observed in the distribution of eigenvectors: eigenvector of a random matrix show a phenomenon called de-localization, being distributed randomly in a high dimensional sphere. These universal characterizations provide the basis of a test: Deviations from universal eigenvalue distributions (20,27) or the appearance of localized eigenvectors indicate the presence of a signal that can be further analyzed.
However, this test can be confounded by sparsity. Specifically, we show that the intrinsic sparsity of single-cell data can introduce deviations from the universal random matrix eigenvalue distribution. Nonetheless, we observe that these deviations can be easily identified by the presence of localized eigenvectors that survive permutation of the data. We combine the universal random matrix statistics with corrections for sparsity to identify the biological signal in single-cell data. By studying a variety single-cell transcriptomic experiments, we show that the spectrum of a normalized Wishart matrix generated from the data follows a Marchenko-Pastur (MP) distribution with a small fraction of outliers eigenvalues and localized eigenvectors. Thus, we can use the associated edge statistics (Tracy-Widom) and localized eigenvectors to generate a low rank approximation of the data, increasing the power for identifying potential interesting biological signals. Our direct approach is substantially simpler and more efficient than sparse PCA (22,23), explicitly handles sparsity of the data matrix (without any assumptions about the distribution of missing elements), and leads directly to an estimate of the rank of the underlying signal. We show that this procedure is better able to capture marked single-cell clusters than alternative techniques, across a variety of data sets.

Quasi-universality of single-cell sequencing data.
A standard technique for studying high dimensional data is dimensionality reduction via Principal Component Analysis (PCA) or versions thereof. PCA estimates the sample covariance and can be used to generate a low dimensional representation by projecting into the eigenvectors associated with the eigenvalues capturing most of the variance (principal components). We observed that the distribution of separation between the square root of two consecutive eigenvalues of the sample covariance matrix in different single-cell RNASeq experiments (28)(29)(30)(31)(32)(33) resembles the Wigner surmise distribution conjectured by Wigner in 1955 (16) in the study of the difference between resonant peaks in slow neutron scattering ( Figure 1A, and Supplementary Figure   1). This observation prompted us to investigate the connection between Random Matrix Theory (RMT) and the spectra of single-cell data, guided by the hypothesis that departures from random matrix distributions indicate interesting potential biological signals. Figure 1B shows an example, in red the nonparametric Marchenko-Pastur distribution (MP) of density of eigenvalues, the associated RMT distribution, that fits well most of the eigenvalues (see Methods).
The same results can be observed across many other single-cell datasets (Supplementary Figure 2, Methods). Deviations from RMT can also be found by analyzing the larger eigenvalues in relations to the expected Tracy-Widom distribution (TW). We observed that across single-cell datasets this deviations amount to 5% of eigenvalues ( Figure 1C).
We further investigated the potential leading causes explaining these deviations.
After randomizing the data by shuffling the cell expression values in each gene independently to erase any potential signal in the sample covariance matrix, we found that only ~2% of eigenvalues could be associated with potential biological signal ( Figure 1C).

Sparsity Induced Eigenvector Localization
One of the key features of single-cell data is its sparsity. We investigated if sparsity could induce deviations from universality. By introducing zeros in a random matrix with entries generated with Gaussian or Poisson distributions, we observed that the mean and the median of the highest eigenvalue have Ensembles, a generalization of RMT with random matrices with a significant fraction of zero entries (24,(34)(35)(36) (37) that the highest eigenvalue distribution could be approximated by a rescaled and shifted Tracy-Widom distribution. However, little is known below that bound, and nothing is known when sparsity varies in columns. We discovered that this phenomenon can be also observed in real data; it can be detected by removing  6). The localization phenomenon due to sparsity generates artifacts that could potentially be interpreted as true signal in standard application of PCA. For instance, outlier points can be detected in the highest components of sparse random data ( Figure 2D). Another effect of sparsity is the artifactual generation of an "elbow" in randomized data sparse data ( Figure 2E). These effects can be suppressed by eliminating the localized vectors, generating a more homogeneous distribution in the lower dimensional representation reflecting the random nature of the data.

Simulations and comparison with alternative approaches
We now proceed to evaluate the performance of the use of sparsity induced localization correction and random matrix statistics for the identification of potential relevant biological signals ( Figure 3). We first perform two sets of simulations: a single-cell RNA sequencing simulation of six cell populations using Splatter (39) (see Methods) and seven Gaussian clusters with small mean to variance ratio in each dimension. Figures 3A and 3B show t-SNE representations of the data before and after the RMT procedure and panel C the associated MP statistics. Panels 3D, 3E and 3F represents the before, after and MP statistics of a set of seven Gaussian clusters. The first example illustrates the challenge of identifying structures based on t-SNE plots before performing the RMT procedure ( Figure 3A); in contrast, after the procedure we see clearly separated clusters ( Figure 3B). The test based on the MP statistics correctly identifies the six components associated with the six simulated clusters. The second Gaussian simulation correspond to a regime where the identification of clusters could be challenging due to low difference in means compared to the variance ( Figure   3D). Again, the RM procedures clearly identified the cluster structure ( Figure 3E, 3F).
We now perform a comparison in terms of cell-phenotype cluster resolution with some published algorithms. For that purpose, we are using the data sets (38) (human PBMC) and (40) (mouse cortex) described in the previous section. As explained in the previous section, these references together with the analysis done in (11) about (40) have cells already labeled in terms of phenotype. We claimed in previous section that our RMT method is able to clean system noise such that the cell-phenotype clusters are better resolved. This noise is partially generated due to the missing values in single-cell experiments. For that reason, we compare with the two main approaches in the field that address this question: imputation (MAGIC (8) and scImpute (10)) and zero-inflated dimensionality reduction (ZIFA (13) and ZIMB-WaVE (9)). For completeness, we also have compared with the raw data, with a selection of genes based on higher variance (top 300 genes) and with Seurat (41). The comparison is performed using the knowledge of cell phenotypes in refs. (11,38,40) and by computing the mean silhouette score in the reduced space; higher values would indicate a better (less noisy) cell-phenotype cluster resolution. In Figure 3G-3J we represent the mean silhouette score as a function of the reduced space number of dimensions (number of PCs) for 13 PBMC cell-phenotypes described in (11)( Figure 3G) and for 7 ( Figure 3H), 15 ( Figure 3I) and 26 ( Figure 3J) marked mouse cortex cell populations described in (40). We have selected the 1,500 most signal-like genes

Application to diverse single-cell transcriptomic data sets
In this section we present in detail the RMT analysis of the two marked single-cell data: 6,573 human PBMC cells from reference (38)  where the authors analyze data in (38) and produce a larger number of labels (corresponding to more cell identities). We ran the RMT algorithm that first eliminates the genes that introduce artifacts due to sparsity, then determines how hierarchical clustering of the clean system and projected the clusters into a t-SNE plot in order to better visualize it. We also performed a comparison with the cell identities defined in references (11,38,40,42) (D panels), confirming that we can recover the populations in these references and adding some more potential subpopulations (E panels). In particular, for the PBMC case we have found two subpopulations for dendritic cells that were not previously identified in the original work (38) and subsequent further refinements (11), that expressed markers identified in an independent study (43) (Supplementary Figure 16). In addition, we identified three for CD14 monocytic cells, two for B-activated cells and two for T-activated cells (see Methods for the discussion of the differential expression analysis).

Discussion
In this manuscript, we demonstrate the effectiveness of tests based on (sparse) Random Matrix Theory for studying the spectrum of the covariance matrix of single-cell genomic data. We show that while most of the spectrum follows the expectations from RMT (95%), deviations become apparent both in the distribution of density of eigenvalues and the maximum eigenvalue and in eigenvector localization. We further show that most of these effects are artifacts due to the sparsity of the data including sparsity induced localization; this accounts for 3% in average of the remaining eigenvalues. The final 2% of the eigenvalues could then be attributed to true biological signal. Eigenvector localization after eliminating the sparsity effects points to groups of cells where information tightly concentrates, indicating common transcriptional programs.
Sparse Random Matrix Theory and associated eigenvector localization correction provides a powerful tool to identify this signal and produce a low rank representation of single-cell data that may be used for further interpretation.
Additionally, we should point out that the universality we observed in Wishart/covariance matrices is also observable in the spectra of graph Laplacians (including sparse graphs (44)) and kernel random matrices (45), which are used in other single-cell analytic techniques, suggesting that the approach followed here could be applied more broadly.
Code for the algorithm and denoising pipeline is publicly available on https://rabadan.c2b2.columbia.edu/html/randomly/ .    There is also a projection into 83 random vectors. Finally, the lines show how gamma-functions can fit the distributions discussed. The right subpanel shows the number of relevant genes in terms of the test above discussed together with a false discovery rate. The higher is the chi-squared test for variance the less genes are responsible for signal. C) Evolution of the mean Silhouette score as a function of the statistical test discussed in panel (b). The score grows at the beginning because genes responsible for noise are eliminated as the sample variance increases. At a certain threshold the score starts dropping because the number of genes is too small to define the cell clusters.
Nullified case corresponds to genes projected into signal before selecting the signal-like genes (the RMT uses this case). D) Hierarchical clustering of the PBMC cells after using RMT method to eliminate noise and comparison with the cell populations found in (38) and (11). The number of genes signal-like selected is 1500. E) Visual representation of the previous panel through a t-SNE plot. There is also a projection into 103 random vectors. Finally, the lines show how gamma-functions can fit the distributions discussed. The right subpanel shows the number of relevant genes in terms of the test above discussed together with a false discovery rate. The higher is the chi-squared test for variance the less genes are responsible for signal. C) Evolution of the mean Silhouette score as a function of the statistical test discussed in panel (b). The score grows at the beginning because genes responsible for noise are eliminated as the sample variance increases. At a certain threshold the score starts dropping because the number of genes is too small to define the cell clusters. Nullified case corresponds to genes projected into signal before selecting the signal-like genes (the RMT uses this case). D) Hierarchical clustering of the PBMC cells after using RMT method to eliminate noise and comparison with the cell populations found in (38) and (11). The number of genes signal-like selected is 1500. E) Visual representation of the previous panel through a t-SNE plot.

Introduction to eigenvalue statistics in covariance random matrices
Given a N ´ P matrix X, each column is independently drawn from a distribution with mean zero and variance !, the corresponding Wishart matrix is defined as The eigenvalues ) * and normalized eigenvectors + * of the Wishart matrix where , = 1, 2, … 0 are given by the following relation: "+ * = ) * + * If X happens to be a random matrix (a matrix whose entries xij are randomly sampled from a given distribution) then W becomes a random covariance matrix 2. Largest eigenvalue universality: the behavior of the largest eigenvalue F G is such that where p q in the Wishart case is a function known as the Tracy-Widom distribution. A similar argument can be made for the smallest eigenvalue F H .
• Eigenvector delocalization: this property implies that the norm of the eigenvectors + * is equally distributed among all their components a:

Introduction to eigenvector localization
Let us now consider the case of perturbed random matrices, i.e. matrices that contain a random part plus a perturbation that partially breaks the randomness in some direction. The spike model of Johnstone provides a simple example where a finite rank perturbation is added to a large random matrix (6). If the perturbation is below a certain critical value the largest eigenvalue follows the Tracy-Widom distribution of the unperturbed matrix. However, if the perturbation is larger than the critical value, the largest eigenvalue may separate from the bulk MP distribution and present Gaussian fluctuations (7). This is the so-called BBP  The delocalized-localized phase transition is an example of the famous Anderson localization phase transition that was first observed in quantum disordered systems (10). It has been observed that, in the delocalized phase, the eigenvalue statistics are governed by RMT.
Interestingly, the distribution of components for delocalized eigenvectors can be easily estimated by considering them as vectors on a unit sphere of dimension N-1. The distribution of their components is then given by 7Hr D that in case of large N approximates a Gaussian distribution with mean zero and 1/N variance (see Figure 2B) This can be made more precise in terms of information theory, by noticing that delocalized eigenvectors do not carry any information, while localized vectors are correlated with the original perturbation. This insight proves very useful when attempting to distinguish between noise and signal in any complex system, biological systems in particular. The noise in the data will correspond to the part of the spectrum that can be described in terms of RMT.
In Supplementary Figures 5A and 6A  The inverse of the IPR quantifies the number of eigenvector components that contribute significantly. The results are equivalent to those obtained with the other statistical tests.

Sparse Random Matrix ensembles and sparsity induced localization
Sparse Random Matrix Theory (sRMT) ensembles are a class of random matrices with a fraction of non-zero elements, p. In contrast to the RMT's presented in the first section of the Methods, sRMT's exhibit localized eigenvectors, a phenomenon that we call here sparsity induced localization.
The universality properties of covariance sRMT have been studied in (12), (13) and (14). The local statistics of eigenvalues preserve the bulk and largest eigenvalue universalities. The main difference with non-sparse RMT is the global statistics (15) and the presence of localized eigenvectors (16,17).
Regarding the global statistics, there could be significant deviations from the original MP and Tracy-Widom distributions, depending on the fraction of non-zero values p (Figure 2A, 2C). In these figures we are using sparse ensembles from a mixture of Gaussian distribution, Dirac delta distribution centered at zero and Poisson distribution applied to the randomized dataset (18).
The presence of localized eigenvectors is a very important feature. In the bottom panel of Figure 2C, the correlation between the MP deviations and the presence of localized eigenvectors is shown, by using the gaussianity test discussed above. In Supplementary Figures 4A and 4B, we evaluated the localization of eigenvectors using two other tests: Shannon entropy and IPR, previously described. The three tests show sparsity induced localization. This sparsity induced localization is not associated with any biologically relevant information.
Sparsity induced localization can introduce artifacts as outliers in PCA and artifactual elbow plots ( Figure 2D, 2E and Supplementary 4C). We have also performed a comparison between the sparse dataset and the one after removing sparsity in Figures 2C, 2D and 2E applied to the randomized dataset (18).
Colors distinguish between sparse and clean data. The same comparison has been performed for the original datasets (18) (Supplementary Figure 5) and (19) (Supplementary Figure 6).

Algorithm description for denoising of single-cell data
We outline three major steps in the denoising of single-cell data algorithm on the example of PBMC dataset by Kang et al. (18), and illustrate in the supplementary Figure 13.

• Preprocessing
The goal of preprocessing is to remove genes that create artifacts, due to the sparse nature of the data. Gene expression values for each cell were divided by the total number of transcripts and multiplied by 10 6 . These values were then log2 transformed. After, the single-cell data matrix X is Z-score normalized, such that every gene has mean 0 and standard deviation 1. A randomized matrix is obtained via random permutation of cells for every gene independently, to destroy potential correlations. We project the expression of each gene onto the eigenvector basis of the randomized matrix. To assess normality, we evaluated several related methods: Kolmogorov-Smirnov, Anderson-Darling, and the Shapiro-Wilk test, all providing similar results. In this manuscript, we use the Shapiro-Wilk statistics comparing to genes that express less than certain number of transcripts (in this manuscript, 7 transcripts by default) (see Supplementary Figure 13(a)). Genes that have Shapiro-Wilk statistic higher than the minimum statistic of the sparse genes with less that 7 transcripts are considered to be abnormal and are removed from the further analysis. Alternatively, as the p-value is a monotonic function of the Shapiro-Wilk statistic, one can impose an equivalent cut-off on p-value, correcting for multiple hypotheses.

Datasets
In Figures 2C, 4 we are using Kang et al. (18) GSE96583 with their labels and a second set of labels (those referred as control by the authors) from Butler et al. (20). For Figure 5, we used Zeisel et al. dataset GSE60361 and annotation from (19).
For Figure 1C, we are using the following datasets: -Count matrix for the SMART-Seq2 data set (21) was obtained under the GEO accession number GSE81682.
-Count matrices for CelSeq and CelSeq2 data sets were obtained from accession numbers GSE81076 and GSE86469 correspondingly.
-Count matrix for Fluidigm C1 technology was obtained from accession number GSE86469.
-Hi-C data was obtained from accession number GSE84290.
-Data for the ATAC-seq data was obtained under the accession number GSE65360.
-Raw Nuq-seq data was obtained from the Gene Expression Omnibus with accession number GSE84371.
-Data for bulk RNAseq GBM was obtained from TCGA firehose portal

Wishart matrix statistics simulations
Single-cell RNAseq simulated dataset ( Figure 3A) was generated using the Splatter R package. A mean expression level for each gene is simulated using a gamma distribution. The negative binomial distribution is used to generate a count for each cell based on these means, with a fixed dispersion parameter.

Comparison to other techniques
When comparing with other methods, we are normalizing the data using ~j D (1 + Ä&Å) for ZIFA (22), and scImpute (23). ZIMB-WaVE (24) where the F is the mean distance between a cell and all the other cells of the same class (the class is defined by the phenotype labels provided in (18)(19)(20)).
Parameter k is the mean distance between a cell and all other cells in the next nearest cluster.
For Supplementary Figures 7, 8 and 9, we are using t-SNE representation on top of a PCA reduction, where we have selected the optimal number of principal components according to Figures 3G, 3H, 3I, 3J.

Clustering
The hierarchical Ward clustering method was used in Figures 4D and 5D.

t-SNE representation
The t-SNE representation were obtained using the default parameters, which are: Learning rate = 1000, Perplexity = 30 and Early exaggeration = 12.

Performance
The most computationally intensive part of our approach relates to the identification of the full set of eigenvectors and eigenvalues. We compared different off-the-shelf SVD approaches (28)

Differential expression analysis of PBMC dataset
In Figure 4D

Supplementary Figure 4. A)
Calculation of Shannon Entropy for the randomized Kang et al. (18) dataset. This is another way of expressing the same phenomenon in Figure 2C. When the system is sparse (blue) there are eigenvectors whose entropy decreases. That is a sign of information contained in these eigenvectors. B) Calculation of the Inverse participation ratio (IPR) for the randomized Kang et al. (18) dataset. The participation ratio indicates the number of cell covariates that take part for each eigenvector. When sparsity is nonnegligible in the random system (blue), there are eigenvectors which have fewer cell covariates. C) t-SNE representation of the randomized Kang et al. (18) dataset. This shows the non-linear representation of the Figure 1D. The sparsity creates privileged directions in the space as it can be seen in the right panel. Figure 5. A) Localization properties of the eigenvectors in a single-cell dataset of PBMC cells (18). The blue line represents the system dominated by sparsity and the red line corresponds to the system after removing sparsity. B) Calculation of the Shannon entropy for the eigenvectors of the PBMC (18) dataset. The blue (red) line corresponds to the system before (after) cleaning the sparsity. For completeness, a comparison with the non-sparse randomized dataset (green line) is plotted. C) Calculation of the inverse participation ratio (IPR) for the same dataset. Figure 6. A) Localization properties of the eigenvectors in a single-cell dataset of mouse cortex cells (19).The blue line represents the system dominated by sparsity and the red line corresponds to the system after removing sparsity. The localization in the smallest eigenvector is due to the presence of low-expressed cells (these cells could be eliminated from the analysis). B) Calculation of the Shannon entropy for the eigenvectors of the mouse cortex cells (19) dataset. The blue (red) line corresponds to the system before (after) cleaning the sparsity. For completeness, a comparison with the non-sparse randomized dataset (green line) is plotted. C) Calculation of the inverse participation ratio (IPR) for the same dataset. Figure 7. Comparison of the t-SNE representation for the different methods and algorithms used in Figure 3I. This case corresponds to 15 different mouse cortex cell-phenotypes described in (19). Figure 8. Comparison of the t-SNE representation for the different public algorithms used in Figure 3H. This case corresponds to 7 different mouse cortex cell-phenotypes described in (19). Figure 9. Comparison of the t-SNE representation for the different public algorithms used in Figure 3G. This case corresponds to 13 different PBMC cell-phenotypes sequenced in (18) and described in (20).  Figure 16. Comparison of dendritic cell populations found using RMT with populations found in paper (29). On the left panel, we show RMT cellpopulation clusters by color, the dashed rectangle focuses on the three dendritic cell populations. On the right panels, we plot the dendritic cell marker genes identified in Villani et al. (29). Based on this comparison, the DC1 population from Villani et al. (29) corresponds to RMT Dendritic 2, whereas DC3 and DC4 are grouped in Dendritic 1. Finally, DC5/6 correspond to RMT pDC cells.