## Abstract

Single-cell RNA sequencing (scRNA-seq) can determine gene expression in numerous individual cells simultaneously, promoting progress in the biomedical sciences. However, scRNA-seq data are high-dimensional with substantial technical noise, including dropouts. During analysis of scRNA-seq data, such noise engenders a statistical problem known as the curse of dimensionality (COD). Based on high-dimensional statistics, we herein formulate a noise reduction method, RECODE (resolution of the curse of dimensionality), for high-dimensional data with random sampling noise. We show that RECODE consistently eliminates COD in relevant scRNA-seq data with unique molecular identifiers. RECODE does not involve dimension reduction and recovers expression values for all genes, including lowly expressed genes, realizing precise delineation of cell-fate transitions and identification of rare cells with all gene information. Compared to other representative imputation methods, RECODE employs different principles and exhibits superior overall performance in cell-clustering and single-cell level analysis. The RECODE algorithm is parameter-free, data-driven, deterministic, and high-speed, and notably, its applicability can be predicted based on the variance normalization performance. We propose RECODE as a general strategy for preprocessing noisy high-dimensional data.

## Introduction

Single-cell RNA sequencing (scRNA-seq) enables the determination of gene expression profiles in multiple individual cells simultaneously (Tang *et al*., 2009). When used together with recently developed microfluidics platforms and cell-index strategies, scRNA-seq permits the analysis of gene expression in thousands of single cells in parallel, accelerating progress in the biomedical sciences (Cao *et al*., 2020; Regev *et al*., 2017). On the other hand, due to technical limitations, scRNA-seq detects only a fraction of the transcriptome in single cells (~1-60%; on average: ~ <10%), and there are large variations in the detection level of each transcript by scRNA-seq (Grun *et al*., 2014; Kiselev *et al*., 2019). Thus, unlike conventional bulk RNA-sequencing, scRNA-seq, which is generally used to process numerous cells on automated platforms, provides a sparse representation of the true transcriptome of single cells, with detection failures (dropouts) and variations occurring randomly for most genes, particularly genes with low-expression levels (Lähnemann *et al*., 2020). These drawbacks, collectively regarded as non-biological technical noise, pose a key challenge in scRNA-seq data analysis and interpretation.

To circumvent these drawbacks, preprocessing of scRNA-seq data, such as dimension reduction and normalization, is widely used (Kiselev *et al*., 2019; Stegle *et al*., 2015). However, such preprocessing does not provide a fundamental solution, because it compresses the original data without separating the true information from the noise information, and therefore cannot recover the true expression values. With a focus on mitigating the dropout effects and data sparsity and based on models for transcript/noise distributions as well as on the information from similar cells (“nearest neighbors”), many inventive methods for modifying scRNA-seq data have been proposed (Bonnefoy *et al*., 2018; Eraslan *et al*., 2019; Li and Li, 2018; Peng *et al*., 2019; van Dijk *et al*., 2018; Wagner *et al*., 2019; Wang *et al*., 2019; Zappia *et al*., 2018). These “imputation” methods are classified into several categories, including model-based imputation, data smoothing, and data reconstruction, and appear to be successful in recovering dropped-out gene information (Lähnemann *et al*., 2020). However, compared to no-imputation controls, most of these methods fail to substantially improve performance in downstream data analyses, such as clustering analyses and dimension reduction mappings; they also introduce “circularity”, thereby generating false positives and decreasing the reproducibility of specific gene expressions (Andrews and Hemberg, 2018; Hou *et al*., 2020). Thus, imputations need to be used with appropriate caution and require further improvements.

It is fundamental to note that scRNA-seq data are high-dimensional data (the dimension corresponds to the number of genes, i.e., > ~10,000), with each feature (each gene-expression level) bearing technical noise (Grun *et al*., 2014). High-dimensional statistics theories demonstrate that such data, even when the noises are small, suffer from the “curse of dimensionality” (COD), which causes detrimental effects in downstream data analyses (Hall *et al*., 2005). Specifically, COD causes impairments of close distances in true data structures, inconsistency of statistics such as contribution rates of principal components (PCs), and inconsistency of PCs, among other deleterious influences (Aoshima *et al*., 2018; Yata and Aoshima, 2009) (see **Results** and the Appendix Supplementary text for details; COD in this context is distinct from that in informatics, which refers to an exponential increase of computational complexity associated with a rise in data dimension). However, despite its significance, COD has not been explicitly addressed in the context of scRNA-seq data analysis, including imputations.

Here, based on high-dimensional statistics (Yata and Aoshima, 2010; 2012), we formulate a noise reduction method, RECODE (resolution of the curse of dimensionality), which resolves COD in scRNA-seq data. RECODE is tailored to the scRNA-seq data with unique molecular identifiers (UMIs) because its mathematical formulation relies on the theory of random samplings, which are involved as noise in the copying and sequencing steps for generating scRNA-seq data with UMIs. We show that RECODE consistently eliminates COD in the relevant scRNA-seq data. Significantly, RECODE does not involve dimension reduction, such as selection of highly variable genes and major principal components, for the downstream data analysis, but recovers expression values, even for lowly expressed genes, and thus enables the use of all the gene information and the distinguishing of close cell types/transient cell populations masked by COD. RECODE outperforms representative imputation methods, not only in the cluster level analysis, but also in the single-cell level analysis (e.g., the identification of rare cell types). The algorithm of RECODE is parameter-free, data-driven, deterministic, and high-speed, making the method practical, and notably, the applicability of RECODE is predictable. We propose the use of RECODE as a general strategy for preprocessing noisy high-dimensional data, including scRNA-seq data.

## Results

### Curse of dimensionality (COD)

The noise (technical noise) of scRNA-seq data arises from variations of the copying and sequencing errors during data creation and differs from so-called biological noise, such as transcriptional stochasticity and biological variations. Since true expression values do not contain technical noise, we define the noise of scRNA-seq data as the difference between the observed UMI and true RNA counts (**Materials and Methods**). A typical cell expresses more than 10,000 genes, and the noise arises in all expressed genes. Accordingly, the accumulation of such noise causes severe problems in downstream data analyses, so-called COD. In this study, therefore, we attempt to resolve COD in scRNA-seq data analysis.

First, we demonstrate COD using scRNA-seq simulation data with 1,000 cells and variable dimensions (200–20,000) based on the Splatter algorithm (Zappia *et al*., 2017) (Figs 1A and EV1A; **Materials and Methods**). Because the Euclidean distance contains a summation of the squared components, the distance errors of observed values grow according to the dimension. Eventually, the distance errors obscure the difference among neighboring samples. For example, the higher the dimension is, the longer the “legs”, i.e., distances among neighbor cells/clusters, of the dendrogram of unsupervised hierarchical clustering (UHC) become, leading to an impaired clustering (Fig 1B). Similar problems occur even for other metrics, such as correlation distance (Section 1.1 in the Supplementary text) (Hall *et al*., 2005). Thus, conventional data analysis methods based on distances fail to identify true data structures in high-dimensional data with noise. We call this type of COD the *loss of closeness* (COD1), and it makes the detailed classification of high-dimensional data impossible.

The noise accumulation also causes adverse effects on data statistics, such as the contribution rate in principal component analysis (PCA) and the Silhouette score (Figs 1C–D). In particular, it has been proven that data variances of PCA-transformed data (eigenvalues of the data covariance matrix) may not converge to true variances for high-dimensional data with noise and low sample sizes (Section 1.2 in the Supplementary text) (Yata and Aoshima, 2009). We call this COD the *inconsistency of statistics* (COD2), and it leads to false statistical inferences.

When there is a considerable variation in the noise scale in each feature, as in the case of scRNA-seq data, another type of COD that induces false PCA structures may occur (Fig 1E). The PCA is robust for noise with small variances, even for high-dimensional data (Section 3 in the Supplementary text). However, in the case of random sampling with a low detection rate, since the variances of some noises become large, the PCA structures will be broken (Yata and Aoshima, 2009). For example, we know that the PCA structures change according to dimension and that the principal components are affected by non-biological information such as the sequencing depth (the total UMI counts per cell) or the number of detected genes (Fig 1E, Section 4 in the Supplementary text) (Kiselev *et al*., 2019). We call this type of COD the *inconsistency of principal components* (COD3).

We also note that CODs adversely affect almost all high-dimensional data analysis, including nonlinear dimension reduction methods, such as uniform manifold approximation and projection (UMAP) and t-stochastic neighbor embedding (t-SNE) (Figs EV1B–C).

### CODs in scRNA-seq data analysis

We show CODs in real scRNA-seq data using public scRNA-seq datasets generated by 10X Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) data for peripheral blood mononuclear cells (*human PBMC CITE-seq data*, 10X genomics demo data: 10k PBMCs from a Healthy Donor - Gene Expression with a Panel of TotalSeq™-B Antibodies). We create clusters using the protein expression data of 12 cell surface markers (Appendix Fig S1) and use them as the ground truth clusters. We define the dimension (200~33,254(all) genes) by picking up highly variable genes in order (see the **Materials and Methods** for details). We demonstrate the CODs using this scRNA-seq data by changing the dimensions.

In the UHC, the legs of the dendrogram become longer as the dimension increases due to COD1 (Fig 2A), leading to the loss of detailed hierarchies. As a result, we cannot identify clusters that consist of a small number of cells, and thus overlook rare cell types. The contribution rates in PCA and the Silhouette scores become worse (lower rates and lower scores) due to COD2 (Figs 2B–C). Accordingly, we may miss some critical information from the statistical analysis. In the PCA dimension reductions, the major axes (first and second principal components) change as the dimension increases due to COD3 (Fig 2D). In particular, we often observe that clusters grow in one direction, simply following the total UMI counts in scRNA-seq data analysis (Fig 2D bottom); such unidirectional growth arises due to COD3. Therefore, this implies that COD3 affects even PCA processing.

Moreover, CODs have adverse effects on almost all downstream data analyses. For example, UMAP and t-SNE cannot function as the dimensions increase (Figs 2E–F). We also obtain similar results for other 10X Chromium and Drop-seq data (Fig EV2 and Appendix Figs S2–S3), for 10X 3’scRNA-seq data for five human lung adenocarcinoma cell lines used in a benchmarking experiment (*CellBench data*) (Tian *et al*., 2019), for 10X 3’scRNA-seq data for human primordial germ cell-like cell (hPGCLC) specification (*hPGCLC induction data*) (Chen *et al*., 2019), for 10X 3’scRNA-seq data for a mixture of human induced pluripotent stem cells (hiPSCs) and for hPGCLCs (*hiPSC/hPGCLC mixture data*), and Drop-seq data for a cultured cell line (*Drop-seq data*) (Torre *et al*., 2018) (see the **Materials and Methods** for a detailed description of these datasets). In widely used scRNA-seq data analysis tools, e.g., Seurat (Hao *et al*., 2021; Satija *et al*., 2015) and Scanpy (Wolf *et al*., 2018), highly variable genes (HVGs) and/or major PCs are selected for downstream data analyses. However, we do not know *a priori* the best HVGs to select, and PCA itself is affected by CODs, as we see above.

### Resolution of the curse of dimensionality (RECODE)

To overcome CODs, we herein formulate a novel noise reduction method, RECODE (resolution of the curse of dimensionality). RECODE is tailored to the scRNA-seq data with UMIs because its mathematical formulation relies on the theory of random sampling, which models the copying and sequencing steps for generating scRNA-seq data with UMIs. See the **Materials and Methods** and Sections 3 and 4 in the Supplementary text for the detailed mathematical formulations and theorems for RECODE. RECODE consists of four procedures (Fig 3A). Procedure I normalizes the original data by the *noise-variance-stabilizing normalization* (NVSN) that transforms the original data such that the noise variances of all features are equal to one. Procedure II projects the normalized data into the PCA space. Procedure III modifies the variances of principal components (eigenvalues of the covariance matrix) based on high-dimensional statistics theories (*PC variance modification*) for the major PCs (essential part), and by setting those variances to be zero (*PC variance elimination*) for the others (noise part). Procedure IV maps the variance-modified data into the original space by the inverse transformations of I and II. The PC variance elimination and PC variance modification contribute to the resolution of CODs 1-2, respectively, while the NVSN resolves the COD3 caused by random samplings.

We show the verification of RECODE using the scRNA-seq simulation data in Fig 1. RECODE shortens the legs of the dendrogram in UHC compared to those in the observed data and, accordingly, obtains more correct clustering (Fig 3B). Further, RECODE obtains better statistics (i.e., better contribution rates in PCA and better Silhouette scores) (Figs 3C–D) and improves the principal components independently of the total counts (Fig 3E). These results indicate that RECODE successfully resolves CODs 1–3. Moreover, the RECODE-modified data are well consistent with the reference data (Figs 3F–G and EV3A–B). As a result, downstream data analysis methods, even nonlinear dimension reduction methods, work well even when using all genes (Figs EV3C–D). We note apparent overperformances by RECODE in some outcomes, which might stem from the incompleteness of the current scRNA-seq simulation algorithm; see below for RECODE in real scRNA-seq data analysis.

RECODE has the distinctive characteristics of being parameter-free, data-driven, deterministic (no random effect) and high-speed, making it a practical method (Section 5.2 in the Supplementary text). Notably, the applicability of RECODE to scRNA-seq data is predictable based on the variances after the noise variance-stabilizing normalization (hereinafter NVSN variances) (Fig 4A). We can define three classes of the applicability of RECODE to scRNA-seq data. Class A consists of data with NVSN variances of all genes ≥ 1, and those of many genes = 1; RECODE is applicable to these data with a good noise reduction effect (strongly applicable). Class B consists of data with NVSN variances of all genes ≥ 1 and few/no genes = 1; RECODE is applicable to these data with a limited noise reduction effect (weakly applicable). Class C consists of the data not assigned to Classes A and B; RECODE is inapplicable in these cases. We examined the applicability of the human cell atlas data (*HCA data*) generated by 10X Chromium, Drop-seq, Quartz-seq, Smart-seq2, and Smart-seq3 (Hagemann-Jensen *et al*., 2020; Mereu *et al*., 2020). We classified 10X Chromium, Drop-seq, and Quartz-seq data into Class A and Smart-seq2 and Smart-seq3 into Class B (Fig 4B). We were able to observe the elimination of the overvalued variations caused by noise for strongly applicable data (Figs 4C–D). In addition, we note that all 10X Chromium and Drop-seq data we have examined so far belonged to Class A. In contrast, there is still overvalued variation in non-significant genes in Class B, indicating that such data contain another type of noise different from the random sampling noise.

### Resolution of CODs in scRNA-seq data analysis by RECODE

This section verifies the performance of RECODE applied to the human PBMC CITE-seq data shown in the section CODs in scRNA-seq data analysis. We first show that RECODE resolves CODs 1–3. In the UHC (Fig 5A), we can observe the closer distances (shorter legs) among neighbor cells and the detailed hierarchies after RECODE (resolution of COD1). This enables us to classify the populations of a few cells. In the statistics analyses by the contribution rates of PCA and the Silhouette scores (Figs 5B–C), we obtain better values that enable us to discuss the statistical significance after RECODE (resolution of COD2). In the PCA projection (Fig 5D), RECODE returns the principal components independent of the total number of UMIs (resolution of COD3). In addition to resolving CODs, RECODE can also eliminate overvalued variances caused by noise without losing true signals (Fig 4C–D). Moreover, RECODE can improve the mappings of nonlinear dimension reduction methods, such as UMAP and t-SNE (Fig 5E).

Next, we show that RECODE recovers the true expression values of each gene. In the UMAP plot colored by the expression values of the lowly expressed genes *CD3D* and *CD19* (Fig 5F), we observe scattered blue dots (zero expression values) in the non-preprocessed data, which are regarded as dropouts. In contrast, almost all the blue dots disappear after RECODE, and the relative expression levels of *CD3D* and *CD19* get close to those of the protein expressions (Figs 5F–G). Moreover, in the biaxial plots of *CD3D* and *CD19* (Fig 5H), the distribution of the non-preprocessed data is too sparse and is completely disparate from that of fluorescence-activated cell sorting (FACS) (Stoeckius et al., 2017) and of the protein expression due to the severe dropout effects. This indicates that the current scRNA-seq data analysis may not capture the true gene-to-gene relations among lowly expressed genes. In contrast, the RECODE-modified data distribution is highly similar to the distributions of FACS and the protein expression. We also obtained similar results for the other 10X Chromium and Drop-seq data (Fig EV5).

From the verification above, we conclude that RECODE can recover true gene expressions by resolving CODs, and therefore, RECODE functions effectively as a noise reduction method for scRNA-seq data.

### Comparison of RECODE with imputation methods

We compare the performance of RECODE with that of representative imputation methods using the human PBMC CITE-seq data. We use six imputation methods that show high performance in previous reports (Andrews and Hemberg, 2018; Hou *et al*., 2020)—namely, SAVER (Huang et al., 2018b), scImpute (Li and Li, 2018), MAGIC (van Dijk *et al*., 2018), DrImpute (Gong *et al*., 2018), ENHANCE (Wagner *et al*., 2019), and ALRA (Linderman *et al*., 2022)—and we employ different imputation categories following the classification in a review by (Lähnemann *et al*., 2020): model-based imputation (SAVAR and scImpute), data smoothing (DrImpute and MAGIC), and data reconstruction/matrix factorization (ALRA and ENHANCE). We do not use machine learning-based methods because their results are strongly dependent on the training data or hyperparameters.

We first investigate the performance upon UMAP without any pre-dimension reduction (e.g., PCA) (Fig 6A). ScImpute and DrImpute failed to map cells in an appropriate manner, e.g., NK cells (colored in gray) were intermingled with other cell types. Further, we found that ENHANCE recognized a number of scRNA-seq data as identical data (43 out of 6341 (~0.7%)), presumably due to its aggregation algorithm, leading to the loss of a fraction of single-cell information. Next, we show the performance on the variance correction (Fig 6B). scImpute, DrImpute, and ALRA could not reduce the variances of non-significant genes (e.g., housekeeping genes) appropriately, whereas RECODE, SAVER, MAGIC, and ENHANCE suitably reduced them while preserving those of significant genes. We conclude that the four methods, scImpute, DrImpute, ALRA, and ENHANCE, are inadequate for the single-cell noise reduction methodology.

Next, we examine the *CD3D* and *CD19* expression values preprocessed with RECODE, SAVER, and MAGIC. By coloring the expression values on the UMAP plot (Fig 6C), we find that all the methods recover their expression to a level comparable to that detected by CITE-seq protein expressions. Furthermore, the relative expression levels also become significant (Fig 6D). These two results show that there is no significant difference in performance among RECODE, SAVER, and MAGIC in the cluster level analysis. However, in the biaxial plots for *CD3D*/*CD19* (Fig 6E), which can be regarded as a single-cell level analysis, we observe the aggregated distributions with MAGIC compared to the protein expression and the sparse distribution in low expression levels with SAVER. In contrast, like FACS and protein expression plots, the biaxial plot of RECODE shows the continuous distribution and three major populations (*CD3D ^{+}/CD19^{-}*: T cells;

*CD3D*: B cells; and

^{-}/CD19^{+}*CD3D*: the other cell types), indicating that only RECODE recovers true gene expression values for single cells.

^{-}/CD19^{-}To further compare the performance at the single-cell resolution, including the detection of rare cell populations, we designed mixed scRNA-seq data that contain major and rare cell types using CellBench data and hiPSC/hPGCLC mixture data. We generated pseudo-rare cell types with the variable number of cells (100, 20, 10, 5, 3, 1 cells) (Figs 7A–B). We applied RECODE, SAVER, and MAGIC to the mixed scRNA-seq data and monitored gene expression values signifying rare cell types (*ZBED2* and *CA9* for CellBench data, and *OTX2* and *ZIC3* for hiPSC/hPGCLC mixture data) (Figs 7C–D). In the cases of SAVAR and MAGIC, the expression levels were negatively influenced by those of the majority cell type. In addition, as reported in a previous study (Tian *et al*., 2019) and shown in Fig 6E, SAVER sometimes returns discrete values around low expression levels (arrowheads in Fig 7D). In contrast, RECODE does not lose essential signals even with one cell. Moreover, we observe this trend even in the median of signature genes (Figs 7E–F). Consequently, only RECODE can identify a true heterogeneous gene expression at the single-cell resolution and detect rare cell types hidden by noise.

In addition, we compared the computational practicalities of RECODE, SAVER, and MAGIC. The run time of RECODE is as fast as that of MAGIC, which is one of the fastest imputation methods (Hou *et al*., 2020), and much faster than that of SAVER (Fig 8A). The high-speed performance of RECODE is achieved by a fast algorithm based on mathematical treatments of eigenvalues (Section 5.1 in the Supplementary text). Furthermore, the scalability of RECODE for the number of cells was the best (RECODE:0.91, SAVER:1.12, MAGIC:1.86, calculated by the linear regression of run-times for 10,000–30,000 cells), indicating that RECODE can be applied to a greater number of cells in the most practical manner. Moreover, the memory usage of RECODE is as low as that of MAGIC (Fig 8B).

### Precise delineation of cell-fate transition dynamics and identification of rare cell populations during mouse gastrulation by RECODE

This section explores the performance of RECODE for the delineation of cell-fate specification dynamics and identification of rare cell populations on a complex dataset for mouse gastrulation from embryonic day (E) 6.5 to E8.5 (*mouse gastrula data*, 10X Chromium, version 1) (Pijuan-Sala *et al*., 2019). This previous study has identified many cell types in a successful manner, but there are some misannotated or unidentified/undescribed cell populations. For example, the node, an organizer for the left-right axis determination (Lee and Anderson, 2008), was misannotated as the notochord, which is a distinct rod-like structure formed both anteriorly and posteriorly to the node and plays a key patterning function, such as in the dorsal-ventral axis patterning of the neural tube (Balmer et al., 2016). Furthermore, anterior visceral endoderm (AVE), an essential signal center for the anterior-posterior axis formation (Bardot and Hadjantonakis, 2020), was also not annotated/identified.

We compare the performance of RECODE with that of preprocessing in Seurat, a widely used scRNA-seq data analysis platform based on dimension reduction (Hao *et al*., 2021; Satija *et al*., 2015). The preprocessing in Seurat consists of the selection of highly variable genes (2,000 genes) and major principal components based on the jackstraw algorithm for the original data (**Materials and Methods**). We conduct the subsequent downstream analyses under the same conditions. First, we analyze the mouse gastrula data for E7.5 embryos, in which complex cell-fate specifications/embryonic patterning proceeds, including node and notochordal plate formation, using Seurat and RECODE preprocessing. On UMAP plots (Fig 9A), key lineage-marker gene expressions of the Seurat-preprocessed data capture progressive cell-fate transitions from the epiblast (EPI) to mesoderm (Meso) and to extraembryonic mesoderm (Ex.Meso): it appears that *Pou5f1*^{+} EPI goes on to express key mesoderm markers such as *T* and *Hand1* successively, leading to the formation of *Pou5f1*^{−} /*Tal1*^{+} Ex.Meso. However, the *Pou5f1 ^{low}*/

*Sox17*

^{+}definitive endoderm (D.Endo) is isolated from both the EPI-Meso-Ex.Meso cluster and the

*Ttr*

^{+}VE cluster. In contrast, RECODE-preprocessed results delineate continuous cell-fate transitions from

*Pou5f1*

^{+}EPI to Meso and Ex.Meso, but also to D.Endo through

*Pou5f1*

^{+}/

*T*

^{+}cell populations. Furthermore, D.Endo appears to be extending toward a subpopulation of the VE cluster that expresses

*Sox17*at a weak level. It has been demonstrated that during gastrulation, D.Endo arises from the

*T*

^{+}mesendoderm (Bardot and Hadjantonakis, 2020), D.Endo ingresses into the VE layer and both D.Endo and VE contribute to the gut endoderm (Kwon

*et al*., 2008). During this process, D.Endo continues to be

*Ttr*

^{−}/

*Sox17*

^{+}, whereas the

*Ttr*

^{+}/

*Sox17*

^{−}VE becomes

*Ttr*

^{+}/

*Sox17*cells (Viotti

^{+}*et al*., 2014). Thus, these findings indicate that the RECODE-preprocessed data recapitulate continuous cell-fate transition dynamics during mouse gastrulation in a more appropriate manner than the Seurat-preprocessed data. This would be because while the Seurat-preprocessed data lose critical information that reflects precise cell-to-cell relationships during dimension reduction (i.e., HVG and PC selection), the RECODE-preprocessed data recover the expression values of all genes, including lowly expressed ones.

Next, we apply UHC to Seurat- and RECODE-preprocessed data to determine a more precise annotation of cell types (Fig EV5A–C). Under the UHC results and key gene expressions, we first annotate the TE, VE, and EPI lineages (Fig EV5A). Then, we classify the EPI lineage based on the UHC results and the expression of the Meso and D.Endo markers and the genes that are expressed in the node and/or notochordal plate (*T*, *Mesp1*, *Hand1*, *Tal1*, *Sox17*, *Noto*, *Foxj1*, *Pifo*, *Chrd*, and *Cer1*) (Balmer *et al*., 2016; Biben *et al*., 1998; Cruz *et al*., 2010; Hadjantonakis *et al*., 2008; Kinzel *et al*., 2010; Plouhinec *et al*., 2004; Yamanaka *et al*., 2007) (Figs EV5B–C). Accordingly, we identify the node essentially as *Noto ^{+}*,

*Foxj1*,

^{+}*Pifo*,

^{+}*Chrd*,

^{+}*T*,

^{+}*Cer1*cells and the notochordal plate essentially as

^{−}*T*,

^{+}*Cer1*cells in both datasets. We note that the numbers of the cells classified into the respective subclusters differ to a certain extent, and more importantly, the hierarchical position of the node differs between the Seurat- and RECODE-preprocessed data (Fig EV5C). Consistent with the UHC result, the node is also isolated on the UMAP plot of the Seurat-preprocessed data (Fig 9B). In contrast, the UMAP plot of the RECODE-preprocessed data shows a continuous transition from the

^{+}*Pou5f1*

^{+}/

*T*

^{+}EPI cells to the node, notochordal plate, and D.Endo. This is in good agreement with the fact that the node, notochordal plate, and D.Endo are derived differentially from the mesendoderm, i.e., the node originates from the anterior end of the primitive streak, the notochordal plate arises from the anterior midline, while D.Endo forms around flanking areas (Balmer

*et al*., 2016). Thus, the RECODE-based analysis successfully captures accurate developmental trajectories, which were overlooked in the Seurat-based analysis.

Next, we show differentially expressed genes (DEGs) between the node and the notochordal plate using the Seurat- and RECODE-preprocessed data (Fig 9C). We obtain a substantially larger number of DEGs of the RECODE-preprocessed data than those of the Seurat-preprocessed data for both the node and the notochordal plate (178 and 250 DEGs for RECODE, 60 and 96 DEGs for Seurat). While we detect *Cer1* as a notochordal plate-specific gene in the Seurat- and RECODE-preprocessed data, we identify *Noto*, *Pifo*, and *Foxj1* as statistically significant, node-specific DEGs, only in the RECODE-preprocessed data (Fig 9C). Moreover, the Seurat-preprocessed data define *T* and *Foxa2* as DEGs specific to the node and notochordal plate, respectively, whereas it has been shown that they are expressed uniformly in the node and notochordal plate (Balmer *et al*., 2016; Lolas et al., 2014), indicating a false-positive call (Fig EV5D).

Next, we analyze the mouse gastrula data for E6.5 embryos. We classify TE, VE, and EPI lineages by the UHC results and key gene expressions (Fig EV5E). We further classify the VE by the expression of AVE markers (*Dkk1*, *Cer1*, *Otx2*, *Lhx1*, *Hhex*, *Lefty1*) (Bardot and Hadjantonakis, 2020; Takaoka and Hamada, 2012) (Fig 9D). We find that Seurat-preprocessed data can classify the VE cells into two major clusters (VE and AVE), including one that expresses key AVE markers, but this classification might be uncertain because of the sparse and variable expression values. In contrast, the RECODE-preprocessed result can clearly classify the VE cells. Moreover, the AVE cells can be classified into two cell types, with one expressing AVE markers at higher levels (5 cells: AVE1) and the other expressing them sporadically (19 cells: AVE2). In the DEG analysis (Fig 9E), we obtain larger numbers of DEGs both between AVE1 and VE (438 and 328 DEGs) and between AVE2 and VE (257 and 287 DEGs) with the RECODE-preprocessed data compared to the number of DEGs obtained between AVE and VE (80 and 90 DEGs) with the Seurat-preprocessed data. Importantly, two genes that are known to be specific to AVE, *Gsc* (Belo *et al*., 1997; Ding *et al*., 1998) and *Shisa2* (Cheng et al., 2019; Furushima et al., 2007), are contained in only DEGs of the AVE1 cells with the RECODE-preprocessed data. These findings indicate that the AVE1 represents the genuine AVE, and the AVE2 may be the cells located around AVE1 with a property akin to AVE, implying that RECODE can produce further biological insights.

## Discussion

We have formulated a noise reduction method, RECODE, which resolves the curse of dimensionality (COD) in scRNA-seq data analysis. RECODE significantly mitigates CODs caused by random sampling noises in creating scRNA-seq data with UMIs. Accordingly, we have shown that RECODE extinguishes those noises and recovers the true expression values, even for lowly expressed genes, enabling us to use all of the information of genes in the downstream data analysis and allowing the discrimination of single cells bearing close gene expression profiles and identification of rare cells. Furthermore, we have performed comprehensive comparisons of RECODE with representative imputation methods and a widely used scRNA-seq data analysis platform (Seurat), demonstrating the superiority of RECODE. RECODE is parameter-free, data-driven, deterministic, and high-speed, and importantly, the applicability of RECODE can be predicted by variance distributions after noise variance-stabilizing normalization (NVSN).

The successful application of RECODE to scRNA-seq data with UMIs is based on the fact that those noise variances are appropriately estimated from the theories of random sampling in statistics (**Materials and Methods**). The procedures for generating the scRNA-seq data consist of essentially three key steps: copying the original information (reverse transcription of mRNAs), amplifying the copied information (PCR amplification of cDNAs), and sequencing the amplified information (next-generation sequencing of amplified cDNAs) (see Section 4 in the Supplementary text). The copying and sequencing steps are assumed to involve random sampling noise since these steps randomly pick up mRNAs and cDNAs, respectively, whereas the amplifying step involves a distinct type of noise resulting from the PCR amplification, which can vary for each cDNA molecule. The introduction of UMIs in the cDNA synthesis step prior to the cDNA amplification offsets this noise, allowing an approximation of noises associated with scRNA-seq as random sampling noises. Indeed, RECODE mitigated random sampling noises in all the scRNA-seq data generated on the 10X Chromium and Drop-seq platforms we evaluated, irrespective of their chemistry versions (Fig 4).

On the other hand, scRNA-seq data based on read counts contain mixed noises (random sampling noise and PCR amplification noise), and the extent of noise reduction by RECODE would not be as large as that for the scRNA-seq data with UMIs (Fig 4). Similarly, bulk or microarray data contain different types of noise, and merged data among different culture environments or developmental time points also include other noises known as batch effects. These data are beyond the scope of the current version of RECODE because their noises are not regarded as consisting of only random sampling noise. To optimize the noise reduction effect by RECODE on such data, it is imperative to appropriately estimate the noise variances brought about by the mixture of random sampling, PCR amplification, batch effects, and possibly other factors. Toward this end, we need to examine the manner and the mechanism of noise emergence in each step of the scRNA-seq data generation in a more careful fashion and construct a mathematical theory to model the process for such noise emergence.

Although both RECODE and imputation methods directly modify raw data values, they focus on different aspects of data and, accordingly, employ distinct principles. Typically, imputation methods are parametric and aim to circumvent the dropout effect and data sparsity. To recover the expression of dropped-out genes based on their expression levels in nearest neighbors, imputation methods require some prior data clustering (e.g., k-nearest neighbors), which does not escape from COD and therefore potentially leads to incorrect results. Information aggregation of nearest neighbors does not work for rare cells with no biologically appropriate neighbors, leading to the loss of such rare cells (Fig 7). Furthermore, since these methods are not necessarily based on a precise understanding of the noise-generating processes, they may improve a part of the original data but exacerbate other parts, causing problems such as “over-imputation” or “cyclicity” (Andrews and Hemberg, 2018; Lähnemann *et al*., 2020). In contrast, RECODE is non-parametric and focuses on removing COD based on theories in high-dimensional statistics. Accordingly, RECODE mitigates data sparsity and simultaneously allows a more quantitative comparison of cells bearing close data structures.

Importantly, RECODE does not impose any assumptions on the data types for their application; that is, RECODE is applicable to other sequencing data created by a similar platform. For example, with an appropriate preprocessing, the spatial gene expression and scATAC-seq data with 10X Chromium are also categorized as Class A (strongly applicable) (Sections 5.3 and 5.4 in the Supplementary text). In addition, RECODE can be applied to high-dimensional data with noise in biology and even in other disciplines. We would therefore like to propose that RECODE presents a general strategy for preprocessing noisy high-dimensional data.

## Author contributions

Conceptualization: Y.I., T.N., T.Y., Y.H., and M.S.; Investigation: Y.I., T.N., E.G.E., M.Y., Y.Ko., Y.Y., and Y.H.; Methodology: Y.I. and T.N.; Formal analysis: Y.I. and T.N.; Writing: Y.I., T.N., Y.H., E.G.E., M.Y., and M.S.; Code implementation: Y.I. and Y.Ka.; Supervision: Y.H. and M.S.

## Competing interests

The authors declare no competing interests.

## Materials and Methods

We propose a noise reduction method for scRNA-seq data represented by UMI counts, called RECODE (resolution of the curse of dimensionality). Using high-dimensional statistical theories, RECODE can effectively reduce the noise in data through the resolution of the curse of dimensionality. The detailed theories are shown in the Supplementary text.

### RECODE

Let *C _{ij}* (

*i*= 1,…,

*d, j*= 1,…,

*n*) be the observed UMI count of gene

*i*in cell

*j*and be its true expression value. Here,

*n*and

*d*are the sample size and dimension, respectively. Because the scale of

*C*and can vary significantly due to the low detection rate in data sampling, we consider the following scaled values (probability): where

_{ij}*t*and are the total UMI and RNA counts in cell

_{j}*j*, i.e.,

*t*= ∑

_{j}_{i}

*x*and , respectively. We model the noise

_{ij}*e*(

_{ij}*i*= 1,…,

*d, j*= 1,…,

*n*) defined as the difference between the observed and true scaled values:

Hereafter, we treat the above values as random variables.

To correctly split the essential and noise parts in the singular value decomposition-based transformation in RECODE, we first address COD3 (inconsistency of principal components) caused by the non-uniform noise (the procedure I in Fig 3A). Based on the random sampling procedures, we model the observed UMI count data *c _{ij}* as

This model generally coincides with previous studies (Grun *et al*., 2014; Hafemeister and Satija, 2019; Huang et al., 2018a). Some recent studies have also modeled the parameter distribution in the Poisson distribution as a gamma distribution. As a result, the mixture model follows the negative binomial distribution. Although the negative binomial model may represent scRNA-seq data, estimating two independent parameters is a serious problem with high computational costs. In contrast, this study does not assume the distribution of the parameter *t _{j}* (allowing any distributions). That is, this study employs a general distribution that contains distributions such as the Poisson, gamma, and negative binomial distributions used in previous studies.

Without modeling the parameter distribution *t _{j}*, we directly proved the relationship of the following statistics for genes (see Theorem 4.1 in the Supplementary text):

Here, E_{i} and Var_{i} are the expectation (mean) and variance for fixed gene *i*. In order to see the behavior of the noise variance, let us set to be constant. Then, we obtain Var_{i}(*e _{ij}*) = E

_{i}(

*x*/

_{ij}*t*) from (1). The conventional studies based on the negative binomial distribution have considered that the mean of scaled data describes the noise variance as Var

_{j}_{i}(

*e*) =

_{ij}*a*E

_{i}(

*x*) +

_{ij}*b*E

_{i}(

*x*)

_{ij}^{2}(

*a, b*: parameters). In contrast, our observation Var

_{i}(

*e*) = E

_{i,j}_{i}(

*x*/

_{ij}*t*) indicates that the explanatory variable is E

_{j}_{i}(

*x*/

_{ij}*t*). This explanation of the noise variance is an essential difference from the conventional studies. Moreover, we can directly evaluate the noise variance from observed data without parameters.

_{j}Using a function *f _{α}* (

*α*≥ 0) given as we define normalized data

*z*as

_{ij}*Z*≔

_{ij}*f*

_{Ei(xij/tj)}(

*x*). We consider the noise of the normalized data as , where . Because the noise satisfies for expressed gene

_{ij}*i*, we call the transformation

*x*→

_{ij}*z*the

_{ij}*noise variance-stabilizing normalization*(NVSN).

Next, we denoise normalized UMI count *z _{ij}* by transforming eigenvalues (procedures II and III in Fig 3A). Let be the matrix form of normalized data and be the matrix of row averages. We consider the singular value decomposition of centralized data as , where

*U*, Σ

_{z}_{z}, and

*V*are a

_{z}*d*×

*d*orthogonal matrix, a

*d*×

*n*rectangular diagonal matrix, and an

*n*×

*n*orthogonal matrix, respectively. Defining , the diagonal components

*λ*(

_{z.i}*i*= 1,…,

*d*of Λ

*Z*are equal to the eigenvalues (PC variances) of PCA. That is, for the covariance matrix ,

*λ*(

_{i}*i*= 1,…,

*d*) satisfies the eigenvalue equation

*S*=

_{z}u_{i}*λ*, where

_{z.i}u_{z.i}*u*is the

_{i}*i*th column vector of

*U*. Furthermore, for the PCA transformation , the ith eigenvalue corresponds to the variance of the ith principal component (ith PC variance), i.e., . Because NVSN resolved COD3, the PCA transformation can divide the essential and noise parts by the first

_{z}*ℓ*principal components and the others (

*ℓ*: threshold). Accordingly, we introduce the following modified eigenvalues

Here, *d ^{PCA}* is the dimension of PCA-projected data, i.e.,

*d*= min{

^{PCA}*n*− 1,

*d*}. The transformation of the essential part (PC variance modification in Fig 3A) adopts Yata and Aoshima’s method that modifies eigenvalues to converge to true eigenvalues (Yata and Aoshima, 2012). The transformation of the noise part (PC variance elimination in Fig 3A) eliminates the effect of the noise part by setting the eigenvalues to be zero. Here, we can optimize the threshold

*ℓ*by using the fact that all the noise variances of the normalized data are one, as

Then, we define the denoised data as

Here, and are diagonal matrices with diagonal entries and, respectively. We treat when *λ _{z,i}* = 0. The denoised data satisfies for

*i*= 1,…,

*d*and . That is, the eigenvalues are modified to , while the principal component structure and the center values are preserved. Finally, we obtain the denoised scaled data and UMI count data by applying the inverse functions as

In the computation, the noise variance E_{i}(*x _{ij}*/

*t*) is evaluated by

_{j}Here, the star symbol ⋆ indicates the sample values. Then, we generate the normalized value as .

### Classification of genes

We define the following classification of genes:

From this definition, the significant genes capture cell-specific features, while the non-significant genes do not identify cell differences. The silent genes are those having no function. We denote the index sets of significant, non-significant, and silent genes as *I*_{sig}, *I*_{non-sig}, and *I*_{silent}, respectively. Then, from Eq. (1), we obtain

Thus, we can classify genes from the variances of normalized data by NVSN, which hereafter denotes NVSN variances. The classification allows us to determine whether a gene defines a difference among cell populations in scRNA-seq data. Moreover, this classification is also used for the applicability of RECODE discussed in the next section.

### Applicability of RECODE

In principle, RECODE can be used for any scRNA-seq data and reduces noises induced by random sampling. However, the output data may still retain noises caused by other technical problems such as amplification. To examine the effects of denoising, we can judge the applicability of RECODE by observing the NVSN variance distribution.

We can first show that the empirical NVSN variances of expressed genes are lower bounded by one
under Eq. (2), where is the empirical mean of the gene *i*. According to this property, we classify the scRNA-seq data as follows:

Class A (strongly applicable): The NVSN variances satisfy the condition (3) and for most genes

*i*.Class B (weakly applicable): The NVSN variances satisfy the condition (3) and for most genes

*i*.Class C (inapplicable): The NVSN variances do not satisfy the condition (3).

Here, we recall that most genes are not directly related to cell identifications. In our setting, this fact indicates that the number of *I*_{non-sig} (non-significant genes) is much larger than that of *I*_{sig} (significant genes), leading to the second requirement in Class A. It should also be noted that noise variances generally increase when there exist other noise effects in addition to random samplings, such as amplification errors. The additional noises cause the NVSN variances to take large values away from one, as addressed in Class B. Based on this discussion, the data in Class A match our model, and hence we can expect that RECODE appropriately removes noise. On the other hand, the data in Class B imply that other noises may be mixed in with the random sampling noise, and accordingly, RECODE may only work for the partial removal of noise. Finally, since the data in Class C do not follow our model, we cannot expect noise removal there.

### Simulation data creation

We created the simulation data in Figs 1, 3, EV1, and EV3 based on the Splatter algorithm (Zappia *et al*., 2017). We set 1,000 cells, 20,000 genes, and 200 differential expressed genes identifying the four groups. We used the trended cell mean as the reference data , where the symbol SPL denotes the Splatter variables. To represent the low detection rate, we created the observed count as , where *r* denotes the detection rate set as 0.1. We set the other parameters as follows: *α*^{SPL} = 0.3, *β*^{SPL} = 0.6, *π*^{O, SPL} = 0, *μ*^{O, SPL} = 4, *σ*^{O, SPL} = 0.5, *μ*^{L, SPL} = 11, *σ*^{L, SPL} = 0.2, *ϕ*^{SPL} = 0, , and *k*^{SPL} = −1.

### Collection of human iPSCs and PGCLCs

All the experiments on the induction of hPGCLCs from hiPSCs were approved by the Institutional Review Board of Kyoto University and were performed according to the guidelines of the Ministry of Education, Culture, Sports, Science, and Technology (MEXT) of Japan.

The culture of human iPSCs and the induction of PGCLCs were performed as described previously (Kojima *et al*., 2017; Kojima *et al*., 2021; Sasaki *et al*., 2015). The 585B1 BTAG (585B1-868, bearing 2A-tdTomato and 2A-EGFP at BLIMP1 and TFAP2C loci) hiPSCs were maintained in StemFit AK03N medium (Ajinomoto) on cell culture plates coated with iMatrix-511 (Nippi). The medium was changed every other day. For the passaging or the induction of differentiation, the cells were treated with a one-to-one mixture of TrypLE Select (Life Technologies) and 0.5 mM EDTA/PBS to dissociate into single cells, and 10 μM of a ROCK inhibitor (Y-27632; Wako Pure Chemical Industries) was added for 24 h after plating.

For the induction of hPGCLCs, hiPSCs were first plated at a density of 5 × 10^{4} cells/cm^{2} onto a fibronectin (FC010; Millipore)-coated plate. The cells were then cultured in GK15 medium (GMEM with 15% KSR, 0.1 mM NEAA, 2 mM L-glutamine, 1 mM sodium pyruvate, penicillin-streptomycin, and 0.1 mM 2-mercaptoethanol) supplemented with 50 ng/ml activin A (R&D Systems), 3 μM CHIR99021 (TOCRIS), and 10 μM of Y-27632 (Wako Pure Chemical Industries) for 44–48 h. Next, the cells were dissociated into single cells with TrypLE Select and aggregated in a low cell-binding V-bottom 96-well plate (Greiner) at 5,000 cells per well in 100 μl of GK15 medium supplemented with 200 ng/ml BMP4 (R&D Systems), 100 ng/ml SCF (R&D Systems), 50 ng/ml EGF (R&D Systems), 1,000 U/ml LIF (Millipore), and 10 μM of Y-27632 to be induced into hPGCLCs.

For the sample collection from iPSCs, the cells were suspended in the same manner as described above. For the aggregates of PGCLC induction on day 4, the aggregates were collected on the designated days of induction, washed once in PBS, and dissociated with 0.25% Trypsin–EDTA for 10–15 min at 37°C with gentle pipetting every 5 min. Trypsin was neutralized with a 5× volume of 10% FBS in DMEM. Before proceeding with the scRNA-seq data analysis, the live BTAG double-positive single cells were sorted using a FACS Aria III system (BD Biosciences) by FSC-SSC and EGFP-tdTomato gating with DRAQ7 (ab109202) staining. To make an aliquot of the mixture of iPSCs and day 4 BTAG double-positive cells, equal numbers of cells were mixed prior to the scRNA-seq data analysis.

### 10X scRNA-seq data acquisition

ScRNA-seq libraries of 10X data were generated using the 10X Genomics Chromium Controller (10X Genomics) and Chromium Single Cell 3’ Reagent Kits v3.1 according to the manufacturer’s instructions. Reverse transcription, cDNA amplification, and sample indexing were performed using an Eppendorf Mastercycler. The final libraries were quantified using a KAPA library quantification kit (KK4824), and the fragment size distribution of the libraries was determined using a LabChip GX DNA high sensitivity kit (Perkin Elmer). Pooled libraries were then sequenced using a NovaSeq 6000 Illumina Sequencer with a NovaSeq 6000 S1 Reagent Kit (100 Cycles, 20012865).

### Mapping of scRNA-seq data

For human PBMC CITE-seq data, the filtered count matrix data were downloaded from the 10X Genomics demo data site (see Data availability in the **Materials and Methods** for details).

For the other 10X 3’scRNA-seq data, CellBench data by Tian *et al*., the hPGCLC induction data by Chen *et al*., the hiPSC/hPGCLC mixture data generated in this study, and the mouse gastrula data by Pijuan-Sala *et al*., raw files were processed with CellRanger 6.0.1 using default mapping arguments. Reads were mapped to the mouse genome (GRCm38.p6) or human genome (GRCh38.p12), respectively.

For the Human Cell Atlas project data by Mereu *et al*. and Hagemann-Jansen *et al*., the 10X data were processed with CellRanger 6.0.1 using default mapping arguments, and the Drop-seq, Quartz -seq, Smart-seq2, and Smart-seq3 data were processed with zUMI v2.9.7 (Parekh *et al*., 2018) and STAR v2.7.3 (Dobin *et al*., 2013) according to the original papers (Hagemann-Jensen *et al*., 2020; Mereu *et al*., 2020). Briefly, reads were mapped onto a simple human genome (GRCh38.p12) and onto a custom genome consisting of human (GRCh38.p12), dog (CanFam3.1), and mouse (GRCm38.p6) genomes with chromosome and gene names headed by “hs_”, “mm_” and “cf_”, respectively, to identify the organisms.

### Mapping of bulk RNA-seq data

For the hPGCLC and hiPSC data generated by the 10X 3’scRNA-seq, conversion of the read data into expression levels was performed as described previously (Kojima *et al*., 2017; Nakamura *et al*., 2016; Nakamura *et al*., 2015). The reads were mapped on the human genome (GRCh38.p12). The 3’ RNA-seq reads only the 3-prime ends of transcripts so that the expression levels were calculated as genes (Entrez genes) but not mRNAs and Reads Per Million-mapped reads (RPM) were calculated for the expression levels. Then the RPM expression data matrix was transformed to log2(RPM+1) and used for the downstream data analysis.

For the five adenocarcinoma cell line data generated by conventional full-length RNA-seq (Counterparts to CellBench, (Holik *et al*., 2017)), the reads were mapped on the human genome (GRCh38.p12), and the expression levels were calculated as Fragments Per Kilo-base Million-mapped reads (FPKM). Then, the FPKM expression data matrix was transformed to log2(FPKM+0.1) and used for the downstream data analysis.

### ScRNA-seq data quality check and preprocessing

In the quality check of scRNA-seq data, the cells with low nUMIs mapped on genes, low nGenes, high percentages of UMIs on mitochondrial genes, and the multiplets judged by Scrublet (Wolock *et al*., 2019) were filtered out. Additionally, for 10X CITE-seq data of PBMCs, the cells with low nUMIs mapped on protein index were filtered out. For the Human Cell Atlas data, first, the multiplets between different species were omitted according to the ratio of UMIs mapped on the human genome. Then, using the data mapped onto the human genome (GRCh38.p12), the cells with low nUMIs mapped on genes, low nGenes, and high percentage UMIs of mitochondrial genes were filtered out. See Appendix Figs S4–S5 for detailed criteria. For preprocessing of the expression values, each UMI value was divided by the total UMI count of the cell, multiplied by the scale factor parameter (10k), and then log2-transformed after adding a pseudo expression value of 1 to avoid the incorrect value (log_{2}0) [log2(ss10k+1)].

### Applying RECODE

RECODE is applied to the quality-checked UMI count matrices using Python version 3.7.4 of Anaconda. The output matrices were log-normalized (log2(ss10k+1)) for downstream data analyses.

### Applying imputation methods

The imputation methods were performed with default settings according to the respective program manuals. SAVER (Huang *et al*., 2018b), scImpute (Li and Li, 2018), and ENHANCE (Wagner *et al*., 2019) were applied to the UMI count matrices, and the output matrices were log-normalized (log2(ss10k+1)). In contrast, MAGIC (van Dijk *et al*., 2018), DrImpute (Gong *et al*., 2018), and ALRA (Linderman *et al*., 2022) were applied to the log-normalized UMI count matrices. Minus values in their output matrix were forcibly set as 0.

### Analysis of 10X data

The following analysis was performed using Microsoft-R software version 4.0.2 with the `parallelDist, amap, Rtsne, gplots, uwot, cluster, qvalue`, and `Seurat` packages. UHC was performed using the hclust function with Euclidean distances and Ward’s method (ward.D2). t-SNE was performed using the Rtsne function with perplexity = 30 and genes with expression (log_{2} (ss10k + 1) > 0) in at least one cell. UMAP was performed using the umap function with n_neighbors = 10, min_dist = 0.5 and genes with expression (log_{2}(ss10k + 1) > 0) in at least one cell. The cell types of human PBMC CITE-seq, CellBench, and the hiPSC/hPGCLC mixture datasets were defined by using external data as described below (Appendix Figs S1–S3). Some of the ENHANCE-treated cells became identical to each other, and thus those cells were removed ahead of clustering to avoid computation errors. Note that, for analyses other than Seurat, we performed these multi-variance analyses without any other dimension reduction steps, such as selecting HVGs or aggregation of gene information by PCA. To identify DEGs, the `wilcox.test` and `kruskal.test` functions followed by the q value function were used to calculate the P-value and false discovery ratio (FDR), respectively.

For analyses with Seurat, importation of the raw UMI count matrix, log2 normalization [log2(ss10k+1)], scaling, selection of HVGs (2,000 genes), and PCA were performed by `NormalizeData, FindVariableFeatures, ScaleData`, and `RunPCA` functions with default settings according to the tutorial on the Seurat website (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html). Selecting the significant PCs was done based on the results of the `jackstraw` function. The following analysis (UHC, UMAP, and t-SNE) were performed with the `parallelDist, Rtsne` and `uwot` packages using the PC scores of the significant PCs suggested by `jackstraw` (mouse gastrula data E6.5; PC26, E7.5; PC31).

### Identification of cell types in human PBMC CITE-seq, CellBench, and the hiPSC/hPGCLC mixture dataset

For cell annotations in human PBMC CITE-seq data, raw UMI counts of protein expression (CD3, CD4, CD8A, CD14, CD15, CD16, CD19, CD25, CD56 CD137, CD45RA, and CD45RO) were processed by dividing each UMI by the total UMI count of the proteins in the cell (nUMI_protein), multiplying it by the median of nUMI_protein, and then log2-transforming it after adding a pseudo expression value of 1 to avoid the incorrect value (log_{2}0) [log2(ss.med+1)]. The cell clustering and annotations were determined according to the UHC analysis and the expression pattern of the protein expressions. The UHC and UMAP were performed with the parameters described above.

For the annotation of the CellBench dataset, first, the signature genes of each cell type were calculated using the corresponding bulk RNA-seq data (Holik *et al*., 2017) [174 genes, log2F.C. (mean of target cells versus mean of the others) > 3 and mean of log2(FPKM+0.1) of the target cells > 4]. The cell clustering of CellBench 10X data was performed according to the UHC analysis using the log2(ss10k+1) values of the 174 signature genes. Finally, the correlation between the means of the clusters of 10X CellBench data and bulk RNA-seq data were computed and the cells of CellBench were annotated according to the results of the correlation analysis.

For the annotation of the hiPSC/hPGCLC mixture data, first, the signature genes of each cell type were calculated using the corresponding bulk RNA-seq data (Kojima *et al*., 2017) [442 genes, log2F.C. (mean of hiPSC versus mean of hPGCLC) > 3 and mean of log_{2}(RPM + 1) >4]. The clustering of 10X scRNA-seq data was performed according to the UHC analysis using the log2(ss10k+1) values of the 442 signature genes. We annotated the clusters with the expressions of *SOX17, TFAP2C, PRDM1*, and *NANOS3* and with *SOX2* and *DNMT3B* as hPGCLC and hiPSC, respectively. The clusters with both PGCLC and iPSC marker expressions should be doublets, so they were removed ahead of the following analysis. Note that we assumed that the effect of COD is negligible for such a small number of genes (dimensions).

### In silico dilution analysis

For the assay in Fig 7, 100, 20, 10, 5, 3, 1 hiPSCs or H2228 cells in the hiPSC/hPGCLC mixture or the CellBench data were picked randomly and mixed with the other remaining cells (Figs 7A–B). Then the expression matrices were applied for RECODE and other imputation methods in the same manner as described above.

### Evaluation of run-time, memory, and scalability

We used scRNA-seq data with 36,694 cells (“uclc2” series of GSE140021, Chen *et al*., 2019, see Appendix Table S1 for details) to evaluate the run-time, memory usage, and scalability of RECODE, SAVER, and MAGIC (Fig 8). We created datasets by randomly sampling 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, and 30,000 cells from the original data. The run-times were measured by the `time.time` function on python for RECODE and the `tictoc` package on R for SAVER and MAGIC. The memory usages were measured by the maximum memory usage during the job by the Grid Engine system (UNIVA Grid Engine) of our computation server. The scalabilities were calculated by the linear regression of run-times for 10,000–30,000 cells. The brief specs of the computation node used for this assay are CPU: dual Xeon Gold 6146 (12 cores × 2, 3.2 GHz, without hyperthreading technology) and Memory: DDR4-2666, 512GB.

### Data availability

The accession numbers used in this study are as follows. For the scRNA-seq data, the hiPSC/hPGCLC mixture data generated in this study (GSE175525); human PBMC CITE-seq data (demo data from 10X genomics: *10k PBMCs from a Healthy Donor - Gene Expression with a Panel of TotalSeq™-B Antibodies*) (https://www.10xgenomics.com/resources/datasets/10-k-pbm-cs-from-a-healthy-donor-gene-expression-and-cell-surface-protein-3-standard-3-0-0), the CellBench data (GSE118767) (Tian *et al*., 2019), the hPGCLC induction (GSE140021) (Chen *et al*., 2019), the Drop-seq data (GSE99330) (Torre *et al*., 2018), datasets in the Human Cell Atlas project (10X 3’ scRNA-seq, Drop-seq, Quartz-seq, Smart-seq2, Smart-seq3, (GSE133549, E-MTAB-8735) (Hagemann-Jensen *et al*., 2020; Mereu *et al*., 2020), and the mouse gastrula data (E-MTAB-6967) (Pijuan-Sala *et al*., 2019).

For the external data, the bulk RNA-seq data of five adenocarcinoma cell lines for CellBench (GSE86337) (Holik *et al*., 2017), the bulk RNA-seq data of iPSCs and the purified hPGCLCs (GSE99350) (Kojima *et al*., 2017), the smFISH data count data were obtained from https://www.dropbox.com/sh/g9c84n2torx7nuk/AABZei_vVpcfTUNL7buAp8z-a?dl=0 (Torre *et al*., 2018). See Appendix Table S1 for details.

### Code availability

The python and R codes of RECODE are available at https://github.com/yusuke-imoto-lab/RECODE.

## Extended Data

## Acknowledgments

We are grateful to Dr. Kazuyoshi Yata for the valuable discussions and comments, and to Y. Nagai, N. Konishi, E. Tsutsumi, and M. Kawasaki of the Saitou Laboratory for their technical assistance. We thank the Single-Cell Genome Information Analysis Core (SignAC) in ASHBi for the RNA sequence analysis. This work was sponsored by the World Premier International Research Center Initiative (WPI). This work was also sponsored by a JST PREST Grant (no. JPMJPR2021 to Y.I.), a JSPS Grant-in-Aid for Early-Career Scientists (no. 18K14714 to T.N.), a MEXT Grant-in-Aid for Transformative Research Areas B (no. 20H05761 to T.N.), a JST CREST Mathematics Grant (no. 15656429 to Y.H.), a JST MIRAI Program Grant (no. 22682401 to Y.H.), and a Grant-in-Aid for Specially Promoted Research from JSPS (nos. 17H06098 and 22H04920 to M.S.).

## Appendix

### Supplementary Figures

### Supplementary Tables

**Appendix Table S1.**

A list of the accession numbers or IDs for the data used in this study.

### Supplementary text

A mathematical formulation of RECODE.