Abstract
Principal component analysis (PCA) is an essential method for analyzing single-cell RNA-seq (scRNA-seq) dataset but for large-scale scRNA-seq datasets, the computation consumes a long time and large memory space.
In this work, we review the existing fast and memory-efficient PCA algorithms and implementations and evaluate their practical application to large-scale scRNA-seq dataset. Our benchmark showed that some PCA algorithms based on Krylov subspace and randomized singular value decomposition are fast, memory-efficient, and accurate than the other algorithms. Considering the difference of computational environment of users and developers, we also developed the guideline to select the appropriate PCA implementations.
Background
Owing to the emergence of single-cell RNA sequencing (scRNA-seq) technologies [1], many types of cellular heterogeneity have been examined. For example, cellular subpopulations consisting of tissues [2–6], rare cells and stem cell niches [7], continuous gene expression change related to cell cycle [8], spatial coordinates [9–11], and difference of differentiation maturity [12, 13] have been captured by many scRNA-seq studies. Since the measurement of cellular heterogeneity highly depends on the number of cells measured simultaneously, a wide variety of large-scale scRNA-seq technologies have been developed [14], including those using cell sorting devices [15–17], Fludigm C1 [18–21], droplet-based technologies (Drop-Seq [2–4], in-Drop RNA-Seq [5, 6], 10X Genomics Chromium [22]), and single-cell combinatorial-indexing RNA-sequencing (sci-RNA-seq [23]). Such technologies have encouraged the establishment of several large-scale genomics consortiums such as the Human Cell Atlas [24–26], Mouse Cell Atlas [27], and Tabula Muris [28]. These projects are measuring a tremendous number of cells by scRNA-seq and tackling basic life science problems such as the number of cell types consisting of an individual, cell-type-specific marker gene expression and gene functions, and molecular mechanisms of diseases at a single-cell resolution.
Nevertheless, the analysis of scRNA-seq datasets poses a potentially difficult problem; the cell type corresponding to each data point is unknown a priori [1, 29–33]. Accordingly, researchers perform unsupervised machine learning (UML) methods such as dimensional reduction and clustering to reveal the cell type corresponding to each individual data point. In particular, principal component analysis (PCA [34–36]) is a workhorse algorithm for UML across many situations. PCA is widely used for data visualization [37–39], data quality control (QC) [40], feature selection [13,41–47], de-noising [48,49], imputation [50–52], confirmation and removal of batch effects [53–55], confirmation and estimation of cell-cycle effects [56], input of other non-linear dimensional reduction [57–63] and clustering methods [64–67], rare cell type detection [68,69], cell type and cell state similarity search [70], pseudotime coordinate [13, 71–75], and spatial coordinate [9]. A wide variety of data analysis pipelines include PCA as an internal function or utilize principal component (PC) scores as input for the down-stream analyses [22, 76–83].
Despite its wide use, there are several reasons why it is unclear how PCA should be conducted for large-scale scRNA-seq. First, since the widely used PCA algorithms and implementations load all elements of data matrix into memory space, for large-scale datasets such as the 1.3 million cells measured by 10X Genomics Chromium [39] or the 2.0 million cells measured by sci-RNA-seq [23], the calculation is difficult unless the memory size of the user’s machine is very large. Furthermore, the same data analysis workflow is repeatedly performed, with deletions or additions of the data or changes of parameters of the workflow, and under such trial-and-error cycles, PCA can become a bottleneck of the workflow. Therefore, some fast and memory-efficient PCA algorithms are required.
Second, there are indeed other PCA algorithms that are faster and more memory-efficient, but their practicality for use with large-scale scRNA-seq datasets is not fully known. Generally, the acceleration of algorithms by some approximation methods, and the accuracy of biological data analysis can be a trade-off. Fast PCA algorithms might overlook some important differential gene expression. In the case of large-scale scRNA-seq studies aiming to find novel cell types, this property may cause the loss of clustering accuracy and not acceptable.
Finally, actual computational time and memory efficiency are highly dependent on the specific implementation, including the programming language and data format, but there is no benchmarking for evaluating these properties. Such information is directly related to the practicality of the software and is useful as a guideline for users and developers.
For the above reasons, in this work, we examine the practicality of fast and memory-efficient PCA algorithms for use with large-scale scRNA-seq datasets. This work provides four key contributions. First, we reviewed the existing PCA algorithms and their implementations (Figure 1). Second, we performed a benchmark test with selected PCA algorithms and implementations. To our knowledge, this is the first comprehensive benchmarking of PCA with scRNA-seq datasets. Third, we provide some original implementations of some PCA algorithms and utility functions for QC, filtering, and feature selection. All commands are implemented as a fast and memory-efficient Julia package. Finally, we propose guidelines for end-users and software developers.
Results
Review of PCA algorithms and implementations
Here, we review the existing PCA algorithms and their implementations. All algorithms pseudo-code is provided in Additional file 1.
PCA is formalized as eigenvalue decomposition (EVD) of the covariance matrix of the data matrix or singular value decomposition (SVD) of the data matrix [84]. To perform PCA, the most widely used PCA function in the R language is probably prcomp function, which is a standard R function [13, 40, 41, 49, 53, 64, 67, 69, 72, 74, 76, 79, 85]. Likewise, users and developers of the Python language may use the PCA function of scikit-learn (sklearn) [50, 54, 56, 86, 87], which is a Python package for machine learning. These are actually wrapper functions for performing SVD with LAPACK subroutines such as DGESVD (QR method-based) or DGESDD (divide-and-conquer method-based), and both subroutines perform the Golub-Kahan method [84]. In this method, the covariance matrix is tri-diagonalized by Householder transformation, and then the tri-diagonalized matrix is diagonalized by the QR method or divide-and-conquer method (Figure 2a). Such a two-step transformation is commonly performed by the sequential similarity transformation expressed as , where X is an n-by-n covariance matrix, Mk is an n-by-n invertible matrix, and k is the step of the transformation. Likewise, when the input data matrix is asymmetric, the matrix is bi-diagonalized and then tri-diagonalized. When the matrix is finally diagonalized at the kth step, the diagonal elements become eigenvalues and M = M1M2…Mk becomes the set of corresponding eigenvectors. Although the Golub-Kahan method is the most widely used SVD algorithm, this method has some drawbacks. First, the large dense matrix M must be temporarily saved and incrementally updated in each step, and therefore the memory space is filled quickly. Second, when the matrix is large, the data matrix itself is difficult to be loaded and causes an out-of-memory error. For the above reasons, the Golub-Kahan method cannot be directly applied to large-scale scRNA-seq datasets.
There are some faster and more memory-efficient PCA algorithms. Contrary to the full-rank SVD solved by LAPACK, such algorithms are formalized as truncated SVD, in which only some of the top PCs are calculated. We classify these methods into five categories (Figure 2b). The first category consists of downsampling-based methods [88]. In these methods, SVD is first performed for a small matrix consisting of cells randomly sampled from the original large matrix. The remaining cells are then projected onto the low-dimensional space spanned by the eigenvectors learned from the small matrix. The effectiveness of this method in scRNA-seq studies has been evaluated by Bhaduri et al. [88] (Table 1).
The second category is SVD update [89], which repeatedly performs SVD using subsets of the data sampled from the data matrix and incrementally updates the result. Sequential Karhunen-Loeve transform (SKL) [89], which is a kind of SVD update, is used in loompy (http://loompy.org) (Table 1).
The third category consists of Krylov subspace-based methods [90–93]. The most typical method within this category is the power method, which iteratively multiplies a vector w with a covariance matrix X and normalizes w. Within some iterations, w converges to the eigenvector (PC1) corresponding to the largest eigenvalue. Since the way to calculate the higher PCs is not obvious, there are some algorithms, such as orthogonal iteration (block power method, subspace iteration, or simultaneous iteration [90]), the Lanczos method [90], and the Arnoldi method [90]. Orthogonal iteration performs the power method with multiple initial vectors in parallel, and each power iteration step performs QR decomposition for orthonormalization. Contrary to orthogonal iteration, Lanczos and Arnoldi methods respectively introduce Lanczos and Arnoldi processes to generate vectors that are orthogonal to each other. To make the convergence faster, both methods also introduce “restart” strategies such as the augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA [94]) and implicitly restarted Arnoldi methods (IRAM [95]), in which new initial vectors are calculated by the accumulated result of Lanczos or Arnoldi processes. In contrast to the Golub-Kahan method, these methods do not generate large dense temporary matrices, and when the data matrix is sparse, these methods are compatible with a sparse matrix format and can be accelerated. Cell Ranger [22], Seurat2 [47], Scanpy [87], SAFE [66], Scran [48], Giniclust2 [68], MAGIC [50], Harmony [55], and Scater [76] use IRLBA for PCA functions (Table 1). Although IRAM appears not to have been used in scRNA-seq studies, the effectiveness of its use with population genetic datasets has been argued recently [96].
The fourth category is gradient descent (GD, or steepest descent)-based methods. In this method category, the gradient of the objective function is calculated, and the initial vectors are updated to the reverse direction of the gradient. Although GD utilizes all the data to calculate the gradient (i.e., full gradient), stochastic gradient descent (SGD) calculates the gradient with a subset of the data (i.e., stochastic gra-dient). Although these PCA algorithms are sometimes used for situations in which the data are incrementally observed, such as subspace tracking [97], these methods also can be fast and memory-efficient because the calculation of the full/stochastic gradient is decomposable to the sum of the gradient of individual data points. Although these methods appear not to have been used in scRNA-seq studies, in image processing studies, this method is known as Oja’s method or the generalized Hebbian algorithm [98–100].
The fifth category comprises random projection-based methods, in which a data matrix is randomly projected onto lower dimensions and basic linear algebraic methods such as QR decomposition and SVD are performed on the smaller matrix. Since most calculations are performed for these random lower dimensions, these methods can be fast and memory-efficient. Surprisingly, in this method, the SVD of the original data matrix can be accurately reconstructed from the arithmetic of the small matrix with low reconstruction error [101, 102]. Although Halko’s method is known as an algorithm of the randomized SVD, Li et al. also modified the preconditioning step so that the calculation time is improved (algorithm971 [103]). Halko’s method is used in scanpy [87], SIMLR [65], and SEQC [83], and algorithm971 is implemented in CellFishing.jl [70] (Table 1). Halko’s method is also used in population genetic studies [104].
Notably, the acceleration techniques of the above algorithms are based on random row/column selection or random projection of data matrices, both of which are used to make a large matrix smaller. When these processes are used in an out-of-core (also known as, online, incremental, or on-disk) implementation, in which only a subset of the data matrix is loaded into the memory and used to incrementally update the calculation, these algorithms might be scalable to even scRNA-seq datasets consisting of millions of cells. For example, in fast Fourier transform-accelerated interpolation-based t-stochastic neighbor embedding (FIt-SNE [59]), algorithm971 is implemented in an out-of-core manner and named out-of-core PCA (oocPCA). However, most PCA implementations load all the elements of a data matrix into the memory-space simultaneously. Therefore, the order of memory usage of such algorithms is commonly , where N is the number of genes and M is the number of cells (Figure 3). To extend the scope of algorithms used in the benchmarking, we originally implemented algorithms such as orthogonal iteration, GD, SGD, Halko’s method, and algorithm971 in an out-of-core manner.
Benchmarking of PCA algorithms and implementations
Here, we perform the benchmarking test of the PCA algorithms described above. We list PCA implementations that are freely available, easily downloaded, installed, and performed as well as possible. The source code for performing the benchmarking is summarized in Additional file 2, and the results of the benchmarking are summarized in Figure 3.
Real-world datasets
In consideration of the trade-offs among the large number of methods to be evaluated and our limited time, computational resources, and manpower, we carefully selected real-world datasets for benchmarking. We selected three datasets: mouse cells from a primary visual cortex region (Cortex), human cells from the pancreas (Pancreas), and mouse cells from the cortex, hippocampus, and ventricular zone (Brain) (Table 2). These datasets have been used in many previous scRNA-seq studies [66, 70, 82, 88, 105–111].
The accuracy of PCA algorithms
Here, we evaluate the accuracy of the various PCA algorithms by using three real-world datasets. For the analyses of the Cortex and Pancreas datasets, we set the result of prcomp as the gold standard, and the other implementations are compared with this result (Figure 1b and 3). For the Brain dataset analyses, full-rank SVD by LAPACK is computationally difficult. Therefore, we set the result of Cell Ranger as the gold standard. Although we know that the algorithm is IRLBA, some details of data preprocessing, such as gene selection and logarithm transformation, are unclear. Accordingly, for the Brain dataset, the comparison may be imprecise. In our computing environment, we could not use Cell Ranger to analyze the Brain dataset owing to an out-of-memory error in Python, so the result of the analysis provided by 10X Genomics is used instead.
First, we performed t-stochastic neighbor embedding (t-SNE [57, 58]) for the results of each PCA algorithm and compared the clarity of the cluster structure detected by the original studies (Figure 1b and 4). For the Brain dataset, only downsampling and some out-of-core PCA implementations such as IncrementalPCA (sklearn), orthiter/gd/sgd/halko/algorithm971 (OnlinePCA.jl), and oocPCA_CSV (oocRPCA) could be performed, while the other implementations were terminated by out-of-memory errors. Compared with the gold standard cluster structures, the structures detected by downsampling were unclear, and some distinct clusters were incorrectly combined into single clusters. In the realistic situation when the cellular labels are not available a priori, the labels were exploratorily estimated by confirming differentially expressed genes, known marker-genes, or related gene functions of clusters. In such a situation, downsampling may overlook subgroups hiding in a cluster. We also performed two clustering methods (k-means and Gaussian mixture model (GMM) clustering [112]) against all the results of the PCA implementations and calculated the adjusted Rand index (ARI [113]) to evaluate clustering accuracy (Figure 1b and 5). Compared with the gold standard, the result of down-sampling and sgd (OnlinePCA.jl) were worse, and the other implementations were as accurate as the gold standard.
Next, we performed an all-to-all comparison between PCs from the gold standard and the other PCA implementations (Figure 1b and 6). Since the PCs are unit vectors, when two PCs are directed in the same or opposite direction, their cross product becomes 1 or −1, respectively. Both the same and opposite direction vec-tors are mathematically identical in PCA optimization problem and different PCA implementations may yield PCs with different signs. Accordingly, we calculated the absolute value of the cross product ranging from 0 to 1 for the all-to-all comparison and evaluated whether higher PCs are accurately calculated. The figure 6 shows that the higher PCs of downsampling, orthiter/gd/sgd (OnlinePCA.jl), and PCA (dask-ml [114]) become inaccurate as the dimensionality of the PC increases. The higher PCs of these implementations also look noisy and unclear in pair plots of PCs in each implementation and seem uninformative (Additional file 3, Additional file 4, and Additional file 5). In particular, the higher PCs calculated by sgd (OnlinePCA.jl) are sometimes influenced by the existence of outlier cells (Additional file 3, Additional file 4, and Additional file 5) and very sensitive to the different learning parameters, the number of row vectors in the data matrix (i.e., number of epoch or pass out, Additional file 7). Contrary to these results, all the implementations of IRLBA and IRAM as well as the randomized SVD approaches except for PCA (dask-ml) are surprisingly accurate regardless of the difference in the written language and the developers. Although PCA (dask-ml) is based on Halko’s method and almost identical to the other implementations of Halko’s method, this function uses the direct tall-and-skinny QR algorithm [115] (https://github.com/dask/dask/blob/a7bf545580c5cd4180373b5a2774276c2ccbb573/dask/array/linalg.py#L52) and this part might be related to the inaccuracy.
For the Brain dataset, compared with the gold standard (irlb (Cell Ranger)), the diagonal lines within the plots of all the results seem unclear (Figure 6c). This may be because the data preprocessing condition for irlb (Cell Ranger) [22] and the other PCA implementations are not identical. The distribution of eigenvalues of irlb (Cell Ranger) is also slightly flat compared with the other implementations (Figure 7c).
Because PCA calculates cell-wise eigenvectors (PCs) and gene-wise eigenvectors (loading vectors) simultaneously, we also performed all-to-all comparisons between the loading vectors of the gold standard and those of the other PCA implementations (Figure 8). We extracted the top 500 genes in terms of the largest absolute values in loading vectors and calculated the number of genes in common between the two loading vectors. The same tendencies were observed even in loading vectors. Since the genes with large absolute values in loading vectors are used as feature values in some studies [41–46], inaccurate PCA implementations may lower the accuracy of such an approach. The distribution of the eigenvalues of downsampling, IncrementalPCA (sklearn), and sgd (OnlinePCA.jl) are also different from those of the other implementations (Figure 7).
Finally, we compared the computational time and the memory usage of all the PCA implementations (Figure 9). For the Brain dataset, downsampling itself was fast, but the preprocessing steps, such as matrix transposition (X′) and multiplication of the transposed data matrix and the loading vectors to calculate PCs (X′W), were slow and had high memory space requirements (Additional file 2). We also found that the calculation time of PCA (dask-ml) was not as fast in spite of its out-of-core implementation; for the Brain dataset, this implementation could not finish the calculation within three days in our computational environment. The other out-of-core PCA implementations such as IncrementalPCA (sklearn), orthiter/gd/sgd/halko/algorithm971 (OnlinePCA.jl), and oocPCA_CSV (oocR-PCA), were able to finish those calculations in 10 or fewer hours.
Calculation time, memory usage, and scalability
We also systemically estimated the calculation time, memory usage, and scalability of all the PCA implementations using 18 synthetic datasets consisting of {102,103,104} genes × {102,103,104,105,106,107} cells matrices (see Materials and methods). We evaluated whether the calculations can be finished or are terminated by out-of-memory errors (Figure 1b). We also manually terminated a PCA process (i.e., dask-ml) that was unable to generate output files within three days. All the terminated jobs are summarized in Additional file 6. Note that the number of epochs in orthiter/gd/sgd (OnlinePCA.jl) is one, and in most situations, the value should be tuned using grid search (Additional file 7).
Figures 10 and 11 show the calculation time and the memory usage of all the PCA implementations, which can be scaled to a 104 × 107 matrix. IncrementalPCA (sklearn) and oocPCA_CSV (oocRPCA) were slightly slower than the other implementations (Figure 10), and this was probably because the inputs of these implementations were CSV files while the other implementations used binary files. The memory usage of all the implementations were almost the same except for oocPCA_CSV (oocRPCA). This is probably because this function has a parameter that controls the maximum memory usage (mem), and we set the value as 10 GB (Additional file 2). Indeed, the memory usage seemed to have converged to around 10 GB (Figure 11). This property is considered an advantage of this implementation; the users can specify different values to suit the computational environment.
The relationship between file format and performance
We also counted the pass out of the Brain matrix in out-of-core PCA implementations (Figure 12a) and found that the calculation time was correlated with the number of pass out of the implementation. Furthermore, data compression substantially accelerates the calculation time. This suggests that the data loading process is very critical for out-of-core implementation and that the overhead for this process has a great effect on the overall calculation time and memory usage. Accordingly, using different data formats, such as CSV, Zstd, Loom [87], and hierarchical data format 5 (HDF5), provided by the 10X Genomics (10X-HDF5) of the Brain dataset, we evaluated the calculation time and the memory usage for the simple one-pass orthogonal iteration (qr(XW)), where qr is the QR decomposition, X is the data matrix, and W is the 30 vectors to be estimated as the eigenvectors (Figure 12b). Since only one row is loaded at once for CSV, Zstd, and Loom formats, their memory usage is very low but the time needed for calculation is greater than that of 10X-HDF5. Conversely, for 10X-HDF5 format, the data matrix is stored as compressed sparse column format (CSC), which omits the 0 values and saves memory usage, and this enables the large data matrix to be divided into multiple blocks containing multiple row vectors, for which each block is loaded incrementally. While it is not obvious that the usage of sparse matrix accelerates the PCA with scRNA-seq datasets because scRNA-seq datasets are not particularly sparse compared with data from other scientific fields (cf. recommender systems or social network [116, 117]), we showed that it does speed up the calculation time for scRNA-seq datasets.
When all row vectors stored in 10X-HDF5 are loaded at once, the calculation is fastest, but the memory usage is also highest. Since the calculation time and the memory usage have a trade-off and the user’s computational environment is not always high-spec, the block size should be optionally specified as an argument of the command. For the above reasons, we also developed tenxpca, which is a new implementation that performs algorithm971 for sparse matrix stored in 10X-HDF5 format. Using all elements of the CSC at once, tenxpca can finish the calculation in 1.18 hours with 82.96 GB memory usage. This is the fastest analysis of the Brain dataset in this study. According to the user’s machine specification, the number of rows loaded at once can be optionally changed to a different number.
In addition to tenxpca, some algorithms used in this benchmarking, such as orthogonal iteration, GD, SGD, Halko’s method, and algorithm971, are implemented as Julia functions and command line, which have been published as a Julia package OnlinePCA.jl (Figure 13). When the data are stored as a CSV file, they are compressed as a Zstd file (Figure 13a) and then some out-of-core PCA implementations are performed. When the data are in 10X-HDF5 format, algorithm971 is directly performed with the data by tenxpca (Figure 13b). We also implemented some functions and command line to extract row-wise/column-wise statistics such as mean and variance as well as highly variable genes [118] in an out-of-core manner. Because such statistics are saved as small vectors, they can be loaded by any programming language and applied to QC, and the users can select only informative genes and cells. After QC, the filtering command removes low-quality genes/cells and generates another Zstd file.
Guidelines for users and developers
Based on all the benchmarking results and our implementation in this work, we propose some user guidelines (Figure 14). Considering that bioinformatics studies combine multiple tools to construct a user’s specific workflow, written language is an important factor in selecting the right PCA implementation. Therefore, we categorized the PCA implementations by their written language (i.e., R, Python, and Julia; Figure 14, column-wise). Along with the data matrix size, we also categorized implementations by the way they load data (in-memory or out-of-core) as well as their input matrix format (dense or sparse, Figure 14, row-wise). Here we define the GC-value of a data matrix as the number of genes × the number of cells.
If the data matrix is not too large (e.g., GC ≤ 107), the data matrix can loaded as a dense matrix, and full-rank SVD of LAPACK is then accurate and optimal (in-memory & dense matrix). In such a situation, the wrapper functions for utilizing LAPACK written in each language are useful. However, if the data matrix is much larger (e.g., GC ≥ 108), an alternative to LAPACK is needed. Based on the benchmarking results, we recommend IRLBA, IRAM, Halko’s method, and algorithm971 as alternatives to LAPACK. If the GC-value is around 108 ≤ GC ≤ 1010, and if the data matrix can be loaded into memory as a sparse matrix, some implementations for these algorithms are available (in-memory & sparse matrix). In particular, such implementations are effective for large data matrices stored in 10X-HDF5 format as CSC format. Seurat2 [47] also introduces this approach by combining the matrix market format (R, Matrix) and irlba function (R, irlba). When the data matrix is dense and cannot be loaded into memory space (e.g., GC ≥ 1010), the out-of-core implementations such as oocPCA_CSV (R, oocRPCA), IncrementalPCA (Python, sklearn), and algorithm971 (Julia, OnlinePCA.jl) are useful (dense matrix & out-of-core). If the data matrix is extremely large and cannot be loaded into memory even if the data are formatted as a sparse matrix, out-of-core PCA implementations for sparse matrix are needed. In such a situation, tenxpca can be used if the data is stored in 10X-HDF5 format.
There is a point to be noted regarding effective utilization of the implementations for randomized SVD. Both Halko’s method and algorithm971 have a parameter for specifying the number of power iterations (niter), and this iteration step sharpens the distribution of eigenvalues and enforces a more rapid decay of the singular values ([119] and Additional file 2). In our experiments, the value of niter is very critical for achieving accuracy, and we highly recommend niter values of 3 or larger (Additional file 8). In some implementations, this parameter is specified as a smaller number or cannot be accessed as a function parameter. Therefore, users should carefully set the parameter or select an appropriate implementation.
We also propose guidelines for developers. To develop fast, memory-efficient, and scalable PCA implementations, there are many data, algorithms, and computational framework and environment methodologies (Additional file 9). Here, we focus on two topics.
The first topic is “loss of sparsity”. As described above, the usage of sparse matrix can effectively reduce the memory space and accelerate the calculation, but developers must be careful not to destroy the sparsity of a sparse matrix. PCA with a sparse matrix is not equivalent to SVD with sparse matrix; in PCA, all sparse matrix elements must be centered by the substitution of gene-wise average values. Once the sparse matrix X is centered (X − Xmean), it becomes a dense matrix filled with floating-point numbers, and the memory usage is significantly increased. Obviously, the explicit calculation should be avoided. In such a situation, if multiplication of this centered matrix and dense vector/matrix is required, the calculation should be divided into two parts, such as (X − Xmean) W = XW − XmeanW, and these parts should be calculated separately. If one or both parts require more than the available memory space, such parts should be incrementally calculated in an out-of-core manner. There are actually some PCA implementations that can accept a sparse matrix, but they may consume very long calculation time and large memory space because of a loss of sparsity (cf. rpca of rsvd https://github.com/cran/rsvd/blob/7a409fe77b220c26e88d29f393fe12a20a5f24fb/R/rpca.R#L158). To the best of our knowledge, prcomp irlba of irlba (https://github.com/bwlewis/irlba/blob/8aa970a7d399b46f0d5ad90fb8a29d5991051bfe/R/irlba.R#L379), irlb of Cell Ranger https://github.com/10XGenomics/cellranger/blob/e5396c6c444acec6af84caa7d3655dd33a162lib/python/cellranger/analysis/irlb.py#L118), safe_sparse_dot of sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.utils.extmath.safe_sparse_dot.html), and tenxpca of OnlinePCA.jl (https://github.com/rikenbit/OnlinePCA.jl/blob/c95a2455acdd9ee14f8833dc5c53615d5e24b5f1/src/tenxpca.jl#L183) deal with this topic. Likewise, as an alternative to the centering calculation, MaxAbsScaler of sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html introduces a scaling method, in which the maximum absolute value of each gene vector becomes one, thereby avoiding the loss of sparsity.
The second topic is “lazy loading.” Although, the out-of-core PCA implementations used in this benchmarking explicitly calculate centering, scaling, and any other arithmetic operations from the extracted block of the data matrix, such processes should be virtually calculated, as if the matrix was in memory. In this situation, only when the data are actually required, the processes should be lazily calculated on the fly. Source code to achieve this is even safe and readable. Some packages, such as DeferredMatrix of BiocSingular (R/Bioconductor, https://bioconductor.org/packages/devel/bioc/html/BiocSingular.html), CenteredSparseMatrix (Julia, https://github.com/jsams/CenteredSparseMatrix), Dask [114] (Python, https://dask.org), and Vaex (Python, https://vaex.io/), support lazy loading.
Discussion
In this benchmarking study, we found that PCA implementations based on LA-PACK are accurate but cannot be scaled for use with large-scale scRNA-seq datasets such as the Brain dataset, and alternative implementations are thus required. Some methods approximate the calculation by using truncated SVD forms such as IRLBA, IRAM, Halko’s method, and algorithm971, and these are sufficiently accurate as well as faster and more memory-efficient than LAPACK. The actual memory usage is highly dependent on whether an algorithm is implemented as out-of-core or whether sparse matrix can be specified as input. Some sophisticated implementations, including ours, can handle such issues. Other PCA algorithms, such as downsampling, SKL, orthogonal iteration, GD, and SGD, are actually not accurate, and their use risks overlooking cellular subgroups contained within scRNA-seq datasets. These methods commonly update eigenvectors with small fractions of the data matrix, and this process may overlook subgroups or subgroup-related gene expression, thereby causing the observed inaccuracy. Although the down-stream analyses of PCA vary widely, and we could not examine all the topics of scRNA-seq analysis, such as rare cell-type detection [68, 69] and pseudotime analysis [13, 71–75], differences among PCA algorithms might also affect the accuracy of such analyses. Butler et al. showed batch effect removal can be formalized as canonical correlation analysis (CCA) [47], which is mathematically very similar to PCA. The optimization of CCA is also formalized in various ways, including randomized CCA [120] or SGD of CCA [121]. Although this topic is beyond the scope of the present work, we will also evaluate such differences among algorithms in the future.
This work also sheds light on the effectiveness of randomized SVD. This algorithm is popular in population genetic studies [104]. In the present study, we also assessed its effectiveness with scRNA-seq datasets with high heterogeneity. This algorithm is relatively simple and some studies have implemented it from scratch (Table 1). The simplicity may be the attraction of this algorithm. Our literature review, benchmarking, special implementation for scRNA-seq datasets, and guidelines provide important resources for new users and developers tackling the challenges of large-scale scRNA-seq data analysis. EVD/SVD is also known as the “master” algorithm for matrices [84]; this method also can solve least squares problems and be applied to other data analysis methods, such as dimensional reduction, clustering, and prediction, which means many out-of-core algorithms can be developed for large scRNA-Seq datasets.
Materials and methods
Empirical datasets
The gene expression matrix and cell type labels for the Cortex dataset [37] were retrieved from the Single Cell Portal Beta (https://portals.broadinstitute.org/single_cell/study/a-transcriptomic-taxonomy-of-adult-mouse-visual-cortex-visp). The gene expression matrix and cell type labels for the Pancreas dataset [38] were retrieved from the GEO database (GSE84133). The gene expression matrix and cell type labels for the Brain dataset [39] were downloaded from the 10X Genomics company website (https://support.10xgenomics.com/single-cell/datasets/1M_neurons). The genes of all matrices with zero variance were removed because such genes are meaningless for PCA calculation. The number of remaining genes and cells are summarized in Table 2.
Simulated datasets
All count datasets were generated by the R rnbinom (random number based on a negative binomial distribution) function with shape and rate parameters of 0.4 and 0.3, respectively. Matrices of {102,103,104} genes × {102,103,104,105,106,107} cells were generated.
Benchmarking procedures
Assuming digital expression matrices of unique molecular identifier (UMI)-counts, all the data files, including real and synthetic datasets, were in CSV format. When using the Brain dataset, the matrix stored in 10X-HDF5 format was converted to CSV using our in-house Python script (https://gist.github.com/kokitsuyuzaki/5b6cebcaf37100c8794bdb89c7135fd5). After being loaded by each PCA implementation, the raw data matrix Xraw was transformed to X by the logarithm-transformation X = log10 (Xraw + 1), where log is the element-wise logarithm. When performing each PCA implementation based on the truncated SVD, the number of PCs were specified in advance (Table 2).
Although it is unclear how many cells should be used in downsampling, an empirical analysis [88] suggests that 20,000 to 50,000 cells are sufficient for clustering and detecting subpopulations in the 1.3M dataset. Thus 50000/1300000 × 100 = 3.8% of cells were sampled from each dataset and used for the downsampling method. When performing IncrementalPCA (sklearn), the row-vectors, which match the number of PCs, were extracted until the end of the lines of the files. When performing irlb(Cell Ranger), the loaded dataset was first converted to the scipy sparse matrix and specified with the function. This is because this function supports sparse matrix data stored in 10X-HDF5 format. When performing the benchmark, this conversion time and memory usage were also included. When performing all the functions of OnlinePCA.jl such as orthiter/gd/sgd/halko/algorithm971, we converted the CSV data to Zstd format, and the calculation time and the memory usage were included in the benchmark for fairness. orthiter, gd, and sgd (OnlinePCA.jl) were performed until the calculations converged (Additional file 7). For all the randomized SVD implementations, the niter parameter value was set to 3 (Additional file 8). When performing oocPCA_CSV, the users can also use oocPCA_BIN, which is used to perform PCA with binarized CSV files. The binarization is performed by the csv2binary function, which is also implemented in the oocRPCA package. Although data binarization will accelerate the calculation time for PCA itself, we confirmed that csv2binary is based on the in-memory calculation, and in our environment, csv2binary was terminated by an out-of-memory error. Accordingly, we only used oocPCA_CSV, and the CSV files were directly loaded by this function.
Since most algorithms are based on random numbers, we also tried to evaluate stability, as captured by variation among multiple trials. However, we could not specify the random seed in many of the implementations. This is because, in many cases, there is no parameter for specifying the seed in the PCA function, or some-times the source code for performing the PCA calculation is separated into other languages such as FORTRAN, C, and C++ making the seed hard to specify in the code. We confirmed that many implementations generated the same results with multiple trials (data not shown), but this does not always mean the calculations stably converge to the same solution from any random seed, because the random seed is sometimes hard-coded in the source code. For the above reason, in this work, we used the result of a single trial for each implementation.
Computational environment
All computations were performed on two node-machines with Intel Xeon E5-2697 v2 (2.70 GHz) processors and 128 GB of RAM, four node-machines with Intel Xeon E5-2670 v3 (2.30 GHz) processors and 96 GB of RAM, and four node-machines with Intel Xeon E5-2680 v3 (2.50 GHz) processors and 128 GB of RAM. Storage among node machines was shared by NFS, connected using InfiniBand. All jobs were queued by the Open Grid Scheduler/Grid Engine (v2011.11) in parallel. The elapsed time and maximum memory usage were evaluated using the GNU time command (v1.7). We also tried to use Cell Ranger (v1.3.0) to analyze the Brain dataset on a large-memory machine with an Intel Xeon (2.90 GHz) processor and 512 GB of RAM.
Reproducibility
All the analyses were performed on the machines described above. We used R v3.5.0, Python v3.6.4, and Julia v1.0.1 in the benchmarking, and only when we performed t-SNE by bhtsne (https://github.com/lvdmaaten/bhtsne) and CSV conversion of Brain dataset, we used Python v2.7.9. All the programs used to per-form the PCA implementations in the benchmarking are summarized in Additional file 2. Orthogonal iteration, GD, SGD, Halko’s method, and algorithm971 are imple-mented as orthiter, gd, sgd, halko, and algorithm971, respectively, which are the Julia functions or commands for OnlinePCA.jl (https://github.com/rikenbit/OnlinePCA.jl). We also published the script files used to perform the benchmark (https://github.com/rikenbit/onlinePCA-experiments).
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by MEXT KAKENHI Grant Number 16K16152. This work was partially supported by the Japan Science and Technology Agency (JST), CREST grant number JPMJCR16G3, and the Projects for Technological Development, Research Center Network for Realization of Regenerative Medicine by Japan (18bm0404024h0001), the Japan Agency for Medical Research and Development (AMED).
Author’s contributions
KT and HS surveyed the PCA algorithms and implementations. KT and IN designed the benchmarking test. KT and KS implemented the Julia program and performed all the analyses. KT retrieved and preprocessed the test dataset to evaluate the proposed method. All the authors have written, read, and approved the manuscript.
Acknowledgements
We thank Mr. Akihiro Matsushima and Mr. Manabu Ishii for their assistance with the IT infrastructure for the data analysis. We are also grateful to all member of the Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research for their helpful advice. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.
Abbreviations
- PCA
- principal component analysis
- scRNA-seq
- single-cell RNA sequencing
- sci-RNA-seq
- single-cell combinatorial-indexing RNA-sequencing analysis
- UML
- unsupervised machine learning
- QC
- quality control
- PC
- principal component
- EVD
- eigenvalue decomposition
- SVD
- singular value decomposition
- Sklearn
- scikit-learn
- SKL
- sequential Karhunen-Loeve transform
- IRLBA
- augmented implicitly restarted Lanczos bidiagonalization
- IRAM
- implicitly restarted Arnoldi method
- GD
- gradient descent
- SGD
- stochastic gradient descent
- t-SNE
- t-stochastic neighbor embedding
- FIt-SNE
- fast Fourier transform-accelerated interpolation-based t-stochastic neighbor embedding
- oocPCA
- out-of-core PCA
- GMM
- Gaussian mixture model
- ARI
- adjusted Rand index
- Zstd
- Zstandard
- UMI
- unique molecular identifier
- CSV
- comma-separated values
- HDF5
- hierarchical data format 5
- 10X-HDF5
- HDF5 provided by 10X Genomics
- CSC
- compressed sparse column format
- CSR
- compressed sparse row format
References
- 1.↵
- 2.↵
- 3.
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.
- 17.↵
- 18.↵
- 19.
- 20.
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.
- 31.
- 32.
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.
- 44.
- 45.
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.
- 62.
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.
- 74.↵
- 75.↵
- 76.↵
- 77.
- 78.
- 79.↵
- 80.
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.
- 92.
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.
- 107.
- 108.
- 109.
- 110.
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵