Accelerated dimensionality reduction of single-cell RNA sequencing data with fastglmpca

Summary: Motivated by theoretical and practical issues that arise when applying Principal Components Analysis (PCA) to count data, Townes et al introduced “Poisson GLM-PCA”, a variation of PCA adapted to count data, as a tool for dimensionality reduction of single-cell RNA sequencing (RNA-seq) data. However, fitting GLM-PCA is computationally challenging. Here we study this problem, and show that a simple algorithm, which we call “Alternating Poisson Regression” (APR), produces better quality fits, and in less time, than existing algorithms. APR is also memory-efficient, and lends itself to parallel implementation on multi-core processors, both of which are helpful for handling large single-cell RNA-seq data sets. We illustrate the benefits of this approach in two published single-cell RNA-seq data sets. The new algorithms are implemented in an R package, fastglmpca.

1. General problem formulation.Let Y ∈ R n×m + be an n × m non-negative matrix storing observed counts y ij .Similar to [5,14,19,20], we consider the following model of the count data: (1) in which Pois(λ) denotes the Poisson probability distribution with mean λ, X ∈ R n×nx , Z ∈ R m×nz are additional observed covariates for, respectively, the rows and columns of Y (n x or n z , or both, may be zero), U ∈ R n×K , Σ ∈ R K×K , V ∈ R m×K , B ∈ R m×nx , W ∈ R n×nz are matrices of unknowns to be estimated from the data, and K > 0 is an integer specifying the rank of the matrix factorization UΣV T , which is typically much smaller than n or m. (Our software implementation also allows for fixing one or more columns of B and/or one or more columns of W to chosen values.) In the simplest case when there are no additional covariates, n x = n z = 0, the model is (2) To make the unknowns identifiable (up to sign) under general conditions [14], we further impose the following "PCA-like" constraints: (3) in which I n denotes the n × n identity matrix.We call these the "identifiability constraints." Our aim is to compute maximum-likelihood estimates of the unknowns U, Σ, V, B, and W in (1) subject to the identifiability constraints (3).As we will show, computing the constrained maximum-likelihood solution for (2) is easily extended to the model with additional covariates (1), so we focus initially on fitting (2), then we explain the extension to (1).
1.1.Dealing with the identifiability constraints.To deal with the identifiability constraints (3), we first compute maximum-likelihood estimates of U and V without these constraints, then we transform the solution post hoc to find equivalent estimates of U, V, and Σ satisfying the identifiability constraints.
The solutions Û, V and U, Σ, V have the same likelihood because Ĥ = Û VT = UΣV T , and U, Σ, V satisfy the identifiability constraints (3) by construction.However, we would like to avoid computing the SVD of an n × m matrix, which can be computationally expensive and memory-intensive.The following procedure generates the same result as ( 4), but avoids the costly SVD: (5) The first two steps compute QR decompositions of Û and V. Since R U and R V are K × K matrices, the product of these two matrices is very quickly computed.Therefore, the predominant computation in this procedure are the QR decompositions, each of which have a much more favorable time complexity than the truncated SVD of Ĥ.
1.2.Unconstrained optimization problem.Since the identifiability constraints are easily satisfied post hoc, for the remainder we focus on developing algorithms for the unconstrained optimization of U and V. Arbitrarily setting Σ = I K , we seek to maximize the following objective, which is the log-likelihood derived from model (2) after removing terms that do not involve U or V: In the next section, we develop a fast algorithm for finding an unconstrained maximizer of (6).
2. Algorithm for maximizing the GLM-PCA log-likelihood.To approach the problem of finding an unconstrained maximizer of the GLM-PCA log-likelihood (6), we first consider the much simpler problem of fitting the following model to data X ∈ R N ×K , y ∈ R N : (7) in which β = (β 1 , . . ., β K ) ∈ R K is the vector of K unknowns to be estimated from the data X, y.This is the standard generalized linear model (GLM) with a Poisson error distribution and log-link function [11].Therefore, we refer to (7) as the "Poisson GLM."For later convenience, we define a function that returns the maximum-likelihood estimate (MLE) of the coefficients β in the Poisson GLM: (8) Pois-GLM-MLE(X, y) := argmax where ℓ pois-glm (X, y; β) denotes the likelihood of the Poisson GLM (7).
We now relate the problem of computing the MLE for the Poisson GLM to our original problem.The likelihood for the Poisson GLM (7), excluding terms that do not involve β, is By rearranging (6), the GLM-PCA log-likelihood can be rewritten as a sum of Poisson GLM log-likelihoods in two different ways: = m j=1 ℓ pois-glm (y :,j , U; v T j,: ), (11) in which y i,: ∈ R m + , y :,j ∈ R n + denote vectors storing, respectively, the ith row and j column of Y, and u i,: ∈ R K , v j,: ∈ R K denote vectors storing, respectively, the ith row of U and the jth row of V.This result, while straightforward algebraically, has an important implication: if we fix V, the GLM-PCA optimization problem reduces to a collection of n Poisson GLM optimization problems (eq.10) and, similarly, if we fix U, the GLM-PCA optimization problem reduces to a collection of m Poisson GLM optimization problems (eq.11).This suggests an alternating optimization approach in which we iterate between optimizing U with V fixed, and optimizing V with U fixed.This is an example of a block-coordinate optimization algorithm [23] (also "nonlinear Gauss-Seidel" [4]), where the "blocks" are U and V.It is analogous to "alternating least squares" for matrix factorization with Gaussian errors, and therefore we call this algorithm "Alternating Poisson Regression" (APR).(Similarly, in non-negative matrix factorization some algorithms reduce to solving a series of non-negative least squares problems [8,10].)This idea is summarized by Algorithm 1 in the main text.
The core computation of the APR algorithm is the MLE of the Poisson GLM.Therefore, in the next section we focus on the development of a fast solution to this core computation.

Fitting the Poisson GLM.
We compute the MLE for the Poisson GLM (7) using a cyclic coordinate descent (CCD) algorithm [4,23], in which each coordinate β k is optimized one at a time while the others are fixed.While the iteratively reweighted least squares (IRLS) algorithm is the standard algorithm for fitting GLMs, including the Poisson GLM, the complexity of IRLS is O(K 3 ) due to the inversion or factorization of a K × K matrix.To handle large data sets, we would prefer a more efficient alternative that scales better to larger values of K.For this reason, we use the CCD approach since its computational complexity grows only linearly with K.
The CCD algorithm performs the following 1-d optimization for each k = 1, . . ., K, and repeats these 1-d optimizations until the iterates β (0) , β (1) , . . .reach a stationary point.(Although we have framed this as a maximum-likelihood estimation problem, the convention in most numerical optimization papers is to frame the optimization as as a minimization problem, so we view our algorithms as minimizing the negative log-likelihood, hence "descent.")Note that this coordinate descent algorithm is different from the coordinate descent approach for GLMs in glmnet [7]; glmnet essentially implements IRLS, but performs coordinatewise updates to solve each weighted least-squares problem, whereas we perform the coordinatewise updates (12) on the original log-likelihood.For example, our approach is similar to the cyclic coordinate ascent algorithm for penalized logistic regression described in [24].
To further motivate the CCD algorithm, consider the first and second-order derivatives of Poisson GLM log-likelihood (9): We showed previously that it is possible to easily orthogonalize U and V.When solving Pois-GLM-MLE(X, y), X will either be U or V, so we can assume that X is orthogonal simply by making it so.Although the K × K matrix of secondorder partial derivatives (the Hessian) is not guaranteed to be diagonal even when X is orthogonal, it might be close to diagonal; that is, the diagonal elements may be much larger than the non-diagonal elements.In the special case in which all the λ i 's are the same, the off-diagonal entries of the Hessian (eq.14 with k ̸ = k ′ ) will be zero.Therefore, in situations where the λ i 's are similar, the Hessian should be close to diagonal.Coordinate descent algorithms have good rates of convergence when the Hessian is diagonal or close to diagonal.Even when the Hessian is not close to diagonal, in practice coordinatewise algorithms often converge very rapidly.In summary, although the worst-case convergence behaviour of (cyclic) coordinate descent is poor, we view CCD as a practical approach that it very scalable and should have good convergence behaviour in most situations.

Other implementation details.
Here we describe other important aspects of the computer implementation.

Fitting the Poisson GLMs in parallel.
The n Poisson GLM optimizations in (10) and the m Poisson GLM optimizations in (11) are trivially parallelizable, so we can take advantage of multicore processors, for example, to greatly speed up the optimization.In the R package, we used Intel Threading Building Blocks (TBB) multithreading [6], interfaced via the RcppParallel R package [2], to solve the Poisson GLM subproblems in parallel.

Initialization.
To initialize the model fitting, by default we set the entries of U and V to random values (drawn from a normal distribution).While other approaches have been suggested (e.g., [14]), we find that these initializations did not reliably lead to better solutions nor faster convergence.
2.2.3.Stopping criteria.We stop the model fitting when a maximum number of outer-loop iterations in Algorithm 1 is reached (chosen by the user), or when two successive outer-loop iterates improve the log-likelihood (6) by less than some tolerance, ε > 0. This tolerance value is also set by the user.By default, ε = 10 −4 .

CCD algorithm.
The CCD updates (12) were implemented using backtracking line search [15, Algorithm 3.1], with Newton search direction p k = g k /h kk .While there is some debate as to whether line search is worth the added expense (e.g., [7]), in this setting we have found it to be important for avoiding diverging updates.
Fully optimizing U with fixed V (and fully optimizing V with fixed U) is possible (up to some numerical error), but in practice it is not necessary; because of the alternating nature of the algorithm, incompletely optimizing U and V provides considerable reductions in computation with little impact on convergence.Specifically, we approximated (12) with the solution obtained by cycling through the CCD updates a fixed number of times, in which β was initialized to row i of U or row j of V.The number of cycles to perform in CCD is a tuning parameter in the software.We found that cycling through coordinates 1 through K three times worked well in practice, and so we took this to be the software default.Although the theory on coordinate descent does not guarantee convergence unless we compute exact solutions, we have not encountered situations in which inexact solutions caused the iterates to diverge.

Acceleration of the outer-loop updates.
Since APR is a coordinatewise optimization method that alternates between optimizing U and optimizing V, one possible reason for slow convergence is strong interdependence between U and V [3,23].To remedy this issue, we tried DAAREM, an off-the-shelf method for accelerating fixed-point iterations [9,18].The DAAREM-accelerated variant of APR is the same as the original algorithm, except for an additional step at the end of each iteration of the outer-loop that computes a modified update.The additional computation involved in this additional step is negligible compared to the other computations.
We made two minor changes to the optimization which greatly improved the effectiveness of DAAREM for fitting GLM-PCA models.First, we found that DAAREM did not work well when U was made orthogonal before updating V, and vice versa, so we skipped this orthogonalization step when using DAAREM.
Second, we found that the DAAREM-modified updates behaved poorly when the iterates were changing rapidly and were far away from the solution.Therefore, in our numerical experiments we first performed 100 unaccelerated APR updates to obtain a good initial estimate, then we performed a second round of APR updates with DAAREM.
3. Comparisons in simulated data sets and scRNA-seq data sets.The source code implementing these experiments is provided in the "inst" directory within the GitHub repository, https://github.com/stephenslab/fastglmpca.The Zenodo repository [22] contains a snapshot of this source code near the time of publication.
3.1.Simulated data sets.We simulated data matrices Y of different sizes to assess scalability of the model fitting algorithms.Specifically, for each trial, we simulated Y using the generate glmpca data pois() function from the fastglmpca package.This function randomly generates a matrix Y from the GLM-PCA model (1) in which the parameters of the GLM-PCA model have been chosen to roughly mimic scRNA-seq data.The value of K in the GLM-PCA model used to simulate the data was set to 5 or 10.
3.2.Single-cell RNA-seq data sets.All scRNA-seq data sets were stored as sparse n × m matrices Y, where n was the number of genes and m was the number of cells.
Note that no additional steps were taken to preprocess the scRNA-seq UMI counts other than removing genes with no variation in the data.Since the GLM-PCA model is a Poisson model of UMI counts, no transformation or normalization of the counts is needed.

Mouse epithelial airway data set.
The mouse epithelial airway data were gene expression profiles of trachea epithelial cells in C57BL/6 mice obtained using droplet-based 3' scRNA-seq, processed using the GemCode Single Cell Platform [12].The data were downloaded from the Gene Expression Omnibus (GEO) website, accession GSE103354.Specifically, we downloaded file GSE103354 Tracheadroplet UMIcounts.txt.gz.This file also contained predicted cell-type assignments from the community detection analysis of [12].After removing genes that were not expressed in any of the cells, Y was a 18,388×7,193 matrix of UMI counts, and 90.7% of the counts were zero.
3.2.2.68k PBMC data set.The 68k PBMC data were transcriptome profiles from peripheral blood mononuclear cells (PBMCs) processed using Cell Ranger 1.1.0[25].We downloaded the "Gene/cell matrix (filtered)" tar.gz file for the Fresh 68k PBMCs (Donor A) data set from the 10x Genomics website.We also downloaded the file 68k pbmc barcodes annotation.tsvfrom https://github.com/10XGenomics/single-cell-3prime-paper, which contained predicted cell-type assignments from the analysis of [25].After filtering out genes that were not expressed in any of the cells, Y was a 20,387 × 68,579 matrix of UMI counts, and 97.3% of the UMI counts were zero.In plots visualizing V, we defined "T cells" as all cells labeled as one of "CD4+ T Helper2", "CD4+/CD25 T Reg", "CD4+/CD45RA+/CD25-Naive T", "CD4+/CD45RO+ Memory" or "CD8+/CD45RA+ Naive Cytotoxic."3.3.Methods compared.We compared three methods for fitting GLM-PCA models to simulated and scRNA-seq data: the AvaGrad method implemented in the glmpca R package [21] (version 0.2.2.9000); the iterative reweighted SVD (IRSVD) method implemented in the scGBM R package [13] (version 1.0.3); and the alternating Poisson regression (APR) described in this paper and implemented in the fastglmpca R package (version 0.1-5).Since a key feature of the APR method is the straightforward parallelization of the updates, we compared two variants of fastglmpca, with and without parallelization.
Note that we did not compare the "Fisher scoring" method (optimizer = "fisher") implemented in glmpca [19].We found this method to be unstable-indeed this instability is documented in the R package-and others have found that generally it did not work as well as the AvaGrad method, so we did not include this method in our comparisons.
For AvaGrad, we called the glmpca() function from the glmpca package with fam = "poi", minibatch = "stochastic" and optimizer = "avagrad".For IRSVD, we called the gbm.sc() function from the scGBM package.For APR, we first called init glmpca pois() to initialize the parameter estimates, then we called fit glmpca pos() to perform the updates.For the parallel (multithreaded) variant of APR, we used 28 threads.For the DAAREM-accelerated variant of APR, we used again 28 threads, and performed the optimization in two stages: we performed an initial round of 100 updates without DAAREM, then we called fit glmpca pois() a second time with daarem = TRUE and orthonormalize = FALSE.We used version 0.7 of the daarem R package, setting the monotonicity tolerance (mon.tol) to 0.05.All other DAAREM tuning parameters were kept at their defaults.For all these calls, the remaining tuning parameters were kept at the software defaults except when modifying the termination criteria so as to force the methods to continue optimizing longer.
Unlike the other methods, the AvaGrad method is a stochastic optimization method and its performance can be sensitive to the choice of learning rate (despite the fact that the AvaGrad method was intended to reduce this sensitivity [17]).In particular, we found the algorithm could be unstable at higher learning rates (indeed, the "maxTry" option in glmpca is set by default to 10 to automatically restart the optimization in the case of numerical divergence).We experimented with different learning rates, and found that lr = 1e-4 worked well, although it is possible that other learnings rates could have worked better.
The three software implementations included a single row covariate (n x = 1) and a single column covariate (n z = 1) in the model ( 1) by default, and we kept this default setting.These covariates were defined identically in all implementations as follows.The row covariate was a vector of ones, x i1 = 1, with fixed coefficients b j1 = log( n i=1 y ij /n).The column covariate was also a vector of ones, z j1 = 1, with coefficients, w j1 , to be estimated.When Y is a matrix of UMI counts with rows i corresponding to genes and columns j corresponding to cells, this row covariate defines a "size factor" [1] for each cell j as the mean UMI count per gene, and this column covariate encodes a baseline expression level for each gene i.
For fair comparison, we provided the same initial estimates of U and V to glmpca and fastglmpca.The entries of U and V were initialized to small values by randomly drawing from the normal distribution with mean zero and standard deviation 10 −5 /K.
The scGBM interface did not allow the initial estimates of U and V to be specified.Instead, the gbm.sc() function computes initial estimates based on Pearson residuals obtained by fitting a simple rank-2 GLM-PCA model to the data [14].In practice, we have found that this initialization is better than our simple random initializations, but the downside is that it involves memory-intensive computations.
With the minibatch = "stochastic" setting, glmpca() does not output the exact log-likelihood achieved at each iteration.Therefore, we modified the code slightly.These additional log-likelihood calculations have a neglible impact on the overall runtime.The modified code is available at https://github.com/eweine/glmpca.

Computing environment.
All analyses were performed in R 3.6.1 [16] linked to the OpenBLAS 0.2.19 optimized numerical libraries on Linux machines (Scientific Linux 7.4) with Intel Xeon E5-2680v4 ("Broadwell") processors.Up to 56 GB of memory was needed to run glmpca and fastglmpca on the largest data set (68k PBMC).The scGBM computations were more memory-intensive, and needed up to 250 GB of memory on the largest data set.For the parallel variant of fastglmpca, we used 28 cores.Supplementary Figure 6.The V matrix returned by different optimization methods when applied to the same 68k PBMC data set after running the optimization for about 10 hours on the epithelial airway data (K = 10).The fastglmpca result was obtained using parallel updates.Colors depict the predicted cell-type annotations from [25].