Abstract
High-throughput single cell transcriptomics is rapidly emerging as the technique of choice to establish a census of neurons in the nervous system. Integrating the resulting cell type census with a physiological and anatomical taxonomy has been difficult, as most techniques require the tissue to be dissociated before sequencing. The recently proposed patch-seq technique allows to acquire multi-modal single cell data, where RNA-seq data is collected together with physiological and morphological information from the same cells. The technique typically results in data sets which have many more dimensions (expression levels of genes and electrophysiological properties) than measurements (cells), making it computationally difficult to relate the two modalities. Here we present a framework based on sparse reduced-rank regression for obtaining an interpretable visualization of the relationship between high-dimensional transcriptomic data and electrophysiological information on the single-cell level.
Introduction
Since the days of Ramón y Cajal, neuroscientists have classified neurons into cell types, which are often considered the fundamental building blocks of neural circuits (Masland, 2004). Classically, these types have been defined based on their physiology or anatomy, but due to the recent rise of single cell transcriptomics, a definition of cell types based on genetics is becoming increasingly popular (Poulin et al., 2016). For example, high-throughput single cell transcriptomics approaches have been used to establish a census of neurons in the retina (Shekhar et al., 2016; Macosko et al., 2015) and the cortex (Tasic et al., 2016, 2017; Zeisel et al., 2015) of mice.
Despite this success, it has proven difficult to integrate the obtained cell type taxonomy based on the transcriptome with information about physiology and anatomy. On the cell type level, single genes have been shown to be correlated with physiological properties in a meta-study on a brain-wide database of cell types, characterized using microarrays and electrophysiology (Tripathy et al., 2017). To be able to relate gene expression patterns to physiological characteristics on the single cell level, we need to be able to obtain the transcriptomes as well as electrophysiological measurements from the same cells and then integrate the resulting data sets.
The experimental capability to achieve this was developed with patch-seq (Cadwell et al., 2016; Fuzik et al., 2016; Cadwell et al., 2017; Földy et al., 2016), a technique that allows obtaining the transcriptome of cells characterized electrophysiologically or morphologically (Figure 1a). In contrast to other single cell transcriptomics experiments, these expertiments are rather low throughput, resulting in a multi-modal dataset with particular statistical structure: for a few dozen of cells, we have expression data on several thousands of genes as well as dozens of electrophysiological measurements (Figure 1a). Integrating and properly visualizing genetic and physiological information in this n ≪ p regime requires specialized techniques, which allow extracting an interpretable subset of genes and exploit as much information about the relationship between genes and physiology as possible to increase statistical power.
Here we present a framework based on sparse reduced-rank regression for obtaining an interpretable visualization of the relationship between high-dimensional single cell transcriptomes and electrophysiological information obtained using techniques like patch-seq. The method yields an intuitive low-dimensional representation of key features of the data, relating the dominant gene expression patterns that predict variation in the electrophysiological space.
Results
To relate gene expression patterns to electrophysiological properties, one can use the genetic data to predict any given electrophysiological property (Cadwell et al., 2016). This is a regression problem: each gene is a predictor and AP threshold is the response variable (Figure 1b). To predict multiple electrophysiological properties at the same time, one can combine individual regression problems into a multivariate regression problem where the response is a multivariate vector (Figure 1c).
However, different electrophysiological properties tend to be strongly correlated and so one could construct a more parsimonious model where gene expression is predicting several latent factors that in turn predict all the electrophysiological properties together (Figure 1d). These latent factors form a “bottleneck” in the linear mapping and allow exploiting correlations between the predicted elecrophysiological properties to increase statistical power. This is called reduced-rank regression (RRR) and can be solved by running principal component analysis (PCA) on the results of multivariate regression (see Methods). An attractive property of RRR is that it can be viewed not only as a prediction method, but also as a dimensionality reduction method, allowing visualization and exploration of the multi-modal dataset.
As there are over 20 thousand genes in a mouse genome and the typical sample size of a patch-seq data set is on the order of n ≈ 100, all of these regression problems are in the n ≪ p regime and need to be regularized. Here we use elastic net regularization, which combines ℓ1 (lasso) and ℓ2 (ridge) penalties. This enforces sparsity and performs feature selection: only a small subset of genes are selected into the model while all other genes get zero regression coefficients (Figure 1e).
Mathematically, our method minimizes the following loss function (see Methods for details):
Our elastic net RRR extends a recently suggested sparse RRR (Chen and Huang, 2012) and can be implemented using the glmnet-package (Friedman et al., 2010), a popular library for elastic net regression (see Methods). This model yields latent factors XW that can be interpreted as a low-dimensional genetic variability that is predictive of electrophysiological variability. Similarly, it allows to interpret YV as a low-dimensional electrophysiological variability that can be predicted from the genetic variability.
We applied our RRR approach to the patch-seq data set from Cadwell et al. (Cadwell et al., 2016) (see Methods for preprocessing). This data set encompasses two classes of interneurons, single bouqet cells (SBC) and elongated neurogliaform cells (eNGC), and contains n = 44 neurons. We used a permutation approach to estimate the rank r (dimensionality of the bottleneck) given the available data. For this data set, we obtained r = 2 (see Methods). We then used crossvalidation (see Methods) to select the optimal values of the regularization parameters λ1 and λ2 (Figure 2a,b). At the maximum, the model achieved test-set R2 ≈ 0.22, a test-set correlation between the first pair of components ρ1 ≈ 0.75, and test-set correlation between the second pair of components ρ2 ≈ 0.5.
We can use the bottleneck representation in the RRR model (XW) to visualize and explore the genetic data (Figure 2c). The first component is clearly associated with the cell type. In contrast, the second one is uncorrelated to cell type and corresponds to within-type variation. Only 24 genes are selected as contributing to the model, shown as lines on Figure 2c: each line represents a gene’s correlations to the RRR components 1 and 2 (the circle shows the maximum possible correlation). In the PCA literature, such visualization is called a “biplot” and we will adopt this terminology here. 22 out of the 24 genes are strongly correlated with the first component, with 18 of them having higher expression in the eNGC cells and four having higher expression in the SBC cells. Many of these genes were identified as differentially expressed in the original publication (Cadwell et al., 2016). The two remaining genes are strongly correlated with the second component.
We can visualize the electrophysiological space in a similar manner (using YV) as a biplot with all 11 available electrophysiological properties (Figure 2d). Comparing the directions of variables on both biplots can suggest which electrophysiological variables are associated with which genes (e.g. AP threshold is positively correlated with Rpn2 expression level). We suggest to call the pair of RRR biplots a “bibiplot”.
One important caveat is that the list of selected genes (Figure 2c) should not be interpreted as definite. This is for two reasons. First, the model performance (Figure 2a,b) was unaffected in some range of parameters corresponding to selecting from ∼10 to ∼50 genes, meaning that the choice of regularization strength in this interval remains an analyst’s call. As an example, we show gene-space biplots with 10 and 40 genes in Figure S1. Second, even for fixed regularization parameters, a somewhat different set of genes may be selected every time when bootstrapping the model (this is often true for lasso-regularized models, especially when n ≪ p). We show the frequencies with which some genes are selected during bootstrapping in Figure 2e. There is an interplay between these two factors. Stronger ℓ1 regularization leads to a sparser model with less bootstrap reliability. Weaker ℓ1 regularization leads to a less sparse model with more bootstrap reliability.
The RRR biplots can be compared to PCA biplots, made in the gene space and in the electrophysiological space independently from each other (Figure S2; a similar analysis on another data set with n =11 neurons was done in (Harris et al., 2017)). In the electrophysiological space, the RRR biplot is almost identical to the PCA biplot, meaning that our RRR model explains the dominant modes of variation among the dependent variables. In the gene space, the situation is different. The first PCA component also separates the two cell types (albeit less clearly), but the second component is only weakly related to the PC2 in the electrophysiological space (correlations between the first/second pair: 0.74 and 0.22). Apart from that, the PCA in the gene space is not sparse, making the biplot practically impossible to display and interpret as it would have to show thousands of genes.
In addition, we applied our framework to the patch-seq dataset from Fuzik et al. (Fuzik et al., 2016) which encompasses n = 80 inhibitory and excitatory neurons from layers 1/2 of mouse somatosensory cortex (Figure 3). The first RRR component strongly separated excitatory from inhibitory neurons, which is not surprising given the large differences in gene expression and in firing patterns between these two classes of neurons. At the same time, applying PCA separately to each modality results in pronounced class difference along multiple PCs (PC1, PC2, and some others; Figure S3), whereas RRR is capable of identifying and isolating this co-variation between modalities as the first component.
However, the second RRR component did not seem to carry a lot of signal in this dataset. Correlation was very weak (Figure 3b), in particular when the regularization was strong enough to select < 100 genes. Cross-validation indicated that when using several hundreds of genes or more the second RRR component became more pronounced (Figure 3a,b), but given that this could not be attributed to a smaller set of genes we suspect that it represents some unidentified experimental bias. Also, when running RRR separately for the inhibitory and the excitatory classes, we were unable to identify meaningful RRR components. For that reason here we chose the values of regularization parameters that did not yield any genes associated with the second component.
Discussion
We suggested regularized cross-validated sparse reduced-rank regression as a tool for interpretable data exploration and visualization of patch-seq datasets. It allows to visualize the variability across cells in transcriptomic and electrophysiological modalities in a consistent way, and to find a sparse set of genes explaining electrophysiological variability. Cross-validation allows to estimate the out-of-sample validity of the model. We expect that our method will also be relevant beyond the scope of patch-seq data: Spatial transcriptomics (Lein et al., 2017) combined with two-photon imaging may allow characterizing the transcriptome and physiology of individual cells in the intact tissue, yielding large multi-modal data sets. Similarly, other types of “multi-omics” data where single-cell or bulk transcriptomic data are combined with some other type of measurements (e.g. chemical, medical, or even behavioural), may benefit from interpretable visualization techniques.
Reduced rank-regression is closely related to the two other classical dimensionality reduction methods analyzing two data matrices (“two views”) together: canonical correlation analysis (CCA) and partial least squares (PLS). These can be understood as looking for projections with maximal correlation (CCA) or maximal covariance (PLS), whereas RRR looks for projections with maximal explained variance in Y. In recent years, multiple approaches to sparse CCA and sparse PLS have been suggested (Witten et al., 2009; Wilms and Croux, 2015; Lê Cao et al., 2008; Chun and Keleş, 2010) (among others). Here, we chose sparse RRR at the core of our framework, because it seemed more meaningful to predict electrophyiological properties from transcriptomic data instead of treating them symmetrically. Also, sparse RRR allows a mathematically simple formulation for rank r > 1 (using group lasso, see Methods) and can be conveniently implemented using existing implementations of elastic net regression.
To regularize the RRR model and to achieve sparse solutions, we used the elastic net penalty. It has two parameters, α and λ, and cross-validation will often indicate that α can be varied in some range without affecting the validation performance (see e.g. Figure 2a). This allows the researcher to control the trade-off between a sparser solution and a more comprehensive gene selection. If there is a set of genes that are highly correlated between each other, then large α will tend to select only one of them, whereas small α will tend to assign similar weights to all of them. Using α = 0 corresponds to RRR with pure lasso regularization as suggested in (Chen and Huang, 2012). In the datasets analyzed here, we found that values α ≈ .5 yielded a good compromise.
In principle, it would be possible to generalize this regression framework to nonlinear mappings, using e.g. a neural network with a bottleneck instead of the low-rank linear mapping shown in Figure 1e. This can be an interesting direction for future research, but fitting such models would require much larger sample sizes than currently available for patch-seq data.
Python code for this manuscript is available at https://github.com/berenslab/patch-seq-rrr.
Methods
Data
For the data by Cadwell et al. (Cadwell et al., 2016) we used the RPKM values, for the Fuzik et al. data set (Fuzik et al., 2016) the UMI counts as gene expression data. In the Cadwell et al. data set there are n = 51 interneurons (from 53 sequenced interneurons, 2 were excluded in the original publication as “contaminated”), p = 15074 genes identified by the authors as “detected”, and q = 11 electrophysiological properties. In Fuzik et al. data set there are n = 83 cells, p = 24378 genes after excluding ERCC spike ins, and q = 89 electrophysiological properties. Out of 83 sequenced cells, we were only able to match n = 80 to the electrophysiological data. We used only q = 80 electrophysiological properties for which the data were available for all these cells (the fact that n = q = 80 is coincidental).
We performed library size normalization by dividing the values for each cell i by the cell sum over all genes (“library size”) and multiplying the result by the median library size across all cells:
We then log-transformed the data using log2(x + 1) transformation. We excluded all cells for which at least one electrophysiological property was not estimated. Further, we excluded all genes that had exactly zero expression for all remaining cells. Finally, we standardized all gene expression values and all electrophysiological properties (to zero mean and unit variance).
For the Cadwell et al. data set, this yielded the final sample size of n = 44, p = 15054 genes, and q =11 electrophysiological properties. We restricted the gene pool to the p = 3000 most variable genes (the same ones identified in the original publication) for all our analyses. We used the expert classification of cells into two classes performed in the original publication for annotating cell types. Out of n = 44 cells, only 35 cells were classified unambiguously (score 1 or score 5 on the scale from 1 to 5); the remaining 9 cells received intermediate scores.
Using the Cadwell et al. data, we tried two modifications of the above preprocessing pipeline: first, we left out the standardization of the transcriptomic data; second, we left out feature selection and used all available genes instead of 3000 most variables one. In both cases the cross-validated R2 was somewhat lower than with our default pipeline, but when using α ≈ .5 and λ necessary to get ∼20 genes selected, we obtained very similar RRR projections.
For the Fuzik et al. data set, the same preprocessing pipeline yielded n = 80, p = 13089, and q = 80. We selected p = 1384 genes with average expression above .5 (before standardization) for the RRR analysis. The n = 80 cells have been classified in the original publication into L2, L4, and L5 excitatory neurons and into five classes of interneurons (labeled T1 to T5 in Figure 3).
All data sets were provided by the authors.
Algorithm
We consider two data matrices, X of n × p size and Y of n × q size that contain two sets of measurements on the same n samples. We assume that both matrices are centered, i.e. column means have been subtracted.
For simplicity, we first consider the special case of rank r = 1. The loss function of reduced-rank regression (RRR) in this case can be written as where without loss of generality it is convenient to require that ∥v∥ = 1. Here and below all matrix norms are Frobenius norms. The product wv⊤ forms the matrix of regression coefficients that has rank r = 1. This decomposition allows to interpret w as a mapping that transforms X into latent variables and v as a mapping that transforms latent variables into Y (Figure 1e).
RRR can be directly solved using singular vector decomposition (SVD). Indeed, the loss can be decomposed into the ordinary least squares (OLS) loss and the rank approximation loss: where B̂BOLS = (X⊤X)‒1X⊤Y is the solution to the un-penalized OLS regression. The first term corresponds to the variance of Y that is un-explainable by any linear model. The minimum of the second term can be obtained using SVD of XB̂OLS. The right singular vector corresponding to the largest singular value gives v̂, and û = B̂OLSv̂⊤.
We now add the elastic net penalty to the loss function that linearly combines the lasso (ℓ1-norm) and the ridge (ℓ2-norm) penalties:
The penalties are only applied to the vector w because the vector v has a fixed ℓ2 norm anyway, and ℓ1 penalty would be inappropriate because we do not wish to make it sparse. This optimization problem is biconvex and can be solved with an iterative “alternating” approach: in turn, we fix v and find the optimal wopt and then fix w and find the optimal vopt until convergence.
For fixed w, the loss does not depend on the penalty terms and the least-squares term can be written as which is minimized when v is aligned with Y⊤Xw, i.e.
For fixed v, the least-squares term can be re-written as meaning that the loss is equivalent to
This is the loss of elastic net regression of Yv on X, and so the optimal wopt can be obtained using any of the many available elastic net libraries. We used glmnet (Friedman et al., 2010) which is readily available for Matlab, Python, and R. It uses the following parameterization of the loss which we also adopt here:
Here α controls the trade-off between the lasso and the ridge and λ controls the overall regularization strength.
Now we can consider the general case of any rank r. Instead of w and v being vectors we now have W and V being matrices of p×r and q×r shapes, respectively. Without loss of generality, it is convenient to constrain the encoder matrix V to have orthonormal columns: V⊤V = I. The RRR loss term and the ridge term have the same form as before, but the lasso term needs to be modified. We want the matrix W to be sparse in the sense that some of the genes are left out of the model entirely. This means that the entire rows of W, and not just its individual elements, should be zeroed out. This can be achieved by a group lasso penalty term that computes the sum of ℓ2 norms of each row of W. This is lasso in disguise because it can be seen as the ℓ1 norm of the vector of row norms. Conveniently, glmnet allows to fit such models using the family="mgaussian" option.
Using glmnet-like parameterization, the loss is
For fixed V, the optimal Wopt can be obtained by glmnet. For fixed W, this is an example of the orthogonal Procrustes problem (Gower and Dijksterhuis, 2004). Using the same argument as in Equation 4, we need to maximize tr(Y⊤XWV⊤). This can be achieved by the “thin” SVD of Y⊤XW. If the left and right singular vectors are stacked in columns of L and R respectively, then Vopt = LR⊤. We provide a short proof below.
Given that the loss function is biconvex but possibly not jointly convex in V and W, it is important to choose a reasonable initialization. We initialized V by the r leading right singular vectors of X⊤Y and found this strategy to work well.
Relaxed elastic net
It is well-known that elastic net or even the lasso penalty on its own can lead to an over-shrinkage effect when the non-zero coefficients are shrunk too much. There have been several suggestions in the literature on how to address this problem (Efron et al., 2004; Zou and Hastie, 2005; Meinshausen, 2007). For example, relaxed lasso (Meinshausen, 2007) does lasso regression with a penalty λ1 and then, using only the terms with non-zero coefficients, does another lasso regression with a different penalty λ2. If λ2 = 0 this is also called “LARS-OLS hybrid” (Efron et al., 2004). Similar procedures for elastic net are not as established. We found that we obtain an improvement if after RRR with elastic net penalty with coefficients λ and α, we take only the genes with non-zero coefficients and run RRR with pure ridge penalty (α = 0) with the same value of λ. This procedure does not introduce any additional tuning parameters but substantially outperformed pure elastic net RRR on our data.
Cross-validation
We used repeated k-fold cross-validation (CV) to select the values of λ and α. We used two measures of performance: (i) the test-set reconstruction error ∥Ytest – XtestŴV̂⊤∥2 and (ii) the test-set correlation coefficients corr(Ytestv̂, Xtestŵ) for all columns of Ŵ and V̂. The correlation is not directly optimized by RRR but it is arguably a more intuitive metric and is what we are mostly looking for when we look at how well the two biplots in a bibiplot match each other.
We found it convenient to work with k ≈ 10. Leave-one-out CV is less applicable in this case because it does not allow to look at the test-set correlations. We found that cross-validation curves were quite sensitive to the random splitting into k folds. For that reason we used repeated cross-validation, averaging all the error estimates across 100 random splits into folds.
Biplots
For any linear dimensionality reduction method that reduces dataset X with n samples and p columns to XW with 2 columns, we construct the corresponding biplot as follows.
The scatter plot shows n points with x-coordinates given by XW.1 and y-coordinates given by XW.2, both standardized to have unit variance. The p lines show correlations between the original variables in X and the projections XW such that the i-th variable is represented as a vector with coordinates (corr(X.i, XW.1), corr(X.i, XW.2)). It is convenient to scale these vectors with a constant factor γ for better visibility. We used γ = 2. We also display a circle with radius γ that shows the maximal possible extent of the vectors (assuming that the columns of XW are uncorrelated).
If W is sparse, then we only show the variables corresponding to non-zero rows of W (even though other variables can also have non-zero correlations with XW). In case of reduced-rank regression, we use XW as the dimensionality reduction mapping for X and YV as the dimensionality reduction mapping for Y.
Permutation-based rank estimation
We used a permutation-based procedure to estimate the rank of the linear mapping between X and Y (Figure S4). First, we estimate the dimensionality of X as follows. PCA on X yields a sequence of eigenvalues sorted in the decreasing order. Randomly permuting (shuffling) the rows of X for each of the columns separately, we can obtain a dataset X̃ that preserves marginal variances (and so the sum of the PCA eigenvalues as well) but sets all the population correlations to zero. Consequently, PCA on X̃ yields a sequence of decreasing eigenvalues under the null hypothesis that all variables are uncorrelated. We repeat this nrep = 100 times and estimate dimensionality d of X as the number of its eigenvalues that are above the 95th percentile of the shuffled eigenvalues. The procedure yields d =13 and d =16 for the Cadwell et al. and the Fuzik et al. datasets respectively.
Now we apply PCA to reduce the dimensionality of X to d. Let us call this reduced dataset Z. We perform unregularized RRR of Y on Z by doing SVD of ZB̂OLS, as described above. This yields a sequence of d singular values sorted in decreasing order. Randomly permuting the rows of Z yields a dataset Z̃ that has exactly the same covariance matrix as Z but is statistically independent of Y. RRR gives a sequence of decreasing singular values under the null hypothesis that Z and Y are unrelated. We repeat this nrep = 100 times and estimate the rank r of the linear mapping as the number of singular values that are above the 95th percentile of the shuffled singular values. This procedure yields r = 2 and r = 3 for the Cadwell et al. and the Fuzik et al. datasets, respectively. For both datasets we used RRR with rank r = 2.
Bootstrapping
We used bootstrapping to estimate the reliability of the gene selection. The following procedure was repeated nrep = 100 times: we selected n out of n cells with repetition and applied regularized RRR to the resulting data. This allows to estimate frequency with which each gene gets selected into the model. We used the same values of regularization parameters for all bootstrap repetitions.
Procrustes problem
Given A, the problem is to maximize tr(AV⊤) subject to V⊤V = I. Let us denote by A = LQR⊤ = L̃Q̃R⊤ the “thin” and the “full” SVD of A. Now we have:
Here H = R⊤V⊤L̃ is a matrix with orthonormal rows as can be verified directly, and so it must have all its elements not larger than one. It follows that the whole trace is not larger than the sum of singular values of A. Using V = LR⊤ yields exactly this value of the trace, hence it is the optimum.
Acknowledgements
This work was funded by the German Ministry of Education and Research (FKZ 01GQ1601), the German Research Foundation (EXC307, BE5601/4-1) and the National Institutes of Health BRAIN Initiative (1U19MH114830-01).
We thank Andreas Tolias, Rickard Sandberg, Cathryn Cadwell, Jiaolong Xiang and Frederico Scala for discussion, Cathryn Cadwell and Janos Fuzik and their co-authors for making their data available and Shreejoy Tripathy for help with data processing.
PB and DK conceptualized the project, DK and MW developed statistical methods and wrote the software, PB supervised the project, DK and PB wrote the paper.
The authors declare that they have no competing financial interests.
Correspondence and requests for materials should be addressed to P.B. (email: philipp.berens{at}uni-tuebingen.de).