ABSTRACT
The cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle as both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. Here, we present tricycle, an R/Bioconductor package, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the ubiquitous applicability of transfer learning. We show that tricycle can predict any cell’s position in the cell cycle regardless of the cell type, species of origin, and even sequencing assay. The accuracy of tricycle compares favorably to gold-standard experimental assays which generally require specialized measurements in specifically constructed in vitro systems. Unlike gold-standard assays, tricycle is easily applicable to any single-cell RNA-seq dataset. Tricycle is highly scalable, universally accurate, and eminently pertinent for atlas-level data.
INTRODUCTION
The cell cycle is the biological process which controls faithful replication and division of cells across all species of life. Despite existing as a continuous process, cell cycle has historically been characterized as having four discrete stages during which the cell performs growth and maintenance (G1), replicates its DNA (S), increases further in size and prepares for mitosis (G2), and undergoes mitosis and cytokinesis (M). Cell cycle is a highly conserved mechanism with and integral role in generating the diversity of cell types within multicellular organisms. As a result, maladaptive modifications of the cell cycle can have devastating consequences in development and disease (McConnell and Kaznowski, 1991; Ambros, 1999; Ohnuma and Harris, 2003). Despite its importance, many of the molecular mechanism regulating and interacting with cell cycle remain poorly understood.
High-throughput expression data has been utilized for studying the cell cycle since the seminal work on the yeast cell cycle by Spellman et al. (1998) and Cho et al. (1998) at the dawn of the microarray era. This work used various approaches to synchronize cells in specific cell cycle stages followed by assaying cells in bulk. The data from Spellman et al. (1998) were later used by Alter et al. (2000) to show that principal component analysis reveals a circular pattern which represents the cyclical nature of the cell cycle; widely cited as one of the first examples of the use of principal component analysis and singular value decomposition in analysis of high-throughput expression data. Subsequent work sought to systematically identify both periodically expressed genes and cell cycle marker genes and deposited these into widely used databases (Whitfield et al., 2002; Gauthier et al., 2008).
Single-cell technologies have enabled the ability to study the effects of cell cycle in multicellular organisms with a degree of sensitivity and accuracy only previously available in monocelluar or clonal systems. Thus, cell cycle has been the subject of substantial interest, both as a biological variable of interest and as a possible confounding feature for other comparisons of interest (Buettner et al., 2015). A number of methods have been developed to estimate cell cycle state from single-cell expression data (Leng et al., 2015; Scialdone et al., 2015; Liu et al., 2017; Stuart et al., 2019; Hsiao et al., 2020; Schwabe et al., 2020). These methods differ broadly in the definition of cell cycle state (discrete stages vs. continuous pseudotime) as well as the use of special training data. Most of these methods have been demonstrated to be effective on datasets consisting of a single cell type. Despite the conservation of the cell cycle process, none of these methods have been shown to be applicable across single-cell technologies and mammalian tissues.
RESULTS
Transfer learning
To develop a universal method for estimating a continuous cell cycle pseudotime for a single-cell expression data set independent of technology, cell type, or species, we leverage transfer learning via dimensionality reduction (Pan et al., 2008). We define a reference cell cycle embedding (or latent space) into which we project a new data set; an approach originally advocated for in Stein-O’Brien et al. (2019). After projection, we infer cell cycle pseudotime as the polar angle around the origin. This pseudotime variable takes values in [0, 2π] and is unrelated to wall time, but rather represents progression through the cell cycle phases. We refer to this psedudotime variable as cell cycle position to avoid confusion with wall time and to emphasize its periodic nature.
To define a reference cell cycle embedding, we leverage key features of principal component analysis of cell cycle genes. Previous work has found that principal component analysis on expression data sometimes yield an ellipsoid pattern. This was first described by Alter et al. (2000); it has later been observed independently in multiple data sets (Schwabe et al., 2020; Liu et al., 2017; Mahdessian et al., 2021). Here, we demonstrate that the ellipsoid pattern is a consequence of a link between Fourier analysis of periodic functions and principal component analysis. The shape is created by the fact that cell cycle genes are periodic with a single peak of expression (which differs between genes). Thus, there is a direct link between progression through the cell cycle process and angular position on the ellipsoid.
We use the first two principal components to define a reference embedding representing the cell cycle. Because this reference embedding is a low dimensional linear space, we obtain an orthogonal projection operator allowing us to project any new data set into the reference embedding. We show that projecting new data into the reference cell cycle embedding overcomes technical and biological challenges posed by data sets where substantial variation is explained by one or more factors different from cell cycle, such as cellular differentiation.
Principal component analysis and periodic functions
To gain insight into gene expression dynamics over the cell cycle, we start by analyzing principal component analysis of periodic functions. Our model is a collection of periodic functions with a single peak, taking the form with a gene-specific amplitude (Ag) and location of the peak (Lg) with 0 ≤ θ < 2π representing the unknown cell cycle position. Figure 1a,b depicts the unobserved (true) time ordering, observed on a discrete grid of time points, together with a random permutation of these time points; this represents the observed data which is not ordered by time. A key insight is the fact that the first two principal components are the same for the observed and the unobserved data (Figure 1c), when performed on a discrete set of observation times. The unknown time order can be inferred from the principal component plot as the angle of each point, making it possible to fully reconstruct the unobserved time order (Figure 1d), i.e., the first two principal components form an orthogonal projection into a twodimensional space representing the periodic time.
For this result to hold, it is required that the gene expression data exhibits at least two distinct peak locations (not separated by exactly π) and that each gene has at most one peak over the time period (Methods). The assumption of a single expression peak for each gene is supported by empirical data for genes in the cell cycle expression program (see below). The two first principal components of this data can be represented as where b1, b2 are two dimensional vectors which are linear functions of the Eigenvectors and −values of a 2 × 2 matrix entirely determined by the set of peak locations and amplitudes (Lg, Ag) (Methods). No matter how many distinct peak locations and amplitudes are present, the space representing periodic time will always be two-dimensional. Higher dimensions are only required when individual genes have multiple peaks. Previous empirical investigations of cell cycle using expression data supports the observation of a 2-dimensional space for principal component analysis Buettner et al. (2015) and Schwabe et al. (2020).
The simulated data depicted in Figure 1 has Gaussian noise, but we have verified that the result holds for data generated using the negative binomial distribution with an associated mean-variance relationship. Using the negative binomial distributed data required more than 2 distinct peaks to be stable (Supplementary Figures S1, S2). For both distributions, this approach is robust to downsampling of the data similar to what is seen with the increased sparsity from droplet based sequencing technology. In simulations, we can recover cell cycle position with as little as 10 total counts per cell across 100 genes (depending on noise levels and heights of the peaks) (Supplementary Figure S3).
Recovering cell cycle position using principal component analysis on cell cycle genes
We next assess our model on experimental data, and learn an embedding representing cell cycle. We use 10x Genomics Chromium single-cell RNA-Sequencing (scRNA-Seq) data on two replicate cultures of E14.5 mouse cortical neurospheres (Methods), integrated using Seurat 3 and transformed to log2-scale. The use of an alignment method (CCA in Seurat3) to integrate the two samples is important for the quality of the ellipsoid, by maximizing the correlation structure between the two samples. Since neurospheres are maintained in a proliferative state, we expect that cell cycle phase is an important contributor to the variation in expression within this single-cell dataset. To confirm this expectation, we consider a UMAP representation of the data based on all variable genes (Supplementary Figure S4) colored according to the predictions from two separate cell cycle stage estimation utilities (cyclone and a modification of Schwabe et al.(2020) we call modified-Schwabe, see Methods); this analysis demonstrates that the cell cycle is a major source of transcriptional variation in the neurosphere dataset.
We then perform principal component analysis of the top 500 most variable genes amongst the roughly 1700 genes annotated with the Gene Ontology cell cycle term (GO:0007049, Methods) (Ashburner et al., 2000). As suggested by our model, the first two principal components form an ellipsoid with a sparse/empty interior (Figure 2a). Using the modified-Schwabe cell cycle stage predictor, we observe a strong relationship between polar angle on the ellipsoid and predicted cell cycle stage.
The strong relationship between polar angle on the ellipsoid and predicted cell cycle stage was also observed on an independent dataset on cultured primary mouse hippocampal progenitors from a wild-type mouse as well as from a Kmt2d+/βgeo mouse, a previously described model of Kabuki syndrome (Carosso et al., 2019). The data were processed similarly to the neurosphere data. Again, we select the top 500 most variable cell cycle genes and perform a principal component analysis (Figure 2b) which reveal an ellipsoid pattern. The shape of the principal component plot differs between the two datasets, but the weights used to form the first two principal components are highly concordant (Figure 2c, Supplementary Figure S5 for PC2) for the 318 genes present in both cell cycle embeddings. Almost all of the highly ranked genes (absolute weights > 0.1, highlighted in red and labelled with gene name) represent important regulators of, or participants in, the cell cycle. For example, the highest ranked gene is Topoisomerase 2A Top2a which controls the topological state of DNA strands and catalyzes the breaking and rejoining of DNA to relieve supercoiling tension during DNA replication and transcription (Lodish et al., 2008). Also highly ranked are Smc2 and Smc4 which compose the core subunits of condensin, which regulates chromosome assembly and segregation (Ono et al., 2003; Wei-Shan et al., 2019).
Given our mathematical analysis as well as the strong empirical relationship between polar angle on the ellipsoid and cell cycle stage predictions, we define a method to learn cell cycle position as the polar angle around the origin on the coordinate plane which we denote by θ. We center the coordinate plane on (0, 0) whose location corresponds to cells with zero expression for all 500 variable cell cycle genes.
To demonstrate that cell cycle position reflects the true biological cell cycle progression, we consider expression dynamics of specific cell cycle genes. For Top2a and Smc2 the peak expressions are observed at G2 stage around π (Figure 2e), consistent with their known increased expression through S phase and into G2 (Heck et al., 1988; Belluti et al., 2013; Wei-Shan et al., 2019). Furthermore, the dynamics are highly similar between the independently analyzed cortical neurosphere and hippocampal NPC datasets, which supports the observation that the two different embeddings yield concordant cell cycle positions (despite each including dataset-specific genes). These observations hold for all genes with high weights (Supplementary Figure S6. This approach serves as an internal control in any single-cell RNA-seq data set and can be used to assess the quality of any continuous ordering.
Next, we directly relate θ to the measured tran scription values. Figure 2d shows the log2 transformed total UMI numbers against θ, with a periodic loess smoother for each dataset. In both datasets, the maximum level is reached around π and the minimum around 1.5π, which corresponds to the end of G2 and the middle of M stage respectively. We observe the total UMI number begins to increase at the beginning of G1/S phase and to decrease sharply as cells progress through M phase. The difference between the maximum and minimum of the periodic loess line is 1, corresponding to a two-fold difference in total UMI, which is known to be proportional to cell size (Marguerat and Bähler, 2012; Padovan-Merhar et al., 2015). This observation, and the timing with respect to cell cycle position, is consistent with the approximate reduction in cellular volume by one half as a result of cytokinesis in M phase and the formation of two daughter cells of roughly equal size.
Note that these principal component analyses are differentiating G2/M cells from G1/G0 cells on the first principal component. This is in contrast to the mathematical analysis where the starting point (θ = 0) can be any location (red point in Figure 1) as there is no clear starting point for a periodic function. That the first principal component differentiates G2M from G1/ G0 can be explained by the nature of principal component analysis. Before principal component analysis we subtract each gene’s mean expression. However, genes marking G2/M usually have very high expression compared to other stages, with G0/G1 being the lowest (Supplementary Figure S7), ensuring that this becomes the first principal component. A clustering analysis of the expression patterns provides further evidence that cell cycle genes have a single peak pattern of expression (Supplementary Figure S7). Thus, the observed behavior of the cell cycle genes in these data sets fits the theoretical requirements of our model.
In summary, principal component analysis of the cell cycle genes predicts cell cycle progression for the mNeurosphere and mHippNPC datasets with a high degree of similarity between the cell cycle position inferred independently in the two datasets as predicted by our mathematical model.
When principal component analysis fails to reflect cell cycle position
A principal component analysis does not always yield an ellipsoid pattern; a requirement for this to work is for the first principal component to dominated by cell cycle. To illustrate this, we used an existing mouse developing pancreas dataset, with cell type labels (Bastidas-Ponce et al., 2019). A major source of variation in this dataset is cellular differentiation as demonstrated by a standard UMAP embedding (based on all variable genes) illustrating the previously described (Bastidas-Ponce et al., 2019) differentiation trajectories (Figure 3a). When we perform principal component analysis using only the variable cell cycle genes, the resulting PCA plot still reflects the differentiation trajectory and does not resemble the ellipsoid pattern observed in the previous section (Figure 3b,c). Note that PC1 has some relationship with cell cycle since the differentiation path goes from cycling to non-cycling cells, but it also reflects the progression from cycling multipotent cells to terminally differentiated cells. This result strongly suggests that some of the cell cycle genes may participate in biological processes other than the cell cycle and demonstrates that PCA of cell cycle genes does not always exclusively capture cell cycle variance.
However, when we perform principal component analysis only on a subset of cells from a single, proliferating progenitor cell type, the ellipsoid pattern returns (Supplementary Figure S8a,b). This highlights the challenge of inferring cell cycle for datasets that contain many different cell types, including postmitotic cells.
Transfer learning through projection
To overcome the challenges of inferring cell cycle position in arbitrary datasets, we propose a simple, yet highly effective transfer learning approach we term tricycle (transferable representation and inference of cell cycle). In short, we first construct a reference embedding representing the cell cycle process using a fixed dataset where cell cycle is the primary source of transcriptional variation. For the remainder of this manuscript we will use the cortical neurosphere data as this reference. We show that the learned reference embedding generalizes across all datasets we have examined. Because our reference embedding is a linear subspace, we benefit from an orthogonal projection operator which allows us to map new data into the reference embedding, with well understood mathematical properties. Finally, we infer cell cycle position by the polar angle around the origin of each cell in the embedding space. The robustness of this approach is demonstrated by the ability of this projection to estimate cell cycle position in multiple independent and disparate datasets; evidence of which is provided below. Specifically, using the cortical neurosphere dataset as a fixed reference, our transfer learning approach generalizes across cell types, species (human/mouse), sequencing depths and even single-cell RNA sequencing protocols.
As a demonstration, we consider a diverse selection of single-cell RNA-seq datasets representing different species (mouse and human), cell types and technologies (10x Chromium, SMARTer-Seq, Drop-seq and Fluidigm C1) (Table S1). We project these datasets into the cell cycle embedding learned from the neurosphere data (Figure 4, Supplementary Figure S9), and color the projections according to the modified Schwabe estimator of cell cycle stage. Although the shape of the projection varies from dataset to dataset, the cells of the same stage always appear at a similar position of θ, such as cells at S stage centering at 0.75π. To verify our cell cycle ordering, we look at the expression dynamics of Top2a and Smc4 as a function of θ (Figure 4, Supplementary Figure S10). PCA plots of the GO cell cycle genes for each dataset illustrates the advan tage of using a fixed embedding to represent cell cycle (Supplementary Figure S11). Together, these results strongly supports that tricycle generalizes across data modalities.
Having inferred cell cycle position, we can visualize the cell cycle dynamics on a UMAP plot representing the full transcriptional variation, as is standard in the scRNA-Seq literature (Figure 4). To effectively visualize cell cycle position, we use a circular color scale to account for the fact that position “wrap around” from 2π to 0. Doing so reveals the smooth behaviour of the tricycle predictions (despite not using smoothing or imputation) and argues for representing cell cycle in gene expression data as a continual progression rather than discrete states.
Cell cycle position estimation on gold-standard datasets
We validated tricycle on multiple datasets containing “gold-standard” cell cycle measurements, including measurements by proxy using the fluorescent ubiquitination-based cell-cycle indicator (FUCCI) system and by fluorescence-activated cell sorting (FACS) of cells in discrete cell cycle stages. Both of these approaches allow for assignment to or selection of cells from discrete phases of the cell cycle. The FUCCI system uses a dual reporter assay in which the reporters are fused to two genes with dynamic and opposing regulation during the cell cycle (Sakaue-Sawano et al., 2008), allowing for a quantitative assessment of whether cells are in G1 or S/G2/M phase. In contrast to FACS, FUCCI systems, combined with an appropriate quantification method, make it possible to continuously measure cell cycle progression by placing the 2 protein measurements in a 2-dimensional space. Cell cycle pseudotime needs to be inferred from these 2-dimensional measurements, which is usually done by a variant of polar angle (Hsiao et al., 2020; Mahdessian et al., 2021).
Mahdessian et al. (2021) measured human U-2 OS cells to derive a FUCCI-based pseudo-time scoring. Their FUCCI measurements form a distinct horseshoe shape with the left side of the horseshoe representing time post-metaphase-anaphase transition with a continuous progression through G1, S, G2 and ending pre-metaphase-anaphase transition (Figure 5; this depiction mirrors other data presentations (Sakaue-Sawano et al., 2008; Sakaue-Sawano et al., 2017)). Cell cycle is a continuous process which is not immediately reflected in the horseshoe form because of the large gap (in the x-axis) between the two ends of the horsehoe. The x-axis reflects the protein levels of geminin (GMNN) which is degraded during the metaphase-anaphase transition (McGarry and Kirschner, 1998) and the two ‘open’ ends of the horseshoe are closely connected in time despite the visual gap in the scatter-plot. This fact gives the FUCCI system the ability to assess whether a cell in M phase is before or after this transition, or said differently, a high temporal resolution around this transition despite the relatively short wall time compared to the rest of the cell cycle. We observe a close correspondence between tricycle cell cycle position and FUCCI pseudotime. The only cells for which there is a superficial disagreement are placed in M phase by tricycle (cell cycle position around 0.85π) and are split between pre-metaphase-anaphase transition and post-metaphase-anaphase transition by FUCCI pseudo-time, for this particular transition the FUCCI system has higher temporal resolution than tricycle; adding a small offset to these cells results in a remarkable concordance between the two systems (Figure 5). Elsewhere in the cell cycle, there is no evidence of better temporal resolution with FUCCI; examining expression dynamics suggests that tricycle does at least as good as FUCCI as ordering key cell cycle genes. We can use tricycle to examine the expression dynamics of GMNN and CDT1 which reveals that GMNN expression is stable across the cell cycle (Supplementary Figure S12), suggesting the protein is predominantly regulated post-transcriptionally during mitosis.
Hsiao et al. (2020) used FUCCI on human induced pluripotent stem cells (iPSC) followed by scRNA sequencing using Fluidigm C1. While the Mahdessian et al. (2021) FUCCI data look like a horseshoe, the Hsiao et al. (2020) FUCCI data are more akin to a cloud (the data differ in quantification and normalization of the FUCCI scores). These data are used to estimate a continuous cell cycle position (which we term “FUCCI pseudotime”) based on polar angle of the FUCCI scores. Compared with the data in Mahdessian et al. (2021), there are larger differences between FUCCI pseudotime and tricycle cell cycle position. However, we can directly compare the associated expression dynamics of key cell cycle genes (Figure 5 for TOP2A, Supplementary Figure S13 for 8 additional genes). These results suggests that tricycle cell cycle position is at least as good or better as the FUCCI pseudotime at ordering the cells along the cell cycle; the R2 for TOP2A is 0.42 for tricycle compared with 0.27 for peco.
In contrasts to FUCCI measurements, FACS sorting and enrichment of cells yields groups of genes in (supposedly) distinct phases of the cell cycle. We consider 2 different datasets where FACS has been combined with single-cell RNA-seq. Buettner et al. (2015) assays mouse embryonic stem cells (mESC) using Hoechst 33342-staining followed by cell isolation using the Fluidigm C1. They use very conservative gating for G1 and G2M at the cost of less conservative gating for S phase. Leng et al. (2015) uses FACS on FUCCI labeled H1 human embryonic stem cells (hESC) followed by cell isolation using the Fluidigm C1. In both experiments, cells largely appear as expected in the cell cycle embedding defined by the cortical neurosphere reference embedding (Supplementary Figure S14). For the mESC, we note that some cells labeled S (but not G1 or G2M) appear outside the position expected for this stage, consistent with the gating strategy used for these data.
Summarizing this evidence, we conclude that tricycle recapitulates and refines the cell cycle ordering consistent with current “state of the art” experimental methods. Tricycle cell cycle position is competitive with FUCCI based measurements, except for cells in the metaphase to anaphase transition during mitosis.
Comparison to existing tools for cell cycle position inference
We next sought to compare tricycle cell cycle position estimates to those obtained from other available methods. Existing methods for cell cycle assessment can be divided into those which infer a continuous position and those which assign a discrete stage. We have evaluated the following methods: peco (Hsiao et al., 2020), Revelio (Schwabe et al., 2020), Oscope (Leng et al., 2015), reCAT (Liu et al., 2017), cyclone (Scialdone et al., 2015), Seurat (Stuart et al., 2019), the original Schwabe Schwabe et al., 2020, and the modified Schwabe 5 stage assignment method. Each method differs in which datasets it works well on and which issues it might have; a detailed comparison is available in the Supplement (Supplemental Methods, Supplementary Figures S15–S21).
Issues with existing methods include (a) ability to work on datasets with multiple cell types, (b) the ability to scale to tens of thousands of cells or more, and (c) the ability to work on less information rich datasets such as those generated by droplet-based or in situ scRNA-Seq methods. Oscope requires data on many genes due to its use of pair-wise correlations, and therefore does not work on less information rich platforms (e.g 10x Chromium or Drop-Seq). peco works better on less sparse, and information-rich data (e.g. Fluidigm C1), but even on data from this platform, it is outperformed by tricycle. reCAT is critically dependent on the extent to which a principal component analysis of the cell cycle genes reflect cell cycle and only infers a cell ordering; it is not straightforward to interpret the re-CAT ordering, especially across datasets. Revelio is primarily a visualization tool, which appears to fail on datasets where substantial variation is driven by processes other than the cell cycle. Of the discrete predictors, Seurat agrees well with tricycle (and is very scalable) but is limited by only predicting a 3 stage cell cycle representation (G1/S/G2M). Cyclone appears to do poorly in labelling cells in S phase and only predicts 3 stages. The (modified) Schwabe predictor assigns 5 stages, but has many missing labels and mis-assigns cells from G0/G1 to other stages.
Additionally, we benchmarked the computational speed and performance of tricycle against other cell cycle estimation algorithms. We briefly compared the running time of several methods using subsets of the mRetina dataset (Supplementary Figure S22). To compute continuous estimates using tricycle takes a mean of about 0.58, 0.86 and 1.48 seconds when the number of cells is 5000, 10000, and 50000 respectively. In contrast, to compute finite discrete stages Seurat takes a mean of about 1.10, 1.22 and 4.95 seconds for a three stage estimation and cyclone takes a mean of about 7.96, 11.50 and 50.66 minutes for a three stage estimation, when the number of cells is 5000, 10000, and 50000 respectively. Other methods (peco, Oscope, reCAT) are not capable of processing large (10k-100k+) datasets. All of the comparisons were run on Apple Mac mini (2018) with 3.2 GHz 6-Core Intel Core i7 CPU, 64GB RAM, and operating system macOS 11.2. Thus, tricycle is able to scale with the increasing size of datasets.
Application of tricycle to a single-cell RNAseq atlas
To demonstrate the scalability and generalizability of tricycle we applied it to a recent dataset of ≈ 4 million cells from the developing human (Cao et al., 2020). The data were generated using combinatorial indexing (sci-RNA-seq3) and are relatively lightly sequenced with a median of 429 – 892 total UMIs for 4 single-cell profiled tissues and 354 – 795 for 11 single-nuclei profiled tissues (Supplementary Figure S23). Using tricycle, we are able to rapidly and robustly annotate cell cycle position for each of the cells/nuclei in this atlas (Figure 6a, Supplementary Figure S24). Within a global UMAP embedding, tricycle annotations enable immediate visual identification of proliferating and/or progenitor cell populations for most cell types and tissues. The rapid annotation of cell cycle position on this reference dataset further allowed us to examine the relative differences in the proportion of cells actively proliferating across different tissues and cell types in the developing human. To quantify this, we discretized all cells along θ into two bins corresponding to actively proliferating (0.25π < θ < 1.5π; S/G2/M) or non-proliferating (G1/G0). We next ranked each tissue by the relative proportion of actively proliferating cells to identify the tissues and cell types with the highest proliferative index (Figure 6b). To examine cell-type specific differences in proliferation potential, we computed the cell cycle embedding as well as the proliferative index for the 9 most abundant cell types within each tissue (Supplementary Figures S25 and S26).
Tissue-level proliferation indexes identified thymus, cerebrum, and adrenal gland as having the highest overall proportions of dividing cells across the sampled fetal timepoints. Within the thymus, thymocytes represent both the most abundant cell type and the most ‘prolific’ cell types as a function of the proporation of mitotic cells. Thymocytes exhibit a circular embedding in UMAP space that effectively recapitulates the estimated cell cycle position predictions from tricycle (Supplemental Figure S26k). Within this circular embedding, there is a gap of cells with cell cycle position estimates at π, consistent with dropout of cells and lower information content in M-phase. Comparison of tricycle cell cycle annotations to modified Schwabe cell cycle phase calls in this embedding suggests that tricycle more accurately estimates cell cycle position even on cell types with a mean total UMI of 354 (Supplementary Figure S27).
Within tissues, lymphoid cells are often the cell type with the highest proliferation index (Supplementary Figures S26, S25); often with a greater number of actively proliferating cells than not. Within the fetal liver and spleen – both sites of early embryonic erythropoiesis during human development (Cumano and Godin, 2007) – erythrob-lasts represent the cell type with the highest fraction of proliferating cells. Across developmental time, most tissues maintain relatively monotonic proliferation indices, however several (liver, placenta, intestine) exhibit dynamic changes across the sampled timepoints. This application illustrates the utility of tricycle to atlas-level data.
Stability of the cell cycle position assignments
To test the robustness of tricycle we performed in-silico experiments to determine the stability of cell cycle position assignments. We evaluated three different types of stability wrt. (a) missing genes, (b) sequencing depth, and (c) data preprocessing.
When projecting new data into the cell cycle reference embedding, it is common that the feature mapping between the two data sets contains only a subset of the 500 genes used in the embedding. The number of genes available for the feature mapping has an impact on the shape of the resulting embedding; the mNeurosphere and mHippNPC datasets have almost the same shape when restricted to a set of common genes (Supplementary Figure S28). To establish the stability of tricycle, we randomly removed genes from the neurosphere dataset and computed tricycle cell cycle positions; we used the neurosphere dataset as a positive control to ensure all genes are present. We used the circular correlation coefficient to assess the similarity between the tricycle cell cycle position for the full dataset vs. the dataset with randomly pruned genes (Supplementary Figures S29, S30). This reveals excellent stability (circular ρ > 0.8) using as little as 100 genes.
To examine the impact of sequencing depth, we downsampled the mHippNPC dataset (Supplementary Figures S31, S32), and used the circular correlation coefficient to quantify to similarity to the cell cycle position inferred using the full sample. Originally, the median of library sizes (total UMIs) is 10,000 for mHippNPC data. Downsampling to 20% of the original depth(approximate median of library sizes 2,000) kept circular ρ > 0.8. This is congruent with the observed robustness of the method to the varying sequencing depth of the various datasets examined above.
Next, we examined the stability of tricycle wrt. the choice of reference embedding. Above, we show a cell cycle space estimated separately for the mNeurosphere and the mHippNPC datasets (Figure 2). We observe that the inferred expression dynamics are more alike in the two datasets if we project the mHippNPC into the mNeuro-sphere embedding compare to using its own embedding. To quantify this, we pick key cell cycle genes (previously examined in Supplementary Figure S6) and compare the location of peak expression in the mNeurosphere dataset compare to the mHippNPC dataset with cell cycle position estimated using these two approaches (Supplementary Figure S33). For the vast majority of genes, the highest expression appear at a closer position when we estimate cell cycle position by projecting the mHippNPC dataset into the mNeurosphere embedding.
To examine the impact of preprocessing data prior to projection, we compared cell cycle position inferred using data processed with and without Seurat. Note that when we estimate the cell cycle space, we use Seurat to align the different biological samples. But this is not done when we project new data using the pre-learned reference. We observe negligible differences, whether or not Seurat is used (Supplementary Figure S34).
These results demonstrate the high sensitivity of tricycle to accurately estimate the cell cycle position across a high dynamic range of both number of detectable genes within the feature map as well as depth of the information content in the target cells.
DISCUSSION
Here, we have demonstrated the ability of tricycle to accurately call cell cycle position in 26 datasets across species, cell types, and assay technologies.
Tricycle achieves its universality by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis on periodic functions, and ubiquitous applicability of transfer learning to enable rapid and efficient use across a diverse collection of datasets. Our embedding – shaped by the fact that the first dimension stratifies G2/M from G1/G0 – ensures that we can easily interpret cell cycle position between datasets, overcoming one challenge of cell cycle inference. The stage specific periodicity of cell cycle markers, tied to their biological function, implies that the cell cycle space becomes two dimensional. Our definition of cell cycle position as the polar angle of a cell embedded in the reference cell cycle space, serves as a form of internal normalization and helps with the generalizability of tricycle across datasets. Despite this, it is still remarkable that we can project new data – without data integration or batch effect removal – and still get a useful and accurate embedding of the data into the cell cycle space with minimal computational effort. Because the projection operator is a single low-dimensional linear operation, tricycle has excellent scalability and can easily be applied to atlas-scale datasets. Thus, tricycle is a powerful tool for quickly and accurately inferring cell cycle position for single-cell RNA-seq data.
The cell cycle is a major source of transcriptional variation in many biological systems. In particular, highly studied systems such as developmental and disease processes rely on proper regulation of the cell cycle. In many single-cell experiments however, cell cycle is often considered a confounding factor and as such, methods exist to remove this effect from the data prior to analysis. We caution against removing cell cycle progression blindly as it can be intimately intertwined with other sources of variation of interest. Taking the mPancreas data as an example, there is a clear relationship between the number of cycling cells and differentiation as the multi-potent ductal cells advance to be terminally differentiated alpha and beta cells. If correction for cell cycle progression is warranted, our analysis of the mPancreas data suggests that the common approach of regressing out principal components of cell cycle genes may remove biological variation of interest.
The success of tricycle’s application using a single arbitrary cell cycle embedding raises interesting questions about the robustness and universality of the biological process itself. Here, we use a fixed reference embedding to represent cell cycle, defined using the mouse cortical neurosphere dataset. This raises the question: is there a single best embedding? One approach would be to decrease the size of the gene list used to construct the embedding. In support of this, Hsiao et al. (2020) reports that as little as 6 genes yield good performance. We find a small set (though larger than 6) of genes with high weights (Figure 2), but that making the list too small results in inferior performance.
Another approach would be to optimize the embedding to be as circular as possible. However, despite different shapes, embeddings based on the cortical neurosphere and the primary hippocampal NPC datasets result in similar cell cycle position estimates. Both results argue that the robustness of the method is derived from the structure created by the relationship of the genes to each other rather than the behavior of any individual marker gene. Thus, so long as the structure of the embedding is driven by the cell cycle, the specific source of the reference embedding is irrelevant. Here, we use 500 genes as well as a single, clean, dataset to define the cell cycle embedding, and we show that this achieves excellent generalization performance without any optimization. While our use of a single, fixed, reference embedding is a clear advantage to users, our package contains functions to define and use a custom reference embedding.
METHODS
Using principal component analysis to recover time ordering
We will consider the following statistical model. The mean expression of each gene is modelled as here Ag is a gene-specific amplitude and dg is a mean-specific displacements (location of the peak). In this formulation, the mean function has a single peak and is periodic. We have G genes and each gene has its own (but not necessarily unique) (Ag, dg).
Basic trigonometry yields the identity which we can write as using the orthonormal functions
Our derivation is based on Ramsay and Silverman (2005) section 8.4. This section shows that the variance-covariance operator is given by where the inner matrix (which turns out to determine the principal components) is a 2 × 2 matrix equal to
The principal component analysis is given by the Eigen-functions and −values of the variance-covariance operator. Such an Eigen-function and −value pair ξ, ρ takes the form for a vector b which satisfies ie. b, ρ are Eigen-vectors and −values for the G−1Ct C matrix. Specifically, if q1, q2, λ1, λ2 are two such Eigen-vectors- and −values then the two first principal components are given by
Simulations
For Figure 1 we performed the following simulation. 50 realization of a cosine function with a location of 0.2 and an amplitude of 0.5 as well as 50 realizations of a cosine function with a location of 1.2 and an amplitude of 1. Each function was evaluated on an equidistant grid of 1000 points and independent Gaussian noise with a standard deviation of 0.2 was added. The depictions in Figure 1a,b were each one of the realizations of the two different cosine functions.
For Supplementary Figures S1, S2 and S3 we simulated data using the negative binomial distribution, inspired by the setup in Splatter (Zappia et al., 2017). In addition to a gene-specific amplitude (Ag) and location of the peak (Lg), we also consider different library size (l), which is an approximate as we still have some cell-to-cell variance. For a cell, we let , with c a constant to ensure positivity of . Then the cell mean is . The trended cell mean is simulated from a Gamma distribution as , with B the biological coefficient of variation (we fix B as 0.1 in our simulations). Thus, the counts for gene g is given as yg Pois(λg). We always simulate a 100 genes times 5000 cells count matrix, with cell timepoint θ uniformed distributed between 0 and 2π. We only varies one of the Lg, Ag and l in Supplementary Figures S1, S2 and S3. Specifically, in Supplementary Figures S1, we used different number of distinct peak locations across 100 genes, and fixed the amplitudes (across 100 genes) as 3 and library size as 2000. In Supplementary Figures S2, we used different numbers of distinct amplitudes across 100 genes, and fixed the number of distinct peak locations (across 100 genes) as 100 and library size as 2000. In Supplementary Figures S3, we changed the library size l, and fixed the number of distinct peak locations (across 100 genes) as 100 and the amplitudes (across 100 genes) as 3. PCA was performed on the library size normalized and log2 transformed matrix after we got the count matrix.
Generation of mouse primary hippocampal NPC scRNA-Seq dataset
Hippocampal neural stem/progenitor cells (NPCs) were isolated by microdissection from E17 day embryos (offspring of male Kmt2d+/βgeo and female C57Bl/6J) and cultured on Matrigel as described in Carosso et al. (2019). We verified neuronal lineage by demonstrating Nestin, Calbindin, and Prox1 expression (not shown). Cells were maintained in an undifferentiated state with growth factor inhibition (EGF, FGF2) in Neurobasal media. In a prior publication, we have demonstrated that the Kmt2d+/βgeo cells exhibit defects in proliferation (Carosso et al., 2019). Following isolation we collected cells from both genotypes at the undifferentiated state (day 0) and then after growth factor removal on days 4, 7, 10 and 14, capturing cells that were ever more differentiated. sc-RNA-Seq libraries were created with a Chromium Single-Cell 3’ library & Gel Bead Kit v2 (10x Genomics) according to manufacturer protocol. Only cells from day 0 are analyzed here.
Generation of mouse E14.5 Neurosphere scRNA-Seq dataset
Cortical neurospheres were generated from the dissociated telencephalon of embryonic day 14.5 (E14.5) wild type embryos. Embryos were harvested and the dorsal telencephalon was dissected away and collected in 1X HBSS at RT temperature. The dorsal telencephalon was gently triturated using p1000 pipette tips and the resultant cell suspension was spun at 500G for 5min and the media was aspirated off. The cell pellets were resuspended in complete neurosphere media 7ml (CNM) and plated in ultra-low adherence T25 flasks. CNM is made from combining 480ml DMEM-F12 with glutamine, 1.45g of glucose, 1X N2 supplement, 1X B27 supplement without retinoic acid, 1x penicillium/streptomycin and 10ng/ml of both epidermal growth factor (EGF) and basic fibroblast growth factor (bFGF). The cell pellets were cultured for 3-5 days, or until spheroids have formed. The neuro-spheres were then collected and spun at 100G for 5min and the supernatant was removed. Neuro-spheres were resuspeneded in 5ml TrypLE and incubated for a maximum of 5min at 37° C with gentle trituration every 1.5min with a p1000 until the neurospheres are mostly a single-cell suspension. The cells were spun down at 500G for 5min and the supernatant was removed. The cells were resuspended in 15ml of CNM and gently passed through a 40uM filter to remove large cell clumps. The resultant cell suspension was then plated in T75 flasks for another 2-5 days or until spheres begun to have dark centers. This process was repeated two more times before cells were collected for 10X Genomics single-cell library prep. Before single-cell library preparation, the neurospheres were dissociated as described above and passed through a 40uM filter to ensure a single-cell suspension. Approx. 7000 cells were selected from each sample for input to the scRNA-Seq library prep. sc-RNA-Seq libraries were created using the Chromium Single-Cell 3’ library & Gel Bead Kit v2 (10x Genomics) according to manufacturer protocol.
Reference genome and mapping index building
For mouse, GRCm38 reference genome fasta file and primary gene annotation GTF file (v25) were downloaded from GENCODE (https://www.gencodegenes.org). Similarly, GRCh38 reference genome fasta file and primary gene annotation GFT file(v35) were downloaded for human. We built a reference index for use by alevin as described by Soneson (2020) using R package eisaR(v1.2.0), which we use to quantify both spliced and unspliced counts of annotated genes.
scRNA-Seq preprocessing
Mouse Neurosphere (mNeurosphere) dataset
fastqs files were used to quantify both spliced and unspliced counts by Alevin (Salmon v1.3.0) with default settings as described by Soneson (2020). Abundances matrices were read in by R package tximeta (v1.8.1). The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log2(TotalUMIs), log2(number of expressed genes), percentage of mitochondrial gene counts, or log10(doublet scores)). The doublet scores were computed using doublet-Cells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log2 transformed by function normalizeCounts in R package scuttle (v1.0.2). The biological samples were integrated together by Seurat (v3.2.2). We then run PCA on the top 2000 highly variable genes of the integrated log2(expression) using the runPCA function with default parameters, followed by runing the runUMAP function on the resulting top 30 principal components with default parameters. Note that we did not restrict genes to cell cycle genes in this step, as we would like to see the overall variation of the data. Cell types were inferred by SingleR package v1.4.0 using built-in MouseR-NAseqData dataset as the reference.
Mouse primary hippocampal NPC (mHipp-NPC) dataset
All preprocessing are the same as for the mouse Neurosphere (mNeurosphere) dataset.
Mouse developing pancreas (mPancreas) dataset
We obtained the spliced and unspliced count matrices of the Mouse developing pancreas dataset from the python package scvelo (v0.2.1). The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and any cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log2(TotalUMIs), log2(number of expressed genes), percentage of mitochondrial gene counts, or log10(doublet scores)). Here, the doublet scores were computed using doubletCells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log2 transformed by function normalizeCounts in R package scuttle (v1.0.2). We run PCA on the top 500 highly variable genes using the runPCA function with default parameters, followed by running the runUMAP function on the resulting top 30 principal components. When running the UMAP, we set min dist to 0.5 instead of default value 0.01 to replicate the UMAP figure shown in Bergen et al. (2020) with other parameters default. Of note, the single-cell libraries of the data was generated using 10x Genomics’ Chromium v2 system.
Mouse Hematopoietic Stem Cell (mHSC) Dataset
We downloaded processed log2 transform TPM matrix directly from GEO under accession number GSE59114 (Kowalczyk et al., 2015). We only used the cells from C57BL/6 strain, of which contains more cells, as the number of overlapped genes between xlsx file of C57BL/6 strain and DBA/2 strain is too small. Because the data was already processed and filtered, we did not perform any other processing. Unlike the above mentioned dataset, the SMARTer protocol was applied during library preparation.
Mouse Retina (mRetina) dataset
This dataset is available at https://github.com/gofflab/ developing_mouse_retina_scRNASeq. We removed cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log2(TotalUMIs), log2(number of expressed genes), percentage of mitochondrial gene counts, or log10(doublet scores)). As the total UMIs depend on cell type, we filtered the cells by blocking for each cell type. The doublet scores were computed using doubletCells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log2 transformed by function normalizeCounts. We used the cell type annotations as the new CellType column in the provided phenotype file. The single-cell libraries of the data was generated using 10x Genomics’ Chromium v2 system.
HeLa cell lines datasets
The spliced and unspliced count matrices of HeLa Set 1 (HeLa1) and HeLa Set 2 (HeLa2) were downloaded from GEO website with accession number GSE142277 and GSE142356. Both datasets were generated by the same lab under the same protocol, while the sequencing depth of Set 2 is only about half that of Set 1 (Schwabe et al., 2020). For each dataset, we only used the genes existing in both spliced and unspliced count matrices. The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log2(TotalUMIs), log2(number of expressed genes), percentage of mitochondrial gene counts, or log10(doublet scores)). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log2 transformed by function normalizeCounts. The single-cell libraries of the data was generated using Drop-seq system.
Mouse embryonic stem cell (mESC) dataset
The processed count matrix was downloaded from under accession ArrayExpress website number E-MTAB-2805 (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2805/). We only retained 279 cells with log2(counts) greater than 15. The count matrix were library size normalized across cells and log2 transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.
Human embroyonic stem cells (hESC) dataset
The processed count matrix was downloaded from GEO under accession number GSE64016. We only retained FACS sorted cells. The count matrix were library size normalized across cells and log2 transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.
Human U-2 OS cells (hU2OS) dataset
The TPM matrix was downloaded from GEO under accession number GSE146773. We only retained FACS sorted cells with log2(counts) greater than the 3 times MAD range. Genes which were expressed in less than 20 cell were removed. The left TPM matrix were library size normalized across cells and log2 transformed by function normalizeCounts. The RNA-Seq data was generated using SMART-seq2 chemistry in this dataset.
Human induced pluripotent stem cells (hiP-SCs) dataset
The processed FUCCI intensity and RNA-seq data was downloaded from https://github.com/jdblischak/fucci-seq/blob/master/data/eset-final.rds?raw=true. The preprocessing was described in Hsiao et al.(2020). The count matrix were library size normalized across cells and log2 transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.
Fetal tissue dataset
We got the loom file containing gene counts of all tissue from GEO under accession number GSE156793. We then processed and analyzed each tissue separately. For each tissue type, cells of which log2(TotalUMIs) is lower than median – 3 × MAD, and genes expressed in less than 20 cells were excluded from further analyses. The count matrix were library size normalized across cells and log2 transformed by function normalizeCounts. All 4 tissues profiled using single-cell and 9 tissues profiled using single-nuclei were generated on sci-RNA-seq3 system.
5 stage cell cycle assignments
The 5 stage (G1S, S, G2, G2M, and MG1) cell cycle assignments were adapted from Schwabe et al.(2020) with some modifications. Briefly, the assignments use the high expression genes list for each stage, curated by Whitfield et al. (2002). Let k represent one of the 5 stages, and represent the gene list with pk genes. For each stage k, we could calculate the mean expression across genes in the gene list lk for the jth cell as with as the log2 transformed expression value of gene and cell j. Then we assess how well a gene in a gene list correlates to the me n expression level of that gene list as . For each stage, the gene list is pruned to genes with . (For the fetal tissues dataset, we used since the extremely shallowly sequenced data shows less co-expression patterns and the threshold 0.2 could leave us with no genes.) We label this pruned new gene list as with qk the number of genes. The stage assignment score for cell j and stage k is given as
The 5-by-n matrix A, of which the number of columns equals to the number of cells, follows z-score transformations w.r.t. first rows and then columns, resulting the 5-by-n matrix . For each cell, we compute the preliminary stage assignment as .
As in the Schwabe et al. (2020), we also apply two filtering steps. The first filtering, which is exactly the same described by the original paper. We require , the stage with the second largest assignment score to be the neighboring stage to sj. This requirement corresponds to that the 5 stages are continuously cyclic processes.
As for the second filtering step, t he original method discards all cells with the second largest assignment score . We found the threshold of 0.75 to some extent not applicable, as in some datasets it leads to losing 90% of cells. Therefore, we use a more adaptive threshold by requiring .
If the cell passes two filtering steps, it will be assigned to a stage sj. Otherwise, it would be assigned as NA w.r.t. 5 stages of the cell cycle. To mitigate the batch effect on the 5 stage assignments, the assigning procedures are done for each sample/batch separately within each dataset, as recommended in Revelio package (Schwabe et al., 2020).
PCA of GO cell cycle genes
For each dataset, we subsetted the preprocessed log2 transformed expression matrix to genes in the GO term cell cycle (GO:0007049). If there are clear batches defined in the dataset, such as sample or batch, we use Seurat3 to remove batch effect. In the case of using Seurat3, we used a library size normalized count matrix as input instead of log2 transformed values. The integration anchors were searched in the space of top 30 PCs. The output integrated matrix is a log2 transformed matrix of top 500 most variable genes. We then performed principal component analysis on the gene-wise mean centered expression matrix. In the case of no batch exiting, we also restricting to top 500 variable genes among GO cell cycle genes.
Projection of new data to cell cycle embedding and calculation of cell cycle position θ
The projection using pre-learned weights matrix during PCA of GO cell cycle genes is straight forward, given by where R represents the o-by-2 reference matrix (o 500), contains the weights of top 2 PCs learned from PCA of GO cell cycle genes; is a o-by-n matrix, subsetted from E (the log2 transformed expression matrix) with genes in the weights matrix and row-means centered. The resulting n-by-2 P is the cell cycle embedding projected by the reference. The calculation of the cell cycle position θ is given by where Pi is the ith column of matrix P. When mapping the genes between weights matrix and the data that we want to project, the Ensemble ID is given higher priority than the gene symbol for mouse. For across species projection, we only con sider the homologous genes of the same gene symbols.
Periodic loess
As θ is a circular variable bound between 0 to 2π, fitting a traditional loess model y ~ θ, with y as any response variable, such as the gene expression of gene, or log2(TotalUMIs), has problems around the boundaries 0 and 2π. Hence, we concatenate triple y and triple θ with one period shift to form [y, y, y] and [θ 2π, θ, θ + 2π], on which the loess line is fitted. We then only use the fitted value when θ is between 0 and 2π for visualization purpose.
The calculation of the coefficient of determination R2 of fitted loess model is given by
Here and . Note that instead of using all three copies of data points, we restrict the calculation of SSres and SStotal on the original data points (the middle copy). The residuals are not the same for the three copies, especially at the beginning and end of [−2π, 2π].
The circular correlation coefficient ρ
We use the circular correlation coefficient ρ defined by Jammalamadaka and Sarma, 1988 to evaluate concordance between two polar vectors θ1 and θ2.
It is defined as follows μ1 and μ2 represent the mean of θ1 and θ2 respectively, and are estimated by maximum likelihood estimation under von Mises distribution assumption.
Running other methods
For other cell cycle inference methods, we use all default parameters and its built-in reference (if needed) in the following packages: cyclone in scran (v1.18.5), CellCycleScoring in Seurat (v4.0.0.9015), Revelio (v0.1.0), peco (v1.1.21), and reCAT (v1.1.0).
Silhouette index on angular separation distance of tricycle cell cycle position θ
For cyclone and Seurat, we could use Silhouette index to describe consistency between discretized cell cycle stage and tricycle cell cycle position θ. We use angular separation distance metric to quantify the distance between cell i and cell j as
For a cell . The mean distance between cell i and all other cells assigned to the same stage with the cardinality of . Specially, a(i) = 0 if . The mean distance from cell i to all cells assigned to other stage k′ such that k′ ≠ k(i) ∧ k′ ∈ {G1, S, G2M} is
The Silhouette index for cell i is given as
For any cell i, the Silhouette index s(i) is bound between −1 to 1 (−1 ≤ s(i) ≤ 1). An s(i) close to −1 means the cell is consistently assigned to its neighbors w.r.t. its cell cycle position θi. An s(i) close to 1 means the cell is closer to the other stage. An s(i) equals to 0 means the cells is on the border of two stages. The mean Silhouette index on all cells measures how tight the stage assignments are. In this context, this value must be interpreted carefully as it is different from traditional clustering which might puts hard boundaries and gaps between clusters. As the cell cycle process is continuous in nature, there must be cells assigned on the boundaries and ambiguous to either stage, and no gap should appear between stages. Thus the mean Silhouette index greater than 0 might be appropriate to conclude the agreement between tricycle cell cycle position θ and discretized cell cycle stages.
Data availability
Data reported in this publication is being submitted to NCBI GEO. Get in touch if you want it sooner!
Software availability
The tricycle method is implemented in the R package tricycle containing the mNeurosphere reference, which is available on https://github.com/hansenlab/tricycle. This package is being submitted to Bioconductor.
Funding
This project has been made possible in part by grant number CZF2019-002443 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award R01GM121459. This work was additionally supported by awards from the National Science Foundation (IOS-1665692), the National Institute of Aging (R01AG066768), and the Maryland Stem Cell Research Foundation (2016-MSCRFI-2805). GSO is supported by postdoctoral fellowship awards from the Kavli Neurodiscovery Institute, the Johns Hopkins Provost Award Program, and the BRAIN Initiative in partnership with the National Institute of Neurological Disorders (K99NS122085).
Disclaimer
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or National Science Foundation.
Conflict of Interest
None declared.
Supplementary Materials
Supplementary Methods.
SUPPLEMENTARY METHODS
Comparison to existing cell cycle tools
Oscope
Oscope poses significant challenges when run on shallow data (10X, sci-RNA-seq3, or DropSeq), since the method requires quantification of a high number of genes in every cell. For this reason, we do not evaluate Oscope.
peco
Peco supplies 2 models: one trained on 101 genes and one trained on 5 genes. We used the 101 gene model to be robust to some genes not being measurable in all datasets. We applied peco to all dataset described in Supplementary Table S1, except mRetina and human fetal tissues. For human fetal tissues, we only use a subset of random 2000 cells selected from human fetal intestine data (termed “hfIntestineSub”).
We assess the expression dynamics of 4 genes highlighted in Hsiao et al. (2020): CDK1, TOP2A, UBE2C and H4C3 (Supplementary Figure S15); not all datasets have these genes measured in which case they are absent from the figure. To systematically compare tricycle and peco we use the R2 associated with two different cell cycle positions. This is a comparison between R2 for the same data, but using the same periodic loess approach with two different position variables. For these genes, across all dataset, tricycle cell cycle position has a higher R2 than peco cell cycle position (Supplementary Figure S15). Generally, information-rich Fluidigm C1 data does better with peco compared to information-poor 10X, Drop-Seq.
Revelio
Revelio is designed to search for an ellipsoid pattern amongst (rotated) principal components, by finding the directions having strongest association to 5 discrete cell cycle stages. The output of Revelio is therefore supposed to be an ellipsoid. Revelio by itself does not quantify cell cycle position, although it seems natural to do so by the angle. When we use Revelio, we do indeed observe an ellipsoid in 4 datasets (Supplementary Figure S16a, b, f, g, i and j), but it clearly fails in 3 datasets: mPancreas dataset, mRetina dataset, and mHSC dataset (Supplementary Figure S16c, d, and e). These 3 datasets all have substantial variation which is not associated with cell cycle, such as cell types and differentiation, which we believe explains the non-ellipsoidal embedding. For example, in the mPancreas data some of the differentiation effect is perfectly confounded with cell cycle as the terminally differentiated cells stop cycling. It is not clear that simply rotating the principal components will help us find a better cell cycle exclusive dimension. Additionally, Revelio removes any cell which does not have a prediction using the Schwabe stage predictor; in the mRetina dataset only 30k out of more than 90k cells are retained.
reCAT
reCAT starts with a principal component analysis of the cell cycle genes, and infers an ordering by solving a traveling salesman problem on this representation. This produces an ordering, but this ordering is hard to interpret because it is not directly linked to cell cycle stage. To address this, the authors provide two different stage predictors. Because the method requires the solution of a traveling salesman problem, it scales poorly. Due to these issues, we only ran reCAT on data with less than 5000 cells. The orderings inferred by reCAT are largely consistent with our cell cycle position θ using mNeurosphere reference for all dataset except the most shallow sequenced hfIntestineSub data (Supplementary Figure S17 last sub-panel in each panel). And the expression dynamics of Top2A on the time series also confirms the appropriate ordering of cells (Supplementary Figure S17 the third sub-panel in each panel). However, the two stage predictors given by reCAT yield different predictions on stages. For example, for the mPancreas dataset (Supplementary Figure S17a), the majority of cells are at S stage based on Bayes scores but are at G1 stage based on mean scores. Note that the reCAT function requires the user to feed an approximate cutoff position to assign a cell cycle stage based on Bayes scores. However, in all the datasets, we are unable to assign cutoff position to let each stage have its own highest scores interval. Without a useful stage assignment, the ability to make use of the cell orders is substantially restricted as the percentage of each stage is different across dataset.
Cyclone
We observe general agreement between the 3 stage predictions of cyclone and tricycle cell cycle position, as the cyclone stages cluster together (Supplementary Figure S18). We note that cyclone assigns very few cells to the S stage. We believe this is caused by the assignment strategy (cells are assigned to S stage if both G1 and G2M scores are below 0.5). To expand on this comparison, we computed silhouette index with a distance defined by the tricycle cell cycle position (Methods). For cyclone, the under-representation of S stage drags down the silhouette index for both G1 and S stages, as cells at S stages are usually mixed with G1 cells, making the mean distance to all cells at G1 stage and to all cells at S stage not that differentiable. We note that cyclone works best on the last two FACS dataset, with one of them (mESC) is the training dataset for cyclone gene list.
Seurat
We observe good agreement between the 3 stage predictions of Seurat and tricycle cell cycle position, better than cyclone (Supplementary Figure S19). Compared to cyclone, we have a much higher silhouette index for Seurat; the highest observed mean is 0.74 for the mHSC dataset, which confirms the highly visual agreement between Seurat assignments and tricycle. The main disadvantage of Seurat is the inherent limitation of a 3 stage prediction.
(modified) Schwabe
The (modified) Schwabe method assigns cells to 5 different stages. Because of the higher resolution, it is the main predictor we use in our work. By default, the Schwabe method as reported in Schwabe et al. (2020) produces a substantial amount of missing labels, and we have therefore modified the method to address this (Methods); we call this the modified Schwabe predictor.
Broadly, the (modified) Schwabe predictor agrees with tricycle, with one specific type of disagreement. These inconsistencies are examined in Supplementary Figure S20. Some cells with a tricycle cell cycle position of 0/2π (G0/G1) are assigned to other stages by modified Schwabe (Supplementary Figure S20 second sub-panel of each row). It is well appreciated that there are many more genes specifically expressed at S, G2 or M stage as compared to G0/G1 stage (Dolatabadi et al., 2017). For each dataset, we plot out the percentage of non-expressed genes over all projection genes in the first sub-panels, which show that the dynamics of percentages are captured by cell cycle position θ using mNeurosphere reference. We plot the percentage of non-expressed genes conditioned on stage and whether tricycle cell cycle position is around 0/2π (Supplementary Figure S20 third sub-panel of each row), which confirm that for each stage there exist two distinct groups. This is reinforced by the different expression patterns of Top2a and Smc4 between flagged cells and non-flagged cells in the last two sub-panels. Thus, we conclude the cells around 0/2π are likely to be wrongly assigned to other stages, probably due to low information content.
To assess whether these inconsistencies are caused by our modification of Schwabe, we repeat the comparison using the original Schwabe assignments and arrive at the same conclusion (Supplementary Figure S21). This assessment highlights the large number of missing labels from the original Schwabe predictor, for example only 30k out of 90k cells in the mRetina dataset are labelled.