Universal prediction of cell cycle position using transfer learning

Shijie C. Zheng; Genevieve Stein-O’Brien; Jonathan J. Augustin; Jared Slosberg; Giovanni A. Carosso; Briana Winer; Gloria Shin; Hans T. Bjornsson; Loyal A. Goff; Kasper D. Hansen

doi:10.1101/2021.04.06.438463

ABSTRACT

The cell cycle is a highly conserved, continuous process which controls faithful replication and division of cells. Single-cell technologies have enabled increasingly precise measurements of the cell cycle as both as a biological process of interest and as a possible confounding factor. Despite its importance and conservation, there is no universally applicable approach to infer position in the cell cycle with high-resolution from single-cell RNA-seq data. Here, we present tricycle, an R/Bioconductor package, to address this challenge by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis of periodic functions, and the ubiquitous applicability of transfer learning. We show that tricycle can predict any cell’s position in the cell cycle regardless of the cell type, species of origin, and even sequencing assay. The accuracy of tricycle compares favorably to gold-standard experimental assays which generally require specialized measurements in specifically constructed in vitro systems. Unlike gold-standard assays, tricycle is easily applicable to any single-cell RNA-seq dataset. Tricycle is highly scalable, universally accurate, and eminently pertinent for atlas-level data.

INTRODUCTION

The cell cycle is the biological process which controls faithful replication and division of cells across all species of life. Despite existing as a continuous process, cell cycle has historically been characterized as having four discrete stages during which the cell performs growth and maintenance (G1), replicates its DNA (S), increases further in size and prepares for mitosis (G2), and undergoes mitosis and cytokinesis (M). Cell cycle is a highly conserved mechanism with and integral role in generating the diversity of cell types within multicellular organisms. As a result, maladaptive modifications of the cell cycle can have devastating consequences in development and disease (McConnell and Kaznowski, 1991; Ambros, 1999; Ohnuma and Harris, 2003). Despite its importance, many of the molecular mechanism regulating and interacting with cell cycle remain poorly understood.

High-throughput expression data has been utilized for studying the cell cycle since the seminal work on the yeast cell cycle by Spellman et al. (1998) and Cho et al. (1998) at the dawn of the microarray era. This work used various approaches to synchronize cells in specific cell cycle stages followed by assaying cells in bulk. The data from Spellman et al. (1998) were later used by Alter et al. (2000) to show that principal component analysis reveals a circular pattern which represents the cyclical nature of the cell cycle; widely cited as one of the first examples of the use of principal component analysis and singular value decomposition in analysis of high-throughput expression data. Subsequent work sought to systematically identify both periodically expressed genes and cell cycle marker genes and deposited these into widely used databases (Whitfield et al., 2002; Gauthier et al., 2008).

Single-cell technologies have enabled the ability to study the effects of cell cycle in multicellular organisms with a degree of sensitivity and accuracy only previously available in monocelluar or clonal systems. Thus, cell cycle has been the subject of substantial interest, both as a biological variable of interest and as a possible confounding feature for other comparisons of interest (Buettner et al., 2015). A number of methods have been developed to estimate cell cycle state from single-cell expression data (Leng et al., 2015; Scialdone et al., 2015; Liu et al., 2017; Stuart et al., 2019; Hsiao et al., 2020; Schwabe et al., 2020). These methods differ broadly in the definition of cell cycle state (discrete stages vs. continuous pseudotime) as well as the use of special training data. Most of these methods have been demonstrated to be effective on datasets consisting of a single cell type. Despite the conservation of the cell cycle process, none of these methods have been shown to be applicable across single-cell technologies and mammalian tissues.

RESULTS

Transfer learning

To develop a universal method for estimating a continuous cell cycle pseudotime for a single-cell expression data set independent of technology, cell type, or species, we leverage transfer learning via dimensionality reduction (Pan et al., 2008). We define a reference cell cycle embedding (or latent space) into which we project a new data set; an approach originally advocated for in Stein-O’Brien et al. (2019). After projection, we infer cell cycle pseudotime as the polar angle around the origin. This pseudotime variable takes values in [0, 2π] and is unrelated to wall time, but rather represents progression through the cell cycle phases. We refer to this psedudotime variable as cell cycle position to avoid confusion with wall time and to emphasize its periodic nature.

To define a reference cell cycle embedding, we leverage key features of principal component analysis of cell cycle genes. Previous work has found that principal component analysis on expression data sometimes yield an ellipsoid pattern. This was first described by Alter et al. (2000); it has later been observed independently in multiple data sets (Schwabe et al., 2020; Liu et al., 2017; Mahdessian et al., 2021). Here, we demonstrate that the ellipsoid pattern is a consequence of a link between Fourier analysis of periodic functions and principal component analysis. The shape is created by the fact that cell cycle genes are periodic with a single peak of expression (which differs between genes). Thus, there is a direct link between progression through the cell cycle process and angular position on the ellipsoid.

We use the first two principal components to define a reference embedding representing the cell cycle. Because this reference embedding is a low dimensional linear space, we obtain an orthogonal projection operator allowing us to project any new data set into the reference embedding. We show that projecting new data into the reference cell cycle embedding overcomes technical and biological challenges posed by data sets where substantial variation is explained by one or more factors different from cell cycle, such as cellular differentiation.

Principal component analysis and periodic functions

To gain insight into gene expression dynamics over the cell cycle, we start by analyzing principal component analysis of periodic functions. Our model is a collection of periodic functions with a single peak, taking the form with a gene-specific amplitude (A_g) and location of the peak (L_g) with 0 ≤ θ < 2π representing the unknown cell cycle position. Figure 1a,b depicts the unobserved (true) time ordering, observed on a discrete grid of time points, together with a random permutation of these time points; this represents the observed data which is not ordered by time. A key insight is the fact that the first two principal components are the same for the observed and the unobserved data (Figure 1c), when performed on a discrete set of observation times. The unknown time order can be inferred from the principal component plot as the angle of each point, making it possible to fully reconstruct the unobserved time order (Figure 1d), i.e., the first two principal components form an orthogonal projection into a twodimensional space representing the periodic time.

For this result to hold, it is required that the gene expression data exhibits at least two distinct peak locations (not separated by exactly π) and that each gene has at most one peak over the time period (Methods). The assumption of a single expression peak for each gene is supported by empirical data for genes in the cell cycle expression program (see below). The two first principal components of this data can be represented as where b₁, b₂ are two dimensional vectors which are linear functions of the Eigenvectors and −values of a 2 × 2 matrix entirely determined by the set of peak locations and amplitudes (L_g, A_g) (Methods). No matter how many distinct peak locations and amplitudes are present, the space representing periodic time will always be two-dimensional. Higher dimensions are only required when individual genes have multiple peaks. Previous empirical investigations of cell cycle using expression data supports the observation of a 2-dimensional space for principal component analysis Buettner et al. (2015) and Schwabe et al. (2020).

Figure 1. Principal component analysis recovers time ordering in simulations.

Simulations are based on cosine functions with Gaussian noise (Methods). (a) Expression vs. time for 2 genes with different peak locations and amplitudes. Each of the two gene peaks are replicated 50 times for a total of 500 genes and 1,000 time points (cells). (b) Expression vs. permuted time, representing the unknown time order of observed data which obscures the periodicity of the functions. (c) Principal component analysis of the data from (b) and (a); the two datasets have equivalent principal components. We infer cell cycle position by the angle of the ellipsoid. The red dot indicates θ = 0. (d) Expression vs. inferred cell cycle position.

The simulated data depicted in Figure 1 has Gaussian noise, but we have verified that the result holds for data generated using the negative binomial distribution with an associated mean-variance relationship. Using the negative binomial distributed data required more than 2 distinct peaks to be stable (Supplementary Figures S1, S2). For both distributions, this approach is robust to downsampling of the data similar to what is seen with the increased sparsity from droplet based sequencing technology. In simulations, we can recover cell cycle position with as little as 10 total counts per cell across 100 genes (depending on noise levels and heights of the peaks) (Supplementary Figure S3).

Recovering cell cycle position using principal component analysis on cell cycle genes

We next assess our model on experimental data, and learn an embedding representing cell cycle. We use 10x Genomics Chromium single-cell RNA-Sequencing (scRNA-Seq) data on two replicate cultures of E14.5 mouse cortical neurospheres (Methods), integrated using Seurat 3 and transformed to log₂-scale. The use of an alignment method (CCA in Seurat3) to integrate the two samples is important for the quality of the ellipsoid, by maximizing the correlation structure between the two samples. Since neurospheres are maintained in a proliferative state, we expect that cell cycle phase is an important contributor to the variation in expression within this single-cell dataset. To confirm this expectation, we consider a UMAP representation of the data based on all variable genes (Supplementary Figure S4) colored according to the predictions from two separate cell cycle stage estimation utilities (cyclone and a modification of Schwabe et al.(2020) we call modified-Schwabe, see Methods); this analysis demonstrates that the cell cycle is a major source of transcriptional variation in the neurosphere dataset.

We then perform principal component analysis of the top 500 most variable genes amongst the roughly 1700 genes annotated with the Gene Ontology cell cycle term (GO:0007049, Methods) (Ashburner et al., 2000). As suggested by our model, the first two principal components form an ellipsoid with a sparse/empty interior (Figure 2a). Using the modified-Schwabe cell cycle stage predictor, we observe a strong relationship between polar angle on the ellipsoid and predicted cell cycle stage.

Figure 2. The cell cycle ellipsoid and cell cycle position.

(a) Top 2 principal components of GO cell cycle genes from E14.5 primary mouse cortical neurospheres, in which the variation is primarily driven by cell cycle. Each point represents a single cell, which is colored by 5 stage cell cycle representation, inferred using the modified Schwabe method (Schwabe et al., 2020). The cell cycle position θ (with values in [0, 2π); sometimes called cell cycle pseudotime) is the polar angle. (b) As in (a), but for a dataset of primary mouse hippocampal progenitor cells from both a mouse model of Kabuki syndrome and a wildtype. (c) A comparison of the weights on principal component 1 between the cortical neurosphere and hippocampal progenitor datasets. Genes with high weights (|score| > 0.1 for either vector) are highlighted in red. (d,e) The expression dynamics of (d) Top2A and (e) Smc4 using the inferred cell cycle position, with a periodic loess line (Methods). (f) The dynamics of total UMI using the inferred cell cycle position, with a periodic loess line, illustrating the high agreement of the dynamics between datasets.

The strong relationship between polar angle on the ellipsoid and predicted cell cycle stage was also observed on an independent dataset on cultured primary mouse hippocampal progenitors from a wild-type mouse as well as from a Kmt2d^+/βgeo mouse, a previously described model of Kabuki syndrome (Carosso et al., 2019). The data were processed similarly to the neurosphere data. Again, we select the top 500 most variable cell cycle genes and perform a principal component analysis (Figure 2b) which reveal an ellipsoid pattern. The shape of the principal component plot differs between the two datasets, but the weights used to form the first two principal components are highly concordant (Figure 2c, Supplementary Figure S5 for PC2) for the 318 genes present in both cell cycle embeddings. Almost all of the highly ranked genes (absolute weights > 0.1, highlighted in red and labelled with gene name) represent important regulators of, or participants in, the cell cycle. For example, the highest ranked gene is Topoisomerase 2A Top2a which controls the topological state of DNA strands and catalyzes the breaking and rejoining of DNA to relieve supercoiling tension during DNA replication and transcription (Lodish et al., 2008). Also highly ranked are Smc2 and Smc4 which compose the core subunits of condensin, which regulates chromosome assembly and segregation (Ono et al., 2003; Wei-Shan et al., 2019).

Given our mathematical analysis as well as the strong empirical relationship between polar angle on the ellipsoid and cell cycle stage predictions, we define a method to learn cell cycle position as the polar angle around the origin on the coordinate plane which we denote by θ. We center the coordinate plane on (0, 0) whose location corresponds to cells with zero expression for all 500 variable cell cycle genes.

To demonstrate that cell cycle position reflects the true biological cell cycle progression, we consider expression dynamics of specific cell cycle genes. For Top2a and Smc2 the peak expressions are observed at G2 stage around π (Figure 2e), consistent with their known increased expression through S phase and into G2 (Heck et al., 1988; Belluti et al., 2013; Wei-Shan et al., 2019). Furthermore, the dynamics are highly similar between the independently analyzed cortical neurosphere and hippocampal NPC datasets, which supports the observation that the two different embeddings yield concordant cell cycle positions (despite each including dataset-specific genes). These observations hold for all genes with high weights (Supplementary Figure S6. This approach serves as an internal control in any single-cell RNA-seq data set and can be used to assess the quality of any continuous ordering.

Next, we directly relate θ to the measured tran scription values. Figure 2d shows the log₂ transformed total UMI numbers against θ, with a periodic loess smoother for each dataset. In both datasets, the maximum level is reached around π and the minimum around 1.5π, which corresponds to the end of G2 and the middle of M stage respectively. We observe the total UMI number begins to increase at the beginning of G1/S phase and to decrease sharply as cells progress through M phase. The difference between the maximum and minimum of the periodic loess line is 1, corresponding to a two-fold difference in total UMI, which is known to be proportional to cell size (Marguerat and Bähler, 2012; Padovan-Merhar et al., 2015). This observation, and the timing with respect to cell cycle position, is consistent with the approximate reduction in cellular volume by one half as a result of cytokinesis in M phase and the formation of two daughter cells of roughly equal size.

Note that these principal component analyses are differentiating G2/M cells from G1/G0 cells on the first principal component. This is in contrast to the mathematical analysis where the starting point (θ = 0) can be any location (red point in Figure 1) as there is no clear starting point for a periodic function. That the first principal component differentiates G2M from G1/ G0 can be explained by the nature of principal component analysis. Before principal component analysis we subtract each gene’s mean expression. However, genes marking G2/M usually have very high expression compared to other stages, with G0/G1 being the lowest (Supplementary Figure S7), ensuring that this becomes the first principal component. A clustering analysis of the expression patterns provides further evidence that cell cycle genes have a single peak pattern of expression (Supplementary Figure S7). Thus, the observed behavior of the cell cycle genes in these data sets fits the theoretical requirements of our model.

In summary, principal component analysis of the cell cycle genes predicts cell cycle progression for the mNeurosphere and mHippNPC datasets with a high degree of similarity between the cell cycle position inferred independently in the two datasets as predicted by our mathematical model.

When principal component analysis fails to reflect cell cycle position

A principal component analysis does not always yield an ellipsoid pattern; a requirement for this to work is for the first principal component to dominated by cell cycle. To illustrate this, we used an existing mouse developing pancreas dataset, with cell type labels (Bastidas-Ponce et al., 2019). A major source of variation in this dataset is cellular differentiation as demonstrated by a standard UMAP embedding (based on all variable genes) illustrating the previously described (Bastidas-Ponce et al., 2019) differentiation trajectories (Figure 3a). When we perform principal component analysis using only the variable cell cycle genes, the resulting PCA plot still reflects the differentiation trajectory and does not resemble the ellipsoid pattern observed in the previous section (Figure 3b,c). Note that PC1 has some relationship with cell cycle since the differentiation path goes from cycling to non-cycling cells, but it also reflects the progression from cycling multipotent cells to terminally differentiated cells. This result strongly suggests that some of the cell cycle genes may participate in biological processes other than the cell cycle and demonstrates that PCA of cell cycle genes does not always exclusively capture cell cycle variance.

Figure 3. When principal component analysis fails to describe the cell cycle.

Data is from the developing mouse pancreas. (a) UMAP embedding using all variable genes. Cells are colored by cell type. (b) PCA plot of the cell cycle genes; this reflects the differentiation path in (a). (c) PCA plot of the cell cycle genes for ductal cells only; this plot reflects cell cycle.

However, when we perform principal component analysis only on a subset of cells from a single, proliferating progenitor cell type, the ellipsoid pattern returns (Supplementary Figure S8a,b). This highlights the challenge of inferring cell cycle for datasets that contain many different cell types, including postmitotic cells.

Transfer learning through projection

To overcome the challenges of inferring cell cycle position in arbitrary datasets, we propose a simple, yet highly effective transfer learning approach we term tricycle (transferable representation and inference of cell cycle). In short, we first construct a reference embedding representing the cell cycle process using a fixed dataset where cell cycle is the primary source of transcriptional variation. For the remainder of this manuscript we will use the cortical neurosphere data as this reference. We show that the learned reference embedding generalizes across all datasets we have examined. Because our reference embedding is a linear subspace, we benefit from an orthogonal projection operator which allows us to map new data into the reference embedding, with well understood mathematical properties. Finally, we infer cell cycle position by the polar angle around the origin of each cell in the embedding space. The robustness of this approach is demonstrated by the ability of this projection to estimate cell cycle position in multiple independent and disparate datasets; evidence of which is provided below. Specifically, using the cortical neurosphere dataset as a fixed reference, our transfer learning approach generalizes across cell types, species (human/mouse), sequencing depths and even single-cell RNA sequencing protocols.

As a demonstration, we consider a diverse selection of single-cell RNA-seq datasets representing different species (mouse and human), cell types and technologies (10x Chromium, SMARTer-Seq, Drop-seq and Fluidigm C1) (Table S1). We project these datasets into the cell cycle embedding learned from the neurosphere data (Figure 4, Supplementary Figure S9), and color the projections according to the modified Schwabe estimator of cell cycle stage. Although the shape of the projection varies from dataset to dataset, the cells of the same stage always appear at a similar position of θ, such as cells at S stage centering at 0.75π. To verify our cell cycle ordering, we look at the expression dynamics of Top2a and Smc4 as a function of θ (Figure 4, Supplementary Figure S10). PCA plots of the GO cell cycle genes for each dataset illustrates the advan tage of using a fixed embedding to represent cell cycle (Supplementary Figure S11). Together, these results strongly supports that tricycle generalizes across data modalities.

Figure 4. A pre-learned weights matrix learned from proliferating cortical neurospheres enables cell cycle position estimation in other proliferating datasets.

(a) Different datasets (hippocampal NPCs, mouse pancreas, mouse retina and HeLa set 2) projected into the cell cycle embedding defined by the cortical neurosphere dataset. Cell cycle position θ is estimated as the polar angle. (b) Inferred expression dynamics of Top2a (TOP2A for human), with a periodic loess line (Methods). (c) UMAP colored by cell cycle position using a circular color scale.

Having inferred cell cycle position, we can visualize the cell cycle dynamics on a UMAP plot representing the full transcriptional variation, as is standard in the scRNA-Seq literature (Figure 4). To effectively visualize cell cycle position, we use a circular color scale to account for the fact that position “wrap around” from 2π to 0. Doing so reveals the smooth behaviour of the tricycle predictions (despite not using smoothing or imputation) and argues for representing cell cycle in gene expression data as a continual progression rather than discrete states.

Cell cycle position estimation on gold-standard datasets

We validated tricycle on multiple datasets containing “gold-standard” cell cycle measurements, including measurements by proxy using the fluorescent ubiquitination-based cell-cycle indicator (FUCCI) system and by fluorescence-activated cell sorting (FACS) of cells in discrete cell cycle stages. Both of these approaches allow for assignment to or selection of cells from discrete phases of the cell cycle. The FUCCI system uses a dual reporter assay in which the reporters are fused to two genes with dynamic and opposing regulation during the cell cycle (Sakaue-Sawano et al., 2008), allowing for a quantitative assessment of whether cells are in G1 or S/G2/M phase. In contrast to FACS, FUCCI systems, combined with an appropriate quantification method, make it possible to continuously measure cell cycle progression by placing the 2 protein measurements in a 2-dimensional space. Cell cycle pseudotime needs to be inferred from these 2-dimensional measurements, which is usually done by a variant of polar angle (Hsiao et al., 2020; Mahdessian et al., 2021).

Mahdessian et al. (2021) measured human U-2 OS cells to derive a FUCCI-based pseudo-time scoring. Their FUCCI measurements form a distinct horseshoe shape with the left side of the horseshoe representing time post-metaphase-anaphase transition with a continuous progression through G1, S, G2 and ending pre-metaphase-anaphase transition (Figure 5; this depiction mirrors other data presentations (Sakaue-Sawano et al., 2008; Sakaue-Sawano et al., 2017)). Cell cycle is a continuous process which is not immediately reflected in the horseshoe form because of the large gap (in the x-axis) between the two ends of the horsehoe. The x-axis reflects the protein levels of geminin (GMNN) which is degraded during the metaphase-anaphase transition (McGarry and Kirschner, 1998) and the two ‘open’ ends of the horseshoe are closely connected in time despite the visual gap in the scatter-plot. This fact gives the FUCCI system the ability to assess whether a cell in M phase is before or after this transition, or said differently, a high temporal resolution around this transition despite the relatively short wall time compared to the rest of the cell cycle. We observe a close correspondence between tricycle cell cycle position and FUCCI pseudotime. The only cells for which there is a superficial disagreement are placed in M phase by tricycle (cell cycle position around 0.85π) and are split between pre-metaphase-anaphase transition and post-metaphase-anaphase transition by FUCCI pseudo-time, for this particular transition the FUCCI system has higher temporal resolution than tricycle; adding a small offset to these cells results in a remarkable concordance between the two systems (Figure 5). Elsewhere in the cell cycle, there is no evidence of better temporal resolution with FUCCI; examining expression dynamics suggests that tricycle does at least as good as FUCCI as ordering key cell cycle genes. We can use tricycle to examine the expression dynamics of GMNN and CDT1 which reveals that GMNN expression is stable across the cell cycle (Supplementary Figure S12), suggesting the protein is predominantly regulated post-transcriptionally during mitosis.

Figure 5. Evaluation of tricycle on FUCCI datasets.

(a-c) Data from Mahdessian et al. (2021). (a) FUCCI scores colored by tricycle cell cycle position. (b) Comparison between FUCCI pseudotime and tricycle cell cycle position with a periodic loess line. Cells in the dotted rectangle were moved by adding one period 2π to tricycle θ to reflect the higher temporal resolution around the anaphase-metaphase transition for FUCCI pseudotime. Note that the x-axis starts at 0.85π, which corresponds to FUCCI pseudotime 0. (c) R² values of periodic loess line of all projection genes when using tricycle θ and FUCCI pseudotime as the predictor. The dashed line represents y = x. (d-g) Data from Hsiao et al. (2020). (d) FUCCI scores colored by tricycle cell cycle position. (d,e) Expression dynamics of Top2a with a periodic loess line using either (d) tricycle cell cycle position or (e) FUCCI pseudotime inferred by Hsiao et al. (2020). Cells are colored by 5 stage cell cycle representation, inferred using the modified Schwabe method Schwabe et al. (2020). (g) Similar to (c), but for the data from Hsiao et al. (2020).

Hsiao et al. (2020) used FUCCI on human induced pluripotent stem cells (iPSC) followed by scRNA sequencing using Fluidigm C1. While the Mahdessian et al. (2021) FUCCI data look like a horseshoe, the Hsiao et al. (2020) FUCCI data are more akin to a cloud (the data differ in quantification and normalization of the FUCCI scores). These data are used to estimate a continuous cell cycle position (which we term “FUCCI pseudotime”) based on polar angle of the FUCCI scores. Compared with the data in Mahdessian et al. (2021), there are larger differences between FUCCI pseudotime and tricycle cell cycle position. However, we can directly compare the associated expression dynamics of key cell cycle genes (Figure 5 for TOP2A, Supplementary Figure S13 for 8 additional genes). These results suggests that tricycle cell cycle position is at least as good or better as the FUCCI pseudotime at ordering the cells along the cell cycle; the R² for TOP2A is 0.42 for tricycle compared with 0.27 for peco.

In contrasts to FUCCI measurements, FACS sorting and enrichment of cells yields groups of genes in (supposedly) distinct phases of the cell cycle. We consider 2 different datasets where FACS has been combined with single-cell RNA-seq. Buettner et al. (2015) assays mouse embryonic stem cells (mESC) using Hoechst 33342-staining followed by cell isolation using the Fluidigm C1. They use very conservative gating for G1 and G2M at the cost of less conservative gating for S phase. Leng et al. (2015) uses FACS on FUCCI labeled H1 human embryonic stem cells (hESC) followed by cell isolation using the Fluidigm C1. In both experiments, cells largely appear as expected in the cell cycle embedding defined by the cortical neurosphere reference embedding (Supplementary Figure S14). For the mESC, we note that some cells labeled S (but not G1 or G2M) appear outside the position expected for this stage, consistent with the gating strategy used for these data.

Summarizing this evidence, we conclude that tricycle recapitulates and refines the cell cycle ordering consistent with current “state of the art” experimental methods. Tricycle cell cycle position is competitive with FUCCI based measurements, except for cells in the metaphase to anaphase transition during mitosis.

Comparison to existing tools for cell cycle position inference

We next sought to compare tricycle cell cycle position estimates to those obtained from other available methods. Existing methods for cell cycle assessment can be divided into those which infer a continuous position and those which assign a discrete stage. We have evaluated the following methods: peco (Hsiao et al., 2020), Revelio (Schwabe et al., 2020), Oscope (Leng et al., 2015), reCAT (Liu et al., 2017), cyclone (Scialdone et al., 2015), Seurat (Stuart et al., 2019), the original Schwabe Schwabe et al., 2020, and the modified Schwabe 5 stage assignment method. Each method differs in which datasets it works well on and which issues it might have; a detailed comparison is available in the Supplement (Supplemental Methods, Supplementary Figures S15–S21).

Issues with existing methods include (a) ability to work on datasets with multiple cell types, (b) the ability to scale to tens of thousands of cells or more, and (c) the ability to work on less information rich datasets such as those generated by droplet-based or in situ scRNA-Seq methods. Oscope requires data on many genes due to its use of pair-wise correlations, and therefore does not work on less information rich platforms (e.g 10x Chromium or Drop-Seq). peco works better on less sparse, and information-rich data (e.g. Fluidigm C1), but even on data from this platform, it is outperformed by tricycle. reCAT is critically dependent on the extent to which a principal component analysis of the cell cycle genes reflect cell cycle and only infers a cell ordering; it is not straightforward to interpret the re-CAT ordering, especially across datasets. Revelio is primarily a visualization tool, which appears to fail on datasets where substantial variation is driven by processes other than the cell cycle. Of the discrete predictors, Seurat agrees well with tricycle (and is very scalable) but is limited by only predicting a 3 stage cell cycle representation (G1/S/G2M). Cyclone appears to do poorly in labelling cells in S phase and only predicts 3 stages. The (modified) Schwabe predictor assigns 5 stages, but has many missing labels and mis-assigns cells from G0/G1 to other stages.

Additionally, we benchmarked the computational speed and performance of tricycle against other cell cycle estimation algorithms. We briefly compared the running time of several methods using subsets of the mRetina dataset (Supplementary Figure S22). To compute continuous estimates using tricycle takes a mean of about 0.58, 0.86 and 1.48 seconds when the number of cells is 5000, 10000, and 50000 respectively. In contrast, to compute finite discrete stages Seurat takes a mean of about 1.10, 1.22 and 4.95 seconds for a three stage estimation and cyclone takes a mean of about 7.96, 11.50 and 50.66 minutes for a three stage estimation, when the number of cells is 5000, 10000, and 50000 respectively. Other methods (peco, Oscope, reCAT) are not capable of processing large (10k-100k+) datasets. All of the comparisons were run on Apple Mac mini (2018) with 3.2 GHz 6-Core Intel Core i7 CPU, 64GB RAM, and operating system macOS 11.2. Thus, tricycle is able to scale with the increasing size of datasets.

Application of tricycle to a single-cell RNAseq atlas

To demonstrate the scalability and generalizability of tricycle we applied it to a recent dataset of ≈ 4 million cells from the developing human (Cao et al., 2020). The data were generated using combinatorial indexing (sci-RNA-seq3) and are relatively lightly sequenced with a median of 429 – 892 total UMIs for 4 single-cell profiled tissues and 354 – 795 for 11 single-nuclei profiled tissues (Supplementary Figure S23). Using tricycle, we are able to rapidly and robustly annotate cell cycle position for each of the cells/nuclei in this atlas (Figure 6a, Supplementary Figure S24). Within a global UMAP embedding, tricycle annotations enable immediate visual identification of proliferating and/or progenitor cell populations for most cell types and tissues. The rapid annotation of cell cycle position on this reference dataset further allowed us to examine the relative differences in the proportion of cells actively proliferating across different tissues and cell types in the developing human. To quantify this, we discretized all cells along θ into two bins corresponding to actively proliferating (0.25π < θ < 1.5π; S/G2/M) or non-proliferating (G1/G0). We next ranked each tissue by the relative proportion of actively proliferating cells to identify the tissues and cell types with the highest proliferative index (Figure 6b). To examine cell-type specific differences in proliferation potential, we computed the cell cycle embedding as well as the proliferative index for the 9 most abundant cell types within each tissue (Supplementary Figures S25 and S26).

Figure 6. Application of tricycle on a human fetal tissue atlas.

Data is from from Cao et al. (2020). (a) UMAP embedding of human fetal tissue atlas data colored by cell cycle position θ estimated using mNeurosphere reference. (b) The percentage of actively proliferating cells in human fetal tissue atlas. Tissues are ordered decreasingly with the percentage. Tissue and cell type annotations are available in Supplementary Figure S24

Tissue-level proliferation indexes identified thymus, cerebrum, and adrenal gland as having the highest overall proportions of dividing cells across the sampled fetal timepoints. Within the thymus, thymocytes represent both the most abundant cell type and the most ‘prolific’ cell types as a function of the proporation of mitotic cells. Thymocytes exhibit a circular embedding in UMAP space that effectively recapitulates the estimated cell cycle position predictions from tricycle (Supplemental Figure S26k). Within this circular embedding, there is a gap of cells with cell cycle position estimates at π, consistent with dropout of cells and lower information content in M-phase. Comparison of tricycle cell cycle annotations to modified Schwabe cell cycle phase calls in this embedding suggests that tricycle more accurately estimates cell cycle position even on cell types with a mean total UMI of 354 (Supplementary Figure S27).

Within tissues, lymphoid cells are often the cell type with the highest proliferation index (Supplementary Figures S26, S25); often with a greater number of actively proliferating cells than not. Within the fetal liver and spleen – both sites of early embryonic erythropoiesis during human development (Cumano and Godin, 2007) – erythrob-lasts represent the cell type with the highest fraction of proliferating cells. Across developmental time, most tissues maintain relatively monotonic proliferation indices, however several (liver, placenta, intestine) exhibit dynamic changes across the sampled timepoints. This application illustrates the utility of tricycle to atlas-level data.

Stability of the cell cycle position assignments

To test the robustness of tricycle we performed in-silico experiments to determine the stability of cell cycle position assignments. We evaluated three different types of stability wrt. (a) missing genes, (b) sequencing depth, and (c) data preprocessing.

When projecting new data into the cell cycle reference embedding, it is common that the feature mapping between the two data sets contains only a subset of the 500 genes used in the embedding. The number of genes available for the feature mapping has an impact on the shape of the resulting embedding; the mNeurosphere and mHippNPC datasets have almost the same shape when restricted to a set of common genes (Supplementary Figure S28). To establish the stability of tricycle, we randomly removed genes from the neurosphere dataset and computed tricycle cell cycle positions; we used the neurosphere dataset as a positive control to ensure all genes are present. We used the circular correlation coefficient to assess the similarity between the tricycle cell cycle position for the full dataset vs. the dataset with randomly pruned genes (Supplementary Figures S29, S30). This reveals excellent stability (circular ρ > 0.8) using as little as 100 genes.

To examine the impact of sequencing depth, we downsampled the mHippNPC dataset (Supplementary Figures S31, S32), and used the circular correlation coefficient to quantify to similarity to the cell cycle position inferred using the full sample. Originally, the median of library sizes (total UMIs) is 10,000 for mHippNPC data. Downsampling to 20% of the original depth(approximate median of library sizes 2,000) kept circular ρ > 0.8. This is congruent with the observed robustness of the method to the varying sequencing depth of the various datasets examined above.

Next, we examined the stability of tricycle wrt. the choice of reference embedding. Above, we show a cell cycle space estimated separately for the mNeurosphere and the mHippNPC datasets (Figure 2). We observe that the inferred expression dynamics are more alike in the two datasets if we project the mHippNPC into the mNeuro-sphere embedding compare to using its own embedding. To quantify this, we pick key cell cycle genes (previously examined in Supplementary Figure S6) and compare the location of peak expression in the mNeurosphere dataset compare to the mHippNPC dataset with cell cycle position estimated using these two approaches (Supplementary Figure S33). For the vast majority of genes, the highest expression appear at a closer position when we estimate cell cycle position by projecting the mHippNPC dataset into the mNeurosphere embedding.

To examine the impact of preprocessing data prior to projection, we compared cell cycle position inferred using data processed with and without Seurat. Note that when we estimate the cell cycle space, we use Seurat to align the different biological samples. But this is not done when we project new data using the pre-learned reference. We observe negligible differences, whether or not Seurat is used (Supplementary Figure S34).

These results demonstrate the high sensitivity of tricycle to accurately estimate the cell cycle position across a high dynamic range of both number of detectable genes within the feature map as well as depth of the information content in the target cells.

DISCUSSION

Here, we have demonstrated the ability of tricycle to accurately call cell cycle position in 26 datasets across species, cell types, and assay technologies.

Tricycle achieves its universality by leveraging key features of the biology of the cell cycle, the mathematical properties of principal component analysis on periodic functions, and ubiquitous applicability of transfer learning to enable rapid and efficient use across a diverse collection of datasets. Our embedding – shaped by the fact that the first dimension stratifies G2/M from G1/G0 – ensures that we can easily interpret cell cycle position between datasets, overcoming one challenge of cell cycle inference. The stage specific periodicity of cell cycle markers, tied to their biological function, implies that the cell cycle space becomes two dimensional. Our definition of cell cycle position as the polar angle of a cell embedded in the reference cell cycle space, serves as a form of internal normalization and helps with the generalizability of tricycle across datasets. Despite this, it is still remarkable that we can project new data – without data integration or batch effect removal – and still get a useful and accurate embedding of the data into the cell cycle space with minimal computational effort. Because the projection operator is a single low-dimensional linear operation, tricycle has excellent scalability and can easily be applied to atlas-scale datasets. Thus, tricycle is a powerful tool for quickly and accurately inferring cell cycle position for single-cell RNA-seq data.

The cell cycle is a major source of transcriptional variation in many biological systems. In particular, highly studied systems such as developmental and disease processes rely on proper regulation of the cell cycle. In many single-cell experiments however, cell cycle is often considered a confounding factor and as such, methods exist to remove this effect from the data prior to analysis. We caution against removing cell cycle progression blindly as it can be intimately intertwined with other sources of variation of interest. Taking the mPancreas data as an example, there is a clear relationship between the number of cycling cells and differentiation as the multi-potent ductal cells advance to be terminally differentiated alpha and beta cells. If correction for cell cycle progression is warranted, our analysis of the mPancreas data suggests that the common approach of regressing out principal components of cell cycle genes may remove biological variation of interest.

The success of tricycle’s application using a single arbitrary cell cycle embedding raises interesting questions about the robustness and universality of the biological process itself. Here, we use a fixed reference embedding to represent cell cycle, defined using the mouse cortical neurosphere dataset. This raises the question: is there a single best embedding? One approach would be to decrease the size of the gene list used to construct the embedding. In support of this, Hsiao et al. (2020) reports that as little as 6 genes yield good performance. We find a small set (though larger than 6) of genes with high weights (Figure 2), but that making the list too small results in inferior performance.

Another approach would be to optimize the embedding to be as circular as possible. However, despite different shapes, embeddings based on the cortical neurosphere and the primary hippocampal NPC datasets result in similar cell cycle position estimates. Both results argue that the robustness of the method is derived from the structure created by the relationship of the genes to each other rather than the behavior of any individual marker gene. Thus, so long as the structure of the embedding is driven by the cell cycle, the specific source of the reference embedding is irrelevant. Here, we use 500 genes as well as a single, clean, dataset to define the cell cycle embedding, and we show that this achieves excellent generalization performance without any optimization. While our use of a single, fixed, reference embedding is a clear advantage to users, our package contains functions to define and use a custom reference embedding.

METHODS

Using principal component analysis to recover time ordering

We will consider the following statistical model. The mean expression of each gene is modelled as here A_g is a gene-specific amplitude and d_g is a mean-specific displacements (location of the peak). In this formulation, the mean function has a single peak and is periodic. We have G genes and each gene has its own (but not necessarily unique) (A_g, d_g).

Basic trigonometry yields the identity which we can write as using the orthonormal functions

Our derivation is based on Ramsay and Silverman (2005) section 8.4. This section shows that the variance-covariance operator is given by where the inner matrix (which turns out to determine the principal components) is a 2 × 2 matrix equal to

The principal component analysis is given by the Eigen-functions and −values of the variance-covariance operator. Such an Eigen-function and −value pair ξ, ρ takes the form for a vector b which satisfies ie. b, ρ are Eigen-vectors and −values for the G⁻¹C^t C matrix. Specifically, if q₁, q₂, λ₁, λ₂ are two such Eigen-vectors- and −values then the two first principal components are given by

Simulations

For Figure 1 we performed the following simulation. 50 realization of a cosine function with a location of 0.2 and an amplitude of 0.5 as well as 50 realizations of a cosine function with a location of 1.2 and an amplitude of 1. Each function was evaluated on an equidistant grid of 1000 points and independent Gaussian noise with a standard deviation of 0.2 was added. The depictions in Figure 1a,b were each one of the realizations of the two different cosine functions.

For Supplementary Figures S1, S2 and S3 we simulated data using the negative binomial distribution, inspired by the setup in Splatter (Zappia et al., 2017). In addition to a gene-specific amplitude (A_g) and location of the peak (L_g), we also consider different library size (l), which is an approximate as we still have some cell-to-cell variance. For a cell, we let , with c a constant to ensure positivity of . Then the cell mean is . The trended cell mean is simulated from a Gamma distribution as , with B the biological coefficient of variation (we fix B as 0.1 in our simulations). Thus, the counts for gene g is given as y_g Pois(λ_g). We always simulate a 100 genes times 5000 cells count matrix, with cell timepoint θ uniformed distributed between 0 and 2π. We only varies one of the L_g, A_g and l in Supplementary Figures S1, S2 and S3. Specifically, in Supplementary Figures S1, we used different number of distinct peak locations across 100 genes, and fixed the amplitudes (across 100 genes) as 3 and library size as 2000. In Supplementary Figures S2, we used different numbers of distinct amplitudes across 100 genes, and fixed the number of distinct peak locations (across 100 genes) as 100 and library size as 2000. In Supplementary Figures S3, we changed the library size l, and fixed the number of distinct peak locations (across 100 genes) as 100 and the amplitudes (across 100 genes) as 3. PCA was performed on the library size normalized and log₂ transformed matrix after we got the count matrix.

Generation of mouse primary hippocampal NPC scRNA-Seq dataset

Hippocampal neural stem/progenitor cells (NPCs) were isolated by microdissection from E17 day embryos (offspring of male Kmt2d^+/βgeo and female C57Bl/6J) and cultured on Matrigel as described in Carosso et al. (2019). We verified neuronal lineage by demonstrating Nestin, Calbindin, and Prox1 expression (not shown). Cells were maintained in an undifferentiated state with growth factor inhibition (EGF, FGF2) in Neurobasal media. In a prior publication, we have demonstrated that the Kmt2d^+/βgeo cells exhibit defects in proliferation (Carosso et al., 2019). Following isolation we collected cells from both genotypes at the undifferentiated state (day 0) and then after growth factor removal on days 4, 7, 10 and 14, capturing cells that were ever more differentiated. sc-RNA-Seq libraries were created with a Chromium Single-Cell 3’ library & Gel Bead Kit v2 (10x Genomics) according to manufacturer protocol. Only cells from day 0 are analyzed here.

Generation of mouse E14.5 Neurosphere scRNA-Seq dataset

Cortical neurospheres were generated from the dissociated telencephalon of embryonic day 14.5 (E14.5) wild type embryos. Embryos were harvested and the dorsal telencephalon was dissected away and collected in 1X HBSS at RT temperature. The dorsal telencephalon was gently triturated using p1000 pipette tips and the resultant cell suspension was spun at 500G for 5min and the media was aspirated off. The cell pellets were resuspended in complete neurosphere media 7ml (CNM) and plated in ultra-low adherence T25 flasks. CNM is made from combining 480ml DMEM-F12 with glutamine, 1.45g of glucose, 1X N2 supplement, 1X B27 supplement without retinoic acid, 1x penicillium/streptomycin and 10ng/ml of both epidermal growth factor (EGF) and basic fibroblast growth factor (bFGF). The cell pellets were cultured for 3-5 days, or until spheroids have formed. The neuro-spheres were then collected and spun at 100G for 5min and the supernatant was removed. Neuro-spheres were resuspeneded in 5ml TrypLE and incubated for a maximum of 5min at 37° C with gentle trituration every 1.5min with a p1000 until the neurospheres are mostly a single-cell suspension. The cells were spun down at 500G for 5min and the supernatant was removed. The cells were resuspended in 15ml of CNM and gently passed through a 40uM filter to remove large cell clumps. The resultant cell suspension was then plated in T75 flasks for another 2-5 days or until spheres begun to have dark centers. This process was repeated two more times before cells were collected for 10X Genomics single-cell library prep. Before single-cell library preparation, the neurospheres were dissociated as described above and passed through a 40uM filter to ensure a single-cell suspension. Approx. 7000 cells were selected from each sample for input to the scRNA-Seq library prep. sc-RNA-Seq libraries were created using the Chromium Single-Cell 3’ library & Gel Bead Kit v2 (10x Genomics) according to manufacturer protocol.

Reference genome and mapping index building

For mouse, GRCm38 reference genome fasta file and primary gene annotation GTF file (v25) were downloaded from GENCODE (https://www.gencodegenes.org). Similarly, GRCh38 reference genome fasta file and primary gene annotation GFT file(v35) were downloaded for human. We built a reference index for use by alevin as described by Soneson (2020) using R package eisaR(v1.2.0), which we use to quantify both spliced and unspliced counts of annotated genes.

scRNA-Seq preprocessing

Mouse Neurosphere (mNeurosphere) dataset

fastqs files were used to quantify both spliced and unspliced counts by Alevin (Salmon v1.3.0) with default settings as described by Soneson (2020). Abundances matrices were read in by R package tximeta (v1.8.1). The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log₂(TotalUMIs), log₂(number of expressed genes), percentage of mitochondrial gene counts, or log₁₀(doublet scores)). The doublet scores were computed using doublet-Cells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log₂ transformed by function normalizeCounts in R package scuttle (v1.0.2). The biological samples were integrated together by Seurat (v3.2.2). We then run PCA on the top 2000 highly variable genes of the integrated log₂(expression) using the runPCA function with default parameters, followed by runing the runUMAP function on the resulting top 30 principal components with default parameters. Note that we did not restrict genes to cell cycle genes in this step, as we would like to see the overall variation of the data. Cell types were inferred by SingleR package v1.4.0 using built-in MouseR-NAseqData dataset as the reference.

Mouse primary hippocampal NPC (mHipp-NPC) dataset

All preprocessing are the same as for the mouse Neurosphere (mNeurosphere) dataset.

Mouse developing pancreas (mPancreas) dataset

We obtained the spliced and unspliced count matrices of the Mouse developing pancreas dataset from the python package scvelo (v0.2.1). The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and any cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log₂(TotalUMIs), log₂(number of expressed genes), percentage of mitochondrial gene counts, or log₁₀(doublet scores)). Here, the doublet scores were computed using doubletCells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log₂ transformed by function normalizeCounts in R package scuttle (v1.0.2). We run PCA on the top 500 highly variable genes using the runPCA function with default parameters, followed by running the runUMAP function on the resulting top 30 principal components. When running the UMAP, we set min dist to 0.5 instead of default value 0.01 to replicate the UMAP figure shown in Bergen et al. (2020) with other parameters default. Of note, the single-cell libraries of the data was generated using 10x Genomics’ Chromium v2 system.

Mouse Hematopoietic Stem Cell (mHSC) Dataset

We downloaded processed log₂ transform TPM matrix directly from GEO under accession number GSE59114 (Kowalczyk et al., 2015). We only used the cells from C57BL/6 strain, of which contains more cells, as the number of overlapped genes between xlsx file of C57BL/6 strain and DBA/2 strain is too small. Because the data was already processed and filtered, we did not perform any other processing. Unlike the above mentioned dataset, the SMARTer protocol was applied during library preparation.

Mouse Retina (mRetina) dataset

This dataset is available at https://github.com/gofflab/ developing_mouse_retina_scRNASeq. We removed cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log₂(TotalUMIs), log₂(number of expressed genes), percentage of mitochondrial gene counts, or log₁₀(doublet scores)). As the total UMIs depend on cell type, we filtered the cells by blocking for each cell type. The doublet scores were computed using doubletCells function in R package scran (v1.18.1). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log₂ transformed by function normalizeCounts. We used the cell type annotations as the new CellType column in the provided phenotype file. The single-cell libraries of the data was generated using 10x Genomics’ Chromium v2 system.

HeLa cell lines datasets

The spliced and unspliced count matrices of HeLa Set 1 (HeLa1) and HeLa Set 2 (HeLa2) were downloaded from GEO website with accession number GSE142277 and GSE142356. Both datasets were generated by the same lab under the same protocol, while the sequencing depth of Set 2 is only about half that of Set 1 (Schwabe et al., 2020). For each dataset, we only used the genes existing in both spliced and unspliced count matrices. The spliced counts were treated as the expression counts. We removed cells with less than 200 expressed genes, and cells flagged as outliers (deviating more than triple median absolute deviations(MAD) from the median of log₂(TotalUMIs), log₂(number of expressed genes), percentage of mitochondrial gene counts, or log₁₀(doublet scores)). All mitochondrial genes and any genes which were expressed in less than 20 cells were further excluded from all subsequent analyses. Expression abundances were then library size normalized and log₂ transformed by function normalizeCounts. The single-cell libraries of the data was generated using Drop-seq system.

Mouse embryonic stem cell (mESC) dataset

The processed count matrix was downloaded from under accession ArrayExpress website number E-MTAB-2805 (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2805/). We only retained 279 cells with log₂(counts) greater than 15. The count matrix were library size normalized across cells and log₂ transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.

Human embroyonic stem cells (hESC) dataset

The processed count matrix was downloaded from GEO under accession number GSE64016. We only retained FACS sorted cells. The count matrix were library size normalized across cells and log₂ transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.

Human U-2 OS cells (hU2OS) dataset

The TPM matrix was downloaded from GEO under accession number GSE146773. We only retained FACS sorted cells with log₂(counts) greater than the 3 times MAD range. Genes which were expressed in less than 20 cell were removed. The left TPM matrix were library size normalized across cells and log₂ transformed by function normalizeCounts. The RNA-Seq data was generated using SMART-seq2 chemistry in this dataset.

Human induced pluripotent stem cells (hiP-SCs) dataset

The processed FUCCI intensity and RNA-seq data was downloaded from https://github.com/jdblischak/fucci-seq/blob/master/data/eset-final.rds?raw=true. The preprocessing was described in Hsiao et al.(2020). The count matrix were library size normalized across cells and log₂ transformed by function normalizeCounts. The RNA-Seq data was generated using Fluidigm C1 system in this dataset.

Fetal tissue dataset

We got the loom file containing gene counts of all tissue from GEO under accession number GSE156793. We then processed and analyzed each tissue separately. For each tissue type, cells of which log₂(TotalUMIs) is lower than median – 3 × MAD, and genes expressed in less than 20 cells were excluded from further analyses. The count matrix were library size normalized across cells and log₂ transformed by function normalizeCounts. All 4 tissues profiled using single-cell and 9 tissues profiled using single-nuclei were generated on sci-RNA-seq3 system.

5 stage cell cycle assignments

The 5 stage (G1S, S, G2, G2M, and MG1) cell cycle assignments were adapted from Schwabe et al.(2020) with some modifications. Briefly, the assignments use the high expression genes list for each stage, curated by Whitfield et al. (2002). Let k represent one of the 5 stages, and represent the gene list with p_k genes. For each stage k, we could calculate the mean expression across genes in the gene list l_k for the jth cell as with as the log₂ transformed expression value of gene and cell j. Then we assess how well a gene in a gene list correlates to the me n expression level of that gene list as . For each stage, the gene list is pruned to genes with . (For the fetal tissues dataset, we used since the extremely shallowly sequenced data shows less co-expression patterns and the threshold 0.2 could leave us with no genes.) We label this pruned new gene list as with q_k the number of genes. The stage assignment score for cell j and stage k is given as

The 5-by-n matrix A, of which the number of columns equals to the number of cells, follows z-score transformations w.r.t. first rows and then columns, resulting the 5-by-n matrix . For each cell, we compute the preliminary stage assignment as .

As in the Schwabe et al. (2020), we also apply two filtering steps. The first filtering, which is exactly the same described by the original paper. We require , the stage with the second largest assignment score to be the neighboring stage to s_j. This requirement corresponds to that the 5 stages are continuously cyclic processes.

As for the second filtering step, t he original method discards all cells with the second largest assignment score . We found the threshold of 0.75 to some extent not applicable, as in some datasets it leads to losing 90% of cells. Therefore, we use a more adaptive threshold by requiring .

If the cell passes two filtering steps, it will be assigned to a stage s_j. Otherwise, it would be assigned as NA w.r.t. 5 stages of the cell cycle. To mitigate the batch effect on the 5 stage assignments, the assigning procedures are done for each sample/batch separately within each dataset, as recommended in Revelio package (Schwabe et al., 2020).

PCA of GO cell cycle genes

For each dataset, we subsetted the preprocessed log₂ transformed expression matrix to genes in the GO term cell cycle (GO:0007049). If there are clear batches defined in the dataset, such as sample or batch, we use Seurat3 to remove batch effect. In the case of using Seurat3, we used a library size normalized count matrix as input instead of log₂ transformed values. The integration anchors were searched in the space of top 30 PCs. The output integrated matrix is a log₂ transformed matrix of top 500 most variable genes. We then performed principal component analysis on the gene-wise mean centered expression matrix. In the case of no batch exiting, we also restricting to top 500 variable genes among GO cell cycle genes.

Projection of new data to cell cycle embedding and calculation of cell cycle position θ

The projection using pre-learned weights matrix during PCA of GO cell cycle genes is straight forward, given by where R represents the o-by-2 reference matrix (o 500), contains the weights of top 2 PCs learned from PCA of GO cell cycle genes; is a o-by-n matrix, subsetted from E (the log₂ transformed expression matrix) with genes in the weights matrix and row-means centered. The resulting n-by-2 P is the cell cycle embedding projected by the reference. The calculation of the cell cycle position θ is given by where P_i is the ith column of matrix P. When mapping the genes between weights matrix and the data that we want to project, the Ensemble ID is given higher priority than the gene symbol for mouse. For across species projection, we only con sider the homologous genes of the same gene symbols.

Periodic loess

As θ is a circular variable bound between 0 to 2π, fitting a traditional loess model y ~ θ, with y as any response variable, such as the gene expression of gene, or log₂(TotalUMIs), has problems around the boundaries 0 and 2π. Hence, we concatenate triple y and triple θ with one period shift to form [y, y, y] and [θ 2π, θ, θ + 2π], on which the loess line is fitted. We then only use the fitted value when θ is between 0 and 2π for visualization purpose.

The calculation of the coefficient of determination R² of fitted loess model is given by

Here and . Note that instead of using all three copies of data points, we restrict the calculation of SS_res and SS_total on the original data points (the middle copy). The residuals are not the same for the three copies, especially at the beginning and end of [−2π, 2π].

The circular correlation coefficient ρ

We use the circular correlation coefficient ρ defined by Jammalamadaka and Sarma, 1988 to evaluate concordance between two polar vectors θ₁ and θ₂.

It is defined as follows μ₁ and μ₂ represent the mean of θ₁ and θ₂ respectively, and are estimated by maximum likelihood estimation under von Mises distribution assumption.

Running other methods

For other cell cycle inference methods, we use all default parameters and its built-in reference (if needed) in the following packages: cyclone in scran (v1.18.5), CellCycleScoring in Seurat (v4.0.0.9015), Revelio (v0.1.0), peco (v1.1.21), and reCAT (v1.1.0).

Silhouette index on angular separation distance of tricycle cell cycle position θ

For cyclone and Seurat, we could use Silhouette index to describe consistency between discretized cell cycle stage and tricycle cell cycle position θ. We use angular separation distance metric to quantify the distance between cell i and cell j as

For a cell . The mean distance between cell i and all other cells assigned to the same stage with the cardinality of . Specially, a(i) = 0 if . The mean distance from cell i to all cells assigned to other stage k′ such that k′ ≠ k⁽ⁱ⁾ ∧ k′ ∈ {G1, S, G2M} is

The Silhouette index for cell i is given as

For any cell i, the Silhouette index s(i) is bound between −1 to 1 (−1 ≤ s(i) ≤ 1). An s(i) close to −1 means the cell is consistently assigned to its neighbors w.r.t. its cell cycle position θ_i. An s(i) close to 1 means the cell is closer to the other stage. An s(i) equals to 0 means the cells is on the border of two stages. The mean Silhouette index on all cells measures how tight the stage assignments are. In this context, this value must be interpreted carefully as it is different from traditional clustering which might puts hard boundaries and gaps between clusters. As the cell cycle process is continuous in nature, there must be cells assigned on the boundaries and ambiguous to either stage, and no gap should appear between stages. Thus the mean Silhouette index greater than 0 might be appropriate to conclude the agreement between tricycle cell cycle position θ and discretized cell cycle stages.

Data availability

Data reported in this publication is being submitted to NCBI GEO. Get in touch if you want it sooner!

Software availability

The tricycle method is implemented in the R package tricycle containing the mNeurosphere reference, which is available on https://github.com/hansenlab/tricycle. This package is being submitted to Bioconductor.

Funding

This project has been made possible in part by grant number CZF2019-002443 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award R01GM121459. This work was additionally supported by awards from the National Science Foundation (IOS-1665692), the National Institute of Aging (R01AG066768), and the Maryland Stem Cell Research Foundation (2016-MSCRFI-2805). GSO is supported by postdoctoral fellowship awards from the Kavli Neurodiscovery Institute, the Johns Hopkins Provost Award Program, and the BRAIN Initiative in partnership with the National Institute of Neurological Disorders (K99NS122085).

Disclaimer

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or National Science Foundation.

Conflict of Interest

None declared.

Supplementary Materials

Supplementary Methods.
Supplementary Figures S1–S34.
Supplementary Table S1.

SUPPLEMENTARY METHODS

Comparison to existing cell cycle tools

Oscope

Oscope poses significant challenges when run on shallow data (10X, sci-RNA-seq3, or DropSeq), since the method requires quantification of a high number of genes in every cell. For this reason, we do not evaluate Oscope.

peco

Peco supplies 2 models: one trained on 101 genes and one trained on 5 genes. We used the 101 gene model to be robust to some genes not being measurable in all datasets. We applied peco to all dataset described in Supplementary Table S1, except mRetina and human fetal tissues. For human fetal tissues, we only use a subset of random 2000 cells selected from human fetal intestine data (termed “hfIntestineSub”).

We assess the expression dynamics of 4 genes highlighted in Hsiao et al. (2020): CDK1, TOP2A, UBE2C and H4C3 (Supplementary Figure S15); not all datasets have these genes measured in which case they are absent from the figure. To systematically compare tricycle and peco we use the R² associated with two different cell cycle positions. This is a comparison between R² for the same data, but using the same periodic loess approach with two different position variables. For these genes, across all dataset, tricycle cell cycle position has a higher R² than peco cell cycle position (Supplementary Figure S15). Generally, information-rich Fluidigm C1 data does better with peco compared to information-poor 10X, Drop-Seq.

Revelio

Revelio is designed to search for an ellipsoid pattern amongst (rotated) principal components, by finding the directions having strongest association to 5 discrete cell cycle stages. The output of Revelio is therefore supposed to be an ellipsoid. Revelio by itself does not quantify cell cycle position, although it seems natural to do so by the angle. When we use Revelio, we do indeed observe an ellipsoid in 4 datasets (Supplementary Figure S16a, b, f, g, i and j), but it clearly fails in 3 datasets: mPancreas dataset, mRetina dataset, and mHSC dataset (Supplementary Figure S16c, d, and e). These 3 datasets all have substantial variation which is not associated with cell cycle, such as cell types and differentiation, which we believe explains the non-ellipsoidal embedding. For example, in the mPancreas data some of the differentiation effect is perfectly confounded with cell cycle as the terminally differentiated cells stop cycling. It is not clear that simply rotating the principal components will help us find a better cell cycle exclusive dimension. Additionally, Revelio removes any cell which does not have a prediction using the Schwabe stage predictor; in the mRetina dataset only 30k out of more than 90k cells are retained.

reCAT

reCAT starts with a principal component analysis of the cell cycle genes, and infers an ordering by solving a traveling salesman problem on this representation. This produces an ordering, but this ordering is hard to interpret because it is not directly linked to cell cycle stage. To address this, the authors provide two different stage predictors. Because the method requires the solution of a traveling salesman problem, it scales poorly. Due to these issues, we only ran reCAT on data with less than 5000 cells. The orderings inferred by reCAT are largely consistent with our cell cycle position θ using mNeurosphere reference for all dataset except the most shallow sequenced hfIntestineSub data (Supplementary Figure S17 last sub-panel in each panel). And the expression dynamics of Top2A on the time series also confirms the appropriate ordering of cells (Supplementary Figure S17 the third sub-panel in each panel). However, the two stage predictors given by reCAT yield different predictions on stages. For example, for the mPancreas dataset (Supplementary Figure S17a), the majority of cells are at S stage based on Bayes scores but are at G1 stage based on mean scores. Note that the reCAT function requires the user to feed an approximate cutoff position to assign a cell cycle stage based on Bayes scores. However, in all the datasets, we are unable to assign cutoff position to let each stage have its own highest scores interval. Without a useful stage assignment, the ability to make use of the cell orders is substantially restricted as the percentage of each stage is different across dataset.

Cyclone

We observe general agreement between the 3 stage predictions of cyclone and tricycle cell cycle position, as the cyclone stages cluster together (Supplementary Figure S18). We note that cyclone assigns very few cells to the S stage. We believe this is caused by the assignment strategy (cells are assigned to S stage if both G1 and G2M scores are below 0.5). To expand on this comparison, we computed silhouette index with a distance defined by the tricycle cell cycle position (Methods). For cyclone, the under-representation of S stage drags down the silhouette index for both G1 and S stages, as cells at S stages are usually mixed with G1 cells, making the mean distance to all cells at G1 stage and to all cells at S stage not that differentiable. We note that cyclone works best on the last two FACS dataset, with one of them (mESC) is the training dataset for cyclone gene list.

Seurat

We observe good agreement between the 3 stage predictions of Seurat and tricycle cell cycle position, better than cyclone (Supplementary Figure S19). Compared to cyclone, we have a much higher silhouette index for Seurat; the highest observed mean is 0.74 for the mHSC dataset, which confirms the highly visual agreement between Seurat assignments and tricycle. The main disadvantage of Seurat is the inherent limitation of a 3 stage prediction.

(modified) Schwabe

The (modified) Schwabe method assigns cells to 5 different stages. Because of the higher resolution, it is the main predictor we use in our work. By default, the Schwabe method as reported in Schwabe et al. (2020) produces a substantial amount of missing labels, and we have therefore modified the method to address this (Methods); we call this the modified Schwabe predictor.

Broadly, the (modified) Schwabe predictor agrees with tricycle, with one specific type of disagreement. These inconsistencies are examined in Supplementary Figure S20. Some cells with a tricycle cell cycle position of 0/2π (G0/G1) are assigned to other stages by modified Schwabe (Supplementary Figure S20 second sub-panel of each row). It is well appreciated that there are many more genes specifically expressed at S, G2 or M stage as compared to G0/G1 stage (Dolatabadi et al., 2017). For each dataset, we plot out the percentage of non-expressed genes over all projection genes in the first sub-panels, which show that the dynamics of percentages are captured by cell cycle position θ using mNeurosphere reference. We plot the percentage of non-expressed genes conditioned on stage and whether tricycle cell cycle position is around 0/2π (Supplementary Figure S20 third sub-panel of each row), which confirm that for each stage there exist two distinct groups. This is reinforced by the different expression patterns of Top2a and Smc4 between flagged cells and non-flagged cells in the last two sub-panels. Thus, we conclude the cells around 0/2π are likely to be wrongly assigned to other stages, probably due to low information content.

To assess whether these inconsistencies are caused by our modification of Schwabe, we repeat the comparison using the original Schwabe assignments and arrive at the same conclusion (Supplementary Figure S21). This assessment highlights the large number of missing labels from the original Schwabe predictor, for example only 30k out of 90k cells in the mRetina dataset are labelled.

SUPPLEMENTARY FIGURES

Supplementary Figure S1. Simulations using negative binomial distribution with different number of distinct peak locations.

We used different number of distinct peak locations across 100 genes, and fixed the amplitudes (across 100 genes) as 3 and library size as 2000. The number of distinct peak locations across 100 genes is (a) 2, (b) 3, (c) 5, (d) 10, and (e) 50. As long as we have more than 2 distinct peak locations, we get an ellipsoid.

Supplementary Figure S2. Simulations using negative binomial distribution with different number of distinct amplitudes.

We used different numbers of distinct amplitudes across 100 genes, and fixed the number of distinct peak locations (across 100 genes) as 100 and library size as 2000. The number of distinct amplitudes across 100 genes is (a) 1, (b) 2, (c) 3, (d) 5, (e) 10, and (f) 50. No matter what the number of distinct amplitude(s) is, we always get an ellipsoid.

Supplementary Figure S3. Simulations using negative binomial distribution with different library size.

We changed the library size l, and fixed the number of distinct peak locations (across 100 genes) as 100 and the amplitudes (across 100 genes) as 3. The library size is (a) 2000, (b) 500, (c) 100, and (d) 10. The range of x-axis and y-axis of the first two sub-panels are fixed across (a)-(d). With library size decreasing, the ellipsoid shrinks to the (0, 0). However, the orders of cell can still be recovered.

Supplementary Figure S4. UMAPs of the mouse cortical Neurosphere dataset.

Scatter plots show the UMAPs of Seurat3 mergerd Neurosphere data colored by (a) sample, (b) cell type inferred by SingleR, (c) log₂(TotalUMIs), (d) inferred cell cycle stage by cyclone, (e) inferred cell cycle stage by Seurat, (f) inferred cell cycle stage by the modified Schwabe method Schwabe et al. (2020) (See Methods). The UMAP coordinates were computed using the PCA on top 2000 highly variable genes after integreation by Seurat3.

Supplementary Figure S5. Weights of PCA on GO cell cycle genes.

(a) The weights of top 2 PCs learned from doing PCA on GO cell cycle genes of cortical Neurosphere data. (b) The weights of top 2 PCs learned from doing PCA on GO cell cycle genes of mouse primary hippocampal NPC data. (c) A comparison of the weights on principal component 1 between the cortical neurosphere and hippocampal progenitor datasets. (d) As (c), but for PC2. Genes with high weights (|score| > 0.1 for either vector) are highlighted in red. PCC: Pearson’s Correlation Coefficient.

Supplementary Figure S6. Expression dynamics of top ranked genes.

Similar to Figure 2d and e, but now showing all overlapped projection genes with absolute weights greater than 0.1 in either PC1 or PC2 of either dataset. Yellow points are cells of mHippNPC data, while blue points are cells of mNeurosphere data. Two loess lines were fitted for two dataset respectively. There is high agreement of the dynamics between datasets.

Supplementary Figure S7. Characteristics of expression patterns of the mNeurosphere reference.

(a) Heatmap shows the z-scores of 500 projections genes in the mNeurosphere data. Each row represents a gene and each column represents a cell, ordered by the cell cycle position θ from PCA. We also annotate the position of half π as the cells are not uniformed distributed along 0 to 2π. (b) The fitted loess line of z-scores over cell cycle position θ for all 500 projection genes. (c-e) The three different clusters in (b). (c) The cluster of genes with highest z-scores less than 0.5. (d) The cluster of genes with highest z-scores greater than 0.5 and peak position before π This cluster corresponds to high expression genes at G1/S stage. (e) The cluster of genes with highest z-scores greater than 0.5 and peak position after π. This cluster corresponds to high expression genes at G2/M stage.

Supplementary Figure S8. PCA and projections of the mouse developing pancreas data.

(a-c) The top 2 PCs of GO cell cycle genes of the the three most multipotent cell types in the mouse developing pancreas data. PCA was performed independently for each cell type. (d) Projection of allNgn3LEP cells of mPancreas data using the learned top 2 PCs weights on GO cell cycle genes of Ductal cells. (e) Projection of allNgn3HEP cells of mPancreas data using the learned top 2 PCs weights on GO cell cycle genes of Ductal cells.

Supplementary Figure S9. A pre-learned rotation matrix learned from proliferating cortical neurospheres enables cell cycle position estimation in other proliferating datasets.

This figure include two other datasets in addition to the four datasets in Figure 4. (a) Different datasets (mouse hematopoietic stem cell and Hela set 1) projected into the cell cycle embedding defined by the cortical neurosphere dataset. Cell cycle position θ is estimated using polar angle. (b) Inferred expression dynamics of Top2A(or TOP2A for human), with a periodic loess line (Methods). (c) UMAP colored by cell cycle position using a circular color scale.

Supplementary Figure S10. The dynamics of Smc4 expression over cell cycle position θ.

Inferred expression dynamics of Smc4(or SMC4 for human) over cell cycle position inferred using cortical neurospheres reference, with a periodic loess line (Methods) for (a) hippocampal NPCs, (b) mouse pancreas, (c) mouse retina, (d) Hela set 2, (e) mouse hematopoietic stem cell, and (f) Hela set 1 data. These data are the same data used in Figure 4 and Supplementary Figure S9.

Supplementary Figure S11. The top 2 PCs of GO cell cycle genes.

The figure consists top 2 PCs of PCA performed on GO cell cycle genes of each dataset. They serve as companion figures to Figure 4 and Supplementary Figure S9. Note that the cell cycle progression is hidden by direct PCA on datasets with higher heterogeneity, such as mPancreas and mRetina dataset, while cell cycle progression is visible in other datasets.

Supplementary Figure S12. Expression dynamics of GMNN and CDT1 on FUCCI pseudotime and tricycle position of hU2OS data.

(a) The gene expression of GMNN in hU2OS dataset is stable over the FUCCI pseudotime. (b) The gene expression of CDT1 in hU2OS dataset is stable over the FUCCI pseudotime. (c-d) Similar to (a-b), but we use the cell cycle position θ using mNeurosphere reference as the predictor.

Supplementary Figure S13. Expression dynamics of selected cell cycle genes of hiPSCs dataset.

Similar to Figure 5e,f, but now we show more cell cycle related genes. In each panel, the left sub-panel shows the expression of the gene over tricycle cell cycle position θ using mNeurosphere reference, and the right sub-panel over the FUCCI pseudotime inferred by Hsiao et al. (2020). Cells are colored by 5 stage cell cycle representation, inferred using the modified Schwabe method Schwabe et al. (2020). Periodic loess lines and R² are added for each sub-panel (Methods).

Supplementary Figure S14. Evaluation of tricycle on FACS datasets

(a-b) Data from Buettner et al. (2015). (a) The data is projected to the cell cycle embedding defined by the cortical neurosphere dataset. Cells are colored by FACS labels. (b) Expression dynamics of Top2A with a periodic loess line using tricycle cell cycle position estimated by projection in (a). (c-d) Similar to (a,b), but for data from Leng et al. (2015).

Supplementary Figure S15. Expression dynamics of cell cycle genes on peco cell cycle position.

We run peco on all dataset described in Supplementary Table S1, except mRetina and human fetal tissues. mRetina data has too many cells, and for human fetal tissues, we only use a subset of random 2000 cells from intestine data (hfIntestineSub). For each data, the expression dynamics of Cdk1, Top2A, Ube2C and H4c3, as long as they exist in the target dataset, over peco inferred θ are plotted out. Note that in this figure, the y-axis represents the peco normalized expression values, as peco has its own normalization requirement. We annotate the each panel with R² of loess line calculated on peco inferred θ and R² of loess line on tricycle inferred θ using mNeurosphere reference (although we have not plotted out the expression dynamics over tricycle inferred θ). Across all datasets and genes, the tricycle inferred θs have greater R² to peco θ, and are highlighted as red.

Supplementary Figure S16. Cell cycle embeddings by Revelio.

The cell cycle embedding produced by Revelio for each data. Cells are colored by 5 stage cell cycle representation, inferred using the original Schwabe method Schwabe et al. (2020) as implemented in the Revelio package. Note that all cells without a valid stage assignment (assigned to “NA”) are removed by the functions in Revelio package.

Supplementary Figure S17. Cell cycle stage and order estimations by reCAT.

Panels show the cell cycle stage scores and cell orders estimated by reCAT for (a) mPancreas, (b) mHSC, (c) HeLa set 1, (d) HeLa set 2, (e) hfIntestineSub, (f) hU2OS, (g) hiPSCs, (h) mESC, and (i) hESC data. For each data, the first sub-panel shows the Bayes scores for G1, S, and G2/M stage over the estimated cell orders(time series t). For each cell, there will be three scores (data points) colored by stage. The second sub-panel shows the mean scores for G1, G1/S, S, G2, G2/M, and M stage over the estimated cell orders. For each cell, there will be six scores (data points) colored by stage. The third sub-panel is the expression dynamic of Top2A(or TOP2A for human) over reCAT estimated cell orders. Each data point is a cell, colored by 5 stage cell cycle representation, inferred using the modified Schwabe method Schwabe et al. (2020) (except the last two FACS datastes, for which we color the cells by FACS stage.). Note that although reCAT package provide function to assign cell cycle stage, it requires manual input cutoff for Bayes scores. It is unrealistic for us to pick some appropriate cutoffs for most of the datasets presented here. For example, for mPancreas data in (a), we cannot decide which region has the consistent G1 scores. The last sub-panels compares tricycle cell cycle position using mNeurosphere reference and reCAT cell orders.

Supplementary Figure S18. Comparison between cyclone assigned stages and tricycle cell cycle position using mNeurosphere reference.

Each panel describe one data, specifically for (a) mNeurosphere, (c) mHippNPC, (c) mPancreas, (d) mRetina, (e) mHSC, (f) HeLa set 1, (g) HeLa set 2, (h) hfIntestineSub, (i) hU2OS, (j) hiPSCs, (k) mESC, (l) hESC data. For each data, the first sub-panel shows the cell cycle embedding projection by mNeurosphere reference, and each point is a cell, colored by cyclone inferred cell cycle stage. The second sub-panel shows silhouette index computed using angular separation distance of tricycle cell cycle position θ estimated using mNeurosphere reference (Methods), stratified by cyclone inferred cell cycle stage. The mean silhouette index across all cells is given in the title. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 interquartile range (IQR) from these percentiles. For mRetina data, the pairwise distance matrix is too big to substantiate, so we could not compute silhouette index. The third sub-panel shows the marginal density of Top2A(or TOP2A for human) expression conditioned on cyclone cell cycle stage.

Supplementary Figure S19. Comparison between Seurat assigned stages and tricycle cell cycle position using mNeurosphere reference.

Each panel describe one data, specifically for (a) mNeurosphere, (c) mHippNPC, (c) mPancreas, (d) mRetina, (e) mHSC, (f) HeLa set 1, (g) HeLa set 2, (h) hfIntestineSub, (i) hU2OS, (j) hiPSCs, (k) mESC, (l) hESC data. For each data, the first sub-panel shows the cell cycle embedding projection by mNeurosphere reference, and each point is a cell, colored by Seurat inferred cell cycle stage. The second sub-panel shows silhouette index computed using angular separation distance of tricycle cell cycle position θ estimated using mNeurosphere reference (Methods), stratified by Seurat inferred cell cycle stage. The mean silhouette index across all cells is given in the title. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 IQR from these percentiles. For mRetina data, the pairwise distance matrix is too big to substantiate, so we could not compute silhouette index. The third sub-panel shows the marginal density of Top2A(or TOP2A for human) expression conditioned on Seurat cell cycle stage.

Supplementary Figure S20. Comparison between modified 5 stage assignments and tricycle cell cycle position using mNeurosphere reference.

See next page for caption. Each row or panel contains analysis for a dataset, specifically (a) for mNeurosphere, (b) for mHippNPC, (c) for mPancreas, (d) for mRetina, (e) for HeLa set 1, (f) for HeLa set 2 data. For each data, the first sub-panel shows the dynamics of percentage of non-expressed genes over all overlapped genes with mNeurosphere projection matrix (number of genes with 0 expression divided by the number overlapped genes with mNeurosphere projection matrix) w.r.t. tricycle cell cycle position θ using mNeurosphere reference. Cells are colored by 5 stage assignment. The second panel shows the marginal density of tricycle cell cycle position θ conditioned on 5 stage assignments using von Mises kernel on polar coordinate system. The third sub-panel shows the percentage of non-expressed genes over all overlapped genes with mNeurosphere projection matrix conditioned on 5 stages assignment and whether cells appear in the G1/G0 cluster - θ < 0.25π or θ > 1.5π as boxplots. The forth and the last sub-panel show the expression of Top2A and Smc4 conditioned on 5 stages assignment and whether cells appear in the G1/G0 cluster. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 IQR from these percentiles.

Supplementary Figure S21. Comparison between original 5 stage assignments and tricycle cell cycle position using mNeurosphere reference.

This figure shows the exact same data and comparisons as in Supplementary Figure S20, but now we use the original Schwabe method as implemented in the Revelio package (Schwabe et al., 2020). Note that the number of cells in each dataset is decreased as any cell without a valid stage assignment (assigned to “NA”) is removed by the functions in Revelio package.

Supplementary Figure S22. Running time comparisons between cyclone, Seurat, and tricycle cell cycle inference

We record the elapsing time for each method when running them on 10 random subsets of mRetina data with 5000, 10000, and 50000 cells. For cyclone and Seurat, the time is recorded for the cell cycle stage assignment function. For tricycle, the time is recorded for cell cycle position estimation using mNeurosphere reference. Note that we add jitters to the data points to avoid excessive overlaps.

Supplementary Figure S23. TotalUMIs of human fetal atlas.

For each tissue type of the human fetal atlas data (Cao et al., 2020), we show the total UMIs of a cell. The dashed line separates 4 single-cell profiled tissues with 11 single-nuclei profiled tissues. Boxes indicate 25th and 75th percentiles. Whiskers extend to the largest values no further than 1.5 × interquartile range (IQR) from these percentiles.

Supplementary Figure S24. Human fetal tissue atlas UMAP embeddings with all tissues

Human fetal tissue atlas UMAP embeddings with all tissues, colored by (a) tissue and (b) cell type.

Supplementary Figure S25. Application of tricycle on 4 single-cell profiled human tissues

We show one tissue type in each row/panel (a) intestine, (b) kidney, (c) pancreas, and (d) stomach. For each tissue, the cell cycle embedding using mNeurosphere reference is given in the first sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell cycle position θ in the second sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell type in the third sub-panel, percentage of actively proliferating cells for each cell type in decreasing order in the forth sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by development days in the fifth sub-panel, and percentage of actively proliferating cells for each development day in the last sub-panel.

Supplementary Figure S26. Application of tricycle on 11 single-nuclei profiled human tissues

Similar to Figure S25, but for 11 tissues with single-nuclei RNA profiled. We show one tissue type in each panel (a) adrenal, (b) cerebellum, (c) cerebrum, (d) eye, (e) heart, (f) liver, (g) lung, (h) muscle, (i) placenta, (j) spleen, and (k) thymus. For each tissue, the cell cycle embedding using mNeurosphere reference is given in the first sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell cycle position θ in the second sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by cell type in the third sub-panel, percentage of actively proliferating cells for each cell type in decreasing order in the forth sub-panel, tissue-level UMAPs from Cao et al. (2020) colored by development days in the fifth sub-panel, and percentage of actively proliferating cells for each development day in the last sub-panel. with (g) lung, (h) muscle, (i) placenta, (j) spleen, and (k) thymus.

Supplementary Figure S27. Human fetal thymus UMAPs colored by cell cycle position or stage.

(a) Same as Supplementary Figure S26k second sub-panel, which shows the UMAP embeddings of human fetal thymus, colored by cell cycle position θ. (b) Same UMAP embedding as in (a), but colored by 5 stage cell cycle representation, inferred using the modified Schwabe method from Schwabe et al. (2020). (c) Same UMAP embedding as in (a), but colored by 3 stage cell cycle representation, inferred by cyclone (Scialdone et al., 2015). (d) Same UMAP embedding as in (a), but colored by 3 stage cell cycle representation, inferred by Seurat (Stuart et al., 2019).

Supplementary Figure S28. Projection using the exact same genes on two datasets.

This figure shows cell cycle embeddings for (a) mNeurosphere and (b) mHippNPC dataset using the subset mNeurosphere reference restricted to 384 genes existing in both datasets. Cells are colored by 5 stage cell cycle representation, inferred using the modified Schwabe method Schwabe et al. (2020).

Supplementary Figure S29. Examples of mNeurosphere dataset projections with randomly sub-sampled projection matrices.

Each column represents an example of a projection using the sub-sampled genes from original 500 projection genes. From left to the right, the numbers of genes are 400, 300, 200, 100, and 50. (a) The projected cell cycle embedding using the sub-sampled projection matrix. Cells are colored by 5 stage cell cycle representation, inferred using the modified Schwabe method from Schwabe et al. (2020). (b) Comparisons of cell cycle positions θ estimated using the full 500 projection matrix and using the sub-sampled projection matrix. The circular correlation ρ is given in the figure.

Supplementary Figure S30. Stability assessment with projection genes missing.

This figure shows comprehensive assessments as complement to Supplementary Figure S29. For each target number of genes retained in the mNeurosphere reference matrix, we randomly sampled different genes 30 times. For each run, the circular correlation coefficient ρ was calculated between θ from projection using the full reference matrix and θ from projection using sub-sampled reference.

Supplementary Figure S31. Examples of projections on downsampled mHippNPC dataset.

(a) The cell cycle embedding projection using the mNeurosphere reference on original mHippNPC data. (b) Each sub-panel represents the same projection as in (a), but the mHippNPC is downsampled to the 80%, 60%, 40%, 20%, and 10% of its original library size (corresponding to median of library size is given in the panel title). Note that the ranges of both x-axis and y-axis are different across sub-panels. (c) By overlaying (a) and all sub-panels of (b), it shows the shrinkage of projections with library size decreasing. (d) Comparisons of cell cycle positions θ estimated from the original mHippNPC data and from the downsampled mHippNPC data. (e) Similar to (a), but the points are colored by log₂ transformed library size. (f) Similar to (b), but the points are colored by log₂ transformed library size.

Supplementary Figure S32. Stability assessment with decreasing sequencing depths.

This figure shows comprehensive assessments as complement to Supplementary Figure S31. We repeated the downsampling processes for mHippNPC for each target downsampling percentage. For each run, the circular correlation coefficient ρ was calculated between θ estimated on the original mHippNPC data and θ estimated on the downsampled data.

Supplementary Figure S33. Comparison of positions of peak expression for θ estimated on independent PCA and projection by mNeurosphere reference.

(a) For each gene dipicted in Supplementary Figure S6, we estimate and compare when the peak expression is reached between 0 to 2π for mNeurosphere and mHippNPC data. The position θ is based on independent PCA on GO cell cycle genes of each data. (b) Similar to (a), but now we use position θ estimated using pre-learned mNeurosphere reference. (c) The majority of genes are better aligned on θ pre-learned mNeurosphere reference. x-axis represents the absolute distance of position of peak expression on θ estimated on independent PCA, while y-axis represents those estimated using pre-learned mNeurosphere reference. Genes with a larger absolute distance on θ estimated on independent PCA compared to θ estimated using pre-learned mNeurosphere reference are colored as blue, and genes are colored by red if showing the opposite direction.

Supplementary Figure S34. Self-projection to test method sensitivity on a positive control.

(a) The cell cycle embedding of mNeurosphere data using the reference learned from itself. Note the projections are different from direct PCA, as the PCA is done on Seurat corrected expressions while the projection is calculated on non-corrected expressions. (b) Comparisons of cell cycle positions θ estimated from the direct PCA and from the projection. (c) RNA velocity embedding of the projection genes on the cell cycle embedding for mNeurosphere data. Cells are colored by 5 stage cell cycle representation, inferred using the modified from Schwabe method Schwabe et al. (2020).

SUPPLEMENTARY TABLES

View this table:

Supplementary Table S1. Datasets

Acknowledgements

BIBLIOGRAPHY

↵
Alter, O, Brown, PO, and Botstein, D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America 97, 10101–10106. DOI: 10.1073/pnas.97.18.10101.
OpenUrl Abstract/FREE Full Text
↵
Ambros, V (1999). Cell cycle-dependent sequencing of cell fate decisions in Caenorhabditis elegans vulva precursor cells. Development 126, 1947–56.
OpenUrl Abstract/FREE Full Text
↵
Ashburner, M, Ball, CA, Blake, JA, Botstein, D, Butler, H, Michael Cherry, J, Davis, AP, Dolinski, K, Dwight, SS, Eppig, JT, et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. DOI: 10.1038/75556.
OpenUrl CrossRef PubMed Web of Science
↵
Bastidas-Ponce, A, Tritschler, S, Dony, L, Scheibner, K, Tarquis-Medina, M, Salinno, C, Schirge, S, Burtscher, I, Böttcher, A, Theis, FJ, et al. (2019). Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 146, dev173849. DOI: 10.1242/dev.173849.
OpenUrl Abstract/FREE Full Text
↵
Belluti, S, Basile, V, Benatti, P, Ferrari, E, Marverti, G, and Imbriano, C (2013). Concurrent inhibition of enzymatic activity and NF-Y-mediated transcription of Topoisomerase-IIα by bis-DemethoxyCurcumin in cancer cells. Cell Death & Disease 4, e756–e756. DOI: 10.1038/cddis.2013.287.
OpenUrl CrossRef
↵
Bergen, V, Lange, M, Peidli, S, Wolf, FA, and Theis, FJ (2020). Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology, 1–7. DOI: 10.1038/s41587-020-0591-3.
OpenUrl CrossRef
↵
Buettner, F, Natarajan, KN, Casale, FP, Proserpio, V, Scialdone, A, Theis, FJ, Teichmann, SA, Marioni, JC, and Stegle, O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155–160. DOI: 10.1038/nbt.3102.
OpenUrl CrossRef PubMed
↵
Cao, J, O’Day, DR, Pliner, HA, Kingsley, PD, Deng, M, Daza, RM, Zager, MA, Aldinger, KA, Blecher-Gonen, R, Zhang, F, et al. (2020). A human cell atlas of fetal gene expression. Science 370, eaba7721. DOI: 10.1126/science.aba7721.
OpenUrl Abstract/FREE Full Text
↵
Carosso, GA, Boukas, L, Augustin, JJ, Nguyen, HN, Winer, BL, Cannon, GH, Robertson, JD, Zhang, L, Hansen, KD, Goff, LA, et al. (2019). Precocious neuronal differentiation and disrupted oxygen responses in Kabuki syndrome. JCI Insight 4. DOI: 10.1172/jci.insight.129375.
OpenUrl CrossRef
↵
Cho, RJ, Campbell, MJ, Winzeler, EA, Steinmetz, L, Conway, A, Wodicka, L, Wolfsberg, TG, Gabrielian, AE, Landsman, D, Lockhart, DJ, et al. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73. DOI: 10.1016/s1097-2765(00)80114-8.
OpenUrl CrossRef PubMed Web of Science
Clark, BS, Stein-O’Brien, GL, Shiau, F, Cannon, GH, Davis-Marcisak, E, Sherman, T, Santiago, CP, Hoang, TV, Rajaii, F, James-Esposito, RE, et al. (2019). Single-Cell RNA-Seq Analysis of Retinal Development Identifies NFI Factors as Regulating Mitotic Exit and Late-Born Cell Specification. Neuron 102, 1111–1126.e5. DOI: 10.1016/j.neuron.2019.04.010.
OpenUrl CrossRef PubMed
↵
Cumano, A and Godin, I (2007). Ontogeny of the hematopoietic system. Annual Review of Immunology 25, 745–785. DOI: 10.1146/annurev.immunol.25.022106.141538.
OpenUrl CrossRef PubMed Web of Science
↵
Dolatabadi, S, Candia, J, Akrap, N, Vannas, C, Tomic, TT, Losert, W, Landberg, G, Åman, P, and Ståhlberg, A (2017). Cell Cycle and Cell Size Dependent Gene Expression Reveals Distinct Subpopulations at Single-Cell Level. Frontiers in Genetics 8, 1. DOI: 10.3389/fgene.2017.00001.
OpenUrl CrossRef
↵
Gauthier, NP, Larsen, ME, Wernersson, R, de Lichtenberg, U, Jensen, LJ, Brunak, S, and Jensen, TS (2008). Cyclebase.org–a comprehensive multi-organism online database of cell-cycle experiments. Nucleic Acids Research 36, D854–9. DOI: 10.1093/nar/gkm729.
OpenUrl CrossRef PubMed Web of Science
↵
Heck, MM, Hittelman, WN, and Earnshaw, WC (1988). Differential expression of DNA topoisomerases I and II during the eukaryotic cell cycle. Proceedings of the National Academy of Sciences 85, 1086–1090. DOI: 10.1073/pnas.85.4.1086.
OpenUrl Abstract/FREE Full Text
↵
Hsiao, CJ, Tung, P, Blischak, JD, Burnett, JE, Barr, KA, Dey, KK, Stephens, M, and Gilad, Y (2020). Characterizing and inferring quantitative cell cycle phase in single-cell RNA-seq data analysis. Genome Research 30, 611–621. DOI: 10.1101/gr.247759.118.
OpenUrl Abstract/FREE Full Text
↵
Jammalamadaka, SR and Sarma, Y (1988). A correlation coefficient for angular variables. Statistical theory and data analysis II, 349–364.
↵
Kowalczyk, MS, Tirosh, I, Heckl, D, Rao, TN, Dixit, A, Haas, BJ, Schneider, RK, Wagers, AJ, Ebert, BL, and Regev, A (2015). Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Research 25, 1860–1872. DOI: 10.1101/gr.192237.115.
OpenUrl Abstract/FREE Full Text
↵
Leng, N, Chu, LF, Barry, C, Li, Y, Choi, J, Li, X, Jiang, P, Stewart, RM, Thomson, JA, and Kendziorski, C (2015). Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nature Methods 12, 947–950. DOI: 10.1038/nmeth.3549.
OpenUrl CrossRef PubMed
↵
Liu, Z, Lou, H, Xie, K, Wang, H, Chen, N, Aparicio, OM, Zhang, MQ, Jiang, R, and Chen, T (2017). Reconstructing cell cycle pseudo time-series via single-cell transcriptome data. Nature Communications 8, 22. DOI: 10.1038/s41467-017-00039-z.
OpenUrl CrossRef PubMed
↵
Lodish, Berk, Harvey and, Kaiser, Arnold and, Kaiser, Chris A and, Krieger, Chris and, Scott, Monty and, Bretscher, Matthew P and, Ploegh, Anthony and, Matsudaira, Hidde and, and others, Paul and (2008). “Section 12.3 The Role of Topoisomerases in DNA Replication”. Molecular Cell Biology. 4th edition. New York: W. H. Freeman.
↵
Mahdessian, D, Cesnik, AJ, Gnann, C, Danielsson, F, Stenström, L, Arif, M, Zhang, C, Le, T, Johansson, F, Shutten, R, et al. (2021). Spatiotemporal dissection of the cell cycle with single-cell proteogenomics. Nature 590, 649–654. DOI: 10.1038/s41586-021-03232-9.
OpenUrl CrossRef
↵
Marguerat, S and Bähler, J (2012). Coordinating genome expression with cell size. Trends in Genetics 28, 560– 565. DOI: 10.1016/j.tig.2012.07.003.
OpenUrl CrossRef PubMed Web of Science
↵
McConnell, S and Kaznowski, C (1991). Cell cycle dependence of laminar determination in developing neocortex. Science 254, 282–285. DOI: 10.1126/science.1925583.
OpenUrl Abstract/FREE Full Text
↵
McGarry, TJ and Kirschner, MW (1998). Geminin, an inhibitor of DNA replication, is degraded during mitosis. Cell 93, 1043–1053. DOI: 10.1016/s0092-8674(00)81209-x.
OpenUrl CrossRef PubMed Web of Science
↵
Ohnuma, Si and Harris, WA (2003). Neurogenesis and the Cell Cycle. Neuron 40, 199–208. DOI: 10.1016/s0896-6273(03)00632-9.
OpenUrl CrossRef PubMed Web of Science
↵
Ono, T, Losada, A, Hirano, M, Myers, MP, Neuwald, AF, and Hirano, T (2003). Differential Contributions of Condensin I and Condensin II to Mitotic Chromosome Architecture in Vertebrate Cells. Cell 115, 109– 121. DOI: 10.1016/s0092-8674(03)00724-4.
OpenUrl CrossRef PubMed Web of Science
↵
Padovan-Merhar, O, Nair, GP, Biaesch, AG, Mayer, A, Scarfone, S, Foley, SW, Wu, AR, Churchman, LS, Singh, A, and Raj, A (2015). Single Mammalian Cells Compensate for Differences in Cellular Volume and DNA Copy Number through Independent Global Transcriptional Mechanisms. Molecular Cell 58, 339– 352. DOI: 10.1016/j.molcel.2015.03.005.
OpenUrl CrossRef PubMed
↵
Pan, SJ, Kwok, JT, and Yang, Q (2008). “Transfer learning via dimensionality reduction”. AAAI. Vol. 8, 677–682.
OpenUrl
↵
Ramsay, H and Silverman, BW (2005). Functional Data Analysis, 2nd ed. Springer Verlag, New York.
↵
Sakaue-Sawano, A, Kurokawa, H, Morimura, T, Hanyu, A, Hama, H, Osawa, H, Kashiwagi, S, Fukami, K, Miyata, T, Miyoshi, H, et al. (2008). Visualizing Spatiotemporal Dynamics of Multicellular Cell-Cycle Progression. Cell 132, 487–498. DOI: 10.1016/j.cell.2007.12.033.
OpenUrl CrossRef PubMed Web of Science
↵
Sakaue-Sawano, A, Yo, M, Komatsu, N, Hiratsuka, T, Kogure, T, Hoshida, T, Goshima, N, Matsuda, M, Miyoshi, H, and Miyawaki, A (2017). Genetically Encoded Tools for Optical Dissection of the Mammalian Cell Cycle. Molecular Cell 68, 626–640.e5. DOI: 10.1016/j.molcel.2017.10.001.
OpenUrl CrossRef
↵
Schwabe, D, Formichetti, S, Junker, JP, Falcke, M, and Rajewsky, N (2020). The transcriptome dynamics of single cells during the cell cycle. Molecular Systems Biology 16, e9946. DOI: 10.15252/msb.20209946.
OpenUrl CrossRef
↵
Scialdone, A, Natarajan, KN, Saraiva, LR, Proserpio, V, Teichmann, SA, Stegle, O, Marioni, JC, and Buettner, F (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61. DOI: 10.1016/j.ymeth.2015.06.021.
OpenUrl CrossRef PubMed
↵
Soneson, C (2020). RNA Velocity with alevin.
↵
Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB, Brown, PO, Botstein, D, and Futcher, B (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297. DOI: 10.1091/mbc.9.12.3273.
OpenUrl Abstract/FREE Full Text
↵
Stein-O’Brien, GL, Clark, BS, Sherman, T, Zibetti, C, Hu, Q, Sealfon, R, Liu, S, Qian, J, Colantuoni, C, Black-shaw, S, et al. (2019). Decomposing Cell Identity for Transfer Learning across Cellular Measurements, Platforms, Tissues, and Species. Cell Systems 8, 395– 411.e8. DOI: 10.1016/j.cels.2019.04.004.
OpenUrl CrossRef
↵
Stuart, T, Butler, A, Hoffman, P, Hafemeister, C, Papalexi, E, Mauck 3rd, WM, Hao, Y, Stoeckius, M, Smibert, P, and Satija, R (2019). Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21. DOI: 10.1016/j.cell.2019.05.031.
OpenUrl CrossRef PubMed
↵
Wei-Shan, H, Amit, VC, and Clarke, DJ (2019). ell cycle regulation of condensin Smc4. Oncotarget 10, 263–276. DOI: 10.18632/oncotarget.26467.
OpenUrl CrossRef
↵
Whitfield, ML, Sherlock, G, Saldanha, AJ, Murray, JI, Ball, CA, Alexander, KE, Matese, JC, Perou, CM, Hurt, MM, Brown, PO, et al. (2002). Identification of Genes Periodically Expressed in the Human Cell Cycle and Their Expression in Tumors. Molecular Biology of the Cell 13, 1977–2000. DOI: 10.1091/mbc.02-02-0030.
OpenUrl Abstract/FREE Full Text
↵
Zappia, L, Phipson, B, and Oshlack, A (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology 18, 174. DOI: 10.1186/s13059-017-1305-0.
OpenUrl CrossRef

View the discussion thread.

Posted April 06, 2021.

Download PDF

Citation Tools

Subject Area

Genomics

Subject Areas

All Articles

Animal Behavior and Cognition (5214)
Biochemistry (11745)
Bioengineering (8751)
Bioinformatics (29195)
Biophysics (14971)
Cancer Biology (12095)
Cell Biology (17411)
Clinical Trials (138)
Developmental Biology (9421)
Ecology (14179)
Epidemiology (2067)
Evolutionary Biology (18306)
Genetics (12245)
Genomics (16802)
Immunology (11867)
Microbiology (28083)
Molecular Biology (11592)
Neuroscience (60965)
Paleontology (451)
Pathology (1870)
Pharmacology and Toxicology (3238)
Physiology (4959)
Plant Biology (10427)
Scientific Communication and Education (1683)
Synthetic Biology (2885)
Systems Biology (7339)
Zoology (1651)

[1] ↵
Alter, O, Brown, PO, and Botstein, D (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences of the United States of America 97, 10101–10106. DOI: 10.1073/pnas.97.18.10101.
OpenUrl Abstract/FREE Full Text

[2] ↵
Ambros, V (1999). Cell cycle-dependent sequencing of cell fate decisions in Caenorhabditis elegans vulva precursor cells. Development 126, 1947–56.
OpenUrl Abstract/FREE Full Text

[3] ↵
Ashburner, M, Ball, CA, Blake, JA, Botstein, D, Butler, H, Michael Cherry, J, Davis, AP, Dolinski, K, Dwight, SS, Eppig, JT, et al. (2000). Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. DOI: 10.1038/75556.
OpenUrl CrossRef PubMed Web of Science

[4] ↵
Bastidas-Ponce, A, Tritschler, S, Dony, L, Scheibner, K, Tarquis-Medina, M, Salinno, C, Schirge, S, Burtscher, I, Böttcher, A, Theis, FJ, et al. (2019). Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development 146, dev173849. DOI: 10.1242/dev.173849.
OpenUrl Abstract/FREE Full Text

[5] ↵
Belluti, S, Basile, V, Benatti, P, Ferrari, E, Marverti, G, and Imbriano, C (2013). Concurrent inhibition of enzymatic activity and NF-Y-mediated transcription of Topoisomerase-IIα by bis-DemethoxyCurcumin in cancer cells. Cell Death & Disease 4, e756–e756. DOI: 10.1038/cddis.2013.287.
OpenUrl CrossRef

[6] ↵
Bergen, V, Lange, M, Peidli, S, Wolf, FA, and Theis, FJ (2020). Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology, 1–7. DOI: 10.1038/s41587-020-0591-3.
OpenUrl CrossRef

[7] ↵
Buettner, F, Natarajan, KN, Casale, FP, Proserpio, V, Scialdone, A, Theis, FJ, Teichmann, SA, Marioni, JC, and Stegle, O (2015). Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology 33, 155–160. DOI: 10.1038/nbt.3102.
OpenUrl CrossRef PubMed

[8] ↵
Cao, J, O’Day, DR, Pliner, HA, Kingsley, PD, Deng, M, Daza, RM, Zager, MA, Aldinger, KA, Blecher-Gonen, R, Zhang, F, et al. (2020). A human cell atlas of fetal gene expression. Science 370, eaba7721. DOI: 10.1126/science.aba7721.
OpenUrl Abstract/FREE Full Text

[9] ↵
Carosso, GA, Boukas, L, Augustin, JJ, Nguyen, HN, Winer, BL, Cannon, GH, Robertson, JD, Zhang, L, Hansen, KD, Goff, LA, et al. (2019). Precocious neuronal differentiation and disrupted oxygen responses in Kabuki syndrome. JCI Insight 4. DOI: 10.1172/jci.insight.129375.
OpenUrl CrossRef

[10] ↵
Cho, RJ, Campbell, MJ, Winzeler, EA, Steinmetz, L, Conway, A, Wodicka, L, Wolfsberg, TG, Gabrielian, AE, Landsman, D, Lockhart, DJ, et al. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell 2, 65–73. DOI: 10.1016/s1097-2765(00)80114-8.
OpenUrl CrossRef PubMed Web of Science

[11] Clark, BS, Stein-O’Brien, GL, Shiau, F, Cannon, GH, Davis-Marcisak, E, Sherman, T, Santiago, CP, Hoang, TV, Rajaii, F, James-Esposito, RE, et al. (2019). Single-Cell RNA-Seq Analysis of Retinal Development Identifies NFI Factors as Regulating Mitotic Exit and Late-Born Cell Specification. Neuron 102, 1111–1126.e5. DOI: 10.1016/j.neuron.2019.04.010.
OpenUrl CrossRef PubMed

[12] ↵
Cumano, A and Godin, I (2007). Ontogeny of the hematopoietic system. Annual Review of Immunology 25, 745–785. DOI: 10.1146/annurev.immunol.25.022106.141538.
OpenUrl CrossRef PubMed Web of Science

[13] ↵
Dolatabadi, S, Candia, J, Akrap, N, Vannas, C, Tomic, TT, Losert, W, Landberg, G, Åman, P, and Ståhlberg, A (2017). Cell Cycle and Cell Size Dependent Gene Expression Reveals Distinct Subpopulations at Single-Cell Level. Frontiers in Genetics 8, 1. DOI: 10.3389/fgene.2017.00001.
OpenUrl CrossRef

[14] ↵
Gauthier, NP, Larsen, ME, Wernersson, R, de Lichtenberg, U, Jensen, LJ, Brunak, S, and Jensen, TS (2008). Cyclebase.org–a comprehensive multi-organism online database of cell-cycle experiments. Nucleic Acids Research 36, D854–9. DOI: 10.1093/nar/gkm729.
OpenUrl CrossRef PubMed Web of Science

[15] ↵
Heck, MM, Hittelman, WN, and Earnshaw, WC (1988). Differential expression of DNA topoisomerases I and II during the eukaryotic cell cycle. Proceedings of the National Academy of Sciences 85, 1086–1090. DOI: 10.1073/pnas.85.4.1086.
OpenUrl Abstract/FREE Full Text

[16] ↵
Hsiao, CJ, Tung, P, Blischak, JD, Burnett, JE, Barr, KA, Dey, KK, Stephens, M, and Gilad, Y (2020). Characterizing and inferring quantitative cell cycle phase in single-cell RNA-seq data analysis. Genome Research 30, 611–621. DOI: 10.1101/gr.247759.118.
OpenUrl Abstract/FREE Full Text

[17] ↵
Jammalamadaka, SR and Sarma, Y (1988). A correlation coefficient for angular variables. Statistical theory and data analysis II, 349–364.

[18] ↵
Kowalczyk, MS, Tirosh, I, Heckl, D, Rao, TN, Dixit, A, Haas, BJ, Schneider, RK, Wagers, AJ, Ebert, BL, and Regev, A (2015). Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Research 25, 1860–1872. DOI: 10.1101/gr.192237.115.
OpenUrl Abstract/FREE Full Text

[19] ↵
Leng, N, Chu, LF, Barry, C, Li, Y, Choi, J, Li, X, Jiang, P, Stewart, RM, Thomson, JA, and Kendziorski, C (2015). Oscope identifies oscillatory genes in unsynchronized single-cell RNA-seq experiments. Nature Methods 12, 947–950. DOI: 10.1038/nmeth.3549.
OpenUrl CrossRef PubMed

[20] ↵
Liu, Z, Lou, H, Xie, K, Wang, H, Chen, N, Aparicio, OM, Zhang, MQ, Jiang, R, and Chen, T (2017). Reconstructing cell cycle pseudo time-series via single-cell transcriptome data. Nature Communications 8, 22. DOI: 10.1038/s41467-017-00039-z.
OpenUrl CrossRef PubMed

[21] ↵
Lodish, Berk, Harvey and, Kaiser, Arnold and, Kaiser, Chris A and, Krieger, Chris and, Scott, Monty and, Bretscher, Matthew P and, Ploegh, Anthony and, Matsudaira, Hidde and, and others, Paul and (2008). “Section 12.3 The Role of Topoisomerases in DNA Replication”. Molecular Cell Biology. 4th edition. New York: W. H. Freeman.

[22] ↵
Mahdessian, D, Cesnik, AJ, Gnann, C, Danielsson, F, Stenström, L, Arif, M, Zhang, C, Le, T, Johansson, F, Shutten, R, et al. (2021). Spatiotemporal dissection of the cell cycle with single-cell proteogenomics. Nature 590, 649–654. DOI: 10.1038/s41586-021-03232-9.
OpenUrl CrossRef

[23] ↵
Marguerat, S and Bähler, J (2012). Coordinating genome expression with cell size. Trends in Genetics 28, 560– 565. DOI: 10.1016/j.tig.2012.07.003.
OpenUrl CrossRef PubMed Web of Science

[24] ↵
McConnell, S and Kaznowski, C (1991). Cell cycle dependence of laminar determination in developing neocortex. Science 254, 282–285. DOI: 10.1126/science.1925583.
OpenUrl Abstract/FREE Full Text

[25] ↵
McGarry, TJ and Kirschner, MW (1998). Geminin, an inhibitor of DNA replication, is degraded during mitosis. Cell 93, 1043–1053. DOI: 10.1016/s0092-8674(00)81209-x.
OpenUrl CrossRef PubMed Web of Science

[26] ↵
Ohnuma, Si and Harris, WA (2003). Neurogenesis and the Cell Cycle. Neuron 40, 199–208. DOI: 10.1016/s0896-6273(03)00632-9.
OpenUrl CrossRef PubMed Web of Science

[27] ↵
Ono, T, Losada, A, Hirano, M, Myers, MP, Neuwald, AF, and Hirano, T (2003). Differential Contributions of Condensin I and Condensin II to Mitotic Chromosome Architecture in Vertebrate Cells. Cell 115, 109– 121. DOI: 10.1016/s0092-8674(03)00724-4.
OpenUrl CrossRef PubMed Web of Science

[28] ↵
Padovan-Merhar, O, Nair, GP, Biaesch, AG, Mayer, A, Scarfone, S, Foley, SW, Wu, AR, Churchman, LS, Singh, A, and Raj, A (2015). Single Mammalian Cells Compensate for Differences in Cellular Volume and DNA Copy Number through Independent Global Transcriptional Mechanisms. Molecular Cell 58, 339– 352. DOI: 10.1016/j.molcel.2015.03.005.
OpenUrl CrossRef PubMed

[29] ↵
Pan, SJ, Kwok, JT, and Yang, Q (2008). “Transfer learning via dimensionality reduction”. AAAI. Vol. 8, 677–682.
OpenUrl

[30] ↵
Ramsay, H and Silverman, BW (2005). Functional Data Analysis, 2nd ed. Springer Verlag, New York.

[31] ↵
Sakaue-Sawano, A, Kurokawa, H, Morimura, T, Hanyu, A, Hama, H, Osawa, H, Kashiwagi, S, Fukami, K, Miyata, T, Miyoshi, H, et al. (2008). Visualizing Spatiotemporal Dynamics of Multicellular Cell-Cycle Progression. Cell 132, 487–498. DOI: 10.1016/j.cell.2007.12.033.
OpenUrl CrossRef PubMed Web of Science

[32] ↵
Sakaue-Sawano, A, Yo, M, Komatsu, N, Hiratsuka, T, Kogure, T, Hoshida, T, Goshima, N, Matsuda, M, Miyoshi, H, and Miyawaki, A (2017). Genetically Encoded Tools for Optical Dissection of the Mammalian Cell Cycle. Molecular Cell 68, 626–640.e5. DOI: 10.1016/j.molcel.2017.10.001.
OpenUrl CrossRef

[33] ↵
Schwabe, D, Formichetti, S, Junker, JP, Falcke, M, and Rajewsky, N (2020). The transcriptome dynamics of single cells during the cell cycle. Molecular Systems Biology 16, e9946. DOI: 10.15252/msb.20209946.
OpenUrl CrossRef

[34] ↵
Scialdone, A, Natarajan, KN, Saraiva, LR, Proserpio, V, Teichmann, SA, Stegle, O, Marioni, JC, and Buettner, F (2015). Computational assignment of cell-cycle stage from single-cell transcriptome data. Methods 85, 54–61. DOI: 10.1016/j.ymeth.2015.06.021.
OpenUrl CrossRef PubMed

[35] ↵
Soneson, C (2020). RNA Velocity with alevin.

[36] ↵
Spellman, PT, Sherlock, G, Zhang, MQ, Iyer, VR, Anders, K, Eisen, MB, Brown, PO, Botstein, D, and Futcher, B (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9, 3273–3297. DOI: 10.1091/mbc.9.12.3273.
OpenUrl Abstract/FREE Full Text

[37] ↵
Stein-O’Brien, GL, Clark, BS, Sherman, T, Zibetti, C, Hu, Q, Sealfon, R, Liu, S, Qian, J, Colantuoni, C, Black-shaw, S, et al. (2019). Decomposing Cell Identity for Transfer Learning across Cellular Measurements, Platforms, Tissues, and Species. Cell Systems 8, 395– 411.e8. DOI: 10.1016/j.cels.2019.04.004.
OpenUrl CrossRef

[38] ↵
Stuart, T, Butler, A, Hoffman, P, Hafemeister, C, Papalexi, E, Mauck 3rd, WM, Hao, Y, Stoeckius, M, Smibert, P, and Satija, R (2019). Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21. DOI: 10.1016/j.cell.2019.05.031.
OpenUrl CrossRef PubMed

[39] ↵
Wei-Shan, H, Amit, VC, and Clarke, DJ (2019). ell cycle regulation of condensin Smc4. Oncotarget 10, 263–276. DOI: 10.18632/oncotarget.26467.
OpenUrl CrossRef

[40] ↵
Whitfield, ML, Sherlock, G, Saldanha, AJ, Murray, JI, Ball, CA, Alexander, KE, Matese, JC, Perou, CM, Hurt, MM, Brown, PO, et al. (2002). Identification of Genes Periodically Expressed in the Human Cell Cycle and Their Expression in Tumors. Molecular Biology of the Cell 13, 1977–2000. DOI: 10.1091/mbc.02-02-0030.
OpenUrl Abstract/FREE Full Text

[41] ↵
Zappia, L, Phipson, B, and Oshlack, A (2017). Splatter: simulation of single-cell RNA sequencing data. Genome Biology 18, 174. DOI: 10.1186/s13059-017-1305-0.
OpenUrl CrossRef