Unraveling the timeline of gene expression: A pseudotemporal trajectory analysis of single-cell RNA sequencing data

Background Single-cell RNA sequencing (scRNA-seq) technologies have rapidly developed in recent years. The droplet-based single cell platforms enable the profiling of gene expression in tens of thousands of cells per sample. The goal of a typical scRNA-seq analysis is to identify different cell subpopulations and their respective marker genes. Additionally, trajectory analysis can be used to infer the developmental or differentiation trajectories of cells. Methods This article demonstrates a comprehensive workflow for performing trajectory inference and time course analysis on a multi-sample single-cell RNA-seq experiment of the mouse mammary gland. The workflow uses open-source R software packages and covers all steps of the analysis pipeline, including quality control, doublet prediction, normalization, integration, dimension reduction, cell clustering, trajectory inference, and pseudo-bulk time course analysis. Sample integration and cell clustering follows the Seurat pipeline while the trajectory inference is conducted using the monocle3 package. The pseudo-bulk time course analysis uses the quasi-likelihood framework of edgeR. Results Cells are ordered and positioned along a pseudotime trajectory that represented a biological process of cell differentiation and development. The study successfully identified genes that were significantly associated with pseudotime in the mouse mammary gland. Conclusions The demonstrated workflow provides a valuable resource for researchers conducting scRNA-seq analysis using open-source software packages. The study successfully demonstrated the usefulness of trajectory analysis for understanding the developmental or differentiation trajectories of cells. This analysis can be applied to various biological processes such as cell development or disease progression, and can help identify potential biomarkers or therapeutic targets.


Introduction
Single cell RNA sequencing (scRNA-seq) has emerged as a popular technique for transcriptomic profiling of samples at the single cell level. With droplet-based methods, thousands of cells can be sequenced in parallel using next-generation sequencing platforms [1,2]. One of the most widely used droplet-based scRNA-seq technologies is the 10x Genomics Chromium which enables profiling transcriptomes of tens of thousands of cells per sample [3]. A common goal of a scRNA-seq analysis is to investigate cell types and states in heterogeneous tissues. To achieve this, various pipelines have been developed, such as Seurat [4] and the Bioconductor's OSCA pipeline [5]. A typical scRNA-seq data analysis pipeline involves quality control, normalization, dimension reduction, cell clustering, and differential expression analysis.
As the cost of scRNA-seq continues to drop, more experimental studies involve replicate samples. In a multiple sample single-cell experiment, an integration method is required to investigate all cells across all samples simultaneously. This ensures that sample and batch effects are appropriately considered in visualizing and clustering cells. Popular integration methods include the Seurat's anchor-based integration method [4], Harmony [6], and the MNN [7].
After integration and cell clustering, differential expression analysis is often performed to identify marker genes for each cell cluster. Various methods have been developed at the single-cell level for finding marker genes [8,9]. Recently, the pseudo-bulk method has become increasingly popular due to its superior computational efficiency and its ability to consider biological variation between replicate samples [10].
Trajectory inference is another popular downstream analysis that aims to study cell differentiation or cell type development. Popular software tools to perform trajectory analysis include monocle3 [11] and slingshot [12]. These methods learn trajectories based on the change of gene expression and order cells along a trajectory to obtain pseudotime [13,14]. This allows for pseudotime-based time course analysis in single-cell experiments, which is extremely useful for investigating specific biological questions of interest.
Here we present a new single-cell workflow that integrates trajectory analysis and pseudo-bulking to execute a singlecell pseudo time course analysis. The inputs for this workflow are single-cell count matrices, such as those generated by 10x Genomic's cellranger (https://www.10xgenomics.com). The single-cell level analysis is performed in Seurat, and the trajectory analysis is conducted using monocle3. Once the pseudo-bulk samples are created and assigned pseudotime, a time course analysis is conducted in edgeR [15]. The analysis pipeline presented in this article can be applied to any scRNA-seq study with replicate samples.

Description of the biological experiment
The scRNA-seq data used in this workflow consists of five mouse mammary epithelium samples at five different stages: embryonic, early postnatal, pre-puberty, puberty and adult. The puberty sample is from the study in Pal et al. 2017 [16], whereas the other samples are from Pal et al. 2021 [17]. These studies examined the stage-specific single-cell profiles in order to gain insight into the early developmental stages of mammary gland epithelial lineage. The cellranger count matrix outputs of these five samples are available on the GEO repository as series GSE103275 and GSE164017.

Data preparation
Downloading the data The cellranger output of each sample consists of three key files: a count matrix in mtx.gz format, barcode information in tsv.gz format and feature (or gene) information in tsv.gz format.
We first create a data directory to store all the data files.

Cell type identification
The mammary gland epithelium consists of three major cell types: basal myoepithelial cells, luminal progenitor (LP) cells and mature luminal (ML) cells. These three major epithelial cell populations have been well studied in the literature. By examining the classic marker genes of the three cell types, we are able to identify basal, LP and ML cell populations in the integrated data ( Figure 4). Here we use Krt14 and Acta2 for basal, Csn3 and Elf5 for LP, and Prlr and Areg for ML. We also examine the expression level of Hmgb2 and Mki67 as they are typical markers for cycling cells and the expression level of Igfbp7 and Fabp4 as they are marker genes for stromal cells.
The number of cells in each cluster for each sample is shown below.
> tab_number <- The proportion of cells in each cluster is calculated for each sample to compare the variation in cell composition across different stages.  The bar plot ( Figure 5) shows the proportion of different cell types in samples at different developmental stages. Specifically, the proportion of basal cells (purple) demonstrates an ascending trend from E18.5 to pre-puberty stage, after which it declines towards adult stage. The LP cell proportion (red) rises from E18.5 to puberty stage, followed by a slight dip at adult stage. Although the proportion of ML cells (blue) is higher at P5 than pre-puberty stage, it shows an overall increasing trend. Cycling cells (green) constitute the highest proportion at E18.5 stage, but decrease to a smaller proportion at pre-puberty stage, with a slight increase at puberty stage, and subsequently, they reduce to a negligible proportion at adult stage. The augmented cycling cell proportion at puberty stage aligns with the ductal morphogenesis characteristics of the mammary gland.

Trajectory analysis with monocle3
Constructing trajectories and pseudotime Many biological processes manifest as a dynamic sequence of alterations in the cellular state, which can be estimated through a "trajectory" analysis. Such analysis is instrumental in detecting the shifts between different cell identities and modeling gene expression dynamics. By treating single-cell data as a snapshot of an uninterrupted process, the analysis establishes the sequence of cellular states that forms the process trajectory. The arrangement of cells along these trajectories can be interpreted as pseudotime.
Here, we use the monocle3 package to infer the development trajectory in the mouse mammary gland epithelial cell population. The Seurat object of the integrated data is first converted into a cell_data_set object to be used in monocle3.
> library(monocle3) > cds_obj <-SeuratWrappers::as.cell_data_set(seurat_int) monocle3 re-clusters cells to assign them to specific clusters and partitions, which are subsequently leveraged to construct trajectories. If multiple partitions are used, each partition will represent a distinct trajectory. The calculation of pseudotime, which indicates the distance between a cell and the starting cell in a trajectory, is conducted during the trajectory learning process. These are done using the cluster_cells and learn_graph functions. To obtain a single trajectory and avoid a loop structure, both use_partition and close_loop are turned off in learn_graph.

Visualizing trajectories and pseudotime
The plot_cells function of monocle3 is used to generate a trajectory plot that superimposes the trajectory information onto the UMAP representation of the integrated data. By adjusting the label_principal_points parameter, the names of roots, leaves, and branch points can be displayed. Cells in the trajectory UMAP plot ( Figure 6) on the left are colored by cell cluster identified in the previous Seurat integration analysis.
Along the monocle3 trajectory analysis, several nodes are identified and marked with black circular dots on the resulting plot, representing key points along the trajectories. To establish the order of cells and calculate their corresponding pseudotime, it is necessary to select a starting node from among the identified nodes. For this analysis, node "Y_65" in the basal population (cluster 1) was selected as the starting node, as mammary stem cells are known to be enriched in the basal population and give rise to LP and ML cells in the epithelial lineage [19]. It should be noted that node numbers may vary depending on the version of monocle3 used.
The cells are then ordered and assigned pseudotime values by the order_cells function in monocle3. The resulting pseudotime information can be visualized on the UMAP plot by using the plot_cells function, as demonstrated in the UMAP plot on the right ( Figure 6).

Psuedo-bulk time course analysis with edgeR Constructing pseudo-bulk profiles
After obtaining the pseudotime of each cell, we proceed to a time course analysis to identify genes that change significantly along the pseudotime time. Our approach involves creating pseudo-bulk samples using a pseudo-bulking approach and performing an edgeR-style time course analysis.

Filtering and normalization
We now proceed to the standard edgeR analysis pipeline, which starts with filtering and normalization. The sample information, such as library sizes, average pseudotime and cell numbers, are shown below.
> y$samples[, c("lib.size", "pseudotime", "cell_number")] To ensure the reliability of the analysis, it is recommended to remove pseudo-bulk samples that are constructed from a small number of cells. We suggest each pseudo-bulk sample should contain at least 30 cells. In this analysis, we identified seven pseudo-bulk samples that were constructed with less than 30 cells and removed them form the analysis.

> keep_samples <-y$samples$cell_number > 30 > y <-y[, keep_samples]
Genes with very low count number are also removed from the analysis. This is performed by the filterByExpr function in edgeR.

[1] 11550 22
Normalization is performed by the trimmed mean of M values (TMM) method [20] implemented in the calcNormFactors function in edgeR.

> y <-calcNormFactors(y)
A Multi-dimensional scaling (MDS) plot serves as a valuable diagnostic tool for investigating the relationship among samples. MDS plots are produced using the plotMDS function in edgeR (Figure 7).

Design matrix
The aim of a time course experiment is to examine the relationship between gene abundances and time points. Assuming gene expression changes smoothly over time, we use a natural cubic spline with degrees of freedom of 3 to model gene expression along the pseudotime. The spline design matrix is generated by ns function in splines. The design matrix is also reformed so that the first column represents the linear trend.

Dispersion estimation
The edgeR package uses negative binomial (NB) distribution to model read counts of each gene across all the sample.
The NB dispersions are estimated by the estimateDisp function. The estimated common, trended and gene-specific dispersions can be visualized by plotBCV (Figure 8).
> y <-estimateDisp(y, design) > sqrt(y$common.dispersion) [1] 0.588 > plotBCV(y) The NB model can be extended with quasi-likelihood (QL) methods to account for gene-specific variability from both biological and technical sources [21,22]. Note that only the trended NB dispersion is used in the QL method. The gene-specific variability is captured by the QL dispersion.
The glmQLFit function is used to fit a QL model and estimate QL dispersions. The QL dispersion estimates can be visualized by plotQLDisp (Figure 9). > fit <-glmQLFit(y, design, robust=TRUE) > plotQLDisp(fit) Figure 9. A scatter plot of the quarter-root QL dispersion against the average abundance of each gene. Estimates are shown for the raw, trended and squeezed dispersions.

Time course trend analysis
The QL F-tests are performed by glmQLFTest in edgeR to identify genes that change significantly along the pseudotime.
The test are conducted on all three covariates of the spline model matrix. This is because the significance of any of the three coefficients would indicate a strong correlation between gene expression and pseudotime.

Gene ontology analysis
To interpret the results of the time course analysis at the functional level, we perform geneset enrichment analysis. Gene ontology (GO) is one of the commonly used databases for this purpose. The GO terms in the GO databases are categorized into three classes: biological process (BP), cellular component (CC) and molecular function (MF). In a GO analysis, we are interested in finding GO terms that are over-represented or enriched with significant genes.
GO analysis is usually directional. For simplicity, we re-perform the QL F-test on the Z1 coefficient to identify genes that exhibit a general linear increase or decrease along pseudotime. The numbers of genes with a significant increasing or decreasing linear trend are shown below.
> res_2 <-glmQLFTest(fit, coef=2) > summary(decideTests(res_2)) Z1 Down 1366 NotSig 9056 Up 1128 To perform a GO analysis, we apply the goana function to the above test results. Note that Entrez gene IDs are required for goana, which has been added to the ENTREZID column in the gene annotation. The top enriched GO terms can be viewed using topGO function.

KEGG pathway analysis
The Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] is another commonly used database for exploring signaling pathways to understand the molecular mechanism of diseases and biological processes. A KEGG analysis can be done by using kegga function.
The top enriched KEGG pathways can be viewed by using topKEGG function.

Discussion
In this article, we demonstrated a complete workflow of a pseudo-temporal trajectory analysis of scRNA-seq data. This workflow takes single-cell count matrices as input and leverages the Seurat pipeline for standard scRNA-seq analysis, Figure 14. A smooth curve of PI3K-Akt signaling pathway expression level against pseudotime.
including quality control, normalization, and integration. The scDblFinder package is utilized for doublet prediction. Trajectory inference is conducted with monocle3, while the edgeR QL framework with a pseudo-bulking strategy is applied for pseudo-time course analysis. Alternative methods and packages can be used interchangeably with the ones implemented in this study, as long as they perform equivalent functions. For instance, the bioconductor workflow may be substituted for the Seurat pipeline in scRNA-seq analysis, whereas the slingshot package may replace monocle3 for performing trajectory analysis.
This workflow article utilized 10x scRNA-seq data from five distinct stages of mouse mammary gland development, with a focus on the lineage progression of epithelial cells. By performing a time course analysis based on pseudotime along the developmental trajectory, we successfully identified genes and pathways that exhibit differential expression patterns over the course of pseudotime. The results of this extensive analysis not only confirm previous findings in the literature regarding the mouse mammary gland epithelium, but also reveal new insights specific to the early developmental stages of the mammary gland. The analytical framework presented here can be utilized for any single-cell experiments aimed at studying dynamic changes along a specific path, whether it involves cell differentiation or the development of cell types.

Packages used
This workflow depends on various packages from the Bioconductor project version 3. 15