Revealing dynamics of gene expression variability in cell state space

Grün, Dominic

doi:10.1038/s41592-019-0632-3

Brief Communication
Published: 18 November 2019

Revealing dynamics of gene expression variability in cell state space

Dominic Grün ORCID: orcid.org/0000-0002-3364-5898^1,2

Nature Methods volume 17, pages 45–49 (2020)Cite this article

11k Accesses
42 Citations
77 Altmetric
Metrics details

Subjects

Abstract

To decipher cell state transitions from single-cell transcriptomes it is crucial to quantify weak expression of lineage-determining factors, which requires computational methods that are sensitive to the variability of weakly expressed genes. Here, I introduce VarID, a computational method that identifies locally homogenous neighborhoods in cell state space, permitting the quantification of local variability in gene expression. VarID delineates neighborhoods with differential gene expression variability and reveals pseudo-temporal dynamics of variability during differentiation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Locally homogenous neighborhoods enable sensitive cell type identification.**

**Fig. 2: Inferring local variability in hematopoietic progenitor cell state space.**

**Fig. 3: Exploring dynamics of gene expression variability during neutrophil differentiation.**

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Srinivas Niranj Chandrasekaran, Beth A. Cimini, … Anne E. Carpenter

Gene trajectory inference for single-cell data by optimal transport metrics

Article 05 April 2024

Rihao Qu, Xiuyuan Cheng, … Yuval Kluger

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Data availability

Primary data used in this manuscript were downloaded from GEO with accession code GSE89754 for the hematopoietic data¹⁰, and GSE92332 for the intestinal data²⁰.

Code availability

VarID is integrated in the RaceID v0.1.4 package available from CRAN or github (https://github.com/dgrun/RaceID3_StemID2_package). Source code for reproducing the results of this manuscript is available on github (https://github.com/dgrun/VarID_analysis) and as Supplementary Software.

References

Grün, D. Revealing routes of cellular differentiation by single-cell RNA-seq. Curr. Opin. Syst. Biol. 11, 9–17 (2018).
Article Google Scholar
Grün, D., Kester, L. & van Oudenaarden, A. Validation of noise models for single-cell transcriptomics. Nat. Methods 11, 637–640 (2014).
Article Google Scholar
Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 11, e1004333 (2015).
Article Google Scholar
Eling, N., Richard, A. C., Richardson, S., Marioni, J. C. & Vallejos, C. A. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 7, 284–294 (2018).
Article CAS Google Scholar
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS Google Scholar
Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 33, 495–502 (2015).
Article CAS Google Scholar
Setty, M. et al. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol. 34, 637–645 (2016).
Article CAS Google Scholar
Herman, J. S. FateID infers cell fate bias in multipotent progenitors from single-cell RNA-seq data. Nat. Methods 15, 379–386 (2018).
Article CAS Google Scholar
Grün, D. et al. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature 525, 251–255 (2015).
Article Google Scholar
Tusi, B. K. et al. Population snapshots predict early haematopoietic and erythroid hierarchies. Nature 555, 54–60 (2018).
Article CAS Google Scholar
Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).
Article CAS Google Scholar
Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Preprint bioRxiv at https://doi.org/10.1101/576827 (2019).
Hu, H. et al. AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors. Nucleic Acids Res. 47, D33–D38 (2019).
Article CAS Google Scholar
Huynh-Thu, V. A., Irrthum, A., Wehenkel, L. & Geurts, P. Inferring regulatory networks from expression data using tree-based methods. PLoS ONE 5, e12776 (2010).
Article Google Scholar
Liting, X., Gerstein, R., Socolovsky, M. & Castilla, L. H. Deletion of core binding factors Runx1 and Runx2 leads to perturbed hematopoiesis in multiple lineages. Blood 122, 46 (2013).
Article Google Scholar
Komorowska, K. et al. Hepatic leukemia factor maintains quescence of hematopoietic stem cells and protects the stem cell pool during regeneration. Cell Rep. 21, 3514–3523 (2017).
Article CAS Google Scholar
Paul, F. et al. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015).
Article CAS Google Scholar
Doi, Y. et al. SATB1 expression marks lymphoid-lineage biased hematopoietic stem cells in mouse bone marrow. Blood 126, 2356 (2015).
Article Google Scholar
Jones, C. L. et al. ETV6 regulates Pax5 expression in early B cell development. Blood 128, 2655 (2016).
Article Google Scholar
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
Article CAS Google Scholar
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://arxiv.org/abs/1802.03426v2 (2018).
Yu, G. & He, Q.-Y. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Mol. Biosyst. 12, 477–479 (2016).
Article CAS Google Scholar
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Article CAS Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS Google Scholar

Download references

Acknowledgements

This study was supported by the Max Planck Society, the German Research Foundation (DFG) (grant numbers SPP1937 GR4980/1-1, GR4980/3-1, and GRK2344 MeInBio), by the DFG under Germany’s Excellence Strategy (CIBSS, EXC-2189, Project ID 390939984), by the ERC (818846, ImmuNiche, ERC-2018-COG), and by the Behrens-Weise-Foundation.

Author information

Authors and Affiliations

Max-Planck-Institute of Immunobiology and Epigenetics, Freiburg, Germany
Dominic Grün
Centre for Integrative Biological Signaling Studies, University of Freiburg, Freiburg, Germany
Dominic Grün

Authors

Dominic Grün
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.G. conceived and implemented the method and performed the analysis.

Corresponding author

Correspondence to Dominic Grün.

Ethics declarations

Competing interests

The author declares no competing interests.

Additional information

Peer review information Nicole Rusk and Nina Vogt were the primary editors on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Hematopoietic cell type identification by Louvain clustering on knn networks.

a, Scatterplot of variance and mean transcript count in logarithmic space for all genes across all cells in the mouse hematopoietic progenitor dataset. The red line indicates a second order polynomial fit to all genes. The blue line indicates the maximum deviation of the fit towards higher variability based on the error interval of the fitted coefficients. The broken orange line indicates a loess regression. The polynomial fit function is given at the top. b, UMAP representation highlighting clusters obtained by Louvain clustering on the full k-nearest neighbor network. c, Dot plot showing the expression z-score of lineage-specific marker genes across all clusters from (b). The dot size indicates the fraction of cells expressing a gene. d, t-SNE map of clustering output obtained from Seurat^5,6 using high-resolution settings (Methods). e, UMAP representation highlighting clusters obtained with Seurat from (d). f, Alluvial diagram comparing the cluster composition obtained with VarID, RaceID3⁸, and Seurat. g, Evaluation of the resolution of rare populations as a function of α and knn within this dataset. I tested the overlap of inferred clusters with lymphoid progenitors (Dntt), B cells (Ebf1, Cd19), basophils (Lmo4, Ms4a2), eosinophils (Ear10), dendritic cells (Cd74), and megakaryocytes (Mpl, Pf4), based on expression of the corresponding marker genes (in parentheses). The fraction of cells in a cluster with positive transcript counts for the respective markers was computed for each cluster (termed enrichment), and the fraction of all marker-positive cells falling into that cluster (termed overlap). The clustering should maximize both the overlap and the enrichment. If a cluster perfectly recapitulates the marker expression domain, both values equal one. The heatmap shows the maximum of the product of overlap and enrichment across all clusters averaged across all marker genes as a function of the parameters, and supports α=10 and knn=10 as an optimal parameter choice. Smaller values for knn would lead to higher variances of the variability estimates. h, The same analysis as in (g) was performed on a subset of parameters, either using a supplied distance matrix (1 – Pearson’s correlation coefficient) or the default method (Euclidean distance in PCA-space) for the knn search. The ratio of the overlap*enrichment product between the default and the correlation-based approach is shown in the heatmap and close to one for all parameter combinations. (a-h) Data from n=2 biologically independent experiments.

Supplementary Figure 2 Exploring local gene expression variability in hematopoietic progenitors.

a, Gene-specific parameter fit from the negative binomial generalized linear model with log link function and total transcript count of a cell as independent variable are shown in a scatterplot as a function of the mean expression. Robust parameter fits for the coefficient β₁, size factor θ, and intercept β₀ are obtained by a loess-regression of the parameter fits as a function of mean expression in order to share information between genes of similar expression (broken orange line). This method follows a recently published approach¹². b, Scatter plot of the variance of Pearson residuals from the generalized linear model fit as a function of the mean transcript expression in logarithmic space. The broken orange line represents a loess regression. c, Heatmap of normalized expression (left) and corrected variance (right) for the top 50 genes with enhanced variability from Figure 2d ordered by decreasing log₂-foldchange of variability between cluster 16 and the remaining cells. Clusters were manually grouped by lineage. Hierarchical clustering of rows was performed based on gene expression. d, Heatmap of normalized expression (left) and corrected variance (right) for all transcription factor genes with enhanced variability ordered by decreasing log₂-foldchange of variability between cluster 16 and the remaining cells (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, log₂-foldchange >1.25). Clusters were manually grouped by lineage. Hierarchical clustering of rows was performed based on gene expression. (a-c) Data from n=2 biologically independent experiments.

Supplementary Figure 3 Sensitivity and specificity for the identification of genes with enhanced variability.

a, Three populations were simulated with 500 cells each. The mean expression of population 1 was equal to all genes as measured in the hematopoietic dataset¹⁰. Gene expression was sampled from negative binomial distributions with size factors determined from the mean-variance relation in Figure 2a. Population 2 was generated from the same mean expression values after 5-fold up- or down-regulation of 100 genes each. Population 3 was simulated accordingly after 5-fold up- or down-regulation of 100 genes from population 2. For population 2 the variance of 50 genes taken from Fig. 2a was increased two-fold to simulate enhanced variability. The t-SNE map depicts the three populations resolved by VarID into three clusters. b, The plot shows the genes (n=50) with simulated noise differences ordered by average expression. If differentially variable genes are called by VarID in population 2 versus 1 and 3 with a fold change cut-off of >1.25 and one-sided Wilcoxon test P<0.001, the true positive rate is 52% at a false positive rate of 5%. The true positives are highlighted in red. Applying an average expression cut-off of >0.5 increases the true positive rate to 1 at a false positive rate of 4%. The solid black line indicates an average expression of 0.4. We note that a cut-off on the variability fold change is required to control for the false positive rate, since significant differences in variability can be induced by few tail events, i.e. cells with positive transcript counts for a lowly expressed genes, since these events affect a larger number of neighborhoods (determined by knn). The unconstrained false positive rate is ~23%. I thus recommend applying a fold change cut-off of >1.25, which I use throughout the manuscript. c, To test the dependence of sensitivity and specificity on the number of cells I varied the size of population 2 between 20 and 1,000 cells. The plot shows the true positive rate (solid lines) and false positive rate (broken line) as a function of the size of population 2. Rates were computed without filtering, after applying a variability fold change cut-off (FC>1.25) and after applying an additional average expression cut-off (EXP>0.5). While rates saturate beyond a population size of ~200, sensitivity drops at small populations sizes. For 50 cells, I observed a true positive rate of 32% at a false positive rate of 7% (64% and 7%, respectively, at an average expression cut-off of >0.5) with a fold change cut-off >1.25.

Supplementary Figure 4 Characterization of co-expressed and co-varying genes during neutrophil differentiation.

a, Self-organizing map (SOM) of pseudo-temporal gene expression profiles inferred by FateID⁸. The color indicates the z-score of loess-smoothed profiles. Cells were ordered along the trajectory connecting clusters 5, 4, 3, 7, 1, and 2 in (Fig. 3a) by StemID2. Original clusters (cf. Fig. 1b) are highlighted at the bottom. Modules were obtained by grouping SOM nodes based on correlation (Pearson correlation > 0.85). Only modules with >10 genes are shown in the map. Genes with >2 transcripts in at least one cell were included. Data from n=2 biologically independent experiments. b, Reactome pathway analysis²² revealing enriched pathways in module 2 (n=112 genes) and module 3 (n=55 genes) (hypergeometric test P<0.05, Methods) of SOM in Fig. 3b. c, Reactome pathway analysis revealing enriched pathways in module 14 (n=45 genes, hypergeometric test P<0.05, Methods) of (a). (b,c) The x-axis shows the number of genes of a particular pathway present in the module. The gene universe comprised n=3,439 expressed genes.

Supplementary Figure 5 EPO-stimulation of murine bone marrow cells leads to variable expression of innate immune genes in erythrocyte progenitors.

a, UMAP representation of combined EPO-stimulated and normal mouse hematopoietic progenitor single-cell RNA-seq data¹⁰ highlighting clusters inferred by Louvain clustering on the pruned knn network (knn=10 and α=10). b, UMAP representation indicating the sample of origin for each single-cell transcriptome. Only for the erythrocyte progenitor branch a separation of the samples is observed. c, UMAP highlighting expression of the erythrocyte progenitor marker gene Gata1. d, Heatmap of normalized expression (left) and corrected variance (right) for the top 50 genes with enhanced variability in cluster 17 ordered by decreasing log₂-foldchange of variability between cluster 17 and 15 (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, foldchange >1.25). Clusters were manually grouped by lineage. Hierarchical clustering of rows was performed based on gene expression. Pathway enrichment analysis revealed that 39 out of 170 differentially variable genes were annotated within the pathway “Innate Immune System”, adjusted (hypergeometric test P=0.002, Methods). e, Venn diagram showing the overlap of genes with enhanced local variability (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, foldchange >1.25) and differentially expressed genes (P<0.001, Benjamini Hochberg corrected, see Methods, foldchange >1.25 between the populations) in cluster 17 versus 15. (a-e) Data from n=2 biologically independent experiments.

Supplementary Figure 6 Intestinal stem cells exhibit stochastic expression of secretory lineage transcription factors.

a, UMAP representation of mouse intestinal epithelial single-cell RNA-seq data²⁰ highlighting clusters inferred by Louvain clustering on the pruned knn network (k=10). Cell type labels are based on marker gene expression. b, Dot plot showing the expression z-score of lineage-specific marker genes across all clusters from (a). The dot size indicates the fraction of cells expressing a gene. c, UMAP representation with links connecting cluster medoids. The thickness and color of a link indicates the transition probability between the connected clusters. d, Scatterplot showing corrected variance of transcript counts as a function of the mean in logarithmic space after eliminating the mean-dependence by subtracting the baseline fit. The red line indicates the baseline level of the corrected variability. e, Scatter plot of the variance of Pearson residuals from a negative binomial generalized linear model with log link function and the total transcript count of a cell as independent variable as a function of the mean transcript expression in logarithmic space. The broken orange line represents a loess regression. Highly variable outliers at low and high expression are not visible, since the plot shows a zoom-in to increase visibility. f, Venn diagram showing the overlap of genes with enhanced local variability in cluster 10 versus the remaining cells as predicted after correcting the variance or computing the variance of Pearson residuals (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, foldchange >1.25). g, Heatmap of normalized expression (left) and corrected variance (right) for the top 50 genes with enhanced variability ordered by decreasing log₂-foldchange of variability between cluster 10 and the remaining cells (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, see Methods, log₂-foldchange >1.25). Clusters were manually grouped by lineage. Hierarchical clustering of rows was performed based on gene expression. h, Venn diagram showing the overlap of genes with enhanced local variability (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, foldchange >1.25) and differentially expressed genes (P<0.001, Benjamini Hochberg corrected, see Methods, foldchange >1.25 between the populations) in cluster 10 versus the remaining cells. i, Gene regulatory network predicted by GENIE3 run on all transcription factors among the genes with enhanced variability, using the full dataset as input. j, UMAP representation highlighting corrected variability (upper panel) and normalized gene expression (lower panel) for Tox3. k, UMAP representation highlighting corrected variability (upper panel) and normalized gene expression (lower panel) for Hopx.

Supplementary Figure 7 Intestinal cell type identification by Louvain clustering on knn networks.

a, UMAP representation highlighting clusters obtained by Louvain clustering on the full knn network. b, Dot plot showing the expression z-score of lineage-specific marker genes across all clusters from (a). The dot size indicates the fraction of cells expressing the gene. c, t-SNE map of clustering output obtained from Seurat^5,6 using high-resolution settings (Methods). b, Dot plot showing the expression z-score of lineage-specific marker genes across all Seurat clusters from (c). (a-d) Data from n=4 animals.

Supplementary Figure 8 Exploring local variability in intestinal epithelial stem cells.

a, Scatterplot showing variance and mean of the transcript count of all genes across all cells in the mouse intestinal dataset in logarithmic space. The red line indicates a second order polynomial fit to the baseline level of the variance comprising technical and biological variability. (b-d) Gene-specific parameter fits from the negative binomial generalized linear model with log link function and total transcript count of a cell as independent variable are shown in a scatterplot as a function of the mean expression. Robust parameter fits for the intercept β₀ (b), the size factor θ (c), and coefficient β₁ (d) are obtained by a loess-regression of the parameter fits as a function of mean expression in order to share information between genes of similar expression (broken orange line). This method follows a recently published approach¹². e, Heatmap of normalized expression (left) and corrected variance (right) for all transcription factor genes with enhanced variability ordered by decreasing log₂-foldchange of variability between cluster 10 and the remaining cells (one-sided Wilcoxon rank sum-test P<0.001, Benjamini Hochberg corrected, log₂-foldchange >1.25). Clusters were manually grouped by lineage. Hierarchical clustering of rows was performed based on gene expression. f, UMAP representation highlighting corrected variability (upper panel) and normalized gene expression (lower panel) for Foxa3. (a-f) Data from n=4 animals.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grün, D. Revealing dynamics of gene expression variability in cell state space. Nat Methods 17, 45–49 (2020). https://doi.org/10.1038/s41592-019-0632-3

Download citation

Received: 11 June 2019
Accepted: 08 October 2019
Published: 18 November 2019
Issue Date: January 2020
DOI: https://doi.org/10.1038/s41592-019-0632-3

This article is cited by

MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer
- Xiaoying Wang
- Maoteng Duan
- Qin Ma
Nature Communications (2024)
VarID2 quantifies gene expression noise dynamics and unveils functional heterogeneity of ageing hematopoietic stem cells
- Reyna Edith Rosales-Alvarez
- Jasmin Rettkowski
- Dominic Grün
Genome Biology (2023)
Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP
- Rongyuan Li
- Jingli Wu
- Qi Zhu
BMC Bioinformatics (2023)
Phiclust: a clusterability measure for single-cell transcriptomics reveals phenotypic subpopulations
- Maria Mircea
- Mazène Hochane
- Stefan Semrau
Genome Biology (2022)
Polygenic risk modeling of tumor stage and survival in bladder cancer
- Mauro Nascimben
- Lia Rimondini
- Manolo Venturin
BioData Mining (2022)