Abstract
Rapid advancements in single-cell RNA-sequencing (scRNA-seq) technologies revealed the richness of myriad attributes encompassing cell identity, such as diversity of cell types, organ-of-origin, or developmental stage. However, due to the large scale of the data, obtaining an interpretable compressed representation of cellular states remains a computational challenge. For this task we introduce bioIB, a method based on the Information Bottleneck algorithm, designed to extract an optimal compressed representation of scRNA-seq data with respect to a desired biological signal, such as cell type or disease state. BioIB generates a hierarchy of weighted gene clusters, termed metagenes, that maximize the information regarding the signal of interest. Applying bioIB to a scRNA-seq atlas of differentiating macrophages and setting either the organ-of-origin or the developmental stage as the signal of interest provided two distinct signal-specific sets of metagenes that captured the attributes of the respective signal. BioIB’s representation can also be used to expose specific cellular subpopulations, for example, when applied to a single-nucleus RNA-sequencing dataset of an Alzheimer’s Disease mouse model, it identified a subpopulation of disease-associated astrocytes. Lastly, the hierarchical structure of metagenes revealed interconnections between the corresponding biological processes and cellular populations. We demonstrate this over hematopoiesis scRNA-seq data, where the metagene hierarchy reflects the developmental hierarchy of hematopoietic cell types.
Significance Single-cell gene expression represents an invaluable resource, encoding multiple aspects of cellular identity. However, its high complexity poses a challenge for downstream analyses. We introduce bioIB, a methodology based on the Information Bottleneck, that compresses data while maximizing the information about a biological signal-of-interest, such as disease state. bioIB generates a hierarchy of metagenes, probabilistic gene clusters, which compress the data at gradually changing resolutions, exposing signal-related processes and informative connections between gene programs and their corresponding cellular populations. Across diverse single-cell datasets, bioIB generates distinct metagene representations of the same dataset, each maximally informative relative to a different signal; uncovers signal-associated cellular populations; and produces a metagene hierarchy that reflects the developmental hierarchy of the underlying cell types.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
This version includes an updated description of the model. We explain how to estimate the relevant distributions from the inout data. Figure 1 is adapted accordingly.