## Abstract

Rapid advancements in single-cell RNA-sequencing (scRNA-seq) technologies revealed the richness of myriad attributes encompassing cell identity, such as diversity of cell types, organ-of-origin, or developmental stage. However, due to the large scale of the data, obtaining an interpretable compressed representation of cellular states remains a computational challenge. For this task we introduce bioIB, a method based on the Information Bottleneck algorithm, designed to extract an optimal compressed representation of scRNA-seq data with respect to a desired biological signal, such as cell type or disease state. BioIB generates a hierarchy of weighted gene clusters, termed metagenes, that maximize the information regarding the signal of interest. Applying bioIB to a scRNA-seq atlas of differentiating macrophages and setting either the organ-of-origin or the developmental stage as the signal of interest provided two distinct signal-specific sets of metagenes that captured the attributes of the respective signal. BioIB’s representation can also be used to expose specific cellular subpopulations, for example, when applied to a single-nucleus RNA-sequencing dataset of an Alzheimer’s Disease mouse model, it identified a subpopulation of disease-associated astrocytes. Lastly, the hierarchical structure of metagenes revealed interconnections between the corresponding biological processes and cellular populations. We demonstrate this over hematopoiesis scRNA-seq data, where the metagene hierarchy reflects the developmental hierarchy of hematopoietic cell types.

**Significance** Single-cell gene expression represents an invaluable resource, encoding multiple aspects of cellular identity. However, its high complexity poses a challenge for downstream analyses. We introduce bioIB, a methodology based on the Information Bottleneck, that compresses data while maximizing the information about a biological signal-of-interest, such as disease state. bioIB generates a hierarchy of metagenes, probabilistic gene clusters, which compress the data at gradually changing resolutions, exposing signal-related processes and informative connections between gene programs and their corresponding cellular populations. Across diverse single-cell datasets, bioIB generates distinct metagene representations of the same dataset, each maximally informative relative to a different signal; uncovers signal-associated cellular populations; and produces a metagene hierarchy that reflects the developmental hierarchy of the underlying cell types.

## Introduction

Cellular gene expression profiles encapsulate a wealth of information regarding a cell’s identity, defined by a variety of biological factors, such as cell type, disease state, and developmental stage. Single-cell RNA-sequencing (scRNA-seq) technologies, quantifying gene expression levels at single-cell resolution, are invaluable for revealing these facets, allowing to study the different factors encompassing a cell’s identity^{1}. However, exposing such factors poses a computational challenge due to the complexity and high dimensionality of scRNA-seq. While datasets typically comprise thousands of gene profiles across thousands to hundreds of thousands of cells, any reduction in dimensionality will naturally result in loss of information^{2}. Specifically, when aiming to uncover factors associated with a specific biological signal (e.g. gene programs associated with disease progression), the challenge can be framed as a trade-off between reducing the complexity of the data while retaining as much information as possible regarding the signal of interest.

The Information bottleneck (IB) theory^{3} allows us to reason mathematically about this trade-off. Given a dataset (e.g. scRNA-seq measurements) and a variable of interest encoded in the data (e.g. healthy vs. disease samples), IB provides a reduced data representation which is maximally informative about the selected variable^{3,4}. Since it was first introduced, IB has been successfully applied in diverse fields, such as text clustering^{5}, image analysis^{6,7}, language processing^{8}, neuroscience^{9} and computational biology^{10–12}. Here, we present bioIB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data. The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes. The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene.

By considering the trade-off, and focusing on the signal of interest, bioIB differs from previous approaches suggested for data compression, namely dimensionality reduction of single-cell data (e.g. PCA^{13,14}, ICA^{15}, NMF^{16}), typically applied to the data without any constraints and producing factors with limited biological interpretability. Furthermore, the interpretability of bioIB’s output poses an advantage compared to deep learning methods which obtain non-linear latent representation of the data (e.g. scVI^{17}, scDeepCluster^{18}, SAUCIE^{19}, scGNN^{20}). Finally, existing methods aimed at extracting interpretable factors from scRNA-seq data (e.g. f-scLVM^{21}, net-NMFsc^{22}, Spectra^{23}) rely on prior knowledge in the form of known molecular pathways or gene interactions which regulate the desired signal of interest. In contrast, bioIB is applicable in the setting where such prior knowledge is not necessarily available, but the labels with respect to the signal of interest are known. Specifically, bioIB elucidates the relevant gene programs de novo, guided by the cellular labels associated with the signal of interest (e.g., ‘Wildtype’ and ‘Mutant’ for the ‘Genotype’ signal).

In addition to achieving optimal signal-aware clustering of genes via metagenes, bioIB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest. The bioIB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.

We demonstrate that metagenes generated by bioIB are biologically meaningful, capturing molecular pathways differentially activated between selected cell groups. First, we consider a scRNA-seq atlas of differentiating macrophages. By using either organ-of-origin or developmental stage as signals of interest, we show that bioIB extracts distinct, signal-specific metagene hierarchies and associated biological processes. Next, we demonstrate how bioIB can be used to identify a cellular subpopulation of disease-associated astrocytes in a single nucleus RNA-seq (snRNA-seq) dataset from murine Alzheimer’s Disease models. Finally, we showcase that bioIB metagene hierarchy for a dataset of differentiating hematopoietic cell types reflects the developmental hierarchy of the corresponding cellular populations. bioIB is available as an open-source software package, along with documentation and tutorials (https://github.com/nitzanlab/bioIB).

## Results

### bioIB elucidates signal-specific metagenes and their structure

The bioIB representation is computed for a given dataset and signal of interest, provided as cell labels. The representation is composed of metagenes which are probabilistic aggregation of the genes into clusters, representing the major patterns of gene expression variation underlying the labeled signal.

The input to bioIB includes a count matrix *D* ∈ *R*^{N×G} of N cells by G genes, and a vector of cell labels related to the signal of interest, *S* ∈ *R*^{N×1}, where for example, each cell is labeled as sampled from either a healthy or diseased population (Methods; Figure 1A). This input is used to estimate the distributions required for the bioIB algorithm. We thus define three categorical random variables, *C* ∼*Cat*({*c*_{1}, …, *c*_{N}}), *X* ∼*Cat*({*x*_{1}, …, *x*_{G}}), *Y* ∼*Cat*({*y*_{1}, …, *y*_{K}}), respectively representing the *N* cells, *G* genes and *K* cell states of interest. Normalizing the input matrix *D* by the total number of counts, we obtain a joint probability distribution *p*(*c, x*). Next, summing *p*(*c, x*) across the cells, we obtain *p*(*x*), such that an entry [*p*(*x*)]_{i} represents the marginal probability of sampling the transcript of gene *x*_{i}. Using Bayes theorem, we obtain the conditional probability:

Here, an entry [*p*(*c*|*x*)]_{ij} represents the probability of sampling the cell *c*_{i} out of all measured cells in *D*, given that we observed gene *x*_{j}.

The provided cellular annotation vector *S* ∈ *R*^{N×1} allows us to define the conditional distribution of *Y* (representing the *K* cell states of interest) given that we observed a cell in *D*. By definition *p*(*y*|*c*) is an indicator function, defined by *S*, namely, for a cell *c*_{i}, *p*(*y*|*c*_{i}) = 1 if *S*_{i} = *y* and zero elsewise:

At last, we can obtain the conditional distribution of cell states of interest given that we observed a certain gene in *D*:

The conditional probability probability matrix of cell states given the genes *p*_{i}(*y*|*x*) and the gene probability vector *p*(*x*) are used as input to the core of the bioIB method, the IB algorithm. The IB yields the optimal probabilistic mapping, from the genes’ random variable, *X*, to the categorical random variable representing the metagenes , (for |*M*| < = |*G*|). The mapping is optimal with respect to the tradeoff between compression and information about the signal of interest *Y* according to a given threshold parameter *β* (Figure 1C). This is achieved by optimizing for that minimizes the mutual information with the input genes *X*,, while maximizing the mutual information with *Y*, (Methods; Figure 1C):

The resulting metagenes are probabilistic clusters of genes capturing the shared expression patterns amongst cell states relative to *Y* (Figure 1D). The number of metagenes is roughly determined by the threshold parameter *β*, ranging from the original representation (no compression, ) to full compression to a single cluster (*β* = *0*). A hierarchy of metagenes is obtained by gradually decreasing *β* through a reverse-annealing process^{4} (Methods). The probabilistic output mapping, , reflects the amount of information each metagene holds regarding the different labels, whereas the hierarchical structure reveals the interdependence between the metagenes, and the underlying cellular populations they correspond to (Figure 1E).

As an illustrative example, we construct a toy dataset composed of cells belonging to one of two cell types, which act as the signal of interest Y (Supplementary Figure 1A-D). The bioIB hierarchy is revealed by plotting the conditional probabilities of a particular label given every metagene, across *β* values that define the compression level (Supplementary Figure 1C-D). The hierarchical structure reflects the interconnections among the metagenes and the specified cell types of interest (*Y*), while the bifurcation order is dictated by the informativity of the generated metagenes relative to *Y*.

bioIB can also capture the relationships between related cell types, defined as distinct labels of interest (*Y)*. Given a toy model with four related cell types, bioIB hierarchy reflects the two distinct pairs of linked cell types by two branches. Further splits correspond to higher-resolution separation to different cell types, eventually resulting in cell type-specific metagenes (Supplementary Figure 1E-G). bioIB extracts distinct molecular signatures in macrophages for developmental stage and organ residence across development

Gene expression data in scRNA-seq experiments contain signatures associated with multiple overlapping biological signals or conditions. How can we identify gene signatures associated with a specific source of heterogeneity in the data? We demonstrate bioIB’s approach to this challenge in the context of a scRNA-seq atlas of the developing immune system, which contains cells from 9 organs spanning weeks 4 to 17 after conception^{24} (Figure 2A). Applied to this dataset, bioIB identified gene signatures associated with either the organ-of-origin or with developmental stage. For each signal of interest (either organ-of-origin or developmental stage), bioIB clustered the relevant genes into co-expressed metagenes, thus compressing the data, while preserving the information about the selected signal of interest.

We focused on the macrophages subset of cells, as their gene expression varies across organs and throughout the gestation stages^{24} (Supplementary Figure 2A). We first applied bioIB with *Y* set to be the developmental stage, after aggregating cells to ‘Early’ (8-12 gestational weeks) and ‘Late’ (>14 gestational weeks) labels. The resulting bioIB representation, compressed into four metagenes, exposed the selected signal of developmental stage (Supplementary Tables 1, 2). Qualitatively, cells cluster according to ‘Early’ and ‘Late’ developmental stages in the compressed representation (Figure 2B). Evaluating the corresponding clustering quantitatively we find that the Normalized Mutual Information (NMI) score following bioIB is increased relative to the original data, specifically compared to the developmental stage labels, and not the organ labels (Fold change increase in NMI score: stage labels = 13.8; organ labels = 0.76; Supplementary Table 5). Moreover, bioIB metagenes outperform baseline clustering methods in predicting the gestational week of the cells with an average success rate improvement of 12%, indicating that bioIB preserves more information about the signal of interest (Figure 2D; Supplementary Figure 3A; Supplementary Table 6; Methods). Applying bioIB to the same data, but now setting the signal of interest *Y* as the organ-of-origin, resulted in a different compressed representation, informative with respect to the organ-of-origin (Supplementary Tables 3, 4), as the cells in their compressed metagene representation cluster according to their organ-of-origin (Figure 2C). bioIB with *Y* set to organ improves the NMI scores both for organ clustering and for developmental stage clustering (Fold change increase in NMI score: stage labels = 2.2; organ labels = 1.5; Supplementary Table 5). This is explained by the non-uniform distribution of cells from different organs across developmental stages (Supplementary Figure 2B), such that the information about the cell’s organ-of-origin is also highly predictive of its developmental stage. Again, bioIB metagenes outperform baselines in signal label prediction with an average success rate improvement of 12% (Figure 2E; Supplementary Figure 3B; Supplementary Table 7; Methods).

Following the validation of the representation we turn to examine the hierarchical structure and inter-connections of metagenes exposed by the gradual compression of data by bioIB reverse-annealing (Methods).

When *Y* is set as the gestational stage, as *β* increases, the first bifurcation represents a metagene that is specific to late gestational stages (Metagene 0; Figure 2F, Supplementary Figure 4A). This metagene is statistically enriched for immune processes based on Gene Ontology analysis^{25} (Figure 2G) and contains genes involved in the development of adaptive immune response, such as *TRBV7-2* and *IGL1-40*, that are specifically active in later gestation stages across organs^{24} (Supplementary Figure 4B). The next bifurcation produces a more general metagene 1, statistically enriched for multiple biological processes including immune function (Supplementary Figure 4C). Relative to metagene 0, metagene 1 is less informative for the differentiation between early and late stages, and is therefore generated second by bioIB (Supplementary Figure 4D). Metagene 2 contains yolk sac markers (e.g. *TTR, CGA*), in accordance with yolk sac being the only organ that is present exclusively at the earliest gestational stages (Supplementary Figure 4B).

Next, we validate that, as desired, metagenes generated with *Y* being the organ-of-origin reflect organ-specific transcriptomic signatures and that the bioIB hierarchy reflects the developmental connections between the macrophagic populations from different organs (Figure 2H, I, Supplementary Figure 4E). For example, the branch uniting metagenes 0-1, associated with yolk sac, and metagene 2, characterizing liver, is consistent with the reported developmental migration of the macrophages from the yolk sac to the liver^{24}. Another example is the branch consisting of metagenes 3 and 4 that mirrors macrophage transition from the liver to the spleen^{24}. Together, bioIB extracts distinct, signal-specific hierarchies of metagenes. bioIB metagenes identify Alzheimer’s Disease associated astrocytes

A key challenge in scRNA-seq analysis is to identify specific cellular subpopulations affected by a certain condition, such as disease. The standard pipeline, commonly implemented for this task, involves unsupervised clustering of cells, which exposes the downstream analysis to clustering-related bias^{28}. BioIB can overcome such limitations and detect disease-associated cells within a heterogeneous cellular population, which we demonstrate in the context of Alzheimer’s disease (AD) -associated astrocytes. To do so, we re-analyzed single-nucleus RNA-seq measurements of astrocytes from an AD mouse model and wild-type (WT) mice^{29} (Figure 3A).

BioIB analysis with the signal of interest set as the genotype (AD/WT) resulted in a hierarchy of six metagenes (Supplementary tables 9, 10) capturing informative transcriptomic signatures differentiating between AD and WT cells (Figure 3B, Supplementary Figure 5A, Supplementary table 11). The resulting metagenes allowed for better separation of AD and WT cells, compared to the original data (Fold change increase in NMI score = 3.98; Figure 3C).

Furthermore, bioIB metagenes captured a higher-resolution structure within the data; the main branch of metagenes associated with AD phenotype is composed of metagenes 0,1,2, each associated in turn with a distinct subpopulation of AD astrocytes (Figure 3D,E). To interpret their biological identities, we extracted a set of representative genes for each metagene (Methods). Metagene 0, whose representative gene set includes genes involved in morphology regulation (*GFAP, THY1, VIM, B2M, PSEN1*), is enriched for the cellular projection development process, consistent with general astrocyte activation^{30} (Supplementary Figure 5D). Metagenes 1 and 2 represent pathways more tightly associated with the disease:, the representative gene set of metagene 1 is enriched with immune genes^{31}, such as C1QA and CTSS, and metagene 2 is represented by established markers of AD pathology, *TYROBP* and SERPINA3N^{32,33}. Meta-analysis of the AD-associated transcriptome^{34} revealed that metagenes 1 and 2 are the only metagenes that are exclusively represented by AD-associated genes (Figure 3F; Methods). Characterization of the WT-related metagene 5 can be found in Supplementary Figure 5E.

While metagene 0 is expressed in the majority of AD astrocytes, metagenes 1 and 2 characterize distinct cellular subpopulations among the AD cells (Figure 3D,E), which we hypothesized to correspond to disease-associated astrocytic signatures. To validate our interpretation, we quantified the expression of bioIB metagenes in six astrocytic clusters defined in the original analysis, which included two homeostatic clusters, two GFAP-high clusters of reactive astrocytes which are not specific to the disease, and two disease-associated clusters^{29}. We found that while the bioIB metagene 0 is highly expressed both in disease associated clusters and in reactive GFAP-high clusters, metagenes 1 and 2 are specifically enriched in the disease-associated cluster, most abundant in AD^{29} (Figure 3G; Supplementary Figure 5F). The two WT-associated metagenes (4,5) are correspondingly enriched in the homeostatic clusters (Figure 3G). In summary, bioIB allows to directly uncover the cellular subpopulations differentially affected by the disease, avoiding the initial pre-clustering of the cells. bioIB metagene hierarchy reflects the developmental connections between hematopoietic cell types.

scRNA-seq datasets expose a striking diversity of cell types and states, whose interconnections carry important biological information about cell state identity. For example, the hierarchical differentiation tree of hematopoietic stem and progenitor cells (HSPCs) reveals the phenotype and function of mature hematopoietic cells^{35}. BioIB metagene hierarchy can capture the developmental hierarchical structure of cell types, as we demonstrate here for scRNA-seq data of HSPCs differentiation^{36} (Figure 4A). BioIB is applied given the cell type signal over a subset of the data containing six major hematopoietic cell types – monocytes, neutrophils, mast cells, basophils, megakaryocytes and erythroid cells. This analysis produced 11 metagenes, where each of the six cell types is uniquely characterized by at least one metagene, maximizing its expression level within that particular cell type (Figure 4B, Supplementary Tables 12,13). In addition, there are metagenes representing a transcriptional program shared by several developmentally linked cell types (Figure 4B,C; Supplementary Figure 6A). For example, metagenes 0 and 2 are specifically expressed in monocytes and neutrophils, respectively, while metagene 1 is activated in both (Figure 4B,C). The bioIB metagenes are biologically informative, uniting genes and processes characteristic of the corresponding cell types (Supplementary Figure 6B,C). Hence, metagene 0, specifically representing monocytes, features monocyte marker genes such as FABP5^{36} and WFDC17^{37,38} (Figure 4C,D), and is associated with pro-inflammatory macrophage activation, characteristic of monocytes function^{39} (Figure 4E). Similarly, metagene 2, specifically characterizing neutrophils, includes markers like ITGB21^{36} CAMP, LTF, and ELANE^{40} (Figure 4C,D) and is statistically enriched for neutrophil mediated immunity and neutrophil activation (Figure 4E).

The hierarchical representation of the metagenes generated by bioIB induces a hierarchy of cell types that reflects the developmental links between them (Figure 4F,G; Supplementary Figure 6D). In particular, the first bifurcation in the metagene hierarchy generates two metagenes corresponding to the two major branches in the developmental hierarchy^{36} (Figure 4A), one which includes Monocytes and Neutrophils, and another which includes Mast cells, Basophils, Megakaryocytes and Erythroid cells (Figure 4F,G). The second bifurcation (Figure 4F,G) splits the latter into two additional specific metagenes, one including Mast cells and Basophils, and another - Megakaryocytes and Erythroid cells (Figure 4F,G). The third bifurcation further splits the metagene corresponding to the Mast-Baso branch to two separate metagenes that are more specific to either Mast cells or Basophils. Similarly, the fourth bifurcation splits the metagene corresponding to the Monocyte-Neutrophil branch to two separate Monocyte and Neutrophil associated metagenes. Finally, the last bifurcations split the metagene corresponding to the Megakaryocyte-Erythroid branch to four metagenes distinguishing between Megakaryocytes and Erythroid cells.

In conclusion, bioIB metagenes characterize distinct biological processes linked to the underlying cellular populations, while the metagene hierarchy unveils the biological relationships interconnecting these populations.

## Discussion

We introduced bioIB, a scRNA-seq tailored method for clustering genes with respect to a set of known cellular labels, based on the Information Bottleneck (IB) algorithm. We have shown that bioIB metagenes, which are biologically interpretable, provide a meaningful representation which exposes the representative molecular pathways differentially expressed between cellular populations of interest. Given single-cell data from human differentiating macrophages, with overlapping signals of organ-of-origin and developmental time, bioIB successfully extracted two distinct compressed data representations, each depicting the respective biological processes. Next, we used bioIB to identify a subpopulation of disease-associated astrocytes in single-nucleus data from an Alzheimer’s Disease mouse model, providing the genotype as the signal of interest. At last, we have shown that beyond the final metagenes used for analysis, their hierarchical structure, produced by the iterative algorithm used by bioIB, exposes interconnections between metagenes and their respective cell types. We showcased this in the context of differentiating hematopoietic cells, where the bioIB hierarchical structure matched the expected developmental hierarchy of hematopoietic cell types.

By definition, the bioIB output is sensitive to the representation of the differentially labeled cell clusters in the data. The sensitivity is both with respect to the number of genes enriched in each cluster and in terms of the clusters’ size. This is expected since the mutual information measure at the core of bioIB is sensitive to these quantities. That is, a cluster containing a small number of cells is expected to be characterized by fewer metagenes compared to a larger cluster. This can be biologically motivated, on the one hand, as the relative abundance of cells within clusters in the data can be associated with their relative importance, and on the other hand may be a caveat when studying rare populations. For the latter case, a potential solution to this limitation could involve subsampling the data for a uniform distribution of clusters’ size.

As with a majority of computational methods, the bioIB output depends on a hyperparameter, *β*, controlling the level of compression. This is analogous to setting the number of clusters in a clustering algorithm, making this value data-specific. Here, the interpretability of the obtained metagenes allows the user to tune *β* to obtain the desired number of informative metagenes. The current bioIB formulation is limited in its scalability to data size, as it relies on the exact solution to the IB problem. This can be overcome, as we have done in this study, by focusing the analysis on highly informative genes. A natural extension to bioIB to overcome this limitation more generally is using an existing variational IB solver which relies on neural approximation^{42–44}.

In future work bioIB can be extended to extract multiple related data representations with respect to several variables of interest, based on the multivariate information bottleneck framework^{45}. This paradigm might be particularly useful in analyzing gene expression data, allowing to simultaneously extract multiple encoded signals and analyze the corresponding biological processes. Furthermore, bioIB could be extended to produce signal-specific cell clusters, or metacells, retaining maximal possible information about a target gene subset, such as disease biomarkers.

Here we demonstrated that bioIB can provide efficient characterization of signals of interest encoded in single-cell data, such as cell type, disease state or tissue-of-origin. BioIB can be generalized beyond single-cell gene expression data to additional types of biological data, such as bulk RNA-seq and proteomics data, to expose signal-specific optimally compressed representations. In summary, bioIB is expected to enrich biological data analysis by revealing the hierarchical, signal-specific structure encoded in complex datasets.

## Materials and methods

### The bioIB algorithm

The bioIB algorithm provides a compressed representation of scRNA-seq data with respect to a signal of interest. To do so it takes as input a cell (*N*) by gene (*G*) scRNA-seq measurements matrix, *D* ∈ *R*^{N×G}; following standard practice we suggest providing log-normalized counts as input. Additional input to bioIB is a vector of cell labels related to the signal of interest *S* ∈ *R*^{N×1}, labeling every cell with one of *K* possible cell states of interest defined using *Y* = [1, …, *K*], such that *Y* = {*S*}. Given this input, the bioIB pipeline is composed of two main steps: (1) obtaining a probabilistic representation of the count matrix, and (2) using this representation as input for the Information Bottleneck (IB) algorithm.

#### 1. Obtaining a probabilistic data representation

We use the input count matrix *D* and signal of interest *S* to obtain the relevant probability distributions required for the IB algorithm; the conditional probability matrix of cell states given the genes *p*(*y*|*x*) and the gene probability vector *p*(*x*). To convert to probability space we define the random variables of *C* ∼*Cat*({*c*_{1}, …, *c*_{N}}), *X* ∼*Cat*({*x*_{1}, …, *x*_{G}}), *Y* ∼*Cat*({*y*_{1}, …, *y*_{K}}), respectively representing the *N* cells, *G* genes and *K* cell states of interest. The empirical distributions of these are then constructed using the input data (see Equations 1-3).

#### 2. The IB algorithm

The obtained probabilistic representations, *p*(*y*|*x*) ∈ *R*^{K×G} and *p*(*x*) ∈ *R*^{G} are the input for the Information Bottleneck (IB) algorithm.

IB^{3} is a dimensionality reduction method, designed to extract the information from data *X* that is relevant for the prediction of another related variable *Y*, such that the choice of *Y* determines the relevant components of the signal encoded in *X*. Mutual information (MI) is used to evaluate both the extent of compression, , and the level of relevant information preserved in the compressed data, through . A trade-off parameter *β* is introduced to control the amount of compression (distortion) allowed. Formally, the IB objective is given by,

Notably, when *β* = 0, all genes are merged into one cluster (full compression), and when *β* = ∞, the compressed data is identical to the original full data, so every cluster is associated with one particular gene, . For every value of *β*, the algorithm yields the conditional probability matrix of *M* gene clusters, which we term metagenes, , given the genes, *x* ∈ *X*, , representing the optimal mapping of genes to metagenes, and the conditional probability matrix of cell states given the metagenes . For the full mathematical description and the associated proofs for the information bottleneck algorithm, see refs^{3,4}.

There are many ways to solve the IB objective (including neural approximators introduced recently^{42–44}). Here we will focus on the Blahut-Arimoto algorithm^{46}, described below. IB can provide either a series of solutions at different compression levels, using a reverse-annealing process, or a single solution with a flat division of the data points to a predefined number of clusters.

#### a. Blahut arimoto

Here, *ε* is a threshold parameter used to define convergence based on the difference between previous and current iterations. For a given *β*, the algorithm converges into a stable solution, providing two output probability matrices that define and determines the mapping between the original data points *x* ∈ *X* to data clusters , whereas defines the association between the data clusters, , and the groupings of the signal of interest, *y* ∈ *Y*.

#### b. Reverse-annealing

In the process of reverse-annealing the IB algorithm is initialized with a compressed representation that is identical to the original data *X* and with a large value of *β*:

*β*_{max}→ ∞

Next, we run the algorithm iteratively, while reducing *β*. Upon convergence, we initialize the next iteration with the final *p*(*x*(|*x*) mapping achieved in the previous step, and with *β* − Δ, for a small step size Δ. Following this procedure we achieve a series of solutions for every value of *β*: ∀*β* ∈ {*β*_{min}, *β*_{min}+ Δ, …, *β*_{max}}. At the end of this process *β*_{min} → 0, corresponding to maximal compression, where consists of a single point, uniting all the original data points in *X*. Reverse-annealing ultimately yields a hierarchical structure that mirrors several important aspects of the identified clusters, such as their informativity for discrimination between the labels of interest *Y*, as well as the interconnections among them. It is important to note that *β*_{max} controls the maximal number of metagenes, namely the number of end-nodes in the hierarchy, and modifying it does not affect the hierarchical structure itself.

#### c. Clustering

To achieve the division of the data points *x* ∈ *X* to a defined number of clusters *M*, the IB algorithm is initialized with a random mapping of *X* to *M* clusters, generating a binary conditional probability matrix . The corresponding and are obtained, using basic probability rules and Bayes Theorem. Since this process introduces a dependence of the output on the initialization, we randomly initialize the algorithm *n* = 100 times and select the mapping that minimizes the objective function (Eq. 1)^{4}.

### Downstream analyses

#### 1. Identifying representative genes

The representative genes *x* ∈ *X* for a given metagene are identified as the ones that maximize . Specifically, for a given metagene, we first order the genes by their conditional probability . For a given *τ* ∈ [0,1], the set of *j* representative genes {*x*_{1}, *x*_{2}, …, *x*_{j}} is chosen as the minimal set such that:

#### 2. Recovering single-cell metagene expression

The bioIB output provides the mapping of the original count matrix *D* ∈ *R*^{N×G} to its compressed representation . Namely, we obtain the weighted expression of genes, *x* ∈ *X*, using the mapping ., given by,

As a result, we obtain a cell (*N*) by metagene (*M*) compressed data matrix, , such that represents the expression level of metagene *j* in cell *i*.

#### 3. Extracting the metagene hierarchy

The bioIB reverse-annealing output provides a series of conditional probability matrices: and for each *β*. Since we initialize the reverse-annealing process with , these matrices include *N* metagenes, but only *M* of them are unique. We first identify the most representative gene *x* of each metagene , using :

Next, we extract the metagene hierarchy by identifying the merging points of the most representative genes for each metagene across decreasing *β*. For example, metagenes and are considered merged at *β*_{merge} if ∀*y*, . The identified merging points are recorded using a format of the scipy.cluster.hierarchy.linkage() output linkage matrix and plotted using scipy.cluster.hierarchy.dendrogram(). The code and the documentation for the relevant bioIB functions are provided in the bioIB package at https://github.com/nitzanlab/bioIB.

#### 4. Linking metagenes to cell types

Metagenes, , are linked to cell types, *y* ∈ *Y*, using mapping, given by,

### Datasets

## Multi-organ atlas of human differentiating macrophages

### Data preprocessing

We obtained the dataset of the multi-organ atlas of human differentiating macrophages from ref.^{24}, available at https://developmental.cellatlas.io/fetal-immune. We downloaded the dataset of myeloid cells and further filtered the data to include only cells of macrophage cell types. Using the gestational week label we assigned the cells into two groups, “Early” and “Late” (Early: < 14 weeks; Late: >= 14 weeks).

In order to avoid bias towards less represented organ groups, we filtered out cells which originated from organs with less than 2800 total cells, resulting in cells originating from five organs: kidney (KI), liver (LI), skin (SK), spleen (SP) and yolk sac (YS). Following basic preprocessing for low-quality cells using scanpy’s^{47} ‘sc.pp.filter_cells(min_genes=200)‘, the data used for bioIB analysis included 108197 cells. We further reduced the data to 500 highly variable genes using scanpy’s^{47} ‘sc.pp.highly_variable_genes()‘ with the default parameters.

### Method application

We applied bioIB to the obtained dataset twice, (1) setting *Y* as the development stage (*Y* = [*Early, Late*]), and (2) setting *Y* as the organ-of-origin (*Y* = [*KI, LI, SK, SP, YS*]). In both analyses we initialized the reverse-annealing process with *β*_{max} = 20.

### Benchmarking

To evaluate the predictive performance of bioIB metagenes, we trained linear SVMs on the compressed data generated by bioIB and compared the success rates in predicting the cellular labels of gestational week and organ with the ones achieved by two settings of k-means gene clusters, computed over (1) the original log-normalized count matrix *X*, and (2) the probability matrix *p(y*|*x)*, that contains the information about the labels and represents the original input to bioIB.

## Astrocytes from a murine model of Alzheimer’s Disease (AD)

### Data preprocessing

We obtained single-nucleus RNA-seq measurements from astrocytes from AD mouse model and wild-type (WT) mice from ref.^{29}, available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143758. Following normalization and log-transformation, we performed leiden clustering using Scanpy’s ‘sc.tl.leiden()’ function with default parameters. Following this, we retained the cell clusters with enriched expression of the astrocytic markers *Gfap* and *Slc1a3*, resulting in n=7036 cells. As a last step we extracted highly informative genes with respect to the signal of interest, or disease state, encoded by the provided genotype annotation *Y* = [*AD, WT*]. This was done by retaining the 1000 genes with the highest information gain (IG) values, where the IG is defined using the mutual information between the gene expression probability *p*(*x*) and the genotype probability *p*(*y*),

### Method application

We applied bioIB with *Y* set as the mouse genotype: *Y* = [*AD, WT*]. The reverse-annealing process was initialized with *β*_{max}= 150.

### Constructing a list of AD-related genes

AD-associated genes were defined as differentially expressed genes in at least 7 of the 15 AD-APP mouse model studies as part of the AD meta-analysis resource, which has summarized and compared the differential expression results from a wide range of AD transcriptomic studies^{34}.

## Hematopoiesis dataset

### Data preprocessing

We obtained the dataset of the differentiating hematopoietic cell types collected by ref.^{36} and processed by ref.^{35}. Data was downloaded using the Cospar package (https://cospar.readthedocs.io/en/latest/index.html) using the function ‘cs.datasets.hematopoiesis()‘). We filtered out the undifferentiated cells, as well as the differentiated cell types with less than 300 total cells, resulting in a data subset of 27387 cells. We further reduced the data to the highly variable genes using scanpy’s^{47} ‘sc.pp.highly_variable_genes()‘ with the default parameters, resulting in 1803 genes.

### Method application

We calculated the IG values (as above) for the highly variable genes and used as input for bioIB the 300 genes with the highest IG values.

## Data availability

The datasets analyzed in the current study are available at:

Immune macrophage atlas: https://developmental.cellatlas.io/fetal-immune

Alzheimer’s Disease astrocytes: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143758

Hematopoiesis: https://cospar.readthedocs.io/en/latest/index.html

## Code availability

Software is available at https://github.com/nitzanlab/bioIB.

## Acknowledgements

We would like to thank the late Professor Naftali Tishby for initiating this project and his guidance which made this work possible. We would also like to express our gratitude to Professor Eli Nelken and Hadar Levi Aharoni for fruitful discussions. We acknowledge all members of the Nitzan lab for general feedback. This work was supported by the Azrieli, Kaete-Klausner and TEVA PhD fellowships (S.D.), a scholarship for outstanding doctoral students in data-science by the Israeli Council for Higher Education and the Clore Scholarship for PhD students (Z.P.), an Alon Fellowship, the Israel Science Foundation (Grant no. 1079/21), and the European Union (ERC, DecodeSC, 101040660) (M.N.). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council.

## Footnotes

This version includes an updated description of the model. We explain how to estimate the relevant distributions from the inout data. Figure 1 is adapted accordingly.