Identifying maximally informative signal-aware representations of single-cell data using the Information Bottleneck

Rapid advancements in single-cell RNA-sequencing (scRNA-seq) technologies revealed the richness of myriad attributes encompassing cell identity, such as diversity of cell types, organ-of-origin, or developmental stage. However, due to the large scale of the data, obtaining an interpretable compressed representation of cellular states remains a computational challenge. For this task we introduce bioIB, a method based on the Information Bottleneck algorithm, designed to extract an optimal compressed representation of scRNA-seq data with respect to a desired biological signal, such as cell type or disease state. BioIB generates a hierarchy of weighted gene clusters, termed metagenes, that maximize the information regarding the signal of interest. Applying bioIB to a scRNA-seq atlas of differentiating macrophages and setting either the organ-of-origin or the developmental stage as the signal of interest provided two distinct signal-specific sets of metagenes that captured the attributes of the respective signal. BioIB’s representation can also be used to expose specific cellular subpopulations, for example, when applied to a single-nucleus RNA-sequencing dataset of an Alzheimer’s Disease mouse model, it identified a subpopulation of disease-associated astrocytes. Lastly, the hierarchical structure of metagenes revealed interconnections between the corresponding biological processes and cellular populations. We demonstrate this over hematopoiesis scRNA-seq data, where the metagene hierarchy reflects the developmental hierarchy of hematopoietic cell types. Significance Single-cell gene expression represents an invaluable resource, encoding multiple aspects of cellular identity. However, its high complexity poses a challenge for downstream analyses. We introduce bioIB, a methodology based on the Information Bottleneck, that compresses data while maximizing the information about a biological signal-of-interest, such as disease state. bioIB generates a hierarchy of metagenes, probabilistic gene clusters, which compress the data at gradually changing resolutions, exposing signal-related processes and informative connections between gene programs and their corresponding cellular populations. Across diverse single-cell datasets, bioIB generates distinct metagene representations of the same dataset, each maximally informative relative to a different signal; uncovers signal-associated cellular populations; and produces a metagene hierarchy that reflects the developmental hierarchy of the underlying cell types.


Introduction
Cellular gene expression profiles encapsulate a wealth of information regarding a cell's identity, defined by a variety of biological factors, such as cell type, disease state, and developmental stage.Single-cell RNA-sequencing (scRNA-seq) technologies, quantifying gene expression levels at single-cell resolution, are invaluable for revealing these facets, allowing to study the different factors encompassing a cell's identity 1 .However, exposing such factors poses a computational challenge due to the complexity and high dimensionality of scRNA-seq.While datasets typically comprise thousands of gene profiles across thousands to hundreds of thousands of cells, any reduction in dimensionality will naturally result in loss of information 2 .Specifically, when aiming to uncover factors associated with a specific biological signal (e.g.gene programs associated with disease progression), the challenge can be framed as a trade-off between reducing the complexity of the data while retaining as much information as possible regarding the signal of interest.The Information bottleneck (IB) theory 3 allows us to reason mathematically about this trade-off.Given a dataset (e.g.scRNA-seq measurements) and a variable of interest encoded in the data (e.g.healthy vs. disease samples), IB provides a reduced data representation which is maximally informative about the selected variable 3,4 .Since it was first introduced, IB has been successfully applied in diverse fields, such as text clustering 5 , image analysis 6,7 , language processing 8 , neuroscience 9 and computational biology [10][11][12] .Here, we present bioIB, a single-cell tailored method based on the IB algorithm, providing a compressed, signal-informative representation of single-cell data.The compressed representation is given by metagenes, which are clustered probabilistic mapping of genes.The probabilistic construction preserves gene-level biological interpretability, allowing characterization of each metagene.By considering the trade-off, and focusing on the signal of interest, bioIB differs from previous approaches suggested for data compression, namely dimensionality reduction of single-cell data (e.g.PCA 13,14 , ICA 15 , NMF 16 ), typically applied to the data without any constraints and producing factors with limited biological interpretability.Furthermore, the interpretability of bioIB's output poses an advantage compared to deep learning methods which obtain non-linear latent representation of the data (e.g.scVI 17 , scDeepCluster 18 , SAUCIE 19 , scGNN 20 ).Finally, existing methods aimed at extracting interpretable factors from scRNA-seq data (e.g.f-scLVM 21 , net-NMFsc 22 , Spectra 23 ) rely on prior knowledge in the form of known molecular pathways or gene interactions which regulate the desired signal of interest.In contrast, bioIB is applicable in the setting where such prior knowledge is not necessarily available, but the labels with respect to the signal of interest are known.Specifically, bioIB elucidates the relevant gene programs de novo, guided by the cellular labels associated with the signal of interest (e.g., 'Wildtype' and 'Mutant' for the 'Genotype' signal).In addition to achieving optimal signal-aware clustering of genes via metagenes, bioIB generates a hierarchy of these metagenes, reflecting the inherent data structure relative to the signal of interest.The bioIB hierarchy facilitates the interpretation of metagenes, elucidating their significance in distinguishing between biological labels and illustrating their interrelations with both one another and the underlying cellular populations.We demonstrate that metagenes generated by bioIB are biologically meaningful, capturing molecular pathways differentially activated between selected cell groups.First, we consider a scRNA-seq atlas of differentiating macrophages.By using either organ-of-origin or developmental stage as signals of interest, we show that bioIB extracts distinct, signal-specific metagene hierarchies and associated biological processes.Next, we demonstrate how bioIB can be used to identify a cellular subpopulation of diseaseassociated astrocytes in a single nucleus RNA-seq (snRNA-seq) dataset from murine Alzheimer's Disease models.Finally, we showcase that bioIB metagene hierarchy for a dataset of differentiating hematopoietic cell types reflects the developmental hierarchy of the corresponding cellular populations.bioIB is available as an open-source software package, along with documentation and tutorials (https://github.com/nitzanlab/bioIB).

bioIB elucidates signal-specific metagenes and their structure
The bioIB representation is computed for a given dataset and signal of interest, provided as cell labels.The representation is composed of metagenes which are probabilistic aggregation of the genes into clusters, representing the major patterns of gene expression variation underlying the labeled signal.The input to bioIB includes a count matrix  ∈  !×# of N cells by G genes, and a vector of cell labels related to the signal of interest,  ∈  !×1 , where for example, each cell is labeled as sampled from either a healthy or diseased population (Methods; Figure 1A).This input is used to estimate the distributions required for the bioIB algorithm.We thus define three categorical random variables,  ~({ $ , . . .,  !}) ,  ~({ $ , . . .,  # }),  ~({ $ , . . .,  % }) , respectively representing the  cells,  genes and  cell states of interest.Normalizing the input matrix  by the total number of counts, we obtain a joint probability distribution (, ).Next, summing (, ) across the cells, we obtain (), such that an entry [()] & represents the marginal probability of sampling the transcript of gene  & .Using Bayes theorem, we obtain the conditional probability: Here, an entry [(|)] &-represents the probability of sampling the cell  & out of all measured cells in , given that we observed gene  -.
The provided cellular annotation vector  ∈  !×$ allows us to define the conditional distribution of  (representing the  cell states of interest) given that we observed a cell in .By definition (|) is an indicator function, defined by , namely, for a cell  & , (| & ) = 1 if  & =  and zero elsewise: At last, we can obtain the conditional distribution of cell states of interest given that we observed a certain gene in : The conditional probability matrix of cell states given the genes (|) and the gene probability vector () are used as input to the core of the bioIB method, the IB algorithm.The IB yields the optimal probabilistic mapping, ( @|) from the genes' random variable, , to the categorical random variable representing the metagenes  A ~({ @ $ , . . .,  @ 2 }), (for || <= ||).The mapping is optimal with respect to the tradeoff between compression and information about the signal of interest  according to a given threshold parameter  (Figure 1C).This is achieved by optimizing for  A that minimizes the mutual information with the input genes X, (,  A ), while maximizing the mutual information with , ( A , ) (Methods; Figure 1C):  A =  3 4 ((,  A ) −  ( A , )).
[4] The resulting metagenes are probabilistic clusters of genes capturing the shared expression patterns amongst cell states relative to  (Figure 1D).The number of metagenes is roughly determined by the threshold parameter , ranging from the original representation (no compression,  → ∞;  A = ) to full compression to a single cluster ( = 0).A hierarchy of metagenes is obtained by gradually decreasing  through a reverse-annealing process 4 (Methods).The probabilistic output mapping, ( |  @), reflects the amount of information each metagene holds regarding the different labels, whereas the hierarchical structure reveals the interdependence between the metagenes, and the underlying cellular populations they correspond to (Figure 1E).As an illustrative example, we construct a toy dataset composed of cells belonging to one of two cell types, which act as the signal of interest Y (Supplementary Figure 1A-D).The bioIB hierarchy is revealed by plotting the conditional probabilities (| @) of a particular label given every metagene, across  values that define the compression level (Supplementary Figure 1C-D).The hierarchical structure reflects the interconnections among the metagenes and the specified cell types of interest (), while the bifurcation order is dictated by the informativity of the generated metagenes relative to . bioIB can also capture the relationships between related cell types, defined as distinct labels of interest ().Given a toy model with four related cell types, bioIB hierarchy reflects the two distinct pairs of linked cell types by two branches.Further splits correspond to higher-resolution separation to different cell types, eventually resulting in cell type-specific metagenes (Supplementary Figure 1E-G).

Figure 1. Elucidating meaningful metagenes underlying the signal of interest using bioIB.
A-D) The bioIB pipeline.A) Input; bioIB takes as input a count matrix and a cellular annotation vector, labeling every cell with a state, representing the signal of interest.For example, if the signal of interest is cell type, these labels should annotate every cell with the corresponding cell type.B) Distribution extraction; The provided count matrix and the cellular annotation vector are used to estimate the distributions of the random variables representing the genes () and the cell states of interest ().C) Information Bottleneck; The probabilities obtained in (B) are used as input for the Information Bottleneck (IB) algorithm, which yields the optimal mapping of genes to metagenes, by optimizing the trade-off between complexity and accuracy.This is achieved by optimizing for  # that minimizes the mutual information between the input genes () and the metagenes ( # ), (,  # ), while maximizing the mutual information between the metagenes and the cell states of interest (), ( # , ).D) Output; The output of bioIB is a probabilistic mapping between genes and metagenes, scoring all the input genes by their contribution to each metagene.bioIB also provides a cell-to-metagene compressed representation of the input matrix, summarizing the expression of metagenes in the input single cells.E) Possible downstream analyses of the compressed data achieved by bioIB: enhancing the signal of interest (such as cell type or genotype) in the data, extracting informative genes underlying the heterogeneity between cell type labels, elucidating the interconnections between metagenes and the corresponding cell types.bioIB extracts distinct molecular signatures in macrophages for developmental stage and organ residence across development Gene expression data in scRNA-seq experiments contain signatures associated with multiple overlapping biological signals or conditions.How can we identify gene signatures associated with a specific source of heterogeneity in the data?We demonstrate bioIB's approach to this challenge in the context of a scRNAseq atlas of the developing immune system, which contains cells from 9 organs spanning weeks 4 to 17 after conception 24 (Figure 2A).Applied to this dataset, bioIB identified gene signatures associated with either the organ-of-origin or with developmental stage.For each signal of interest (either organ-of-origin or developmental stage), bioIB clustered the relevant genes into co-expressed metagenes, thus compressing the data, while preserving the information about the selected signal of interest.We focused on the macrophages subset of cells, as their gene expression varies across organs and throughout the gestation stages 24 (Supplementary Figure 2A).We first applied bioIB with  set to be the developmental stage, after aggregating cells to 'Early' (8-12 gestational weeks) and `Late' (>14 gestational weeks) labels.The resulting bioIB representation, compressed into four metagenes, exposed the selected signal of developmental stage (Supplementary Tables 1, 2).Qualitatively, cells cluster according to `Early' and `Late' developmental stages in the compressed representation (Figure 2B).Evaluating the corresponding clustering quantitatively we find that the Normalized Mutual Information (NMI) score following bioIB is increased relative to the original data, specifically compared to the developmental stage labels, and not the organ labels (Fold change increase in NMI score: stage labels = 13.8;organ labels = 0.76; Supplementary Table 5).Moreover, bioIB metagenes outperform baseline clustering methods in predicting the gestational week of the cells with an average success rate improvement of 12%, indicating that bioIB preserves more information about the signal of interest (Figure 2D; Supplementary Figure 3A; Supplementary Table 6; Methods).Applying bioIB to the same data, but now setting the signal of interest  as the organ-of-origin, resulted in a different compressed representation, informative with respect to the organ-of-origin (Supplementary Tables 3, 4), as the cells in their compressed metagene representation cluster according to their organ-of-origin (Figure 2C).bioIB with  set to organ improves the NMI scores both for organ clustering and for developmental stage clustering (Fold change increase in NMI score: stage labels = 2.2; organ labels = 1.5;Supplementary Table 5).This is explained by the non-uniform distribution of cells from different organs across developmental stages (Supplementary Figure 2B), such that the information about the cell's organ-of-origin is also highly predictive of its developmental stage.Again, bioIB metagenes outperform baselines in signal label prediction with an average success rate improvement of 12% (Figure 2E; Supplementary Figure 3B; Supplementary Table 7; Methods).Following the validation of the representation we turn to examine the hierarchical structure and interconnections of metagenes exposed by the gradual compression of data by bioIB reverse-annealing (Methods).When  is set as the gestational stage, as  increases, the first bifurcation represents a metagene that is specific to late gestational stages (Metagene 0; Figure 2F, Supplementary Figure 4A).This metagene is statistically enriched for immune processes based on Gene Ontology analysis 25 (Figure 2G) and contains genes involved in the development of adaptive immune response, such as TRBV7-2 and IGL1-40, that are specifically active in later gestation stages across organs 24 (Supplementary Figure 4B).The next bifurcation produces a more general metagene 1, statistically enriched for multiple biological processes including immune function (Supplementary Figure 4C).Relative to metagene 0, metagene 1 is less informative for the differentiation between early and late stages, and is therefore generated second by bioIB (Supplementary Figure 4D).Metagene 2 contains yolk sac markers (e.g.TTR, CGA), in accordance with yolk sac being the only organ that is present exclusively at the earliest gestational stages (Supplementary Figure 4B).Next, we validate that, as desired, metagenes generated with  being the organ-of-origin reflect organspecific transcriptomic signatures and that the bioIB hierarchy reflects the developmental connections between the macrophagic populations from different organs (Figure 2H, I, Supplementary Figure 4E).For example, the branch uniting metagenes 0-1, associated with yolk sac, and metagene 2, characterizing liver, is consistent with the reported developmental migration of the macrophages from the yolk sac to the liver 24 .Another example is the branch consisting of metagenes 3 and 4 that mirrors macrophage transition from the liver to the spleen 24 ..The scRNA-seq data of macrophages from 5 distinct organs (kidney, liver, skin, spleen, yolk sac) and 11 gestational weeks (4, 7-12, 14-17), was analyzed using bioIB with  set either as the developmental stage (Early: < 14 weeks; Late: >= 14 weeks), or as the organ-of-origin, which resulted in two distinct signalspecific compressed representations of the same data.In (H), metagene 0 that is mostly expressed in the yolk sac, featuring characteristic organ markers like CGA 26 , is merged into one branch with metagenes 1 and 2, that are elevated in the liver and feature apolipoprotein genes involved in liver-associated cholesterol efflux 27 .I) Heatmap showing scaled metagene expression in the original scRNA-seq matrix among organs.YS: yolk sac , LI: liver , SP: spleen.

bioIB metagenes identify Alzheimer's Disease associated astrocytes
A key challenge in scRNA-seq analysis is to identify specific cellular subpopulations affected by a certain condition, such as disease.The standard pipeline, commonly implemented for this task, involves unsupervised clustering of cells, which exposes the downstream analysis to clustering-related bias 28 .BioIB can overcome such limitations and detect disease-associated cells within a heterogeneous cellular population, which we demonstrate in the context of Alzheimer's disease (AD) -associated astrocytes.To do so, we re-analyzed single-nucleus RNA-seq measurements of astrocytes from an AD mouse model and wild-type (WT) mice 29 (Figure 3A).BioIB analysis with the signal of interest set as the genotype (AD/WT) resulted in a hierarchy of six metagenes (Supplementary tables 9, 10) capturing informative transcriptomic signatures differentiating between AD and WT cells (Figure 3B, Supplementary Figure 5A, Supplementary table 11).The resulting metagenes allowed for better separation of AD and WT cells, compared to the original data (Fold change increase in NMI score = 3.98; Figure 3C).Furthermore, bioIB metagenes captured a higher-resolution structure within the data; the main branch of metagenes associated with AD phenotype is composed of metagenes 0,1,2, each associated in turn with a distinct subpopulation of AD astrocytes (Figure 3D,E).To interpret their biological identities, we extracted a set of representative genes for each metagene (Methods).Metagene 0, whose representative gene set includes genes involved in morphology regulation (GFAP, THY1, VIM, B2M, PSEN1), is enriched for the cellular projection development process, consistent with general astrocyte activation 30 (Supplementary Figure 5D).Metagenes 1 and 2 represent pathways more tightly associated with the disease: , the representative gene set of metagene 1 is enriched with immune genes 31 , such as C1QA and CTSS, and metagene 2 is represented by established markers of AD pathology, TYROBP and SERPINA3N 32,33 .Meta-analysis of the AD-associated transcriptome 34 revealed that metagenes 1 and 2 are the only metagenes that are exclusively represented by AD-associated genes (Figure 3F; Methods).Characterization of the WT-related metagene 5 can be found in Supplementary Figure 5E.While metagene 0 is expressed in the majority of AD astrocytes, metagenes 1 and 2 characterize distinct cellular subpopulations among the AD cells (Figure 3D,E), which we hypothesized to correspond to disease-associated astrocytic signatures.To validate our interpretation, we quantified the expression of bioIB metagenes in six astrocytic clusters defined in the original analysis, which included two homeostatic clusters, two GFAP-high clusters of reactive astrocytes which are not specific to the disease, and two disease-associated clusters 29 .We found that while the bioIB metagene 0 is highly expressed both in disease associated clusters and in reactive GFAP-high clusters, metagenes 1 and 2 are specifically enriched in the disease-associated cluster, most abundant in AD 29 (Figure 3G; Supplementary Figure 5F).The two WT-associated metagenes (4,5) are correspondingly enriched in the homeostatic clusters (Figure 3G)..pipeline of the analysis shown in this section.The snRNA-seq data of astrocytes from an AD mouse model was analyzed using bioIB with  set to genotype, which resulted in identification of a specific subpopulation of disease-associated astrocytes.Figure created with Biorender.com.B) bioIB metagene hierarchy produced given the preprocessed snRNA-seq data, relating to the AD group.The defined metagenes exhibit differential expression patterns between AD and WT, with metagenes 0, 1 and 2 overexpressed in AD cells (Fold change increase in metagenes 0, 1, 2: 1.9, 3.9, 6, respectively), a neutral metagene 3 (Fold change increase in metagene 3 = 0.91), and metagenes 4 and 5 overexpressed in WT cells (Fold change increase in metagenes 4, 5: 0.3, 0.66, respectively; Supplementary table 11).C) UMAP representation of the original data (left) and of the bioIB compressed data (right).D) Heatmap showing scaled expression of metagenes 0,1,2 in individual cells of AD genotype, sorted by maximal normalized metagene expression.E) UMAPs of the bioIB-compressed data, colored by the expression of ADassociated metagenes 0,1,2.F) Fractions of representative genes of metagenes 0-5 that were found to be differentially expressed in at least 7 studies in the meta-analysis of the AD-associated transcriptome 34 (Methods).G) Heatmap of scaled expression values of six bioIB metagenes in six transcriptional clusters of astrocytes, defined in ref 29 .
bioIB metagene hierarchy reflects the developmental connections between hematopoietic cell types.
scRNA-seq datasets expose a striking diversity of cell types and states, whose interconnections carry important biological information about cell state identity.For example, the hierarchical differentiation tree of hematopoietic stem and progenitor cells (HSPCs) reveals the phenotype and function of mature hematopoietic cells 35 .BioIB metagene hierarchy can capture the developmental hierarchical structure of cell types, as we demonstrate here for scRNA-seq data of HSPCs differentiation 36 (Figure 4A).BioIB is applied given the cell type signal over a subset of the data containing six major hematopoietic cell types -monocytes, neutrophils, mast cells, basophils, megakaryocytes and erythroid cells.This analysis produced 11 metagenes, where each of the six cell types is uniquely characterized by at least one metagene, maximizing its expression level within that particular cell type (Figure 4B, Supplementary Tables 12,13).In addition, there are metagenes representing a transcriptional program shared by several developmentally linked cell types (Figure 4B,C; Supplementary Figure 6A).For example, metagenes 0 and 2 are specifically expressed in monocytes and neutrophils, respectively, while metagene 1 is activated in both (Figure 4B,C).The bioIB metagenes are biologically informative, uniting genes and processes characteristic of the corresponding cell types (Supplementary Figure 6B,C).Hence, metagene 0, specifically representing monocytes, features monocyte marker genes such as FABP5 36 and WFDC17 37,38 (Figure 4C,D), and is associated with pro-inflammatory macrophage activation, characteristic of monocytes function 39 (Figure 4E).Similarly, metagene 2, specifically characterizing neutrophils, includes markers like ITGB21 36 CAMP, LTF, and ELANE 40 (Figure 4C,D) and is statistically enriched for neutrophil mediated immunity and neutrophil activation (Figure 4E).
The hierarchical representation of the metagenes generated by bioIB induces a hierarchy of cell types that reflects the developmental links between them (Figure 4F,G; Supplementary Figure 6D).In particular, the first bifurcation in the metagene hierarchy generates two metagenes corresponding to the two major branches in the developmental hierarchy 36 (Figure 4A), one which includes Monocytes and Neutrophils, and another which includes Mast cells, Basophils, Megakaryocytes and Erythroid cells (Figure 4F,G).The second bifurcation (Figure 4F,G) splits the latter into two additional specific metagenes, one including Mast cells and Basophils, and another -Megakaryocytes and Erythroid cells (Figure 4F,G).The third bifurcation further splits the metagene corresponding to the Mast-Baso branch to two separate metagenes that are more specific to either Mast cells or Basophils.Similarly, the fourth bifurcation splits the metagene corresponding to the Monocyte-Neutrophil branch to two separate Monocyte and Neutrophil associated metagenes.Finally, the last bifurcations split the metagene corresponding to the Megakaryocyte-Erythroid branch to four metagenes distinguishing between Megakaryocytes and Erythroid cells.
In conclusion, bioIB metagenes characterize distinct biological processes linked to the underlying cellular populations, while the metagene hierarchy unveils the biological relationships interconnecting these populations.

Discussion
We introduced bioIB, a scRNA-seq tailored method for clustering genes with respect to a set of known cellular labels, based on the Information Bottleneck (IB) algorithm.We have shown that bioIB metagenes, which are biologically interpretable, provide a meaningful representation which exposes the representative molecular pathways differentially expressed between cellular populations of interest.Given single-cell data from human differentiating macrophages, with overlapping signals of organ-oforigin and developmental time, bioIB successfully extracted two distinct compressed data representations, each depicting the respective biological processes.Next, we used bioIB to identify a subpopulation of disease-associated astrocytes in single-nucleus data from an Alzheimer's Disease mouse model, providing the genotype as the signal of interest.At last, we have shown that beyond the final metagenes used for analysis, their hierarchical structure, produced by the iterative algorithm used by bioIB, exposes interconnections between metagenes and their respective cell types.We showcased this in the context of differentiating hematopoietic cells, where the bioIB hierarchical structure matched the expected developmental hierarchy of hematopoietic cell types.By definition, the bioIB output is sensitive to the representation of the differentially labeled cell clusters in the data.The sensitivity is both with respect to the number of genes enriched in each cluster and in terms of the clusters' size .This is expected since the mutual information measure at the core of bioIB is sensitive to these quantities.That is, a cluster containing a small number of cells is expected to be characterized by fewer metagenes compared to a larger cluster.This can be biologically motivated, on the one hand, as the relative abundance of cells within clusters in the data can be associated with their relative importance, and on the other hand may be a caveat when studying rare populations.For the latter case, a potential solution to this limitation could involve subsampling the data for a uniform distribution of clusters' size.
As with a majority of computational methods, the bioIB output depends on a hyperparameter, , controlling the level of compression.This is analogous to setting the number of clusters in a clustering algorithm, making this value data-specific.Here, the interpretability of the obtained metagenes allows the user to tune  to obtain the desired number of informative metagenes.The current bioIB formulation is limited in its scalability to data size, as it relies on the exact solution to the IB problem.This can be overcome, as we have done in this study, by focusing the analysis on highly informative genes.A natural extension to bioIB to overcome this limitation more generally is using an existing variational IB solver which relies on neural approximation [42][43][44] .
In future work bioIB can be extended to extract multiple related data representations with respect to several variables of interest, based on the multivariate information bottleneck framework 45 .This paradigm might be particularly useful in analyzing gene expression data, allowing to simultaneously extract multiple encoded signals and analyze the corresponding biological processes.Furthermore, bioIB could be extended to produce signal-specific cell clusters, or metacells, retaining maximal possible information about a target gene subset, such as disease biomarkers.
Here we demonstrated that bioIB can provide efficient characterization of signals of interest encoded in single-cell data, such as cell type, disease state or tissue-of-origin.BioIB can be generalized beyond singlecell gene expression data to additional types of biological data, such as bulk RNA-seq and proteomics data, to expose signal-specific optimally compressed representations.In summary, bioIB is expected to enrich biological data analysis by revealing the hierarchical, signal-specific structure encoded in complex datasets.

The bioIB algorithm
The bioIB algorithm provides a compressed representation of scRNA-seq data with respect to a signal of interest.To do so it takes as input a cell () by gene () scRNA-seq measurements matrix,  ∈  "×$ ; following standard practice we suggest providing log-normalized counts as input.Additional input to bioIB is a vector of cell labels related to the signal of interest  ∈  "×% , labeling every cell with one of  possible cell states of interest defined using  = [1, . . ., ], such that  = {}.Given this input, the bioIB pipeline is composed of two main steps: (1) obtaining a probabilistic representation of the count matrix, and (2) using this representation as input for the Information Bottleneck (IB) algorithm.

Obtaining a probabilistic data representation
We use the input count matrix  and signal of interest  to obtain the relevant probability distributions required for the IB algorithm; the conditional probability matrix of cell states given the genes (|) and the gene probability vector ().To convert to probability space we define the random variables of  ~({ % , . . .,  " }) ,  ~({ % , . . .,  $ }),  ~({ % , . . .,  & }) , respectively representing the  cells,  genes and  cell states of interest.The empirical distributions of these are then constructed using the input data (see Equations 1-3).

The IB algorithm
The obtained probabilistic representations, (|) ∈  &×$ and () ∈  $ are the input for the Information Bottleneck (IB) algorithm.IB 3 is a dimensionality reduction method, designed to extract the information from data  that is relevant for the prediction of another related variable , such that the choice of  determines the relevant components of the signal encoded in .Mutual information (MI) is used to evaluate both the extent of compression, (,  B ), and the level of relevant information preserved in the compressed data, through ( B , ).A trade-off parameter  is introduced to control the amount of compression (distortion) allowed.Formally, the IB objective is given by,  B =  ' ( ((,  B ) −  ( B , )).Notably, when  = 0, all genes are merged into one cluster (full compression), and when  = ∞, the compressed data is identical to the original full data, so every cluster is associated with one particular gene,  B = .For every value of , the algorithm yields the conditional probability matrix of  gene clusters, which we term metagenes,  ( ∈  B , given the genes,  ∈ , ( (|) ∈  )×$ , representing the optimal mapping of genes to metagenes, and the conditional probability matrix of cell states given the metagenes (| () ∈  &×) .For the full mathematical description and the associated proofs for the information bottleneck algorithm, see refs 3,4 .
There are many ways to solve the IB objective (including neural approximators introduced recently [42][43][44] ).Here we will focus on the Blahut-Arimoto algorithm 46 , described below.IB can provide either a series of solutions at different compression levels, using a reverse-annealing process, or a single solution with a flat division of the data points to a predefined number of clusters.Here,  is a threshold parameter used to define convergence based on the difference between previous and current iterations.For a given , the algorithm converges into a stable solution, providing two output probability matrices that define  B , ( (|) and (| ().( (|) determines the mapping between the original data points  ∈  to data clusters  ( ∈  B , whereas (| () defines the association between the data clusters,  ( ∈  B , and the groupings of the signal of interest,  ∈ .

b. Reverse-annealing
In the process of reverse-annealing the IB algorithm is initialized with a compressed representation  B that is identical to the original data  and with a large value of : Next, we run the algorithm iteratively, while reducing .Upon convergence, we initialize the next iteration with the final ( (|) mapping achieved in the previous step, and with  −  , for a small step size .
Following this procedure we achieve a series of solutions for every value of : ∀ ∈ { <*> ,  <*> + , . . .,  <=.}.At the end of this process  <*> → 0, corresponding to maximal compression, where  B consists of a single point, uniting all the original data points in .Reverse-annealing ultimately yields a hierarchical structure that mirrors several important aspects of the identified clusters, such as their informativity for discrimination between the labels of interest , as well as the interconnections among them.It is important to note that  <=.controls the maximal number of metagenes, namely the number of end-nodes in the hierarchy, and modifying it does not affect the hierarchical structure itself.

c. Clustering
To achieve the division of the data points  ∈  to a defined number of clusters , the IB algorithm is initialized with a random mapping of  to  clusters, generating a binary conditional probability matrix ( (|) ∈  )×$ .The corresponding ( () and (| () are obtained, using basic probability rules and Bayes Theorem.Since this process introduces a dependence of the output on the initialization, we randomly initialize the algorithm  = 100 times and select the mapping that minimizes the objective function (Eq. 1) 4 .

Recovering single-cell metagene expression:
The bioIB output provides the mapping of the original count matrix  ∈  "×$ to its compressed representation  a ∈  "×) .Namely, we obtain the weighted expression of genes,  ∈ , using the mapping (| ()., given by,  a *? = ∑ @  *@ b @ | ( ?c. As a result, we obtain a cell () by metagene () compressed data matrix,  a ∈  "×) , such that  a *? represents the expression level of metagene  in cell .

Data preprocessing
We obtained the dataset of the multi-organ atlas of human differentiating macrophages from ref. 24 , available at https://developmental.cellatlas.io/fetal-immune.We downloaded the dataset of myeloid cells and further filtered the data to include only cells of macrophage cell types.Using the gestational week label we assigned the cells into two groups, "Early" and "Late" (Early: < 14 weeks; Late: >= 14 weeks).
In order to avoid bias towards less represented organ groups, we filtered out cells which originated from organs with less than 2800 total cells, resulting in cells originating from five organs: kidney (KI), liver (LI), skin (SK), spleen (SP) and yolk sac (YS).Following basic preprocessing for low-quality cells using scanpy's 47 `sc.pp.filter_cells(min_genes=200)`, the data used for bioIB analysis included 108197 cells.We further reduced the data to 500 highly variable genes using scanpy's 47 `sc.pp.highly_variable_genes()` with the default parameters.

Benchmarking
To evaluate the predictive performance of bioIB metagenes, we trained linear SVMs on the compressed data generated by bioIB and compared the success rates in predicting the cellular labels of gestational week and organ with the ones achieved by two settings of k-means gene clusters, computed over (1) the original log-normalized count matrix X, and (2) the probability matrix p(y|x), that contains the information about the labels and represents the original input to bioIB.
Astrocytes from a murine model of Alzheimer's Disease (AD)

Data preprocessing
We obtained single-nucleus RNA-seq measurements from astrocytes from AD mouse model and wild-type (WT) mice from ref. 29 , available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE143758.Following normalization and log-transformation, we performed leiden clustering using Scanpy's `sc.tl.leiden()' function with default parameters.Following this, we retained the cell clusters with enriched expression of the astrocytic markers Gfap and Slc1a3, resulting in n=7036 cells.As a last step we extracted highly informative genes with respect to the signal of interest, or disease state, encoded by the provided genotype annotation  = [, ].This was done by retaining the 1000 genes with the highest information gain (IG) values, where the IG is defined using the mutual information between the gene expression probability () and the genotype probability (),  () = ()  &G ((|) || () ).

Constructing a list of AD-related genes
AD-associated genes were defined as differentially expressed genes in at least 7 of the 15 AD-APP mouse model studies as part of the AD meta-analysis resource, which has summarized and compared the differential expression results from a wide range of AD transcriptomic studies 34 .

Data preprocessing
We obtained the dataset of the differentiating hematopoietic cell types collected by ref. 36 and processed by ref. 35 .Data was downloaded using the Cospar package (https://cospar.readthedocs.io/en/latest/index.html) using the function `cs.datasets.hematopoiesis()`).We filtered out the undifferentiated cells, as well as the differentiated cell types with less than 300 total cells, resulting in a data subset of 27387 cells.We further reduced the data to the highly variable genes using scanpy's 47 `sc.pp.highly_variable_genes()` with the default parameters, resulting in 1803 genes.

Method application
We calculated the IG values (as above) for the highly variable genes and used as input for bioIB the 300 genes with the highest IG values.
Figure1.Elucidating meaningful metagenes underlying the signal of interest using bioIB.A-D) The bioIB pipeline.A) Input; bioIB takes as input a count matrix and a cellular annotation vector, labeling every cell with a state, representing the signal of interest.For example, if the signal of interest is cell type, these labels should annotate every cell with the corresponding cell type.B) Distribution extraction; The provided count matrix and the cellular annotation vector are used to estimate the distributions of the random variables representing the genes () and the cell states of interest ().C) Information Bottleneck; The probabilities obtained in (B) are used as input for the Information Bottleneck (IB) algorithm, which yields the optimal mapping of genes to metagenes, by optimizing the trade-off between complexity and accuracy.This is achieved by optimizing for  # that minimizes the mutual information between the input genes () and the metagenes ( # ), (,  # ), while maximizing the mutual information between the metagenes and the cell states of interest (), ( # , ).D) Output; The output of bioIB is a probabilistic mapping between genes and metagenes, scoring all the input genes by their contribution to each metagene.bioIB also provides a cell-to-metagene compressed representation of the input matrix, summarizing the expression of metagenes in the input single cells.E) Possible downstream analyses of the compressed data achieved by bioIB: enhancing the signal of interest (such as cell type or genotype) in the data, extracting informative genes underlying the heterogeneity between cell type labels, elucidating the interconnections between metagenes and the corresponding cell types.Figure is created with BioRender.com.

Figure 2 .
Figure 2. bioIB extracts distinct molecular signatures underlying the signals related to developmental stage and organ-of-origin in developing macrophages.A) Schematic representation of the analyzed dataset24 .The scRNA-seq data of macrophages from 5 distinct organs (kidney, liver, skin, spleen, yolk sac) and 11 gestational weeks(4, 7-12, 14-17), was analyzed using bioIB with  set either as the developmental stage (Early: < 14 weeks; Late: >= 14 weeks), or as the organ-of-origin, which resulted in two distinct signalspecific compressed representations of the same data.Figure created with Biorender.com.B,C) UMAP representation of the data compressed by bioIB with Y set to developmental stage (B) or organ-of-origin (C), representing single-cell expression of the generated metagenes, colored by the organ (on the left) and by the developmental stage (on the right).D,E) SVM success rates in predicting the gestational week (D) or the organ-of-origin (E) based on bioIB metagenes created with Y set to the respective label; bioIB performance is compared to results based on k-means gene clusters and gene clusters generated by agglomerative clustering based on the log-normalized count matrix .The x-axis denotes the number of gene clusters or metagenes produced by each method.F,H) bioIB metagene hierarchy produced by reverse annealing with Y set to be developmental stage (F) or organ-of-origin (H), showing the conditional probabilities of the late stage (F) or yolk sac and liver (H) given metagene expression.G) Gene Ontology biological processes significantly enriched among genes representative of metagene 0 shown in (F).In (H), metagene 0 that is mostly expressed in the yolk sac, featuring characteristic organ markers like CGA26 , is merged into one branch with metagenes 1 and 2, that are elevated in the liver and feature apolipoprotein genes involved in liver-associated cholesterol efflux27 .I) Heatmap showing scaled metagene expression in the original scRNA-seq matrix among organs.YS: yolk sac , LI: liver , SP: spleen.
Figure 2. bioIB extracts distinct molecular signatures underlying the signals related to developmental stage and organ-of-origin in developing macrophages.A) Schematic representation of the analyzed dataset24 .The scRNA-seq data of macrophages from 5 distinct organs (kidney, liver, skin, spleen, yolk sac) and 11 gestational weeks(4, 7-12, 14-17), was analyzed using bioIB with  set either as the developmental stage (Early: < 14 weeks; Late: >= 14 weeks), or as the organ-of-origin, which resulted in two distinct signalspecific compressed representations of the same data.Figure created with Biorender.com.B,C) UMAP representation of the data compressed by bioIB with Y set to developmental stage (B) or organ-of-origin (C), representing single-cell expression of the generated metagenes, colored by the organ (on the left) and by the developmental stage (on the right).D,E) SVM success rates in predicting the gestational week (D) or the organ-of-origin (E) based on bioIB metagenes created with Y set to the respective label; bioIB performance is compared to results based on k-means gene clusters and gene clusters generated by agglomerative clustering based on the log-normalized count matrix .The x-axis denotes the number of gene clusters or metagenes produced by each method.F,H) bioIB metagene hierarchy produced by reverse annealing with Y set to be developmental stage (F) or organ-of-origin (H), showing the conditional probabilities of the late stage (F) or yolk sac and liver (H) given metagene expression.G) Gene Ontology biological processes significantly enriched among genes representative of metagene 0 shown in (F).In (H), metagene 0 that is mostly expressed in the yolk sac, featuring characteristic organ markers like CGA26 , is merged into one branch with metagenes 1 and 2, that are elevated in the liver and feature apolipoprotein genes involved in liver-associated cholesterol efflux27 .I) Heatmap showing scaled metagene expression in the original scRNA-seq matrix among organs.YS: yolk sac , LI: liver , SP: spleen.

Figure 3 .
Figure 3. bioIB metagenes identify the AD associated astrocytes.A) Schematic representation of the analyzed dataset29 .pipeline of the analysis shown in this section.The snRNA-seq data of astrocytes from an AD mouse model was analyzed using bioIB with  set to genotype, which resulted in identification of a specific subpopulation of disease-associated astrocytes.Figure created with Biorender.com.B) bioIB metagene hierarchy produced given the preprocessed snRNA-seq data, relating to the AD group.The defined metagenes exhibit differential expression patterns between AD and WT, with metagenes 0, 1 and 2 overexpressed in AD cells (Fold change increase in metagenes 0, 1, 2: 1.9, 3.9, 6, respectively), a neutral metagene 3 (Fold change increase in metagene 3 = 0.91), and metagenes 4 and 5 overexpressed in WT cells (Fold change increase in metagenes 4, 5: 0.3, 0.66, respectively; Supplementary table 11).C) UMAP representation of the original data (left) and of the bioIB compressed data (right).D) Heatmap showing scaled expression of metagenes 0,1,2 in individual cells of AD genotype, sorted by maximal normalized metagene expression.E) UMAPs of the bioIB-compressed data, colored by the expression of ADassociated metagenes 0,1,2.F) Fractions of representative genes of metagenes 0-5 that were found to be differentially expressed in at least 7 studies in the meta-analysis of the AD-associated transcriptome 34 (Methods).G) Heatmap of scaled expression values of six bioIB metagenes in six transcriptional clusters of astrocytes, defined in ref29 .

Figure 4 .
Figure 4. bioIB metagene hierarchy reflects the connections between the developmentally linked hematopoietic cell types.A) Schematic representation of the analyzed dataset36 .The single-cell data from differentiating blood cells yielded the depicted developmental hierarchy of the hematopoietic cell types.Figure created with Biorender.com.B) Heatmap showing the scaled expression (z-score) of the bioIB metagenes across cell types.C) Heatmap showing scaled expression of the top representative genes of metagenes 0-2 across monocytes and neutrophils.Metagenes 0 and 2 are specifically expressed in monocytes and neutrophils, respectively, while metagene 1 is expressed in both.D) SPRING41 visualizations of the hematopoietic dataset colored by cell type (left panel) and by the expression of metagenes 0-2 (three panels on the right).*MG -metagene.E) Gene Ontology enrichment results showing biological process categories significantly enriched in metagene 0 (left) and 2 (right).F) Bifurcation plots of further compression of the 11 metagenes shown in (B) relative to Monocytes, Mast cells and Megakaryocytes.Metagenes characterizing developmentally linked cell types are linked in the metagene hierarchy.For example, metagene 0 representing monocytes diverges from the same branch as metagene 2, representing Neutrophils.Bifurcation plots relative to Neutrophils, Basophils and Erythroid cells are provided in Supplementary Figure4D.G) Metagene hierarchy inferred from the bioIB reverse annealing output shown in (F) and in Supplementary Figure4D.The cell type associated with every metagene is the one maximizing the conditional probability of a cell type given this metagene ( !(| ()).