## Abstract

Comprehensive molecular characterization of cancer subtypes is essential for predicting clinical outcomes and searching for personalized treatments. We present bnClustOmics, a statistical model and computational tool for multi-omics unsupervised clustering, which serves a dual purpose: Clustering patient samples based on a Bayesian network mixture model and learning the networks of omics variables representing these clusters. The discovered networks encode interactions among all omics variables and provide a molecular characterization of each patient subgroup. We conducted simulation studies that demonstrated the advantages of our approach compared to other clustering methods in the case where the generative model is a mixture of Bayesian networks. We applied bnClustOmics to a hepatocellular carcinoma (HCC) dataset comprising genome (mutation and copy number), transcriptome, proteome, and phosphoproteome data. We identified three main HCC subtypes together with molecular characteristics, some of which are associated with survival even when adjusting for the clinical stage. Cluster-specific networks shed light on the links between genotypes and molecular phenotypes of samples within their respective clusters and suggest targets for personalized treatments.

**Author summary** Multi-omics approaches to cancer subtyping can provide more insights into molecular changes in tumors compared to single-omics approaches. However, most multi-omics clustering methods do not take into account that gene products interact, for example, as parts of protein complexes or signaling networks. Here we present bnClustOmics, a Bayesian network mixture model for unsupervised clustering of multi-omics data, which can represent dependencies among molecular changes of various omics types explicitly. Unlike other approaches that use data from public interaction databases as ground truth, bnClustOmics learns the dependencies between genes from the analyzed multi-omics dataset. At the same time, our approach can also account for prior knowledge from public interaction databases and use it to guide network learning without losing the ability to learn new dependencies. We applied bnClustOmics to a multi-omics HCC dataset and identified three subtypes similar to those identified in other HCC studies. The cluster-specific networks learned by bnClustOmics revealed additional insights into the molecular characterization of the discovered subgroups and highlighted the changes in signaling networks leading to distinct HCC phenotypes.

## Introduction

Cancer is a complex disease and one of the leading causes of death worldwide. Over the last decades, much research was devoted to discovering cancer subtypes based on genomic and transcriptomic data [1–3]. Molecular subtyping approaches based on gene expression have been helpful for the identification of markers associated with clinical outcomes and facilitated the search for targeted therapies [4,5]. More recently, cancer subtyping has been based on integrating multiple different omics types [6–9]. Multiple tools have been developed to integrate multi-omics data and learn interaction networks to understand what drives oncogenesis [10,11]. However, our understanding of how heterogeneous genetic alterations in cancer cells affect signaling pathways and lead to a few disease phenotypes is still far from complete [12,13]. One major obstacle is the missing connection between methods for network discovery and approaches to molecular subtyping. Almost all existing methods focus on only one of these two tasks.

Only a few multi-omics clustering methods include interactions between gene products into the model explicitly. Some of them are designed for single omics types [14,15] or use a supervised approach for clustering [16]. PARADIGM [17] and BiCON [18] unsupervised clustering (of patient samples) while accounting for the fact that genes products can interact with each other and that interactions may differ between patient groups. However, these methods rely entirely on existing protein-protein interaction (PPI) databases and consider them as ground truth. Instead of learning the network from the dataset, they map the omics data onto interactions from existing databases by considering pairwise dependencies between genes. Such approaches are prone to mistakes contained in databases and do not allow the discovery of unknown interactions.

When learning gene regulatory networks, the Bayesian network framework is often used instead of pairwise correlation analysis since it can uncover direct interactions and, in some cases, learn their directions [19,20]. A Bayesian network mixture model was used for clustering of pan-cancer mutation data [14], but never applied to any other omics types or integrated multi-omics model for unsupervised clustering.

Here, we extend the model of Kuipers et al. [14] to multi-omics data comprising discrete and continuous data types. We present bnClustOmics, an unsupervised clustering method based on the assumption that the cancer subtype can be represented as a Bayesian network consisting of omics variables of various types. Our model reflects the consensus view of cancer mechanisms, in which genetic alterations disrupt normal cell signaling and activate oncogenic pathways. Biological experiments have shown that mutations in cancer cells result in altered interactions between proteins, including phosphoproteins [21]. Thus, modeling the subtype-specific changes in the interactome may improve the clustering model. With cancer subtypes being modeled as Bayesian networks, bnClustOmics can detect the signal from interactions that differ in networks representing different subtypes. A major advantage of bnClustOmics compared to other methods for multi-omics clustering is that the output includes networks (learned *de novo*) representing discovered clusters which can be considered further in downstream analyses and shed light on subtype-specific cancer mechanisms.

We demonstrated in simulation studies that many commonly used clustering methods, including those specifically designed for multi-omics data, have a limited ability to detect a signal from changed interactions, whereas the ability of bnClustOmics to do so improves its clustering accuracy. In particular, we compared traditional clustering approaches with three different approaches designed for multi-omics data. Among multi-omics approaches, we selected methods that demonstrated good performance in previous benchmarking studies [6,22]. iClusterPlus [23] assumes that only a limited number of features are relevant and uses regularization to select features, while CIMLR [24] can incorporate the complete genome without enforcing sparsity. CIMLR was expected to perform better in a broad range of settings due to its claimed ability to learn the importance of different omics types from the analyzed dataset [24], however Duan et al. reported controversial results in this regard [22]. We also added MOFA [25] to benchmarking since it demonstrated good results with regard to feature selection in our simulations. bnClustOmics is only feasible for a limited number of omics features, hence the importance of each omics type is implicitly affected by the feature selection method. We tried several approaches to select relevant features and compared the performance of bnClustOmics using a selected subset to all other clustering methods applied to a non-reduced set.

We applied bnClustOmics to a multi-omics dataset from hepatocellular carcinoma (HCC) patient biopsies [26]. HCC is the most common type of primary liver cancer, which is the fourth most common cause of cancer-related mortality worldwide [27]. We discovered three clusters of HCC patients based on five omics types: mutations and copy number changes (both genome), transcriptome, proteome, and phosphoproteome. The number and molecular characteristics of the three discovered groups confirm many findings from previous HCC studies, including an analysis of the same HCC dataset [26]. In addition to cluster assignments, we analyzed the cluster-specific networks learned by bnClustOmics and scrutinized specific edges which connect changes in the genome to abnormal expression of transcripts, proteins, and phosphorylation sites. Furthermore, we identified hub nodes, i.e. genes with the most stable and most varying neighbors across cluster-specific networks based on the posterior probabilities of the edges. Cluster-specific connections between omics variables provide insights into the molecular characteristics underlying HCC subtypes and suggest targets for personalized therapies.

## 1 Results and Discussion

### 1.1 Model and workflow

We model a cancer subtype as a Bayesian network, whose nodes represent different omics measurements of the same set of genes. The HCC dataset [26] includes five omics types, namely mutations and copy number changes (both genome, denoted *M* and *CN*), transcriptome (*T*), proteome (*P*), and phosphoproteome (*PP*). The edges in the network represent statistical dependencies among all observations across all omics types. Such dependencies are not limited to a single biological interpretation. For example, an edge in the network might represent a physical interaction between proteins, a regulatory relationship between a transcription factor and its target, a functional interaction or a co-expression pattern. A functional interaction denotes an indirect association where two gene products do not physically interact but are jointly involved in the same cellular process [28].

By design, our integrative model prohibits any edges from nodes of continuous data types (*T*, *P*, *PP*) to nodes of binary or ordinal data types (*M*, *CN*). Prohibiting interactions between certain omics types avoids overfitting and results in more interpretable networks. We only allow edges aligned with the information flow of the central dogma of molecular biology [29].

At the first step of the analysis, we perform feature selection from all features of all available omics types (Fig 1). To analyze the HCC dataset, we selected features based on multi-omics factor analysis (MOFA, [25]) latent factor analysis, differential gene expression (DGE) analysis, and prior knowledge about signaling networks (Section 2.13).

bnClustOmics uses a Bayesian network mixture model and employs the EM algorithm [14] to cluster patient samples and learn the networks representing those clusters. Unlike other multi-omics clustering methods, bnClustOmics does not rely on interactions from databases, but learns Bayesian networks from data *de novo* using a Bayesian approach [14,30,31]. However, it is possible to construct blacklist and penalization matrices that incorporate prior information about interactions between selected features and guide network learning in subsequent steps. In the extreme case, we could blacklist all edges which are not found in a specific database. However, blacklisting prevents the discovery of new interactions. Instead, we can use edge-specific penalization factors to modify the prior probability distribution of the graph structure and lower the probability of such edges appearing in the resulting graphs. The penalization matrix also provides an easy way to incorporate a confidence score which is often assigned to interactions in the PPI databases.

bnClustOmics takes as input the observed values of the selected omics features for all patients, the number of clusters, and optional blacklist and penalization matrices. As output, we obtain cluster assignments for all patient samples, cluster-specific networks consisting of omics variables, and the log-likelihood, AIC and BIC scores of the estimated model. The AIC and BIC can be used to determine the optimal number of clusters. In addition, the Bayesian method used for structure learning provides estimates of posterior probabilities of all edges in the discovered networks. The statistical model presented in this work is implemented in the R package bnClustOmics and available at the GitHub repository https://github.com/cbg-ethz/bnClustOmics.

### 1.2 Benchmarking

We compared the performance of several clustering algorithms to bnClustOmics when the data generating model is a mixture of Bayesian networks. For comparison, we selected several general clustering methods as well as methods specifically designed for integration and clustering of multi-omics data, including kmeans [32], hclust [32], mclust [33], iClusterPlus [23], CIMLR [24], and MOFA [25].

For each set of simulation settings, we generated 30 – 50 Bayesian network mixtures (Section 2.2), where each directed acyclic graph (DAG) *G _{k}* consists of

*n*Gaussian and

_{c}*n*Bernoulli random variables (S1 Appendix). We used the adjusted Rand index (ARI, [34]) between estimated cluster assignments and the ground truth membership as a measure of accuracy.

_{b}For large sample sizes, bnClustOmics reaches a high accuracy even in a setting where the difference between cluster centers is small (Fig 2A), while the other algorithms fail to discover cluster assignments when cluster centers are very close to each other. Accuracy improves when the distances between centers of mixture components become larger for all algorithms except CIMLR. In our simulation settings, CIMLR failed to detect the signal from the continuous nodes in the presence of binary nodes. When we removed binary nodes from the simulated datasets and applied CIMLR to the continuous part only, its accuracy improved considerably.

For small sample sizes, all methods demonstrate lower clustering accuracy (Fig 2B), and bnClustOmics outperforms the other approaches in the majority of cases. We attribute this outperformance to the ability of bnClustOmics to detect the signal not only from differences between cluster centers but also structural differences between graphs representing clusters.

Next, we fix the distance between the centers of the distributions at a medium value and analyze the performance of different algorithms with four different values *K* of the number of clusters. The clustering accuracy of bnClustOmics does not become worse with increasing number of clusters *K*, while for the other algorithms, the accuracy decreases (Fig 2C). Among the other algorithms, CIMLR applied to only a continuous part of the data performs the best. However, its accuracy is again substantially worse when the binary data is included.

Since bnClustOmics is only computationally feasible for networks with a limited number of nodes, its performance may strongly depend on the selection of the relevant features. To assess whether reducing the number of features affects the clustering accuracy of bnClustOmics, we generated Bayesian network mixtures with *n _{c}* = 1000 Gaussian nodes and

*n*= 100 binary nodes from

_{b}*K*= 3 mixture components and applied all clustering approaches to this dataset. All algorithms were applied to the complete dataset, except bnClustOmics which was applied to only a subset of the data consisting of 150 continuous features (S2 Appendix) and all binary features with at least one non-zero observation. We found that despite using significantly fewer variables, bnClustOmics outperforms the other methods (Fig 2D) when the distances between cluster centers are medium to large. For small distances, all methods perform poorly.

So far, we assumed that the number of clusters is known; however, in a fully unsupervised setting, this is not the case. bnClustOmics allows estimating the number of clusters *K* using either the AIC or BIC score. Our simulations indicate that for small sample size, AIC works better (Fig 3A), while for large sample size, BIC shows better results (Fig 3B).

Next, we tested the ability of bnClustOmics to reconstruct Bayesian networks representing discovered clusters. We generated 50 Bayesian network mixtures with *K* = 4 components and unequal weights, such that the four clusters contain 150, 100, 50, and 20 observations, respectively. The Bayesian approach yields estimated *maximum a posteriori* (MAP) structures, i.e. graphs which have the highest scores of all considered structures and represent the best fit to the data. In addition to MAP graphs, we also estimated the consensus structures (Section 2.5), which consist of edges whose posterior probabilities are higher than a certain threshold.

The number of observations per cluster correlates positively with the accuracy of the learned MAP structures, as progressively higher TPR and lower FDR levels were reported for MAP structures corresponding to a higher number of observations (Fig 3C). However, the FDR of MAP structures is rather high, especially for the cluster 4 with the smallest number of observations. We observe that consensus graphs help reduce FDR compared to the MAP estimates, although at the cost of reducing the true positive rate (TPR). The structural Hamming distance (SHD) is smaller for consensus structures than for MAP structures (S1 Fig). In our simulation, a posterior threshold of 0.7 minimizes the SHD for *N _{Zk}* ∈ {150, 100, 50} and 0.95 for

*N*= 20.

_{Zk}The Bayesian approach allows us to include prior knowledge about known interactions and guide *de novo* network learning. In the analysis of mutation data, an edge penalization matrix was used by Kuipers et al. [14] to include prior information from the database STRING [35]. The edge penalization matrix is used to modify the prior over structures, such that penalized edges have a lower chance to appear in the discovered graphs (Section 2.9). PPI databases contain known interactions between genes but most often do not describe the context in which a particular interaction occurs. Hence, if interactions differ between unknown cancer subtypes, we cannot learn them using a database alone. To assess to which extent the penalization matrix can improve network discovery, we constructed a simulated database of interactions by taking the union of all edges in the ground truth structures and introducing 10% of mistakes which model false-positive (5%) and false-negative (5%) interactions in databases. The entries of the penalization matrix corresponding to interactions from the simulated database were not penalized; all other edges were penalized by a factor of two.

The usage of an edge penalization matrix resulted in MAP and consensus structures containing fewer false-positive edges and more true positives than corresponding structures obtained without using a penalization matrix (Fig 3C). Limited sample size is a common problem of biological data, and proteome and phosphoproteome data are generally scarce. At the same time, extensive databases exist which include known protein-protein interactions and regulatory relationships identified in biological or computational studies. Hence, including information from such databases can be helpful for network reconstruction.

### 1.3 HCC patient subtyping

We analyzed the HCC multi-omics dataset [26] comprising 50 biopsies from 48 patients and including five omics types, namely mutations and CNAs (both genome), transcriptome, proteome, and phosphoproteome. In order to apply bnClustOmics, we first performed feature selection as follows. To select *M* features, we used a list of significantly mutated genes in the analyzed cohort identified by Ng et al. [26]. In addition, we included possible drivers of HCC identified in other studies [36–38]. To select continuous features, we applied MOFA and performed latent factor analysis. In addition, we included the *P* and *PP* features, which are differentially expressed/phosphorylated in tumor samples and either are present in the kinase-substrate database, or are known transcription factors according to the Omnipath database [39] (Section 2.10). We proceeded with the construction of the blacklist and penalization matrices as described in Section 2.8 and Section 2.9 and included prior information about interactions between selected features from the STRING and Omnipath databases [35,39].

We ran the algorithm for *K* =1, 2, 3, 4, and 5 clusters. The BIC and AIC scores indicated *K* = 3 as the optimal number of clusters (Fig 4A). *K* = 3 clusters were also found as optimal for the same data in [26] and in another HCC study applying a network-based method to the TCGA HCC dataset [40]. Similar to the clusters discovered in [26], the clusters discovered by bnClustOmics (Fig 4D) are associated with mutations in the genes *TP53* and *CTNNB1*, Edmondson grade, and BCLC stage (*p*-values using Fisher’s exact test are 0.012, 0.001, 0.007, and 0.019, respectively). Cluster 1 is dominated by samples with mutations in *CTNNB1*, and cluster 2 is dominated by samples with mutations in *TP53* (Fig 4C). Cluster 3 is the most heterogeneous in terms of mutations. However, all 4 samples with mutations in *ALB* are in cluster 3.

The three discovered subgroups are associated with patient survival with and without adjustment for BCLC stage (S2 Fig, Section 2.11). In particular, the Cox proportional hazards model revealed that cluster 2 is associated with a poor prognosis (*p* = 0.039 for the non-adjusted model and *p* = 0.024 for the adjusted model), while survival prognoses for cluster 1 and cluster 3 are better and similar. We tested several other approaches for multi-omics clustering, including MOFA, which we used for feature selection. None of the models produced patient subgroups significantly associated with survival when adjusting for BCLC stage (S1 Table, Survival analysis).

To identify processes whose regulation is different between the three patient clusters, we performed DGE and pathway enrichment analysis (Section 2.12). The differences in enriched pathways at all omics levels are more pronounced between cluster 2 and the other clusters (Fig 4D). Significant differences in enriched pathways between cluster 1 and cluster 3 were identified only at the transcriptome level, but not the proteome or phosphoproteome level. However, this situation can result from a combination of noisy data and limitations of pathway enrichment analysis [41].

In order to extend the molecular characterization of the discovered clusters beyond expression levels and mutational frequencies, we analyzed the multi-omics networks that define the clusters. The three MAP networks are very different from each other (Fig 4E). At the same time, the similarities between consensus networks constructed at the edge-wise posterior level of 0.1 are substantially larger (Section 2.5). While 0.1 is a low confidence threshold, the proportion of edges that pass this threshold is around 2% of all non-blacklisted edges for each network. Therefore, the high degree of similarity at the 0.1 level suggests that the posterior landscapes are not as different as the MAP structures. This reflects a high level of modeling uncertainty due to the small effective sample sizes from which the networks were learned. The downside of MAP structures is the inability to account for this uncertainty which can lead to overfitting, as we have seen in simulation studies (Fig 3C).

To address this limitation, we took advantage of the Bayesian approach that we used for structure learning and using several posterior thresholds constructed consensus networks for downstream analysis (Section 2.5).

### 1.4 Downstream effects of mutated genes

bnClustOmics allows for identifying links between genotypes and molecular characteristics of individual clusters. We analyzed all children of *M* (mutation) nodes in the cluster-specific networks. At the first step, we performed pathway enrichment analysis and identified KEGG pathways that are enriched with cluster-specific children of mutation nodes (S2 Table). Signaling pathways associated with HCC, including PIK3-Akt, p53 and cell cycle, were enriched in all clusters. The differences in enriched pathways between the clusters can be connected to their genotypes. For example, the Wnt signaling pathway, whose activation is usually associated with mutations in *CTNNB1* is enriched in cluster 1 and cluster 3 but not in cluster 2. Network *G*_{3} is characterized by more connections than other networks due to a higher level of heterogeneity in cluster 3. As a result, more pathways were found to be enriched with direct neighbors of *M* nodes.

We further scrutinized the individual edges connecting mutations to the nodes representing genes products of other omics types. To find the connections which can explain the abnormal expression of *T*, *P* and *PP* nodes, we have selected children of all mutation nodes in all networks which are differentially expressed in at least one cluster or in the whole dataset (Fig 5). The frequently mutated genes *TP53, CTNNB1*, and *ARID1A* have the most children across networks. However, *CTNNB1* has the largest proportion of children that are the same across the clusters, whereas the children of *ARID1A* are rather different across the clusters. This suggests that the effects of mutations in *CTNNB1* are more homogeneous, while effects of *ARID1A* are more heterogeneous across clusters. *ARID1A* is a sub-unit of chromatin remodeling complex SWI/SNF and may have broad effects on gene expression levels. Heterogeneous roles of *ARID1A* in HCC were already pointed out in previous studies [42]. In particular, *ARID1A* was found to act both as a tumor suppressor and oncogene depending on the context.

We noted that bnClustOmics was able to capture some of the well-known HCC-specific interactions while performing *de novo* clustering. One example of homogeneous connections is the edge from mutation in *CTNNB1* (denoted *CTNNB1-M*) to the *CTNNB1* transcript abundance (*CTNNB1-T*). In all clusters, the mutation status of *CTNNB1* is positively correlated with the expression of the *CTNNB1* transcript. In cluster 1 and cluster 3, *CTNNB1-T* is overexpressed compared to normal samples. This corresponds to the known effects of *CTNNB1* mutations in HCC [43]. However, in cluster 2, *CTNNB1-T* is not overexpressed, despite the edge between *CTNNB1-M* and *CTNNB1-T*. This situation results from cluster 2 containing only two samples with mutated *CTNNB1* and the fact that mutations in *TP53* are not associated with increased *CTNNB1-T*. This example demonstrates the complementary roles of network analysis with DGE in the downstream analysis.

The edge from *CTNNB1-M* to *GLUL-T* which is present in *G*_{2} and *G*_{3} is another example of a previously known interaction. *GLUL* is known to be upregulated in HCC and is associated with the mutated *CTNNB1*. It is also known that *GLUL* is affected by activation of the Wnt/*β*-catenin pathway at the transcription level, so the incoming edges in the *GLUL-T* node are consistent with previous findings [44]. Interestingly, there is no edge connecting *CTNNB1-M* and *GLUL-T* in *G*_{1}. If we examine the interaction partners of *GLUL-T* (Fig 6A), there is an incoming edge that is specific to *G*_{1} coming from the phosphorylation site AXIN2_S70, and AXIN2_S70 has an incoming edge from *CTNNB1* also only in *G*_{1}. AXIN2, just like *GLUL*, is a known target of the Wnt/*β*-catenin pathway [45]. The link between proteins GLUL and AXIN2 is also present in the STRING database with an interaction score of 0.42. The phosphorylation site AXIN2_S70 has been mentioned in the study connecting mutations to signaling in breast cancer [46]; however, there have been no previous studies about this phosphorylation site in HCC. Thus, the different path from *CTNNB1-M* to *GLUL* in *G*_{1} compared to *G*_{2} and *G*_{3} may represent differences in signaling leading to the same target. Alternatively, due to a limited number of observations, we may have captured the same process with a different set of edges, so further experiments are needed to clarify this link.

In addition to edges corresponding to known interaction contexts, bnClustOmics discovered edges pointing at new context-specific dependencies. Cluster 2 is characterized by mutations in the *TP53* gene, and we analyzed *TP53-M* connections which might contribute to the phenotype of cluster 2 (S8 Appendix). The transcript node *TERT-T* is differentially expressed in cluster 2 and also has an incoming edge from *TP53-M* in *G*_{2}. *TERT-T* expression is known to be upregulated in many cancers including HCC [47] and it is also significantly overexpressed in all clusters in the analyzed cohort. However, the expression level of *TERT-T* is significantly higher in cluster 2 than in cluster 1 and cluster 3 (Figure 9 in S8 Appendix). The high degree of *TERT-T* overexpression is associated with mutations in *TP53* as suggested by *G*_{2}. At the same time, the edge from *TP53-M* to *TERT-T* is absent in *G*_{1} and *G*_{3} (Figure 8 in S8 Appendix), suggesting that the effect of mutated *TP53* on *TERT-T* is only present in cluster 2.

We investigated connections of *TERT-T* in other networks to identify possible sources of its upregulation in remaining clusters (Fig 6B). There is an incoming edge from ALB-*M* in *G*_{3}; however, it is negatively correlated with *TERT-T* expression, so mutations in ALB-*M* do not seem to contribute to *TERT-T* overexpression. In addition, there is a *G*_{3}-specific incoming edge from the phosphorylation site RB1_S37, which is overexpressed in cluster 2 and cluster 3, but not in cluster 1. Network *G*_{3} suggests that RB1_S37 is associated with overexpression of *TERT-T* and hence might also contribute to carcinogenesis. In *G*_{3}, there is an incoming edge to RB1_S37 from *CTNNB1-M* and the corresponding correlation is positive. This suggests that *CTNNB1-M* contributes to RB1_S37 overexpression and via RB1_S37 may affect *TERT-T* as well. However, this dependency is not direct and weaker than the edge from *TP53-M* to *TERT-T* in cluster 2, suggesting that the direction of the effect of mutations in *CTNNB1* and *TP53* on the expression of *TERT* is the same, but the effect size is different. This finding aligns with the associations between mutations in *CTNNB1* and *TP53* and survival. Both mutated genes are drivers of HCC however, *TP53* results in a poorer prognosis than *CTNNB1*.

Other RB1 phosphorylation sites, namely S249 and T356, are highly phosphorylated across all clusters. Moreover, we observe several incoming edges from *M* nodes in all RB1 phosphorylation sites (Fig 6C:E). The mutation statuses of parent nodes of RB1 (*FAT4-M* in cluster 2, *TERT-M* in cluster 3, *TP53-M* in cluster 1) are positively correlated with increased phosphorylation of the respective sites, suggesting that they all may contribute to RB1 hyperphosphorylation. In previous studies, RB1 has been shown to play an important but complex role in cell cycle regulation and apoptosis [48]. It can act both as a tumor suppressor and oncogene depending on its phosphorylation status. All three phosphorylation sites included in our network can be found in the PhosphoSitePlus database [49]. The role of S249 and T356 phosphorylation is well studied and known to affect the cell cycle and apoptosis. The role of S37 phosphorylation is less well known, and there are no studies about its role in HCC. As previously noted, our analysis suggests that phosphorylation of this site may also play a role in HCC. We note that *RB1-T* is also overexpressed. However, there are no edges between *RB1-T* and RB1 phosphorylation sites (S3 FigB), suggesting that overexpression of *RB1-T* is not the main source of RB1 hyperphosphorylation. In addition, since unphosphorylated RB1 acts as a tumor suppressor, knocking it down does not seem wise. Many efforts rather target inhibiting its phosphorylation and activating its tumor-suppressive properties [48,50]. Furthermore, Indovina et al. [48] mention Cdk inhibitors as possible therapies which can prevent RB1 phosphorylation. Indeed, Ng et al. [26] found an association of overactive CDK1/CDK2/CDK5 kinases and the phenotype associated with mutations in *TP53*. The central role of phosphorylation of RB1 in all networks suggests that inhibition of Cdk can be beneficial for patients in all clusters.

Many of the edges in discovered networks are absent in the public PPI databases. The edge from *TP53-M* to *LECT2-T* is present in *G*_{3}, and *TP53-M* is negatively correlated with *LECT2-T* in this cluster (it is also negatively correlated with *LECT2-T* in *G*_{2}, but this edge has a low posterior probability). We note that *LECT2-T* is also downregulated in cluster 2 and cluster 3, but not in cluster 1. The downregulation of *LECT2-T* has been previously associated with a poor prognosis in HCC and mutations in *TP53* [51]. Thus, the discovered link between mutations in *TP53-M* and downregulation of *LECT2-T* is plausible, despite being absent in the STRING database. We further noted that *LECT2-M* has an incoming edge from *TCHH-M* in both *G*_{2} and *G*_{3}, while *TCHH* mutations are absent in cluster 1. Both *M* nodes are negatively correlated with *LECT2-T* suggesting that *TCHH-M* contributes in a similar way to the molecular phenotype as *TP53-M*. Heterogeneity is a known issue in identifying cancer subtypes. One implication of shared connections of different mutated genes in the discovered networks is that they affect similar downstream genes and may be targeted by similar therapies.

At the same time, some *M* nodes have opposite effects on the same interaction partners, indicating opposite effects of these corresponding mutated genes on the phenotype. *TP53-M* and *CTNNB1-M* share two common connections: HDAC4_S246 and *KMT2D-T*. In both cases, the mutation status of *CTNNB1* and *TP53* are oppositely correlated with their shared interaction partners. The correlation between *TP53-M* and *KMT2D-T* is positive, while the correlation between *KMT2D-T* and other *M* nodes (shown in S3 FigA) including *CTNNB1* is negative. In pancreatic cancer, low expression of *KMT2D* has been associated with a better prognosis [52]. Moreover, knock-out of *KMT2D* has been shown to attenuate cell proliferation and was suggested as a therapeutic target [53]. Opposite effects of *TP53-M* and *CTNNB1-M* on *KMT2D-T* in cluster 3 suggest that co-occurrence of these mutations may diverge the phenotype from phenotypes where *TP53* and *CTNNB1* do not co-occur. Mutations in *CTNNB1* and *TP53* have been considered mutually exclusive in many studies [54]. However, they co-occur in 10% of all samples in the analyzed dataset. The mutual exclusivity was also challenged by a study presenting a detailed case of *TP53/CTNNB1* co-occurrence in the same tumor [55]. In addition, we observe an interesting pattern of co-occurrence of *TP53* and *CTNNB1* across discovered clusters as four out of five co-occurrence cases fall outside of the *TP53*-dominated cluster 2, which can also hint at possible opposite effects of mutations in *TP53* and *CTNNB1* on the phenotype. Our findings align with another HCC classification based on morphological features of the tumor and gene expression [56]. The analysis by Trobenson et al. [56] indicated that *CTNNB1* and *TP53* were associated with opposite effects on the presence of pseudoglands (a histopathologic feature used for HCC characterization in clinics). In addition, the majority of samples with co-occurring *CTNNB1*/*TP53* mutations ended up in the *CTNNB1* cluster based on the gene expression data. However, *CTNNB1/TP53* mutated tumors were associated with clonal progression, in contrast to tumors harboring only *CTNNB1*.

### 1.5 Hub phosphorylation sites

In studies devoted to PPI network characterization, the number of neighbors (degree) of a node in the network is often used to characterize its biological importance [57,58]. Following this logic, we defined two lists of the most connected nodes in the networks discovered by bnClustOmics. In the first list, we included the top twenty nodes with the largest number of connections that are present with non-zero posterior probabilities in two or all networks (S1 File). Such nodes and their direct neighbors represent the most similar parts between the networks. In the second list, we included all nodes with the largest number of cluster-specific connections (S2 File). Interestingly, the nodes in the first list turned out to be *P* nodes (9 out of 20), *M* nodes (9 out of 20) nodes and *T*-nodes (2 out of 20) while the top nodes of the second list were dominated by *PP* nodes (17 out of 20). Hence, of all omics types, phosphorylation sites appear to have the most different neighborhoods between the clusters. While for *CN*, *T*, and *M* nodes, this can be explained by model structural restrictions, for *P* and *PP* nodes, this finding suggests that differences in the interactome between clusters are more substantial at the phosphoproteome level than at the proteome level.

The list of most differentially connected phosphorylation sites includes MAPK1_T185, CTNND1_S252, and GRB14_S372, which are known to play a role in HCC signaling and affect the regulation of cell cycle, apoptosis, and carcinogenesis (S3 Table). Some of these hub-phosphorylation sites have been found to be important in other cancers than HCC, e.g., ANKRD28_S1011, PRKAA2_S491, and TBXA2R_S331. Our networks suggest that they might also play a role in HCC and are thus candidates for further experiments.

MAPK1 is known to be essential for MAP kinase signaling, which is one of the targets of Sorafenib [59–61], a standard-of-care treatment for advanced HCC. The phosphorylation site MAPK1_T185 is increased in cluster 2 and cluster 3 and has a considerable amount of cluster-specific connections in *G*_{2} (Fig 6E). The phosphorylation of another MAPK1 site, namely Y187, is significantly increased in cluster 3 only. Both phosphorylation sites have many references in the PhosphoSitePlus database, and are known to induce carcinogenesis and alter apoptosis, and are known drug targets. However, MAPK1 is known to be active if both sites are phosphorylated [62]. The increased phosphorylation of both sites is observed only in cluster 3. At the same time, the role of mono-phosphorylated MAPK1 is not fully understood [63]. Sorafenib which inhibits upstream regulators of MAPK1 [64] was given to six patients from the analyzed cohort, three of which were assigned to cluster 2 and three to cluster 3. Five out of six patients had to discontinue treatment due to side-effects, but patients from cluster 3 on average tolerated the therapy longer and survived longer than patients who were treated with Sorafenib in cluster 2 (S9 Appendix). This separation aligns well with our clustering, although it is not possible to make stronger conclusions due to a limited number of biopsies and the short duration of treatment.

One of the MAPK1_T185 interaction partners in *G*_{2} is another hub phosphorylation site, PTPN1_S352, whose phosphorylation is increased in cluster 2 only. *PTPN1* is known to play an important role in many liver diseases; however, it can act both as a tumor suppressor, and oncogene in HCC [65]. Most studies suggest its tumor-suppressive role. However, our analysis indicates that increased phosphorylation of PTPN1_S352 is associated with a poor prognosis and increased phosphorylation of MAPK1_T185 in cluster 2. This connection is confirmed in [66], where *PTPN1* was identified as an oncogene, and its knockdown resulted in attenuated Ras activity and MAPK signaling. We found several inhibitors of PTPN1 in The International Union of Basic and Clinical Pharmacology (IUPHAR) / British Pharmacological Society (BPS) Guide to PHARMACOLOGY [67]. All of them have hypoglycaemic and other anti-diabetic effects. Previous studies already pointed out the anti-tumor properties of diabetes drugs on HCC [68]. We believe that investigating strong individual dependencies in cluster-specific networks coupled with DGE might suggest drug candidates and highlight interactions that are important in the context of different subtypes of HCC.

### 1.6 Discussion

Learning biological networks and cancer subtyping based on multi-omics molecular data are challenging problems, which are traditionally addressed by separate computational methods. In this work, we present bnClustOmics, a tool that tackles both problems simultaneously. Our approach can integrate and cluster multi-omics datasets and learn networks consisting of different types of omics variables, each of which characterizes a patient cluster. In simulation studies, we have shown that bnClustOmics outperforms other clustering approaches due to its ability to detect differences in network structures, while other algorithms mostly lack this ability. A major limitation of our method is the necessity to perform feature selection, which is not straightforward in an unsupervised setting. We suggest using a combination of MOFA and DGE analysis based on our simulation studies, but other ways can also be explored in the future. The package bnClustOmics can be applied to any combination of omics types and is not limited to the five omics types analyzed in this HCC cohort. In the current implementation, there is no possibility to learn the edges between discrete nodes. This feature can further refine clustering, but it makes sense only for larger datasets due to the extreme sparsity of the mutation data.

We applied bnClustOmics to an HCC dataset comprising five different omics types. Similar to previous studies [26,40,56], the three discovered clusters are associated with mutations in *CTNNB1* and *TP53*, and the BCLC stage. Our patient clustering is significantly associated with survival with and without adjustment for the BCLC stage. Cluster 2 is dominated by samples with mutated *TP53* and is associated with a poor prognosis. Samples in which *CTNNB1* and *TP53* co-occur are mostly found in cluster 1 and cluster 3. Moreover, we find that *CTNNB1* and *TP53* have opposite effects on the expression of the transcript *KMT2D* and the phosphorylation site HDAC4_S246 in the learned networks. These findings might explain why *CTNNB1* and *TP53* show mutual exclusivity patterns [69,70] and are associated with opposite effects of the phenotype [56] in some cohorts.

On a more general level, our analysis suggests that the discovered clusters are associated with changes in signaling networks as identified by substantial differences in the neighborhoods of phosphorylation sites. The differences between interactions partners are the largest on the phosphoproteome level, suggesting that this omics type brings a major contribution to the result of the network-based clustering highlighting the importance of phosphoproteome data for further studies.

Cluster-specific networks suggest that hyperphosphorylation of RB1 is associated with mutations in *TP53, CTNNB1*, and *FAT4* but not with overexpression of *RB1* at the transcriptome level. This finding aligns with previous studies suggesting that unphosphorylated RB1 acts as a tumor suppressor, while hyperphosphorylation of RB1 contributes to carcinogenesis [48]. Hence therapies that inhibit phosphorylation of RB1 such as Cdk inhibitors may be a promising treatment strategy.

Overall, our analysis has shown that including associations between different omics types in the clustering model is an important step towards defining cancer subtypes and their molecular makeup comprehensively. These novel associations may improve the selection of effective personalized therapies.

## 2 Methods

### 2.1 Data

We applied bnClustOmics to the HCC data analyzed in [26] (S1 Text). The full dataset comprises 51 biopsies from 49 patients with HCC diagnosis. For each patient, DNA, RNA, proteome, and phosphoproteome data are available. For two patients, two sets of biopsies were available from two genetically different HCC tumors. In addition, we obtained data from 15 biopsies from healthy livers for transcriptome analysis and 11 biopsies for proteome and 10 for phosphoproteome analysis from the same study. A detailed description of sequencing, library preparation, transcript quantification, and SWATH analysis can be found in [26]. We obtained the normalized data from Ng et al. [26] and performed data imputation and batch-correction where applicable (S7 Appendix, S1 Text). One sample was hypermutated with over 9000 mutated genes and was excluded from the analysis. Consequently, we included 50 biopsies from 48 patients in the study.

### 2.2 Bayesian network mixture model

We assume that the data *D* consisting of *N* observations is generated from a mixture of *K* components with weights *τ _{k}*. Each component is a Bayesian network , a directed probabilistic graphical model representing a factorization of the joint distribution of the random variables

*X*

_{1}, …,

*X*. The random variables are used to model omics features in the analyzed dataset (

_{n}*M*,

*CN*,

*T*,

*P*and

*PP*). Each patient sample

*D*represents a vector of

_{i}*n*values (one for each

*X*) and is generated from a model , depending on the value of a hidden variable

_{j}*Z*[14], where

_{i}*G*is a DAG and

_{k}*θ*are the parameters of the local probability distributions (LPD).

_{k}A Bayesian network mixture model was first suggested in [14] for (single-omics) binary mutation data. In our model, each network consists of binary (mutations), ordinal (CNA), and continuous variables (transcriptome, proteome, and phosphoproteome). We denote the set of indices of all binary, ordinal, and continuous nodes by Ω, Φ, and Ψ, respectively. The quantities *n _{b}, n_{o}*, and

*n*are the numbers of binary, ordinal, and continuous random variables, respectively, in the network. We model the LPD for each continuous node

_{c}*X*∈ Ψ, of each mixture component by linear regression on its parents in graph

_{ψk}, ψ*G*, where

_{k}**Pa**

_{ψk}is the set of parents of node

*X*in graph

_{ψk}*G*. The set of parameters of the LPDs of continuous nodes includes a vector of regression intercepts

_{k}*m*, a vector of standard deviations

_{k}*σ*, and a vector of regression coefficients

_{k}*B*defined for all nodes with non-empty parent set. Given a graph

_{ψk}*G*, the Gaussian Bayesian network model above can be equivalently parameterized using a vector of unconditional means

_{k}*μ*and a covariance matrix ∑

_{k}_{k}(S4 Appendix). We use both parametrizations interchangeably. Binary and ordinal nodes are not allowed to have parents by assumption. For binary nodes

*X*, we assume that the LPDs are defined by the parameters and for ordinal nodes

_{ωk}*X*, we use the Gaussian approximation

_{ϕk}We denote the set of all parameters of a mixture component *k* by *θ _{k}* = (

*λ*, ∑

_{k}, μ_{k}_{k}).

### 2.3 EM algorithm

Following [14] we use an EM algorithm for learning Bayesian network mixture models. We denote by *D _{i}* the

*i*-th observation in the dataset, representing a vector of omics measurements of one patient (or one biopsy in case of multiple biopsies per one patient). The algorithm proceeds as follows:

Initialize cluster membership probabilities

*γ*of patient_{ik}*i*being in cluster*k*(Section 2.7)Given

*γ*, perform MAP structure search and estimate DAGs (Section 2.5)_{ik}Given estimated DAGs , iterate

*q*times:(M-step) Compute MAP parameters (S5 Appendix)

(E-step) Update membership weights and cluster weights

Iterate steps 2 and 3 until convergence

The internal cycle with *q* iterations is added for computational efficiency because parameter updates are computationally less expensive than structure search. Hence, for each update of the structures, we perform *q* updates of the parameters. We learn cluster membership assignments for all patients *D _{i}* and MAP networks . Once the EM algorithm has converged, bnClustOmics can optionally perform sampling from the posterior distribution and the output includes the matrices of estimated probabilities of all edges (Section 2.5).

The main differences to the procedure in [14] are a different set of parameters *θ _{k}* and network structural constraints due to the multi-omics extension and differences in data types.

### 2.4 Network score

For assessing how well the network structure fits the data, we use the BGe score [71,72]. In addition to the model assumption specified in Eq 2, the BGe score requires technical assumptions on likelihood and parameter prior [71]. The network score *R*(*G _{k}* |

*D*) then decomposes over continuous nodes as

By our model design, nodes *X _{ϕ}* and

*X*, corresponding to mutations and copy number changes, are not allowed to have any parents. Hence, the terms

_{ω}*S*(

*X*,

_{ϕk}**Pa**

*|*

_{ϕk}*D*) =

*S*(

*X*|

_{ϕk}*D*) and

*S*(

*X*,

_{ωk}**Pa**

*|*

_{ϕk}*D*) =

*S*(

*X*|

_{ωk}*D*) are constant for all possible graphs. For this reason, we exclude these terms when performing structure search and the product in Eq 5 runs only over nodes

*X*. However, nodes

_{ψk}*X*and

_{ϕk}*X*may enter the equation as parents of

_{ωk}*X*.

_{ψk}### 2.5 Structure search

At each step of structure search, we use the iterative order MCMC scheme introduced in [30] and implemented in the R-package BiDAG [31], which proved to be superior to many other methods for MAP structure search in simulation studies [30]. An optional step after the MAP graph has been found is to sample graphs from the posterior distribution using the order MCMC scheme [30]. This step allows us to estimate consensus models by averaging over a sample of *L* graphs from the posterior distribution. In particular, the posterior probability of an edge *e _{ξψk}* between nodes

*X*and

_{ξk}*X*in the graph

_{ψk}*G*is estimated as: where if the edge

_{k}*e*is present in structure and 0 otherwise. Edges whose posterior probabilities are lower than a defined posterior threshold are excluded from the resulting consensus structure [30].

_{ξψk}We use the iterative MAP search at the second step of the EM algorithm and perform sampling once after the EM has converged to compute posterior probabilities of single edges and identify consensus graphs.

To construct graphs for the downstream analysis, we made a list of edges whose posterior *P*(*e _{ξψk}* |

*D*) is higher than 0.9 for at least one cluster

*k*(the threshold was chosen based on our simulation studies). In addition, we selected all edges whose sum of posteriors in all clusters , while the threshold for individual networks is lower:

*P*(

*e*|

_{ξψk}*D*) > 0.5 for at least one cluster

*k*. Finally, we constructed the graphs

*G*by including edges from the selected list if their posterior

_{k}*P*(

*e*|

_{ξψk}*D*) > 0.4. The reason behind this selection process is finding high-confidence cluster-specific interactions while not dismissing similarities at lower (but non-zero) posterior levels.

### 2.6 Cluster membership weights

Updating the membership weights *γ _{ik}* requires assessment of the likelihoods . The decomposition provided by the Bayesian network model allows us to integrate discrete and continuous data types in measuring how well an observation

*D*(a vector consisting of

_{i}*n*continuous,

_{c}*n*ordinal, and

_{o}*n*binary components) fits a DAG and parameters :

_{b}The detailed formulas for computing the likelihoods are given in S6 Appendix. We have extended the R-package BiDAG, such that the function scoreagainstDAG is able to accommodate mixed data.

### 2.7 Starting membership weights

In general, the EM algorithm does not guarantee finding the global maximum, and the local maximum it finds will depend on the starting point. For this reason, we use a non-random starting point in order to start in a parameter region of high likelihood and help mitigate the local optima issue. By default (and for the HCC data), the starting cluster membership of patients is defined via running mclust on the first *K* +2 principal components after applying PCA to the original data. Our simulation studies have shown that dimension reduction via PCA as a starting point improves the results of mclust. The initial membership weights are then defined as
where *g _{i}* denotes the cluster assignment of the

*i*observation by mclust. PCA is applied only to define the initial membership weights, but the EM algorithm is then applied to original non-reduced data. With a non-random starting point, by default, bnClustOmics runs the EM only once (the results of simulation studies are shown for one run). However, for the HCC dataset, we restarted the EM five times and selected the model with the highest likelihood for each value of

^{th}*K*.

### 2.8 Allowed edges

By design, bnClustOmics only prohibits incoming edges to discrete nodes. In the HCC data analysis, we added more constraints to obtain more biologically relevant networks. The general flow of the information is directed from the DNA to RNA and (pshospho)protein nodes (S4 Table).

Naturally, we allow all possible edges between *P* and *PP* nodes. We do not allow edges between transcripts because the transcripts do not interact directly. When proteome data is not available, it makes sense to approximate protein-protein interactions with transcript-transcript interactions. However, since we have (phospho)proteome data available, we prefer to explain dependencies with more relevant and interpretable edges between (phospho)proteins and between transcripts and proteins.

### 2.9 Edge penalization matrix

When performing structure search, we use the prior information about interactions between the genes included in the networks, following the methodology described by Kuipers et al. [14]. To do this, we modify the default prior distribution over structures *P*(*G _{k}*) and replace it with
where

*κ*defines the penalization factor of the edge

_{ξψ}*X*→

_{ξk}*X*. Note that

_{ψk}*κ*≥ 1 and these factors do not depend on

_{ξψ}*k*since prior knowledge does not include cluster assignments. The change of prior leads to replacing of the score terms

*S*(

*X*,

_{ψk}**Pa**

*|*

_{ψk}*D*) with the terms

*S*′(

*X*,

_{ψk}**Pa**

*|*

_{ψk}*D*) in Eq 2.4 for all nodes

*X*with non-empty parent sets:

_{ψk}We use the STRING v.11.0 [35] and Omnipath [39] databases to define penalization factors. We penalize the edges by a factor of 2 if they are not found in the databases. The edges corresponding to interactions from the Omnipath database are not penalized. The edges corresponding to the interactions from the STRING database are not penalized if the interaction score is equal to or bigger than 0.5. Otherwise the penalization factor is defined as 2 – 2 * *interaction_score*. In addition, we do not penalize the edges between the same genes of different omics types, e.g., the edges *TP53-T* → *TP53-P* and *TP53-CN* → *TP53-T* are not penalized.

### 2.10 Feature selection

The structure search is the most computationally expensive step of the learning procedure. The complexity of the structure search scheme depends only on the number *n _{c}* of continuous nodes in the network (since the product in Eq 5 goes only over continuous nodes) and equals [30]. Hence, for the feasibility of bnClustOmics, we must pre-select the features which we include in the Bayesian networks. Another beneficial point of sensible feature selection is better interpretability since the qualitative analysis is hardly possible for networks with thousands of nodes.

We selected 778 omics features in total (Table 3 in S3 Appendix): 24 *M*, 292 *CN*, 188 *T*, 116 P and 158 *PP*. The main idea behind our feature selection approach was to combine methods that proved to work best in simulation studies (S2 Appendix) with prior knowledge about genes and interactions that are known to be important in HCC signaling (S3 Appendix). In addition to listed criteria we used reasonable filters for selected features: we included only those *M* nodes which are present in at least two samples and *CN* nodes with non-zero variance.

### 2.11 Survival analysis

To study the association of clusters with clinical outcomes, we used the Cox proportional hazards model with and without adjustment for clinical stage BCLC. Time was measured in days from the date of diagnosis. In the adjusted model, we excluded BCLC group “0” consisting of one sample, which did not include death events. If two or more biopsies were available for one patient, one of them was included in the analysis if the cluster assignments for all of them were the same. Otherwise, all samples from the patient were excluded. Two samples of patients who were lost-to-followup were considered censored. We used a likelihood ratio test based on the *χ*^{2} distribution to assess the model fit.

### 2.12 Enrichment analysis

Pathway enrichment analysis was performed using the R package ReactomePA [73]. For each omics type, a list of differentially expressed/phosphorylated genes (proteins, phosphoproteins) with FDR adjusted *p*-value smaller than 0.05 was used as input. Pathways enriched with FDR-adjusted *p*-value smaller than 0.05 were selected for visualization.

### 2.13 Differential gene and protein expression analysis

For DGE analysis, we used the R package edgeR [74] for transcriptome data, and limma [75] for proteome and phosphoproteome data. Genes were considered differentially expressed if the FDR-adjusted *p*-value was smaller than 0.05. For variable selection, we compared tumor to healthy samples for all omics types. For the heatmap in Fig 4D, we compared samples in a specific cluster to samples in all other clusters. In the downstream analysis, we have also performed DGE analysis between tumor samples in individual clusters and healthy samples.

## S1 Text. Data and code availability

The sequencing datasets are available at European Genome-phenome Archive under accessions EGAS00001005073 (whole-exome sequencing), EGAS00001005074 (RNA-sequencing). The mass spectrometry proteomics and phospho-proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository under accessions PXD025705 and PXD025836.

The pre-processed reduced and non-reduced multi-omics datasets used for clustering by bnClustOmics and other methods in this study as well as the code and results of simulation studies and HCC clustering are available at the GitHub repository https://github.com/cbg-ethz/HCC. Patient IDs were encrypted.

The R-package bnClustOmics is available at the GitHub repository https://github.com/cbg-ethz/bnclustOmics.

**S1 File. Top twenty most similarly connected nodes and their interactions partners in cluster-specific networks.**

**S2 File. Top twenty most differently connected nodes and their interactions partners in cluster-specific networks.**

## Acknowledgments

We thank Dr. George Rosenberger and Yannick Suter for their expertise, advice, and fruitful discussions on topics presented in this manuscript.

Part of this research was supported by the European Research Council (ERC) Synergy Grant 609883 and SystemsX.ch Research, Technology and Development (RTD) Grant 2013/150.

## Supporting information

### S1 Appendix. Generating Bayesian network mixtures and data in simulations studies

The steps of generating a Bayesian network mixture and the dataset from this mixture include:

Generating DAGs

*G*=1, …,_{k}, k*K*, consisting of*n*variables each,*n*binary and_{b}*n*continuous._{c}First, we generate a random DAG

*G*_{1}consisting of*n*continuous nodes with the function randomDAG from the R-package pcalg [76]. The parameter prob is set to and corresponds to each continuous node_{c}*X*having one continuous parent on average. Then, the remaining_{ψk}*K*– 1 DAGs in the mixture are generated such that the SHD between each of them and the first structure equals*η*|*E*|, where |*E*| is the number of edges in the first randomly generated structure. We use*η*values in the range 0.1 – 0.4, hence graphs representing different mixture components have a lot of edges in common. From a biological point of view, this makes sense: While some interactions may be altered in a particular cancer subtype, most of them will stay the same (e.g. housekeeping pathways).At the next step, we add random edges from binary to continuous nodes, such that each binary node has 0.5 continuous children on average. These edges model the effects of mutations on other nodes (e.g. transcripts, proteins) and are generated randomly for each mixture component.

Generating parameters of local probability distributions (LPDs) for each structure.

For 99% of binary nodes

*X*, frequencies_{ωk}*λ*are sampled from beta distribution with parameters_{ωk}*α*= 0.1,*β*= 7, modeling sparse and heterogeneous mutation data. For 1% of binary variables (minimum one variable) frequencies*λ*are sampled from a beta distribution with parameters_{ωk}*α*= 0.5 and*β*=1, modeling rare genes, for which higher frequencies are observed in known cancer subtypes.For continuous nodes: regression coefficients

*β*for nodes with non-empty parent sets are chosen in the range [0.5, 1.5]. Conditional standard deviations_{ψk}*σ*are sampled from a normal distribution with mean 0.3 and standard deviation 0.2; to prevent negative values, we use the absolute values of generated numbers._{ψk}Regression intercepts

*m*= 0, by default, apart_{ψk}*v*=*δn*nodes_{c}*G*. For these_{k}*v*nodes, we first sample the*sign*(“+” or with equal probability and then sample randomly in the range [0.5, 1.5] or [−1.5, −0.5]. The parameter*δ*directly impacts how far the centers of distributions*μ*are from each other._{k}

Generating data for each mixture component

*Z*=1, …,_{k}, k*K*using graphs and parameters generated in the previous steps.Generate

*N*observations of each binary node_{Zk}*X*from a Bernoulli distribution, using parameters_{ωk}*λ*._{ωk}Generate

*N*observations of each continuous node_{Zk}*X*according to the Eq 2, using parameters_{ψk}*G*._{k}, m_{k}, B_{k}, σ_{k}

We varied two parameters of generated Bayesian network mixtures to see how different algorithms performed depending on the signal strength, defined as *L*2 norm between centers of distributions of mixture components. The first parameter *η* is responsible for the structural difference between networks representing mixture components. The second parameter *δ* was responsible for differences between vectors of regression intercepts *m _{k}, k* =1

*, …, K*. The tables below represent the correspondence between

*x*–

*axis*labels used in Fig 2A:D,

*η*,

*δ*and average

*L*

_{2}norm of differences between vectors of unconditional means

*μ*and

_{i}*μ*,

_{j}*i*≠

*j*.

### S2 Appendix. Feature selection simulation

This simulation study illustrates clustering accuracy of bnClustOmics depending on the selection of relevant features. For simulations we use *n _{c}* = 1000 (similarly to the feature selection benchmarking study [77]) and

*n*= 100. For clustering, we choose only binary features which equal to 1 in at least one generated sample. In addition we try several approaches to select the continuous features:

_{b}random selection: 150 continuous features

moCluster: 150 continuous features with non-zero loadings defined by sparse consensus PCA

MOFA: top 150 features sorted by total absolute weight in all latent factors

ranking by the absolute value of normalized mean: top 150 features sorted by , where is sample mean and

*s*is sample standard deviation of_{ψ}*X*_{ψ}hybrid of 3. and 4.: mix of MOFA (75 features) and ranking by the absolute value of normalized mean (75 features).

By simulation study design, the default values for all continuous nodes are 0. Hence a ranking by the normalized mean is similar to a ranking by the one-sample *t-test* statistics and defines the variables which most extremely deviate from 0. Feature selection using normalized mean can be seen as a proxy to performing the DGE analysis in the real expression data and selecting genes whose expression mostly deviates from non-tumor samples.

The method moCluster performed best with regard to feature selection in the benchmarking study [77]. However, in that study, the authors generated all variables independently and did not include any interactions in the model. In our simulation, this approach did not perform well. Ranking by normalized mean has shown the best performance. However, MOFA preserves more edges from the original network in subnetworks consisting of selected nodes only. For this reason, when comparing the accuracy of bnClustOmics to other methods in Fig 2D, we show the accuracy of bnClustOmics when the hybrid approach is used for feature selection.

### S3 Appendix. Feature selection for the HCC analysis

To select *M* nodes, we included all mutated genes found significant by MutSigCV tool (*q* < 0.1). In addition, we have added the genes which were found significantly mutated in the TCGA HCC cohort, and the genes identified by HCC studies [36–38] as potential cancer drivers if they were mutated in at least two samples in the HCC dataset.

For nodes of continuous types, we first identified latent factors using MOFA on a subset of features passing standard deviation thresholds (1 for proteome and 2 for transcriptome and phosphoproteome). Five latent factors have been identified by MOFA. Consequently, we selected the top 50 features for each omics type by the total absolute weight of features in all latent factors.

We extended the selected *P* and *PP* features by performing the DGE analysis and picking the differentially expressed features (*q* <0.05), which are also present in the kinase-substrate database Omnipath. Their crucial role in cancer development explains our interest in kinases. Most protein kinases promote cell proliferation, survival, and migration. Furthermore, their aberrant activity is often associated with cancer development [78]. The standard-of-care HCC treatment, Sorafenib, is also a multi-kinase inhibitor.

For *P* nodes, we also selected differentially expressed features present in the transcription factor (TF) database Omnipath, confidence level B.

We have also extended each omics feature set with genes present in selected features of other omics sets for consistency and interpretability of networks. For example, for the possibility of discovering an edge *TP53-M* → *TP53-P*, we have included TP53 at the protein level. The same reasoning stands behind our choice of *CN* nodes, which were selected as a union of gene features selected from transcriptome, proteome, and phosphoproteome. In addition, we included *CN* nodes identified as potential drivers in [26]. We excluded *CN* nodes that had 0 variance in the HCC dataset.

### S4 Appendix. Two ways to parametrize a Gaussian Bayesian network

In this section, we drop indices *k*, so the equations are valid for all mixture components. There are two ways to parametrize a Gaussian Bayesian network. One way is via a vector of regression intercepts *m*, a noise vector *σ*, and regression coefficients *B* = {*β _{ψ}*} as in Eq 2. The second way is via a vector of unconditional means

*μ*and a covariance matrix ∑. Given the DAG

*G*, the parameters {

*μ*, ∑} can be transformed into equivalent parameters {

*m, B, σ*} [71], where regression coefficients

*β*are computed only for those nodes, which have non-empty parent sets in a DAG

_{ψ}*G*. Let

*∑*be the block of the covariance matrix consisting only of rows with indices

_{VW}*V*and columns with indices

*W*. And let

*W*be the parents of node

*V*in the graph

*G*, then

For convenience, we use both parametrizations interchangeably. For example, in defining parameters of simulation studies it is more convenient to use {*m, B, σ*}, while in the description of the EM algorithm we use {*μ*, ∑}.

### S5 Appendix. MAP parameters estimation

Let . For binary (mutation) nodes, we follow [14] and parametrize local probability distributions as *P*(*X _{ωk}* = 1) =

*λ*, with a beta prior on

_{ωk}*λ*with hyperparameters . The posterior of

_{ωk}*λ*follows a beta distribution as well, so we compute the MAP parameters for all binary nodes

*X*in the M step of the EM algorithm as follows:

_{ω}For continuous and ordinal nodes we use the BGe score and assume a normal-inverse-Wishart prior on the parameters *μ* and ∑ [71]:

The posterior is then also normal-inverse-Wishart and the MAP parameters are computed as follows:
where and . The values of the hyperparameters *a _{μ}, a_{w}*, the prior mean vector

*v*and the parametric matrix

*U*by default are set as follows [79]: where is the identity matrix, and is a vector consisting of zeros.

When MAP graphs *G _{k}* and parameters are estimated, the estimates are computed according to S4 Appendix.

### S6 Appendix. Local likelihoods formulas

For binary nodes, we compute likelihoods as for all binary nodes.

For continuous nodes *X _{ψ}*, we compute Gaussian likelihoods according to the model specified in Eq 2,

For copy number nodes *X _{ϕ}*, we use a similar Gaussian likelihood, but the sum over parents is dropped due to structural assumptions:

### S7 Appendix. Data pre-processing

For transcriptome data, the pre-processing steps included:

Like in [26] gene-level expected counts were upper-quartile-normalized to 1000.

log

_{2}transformation.

For proteome data, the pre-processing steps included:

log

_{2}transformation.Normalization by median substraction.

Filtering out proteins which were detected in less than 50% of samples.

For clustering only: imputation of missing values using the R package impute [80]. For differential expression analysis, we used unimputed values.

For phosphoproteome data, the pre-processing steps included:

log

_{2}transformation.Normalization by median substraction.

Filtering out proteins which were detected in less than 50% of samples.

Batch correction with the R package edgeR.

For clustering only: imputation of missing values using the R package impute [80]. For differential expression analysis, we used unimputed values.

The CNA data was obtained at the gene level from the study by Ng et al. [26]. The copy number status was derived from the log-ratio and takes values from 2 to −2, which denote [81]:

2: amplification

1: copy gain indicates a low-level gain

0: copy number neutral

-1: shallow deletion indicates a shallow loss, possibly a heterozygous deletion

-2: deep deletion indicates a deep loss, possibly a homozygous deletion

This way, despite the ordinal nature of the CNA data, the range of the values justifies the normal approximation.

### S8 Appendix. High-confidence connections of *TP53-M* in *G*_{2}

We investigated the edges outgoing from *TP53-M* whose posterior probabilities in *G*_{2} is higher than 0.9 (p_cl2> 0.9, Table 4). All four identified edges are specific to *G*_{2}, as their posteriors in other networks are lower than 0.4. We further considered only gene products that were differentially expressed in cluster 2 (p_DE2< 0.05). Three edges satisfied these criteria. Eventually, we focussed on the edge connecting *TP53-M* and *TERT-T*, since it was also found in the STRING database.

### S9 Appendix. Responses to treatment with Sorafenib

Six out of all 48 patients in the analyzed cohort were treated with Sorafenib. All of these patients were assigned to either cluster 2 or cluster 3 and none to cluster 1. Edmondson grade, survival, and side-effects to Sorafenib were mostly similar within clusters. Two of three patients in cluster 2 experienced hepatic decompensation within several days after the start of treatment and had to stop it. Patients from cluster 3 could, on average, tolerate the side effects longer. All patients in cluster 3 survived longer than patients in cluster 2 even within the same clinical stage.

The biological interpretation of most edges is straightforward, however, edges of type X-*M* → Y-*P/T/PP*, for two genes X and Y, are rarely considered when learning the networks. Mehnert et al. [21] have shown via experiments that a cancerous mutation in gene X can change the interactome of its protein product X-*P* without affecting the expression of X-*P* itself. Say Y-*P* is one of the interactors of X-*P* affected by a mutation in gene X. In this case, we can observe a statistical dependency between a mutation node X-*M* and a protein node Y-*P* (but not between X-*P* and Y-*P*). Such dependencies are particularly interesting because they help understand the links between genotypes and phenotypes.

## Footnotes

Legend to Figure 3 is updated, Table numbering in supporting information changed.

## References

- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.
- 8.
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵