Oncogenetic Network Estimation with Disjunctive Bayesian Networks

Motivation Cancer is the process of accumulating genetic alterations that confer selective advantages to tumor cells. The order in which aberrations occur is not arbitrary, and inferring the order of events is challenging due to the lack of longitudinal samples from tumors. Moreover, a network model of oncogenesis should capture biological facts such as distinct progression trajectories of cancer subtypes and patterns of mutual exclusivity of alterations in the same pathways. In this paper, we present the Disjunctive Bayesian Network (DBN), a novel oncogenetic model with a phylogenetic interpretation. DBN is expressive enough to capture cancer subtypes’ trajectories and mutually exclusive relations between alterations from unstratified data. Results In cases where the number of studied alterations is small (< 30), we provide an efficient dynamic programming implementation of an exact structure learning method that finds a best DBN in the super-exponential search space of networks. In rare cases that the number of alterations is large, we provided an efficient genetic algorithm in our software package, OncoBN. Through numerous synthetic and real data experiments, we show OncoBN’s ability in inferring ground truth networks and recovering biologically meaningful progression networks. Availability OncoBN is implemented in R and is available at https://github.com/phillipnicol/OncoBN.

hani and Lagergren, 2013). These algorithms' objective is to find a network structure that maxi-48 mizes a (regularized) likelihood. The optimal network learned by any approximation method may 49 be far from the ground truth and iterative search methods can get trapped in local maximums. Here 50 we show that for the number of driver alterations that we often encounter in tumors (< 30), one 51 can use an efficient dynamic programming implementation of an exact structure learning algorithm 52 (Silander and Myllymäki, 2006 Mutual exclusivity of alterations is another phenomenon that was considered in learning cancer 55 progression networks. Two sets of alterations are mutually exclusive if they (almost) never cooccur 56 in a tumor (Leiserson et al., 2015). Two potential explanations for this observation are functional 57 redundancy and synthetic lethality (Deng et al., 2019). Existing approaches considering pathways 58 and their effects on cancer progression either assume that the pathways are inputs of the progres-59 sion inference algorithm (Gerstung et al., 2011;Cheng et al., 2012) or learn them along with the 60 progression network (Raphael and Vandin, 2015;Cristea et al., 2017). 61 The CBN progression rule dictates that all parent alterations need to be present in the tumor 62 for the child to occur, under which mutually exclusive genes cannot share any descendant alter-63 ations. CBN's inability to capture mutual exclusivity of alterations has motivated a line of work in 64 which the mutual exclusivity restriction and pathway information are introduced artificially to the 65 CBN (Gerstung et al., 2011;Cheng et al., 2012). Moreover, since each cancer subtype has dis-66 tinct molecular characteristics and progression paths, one must first stratify samples to disjoint 67 subtypes and then learn each subtype's progression network separately. This extra step is re-68 quired for all of the above models mainly because they cannot naturally capture subtypes' mutual 69 exclusivity. PICNIC (Caravagna et al., 2016) is the state-of-the-art pipeline that clusters samples 70 to subtypes, detects driver events, checks for statistically significant mutual exclusivity hypotheses 71 or takes pathway information as an input, and infers the progression network. (semi-)monotone progression networks without any biological interpretation. The class of montone 76 BNs is a superset of our proposed model which makes it more flexible but prone to overfitting due 77 to lack of enough samples in many real-world scenarios. 78

79
Biological Modeling. We propose the Disjunctive Bayesian Network (DBN), which recon-80 ciles population-level progression models (oncogenetic models) and individual tumor evolution 81 models (phylogenetic models). From the oncogenetic perspective, DBN relaxes the CBN progres-82 sion assumption by allowing progression even if one of parents has occurred, Figure 1A. From 83 the phylogenetic perspective, each directed path starting from the wild-type root in a DBN graph 84 represents a (sub)clone, Figure 2C, and each sample from the DBN graph represents an indi-85 vidual tumor consisting of (sub)clones, Figure 2B. Overall, the DBN itself is the overlay of all of 86 the possible sub-clones corresponding to the modeled cancer, Figure 2A. accommodate distinct progression paths for subtypes and is expressive enough to capture the mu-88 tual exclusivity of alterations present in the data. Therefore, one can skip two preprocessing steps 89 necessary for the state-of-the-art models: stratifying samples by subtype and mutual exclusivity 90 detection. We consider two extensions of DBN. The first extension relaxes the strict disjunction 91 assumption and allows spontaneous (parent-less) alteration, Figure 1B

109
We model the observation of alterations as a binary random vector (X 1 , . . . , X p ), where X j = 1 110 if the j-th alteration is detected in the sample and x = (x 1 , . . . , x p ) is an observed sample. We 111 assume that a BN governs the order in which the events can occur. The BN consists of a DAG G 112 and local Conditional Probability Distributions (CPD) P(x j |x(P j ); θ) where P j is the set of parents 113 of event j in G and θ parameterizes the distribution. Local CPDs form the joint distribution as 114 P(x; G, θ) = p j=1 P(x j |x(P j ); θ). only if at least one of its parents have occurred. Therefore, P(x j = 1|x(P j ); θ) = 0 if parents are 118 inactive and θ j otherwise, Figure 1A.
119 Spontaneous Activation Model. The deviation from the DBN progression rule may be the re-120 sults of spontaneous activation caused by unknown sources. To capture that, we add a non-zero 121 spontaneous activation probability ε j > 0 for each node, Figure 1B. 122 Given n cross-sectional samples and the network G, we wish to findθ G , the maximum likelihood estimator (MLE) for θ. We focus on the spontaneous activation model, where the likelihood is: (1)

Maximizing the log-likelihood results inθ
and where x ij is the realization of 123 the jth event in the ith sample. From now on, to reduce the number of inferred parameters, we 124 assume ∀j : ε j = ε and we fix it throughout the experiments. Details of parameter estimation for 125 the three models (basic, spontaneous, measurement error) are presented in Supplement B. 126

127
Although for a fixed network G the MLE parameters have closed form, finding the best G is 128 NP-hard. We present an efficient Dynamic Programming (DP) method for p < 30 that finds a best 129 graph with maximum likelihood. We use "a best" instead of "the best" graph to emphasize on the To have more interpretability and avoid overfitting, we restrict our search space to the space of p-node DAGs with an in-degree bound of k, G p,k . To further penalize dense graphs, we follow (Ramazzotti et al., 2015) and use the Bayesian information criterion (BIC) as our graph fitness score. The final optimization objective takes the following form: (2)

132
An exhaustive search of G p,k takes super-exponential time. Silander and Myllymäki, 2006 133 introduced a dynamic programming algorithm that can find the optimal network in exponential time. 134 Their algorithm assumes that each graph G can be assigned a decomposable score Score(G) 135 such that Score(G) = p j=1 Score j (P j ) where Score j (P j ) is the score of the subgraph consisting 136 of only vertex j and its parents P j . Score j (P j ) is called the local score of j. For us, Score(G) = 137 BIC(G,θ G ), is our decomposable score. The rest of this section is devoted to a high-level summary 138 of the algorithm.

139
Optimal Substructure. First note that each DAG has at least one sink node, which is a node with no outgoing edges. The score of a best graph G * (V ) can be broken down to the best parents of any of its sinks s and a best subgraph obtained by removing s and its incoming edges. More formally, for s, an arbitrary sink of G * , P * s should be a best set of parents, i.e., has highest local score P * s = argmax Ps Score s (P s ). In addition, for G * (V ) to be optimal, Score(G * (V \{s})) should also be optimal. This optimal substructure suggests the following recursive formula for finding a best sink for set of nodes W ⊆ V : where P * s (W ) = argmax Ps∈W Score s (P s ) is the pre-computed best parents of s in W . Best 140 sinks can be computed in O n2 n−1 time using memoization.

141
Reconstructing an Optimal Solution. Best sinks immediately result in a best ordering of nodes 142 in reverse order. By having an optimal order and the best set of parents for all nodes, it is straight-143 forward to build an optimal graph. Starting from an empty graph, we add a node according to the 144 optimal order and add incoming edges from its optimal parents that preexist in the graph.

150
When the data is corrupted by noise, the estimated graph is likely to contain spurious edges. To 151 remove low confidence edges, we perform statistical tests on the estimated graph. In DBNs, if e = 152 (u, v) is an edge in the ground-truth graph, we have P ( Selection probabilities 10: for i = 1 to S do 11: pseudocode of this part is summarized in Algorithm 1. Length penalty introduced in (Lam and Bacchus, 1994) that simplifies to log n log p p j=1 |P j | for 193 the DBN. In another approach represented by r > 0 in Algorithm 1, we limit the number of parents 194 of each node to r, i.e., max j |P j | ≤ r.

197
To test the DP method against existing cancer progression algorithms, we generate datasets 198 from simulated networks. Random graphs G are created using the PCALG R package (Kalisch 199 et al., 2012), which allows the user to specify the number of vertices and the average degree. For 200 network parameters, we sample θ j ∼ Unif(0.25, 0.75). Once θ j s and G are known, a simulated 201 dataset can be created by iterating over a topological sort of G. For tests on simulated data, we fix 202 the number of observations n to be 400 and the number of alterations p to be 20 (this is similar to 203 the size of existing cancer datasets). Unless specified otherwise, the average degree is set to 3.

204
To simulate the noise that is likely present in real data, we flip the binary value of each entry with 205 probability η. B. Comparing the DP algorithm to CAPRI for various levels of network complexity. Network complexity was quantified by varying the average degree of random graphs from 1 to 8. 100 datasets for each degree were generated. C. Cross-comparison between DBN and CBN. 100 datasets were generated from the CBN model and 100 datasets were generated from the DBN model. Then the DBN and CBN models were fitted to all of the datasets, and the mean MCC in each class are reported.  Figure 5C reports the mean MCC in each category.  Table 1. 252 To quantify our uncertainty in the estimated progression network, we run the algorithm on 100 253 bootstrapped datasets. We form the mean graph by only reporting the edges that are present in a 254 sufficiently large number of networks estimated from the bootstrapped datasets (this cutoff will be 255 25 or 50).  improvement to the method could be to integrate some of CAPRI's regularization steps to improve 283 robustness to noise. When η is small and fixed, DP uniformly outperforms CAPRI at different levels 284 of network complexity ( Figure 5B).

285
The cross-comparison ( Figure 5C)   way. It confirms that the two subtypes (papillary and non-papillary) both have perturbation in p53, 330 RTK/Ras/PI(3)K, methylation, and acetylation pathways. The only mutation that is shared between 331 the two subtypes is EP300, which corresponds to acetylation. 332 exclusive cancer pathways and their progression dynamics. J. Comput. Biol., 24 (6)