Knowledge discovery with Bayesian Rule Learning for actionable biomedicine

Biomarker discovery is critical for both biomedical research and for clinical diagnostic, prognostic, and therapeutic decision-making. They help improve our understanding of the underlying physiological processes within an individual. Discovery of biomarkers from complex biomedical datasets is done using data mining algorithms. Hundreds of thousands of biomarkers have been discovered and reported in literature but only a few dozen have been found to be clinically useful. This discrepancy is because statistical significance is not clinical relevance. Statistical significance only accounts for the correctness of the learned associations. Clinical relevance, in addition to statistical significance, also accounts for clinical utility such as cost-effectiveness, non-invasiveness, efficacy, and safety of the proposed biomarkers. We need models that are statistically significant and clinically relevant, all the while keeping it interpretable. Interpretable classifiers are more actionable in medicine because they offer human-readable explanations for their predictions. Traditional data mining methods cannot account for clinical relevance. We formulate this as a knowledge discovery problem. In computer science, knowledge discovery in databases is “a non-trivial process of the extraction of valid, novel, potentially useful, and ultimately understandable patterns in data”. Bayesian Rule Learning (BRL) finds an optimal Bayesian network to explain the training data and translates that into an interpretable rule model. In this paper, we extend BRL for knowledge discovery (BRL-KD) to enable BRL to incorporate a clinical utility function to learn models that are clinically more relevant. We demonstrate this using a real-world dataset to predict cardiovascular disease outcome. We evaluate predictive performance with the area under the receiver operating characteristic curve (AUROC) and clinical utility with the cost of the model. We show that BRL-KD successfully generates a set of models offering different trade-offs between AUROC and cost. Based on the clinical standard, a model with an acceptable trade-off can then be chosen.

proteins, metabolites, etc.) and an outcome variable of interest (e.g., disease state, 23 therapeutic response, etc.). 24 In 2016, Burke [4] reported that there were 768,000 papers indexed in PubMed 25 directly related to biomarkers. While many of those papers claimed the markers to be 26 clinically useful, very few molecular biomarkers are currently in clinical use today. 27 Selleck at al. [5] further quantify this discrepancy with respect to cancer-related  The translation of discoveries made in basic biological research into medical practice 35 is a slow process and is riddled with many challenges [6]. Selleck at al. [5] suggest that 36 it is not enough to just identify variants but instead to identify actionable variants as 37 biomarkers in order to revolutionize healthcare delivery. In order to improve translation 38 of discovered biomarkers into medical practice, they call for a more focused biomarker 39 discovery process using an evaluation criteria similar to the one set forth by Evaluation 40 of Genomic Applications in Practice and Prevention Initiative [7] for evaluating genetic 41 tests created for clinical and public health use. Their suggested four evaluation criteria 42 are-1) analytical validity, to test if the biomarker measurements are reproducible; 2) such as cost-effectiveness, treatment or risk reduction, etc. 47 Shortliffe et al., [8] succinctly state this from an informatics standpoint-statistical 48 significance is not clinical relevance. Statistical significance is the certainty that the 49 discovered pattern-i.e., the discovered association between the biomarker and the significance, also accounts for clinical utility such as invasiveness, efficacy, safety, and 53 cost associated with the discovered biomarkers. So, for example, if the newly discovered 54 biomarker is just as precise but less invasive or cheaper than the current clinical 55 standard, it has the same statistical significance but offers more clinical utility than the 56 current clinical standard in practice. 57 Traditional data mining methods only focus on discovering statistically significant 58 biomarkers. In this paper, we formulate the problem of identifying both statistically 59 significant and clinically relevant models as a statistical knowledge discovery problem. 60 Knowledge Discovery in Databases (KDD) is an important process of discovering useful 61 knowledge from data. KDD is the non-trivial process of "identifying valid, novel, 62 potentially useful, and ultimately understandable patterns in the data" [9]. help support any hypothesized association pattern between a biomarker and the clinical 78 outcome. Any incorrectly found association is called a false positive. High-dimensional 79 datasets considerably increase the risk of discovering false positives.

80
Bayesian Rule Learning (BRL) is a rule-based classifier generator that has been 81 shown to be proficient in biomarker discovery from high-dimensional datasets and has 82 been shown to perform better than state-of-the-art classifiers that are typically used for 83 such applications [10,11]. This implies that BRL models generalize better to unseen 84 data, which means they discover more valid patterns. finding the optimal BN model that maximizes the heuristic score, BRL infers a rule 126 model from the optimal BN. The constrained Bayesian network structure and its 127 representation is described in the next subsection. We then describe the heuristic 128 Bayesian score, and finally the greedy best-first search used to find the BN model that 129 optimizes the Bayesian score.

130
Model representation 131 We denote the input training dataset as D. Dataset D contains a set of n 132 discrete-valued, candidate predictive variables X 1:n . These are the n candidate 133 biomarkers measured during the study that generated the data. We use a subscript to 134 represent the i -th variable, X i . Additionally, D also contains a specified discrete target 135 variable, Y . This is the clinical outcome of interest for this data analysis task, for represent probabilistic dependencies between the variables. Variable X j is considered a 145 parent of X i if there is a directed edge from X j → X i . This implies that variable X i is 146 probabilistically dependent upon X j . A set of parents of variable X i is represented as 147 Π Xi . Each node, X i , in the network contains a parameter, θ i , which is a probability 148 distribution of the values taken by the node conditioned on its parents, θ i = p(X i |Π Xi ). 149 This information from the parent is sufficient because each variable is independent of its 150 non-descendants, given its parents [13]. Parameter set Θ is just the collection of the 151 probability distributions of all the nodes of the network. Bayesian networks provide 152 considerable computational savings by representing a full joint probability distribution 153 by decomposing the probability distribution of each variable to be dependent only on its 154 parents. For a brief tutorial on generalized Bayesian networks, please refer to 155 Heckerman [14].

156
In this paper, we will only look at a special case of a generalized Bayesian network, 157 called constrained Bayesian network (BN). A BN only consists of the target variable, Y , 158 and its parents Π Y . We illustrate this using  To evaluate the quality of the BNs found during the search, we need a heuristic score to 180 evaluate how well it explains the input training data. Buntine [15], under certain 181 assumptions, proposed a heuristic Bayesian score called the BDeu (Bayesian Dirichlet 182 equivalence uniform) score to compute the joint probability of a generalized Bayesian 183 network and the input training data. We modify this score to only evaluate for the 184 constrained Bayesian network containing the target variable and its parents. This 185 modified score is shown in Eq 1.
Here, index j iterates though each of the q Y possible variable-value configurations of 187 the parents of the target Y . Index k iterates through each value taken by the target configuration specified by j. And, N ij = k N ijk . All the terms described so far is 194 collectively called the likelihood function.

195
The first term p(B S ), in Eq 1 is the prior term. It is the prior distribution that 196 represents our prior belief about which of the models in the model space is likely to be 197 correct, before we saw the training data. In our earlier work [12], we showed how to use 198 this term to incorporate prior domain knowledge from literature to help improve the 199 model predictive performance. In this paper, however, we will use it to constrain the 200 search to prefer models with fewer parents added to the target variable. This makes an 201 assumption that models with fewer variables are more likely to be correct. This can be 202 helpful while modeling high-dimensional data as the search would try to reduce false 203 positive discoveries.

204
To achieve sparser models, we use a prior distribution score as described by Koller et 205 al. [16] to prefer models with fewer parents added to any given node in the network. 206 This is prior term is shown in Eq 2.
Here, the exponent term |Π i | is the total number of parents that the target variable 208 has in structure B S . Hyperparameter κ is a value ranging between 0 < κ ≤ 1. A value 209 of κ = 1 represents a uniform prior over all network structures. While, smaller values of 210 κ generates a distribution that prefers structures with nodes having fewer and fewer 211 parents, larger values for κ generates a distribution that allows more and more parents. 212 We combine Eq 1 and Eq 2 to obtain the heuristic score for BRL as shown in Eq 3. 213 Once we have the optimal BN, we can predict the target variable value of an unseen 214 test instance, Y t , given the value of its parent variables, X t , the learned optimal BN, 215 B S , and the training data D. We compute this using the expectation of the parameter 216 associated with the target variable as shown in Eq 4.
To make a prediction, we simply return that k value for which the above value is the 218 highest i.e., arg max k p(Y t = k|X t = j, B S , D). 219 We now need a search algorithm to find the BN that maximizes this score. We 220 describe this search algorithm in the next subsection.

221
Greedy best-first search 222 In this paper, BRL uses a greedy best-first search to find the BN that maximizes the 223 heuristic score in Eq 3. The greedy best-first search algorithm described here is based 224 on the same as the one described in our earlier work (see [11]). We modify it slightly to 225 remove the user parameter, MAX CONJ, from the search. The constraint in the 226 complexity of the model, in this paper, comes from the prior term we specified in the 227 heuristic score in Eq 3.

228
The BRL search algorithm takes as input, the training dataset D. In this paper, we 229 set the α = 1, similar to how we did in the previous paper [11]. We set the value of simulated datasets. In this paper, we do not concern ourselves with optimizing the prior 232 term, so we keep it fixed.

233
In the first step of the search algorithm, BRL initializes with n BNs, each containing 234 the target variable Y and a unique parent variable selected from X i=1:n . Each BN is 235 evaluated by the heuristic score in Eq 3. The BN with the highest score is selected for 236 the next iteration of the algorithm. Let's call this model, prev best, i.e., the model that 237 did the best in the previous iteration. In each subsequent iteration, a new parent is 238 added to Y , such that it is not already present in its parent set Π Y . After having tried 239 each variable to its parent set, one at a time, the BNs are evaluated again using the 240 heuristic score. If the score of any of these models improve upon the score of the model 241 which performed the best in the previous iteration, prev best, then the new specialized 242 model is chosen as the best model in this iteration. This iteration continues, until 243 adding another variable from D does not help improve the heuristic score of the BN. As mentioned in the introduction section, BRL helps us find valid and understandable 249 rule patterns from the training data. However, it does not account for clinical utility.

250
Let us re-visit the heuristic score in BRL, as shown in Eq 3. Here, the structure prior 251 term is p(B S ) = κ |Πi| , is the prior distribution that represents our belief in which of the 252 models in the hypothesis space is likely to be correct a priori. We set the value to prefer 253 models with fewer parents. In our previous work, we developed a method called 254 BRL p [12], we used the structure prior term, p(B S ), to incorporate prior domain 255 knowledge to create a bias in the search to prefer network substructures that have been 256 shown to be promising in the literature. The rest of the term in the equation is called 257 the likelihood function that encodes the likelihood that the observed training data was 258 generated by a given BN model. The likelihood function gives us a measure of how well 259 the model fits the data. Generally speaking, the better the model fits the data, the more 260 likely it is to generalize to unseen test data. The prior term encodes the prior probability 261 of which of the BNs is the correct data-generating model. Together, the likelihood 262 function and the prior term assist the search algorithm in identifying promising 263 candidate BNs that are most likely to have generated the training dataset. In KDD 264 definition, these two terms help the search algorithm find valid patterns from data.

265
To bias the search to prefer more useful patterns, we modify the heuristic score to 266 Eq 5.
This equation encodes the joint probability of the BN structure (B S ), the data (D), 268 and the clinical relevance as encoded by the utility function (Ψ). The utility function 269 p(Ψ|B S ) takes the BN structure as input and outputs the clinical relevance of the 270 model. The term p(Ψ|B S ) encodes a probability distribution that represents our belief 271 about which of the models in the hypothesis space is more clinically relevant. 272 We encode the utility function similar to how we encoded the prior distribution over 273 models in BRL p . We use the informative prior as specified by Mukherjee et al. [17]. shown in Eq 6.
Let us see an example to understand how to use this utility function. Say, the 279 current clinical standard being used in medical practice is expensive and we want 280 cheaper yet still effective models to replace them. In other words, we want more 281 cost-effective models. The utility function is encoded to score cheaper models, higher.

282
One way to compute the model cost is by adding the cost associated with each 283 biomarker in the model. To enable BRL-KD to look for cost-effective models, the set T 284 in Equation 6 iterates through each biomarker in the dataset. Weight w t is the weight 285 of the biomarker t. For cost-effectiveness, we set the weight to the cost of the biomarker. 286 The concordance function here can be an indicator function that returns 1, when the 287 variable t exists in the BN structure, B S . We use the same scoring function, we have 288 described here, to demonstrate BRL-KD on a real-world dataset in the next subsection. 289 An important consideration while using this utility function is to encode it as a 290 maximization problem. This is because we maximize the heuristic score in the greedy 291 best-first BRL search. However, currently we have encoded the current utility function 292 in such a way that we want to minimize the function (i.e., minimize the total cost). To 293 keep this consistent with the Bayesian score heuristic, we must convert the utility 294 function into a maximization problem. We can do that by simply encoding the negative 295 value of cost, instead of the cost. Now, the utility function needs to be maximized in 296 order to minimize the overall cost.

297
Another important consideration while using the utility function is the value set for 298 the weights. When encoding the cost, some markers may cost a few US dollars, some 299 others may cost thousands of dollars. This will lead to this term being either too large 300 or too small as a result of it being on the exponent term of the utility function. To 301 avoid this, we must scale this value. For example, we can perform min-max scaling for 302 the values to range between 0 and 1.  BRL-KD maximizes the heuristic function in Eq 5 to identify patterns that are valid, 317 novel, potentially useful, and understandable, from the data. To specifically seek novel 318 patterns, we may alter the utility function to encode a metric that quantifies how well 319 known it is in literature to have each of the biomarkers to be associated with the 320 disease. We explain this situation further in the Future Work section. For now, let us 321 design and observe BRK-KD in practice using a real-world example. The goal of the experiment here is to study the changes in behavior of BRL-KD models 324 as a result of altering the hyperparameter λ. Specifically, we want to observe the 325 changes in two evaluation metrics-1) a clinical relevance metric, here we calculate the 326 total cost of all the biomarkers selected by the model, and 2) a predictive performance 327 metric, here we use area under the receiver operating characteristic curve (AUROC).

328
In this experiment, we use a real-world dataset collected to learn differential features 329 between individuals who are at risk to develop cardiovascular disease (CVD) and those 330 who are unlikely to develop the disease in the near future. Here, we assume that the 331 most cost-effective diagnostic model is the clinically most relevant model. Cost-effective 332 models have both good predictive performance and are cheaper than clinical standard 333 currently used in practice.   The data had to be pre-processed to prepare it for analysis using BRL. We first had 350 to clean the data. The dataset had 608 clinical variables. For this data analysis, we  We randomly split the dataset into 70% training data and 30% test data. We will 365 use the training data to identify our preferred value for hyperparameter λ by observing 366 the average cost and average AUROC over cross-validation, achieved by a set value for 367 λ. The average of the metrics is computed over 10-fold cross-validation. We then choose 368 the λ with our preferred trade-off between cost and AUROC, and use that to model the 369 training dataset. We observe these models and evaluate their performance on the 30% 370 held-out test set.

371
Since BRL-KD can only handle discrete valued data, we discretize the continuous 372 valued variables from the Heart SCORE data under cross-validation. When the training 373 data is split into 10-folds, in each iteration of the cross-validation, we discretize on the 374 training-fold using the supervised minimum description length (MDL) principle 375 method [19] and use the learned bins to discretize the test-fold dataset. Once we 376 determine the ideal value for λ from the cross-validation experiment, we discretize the 377 70% training data using the MDL method and use the learned bins to discretize the 378 30% test data.

379
Methods compared 380 We will study the change in behavior of BRL-KD with the change in hyperparameter λ. We encode the term p(Ψ|B S ) in Eq 6 with a distribution that represents our belief about which of the models in the hypothesis space is more cost-efficient. We do this using Eq 7.
In this equation, λ represents the relative importance of considering cost as opposed 381 to the likelihood term (that tries to optimize predictive performance). The set of 382 weights, w = {w P E , w V AP , w F S , w M B }, represent the cost of physical examination 383 (w P E ), cost of Vertical Auto Profile test for lipids (w V AP ), cost of finger stick tests 384 (w F S ), and the cost of running a full metabolome profile of 1228 biochemicals (w M B ). 385 We want to state a disclaimer that none of the values encoded in this experiment 386 represents reality and were not done consulting a medical practitioner. They were 387 merely estimated to demonstrate the functionality and behavior of BRL-KD as a proof 388 of concept. The numbers in the next paragraph were obtained using simple online 389 search queries. In reality, the cost can be influenced by many factors including time, 390 insurance, medicare, and the location of point-of-care. For now, we assume that the cost 391 associated with each biomarker presented in the next paragraph is correct and we 392 observe the function of BRL-KD in light of having specified such values in practice.

393
The Heart SCORE dataset contains clinical and metabolic variables. For all the 394 demographic and questionnaire-based variables, we set the cost of the biomarkers to $0. 395 We assume that it is free to obtain markers that can be supplied by asking the 396 individual or can be obtained from their medical record. We set the cost of a physical 397 exam, w P E = $146. By setting the value of w P E = 146, we state that if the BRL model 398 requires the use of any one or more variables from the physical examination report (e.g., 399 height, weight, or body mass index), the model would incur a cost of $146. The cost of 400 Vertical Auto Profile (VAP) test for lipids is set to w V AP = $2813, for finger stick tests 401 w F S = $1, and the whole metabolic panel w M B = $1000. We re-emphasize that the cost 402 in this example is purely hypothetical and instead the focus of experiments in this paper 403 is to demonstrate and evaluate the mechanism of being able to incorporate such

408
The costs range in thousands leading to this probability ranging widely. To prevent 409 this, we scale the cost using min-max scaling to have the cost values between the range 410 0 and 1. We also take the negative value of the scaled cost in order to convert this 411 utility function into a maximization problem.

412
Evaluation metrics 413 We measure two metrics-1) clinical relevance using the total cost of the model, and 2) 414 predictive performance using AUROC.

415
The cost metric is simply the sum of the costs of each marker selected by the BRL 416 classifier. If the BRL model would select any variable from one of the sets with specified 417 costs, the model would incur a cost as specified by the associated weight. For example, 418 if the BRL model selects 2 metabolic variables, it would still only incur the cost of w M B 419 once, since we assume that the whole metabolic panel is run. Again, we emphasize that 420 our description of utility is meant to merely depict a real-world application and that our 421 choice may not reflect reality. Our goal is to evaluate the use of BRL-KD in a possible 422 real-world scenario.

424
Performance over cross-validation 425 On the 70% training data, we perform 10-fold cross validation. In each fold, the training 426 data is split into 90% train and 10% development data. The 90% train is used to learn 427 a BRL-KD model. We try different values of the hyperparameter  We observe that the average cost steadily decreases with the increase in the value of 431 λ. The average AUROC is more or less steady with the change in λ. The value of λ = 0 432 sets the whole expression of p(Ψ|B S ) in Eq 5 to 1. As a result, this heuristic 433 corresponds to that of plain BRL. The average cost of this model is $2687.8 and the 434 associated average AUROC is 0.5930.

435
As we increase the value of λ, the BRL-KD model starts to pick more cost-efficient 436 markers. This means markers that are of similar or slightly poorer quality but at a 437 cheaper cost. As we increase the λ value, we get cheaper and cheaper BRL-KD models. 438 Value λ = 4 corresponds to an average cost of $1000 and average AUROC of 0.5575.

439
This λ value presents with a cheaper option at the loss of some predictive power. Value 440 λ = 10 corresponds to an average cost of $400.2 and average AUROC of 0.5566. Value 441 λ = 20 corresponds to an average cost of just $0.2 and average AUROC of 0.6001. This 442 λ value results suggest that there are similarly good AUROC performing models by 443 simply using variables that are available for free (according to our definition).

444
Performance over held-out test set 445 We now look at the models learned on the overall 70% training dataset, for 446 λ = {0, 4, 10}, to observe the BRL-KD models. Note that we also looked at models with 447 greater values of λ but those models remained consistent after λ = 10. The value λ > 10 448 corresponded with models with a total cost of $0 so there was nothing to optimize 449 further using the λ hyperparameter. We estimate the AUROC by evaluating the model 450 on the held-out 30% test dataset. Together, the set of BRL-KD models, offering 451 trade-offs between cost and AUROC, is called a Pareto set of solutions.

452
The resulting BRL-KD model cost and AUROC is plot in Fig 3. We see that with 453 increasing λ, we get cheaper models but we get them at a loss of AUROC performance 454 on the test set. 455 We now observe the BRL-KD models generated by the three values of λ. Fig 4a,   456 shows the BRL-KD rule model we get when λ = 0. This sets the whole expression of

474
BRL-KD compared to other rule-based classifiers 475 We additionally learned three popular rule-based classifiers from the Weka machine 476 learning suite (version 3.9.3) [20], namely-C4.5 [21], a decision tree learning algorithm 477 (where each path from root to leaf in the tree is translated into a rule), RIPPER [22], 478 and PART [23]. The goal of this paper was only to demonstrate and evaluate the functionality of 494 BRL-KD. We used a real-world dataset to do this and used an arbitrarily defined utility 495 function. The models developed in this paper do not perform well enough to be 496 considered for clinical use. However, this paper provides motivation to try and use the 497 functionality of BRL-KD to explore real-world datasets with a reasonably accurately 498 defined clinical utility function.

499
In this paper, we only considered one definition of clinical utility, i.e., 500 cost-effectiveness. However, clinical utility can also mean novelty, non-invasiveness, 501 specificity, etc. Or any combination of those definitions. An important future work 502 would be to define new ways to encode clinical utility function using the same general 503 form as shown in Eq 6. To do this, we must first try to quantify this utility metric at a 504 biomarker level. For example, say we are interested in finding models containing novel 505 biomarkers. One way to quantify novelty is by finding the number of references in 506 biomedical literature that associate each candidate biomarker with the disease. This can 507 be done using a test mining tool like KinderMiner [25]. Similar metric was quantified by 508 Kleiman et al. to discover novel lab tests [24]. This metric can be used to encode the 509 utility function in BRK-KD to return models with novel biomarkers. These models are 510 clinically useful by providing new biomedical research directions.

511
Similarly, we can encode biomarker specificity by looking at gene-disease ontologies 512 and encoding the weights with the reciprocal of the number of known diseases 513 associated with each input gene. By maximizing such a function, BRL-KD would prefer 514 models containing biomarker with fewer known associated disease. These models are 515 clinically useful by providing tests that are specific to only the disease under study and 516 not a biomarker that indicates many associated illnesses.

518
The goal of the experiments was to demonstrate the functionality and behavior of 519 BRL-KD and not to actually find a workable clinical model for CVD diagnosis. We saw 520 that increasing the λ hyperparameter helped us specify a range of possible trade-off 521 values between clinical relevance and predictive performance. We used cost as the utility 522 function measuring clinical relevance. Using BRL-KD, we tuned the hyperparameter λ 523 to study a range of different trade-offs between cost and predictive performance.

524
To pass the four evaluation criteria posited by Selleck at al. [5], as described in the 525 Introduction section, we need to consider everything right from the experimental design 526 used to collect the dataset to the statistical data analysis method. However, this paper 527 only dealt with the statistical data analysis step. We assumed that the candidate 528 biomarkers measurements, chosen for data collection in the study, were reliable and 529 reproducible. However, to improve the chances of successful translation of discovered 530 biomarkers into clinical practice, the scientific investigators must also consider efficient 531 experimental design for data collection.

532
The ultimate decision on λ requires us to know a few things about the current 533 clinical standard being used in practice. This will help us determine from the data if 534 there exists a λ value that gives us a model that is clinically more relevant than the one 535 being used in medical practice today. From the experiments in this paper, however, it 536 appears to be clear that BRL-KD is a useful tool to help us find clinically more relevant 537 models and encourages usage in a real-world problem.

538
The Java code for BRL-KD has been made open source and is available for public 539 use from the GitHub repository (https://github.com/jeya-pitt/Bayesian-Rule-Learning). 540