BiGPICC: a graph-based approach to identifying carcinogenic gene combinations from mutation data

Genome data from cancer patients encapsulates explicit and implicit relationships between the presence of a gene mutation and cancer occurrence in a patient. Different types of cancer in human are thought to be caused by combinations of two to nine gene mutations. Identifying these combinations through traditional exhaustive search requires the amount of computation that scales exponentially with the combination size and in most cases is intractable even for even cutting-edge supercomputers. We propose a parameter-free heuristic approach that leverages the intrinsic topology of gene-patient mutations to identify carcinogenic combinations. The biological relevance of the identified combinations is measured by using them to predict the presence of tumor in previously unseen samples. The resulting classifiers for 16 cancer types perform on par with exhaustive search results, and score the average of 80.1% sensitivity and 91.6% specificity for the best choice of hit range per cancer type. Our approach is able to find higher-hit carcinogenic combinations targeting which would take years of computations using exhaustive search.


Introduction
Multi-hit theory of carcinogenesis states that it takes combinations of gene mutations to initiate carcinogenesis in humans 1 . Clinical studies and a body of mathematical models established that the size of such combinations (hits) can range from 2 to 9 depending on the cancer types [2][3][4][5][6] . However, most computational efforts to find carcinogenic mutations focus on finding individual "driver mutations" responsible for carcinogenesis [7][8][9][10] . While these mostly mutational frequency-and signature-based methods identify driver mutations associated with increased risk of cancer, these mutations by themselves cannot cause cancer.
Cancer-screening methods based on genetic predisposition rely heavily on identifying driver mutations. However, some people with a genetic predisposition may never get the cancer, while the others get the cancer as they grow older and accumulated more mutations. For example, women under the age of 20 with the BRCA1 mutation are very unlikely to get breast cancer; moreover, 28% of all women with BRCA1 mutation never get the cancer 11 . Similar statistics can be observed for Li Fraumeni syndrome [12][13][14][15] . These observations provide strong evidence that cancer in a patient is caused not by an individual driver gene mutation but rather by a combination of gradually accumulated gene mutations. Though there are several other factors responsible for cancer such as epigenetic modifications, tumor environment, and adaptive evolution, carcinogenesis is primarily a result of genetic mutations.
Different combinations of gene mutations can cause cancer of the same type but with different etiologies and pathologies representing different subtypes. To design and develop individualized and precision drugs for treating an individual with cancer, we need to identify the combinations of gene mutations in that patient. Since current computational approaches mostly look for driver gene mutations, they come short in the context of precision drug discovery. The human genome is comprised of ∼20,000 genes and each gene can host dozens of mutations. Finding the carcinogenic combinations of size h requires examining |G | h combinations even with the simplistic assumption of one mutation per gene (here, G is the set of genes in human genome). For h > 4, computationally evaluating such a large number of combinations can take years. In this work, we propose an approach that can efficiently find carcinogenic gene combinations even for large values of h.

Exhaustive search methods
A recent body of work [16][17][18] explored all possible gene combinations to identify those responsible for cancer based on tumor and normal samples from cancer patients. They mapped the task to a weighted set cover problem, where each combination is associated with a set of tumor samples in which it is jointly mutated and assigned weight based on its classification accuracy when used for differentiating between tumor and normal samples. They implemented a heuristic approximating the minimum weight set cover that requires O |G | h time for identifying a single combination of length h. For |G | ≈ 20,000, the size of

Our contribution
Here we present BiGPICC (Bipartite Graph Partitioning for Identifying Carcinogenic Combinations) -a parameter-free approach to finding multi-hit combinations based on the topology of mutation data. Analyzing mutation data is potentially more insightful for cancer genomics than using molecular networks, as the latter are simplified abstractions of gene or protein interactions. To the best of our knowledge, this is one of the first network-based algorithms working directly with gene-sample mutations.
We formulate the search in terms of a community detection problem on a bipartite graph representation of the data, and design and implement an algorithm for solving it. Our numerical experiments on Summit supercomputer for 16 cancer types demonstrate that it identifies combinations of comparable biological relevance to state-of-the-art results. At the same time, our approach is capable of efficiently identifying relevant (5+)-hit combinations, unavailable to the existing methods due to high computational cost. An additional advantage of out method is that it does not require any manual tuning due to the absence of parameters representing some form of domain knowledge, which means that it is readily available to work with a broad variety of datasets.

Methods
Our approach for identifying carcinogenic combinations from data on gene mutations in tumor and normal tissues is outlined in Figure 1. Viewing the data as a bipartite graph, we iteratively partition it using community detection to find candidate gene combinations whose mutations tend to occur in tumor tissues, then filter out those with frequent mutations in normal samples. The final set of carcinogenic combinations is identified by selecting a minimum set cover of the tumor samples from the filtered candidate pool. To assess the relevance of the identified combinations, we use them to differentiate between tumor and normal samples on previously unseen data. A detailed description of these steps is provided below.

Graph formulation
Let G and S denote the sets of genes and samples in the data, respectively. The input data contains information about whether a mutation of gene g was observed in sample s for every gene-sample pair (g, s) ∈ G × S . Specifically, the data is stored in a binary |G | × |S | matrix D whose entries are 1 if mutation of the corresponding gene was observed in the corresponding sample, and 0 otherwise. This information can be equivalently represented as an unweighted bipartite graph on the vertices of two distinct classes G and S , where an edge connects some g ∈ G and s ∈ S if and only if the corresponding mutation has been observed. We denote such a graph G, and notice that the adjacency matrix of G is the symmetric block matrix 0 D D T 0 . The presence of both normal and tumor samples in the data means that the vertex class S is itself partitioned into two subclasses, denoted S norm and S tum respectively, with opposing significance to the problem. To isolate the information about gene mutations in tumor tissues, we consider the graph G tum := G[G ⊔ S tum ], the induced subgraph of G on the vertices G ⊔ S tum . Recall that a community is a subset of graph vertices that are more densely connected with one another than with the rest of the graph, according to some metric. Any community C in G (or G tum ) consists of the gene component C G :=C ∩ G -a combination of genes whose mutations tend to occur together, and the sample component C S :=C ∩ S (or C ∩ S tum , respectively) -the samples in which these genes co-mutate the most. Because bipartite communities of multiple vertices must contain vertices from both classes to allow for internal connections, both C G and C S are non-empty whenever |C| > 1.
For some gene g ∈ G , let M(g) ⊂ S denote the samples in which a mutation of g has been observed. Furthermore, for a gene combination C G , denote the samples in which these genes are jointly mutated as M(C G ) := g∈C G M(g). Analogously, we define the tumor samples with a mutation of g as M tum (g) := M(g) ∩ S tum , and with a joint mutation of C G -as M tum (C G ) := M(C G ) ∩ S tum . If C G is carcinogenic, its joint mutation is thought to explain some number of tumor samples from the data, so C G must be fully connected to the non-empty set M tum (C G ) in the graph representation of data. For any unrelated gene g ′ ∈ G \ C G , the connectivity between genes C G ∪ {g ′ } and samples M tum (C G ) is expected to be less dense, because a mutation of g ′ is unlikely to appear in every sample from M tum (C G ) by chance unless it is mutated in most of S tum .
It follows that in the absence of highly mutable genes any carcinogenic gene combination C G is expected to correspond to a community C G ⊔ M tum (C G ) in G tum . At the same time, the converse is not necessarily true -because carcinogenic combinations are assumed to jointly mutate in predominantely tumor samples, not every community in G tum has its gene component as a carcinogenic combination. For example, if tumor samples are but a small fraction of all the samples with a joint mutation of C G (i.e. |M tum (C G )| ≪ |M(C G )|), then C G is unlikely a carcinogenic combination even despite C G ⊔ M tum (C G ) being a community in G tum . Therefore, the task of finding carcinogenic combinations of gene mutations can be cast as a problem of identifying communities with some desired degree of tumor prevalence in their sample components.

Community detection
We use the Constant Potts Model (CPM) to formally define the notion of community structure in a graph. CPM was proposed as an alternative to the commonly-chosen modularity approach that does not suffer from the issue of inconsistent communities across different scales 27 . For the case of G tum , an unweighted bipartite graph on the vertex classes G and S tum , CPM formalizes its partition into disjoint communities P (i.e. where e(C) is the total number of internal edges in community C (that is, between C G = C ∩ G and C S = C ∩ S tum ), and γ ∈ [0, 1] is the so-called resolution parameter. This parameter can be viewed as the density of internal connectivity required from a set of vertices C for it to qualify as a community -or, more specifically, for C to positively contribute to the partition quality ( * ). Notice that whenever gene component or sample component of C are empty, its contribution to ( * ) is zero.
To identify communities within a graph, we rely on the Leiden algorithm 28 , chosen for its guarantees on the community connectivity and speed. In particular, the Leiden algorithm converges to a partition in which all subsets of all communities are guaranteed to be locally optimally assigned. The Leiden algorithm starts by viewing every vertex as a separate community, and then alternates between moving nodes between communities, refining the partition, and aggregating communities into single vertices to reduce the graph size, until no further improvement to the chosen partition quality (in our case, ( * )) can be made.
Because the possible number of carcinogenesis drivers in a combination is known or hypothesized for many cancer types, we are interested in finding communities whose gene component size h := |C G | is within a certain range l ≤ h ≤ u obtained from the literature. To enforce this size constraint, we iteratively refine the identified communities using the Leiden algorithm until the size of their gene component does not exceed u, the maximum possible number of carcinogenesis drivers. Specifically, if a community C in G tum has gene component C G that is too large, we refine it by first extending its sample component to be g∈C G M tum (g) to include all relevant tumor samples, then applying the Leiden algorithm to the resulting subgraph G tum [C G ⊔ g∈C G M tum (g)] to partition C G based on their mutations. Every time the Leiden algorithm is run on a (sub)graph induced by genes C G and samples C S , the CPM resolution parameter is set to the connectivity density of this (sub)graph, |C G ||C S | , requiring the communities to be internally connected at least as densely as the (sub)graph itself. After filtering out the the communities whose gene component is too small from the results, the gene components of the remaining communities are considered candidates for carcinogenic combinations.

Procedure 1 Iterative Partitioning
Input: graph G tum on vertices G ⊔ S tum , carcinogenic combination size bounds l and u Output: candidates for carcinogenic combinations K • 11: else if l ≤ |C ′ G | then 12: The above approach to partitioning G tum to identify candidate gene combinations is described formally in Procedure 1. Because the Leiden algorithm is randomized, every call of Procedure 1 may yield a different set of candidate gene combinations. To increase the likelihood of carcinogenic combinations to appear among the results, we perform multiple iterative partitioning passes via Procedure 1 and combine the obtained sets of candidate combinations K • into the joint candidate pool K .
The iterative partitioning passes are independent of each other and can be performed in distributed fashion. We run them in parallel on the Summit supercomputer, relying on the implementation of Leiden algorithm from the Python library leidenalg.

Filtration of candidates
For a candidate combination C G ∈ K , define its tumor ratio as the share of tumor samples among all those in which C G is jointly mutated, r(C G ) : We impose a threshold ρ ∈ [0, 1] on the tumor ratio of candidate combinations, so that only "sufficiently carcinogenic" candidates K ρ :={C G ∈ K : r(C G ) ≥ ρ} (notice that K 0 = K ) are considered when selecting the final set of carcinogenic combinations from the pool. For example, the choice of ρ = 1 implies considering only candidate combinations with no joint mutations in normal samples. This particular choice however is likely to result in implausible carcinogenic combinations due to the possibility of non-mutagenic drivers behind carcinogenesis and potential inaccuracies of genomic data.

Minimum set cover
For a reasonable choice of ρ, the set of tumor samples explained by the "sufficiently carcinogenic" candidates, S tum ρ := is expected to coincide with S tum assuming that every tumor there is caused by a joint mutation of some gene combination from G whose size is in the hypothesized range. Given the threshold ρ, we select carcinogenic combinations as a subset of K ρ whose joint mutations explain S tum ρ most concisely, by constructing a minimum set cover of S tum ρ using sets of tumor samples from {M tum (C G ) : C G ∈ K ρ }. Specifically, we employ a greedy heuristic that iteratively chooses a candidate combination from K ρ to cover the most of yet unexplained samples in S tum ρ on each step (see Procedure 2). The size of the resulting set of carcinogenic combinations C ρ is guaranteed to approximate the size of the true solution within the factor of ln m, where m := max{|M tum (C G )| : C G ∈ K ρ } is the biggest number of tumor samples explained by a "sufficiently carcinogenic" candidate 29 .

Procedure 2 Greedy Minimum Cover
Input: samples S tum ρ to cover, "sufficiently carcinogenic" candidate combinations K ρ Output: carcinogenic combinations C ρ

Algorithmic complexity
The time complexity of Procedure 1 is defined by the number and complexity of the calls to the Leiden algorithm it makes. Let n 0 , n 1 , . . . denote the numbers of genes in the inputs to these calls in chronological order. In particular, n 0 = |G | as the first Leiden algorithm call takes the whole graph G tum as the input. For any k > 0, the input genes of the k-th call are a strict subset of the input genes of some previous call and therefore n k ≤ n j − 1 for some j < k. Also, if every i < k satisfies n i ≤ |G | − i then i = k also does, as n k ≤ n j − 1 ≤ |G | − j − 1 ≤ |G | − k. Because the condition is satisfied for k = 1, we obtain by induction that n i ≤ |G | − i for every i.

The resulting time complexity of BiGPICC pipeline is
if the p calls to Procedure 1 are done in parallel. While it was proved that a CPM-optimal partition of an arbitrary n-vertex graph is reachable in O(n) steps of the Leiden algorithm, no upper bound on the number of steps, and therefore on T Ldn (n), is known. However, the algorithm was empirically shown to run in near-O(n) time on a variety of real-world and generated graphs of up to 10 7 vertices 28 . If assuming linear runtime of the Leiden algorithm, the BiGPICC time complexity becomes O( 1 2 |G | 2 + |G ||S | + p|S tum ||G |l −1 ).

5/11
Abbreviation The mutation data is represented as a |G | × |S | adjacency matrix. Candidate combinations identified in the community detection step are a superset of all BiGPICC outputs and span the total of O(p|G |) non-unique vertices. Therefore, the method's memory complexity is O(|G ||S | + p|G |).

Learning threshold for the tumor ratio
To choose the best value of ρ, we frame the BiGPICC pipeline as a classification problem by using the selected carcinogenic combinations C ρ to predict whether previously unseen samples are normal or tumor. Namely, a sample is classified as tumor if it has a joint mutation of any C G ∈ C ρ , and as normal otherwise. Notice that an increase in the value of ρ improves the precision score of the classifier while also growing the size of C ρ (as the choice of C G ∈ K ρ at each step of Procedure 2 is narrowed). Because the number of gene combinations in C ρ corresponds to classifier's complexity and therefore its propensity to overfit, 1 − ρ can be viewed as the amount of regularization in the training, a trade-off between the "carcenogenicity" and the generalizability of learned combinations.
Before BiGPICC can access the data, we remove 25% of the samples to serve as a test dataset S test for the final model. On the remaining 75% samples S \ S test , we employ 4-fold cross-validation to find the optimal value of hyperparameter ρ. Specifically, S \ S test is partitioned into equal parts S (k) fold , k = 1, 2, 3, 4 and in 4 separate scenarios the pipeline is run using the samples (S \ S test ) \ S (k) fold to produce C ρ for each value of ρ ranging from 0 to 1 with the increment of 0.01. The values of ρ are then assessed by averaging the performance of C ρ -based binary classifier on S (k) fold across the 4 scenarios. We use the Matthews correlation coefficient (MCC, also called the phi coefficient) -which takes values between -1 and 1 and is considered more informative than the commonly chosen F1 and accuracy scores 30 -as a single performance metric. Because constructing C ρ and evaluating its performance for individual values of ρ can be done in parallel, the search for best ρ does not introduce a significant computational overhead. The value of ρ delivering highest mean MCC is then used to train the final classifier (by re-running BiGPICC pipeline) on the samples S \ S test and report its performance on S test .

Experimental setup
Our somatic mutation data was collected in mutation annotation format (MAF) from The Cancer Genome Atlas (TCGA) for 16 cancer types using Mutect2 software. We identify a set of 331 matched blood-derived normal samples from all cancer types. We use the Variant Effect Predictor (VEP) to determine the location (intron, exon, UTR) and effect of these variants (synonymous, non-synonymous, missense, nonsense), and only consider protein-altering variants (non-synonymous, nonsense, and insertion/deletions in exons).
We apply BiGPICC to 16 datasets for different cancer types in 33 numerical experiments, each exploring a combination size h or its range for a cancer type according to existing literature. Each of the 16 datasets contains the same set of 331 normal tissue samples. Table 1 shows the hypothesized ranges of h and the dataset parameters for each cancer type. For each cancer type and multi-hit range, we run the community detection step (Procedure 1) p = 10,000 times to increase the probability of including all carcinogenic combinations to the candidate pool.

Classification performance
In each numerical experiment we build a classifier based on the identified gene combinations of desired size to differentiate between tumor and normal samples. We use the MCC, specificity, sensitivity, and F1 metrics to highlight various aspects of our classifiers. Each of the numerical experiments performs the 4-fold cross-validation runs and chooses the best value of ρ based on the highest mean MCC across the 4 folds. We demonstrate the effect of the value of ρ on the MCC in individual numerical experiments in Figure 2. After using the chosen value of ρ to train the final classifier on G ⊔ (S \ S valid ), we report its combination count and performance metrics on the previously unseen test samples S test in Table 2. The spread of classification performance within the same cancer type is likely attributed to the biological relevancy of the chosen hit range in individual experiments. The variation in performance across different cancer types can be explained by the data availability in each case and the representativity of available dataset for the high-dimensional search space. In particular, poor performance of PRAD and BLCA classifiers is consistent with the previous results based on the same data 16,19 .

Comparison against exhaustive search results
The authors of Al Hajri et al. 17 identified carcinogenic combinations of length 2, 3, and 4 for various cancer types through an exhaustive search (ES) of all possible gene combinations of given length. Similarly to the minimum set cover step of BiGPICC, they iteratively add carcinogenic gene combinations until their resulting set explains all tumor samples in the data. Each step of ES approach selects the combination C G maximizing αn TP + n TN , where n TP is the number of currently unexplained tumor samples in which C G is jointly mutated and n TN is the number of normal samples in which it is not. The parameter α represents the importance of correctly identifying normal samples relative to the analogous importance for tumor samples, and was fixed to α = 0.1 in all ES experiments. Figure 3 shows the comparison between the classification performance of carcinogenic combinations identified by BiGPICC for h < 5 and the metrics reported in Al Hajri et al. The two methods have comparable performance except for GBM cancer where BiGPICC performs significantly worse than ES. In seven out of ten cases, BiGPICC outperforms ES in either specificity or sensitivity, and in two cases -in both the metrics. A possible explanation for the heuristics-based BiGPICC outperforming an exhaustive search method may be a suboptimal choice of constant α = 0.1 in the latter. In addition, the train-test splits used in our runs are not the same as in Al Hajri   gene interactions. In particular, our second attempt of the random train-test split of GBM data resulted in significantly higher MCC scores on S test -0.791 for GBM, 2 and 0.714 for GBM, 3, with the new specificity for GBM, 2 outperforming its ES counterpart.

Runtime performance
The 4-fold cross-validation BiGPICC runs on 75% of all samples for tuning the parameter ρ were conducted using 280 Summit nodes each and took on average 26 minutes across all experiments, ranging from 11 minutes (SARC, 8) to 1 hour 29 minutes (UCEC, 2). Running the pipeline on the full dataset (without the validation samples) to measure the resulting classification performance was done using 70 nodes and took 23 minutes on average, ranging from 8 minutes (SARC, 8) to 1 hour 33 minutes (UCEC, 6-7). Adding the two mean runtimes gives that identifying carcinogenic combinations on the full dataset from scratch takes BiGPICC on average 146 node-hours on Summit.
For comparison, the ES method from Dash et al. 16 requires 124 Summit node-hours to identify 4-hit combinations for BRCA cancer type based on 75% of the available samples. Assuming ideal scaling of the method, the time required to identify 5-, 6-, and 7-hit combinations using all 4,600 nodes of Summit supercomputer would be 4.4 days, 39 years, and 107,000 years, respectively. In contrast, the BiGPICC runtimes are not noticeably affected by the combination size h and instead depend on the size and topology of the input graph determined by the cancer type. It follows that BiGPICC is orders of magnitude faster than ES for any h > 4.
While  each compute node and thus can be moved to a less expensive CPU-only cluster to achieve the same runtime.

Conclusions and discussion
We proposed a community detection-based approach for identifying carcinogenic combinations and showed that the biological relevance of its results is comparable to that for state-of-the-art. At the same time, BiGPICC enables discovery of (5+)-hit combinations intractable for exhaustive search methods even on most modern supercomputers.
In all cancer types considered, biological relevance of the identified combinations, measured as classification performance, tends to drop as the number of hits increases. An identical trend in the exhaustive search results from Al Hajri et al. 17 suggests that, aside from biological reasons, the issue may lie with the so-called curse of dimensionality in both the machine learning and combinatorial contexts. An increase in the number of hits means fewer samples with joint mutation of a gene combination exist in the data while the search space of possible combinations grows exponentially. Therefore, the number of samples required for finding carcinogenic combinations grows with the number of hits, while our runs use the same dataset for every multi-hit range.
The classification performance of BiGPICC exhibits a significantly higher variability in sensitivity, the percentage of correctly classified tumor samples, than in specificity, the analogous percentage for normal samples. The sensitivity and specificity scores across the experiments are distributed as 0.763 ± 0.175 (mean ± SD) and 0.911 ± 0.054, respectively. We attribute this trend to the implicit control exhibited over the specificity score by the parameter ρ, whose learned value is consistently high in the experiments (0.938 ± 0.051). Let n TP , n FN , n TN , and n FP denote respectively the number of correctly predicted tumor samples, incorrectly predicted tumor samples, correctly predicted normal samples, and incorrectly predicted normal samples in the training dataset with the total number of samples n = n TP + n FN + n TN + n FP . Parameter ρ ensures that the combinations selected for the classifier have sufficient tumor ratio, which translates into controlling its precision n TP n TP +n FP . Assuming for simplicity no overlap between the samples in which the selected combinations are jointly mutated, the relationship is given by n TP n TP +n FP ≥ ρ (in general, the right hand side can be both larger or smaller depending on the tumor ratio in the overlapping samples). It follows that the number of false positives is limited by the number of tumor samples in the dataset, n FP ≤ 1−ρ ρ n TP ≤ 1−ρ ρ (n TP + n FN ) = 1−ρ ρ rn, where r = n TP +n FN n is the ratio of tumor samples in the data. Because n FP is the only factor negatively affecting the specificity n TN n TN +n FP , the latter is bounded from below by n TN n TN +n FP = 1− n FP (1−r)n ≥ 1− (1−ρ)r (1−r)ρ − −− → ρ→1 1.
In particular, the average value of 1 − (1−ρ)r (1−r)ρ across the experiments is 0.865 (if r is calculated using the full datasets; it is expected to be the same along the training-validation split). At the same time, the pipeline does not exhibit control over n FN , the adverse factor for the sensitivity n TP n TP +n FN , which is comprised by the tumor samples left unexplained by "sufficiently carcingonenic" combinations.
Unlike most other approaches, BiGPICC does not require manually balancing the importance of tumor and normal samples in the data. Instead, its hyperparameter ρ which implicitly takes on this role is learned from the data using cross-validation. This renders our approach parameter-free, alleviating the need to manually tune it on a case-by-case basis.

Limitations and future work
BiGPICC does not offer a dedicated mechanism to differentiate between the driver mutations causing carcinogenesis and the passenger mutations that do not contribute to cancer formation. It means that some of the highly mutable genes among the identified ones can be mutated in tumor samples by chance and therefore lack biological relevance. In particular, each of our datasets contains between 41 and 67 genes with mutations in over 50% of the samples. This can lead to many candidate combinations with near-identical explanatory power, which can be behind the observed phenomenon of the similar classification performance exhibited by significantly different sets of carcinogenic combinations. Increasing the number of samples in the data would mitigate the issue of driver-passenger distinction by amplifying the signal of carcinogenic pathways against the backdrop of noise from chance mutations. Under the assumption that highly mutable genes are unlikely to drive carcinogenesis, another approach would be to remove genes with mutation frequency above some domain-informed threshold from the analysis.
Another limitation of the method is that it is run on the gene-patient data encoding all possible mutations of a gene as a binary variable, thus ignoring the variability in mutations of individual genes. However, the BiGPICC pipeline could similarly be applied to mutation-level data for the identified carcinogenic gene combinations. Given that the identified genes typically constitute less than 1% of the original gene pool, we expect the mutation-level data to be of similar size and tractability to the datasets used in this study.
Assuming that the sufficient amount of data is available, an additional use of our pipeline is testing out the competing ranges of multi-hit theories by comparing the performance of learned carcinogenic combinations of the corresponding size. The ICGC database 31 contains roughly twice as many samples as our datasets for individual cancer types, and may be a prospective dataset to try this approach on.