MTGCL: Multi-Task Graph Contrastive Learning for Identifying Cancer Driver Genes from Multi-omics Data

Cancer is a complex disease that typically arises from the accumulation of mutations in driver genes. Identification of cancer driver genes is crucial for understanding the molecular mechanisms of cancer, and developing the targeted therapeutic approaches. With the development of high-throughput biological technology, a large amount of genomic data and protein interaction network data have been generated, which provides abundant data resources for identifying cancer driver genes through computational methods. Given the ability of graph neural networks to effectively integrate graph structure topology information and node features information, some graph neural network-based methods have been developed for identifying cancer driver genes. However, these methods suffer from the sparse supervised signals, and also neglect a large amount of unlabeled node information, thereby affecting their ability to identify cancer driver genes. To tackle these issues, in this work we propose a novel Multi-Task Graph Contrastive Learning framework (called MTGCL) to identify cancer driver genes. By using self-supervised graph contrastive learning to fully utilize the unlabeled node information, MTGCL designs an auxiliary task module to enhance the performance of the main task of driver gene identification. MTGCL simultaneously trains the auxiliary task and main task, and shares the graph convolutional encoder weights, so that the main task enhances the discriminative ability of the auxiliary task via supervised learning, whereas the auxiliary task exploits the unlabeled node information to refine the node representation learning of the main task. The experimental results on pan-cancer and some specific cancers demonstrate the effectiveness of MTGCL in identifying the cancer driver genes. In addition, integrating multi-omics features extracted from multiple cancer-related databases can greatly enhance the performance of identifying cancer driver genes, especially, somatic mutation features can effectively improve the performance of identifying specific cancer driver genes. The source code and data are available at https://github.com/NWPU-903PR/MTGCL. Author Summary Identifying cancer driver genes that causally contribute to cancer initiation and progression is essential for comprehending the molecular mechanisms of cancer and developing the targeted therapeutic strategies. However, wet-lab experiments are time-consuming and labor-intensive. The advent of high-throughput multi-omics technology provides an opportunity for identifying the cancer driver genes through data-driven computing approaches. Nevertheless, effectively integrating these omics data to identify cancer driver genes poses significant challenges. Existing computational methods exhibit certain limitations. For instance, conventional approaches (e.g., gene mutation frequency-based methods, network-based methods) often focus on a single omics data, while existing deep learning-based methods have not fully utilized the abundant unlabeled node information, so that their identification accuracy is not high enough. Thus, by fully utilizing multidimensional genomics data and molecular interaction networks, we propose a multi-task learning framework (called MTGCL) to identify cancer driver genes. MTGCL synergistically combines graph convolutional neural networks with graph contrastive learning. The experimental results validate the power of MTGCL for identifying cancer driver genes.


Introduction
Cancer driver genes directly or indirectly participate in the formation and development of tumors, and their abnormal changes, such as gene mutations and copy number variations, can cause cell carcinogenesis, uncontrolled proliferation and metastasis of cancer cells, thus promoting tumor formation and progression [1][2][3].Identifying cancer driver genes helps uncover the pathogenesis of cancer, develop more precise therapeutic schedule, and explore some new drug targets [4,5].Although wet-lab experiments can identify the cancer driver genes, it is time consuming and expensive.With the rapid development of biotechnology, and the implementation of projects such as The Cancer Genome Atlas (TCGA) [6] and the International Cancer Genome Consortium (ICGC) [7] and other projects [8,9], a large amount of various omics data (e.g., mutation data, transcriptome data, proteome data, genome copy number change data) related to cancers have been generated.These omics data provide a guarantee for developing computational methods to identify cancer driver genes.
Existing computational methods of identifying cancer driver genes can be classified into three categories: gene mutation frequency-based methods, network-based methods, and machine learning-based methods.Gene mutation frequency-based methods identify the genes with higher mutation frequencies than the background mutations as cancer driver genes [10,11].These methods are limited by the selection of background mutation rate, making them difficult to distinguish high-frequency passenger genes and may miss some genes closely related to cancer but with lower mutation frequencies.Network-based methods assume that genes with important network structural properties [12,13] or affecting the state of the entire molecular network [14,15] are cancer driver genes.Due to the complex interrelationships among genes and the limited information on known gene relationships [16], network-based methods are limited by the incompleteness of the gene interaction network [17].
Machine learning-based methods of identifying cancer driver genes can be further divided into traditional machine learning methods and deep learning methods.Traditional machine learning methods usually select the effective features (e.g., gene mutation frequency, gene differential expression, and DNA methylation levels) to represent the genes, then feed these features into the classifier (e.g., support vector machine, random forest, and multilayer perceptron) to identify cancer driver genes [18][19][20].Although these traditional machine learning-based methods have achieved better results in identifying cancer driver genes, they only utilize the genomic multi-omics features, which are insufficient to obtain the domain knowledge related to cancer.In recent years, deep learning models have made unprecedented progress in the fields of molecular biology and genomics.Some deep learning-based cancer driver gene identification methods have presented to identify cancer driver genes.Especially, graph neural networks that effectively integrate graph topology and node attribute information have been developed to identify cancer driver genes.For instance, EMOGI [21] primarily used graph convolutional networks (GCNs) to identify cancer driver genes by integrating proteinprotein interaction networks and multiple genomics data (i.e., gene expression data, gene mutation frequency data, DNA methylation data, and copy number variation data), and also used the Layer-wise Relevance Propagation [22] to interpret and analyze the prediction results.MTGCN [23] employed a multi-task graph convolutional neural network to identify pan-cancer driver genes and specific cancer driver genes by optimizing both node prediction task and link prediction task during the learning process, and also concatenating the biological multi-omics features with network structure features obtained from DeepWalk [24].Although EMOGI [21] and MTGCN [23] have achieved better results than the traditional machine learning-based cancer driver gene identification methods, they still have the following limitations: EMOGI only utilized a very sparse set of labeled genes information, which limits its performance improvement; MTGCN designed an link prediction auxiliary task to utilize the unlabeled gene information, but its prediction performance and computational efficiency is significantly affected by the sparsity and scale of protein-protein interaction (PPI) network.
In view of that self-supervised learning provides a feasible solution for graph representation learning by effectively exploiting the unlabeled data, especially graph contrastive learning (GCL) uses the data itself to provide supervised information to guide learning richer representation of each sample from a large amount of unlabeled data [25][26][27], in this work we introduce self-supervised graph contrastive learning to design an auxiliary task to share the graph convolutional encoder weights with the main task, and jointly train an end-to-end multi-task graph contrastive learning model to identify cancer driver genes.In sum, the contributions of this work are mainly as follows: 1) We introduce self-supervised graph contrastive learning to design an auxiliary task, and utilize the graph convolutional networks to plan the main task of node classification.The main task enhances the discriminative ability of the auxiliary task via supervised learning, while the auxiliary task refines the gene representation learning of the main task by exploiting the unlabeled node information.2) We design a new convolutional layer structure for main task and auxiliary task, which can improve the performance of cancer driver gene identification by effectively integrating graph structure topology information and node features information.3) We perform comprehensive experiments on pan-cancer and some specific cancers to demonstrate the effectiveness of MTGCL in identifying cancer driver genes.

MTGCL architecture
MTGCL is an end-to-end framework (Fig 1) that effectively combines graph convolutional networks (GCNs) with graph contrastive learning (GCL) to identify cancer driver genes by using the molecular interaction networks and gene features as inputs.It generates scores for each gene, which represent the probability of the gene being a cancer driver gene.MTGCL mainly consists of a main task module and an auxiliary task module.The main task module is used to realize the semi-supervised node binary classification task using GCNs, while the auxiliary task is used to implement the node representation learning task using graph contrastive learning.We simultaneously train the two task modules and let them share the graph convolutional encoder weights, so that the main task enhances the discriminative ability of the auxiliary task via supervised learning, whereas the auxiliary task exploits the unlabeled node information to refine the node representation learning, thereby improving the classification performance of the main task.This learning paradigm enables MTGCL to learn more cancer-related representations for each gene.

Performance of MTGCL and other seven contrastive methods for identifying pan-cancer driver genes
In order to evaluate the performance of MTGCL for identifying pan-cancer driver genes, we compared our MTGCL with other seven methods on CPDB, STRING and PathNet networks.Seven contrastive methods are the five baseline methods of Multi-Layer Perceptron (MLP), DeepWalk [24], GCN [28], GAT [29], ChebConv [30], and two stateof-the-art methods of EMOGI [21] and MTGCN [23].The hyperparameters for each method can be found in the Supplementary A. The metrics of the area under the precision-recall curve (AUPR) and the area under the Receiver Operating Characteristic curve (AUC) are used to evaluate the performance of MTGCL and other seven comparative methods.AUPR punishes much more the existence of false-positive cancer driver genes among the best-ranked prediction scores, therefore it is a more significant quality metric than AUC in the class-imbalanced task [31].The results of MTGCL and other seven comparative methods in 5-fold cross-validation (5CV) test are shown in  The results in Table 2 show that all the proposed components for building MTGCL are valid and contribute to the final performance of MTGCL, and integrating omics features extracted from multiple cancer databases can effectively improve the performance of MTGCL for identifying pan-cancer driver genes.

Performance analysis of MTGCL in identifying novel pancancer driver genes
In this section, we analyze the performance of MTGCL in identifying novel pan-cancer driver genes on three molecular networks (i.e., CPDB, STRING and PathNet networks).
For each molecular network, we run MTGCL 10 times in 5CV test to calculate the average prediction score of genes as the final prediction score for the following analysis.
We first analyze whether there is a significant difference in prediction scores between known cancer driver genes (KCGs) and candidate cancer genes (CCGs)/ non-cancer genes (noCGs)/ other genes, as well as between CCGs and noCGs/ other genes.Fig 2a presents the prediction scores of KCGs, CCGs, noCGs and other genes on three molecular networks, and also provides the statistical significance t-test results between KCGs and CCGs/ noCGs/ other genes, as well as between CCGs and noCGs/ other genes.
As shown in Fig 2a, we can see that on three molecular networks, the scores of KCGs are significantly higher than that of CCGs, noCGs and other genes, and the scores of CCGs are also significantly higher than that of noCGs and other genes, indicating that MTGCL can not only accurately identify KCGs, but also predict CCGs that potentially driver cancer development.[37,38] and involves in the interaction between breast cancer cells and immune cells in the tumor microenvironment [39].

Performance contrastive of MTGCL with other seven methods for identifying specific cancer driver genes
To validate the effectiveness of MTGCL in identifying specific cancer driver genes, we perform MTGCL and other seven methods on CPDB network to identify the driver genes of Breast Cancer (BRCA) and Liver Hepatocellular Carcinoma (LIHC), respectively.We just use the breast-related data to extract the gene node features, and liver-related data to extract the gene node features in the CPDB network.That is, we extracted 40dimensional features ( 40 ) for each gene, which consist of 3-dimensional features ( 3 ) of single nucleotide mutation frequency, gene differential methylation, and gene differential expression obtained from TCGA, as well as 37-dimensional somatic mutation features ( 37 ) collected from the COSMIC database.The genes annotated in the NCG database [9] that are associated with BRCA/LIHC cancer are used as the positive samples.After mapping them to the CPDB network, we can obtain 201 positive samples for BRCA and 82 positive samples for LIHC.The 2,187 negative sample genes determined in pancancer driver gene identification are taken as the negative samples for identifying BRCA/LIHC cancer driver genes.The optimal hyperparameters used in MTGCL and other seven methods can be found in "Hyperparameter settings" in Supplementary A. has a significant contribution to improve the performance of MTGCL compared to the features ( 3 ) of single nucleotide mutation frequency, gene differential methylation, and gene differential expression.However, EMOGI [21] and MTGCN [23] only use  3 features to identify the BRCA/LIHC driver genes in their published original literatures.

Discussion
In this work, we developed a novel framework of MTGCL to identify the cancer driver Although MTGCL has achieved good performance in identifying pan-cancer and some specific cancers (i.e., BRCA and LIHC) driver genes, it can be improved from the following four aspects.Firstly, MTGCL should consider introducing more biological prior information, developing efficient feature extraction and integration approaches to effectively represent genes, even though it considers more omics data related to gene mutation types, achieving a significant performance improvement of cancer driver gene identification.Secondly, different molecular networks have a certain impact on the identification results of MTGCL, we should consider fully utilizing more known molecular interaction information to build a more complete network.Thirdly, MTGCL introduces graph contrastive learning to design an auxiliary task to alleviate the limited label problem, but it still cannot completely get rid of its dependence on labels, therefore we should develop an effective self-supervised learning method to reduce dependence on label information.Our MTGCL is a very flexible framework, especially in the design of the convolutional layer.Here, we chose to use the first-order neighbor nodes (not including the node itself) degree normalized average for neighbor aggregation, which is very suitable for the data characteristics of the problem studied in this work.However, in other fields, this aggregation method can be replaced by other methods such as graph attention mechanism, etc.

Materials and Methods
An overview of the data MTGCL utilizes the multi-omics data and three different biological molecular networks to identify cancer driver genes.The multi-omics data are used to build the gene feature matrix  ∈  ×85 , which contains  genes, and each gene is represented by an 85-dimensional feature vector   ∈  1×85 .The 85 features in   are composed of two parts: 1) 48 features were obtained from the gene expression data, gene mutation data, and DNA methylation data of 16 different cancer types from The Cancer Genome Atlas (TCGA) database [6] by performing the same data preprocessing pipeline as EMOGI; 2) 37 features were related to different mutation types, which are obtained from somatic mutation data of 37 primary tissues in COSMIC [8] database by performing the same data collection and preprocessing pipeline as cTaG [40].Three biological molecular networks are CPDB network, STRING network and PathNet network.CPDB network is a human functional interaction network that includes human binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways, and it is derived from the Consensus Pathway database [41].STRING network is a protein-protein association network, which is derived from the STRING database [42].PathNet network is the pathway network that includes KEGG and Reactome pathways [43].The detailed processing procedures of multi-omics data and three molecular networks are provided in Supplementary A.
For identifying pan-cancer driver cancers, we choose the positive and negative samples with the same strategy as EMOGI.The positive samples (i.e., cancer driver genes) are the union of the known cancer driver genes (KCGs) obtained from the expert-curated list in The Network of Cancer Genes (NCG) [9], the Cancer Gene Census (CGCs) genes from the COSMIC [8] database, and high-confidence genes mined from literature [44].The negative samples(i.e., non-cancer driver genes) are obtained by iteratively removing genes from the positive samples, NCG [9] candidate cancer driver genes, the genes associated with 'pathways in cancer' in the KEGG database [45], the disease genes in OMIM [46], the genes predicted to be involved with cancer by MSigDB [47], and the genes related to the expression of cancer genes [48]

Main task module of MTGCL
In this section, we first analyzed the performance differences of using different graph convolutional structure in the main task module in which it adopts semi-supervised GCN model to achieve node classification task (i.e., identifying cancer drive genes), especially analyzing the reasons for performance improvement of ChebConv [30] (used in EMOGI [21] and MTGCN [23]) compared to GCN [28] and GAT [29].Then, we presented a new graph convolutional layer structure for the main task and auxiliary task.
From the comparative experimental results (R1 Fig in Supplementary A) of using different graph convolutional structures, we can find that ChebConv can better integrate the graph structure topology information and omics feature information of nodes than GCN and GAT, thereby improving the performance of identifying cancer driver genes.This is because ChebConv performs message aggregation by separating the mapping matrices of node self-features (ego-features) from its neighboring features aggregated (neighborhood-features), while GCN and GAT share the mapping matrix.Therefore, ChebConv can avoid the discriminating features of a driver gene being smoothed during message aggregation, thereby alleviating the over-smoothing problem in GCN and GAT.
That is, separating the feature embedding matrices for the ego-features and neighborhood-features during each convolution operation can enhance the model's utilization of node feature information, thereby improving the model's performance in identifying cancer driver genes.Due to the fact that ChebConv is a frequency-based graph convolution structure, it is difficult to understand and lacks flexibility in its convolutional structure.Therefore, leveraging the aforementioned discovered characteristic, we designed a new convolutional layer structure (Equation 1) from the perspective of the spatial domain.
For network (, , , ) = (, ), where  is the edge set of molecular network,  ∈  × is the adjacency matrix,  is the node set,  = || is the number of nodes in the network,  is the matrix consisting of the content features associated with each molecular node, the representation of our proposed graph convolutional layer structure at the k-th layer can be formulated as: where,  0 = X,  is the degree matrix ( ), ( • )is the nonlinearity activation function (i.e., ReLU, Sigmoid). 1 and  2 represent the learnable parameter matrices of the ego-features and neighborhood-features of the node at the (k-1)-th layer, respectively.Operator (, ) represents the operation of summing (or concatenating, etc.)  and .In this work, we choose the concatenation operation in the first layer, because the first-order neighbors of a node have greatest impact on that node, and evaluating the relationship between the node own features and its first-order neighbor features is more conducive to learning more reliable representations.For the second layer and subsequent layers, we choose the summation operation, because their outputs need to be input into the auxiliary task of graph contrastive learning, and too high dimension of node features will increase the computational burden.For the second layer and subsequent layers, we choose the summation operation, because the output of the last layer of the graph convolutional encoder need to be input into the auxiliary task of graph contrastive learning, and too high dimension of node features will increase the computational burden.Considering our computer configuration and the model performance improvement, here we set  = 2.The output ŷ of the main task can be formulated as: where,  31 , 32 ∈   * 1 represent the learnable parameter matrices of the ego-features and neighborhood-features of the node at the 3-th layer (i.e., output layer), respectively.
In the training process of MTGCL, we adopt following cross-entropy loss function   to optimize the main task.
where,   is the predicted score of n-th gene,   is the true label of n-th gene,   is the total number of gene in the training set, μ is a balance factor that controls the weight of positive samples, here we set μ to 1.

Auxiliary task module of MTGCL
Due to the sparsity of known cancer driver genes, the prediction performance will be limited if only adopting the supervised learning models to identify cancer driver gene.In view of that graph contrastive learning can utilize the data itself to guide the learning of effective representations for each node from a large amount of the unlabeled node data [25][26][27].In this work, we introduce the graph contrastive learning to design the auxiliary task module to refine the node representation learning of main task module by exploiting the unlabeled node information.First, we generate two graph views (i.e.,  1 = ( where,  1 is the degree matrix of the augmented graph  1 ( 1 , 1 ),  2 is the degree matrix of the augmented graph  2 ( 2 , 2 ) .After that, we utilize a pair of the shared multi-layer perceptron (MLP) to project the node representation from both views (i.e.,  1 and  2 ) into the mapping space to generate the mapping node representation matrices

Fig 1 .
Fig 1. Architecture of MTGCL.MTGCL mainly consists of two task modules: the main task module that implements the semi-supervised graph convolutional neural network node classification task, and the auxiliary task module that implements the graph contrastive learning node feature representation learning task.These two modules are trained simultaneously by sharing the weights in the graph convolutional encoder.The inputs of MTGCL consist of the adjacency matrix  and the node feature matrix .The output  is a vector in which each element represents the probability of each gene being predicted as a cancer driver gene.'Aug' means graph augmentation operation.′||′ indicates concatenation operation.

Fig 2 .
Fig 2. Performance of MTGCL for identifying novel pan-cancer candidate driver genes on genes.MTGCL is a multi-task learning model framework that introduces the graph contrastive learning to design the auxiliary task and proposes a novel graph convolutional layer structure for the main task of GCN-based node classification.The auxiliary task utilizes a large amount of unlabeled node information to boost the node representation learning of the main task, thereby improving the performance of cancer driver gene identification.The main task enhances the discriminative ability of the auxiliary task via supervised learning.MTGCL jointly trains the auxiliary and main tasks by sharing the propagation layer parameters to improve the model performance and reduce computational costs.The results on three molecular networks of GPDB, STRING and PathNet show that MTGCL outperforms other existing state-of-the-art methods in identifying pan-cancer and specific cancer driver genes.The ablation experimental results demonstrate that our proposed new graph convolution strategy can effectively improve the performance of identifying cancer driver genes, and the graph contrastive learning auxiliary task can boost the main task gene representation learning for further improving model performance.In addition, integrating omics features extracted from multiple cancer-related databases can greatly enhance the performance of identifying cancer driver genes.Especially, somatic mutation features can effectively improve the performance of identifying specific cancer driver genes.The analysis results of top 100 predicted cancer driver genes demonstrate that the confidence level of our MTGCL prediction results is relatively high.

Table 1 .
We also performed the t hypothesis test (t-test) to verify whether the performance improvement of MTGCL over the other contrastive methods is significant.the other seven methods of MLP, DeepWalk, GCN, GAT, ChebConv EMOGI and MTGCN for identifying the cancer diver genes on three molecular networks of CPDB, STRING, and PathNet.The AUPR and AUC values of MTGCL on the CPDB network are 0.8316 and 0.9084, which are 0.0051~0.1610and 0.0031~0.0948higher than those of the other seven methods, respectively; AUPRC and AUC values of MTGCL on STRING network are 0.8240 and 0.9206, which are 0.0147~0.1458and 0.0076~0.0742higher than those of other seven methods, respectively; AUPRC and AUC values of In addition, we also compared the running time and convergence speed of MTGCL with MTGCN that also uses the GCN multi-task learning strategy (i.e., taking the node prediction as the main task and the link prediction as the auxiliary task) under the same experimental environment.On CPDB network, MTGCL takes 203 seconds to run 1000 epochs, while MTGCN takes 1,004 seconds; the convergence iteration number of MTGCL is 1,900, while MTGCN is 2,500.On STRING network, MTGCL takes 185 supplementary B) of MTGCL compared to ChebConv, EMOGI and MTGCN methods on CPDB, STRING and PathNet networks.From Table 1 and S1 Fig, we can see that the performance of our MTGCL outperforms all of MTGCL on PathNet network are 0.8531 and 0.8622, which are 0.0328-0.1523and 0.027-0.1286higher than those of other seven methods, respectively.Moreover, the AUPR performance improvement of MTGCL is statistically significant compared to these methods with similar performance, including MTGCN, EMOGI, and ChebConv.These results demonstrate that our MTGCL method has superior performance in identifying the pan-cancer driver genes.Table 1.Performance of MTGCL and other contrastive methods on CPDB, STRING and PathNet networks in 5CVtest.seconds to run 1000 epochs, while MTGCN takes 1,368 seconds; the convergence iteration number of MTGCL is 1,900, while MTGCN is 2,800.On PathNet network, MTGCL takes 89 seconds to run 1000 epochs, while MTGCN takes 342 seconds; the convergence iteration number of MTGCL is 700, while MTGCN is 800.Above results demonstrate that our MTGCL not only performs better than MTGCN in identifying pancancer driver genes, but also has higher computational efficiency than MTGCN.

Ablation experiments of diverse architecture components in MTGCL
and COSMIC) on MTGCL.The ablation experimental results of MTGCL are shown in

Table 2 .
In Table 2, MTGCL -GCL denotes that we remove the contrastive learning module from MTGCL.MTGCN ^ChebConv denotes that we use Chebyshev convolutional layer instead of our designed graph convolutional encoder in MTGCL.MTGCL XT denotes that we only use the 48-dimensional features extracted from TCGA in in MTGCL.MTGCL XC denotes that we only use the 37-dimensional features extracted from COSMIC in MTGCL.We also performed the t-test to verify whether the contributions of diverse architectural components in MTGCL are statistically significant.S2 Fig (in supplementary B) is the AUPR statistical significance test results of MTGCL compared to its variations (i.e., MTGCL -GCL , MTGCN ^ChebConv , MTGCL XT and MTGCL XC ).

Table 2 .
Ablation experiment results of MTGCL on CPDB, STRING and PathNet networks.As shown in Table 2 and S2 Fig, we can see that the AUPR values of MTGCL on GCL , indicating that introducing graph contrastive learning to design the auxiliary task can

.0022 0.9084±0.0011 0.8240±0.0027 0.9206±0.0013 0.8531±0.0028 0.8622±0.0032 improve
the performance of MTGCL in identifying the pan-cancer driver genes.The AUPR values of MTGCL on the CPDB, STRING and PathNet networks are 0.8316, 0.8240, and 0.8513, which are 0.0061, 0.0055, and 0.0287 higher than that of MTGCN ^ChebConv , respectively, our designed convolutional layer can enhance the performance of MTGCL than the Chebyshev convolutional layer.The AUPR value of MTGCL on the PathNet network are 0.8513, which is 0.0208, 0.008 higher than that of MTGCL XT and MTGCL XC , respectively, indicating that the integration of omics features extracted from TCGA and COSMIC databases can improve the performance of MTGCL.
[33]1, RYR3, SLX4, and RMI1) are the essential genes in Project Achilles[33].Out of the 57 NPCGs, 56 genes have at least one evidence supporting them the potential as [34][35][36]CF4, TEK, LAMA1, and GEN1) are annotated as driver genes, or oncogenes/tumor suppressor genes in the CancerMine database[32], and 4 genes (i.e., Supplementary C)), we can see that NPCGs are enriched into many KEGG pathways related to cancer, such as Focal adhesion, PI3K-Akt signaling pathway, Ras signaling pathway, Proteoglycans in cancer, MAPK signaling pathway, etc.And many genes in NPCGs play important roles in these cancer-related pathways.For example, PTK2 (also known as Focal Adhesion Kinase, FAK) is a key gene encoding the focal adhesion kinase, a critical protein in the focal adhesion pathway.Previous studies have shown that FAK plays a role in breast tumor progression, intestinal tumorigenesis, and squamous cell carcinoma[34][35][36].The FN1 gene is important in both the Focal Adhesion and PI3K-Akt signaling pathways.More studies have demonstrated that FN1 is highly expressed in breast cancer

Table 3
From Table3and S3 Table, we can see that the performance of our MTGCL is still superior to other seven methods in identifying BRCA and LIHC cancer driver genes, indicating that our MTGCL can also be effectively used to identify the specific cancer genes.In addition, we also investigated the impact of  3 and  37 features on MTGCL (S3 Fig), finding that somatic mutation features ( 37 ) Table (in Supplementary B) gives the significance t-test results of MTGCL compared to other methods on BRCA and LIHC cancers.

Table 3 .
Performance of MTGCL and other seven methods on BRCA and LIHC.
. Statistical results of CPDB, STRING and PathNet networks are shown in S2 Table (in Supplementary B).