Abstract
Multi-modal biological data integration can provide comprehensive views of gene regulation and cell development. However, conventional integration methods rarely utilize prior biological knowledge and lack interpretability. To address these challenges, we developed Pathformer, a biological pathway informed deep learning model based on Transformer with bias to integrate multi-modal data. Pathformer leverages criss-cross attention mechanism to capture crosstalk between different biological pathways and between different modalities (i.e., multi-omics). It also utilizes SHapley Additive Explanation method to reveal key pathways, genes, and regulatory mechanisms. Through benchmark studies on 28 TCGA datasets, we demonstrated the superior performance and interpretability of Pathformer on various cancer classification tasks, compared to other integration models. Furthermore, we applied Pathformer to liquid biopsy multi-modal data integration with high accuracy in cancer diagnosis. Meanwhile, Pathformer revealed interesting molecularly altered pathways in cancer patients’ body fluid, such as ligand binding of scavenger receptors, iron transport, and DAP12 signaling transmission, which are related to extracellular vesicle transport, platelet, and immune response.
Introduction
The rapid progress in high-throughput technologies has made it possible to curate multi-modal data for disease studies using genome-wide platforms. These platforms can analyze different molecular alterations in the same samples, such as DNA variances (e.g., mutation, methylation, and copy number variance) and RNA alterations (e.g., expression, alternative promoter, splicing, and editing). Integrating these multi-modal data offers a more comprehensive view of gene regulation in diseases (e.g., cancer) than analyzing single type of data1. For instance, multi-modal data integration is helpful in addressing certain key challenges of cancer diagnosis and prognosis, such as heterogeneity of intra- and inter-cancer, and complex molecular interactions2. Therefore, there is a pressing need for advanced computational methods that uncover interactions of multi-modal data in cancer.
Current algorithms for integrating multi-modal data can be broadly categorized into three groups: early integration models that merge multi-modal data into a single matrix3, 4, late integration models that process each modality separately and then combine their outputs through averaging or maximum voting5, 6, and intermediate integration models that dynamically merge multi-modal data7, 8. Recently, instead of previous methods that mainly focus on unsupervised problems, several supervised algorithms have been proposed for classifying diseases. For example, mixOmics uses latent component analysis to find common features among multi-modal data9. Wang et al. proposed multi-omics graph convolutional networks (MOGONet), a late integration model that uses graph convolutional networks for modal-specific learning and view correlation discovery network for multi-modal integration10. Moon et al. proposed two modal data integration and interpretation algorithm (MOMA) that utilizes attention mechanisms to extract important modules11. These methods rely on computational inference to capture relationship between modalities, but ignore the immensely informative prior biological knowledge such as regulatory networks.
To improve the interpretability, several studies have attempted to incorporate prior biological knowledge into deep learning models for multi-modal data integration. For instance, Ma et al. proposed a visible neural network that combines with biological pathways to model the impact of gene interactions on yeast cell growth12. Meanwhile, pathway-associated sparse deep neural network (PASNet) was utilized to accurately predict the prognosis of glioblastoma multiforme (GBM) patients13. Recently, a sparse neural network integrating multiple molecular features based on a multilevel view of biological pathways, P-net, was published for the classification of prostate cancer patients14. Another method, PathCNN, was developed to predict survival of GBM patient by using principal component analysis (PCA) algorithms to define multi-modal pathway images and a convolutional neural network15. However, these algorithms rarely considered the synergy and nonlinear relationships between pathways. Given the complexity of biological systems, understanding the pathway crosstalk is crucial for comprehending more complex diseases16, which can help deep learning models better capture multi-modal interactions.
Inspired by these prior works, we propose Pathformer, which combines pathway crosstalk networks and the Transformer encoder with bias for the interpretation and classification of multi-modal data in cancer. Recently, Transformer has demonstrated its capability in handling multi-modal tasks in computational fields17. It hasn’t been applied to the biological multi-modal data for lack of reliable biological embedding methods and solutions to the memory explosion posed by the vast amount of gene inputs. These challenges are addressed by Pathformer. First, Pathformer uses multiple statistical indicators of multi-modal data as gene embedding, which comprehensively describes different perspectives of gene information. Second, Pathformer utilizes a sparse neural network based on prior pathway knowledge to transform gene embeddings into pathway embeddings, which not only captures valuable information but also addresses memory explosion issue. Third, Pathformer incorporates pathway crosstalk networks into the Transformer model with bias to enhance the exchange of information between different modalities and pathways.
As far as we are aware, Pathformer is the first biological multi-modal integration model that combines prior pathways knowledge and Transformer encoder model. We evaluated Pathformer on 28 benchmark datasets of the Cancer Genome Atlas (TCGA)18 and demonstrated its superior performance and biological interpretability on various cancer classification tasks, compared to other integration models. Pathformer was applied to liquid biopsy data, which not only showed high accuracy for noninvasive cancer diagnosis but revealed interesting molecularly altered pathways in human plasma.
Results
The Pathformer model
Pathformer utilizes biological pathway network and a Transformer encoder to allow better information fusion. It has six modules: biological pathway input, pathway crosstalk network calculation, multi-modal data input, biological multi-modal embedding, Transformer module with pathway crosstalk network, and classification module (Fig. 1a, see Methods for details). Pathformer uses biological multi-modal data and biological pathway information as input, and define biological multi-modal embedding (gene embedding and pathway embedding). It then enhances the fusion of information between various modalities and pathways by combining pathway crosstalk networks with Transformer encoder. Finally, a fully connected layer serves as the classifier.
We curated all pathways from four public databases, then selected 1,497 pathways based on the criterion of gene number, overlap ratio with other pathways, and the number of pathway subsets. Next, we used BinoX19, a classic tool for crosstalk analysis, to calculate the crosstalk relationships among the 1,497 pathways. Based on these relationships, we created a pathway crosstalk network as Pathformer’s input (see Methods and Supplementary Notes).
Multi-modal biological data preprocessing and embedding are crucial components of Pathformer (Fig. 1b). We preprocessed the raw sequence reads of DNA-seq and RNA-seq into multi-modal data, including DNA methylation, DNA copy number, and different RNA alterations (see Methods and Supplementary Notes). These multi-modal data are on different levels, such as nucleotide level, fragment level, and gene level, which significantly influence data integration. To address this, we used multiple statistical indicators as gene embeddings to retain the gene diversity across different modalities (see Fig. 1b and Methods). Subsequently, we used the known gene-pathway mapping relationship to develop a sparse neural network based on prior pathway knowledge (PSNN) to transform gene embedding into pathway embedding. The PSNN has two layers representing genes and pathways, respectively. These two layers are not fully connected, but rather share a connection pruned based on the pathway and gene inclusion relationships. If there is no correlation between a given gene and a given pathway, the connection weight between two neurons is set to be 0; otherwise, it is learned through training (see Methods). Therefore, pathway embedding is a dynamic embedding method. The PSNN can not only restore the mapping relationship between genes and pathways, but also identify important genes in different pathways through trained weights, and can transfer the complementarity of modalities at the gene level to the pathway level. Additionally, this biological multi-modal embedding step does not require additional gene selection, thereby avoiding bias and overfitting problems resulting from artificial feature selection.
Transformer module with pathway crosstalk network bias is the key module of Pathformer model (Fig. 1c). Inspired by the Evoformer model used in AlphaFold220 for processing multiple sequences, we developed the Transformer module based on criss-cross attention (CC-attention) with bias for data fusion of pathways and modalities. Particularly, multi-head column-wise self-attention (col-attention) is used to enhance the exchange of information between pathways, with the pathway crosstalk network matrix serving as the bias for col-attention to guide the flow of information. Multi-head row-wise self-attention (row-attention) is employed to facilitate information exchange between different modalities, and the updated multi-modal embedding matrix is used to update the pathway crosstalk network matrix by calculating the correlation between pathways. More details of the Transformer module are described in Methods.
Pathformer outperforms existing multi-modal integration methods in various classification tasks using TCGA datasets
To evaluate the performance of Pathformer, we tested model on various cancer classification tasks as benchmark studies: cancer early- and late-stage classification (10 TCGA cancer datasets), low- and high-survival risk classification (10 TCGA cancer datasets), and cancer subtype classification (8 TCGA cancer datasets) (see Supplementary Fig. 1 and Supplementary Notes). For these tasks, DNA methylation, DNA CNV, and RNA expression were used as input. For model training and test, we performed 2 times 5-fold cross-validation that divided the data into a discovery set (75%) and a validation set (25%) for each test (see Supplementary Fig. 1 and Methods). We first optimized hyperparameters using 5-fold cross-validation on the discovery set, with macro-averaged F1 score as the criterion for grid search. The results of optimal hyperparameter combination for each dataset are listed in Supplementary Fig. 2 and Supplementary Table 1. Then, we trained Pathformer using the discovery set with early stopping and tested it on the validation set.
We compared the classification performance of Pathformer with several existing multi-modal integration methods, including early integration methods based on base classifiers, i.e., nearest neighbor algorithm (KNN), support vector machine (SVM), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost); late integration methods based on KNN, SVM, LR, RF, and XGBoost; partial least squares-discriminant analysis (PLSDA) and sparse partial least squares-discriminant analysis (sPLSDA) of mixOmics9; two deep learning-based integration methods, MOGONet10 and PathCNN15. MOGONet is a multi-modal integration method based on graph convolutional neural network. PathCNN is a representative multi-modal integration method that combines pathway information. During comparison methods, the multi-modal data were preprocessed with the statistical indicators and features were prefiltered with ANOVA as input (see Supplementary Notes).
Pathformer consistently outperformed the other integration methods in most classification tasks, evaluated by macro-averaged F1 score (F1score_macro) (Fig. 2), as well as area under the receiver operating characteristic curve (AUC) and average F1 score weighted by support (F1score_weighted) (Supplementary Fig. 3 and Supplementary Table 2). We showed F1score_macro in the main figure because it is a more robust measurement than the other two scores for the imbalanced classes. In the cancer stage classification and survival classification tasks, Pathformer achieved the best F1score_macro and F1score_weighted in all the 10 datasets, and the best AUC in 8 of 10 datasets. In cancer subtype classification of TCGA, Pathformer achieved the best F1score_macro in 7 of 8 datasets, the best F1score_weighted in 6 of 8 datasets, and the best AUC in 6 of 8 datasets. Notably, Pathformer substantially outperformed the other methods in the challenging classification tasks like cancer early- and late-stage classification and low- and high-survival risk classification, showing average increases of 11% and 15% in F1score_marco compared with XGBoost, respectively. This highlights Pathformer’s exceptional learning ability. Moreover, in terms of stability, Pathformer also showed significantly better generalization ability than the other deep learning algorithms, as indicated by the cross-validation variances (Supplementary Fig. 4).
Ablation analysis shows that Pathformer benefits from multi-modal integration, attention mechanism and pathway crosstalk network
We used ablation analysis to evaluate the essentialities of each type of data and each module of model in the multi-model data integration of Pathformer, based on nine datasets of cancer early- and late-stage classification. First, we evaluated the essentialities of seven different data inputs, including RNA expression, DNA methylation, DNA CNV, and a combination thereof (Fig. 3a). By comparing the classification performances of seven models, we discovered that the model with all three modalities as input achieved the best performance, followed by RNA expression-only and DNA methylation-only model. Furthermore, we observed that the performances of models with single modality can vary greatly between datasets. For example, DNA methylation-only model performed better than RNA expression-only and DNA CNV-only in the KIRC dataset, but the opposite performances were observed in the LUAD dataset. These findings suggest that different modalities have disparate behaviors in different cancer types, and emphasized the necessity of multi-modal data integration in various cancer classification tasks.
Next, we also evaluated the essentialities of different modules in Pathformer. We developed 4 models, namely CC-attention, Transformer, PSNN, and NN, which successively remove one to multiple modules of Pathformer. CC-attention is a model without pathway crosstalk network bias. Transformer is a model without either pathway crosstalk network bias or row-attention. PSNN is a model that directly uses classification module with pathway embedding as input. NN is a model that directly uses classification module with gene embedding as input. As shown in Fig. 3b, the complete Pathformer model achieved the best classification performance, while the performance of CC-Attention, Transformer, PSNN, and NN decreased successively. Transformer had a significantly lower classification performance compared to CC-Attention, but no significant improvement compared to PSNN. This indicates that the criss-cross attention mechanism (Fig. 1c) plays a key role in Pathformer, with respect to information fusion and crosstalk between different biological pathways and between different modalities (i.e., multiomics).
Biological interpretability of the Pathformer model
To comprehend Pathformer’s decision-making process, we used averaging attention maps in row-attention to represent the contributions of different modalities, and SHapley Additive exPlanations21 (SHAP value) to decipher the important pathways and their key genes (see Methods). SHAP value is a post hoc model interpretation method that assigns an importance value to each feature to explain the relationship between features and classification21. In addition, the z-score of SHAP values of different modalities for each pathway and gene can demonstrate modal complementarity at the gene level and the pathway level. Finally, the hub module of the updated pathway crosstalk network represents the most critical regulatory mechanism in classification, and is screened by sub-network scores based on SHAP values of pathways. Links of the updated network indicate crosstalk relationships that affect classification tasks (see Methods).
Here, we demonstrated the interpretability of Pathformer using the breast cancer subtype classification task as an example (Fig. 4). First, at the modality level, we visualized the contributions of different modalities for breast cancer subtype classification by the attention weights (Fig. 4a). The contribution of transcriptomic data was greater than 50% in breast cancer subtype classification, which is consistent with the fact that PAM50 is defined based on transcriptomic data22. Combining with the results of other classification tasks for breast cancer (Supplementary Figs. 5a, 6a), we observed that transcriptome always played a crucial role in various classification tasks; DNA CNV had certain contribution in subtype classification; and DNA methylation contributed substantially in early- and late-stage classification. In addition, the contributions of various statistical indicators in the same modality were also different for different classification tasks. For example, mean of DNA CNV played an important role in subtype classification, while minimum of DNA CNV had greater contribution in stage classification and survival classification. These findings further validated the necessity of multi-modal integration and biological multi-modal embedding.
Next, at the pathway and gene level, we identified the pathways with top 15 SHAP value and the genes with top 5 SHAP value of each pathway as key genes in breast cancer subtype classification (Fig. 4b). Then, we presented a hub module of the updated pathway crosstalk network (Fig. 4c). Here, complex I biogenesis pathway was identified as the most critical pathway in breast cancer subtype classification and a key node in the hub module of the updated pathway crosstalk network. This pathway comprises 57 genes, including mitochondrial genes and protein-coding genes. Complex I participates in the biosynthesis and redox control during cancer cell proliferation and metastasis23. Five mitochondrial genes (MT-ND3, MT-ND1, MT-ND4, MT-ND2, and MT-ND6) were identified as key genes of the complex I biogenesis pathway in breast cancer subtype classification by Pathformer. These mitochondrial genes have been reported to exhibit distinct patterns in different breast cancer subtypes24. In addition, in the hub module of the updated pathway crosstalk network, complex I biogenesis pathway was closely related to TP53-regulated metabolic genes pathway and signaling by ERBB4 pathway, and has been identified as the most critical regulatory mechanism for breast cancer subtype classification. According to literatures, TP53 mutation spectrum25 and ERBB426 are biomarkers for breast cancer subtypes.
Moreover, many other important pathways identified by Pathformer for breast cancer subtype classification have also already been reported previously (Fig. 4b). For example, the expression of nucleotide excision repair pathway is reduced in TNBC, which may affect survival after platinum chemotherapy of patients27. RFC4 is the key gene of this pathway, and DNA CNV of RFC4 was reported to play a crucial role in determining individual breast cancer subtypes28, which is consistent with the prediction of the gene’s pillar module by Pathformer. Key genes of transcription of E2F targets under negative control by p107 and p130 in complex with HDAC1 pathway were identified as E2F1, HDAC1, RBBP4, CCNA2, and CDK1 by Pathformer. Most E2F family genes expressions are significantly up-regulated in TNBC, and are predictive biomarkers of neoadjuvant therapies in patients with ER-positive/HER2-negative tumors29. In addition to the transcriptome level, DNV CNV of E2F1 is also a susceptibility factor for breast cancer30, again consistent with the prediction of the gene’s pillar module by Pathformer. HDAC1 is significantly lower in HER2-positive and TNBC compared to luminal A and luminal B31.
Similarly, we also analyzed important pathways and hub modules of the updated pathway crosstalk network in breast cancer early- and late-stage classification and high- and low-risk survival classification (Supplementary Figs. 5,6). We found that complex I biogenesis pathway always played a crucial role in different classification tasks of breast cancer, due to its connection between various cancer-related pathways. Particularly, in breast cancer early- and late-stage classification, iron uptake and transport pathway had the greatest impact. Supportively, the transport and storage of iron in cells are known to play a key role in carcinogenesis, cell proliferation, and the development of breast cancer32. Furthermore, we found that some pathways were more important in early- and late-stage classification than in subtype classification and survival classification, such as collagen biosynthesis and modifying enzymes pathway, Eph/ephrin signaling pathway, FRA pathway, and G1 pathway. Roles of LAT2/NTAL/LAB in calcium mobilization pathway was more important in survival classification than in the other classification tasks, which was consistent with calcium signaling pathway’s function in breast cancer cells’ proliferation, invasion, apoptosis, and multidrug resistance, and with breast cancer survival33.
Application of Pathformer to liquid biopsy data for non-invasive cancer diagnosis
Liquid biopsy is a non-invasive detection way with important clinical applications in both cancer diagnosis and status monitoring, which provides comprehensive information on transcriptome dynamics34. RNA alterations reflect the complementarity between different levels of information and help to overcome missed detection results of single data to further improve the accuracy of cancer diagnosis. Therefore, we used Pathformer to integrate multi-modal data of liquid biopsies for classifying cancer patients from healthy controls. We applied Pathformer to three cell-free RNA-seq datasets derived from three different blood components: plasma, extracellular vesicle (EV), and platelet datasets (see Methods).
We calculated seven RNA-level modalities from RNA-seq data as Pathformer’s input, including RNA expression, RNA splicing, RNA editing, RNA alternative promoter (RNA alt. promoter), RNA allele-specific expression (RNA ASE), RNA single nucleotide variations (RNA SNV), and chimeric RNA. From results of 5-fold cross-validation in Supplementary Fig. 7, we found that the model with all modalities as input had the best comprehensive performance on three datasets, followed by RNA expression-only model and RNA alt. promoter-only model, and some models with other modalities exhibited great fluctuations on different datasets. In order to effectively integrate information without redundancy, we performed further feature selection based on different modality combinations evaluated by Pathformer. First, we calculated the contributions of each modality and its corresponding statistical indicators (Fig. 5a). Similar to results of cross-validation, RNA expression was the core modality across all datasets. Next, we performed 5-fold cross-validation find an optimal modality combination for each dataset (Fig. 5b, Supplementary Table 3). We found that plasma dataset with 7 modalities, EV dataset with 3 modalities, and platelet dataset with 3 modalities obtained the best performance. The AUCs were higher than 0.9 for all three datasets. In conclusion, Pathformer effectively integrated multi-modal data from human plasma, and accurately classified cancer patients from healthy controls.
Pathformer reveals deregulated pathways and genes in cancer patients’ plasma
Because the Pathformer model has biological interpretability, we used Pathformer to predict cancer related pathways and genes in the above liquid biopsy data (Fig. 6). Then, we can gain insight into the deregulated alterations in body fluid (i.e., plasma) for cancer patients vs. healthy controls.
First, in comparison to cancer tissue data (Fig. 4, Supplementary Fig. 6), we found that vesicle transport and coagulation related pathways occupied an important position in datasets of various blood components, which is consistent with the characteristics of body fluids (Fig. 6a-c). Furthermore, we also observed that active pathways and key genes of plasma dataset were more similar to those in platelet dataset, which is consistent with a recent report showing platelet is a major origin in the plasma cell-free transcriptome35.
Next, we examined there interesting pathways: one was found in EV data and the others were revealed from platelet data. In both EV and plasma datasets, we found that binding and uptake of ligands (e.g., oxidized low-density lipoprotein, oxLDL) by scavenger receptors pathway was identified as the most active pathway (Fig. 6a, b). It is well established that scavenger receptors play a crucial role in cancer prognosis and carcinogenesis by promoting the degradation of harmful substances and accelerating the immune response through endocytosis, phagocytosis, and adhesion36. Scavenger receptors are also closely related to the transport process of vesicles. For example, stabilin-1, a homeostatic receptor, has the potential to impact macrophage secretion by linking extracellular signals and intracellular vesicular processes37. Meanwhile, HBB, HBA1, HBA2, FTH1, HSP90AA1 were identified as key genes in this pathway. HBB has been reported as a biomarker in thyroid cancer38, breast cancer39, and gastric cancer40. It has also been demonstrated that HBB is significantly downregulated in gastric cancer blood transcriptomics40. HSP90AA1 has also been demonstrated to be a potential biomarker for various cancers41, especially in the blood42.
The other interesting pathways are DAP12 signaling pathway and DAP12 interactions pathway revealed in both platelet and plasma datasets (Fig. 6a, c). DAP12 triggers natural killer cell immune responses against certain tumor cells43, which is regulated by platelet44. Among the top 5 key genes of DAP12 related pathway in both platelet and plasma datasets, B2M was reported as a serum protein encoding gene and a widely recognized tumor biomarker45; HLA-E and HLA-B were reported as cancer biomarkers in tissue and plasma46, 47.
In addition, Pathformer provides insight into the interplay between various biological processes and their impact on cancer progression by updating pathway crosstalk network (Fig. 6d-e). In the plasma data, the link between binding and uptake of ligands by scavenger receptors pathway and iron uptake and transport pathway was a novel addition to the updated network (Fig. 6d). In other words, this crosstalk relationship was newly predicted by Pathformer. The crosstalk between two pathways was amplified by Pathformer in plasma dataset, probably because they were important for classification and shared the same key gene, FTH1, one of two intersecting genes between the two pathways. However, in platelet dataset, this crosstalk between two pathways was not shown, when the scavenger receptors pathway was not important enough (Fig. 6e). In summary, Pathformer’s updated pathway crosstalk network visualizes the information flow between pathways related to cancer classification task in the liquid biopsy data, providing novel insight into the cross-talk of biological pathways in cancer patients’ plasma.
Discussion
Pathformer utilizes a biological multi-modal embedding (Fig. 1b) based on pathway-based sparse neural network, providing a demonstration of applying Transformer model on biological multi-modal data integration. Particularly, we showed that the criss-cross attention mechanism (Fig. 1c) contributed to the classification tasks by capturing crosstalk between biological pathways and potential regulation between modalities (i.e., multi-omics).
Applications of Pathformer
Pathformer will be usefully in many clinical applications like cancer subtyping, staging, prognosis, and diagnosis. For instance, we have demonstrated excellent performance of Pathformer on noninvasive diagnosis of cancer based on multi-modal data of liquid biopsy. The accuracies (AUC scores) of cancer classification in plasma, EV, and platelet datasets were all higher than 90%. Furthermore, the interpretability of the Pathformer model can help researchers gain insights into the complex regulation processes involved in cancer. For instance, Pathformer has identified active pathways consistent with the characteristics of body fluid data, such as binding and uptake of ligands by scavenger receptors, and the DAP12 related pathway, which have been reported to be closely related to extracellular vesicle transport, platelet, and immune response during the development and progression of cancer.
Limitations of Pathformer and future directions
Pathformer used genes involved in pathways from four public databases, all of which consist of protein-coding genes. However, a substantial body of literature has reported that noncoding RNAs are also crucial in cancer prognosis and diagnosis48. Therefore, incorporating noncoding RNAs and their related functional pathways into Pathformer would be a potential future work. Another flaw of Pathformer is the computing memory issue. Pathway embedding of Pathformer has prevented memory overflow of Transformer module caused by long inputs. However, when adding more pathways or gene sets (e.g., transcription factors), Pathformer still faces the issue of memory overflow. In the future work, we may introduce linear attention to further improve computational speed.
Methods
Data collection and preprocessing
We collected 28 datasets across different cancer types from TCGA to evaluate classification performance of Pathformer and existing comparison methods, which consists of 8 datasets for cancer subtype classification, 10 datasets for cancer early- and late-stage classification, and 10 datasets for cancer low- and high-survival risk classification. Besides, to further verify the effect of Pathformer in cancer diagnosis, we also collected three types of body fluid datasets: the plasma dataset (comprising 373 samples assayed by total cell-free RNA-seq49), the extracellular vesicle (EV) dataset (comprising 477 samples from two studies assayed by exosomal RNA-seq50, 51), and the platelet dataset (comprising 918 sample from two studies assayed by tumor-educated blood platelet RNA-seq52, 53). Through our biological information pipeline, totally 4 and 7 biological modalities are obtained for TCGA dataset and liquid biopsy dataset, respectively. More details of data collection and preprocessing are described in Supplementary Fig. 1 and Supplementary Notes.
The Pathformer model
As shown in Fig. 1, Pathformer consists of the following six modules: biological pathway input, pathway crosstalk network calculation, multi-modal data input, biological multi-modal embedding, Transformer module with pathway crosstalk network bias, and classification module.
Biological pathways and crosstalk network
We collected 2,289 pathways of four public databases including Kyoto Encyclopedia of Genes and Genomes database (KEGG)54, Pathway Interaction database (PID)55, Reactome database (Reactome)56, and BioCarta Pathways database (BioCarta)57. Then, we filtered these pathways by three criteria: gene number, the overlap ratio with other pathways (the proportion of genes in the pathway that are also present in other pathways), and the number of pathway subsets (the number of pathways included in the pathway). Following the principle of moderate size and minimal overlap with other pathway information, we selected 1,497 pathways with gene number between 15 and 100, or gene number greater than 15 and overlap ratio less than 1, or gene number greater than 15 and the number of pathway subsets less than 5. Next, we used BinoX to calculate the crosstalk relationship of 1,497 pathways and build a pathway crosstalk network with adjacency matrix (more details in Supplementary Notes).
Biological multi-modal data input and embedding
Pathformer supports any number of modalities as input which may have different dimensions, including nucleotide level, fragment level, and gene level. For example, Pathformer’s input for TCGA datasets includes gene-level RNA expression, fragment-level DNA methylation, and both fragment-level and gene-level DNA CNV. Pathformer’s input for liquid biopsy datasets includes gene-level RNA expression; fragment-level RNA alternative promoter, RNA splicing, and chimeric RNA; and nucleotide-level RNA editing, RNA ASE, and RNA SNV. We represented multi-modal input matrix of a sample as 𝑴, and converted matrix 𝑴 into gene encoding EG and pathway encoding EP. First, we used a series of statistical indicators in different modalities as gene embedding. These statistical indicators include gene level score, count, entropy, minimum, maximum, mean, weighted mean in whole gene, and weighted mean in window. Gene embedding is calculated as follows: , where Gi is modality i, 𝐷g is length of gene embedding for all modalities, 𝑭E is a series of gene embedding functions. 𝑭E uses a series of statistical indicators to uniformly convert the data of different modalities into the gene level, and the embedding functions corresponding to different modalities are different (more details in Supplementary Notes). Then, we used the known biological pathways to construct a sparse neural network for converting the gene embedding 𝑬G into the pathway embedding 𝑬P, as described below: , where Np is the number of pathways, Dp = Dg is the length of pathway embedding, is a learnable sparse weight matrix, and 𝑩 is a bias term. 𝑾sparse is constructed based on the known relationship between pathways and genes. When the given gene and the pathway are irrelevant, the corresponding element of 𝑾sparse will always be 0. Otherwise, it needs to be learned through training.
Transformer module with pathway crosstalk network bias
We employed the Transformer module based on criss-cross attention with pathway crosstalk network bias, which has 3 blocks. Each block of Transformer module contains the following processes: multi-head column-wise self-attention (col-attention), multi-head row-wise self-attention (row-attention), layer normalization, GELU activation, residual connection, and network update. Multi-head column-wise self-attention contains 8 heads, each head is a mapping of 𝑸1, 𝑲𝟏, 𝑽1, 𝑷, which are query vector, key vector, and value vector of multi-modal embedding and pathway crosstalk network matrix, respectively.
First, we represented the hth column-wise self-attention by , calculated as follows: , where ℎ = 1,2, ⋯, 𝐻 is the hth head; H is the number of heads; are linear transformations of the input are the weight matrices as parameters; d is the attention dimension; dropout0.2 is a dropout neural network layer with a probability of 0.2; and softmax is the normalized exponential function.
Next, we merged multi-head column-wise self-attention and performed a series of operations as follows: , where ℎ = 1,2, ⋯, 𝐻 is the hth head; H is the number of heads; ∘ is the matrix dot product; are the weight matrices as parameters; o is a constant; LayerNorm is the layer normalization function; GELU is the distortion of RELU activation function; and dropout0.2 is a dropout neural network layer with a probability of 0.2.
Multi-head row-wise self-attention enables information exchange between different modalities. It is a regular dot-product attention without pathway crosstalk network bias. The hth row-wise self-attention, i.e., , is calculated as follows: , where ℎ = 1,2, ⋯, h is the hth head; H is the number of heads; are linear transformations of the input are the weight matrices as parameters; d is the attention dimension; dropout0.2 is a dropout neural network layer with a probability of 0.2; and softmax is the normalized exponential function.
Subsequently, we merged multi-head row-wise self-attention and performed a series of operations. The formulas are as follows: , where ℎ = 1,2, ⋯, h is the hth head; H is the number of heads; ∘ is the matrix dot product; are the weight matrices as parameters; o is a constant; 𝜷 is a constant coefficient for row-attention; LayerNorm is the layer normalization function; GELU is the distortion of RELU activation function; and dropout0.2 is a dropout neural network layer with a probability of 0.2. 𝑶2 is pathway embedding input of the next Transformer block. In other words, when , superscripts with parenthesis represent data at different block.
Then, we used the updated pathway embedding 𝑶2 to update the pathway crosstalk network. We exploited the correlation between embedding vectors of two pathways to update the corresponding element of the pathway crosstalk network matrix. The formula is as follows: , where 𝑷K is the updated pathway crosstalk network matrix of next Transformer block. In other words, when 𝑷ʹ is P(1), P is P(0), superscripts with parenthesis represent data at different block.
Classification module
In order to solve the classification tasks, we used the fully connected neural network as the classification module to transform pathway embedding encoded by the Transformer module into the probability for each label. Three fully connected neural networks each have 300, 200, and 100 neurons, with dropout probability dropoutc, which is hyperparameter. More details of the classification module are described in Supplementary Notes.
Model training and test
In this study, we implemented Pathformer’s network architecture using the “PyTorch” package in Python v3.6.9, and our codes can be found in the GitHub repository (https://github.com/lulab/Pathformer). For model training and test, we divided the labeled dataset into the discovery set (75%) and the validation set (25%) hierarchically. We implemented model training, hyperparameter optimization and model early stopping on the discovery set and tested on the validation set (Supplementary Fig. 1).
When training the model, we used a normal model learning strategy. We applied cross-entropy loss with class-imbalance weight as the label prediction loss, the ADAM optimizer to train Pathformer, and the cosine annealing learning rate method to optimized learning rate. For hyperparameter optimization, we used grid search with 5-fold cross-validation in the discovery set. We used the macro-averaged F1 score as the selection criterion to find the optimal combination of maximum of learning rate∈[1e-4, 1e-5], dropout probability of classification (c)∈[0.3, 0.5], and constant coefficient for row-attention (𝜷)∈[0.1,1]. For early stopping, we divided the discovery set into the training set (75%) and the test set (25%) hierarchically, and used the macro-averaged F1 score of the test set as the criterion for stopping training. When testing the model, we used the best model trained with optimal hyperparametric combination in the validation set. More details of model training and test are described in Supplementary Notes.
Model interpretability
To better understand Pathformer’s decisions, we increased the interpretability of Pathformer by calculating contributions of different modalities, important pathways and their key genes, and hub module of the updated pathway crosstalk network.
Contribution of each modality
In Pathformer, row-attention is used to facilitate information interaction between different modalities, that is, row-attention map can represent the importance of each modality. According to the trained model, we obtained row-attention maps of 8 heads in 3 blocks for each sample. For the contribution of each modality, we first integrated all matrices of row-attention maps into one matrix by element-wise average. Then, we averaged this average row-attention matrix along with columns as the attention weights of modalities, i.e., the contribution of modalities. The calculation is as follows: , where N is the number of samples, BL is the number of blocks, H is the number of heads, softmax is a normalized exponential function, and attention weighti is the attention weight of dimension i of pathway embedding.
Important pathways and their key genes
SHapley Additive exPlanations21 (SHAP) is an additive explanation model inspired by coalitional game theory, which regards all features as “contributors”. SHAP value is the value assigned to each feature, which explains the relationship between pathways, genes and classification, implemented by “SHAP” package of Python v3.6.9. Specifically, we calculated SHAP values of the gene embedding and the pathway embedding encoded by Transformer module corresponding to each sample and each category, denoted as respectively. The SHAP values of genes and pathways are calculated as follows: , where 𝑔 = 1,2, ⋯, 𝑁p is the gth gene, 𝑔 = 1,2, ⋯, 𝑁p is the pth pathway, 𝑛 = 1,2, ⋯, 𝑁 is the nth sample, 𝑒 = 1,2, ⋯, 𝐷p is dimension e of pathway embedding, and 𝑗 = 1,2, ⋯, 𝑑out is the jth category of sample.
In addition, we calculated SHAP values of pathways and genes in different modalities, described as follows: , where 𝑖 = 1,⋯,𝑚 is the ith modality, 𝑒, is the length of gene embedding and pathway embedding for modality i.
Finally, pathways with the top 15 SHAP values in the classification task are considered as important pathways. For each pathway, genes with top 5 SHAP values are considered as the key genes of the pathway. The core modality on which one gene depends indicates that the SHAP value of that gene ranks higher on this modality than on the others.
Hub module of the updated pathway crosstalk network
In Pathformer, pathway crosstalk network matrix is used to guide the direction of information flow, and updated according to encoded pathway embedding in each Transformer block. Therefore, the updated pathway crosstalk network contains not only prior information but also multi-modal data information, which represents the specific regulatory mechanism in each classification task. We defined the sub-network score through SHAP value of each pathway in sub-network, so as to find foremost sub-network for prediction, that is, hub module of the updated pathway crosstalk network. The calculation of the sub-network score can be divided into four steps: average pathway crosstalk network matrix calculation, network pruning, sub-network boundary determination, and score calculation. More details of sub-network score calculations are described in Supplementary Notes.
Declarations
Data availability
All datasets used in this study are publicly available for academic research usages. The details of usage are also fully illustrated in Methods and Supplementary Notes.
Code availability
Source code for data preprocessing and model training is freely available at Github (https://github.com/lulab/Pathformer) with detailed instructions. Source code for comparing the other methods is also included.
Consent for publication
All authors have approved the manuscript and agree with the publication.
Competing interests
The authors declare that they have no competing interests.
Funding and Acknowledgements
This work is supported by National Natural Science Foundation of China (81972798, 32170671), Tsinghua University Spring Breeze Fund (2021Z99CFY022), National Key Research Program of China (2021YFA1301603), Tsinghua University Guoqiang Institute Grant (2021GQG1020), Tsinghua University Initiative Scientific Research Program of Precision Medicine (2022ZLA003), Bioinformatics Platform of National Center for Protein Sciences (Beijing) (2021-NCPSB-005). This study was also supported by Bayer Micro-funding, Beijing Advanced Innovation Center for Structural Biology, Bio-Computing Platform of Tsinghua University Branch of China National Center for Protein Sciences. We also thank Hongli Ma and Kexing Li for helping us edit the text of manuscript.
Funding for open access charge: Tsinghua University Guoqiang Institute Grant (2021GQG1020).