Attention-based GCN Integrates Multi-omics Data for Breast Cancer Subtype Classification and Patient-specific Gene Marker Identification

Breast cancer is a heterogeneous disease and can be divided into several subtypes with unique prognostic and molecular characteristics. Classification of breast cancer subtypes plays an important role in the precision treatment and prognosis of breast cancer. Benefitting from the relation-aware ability of graph convolution network (GCN), we presented a multi-omics integrative method, attention-based GCN, for breast cancer molecular subtype classification using mRNA expression, copy number variation, and DNA methylation multi-omics data. Several attention mechanisms were performed and all exhibited effectiveness in integrating heterogeneous data. Column-wise attention-based GCN outperformed the other baseline methods, achieving AUC of 0.9816, ACC of 0.8743 and MCC of 0.8151 in 5-fold cross validation. Besides, Layer-wise Relevance Propagation (LRP) algorithm was used for interpretation of model decision and could identify patient-specific important biomarkers which were reported to be related to the occurrence and development of breast cancer. Our results highlighted the effectiveness of GCN and attention mechanisms in multi-omics integrative analysis and implement of LRP algorithm could provide biologically reasonable insights into model decision. Author Summary Identification of molecular subtype is essential to understanding the pathogenesis of cancer and advancing the development of cancer precision treatment. We have developed a graph convolution network architecture to predict the molecular subtype of breast cancer patients based on multi-omics data. The major difference between our work and other methods is that we have offered a new view for multi-omics data integration with the assistance of graph convolution network. Besides, we adopted an interpretability method to explain the model decision and prioritize the genes for biomarker discovery.


Introduction
normal-like(Normal) tumors [1]. At the same time, there is a large correlation between breast cancer intrinsic subtypes and immune phenotypes, and different cohort studies have reproduced similar intrinsic subtypes by using different numbers of gene signatures [2,3]. It also indicates that multiple genes act synergistically as functional modules leading to the inherent molecular heterogeneity of breast cancer. Multi-omics sequencing platform offers opportunities to study breast cancer at various molecular levels [4]. Each type of omics data exhibits specific disease associations and several studies have developed machine learning models for breast cancer subtype identification and also exploit subtype-specific risk predictors or biomarkers [5][6][7][8].
However, single omics data analysis is limited to correlations, mainly reflecting reactive processes rather causative ones. Integrating different omics data may provide more biological insights [9], for example, further refining the intrinsic classification of breast cancer(intCluster10 [10]) or improving the predictive power of diagnosis model [11].
Many multi-omics integrated analysis approaches are based on machine learning techniques, aiming at patient stratification [12][13][14], biomarker discovery [15,16], pathway analysis [17], and drug discovery [18]. One of the challenges of multi-omics analysis is the high dimensionality of data and many studies have proposed different deep learning integration strategies [19][20][21]. Autoencoders are designed to reconstruct the original high-dimensional input into latent low-dimensional features which can represent the raw data and have been successfully applied to analyze high-dimensional omics data and integrate heterogeneous data types [22]. Generative adversarial networks (GAN) can also be used to extract effective feature representations from multi-omics heterogeneous [23]. Network-based fusion strategies were also applied in multi-omics data integration [24]. The application of GCN in multi-omics integration analysis is seldomly reported. And most of the existing studies usually combine feature representation learning technology with GCN based on the patient similarity network for patient diagnosis or clustering [25,26], where the prior structural relationships between genes are not taken into account. Genes act together in signaling, regulatory pathway, and protein complexes. The information contained in the protein-protein interaction network (PPI) is highly important for gene function prediction [27], gene module detection [17], and novel cancer gene discovering [28,29]. Here, we developed a GCN based framework that could simultaneously incorporate multi-omics features of genes and their structural relationships with gene-gene interaction network as graph input.
Genomic changes have complicated associations with cancer phenotypes, and each type of molecular features is not independent. For example, CNV causing a gene expression change may be relevant to disease, and presence of DNA methylation near the transcription start site of a gene is associated with its expression change. Most of existing supervised multi-omics can be divided into two types, concatenation-based methods and ensemble-based methods. Concatenation-based methods integrate different types of data input by directly concatenating the data input features or the learned embedding features, and the latter is implemented by integrating predictions from different classifiers. Both of them fail to catch the cross-omics correlations and are likely to be biased towards certain omics data types considering the existence of systematic noises or scale differences [25,30]. Several supervised machine learning classification methods are developed for modeling the cross-omics correlations. For example, Data Integration Analysis for Biomarker discovery using Latent components (DIABLO) maximizes the common or correlated information between multiple omics datasets and discriminates phenotypic groups [31]. A sparse multiple discriminative canonical correlation analysis (SMDCCA) integrates differentially expressed genes (DEGs), differentially methylated CpG sites (DMCs), and differential metabolic products (DMPs) for osteoporosis risk and identifies biomarkers which are correlated spanning different biological layers [32]. Attention mechanism is one of the most important concepts in the deep learning field and can explain the incomprehensible network architecture behavior [33]. We introduced attention-based GCN to capture the complex correlations between different omics date types.
The lack of transparency is a significant disadvantage of black-box deep learning methods, limiting the interpretability and thus the scope of application in practice [34].
In cancer diagnostics, the demand for model interpretation is increasing since it's crucial to identify important biomarkers and related biological processes involved in decision making process and further convince physicians in model rationality. What's more, reliable interpretation strategies can help investigate biological meanings of the nonlinear deep models and explore biological patterns which may lead to new biological hypotheses. Layer-wise Relevance Propagation (LRP) algorithm may provide a better explanation of model decision than the sensitivity-based approach or the deconvolution method in image classification [35]. LRP has been applied in explaining the deep neural network for gene expression [36] and to visualize convolutional neural network decisions for Alzheimer's disease based om magnetic resonance imaging data.
In summary, to explicitly model the associations between different molecular-level features of genes and their associations with disease phenotypes, we mapped features at multiple levels (mRNA, DNA methylation and CNV) to the gene level and adopted GCN with prior structural information from protein-protein interaction network as graph input for feature fusion. Based on multi-omics data from TCGA BRCA, we proposed an attention-based GCN for classification of different breast cancer molecular subtypes and jointly learning associations across multiple molecular levels of genes. We compared the model performance of GCN with three types of attention mechanism approaches. The results show that GCN combined with omics-specific attention mechanism can significantly improve the prediction performance of neural network.
We also analyzed the prediction performance under different molecular-level data types and proved that attention-based GCN has certain advantages in integrating highdimensional features. Furthermore, we adopted LRP algorithm to explain model decisions and identify patient-specific gene markers and related functional modules.
Finally, we validated the generalization performance of the model on unlabeled data. (Fig 1)

Performance of three attention mechanisms for multi-omics feature fusion and interpretation for attention-based GCN model
Based on the constructed hvGs dataset (hvG-3167) and corresponding hvG-PPI subnetwork, the performances of the GCN model using three different attention mechanisms and the classical MLP model framework were tested respectively. The classification results are shown in Table 1. We observed that the simple GCN model  Table S1). It can be observed that the information of mRNA is retained to the maximum extent, and CNV and DNA methylation data also have a certain degree of contribution, but the SE score is close to 0 in some sample decisions. We further visualized the sample-averaged attention score obtained from cAGCN with a heatmap reflecting the correlation between omics ( Fig   2B). There are some differences in the distribution of attention scores when the data set varies. We supposed that cAGCN yielding a better prediction score is attributed to its advantage in learning the cross-omics complicated relationships from heterogeneous data. Overall, mRNA data plays an important role in the model decision. It is inferred that the molecular heterogeneity of breast cancer was largely captured at the level of gene transcription and protein function.

Performance of the model under different hyperparameters
We mainly investigated the influence of two hyperparameters on the cAGCN model.
Considering that network shrinkage would affect the quality of the input graph and then the performance of the GCN, we tested the graph input with different order neighborhoods retained (k-hop). As the retention order increases, the density of the input graph gets larger. And when it exceeds a certain order (k=3), the performance of the model decreases obviously ( Table 2). Keeping higher-order neighborhood relationships may produce more noise edges and graph Laplacian with sparse networks can play a certain role as regularization [37] and improve the generalization performance of the model. The multi-head attention mechanism is an essential component of cAGCN, and the hyperparameter optimization is carried out on the number of attention heads. In the initial stage, MCC and ACC tend to increase along with the increase of the number of attention heads, and AUC is not sensitive to hyperparameters ( Fig S1). Finally, the number of attention heads was set as 7 and the 2-hop subnetwork was considered as graph input.

Performance of cAGCN under different omics data types
When comparing the hvGs of three molecular level data, it can be found that there is relatively little overlap among the three gene sets (Fig 3A). We further explored the ability of different omics data types to predict the molecular subtype of breast cancer.
Seven data sets were constructed under different permutations and combinations ( Table   3). Classification performance of single-omics data set is significantly inferior to that of multi-omics data set. Although the number of multi-omics features has greatly increased, using proper feature fusion strategy can still achieve good classification performance.

Effects of adding reference genes and validity of cAGCN
By comparing with the reported breast cancer related oncogenes in DisGenet database, we found that the selected hvGs filtered out important breast cancer related oncogenes.
Most of the hvGs have not been reported to be directly associated with breast cancer (Fig 3A). However, the prediction results show that these genes are closely related to the molecular subtypes of breast cancer. Meanwhile, the addition of reference gene set has little effects on the model performance ( Table 4), indicating that these hvGs are not independent from reference genes and are probably directly or indirectly involved in the molecular processes related to the occurrence and proliferation of breast cancer.
To further verify the effectiveness of feature fusion strategy, univariable Cox PH regression was adopted to analyze the fusion feature extracted from cAGCN middle layer. After screening out genes significantly associated with the overall survival of breast cancer, KEGG functional enrichment analysis and DO enrichment analysis were performed. The same process was carried out based on the original gene expression data. More genes are discovered to be related to the prognosis of breast cancer (Fig 3A) and significantly cancer-enriched. The results of DO enrichment analysis show that these genes are significantly enriched in breast cancer-related genes (Fig 3B). The results of KEGG pathway enrichment analysis further demonstrate the correlation between these genes and the occurrence and proliferation of cancer (Fig 3C). For example, Dysregulation of Hippo Signaling Pathway is directly relevant to various aspects in cancer biology, which includes cell proliferation, EMT (epithelial-tomesenchymal transition) and metastasis [38]. TNF signaling pathway contains an important pro-inflammatory cytokine, tumor necrosis factor-α (TNF-α), which is found in the TME that is involved in all stages of breast cancer development, affecting tumor cell proliferation and survival, EMT, metastasis and recurrence [39]. Wnt signaling pathway regulates a variety of cellular processes, including cell fate, differentiation, proliferation and stem cell pluripotency. Abnormal Wnt signal is a marker of many cancers [40]. Transforming growth factor beta(TGF-β) in TGF-β signaling pathway is a multifunctional cytokine involved in regulation of cell proliferation, differentiation and survival/or apoptosis of many cells [41]. In later stage of breast cancer, TGF-β has direct pro-tumorigenic effects through the stimulation of invasion, the migration of tumor cells [42,43]. Cell Cycle disorder is a common feature of human cancer. Different breast cancer subtypes show different patterns on cell cycle and its checkpoints [4].

Identification of important genes by LRP
We used the LRP algorithm to identify important genes that affect model decisions (Table S2). We first performed the functional analysis based on the top100 genes related to each breast cancer subtype. Although there is little genetic overlap among the subtypes, the biological processes involved in them have a high degree of functional overlap (Fig 4A). The results of GO and KEGG functional enrichment analysis show that the most important genes are related to cell cycle (Fig 4B, C, Table S3). Based on the merged network, 12 MCODE components are identified, where MCODE 2 represents mitotic cell cycle and MCODE 11 is related to Wnt signaling pathway (Fig   4D).
Since the model has the highest prediction precision for basal-like samples, we further explored the patient-specific feature pattern of 23 basal-like samples of the validation data set (Table S4). The basal-like group composes up to 15% of breast cancer and is characterized by triple-negative profile and is associated with an aggressive phenotype, high histological grade, poor clinical behavior, and high rates of recurrences and/or metastasis [44]. Among all frequently selected genes, CYP4X1, PPP2R5A, and SPRR2G are ranked in the top200 genes of all 23 basal-like samples.
CYP4X1 encodes a member of the cytochrome P450 superfamily of enzymes.
Cytochrome P450s are key oxidative enzymes that metabolize many carcinogens and anticancer drugs. A tissue microarray study found that multiple cytochrome P450s can be used as prognostic markers of breast cancer [45]. The product of PPP2R5A belongs SPRR3 promotes breast cancer cell proliferation by enhancing p53 degradation via the AKT and MAPK pathways and its downregulation is associated with triple-negative breast cancer (TNBC), which is the main component of basal-like breast cancer [47].
SPRR family genes and FLNB are involved in the epidermal cell differentiation pathway, which is a specific enrichment pathway for basal-like samples. FLNB has the function of regulating EMT [48], and meanwhile, it's reported that SPRR can play a role as a ligand of SH3 protein domain and is related to EMT in cholangiocarcinoma [49].
Basal-like samples are highly invasive and show EMT characteristics, but the association between FLNB and basal-like breast cancer needs to be further verified. In the top100 gene list (Table S4)

Discussion
Benefitting from the rapid advancement of omics technologies and existing large labeled biomedical data, more supervised methods are developed for patient diagnosis and biological knowledge discovery. High-dimensionality and small sample size, as well as the different distributions of omics data have posed great challenges to multiomics integrative analysis. A well-designed architecture is needed which can jointly make accurate decisions and learn the interrelationships between various biological data types or entities. In this paper, we proposed a supervised deep learning architecture, AGCN, for breast cancer subtype identification. We utilized GCN for taking advantages of various data features and gene inherent structures and attention mechanism for flexibly learning the association patterns of different omics data types or biomolecules.
Through ablation studies, both GCN and attention mechanism were proved to be essential to multi-omics integrative classification. The distribution of attention scores extracted from the two omics-specific attention mechanism suggests that gene Since we only focused on the highly variable features of different omics data types, such feature selection strategy may introduce bias and ignore those slightly variable but functionally indispensable biomarkers. Although manual addition of reference biomarkers can help alleviate the problem, it is not applicable to analysis of unknown diseases. Development of an end-to-end approach with no need of complicated preprocessing procedures can be one of the solutions.

TCGA BRCA dataset
We downloaded the TCGA BRCA benchmark dataset from [54], which included 759 samples with omics data of CNV at the genome level, DNA methylation, and mRNA expression at the transcriptome level. The real 671 patient labels came from the results of a previous TCGA cohort study [55] and the remaining 88 unknown samples were used for external validation. For mRNA expression and DNA methylation data, median absolute deviation (MAD) was adopted to select the top 2,000 most-variable features.
For CNV data, genes in significantly gained or lost regions (q-value<0.05) reported in Broad Institute for breast cancer were selected. 3,167 high variable genes (hvGs) at different molecular levels were retained (Fig 1A) and the corresponding dataset was named as MDC. Besides, we selected genes from DisGenet [56] which were reported to be breast cancer related and filtered out genes with a GDA score < 0.2. Thus, a total of 540 genes were collected as reference gene sets and the corresponding dataset was named as MDC_addRef consisting of 3483 genes.

Protein-protein interaction network
We downloaded human PPI from StringDB (v11.5) as the graph structure input, which consisted of 19,385 vertices and 11,938,498 edges. After removing low confidence edges (confidence score < 850) and ID mapping, the final reference PPI contained 12,397 vertices and 287,084 edges. Different gene sequencing platforms are always hampered by missing data due to discrepant gene annotations and only a minority of genes are disease-related. To reduce irrelevant features and further improve the quality of the input biological network, we mainly focused on 3,167 hvGs at different molecular levels. Considering that direct edge reduction would affect the relationships between the original gene pairs and result in the loss of association information, we further explored the performance of GCN with network input preserving neighborhoods of different order.

Implement of GCN
Graph Convolutional Neural Networks define spectral convolution operations on graphs. Referring to the convolution theorem, the spectral space signal obtained by the Fourier transform is first linearly transformed and the inverse Fourier transform is then used to convert the spectral space signal to the original space to realize GCN. [57] simplify and normalize the convolution kernel of Chebyshev polynomial approximation to obtain a first-order GCN layer which is shown in Equation (1), where ̃= + , � = ∑̃, ( ) is the trainable parameters of the th layer, σ represents the activation function. Here, we used GCN to aggregate the information of gene neighboring nodes and then update the feature representation of genes.

Attention mechanism
Attention mechanism is a special module in machine learning model that can automatically learn and calculate the association between input data and output data.
Here, we explored the influences of three attention mechanism modules on the model performance.

 Squeeze and Excitation
We referred to the proposed architecture which was used to explicitly model the interdependencies between channels of convolutional neural networks [58]. Squeeze and Excitation (SE) block here is in use for importance assessment of different omicslevel features. Squeeze is the global information embedding module using global average pooling to get channel statistics. The purpose of Excitation is to capture the complex nonlinear interaction between channels, which is implemented through fully connected layers (FC) and gating mechanisms. Finally, residual connection is applied for output. The concrete structure is demonstrated in Fig 5A and the calculation steps are as follows: For sample , ∈ ℝ × , Squeeze: Excitation: Output: , = + � where δ represents the activation function ReLU and σ represents the activation function Sigmoid.
 column-wise self-attention and row-wise self-attention The self-attention mechanism is a module that learns the internal associations of sequences. Referring to Scaled Dot-Product Attention [59], we first calculate the query (Q), key (K), and value (Q) vector corresponding to the input sequence, and calculate the dot product between the Q and K as the association evaluation between the two vectors and normalization is performed to obtain the final attention score (Fig 5B, C).
Column-wise self-attention (cAttn) is used to capture the association between different omics and row-wise self-attention (rAttn) is used to capture association information between genes. The difference between the two is the implement of matrix transpose.

Layer-wise relevance propagation(LRP)[60]
The LRP algorithm utilizes the structure of the network by propagating the explanations from the output to the input layer based on the deep Taylor decomposition to explain the nonlinear decision. At some chosen root point �, so that ( �) = 0, the first-order Taylor approximation of the function ( ) can be expressed as: where can represent the relevance between the input element and the predicted output. For deep neural network layers with nonlinear activation function ReLU, of which the form is: where ( ) represents the output of neuron of the -th layer, ( +1) is the weight between the neuron of the -th layer and the neuron of the ( + 1)-th layer and ( +1) is bias. Since the layers of this type always have non-negative activations, the relevance propagation rule is made as following according to z + -rule:

Model parameters
In ablation study, network hyperparameters were determined by cross-validation.
Finally, we set the dimensions of feed forward network to -512-256-5 for cAMLP,

Univariable Cox PH analysis and enrichment analysis
Univariable Cox PH regression analysis was performed using R package Survival to screen genes associated with breast cancer survival (log-rank P value < 0.05 ).
Functional analysis and DO enrichment analysis of the screened genes were implemented in R package ClusterProfile [61] and ClueGO [62] plugin in Cytoscape was used to visualize KEGG network. GO and KEGG functional enrichment analysis for multiple gene lists was performed using MetaScape [63]. Minimal overlap of 3 genes, enrichment factor of 1.5, and p-value of 0.05 were used for filtering. The remaining significant terms were then hierarchically clustered into a tree based on Kappastatistical similarities among their gene memberships. Then 0.3 kappa score was applied as the threshold to cast the tree into term clusters. MCODE algorithm was further applied to the merged network to identify neighborhoods where proteins are densely connected. Pathway and process enrichment analysis has been applied to each MCODE component independently, and the three best-scoring terms by P value have been retained as the functional description of the corresponding components.

Data Availability
The TCGA BRCA multi-omics data used in this paper were downloaded from [54] and PPI data was downloaded from https://stringdb-static.org/. Figures   Fig 1. Workflow overview. A. Data collection and preprocessing of TCGA BRCA multi-omics data and PPI network. 67l samples with subtype information were used for model training. 3167 highly variable genes (hvGs) of each omics type were retained and 540 reference genes were downloaded from DisGenet. Raw PPI was collected from StringDB (v11.5) and edges with low confidence score (<850) were filtered out.

Code Availablility
Subnetwork consisting of hvGs was extracted from high confidence PPI with consideration of keeping different hop neighborhood relationships. B. Attention-based graph convolution network (AGCN) framework. AGCN takes multi-omics data and corresponding subnetwork as input. Performance of three attention mechanisms (SE, cAttn and rAttn) were discussed. C. Validity of AGCN. Univariable Cox PH was adopted to screen genes associated with patient overall survival based on the embedded feature extracted from global pooling layer of AGCN and raw expression data respectively. KEGG and DO enrichment analysis were performed on the significant genes. D. Interpretation of AGCN. Layer-wise relevance propagation algorithm was adopted on the AGCN model for feature ranking. KEGG and GO enrichment analysis were performed on top100 genes of each subtype. Specifically, patient-specific pattern of top200 genes in basal-like samples was further discussed. E. Validation on unknown data. The best model obtained from cross-validation was used to predict 88 unlabeled samples and further survival analysis was performed based on predicted labels.

Fig 2.
Performance of three attention mechanisms for multi-omics feature fusion. A.