Abstract
Drug combination therapy is a promising solution to many complicated diseases. Since experimental measurements cannot be scaled to millions of candidate combinations, many computational approaches have been developed to identify synergistic drug combinations. While most of the existing approaches either use SMILES-based features or molecular-graph-based features to represent drugs, we found that neither of these two feature modalities can comprehensively characterize a pair of drugs, necessitating the integration of these two types of features. Here, we propose Pisces, a cross-modal contrastive learning approach for synergistic drug combination prediction. The key idea of our approach is to model the combination of SMILES and molecular graphs as four views of a pair of drugs, and then apply contrastive learning to embed these four views closely to obtain high-quality drug pair embeddings. We evaluated Pisces on a recently released GDSC-Combo dataset, including 102,893 drug combinations and 125 cell lines. Pisces outperformed five existing drug combination prediction approaches under three settings, including vanilla cross validation, stratified cross validation for drug combinations, and stratified cross validation for cell lines. Our case study and ablation studies further confirmed the effectiveness of our novel contrastive learning framework and the importance of integrating the SMILES-based features and the molecular-graph-based features. Pisces has obtained the state-of-the-art results on drug synergy prediction and can be potentially used to model other pairs of drugs applications, such as drug-drug interaction.
Availability Implementation of Pisces and comparison approaches can be accessed at https://github.com/linjc16/Pisces.
1 Introduction
Drug combination therapy is a promising solution to many complex diseases, such as breast cancer [1–3], colorectal cancer [4, 5], Alzheimer’s disease [6], and diabetes [7, 8]. However, previous studies have also pointed out the rareness of synergistic drug combinations [9–11]. Since experimentally testing millions of candidate combinations is not scalable, there is a pressing need to develop computational approaches to identify synergistic drug combinations [12–16].
We focus on cancer drug synergistic prediction and follow existing approaches [17] to form the synergistic drug combination prediction problem as a triplet classification problem. Each triplet consists of two drugs and a cancer cell line and will be classified into synergistic or not. Each cell line is represented using its genomics features, such as gene expression and somatic mutation.
A key technical challenge is to derive an effective representation for a pair of drugs. Simplified molecular-input line-entry system (SMILES) sequences [18] and molecular graphs [19–21] are the two major modalities for representing a drug. Both of them have strengths and limitations: SMILES sequences are easier to embed by leveraging the recent progress in natural language processing, but are ambiguous on drugs with complex structures; molecular graphs precisely characterize the molecular information, but large graphs (i.e., large diameter) are often hard to embed [22]. This dilemma is even more severe when we want to embed a pair of drugs since one might be better represented using SMILES sequences and the other might be better represented using molecular graphs. As we showed in the experiments, a simple concatenation of these two kinds of features yields undesirable results.
To address this problem, we propose Pisces, a cross-modal contrastive learning approach for drug synergy prediction. Our intuition is that the SMILES sequence modality and the molecular graph modality complement each other, and thus should be integrated. To realize this intuition, we have developed a cross-modal contrastive learning framework. Contrastive learning has recently obtained great success in computer vision where an image is embedded closely to its augmentation (e.g., the rotated review) [23–30]. However, contrastive learning has never been applied to pairs of drugs since it is unclear what should be a proper augmentation. We propose to create four augmented views for each drug pair based on the combination of the SMILES sequence and the molecular graph modality. We hypothesize that these four combinations can offer a comprehensive view of a drug pair, thus enhancing the drug synergistic prediction.
We evaluated our method on a recently published large-scale cancer drug synergy dataset GDSC-Combo [9], which covers 102,893 drug combinations spanning over 63 drugs and 125 cell lines [2]. We first observed a substantial discrepancy between the prediction performance by using two different modalities. By contrasting these two modalities, Pisces substantially outperformed five existing drug combination prediction approaches under vanilla cross validation setting, stratified cross validation for drug combinations setting, and stratified cross validation for cell lines setting. Finally, we found that two drugs from the top performed drug pairs favored different modalities, again confirming the effectiveness of integrating SMILES and molecular graph modalities. Despite the large amount of triplet combinations that have been measured in GDSC-Combo [9], it still only covers 20.7% of all possible triplet combinations. Pisces offers an in silico solution to massively generalize these in vitro measurements. In addition to drug synergy prediction, Pisces can be broadly applied to other applications that require the modeling of drug pairs, such as drug-drug interaction prediction, as well as further integrating other drug modalities.
2 Methods
2.1 Problem setting
We model the synergistic drug combination prediction problem as a classification task. Given the drug set and the cell line set
, each drug combination in the dataset is denoted as a triplet (dA, dB, c)i, where
and
represent possible pairs of two different drugs, and
denotes the cell line. The prediction of drug synergy is modeled as:
where
is a learned mapping function with θ as the learnable parameters. The output ŷi ∈ [0, 1] is the probability of the synergistic drug combination prediction. For each drug combination, the binary synergy label yi with a value of 1 indicates synergy, otherwise no synergy.
We aim to find the best fθ to enable the predictions on test set {ŷi}test to approximate the labels {yi}test. Moreover, following the previous works [12, 14–16], we view the prediction probability greater than 0.5 as synergy, otherwise no synergy.
Furthermore, for drug representations, we define the SMILES features and molecular graph features of drugs as s and g. Each SMILES string can be represented as s = (s1,…, sl) where s denotes the tokens and l denotes the string length. Molecular graph features can be represented as , where
are the n atoms in the molecular graph and
are the bonds between atoms. The cell line features are provided as a fixed-size vector
for M genes, with each dimension representing transcripts per million (TPM) for one gene. Since Pisces uses both kinds of features, the input would be (sA, gA, sB, gB, c)i.
2.2 Overview of Pisces
We propose a cross-modal contrastive learning approach Pisces, as shown in (Figure 1). Pisces makes full use of the complementary information in the structural and SMILES-based inputs. A graph neural network [31] and Transformer [32] model are applied to encode each drug’s graph-based features and SMILES-based features respectively. Pisces also encodes cell line features by assembling over-expressed genes and their associated neighbors with a Multi-layer Perceptron (MLP) [33]. We then concatenate the drug feature vectors and cell feature vectors to produce the drug combination representations. We introduce a contrastive learning loss term to integrate the feature representations. Finally, Pisces uses a binary classifier to predict the drug synergy label.
Pisces considers both the SMILES sequence and the molecular graph of each drug. It first uses Transformer to embed SMILES sequences and graph neural networks to embed molecular graphs. Pisces embeds each cell line by aggregating neighbors of over-expressed genes in the protein-protein interaction network. It then concatenates the SMILES embedding and the graph embedding between two drugs, which creates four different views for each drug combination. These four different views are treated as augmentations in contrastive learning.
2.3 Embedding drugs using SMILES-based features
Transformer has obtained great success in processing string data, such as natural languages [34, 35], combinatorial optimization [36–38], protein sequences [39, 40] and molecular features [22, 41]. We first encode the SMILES string using a Transformer [32] model. Specifically, we used the encoder architecture with multi-head self-attention modules. Each encoding layer includes one self-attention module and a feed forward network (FFN). Skip-connections are added to each module to build up the residual blocks [42]. The special token [CLS] is added to the beginning of each SMILES string s. We use the contextualized embedding corresponding to [CLS] from the output as the SMILES representation and denote the SMILES embeddings of drug A and drug B as and
.
2.4 Embedding drugs using molecular graph-based features
The molecular graph-based feature is encoded with a DeeperGCN encoder [31], which is stacked by identical layers. It takes a set of node vectors and an adjacency matrix as input and propagates information between nodes based on the graph structure [43]. In each layer, the output from the previous layer is fed and processed sequentially by a layer normalization module [44], a nonlinear ReLU module [45], and a residual graph convolutional block where each node will aggregate both its neighbor edge and neighbor nodes with an aggregation module. Moreover, the aggregation module comprises the concatenation of maximize, minimize and average pooling. We denoted the output node vectors as . Then we calculated molecular graph-based embeddings using mean pooling and max pooling.
Finally, for drug A and drug B, we produce the SMILES-based representations and
and the molecular graph-based representations
and
.
2.5 Embedding cell line features using gene expression level and PPI topological relationships
Based on the TPM, we set a threshold T and determine an over-expressed gene set . Then, we find the k nearest neighbors for each gene
in the protein-protein interaction (PPI) [46] network, defined as
, and finally compute a candidate set
. We then produce the embeddings of cell line features as
where cg is a learnable embedding for each protein.
2.6 Improved synergistic prediction using contrastive learning
Contrastive learning has been broadly used in both supervised and unsupervised learning [23, 25, 47]. The key idea behind contrastive learning is to push positive sample pairs to be close within the embedding space while pushing negative pairs far away from each other. In this formulation, positive pairs are instances defined to be similar in some important way, while negative pairs keep the space from collapsing to a single point. By defining positive pairs as two triplets from the same drug combinations while negative pairs as triplets from different drug combinations, we can apply contrastive learning to integrate different features of the drug combination learning task. In particular, we randomly select one type of feature embedding (SMILES or molucule graphs) for each drug, i.e.,
where subscript s and g represent modalities of SMILES and molecule graphs. We use zi(u, v) and zi(u′, v′) to denote two different random choice of the i-th drug combination in one training batch. We choose infoNCE [48] as the contrastive learning loss in our study, defined as
where nbatch denotes the number of samples in one training batch. Finally, Pisces uses one linear layer and MLP layer Pψ to generate the output ŷ, trained with binary cross entropy loss for this classification task as
where σ represents the Sigmoid [49] activation function. To better enhance the consistency between different types of features, we introduce an additional auxiliary loss. First, we define the auxiliary score as the following:
where σ denotes the Sigmoid activation function. Therefore the auxiliary loss is defined to minimize between the auxiliary scores and classification outputs:
Finally, we combine infoNCE loss in (6), classification loss in (8) and auxiliary loss in (10) as our final cross-modal contrastive learning loss with λ1, λ2 as hyperparameters:
3 Experimental settings
Dataset
We obtained a recently released drug combination dataset from Genomics of Drug Sensitivity in Cancer (GDSC-combo) [9]. Since there are two replicates for each drug pair cell line triplet, We first obtained the samples formulated by (Drug A, Drug B, cell line, synergy or not) tuples by following the rule: for each triple (Drug A, Drug B, cell line) in the original dataset, we view it as synergy if there exists one, otherwise no synergy. We excluded the samples that have a combination of three or more drugs. Finally, we obtained 102,893 samples, including 63 drugs and 125 cell lines. We observed that this is a highly imbalanced dataset where only 5,362 samples are synergistic. We evaluated three settings, including the vanilla cross validation (CV) setting, the stratified cross validation for drug combinations setting and the stratified cross validation for cell lines setting. Specifically, in the stratified CV for drug combinations setting, drug combinations in the test set have never been seen in the training set. Likewise, in the stratified CV for cell lines setting, cell lines in the test set have never been seen in the training set. Moreover, PRODeepSyn, GraphSynergy and Pisces share the PPI network which is obtained from the STRING database [50] as in [14].
Comparison approaches
We compare Pisces to five existing drug synergistic prediction approaches. PRODeepSyn [14] takes molecular fingerprints and descriptors for drugs as the inputs. The molecular fingerprint is a 256-dimensional binary vector for each drug, representing the existence of a set of predefined substructures [51]. Then drug descriptor is a 200-dimensional real vector used in [52], representing molecules’ physical or chemical properties of interest, such as lipophilicity or molecular refractivity. Both the fingerprints and the descriptors are obtained from RDKit [53]. Cell lines are embedded using Graph Convolutional Networks [54] by integrating the protein-protein interaction (PPI) network with the gene expression vector. Finally, PRODeepSyn then used an MLP to predict drug synergy. AuDNNsynergy [12] utilizes the same features as those in PRODeepSyn. Different from PRODeepSyn, it trains three autoencoders to predict drug synergy. DeepSynergy [13] takes molecular fingerprints, descriptors and drug-target interactions for drugs and Transcripts Per Million (TPM) for cell lines as the inputs. All these features are then fed into an MLP to predict drug synergy. GraphSynergy [15] utilizes Graph Convolutional Networks to extract drug and cell line features from the PPI network. These features are then fed into an MLP to predict drug synergy. DeepDDS [16] takes molecule graph for each drug and TPM for cell lines as the inputs. It then uses an MLP to predict drug synergy. Notably, these comparison approaches often relies on very different features that might not be available in any dataset. The original implementations of these methods often use hard-coded or pre-processed features that cannot be generalized to GDSC-combo. The details of pro-processing are also not comprehensively revealed and make it hard to reproduce their results. Therefore, we have re-implemented many of them and use the same feature pre-processing to increase the usability for fair comparison. We have made our implementations of all five comparison approaches available for future studies.
Model architecture
We utilized DeeperGCN and Transformer to embed drugs. Specifically, DeeperGCN is stacked by 6 layers of the graph convolutional blocks, where the dimension of the hidden size was set to 384. The hidden size, the FFN size, and the number of Transformer encoder layers was set to 512, 1024 and 6. The number of the attention heads in Transformer was set to 4. The dropout rates of DeeperGCN and Transformer were both 0.1. When determining over-expressed genes, the threshold T was set to 400. For the comparison approaches, we followed the original papers for the model architectures and the training details.
Training details
We trained our models using the Adam [55] optimizer with β1 = 0.9, β2 = 0.98, ε = 10−6, a weight decay of 0.01 and a batch of 128 for 100,000 training steps. We used a linear decay scheduler with 4,000 warm-up steps. We ran a grid search within [1e − 5, 5e − 5, 8e − 5,1e − 4, 5e − 4] for the learning rate. λ1, λ2 were both set as 0.01.
Metrics
Since our dataset is highly imbalanced, we consider the following four metrics for evaluation: balanced accuracy (BACC) which is the average of sensitivity (true positive rate) and specificity (true negative rate), area under the precision-recall curve (AUPRC), F1 score and Cohen’s Kappa statistic. All metrics are higher the better.
4 Results
4.1 SMILES-based and molecular-graph-based features are complementary for drug synergy prediction
Pisces is developed based on the hypothesis that SMILES-based features and molecular-graph-based features complement each other, thus an integration of them might enhance the drug synergy prediction. Therefore, we first sought to validate this hypothesis by examining the performance of using either SMILES-based or molecular-graph-based features. Despite the consistent performance between these two types of features in (Figure 2), we also observed that they showed substantial performance discrepancy on many drug combinations. For example, drug Sapitinib and drug Entinostat obtained a KAPPA score of 0.7897 when using SMILES-based features, while the KAPPA score drops to 0.3691 when using molecular-graph-based features. The number of drug combinations that can be better predicted by the SMILES-based features and the molecular-based features is comparable, necessitating the importance of considering both features.
a-d, Scatter plots comparing the performance of using SMILES-based feature and molecular graph-based feature in terms of BACC (a), AUPRC (b), F1 (c), KAPPA (d). Each point is a drug. Shaded areas represent 95 % confidence intervals of a linear regression line. FVE stands for fraction of inter-species variance explained. p-values are obtained using F-test.
4.2 Pisces achieves substantial improvement on drug synergy prediction
After validating our hypothesis that molecular graphs and SMILES provide complementary information, we next sought to compare Pisces to other drug synergy prediction approaches under three cross validation settings. On vanilla cross validation, we found that Pisces substantially outperformed all comparison approaches on all four metrics (Figure 3). For example, Pisces obtained a 0.4474 KAPPA score, which is 21.05% higher than the best comparison approach. Since Pisces is the only approach considering both types of drug features, the prominent performance of Pisces confirms the effectiveness of contrasting molecular-graph-based features and SMILES-based features.
a-d, Bar plots comparing Pisces to five existing approaches under vanilla cross validation, stratified cross validation for drug combinations and stratified cross validation for cell lines, in terms of BACC (a), AUPRC (b), F1 (c), KAPPA (d). GraphSynergy cannot be applied to the stratified cross validation for cell lines setting. Since DeepSynergy predicts all drug combinations as no synergy in stratified cross validation for drug combinations and cell lines setting, the F1 and KAPPA values are zero there.
The promising performance of Pisces on vanilla CV motivates us to further evaluate it in two more challenging settings. We first examined the stratified CV for drug combinations setting where all test drug pairs have never been seen in the training set (Figure 3). We found that the performances of all approaches dropped, confirming that this is a more challenging setting compared to vanilla CV. Nevertheless, our method still outperformed all comparison approaches and the improvement is even larger on this challenging setting than that on the vanilla CV setting. We attribute this improvement to Pisces’ consideration of both types of features, which offers us a more robust drug combination representation that can be generalized to unseen drug combinations.
Finally, we evaluated the second challenging setting of stratified CV for cell lines, where all test cell lines have never been seen in the training set (Figure 3). This setting is much closer to real-world clinical applications since it can perform predictions for a new patient who has not been treated by any drug combinations. We again found that the performance of all methods dropped substantially. Nevertheless, Pisces still achieved the overall best performance, suggesting its applicability in real-world applications. Collectively, the prominent performance of Pisces on three different settings demonstrates the importance of integrating molecular-graph-based and SMILES-based features and the effectiveness of our cross-modal contrastive learning framework.
4.3 Pisces obtains larger improvement on drug combinations that favor different modalities
After observing the promising performance of Pisces, we then sought to understand what kind of drug combinations can obtain larger improvement using Pisces. We first visualized the embedding space of all test triplets by Pisces and two best-performed comparison approaches DeepDDS and GraphSynergy (Figure 4). We found that Pisces obtained a more visible pattern contrasting synergistic and non-synergistic triplets, further validating the promising performance of our method. We also observed a clear pattern on cell line OCUB-M, where our method achieved a large improvement compared to other approaches, suggesting the effectiveness of contrasting two types of modalities.
t-SNE visualization of the embedding space of Pisces on all cell lines (a), DeepDDS on all cell lines (b), GraphSynergy on all cell lines (c), and Pisces on cell line OCUB-M (d). Each point is a triplet of drug combination and cell line.
Next, we studied whether Pisces can obtain large improvement on a drug pair where two drugs favored different modalities. For each single drug, we determine the modality it favors using the same analysis as in (Figure 2). If both drugs in a combination favor the same modality (i.e., SMILES or molecular graph), we denote it as same in (Figure 5a). We found that drug pairs that favor different modalities have substantially larger improvement by Pisces than those favor the same modality, supporting our hypothesis that these drug pairs are relatively more challenging to model and our cross-modal contrastive learning approach can effectively embed them by integrating two modalities.
a, Box plot comparing the improvement of AUPRC by Pisces on drug pairs that favor the same modality to drug pairs that favor different modalities. b,c, Box plots comparing the graph diameter (b) and number of rings (c) between 50 most-improved drug combinations and other drug combinations. The improvement is calculated using the relative improvement between Pisces and the best comparison approach in terms of AUPRC. d, Molecular graphs of Vinorelbine and Linsitinib, on which Pisces obtained large improvements.
Finally, we investigated if molecular-graph properties are also related to the improvement of Pisces. We found that the 50 most-improved combinations by Pisces have significantly larger graph diameter (p-value < 0.0023) and number of rings in the graph (p-value < 0.0091) (Figure 5b,c). For instance, the drug pair Linsitinib and Vinorelbine, which obtained a 211.87% AUPRC improvements against the best comparison approach, have 31 graph diameter and 15 rings in total(Figure 5d). Large and complicated graphs are often difficult to embed using graph-based approaches. Pisces mitigated this issue by additionally considering SMILES-based features, thus resulting in a better performance on these combinations.
4.4 Ablation studies
Finally, we performed ablation studies to examine the importance of each component in Pisces (Figure 6). We first examined the importance of cross-modal contrastive learning by comparing Pisces to a baseline approach that used a simple concatenation to integrate molecular-graph-based and SMILES-based features (Molecular graph and SMILES (w/o cl)). We found that our method outperformed this approach on 11 out of 12 comparisons, indicating the effectiveness of contrastive learning. Next, we investigated the importance of using both types of features by comparing Pisces to a model that only uses molecular-graph-based features and a model that only uses SMILES-based features. Cross-modal contrastive learning cannot be applied to these two models. We found that Pisces substantially outperformed both methods, again confirming the importance of using both types of features. Interestingly, we found that the simple concatenation-based approach is only slightly better than these two models that only use one type of the features, further confirming the effectiveness of our cross-modal contrastive learning framework.
a-d, Bar plots examining the importance of different types of features and contrastive learning in terms of BACC (a), AUPRC (b), F1 (c), and KAPPA (d). w/cl denotes to using the proposed cross-modal contrastive learning. w/o cl denotes to not using the proposed cross-modal contrastive learning.
5 Conclusion and future work
We have developed a synergistic drug combination prediction approach Pisces. Based on the intuition that different drug combinations might favor different types of features, Pisces exploits cross-modal contrastive learning to integrate SMILES-based features and molecular-graph-based features. We have evaluated our method on a recently-published large-scale dataset GDSC-Combo [9] and observed that Pisces substantially outperformed five existing approaches. In addition to drug synergy prediction, our framework of contrasting different modalities can also be applied to other drug combination tasks, such as drug-drug side effect prediction. In the future, we would like to further improve Pisces by incorporating more drug molecule modalities, such as drug three-dimensional structure, drug textual descriptions, and pharmacodynamics features.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].
- [21].↵
- [22].↵
- [23].↵
- [24].
- [25].↵
- [26].
- [27].
- [28].
- [29].
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵