Abstract
Predicting synergistic drug combinations can help accelerate discovery of cancer treatments, particularly therapies personalized to a patient’s specific tumor via biopsied cells. In this paper, we propose a novel setting and models for in-context drug synergy learning. We are given a small “personalized dataset” of 10-20 drug synergy relationships in the context of specific cancer cell targets. Our goal is to predict additional drug synergy relationships in that context. Inspired by recent work that pre-trains a GPT language model (LM) to “in-context learn” common function classes, we devise novel pre-training schemes that enable a GPT model to in-context learn “drug synergy functions”. Our model—which does not use any textual corpora, molecular fingerprints, protein interaction or any other domain-specific knowledge— is able to achieve competitive results. We further integrate our in-context approach with a genetic algorithm to optimize model prompts and select synergy candidates to test after conducting a patient biopsy. Finally, we explore a novel task of inverse drug design which can potentially enable the design of drugs that synergize specifically to target a given patient’s “personalized dataset”. Our findings can potentially have an important impact on precision cancer medicine, and also raise intriguing questions on non-textual pre-training for LMs.
1 Introduction
Drug combination therapy is a standard practice for diseases including cancer [46] and HIV. It is based on identifying multiple single agent therapies that, when used together, lead to synergistic effects. Predicting such combinatorial synergies is challenging, especially given the wide range of multiple different mutations as well as different genetic backgrounds typically found in different patients’ cancer cells [48]. Many drug combinations can also cause increased toxicity [95, 29] in a manner that may depend on specific patient backgrounds [52], adding further complexity to the problem. To enable the safest and most effective implementation of combination therapy in cancer care, it is thus important to personalize the prediction of drug synergies.
Since the number of drug combinations scales exponentially, differentiating between synergistic and antagonistic pairings is very expensive to test in large quantities in laboratory conditions. Thus, considerable interest has recently grown in using machine learning for predicting synergistic and antagonistic effects between pairs of drugs in silico [41, 56, 61]. These approaches are typically not evaluated in the few-shot setting, where only a few training examples are given, which is particularly relevant in the personalized setting described above, and more generally for cancer tissue types for which there is limited training data for synergy learning models. Additionally, these efforts use a variety of features to categorize the drugs, from molecular fingerprints [56] to protein interactions [90]. Obtaining these features often requires integrating external knowledge sources (e.g., from drug databases), which often results in findings being restricted to the limited subsets of drugs for which this information is available and also requires specialized engineering in model design. Finally, it is unclear if these external sources are actually needed for current models.
In this work, we address these limitations by exploring the ability of transformer language models (LMs) to learn drug synergy relations. We devise approaches that leverage transformers (1) without any external knowledge required to be integrated into the model (i.e., no protein interaction networks or patient cell line features); (2) in the few-shot setting with an in-context learning approach that can generalize to novel unseen drugs and patient cell lines; and (3) for designing novel synergistic drug structures in the context of a specific patient’s data.
Transformer LMs are Strong Drug Synergy Learners—Even Without Textual Representations
First, we consider drug synergy prediction using transformer language models without enriching drugs/cells with information from external knowledge bases. We find these “feature-less” models are able to achieve results that are better or competitive in comparison to knowledge-enhanced state-of-art drug synergy models (e.g. BERT models achieve 84.1% ROC-AUC to GraphSynergy’s 83.4%) Furthermore, in contrast to recent work that uses textual representations pre-trained on scientific corpora [49], we discover an intriguing counter-intuitive finding: using randomized (i.e. uninformative) tokens instead of drug/cell names is able to rival models that use textual names of those entities, suggesting that external information coming from pre-training on scientific corpora has negligible impact on current models in this setting. These findings motivate us to explore the power of transformer models without external information, and to study generalization beyond memorization capacity by evaluating on drugs/cells unseen during training.
SynerGPT: A New In-Context Drug Synergy Setting & Model
We take inspiration from recent work [22] that showed how a GPT model architecture can be trained to “in-context learn” function classes such as linear functions (e.g., linear regression/classification) and neural networks. We pre-train a GPT model from scratch on known drug synergies—using no textual corpora—and explore its ability to generalize in the few-shot setting to drugs and patient cell lines unseen during training. We find that our model, dubbed SynerGPT, is able to achieve strong competitive results without any external knowledge sources. In particular, we introduce a new setting of In-Context Learning for Drug Synergy (ICL-DS). In-Context Learning (ICL) [14] has emerged as a powerful paradigm for few-shot learning [8]. In ICL, trained model parameters are never explicitly updated after pre-training, and adaptation to each task is done on the fly given contextual examples. This is particularly appealing in settings where it is prohibitively costly to perform parameter updates for each incoming new task and context (e.g., for each new patient in a hospital setting). We devise novel pre-training approaches for ICL-DS, including strategies for optimizing the language model prompt selection with a genetic algorithm. Prompts comprise specific combinations of drugs tested for synergy on specific patient cell lines; optimizing prompt selection in this setting has potential implications for the design of a standardized assay panel of drugs and cells to be tested for a patient’s particular tumor. While specific patient data at this level is not readily available, we re-purpose existing drug combination data to lay the foundations for formalizing and studying our approaches from a machine learning perspective.
Designing New Molecules to be Synergistic in the Context of a Specific Patient
Finally, in our third major contribution we propose an additional new task of Inverse Synergistic Drug Structure Design (ISDSD): using a GPT transformer model for generating or retrieving drug molecules that are synergistic in the context of a specific cancer patient’s information (i.e., molecules that are synergistic with other drugs administered to a patient with specific cancer cells). This approach may in the future provide a new methodology for personalized drug candidate discovery.
2 Background and Problem Setting
In the last few decades, combination therapy has emerged as an effective method to target genetically unstable diseases [47, 39, 46], with dramatic success in treating HIV [47] and more recently HCV[39]. Unlike HIV and HCV which encode only 10-15 proteins [20, 15], cancer is radically more complex. Since cancer has an unstable genome, combination therapy is often considered necessary [46] and is commonly used in practice, with varying degrees of success.
Generally, drugs work by affecting cellular pathways–chain interactions of molecules which lead to changes in a cell. In drug synergy prediction, our goal is to predict whether combining drugs will have positive or negative outcomes in the complex system of these interacting pathways. Generally, synergy lab experiments are conducted in cell lines, which are a population of cells from a multi-cellular organism (for example, human lung cancer cells).
In this work, we also investigate inverse design of drug molecules. Traditionally, the idea behind inverse design of molecules is to predict or retrieve a molecular structure which has some desired chemical property or protein target [64]. In our work, we seek to explore inverse design at a higher level– the “interactome” of drug interactions in complex cellular pathways.
General Problem Formulation
Given k input drugs d1, d2,…, dk∈ D along with a cell line c ∈ C, the goal of drug synergy prediction is to predict a synergy value y for the interactions between the drugs in the given cell line. In existing datasets, only the pairwise k = 2 setting is considered. Thus, we focus our experiments on pairwise drug synergy, the most commonly researched setting, but our methods can naturally be extended to n-ary synergies. This problem can be considered as either a regression (y ∈ R) or a binary classification problem (synergistic (True) or not (False); y ∈ [0, 1]). Synergy data comes from a dataset of tuples (d1, d2, c, y) ∈ D.
Few-Shot In-Context Setting
We also consider the few-shot setting in our formulation, which has applications for predicting synergies when there is scarce training data such as in tumor-specific synergy prediction, uncommon cancer tissues, or newly introduced single-agent therapies. In the few-shot setting, we assume there are n synergy tuples available which contain an unknown entity h (unknown cell line ch or unknown drug dh). Define these tuples as xi:= (d1, d2, c, y)i for i ∈ [1..n] where one of d1, d2, or c is the unknown h. Each xi can then be used for training in addition to the existing training data. In our proposed method SynerGPT, we don’t use these tuples xi in training– rather, we use them as the prompt for in-context learning. Here, we are particularly interested in synergy prediction based on extremely small datasets (e.g. tested synergies from a patient’s specific cancer cells), which makes traditional supervised approaches less effective. In section 3.2.3, we detail our training strategies for in-context learning with unknown h from limited examples.
Inverse drug design from Drug Synergy Context
We propose a new task where the goal is to predict the structure of a molecule given a context of drug synergy tuples (e.g., we might be given 20 synergy tuples). We train a model to predict the structure of some unknown drug dh from its synergy relations with other drugs. This has two important uses. First, this may enable scientists to predict new molecules which have desirable synergies or similar synergies to existing drugs, which is a novel way to consider drug discovery. This can potentially enable the design of drugs that synergize specifically to target a given patient’s unique cancer cells. Secondly, this can support explainability of the synergy prediction model as function of the context it is fed, by “visualizing” SynerGPT’s understanding of the unknown drug given the context. Figure 2 shows that we can observe the structure of the molecule evolving towards the ground truth as more context examples are given. As this is a novel difficult problem, we frame it as a retrieval task, though it is trivial from an implementation perspective to instead predict structures using a pretrained generative model for molecules [28].
3 Methodology
In this section, we will consider the four components of our paper. First, we detail how drug synergy tuples are input to encoder-only language models (§ 3.1). Next, we extend this idea to the few-shot ICL setting and propose training methodologies to do so (§ 3.2). We then discuss optimization of the “prompt” used for ICL (§ 3.2.3). Finally, we extend our methodology to inverse drug design (§ 3.3).
3.1 Input for encoder-only language models
Initially, we explore the efficacy of BERT-style language models [13, 2, 92] for drug synergy prediction. We modify the task input to be in natural language using a simple formulation: [CLS] d1 [SEP] d2 [SEP] c [SEP] where d1 and d2 are drug names (e.g., imatinib, 5-FU), and c is the name of a cell line (e.g., MCF2, Ishikawa). The model is then trained to predict the output value y from the [CLS] token representation.
We also investigate to what extent pretraining knowledge is responsible for the model’s performance. To do so, we evaluate the impact on performance when the drug and cell names are replaced with ‘random’ tokens. Given the ordered (by frequency) vocabulary V of the LM, we select the tokens {vi ∈ V | i ∈ [k..(k + |C| + |D|)]} to represent our drug and cell lines. Note we start at a threshold k to avoid the most common tokens which might have specialized representations in the language model’s latent space. We uniquely map each cell line and drug to a token in this set, which we use as input to the BERT LM. Essentially, this experiment is used to determine whether knowledge from pretraining or the transformer architecture itself is responsible for performance on the drug synergy task. An example input from this strategy is: [CLS] rabbit [SEP] fish [SEP] book [SEP].
3.2 SynerGPT: In-Context Learning for Few-Shot Synergy Prediction
3.2.1 In-Context Learning for Function Classes: Background
Recent work trained transformer models to “in-context learn” function classes [22]. A function class is a set of functions that satisfy specific properties, such as linear functions or neural net-works. In-context learning of a function class F is defined as being able to approximate f (xquery) for “most” functions f ∈ F given a new query xquery when conditioned on a prompt sequence (x1, f (x1),…, xn, f (xn), xquery). We define a prompt prefix Pn = (x1, f (x1),…, xn, f (xn), xn+1) as the first n in-context examples followed by the n + 1th input. A model Mθ parameterized by θ is trained to minimize the loss averaged over all prefixes given some appropriate loss function ℓ. Weights wn:= 1 unless otherwise noted.
3.2.2 Predicting Drug Synergy In-Context
For in-context prediction of drug synergy, we redefine as the prompt prefix (as discussed in Section 2, we refer to this as the “context” or “input context”). Here, y can be considered the output of a function measuring synergy on (d1, d2, c). As in [22], we consider a GPT-2 family [57] decoder-only language model, which we call SynerGPT. Here, the prediction of the synergy value y j is made using a linear transformation of the contextualized output representation of c j (note that this includes and due to self-attention). Model inputs–drugs d, cell lines c, and labels y–are initialized using a learnable embedding layer (i.e. no external features). To evaluate the model’s ability to predict synergies of unknown drugs or cells, we hold out either m drugs or m cells and remove their synergy relations from the training set (see Section 4). We use a subset of the held out tuples as a pool of context examples. We now turn to the question of how to select the context (prompt prefix) from this pool in a manner that increases predictive performance.
3.2.3 How to sample the context?
A central question about using language models without external features—including textual names— is how to teach the model to understand unknown drugs or cell lines. We propose using a masking strategy—every unknown drug dh or cell ch is represented by [UNKNOWN] and the model must use in-context learning to understand it based on contextually-related known drugs and cell lines. In this setting, we assume that we are given a set of synergy tuples to sample from to construct a prompt. During training, it’s simply the training set. During evaluation, we consider a special held-out “context” set Dc ⊂ D (thus named because we sample the context/prompt Pnfrom this set). To sample from this context set, we propose a context-selection strategy based on constructing a graph G on this Dc. Specifically, we construct G by creating a node for every synergy tuple x:= (d1, d2, c, y) ∈ Dc. We construct a drug edge edbetween two nodes x1 and x2 if they share drug d (i.e. d ∈ x1 ∧ d ∈ x2). Similarly, we construct a cell line edge ec if they share cell line c. See Figure 1 for an example and Appendix Figure 8 for more details. We employ the following context selection strategies to sample a context with n examples given some node x containing unknown h which is either drug dh or cell ch:
Random: Uniformly select n context examples from Dc.
Graph: Uniformly select examples from the nodes adjacent to x in G.
Unknown-First: Uniformly select nodes adjacent to x which share an edge of type eh, i.e. prioritizing selection of nodes that contain the masked unknown h.
Note that these strategies are hierarchical– Unknown-First falls back to Graph when there aren’t enough examples which falls back to Random. Examples from Random are put earlier in the context than Graph which is again put before Unknown-First. In order to train the model to correctly use the [UNKNOWN] token, we need to artificially create unknown drugs or cells during training. Given training example x, we uniformly select d1 ∈ x or d2 ∈ x to be the hidden drug dh. For the unknown cell line setting, c ∈ x is always set to ch because there is just one cell line per example. We replace all occurrences of h in the prompt with [UNKNOWN].
3.2.4 Optimizing the Context
We further study whether the context can be optimized to best enable predictions for some unknown drug or cell line h (see Figure 7 for an example). The purpose of these experiments is to enable the eventual development of a standardized assay for drug synergy prediction. Thus, as output, these optimization algorithms produce a set of context tuples for each h. To do this optimization, we assume that we have four splits of data, which are constructed as follows. Given a set of p “unknown” drugs/cells H, all synergy tuples not containing any h ∈ H are put into a training set DTr. The remaining tuples are randomly partitioned into three equal sized sets: a context bank Dc, a validation set Dv, and a test set DTe. We first train a model on DTr following the Unknown-First strategy (where contexts are sampled from DTr itself). Following this, for each unknown entity hi, we select n context examples from Dc which maximize the model’s score on the validation set Dv. This is a combinatorial optimization problem which can be considered related to the best subset selection problem [3, 44]. We consider a genetic algorithm [21]: a metaheuristic method which is useful for black box optimization of systems containing complex interacting parts [45], which is suitable for the complex interactions between cellular pathways required for drug synergy prediction. As output, we get a set of context tuples for each h. Optimization algorithm details are given in Appendix B.
3.3 In-Context Learning for Inverse Design
To train the model to retrieve relevant drug structures in-context, we use the same architecture as for synergy prediction (§ 3.2.2), so that we can use the same data split and optimized contexts from Section 3.2.4 to understand how the model interprets them. For effective retrieval, we need a strong base molecular representation that makes it possible to effectively distinguish molecules. So, we choose to use MegaMolBARTv2 [50] representations, which were trained on 1.45 billion molecular SMILES strings and thus have a relatively comprehensive (in terms of drug classes) latent space. We train a SynerGPT model from scratch to predict representations using a linear transformation on the output [UNKNOWN] representation. We use this final representation to retrieve the desired drug using cosine similarity with the MegaMolBARTv2 representations of the drugs in our synergy dataset. The training context is selected using the Unknown-First strategy. Finally, we train the model using a minibatch contrastive loss [58, 16] between the L2-normalized ground truth representations Dg (here MegaMolBartv2) and predicted representations Dp (output from our model’s prediction head): where CE is categorical cross-entropy loss, b is the mini-batch size, Ibis the identity matrix, and τ is a learnable temperature parameter. We use this loss for ℓ in equation 1.
4 Results
4.1 BERT can do Drug Synergy?
In this section, we experiment with finetuning BERT on drug synergy data where all drugs and cell lines are seen during training (data splits detailed in Appendix A.1). As discussed earlier, there has been recent work using external network datasets capturing interactions between drugs, proteins and cell lines [90] for synergy prediction. To evaluate the impact of these external datasets, we compare against a strong and recent model, Graphsynergy [90] that uses over a dozen different network datasets and achieves state-of-the-art on its subset of DrugCombDB.
We train four BERT-based [13] language models [2, 92] and find that they outperform GraphSyn-ergy in both name and random token settings. BioLinkBERT with ran-dom tokens, for example, achieves a ROC-AUC score of 84.1% compared to GraphSynergy’s 83.4% (p < 0.05 using paired t-test). In comparison, BioLinkBERT with drug names as input achieves 83.6%. We checked multiple BERT configurations, and details on other BERT models are shown in Appendix A.1 Table 4.
A natural question here is whether the model has learnt the required knowledge during pre-training. Surprisingly, replacing drug and cell names with random tokens (§ 3.1) resulted in no drop in performance. This suggests that the transformer architecture may be the dominant factor explaining BERT’s performance on the task. However, if we use a randomly-initialized BERT model without any pre-training, we find the performance is worse (by 3 ROC-AUC pts). We conjecture this may be related to the observation that pre-training on a nonsense corpus [35] can provide good weight initializiations for downstream tasks.
To verify our findings, we consider the ChemicalX framework [62], which implements several baselines and provides a standardized subset of DrugCombDB [41] with drug and cell line features. This standardization allows us to compare different baseline methodologies on the same dataset.1
The ChemicalX DrugCombDB dataset has 2,956 drugs, 112 cell lines, and 191,391 synergy tuples. We compare against baselines DeepSynergy [56], MR-GNN [88], SSI-DDI [51], and DeepDDS [80], which we train using default hyperparameters from the original papers for 50 epochs as in [62]. These baselines (details in Appendix A.2) represent the most popular approaches to drug synergy prediction and allow us to compare to the performance of transformer architectures. Remarkably, SciBERT with random tokens outperforms all baselines except DeepDDS in this setting as shown in Table 1.
4.2 In-Context Learning for Few-Shot Drug Synergy
We now evaluate models on the few-shot and zero-shot setting, i.e, when a new drug or cell line is introduced with limited or no interaction data. We use the same architecture used in Garg et al. [22]: a GPT-2 [57] model with 256-dimensional embeddings, 12 layers, 4 attention heads, and batch size of 64. We use a learning rate of 2e-5. Model weights are initialized from scratch. To enable efficient experimentation in the few-shot setting, we construct a dataset split which contains multiple unknowns (i.e. m held-out drugs or cells: H:= {hi | i ∈ [1..m]}). To construct our split, we remove all synergy tuples containing h ∈ H from the dataset D so that the remaining dataset only contains tuples with known drugs/cells (this is our training set DTr). Then, for each h, we select n synergy tuples randomly to form the “context” bank/split Dc. All other “unknown” synergy tuples are put into DTe.
For comparison, we use the same baselines trained in zero-shot and few-shot settings. We also test SetFit [74] (a few-shot LM approach), k-nearest neighbors, off-the-shelf pre-trained GPT-2 (using entity names as input, similar to CancerGPT [37]), and MAML with DeepDDS (details in Appendix A.2). In the few-shot setting, the context bank Dcis considered part of the training set, and in the zero-shot setting it is not used. Our model, SynerGPT, however, is not trained on the context bank but uses it as context (prompt) examples for evaluation. Examples are selected using the Random, Graph, or Unknown-First strategies. We separately investigate the setting where drugs are unknown and where cell lines are unknown.
Unknown Drugs
To construct the dataset split, we set m = 50 unknown, i.e.,“held-out” drugs and context n = 20 synergy tuples. Hence, our context bank contains 50 × 20 = 1, 000 tuples. Overall, we find that our SynerGPT can perform better in the few-shot setting than existing baselines on on this task, as shown in Table 2. Full results are in Appendix Table 6. SynerGPT is trained in the zero-shot setting, which means it can be evaluated both with context examples (few-shot) and without any examples (zero-shot). Each strategy performs roughly the same zero-shot (although since strategies are used in training there are small differences), but the performance with sampled context examples is much different. Without examples, SynerGPT performs worse than DeepDDS few-shot, but the same SynerGPT model outperforms DeepDDS when given the few-shot context. Overall, we outperform all prior models in the few-shot setting and zero-shot setting. In particular, Unknown-First is able to increase performance by 3.8% absolute ROC-AUC with context, whereas DeepDDS only increases 1.3% from zero-to few-shot. Our approach is able to leverage the few given examples more effectively as shown by this higher increase in ROC-AUC. It is also notable that Unknown-First outperforms Graph since the context contains more examples with the unknown drug which the model is able to utilize to produce better predictions.
For example, the tuple (Vismodegib, Mithramycin A, NCI-H226) with unknown Vismodegib is True. Without examples, this is predicted as 0.46. For Graph with examples, it is predicted as 0.65–closer to the ground truth. For Unknown-First, the prediction further increases to 0.79. In this example, Graph only sees 15 examples containing the unknown but Unknown-First sees a full 20. Few-shot DeepDDS predicts 0.47 for this example, which is quite similar to our method without examples. As another example, (Chlorambucil, Cylocide, SK-OV-3) consists of two unknown drugs and has label False. Without examples, it is predicted as 0.62. Graph improves this to 0.35 and Unknown-First improves to 0.23. Interestingly, few-shot DeepDDS exhibits high uncertainty and predicts 0.50.
Unknown Cell Lines
Since there are only 112 cell lines, we set m = 20 as unknown and use n = 10 context examples. Interestingly, we find that models perform worse with context examples. We believe this is caused by the relatively small number of patient cell lines in the data vs. 2,956 drugs, making it harder to learn higher-level types of drug-cell line interaction. In other words, we are trying to learn a complex function class (drug synergy in an unknown cell line) without a significant number of example functions f ∈ F . To alleviate this issue, we use 6 layers, batch size of 128, and only 30 epochs. Nonetheless, the issue still exists–performance decreases for baselines DeepDDS and MR-GNN and our strategies Unknown-First and Graph. We experiment with interpolating between training initially with Random to Unknown-First at the end (see Appendix A.2.2), which helps in the unknown cell line case. We believe this creates an exploration-exploitation effect.
4.3 Context Optimization
As we have shown in the previous section that the context selection strategy is very important for SynerGPT per-formance, the natural next question is to what extent the context can affect model performance. To test this, we conduct a different split. Like before, we select 50 unknown drugs and 20 cell lines; with their respective tuples, we create three uniform splits: context, validation, and test. We train a SynerGPT Unknown-First model using hyperparameters as in our above experiments.
In context optimization, our goal is to select examples from the context and train splits which maximize some metric on the validation split. For our experiments, we maximize ROC-AUC for our trained model using the validation set. Overall, we consider two strategies: Unknown-First, and a genetic algorithm.2 For the genetic algorithm, we use the implementation and hyperparameters from PyGAD [21] with a population of 8 for 50 epochs. Here, we consider each example in the context split to be a potential gene. For comparison, we also select the context at random according to the Unknown-First strategy. To ensure comparability, we evaluate Unknown-First the same number of times as the genetic algorithm and select the best context. Our results (Table 3) show that the genetic algorithm optimizes the context from a starting average AUC of 79.2% up to 81.5% for unknown drugs and from 85.2% to 86.1% for unknown cells. Appendix C visualizes this and shows error bars.
We further analyze the results by different tissue types (Appendix E). For example, we find that for unknown drugs, synergy prediction in ovarian cancer is effective, but for both unknown drugs and cell lines predictive performance on bone cell lines is low.
4.4 Inverse drug design from drug synergy examples
Next we evaluate SynerGPT’s ability to retrieve the structure of an unknown drug. We use the same splits as before but replace the classification head with a vector output head trained using the loss in Equation 2. Using the same splits allows us to visualize the genetic algorithm results. Experimentally, we find that we achieve the best performance with the weight value from equation 1 set to wi:= i/k. Two examples of the model retrieving drugs which match the context synergy pairs are shown in Figures 2 and 5. These examples show the retrieved drug after i context examples have been observed by the model. Additionally, we show overall retrieval performance as the number of context examples shown to the model increases in Appendix D Figure 4. For the weighted strategy, mean rank for seen drugs decreases from ∼1,500 to ∼400 as context increases. Qualitatively, we find that we are able to retrieve the relevant drug or one with similar structure from synergy relationships in multiple cases. This is considerably more effective for drugs observed during training, but the performance is also better than random for unknown drugs. This ability to visualize the model’s understanding is helpful for explaining what the model predicts from observing a given context. Second, it is useful for retrieving drugs which have a desired set of synergies, which can help inform drug candidate discovery. Future research can extend this task to generation rather than retrieval.
5 Related Work
5.1 Molecular Language Models
In recent years, advances in machine learning and NLP have been applied to molecule representations. Several efforts [18, 11, 77, 66, 50, 75] show excellent results training on string representations of molecules [81, 82, 34, 10]. Interest has also grown in multi-modal models [17, 96] and multi-encoder models [16, 76, 85, 71, 42, 67, 87, 97] with applications to chemistry and biology. Existing work [17, 71, 86, 12] also builds on this to “translate” between these modalities, such as MolT5 [17], which translates between molecules and language.
5.2 In-Context Learning
With the success of models such as GPT-3 [8] and GPT-4 [54], interest has grown in the theoretical properties of in-context learning. [22], which we follow in the work, investigates the ability of transformers to learn function classes. [53] investigates whether in-context learning is related to specific “induction heads”. [79] shows that transformers do in-context learning by gradient descent. [38] frames in-context learning as algorithm learning to investigate generalization on unseen tasks.
5.3 Language Models for Chemistry and Knowledge Graph Completion
Very recently, considerable interest has grown in using language models, particularly GPT-4 [54], for uncovering chemical knowledge and molecular discovery [24, 83, 7, 5, 84, 9], including work in the few-shot setting [59, 27]. CancerGPT [37], a related contemporaneous preprint, was recently released which explores a similar few-shot approach to drug-drug synergy prediction. It explores training literature-aware text-based GPT models on drug synergy data. The use of GPT models pretrained on massive textual corpora from the web also makes rigorous evaluation and comparison difficult. We believe our work is complementary, since we largely explore the transformer architecture without language and we consider in-context learning which they do not. We also consider extensions such as inverse design and context optimization. Due to the recency of [37], we leave additional comparisons beyond our real GPT2 baseline to future work. Applying language models to knowledge graphs has been investigated in the general [91, 30, 93] and scientific domains [49, 63]. They can be considered similar to our tests of BERT language models applied to a drug synergy hypergraph (§ 4.1).
5.4 Drug Synergy Prediction
As discussed above, there are several approaches [56, 88, 51, 80, 36, 72, 62] which can predict synergy scores given cell line and drug features. There has also been interest in learning representations for these settings [65]. Recently, work [90, 61, 40] has begun to incorporate additional data sources such as drug-protein interactions. This can help improve results, but it often requires creating a subset of the original synergy dataset which can bias results towards the proposed method. [89] extracts additional training data from the literature to improve synergy prediction results, which may relate to our results in Appendix F. Research also investigates the application of few-shot [43] and zero-shot [26] machine learning to drug response prediction–we extend this idea to drug synergy prediction.
6 Conclusions and Future Work
As demonstrated by HIV, HCV, and now cancer, combination therapy is a critical option for disease treatment. Yet, difficulties arise in regards to understanding drug-drug interactions and patient-specific genetic differences. To tackle this, we show that encoder-only language models are effective for drug synergy prediction. We then build on these results by proposing SynerGPT, a decoder model with a novel training strategy for in-context learning which can produce strong results for few-shot drug synergy prediction. We additionally show that the model context can be optimized using non-linear black-box approaches, which has exciting implications for the design of a standardized drug synergy testing panel for creating patient-specific synergy datasets. Finally, we explore a novel task of inverse design using desired drug synergy tuples. Performance on this challenging task is low for unknown drugs; nonetheless, it shows promise for future work that may enable personalized drug discovery.
Limitations
While we are able to achieve strong performance without additional cellular or drug data, our approach is very much a black box akin to most deep learning methods. Future work will still likely want to integrate external database features. However, they will likely need to be integrated in a more thoughtful manner in order to ensure an actual benefit. It would also likely be interesting for future work to investigate the internal connections language models are learning and what it might mean for understanding the fundamental biology of how cellular pathways interact. It is also worth noting that designing molecules using drug synergy tuples is a somewhat atypical task, so there may exist a wall in terms of the information content inherent in the context. While we do analysis by separating model performance into different tissue types in this work (as done in multiple prior studies), we note that for future research it is likely too limiting and simplistic to separate cell lines into tissues types.
A Full Results Tables
A.1 GraphSynergy Full Results
Full results for the BERT input method and GraphSynergy tests are in Table 4. We compare on the specific subset of DrugCombDB [41] which was selected to match Graphsynergy’s network data (i.e. selecting the subset of DrugCombDB with drugs/cells that can be matched with external protein-protein interaction, drug-protein association, and cell-protein association networks) and a 7:1:2 train:validation:test split. This data subset also contains useful surface names (the common natural language name of the drug; e.g. dasatinib), which allows us to compare the effect that drug names have on language model synergy prediction performance.
We consider three BERT training variations: the original BERT [13], SciBERT [2], and BioLinkBERT [92]. SciBERT was trained on a corpus of scientific documents which would be considerably more focused on drugs than a general corpus. BioLinkBERT is a biomedical BERT model additionally trained using document relation prediction (e.g. citation links).
ChemicalX Results
We report full results on the subset of DrugCombDB [41] used by ChemicalX [62] in Table 5.
A.2 Few-Shot Full Results
A.2.1 Baseline Descriptions
DeepSynergy is a popular feedforward model which uses cell line features and drug fingerprints. MR-GNN is a graph convolutional network (GCN) [32] fed into an LSTM [23] which takes the drug structure into account. SSI-DDI uses a graph attention network (GAT) [78] with a final co-attention layer. DeepDDS uses both a GAT and GCN, which are fed into a fully connected feed forward network.
Real GPT-2 We train a GPT-2 model3 in the few-shot setting (as opposed to SynerGPT’s zero-shot) using random context and the same hyperparameters to mimic SynerGPT’s training settings as much as possible. We use names of the drugs obtained from linking to PubChem [31] as input in the form “Are drugs [DRUG1] and [DRUG2] synergistic in cell line [CELL]?”.
SetFit Furthermore, we test finetuning a few-shot language-model baseline, SetFit [74], on our few-shot data. We follow the original paper in using batch size 16, R = 20 text pairs generated for contrastive learning, and 1 epoch. Inputs to the model follow the same format as BERT in Section 3.1. We test using four models.
SetFit-SBERT: paraphrase-multilingual-mpnet-base-v2 from [60] with names as input. This model was trained to create semantic embeddings via Siamese networks.
SetFit-C: recobo/chemical-bert-uncased-simcse from Recobo.ai 4 with names as input. This model was trained using SimCSE on chemistry text.
SetFit-S2: allenai/specter2 from [68] with names as input. This model was trained on multiple scientific classification and regression tasks, such as MeSH descriptors classification.
SetFit-SMILES: DeepChem/ChemBERTa-77M-MTR from [1] with SMILES strings as input. This model was pretrained by predicting 200 molecular properties for a molecule given its SMILES string.
Model-Agnostic Meta-Learning We also consider a meta-learning formulation of our problem setting. We use MAML [19] to train a DeepDDS model. Since MAML5 does few-shot classification using episodes sampled from different learning tasks, we reframe our problem to match this. We consider predicting synergy for each drug to be a task. Then, we sample an episode for training from a random task for each mini-batch. We aggregate rare drugs without enough samples to form an episode into the same task until there are enough samples for an episode. Additionally, since we are dealing with binary classification here, we use N = 2-way. We sample the “validation” portion of each episode from our training set like in SynerGPT. We use the same context bank (and context size) for “adaptation” during evaluation. The same learning rate (1e − 3), batch size (512), and number of steps/epochs as DeepDDS is used. We report few-shot (first-order) and zero-shot (no adaptation) versions. Overall, we find that the MAML training procedure produces poor results, and adaptation produces insignificant performance increases. We attribute this to the episode-based sampling strategy neglecting important information in training.
Protonets As another meta-learning baseline, we consider Protonets [69]. We use the same meta-learning framework as for MAML. Because we don’t have drug task meta-data, we only consider the few-shot setting. We find that the Protonet’s.
k-Nearest Neighbors We also consider a k-Nearest Neighbors baseline using scikit-learn [55] similar to [49]. We construct embeddings for each synergy pair by concatenating (Drug1, Drug2, Cell) embeddings. In the training set, we also include (Drug2, Drug1, Cell). We consider two embedding sources. For the first, kNN-Features, we consider the drug and cell fingerprint features from ChemicalX. For the second, kNN-S2, we use name embeddings from the Specter2 model. We report both zero-shot and few-shot versions. In the few-shot setting, the context bank is added to the training data. We set k equal to the context number (20 and 10 for drugs and cell lines, respectively). We find performance on cell lines to be surprisingly effective, although still less than SynerGPT.
A.2.2 Interpolate Details
In the Unknown cell line setting, we observe that Random has an interesting effect where it performs better after examples (although still worse than Unknown-First (no-ex)), so we consider a fourth strategy: interpolating between Random Unknown-First. Essentially, for each data mini-batch in epoch e of E total epochs, we select either the Random strategy with probability max(0.25, 1 − e) otherwise we use the Unknown-First Strategy. This is analogous to an exploration-exploitation approach where we are pretraining with Random and transitioning to Unknown-First. We use a threshold of 25% to ensure the benefits of Random are kept until the end of training. We find that this interpolation strategy is effective (with p < 0.05, see Table 6) in dealing with the unknown cell line case.
A.2.3 In-Context Implementation Details
In the unknown drug setting, to allow for tuples with multiple unknown drugs, we use both a [UNKNOWN] and [UNKNOWN2] token (e.g. a tuple containing two unknown drugs would be ([UNKNOWN], [UNKNOWN2], c)).
For the inverse design experiments, in some cases, context examples do not contain the unknown entity h and therefore no [UNKNOWN] tokens, so we use →−0 as a replacement for the ground truth representation when calculating our loss function. We use the same model, splits, and training hyperparameters as in the context optimization setting.
B Context Optimization
B.1 Genetic Algorithm
In the case of the genetic algorithm, each context bank synergy tuple x ∈ Dcis considered as a gene which can be selected by the algorithm. Given p “unknown” drugs or cell lines, each has n slots for context examples in its prompt, which makes for np total genes. We also enforce that each x contains the relevant unknown drug dh or cell line ch. We disallow each context example from being selected multiple times; the reasons for this is two-fold. First, in early experiments we found that if we use the same example for the entire context (e.g. 20 repeats of x), then the model performs poorly. This is likely because the model is not trained on duplicate input, so it is trying to make meaningless connections between the same x. Second, repeating x in the context provides no new information to the model. Although we enforce this constraint, in practice without it the model will likely do the same thing on its own.
For the genetic algorithm in context optimization, we use a population of 8 for 50 epochs. We use steady-state parent selection with 4 parents, single-point cross-over, 10% gene mutation, and elitism. Each example in the context bank is considered a gene and we disallow repeated genes. This results in 351 evaluations on the validation set.
B.2 Error Reduction for Context Optimization
Using the Unknown-First strategy, we sample a context for some heldout tuple in the validation set. We then calculate the absolute error ɛ for the heldout tuple. For each context example xc in the heldout tuple’s input context Pn, we store ɛ and the relevant heldout entity, h. After some number (for fairness we use the same number of times as the genetic algorithm evaluates ROC-AUC on the validation set–351) of epochs on the validation set, we calculate a mean error ɛh(xc) for that context example xc. Finally, for each heldout drug or cell h, we select the n context examples xc with the lowest .
As shown in Table 7, this strategy produces poor performance. This indicates that simply selecting all of the most individually informative context examples is not useful. Rather, there is a more complex, non-linear interaction between examples which is informative to the model. This is intuitive, because the interaction between cellular pathways in complex and still not well understood. The ability for the context to be optimized by a genetic algorithm but not error reduction indicates that data collection strategies which emphasize diversity may be important to consider for constructing new drug synergy datasets.
We further analyze the results of context optimization by separating the results for unknown drugs into their effects on different tissue types. To obtain tissue types, we use the COSMIC [73] cancer mutation database. Results (Table 8) show that performance varies between different tissue types, but that the context optimized by genetic algorithm outperforms default Unknown-First and the model with no context in all cases with exception of pleura. For example, the model excels predicting synergies in ovarian cancers, but results are lower than average in bone and lymphoid cancers. For example, the ROC-AUC of ovarian cancer increases from 77.6 to 81.6% with examples selected using the default Unknown-First strategy. Our context optimization strategy also shows to be important– the ROC-AUC further increases significantly to 87.5 using the examples selected by the genetic algorithm. In the unknown cell line case, we see improvements on all cell lines except skin and bone. Interestingly, performance on bone-derived cell lines is low in both settings.
While we do analysis by separating model performance into different tissue types in this work (as done in multiple prior studies), we note that for future research it is likely too limiting and simplistic to separate cell lines into tissues types. Future studies may look at better bucketing approaches, such as primary cancer-driving mutations. An excellent example are KRAS mutations, which occurr in up to 25% of human tumors and in many different tissue types (for KRAS: pancreatic, thyroid, colorectal, and lung carcinomas, among others) [33]. Further, we note that while our work focuses on few-shot applications to mono-clonal cell lines and tumor biopsies, there is growing evidence that intra-tumor heterogeneity is a driving factor in cancer growth and is also responsible for drug resistance [4]. Future work can investigate the effect that this heterogeneity may have on patient-specific drug synergy prediction.
C How does training data scale with performance?
Figure 6 shows how training data scale affects performance. We consider the version of the DrugComb [94] dataset for ZIP score regression recently released in the Therapeutic Data Commons software library [25]. It contains 129 drugs, 59 cell lines, and 297,098 synergy tuples. The figure shows BERT model validation performance trained using random token inputs.
D Computing and Implementation Details
All experiments were done on an internal cluster of GPUs. Each experiment was conducted on a single NVIDIA RTX A6000 with 48 GB VRAM. Notably, multiple experiments can fit on the GPU at one time. Our BERT model experiments done within the ChemicalX framework take roughly 2.5 hours each. For SynerGPT, the Unknown-First and Graph variants took roughly 3 hours to train. Random was compute-bound by sampling, which caused it to take 9 hours to train. The inverse design variants took roughly 6-7 hours to train. We estimate that 80 days of GPU time were used for all experiments.
BERT-base consists of 108,233,473 parameters and BERT-large is 333,476,865. Unknown drug SynerGPT contains 22,793,473 parameters. Unknown cell version is 18,044,673 parameters. BERT-large models (we experimented with BioLinkBERT-large) were unstable to train in many cases. BERT and SynerGPT training used a linear decay learning rate schedule. Unknown drug SynerGPT uses 10,000 steps of warm-up. BERT used 1000 steps of warm-up. Unknown cell SynerGPT used 5% of training steps as warm-up. On ChemicalX DrugCombDB, we use a batch size of 512 with random tokens and 256 for names due to VRAM limits. For all random token BERT experiments, we use a high threshold of k = 5, 000 to ensure no common tokens are used.
For the GraphSynergy experiments, BERT-base models use a learning rate of 2e-5. We use 5e-6 for large models, which we find improves training stability. We use a batch size of 32.
E Evaluation Metrics
In this section, we detail the binary classification metrics that are used in this paper. Assume we have values for true positives TP, true negatives TN, false positives FP, and false negatives FN where predictions are separated into positives and negatives based on some threshold t.
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
TPR: TP / (TP + FN)
FPR: FP / (TN + FP)
TNR: TN / (TN + FP)
ROC-AUC [6]: The area under the curve created by plotting TPR against FPR as t is varied.
PR-AUC: Similar to ROC-AUC but the curve is TPR against Precision.
F1: 2TP / (2TP + FP + FN)
Given a list of rankings R,
Footnotes
↵1 Previous work tested on different subsets of existing datasets (due to filtering for external features).
↵2 We also test a simple error reduction algorithm which produces poor performance. See Appendix B.2.
↵3 “gpt2” from HuggingFace.
↵5 We use the implementation from https://github.com/cnguyen10/few_shot_meta_learning
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵