Offline RL for generative design of protein binders

Offline Reinforcement Learning (RL) offers a compelling avenue for solving RL problems without the need for interactions with an environment, which may be expensive or unsafe. While online RL methods have found success in various domains, such as de novo Structure-Based Drug Discovery (SBDD), they struggle when it comes to optimizing essential properties derived from protein-ligand docking. The high computational cost associated with the docking process makes it impractical for online RL, which typically requires hundreds of thousands of interactions during learning. In this study, we propose the application of offline RL to address the bottleneck posed by the docking process, leveraging RL’s capability to optimize non-differentiable properties. Our preliminary investigation focuses on using offline RL to conditionally generate drugs with improved docking and chemical properties.


Introduction
The process of designing new drugs is extremely challenging, given the required investment of time and resources required for a drug to navigate through the various stages, spanning from its conception and development, pharmacological characterization and rigorous testing, and eventual approval for mass production and clinical use [15].Furthermore, the vast chemical space comprising all potential drug candidates and their intricate interactions with other molecules and proteins pose formidable obstacles [27].During the design phase in SBDD, our goal is to identify drug candidates with a higher likelihood of successfully traversing these stages -i.e.ensuring favorable chemical attributes such as synthetic accessibility or drug-likeness, while also targeting specific disease-related protein targets as selectively as possible.
Recent advancements in artificial intelligence (AI) have provided valuable tools to aid researchers in the quest for better drug candidates [1].Some of these tools are dedicated to assessing molecular properties [25,38], enabling the validation of molecules before embarking on costly real-world experiments.Another avenue of research focuses on generating molecules themselves [13,16,26].Many such approaches rely on deep learning techniques, necessitating differentiability for training.However, most molecular properties are rule-based or predicted using diverse models, rendering gradient-based learning infeasible for such signals.This challenge can be effectively addressed by incorporating techniques from Reinforcement Learning (RL).
RL has exhibited promising results across a wide range of domains, including robotics [43] and Natural Language Processing (NLP) [29].Online RL techniques have also found application in drug generation tasks [26].However, the necessity for direct interaction with the environment can be a substantial bottleneck for RL algorithms, given the associated costs and potential risks, particularly when algorithms require thousands or even millions of interactions.In SBDD, molecular docking is a critical process that assesses whether a molecule interacts with a target protein, and in this case serves as a prime example of such a costly interaction.Incorporating this property in the de novo molecule generation process is key to achieve successful drug candidates [8].
Offline RL methods [22] leverage the power of RL without the need for direct interaction with the environment, relying solely on pre-collected datasets of interactions.Numerous successful applications of offline RL exist in fields like robotics [31,20], autonomous driving [10], and recommendation systems [7].In this work, our primary objective is to apply offline RL techniques to the domain of de novo drug design, where drugs must satisfy a predefined set of chemical and docking properties.
Drawing inspiration from the successes of RL in the field of NLP, our proposed approach seeks to generate SMILES [40] strings for potential drugs, conditioned on the binding site of the target protein.
To achieve this, first we train the generative model on the supervised task of generating sequences similar to the dataset, after that we employ an offline RL methodology, aiming to enhance the quality of generated samples in terms of docking and chemical properties.This preliminary work serves as a proof of concept, demonstrating the potential of utilizing offline RL to address intricate optimization challenges in drug generation that are difficult to tackle using conventional approaches.Additionally, we propose a strategy for data augmentation, aiming to solve the limited availability of structural protein-ligand complex data.This approach paves the way for future research endeavours in this area.
Given the necessity to generate molecules with specific properties, the application of RL for drug design has gained traction in recent years [28,6,39,41,2,26].These RL-based approaches have demonstrated efficacy in optimizing one or multiple chemical properties that are computationally inexpensive to evaluate.However, the optimization of docking properties -which instead rely or expensive physics-based simulation -remains a formidable challenge [8].
It's noteworthy that some research endeavors have harnessed offline RL techniques in sensitive domains like personalized patient treatment [30,12].Intriguingly, as of our knowledge cutoff, there have been no prior attempts to apply offline RL to the conditional de novo drug design.

Docking Process in Drug Design
The docking process is a pivotal step in SBDD.It plays a crucial role in predicting whether a specific molecule, often referred to as a ligand, can effectively interact with a pharmacological target -in most instances, a protein.This interaction is fundamental to the identification of potential drug candidates and development of novel pharmaceutical compounds.Docking involves the precise prediction of how a ligand (drug) molecule fits into the binding site or active pocket of a target protein.The objective is to assess whether the ligand can form stable and biologically relevant interactions with the protein binding site, often mimicking the natural binding of endogenous ligands or substrates.To do so, it is required to place 3D representations of the target (obtained experimentally) and the ligand under a force field.The initial ligand pose in the binding site is then optimized under the force field so, an energy-minimized structure the bound complex is obtained.The structural properties -e.g.position of the ligand vs the pocket, number of interacting atoms -and the relative energy of the complex can be then used to score its stability.
The docking process is an indispensable tool in computational drug design, enabling researchers to screen vast libraries of molecules efficiently and identify potential drug candidates.However, it is computationally intensive and can be a time-consuming aspect of drug discovery, making it an important target for optimization through advanced techniques such as offline reinforcement learning, as explored in this study.

Language Modeling
We leverage the potent encoder-decoder Transformer architecture [37] as the foundation for our generative model.This architecture comprises stacked encoder and decoder blocks, with the decoder blocks structured similarly to the encoder but incorporating certain inputs from the encoder blocks.Each of these blocks relies on the attention mechanism, a pivotal component in the Transformer framework.The attention mechanism operates on keys (K), queries (Q), and values (V ), and its calculation is defined as follows: here, d represents the dimensionality of the representation.In the encoder, K and V are derived from the outputs of the preceding blocks, while in the decoder, they are obtained from the final encoder block.
The rationale behind selecting the Transformer architecture for our task is grounded in its remarkable track record of success across diverse domains.Transformers have achieved exceptional performance in various applications, and they have recently shown promise in the field of drug generation [3,26].Furthermore, we want to leverage the recently released ESM2 [23] transformer model to compute embeddings for the target proteins on which our generative model is conditioned.

RL Task Formulation
Drawing inspiration from the work of [32], we formalize our problem as a Partially Observable Markov Decision Process [33].At each timestep t, the agent (i.e. the generative model) receives an observation that consist of all previously generated tokens in our sequence i.e. o t = {o 1 , o 2 , ..., o t−1 }, and predicts the next token for which its receives a reward r t .In our specific context, tokens correspond to the set of all possible atoms and bonds present in our chosen SMILES's vocabulary.A special begining of string token is used at the start to indicate we are generating the first token and another special end of string token is used to indicate the end of the sequence.Given that we can only compute the properties of the molecule when the full sequence is transformed to the molecule, all timesteps have reward 0 except the final timestep.
Our formulation can be summarised as follows: 1. Generative model: given some target information T and a ligand with SMILES's representation {o 1 , o 2 , ..., o n }, find a model π θ such that 2. RL: finetune the generative model to maximise the expected return R where R = γ t r t and γ is the traditional discount factor in reinforcement learning.

Method
In this section, we outline our method, which comprises three fundamental steps: data preparation, supervised training of the generative model, and offline RL training.These steps collectively form the core of our approach and are pivotal to achieving our research objectives.

Supervised Data
In our research, we harnessed the PDBBind dataset [24], a resource that supplies triplets of 3D structures comprising a protein, a binding pocket, and a ligand.Notably, the inclusion of information about the pocket is a distinctive feature of the PDBBind dataset, rendering it the sole viable option for our SBDD task.
For our SBDD approach, we need string representations of both proteins and ligands.Biopython [9] was used to transform proteins strudtures into amino acids sequences and the binding pockets into binary masks to indicate the presence or not of each amino acids in the binding pocket.For the ligands, we used RDKit [21] to generate SMILES strings, representing their molecular structures.SMILES representation pose a challenge in molecular generation because of the difficultly to accurately capture branches and rings.Additionally, a single molecule can have multiple SMILES representations.
Hence, for training we use a recently developed molecular string representation called SELFIES [19] that is more robust than SMILES, in that any random sequence of SELFIES tokens can be synthesised into a valid molecule.We convert all the SMILES strings in our dataset to their SELFIES representations and after training the generated SELFIES are converted back to their equivalent SMILES sequences.

RL Data
Preparing the data for RL finetuning requires scoring the various ligands.For this purpose, we use the following metrics: 1. Quantitative Estimate of Drug-likeness (QED): The QED is a measure of the likelihood of a generated molecule to have molecular properties similar to known drugs [4].2. Synthetic Accessibility Score (SAS): The SAS is a metric that measures the level of difficulty to synthesize the generated molecule.3. Number of satisfied Lipinski's rules (Lipinski): Lipsinki's rules are a set of guidelines used to assess the likelihood of the generated molecule being suitable for use as a drug based on physichochemical parameters.4. Retrosynthetic Accessibility Score (RAS): This metric estimates whether it is possible to identify a retrosynthetic path for obtaining the compound [35].
Each score was subsequently mapped to the range [0, 1].In addition to these property scores, we utilized a GPU-accelerated version [11] of the Vina [36] algorithm to obtain docking scores for the ligands.For each ligand, we first converted it from SELFIES to a 3D molecule using RDKit 2 .Subsequently, we dock the ligand to the target protein using Vina, initializing the position at the center of masses of the corresponding pocket.We recorded the following metrics related to the docking process: 1. VinaE: Vina energy score of the best resulting configuration.2. Dist: Distance between the centers of mass of the ligand and the pocket.
3. Clash: Number of instances where the ligand atoms and proteins are too close such that a large repulsive force will be generated that will make the complex unstable.
It's important to note that solely relying on the VinaE score is insufficient to gauge the quality of docking for the target pocket.The distance to the pocket is also crucial in this regard, as it can report wether the docking algorithm could not fit the ligand inside the binding site.Similarly, the number of clashes can be potentially indicative of physically impossible docking configurations.
Following the removal of samples that RDKit or Vina failed to process, the dataset's final size was 17,133 samples, down from the original 19,447.Running the docking process for each dataset sample took approximately 100 hours in total, utilizing Quadro-RTX-4000 GPUs, averaging to 21 seconds per sample.This extensive computational effort underscores our motivation for optimizing the docking property through offline RL.

Data Augmentation
The dataset size of 17,133 samples is a relatively small for the complex task of drug generation.Unfortunately, the supervised portion of our dataset, which necessitates experimentally determined triplets of proteins, pockets, and ligands specifically binding to the target pockets of proteins, cannot be readily augmented.
However, we can augment the RL datasets by introducing additional docking examples.To achieve this, we randomly select pairs of proteins and pockets and endeavor to dock random ligands from the dataset, following the same process detailed in subsubsection 4.1.2.To prevent redundancy, we exclude triplets that already exist in the PDBBind dataset.This augmentation strategy exposes the model not only to positive examples of ligand binding but also to negative ones, as we assume that a random ligand has a low likelihood of successful docking to a randomly chosen protein.This assumption is supported by the distribution plots shown in Figure 1 and the statistics presented in Table 1.
The result of this data augmentation effort is a dataset comprising 77,344 samples.It's worth noting that this number can be further increased, albeit at the cost of significantly higher computational resources required for docking.

Generative model
The initial training stage of our method focuses on training a generative model using the supervised objective, leveraging the PDBBind dataset.In this stage, we provide the target protein sequence and the binding site mask as inputs to the encoder.The decoder's task is to predict the next token based on the encoder's output and all previous tokens in the sequence.We utilize cross-entropy loss as the training objective.
Notably, our approach introduces a novel concept: explicit conditioning on the protein and binding site, a factor not previously employed in attempts to address the binding problem.Our hypothesis is that this conditioning mechanism may enhance the generation of relevant drugs and facilitate better generalization.
To enhance the representation learning for proteins, given the limited dataset size, we opt to utilize the ESM2 protein encoder [23], a pre-trained encoder that has proven effective in capturing protein features.In this regard, we do not retrain this component of our encoder.However, to incorporate binding site information, we introduce additional trainable parameters into our encoder.The decoder is trained from scratch.The supervised training stage of our method is illustrated at the top of Figure 2.

Offline RL
For the offline RL component of our method, we draw inspiration from ILQL [32], initially proposed for addressing NLP problems.ILQL is built upon the foundation of IQL [18], an offline RL algorithm recognized as one of the most potent offline RL methodologies to date [34].Additionally the algorithm is very easy to implement.
The underlying concept of ILQL involves taking a pre-trained decoder and replacing the prediction head with a state value function V and an action state value function Q.In value based RL, the V and Q functions for a state s and an action a are defined as follows respectively: The V value provides an estimate of the expected discounted reward that a policy will accumulate when starting from a given state s.It offers insights into the overall value of being in a specific state.
On the other hand, the Q function estimates the expected discounted reward after taking a specific action a in a particular state s.During training we use the ILQL expectile loss as defined in [32] to learn the V and Q functions.
During model inference, we employ the outputs of the generative model and shift the predicted token distribution with an advantage obtained from the V and Q functions, utilizing a weighting parameter β.The formula for the resulting score of token a given the input h (comprising the protein, pocket, and previous tokens) is expressed as: where p gen (a|h) represents the probability of token a predicted by the generative model.The offline RL stage is visually represented at the bottom of Figure 2.
At present, our reward function is a weighted combination of the factors outlined in subsubsection 4.1.2.The weights can be found in subsection A.2.
We christen our approach as Ligend, short for Ligand Generation via Structural Coditioning, encapsulating our novel methodology.

Experimental Setup
Using our approach we train models with different configurations in order to see which components are essential and which are not.The following parts of the method were probed: using or not using data augmentation and using or not using docking properties.
We split the PDBBind dataset into training and validation sets.The validation dataset size is only 5% which results into 800 samples due to two factors.First, original dataset size is small, so removing a big part of it may result in severe model performance drop.Second, the docking process required for the evaluation takes a lot of time, so the increase of the validation set size increases the computational overhead.
First, we train the generative model with the training subset of PDBBind.After that, for each of the experiment we reuse the obtained model for further RL tuning and use the same split of the PDBBind for the training.
We evaluate each of the resulting models with the same set of β hyperparameters.We choose the best model based on the highest average reward it achieves on validation set.For the list of hyperparameters see subsection A.2.
After collecting predictions for all of the best models we use the intersection of predictions sets which were successfully processed with evaluation pipeline for fair comparison.This results into 554 samples.

Experimental Results
Table 1: Average results of scoring on the intersection of successfully processed validation sets.↓ and ↑ indicates that metric must be minimized or maximized respectively.In "w/o docking" all the docking metrics are removed from the reward function."+ augmentation" indicates the usage of the augmented data.Chemical properties are normalized into the range [0, 1].We also report the same metrics for the entire PDBBind dataset and it's augmentation.We highlight the best of scores for generated molecules with the bold text.The outcomes of our method under various configurations are presented in Table 1, showcasing a breakdown of each reward component and the cumulative rewards for chemical properties, docking properties, and the overall score.It is important to note that we did not conduct a hyperparameter search while RL performance may improve strongly with better parameters.

Chemical properties Docking properties
Several noteworthy insights can be gleaned from our results.Firstly, and somewhat surprisingly, the generative model produces predictions for molecules with relatively favorable chemical properties when compared to the dataset distribution.Additionally, the average VinaE score demonstrates notable performance.It must be remarked that the PDBBind full and validation sets are different from the augmented in a fundamental sense: the experimental X-ray structures represent rigid, energyminimized states not only for the ligand, but also the targets' binding pocket residues.This means that the structure of the target is optimized towards its bound ligand in the PDBBind dataset.Therefore, it is not surprising that the VinaE scores are significantly better-and that the newly generated molecules can not reach the same energy range -as well as the larger presence of clashes, an artifact from the crystallization.Interestingly, the average VinaE is similar between the Ligend models and the generative one.Optimizing solely for chemical properties yields marginal improvements in chemical scores but comes at the cost of a significant decline in docking results.
When we employ all metrics simultaneously for optimization, a consistent enhancement in chemical properties is observed, alongside a boost in docking properties.Interestingly, the physichochemicalbased properties QED and Lipinski rule of 5 show the smallest improvement over the generative model -suggesting that the learned distribution captures druglikeness -while the SAS and RA, related to the more abstract concept of synthetizability, show the larger changes.Regarding the docking parameters, what distinguishes the full model from the rest, is the reduced average drugpocket distance and number of clashes, which may compensate the slightly less favorable average dockeing energy.
It is remarkable that the improvements across all metrics were often smaller when the augmentation dataset is included.This behavior with the augmented data might be attributed to the influence of the CQL loss component within the ILQL loss, which forces to elevate the scores for samples from the dataset.In the augmented dataset, we encounter approximately four times as many negative examples of docking compared to positive instances, potentially contributing to this observed effect.Addressing this issue may be possible through more refined hyperparameter selection.
Optimizing for both chemical and docking properties simultaneously leads to improved chemical scores.This improvement reaffirms the idea that molecular docking involves an interplay of various chemical properties which need to be optimised during docking along with the traditional docking specific metrics such as the docking distance or the energy score.Nevertheless, the most significant conclusion remains that RL can effectively enhance predictions compared to those of the generative model improving it by approximately 50%.Despite this improvement, a considerable gap persists between the samples from PDBBind and the output generated by Ligend.
To ascertain the diversity and novelty of the molecules generated by our approach, we provide details in Table 2, encompassing the count of unique molecules, the number of novel molecules not present in the entirety of PDBBind and the Tanimoto similarity between the Morgan fingerprints of generated samples and the molecules in the PDBBind dataset.It is worth noting a slight decrease in the count of unique and novel molecules after RL finetuning compared to the Generative model, indicating a potential avenue for future exploration.Both the Generative and RL finetuned models have average Tanimoto similarities below 0.4, hence both models do not just generate molecules with minor differences compared to those in the datasets but attempt to generate molecules with completely different structures.

Conclusion and Future Work
In this study, we have demonstrated a proof of concept, showcasing the potential of offline reinforcement learning (RL) in the domain of de novo SBDD, with a focus on specific chemical and docking properties.
A significant limitation of our work is that RL, by nature, optimizes the reward specified, which may not fully capture the complexities of real-world applications.Even the optimal policy derived under a given reward function may produce solutions that are impractical or unsuitable for actual use.To address this limitation, additional expert scoring of the generated molecules is essential to ensure the practical viability of the solutions.
While we have achieved promising results, they remain several avenues for improvement that have the potential to significantly enhance the performance of our approach: 1. Hyperpararemters search.Our current experimentation did not involve hyperparameter tuning.For the supervised learning step, this may be less critical, but fine-tuning hyperparameters in the RL phase could substantially boost performance, even in the absence of other enhancements.
2. Better generative model pretraining.Given the small dataset size, achieving the challenging generative objective requires additional considerations.An additional pretraining stage, preceding the current supervised phase, could potentially lead to better performance.
This new stage could involve generating arbitrary molecules without specific conditioning, utilizing datasets significantly larger than PDBBind.Existing research in this area provides pre-trained models that can be leveraged without the need for extensive training.
3. Advanced decoding techniques.In this work, we utilized greedy decoding for generating predictions.However, more sophisticated techniques like beam search are known to offer improved performance.Implementing advanced decoding methods could yield further gains.
4. Enhanced reward design.The design of the reward function is a crucial aspect of RL applications.Our current reward function is a simple weighted sum of various metrics, with weights that have not been tuned (except for setting them to zero in ablations).Exploring nonlinear combinations of these metrics may also lead to substantial performance improvements.Also, adding additional scores such as toxicity might be advantageous.
We anticipate that our further work will refine and optimize our approach.We hope that our research will provide valuable insights and inspiration for fellow researchers in the field of drug generation with specific properties.

Figure 1 :Figure 2 :
Figure 1: Distribution densities of (a) Vina energy score and (b) distance to the target pocket for PDBBind dataset and its augmentation.The lower values of both metrics correspond to better docking results.

Table 2 :
Number of unique molecules, number of molecules which are not present in the PDBBind dataset, and mean Tanimoto similarity between Morgan fingerprints with molecules in the PDBBind dataset, computed using RDKit.