Can AI reproduce observed chemical diversity?

Generating diverse molecules with desired chemical properties is important for drug discovery. The use of generative neural networks is promising for this task. To facilitate evaluation of generative models, this paper introduces a metric of internal chemical diversity, and raises the following challenge: can a nontrivial AI model reproduce observed internal diversity for desired molecules? To illustrate this metric, a mini-benchmark is performed with two generative models: a Reinforcement Learning model and the recently introduced ORGAN. The aim of this paper is to encourage research about internal diversity metrics.


Introduction
Drug discovery is like finding a needle in a haysack. The chemical space of potential drugs contains more than 10 60 molecules. Moreover, testing a drug in a medical setting is timeconsuming and expensive. Getting a drug to market can take up to 10 years and cost $2.6 billion [1]. In this context, computer-based methods are increasingly employed to accelerate drug discovery and reduce development costs.
In particular, there is a growing interest in AI-based generative models. Their goal is to generate new lead compounds in silico, such that their medical and chemical properties are predicted in advance. Examples of this approach include Variational Auto-Encoders [2], Adversarial Auto-Encoders [3,4], Recurrent Neural Networks and Reinforcement Learning [5,6,7], eventually in combination with Sequential Generative Adversarial Networks [8,9].
However, research in this field often remains at the exploratory stage: generated samples are sometimes evaluated only visually, or with respect to metrics that are not the most relevant for the actual drug discovery process.
Rigorous evaluation would be particularly welcome regarding the internal chemical diversity of the generated samples. Generating a chemically diverse stream of molecules is important, because drug candidates can fail in many unexpected ways, later in the drug discovery pipeline.
Based on visual inspection, [5, p. 8] reports that their Reinforcement Learning (RL) generative model tends to produce simplistic molecules. On the other hand, [8, p.6, p.8] argues that their Objective-Reinforced Generative Adversarial Network (ORGAN) generates less repetitive and less simplistic samples than RL. However, their argument is also based on visual inspection and therefore, it remains subjective: our own visual inspection of the ORGAN-generated samples (available on the ORGAN Github: https://github.com/gablg1/ORGAN/tree/master/results/mol_results) rather suggests that ORGAN produces molecules as repetitive and as simplistic as RL.
In this paper, we introduce a metric that quantifies the internal chemical diversity of the model output. We also submit a challenge: Challenge: Is it possible to build a non-trivial generative model, with (part of) its output satisfying a non-trivial chemical property, such that the internal chemical diversity of this output is at least equal to the observed diversity naturally found for the same kind of molecules?
To illustrate this challenge, we compare RL and ORGAN generative models, with respect to the following chemical properties: 1 Being active against the dopamine receptor D2. The dopamine D2 receptor is the main receptor for all antipsychotic drugs (schizophrenia, bipolar disorder...).
2 Druglikeness as defined in [8]. We are interested in this property because we can use experimental results in [8] to facilitate discussion. However, the notion of druglikeness in [8] is different from the notion of Quantitative Estimation of Druglikeness (QED) [10], which is an index measuring different physico-chemical properties facilitating oral drug action.
Here, druglikeness is the arithmetic mean of the solubility (normalized logP), novelty (which equals 1 if the output is outside of the training set, 0.3 if the output is a valid SMILES in the training set, and 0 if the output is not a valid SMILES), synthesizability (normalized synthetic accessibility score [11]) and conciseness (a measure of the difference of the length between the generated SMILES and its canonical representation).
We mention that recently, [9] considers an ORGAN with the QED definition of druglikeness. However, we also performed our own experiments with the QED property, and they did not affect our conclusions.

The metric of internal chemical diversity
Let a and b be two molecules, and m a and m b be their Morgan fingerprints [12]. Their number of common fingerprints is |m a ∩ m b | and their total number of fingerprints is The Tanimoto-similarity T s between a and b is defined by: Their Tanimoto-distance is: We use rdkit implementation [13] of this distance, with fingerprint size 4.

Internal diversity
We define the internal diversity I of a set of molecules A of size |A| to be the average of the Tanimoto-distance T d of molecules of A with respect to each other. Formally, we have: Note that this sum includes self-distances, although their contributions are equal to zero.
For a sufficiently large set A, any sufficiently large subset A ⊂ A, sampled with uniform probability, has the same internal diversity as A. This property follows from the law of large numbers. We can thus define the internal diversity of a generative model, by computing the internal diversity of a sufficiently large generated sample. This allows to formalize our challenge: Challenge (restatement): Let N be the molecules observed in nature. Is there a nontrivial generative model G and a non-trivial chemical property P such that: Internal chemical diversity is always smaller than 1 (because the Tanimoto-distance is smaller than 1), and it is usually much smaller. That's why we prefer this definition to the Tanimoto-variance of a set of molecules A, which is:

Relation of the internal diversity metric with the previous literature
Internal diversity quantitatively captures the visually observed fact that generated molecules can be repetitive and simplistic [8,5]. Previous metrics did not allow to do that.

Internal vs. external diversity metric [8, p.5]
Let A 1 be the training set, and A 2 be the generated set. The external diversity (called 'diversity' in [8, p.5]) is defined by: External diversity and internal diversity metrics are different: in our definition we only have one set A, the generated set (see equation (1)).
External diversity fails to capture the visually observed fact that generated molecules can be repetitive and simplistic (as observed in [8,5] ): according to On the other hand, our metric will give better results, because it better matches human visual observation of samples: our diversity is slightly lower for RL than for ORGAN.
Why internal diversity metric works better than external diversity metric: Suppose the chemical space is R 2 with the Euclidean distance (in place of the Tanimotodistance). Suppose the training data is located on a circle of radius one centered in the origin. Suppose also a trivial generative model, with all generated samples located on this origin. See figure 1.
In this setting, the external diversity of the model is equal to 1, because the distance between a generated point and a training point is always equal to 1.
On the other hand, the internal diversity for this generative model is equal to zero, because the distance between two generated samples is always zero.
Contrary to the external diversity metric in [8], our metric can distinguish between this trivial case and a less trivial generative model, where generated points are spread around However, some sort of external diversity metrics is still important, in order to eliminate another kind of trivial generative model, which simply reproduces the training set. There is complementarity between suitable external and internal diversity metrics. Tanimoto-similarity between m and A 1 is given by: Segler et al. consider the distribution of the Tanimoto-similarities between generated molecules and their nearest neighbor in the training set. They qualitatively discuss the shape of this distribution for their own models, but they do not use this distribution to define a quantitative metric that allows ranking different models.
To extend the work done in [6], various metrics can be derived from this distribution (e.g. variance, Wasserstein distance to the uniform distribution...). This will be interesting for future work, but in any case, these metrics will be more complicated than our metric.

2.2.4
Internal diversity metric vs. NN-Levenshtein distance [6, p. 10 and figure 11] The same remarks as with the NN-Tanimoto-distance apply. For future work, it will be interesting to replace the Tanimoto distance in our work with the Levenshtein distance. Recently, [15, figure 2] used a visualization to claim that the molecules generated by their RL model (similar to [6]) "populate" the chemical space. It is analogous to visualizations in [14, figures 5, 6 and 9], and [6, figures 5 and 8] (except that in this latter case, it is a t-SNE visualization, which is useful, but can also be misleading, see [16]).
The internal diversity metric introduced here is a contribution to give a precise meaning to this expression "populate". This metric allows to compare different models, from the viewpoint of which one "populates" better.

Internal diversity metrics and computer vision
Generative models in computer vision are also considering internal diversity metrics. For example, [17, section 4] introduced the Inception score, to assess both the quality and internal diversity of generated images. Other metrics are being considered and evaluated in the literature [18]. For future work, it will be interesting to build analogous metrics for molecule generation.

Beyond fingerprints
Our definition of internal diversity depends on Morgan fingerprints, which are hand-crafted features that do not always capture the notion of chemical distance [19]. It would be better to use automatically learned features, molecule vector representations, analogous to word embeddings used in Natural Language Processing [20]. There is some work in this direction [21].

Reinforcement Learning
As in the case of RL considered in [8], and as in [22, p. 4], the generator G θ is a LSTM Recurrent Neural Network [23] parameterized by θ.
G θ maps the input embedding representations into a sequence of hidden states. Moreover, a softmax output layer maps the hidden states into the output token distribution. G θ generates SMILES (Simplified Molecular-Input Line-Entry System) sequences of length T (eventually padded with " " characters), denoted by: Let R(Y 1:T ) be the reward function.
• For the case of dopamine D2 activity, we take: where P active (Y 1:T ) is the probability for Y 1:T to be D2-active. This probability is given by the predictive model made in [7] [1] , and available online at https://github.com/MarcusOlivecrona/REINVENT/releases • For the case of druglikeness, we take: The generator G θ is viewed as a Reinforcement Learning agent: its state s t is the currently produced sequence of characters Y 1:t , and its action a is the next character y t+1 , which is selected in the alphabet Y. The agent policy is: G θ (y t+1 |Y 1:t ). It corresponds to the probability to choose y t+1 given previous characters Y 1:t .
Let Q(s, a) be the action-value function. It is the expected reward at state s for taking action a and for following the policy G θ , in order to complete the rest of the sequence. We maximize its expected long-term reward: For any full sequence Y 1:T , we have: For t < T , in order to calculate the expected reward Q for Y 1:t , we perform a N -time Monte Carlo search with the rollout policy G θ , represented as: where Y n 1:t = Y 1:t and Y n t+1:T is randomly sampled via the policy G θ .
For t < T , Q is given by:

Objective-Reinforced Generative Adversarial Network (ORGAN)
To obtain an ORGAN, [8] brings a Character-Aware Neural Language Model [24] D φ parameterized by φ. Basically, D φ is a Convolutional Neural Network (CNN) whose output is given to a LSTM. D φ is fed with both training data and data generated by G θ . It plays the role of a discriminator, to distinguish between the two: for a SMILES Y 1:T , the output D φ (Y 1:T ) is the probability that Y 1:T belongs to the training data.
For the case of dopamine D2-activity, the reward function becomes: and for the case of druglikeness: where λ ∈ [0, 1] is a hyper-parameter. For λ = 0, we get back the RL case, and for λ = 1, we obtain a Sequential Generative Adversarial Network (SeqGAN) [22].
The networks G θ and D φ are trained adversarially [25,26], such that the loss function for D φ to minimize is given by:

Dopamine D2 activity
As in [8], we pre-train the models 240 steps with Maximum Likelihood Estimation (MLE), on a random subset of 15k molecules from the ZINC database of 35 million commercially-available compounds for virtual screening, used in drug discovery [27]. Then we further train the models with RL and ORGAN respectively, for 30 and 60 steps more.  The most interesting case is RL after 30 steps. In this case, increasing the probability of D2 activity is contradictory with keeping diversity. After 30 steps, internal diversity is even higher than the DRD2 diversity baseline.
However, when we only keep the molecules of interest, with P a > 0.8, internal diversity dramatically drops to vanishingly small levels. Note that even in these cases, generated SMILES remain distinct from each other.
For ORGAN-0.04, results are mostly analogous to RL. Note that at 30 steps, diversity for P a > 0.8 is 2 orders of magnitude better than RL 30. However, it still remains one order of magnitude lower than the DRD2 baseline, and at 60 steps, diversity has dropped to levels similar with RL.
For ORGAN-0.5, learning the D2 property still did not start after 60 steps. The situation is analogous to the SeqGAN case (λ = 1) described in [8]: high diversity, but no learning of the objective. In particular, that's why the internal diversity for P a > 0.8 is indetectable: there are only 6 samples satisfying the desired property, among 32k.
The intermediate cases between λ = 0.04 and λ = 0.5 are analogous to either of them. It is hard to situate the tipping point, between the cases where training is just slow, and where training will never take off.
Note that external diversity can be high, despite that internal diversity can be vanishingly small.
Here are the SMILES (structures in figure 4)

Druglikeness
Here, we use the experimental data from [8], made available on their Github: [8] pre-trains the models 240 epochs with Maximum Likelihood Estimation (MLE), on a random subset of 15k molecules from the ZINC database of 35 million commerciallyavailable compounds for virtual screening, used in drug discovery [27]. Then [8] further trains the models with RL and ORGAN respectively, for 200 steps. Table 2   Results show that ORGAN indeed improves over RL, since it is able to raise internal diversity to detectable levels. However, ORGAN diversity still remains 2 orders of magnitudes lower than ZINC diversity when L > 0.8. ORGAN diversity also remains 3 orders of magnitude lower than the total diversity of ZINC, which corresponds to the level of internal diversity to which most eyes are used to. We conclude that for our limited setting (small datasets...), both RL and ORGAN for λ = 0.8 fail to generate internally diverse molecules for this property.
Note again that external diversity can be very high, despite that internal diversity can be vanishingly small.
Note also that for ZINC, which is the training set, external and internal diversities for L > 0.8 are still different, because external diversity is taken over all training molecules, whereas internal diversity is taken over the small fraction (2 %) of them that have L > 0.8.

Conclusion and additional future work
The conclusion of this mini-benchmark is that for small training datasets, small architectures, and the specific hyperparameters tested, both RL and ORGAN fail to match observed internal diversity for desired molecules, although ORGAN is slightly better than RL. Future work about the diversity metrics was already discussed in subsection 2.2. There is also future work for a more comprehensive benchmark, with larger datasets, with larger and more models (like the recent [29]), and with testing various hyperparameters. Finally, there is also work for better ORGAN training. For this point, two distinct problems can be considered: • The perfect discriminator problem in adversarial training • The imbalance between different objectives in Reinforcement Learning

The perfect discriminator problem
In ORGAN training, the discriminator D φ quickly becomes perfect: it perfectly distinguishes between training data and generated data. In general, this situation is not very good for adversarial learning [30]. Here, the discriminator still teaches something to the generator. On average, according to the discriminator, the probability for a generated sample to belong to the training set still remains far from 0, although always smaller than 0.5. This probability is transmitted to the generator through the reward function.
However, not being able to 'fool' the discriminator, even in the SeqGAN case of λ = 1 (without any other objective), shows generator weakness: it shows inability to reproduce a plain druglike dataset like ZINC. Training a SeqGAN properly should be a first step towards improving ORGAN.
To achieve this, it might be possible to take a larger generator, to replace the discriminator loss in equation (5) with another function (like CramerGAN [31]), and to use one-sided label smoothing [17, p.4].
The discriminator might also overfit training data. Taking a larger training set could help, we took 15k samples here (less than 1MB), and this is small compared with training sets in Natural Language Processing. On the other hand, datasets in drug discovery rarely exceed 10k molecules, and therefore, it could also be interesting to look in the direction of low-data predictive neural networks [32].
Once adversarial training is stabilized, it might be interesting to replace all classifiers in the reward function with discriminators adversarially trained on different datasets. Various desired properties might be instilled into generated molecules with multiple discriminators.
This might better transmit the chemical diversity present in the various training sets.

Imbalance in multi-objective RL
The main issue is the imbalance between the various objectives in the reward function, a problem occurring also in RL. Multi-objective reinforcement learning is a broad topic (for a survey, see [33]).
A problem here is that with a weighted sum, the agent always focuses on the easiest objective, and ignores harder ones. Moreover, the relative difficulty between objectives evolves over time. For example, the average probability of D2 activity initially grows exponentially, and so this growth is small when this probability is near 0.
Using time-varying adaptive weights might help. Moreover, those weights might not necessarily be linear: For example, the reward function can be of the form (x λ + y λ ) 1/λ , which converges towards min(x, y) as λ → −∞. Using an objective function of the form min(x, y) focuses the generator on the hard objective (but in our experiments, due to the perfect discriminator problem, it did not work).
Morever, in the reward function, a penalty can be introduced for newly generated molecules that are too similar with the generated molecules already having the desired properties.
In any case, the (varying) relative weights between different objectives must be determined automatically, and not through guesswork. In a drug discovery setting, a molecule must simultaneously satisfy a large number of objectives. For example, for an antipsychotic drug, it is not enough to be active against D2. The molecule must also pass toxicity and druglikeness tests. Moreover, to avoid side-effects, the molecule must not be active with D3, D4, serotonin, or histamine. That's a lot of objectives to include in the reward function.

Availability of data and material
All code and data are available at: https://github.com/mostafachatillon/ChemGAN-challenge

Competing interests
The author declares that he has no competing interests.

Funding
This study was self-funded by the author.