ABSTRACT
Designing molecules with desirable physiochemical properties and functionalities is a long-standing challenge in chemistry, material science, and drug discovery. Recently, machine learning-based generative models have emerged as promising approaches for de novo molecule design. However, further refinement of methodology is highly desired as most existing methods lack unified modeling of 2D topology and 3D geometry information and fail to effectively learn the structure-property relationship for molecule design. Here we present MolCode, a roto-translation equivariant generative framework for Molecular graph-structure Co-design. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure. Extensive experimental results show that MolCode outperforms previous methods on a series of challenging tasks including de novo molecule design, targeted molecule discovery, and structure-based drug design. Particularly, MolCode not only consistently generates valid (99.95% Validity) and diverse (98.75% Uniqueness) molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins (61.8% high affinity ratio), which demonstrates MolCode’s potential applications in material design and drug discovery. Our extensive investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design, and provides new insights into machine learning-based molecule representation and generation.
Introduction
Designing molecules with desirable characteristics is of fundamental importance in many applications, ranging from drug discovery1–3, catalysis4 to semiconductors5,6. However, the size of the chemical space is estimated to be in the order of 10607, which precludes an exhaustive computational or experimental search of possible molecular candidates. In recent years, advances in machine learning (ML) methods have greatly accelerated the exploration of chemical compound space8–19. Many studies propose to generate 2D/3D molecules and optimize molecular properties with deep generative models20–25.
Molecules can be naturally represented as 2D graphs where nodes denote atoms, and edges represent covalent bonds. Such concise representation has motivated a series of studies in the tasks of molecule design and optimization. These works either predict the atom type and adjacency matrix of the graph26–29, or employ autoregressive models to sequentially add nodes and edges21,30. Furthermore, some methods leverage the chemical priors of molecular fragments/motifs and propose to generate molecular graphs fragment-by-fragment31,32. However, complete information about a molecule cannot be obtained from these methods since the 3D structures of molecules are still unknown, which limits their practical applications. Due to intramolecular interactions or rotations of structural motifs, the same molecular graph can correspond to various spatial conformations with different quantum properties33–37. Therefore, molecular generative models considering 3D geometry information are desired to better learn structure-property relationships.
Recently, some studies characterize molecules as 3D point clouds where each point has atom features (e.g., atom types) and 3D coordinates and corresponding generative models have been proposed for 3D molecule design. These methods include estimating pairwise distances between atoms38, employing diffusion models to predict atom types and coordinates of all atoms39, and using autoregressive models to place atoms in 3D space step-by-step22,24,40. Since molecular drugs inhibit or activate particular biological functions by binding to the target proteins, another line of work further proposes generating 3D molecules inside the target protein pocket, which is a complex conditional generation task41–44. However, most of these methods do not explicitly consider chemical bonds and valency constraints and may generate molecules that are not chemically valid. Moreover, the lack of bonding information also inhibits the generation of realistic substructures (e.g., benzene rings).
In this work, we propose MolCode, a roto-translation equivariant generative model for Molecular graph-structure Co-design from scratch or conditioned on the target protein pockets. Our model is motivated by the intuition that the information of the 2D graph and 3D structure is intrinsically complementary to each other in molecule generation: the 3D geometric structure information empowers the generation of chemical bonds, and the bonding information can in turn guide the prediction of 3D coordinates to generate more realistic substructures by constraining the searching space of bond length/angles. In MolCode, we employ autoregressive flow as the backbone framework to generate atom types, chemical bonds, and 3D coordinates sequentially. To encode intermediate 3D graphs, roto-translation equivariant graph neural networks (GNNs)45,46 are first used to obtain node embeddings. Note that our MolCode is agnostic to the choice of encoding GNNs. Then, a novel attention mechanism with bond encoding enriches embeddings with global context as well as bonding information. In the decoding process, we construct a local coordinate system based on local reference atoms and predict the relative coordinates, ensuring the equivariance property of atomic coordinates and the invariance property of likelihood. The generated 2D molecular graphs also help check the chemical validity of the generated molecules in each step. In our experiments, we show that MolCode outperforms existing generative models in generating diverse, valid, and realistic molecular graphs and structures from scratch. Further investigations on targeted molecule discovery show that MolCode can generate molecules with desirable properties that are scarce in the training set, demonstrating its strong capability of capturing structure-property relationships for generalization. Finally, we extend MolCode to the structure-based drug design task and manage to generate drug-like ligand molecules with high binding affinities. Systematic hyperparameter analysis and ablation studies show that MolCode is robust to hyperparameters and the unified modeling of 2D topology and 3D geometry consistently improves molecular generation performance.
Results
Sequential Generation with Flow Models
Contrary to previous works that treat molecules solely as 2D graphs or 3D point clouds, a molecule is comprehensively represented as a 3D-dimensional graph G = (V, A, R) in this work. Let a and b denote the number of atom types and bond types.
For a molecule with n atoms, V ∈ {0, 1} n×a is the atom type matrix, A ∈ {0, 1} n×n×(b+1) is an adjacency matrix, and R ∈ ℝn×3 is the 3D atomic coordinate matrix. We add one additional type of edge between two atoms, which corresponds to no edge between two atoms. Following previous works like GraphAF21 and G-SchNet22, we formalize the problem of molecular graph generation as a sequential decision process (Fig. 1a and b). We can factorize the probability of molecule P(V, A, R) as:
where V:i−1, A:i−1 and R:i−1 indicate the graph (V, A, R) restricted to the first i − 1 atoms, Vi and Ri represent the atom type and coordinates of the i-th atom, and Ai denotes the connectivity of the i-th atom to the first i − 1 atoms. We employ a normalized flow model47 to learn such probabilities. A flow model aims to learn a parameterized invertible function between the data point variable x and the latent variable z: fθ : z ∈ ℝd → x ∈ ℝd. The latent distribution pZ is a pre-defined probability distribution, e.g., a Gaussian distribution. The data distribution pX is unknown. But given a data point x, its log-likelihood can be computed with the change-of-variable theorem:
where
denotes the Jacobian matrix. To train the flow model on a molecule dataset, the log-likelihoods of all data points are computed from Eq. (3) and maximized via gradient ascent. In the sampling process, a latent variable z is first sampled from the pre-defined latent distribution pZ. Then the corresponding data point x is obtained by performing the feedforward transformation x = fθ (z). Therefore, fθ needs to be inevitable, and the computation of detJ should be tractable for the training and sampling efficiency. A common choice is the affine coupling layers21,48,49 where the computation of detJ is very efficient because J is an upper triangular matrix.
a, In the sequential generation, MolCode concurrently generates molecular 2D graphs and 3D structures. The joint probability of atom types, bond types, and coordinates can then be factorized into a chain of conditional probabilities. b, MolCode employs the normalized flow as the backbone model and predicts atom types, bond types, and coordinates sequentially in each step. c, MolCode employs roto-translation equivariant Graph Neural Networks and multi-head self-attention with bond encoding for the conditional feature extraction from the intermediate 3D graph. d, For the generation of atomic coordinates, MolCode firstly constructs a local spherical coordinate system and generates the relative coordinates i.e. d, θ, ϕ, which ensure the equivariance of coordinates and the invariance of likelihood.
Fig. 1 shows a schematic depiction of the MolCode architecture. At each generation step, we predict the new atom type, bond types, and the 3D coordinates sequentially. We use an equivariant graph neural network for the extraction of conditional information from intermediate molecular graphs. A novel multi-head self-attention network with bond encoding is proposed to further capture the global and bonding information. For the generation of atomic coordinates, MolCode firstly constructs a local spherical coordinate system and generates the relative coordinates i.e. d, θ, ϕ, which ensure the equivariance of coordinates and the invariance of likelihood. In the de novo molecule design and targeted molecule discovery, MolCode generates molecules from scratch. In structure-based drug design, which is a conditional generation task, the target protein pocket represented as a 3D-dimensional graph is first input into MolCode. Then MolCode generates ligand molecules based on the protein pocket.
We train MolCode on a set of molecular structures and the corresponding molecular graphs can be obtained with toolkits in chemistry50,51. In the generation process, we check whether the generated bonds violate the valency constraints at each step. If the newly added bond breaks the valency constraint, we just reject it, sample a new latent variable and generate another new bond type. More details on the model architecture and training procedure can be found in the Methods section.
De novo Molecule Design
For virtual screening, the generative model should be able to sample a large quantity of valid and diverse molecules from scratch. In the random molecule generation task, we evaluate MolCode on the QM9 dataset52 consisting of 134k organic molecules with up to nine heavy atoms from carbon, nitrogen, oxygen, and fluorine. We use Validity, Uniqueness, and Novelty to evaluate the quality of the generated molecules: Validity calculates the percentage of valid molecules among all the generated molecules; Uniqueness is the percentage of unique molecules among all the valid molecules; Novelty measures the fraction of novel molecules among all the valid and unique ones. Specifically, the 3D molecular structures are first converted to 2D graphs, and the bond types (single, double, triple, or none) are determined based on the distances between pairs of atoms and the atom types51. A molecule is considered valid if it obeys the chemical valency rules; it is considered unique or novel if its 2D molecular graph appears only once in the whole sampled molecule set or does not exist in the training set. In Table. 1, we compare MolCode with four state-of-the-art baselines including E-NFs53, G-SchNet22, G-SphereNet40, and EDM39 on 3D molecule generation. We also compare MolCode with its two variants i.e. MolCode without validity check (MolCode w/o check) and MolCode without bond information (MolCode w/o bond) for ablation studies. All metrics are computed from 10,000 generated molecular structures. We observe that MolCode achieves the best performance in generating valid and diverse molecular structures (99.95 % Validity, 98.75 % Uniqueness). With the advantage of the generated bonds, MolCode can rectify the generation process when the valency constraints are violated, and therefore better explore the chemical space with the autoregressive flow framework. Interestingly, even without a validity check, MolCode can still achieve Validity as high as 94.60 %, which indicates the strong ability of MolCode to capture the underlying chemical rules by modeling the generation of bonds. In MolCode (w/o bond), the bonding information is not provided to the conditional information extraction block. The Validity drops from 99.95 % to 92.12 % and the Uniqueness drops from 98.75 % to 94.32 %, which also verifies the usefulness of bonding information in MolCode. Regarding Novelty, as discussed by previous work39 that QM9 is the exhaustive enumeration of molecules that satisfy a predefined set of constraints, the Novelty of MolCode is reasonable and acceptable.
Validity calculates the percentage of valid molecules among all the generated molecules; Uniqueness refers to the percentage of unique molecules among the valid molecules; Novelty measures the fraction of molecules not in the training set among all the valid and unique molecules. The best results are bolded.
To further investigate how well our model fits the distribution of QM9, we conduct qualitative substructure analysis (Table. S1). Specifically, we first collect the bond length/angle distributions in the generated molecules and the training dataset and then employ Kullback-Leibler (KL) divergence to compute their distribution distances. We show several common bond and bond angle types. We can observe that MolCode obtains much lower KL divergence than the other methods and its variant without bond information, indicating that the molecules generated by MolCode capture more geometric attributes of data. Moreover, we show two sets of bond length distributions (carbon-carbon single bond and carbon-oxygen single bond) and two sets of bond angle distributions (carbon-carbon-carbon and carbon-carbon-oxygen chains) in Fig. 2a. Generally, the distributions of MolCode align well with those of QM9, indicating that the distances and angles between atoms are accurately modeled and reproduced.
a, Radial distribution functions for carbon-carbon single bond and carbon-oxygen single bond (first row) and angular distribution functions for bonded carbon-carbon-carbon and carbon-carbon-oxygen chains (second row) in the training data and in the generated molecules by MolCode. b, Histograms of calculated HOMO-LUMO gaps and isotropic polarizability for molecules generated with the biased MolCode (green curves), MolCode before biasing (purple curves), and for the QM9 dataset (blue curves). c, Bar plots showing the average numbers of atoms, bonds, and rings per molecule for QM9 and for molecules generated with MolCodes. B1, B2, and B3 correspond to single, double, and triple bonds. R3, R4, R5, and R6 are rings of size 3 to 6.
Targeted Molecule Discovery
The ability to generate molecules with desirable properties that are either absent or rare in the training data (e.g., new materials) is quite useful for the target exploration of chemical space. Here we conduct two targeted molecule discovery experiments, namely minimizing the HOMO-LUMO gap and maximizing the isotropic polarizability. Following previous works22,40, we finetune the pretrained generative models on the collected biased datasets. Specifically, we collect all molecular structures whose HOMO-LUMO gaps are smaller than 4.5 eV and all molecular structures whose isotropic polarizabilities are larger than 91 Bohr3 from the QM9 as the biased datasets. Afterward, we generate 10,000 molecular structures with the finetuned model and compute the quantum properties (HOMO-LUMO gap and isotropic polarizability) with the PySCF package54,55. The performance is then evaluated by calculating the mean and optimal value over all property scores (Mean and Optimal) and the percentage of molecules with good properties (Good Percentage). Molecules with good properties are those with HOMO-LUMO gaps smaller than 4.5 eV and isotropic polarizabilities larger than 91 Bohr3, respectively.
The results of targeted molecule discovery for two quantum properties are shown in Table. 2. For both properties, our MolCode outperforms all the baseline methods and its variants without validity check and bonding information, demonstrating MolCode’s strong capability in capturing structure-property relationships and generating molecular structures with desirable properties. For instance, even though the biased datasets are only 3.20% and 2.04% of QM9 respectively, the fine-tuned MolCode achieves Good Percentages of 87.76% and 38.40%. We also illustrate the property distributions of QM9, MolCode, and biased MolCode in Fig. 2b. Clearly, we can observe that the property distributions of MolCode align well with those of the QM9 dataset while the property distributions of the biased MolCodes shift towards smaller HOMO-LUMO gap and larger isotropic polarizability respectively.
We aim to minimize the HOMO-LUMO gap and maximize the isotropic polarizability. The properties are calculated by PySCF and the best results are bolded. Good Percentage measures the ratio of molecules with HOMO-LUMO gaps smaller than 4.5 eV or isotropic polarizabilities larger than 91 Bohr3 respectively.
Fig. 2c reveals further insights into the structural statistics of the generated molecules. First, we observe that MolCode captures the atom, bond, and ring counts of the QM9 dataset accurately. Second, for the biased MolCode towards smaller HOMO-LUMO gaps, the generated molecules exhibit an increased number of nitrogen/oxygen atoms and double-bonds in addition to a tendency towards forming six-atom rings. These features indicate the presence of aromatic rings with nitrogen/oxygen atoms and conjugated systems with alternating single and double bonds, which are important motifs in organic semiconductors with small HOMO-LUMO gaps. Finally, for the biased MolCode towards larger isotropic polarizability, the generated molecules contain more atoms, bonds, and rings, which are the prerequisites for large isotropic polarizabilities.
Structure-based Drug Design
Designing ligand molecules binding with target proteins is a fundamental and challenging task in drug discovery56. According to the lock and key model57,58, the molecules that bind tighter to a disease target are more likely to be drug candidates with higher bioactivity against the disease. Therefore, it is beneficial to take the structure of the target proteins into consideration when generating molecules for drug discovery. Here, we train MolCode on the CrossDocked2020 dataset59 which contains 22.5 million protein-molecule complexes for structure-based drug design. Starting with the target protein pocket as the context, MolCode iteratively predicts the ligand atom types, bond types, and atom coordinates. We generate 100 ligand molecules for each target protein pocket in the test set. More details are included in the Methods section.
Fig. 3 shows the property distributions of the sampled ligand molecules. Here, we mainly focus on the following metrics following previous works41,44: Vina Score measures the binding affinity between the generated molecules and the protein pockets; QED measures how likely a molecule is a potential drug candidate; Synthesizability (SA) represents the difficulty of drug synthesis (the score is normalized between 0 and 1 and higher values indicate easier synthesis). In our work, The Vina Score is calculated by QVina60,61, and the chemical properties are calculated by RDKit62 over the valid molecules. Before feeding to Vina, all the generated molecular structures are firstly refined by universal force fields63. Four competitive baselines including LiGAN64, AR41, GraphBP43, and Pocket2Mol44 are compared. We also show the distributions of the test set for reference. MolCode can generate ligand molecules with higher binding affinities (lower Vina scores) than baseline methods. Specifically, MolCode succeeds to generate molecules with higher affinity than corresponding reference molecules for 61.8% protein pockets on average. Moreover, the generated molecules also exhibit more potential to be drug candidates (higher QED and SA). These improvements indicate that MolCode effectively captures the distribution of 3D ligand molecules conditioned on binding sites with the graph-structure co-design scheme.
The distributions of Vina scores, QED, and SA scores of the generated molecules. We also show the distributions of the test set for reference. Lower Vina scores and higher QED and SA indicate better ligand quality.
In Fig.4, we further show several examples of generated 3D molecules with higher affinities to the target proteins than their corresponding reference molecules in the test set. It can be observed that our generated molecules with higher binding affinity also have diverse structures and are largely different from the reference molecules. It demonstrates that MolCode is capable of generating diverse and novel molecules to bind target proteins, instead of just memorizing and reproducing known molecules in the dataset, which is quite important in exploring novel drug candidates.
Examples of the generated molecules with higher binding affinities than the references. We report the Vina scores and a lower Vina score indicates higher binding affinity.
Conclusion
In this article, we have reported a roto-translation equivariant generative framework for molecular graph-structure co-design from scratch or conditioned on the target protein pockets. As compared to existing methods that only represent and generate 2D topology graphs or 3D geometric structures, MolCode concurrently designs 2D molecular graphs and 3D structures and can well capture complex molecular relationships. Extensive experiments on de novo molecule design, targeted molecule discovery, and structure-based drug design demonstrate the effectiveness of our model. Our investigation demonstrates that the 2D topology and 3D geometry contain intrinsically complementary information for molecular representation and generation and the unified modeling of them greatly improve the molecular generation performance.
There are also several potential extensions of MolCode as future works. First, MolCode may be extended and applied to significantly larger systems with more diverse atom types such as proteins and crystal materials. Although MolCode has been trained on ligand-protein pocket complexes from the Crossdocked2020 dataset, modifications will be necessary to ensure further scalability and robustness65–69. Another potential improvement is to incorporate chemical priors such as ring structures into MolCode to generate more valid molecules and realistic 3D structures19,25. For example, the molecules may be generated fragment-by-fragment instead of atom-by-atom, which can also speed up the generation process. Furthermore, wet-lab experiments may be conducted to validate the effectiveness of MolCode. Overall, we anticipate that further developments in deep generative models will greatly accelerate and benefit various applications in material design and drug discovery.
Methods
Dataset
For the task of random molecule generation and targeted molecule discovery, we evaluate MolCode on the QM952 dataset. The QM9 dataset contains over 134k molecules and their corresponding 3D molecular geometries computed by density functional theory (DFT). In the random molecular geometry generation task, we randomly select 100k 3D molecular geometries as the training set and 10k 3D molecular geometries as the validation set. For the targeted molecule discovery, we collect all molecular geometries whose HOMO-LUMO gaps are smaller than 4.5 eV and all molecular geometries whose isotropic polarizabilities are larger than 91 Bohr3 as the finetuning dataset.
As for the structure-based drug design, we use the CrossDocked dataset59 which contains 22.5 million protein-molecule structures following41 and44. We filter out data points whose binding pose RMSD is greater than 1 Å and molecules that can not be sanitized with RDkit62, leading to a refined subset with around 160k data points. We use mmseqs270 to cluster data at 30% sequence identity, and randomly draw 100,000 protein-ligand pairs for training and 100 proteins from remaining clusters for testing. For evaluation, we randomly sample 100 molecules for each protein pocket in the test set.
For all the tasks including random/targeted molecule generation and structure-based drug design, MolCode and all the other baseline methods are trained with the same data split for a fair comparison.
Overview of MolCode
Let a be the number of atom types, b be the number of bond types, and n denote the number of atoms in a molecule. We can represent the molecule as a 3D-dimensional graph G = (V, A, R), where V ∈ {0, 1} n×a is the atom type matrix, A ∈ {0, 1} n×n×n(b+1) is an adjacency matrix, and R ∈ ℝn×3 is the 3D atomic coordinate matrix. Note that we add one additional type of edge between two atoms, which corresponds to no edge between two atoms. Here, each element Vi in V and Ai j in A are one-hot vectors. Viu = 1 and Ai jv = 1 represent that the i-th atom has type u and there is a type v bond between the i-th and j-th atom respectively. The i-th row of the coordinate matrix Ri represents the 3D Cartesian coordinate of the i-th atom.
We adopt the autoregressive flow framework47 to generate the atom type Vi of the new atom, the bond types Ai j, and the 3D coordinates at each step. Since both the node type Vi and the edge type Ai j are discrete, which do not fit into a flow-based model, we adopt the dequantization method20,21 to convert them into continuous numbers by adding real-valued noise as:
where U (0, 1) is the uniform distribution over the interval (0, 1). To generate Vi and Ai j, we first sample the latent variable
and
from the standard Gaussian distribution 𝒩 (0, 1), and then map
and
to
and Ãi j respectively by the following affine transformation:
where ⊙ denotes the element-wise multiplication. Both the scale factors (
and
) and shift factors (
and
) depend on the conditional information extracted from the intermediate 3D graph Gi, which we will discuss later. After obtaining
and Ãi j, Vi and Ai j can be computed by taking the argmax of
and Ãi j i.e., Vi = one-hot(arg max
) and Ai j = one-hot(arg max Ãi j).
However, it is non-trivial to generate coordinates that satisfy the equivariance to rigid transformations and the invariance property of likehood. Inspired by G-SchNet22, MolGym71, and G-SphereNet40, we choose to construct a local spherical coordinate system and generate the distance di, the angle θi, and the torsion angle ϕi w.r.t. the constructed local SCS. Specifically, we first choose a focal atom among all atoms in Gi, which serves as the reference point for the new atom. The new atom is expected to be placed in the local region of the selected focal atom. Assume that the focal node is the f -th node in Gi. First, the distance di from the focal atom to the new atom is generated, i.e., di = ∥Ri − Rf ∥. Then, if i ≥ 2, the angle θi ∈ [0, π] between the lines (Rf, Ri) and (Rf, Rc) is generated, where c is the closest atom to the focal atom in Gi. Finally, if i ≥ 3, the torsion angle ϕi ∈ [−π, π] formed by planes (Rf, Rc, Ri) and (Rf, Rc, Re) is generated, where e denotes the atom closest to c but different from f in Gi. Similar to and Ãij, di, θi, ϕi can be obtained by:
where
are latent variables sampled from standard Gaussian distributions and the scale factors
and the shift factors
are the functions of Gi. The coordinate Ri of the new atom is computed based on the relative coordinates di, θi, ϕi and the reference atoms (f, c, e), hence satisfying the roto-translation equivariance property.
Encoder
Generating the atom type, covalent bonds, and 3D position at each step requires capturing the conditional information of the intermediate graph Gi with an equivariant encoder. In MolCode, we use SphereNet45 for the QM9 dataset and EGNN46 for the CrossDocked2020 dataset to obtain the node embeddings. Note that MolCode is agnostic to the choice of equivariant graph neural networks. SphereNet can capture the complete geometric information inside molecular structures including bond length/angles and dihedral angles but can hardly scale to large molecules due to computational complexity. On the contrary, EGNN only encodes pairwise distances between atoms and is more efficient than SphereNet on systems with more atoms e.g., ligand-protein pocket complexes. For the input graph Gi, let the node embedding matrix computed by 3D GNN be , where h j is the embedding of the j-th atom and d is the dimension of embedding.
To further encode the information of covalent bonds and capture the global information in the molecule graph, we modify the self-attention mechanism72 and propose a novel bond encoding. The multi-head self-attention (MHA) with bond encoding is calculated as:
where Con(·) denotes the concatenation operation, Emb(Ai j) is the embedding of the bond between the i-th and j-th atom, K is number of attention heads,
, and WO are learnable matrices.
In MolCode, we use the SphereNet45 with 4 layers or EGNN46 with 6 layers to extract features from the intermediate 3D graphs, where the input embedding size is set to 64 and the output embedding size is set to 256. The cutoff is set as 5 Å. The node features are initialized to the one-hot vectors of atom types and the edge features are initialized by spherical basis functions. In the multi-head self-attention module with bond encoding, there are 4 attention heads. In addition, we employ 6 flow layers with a hidden dimension of 128 for the decoder. We use the model configuration for all the experiments.
Decoder
To generate new atoms, the scale factor and shift factor
in Eq. (5) can be computed as:
where MLPV is a multi-layer perceptron and MHAV (H) f denotes the f -th node embedding from the output of the multi-head self-attention network. With the predicted new atom Vi, we can update H to
and predict
and
in Eq. (5):
where Emb(Vi) denotes the atom type embedding here. As for the scale and shift factors in Eq. (8), we have:
where
is the node embedding of the newly added atom from the output of the multi-head self-attention network.
As for the focal atom selection, we employ a multi-layer perceptron (MLP) with the atom embeddings as input. Atoms that are not valence filled are labeled 1, otherwise 0. Particularly, in the structure-based drug design task where there is no ligand atom at the beginning, the focal atoms are defined as protein atoms that have ground-truth ligand atoms within 4 Å at the first step. After the generation of the first ligand atom, MolCode selects focal atoms from the generated ligand atoms. At the inference stage, we randomly choose the focal atom f from atoms whose classification scores are higher than 0.5. The sequential generation process stops if all the classification scores are lower than 0.5 or there is no generated bond between the newly added atom and the previously generated atoms.
Validity Filter
The graph-structure codesign scheme in MolCode makes it feasible to check the chemical validity based on the generated 2D graphs at each step. Specifically, we explicitly consider the valency constraints during sampling to check whether current bonds have exceeded the allowed valency. The valency constraint is defined as:
where |Ai j| denote the order of the chemical bond Ai j. If the newly added bond breaks the valency constraint, we will reject the bond Ai j, sample a new zi j in the latent space and generate another new bond type.
Model Training and Inference
To make sure the generated atoms are in the local region of their corresponding reference atoms, we propose to use Prim’s algorithm to obtain the generation orders of atoms. The weights of the edges are set as the distances between atoms. The first atoms of molecules are randomly sampled in each epoch to encourage the generalization ability of the model. With such obtained trajectories, MolCode is trained by stochastic gradient descent with the following loss function. For a 3D molecular graph G with n atoms (n > 3), we maximize its log-likelihood in Eq. (18) and (19) to train the MolCode model. Besides, the atom-wise classifier for focal atom selection is trained with a binary cross entropy loss.
In the random molecule generation task, our MolCode model is trained with Adam73 optimizer for 100 epochs, where the learning rate is 0.0001 and the batch size is 64. We report the results corresponding to the epoch with the best validation loss. It takes around 36 hours to train a MolCode from scratch on 1 Tesla V100 GPU. In the targeted molecule discovery task, the model is fine-tuned with a learning rate of 0.0001 and a batch size of 32. The number of training epochs is 40 for the HOMO-LUMO gap and 80 for the isotropic polarizability. In the task of structure-based drug design, we train MolCode with Adam optimizer for 100 epochs with a learning rate of 0.0001 and a batch size of 8. β1 and β2 in Adam is set to 0.9 and 0.999, respectively. For all the tasks including random/targeted molecule generation and structure-based drug design, MolCode and all the other baseline methods are trained with the same data split for a fair comparison. We run the code provided by the authors to obtain the results of baseline methods.
During generation, we use temperature hyperparameters in the prior Gaussian distributions. Specifically, we change the standard deviation of the Gaussian distribution to the temperature hyperparameters. To decide the specific values of temperature hyperparameters, we perform a grid search over {0.3, 0.5, 0.7} based on Validity and Uniqueness in random molecule generation to encourage generating more valid and diverse molecules. We use 0.5 for sampling , 0.5 for sampling
, 0.3 for sampling
, 0.3 for sampling
, and 0.7 for sampling
as the default setting. We have the following interesting insights for choosing temperature hyperparameters: To generate valid and diverse molecules, the hidden variables for bond lengths/angles (
and
) are assigned with small temperature hyperparameters (low variance) since the values of a certain type of bond lengths/angles are largely fixed. On the contrary, the torsion angles are more flexible in molecules so that the temperature hyperparameter of
is larger. We use the same fixed temperature hyperparameters for the targeted molecule discovery and structure-based drug design experiments. In Fig. S1, we show the hyperparameter analysis with respect to
, and
. The default values with these hyperparameters are set to 0.5. MolCode is generally robust to the choice of hyperparameters and can further benefit from setting appropriate hyperparameter values.
Algorithm 1 and 2 show the pseudo-codes of the training and generation process of MolCode for random/targeted molecule generation. Note that to scale to large molecules in experiments, the bonds are only generated and predicted between new atoms and the reference atoms. The pseudo-codes of MolCode for structure-based drug design are similar to Algorithm 1 and 2, except that the ligand atoms are generated conditioned on the protein pocket instead of generated from scratch.
Data availability
The data necessary to reproduce our numerical benchmark results are publicly available at https://github.com/divelab/DIG and https://github.com/gnina/models.
Code availability
The code used in the study is publicly available from the GitHub repository: https://github.com/zaixizhang/MolCode.
Author contributions statement
Z.X.Z, Q.L, C.L., C.H., and E.H.C. designed the research, Z.X.Z conducted the experiments, Z.X.Z, Q.L, and C.L. analyzed the results. All authors reviewed the manuscript.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for material should be addressed to Qi Liu.
1 Supplementary Information
The KL divergence of the bond lengths (upper part) and bond angles (lower part) between the training set and the generated molecules are shown below.
Influence of temperature hyperparameters on Validity and Uniqueness in the random molecule generation task.
Acknowledgements
This research was partially supported by grants from the National Natural Science Foundation of China (Grants No.61922073 and U20A20229).