## ABSTRACT

Designing molecules with desirable physiochemical properties and functionalities is a long-standing challenge in chemistry, material science, and drug discovery. Recently, machine learning-based generative models have emerged as promising approaches for *de novo* molecule design. However, further refinement of methodology is highly desired as most existing methods lack unified modeling of 2D topology and 3D geometry information and fail to effectively learn the structure-property relationship for molecule design. Here we present MolCode, a roto-translation equivariant generative framework for Molecular graph-structure Co-design. In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure. Extensive experimental results show that MolCode outperforms previous methods on a series of challenging tasks including *de novo* molecule design, targeted molecule discovery, and structure-based drug design. Particularly, MolCode not only consistently generates valid (99.95% Validity) and diverse (98.75% Uniqueness) molecular graphs/structures with desirable properties, but also generate drug-like molecules with high affinity to target proteins (61.8% high affinity ratio), which demonstrates MolCode’s potential applications in material design and drug discovery. Our extensive investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design, and provides new insights into machine learning-based molecule representation and generation.

## Introduction

Designing molecules with desirable characteristics is of fundamental importance in many applications, ranging from drug discovery^{1–3}, catalysis^{4} to semiconductors^{5,6}. However, the size of the chemical space is estimated to be in the order of 10^{607}, which precludes an exhaustive computational or experimental search of possible molecular candidates. In recent years, advances in machine learning (ML) methods have greatly accelerated the exploration of chemical compound space^{8–19}. Many studies propose to generate 2D/3D molecules and optimize molecular properties with deep generative models^{20–25}.

Molecules can be naturally represented as 2D graphs where nodes denote atoms, and edges represent covalent bonds. Such concise representation has motivated a series of studies in the tasks of molecule design and optimization. These works either predict the atom type and adjacency matrix of the graph^{26–29}, or employ autoregressive models to sequentially add nodes and edges^{21,30}. Furthermore, some methods leverage the chemical priors of molecular fragments/motifs and propose to generate molecular graphs fragment-by-fragment^{31,32}. However, complete information about a molecule cannot be obtained from these methods since the 3D structures of molecules are still unknown, which limits their practical applications. Due to intramolecular interactions or rotations of structural motifs, the same molecular graph can correspond to various spatial conformations with different quantum properties^{33–37}. Therefore, molecular generative models considering 3D geometry information are desired to better learn structure-property relationships.

Recently, some studies characterize molecules as 3D point clouds where each point has atom features (*e*.*g*., atom types) and 3D coordinates and corresponding generative models have been proposed for 3D molecule design. These methods include estimating pairwise distances between atoms^{38}, employing diffusion models to predict atom types and coordinates of all atoms^{39}, and using autoregressive models to place atoms in 3D space step-by-step^{22,24,40}. Since molecular drugs inhibit or activate particular biological functions by binding to the target proteins, another line of work further proposes generating 3D molecules inside the target protein pocket, which is a complex conditional generation task^{41–44}. However, most of these methods do not explicitly consider chemical bonds and valency constraints and may generate molecules that are not chemically valid. Moreover, the lack of bonding information also inhibits the generation of realistic substructures (*e*.*g*., benzene rings).

In this work, we propose MolCode, a roto-translation equivariant generative model for Molecular graph-structure Co-design from scratch or conditioned on the target protein pockets. Our model is motivated by the intuition that *the information of the 2D graph and 3D structure is intrinsically complementary to each other in molecule generation*: the 3D geometric structure information empowers the generation of chemical bonds, and the bonding information can in turn guide the prediction of 3D coordinates to generate more realistic substructures by constraining the searching space of bond length/angles. In MolCode, we employ autoregressive flow as the backbone framework to generate atom types, chemical bonds, and 3D coordinates sequentially. To encode intermediate 3D graphs, roto-translation equivariant graph neural networks (GNNs)^{45,46} are first used to obtain node embeddings. Note that our MolCode is agnostic to the choice of encoding GNNs. Then, a novel attention mechanism with bond encoding enriches embeddings with global context as well as bonding information. In the decoding process, we construct a local coordinate system based on local reference atoms and predict the relative coordinates, ensuring the equivariance property of atomic coordinates and the invariance property of likelihood. The generated 2D molecular graphs also help check the chemical validity of the generated molecules in each step. In our experiments, we show that MolCode outperforms existing generative models in generating diverse, valid, and realistic molecular graphs and structures from scratch. Further investigations on targeted molecule discovery show that MolCode can generate molecules with desirable properties that are scarce in the training set, demonstrating its strong capability of capturing structure-property relationships for generalization. Finally, we extend MolCode to the structure-based drug design task and manage to generate drug-like ligand molecules with high binding affinities. Systematic hyperparameter analysis and ablation studies show that MolCode is robust to hyperparameters and the unified modeling of 2D topology and 3D geometry consistently improves molecular generation performance.

## Results

### Sequential Generation with Flow Models

Contrary to previous works that treat molecules solely as 2D graphs or 3D point clouds, a molecule is comprehensively represented as a 3D-dimensional graph *G* = (*V, A, R*) in this work. Let *a* and *b* denote the number of atom types and bond types.

For a molecule with *n* atoms, *V* ∈ {0, 1} ^{n×a} is the atom type matrix, *A* ∈ {0, 1} ^{n×n×(b+1)} is an adjacency matrix, and *R* ∈ ℝ^{n×3} is the 3D atomic coordinate matrix. We add one additional type of edge between two atoms, which corresponds to no edge between two atoms. Following previous works like GraphAF^{21} and G-SchNet^{22}, we formalize the problem of molecular graph generation as a sequential decision process (Fig. 1a and b). We can factorize the probability of molecule *P*(*V, A, R*) as:
where *V*_{:i−1}, *A*_{:i−1} and *R*_{:i−1} indicate the graph (*V, A, R*) restricted to the first *i* − 1 atoms, *V*_{i} and *R*_{i} represent the atom type and coordinates of the *i*-th atom, and *A*_{i} denotes the connectivity of the *i*-th atom to the first *i* − 1 atoms. We employ a normalized flow model^{47} to learn such probabilities. A flow model aims to learn a parameterized invertible function between the data point variable *x* and the latent variable *z*: *f*_{θ} : *z* ∈ ℝ^{d} → *x* ∈ ℝ^{d}. The latent distribution *p*_{Z} is a pre-defined probability distribution, *e*.*g*., a Gaussian distribution. The data distribution *p*_{X} is unknown. But given a data point *x*, its log-likelihood can be computed with the change-of-variable theorem:
where denotes the Jacobian matrix. To train the flow model on a molecule dataset, the log-likelihoods of all data points are computed from Eq. (3) and maximized via gradient ascent. In the sampling process, a latent variable *z* is first sampled from the pre-defined latent distribution *p*_{Z}. Then the corresponding data point *x* is obtained by performing the feedforward transformation *x* = *f*_{θ} (*z*). Therefore, *f*_{θ} needs to be inevitable, and the computation of det*J* should be tractable for the training and sampling efficiency. A common choice is the affine coupling layers^{21,48,49} where the computation of det*J* is very efficient because *J* is an upper triangular matrix.

Fig. 1 shows a schematic depiction of the MolCode architecture. At each generation step, we predict the new atom type, bond types, and the 3D coordinates sequentially. We use an equivariant graph neural network for the extraction of conditional information from intermediate molecular graphs. A novel multi-head self-attention network with bond encoding is proposed to further capture the global and bonding information. For the generation of atomic coordinates, MolCode firstly constructs a local spherical coordinate system and generates the relative coordinates i.e. *d, θ, ϕ*, which ensure the equivariance of coordinates and the invariance of likelihood. In the *de novo* molecule design and targeted molecule discovery, MolCode generates molecules from scratch. In structure-based drug design, which is a conditional generation task, the target protein pocket represented as a 3D-dimensional graph is first input into MolCode. Then MolCode generates ligand molecules based on the protein pocket.

We train MolCode on a set of molecular structures and the corresponding molecular graphs can be obtained with toolkits in chemistry^{50,51}. In the generation process, we check whether the generated bonds violate the valency constraints at each step. If the newly added bond breaks the valency constraint, we just reject it, sample a new latent variable and generate another new bond type. More details on the model architecture and training procedure can be found in the Methods section.

*De novo* Molecule Design

For virtual screening, the generative model should be able to sample a large quantity of valid and diverse molecules from scratch. In the random molecule generation task, we evaluate MolCode on the QM9 dataset^{52} consisting of 134k organic molecules with up to nine heavy atoms from carbon, nitrogen, oxygen, and fluorine. We use Validity, Uniqueness, and Novelty to evaluate the quality of the generated molecules: Validity calculates the percentage of valid molecules among all the generated molecules; Uniqueness is the percentage of unique molecules among all the valid molecules; Novelty measures the fraction of novel molecules among all the valid and unique ones. Specifically, the 3D molecular structures are first converted to 2D graphs, and the bond types (single, double, triple, or none) are determined based on the distances between pairs of atoms and the atom types^{51}. A molecule is considered valid if it obeys the chemical valency rules; it is considered unique or novel if its 2D molecular graph appears only once in the whole sampled molecule set or does not exist in the training set. In Table. 1, we compare MolCode with four state-of-the-art baselines including E-NFs^{53}, G-SchNet^{22}, G-SphereNet^{40}, and EDM^{39} on 3D molecule generation. We also compare MolCode with its two variants i.e. MolCode without validity check (MolCode w/o check) and MolCode without bond information (MolCode w/o bond) for ablation studies. All metrics are computed from 10,000 generated molecular structures. We observe that MolCode achieves the best performance in generating valid and diverse molecular structures (99.95 % Validity, 98.75 % Uniqueness). With the advantage of the generated bonds, MolCode can rectify the generation process when the valency constraints are violated, and therefore better explore the chemical space with the autoregressive flow framework. Interestingly, even without a validity check, MolCode can still achieve Validity as high as 94.60 %, which indicates the strong ability of MolCode to capture the underlying chemical rules by modeling the generation of bonds. In MolCode (w/o bond), the bonding information is not provided to the conditional information extraction block. The Validity drops from 99.95 % to 92.12 % and the Uniqueness drops from 98.75 % to 94.32 %, which also verifies the usefulness of bonding information in MolCode. Regarding Novelty, as discussed by previous work^{39} that QM9 is the exhaustive enumeration of molecules that satisfy a predefined set of constraints, the Novelty of MolCode is reasonable and acceptable.

To further investigate how well our model fits the distribution of QM9, we conduct qualitative substructure analysis (Table. S1). Specifically, we first collect the bond length/angle distributions in the generated molecules and the training dataset and then employ Kullback-Leibler (KL) divergence to compute their distribution distances. We show several common bond and bond angle types. We can observe that MolCode obtains much lower KL divergence than the other methods and its variant without bond information, indicating that the molecules generated by MolCode capture more geometric attributes of data. Moreover, we show two sets of bond length distributions (carbon-carbon single bond and carbon-oxygen single bond) and two sets of bond angle distributions (carbon-carbon-carbon and carbon-carbon-oxygen chains) in Fig. 2a. Generally, the distributions of MolCode align well with those of QM9, indicating that the distances and angles between atoms are accurately modeled and reproduced.

### Targeted Molecule Discovery

The ability to generate molecules with desirable properties that are either absent or rare in the training data (e.g., new materials) is quite useful for the target exploration of chemical space. Here we conduct two targeted molecule discovery experiments, namely *minimizing* the HOMO-LUMO gap and *maximizing* the isotropic polarizability. Following previous works^{22,40}, we finetune the pretrained generative models on the collected biased datasets. Specifically, we collect all molecular structures whose HOMO-LUMO gaps are smaller than 4.5 eV and all molecular structures whose isotropic polarizabilities are larger than 91 Bohr^{3} from the QM9 as the biased datasets. Afterward, we generate 10,000 molecular structures with the finetuned model and compute the quantum properties (HOMO-LUMO gap and isotropic polarizability) with the PySCF package^{54,55}. The performance is then evaluated by calculating the mean and optimal value over all property scores (Mean and Optimal) and the percentage of molecules with good properties (Good Percentage). Molecules with good properties are those with HOMO-LUMO gaps smaller than 4.5 eV and isotropic polarizabilities larger than 91 Bohr^{3}, respectively.

The results of targeted molecule discovery for two quantum properties are shown in Table. 2. For both properties, our MolCode outperforms all the baseline methods and its variants without validity check and bonding information, demonstrating MolCode’s strong capability in capturing structure-property relationships and generating molecular structures with desirable properties. For instance, even though the biased datasets are only 3.20% and 2.04% of QM9 respectively, the fine-tuned MolCode achieves Good Percentages of 87.76% and 38.40%. We also illustrate the property distributions of QM9, MolCode, and biased MolCode in Fig. 2b. Clearly, we can observe that the property distributions of MolCode align well with those of the QM9 dataset while the property distributions of the biased MolCodes shift towards smaller HOMO-LUMO gap and larger isotropic polarizability respectively.

Fig. 2c reveals further insights into the structural statistics of the generated molecules. First, we observe that MolCode captures the atom, bond, and ring counts of the QM9 dataset accurately. Second, for the biased MolCode towards smaller HOMO-LUMO gaps, the generated molecules exhibit an increased number of nitrogen/oxygen atoms and double-bonds in addition to a tendency towards forming six-atom rings. These features indicate the presence of aromatic rings with nitrogen/oxygen atoms and conjugated systems with alternating single and double bonds, which are important motifs in organic semiconductors with small HOMO-LUMO gaps. Finally, for the biased MolCode towards larger isotropic polarizability, the generated molecules contain more atoms, bonds, and rings, which are the prerequisites for large isotropic polarizabilities.

### Structure-based Drug Design

Designing ligand molecules binding with target proteins is a fundamental and challenging task in drug discovery^{56}. According to the lock and key model^{57,58}, the molecules that bind tighter to a disease target are more likely to be drug candidates with higher bioactivity against the disease. Therefore, it is beneficial to take the structure of the target proteins into consideration when generating molecules for drug discovery. Here, we train MolCode on the CrossDocked2020 dataset^{59} which contains 22.5 million protein-molecule complexes for structure-based drug design. Starting with the target protein pocket as the context, MolCode iteratively predicts the ligand atom types, bond types, and atom coordinates. We generate 100 ligand molecules for each target protein pocket in the test set. More details are included in the Methods section.

Fig. 3 shows the property distributions of the sampled ligand molecules. Here, we mainly focus on the following metrics following previous works^{41,44}: **Vina Score** measures the binding affinity between the generated molecules and the protein pockets; **QED** measures how likely a molecule is a potential drug candidate; **Synthesizability (SA)** represents the difficulty of drug synthesis (the score is normalized between 0 and 1 and higher values indicate easier synthesis). In our work, The Vina Score is calculated by QVina^{60,61}, and the chemical properties are calculated by RDKit^{62} over the valid molecules. Before feeding to Vina, all the generated molecular structures are firstly refined by universal force fields^{63}. Four competitive baselines including LiGAN^{64}, AR^{41}, GraphBP^{43}, and Pocket2Mol^{44} are compared. We also show the distributions of the test set for reference. MolCode can generate ligand molecules with higher binding affinities (lower Vina scores) than baseline methods. Specifically, MolCode succeeds to generate molecules with higher affinity than corresponding reference molecules for 61.8% protein pockets on average. Moreover, the generated molecules also exhibit more potential to be drug candidates (higher QED and SA). These improvements indicate that MolCode effectively captures the distribution of 3D ligand molecules conditioned on binding sites with the graph-structure co-design scheme.

In Fig.4, we further show several examples of generated 3D molecules with higher affinities to the target proteins than their corresponding reference molecules in the test set. It can be observed that our generated molecules with higher binding affinity also have diverse structures and are largely different from the reference molecules. It demonstrates that MolCode is capable of generating diverse and novel molecules to bind target proteins, instead of just memorizing and reproducing known molecules in the dataset, which is quite important in exploring novel drug candidates.

## Conclusion

In this article, we have reported a roto-translation equivariant generative framework for molecular graph-structure co-design from scratch or conditioned on the target protein pockets. As compared to existing methods that only represent and generate 2D topology graphs or 3D geometric structures, MolCode concurrently designs 2D molecular graphs and 3D structures and can well capture complex molecular relationships. Extensive experiments on *de novo* molecule design, targeted molecule discovery, and structure-based drug design demonstrate the effectiveness of our model. Our investigation demonstrates that the 2D topology and 3D geometry contain intrinsically complementary information for molecular representation and generation and the unified modeling of them greatly improve the molecular generation performance.

There are also several potential extensions of MolCode as future works. First, MolCode may be extended and applied to significantly larger systems with more diverse atom types such as proteins and crystal materials. Although MolCode has been trained on ligand-protein pocket complexes from the Crossdocked2020 dataset, modifications will be necessary to ensure further scalability and robustness^{65–69}. Another potential improvement is to incorporate chemical priors such as ring structures into MolCode to generate more valid molecules and realistic 3D structures^{19,25}. For example, the molecules may be generated fragment-by-fragment instead of atom-by-atom, which can also speed up the generation process. Furthermore, wet-lab experiments may be conducted to validate the effectiveness of MolCode. Overall, we anticipate that further developments in deep generative models will greatly accelerate and benefit various applications in material design and drug discovery.

## Methods

### Dataset

For the task of random molecule generation and targeted molecule discovery, we evaluate MolCode on the QM9^{52} dataset. The QM9 dataset contains over 134k molecules and their corresponding 3D molecular geometries computed by density functional theory (DFT). In the random molecular geometry generation task, we randomly select 100k 3D molecular geometries as the training set and 10k 3D molecular geometries as the validation set. For the targeted molecule discovery, we collect all molecular geometries whose HOMO-LUMO gaps are smaller than 4.5 eV and all molecular geometries whose isotropic polarizabilities are larger than 91 Bohr^{3} as the finetuning dataset.

As for the structure-based drug design, we use the CrossDocked dataset^{59} which contains 22.5 million protein-molecule structures following^{41} and^{44}. We filter out data points whose binding pose RMSD is greater than 1 Å and molecules that can not be sanitized with RDkit^{62}, leading to a refined subset with around 160k data points. We use mmseqs2^{70} to cluster data at 30% sequence identity, and randomly draw 100,000 protein-ligand pairs for training and 100 proteins from remaining clusters for testing. For evaluation, we randomly sample 100 molecules for each protein pocket in the test set.

For all the tasks including random/targeted molecule generation and structure-based drug design, MolCode and all the other baseline methods are trained with the same data split for a fair comparison.

### Overview of MolCode

Let *a* be the number of atom types, *b* be the number of bond types, and *n* denote the number of atoms in a molecule. We can represent the molecule as a 3D-dimensional graph *G* = (*V, A, R*), where *V* ∈ {0, 1} ^{n×a} is the atom type matrix, *A* ∈ {0, 1} ^{n×n×n(b+1)} is an adjacency matrix, and *R* ∈ ℝ^{n×3} is the 3D atomic coordinate matrix. Note that we add one additional type of edge between two atoms, which corresponds to no edge between two atoms. Here, each element *V*_{i} in *V* and *A*_{i j} in *A* are one-hot vectors. *V*_{iu} = 1 and *A*_{i jv} = 1 represent that the *i*-th atom has type *u* and there is a type *v* bond between the *i*-th and *j*-th atom respectively. The *i*-th row of the coordinate matrix *R*_{i} represents the 3D Cartesian coordinate of the *i*-th atom.

We adopt the autoregressive flow framework^{47} to generate the atom type *V*_{i} of the new atom, the bond types *A*_{i j}, and the 3D coordinates at each step. Since both the node type *V*_{i} and the edge type *A*_{i j} are discrete, which do not fit into a flow-based model, we adopt the dequantization method^{20,21} to convert them into continuous numbers by adding real-valued noise as:
where *U* (0, 1) is the uniform distribution over the interval (0, 1). To generate *V*_{i} and *A*_{i j}, we first sample the latent variable and from the standard Gaussian distribution 𝒩 (0, 1), and then map and to and *Ã*_{i j} respectively by the following affine transformation:
where ⊙ denotes the element-wise multiplication. Both the scale factors ( and ) and shift factors ( and ) depend on the conditional information extracted from the intermediate 3D graph *G*_{i}, which we will discuss later. After obtaining and *Ã*_{i j}, *V*_{i} and *A*_{i j} can be computed by taking the argmax of and *Ã*_{i j} i.e., *V*_{i} = one-hot(arg max ) and *A*_{i j} = one-hot(arg max *Ã*_{i j}).

However, it is non-trivial to generate coordinates that satisfy the equivariance to rigid transformations and the invariance property of likehood. Inspired by G-SchNet^{22}, MolGym^{71}, and G-SphereNet^{40}, we choose to construct a local spherical coordinate system and generate the distance *d*_{i}, the angle *θ*_{i}, and the torsion angle *ϕ*_{i} w.r.t. the constructed local SCS. Specifically, we first choose a focal atom among all atoms in *G*_{i}, which serves as the reference point for the new atom. The new atom is expected to be placed in the local region of the selected focal atom. Assume that the focal node is the *f* -th node in *G*_{i}. First, the distance *d*_{i} from the focal atom to the new atom is generated, i.e., *d*_{i} = ∥*R*_{i} − *R*_{f} ∥. Then, if *i* ≥ 2, the angle *θ*_{i} ∈ [0, *π*] between the lines (*R*_{f}, *R*_{i}) and (*R*_{f}, *R*_{c}) is generated, where *c* is the closest atom to the focal atom in *G*_{i}. Finally, if *i* ≥ 3, the torsion angle *ϕ*_{i} ∈ [−*π, π*] formed by planes (*R*_{f}, *R*_{c}, *R*_{i}) and (*R*_{f}, *R*_{c}, *R*_{e}) is generated, where *e* denotes the atom closest to *c* but different from *f* in *G*_{i}. Similar to and *Ã _{ij}, d*

_{i},

*θ*

_{i},

*ϕ*

_{i}can be obtained by: where are latent variables sampled from standard Gaussian distributions and the scale factors and the shift factors are the functions of

*G*

_{i}. The coordinate

*R*

_{i}of the new atom is computed based on the relative coordinates

*d*

_{i},

*θ*

_{i},

*ϕ*

_{i}and the reference atoms (

*f, c, e*), hence satisfying the roto-translation equivariance property.

### Encoder

Generating the atom type, covalent bonds, and 3D position at each step requires capturing the conditional information of the intermediate graph *G*_{i} with an equivariant encoder. In MolCode, we use SphereNet^{45} for the QM9 dataset and EGNN^{46} for the CrossDocked2020 dataset to obtain the node embeddings. Note that MolCode is agnostic to the choice of equivariant graph neural networks. SphereNet can capture the complete geometric information inside molecular structures including bond length/angles and dihedral angles but can hardly scale to large molecules due to computational complexity. On the contrary, EGNN only encodes pairwise distances between atoms and is more efficient than SphereNet on systems with more atoms *e*.*g*., ligand-protein pocket complexes. For the input graph *G*_{i}, let the node embedding matrix computed by 3D GNN be , where *h* _{j} is the embedding of the *j*-th atom and *d* is the dimension of embedding.

To further encode the information of covalent bonds and capture the global information in the molecule graph, we modify the self-attention mechanism^{72} and propose a novel bond encoding. The multi-head self-attention (MHA) with bond encoding is calculated as:
where Con(·) denotes the concatenation operation, Emb(*A*_{i j}) is the embedding of the bond between the *i*-th and *j*-th atom, *K* is number of attention heads, , and *W*^{O} are learnable matrices.

In MolCode, we use the SphereNet^{45} with 4 layers or EGNN^{46} with 6 layers to extract features from the intermediate 3D graphs, where the input embedding size is set to 64 and the output embedding size is set to 256. The cutoff is set as 5 Å. The node features are initialized to the one-hot vectors of atom types and the edge features are initialized by spherical basis functions. In the multi-head self-attention module with bond encoding, there are 4 attention heads. In addition, we employ 6 flow layers with a hidden dimension of 128 for the decoder. We use the model configuration for all the experiments.

### Decoder

To generate new atoms, the scale factor and shift factor in Eq. (5) can be computed as:
where MLP^{V} is a multi-layer perceptron and MHA^{V} (*H*) _{f} denotes the *f* -th node embedding from the output of the multi-head self-attention network. With the predicted new atom *V*_{i}, we can update *H* to and predict and in Eq. (5):
where Emb(*V*_{i}) denotes the atom type embedding here. As for the scale and shift factors in Eq. (8), we have:
where is the node embedding of the newly added atom from the output of the multi-head self-attention network.

As for the focal atom selection, we employ a multi-layer perceptron (MLP) with the atom embeddings as input. Atoms that are not valence filled are labeled 1, otherwise 0. Particularly, in the structure-based drug design task where there is no ligand atom at the beginning, the focal atoms are defined as protein atoms that have ground-truth ligand atoms within 4 Å at the first step. After the generation of the first ligand atom, MolCode selects focal atoms from the generated ligand atoms. At the inference stage, we randomly choose the focal atom *f* from atoms whose classification scores are higher than 0.5. The sequential generation process stops if all the classification scores are lower than 0.5 or there is no generated bond between the newly added atom and the previously generated atoms.

### Validity Filter

The graph-structure codesign scheme in MolCode makes it feasible to check the chemical validity based on the generated 2D graphs at each step. Specifically, we explicitly consider the valency constraints during sampling to check whether current bonds have exceeded the allowed valency. The valency constraint is defined as:
where |*A*_{i j}| denote the order of the chemical bond *A*_{i j}. If the newly added bond breaks the valency constraint, we will reject the bond *A*_{i j}, sample a new *z*_{i j} in the latent space and generate another new bond type.

### Model Training and Inference

To make sure the generated atoms are in the local region of their corresponding reference atoms, we propose to use Prim’s algorithm to obtain the generation orders of atoms. The weights of the edges are set as the distances between atoms. The first atoms of molecules are randomly sampled in each epoch to encourage the generalization ability of the model. With such obtained trajectories, MolCode is trained by stochastic gradient descent with the following loss function. For a 3D molecular graph *G* with *n* atoms (*n* > 3), we maximize its log-likelihood in Eq. (18) and (19) to train the MolCode model. Besides, the atom-wise classifier for focal atom selection is trained with a binary cross entropy loss.

In the random molecule generation task, our MolCode model is trained with Adam^{73} optimizer for 100 epochs, where the learning rate is 0.0001 and the batch size is 64. We report the results corresponding to the epoch with the best validation loss. It takes around 36 hours to train a MolCode from scratch on 1 Tesla V100 GPU. In the targeted molecule discovery task, the model is fine-tuned with a learning rate of 0.0001 and a batch size of 32. The number of training epochs is 40 for the HOMO-LUMO gap and 80 for the isotropic polarizability. In the task of structure-based drug design, we train MolCode with Adam optimizer for 100 epochs with a learning rate of 0.0001 and a batch size of 8. *β*_{1} and *β*_{2} in Adam is set to 0.9 and 0.999, respectively. For all the tasks including random/targeted molecule generation and structure-based drug design, MolCode and all the other baseline methods are trained with the same data split for a fair comparison. We run the code provided by the authors to obtain the results of baseline methods.

During generation, we use temperature hyperparameters in the prior Gaussian distributions. Specifically, we change the standard deviation of the Gaussian distribution to the temperature hyperparameters. To decide the specific values of temperature hyperparameters, we perform a grid search over {0.3, 0.5, 0.7} based on Validity and Uniqueness in random molecule generation to encourage generating more valid and diverse molecules. We use 0.5 for sampling , 0.5 for sampling , 0.3 for sampling , 0.3 for sampling , and 0.7 for sampling as the default setting. We have the following interesting insights for choosing temperature hyperparameters: To generate valid and diverse molecules, the hidden variables for bond lengths/angles ( and ) are assigned with small temperature hyperparameters (low variance) since the values of a certain type of bond lengths/angles are largely fixed. On the contrary, the torsion angles are more flexible in molecules so that the temperature hyperparameter of is larger. We use the same fixed temperature hyperparameters for the targeted molecule discovery and structure-based drug design experiments. In Fig. S1, we show the hyperparameter analysis with respect to , and . The default values with these hyperparameters are set to 0.5. MolCode is generally robust to the choice of hyperparameters and can further benefit from setting appropriate hyperparameter values.

Algorithm 1 and 2 show the pseudo-codes of the training and generation process of MolCode for random/targeted molecule generation. Note that to scale to large molecules in experiments, the bonds are only generated and predicted between new atoms and the reference atoms. The pseudo-codes of MolCode for structure-based drug design are similar to Algorithm 1 and 2, except that the ligand atoms are generated conditioned on the protein pocket instead of generated from scratch.

## Data availability

The data necessary to reproduce our numerical benchmark results are publicly available at https://github.com/divelab/DIG and https://github.com/gnina/models.

## Code availability

The code used in the study is publicly available from the GitHub repository: https://github.com/zaixizhang/MolCode.

## Author contributions statement

Z.X.Z, Q.L, C.L., C.H., and E.H.C. designed the research, Z.X.Z conducted the experiments, Z.X.Z, Q.L, and C.L. analyzed the results. All authors reviewed the manuscript.

## Competing interests

The authors declare no competing interests.

## Additional information

Correspondence and requests for material should be addressed to Qi Liu.

## 1 Supplementary Information

## Acknowledgements

This research was partially supported by grants from the National Natural Science Foundation of China (Grants No.61922073 and U20A20229).