Infinite Physical Monkey: Do Deep Learning Methods Really Perform Better in Conformation Generation?

Conformation Generation is a fundamental problem in drug discovery and cheminformatics. And organic molecule conformation generation, particularly in vacuum and protein pocket environments, is most relevant to drug design. Recently, with the development of geometric neural networks, the data-driven schemes have been successfully applied in this field, both for molecular conformation generation (in vacuum) and binding pose generation (in protein pocket). The former beats the traditional ETKDG method, while the latter achieves similar accuracy compared with the widely used molecular docking software. Although these methods have shown promising results, some researchers have recently questioned whether deep learning (DL) methods perform better in molecular conformation generation via a parameter-free method. To our surprise, what they have designed is some kind analogous to the famous infinite monkey theorem, the monkeys that are even equipped with physics education. To discuss the feasibility of their proving, we constructed a real infinite stochastic monkey for molecular conformation generation, showing that even with a more stochastic sampler for geometry generation, the coverage of the benchmark QM-computed conformations are higher than those of most DL-based methods. By extending their physical monkey algorithm for binding pose prediction, we also discover that the successful docking rate also achieves near-best performance among existing DL-based docking models. Thus, though their conclusions are right, their proof process needs more concern.


Introduction
Binding pose generation is a fundamental and significant problem in drug discovery and cheminformatics 1,2 , which is aimed at obtaining the spatial and orientational relationship between protein pockets and drug candidates.Once the binding pose is determined, it can be utilized for structure-based drug design (SBDD) and ligand-based drug design (LBDD) based on protein-ligand energy profiles.In particular, high-throughput virtual screening 3 , the most popular SBDD protocol, requires the binding energy assessment for every molecule in the screened compound library, in which the scoring function highly relies on the plausibility of binding poses.
Numerous algorithms have been developed to predict binding poses, with some acknowledged as standard tools in virtual screening protocol 4 .Docking pose generation algorithms could be classified into systematic search, heuristic search, and deterministic search based on the sampling strategy.The systematic search represented by Glide 5 samples all degrees of freedom, but the sampling complexity grows exponentially with the number of rotatable bonds, contributing to the curse of dimensionality.Therefore, the search space is first filtered according to empirical statistical rules in practical applications.Heuristic search involves performing random transformation to the ligand conformations, which are then accepted or rejected based on the evaluation function, iterated until convergence.In particular, AutoDock 6 and LeDock 7 are two kinds of this class that use the genetic algorithm and simulated annealing algorithm, respectively Deterministic search, which involves using molecular dynamics as the search engine, explores the conformational space with a greedy strategy, moving step by step toward lower energy regions.
However, since deterministic search collapses to the same final state with the same initial state and the same dynamics parameters, it is less common than the first two.CDOCKER 8 is an example of deterministic software based on the CHARMM 9 molecular dynamics simulation.
Although these molecular docking algorithms have played a significant role in drug discovery campaigns over the past few decades, the intrinsic complexity of protein-ligand interactions consisting of entropy and solvation effects 10,11 make the binding pose generation far from solved.

Accurate prediction of binding poses remains a key problem in computer-aided drug design (CADD).
In recent years, the data-driven scheme has been utilized for tackling this problem and has shown powerful potential in real-world scenarios.DeepDock 12 follows the heuristic search ideas, training a scoring function to accept and reject change transformations of conformations.
EquiBind 13 proposes a SE(3)-equivariant framework to embed the physical constraints when predicting Cartesian coordinates directly.TankBind 14 generates extensive 2D information, i.e., interatomic distances inside ligands and between ligand atoms and protein atoms, to recover the 3D conformation.However, since it's not a one-to-one mapping, the recovering process would compromise the plausibility of binding poses.Thus physical refinement is often required to polish these conformations 15 .A more physical model, DiffDock 16 , has been designed to operate on the internal coordinate space with the popular and powerful diffusion model, achieving the state-ofthe-art (SOTA) performance among all the deep learning (DL)-based methods.The foundation of these rapid developments is from the molecular conformation generation methods, like GeoDiff 17 for directly generating Cartesian coordinates, CGCF 18 and SDEGen 15 for constructing 3D geometries from extensive 2D distances, and torsional diffusion 19 for operating search in torsional angle space.
Recently, there is a work 20 showing the DL-based model was defeated by their curated parameter-free method.This method sampled ~2000 conformations from RDKit by rotating the dihedral angles in 1/6 of them, performing force field optimization in 1/6 of them, and clustering the resulting conformations based on the 3D coordinates.This constructed baseline achieves comparable results to the SOTA method in terms of the COV and MAT metrics, leading to questions about the rationality of the benchmark in the DL-based approach.Two points need to be noted: first, for the molecular conformation generation problem, COV and MAT are only a part of the assessment metrics, and other evaluation approaches have been used in the field."By revising the MCG setting in many DL-based methods, we suggest the community rethink the benchmark in the current MCG, and focus on the end applications in various MCG-related downstream applications in the future" 21 , what they have suggested have long been adopted among most of the molecular conformation generation works.For example, in GraphDG 22 , ConfGF 23 , and GeoDiff 17 , the difference between the thermodynamic properties of the generated conformations and quantumcomputed conformations was compared; in DMCG 24 , it is demonstrated that the docking success rate was improved when the model-generated conformations were taken for downstream docking tasks; in SDEGen 15 , not only the generated conformations were compared with the crystal conformations, but the free energy profile was also plotted by molecular dynamics to qualitatively compare the coverage area of the model-generated conformations with that of the RDKitgenerated conformations.Table 1 lists the evaluation methods involved in each model in the field of molecular conformation generation, emphasizing that there are a variety of metrics adopted for evaluating conformation generation models.Second, the parameter-free method they constructed beat other machine learning models because of the oversampling, which coincides with the infinite monkey theorem 25 , the monkeys were even equipped with physical knowledge.
This prompted the construction of a true infinite stochastic monkey algorithm in this paper to demonstrate that even this more random method can achieve the reported accuracy.Compared to their infinite physical monkey, the infinite stochastic monkey goes further away from the dependence on the parameters of the ETKDG algorithm, and the idea of this approach is as follows: we model the chemical bonds using a resonant oscillator model, under which the bond length distribution can be well approximated by a Gaussian distribution.The mean and standard deviation of the bond lengths for different bond types (e.g., carbon-carbon single bond, carbonoxygen double bond) in the data set are counted.During the generation procedure, we resample each chemical bond length and then reconstruct the three-dimensional geometries from the twodimensional bond lengths.The experiments demonstrate that our method also achieves almost as close to their proposed RDKit+Clustering algorithm, indicating that the comparison in the previous is unfair.At the very least, they should also sample 2000 conformations for DL methods followed by clustering.
We also extended the infinite monkey algorithm for the binding conformation prediction problem and constructed the Scoring baseline considering pocket information and the UFF baseline model based on force field optimization.Our approach highlights the importance of large sampling size in achieving high success rates in molecular docking, challenging the use of RDKit+Clustering as a fair baseline model.Moreover, we show that the infinite physical monkey can serve as a stochastic baseline, shedding light on the binding conformation prediction problem.
Table 6 presents the docking success rates of Uni-Dock, two traditional docking methods, two DL methods given pockets, and our constructed baselines.Our results suggest that machine learning models have the potential to surpass traditional models in terms of the docking success rate for a given pocket.Additionally, any DL-based docking model must outperform the infinite physical monkey under the same conditions to validate its effectiveness.We also identify the possible inductive bias of current pocket-based docking models and propose straightforward training methods to mitigate this bias.

Infinite Stochastic Monkey (Molecular Conformation Generation)
As we mentioned above, the RDkit+Clustering algorithm is essentially a physical version of the infinite monkey theorem, like shooting a target with a machine gun and then clustering the scores.
With the same sample size, COV/MAT can reflect the conformational quality to some extent, not to mention that in the field of molecular conformation generation, there are at least distance distribution and thermodynamic calculation experiments to further verify the quality of conformation ensemble.where   is the number of the favored conformations at the quantum chemistry accuracy in the dataset, the number of ETKDG-constrained dihedral random conformations and force-field optimized conformations is 1/4 of randoms.Similarly, the K-means algorithm is adopted to cluster twice as many conformations as   , and the center of each cluster is taken as pseudoconformations for evaluation.

Infinite Physical Monkey (Binding Pose Prediction)
Except for the high performance in the molecular conformation generation, similar results could also be reached by the infinite physical monkey algorithm in the binding pose generation problem.
Among drug design applications, the most commonly used scheme is predicting the binding conformations under pocket given conditions.Based on the assumption that the pocket is given, the infinite physical monkey algorithm is as follows: 1) generate the molecular conformation using the ETKDG algorithm, where the conformation is generated without considering any protein environment; 2) place the molecule onto the geometric midpoint of the pocket and then perform a random rotation centered on the geometric midpoint to obtain the final infinite physics monkey version of the docking conformation.Moreover, we propose a scoring baseline model that takes into account the chemical environment inside the pocket, i.e., after obtaining the docking conformation of the infinite physical monkey, the binding energy of the conformation inside the pocket is evaluated by Vina 26 , and the molecules are ranked in the order of the binding energies from lowest to highest.In addition to this post-processing approach, we also propose a force field-based baseline for binding conformation prediction: perform force field optimization on molecules using the infinite physical monkey conformation as a starting point with residues fixed in the pocket.With the infinite physical monkey algorithm built on the molecular conformation generation task and binding pose prediction task, it can be concluded that a large sampling size results in a sustained improvement of the metrics associated with RMSD.
Thus, we propose suggestions to the method comparison in this discussed paper 21 , at least, when they use infinity monkeys of RDKit, they should also make other machine learning methods infinity monkeys.Although we partially agree that COV/MAT is not a completely reasonable metric, there is no perfect metric in the world, and if we compare a method to an elephant, then each metric is like a blind person who touches the elephant and uses different metrics for evaluation in order to understand the overall shape of this elephant as much as possible.
Unfortunately, they only asked a blind person who was touching the elephant's leg what the elephant really looked like, ignoring the opinions of other blind people.

Method comparsion in binding pose prediction given pockets
We agree Yu et al. 20 that DL model should focus on the docking scenario, which predicts the binding conformations inside the protein pockets rather than within the whole protein.
However, there are still some DL-based docking models developed for fair comparison.In the last experiment, we discuss the results of the pocket-aware docking methods, from traditional approaches, such as AutoDock GPU, Glide SP, and Uni-Dock, to the deep-learning models, such as LigPose 27 and TankBind 14 .Besides, three baselines are also present for demonstrating that the DL-based models truly model the interaction between the ligands and pockets.

Results and discussions Experiment 1: Performances of Infinite Stochastic Monkey in Molecular Conformation
Generations.
Although our Infinite Stochastic Monkey algorithm is much more stochastic than the RDkit+Clustering methodology, it still achieves the SOTA performance on COV/MAT.Such results indicate that under the large sampling number limit (e.g., 2000 conformations), pure stochastic algorithms should give out the limiting COV/MAT scores.And this is easy to understand: when performing conformation generations with a large sampling number, it is equal to uniformly sampling the molecules' conformation spaces.This is just like randomly sanding infinite monkeys (trial conformations) to the molecules' potential surfaces, and then analyzing if the monkeys could fully occupy the surfaces.And the results must be deterministic: the sampled conformation space would naturally cover all low-energy conformations, including the reference conformations.In the case of RDkit, the monkeys are further driven by the physical rule and are only sent to the regions with relatively low potential energies, which is just the so-called enhanced sampling methodology.As a result, under a large but not infinite number, the monkeys with physical knowledge in their mind perform a little bit well than our pure stochastic monkeys, as can be seen in Table 2. On the other hand, DL methods are designed to generate low-energy conformations with a small sampling number.Thus, it is unfair to criticize the DL method for not outperforming the limiting case of the RDkit+Clustering methodology, especially when the sampling number of the DL methods are far smaller than the stochastic algorithms.What's more, as we mentioned above, since the COV/MAT scores have their own limitations, the SMCG community no longer invokes them as the only benchmark standard.So it is also meaningless to only use them as the criteria when benchmarking the performances of SMCG methods.To further illustrate that numerous samples could cover most of the energy profile, we curated the Infinite Physical Monkey algorithm for a well-acknowledged problem, docking pose generation.We aligned the molecules whose conformations are generated by RDKit to the center of the protein pocket followed by a random rotation.Table 3 shows the performance of the constructed Infinite Physical Monkey under the different number of conformations and thresholds.The result of the 2A-hit rate under 2,000 samples, 74.84%, is close to the SOTA performance, outperforming most traditional docking methods.But you will never take these resulted conformations to assess the binding energy for virtual screening in real-world drug development, which is the same discussion with the previous Infinite Physical Monkey in molecular conformation generation, also namely the RDKit+Clustering.Figure2.The Hit rate of three baseline methods at different thresholds Now we could further investigate the revelations taken by our monkeys.We calculated the binding energy between the randomly generated ligand conformations and the protein pocket.
Then we sorted the RMSD values between the generated ligand conformations and the crystal ligand conformation.The results are presented in Table 4.We found that after the sorting, the top-1 and top-5 2A hit rates are raised by a little compared with the original Infinite Physical Monkey algorithm.This is because, during the we put the more reasonable binding conformations (the conformations with higher binding scores) at the top of the conformation list.
But after all, these conformations were generated from the random rotating, thus most of the conformations are not superior in binding energies.Some of them will even overlap with the atoms of the binding pocket.This experiment further explained the randomness of our Infinite Physical Monkey algorithm.But under the same sampling number, the docking success rate is still a valid benchmarking score.
Based on the conformations generated by the Infinite Physical Monkey algorithm, we further tested if force-field-based geometry optimization could improve the docking success rates.The geometry optimization process could move the conformations to the nearest local potential energy minima.Thus accuracies of the optimized geometries depend on the initial guess structures and the accuracy of the chosen force field.And in the current experiment, we invoked the UFF force field.As shown in Table 6, compared with random conformations, the top-1 and top-5 2A hit rates of the optimized conformations raised to 9.23% and 26.66%.And we plotted the hit rates of three selected methods under a sampling number of 1, 5, and 2000 in  In Table 6 we compared the binding-pocket-specific methods (which require the users to specify the binding pocket before docking) with the deep-learning-based local docking models.For the sake of fairness, two traditional methods (AutoDock, Glide SP), two DL models, and three baselines contrasted by us were selected for the comparison.The comparison standard is the top-1 2 Å hit rate of the methods.As can be seen, the traditional methods could reach a maximum accuracy of 66.8% (Gilde SP), while the DL methods could reach a SOTA accuracy of 74.7% (LigPose).Such a result states that DL methods have the ability to outperform the traditional methods in the binding conformation prediction tasks when binding pockets are given.Another deep-learning-based method, which is the TankBind model, although only achieves an accuracy of 24.2%, still performs much better than our three baselines.As we figured out, once a DL model could outperform the pure stochastic or scoring-based Infinite Physical Monkey method, the model must have learned the interactions between the pocket and the ligands.In conclusion, DL models could be superior to traditional methods in binding conformation generations.And for both classes of methods, the ultimate way to benchmark the effectiveness is by performing the subsequent biological activity experiments.

Conclusions
In this paper, we systematically analyze the plausibility of comparsion made by Infinite Physical Monkey algorithm in molecular conformation generation and binding pose prediction.The conclusions are as follows: 1) Although COV/MAT has their own limitations, but the reason why RDKIt+Clustering outperforms other deel learning-based model is primarily due to the larger samling size.And in the field of molecular generation, there are other physics-related evaluation metrics..
2) The Infinite Physical Monkey provides a stochasic baseline for evaluating the feasibility of DL-approch in binding pose predition.
3) Our fair comparison demonstrates that DL approaches have the potential to enhance binding pose prediction.Moreover, this comparison reveals an inductive bias that is hidden in the pocket truncation process during data processing.
In summary, our analysis suggests that the use of physics-based metrics and the consideration of sample size are important factors in evaluating the performance of molecular conformation generation and binding pose prediction methods.our results highlight the potential of DL approaches and the importance of considering inductive biases in the data processing phase.

Figure1.
Figure1.The workflow of Infinite Physical Monkey algorithm in Binding Pose Prediction

Experiment 2 :
Performances of Infinite Physical Monkey in Binding Pose Prediction.

Figure 2 .Experiment 3 :
figure, we could summarize at least two conclusions: first, once a DL model could outperform the baseline version of our Infinite Physical Monkey algorithm, it must have learned the distribution of the molecules inside the protein pocket.And we found that each currently developed models are better than the stochastic method, which proves the effectiveness of these models.Second, when developing deep-learning-based binding conformation prediction models, we should pay more attention to the truncation method of the binding pocket.In most training tasks, we tend to select the binding pocket according to the center of the ligand.As a result, the training sets we constructed usually only contain one kind of conformation, in which the center of the ligand is the center of the binding pocket.The models trained with such training sets would naturally tend to place the ligand at the center of the given pocket, even if the true binding pocket does not locate at the center of the given residues.A simple methodology to avoid such inductive bias is adding Gaussian random noises to the center position of the ligand when extracting the binding pocket:

Table 1 .
The evaluation contents in different works

Table 3 .
Hit rate of Infinite Physical Monkey baseline

Table 4 .
Hit rate of Random Scoring baseline

Table 5 .
Hit rate of UFF baseline