Learning with uncertainty for biological discovery and design

Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution is to quantify prediction uncertainty so that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pretrained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition of Mycobacterium tuberculosis. We show how uncertainty facilitates a tight iterative loop between computation and experimentation, improves the generative design of novel biochemical structures, and generalizes across disparate biological domains. More broadly, our work demonstrates that uncertainty should play a key role in the increasing adoption of machine learning algorithms into the experimental lifecycle.


Introduction
As unprecedented high-throughput assays continue to transform biology [1]- [6], the 26 ultimate goal of these studies remains the same: to generate hypotheses that elucidate important 27 features of biological systems [7], [8]. The growing volume of experimental data underscores the 28 importance of robust, systematic strategies to explore these results and identify experimental 29 conditions that give rise to a desirable biological outcome. 30 Machine learning algorithms offer a way to translate existing data into actionable 31 biological hypotheses [9]- [14]. However, while hypothesis generation often relies on a human 32 expert's intuitive certainty or uncertainty about a given hypothesis, this intuition is not 33 automatically built into a machine learning algorithm, making these algorithms susceptible to 34 overconfident predictions, especially when access to training data is limited [15], [16]. Instead, 35 an intelligent algorithm that quantifies prediction uncertainty [17]- [21] could help focus 36 experimental effort on hypotheses with a high likelihood of success, which is especially useful 37 when new data acquisition is slow or arduous; or, the algorithm could alert a researcher to 38 experiments with greater novelty but also with a greater risk of failure [17]. 39 While uncertainty is gradually becoming recognized as a critical property in learning 40 algorithms [22]- [24], in the biological setting, most machine learning studies simply do not 41 consider uncertainty, while those that do are limited to specific tasks or in silico validation [25]-Before committing resources to experimental validation of compound-kinase affinity 139 predictions, we first set up an in silico simulation of the prediction and discovery process. We 140 obtained a publicly-available dataset [1] containing binding affinity measurements, within a 0.1 141 to 10,000 nanomolar (nM) range, of the complete set of kinase-compound pairs among 72 142 compounds and 442 unique kinase proteins (the dataset contained 379 unique kinase genes with 143 multiple mutational variants for some of the genes). We set up a cross-validation-based 144 simulation by separating the known data into training and test data (Supplementary Fig. 1A). 145 Importantly, to simulate out-of-distribution prediction, we ensured that approximately one-third 146 of the test data contained interactions involving compounds not in the training data, one-third 147 contained interactions involving kinase genes not in the training data, and one-third contained 148 interactions involving compounds and kinase genes not in the training data (Supplementary 149 Fig. 1A). 150 Our main set of benchmarking methods leverages unsupervised pretraining via state-of- 151 the-art neural graph convolutional-based compound features (pretrained by the original study 152 authors on ~250K small molecule structures) [43] and neural language model-based protein 153 sequence features (pretrained by the original study authors on ~21M protein sequences) [44] 154 (Methods). Subsequent regression models use a concatenation of these features to predict Kd 155 binding affinities ( Fig. 2A). 156 We benchmark methods that also learn some notion of prediction uncertainty. Our first 157 uncertainty model fits a GP regressor [18], [45] to the training set. For a given compound-kinase 158 pair, the GP provides a Kd prediction in the form of a Gaussian distribution, where we use the 159 mean of the Gaussian as the prediction value and the standard deviation as the measure of 160 uncertainty. Our second method first fits a multilayer perceptron (MLP), followed by fitting a GP 161 to the residuals of the MLP predictions (MLP + GP) [46]. This results in a hybrid model where 162 the prediction is the sum of the MLP and the GP estimates, while the GP variances can be used 163 as the uncertainty scores. We also benchmark two other methods that attempt to augment neural 164 networks with uncertainty: a Bayesian multilayer perceptron (BMLP) [23], [47] and an ensemble 165 of MLPs that each emits a Gaussian distribution (GMLPE) [22]. 166 We also benchmark three baseline methods without uncertainty. On the same neural 167 pretrained features, we test an MLP, also known as a densely-connected neural network [48], 168 [49] and collective matrix factorization (CMF), a model often used to recommend shopping 169 items for potential buyers that can also recommend drugs for potential targets [50]- [53]; variants 170 of both models have seen extensive use in previous compound-target interaction prediction 171 studies. Lastly, to assess the benefit of our unsupervised pretraining-based features, we also train predicted Kds and the ground truth Kds for our GP and MLP + GP models over all test data are 178 0.35 and 0.38, respectively (n = 24,048 compound-kinase pairs), in contrast with 0.26, 0.23, and 179 0.21 for the MLP, CMF, and DGraphDTA baselines, respectively (Supplementary Fig. 1B). 180 Good regression performance of GP-based methods is also consistent across all our metrics 181 (Pearson correlation, Spearman correlation, and mean square error) when partitioning the test set 182 based on exclusion of observed compounds, kinases, or both (Supplementary Fig. 1B). 183 Importantly, we also observed that, in this relatively data-limited training setting, rich 184 pretrained features combined with a relatively lightweight regressor (e.g., a GP or MLP) 185 outperformed a more complex regressor architecture (i.e., DGraphDTA) trained end-to-end on 186 simpler features (Supplementary Fig. 1B). This provides evidence that pretraining with state-of- 187 the-art unsupervised models contributes valuable information in a data-limited setting. 188 Where robust GP-based prediction has a substantially large advantage is in prioritizing 189 compound-kinase pairs for further study. Importantly, in contrast to average-case metrics, 190 focusing on top predictions directly mimics biological discovery, since researchers typically 191 choose only a few lead predictions for further experimentation rather than testing the full, 192 unexplored space. In GP-based models, we observed that predictions with lower uncertainty are 193 more likely to be correct, whereas high-uncertainty predictions have worse quality 194 ( Supplementary Fig. 1C), allowing us to prioritize compound-kinase pairs with high predicted 195 affinity and low prediction uncertainty (Methods). In contrast, models without uncertainty like 196 the MLP do not distinguish confident and uncertain predictions (Supplementary Fig. 1C). The 197 top compound-kinase pairs acquired by the GP-based models have strong, ground-truth 198 affinities, while the other methods with poorly calibrated or nonexistent uncertainty 199 quantification struggle to prioritize true interactions and acquire interactions with significantly 200 higher Kds ( Fig. 2B and Supplementary Fig. 2A). Performance of the GP-based models 201 decreases when ignoring uncertainty (Supplementary Fig. 2B In theory, out-of-distribution predictions with low predicted Kd and low uncertainty ( Fig.   219 2C,D) will also have low in vitro Kds. Before validating our predictions, we first wanted to 220 ensure that the empirical behavior of our uncertainty models matched our intuition: namely, that 221 unknown compounds very different from any compound in the training set would also have high 222 associated uncertainty. To do so, we visualized the 72 compounds from the training set [1] and 223 the 10,833 unknown-affinity compounds using a two-dimensional t-distributed Stochastic 224 Neighbor Embedding (t-SNE) [55] of the structure-based compound feature space. The 225 embedding shows large regions of the compound landscape that are far from any compounds 226 with known affinities (Fig. 2E). 227 We can then use a GP, trained on just 72 compounds, to assign a predicted Kd affinity 228 (for a given kinase target) to each compound in the ZINC/Cayman library, as well as a 229 corresponding uncertainty score. When we color the points in the visualization by uncertainty, 230 consistent with our intuition, GP uncertainties are lower in regions near compounds with known 231 affinities and higher in regions with many unknown-affinity compounds (Fig. 2F), with high 232 correlation between the uncertainty score and test compound distance to its Euclidean nearest 233 neighbor in the training set (Spearman r = 0.87, n = 10,833 compounds; Methods). The GP 234 prioritizes compounds within the low uncertainty regimes that also have high predicted binding 235 affinity ( Fig. 2G and Supplementary Fig. 3). In contrast, the MLP assigns high priority to many 236 compounds far from the known training examples ( Fig. 2H and Supplementary Fig. 3), which 237 is most likely due to pathological behavior on out-of-distribution examples. For comparison, 238 CMF seems unable to learn generalizable patterns from the small number of training compounds 239 ( Fig. 2I and Supplementary Fig. 3). 240 Uncertainty prediction discovers sub-nanomolar compound-kinase biochemical activity 241 We then performed machine-guided discovery of compound-kinase interactions. Since 242 our in vitro binding assays are optimized to screen many compounds for a given kinase, we 243 focused our validation efforts on a set of four diverse kinases: human IRAK4, a serine/threonine 244 kinase involved in Toll-like receptor signaling [56]; human c-SRC, a tyrosine kinase and 245 canonical proto-oncogene [57]; human p110δ, a lipid kinase and leukocytic immune regulator 246 [58]; and Mtb PknB, a serine/threonine kinase essential to mycobacterial viability [59]. These 247 kinases have well-documented roles in cancer, immunological, or infectious disease [39]- [42]. 248 To compare prediction with and without uncertainty, we used either our GP or MLP 249 models to acquire compounds from the ZINC/Cayman library with high predicted affinity for 250 each of the four kinases of interest. We validated the top five predictions returned by the GP or 251 MLP for each kinase using an in vitro biochemical assay to determine the Kd (Methods). 252 We observed that none of the predictions acquired by the MLP had a Kd of less than the 253 top tested concentration of 10 μM ( Fig. 3 and Supplementary Table 2), consistent with out-of-254 distribution prediction resulting in pathological model bias (Supplementary Fig. 2). In contrast, 255 the GP yielded 18 compound-kinase pairs with Kds less than 10 μM (out of 20 pairs tested, or a 256 hit rate of 90%), 10 of which are lower than 100 nM ( Fig. 3 and Supplementary Table 2). 257 Notably, GP acquisition yielded sub-nanomolar affinities between K252a and IRAK4 (Kd = 0.85 258 nM) and between PI-3065 and p110δ (Kd = 0.36 nM), automating discoveries that previously 259 had been made with massive-scale screens or expert biochemical reasoning [42], [60]. Some 260 compounds had predicted and validated affinities for multiple kinases, such as K252a, an 261 member of the indolocarbazole class of compounds, many of which have broad-spectrum kinase 262 inhibition [1]. Other compounds were only acquired for one of the kinases, including PI-3065 for 263 p110δ, WS3 for c-SRC (Kd = 4 nM), and SU11652 for PknB (Kd = 76 nM). Interestingly, the 264 latter two of these interactions do not seem to have existing experimental support; WS3 was 265 developed as an inducer of pancreatic beta cell proliferation [61] and SU11652 was developed 266 for human receptor tyrosine kinase inhibition [62]. 267 To further assess the impact of uncertainty on prediction quality, we also performed 268 PknB acquisition with another GP-based model (MLP + GP) and varied the weight β on the 269 uncertainty (Methods). We validated the top five predictions from the GP and MLP + GP at β = 270 1 (tolerates some uncertainty) and β = 20 (prefers lowest Kds with a very low tolerance for 271 uncertainty), as well as the top five predictions from the GP at β = 0 (i.e., ignoring uncertainty). 272 At β = 20, the MLP + GP acquired a similarly potent set of compounds as the GP. Tolerating 273 greater amounts of uncertainty, or ignoring it completely, led to more false-positive predictions 274 (Fig. 3B). 275 The ability for GP-based models to yield (sub-)nanomolar affinity compound-kinase pairs 276 in a highly out-of-distribution prediction task provides strong experimental support for our 277 uncertainty-based learning approach. We note that training our models on information from 72 278 compounds to make predictions over a 10,833-compound library is a much more challenging 279 task than any previously reported drug-target interaction prediction setting [48]- [52]. Moreover, 280 among previous studies that performed experimental validation of machine learning-predicted 281 compound-target interactions, none report Kds or IC/EC50s below the micromolar range [48], 282 [52], suggesting that our approach yields most potent interactions discovered by compound-283 target interaction prediction to date. 284 A notable advantage of GP-based uncertainty quantification is that it enables an absolute 285 assessment of prediction quality. For example, all predictions with a mean less than 10 μM (our 286 top-tested concentration) and an interquartile range less than 2 μM resulted in true positive hits 287 ( Supplementary Fig. 5). In contrast, more dispersed prediction distributions had higher 288 variability in the potency of the true binding interaction including false positives 289 ( Supplementary Fig. 5), suggesting that our GP-based models make better predictions when 290 they are more confident. Uncertainty adds an interpretable dimension to machine-generated 291 predictions; for example, a researcher with a low tolerance for false positives might ignore a 292 generated hypothesis with a low predicted Kd but a high prediction uncertainty. 293 Anti-Mtb activity of compounds with validated PknB biochemical activity 294 Machine-generated hypotheses also add value by stimulating follow-up biological 295 research that provides insights beyond the initial prediction problem. For example, given the 296 novel, potent interactions discovered by our models, we sought to further assess if the 297 compounds had any broader relevance beyond biochemical affinity with the protein molecule 298 itself. 299 PknB is a kinase that is essential to Mtb viability [59]. Bacterial kinases are less well 300 studied than human (or other mammalian) kinases [63], [64] but are nonetheless important 301 therapeutic targets [39], [59], [65], especially in antibiotic-resistant bacterial strains. Most 302 importantly, tuberculosis remains the leading cause of infectious disease death globally [32], 303 underscoring the importance of novel antibacterial therapies. Given the essentiality of PknB and 304 our in silico identification of PknB-binding compounds, we sought to examine if the compounds 305 with high binding affinity to PknB would have any impact on mycobacterial growth. This would 306 not be guaranteed since factors like cell wall permeability or intracellular stability were not 307 explicitly encoded in the training data. 308 We focused on the compounds with a Kd less than 100 nM: K252a (Kd = 11 nM), 309 TG101209 (Kd = 71 nM), and SU11652 (Kd = 76 nM). Using the colorimetric, resazurin 310 microtiter assay (alamar blue) [39], [66], we determined the minimum inhibitory concentration 311 (MIC) of these compounds as well as rifampicin, a frontline antibiotic for tuberculosis [32] 312 (Methods); the MICs for these compounds with H37Rv are shown in Supplementary Table 3. 313 We observed that K252a and SU11652 inhibited the growth of H37Rv compared to a dimethyl 314 sulfoxide (DMSO) vehicle control (one-sided t-test P-value of 7.0 × 10 -8 for K252a and 3.9 × 10 -315 8 for SU11652, n = 3 replicate cultures per condition) ( Fig. 4A and Supplementary Fig. 4A). 316 SU11652 is a well-documented inhibitor of human receptor tyrosine kinases including PDGFR, 317 VEGFR, and Kit [62]. TG101209 did not inhibit growth of H37Rv (one-sided t-test P-value of 318 0.11, n = 3 replicate cultures per condition) ( Fig. 4A and Supplementary Fig. 4A), perhaps due 319 to low cell permeability [67], [68]. These results were corroborated using additional validation 320 where Mtb expressing the luxABCDE cassette (luxMtb) was incubated with increasing 321 concentrations of K252a, SU11652, and TG101209 (Supplementary Fig. 4B; Methods). 322 We further validated these results in a more complex, host-pathogen model. We utilized a 323 previously described assay for monitoring intracellular growth where macrophages are infected 324 with luxMtb and luminescence is measured as a proxy of bacterial growth [69], [70] (Methods). 325 We infected macrophages with luxMtb for 4 hours prior to the addition of compounds dissolved 326 in cell culture media. Consistent with our axenic culture experiments, treatment with K252a and 327 SU11652 resulted in less luminescence as compared to DMSO (one-sided t-test P-value of 2.9 × 328 10 -6 for K252a and 2.8 × 10 -6 for SU11652; n = 3 replicate cultures per condition) (Fig. 4B,C). 329 In examining the literature for prior work on compounds targeting PknB, we identified support 330 for K252a as an inhibitor of PknB kinase activity and Mtb growth [59], [65]. These previous Follow-up analyses can also take the form of additional prediction rounds that 340 incorporate the results of previous experiments, a setting in which sample-efficiency is 341 paramount. This iterative cycle involving prediction, acquisition, model retraining, and 342 subsequent prediction and acquisition is referred to as "active learning" [28], [31]. In this vein,  Table 4), indicating that IKK-16 is structurally remote to any compound in the training data; for reference, a recently used novelty 361 threshold was a Tanimoto similarity of 0.40 [12]. 362 We could not find existing literature linking IKK-16 to PknB, suggesting a new potential 363 biomedical dimension for IKK-16 and related small molecules. These results also illustrate how 364 uncertainty combined with an active learning strategy can explore regions of the compound 365 space that are more distal to the original training set and still provide successful predictions. 366 Uncertainty prediction improves generative design of novel compound structures 367 While our discovery experiments leveraged existing, commercially available compounds, 368 our robust predictive models can also help us design new compound structures with high affinity 369 for PknB. In particular, we are interested in a generative design paradigm. As in all design 370 problems, a designer wishes to create a new object with a desired property. In machine-assisted 371 generative design, a generator algorithm is responsible for generating novel objects while an 372 evaluator algorithm prioritizes objects that best fulfill the desired property. In theory, uncertainty 373 prediction enables more robust evaluators that select designs that better reflect the desired, 374 ground-truth property. 375 We performed generative design of novel small molecule structures that have strong 376 affinity for PknB. Our generation strategy was based on sampling from the latent space of a 377 variational autoencoder (VAE) [72], with an architecture optimized for chemical structures 378 (JTNN-VAE) [43] (Methods). We trained a JTNN-VAE to reconstruct the distribution of the 379 entire ZINC/Cayman dataset. We randomly sampled from the JTNN-VAE latent space and 380 decoded the result to obtain an "artificial library" of 200,000 compound structures that do not 381 exist in the ZINC/Cayman dataset (we note that our model for generating chemical structures is 382 distinct from the model used to encode structural features). We used the MLP, MLP + GP, and GP to rank compounds within this artificial library for predicted affinity with PknB, taking 384 uncertainty into account for the latter methods. We emphasize that the generative model learned 385 across the ZINC/Cayman dataset is still highly mismatched from the 72-compound training 386 distribution of the GP, MLP + GP, and MLP. We then used molecular docking, an orthogonal 387 method for binding affinity prediction, to simulate the true binding affinity between the 388 generated compound and the PknB active site. Since consistency across disparate docking 389 scoring functions corresponds to better prediction of true biochemical affinity [73], we use six 390 scoring functions to compare generated designs selected with and without uncertainty [74]- [78]. 391 The molecules prioritized by the GP-based methods had significantly higher affinity than 392 the MLP baseline across all scoring functions, based on one-sided Welch's t-test P-values at a 393 false discovery rate (FDR) less than 0.05 [79] (Fig. 5A). There was no significant difference 394 between docking scores of GP-based methods and of known high-affinity compounds, used as 395 positive controls (Fig. 5A). Visual inspection of the best designs predicted by the GP and by 396 docking reveals structures similar to known inhibitors (Fig. 5B), while some of the structures 397 prioritized by the MLP appear pathological (Fig. 5C). For the compounds with strong binding 398 affinity, visualizing the binding poses suggested by the docking algorithm shows concordance 399 with a known crystallography-determined small molecule pose [80] (PDB: 2FUM) (Fig. 5D). 400 These results show how uncertainty-based robustness in an out-of-distribution setting can better 401 guide the generative design of new chemical structures. 402 Generality of uncertainty prediction to disparate biological domains 403 Importantly, because our approach to machine-guided discovery is general, it can also be 404 applied to diverse domains beyond kinase affinity prediction. Many biological problems, while 405 seemingly disparate, are fundamentally similar in that they are based on predicting the value of target variables based on a set of feature variables ( Fig. 6A; compare to Fig. 2B). We therefore 407 wanted to demonstrate generality by applying our same learning paradigm to predict the 408 brightness of fluorescent proteins based on protein sequence features (Methods), potentially 409 enabling an algorithm that can optimize, in silico, the fluorescence of an existing protein design 410 [14], [81]. 411 We obtained a high-throughput mutagenesis dataset involving avGFP [2]. We trained 412 machine learning models to predict fluorescent brightness based on protein sequence features, 413 where we used the exact same pretrained embedding model as in our kinase experiments [44]. To 414 simulate a data-limited scenario, we only gave the model access to sequences with at most one 415 amino acid mutation compared to wild-type (n = 1,115 sequences) (Fig. 6B); in contrast, our test 416 set consisted of sequences with two or more mutated residues (n = 52,910 sequences), simulating 417 a scenario where an algorithm is asked to make predictions over a more combinatorially complex 418 space. We use the same pretrained learning models as in our kinase cross-validation experiments 419 except for DGraphDTA, a domain-specific model, and CMF, which does not naturally apply to 420 this problem setup; instead, we use ordinary least squares (OLS) regression, a better-suited linear 421 model, as a replacement benchmark (Methods). 422 We once more observed that GP-based models acquired mutant sequences with 423 significantly higher average brightness than other methods (Fig. 6C). The GP also performs well 424 in the average case, with an acquisition ranking that is strongly and significantly correlated with 425 measured fluorescence (Spearman r = 0.78, two-sided P < 10 -308 , n = 52,910 sequences) ( Fig.   426 6D), which is competitive with or better than baseline methods, many of which overfit the small 427 training dataset (Supplementary Fig. 6A). While OLS regression was also robust to overfitting 428 in the average case, GP-based uncertainty results in substantially fewer false positive predictions 429 among top-acquired sequences ( Fig. 6C and Supplementary Fig. 6B). As in the kinase 430 inhibition cross-validation experiments, we observed that GP-based uncertainty helped reduce 431 false-positive predictions among top-acquired samples compared to predictions without 432 uncertainty (Supplementary Fig. 6C). These results suggest how predictive (and robust) 433 algorithms could help reduce the complexity of mutagenesis experiments traditionally used to 434 optimize new protein designs [2], [82]. 435 Structurally, many GP-acquired mutations are to surface residues (Fig. 6E), consistent 436 with previous observations that buried residue mutations are more likely to be deleterious to 437 fluorescence [2]. A notable exception is high GP prioritization of mutations to T62 (Fig. 6E), a 438 buried residue essential to chromophore synthesis [83]. Interestingly, the GP prioritized alanine 439 or serine substitutions (i.e., T62A or T62S) that preserve and even confer higher fluorescence 440 (e.g., a T62A/L178I mutant, the second sequence in the GP acquisition ranking, has a log-441 fluorescence of 3.9, versus 3.7 for wild-type). While surface residue mutations are more likely to 442 be neutral, an algorithm aimed at enhancement might focus on these buried residues that directly 443 influence fluorescence. Biological discovery often requires making educated hypotheses with limited data under 452 substantial uncertainty. In this study, we show how machine learning models that generate 453 biological hypotheses can overcome such challenges. In particular, our results suggest a broadly 454 useful paradigm: neural pretrained features followed by a task-specific supervised GP-based 455 model. We show that uncertainty provides a useful guard against overfitting and pathological 456 model bias, sample efficiency enables successful iterative learning across a broad spectrum of 457 experimental scales, and pretraining elevates our uncertainty models to state-of-the-art 458 performance. 459 While our study provides concrete illustrations of uncertainty prediction in biochemical 460 activity and protein engineering tasks, uncertainty could also be impactful in many other areas 461 given the complexity of biological structures and systems. The generality of our approach is a 462 notable advantage and stands to benefit from many recent advances in biological representation 463 learning, such as, for example, gene embedding approaches that learn information from 464 coexpression or protein interaction networks [84]. These features can be readily coupled with a 465 GP-based regressor to predict a desirable phenotype; for example, a GP could be trained based 466 on gene embeddings to prioritize CRISPR experiments or other genetic perturbations with the 467 goal of inducing a particular phenotype (for example, cellular fitness or surface marker 468 expression), which is especially useful when acquiring from a combinatorially large 469 interventional search space [85]. 470 Our work highlights GP-based methods as particularly useful. GPs enable theoretically 471 principled incorporation of prior information, can use standard kernels to approximate any 472 continuous function [86], and preserve and even improve modelling performance in complex, order to probe novel regimes [17], [18]. For example, in the drug discovery setting, novelty is important since human-designed drugs are often appraised based on their creativity (in addition 497 to effectiveness) [90], [91]. Novelty is also important across biological domains, such as 498 designing artificial proteins not found in nature [14] or discovering new transcriptional circuits 499 [85]. Uncertainty helps define the boundaries of an algorithm's knowledge, beyond which human 500 creativity can take over. 501 Although initializing a model with some training data is helpful, it is also possible to 502 begin with zero training data (all predictions might therefore begin as equally uncertain). As 503 more data is collected, a sample-efficient model with uncertainty can progressively yield better 504 and more confident predictions. This is the iterative cycle of computation and experimentation at 505 the heart of active learning [28], [31], for which we provide a proof-of-concept example in this 506 study. 507 More generally, we anticipate that iterative experimentation and computation will have a 508 transformative effect on the experimental process. In addition to learning from high-throughput 509 datasets, we also envision learning algorithms working intimately alongside bench scientists as 510 they acquire new data, even on the scale of tens of new datapoints per experimental batch. As we 511 show, using machine learning to generate novel hypotheses will require a principled 512 consideration of uncertainty.

Acknowledgments
We thank Tristan Bepler and Ellen Zhong for helpful discussions. We thank Diane Ballestas, Patrice Macaluso and Sydney Solomon for assistance with the validation experiments.
We thank Robert Chun for assistance with the manuscript. B.H. is partially supported by the Department of Defense (DoD) through the National Defense Science and Engineering Graduate Fellowship (NDSEG) and by NIH grant R01 GM081871 (to B. Berger).

Declaration of interests
The authors declare no competing interests.

Mycobacterium tuberculosis model details
We utilized wild-type H37Rv and H37Rv expressing an integrated copy of the luxABCDE cassette which enables mycobacteria to endogenously produce light [70]; monitoring luminescence of the latter strain has been demonstrated to correlate well with the standard colony forming unit assay [69].

Multilayer perceptron (MLP)
We trained an MLP with two hidden layers with 200 neurons per layer and rectified linear unit (ReLU) activation functions, trained with mean square error loss and adaptive moment estimation (Adam) with our implementation's default optimization parameters (learning rate of 0.001, 1 = 0.9, 2 = 0.999) [92]. Hyperparameters were tuned based on a small-scale grid search using five-fold random cross-validation within the compound-kinase training set before application to the out-of-distribution test set, with a particular emphasis on preventing overfitting [48]. Lower model capacity, ℓ 2 regularization of the densely connected layers (weight 0.01), and early stopping after 50 training epochs were helpful in preventing overfitting to the training data and highly pathological outputs on out-of-distribution data (for example, outputting the same prediction value for all instances). The MLP was implemented using the keras Python package (version 2.3.1) using a tensorflow (version 1.15.0) backend with CUDA-based GPU acceleration.

Collective matrix factorization (CMF)
We performed CMF using the compound-kinase Kds as the explicit data matrix and the neural-encoded compound and kinase features as side-information [53]. Briefly, CMF optimizes the loss function

Ordinary least squares (OLS)
For the protein fluorescence experiments, we use OLS as a replacement benchmarking model that is more natural to the problem setup than CMF, which is most commonly used in recommender-system problems involving the affinity between two types of entities (for example, users and shopping items, or compounds and kinases). OLS minimizes the loss ( ; , ) = ‖ − ‖ F 2 where is sample-by-target-variable matrix, is the sample-by-feature-variable matrix, and is a learned coefficient matrix (the latter two matrices are also augmented to fit a constant intercept term for each target variable). We use the implementation in the scikit-learn Python package.

DGraphDTA
We used DGraphDTA [54] to predict compound-kinase Kds. DGraphDTA leverages a graph neural network based on the compound molecular structure and the protein residue contact map. We used the implementation provided at https://github.com/595693085/DGraphDTA with default model architecture hyperparameters. For compound features, we provided the model with chemical SMILE strings that the model transforms into a graph convolutional representation [94]; for kinase features, we use the protein contact maps provided by the original study.

Prediction acquisition function
For models that output uncertainty scores, an acquisition function is used to rank compound-kinase pairs for acquisition, which in the biological setting often corresponds to further experimental validation, in a way that balances both the prediction value and the associated uncertainty. A standard acquisition function is the upper confidence bound (UCB) [38]. When low prediction values are desirable, UCB acquisition takes the form

Gaussian Process (GP)
GPs are a Bayesian machine learning strategy that can learn nonlinear functions, can work with limited data, and enable principled incorporation of prior information. The aspect of GPs most relevant to this study is that they enable a researcher to explicitly specify prior information encoding both a "baseline" prediction and corresponding uncertainty. For example, a priori, a researcher can assume that a given compound-kinase pair has low affinity; this intuition can be encoded as a probability distribution with most of the probability density assigned to low affinity but with small, nonzero probability assigned to high affinity. On a prediction example that is very different from any training example, the prediction uncertainty of a GP approaches the value of the prior uncertainty [18].
A Gaussian process regressor is fully described by a mean function and a covariance function. For the compound-kinase experiments, our mean function is set to a constant value corresponding to a Kd of 10,000 nM (i.e., the top tested concentration above which the Kd is not determined and a compound-kinase pair is considered inactive). For the protein fluorescence experiments, the mean function is set to a constant value corresponding to a log-fluorescence of 3 (i.e., the original study's darkness cutoff). Our covariance function is set to a Gaussian, or a squared exponential, kernel scaled by a constant prior related to the prior uncertainty where ‖•‖ 2 denotes the ℓ 2 -distance between feature vectors and . For the kinase experiments, prior is set to 10,000 nM; for the protein fluorescence experiments, prior is set to a log-fluorescence of 2; for both, is set to unity. Each prediction takes the form of a (scalar) Gaussian distribution; we use the mean as the prediction value and the variance as the uncertainty estimate. We use the Gaussian process regressor implementation provided by the scikit-learn Python package. Gaussian processes are reviewed in-depth by Rasmussen and Williams and with helpful, high-level visual aids by Görtler et al.

Gaussian process fit to residuals of a multilayer perceptron (MLP + GP)
Since much of the interest in machine learning has been on improving the performance of neural network models, a simple way to augment neural networks with uncertainty is to combine the predictions made by a neural network and predictions made by a GP [46]. We use an MLP regressor with the same architecture and hyperparameters as the standalone MLP model described above. The GP fit to the residuals of the MLP regressor has the same form as described for the regular GP above but where the regression problem is formulated as for training example and training label . To calculate the prediction value, we evaluate both the MLP and the GP and sum the MLP prediction and the GP mean [46], i.e., To calculate the uncertainty estimate, we can simply use the GP standard deviation [46], i.e., pred ( ) = Var(GP(̃)) 1/2 .
We used the same software (a combination of the scikit-learn, GPyTorch, keras, and tensorflow Python packages) to implement the hybrid model.

Bayesian multilayer perceptron (BMLP)
A more involved, Bayesian approach to augmenting neural networks with uncertainty is to impose a Bayesian prior on the parameters of the neural network. We train an MLP regressor with the same architecture described above (two hidden layers with 200 neurons per layer and ReLU non linearities) but with a unit-variance Gaussian prior on each weight and bias entry [23].
Within the respective biological task, the Gaussian prior mean for each entry corresponds to a Kd of 10,000 nM (i.e., no biochemical affinity) or a log-fluorescence of 3 (i.e., a dark protein).
Optimization was performed under a mean-field independence assumption with gradient descentbased variational inference [23], [47]. When making predictions, we sample 100 neural networks and evaluate each neural network on each prediction example. We use the mean prediction across the 100 neural networks as the prediction value and the variance across the 100 neural networks as the uncertainty estimate. To implement the BMLP, we used the Edward Python package (version 1.3.5) for probabilistic programming [47] with a tensorflow CPU (version 1.5.1) backend.

Gaussian negative log-likelihood-trained multilayer perceptron ensemble (GMLPE)
Rather than a Bayesian approach to uncertainty, another group of uncertainty methods is based on model ensembles. Ensembling involves fitting multiple models to a training dataset; then, variation in the predictions of the models can be used to estimate uncertainty. For our ensemble method, we use the model described by Lakshminarayanan et al. We train an MLP regressor with the same architecture described above (two hidden layers with 200 neurons per layer and ReLU non linearities) but, instead of mean square error loss, with Gaussian negative log-likelihood loss where pred ( ) is the predicted value and pred ( ) is the predicted uncertainty (both outputted by the neural network), and true ( ) is the ground truth value for training example ∈ {1,2, … , }. We train five such models to create a neural network ensemble and we combine prediction distributions across the ensemble as with a Gaussian mixture. As an implementation detail, we trained the neural network to output the log variance to enforce positivity. We implemented the GMLPE with the keras Python package using a tensorflow backend with CUDA-based GPU acceleration.

Compound-kinase affinity prediction cross-validation setup and benchmarking
We obtained a dataset of binding affinity Kds across all pairs of 72 compounds and 442 kinases (corresponding to 379 unique genes) from Davis et al. Compounds were partitioned randomly into two equal-sized sets of 36 and kinases were partitioned randomly into one set of 190 unique genes (corresponding to 216 kinase proteins, including mutational variants) and another set of 189 unique genes (corresponding to 226 kinase proteins). Compound-kinase pairs therefore fell into one of four "quadrants" of the interaction matrix defined by sets of partitioned compounds and kinases (for a pictorial representation, see Supplementary Fig. 1A). One quadrant of the compound-kinase interaction matrix was reserved as training data; the other three quadrants were used as out-of-distribution test data.
Compound structures were obtained from the ZINC database [30] and kinase amino acid sequences were obtained from the UniProt database [95]. of the corresponding compound and kinase. We observed that these state-of-the-art features provided much better empirical performance than traditional features like chemical fingerprints [96] or one-hot-encoded protein family domains [48].
The six benchmarking models (GP, MLP + GP, BMLP, GMLPE, MLP, CMF) were trained on the training quadrant and used to make a prediction (and, if applicable, a corresponding uncertainty score) for each compound-kinase pair in the three test quadrants.
Three standard, average-case performance metrics were used: (1) Pearson correlation between predicted and true Kds, (2) Spearman correlation between predicted and true Kds, and (3) mean square error between predicted and true Kds. We also performed a "lead prioritization" experiment. Compound-kinase pairs in the test set were ranked for uncertainty methods by a rank-UCB acquisition function (β = 1) and for non-uncertainty methods by the prediction value.
We compared the Kds for the top k of these compound-kinase pairs across all six methods; we repeated this experiment for k = 5 and k = 25 to assess different lead prioritization thresholds.
The distributions of binding affinities among acquired compound-kinase pairs were assessed for statistical significance (FDR < 0.05) using Welch's unequal variances t-test. The above crossvalidation benchmarking experiments (both average-case and lead prioritization) were repeated over five random seeds.

Acquisition of commercially available ZINC/Cayman compound dataset
We obtained a dataset of small molecule compounds over which we wished to predict binding affinity with various kinases. We used the ZINC database [30], an online repository of Compounds were acquired directly from Cayman Chemical. All supplied compounds were tested to ensure ≥ 98% purity. We leveraged the kinase affinity assays provided by the .
Curves were fitted using a non-linear least square fit with the Levenberg-Marquardt algorithm.
The Hill slope was set to -1; a deviation from this Hill slope in the dose-response pattern was used to identify possible aggregation, but no such deviation was observed. A full dose-response curve was obtained and fit in duplicate, and we report the mean Kd between duplicate curves.

Axenic Mtb growth inhibition assay
H37Rv Mtb growth was evaluated using the resazurin viability assay (alamar blue allowed to re-adhere overnight.

Intra-macrophage Mtb growth inhibition assay
H37Rv Mtb expressing the luxABCDE cassette were grown to an optical density of 0.4 and centrifuged briefly. Mtb were resuspended in pre-warmed maintenance media and filtered through a 5 μM filter to remove clumped bacteria and generate a single-cell suspension.
Macrophages were infected at a multiplicity of infection of 3 bacteria to 1 macrophage in 100 μL per well and phagocytosis was allowed to proceed for 4 hours prior to washing macrophages twice with pre-warmed maintenance media to remove extracellular bacteria. Following phagocytosis and washing, cells were incubated with media containing a DMSO vehicle control or rifampicin, K252a, or SU11652 (Cayman Chem) for 5 days. On day 5, we measured luminescence as a proxy of intracellular bacterial burden as previously described [69], [70] using a high-throughput luminometer.

Second Round of Acquisition and Validation of Compound-PknB Interactions
A GP or an MLP was trained on both the original training dataset [1] and all of the firstround PknB-related experimental data (Fig. 3B). All other training and acquisition details were the same as in the first prediction round. The top five predictions for the GP (β = 20) and for the MLP (ten predictions in total) had their binding affinity Kds determined via the same in vitro binding assay described above. IKK-16, acquired by the GP, was the sole hit with Kd below 10,000 nM. To assess the similarity between IKK-16 and each of the original training set compounds, we used the RDKit to compute the Tanimoto similarity of Morgan fingerprints with a radius of 2 and 2048 bits, the same Tanimoto similarity computation procedure as in Stokes et al.

Generation of artificial compound library and affinity prediction
We used a machine-learning method to generate a library of 200,000 unique compound structures not present in the ZINC/Cayman database. To do so, we trained a JTNN-VAE [43] to reconstruct the ZINC/Cayman dataset of 10,833 compounds using the default model architecture parameters in the publicly available training code (https://github.com/wengong-jin/icml18-jtnn).
Hyperparameters were selected based on the provided defaults, namely an embedding dimension molecule structures present in the training data were discarded and not considered as novel designs. We note that the JTNN-VAE model for chemical featurization is a different, pretrained model used in the original study [43]; we trained a separate JTNN-VAE model for molecule generation. These artificially generated compound structures were featurized and concatenated with the PknB feature vectors as described previously. We then acquired the top ten compounds according to a GP (β = 20), an MLP + GP (β = 20), and an MLP, where all models were trained exclusively on the Davis et al. kinase inhibition data, as described previously.

Docking-based validation of compound designs
We used a crystallography-determined structure of PknB in complex with mitoxantrone (PDB: 2FUM) [80] as the underlying structure for our docking procedure. To prepare the kinase structure for docking, we restricted our analysis to chain A with the mitoxantrone molecule removed. Our docking region encompassed the full set of amino acids directly proximal to the ligand pocket (https://www.rcsb.org/3d-view/2FUM?preset=ligandInteraction&sele=MIX).
Structure files were preprocessed to be in compatible file formats using AutoDockTools version  [78], and smina version "Oct 15 2019" [75], the last of which implemented the DK and Vinardo [74] scoring functions in addition to its own default scoring function. The Vina-and smina-based toolkits were run with an exhaustiveness parameter of 500, a high value meant to increase the search space of possible poses, and all other parameters set to the default. We use the reported energy scores returned by the docking procedure. We visualized docking poses using PyMOL version 2.3.3.

Protein fluorescence cross-validation setup and benchmarking
We obtained mutagenesis data from Sarkisyan et al., treating each unique avGFP sequence as a separate sample featurized by embeddings derived from the full protein sequence.
We used the same pretrained sequence embedding model from Bepler and Berger [44] used to featurize kinase sequences in our other experiments. The original mutagenesis study assigned a median log-fluorescence value to each unique sequence, obtained via fluorescence-activated cell sorting of GFP-expressing bacterial vectors based on brightness at 510 nm emission [2]. Our supervised formulation is to predict log-fluorescence based on the neural sequence embedding.
We use a log-fluorescence of 3 as a cutoff (as used in the original study) below which all sequences are considered equally dark (i.e., when training the model, we set all dark sequences to a log-fluorescence value of 3).
GP, MLP + GP, BMLP, GMLPE, MLP, and OLS regressors were each trained on 1,115 unique sequences containing at most one mutation to wild-type avGFP (UniProt: P42212). After training, we acquired sequences among the remaining avGFP mutants (a total of 52,910 unique sequences) from the same study based on higher predicted brightness, and, if available, low predicted uncertainty using rank-UCB (β = 1). We compared models based on the top 50 or 500 acquired sequences; models with pseudo-randomness were run across five random seeds. The distributions of fluorescence among acquired sequences were assessed for statistical significance (FDR < 0.05) using Welch's unequal variances t-test. We also measured the Spearman correlation between acquisition rank and each mutant sequence's median log-fluorescence.
Mutations involved in the top hundred acquired sequences by the GP were located on an X-ray crystallography-determined avGFP structure (PDB: 2WUR) [99]. We used the FindSurfaceResidues (https://pymolwiki.org/index.php/FindSurfaceResidues) PyMOL script to distinguish buried and surface residues. We used PyMOL to visualize the protein structure.

Benchmarking hardware and computational resources
Experiments had access to a Nvidia Tesla V100 PCIe GPU (32GB RAM) and an Intel Xeon Gold 6130 CPU (2.10GHz, 768GB of RAM). GP training for kinase and GFP experiments 53 required approximately 60 and 30 minutes of runtime, respectively. All experiments required a maximum of 50 GB of CPU RAM or 32 GB of GPU RAM.

Statistical analysis implementation
We use the scientific Python toolkit, including the scipy (version 1.3.1) and numpy (version 1.17.2) Python packages [100], to compute the statistical tests described in the manuscript, including Pearson correlation, Spearman correlation, Welch's unequal variance ttest, and associated P values. We use the seaborn Python package (version 0.9.0; https://seaborn.pydata.org/) to compute the 95% confidence intervals and violin-plot kernel density estimates in our data visualizations.

Data and code availability
We make our data and code available at http://cb.csail.mit.edu/cb/uncertainty-ml-mtb/.