TY - JOUR T1 - Adding stochastic negative examples into machine learning improves molecular bioactivity prediction JF - bioRxiv DO - 10.1101/2020.05.21.107748 SP - 2020.05.21.107748 AU - Elena L. Cáceres AU - Nicholas C. Mew AU - Michael J. Keiser Y1 - 2020/01/01 UR - http://biorxiv.org/content/early/2020/05/22/2020.05.21.107748.abstract N2 - Multitask deep neural networks learn to predict ligand-target binding by example, yet public pharmacological datasets are sparse, imbalanced, and approximate. We constructed two hold-out benchmarks to approximate temporal and drug-screening test scenarios whose characteristics differ from a random split of conventional training datasets. We developed a pharmacological dataset augmentation procedure, Stochastic Negative Addition (SNA), that randomly assigns untested molecule-target pairs as transient negative examples during training. Under the SNA procedure, ligand drug-screening benchmark performance increases from R2 = 0.1926 ± 0.0186 to 0.4269±0.0272 (121.7%). This gain was accompanied by a modest decrease in the temporal benchmark (13.42%). SNA increases in drug-screening performance were consistent for classification and regression tasks and outperformed scrambled controls. Our results highlight where data and feature uncertainty may be problematic, but also show how leveraging uncertainty into training improves predictions of drug-target relationships.Competing Interest StatementThe authors have declared no competing interest.General AbbreviationsSNAStochastic Negative Addition as a procedureAUROCAUC of the Receiver Operating Characteristic Curve (classification)AUPRCAUC of the Precision-Recall Curve (classification)AUROCrAUC of the Receiver Operating Characteristic Curve (regression-as-classification)AUPRCrAUC of the Precision-Recall Curve (regression-as-classification)Model AbbreviationsSTDa “standard” model trained without SNA procedureSTD scrambledSTD model trained with y-randomization of the input training dataSNAa model trained with SNASNA scrambledSNA model trained with y-randomization of the input training dataNegatives Removeda model trained with negatives removed from the training setNegatives Removed scrambleda Negatives Removed model trained with y-randomization of the input training dataSNA +SEA blacklistan SNA model where ligands with a chance of binding (by SEA) are blacklisted from SNA choice during training. ER -