Abstract
Novel invertebrate-killing compounds are required in agriculture and medicine to overcome resistance to existing treatments. Because insecticides and anthelmintics are discovered in phenotypic screens, a crucial step in the discovery process is determining the mode of action of hits. Visible whole-organism symptoms are combined with molecular and physiological data to determine mode of action. However, manual symptomology is laborious and requires symptoms that are strong enough to see by eye. Here we use high-throughput imaging and quantitative phenotyping to measure C. elegans behavioral responses to compounds and train a classifier that predicts mode of action with an accuracy of 88% for a set of ten common modes of action. We also classify compounds within each mode of action to discover pharmacological relationships that are not captured in broad mode of action labels. High-throughput imaging and automated phenotyping could therefore accelerate mode of action discovery in invertebrate-targeting compound development and help to refine mode of action categories.
Introduction
Invertebrate pests including insects, mites, and nematodes damage crops, decrease livestock productivity, and cause disease in humans. Nematodes alone infect over 1 billion people and lead to the loss of 5 million disability adjusted life years annually1. In livestock, they infect sheep, goats, cattle and horses causing gastroenteritis that leads to diarrhea, reduced growth, and weight loss. Nematodes that parasitize crops have been estimated to cause well over $100 billion in annual crop losses2. Crop loss due to insects is measured in tens of metric megatons and is predicted to increase due to climate change3. Compounds that kill or impair invertebrates are one of the primary means of defense in human and veterinary medicine and in crop protection. However, resistance is widespread in nematodes and insects and drives continuing efforts to discover new invertebrate-targeting compounds4,5.
To date, most currently approved treatments for infections in humans and livestock as well as for crop protection in the field have been discovered through phenotypic screens6,7. That is, compounds are first screened for the ability to kill or impair a target species without any hypothesized molecular target. A critical problem is then determining hit compounds’ mode of action which is important for understanding resistance mechanisms, avoiding pathways where resistance is already common, and subsequent lead optimization. Despite advances in biochemical and genetic methods for determining mode of action, direct observation of the symptoms induced by compounds remains a key step in mode of action discovery6. Because most insecticides and anthelmintics target the neuromuscular system, behavioral symptoms are a particularly important class of phenotypes to consider, but manual observation of behavior is time consuming, insensitive to subtle phenotypes, and prone to inter-operator variability and bias8. We therefore sought to develop more automated and quantitative methods to do mode of action prediction from phenotypic screens of freely behaving invertebrates.
Pioneering work in zebrafish showed that behavioral fingerprints can be used to discover neuroactive compounds and that behavioral fingerprints correlate with compound mode of action9–11. However, this approach has not yet been applied to invertebrate animals—the targets of insecticides and anthelmintics—at a large scale. Furthermore, although previous zebrafish screens were high throughput, their spatial resolution was low and phenotypes were limited to activity levels in response to stimuli. Recent work in computational ethology has shown the power of moving beyond point representations of animal behavior to include information on posture12–15. From previous symptomology work, it is clear that detailed postural information can be useful for resolving mode of action16,17. We chose C. elegans as our model system because it is small and compatible with multiwell plates and automated liquid handling. It is sensitive to anthelmintics and insecticides and has played an important role in mode of action discovery in the past16,18–21.
To combine the benefits of high throughput and high resolution, we used megapixel camera arrays to record the behavioral responses of worms to a library of 110 compounds covering 22 distinct modes of action. We simultaneously recorded all of the wells of 96-well plates with sufficient resolution to extract the pose of each animal and a high-dimensional behavioral fingerprint that captures aspects of posture, motion, and path. We show that worms have diverse dose-dependent behavioral responses to insecticides and anthelmintics and develop a machine learning approach that shares information across replicates and doses to accurately predict the mode of action of previously unseen test compounds. Furthermore, we show that a novelty detection algorithm can be used to identify compounds with modes of action not contained in the training data suggesting a way to prioritize lead compounds with novel modes of action early in the development process. These results demonstrate that high-throughput phenotyping in C. elegans is a promising approach for assisting target deconvolution in anthelmintic and insecticide discovery. Finally, we show that our prediction accuracy may not be limited by noise or phenotypic dimensionality. Rather, we show that we can classify compounds even within a mode of action class, suggesting that there are limitations in our knowledge of the relevant pharmacology rather than limitations in our ability to reproducibly detect compound-induced phenotypes.
Results
Insecticides affect phenotypes in multiple behavioral dimensions
We assembled a library of 110 insecticides and anthelmintics with diverse targets to sample a range of modes of action used medically and commercially (see Table S1 for full list). The modes of action represented in the library cover 70% of the market of insecticides used in the field5 and several important classes of anthelmintics used in veterinary and human medicine4. To quantify the effects of the compounds on behavior we recorded worms using megapixel camera arrays that simultaneously image all of the wells of 96-well plates (Fig. 1a). We recorded at least 10 replicates at 3 doses for each compound with enough resolution to extract high-dimensional behavioral fingerprints following segmentation, pose estimation, and tracking (Fig. 1a). The behavioral fingerprints are vectors of posture and motion features that are subdivided by body segment and motion state including, for example, ‘midbody curvature during forward crawling’ or ‘angular velocity of the head with respect to the tail while the worm is paused’. We have previously shown that similar features can detect even subtle behavioral differences that can be difficult to detect by eye22 and that the combined feature set has sufficient dimensionality to accurately classify worms with diverse behavioral differences caused by genetic variation and optogenetic perturbation23,24.
As expected, several compound classes have strong visible effects on C. elegans behavior including, for example, the glutamate-gated chloride channel activator emamectin benzoate, the spiroindoline vesicular acetylcholine transporter inhibitor SY1713, and the serotonin receptor antagonist mianserin. All three compounds can be distinguished from DMSO controls and from each other in a simple two-dimensional space defined by speed and body curvature (Fig. 1b). The large differences in curvature and in motion caused by some compounds are observable by eye, as shown in the inset images and in Fig. 1c. However, not all compounds are well separated in these two dimensions (gray points, Fig. 1b). To find which features are different for the full compound set, we compared the behavioral fingerprints of treated worms with DMSO controls for each feature and corrected for multiple comparisons using the Benjamini-Yekutieli procedure25. To account for random day-to-day variation in the experiments, we used a Linear Mixed Model for these statistical tests, where the fixed effect is the drug dose and the day of the experiment is added as a random effect. The number of features that are significantly different at a false discovery rate of 1% between the behavior of worms treated with each compound and the DMSO controls is summarized in a heatmap (Fig. 1d).
To further increase the dimensionality of the behavioral phenotypes, we included a blue light stimulation protocol. Each tracking experiment is divided into three parts: 1) a 5-minute pre-stimulus recording, 2) a 6-minute stimulus recording with three 10-second blue light pulses starting at 60, 160, and 260 seconds, and 3) a 5-minute post-stimulus recording. Blue light is aversive to C. elegans26 and so it can help to distinguish between animals that are simply pausing and those that are not able to move27. Behavioral differences are observed in each assay period but the stimulus period shows the most differences (Fig. 1d). Even within mode of action classes, compound potency can be highly variable. The largest potency difference is observed for the Octopamin agonists where amidine affects 0.08% of features and oxazoline affects 75% of features. Overall, 86% of compounds have a detectable effect on behavior in at least one feature. Compounds with no detectable effect were not included in subsequent analysis.
Compounds with the same mode of action have similar effects on behavior
Having established that C. elegans shows diverse behavioral responses to insecticides and anthelmintics, we next sought to determine to what extent the responses are mode of action specific. For all clustering and classification tasks we first z-normalize each feature to put them on a common scale and to prevent arbitrary choices of units from impacting the analysis. We used hierarchical clustering to visualize the relationships between the behavioral responses to different compounds (Fig. 2a). Several of the compound classes show clear clustering, including the AChE inhibitors, vAchT inhibitors, GluCl agonists, and mAchR agonists. The degree of mode of action clustering is greater than expected by chance, which can be seen in a plot of the cluster purity observed in the data compared to random clustering (Fig. 2b). However, the distance between compounds that share the same mode of action can be large, even for classes that cluster well over all, in part because behavioral fingerprints change with dose. It is also not always possible to align feature vectors using doses because compounds can have very different potencies: a low dose for one compound could be a high dose for another.
These effects can be seen in dose-response plots for individual features. The three mitochondrial inhibitors in Fig. 2c all decrease angular velocity, but they do it at different doses. At 3 μM, only SY1048 has a strong effect, while at 30 μM, rotenone has a similarly strong effect. Clustering based on angular velocity would lead to qualitatively different conclusions about nearest neighbors at these different doses. For the spiroindolines, similar differences in dose-response are observed for body curvature with the added difference that the effect of SY1786 is non-monotonic and returns to baseline at high doses. These non-monotonic effects can be due to compounds precipitating from solution at high doses or due to intrinsically complex compound effects such as a compound that causes an increase in speed at low doses but is lethal at high doses. Regardless of the cause, complex dose-response curves present challenges for mode of action prediction since supervised machine learning algorithms rely on differences in feature distributions to learn decision boundaries and dose-response effects spread out the distributions and increase the overlap between classes.
Combining classifiers by voting enables mode of action prediction
We take advantage of the fact that several replicates are recorded per condition and resample with replacement from the multiple replicates for each dose to create a set of average behavioral fingerprints. This effectively smooths the data and provides a simple method for balancing classes before classifier training. For classes with fewer compounds, we resample more times so that each class contains the same number of points. To partially mitigate the effect of compound potency, we then normalize each behavioral fingerprint to unit magnitude. This normalization is done row-wise on each sample in contrast to the z-normalization described above which is done column-wise on each feature. Rescaling in this way brings compounds with similar effect profiles but different potencies closer together in feature space (Fig. 3a), but because of nonlinearities in the dose response profiles, the overlap is not perfect even after rescaling.
Predictions must be combined across doses and replicates to make a single prediction for the mode of action of a given compound. Inspired by an analogy with the multi-sensor fusion problem28, we use a voting procedure to make a final prediction. However, in contrast to multi-sensor fusion, we cannot train different classifiers for each dose because 1 μM for one compound is not equivalent to 1 μM for another compound. Instead, we train a single classifier for all doses and make predictions for each replicate. Each replicate contributes a vote for a compound’s class and the class with the most votes wins.
We split our data into a training/tuning set consisting of 60 compounds and a hold-out/reporting dataset consisting of 16 compounds containing at least one compound for each mode of action class. We used the training set to determine an appropriate classifier, select features, and tune hyperparameters using cross-validation. We achieved the highest cross-validation accuracy using multinomial logistic regression with 1024 features selected using recursive feature elimination (Fig. 3b). To determine whether the classifier could generalise to unseen compounds, we applied it to the test data without further tuning. The classifier predicted the correct mode of action for the unseen compounds 88% of the time (Fig. 3c).
In addition to the ten modes of action in our dataset that were represented by at least five compounds, we also included 19 compounds representing an additional 11 modes of action. We used these additional compounds to simulate another use-case for our approach: detecting screening hits that represent potentially novel modes of action that do not fall into known classes. Using a novelty detection algorithm29, we assigned a novelty score to each of the new compounds based on their affinity to each of the existing classes. To obtain the novelty score we use an ensemble of support vector machine (SVM) classifiers that flag novel compounds based on the confidence values of the main multinomial logistic regression classifier used for the predictions of known classes. The ensemble of SVM classifiers is trained using partitions of the training set into presumed-known and presumed-unknown classes. The novelty score is defined as the weighted average of the output of this ensemble. Most of the truly novel compounds were assigned novelty scores above 0.8. Several of the non-novel compounds— those that come from a class that is present in the training data—have high novelty scores but this includes the two test compounds that were incorrectly classified. To explore the origin of the high novelty score for the incorrectly classified compounds, we looked for differences between the effects of compounds within a class.
Mode of action deconvolution within classes
Although compounds are categorized into broad mode of action classes, most compounds will have some degree of off-target engagement. If the off-target effects are different for compounds within a mode of action class or if the compounds have differences in pharmacokinetics, they may lead to different phenotypes. In this case, it may be possible to use behavioral fingerprinting to further deconvolve mode of action classes revealing hidden compound heterogeneity. To test for phenotypic differences within mode of action classes, we trained a classifier to distinguish the replicates from each compound within a class from the replicates of the other compounds in the class. We then used cross-validation accuracy to quantify the distinguishability of the compounds within a class. We hypothesized that for classes without mechanistic sub-classes, the classifier would perform similarly to random guessing. In contrast, if compounds with the same mode of action had different off-target profiles or different pharmacokinetics, the classifier would be able to reliably distinguish individual compounds or subsets of compounds with the same broad mode of action.
In all classes, the compound-level classifiers performed better than random guessing, in some cases by a large margin. One of the incorrectly classified compounds in the test set was ritanserin, which was also assigned a high novelty score. The within-class classifier shows that it is indeed clearly distinguishable from the other 5-HT receptor antagonists (Fig. 4a). Although ritanserin is known to be a 5-HT receptor antagonist, it is also known to affect multiple other targets. In addition to detecting outlying compounds, the deviations from random guessing revealed substructures within the classes. For example, one group contains the two antidepressants, which are nearest neighbors in terms of structural similarity (atom pair Tanimoto coefficient of 0.44 between the antidepressants compared to a Tanimoto coefficient of 0.20 ± 0.09 (mean ± standard deviation) for the other pairwise comparisons within the class). As with ritanserin, the other four compounds in this class have known polypharmacology which could be driving their clustering. Another class with interesting substructure is that consisting of mitochondrial inhibitors which also separates into distinguishable groups. In this case, the phenotypically distinct groups separate the complex I inhibitors from the complex II and III inhibitors which appear phenotypically more similar. The mectins we tested are structurally similar and are known to share the same binding site, suggesting they would be difficult to separate into subgroups. Consistent with this expectation, the compound-level mectin classifier performs only slightly better than chance (Fig. S1).
Discussion
We have shown that worms have diverse responses to insecticides and anthelmintics and that behavioral fingerprints can be used to cluster compounds with similar modes of action. With appropriate normalization and by combining information across doses and replicates through voting, we can also accurately predict the mode of action of previously unseen compounds. Given the wide variety of modes of action included in this study, it might be expected that the dimensionality of worm behavior would be limiting. In other words, it could have been the case that worm behavior can only vary in a small number of ways and that this would limit the number of distinct classes that are distinguishable using tracking data alone. On the contrary, we found that in addition to predicting mode of action across ten classes, we could often classify at the level of individual compounds and that the within-mode-of-action confusion matrices revealed subgroups that related to finer-scale mechanistic divisions. This finding suggests there is sufficient dimensionality in worm behavioral responses to detect mode of action as well as more detailed pharmacological differences such as the spectrum of a compound’s off-target effects or differences in target engagement. These results belie worms’ superficial simplicity but are consistent with genetic findings that mutations can lead to multiple types of uncoordination18 and with tracking results suggesting that even wild type spontaneous locomotion is surprisingly complex30,31.
The most obvious limitation of our method is that not all compounds affect the N2 strain of C. elegans. Only a small minority of the compounds we assayed had no detectable effect, but there will be entire classes of compounds that are not expected to have an effect on C. elegans because their targets are not conserved, such as the pyrethroid insecticides that target voltage-gated sodium channels32. Expanding the range of organisms included in the training data is one way to address this limitation. There are methods for deriving multidimensional behavioral fingerprints that incorporate postural information from flies33 and zebrafish larvae34 and both organisms are compatible with high throughput screening. Because our approach already involves the fusion of data from multiple samples, additional species level classifiers could be seamlessly incorporated into the voting procedure to arrive at a final prediction. Results from ensemble learning in diverse fields suggest a further benefit of a multi-species approach: increasing the votes from independent classifiers should increase classifier accuracy if the predictions from different species are partially uncorrelated35. The sample principle applies beyond behavioral phenotypes. The results of automated symptomology can also be combined—within the same analysis framework—with data from non-animal species including bacteria and fungi as well as other assays commonly used in mode of action identification including genetic and biochemical assays.
Strains of C. elegans that have been recently isolated from the wild are readily available36 and have given insight into anthelmintic resistance37. Using wild isolated strains of C. elegans would increase the diversity of compound responses without having to modify the experimental protocols for screening or downstream analysis. Moving beyond C. elegans to other nematode species, including parasitic nematodes, could provide further independent phenotypes for improving mode of action prediction. However, particularly for animal parasites, different morphologies might require alternative analysis approaches21,38,39.
Phenotypes derived from in vitro microscopy of cells in culture have also been extensively studied for mode of action prediction40–42. Human cells in culture would provide an additional phylogenetically distinct species to combine with invertebrate screening. Human cell responses may also give an indication of mammalian toxicity43. However, there are drawbacks with respect to insecticide and anthelmintic discovery. The first, which is shared with zebrafish, is that the goal of these compounds is to affect invertebrates without affecting vertebrates so there may be a lower response rate than for pharmaceutical compounds. Cultured insect cells44 could be used in similar assays as human cells but may provide more relevant information for insecticide mode of action. The second drawback, which applies to most culture systems, is that some compounds act on super-cellular structures. For example, acetylcholinesterase inhibitors act by causing a buildup of acetylcholine at synapses or neuro-muscular junctions. Whole-animal phenotypic screening is therefore likely to continue to play a role in phenotypic screening of neuroactive compounds both for discovery and mode of action prediction.
Symptomology is an important technique for mode of action determination in insecticide discovery. Our method would therefore fit straightforwardly into existing discovery pipelines. Since phenotypic screening in pest species is the primary means of lead identification, an intriguing possibility would be to adapt our method to primary screening data so that mode of action prediction is not only available for selected hits but can already be included at the earliest stages of decision making. Given advances in computer vision and pose estimation45–47, it should be possible to track pest species in complex media such as leaf sections with existing technology. Improved phenotyping in primary screens might also reduce false negative rates by picking up compounds with unique phenotypic effects that are too weak to register as hits using current metrics but might provide useful starting points for optimization.
Methods
Worm husbandry
The N2 Bristol C. elegans strain was obtained from the CGC (Caenorhabditis Genetics Center) and cultured on Nematode Growth Medium at 20°C and fed with E. coli (OP50) following standard procedures18.
Worm preparation
Synchronized populations of young adult worms for imaging were cultured by bleaching unsynchronized gravid adults, and allowing refed L1 diapause progeny to develop for two and a half days (detailed protocol: https://dx.doi.org/10.17504/protocols.io.2bzgap6). On the day of imaging, young adults were washed in M9 (detailed protocol: https://dx.doi.org/10.17504/protocols.io.bfqbjmsn) and transferred to the prepared drug plates (3 worms per well) via the COPAS 500 Flow Pilot (detailed protocol: https://dx.doi.org/10.17504/protocols.io.bfc9jiz6) and returned to a 20°C incubator for 3.5 hours. Plates were then transferred onto the multi-camera tracker for another 30 minutes to habituate prior to imaging so that the total drug exposure time was 4 hours.
Plate preparation
Low peptone (0.013%) nematode growth medium (detailed protocol: https://dx.doi.org/10.17504/protocols.io.2rcgd2w) was prepared as follows: 20g agar (Difco), 0.13g Bactopeptone, and 3g NaCl were dissolved in 975ml of milliQ water. After autoclaving, 1ml of 10mg/ml cholesterol was added along with 1ml of 1M CaCl2, 1ml 1M MgSO4 and 25ml of 1M KPO4 buffer (pH 6.0). Molten agar was cooled to 50−60°C and 200 μl was dispensed into each well of 96-square well plates (WHAT-7701651) using an Integra VIAFILL (detailed protocol: https://dx.doi.org/10.17504/protocols.io.bmxbk7in). Poured plates were stored agar-side up at 4°C until required.
Prior to applying compounds, plates were placed without lids in a drying cabinet to lose 3−5% weight by volume and then stored with lids (WHAT-77041001) on at room temperature until required.
Compound preparation
Compounds were prepared for screening by dissolving in DMSO at 1000X their final imaging plate concentration (so that final concentration of DMSO in imaging plates was 0.01%). The results presented here are pooled from two rounds of experiments that used two different methods of adding compounds.
For the first round (December 2019), the library was stored in 56 ‘master’ 96-well plates in which eight replicates of each compound solution were in a single column of a 96-well plate and lower concentrations were made by serial dilution of the highest concentration using DMSO. There were up to six doses per drug and one column was reserved for DMSO and one column for no compound controls on each master plate, which were stored at −20°C. The day prior to imaging, ‘source’ plates were prepared using an Integra VIAFLO 96-channel pipette by adding 7 μl water and 0.5 μl 1000X drug and mixing by pipetting up and down. 5 μl water was added to the surface of each well of an imaging plate to facilitate compound transfer and prevent agar damage from pipette contact. Then, 3 μl of compound was transferred from the source plates to the destination imaging plates using an Opentrons liquid handling robot, which randomly shuffled the column order during the transfer.
For the second round (January 2020), ‘master’ 96-well plates were filled sequentially with three doses of each drug (made by serial dilution of the highest concentration using DMSO) so that the entire library fitted into three and a half 96-well plates. Three sets of shuffled ‘library’ plates were created with randomized column orders using an Opentrons robot so that three replicates per drug per dose fitted into 10.5 96-well plates. All plates were stored at −20°C. The day prior to imaging, ‘source’ plates were prepared using the VIAFLO by transferring 1.4 μl of 1000X drug from the shuffled ‘library’ plates into 96 well (PCR) plates filled with 19 μl of water and pipetted up and down to mix. 5 μl of water was added to the imaging plates to facilitate compound transfer and then 3 μl of the pre-diluted compounds were then transferred to imaging plates using the VIAFLO.
After the drug had absorbed into the agar, imaging plates were seeded with 5 μl OP50 diluted 1:10 in M9 solution using an Integra VIAFILL, and left overnight at room temperature in the dark. Full protocols are available at https://doi.org/10.17504/protocols.io.9vqh65w, https://doi.org/10.17504/protocols.io.bn5zmg76.
Image acquisition
Image acquisition was performed on custom-built tracking rigs (LoopBio GMBH, Vienna) equipped with six Basler acA4024 cameras to simultaneously image all wells of a 96-well plate. Each tracking experiment is divided into three parts, which are run in series by a script: 1) a 5-minute pre-stimulus recording, 2) a 6-minute stimulus recording with three 10-second blue light pulses starting at 60, 160, and 260 seconds, and 3) a 5-minute post-stimulus recording. Blue light was provided using four Luminus CBT-90 TE light-emitting diodes with a peak wavelength of 456nm and peak radiometric flux of 10.3W each.
Image processing and quality control
Segmentation and skeletonization were performed using Tierpsy tracker23 (https://github.com/Tierpsy/tierpsy-tracker). Each video was manually checked using the Tierpsy tracker viewer and any wells that had precipitation, excess unabsorbed liquid that led to swimming worms, or damaged agar were marked as bad and excluded from the analysis.
Feature extraction and pre-processing
Python code to reproduce the analysis in the paper is available on GitHub (https://github.com/em812/moaclassification). We used Tierpsy tracker to obtain summarized behavioural features for each screened well24. These include morphological features such as length and width, postural features such as curvature and features describing movement such as speed and angular velocity. A total of 3020 features is obtained. We derived summarized features separately for the pre-stimulus period, for the period with blue light stimuli and for the post-stimulus period resulting in a total of 9060 features for each well.
In addition to the manual quality control described above we used the filtering options in Tierpsy tracker to filter out tracked objects that have average width smaller than 20μm or larger than 500μm and average length smaller than 200μm or larger than 2000μm. Features that have more than 5% NaN values in the entire dataset are removed entirely. Any remaining NaN values are imputed to the mean of the given feature calculated across all wells. Compounds with very low effect compared to DMSO are also removed from the dataset. A compound is considered to have very low effect when none of the features shows significant differences across doses (including DMSO as dose 0) in a statistical test based on a Linear Mixed Model with random intercept where the dose is the fixed effect and the day of the experiment is the random effect. The Benjamini-Yekutieli method25 with 1% false discovery rate was used to correct for multiple comparisons. After cleaning up the dataset, we standardize the features (to mean 0 and standard deviation 1) to bring them on a common scale and prevent the unit differences from influencing the analysis.
For the classification task, we use a bootstrap method that simultaneously smooths the data and balances the classes. We bootstrap the compound dose replicates (i.e. we resample with replacement until we get a sample 0.6 times the size of the initial sample) multiple times and each time we derive the mean feature values. In this way, we replace the drug dose replicates with bootstrapped averages which smooths the data reducing the effect of outliers. This method also gives an easy way to balance the classes. The number of bootstrapped averages per dose in the training set depends on how well-populated the mode of action class is. In the class with the most members (AChE inhibitors with 9 compounds), we get 20 bootstrap averages per drug dose. In the rest of the classes, the number of averages is proportional to the ratio between the max number of members in a class (9) and the number of members in the given class. Finally, to partially mitigate the effect of compound potency, we normalize each behavioral fingerprint to unit L2-norm. Rescaling in this way brings compounds with similar effect profiles across features but different potencies closer together in feature space.
Hierarchical clustering
To investigate how well the compounds of the same mode of action cluster together in an unsupervised way, we use hierarchical clustering. For this task, we create a matrix where each row is the average of a drug dose and each column is one of the 256 features in the tierpsy_256 feature set24. We also include 5 rows with the average of 5 random partitions of the DMSO control replicates. We use the hierarchical clustering algorithm implemented in the seaborn48 clustermap function with complete linkage and cosine distances. To assess the quality of the clustering, we use the row linkage dendrogram and compare the cluster purity at each level of the dendrogram with the purity of random clusters. To get the random clusters we permute the cluster labels derived from the dendrogram. We repeat the permutation 1000 times.
Train, test, novel split
For the classification task, we created a training set and a test set. We considered that we need at least 4 compounds per class in the training set for cross validation, so classes with less than 5 compounds were not included. As a result, we used ten classes for the classification task. The split in training set and test set was done using stratified split as implemented in scikit-learn49 with 20% of the compounds of each class included in the test set. Using this method, 60 compounds were assigned to the training set, while 16 compounds spanning all nine classes were assigned to the test set. The compounds of the sparsely-populated classes that are not included in the classification task were considered ‘novel’ compounds and were used for novelty detection. Eleven classes with a total of 17 compounds are included in the novel compounds dataset. The training set was used for feature selection and hyperparameter selection. The test set was only used to test the classification accuracy of the selected trained model.
Classification by mode of action
The classification of compounds into modes of action is a classification of ‘bags’ of data points, as we know that we have the same label across doses and across replicates of the same compound. To get compound-level predictions, we use the following approach. We train a single classifier with the rescaled bootstrapped averages from all the compounds and then we combine the predictions across all the data points of the same compound with a simple voting procedure. Each bootstrapped average point contributes one vote for the class of the compound. The predicted class for the compound is the one with the most votes. We initially tested three types of classifiers: logistic regression, random forest and XGBoost. Logistic regression performed better than the ensemble methods in terms of cross-validation accuracy in the training set, so it was adopted for the entire pipeline.
As we have thousands of highly correlated features, we performed feature selection with recursive feature elimination using the training set. The estimator used was multinomial logistic regression with elastic net. We tested different sizes of selected feature sets and chose the set with 1024 features which resulted in the highest 4-fold cross-validation accuracy. The full list of selected features is shown in Table S2.
Using only the 1024 selected features, we optimized the hyperparameters of the estimator using stratified 4-fold cross validation. We tested the following parameter grid:
The optimal parameters for the logistic regression classifier are penalty=’l2’, C=10, multi_class=’multinomial’. The cross-validation accuracy presented in Figure 3b is the result obtained with the optimal parameters. Finally, using the optimal feature set and the optimal hyperparameters we trained a classifier with the entire training set and predicted the class of the unseen compounds in the held-out test set.
Novelty detection
To detect potentially novel modes of action that are not part of the known classes seen by the trained classifier, we use an adjusted version of the novelty detection algorithm for multiclass problems proposed in Ref 29. The algorithm uses θ scores to assess the affinity of a compound to the known classes. The θ score is defined as the ratio between the confidence of the classifier in the most likely class and the confidence in the second most likely class. Using our logistic regression classifier, the θ score of a compound is given by: where S is the set of data points belonging to the given compound and class_probas are the probabilities of each class predicted by the trained classifier. These θ scores are fed to an ensemble of binary support vector machines (SVMs) to flag novel compounds. The ensemble is trained with the following procedure. First, we partition the training set ten times, each time leaving one class out as presumed-novel. For each partition, we train a logistic regression classifier and we get the θ scores for the compounds of the known classes and for the compounds of the presumed-novel class. Using the θ scores and the mean θ scores in the respective class as input features, we train a binary SVM to label compounds with 0 if they are known and with 1 if they are novel. This procedure gives us an ensemble of ten SVMs that take as input the θ score of a compound and the mean θ score of its assigned class and give as output a prediction of whether the compound is novel. The final novelty score of the compound is given by the average of the ten predictions weighted by the cross-validation accuracy of the SVM in its respective training set.
Within mode of action classification
To test for phenotypic differences within mode of action classes, we trained a classifier to distinguish the replicates from each compound within a class from the replicates of the other compounds in the class. In this case, we did not use the bootstrapped averages, but the standardized and normalized raw data. For each of the ten modes of action included in the main classification task, we pool all the compounds from the training set and the test set. Each compound is considered a separate class. We use stratified 4-fold cross validation to include replicates of every compound at every dose both in the training and the test set in each fold. We then pool together the predictions for each test fold to create a confusion matrix for all the replicates. Finally, we cluster the confusion matrix using the spectral co-clustering algorithm as implemented in scikit-learn to reveal internal structure within the mode of action.
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 714853) and was supported by the Medical Research Council through grant MC-A658-5TY30. AMR was supported by a BBSRC CASE studentship part-funded by Syngenta.
Footnotes
andre.brown{at}imperial.ac.uk