TargetSeeker-MS: A Computational Method for Drug Target Discovery using Protein Separation Coupled to Mass Spectrometry

When coupled to mass spectrometry (MS), energetics-based protein separation (EBPS) techniques, such as thermal shift assay, have shown great potential to identify the targets of a drug on a proteome scale. Nevertheless, the computational analyses assessing the confidence of drug target predictions made by these methods have remained rudimentary and significantly differ depending on the protocol used to produce the data. To identify drug targets in datasets produced using different EBPS-MS techniques, we have developed a novel flexible computational approach named TargetSeeker-MS. We showed that TargetSeeker-MS reproducibly identifies known and novel drug targets in C. elegans and HEK293 samples that were treated with the fungicide benomyl and processed using two different EBPS techniques. We also validated a novel benomyl target in vitro. TargetSeeker-MS, which is available online, allows for the confident identification of targets of a drug on a proteome scale, thereby facilitating the evaluation of its clinical viability.

INTRODUCTION 1 proposed approaches. Benomyl is known to inhibit aldehyde dehydrogenase (ALDH), a 1 mechanism putatively leading to Parkinson's disease development 16 . TargetSeeker-MS identified 2 aldehyde dehydrogenase as a benomyl target along with other known and novel targets. Finally, 3 we highlight that TargetSeeker-MS identified human benomyl target orthologs when processing 4 a HEK293 cells dataset treated with the drug and validated the impact of benomyl on the 5 enzymatic activity of one of its novel predicted targets, GAPDH. TargetSeeker-MS is a Bayesian inference-based approach that computes the probability that a 9 protein is bound by a given drug through the analysis of EBPS-MS datasets. Briefly, 10 TargetSeeker-MS takes as input a set of untreated (control) samples that were processed using 11 EBPS and quantified using MS and builds for each protein a noise model of the similarity of the 12 protein fractionation profiles in different biological replicates. It then evaluates the similarity of 13 these control protein fractionation profiles with that of a drug-treated sample that was also 14 separated using the same EBPS-MS approach. TargetSeeker-MS then assesses the confidence 15 that each protein is bound by the drug. Figure 1 provides a graphical representation of 16 TargetSeeker-MS' pipeline. In this study, we used TargetSeeker-MS to identify the proteins 17 bound by benomyl in three different datasets. The first two datasets analyzed C. elegans samples 18 and were produced using DiffPOP separation (see Methods) coupled to MS (Dataset 1 -    creating a noise model for the similarity between the fractionation profiles of a given protein 1 using a small number of biological replicates.

3
TargetSeeker-MS assesses the statistical significance of protein fractionation profile changes 4 upon benomyl treatment.

5
The ability of the algorithm to assess the significance of the change in the fractionation profile of 6 a protein upon drug treatment was illustrated with four protein examples from Dataset 1 ( Figure   7 2C and 2D). Pck-2 represents an example of a protein without a change in its fractionation 8 profile upon benomyl treatment ( Figure 2C). Indeed, both distributions of normalized spectral 9 counts are almost identical (Similarity " #,/ = 0.89 TargetSeeker-MS identifies high-confidence benomyl targets in C. elegans samples that were 4 analyzed using DiffPOP-MS. 5 We tested the ability of TargetSeeker-MS to identify benomyl targets in C. elegans through the 6 analysis of a dataset produced with DiffPOP-MS (Dataset 1). TargetSeeker Figure 3B).  that have been reported to be affected by benomyl, such as "Aldehyde catabolic process" and 20 "Cellular aldehyde metabolic process" ( Figure 3D; Supplementary  3 assess the significance of benomyl binding. A green row represents a protein that was also predicted as a benomyl target in the TSA/C. elegans.

1
In order to test the ability of the TargetSeeker-MS algorithm to accurately identify drug targets in 2 samples that were separated using a different EBPS-MS method, we applied our algorithm to the 3 TSA-C. elegans dataset (Dataset 2). Combining two replicate TSA analyses of benomyl-treated 4 samples into a single TargetSeeker-MS analysis allowed the algorithm to identify 278 benomyl 5 targets with a FDR < 0.01 and 331 benomyl targets with a FDR < 0.05 (Supplementary Table   6 S7).

7
Even though Dataset 2 was processed using a similar approach to that described in  Figure S8). This decreased variation in fractionation profiles therefore allows TargetSeeker-MS 15 to assign high-confidence FDRs to even small changes in protein fractionation profiles upon 16 drug treatment. To maximize sensitivity we therefore opted to not apply a FSD threshold for all 17 data produced using TSA-MS. Nevertheless, in order to maintain a high level of confidence in 18 TargetSeeker-MS predictions, we only retained as putative drug targets proteins that were 19 assigned by the algorithm a FDR < 0.05 in both replicate TSA analyses of benomyl-treated 20 samples when analyzed independently (Supplementary Table S8, S9, and S10). This process the TSA/C. elegans dataset to be analyzed by TargetSeeker-MS were also identified as drug 17 targets in the latter dataset (Table 1). These results highlight the ability of TargetSeeker-MS to 18 reproducibly detect protein targets in datasets originating from different EBPS-MS techniques.

TargetSeeker-MS identifies high-confidence benomyl targets in HEK 293 that are C. elegans
1 orthologs. 2 We executed the TargetSeeker-MS algorithm on the DiffPOP/HEK293 datasets (Dataset 3) to 3 obtain some insights regarding human proteins that may be bound by benomyl. TargetSeeker-4 MS identified 349 human drug targets with a FDR < 0.01 and 566 proteins with an FDR < 0.05 Table S12). Once again, TargetSeeker-MS outperformed the -score method  Table S13, S14, and S15). This resulted in a list of 22 high-confidence benomyl 13 targets (Supplementary Table S15). Among these targets we find three proteins, CKMT1A (FDR 14 < 0.001), CKMT2 (FDR < 0.001), and UBE2O (FDR = 0.005) that correspond to orthologs of C. 15 elegans proteins (E-value < 10 89 and sequence identity > 50%; see Methods) that were also 16 predicted by TargetSeeker-MS to be high-confidence benomyl targets in the DiffPOP/C. elegans 17 dataset (Dataset 1). In addition, CKMT1A and CKMT2 orthologs were also identified as 18 benomyl targets in the TSA/C. elegans dataset (Dataset 2). Of note, creatine kinases were 19 identified as benomyl targets in all three datasets that involve two different EBPS-MS techniques 20 analyzing two different organisms. Among the notable GO term enrichments present in the list of 21 22 benomyl targets, we find "Creatine metabolic process" and "Creatine kinase activity", which are biological processes that have been reported to be linked to benomyl 19,20 (Figure 5C and   1   Supplementary Table S16).  GAPDH is among the H. sapiens proteins for which a C. elegans ortholog was not a predicted as 2 a benomyl target. To our knowledge, GAPDH activity was never previously reported to be 3 affected by benomyl nor was the protein shown to be bound by the compound. We therefore 4 validated the TargetSeeker-MS' prediction from the Dataset 3 using an orthogonal in vitro 5 approach (see Methods). Figure 5D shows that the enzymatic activity of GAPDH is reduced with 6 an increasing concentration of benomyl, with activity reduced approximately by 50% at 25 uM.

7
This effect confirms the TargetSeeker-MS prediction that GAPDH is bound by benomyl and that 8 this binding affects the in vitro functionality of GAPDH.

12
In this study, spectral counting was used to build protein fractionation profiles to illustrate that 13 TargetSeeker-MS can confidently identify drug targets with simple and cost-effective MS 14 quantitative approaches (label-free quantification). Nevertheless, most types of intensity-based  The rapidly growing field of drug repositioning depends on the use of techniques like EBPS-MS 6 for the systematic identification of the targets of a drug. The TargetSeeker-MS software package 7 will facilitate the implementation of these techniques and will have a significant impact on the 8 field. Over time the drug discovery pipeline has become increasingly long and costly. Indeed, it 9 takes on average 13.5 years for a drug to reach the market from the start of the investigation 29 .

10
The drug discovery process is also highly failure-prone with a success rate of less than 10% 29 .

11
Drug repositioning is an effort to test whether compounds for which the safety is known and that 12 were approved by organizations such as the FDA could have applications for treatment of 13 disease processes other than the one they were originally designed for. This accelerates the drug 14 discovery process and minimizes the expense and failure risks. Success stories such as 15 duloxetine, which was originally developed to treat depression and is now also used to improve 16 the condition of stress urinary incontinence victims 30 , and crizotinib, which was created to treat 17 anaplastic large-cell lymphoma and was repurposed for the treatment of non-small-cell lung 18 cancer 31 , will hopefully become more common with the increased popularity of EBPS-MS and the use of TargetSeeker-MS.
While EBPS-MS approaches are emerging as fast efficient methods to identify drug 1 targets in complex proteome, no user-friendly approaches have been proposed to assess the 2 confidence of drug binding predictions. We have shown that TargetSeeker-MS can identify high-3 confidence drug targets with a limited number of control samples. We also demonstrate that 4 TargetSeeker-MS recapitulates predictions of benomyl targets using different EBPS-MS 5 techniques and processing samples from different organisms. In conclusion, our algorithm will 6 favor the growth of the applications of EBPS-MS techniques and improve their impact the field 7 of drug discovery.

9
Method overview 10 We propose a computational approach, TargetSeeker-MS, for the identification of drug targets in 11 energetics-based protein separation (EBPS) coupled to mass spectrometry (MS) datasets. The

12
Bayesian inference approach assesses the confidence that a protein is bound by a given 13 compound. We first describe the two EBPS procedures used in this manuscript (DiffPOP and 14 Thermal shift assay) and the mass spectrometry analysis. We then present TargetSeeker-MS 15 machine learning algorithm. Figure 1 graphically depicts our approach.  Human embryonic kidney cell lysate (HEK293) was prepared from HEK293 cells grown in supplemented with penicillin and streptomycin. Cells were grown (37 o C/5% CO 2 ) to 1 approximately 80% confluence in tissue culture flasks. Cells were washed twice with DPBS, 2 scrapped from flasks, supplemented with protease inhibitor cocktail (Roche) and lysed by 3 sonication. Protein concentration was determined by BCA assay, lysate was kept at -80 o C until 4 use.

5
C. elegans and HEK293 lysate samples were aliquoted and volume adjusted so that they 6 contained five hundred micrograms of proteins in a final volume of 250 µL containing 40 % 7 Phosphoprotein Buffer A (ClonTech). Nineteen aliquots of C. elegans lysate were utilized, seven 8 of them were treated with benomyl, while twelve were left untreated (controls). In addition, six 9 aliquots of HEK293 human embryonic kidney cell lysate were utilized, four were left untreated 10 (controls), while two were treated with benomyl (see below).

12
Benomyl treatment and DiffPOP protein separation 13 Lysates of HEK293 and C. elegans were incubated with benomyl prior to fractionation. Benomyl 14 (5ul of 10mM stock dissolved in DMSO) was added for the drug treated samples, 5ul of DMSO 15 was added to the control samples. The DiffPOP method was carried out with sequential additions 16 of denaturing solution of 90 % methanol/1% acetic acid (3.75, 8.25, 12.5, 16.25, 20, 42.5, 65, 17 212.5 and 2000 µL). Each addition was followed by vigorous vortexing and centrifugation 18 (18000 x g for 10 min at 4ºC). The supernatant was transferred to a new Eppendorf tube, 19 denaturing solution was added, sample was vortexed and centrifuged. The process was repeated 20 to produce ten pelleted fractions. All resulting pellets were washed with 400 µL ice-cold acetone 21 and centrifuged (18000 x g for 10 min at 4ºC). Pellets were air-dried and digested with trypsin 22 (see below).

23
Thermal shift assay protein separation 1 Four C. elegans untreated samples and two benomyl treated samples were each separated into 10 2 fractions using a thermal shift assay procedure slightly modified from the version presented by 3 Savitski et al. 10 Lysates of C. elegans were prepared and incubated with benomyl, as described 4 above, prior to the thermal shift assay. Lysates were heated in an Eppendorf heated shaker block   The digested samples were analyzed on a Q Exactive mass spectrometer (Thermo Fisher 100um ID analytical column with pulled tip. Both were packed with 5 µm ODS-AQ C18 resin, 1 (YMC). Samples were separated at a flow rate of 400nl/min on an Easy nLCII (Thermo Fisher 2 Scientific). Buffer A was 5 % acetonitrile and 0.1 % formic acid, buffer B was 80 % acetonitrile 3 and 0.1 % formic acid. The following gradient was utilized: 1-10% B over 5 min, an increase to 4 45% B over 90 min, an increase to 80% B over another 15 min and held at 80% B for 5 min of 5 washing before returning to 1% B during the final 5 min for a 120 min total run time. Column 6 was re-equilibrated with 10ul of buffer A prior to the injection of sample. Peptides were eluted 7 directly from the tip of the column and nanosprayed directly into the mass spectrometer by  cysteine was considered as a static modification. Data was searched with 50 ppm precursor ion 23 tolerance and 600 ppm fragment ion tolerance. Data was filtered to 10 ppm precursor ion 1 tolerance post database search. Identified proteins were filtered using DTASelect 35 and utilizing 2 a target-decoy database search strategy to control the false discovery rate to 1% at the protein 3 level. To maximize identification specificity only proteins that were identified by at least two 4 different peptides in a given fraction were considered. A single protein identification was 5 retained for the analysis when multiple proteins were inferred from the same peptides across all . " # is obtained similarly.

7
We hypothesize that upon the binding of a drug to a protein, the stability of this protein , to ensure that the measure falls between 0 and 1 inclusively. We 13 then define the similarity " #,/ for between two samples' fractionation profiles " # and " / as 1-  We used an approach inspired by a previously described method used to assess the confidence of 8 protein-protein interactions 36 to build this model and to assess the significance of the similarity 9 difference between treated and untreated samples. Our novel computational approach is 10 described below.

12
Step 1: Building the similarity noise model from untreated samples 13 For all untreated sampled that were processed using a EBPS-MS approach TargetSeeker- , the average of the similarity between all fractionation profiles except those of the sample pair 1 ( ) , * ). We therefore assume that " U# % ,# ' represents a fair approximation of " # . Now let be a 2 100×100 matrix and , represent the frequency where " # % ,# ' • 100 = and provides us a good estimate of the noise of the similarity between two fractionation profiles.
Step 2: Posterior distribution of the mean similarity between fractionation profiles from 1 untreated samples 2 Using Bayes' rule and assuming the conditional independence of the similarity observations 3 given their true means, the posterior distribution of " # is computed as follows: Step 3: Significance assessment 11 The goal of TargetSeeker-MS is to assess the significance of the change in the fractionation 12 profile of each protein in untreated samples and a drug treated sample. In order to do so, we first 13 compute the average of the similarity values between " / and " # ∀ ∈ as follows, " #,/ = 14 ‡ N O,P O∈W |W| . We then assess the significance of the change in similarity for the drug treated Step 4: False discovery rate estimation 1

The -values computed in
Step 3 may be underestimated as a result of the possible violation of 2 some of the assumptions made when building our noise model. To assess the specificity of its 3 predictions, TargetSeeker-MS uses a hypothesis-free approach to compute a False Discovery Step 5: Biological replicates of drug treated samples 4 To maximize the stringency of its drug target predictions when given biological replicates of 5 drug treated samples, TargetSeeker-MS allows the user to analyze both replicates independently 6 and to report the proteins that are considered high-confidence drug targets in all replicates. Computation of fold-change of similarity difference 14 To maximize the specificity of TargetSeeker-MS' predictions, putative drug targets must be 15 associated with a FDR < 0.10, but a Fold-change of Similarity Difference (FSD) above a given 16 threshold can also be used. The FSD of a protein is calculated as follows: To investigate the mechanism of action of benomyl, we evaluated the statistical enrichment of 2 Gene Ontology terms 37 among the proteins predicted as its targets by TargetSeeker-MS in the 3 three different datasets using Ontologizer 38 . We tested the enrichment of molecular functions, 4 biological processes, and cellular components (with the complete set of proteins associated with 5 a fractionation profile as background). Ontologizer uses a modified Fisher's exact test to assess 6 the statistical significance of the enrichment of Gene Ontology terms and the Bonferroni 7 correction to correct for multiple hypothesis testing 38 .

9
Protein ortholog determination 10 Orthologous protein targets between H. sapiens and C. elegans were determined using the Blastp 11 algorithm 39 . Proteins associated with a sequence identity between the two species > 50% and an 12 E-value < 10 -10 were considered orthologs.   DiffPOP-MS analysis along with their associated statistics as calculated by TargetSeeker-MS.   DiffPOP-MS analysis along with their associated statistics as calculated by TargetSeeker-MS.  DiffPOP-MS analysis along with their associated statistics as calculated by TargetSeeker-MS. 13 14 Supplementary   Figure 5D.