Clinical Forecasting using Ex Vivo Drug Sensitivity Profiling of Acute Myeloid Leukemia

Acute Myeloid Leukemia (AML) is a heterogeneous malignancy involving the clonal expansion of myeloid stem and progenitor cells in the bone marrow and peripheral blood. Most AML patients eligible for potentially curative treatment receive intensive chemotherapy. Risk stratification is used to optimize treatment intensity and transplant strategy, and is mainly based on cytogenetic screening for structural chromosomal alterations and targeted sequencing of a selection of common mutations. However, the forecasting accuracy of treatment response remains modest. Recently, ex vivo drug screening has gained traction for its potential in personalized treatment selection, as well as a tool for identifying and mapping patient groups based on relevant cancer dependencies. We systematically evaluated the use of drug sensitivity profiling for predicting patient survival and clinical response to chemotherapy in a cohort of AML patients. We compared computational methodologies for scoring drug efficacy and characterized tools to counter noise and batch-related confounders pervasive in high-throughput drug testing. We show that ex vivo drug sensitivity profiling is a robust and versatile approach to patient prognostics that comprehensively maps functional signatures of treatment response and disease progression. In conclusion, ex vivo drug profiling can accurately assess risk of individual AML patients and may guide clinical decision-making.


Introduction
Acute Myeloid Leukemia (AML) is a heterogeneous cancer where the clonal expansion of myeloid progenitor cells (blasts) in the bone marrow and peripheral blood, interfere with 48 healthy hematopoiesis resulting in immunodeficiency, thrombocytopenia and anemia 1 . The 49 current 5-year survival rate of patients over 60 years of age is estimated to be 10-15% 2 . Most 50 treatment-eligible individuals receive standard induction chemotherapy, which consists of a 51 three-day treatment of 60 mg/m 2 Daunorubicin (or Idarubicin) and 100-200 mg/m 2 Cytarabine 52 (or Ara-C) intravenously for seven days 3 . Survival rates of older patients have not 53 substantially improved over the past decades, underscoring the need for better clinical 54 assessment and more accurate prognostic approaches. 55 Current risk stratification methods such as the European LeukemiaNet (ELN) 56 guidelines for AML stratification are mainly based on parameters such as molecular pathology 57 and genomic alterations 4 . However, a substantial number of patients remains difficult to 58 stratify, such as cytogenetically normal AML, and approximately 50% of patients are 59 stratified as intermediate-risk patients for whom selection of the appropriate treatment 60 regimen remains a major challenge 4-6 . Various non-genomic methods have been developed 61 for identification of clinically and biologically relevant molecular subtypes for risk 62 stratification, such as flow and mass cytometry, transcriptomics and proteomics 7-12 , although 63 clinical implementation of these methods has generally been slow. 64 binarized clinical variables using the glmnet package 48 . Ridge, Lasso, or Elastic net models 146 were trained by setting the penalty mixture parameter ( ) to 0, 1, or 0.4 respectively, unless 147 otherwise specified, and screening over a sequence of penalties using leave-one-out cross-148 validation (due to the low sample size). The model with the best average cross-validation 149 deviance score was selected for testing or further analysis of model coefficients. The Cox 150 model coefficients represent the change in log-hazard ratio (log-HR) as a function of drug 151 sensitivity. Regularized binomial models were used for logistic regression of binarized 152 clinical outcomes, and optimized using the same procedure. 153 Pre-selection of variables prior to model training was based on thresholds for standard 154 deviations in drug sensitivity (as indicated), where the number of drugs with the greatest drug 155 sensitivity standard deviations were selected. For a random sampling of features (100), the 156 rAUC standard deviation was used as sampling weights. 157 158

Model testing 159
For testing the models, a random sample of 10 patients was withheld from the training 160 procedure, keeping a 4/10 proportion of deceased patients to maintain the approximate 161 proportions in the original dataset. The prediction accuracy on test data was scored with 162 Harrell's concordance index (C-index), and the training and testing procedure was repeated 50 163 or 200 times, as indicated 49 . For testing of models predicting binary clinical outcomes, a five-164 fold training and testing procedure was performed and prediction accuracy was measured 165 using an area under the receiver operating characteristic curve (ROC AUC). 166 167

Variable importance 168
The importance of individual drugs was assessed by their contribution to predictivity through 169 a variable withdrawal test, and the statistical significance of the model coefficients. The Hierarchical clustering was performed on a full drug sensitivity dataset, or the same data 188 weighted by the respective Lasso coefficients for each drug, resulting in a dimension-reduced 189 (by removing zero-sum rows) survival association matrix. Clustering was performed using 190 euclidean distance and ward.D. For unsupervised classification the dendrograms were divided 191 into 8 groups for the drug dimension and 6 groups for the patient dimension. Correspondence 192 with drug annotations or patient annotations was measured using the Rand index. For patient 193 annotation based on mutation profiles, a separate unsupervised classification using the same 194 hierarchical clustering procedure of the binary mutation matrix was performed.

Data and code availability 196
The data used in this study are available upon reasonable request. All code is freely available 197 at https://github.com/Enserink-lab/ DSCoxTools. 198 199

200
Study approach, quality control, and comparison of metrics 201 To evaluate the clinical utility of ex vivo drug sensitivity profiling in AML, we 202 systematically performed drug screens on bone marrow or peripheral blood samples from 203 patients that were obtained at the time of diagnosis ( Fig. 1A and Suppl. Table S1). For high-204 throughput curve-fitting and multiparametric scoring of drug sensitivities, we used the Breeze 205 pipeline, which performs Hill curve fitting to compute several drug sensitivity metrics, such 206 as the EC50, TEC50, and DSS1, DSS2 and DSS3 22,25 . In addition to these metrics, we 207 calculated a raw AUC metric (rAUC) by computing a step-wise normalized rectangular area 208 under the relative viability dose response along a logarithmic concentration scale (Fig. 1B). 209 Finally, due to potential right skewness in the relative viability scale, we also performed a 210 negative log 2 -transformation of the rAUC, resulting in a weighted-average log 2 fold-change in 211 cell viability (rAUC-log 2 ). 212 To obtain a measure of the quality of the drug screens we computed plate-wise Z´-213 factors, and analyzed drug sensitivity profile correlations between patients to detect potential 214 outliers (Suppl. Fig. S1). Although some patient-specific plates had a suboptimal Z´-factor 215 (Suppl. Fig. S1A), all plate control averages were at least two standard deviations apart (Suppl. 216 Fig. S1B), and we did not observe any consistent association between plate control noise and 217 patient profile correlations (Suppl. Fig. S1C). However, we did find substantial differences in 218 profile correlations for different metrics, with DSS1-3 and rAUC-log 2 yielding the best inter-patient correlations (median PCC over 75%), and TEC50 and EC50 giving the worst (Suppl. 220 Fig. S1C-D). Furthermore, intra-patient profile correlations between treatment-naive and 221 relapsed samples were higher than inter-patient correlations (Suppl. Fig. S1D). Comparison of 222 the different metrics revealed that low-confidence Hill curve-fits reported by Breeze tend to 223 be associated with low-sensitivity dose responses (EC50 >10 -7 rAUC resulted in a substantial improvement in prediction accuracy, performing at least as 245 good as DSS3 and better than DSS-AUC (Fig. 1E, left), indicating that distributional 246 skewness in relative viabilities has a negative impact on appropriate drug sensitivity scoring. 247 To counter potential batch effects, we standardized the drug sensitivity distribution for 248 each patient (Suppl. Fig. S2B). This technique transforms the drug sensitivity metrics to a 249 respective z-score, assuming that the major differences in patient distributions mainly reflect 250 technical influences and that only changes in the relative magnitude of specific drug 251 sensitivities are relevant. With the exception of DSS1 and DSS2, this operation further 252 improved the prediction accuracy for the AUC-based metrics, with rAUC-log 2 yielding the 253 best performance with a median C-index of 76% for both Elastic net and Lasso, whereas 254 DSS3 and AUC-DSS yielded a median C-index of 73% for Elastic net (Fig. 1E, right). In 255 contrast, patient standardization of DSS1 and DSS2 caused a slight reduction in predictivity 256 that was most profound for models trained with a Lasso penalty (Fig. 1E). To better 257 understand these effects, we measured the inter-drug profile correlations, which revealed 258 library-wide multicollinearity that was removed when standardizing the metrics, but 259 maintained to a certain degree for drug pairs targeting similar proteins ( Fig. 1F and Suppl. Fig.  260 S2C-D). Here, DSS1 and DSS2 showed a slightly stronger correlation than DSS3 and other 261 AUC metrics. 262 Scaling the drug distributions (Suppl. Fig. S2B) to equalize the penalty of drugs with 263 differences in response variance caused a strong reduction in prediction accuracy using 264 rAUCs, while retaining or slightly improving the prediction accuracy using DSS1 and DSS2 265 (Suppl. Fig. S3B). This suggests that the strict processing of the DSS de-noises the data but 266 also removes informative variability, and that patient-wise standardization of other metrics Given the large number of drugs relative to the number of patients, we also tested the 269 effect of pre-selecting features ranked by their standard deviations as well as a greater range 270 of Elastic net penalties. These operations, in particular feature pre-selection, improved the 271 survival predictions further, with rAUC-log 2 yielding C-index averages over 80% for several 272 Elastic net models ( Fig. 1G and Suppl. Fig. S3D The first principal component identified over 20% of the total variance ( Fig. 2A, left), 281 and was strongly associated with curve-fit non-responders (>75%, Fig. 2B, left). Moreover, 282 noise in plate controls and curve-fit error had strong associations with the first two to four 283 components of DSS2 and DSS3, whereas rAUCs were more evenly affected (Fig. 2B, left). 284 Strikingly, for any drug sensitivity metric the first component was almost devoid of any 285 association with biological/clinical covariates, suggesting that the major cause of variability in 286 drug sensitivities stems from patient response averages, noise and potential curve-fit issues. 287 These effects were rectified to some extent by standardizing the drug sensitivities per 288 patient, particularly for rAUC and rAUC-log 2 ( Fig. 2A-C). It also increased the variance 289 explained by biological/clinical covariates, while decreasing the effect of other sample 290 characteristics, such as number of patient non-responders ( Fig. 2B and C). Surprisingly, DSS2 291 and DSS3 were most susceptible to batch covariates and noise in plate controls (Fig. 2C), and 292 revealed that the curve-fitting is inherently sensitive to batch variability, introducing biases 294 that confound the information in DSS-based patient profiles (Fig. 2D). 295 To remove confounding factors that interfere with accurate prediction of patient 296 survival, we performed a Removal of Confounding Principal Components (RCPC) 297 procedure 35 . For all non-standardized datasets, removing at least one or two principal 298 components improved prediction accuracy ( processed ex vivo drug screening data can contain consequential confounders that may bias 306 the dose-response curve fit and drug sensitivity scoring, and that standardization or RCPC can 307 be used as a straightforward and reliable method for de-confounding the data. 308 309

Integration with AML biomarkers and treatment response predictions 310
We next compared the effectiveness of ex vivo drug profiling in predicting treatment 311 outcome to the predictiveness of other AML patient data, such as genetic features and/or 312 prognostic clinical features that included ELN risk stratifications, age, sex and WHO's patient 313 performance status. As expected, due to coverage sparsity the Ridge models performed better 314 on the clinical feature sets (Fig. 3A). The genetic feature set provided the highest prediction 315 with a median C-index of 63%, which is similar to what has been reported previously using 316 much larger cohorts 36 . Combining the different clinical feature sets did not improve 317 predictivity, most likely due to redundancy in information (Suppl. Fig. S5A). In contrast, combining the clinical prognostic features with either the full standardized rAUC-log 2 or 319 DSS3 datasets resulted in a median C-index improvement of 3-5% (Fig. 3A). Importantly, 320 only minor effects were observed when combining genetic features with the entire drug 321 dataset. However, we did notice that combining genetic features with a reduced drug dataset 322 using feature pre-selection resulted in an improved combination effect ( and Suppl. Fig. S6D). Given that many compounds in the library have overlapping targets, 345 there was substantial redundancy in single-drug withdrawals with respect to model 346 performance (Fig. 4C). In-depth investigation of specific drug responses that were 347 significantly associated with survival revealed that ex vivo sensitivity to anthracyclines 348 predicted a favorable outcome (Fig. 4A). Daunorubicin, one of the major components of 7+3 349 chemotherapy, had one of the most significant associations with survival, while Idarubucin 350 also had a favorable (albeit statistically non-significant) survival association. Epirubicin, 351 another anthracycline used in treatment of several types of cancers, and the anthracycline 352 analog Mitoxantrone, were also significantly associated with survival ( Fig. 4A). Interestingly, 353 we found that ex vivo sensitivity to the Bcl-2/Bcl-XL/Bcl-w inhibitor ABT-263 (Navitoclax) 354 was strongly associated with increased risk (Fig. 4A). ABT-263 is an experimental drug with 355 a mechanism of action similar to the more selective Bcl-2 inhibitor Venetoclax, which has 356 been reported to significantly improve survival of AML patients who are ineligible for 357 standard chemotherapy (please note that at the time of sample collection, Venetoclax had not 358 yet been approved for AML therapy, which is why it was absent from the drug library). 359 Hierarchical clustering of learned risk associations against different clinical outcomes 360 revealed interesting correlations (Fig. 4D). For instance, sensitivity to the mTOR inhibitors 361 Rapamycin, Everolimus, and Temsirolimus was associated with initial positive response to 362 induction therapy, but also with increased risk of relapse and decreased long-term survival. 363 This implies that dependence on mTOR may be an early prognostic marker for acquired 364 resistance to chemotherapy. Furthermore, enrichment analysis of drug targets associated with 365 Cox model coefficients demonstrated a library-wide risk association of sensitivity to kinase 366 inhibitors, including JAK and PI3K signaling (Fig. 4E). Analysis of the differential sensitivity 367 between treatment-naive and relapsed samples indicated a further increase in kinase inhibitor sensitivity, including multiple receptor tyrosine kinases ( Fig. 4F-G). Conversely, sensitivity to 369 topoisomerase II-targeting chemotherapeutics in treatment-naive samples, which was 370 associated with a favorable outcome, was decreased in samples from relapsed patients (Fig.  371 4E-G). Consistent with previous studies 37,38 , ex vivo sensitivity to agents that target anti-372 apoptotic proteins, particularly ABT-263, was strongly reduced in relapsed samples (Fig. 4F), 373 indicating an increased apoptotic threshold or development of cross-resistance. This was also 374 accompanied with a library-wide negative correlation trend between differential relapse 375 sensitivity and risk-association in treatment-naive samples, where several prognostic drug 376 markers for adverse response to standard treatment diminished in sensitivity after relapse ( Finally, to investigate the relationship between different drug responses and their 384 contribution to survival forecasting, we used the drug sensitivity profiles to cluster the 385 patients into different risk groups. In contrast with previous studies comparing drug sensitivity 386 metrics, we could only detect minor differences in clustering fidelity based on a Rand index 387 measuring the agreement between curated drug assignment and hierarchical clustering (Suppl. 388 25 . However, congruent with the previous findings in this study, clustering of 389 patients improved using DSS or rAUC-log 2 compared to rAUC, as well as with patient 390 standardization (Suppl. Fig. S7C). To assess drug profiles with the highest relevance for risk 391 assignment, we weighted the rAUC-log 2 drug sensitivity data with learned Lasso model 392 coefficients, thus shrinking the drug dimension and transforming the scores to relative risk contributions (Fig. 5A). Although this reduced the agreement with clusters of genetic markers 394 slightly, it greatly improved patient risk group stratification ( Fig. 5B and Suppl. Fig. S7C-D). 395 Stratifying the patient cohort into six groups, clear patterns in relative drug sensitivity profiles 396 emerged that aligned with different survival responses (Fig. 5

407
In this retrospective study, we used ex vivo drug sensitivity screening to predict the 408 survival of AML patients subjected to standard chemotherapy. An important topic that has 409 been the subject of much discussion is how to appropriately score drug sensitivities in large-410 scale drug screens. Although sigmoid-constrained curve fitting and response scoring is a 411 valuable tool in large-scale drug screens, it is inherently sensitive to dose coverage in the 412 dynamic range, and model parameters tend to be unreliable 23,28,39 . A common approach is to 413 only consider the area under inhibitory response curves that cross a certain threshold, 414 exemplified by the DSS 25 . We observed that this compression of drug response variability 415 resulted in zero-inflated data with substantial loss of information. Furthermore, covariances in 416 curve-fitted inhibition data were more strongly affected by batch effects than covariances in other methods in scoring drug sensitivities, suggesting this is a useful strategy to counter 419 skewness and noisy measurements in large-scale ex vivo drug sensitivity screens. 420 421 In our study, we used a drug library that not only includes standard chemotherapeutics 422 but also a wide variety of other compounds, and identified informative 'drug sensitivity 423 fingerprints' that may not only help guide treatment choice but that can also provide valuable 424 prognostic information. Importantly, merging the data from drug sensitivity profiling with 425 information from genetic biomarkers did not further improve the predictivity of drug profiling 426 alone unless a large number of drugs were removed. This suggests that predictive information 427 in genetic profiles is already captured by ex vivo drug sensitivities, and indicates that ex vivo 428 drug profiling can be further developed into a prognostic tool through rational design of 429 focused drug libraries that fully cover cancer dependencies. Examples of drugs that should be 430 included in such next-generation libraries for risk stratification are BCL2, mTOR and HSP90 431 inhibitors, which, as already discussed above, are clearly associated with risk. Furthermore, 432 we found that the five top-ranking drugs associated with high risk include three JAK2 433 inhibitors and one STAT3 inhibitor, which is consistent with previous findings that high 434 JAK2/STAT3 activity causes resistance to chemotherapy [40][41][42] g  e  n  e  t  i  c  h  e  t  e  r  o  g  e  n  e  i  t  y  ,  a  l  t  e  r  e  d  c  e  l  l  f  a  t  e  a  n  d  d  i  f  f  e  r  e  n  t  i  a  t  i  o  n  t  h  e  r  a  p    Mean test C-index results (50 tests) for Lasso survival models trained on different drug 600 sensitivity metrics or z-scores with various feature pre-selection thresholds (based on rAUC 601 standard deviations), and different numbers of principal components removed. F, Number of 602 components removed to achieve the highest mean test C-index for different drug sensitivity 603 metrics or z-scores shown in Fig. 2E and Suppl. Fig. S4B. G, Mean test C-index results (50 604 tests) for Lasso survival comparing zero or the optimal number of components removed for 605 50 datasets generated under weighted random sampling of features, for different drug 606 sensitivity metrics or z-scores (shown in Suppl. Fig. S4E). 607  (lower). Blue dots indicate significance. G, Drug target or drug class association with 632 differential sensitivity, using directional drug-set enrichment on differential drug sensitivities 633 from (F). 634   Patients are color annotated as in Fig. 5 for mutations, sex, response to induction, relapse, 723 survival status, event time, prognosis, and predicted log-HR (relative to population-median).