Abstract
Therapy resistance in breast cancer is increasingly attributed to polyploid giant cancer cells (PGCCs), which arise through whole-genome doubling and exhibit heightened resilience to standard treatments. Characterized by enlarged nuclei and increased DNA content, these cells tend to be dormant under therapeutic stress, driving disease relapse. Despite their critical role in resistance, strategies to effectively target PGCCs are limited, largely due to the lack of high-throughput methods for assessing their viability. Traditional assays lack the sensitivity needed to detect PGCC-specific elimination, prompting the development of novel approaches. To address this challenge, we developed a high-throughput single-cell morphological analysis workflow designed to differentiate compounds that selectively inhibit non-PGCCs, PGCCs, or both. Using this method, we screened a library of 2,726 FDA Phase 1-approved drugs, identifying promising anti-PGCC candidates, including proteasome inhibitors, FOXM1, CHK, and macrocyclic lactones. Notably, RNA-Seq analysis of cells treated with the macrocyclic lactone Pyronaridine revealed AXL inhibition as a potential strategy for targeting PGCCs. Although our single-cell morphological analysis pipeline is powerful, empirically testing all existing compounds is impractical and inefficient. To overcome this limitation, we trained a machine learning model to predict anti-PGCC efficacy in silico, integrating chemical fingerprints and compound descriptions from prior publications and databases. The model demonstrated a high correlation with experimental outcomes and predicted efficacious compounds in an expanded library of over 6,000 drugs. Among the top-ranked predictions, we experimentally validated two compounds as potent PGCC inhibitors. These findings underscore the synergistic potential of integrating high-throughput empirical screening with machine learning-based virtual screening to accelerate the discovery of novel therapies, particularly for targeting therapy-resistant PGCCs in breast cancer.
Introduction
PGCCs are cancer cells with additional copies of chromosomes, often resulting in significantly larger cell size and increased genomic content.1–3 These cells are found across various cancer types, including breast, prostate, lung, ovarian and colorectal cancers.4–8 The presence of PGCCs has been correlated with advanced disease stages, increased tumor aggressiveness, and poor clinical outcomes. The formation of PGCCs can be attributed to several mechanisms, including aberrant cell cycle regulation, mitotic failure, and response to cellular stress such as chemotherapy and radiation. These mechanisms result in the cells bypassing normal mitotic checkpoints, leading to endoreduplication or cell fusion events that contribute to polyploidy.9–15 PGCCs contribute significantly to tumor heterogeneity. By re-shuffling genomic content of multiple copies of genome,16 they generate diverse progeny through asymmetric division and budding allows for the rapid adaptation of tumor cells to changing microenvironments and therapeutic pressures.17 This adaptability promotes tumor evolution and metastasis, complicating treatment strategies.
PGCCs have emerged as a key target in cancer research due to their critical role in therapy resistance. These cells exhibit resistance to conventional chemotherapies and radiation therapy, often surviving initial treatments and giving rise to recurrent tumors.15, 18, 19 This resistance is mediated through multiple mechanisms, including enhanced DNA repair capabilities, activation of survival pathways, avoidance of apoptosis, and the ability to enter a dormant state. In addition, PGCCs are reported to exhibit stem cell-like properties by their enhanced tumor-initiating capability and up-regulation of relevant biomarkers.20–22 Their presence often correlates with more aggressive disease phenotypes and poorer patient outcomes. Targeting PGCCs represents a promising therapeutic strategy. Approaches under investigation include disrupting the specific cell cycle and survival pathways active in PGCCs, as well as exploiting their unique metabolic dependencies.23–28 Therapies aimed at eliminating PGCCs or preventing their formation could enhance treatment efficacy and reduce relapse rates.
Although there has been some progress in this direction, to date,23–32 there are no effective therapies targeting PGCCs.15 The development of anti-PGCC treatments has been hindered by the absence of a high-throughput method to rapidly quantify these cells. Traditional drug screening assays, such as MTT, XTT, or ATP, quickly measure the overall inhibition of cancer cell populations but fail to provide specific information on the elimination of a small PGCC subpopulation, which are crucial for addressing treatment resistance and relapse. PGCCs can be characterized by the excessive DNA content and large cell and nuclear size. Currently, the gold standard for identifying and isolating PGCCs involves fluorescence-activated cell sorting (FACS) combined with visual confirmation.20 While flow cytometry can quantify the number and percentage of PGCCs, it is impractical for screening thousands of compounds or for monitoring the dynamic processes of PGCC induction and death. The limitations of existing approaches underscore the need for a high-throughput and precise analytical method specifically tailored for PGCC research. Building upon the advances in image-based cell segmentation and detection methods,33–37 we recently established a dedicated workflow for the identification and tracking of PGCCs.38 In this study, we expanded the screen to a library of 2,726 FDA Phase 1-approved drugs to identify novel PGCC inhibitors. Additionally, we conducted RNA-Seq analysis to preliminarily elucidate the mechanisms of these anti-PGCC compounds and to explore new strategies for targeting PGCCs.
Although our single-cell morphological analysis allows high-throughput testing of thousands of compounds, it is impractical to empirically test all existing compounds. This challenge underscores the need for computational methods that can efficiently predict anti-PGCC drug responses, streamlining the drug discovery process by identifying promising candidates for experimental validation. Machine learning models have emerged as powerful tools, offering a promising solution by leveraging multi-omics data and biochemical features of compounds, such as chemical structures, to predict drug sensitivity across cancer cell lines.39–45 However, to the best of our knowledge, no machine learning models currently exist for predicting anti-PGCC compounds, largely due to the lack of large training datasets. Establishing such methods is essential for advancing the development of targeted therapies against these challenging cancer cells. In this study, powered by our high-throughput morphological assay, we systematically evaluated a wide array of machine learning models to predict anti-PGCC effects (Fig. 1a). Furthermore, we developed a novel ensemble model that integrates biochemical features with pharmacological descriptions of compounds to enhance prediction performance. This model enabled virtual screening of an expanded library of 6,575 compounds for potential drug repurposing opportunities. Among the top predictions, we experimentally validated two compounds. Taken together, this study demonstrates the significant potential of integrating empirical and virtual screening approaches for PGCCs, which may unlock new avenues for overcoming cancer therapy resistance and ultimately lead to improved patient outcomes.
Methods
Cell culture
We cultured MDA-MB-231 and Vari068 cells in Dulbecco’s Modified Eagle Medium (DMEM, Gibco 11995) supplemented with 10% fetal bovine serum (FBS, Gibco 16000), 1% GlutaMax (Gibco 35050), 1% penicillin/streptomycin (pen/strep, Gibco 15070), and 0.1% plasmocin (InvivoGen ant-mpp). SUM159 cells were cultured in F-12 medium (Gibco 11765) supplemented with 5% FBS (Gibco 16000), 1% pen/strep (Gibco 15070), 1% GlutaMax (Gibco 35050), 1 μg/mL hydrocortisone (Sigma H4001), 5 μg/mL insulin (Sigma I6634), and 0.1% Plasmocin (InvivoGen ant-mpp). MDA-MB-231 and SUM159 cells were obtained from Dr. Gary Luker’s lab at the University of Michigan, while Vari068 cells were obtained from Dr. Max Wicha’s lab at the University of Michigan. The Vari068 cells, derived from an ER-/PR-/Her2-breast cancer patient who provided informed consent, were adapted to a standard two-dimensional culture environment.46–48 All cell cultures were maintained at 37 °C in a humidified incubator with 5% CO2 and passaged upon reaching over 80% confluency. All cell lines were cultured with a mycoplasma antibiotic Plasmocin.
Compound screening to identify inhibitors of PGCCs
In our screening experiments, we utilized a compound library of 2,726 compounds, each having successfully completed Phase I drug safety confirmation (APExBIO, L1052, DiscoveryProbe™ Clinical & FDA Approved Drug Library). These compounds were prepared at a concentration of 10 mM in DMSO or PBS. For screening, serial dilution was performed to achieve a final concentration of 10 µM. DMSO at 0.1% was used as the control treatment. Cells were harvested from culture dishes using 0.05% Trypsin/EDTA (Gibco, 25200), centrifuged at 1,000 rpm for 4 minutes, re-suspended in appropriate media, and seeded into 96-well plates. The number of cells seeded per well varied by cell line: 1,000 for SUM159 and MDA-MB-231 in 100 μL of media per well. Cells were cultured for 24 hours before treatment with compounds for 48 hours. Post-treatment, cells were stained with 0.3 μM Calcein AM (Biotium, 80011-2), 0.6 μM Ethidium homodimer-1 (Invitrogen™, L3224 Live/Dead Viability/Cytotoxicity Kit), and 8 μM Hoechst 33342 (Thermo Scientific 62249), followed by a 30-minute incubation. For other experiments, 4,000 cells per well were seeded for all cell lines. After 24 hours, cells were treated with PGCC-inducing agents (Docetaxel 1 μM) for 48 hours. Post-induction, the reagents were aspirated, and the test compounds were added to treat the mixed populations for an additional 48 hours without flow sorting. The same staining and imaging protocol was used to quantify PGCCs and non-PGCCs after treatments.
Image acquisition
Cells in 96-well plates were imaged using an inverted Nikon Ti2E microscope. Brightfield and fluorescence images were captured with a 4x objective lens and a Hamamatsu ORCA-Fusion Gen-III SCMOS monochrome camera. Each field of view covers approximately 14 mm2, accommodating up to 10,000 cells per image. Hoechst-stained cell nuclei were visualized with a DAPI filter set, while live and dead cells were detected using FITC and TRITC filter sets, respectively. Auto-focusing ensured image clarity, with the entire imaging process for a 96-well plate completed in under 9 minutes.
Single-cell morphological analysis software
The goal of our image processing is to quantify viable cells and distinguish PGCCs from non-PGCCs. We developed a custom MATLAB (2021b) program to achieve this in three steps: (1) identify cell nuclei with Hoechst staining, (2) determine cell viability, and (3) recognize PGCCs based on nuclear size. Hoechst-stained images were initially filtered using top-hat and bottom-hat filters to reduce background noise, then enhanced through contrast adjustment, and binarized to measure nuclear sizes. Cell debris was excluded based on smaller sizes.49 Live/Dead staining was employed to exclude dead cells, identified by dim Live signals and bright Dead signals. The cell counting method was adapted from our previous work.50–52 Live cells with nuclei larger than 300 pixels using a 4X objective lens or 1,875 pixels using a 10X objective lens (817 µm2 area, equivalent to a 32 µm diameter circle) were classified as PGCCs, while smaller nuclei were considered non-PGCCs. These thresholds were empirically validated with flow cytometry and visual confirmation (Fig. 1). Among the 2,726 compounds, 29 compounds were excluded due to their fluorescent colors which interfere with image processing.
Whole-transcriptome sequencing
We extracted RNA from MDA-MB-231 cells, both untreated and treated with 1 μM Pyronaridine Tetraphosphate for 2 days, using the PureLink™ RNA Mini Kit (Invitrogen™, 12183018A). The RNA samples were processed at the UPMC Hillman Cancer Center Cancer Genomics Facility with a KAPA RNA HyperPrep Kit with RiboErase. Each sample population was expected to generate approximately 40 million reads (38×38 base paired-end), with two biological replicates conducted. Reads were aligned using Bowtie2 read aligners in Partek, followed by transcriptome assembly and differential expression analysis with DESeq2.53, 54
Functional enrichment analysis of the Pyronaridine treatment
Gene Set Enrichment Analysis (GSEA) was performed to understand the underlying mechanisms of Pyronaridine treatment.55 Genes from RNA-seq were ranked based on the statistical significance (P-value) of their differential expression in Pyronaridine-treated MDA-MB-231 cells compared to untreated cells. The curated gene sets representing genetic and chemical perturbations (CGPs) from the Molecular Signatures Database (MSigDB) were tested for enrichment at the negative end of the ranked gene list (i.e., downregulated genes in response to Pyronaridine).56 To analyze overlaps among enriched gene sets, we utilized EnrichmentMap and AutoAnnotate in Cytoscape for constructing and visualizing a gene set association network.57 Gene set associations were represented by the degree of gene overlap between two sets, calculated as the average of the Jaccard index and the overlap coefficient (referred to as the combined coefficient). Gene sets with an FDR q-value below 0.05 in GSEA and a combined coefficient above 0.375 were included in the association network. Additionally, we analyzed the leading-edge subset of an enriched gene set of interest identified by GSEA, which represents the top-ranked genes that contribute most to the enrichment score. This subset was further studied for its potential relevance in the response to Pyronaridine.
Statistical analysis
Statistical analyses were conducted using R (version 4.1), GraphPad Prism 10, and MATLAB. GraphPad Prism 10 software determined half-maximal inhibitory concentrations (IC50s). Two-tailed Student’s t-test compared two groups, while paired 1-way ANOVA and Fisher’s Least Significant Difference (LSD) test compared multiple groups, considering treatment conditions as the variable. Within each cell line, treated versus untreated conditions were consistently paired for comparisons, with significance set at P<0.05. The standard deviation was represented by error bars; sample/group details were specified in figure captions. For data with high variability (e.g., gene expression levels), comparisons were made on a log scale.
Representation of drug features using structures and descriptions
For machine learning modeling, each drug was represented by either a vector of molecular fingerprints to capture its biochemical and structural features, or a vector of text embeddings to encode descriptions of its pharmacological, biochemical, and molecular biological properties. Drug structures were represented by the Simplified Molecular Input Line Entry System (SMILES) line notation. Canonical SMILES codes were obtained from PubChem using the Python PubChemPy package and then converted into molecular fingerprints based on the Molecular ACCess System (MACCS), PubChem, and Extended-Connectivity Fingerprint (ECFP6) systems using the R rcdk package.58 The molecular fingerprints are binary vectors that encode the structural properties of a drug, with lengths of 166, 881, and 1,024 bits, respectively, where each bit denotes the presence (1) or absence (0) of a pre-defined structural property. Text descriptions of drugs were obtained from PubChem using the PUG REST interface, which provides programmatic access to PubChem data.59, 60 We then converted the descriptions into text embeddings using the latest embedding methods developed by OpenAI, including text-embedding-3-small (1,536 dimensions) and text-embedding-3-large (3,072 dimensions), which generate vectors composed of continuous values to represent the semantic information of drug descriptions.
Machine learning models to predict anti-PGCC efficacy
We trained machine learning models to predict drug responses in PGCCs of MDA-MB-231 based on drug structures and descriptions. The normalized count of PGCCs, compared between treated and untreated cells, was increased by 10-3 and then log2-transformed and used as the prediction target. We employed 10-fold cross-validations to train and test each model. In each round of 10-fold cross-validation, the drugs were randomly partitioned into 10 sets, where 9 sets were used for model training and the remaining set was used for testing by calculating the Pearson correlation coefficient between the actual and predicted values. Once all 10 sets were tested by the corresponding trained models, we summarized the performance by averaging the 10 correlation coefficients. This entire process, including random partitioning and 10-fold cross-validation, was repeated for 10 rounds. The results from these 10 rounds are presented in box plots, with performance summarized by the median correlation value. We evaluated a total of seven linear and nonlinear regression-based machine learning models, including linear regression with L2 regularization (Ridge), support vector machine (SVM), random forest (RF), histogram-based gradient boosting (HGB), decision tree (DT), stochastic gradient descent linear regression (SGD), and multi-layer perceptron (MLP). These models were implemented using the respective functions of the Python scikit-learn library. For ensemble learning, the predicted drug responses from two individual models, trained on either drug structures or descriptions, were used as inputs for training a linear regression model to predict the drug response. We ensured that all random partitions were applied consistently across individual and ensemble models to allow for rigorous comparison of the results.
Results and Discussion
Comprehensive compound efficacy analysis by quantifying PGCCs and non-PGCCs
We developed a single-cell morphological analysis pipeline to rapidly quantify PGCCs and non-PGCCs by identifying cell nuclei with Hoechst staining, excluding dead cells using Live/Dead staining, and distinguishing PGCCs and non-PGCCs based on nuclear size (Fig. 1a).38 This pipeline was validated with multiple breast cancer cell lines and confirmed through flow cytometry and visual inspection. As a demonstration, we treated MDA-MB-231 cells with Paclitaxel, a common and widely used drug for triple-negative breast cancer (TNBC) (Fig. 1b). Without treatment, the cell population was predominantly non-PGCCs and much higher in number. Paclitaxel treatment significantly reduced the total number of cells while inducing a higher proportion of PGCCs, which can be the source of treatment resistance. Fig. 1c shows enlarged views of non-PGCCs and PGCCs. Our pipeline converts raw images to pseudo-colors representing nuclear size: red for larger nuclei and blue for smaller nuclei (Fig. 1d). As anticipated, the plot of Paclitaxel-treated cells shifts significantly towards red, indicating an increase in PGCCs, while the untreated cell population predominantly remains blue. Based on the size threshold established in our prior work, we quantify the numbers of PGCCs and non-PGCCs for each image.38 This high-throughput screening tool can process up to 10,000 cells per condition within one second, enabling detailed monitoring of cell development and the identification of compounds affecting PGCC populations.
Using the innovative single-cell morphological analysis, we characterized the changes in cell composition when treated with a compound library of 2,726 compounds, each having successfully completed Phase I drug safety confirmation for potential rapid translational impact. One day after cell loading, cells were treated for two days and then stained and imaged to quantify non-PGCCs and PGCCs (Fig. 2a). The counts of PGCCs and non-PGCCs were normalized to the numbers in 8 control wells on the same 96-well plate. Among 2,726 compounds, 29 compounds were excluded due to their fluorescent colors that interfere with image processing, and 461 inhibits the total cell number at least by half. However, among those 461 compounds, 236 compounds (51.2%) boosted the number of PGCCs at least by two times. We further examined commonly used chemotherapeutics. We found that Taxanes (Docetaxel and Paclitaxel (Taxol)), Gemcitabine, Carboplatin, Vinorelbine significantly inhibited non-PGCCs but boosted more treatment-resistant PGCCs after treatment. This partially explain why we see an overall tumor shrinkage after treatment, but remaining cancer cells develop therapeutic resistance and relapse in clinics. While Cyclophosphamide monohydrate, Capecitabine, and Fluorouracil do not induce PGCCs, they are not effective in killing cells. The observation clearly highlights the challenges of current chemotherapies in treating TNBC. Given the complicated in vivo environment and challenges of effective drug delivery into the core of tumors, the situation will be much worse in patients. As such, our high-throughput screening capability is essential in identifying new compounds that inhibit PGCCs.
Discovering PGCC inhibitors with screening experiments
Given that most TNBC cell lines naturally harbor a minimal PGCC population (<1%), accurately assessing the impact of compounds on PGCCs is challenging. To induce PGCCs, Docetaxel was administered to cells for two days after initial loading.
Subsequently, the cell suspension was aspirated to remove Docetaxel, and the testing compounds were introduced for an additional two days. Cells were then stained and imaged to quantify both PGCCs and non-PGCCs (Fig. 1a). As illustrated in Fig. 2b, drug-resistant PGCCs proved largely impervious to most compounds. Conventional chemo-therapeutic drugs are also ineffective in killing treatment-resistant PGCCs. Among 2,697 compounds, 169 inhibited PGCCs by at least twofold, 45 inhibited them by at least tenfold, and 63 inhibited both PGCCs and non-PGCCs by at least twofold (Fig. 2b).
Among the potent drugs against PGCCs, we observed the efficacy of proteasome inhibitors (e.g., Bortezomib, Oprozomib, Carfilzomib, and Celastrol), CHK inhibitors (e.g., AZD7762, PF-477736), and FOXM1 inhibitor Thiostrepton. FOXM1, a key regulator of the cell cycle, is dysregulated in PGCCs, making them particularly susceptible to FOXM1 inhibition.38, 61, 62 Proteasome inhibitors induce cancer cell death through multiple mechanisms, including the accumulation of pro-apoptotic proteins and cell cycle arrest, as well as the buildup of misfolded proteins that heighten cellular stress and sensitivity to other therapies.63–65 CHK inhibitors, by targeting CHK1 and CHK2, disrupt DNA damage repair and cell cycle control, preventing cancer cells from recovering from therapy-induced damage and enhancing the efficacy of existing treatments.66, 67 While these compounds have been studied, they are not yet in clinical use for treating breast cancer. Their selective activity against PGCCs highlights their potential as therapeutic options for patients with treatment-resistant breast cancer characterized by a significant presence of PGCCs.
In addition to well-studied targets, the large-scale screening revealed promising new compounds for targeting PGCCs (Fig. 2b). Notably, macrocyclic lactones such as Ivermectin, Doramectin, and Moxidectin—known for their antiparasitic effects—function by binding to glutamate-gated chloride channels in parasitic nerve and muscle cells.68–71 This binding elevates chloride ion permeability, leading to hyperpolarization, paralysis, and death of the parasites. These compounds also interact with other ion channels, disrupting neurotransmission specifically in parasites while leaving host cells largely unaffected due to structural differences in ion channels. Recent studies have demonstrated that Doramectin inhibits glioblastoma cell survival through modulation of autophagy; however, its effects on breast cancer cells have yet to be explored.72 Furthermore, Pyronaridine, an antimalarial drug used in combination therapies for Plasmodium falciparum and Plasmodium vivax infections, was also found to effectively eliminate PGCCs.73, 74 Pyronaridine’s effect is visually indicated by a blue shift in pseudo color compared to the control (Fig. 3a). Pyronaridine disrupts hemozoin formation, leading to toxic heme accumulation, intercalates into DNA to inhibit nucleic acid synthesis, and induces oxidative stress through ROS generation. This multifaceted action damages critical cellular components, killing the parasite. When used with artesunate, Pyronaridine improves treatment efficacy and overcomes resistance, enhancing parasite clearance and therapeutic outcomes. Beyond its antimalarial properties, Doramectin’s antiviral activity against COVID-19 and Ebola viruses has garnered significant attention.74–76 Although its potential impact on breast cancer has been noted,77, 78 there has been no prior investigation into its ability to overcome therapeutic resistance or specifically target PGCCs. Overall, although the mechanism of PGCC inhibition by these compounds remains unclear, they present intriguing possibilities for future investigation.
In addition to single-dose treatments, we tested five concentrations of selected compounds to validate our screening results in MDA-MB-231 cells (Fig. 3b). To further confirm these findings, we evaluated the compounds in a second TNBC line, SUM159. Notably, Pyronaridine selectively targeted PGCCs in both cell types (Fig. 3b). These results highlight our distinct capability to differentiate compounds based on their selective effects on PGCCs versus non-PGCCs, enabling precise identification and validation of effective PGCC inhibitors.
Identification and validation of AXL as a key mediator for the anti-PGCC effects of Pyronaridine
To investigate the potential mechanisms underlying Pyronaridine-induced inhibition of PGCCs in MDA-MB-231 cells, we performed RNA-seq on Pyronaridine-treated PGCCs and compared their gene expression profiles to those of untreated cells. We applied GSEA to identify signaling pathways perturbed by the treatment, focusing on gene sets associated with various perturbations. We identified 283 statistically significantly depleted gene sets (normalized enrichment score [NES] < 0, q-value < 0.05) in Pyronaridine-treated cells compared to control cells. In other words, these gene sets were enriched for genes downregulated by Pyronaridine treatment. An association network analysis of these gene sets revealed a close involvement in cell cycle regulation and cancer cell proliferation (Fig. 4a and b). Among these gene sets, we observed significant depletion in the KOBAYASHI_EGFR_SIGNALING_24HR_DN, which contains genes downregulated by EGFR inhibition (NES = −1.74, q = 0.007) (Fig. 4a and c).79 This gene set overlapped with several others related to cell cycle states, RB1 targets, and breast cancer grades (Fig. 4a). These findings indicate that Pyronaridine may deregulate EGFR signaling pathway to inhibit PGCC proliferation in TNBC, echoing results from a previous report in non-small cell lung cancer.80
We further explored key players in the EGFR signaling pathway-mediated genes for their potential as therapeutic targets of PGCCs in TNBC. The 5 top-ranked leading-edge genes from GSEA included TUBB, AXL, NOLC1, CCND1, and TPX2 (Fig. 4c), all of which were significantly downregulated in Pyronaridine-treated cells. Among these, AXL emerged as a particularly promising target for further investigation. The AXL pathway, driven by the AXL receptor tyrosine kinase, orchestrates cell survival, proliferation, migration, and invasion.81–83 Activation by its ligand, Gas6, triggers a signaling cascade involving PI3K, AKT, and MAPK, which enhances cell survival, inhibits apoptosis, promotes epithelial-to-mesenchymal transition (EMT), and facilitates cancer metastasis.84 AXL also plays a role in immune evasion and therapy resistance, with its dysregulation often correlating with aggressive cancer phenotypes and poor prognosis, making it a prime target for therapeutic intervention.85 In light of our RNA-Seq data and existing literature on AXL’s role in therapy resistance, we tested TP-0903, a novel, orally bioavailable AXL inhibitor currently in a first-in-human clinical trial for advanced solid tumors.86, 87 As an ATP-competitive inhibitor, it features an adenine-mimicking heterocyclic structure and specifically binds to the active form of AXL. Our findings demonstrate that TP-0903 effectively targets PGCCs in both MDA-MB-231 and SUM159 cells (Fig. 4d). This preliminary study aligns with RNA-Seq analysis and suggests that Pyronaridine’s mechanism in targeting PGCCs may involve the AXL pathway.
Machine learning-based prediction of anti-PGCC effects using high-throughput screening data
The impracticality of empirically screening all existing compounds and the absence of predictive models are major obstacles hindering the identification of promising anti-PGCC compounds for experimental validation. To address this challenge, we assessed the potential of our high-throughput morphological assay of 2,726 compounds to effectively inform predictive machine learning models. Specifically, we comprehensively tested seven state-of-the-art machine learning methods to predict anti-PGCC efficacy in MDA-MB-231 cells. As described in Methods, these regression models were trained to predict changes in PGCC counts based on quantitative representations of either chemical structures (fingerprints) or compound descriptions (text converted to embeddings) (Fig. 5a). A total of 2,430 compounds in the screening library with both features available were used in the model. We adopted 10 rounds of 10-fold cross-validations to train and test each model. In each iteration of cross-validation, a model was trained using 90% of the 2,430 compounds and tested on the remaining 10%, which were not seen by the model during training. Overall, 31 out of 63 (49.2%) models achieved a median Pearson correlation coefficient π above 0.2 across 10 rounds of cross-validations (Fig. 5b).
For molecular fingerprints, HGB with a combination of MACCS and PubChem was the best model (median π, 0.29; Fig. 5b). Models that used combinations of multiple molecular fingerprints as features tended to achieve better performance compared to those using single molecular fingerprints. For example, HGB with MACCS and PubChem, RF with MACCS and ECFP6, and SVM with all three molecular fingerprints outperformed their single-fingerprint counterparts (Fig. 5b). For description-based embeddings, models with longer embeddings (3,072 dimensions) generally outperformed those with 1,536 dimensions (Fig. 5b), suggesting that longer embeddings capture additional pharmacological information. Notably, SVM with 3,072-dimensional embeddings was the best-performing model (median π = 0.24; Fig. 5b). Overall, performance of these models was comparable to the best results from a community challenge for predicting drug sensitivities and recent studies predicting genetic dependencies in pan-cancer cell lines,88–90 demonstrating the capability of our screening library to support accurate predictive modeling.
Enhancing predictive performance by integrating compound structures and descriptions using an ensemble learning approach
Since compound structures and descriptions provide distinct yet potentially complementary information, combining these features may improve the performance of predictive models. To explore this, we developed an ensemble learning method by integrating the best-performing models for drug structures and descriptions, respectively (i.e., HGB on MACCS and PubChem, and SVM on the longer embedding). The ensemble model utilized linear regression to generate the final prediction based on the outputs of these two models. Notably, this approach significantly improved performance (median π = 0.31) compared to the individual models (one-tailed paired t-test, both P < 1×10-6) (Fig. 5c). Across all 2,430 drugs, the ensemble model achieved a π of 0.33 between real and predicted drug responses (P = 1.53 x 10-61) (Fig. 5d).
In the ensemble model, the regression coefficients for the HGB and SVM models were 1.2 and 0.6, respectively, both statistically significant (P < 1×10-3). These results suggest that both models contributed meaningful and independent information to the ensemble model. The HGB model had a greater impact on the final prediction, while the SVM model predictions provided a complementary effect. Taken together, our findings demonstrate that integrating these two distinct features allows the model to capture meaningful and complementary patterns related to anti-PGCC effects, leading to enhanced predictive performance.
Expanded virtual screening by the ensemble prediction model and validation using a patient-derived model
We expanded our virtual screening to a broader range of compounds to identify potential anti-PGCC agents in breast cancer. We compiled a large library of compounds based on the Profiling Relative Inhibition Simultaneously in Mixtures (PRISM) project, which is one of the largest drug sensitivity screens, covering 6,575 oncology or non-oncology drugs (as of 24Q2).91 Of these 6,575 drugs, 3,093 drugs were not part of our original screening library but had both drug structure and description information. We applied our ensemble model to predict anti-PGCC effects for these 3,093 drugs in MDA-MB-231 cells. The predicted drug rankings, based on their viability-inhibitory effects in PGCCs, are shown in Fig. 6a. Among the top-ranked candidates, we prioritized those with novelty, strong pharmacological profiles, and translational potential for experimental validation. Notably, two compounds—Lestaurtinib and UCN-01—demonstrated effective inhibition of PGCCs in both MDA-MB-231 and SUM159 cell lines, validating the model’s predictions (Fig. 6b).
To further ensure the clinical relevance of these findings, we validated the two compounds in a low-passage, TNBC patient-derived cell line, Vari068, which naturally harbors a high population of PGCCs. Remarkably, we confirmed a significant reduction in PGCCs within this patient-derived model. Although machine learning models do not always provide direct mechanistic explanations, a literature review suggests plausible mechanisms. Lestaurtinib, a multi-targeted tyrosine kinase inhibitor, interferes with stress signaling pathways involving JAK2, which PGCCs depend on for survival.92–95 UCN-01, a Chk1 inhibitor, targets crucial cell cycle checkpoints, undermining PGCCs’ ability to manage DNA damage and genomic instability.96, 97 By disrupting these survival pathways, both drugs induce PGCC vulnerability, leading to selective cell death. The successful validation of these model-predicted compounds demonstrates the significant potential of machine learning-based virtual screening to accelerate the discovery of novel anti-cancer therapies, particularly for targeting therapy-resistant PGCCs.
Conclusions
Therapy resistance in breast cancer is increasingly linked to the presence of PGCCs, which arise through whole-genome doubling and exhibit heightened resistance to conventional treatments. To address the challenge of identifying effective PGCC inhibitors in a high-throughput manner, we developed a single-cell morphological analysis workflow that rapidly distinguishes compounds targeting non-PGCCs, PGCCs, or both. Through screening a library of 2,726 FDA Phase 1-approved drugs, we identified several promising anti-PGCC candidates, including inhibitors of the proteasome, FOXM1, CHK, and macrocyclic lactones. RNA-Seq analysis of Pyronaridine-treated cells further suggested that AXL inhibition could be a viable strategy for targeting PGCCs. To scale up the discovery of potential PGCC inhibitors, we developed an ensemble learning model that predicts anti-PGCC efficacy by integrating two machine learning models based on chemical fingerprints and compound descriptions. This model successfully predicted effective compounds from the PRISM library, which includes over 6,000 drugs. Two of the top-ranked predictions were experimentally validated as potent PGCC inhibitors. These findings underscore the potential of machine learning-driven virtual screening to accelerate the discovery of novel therapies aimed at overcoming therapy resistance in PGCCs.
Author Contributions
Drug screening and cell biology experiments were performed by Jinxiong Cheng, Hsiao-Chun Chen, and Yu-Chih Chen. Software for single-cell morphological analysis was developed by Yushu Ma. RNA-Seq experiment was performed by Yu-Chih Chen. Sequencing read alignment and data analysis were performed by Chien-Hung Shih and Yu-Chih Chen. In silico prediction of PGCC inhibitors was performed by Chien-Hung Shih, Li-Ju Wang, Yanhao Tan, and Yu-Chiao Chiu. Yu-Chih Chen and Yu-Chiao Chiu supervised the study. Chien-Hung Shih, Yu-Chiao Chiu, and Yu-Chih Chen wrote the manuscript. All authors discussed the results, commented on the manuscript, and approved the final manuscript.
Conflicts of Interest
The authors declare no competing interests.
Declaration of Generative AI in Scientific Writing
The authors utilized ChatGPT (versions, 4o and 3.5) to enhance the readability and language of this work. Following its use, the authors thoroughly reviewed and edited the content as necessary and take full responsibility for the content of the publication.
Acknowledgements
This study was generously funded by start-up support from the UPMC Hillman Cancer Center awarded to Yu-Chih Chen and Yu-Chiao Chiu (supported by the National Institutes of Health [NIH] through Grant Numbers P30CA047904 and P50CA272218), the Women’s Cancer Research Center (WCRC) at Magee Women’s Research Institute to Yu-Chih Chen, the Pitt CTSI Pilot project to Yu-Chih Chen (NIH Grant Number UL1TR001857), Pittsburgh Liver Research Center (NIH P30DK120531) to Yu-Chiao Chiu, the NIH National Institute of General Medical Sciences (R35GM150509 to Yu-Chih Chen and R35GM154967 to Yu-Chiao Chiu), the NIH National Cancer Institute to Yu-Chiao Chiu (R00CA248944), the NIH Office of the Director to Yu-Chiao Chiu (3R00CA248944-04S1 and R03DE033361), and Leukemia Research Foundation to Yu-Chiao Chiu, as well as the UPMC Competitive Medical Research Fund (CMRF) awarded to Yu-Chih Chen. This research was supported in part by the University of Pittsburgh Center for Research Computing (NIH S10OD028483), through the resources provided. We also thank Drs. Gary Luker and Max Wicha at the University of Michigan for kindly providing the cell lines used in this study.
References
- (1).↵
- (2).
- (3).↵
- (4).↵
- (5).
- (6).
- (7).
- (8).↵
- (9).↵
- (10).
- (11).
- (12).
- (13).
- (14).
- (15).↵
- (16).↵
- (17).↵
- (18).↵
- (19).↵
- (20).↵
- (21).
- (22).↵
- (23).↵
- (24).
- (25).
- (26).
- (27).
- (28).↵
- (29).
- (30).
- (31).
- (32).↵
- (33).↵
- (34).
- (35).
- (36).
- (37).↵
- (38).↵
- (39).↵
- (40).
- (41).
- (42).
- (43).
- (44).
- (45).↵
- (46).↵
- (47).
- (48).↵
- (49).↵
- (50).↵
- (51).
- (52).↵
- (53).↵
- (54).↵
- (55).↵
- (56).↵
- (57).↵
- (58).↵
- (59).↵
- (60).↵
- (61).↵
- (62).↵
- (63).↵
- (64).
- (65).↵
- (66).↵
- (67).↵
- (68).↵
- (69).
- (70).
- (71).↵
- (72).↵
- (73).↵
- (74).↵
- (75).
- (76).↵
- (77).↵
- (78).↵
- (79).↵
- (80).↵
- (81).↵
- (82).
- (83).↵
- (84).↵
- (85).↵
- (86).↵
- (87).↵
- (88).↵
- (89).
- (90).↵
- (91).↵
- (92).↵
- (93).
- (94).
- (95).↵
- (96).↵
- (97).↵