Abstract
Developing novel cancer treatments is a challenging task that can benefit from computational techniques matching transcriptional signatures to large-scale drug response data. Here, we present ‘retriever,’ a tool that extracts robust disease-specific transcriptional drug response profiles based on cellular response profiles to hundreds of compounds from the LINCS-L1000 project. We used retriever to extract transcriptional drug response signatures of triple-negative breast cancer (TNBC) cell lines and combined these with a single-cell RNA-seq breast cancer atlas to predict drug combinations that antagonize TNBC-specific disease signatures. After systematically testing 152 drug response profiles and 11,476 drug combinations, we identified the combination of kinase inhibitors QL-XII-47 and GSK-690693 as the topmost promising candidate for TNBC treatment. Our new computational approach allows the identification of drugs and drug combinations targeting specific tumor cell types and subpopulations in individual patients. It is, therefore, highly suitable for the development of new personalized cancer treatment strategies.
Introduction
Developing new drugs and pharmacological regimens to treat complex diseases such as cancer is an expensive and challenging task (1). Several computational techniques that use structural interactions (2), transcriptional signatures (3–5), biological networks perturbations (6, 7), and data mining (8) are available to aid in the discovery of new treatment regimens which could improve patient management (9). Methods based on transcriptional signatures use the observed changes in gene expression profiles between samples from patients affected with the disease under study compared to samples from control subjects, and match these to the response profiles of cell lines that act as surrogates for the disease to different compounds. These data can be combined to identify compounds that may antagonize the disease’s transcriptional changes, returning it to a more healthy-like state (5). Finding specific transcriptional response profiles to drug candidates, as well as robust disease-associated transcriptional alterations, is therefore crucial for such approaches to produce reliable predictions.
To obtain precise disease-specific transcriptional signatures that can be targeted by pharmacological compounds, one needs to account for transcriptional variability between patients by including samples from multiple individuals (10). This allows for the identification of a target set of genes that is constitutively expressed with the same magnitude and pattern across patients/cells. Targeting such a consistently expressed set of genes is thought to increase the effectiveness and safety of the treatment (11). Transcriptional signatures are measurable at a variety of resolutions, ranging from low resolution, as in the complex mixture of cells present within tissues, to high resolution, such as single-cells (12). The Cancer Genome Atlas (TCGA) project provides RNA-sequencing (RNA-seq) data for tumors and adjacent tissues (13). While such tissue-level data have been useful in the identification of cancer subtypes and prognostic signatures, the identification of accurate transcriptional signatures between disease and healthy states has been hindered by the wide heterogeneity of cell types found in tumor tissues (14).
This issue can be overcome by using single-cell RNA-seq data (15). A drawback of single-cell RNA-seq experiments is that, due to the higher cost of profiling of single cells, often fewer biological replicates are available. However, by combining multiple experiments, sample sizes of cells with a desired molecular phenotype can be increased, improving characterization of the transcriptional changes observed in disease (16). In addition, the construction of pseudo-bulk profiles (the sum of the expression values of a gene across all cells obtained from the same individual) from single-cell data makes it possible to account for gene expression variability among patients (17, 18).
The LINCS-L1000 project published transcriptional profiles of several cell lines, treated with hundreds of compounds at various concentrations and time points of drug exposure (4). These profiles have previously been used to identify potential drugs that can be repurposed to treat a variety of diseases, including cancer (3). Additionally, the project provides an interactive portal in which users can interrogate whether an up- or down-regulated gene set of interest overlaps with transcriptional drug response signatures. The portal then returns a ranked list of compounds that are likely to have an inverse effect on disease-associated gene expression levels (19). However, these predictions are not based on robust tissue- or disease-specific transcriptional profiles and may therefore over- or underestimate the potential effect of drugs on specific diseases. Additionally, the LINCS-L1000 web portal returns an exhaustive list of independent experiments with matching inverse patterns, rendering it difficult to identify compounds that induce stable transcriptional responses in cell lines derived from the same disease across multiple time points and drug concentrations. Thus, being able to extract robust disease-specific transcriptional drug response signatures that are consistent at different time points, drug concentration, or cell line from the profiles provided by the LINCS-L1000 project would significantly improve drug prioritization and accelerate the identification of new pharmacological options for personalized treatment of cancer (20).
Once both the disease profile and the disease-specific response to multiple compounds are available, rank-based correlation analysis can be used to quantitatively identify compounds that can revert the transcriptional changes that distinguish diseased samples from healthy ones (5, 9). Nonetheless, monotherapy in cancer is highly susceptible to the development of resistance following an initial response to treatment (21). Combination therapy, or the simultaneous administration of multiple drugs to treat a disease, has evolved into the standard pharmacological regimen for treating complex diseases such as cancer. Combination therapy prevent tumor evolution and help inhibit the development of drug resistance in cancer, thereby improving patient survival (22).
The in silico prediction of responses to drug combinations is an active topic of research in computational biology (23–25). Recently, Pickering (2021) found that the transcriptional response signatures to 856, 086 unique two-drug combinations could be predicted based on the 1, 309 drug response profiles to compounds tested by the Connectivity Map Build 2 project (parent of the LINCS-L1000 project) (26). After analyzing 148 independent studies involving 257 treatment combinations from the Gene Expression Omnibus (GEO) database, it was discovered that averaging the expression profiles of individual treatments provides 78.96 percent accuracy in predicting the direction of differential expression for the combined treatment However, (27) different from the Connectivity Map project— where the generation of combinatorial response profiles is possible thanks to the availability of single response profiles for each individual compound (28)—the LINCS-L1000 project does not provide robust single drug response profiles, making the generation of meaningful combinatorial profiles not yet possible.
Here, we present retriever, a tool that uses correlation analysis and hierarchical collapsing to extract robust disease-specific transcriptional drug response signature profiles that are consistent across time, concentration, and cell line, from data provided by the LINCS-L1000 project. We integrated these transcriptional drug response profiles with a single-cell RNA-seq signature representing the transcriptional changes observed in triple-negative breast cancer cells compared to healthy breast epithelial cells. We used these two signatures—the transcriptional changes induced by the disease and the cellular responses to drugs—to prioritize drugs and to predict novel drug combinations suitable for the treatment of triple-negative breast cancer. The transcriptional changes associated with TNBC were computed from a single-cell breast cancer RNA-seq atlas built by us after compiling single-cell RNA-seq data obtained from 36 publicly available healthy breast and breast cancer samples. After systematically testing drug response profiles and drug combinations, we identified the combination of kinase inhibitors QL-XII-47 and GSK-690693 as the topmost promising treatment for TNBC. In addition to recommending drug combinations, the profiles returned by retriever allow for the characterization of possible mechanisms of action of the identified compound(s) to reverse the disease’s transcriptional profile towards a healthy-like state.
Methods
Single-cell RNA-seq datasets
We collected publicly available single-cell RNA-seq count matrices for healthy breast tissues and breast cancer samples from multiple sources (see Data Availability). Each sample’s gene identifiers (IDs) were translated into current gene symbols using the dictionary of gene ID synonyms provided by ENSEMBL (29). Datasets were loaded into R and integrated into a single ‘Seurat’ object (30). Data were then subjected to quality control keeping only cells with a library size of at least 1, 000 counts and within the 95 percent confidence interval of the prediction of the mitochondrial content ratio and detected genes in proportion to the cell’s library size. We also removed all cells that had mitochondrial proportions greater than 10% (31). We then normalized, scaled, and reduced the dimensionality of the data through Principal Component Analysis (PCA) using the default functions and parameters included in Seurat. Data integration was performed using Harmony (32). UMAP projections of the data were extracted using the top 20 dimensions returned by Harmony (33). Cell clustering was performed using the functions included in Seurat for this purpose, with a resolution of 0.01, using UMAP embeddings as the source for the nearest neighbor network construction.
Cell type assignation
Assignation of cell types was performed based on the expression of the marker genes reported by Wu et al. (34, 35). Epithelial cell subtypes were assigned using the Nebulosa package by querying cells expressing ESR1, PGR, and ERBB2 receptors (36).
Single-cell RNA-seq differential expression analysis
Transcriptional signatures of triple-negative breast cancer epithelial cells compared to healthy epithelial cells were quantified by differential expression using MAST (37). We compared (Supplementary Fig. S1 and Fig. S2) all epithelial cells from cancer patients (5, 730 cells, 31, 73%) against those from healthy donors (12, 323 cells, 68.26%). In addition, we compared the cluster of patient-derived epithelial cells depleted of hormone and growth factor receptor expression (enrichment for ESR1−, PGR−, and ERBB2− cells—2, 998 cells, 16.6%) against all healthy epithelial cells (12, 323 cells, 68.3%), against the cluster of epithelial cells depleted of hormone and growth factor expression (6117 cells, 33.9%) derived from healthy individuals, and against the cluster of epithelial cells enriched for triple-positive cells (ESR1+,PGR+, and ERBB2+) from healthy individuals (6, 206 cells, 34.4%). We selected the comparison of the cluster enriched with triple-negative cancer cells with the cluster enriched for triple-positive healthy epithelial cells as the disease-specific transcriptional profile, as this profile showed the highest Spearman Correlation Coefficient with tissue-level fold-changes computed on TCGA breast cancer data.
Pseudo-bulk analyses
Pseudo-bulk profiles were computed by summing up all expression values for the same gene across cells of the same individual. We applied this procedure to the triple-negative epithelial cells derived from patients against the triple-positive epithelial cells from healthy individuals. The computed pseudo-bulk profiles from diseased and healthy individuals were then compared using DESeq2 (38).
TCGA breast cancer dataset analysis
Upper-quantile FPKM-transformed RNA-seq data at the tissue level as well as the associated metadata were downloaded from the TCGA breast cancer (TCGA-BRCA) project using the ‘TCGAbiolinks’ Bio-conductor package (39). Samples labeled as ‘healthy’ and ‘basal’ were collected and differential expression between these groups was evaluated using DESeq2 (38).
Comparisons between data-types
The log2 fold-changes of expression between cancer and control samples at the single-cell level, pseudo-bulk level and tissue level were used to validate the computed signatures describing the transcriptional changes in triple-negative breast cancer compared to healthy epithelial cells using the Spearman correlation coefficient (40).
Application of retriever to triple-negative breast cancer
The transcriptional response profiles of triple-negative breast cancer cell lines that were treated with different concentrations and treatment lengths of different compounds available in the LINCS-L1000 project were collected using the ‘ccdata’ Bioconductor package (28). The LINCS-L1000 project provides these profiles for three different TNBC cell lines (BT20, HS578T, and MDAMB231). We used these data to compute the disease-specific transcriptional response profiles.
Generation of drug combination transcriptional response profiles
All possible two-drug combinations were generated by averaging the computed disease-specific transcriptional response profiles.
Drug repurposing analysis
Using the set of overlapping genes between the LINCS-L1000 project and differential expression analysis from single-cell RNA-seq data, we performed Spearman correlation analyses with the independent disease-specific drug response profiles, as well as with the drug combination profiles. Ninety-five percent confidence intervals, p-values, and false-discovery rates (FDR) (41) were also computed. Compounds were ranked based on the computed correlation coefficients, from negative to positive.
Mechanisms of action prediction
To identify possible mechanisms of action for drugs or drug combinations, we took the disease-specific transcriptional response profiles and used these as input for GSEA, with the MSigDB Hallmarks gene set as reference to compute the enrichment of pathways as well as the directionality of the effect (42). Pathways with FDR < 0.05 were considered significant.
Cell Lines and Cell Culture
Human breast adenocarcinoma cell lines DU4475 and BT20 were purchased from the American Type Culture Collection (ATCC), while CAL120 was obtained from Leibniz-Institut DSMadeno Z-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH. In all cases, the provider supplied an authentication certificate. To ensure the integrity of the cell lines, routine mycoplasma testing was conducted using the Lonza MycoAlertTM kit (LT07-418). None of the cell lines used in this study appeared in the database of commonly misidentified cell lines maintained by the International Cell Line Authentication Committee and NCBI Biosample. The cell lines were not passaged more than 7 times, with the highest passage used being passage 14. DU4475 cell line was cultured in RPMI-1640 media (ThermoFisher Cat:A1049101) supplemented with 10% Fetal Bovine Serum (FBS; ThermoFisher Cat:10500064) and 1% penicillin–streptomycin (ThermoFisher; 10,000 U/mL). CAL120 cell line was maintained in DMEM-High glucose media (ThermoFisher Cat:11965084) with the same supplements, while BT20 cell line was cultured in EMEM (ThermoFisher Cat:31095029) media with identical supplementation.
Cell Viability Assays
BT20 and CAL120 cells were initially seeded at a density of 4000 and 2500 cells/well, respectively, in a 96-well plate and allowed to incubate overnight. After 24 hours, the culture media was replaced with media containing predetermined concentrations of selected drugs, namely BMX/BTK Inhibitor II, QL-XII-47 (CAS 1469988-75-7, Sigma Cat:5307580001), and GSK-690693 (Sigma Cat:SML0428-5MG), and incubated for an additional 72 hours. For the suspended cell line DU4475, cells were not counted, and drugs were added from the initiation of culturing. Cell viability for the drug assay was evaluated using the ATP-based CellTiter-Glo® 3D Cell Viability Assay kit (Promega Cat:7570). The luminescence signal of each well was measured using a bottom-reading luminescent plate reader (SpectraMax M5 Multimode microplate reader). In each independent experiment, a minimum of three technical replicates were employed, and at least three independent experiments were conducted.
Results
The retriever algorithm
The retriever algorithm extracts disease-specific drug response profiles using three steps (Figure 1). The first step summarizes the cellular responses at different time points after the application of the drug, the second step summarizes the responses at different concentrations, and the third step summarizes the responses across different cell lines. This provides robust, disease-specific transcriptional response profiles based on the responses observed in all cell lines used as surrogates of a specific disease.
In the first step, retriever takes the response profiles of a given cell line to the same compound under the same concentration, but at different time points, and averages these. Then, the descriptive power of the extracted profile to represent the transcriptional response to a drug at a given concentration in the cell line, consistent across time, is evaluated using Spearman’s correlation coefficient.
The averaged profile is returned if the correlation with the computed profile is larger than a user-defined threshold (default ρ = 0.6). This threshold can take values between 0 and 1, representing the percentage of agreement between the cellular responses to the same compound. If the threshold is too close to one, very few averaged profiles will be returned due to strong filtering, and if it is defined to close to zero, very noisy profiles (similar to simply averaging all the available profiles) will be returned. The original drug response profiles that do not reach this threshold are removed. The averaged profile is then recomputed using all response profiles that reached the threshold. This procedure ensures the removal of aberrant or insufficient cellular responses. Only averaged signatures of at least two profiles are used in the second step.
In the second step, retriever takes the stable time-consistent signature profiles of a compound at a particular concentration in the same cell line. To summarize the response at different concentrations, it applies the same procedure described in the first step over the averaged profiles. The profiles returned by the second step are stable transcriptional response profiles that are consistent across time and drug concentration in a specific cell line.
The last step extracts disease-specific drug response profiles by again applying the procedure described in step 1 to the stable response profiles to the compound in all disease-specific cell lines available in the LINCS-L1000 project. The profiles returned by the third step are robust disease-specific response signatures representing transcriptional changes to a specific compound.
Single-cell RNA-seq atlas of breast samples
To showcase how retriever can be used to prioritize drugs, we applied it to single-cell data from breast cancer and healthy breast tissues. For this, we compiled 36 publicly available single-cell RNA-seq count matrices from breast samples (26 diseased and 10 healthy). In total, we combined 109, 097 cells into a single Seurat object, maintaining the sample of origin metadata. Following quality control (see Methods section), 77, 384 cells were retained for further analysis (Figure 2A), of which 30, 790 were derived from healthy (left panel in Figure 2B) and 46, 594 from cancer samples (middle panel in Figure 2B). Cell types were assigned to the nine identified clusters in the low dimensional representation (Figures 2C, S3) using markers reported by Wu et al. (34, 35).
Transcriptional signatures associated with triple-negative breast cancer at the single-cell level
To identify the transcriptional signature that best represents the changes associated with triple-negative breast cancer, we first identified the single cells with a triple-negative phenotype within the epithelial cluster of cells. To do so, we queried cells expressing ESR1, PGR, and ERBB2 receptors, which allowed us to assign a cluster of cells enriched for the expression of the three receptors (Right panel in Figure 2B); cells belonging to the cluster enriched for triple-positive cells will be further referred as “triple-positive-like” cells in the remainder of this work. While cells belonging to the other cluster will be referred to as “triple-negative-like” cells.
We compared the triple-negative-like breast cancer cells with different healthy epithelial cell subpopulations, including all epithelial cells, cells belonging to the triple-negative cluster, and cells belonging to the triple-positive cluster of epithelial cells. We then compared fold changes to those derived from the population level-data from TCGA (see Methods section). The most representative profile was the one comparing triple-negative-like epithelial cancer cells with healthy triple-positive-like epithelial cells. Note that, while approximately half of all healthy epithelial cells are assigned to the triple-positive cluster, these receptors are expressed in only 20% of these cells, and at lower levels compared to breast cancer cells (Supplementary Fig. S1).
As a result, we used MAST to perform differential expression analysis at the single cell level between triple-negative-like epithelial cancer cells (n = 2, 998) and healthy triple-positive-like epithelial cells (n = 6, 206), identifying 205 differentially expressed genes (106 upregulated and 99 downregulated) with absolute log2 fold-changes larger than 1 and false discovery rate lower than 0.05 (Figure 3A).
After testing using both the hypergeometric test through Enrichr (43) and gene set enrichment analysis (GSEA) using the Hall-mark MSigDB signatures as reference gene sets (42), we found that the differentially expressed genes were associated with the activation of Oxidative Phosphorylation Pathway, Interferon Alpha Response, as well as Myc Targets, and the downregulation of TNFα Signaling via NF-κB, Estrogen Response, UV Response Up, Apoptosis, Hypoxia, Unfolded Protein Response, IL-6/JAK/STAT3 Signaling, Inflammatory Response, and the Androgen Response pathway (Figure 3C, Supplementary Table S1). Many of these pathways are known to be highly associated with TNBC molecular phenotype. For example, activation of both the oxidative phosphorylation pathway and of Myc targets is associated with a worse outcome of the disease (44, 45), and the crosstalk between the interferon alpha response pathway and NF-κB is associated with drug resistance and tumor progression in this malignancy (46). The downregulation of the estrogen response pathway as well as apoptosis and hypoxia are markers of TNBC (47). In addition, downregulation of androgen and inflammatory responses is associated with worse prognosis and higher chemotherapy responsiveness respectively (48, 49).
We found 6, 149 genes expressed in both the single-cell RNA-seq datasets and the bulk RNA-seq from the TCGA-BRCA project. Spearman correlation coefficients between the expression changes computed at the single-cell level (Left panel in Figure 3A), pseudo-bulk level (Middle panel in Figure 3A) and tissue level (Right panel in Figure 3A), revealed a monotonic positive association between these different levels (Figure 3B), supporting the descriptive power of the computed differential single-cell expression signature to describe transcriptional changes associated with TNBC, and ensuring that the selected cell type as well as the pseudo-bulk samples are high-resolution descriptors of the diseased tissues.
Applying retriever to extract TNBC-specific transcriptional drug-response signatures
We collected 4, 899 response profiles measured in TNBC cell lines from the LINCS-L1000 project using the ‘ccdata’ package available in Bioconductor (28). These profiles correspond to the expression change of 1000 genes in response to 205 compounds on average at four different concentrations (0.08μM, 0.4μM, 2μM, and 10μM), and at two time points (6 and 24 hours), in three different TNBC cell lines (BT20, HS578T, and MDAMB231).
An illustration of the step-by-step process of constructing a generalized response profile to a compound across three TNBC cell lines is presented in Figure 4 using the QL-XII-47 compound as a case example.
1. Computing time-consistent generalized response profiles
In the first step (Figure 4A), we computed time-consistent generalized response profiles. To achieve this, we took the response profile of the MDAMB231 cell line to QL-XII-47 under the same concentration at two different time points (6 and 24 hours) and averaged them. When we compared the transcriptomes of the cell line exposed at two different time points, we found little or no correlation (large boxes labeled in gray) for each concentration (0.08μ M, 0.4μ M,2μ M, and 10μ M). However, we found that their averaged profile displays high predictive power (Spearman Correlation Coefficient in all cases) of the cellular response to the compound, independent of the time of treat ment. When present, we removed the profiles that did not correlate with the averaged profile above 0.6 (A total of 19.72% of the profiles). This procedure allows to identify concentrations in which the compounds induces a different, aberrant or insufficient cellular responses to the drug and also allows us to compute a single robust response profile to the compound.
2. Computing time- and concentration-consistent response profiles
In the second step (Figure 4B), we extract time and concentration-consistent responses. For this purpose, we averaged the time-consistent profiles computed in the first step. As before, we removed the profiles that did not correlate with the averaged profile above 0.6, such as the profile present in the first panel of Figure 4B, which showed a different cellular response at the concentration (A total of 16.91% of the profiles computed in Step 1). We then recomputed the profile for the compound in the MDAMB231 cell line with all profiles that had correlation coefficients above the threshold.
3. Generating disease-specific drug response profiles
In the last step (Figure 4C), we processed the outputs of the second step to extract a generalized response profile to a compound across the three TNBC cell lines. To do this, we performed independently steps 1 and 2 for QL-XII-47 in the other two TNBC cell lines available in the LINCS-L1000 project (BT20, HS578T) and averaged these profiles.
After summarizing the time points, concentrations, and cell lines for all the 205 compounds tested in TNBC cell lines, we ended with robust transcriptional response profiles of 152 (74.14%) compounds (Supplementary Table 1) that are specific for TNBC.
The remaining 53 compounds were removed, as they did not show a stable response across time points, concentrations, and cell lines.
Having a single generalized disease-specific response profile for each compound allowed us to (1) generate compound combinations in-silico as candidates for the treatment of TNBC (Supplementary Table 2), (2) to rank compounds and compound combinations based on their predicted potential to reverse the disease state towards a healthy-like state (Supplementary Table 3 and 4), and to (3) interrogate the computed profiles using rank-based gene set enrichment analysis (GSEA) to predict possible mechanisms of action of the compounds when used to treat TNBC.
Our method to extract disease-specific drug response profiles is implemented in the ‘retriever’ R package (see Data Availability section) and allows the computation of disease-specific transcriptional drug response signatures for different cancer types available in the LINCS-L1000 project.
Ranking compound candidates for the treatment of TNBC
We used the 152 TNBC-specific transcriptional response profiles extracted with retriever to compute 11, 476 response profiles to nonredundant drug combinations. To do so, we calculated the averaged effect sizes of Spearman correlation between the response profiles and the differential single-cell transcriptional signatures, and then ranked the compounds and compound combinations likely to specifically antagonize the TNBC-specific transcriptional signatures. Since Spearman correlation is a rank-based approach, this method can quantitatively identify inverse effects induced by drugs using the response profiles and disease-associated transcriptional signatures. Thus, we computed the Spearman coefficients, 95% confidence intervals, and p-values corrected for multiple testing (FDR) for all 11, 628 profiles (Supplementary Table 3 and 4). We then ranked the drug combinations by their correlation coefficients, from negative to positive. A negative value represents the expected inverse effect of the drug against the expression changes observed in the disease towards a healthy-like state.
We identified QL-XII-47 as the most promising compound to reverse the transcriptional profile of TNBC back to a healthy-like state, followed by Torin-2, Torin-1, QL-X-138, and WYE-125132. QL-XII-47 is a highly effective and selective Bruton tyrosine kinase (BTK) inhibitor that covalently modifies Cys481 of the protein. QL-XII-47 has an IC50 of 7 nM for inhibiting BTK kinase activity and induces a G1 cell cycle arrest in Ramos cells (B lymphocytes from Burkitt lymphoma), which is associated with significant degradation of the BTK protein. It was also shown that, at sub micromolar concentrations, QL-XII-47 inhibits the proliferation of B-cell lymphoma cell lines (50). Furthermore, other BTK inhibitors have shown to reduce TNBC cell viability (51). In addition, independent validation of sensitivity to QL-XII-47 has been tested before in two other TNBC cell lines (HC70 and HCC1806) that were not included in the LINCS-L1000 project. Low toxicity (lower than Torin-2) and more than 50% reduction of the growth rate (GRmax) was found for both cell lines (52).
Through enrichment analysis, we found that QL-XII-47 may act in TNBC through activation of the TNF-alpha Signaling via NF-κB, Hypoxia-associated genes, the Inflammatory Response, the IL-2/STAT5 Signaling pathway, and by deactivating genes involved in Fatty Acid Metabolism (FDR < 0.05 in all cases, Fig 5A).
Ranking compound combination candidates for the treatment of TNBC
After testing the combinations of the disease-specific drug response profiles, we found that a combination of QL-XII-47 and GSK-690693—a pan-AKT kinase inhibitor that reduces tumor cell proliferation and induces tumor cell apoptosis (53, 54)—was the best performing drug combination to revert TNBC signatures Gene set enrichment analysis predicted that this combination may act through the activation of TNF-alpha signaling via NF-κB and Myogenesis, and the deactivation of Fatty Acid Metabolism, mTORC1 Signaling, E2F targets, and the Unfolded Protein Response (FDR < 0.05 in all cases). These pathways have previously been associated with the inhibition of triple-negative breast cancer growth and metastasis (55–57), and thus highlight the potential of our computational approach to prioritize drugs by integrating cell-type specific profiles obtained from single-cell RNA-seq data sets with disease-specific transcriptional drug response profiles.
Moreover, we performed experimental validation of the identified top combination in three different breast cancer cell lines (CAL120, BT20, and DU4475). We compared the cellular viability of the cells under baseline conditions and after treatment with 0.6 μM QL-XII-47, 0.8 μM GSK-690693, and the combination of both compounds under the same concentration. We found that in all cases, as predicted by retriever, QL-XII-47 (top hit) displayed the strongest effect compared to GSK-690693. Additionally, we found that both compounds displayed an additive effect in two (CAL120 and DU4475) of the three cell lines tested supporting retriever’s potential to identify useful drug combinations from cellular transcriptional responses to individual compounds.
Signatures impacted by QL-XII-47 and GSK-690693 in individual patients
Finally, we evaluated how combination treatment with QL-XII-47 and GSK-690693 might impact transcriptional profiles of triple-negative cancer cells at the individual patient level. We found that, for most patients, the signatures expected to be impacted by the drug combination were inversely correlated between healthy samples and cancer samples (Figure 6). For example, TNF-alpha signaling via NF-κB shows an inverse association in 84% (22 of 26) of patients, while Hypoxia shows an inverse association in 73% (19 of 26) patients. Note that, while we focused on triple-negative cells in this study, the breast cancer atlas we analyzed also included samples from receptor-positive cancers. However, we specifically applied retriever to triple-negative cells from these samples, aiming to identify drug combinations that would target the tumor’s most difficult-to-treat cells. Interestingly, signatures in triple-negative cells from all of the included non-TNBC samples (8 of 8) showed strong inverse correlations to those from healthy samples. For the TNBC samples, 7 of 18 had strong inverse associations with the healthy sample cluster. This indicates that, when applied to single-cell data derived from a specific tumor type, the candidate drug combinations predicted by retriever are likely to benefit a relatively large number of patients.
Discussion
Single-cell RNA-seq provides an unprecedented resolution to characterize cellular heterogeneity in cancer. Compared with cell lines, the cells from tumors characterized by single-cell RNA-seq do not exhibit metabolic adaptation due to culturing. In advantage to bulk RNA-seq, single-cell RNA-seq can provide a more accurate signature of the changes observed in specific cells relevant to the disease under study(14, 58). Thus, leveraging the information provided by this new technology is important to accelerate the identification of new personalized treatments that target specific subpopulations of cells in a tissue (59).
However, just as important as characterizing the signatures of the disease in affected cells at the highest possible resolution is to characterize the response profiles of drugs that may help to reverse disease-associated transcriptional changes towards a healthy state. Large consortia including the LINCS-L1000 project provide transcriptional phenotypes at different time points after applying hundreds of compounds to multiple types of cell lines at different concentrations. Even though this information is valuable, its direct application to pathophysiological scenarios is limited due to the difficulty to extract robust response profile in these large data sets (60). Here, we introduced retriever, a tool to extract robust disease-specific drug response profiles from the LINCS-L1000 project. retriever mines the information associated to each cell line in the project to select subsets of cell lines that are surrogates of a defined disease, after which it extracts disease-specific response profiles. By doing this, retriever maximizes the robustness of the transcriptional signatures that are used to rank candidate compounds for the identification of new treatment strategies. With retriever, we hypothesize that if a signature is robust across multiple cell lines representing the same disease, it will likely also be more robust across individuals, as cell lines exhibit genetic and metabolic differences.
We showcased retriever’s strength by retrieving time-, dose-, and cell-type consistent transcriptional drug response signatures of TNBC cells from the LINCS-L1000 project and combining it with single-cell RNA transcriptome profiles obtained from a large single-cell breast cancer atlas, which we used to predict novel drug combinations against TNBC. This identified a combination of two kinase inhibitors predicted to be effective through together acting on important biological pathways in breast cancer, and thereby antagonize the TNBC-specific signature.
The prediction of drug combinations with retriever that we present here, although it is relatively simple and relies on the hypothesis of independent mechanisms of action for each compound, has been established as accurate (predicting the right directionality of the change) in analyses performed on profiles from the Connectivity Map project (28). Nevertheless, our computational approach only allows us to rank compounds and compounds combinations that are suitable for the development of treatments based on the negative correlation of the drug response profiles and the disease signatures. The topmost promising compounds and drug combinations still need to be validated experimentally to define the right concentration and evaluate if they indeed exhibit a synergistic mechanism of action. Their toxicity across cell lines and, after preclinical tests, potential adverse events in in-vivo models and cancer patients also need to be evaluated.
Thanks to the multiple cell lines available in the LINCS-L1000 project, our approach can be replicated in at least 13 other cancer types, including: adult acute monocytic leukemia, adult acute myeloid leukemia, cecum adenocarcinoma, colon (adeno)carcinoma, endometrial adenocarcinoma, lung adeno-carcinoma, large cell lung carcinoma, small cell lung carcinoma, melanoma, ovarian mucinous adenocarcinoma, prostate carcinoma, and triple-negative breast cancer among other variants (such as local or metastatic tumors for some cancer types), as soon as single-cell RNA-seq data for those cancer types and healthy tissues become available. Considering the increase in single-cell RNA-seq studies being published since the development of the technique, we expect this to be possible for most cancer types represented in the LINCS-L1000 project in the near future (61).
Finally, our approach can also be applied to disease profiles derived from a single patient to recommend personalized treatment strategies. We envision this to be one of the potentially most impactful applications of retriever in computationally-informed precision medicine once single-cell transcriptional profiles become a standard in the clinic. In addition, while we have show-cased retriever’s strength in predicting drug combinations for TNBC in general, the method can be applied to single-cell RNA-seq data derived from individual tumors, opening up the road for applications in precision medicine.
Data Availability
All the data and code required to replicate the analysis as well as the figures and tables are available at https://github.com/dosorio/L1000-TNBC. The retriever package is available at https://github.com/kuijjerlab/retriever. The following datasets were used to construct the single-cell RNA-seq breast atlas used in this study:
Tabula sapiens wild-type mammary gland. Data from (62) Quake, Stephen R., and Tabula Sapiens Consortium. “The Tabula Sapiens: a single cell transcriptomic atlas of multiple organs from individual human donors.” bioRxiv (2021). Accessed through: https://tabula-sapiensportal.ds.czbiohub.org/
Wild-type data from seven individuals reported by (63) Nguyen, Quy H., et al. “Profiling human breast epithelial cells using single cell RNA sequencing identifies cell diversity.” Nature communications 9.1 (2018): 1-12. GEO accession code: GSE113197.
Wild-type data from five individuals reported by (64) Bhat-Nakshatri, Poornima, et al. “A single-cell atlas of the healthy breast tissues reveals clinically relevant clusters of breast epithelial cells.” Cell Reports Medicine 2.3 (2021): 10021955. GEO accession code: GSE164898.
Triple-negative breast cancer data from 9 individuals from (34) Wu SZ, Al-Eryani G, Roden DL, Junankar S et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 2021 Sep;53(9):1334-134756. GEO accession code: GSE176078
Breast cancer data from 17 individuals from the Array-Express database accession code: E-MTAB-8107. Described in (65) Qian, Junbin, et al. “A pan-cancer blueprint of the heterogeneous tumor microenvironment revealed by single-cell profiling.” Cell research 30.9 (2020): 745-76257.
Supplementary Material
Supplementary Figures
Supplementary Tables
Table S2: (CSV File) Computed robust disease-specific response profiles to 152 compounds across three triple-negative breast cancer cell lines (TNBC)
Table S3: (CSV File) Computed 11, 476 compound combinations response profiles across three triple-negative breast cancer cell lines (TNBC)
Table S4: (CSV File) Ranked compounds based on their predicted potential to reverse the disease state (TNBC) towards a healthy state
Table S5: (CSV File) Ranked compound combinations based on their predicted potential to reverse the disease state (TNBC) towards a healthy state
Acknowledgments
This work was supported by the Norwegian Research Council, Helse Sør-Øst, and University of Oslo through the Centre for Molecular Medicine Norway (187615, M.L.K and D.O), the Norwegian Cancer Society (214871, M.L.K.), the Norwegian Research Council (313932, M.L.K), and the Marie Skłodowska-Curie Postdoctoral Scientia Fellows program from the Faculty of Medicine at the University of Oslo, co-funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions Grant (801133, D.O and P.S).
Footnotes
Experimental validation added