Drug combination prediction for cancer treatment using disease-speciﬁc drug response proﬁles and single-cell transcriptional signatures.

Developing novel cancer treatments is a challenging task that can beneﬁt from computational techniques matching transcriptional signatures to large-scale drug response data. Here, we present ‘ retriever ,’ a tool that extracts robust disease-speciﬁc transcriptional drug response proﬁles based on cellular response proﬁles to hundreds of compounds from the LINCS-L1000 project. We used retriever to extract transcriptional drug response signatures of triple-negative breast cancer (TNBC) cell lines and combined these with a single-cell RNA-seq breast cancer atlas to predict drug combinations that antagonize TNBC-speciﬁc disease signatures. After systematically testing 152 drug response proﬁles and 11,476 drug combinations, we identiﬁed the combination of kinase inhibitors QL-XII-47 and GSK-690693 as the topmost promising candidate for TNBC treatment. Our new computational approach allows the identiﬁcation of drugs and drug combinations targeting speciﬁc tumor cell types and subpopulations in individual patients. It is, therefore, highly suitable for the development of new personalized cancer treatment strategies.


Introduction
Developing new drugs and pharmacological regimens to treat complex diseases such as cancer is an expensive and challenging task (1).Several computational techniques that use structural interactions (2), transcriptional signatures (3)(4)(5), biological networks perturbations (6,7), and data mining (8) are available to aid in the discovery of new treatment regimens which could improve patient management (9).Methods based on transcriptional signatures use the observed changes in gene expression profiles between samples from patients affected with the disease under study compared to samples from control subjects, and match these to the response profiles of cell lines that act as surrogates for the disease to different compounds.These data can be combined to identify compounds that may antagonize the disease's transcriptional changes, returning it to a more healthy-like state (5).Finding specific transcriptional response profiles to drug candidates, as well as robust disease-associated transcriptional alterations, is therefore crucial for such approaches to produce reliable predictions.
To obtain precise disease-specific transcriptional signatures that can be targeted by pharmacological compounds, one needs to account for transcriptional variability between patients by including samples from multiple individuals (10).This allows for the identification of a target set of genes that is constitu-tively expressed with the same magnitude and pattern across patients/cells.Targeting such a consistently expressed set of genes is thought to increase the effectiveness and safety of the treatment (11).Transcriptional signatures are measurable at a variety of resolutions, ranging from low resolution, as in the complex mixture of cells present within tissues, to high resolution, such as single-cells (12).The Cancer Genome Atlas (TCGA) project provides RNA-sequencing (RNA-seq) data for tumors and adjacent tissues (13).While such tissue-level data have been useful in the identification of cancer subtypes and prognostic signatures, the identification of accurate transcriptional signatures between disease and healthy states has been hindered by the wide heterogeneity of cell types found in tumor tissues (14).This issue can be overcome by using single-cell RNA-seq data (15).A drawback of single-cell RNA-seq experiments is that, due to the higher cost of profiling of single cells, often fewer biological replicates are available.However, by combining multiple experiments, sample sizes of cells with a desired molecular phenotype can be increased, improving characterization of the transcriptional changes observed in disease (16).In addition, the construction of pseudo-bulk profiles (the sum of the expression values of a gene across all cells obtained from the same individual) from single-cell data makes it possible to account for gene expression variability among patients (17,18).
The LINCS-L1000 project published transcriptional profiles of several cell lines, treated with hundreds of compounds at various concentrations and time points of drug exposure (4).These profiles have previously been used to identify potential drugs that can be repurposed to treat a variety of diseases, including cancer (3).Additionally, the project provides an interactive portal in which users can interrogate whether an up-or down-regulated gene set of interest overlaps with transcriptional drug response signatures.The portal then returns a ranked list of compounds that are likely to have an inverse effect on disease-associated gene expression levels (19).However, these predictions are not based on robust tissue-or disease-specific transcriptional profiles and may therefore over-or underestimate the potential effect of drugs on specific diseases.Additionally, the LINCS-L1000 web portal returns an exhaustive list of independent experiments with matching inverse patterns, rendering it difficult to identify compounds that induce stable transcriptional responses in cell lines derived from the same disease across multiple time points and drug concentrations.Thus, being able to extract robust disease-specific transcriptional drug response signatures that are consistent at different time points, drug concentration, or cell line from the profiles provided by the LINCS-L1000 project would significantly improve drug prioritization and accelerate the identification of new pharmacological options for personalized treatment of cancer (20).
Once both the disease profile and the disease-specific response to multiple compounds are available, rank-based correlation analysis can be used to quantitatively identify compounds that can revert the transcriptional changes that distinguish diseased samples from healthy ones (5,9).Nonetheless, monotherapy in cancer is highly susceptible to the development of resistance following an initial response to treatment (21).Combination therapy, or the simultaneous administration of multiple drugs to treat a disease, has evolved into the standard pharmacological regimen for treating complex diseases such as cancer.Combination therapy prevent tumor evolution and help inhibit the development of drug resistance in cancer, thereby improving patient survival (22).
The in silico prediction of responses to drug combinations is an active topic of research in computational biology (23)(24)(25).Recently, Pickering (2021) found that the transcriptional response signatures to 856, 086 unique two-drug combinations could be predicted based on the 1, 309 drug response profiles to compounds tested by the Connectivity Map Build 2 project (parent of the LINCS-L1000 project) (26).After analyzing 148 independent studies involving 257 treatment combinations from the Gene Expression Omnibus (GEO) database, it was discovered that averaging the expression profiles of individual treatments provides 78.96 percent accuracy in predicting the direction of differential expression for the combined treatment (27).However, different from the Connectivity Map projectwhere the generation of combinatorial response profiles is possible thanks to the availability of single response profiles for each individual compound (28)-the LINCS-L1000 project does not provide robust single drug response profiles, making the generation of meaningful combinatorial profiles not yet possible.
Here, we present retriever, a tool that uses correlation analysis and hierarchical collapsing to extract robust disease-specific transcriptional drug response signature profiles that are consistent across time, concentration, and cell line, from data provided by the LINCS-L1000 project.We integrated these transcriptional drug response profiles with a single-cell RNA-seq signature representing the transcriptional changes observed in triple-negative breast cancer cells compared to healthy breast epithelial cells.We used these two signatures-the transcriptional changes induced by the disease and the cellular responses to drugs-to prioritize drugs and to predict novel drug combinations suitable for the treatment of triple-negative breast cancer.The transcriptional changes associated with TNBC were computed from a single-cell breast cancer RNA-seq atlas built by us after compiling singlecell RNA-seq data obtained from 36 publicly available healthy breast and breast cancer samples.After systematically testing drug response profiles and drug combinations, we identified the combination of kinase inhibitors QL-XII-47 and GSK-690693 as the topmost promising treatment for TNBC.In addition to recommending drug combinations, the profiles returned by retriever allow for the characterization of possible mechanisms of action of the identified compound(s) to reverse the disease's transcriptional profile towards a healthy-like state.

Methods
Single-cell RNA-seq datasets.We collected publicly available single-cell RNA-seq count matrices for healthy breast tissues and breast cancer samples from multiple sources (see Data Availability).Each sample's gene identifiers (IDs) were translated into current gene symbols using the dictionary of gene ID synonyms provided by ENSEMBL (29).Datasets were loaded into R and integrated into a single 'Seurat' object (30).Data were then subjected to quality control keeping only cells with a library size of at least 1, 000 counts and within the 95 percent confidence interval of the prediction of the mitochondrial content ratio and detected genes in proportion to the cell's library size.We also removed all cells that had mitochondrial proportions greater than 10% (31).We then normalized, scaled, and reduced the dimensionality of the data through Principal Component Analysis (PCA) using the default functions and parameters included in Seurat.Data integration was performed using Harmony (32).UMAP projections of the data were extracted using the top 20 dimensions returned by Harmony (33).Cell clustering was performed using the functions included in Seurat for this purpose, with a resolution of 0.01, using UMAP embeddings as the source for the nearest neighbor network construction.
Cell type assignation.Assignation of cell types was performed based on the expression of the marker genes reported by Wu et al. (34,35).Epithelial cell subtypes were assigned using the Nebulosa package by querying cells expressing ESR1, PGR, and ERBB2 receptors (36).
Single-cell RNA-seq differential expression analysis.Transcriptional signatures of triple-negative breast cancer epithelial cells compared to healthy epithelial cells were quantified by differential expression using MAST (37).We compared (Supplementary Fig. S1 and Fig. S2) all epithelial cells from cancer patients (5, 730 cells, 31, 73%) against those from healthy donors (12, 323 cells, 68.26%).In addition, we compared the cluster of patient-derived epithelial cells depleted of hormone and growth factor receptor expression (enrichment for ESR1 − , PGR − , and ERBB2 − cells-2, 998 cells, 16.6%) against all healthy epithelial cells (12, 323 cells, 68.3%), against the cluster of epithelial cells depleted of hormone and growth factor expression (6117 cells, 33.9%) derived from healthy individuals, and against the cluster of epithelial cells enriched for triple-positive cells (ESR1 + , PGR + , and ERBB2 + ) from healthy individuals (6, 206 cells, 34.4%).We selected the comparison of the cluster enriched with triple-negative cancer cells with the cluster enriched for triplepositive healthy epithelial cells as the disease-specific transcriptional profile, as this profile showed the highest Spearman Correlation Coefficient (ρ) with tissue-level fold-changes computed on TCGA breast cancer data.
Pseudo-bulk analyses.Pseudo-bulk profiles were computed by summing up all expression values for the same gene across cells of the same individual.We applied this procedure to the triple-negative epithelial cells derived from patients against the triple-positive epithelial cells from healthy individuals.The computed pseudo-bulk profiles from diseased and healthy individuals were then compared using DESeq2 (38).
TCGA breast cancer dataset analysis.Upper-quantile FPKM-transformed RNA-seq data at the tissue level as well as the associated metadata were downloaded from the TCGA breast cancer (TCGA-BRCA) project using the 'TCGAbiolinks' Bioconductor package (39).Samples labeled as 'healthy' and 'basal' were collected and differential expression between these groups was evaluated using DESeq2 (38).
Comparisons between data-types.The log 2 fold-changes of expression between cancer and control samples at the single-cell level, pseudo-bulk level and tissue level were used to validate the computed signatures describing the transcriptional changes in triple-negative breast cancer compared to healthy epithelial cells using the Spearman correlation coefficient (40).

Application of retriever to triple-negative breast cancer.
The transcriptional response profiles of triple-negative breast cancer cell lines that were treated with different concentrations and treatment lengths of different compounds available in the LINCS-L1000 project were collected using the 'ccdata' Bioconductor package (28).The LINCS-L1000 project provides these profiles for three different TNBC cell lines (BT20, HS578T, and MDAMB231).We used these data to compute the diseasespecific transcriptional response profiles.

Generation of drug combination transcriptional response
profiles.All possible two-drug combinations were generated by averaging the computed disease-specific transcriptional response profiles.
Drug repurposing analysis.Using the set of overlapping genes between the LINCS-L1000 project and differential expression analysis from single-cell RNA-seq data, we performed Spearman correlation analyses with the independent diseasespecific drug response profiles, as well as with the drug combination profiles.Ninety-five percent confidence intervals, p-values, and false-discovery rates (FDR) (41) were also computed.Compounds were ranked based on the computed correlation coefficients, from negative to positive.

Mechanisms of action prediction.
To identify possible mechanisms of action for drugs or drug combinations, we took the disease-specific transcriptional response profiles and used these as input for GSEA, with the MSigDB Hallmarks gene set as reference to compute the enrichment of pathways as well as the directionality of the effect (42).Pathways with FDR < 0.05 were considered significant.

Results
The retriever algorithm.The retriever algorithm extracts disease-specific drug response profiles using three steps (Figure 1).The first step summarizes the cellular responses at different time points after the application of the drug, the second step summarizes the responses at different concentrations, and the third step summarizes the responses across different cell lines.This provides robust, disease-specific transcriptional response profiles based on the responses observed in all cell lines used as surrogates of a specific disease.
In the first step, retriever takes the response profiles of a given cell line to the same compound under the same concentration, but at different time points, and averages these.Then, the descriptive power of the extracted profile to represent the transcriptional response to a drug at a given concentration in the cell line, consistent across time, is evaluated using Spearman's correlation coefficient.
The averaged profile is returned if the correlation with the computed profile is larger than a user-defined threshold (default ρ = 0.6).This threshold can take values between 0 and 1, representing the percentage of agreement between the cellular responses to the same compound.If the threshold is too close to one, very few averaged profiles will be returned due to strong filtering, and if it is defined to close to zero, very noisy profiles (similar to simply averaging all the available profiles) will be returned.The original drug response profiles that do not reach this threshold are removed.The averaged profile is then recomputed using all response profiles that reached the threshold.This procedure ensures the removal of aberrant or insufficient cellular responses.Only averaged signatures of at least two profiles are used in the second step.
In the second step, retriever takes the stable time-consistent signature profiles of a compound at a particular concentration in the same cell line.To summarize the response at different concentrations, it applies the same procedure described in the first step over the averaged profiles.The profiles returned by the second step are stable transcriptional response profiles that are consistent across time and drug concentration in a specific cell line.
The last step extracts disease-specific drug response profiles by again applying the procedure described in step 1 to the stable response profiles to the compound in all disease-specific cell lines available in the LINCS-L1000 project.The profiles returned by the third step are robust disease-specific response signatures representing transcriptional changes to a specific compound.
Single-cell RNA-seq atlas of breast samples.To showcase how retriever can be used to prioritize drugs, we applied it to single-cell data from breast cancer and healthy breast tissues.
For this, we compiled 36 publicly available single-cell RNA-seq count matrices from breast samples (26 diseased and 10 healthy).
In total, we combined 109, 097 cells into a single Seurat object, maintaining the sample of origin metadata.Following quality control (see Methods section), 77, 384 cells were retained for further analysis (Figure 2A), of which 30, 790 were derived from healthy (left panel in Figure 2B) and 46, 594 from cancer samples (middle panel in Figure 2B).Cell types were assigned to the nine identified clusters in the low dimensional representation (Figures 2C, S3) using markers reported by Wu et al. (34,35).
Transcriptional signatures associated with triple-negative breast cancer at the single-cell level.To identify the transcriptional signature that best represents the changes associated with triple-negative breast cancer, we first identified the single cells with a triple-negative phenotype within the epithelial cluster of cells.To do so, we queried cells expressing ESR1, PGR, and ERBB2 receptors, which allowed us to assign a cluster of cells enriched for the expression of the three receptors (Right panel in Figure 2B); cells belonging to the cluster enriched for triplepositive cells will be further referred as "triple-positive-like" cells in the remainder of this work.While cells belonging to the other cluster will be referred to as "triple-negative-like" cells.
We compared the triple-negative-like breast cancer cells with different healthy epithelial cell subpopulations, including all epithelial cells, cells belonging to the triple-negative cluster, and cells belonging to the triple-positive cluster of epithelial cells.
We then compared fold changes to those derived from the population level-data from TCGA (see Methods section).The most representative profile was the one comparing triple-negative-like epithelial cancer cells with healthy triple-positive-like epithelial cells.Note that, while approximately half of all healthy epithelial cells are assigned to the triple-positive cluster, these receptors are expressed in only 20% of these cells, and at lower levels compared to breast cancer cells (Supplementary Fig. S1).
As a result, we used MAST to perform differential expression analysis at the single cell level between triple-negative-like epithelial cancer cells (n = 2, 998) and healthy triple-positivelike epithelial cells (n = 6, 206), identifying 205 differentially expressed genes (106 upregulated and 99 downregulated) with absolute log 2 fold-changes larger than 1 and false discovery rate lower than 0.05 (Figure 3A).
After testing using both the hypergeometric test through Enrichr (43) and gene set enrichment analysis (GSEA) using the Hallmark MSigDB signatures as reference gene sets (42), we found that the differentially expressed genes were associated with the activation of Oxidative Phosphorylation Pathway, Interferon Alpha Response, as well as Myc Targets, and the downregulation of TNFα Signaling via NF-κB, Estrogen Response, UV Response Up, Apoptosis, Hypoxia, Unfolded Protein Response, IL-6/JAK/STAT3 Signaling, Inflammatory Response, and the Androgen Response pathway (Figure 3C, Supplementary Table S1).Many of these pathways are known to be highly associated with TNBC molecular phenotype.For example, activation of both the oxidative phosphorylation pathway and of Myc targets is associated with a worse outcome of the disease (44,45), and the crosstalk between the interferon alpha response pathway and NF-κB is associated with drug resistance and tumor progression in this malignancy (46).The downregulation of the estrogen response pathway as well as apoptosis and hypoxia are markers of TNBC (47).In addition, downregulation of androgen and inflammatory responses is associated with worse prognosis and higher chemotherapy responsiveness respectively (48,49).
We found 6, 149 genes expressed in both the single-cell RNAseq datasets and the bulk RNA-seq from the TCGA-BRCA project.Spearman correlation coefficients between the expression changes computed at the single-cell level (Left panel in Figure 3A), pseudo-bulk level (Middle panel in Figure 3A) and tissue level (Right panel in Figure 3A), revealed a monotonic positive association between these different levels (Figure 3B), supporting the descriptive power of the computed differential single-cell expression signature to describe transcriptional changes associated with TNBC, and ensuring that the selected cell type as well as the pseudo-bulk samples are high-resolution descriptors of the diseased tissues.
Applying retriever to extract TNBC-specific transcriptional drug-response signatures.We collected 4, 899 response profiles measured in TNBC cell lines from the LINCS-L1000 project using the 'ccdata' package available in Bioconductor (28).These profiles correspond to the expression change of 1000 genes in response to 205 compounds on average at four different concentrations (0.08µM, 0.4µM, 2µM, and 10µM), and at two time points (6 and 24 hours), in three different TNBC cell lines (BT20, HS578T, and MDAMB231).
An illustration of the step-by-step process of constructing a generalized response profile to a compound across three TNBC cell lines is presented in Figure 4 using the QL-XII-47 compound as a case example.

Computing time-consistent generalized response profiles:
In the first step (Figure 4A), we computed time-consistent generalized response profiles.To achieve this, we took the response profile of the MDAMB231 cell line to QL-XII-47 under the same concentration at two different time points (6 and 24 hours) and averaged them.When we compared the transcriptomes of the cell line exposed at two different time points, we found little or no correlation (large boxes labeled in gray) for each concentration (0.08µ M, 0.4µ M, 2µ M, and 10µ M).However, we found that their averaged profile displays high predictive power (Spearman Correlation Coefficient (ρ) > 0.65 in all cases) of the cellular response to the compound, independent of the time of treatment.When present, we removed the profiles that did not correlate with the averaged profile above 0.6 (A total of 19.72% of the profiles).This procedure allows to identify concentrations in which the compounds induces a different, aberrant or insufficient cellular responses to the drug and also allows us to compute a single robust response profile to the compound.

Computing time-and concentration-consistent response profiles:
In the second step (Figure 4B), we extract time and concentration-consistent responses.For this purpose, we averaged the time-consistent profiles computed in the first step.As before, we removed the profiles that did not correlate with the averaged profile above 0.6, such as the profile present in the first panel of Figure 4B, which showed a different cellular response at the concentration (A total of 16.91% of the profiles computed in Step 1).
We then recomputed the profile for the compound in the MDAMB231 cell line with all profiles that had correlation coefficients above the threshold.
3. Generating disease-specific drug response profiles: In the last step (Figure 4C), we processed the outputs of the second step to extract a generalized response profile to a compound across the three TNBC cell lines.To do this, we performed independently steps 1 and 2 for QL-XII-47 in the other two TNBC cell lines available in the LINCS-L1000 project (BT20, HS578T) and averaged these profiles.
After summarizing the time points, concentrations, and cell lines for all the 205 compounds tested in TNBC cell lines, we ended with robust transcriptional response profiles of 152 (74.14%) compounds (Supplementary Table 1) that are specific for TNBC.The remaining 53 compounds were removed, as they did not show a stable response across time points, concentrations, and cell lines.
Having a single generalized disease-specific response profile for each compound allowed us to (1) generate compound combinations in-silico as candidates for the treatment of TNBC (Supplementary Table 2), ( 2) to rank compounds and compound combinations based on their predicted potential to reverse the disease state towards a healthy-like state (Supplementary Table 3 and 4), and to (3) interrogate the computed profiles using rank-based gene set enrichment analysis (GSEA) to predict possible mechanisms of action of the compounds when used to treat TNBC.
Our method to extract disease-specific drug response profiles is implemented in the 'retriever' R package (see Data Availability section) and allows the computation of disease-specific transcriptional drug response signatures for different cancer types available in the LINCS-L1000 project.

Ranking compound candidates for the treatment of TNBC.
We used the 152 TNBC-specific transcriptional response profiles extracted with retriever to compute 11, 476 response profiles to nonredundant drug combinations.To do so, we calculated the averaged effect sizes of Spearman correlation between the response profiles and the differential single-cell transcriptional signatures, and then ranked the compounds and compound combinations likely to specifically antagonize the TNBC-specific transcriptional signatures.Since Spearman correlation is a rank-based approach, this method can quantitatively identify inverse effects induced by drugs using the response profiles and disease-associated transcriptional signatures.Thus, we computed the Spearman coefficients, 95% confidence intervals, and p-values corrected for multiple testing (FDR) for all 11, 628 profiles (Supplementary Table 3 and 4).We then ranked the drug combinations by their correlation coefficients, from negative to positive.A negative value represents the expected inverse effect of the drug against the expression changes observed in the disease towards a healthy-like state.
We identified QL-XII-47 as the most promising compound to reverse the transcriptional profile of TNBC back to a healthy-like state, followed by Torin-2, Torin-1, QL-X-138, and WYE-125132.QL-XII-47 is a highly effective and selective Bruton tyrosine kinase (BTK) inhibitor that covalently modifies Cys481 of the protein.QL-XII-47 has an IC 50 of 7 nM for inhibiting BTK kinase activity and induces a G1 cell cycle arrest in Ramos cells (B lymphocytes from Burkitt lymphoma), which is associated with significant degradation of the BTK protein.
It was also shown that, at sub micromolar concentrations, QL-  XII-47 inhibits the proliferation of B-cell lymphoma cell lines (50).Furthermore, other BTK inhibitors have shown to reduce TNBC cell viability (51).In addition, independent validation of sensitivity to QL-XII-47 has been tested before in two other TNBC cell lines (HC70 and HCC1806) that were not included in the LINCS-L1000 project.Low toxicity (lower than Torin-2) and more than 50% reduction of the growth rate (GR max ) was found for both cell lines (52).
Through enrichment analysis, we found that QL-XII-47 may act in TNBC through activation of the TNF-alpha Signaling via NF-κB, Hypoxia-associated genes, the Inflammatory Response, the IL-2/STAT5 Signaling pathway, and by deactivating genes Ranking compound combination candidates for the treatment of TNBC.After testing the combinations of the disease-specific drug response profiles, we found that a combination of QL-XII-47 and GSK-690693-a pan-AKT kinase inhibitor that reduces tumor cell proliferation and induces tumor cell apoptosis (53,54)-was the best performing drug combination to revert TNBC signatures, with an increase of 10% compared to monotherapy with QL-XII-47.
Gene set enrichment analysis predicted that this combination may act through the activation of TNF-alpha signaling via NF-κB and Myogenesis, and the deactivation of Fatty Acid Metabolism, mTORC1 Signaling, E2F targets, and the Unfolded Protein Response (FDR < 0.05 in all cases).These pathways have previously been associated with the inhibition of triple-negative breast cancer growth and metastasis (55)(56)(57), and thus highlight the potential of our computational approach to prioritize drugs by integrating cell-type specific profiles obtained from single-cell RNAseq data sets with disease-specific transcriptional drug response profiles.
Signatures impacted by QL-XII-47 and GSK-690693 in individual patients.Finally, we evaluated how combination treatment with QL-XII-47 and GSK-690693 might impact transcriptional profiles of triple-negative cancer cells at the individual patient level.We found that, for most patients, the signatures expected to be impacted by the drug combination were inversely correlated between healthy samples and cancer samples (Figure 6).For example, TNF-alpha signaling via NF-κB shows an inverse association in 84% (22 of 26) of patients, while Hypoxia shows an inverse association in 73% (19 of 26) patients.Note that, while we focused on triple-negative cells in this study, the breast cancer atlas we analyzed also included samples from receptor-positive cancers.However, we specifically applied retriever to triple-negative cells from these samples, aiming to identify drug combinations that would target the tumor's most difficult-to-treat cells.Interestingly, signatures in triple-negative cells from all of the included non-TNBC samples (8 of 8) showed strong inverse correlations to those from healthy samples.For the TNBC samples, 7 of 18 had strong inverse associations with the healthy sample cluster.This indicates that, when applied to single-cell data derived from a specific tumor type, the candidate drug combinations predicted by retriever are likely to benefit a relatively large number of patients.

Discussion
RNA-seq provides an unprecedented resolution to characterize cellular heterogeneity in cancer.Compared with cell lines, the cells from tumors characterized by single-cell RNA-seq do not exhibit metabolic adaptation due to culturing.In advantage to bulk RNA-seq, single-cell RNA-seq can provide a more accurate signature of the changes observed in specific cells relevant to the disease under study (14,58).Thus, leveraging the information provided by this new technology is important to accelerate the identification of new personalized treatments that target specific subpopulations of cells in a tissue (59).
However, just as important as characterizing the signatures of the disease in affected cells at the highest possible resolution is to characterize the response profiles of drugs that may help to reverse disease-associated transcriptional changes towards a healthy state.Large consortia including the LINCS-L1000 project provide transcriptional phenotypes at different time points after applying hundreds of compounds to multiple types of cell lines at different concentrations.Even though this information is valuable, its direct application to pathophysiological scenarios is limited due to the difficulty to extract robust response profile in these large data sets (60).Here, we introduced retriever, a tool to extract robust disease-specific drug response profiles from the LINCS-L1000 project.retriever mines the information associated to each cell line in the project to select subsets of cell lines that are surrogates of a defined disease, after which it extracts disease-specific response profiles.By doing this, retriever maximizes the robustness of the transcriptional signatures that are used to rank candidate compounds for the identification of new treatment strategies.With retriever, we hypothesize that if a signature is robust across multiple cell lines representing the same disease, it will likely also be more robust across individuals, as cell lines exhibit genetic and metabolic differences.
We showcased retriever's strength by retrieving time-, dose-, and cell-type consistent transcriptional drug response signatures of TNBC cells from the LINCS-L1000 project and combining it with single-cell RNA transcriptome profiles obtained from a large single-cell breast cancer atlas, which we used to predict novel drug combinations against TNBC.This identified a combination of two kinase inhibitors predicted to be effective through together acting on important biological pathways in breast cancer, and thereby antagonize the TNBC-specific signature.
The prediction of drug combinations with retriever that we present here, although it is relatively simple and relies on the hypothesis of independent mechanisms of action for each compound, has been established as accurate (predicting the right directionality of the change) in analyses performed on profiles from the Connectivity Map project (28).Nevertheless, our computational approach only allows us to rank compounds and compounds combinations that are suitable for the development of treatments based on the negative correlation of the drug response profiles and the disease signatures.The topmost promising compounds and drug combinations still need to be validated experimentally to define the right concentration and evaluate if they indeed exhibit a synergistic mechanism of action.Their toxicity across cell lines and, after preclinical tests, potential adverse events in in-vivo models and cancer patients also need to be evaluated.
Thanks to the multiple cell lines available in the LINCS-L1000 project, our approach can be replicated in at least 13 other cancer types, including: adult acute monocytic leukemia, adult acute myeloid leukemia, cecum adenocarcinoma, colon (adeno)carcinoma, endometrial adenocarcinoma, lung adenocarcinoma, large cell lung carcinoma, small cell lung carcinoma, melanoma, ovarian mucinous adenocarcinoma, prostate carcinoma, and triple-negative breast cancer among other variants (such as local or metastatic tumors for some cancer types), as soon as single-cell RNA-seq data for those cancer types and healthy tissues become available.Considering the increase in single-cell RNA-seq studies being published since the development of the technique, we expect this to be possible for most cancer types represented in the LINCS-L1000 project in the near future (61).
Finally, our approach can also be applied to disease profiles derived from a single patient to recommend personalized treatment strategies.We envision this to be one of the potentially most impactful applications of retriever in computationally-informed precision medicine once single-cell transcriptional profiles be-come a standard in the clinic.In addition, while we have showcased retriever's strength in predicting drug combinations for TNBC in general, the method can be applied to single-cell RNAseq data derived from individual tumors, opening up the road for applications in precision medicine.

Fig. 1 .
Fig.1.Overview of the retriever algorithm.retriever generates disease-specific transcriptional drug response signatures by merging transcriptional signatures over time, concentration, and cell-type.These signatures can then be matched to single-cell or bulk expression profiles to predict drugs and drug combinations most likely to be effective in treating a disease.

Fig. 2 .
Fig. 2. Single-cell atlas of breast samples.(A) UMAP projection of the integrated 77,384 cells from 36 breast samples.(B) UMAP projection of the healthy (left) and cancer (center) cells, and visualization of triple-positive epithelial cells in the cancer samples.(C) Dotplot representing the normalized expression level and percentage of cells expressing the top five differentially expressed genes for each cell type.

Osorio 5 Fig. 3 .
Fig. 3. Transcriptional changes identified in TNBC cells.(A) Volcano plots report the difference between TNBC and healthy samples, on the left based on single-cell RNA-seq count data computed using MAST, in the middle using the pseudo bulk measures in each single-cell RNA-seq sample, and on the right using the bulk RNA-seq data from the BRCA-TCGA project.Each dot represents a gene.Dots are color coded, in red if the log2 fold-change is larger than 1 and in blue if the log2 fold-change is smaller than -1.(B) Comparisons of the transcriptional changes associated with TNBC at the single-cell, pseudo bulk and tissue level.Each dot represents a gene.Dots are color coded as in Figure 2A.(C) Single-sample GSEA Enrichment Score (ES) for the transcriptional changes between TNBC and healthy cells at different levels of resolution.Labeled pathways are those that show same trend (positive or negative ES) in the three data types.

Fig. 4 .
Fig. 4. Case example showing the construction of a single transcriptional response profile to a compound across three TNBC cell lines.Frames of the scatterplots displaying the relationship between the profiles and the averaged profiles are color coded, in black if the computed Spearman correlation coefficient ( ρ) is larger than 0.6 or in gray otherwise.(A) Generation of a time-consistent response profile.(B) Generation of a time and concentration-consistent response profile.(C) Generation of a time, concentration, and cell-line-consistent response profile.

Fig. 5 .
Fig. 5. Correlation analysis and mechanism of action prediction for the disease-specific drug response profiles in TNBC.(A) Spearman correlation analysis between the expression changes associated with TNBC and the disease-specific drug response signature of QL-XII-47.Each dot represents a gene.Dots are color coded, in red if they are expected to be upregulated by the drug, in blue if they are expected to be downregulated by the drug, and in gray if no significant change is expected.The red line represents the perfect match between both profiles.Density lines reflect the number of dots in each section of the plot.(B) Mechanisms of action predicted for QL-XII-47 in TNBC based on the enrichment of biological pathways in the disease-specific transcriptional drug response signature.(C) Independent sensitivity evaluation of the effect of QL-XII-47 in two other TNBC cell lines.(D) Spearman correlation analysis between the expression changes associated with TNBC and the combination signature of QL-XII-47 and GSK-690693.Each dot represents a gene.Dots are color coded as in (A).The red line represents the perfect match between both profiles.Density lines reflect the number of dots in each section of the plot.(E) Mechanisms of action predicted for the mixture of QL-XII-47 and GSK-690693 based on the enrichment of their combination drug response signature.
involved in Fatty Acid Metabolism (FDR < 0.05 in all cases, Fig 5A).

Fig. 6 .
Fig. 6.Cancer hallmark signatures identified in triple-negative cells from individual breast cancer samples and triple-positive-like epithelial cells from control samples.Single-sample GSEA Enrichment Score (ES) of the pseudo-bulk profile computed for each donor included in the study.Labeled pathways are those that show a similar trend (either positive or negative ES) in the three data types.Dn: downregulation; Up: upregulation.