Abstract
Resistant tumours are thought to arise from the action of Darwinian selection on genetically heterogenous cancer cell populations. However, simple clonal selection is inadequate to describe the late relapses often characterising luminal breast cancers treated with endocrine therapy (ET), suggesting a more complex interplay between genetic and non-genetic factors. Partially, this is due to our limited understanding on the effect of ET at the single cell level. In the present study, we dissect the contributions of clonal genetic diversity and transcriptional plasticity during the early and late phases of ET at single-cell resolution. Using single-cell RNA-sequencing and imaging we disentangle the transcriptional variability of plastic cells and define a rare sub-population of pre-adapted (PA) cells which undergoes further transcriptomic reprogramming and copy number changes to acquire full resistance. PA cells show reduced oestrogen receptor α activity but increased features of quiescence and migration. We find evidence for sub-clonal expression of this PA signature in primary tumours and for dominant expression in clustered circulating tumour cells. We propose a multi-step model for ET resistance development and advocate the use of stage-specific biomarkers.
Introduction
The outgrowth of primary luminal breast cancer (BCa) is driven by non-mutated estrogen receptor α (ERα), with all patients receiving adjuvant endocrine therapy after curative surgery (ET). This strategy significantly delays clinical relapse but does not abrogate it completely, as about ∼3% of the patients each year come back with overt relapse, inevitably leading to further metastatic development1-3. The frequency of relapse remains constant up to 20 years after surgery making ET- resistance the most critical clinical problem for the management of these patients4. The processes of adaptation and selection leading to late relapse are currently poorly understood and should be interpreted in light of adjuvant therapies.
Recent developments in next-generation sequencing (NGS) revealed that tumours are genetically heterogeneous5-7 and in some cancer types, heterogeneity correlates with the likelihood of recurrence and development of drug-resistance8,9. In some instances, targeted therapy can lead to the rapid expansion of genetically defined pre-existent resistant cells that can be explained by simple models of clonal selection10-12. However, this same model is mostly inconsistent with the decade-long latency observed in luminal BCa. In addition, despite recent studies showed that the majority of the genetic lesions in BCa are accumulated during the early phases of tumour development5,13, they failed to identify any major driver associated to metastasis and resistance, with the exception of a minor fraction of cases showing either ESR1 mutations or CYP19A1 amplification14-17. Yet, the transcriptomes of the resistant cells are profoundly heterogeneous and different from those of the primary tumour18-20, suggesting a contribution of non-genetic mechanisms21.
Rare phenotypic subpopulations, showing features of drug-tolerance and sometimes of quiescence, have been found in primary melanomas22, leukaemia23, non-small cell lung cancer24 and triple negative breast cancer (TNBC)25. In primary melanoma, a rare, transient subpopulation expressing resistant markers at high levels can survive and persist to become stably resistant26. Nevertheless, it remains unclear how genetic and non-genetic components contribute to different types or stages of ERα-BCa.
In this study, we used a combination of live cell imaging, single cell RNA- sequencing (scRNA-seq) and machine learning to dissect the phenotypic heterogeneity and plasticity of ERα-positive BCa and leveraged this information to identify a subpopulation of rare, pre-adapted cells both in vitro and in vivo. These cells (termed PA, from Pre-Adapted) display a unique transcriptional signature with features of dormancy and mixed epithelial and mesenchymal traits, and which was found dominant in clustered circulating tumour cells. PA cells showed a significant survival advantage under short-term ET but required further transcriptional reprogramming and genetic alterations to acquire full resistance and re-establish a proliferative phenotype in vitro. These results highlight the multi-faceted effects of ET at single cell level and suggest a multi-step mechanism of drug resistance that involve both non-genetic and genetic contributions.
Phenotypical equivalent of fully resistant clones is absent in treatment-naïve cells
In order to study the dynamic process of ET resistance, we exploited an in vitro system that maximises reproducibility while minimising confounding factors15,27. Long-term oestrogen deprived (LTED) cells originate from ESR1 wild-type MCF7 that have been deprived from oestradiol for one year. This model is generally considered a good proxy to study the effect of aromatase inhibitors (AI) (Fig. 1a). We previously showed that resistance in this model is driven by amplification of the aromatase gene (CYP19A1) in combination with endogenous cholesterol biosynthesis, but not by mutated ESR115,28. Even in this accelerated model, fully resistant cells emerge between 6 to 12 months of oestrogen deprivation29, which is incompatible with clonal selection of a pre-resistant cell30. In line with this, less than 1.5% of early-stage BCa show evidence of a pre-existent ESR1-mutant clone31 (Supplementary Table 1), suggesting key driver mutations to be acquired at a later stage. However, this model does not fully exclude pre-existence of transcriptomic clones with features of resistance. To investigate this, we generated scRNA-seq high-quality profiles for >1,200 MCF7 and >1,900 LTED cells (Supplementary Table 2).
Dimensionality reduction (Similarity Weighted Nonnegative Embedding, or SWNE)32 showed MCF7 and LTED as completely separated populations, with no single MCF7 clustering with LTED cells (Fig. 1b). Studies in melanoma and TNBC suggest that drug-resistant cells can rapidly emerge25,26. This implies that in drug-naïve tumors, at least a few cells have a transcriptional profile similar to that of fully resistant cells. However, our data suggest this is not the case in luminal breast cancer cell lines, which is concordant with the long latency taken by ET-resistance to occur in most patients treated with endocrine therapies. To completely exclude any contribution of a pre-existent genetic clone, we inferred single-cell, copy number alterations (CNAs) from scRNA-seq data (Methods). Clustering of single MCF7 and LTED cells based on the inferred patterns of CNAs identified two clades, one including all the MCF7 and one all the LTED cells (Fig. 1c). In line with CYP19A1 significantly contributing to AI resistance in vivo and in vitro15, an amplification involving the region was found in almost all LTED cells but not in MCF7 (Fig. 1c). This was confirmed by shallow whole genome sequencing (Supplementary Fig. 1a).
Taken together, these data support that AI resistance is not driven by a pre-resistant clone (whether genetic or in a particular transcriptional state), suggesting a multi-step adaptation process in which the necessary hits occur with a different timing during ET.
Using scRNA-seq, we were able to confirm that certain pathways, such as cholesterol biosynthesis, are profoundly reprogrammed by ET at the population level (Fig. 1d and Supplementary Fig. 1b). Nevertheless, clustering of single-cell profiles identified five distinct groups (two for the MCF7 and three for the LTED), reflecting a certain degree of diversity for the activity of these pathways. This led us to further dissect the phenotypic heterogeneity of breast cancer cells with the aim of pinpointing rare, transcriptionally-defined clones in the drug-naïve condition.
Characterisation of a phenotypically distinct population of plastic cells in luminal breast cancer
We first aimed at identifying any surface marker that could guide the dissection of the phenotypic heterogeneity of luminal breast cancer cells. Genes showing high transcriptional variability across single MCF7 cells were intersected with annotation in the Cell Surface Protein Atlas33. Among the top 30 genes, we identified CD44 (Supplementary Fig. 2) which has been previously recognised as a marker of plastic cells in various solid tumours34-36. Further investigation confirmed variable expression of CD44 in primary tumours (Supplementary Fig. 3a-c and Fig. 2a,b). CD44 positive cells were found significantly enriched after treatment in a cohort of neo-adjuvant AI patients (Fig. 2a; 3.8-fold, p-value = 0.0032; two-tailed paired t-test) as well as in matched AI-treated primary-metastatic (Fig. 2b and Supplementary Fig. 3d; 2-fold, p-value = 0.0029; Wilcoxon signed-rank test), suggesting higher chances of survival to ET for CD44-expressing cells in vivo. We next sought to investigate if CD44high cells can be also found at other active sites in breast cancer patients. Interestingly, we found substantial CD44high cells in pleural effusions from all four patients examined (Supplementary Fig. 3e). In line with this, the fraction of FACS- sorted CD44high cells was significantly increased in LTED (upper panels in Supplementary Fig. 3f,g). Extensive functional characterisation of these cells demonstrated that MCF7-CD44high cells were more invasive, more clonogenic, and could form first and second-generation of mammosphere at higher efficiency than CD44low cells (Supplementary Fig. 3h-j). In agreement with previous studies36, CD44high cells also showed cellular plasticity as they could recapitulate the entire population while CD44low were capable of generating only CD44low cells (Supplementary Fig. 3f).
To further investigate the plasticity of CD44high cells in vitro at the single-cell level, we generated MCF7 and LTED cell lines with a GFP reporter expressed under the promoter of the CD44 gene (Supplementary Fig. 4). Reconstitution experiments from FACS-sorted cells showed that CD44GFP-high cells could recapitulate all the functional aspects of endogenous CD44high cells including cellular plasticity (Figure 2c). Interestingly, both CD44GFP-high and CD44GFP-low showed features of plasticity in fully resistant cells (Fig. 2d and Supplementary Fig. 3g). When MCF7 were challenged with short-term ET, only CD44GFP-high cells appeared to adapt to it, while CD44GFP-low cells were rapidly cleared out between days 4 and 7 (Fig. 2e). Single-cell plating experiments confirmed that only CD44high cells could drive the formation of early colonies under E2-deprivation, but the colonies were significantly smaller compared to E2-supplemented conditions (Fig. 2f). These observations indicate combined cytostatic and cytotoxic effects of ET and that those cells that could adapt to the therapy originate within the CD44high compartment. Extrapolation of cell-cycle dynamics of CD44GFP-high and CD44GFP-low cells from time-lapse imaging data revealed comparable cell-cycle length in E2-supplemented condition (Fig. 2g; +E2). Nevertheless, CD44GFP-high cells had a significantly lower proportion of cells engaged in productive cell cycle entry, suggesting the existence of a low-proliferative subpopulation within the CD44high compartment even under permissive environments. Under E2-deprivation, the CD44GFP-low completely failed to undergo cell-cycle entry, while 12% of CD44GFP-high managed to do one or more cell cycles, with a much longer latency (Fig. 2g; -E2).
Taken together, these results further support the idea that at least some of the cells in the CD44GFP-high (but not CD44GFP-low) compartment have an increased ability to survive the acute phase of ET, and this correlates with their features of plasticity. This led us to hypothesise that non-genetic, transcriptional variability would reflect pre-existent, rare subpopulations in treatment-naïve cells with higher chances to survive and give rise to fully-resistant cells.
Single-cell transcriptomic profiling unveils further heterogeneity of plastic cells
To investigate the transcriptional variability of CD44high cells, we carried out sorting- driven, scRNA-seq of CD44-GFP luminal breast cancer cells. About 10,000 single cells in E2-supplemented condition were profiled (CD44GFP-high and CD44GFP-low in equal proportions; Fig 3a; in the remainder of the text, these two sorted subpopulations will be referred to as CD44high and CD44low). Dimensionality reduction (Fig. 3a) highlighted a surprising similarity between the profiles of CD44high and CD44low, except for a small percentage (∼4%) of CD44high cells significantly departing from the main cluster. In line with this, differential expression analysis of the two subpopulations resulted in 10-fold less differentially regulated genes (DEGs) than those observed by comparing them to LTED (Fig. 3b; Supplementary Table 3). Nevertheless, CD44high showed an overall, significantly higher transcriptomic variability (p-value < 2.2e-16; Wilcoxon rank-sum test) than CD44low (Fig 3c).
We next sought to systematically address whether the observed variability was the result of either an increased transcriptional noise specific to CD44high cells (compatible with a bet-hedging mechanism) or instead the reflection of a regulated network (leading to coordinated expression of multiple genes in the same cell). We applied PIDC37, an algorithm using partial information decomposition (PID), to identify regulatory relationships between genes, and reconstructed the gene regulatory networks (GRNs) from the scRNA-seq profiles of CD44high and CD44low cells, separately (Supplementary Table 4). The two networks were merged and analysed to identify major communities (Fig. 3d). The larger of the three identified communities (#1 in Figs. 3d-f) showed low similarity between the CD44high and CD44low GRNs, with the majority of edges supported only by the CD44high GRN (Fig. 3e). Pathway enrichment analyses38 for the genes in this community showed highly significant enrichments for oestrogen response, TNFα signalling, epithelial-mesenchymal transition and p53 pathway (Fig. 3f; q-value < 1e-8). These results strongly suggest that the variability specific to the CD44high compartment is the result of the coordinated regulation of genes within well-defined subpopulations. With this in mind, we hypothesised a central role for these rare cells in the early phases of acute oestrogen deprivation (termed acute-ET).
Single-cell profiling of acute response to endocrine therapy identifies a subpopulation of pre-adapted cells
To investigate the role of transcriptomic variability of plastic cells during acute-ET, we performed scRNA-seq experiments upon oestrogen deprivation (Supplementary Table 2). Continuous single-cell imaging suggested that cells within the CD44low subpopulation started being differentially affected by acute-ET after 48 hours of treatment (Fig. 2e). We thus profiled gene expression data of about 10,000 single cells at 48 hours of starvation (Fig. 4a). Applying a stringent threshold on the first SWNE component, we could define a rare, pre-adapted (PA) subpopulation among plastic cells (CD44high) expressing a signature of acute-ET even in permissive E2- supplemented condition. The identification of PA cells was confirmed using an approach based on Random Forests classification (Fig. 4b, Supplementary Table 5 and Methods). Of note, PA cells are genetically indistinguishable from the other CD44high cells and have not yet acquired any of the genetic re-arrangements of the fully resistant, LTED cells (Fig. 4c). Considering both approaches and both a lenient and a stringent threshold, PA cells are estimated to constitute 0.76 to 4% of the CD44high cells, which correspond to 0.03 to 0.14% of the total MCF7 population. Overall, these data suggest that PA cells might represent the first step in the process of adaptation to acute-ET.
We then sought to validate if the PA transcriptional state would confer a survival advantage compared to other plastic cells exposed to acute-ET. First of all, we identified the Claudin-1 gene (CLDN1) as a marker for PA cells (in combination with CD44). We then generated MCF7 cells stably labelled with either a nuclear GFP or mKate2 and leveraged this tool to follow two subpopulations over time after mixing them. The same amount of sorted PA cells (CD44high and CLDN1high) was mixed with other plastic cells (CD44high and CLDN1low; Fig. 4d). CD44high CLDN1high PA cells showed increased survival to acute-ET compared to CD44high CLDN1low, with this effect increasing over time. As a control, no difference was observed between CLDN1high and CLDN1low from the CD44low compartment. These data strongly support the hypothesis that PA cells have distinctive survival advantage under acute-ET (Fig. 4e).
We then further characterised these cells functionally. We focused on a stringent set of differentially expressed genes between the PA cells and the rest of the CD44high cells in +E2 condition (312 up-regulated and 341 down-regulated; Figs. 4a,b and Supplementary Table 5). PA cells displayed features of mixed epithelial and mesenchymal traits, along with up-regulation of p53 pathway, cell polarity (apical junction components), and hypoxia (Fig. 5a, upper panel). PA cells also showed reduced ERα activity and down-regulation of the cell cycle machinery, while still expressing ESR1 (Fig. 5a, lower panel and Supplementary Fig. 5). Interestingly, both plastic and non-plastic cells lied on a continuum showing negative correlation between the expression of the genes of the cell cycle and of those in the signature of PA cells (Fig. 5b; Spearman’s Rank Correlation coefficient = −0.519; p-value < 2.2e- 16), with PA cells found at the edge of this spectrum. We finally sought to quantify the overlap between the PA cells signature (up-regulated genes) with the CD44high- enriched GRN we previously identified (Figs. 3d-f). Indeed, when we further dissected the GRN (community #1) into its two main components, we found extensive overlap between one of these components and the PA signature (Fig. 5c; p-value = 2.7e-21; hypergeometric test). This further supports the idea that the genes in this signature are part of a co-regulated network.
Overall, these data support the hypothesis that plastic cells are non-genetically heterogeneous, and among them rare cells in the PA state have a survival advantage during acute-ET.
Pre-adaptation is a persistent feature of acute-ET but differs from a signature of full resistance
While these analyses support a pivotal role for the PA phenotype in conferring a survival advantage during acute-ET, PA cells are still genetically indistinguishable from the rest of the cells. This suggests these cells do not represent the final step of drug resistance. Nevertheless, we aimed at determining whether longer exposure to acute-ET correlates with the persistence of the PA signature, and/or this also coincides with other reprogramming events. In order to capture the different dynamics of survival of CD44high and CD44low (Figs. 2e and 6a), we generated scRNA-seq profiles at 4 and 7 days of oestrogen deprivation (Supplementary Table 2), a period in which the relative number of CD44high cells does not change while CD44low undergo rapid extinction. Dimensionality reduction of >28k cells showed increased prevalence of the PA signature with time of starvation (Fig. 6a). Formal quantification using AUCell39 confirmed this trend (Fig. 6b, left panel and Fig. 6d). The same analysis using a LTED-specific signature (Methods and Fig. 1) failed to identify any cell expressing it during acute-ET (Fig. 6b, right panel). In line with this, the critical transcriptional pathways driving full resistance (i.e. cholesterol biosynthesis and re-activation of ERα signalling) were completely abrogated in PA and cells exposed to acute-ET (Figs. 6c,d). On the other hand, some of the pathways associated to PA phenotype (partial-EMT, cell polarity, hypoxia) were found to consistently increase during treatment.
Unexpectedly, while imaging showed that after 7 days >75% of the CD44low died and were destined to extinction (Fig. 2f), the profiled CD44high and CD44low cells converged on the same transcriptional changes. We reasoned that since scRNA-seq experiments capture viable cells exclusively, we profiled only those cells that were still alive at day 7. Thus, we hypothesised that the PA-like transcriptional program is an intermediate bottleneck during acute-ET. In line with this, we discovered that CD44low cells can occasionally up-regulate a signature overlapping that of PA cells, but this happens with lower efficiency (Fig. 6b) and it is not sufficient to give them the survival advantage shown by CD44high. To validate these observations at the protein level, we performed a multi-marker tracing profile, exploiting some marker genes (namely, GPRC5A, MFGE8, FSCN1 and RAB11FIP1) showing a trend of up-regulation with starvation time. This trend was confirmed at the protein level, with values consistently higher in CD44high compared to CD44low cells (Fig. 6e and Supplementary Fig. 6 and 7). Nevertheless, this did not prevent cells in the CD44low compartment to die at an almost linear rate (Fig. 6e).
Taken together, these observations confirmed that the PA transcriptome is strongly selected by acute-ET. Nevertheless, the observed rapid expansion (Fig. 6b) seems incompatible with the strict selection of a pre-existing population30. The observation that also CD44low cells can adopt a similar transcriptional profile in response to ET (despite being unable to survive) suggests the PA program is required but not sufficient to explain the survival of plastic cells to acute-ET (see Discussion).
The pre-adapted signature is active in primary tumors and enriched in clusters of circulating tumor cells
We first looked for evidence of expression and co-regulation of genes up-regulated in PA cells, in 825 primary luminal breast tumours40. Tumours classified as luminal A showed significantly higher expression of the signature compared to luminal B (p- value < 2.2e-16; Wilcoxon rank-sum test) and TNBC/HER2+ lesions (p-value = 1.9e-8) (Fig. 7a). Of note, luminal A exhibit the longest latencies in relapse development amongst all BCa41-43. Considering >600 luminal A samples, we then checked the distribution of pairwise correlations between the expression pattern of the genes in the signature, as a proxy for co-regulation. Compared to a size-matched set of randomly picked genes, those in the PA signature showed significantly higher coefficients (Fig. 7b; p-value < 2.2e-16; Wilcoxon rank-sum test), with hundreds of pairs with values over 0.5 (Spearman’s Rank Correlation coefficient). These results further corroborate our previous observations that these genes tend to be controlled by the same GRNs and showed a trend of higher expression in luminal tumours with longer latency of recurrence (A vs B; Fig. 7a).
Given that some of the key pathways active in PA cells hinted to mixed epithelial and mesenchymal features, as well as cell polarity and migration, we asked if the PA phenotype could play a role in metastatic progression. Previous data strongly suggest that epithelial-like cluster of circulating tumour cells (CTCs) are responsible for 85-92% of metastatic dissemination44, with individual CTC showing more mesenchymal features playing a more limited role45. Interestingly, the PA signature was found significantly enriched in CTCs45 (Fig. 7c; q-value = 0.017, permutation test) and at even higher levels in clusters of CTCs44 (Fig. 7d; q-value = 0.066, permutation test). These results provide a further link between drug-induced adaptation and metastatic invasion27,46.
Discussion
In this study, we leveraged an in vitro model to investigate the contribution of genetic and transcriptional heterogeneity to the development of resistance to ET in luminal breast cancer. As opposed to previous observations in melanoma, TNBC, lung and colorectal cancers, in which targeted therapy lead to the rapid emergence of fully-resistant cells10–12,18,26, we could not find any genetic or phenotypic clone showing features of resistance in treatment-naïve cells (Fig. 1). The same observation held true even after thoroughly dissecting the heterogeneity of the cells showing features of plasticity (Figs. 2-3). On the other hand, we could identify and characterise a small subpopulation (∼0.1% of the treatment-naïve cells) showing a pre-adapted (PA) phenotype (Fig. 4). These cells showed a 2-fold increased survival to acute-ET compared to other plastic cells, while non-plastic cells undergo complete extinction under selective pressure (Fig. 4e), along with mixed epithelial and mesenchymal features, and quiescence. Interestingly, while any cell (also those with no feature of plasticity) can adopt a transcriptional program overlapping that of the PA cells, only plastic cells can withstand acute-ET (Fig. 6a and 6e), with PA cells showing a more pronounced survival advantage (Fig. 4e). Finally, we found an enrichment of the PA signature in clusters of CTCs. To our knowledge, this is the first time a quiescent population from the primary tumour has been linked to both features of survival to therapy and of CTCs. Interestingly, it has been reported that early stage metastatic cells possess partial features of survival, dormancy and EMT, which all overlap with our PA signature47. A signature of partial-EMT has also been recently shown to be expressed in the cells at the leading edge of primary head and neck cancers48. It is tempting to speculate that PA cells might not only display a survival advantage during the early phases of the therapy but might also be the pioneers of micro-metastatic spread.
Surprisingly, we found that also cells with no features of plasticity were able to adopt the PA signature, even though with a much lower efficiency, which cannot prevent the extinction of the compartment after two weeks of oestrogen deprivation (Fig. 6). On top of this, 70% of the plastic cells adopted a PA signature within 48 hours of acute-ET (Fig. 6b). This fast transition to a diverse transcriptional state is hardly explained by straight Darwinian selection of a persister cell30. For reasons that remain to be investigated, plastic cells have a much higher probability than non-plastic ones to transition into a PA state, and this probability is dramatically increased by oestrogen deprivation. We reason that upon stress, plastic PA cells are better positioned than cells requiring transcriptional reprogramming, hence the observed difference in survival within the plastic compartment (Fig. 6e).
PA cells represent an obligated step toward the acquisition of resistance while still requiring substantial reprogramming to recapitulate features of full resistance cells. We propose that the delayed relapse common to ET treated patients might be mediated by similar processes, in which PA-like cells are selected for and stalled by ET for up to >10 years. This model would reconcile why ET are sometime effective for down staging neo-adjuvant patients but fail to clear micro-metastatic disease. Nevertheless, how this bottleneck affects the progression of the tumour requires further investigation. Future studies on the necessary steps and their timing of occurrence during treatment must be carried out in order to expose potential vulnerabilities of these quiescent cells.
Methods
Cell lines
Cells were originated as previously described27. MCF7 cells were maintained in Dulbecco’s modified Eagle’s medium (DMEM) containing 10% fetal calf serum (FCS). Long-term oestrogen-deprived cells (LTED) were derived from MCF7 after one year oestrogen deprivation as previously described and were maintained in phenol-red free DMEM containing 10% charcoal stripped fetal calf serum (SFCS)27. Both media were supplemented with 2 mM L-glutamine, 100 units/mL penicillin and streptomycin. 10−8 M estradiol (E2758 Sigma) was added routinely to MCF7. Primary metastatic breast cancer cells were derived from pleural effusions of patients with metastatic breast cancers. The pleural effusion (PE) cells were maintained in DMEM containing 10% fetal calf serum (FCS) and 2 mM L- glutamine, 100 units/mL penicillin and streptomycin.
Plasmids
pLVX-IRES-mCherry-puro lentiviral vector (Cambridge Bioscience, Cambridge, UK) was used to infect MCF7 and LTED cells. CD44 reporter GFP cells were established with CD44CR1-IRES-GFP-puro lentiviral vector (Tebu-Bioscience). Stable and polyclonal cell populations were established after puromycin selection (0.5 μg/ml). NucLight Green lentivirus (IncuCyte, 4626) and NucLight Red Lentivirus (IncuCyte, 4627) were used to infect MCF7. Stable and polyclonal cell populations were established after Zeocin selection (300 μg/ml). H2B-mCherry-puro lentiviral vector was used to infect stable CD44 reporter GFP cells. Stable and polyclonal cell populations were established after sorting.
Antibodies
Anti-ERα antibody (Vector Laboratories, VP-E613) 1:100 for immunofluorescence (IF) and anti-ERα (Santa Cruz, HC-20) 1:1000 for western blot (WB), anti-CD44 antibody (Santa Cruz, sc-7297,) 1:200 for IF and 1:100 for immunohistochemistry (IHC), anti-pan Cytochemistry antibody (Abcam, ab17154) 1:200 for IF, anti-FGFR4 antibody (Abcam, ab44971) 1:100 for IF, anti-FSCN1(Sigma, HPA005723) 1:100 for IF, anti-MFGE8 (Sigma, HPA002807) 1:100 for IF, anti-RAB11FIP1 (Sigma, HPA023904) 1:100 for IF, anti-GPRC5A (Sigma, HPA007928) 1:100 for IF, anti-caspase3 (Merk, MAB10753) 1:100 for IF.
FACS analysis
Cells were cultured to 70 to 80% confluence and detached from the cell culture flasks using EDTA. Cell pellets were obtained and washed with cold phosphate-buffered saline (PBS) containing 1% FCS and 5mM EDTA. All further steps were performed on ice and all centrifugation steps at 4°C. Fluorochrome-conjugated monoclonal antibodies against human CD44 (FITC, BD Pharmingen; BV421, BD Pharmingen), Claudin1 (APC, R&D systems), and their isotype controls were added to the cell suspension at concentrations recommended by the manufacturer (BD Biosciences) and incubated at 4 °C in the dark for 30 min. The labelled cells and CD44 reporter GFP cells were washed in PBS and then were analyzed on a FACSAria (BD Biosciences). Gating was set to relevant isotype control (IgG-FITC)-labeled cells or unstained cells for each cell line. Propidium iodide (Bio-Rad, 1351101) and DRAQ7 (BioLegend, 424001) were used for the dead cell removal.
Soft Agar Colony-Forming Assay
Anchorage independent cell growth was carried out in six-well tissue culture plates. A 1-mL layer of 0.6% agar (DIFCO Laboratories) in appropriate cell culture medium was solidified at the bottom of each well. Cells to be assayed were suspended in 1 mL of 0.3% agar in medium. 1 × 104 cells were seeded in each dish. After 4 weeks of incubation at 37 °C in 5% CO2, colonies were visualized by staining with 0.02% Crystal violet.
Mammosphere Culture
Cells were plated as single cells at a density of 5×102 viable cells/well in ultralow attachment 6 well plates (Corning, CLS3814). Cells were grown in a serum-free DMEM or phenol-red free DMEM, supplemented with B27 (Invitrogen, 17504-044), 20 ng/mL EGF (Sigma, E9644) and 20 ng/mL bFGF (R&D systems, 233-FB-025). Mammospheres were grown for 10–14 d and phase contrast images were obtained using the ImageXpress Micro microscope (Molecular Devices). For the second-generation experiment, first-generation mammospheres were collected from multiple wells and spun at 500 × g per 5 minutes. The pellet was resuspended in 50μl Trypsin and the sample was passed 25 times through a sterile needle to get single cell suspension. The same density of cells as in first generation culture was seeded, and cells were allowed to grow for 14
Immunofluorescence
Briefly, 104 cells were seeded on chamber slides (Lab-Tek). On the final day, cells were washed twice with PBS at room temperature and 4% PFA/PBS was added for 15 minutes. Cells were washed twice with PBS and NH4Cl was added as quencher. 0.2% Triton/PBS was added for 5 minutes. 10% BSA/PBS was used as blocking reagent. 5% BSA/PBS was used to dilute primary antibodies and Alexa-fluor 488, 568, 594, 647 labelled anti-rabbit or anti-mouse secondary antibodies (ThermoFisher). Nuclei were counterstained with DAPI and were mounted in ProLong Antifade Mountant (ThermoFisher, P36941). Pictures were acquired using EVOS microscope system (Advanced Microscopy Group, Bothell, WA, USA) or Zeiss Axiovert 200M inverted microscope.
Tissue microarray (TMA) of paired primary and secondary breast cancers
Twenty primary breast carcinomas with a paired metastasis were acquired from the pathology archives of Charing Cross Hospital, London, UK. A tissue microarray was constructed using a manual microarrayer and 0.6 mm punches. The tissue microarray was immunohistochemically profiled for CD44 (Santa Cruz sc-7297). Antigen retrieval was performed using 0.01M citrate buffer, pH 6.0 followed by blocking in 0.3% hydrogen peroxide in PBS, then in normal goat serum (20 μl per ml) for 30 min. The primary antibody was incubated overnight at 4°C at 1:100, and then detected using anti-mouse secondary antibody (Vector Laboratories), Vectastain Elite peroxidase ABC kit, and ImmPACT DAB kit (Vector Laboratories). Subsequently, 4μm TMA sections were immuno-stained using the optimized staining protocol, including negative controls (omission of the primary antibody). Staining was scored based on the H-score and Allred Quick score by three independent investigators (including one consultant pathologist) blinded to the clinicopathological characteristics of patients. H = (3 × % of strongly stained cells) + (2 × % of moderately stained cells) + (1 × % of weakly stained cells) + (0 × % of cells without staining). Negative controls were performed by omission of the primary antibody.
Neo-adjuvant patient selection
All clinical data from patients operated at the European Institute of Oncology (IEO) were prospectively entered in an Institutional data base. For the present study we retrieved data from patients with a neoadjuvant treatment from 1999 to 2014. We selected patients having had pre-surgical biopsy and surgery in our Institute, in order to have their sample analyzed in the same laboratory. We randomly selected 20 patients with luminal tumors, treated by neoadjuvant hormone only (aromatase inhibitors), 10 responders and 10 non-responders.
Reconstitution assay
After sorting with CD44 using FACS Aria III (BD Biosciences), 105 cells of CD44high and CD44low were seeded on 6 well plates and incubated for 7 days. After 7 days, cells were trypsinized and stained with anti-CD44 antibody (FITC, BD) for FACS. After sorting with GFP using FACS Aria III (BD Biosciences), 105 cells of CD44GFP-high and CD44GFP-low were seeded on 6 well plates with oestradiol supplement or deprivation. Five pictures per condition were taken using an EVOS microscope system (Advanced Microscopy Group, Bothell, WA, USA) during 14 days. Fifty different fields were counted. The % of GFP positive cells was calculated by number of GFP positive cells/ number of total cells x100.
Monoclonal Assay
Using FACS Aria III (BD Biosciences), single cell of mCherry-MCF7 was seeded on 96 well plates with oestradiol supplement or deprivation. Single cell was confirmed using an EVOS microscope system (Advanced Microscopy Group, Bothell, WA, USA). The cells were incubated for 30 days, and colonies were counted on EVOS microscope system.
Live cell imaging and Data Analysis
After sorting with GFP by FACS Aria III (BD Biosciences), 105 cells of H2B mCherry MCF7 CD44rep GFPpos and GFPneg were seeded on 6 well plates with estradiol supplement or deprivation. Time-lapse live cell imaging was performed on IncuCyte ZOOM (Essen BioScience) equipped with temperature, humidity and CO2 control. Images were acquired every 6 hours with 10X plan fluorescence objectives for a proliferation assay and every 15 min with 20x up to 10 days for a cell cycle analysis. Excitation (Ex) and emission (Em) filters sets (Chroma Technology Corporation) were as follows: CFP, 427-10nm (Ex), 483-32 nm (Em); YFP, 504-12 nm (Ex), 542-27 nm (Em); mCherry, 589-15 nm (Ex), 632-22 nm (Em). Micromanager 1.3 was used for acquisition of time-lapse images. All data analysis was done with scripts written in Matlab (Mathworks) or using Cell Profiler (Broad Institute) and ImageJ (National Institutes of Health). Symmetric/Asymmetric/conversion analyses were performed on a total of 200 cells. Each cell was monitored for the first 3 cell divisions (1 cell to 2 cells, 2 cells to 4 cells, 4 cells to eight cells). Symmetric division was scored if the daughter cell matched the mother. Asymmetric was scored if the daughter cell did not match the mother. Conversion was scored if cell changed CD44 status without cell division (at least 4 hours pre-or post- division). Cell cycle speed was established by calculating the time intervening between two consecutive metaphase plates.
Statistical analyses
Unless specified otherwise, all the analyses and plots were performed in the statistical computing environment R v3 (www.r-project.org).
Single cells preparation
Single cells were prepared from a full population of MCF7 and LTED, or from sorting MCF7 CD44-GFP reporter cells by the level of GFP expression at different time points of E2 deprivation. After centrifugation, single cells were washed with PBS and were re-suspended with a buffer (Ca++/Mg++ free PBS + 0.04% BSA) at 1,000 cells/µl.
Single-cell RNA sequencing
Viability was confirmed to be > 90% in all samples using acridine orange/propidium iodide dye with LUNA-FL Dual Fluorescence Cell Counter (Logos Biosystems, L20001). Single cell suspensions were loaded on a Chromium Single Cell 3’ Chip (10X Genomics) and were run in the Chromium Controller to generate single-cell gel bead-in-emulsions using the 10x genomics 3’ Chromium v2.0 platform as per manufacturer’s instructions. Single-cell RNA-seq libraries were prepared according to the manufacturer’s protocol and the library quality was confirmed with a Bioanalyzer High-Sensitivity DNA Kit (Agilent, 5067- 4627) and a Qubit dsDNA HS Assay Kit (ThermoFisher, Q32851). Samples were pooled up to four and sequenced on an Illumina HiSeq 4000 according to standard 10X Genomics protocol.
Single-cell RNA-seq raw data analysis
cellRanger (v2.1.1) was run on the raw data using GRCh38 annotation (v1.2.0). Output from cellRanger was loaded into R using the function load_cellranger_matrix_h5 from package cellranger (v1.1.0; genome = “GRCh38”). Datasets were merged according to gene names. All cells sampled were retained except for flow-sorted CD44high and CD44low either in +E2 media or starved for two days, for which the top 5,000 cells in terms of UMIs per cell were considered. In order to robustly detect transcriptional states, a recent paper suggested to consider a coverage of at least 1,500 detected genes per cell49. A filter on cells showing at least 1,500 detected genes per cell and at least 5,000 UMIs per cell was then applied. After that, reads mapping on mitochondrial genes were excluded. Before normalization, a series of filtering steps were performed. To do that, data were imported in Seurat (v2.3.4)50 and scaled (NormalizeData function using normalization.method = “LogNormalize”, scale.factor = 10000, followed by the ScaleData function). A filtering step was then performed based on the cumulative level of expression (the sum of the Seurat-scaled values) of three housekeeping genes (GAPDH, RPL26 and RPL36)51. Manual inspection of these values versus the number of UMIs per cell (or the number of genes with non-zero expression per cell) revealed no correlation between the two. Nevertheless, a number of cells showed very low expression for these genes. Cells showing housekeeping gene expression in the bottom 1% were then excluded from further analyses. At last, genes expressed in less than 20 cells were excluded. Across-cells normalization was performed using the R package Scran (v1.6.9)52. Raw counts were imported into a SCE object using the newSCESet function; size factors were calculated using computeSumFactors (sizes = seq(20, 250, 10)), on data pre-clustered through quickCluster.
Estimation of Copy number alterations (CNAs) from scRNA-seq data
CNAs were estimated directly from the scRNA-seq data, using an approach similar to the one used by Patel at al.53. Only genes expressed in >=25 cells were considered.
A reference gene expression profile was generated based on published scRNA-seq profiles of hormone-responsive luminal cells (termed L2)54, using only the datasets obtained using a droplet-based approach. After normalizing each single-cell profile based on the total number of detected transcripts to a fixed constant (10,000), a pseudo-bulk profile for the L2-cells was derived using the mean expression value of each gene across all cells.
Before running the actual CNAs quantification, all the raw scRNA-seq datasets generated in this study (after filtering, pre-normalization) and the pseudo-bulk profile generated as described above were linearly normalized to a constant (10,000) and log-scaled (pseudo-count set to 1).
First of all, chromosomal coordinates of all genes were retrieved using the biomaRt R package (v2.34.2; host set to “jul2015.archive.ensembl.org”)55. This way, genes were sorted by chromosomal coordinates. A genome-wide scan was then conducted using a sliding window of 100 genes, with a step of 10. Using the rollapply function from the zoo package in R (v1.8-3), mean value of expression in each bin was calculated for each single cell, as well as for the reference profile. The resulting genome-wide profile from each single cell was then linearly regressed against the reference estimate (using the function lm). The residuals were then considered as a proxy for CNAs and plotted in the form of heat maps. Single-cell, CNAs profiles were hierarchically clustered (hclust, method="ward.D2”) and shown as a circular dendrogram using circlize_dendrogram from R package dendextend (v1.8.0). In case of MCF7 and LTED (full populations), CNAs were estimated on all the cells. In case of the identified pre-adapted cells, the same number of cells (n = 81) was randomly sampled from the other groups of cells.
Estimation of CNAs from ChIP-input-DNA
Reads were aligned to the hg19 human reference genome using bowtie2 (v2.3.4.3)56. Aligned reads were converted to BAM files, sorted and indexed using Samtools (v1.9)57. Duplicated reads were marked and removed using Picard MarkDuplicates (v2.1.1; REMOVE_DUPLICATES=true). Only uniquely mapped reads were retained for further analyses. Copy numbers were inferred using CNVkit tools (v.0.9.4.dev0)58, as described here: https://cnvkit.readthedocs.io/en/stable/pipeline.html. CNVkit was run with the default parameters of the batch command after creating a flat reference genome as suggested in the manual using the command reference.
Dimensionality reduction and clustering
Normalized data was then imported in Seurat and scaled. Variable genes were identified using the FindVariableGenes function (mean.function = ExpMean, dispersion.function = LogVMR, x.low.cutoff = 0.01, x.high.cutoff = 6, y.cutoff = 0.01, num.bin=100). Principal component analysis (PCA) was run using variable genes as input, and the top 50 components were kept. Clusters were then identified using FindClusters (resolution = 0.6). Considering only those variable genes identified as described above (Similarity Weighted Nonnegative Embedding (SWNE)32 was applied to further reduce the dimensionality of the data. The k parameter was estimated using FindNumFactors on a subsample of 1,000 cells (loss = “mse”, 2 to 50 as range of values, with a step of 2). The choice of k is determined by randomly set 20% of the gene expression matrix as missing, followed by finding the factorization that best imputes the missing values, minimizing the mean squared error. Using this parameter, non-negative matrix factorization was then run through RunNMF (alpha = 0, init = “ica”, loss = “mse”), followed by EmbedSWNE (alpha.exp = 1.25, snn.exp = 1.0, n_pull = n_4, dist.use = “IC”). For this step, the shared nearest neighbour (SNN) matrix calculated by the FindClusters function of Seurat was used.
Differential expression analysis
The Two-sample Likelihood Ratio Test implemented in the LRT function of the MAST R package (v1.4.1)59 was used to identify marker genes for a given sample or cluster. Briefly, each cell was either flagged as either belonging to the sample (or the cluster) or not. Those genes identified as up-regulated in the cluster at FDR <= 0.05 (Benjamini-Hochberg correction)60 and showing an Area Under the Curve (AUC) >= 0.6 were classified as markers for the sample or the cluster. The AUC is an estimate on how accurately a certain gene predicts a cell as part of a certain sample or cluster. AUCs were calculated using the ROCR R package61.
Functional enrichment analyses
For functional enrichment analyses a selected number of gene sets was employed. The fifty Hallmark gene sets from the Molecular Signature Database (MSigDB)38 were downloaded from the MSigDB website October 19, 2017. Gene sets from Puram et al.48 (Table S7), along with a manually curated list of ERα-core target genes (BYSL, GREB1, HEY2, MPHOSPH10, MYB, NIP7, RARA, SLC9A3R1, TFF1, XBP1) were also considered. For a given subset of cells, each gene set was scored separately as the sum of the normalized expression values of all the genes in the set. The resulting distributions were then used for statistical testing and visualization.
Single cell gene regulatory network inference
Networks were inferred separately for CD44high and CD44low cells, with nodes representing genes and edges representing statistical dependencies between gene pairs. For each dataset, genes expressed in fewer than 20% of cells were excluded; then all possible network edges were ranked using the PIDC network inference algorithm37 implemented in NetworkInference.jl (http://github.com/Tchanders/NetworkInference.jl), with expression data for each gene discretized independently into 6 bins of equal width; finally, a network was defined keeping the 2,000 highest ranking edges. The two networks were then superimposed to form an overlapping network with edges belonging (i) only to the CD44high network, (ii) only to the CD44low network, or (iii) to both networks. Communities were detected in the overlapping network (and recursively in each community) using the label propagation method implemented in LightGraphs.jl (http://github.com/JuliaGraphs/LightGraphs.jl). Communities were required to include at least 10 nodes. Similarity of the CD44high and CD44low networks within each community was calculated using the Jaccard index: the number of edges in the community that belong to both the CD44high and CD44low networks divided by the total number of edges in the community; an edge was deemed to belong to a community if it connected two nodes in the community.
Identification of the pre-adapted cells
Two different strategies were employed to identify the pre-adapted cells. The first one takes advantage of SWNE; a threshold was applied on the first component and the cells showing extreme values (>= 0.75) were labeled as pre-adapted. The second strategy leverages random forests classifiers62. First of all, the datasets of CD44high cells in +E2 media and starved conditions (2 days) were split into training and testing sets, using 10% and 90% of the cells, respectively. The training set was then used to call the DEGs between the two conditions (+E2 vs starved), using the procedure described in the Differential expression analysis paragraph above. These DEGs were used as input features to train a random forest classifier, using the randomForest R package (v4.6-14; default parameters). This model was then used to test the remaining data. Those cells in the testing set labelled as +E2 that were showing a probability >50% of being classified as starved were considered pre-adapted.
AUCell39 (R package v1.0.0) was the used to quantify the activity of the pre-adapted signatures (and of other signatures, whenever indicated in the text) in single cells. First of all, normalized data was processed using the AUCell_buildRankings function. The resulting rankings, along with the signatures of interest, were then subject to function AUCell_calcAUC (aucMaxRank set to 5% of the number of input genes). Following inspection of the resulting distributions, thresholds were then manually set to 0.37, 0.18 and 0.32 for the signatures of pre-adapted cells either based on SWNE or random forests, or for the LTED signature (defined as those genes up-regulated in LTED vs MCF7, as described in the Differential expression analysis paragraph above).
Re-analysis of published primary samples
Bulk RNA-seq datasets for 1,222 breast cancer samples were downloaded from the GDC (Genomic Data Commons)40 data portal (http://portal.gdc.cancer.gov/) using gdc-client, according to metadata obtained on July 25, 2018. Gene features were normalized to sequencing depth. Given that only a fraction of the samples was pre-classified using PAM5063, k- nearest neighbors (k-NN) classification was employed to impute the rest of the samples. This was performed via the knn function in the R package class (v7.3-14), using the pre-classified samples as training data. Unclassified samples were ascribed to a particular subtype only when showing >60% probability of being assigned to that class. Spearman’s correlations between expression profiles of pairs of genes were calculated on the depth-normalized values. Prior to calculating signature scores, these numbers were further log2-transformed (pseudo-count set to 1) and scaled to z-score gene-wise.
Re-analysis of published datasets for circulating tumour cells (CTCs) and clusters of CTCs
Normalized data for CTCs collected at five time points from a single patient along with identically processed blood specimens from 10 healthy donors45 were downloaded from GEO (GSE41245). For each capture, the log2-fold-change between EPCAM+ cells and the matched IgG+ cells (control) was calculated. DEGs were defined as those genes showing a linear fold-change between EPCAM+ cells and control >= 1.5. The fraction of DEGs overlapping the genes in the pre-adapted signature was then calculated for each pair. To test if the observed difference between the fraction of DEGs in CTCs and in healthy specimens was random, a P-value was calculated using the Wilcoxon rank-sum test. The corresponding false discovery rate (FDR) was estimated by 1,000 permutations.
Raw data for individual CTC-clusters (median of 3 cells per cluster) and numerically matched pools of single CTCs from the same specimen44 were downloaded from GEO (GSE51827). Each profile was normalized by depth, then a profile-specific score was derived for the signature of the pre-adapted cells by summing the normalized expression values of all genes in the signature. These numbers were then divided by the maximum across all profiles. To test if the observed difference between the values obtained for the clusters against the matched pools of CTCs, a P-value was calculated using the Wilcoxon rank-sum test. The corresponding false discovery rate (FDR) was estimated by 1,000 permutations.
GFP+ cells quantification
A custom Python script (available on request) was employed to segment images based on DAPI (to count the total number of cells) and GFP signal (to quantify the fraction of GFP+ cells).
Data availability
Raw sequencing data was deposited at the Gene Expression Omnibus (GEO) under accession number GSE119693. Reviewers can access the data using token mzgzmeowzfctneb and following the instructions provided at this link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122743
Author contribution
SP.H., I.B., and L.M. planned the research; SP.H. and Y.L. performed the experiments; I.B., SP.H., T.C., K.M., and L.M. analysed the data; G.C., N.R., G.P., S.A. and R.C. contributed technical support; SP.H., I.B., and L.M. wrote the manuscript.
Acknowledgments
We want to acknowledge and thanks all patients and their families for the support and for donating the research samples. The authors gratefully acknowledge infrastructure support from the Cancer Research UK Imperial Centre, the Imperial Experimental Cancer Medicine Centre and the National Institute for Health Research Imperial Biomedical Research Centre. L.M. was supported by a CRUK fellowship (P64250) and Imperial Junior Fellowship (G53019). SP.H. was supported by a Basic Science Research Program through the National Research Foundation of Korea (NRF-2013R1A1A1011832) and by a CRUK programme award (CRUK C37/A18784). I.B. was supported by CRUK funding (P64250) and by an Imperial College Research Fellowship. The Imperial College Healthcare NHS Trust Tissue Bank provided tissue samples. Consent was collected at IEO and Imperial College (Project R15036) by the respective Tissue Banks. Other investigators may have received samples from these same tissues. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.
Footnotes
The authors have declared no conflict of interest exists.