Disseminating cells in human oral tumours acquire an EMT cancer stem cell state that is predictive of metastasis

Cancer stem cells (CSCs) undergo epithelial-mesenchymal transition (EMT) to drive metastatic dissemination in experimental cancer models. However, tumour cells undergoing EMT have not been observed disseminating into the tissue surrounding human tumour specimens, leaving the relevance to human cancer uncertain. We have previously identified both EpCAM and CD24 as markers of EMT CSCs with enhanced plasticity. This afforded the opportunity to investigate whether retention of EpCAM and CD24 alongside upregulation of the EMT marker Vimentin can identify disseminating EMT CSCs in human tumours. Examining disseminating tumour cells in over 12,000 imaging fields from 84 human oral cancer specimens, we see a significant enrichment of single EpCAM, CD24 and Vimentin co-stained cells disseminating beyond the tumour body in metastatic specimens. Through training an artificial neural network, these predict metastasis with high accuracy (cross-validated accuracy of 87-89%). In this study, we have observed single disseminating EMT CSCs in human oral cancer specimens, and these are highly predictive of metastatic disease.


Introduction 28
In multiple types of carcinoma, cancer stem cells (CSCs) undergo epithelial-mesenchymal transition 29 (EMT) to enable metastatic dissemination from the primary tumour ( from studies using murine models and human cancer cell line models. However, this process has not 32 been observed in human tumours in the in vivo setting, leading to uncertainty over the relevance of 33 these findings to human tumour metastasis (Bill and Christofori, 2015;Williams et al., 2019). A key 34 complication with efforts to study metastatic processes in human tumours is the inability to trace cell 35 lineage. As cancer cells exiting the tumour downregulate epithelial markers whilst undergoing EMT, 36 they become indistinguishable from the mesenchymal non-tumour cells surrounding the tumour (Li 37 and Kang, 2016). Therefore, once these cells detach from the tumour body and move away they are 38 lost to analysis. Attempts have been made to use the retention of epithelial markers alongside 39 acquisition of mesenchymal markers to identify cells undergoing EMT in human tumours (Bronsert et  new tumour growth at secondary sites, and therefore retained plasticity manifested as ability to revert 46 to an epithelial phenotype is an important feature of metastatic CSCs (Ocana et al., 2012;Tsai et al., 47 second marker of plastic EMT CSCs, and Vimentin as a mesenchymal marker to identify cells that have 72 undergone EMT. Notably, CD44 cannot be used as an EMT marker in the context of intact tissue as it 73 requires trypsin degradation in order to yield differential expression in EMT and epithelial populations 74 (Biddle et al., 2013;Mack and Gires, 2008). Vimentin, on the other hand, accurately distinguishes EMT 75 from epithelial tumour cells in immunofluorescent staining protocols (Biddle et al., 2016). By 76 combining EpCAM as a tumour lineage and EMT CSC marker, Vimentin as a mesenchymal marker, and 77 CD24 as a plastic EMT CSC marker, we aimed to identify tumour cells that have undergone EMT and 78 disseminated into the surrounding stromal region. For this, we developed a protocol for automated 79 4-colour (3 markers + nuclear stain) immunofluorescent imaging and analysis of entire 80 histopathological slide specimens, to test for co-localisation of the 3 markers in each individual cell 81 across each specimen. 82

83
To determine whether this marker combination identifies EMT CSCs, we initially tested the protocol 84 on the CA1 OSCC cell line and an EMT CSC sub-line that is a derivative of this cell line (EMT-stem sub-85 line) (Biddle et al., 2016). EpCAM + Vim + CD24 + cells were greatly enriched in the EMT-stem sub-line, 86 comprising 41% of the population, compared to 2.1% in the CA1 line ( Figure 1A, B, E). Cells with this 87 staining profile were absent from normal keratinocyte culture and cancer associated fibroblast culture 88 (Supplementary Figure S1). To test the specific role of EpCAM retention, we replaced EpCAM with a 89 pan-keratin antibody against epithelial keratins. There was very little Pan-keratin + Vim + CD24 + staining, 90 and no enrichment for Pan-keratin + Vim + CD24 + cells in the EMT-stem sub-line ( Figure 1C, D, E). 91 Therefore, whilst epithelial keratins are lost, EpCAM is retained in cells undergoing EMT and an 92 EpCAM + Vim + CD24 + staining profile can be used as a marker for EMT CSCs in immunofluorescent 93 staining protocols. 94 95 Imaging the tumour body and adjacent stroma in sections of human OSCC specimens, we detected 96 single cells co-expressing EpCAM, Vimentin and CD24 in the stromal region surrounding the tumour 97 ( Figure 1F), confirming that these cells can be detected in human tumour specimens. We next 98 stratified 24 human primary OSCC specimens into 12 tumours that had evidence of lymph node 99 metastasis or perineural spread, and 12 that remained metastasis free (Supplementary Figure S2), and 100 stained them for EpCAM, Vimentin and CD24. Single cells co-expressing EpCAM, Vimentin and CD24 101 were abundant in the stroma surrounding metastatic tumours. This was not the case in non-metastatic 102 tumours or normal epithelial regions (Figure 2, A-C). In contrast to EpCAM, pan-keratin staining did 103 not identify cells in the stroma surrounding metastatic tumours ( Figure 2D). 104

105
We developed an image segmentation protocol that separated the tumour body from the adjacent 106 stroma, thus enabling each nucleated cell to be assigned to either the tumour or stromal region in 107 automated image analysis ( Figure 2E). Expression of EpCAM, Vimentin and CD24 was then analysed 108 for every nucleated cell in every imaging field that included both tumour and stroma (3500 manually 109 curated imaging fields across the 24 tumours). This enabled the proportion of each cell type in each 110 region to be quantified ( Figure 2F). EpCAM + Vim + CD24 + cells were enriched in the stroma compared to 111 the tumour body, and there was a much greater accumulation of EpCAM + Vim + CD24 + cells in the 112 stroma of metastatic tumours compared to non-metastatic tumours. Interestingly, this was not the 113 case for EpCAM + Vim + CD24cells, which were also enriched in the stroma but showed no difference 114 between metastatic and non-metastatic tumours. Pan-keratin + Vim + CD24 + cells were not detected. 115

116
To extend this analysis, we stained and imaged a further 60 tumours, evenly stratified on the same 117 criteria. These displayed the same evidence of individual disseminating cells co-expressing EpCAM, 118 Vimentin and CD24 in metastatic tumours only ( Figure 2G and Supplementary Figure S3F To explore whether these EpCAM + Vim + CD24 + cells in the stroma may in fact be non-tumour cell types, 126 we analysed a published scRNAseq dataset for human head and neck cancer (Puram et al., 2017). In 127 this dataset, tumour and non-tumour cells were separated using bioinformatic techniques (principally 128 inferred CNV and a 'tumour-epithelial' expression signature). Analysing this dataset for EpCAM, 129 Vimentin and CD24 co-expression, we found that 12% of tumour cells (267/2215) were 130 EpCAM + Vim + CD24 + . In the non-tumour cells, only 0.8% (29/3687) were EpCAM + Vim + CD24 + 131 (Supplementary Figure S4). Therefore, the observed EpCAM + Vim + CD24 + cells in our tumour specimens 132 are highly likely to be a tumour cell population. Indeed, use of EpCAM as a tumour lineage marker is 133 specifically intended to exclude staining for stromal constituents. EpCAM is a specific epithelial 134 marker, that is not expressed in stromal or immune cells -it is expressed exclusively in epithelia and 135 epithelial-derived tumours ( OSCC are an important health burden and one of the top ten cancers worldwide, with over 300,000 146 cases annually and a 50% 5-year survival rate. There is frequent metastatic spread to the lymph nodes 147 of the neck; this is the single most important predictor of outcome and an important factor in 148 treatment decisions (Sano and Myers, 2007). If spread to the lymph nodes is suspected, OSCC 149 resection is accompanied by neck dissection to remove the draining lymph nodes, a procedure with 150 significant morbidity. At presentation it is currently very difficult to determine which tumours are 151 metastatic and this results in sub-optimal tailoring of treatment decisions. Accurate prediction of 152 metastasis would therefore have great potential to improve clinical management of the disease to 153 reduce both mortality and treatment-related morbidity. We sought to determine whether the 154 EpCAM + CD24 + Vim + staining pattern could be predictive of metastasis. 155 156 Starting with the EpCAM, Vimentin and CD24 immunofluorescence grey levels for each nucleated cell, 157 we used a supervised machine learning approach to predict whether an imaging field comes from a 158 metastatic or non-metastatic tumour ( Figure 5A). As a benchmark we used the pan-keratin, Vimentin 159 and CD24 immunofluorescence grey levels, as we hypothesised that pan-keratin would provide an 160 inferior predictive value than EpCAM given that there was no dissemination of pan-keratin expressing 161 cells in the stroma. 3500 imaging fields containing 2,640,000 total nucleated cells from 24 tumour 162 specimens were used in the machine learning task. We compared the performance accuracy (10-fold 163 cross-validated F-score) of different machine learning classification algorithms. The best performing 164 classifiers for EpCAM, Vimentin and CD24 were the artificial neural network (ANN) and support vector 165 machine (SVM), with F1 accuracy scores of 91% and 87% respectfully ( Figure 5B). For the ANN, the 166 area under the curve (AUC) accuracy score was 87%, with 94% sensitivity and 82% specificity. Training 167 with Pan-keratin, Vimentin and CD24 gave much worse prediction across all classifiers ( Figure 5C). 168 These findings demonstrate that, utilising a machine learning algorithm, staining for EpCAM, Vimentin 169 and CD24 can predict metastatic status with high accuracy and may therefore have clinical utility. 170

171
To extend this analysis of utility for metastasis prediction, we stained and imaged a further 60 172 tumours, evenly stratified on the same criteria, for EpCAM, Vimentin and CD24. Over 9000 imaging 173 fields at the tumour-stroma boundary from 60 evenly stratified tumour specimens, containing over 174 8.5 million nucleated cells, were fed into an artificial neural network machine learning task. For this 175 task, we recorded the predictive accuracy from the training and validation sets after each training 176 epoch, which showed good alignment and an 89% accuracy score after 12 training epochs ( Figure 5D). Machine learning for prognostic prediction using immunofluorescent staining data 294 A dataset was created of a pool of 2,640,000 nucleated cells across 3500 imaging fields from 24 tumour 295 specimens (12 with lymph node metastasis or perineural spread, and 12 without) (batch 1) or 296 8,563,000 nucleated cells across 9,200 imaging fields from 60 tumour specimens (30 with lymph node 297 metastasis or perineural spread, and 30 without) (batch 2). The background threshold for the FITC, 298 CY3 and CY5 channels was subtracted from the grey level intensities for each nucleated cell. The 299 supervised machine learning task was to classify each imaging field into whether it belonged to a 300 metastatic or non-metastatic tumour. 301

302
The dataset was stratified into a training and validation cohort in a 70%:30% ratio using a random seed 303 split. Supervised machine learning approaches were implemented using the skikit-learn Python 3.6 304 libraries (Pedregosa et al., 2011) and Tensorflow/Keras framework 305 (https://www.tensorflow.org/api_docs/python/tf/keras/models). Hyper-parameter optimisation was 306 performed by an exhaustive grid search and computed on Apocrita, a high performance cluster (HPC) 307 facility at Queen Mary University of London (http://doi.org/10.5281/zenodo.438045). To further 308 minimise overfitting, 10-fold cross-validation was performed and the mean accuracy metric, F1 score, 309 was obtained for each learning iteration. Receiver-of-operator (ROC) curves and the area-under the-310 curve (AUC) were computed for the optimum supervised learning algorithm. Supervised approaches 311 used were logistic regression, support vector machines (Smola and Scholkopf, 2004), Naïve Bayes 312 (Zhang, 2005), K-Nearest Neighbours (Bentley, 1975 Quantification of the percentage of EpCAM + Vim + CD24 + and pan-keratin + Vim + CD24 + cells in the CA1 404 cell line and EMT-stem sub-line. Significance is obtained from a two-tailed student t-test. The graph 405 shows mean +/-95% confidence interval. F, Detection of EpCAM + Vim + CD24 + cells in the stroma 406 surrounding an oral cancer tumour specimen. The white arrow highlights an EpCAM + Vim + CD24 + cell in 407 the stroma. The red arrow highlights an EpCAM + Vim + CD24cell in the stroma. DAPI nuclear stain is 408 blue. Below inset; enlargement of the highlighted cells for each marker. Scale bars = 100µm. 409 generation of an 'EpCAM dense cloud' to distinguish the tumour body from the stroma. Grey level 417 intensities for EpCAM, Vimentin and CD24 were obtained for every nucleated cell in each imaging 418 field. F, Quantification of the percentage of EpCAM + Vim + CD24 + , EpCAM + Vim + CD24and pan-419 keratin + Vim + CD24 + cells in normal region (epithelium distant from the tumour), tumour body, and 420 stromal region from metastatic and non-metastatic tumours in the first batch of specimens. A student 421