Automated Raman micro-spectroscopy of epithelial cells for the high-throughput classification

Raman micro-spectroscopy is a powerful technique for the identification and classification of cancer cells and tissues. In recent years, the application of Raman spectroscopy to detect bladder, cervical, and oral cytological samples has been reported to have an accuracy that is greater than standard pathology. However, despite being entirely non-invasive and relatively inexpensive, the slow recording time, and lack of reproducibility, have prevented the clinical adoption of the technology. Here we present an automated Raman cytology system that can facilitate high-throughput screening and improve reproducibility. The proposed system is designed to be integrated directly into the standard pathology clinic, taking into account their methodologies and consumables. The system employs image processing algorithms and integrated hardware/software architectures in order to achieve automation and is tested using the ThinPrep standard, including the use of glass slides, and a number of bladder cancer cell lines. The entire automation process is implemented using the open source Micro-Manager platform, and is made freely available. We believe this code can be readily integrated into existing commercial Raman micro-spectrometers.

to analyse multi-well plates [18] for the purpose of developing a Raman-based cell 49 viability assay; specifically, on the effect of doxorubicin concentration on monocytic 50 THP-1 cells; and (iii) to investigate the effect of the targeted cancer drug panitumumab 51 on colorectal cancer cell lines. [20] 52 Although, not a target application of the system proposed here, the study of 53 pathogens, and in particular bacteria, has been another key application area of 54 automated Raman cytology systems in recent years. The automated system described 55 in the previous paragraph has been adapted to record spectra of single isolated 56 neutrophils from human peripheral blood [19], which were stimulated via an in-vitro 57 infection model with heat-inactivated bacterial and fungi pathogens; the system 58 captured 20000 neutrophil spectra, across various treatment groups, originating from 59 three donors. Another system developed by Douet et. al. [22] has been designed to 60 provided automated Raman spectrocopy of individual bacteria cells; once again, this 61 automated system is based on image processing in order to automatically identify the 62 bacterial cell position, followed by alignment of the cell with the excitation laser. In this 63 case, the image processing component relies on the availability of an out-of-focus diffraction pattern facilitated by the use of a spatially coherent illumination source. The 65 recorded image can be described as an in-line digital hologram, which can be subjected 66 to numerical propagation [23] in order to obtain an in-focus image of the sample. The 67 cell position can be identified based only on image contrast, whereby the bacterial cell 68 appears dark in a bright background. 69 In this paper, we present an automated Raman cytology system with several 70 contributions: (1) This system utilises a simpler image processing component than 71 previous systems, which is based on a single step. It is shown that this approach can 72 accurately identify epithelial cell nucleus position, which to the best of our knowledge, 73 has not been a target area for previous automated systems; (2) The system can be 74 applied to unlabelled 'phase-only' adherent cells, which produce low image contrast; (3) 75 The system is demonstrated to work with the ThinPrep standard, [5,7,12,24] an 76 established instrument and protocol used to prepare cytology samples in hospital 77 settings such as the cervical 'Pap' smear; (4) The method is based around the 78 open-source Micro-Manager platform [25], which is freely available. The associated code 79 is supplied in an online repository [26] and is described in detail in the supplementary 80 information. This approach can be implemented easily on any existing RMS system Central to the proposed automated Raman cytology system is an image capture and 88 processing methodology that facilitates the rapid identification of cell nuclei, which can 89 subsequently be targeted for RS. The cell nucleus, which contains almost all of the cell 90 DNA, is the primary target for Raman-based classification of epithelial cell 91 type; [1][2][3][4][5][6][7][8][9][11][12][13][14] in some studies the nucleus is targeted at several different points and 92 the resulting spectra are averaged, and in other studies, this 'averaging' is achieved 93 optically by using a relatively large laser spot for excitation. However, this approach is 94 complicated by the difficulty in clearly identifying cellular features such as the cell 95 nucleus using brightfield microscopy. Adherent epithelial cells are commonly described 96 as a weakly scattering phase-only objects that appear almost transparent when imaged 97 using brightfield microscopy. Although several imaging modalities exist to improve the 98 image contrast of such objects, such as phase-contrast, and differential interference 99 contrast, these methods cannot reliably be used to identify the cell nucleus.

100
Furthermore, these modalities require dedicated equipment such as a phase-contrast 101 objective (which includes an annular filter) or polarising optics, which are preferably 102 avoided when using RS. Fluorescence microscopy is the gold standard for identification 103 of the cell nucleus, whereby a fluorescent stain such as 4 ,6-diamidino-2-phenylindole focusing partially-coherent illumination, similar to a effect of a micro-lens. This 114 phenomenon has previously been investigated for the purpose of cell-counting, [29] 115 which was the inspiration for this paper. In that study, human neuroblastoma cells were 116 imaged in culture using a low magnification objective in order to capture a large 117 number of cells in the field of view. The authors demonstrated that a green filter and a 118 130µm pinhole placed immediately above the culture flask generated bright-spots in a 119 defocused plane ≈50µm from the object plane, that were approximately centred in the 120 fluorescing regions (nuclei) of the cells. This was confirmed using the green fluorescent 121 protein-tagged nuclear histone H2B protein.
In the system proposed in this paper,   partially coherent illumination is generated using only the microscope components that 123 would be found in an existing RMS system. Such illumination can easily be obtained by 124 closing the condenser aperture diaphragm such that the spatial coherence of the 125 Kohler-illumination is maximised. We also make no attempt to apply a colour filter to 126 the Tungsten Halogen lamp in order to enhance the bright-spot contrast. Another 127 important difference in our work with respect to [29] is that the approach taken here 128 makes use of a high magnification/numerical aperture microscope objective, which can 129 resolve several bright spots in the 'focal-plane' of the cell nucleus corresponding to 130 various sub-cellular features. Even with this simplified approach, we demonstrate that it 131 is possible to identify the cell nucleus position with a sufficient degree of accuracy for 132 subsequent targeting with RMS.

133
The applicability of this method of nuclear targeting is demonstrated for two forms 134 of sample preparation: (1) live adherent epithelial cells in medium as shown in Fig.1,   135 and (2) the same cells prepared using the ThinPrep standard as shown in Fig.2.

136
ThinPrep is a clinical standard for the preparation of cyto-histological samples, 137 particularly in the areas of cervical screening and urine cytology. The ThinPrep 138 standard, including the use of associated fixatives and glass slides, has previously been 139 shown to be compatible with Raman micro-spectroscopy. [5,7,12,24,30] HeLa cells were 140 selected for this initial investigation due to their well known morphology and were 141 prepared as described in Section 3.1.1 and imaged using an IX81 fluorescence 142 microscope as described in Section 3.2. In Fig.1 (a) the live cells are shown in medium. 143 The image was slightly defocused by displacing the sample approximately 1µm from the 144 focal plane, and oblique illumination was used in order to improve visualisation of cell 145 boundaries. In Fig.1 (b) the corresponding fluoresence image using DAPI is shown, combined. Images (a) and (b) are superimposed together with the positions of the local 153 maxima in image (c), which are shown as yellow targets. The red arrows highlight 154 nuclei that have been just missed by the target. The green arrows highlight cells that 155 appear to have been correctly targeted but did not fluoresce. The brown arrow highlight 156 cases that are just at the edge of the nucleus but which would likely provide meaningful 157 Raman spectra for a laser spot size > 1µm. Finally, the orange arrows highlight cell 158 nuclei that have been double targeted. Based on an analysis of several such images, we 159 estimate successful (single) targeting of > 75% of HeLa cell nuclei.

160
A similar analysis was applied to samples prepared using the ThinPrep standard. A 161 bright field image of these cells is shown in Fig. 2 (a). The cells have a thicker 162 morphology and a greater depth of field than for the adherent case making it more 163 difficult to record an in-focus image with a high NA objective. In Fig.2 (b) the 164 corresponding fluorescence image using DAPI is shown, once again highlighting the 165 nucleus in each cell. It is clear that the nuclei for the ThinPrep case are smaller in area 166 than for the adherent case. In Fig.2 (c) the bright-spot image is shown. This was 167 obtained by moving the sample up a distance of 14µm from the focal plane, which was 168 found to provide optimal contrast. Interestingly, the 'focal-length' of the ThinPrep 169 HeLa cells (i.e. the axial displacement providing the highest bright-spot image contrast) 170 is significantly shorter than for the case of adherent cells, owing to their thicker rounder 171 morphology. In Fig.2 (d) the information from these three images is combined. Images 172 (a) and (b) are superimposed together with the positions of the local maxima in image 173 (c) following Gaussian filtering, which are shown as yellow targets. The red arrows 174 highlight nuclei that have been just missed by the target and the orange arrow highlight 175 cases that are double targeted. Based on an analysis of several such images, we estimate 176 successful (single) targeting of > 85% of HeLa cell nuclei prepared using the ThinPrep 177 standard.

178
In this section, we have demonstrated that it is possible to identify cell nucleus 179 position using a standard brightfield microscope with a closed condenser aperture, both 180 for adherent live cell and also for cells prepared using the ThinPrep standard. In the 181 next section, we outline a global automation routine for Raman cytology that makes use 182 of this. Although in subsequent sections we focus our results on ThinPrep samples due 183 to their clinical relevance, we have also found that adherent cells work equally well with 184 this automated routine. The nucleus 'microlens-effect' is used as a basis for targeting of cells for an 187 automated RMS platform. The system uses a conventional Olympus-IX81 microscope, 188 controlled with PC via the IX2-UCB control box, which allows for electronic control of 189 the microscope objective focus position (Z-stage) and the white-light lamp. As 190 illustrated in Fig.3, the open-source Micro-Manager software system [25] can be used to 191 control the IX2-UCB, as well as several other opto-electronic components in the system, 192 including an inexpensive CMOS digital camera inserted into the eyepiece of the of the ImageJ library for image processing [31,32]. The entirety of the automation 198 platform for collection of the cell spectra is written using this scripting interface and 199 these scripts are freely provided in an online repository [26]. A key feature of this 200 automation system is that the condenser aperture is closed to a minimum in order to 201 maximise the contrast in the bright-spot image as described in the previous section.

202
In order to locate the cells, the Z-position of the bright-spot plane must first be 203 determined. However, as described in the previous section, the position of this plane 204 will vary significantly depending on cell morphology Therefore, in order to account for 205 different cell morphologies, this plane is found using an auto-focusing routine. A series 206 of spatially coherent light images are recorded across a range of focal positions in the 207 same field-of-view (FOV). It was found that when the variance of a given FOV is 208 maximised (defined as the square of the standard deviation over the mean for a given 209 image's pixel intensities), the sample is in the optimal bright-spot plane in terms of 210 image contrast. The sample can also be brought into focus for brightfield imaging (and 211 Raman spectroscopy) by finding the local minima of variance closest to the bright-spot 212 plane. This variance response for a given range of focal positions is shown in Fig. 4.

213
The overall automation routine is illustrated in the flow-chart in Fig. 4 with a series 214 of high-level steps. A more comprehensive low-level description is provided in the filtered, removing the most clustered cells. By filtering for isolated cells, we improve the 224 quality of the recorded spectra with respect to the targeting accuracy of the cell nucleus; 225 as shown in the previous section, such clustered cells are more likely to be doubly 226 targeted, particularly for the case of Thinprep slides. The refined list is then sorted 227 using a nearest neighbour algorithm to limit stage movement, reducing the risk of drift 228 and also speeding-up the overall acquisition process. Information from the image 229 stitching process about image overlap relative to stage movement is used to create a 230 coordinate transform, converting the list of cell positions from pixel coordinates to stage 231 positions, which is described in more detail in the Supplementary Information. Once an 232 offset for the laser spot position is included, and the stage is moved into the focus plane, 233 the cells can be targeted by the Raman laser, and spectra can be recorded. The final 234 step in the process is the removal of the baseline and glass spectrum component in these 235 spectra as described in more detail in Section 3.5.  Fig. 2 (c) and (a). On the right side a high-level flow chart is shown for the overall automated Raman cytology process, which includes moving the sample between these two planes.  LGlutamine. Flasks were 262 maintained in a humidified atmosphere with 5% CO 2 at 37°C. When the cell lines 263 reached 80% confluency, the culture medium was removed, and the cells were rinsed 264 with sterile PBS. Trypsin-EDTA (0.5%) was added to the flask, which was incubated at 265 37°C until the cells had completely detached (not exceeding 15 min). An equal volume 266 of 5% serum-containing medium was added to the flask to neutralise the trypsin enzyme. 267 The entire contents of the flask was transferred into a sterile container, and centrifuged 268 at 1200 rpm for 5 min. The supernatant was removed, and the cell pellet was 269 resuspended in fresh medium. This solution was centrifuged at 1200 rpm for 5 min, the 270 medium decanted, and resuspended in 1 ml PBS. This step was repeated and the cell

287
Raman spectra from the cells described in Section 3.1 were recorded using a 288 custom-built Raman micro-spectroscopy system, which is illustrated in Fig. 3. This 289 system employs a 150 mW laser with a wavelength of 532 nm and a coherence length 290 ≈100 m (Torus, Laser Quantum), which is driven by a power supply unit (mpc3000,

291
Laser Quantum) that is controllable over an RS232 cable using Serial Commands using 292 the Micro-Manager 'freeserialport' device adapter. The system also employs a isolates the signal from the cell nucleus in three dimensions, and minimises background 315 noise from the glass slides, as well as from optical elements in the system. The system is 316 designed to provide a spatial resolution of ≈2-3 µm 2 from the cell nucleus. A long pass 317 filter (F, LP03-532RU-25, Semrock) and a dichroic beamsplitter (DB1, LPD-01-532RS, 318 Semrock) are also used to filter the laser wavelength from reaching the spectrograph, 319 while transmitting the longer Raman scattered wavelengths. A dichroic short pass filter 320 (DB1, 69-202, Edmund Optic) permits imaging of the sample to a digital camera 321 (CMOS, MU300, AmScope). All spectra were recorded using the Andor Solis software 322 plugin for Micro-Manager. The system was wavenumber calibrated using a polymer 323 standard as described in [33]. No intensity calibration was performed for this 324 experiment since all spectra were recorded from the same system. The aperture in the 325 microscope condenser (C, U-UCD8, Olympus) was closed to a minimum in order to 326 maximise the spatial coherence of the illumination. The automation process described in Section 2 is capable of recording a large number of 329 spectra. In total, spectra were recorded from 577 different HT1197 cell nuclei deposited 330 on CaF 2 (for the purpose of removing the glass spectrum as described in the following 331 section), 6426 different HT1197 cell nuclei deposited on glass, and 7499 different RT112 332 Raw spectra were first input to a cosmic ray removal algorithm [34]; this algorithm is 340 capable of removing cosmic rays from a spectrum by identifying a closely matching 341 spectrum in the dataset, and replacing each cosmic ray with the corresponding 342 wavenumber intensity values in that matching spectrum. It is, therefore, not necessary 343 to apply the commonly used double acquisition cosmic ray removal method to obtain a 344 closely matching spectrum [35], which was found to be problematic for the high-speed 345 spectral acquisition applied in the automation process.

346
Cosmic ray removal was followed by an Extended Multiplicative Signal Correction 347 (EMSC) algorithm [24,36,37], which removes the variable background signal from each 348 spectrum. The EMSC algorithm estimates this background using an N-order 349 polynomial (to remove the baseline signal that results from the cells auto-fluorescence) 350 and also, if required, the background signal from the substrate. Briefly described, the 351 EMSC algorithm applies a least squares fit to (i) a low-noise, contaminant-free reference 352 Raman spectrum from a cell; (ii) an N-order polynomial; and (iii) a reference spectrum 353 taken from the substrate; this is required for the case of glass substrates but not for the 354 case of Raman-grade CaF 2 . The algorithm returns the weight of (i), which enables 355 normalisation of the spectrum relative to the reference, as well as the total background 356 made up of the appropriately weighted substrate spectrum plus the polynomial. The

357
EMSC-corrected spectrum, X, is given by: where X 0 is the raw data, B is the reference spectrum of the substrate, c b is the weight 359 of the reference substrate spectrum, P m denotes the m th order of the polynomial, c m is 360 the corresponding polynomial coefficient, and c r is the weight of the cell reference 361 spectrum, R. In summary, X 0 can be described as the linear (weighted) superposition 362 of R, B, and P . It has been shown that the use of a high-order polynomial does not 363 result in over-fitting with the EMSC algorithm [37]. For this study, a fifth-order 364 polynomial was used in the EMSC-correction algorithm for all datasets.

365
The reference cell spectrum provides the basis for all of the spectra to be fitted; the 366 reference spectrum used here is the mean spectrum of the highest quality 50 spectra 367 from the HT1197 dataset recorded on the CaF 2 substrate. No smoothing was applied in 368 this case. The background signal from Raman-grade CaF 2 substrates are flat in the 369 fingerprint region [38] and, therefore, this substrate was selected from which to obtain 370 the high-quality reference cell spectrum to be used in the EMSC-correction of the 371 spectra recorded on the glass substrates. In order to remove any potential bias, the 372 same reference spectrum was used for the EMSC-algorithm applied to process the 373 spectra of all cells deposited on glass. It has been demonstrated that equivalent results, 374 in terms of the multivariate statistical analysis that follows, will be obtained when using 375 significantly different reference spectra, so long as the same reference spectrum is 376 applied in EMSC-correction of all datasets.
[?] Also input to the EMSC-correction 377 algorithm is a glass spectrum, which is the mean spectrum of the spectra recorded from 378 the substrate followed by Savitsky-Golay smoothing using a polynomial order of 3 and a 379 window size of 9.

380
Following EMSC correction, the spectra were denoised using a Savitsky-Golay based 381 algorithm [39] using a polynomial order of 3 and a window size of 7. The resulting 382 datasets of cell spectra were filtered in order to remove lower-quality spectra. This was 383 achieved by removing all spectra that provided Pearson correlation coefficient of less 384 than 0.99 with respect to reference cell spectrum, which resulted in a culling of 385 approximately 50% of the data. A similar approach has previous been applied, albeit 386 with a lower coefficient value, in order to extract spectra with high signal-to-noise ratios 387 from a large dataset [15]. The results of the various pre-processing steps described in 388 this section are presented in Section 4. In order to comprehensively evaluate the capability of the automated system to 391 accurately classify the low-and high-grade urinary bladder carcinoma epithelium cells, 392 and to elucidate the underpinning differences in their biochemical composition, a range 393 of machine learning classification techniques were tested to discriminate the two spectral 394 datasets, following the pre-processing steps outlined in Section 3.5. The algorithms 395 considered were as follows: Linear Discriminant Analysis (LDA), Quadratic 396 Discriminant Analysis (QDA), k-Nearest Neighbours (kNN), Random Forest (RF) [40], 397 Support Vector Machine (SVM) [41] and Partial Least Squares (PLS) [42]. The 398 classifiers were combined with two pre-processing steps, namely: Principal Component 399 Analysis (PCA) [43] and Marginal Relevance (MR) for wavelength selection [44]. PCA 400 obtains lower dimensional projections of the data in the feature space. The new features 401 (Principal Components -PCs) represent directions in the observation space along which 402 the data have the highest variability. In contrast, Marginal Relevance (MR) is a 403 criterion that ranks each wavelength in order of their capability to discriminate between 404 the classes. The MR score for each wavelength is the ratio of the between-class to 405 within-class sum of squares. In this approach, each wavelength is considered 406 independently of others and neighbouring wavelengths have similar MR scores.

407
All of the analysis was done in R, a free software environment for statistical 408 computing and graphics [45]. LDA and QDA are implemented using the MASS [46] 409 package and require no parameter tuning. PLS is implemented in the package pls [47] 410 and the number of principal components was set to 15. SVM is implemented in the 411 package kernlab [48]. Gaussian kernel was used with the bandwidth parameter value 412 set at an empirical estimate suggested by [49]. kNN and RFs are implemented in the 413 packages class [46] and ranger [50] respectively. The values for the tuning parameters 414 of these models were obtained by cross-validation of the training set. 415 Ten-fold cross-validation was used to estimate the performance of the models on new 416 data. In this application, PCA was carried out on the training set of each 417 cross-validatory split of the data. Three PCs were used as input features to LDA, QDA 418 and kNN classifiers. Marginal relevance (MR) criterion is implemented in the R 419 package BKPC [51]. For all datasets, the highest scoring features from 7 regions were 420 taken as inputs into the following classification algorithms: LDA, QDA, kNN, RF and 421 SVM. RF, SVM and PLS were also trained on all wavelengths without any dimension 422 reduction pre-processing steps. 423

424
Following application of the Pearson correlation coefficient, the two spectral datasets of 425 6426 HT1197 (high-grade bladder cancer) cell spectra, and 7499 RT112 (low-grade 426 bladder cancer) cell spectra, are reduced in number to 3583 and 3701 cell spectra, 427 respectively. This corresponds to retention of 56% and 49%, respectively. The raw 428 Fig 5. Raw and processed data from high-and low-grade bladder cancer cell lines. Raw spectra recorded using the proposed automated Raman micro-spectrometer, following cosmic ray removal; (a) 3583 spectra taken from individual HT1197 cell nuclei; and (b) 3701 spectra taken from individual RT112 cell nuclei; (c) The pre-processed HT1197 dataset, below which the mean spectrum is shown with a black line around which the standard deviation of the dataset is shown using a shaded grey colour; (d) The pre-processed RT112 dataset, below which the mean spectrum is shown with a green line around which the standard deviation of the dataset is shown using a shaded light green colour.  [5,11,52] and two new peaks that have not previously been identified in 453 the analysis of bladder tissue: 1424 cm −1 (Deoxyribose) and 1490 cm −1 (DNA). [53] 454 Classification accuracy, sensitivity, and specificity for the eleven classification 455 approaches are given in Table 1, and Fig. 7 shows box-plots of classification accuracy 456 over the test sets in ten-fold cross-validation for the eleven classifiers. The comparative 457 analysis suggests that classifiers PLS, SVM, and RF, with no (statistical) pre-processing 458 steps, consistently perform better for classification of these datasets. MR seems to be 459 more effective than PCA in selecting the features. We note that these sensitivies and 460 specificities are the highest reported to date in the literature in the classification of high-461 and low-grade bladder cancer cell lines, likely owing to the increased reproducibility of 462 the recording process as well as the increased dataset size.  Table 1. Classification accuracy, sensitivity, and specificity for the eleven classifiers: LDA, QDA and kNN applied after pre-processing with PCA and MR. RF and SVM were trained after pre-processing with MR and on the entire data without any pre-processing for dimension reduction. PLS was applied without any pre-processing. Classification accuracy of the ten test sets in ten-fold cross-validation for the eleven classifiers.

464
The automated Raman cytology system presented here is is focused on recording 465 spectra from the unlabelled nucleus of epithelial cells. Automated targeting of the 466 nucleus is a key differentiator of the proposed technology with other automated Raman 467 cytology systems, which have recently been proposed and which target the centre of the 468 cell mass [15,16]. The basis of our approach is to iterate the focus of microscope 469 between two planes, the traditional image plane at which Raman spectra are recorded, 470 and the 'bright-spot plane' some tens of micrometers from the image plane, where the 471 'micro-lens' effect of the cell nuclei focuses the partially coherent microscope 472 illumination and produces bright spots approximately co-located with the cell nuclei, 473 thereby facilitating subsequent RS. Our approach builds on the work of Drey et. al. [29], 474 who first identified the phenomenon and utilised it for the purpose of cell counting of 475 live cell cultures. The 'focal-length' of the nucleus, i.e. the distance between the 476 microscope focal plane and the 'bright-spot plane' appears to depend on the spatial 477 variation in cell-thickness with a thicker, rounder morphology resulting in the focusing 478 of light several micrometers behind the sample plane; and a thinner, flatter morphology 479 focusing the light several tens of micrometers behind the sample plane. 480 We have shown that in addition to live adherent epithelial cells, the method can also 481 be applied to identify the nucleus of cells prepared using the ThinPrep standard, which 482 includes the use of specific fixatives and glass slides. Appending such a widely approximately 75% of cells. We believe this could be improved upon by using a colour 495 filter in the illumination lamp to increase bright-spot contrast as in [29], or for the case 496 of live cells, to immerse the cells in phosphate buffered solution before imaging, which 497 was shown in [29] have the effect of swelling the cell body and significantly enhancing 498 bright-spot contrast.

499
The overall throughput of the method is demonstrated at approximately 0.1 cell/sec, 500 which is slower than the method proposed by Schie et. al. [15], which can record at a 501 rate of approximately 1.4 cell/sec. However, this comparison must be qualified by the 502 cell type that have been investigated by the two systems. In [15] the authors applied 503 their system to lymphocytes, neutrophils, and monocytes, which are considerably 504 thicker than the epithelial cells investigated in this paper, and which, therefore, produce 505 a more intense Raman scattering, necessitating a shorter acquisition time. Furthermore, 506 they use a significantly more powerful laser source of 400 mW, albeit at a longer 507 wavelength of 785nm that will produce less Raman scattering [38]. Noticeably, they also 508 use a Pearson correlation coefficient of 0.95 to cull their noisy spectra while we use of a 509 coefficient value of 0.99. We have found that an acquisition time of 5 s (increasing the 510 throughput of 0.2 cells/sec) could be used if we apply a value of 0.95. We believe that it 511 may be possible to further reduce acquisition time by increasing the laser power, 512 distributed over a larger area of the cell nucleus, and using advanced denoising methods 513 based on machine learning. With these approaches, we believe it may be possible to 514 achieve 1 cell/sec throughput, and this will be a subject of our future work.

515
Although we have demonstrated the applicability of the automated identification of 516 the nucleus for both live cells and cells prepared using ThinPrep, we focus our 517 experiments in Section 4 on bladder cancer cell lines using the latter. The application of 518 the proposed system to clinical cytology samples is the primary motivation of our work. 519 An important feature of our automated platform is the use of 532 nm excitation and the 520 removal of the glass spectrum, which is unavoidable for clinical deployment of the 521 technology. We have identified at least three clinical areas that could potentially benefit 522 from the proposed technology, all of which have been shown to be diagnostically 523 improved by Raman spectroscopy: The most common branch of cytology is the 'Pap 524 smear,' used to screen precancerous cervical lesions; application of Raman to cervical 525 cytology samples has received significant attention in the literature [7,[11][12][13].

526
Cytological inspection is also common for bladder cancer, whereby epithelial cells are 527 extracted from urine, though this is often used as an adjunct to cystoscopy due to the 528 low sensitivity (20%) for low grade carcinoma accounting for the majority of cases.

529
RMS has been demonstrated to improve the sensitivity of urine cytology to 530 ¿90% [3,4,6] and our group has actively researched a methodology for RMS to be 531 integrated into the pathology lab [5]. Oral cancer is one of the most common cancers 532 worldwide with tumours located around the tongue and mouth. Late stage oral cancer 533 is straightforward to diagnose with histological analysis of tissue biopsy, but results in 534 poor outcome. RMS has been demonstrated to successfully identify precancerous tissue 535 by investigating epithelial cells [8,14]. All three of these clincial areas require only 536 non-invasive procedures to retrieve cells, and although each of them benefit from 537 improved classification using RMS, clinical adoption has been slow due to the slow 538 throughput of Raman and issues with reproducibility. We believe that the system 539 proposed here can solve these issues and facilitate clinical adoption. with relatively little effort. We hope this approach will help to advance automated 546 Raman cytology for clinical applications. Micro-Manager software system and can readily be downloaded and adapted for existing 559 RMS systems. We hope that this work will provide the much need throughput and 560 reproduciblity to finally advance Raman cytology into routine clinical practice.