Abstract
Background One in eight men will be affected by prostate cancer (PCa) in their lives. While the current clinical standard prognostic marker for PCa is the Gleason score, it is subject to interreviewer variability. This study compares two machine learning methods for discriminating between high- and low-grade PCa on histology from 47 PCa patients.
Methods Digitized slides were annotated by a GU fellowship-trained pathologist. High-resolution tiles were extracted from annotated and unlabeled tissue. Glands were segmented and pathomic features were calculated and averaged across each patient. Patients were separated into a training set of 31 patients (Cohort A, n=9345 tiles) and a testing cohort of 16 patients (Cohort B, n=4375 tiles). Tiles from Cohort A were used to train a compact classification ensemble model and a ResNet model to discriminate tumor and were compared to pathologist annotations.
Results The ensemble and ResNet models had overall accuracies of 89% and 88%, respectively. The ResNet model was additionally able to differentiate Gleason patterns on data from Cohort B while the ensemble model was not.
Conclusions Our results suggest that quantitative pathomic features calculated from PCa histology can distinguish regions of cancer; how-ever, texture features captured by deep learning frameworks better differentiate unique Gleason patterns.
Introduction
Prostate cancer (PCa) is the most diagnosed non-cutaneous cancer in men, affecting an estimated 268,000 men in 2022[1]. Improved prostate cancer screening and therapies have led to a high five-year survival rate and the overall prognosis for PCa is one of the best compared amongst all cancers. Prostate cancer is currently graded using the Gleason grading system, assigning scores corresponding to the two most predominant morphological patterns present. More recently, it has been used to assign patients into one of five Grade Groups (GG) to predict prognosis[2]. Clinically significant cancer (GG ≥ 2, tumor volume ≥ 0.5 mL, or stage ≥ T3) is often treated with radiation therapy and/or radical prostatectomy. Low-grade cancer can often be monitored through annual prostate specific antigen (PSA) testing. Side effects from prostate cancer treatment can include long-term complications such as impotence and impaired urinary function[3], thus early and accurate detection of PCa is necessary to minimize overtreatment while still addressing clinically significant cancer.
Digital pathology is playing an increasingly important role in clinical research, with applications in diagnosis and treatment decision support[4]. Fast acquisition time, management of data, and interpretation of histology has made digital pathology popular and easier for pathologists to manage and share slides. Additionally, artificial intelligence (AI) with digital pathology has created opportunities to incorporate computational algorithms into pathology workflows or for AI-based computer-aided diagnostics[5].
In prostate cancer research, many machine learning applications have been focused on automated Gleason grading. While the Gleason score is currently the gold standard prognostic marker for prostate cancer, the process of assigning grades is a subjective, quantitative metric. Additionally, pathologist-provided annotations for digital pathology studies is not only time consuming, but can result in significant inter-observer variability[6, 7]. The primary focus of these automated Gleason grading methods has been on biopsies or tissue microarrays as opposed to whole-slide images[8–11]. A fast, automated tool for identifying Gleason patterns in prostate histology could allow for rapid annotation and grading, as well as provide important prognostic information such as recurrence probabilities.
In this study, we developed an Automated Tumor Assessment of pRostate cancer hIstology (ATARI) classification model for the Gleason grading of whole-mount prostate histology using quantitative histomorphometric features calculated from digitized prostate cancer slides. The results of this model were validated using ground truth pathologist annotations. In addition, we compared this model to a residual network with 101 layers (ResNet101) for automated Gleason grading[12]. Specifically, we tested the hypothesis that a machine learning model applied to second-order features calculated from digitized histology could discriminate prostate cancer from normal tissue. We also hypothesized that deep learning model would differ in classification accuracy, both in detecting cancer and differentiating Gleason patterns.
Materials and Methods
Patient Population and Data Acquisition
Data from 47 prospectively recruited patients (mean age 59 years) with pathologically confirmed prostate cancer were analyzed for this study. This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the Medical
College of Wisconsin. Written informed consent was obtained from all subjects involved in the study. The data presented in this study are available on request from the corresponding author. The data are not publicly available due to patient privacy concerns. For model development, subjects were split into 2/3 training (n = 31 patients) and 1/3 testing (n = 16 patients) data sets, matched for tumor grade and other clinical characteristics (see Table 1).
Tissue Collection and Processing
Prostatectomy was performed using a da Vinci robotic system (Intuitive Surgical, Sunnyvale, CA)[13, 14]. Whole prostate samples were fixed in formalin overnight and sectioned using custom axially oriented slicing jigs[15]. Briefly, prostate masks were manually segmented from the patient’s pre-surgical T2-weighted magnetic resonance image using AFNI (v.19.1.00) (Analysis of Functional NeuroImages, http://afni.nimh.nih.gov/)[16]. Patient-specific slicing jigs were modeled using Blender 2.79b (https://www.blender.org/) to match the orientation and slice thickness of each patient’s T2-weighted image[6, 17–19], and 3D printed using a fifth-generation Makerbot (Makerbot Industries, Brooklyn, NY). The MRI scans were not used beyond slicing molds for the remainder of this study.
Whole-mount tissue sections were processed, paraffin embedded, and resulting whole mount slides were hematoxylin and eosin (H&E) stained. The slides were then digitally scanned using a slide scanner (Olympus America Inc., Center Valley, PA, USA) at a resolution of 0.34 microns per pixel (40x magnification) to produce whole slide images (WSI), and down-sampled by a factor of 8 to decrease processing time. A total of 330 digitized slides were manually annotated using a Microsoft Surface Pro 4 (Microsoft, Seattle, WA, USA) with a pre-loaded color palette for different Gleason patterns[2] by a GU fellowship-trained pathologist (KAI). An example of the prostate annotation process is shown in Figure 1.
Annotation Segmentation
Digital whole-mount slides were divided into high resolution tiles that were 3000×3000 pixels and labeled using their corresponding xy-coordinates within the image. This size tile was chosen as it is the smallest resolution that our pathologist could determine Gleason grades. These tiles were then stitched back together to recreate the whole-mount image while concurrently creating x- and y-coordinate look-up tables. A subset of slides was rescanned on the Olympus slide scanner, and annotations that were performed on lower resolution digitized versions of the slide were quantitatively transferred (n=201 slides). Briefly, the analogous annotated image was aligned to the newly digitized slide using MATLAB 2021b’s imregister function (The MathWorks Inc., Natick, MA, USA). The annotations were isolated to create a single mask for each of eight possible classes: seminal vesicles, atrophy, high-grade prostatic intraepithelial neoplasia (HGPIN), Gleason 3 (G3), Gleason 4 cribriform gland (G4CG), Gleason 4 non-cribriform glands (G4NC), Gleason 5 (G5), and unlabeled benign tissue. Gleason 4 patterns have been separated in our annotations as there are notable prognostic differences between the cribriform and non-cribriform patterns[20–23]. An additional averaged white image of non-tissue (i.e., background, lumen, and other artifacts) was found to remove these areas from the annotation masks to ensure the most representative histology remained for analysis. Each region of interest (ROI) within an individual class was individually compared to the xy-look-up tables to determine coordinates corresponding to tiles, and only those with over 50% of a specific annotation were included. Five tiles from each ROI were saved into annotation-specific directories for use with the ATARI model, except for unlabeled benign tissue where 15 tiles were randomly saved from each slide. ROIs that were too small to extract 5 tiles from were excluded.
Each annotated tile was further divided into 1024×1024 pixel tiles for use with the ResNet101 model, resulting in upwards of 9 sub-tiles used for the ResNet101 per full-sized tile used for the ATARI model. Sub-tiles that remained within a mask were saved into annotation-specific directories, similarly to the large tiles used for the ATARI model. The ResNet101 additionally was trained using background tiles determined by areas that were included in the average white image. Tiles used for training were augmented by resizing (250×250 pixel), random cropping (240×240), applying color jitter (0.3, 0.3, 0.3), adding random rotations (±0-30°), applying random horizontal and vertical flips and center cropping to the ResNet input size of 224×224 as well as normalizing to ImageNet’s mean (0.485, 0.456, 0.406) and standard deviation (0.229, 0.224, 0.225). This tile extraction process is demonstrated in Figure 1, and breakdown of slides and sorted tiles can be found in Table 2.
Pathomic Feature Extraction
High resolution tiles were down-sampled to increase processing time, and then were processed using a custom, in-house MATLAB function to extract pathological features for use with the ATARI model. First, a color deconvolution algorithm was applied to each image to segment stroma, epithelium, and lumen based on their corresponding stain optical densities (i.e., positive hematoxylin or eosin, and background)[24]. These features were then further smoothed and filtered to remove excess noise and improve segmentations. Glands with lumen touching the edge of a tile were excluded. Overall stromal and epithelial areas were calculated on a whole-image basis, and an additional six features were calculated on an individual gland-basis: epithelial area, roundness, and wall thickness; luminal area and roundness, and cell fraction (i.e., the percent of epithelial cells per total gland area, defined by the area of the epithelium without lumen).
Model training
Flowcharts for the ATARI model and ResNet101 classifier can be found in Figure 1. An ensemble algorithm was used as the framework for developing the ATARI classifier on 31 subjects based in MATLAB (Mathworks, Inc. Natick, MA). A compact classification ensemble was used, which fitted predictors trained on bootstrapped samples from the training data set to obtain a combined ensemble model that minimized variance across learners[25, 26]. Inputs for this model were mean, median, and variance of the calculated pathomic features averaged across each tile, z-scored across the training data. To test the granularity of Gleason pattern prediction, we trained predictive models using several different levels of tumor specificity including all Gleason grades; high-(G4+) and low-grade (G3) cancer and benign tissue (HG vs LG model); and non-cancer and cancer (G3+) (NC vs CA model). To test generalizability, the model was applied to a left-out test set. Predictions were then plotted on three slides from the test data set using the same features calculated across all tiles for the slide to assess successful identification of tumor and compared to ground-truth pathologist annotations and the ResNet model.
To test a deep learning approach for comparison, a ResNet model with 101 layers was implemented in Python using the PyTorch framework (v.1.8.1)[12, 27]. The same tiling procedure as previously described was used to curate the dataset for this network, with the addition of splitting all tiles into smaller 1024×1024 pixel patches and saving those that remained 50% within an annotation mask. Data from Cohort A was split into 80/20 training and validation datasets to prevent overfitting and several data augmentation techniques were used to increase training samples. The image patches were resized to 250×250 pixels, randomly cropped to 240×240 pixels, augmented and center cropped to generate the needed input size of 224×224 pixels. The same three model designs as the ATARI were trained using the ResNet101 framework. Class imbalance of the training dataset was addressed by introducing sample number-based class weights in the cross-entropy loss function.
Results
The accuracy of both models was analyzed using a left-out test dataset from 17 patients (95,875 image patches for the ResNet; 4,375 image tiles for ATARI). The ATARI model was unable to successfully classify Gleason grades (overall accuracy 85%, per-class accuracy range 0% - 99%) nor high-(HG) and low-grade (LG) cancer (overall accuracy 83%, per-class accuracy range <1% - 99%). In both models, normal tissue was classified well above chance level (20% for all Gleason grades, 33% for high- and low-grade cancer), with G3 in the Gleason grades model and HG in the HG vs LG model performing at chance. The non-cancer (NC) vs cancer (CA) model had an overall accuracy of 89% and a per class accuracy of 97% and 53% for NC and CA, respectively. The ResNet model was able to successfully classify all Gleason grades with an absolute overall accuracy of 79% (per class accuracy range 25% - 87%), HG vs LG (overall accuracy 72%, per class accuracy range 55% - 72%), and NC vs CA (overall accuracy 83%) with an accuracy 91% and 74% for non-cancer and cancer (Figure 2).
Figure 3 show the representative slides as their ATARI and ResNet101 annotations as compared to ground-truth annotations. Although the ATARI model was unable to capture unique Gleason patterns, it was able to define the region of tumor present on the slide. The ResNet101 model was able to accurately predict the Gleason patterns with a per class accuracy of 25-52%.
Discussion
In this study, high-resolution tiles taken from annotated regions on whole-mount digital slides after radical prostatectomy were used to train models to support pathologist diagnoses of prostate cancer. Specifically, the ATARI model used quantitative features to classify glandular features, whereas the ResNet101 classifier used deeper textural features of histology. The ATARI was only able to accurately predict cancer and non-cancer, whereas the ResNet101 classifier was able to further predict unique Gleason grades present on the slide. The results from our study indicate that while machine learning models using calculated features may be successful at differentiating tumor from non-tumor, deeper features found using neural networks can further define unique patterns. This may indicate that Gleason patterns exist beyond simple glandular features and may be more readably quantified using textural features. The absolute accuracies of 89% and 83% for the ATARI and ResNet101 models, respectively, show the need for a more general approach to using machine learning for cancer diagnosis.
Machine and deep learning applications are becoming prominent in clinical research. Machine learning focuses on the use of data and algorithms to imitate the way that humans learn. Data used in machine learning applications are human-derived, quantitative metrics that are then analyzed through statistical methods to make classifications or predictions. Deep learning is a sub-field of machine learning that automates the feature extraction without the need for human intervention. It can uncover more nuanced patterns within the data to generate predictions. In this study, our proposed machine-learning model outperformed the ResNet model at classifying cancer from non-cancer; however, the ResNet could classify unique Gleason grades. This may indicate that the features of Gleason grades do not have strong quantitative differences, but rather texture differences that are discernible using a deep learning model. Other prior studies have shown similar results where a trained deep learning model outperformed a simple model trained on handcrafted features[28–30].
Automated Gleason grading applications have been previously applied for multiple purposes. One prior study trained a convolutional neural network (CNN) using WSI-level features constructed from a CNN-based PCa detection model that was trained from slide-level annotations to predict the final patient Gleason Grade Group[31]. This model achieved a 94.7% accuracy at detecting cancer and 77.5% accuracy at predicting the patient Grade Group. While promising, this model does not provide histological annotations to WSI, but rather only predict patient Grade Group. Several previous studies have applied deep learning models to prostate biopsy specimens[11, 32, 33]. While these models have achieved high accuracies at annotating biopsy cores, our ResNet101 model was able to annotate whole-slides images and could distinguish between regions of Gleason 4 cribriform and non-cribriform tumors.
Integrating rapid annotation of Gleason patterns after tissue resection into the clinical workflow could save a tremendous amount of pathologist time. Once slides are digitally scanned, a diagnosis could be predicted automatically based on the automated annotations. This could then be used to rank slides by order of importance for pathologist review and to aid in treatment planning. The proposed models could be applied to large data sets and would decrease the workload on pathologists. Additionally, annotations provided from quantitative metrics may eliminate variability in Gleason annotations.
One major limitation of the study is the use of only one pathologist for annotating the training and test datasets. Inter-observer variability is a known issue in prostate cancer diagnosis, and thus should be addressed in the training phase. Additionally, only one slide scanner was used to digitize the slides used in this study. Future studies should investigate the impact additional slide scanners would have on the generalizability of the models, as this analysis was outside the scope of the current study. Finally, future studies should look at larger populations to provide a more robust dataset of Gleason patterns which may increase accuracy in the machine learning models, as this study had a relatively small cohort of 47 patients.
Conclusion
We demonstrate in a cohort of 47 patients that machine learning models and neural networks can accurately predict regions of prostate cancer, where the latter network was further able to classify unique Gleason patterns. These models are anticipated to aid in prostate cancer decision support by decreasing the diagnostic burden of pathologists. Future studies should determine how inter-observer and slide scanner resolution impact these networks in their classifications.
Author Contributions
Conceptualization, S.R.D. and P.S.L.; methodology, A.N., K.A.I. and P.S.L.; software, S.R.D., S.A.B., A.K.L., M.B., A.W., and F.K.; validation, S.R.D., S.A.B., A.K.L., M.B., A.W., F.K., and. P.S.L.; formal analysis, S.R.D.; investigation, P.S.L.; resources, P.S.L.; data curation, P.S.L.; writing—original draft preparation, S.R.D.; writing—review and editing, S.R.D., S.A.B., A.K.L., M.B., A.W., F.K., K.A.I., A.N., and P.S.L.; visualization, S.R.D.; supervision, P.S.L.; project administration, P.S.L; funding acquisition, P.S.L. All authors have read and agreed to the published version of the manuscript.
Acknowledgments
We would like to thank our patients for their participation in this study, and the Medical College of Wisconsin Machine Learning Group for helpful feedback and discussion.