Abstract
Breast cancer is one of the main causes of death world-wide. Histopathological cellularity assessment of residual tumors in post-surgical tissues is used to analyze a tumor’s response to a therapy. Correct cellularity assessment increases the chances of getting an appropriate treatment and facilitates the patient’s survival. In current clinical practice, tumor cellularity is manually estimated by pathologists; this process is tedious and prone to errors or low agreement rates between assessors. In this work, we evaluated three strong novel Deep Learning-based approaches for automatic assessment of tumor cellularity from posttreated breast surgical specimens stained with hematoxylin and eosin. We validated the proposed methods on the BreastPathQ SPIE challenge dataset that consisted of 2395 image patches selected from whole slide images acquired from 64 patients. Compared to expert pathologist scoring, our best performing method yielded the Cohen’s kappa coefficient of 0.70 (vs. 0.42 previously known in literature) and the intra-class correlation coefficient of 0.89 (vs. 0.83). Our results suggest that Deep Learning-based methods have a significant potential to alleviate the burden on pathologists, enhance the diagnostic workflow, and, thereby, facilitate better clinical outcomes in breast cancer treatment.
1. Introduction
Breast cancer is one of the most common cancer types diagnosed in women in the United States and worldwide [1]. Biopsies and histological assessment allow pathologists to analyze microscopic structures of breast tissues and, in particular, assess the cancer’s aggressiveness.
Multiple options are available to manage and monitor the breast cancer treatment based on the information provided from the tumor’s response to it. In addition to the treatment effect on the tumor size, the therapy may also alter the tumor’s cellularity [2]. During anticancer therapy, the size of the tumor may remain the same, but the overall cellularity may be drastically reduced [3]. As a result, it makes the residual tumor cellularity an important factor in assessing the response treatment.
Currently, tumor cellularity is manually assessed by pathologists from hematoxylin and eosin (H&E)-stained slides [3]. The costs of such estimation are high, the process is tedious and subjective, and the quality and reliability might be also be affected by high inter-observer variability even among senior pathologists. This potentially may affect prognostic power assessment in clinical trials [4]. The subjectivity in visual tissue assessment motivates the use of computer-aided methods to improve the diagnosis accuracy, reduce human error and increase inter-observer agreement and reproducibility [5, 6]. Automated analysis of the H&E slide using computer vision could provide immediate benefits to patient care. Recent success in Deep Learning (DL) [7, 8], and in particular the advances in convolutional neural networks (CNN), have recently shown high potential in this realm [9].
In this work, we evaluate three DL-methods to score the cellularity of the breast tissue from histopathological images. In particular, our first approach employs a weakly-supervised segmentation model with Resnet-34 [10] encoder and Feature Pyramid Network (FPN)[11] and a second-stage regression network that predicts the cellularity score using the predicted segmentation maps. Our second approach is also based on segmentation, however, instead of using the segmentation maps directly, we extract various features from them and use the gradient boosting trees (GBT) [12] to predict the cellularity score. Finally, we also evaluate using H&E image patches directly to predict the cellularity score.
2. Related work
CNNs have recently been successfully applied to many tasks in biomedical image analysis, often outperforming conventional machine learning methods [9, 13, 14]. As such, they have successfully been utilized for digital pathology image analysis and have demonstrated great potential for improving breast cancer diagnostics [15, 16, 17, 18].
Although there are not many studies focusing directly on automated quantitative cellularity assessment, it has been shown that this task can be solved by first segmenting malignant cells and then computing the tumor’s area [19]. Many efforts have been devoted to developing supervised and unsupervised methods for automated cell and nuclear segmentation and detection [20, 21]. Supervised segmentation models have superior performance but require hand-labeled nuclear mask annotations [20]. In these approaches, segmented nuclear bodies are used to extract features that are typically inspired by visual markers recognized by pathologists. Commonly used features describe morphology, texture, and spatial relationships among cell nuclei in tissue [19, 22].
The conventional approach most relevant to our work is by Peikari et al. [19] who proposed an automated cellularity assesement protocol. First, they used smaller patches, or regions of interest (RoI), extracted from whole slide images to segment all present cell nuclei. Then they extracted a number of predefined features from segmented nuclei and used support vector machines to distinguish lymphocytes and normal epithelial nuclei from malignant ones. Cellularity estimation was done using distinguished malignant epithelial figures for every RoI.
Alternatively, segmentation-free methods that directly estimate cellularity from histopathology imaging data and nuclei locations annotated by human observers were also shown promising. In particular, Veta et al. [23] proposed a deep learning-based method that leverages an information from a tumor’s cells nuclei locations (centroids) and predicts the areas of individual nuclei and mean nuclear area without the intermediate step of nuclei segmentation. In particular, this approach was based on a 10-layer deep neural network predicting nuclear areas quantized into 20 histogram bins. The results showed that predicted measurements had substantial agreement with manual measurements, which suggests that it is possible to compute the areas directly from imaging data, without the intermediate step of nuclei segmentation. This is in spirit similar to one of our approaches, but we do not directly compare our methods to Veta et al. since we use different datasets and performance metrics.
Recent works by Akbar et al. [24, 25] have compared the conventional approach based on segmentation and feature extraction and direct applications of deep CNNs to image patches in both regression and classification settings. Overall, they showed that the DL-based approach outperformed hand-crafted features in both accuracy and intraclass correlation (ICC) with expert pathologist annotations. Specifically, their best result was achieved by using a pretrained Inception [26] model that reached ICC of 0.83 and 0.81 with two expert pathologists. In this study we evaluate even wider range of DL-based approaches, including segmentation-based and segmentation-free, in both regression and classification settings. We provide appropriate performance comparisons with previously reported results. All the methods developed in this study are fully automatic and do not require any involvement of the human annotators at the test time.
3. Methods
In this study, we propose and evaluate three different methods. The first two methods are based on the nuclei segmentation and the third method leverages the raw image without preceding segmentation step. Graphical illustration of our approach is presented in Figure 1
3.1. Segmentation
Network Architecture
Most modern segmentation architectures inherit the encoder-decoder architecture similar to U-Net [27], where convolutional layers in the contracting branch (encoder) are followed by an upsampling branch that brings segmentation back to the original image size (decoder). In addition, skip connections are used between contracting and upsampling modules to help the localization information propagate through the complex multilayer structure and eventually improve segmentation accuracy [27]. U-Net and architectures inspired by this idea have produced state of the art results in various segmentation problems, and many improvements for the architecture and its training protocols have recently been proposed. In particular, Iglovikov et al. [28, 29] used batch normalization [30] and exponential linear unit (ELU) as the primary activation function and an ImageNet pre-trained VGG-11 network [31] as an encoder. Liu et al. [32] proposed an hourglass-shaped network (HSN) with residual connections, which is also very similar to the U-Net architecture. Rakhlin et al. [33] used the Resnet-34 network [34] as the encoder and the Lovász-Softmax loss function [35] along with Stochastic Weight Averaging (SWA) [36] for training.
In our proposed architecture, the segmentation module also inherits the U-Net architecture. The contracting branch (encoder) of our model is based on the Resnet-34 [34] network architecture where we have introduced several useful modifications. In particular, we have replaced ReLU activations with ELU that does not saturate gradients and keeps the output close to zero mean and have changed order of batch normalization [30] and activation layers. In Section 4 we compare encoders initialized with random He’s initialization [37] and pretrained on ImageNet.
A major difference between the proposed model and the classical U-Net architecture originates in the limited dataset size, characteristic for the BreastPathQ Cancer Cellularity Challenge and for medical imaging in general. We utilize two regularization techniques to alleviate the problem of overfitting to limited training data: (1) data augmentation and (2) spatial 2D dropout incorporated into the upsampling branch.
The upsampling branch is implemented as a Feature Pyramid Network (FPN) [38], reconstructing high-level semantic feature maps at four scales simultaneously. We implement a feature pyramid block as a convolutional layer with 64 activation maps followed by upsampling to the original resolution with upsampling rate of 8, 4, 2, or 1 depending on the feature map depth (see Fig. 2). In Section 4, we compare the performance of standard and FPN decoders.
We concatenate upsampled maps into a single layer of 64 × 4 = 256 maps and add after it a spatial 2D dropout layer, which acts as a regularizer and prevents coadaptation of the network weights, but unlike conventional dropout it drops out not individual neurons but rather entire activation maps. Throughout the work, we use dropout rate 0.5, randomly dropping 128 out of 256 activation maps.
Finally, the output of the model is a 4-channel sigmoid layer that assigns every pixel with four values from 0 to 1 that represent the probabilities of belonging to the Normal, Lymphocyte, Malignant, and Background classes.
Loss functions
Binary cross entropy (BCE), while convenient for training, does not directly translate into Jaccard index, the metric commonly used to evaluate segmentation accuracy. Hence, as the loss function we use a weighted sum of BCE and the soft Jaccard loss for class c; in this work we set α = 0.15, a value found via cross-validation. The soft Jaccard loss is defined as where w are network parameters, is the binary label for pixel i and class is the predicted probability of c for pixel i, and N is the total number of pixels. The total loss function is a weighted sum of class losses: where vc is a loss weight for class c. In this work we weigh Normal, Lymphocyte, and Background as 1 and Malignant, the class of primary importance in our problem, as 4.
3.2. Cellularity estimation from segmented cells
In this subsection, we describe the method for cellularity assessment that leverages the output of the trained segmentation network Fig. 2. We feed the segmented output into a Resnet-34 CNN model. The model automatically learns deep features from the 4-channel segmentation input and regresses it onto continuous cellularity score. In this approach, the segmentation model acts as a filter the aim of which is to extract only the information about the cell morphology. We hypothesized that this structured approach makes our method similar to methods employed by expert pathologists, makes it transparent and less sensitive to data acquisition settings.
3.3. Feature extraction-based cellularity estimation
The second type of model is Gradient Boosted Trees (GBT) [39]. This approach represents an example of inter-disciplinary connection in Machine Learning. The general idea and handcrafted features are borrowed from the second place solution for 2017 Kaggle contest for Sea Lion Population Count in aerial imaginary [40]. The authors would like to thank Konstantin Lopuhin for valuable discussion they had while incorporating his method. In this study, GBT operates on a vector of hand-crafted features extracted from nuclei segmentation maps, including:
activations and their areas aggregated over segmentation maps with different thresholds; for every segmentation map in Normal, Lymphocyte, Malignant and for 7 thresholds 0.02, 0.04, 0.08, 0.16, 0.24, 0.32, 0.5, we obtain 2 values: total area above threshold and total activation above threshold (see Fig. 4 for an illustration);
using the Laplacian of Gaussian (LoG) method as implemented in the OpenCV library [41], we find blobs in segmentation maps at 6 thresholds: 0.02, 0.04, 0.08, 0.16, 0.24, 0.5; for each threshold we find the number of blobs and total activation in blob centers (Fig. 5);
total activation for every channel, computed as a sum of the activations at every pixel after sigmoid.
In total, we obtain 3 ×(7 × 2 + 6 × 2 + 1) = 81 features to train the GBT model.
3.4. Cellularity estimation from the raw images
The third type of model is a deep convolutional network implemented in regression or classification settings.
These models do not use intermediate segmentation and predict the cellularity score immediately from the microscopic image. For classification, we categorize cellularity into 101 class using regular bins with thresholds 0.00, 0.01, …, 1.00. The idea of direct regression of an image into continuous value using CNN is not new. In particular, it was implemented in [42] where the authors use CNN to predict bone age from radiograph.
3.5. Evaluation metrics
We assessed the results using several metrics. The main evaluation metric is the mean squared error (MSE) between the cellularity score obtained in our experiments and ground truth provided by an expert pathologist.
In order to make our results comparable with previous work, we also report Cohen’s kappa coefficient agreement and the intra-class correlation coefficient (ICC) between expert and automated methods, similar to [19]. In all experiments, we find our results superior to our predecessors; however, the cellularity score itself in [19] is evaluated based on binning it into four categories of 0–25%, 26–50%, 51–75%, and 76–100%. Such 4-class categorization is relatively coarse and, in our opinion, does not represent a suitable evaluation metric for continuous cellularity estimation that is our goal in this work.
4. Experiments and results
4.1. Data
The data used in this study had been acquired from the Sunnybrook Health Sciences Centre with funding from the Canadian Cancer Society and was made available for the BreastPathQ challenge sponsored by the SPIE, NCI/NIH, AAPM, and the Sunnybrook Research Institute [19].
In our experiments we used 2, 395 patches of 512 × 512 pixels in size, extracted from 96 haematoxylin and eosin (H&E) stained whole slide images (WSI) acquired from 64 patients. Each patch in the training set has been assigned a tumor cellularity score by an expert pathologist. In Figure 6, we present a distribution of the cellularity scores in the dataset.
Besides the image data, we used the annotations (X and Y coordinates) to identify lymphocytes, malignant epithelial, and normal epithelial cell nuclei in the additional 153 patches. Using these weak annotations, we generated the segmentation masks that were used in out experiments. Here, at each XY location, we simply fit a blob of 15 pixels in diameter. In Figure 3 we present the generated masks for various classes.
4.2. Weakly-supervised cell segmentation
Highly limited amount of nuclei annotation data poses a serious challenge for training segmentation models. In our ablation studies, we evaluated our model in four different settings to find how different design choices influence the segmentation accuracy and generalization. Namely, we compared the model as described in Section 3 with a standard U-Net decoder against an FPN decoder, and with the encoder initialized randomly against the encoder initialized with weights pretrained on ImageNet. In all settings, the model was trained for 150 epochs with the Adam optimizer and gradually decreasing learning rate from 10−4 to 10−5.
To obtain training patches, we downscaled the microscopy images × 2 times, randomly cropped a 256 × 256 area, and rescaled pixel values from [0, 255] to [− 1, 1]. As mentioned previously, segmentation targets were generated as 4-channel masks with round blobs, 15 pixels in diameter (the characteristic nucleus size), drawn in the nuclei centers. During training, we dynamically augmented images with vertical and horizontal flips, rotation, gamma, hue, and saturation utilizing the Albumentations library [43].
In the first series of experiments, we evaluated segmentation quality as an important intermediate metric for the evaluation of our methods. The segmentation performance as a function of the decoder, initialization, and training epoch is shown in Fig. 8 and Table 1. As we can see, the model with the feature pyramid decoder and encoder pretrained on ImageNet achieved significantly higher and more stable Jaccard index on the validation set than the alternatives.
4.3. Segmentation-based cellularity assessment
Prediction from the segmented cells
As mentioned previously, we used the output of the segmentation model as input for the cellularity regressor and then trained this cascade end-to-end. We froze the segmentation model and stack its 4-channel output with a randomly initialized Resnet-34 in the regression setting. We trained the regression part with cellularity targets and MSE loss until convergence. Then we unfroze segmentation weights and fine-tuned both modules in an end-to-end fashion, as a single model. We repeated this experiment with Resnet-34 pretrained on ImageNet. In the latter case, we excluded the background channel from segmentation output to comply with the vanilla Resnet-34 architecture that has a 3-channel input.
In these experiments, we found that after fine-tuning the accuracy of segmentation itself slightly decreases, while the accuracy of the overall cellularity scoring increases. This is in line with [44], which found that perfect segmentation of nuclei figures does not ensure better classification of malignant objects from breast cancer tissues. This finding suggests that the two branches of future work, tumor bed segmentation and cellularity assessment, are relatively independent.
Feature extraction-based method
In this series of experiments, we extracted the 81 features from segmentation masks as discussed in Section 3 and trained the LightGBM [39] regression model with mean squared error (MSE) objective. The model was trained for 600 epochs with learning rate 0.01. The maximum tree depth was set to 5; the number of leaves, to 8. These parameters have been selected through cross-validation.
We report LightGBM accuracy in Table 2 and show the resulting feature importance on Fig. 10. Feature importance was calculated based on the total gain of the loss function from the splits formed according to this feature. As expected, all highly important features come from the Malignant channel. The most important feature is the total activation above 0.5 threshold, and the second and third most important features are the activations above 0.32 and 0.24 thresholds, as expected since activations at different thesholds are highly correlated, and the segmentation quality at threshold 0.5 was the best, so the feature based on this mask is a natural candidate for the most important feature. Activations at lower thresholds provide additional value, but a big part of the information that they contain has already been conveyed via the 0.5 threshold feature. Interestingly, malignant cell count (detected at threshold 0.24) is only the 9th feature in order of importance.
4.4. Direct cellularity assessment from the raw images
In our final experiments, we evaluated several deep neural architectures that take the original microscopy images as input and output the cellularity score without intermediate segmentation. Similarly to previous experiments, we trained the models with random He initialization [37] or initialized them with weights pretrained on ImageNet. In all cases, ImageNet initialization was superior to random, and the overall accuracy was slightly better than for the models with intermediate segmentation. The Xception model implemented in a classification setup with random initialization performed slightly better than its counterparts (MSE 0.012 vs. 0.017-0.025). The mean squared error of cellularity prediction as a function of the training epoch for different initializations is shown in Figure 9. All the performance evaluation metrics are presented in Table 2 and Table 3.
4.5. Discussion
As we can see in Table 2 and Table 3, direct cellularity assessment method slightly outperforms the segmentation-based approach, where the regression module works on top of the segmentation feature extractor. We believe that performance improves due to two main reasons. First, segmentation models were not trained on accurate segmentation masks but rather on approximate masks generated from weakly supervised labels. Second, the cellularity score depends not only on the tumor masks but also on a broader set of features, some of which could be lost during the segmentation step.
While we note the record results of our end-to-end models, we believe that the modular form of the prediction pipeline provides benefits that more than compensate for this small difference in the final score.
The segmentation-based approach has two significant advantages: generalizability and interpretability. In practice, the data used for medical imaging tasks comes from different hospitals and is collected by different hardware. Images may differ in quality, level of noise, color and brightness distributions. In [42], the authors proposed to use segmentation to clean and standardize the data, which helps with overall robustness and performance of various task-specific models.
Better interpretability is achieved by the fact that we can visually verify the quality of the intermediate step, i.e., segmented tumors. Furthermore, the decision trees model allows to estimate the feature importance for every feature based on the information gain. If segmented tumors are correct, and the most informative features make intuitive sense, we obtain additional confidence in our model, which is very important in the medical setting.
5. Conclusion
In this paper, we evaluate three automatic methods to assess the cellulalarity of residual breast tumors in H&E stained samples. Our first method leverages the weakly-supervised segmentation masks as inputs for deep CNN. We believe that this method will be more generalizable and robust towards the data acquisition and easier to interpret.
Our second method that leverages feature extraction from the weakly-supervised segmentation mask yields the highest score among the all previously published feature extraction-based methods [24, 19].
Finally, the third method in this study is an end-to-end approach that predicts the cellularity score without any intermediate segmentation step. Although it is attractive and produces the best results it lacks interpretability of the segmentation-based methods and could perform best due to the dataset bias.
The main limitation of this study is the dataset size and the weak labels for the segmentation model. We think that given a bigger dataset and good quality annotations, segmentation-based approach could produce better results that less deviate from the end-to-end trained models.
Footnotes
shvets{at}mit.edu, akalinin{at}umich.edu, aleksei.tiulpin{at}oulu.fi, iglovikov{at}gmail.com, snikolenko{at}neuromation.io