The Evaluation of Acute Myeloid Leukaemia (AML) Blood Cell Detection Models Using Different YOLO Approaches

This study proposes to evaluate the performance of Acute Myeloid Leukaemia (AML) blast cell detection models in microscopic examination images for faster diagnosis and disease monitoring. One of the popular deep learning algorithms such as You Only Look Once (YOLO) developed for object detection is the successful state-of-the-art algorithms in real-time object detection systems. We employ four versions of the YOLO algorithm: YOLOv3, YOLOv3-Tiny, YOLOv2 and YOLOv2-Tiny for detection of 15-class of AML blood cells in examination images. We also acquired the publicly available dataset from The Cancer Imaging Archive (TCIA), which consists of 18,365 expert-labelled single-cell images. Data augmentation techniques are additionally applied to enhance and balance the training images in the dataset. The overall results indicated that four types of YOLO approach have outstanding performances of more than 92% in precision and sensitivity. In comparison, YOLOv3 has more reliable performance than the other three approaches. Consistently, the AUC values for the four YOLO models are 0.969 (YOLOv3), 0.967 (YOLOv3-Tiny), 0.963 (YOLOv2), and 0.948 (YOLOv2-Tiny). Furthermore, we compare the best model’s performance between approaches that use the entire training dataset without using data augmentation techniques and image division with data augmentation techniques. Remarkably, by using 33.51 percent of the training data in model training, the prediction outcomes from the model that used image partitioning with data augmentation were similar to those obtained using the complete training dataset. This work potentially provides a beneficial digital rapid tool in the screening and evaluation of numerous haematological disorders.


142
approaches; one approach uses the entire training dataset and does not use data 143 augmentation strategies, and our proposed methodology is used to determine the gaps 144 in model performances. We also introduce the performance evaluation technique using 145 the ROC curve. This technique can help to examine the performance of our approaches 146 with the quantitative AUC values. The rest of the paper is organized as Section 2 to 147 Section 4. Section 2 presents the dataset arrangement, data augmentation techniques

165
According to previous research [27], the most important aspect in machine 166 learning models is the preparation and testing phase. Using different sampling 167 strategies, the researchers investigated the impact of the training and testing processes 168 on machine learning efficiency. The findings revealed that if the dataset is small or 169 perhaps the number of samples in the dataset is small, the test rate can be set between 170 10% and 20%. In general, the test rate ranges between 20% and 50%. So, we randomly 171 divide the received dataset into a training dataset and testing dataset for each class 172 where the training dataset contains about 80% and the remaining dataset is testing 7 174 minimize certain classes in the testing dataset that have a large number of images. As a 175 result, the three classes of over 3000 images were reduced: Neutrophil (Segmented) 176 was reduced to 1510 images, and the other two classes, Lymphocyte (Typical) and 177 Myeloblast, were reduced to 1000 images each. The number of data analysis for 178 training, testing and unused images is shown in (Fig 2(b)) and the training dataset is 179 reduced from 80% to 27% in entire dataset according to the pie chart. 182 Moreover, the detail data utilization of AML dataset is described in Table 1.
183 According to the giving the rotating effect to the image with the degree value specified from -208 180º to 180º. In this study, the rotating effect was varied at every 45° to obtain 209 various images as shown in (Fig 3(b)).

210
 Contrast: It is one kind of augmentation technique in image processing and it 211 can be adjusted between the darkest and brightest image areas. The contrast of 212 images was changed by multiplying all pixel values with 0.4, 0.6, 0.8 and 1.0 213 as shown in (Fig 3(c)).

214
 Noise: Noise can reduce the accuracy of neural networks when testing on real-215 world data. Image noise injection is an important augmentation step that allows 216 the model to discriminate original image from noise in an image. Gaussian noise 217 distribution used are three values of standard deviation (σ = 0, 10, 20) as shown 218 in (Fig 3(d)).

219
 Blur: Blurring is a very popular technique in image processing. It can be defined 220 as the degree of separation between sharpening and blur images. Gaussian blur 221 filter used will result in a blurrier image. Blurring images for data augmentation 222 could lead to higher resistance to motion blur during testing. The higher the 224 filter with a standard deviation of 9 was used to blur the images as shown in 225 (Fig 3(e)). Rotation; (c) Image Contrasting; (d) Noise Injection to Image; (e) Image Blurring.

228
After completing the four image processing tasks mentioned above, a single 229 image can be augmented into up to 108 images of 608x608 pixel resolution. The testing 230 dataset with the representation of 608x608 pixels was used to prepare the training 231 dataset for the four versions of the YOLO algorithm.

232
To assess the improvements in our methodology that used data utilization and 233 data augmentation techniques, we used a performance comparison with an approach 234 that used the whole testing dataset without using data augmentation techniques. This In this study, the proposed model for AML Blood Cell classification task is 248 based on YOLOv3 convolutional neural network as shown in (Fig 4).

266
The YOLOv3 architecture also uses k-means clustering to evaluate the bounding box 267 priors.

268
As illustrated in (Fig 4), YOLOv3 finally predicts three-branch outputs in the 269 feature map for each cell image. We selected 608x608 for the input image sizes, and 270 the output has three different types of attribute maps: 19x19 for a large object, 38x38 271 for a medium object, and 76x76 for a small object. In the training stage of YOLOv3 proposed architecture of YOLOv2 is shown in (Fig 5). 298 for input image as showed in (Fig 5). 318 convolution layers and 6 max-pooling layers and predicted one output feature map as 319 shown in (Fig 7). The training network parameters are the same as the above three

348
After receiving all of these four values for each class, the four-performance 349 metrics as described above are computed using Equations (1), (2), (3), and (4).

350
Precision is defined as the proportion of true class with a positive test which has true.

351
The precision is calculated by the following formula: confusion matrix for n × n is constructed as shown in (Fig 8).   (Fig 9).

431
In the confusion matrix, we found that the YOLOv2 and YOLOv3-Tiny can 432 correctly predict more classes than the other two classes. These two models can predict 433 14 classes, the YOLOv2-Tiny can predict 13 classes and YOLOv3 can predict 12 434 classes even though some class has a small dataset. As mentioned in the Dataset and 435 Labelling section, the four models were difficult to correctly perform in the small 436 testing dataset. Nevertheless, the four models accurately predicted the class namely 437 Smudge cell because it has a distinctive character than other classes. The typical results 438 from the detection and classification of each single-cell images are shown in (Fig 10).
439 Although YOLOv3 predicted fewer classes than the other algorithm, it has more 440 number of prediction images than the other three models. For the total number of image 441 prediction, the four models correctly predicted over 3663 single-cell images as follows:

Fig 10. The classification and localization of Acute Myeloid Leukaemia (AML) 445
Blood Cell images with the threshold level of 0.5 and NMS of 0.2.

446
Additionally, we compare the precision values and sensitivity values to evaluate 447 the quality of class-wise prediction by using one-versus-rest approach as shown in (Fig   448  11 and Fig 12).

472
The overall results of four-performance metrics for four types of YOLO models 473 are shown in (Fig 13). YOLOv3 has 94% in overall precision and sensitivity and 99% 475 models. Therefore, we can conclude that the YOLOv3 model is suitable for high-476 performance GPU whereas YOLOv3-Tiny is compatible with low memory and CPU 477 devices.
478 Fig 13. The overall score for the four-performance metrics namely precision, 479 sensitivity, specificity, and accuracy with the threshold level of 0.5 and NMS of 0.2.

480
The model performances are calculated by using micro-averaging.

482
The performance comparison between YOLOv3 (using entire 483 training dataset and without applying data augmentation 484 techniques) and YOLOv3 (our approach) 485 Since the YOLOv3 model, trained with the data partition strategy (our 486 approach), has the best results as defined in the previous section, the method could be 487 evaluated to examine if its performance was comparable to the model trained with the 488 entire public dataset employed. In this section, we compared the performance of a well-489 trained YOLO v3 model to the difference between using the entire training dataset 490 (without data augmentation techniques) and data partitioning (our approach). The same 491 parameter choices were used in the approach's training and testing. (Fig 14) depicts the 492 confusion matrix for the normal approach. As (Fig 9(a) and Fig 14) are compared, it is 493 evident that the YOLOv3 method (using the entire training dataset and without using 494 data augmentation techniques) has higher prediction scores in the large two dataset 495 classes, namely Neutrophil (segmented) and Myeloblast. While the approach detected 496 one more class, Promyelocyte (bilobed), than ours when analyzing the testing dataset, 497 the overall detection rate is lower.

Fig 14. Confusion Matrix for YOLOv3 (Using Entire Training Dataset Without 499
Applying Data Augmentation Techniques).
500 Table 2 further summarizes and contrasts the precision and sensitivity for multi-501 class classification of blood cell representations using a one-versus-rest method. In the 502 precision contrast, the trained YOLOv3 model (using the entire testing dataset without 503 using data augmentation techniques) has a higher precision score in five classes, while 504 the trained model (our approach) has a higher precision score in seven classes. Both 505 versions have higher sensitivity scores in six groups in the sensitivity comparison. As 506 a result, our solution outperforms the competitive advantage in terms of precision and 507 sensitivity for multi-class classification.

510
Furthermore, we compared the overall performance of the two YOLOv3 models 511 using micro-averaging, as shown in Table 3. According to the findings, the trained 512 YOLOv3 model (using the entire training dataset without using data augmentation 513 techniques) outperforms our approach in all performance metrics. At this point, our 514 methodology has only used 33.51 % of the training dataset, and all of the performance 515 are similar due to the slight calculation difference between the two models. The findings 516 revealed that using data augmentation strategies in the training of a blood cell dataset 517 would minimize the sample size from a broad dataset class whereas still being aided by 518 the dataset necessity. 519

529
The classification results logically yield a numeric value of an instance 530 probability with the predicted classes as shown in (Fig 10).  Threshold   TTP  TFP  TTN  TFN  TPR  FPR  TTP  TFP  TTN  TFN  TPR  FPR  TTP  TFP  TTN  TFN  TPR  FPR  TTP  TFP  TTN  TFN  TPR  FPR   0  3663  3663  0  0  1  1  3663  3663  0  0  1  1  3663  3663  0  0  1  1  3663  3663  0  0 (Fig 15). In the ROC curve analysis, the AUC is an effective way to evaluate the 547 performance of the trained model. The AUC value is always bounded between 0 and 1, 548 where a perfectly inaccurate test represents a value of 0 and a perfectly accurate test 549 represents a value of 1. In general, an AUC value can be defined as follows: under 0.5 550 is no realistic model, 0.5 is no discrimination model, 0.7 to 0.8 is considered acceptable 551 model, 0.8 to 0.9 is considered excellent model, and more than 0.9 is considered 552 outstanding model [38].

557
According to (Fig 15), we conclude that the YOLOv3 model is a superior 558 detection models because its AUC value is the highest one. The AUC values for the 559 four YOLO models are 0.969, 0.967, 0.963, and 0.948 for YOLOv3, YOLOv3-Tiny, 560 YOLOv2 and YOLOv2 respectively. Since all of the AUC values are higher than 0.9, 561 we may consider the four models are an outstanding model for AML cell classifications.

562
We summarize our finding with the overall performance comparison between 563 four types of YOLO approach as shown in Table 5