Convolutional neural networks for mild diabetic retinopathy detection: an experimental study

Currently, Diabetes and the associated Diabetic Retinopathy (DR) instances are increasing at an alarming rate. Numerous previous research has focused on automated DR detection from fundus photography. The classification of severe cases of pathological indications in the eye has achieved over 90% accuracy. Still, the mild cases are challenging to detect due to CNN inability to identify the subtle features, discrimnative of disease. The data used (i.e. annotated fundus photographies) was obtained from 2 publicly available sources – Messidor and Kaggle. The experiments were conducted with 13 Convolutional Neural Networks architectures, pre-trained on large-scale ImageNet database using the concept of Transfer Learning. Several performance improvement techniques were applied, such as: (i) fine-tuning, (ii) data augmentation, and (iii) volume increase. The results were measured against the standard Accuracy metric on the testing dataset. After the extensive experimentation, the maximum Accuracy of 86% on No DR/Mild DR classification task was obtained for ResNet50 model with fine-tuning (un-freeze and re-train the layers from 100 onwards), and RMSProp Optimiser trained on the combined Messidor + Kaggle (aug) datasets. Despite promising results, Deep learning continues to be an empirical approach that requires extensive experimentation in order to arrive at the most optimal solution. The comprehensive evaluation of numerous CNN architectures was conducted in order to facilitate an early DR detection. Furthermore, several performance improvement techniques were assessed to address the CNN limitation in subtle eye lesions identification. The model also included various levels of image quality (low/high resolution, under/over-exposure, out-of-focus etc.), in order to prove its robustness and ability to adapt to real-world conditions.

Introduction 1 eye leading to DR is critical. It not only allows to avoid the late invasive treatments and 11 high medical expenses, but most importantly -to reduce the risk of potential sight loss. 12 The manual methods of diagnosis prove limiting given the worldwide increase in 13 prevalence of both Diabetes and its retinal complications [7]. Currently, the 14 ophthalmologist-to-patient ratio is approx. 1:1000 in China [5]. Furthermore, the 15 traditional approaches reliant on human assessment require high expertise, as well as 16 promote inconsistency among the readers [1]. Labour and time-consuming nature of 17 manual screening services has motivated the development of automated retinal lesions 18 detection methods, in particular early stages of DR. 19 Deep Neural Network model is a sequence of mathematical operations applied to the 20 input, such as pixel value in the image [8], where the training is performed by 21 presenting the network with multiple examples, as opposed to unflexible rule-based 22 programming underlying the conventional methodologies [9]. Deep learning, in 23 particular Convolutional Neural Networks (CNN), has been widely explored in the field 24 of DR detection [10][11][12][13][14],largely surpassing previous image recognition 25 methodologies [14]. Overall, Deep learning has demonstrated tremendous potential in 26 healthcare domain, enabling the identification of patients likely to develop a disease in 27 the future [6]. In terms of DR, the applications range from binary classification (No high-dimensional datasets. The models have proven successful at learning the most 32 discriminative, and often abstract aspects of the image, while remaining insensitive to 33 irrelevant details such as orientation, illumination or background. 34 The numerous challenges in automatic DR detection have been identified in 35 literature. The diagnosis is particularly difficult for patients with early stage of DR [1]. 36 As highlighted by Pratt et al. [12], Neural Networks struggle to learn sufficiently deep 37 features to detect intricate aspects of Mild DR. In the same study, approx. 93% of mild 38 cases were incorrectly classified as healthy eye instances. The problem is illustrated on 39 Deep learning-based DR detection consistently report high performance on severe cases, 43 while identification of mild cases still remains a challenge. This limitation impede the 44 wider application of fully automated mass-screening due to potential omission of early 45 phase of DR, leading to more advanced condition development in the future. Also, 46 according to the study conducted by Ting et al. [15], the referable stage of DR (mild) 47 was 5x more prevalent than the vision-threatening stage of DR (severe), demonstrating 48 the significance of early lesions detection.

49
Transfer learning has already been validated and demonstrated promising results in 50 medical image recognition. The concept uses knowledge learned on primary task, and 51 its re-purpose to secondary task. Transfer learning is particularly useful in Deep 52 learning applications that require vast amount of data and substantial computational 53 resources. The state-of-the-art CNN models, pre-trained on the large public image 54 repository have been used as part of this study, following the concept of Transfer 55 learning. Using the weights initialised, the top layers of Neural Networks have been 56 trained for customised No DR/Mild DR binary classification from publicly available 57 fundus image corpora. The improved classification performance via Transfer learning 58 has already been reported in prior research on automated DR detection [16]. Unlike 59 previous approaches, the study conducted focuses entirely on Mild DR instances -60 currently challenging to identify. The several task-specific data augmentation techniques 61 for classification performance improvement are further evaluated. Finally, the fully-automated Deep learning-based system facilitates methodology reproducibility and 63 consistency in order to streamline an early DR detection process, increasing the access 64 to mass-screening services among the population-at-risk. • Models fine-tuning to reflect the specifics of the application case-study;

73
• Various Optimisers performance assessment and the optimal one selection; 74 • 2 datasets combination and augmentation for further accuracy improvement.

75
To illustrate the steps followed, the high-level process pipeline is presented in Fig 2. 76

77
The data was acquired from publicly available corpora, i.e. Kaggle and Messidor. for Messidor and Kaggle datasets among the respective DR classes (the exact numbers 90 of images for each dataset/class can be found in Table 2 ).  The knowledge transfer from primary to secondary task frequently acts as an only 93 solution in highly-specialised disciplines, where the availability of large-scale quality 94 data proves challenging. The adoption of already pre-trained models is not only the 95 efficient optimisation procedure, but also supports the classification improvement. The 96 first layers of CNNs learn to recognise the generic features such as edges, patterns or 97 textures, whereas the top layers focus on more abstract and task-specific aspects of the 98 image, such as blood vessels or hemorrhages. Training only the top layers of target 99 dataset, while using the initialised parameters for the remaining ones is commonly 100 employed approach, in particular in computer vision domain. Apart from efficiency 101 gains, fewer parameters to train also reduce the risk of overfitting, which is a major 102 problem in Neural Networks training process [12]. The CNN models used in the 103 experiments along with their characteristics are presented in Table 1. The algorithms were implemented using Keras library 1 , with TensorFlow 2 as a back-end. 106 The images resolution has been standardised to a uniform size in accordance with input 107 requirements of each model. The number of epochs, i.e. complete forward and backward 108 passes through the network, was set to 20 due to the already pre-trained models use.

109
The training/testing split was set to 60/40 given the small-to-moderate dataset size.

110
The stratified random sampling was performed in order to ensure the correct class 111 distribution and final findings reliability. The mini-batch size was set to 32, and the 112 cross-entropy loss function was selected due to its suitability for binary classification.

113
The default Optimiser was RMSProp. The standard evaluation metric of Accuracy on 114 testing dataset was used for final results validation. The CNN models adopted in the study were pre-trained on a large-scale ImageNet 118 dataset that spans numerous categories includings flowers, fruits, animals etc. Such 119 models obtain high performance on classification tasks for the objects present in the 120 training dataset, while prove limiting in their application to niche domains, such as DR 121 detection. Diagnosis of pathological indications in fundoscopy depends on a wide range 122 of complex features, and their localisations within the image [1]. In each layer of CNN, 123 there is a new representation of input image by progressive extraction of the most 124 distinctive characteristics [10]. For example, the first layer is able to learn edges, while 125 the last layer can recognise exudates -a DR classification feature [1]. As a result, the threshold of 100 was selected, and the subsequent layers of each model were 'un-frozen' 131 and fine-tuned to the application dataset. The initial 100 layers were treated as a fixed 132 feature extractor [26], while the remaining layers were adapted to specific characteristics 133 of fundus photography. The potential classification improvement on DR detection task 134 was evaluated as a result of the proposed models customisation. In the study conducted 135 by Zhang et al. [5], the performance accuracy of Deep learning-based DR detection    substantially expands the dataset, alleviating the imbalance issues between the classes. 174 The comparison of images volumes before and after augmentation is included in Fig 5. 175

Models comparison 177
The 13 pre-trained CNNs were compared in terms of yielded Accuracy on testing 178 dataset (Table 3). Additionally, the fine-tuning was applied as an alternative to the 179 default option. After removal and re-training of n layers from 100 onwards (n was optimisation procedure (Fig 2).

186
The Accuracy after each epoch was further plotted in order to investigate the models 187 convergence capabilities in the default and fine-tuning scenario (Fig 6). As a result, the 188 computational intensity was additionally evaluated.   Table 5. 201 indicate that the number of people with Diabetes is projected to raise [12,27]. This in 205 turn places an enormous amount of pressure on available resources and experts. According to British Diabetic Association guidelines, a minimum standard of 214 80% sensitivity and 95% specificity must be obtained for sight-threatening DR detection 215 by any method [29]. Second option allows to downsize the large-scale mass-screening 216 outputs to the potential DR cases, followed by human examination. Both scenarios 217 significantly reduce the burden on skilled ophthalmologists and specialised facilities, 218 making the process accessible to wider population, especially in low-resource settings. training greatly affects the outcome of Neural Networks process [9]. The numerous 230 pre-processing steps were implemented to measure potential accuracy improvement for 231

189
August 28, 2019 8/12 No DR/Mild DR image classification. As Mild DR proves extremely challenging to 232 identify from healthy retina due to only subtle indications of the disease, the data 233 augmentation undertaken was believed to enhance pathological features visibility (e.g. 234 zoom and crop).

235
The top 3 CNN architectures with the top layer removed and re-trained were 236 VGG19, ResNet50 and NASNetLarge, yielding the accuracy of 81.3% (Table 3). The 237 lowest performance was obtained by DenseNet169 (56.9%) and MobileNet (58.3%), 238 respectively. In terms of DenseNet169, the characteristic feature of its structure that 239 connects each layer to every other layer in a feed-forward manner did not prove to 240 enhance the performance on No DR/Mild DR classification task. As for MobileNet, the 241 results only confirmed its intended purpose for mobile applications due to its lightweight 242 and streamlined architecture, which comes at a cost of the Accuracy.

243
The effect of fine-tuning (un-freezing the layers from 100 onwards) differed across 244 the models. The Accuracy improvement observed was only minor, suggesting the 245 relative suitability of pre-trained models to DR detection task. In other words, the CNN 246 models were able to identify Mild DR from healthy retina despite being trained on 247 un-related pictures from ImageNet repository. When no Accuracy increase is achieved, 248 the further layers un-freezing is not recommended due to unnecessary computational 249 time and cost incurred.

250
To complete the analysis on the effect of fine-tuning, the graphs depicting each CNN 251 architecture performance at the respective epochs (single pass of the full training set) 252 has been performed, as illustrated in Fig6. Despite no major influence on the 253 classification Accuracy, faster model's convergence was observed due to the fine-tuning 254 applied. The higher number of layers un-frozen and re-trained made the models more 255 task-specific, leading to an improved use of resources due to reduced training time for 256 the most optimal performance. The finding was particularly noticable for the following 257 models Xception, MobileNet and DenseNet169.

258
Next, the various Optimisers have been evaluated on the top 3 CNN architectures. 259 Table 4 presents the Accuracy outputs obtained for each model. While there was no 260 major impact on the classification performance for VGG19, the higher variability was 261 observed for ResNet50, proving its sensitivity to the most suitable Optimiser selection. 262 Overall, RMSProp proved the most optimal choice for 2 out of 3 263 models.ResNet50+RMSProp was selected as the max-Accuracy model+Optimiser 264 option for No DR/Mild DR classification task.

265
As the last step, the 2 scenarios were considered, namely (i) volume increase and (ii) 266 data augmentation. As expected, the datasets combination, i.e. Messidor + Kaggle 267 (aug) further improved the classification Accuracy by +5% (86%). Where as, 268 combination of the datasets, i.e. Messidor + Kaggle (raw) results (90.5%) which is 269 consider as case of data imbalance. Therefore, the augmented images helps in data The several shortcomings of the study have been identified. Firstly, due to limited 277 availability of Mild DR images, only small-to-moderate dataset size was used in the 278 study. As a compensation procedure, the CNN models pre-trained on large-scale 279 ImageNet database have been adopted. The appropriate data augmentation techniques 280 have been further applied to expand the dataset, i.e. rotation, horizontal/vertical flip, 281 etc. Secondly, the default hyper-parameters were followed, while training the classifiers 282 As it is the initial study focusing on binary No DR/Mild DR classification, future work 295 will cover finer-grained information extraction from cases previously identified as Mild 296 DR. For instance, upon sufficient data availabilty, the model will allow to recognise the 297 particular lesions such as exudates or aneurysms. The more in-depth classification will 298 further assist the retinal practitioners in more efficient eye-screening procedure. Also, 299 the highly varied input data (e.g. in terms of ethnicity, age group, level of lighting) will 300 support the model robustness and flexibility. Additionaly, different scenarios with 301 respect to the number of layers and nodes will be experimented with using TensorBoard 302 for dynamic visualisation. Increased convolution layers are perceived to allow the model 303 to learn deeper features [9,12]. This in turn will enable the most optimal CNN 304 architecture design (depth and width of the network) for maximum classification 305 accuracy. An increase network dimensionality is the most direct way to enhance model 306 performance [16]. Future work will also place more emphasis on outputs visualisation in 307 order to obtain greater insight into the models internal workings, and further improve 308 the classification capability. In particular, the identification of exact image regions that 309 are associated with specific classification results will be highlighted, as well as the 310 magnitude of each feature intensity (so called attention/saliency maps [8]). Improved 311 understanding of the algorithm workings will facilitate the automated system wider 312 adoption and acceptance among physicians [6]. Finally, the experiments with ensembling 313 approach will be conducted, where the results of Neural Networks models trained on the 314 same data will be averaged in order to evaluate further classification accuracy gains.

316
Early detection and immediate treatment of DR is considered critical for irreversible 317 vision loss prevention. The automated DR recognition has been a subject of many 318 studies in the past, with main focus on binary No DR/DR classification [12]. According 319 to the results, an identification of moderate to severe indications do not pose major subjective manual feature extraction methods. Furthermore, the study used the 332 combined datasets from various sources to evaluate system robustness in its ability to 333 adapt to the real-world scenarios. As stated by Wan et al. [16], the single data 334 collection environment poses difficulty in accurate model validation. The system 335 successfully facilitates the streamlining of labour-intensive eye-screening procedure, and 336 serves as an auxiliary diagnostic reference whilst avoiding human subjectivity.