Feasibility of Automated Deep Learning Design for Medical Image Classification by Healthcare Professionals with Limited Coding Experience

Deep learning has huge potential to transform healthcare. However, significant expertise is required to train such models and this is a significant blocker for their translation into clinical practice. In this study, we therefore sought to evaluate the use of automated deep learning software to develop medical image diagnostic classifiers by healthcare professionals with limited coding – and no deep learning – expertise. We used five publicly available open-source datasets: (i) retinal fundus images (MESSIDOR); (ii) optical coherence tomography (OCT) images (Guangzhou Medical University/Shiley Eye Institute, Version 3); (iii) images of skin lesions (Human against Machine (HAM)10000) and (iv) both paediatric and adult chest X-ray (CXR) images (Guangzhou Medical University/Shiley Eye Institute, Version 3 and the National Institute of Health (NIH)14 dataset respectively) to separately feed into a neural architecture search framework that automatically developed a deep learning architecture to classify common diseases. Sensitivity (recall), specificity and positive predictive value (precision) were used to evaluate the diagnostic properties of the models. The discriminative performance was assessed using the area under the precision recall curve (AUPRC). In the case of the deep learning model developed on a subset of the HAM10000 dataset, we performed external validation using the Edinburgh Dermofit Library dataset. Diagnostic properties and discriminative performance from internal validations were high in the binary classification tasks (range: sensitivity of 73.3-97.0%, specificity of 67-100% and AUPRC of 0.87-1). In the multiple classification tasks, the diagnostic properties ranged from 38-100% for sensitivity and 67-100% for specificity. The discriminative performance in terms of AUPRC ranged from 0.57 to 1 in the five automated deep learning models. In an external validation using the Edinburgh Dermofit Library dataset, the automated deep learning model showed an AUPRC of 0.47, with a sensitivity of 49% and a positive predictive value of 52%. The quality of the open-access datasets used in this study (including the lack of information about patient flow and demographics) and the absence of measurement for precision, such as confidence intervals, constituted the major limitation of this study. All models, except for the automated deep learning model trained on the multi-label classification task of the NIH CXR14 dataset, showed comparable discriminative performance and diagnostic properties to state-of-the-art performing deep learning algorithms. The performance in the external validation study was low. The availability of automated deep learning may become a cornerstone for the democratization of sophisticated algorithmic modelling in healthcare as it allows the derivation of classification models without requiring a deep understanding of the mathematical, statistical and programming principles. Future studies should compare several application programming interfaces on thoroughly curated datasets.

(including the lack of information about patient flow and demographics) and the absence of measurement for precision, such as confidence intervals, constituted the major limitation of this study.
All models, except for the automated deep learning model trained on the multi-label classification task of the NIH CXR14 dataset, showed comparable discriminative performance and diagnostic properties to state-of-the-art performing deep learning algorithms.
The performance in the external validation study was low. The availability of automated deep learning may become a cornerstone for the democratization of sophisticated algorithmic modelling in healthcare as it allows the derivation of classification models without requiring a deep understanding of the mathematical, statistical and programming principles. Future studies should compare several application programming interfaces on thoroughly curated datasets.

INTRODUCTION
Diagnosis depends on data: its collection, integration and interpretation enables accurate classification of clinical presentations into an accepted diagnostic category. Human diagnosticians achieve acceptable accuracy in such classification tasks through the learning of diagnostic rules (patterns recorded by other human diagnosticians) followed by training on real cases for which the diagnostic labels are provided (supervised clinical experience). Due to the limited capacity of human neural networks (brain), the amount of data utilized to create these diagnostic rules, and then to reach a diagnosis on an individual patient is highly selective and biased, with the vast majority of available data being ignored. In artificial intelligence (AI), the technique of deep learning uses artificial neural networks -so-called because of their superficial resemblance to biological neural networks -as a computational model to discover intricate structure and patterns in large, high dimensional datasets such as medical images. [1] A key feature of these networks is their ability to fine-tune based on experience, allowing them to adapt to their inputs, thus becoming capable of learning. It is this characteristic which makes them powerful tools for pattern recognition, classification, and prediction. In addition, the features discovered are not predetermined by human engineers, but rather by the patterns they have learned from input data. [2] Although first espoused in the 1980s, deep learning has come to prominence in recent years, driven in large part by the power of graphics processing units originally developed for video gaming, and the increasing availability of large datasets. [3] Since 2012, deep learning has brought seismic changes to the technology industry, with major breakthroughs in areas as diverse as computer vision, image caption, speech recognition, natural language translation, robotics, and even self-driving cars. [4][5][6][7][8][9] In 2015, Scientific American listed deep learning as one of their "world changing" ideas for the year. [10] Until now, the development and implementation of deep learning methodology into healthcare has faced three main blockers: Firstly, access to large, well-curated, and well-labeled datasets is a requirement that represents a major challenge to the use of deep learning.
Although numerous institutions around the world have access to large clinical datasets, far fewer have them in a computationally tractable form and with robust clinical labels for learning tasks. Secondly, highly specialized computing resources are needed, since the performance of deep learning models is highly dependent on recent advances in massively parallel computing architectures, termed graphic processing units. The architecture of silicon customized to these tasks is rapidly evolving with software companies increasingly designing their own hardware chips such as tensor processing units, and field-programmable gate arrays. [11,12]. Thus, it is already clear that it will be difficult for small research groups, working alone in hospital and university settings, to accommodate these huge financial costs and the rapidly evolving landscape. Thirdly, highly specialized technical expertise and significant mathematical knowledge is required to develop deep learning models. Currently, it is estimated that fewer than 10,000 people worldwide have the skills necessary to tackle serious AI research with the majority of these employed outside academic institutions. [13] According to the 2017 Computer Research Association Taulbee survey, nearly 60% of new AI PhD graduates are recruited by industry. [14] One approach to combat these obstacles is the increasingly popular technique, transfer learning, where a model developed for a specific task is repurposed and leveraged as a starting point for training on a novel task. While transfer learning mitigates some of the significant computing resources required in designing a bespoke model from inception, it nevertheless requires specialized deep learning expertise to deliver effective results. With this in mind, several companies released application programming interfaces (API) in 2018, claiming to have automated deep learning to such a degree that any individual with basic computer competence could train a high-quality model. [15,16] As programming is not a common skill among healthcare professionals, automated deep learning is a potentially promising platform to support the dissemination of deep learning application development in healthcare and medical sciences. In the case of classification tasks, these API products match generic neural-network architectures to a given imaging dataset, fine-tune the network aiming at optimizing discriminative performance, and create a prediction algorithm as output. In other words, the input is a (labeled) image dataset, and the output is a custom classifying algorithm. Yet, the extent to which non-experts can replicate trained deep learning engineers' performance with the help of automated deep learning remains unclear.
In this study, healthcare professionals without any deep learning expertise explored the feasibility of automated deep learning model development and investigated the performance of these models in diagnosing a diverse range of disease from medical imaging.
More precisely, we (i) identified medical benchmark imaging datasets for diagnostic image classification tasks and their corresponding publications on deep learning models; (ii) used these datasets as input; (iii) replaced the classic deep learning models with automated deep learning models; (iv), and compared the discriminative performance of the classic and the automated deep learning models. Moreover, we sought to evaluate the interface that was used for automated deep learning model development (Google Cloud AutoML© Vision API, Beta release) for its utilization in prediction model research. [17][18][19]

Challenge 1: Performance of automated deep learning in archetypal binary classification tasks
Diagnostic properties and discriminative performance were comparable in the case of the investigated binary classification tasks.

Task 1: Classification of diabetic retinopathy vs normal retina on fundus images
The Retinal Fundus Image dataset involved 1187 images in total, with 533 normal fundus images ("R0 cases"), and 153 images showing mild ("R1 cases"), 247 moderate ("R2 cases") and 254 severe diabetic retinopathy ("R3 cases"). Thirteen duplicate images were automatically excluded by the API. The automated deep learning model trained to distinguish healthy fundus images from fundus images showing diabetic retinopathy (R0 from R1, R2 and R3 cases) reached an AUPRC of 0.87, and best accuracy at a cut-off value of 0.5 with a sensitivity of 73.3% and a specificity of 67%. AUPRC of this automated deep learning model was 1, best accuracy was reached at a cut-off value of 0.5 with a sensitivity of 97% and a specificity of 100%.

Challenge 2: Performance of automated deep learning in multiple classification tasks
The two models trained to distinguish multiple classification tasks showed high diagnostic properties and discriminative performance.

Task 1: Classification of three common macular diseases and normal retinal OCT images
The Retinal OCT set provided by Guangzhou Medical University/Shiley Eye Institute involved 101418 images of 5761 patients. 31882 images depicted OCT changes related to neovascular age-related macular degeneration (791 patients), 11165 to diabetic macular edema (709), 8061 depicted drusen (713 patients), and 50310 were normal (3548 patients).
One hundred seventy fives images were identified as duplicates and excluded by the API. The AUPRC of the automated deep learning model trained to distinguish these four categories was 0.99, while best accuracy was reached at a cut-off value of 0.5, with a sensitivity of 97.3%, a specificity of 100% and a positive predictive value (PPV) of 97.7%.  The thresholds were reported in two cases. [28,29] Tables 3 and 4 show the corresponding confusion matrices for the OCT and Dermatology Image set and Fig   1 shows cases from each model where the incorrect label was predicted.

External Validation
In the case of the deep learning model developed on a subset of the Dermatology Image set, we additionally performed an external validation using the Dermatology Validation set. As the latter set did not include benign keratosis as a label, these images were removed from the Dermatology Image set used for training.
The automated deep learning model showed poor diagnostic properties and a discriminative performance near chance (AUPRC: 0.47, best accuracy at a cut-off value of 0.5, with a sensitivity of 49% and a positive predictive value of 52%). Of note, the sensitivity for melanoma classification is 11% with a misclassification rate of 63.7%.
Interestingly, nevus was the most likely classification in all cases, followed by the ground truth. The only exception was the case of actinic keratosis, where its ground truth diagnosis was the third most probably diagnosis to be predicted (   From a methodological viewpoint, our results -as with those reported in the state-ofthe-art deep learning studies -might be overly optimistic, since we were not able to test all the models "out-of-sample" as recommended by current guidelines. Moreover, for external validation, the present version of the API only allows single image upload for prediction, limiting large scale external validation. This reduces its usability for systematic evaluation in prediction model research considerably, given the high numbers of images that these datasets comprise. To circumvent this issue, we created a proxy to an external validation and found a substantial reduction in the diagnostic and discriminative performance of the deep learning model. The limited performance of automated deep learning models (also in the multi-label classification task) are likely related to inadequacy of the datasets used to train the models. To obviate concerns over class imbalance in our external validation, we under-sampled accordingly, however, this did not significantly alter the model's discriminative performance.

Strengths
To our knowledge, this is the first assessment of the feasibility and usefulness of automated deep learning technology in medical imaging classification tasks performed by healthcare professionals. In this study, we showed best effort to comply with the reporting guidelines for prediction model research, and for developing and reporting machine learning predictive models. [24,30] One major strength of our study is that we tested one exemplary model for robustness in an out-of-sample cross validation, since internal validation and random split sample-validation has been claimed to overestimate test performance. A further strength is that our results can easily be explored and replicated by others, given the use of public datasets and the free trial use of the the AutoML©®™ Cloud Vision API.

Limitations and future directions
The convenience sampling used in the out-of-sample cross validation has been claimed to introduce bias and exaggerate estimates of diagnostic performance. [ Currently, studies on AutoML including ours, have to rely on publicly available datasets. While using them allows comparison of performance between different algorithms, these are not without concern. For many classification tasks, and particularly for validation barrier towards an efficient use of published data. However, since many studies will be dealing with ephemeral processing of de-identified data we do not believe that the General Data Protection Regulation is likely to pose a substantial hindrance.
We confirmed feasibility but encountered various methodological problems that are well-known in research projects performing classification tasks and predictions. We believe that concerted efforts in terms of data quality and accessibility are needed to make automated deep learning modelling successful. Moreover, as the technology evolves, transparency in the reporting of the models and a careful reporting of their performance in methodologically sound validation studies will be pivotal for a successful implementation into practice. Finally, the extent to which automated deep learning algorithms must adhere to regulatory requirements for medical devices is unclear. [34] Although the development of deep learning prediction models was feasible for healthcare professionals without any deep learning expertise, we would recommend the contrast, large scale standardized deep learning algorithms will necessitate worldwide, multivariable validations of expertly tuned models. Thus, there is considerable value to these "small data" approaches customized to a specific geographical patient population that a given clinic may encounter. This may be where automated deep learning finds its niche in the medical field. Importantly, this could make models susceptible to selection bias, over-fitting, and a number of other issues from imprecise model training or patient selection. Therefore, regulatory guidelines are needed for both medical deep learning and clinical implementation of these models before they may be used in clinical practice. In summary, while our approach seems rational in this early evaluation, the results of this study cannot yet be extrapolated into clinical practice.  [20][21][22][23] Moreover, in a proof-of-principle evaluation, we aimed at testing one of the models "outof-sample" as recommended by current guidelines. [24] The current version of the API only allows single image upload for model prediction limiting the feasibility of large scale external validation. However, we created a proxy to an external validation in one exemplary use-case,

Study design and data source
where we used the Dermatology Image set for algorithm development and tested its performance using the Edinburgh Dermofit Library dataset (hereafter referred to as Dermatology Validation set). [25] Training of healthcare professionals using the API Two clinicians (LF and SKW) with no prior coding or machine learning experience performed the analysis after a period of self-study based on the online documentation. In total, both researchers invested approximately ten hours of preparation. Due to the release cycle evolution of the AutoML©®™ Cloud Vision API during the study (alpha release May 2018, beta release July 2018), they adopted an iterative approach when executing the analyses. All analytic steps and interpretations of results were performed jointly.

Patient recruitment and enrolment
We accessed five de-identified open-source imaging datasets that were collected from retrospective, non-consecutive cohorts, showing diseases or disease features of common medical diagnoses. Eligibility criteria, patients' demographics and patient workflow for each of these datasets are published elsewhere. [20][21][22][23]

Index Test: AutoML©®™ Cloud Vision API
The term "automated machine learning" commonly refers to "automated methods for model selection and/or hyperparameter optimization". This is the concept which led to the idea of allowing a neural network to design another neural network, through the application of a neural architecture search. [17][18][19] In deep learning, designing and choosing the most suitable model architecture requires a significant amount of time and experimentation even for those with significant deep learning expertise. This is because the search space of all possible model architectures can be exponentially large (e.g. a typical 10-layer network could have ~10 10 candidate networks). To make this model design process easier and more accessible, an approach known as neural architecture search has recently been described. [26] Neural architecture search is typically achieved using one of two methods: 1) reinforcement learning algorithms, and 2) evolutionary algorithms. The former forms the basis of this commercially available API that was evaluated in this study. [27] Data handling and analytic approach

Comparison with Benchmark Classic Deep Learning Models
In order to provide a direct comparator to the performance of classic deep learning models Recognition Challenge, for the first time. [4] If a study provided contingency tables for the same or for separate algorithms tested in a specific classification task, we assumed these to be independent from each other. We accepted this, as we were particularly interested in providing an overview of the results of various studies rather than providing accurate precisions of point estimates.

Statistical Analysis
The AutoML©®™ Cloud Vision API provides metrics that are commonly used by the AI- We a priori attempted to compare the classification performance between state-of-theart deep learning studies and our results. However, while the published reports provided areas under the receiver operating characteristic curve (AUC ROC), the AutoML API reports the AUPRC. Although the points of the two types of curves can be mapped one-to-one and hence curves can be translated from the ROC space to the prediction space (if the confusion matrices are identical) differences in the confusion matrices and the level of reporting impeded us from performing a comparison on the level of AUC. Instead, we compared the