Automatic Detection of Diabetic Eye Disease Through Deep Learning Using Fundus Images: A Survey

Diabetes Mellitus, or Diabetes, is a disease in which a person’s body fails to respond to insulin released by their pancreas, or it does not produce sufficient insulin. People suffering from diabetes are at high risk of developing various eye diseases over time. As a result of advances in machine learning techniques, early detection of diabetic eye disease using an automated system brings substantial benefits over manual detection. A variety of advanced studies relating to the detection of diabetic eye disease have recently been published. This article presents a systematic survey of automated approaches to diabetic eye disease detection from several aspects, namely: i) available datasets, ii) image preprocessing techniques, iii) deep learning models and iv) performance evaluation metrics. The survey provides a comprehensive synopsis of diabetic eye disease detection approaches, including state of the art field approaches, which aim to provide valuable insight into research communities, healthcare professionals and patients with diabetes.


I. INTRODUCTION
Diabetic Eye Disease (DED) comprises a group of eye conditions, which include Diabetic Retinopathy, Diabetic Macular Edema, Glaucoma and Cataract [1]. All types of DED have the potential to cause severe vision loss and blindness in patients from 20 to 74 years of age. According to the International Diabetes Federation (IDF) statement, about 425 million citizens worldwide suffered from diabetes in 2017. By 2045, this is forecast to increase to 692 million [2]. Medical, social and economic complications of diabetes impact substantially on public health, with diabetes being the world's fourthlargest cause of death [3]. The effects of diabetes can be observed in different parts of a person's body, including the retina. Fig. 1 shows the normal anatomical structures of the retina. Fig. 2 illustrates a complication of DED in a retina. Serious DED begins with an irregular development of blood vessels, damage of the optic nerve and the formation of hard exudates in the macula region. Four types of DED threaten The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . eye vision, and they are briefly described in the following subsection.
Diabetic Retinopathy (DR) is caused by damage to blood vessels of the light sensitive tissue (retina) at the back of the eye. The retina is responsible for sensing light and sending a signal to brain. The brain decodes those signals to see the objects around [4]. There are two stages of DR: early DR and advanced DR. In early DR, new blood vessels do not developing (proliferating) and this is generally known as nonproliferative diabetic retinopathy (NPDR). The walls of the blood vessels inside the retina weaken due to NPDR. Narrower bulges (microaneurysms) protrude from the narrower vessel surfaces, often dripping fluid and blood into the eye. Large retinal vessels also start dilating and become irregular in diameter. As more blood vessels become blocked, NPDR progresses from mild to severe. Depending on the severity, the retina's nerve fibres may begin to swell. The central part of the retina (macula) often swells (macular edema); a disease requiring treatment. NPDR is divided into three stages, namely: mild, moderate and severe [5]. Advanced DR is called proliferative diabetic retinopathy (PDR). In this case, the damaged blood vessels leak the transparent VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ jelly-like fluid that fills the centre of the eye (vitreous) causing the development of abnormal blood vessels in the retina. Pressure can build up in the eyeball because the newly grown blood vessels interrupt the normal flow of the fluid. This can damage the optic nerve that carries images from the eye to the brain, leading to glaucoma. Glaucoma (Gl) is an ocular disease that damages the optic nerve that links the eye to the brain. When the fluid pressure inside the eye, known as intraocular pressure (IOP), is high, the optic nerve is impaired [6]. An increase in blood sugar doubles the chances of Gl, which leads to blindness and a loss of vision if not detected early. Gl can be classified into three types based on the size of the enlarged optic nerve head or optic disc and Cup-to-Disc Ratio (CDR), or cupping. The stages of Gl are mild, moderate and severe [7].
Diabetic Macular Edema (DME) occurs when fluid builds up in the centre of the retina (macula) due to damage to the blood vessels. The macula is responsible for sharp, straightahead vision. Fluid buildup causes swelling and thickening of the macula which distorts vision [4]. The stages of DME can be categorized into mild, moderate and severe based on the following points [8]: • Retinal thickening of the fovea at or below 500 µ or 1/3 of its disc diameter • Hard exudates, with subsequent retinal thickening, at or within 500 µ of the fovea • Retinal thickening at a size that is greater than one disc diameter (1500 µ), and which is within one fovea disc diameter. Cataract (Ca) is the degeneration of the lens protein due to high sugar level causing blurry lens growth, which in turn leads to blurred vision. Diabetic people are more prone to growing cloudy lenses and developing Ca earlier than nondiabetic people. Usually Ca is graded into four classes: noncataractous, mild, moderate and severe [9].
Patients suffering from diabetes display a significantly higher predisposition develop DED. As a consequence, early detection of DED has become paramount in preventing vision loss in adults and children. Studies have already shown that 90% of patients with diabetes can avoid DED development through early detection [10]. Manual detection of DED involves no computer assistance, resulting in longer waiting times between early diagnosis and treatment. Moreover, the initial signs of DED are so minute that even an expert may struggle with its identification.
Advancements in Artificial Intelligence (AI) offer many advantages to automated DED detection over the manual approach. They include a reduction in human error, timeefficiency and detection of minute abnormalities with greater ease. Automated DED detection systems can be assembled through joint image processing techniques using either Machine Learning (ML) or Deep Learning techniques (DL).
In DL approaches, images with DED and without DED are collected. Then, the image preprocessing techniques are applied to reduce noise from the images and prepare for the feature extraction process. The pre-processed images are input to DL architecture for the automatic extraction of features and their associated weights to learn the classification rules. The features weights are optimized recursively to ensure the best classification results. Finally, the optimized weights are tested on an unseen set of images. This type of architecture demands a large number of images for training. Therefore, a limited number of images can severely restrict its performance.
DL techniques require a substantial amount of computational memory and power. Normally, to develop and evaluate the classification model, DL architecture requires a Graphical Processing Unit (GPU). In real world DL applications, this assumption does not always hold. Training images using the DL model can be costly, challenging in terms of annotated data collection, and time and power consuming. To account for the above mentioned shortcomings, the approach called Transfer Learning (TL), or Knowledge Transfer (KT), has been introduced by the researchers. In TL, previously derived knowledge (e.g. in terms of features extracted) can be readapted to solve new problems. Not only does TL drastically reduce the training time, it also reduces the need for a large amounts of data. The latter point proves particularly convenient in niche applications where high-quality input images annotated by specialists are often limited or expensive to obtain.

A. MOTIVATION
As mentioned above, DL and TL techniques have their advantages and disadvantages however, several researchers have used these methods to build automatic DED detection systems in recent years. Overall, there are very few review studies published in academic databases which simultaneously address all of the types of DED detection. Thus, this literature review is essential to collate the work in the DED detection field.
Ting et al. [11] published a review article focusing on eye conditions such as diabetic retinopathy, glaucoma, and age-related macular diseases. They selected papers published between 2016 and 2018 and summarised them in their report. They summarized those papers which used fundus and  optical coherence tomography images, and TL methods. Their research did not include current (2019-2020) publications that incorporated TL methods into their approach, and they omitted the identification of eye cataract disease from their study scope. Similarly, Hogarty et al. [12] provided a review of current state articles using AI in Ophthalmology, but their focus lacked comprehensive AI methodologies. Mookiah et al. [13], reviewed computer aided DR detection studies, which are largely DR lesion based. Another author, Ishtiaq et al. [14], reviewed comprehensive DR detection methods from 2013 to 2018 but their review lacked studies from 2019 to 2020. Hagiwara et al. [15], reviewed an article for the computer aided diagnosis of Gl using fundus images. They addressed computer aided systems and systems focused on optical disc segmentation. There are a variety of studies using DL and TL methods for Gl detection that have not discussed in their review paper. It is, therefore, important to review papers that consider existing approaches to DED diagnostics. In fact, most scholars in their review article did not address the period of publication years covered by their studies. Current reviews were too narrow, either in terms of disease (DR, Gl, DME and Ca) or in aspects of methodology (DL and ML). Therefore, to address the limitations of the above-mentioned studies, this article offers a thorough analysis of both DL and TL approaches to automated DED detection published between 2014 and 2020 to cover the current DR detection methods built through DL or TL based approaches.

B. CONTRIBUTION
To provide a structured and comprehensive overview of the state of the art in DED detection systems using DL, the proposed paper surveys the literature from the following perspectives: 1) Datasets available for DED 2) Preprocessing techniques applied to fundus images for DED detection

3) DL approaches proposed for DED detection 4) Performance measures for DED detection algorithm evaluation.
The arrangement of this article is as follows. Section II presents the research method followed after surveying the articles. Section III analyses the papers based on the datasets used in their study. Section IV addresses the imageprocessing techniques used in the prior work. Section V analyses the articles based on the classification methods employed. Section VI discusses the findings and observations. Section VII discusses the gaps and future directions. Finally, Section VIII concludes the paper.

II. RESEARCH METHOD
The overall research method followed is shown in Fig. 3. Initially, a keyword search was conducted using 10 academic databases considering our specific review target. Seven filters were applied to select the primary review target. Afterwards, the selected articles were critically analysed and grouped into three categories based on the following aspects, namely: (i) papers employing TL, (ii) papers proposing a new DL network and (iii) papers discussing with DL and ML combined. Review target keywords were searched using 'AND' Boolean operator and included: "deep learning", "transfer learning", "image processing", "image classification", "fundus images", "diabetic eye disease", " diabetic retinal disease", "diabetic retinopathy", "glaucoma", "diabetic macular edema" and "cataract".

Selection of Articles
Papers published between April 2014 and January 2020 were considered eligible for this study due to rapid advances in the field. We then narrowed our search to Conference Papers and Journal Articles. After the selection process, we encountered several duplicates as a result of using 10 different databases. After duplicates removal, titles, abstracts and conclusions of the remaining publications were carefully read. 69 articles were obtained focusing on fundus images, DL methods and classification of DED. We studied the bibliography and citation of the selected 69 articles, in which we found 7 more articles for the potential inclusion. Finally, during a quality assessment by reading 76 papers, our selection was narrowed down to 65 studies. The details of the process followed during our systematic review are presented in Fig. 4. We subsequently distributed the final sample of articles into three target groups. The distribution of 65 articles concerning the review target is represented in Table 1. The first group includes papers that use a pretrained network also referred to as the TL Approach. The second group categorizes articles that use their own built in DL network to detect DEDs. Finally, the third group summarises the articles that use combined DL and ML methods.

III. DIABETIC EYE DISEASE DATASETS
The authors of the selected articles use private and public datasets which are divided into training and testing examples. The most common datasets used for the detection of DR are Kaggle and Messidor [77]. Authors in [19]- [24], [28], [29], [51]- [57], [59], [60], [78] used Kaggle data and [24], [25], [28], [30], [71], [72], [79] used Messidor [77] data. The Kaggle dataset consists of 88,702 images, of which 35,126 are used for training and 53,576 are used for testing. Messidor [77] is the most widely used dataset which consist 1,200 fundus images. The Kaggle and Messidor dataset, is labeled for DR stages. Table 2 describes the datasets included in the chosen articles, listed from the viewpoint of the individual DED analyzed; i.e. DR, Gl, DME, and Ca. The table contains the name of the DED, the name of the dataset, the summary of the particular dataset, the sources of the publications that used the dataset and finally, the path where the dataset can be retrieved (if accessible publicly)

IV. IMAGE PREPROCESSING TECHNIQUES IN SELECTED ARTICLES
Images are subjected to numerous image preprocessing steps for visualization enhancement. Once the images are brighter and clearer, a network can extract more salient and unique features. A brief description of the preprocessing techniques used by the researchers addressed in this section. Green channel on the RGB color space provides a better contrast when compared to the other channels. In most of the image preprocessing techniques, green channel extraction is employed. The green channel image produces more information than blue and red channels. For instance, Li et al. [24] extracted the green channel of the image for exudates detection, where the exudates reveal better contrast from the background.
Another popular image preprocessing technique is contrast enhancement. The application of contrast enhancement further improves the contrast on a green channel image. To improve the contrast of the image, contrast enhancement is employed to the green channel of the image. For example, again Li et al. [24] have enhanced the contrast on the extracted green channel by employing the Contrast Limited Adaptive Histogram Equalization (CLAHE) method. This enhances the visibility of exudates of a green channel image. Normally, after contrast enhancement, illumination correction is implemented to improve the luminance and brightness of the image. A noise removal filter like Gaussian Filtering is then applied to smooth out the image.
The resizing of an image is another popular method of image preprocessing. The image is scaled down to a low resolution image according to the appropriate system. Li et al. [24] resized their images with various sizes to the same pixel resolution of 512 × 512. Similarly, Li et al. [25] resized their image to 224 × 224 pixel resolution, for all the pretrained CNN models that used 224 × 224 size resolution images. The resolution of an image is resized into the resolution required by the network in use.
Researchers often have to eradicate and mask the blood vessels and optical discs so that they are not classified as wrong DED lesions. Many DED datasets consist of images with a black border, with researchers generally preferring to segment the meaningless black border to focus on the region of interest (ROI). For example, Li et al. [24] removed the black border of fundus images using the thresholding method to further focus on the Region Of Interest (ROI).
Image augmentation is applied when there is an image imbalance (as typically observed in real world settings). Images are mirrored, rotated, resized and cropped to produce cases of the selected images for a class where the number of images is lower than the other large proportion of healthy retina images in comparison with DED retina images. Augmentation is a common strategy for enhancing outcomes and preventing overfitting. It is observed that the distribution of the Kaggle dataset is uneven. The Kaggle dataset includes 35,126 fundus images annotated as No DR (25810), Mild DR (2443), Moderate DR(5292), Severe DR(873) and Proliferative DR(708). Thus, Li et al. [24], An et al. [38], Nguyen et al. [31], Xu et al. [56], Pires et al. [60], Gargeya and Leng [52], Ghosh et al. [53], van Grinsven et al. [28], Quellec et al. [22] used the Kaggle dataset and the adopted augmentation technique to balance the dataset. Sometimes the RGB image is transformed into a greyscale image accompanied by further processing. Grayscale conversion is mostly used in approaches where ML is used.

V. DIABETIC EYE DISEASE CLASSIFICATION TECHNIQUES
In this section, we review the DL based approaches for DED detection. DL is defined as the extension of the ML with a multilayer network for extracting features. In DL architecture the term "deep" refers to the depth of the layers. The classification process is as follows: (i) The annotated dataset is split into testing and training samples for DL architecture, (ii) The dataset is preprocessed using image preprocessing techniques for quality enhancement and (iii) The preprocessed images are fed into DL architecture for features extraction and subsequent classification. Each layer in DL architecture considers the output of the previous layer as its input, processes it and passes it onto the next layer. Many authors fine tune the hyperparameters of existing DL algorithms, such as VGG16 or CNN, to improve classification performance. Hyperparameter observed in this study is shown in Table 4  Finally, the last layer of the architecture produces the required result, i.e. classification of DED as for the scope of the study. Out of 65 studies, 38 used TL, 21 used their proposed DL and six used a combination of DL and ML classifiers such as Random Forest(SF), Support Vector Machine(SVM), Backpropagation Neural Network (BPNN).

A. DL APPROACHES EMPLOYING TL
The concept of TL is based on the reuse of the features learned by DL models on the primary task and its adaptation to the secondary task. The idea is to reduce the computational complexity while training Neural Network architecture (resource intensive). Additionally, TL is found to be beneficial in cases where there is insufficient data to train a Neural Network from scratch (high volume of data required). Using TL, the parameters are initialized from the prior learning instead of random generation. Intuitively, the first layers learn to extract basic features such as edges, textures, etc., while the top layers are more specific to the task, e.g. blood vessels and exudates. Therefore, TL is commonly adopted in image recognition applications as the initial features extracted are shared regardless of the tasks. Table. 5 shows the records of works, which applied TL for the detection of DED. The details regarding a particular type of DED recognition, network architecture and model used were further extracted. Additionally, the classification results were retrieved for the comparison between the studies and state of the art overview. Overall, 38 of the 65 studies adopted the TL approach for the detection of DED through DL (19-DR, 15-Gl, 3-DME and 1-Ca  of 96.4%. Last, for the referable DME, they achieved sensitivity of 92%. Quellec et al. [22] and Gondal et al. [19] used a 26 layered o_O solution proposed by Bruggemann and Antony, 1 which ranked second in DR Kaggle competition. Reference [22] achieved AUC of 95.4% on the Kaggle dataset and on the e-ophtha dataset they obtained AUC of 94.9%. Similarly, Gondal [19] used o_O solution to detect DR lesions such as red dots, soft exudates, hemorrhages and microaneurysms. They replaced the last dense layer to the global average pooling layer. They achieved AUC of 95.4% on the DIARETDB1 dataset. Gulshan et al. [20] detected DR using Inception-v3 on the Kaggle dataset and also datasets collected from three Indian hospitals. They achieved specificity of 98.2% and sensitivity of 90.1% for a moderate and worse stage of DR respectively. Mansour [21] modified AlexNet for the classification of 5 stages of DR. They achieved an accuracy (Acc) of 97.93%, specificity of 93% and sensitivity of 100% on the Kaggle dataset. Roy et al. [23] used the Random Forest classifier on the Kaggle dataset and achieved a Kappa Score 2 (KSc) of 86%. Li et al. [24] detected exudates using a modified U-Net. U-Net was designed for the segmentation of neuronal membranes. They modified the architecture using unpooling layers rather than deconvolutional layers of U-Net. The authors trained the model using the eophtha dataset and achieved AUC of 96% on DIARETDB1. Li et al. [25] used various pretrained CNN models such as AlexNet, GoogLeNet and VGGNet. They achieved an AUC of 98.34%, Accuracy (Acc) of 92.01%, specificity of 97.11% and sensitivity of 86.03%. They achieved an AUC of 97.8% and KSc of 77.59%, following Acc of 95.21%, specificity of 97.80% and sensitivity of 77.79%, respectively. Perdomo et al. [26] classified normal DR images and images with exudates using LeNet architecture. Using the e-ophtha dataset the authors achieved an Acc of 99.6%, specificity of 99.6% and sensitivity of 99.8%. Takahashi et al. [27] applied a modified GoogLeNet for detecting various stages of DR. They modified GoogLeNet by deleting the five accuracy layers and reduced the batch size to four and achieved an Acc of 81% and Kappa value of 74%. van Grinsven et al. [28] used a nine layered CNN, which consisted of five convolution layers with 32 filters inspired by OxfordNet. They achieved AUC of 97.2%, specificity of 91.40% and sensitivity of 91.90% using the Messidor dataset [77]. Sayres et al. [29] classified five different stages of DR with an Acc of 88.4%. The accuarcy on the normal images was 96.9% and accuracy on images with mild and worse NPDR was 57.9%. Umapathy et al. [30] used images from STARE [80], HRE, MESSIDOR [77] and images acquired from the Retina Institute of Karnataka datasets. The authors proposed two methods for automated detection, Decision Trees classifier  [37] used ResNet architecture and tested two datasets obtained from multiple institutes. They used the method of data augmentation to increase the data volume and measure their accuracy using the area under the receiver operating characteristic curve (AROC). Hence, they obtained two results, an AROC of 94.8% in an augmented dataset and an AROC of 99.7% in a dataset without augmentation. An et al. [38] used TL to detect Gl using color fundus images and 3 dimensional optical coherence tomography (OCT). To evaluate the model AUC the tenfold cross-validation (CV) was used. The Random Forest combined with five separate CNN models improved tenfold CV AUC to 96.3%. Diaz-Pinto et al. [39] used five different publicly available datasets resulting in the AUC of 96.05%, specificity of 85.80% and sensitivity of 93.46%. Cerentinia et al. [40] used GoogLeNet architecture for the detection of the presence of Gl. They used datasets from various databases and achieved an Acc of 90% from the High Resolution Fundus (HRF) database, 94.2% of accuracy from RIM-ONE(r1) [89], 86.2% of accuracy from RIM-ONE(r2) [89], 86.4% of accuracy from RIM-ONE(r3) [89] and by combining all three databases the accuracy obtained was 87.6%. Orlando et al. [41] used two different CNN models from OverFeat and VGG-S to develop an automated Gl detection system. The proposed architecture yielded AUC value for OverFeat and VGG-s of 76.3% and 71.8%, respectively. de Moura Lima et al. [42] used VGG-16, VGG-19, ResNet50, InceptionV3 and InceptionResNetV2 to diagnose Gl on RIM-ONE [89] datasets. Promising results were obtained by combining ResNet and Logistic Regression, on RIM-ONE-r2 [89], with AUC of 95.7% and on Incep-tionResNet with the same classifier yielded AUC of 86% on RIM-ONE-r3 [89]. Li et al. [43] used the VGG network to classify glaucoma and non-glaucoma visual fields based on the results of the visual field (VF) study and, for this test, they obtained VF samples from three different ophthalmic centres in mainland China. They obtained an Acc of 87.6%, while the specificity was 82.6% and sensitivity was 93.2%, respectively. In the Gómez-Valverde et al. [44] study VGG-19 was used to identify glaucoma and non-glaucoma using two publicly available datasets RIM-ONE [89] and DRISHTI-GS [84] and one private dataset from a screening campaign performed at Hospital de la Esperanza (Parc de Salut Mar) in Barcelona (Spain).
Diabetic Macular Edema Various researchers also investigated the use of a pretrained model to detect DME. Sahlsten et al. [49] performed binary classification of Non-Referable DME and Referable DME (NRDME/RDME) and achieved AUC of 98.7%, specificity of 97.4% and sensitivity of 89.6% in binary classification using TL.

B. DL APPROACHES EMPLOYING NEW NETWORK
An alternative to TL is the new network development by the researchers. Out of 65 studies, 21 of them have designed their DL architectures for automated detection of DED. Table 6 presents the list of studies, where the researchers have employed their own built DL models with the classifier indicated, number of layers, model used and results obtained.
Diabetic Retinopathy Doshi et al. [51] detected the severity of diabetic retinopathy using the 29 layers CNN model and detected five stages of DR, and three CNN achieved an Acc of 39.96% on kappa matrix. Gargeya and Leng [52] identified diabetic retinopathy using the DL approach. They achieved AUC of 94%, specificity of 87% and sensitivity of 93%. Ghosh et al. [53] employed a 28 layers CNN for two and five class classification of diabetic retinopathy. Using Softmax they achieved an Acc of 95% for two class and 85% of Acc for five class classification. Jiang et al. [54] classified two classes of diabetic retinopathy using fundus images. They used 17 layers deep CNN on the Caffe framework and achieved an Acc of 75.7%. Pratt et al. [55] employed a CNN architecture to identify the severity level of DR. They achieved an Acc of 75%, specificity of 30% and sensitivity of 95% using Softmax classifier. Xu et al. [56]  Hemanth et al. [61] proposed a hybrid method based on using both image processing and DL for improved results. using 400 retinal fundus images within the MESSIDOR [77] database and average values for different performance evaluation parameters were obtained an Acc 97%, sensitivity (recall) 94%, specificity 98%, precision 94%, FScore 94% and geometric mean (GMean) 95%.
Glaucoma Chen et al. [62] developed six layer CNN model. With the Softmax classifier they achieved an AUC of 83.1% and 88.7% in ORIGA [83] and SCES datasets. Raghavendra et al. [63] build an eighteen layer CNN framework to diagnose Gl using 1426 fundus images in where 589 were normal and 937 were with glaucoma. They achieved an Acc of 98.13%, sensitivity of 98% and specificity of 98.3%. Pal et al. [64] introduced a novel multi-model DL network named G-EyeNet for glaucoma detection using DRI-ONS [88] and Drishti-GS [84] datasets. Their experimental findings revealed an AUC of 92.3%.
Diabetic Macular Edema Al-Bander et al. [67] proposed a CNN system to grade the severity of DME using fundus images using the MESSIDOR [77] dataset of 1200 images. They obtained an Acc of 88.8%, sensitivity of 74.7% and specificity of 96.5% respectively. Prentasić and Loncarić [68] introduced a novel supervised CNN based exudate detection method using the DRiDB dataset [93]. The proposed network consists of 10 alternating convolutional and maxpooling layers. They achieved sensitivity of 78%, Positive Predictive Value (PPV) of 78% and FSc of 78% respectively. Tan et al. [69] used the CLEOPATRA [94] image dataset. They obtained sensitivity of 87.58% and specificity of 98.73% respectively.
Cataract Zhang et al. [9] proposed eight layers of DCNN architecture. With the Softmax classifier, they achieved an Acc of 93.52% and 86.69%. Dong et al. [70] used a Softmax classifier with five layer CNN architecture and achieved an Acc of 94.07% and 81.91%, respectively.

C. APPROACHES EMPLOYING COMBINED DL AND ML
Out of 65 studies, six proposed a combination of DL and ML classifiers. Table 7 shows the studies in which the authors applied a combination of DL and ML classifiers namely: Random Forest (RF), Support Vector Machine(SVM) and Backpropagation Neural Network (BPNN) based architectures for DED detection. Abbas et al. [71] developed a DL Neural Network (DLNN) to discover the severity degree of DR in fundus images using studying Deep Visual Features (DVFs). For feature extraction, they used Gradient Location Orientation Histogram (GLOH) and Dense Color Scale Invariant Feature Transform (DColor-SIFT). They converted the features through the use of Principle Component Analysis (PCA). Afterwards, a three layer deep neural network was used to learn these features and subsequently, an SVM classifier was applied for the classification of DR fundus images into five severity stages, including no-DR, moderate, mild, severe NPDR (Nonproliferative Diabetic Retinopathy) and PDR (Proliferative Diabetic Retinopathy). They obtained sensitivity of 92.18%, specificity of 94.50% and AUC of 92.4% on three publicly available datasets (Foveal Avascular Zone Messidor [77], DIARETDB1) and one extraordinary dataset (from the, Hospital Universitario Puerta del Mar, HUPM, Cádiz, Spain). Orlando et al. [72] combined ML and DL for the detection of lesions (red). They used three public datasets, namely Messidor [77], DIARETDB1 and eophtha. They extracted intensity and shape as features using knowledge transferred LeNet architecture, which consists of 10 layers. They achieved AUC of 93.47% and sensitivity of 97.21%, respectively. Arunkumar and Karthigaikumar [73] employed a Deep Belief Network (DBN) for diabetic retinal image classification. At first, with three hidden layers, the deep features were extracted with Deep Belief Network (DBN), then those features were decreased by applying the Generalised Regression Neural Network (GRNN) technique and finally, the retinal images were classified using SVM. On they publicly available ARIA dataset, the authors achieved an Acc of 96.73%, specificity of 97.89% and sensitivity of 79.32%, respectively. Al-Bander et al. [74] used CNN for feature extraction and SVM for Gl and non Gl classification. They achieved an Acc of 88.2%, specificity of 90.8% and sensitivity of 85%, respectively. Ran et al. [75] used a 17 layer DCNN feature extractor, which adopts a residual network to learn more detailed features of fundus images. The DCNN contains three modules, namely shallow, residual and pooling. Here, the shallow and residual modules extract features on a deep, medium and shallow level and the final feature vectors for Random Forests are output from the VOLUME 8, 2020 pooling module. The authors detected six classes of cataract with an Acc of 90.69%. Last, [57] proposed 2 layers stacking architecture with Support Vector Machine and backpropagation neural network classifier. The ensemble classifier achieved an Acc of 93.2% and 84.5%.

D. ANALYSIS AND REVIEW OF PERFORMANCE EVALUATION METRICS
Detailed description of performance measures, namely: specificity, sensitivity, accuracy, area under curve (AUC), precision, f-score, and positive predictive value can be found in [97]. Likewise, Kappa Score, PABAK Index discussions can be found in [98], respectively. In the majority of listed academic papers, the authors used specificity, sensitivity, accuracy and AUC as their assessment metrics to evaluate the efficiency of the classifier. The combined effect of performance metrics found to be used frequently was Sensitivity, Specificity and Accuracy. This variation was used 12 times out of a total 60 trials, accompanied by 12 uses of and Sensitivity, Specificity, AUC and two use of Sensitivity, Specificity, Accuracy and AUC. Instead of Sensitivity, some researchers used Recall. We accommodated Recall under Sensitivity, rather than using it as another success indicator. The performance measurements frequently used include Sensitivity (32 times), Specificity (25 times), Accuracy (26 times), and AUC (25 times). Other performance metrics not commonly used by research groups were: F-Score (twice), Precision (twice), PABAK (once), Kappa Score (3 times), Positive Predictive Value (once) and GMean (once).

VI. DISCUSSION AND OBSERVATIONS
AI is one of the most intriguing technologies used in the material science toolset in recent decades. This compendium of statistical techniques has already shown that it is capable of significantly accelerating both fundamental and applied research. ML, already has a rich history in biology [99], [100] and chemistry [101], and it has recently gained prominence in the field of solid state materials science. Presently, DL models in ML are effectively used in imaging for classification, detection [102], segmentation [103] and prepossessing. The most famous and commonly employed DL architecture in the selected 65 studies is CNN, which is used in 64 cases, while DBN is implemented once. We can infer that CNN is currently the most preferred deep neural network, particularly for the detection of DED as well as the diagnosis of other pathological indications from the medical images.
We have noticed that DL performed well on binary classification tasks (eg. DR and Non DR), whereas its performance significantly dropped when the number of classes increased. As an example, Ghosh et al. [53] obtained an Acc of 95% on DR and Non DR classification task and Acc of 85% on a multi class problem (five stages of severity), with 10% loss in Acc. Else, Choi et al. [17] classified 10 distinct retinal diseases and achieved an Acc of 30.5%. Also, Dong et al. [70] performed cataract classification based on two features, namely: i) features extracted using DL and ii) features extracted using wavelet. Classification for the binary problem (Cataract and Non Cataract) achieved an Acc of 94.07% and 81.91%. Then, the authors performed the classification of four classes of cataract and obtained an Acc of 90.82% and 84.17%. This shows that features extracted using wavelet increased an Acc of the Softmax classifier by 4% in the four class problem. Still, the overall highest accuracy was observed for a binary classification task.
This study reveals the research gap for more rigorous approaches to the development of multiclass DED classification problems. Furthermore, we have observed that binary classification is mostly conducted between the normal and the affected DED cases. For instance, Ghosh et al. [53] and Choi et al. [17] classified DR and Non-DR. Also, Al-bander et al. [74] and Phan et al. [35], identified glaucomatous and nonglaucomatous retinal images, while Dong et al. [70] detected cataract and noncataract conditions. The methods used in these articles are effectively identifying the vast proportion of severe cases where pathological signs are more prominent. Thus, there is a need for classifiers that perform equally well for mild stages of DED developments, where the lesions are tiny and difficult to detect.
Early detection of DED or mild DED is especially necessary to take effective preventive steps and to avoid possible blindness due to deterioration condition over time. As we can see, DL has shown an extensive capacity in the field of health care and especially in the field of DED detection. However, there are some limitations in its large-scale implementation. In terms of the validation of the proposed methods, the authors predominantly used Accuracy, Specificity and Sensitivity to report their classification performance. For instance, Perdomo et al. [26] used LeNet CNN to detect exudates and reported accuracy (99.6%), specificity (99.6%) and sensitivity (99.8%) for the approach proposed. Another widely used metric was AUC, accuracy and sensitivity. This combination is appropriate in DL methods where image classes are imbalanced. However, data imbalance has been solved using geometric transformation (augmentation techniques) or re-sampling images from each class. For example, Chen et al. [62] used augmentation to overcome the overfitting on image data and obtained AUC (83.10%), and AUC (88.7%)on the ORIGA and SCES datasets. Other metrics have been used to measure performance such as Ksc used by Roy et al. [23], AROC used by Asaoka et al. [37], and PABAK used by Takahashi et al. [27].

VII. GAPS AND FUTURE RESEARCH DIRECTIONS
This segment poses numerous research issues that researchers have not been able to solve in previous DED detection studies. Significant research is, therefore, still needed to improve the effectiveness of various techniques for DED detection. The research challenges that need to be addressed are set out below.
Developing stronger DL models While DL has already shown extremely promising success in the field of medical imaging and retinal disease diagnosis, it will be difficult to further refine and create more effective deep neural networks. Yet another solution may be to increase computational power by increasing the capacity of the network [104], [105] while addressing the risk of over-fitting. Another approach could be to create a different object based model rather than an image based model. For example, if researchers are interested in detecting a particular eye malformation (e.g. exudates only), they would design such a deep convolutional neural network that only learns with exudates and other malformation that are not of interest would not be learned by the model. It is argued in [106] that object based identification is more effective than image based identification.
Training on minimal data DL software typically involves a large number of retinal fundus images for learning. If the training set is small, it may not produce satisfactory results in terms of accuracy. There are two solutions available. First, using a range of enhancement methods including rotating, shifting, cropping and colour setting. Second, employ feeble learning algorithms to retrieve training data. Further, investigations shows Generative Adversal Network (GAN) for the generation of training, so that the DL architecture can be trained with robustness and more distinctive features [107].
Similar DL architecture for medical imaging Mostly in DL, several TL frameworks (such as GoogLeNet, AlexNet, VGGNet, and LeNet) for object recognition are available for retraining on a new collection of images such as medical images. Nevertheless, as far as classification efficiency is concerned, these architectures are less suitable for medical images. For example, Choi et al. [17] used VGGNet for DR diagnosis utilising eye fundus images and obtained nearly 85.5% specificity. This is because such TL frameworks are designed for objects such as animals, flowers, etc. As a result, such frameworks may be unsuitable for real time medical images. Potential study could implement a TL architectural design which has been learned on appropriate medical images rather than objects, functioning as a generic architecture, and eventually retrained to improve the accuracy of medical image classification.
Automated Choosing the optimum values for DL Architectures Neural networks have provided promising results primarily in the area of computer vision and particularly in DED detection, but the complexity of modulating is not well known and is considered to be a black box. For instance, several researchers fine tune the constraints of current DL techniques, such as CNN or AlexNet, in order to enhance classification efficiency. In certain instances, however, the history of DL architectures is not well known and is perceived to be a black box. It is, therefore, still hard to determine the effective model and optimum values for the number of hidden layers and modules in various layers.
Domain specific knowledge is also necessary for the selection of attributes for the number of epochs, learning rate and regularization. Thus, in the future, automated optimization algorithms could be proposed to find optimal rates for various DL architectures on various DED datasets and other similar resources for medical images.
Integrating DL, cloud computing and telehealth In particular, rural regions suffer from a lack of human capital, notably in healthcare. In such cases, therefore, telehealth can play an important role in overcoming this drawback. Neural networks, cloud computing and telehealth may be combined in the future to diagnose DED from eye fundus images. For example, in rural communities, the patient could use his or her mobile phone with a retinal camera to capture an image of the eye fundus. This image could also be transferred to a cloud computing platform where the DED detection model (constructed through a ML or a DL approach) can be applied. This configured system will therefore identify DED from the image file and return the results of the detection and prescription to the patient.

VIII. CONCLUSION
This review paper provides a comprehensive overview of the state of the art on Diabetic Eye Disease (DED) detection methods. To achieve this goal, a rigorous systematic review of relevant publications was conducted. After the final selection of relevant records, following the inclusion criteria and quality assessment, the studies have been analyzed from the perspectives of 1) Datasets used, 2) Image preprocessing techniques adopted and 3) Classification method employed. The works were categorized into the specific DED types, i.e. DR, Gl, DME and Ca for clarity and comparison. In terms of classification techniques, our review included studies that 1) Adopted TL, 2) Build DL network architecture and 3) Used combined DL and ML approach. Details of the findings obtained are included in Section VI.
We have also identified several limitations associated with our study. First, we narrowed down the review conducted from April 2014 -January 2020 due to rapid advances in the field. Second, we limited our review to DL based approaches due to their state of the art performance, in particular on the image classification task. Finally, our review focused on a collection of predefined keywords that provides a thorough coverage of the DED area of detection but may not be exhaustive. Furthermore, we hope that our research can be further expanded in the future to include an all encompassing and upto-date overview of the rapidly developing and challenging field of DED detection.