A Comprehensive Review of Deep Learning Strategies in Retinal Disease Diagnosis Using Fundus Images

In recent years, there has been an unprecedented growth in computer vision and deep learning implementation owing to the exponential rise of computation infrastructure. The same was also reflected in retinal image analysis and successful artificial intelligence models were developed for various retinal disease diagnoses using a wide variety of visual markers obtained from eye fundus images. This article presents a comprehensive study of different deep learning strategies employed in recent times for the diagnosis of five major eye diseases, i.e., Diabetic retinopathy, Glaucoma, age-related macular degeneration, Cataract, and Retinopathy of prematurity. This article is organized according to the deep learning implementation process pipeline, where commonly used datasets, evaluation metrics, image pre-processing techniques, and deep learning backbone models are first illustrated followed by an extensive review of different strategies for each of the five mentioned retinal diseases is presented. Finally, this article summarizes eight major research directions available in the field of retinal disease diagnosis and outlines key challenges and future scope for the present research community.


I. INTRODUCTION
To investigate the human eye, many imaging modalities have been developed over the years, out of which, 'Fundus Imaging' has gained in popularity due to its non-invasive and cost-effective nature. Fundus photography involves capturing the projection of the fundus (the rear portion of an eye) onto a two-dimensional plane using a monocular camera. Several ocular structures and biomarkers including various abnormalities can be identified from a captured 2D fundus image ( Figure 1). Many of these visual markers play an important role in identifying retinal diseases.
The tiny red dot-like structures, known as Microaneurysms (MAs), are normally developed due to a lack of oxygen supply and bulging capillaries. Sometimes, when the supply completely shuts down due to certain arteriolar blockages, white soft patches are formed which are indicated as Soft Exudates (SEs). Retinal vessels sometimes burst because of the built-up pressure in arterioles and manifest as dark red patches, known as Hemorrhages. Hard Exudates (HEs) are formed when proteins and fat leak from abnormal vessel walls and appear as hard yellow waxy structures. Examining the presence of these lesions along with other retinal biomarkers like an optic disc (OD), optic cup (OC), macular region, fovea, and blood vessels can provide valuable insights into some of the major retinal diseases and aid in their diagnosis.
Diabetic retinopathy (DR), Glaucoma, age-related macular degeneration (AMD), Diabetic macular edema (DME), retinopathy of prematurity (ROP), and Cataract are some of the major eye diseases that can cause blindness if not treated appropriately. The screening process for such retinal diseases generally requires expert attention and substantial skill [1]. In densely populated countries like India, there is a severe lack of trained ophthalmologists, who can perform such time-consuming tasks [2]. Due to recent exponential growth in digital processors and data-driven technologies, artificial intelligence (AI) based medical screening systems are becoming more prevalent and offer feasible and costeffective solutions for automatic diagnosis of retinal diseases [3]. In particular, computer vision and deep learning (DL) techniques have shown immense growth and promise in fundus image analysis.
DL tasks in retinal disease diagnosis mainly fall into two categories -classification and segmentation tasks. The classification task refers to a direct classification of input images into various disease categories. Similarly, identifying important biomarkers and crucial lesions through segmentation tasks from a given fundus image of the patient can reveal many details about the nature and type of retinal diseases. Many DL architectures have been developed and tested for such tasks are well illustrated in [4]. An overall DL framework for retinal disease diagnosis is shown in Figure 2.
The list of abbreviations used in this article are tabluated in Table 1.

A. FEATURES OF THE PROPOSED REVIEW
The proposed review focuses mainly on providing an indepth review of various DL strategies recently implemented for retinal disease diagnosis using fundus images. This study also intends to outline possible future directions for new researchers interested in AI-based retinal disease diagnosis.
• Moreover, contrary to recently published review articles on this topic [5]- [9], this article takes a DL process pipeline approach to retinal disease diagnosis and surveys recent articles on a diagnosis of five major eye diseases, i.e., Diabetic retinopathy, Glaucoma, Agerelated macular degeneration, cataract and Retinopathy of prematurity. • It also comprehensively outlines all the datasets available for the above-mentioned diseases along with their ground truth descriptions. • It provides knowledge about widely used image preprocessing techniques, evaluation metrics, and commonly used DL backbone models for retinal disease diagnosis tasks. • It also contains an extensive literature study on DL implementation of five major retinal diseases along with tabulating their comparative performances. • It also discusses various research directions currently available in this field.
The rest of this article is organized as follows. Section 2 covers datasets and evaluation metrics for retinal disease diagnosis. Commonly used fundus image pre-processing techniques are illustrated in Section 3. Most widely used DL strategies along with specific backbone models for fundus image-based classification tasks and segmentation tasks are outlined in Section 4. Section 5 presents a literature review along with comparisons of the performance of recent re-search work on DR, Glaucoma, AMD, cataract, and ROP diagnoses. Finally, Future research directions and conclusions are presented in Sections 6 and 7, respectively.

II. DATASETS AND EVALUATION METRICS FOR RETINAL DISEASE DIAGNOSIS
Fundus photography is the process of retrieving a twodimensional image of the 3D ocular retinal fundus, using reflected light projected onto an image plane. Table 2 shows an overview of widely used fundus image datasets that are utilized in DL-based diagnosis for the above-mentioned retinal diseases. The table lists out datasets, the number of fundus images it contains, image size, format, and its ground truth description. Disease diagnosis task utilization of these datasets is presented through color-coding. For ease of comprehension and comparative study, all the datasets are presented in one table.

A. MODEL PERFORMANCE EVALUATION METRICS
Several performance evaluation metrics are used in evaluating the DL model in the retinal disease diagnosis task. Table 3 lists out the most commonly used metrics along with their description. During the literature review in this paper, most of these metrics will be used for comparing various DL architectures.

III. PRE-PROCESSING TECHNIQUES
To improve the training process and build robust prediction models, fundus images are generally pre-processed before the training phase. This is done to compensate for the noiseinduced due to the variety of image capturing hardware used in varied illumination settings during the imaging. Considering the complexity of the retinal structure, many important biomarkers and lesions may not be identified due to the poor quality of the images, as shown in Figure 3. Apart from removing unwanted noise, pre-processing techniques are also used to enhance the fundus image features before DL model implementation. Some of the widely used preprocessing techniques on color fundus images for retinal disease diagnosis are presented in Table 4.

IV. DEEP LEARNING CONCEPTS
Deep learning (DL) is a sub-class of artificial intelligence methods that are based on artificial neural networks (learning methods inspired by the biological structure of the human brain). In the DL process, the latent and intrinsic relation of the input data is learned automatically through mathematical representations. Contrary to traditional machine learning (ML) methods, DL can execute with far less human guidance, as they directly extract useful features from the data without depending on hand-crafted features. This makes DL suitable for medical image analysis, where the features can be learned automatically from complex visual information. In the following section, we discuss the architectures of some of the frequently used backbone models, especially VOLUME 4, 2016  Ratio of classified true positives to the actual number of true positives in the ground truth.
Ratio of classified true negatives to the actual true negatives in the ground truth. False-positive rate (FPR) = (1-SP) Precession indicates what proportion of positive findings was actually correct. Higher value of PR, indicates better system Performance.
Accuracy indicates the ratio of correct predictions to the total number of predictions.
Represents the harmonic mean of recall value and precision. Higher value of F1-score indicates better system Performance. [40] AUC -Indicates the Area under the Receiver Operating Characteristic Curve (AUC) Higher value of AUC indicates better system Performance.
IoU/ Jaccard similarity index IoU = T P (T P +F N +F P ) Intersection-over-union (IoU), widely used measure for understanding how accurate a proposed image segmentation is, compared to a known/ground-truth.
The value of a DSC ranges from 0 to 1, 0 -indicating no spatial overlap between two sets of binary segmentation results, and 1-indicating complete overlap FIGURE 3: Examples of fundus images with the poor quality a) overexposure, b) underexposure, c) obscure, and d) postoperative.
for classification and segmentation tasks in retinal disease diagnosis.

1) Convolution Neural networks (CNNs)
One of the most widely used DL architectures for efficient training through multiple layers is a convolution neural network (CNN) [45]. Figure 4 depicts the general architecture of CNN. Convolution layers, pooling layers, and fully connected layers are three major components of a CNN. The training process consists of two stages. The first one is known as a forward stage where the input image is represented with appropriate weights and biases in each layer, then the predicted output is used to measure the loss function by comparing it with ground truth values. In the second stage, known as the backward stage, gradients of each parameter are computed based on the loss function. The parameters are then updated and initialized for the next forward stage. This is repeated as multiple iterations until the network presents accurate classification results.

2) VGGNet
Another commonly used backbone network in retinal disease classification is VGG Network (VGGNet). This was proposed by Karen Simonyan and Andrew Zisserman in 2014 [46]. Figure 5 shows the architecture of a VGGNet. VGG stands for Visual Geometry Group, which released various versions of Convolution network models for various image classification tasks starting from VGG-16 to VGG-19. The original intention behind the development of VGG is to research how the depth of CNN impacts the accuracy of image classification. A small 3 × 3 kernel is used in all layers of the model to increase the depth of the network while avoiding too many parameters. In VGGNet, the input is set to a size of 224 × 224 RGB image. A 3 × 3 filter is used with a fixed convolution step. There are three fully connected layers which can vary from VGG-11 to VGG-19 depending on the total number of convolutions plus fully connected layers. VGG-11 has eight convolution layers followed by three fully connected layers. On the other hand, VGG-19 has sixteen convolution layers and three fully connected layers. In VGGNet, each convolution layer is not followed by a pooling layer; instead, a total of five pooling layers are distributed throughout the network as shown in Figure 5.

3) ResNet
Residual network (ResNet) [47] consists of a total of 152 layers, which are built by stacking individual residual blocks shown in Figure 6   To highlight foreground pixels from the background.
Histogram equalization increases the overall (global) contrast of the image.
Eliminates the problem of over-amplifying in constant pixel areas of the image.
Enhances minute lesions and markers like microaneurysms in fundus images.

Colour space transformation
In DL model implementation for fundus images, in certain cases, the model performance maybe improved by utilizing single color from RGB Channels.
The extraction of green channels from the fundus images is famous as they offer high contrast images with rich visual information.
Noise Removal [44] Many denoising algorithms like Gaussian filters, median filters, non-local means denoising, etc. are utilized for removing unwanted noise.
One tradeoff is, denoising can also blur the image and degrades by removing fine details of the image   number of filters is doubled and down sampled spatially using a stride of 2. This network employs special skip connections along with batch normalization after each convolution layer. Skip connections are used to optimize such deep models as they take activation from one layer and directly feed it to another layer. This has the effect of training deep networks without encountering vanishing gradient problems.
To reduce the number of parameters, ResNet has a fully connected layer that outputs 1000 classes.

1) Fully Convolution Networks (FCNs)
Long et al. [48] proposed a modified CNN network, by replacing fully connected layers with upsampling layers (shown in Figure 7). The extracted feature maps from initial layers are up-sampled to the equivalent size of the input image. A fully convolution network (FCN) can perform dense pixel-wise prediction, making it better suited for segmentation tasks compared to CNN.

2) U-Net
Ronneberger et al. [49] proposed a network shown in Figure  8, which has symmetrical encoder and decoder structures along with several skip connections from the encoding path to the decoding path. The encoder is responsible for extract-VOLUME 4, 2016 ing features from input images while the decoder reconstructs the images for the final output. The skip connections allow the network to make improved predictions by directly connecting low-level feature maps from encoder to decoder.

A. DIABETIC RETINOPATHY (DR) DIAGNOSIS
Diabetic retinopathy is one of the most common retinal diseases that can cause blindness, if not treated in time. This is a complication seen in one-third of diabetes patients [50]. A survey estimates nearly 93 million people worldwide suffer from DR [51]. Any diabetes patient can develop DR causing vascular disruption in the retina. These numbers are expected to grow higher considering the rapid growth in the number of diabetes patients worldwide [52].  Table 5. Wang et al. [53] proposed a network that jointly performs multiple tasks of increasing image resolution, various DR lesion segmentation, and DR grading. Their method leverages the fact of high-resolution images are suitable for accurate grading due to appropriate lesion segmentation. For each of the tasks, they employed CNN-based methods, where a robust feedback mechanism is established by utilizing taskaware loss functions. Li et al. [54] used an ensemble approach for developing a DR diagnosis method that utilizes enhanced Inception-V4 on a privately collected dataset which is then generalized using the Messidor-2 dataset. They investigated the effects of input image size and its number on model performance. Automatic DR diagnosis presents the difficult task of handling fundus images that are captured at different illuminations.
Kaushik et al. [55] proposed to handle these irregularities using image desaturation techniques in the pre-processing stage. They stacked three CNNs for their training process, where optimum weights from these networks are fused for classifying fundus images for DR diagnosis. Das et al. [56] proposed a DL method that examines the branching of retinal blood vessels and abnormal vessel growth to identify DR from fundus images. After the pre-processing stage, they used the maximal principal curvature technique for segmenting blood vessels followed by histogram equalization and a morphological operation for further refining the results. A CNN-based classifier was developed to work on the segmented blood vessels to classify for DR diagnosis. Alyoubi et al. [8] presented a method consisting of two DL models working simultaneously. The first one was based on CNN which classifies the image into five DR categories and the second one for detecting DR lesions based on YOLOv3. Finally, both the models were combined to achieve improved accuracies. Shankar et al. [57] proposed a method where the input images are first treated for noise removal followed by histogram-based segmentation to retrieve salient regions for DR grading. Then a Synergic DL is employed for classification that consists of three sub-modules.
Shankar et al. [58] in their model for DR screening utilized Bayesian optimization for hyperparameter tuning of the DL architecture based on Inception-V4. First, they preprocessed the given images for contrast enhancement using the CLAHE algorithm, and then histogram-based segmentation was performed to generate suitable input images from which various features were extracted and utilized for DR classification. For early diagnosis of non-proliferative DR, Qiao et al. [59] utilized a CNN-based network along with the use of various pre-processing techniques like LoG, BPF, Match filters, and curve transform. They also developed a microaneurysm detection method based on principal component analysis. Wang et al. [60], in their work, developed a DL framework incorporating multiple tasks for DR diagnosis (Classification into different severity levels) and DR features simultaneously, which can act as supporting information. It comprises squeeze and excitation (SE) as the backbone for feature extraction at higher scales and two heads, one for DR severity classification and another for DR feature detection.
Araújo et al. [61] proposed a DR grading system that can support its decision by providing a grade uncertainty parameter. The network consists of convolutional blocks from which a lesion map is generated that are indicative of the presence of a lesion. Multiple instance learning along with gaussian sampling was utilized for computing grade-wise explanation maps. For detecting referable DR, Sahlsten et al. [62] developed a DL framework based on the Inception-V3 model, which had already been trained on the ImageNet dataset. High-resolution images were used for the training process, which tends to yield better results with comparatively smaller training samples. As training of such high-resolution images may take a lot of time intervals, they also took an ensemble approach where six DL networks were working with lowresolution images and performed comparative analysis.
Qummar et al. [63] used an ensemble approach for improving DR severity classification in the early stages. They utilized five DCNN models for extracting salient features and generating probabilities that indicate the image's adherence to a particular DR class. The ensemble was achieved by stacking. Nneji et al. [64] employed two separate DL models, Inception-V3 and VGG-16 to work on two individual channels of input fundus image. One channel is derived from applying CLAHE and another from the CECED preprocessing technique. Outputs of both the DL models were VOLUME 4, 2016 weighted and merged for final DR classification. Bora et al. [65] developed two types of DL systems based on Inception-V3 architecture for predicting the growth of DR in patients with diabetes. The DL systems were categorized based on the one-field (primary only) or three-field (primary, temporal and nasal) fundus images that they take as input. A fivestage DR classification network was proposed by Majumder and Kehtarnavaz [66], where they implemented a multitask model with two separate models. One classification model with cross-entropy loss function and another was a regression model with a mean square error loss function. After training both of them separately, the extracted features were concatenated and utilized by a final perceptron network for five-stage DR classification.

1) DR Diagnosis using Microaneurysms (MAs)
Microaneurysms (MAs) are one of the earliest visual indications of DR and have gained a lot of research interest in the field of fundus image analysis. The challenges of automatic detection and segmentation of MAs are their invariance to other lesions, their low contrast nature in fundus images, and extremely low pixel count compared to background pixels. Recent DL models, that handle the problem of MA segmentation/detection are discussed below. The Performance comparison is presented in Table 6.
Xia et al. [69] proposed a two-stage network, one for efficient feature extraction that employs residual learning from multiple scales of input images and the second stage for filtering out the false-positive patches. Liao et al. [70] used an encoder-decoder network for MA detection by utilizing the difference between skip connection layers. A customized loss function was used (smooth dice loss) for allowing the network to concentrate more on hard samples during the training process. They also modified the standard activation function to achieve a very precise probability distribution for MA detection. Zhang et al. [71] developed training and test samples consisting of green and blue channels of the original fundus image and two additional samples, one with enhanced contrast and another with a suppressed background. These were then used in a feature transfer network for detecting MAs, where the optimized weights from the previous phase were carried forward for the next learning phase.

2) DR Diagnosis using Exudates
Another important biomarker for detecting DR is Exudates. Hence Hard and Soft Exudates (HEs and SEs) segmentation is another widely researched area in fundus image analysis. Some of the recent works in this direction are discussed  below. Table 7 shows the experimental results of different articles on Exudates segmentation. Huang et al. [74] in their work regarding hard exudates detection first used the Simple Linear Iterative Clustering (SLIC) algorithm to generate superpixels at each input image. Then various pixel and superpixel level features were derived and training patches were produced from each feature. These patches were applied to a CNN model to classify them into HE pixels or background pixels. Many DL detection/segmentation methods focusing on pixel-by-pixel annotation often lead to a danger of catastrophic interference, where the model abruptly forgets the previously learned attributes while learning new information. He et al. [75] in their work used incremental learning to avoid this problem, where the knowledge of the previous model is utilized to perfect the present model.
Kurilová et al. [76] presented a method that utilizes a machine learning technique for filtering input images before employing them for the DL task. A Support Vector Machine (SVM) classifier along with a faster R-CNN network was used for preliminary scanning of input image patches. Image patches without exudates were discarded while others were used for the object detection network. This helped in improving the speed and detection accuracy. To deal with the segmentation issues due to class imbalance and vast size variations in HE lesions, Liu et al. [77] proposed a double branch network, where the easy task of large HE segmentation is performed first and then gradually shifted the attention to the hard task of small HE segmentation. This is achieved through carefully guiding the training process by a customized Dice loss.
Choice of color space of an image can also influence the accuracy of Exudates detection, as was demonstrated by Khojasteh et al. [78]. They first applied principal component analysis on three basic color spaces (RGB, LUV, and HSI) for contrast enhancement, and then a set of training samples were VOLUME 4, 2016 generated in all the color spaces to train a CNN for detecting exudates. Through the study, they also proposed a new color space named PHS space, for accurate detection. Kou et al. [79] implemented a modified U-Net structure, consisting of one encoding path and three decoding paths. They replaced the general convolutional block of a U-Net with residual blocks for detailed feature extraction during the learning process. Zong et al. [80] also proposed a few modifications to the U-Net network for handling the uneven distribution of HE in input image patches. The inception module replaced the basic convolution blocks for deriving features from various scales. Also, residual connections were used for generating the output. For loss function, they used Focal loss, which suitably tackles the data imbalance problem. Mohan et al. [81] proposed an exudate detection process based on altered KAZE features that effectively extracts feature points. Machine autoencoders with extreme learning capability were used for exudate localization.
Liu et al. [82] tackled the extreme size variation and class imbalance problem in the hard exudate (HE) segmentation task, through a dual branch network. One for large exudates and another for small exudates. They also utilized a custom loss function (dual sampling modulated) for the training process to segment HEs in different sizes. Mohan et al. [83] demonstrated a unique feature extraction method based on Hessian Matrix approximation. The model was tested on multiple datasets including a privately collected dataset.

3) DR Diagnosis using Hemorrhages
Hemorrhages are one of the visual signs of DR, which develop due to the burst of retinal blood vessels under extreme pressure build-up inside the vessels. The hemorrhage segmentation task is another direction taken by many researchers for DR diagnosis. Some of the recent works are discussed below and the experimental results are presented in Table 8.
Maqsood et al. [84] introduced a method for hemorrhage detection, where initially, edge details of the input image are enhanced through contrast modification and then passed onto a second stage that employs a 3D-CNN for segmentation. A modified VGG-19-CNN is also used to implement a transfer learning strategy for extracting features. Finally, before sending for feature fusion and classification, MRCEV-based feature selection is performed to mitigate redundancies. Lahmiri [85] combined CNN with a machine learning technique for detecting and classifying hemorrhage in fundus images. The task was performed in three stages, beginning with a CNN for feature extraction, followed by utilizing a Student t-test for further filtering and selecting key features. In the third stage, the selected features were passed through a support vector machine classifier for segregating images with hemorrhage from healthy ones.

4) Diagnosis using Retinal Vessel
Retinal blood vessels serve as a prominent biomarker for indicating the health of the eye. A variety of geometric characteristics like branch lengths, branch angles, vessel diameter among others can be derived from the retinal vessel map. These characteristics are used for the diagnosis of diseases like DR and Glaucoma. Researchers have concentrated on retinal vessel segmentation and achieved excellent results. Some of the Recent articles are discussed in the sections below. Experimental results of several works on vessel segmentation are presented in Table 9.
Yang et al. [86] proposed a method where initially a separate module based on U-Net is used for accurate segmentation of thin and thick vessels, which uses a common encoder for feature extraction followed by two decoders that use corresponding ground truth images of thick and thin vessels independently. Then a fusion module is employed to combine the two segmentation results from the previous module. Both U-Net architectures use additional skip connections to improve the context information during the training. To minimize computational time, Boudegga et al. [87] presented a new architecture, where the first image patches are extracted after pre-processing and augmentation before the training. Their method utilizes a U-shaped structure, implemented through lightweight convolution modules (LCMs). Segmented image patches were then merged in a post-processing stage to obtain the final results. Another U-Net-based DL model was presented by Fukutsu et al. [88] for vessel segmentation along with arteriovenous classification using probability maps. In addition to the traditional skip connections between down sampling and up sampling block, their network has short connections for minimizing gradient loss. They also implemented a multiple dilated convolution (MDC) block between encoder-decoder for extracting global features.
In their architecture for vessel segmentation, Atli and Gedik [89] modified the conventional encoder-decoder network by first performing up-sampling followed by a downsampling operation. Their model attempts to continue the learning process during the progression of sampling operations.
For reducing computational complexity and improving model generalization Gegundez-Arias et al. [90] presented an altered U-Net model which works on image patches derived from fundus images and employs a unique loss function during the training that considers each pixel distance to the vascular tree structure. It uses probability-based prediction for vessel segmentation. They also decreased the convolution count in each layer along with overall network depth for minimizing the model parameters. Building on the base VGG-16 network, Samuel and Veeramalai [91] proposed a method that can segment retinal blood vessels from both fundus images as well as from coronary angiograms. This architecture consists of two feature extraction layers on top of the base network. Each of these is responsible for localizing blood vessels, passing important vessel features through the intermediate layers, and summing up the feature maps from earlier stages.
To achieve a satisfactory segmentation from relatively small datasets Chen et al. [92] developed a method that uses a  semi-supervised learning strategy along with U-Net architecture. Their encoder-decoder structure first uses a relatively small number of labeled data for training and later updates the old dataset using a custom-built strategy. Tian et al. [93] took a multi-path approach for vessel segmentation. The first path, consisting of convolution sampling blocks, utilizes the low-frequency image characteristics for deriving global features and the second path, composed of a coding and decoding region, concentrates on high-frequency components to get local feature details. Final segmentation results were obtained by fusing both information. Wang et al. [94] used an attention-driven encoder-multi decoder network for the segmentation task. A basic U-Net structure is first implemented for generating a rough vessel segmentation map, which is compared with available ground truth to identify hard and easy segmentation tasks in the image. This information serves as a basis for two additional decoders which focus on extracting features for hard and easy regions independently. All three outputs of the decoders are combined and finally fed into a light U-Net to yield the final results.
Biswas et al. [95] proposed a model that utilizes a dilated convolution to increase the receptive fields (amount of input image visible to the innermost network layer) without stacking the convolution layers linearly. This in turn helped in providing a better context for the segmentation task, without necessarily increasing the computations. Wang et al. [96] presented their work, where image patches were generated after the pre-processing stage and fed into an encoder-decoder structure, but through two separate paths, one for grabbing more receptive fields and another for storing spatial information. A unique Attention mechanism was employed to improve the original features, and a Fusion module to merge the features from the two paths. For accurate segmentation of capillaries of retinal fundus images, Wu et al. [97] demonstrated a cascaded deep network. The first network generates retinal vessel maps (probabilistic) based on the input image patches. The second network connected in series with the first one uses these maps to produce refined segmentation results. For avoiding the loss of information caused by the downsampling of the maps, skip connections were arranged between the two cascaded networks. In their attempt at reducing missegmentations and computational complexities, Xiuqin et al. [98] combined U-Net with a residual learning scheme. This enabled them to make the network deeper which is helpful for accurate segmentation, while the inbuilt residual module handles the network degradation caused due to the network depth.
It is difficult to segment retinal vessels in the presence of lesions or to identify microvessels due to low contrast in fundus images. Dharmawan et al. [99] presented a hybrid algorithm for segmentation tasks that involves a contracting path and an expansive path. Their architecture does not use a fully connected layer and hence reduces a substantial computational load. A provision for concatenation operation in their model helps it to train from both local and global features, yielding better results.

B. GLAUCOMA DIAGNOSIS
Glaucoma is another leading cause of irreversible blindness around the world [100]. Like so many other retinal diseases, researchers have concentrated on developing various DL models to diagnose glaucoma from fundus images. Recent VOLUME 4, 2016  Table 10.
Xu et al. [101] developed a DL framework for glaucoma diagnosis, with a relatively small number of training samples through the extraction of OD, OC, and retinal nerve fiber layer (RNFL) characteristics. Their framework consists of a pre-diagnosis classification phase based on a general fundus image (global attributes). In the next phase, the abovementioned biomarker segmentation is performed and ISNT and MCDR scores are calculated. The final diagnosis was performed by utilizing all the segmentation results. Shanmugam et al. [102] used the cup to disc ratio (CDR) for identifying glaucoma in a fundus image. Primarily, their method concentrates on accurate segmentation of OC and OD, which was performed by a modified U-Net. Due to the addition of adaptive convolution in their framework, the computational burden was reduced as it used fewer filters than the standard U-Net. The morphometric attributes derived from the segmentation results were then utilized by a random forest classifier for the classification of glaucoma images from the healthy ones. In another work, Wang et al. [103] employed a transfer learning approach using VGG-16 and AlexNet for model training and glaucoma classification. They collected ONH images from various publicly available datasets and constructed two sets. One set where various data augmentation techniques like random scaling, cropping, rotation, and flipping were performed to expand the dataset. In another set, three-dimensional topography maps of ONH were constructed from shading information of 2D images (SHS method). Both sets, when evaluated for Glaucoma classification, yielded improved performance than regular CNN classification models. Gheisari et al. [104] utilized fundus image sequence (video) for extracting temporal features along with spatial features from static images to improve glaucoma classification accuracy. They implemented a fusion method that combines CNN with a Recurrent neural network (RNN). For CNN, two DL models (ResNet-50 and VGG-16) were implemented and tested. LSTM-based RNN was used as it can select and retain useful information from the image sequence. A fully connected layer is established at the end of the fusion module for enhanced glaucoma classification.
To avoid problems like overfitting and the necessity of large datasets, Nayak et al. [105] have proposed a network that utilizes a feature optimization technique based on biological phenomena, known as a real-coded genetic algorithm (RCGA). Once the improved features are derived through this technique, various classifiers are utilized for identifying glaucoma-based images. RCGA algorithm along with SVM classifier provided the best results. Li et al. [106] proposed a CNN-ResNet architecture, with 101 layers that utilize a total of 26,585 images, for testing and training the model. They avoided the vanishing gradient problem by applying skip connections between the layers during the training process. Hemelings et al. [107] developed a CNN-based method that combines transfer learning with active learning strategies for accurate glaucoma diagnosis. They worked with 8433 fundus images for developing their classifier. After basic image pre-processing techniques like ROI extraction and data augmentation, a pre-trained ResNet-50 was utilized for transfer learning, which consists of 182 layers. 'Uncertainty sampling' technique was employed as an active learning method for the classification system. In addition to this, saliency maps have been generated that support the model decisions.
Juneja et al. [108] proposed a DL model, where, after certain pre-processing techniques like image cropping, augmentation and denoising, images were sent into a CNNbased model (76 Layers deep). To compensate for the loss of data, they used 'Add layer' in every block that combines the previous block output with the next block. Martins et al. [109] developed a glaucoma diagnosis pipeline that can be executed offline on mobile devices. They mainly relied on OD and OC segmentation (U-shaped model) for generating useful morphological features which are used by a separate classification network (based on MobileNet-V2 as the backbone). All the results along with morphological calculations are gathered for constructing a diagnosis pipeline that was integrated with a mobile application. Liu et al. [110] developed a DL framework for glaucoma detection by utilizing 241032 images collected from the Chinese glaucoma study alliance (CGSA). Down sampled images are fed into a CNN architecture (based on ResNet). For accurate generalization, a unique 'online DL system' was developed where experts confirm the model classification results and then the confirmed samples were utilized for retuning the model before the next prediction.
Bajwa et al. [111] performed glaucoma classification in two stages. The first stage utilizes 'Regions with CNN' (RCNN) for OD extraction and localization. It also includes a semi-automatic ground truth generation part, that helps in creating ground truths containing the location of OD for training the RCNN. The second stage is composed of four convolution layers and three fully connected layers and uses the ROI images (generated after OD extraction) for classification. Kim et al. [112] proposed a two-task network that utilizes various CNNs for glaucoma classification and 'Gradient weighted class activation mapping' for localizing most suspicious glaucomatous regions for a given fundus image. Out of various CNN variants, the ResNet-152-M model achieved the most promising results. As an extension, the authors also developed a web-based application incorporating the model in the background, which provides decision, diagnostic confidence score, and suspicious location for a given input fundus image. Singh et al. [113] conducted a detailed study on a variety of DL methods for classifying fundus images into glaucomatous and normal categories. ORIGA, HRF, and ACRIMA were used as training and validation datasets. This study indicates Xception models and Inception-ResNet-V2 which yield better performance compared to others.
Ovreiu et al. [114] proposed a dense network consisting of 201 layers for improving the performance of glaucoma classification. Each layer of this network gathers additional inputs from all previous layers. In another work, Saravanan et al. [115] demonstrated an autoencoder architecture for glaucoma detection along with AVP recognition; they specifically concentrated on reducing classification errors through multi-modal learning implementation. Shoukat et al. [116] compared the performance of three pre-trained CNN-based models for early glaucoma detection. The test was conducted on RIM-ONE, G1020, and REFUGE datasets. Pretrained-EfficientNet-B7 yielded the highest accuracy on the G1020 dataset. Islam et al. [117] compared the performances of different DL models like DenseNet, MobileNet, EfficientNet, and GoogleNet on a private dataset consisting of 643 fundus images. The best performance was achieved by EfficientNet-b3 when cropped OD and OC images were used for training. As an alternative to automatic glaucoma detection, they also developed a vessel segmentation model using U-Net architecture. For early detection of glaucoma, Shoukat et al. [118] utilized G1020 and DRISHTI-GS datasets for their EfficientNet based model training. Image enhancement through various pre-processing techniques was carried out in the initial stage to highlight crucial features in the fundus image.

1) OD/OC Segmentation
Other important retinal biomarkers used for the diagnosis of glaucoma are Optic Disc (OD) and Optic Cup (OC). The cup-to-Disc ratio is calculated from vertical cup diameter and vertical disc diameter. Hence accurate segmentation of OD/OC has become crucial for glaucoma diagnosis and a lot of research has been carried out in this direction. Some of the recent articles on DL-based OD/OC segmentation are discussed in the following sections along with experimental results in Table 11.
Wang et al. [124] developed a DL network, based on the U-Net framework, consisting of two subnetworks, a feature VOLUME 4, 2016 detection subnetwork (FDS) and a cross-connection subnetwork (CCS). The first subnetwork is responsible for extracting desired objects and necessary image features, while the second subnetwork is used for object segmentation. The presence of two subnetworks introduced several multiscale features into both the encoding and decoding process by element-wise subtraction. This in turn made the model more sensitive to edge information and played a crucial role in accurately segmenting the Optic Disc. Veena et al. [125] proposed two individual CNN models for OD and OC segmentation. First, they located the optic nerve region using basic edge detection methods like 'Sobel' and 'watershed algorithm', and then the image was cropped to the desired region. The cropped images were fed for the segmentation task into two separate CNN models composed of 39 layers each. The increased number of layers in each CNN model contributed to the extraction of ample image features and also helped in retaining the image resolution in the output image which improved the segmentation results. Kumar and Bindu [126] presented a U-Net-based architecture consisting of three subsections, namely 'contraction', 'bottleneck', and 'expansion'. The kernel size in the first subsection starting at 16, got doubled with each block and reached 256, and this helped the architecture to learn more complicated features from the input image. The expansion subsection contains the same number of blocks as the contraction, and every input is combined with the feature maps of the subsequent contraction layer. This approach alleviated the problem of gradient vanishing during the training of the model. Natarajan et al. [127] developed a lightweight network for OD segmentation, where they used a gaussian mixture model (GMM) superpixel segmentation algorithm followed by a 'simple linear iterative clustering' to extract the region of interest (RoI) from the fundus image. These ROIs are then fed into a U-Net architecture for the semantic segmentation of OD; to smoothen the isolated points and coarse edges of the output, a regularization term is introduced to the loss function. This helped in improving the model generalization for the segmentation task. Panda et al. [128] presented their work on OD and OC segmentation using a model involving residual learning built on CNN-based architecture. They carried out several image pre-processing techniques like region of interest (RoI) selection around the OD center, image normalization, and contrast enhancement. This was followed by random patch (32 × 32 pixels) extraction from both the pre-processed images and their corresponding ground truths as inputs for the training process. The unique combination of convolutional layers with residual blocks, processed on patch-level data, allowed the model to focus on localized structure similarities. This process flow helped in improved OD and OC segmentation, considering the limited availability of training samples.
Fu et al. [129] demonstrated a fusion method for OD segmentation to improve model performance and avoid the distraction in images caused by bright lesions and illumination variations. First, two separate U-nets were used for detecting the OD and retinal blood vessels independently. Hough transform method is employed to fit the blood vessels by line segments, and then the joint probability is derived by combining the OD detection of U-net and probability bubbles from the intersection points of hough line segments. This in turn is used for OD center and radius estimation. Zhao et al. [130] proposed a model to decrease the computational load by combining transfer learning with U-Net segmentation architecture. First, the segmentation accuracy was boosted by utilizing attention gate learning as an intermediate between the encoder and decoder phase of the classical U-Net architecture. Then the algorithm was implemented with transfer learning, where the weights were initially trained on a renowned dataset before they learned from the fundus dataset. The scarcity of sufficient labeled images was tackled using the above-mentioned transfer of learning. This approach has shown a significant reduction in network inference time compared to its contemporaries.
Xiang et al. [131] presented their work on OD and OC segmentation, concentrating on improving the model performance over multiple datasets. This was achieved by introducing a multi-scale weight shared attention (MSA) module after the encoder phase which enhances the OD/OC feature extraction process and a depth-wise separable convolution (DCS) module after the decoder phase that accurately concentrates on the target features. Model performance was tested for generalization across five fundus datasets, which achieved improved results compared to other state-of-theart architectures. Jin et al. [132] proposed a method for OD segmentation that involves a DenseNet-based encoder for feature extraction, for dealing with small datasets. In the decoder stage, they used various layers of features to drive the attention process, and combine feature information from multiple scales for the upsampling process. At the end of their network, they combined the cross-entropy and the dice coefficient to generate an improved loss function for optimizing the model during the training. Bengani et al. [133] handled the problem of the lack of a large labeled dataset by using a two-tire approach for the OD segmentation task. First, a convolution autoencoder was trained on a large number of unlabelled fundus images, implementing semi-supervised learning. Later transfer learning was applied to the abovementioned pre-trained model with OD ground truth images. Bhatkalkar et al. [134] proposed an encoder-decoder-based generic regression model for simultaneous segmentation of the fovea center and OD. For training of the model, central coordinates of fovea and OD from ground truth images were transformed into heatmaps.
Nazir et al. [135] utilized EfficientNetB0 as the base model to develop a network for glaucoma detection using OD and OC lesions. First, the relevant features were extracted using the base network, then a unique bidirectional feature module was employed to fuse the key points multiple times. And finally, glaucoma localization was achieved along with class prediction. Xiong et al. [136] proposed OD segmentation by leveraging Hough transform annotations. They used a Bayesian U-Net based on weak labels for the segmentation. A probability-based graphical model was built and implemented on U-Net. The expectation-maximization method was used for OD estimation and subsequent weight assignment. Hervella et al. [137] demonstrated a multi-task architecture that simultaneously performs OD and OC segmentation along with glaucoma classification. Both imagelevel and pixel-level labels were utilized in the training process. The simultaneous classification and segmentation task increased the number of shared parameters.

C. AMD DIAGNOSIS
Age-related macular degeneration (AMD) is one of the leading causes of blindness among the elderly population [138]. AMD generally affects the macular region of the retina. A study shows that by 2020 the number of patients with AMD has reached approximately 196 million at the global level and it is expected to reach 288 million by the year 2040 [139]. In the following section, we discuss various DL-based methods employed for the automatic diagnosis of AMD in recent times. Various publicly available datasets utilized for AMD diagnosis are presented in Table 2. Similarly, experimental results of recent research work on AMD diagnosis are shown in Table 11.
Chou et al. [140] utilized a stacking technique for combining a fundus image-based DL model with biomarkers derived from optical coherence tomography (OCT) for distinguishing neovascular AMD (nAMD) from polypoidal choroidal vasculopathy (PCV), as they both manifest similar image properties. A novel method called Multiple Correspondence Analysis (MCA) was introduced for converting OCT biomarkers into continuous principal components. EfficientNet-B3 was used for the training and validation of Fundus images. The ensemble stacking strategy yields the best mixture from the above two paths, for accurate predictions on new input images. Yan et al. [141] presented a framework for predicting late AMD progression using a modified Deep CNN. Apart from fundus images, their model also considers genotypes for improving accuracy. A total of 31,262 fundus images from the AREDS dataset were used for the project. Inception-V3 CNN was used as a base model for the training process. The extracted features were fed to a fully connected layer for AMD severity classification. This severity mixed with 52 genetic variants was again fed into another FC layer for predicting the probability of late AMD if it exceeded the inquired years.
Xu et al. [142] proposed a dual deep CNN model which utilizes fundus and OCT image pairs for categorizing AMD and PCV. Transfer learning was employed by first utilizing the weights from ResNet-50 onto two individual models that VOLUME 4, 2016 separately take OCT images and fundus images as inputs. After refining the weights on new data, they were transferred to corresponding convolutional blocks. In the end, an FC layer was established that classified input pairs into Wet AMD, Dry AMD, PCV, and nAMD. Another work was proposed, based on drusen segmentation for AMD detection, by Pham et al. [143] where they tried to tackle the data imbalance problem, as the number of non-drusen pixels was very high compared to drusen pixels. Their model consists of two networks, one an Image level network that uses a Deeplabv3+ base architecture to generate drusen probability maps and a patch-level network that works on corresponding patch images and their probability maps for final prediction. The Patch level network employs U-Net-based segmentation. A total of 775 fundus images from Kangbuk Samsung hospital, were used for training. Model performance was also tested on the STARE dataset. Vaghefi et al. [144] introduced a multimodal approach for dry AMD diagnosis, where the DL model utilizes three modalities -fundus images, optical coherence tomography (OCT), and OCT angiography (OCT-A) for improving accuracy. Input samples from 75 individuals were collected and grouped into three categories. Inception-ResNet-V2 based CNN was used as a base model and further modifications were made to facilitate training on multiple modalities. It was demonstrated that higher accuracies can be achieved with DL models through the suitable utilization of images with different modalities.
Peng et al. [145] presented a two-step DL model framework for accurately estimating the risk of late AMD at the individual level. Initially, a classification network was implemented and trained over 80,000 manually annotated fundus images, collected from AREDS and AREDS2 datasets. The second part of the architecture, known as the survival model, was responsible for predicting late AMD progression probability depending on the grading results or on the extracted features of the previous section. This stage also avails an option for including genotype information. Their model achieved an accuracy of 0.864 when validated with an independent test dataset. In another work, Heo et al. [146] developed a CNN-based classification model, that uses VGG-16 as a base architecture for classifying dry AMD and Neovascular AMD. In the pre-processing phase, image cropping was performed (around the macula center) followed by border adjustments. Feature maps from the final CNN layer were derived and weights were computed by a class activation map, for generating a heatmap. González-Gonzalo et al. [147] collected 600 fundus images with AMD and DR from three different medical centers (Europe) and conducted performance validation of the RetCAD-DL model (commercially available DL system), for combined detection of AMD and DR from fundus images. The model was additionally assessed with AREDS and Messidor datasets to establish further validation. The RetCAD model executes joint detection by first taking RGB and Contrast-Enhanced (CE) images of the original fundus input image, and then utilizes two ensembles (each with three CNN) one for DR and one for AMD. These ensembles generate DR and AMD scores indicating the probability of referable DR and AMD.
Bridge et al. [148] developed a prognostic model that predicts the future progression of AMD based on multiple longitudinal images (whose timepoints are spared unevenly). In stage one of this particular method, a CNN (Inception-V3) was used for generating feature vectors from the input image, in the next stage, the vectors were merged through interval scaling which compensates for uneven image time points. Finally, in the third stage, a recurrent unit provides the probability of AMD progression by employing the sigmoid activation function. Another DL method was proposed by Yoo et al. [149], which utilizes fundus images and OCT to diagnose AMD. This multimodal approach improved the diagnostic results compared to results obtained by using any one imaging modality. VGG-19, which was pre-trained on the ImageNet dataset was used for feature extraction from both OCT and fundus images; later a Random Forest (RF) model was operated on those features for final classification. Delong test was performed, which implies that the multimodal approach significantly boosted model performance.
Building on their previous work, Chen et al. [150], utilized DeepSeeNet [151] for developing a classification network that classifies fundus images into AREDS prescribed Nine-step AMD severity scale. The initial stage contains 10 Inception-V3 blocks for feature extraction, second stage has an average pooling layer followed by a dense and a dropout layer. The third stage consists of four layers that run in parallel (multi-task) for detecting four major AMD characteristics that are later combined to map the image into a nine-step severity score. The training was performed using the AREDS dataset and the model was later evaluated on both AREDS and AREDS2.
Pham et al. [152] developed a framework for monitoring AMD disease progression from early-stage images by synthesizing future AMD images. They utilized a GAN network with two discriminators for producing realistic future fundus images. In another work, Yellapragada et al. [153] presented a method for training the model without labeled data. They first used a self-supervised NPID training technique on fundus images and then tested its performance using a classifier (supervised) for grading severity levels of AMD. Wu et al. [154] presented a comparative study on the performance of a model in predicting AMD progression by utilizing automatic OCT imaging biomarkers versus manually graded color fundus images. Govindaiah et al. [155] showed the overall performance of late AMD prediction models may improve by adding genetic, clinical, and socio-demographic data while training the model.

D. CATARACT DIAGNOSIS
Cataracts are among the serious retinal diseases, if not identified and treated in time, and may lead to irreversible vision loss [156]. A recent study shows that nearly 33.6 million cases of blindness (45 percent of global blindness) are due to cataracts [157]. Recently many attempts have been made to automatically diagnose cataracts from fundus images. Recent developments in this direction are discussed below along with a performance comparison (Table 13) of various DL models.
For reducing training parameters and computational burden while training the model for cataract detection, Junayed et al. [158], adjusted the activation function and loss function of their CNN-based architecture. They also experimented with three different models which use various numbers of CNN blocks (3,4, and 5 number of blocks) and tested them for detection accuracy. The model with 4 blocks presented optimal results without any overfitting. Imran et al. [159] proposed a cataract classification model (severe, moderate, mild, normal) by combining CNN with recurrent neural network (RNN). After the pre-processing stage, each fundus image from the dataset was subdivided into 12 patches and each of these patches was processed through pre-trained CNN models (GoogleNet, AlexNet, VGGNet, ResNet) for feature extraction. LSTM (bidirectional) based RNN was utilized to process feature vectors for cataract classification. To deal with noise-affected fundus images which are common due to image acquisition complexities, Pratap and Kokil [160] employed two independent DCNNs for a combined feature extraction (CFE) strategy. Pre-trained AlexNet was modified and used for implementing CFE. Support vector machine (SVM) classifiers (independently trained, at different noise levels) were used and features extracted in an earlier stage were fed into a specific classifier by considering the noise level in that particular image. In another work, Imran et al. [161] also combined an SVM classifier with several DL models for cataract classification. After pre-processing phase which involves image resizing, green channel extraction, and image normalization, the Transfer Learning mechanism was implemented with pre-trained ResNet, VGGNet, and AlexNet for the feature extraction stage. Next, the first fully connected layer was replaced by a pooling layer (global averaging), and individual SVMs were employed for the final four-level cataract classification.
Hossain et al. [162] proposed architecture, used ResNet-50 as their base module for the cataract detection task. They collected fundus images from various sources and utilized a total of 3048 cataract infected images and 2670 non-cataract images for their research. Zhang and He [163] used the stacking technique for grading cataracts into six different levels. This was achieved by employing ResNet-18 for extracting high-level features combined with manual extraction of texture features using GLCM (Gray level co-occurrence matrix). These two feature sets were fed into two separate SVM Classifiers to learn the base level probabilities of each input image, followed by a fully connected neural network for generating the final cataract classification label.
To overcome some of the shortcomings of standard CNN architectures like overfitting, high computational burden, and fading gradient problem, Raza et al. [164] utilized trans-  [166] proposed an attention network that focuses on global features as well as local features before final grading.

E. ROP DIAGNOSIS
Retinopathy of prematurity (ROP) is a retinal disease that mainly affects the fundus vasculature of infants. With neovascularization, the effect of this disease may present severe consequences like retinal detachment and complete blindness among children. For timely treatment, it is important to identify an early symptom called plus disease-causing morphological changes to retinal blood vessels of preterm infants. Recent developments in DL implementation for ROP diagnosis and their experimental results are presented in Table 14.
Ramachandran et al. [167] introduced a framework for identifying ROP by detecting plus disease in infant fundus images. In their semi-supervised approach, the network generates bounding boxes around the twisted vessels and the count of these boxes indicates the presence of plus disease in the retinal image. This is achieved by employing a fully convolution neural network (inspired by YOLO architecture) for detecting the twisted vessels. A twofold training approach has been adopted, where the model is trained with manually labeled images to generate images with bounding boxes (pseudo labeled images), then both manually labeled and pseudo labeled images are used for retraining the model which is finally used for predicting ROP. For establishing ROP diagnosis along with an assisted medical follow-up mechanism, Agrawal et al. [168] developed a network that uses an ensemble of U-Net and hough transform techniques to detect various zones (I, II, III) in fundus images of infants (premature). These zones are used to indicate ROP severity and assist in scheduling the next screening. Two U-Nets are used for OD and retinal vessel segmentation tasks, from which the zones can be estimated.
Lei et al. [169] developed an ROP detection network that also generates supporting evidence about its decision. Initially, ResNet-50 (backbone network) was modified and improved by the addition of a Channel and Spatial Attention (CASA) Module, which extracts ROP lesion features and subsequently through a fully connected layer detects ROP in the fundus image. Parallelly, Gradient weighted class activation mapping is implemented on the extracted features for visualizing the extracted features and locating the retinal structures that can explain the model decision. To deal with the relatively high obscurity of ROP features, compared to other retinal features in fundus images, Chen et al. [170] proposed a network that learns at multiple instances and classifies ROP into different stages. A fully convolutional network (FCN) is used for obtaining local features and producing a spatial score map of ROP lesions. These are again converted into patches to augment the dataset. A separate CNN network (Multiple instance learning networks) utilizes these patches for retraining to achieve even better performance. Finally, through the attention pooling mechanism, ROP classification Coyner et al. [173] investigated the viability of utilizing synthetically generated fundus images for the diagnosis of plus disease in ROP. Generative adversarial networks (GANs) were utilized for fundus image 'synthesis'. Pix2pix pipeline method was implemented for realistic image generation from retinal vessel maps. This work indicates that GANs can be effectively utilized for dataset augmentation for improved model training in ROP classification networks. Tan et al. [174] utilized a private dataset consisting of 6974 fundus images to train an AI model for classifying normal and plus diseased images. Their algorithm also showed promise in detecting a comparatively less severe pre-disease stage known as a pre-plus disease from a fundus image. In another research, Redd et al. [175] assessed a DL severity screening system developed for ROP. The DL system was developed to generate a 1-9 scale score which indicates the severity based on retinal vascular abnormalities, which in turn was used to predict an overall ROP disease category. Wang and Chen [176] developed an automated system for identifying the existence of ROP in fundus images along with understanding its severity level. They utilized 6030 data samples for the training process where a median frequency balancing technique was employed to handle the data imbalance problem.

VI. RESEARCH DIRECTIONS
As discussed in the previous sections, retinal disease diagnosis using DL methods has progressed amazingly in terms of testing and evaluating various network architectures for retinal disease diagnosis. However, there is a lot of scope and unexplored areas open for future research. From the review conducted in this paper, we see the following directions for future research: • Weakly supervised learning models: Even though many fundus image datasets exist in the public domain, when compared to natural image datasets like ImageNet which has nearly 14 million images, the availability of labeled fundus images is quite limited. As the accuracy of a DL model depends on the number of available training images, one possible future direction is investigating weakly supervised learning models. Research could be directed towards developing DL models to achieve robust performance from fundus images that are either partially labeled or inaccurately labeled as they present ample scope in improving retinal disease diagnosis. • Fundus image synthesis: Recent popularity of Generative Adversarial Networks (GANs) has shown potential in generating synthetic fundus images which can be used to augment the training dataset. This can effectively eliminate the lack of good quality labeled data and improve prediction performance. Although some of the recent research showed the synthesis of images for DR, glaucoma, and AMD, the field is still relatively new and presents ample scope for future research. • Lightweight Network design: Most of the DL models developed for retinal disease diagnosis perform well at the expense of high consumption of computational resources. This is a major hurdle in implementing such VOLUME 4, 2016 This adds up to the data scarcity problem and restricts the model training to only available public datasets, depriving them of training on rich and diverse private fundus data available at the hospitals. Schemes like federated learning can be explored where the models can be trained on private data locally and then the learned weights are transferred to a global model. • Multiple disease diagnosis: Another promising research direction is simultaneously detecting multiple retinal diseases using DL. This can be helpful for clinicians to identify patients with more than one retinal disease. Although studies have been carried out in this direction such as simultaneous 'DME and DR diagnosis', simultaneous 'AMD, DR and glaucoma diagnosis', etc. it is still a relatively less explored area. • Smartphone-based retinal disease diagnosis: The majority of current work in this field utilizes fundus images captured through high-resolution fundoscopy. There is ample scope for researchers to develop models that can learn from fundus images captured through smartphones. This will help in developing a remote eye screening facility. • Generating evidence maps: One of the major concerns of DL implementation for retinal disease diagnosis is acquiring the approval of professional doctors for the AI-based model. Very limited research has been carried out in the direction of making the predictions more interpretable. One possible solution for this could be, generating evidence maps for the predictions made by the DL model and showing or highlighting the crucial regions of the fundus image the deep network used to arrive at the final decision. Recently some approaches have shown progress in this direction but there is still vast scope for research in terms of improving the accuracy of these evidence maps. For example, DR diagnosis depends on finding various lesions and markers on fundus image, so one can perform accurate lesion segmentation and simultaneously grade DR to generate quality evidence maps.

VII. CONCLUSION
There is a pressing need for automated systems for identifying eye-related diseases, considering the lack of medical experts when compared with the number of patients. A color fundus image, which contains a wide variety of eye-related pathologies in image format, opened up a lot of scope for research in terms of medical image analysis. With the rapid growth in computational power, more and more sophisticated DL models are being implemented and tested for automatic disease diagnosis. Sophisticated image processing techniques can now be utilized for bringing out salient features from a given fundus image. Lesions like microaneurysms, exudates, hemorrhages, etc. which constitute significantly a smaller number of pixels in a fundus image are now utilized to diagnose diseases from an early stage. This review presented a process-based approach for understanding the latest DL approaches in the ophthalmic disease diagnosis process.
As the success of a DL model highly depends on the training dataset, a consolidation of all publicly available fundus image datasets is presented along with their ground truth description. It is observed that many datasets like IDRiD, Messidor, DRIVE, etc. contain high-quality fundus images, which are captured in a controlled environment. The models trained on these datasets may not perform well on other datasets. On the other hand, datasets like Kaggle, and Eye-PACS among others. contain images captured in diverse environmental conditions. These may not be suitable for efficient model training, but as they mimic the real-world scenario, they can steer the model behavior toward the practical side. A balanced combination of datasets may help in developing a robust model which can be implemented clinically.
Most of the studies have shown that the application of image pre-processing techniques like contrast enhancement, color space transformation, image augmentation, filtering, etc. can help the DL model to better extract disease-relevant features during the training process.
The work published in recent years has used various backbone models to build solutions for disease diagnosis. Basic CNN, VGG, ResNet, Inception, etc. are utilized for classification tasks, while networks like U-Net, FCNs, Mask RCNN, Seg-Net, etc. are implemented for segmentation tasks. In most of the studies, these backbone models have only served as a base structure. Many learning paradigms like ensemble learning, transfer learning, multitask learning active learning, etc. have also been explored to improve the model performance and provide an accurate diagnosis. Among all retinal diseases, Diabetic retinopathy has been widely studied and explored and its actual clinical implementation has been examined. The current primary research for DR is now directed toward providing interpretable heatmaps along with disease classification. Similarly, glaucoma diagnosis is also studied considerably but most of them focus on direct classification or diagnosis based on CDR estimation. Compared to these two diseases, much less attention has been paid to AMD and one of the reasons is the lack of large datasets for AMD diagnosis tasks. Diagnosis of Cataract and ROP offers plenty of scope for future researchers as relatively very few studies have been carried out in that direction.
The retinal diseases focused in this review are of crucial importance as a delay in their treatment may lead to complete vision loss. The interest in implementing DL for retinal disease diagnosis has grown significantly in past few years. In some cases, the performance of DL models has surpassed that of human experts. However, the future scope is still wide open concerning efficient and effective patient care since DL systems must evolve and be integrated into clinical practice.
NEERAJ DHANRAJ BOKDE is working as a Postdoctoral Researcher at Aarhus University, Aarhus, Denmark. He received a Ph.D. in data science from the Visvesvaraya National Institute of Technology, Nagpur, India. His major research contributions are in the domain of data science topics, focused majorly on time series analysis, software package development, and prediction applications in renewable energy. He is serving in editorial positions in Data in Brief, Frontiers in Energy Research, Energies, and Information journals. His detailed biography and research contributions are available at https://neerajbokde.in/ .