Deep Transfer Learning Strategy to Diagnose Eye-Related Conditions and Diseases: An Approach Based on Low-Quality Fundus Images

Data from the World Health Organization indicate that billion cases of visual impairment could be avoided, mainly with regular examinations. However, the absence of specialists in basic health units has resulted in a lack of accurate diagnosis of systemic or asymptomatic eye diseases, increasing the cases of blindness. In this context, the present paper proposes an ensemble of convolutional neural networks, which were submitted to a transfer learning process by using 38,727 high-quality fundus images. Next, the ensemble was tested with 13,000 low-quality fundus images acquired by low-cost equipment. Thus, the proposed approach contributes to advance the state-of-the-art in terms of: (i) validating the proposed transfer learning strategy by recognizing eye-related conditions and diseases in low-quality images; (ii) using high-quality images obtained by high-cost equipment only to train the predictive models; and (iii) reaching results comparable to the state-of-the-art, even using low-quality images. This way, the proposed approach represents a novel deep transfer learning strategy, that is more suitable and feasible to be applied by public health systems of emerging and under-developing countries. From low-quality images, the proposed approach was able to reach accuracies of 87.4%, 90.8%, 87.5%, 79.1% to classify cataract, diabetic retinopathy, excavation and blood vessels, respectively.


I. INTRODUCTION
Considering the gradual increase in life expectancy, eye complications are prevalent and, commonly, the diseases related to blindness are asymptomatic. According to the World Health Organization (WHO), billion vision impairment cases could be prevented with appropriate treatment and regular examinations [1]. A study from WHO [2] estimates that The associate editor coordinating the review of this manuscript and approving it for publication was Eduardo Rosa-Molinar . 80 million people have had a total loss of vision in the world until the end of 2020, wherein almost 90% of them live in emerging countries [3]. Moreover, in emerging and underdeveloping countries, there are a considerable number of preventable blindness cases, being predominant in the lowincome class [4].
From this scenario, clinical decision support (CDS) systems are promising to improve and streamline diagnostics [5], mainly when the amount of information exceeds the limit of human analysis and perception. This way, these systems support clinicians in their decisions based on evidence from the history of diagnoses and registered treatments, mitigating clinical failures, such as suggested in [6]. Additionally, in emerging and under-developing countries, ophthalmology pathologies are aggravated in emergency situations where only a general practitioner is usually present, as well as the access to robust diagnostic equipment is limited in the public health system [7].
Currently, most of the CDSs are based on machine learning algorithms, more specifically on convolutional neural networks (CNNs). According to the review published by [8], the increase use of CNNs to classify pathologies observed in the retina was due to its capability of pattern recognition in large image datasets and the possibility of using pre-trained models with a transfer learning strategy. As mentioned by [9], transferring the base knowledge from a large dataset to a specific domain commonly contributes to improve the model's performance.
Following the above-mentioned context, this paper presents a CNN-based ensemble to classify eye conditions and diseases on images provided by Brazilian research institutes. The proposed approach consists of processing images of the ocular fundus and use pre-trained CNN models (VGG16 architecture). These models employ a transfer learning strategy to better adapt to different images' qualities and eye conditions, that are: cataract (operable or not), referable diabetic retinopathy (present or not), excavation (abnormal or not) and blood vessels (abnormal or not).
Different from the approaches found in the literature, the present paper contributes to advancing the state-of-the-art in terms of: (i) validating the proposed transfer learning strategy by recognizing eye-related conditions and diseases in low-quality images (produced by lower-cost equipment); (ii) using high-quality images (obtained by high-cost equipment from referral hospitals) only to train the predictive models; and (iii) reaching results comparable to the state-ofthe-art, even using low-quality images. From these contributions, the proposed transfer learning strategy can improve the service of the public health system, especially when considering a realistic scenario of basic health units from emerging and under-developed countries.
The remainder of the paper is organized as follows. Section II addresses a background review on the use of machine learning algorithms to classify eye-related diseases. Next, the proposed approach is described in section III. The results and discussions are presented in section IV. At last, section V provides the conclusions.

II. BACKGROUND REVIEW
The studies on the use of machine learning algorithms to classify eye-related diseases from images gained notoriety in the last decade [10], [11], [12], [13], [14], [15], [16]. Next, the approaches that advance the state-of-the-art are presented in a chronological fashion way, focusing on those that consider cataract, diabetic retinopathy (DR), excavation, and blood vessels, since they are the diseases/conditions treated in the present paper.

A. CATARACT
Cataract is a common eye disease responsible to cause vision distortion and blindness if not operated. In this way, some researches were proposed to determine the cataract presence and its grade. An automatic feature learning was proposed by [17] to grade the severity of nuclear cataracts from slit-lamp images. This task was performed through filters (using a CNN) and recursive neural networks. Another feature learning procedure was proposed by [18], where a Wavelet Transform and a sketch method were employed to extract features from fundus images. In order to identify the cataract (presence or absence) and grade it (mild, moderate or severe), a multilabel discriminator was used, which reached a maximum accuracy of 90.9% in a dataset with only 445 images.
In [19], the authors extracted features using the Wavelet Transform, sketch and texture methods. From these features, an ensemble based on support vector machines (SVM) and multilayer Perceptron (MLP) neural networks was used to classify the cataract. The average accuracy obtained was about 93.2% to identify the disease and 84.5% to grade it.
A pre-trained CNN with transfer learning strategy was considered by [20] to extract features. In the sequence, an SVM performs cataract classification. The authors employed public datasets and the labeling task was supported by ophthalmologists. However, this data curation process was not explained in details. The proposed approach was able to reach an average accuracy of 92.9%.
A novel deep learning model, named as CataractNet, was proposed by [21] to identify cataract in a dataset composed of 1,130 fundus images (augmented to 4,746 images). The model obtained accuracies higher than 98%, being the best literature result. However, it is important to mention that, although the authors perform a comparative analysis, it cannot be considered adequate since the data are not the same.
It is worth mentioning that specifically on the cataract identification/classification, a benchmark is not a trivial task given the need for a public dataset properly curated. Even so, it is expected that predictive models are able to achieve accuracies above 90%, as also shown in [18], [19], and [20].

B. DIABETIC RETINOPATHY
DR affects the retinal blood vessels of individuals and, for this reason, some researches were addressed to classify its grade. However, the identification of referable DR (presence or absence) is of great importance, as its early-stage detection allows preventive measures to be taken.
In [22], the authors proposed a comparison between the Naïve Bayes and SVM to classify DR. Although the Naïve Bayes achieved an accuracy of 83.4% against 64.9% obtained by the SVM, the dataset used was composed of only 300 images.
One of the most cited studies in the field was proposed by [23], where the authors trained a CNN Inception-v3 using data from EyePACS and Indian hospitals to classify DR. However, the model was validated in both EyePACS and Messidor-2 public datasets that were curated by ophthalmologists. The results presented specificity between 87.0% and 93.9%, while the sensitivity was between 96.1% and 98.5%.
Retinal images from multiethnic populations were classified by [24] using a deep learning model, that was validated with a dataset composed of 494,661 images. This model obtained sensitivity and specificity of 90.5% and 91.6%, respectively.
Another approach based on the CNN Inception-v3 was proposed by [25], where the authors also used the EyePACS and Messidor-2 datasets. They demonstrated that, without data curation, it was not possible to obtain the same classification rates as [23]. Results between 71.2% and 83.6% were reached for specificity, while the range between 68.7% and 92.0% were reached for sensitivity.
A visualization tool and deep learning models were proposed by [26] to support decisions in terms of referable DR. The models were trained and validated using 48,116 retinal images. To visualize the meaningful features for clinicians, the authors applied a CNN-independent adaptive kernel technique based on a sliding window to crop images and produce a feature map. Good results were obtained (considering a confidence interval of 95%), reaching 96% of true positive cases.
InceptionResNet-v2 CNNs were used by [28] to determine the retina status (healthy or pathological) and to classify DR (healthy, non-referable or referable). A Gaussian filter was used to preprocess images and a margin max technique was deployed to improve the models' sensitivities. In this sense, they were able to respectively obtain sensitivity values close to 80% and 98% for pathological and healthy retinas.
From the above-presented studies, it is also notable a difficult to establish an adequate comparison, since they used different datasets and evaluation metrics, as well as the data curation was not detailed.

C. EXCAVATION
Another form of diagnosis and staging of ocular pathologies is the analysis of the optic nerve. Typically, this kind of diagnosis aims to validate if the diameter of the optic disc, also known as excavation, is more extensive than average, denoting a possible disease, such as glaucoma.
The first glaucoma detection study reported in the literature using CNN [29] developed a six-layer architecture, being the first four convolutional and the remaining two fully-connected. Furthermore, the output of the last fullyconnected layer was provided to a softmax classifier trained in the pathology domain. Images from the ORIGA (650 images) and SCES (1,722 images) datasets were used. Results demonstrated an area under the receiver operating characteristic curve (AUC) of 82.3% and 86.0%, respectively.
Employing two stages, the authors of [30] firstly located and extracted the optic nerve from the retina image with a region-based CNN. In the sequence, the images were classified (healthy or glaucoma) using another deep learning model. The approach was validated in the ORIGA dataset, reaching an AUC of 87.4%.
Focusing only on glaucoma disease, the study proposed by [31] presented a transfer learning approach, combining it with data augmentation and uncertainty sampling. The authors used a dataset composed of 4,933 fundus images, which was divided into training and validation subsets. A deep learning model achieved an AUC of 99.5%, sensitivity of 98.0% and specificity of 91.0% for individuals with and without glaucoma.
A deep neural network approach was proposed by [32] to segment the optic nerve region and classify the image as pathological (glaucoma) or healthy. This model used pixellevel features obtained from the segmentation process and image-level features extracted from the images. Two public datasets were used for validation, REFUGE and DRISHTI-GS, whose results in 97.6% and 94.7% of AUC, respectively.

D. BLOOD VESSEL
The blood vessel analysis in fundus images is of paramount importance to support the identification of eye-related and systemic diseases (e.g. hypertension and ischemic stroke). There are a considerable number of studies on the segmentation of the eyes' vascular structure, since it is commonly used as a procedure before classification [33].
A k-nearest neighbors-based approach was proposed by [34] to classify arteries or veins, which used segmented images as inputs. It was extracted the structural profile of the vein or artery, color, and textures. They used the DRIVE dataset. The results demonstrated an average accuracy of 92.3%.
In the scope of detecting glaucoma using blood vessel segmentation, the authors of [35] developed a method for classifying the severity of the pathology. As a first stage, they extracted features from the vessel images, which were used as inputs to a hybrid model composed of an adaptive neural-fuzzy inference system (ANFIS) and an SVM. The sensitivity, specificity, and accuracy were calculated to verify the model's performance, reaching 85.7%, 92%, and 97.8%, respectively. Information about the dataset is scarce.
Aiming at classifying non-specific eye conditions that can cause blindness, the work of [36] employed an Inception-v3 that has in its composition an attention network indicating the most prone pixels to represent discriminating information. VOLUME 11, 2023 The images were labeled in grades (from 0 to 2), where the grade 0 represents greater severity. The dataset used for the model was the EIARG1, with 120 images instituted with the premise of denoting the severity of ocular anomalies. However, as it is an atypical approach, only the network models created by the authors were compared, whose best result was the f1-score equals to 0.92, obtained by the Inception-v3 model.
Having an unsupervised model, the authors of [37] proposed an approach that classifies arteries and veins that only takes advantage of the local contrast between blood vessels and the background to the surroundings. A graph was used to represent the vascular structure of the retina. Thus, a multilevel threshold was applied to the graph to properly classify the images. Next, the arteriole-venular ratio was calculated. This approach was applied to the INSPIRE and DRIVE datasets, obtaining accuracies of 96% and 80%, respectively.

III. DEEP TRANSFER LEARNING-BASED APPROACH
The deep transfer learning approach proposed in the present paper uses high-quality and low-quality fundus images produced and provided by Brazilian research institutes (São Paulo Vision Institute and Federal University of São Paulo). These images were labeled by ophthalmologists and storage in a centralized way to be processed and used to retrain an ensemble-based model (composed of CNNs with VGG16 ImageNet architecture). Primarily, on a machine learning server, only high-quality images were used to generate the baseline model. Thereafter, a retrain was made only using low-quality images based on the knowledge acquired in the previous model. Once generated, the model can be made available to public health units, where low-quality images would be given as inputs to the model. A general overview of the proposed approach is depicted on Fig. 1, which could be used by other emerging and/or under-developing countries. On the cloud storage and machine learning server side, the images were firstly collected and preprocessed (subsection III-A). In the sequence, the CNNs VGG16 were retrained (from ImageNet) in the domain of high-quality fundus images (subsection III-B), and then a novel transfer learning strategy was applied by considering low-quality images produced by low-cost equipment (subsection III-C). Finally, the ensemble model composed of CNNs was tested using only low-quality fundus images and, for this purpose, some performance metrics were considered (subsection III-D).

A. DATA COLLECTION AND PROCESSING
The high-quality images used to retrain the CNNs VGG16 come from the diagnosis base of the São Paulo Vision Institute, which were collected from 2018 to 2020. These images were generated by two equipment, that are: Canon CR2 -Phelcom Eyer and Topcon NW100. On the other hand, the low-quality images acquired by public health units were generated using a 3Nethra Classic. All images were labeled by five ophthalmologists with great expertise in the area (working in different periods and units). Thus, the images were classified into: cataract (operable or not operable); referable DR (presence or absence); excavation (abnormal or normal); and blood vessels (abnormal or normal). Examples of these labeled images are presented in Fig. 2. Since the images were generated by different equipment in both datasets (high-and low-quality), variations in size, aspect ratio, color, focus, and quality are present. Consequently, all the images had to be standardized in a size of 299 × 299 pixels (in RGB scale), where the ocular fundus was centered and cropped. In both datasets there are mis-captured images, where a large portion of the image is practically black or white (as shown in Fig. 3).
In this way, a filter was created to remove them, which converts the RGB image into grayscale and calculates the average value of the pixels (avg pixels ). Thus, two thresholds were empirically defined (thr black = 15 and thr white = 145) to eliminate the images when the following rule is true: if avg pixels < thr black or avg pixels > thr white .
Less than 500 images were discarded from the high-and low-quality datasets. Thus, the high-quality dataset consists of 68,171 images (being 19,360 of cataract; 5,833 of referable DR; 3,866 of abnormal excavation; 19,488 of abnormal blood vessels; and 19,624 of normal condition). Otherwise, the lowquality dataset was formed by 7,850 images (being 1,474 of cataract; 986 of referable DR; 763 of abnormal excavation; 1,046 of abnormal blood vessels; and 3,581 of normal condition).

B. ENSEMBLE OF CONVOLUTIONAL NEURAL NETWORKS
The CNN VGG16 model is basically composed of convolutional, pooling and dense layers. It aims to convert the image into representations of depth [38], filtering the pixel information for a smaller mapping. As can be seen in Fig. 4, this process was initialized by the convolutional layers (which scan a matrix of pixels), followed by a max pooling layer. The weights of convolutional and pooling layers were kept in accordance with those obtained for ImageNet, i.e., these layers were pre-trained with the premise of granting greater adaptability to the network [39]. Next, the dense layers were modified (in relation to the CNN VGG16 base model) by using fully connected layers and a global average pooling (GAP) layer. The purpose of using the GAP layer is to avoid overfitting as it averages all the feature maps as opposed to a flattened layer which transforms a normal layer into a one-dimensional layer keeping all the original values. Therefore, the use of a GAP layer makes optional the use of Different from the convolutional and pooling layers, the dense layers were trained in the domain referred to in this work, i.e., the high-quality fundus images. At last, a softmax function was used to generate the output. It is important to mention that, to prevent the propagation of negative values between the network layers, the rectified linear unit (ReLU) activation function was used to the convolutional and fully connected layers.
The Tensorflow framework with the Keras library was adopted, considering the use of stochastic gradient descent, a learning rate of 0.01, a weight decay of 1e −6 and momentum of 0.9.
Since we have employed a CNN for each specific domain (cataract, referable DR, excavation and blood vessels), the output layer was binary. The CNNs were trained by considering a batch size of 32. The entire training has 30 epochs, being the steps per epoch defined as the number of samples divided by the batch size. For each domain, 80% of the total number of images were used to train the CNNs, while 10% was used to validate and the remaining 10% to test the models.

C. TRANSFER LEARNING
Transfer learning is used to improve learning from one domain by transferring information from a related domain. As exemplified in [40], consider two people who want to learn to play piano. One person has no previous experience playing music, and the other person has extensive musical knowledge of playing the guitar. The person with a musical background will be able to learn piano more efficiently, transferring previously learned musical knowledge to the task of learning to play the piano.
According to [41] there are three categories of transfer learning, that is, transductive, inductive, and unsupervised. Transductive learning represents the scenario when all the knowledge comes only from the source domain.
Inductive learning refers to a situation where the label information of the target domain is available. Whether the information is unknown for both, the source and target domain, the transfer learning is categorized as unsupervised.
This paper starts from the premise that low-quality images: (i) have a lower pixel density, which may result in blurry or pixelated images; (ii) may have poor contrast, which can make it challenging to distinguish between different structures in the eye; and (iii) may have significant distortion, which can result in inaccurate representation of the shape and size of various structures in the eye. Hence, it is conceived that there is a transfer learning between one variety of images to another and not a permutation of images that compose the same set. An example of the differences amid the two groups of images can be seen in Fig. 5. By using the transfer learning procedure on CNNs, it is possible to load the weights from the previous domain into a network and unlock only the last trainable layers (arbitrarily chosen) or unlock all layers for retraining in the new domain. Among all the tests done to acquire the best transfer learning process, the one that unlock all the trainable layers reached the best results. In summary, we used a CNN model pretrained on ImageNet and high-quality fundus images, where a low learning rate was set (0.0001) to classify low-quality fundus images.

D. PERFORMANCE EVALUATION METRICS
To evaluate the performance of each CNN, metrics regularly seen in similar works were chosen. In this sense, it was considered the calculation of sensitivity (sens), specificity (spec), and accuracy (acc). These metrics are respectively presented in the sequence: Next, the results were analyzed and discussed to demonstrate the robustness and effectiveness of the proposed deep transfer learning strategy against the state-of-the-art papers.

IV. RESULTS AND DICUSSIONS
This section divides the results into: (A) proprietary dataset, which implies that a comparison is not applicable; and (B) public datasets, in which the proposed approach was compared with similar works. In both scenarios (public or proprietary datasets), the proposed approach has been trained each eye condition separately, i.e., each model used its own images of altered retina and the images labeled as normal condition were used by all the models.

A. PROPRIETARY DATASET
Focusing on the proprietary dataset, four models, representing each previously addressed eye condition, were trained using a high-quality dataset. Next, these models were retrained using a low-quality dataset denoting the transfer learning method. A comparison was made by training another four models using only the low-quality dataset. These results are presented in Table 1. Given the fact that this research proposes an approach capable of assisting decision-making by ophthalmologists, it is possible to state that transferring the knowledge from one kind of dataset to another (high-to low-quality) shows a considerable improvement in the models' performance.
Although the proposed approach has achieved an average accuracy that exceeds 86%, it is difficult to compare it to the state-of-the-art due to the use of private data. The same issue occurs when analyzing the literature, since some of the works did not employ public datasets [23], [42], [43], [44], [45]. Additionally, there is no consensus regarding the evaluation metrics. For these reasons, in the sequence, the proposed approach was tested on public datasets of cataract, DR and glaucoma. Thus, the results could be adequately compared with those of other works.

B. PUBLIC DATASETS
In order to verify the assertiveness of the model on public data, we have used datasets commonly found in literature to analyze DR, cataract and glaucoma. Two DR datasets were considered to perform the comparisons (Messidor-2 [46] and EyePACS [47]). In addition, the ODIR [48] dataset was employed to classify cataract and the REFUGE [49] dataset was used for glaucoma.
It is worth noting that the use of public datasets, in essence, is of paramount importance to define benchmarks for predictive models. However, it can be noticed that many of these researches curate data and, therefore, change the class label of some instances. In this sense, it is difficult to establish an adequate benchmark, since such changes are just mentioned (not detailed) in the papers.

1) DIABETIC RETHINOPATY
As previously mentioned, Messidor-2 and EyePACS were considered to evaluate the DR domain. The Messidor-2 has 1,748 images, while the EyePACS consists of 9,963 images. Both of them had their classes divided into referable and non referable DR.
A meta-analysis of deep learning-based approaches for DR was made by [50] with the intention of listing their types and performances. In this sense, Table 2 shows the comparison between the present paper and other works with the same scope. It is important to emphasize that the works condensed in this comparison use the same data labeling, without a specialized curation made by other professionals, as the curation process is subjective and difficult to standardize.

2) CATARACT
In cataract context, the ODIR dataset was chosen, which is composed of 1,400 images. Since the state-of-the-art papers obtained accuracies higher than 0.975, it was possible to verify that our result demonstrates the adherence of the proposed model to aid cataract classification, such as presented in Table 3.

3) GLAUCOMA
In terms of glaucoma, it is common to find works on segmentation of the optic nerve, however, in addition to the groundtruth values of the exact segmentation of excavations, the REFUGE dataset [49] also has the classifications of healthy or pathological retinas, in this case, with glaucoma. The dataset consists of 1,200 annotated retinographies, of which 121 correspond to eyes with glaucoma. The retinographies in this dataset are centered on the macula and have a size of 1, 634×1, 634 or 2, 124×2, 056 pixels. Again, the predictive model proposed in this paper proved to be adequate for the image classification process, as demonstrated in Table 4.

C. GENERAL ANALYSIS
The results obtained demonstrate the robustness and adaptability of the proposed approach not only for the proprietary dataset, but also for the public datasets. Therefore, these results endorse the possibility of canonical models that, together with filtering techniques, serve with mastery to support decision making in multiple domains.
Based on the results of multiple CNN models, the proposed pipeline made it possible to classify fundus images and, consequently, rank the probabilities of occurrence of pathologies that support the analysis and inference of ophthalmologists.
As shown in the related literature, the data curation process (labeling) contributes to improving the performance of predictive models. However, in this research, it could be noticed that the data processing stage (presented in the subsection III-A) is easy to use and can be employed to validate images before stored in a database, since these images must be within previously defined thresholds.

V. CONCLUSION
It was found that the issues of scarcity of data in basic health units, as well as the use of equipment that is only able to produce low-quality images, can be solved through a novel transfer learning strategy. In comparison with related studies, to the best of our knowledge, this is the first time that a machine learning-based approach has been proposed to classify four eye diseases/conditions and, at the same time, innovating with a transfer learning strategy.
The proposed approach was able to produce accurate results when compared to the state-of-the-art, however, using only low-quality images in the validation process. In this sense, this approach contributes to better guide decisions in a realistic scenario of emerging/underdeveloped countries and, consequently, prevent visual impairments or blindness.
It is also worth noting that by condensing several domains into a single approach and considering common metrics in the field of medicine, this approach also contributes to its use as a support system for decision-making.
As future advances in this research, we seek to increase the domains of eye conditions or diseases analyzed, as well as test other deep transfer learning strategies in the public datasets. Additionally, one avenue for future exploration is the transfer of knowledge from one condition to another, rather than solely from one type of image to another. This could potentially reduce the need for large amounts of labeled data for each individual condition.