CataractNet: An Automated Cataract Detection System Using Deep Learning for Fundus Images

Cataract is one of the most common eye disorders that causes vision distortion. Accurate and timely detection of cataracts is the best way to control the risk and avoid blindness. Recently, artificial intelligence-based cataract detection systems have been received research attention. In this paper, a novel deep neural network, namely CataractNet, is proposed for automatic cataract detection in fundus images. The loss and activation functions are tuned to train the network with small kernels, fewer training parameters, and layers. Thus, the computational cost and average running time of CataractNet are significantly reduced compared to other pre-trained Convolutional Neural Network (CNN) models. The proposed network is optimized with the Adam optimizer. A total of 1130 cataract and non-cataract fundus images are collected and augmented to 4746 images to train the model. For avoiding the over-fitting problem, the dataset is extended through augmentation before model training. Experimental results prove that the proposed method outperforms the state-of-the-art cataract detection approaches with an average accuracy of 99.13%.


I. INTRODUCTION
Cataract is a lenticular opacity clouding the transparent lens in human eyes. Typically, the lens converges the light to the retina. The presence of the cataract causes this light to be blocked and not reach the lens that results in poor visual acuity. It is a worldwide leading eye disease that develops gradually and does not affect sight early. However, after a while, it can interfere with vision and even cause vision loss in people over age 40 [1]. Cataract detection in earlier stages may avoid painful and costly surgeries and prevent blindness depending on its severity [2]. The world health organization (WHO) [3] reported that about 285 million people in the world have a visual impairment. Among them, 39 million people have limited vision, and the remaining ones have impaired vision. Cataract was responsible for 33% of visual impairment, and 51% of blindness [4]. In 2020, Flaxman et al. [5] predicted that the number of people suf-The associate editor coordinating the review of this manuscript and approving it for publication was Junhua Li .
fering from moderate to severe vision impairment (MSVI) and blindness would be 237.1 and 38.5 million, respectively. Of them, 57.1 million (24%) and 13.4 million (35%) people would be affected by cataract. The worldwide blindness will exceed 40 million by 2025 [6].
Comparing the results of these reports prove that there was only a slight improvement in the eye care system and controlling the vision loss during the last decade. Among the leading causes of blindness such as glaucoma [7], corneal opacity, trachoma, and diabetic retinopathy [8], cataract accounts for the most significant proportion. It is considered as one of the leading causes of blindness [5]. Cataract can be categorized into three main groups based on the location and area where it develops: Nuclear Cataract [9], Cortical Cataract [10], and Posterior Sub Capsular (PSC) Cataract [11]. These three types of cataracts occur due to several common factors such as aging, diabetes, and smoking [5].
Early detection of cataracts plays a vital role in the treatment and can significantly reduce the risk of blindness. Providing the automatic system for cataract detection is a VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ challenging issue for three reasons, including (i) the vast spectrum of cataract lesions and human eye tones, (ii) the scale, form, and location of cataracts, and (iii) the age, gender, and eye type dependence. In recent years, automatic cataract detection based on different imaging modalities has been investigated. Generally, automatic cataract detection and classification systems utilize four types of images: slit lamp, retro-illumination, ultrasound, or fundus images. Among these imaging modalities, fundus images have attracted significant attention in this field as technologists or even patients themselves can easily employ the fundus camera [12]. In contrast, slit-lamp cameras need to be operated only by wellexperienced ophthalmologists. Consequently, the lack of professional ophthalmologists, especially in underdeveloped countries, results in timely treatments [4]. Thus, to simplify the process for early cataract screening, an automatic cataract detection system based on fundus images is highly required. Artificial intelligence-based systems for cataract detection are mostly based on global features (e.g. discrete cosine transformation (DCT) [13]), local features (e.g., local standard deviation [14]) and deep features (e.g., deep CNN which have achieved higher accuracy). Although numerous deeplearning-based automatic cataract detection systems have been reported in the literature, they still suffer from limitations such as low detection accuracy, a high number of model parameters, and thus being computationally expensive.
A fundus image-based automatic cataract detection system is proposed in this paper to overcome the limitations mentioned above, which classifies the patients into two groups: cataract or non-cataract conditions. The novelties of the proposed method are as follows: 1) reducing the number of parameters in the model such as layers and weights, thus reducing the computational cost and time, and 2) increasing the detection accuracy based on the proposed deep neural network structure. Thus, the proposed preliminary cataract detection can be used for mass screening and cataract grading. The main contributions of this article are as follows: • A cataract dataset is collected, reorganized, and preprocessed from different standard datasets of fundus images published in the last two decades, i.e., the high-resolution fundus (HRF) [15] image archive, fundus image registration (FIRE) [16] dataset, ACHIKO-I fundus image dataset [17], Indian diabetic retinopathy image dataset (IDRiD) [18], color fundus image database [19] and digital retinal images for vessel extraction (DRIVE) database [20]. Then, it is extended to a considerable number of images through data augmentation process.
• To detect cataract, a new 16-layer deep learning neural network, i.e., CataractNet, is proposed. The number of layers and the activation and loss functions are tuned to improve the detection accuracy significantly.
The remaining part of this paper is organized as follows: Section II briefly discusses the related works reported in recent years. Section III explains the proposed CataractNet for automated cataract detection. The experimental setup is presented in section IV. The experimental results are discussed in section V. Finally, we conclude this work with future work directions in section VI.

II. RELATED WORKS
The state-of-the-art automatic cataract detection systems consist three steps: pre-processing, feature extraction, and classification [21]. These methods are categorized into two groups based on the algorithms used in either feature extraction or classification stages: Machine Learning (ML)-based and Deep Learning (DL)-based methods. These methods have been discussed in recent reports [21]- [24]. In this section, we briefly discuss some leading works in both groups.

A. MACHINE LEARNING-BASED METHODS
Gao et al. [25] proposed a computer-aided cataract detection system used for mass screening or as preprocessing for cataract grading. An enhanced texture feature was presented and used to train the linear discriminant analysis (LDA). The experimental results on a clinical database demonstrated an accuracy of 84.8%. Yang et al. [26] proposed an automatic cataract detection method that performed in three steps. A top-bottom hat transformation was utilized to improve the contrast between the foreground and background. The luminance and texture were considered as features. The classifier was constructed using a backpropagation neural network (BBNN) to classify the cataract severity into mild, medium, or severe stages.
Guo et al. [27] presented a computer-aided cataract classification based on fundus images. The feature extraction step was carried out using wavelet transform and sketchbased methods. Then, a multiclass discriminant analysis algorithm was used for cataract detection and grading where correct classification rates (CCRs) were respectively 90.9% and 77.1% for wavelet transform-based feature extraction and 86.1% and 74.0% for sketch-based feature extraction. In [28], Fuadah et al. used K-Nearest Neighbor (KNN) as the classifier and an optimal combination of texture features, i.e., dissimilarity, contrast, and uniformity. This system was implemented for smartphones and obtained a high accuracy of 97.5%. A competitive cataract detection system was proposed in [29] based on statistical texture analysis and KNN. The Gray-Level Co-occurrence Matrix (GLCM) was applied for texture feature extraction. The testing set was classified into normal or cataract conditions using KNN and received an average 94.5% accuracy. As the pupil area was cropped and extracted manually by users/experts in the processing step, it could not be considered as a fully automatic system. However, this system was performed on the eye images taken by a standard camera without using any slit lamp or fundus cameras.
Yang et al. [4] proposed an ensemble learning-based approach for cataract detection and grading. Three independent feature sets were extracted, and two learning models were formed for each group. The image classification was achieved by combining the multiple-based learning models based on the ensemble methods, whose CCRs were 93.2% and 84.5% for cataract detection and grading, respectively. Caixinha et al. [30] provided an in-vivo automatic Nuclear Cataract detection and classification system by applying machine learning and ultrasound techniques. It extracted 27 features in frequency and time domain, and support vector machine (SVM), Bayes, multilayer perceptron, and random forest classifiers were investigated in the classification phase. Although the methods based on ultrasound images achieve high accuracy in cataract screening, the imaging techniques are expensive with the complicated operation. In [31], the fundus images were classified as cataract images using an SVM classifier, and then an RBF Network graded their severity with the specificity of 93.33%. Rana and Galib [1] proposed a mobile application on a smartphone (Android, iOS, Windows) that enabled the public to carry out selfscreening cataract detection. They considered the texture information and reported 85% efficiency.
Jagadale et al. [32] utilized Hough circle detection transform for detecting the center of the lens and their radius. Then, the statistical features were extracted and used in an SVM classifier for cataract detection with an accuracy of 90.25%. Sigit et al. [33] presented an android smartphonebased method for cataract detection. The classification was carried out by a single-layer perceptron method with an accuracy of 85%. Recently, a hierarchical feature extraction-based method was presented in [6] for cataract grading. The fourclass classification problem of the cataract severity grading was transformed into three adjacent two-class classifications. These were carried out with three individual neural networks before integration. This system achieved the accuracies of 94.83% and 85.98% for cataract detection and grading, respectively.

B. DEEP LEARNING-BASED METHODS
Deep learning-based methods can learn the essential features and then integrate the feature learning steps into the model building process to decrease the incompleteness of the manual design features and use them in different medical imaging modalities [34], [35]. Gao et al. [36] investigated a deep learning-based method for grading the severity of Nuclear Cataracts from slit-lamp images. Local filters are obtained by clustering the image patches fed into a convolutional neural network (CNN). Then a set of recursive neural networks (RNNs) was used to extract more higherorder features. The cataract grading was performed using support vector regression. Zhang et al. [37] proposed a Deep CNN (DCNN) for cataract detection and grading that used the feature maps from the pooling layers of the architecture. This method was time-efficient and achieved 93.52% and 86.69% accuracies in cataract detection and grading, respectively.
Ran et al. [38] proposed a method for six-level cataract grading based on a combination of DCNN and Random Forests (RF). Three modules formed the proposed DCNN for extracting features at different levels on fundus images. On the other hand, a feature dataset was created by DCNN and used by RF to implement more intricate six-level cataract grading. On average, this method achieved an accuracy of 90.69%. This six-level grading system could help the specialists to understand the patients' condition more precisely. Pratap and Kokil [39] presented a computer-aided method for detecting cataract severity from normal to severe based on fundus images. This method utilized a pre-trained CNN as transfer learning for automatic cataract classification. The final classification was carried out using feature extraction, and an SVM classifier whose four-stage CCR was 92.91%. Jun et al. [40] proposed a cataract grading system based on Tournament based Ranking CNN composed of tournament structure and binary CNN models.
Hossain et al. [41] proposed an automatic cataract detection system using DCNNs and a trained classifier model based on Res-Net, whose accuracy was 95.77%. Recently, Zhang et al. [42] have provided an attention-based Multi-Model Ensemble method for automatic cataract detection on ultrasound images, which, to the best knowledge of the authors, obtained the highest accuracy (97.5%) among the other deep learning-based approaches in the literature. In this method, the whole system was composed of three main parts: an object detection network, three classification networks, and a model ensemble module. The performance was still low but satisfactory, especially in cases of inadequate training samples. However, the main limitation of this method was that it evaluated the cataract degree based on the blurriness of the retinal images, which can be caused by the cataract and the other eye diseases such as corneal edema and diabetes mellitus. Hence, different types of eye diseases may not be distinguished in this method. Almost a similar accuracy (97.47%) has been achieved recently by Khan et al. [43] for fundus images based on VGG-19 model with transfer learning approach on a recently published dataset in KAGGLE [44]. In another recent work by Pratap and Kokil [45], cataract diagnosis has been investigated under a noisy environment. A pre-trained CNN was applied for feature extraction formed of a set of locally-and globallytrained independent support vector networks. The obtained results proved its robustness against noise. It was the first work that investigated the robustness of the cataract detection systems.
It was observed that many works had been done based on conventional machine learning methods, while there are a few works reported on cataract detection and grading using deep learning methods. Therefore, there are still several challenges to deal with, such as improving the accuracy of the models while minimizing their complexity by reducing the number of training parameters, layers, depth, running time, and the overall model size.

III. PROPOSED CataractNet ARCHITECTURE
Convolutional Neural Network (CNN) is a deep neural network that acquires a complex hierarchy of features by convolutional, and pooling layers and non-linear activation functions [46], [47]. The feature extraction phase and the classification process are integrated into deep learning-based methods while these two steps are separated in the manual feature extraction methods. A novel deep learning model is proposed, namely CataractNet, to address the limitation of manual feature extraction and reduce the computational cost.   Table 1.
Half of these layers are placed in four blocks (each block is composed of two layers), and the rest is for classification. In the first block, the inputs are RGB (i.e., 3-input channels) images with the size of 224 × 224, and 32 filters with kernel sizes (KS) of 3 × 3 and padding as ''valid'' are applied. Then, a Max-Pooling (MP) layer with a stride of 2 is applied to reduce the space size of the data representation (width and height). It mainly minimizes the image size since a higher number of pixels corresponds to more parameters, requiring vast amounts of data. Finally, this block is activated by the ReLU activation function, which means the matrix's negative values are considered 0 while the positive values are unchanged. The same block with the same values of the parameters is used as the second block. Next, a similar block is used as the third block, but this time with 64 filters. In the fourth block, the number of filters is increased to 128. Outputs of all four blocks are combined into a feature map fed into the fully connected layers. These layers, namely flatten, dense, and dropout layers, are designed for cataract detection. Three sets of dense and dropout layers are constructed, among which dense layers are characterized with 64, 128, and 256 flattened neurons to collect the filtered cataract characteristics. Furthermore, three dropout layers are set as 0.4, 0.4, and 0.5 to prevent the model from overfitting by setting 40%, 40%, and 50% of neurons in hidden layers to 0 at each update of the training phase. As cataract detection is a binary classification, the sigmoid activation function is used in the last dense layer that is given as below: The sigmoid function's input is denoted by x and used as the final activation function. The total number of classes in the sigmoid layer is N, and each class represents one neuron. There are two main classes in our system: people with cataracts and normal eye conditions. In binary classification, the CNN architecture generates output at two neurons. Given a cataract image, the contribution of the first and the second neurons would be 1 or 0, or vice versa.
To investigate the effect of the block numbers on the classification accuracy and so the cataract detection, three different models based on 3, 4, and 5 blocks with three different sets of filters, i.e. (16,32,64), (32,32,64,128) and (32,32,64,96,128), respectively, are developed and evaluated on the dataset. Table 2 represents the accuracy achieved from these models. The model's efficiency is improved while increasing the number of blocks to 4. On the other hand, this efficiency was reduced at 5-blocks. Therefore, the model with 4-blocks and (32, 32, 64, 128) filters outperforms the others.
To explain the reason behind selecting the specific number of blocks and the convolutional layers, we go deeper into the effects of increasing the number of layers and filters on the extracted features. Patterns such as edges, dots, corners, etc., are extracted in the first layer. Then, the following layers make more extensive patterns such as squares, circles, etc., by combining the extracted patterns from the previous layer. In our method, the required features of the cataract are extracted at 4-convolutional layers. Adding more layers does not necessarily always results in more accuracy. Increasing the number of layers helps to extract more features. Still, to a certain extent, it leads to overfitting and false positives instead of improving accuracy. It happened on CataractNet when the number of layers reached 5. Therefore, the optimal number of blocks for CataractNet is determined as 4.
As it is common in binary classification, binary crossentropy is defined as the loss function as Cross − entropy = −(ilog(p) + (1−i)log(1 − p)), where i is the binary class marker predictor (0 or 1), the log is the normal logarithm, and p is the predicted probability.

IV. EXPERIMENTAL SETUP A. DATASET
Employing a suitable dataset with many samples is critical in improving validation and training in deep learning-based classification. In this research, the high-resolution fundus (HRF) [15] image archive is used to collect all of the cataract retinal images. Additionally, some more images are employed from other datasets, i.e., fundus image registration (FIRE) [16] dataset, ACHIKO-I fundus image dataset [17], Indian diabetic retinopathy image dataset (IDRiD) [18], color fundus image database [19] and digital retinal images for vessel extraction (DRIVE) database [20], so that the total number of images in the classification phase is 1130, out of which 904 (80%) images are utilized for training and validation, and the remaining 226 (20%) images are used for testing. The model learns the patterns in the training process, while in the validation process, the weights are normalized. During the testing period, the model is evaluated for getting the accuracy and the loss.

B. PRE-PROCESSING
Since the dataset images are collected from various sources, their sizes are not identical and appropriate for the classification task. Consequently, the images are resized to a unified format with 224 × 224 pixels. For the RGB images, the intensities of the three channels are normalized in a range between 0 to 1. One of the essential steps before training the networks is image normalization to ensure that each input pixel includes a similar distribution, which results in faster convergence in training. The formula for normalization is given as [48]: where p and N refer to the original intensity (between 0-255) and normalized intensity (between 0-1) of the cataract images, respectively, MaxV and MinV define the maximum and minimum intensities of the original images, respectively.

C. DATA AUGMENTATION
The lack of an extensive training medical image dataset is a challenging issue that makes it hurdles to have further improvement in deep learning [49]. Hence, to deal with the insufficiency of the dataset, data augmentation is applied to training samples through four geometric transformations: re-scaling, rotation (30 degrees randomly to right or left), zooming, and horizontal flip, which results in 3616 additional training images (4 times of the original number of the training images) and prevents overfitting in the network. The numbers of the testing and training images in cataract and noncataract classes are presented in detail in Table 3. The total numbers of non-cataract and cataract images are 2067 and 2679, respectively.

D. IMPLEMENTATION DETAILS
All the experiments are carried out in a computer with the following properties: core i9-10850K CPU, 64GB RAM, and NVIDIA Geforce RTX 2080 super GPU with 3.60 GHz. Image pre-processing, augmentation, and the CNN-based model are all implemented in Python, Keras, and Tensorflow environments. In our proposed CataractNet, a new model is implemented and optimized by an ADAM optimizer with a learning rate of 0.0001, whose results approve that the combination of 32 batches works satisfactorily during CataractNet training.

E. EVALUATION CRITERIA
Accuracy alone is not sufficient for assessing the effectiveness of the model [55]. In addition to accuracy, various evaluation metrics such as Recall/Sensitivity, Precision, Specificity, F-Score, and Matthews Correlation Coefficient (MCC) are employed to evaluate our model and five pretrained deep learning models, as accuracy =   results of the performance verification. The accuracy determines how the values are expected to be accurate. Precision learns how the measurement is reproducible or correctly predicted. Recall decides about the right outcome. To determine the average of all values, F1-score uses precision.

V. RESULTS AND DISCUSSIONS
In this section, we discuss the performance of the proposed CataractNet. Our model and five other pre-trained models are tested for cataract detection using the same dataset. Experimental results are compared with state-of-theart cataract detection methods.

A. PERFORMANCE OF CataractNet
We develop five pre-trained CNN models (with high classification results) to evaluate the performance of the proposed CataractNet on the same dataset. These models are (i) MobileNet [51], (ii) VGG-16 [52], (iii) VGG-19 [53], (iv) ResNet-50 [54], and (v) Inception-v3 [50]. We split dataset into training and testing sets as 90-10%, 80-20%, 70-30%, 60-40%, 50-50%. The performance of our Cataract-Net is compared with these pre-trained models in Table 4 in terms of Accuracy, Precision, Recall (Sensitivity), Specificity, F1-Score, and MCC. We mark the best performance in bold. Our proposed CataractNet ranks first and achieves the highest performance in all evaluation metrics for 80-20% and 70-30% dataset splitting conditions. Cataract-Net receives an accuracy of 99.13% that outperforms the other pre-trained models for 80-20% dataset splitting. The model parameters of these networks, e.g., model size, trainable parameters, total layers, depth, and average running time, are also compared in Table 5. In addition to the high performance, the proposed model has the least time complexity among others. Our model size and the trainable parameters (19.78 MB and 1.17 M, respectively) are significantly minimized compared to other pre-trained models. CataractNet has only 16 layers, and the average running time is 1035 seconds which is slightly less than that of   MobileNet [51] and almost three times smaller than other models [50], [52]- [54].
The confusion matrix of the CataractNet is depicted in Figure 2. Each column and row of the table corresponds to a predicted label and a true label, respectively. As illustrated in the matrix, there are only two wrong classification results where healthy eyes are detected as cataracts and vice versa in our test set.
In Figure 3, the (a) and (b) illustrate the training and validation accuracy and loss graphs of the CataractNet, respectively. The X-axis represents the number of epochs, and Y-axis is the value of the accuracy or loss. As shown in these graphs, the validation accuracy and loss flatten out and become parallel to training after 9 epochs which is sufficient to simulate the pattern with minor anomalies.
To demonstrate the CataractNet efficiency, the Receiver Operating Characteristic (ROC) curve is plotted using a testing set to make the final prediction as shown in Figure 4. The X-axis and Y-axis represent the false-positive and truepositive rates. As demonstrated, the blue dash line is random choices (probability 50%), and its slope is 1 because there are two classes (cataract and normal eyes). The area under the curve (AUC) is 0.9901, which is nearly perfect.
A graphical representation of cataract detection using CataractNet is illustrated in Figure 5. As shown in this figure, our model can successfully detect the cataract no matter it's mild or severe. Two wrong classification results make it clear that they are classified wrongly with a slight difference in scores.
It is worth mentioning that these methods have been implemented on their built datasets that are not publicly available. So, the accuracy can differ from different datasets. To compare our method with state-of-the-arts, it is required to implement and test them using our dataset. There are not any available source codes. We tried to contact these researchers through email to know the implementation settings, but unfortunately, we could not reach them. Consequently, our model is only compared with the stateof-the-art approach of Khan et al. [43] by implementing it on the same dataset of [43] which is accessible in Kaggle [44]. The results are presented in Table 7. Our deep learning-based CataractNet achieves higher performance (98.62%) than the state-of-the-art approach of [43] (with a reported accuracy of 97.47%) on the same small dataset of 1400 images without even any augmentations that indicates the capabilities of our proposed model. Additionally, our method was compared to the other pre-trained CNN models in Table 5 which proved its cost-and time-efficiency. Due to applying augmentation to the training dataset, our dataset size is more significant than most earlier works, significantly protecting the model from overfitting.

VI. CONCLUSION AND FUTURE WORK
In this paper, we presented an automated cataract detection system, namely CataractNet, based on lightweight deep learning. Initially, a cataract dataset of fundus images was rearranged, pre-processed, and augmented to improve the dataset to feed the deep network. The developed Cataract-Net focused on investigating different layers, activation function, loss function, and optimization algorithms for minimizing the computational cost without sacrificing the model accuracy. Comparing with five pre-trained CNN models, i.e., MobileNet, VGG-16, VGG-19, Inception-v3, and ResNet-50, the CataractNet achieved competitive performance. Our model outperformed the state-of-the-art cataract detection approaches in terms of accuracy (99.13%), precision (99.08%), recall (99.07%), specificity (99.17%), MCC (98.23%), and f1-score (99.07%). Being highly accurate, cost-and time-efficient enabled the ophthalmologists to detect cataract disease timely and more precisely using CataractNet. However, our method can not discriminate the three types of age-related cataracts (nuclear cataracts, cortical cataracts, and PSCs). Besides, it was only proposed for cataract detection and not for grading or finding its exact location, which can be helpful for ophthalmologists. These issues need further investigation in the future. He has more than 15 years of working experience in teaching and cutting-edge research in image processing and computer vision. He has authored/coauthored more than 40 international peer-reviewed research papers, including journal articles, conference proceedings, books, and book chapters. His current research interests include 3D processing and AR/VR-based vision rehabilitation. He secured several gold medals and best paper awards from national and international scientific and technological competitions and conferences. His Ph.