Automatic Severity Classification of Diabetic Retinopathy Based on DenseNet and Convolutional Block Attention Module

Diabetic Retinopathy (DR) - a complication developed due to heightened blood glucose levels- is deemed one of the most sight-threatening diseases. Unfortunately, DR screening is manually acquired by an ophthalmologist, a process that can be considered erroneous and time-consuming. Accordingly, automated DR diagnostics have become a focus of research in recent years due to the tremendous increase in diabetic patients. Moreover, the recent accomplishments demonstrated by Convolutional Neural Networks (CNN) settle them as state-of-the-art for DR stage identification. This paper proposes a new automatic deep-learning-based approach for severity detection by utilizing a single Color Fundus photograph (CFP). The proposed technique employs DenseNet169’s encoder to construct a visual embedding. Furthermore, Convolutional Block Attention Module (CBAM) is introduced on top of the encoder to reinforce its discriminative power. Finally, the model is trained using cross-entropy loss on the Kaggle Asia Pacific Tele-Ophthalmology Society’s (APTOS) dataset. On the binary classification task, we accomplished (97% accuracy - 97% sensitivity - 98.3% specificity - 0.9455, Quadratic Weighted Kappa score (QWK)) compared to the state-of-the-art. Moreover, Our network showed high competency (82% accuracy - 0.888 (QWK)) for severity grading. The significant contribution of the proposed framework is that it efficiently grades the severity level of diabetic retinopathy while reducing the time and space complexity required, which demonstrates it as a promising candidate for autonomous diagnosis.


I. INTRODUCTION
Diabetes Mellitus is a chronic metabolic disease characterized by elevated blood glucose levels or (Hyperglycemia), which over time affects the blood vessels in the human body on both micro and macro scales. According to the World Health Organization (WHO), the number of diabetic people hiked to 422 million in 2014, with an expectation to reach 700 million by 2045 [1], [2]. One of the long-term diabetic micro-vascular effects is diabetic retinopathy, a progressive abnormality revealed and detected through ocular pathologies, which leads to blocking and bleeding of the retinal capillaries. Fortunately, early detection can prevent vision impairment. However, without frequent screening, it may induce irreversible damage. International Diabetes Federation (IDF) affirmed that 93 million diabetics suffer from The associate editor coordinating the review of this manuscript and approving it for publication was Chulhong Kim . eye damage, yet only 200,000 ophthalmologists are available worldwide [3]. Grading inconsistency, critical deficiency in the available number of ophthalmologists as well as the laborious process remains hindering factors for diabetic retinopathy detection. Therefore, automating retinopathy diagnostics is desired to reduce the high strain on health care systems. Motivated by this, significant efforts have been directed to enhance Computer-aided medical diagnosis (CAMD) systems.
DR grading systems can be categorized into two clusters: segregation of diabetic retinas from healthy ones (binaryclassification task) and severity estimation (multi-class classification task) of affected retinas from class 0 (healthy) to class 4 proliferative DR (PDR). Traditional Machine Learning (ML) algorithms are Artificial Intelligence (AI) techniques that learn through experience by being exposed to data. They were employed for detecting diabetes type based on patient attributes by Nagaraj et al. [4], they utilized the VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Artificial Flora Algorithm (AFA) [5] for feature selection in addition to using Gradient Boosted Trees (GBT) [6] as a classification model. Furthermore, exploited by Gharaibeh et al. in [7] and [8] by employing feature engineering process, then applying Support Vector Machines (SVM) as a classifier for DR detection [9]. ML algorithms need personalized experience and domain knowledge to find the most informative representation despite its effectiveness. Deep Learning (DL) has gained a foothold in various fields by representing the world as a nested hierarchy of concepts, with each concept defined through its relation to simpler concepts [10]. Convolutional Neural Networks was the standout DL architecture in the late nineties. Since then, it has been used extensively for processing data such as images and time series. Moreover, it has demonstrated outstanding performance in practical applications such as Natural Language Processing (NLP) [11], [12] and Computer Vision (CV) problems [13]- [15].
Exploiting convolutional neural networks' power for a medical domain has developed more robust solutions, specifically in the DR domain. [16] and [17] demonstrated the effectiveness of such a technique for retinal vessel segmentation. Similarly, by leveraging Generative Adversarial Networks (GANs), Zhao et al. [18] could synthesize fundus images. Dai et al. [19] utilized multi-sieving convolutional neural network and image to text mapping for Micro-aneurysms (MA) early detection. [20] evaluated the performance of three recognized CNN architectures: VGG16, VGG19, and InceptionV3 [21], [22] by employing transfer learning and fine-tuning for binary and multi-class classification. Zeng et al. [23] introduced Siamese-like architecture [24] trained with transfer learning to classify fundus images into two grades. Kassani et al. [25] used a Multi-Layer Perceptron (MLP) as a classification head on top of the modified Xception network [26] by concatenating different feature maps from different convolutional layers. Four Inception models were utilized [27] for multi-class classification, each fundus image was sliced into four quadrants, and each quadrant will be classified by one of the four models. [28] exploited blended models to enhance data representation, Gangwar et al. [29] investigated a new hybrid model inherited from Inception and ResNet architectures. Al Antary et al. [30] designed ResNet architecture integrated with a Multi-Scale Attention mechanism (MSA) to enhance the representational power of the encoder. Moreover, they employed a multi-level approach for feature reuse for more improvements. Since our focus in this paper is to enhance the grading system both on binary and multi-class classification tasks, we observed drawbacks related to the aforementioned algorithms despite their success ranging from high time and space complexity to drop out mitigating the severe data imbalance inherited.
DR severity grading remains a challenging task due to three factors: (i) Data rarity. Acquiring massive labeled data is a crucial issue for DL and more significant in the medical domain due to the data privacy issues or/and having costly devices to get high-quality images. (ii) Implicit stochasticity. Retinal fundus images experience large variations caused by different devices and environmental conditions regarding color, contrast, illumination, and size. As a result, the model's decision may be distorted. (iii) Fading classes' disparity. The threshold chosen for image classification between two closely distributed classes (e.g., mild and moderate in the APTOS dataset) is blurry, as will be shown in Section III.C, due to the dependence on microscale ocular pathologies. To solve the problem of fading disparity, large CNN architectures were employed in the literature to extract more informative features, data augmentation and preprocessing were used to enhance CNNs' generalizability. Finally, transfer learning was exploited to overcome data shortage.
In this paper, we investigate the efficacy of light-weight deep learning architecture for fast and robust severity grading of diabetic retinopathy. Our framework is based on a modified version of DenseNet [31] with integrating an attention mechanism with the former architecture for more feature refinement. Furthermore, we observe the effect of data imbalance on the model performance and mitigate such an effect by using an imbalanced learning technique. As shown in Fig.1, we first pass and preprocess the retinal image for quality enhancement, afterward, the images were passed to the DenseNet encoder C for feature extraction, then the features are sent to the attention module A for more improved representation. We train our model by freezing Densnet's encoder, trained on the ImageNet [32] dataset for the model's convergence acceleration by using the pre-trained weights θ C and training only the attention module and the classification head using APTOS data in a supervised approach to update θ A & θ M . Our main contributions are as follows: 1) We developed a modified architecture to reduce the time needed for training and inference while enhancing DR severity grading by using a relatively small model with 8.5 million parameters compared to 10.8 million in the previous work. 2) We exploited the effect of using an attention mechanism as a supplementary module for feature refinement which led to an increase in accuracy while preserving low model complexity. 3) We tested the effect of using an imbalanced learning approach to alleviate the impact of data imbalance on the model's performance and proved its efficiency in enhancing the overall metrics. 4) We utilized transfer learning only by freezing the convolutional encoder without extra fine-tuning which led to relatively low number of learnable parameters (150K).
The paper is divided as follows. The related work is presented in Section II. In Section III, the methodology is presented. In Section IV, the results and discussions are demonstrated. Finally, conclusions are provided in Section V. In the scheme of our proposed approach, In the network training step (upper), we pass a batch of labeled preprocessed images X to our convolutional encoder C for feature extraction, then an attention mechanism A for feature refinement. Finally, in the testing phase (lower), we directly pass the data to the network to predict the image class.

II. RELATED WORK
Deep learning has been deployed extensively in DR due to the rising of the transfer learning paradigm that offers fast convergence and performance enhancement while reducing the need for massive data and computational resources. This has opened the door for more robust algorithms in the medical domain. Wang et al. [33] developed Lesion-Net; the main aim of the network was to aim was to add lesion detection to severity grading to reinforce the representational power of the encoder. The architecture was built on InceptionV3, which was trained and validated using a private dataset. An ensemble stacking approach was investigated by Qummar et al. [34] by using five reputable architectures (Resnet50, InceptionV3, Xception, DenseNet121, DenseNet169) in order to improve produced feature maps. Furthermore, they used the Kaggle EyePACS dataset to assess the model. A hybrid deep learning model introduced by Cortes et al. [35] was built using InceptionV3 encoder for feature extraction and then training Gaussian Process (GP) regressor to get uncertainty of the prediction using EyePACS and Messidor-2 datasets, for DR binary classification task. The EfficientNet-B3 architecture was deployed by Sugeno et al. [36] for both binary and severity classification using APTOS dataset. Furthermore, they developed a method for lesion detection and validated with ground truth exploiting DIARETDB1 1 dataset. Meta-Plasticity, a bioinspired phenomenon, was artificially implemented at CNN's 1 https://www.it.lut.fi/project/imageret/diaretdb1/ back-propagation path to reinforce less common occurrences during the learning process by Boix et al. [37] for performance enhancement. Moreover, they deployed this technique in different deep learning architectures, using APTOS data for binary and severity grading tasks. Zhang et al. deployed a Source-Free Transfer Learning (SFTL) [38] model for referable DR, which utilized the unlabelled retinal images to alleviate the challenges of medical data annotation and privacy. They applied their algorithm to APTOS dataset for binary and multi-class classification tasks.

III. METHODOLOGY
In this section, we present the details of our framework. First, we introduce APTOS data, followed by data preprocessing, then data augmentation, balancing, and analysis. Finally, we introduce our architecture, training settings, and evaluation metrics.

A. DATASETS
In 2019 (APTOS) dataset 2 was released on the Kaggle website 3 as a part of public competition for DR detection. The main aim of using fundus imaging was to classify disease severity by producing a probability that an image located in one of five clusters: No DR, Mild, Moderate, Severe, and Proliferative DR. This data was collected by Aravind Eye Hospital in India, 13,000 (approximately) images were FIGURE 2. In visual comparison between (a) Raw fundus image and (b) Pre-processed fundus image, we observe the removal of the black side borders, by removing the black pixels and applying a Gaussian filter, the clarity of blood vessels and other bio-markers enhanced significantly. provided at this competition; however, we had only access to the ground truth labels of 3662 images.

B. DATA PRE-PROCESSING
The uninformative black areas on the sides of the images were first trimmed then a circular crop was applied to have a centered retinal image. Moreover, a filtering technique was exploited [39] to enhance the clarity of visual bio-markers, and described by the following equations: X indicates the input data, G(σ x ) is a 2D Gaussian kernel with a standard deviation of σ x = 15 in x-direction and * is the convolution operation. α, β, and γ were chosen empirically to be 5, −4, and 70, respectively. Finally, each image was normalized to be in the range of [0, 1], resized to (256 × 256) using bilinear interpolation, and decoded to a 32-bit floatingpoint. Fig.2 represents the input and output from the preprocessing step.  Furthermore, by its projection in lower-dimensional feature space, using Principle Component Analysis (PCA) to lower the data dimensionality to 500-D followed by applying the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm to analyze data distribution across different classes [40], Intuitions were developed by exploiting Fig.3: • Class 0 forms feature clusters all over the 2-D space, making it one of the easiest classes to be detected.
• Classes (1-4) have acute overlapping, which generates a challenging task for the algorithm to fit a proper hyperplane.
• We artificially clustered the data to form only two regions (infected and healthy), and we observed that DL, based on our understanding, is robust enough to solve the binary classification problem. Thus, to mitigate such effect, we used an Inverse Number of Samples (INS) learning approach where each class is weighted inversely proportional to its distribution in the original dataset as described in (3) and (4): W , S i are 1-D array that contains weights for each class and the total number of samples per class. N is the total number of classes and i is the class index. As a consequence, we used the updated version of the Categorical Cross-Entropy loss function (CCE): 38302 VOLUME 10, 2022  where • M number of training samples • N total number of classes • w i weight for class i • y i m target label for training example m for class i • x m input image for training example m • h θ model with learnable parameters θ Random horizontal, vertical flipping, and rotation were applied to reduce overfitting and improve the model's generalizability. Furthermore, it was employed using the on-fly augmentation technique, which means it was utilized as a layer in our network to perform the transformations mentioned during the training phase.

D. ARCHITECTURE
Our algorithm consists of a backbone model (convolutional base) and an attention module. First, the backbone network is used as a feature extractor for the input fundus image, and then features are refined using Convolutional Block Attention Module (CBAM) for data representation enhancement. Afterward, converting them to a one-dimensional array by averaging each feature map generated by the attention module using Global Average Pooling (GAP) followed by classification head. Fig.4 demonstrates an illustration of our network.

1) DenseNet
DenseNet was used as the main backbone for the proposed approach. Huang et al. [31] demonstrated the robustness of the architecture against the vanishing gradient problem while reducing the number of parameters and reducing over-fitting for smaller datasets. The main idea was to connect CNN layers using a dense connectivity pattern such that each layer has a concatenated input of all preceding feature maps: where [X 0 , X 1 , . . . , X l−1 ] is the concatenated feature maps to the l th layer, H l ( . ) is a hidden layer that exploits consecutive operations: batch normalization (BN) [41], followed by a rectified linear unit (RELU) [42], and convolution operation to have a non-linear transformation of the input. Architecture design allows feature reuse based on routing the previous feature maps to the next convolution layer. For pooling, Transition Block (TB) was integrated, consisting of batch normalization, 1 × 1 convolution, and 2 × 2 average pooling.

2) CONVOLUTIONAL BLOCK ATTENTION MODULE (CBAM)
CBAM has proved its success in more curated feature generation and performance enhancement [43]. It consists of two sub-modules: • Channel Attention Module.
• Spatial Attention Module. The attention module is used to infer two feature maps: where σ ( . ) is Sigmoid function, MLP is shared network with hidden units ∈ R C/r×1×1 , C is the number of channels, r is a

Algorithm 1 The Implementation of DenseNet+CBAM Model
Input: Pre-trained DenseNet encoder C with Imagenet weights θ C , labelled data (X , Y ), α, β, γ , batch size B, class weights W . Output: θ A for the attention mechanism A, θ M for the classification head. Initialisation : Learning rate l r 1: Apply preprocessing X = F transform (X , α, β, γ ) 2: for epoch = i from 1 to N do 3: for each mini-batch do 4: for image k in mini-batch b do 5: Apply on-fly Keras augmentation 6: Extract & refine the features F ∈ R H ×W ×C is channel's attention module output, K H ×W is a convolution kernel with one filter applied to concatenation of Sp Avg pool and Sp Max pool , where both of them are employed across the channel axis. Fig.5 shows an illustration for CBAM.

3) PROPOSED IMPLEMENTATION
DenseNet169 was selected from the DenseNet family after comparing different reputable pre-trained models. It demonstrated robust performance across all classes due to its nature; as discussed in Section III.D.1, the flow of information from low-level features to the upper layers allowed the model to exploit as many features as possible. A series of experiments were made to choose the best depth to check if we need this high complexity while achieving the best performance, and we decided to reduce the number of convolutional blocks in the fourth dense block to be 12 instead of 32. Exploiting attention mechanisms offer more flexibility to DL algorithms to focus more on the vital information related to the target and discard those not related. CBAM has provided that it is capable of enhancing the model's representational power without increasing the complexity, so we tried different positions for CBAM in our modified DenseNet, and we observed that the best performance is accompanied by positioning CBAM on top of the convolutional encoder plus reducing the training time significantly due to the decrease in spatial dimensions.   Four trials were investigated to show the gradual increase in performance: Where our baseline has only DenseNet169's modified encoder without attaching CBAM as a supplementary module, moreover as well as not deal with the class imbalance inherited in APTOS data. For the second trial, we demonstrated the effectiveness of using cost-sensitive learning to penalize our model when dealing with minor classes and vice-versa. CBAM was added to DenseNet without using INS to investigate its effectiveness in the third trial. Finally, we investigated the enhancements added by CBAM and INS together. The four experiments had followed the same settings by freezing DenseNet's encoder and using transfer learning to accelerate the training of CBAM and Softmax layers. Finetuning was not used in contrast to the conventional framework when we have a different data domain compared to ImageNet data, and we took our decision based on the interesting results provided by [44], where ImageNet weights demonstrated its robustness as a feature extractor for retinal disease detection. A reduction ratio (r = 32) and kernel size (K 7×7 ) at channel and spatial modules, respectively for CBAM. Due to its performance, our fourth trial was compared to other state-of-the-art techniques. Detailed information regarding our architecture is demonstrated in Table.1.

E. TRAINING SETTINGS
Our splitting policy was 90% to 10% of our dataset to form a training and validation set. A stratified data splitting technique was exploited to preserve the same distribution to ensure the classes' distribution consistency between the aforementioned subsets and the original set. Table.4 demonstrates the training and validation data statistics. Furthermore, K-fold validation was implemented to have more robust results, and due to the size of the dataset, we used 5-folds to train on 80% and test using 20% of the original dataset at each trial. Furthermore, the maximum number of epochs was limited to 400 while using an early stopping callback to avoid overfitting by saving the best weights corresponding to the minimum validation loss. Finally, we used the exact stratified data splitting mechanism to ensure the same class distribution at each fold.
Our algorithm was implemented using TensorFlow [45] and trained on Tesla V100 GPU provided by Google Co-lab. We trained four networks for 1000 epochs, and with a small batch size of 32 images, the RGB images are passed to the network after being preprocessed. Furthermore, using Adam optimizer with learning rate 3 × 10 −4 , β 1 = 0.9, β 2 = 0.909, and weighted CCE that was demonstrated at (5) as a loss function. Specifically, we exploited Sparse (CCE) based on the label encoding found in the dataset. All layers in CBAM were initialized by He normal initializer [46], Dropout layer was set with a rate equal to 0.5 to improve generalizability, and Softmax as a final layer [47]. For severity grading, the highest probability represents the level of the sample, whereas, for binary classification, the output was thresholded at 0.5. We introduce the overall training process of our proposed approach in Algorithm 1.

F. EVALUATION METRICS
Five common metrics were used to evaluate the model's performance.

1) ACCURACY (ACC)
The percentage of correct predictions that a model can achieve. Accuracy is defined as

2) SENSITIVITY (SENS)
is the percentage of positive cases that is classified as actual positive. Identified as follows

3) SPECIFICITY (SPEC)
is the percentage of negative cases that are detected as actual negative. Identified as follows is the harmonic mean of precision and recall and is identified as

5) KAPPA-SCORE
to assess the agreement between our model and the original rater. Identified as follows where true positives (TP) are the classes classified correctly by the algorithm, true negatives (TN) are samples predicted correctly as negative, false positives (FP) are samples that are miss-classified as a positive class, and false negatives (FN) are samples miss-classified as negative class. O i,j is the observed matrices, and E i,j is the expected one.  Fig.6 illustrates the performance of our four algorithms. In Fig.6.a, we observe that without the weighted loss function, it is easier for our model to be distorted and have robust behavior only in detecting major classes (0 and 2) and viceversa. As can be shown in Fig.6.b, attaching CBAM to our encoder enhanced the detection of classes 1 and 3 by 63.3% and 90.9%, while reducing class 2 only by 4.6%. Class imbalance mitigation allowed better performance, as can be seen in Fig.6.c, class 1,3 detection is enhanced by 43.3% and 236.4% respectively, with respect to the baseline algorithm. Finally, using CBAM with DenseNet169 while adding weighted loss has demonstrated thriving performance across all classes. Regardless of the reduction in class 2 by 14.63%, classes (1,3 and 4) exhibit significant improvements by 44.2%, 43.24%, and 235%. An average QWK and accuracy values of 0.8072 and 72.3% were achieved, respectively, using the 5-fold k-validation technique. As shown in Section III.E, we trained our algorithm only for 400 epochs to reduce the computational cost of training five different models, further training will provide more intact results. As shown in Table.2, the proposed method outperformed the literature work on the severity grading task and showed comparable results. Our model enhanced accuracy and QWK by 0.4% and 24.9% while decreasing inference time by cutting down the number of parameters by 83% compared to [28]. We achieved almost the same accuracy as [29] while reducing the model size. Our best trial had an increase in accuracy of about 7% compared to the AM-InceptionV3 [37] method. SFTL model achieved high accuracy at the severity grading task. However, they did not tackle the problem of data imbalance. EfficientNet-B3 [36] achieved higher accuracy but only for major classes, while we achieved comparable accuracy in minor classes, and finally, We compared our best trial with the MSA network without multi-level feature reuse [30]. We had almost the same accuracy with an increase in QWK by 3.6%. Furthermore, we achieved a better confusion matrix across all classes than the literature while reducing time and space complexity by a 45% reduction in parameters. Severity grading f1-score was not mentioned in the literature. However, by using CBAM and INS, an enhancement was established by 21.4% with respect to the baseline DenseNet169.

IV. RESULTS AND DISCUSSIONS
Our algorithm demonstrated robustness against other deep learning architectures for the binary classification task, as shown in Table.3. Above all, the literature did not deal with the class imbalance problem. Most of the algorithms implemented did not consider its effect on quality metrics which provided overestimated outcomes, as most of them were predicting perfectly only for major classes due to ignoring data inherited imbalance. Furthermore, as mentioned in Section III.C, binary grading did not require complex architectures to solve it, our algorithm with lower parameters achieved almost the same metrics compared to other algorithms, plus when we artificially formed two clusters (infected and normal), the classes were balanced which helped literature algorithms to excel in such a task. Moreover, our algorithm exceeds the minimum limits provided by English National Screening Program for sensitivity, and specificity [48]. Finally, our model achieved low training time (9 seconds/epoch) and relatively high inference speed (1.166 seconds/32 images) compared to the MSA network that achieved 5 seconds exploiting the same batch size.

V. CONCLUSION
In this study, we exploited a new CNN model based on DenseNet169 architecture integrated with CBAM as an additional component to be added for representational power enhancement. The proposed method demonstrated robust performance and comparable quality metrics while reducing the burden of space and time complexity. Furthermore, a 2-D Gaussian filter enhances fundus images' quality. Finally, we used INS to form our weighted loss function to tackle the class imbalance to improve the model's prediction across all classes. For future research direction, we evaluate the performance of different CBAM configurations. Moreover, experimenting with different imbalanced learning techniques and increasing the dataset size will lead to better performance.