Classification of Diabetic Retinopathy Severity Based on GCA Attention Mechanism

Diabetic retinopathy (DR) is one of the major complications caused by diabetes and can lead to severe vision loss or even complete blindness if not diagnosed and treated in a timely manner. In this paper, a new feature map global channel attention mechanism (GCA) is proposed to solve the problem of the early detection of DR. In the GCA module, an adaptive one-dimensional convolution kernel size algorithm based on the dimension of the feature map is proposed and a deep convolutional neural network model for DR color medical image severity diagnosis named GCA-EfficientNet (GENet) is designed. The training process uses transfer learning techniques with a cosine annealing learning rate adjustment strategy. The image regions of interest of GENet are visualized using a heat map. The final accuracy, precision, sensitivity and specificity of the DR dataset of the Kaggle competition reached 0.956, 0.956, 0.956, and 0.989, respectively. A large number of experiment results show that GENet based on the GCA attention mechanism can more effectively extract lesion features and classify the severity of DR.


I. INTRODUCTION
Diabetic retinopathy (DR) is one of the major complications of diabetes due to the retinal damage caused by the rupture of capillaries from high levels of sugar [1]. There are now 460 million people worldwide aged 20-79 years with diabetes and this number will exceed 700 million by 2045 [2], [3]. Due to the dramatic increase in the number of people with diabetes, the number of people with DR is expected to reach 191 million by 2030 [4]. Early-stage DR is less harmful, does not cause serious visual impairment and is clinically treatable [5], treatment in a timely manner can reduce the risk of visual impairment by approximately 57% [6]. Therefore, timely examination and treatment are the main measures to protect visual acuity.
2D color fundus images and 3D optical coherence tomography (OCT) images are the most common examination methods and the diagnostic basis for ophthalmic diseases [7]. 2D color fundus images are a common DR examination method with the advantages of time saving and low cost compared to OCT, additional lesions can be diagnosed through The associate editor coordinating the review of this manuscript and approving it for publication was Rajeeb Dey .
color images, such as macular edema, microaneurysm and optic disc edema, etc. Due to the increasing development of big data and computer technology in recent years, the application of deep learning in image processing and computer-aided diagnosis has become increasingly prevalent. The use of color fundus images to identify the severity of DR based on computer-aided diagnosis can not only improve the efficiency of doctors' diagnosis, but also save the cost of medical treatment and provide convenience for economically poor areas.
Deep convolutional neural networks (DCNNs) are widely used in computer vision and have achieved excellent results in many tasks. For DR severity diagnosis, Varun Gulshan et al. [8] used the Inception-v3-based deep convolutional neural network trained on EyePACS-1 and Messidor-2 datasets and achieved 90.3% sensitivity and 98.1% specificity on EyePAC-1 and Messidor-2. Wan et al. [9] performed transfer learning and hyperparameter finetuning on AlexNet, VggNet, GoogLeNet, ResNet on the Kaggle platform respectively, the best classification accuracy reached 95.68%. Gadekallu et al. [10] argued that previous studies lacked data preprocessing and dimensionality reduction, which led to poor results, proposed a feature extraction method combining the standardization of data by VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ StandardScalar, principal component analysis and deep neural networks, the model was evaluated against the mainstream machine learning models available today. When using DCNN for DR classification, because color fundus images are rich in detailed parts such as capillaries and are spread over the vast majority of the image, proper preprocessing of the original image is required, more importantly, the network model needs to consider the relevance of the original image at different locations and in different channels. This requires the network to introduce a proper attention mechanism, which makes the model adaptively enhance the perception of useful information. However, there is scant existing research on this issue. Runze Fan et al. [11] combined an attention model in the feature fusion stage of the DR classification model to adaptively update the weights of each feature block, Liu et al [12] combined a compact bilinear pooling model and an attention mechanism for DR fine-grained image classification. Based on the above background and previous studies, this paper proposes a DR severity classification model GCA-EfficientNet (GENet) using deep learning, which introduces a global channel attention mechanism for feature maps and fully takes into account the correlation between different dimensions of the feature map for DR severity-assisted diagnosis. The main contribution points of this paper are as follows.
(1) For the DR severity classification problem, a Global Channel Attention (GCA) mechanism is proposed to update the attention weights of the different channels of the feature map with the model training process.
(2) In the process of the GCA module parameter update, an adaptive one-dimensional convolution kernel size calculation method is proposed to adjust the size of the convolution kernel adaptively according to the dimensions of the feature maps of different feature extraction modules.
(3) Combining the GCA attention mechanism and Effi-cientNet, the GCA-EfficientNet (GENet) model is proposed, based on the transfer learning technique, the training of the model is accomplished, eventually the accuracy, precision, sensitivity and specificity reach 0.956, 0.956, 0.956, and 0.989 respectively on the DR dataset of the Kaggle competition.

II. RELATED WORKS
Diabetic retinopathy severity detection aims to help physicians make a timely diagnosis of early fundus disease and provide a rationale for further treatment based on the severity of DR by discriminating lesion features on color fundus images or OCT images through image processing techniques and computer technology. Early research on DR mainly used traditional machine learning techniques to identify lesion features. However, in recent years, with the rapid development of artificial intelligence technology and computer technology, an increasing number of scholars have used deep learning techniques for DR severity classification.
At the stage of DR detection using traditional machine learning techniques, researchers need to have some medical background and manual extract lesion features from the image dataset, whereupon the extracted lesion features are fed into a classification model to complete the detection of DR. Nguyen et al. [13] proposed a multilayer feedforward neural network with strong robustness for DR severity classification. For the early detection and classification of the main symptoms of DR, Zhang et al. [14] used a support vector machine (SVM) to classify preprocessed bright nonlesion areas, exudates and cotton wool spots. Zhang et al. [15] proposed a top-down strategy to detect fundus hemorrhage, proposed combined 2DPCA, and applied virtual SVM to achieve higher classification accuracy. To extract the feature vectors of the DR images, Soares et al. [16] used pixels and took a 2D Gabor wavelet transform at multiple scales, which were fed into a classification model to identify vascular and non-vascular. Nayak et al. [17] used image preprocessing, morphological processing and texture analysis techniques to detect lesion features and used them as inputs to an artificial neural network for the automatic detection of DR severity. An automatic system for analyzing DR lesions in the central field of retina is proposed by Barriga et al. [18], the system extracted features using amplitude and frequency modulation and used partial least squares (PLS) and a support vector machine (SVM) for classification. Priya et al. [19] compared the performance of a probabilistic neural network (PNN) and support vector machine (SVM) for DR binary classification, the SVM model achieved 97.608% accuracy which was better than the other models. Roychowdhury et al. [20] analyzed fundus images in different contexts and reduced the number of features used for lesion classification to generate DR severity classes using machine learning. In this paper [21], Srivastava et al. used a Frangi filter to extract features from the green channel of fundus images to train a SVM classifier, which predicted the severity of DR. Santhakumar et al. [22] divided the lesion features of fundus images into several rectangular patches, then passed the features of the patches into a support vector machine (SVM) for DR severity classification. This paper [23] attempts to detect red lesions from retinal fundus images, Srivastava et al. proposed a new filter with strong robustness to discriminate between vascular and red lesions, the lesion features were extracted using the corresponding filter for red lesions of different sizes, the experiment results show that this filter was helpful for the automatic detection of DR.
Although these methods can detect the severity of DR to some extent, machine learning methods based on the traditional approach require a large number of annotated features, this process consumes a lot of resources and time for feature annotation, it needs to segment the part of the lesion from the whole fundus image, which makes the whole annotation process more demanding in terms of medical background and inefficient. Moreover, it is easy to miss the lesion features in the fundus image during the annotation process. However, deep learning techniques do not require the manual annotation and segmentation of lesion features, for example, convolutional neural networks (CNNs) can extract lesion features from the entire fundus image without missing features compared to manual ones. In addition, when CNNs extract fundus image features, according to the different receptive field, it is easy to extract detailed features from convolutional kernels close to the network input such as the texture and shape of the image, while more semantic features are easily extracted for convolutional kernels close to the network output. Nowadays, an increasing number of researchers are applying deep learning techniques for DR severity detection. The adoption of CNN has made the DR diagnosis process simple and efficient, Pratt et al. [24] used CNN structures to extract disease features from fundus images and trained the model using data augmentation techniques to enable the extraction of complex lesion features. A method for deep visual feature (DVF) extraction based on scale-invariant color density and gradient location direction histogram was proposed by Abbas et al. [25], with the whole model having no pre-processing or post-processing stages, the extracted features were transformed and fed into a multilayer classification network to obtain the prediction results. Kanungo et al. [26] derived the impact of hyperparameters and the quality and quantity of training data on the model performance through a large number of comparative experiments. Due to the uninterpretable black-box nature of how CNNs make decisions internally based on image features, Quellec et al. [27] proposed a way to create heatmaps to show which pixels in an image play a role in prediction at the image level and applied it to DR screening. In this paper [28], Zhao et al. proposed a model combining an attention mechanism and a bilinear network for fine-grained classification to deal with small target lesion features in fundus images and proposed a new loss function for different classes of DR, the experiments showed that this model achieved excellent performance in DR severity classification. A contour detection image processing algorithm for vascular detection in fundus images based on Mamdani (Type-2) fuzzy rule was developed by Orujov et al. [29], this algorithm passing the green channel of the image through adaptive histogram equalization with restricted contrast and median filtering, then applying the Mamdani fuzzy rule to the gradient values of the image for edge detection. Das et al. [30] proposed a CNN-based DR detection and classification algorithm in which fundus images are preprocessed so that vascular branches can be extracted by a segmentation model, the segmented regions are corrected using a maximum principal curvature and adaptive histogram equalization. A residual convolutional block attention model (RCAM) was proposed by Fan et al. [31], the attention model is used in a multi-feature fusion technique with adaptive weights, which was combined with the MobileNetV3 network for DR severity classification. Liu et al. [12] considered DR severity classification as a fine-grained classification problem and proposed a compact bilinear pooling network model based on the attention mechanism for DR severity classification, which improved both the prediction accuracy and maintained the computational efficiency of the model. In this paper [32], Ramasamy et al.
extracted and fused ophthalmic features from retinal images, which are based on texture gray level features. The diabetic retinopathy severity was classified using the sequential minimum optimization (SMO) classification method.
Different from previous approaches, this paper proposes a new attention mechanism and a flexible adaptive convolutional kernel sizing algorithm in the attention mechanism, which automatically adjusts the convolutional kernel size according to the size of the input feature matrix and fuses the local channel correlation to obtain the global channel correlation of the feature map and combines it with a deep convolutional neural network for DR severity classification. To determine the model's ability to detect lesion features at different stages, this paper uses a heatmap to visualize the area to which the model pays special attention.

A. ATTENTION MECHANISM
The attention mechanism was proposed by Treisman et al. [33] to simulate a model of the human brain's attention, which can derive attention weights for different factors, emphasizing the impact of a particular factor on the model's results. The attention mechanism has been widely used in deep learning tasks such as sequence-tosequence [34], image localization [35], image understanding [36], and lip translation [37]. The transformer structure proposed by the Google Machine Translation team [38], which discards the recursion and convolution structures and is based entirely on the simpler attention mechanism for processing feature sequences, achieved 28.4 BLEU in the WMT 2014 English-to-German translation task, which was 2 BLEU higher than the best result at the time.
The attention mechanism is used to adaptively adjust the higher-order abstract features extracted by the model for better performance and has been increasingly combined with computer vision in recent years. Hu et al. [39] proposed the ''Squeeze-and-Excitation''(SE). The SE structure assumes that the input feature map is I ∈ R H ×W ×C , where H , W , and C denote the height, width, and number of channels of the feature map, respectively. The output feature matrix of the SE structure can be expressed as: where O ∈ R H ×W ×C , the gap denotes the global average pooling operation over the channel of the matrix, W sq and W ex are fully connected layers for downscaling and upscaling channel, ReLU (·) and σ (·) denote rectified liner unit and sigmoid activation function. The SE structure is widely used in many classical network architectures, including MobileNet-v3 [40] and Efficient-Net [41] due to its higher flexibility and obvious performance improvement of the network model, which has a large performance improvement for DCNN.
Although the inter-channel attention mechanism based on the SE structure considers the correlation between different channels of the feature matrix and gives the influence of VOLUME 10, 2022 channel weights on the model results, the fully connected layers used in both the SE stage make the number of parameters of the model increase drastically. Because the correlation between the channels of the feature map will yield more nonlinear information, which is beneficial to the performance of the network model [42], the bottleneck structure composed of two fully connected layers in the SE structure, although it reduces the number of parameters by dimensionality reduction to a certain extent, it means that some of the feature information extracted by the network is lost, which leads to some limitations in the model performance. SE-Var2 adopts a depth-wise separable convolution approach, which reduces the number of parameters by learning the weights of each channel independently, but does not take the correlation of the feature map channels into account, SE-Var3 adopts a fully connected layer mapping, which considers the correlation of the channels but significantly increases the number of parameters [42]. Although SE-Var3 makes some improvements to this problem, it never makes a better balance between complexity and channel correlation.
The efficient channel attention (ECA) mechanism proposed by Wang et al. [42] makes a trade-off between the performance and complexity of the model compared to the original SE structure and adopts an adaptive convolution kernel size adjustment method to effectively extract the correlation between different channels of the feature matrix. The ECA structure is similar to channel convolution [41], which helps to capture the intrinsic correlations between feature map channels and uses a one-dimensional convolution with an adaptive convolution kernel size in the channel attention mechanism, which greatly reduces the complexity of the model compared to the SE structure. In the ECA structure, assuming that the input feature matrix is I ∈ R H ×W ×C , then where gap denotes the global average pooling operation, f k denotes a one-dimensional convolution of convolution kernel size k, with k positively correlated with the dimension C of the input feature matrix, y = [y 1 , y 2 , . . . , y C ], and finally y is multiplied with the original matrix to obtain the model of the inter-channel attention mechanism.
where ⊗ denotes the multiplication in the channel dimension and the final output feature matrix O ∈ R H ×W ×C . Although the ECA structure can effectively control the parametric increase of the model and it is an efficient interchannel attention mechanism, the one-dimensional convolution makes the weight of each channel derived from the final convolution operation only related by the fixed k channel features adjacent to it, ignoring the correlation between the global channel features of the feature map.

B. GCA STRUCTURE
In this paper, we propose a global channel attention (GCA) model for feature maps. The GCA structure effectively overcomes the problem where the ECA structure only considers local channel correlations, in addition, GCA takes the correlations of all feature map channels into account while maintaining the number of model parameters, which effectively improves the perceptual performance of the model for different channels. The GCA structure is shown in Fig. 1. Assuming that the input feature matrix of the input GCA structure is I ∈ R H ×W ×C , the features of each channel are obtained by a global average pooling operation.
For the extracted features p, p ∈ R C , an adaptive onedimensional convolution operation is performed to obtain the local inter-channel correlation features q as follows: In the above equation, W 1 ∈ R C×C is the parameter matrix of the one-dimensional convolution and swish(·) is the activation function defined as: where the parameter matrix W 1 ∈ R C×C can be expressed as shown at the bottom of the next page.
To determine the convolution kernel size of the 1D convolution in W 1 , this paper proposes an adaptive convolution kernel size design method based on the ECA structure [42], which makes the size of the convolution kernel adaptively adjusted with different feature maps. To obtain more local inter-channel correlations, the number of channels of the feature map and the convolution kernel size are positively correlated, C ∝ k so this paper proposes that the relationship between the feature map and the convolution kernel size is where k denotes the convolutional kernel size, γ and β is the parameter of the linear mapping, which can be learned through the network training process, therefore, the relationship between the convolutional kernel size k and the number of feature map channels can be expressed as: As a result, local inter-channel correlation features with window length k are extracted, each feature can be expressed as: where p i (k) denotes the k channel features adjacent to p i . After extracting the local inter-channel correlation feature q ∈ R C , in order to extract the global channel correlation feature r from the local inter-channel correlation feature q, a global linear operation is then performed on q, where W 2 ∈ R C×C is the parameter matrix of the linear operation and the parameter matrix W can be shown as: Then, the global channel correlation r = [r 1 , r 2 , · · · , r C ] is extracted based on the local inter-channel correlation q,where r i can be expressed as: Finally, the global channel correlation features r ∈ R C are weighted with the input feature matrix I ∈ R H ×W ×C in the dimension of the channel to obtain the feature matrix I of the global attention mechanism of the feature map, where I can be expressed as: where ⊗ denotes a multiplying weighting operation in the dimension of the feature map channel. After the above operations, the feature matrix I of the global attention mechanism of the feature map is obtained. Unlike the SE structure and ECA structure, the GCA structure proposed in this paper overcomes both the loss of the feature map channel information caused by dimensionality reduction processing in the SE structure and the drawback of failing to consider the global channel correlation in the ECA structure, while the adaptive convolution kernel size adjustment method proposed in this paper can extract local channel correlation information at different scales according to the feature maps of different tasks.
The GCA structure proposed in this paper extracts the global channel correlation information of the feature map in two steps. The first step begins with the local inter-channel correlation derived by adaptive one-dimensional convolution with a small number of parameters; the second step integrates the local inter-channel correlation and extracts the global channel correlation. The two-step operation effectively avoids the huge number of parameters caused by two fully connected operations, so that the model does not suffer from overfitting problems and also extracts the global channel correlation features. In the later sections of this paper, a disease severity classification model based on the GCA structure will be proposed and trained based on transfer learning, and finally the performance of the model in this paper will be evaluated using the experiment results.

C. GENET STRUCTURE
The deep convolutional neural network model used in this paper is based on EfficientNet [41], which is based on the neural network architecture search technique (NAS) obtained by balancing the network width, depth and input image resolution, using a relatively small number of parameters but obtaining better performance, depending on the different resolutions of the input image, model width and depths. EfficientNet can be divided into eight models from EfiicientNet-B0 to EfficientNet-B7 [41]. EfficientNet-B7 exceeds the accuracy achieved by the best GPipe at that time, but with 8.4 times fewer number of parameters and 6.1 times faster computing speed [41]. However, because Efficient-Net uses the same inverted residual structure MBConv as MobileNetv2 [43], where the inter-channel attention mechanism uses an SE structure consisting of two fully connected layers, the model has a larger number of parameters and loses some information due to the dimensionality reduction process in the SE structure.
In this paper, we improve on the MBConv convolutional structure and propose the GConv structure which integrated the GCA attention mechanism and the MBConv structure, as shown in Fig. 2. Finally, with reference to the EfficientNet-B0 model derived from NAS technology, the GCA-EfficientNet (GENet) based severity classification model used in this paper for DR is proposed, as shown in Fig. 3.

D. VISUALIZATION
To solve the problem of invisibility inside the convolutional neural network model, which is like a ''black box'', this paper uses Grad-CAM [44] to visualize the attention region of the CNN for the input image in the form of a heat map. The gradient of the score y c of any category c with respect to the feature map A k of the convolution layer, i.e. ∂y c /∂A k ij , is first calculated, then a global average pooling operation is undertaken in the dimensions of height and width, the corresponding weight scores are calculated as follows: The weight ϕ c k indicates the importance of the feature map for the prediction result, after filtering out the effect of negative values by ReLU (·), the final CNN visualization algorithm for DR severity detection is obtained as follows:

IV. EXPERIMENT A. DATASET INTRODUCTION
The dataset used in this experiment is a Kaggle competition dataset containing 35126 high-resolution color fundus images, which have been divided into 5 categories by professional clinicians according to the severity of DR, the number of samples in each category is shown in Table 1. Because the sample number of different categories in this dataset varies greatly, it will have a negative impact on the results of the model, which will be addressed by the preprocessing process in the next section.

B. PRE-PROCESSING AND DATA AUGMENTATION
The dataset used in this experiment is taken from fundus cameras in different environments, which introduce noise during data collection, there are negative impacts such as uneven light, so image preprocessing is necessary to reduce the impact of noise on the experiment results and improve the learning effect of the network model. Meanwhile, to solve the uneven number of DR images in different classes, this paper makes the number of samples in each class basically the same by performing data augmentation techniques on negative samples.

1) IMAGE PRE-PROCESSING
The image pre-processing process comprises the following steps: (A) Remove the black margin around the fundus image to reduce the impact of unnecessary information.
(B) Gaussian filtering, which inhibits Gaussian noise during image acquisition.
(C) Adaptive gamma correction improves the effect of uneven lighting during image collection, corrects images with too much gray or too little gray, as well as enhances contrast.
(D) Contrast-limited adaptive histogram equalization (CLAHE), converting RGB color space to Y-Cr-Cb color space, then adaptive histogram equalization in brightness to reduce the impact of uneven gray scale values in brightness.
After the above operations, the pre-processed DR images are obtained, the pairs of DR images before and after preprocessing for different categories are shown in Fig. 4. The pre-processed DR images can be better used for DCNN learning, the proposed DCNN model GENet can better capture the detailed features of the pre-processed images.

2) DATA AUGMENTATION
Although the pre-processed DR images can be directly trained for the network, it can be seen from Table 1 that the number of DR images in different categories varies greatly, which will adversely affect the result of the network model, so this paper applies data augmentation processing to the negative samples, rotating (90 • , 180 • , 270 • ), flipping horizontally and vertically, cropping at the four corners and the center of the negative sample images, ensuring the number of samples in each class is basically the same, solving the sample imbalance problem. A comparison of the sample numbers before and after data augmentation is shown in Table 2.

C. EXPERIMENT ENVIRONMENT
The GENet proposed in this paper runs under PyTorch 1.7.0, Python 3.6 environment. It divides the dataset into training and validation sets according to 8 : 2, the image resolution is set as 224 × 224, the cross-entropy loss function is used, 100 epochs are learned on the training set, stochastic gradient descent with momentum is used as the model parameter optimizer, the initial learning rate is set to 0.01, the momentum is set to 0.9. In order to make the model finally converge, the cosine annealing learning rate adjustment strategy shown in Fig. 5 is used in this paper.
The lack of sufficient labeled data is a major challenge for medical image processing. When training DCNN models, a small training set is prone to overfitting. In addition, deep learning systems require much more training time and larger amount of data than traditional machine learning systems. In order to solve the above problems, in the model training stage, this paper adopts the transfer learning technique. Transfer learning is a deep learning training strategy that a pre-trained model with generalized features is reused in another task, and in the field of computer vision, specific low-level features such as edges, shapes, and textures can be shared between tasks. Therefore, the use of transfer learning techniques can fine tune the pre-trained model in downstream tasks, thus greatly saving training time and the amount of data, allowing the model to converge as soon as possible and avoiding the overfitting problem.
To ensure the model achieves the desired performance as soon as possible, the parameters of the same structure of GENet and EfficientNet-B0 were used to initialize the GENet network model using the transfer learning technique in the model training stage, the GCA modules with different structures were initialized using Kaiming initialization [45], the models used as comparisons are initialized with the official pre-trained models provided by PyTorch official. To analyze the effectiveness of GENet in DR disease detection, in this paper, GENet and classical DCNN networks are compared in the same environment, the experiment results are analyzed in the next section.

D. EXPERIMENT CONCLUSION AND ANALYSIS
To evaluate the DR classification performance of GENet rigorously, the model which was trained for 100 epochs was evaluated comprehensively for accuracy, precision, VOLUME 10, 2022 sensitivity and specificity on the validation set in this paper. The confusion matrix is defined in Table 3, confusion matrix for DR severity classification was drawn and the GENet model was compared with the classical DCNN.  Table 3 accuracy, precision, sensitivity, and specificity are defined as follows:

As shown in
Pr ecision = TP TP + FP (17) Experiment 1: In this paper, the performance of the model trained for 100 epochs is evaluated on the validation set to obtain the detection of GENet for DR diseases of different severities, the confusion matrix obtained from the experiment is shown in Fig. 6.
It can be seen from Fig. 6 that GENet achieves an accuracy of 95.63% after 100 epochs of training, the classification performance for DR diseases of different severities is shown in Table 4. It can be seen from Table 4 that GENet achieves excellent classification results for Severe Non-Proliferative DR and  Proliferative DR, while slightly lower classification performance for No DR and Mild DR. We conjecture that due to the relatively small difference between the two types of images in the color fundus images, this problem will be overcome as the model is trained with more samples and for a longer period of time because the GENet model is not over-fitted.
In order to verify the above inference, we set the epoch of the above experiments to 150 and trained GENet with other hyperparameters unchanged. The confusion matrix derived after 150 epochs of training is shown in Fig. 7, and the detection performance for different DR severity is shown in Table 5. By comparing the confusion matrix with Fig. 6 and the DR detection performance with Table 5, it can be seen that our analysis is correct and GENet has significantly improved the detection effect for NO DR and Mild DR. Therefore, the GENet based on GCA attention mechanism proposed in this paper can effectively detect different DR severity.
Convolutional neural networks are like a ''black box'', however, Grad-CAM [44] can visualize the region of interest of a CNN in the form of a heat map to understand intuitively how the CNN model derives its prediction results.   The shallow layer of the convolutional neural network preserves the lower-order features such as contours, edges, and textures of the image, while the deep layer preserves the higher-order semantic features of the image. To have an intuitive understanding of the attention mechanism in GENet and how GENet makes predictions internally, blocks of different depths were selected for visualization in this paper. The heat map of DR images based on Grad-CAM is shown in Fig. 8, which demonstrates the different levels of attention of GENet to different regions of the fundus image after using the GCA attention mechanism. In the heat map, red indicates that the image features in the region have higher weights and blue indicates lower weights. From Fig. 8, it can be seen that the feature maps of different depths in the model contain different information about the lesions reflected in the DR images.

Experiment 2:
To verify the effectiveness of GENet in DR severity classification, in this paper, GENet and DenseNet-121 were compared on the same training and validation sets under the same experiment environment, the accuracy is shown in Fig. 9. From the figure, it can be seen that the accuracy of GENet proposed in this paper is better than DenseNet-121 in detecting DR diseases under the same experiment environment, no overfitting phenomenon occurs as the training process continues.
In this paper, we also conducted a comparison experiment between GENet and the classical DCNN model in the same experiment setting as above, analyzed the accuracy, precision, sensitivity, and specificity after 100 training epochs. The results are shown in Table 6. From Table 6, it can be seen that all evaluation indexes of the GENet network are better than the classical traditional CNN network, indicating that the DCNN model based on the GCA attention mechanism proposed in this paper can achieve better performance in the disease severity classification task.

V. CONCLUSION
Diabetic retinopathy is one of the major complications of diabetes mellitus, failure to diagnose and treat it in time can lead to severe eye vision loss or even complete blindness. However, diabetic retinopathy can be prevented through routine screening and effective treatment, thus avoiding the occurrence of irreversible blindness. With the continuous development of machine learning and artificial intelligence technologies, an increasing number of machine learning techniques are used in the medical field to assist doctors in routine diagnosis and treatment.
Therefore, this paper proposes a global channel attention mechanism for feature maps, named the GCA attention mechanism. Furthermore, a deep convolutional neural network model GENet, in which the GCA attention mechanism and EfficientNet are integrated, is proposed for the early detection of diabetic retinopathy. In the disease feature extraction stage, for the network model to fully consider the correlation between feature map channels, this paper proposes an adaptive convolutional kernel size adjustment algorithm for extracting local channel correlation, which makes GENet adaptively adjust the convolutional kernel size in different tasks, so that the network model is enough to achieve better performance. The training process uses transfer learning techniques and cosine annealing algorithms to ensure that the model eventually converges as quickly as possible. The final GENet model achieves 0.956 accuracy, 0.956 precision, 0.956 sensitivity and 0.989 specificity on the DR validation set, demonstrating that the deep convolutional neural network model based on the GCA attention mechanism proposed in this paper is effective in the classification of the severity of DR. In future work, we will combine the GCA attention mechanism with more deep learning models to improve the performance of the model to detect small differences between categories, so that GENet can be used in more scenarios.
BINHUA YANG received the bachelor's degree in electronic science and technology from the Chengdu University of Information Technology, in 2019, where he is currently pursuing the master's degree.
He has participated in many mathematical modeling competitions and won awards, received several academic scholarships, and has several software copyrights. His research interests include deep learning and computer vision.
TONGYAN LI received the Ph.D. degree in communication and information system from the University of Electronic Science and Technology of China, in 2010.
Since 2010, she has been a Graduate Student Tutor with the Department of Communication Engineering, Chengdu University of Information Technology. She has been involved in a number of projects, like the ''863'' major project, National Natural Science Fund Project, Fund Project of Sichuan Provincial Department of Education and Found Project of Science, and Technology Department in Sichuan Province. As the first author, she has more than 20 academic papers published in academic journals and conferences, of which two papers were indexed by SCI, and more than 20 papers were indexed by EI. Her research interests include data mining, recommendation systems, big data processing, machine learning, and artificial intelligence methods.
HAIDI XIE received the bachelor's degree from the Chengdu University of Information Technology, where she is currently pursuing the degree majoring in electronic information with the School of Communication Engineering.
The completed projects include a personalized recommendation system, and the design and implementation of a weather data platform for the Internet of Things based on the rest architecture. A book on recommender systems is currently being written. Her research interests include data mining, recommendation systems and big data processing, and she is currently researching recommendation algorithms.
YULIN LIAO is currently pursuing the degree with the Chengdu University of Information Technology, majoring in communication engineering and electronic information. In 2016, she studied with the School of Communication Engineering, Chengdu University of Information Technology. During the undergraduate period, she participated in the recommendation system and other projects and won provincial awards. She won the third prize in the 15th Wuyi Mathematical Modeling Contest. In the study and work and won the ''school-level excellent stem,'' scholarship. Her current research interests include artificial intelligence and intelligent information processing.
YI-PING PHOEBE CHEN (Senior Member, IEEE) has been a Professor and the Chair with the Department of Computer Science and Information Technology, La Trobe University, Melbourne, Australia, since April 2010. She has been working in many emerging areas, such as bioinformatics, multimedia, artificial intelligence, scientific visualization, pattern recognition, health informatics, data mining, deep learning, and databases. She has published over 240 research papers, many of them appeared in top journals and conferences, such as She is the Steering Committee Chair of Asia-Pacific Bioinformatics Conference (a Founder) and international conference on multimedia modeling. She has been on the program committees of over 100 international conferences, including top ranking conferences, such as ICDE, ICPR, ISMB, and CIKM. VOLUME 10, 2022