Texture Attention Network for Diabetic Retinopathy Classification

Diabetic Retinopathy (DR) is a disease caused by a high level of glucose in retina vessels. This malicious disease put millions of people around the world at risk for vision loss each year. Being a life-threatening disease, early diagnosis can be an effective step in the treatment and prevention of vision loss. To automate the early diagnosis process, computer-aided diagnosis methods are not only useful in detecting the diabetic signatures but also provide information regarding the diabetic grade for the optometrist to determine an appropriate treatment. Several deep classification models are proposed in the literature to solve the diabetic retinopathy classification task, however, these methods usually lack incorporate an attention mechanism to better encode the semantic dependency and highlight the most important region for boosting the model performance. To overcome these limitations, we propose to incorporate a style and content recalibration mechanism inside the deep neural network to adaptively scale the informative regions for diabetic retinopathy classification. In our proposed method, the input image passes through the encoder module to encode both high-level and semantic features. Next, by utilizing a content and style separation mechanism, we decompose the representational space into a style (e.g., texture features) and content (e.g., semantic and contextual features) representation. The texture attention module takes the style representation and applies a high-pass filter to highlight the texture information while the spatial normalization module uses a convolutional operation to determine the more informative region inside the retinopathy image to detect diabetic signs. Once the attention modules are applied to the representational features, the fusion module combines both features to form a normalized representation for the decoding path. The decoder module in our model performs both diabetic grading and healthy, non-healthy classification tasks. Our experiment on APTOS Kaggle dataset (accuracy 0.85) demonstrates a significant improvement compared to the literature work. This fact reveals the applicability of our method in a real-world scenario.


I. INTRODUCTION
I N the healthcare field, early diagnosis of diseases is a vital step since diseases are more treatable in their early stages. Annually, millions of people around the world suffer from diabetes. Diabetes is a disease that increases the amount of glucose in the blood due to a lack of control over the amount of insulin. According to the International Diabetes Federation [1], 425 million adults in the world are affected by diabetes. If not controlled well, diabetes could damage various parts of the human body including the heart, kidneys, feet, nerves, and eyes [2]. Meanwhile, eyes retina disease (Diabetic Retinopathy (DR)) is extremely sensitive and, if left untreated, can result in vision loss. Figure 1 shows the effects of the DR on the retina vessels.
The retina is a thin tissue of the eye which lines the surface of the back of the eye excluding the area of the optic nerve. The retina contains light-sensitive cells which receive and transfer the light through neural signals and coordinate with the brain to process visual information. Like other organs in the human body, the retina receives its nourishment through blood vessels. If the blood sugar (glucose) level is high, it will cause DR, which will block the tiny blood vessels that nourish the retina, cut off its blood supply, and eyes will try to grow new blood vessels, but they won't develop well and will start to weaken. In other words, DR causes the blood vessels of the retina to swell, leak fluid, or bleed, which often leads to vision impairment or blindness. DR causes 2.6% of blindness worldwide [4] and it is the most prevalent microvascular complication among patients with diabetes mellitus [5].
Diabetic retinopathy is a progressive eye disease clas-VOLUME 4, 2016 FIGURE 1: Effects of diabetic retinopathy on the retina vessels [3]. According to the Figure, in the early stages of the disease, the walls of the retina blood vessels are weakened. The action protrudes tiny bulges from the vessel walls, may leak or ooze fluid and blood into the retina and cause issues in the retina swell, producing white spots in the retina. As DR progresses, new blood vessels may grow and threaten human's vision.
sified into two types and four stages. The two types are non-proliferative (NPDR) and proliferative (PDR). NPDR refers to the early stages of the disease and is characterized by lesions such as microaneurysms (MAs) and exudates, whereas PDR is an advanced form of the disease, indicated by neovascularization of weak blood vessels. Bedsides, the four stages of diabetic retinopathy show the evolution cycle and the intensity of DR. These stages are: • Mild nonproliferative diabetic retinopathy: It is the earliest stage of DR. Tiny areas of swelling in the blood vessels of the retina, microaneurysms, is the characteristic of this stage. In this stage, it is possible that a small portion of fluid leaks into the retina and triggers swelling of the macula. • Moderate nonproliferative diabetic retinopathy: In this stage, nourishment cannot reach the retina due to the swelled blood vessels and blockage of the ways for blood to reach the retina. This process accumulates blood and other fluids in the macula. • Severe nonproliferative diabetic retinopathy: At this stage, the number of blocked blood vessels increases and causes a considerable reduction in blood flow. At this point, new blood vessels start to grow in the retina. • Proliferative diabetic retinopathy: This stage is considered the most dangerous because a large number of fragile blood vessels have formed in the retina. At this stage, there is a possibility of fluid leakage and visual disturbances such as blurriness, reduced field of vision, and even blindness at any time. Figure 2 shows the sample of diabetic retinopathy images for the different grade levels.
The risk of DR-induced vision loss is increasing every year. In this regard, early diagnosis of the disease in the early stages can be an effective step in the treatment and prevention FIGURE 2: Sample of diabetic retinopathy images with stage zero (healthy) to four (proliferative diabetic retinopathy) starting from top left to bottom right. Samples from [6] of the disease. By investigating some types of retina lesions, such as microaneurysms (MA), hemorrhages (HM), soft and hard exudates (EX), it is possible to detect DR by medical experts. However, it is not always possible to do this due to a lack of expertise or the high cost of this process. Furthermore, in many cases, due to human errors and parameters such as fatigue, normal methods of examining medical images by human resources are not very accurate.
The use of automated DR identification methods not only gives many people the benefits of early diagnosis but also reduces costs, saves time, and increases the accuracy of the diagnosis. Numerous studies in recent years have shown that Computer-aided diagnosis (CAD) methods are extremely useful in medical image processing [7]. In this field, advanced machine learning-based methods have been proposed to automatically segment and classification of retina images. Segmentation and classification methods process images taken from the retina and identify areas of the disease and detect the stage of the disease. This process allows ophthalmologists to focus directly on the disease areas and apply appropriate treatments to fight the disease. Machine learning techniques try to extract and utilize retina features such as optic disk detection, vessel enhancement, and lesion segmentation from the original input image. Then, classification techniques are utilized to categorize the images such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Naive Bayes [8]- [10]. These typical machine learning methods use specific handcrafted features to learn discriminative patterns from the image itself. As a result, these methods lack to learn the complex structure of the pattern and usually are deficient to model the underline structure of the abnormality. Consequently, the applicability of these engineering methods in a clinical domain is limited.
Numerous studies have been conducted in recent years for DR grade classification. K. Xu et al. [19] proposed a CNNbased method for classifying the images to the normal and DR images. During the preprocessing step, they utilized data augmentation techniques, resizing images, and normalization. Their model includes eight CONV layers, four maxpooling layers, two FC layers, and a SoftMax function at the last layer to create a binary class. Li et al. [16] employed a Deep CNN (DCNN) for classifying DR images. More specifically, they utilized fractional max-pooling to extract more discriminative features from the input data and classified the obtained features using SVM. R. Pires et al. [20] proposed a 16 layers CNN model to classify DR images into the referable and non-referable classes and used dropout and L2 regularization techniques to avoid overfitting during the training process. Sungheetha et al. [21] proposed a method to classify the diabetic condition of a retina image into five classes: No DR, Mild DR, Moderate, Severe, and Proliferative DR. The condition of the diabetic retinopathy were detected by analyzing the Hard Execute spotted in the blood vessel of an eye using a CNN model.
Although the proposed deep learning methods boosted the diabetic retinopathy grading performance, these methods still lack to model the local contextual dependency inside the representation space to recognize the diabetic retinopathy level. To address this limitation, we propose to incorporate the attention mechanism in both texture and semantic levels. In our design, the representation space is decomposed into style and content features in which we perform two parallel attention to attenuate more informative texture and spatial features. By fusion of both normalized features and following the decoding path, the model produces a grading label for each retina image. The contribution of this paper can be summarized as follows: • Incorporating attention mechanisms to adaptively highlight texture information, which plays a significant role in recognizing diabetic retinopathy • Style content decomposition module to separate texture and semantic representation • State-of-the-art (SOTA) results on the public dataset for diabetic retinopathy classification The rest of the paper is organized as follows. Section 2 reviews related work. The proposed network is presented in Section 3. The experimental results are described in Section 4. Finally, Section 5 concludes the paper.

II. RELATED WORK
Diabetic retinopathy classification plays an important role in the diagnosis of DR disease and prevents blindness by early detection of the disease. Similar to other research lines in the computer vision field, diabetic retinopathy classification approaches can be categorized into handcrafted (engineering features based) and deep learning-based approaches. The handcrafted methods focus on designing the specific feature to learn discriminative patterns from the the image itself, while deep learning methods can learn and make intelligent decisions on their own and discover hidden patterns inside the input data that explicitly do not exist. In the following, we will review some of the proposed methods for each of the handcrafted and deep learning methods.

A. HANDCRAFTED APPROACHES
Akram et al. [22] mixed the structure of the support vector machines and Gaussian Mixture Model (GMM) to identify and classify microaneurysms in the retina for early detection of diabetic retinopathy. Furthermore, they improved their method via enriching the feature set with shape, intensity, and statistics of the affected region in their follow-up study [23]. Roychowdhury et al. [24] proposed a two-step hierarchical classification approach that uses a computer-aided screening system (DREAM) for classifying the severity grade of DR. They investigated varying illumination and fields of view for generating the severity grade.
Zhang et al. [25] utilized a three-step method for DR screening. First, they apply a series of normalization and denoising steps to detect reflections and artifacts in the image. Then, the images are segmented using a mathematical morphology operation. Finally, the images classify into the lesion and non-lesion areas using a binary random forest classifier. The random forest classifier may fail in some cases where images do not have many distinctive features. Adal et al. [26] classify the DR severity grade by calculating the absolute difference between two-time points of the extreme's multiscale blobness responses of fundus images and applying SVM and K-nearest Neighbour (KNN) classification algorithms. The approaches based on the handcrafted features need a heuristic feature extraction stage and this made these methods challenging and the results less satisfying. In complex tasks, manual feature extraction may not work well.
Thus, alternative methods are needed in which they do not require a manual feature extraction process and be able to discover hidden patterns within the data itself. To this end, deep learning models were proposed with the capability to automatically extract features, reveal hidden patterns in the data, and handle large amounts of data, which outperformed typical handcrafted methods significantly.

B. DEEP LEARNING APPROACHES
Deep learning methods, specially CNN, have made great strides in recent years and have been able to successfully perform several complex tasks. The hierarchical learning capability and extraction of high-level features of CNNs has made them much more powerful than methods that solely work with typical raw image features. Various architectures of CNN have been introduced in recent years, some stateof-the-art architectures include but not limit to: Fully Convolutional Neural Network (FCN) [27], U-Net [28], SegNet [29], hourglass [30], and DeepLab [31]. In the following, we review some recent methods in the field of diabetic retinopathy classification using CNNs.
Quellec et al. [32] proposed a method based on a heat map optimization scheme for identifying DR. They employed a VOLUME 4, 2016 back-propagation-based CNN for image-level classification to automatically detect lesions in retinal images. Zang et al. [33] utilized a three-level classification method based on CNN to fulfill a DR classification. The first classifier determines whether the DR is referable or non-referable. Then, the second classifier classifies the eye as non-DR, nonproliferative DR (NPDR), or proliferative DR (PDR). Finally, the third classifier separates the case to no DR, mild and moderate NPDR, severe NPDR, and PDR. Kassani et al. [34] employed Xception model [35] concept and introduced a new model for classifying DR by inserting a deep layer aggregation that receives multilevel features from diverse convolutional layers. Then, a multi-layer perception (MLP) classifies these different features. Jain et al. [36] utilized different data augmentation techniques for balancing the input data during the preprocessing stage. Furthermore, they trained three different classifier networks including VGG16, VGG19, and InceptionV3 [37], [38]. These models are trained for both binary and 5-class DR classification. Based on the evaluation results, they have shown that the VGG19 network, which contains a large number of convolutional and pooling layers, was more efficient than the other models. Mateen et al. [39] fine-tuned a pre-trained VGG19 model bypassing input data through its layers to extract its features and classify DR. They utilize Principal Component analysis (PCA) and singular value decomposition (SVD) techniques to reduce the feature dimension and avoid overfitting during the model training stage.
Dai et al. [40] proposed a method for detecting microaneurysms from fundus images by integrating an image-totext mapping scheme with a multi-sieving CNN framework. Using this approach, they handled one of the major challenges in retina image classification, which is that the percentage of relevant information (microaneurysms that are critical for ophthalmologists) in the retina images is lower than irrelevant information. The image-to-text mechanism is used as a clinical report. Alryalat et al. [10] presented a two-stage deep learning model for retina segmentation and predicting response to intravitreal anti-VEGF injections among Diabetic Macular Edema (DME) patients. They first utilized an attention-based U-Net for the segmentation task, then they passed the segmentation map through a classifier network to determine whether the patient would response to the anti-VEGF injection or not (a binary-classification task). Jaskari et al. [41] addressed the uncertainty challenges in the clinical application by developing a Bayesian-based classification method to model the underlying uncertainty in grading the diabetic retinopathy through the retina images. The evaluation results showed that using entropy uncertainty estimation improved the within-distribution uncertainty performance. Zia et al. [42] utilized an Inception-V3 with VGG network to distinguish the key precursors of Dimensionality Reduction. After extracting features from input data, they used an entropy concept to select the most discriminating features. Their model is capable of highlighting the veins, liquid dribble, exudates, hemorrhages, and miniaturized scale aneurysms in the input retina images.
One of the central limitations of the previous work on recognizing the diabetic retinopathy grade is the lack to model the hidden structure that exists in the texture of the retina images. To adaptively highlight these types of hidden information inside the representational space, we propose to incorporate the textual and spatial attentions mechanism on top of the network bottleneck. In the next section, we will present our method in more detail.

III. PROPOSED METHOD
Automatic diabetic retinopathy detection provides an early signal for designing a specific treatment by an expert doctor. Thus, it plays a critical role in the diagnosis and treatment process. To automate this process in an end-to-end manner several deep learning-based research works are proposed in the literature [5], [41], [42]. As described in the previous section, these methods lack the incorporation of an attention mechanism to highlight more informative regions and patterns that existed inside the retina images for detecting the grade of diabetes from the image itself. To address these challenges, we design a network to learn the intrinsic pattern that existed inside the retina images by utilizing a parallel attention mechanism. The general structure of the proposed method is visualized in Figure 3. In the next subsections, we will discuss each part in more detail.

A. PREPROCESSING
Nowadays with the rapid progress in the development of the imaging system, several clinical imaging devices are produced by different companies. Although these retinopathy imaging equipments follow the same pipleline to produce the retina images, due to the intrinsic characteristics of these devices the produced images may vary in terms of intensity, colour, and shape. If such variations are not addressed, it can affect the network training process, and consequently, the trained model will be biased towards a specific imaging standard. Furthermore, the input data need to be preprocessed in a way that would be suitable for a deep model. In this regard, similar to the work done in [34], we have added a series of preprocessing steps to prepare the dataset for neural network training. First, based on the input images aspect ratio we resized them to the size of 512*512 pixels using the bicubic interpolation technique. Next, the retina circle location in the resized input images is centred by cropping each image from the centre to a size of 320*320 pixels. Moreover, we utilized Graham [43] approach to enhance the clarity of blood vessels and lesion areas. For this purpose, all the black pixels have been removed from the input images and a min-pooling filtering technique used to normalize the images as [43] (1): where the convolution operation is denoted by * , the input image represented by I, and G(ρ) marks the Gaussian filter  with a standard deviation of ρ. Pre-defined parameters are also used as α, β, and γ. In addition, to achieve uniform distribution across the dataset and terminate feature bias, all the input images' cross channels' intensity values have been normalized to [-1, 1]. Figure 4 illustrates the result of these preprocessing stage effects on the input retina images.

B. INCEPTION ENCODER
Deep learning architectures usually consist of two main parts, encoder, and decoder. Encoders are the first part of the network whose task is to encode input data into a format from which the network can extract numerous useful features to reveal existing patterns. In architectures related to the application of machine vision, the encoder section consists of a series of successive convolutional layers followed by the pooling and activation layers to represent the data in a high-level space. In our proposed method, the encoder is presented by the use of inception module. The concept of inception was first introduced by Szegedy et al. [44] and later several follow-up versions were made to improve its performance [39], [45], [46]. Unlike the regular CNN networks, the inception block consists of applying several parallel convolution operation to encode the object of interest in various scales. Without using an inception block, the multi-scale representation feature map might not be obtainable using the regular convolution layer. Figure 5 shows the inception blocks architecture. According to Figure 5, this block is consisted of one convolutional path with two 3×3 convolutional layers and a short-cut path with a 1×1 convolutional layer which is designed to increase filter depth during encoding, or decrease filter depth during decoding, and ensure the pixel-wise summation by projecting the input feature map into the same space as output. Given the fact that there are multiple inception paths, with different scales, and the output feature maps of the different inception paths are concatenated together, additional parameters would be created for each inception block. Furthermore, in order to increase the performance of the model and make the model focus on specific areas of the image, we have used the inception network as our encoder module. Our encoder network E parametrized with θ which receives the normalized input image Ic and generates the encoded feature x H×W ×C : It is worthwhile to mention that the main idea behind choosing the inception module was its capability in learning a rich and generic representation compared to the counterpart baseline models. Although our proposed method does not VOLUME 4, 2016 rely on any particular baseline model, the reason for choosing the inception model was its better performance throughout our experiments.

C. STYLE AND CONTENT DECOMPOSITION
The deep encoding module usually performs a set of convolution operations followed by the pooling and activation layers to represent the object of interest in hierarchical representation space. This representation space can be divided into style and content features, where the style shows the common representation shared among layers such as colour, texture while the content representation contains more core features like structure, semantic and shape information [47]. In our strategy, on top of the encoder features, we apply the style content decomposition technique to separately perform attention in each feature set. To perform that, we build a pyramid representation using the feature map derived from each block (the output of the first convolution in each block) of the encoding path. The resulted feature pyramid contains both deep and shallow features to represent the textural information, where we can perform a texture attention mechanism to adaptively recalibrate the most important regions. In the meanwhile, we create the content representation using the output of the last convolution operation on the encoding path. The content representation contains the semantic information, where we can apply the spatial normalization technique to determine important locations for diabetic retinopathy classification. Hence our content and style representation can be achieved by: x content = E l (θ; Ic) , l = L x style = concat(E l (θ; Ic) , l = 1, 2, ...L) where L is the number of convolutional block in the encoder module. In our model to ensure the style matching mechanism, we initially train the encoder network using the perceptual loss then the main training part uses the obtained weight to initiate the encoder parameters weight. To this end, a pair of retinopathy images are fed to the encoder module to generate both content and style representation, then similar to [48] by maximizing the correlation between the style of both images and keeping the content representation as same as possible we iteratively adjust the style matching mechanism. Both style and content losses are used to model the perceptual loss: where N determine the spatial dimension and C l stands for the number of channels in the layer l. The content loss can be defined as: The main objective of the content loss is to keep the representation unchanged as much as possible.

D. ATTENTION MODULE
The idea of the attention mechanism is derived from the real world, where humans seek to focus on specific parts of their vision, such as particular food, road, text, etc., and think about why or how it happens. In the machine vision concept, attention is a technique by which the model can weigh features by the level of their importance, and use this weighting to help achieve the task. In our proposed method, we employ two different attention mechanism on top of the style and content modules in parallel. Texture Attention Module The texture representation in diabetic retinopathy images provides significant information regarding the abnormal regions. Hence, our texture attention mechanism aims to highlight these regions through the frequency domain. To this end, each level of the style pyramid passes through the Laplacian pyramid to modify the frequency information. To model the Laplacian operation, we use the difference of Gaussian operation applied on each level of the pyramid with varying variances. The Gaussian operation to generate different scales can be formulated as: where x style represents the style feature pyramid, σ l indicates the variance of the l th Gaussian function, i and j show the spatial location. To highlight high-frequency information (relates to the texture), we simply use the difference of each pyramid level by increasing variance value: where LP l is the l th number of feature maps in the pyramid level, G l indicates the output of the l th Gaussian operation and L shows the total number of pyramid levels. Spatial attention module: Unlike texture attention which desires to realize 'what' is meaningful in the input image, spatial attention looks 'where' are informative parts of the input images. This spatial attention map generates by using the inter-spatial relationship of features and is complementary to texture attention. We calculate spatial attention by applying pooling operators (average-pooling) on the content feature map channel axis to generate a robust feature descriptor. In our proposed architecture, we generate a spatial attention map M s (x content ) ∈ R H×W by utilizing a convolution layer on the features descriptor which emphasize or suppress special parts. The details of this process are described below.
We generate channel information features by x content s avg ∈ R 1×H×W , which indicates average-pooled features across the channel. Next, we acquire a 2D spatial attention map by applying a convolving operation on top of the resulted feature map. The following equation shows how attention is computed. = σ f 7×7 x content s avg (8) where the sigmoid function is marked by σ, and a convolution operation with the filter size of 7 × 7 is denoted by f 7×7 [49]. By applying spatial attention to the content module, we generate the spatial-normalized representation feature map to guide the network to emphasize more on the informative regions related to the structural information of the diabetic retinopathy patterns. Figure 6 shows the spatial normalization process.  The spatial normalization module [49] is utilized in our model to highlight the important location inside the input image to tune the representational space accordingly.

Fusion Module
The resulted feature maps (Gated representation and spatial-normalized representation) are then concatenated and followed by the convolution operation to form an aggregated feature map which is then fed to the decoder part to produce the classification label.

E. DECODER
For mapping the features vectors obtained from the encoder part to the desired output, we utilize a fully connected layers decoder block. The purpose of the proposed method is depicted in two main goals: DR classification: classifying the retinopathy images into five classes, according to the DR grade, is the main objective of this study. In this regard, the classification model learns to predict the diabetic retinopathy classes. For calculating loss between the predicted class and the true class, we employed a cross-entropy loss function.
Healthy and non-healthy retina: Since retinal fundus images classification and grading is a complex and costly task, as well as the main purpose of retinopathy detection, is to help the ophthalmologist/hospitals reduce the monitoring burden, the healthy and non-healthy retina binaryclassification is intended as an auxiliary task to assist ophthalmologists in recognizing retinopathy. Therefore, this binary annotation can be provided more comfortably. In our structure besides the main classification task, we include this auxiliary task to provide a second signal for the ophthalmologists in recognizing healthy and non-healthy cases. A binary cross-entropy loss function is used for training the auxiliary task model.

IV. EXPERIMENTAL RESULTS
Our proposed method has been evaluated on the APTOS kaggle dataset [6] for diabetic retinopathy classification. Besides the dataset description, this section provides detailed information regarding the training process, evaluation metrics, comparison results for both multiclass diabetic retinopathy classification and healthy or non-healthy retinopathy classification, and finally the ablation study to emphasize the contribution of each proposed module on the model generalization performance. In the next subsections, we will elaborate on each part in more detail.

A. APTOS DATASET
The APTOS dataset [6] is a large collection of retina fundus images prepared using various medical imaging techniques. Aravind Eye Hospital prepared this database for a classification task which is used for developing an automatic approach for detecting and classifying the severity of diabetic retinopathy on a scale from 0 to 4 where the numbers represent the extent of the disease. The dataset consists of a total of 3,662 retina images with a class label for each image that is rated by a clinician according to the severity of the diabetic retinopathy it contains, including No DR (Class0), Mild DR (Class 1), Moderate DR (Class 2), Severe DR (Class 3), Proliferative DR (Class 4). Some samples of the APTOS dataset are illustrated in Figure 2. Given the fact that the classes distribution of this dataset is highly imbalanced, i.e., 49%, 10%, 27%, 5%, and 8% of images belong to normal, mild, moderate, severe, proliferative DR, respectively, we take it into account in the training section with defining weighted loss for each class. Similar to a previous work [34], we use 10% of the labelled samples as a test set and the rest for the training.

B. TRAINING PROCESS
The proposed method is implemented in the Pytorch library and has been carried out on an NVIDIA RTX 3090GPU with a batch size of 8 without any data augmentation. We trained all the models with an initial learning rate 1e − 3 and the decay rate 1e − 4 for 200 epochs using the Adam optimization. In case the validation performance does not change in 10 consecutive epochs, we stop the training process. The baseline network utilized in our experiments has the same structure as the U-Net model without the proposed attention mechanism. It is worth mentioning that during the training process, we do not use transfer learning, instead we train each model using the random weights generated by the standard normal distribution.

C. EVALUATION METRICS
To evaluate our proposed method performance, we have used standard and well-known metrics including accuracy (AC), sensitivity (SE), specificity (SP), F1-Score, and Kappa coefficient. In the following, the terminologies are employed to explain how metrics are calculated. True-Positive (TP) shows the predicted label that is correctly VOLUME 4, 2016 predicted as a retinopathy class. False-Positive (FP) shows the predicted label that is falsely predicted as a retinopathy class. True-Negative (TN) shows the predicted label that is truly labelled as a non-retinopathy pixel. False-Negative (FN) shows the predicted label that is falsely labelled as a non-retinopathy pixel.
Accuracy indicates the percentage of correct prediction, Accuracy = TP + TN TP + TN + FP + FN (9) Specificity indicates the proportion of FP that are correctly identified by model, Sensitivity indicates the proportion of predicted TP that are correctly identified by model, F1 score also known as balanced F-score or F-measure, is a weighted average of the precision and recall, F1 score = 2 * TP 2 * TP + FP + FN (12) Kappa coefficient indicates the reliability between two raters who each classify N items into C mutually exclusive categories,

D. DIABETIC RETINOPATHY CLASSIFICATION RESULTS
To evaluate the performance of the proposed method, we have used the publicly available APTOS dataset. To provide a fair evaluation, we followed the same setting as mentioned in [34] to divide our dataset into train and test sets. In our first evaluation strategy, we applied well-known classification models to classify diabetic retinopathy images. To this end, we slightly modified the classification layer (last fullyconnected layer) of the MobileNet, VGG, and Resnet models to produce the classification label for diabetic classes. Both baseline models and our proposed method are trained for 200 epochs using the same training strategy we explained earlier in Section IV-B. We experienced that the results of the baseline models are almost the same as the results mentioned in [34]. Table 1 provides the comparison results. The experimental results presented in Table 1 show that our approach that uses attention mechanism in texture and content feature outperformed every other model as it enhances the representation space and generate a more rich and generic feature set. Moreover, in comparison with the recent modified Xception [34], Hybrid [53], and the Multi-scale attention [5] methods, our attention-based strategy produces a better classification results. To better analyze the performance of the proposed method for the diabetic classes, we have provided the confusion matrix in Figure 7.  [6]. The confusion matrix reveals that the method is highly capable of separating healthy and non-healthy retinopathy images from each other, however, in determining the diabetic level there is some miss classification among high grade diabetic classes.
According to Figure 7, the proposed approach can effectively classify the healthy retinopathy images (with 98% confidence) from the diabetic classes. In other words, it provides remarkable classification confidence for separating non-healthy samples from the healthy class. However, the classification performance largely decreases in recognizing the Sever and Proliferative diabetic classes. This is mainly due to the high features similarity that exists among the different diabetic classes that are close to each other (e.g., classes 3 and 4), which makes it extremely cumbersome for the deep model to distinguish them.

E. HEALTHY AND NON-HEALTHY CLASSIFICATION RESULTS
From a clinical perspective, the classification of the healthy and non-healthy retinopathy images not only reduces the burden of the optometrists but also facilitates the screening process. Thus, we provide experimental results regarding the healthy and non-healthy diabetic retinopathy classification problem. First, we summarize and compare the classification results of the proposed method with both baseline and the literature work in Table 2. According to Table 2, it is obvious that the proposed approach significantly classifies the healthy and non-healthy samples and outperforms the literature work in all metrics. Compared to the recent Multi-scale attention [5], our method slightly produces better classification results. In Figure 8 and Figure 9, we provide the ROC and precision-recall curves to analyze the true positive detection vs the false positive rate. The ROC and precision-recall curves demonstrate the tradeoff between the sensitivity/specificity and precision/recall metrics. As it can be seen from the curves that our model is highly effective in classifying healthy or non-healthy sample, which is useful in determining the list of patients that need to be check by a specialist. FIGURE 8: ROC curve achieved by applying the proposed method on the Kaggle APTOS dataset [6] for classifying healthy and non-healthy retinopathy images.

F. ABLATION STUDY
In this section, the effect of decomposing the representation space into content and style features, as well as the impact of applying texture and spatial attention modules on the model performance are discussed. To investigate the effect of decomposing the representational space into content and style features, the proposed model was trained with and without decomposing the representational space. Instead  Table 3 demonstrated a performance loss in diabetic retinopathy classification task. In another setting, we evaluated the model performance by simply dropping one attention module and only utilizing the other one to check the contribution of each attention mechanism separately to the model performance. The experimental results demonstrate that each module contributes to the model performance, and all together they provide a powerful features representation for classifying retinopathy, as shown in Table 3. Moreover, The experiments showed that using separated content and style feature maps can effectively provide a regional-based feature recalibration process, which is critical for diabetic retinopathy classification. Eventually, experimental results indicate that applying the attention modules in the proposed model helps the model to focus on the more informative area and scale the representation space which increases the model performance in recognizing the DR. According to the experimental results, omitting the attention modules (e.g., baseline) from the model decreased the kappa coefficient score of the proposed method by 8.9% on the APTOS dataset [6], as shown in Table 3.

G. COMPUTATIONAL TIME
As we stated earlier in the introduction section, analyzing retinopathy images may take up to five minutes for the ophthalmologists to closely check the state of the diabetic retinopathy [56]. In addition, in rare cases such as the pres-VOLUME 4, 2016 ence of macular degeneration, the screening process may even take more time. Hence, one major factor to determine the effectiveness of the machine learning algorithm is to evaluate the inference time of the algorithm. Besides that, we are also interested to analyze the complexity of the model in terms of the required arithmetic operation. Table 4 shows the obtained results. As depicted in Table 4, comparing to the ophthalmologists our machine learning algorithm predicts a batch of 16 images in two second, while this process can take up to 80 minutes for the ophthalmologists. Besides that, our proposed attention module only adds small number of parameters and effectively enhances the performance. Overally, our method has has 30.2 million and is able to run on a single GPU device with 8 GB memory, which is comparatively is more effective than the recent MSA Net [5]. Last but not least, our method comparatively contains less parameters due to the less overhead of the attention mechanism we included inside the model.

V. CONCLUSION
In this paper, we proposed a deep neural network by combining a style and content recalibration mechanism that adaptively scales informative regions for the classification of diabetic retinopathy images. Our proposed model performs both diabetic grading and healthy, non-healthy classification tasks.
To improve the representation power of the network, we utilized a separation mechanism that decomposes style and content representation. Furthermore, we employed an attention module along with a spatial normalization module. The texture attention module highlights the texture information by taking the style representation and applying a high-pass filter, and the spatial normalization module determines the more informative region inside the retinopathy image using a convolutional operation. Next, we applied a fusion module to combine both features to form a normalized representation. Our experiment on the APTOS Kaggle dataset shows an improvement over the work of the literature.