Retinal Image Classification by Self-Supervised Fuzzy Clustering Network

Diabetic retinal image classiﬁcation aims to conduct diabetic retinopathy automatically diagnosing, which has achieved considerable improvement by deep learning models. However, these methods all rely on sufﬁcient network training by large scale annotated data, which is very labor-expensive in medical image labeling. Aiming to overcome these drawbacks, this paper focuses on embedding self-supervised framework into unsupervised deep learning architecture. Speciﬁcally, we propose a Self-supervised Fuzzy Clustering Network (SFCN) by a feature learning module, reconstruction module, and a fuzzy self-supervision module. The feature learning and reconstruction modules ensure the representative ability of the network, and fuzzy self-supervision module is in charge of further providing the training direction for the whole network. Furthermore, three losses of reconstruction, self-supervision, and fuzzy supervision jointly optimize the SFCN under an unsupervised manner. To evaluate the effectiveness of the proposed method, we implement the network on three widely used retinal image datasets, which results demonstrate the satisﬁed performance on unsupervised retinal image classiﬁcation task


I. INTRODUCTION
Retinal images classification of diabetic retinopathy is a significant application in diagnosing diabetic retinopathy. Diabetes has a wide and profound impact on a large number of people around the world by the report of the World Health Organization. According to a survey, 1.5 million people died of diabetes in 2012. It is estimated that 347 million people will develop diabetes in 2014 and the proportion of people with diabetes in the world will rise from 2.8% in 2000 to 4.4% by 2030. People with diabetes are usually over 30 years old, and severe diabetes can lead to diabetic retinopathy which has early and controllable retinal abnormalities such as micro-aneurysms and other small lesions caused by thin retinal capillaries. The summary survey from 35 studies found that diabetic retinopathy was 34.6%, vision-threatening diabetic retinopathy accounted for 10.2% and diabetic macular edema acounts for 6.8% in the overall incidence of The associate editor coordinating the review of this manuscript and approving it for publication was Tallha Akram . the disease. The sooner diabetic retinopathy is detected, the less likely there is vision loss. However, the early symptoms of this disease are not obvious, which makes it impossible to effectively treat the late detection. Moreover, the discovery of this disease requires a great deal of experience and sufficient capacity of the doctor, because the diagnosing situation in retinal images is challenging (Figure 1). Therefore, the demand for an effective classification and screening of retinopathy is even more urgent to help the timely detection and effective treatment of the disease.
Deep learning based methods have been widely introduced in the medical image analysis field, and their ability to recognize and distinguish information can well diagnose diseases [1], [14], [28]. When the number of retinal specialists is insufficient for all patients to be examined individually, the increased prevalence of the disease may cause great difficulties in the treatment of retinal diseases. In previous researches, some diabetic retinal diseases, such as 22-27 glaucoma and 27-30 retinopathy of premature, helped human experts by effectively processing large amounts of data with the help of deep learning technology [2], [9], [18], [33]. It can be seen that deep learning has great prospects in the medical field, but the premise is that the diagnosing performance of the method based on deep learning is higher than that of conventional ophthalmologists.
In recent years, the utilization of deep learning plays a remarkable role in retinal image classification methods for diabetic retinopathy diagnosis. Alqudah [5] presented an automated convoltuional neural network architecture for a multi-class classification system for retinal diseases, and it achieved performable results of 95.3% by a softmax classifier. Shanthi and Sabeenian [34] focused on the classification of diabetic retinopathy fundus images to the severity of the disease using convoltuional neural network with the application of suitable polling, softmax, and rectified linear activation layers to obtain a high level of accuracy. Bourouis et al. [9] proposed a robust hybrid probabilistic learning approach that appropriately combines the advantages both of the generative and discriminative model for challenging problem in retinal image classification. It obtained a better results and showed the flexibility and the merits of deep learning applications in retinal image analysis. Though sufficient training on large amount of annotated data, existing deep learning based retinal image classification methods have achieved satisfactory performance according to the review research [33], where more successful applications can be found.
However, there is a main drawback in existing deep learning retinal image diagnose works. That is, they require large amounts of labeled data to supervise the network learning. Manual labeling of retinopathy information is a complicated task, requiring many experienced experts to manually annotate a large number of retinopathy images. Due to different personal habits, it may cause inconsistencies in image labeling, discomfort and application in scientific research. In order to overcome this limitation, this paper propose a Self-supervised Fuzzy Clustering Network (SFCN) to conduct retinal image classification without any annotations. In detail, SFCN consists three main components to learn the robust classification model by a self-supervised framework. The first part is the feature extraction module achieved by stacked convolutional layers, aiming at learning representation with the input of retinal images; The second one is reconstruction module for the learned feature to guarantee the representative ability of feature extraction; The last one is the fuzzy self-supervision module to provide the training direction of the whole network. In summary, SFCN employs the fuzzy clustering results to supervise the network and constrain the network to output the probability of each retinal image belonging every clusters.

II. RELATED WORK
This section reviews the related works of retina image classification, which is partitioned by three aspects, including retina image classification, unsupervised medical image classification, and a brief introduction of self-supervision.

A. RETINA IMAGE CLASSIFICATION
Diabetic retinopathy is an important diseases of eye complications caused by diabetes. Early detection helps protect vision and patients need to be checked semi-annually to determine eye conditions. We propose an deep learning approach for processing retinal images to help people effectively treat retinal disease in this paper. There are already many medical image analysis algorithms for diagnosing diabetic retinal image lesions, which can identify and distinguish diabetic retinas by examining the occurrence of bleeding, lesions, micro-aneurysms, exudates, and other symptoms in the fundus image.
In previous studies, diabetic retinopathy was automatically detected by various models, which are principally separated into two categories (conventional machine learning approaches [20], [23], and deep learning methodologies [11], [40]). For the first category, several traditional models are employed into retinal image analysis. For instances, Jelinek et al. [23] deployed points-of-interest and visual dictionary that contains important features required to identify retinal pathology; Gupta and Karandikar [20] conducted morphological operations on retinal images to identify exudates and micro-aneurysms features, and then used multi-class support vector machine and k-nearest neighbor classifier for giving severity or grade of abnormality. Recent researches about retina image analysis are almost employing Convolutional Neural Networks (CNN) to automatically learn CNN features from original images and conducting classification on them. Krause et al. [25] proposed a domain enriched deep network consisting a representation network that learns geometric features specific to retinal images, and a custom designed computationally efficient residual task network that utilize the features obtained from the representation layer to perform pixel-level segmentation. For retina image classification, Wu et al. [40] presented an attention-based deep learning method for the retinal disease classification in VOLUME 8, 2020 optical coherence tomography images, especially designed an attention model which pays attention to critical regions containing pathological anomalies. These methods are all under supervised framework with demanding a large amount of annotated data, which limits their flexibilities in realistic applications.

B. UNSUPERVISED MEDICAL IMAGE CLASSIFICATION
For deep learning models, annotating a large scale medical images meets greater challenges than conventional vision task (e.g face recognition) because it requires the expensive labor from well-trained human experts. Thus, unsupervised medical image classification has more extensive prospect in real scenario.
There exists several proposed deep learning based unsupervised medical image classification models by transfer learning or clustering based methodologies [3], [4], [31], [36]. From the view of transfer learning methods, Ahn et al. [3] utilized transferable knowledge across different domains aiming at reducing the reliance on annotated training data by using a new hierarchical unsupervised feature extractor with a convolutional auto-encoder placed a top of a pre-trained convolutional neural network; Tang et al. [36] combined active learning and transfer learning for medical data classification, which is iteratively querying a small number of informative unlabeled target samples, and removing the source samples conflicted with distribution of target data. As for clustering based methods, Ahn et al. [4] proposed an unsupervised feature learning method to tackle the issue of the large volume of unlabeled medical data that learns feature representations to and then differentiate dissimilar medical images using an ensemble of different contolutional neural networks and K -means clustering; Perkonigg et al. [31] presented a method to identify predictive texture patterns in medical images under unsupervised framework, simultaneously encode and cluster medical image patches in a low-dimensional latent space.

C. SELF-SUPERVISION
Self-supervision is proposed to leverage the huge amounts of unlabeled data to learn useful representation for various problems, for example, image classification, object detection, video recognition, etc. It has been proved that lots of deep learning methods can benefit from pre-trained models on large labeled datasets. The basic motivation behind self-supervision methods is to replace the expensive labeled data by 'free' unlabeled data. Being a generic framework, self-supervision enjoys a wide number of applications, ranging from robotics to image understanding. A common way to achieve self-supervised learning is to derive easy-to-obtain supervision signals without human annotations, to encourage the learning of useful features for regular tasks. Zhang et al. [42] designed selfsupervised convolutional subspace clustering network, combining a feature extraction module, subspace clustering and a spectral clustering module into a joint optimization framework. Wang et al. [38] developed a HSI feature learning network to learn consistent features by self-supervision for HIS classification and jointly employed the conditional random field framework to boost the performance of self-supervised feature learning. Chen et al. [10] trained a generalized self-supervised embedding network to provide slow and robust representation for downstream tasks by learning from the data itself. Moreover, several self-supervised learning methods employ image rotation. For example, Gidaris et al. [19] proposed to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input (RotNet); Feng et al. [15] introduced a self-supervised learning method that incorporates rotation invariance into the feature learning framework, one of many good and well-studied properties of visual representation, which is rarely appreciated or exploited by previous deep convolutional neural network based self-supervised representation learning methods.
Inspired by these successful application in traditional computer vision tasks, we embed fuzzy clustering algorithm as the training direction into a self-supervised framework to achieve the unsupervised retinal image classification task. Detail approach will be demonstrated in Section III.

III. SELF-SUPERVISED FUZZY CLUSTERING NETWORK
We introduce the proposed Self-supervised Fuzzy Clustering Network (SFCN) in this section, and demonstrate the joint optimization of the self-supervised framework.

A. APPROACH OVERVIEW
The proposed SFCN is achieved by three main stages, including feature learning directly from unlabeled medical images, self-supervision by fuzzy clustering module, and the application on retinal image classification task. Firstly, the feature learning module is achieved by several stacked convolutional layers to extract CNN representation given a retinal image, and multiple deconvolutional layers are utilized to reconstruct the given retinal image. This module ensures the learned feature representation contains sufficient information to construct itself. Secondly, the fuzzy self-supervision module provides a training supervision for the feature learning module, by the predicted result of fuzzy clustering algorithm. This self-supervision strategy exploits the inherent correlation in unlabeled medical images and makes the output of the network become fuzzy probabilities belonging to each cluster. Finally, we illustrate how to apply the well-trained self-supervised network on retinal image classification task.

B. FEATURE LEARNING MODULE
The feature learning module is the basic component of our proposed SFCN framework, which is used to extract discriminative features from retinal images. Given a retinal image x, we can obtain its output feature vector z = f (x) through an encoder f with L stacked convolutional layers with their parameters {w l , b l } L l=1 because the convolutional neural network has proved its feature extracting ability in many computer vision tasks. w l represents the weight parameters FIGURE 2. Framework of the proposed self-supervised fuzzy clustering network. Our SFCN method contains feature learning module along with fully connected layers, fuzzy self-supervision module, and feature decoder module to achieve the self-supervision process. Through the fuzzy c-means clustering, the last FC layer could output the probability for each retinal image belonging to every clusters, by the constraint of self-supervision loss. To further improve the representation ability for the feature learning module, a reconstruction loss is fixed on the feature decoder module. By the self-supervised fuzzy clustering network, the retinal image classification task can be solved without any annotations.
of l-th layer and b l is the bias parameters of l-th layer. There are two stream tasks for feature vector z, which is fed into a decoder to reconstruct original retinal image x, and provide information for the fuzzy self-supervision module, simultaneously. Explicitly, the feature vector z is constituted by m L feature maps calculated by convolutional layers, where h m L represents the raw output of the last layer of f . This feature vector is learned from unlabeled retinal images, and should exploit the informative data of x in SFCN framework.
Therefore, we employ a decoder g of several deconvolutional layers to construct an imagex, which should be restore the detail information from the feature vector z. The reconstruction loss L re is defined as, where N denotes the number of training retinal images, and X is the whole training datasets. The reconstruction loss employs l2 loss to guarantee the stable feature representations for retinal images after the feature learning, which is more sensitive and robust than l1 loss. Through this encoder-decoder architecture, m L ] maintains the most discriminative feature representation of the image x. This procedure achieves the robust feature learning in unlabeled medical images.

C. FUZZY SELF-SUPERVISION MODULE
As an effective generic unsupervised framework, selfsupervision can provide a training direction on the unlabeled data. We choose the fuzzy clustering results as the cues of self-supervision, because fuzzy clustering algorithm [8] can learn the probabilities belonging to each cluster center. The estimated probabilities are more realistic than traditional clustering methods, e.g. k-means, which reason is discussed below. The traditional clustering analysis belongs to ''Crisp Partition'', such as k-means, hierarchical clustering algorithms. They strictly classify each object to be identified into a certain category, with unique decision of the category which the object belongs to. Therefore, this partition of the categories is distinct. However, most objects do not have strict category properties, and they are intermediate condition or category attributes, which is more reasonable to conduct ''soft partition''. As for soft partition, fuzzy set theory provides a powerful analytical tool for it, which inspires researchers to adopt fuzzy-based methods to handle clustering task. Fuzzy C-means algorithm is the representative and scalable soft partition method to obtain the uncertainty degree belonging to each category, which can build the fuzzy representation of categories to reflect the nature of the real world. Therefore, integrating fuzzy C-means algorithm into our self-supervised framework is better than conventional crisp partition methods, e.g. K-means, and hierarchical clustering methods.
In detail, fuzzy clustering algorithm treats each sample belonging to every cluster points in a fuzzy conception, which always occurs in realistic application. It partitions each VOLUME 8, 2020 sample into multiple classes because there often exists noisy, unclear boundaries in the feature space. In fuzzy clustering algorithms, they introduce the membership for each sample belonging to each class by the fuzzy partition conception [8].
Inspired by the successful applications of Fuzzy C-means algorithm [7], [26], [43], we build the self-supervision module by the probabilities of membership in Fuzzy C-Means (FCM) algorithm. Specifically, we denote u i as the probability of x belonging to i-th cluster, which is iteratively learned by FCM. In the self-supervised framework, we expect our network can directly output this probability following the FCM algorithm without any iterative training, which leads an efficient application in realistic retinal image diagnosis. This expectation can be achieved by a proposed self-supervision loss L ss , where f ss consists two fully connected layers for reducing the dimension of z into the number of clusters N c , and u = [u 1 , . . . , u i , . . . , u N c ] is the concatation of every probabilities estimated by FCM algorithm. Though this loss function, we can observe that the predicted performance f ss is affected by the membership u. We will discuss how to obtain an optimal probabilities of membership u.
In fuzzy C-means algorithms, there are two major components for each feature vector of retinal images, including Based on these two components, we design a fuzzy supervision loss L fs on the feature vector z to boost the clustering effectiveness by optimizing the network parameters The fuzzy supervision loss L fs is defined by, where m is a weighting exponent parameter. This objective function is not only the network loss, but also for the FCM algorithm. Given the fixed u and c, the network parameters {w l , b l } L l=1 can be optimized by back propagation algorithm. In contrast, the training of u and c can be conducted while given the learned feature vectors by the fixed network parameters {w l , b l } L l=1 . The detail optimizations of u and c are followed by, Algorithm 1 Self-Supervised Fuzzy Clustering Network Initialization: The parameters of feature extractor f , f ss and the image decoder g, the parameters α = 0.8, β = 0.7 in encoder-decoder network; The cluster points c, membership u, and parameter m = 2 in FCM algorithm. For experimental settings, batch size = 16, learning rate = 0.001, maximum epochs N = 300, and FCM updating interval I fcm = 10.
Pre-train the Encoder-Decoder with maximum Epochs N p : for t ∈ 1, . . . , N p do Train f and g by L re (Eq.2).
If L re (t) − L re (t − 1) ≤ ε p : Return the pre-trained f and g. end for Train the self-supervised framework by the pre-trained model: for t ∈ 1, . . . , N do Train f and g by L ss (Eq.3) and L fs (Eq.4).
If t%I fcm == 0: Updating u and c by Eq.6 and 7 iteratively, Return the parameters of SFCN framework. end for Finally, we calculate the objective function Eq.3 when u and v are in optimal values by iteratively updating them by Eq.6 and 7. Detailed optimization will be described below.

D. JOINT OPTIMIZATION OF SFCN
To train the proposed self-supervised fuzzy clustering network in a unified framework, the final loss function L of SFCN is fused by the balanced losses of Eq.2, 3, and 5, L = L ss + αL re + βL fs (8) where α and β are balance parameters for different items in the total loss. This loss function ensures that we can obtain optimal network parameters given the fixed clusters c and probabilities u learned by FCM algorithm. As for the updating of the optimal u and c, we should minimize the fuzzy supervision loss L fs when we obtain a satisfied network output of feature vector z.
To train SFCN framework, we design the two-stage optimization algorithm to conduct: (1) train the encoder-decoder network to prepare the initial parameters in SFCN; (2) Update the unified network by the fuzzy self-supervised module in back propagation manner. The optimization are summarized in Algorithm 1.

E. ALGORITHM TRAINING
Our self-supervised fuzzy clustering network consists of feature learning module, reconstruction module and fuzzy self-supervision module. The feature learning module is built by ResNet [21], with one convolution layer, four residual blocks and one fully connected layer. Moreover, the reconstruction module is achieved by the image decoder in Cycle GAN [44], which employs the architecture from Johnson and Fei-Fei [22] containing several residual convolution blocks, and two stride-1/2 deconvolutions to conduct upsample, attached instance normalization. As for fuzzy self-supervision module, we attach two fully connected layers after the feature learning module to output the prediction of u, which is constrained by the results of Fuzzy C-means Clustering on the extracted features.
In detail, we employ Pytorch framework to implement the network on two NVIDIA GTX 2080Ti GPUs under Ubuntu 16.04 operating system. Each image is resized as 224 × 224, and batch size is set as 32. In addition, we utilize SGB optimizer to update the network parameters, which learning rates is set as 0.001 with decayed to 0 in 300 epochs. The balance parameters in Eq. 8 is following α = 0.8, β = 0.7, the parameter m = 2 in Eq.5 when we obtain the optimal performance. The maximum training epochs are set as 300. The training strategy in datasets of different scales is introduced in Section IV.

F. APPLICATION ON RETINAL IMAGES
Through the proposed optimization algorithm for SFCN, we can obtain the optimal network parameters of f and the following fully connected layer f ss . Given a retinal image dataset X = {x 1 , . . . , x i , . . . , x N }, f ss outputs their probabilities u belonging to each cluster. We can obtain each predicted cluster label Label c for x i by,

IV. EXPERIMENTS
In this section, we conduct extensive evaluating experiments on three available retinal image datasets, including MESSI-DOR [13], DRIVE [35] and DIARETDB1 [24], according to the research [6]. Then, we evaluate the performance of our proposed self-supervised fuzzy clustering network in several aspects (e.g. Accuracy, ROC curve, and AUC). Finally, we discuss the effectiveness of the fuzzy self-supervised module in SFCN methods to further validate the novelties of this paper.

A. DATASETS
In experiments, we employ MESSIDOR, DIARETDB1, and DRIVE datasets to conduct SFCN, and we bring a brief introduction of them in this subsection. MESSIDOR [13] is a widely used retinal image dataset in automatically diagnosing of retinal images, which has employed by existing approaches. This dataset is funded by the French Ministry of Research and Defense within a 2004 TECHNO-VISION program. In detail, it contains 1200 eye fundus color digital images acquired 3 ophthalmologic departments using a color video 3CCD camera mounted on a Topcon TRC NW6 non-mydriatic retinograph with a 45 degree field of view. These images were captured using 8 bits per color plane at 1440*960, 2240*1488 or 2304*1536 pixels, and annotated by the medical experts into retinopathy grade and risk of macular edema. In this paper, we evaluate this dataset on the normal/abnormal retinal images classification task. Moreover, training the architecture is implemented with 900 fundus images for this dataset, and the left 300 images are employed as testing data, which are equally sampled from normal and abnormal categories.
DRIVE [35] is a numerical retinal images used in retinal vessel detection in fundus image, which is employed to conduct the normal/abnormal retinal image classification task in this paper. This dataset is collected from a diabetic retinopathy screening program in the Netherlands. The screening population consists of 400 diabetic subjects between 25-90 years of age. Forty photographs have been randomly selected, 33 do not show any sign of diabetic retinopathy and 7 show signs of mild early diabetic retinopathy. Each image was acquired using a Canon CR5 non-mydriatic 3CCD camera with a 45 degree field of view, and captured using 8 bits per color plane at 768*584 pixels. For the division of training and testing data, we select 250 images equally from normal and abnormal classes and the left 150 images are utilized for testing. Note that, because deep neural networks require large amount data to train, we utilize the SFCN model optimized by MESSIDOR to initialize the parameters of SFCN when conduct experiments on DRIVE dataset. DIARETDB1 [24] dataset is a standard diabetic retinopathy database to benchmark diabetic retinopathy detection for digital images. In this paper, we employ it for the classification task of diabetic retinopathy. This database consists of 89 color fundus images of which 84 contain at least mild non-proliferative signs of the diabetic retinopathy, and 5 are considered as normal which do not contain any signs of the diabetic retinopathy according to all experiments who participate in the evaluation. These images were captured using the same 50 degree field-of-view digital fundus camera with varying imaging settings, which were independent marked by 4 medical experts. Because this dataset is in a limited scale with 89 images, we also train the SFCN model by initializing the network parameters by the optimized model on MESSIDOR. For this dataset, we use 39 images to conduct testing and 50 images are utilized in training stage.

B. EVALUATION CRITERIA
In this subsection, we introduce several evaluation criteria to measure the performance of the proposed self-supervised fuzzy clustering networks.
(1) Accuracy indicates the overall classification performance of the model, and it is measured by the ratio of exactly truly predicted sample number in the whole samples, which is calculated by TP+TN TP+FP+TN +FN .
(2) ROC curve is the Receiver Operating Characteristic curve to measure the representative ability of a classifier, which is widely used in retinal image classification.  (3) AUC is the area under the ROC curve, which is often used to measure the classification ability in the recognition task under the ROC curve. AUC is a widely used evaluation metric for examining performance of the classification model.
To further demonstrate the effectiveness of SFCN on unsupervised retinal image classification, we also demonstrate the network performance by confusion matrix, loss and accuracy curves, feature map visualization and the clustering visualization by t-SNE, which will be describe in following paragraphs.

C. PERFORMANCE EVALUATION
Following the training detail described in Section III, we supplement extensive experiments on the three available retinal image databases, and achieve comparative results on them. We conclude the results in Table 1, 2, and 3, along with  several related compared methods (including supervised and unsupervised frameworks).
It can be seen that the proposed self-supervised fuzzy clustering network achieves 87.6%, 81.7%, and 84.7% accuracies on MESSIDOR, DRIVE, and DIARETDB1 datasets, respectively. For MESSIDOR datasets, we compare our SFCN approach with seven supervised methods and one unsupervised method, and the best supervised method [9] obtains 100% accuracy and 1.00 AUC, which has a limited superiority to our SFCN. In contrast, our SFCN achieves the 87.6% accuracy and 0.85 AUC, while the best result (RotNet) of compared unsupervised methods obtains 82.3% accuracy and 0.837 AUC. Moreover, the ROC curves of unsupervised methods are illustrated in Figure 3, which can be seen that our SFCN has the best performance among them.
Though our method seems worst than the best supervised method, it is superior to the DEC and RotNet model, and leaves an acceptable distance between supervised methods. Note that, DEC method is a classical clustering method based on deep learning, and RotNet is a rotation based self-supervised learning approach. The comparison on MESSIDOR illustrates that our SFCN is not only outperform the clustering method (DEC), but also for rotation-based self-supervised learning methods (RotNet). From Table 2 and 3, the same compared results can be concluded by their accuracies on DRIVE and DIARETDB1datasets. That demonstrates our proposed SFCN model can reach a satisfied performance compared with supervised and unsupervised methods, while our approach does not require any labeled data to train such an efficient network. In comparison, it can be seen that our SFCN achieves better performance than a supervised method [17], which is a hand-crafted feature based method. That benefits from the robust feature learning ability of convolutional neural network. Moreover, the main reason of the performable results of SFCN is that our method combines the advantages of self-supervised learning based on deep convolutional neural network and the fuzzy C-means clustering with soft partition strategy, which is more realistic than the crisp division methods.
To further show the training efficiency of our network, we further introduce four criteria of the network, including confusion matrix, loss and accuracy curves, feature map visualization and t-SNE clustering plot, taking the performance on MESSIDOR dataset as a representation.
Confusion Matrix is an error matrix to judge the advantage of the classifier, indicating the predicted results whether they belongs to the real class. Figure 5 is the confusion matrix of MESSIDOR dataset, which we randomly choose 100 image per class. From the confusion matrix, the vertical coordinates are true annotations, and the horizontal coordinates denote the predicted labels of the SFCN. Our SFCN obtains 90 true predicted abnormal samples, and achieves 86 true predicted normal retinal samples, which carries out 88% accuracy on the chosen data. As for abnormal images, our model can diagnose 90% of them, which is a preferable performance in clinical applications.
Loss and Accuracy curves are the reflections of the network training efficiency. They can illustrate the variations of loss and accuracy along with training epochs. We draw the curves both for training and testing data in Figure 4. From the curves, the loss is reducing obviously along with training until it reach 250 epochs, and the accuracy also achieves the optimal result in 250 epochs. The training trend of the model demonstrates our SFCN network can be easily convergence and has a desirable training efficiency.
Feature Map is a measure to evaluate the convolutional layers, which is is a significant tool to reflect the learning ability inside CNN. We choose several feature maps produced VOLUME 8, 2020  by the first convolutional layer in the feature learning module to do visualization, as shown in Figure 6. The first picture in Figure 6 is the original retinal image, and others are its feature maps obtained by the first convolutional layer. To better understanding the feature maps for the convolutional layers, we also visualize the feature maps of the second convolutional layer, which is for the same retinal image (as shown in Figure 7). It can be seen the feature maps present the clear components about the diagnosis, such as the vessel trail, retinal edge, and the lesion area both of the first and second convolutional layers. That further proves the feature extracting ability of the proposed self-supervised fuzzy clustering network. Moreover, the feature visualization of the last layer of the feature learning module are analyzed by t-SNE, which performance can be observed in next paragraph. t-SNE is a dimension-reducing measure combining t-distribution and stochastic neighbor embedding to transform the highly dimensional feature into 2-D feature space. From t-SNE plot, it can present the feature representation ability of the last layer in feature learning module. We draw the t-SNE plot in Figure 8, and the result shows our SFCN model can clearly partition the abnormal and normal samples in the 2-D space. This is another crucial evidence to prove the robustness of the proposed classifier.

D. FURTHER ANALYSIS 1) EVALUATION OF THE FUZZY C-MEANS ALGORITHM
To evaluate the effectiveness of using fuzzy C-means algorithm as the self-supervision guiding, we replace the fuzzy C-means algorithm by two crisp partition strategies, including K-means, and hierarchical clustering. The crisp clustering results of K-means and hierarchical clustering on the learned features are introduced into Eq.3, which are named by Self-supervised K-means Clustering Network (SKCN), and Self-supervised Hierarchical Clustering Network (SHCN), respectively. Their results are reported in Tables 1-3. For MESSIDOR dataset, SKCN and SHCN achieves accuracies of 82.8% and 81.7% respectively, with a distance of SFCN at least 4%, and the AUCs of them are also less than our proposed SFCN. Moreover, the accuracies on DRIVE and DIARETDB1 datasets also present that our SFCN is superior to the modified methods SKCN and SHCN. This comparison demonstrates the fuzzy C-means is more helpful than traditional scrip partition strategies, which is more realistic for the object category-attributes.

2) EVALUATION OF SELF-SUPERVISION LOSS
To validate the effectiveness of the proposed fuzzy self-supervised module, we modify the SFCN by removing the self-supervision loss, and directly attach the FCM on the features obtained by feature learning module, which results are shown in Table 1, 2 and 3. The complete SFCN achieves an improvement of 6.1%, 4.1%, and 5.4% than directly using FCM, on MESSIDOR, DRIVE and DIARTDB1 datasets, separately. In addition, this also obtains a promotion of 0.04 AUC on MESSIDOR dataset. These evaluated results are further proved that the proposed self-supervision mechanism can effectively improve the performance of retinal image classification in automatically abnormal diagnosing problem.

V. CONCLUSION
In this paper, we propose a Self-supervised Fuzzy Clustering Network (SFCN) to solve the diabetic retinopathy classification task, which has attracted much research interest by supervised models. However, these existing methods require tremendous professional labor to annotate large amount data, which is very expensive in medical image analysis. Therefore, our proposed SFCN model focus on unsupervised retinal image classification by a feature learning module, reconstruction model, and a fuzzy self-supervision module to learn the unsupervised feature space. Extensive validating experiments are implemented on MESSIDOR, DRIVE, and DIARTDB1 datasets, which the analysis of results demonstrate the effectiveness of our SFCN approach.
YUEGUO LUO received the Ph.D. degree from Chongqing University, China, in 2017. Since 2004, he has been working as a Teacher with Yangtze Normal University, Fuling, Chongqing. He is currently an Associate Professor. His research interests include computational intelligence and biology computing.
JING PAN was born in 1986. She received the Ph.D. degree from Southwest University, in 2015. She is currently a Professor with the Department of Physics and Chemistry, Taiyuan University. She has a wealth of research experience in the synthesis of nanomaterials, battery electrode modified, math, optimization, computer vision, and coding.
SHAOSHUAI FAN is currently pursuing the bachelor's degree with the School of Cyber Security and Computer, Hebei University, Baoding, China. He is also an Intern with SIBD. His research interests include deep learning and medical image classification.
ZEYU DU is currently pursuing the bachelor's degree with the School of Cyber Security and Computer, Hebei University, Baoding, China. He is also an Intern with SIBD. His research interests include medical image analysis and pattern recognition.
GUANGHUA ZHANG was born in 1986. He received the Ph.D. degree in computer science and technology from Chongqing University, in 2017. He is currently the Director of the Shanxi Intelligent Big Data Industry Technology Innovation Research Institute. He is also a Professor with the Department of Computer Science and Engineering, Taiyuan University. His research interests include machine learning, multispectral image processing, computer vision, and medical images analysis.