DIC: Deep Image Clustering for Unsupervised Image Segmentation

Unsupervised segmentation is an essential pre-processing technique in many computer vision tasks. However, current unsupervised segmentation techniques are sensitive to the parameters such as the segmentation numbers or of high training and inference complexity. Encouraged by neural networks’ flexibility and their ability for modelling intricate patterns, an unsupervised segmentation framework based on a novel deep image clustering (DIC) model is proposed. The DIC consists of a feature transformation subnetwork (FTS) and a trainable deep clustering subnetwork (DCS) for unsupervised image clustering. FTS is built on a simple and capable network architecture. DCS can assign pixels with different cluster numbers by updating cluster associations and cluster centers iteratively. Moreover, a superpixel guided iterative refinement loss is designed to optimize the DIC parameters in an overfitting manner. Extensive experiments have been conducted on the Berkley Segmentation Database. The experimental results show that DCS is more effective in aggregating features during the clustering procedure. DIC has also proven to be less sensitive to varying segmentation parameters and of lower computation costs, and DIC can achieve significantly better segmentation performance compared to the state-of-the-art techniques. The source code is available on https://github.com/zmbhou/DIC.


I. INTRODUCTION
Object segmentation is a challenging problem in the field of computer vision and it has been widely applied in areas such as object recognition and image classification. Generally speaking, object segmentation methods can be divided into three categories, unsupervised, semi-supervised and fully supervised. The unsupervised methods, such as K-means [1], EM [2], FH segmentation [3], Active contour [4], normalized cut [5], meanshift clustering [6], MLSS [7] or SAS [8], are implemented without prior knowledge about the images. In the semi-supervised methods [9]- [12], users can label pixels as foreground or background with interactive segmentation approaches. User input can locate where the object is (location information), and colour and texture information contained in the scribbles provide prior knowledge about what the object is. Fully supervised segmentation methods The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin . are mostly designed for detecting and extracting specific semantic object categories in images and accurately labelled training dataset is required. Currently, deep learning-based semantic segmentation solutions have also achieved significant improvement and attracted a lot of attention [13]- [17].
In this paper, we focus on studying the research topic related to unsupervised image segmentation. A typical unsupervised segmentation algorithm always contains two parts: feature extraction from images' pixels and dividing an image into nonoverlapping regions by pixels clustering, such as the details described in methods such as normalized cut [5], MLSS [7] and SAS [8]. For example, a segmentation framework based on bipartite graph partitioning is designed to aggregate multi-layer superpixels in SAS [8]. In MLSS [7], a semi-supervised learning strategy is applied to generate pairwise affinities based on the sparse graph constructed on pixels and over-segmented regions. Then the pairwise affinities are applied to the spectral segmentation algorithms. However, the performance of those methods may suffer from two principal drawbacks: being sensitive to the segmentation parameters such as cluster numbers and the whole flowchart is complex, which can not be optimized jointly. Neural networks have also been applied to solve the unsupervised segmentation problem. However, the existing framework such as W-Net [18] involves techniques such as designing complex loss functions. Hence, the training procedure is sophisticated.
We argue that a practical unsupervised segmentation method should exhibit characteristics such as adaptivity for incorporating additional cues and easiness for joint algorithm optimization. Encouraged by neural networks' flexibility and their ability for modelling complex patterns, a novel unsupervised segmentation framework based on neural network is proposed. Firstly, a deep image clustering (DIC) network is designed. DIC contains a feature transformation subnetwork and a trainable deep clustering subnetwork. An autoencoder based network architecture is applied to transform the pixel information of images to high-dimensional features. Then a deep clustering subnetwork implemented by iteratively calculating the cluster centers and cluster associations iteratively is proposed. Secondly, superpixels are taken as the grouping cues and a superpixel guided iterative refinement loss is designed to train DIC effectively. Finally, DIC is optimized via backpropagation in an overfitting manner on a single image's basis. The main contributions of the paper are summarized as follows and the flow chart of our approach is shown in Figure. 1.
• We propose a deep image clustering (DIC) model which consists of a feature transformation subnetwork (FTS) and a differentiable deep clustering subnetwork (DCS) for dividing the image space into different clusters.
• We propose a simple and effective superpixel guided iterative refinement loss for optimizing the DIC parameters in an end-to-end way. DIC can be optimized on a single image's basis in an overfitting manner.
• We achieve highly competitive results on Berkley Segmentation Databases with performance gains from 0.8319 to 0.8407 in PRI, and from 0.1779 to 0.1390 in GCE, and from 11.29 to 10.18 in BDE.
The remainder of the paper is organized as follows. The related work is introduced in section II, the proposed framework is introduced in section III and the experimental results are presented in section IV.

II. RELATED WORK A. UNSUPERVISED SEGMENTATION
Many unsupervised segmentation methods have been proposed recently, such as mean-shift (MS), k-means [6], normalized cuts (NCuts) [5], Felzenzwalb and Huttenlocher's graph-based (FH) [3], SDTV [19], KM [20], UCM [21], CCP [22], MLSS [7] and SAS [8]. Mean-shift (MS) [6] builds a non-parametric probability distribution in a feature space and applies mean shift filtering in this domain to yield a convergence point for each pixel. Normalized cuts (NCuts) [5] focuses on minimizing the similarity between groups while maximizing the associations within groups. Other methods can be divided into two categories: region-based and contour-based methods. Region-based unsupervised segmentation methods focus on finding the similarity among neighbouring pixels and merge them using features including color, texture, contour or luminance. Superpixels are always taken as important cues for aiding segmentation and one of the typical works is MLSS [7], in which a multi-layer semi-supervised learning scheme is proposed to construct a dense affinity matrix over pixels and superpixels for spectral clustering. Another highlighted work is SAS [8], a novel segmentation framework based on bipartite graph partitioning to is designed to aggregate multi-layer superpixels. Contour-based methods focus on generating segmentation masks via contour cues. In [21], the image segmentation problem is constructed as a contour detection problem. A contour detection using multiscale local brightness, color, and texture is proposed firstly. Then an Ultrametric Contour Map (UCM) is constructed by generating a hierarchical region tree from contours. In CCP [22], a contour-guided color palette (CCP) is designed firstly. Then, it is further fine-tuned by post-processing techniques such as leakage avoidance, fake boundary removal, and small region mergence to generate robust segmentation masks.

B. DEEP LEARNING AND IMAGE SEGMENTATION
In recent years, we have witnessed great progress for deep convolutional neural networks [23], [24] based image segmentation and various methods have been proposed. Deep learning has been applied to solve the problems related to fully supervised semantic segmentation [13], [14], [16], [17], [25]- [27], interactive segmentation [28]- [30] and unsupervised segmentation [18], [31], [32]. Besides the fully supervised semantic segmentation, deep learning has proven to be effective in solving the interactive segmentation problem as well. For example, in [30], the user annotations are converted into interaction maps by measuring distances of each pixel to the annotated locations firstly. Then, the forward pass is performed in a convolutional neural network to generate an initial segmentation map. Moreover, a backpropagating refinement scheme (BRS) is designed to correct the mislabeled pixels. Neural network has been applied to solve the unsupervised segmentation problem as well. In W-Net [18], a segmentation framework based on autoencoder is proposed and a k-way pixel-wise prediction is generated by the autoencoder, then the whole network is optimized by a reconstruction loss and a normalized cut loss jointly. Furthermore, techniques such as condition random field smoothing and hierarchical segmentation are used to refine the segmentation results. However, these methods may suffer from several drawbacks in applications: (1) The segmentation framework is of high computation cost; (2) The training procedure for optimizing the network parameters is sophisticated; (3) Additional cues such as superpixels cannot be incorporated into the framework adaptively.
In order to design a more effective and practical neural network based unsupervised segmentation framework, a deep image clustering module is proposed in this paper. DIC utilizes a light-weight network architecture to reduce the computation cost. Furthermore, a superpixel guided iterative refinement loss is designed to optimize the DIC parameters in a simper way. Hence, the unsupervised segmentation network can be optimized easily in an overfitting manner.

III. THE PROPOSED METHOD
In this section, the framework for unsupervised image segmentation is proposed. Firstly, we introduce the deep image clustering model in subsection III-A, which consists of two modules: a subnetwork for feature extraction and a deep clustering subnetwork. Then, we present the superpixel guided iterative refinement loss in subsection III-B. Finally, the overfitting training protocol is described in subsection III-C to optimize the network parameters in an end-to-end way.

A. DEEP IMAGE CLUSTERING MODEL 1) NETWORK ARCHITECTURE FOR FEATURE TRANSFORMATION SUBNETWORK (FTS)
We use an autoencoder architecture and the skip connection for constructing the feature transformation subnetwork (FTS). The CNN for feature extraction is composed of a series of convolution layers interleaved with batch normalization (BN) and ReLU activations. The architecture of the feature transformation subnetwork is as shown in Figure. 2, FTS consists of six convolution blocks, one max-pooling operation, one deconvolution operation and a simple convolution operation. We use max-pooling, which downsamples the input by a factor of 2, after the 2nd convolution block to increase the receptive field. Then the 4th convolution block outputs are upsampled by deconvolution and concatenated with the 2nd convolution block outputs to pass onto the 5th convolution block. After the 6th convolution block and the simple convolution block, feature Y with dimension C is generated. We use 3 × 3 convolution filters with the number of output channels set to 64, 128 or 192 in each block, except the last CNN layer which outputs C channels. This CNN architecture is chosen for its simplicity and efficiency. Other network architectures are also conceivable for such algorithm. The resulting C dimensional features Y can be taken as coarse cluster associations. In order to aggregate the features more effectively, Y will be passed onto the following deep clustering module that iteratively updates the pixel-clusters associations and cluster centers for iterations.

2) TRAINABLE DEEP CLUSTERING SUBNETWORK (DCS)
Firstly the extracted feature Y is flattened to the dimension W is the width of image and C is the channel number or superpixel number (SPN). Then a neural network based clustering procedure is designed. The cluster centers are defined as the initializations for feature clustering. Assuming the cluster centers are defined as Then an iterative procedure is applied for adaptive feature clustering. In the t-th iteration of DCS, the associate rate H t with dimension N × M is defined to calculate the similarity between the spatial features and respective centers t−1 : where κ represents the general kernel function. We simply take the the exponential inner dot exp(a T b) as the kernel function. For implementing the neural network, the operation for updating associations in the t-th iteration is formulated as: With the updated association H t , then the cluster centers t can be estimated using the weighted summation of Y . Hence t k is defined as: Then t can be generated respectively. Based on the above formulations, the associations H t and cluster centers t are updated within iterations. Once the iteration procedure is implemented, the final associations H and cluster centers are used to construct the aggregated features ϒ. The aggregated features are formulated as: As ϒ is constructed from a compact basis set, it has the low-rank property compared with the input Y . It is obviously that the iterative updating procedure is free of parameters. The whole pileline of DCS is as illustrated in Figure. 3. Once the aggregated feature ϒ is obtained, the final cluster assignment n for each pixel is set by selecting the dimension that has the maximum value of ϒ n using the following rule: Then C can been interpreted as the cluster number in the following subsections.

B. SUPERPIXEL GUIDED ITERATIVE REFINEMENT LOSS
In the setting of unsupervised image segmentation, there does not exist reliable supervision information. The training protocol of DIC is formulated as a classification task in a self-training manner, in which the superpixel guided grouping cues are taken as the principal supervision information.
The intuition behind is that pixels in the same superpixels tend to be assigned with the same cluster number and the superpixels can also serve as the cues for refining the object boundaries. Hence a superpixel guided iterative refinement loss is defined for network training. Firstly, the constraints which favor the rule that cluster labels should be the same for those of neighboring pixels are enforced by superpixels. K superpixels {S k } K k=1 are extracted from the input image I , where {S k } denotes a set of indices of pixels that belong to k-th superpixels using the technique such as MCG [21], superpixels with higher quality can generate more reliable supervision. Hence all of the pixels in each superpixel are guided to have the same cluster label in the training procedure. In iteration t − 1, t−1 is refined by the superpixels so as to enforce the label consistency constraint. Assuming S p is the superpixel that contains pixel n, then a histogram is constructed by counting the superpixel assignment Finally, the refined superpixel is defined as the following rule: where { t−1 i |i ∈ S p } stands for the unique label set defined in Hist p .
Then the refined superpixel˜ t−1 is taken as the supervision for training the DIC network in iteration t − 1. Similarly, t will be generated and it can be refined subsequently as the supervision for the t-th iteration. Given the assignment t and the refined superpixel˜ t , the iterative refinement loss is defined to optimize the DIC network using cross entropy: The iterative clustering procedure is as illustrated in Figure. 4, we can see that the cluster assignment is updated and refined iteratively.

C. OVERFITTING TRAINING FOR UNSUPERVISED IMAGE SEGMENTATION
Encouraged by the works such as deep image prior [33] or deep decoder [34], an overfitting manner is used for training the DIC network. Different from many current practices using large scale datasets, the overfitting training protocol is simpler and the network parameters are not required to be stored. Given a target image as the input, the training procedure will be described as below: (1): In the forward process of the t-th iteration, Y t is generated by the feature transformation subnetwork with parameter t−1 ft . (2): The deep clustering subnetwork runs for iterations to aggregate the features adaptively. Hence the feature ϒ t is obtained and the superpixel t is generated. (3): Then the refined superpixels˜ t is generated. In the backward process,˜ t is taken as the supervision with the iterative refinement loss. The parameter t ft of the feature transformation subnetwork is updated according to the back propagation rule using the iterative refinement loss L t itef . The detailed training and inference procedures are introduced in Algorithm 1. Extract the feature Y t using the feature transformation subnetwork with initialization t−1 ft .

7:
Initialize the cluster center with 0 . 8: for q = 1 to do 9: Generate the associations H q based on Y t and q−1 according to Eq. (2). Generate ϒ t based on H and according to Eq. (4). 13: Generate cluster t according to Eq. (5). 14: Generate the refined cluster˜ t according to Eq. (7). 15: Take˜ t as the supervision and the network parameters t ft are optimized using the iterative refinement loss defined in Eq. (8). 16: end for 17: T is taken as the final cluster assignments for image I .

IV. EXPERIMENTAL RESULTS
In this section, we first introduce the experimental setup in subsection IV-A. Then, we compare the empirical results of our DIC with other state-of-the-arts in subsection IV-B. Moreover, analysis on the iteration number T and cluster numbers for overfitting training is presented in in subsection IV-C and IV-D. The role of the deep clustering subnetwork is evaluated in subsection IV-E and the computational cost is reported in in subsection IV-F. The analysis on the application for semantic segmentation is presented in subsection IV-G.

A. EXPERIMENTAL SETUP
The segmentation results on two Berkley Segmentation Databases (BSDS300 and BSDS500) [35] which consists of 300 and 500 natural images respectively, are reported. To quantitatively evaluate the segmentation results, five criteria are used: 1) Probabilistic Rand Index (PRI) [36]; 2) Variation of Information (VoI) [37]; 3) Global Consistency Error (GCE) [35]; 4) Boundary Displacement Error (BDE) [38]; and 5) Segmentation Covering (SC). The segmentation performance is better if PRI and SC is large and the other three are smaller compared to the ground truths [8]. In the implementation of the deep clustering module, is set as 3 according to the cross-validation experiments. In the training procedure, The training epoch T is set as T = 100, the learning rate is set as 5 × 10 −2 and the momentum is set as 0.9.

B. COMPARISON WITH STATE-OF-THE-ART METHODS
In order to evaluate the proposed method DIC comprehensively, we compare the average scores of the DIC's VOLUME 8, 2020 with sixteen benchmark algorithms, such as Ncut [5], Mean-shift [6], FH [3], JSEG [39], Multi-scale Ncut (MNcut) [40], NTP [41], SDTV [19], KM [20], gPb-owtucm [21], MLSS [7], SAS [8], W-Net [18] and CCP [22] on BSD300 firstly. Similar to the strategy proposed in MLSS [7] or SAS [8], the optimal Image scale (OIS) is selected for segmenting images in the Berkley Segmentation Database. OIS means that the cluster number is selected optimally for each image. Seven settings as SPN=20,M=16; SPN=50,M=32;SPN=90,M=32;SPN=120, M=32; SPN= 160, M=64;SPN=200, M=64;SPN=300, M=64 are used to generate segmentation results for each image and then the best segmentation masks are selected. The scores of different methods for comparison are collected from [8], [20], [22] and the segmentation results are reported in Table. 1, with the best results highlighted in bold for each criterion. We can see that DIC achieves the highest PRI 0.8407, the lowest BDE 10.18, the second-lowest VoI and the second-lowest GCE scores compared to other methods. In terms of VoI, gPb-owt-ucm outperforms DIC slightly. CCP achieves a GCE of 0.127 which is slightly better than 0.139 of DIC. Compared with neural network based W-Net [18], DIC achieves 0.03 point gain in terms of PRI even if the post-processing procedure is not applied. The performance improvement is primarily owed to the deep clustering module for feature aggregation and DIC's ability to incorporate additional cues such as superpixels. Moreover the segmentation results on BSDS500 are reported as well. DIC can obtain the best SC as 0.66 and PRI as 0.864 compared with methods (reported in Table.2) such as Ncut [5], Mean-shift [6], MLSS [7], gPb-owt-ucm [21],SF [42] and W-Net [18]. When compared with gPb-owt-ucm, DIC achieves higher SC and PRI because it can aggregate similar superpixels more effectively to generate fewer superpixels in a self-guided manner. However the value of VoI will increase due to a smaller amount of superpixels of DIC.
The visual comparison is illustrated in Figure. 5 as well in which the segmentation maps generated by SAS, MLSS and DIC are displayed. We can find that  DIC works better in merging similar pixels and separating diverse regions by learning from local image patterns adaptively.

C. EVALUATION OF ITERATION NUMBER T FOR OVERFITTING TRAINING
DIC is optimized using an overfitting manner on a single image's basis. There is a trade-off between the iteration epoch and the segmentation performance. If more iterations are used, the computation costs will increase, otherwise the training procedure may not converge. An experiment is conducted to evaluate the effect of different iteration numbers T with SPN=100 and M = 32 on the segmentation performance, and the evaluation results are displayed in Table. 3. As shown in Table. 3, PRI of 0.8038 can be achieved when T = 50. If T = 100, PRI can be boosted to 0.8108 and the segmentation performance gains can be witnessed for the measures VoI, GCE and BDE as well. When T is set as a value larger than 100, only marginal performance gain can be witnessed, but the computation cost will increase greatly. When T is small, similar superpixels have not been merged successfully, probably because the algorithm will fail to converge with small T . It's observed that the overfitting training procedure will converge in around 100 iterations. Larger T may increase the computation cost dramatically, but the segmentation performance has been improved marginally. Hence T = 100 is set as the default setting in the following experiments.
Moreover the training convergence analysis of the overfitting training procedure is presented as well. Firstly, five images are selected and the corresponding segmentation results are displayed in Figure. 6. Then the curves of losses defined in Eq.(8) for the overfitting training procedure within 150 iterations are displayed in Figure. 7. It can be witnessed in Figure. 7 that, the loss will decrease slowly when T > 50 and the overall losses will converge in around 100 iterations, which can proof the observation in Table. 3 that the segmentation performance will change slightly when T ≥ 100.

D. ROBUSTNESS TO DIFFERENT CLUSTER NUMBERS
In this subsection, the segmentation performance conditioned on different cluster number C is evaluated. As shown in Table. 4, the performance of SAS will drop sharply when the cluster number increases, while the DIC is less sensitive to the number of superpixels. We find that DIC can generate relatively satisfactory results by learning from local image patterns adaptively even if the default cluster number is large. On the contrary, the performance of SAS will drop sharply as the cluster number C increases due to the phenomenon that consistent regions may be divided into small regions when the cluster number if large. It indicates that the adaptive clustering mechanism of DIC tends to converge to optimal solutions by merging similar pixels into one cluster adaptively. The visual comparison is as shown in Figure. 8, we can see that the performance of DIC is more reliable when the   setting of superpixel numbers changes. Even if SPN is set to a large value such as 300, DIC can also assign the same cluster number to the pixels with similar color or texture cues by extracting more effective features and by adaptively features aggregating. By contrast, the segmentation errors may increase for SAS when the cluster number grows, which is fundamental because the representation ability of the features extracted by the traditional spectral analysis technique is limited and the k-means clustering is a lack of the mechanism for aggregating similar clusters accommodatively.

E. THE ROLE OF DEEP CLUSTERING SUBNETWORK
The deep clustering subnetwork is one of the major contributions in the proposed method and DCS works as a module for elaborately clustering. In this subsection, the role of DCS is evaluated. When DCS is not used, a softmax operation is applied for cluster number assignment as a replacement. According to the comparison and analysis in the above subsection, we set SPN=50 and M = 32 to evaluate the role of DCS. The quantitative evaluation is listed in Table. 5. When DCS is not used, the PRI will drop from 0.814 to 0.8064 and VoI will increase from 1.9084 to 1.9890, which indicates that DCS is more effective in aggregating pixels with similar features in the deep clustering process than the   simple softmax operation. The role DCS is also visually illustrated in Figure. 9. It's evident that the pixels with similar or diverse color or texture can be merged or separated by DCS with higher accuracy, while the simple softmax operation tends to generate over-merged segmentations (see the 1st, 3rd and 4th examples in Figure. 9) or always fails to merge pixels with similar features (see the 5th, 6th and 7th examples in Figure. 9).

F. ANALYSIS OF THE COMPUTATION COST
The computation cost of DIC is evaluated in this subsection. Two algorithms MLSS and SAS are selected for comparison and the results of time cost comparison are illustrated in Table. 6. For DIC, the iteration number of DCI is set as 3, the overfitting training epoch T is set as 100 and M is set according to the superpixel numbers described in Table. 4. MLSS and SAS run on intel i7-6700k CPU with 32G RAM using Matlab R2016a, and DIC runs on a GTX1080 GPU with 32G RAM using Tensorflow 1.5.0. It's obvious that DIC and SAS are faster than MLSS in general and we can find that DIC is comparable to or faster than SAS when the superpixel number is more significant than 20. When the SPN increases, the computation costs of MLSS and SAS will increase dramatically, while the computation cost of DIC changes slightly. For example, DIC takes around 21.5s for segmenting an image with size 321 × 481 on average when SPN=300, while SAS takes 153.8s and MLSS takes 275.5s. DIC is almost 13 times faster than MLSS. The results indicate that neither the segmentation performance nor the computation cost of DIC is sensitive to the parameters such as cluster number, while the costs of segmentation framework based on spectral analysis will increase greatly as the SPN raises. Moreover, the computation cost of DIC can be reduced easily by controlling the training epochs T for different applications according to the analysis in Table. 3. Even if DIC runs on a 1080 GPU while MLSS and SAS can only work on CPU, it is obvious that neural network based framework has higher potential for being optimized with the development of optimization algorithms and hardware, compared with the traditional spectral analysis based frameworks.

G. APPLICATION FOR SEMANTIC SEGMENTATION
In order to evaluate the quality of unsupervised clusters of DIC more comprehensively, the generated usupervised superpixels are utilized for semantic segmentation. The joint training and inference framework proposed in [17] which uses the superpixels to improve the segmentation per- formance is applied for semantic segmentation. Then the DIC superpixels are incorporated into the SCNN of [17]. The experiment is conducted on PASCAL VOC 2012 [46] which is a famous segmentation dataset containing a training set, a validation set and a test set, with 1464, 1449 and 1456 images each. Following common practice, we augment the dataset with the extra annotations provided by [47]. This gives us a total of 10,582 training images. The dataset provides annotations with 20 object categories and one background class. We do not provide evaluation scores such as PRI, GCE or SC on PASCAL VOC 2012 dataset because the ground truths of semantic boundaries are inaccurate due to the dilated boundary pixels. In the experiment, Deelabv2-VGG [44] is connected with DIC superpixels, SCCN and the pairwise network to construct a model Deeplabv2-VGG-DIC-SCCN. As shown in Table. 7, Deeplabv2-VGG-DIC-SCCN achieves a mean IoU of 75.3 1 on the test set which is higher than Deeplabv2-VGG-SCCN (74.9) and other VGG+CRF based models such as CRF-RNN (74.7), GMF (73.2) and DeconvNet+FCN+CRF (72.5). Compared with the framework built on SLIC superpixels, the mean IoU of Deeplabv2-VGG-DIC-SCCN is 0.4 points higher. This is primarily because the quality of DIC superpixels generated by deep image clustering is better and they can provide more reliable supervision in the training procedure. The reported results also indicate the advantages of the proposed DIC superpixels in boosting the segmentation performance. The visual illustrations of DIC results, semantic segmentation results of images in PASCAL VOC 2012 dataset are displayed in Figure. 10.

V. CONCLUSION
We have presented a novel framework for unsupervised image segmentation based on neural network. One of our major contributions is that a neural network based deep image clustering model is designed for dividing the image space into different clusters by updating cluster centers and cluster associations iteratively. Then another contribution can be summarized as incorporating the low-level superpixels into DIC by 1 http://host.robots.ox.ac.uk:8080/anonymous/YUT4XH.html designing an iterative refinement loss to optimize the DIC parameters in an end-to-end way, in which the superpixels can serve as the grouping cues for encoding complex image patterns. Moreover, DIC is utilized to generate superpixels by learning from local image patterns in an overfitting manner. Compared with other unsupervised segmentation methods, DIC is less affected by the segmentation parameter, such as cluster numbers and of lower computation cost. Extensive experimental results on Berkley Segmentation Databases and PASCAL VOC 2012 database have also revealed the superior performance of DIC in both of the quantitative and perceptual evaluation.