Image Clustering Based on Multi-Scale Deep Maximize Mutual Information and Self-Training Algorithm

Image clustering is a complex procedure that is significantly affected by the choice of the image representation. Generally speaking, image representations are generated by using handcraft features or trained neural networks. When dealing with high dimension data, these two representation methods cause two problems: i) the representation ability of the manually designed features is limited; ii) the non-representative and meaningless feature of a trained deep network may hurt the clustering performance. To overcome these problems, we propose a new clustering method which efficiently builds an image representation and precisely discovers the cluster assignments. Our main tools are an unsupervised representation learning method based on Deep Mutual Information Maximization (DMIM) system, and a clustering method based on self-training algorithm. Specifically speaking, to extract the informative representation of image data, we derive the maximum mutual information theory and propose a system to learn the maximum mutual information between the input images and the latent representations. To discover the clusters and assign each image a clustering label, a self-training mechanism is applied to cluster the learned representations. The superiority and validity of our algorithm are verified in a series of real-world image clustering experiments.


I. INTRODUCTION
Clustering, a vital research topic in the field of data science and unsupervised learning, which aim to classify elements into categories on the basis of their similarity [1]. The clustering problem has been extensively studied in the past decades. However, the performance of standard clustering algorithms is adversely affected when dealing with high-dimensional data [2]. Because image is a kind of high-dimensional data, image clustering is always a challenging task in computer vision and machine learning [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Noor Zaman .
Generally speaking, the traditional image clustering methods such as k-means++ [4], gaussian mixture model [5] and spectral clustering [6] group images on handcrafted features and treat feature extraction and clustering separately. Based on this insight, many attempts have dedicated to developing suitable clustering feature extracting techniques such as manually designed feature descriptors, including Bag of Feature (BOF) [7], Histogram of Oriented Gradient (HOG) [8], Principal Component Analysis (PCA) [9] and Scale-Invariant Feature Transform (SIFT) [10]. However, the representation ability of the manually designed feature methods is limited; the traditional clustering methods may be invalid due to the influence of some messy variable. They mostly suffer from appearance variations of scenes and objects when VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ dealing with image data. How to automatically learn an image representation to capture the image information is a crucial problem that needs to be solved in image clustering tasks. In recent years, some novel representation learning methods have emerged, such as manifold alignment [11], dictionary learning [12] and deep neural network [13]. Among them, deep neural networks have been successfully applied to various supervised learning tasks [14].The reasons for the success of deep neural networks is that it can automatically learn the essential representation of images by constructing networks with multiple hidden layers, and train the network with a large number of data [15]. Motivated by this, some studies are devoted to cluster images based on deep neural networks, that is deep clustering [16], [17].
Deep clustering learns deep representations that favor clustering tasks using neural networks. In most of previous deep clustering studies, they usually train a deep generation model e.g. Auto-Encoder (AE), Variational Auto-Encoder (VAE) and Generative Adversarial Network (GAN), to reduce the dimension of image data. Then, part of the trained generation model is fine-tuned by using the clustering algorithms, and the generation model provide deep feature to the clustering algorithms to discover clusters [18], [19]. This two-stage training scheme has been successfully applied to many clustering works [2], [20]- [22]. However, the clustering results may suffer from the invalid image representation. Generation models use reconstruction loss to learn image representation which is not an optimal choice for clustering tasks.
Most recently, some deep clustering methods try to combine image representation with clustering learning [23], [24]. Generally, they construct neural networks and use clustering algorithm as loss function to train network directly. From the perspective of representation learning, one-stage clustering scheme is more reasonable while it learns image representation and clustering information simultaneously. However, one stage clustering also involves a non-representation problem, that is, the clustering algorithm does not match the deep learning. This may lead to the learned image representation only focuses on clustering, and lacks the essential information of the image, and causes the phenomenon of degenerate solution.
To overcome the aforementioned problems, this paper focuses on the establishment of an image clustering method based on the Deep Mutual Information Maximization (DMIM) and a self-training algorithm. Specifically speaking, we first derive the Mutual Information Maximization (MIM) theory into deep neural network, and propose a Deep Mutual Information Maximization (DMIM) systeem to learn an informative representation of image data. To discover the clusters of the input images, we assign each image a clustering label, and adopt a self-training algorithm to fine tune the DMIM system to obtain a more friendly clustering representation. We conduct series of experiments to verify the effectiveness of our algorithm, and the performance of the proposed algorithm outperforms the newest opponent in a large margin.
The main contributions of this paper can be concluded in three aspects. Firstly, we propose a novel maximum mutual information system based on statistical learning theory, and use it to learn an informative image representation. Secondly, we incorporate the learned image representation to a self-train algorithm to realize image clustering. Thirdly, we conduct extensive experiments on four real-world datasets to verify the effectiveness of the proposed algorithm.
The rest of the paper is organized as follows. In Section II, we introduce the related work of our paper. Section III proposes the clustering algorithm as well as some details of the algorithm. Section IV provides a series of experiments to analyze the effect of parameters and verify the superiority of the proposed algorithm. We conclude this paper in the last section.

II. RELATED WORK A. DEEP CLUSTERING
Deep clustering refers to clustering with the related algorithm of deep learning, which has been widely concerned and studied in recent years. The existing deep clustering algorithms are mainly divided into two categories: (I) a two-stage work that applying clustering after a representation is learned. (II) a one-stage work that jointly optimization the representation learning and clustering.
Some two-stage methods usually train a generation model at the first stage. Then, the trained generation model acts as a feature extractor and uses clustering algorithm to obtain the clustering results. For instance, Guo et al. propose Convolutional Auto-Encoder (CAE), using k-means algorithm to cluster the auto-encoder's image representation [20]. Ghasedi Dizaji et al. propose Stacked Auto-Encoder (SAE) algorithm which first train an AE, and uses relative entropy as a loss function training encoder to obtain clustering results [2]. Xie et al. propose Deep Embedded Clustering (DEC) which starts with a pre-training phase using only the reconstruction loss and then improves the clustering ability of the representation by optimizing in a self-supervised manner [25]. Peng et al. propose a novel clustering method by minimizing the discrepancy between pairwise sample assignments for each data point [26]. Gaussian Mixture Variational Autoencoders (GMVAE) is a representative generation-based clustering algorithm that incorporates gaussian distribution to variational AutoEncoder [22]. Categorical Generative Adversarial Networks (CatGAN) is another clustering algorithm that based on generative models. It is based on an objective function that trades-off mutual information between observed examples and their predicted categorical class distribution, against robustness of the classifier to an adversarial generative model [19]. The disadvantage of two-stage methods is the unsuitable problem between image representation and clustering. Since the target of generation model is to make the generated image close to the input image in visualization, while clustering aims to reduce all possible variations into several templates [27], [28]. This difference makes generation model unsuitable to participate in deep clustering directly as the important discriminative information will be lost.
One-stage methods combine the image representation with the clustering learning process. Joint Unsupervised Learning of Deep Representations and Image Clustering (JULE), Structured AutoEncoders for Subspace Clustering (SASC) and Deep Adaptive Clustering (DAC) are three representative image clustering methods that simultaneously learn the image representation and the clustering results. JULE proposes a recurrent framework for joint unsupervised learning of deep representations and image clusters [23]. DAC defines an effective objective and proposes an adaptive mechanism to realize image clustering [24]. The defined objective function is used to update the parameters of a convolutional network by selecting highly confidence image pairs and the cluster assignment is integrated into classification labels. SASC proposes a clustering method based on the subspace clustering theory and a local preserving scheme. It improves the traditional subspace clustering methods by using an autoencoder to guarantee that the learned representations preserve the local and global subspace structure. [29]. The effectiveness of these learning schemes has been proved in theory and practical experiments. However, there are two crucial factors that affect the stability and effectiveness of these two algorithms. On one hand, the initialization of convolutional network is an important factor that affects the performance of DAC and JULE. On the other hand, with the training going on, the local structure preservation of representation cannot be guaranteed. The image representation in the distorted feature space may not be suitable for clustering.

B. MUTUAL INFORMATION MAXIMIZATION
Since the definition of mutual information is based on the information entropy, we first briefly introduce the corresponding concepts of information entropy. Information entropy is the average rate at which information is produced by a stochastic source of data [30]. The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value. Given X = {x 1 , x 2 , ..x n }, the information entropy is defined as, where p(x) is the probability density of X . If Z is the latent variable of X , mutual information (MI) is a measure of the reduction of uncertainty in X due to the knowledge of Z [31]. In information theory, given two random variables X and Z with the joint distribution p(x, z) and the marginal distribution p(x) and p(z), MI between X and Z can be calculated as follows, Therefore, MI can be defined as follows, where D KL (·) denote Kullback-Leibler divergence (KL divergence for short). Mutual Information Maximization (MIM) is based on the definition of IE and MI, and is a technique for maximizing the average mutual information between two variables [32], [33]. In this paper, we maximize the mutual information between image X ∈ R D and representation Z ∈ R d , and d. Thus, we realize the dimensionality reduction of X , and extract the informative representation of image data.

III. METHOD
In this section, we first derive the mutual information maximization to deep neural network and establish a deep mutual information maximization system. Then, a selftraining method is involved in the training of image representation to obtain a representation suitable for clustering. Finally, we provide the network architecture and the detailed training procedure of the proposed algorithm.

A. DEEP MUTUAL INFORMATION MAXIMIZATION
As mentioned above, the mutual information between inputs X and their representations Z is defined by the KL-divergence (3), which can be decomposed as follows, where p(x) denote the probability density of input data X , p(z) is the probability density of representation Z . We will often write densities like p(Z = z) as p(z) to save space. To learn the maximize mutual information between the input image and the image representation, we model p(z|x) as neural network, and assume that p(z) follows the standard normal distribution. This assumption is similar to what is done in VAE. Therefore, the objective function used to train the network p(z|x) can be defined as follows, In [33], Deep InfoMax optimizes (5) by using an adversarial scheme. However, we observe that this scheme often leads to an instability problem in the in the process of network training. Different from Deep InfoMax, we directly add a restriction on the distribution of representation Z , and the restriction is defined as follows, where q(z) is a prior distribution, which follows the standard normal distribution. Combining (5) with (6), the whole objective function is defined as follows, where γ is the balance coefficient.
To calculate the minimal optimization problem, we decompose (7) as follows, Therefore, (7) is equivalent to the following optimization problem, . (9) Because of KL-divergence is theoretically unbounded, we use Jensen-Shannon divergence (JS divergence for short) to instead of it. The objective function can be transformed as, . (10) In [34], f-GAN have proved that the minimization JSdivergence can be calculated as follows, where D(x, z) is a learnable parameter. Therefore, the objection function (10) can be rewritten as, To involve more comprehensive information of images, we combine the local and global mutual information loss, and define the following objective function, where D 1 (x, z g ) and D 2 (z l , z g ) are the learnable parameters. L g , L l and L p are the local mutual information loss function, global mutual information loss function and prior distribution of latent variable, respectively. The definitions of L g , L l and L p are as follows, , B. NETWORK ARCHITECTURE As a practical applications of the ideas described above, we will now develop a system for maximizing the mutual information between the input X and the latent variable Z to obtain an image representation. For this purpose, we model p(z|x), D 1 (x, z g ) and D 2 (z l , z g ) by three neural networks, that is, f θ (x), g φ 1 (x, z g ) and g φ 2 (z l , z g ), where θ, φ 1 and φ 2 denote the weights and biases parameters of the networks. The system we established is shown in Fig. 1. Specifically, f θ (x) maps the input x to the global latent variable z g , which can be implemented by several convolutional layers. For g φ 1 (x, z g ), we first encode x into a vector, and concatenate this vector with the global representation z g . Then, we feed the connection vectors to several fully connection layers to realize g φ 1 (x, z g ). The implementation of g φ 2 (z l , z g ) is similar to g φ 1 (x, z g ) expect the connections are the global representation z g and the local representation z l . The goal of the proposed system is to learn an encoder which maps input images to the informative representations. Next, we introduce the details of the complete loss function.

C. COMPLETE LOSS FUNCTION
All the aforementioned objects including global mutual information, local mutual information and prior matching loss function are jointed together. The complete loss function for learning image representation is defined as follows, and L g = E x∼p pos log g φ 1 (x, z g ) where p pos and p neg denote are the distributions of positive samples and negative samples. z l and z g are the local and global latent representations of positive samples.ẑ l andẑ g are the local and global latent representations of negative samples, respectively. The complete loss function consists of three components. The first and second terms in (15) denote the local and global mutual information loss function, which measure the information relevance between input image and representation. The third item in (15) denotes the prior loss which measure the errors between model prediction and the target variable.
Next, we introduce the implementation details of the proposed deep mutual information network. For the implementations of L g and L l , the two expectations can be approximated by Monte Carlo sampling. Since the expectations of p(z|x)p(x) and p(z)p(x) can be realized by sampling on positive and negative samples, respectively [35]. To achieve this, we borrow the well known word2vec algorithm and adopt negative sampling trick to obtain positive and negative samples [36]. Specifically, we first randomly select a batch of samples from the dataset as positive samples. The positive samples follow the distribution p pos . Then, we shuffle the samples and take the disordered samples as negative samples. The negative samples follow the distribution p neg .
Similar to VAE, p(z|x) follows the normal distribution of mean u k (x) and variance σ 2 Since q(z) follows the standard normal distribution q(z) ∼ N (0, I ). Thus, D KL p(z|x) q(z) can be calculated as follows, For implementation of L p , µ k (x) and σ k (x) are the outputs of network f θ (x). Note that the expectations of p pos denotes that the prior loss only calculate in positive samples.

D. CLUSTERING LOSS FUNCTION
By maximizing the mutual information between the input image and the representation, the image representation with the most image information can be obtained. In this section, to mine the clustering characteristics of image representation, a self-learning module is proposed and integrated into the training of deep mutual information network.
The concept of self-training is derived from semi supervised learning. It first trains a classifier by using the known labeled samples, and then uses the trained classifier to evaluate the label of unlabeled samples [37]. For the unsupervised problem, the self-training can be transformed into two steps: calculating the initial label of samples and training the network with high confidence to correct the low confidence label. Therefore, it is very important to generate the initial labels as close to the real label distribution as possible. Inspired by the DEC algorithm, in this paper, we use selftraining module as a tool to discover the clusters of the learned image representation. Specifically, we first use the well known student's t-distribution and the learned maximization mutual information representation to calculate the pseudo label of each image, and get the pseudo label distribution of the image. Then, we establish a target distribution and use it to VOLUME 8, 2020 fine-tune the learned maximization mutual information network. The reasons why we incorporate the maximum mutual information representation to self-training algorithm contain two points: i) the initial pseudo labels of the self training algorithm should be as close as possible to the semantic labels of images, which means that the learned image representation should contain as much image information as possible; ii) the target distribution is established by enhancing the high confidence pseudo labels and weakening the low confidence pseudo labels.
In [25], DEC proposes a student's t-distribution as a kernel to measure the similarity between the image representation z i and the cluster centroid u j as follows, where z i denotes the i-the image representation, u j is initialized by K-means on representations learned by pre-train autoencoder, and v denote the defrees of freedom of the Student's t-distribution. In particular, l ij can be regarded as a pseudo label of each input image, and denote the probability of assigning sample i to cluster j.
When we obtain the clustering assignment distribution Q, Akin [25], we also propose a target distribution to optimize the image representation by learning from the high confidence pseudo labels. Specifically, the target distribution P is defined as follows, where f j = i l ij denote the pseudo label frequencies. The goal of image clustering is to align Student's t-distribution with the target distribution (20). Therefore, we define the following clustering loss function, By minimizing the distance between T and L distributions, the target distribution can guide the DMIM system to learn an image representation that is more suitable for image clustering.

E. TRAINING AND CLUSTERING
Given a dataset X = [x 1 , x 2 , . . . , x n ] with n samples that need to be clustered. The number of clustering K is a priori knowledge. Let the value of Z = [z 1 , z 2 , . . . , z n ] is the image representations. We first sample a batch of images from the dataset X to construct positive sample set, and shuffle the image to construct negative sample set. Then, we train the proposed DMIM system by optimizing the object proposed in (15) to obtain the image representation Z . Finally, we calculate the clustering pseudo labels by using (19), and update the DMIM system by using the self-training objective function (21). The entire training procedure of the proposed algorithm is presented in Algorithm 1. Compute the image representation Z . 6: Update θ, φ 1 and φ 2 by minimizing (15) with the learning rate ρ 1 . 7: end for 8: Self-training for the optimization of the network: 9: Cluster the image representations by using k-means++ and initialize clustering centroids u j , j = 1, 2, . . . , K . 10: Calculate T and L by using (19) and (20). 11: for t 2 in epochs 2 do 12: Update θ by minimizing (21) with learning rate ρ 2 . 13: end for 14: Calculate clustering labels: 15: for x i in X do 16: Calculate the clustering label based on f θ (x i ) and (19). 17: end for

IV. EXPERIMENTS
In this section, we conduct a series of experiments to verify the effectiveness of our clustering algorithm. All the experiments are performed on a desktop workstation with Inter(R) Core i7-4790 3.6GHz CPU, 32G RAM, Ubuntu 14.04 operating system and Keras environment.

A. DATASETS
We select four representative image datasets including MNIST, Fashion-MNIST, Cifar-10 and STL-10 datasets, to verify the effectiveness of our algorithm. Next, we briefly introduce these datasets.

1) MNIST AND FASHION-MNIST
MNIST is a handwritten digits database which includes a training set of 60,000 examples and a test set of 10,000 examples. Fashion-MNIST is a dataset of Zalando's article images which is an update version of MNIST [38]. For these two dataset, each example is a 28×28 gray scale image, associated with a label from 10 classes.
The STL-10 dataset is an image dataset used to develop unsupervised feature learning, deep learning and self supervised learning algorithms [40]. It is inspired by the CIFAR-10 dataset but with some modifications. The high-resolution dataset (96 × 96) will make it a challenging benchmark to develop more scalable unsupervised learning methods. The detailed statistics of these four datasets are shown in Table 1.

B. EVALUATION METRICS
To evaluate the performance of the clustering algorithms, we adopt three commonly metrics including clustering accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). These three metrics reflect the cluster performance from different perspectives. ACC measures the best matching between unsupervised clustering results and ground truth. NMI measures the similarity between pairs of clusters [41], [42]. Adjusted Rand Index (ARI) establishes a baseline by using the expected similarity of all pair-wise comparisons between clusterings specified by a random model [43]. All the metrics are the higher the score, the better the clustering performance.

C. COMPETITORS
We compare the performance to traditional clustering methods, two-stage clustering methods and one-stage clustering methods. Specifically, traditional clustering methods include K-means++ [4], Self-tuning Spectral Clustering (SSC) [6] and Density Based Spatial Clustering of Applications with Noise (DBSCAN) [44]. These methods first extract BOW feature, and then cluster the image feature. Two-stage clustering methods include Greedy Layer-Wise Training of Deep Networks (GLWTDN) [45], Deep Embedding Clustering (DEC) [25] and Improved Deep Embedding Clustering (IDEC) [46]. These methods first train an Autoencoder, and then use clustering methods to calculate clustering assignments. One-stage clustering methods include Gaussian Mixture Variational AutoEncoders (GMVAE) [22], Categorical Generative Adversarial Networks (CatGAN) [19] and Deep Adaptive Clustering (DAC) algorithm [24]. These three methods also belong to clustering algorithm based on generation model.

1) TRADITIONAL IMAGE CLUSTERING METHODS
K-means++, SSC and DBSCAN: These three comparisons first use BOF algorithm to encode the images. Then, the image features are clustered to achieve image clustering.

2) TWO-STAGE DEEP CLUSTERING METHODS
GLWTDN: It first trains an AE to extract image features, and then uses k-means algorithm to cluster the image features to realize image cluster [45].
DEC: It first learns image representations from an AE. Then, cluster are obtained by utilizing a self-training mechanism [25].
GMVAE: GMVAE uses gaussian mixture model as a prior distribution to improve the traditional variational autoencoder. It uses the improved latent vector as image representation, and then clusters representation to realize image clustering [22].

3) ONE-STAGE DEEP CLUSTERING METHODS
IDEC: IDEC is an improved version of DEC. It trains AE's reconstruction loss function and self-training loss function simultaneously to guarantee local structure preservation [46].
CatGAN: It uses general Generative Network Adversarial (GAN) and entropy as loss function to realize image clustering [19].
DAC: It formulates image clustering as a binary pairwise classification problem, and identifies this pairs of images which should belong to the same cluster [24].

D. EXPERIMENT SETUP
For the traditional clustering algorithms, i.e., K-means, SSC and DBSCAN, we first extract the BOF features of the images. The selected image feature extraction method is Scale-Invariant Feature Transform (SIFT) [47], and the number of bins in BOF algorithm is set to 20. The parameters of other comparison methods are mostly set according to the original literature. For our algorithm, we set the parameters α = 0.01, β = 0.5, γ = 0.5, respectively. According to most of methods based on self-training scheme, the parameter of the student's t-distribution is set to v = 1. We set the learning rates as η = 0.005 and ρ = 0.0001, which are set empirically. VOLUME 8, 2020   Clustering performance and comparison, ACC (%) and NMI (%) and ARI (%), on all datasets. The results marked † are excerpted from [24], [25] and [46]. The best and second best results are highlighted in bold and underlined, respectively. The detailed network architectures are shown in Table 2. The weights of convolutional and fully connected layers are all initialized by Xavier approach [48].

1) CLUSTERING PERFORMANCE COMPARISON
In this part, we compare our method with many state-of-the art methods including K-means ++ [4], SSC [6], DBSCAN [44], GLWTDN [45], DEC [25], IDEC [46], GMVAE [22], CatGAN [19] and DAC [24]. For our method, we followed the implementation details and report the average results from 5 trails. For the rest, we present the best reported results either from experiment on the original codes of their papers, or from [24], [25] and [46]. We report the detailed clustering results of these methods on all the datastes in Table 3.   As shown in Table 3, for each dataset, the performances of deep clustering algorithms are better than that of traditional clustering algorithms. Our clustering method outperforms traditional algorithm with a large margin, which shows the fascinating potential of the proposed method in clustering tasks. Furthermore, note that the proposed method outperforms the deep clustering methods on all the three evaluation metrics expect on MNIST dataset. Our algorithm outperforms all competitive baselines, with significant margins of 7.52%, 3.49% and 3.17% in the case of Fashion-MNIST, Cifar10 and STL10 respectively. These results show the effectiveness of our method in image clustering tasks. Fig. 2 shows the confusion matrixes of the clustering results for all the datasets. The values along the diagonal represent the percentage of samples correctly classified into the corresponding categories. We can find that all the clustering accuracies are average and stable for all the datasets. This proves that our method does not aggregate samples into a few categories or assign a cluster to outlier samples, and can effectively avoid degenerate solutions problem.

2) VISUALIZATION
In this part, we use two methods to visualize the clustering results of our algorithm. In the first visualization experiment, we map the image representation Z to a 2-dimension vector by using t-SNE algorithm [49]. We report the t-SNE results of Fashion-MNIST dataset and STL-10 dataset with different clustering accuracy in Fig. 3 and Fig. 4. Different colors indicate different clusters and the corresponding clustering accuracies are reported below. The visualization results show that the proposed algorithm can effectively improve the sep-arability of data, which is helpful to improve the clustering accuracy.
In the second visualization experiment, we qualitatively analysis the cluster results by the proposed method on Fashion-MNIST dataset and CIFAR-10 dataset. For each category, we randomly select one image as the original image at the first stage. Then, we pick up 5 samples which are the smallest Euclidean distance between original image from the same cluster. Finally, we pick up 5 samples which are most closest to the original image in the incorrect clustering images. All the picked images are shown in Fig. 5, and we mark the correct samples and the incorrect samples with green labels and red labels, respectively. Form the visualization results we can find that the successful cases not only depend on appearance textures, but also contain some semantic information of categories. The failure cases also contain a lot of texture contents similar to the source images. The visualization results imply that our method not only captures image appearance information, but also captures some abstract image information for image clustering. This is the reason why the proposed method can precisely discover the categories of the input images.

3) ON EFFECT OF THE NUMBER OF CLUSTERS
In this experiment, we mainly study the effect of the number of clusters on our algorithm. For each dataset, we conduct 6 experiments on different training sets. The number of training sets varies in the range of [5,10] at equal intervals. We report the variation curves of clustering accuracy with the number of clusters in Fig. 6.
As shown in Fig. 6, with the increase of the number of clusters, the accuracies of all the clustering methods decrease gradually. For the Fashion-MNIST, CIFAR-10 and STL-10 datasets, the clustering accuracy of our method is always higher than the other algorithms in different number of clusters. In addition, other two metrics results also show the superiority of the proposed algorithm. This is because our algorithm can exploit the essential information of images. In addition, the experimental results also show the stability of the our algorithm.

V. CONCLUSION
This paper proposes a new image clustering method based on Deep Mutual Information Maximization (DMIM) system and self-training algorithm. To make the learned image representation contains more image information, we first derive a deep mutual information maximization system, and use it to learn an unsupervised image representation. To discover the image clusters assignments, we borrow a self-training mechanism and incorporate to the learning of image representation. We evaluate our method on unsupervised clustering tasks using popular datasets, achieving competitive results compared to the current state of the art methods.
Form the view of learning scheme, this paper regards an unsupervised learning problem as a semi-supervised leaning problem with enhance the high confidence pseudo labels. Future work may include exploring more self-training methods to assist encoders in obtaining image representations. Specifically, we first train a classification DMIM model and select high confidence pseudo labels as initial labels. Then, we may use some semi-supervised leaning schemes to train the model. Nevertheless, how to determine the initial label is an open problem. A optional way is to automatically select the initial labels based on more prior information. Beside, Graph Convolutional Network (GCN) has been proved to be effective in semi-supervised classification tasks [50]- [52]. A possible direction is adding some GCN layers to the proposed model to improve the clustering performance.
PEIYAO WANG received the B.Sc. and M.Sc. degrees in pattern recognition and intelligent system from Liaoning Shihua University, China, in 2014 and 2017, respectively. She is currently a Teaching Section Chief with the Shenyang Institute of Technology. Her research interests include image/video representation and deep learning.
YUNING WANG received the B.Sc. degree in ammunition engineering and explosive technology from the College of Equipment Engineering, Shenyang Ligong University, Shenyang, Liaoning, China. He is currently working as an Assistant Engineer with PLA 32681. His research interests include the theories and algorithms of object detection, machine learning, and intelligent vision systems.
CHENGDONG WU (Member, IEEE) is currently the Vice President of the Faculty of Robot Science and Engineering, Northeastern University, and the Director of the Institute of Artificial Intelligence, a Professor, and a Doctoral Tutor with Northeastern University, Shenyang, China. He has long been involved in automation engineering, artificial intelligence, and teaching and researching in robot navigation. He is also an Expert in Chinese modern artificial intelligence and robot navigation. He is also a Special Allowance of the State Council.