A Multimodal Clustering Framework With Cross Reconstruction Autoencoders

Multimodal clustering algorithms partitions a multimodal dataset into disjoint clusters. Common feature extraction is a key part in multimodal clustering algorithms. Recently, deep neural networks shows high performance on latent feature extraction. However, existing works did not completely explore the cross-model distribution similarity utilizing deep neural networks. We present a deep multimodal clustering framework with cross reconstruction. Feature extraction apply global cross reconstruction and local cross reconstruction respectively to enforce early fusion among different modalities. Analysis shows that the both cross reconstruction networks reduces the Wasserstein distance of latent feature distributions, which indicates that the proposed framework ensures the distribution similarity of common latent features. Experimental results on benchmark datasets demonstrate superiority beyond existing works.


I. INTRODUCTION
Clustering is a vital research area in unsupervised machine learning. Multimodal clustering, an important field in clustering, partitions the multimodal data into disjoint groups in an unsupervised manner. Multimodal dataset describe objects in multi aspects which are called modals. In each modal, an object has a group of properties which represent the object. Deep neural networks(DNN) [2] breed effective unsupervised tools such as autoencoders [3] and generative adversarial networks(GAN) [4]. Deep clustering methods [5]- [8] burst following the development of the deep unsupervised networks on account of the amazing performance for feature extraction [9] and dimensionality reduction. Deep multimodal clustering [10] utilizes multiple deep neural networks to extract common latent features. Then the final cluster assignment is retrieved by running a cluster method on fused latent features.
Common latent feature extraction is a key concept in multimodal clustering algorithms. Traditional works are developed using spectral clustering [11], subspace clustering [12], etc. Most of these studies are restricted by feature extraction methods when handling large scale data. Recent articles recognize the critical role of deep neural networks in extracting latent features. So far, there has several works on deep The associate editor coordinating the review of this manuscript and approving it for publication was Wai-keung Fung . multimodal clustering, but it has not been comprehensively examined. Nitish et al. [10] propose an algorithm which learns a joint representation of different modalities by DBM. But due to the high computational costs in high-dimensional data space [13], the DBM based methods have not been widely studied now. Jiquan et al. [14] use an autoencoder to extract one common feature between two modals. However, autoencoders do not really discover the similarity of the common feature distributions. Weiran et al [15] learn most correlated features in different modalities by deep canonical correlation analysis (DCCA) which merged a deep autoencoder with canonically correlation analysis. Though it maxes the correlation between each pair modals' latent features, different numerical characteristics of different modalities can end up with indistinct correlation. Moreover, the above methods failed to separate distinctive latent features from common latent features.
In this paper, We propose the cross reconstruction based on the following fact. Each modal has distinctive features that reflects the modal's characteristics. Meanwhile, all modals share common features in order that they have similar cluster assignment. We highlight the importance of distinctive latent features and common latent features when reconstructing the data object by autoencoders in multimodal datasets. Furthermore, we build a two-step deep multimodal clustering framework with cross reconstruction loss. Specifically, we develop VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a global cross reconstruction(GCR) autoencoder and a local cross reconstruction(LCR) autoencoder as two typical neural networks. The cross reconstruction networks early fuse multimodal features to accurately retrieve latent features.
In the GCR autoencoder, we extract latent features from different modalities with Variational Information Bottleneck (VIB) [16]. VIB controls the amount of information through the autoencoder during the feature extraction by minimizing mutual information between raw data and the extracted features. Meanwhile, we integrate cross reconstruction into the features extraction process, which promotes the distribution similarity of different modalities in latent feature space. We also provide theoretical analysis to prove the similarity promotion of multimodal distributions. Finally, we fuse the extracted features to common features using fusion layers and run a clustering algorithm on the common features. The above process is deep multimodal clustering with global cross reconstruction (DMGCR).
In the LCR autoencoder, we explicitly extract distinctive latent feature and common latent feature for each modal. The reconstruction of a data object in current modal uses the mixture of the distinctive latent feature and the common latent feature. The cross reconstruction utilizes the common latent features of other modals and the distinctive latent feature of current modal. After the common latent features is trained, we perform clustering algorithm on the fused common latent feature by the same fusion layers as used in DMGCR. Finally, a clustering algorithms is performed on the fused latent features. We call the above LCR based process deep multimodal clustering with local cross reconstruction(DMLCR).
We summarize the contribution of this article as follows: (1) We propose two novel cross reconstruction networks for multimodal data feature extraction, which implement early fusion on all modalities (2) We provide a theoretical analysis to prove that the proposed cross reconstruction method effectively reduces the distribution difference of different modalities in feature space.
(3) We develop a deep clustering framework and two deep multimodal clustering algorithms, DMGCR and DMLCR, which are capable of performing multimodal clustering on comprehensive datasets.
(4) Experiments show improvement over state-of-the-art methods on six benchmark datasets. We also explore the parameter selection by grid searching and provide the empirical values suggestion.

II. RELATED WORK A. DEEP CLUSTERING
Deep clustering receive more attention in the past few years owing to the good performance of deep neural networks on unsupervised feature extraction. Existing methods share two common concepts, feature extraction and clustering assignment. The existing deep clustering methods are roughly divided into two categories: two-stage methods and end-to-end methods [15].
The two-stage methods first extract latent features from the original feature space, then conduct clustering on latent features. Tian et al. [8] uses an autoencoder to extract the features of graph and finally uses k-means to cluster. Chen [17] applies Deep Belief Network (DBN) to extract features and finally uses non-parametric maximum-margin clustering to cluster the features.
The end-to-end methods realize the feature extraction and clustering with unified deep neural networks. The joint unsupervised learning (JULE) algorithm [18] uses a recurrent framework for joint unsupervised learning of deep representations and image clustering, which are optimized jointly in training process. The deep embedding clustering (DEC) algorithm [5] clusters a set of data points in a jointly optimized feature space. Based on DEC, the improved deep embedding clustering (IDEC) algorithm [6] jointly optimizes and preserves local structure of data distribution. These algorithms are limited within single modal clustering tasks. DEPICT [19] develops a convolutional autoencoder for extracting latent features. It uses KL divergence as objective function to guide the training process. VaDE [20] proposes a generative deep clustering algorithm based on the variational autoencoder. It extracts the latent Gaussian distribution parameters instead of latent features. ADEC [21] utilizes adversarial reconstruction to retrieve better latent features for clustering.
Classical multimodal clustering methods either learn a consensus matrix or minimize the divergence of multiple modalities simultaneously. For example, the multi-view spectral clustering (MMSC) algorithm [22] learns a commonly shared graph Laplacian matrix by unifying different modalities. Gao et al. [23] proposes a novel NMF-based multiview clustering algorithm by searching for a factorization that gives compatible clustering solutions across multimodal. The diversity-induced multi-view subspace clustering (DIMSC) algorithm [24] extends the existing subspace clustering into the multimodal domain. The low-rank tensor constrained multi-view subspace clustering (LT-MSC) algorithm [25] introduces a low-rank tensor constraint to explore the complementary information from multimodal data. The exclusivity-consistency regularized multi-view subspace clustering (ECMSC) algorithm [26] attempts to harness the complementary information between different representations by introducing a novel position-aware exclusivity term.
Deep multimodal clustering methods firstly learn common low-dimensional features from multimodal data, then cluster the features into groups. Ngiam et al. [14] proposes a series of frameworks for deep multimodal learning based on autoencoders. Srivastava and Salakhutdinov [10] proposes a deep multimodal representation learning framework, which learns a joint representation of different modalities by DBM. The DCCA [27] learns complex nonlinear transformations of two modalities of data such that the resulting representations are highly linearly correlated. The DCCAE [15] add an autoencoder regularization term to DCCA.

III. CROSS RECONSTRUCTION
Multimodal learning tasks mine the consensus information among different modalities. Traditional arts firstly extract latent features within single modality, then fuse the features of each modal into a shared latent feature. We call it Late Fusion. We now introduce the Early Fusion by fusing information among modalities with cross reconstruction when extracting latent features within single modality. Cross reconstruction is biologically inspired by cross-modal systems [28] within the visual and auditory systems of the human brain [29].

A. MULTIMODAL FEATURE EXTRACTION
Consider the problem of clustering a dataset of n points X = Data in the same modality have the same dimensions. Multimodal feature extraction tends to extract latent feature Z from the multimodal data X . One of the most well-known tools for deep learning is the autoencoder. Apart from normal autoencoders, variational autoencoders (VAE) are capable of capturing the distribution feature of inputs.
One VAE contains two basic components, an encoder E and a decoder D. Both the encoder and the decoder are composed of a series of nonlinear transforms. The encoder E extracts the approximate latent feature distribution q(z i |x i ) ∼ N (µ i , σ i ) of the input p(x i ) where p(·) means ground truth distribution and q(·) indicates approximate distribution. Next, a z i was sampled by z i = µ i + ε · σ i using the reparameterization skill, where ε ∼ N (0, 1). The decoder D restores the reconstructionx i of input x i according to the latent feature z i . The optimization target of VAE is to minimize the difference between the ground truth distribution p(x, z) and the approximate distribution q(x, z), That is The Eq.(1) can be deduced to the following form The first part in Eq.(2) represents the reconstruction loss which is implemented by the reparameterization trick. The second part can be treated as the Variational Information Bottleneck (VIB) [16] which is proved to be contributive to the feature extraction. Without the VIB form, the VAE will degrade to a normal autoencoder. It is a trivial solution to extend the single modal VAE to the multimodal scenario, because a multimodal dataset contains distinctive latent features and common latent features.
It is impracticable to extract high quality common latent features directly from all modalities with traditional encoders. We use the Variational Information Bottleneck (VIB) [16] to extract the common latent features among different modalities. The reconstruction loss of multimodal common latent feature is It is widely accepted that common latent features of different modalities share similar distributions. Based on the above fact, we propose two cross reconstruction, global cross reconstruction and local cross reconstruction, to promote the feature extraction process. The cross reconstruction methods guide the common latent features of each modality shares similar distribution.

B. GLOBAL CROSS RECONSTRUCTION
The global cross reconstruction network assign each modal with a VAE which is described in the last section to extract latent features. The encoders in the network accept input where θ i and ϕ i are parameters of the encoder and the decoder of the i-th modality. Therefore, we introduce the cross reconstruction loss besides the normal reconstruction loss described as Eq.(3). The global cross reconstruction loss of reconstructing x j with where q(x j |z i ; ϕ j ) denotes the decoder used for reconstructing x j with z i . Note that, it is also the same decoder for reconstructing x j with z j .
The reconstruction loss of the i-th modal is The VIB loss of the i-th modal is where the µ i (x) and σ i (x) are the output of the encoder of the i-th modal. The cross reconstruction loss of the i-th modal is Therefore, combining Eq. (5), Eq. (6) and Eq. (7), we get the final loss function of a GCR network where β and γ are super parameters to control the weight of VIB and the weight of cross construction respectively. Figure 1(a) illustrates a two-modal global cross reconstruction network. The two encoders{E 1 , E 2 } accept data {x 1 , x 2 } from two modals simultaneously. In the first modal, the encoder E 1 encodes the x 1 into common latent feature z 1 . The decoder generates a reconstructionx 1 for the input x 1 . Besides, the latent feature of the second modal z 2 is conveyed to the decoder of the first modal D 1 to produce the cross reconstructionx 21 . Consequently, the two losses, reconstruction loss of the first modal and cross reconstruction loss between the first modal and the second modal, are used to guide the network training. The same process is done for the second modal.
Global cross reconstruction builds implicit connection among different modalities by exploring the probability theory of connection among different modalities built by cross reconstruction. Here we assume that q(x i |z i ; ϕ i ) is a Gaussian distribution in the i-th modality with meanμ i (z i ) and variancê σ 2 i (z i ), and q(x ji |z j ; ϕ i ) is also a Gaussian distribution in the i-th modality with meanμ i (z j ) and varianceσ 2 i (z j ).μ i (·) andσ 2 i (·) are the distribution parameters of reconstruction distribution of the i-th modal.
Then we further derive the loss function of cross reconstruction: where L i represents the loss of reconstructing x i with z i ; L ji represents the loss of reconstructing x i with z j .
Combining L i and L ji , bothμ i (z i ) andμ i (z j ) reduce the difference from x i , which means that the difference between µ i (z i ) andμ i (z j ) also decreases. Note that bothμ i (z i ) and µ i (z j ) are outputs of the same decoder D i , so the input of the decoder D i , i.e. z i and z j are similar. z i and z j are sampled from respectively, and are generated with reparameterization trick: where µ(x i ) and σ (x i ) are the latent distribution parameters which are outputs of the encoder E i , µ(x j ) and σ (x j ) are the outputs of the encoder E j ; z i and z j are randomly sampled from Gaussian distributions N i (µ(x i ), σ (x i )) and N j (µ(x j ), σ (x j )) respectively. Afterward, we calculate the Wasserstein distance [30] between N i and N j .
where (N i , N j ) denotes the set of all joint distributions where the marginals of (z i , z j ) are N i and N j respectively. Since the z i and z j are close, we conclude that the two distribution of latent feature are also similar base on Eq. 9. Therefore, under the constraints of cross reconstruction, these encoders reduce the distribution differences of multimodal features, which shows that the cross reconstruction constrains the latent features to share similar distributions in feature spaces of different modalities.

C. LOCAL CROSS RECONSTRUCTION
Furthermore, we consider a more effective latent common feature extraction method and name it Local Cross Reconstruction(LCR). We build the LCR based on the following fact. Each modal has distinctive latent features that reflects the modal's characteristics. Meanwhile, all modals share common latent features in order that they have similar cluster assignment. In the GCR network, we only extract common latent features from each modality using VIB. The GCR mixes the distinctive latent features with common latent features in one z i , which leads to indistinct clusters. Therefore, apart from the GCR, we propose the Local Cross Reconstruction(LCR) network.
An LCR network composes of M deep variational autoencoders. There are two variational encoders E i = (Ě i ,Ẽ i ) for each modality which extract a pair of latent features (ž i ,z i ) for each modal.ž i means the distinctive latent features of modal i, andz i represents the common latent features of modal i. Specifically, the two encoders share the same parameters in each modal except the last layer and the reparameterization layer. At the last layer, the encoders generate distribution parameters (µˇz i , σˇz i ) and (µz i , σz i ) respectively. At the reparameterization layer, the two latent features,ž i andz i , are built using the reparameterization skill.
The decoder D i accepts multiple inputs. First, it takes a pair of latent features (ž i ,z i ) composed of the distinctive latent feature and the common latent feature as input, and reconstructs input of current modal. Second, it tries rebuilding the input of current modal using a pair of latent features (ž i ,z j ) composed of the distinctive latent feature of current modal and common latent features from other modal.
Providing data x i in modal i and x j in modal j, the local cross reconstruction process of the i-th modal are described as follows.μ Natrually, the local cross reconstruction network loss also contains two parts, the reconstruction loss of current modal and the cross reconstruction losses between current modal and every other modal. The reconstruction loss of modal i is The cross reconstruction loss of modal i is 11) Therefore, Combining Eq.(10) and Eq.(11), We get the final loss function of a LCR network where γ is super parameter to control the weight of cross construction. For instance, Fig. 1(b) shows a two-modal local cross reconstruction network. The two encoders{E 1 , E 2 } accept data {x 1 , x 2 } from two modals simultaneously. The encoder in the first modal E 1 encodes the x 1 into distinctive latent featurě z 1 and common latent featurez 1 . After that, we retrieve a pair of latent features (ž 1 ,z 1 ) as the input of the decoder of the current modal. The decoder generates a reconstruction x 1 for the input x 1 . Besides, the common latent featurez 1 is conveyed to the decoder of the second modal to produce the cross reconstructionx 12 . For the second modal, the same process is executed.
We now explain the LCR theoretically in detail. Givenμ i andσ i are the distribution parameters of the common latent featurez i . E i and D i are the encoder and the decoder of modal i with parameters θ i and ϕ i respectively.x ij means the reconstruction of the input x j using the distinctive latent featurež j and the common latent featurez i .
In this scenario, the reconstruction loss and cross reconstruction loss of modal i are where L i demonstrates the reconstruction loss using the output of the decoderx i = D i (ž i ,z i ; ϕ i ), L ji indicates the cross reconstruction loss using the output of the decoder x ji = D i (ž i ,z j ; ϕ i ). The optimization of Eq.(13) and Eq.(14) updates the parameters θ i and φ i of the encoder E i and decoder D i , which means the distinctive latent feature of the j-th modalž j keeps invariant when the i-th modal is being trained.
The two loss functions guide the training process to reduce the mean square error between the two reconstructions and the original data. Therefore, for the i-th modal, the two reconstructions approaches nearly equality with the input x i after the network is trained. Consequently, the distance between the cross reconstructionx ji and the reconstructionx i are minimized. As they are the outputs of the same decoder D i , the inputs of the decoder is also close. The both inputs contains two parts, in which the distinctive latent featurež i is not changed. Therefore, the common latent features from the ith modalz i and the j-th modalz j are close to each other. We draw the conclusion that the common latent features of the two modal,z i andz j , are close through removing the same distinctive latent featurež i .
Given the common latent feature of the two modal are close, the distributions where they are sampled from are also close by calculating the Wasserstein distance between the two distributions. Providingz i is sampled from N i (μ i , (σ i ) 2 ), z j is sampled from N j (μ j , (σ j ) 2 ), the Wasserstein distance between N i (μ i , (σ i ) 2 ) and N j (μ j , (σ j ) 2 ) is VOLUME 8, 2020 where ε ∈ (N i , N j ) represents a sample collection from the joint distribution of N i , N j . In summary, the Wasserstein distances between p(μ i , (σ i ) 2 ) and p(μ j , (σ j ) 2 ) is minimized after training. It is concluded that the LCR network extracts close common latent features from multi-modal data.

IV. THE PROPOSED FRAMEWORK
Generally, multimodal clustering partitions data X in M modalities into k clusters. We propose two novel multimodal clustering algorithms called the Deep Multimodal Clustering with Global Cross Reconstruction(DMGCR) and the Deep Multimodal Clustering with Local Cross Reconstruction (DMLCR) respectively. The two algorithms both contain two stages, the feature extraction stage and the clustering stage. We introduce our algorithms in detail in this section.

A. FEATURE EXTRACTION 1) FEATURE EXTRACTION WITH GLOBAL CROSS RECONSTRUCTION
Multimodal data contain modality-unique features and modality-common features. It is unreachable to extract common latent features directly from different modalities with traditional encoders. We use the Variational Information Bottleneck (VIB) [16] to extract the common latent features among different modalities. And then we apply the global cross reconstruction method to ensure that the common latent features of different modalities satisfy similar distribution. We adopt a GCR autoencoder as the feature extractor. For a given input x i in the i-th modality, the encoder extracts the latent feature z i in a low-dimensional space, and the decoder reconstructs the input from the latent representation. After training the GCR autoencoder, The deep GCR autoencoder aims to get a finetuned latent feature set {z i } M i=1 for all M modals.

2) FEATURE EXTRACTION WITH LOCAL CROSS RECONSTRUCTION
Furthermore, we utilize the local cross reconstruction autoencoder to extract more precise common latent feature. As described in the previous section, the LCR autoencoder explicitly extracts common latent features from distinctive latent features. The encoder E i in LCR autoencoders retrieves the latent feature pair (ž i ,z i ). The decoder {D j } M j=1,j =i reconstructs the input multiple times with every pair in {(ž j ,z i )|j = 1, 2, . . . , M , j = i}. Training process take advantage of both distinctive latent featurež i and common latent featurez i . However, only the common latent features are deployed in the next feature fusion stage because they grasp the shared information among different modalities.

B. FEATURE FUSION AND CLUSTERING
After extracting latent features z 1 , z 2 , . . . , z M from all modalities, we fuse those features to merged latent features z * , and cluster on the shared features z * .
All the latent features share the same dimensionality d z , the merged features also have d z dimensions. As shown in FIGURE 2, we use fusion layers to merge extracted features. The fusion layers consist of fully connected layers. First, each latent feature z i uses a d z × d z dense layer to decide the weighted latent feature z i w . Next, we concatenate all weighted latent feature to z c and downsample the concatenated feature to the merged features z * . The output z * is the merged features: where η is the parameter of the fusion layers. For the GCR network, the inputs of the fusion layers are the latent features z 1 , z 2 , . . . z M . For the LCR network, The inputs of the fusion layers are the common latent featuresz 1 ,z 2 , . . . ,z M . We use mean squared error loss between the fused feature z * and each extracted latent features z i of the i-th modalities to train the fusion layers: After training the fusion layers, the merged latent features z * preferably grasps the shared information among modals. Finally, we perform clustering on the merged features. We choose k-means as the final clustering algorithm in this article.

C. THE DMGCR ALGORITHM
We first implement the deep multimodal clustering with global cross reconstruction(DMGCR). The algorithm is composed of two stages.
In the first stage, we extract latent features from different modalities using global cross reconstruction to training autoencoders, which ensures the extracted features of different modalities sharing similar distributions. In this stage, we train the GCR autoencoders with the loss function Eq. (8).
In the second stage, we fuse the features to merged features by fusion layers, then we cluster on the merged features. This stage is guided by the fusion loss demonstrated as Eq. (17).
We adopt ADAM [31] as the optimizer with learning rate η = 10 −3 in both stages. The whole process of DMGCR is summarized in Algorithm 1.

D. THE DMLCR ALGORITHM
We further implement the deep multimodal clustering with local cross reconstruction(DMLCR). The algorithm is composed of two stages too.
In the first stage, we extract latent features from different modalities using local cross reconstruction to training autoencoders, which explicitly separates common latent feature from distinctive latent feature. In this stage, we train the LCR autoencoders with the loss function Eq. (12).
In the second stage, we fuse the features to merged features by the same fusion layers as the former algorithm, then we cluster on the merged features. This stage is guided by the fusion loss demonstrated as Eq. (17).
We adopt ADAM [31] as the optimizer with learning rate η = 10 −3 in both stages. The whole process of DMLCR is summarized in Algorithm 2.

Require:
The number of modality M ; The dataset of N data points for each modality i: Sampling a minibatch of samples from each modality; 3: for for each j ∈ [1, m]\i do 10:x ij = D j (z i ,ž j ; ϕ j ); 11: end for 12: Training E i and D i with Eq.(12); 13: end for 14: end for 15: for number of training iterations do 16: Sampling a minibatch of samples from each modality; 17: for each i ∈ [1, M ] do 18: 20: end for 21: z * = Fusion(z 1 , z 2 , . . . , z m ; η);

V. EXPERIMENTS
We compare our DMGCR and DMLCR with nine state-of-art baseline algorithms on six multimodal datasets. We evaluate the clustering results on three metrics. The detail and results are described in this section. VOLUME 8, 2020
AwA comprises images of animals, which has 5814 samples belonging to 10 clusters. We select three features as three modalities, local self-similarity features, SIFT features and SURF features, which are all pretrained features of the images provided by the original source. CNN is a news dataset that contains two modalities of 2107 samples belonging to 7 clusters. The first modality consists of text contents, the second modality contains the images of articles. Digits consists of features of handwritten numerals ('0'-'9') extracted from a collection of Dutch utility maps. It contains three modalities of 2000 samples belonging to 10 clusters. The three modalities respectively have 76, 216 and 240 dimensions. Caltech101, LUse-21 and Scene-15 all consist of pictures of objects: we extract LBP, GIST and CENTRIST descriptors from these datasets as three modalities. Caltech101 is an object recognition data set containing 8677 images, belonging to 101 categories. We choose 712 samples and these samples belong to 10 clusters. LUse-21 contains 2100 samples belonging to 21 clusters. Scene-15 contains 3000 samples assigned to 15 clusters. Table 1 shows the statistics about the datasets, in which M represents for number of modalities, d i is the dimension of the i-th modality, N indicate the number of objects, K means the number of clusters.

B. BASELINE ALGORITHMS
We compare the proposed framework with the following baseline algorithms: (1) Single modality clustering methods: DEC [5], IDEC [6] and JULE [18]. We test these methods on each modality and take the best result as the final results. (2) Multimodal clustering methods: MMSC [22], RMKMC [23], DIMSC [24], LT-MSC [25], ECMSC [26] and DCCAE [15]. Among these multimodal clustering methods, DCCAE is a two-modal method. In order to extend DCCAE to multimodal clustering task, we combine every two of the multiple modalities, and take the average results as the final result. (3) The simplified version of DMGCR and DMLCR: We evaluate our approaches on three common metrics, Accuracy (ACC), Normalized Mutual Information (NMI), and Purity.
ACC measures the best mapping between cluster assignments and true labels, which is defined by N where l i and c i are the true label and predicted cluster of data point x i , and m(c i ) is the permutation mapping function that maps each cluster label c i to the equivalent label from data.
NMI calculates the normalized measure of similarity between two labels of the same data, which is defined as NMI = I (l; c) max{H (l), H (c)} where I (l, c) denotes the mutual information between true label l and predicted cluster c, and H represents their entropy. Results of NMI do not change by permutations of clusters, and they are normalized to the range of [0, 1], with 0 meaning no correlation and 1 exhibiting perfect correlation.
Purity reflects the percentage of each cluster containing the correctly grouped samples, which is defined as where k is the number of classes, s i and n are the number of samples of the i-th cluster and total samples respectively. P i denotes the distribution of correctly classified samples of all clusters.

D. MODEL AND PARAMETER SETTINGS
The model parameter settings of our experiments are as follows: (1) For baseline algorithms, we set their parameters with values provided in the original papers. During training, we fine-tune the parameters of baseline methods to get the best performance.
(2) We use three autoencoders to handle datasets with three modalities and two autoencoders for datasets with two modalities. The encoders and decoders are composed of fully connected layers. We use sigmoid as the activation function in the last layer of decoders, and use ReLU activation in the other layers of encoders and decoders. The weights of neural networks in our framework are randomly initialized. we set the learning rate of ADAM to 0.001.
(3) We select k-means as the final clustering method considering the interpretation of the Euclidean distance in the feature space as diffusion distance in the input space [33]- [35], It can be replaced by other clustering methods.

E. PERFORMANCE EVALUATION 1) CLUSTERING RESULTS
The experiment results of clustering ACC, NMI and Purity are summarized in Table 2, Table 3 and Table 4. The best results are marked in bold.   Firstly, we compare DMGCR and DMLCR with the single modal algorithms DEC, IDEC and JULE. Generally, it can be seen that DMLCR and DMGCR outperforms the single modal algorithms when clustering multimodal data, which indicates the effectiveness of multimodal fusion. DMGCR and DMLCR perform better than DEC and IDEC on all datasets in terms of ACC, NMI and Purity, and outperform JULE in most cases in terms of ACC, NMI and Purity. The JULE algorithm performs the best on only one metric, the purity. However, on the other two metrics our algorithms overpass JULE obviously. Note that DEC and IDEC just use an autoencoder without VIB, and do not perform well on multimodal datasets. These series of results demonstrate that our algorithms are delicately designed for multimodal clustering.
Secondly, we compare DMGCR and DMLCR with multi-modal clustering algorithms MMSC, RMKMC, DIMSC, LT-MSC, ECMSC and DCCAE. Generally, DMLCR outperforms the baseline multimodal methods. Besides, DMGCR also performs better than most multimodal state-of-art works. For instance, compared with baseline methods, the NMI of DMLCR raises 12% on the Digits dataset, 3% on the AwA dataset, 13% on the Cal101 dataset, 10% on the LUse-21 dataset, 14% on the LUse-21 dataset. DMLCR outperforms MMSC, RMKMC, DIMSC and ECMSC on each dataset of all metrics. LT-MSC and DCCAE seldom perform little better on purity than the proposed algorithms. Nevertheless, the DMCLR is comparable with them in the other metrics. From the above results, we determine that the early fusion and the cross construction improve clustering performance on multimodal data.
Thirdly, we simplify our algorithms by dropping the cross reconstruction item from the loss function, and call them DMC and DMLC respectively. The results in the tables indicate that DMGCR outperforms DMC, DMLCR performs better than DMLC. That proves that the cross reconstruction regularization is effective on extracting common features of different modalities.
Finally, DMLCR surpasses the DMGCR on most cases of all metrics. This phenomenon indicates that the local cross construction is beyond the global cross construction by explicitly retrieve the common latent features.
In summary, it can be concluded that our methods performs the best on the multimodal datasets. The proposed cross reconstruction regularization improves the performance of multimodal clustering, which further indicates that it is beneficial to establish early fusion among different modalities.

2) PARAMETER SETTING RESULTS
The loss of the DMGCR algorithm has two super parameters which control the weight of VIB and the cross reconstruction respectively, in Table 5, we discuss the effect of the parameters β and γ to clustering performance of DMGCR on each dataset. Due to space limitation, we only present the experimental results on the Digits dataset and Cal101 dataset in this paper. Both β and γ vary in the set [0, 0.5, 1] and the best results are marked in bold.
The value of β and γ separately reflects how much we want to enforce the cross reconstruction regularization and the VIB VOLUME 8, 2020 regularization. The β = 1, γ = 0 stands for DMGCR without cross reconstruction regularization, theβ = 0, γ = 1 stands for DMGCR without VIB regularization, and the β = 0, γ = 0 stands for DMCR without both cross reconstruction regularization and VIB regularization. The performance of DMGCR without one regularization is better than that of DMGCR without both regularization, but is worse than that of DMGCR with both regularization, which proves the validity of the VIB regularization and cross reconstruction regularization. Furthermore, as shown in the tables, we get the best results when β = 1 and γ = 0.5.
The loss function of DMLCR has one super parameter β which represents the weight of cross reconstruction. Table 6 shows the clustering performance on difference values of the parameter β. Similarly, we set the β to 0.6 to achieve the best clustering performance.

VI. CONCLUSION
In this paper, we propose a novel deep multimodal clustering framework and two algorithms with cross reconstruction. Firstly, we analyzed the mechanism how the global cross reconstruction and the local cross reconstruction effectively reduce the distribution differences of multimodal latent features. Secondly, we build a deep multimodal clustering networks framework and two specific implementation, DMGCR and DMLCR. Finally, We compare our DMGCR and DMLCR algorithms with the state-of-the-art multimodal methods on representative multimodal datasets. Experimental results show that our algorithms achieve obviously improvement on multimodal clustering tasks. A further study could assess the end-to-end training instead of two-stage training to simplify the whole framework. Another possible area of future research would be to apply cross reconstruction to more deep clustering methods. XIAORUI TANG received the bachelor's and master's degrees in software engineering from the Dalian University of Technology, Dalian, China, in 2016 and 2020, respectively. He is currently an Engineer with Meituan Inc. VOLUME 8, 2020