Learning Decorrelated Hashing Codes With Label Relaxation for Multimodal Retrieval

Due to the correlation among hashing bits, the retrieval performance improvement becomes slower when the hashing code length becomes longer. Existing methods try to regularize the projection matrix as an orthogonal matrix to decorrelate hashing codes. However, the binarization of projected data may completely break the orthogonality. In this paper, we propose a minimum correlation regularization (MCR) for multimodal hashing. Rather than being imposed on projection matrix, MCR is imposed on a differentiable function which approximates the binarization. On the other hand, binary labels could not precisely reflect the distances among data. Hence, we propose a label relaxation scheme to achieve better performance.


I. INTRODUCTION
Multimodal hashing which embeds data to binary codes is an efficient tool for retrieving heterogeneous but correlated multimedia data, such as image-text pairs in Facebook and videotag pairs in Youtube. Unlike real vectors used in traditional retrieval methods [1]- [4], binary codes can greatly reduce the storage requirement and computation costs of nearest neighbors search.
Orthogonality is assumed to be a quality of good hashing codes [5]. However, the orthogonality constraint will lead to an NP-hard problem. Hence, there are two widely used ways to approximate orthogonal code matrix: (1) adopting orthogonal vectors and then thresholding them to generate binary codes [5], [6]; (2) imposing an orthogonality regularization on the objective function [7], [8]. These methods on approximating orthogonality have a theoretical defect that the orthogonality is corrupted by quantization.
Spectral hashing (SH) [5] and iterative quantization (ITQ) [6] are two representative works in way (1). SH selects eigenfunctions corresponding to several smallest eigenvalues and thresholds eigenfunctions at zero. ITQ rotates the principal components and thresholds data projected by those princi-The associate editor coordinating the review of this manuscript and approving it for publication was Maurizio Tucci. pal components at zero. Obviously, thresholding orthogonal vectors at zero cannot generate orthogonal binary vectors.
As an representative example for way (2), deep multimodal hashing with orthogonal regularization (DMHOR) [7] is illustrated in Fig. 1. Liong et al. [9] and Chen et al. [10] also use this orthogonal regularization in their deep hashing model.
Deep multimodal hashing with orthogonality regularization (DMHOR) [7] introduces an orthogonality regularization (OR) to deep neural network (DNN). It uses Restricted Boltzmann Machine (RBM) for image and text data. Each layer of RBM can be represented as a nonlinear activation function of a linear transformation of the input. The OR is applied on the weight matrix of each layer. The authors argue that the proposed OR can lead to an orthogonal code matrix when data matrices are orthogonal. This assumption is unreasonable in real application. In this paper, we will briefly analyze the properties of this OR and demonstrate that it is only suitable for some linear hashing models. Deep crossmodal hashing (DCMH) [11] employs different types of DNN for different modalities. For example, convolutional neural network (CNN) is used for images while fully connected neural network is used for text. The orthogonality of hashing codes is neglected.
In this paper, we propose a hashing method named decorrelated multimodal hashing (DMH). First, a sigmoid function is applied on the linear transformations of original data FIGURE 1. Illustration of the difference between our regularization term and that of DMHOR. Red arrows indicate the flowchart of our method, while the blue ones indicate the flowchart of method proposed in DMHOR. W is the projection matrix, X is data matrix, v is a bias, B is the hashing code matrix, I is the identity matrix and f is the sigmoid function used to approximate binarization.
points to map different modalities into a common code matrix. Then, we devise a minimum correlation regularization (MCR) to improve the retrieval performance on longbit experiments. Unlike aforementioned orthogonality constraints or regularizations [7] that are usually applied on the linear transformation matrices, the proposed MCR is applied on the sigmoid function. Because the output of sigmoid function approximates a binary code and the hashing code matrix directly depends on the quantization of it, the propose MCR works better on decorrelating hashing codes ( Fig. 1).
We do not use the term ''orthogonality'' because the maximum number of mutual orthogonal vectors is equal to the dimension of them and an orthogonal linear transformation does not exist when the rank of a data matrix is less than that of its code matrix. For instance, if an N × d data matrix is encoded as an N × c code matrix where N is the number of data and d < c, the dimension of the linear transformation matrix W should be d × c. Because we cannot find c ddimensional column vectors, an orthogonal W does not exist. In Subsection III-B, we will prove that when d + 1 < c, the output matrix of sigmoid function cannot be orthogonal and hence the orthogonality of code matrix cannot be even approximated.
Besides the orthogonality regularization, a label relaxation method is proposed for multi-labeled data sets. Labels are generally treated as a special modality in multimodal hashing methods. Therefore, the relaxed labels that can reflect the distances among data will benefit the hashing process. Ji et al. [12] proposed a deep multi-level semantic hashing method which is similar to ours. However, their method needs to compute the mutual distances among labels, which makes it intractable for large dataset. Illustration of the proposed relaxation method in multi-labeled data. On image a, both human experts and the proposed method should label it as flower because it only contains flowers. It can be seen that for single-labeled data, there is little difference between human experts and the proposed method. However, when a, b, c and d are treated as data in a multi-labeled data set, human experts label them as ''1,0'' or ''1,1'' in traditional way, while the proposed method relaxes labels to be real numbers according to their distances to the classes in the feature space. The closer it is to the class, the larger the label. That is, the generated labels can reflect the distances. As labels are treated as a special modality in the proposed method, it could be better if they can reflect the distances.
The rest of this paper is organized as follows. The related works are reviewed in Section II. In Section III, we, step by step, derive our model from a widely used unimodal hashing method, iterative quantization (ITQ) [6]. The discussions on parameter settings and optimization algorithms are also given in Section III. Experimental results are reported in Section IV. We conclude this paper in Section V VOLUME 8, 2020

II. RELATED WORKS
Some well-known multimodal hashing models are related to some classical unimodal ones. Hence, in this section, unimodal hashing models will be firstly reviewed and then we will discuss some representative multimodal hashing models and their relations to unimodal ones. For a more comprehensive survey on unimodal hashing methods, please refer to [13].

A. UNIMODAL HASHING
Most existing unimodal hashing models focus on image retrieval tasks. However, they are also feasible for other types of data as long as the data are represented in real vectors. For images, there are lots of popular feature extraction methods [14], [15] available to represent images by real vectors.
The unimodal hashing methods can be divided into two categories according to their dependence on data. Localitysensitive hashing (LSH) [16] and its kernelized version [17], [18] are well-known data-independent unsupervised unimodal hashing methods. Due to randomized hashing, LSH demands more bits per hashing table [19].
Spectral hashing (SH) [5], one of the most popular and pioneering data-dependent unimodal hashing methods, generate hashing codes by solving a relaxed mathematical problem to avoid computing the affinity matrix that requires calculating and storing pairwise distances of the whole data set [20]. The authors argued that two constraints for a good code matrix are orthogonality and balance, either of which leads to an NP-hard problem. In the following works, balance is generally neglected and orthogonality constraint is relaxed or neglected, too.
Anchor graph hashing (AGH) [21] substitutes the affinity matrix in SH by constructing the a highly sparse one using several anchor points. Discrete graph hashing (DGH) [8] incorporates a relaxed orthogonality constraint into AGH to improve the performance on long-bit experiments.
Methods based on linear transformations, such as principal component analysis (PCA) [22], attract wide interests due to their effectiveness and computation efficiency. ITQ rotates the projection matrix obtained by PCA to minimize the quantization loss. Isotropic hashing (IsoH) [23], harmonious hashing (HH) [23] and ok-means [24] are derived from ITQ. IsoH equalizes the importance of principal components. HH puts an orthogonal constraint on an auxiliary variable for the code matrix. ok-means rotates the data matrix to minimize the quantization loss. ITQ, IsoH and HH depends on principal components whose maximum number is no larger than the minimum dimension of data matrix. Hence, they cannot generate hashing codes longer than the data dimension. Despite of PCA, other linear transformations can be used, such as Linear Discriminant Analysis (LDA) [25]. Unlike these precomputed transformation matrix, neighborhood discriminant hashing [26] calculates the transformation matrix during the iterative minimization procedure.
Inductive manifold hashing [19] embeds some special samples into lower dimensional space and the embeddings of remaining samples are calculated by a linear combination of those special samples. The coefficients of the linear combination are the probabilities that a sample belongs to those special samples.
All aforementioned unimodal hashing models cannot generate balanced code matrix. Spherical hashing (SpH) [27] and global hashing system (GHS) [20] quantize the distance between a data point and a special point. The closer half to a special point is denoted as 1 while the further half is denoted as 0. Therefore, a balanced matrix can be easily generated. Their major difference is on how to find these special points. SpH uses a heuristic algorithm while GHS treats it as a satellite distribution problem of the Global Positioning System (GPS).
Some unsupervised unimodal hashing models can be easily extended to supervised models. For example, substituting PCA by Canonical Correlation Analysis (CCA) [28], the label information can be incorporated. Besides, unsupervised and supervised models, wealy-supervised models [29]- [31] are also promising, since labels can signifcantly improve retrieval accurarcy but manually labelling images is a heavy burden for human experts.

B. MULTIMODAL HASHING
Multimodal hashing models can be classified into unsupervised and supervised ones. Unsupervised multimodal hashing tries to preserve the Euclidean data structure by binary codes. Inter-media hashing [19] learns hashing function by linear regression. IMH models intra-media consistency in a similar way of SH. Like what AGH has done to SH, linear crossmedia hashing (LCMH) [32] uses the distances between each data point and each cluster centroid to construct a sparse affinity matrix. Collective matrix factorization hashing (CMFH) [33] can be treated as an extension of NDH. For each modality, CMFH consists of two terms: (1) calculating a transformation matrix for the data matrix to match the code matrix through minimizing quantization loss, and (2) calculating a transformation matrix for the code matrix to match the data matrix through minimizing squared error. Latent semantic sparse hashing [34] is an extension of CMFH and its basic idea is similar to HH that imposes the orthogonality constraint on an auxiliary variable. LSSH imposes the sparse regularization on an auxiliary variable in the latent space. Shen et al. [35] proposed a cross-view hashing method for semi-paired data. It jointly learns a correlated representation for each modality and hashing functions. It rotates the hashing code matrix to match the correlated representation matrices. Hence, it can be seen as an extension of ok-means.
By incorporating label information, supervised hashing can achieve higher accuracy. Cross-modality similaritysensitive hashing (CMSSH) [36] treats hashing as a binary classification problem. Cross-view hashing (CVH) [37] assumes the hashing codes be a linear embedding of the original data points. It substitutes the code matrix by this embedding. The objective function is a weighted summation of that of spectral hashing (SH) [5] on each modality. Multilatent binary embedding (MLBE) [38] treats hashing codes as the binary latent factors in the proposed probabilistic model and maps data points from multiple modalities to a common Hamming space. Semantics-preserving hashing (SePH) [39] learns the hashing codes by minimizing the KL-divergence of the probability distribution in Hamming space from that in semantic space. CMSSH, MLBE and SePH need to compute the affinities of all data points, which makes it intractable for large data set. Semantic correlation maximization (SCM) [40] circumvents this by learning only one bit each time and the explicit computation of affinity matrix is avoided through several mathematical manipulations. Multimodal discriminative binary embedding (MDBE) models [41] hashing as a minimization problem. There are two main terms in its formulation. One term indicates different modalities and the labels can be embedded to the same latent space, while the other one indicates the embedded modalities can be further embedded as the labels. l2-norm is used to regularize the linear embedding matrix. Intra-and Inter-Modality Similarity Preserving Hashing (IISPH) [42] measure the similarity among data within the same modality and across different modalities. SCM, MDBE and IISPH discard the uncorrelation property of the code matrix or embedding matrix, which makes their performance improve slowly as code length increases.
Most hashing methods relax the binary constraint, Xu et al. [43] proposes a discrete optimization algorithm to directly learn hashing codes without relaxing the binary constraint. Collective reconstructive embeddings [44], [45] use modality-specific similarity metrics for different modalities. Besides, the above mentioned shallow models. Deep neural networks (DNN) are extensively studied in cross-modal hashing. Among the DNN-based methods, the adversarial learning based models [46]- [48] have achieve appealing results.

III. METHODOLOGY
Terms ''view'' and ''modality'' are discriminated in some literatures [41]. Multiple views of data refers to different type of features of one modality, e.g. SIFT [49] and GIST [50] features for images. However, we use these two words interchangeably since our method can be used in either situations as long as the data are represented by real matrices.
First, Let us define the used notations. Suppose that X i is the i-th view matrix of the data and n is the number of data points and i = 1, . . . , g. A binary code corresponding to the m-th data is defined by a row vector b m = {0, 1} c , where c is the code length and the code matrix B = b 1 , . . . , b n . h i X i ), the hashing function for the i-th view matrix, embeds X i into a binary code matrix.

A. PROBLEM FORMULATION
ITQ is a successful hashing method for single view data. The formulation of ITQ is arg min where X ∈ R n×d is the data matrix, W ∈ R d×c is obtained by principal component analysis (PCA) and R ∈ R c×c is an orthogonal matrix. An intuitive multi-view extension of ITQ can be arg min where α i is a positive real constant. As the maximum number of principal components pre-computed by PCA on the ith view matrix is d i , Eq. (2) cannot be used when c > d i .
We remove R i from Eq. (2). Then, we simultaneously calculate W i ∈ R d i ×c and B during the optimization process. This method can be modeled as arg min , where 1 is a n-dimensional column vector whose elements are equal to 1. β i is a constant and v i is a bias vector. Hence, Eq. (3) can be modified as following.
arg min The orthogonality condition for good codes [5] is approximated by an orthogonal W in ITQ. However, when c > d i , an orthogonal W i does not exist. In this case, Wang et al. [7] introduces the following regularization to decorrelate code matrix: First, let us discuss some interesting properties of Eq. (5). Proposition 1: When c ≤ d i , the W i that minimizes Eq. (5) is an orthogonal matrix.
It is easy to prove Proposition 1 by the definition of orthogonal matrix.
Proposition 2: Let the W i that minimizes Eq. (5) consists of column vectors w i p where p = 1, . . . , c. The angle between any pair of column vectors is equal to each other.
Proof: Let V = W i W i and let V pq be the element in the pth row and qth column of V. V pq is the inner product of w i p and w i q . When w i p 2 F = 1, the diagonal elements of R will be 0 and the angle between w i p and w i q will be arccos(w i p w i q ). Eq. (11) can be written as: According to the inequality of arithmetic and geometric means, it can be deduced that VOLUME 8, 2020 The equality holds if and only if all w i p w i q are equal. That is, the angle between any pair of column vectors is equal when W i minimizes Eq. (5).
Proposition 3: If W i minimizes Eq. (5), the affine transformation of W i , i.e. W i R also minimizes Eq. (5) where R is an orthogonal matrix.
Proof: As R is orthogonal, we have Eq. (8) can be rewritten as Here, R R = I is used in the deduction. Hence, W i R also minimizes Eq. (5). In Fig. 3, we illustrate Proposition 2 and Proposition 3 in 2-dimensional case. Following the flowchart of ITQ, one can find c d-dimensional vectors distributed like those in Fig. 3 and then transform them by R to minimize Eq. (2). However, the complexity of theoretically finding such vectors increases dramatically in high dimensional spaces. Wang et al. [7] use Eq. (5) as a regularization and argue that Eq. (5) will lead to an orthogonal code matrix when the data matrices are orthogonal. It is easy to find an example demonstrating Eq. (5) can only be used in some linear models. For simplicity, let us consider the following model, arg min where X is an orthogonal data matrix and f (·) is a linear or nonlinear function. Please note Eq. (10) is not a unimodal hashing model, because the binary constraint is not imposed to B. Let us suppose the dimensions of B and X are equal. According to Proposition 1, Eq. (5) will lead to an orthogonal W. If f (XW) = XW, then B = XW is also an orthogonal matrix. However, if f (·) is a sign function which is nonlinear, we can get a binary code matrix B = sign(XW) and Eq. (10) becomes a nonlinear unimodal hashing model. Obviously, an orthogonal W cannot ensure an orthogonal B.
Inspired by this example, we propose the following regularization where f (X, ) is the nonlinear embedding function and is the parameter set of f . In our proposed hashing model, i.e., Eq. (4), Proposition 4: Minimizing Eq. (11) cannot lead to an orthogonal f (X i , ) when d i +1 < c.
Proof: According to the definitions, we have rank( According to Theorem 4.2 in [51], cannot be equal to I in any cases. Hence, Minimizing Eq. (11) cannot lead to an orthogonal f . From Proposition 4, we can see that an orthogonal f cannot be acquired when d i + 1 < c. In this case, f cannot even approximate an orthogonal matrix. Minimizing f will only minimize the correlation among the column vectors of f . Fig. 3 illustrates this situation.
It is inessential to name Eq. (11) as ''minimum correlation regularization'' (MCR) or ''maximum uncorrelation regularization''. Since Eq. (11) will be added into our hashing model which is formulated as a minimization problem, we use the former one to keep literal consistency.

C. DECORRELATED MULTIMODAL HASHING
In our implementation, we found that subtracting identity matrix is somewhat redundant, so MCR can be simplified as: It is unnecessary to worry about the diagonal elements of f f will be zeros during the proposed minimization procedure, because as long as all variables are randomly initialized, it is nearly impossible for gradient descent algorithm to reach a solution that all variables are zero. Adding MCR to Eq. (4) leads to the following model. arg min where γ i is a positive real constant, and Setting Eq. (18) as 0, we can derive that B is rounded in each iteration to ensure B ∈ {0, 1} n×c . R Take the partial derivative with respect to v i , resulting in In Eq. (20), ''•'' means element-wise multiplication. The division and square are also element-wise. The partial derivative with respect to W i is The prototype of the proposed training method is shown in Algorithm 1. In Subsection III-F, the parameter settings and details for efficient implementation are discussed.

Algorithm 1 the Prototype of the Proposed Training Method
Require: Update B using Eq. (18). 3:

E. LABEL RELAXATION
As discussed in Section I, relaxing a few labels to real numbers can benefit on learning hashing codes on multi-labeled data sets. Let us denote label matrix as L which is the gth modality and without losing generality, the last r rows are extracted from L to generate a new matrix L R and the remaining n − r rows of L form matrix L T . L R is used for relaxation. The relaxed L R is denoted asL R .
Let us define A i pq = exp(ρ(x i p , x i q )), where x i p is the p-th row of X i and ρ is Euclidean distance. Let us define where H pq is used to build matrix H. A i pq reflects the data structure of i-th modality. H pq integrates the data structure of all modalities by averaging normalized A i pg . To makeL R reflects the data structure, the following objective function can be used: where H is partitioned into four blocks according the dimensions ofL R and L T . On the other hand, the original labels L R also contain useful information. Hence, L R − L R 2 F is added to the above objective function: F . (23) Gradient descent algorithm is used to minimize Eq. (23). The gradient of Eq. (23) with respect toL R is ∂O After gettingL R , let L = L T T R .

F. IMPLEMENTATION DETAILS
α i is the weight for ith view. We set α i as 10 for the label view and 1 for any other views. β i is used to re-scale the view matrix. We empirically found that the proposed method achieves the best performance when the values of the rescaled view matrix are in the interval [0, 255]. For instance, in the NUS-WIDE data set [52], images are represented by 500-dimensional bag-of-visual-words SIFT feature vectors whose values are in [0, 255], texts are represented by 1000dimensional index vectors whose values are 0 or 1 and labels are 10-dimensional index vectors. Hence, we set β as 1, 255 and 255 for image view matrix, text view matrix and label view matrix, respectively. To improve computation efficiency, β i is multiplied with X i before the iteration starts. All data matrices are zero-centered, except for the label matrix. We set the maximum iteration times as K . t linearly decreases from k s to k e by K iterations, i.e., in the k-th iteration, t = k s − (k s − k e )k/K .
For large data set, the first term in Eq. (15) is too large, which makes γ i and t difficult to be determined. We normalize the gradients so that we can fix γ i and t settings for all our experiments. The efficient version of the proposed method is given in Algorithm 2.

IV. EXPERIMENTAL RESULTS
In this section, we evaluate the retrieval performance and computational efficiency of the proposed method. First, VOLUME 8, 2020  we introduce the data sets, evaluation metrics and comparison methods. Then, two types of experiments -Hamming ranking and hash lookup were conducted. Finally, we analyze the convergence and computational efficiency.
A. DATA SETS Wiki 1 contains 2,866 image and text pairs. Each image is represented by a 4,096-dimensional feature extracted by the Caffe implementation of AlexNet [53] as [41] did and each text is represented by a 10-dimension topics' vector generated by latent Dirichlet allocation (LDA) model. Each pair uniquely belongs to one of the 10 categories. Groundtruth neighbors for a test entry is defined as those in the same category. 1 http://www.svcl.ucsd.edu/projects/crossmodal/ MIRFlickr [54] contains 25,000 entries each of which consists of 1 image, several textual tags and labels. Following literature [39], we only keep those textural tags appearing at least 20 times and remove entries which have no label. Hence, 20,015 entries are left. For each entry, the image is represented by a 512-dimensional feature extracted by Resnet-18 [55] and the text is represented by a 500-dimensional feature vector derived from PCA on index vectors of the textural tags. 5% entries are randomly selected for testing and the remaining entries are used as training set. Ground-truth semantic neighbors for a test entry, i.e, a query, are defined as those sharing at least one label. [52] is comprised of 269,648 images and over 5,000 textural tags collected from Flickr. Ground-truth of 81 concepts is provided for the entire data set. Following literatures [33], [39], [40], we select 10 most common concepts for labels and thus 186,577 entries are left. For each entry, the image is represented as a 512-dimensional feature extracted by Resnet-18 and text is represented as an index vector of the most frequent 1,000 tags. 1% entries are randomly selected for testing and the remaining are used for training. Ground-truth semantic neighbors for a test entry are defined as those sharing at least one label.

NUS-WIDE
For image feature extraction neural networks, AlexNet and Resnet-18, the weights pretrained on ImageNet [56] are used. Fine-tuning is done on Wiki, MIRlickr and NUS-WIDE datasets. For fune-tuning, we resize the images to 244 × 244, use Adam Optimizer [57] with default settings and run the training process for 10 epoches with batch size 32.

B. EVALUATION METRICS
Hamming ranking and hash lookup are two widely used experiments for evaluating retrieval performance. In Hamming ranking experiment, all data points in the training set are ranked depending on their Hamming distances to a given query. The average precision (AP) is defined as where N is the number of relevant instances in the retrieved set, P(r) is the precision of the top r retrieved instances, and δ(r) = 1 if the r-th retrieved instance is a true neighbor of the query, and otherwise δ(r) = 0. Mean average precision (MAP) is the mean of APs of all the queries. For the ideal case that all retrieved instance are true neighbors of the queries, MAP is equal to 1, while MAP is equal to 0 for the worst case that all retrieved instance are not the true neighbors. Hence, the closer it is to 1, the better the performance.
In hash lookup experiment, the retrieved instances are those whose Hamming distances to a given query are not larger than a given radius, say 2 in our experiment. The performance are evaluated by F1-score which is defined as The F1-scores are averaged for all queries. Similar to MAP, F1 also varies in [0, 1] and the closer it is to 1, the better the performance.

C. BASELINES
The proposed method is compared with seven multimodal hashing methods CMSSH [36], CVH [37], MDBE [41], SCM [40], SePH [39], DMHOR [7], DJSRH [58] and SSAH [48]. DMHOR, DJSRH and SSAH are based on deep neural networks. CMSSH and SePH requires too much computational cost. Following literatures [39], [40], 10,000 entries are randomly selected for training hashing functions and then we apply these functions to generate hashing codes. We use the codes provided by the authors except for MDBE and DMHOR. We re-implement MDBE and DMHOR, and set parameters following the authors' suggestions. For our method, we use the following parameter settings, k s = 0.003, k e = 0.0015 and K = 400. α i , β i and γ i are set as discussed in Subsection III-F.

D. RESULTS
MAP results are shown in Table 1. In Table 1, ''I2T'' means using images to query texts, while ''T2I'' means using texts to query images. From Table 1, it can be observed that our method outperforms all compared methods. As the bit length increases, the performance of our method increases faster than baselines, which demonstrates the effectiveness of the proposed minimum correlation regularization. For example, in the ''Image-Text'' experiment on MIRFlickr, the performance improvement ranges from 3% to 5% as the bit length varies from 16 to 128, compared to the best baseline, i.e., MDBE. The MAP of DMHOR decreases as the code length increases, which demonstrates the inefficiency of the orthogonality proposed in [7] as discussed in Subsection III-B.
F1-score results are shown in Fig. 4. Similar to the MAP results, our method surpasses all baselines by a huge performance improvement, especially on MIRFlickr. On MIR-Flickr, the performance improvement ranges from 30% to 3,000%, compared to the best baseline. On NUS-WIDE, it is 5% to 200%. A reasonable explanation is that our method can precisely preserve the inter-class structure and therefore the lookup performance is significantly improved. Because the ranking performance depends on the preservation of the structure of the whole data set regardless of inter-class or intraclass structure, the performance improvement is not as significant as that of the lookup experiment. The size of MIRFlickr is only about 1/10 of NUS-WIDE, so the simple non-linearity introduced in our method works much better on MIRFlickr. To achieve comparable performance improvement on NUS-WIDE data set, more sophisticated non-linear models are expected.
In both experiments, MDBE achieves the best performance among all the baselines. Actually, the main part of MDBE, is equivalent to Eq. (3) which is an intuitive multimodal extension of ITQ, where L is the label matrix, X is the image view matrix and Y is the text view matrix. W x , W y and U are variables. If we treat the label matrix as another view of the data and introduce an auxiliary variable B, it is easy to figure out that Eq. (27) and Eq. (3) are equivalent. By introducing non-linearity and minimum correlation regularization, our method performs much better than MDBE. An illustrative experiment on MIRFlickr data set are shown in Fig. 5.

E. PARAMETER SETTINGS
In Fig. 6, we show the MAP and F1-score of DMH on MIR-Flickr data set with various parameter settings. The default setting is α = 10, β = 255 and γ = 0.001. For label relaxation, we set r = 1 for generating relaxed label matrix VOLUME 8, 2020 In the left column of Fig. 6, α varies in {1, 5, 10,15,20,25}. It can be seen that the highest MAP is usually achieved by α = 5 or α = 10. The highest F1-score is got when α = 1. However, when α = 1, DMH performs badly in MAP. Hence, α = 10 is selected for our experiments to achieve a balanced performance on these two types of experiments.
In the right column of Fig. 6, γ varies in 10 {−5,−4,−3,−2,−1,0,1} . It can be seen that DMH performs best in MAP when γ = 0.001. When γ > 0.1, F1-score rockets up, while MAP dumps. A possible explanation is that the regularization overly decorrelates a few columns of the code matrix and leaves other columns highly mutually correlated. The resulting code matrix will be similar to a short-bit code matrix. That is why MAP and F1-scores in all 6 experiments with different lengths of bits are rather close in this situation. Although the global optimum of MCR tends to generate column vectors similar to those illustrated in Fig. 3, the gradient descent algorithm cannot guarantee such solutions since MCR is not convex. Hence, γ = 0.001 is used in our experiments.

F. CONVERGENCE STUDY
The objective function of our method is minimized by Algorithm 2. In Algorithm 2, we empirically amend the derivatives of E for easy parameter tuning. The convergence property is experimentally studied in this subsection. Fig. 7 shows the convergence curves. It can be seen that the objective function value decreases fast in the first 100 iterations and then slides relatively slowly except for that of Wiki data set. The preset iteration step is too large for Wiki data set, so the object function value increase incrementally after reaching the smallest value. The convergence curves of experiments on Wiki and MIRFlickr is smooth, while those of experiments on NUS-WIDE jitters because of more sophisticated data structure and therefore more saddle points across which the algorithm jumps.

G. COMPUTATION EFFICIENCY
Training and testing time on 32-bit are given in Table 2. The training time is the mean time of 10 runs. The testing time is the average time cost for one query. All experiments were FIGURE 6. MAP and F1-score of DMH on MIRFlickr data set. The first two rows are MAP and the last two rows are F1-score.
performed on MATLAB R2015b installed on a GNU/Linux Server with 2.30 GHz 16-core CPU and 768 GB RAM. The three compared deep models, i.e. DMHOR [7], SSAH [48] and DJSRH [58], were trained on a NVIDIA GeForce 1080TI GPU. It is meaningless to compare running time of methods implemented on different platforms. Hence, the running time of these three deep models are not reported in Table 2. From Table 2, it can be seen that the training time of our method is moderate among all methods. Its testing time is close to that of MDBE, because the encoding procedure for a new query of these two methods are similar.

H. COMPARISON OF REGULARIZATIONS
In order to prove the efficiency of the proposed regularization, we imposed four different types of regularization on our method, i.e., 1) no regularization, 2) regularization proposed in [7], 3) Eq. (11) and 4) Eq. (14). The MAP on Wiki data set are shown in Fig. 8. From Fig. 8, we can see that the proposed regularization can improve the performance on experiments of long codes (>64 bits). It is difficult to judge the effects of the regularization proposed in [7], since the performance improvement was not guaranteed on all experiments. The   performance of Eq. (11) and Eq. (14) is close. Hence, it is preferred to use Eq. (14) due to its low computational cost.

I. ABLATION STUDY
To evaluate the effects of minimum correlation regularization (MCR) and label relaxation (LR) on our proposed method, we evaluated our methods on Wiki dataset in four settings: (1) with both MCR and LR, (2) with only MCR, (3) with only LR and (4) with neither MCR nor LR. The four settings are denoted as ''MCR+LR'', ''MCR'', ''LR'' and ''NULL'' in Fig. 9. From Fig. 9, we can conclude that the MCR is important for long-bit experiments. In short-bit experiments, the MAP improved by LR and MCR are subtle. However, for code length longer than 64 bits, the benefits from MCR and LR become significant. LR stably improves the MAP on our methods with or without MCR.

V. CONCLUSION
This paper proposed an effective multimodal hashing method which is modeled as a quantization error problem and the minimum correlation regularization is devised to improve the retrieval performance on long codes. Experiments on MIRFlickr and NUS-WIDE data sets show that the proposed method surpasses the compared methods distinctively. Future works include testing more nonlinear embedding functions and refining optimization procedure for high computational efficiency.
DAYONG TIAN received the B.S. and M.E. degrees from Xidian University, Xi'an, China, in 2010 and 2014, respectively, and the Ph.D. degree from the University of Technology, Sydney, NSW, Australia, in 2017. He is currently an Assistant Professor with the School of Electronics and Information, Northwestern Polytechnical University, Xi'an, China. His research interests include computer vision and machine learning, and in particular, on image restoration, image retrieval, and face recognition.
YIWEN WEI (Member, IEEE) received the Ph.D. degree in radio science from the School of Physics and Optoelectronic Engineering, Xidian University, Xi'an, China, in 2016. From 2016 to 2018, she worked as a Research Scientist with the Temasek Laboratories, National University of Singapore, Singapore. She is currently an Assistant Professor with the School of Physics and Optoelectronic Engineering Science, Xidian University, China. Her research interests include electromagnetic wave propagation and scattering in complex systems, computational electromagnetic, remote sensing, and parameters retrieval, and in particular applying machine learning methods on complex electromagnetic problems.
DEYUN ZHOU received the B.E., M.E., and Ph.D. degrees from Northwestern Polytechnical University (NWPU), in 1985, 1988, and 1991, respectively. He has been a Professor at NWPU, since 1997, where he has also been the Dean of the School of Electronics and Information, since 2012. His research interests include self-adaptive control, intelligent control theory, complex systems modeling, multiobjective optimization, information fusion, and aerial electronic systems.