MESH: A Flexible Manifold-Embedded Semantic Hashing for Cross-Modal Retrieval

Hashing based methods for cross-modal retrieval has been widely explored in recent years. However, most of them mainly focus on the preservation of neighborhood relationship and label consistency, while ignore the proximity of neighbors and proximity of classes, which degrades the discrimination of hash codes. And most of them learn hash codes and hashing functions simultaneously, which limits the flexibility of algorithms. To address these issues, in this article, we propose a two-step cross-modal retrieval method named Manifold-Embedded Semantic Hashing (MESH). It exploits Local Linear Embedding to model the neighborhood proximity and uses class semantic embeddings to consider the proximity of classes. By so doing, MESH can not only extract the manifold structure in different modalities, but also can embed the class semantic information into hash codes to further improve the discrimination of learned hash codes. Moreover, the two-step scheme makes MESH flexible to various hashing functions. Extensive experimental results on three datasets show that MESH is superior to 10 state-of-the-art cross-modal hashing methods. Moreover, MESH also demonstrates superiority on deep features compared with the deep cross-modal hashing method.


I. INTRODUCTION
Due to its significant role in many real-world applications, cross-modal retrieval has been widely studied in recent years [1]- [4]. Among the existing methods for cross-modal retrieval, the hashing based methods (cross-modal hashing) are the most representative ones and have been investigated intensively. Therefore, cross-modal hashing is a class of methods for implementing cross-modal retrieval. Crossmodal hashing maps data from different modalities onto a common Hamming space, where cross-modal retrieval can be performed by using fast XOR operation [5], [6].
A large number of cross-modal hashing methods have been proposed recently, which consist of shallow learning [7]- [10] and deep learning methods [11]- [13]. All these methods have made significant efforts on improving the performance of cross-modal retrieval. Although the deep learning based methods usually obtain superior performance, they are sample-intensive and require large amounts of labelled training data. Therefore, many state-of-the-art methods are The associate editor coordinating the review of this manuscript and approving it for publication was Maurizio Tucci. still shallow learning models, which are the main concerns of this article.
Most of the previous shallow learning methods can be categorized into two groups according to the utilization of class labels, i.e., unsupervised and supervised methods. The unsupervised cross-modal hashing methods discard the label information and construct hashing functions by preserving the intrinsic structure of data [14]- [16]. For example, Collective Matrix Factorization Hashing (CMFH) [17] is proposed based on matrix factorization to learn the unified hash codes for different modalities. Fusion Similarity Hashing [18] explicitly embeds the graph-based fusion similarity across modalities into a common Hamming space, in which an undirected asymmetric graph is constructed to model the fusion similarity, and binary codes are learned with a graph hashing scheme.
The supervised ones exploit the explicit semantic labels of training data to enhance the correlation across different modalities for boosting cross-modal retrieval. For example, several supervised extensions based on matrix factorization have been proposed, such as supervised matrix factorization hashing [9], [19]. In particular, Intra-and Inter-modality VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Similarity Preserving Hashing [9] combines matrix factorization and maintains similarity preserving under the lowdimensional Hamming space. It is worth noting that the above methods generate binary codes by relaxing the discrete constraints, which leads to a large quantization error.
To address this issue, many studies propose to learn binary codes with discrete optimization [7], [20], [21]. For instance, Discrete cross-modal hashing (DCH) [5] was proposed where binary codes are directly learned without relaxation, and label information is used to enhance the discrimination of binary codes through linear classifiers. Cross-modal Discrete Hashing (CMDH) [10] learns a shared set of binary codes for samples from different modalities based on their semantic affinities. As mentioned above, although tremendous efforts have been made, there still remains some issues that need to be further investigated. Most of them mainly focus on the preservation of neighborhood relationship while ignore the proximity across neighbors. Usually, the k-nearest neighbors are selected to preserve intra-modality relationship i.e. neighborhood relationship in the previous methods. However, they treat the k-nearest neighbors as the same, in other words, they ignore the proximity among these k-nearest neighbors. It cannot be distinguished which neighbor is closer or which one is father away. In such case, sufficient discrimination cannot be guaranteed for cross-modal research in hash codes. Several works [22], [23] have noticed the importance of proximity across neighbors and employed manifold structure information to preserve it. Although they have addressed the issue of relationship preserving to some degree, they neglect the proximity across classes. This is because they preserve the inter-modality relationship via one-hot label matrix, which neglects the distribution of class semantic embeddings. In the Hamming space, samples belonging to different classes should be mapped further apart and the same class should be clustered together. However, such scheme is not strong enough to describe the semantic distribution of different classes. For those similar classes, they should be closer than dissimilar classes. Therefore, neglecting the proximity across classes will limit the semantic information embedded in hash codes.
To address these two issues, in this article, we propose a two-step cross-modal retrieval method named Manifold-Embedded Semantic Hashing (MESH). The framework of MESH is shown in Fig. 1, in which the solid lines indicate the training procedure, and the dash lines depict the testing procedure. In the first step, the unified hash codes of training data are learned. Specifically, MESH exploits Local Linear Embedding to model the neighborhood proximity and uses class semantic embeddings to compute inter-modality similarity to formulate the proximity across classes. In addition, we inversely regress the class semantic embeddings to corresponding hash codes to further improve the discrimination. In the second step, hashing functions using linear regression are learned based on unified hash codes. Therefore, MESH can not only extract the manifold structure in different modalities, but also can embed the class semantic information into hash codes to further improve the discrimination of learned hash codes. Extensive experimental results on three datasets show that MESH outperforms 10 state-of-theart cross-modal hashing methods, especially on the task of Image-to-Text.
The main contributions of this work are summarized as follows: • A two-step cross-modal hashing method is proposed by leveraging manifold structure to capture the neighbood proximity, which can improve the discrimination of hash codes. Moreover, the two-step scheme makes MESH flexible to various hashing functions.
• Class semantic embeddings are introduced to extract the proximity of classes to further augment the semantic information of hash codes, which is beneficial to improve the performance on Image-to-Text task in crossmodal retrieval.
• Different from most of the existing methods, we regress the class semantic embeddings to hash codes, which can enforce the generated hash codes of image and text modality to be same with that of class semantic embeddings. The rest of this article is organized as follows. Section 2 reviews the related work. Section 3 introduces the proposed MESH approach. Section 4 presents the experimental settings and results analysis. Finally, we conclude our work in Section 5.

II. RELATED WORK
This section reviews the most recent cross-modal hashing approaches, which can be categorized into two groups according to the utilization of class labels, i.e., unsupervised and supervised methods.
The unsupervised cross-modal hashing methods discard the label information and construct hashing functions by preserving the intrinsic structure of data [14]- [16]. For example, Xie et al. [24] proposed an unsupervised method Multigraph Cross-modal Hashing (MGCMH), which integrates multi-graph learning and hash function learning into a joint framework. In [25], Fang et al. proposed Multi-modal graph regularized Smooth matrix Factorization Hashing (MSFH) to reinforce the spar-sity of extracted common semantic features and the spar-sity of dictionaries learned. There are also many deep neural network based approaches have been presented. Wu et al. [26] proposed an Unsupervised Deep Cross-Modal Hashing via integrating deep learning and matrix factorization with binary latent factor models. Extending the Generative Adversarial Nets (GAN) to cross-modal retrieval, Zhang et al. [27] developed an Unsupervised Generative Adversarial Cross-Modal Hashing which makes full use of GAN's ability for unsupervised representation learning to explore the underlying manifold structure of cross-modal data. Although these unsupervised methods can achieve impressive performance and improve the flexibility, the main concern is that the semantic limitation further reduces the discrimination of hashing functions and codes.
The supervised cross-modal hashing methods utilize the label information of data to improve the performance of cross-modal retrieval. For instance, many matrix factorization based methods are proposed to learn latent semantic representation [9], [17], [19]. However, one common limitation of them is that they relax the discrete constraints within the optimization resulting in suboptimal binary codes. Therefore, a number of cross-modal hashing methods based on discrete optimization are studied [7], [8], [21]. In [22], Locally Linear Embedding is used to extract the manifold information as similarity matrix for learning unified hash codes where the binary codes are learned directly without relaxation. Lu et al. [8] proposed Discrete Latent Semantic Hashing (DLSH) that first learns the latent semantic representations by matrix factorization, and then projects them into a shared Hamming space. Mandal et al. [28] generalized crossmodal hashing to different scenario cases with unpaired and multi-label data, which enhances the usefulness of crossmodal retrieval in real-word applications. In [3], Fang and Ren proposed a supervised discrete cross-modal hashing framework to directly learn the unified discrete binary codes with a close-form, rather than bit by bit which is very time-consuming.
In recent years, many approaches are proposed to learn the nonlinear hashing functions, such as deep learning based methods [2], [29]- [31]. Jin et al. [32] proposed a Deep Ordinal Hashing (DOH) method to learn hashing functions by leveraging the ranking structure of feature space from both local and global views. In [33], a deep architecture that learns instance-aware representations for multi-label image data is proposed, where an image is represented by multiple pieces of hash codes, which is different from the semantic hashing. Deep cross-modal hashing (DCMH) [11] uses deep neural network as hashing functions, one for each modality. Deep Discrete Cross-Modal Hashing (DDCMH) [12] is similar to DCMH but learns hashing functions with discrete optimization. In [34], a Dense Auto-encoder Hashing (DAH) is proposed which can explicitly impute the missed modality and produce robust binary codes. Due to the effectiveness of GAN, several cross-modal hashing methods based on GAN have been proposed [13], [35], [36] which transform the problem of traditional cross-modal retrieval to generative issues. In addition, the cross-modal semantic embeddings learned by deep network structure were also explored in [37] and [38], in which the semantic information is transferred from texts to images for image classification task. Different from [37] and [38], our method intends to learn common semantic representation of text and image to conduct crossmodal retrieval task. Moreover, we employ the class semantic embeddings to improve the discriminative property of learned common representation, which will be introduced in detail in Section III.
Different from the existing methods, we consider the proximity across neighbors in cross-modal hashing and attempt to investigate the preservation of semantic information via proximity across classes to enhance the semantic information embedded in hash codes.

III. APPROACH A. PROBLEM DEFINITION
In cross-modal retrieval, we take image and text as an example. Given image data X (1) = {x n } and text data X (2) = {x where d 1 is the dimensionality of image features, d 2 denotes the dimensionality of text features (usually d 1 = d 2 ), and n is the number of paired samples, the main goal is to learn unified binary codes B ∈ {−1, +1} k×n as well as hashing functions for image and text modality respectively, where k is code length. The semantic label of the given data is represented as Y ∈ {0, 1} c×n , where c is the number of classes. Without loss of generality, we assume that the data are zero-centred,

B. MANIFOLD STRUCTURE LEARNING
In order to capture the manifold structure of each modality, we first employ Locally Linear Embedding (LLE) [39] to compute the weighted representation of each sample. Specifically, each sample in the data set can be represented as a linear combination of its K-nearest neighbors, which can be formulated as follows, where W is the weight matrix which is used as the similarity matrix of X. Equation.
(1) can be solved using the method proposed in [39]. Then, we can construct the similarity matrix S by setting the similarities as the weights calculated via solving LLE, i.e., S = |W|. Here, we denote the similarity matrix for preserving manifold structure of X (1) and X (2) modality as S (1) and S (2) , respectively. We further define update them as follows to ensure symmetry,

C. CLASS PROXIMITY LEARNING
Our approach aims to augment the semantic information of hash codes to improve their discrimination. Thus, we propose to preserve the class proximity by employing the class semantic embeddings. Different from most of the previous methods that preserve the inter-modality similarity by using one-hot label matrix Y, we employ class semantic embeddings A ∈ R a×n to compute the similarity matrix S (12) of samples from different modalities. By so doing, the samples with the same label should have similar hash codes in the Hamming space, and the similar classes, such as dog and cat should be close. The class semantic embeddings can be the class attributes or the distribute word embeddings trained from a large-scale linguistic text corpus, which can be easily extracted from pre-trained models such as Glove [40]. In this article, we obtain the embeddings from Glove by searching the word in Glove. For single word class such as cat, we regard its word vector as the class semantic embeddings. For two-word class or more, such as dining table, we obtain its class semantic embeddings by averaging the word vectors of its compositions. According to the class label matrix Y ∈ {0, 1} c×n and the a-dimensional distribute word embeddings of class names C ∈ R a×c , the representations corresponding to the j-th sample of A can be computed as follows, where Y j is the j-th column of Y. Then, we use the class semantic embeddings A to compute the similarity matrix S (12) of between each samples from different modalities.

D. OVERALL OBJECTIVE FUNCTION
First of all, the latent semantic representations V ∈ R k×n of unified hash codes are learned according to the manifold structure information and class proximity. We formulate the problem by Laplacian Eigenmaps (LE) [41] as follows, where L is the hybrid Laplacian matrix of W = S (1) + S (2) + S (12) . L = D − W, where D ∈ R n×n is a diagonal matrix whose diagonal entries are the column sum of W, i.e. D(i, i) = j W ij . V is the to be learned latent semantic representations.
Then, we define a orthogonal rotation matrix R ∈ R k×k to generate unified hash codes according to the latent semantic representations, which can be stated as follows, Through Eq. (6) we can learn the unified binary codes of training data directly avoiding the large quantization error.
Different from [7], we further regress the class semantic embeddings A to enhance the semantics of hash codes. In contrast to most of existing methods that regress the hash codes to class labels, we inversly regress the class semantic embeddings to Hamming space to re-align the hash codes and improve discrimination, which has been proved more stable than the former [42]. Thus, we enforce the generated hash codes of image and text modality should be the same as the semantic labels. Then we have, where M ∈ R c×k is a projection matrix. Combining Eqs. (5), (6), and (7), we come to the following overall objective function, where α, β, λ are trade-off parameters.
In addition, we apply the kernel features to replace the original features to capture the nonlinear relationship of different modalities. Following [10], we adopt a RBF kernel mapping φ (x) to obtain the kernel features of image and text modality respectively. φ i (x) = exp(−D(x, a i )/σ ), where D(·) is the distance function, σ is the kernel bandwidth parameter, and are the randomly selected q anchors from training dataset.

E. OPTIMIZATION
It is intractable to directly minimize the objective in Eq. (8) because of the non-convexity with four matrix variables V, M, R, and B. Fortunately, it is convex with respect to any of the four variables in the case that the others are fixed. Therefore, we propose an alternative optimization in an iterative manner to address the optimization problem. The detailed optimization steps are listed as follows: Step 1. Update M. The objective function shown in Eq. (8) can be simplified to the following with the other variables fixed but M, By setting its derivative w.r.t M to 0, we have, Step 2. Update V. When the other three variables are fixed but V, the overall objective can be transformed to, By setting its derivative w.r.t V to 0, we can arrive at a Sylvester equation [43] as shown in Eq. (12).
whereÃ = αR T R, B = L, V can be easily achieved through addressing Eq. (12) via the sylvester function in Matlab.
Step 3. Update R. When the other three variables are fixed but R, the overall objective can be transformed to, Apparently, it can be solved by Singular Value Decomposition (SVD). Let Q = BV T , we can have Q = G H T after SVD operation. Then, R can be updated as follows, Step 4. Update B. When the other three variables are fixed but R, the overall objective can be simplified to, It can be transformed to,

Algorithm 1 Manifold-Embedded Semantic Hashing
Input: Training data X (1) and X (2) , label Y, distributed word embeddings of class names C, parameters α, β, λ, and the iteration number T . Output: Unified hash codes B.

Construct class semantic embeddings A;
Compute kernel features φ(X (1) ) and φ(X (2) ); Initialize V, M, B, and R randomly; Apparently, the solution can be easily obtained as followings, where sgn is the sign function.
The description of the proposed method is summarized in Algorithm 1.

F. HASHING FUNCTIONS LEARNING
As MESH is a two-step cross-modal hashing method, in the first step, the unified hash codes of training data are learned. In the second step, we need to learn hashing functions based on the unified hash codes. Since hashing functions learning can be formulated as a regression problem, we can use various types of classifiers as hashing functions, which shows the flexibility of MESH.
Here, we adopt linear regression as the hashing functions. For example, given the unified hash codes B, the hashing functions for image and text modality are learned by solving the following problems respectively.
where θ is a balance parameter. P t is the to be learned hashing function for the t-th modality. The solution is computed as, Therefore, given a new query sample x (t) q from the t-th modality, the binary codes can be computed by,

IV. EXPERIMENTS AND RESULTS ANALYSIS
In this section, we conduct experiments on three datasets to evaluate the performance of our method for cross-modal retrieval. We design two representative cross-modal retrieval tasks, i.e., Text-to-Image utilizing text as a query to search relevant images and Image-to-Text using image as a query to search relevant text. In the experiments, images and texts are considered to be relevant if they share at least one common semantic label.

A. DATASETS
Wiki [44] dataset contains 2866 pairs of image and text. All the image-text pairs can be divided into 10 semantic classes. In Wiki, image features are the 128-dimensional bag-of-visual words SIFT feature vectors, and text features are 10-dimensional topic vectors which are generated by the latent Dirichlet allocation (LDA) model. In our experiments, the initial split i.e., 2173 for training and 693 for testing is employed.
LabelMe [45] dataset consists of 2688 outdoor scenes which can be grouped into 8 classes. Each image is annotated with tags which are from a 781 words dictionary. Following [12], we discard the words that occur in less than 3 times. Therefore, the representation of text modality is a 366-dimensional word frequency. While image features are represented by 512-dimensional GIST features. As [12], the samples without tags are deleted, which leads a 2686 image-text pairs new dataset. In our experiments, 75% of the data are randomly selected for training, and the rest are chose as the query set for testing.
Pascal VOC [46] dataset contains 9963 image-tag pairs, which can be divided into 20 categories. Following [12], we select the pairs with only one label, which results in 2808 training and 2841 testing image-tag pairs. The image modality is represented by 512-dimensional GIST features, and the representations of text modality are 399-dimensional word frequency features.
For all baselines, we carefully tune their parameters and report the average results of 10 repeats. As for the proposed MESH, we empirically set α = 1e −4 , β = 1, λ = 1e −3 , θ = 1, and the iteration number is restricted to 10.

C. EVALUATION METRICS
In order to evaluate the performance of the proposed MESH and competitors for cross-modal retrieval, we employ the widely used Mean Average Precision (MAP) as a measurement metric. In addition, we also plot the precision-recall and top-N precision curves when hash codes are 16 and 64 bits to further compare the performance. It is worth noting that, for the three metrics, the larger the values are, the better the performance is.

D. RESULTS AND ANALYSIS 1) RESULTS ON WIKI
The MAP results of MESH and all baselines on Wiki with varied code lengths are reported in Table 1, and the precision-recall curves and top-N precision curves of Textto-Image and Image-to-Text with 16 and 64 bits are shown in Fig. 2. We have the following observations from the results in Table 1 and Figure 1. 1) MESH slightly outperforms the baselines on both cross-modal retrieval tasks. 2) In case of top-N precision, the results of MESH are consistently better than other baselines. In particular, the precision at top-200 are much better than baselines, which is significant to search task where users mainly focus on results on the first few pages. 3) Most supervised methods, such as MESH, DCH, FDCH, and SMFH achieve superior performance to unsupervised methods, such as CMFH and FSH, which illustrates the significance of employing supervised label information. 4) The results of DCH and FDCH are similar but much better than other baselines. This suggests that discrete optimization can boost the cross-modal retrieval. However, they are still inferior to our MESH. 5) It can be easily seen that the Text-to-Image MAPs of most methods are higher than Image-to-Text   with a large margin. The reason may be that text feature is closer than image visual feature to the semantic information, and can better describe the object.

2) RESULTS ON LabelMe
The MAP results of MESH and all baselines on LabelMe are shown in Table 2, and we also plot precision-recall curves and top-N precision curves of various methods for both two tasks with 16 and 64 bits in Fig. 3. From these results, we can observe that: 1) Our proposed MESH consistently outperforms the other competitors. 2) On Imageto-Text task, the MAPs of MESH are substantially better than that of baselines. Specifically, compared with the best baseline, MESH achieves performance improvement at least by 19% and 10% at 16 and 32 bits, respectively. This may because of the consideration of class proximity in MESH, which can enhance the semantic information of visual feature.
3) Compared to the baselines, the curves in Figure 2 of MESH are consistently higher, especially for Image-to-Text task, which is similar to MAP. 4) Most of methods perform better on Text-to-Image task than on Image-to-Text task. However, the margins are smaller than on Wiki. One possible reason is that the semantic gap between text and image features in LabelMe is smaller than in Wiki. VOLUME 8, 2020

3) RESULTS ON PASCAL
The MAP results of MESH and all baselines on Pascal VOC are reported in Table 3, and the precision-recall and top-N precision curves of all methods for both tasks with 16 and 64 bits are plotted in Fig. 4. From the above results, we obtain the following observations that are similar to Wiki and LabelMe. 1) MESH is superior to other baselines with varied code lengths on both tasks. 2) Similar to Wiki, the Textto-Image results of most methods are higher than Image-to-Text with a large margin. 3) On Image-to-Text task, the curves of MESH are significantly higher than that of other baselines, which further validates the effectiveness of our method. 4) MESH, DCH, and FDCH are in the first group, performing much better than other baselines, which further demonstrates the effectiveness of learning hash codes discretely in reducing quantization error.
In summary, MESH yields state-of-the-art performance on three datasets, which demonstrates the effectiveness of our method in considering the preservation of neighborhood proximity and class proximity. In addition, the class proximity preserving can boost the semantic information of image visual feature resulting in a significant performance improvement on Image-to-Text task, especially on LabelMe and Pascal datasets. Moreover, the two-step hashing scheme can make MESH flexible to various hashing functions.

4) COMPARISON WITH DEEP FEATURES
We further compare our method with a state-of-the-art deep cross-modal hashing method, i.e., DCMH [11] on MIRFlickr-25K [49]. In this experiment, deep features of images that are used in our proposed method are provided by [11]. The MAP results of cross-modal retrieval are presented in Table 4, in which the results of DCMH are cited from SCRATCH [20]. As it can be seen, the proposed MESH outperforms DCMH consistently with a large margin. This demonstrates that the proposed MESH can achieve much better cross-modal retrieval performance compared to the deep cross-modal hashing method DCMH on deep features, which further validates the effectiveness of our method. In addition, the MAPs of our method are superior to SCRATCH in most cases, especially on the task of Image to Text.
Since our work mainly focus on the unified hash codes and hashing function learning for cross-modal retrieval, we use the SIFT and GIST features to evaluate the performance in the experiments. Considering the powerful representation of deep features, especially in some recent deep networks [50], [51], one of our potential future work is to use the deep features in practice.

E. ABLATION STUDY
We also verified the impact of different part of the objective functions on the performance of the proposed MESH. We design two diverse variations for comparison. MESH-1 is built without the Laplacian graph regularization term in Eq. (5). MESH-2 is built using one-hot label information Y by replacing the class semantic embeddings. Table 5 reports the results on Wiki dataset. As can be seen, the proposed MESH outperforms MESH-1 and MESH-2. We also can conclude that the semantic regress term in Eq. (7) plays the key role in MESH, and the manifold embedding is icing on the cake. Fig. 5 shows the effect of the trade-off parameters α, β and λ on LabelMe dataset. As shown in this figure, we can see that the cross-modal results are influenced by α and β. Generally, the proposed method can achieve a better performance with a wide range of parameter values.

G. CONVERGENCE
Since the alternative optimization in an iterative manner is used to optimize the objective function, we further investigate VOLUME 8, 2020 the convergence of our algorithm. We conduct experiments on three datasets to record the objective function value in each iteration, resulting to the convergence curves in Fig. 6. As can be seen, MESH can often converge within 10 iterations which further demonstrates the efficiency of the optimization algorithm. Moreover, we report the training time of each iteration in Algorithm 1 over different datasets in Table 6. The results are obtained with Matlab R2015b on a PC with 2.7 GHz Intel CPU and 16 GB memory.

V. CONCLUSION
In this article, we proposed a flexible cross-modal hashing method, named Manifold-Embedded Semantic Hashing, MESH for short. MESH preserves the neighborhood proximity and class proximity in the hash codes to augment the semantic information and improve the discrimination of hash codes. In addition, the two-scheme cross-modal hashing makes MESH flexible to various types of hashing functions. Extensive experiments on three datasets demonstrated the superiority of MESH against ten state-of-the-art cross-modal hashing approaches.
Recently, zero-shot cross-modal hashing has drawn considerable interests, our model cannot handle the zero-shot problem currently. We plan to extend MESH to solve the cross-modal retrieval for new emerging concepts combining deep representations.