Joint Semantic Preserving Sparse Hashing for Cross-Modal Retrieval

Supervised cross-modal hashing has received wide attention in recent years. However, existing methods primarily rely on sample-wise semantic relationships to evaluate the semantic similarity between samples, overlooking the impact of label distribution on enhancing retrieval performance. Moreover, the limited representation capability of traditional dense hash codes hinders the preservation of semantic relationship. To overcome these challenges, we propose a new method, Joint Semantic Preserving Sparse Hashing (JSPSH). Specifically, we introduce a new concept of cluster-wise semantic relationship, which leverages label distribution to indicate which samples are more suitable for clustering. Then, we jointly utilize sample-wise and cluster-wise semantic relationships to supervise the learning of hash codes. In this way, JSPSH preserves both kinds of semantic relationships to ensure that more samples with similar semantics are clustered together, thereby achieving better retrieval results. Furthermore, we utilize high-dimensional sparse hash codes that offer stronger representation capability to preserve such more complex semantics. Finally, an interaction term is introduced in hash functions learning stage to further narrow the gap between modalities. Experimental results on three large-scale datasets demonstrate the effectiveness of JSPSH in achieving superior retrieval performance.

Abstract-Supervised cross-modal hashing has received wide attention in recent years.However, existing methods primarily rely on sample-wise semantic relationships to evaluate the semantic similarity between samples, overlooking the impact of label distribution on enhancing retrieval performance.Moreover, the limited representation capability of traditional dense hash codes hinders the preservation of semantic relationship.To overcome these challenges, we propose a new method, Joint Semantic Preserving Sparse Hashing (JSPSH).Specifically, we introduce a new concept of cluster-wise semantic relationship, which leverages label distribution to indicate which samples are more suitable for clustering.Then, we jointly utilize sample-wise and cluster-wise semantic relationships to supervise the learning of hash codes.In this way, JSPSH preserves both kinds of semantic relationships to ensure that more samples with similar semantics are clustered together, thereby achieving better retrieval results.Furthermore, we utilize high-dimensional sparse hash codes that offer stronger representation capability to preserve such more complex semantics.Finally, an interaction term is introduced in hash functions learning stage to further narrow the gap between modalities.Experimental results on three large-scale datasets demonstrate the effectiveness of JSPSH in achieving superior retrieval performance.

I. INTRODUCTION
I N THE past decade, the growing availability of multimedia data on the Internet has made cross-modal retrieval become Zhikai Hu, Yiu-Ming Cheung, and Weichao Lan are with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, SAR, China (e-mail: cszkhu@comp.hkbu.edu.hk;ymc@comp.hkbu.edu.hk;cswclan@comp.hkbu.edu.hk).
Mengke Li is with the Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Guangdong 518000, China (e-mail: limengke@gml.ac.cn).Donglin Zhang is with the School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China (e-mail: dlinzzhang@ gmail.com).
Qiang Liu is with the State Key Laboratory of Synthetical Automation for Process Industries (Northeastern University), Shenyang, Liaoning 110819, China (e-mail: liuq@mail.neu.edu.cn).
Digital Object Identifier 10.1109/TCSVT.2023.3307608Fig. 1.When disregarding the distribution of labels, the sample-wise semantic similarity between A and B and that between A and C are identical.However, given that there are more samples affiliated with label C, it is desirable for A to be more akin to C than B, to produce more correct retrieval outcomes.This relationship is referred to as cluster-wise semantic relationship in this paper.
a research hotspot.Cross-modal retrieval [1], [2], [3], [4], [5], [6], [7] refers to the task of retrieving data across different modalities, such as using a piece of text to retrieve the corresponding image, video, or audio, etc.To cope with the large amount of multimedia data and improve retrieval efficiency, hashing technology [8], [9] has been widely used in the field of cross-modal retrieval, resulting in the development of cross-modal hashing methods [10], [11], [12], [13], [14], [15].These methods map data of different modalities into a shared Hamming subspace, enabling fast retrieval of multimodal data through the simple XOR operation.In general, cross-modal hashing methods can be broadly classified into unsupervised [11], [13], [14], [16], [17] and supervised methods [18], [19], [20], [21].Supervised crossmodal hashing methods, which make use of label information, can more effectively mine the semantic relationships between multi-modal data and often achieve better retrieval results.Nevertheless, since the widely used logical labels are relatively rough supervision information, how to use them more efficiently to mine the relationships between multi-modal data and supervise the learning of corresponding hash codes is still an open problem.To the best of our knowledge, existing methods [21], [22], [23], [24], [25] typically estimate the similarity between samples based on the cosine distance or inner product of their corresponding labels, capturing the sample-wise semantic relationship.However, these approaches ignore the fact that the distribution of labels can be highly diverse across different datasets, and such information is crucial to further improving retrieval quality.For example, let us consider a scenario where there are three labels A [0,0,1,1], B [1,0,0,1], and C [0,1,1,0], and their corresponding sample sizes are 1, 10, and 100, respectively, as shown in Fig. 1.The similarity between A and B and that between A and C, calculated by the cosine distance of their labels, are both 1/2.However, since there are more samples corresponding to label C, we may expect that more correct samples can be retrieved during the retrieval phase if A is closer to C. Therefore, in addition to the sample-wise semantic relationship, we can also consider a cluster-wise semantic relationship.In this context, the cluster-wise similarity between A and C is a measure of how well they belong to the same cluster of samples, compared to A and B. Obviously, considering the cluster-wise semantic relationships of labels in supervised cross-modal hashing can potentially lead to more accurate retrieval results.
Furthermore, the representation capability of traditional dense hash codes commonly used in cross-modal hashing is limited.Traditional hash encoding scheme map multi-modal data into dense −1 and 1 codes, requiring long hash codes to achieve better retrieval performance [26], [27], [28].This results in additional storage space burden and lower retrieval efficiency.Meanwhile, there is also a similarity mismatch between dense hash codes and labels.Specifically, as labels consist of binary values 0 and 1, their similarity range is , where S L = 0 represents semantic irrelevance (negative relationship), and S L > 0 represents semantic relevance (positive relationship).However, the similarity range of traditional dense hash codes is , where S B ≤ 0 represents negative relationships, and S B > 0 represents positive relationships.To bridge the mismatch in value range, some methods [24], [29] use 2S L −1 to estimate S B .However, in this case, positive relations in S L (0 < S L < 0.5) will be incorrectly estimated as negative relations.In addition, most of the current two-stage cross-modal hashing methods [12], [24], [25], [28], [29] learn the hash function separately for each modality, which leads to a lack of interaction between modalities, ultimately hindering the capability to bridge the heterogeneous gap.
In this paper, we propose a framework based on sparse hashing to address the aforementioned problems, which is referred to as Joint Semantic Preserving Sparse Hashing (JSPSH).Specifically, we propose a joint learning scheme that incorporates both of the commonly used sample-wise semantic relationship and a newly introduced cluster-wise semantic relationship obtained through label clustering.We utilize these relationships simultaneously to supervise the learning of hash codes.Furthermore, we leverage the representation capability of high-dimensional sparse hash codes, which have been shown to be effective in encoding multi-modal data [25], [30].With sparse hash codes, there is no issue of mismatching similarity value domains, as the values of sparse hash codes are 0 or 1, which is the same as labels.Finally, to further narrow the heterogeneous gap between modalities during the hash function learning stage, we introduce a new interaction term to increase the interaction between them.The main contributions of this paper are summarized as follows: • We propose a novel approach called Joint Semantic Preserving Sparse Hashing, which leverages both samplewise and cluster-wise semantic similarity to guide the learning of hash codes.By introducing cluster-wise semantic relationships, JSPSH ensures that samples with similar semantics can be clustered together more appropriately to achieve better retrieval performance.
• To enable effective learning of these joint semantic correlations, we adopt more expressive high-dimensional sparse hash codes for encoding multi-modal data.
Compared with traditional dense hash codes, it can better preserve complex semantic relationships.
• We introduce a new interaction term in the hash function learning stage, which ensures better alignment between modalities.This further improves the retrieval performance of JSPSH by strengthening the relationship between the different modalities.
• The proposed method was evaluated on three commonly used public datasets, and the experimental results demonstrate that our method outperforms existing methods, both dense and sparse hashing ones.The remainder of this paper is organized as follows.Section II makes an overview of some related works.Section III presents the details of the proposed JSPSH.Then, Section IV provides the experiment results and analyses.Finally, a conclusion is drawn in Section V.

II. RELATED WORK
In this section, we briefly classify existing cross-modal hashing methods based on their encoding method into two categories: traditional dense hashing and high-dimension sparse hashing methods.

A. Dense Cross-Modal Hashing
By default, cross-modal hashing usually refers to dense cross-modal hashing, which encodes multi-modal data into dense hash codes where each bit in the k-bit hash code must be 1 or -1.Depending on whether supervised information is utilized or not, these methods can be further classified into unsupervised and supervised methods.Unsupervised crossmodal hashing methods learn hash codes for multi-modal data without the use of any explicit supervision.They typically exploit the pairwise information between different modalities or the underlying manifold structure of data within each modality to learn the hash codes.A variety of unsupervised cross-modal hashing methods have been proposed in the literature.For example, Inter-Media Hashing (IMH) [10] learns linear hash functions to map multi-modal data into a common Hamming space by exploring the inter-modal and intra-modal correlation of different modalities.Collective Matrix Factorization Hashing (CMFH) [11] utilizes the pairwise information between different modalities and introduces collective matrix factorization to learn unified hash codes.Same as CMFH, Latent Semantic Sparse Hashing (LSSH) [31] learns unified hash codes for all modalities by utilizing the sparse coding and matrix factorization techniques.Besides, Composite Correlation Quantization (CCQ) [32] jointly map both multi-modal data into an isomorphic latent space and learn corresponding hash codes by composite quantization.Fusion Similarity Hashing (FSH) [33] employs a fusion strategy to learn hash codes by constructing an un-directed graph among different modalities.Collective Reconstructive Embedding (CRE) [34] also learn unified binary codes by reconstructing embedding of multi-modal data collectively.More recently, Robust Unsupervised Cross-Modal Hashing (RUCMH) [35] further improves the robustness of cross-modal hashing by exploring the relation between modalities with only partial or even no pairwise information.
Supervised cross-modal hashing methods utilize the additional information provided by labels or annotations to learn hash codes.For example, Semantics Preserving Hashing (SePH) [18] uses labels to learn a similarity distribution, with the objective of maximizing the similarity between the learned hash codes and the given distribution.Generalized Semantics Preserving Hashing (GSPH) [23] propose a cross-modal hashing algorithm that can seamlessly handle multi-label and single-label, paired data, and unpaired data scenarios, making it applicable to a wide range of realworld scenarios.Besides, Discriminative Cross-modal Hashing (DCH) [36] uses labels to learn a classifier, with the aim of generating more discriminative hash codes.To further reduce quantization error, DCH employs the Discrete Cyclic Coordinate (DCC) [37] descent method to discretely update the learned hash code.Label Consistent Matrix Factorization Hashing (LCMFH) [38] and Scalable disCRete mATrix faCtorization Hashing (SCRATCH) [12] simultaneously leverage heterogeneous multi-modal data and labels to learn consistent hash codes that preserve semantic similarity as much as possible.Matrix Tri-Factorization Hashing (MTFH) [22] is the first cross-modal hashing method that attempts to represent different modal data with hash codes of different lengths, which can help capture more information from each modality.Fast Cross-Modal Hashing (FCMH) [24], on the other hand, emphasizes both global and local similarity preservation in the process of learning hash codes, and proposes a discrete update framework to optimize the objective function.To make better use of label information, Adaptive Label correlation based asymmEtric Cross-modal Hashing (ALECH) [29] uses more adaptive labels to supervise the learning of hash codes.

B. High-Dimension Sparse Hashing
High-dimensional sparse hashing is a technique in which data is mapped into a higher-dimensional Hamming space, with only a small subset of bits containing information.This approach contrasts with dense hashing, where all bits in the hash code must be either 1 or -1.In high-dimensional sparse hashing, the number of bits carrying information is significantly smaller than the total number of bits, resulting in a sparse representation that is more efficient in terms of storage and computation.The first high-dimensional sparse hashing work, Fly-Hash [54], was inspired by the biological fruit fly olfactory circuit and modified Locally Sensitive Hashing (LSH) [55], originally dense hashing, into a high-dimensional sparse version.The key characteristic of this approach is that it uses a hash function to project the data into a highdimension Hamming space, where only a small number of bits contain information.Specifically, a winner-take-all strategy is employed, that is, the largest r elements of the output of hash function are set to 1 and the rest are set to 0. In Fly-Hash, the hash mapping function is randomly generated, so it cannot make use of the inherent information of data.In order to address this issue, some data-driven methods have been proposed, such as Bio-Inspired Hashing (Bio-Hash) [56] and Optimal Sparse Lifting Hashing (OSLHash) [57].Although the performance has been significantly improved, these methods are still limited to single modality retrieval tasks.
More recently, high-dimensional sparse hashing has been firstly introduced in cross-modal hashing by High-dimensional Sparse Cross-modal Hashing (HSCH) [25].HSCH maps multimodal data into a high-dimensional sparse Hamming space, where only a small number of bits contain information.Compared with dense hashing, high-dimensional sparse hashing has been shown to have more efficient expression ability and better retrieval performance.Later, an online version of HSCH has also been proposed [30].However, to date, there are still only a small number of cross-modal hashing methods based on high-dimensional sparse hashing.

III. PROPOSED METHOD A. Notations
Assume that there are n pieces of multi-modal data X I ∈ R d 1 ×n and X T ∈ R d 2 ×n that represent image and text data, respectively, where d 1 and d 2 indicate the dimensions of image and text data, respectively.Their corresponding label matrix is denoted as L ∈ {0, 1} c×n , where c represents the number of data categories.L i j = 1 if the j-th sample, either image or text, belongs to the i-th category; otherwise, it is 0. The aim is to simultaneously map X I and X T to a high-dimensional Hamming space and obtain a unified hash code B ∈ {0, 1} k×n , where k denotes the dimension of the Hamming space.Unlike traditional dense hash codes, only r elements in each hash code of B are assigned a value of 1, and the rest are all 0. Thus, in this paper, r is utilized to indicate the length of the sparse hash code, while the sparse rate of the hash code is represented as τ = r/k.
The other symbols used in this paper are defined as follows: || • || F represents the Frobenius norm of a matrix.|| • || 2 represents the 2-norm of a vector.tr(•) represents the trace of a matrix. 1 m represents an m-dimensional all-ones column vector.I m represents an m × m identity matrix.
The proposed JSPSH is a two-stage model that consists of three main parts: semantic relationship exploring, hash codes learning, and hash functions learning.The overall framework of JSPSH is depicted in Fig. 2.

B. Semantic Relationship Exploring 1) Sample-Wise Semantics Relationship:
We first leverage the label information to capture the sample-wise semantic relationship S c .In this semantic relationship, each sample is treated as an independent entity, and the similarity between Fig. 2. The proposed JSPSH framework is a two-stage approach for learning hash codes.In the first stage, both sample-wise and cluster-wise semantic relationships are simultaneously extracted using label information.The sample-wise semantic relationship is obtained by computing the cosine distance between labels.To obtain different levels of cluster-wise semantic relationships, various cluster numbers are selected for clustering.Then, we compute the final cluster-wise semantic relationship as a weighted average of the cluster-wise semantic relationships at various levels.Finally, the sample-wise and cluster-wise semantic relationships are jointly used to train high-dimensional sparse hash codes.In the second stage, hash functions are learned for the different modalities using the learned hash codes.To reduce the heterogeneous gap between the modalities, a constraint is added between the different hash functions to enhance their interaction.each pair of entities is calculated based on their corresponding labels.One of the most commonly used metrics is to compute the cosine similarity between the samples, resulting in an n-by-n similarity matrix S s = cos(L, L).However, if we directly use S s in the subsequent optimization process, the time complexity of the solution will be at least O(n 2 ), making it challenging for the algorithm to be applied to largescale datasets.To address this issue, we are inspired by [58] to decompose the cosine similarity calculation into a more efficient operation where each column of L is a normalized vector, i.e., L * j = L * j /||L * j ||.Since the dimension of L is c × n, we can prioritize left-side matrix multiplication in the subsequent optimization process to avoid generating an n × n matrix.This will help reduce both the time and space complexity.
It is evident that the value range of S s in Eq. ( 1) falls within the interval [0, 1].However, in traditional dense crossmodal methods, since the dense hash code values are either −1 or 1, their similarity values are limited to the range of [−1, 1].To rectify this incompatibility, some methods [24], [29], [58] incorporate an offset term as follow Although the value ranges are aligned in Eq. ( 2), offset correction will lead to misclassification of positive samples in S s (0 < S s < 0.5) as negative samples (−1 < S ′ s < 0).This problem arises because the traditional dense hash code has the ability to finely describe the relationship between negative sample pairs, i.e., it can calculate the specific value in the range [-1, 0] for the relationship between negative sample pairs.However, the similarity S s obtained from labels usually marks the relationship between all negative sample pairs as 0. Therefore, simple offset correction does not fully resolve the inherent contradiction between the dense hash code and the similarity based on label construction.
In this paper, the use of high-dimensional sparse hashing allows for a perfect circumvention of this problem.The similarity calculated based on the sparse hash code B ∈ {0, 1} k×n also indicates the relationship between all negative sample pairs as 0, just like S s , resulting in a natural alignment with S s .Moreover, the powerful representation ability of sparse hash codes enables better mining of the relationship between all positive sample pairs.
2) Cluster-Wise Semantic Relationship: While the samplewise semantic relationship has been widely used and shown satisfactory performance [22], [23], [29], [30], [58], it overlooks the overall distribution of labels that may play a critical role in further improving retrieval results.For instance, in the example illustrated in Fig. 1, if the sample-wise semantic similarity between label A and other labels is the same, we desire it to be closer to the label that contains more samples, which could ensure that more correct results can be retrieved.To this end, we introduce cluster-wise semantic relationship to capture this similarity tendency.Specifically, we hope to further enhance the retrieval results by exploring which labels should be closer or clustered together based on the distribution of labels.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
To obtain the cluster-wise semantic relationship, we treat each label in L as a feature and use k-means algorithm to cluster L. Based on the clustering results, we define the clusterwise semantic similarity between two samples as follow: where C(i, j) = 1 indicates that the i-th label and j-th label belong to the same cluster.As Eq. ( 3) shows, S is an n × n matrix, which would also result in an O(n 2 ) time complexity as analyzed previously.
To avoid this problem, we propose assigning new labels to samples based on the clustering results.Specifically, we treat all samples within the same cluster as the same class and assign the same one-hot label to them.Then, we obtain a new label matrix L ∈ {0, 1} p×n , where p is the number of clusters specified in the clustering algorithm.With these new labels, we can calculate the cluster-wise semantic similarity of the data using the following formula: Same as Eq. ( 1), the time complexity O(n 2 ) can be avoided by prioritizing left-side matrix multiplication.
During the clustering of labels, a thorny issue is determining the optimal number of clusters p.Given that label distribution varies across datasets, it is challenging to set the most appropriate p for each dataset.Fortunately, as clustering is not the ultimate objective of our proposed approach, we could focus less on the selection of p.Our goal is just to extract cluster-wise semantic information between samples through clustering.Consequently, we can instead extract different levels of cluster-wise semantic information by varying the value of p. Specifically, we can choose m different numbers of clusters, denoted as { p i } m i=1 .With different p i , we can obtain different clustering results and corresponding new labels L(i) .Furthermore, this enables us to obtain a series of cluster-wise semantic similarity matrices To leverage cluster-wise semantic relationships across different levels simultaneously, we compute the final cluster-wise semantic similarity S c as a weighted average of the clusterwise semantic similarities S(i) c obtained at different numbers of clusters p i .Specifically, we use different weights w i to adjust the contribution of each level of clustering to the final cluster-wise semantic similarity, that is, Considering that a larger number of clusters p i will result in stronger correlations between samples belonging to the same cluster, we believe the corresponding relationship S(i) c to be more informative.Therefore, we set the weights w i in Eq. ( 6) proportional to p i .Then, the weights are computed as follows: Remark.Why can clustered results provide effective clusterwise semantic relationship which benefits the retrieval results?On one hand, clustering labels that are semantically similar enhances the sample-wise semantic relationship.In other words, it helps identify which sample-wise semantic relationships need to be highlighted.On the other hand, when the sample-wise semantic relationship between labels is the same, clustering results can provide better ranking.For instance, in Fig. 3, assume that the sample-wise semantic relationship between C and B, and C and A is the same, i.e., d 1 = d 2 .Since there are more samples corresponding to label A, the center point of cluster 2 will be closer to A. Therefore, in the clustering process, C will be closer to the center point of cluster 2, i.e., d 3 < d 4 , and C will be clustered with A. This cluster-wise semantic relationship tends to make C and A closer to ensure that more semantically similar samples are gathered around.This decision is more advantageous when A and B are negative samples of each other.For instance, suppose that the labels A,B, and C correspond to 001, 100, and 101, respectively.In this case, it is better to make C closer to A because it can ensure more correct retrieval results.

C. Hash Codes Learning
After obtaining the sample-wise and cluster-wise semantic relationships, we will use them to jointly learn unified hash codes B. The learned hash codes B should ideally preserve the semantic information at both the sample and cluster levels.To this end, we define following object function: where hyper-parameter α is used to balance the ratio between the two types of semantic relationships.We have also introduced two constraints to the function.Specifically, B ∈ {0, 1} k×n and B ⊤ 1 k = r 1 n ensure binary values and the sparsity of the learned hash codes B, respectively.However, they have also made the optimization of the object function Eq. ( 8) into an NP-Hard problem.To address this challenge, we adopt an asymmetric hashing strategy [59] and introduce an intermediate variable H ∈ R k×n .Specifically, we remove the discrete constraints of one B in the matrix multiplication and transform it into a continuous variable H.We then add a constraint item between B and H to reduce quantitative losses.Additionally, to minimize redundancy among different bits of the hash codes, we further apply an orthogonal constraint on H.As a result, Eq. ( 8) is transformed into the following problem: min Then, we can disassemble the solution of Eq. ( 9) into two steps of H-Step and B-Step to optimize them alternately.
H-Step: Fix B, Eq. ( 9) can be reformulated into the following sub-problem: We use V = r BS s + αr BS c + βB.According to [30] and [60], the optimal solution of Eq. ( 10) is given by where the matrix Q is obtained from the eigen-decomposition of matrix VV ⊤ .Define where ∈ R k ′ ×k ′ is the diagonal positive eigenvalue matrix, and k ′ is the rank of VV ⊤ .Matrix Q ∈ R k×k ′ consists of corresponding eigenvectors of positive eigenvalues and Considering that the calculation of V involves the matrix multiplication of S, which can result in a time complexity of O(n 2 ), we propose to calculate V using the following formula As a result, by prioritizing left-side matrix multiplication, the time complexity of V decreases from O(kn 2 ) to O(ckn), where c, k ≪ n.Section III-F gives a detailed analysis.
B-Step: Fix H, Eq. ( 9) can be reformulated into the following sub-problem: max B tr((r HS s + αr HS c + βH)B ⊤ ), The optimal solution is given by where sign r is a function that transforms a real-number vector x into a string of sparse hash code and is defined as follow: sign r (x) = 1, if x is the top-r largest elements 0, otherwise .
The winner-take-all strategy is adopted by the sign r (x).This strategy activates only the largest r -bit elements in x and leaves the rest to 0.

D. Hash Functions Learning
After obtaining the hash codes, it is necessary to learn the hash functions that map the data of different modalities to the hash codes.One conventional approach is to use a linear classification model, that is, min where P * denotes the hash functions to be learned.This approach considers each bit of data mapping to a hash code as a distinct binary classification problem.Nevertheless, since B is strictly binary and P * X * is continuous, there will inevitably be a residual distance between them, and its direction will be uncontrollable.These errors affect the validity of the generated hash codes, especially due to the winner-take-all strategy used to generate high-dimension sparse hash codes during the retrieval phase.To address this issue, [30] has proposed introducing an error correction term and using sample-wise semantic information to enhance the constraints on the mapping function as follow min where γ is generally a hyper-parameter with a small value to control the degree of error correction.However, the aforementioned two methods have a limitation: there is a lack of interaction between modalities during the hash function learning process, which can result in misalignment of the hash codes of different modalities.In the hash function learning stage, it is assumed that data of different modalities share the same hash code B = B I = B T , where B I and B T represent the hash codes of image data X I and text data X T , respectively.However, Eq. ( 17) and Eq. ( 18) essentially use B I and B T independently to learn hash functions for different modalities, which weakens the assumption B I = B T .This can cause misalignment of the hash codes of different modalities, as shown in Fig. 4.Although both distances from P I X I and P T X T to B are small, the directions are different.Ideally, we would like to achieve the effect in Fig. 4(b).To address this, we introduce an interaction term P I X I − P T X T in the hash function learning stage, which re-emphasizes the assumption B I = B T .Consequently, the overall optimization function becomes min Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where µ and λ are two hyper-parameters and R(P I , F represents the regularization term imposed on P I and P T .In Eq. ( 19), we use only the sample-wise semantic relationship for error correction.There are two reasons for this decision.Firstly, we believe that hash codes B have effectively integrated both sample-wise and cluster-wise semantic information in the hash codes learning stage.Secondly, using multiple standards for error correction, i.e., using both samplewise and cluster-wise semantic relationships simultaneously, may introduce contradictions and be counterproductive for learning hash functions.
Finally, we can alternately solve P I and P T to optimize Eq. ( 19) as follows where ωI d 1 and ωI d 2 are two small items (ω = 0.01) to avoid the singularity of matrix X * X ⊤ * .Compared to previous methods [28], [29], [30], [61] that only involve data from the corresponding modalities in training hash functions, our proposed optimization process simultaneously involves data from all modalities in the training process.For example, when solving P I , both X I and X T are involved, which enhances the interaction between different modalities.This interaction not only narrows the heterogeneous gap but also allows for the use of information from multiple modalities to learn a better hash function P * .
The whole training process of JSPSH including semantic relationship exploring, hash codes learning, and hash functions learning is summarized in Algorithm 1.

E. Proof of Convergence
In this section, we analyze the convergence of JSPSH.During the hash code learning stage, all variables B and H have closed-form solutions to their corresponding subproblems.Let L(B, H) denote the value of the object function Eq. ( 9), and we have L(B t+1 , H t+1 ) ≤ L(B t+1 , H t ) ≤ L(B t , H t ), where t is the number of iterations.According to the bounded monotone convergence theory [62], the algorithm will converge to a stable solution.Similarly, during the hash functions learning stage, all variables P I and P T have closed-form solutions to their corresponding sub-problems.Using L(P I , P T ) to denote the value of the object function Eq. ( 19), we have L(P t+1 I , P t+1 T ) ≤ L(P t+1 I , P t T ) ≤ L(P t I , P t T ).In summary, the convergence of the JSPSH algorithm can be guaranteed.

F. Complexity Analysis
The JSPSH algorithm involves three main components: label clustering, hash code learning, and hash function learning.The time complexity of the label clustering stage is O( m i tcp i n), where t is the maximum iteration.It is important to note that this stage is performed only once, and the results are saved and utilized for subsequent calculations.Therefore, the time complexity of this stage is not counted.In the hash codes learning stage, the time complexity of solving H and B in each round are O(ckn + m i kp i n + kn + k 2 n+k 3 ) and O(ckn+ m i kp i n+kn+nk log 2 r ), respectively.In the hash functions learning stage, the time complexities of solving P 1 and ), respectively.As k, c, r, d 1 , d 2 , and p i are all constants and much smaller than n, the time complexity of the JSPSH algorithm can be considered linear to the size of the training set n, i.e., O(n).Therefore, it can efficiently process large-scale datasets.
IV. EXPERIMENT A. Experimental Settings 1) Datasets: To measure the retrieval ability of JSPSH, we conducted experiments on three commonly used largescale datasets, including MIRFlickr [63], IAPR TC-12 [64] and NUS-WIDE [65].MIRFlickr is a dataset that comprises 25,000 image-text pairs, divided into 24 categories.Each image is represented by a 512-dimensional GIST feature, and each text is represented by a 1,386-dimensional bag-of-words vector.To ensure effective training, we eliminated data with textual tags less than 20 and selected 20,015 pairs of valid data.From the remaining data, we randomly selected 2,000 data points as the query set and used the rest for retrieval and training sets.
IAPR TC-12 dataset consists of 20,000 image-text pairs with a total of 255 different classes.Each piece of data is labeled with at least one of these categories.Each image is represented by a 512-dimensional GIST feature, and each text is represented by a 2,912-dimensional bag-of-words vector.Following the setting in [30], we randomly selected 2000 data points as the query set, and used the remaining data points for retrieval and training.
NUS-WIDE is a larger dataset compared to the previous two datasets, consisting of 269,648 image-text pairs and 81 different categories.Following the settings in [43], for the experiments conducted in this paper, only the 10 most frequently occurring categories of samples, totaling 186,577 pairs, were used.Each image is represented by a 500-dimensional SIFT feature, while the corresponding text is represented by a 1,000-dimensional binary tagging vector representation.We randomly selected 2,000 pieces of data as the query set, while the remaining samples were used as the retrieval and training sets.
2) Evaluation Metrics: In this paper, we conducted two cross-modal retrieval tasks: I2T, which retrieves images based on text queries, and T2I, which retrieves text based on image queries.We employed three commonly used metrics to evaluate the performance of JSPSH and all compared methods, namely mean average precision (mAP), precisionrecall (PR) curve, and top-K precision curve.A higher mAP and top-K precision value as well as a larger area under the PR curve indicate better retrieval performance.When calculating precision, we considered a search result to be correct if it shares at least one label with the query.

3) Baselines and Implementation Details:
To verify the effectiveness of the proposed JSPSH, we compared it with nine state-of-the-art cross-modal hashing methods, including DCH [36], SCRATCH [12], DLFH [66], LFMH [67], BATCH [68], WATCH [28], ALECH [29] and HSCH [30].Among these methods, HSCH is the only high-dimensional sparse cross-modal hashing method, while the remaining methods are traditional dense hashing methods.The codes for all comparison methods are kindly provided by their authors, and all parameters follow the settings in the corresponding papers.All experiments are conducted on the server equipped with Intel i7-12700KF CPU@ 3.7 GHZ and 64 GB RAM.

B. Retrieval Performance
In this section, we analyze the retrieval performance of the proposed JSPSH and compare it with other methods from three aspects.Table I presents the mAP results of all methods on the three datasets.Moreover, Fig. 5 and Fig. 6 illustrate the PR curve and top-K precision curve of all methods on the MIRFlickr dataset, respectively, with hash code lengths varying from 2 to 32 bits.Based on these results, we draw the following conclusions: • The superiority of high-dimensional sparse hashing methods, JSPSH and HSCH, over traditional dense hashing algorithm is evident from the mAP results presented in Table I.In particular, JSPSH and HSCH exhibit robustness in low-dimensional scenarios, such as r = 2 or 4, thereby demonstrating their potential in encoding abundant information using a fewer number of hash bits.This highlights the representation capability of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.high-dimensional sparse hash codes, thereby proving their efficacy in the field of retrieval tasks.
• JSPSH consistently outperforms HSCH in terms of retrieval performance, which highlights the efficacy of cluster-wise semantic relationships.Both JSPSH and HSCH leverage high-dimensional sparse hash codes to encode information, with the main difference being that HSCH only uses sample-wise semantic relationships in hash code learning while JSPSH utilizes both samplewise and cluster-wise semantic relationships.The supervised learning of hash codes with the help of cluster-wise semantic relationships provides more precise information based on label distribution to determine which hash codes should be closer in Hamming space, resulting in better grouping of semantically similar samples and superior retrieval performance.The PR curve and top-K precision curve depicted in Fig. 5 and Fig. 6 further support these analyses.It is evident that the gap between traditional dense hashing methods and high-dimensional sparse hashing methods is substantial, particularly when the dimension of hash codes is low, such as r = 2 and 4. When comparing JSPSH and HSCH, it is observed that JSPSH consistently outperforms HSCH in terms of retrieval precision, under the same recall rate.Furthermore, JSPSH always ensures that a higher number of relevant samples appear within the top-K retrieved results, except when r = 4.These observations suggest that JSPSH is better suited to ensure that semantically similar samples are distributed around the query.In other words, with the help of clusterwise semantic information, JSPSH can ensure that samples are more appropriately clustered in the retrieval set.

C. Efficiency Analyses
In Section III-F, we presented a theoretical analysis showing that the time complexity of JSPSH is linearly related to the size of the training set.To validate this analysis, we provide experimental data on the training time complexity, training We believe that a slight increase in training time is a reasonable trade-off considering the significant improvement in retrieval performance offered by JSPSH.As for the retrieval time, all methods achieve similar performance with the same hash code length.This indicates that sparse hash codes do not impose an additional computational burden during the retrieval phase.

D. Ablation Study
In JSPSH, we made three key contributions.First, we introduced the concept of cluster-wise semantic relationships and used it in conjunction with sample-wise semantic relationships to jointly supervise the learning of hash codes.Second, we replaced traditional dense hash codes with highdimensional sparse hash codes, whose effectiveness has already been validated in Section IV-B.Third, we introduced an interaction term during the hash function learning process to narrow the heterogeneous gap.To validate the effectiveness of the first and third contributions, we conducted ablation experiments on five variants of JSPSH.Specifically, JSPSH-1 used only sample-wise semantic relations to train hash codes.JSPSH-2 and JSPSH-3 used both semantic relations to jointly train hash codes, but only used p = 100 and p = 500 for the cluster-wise semantic relationship obtained from clustering results, respectively.JSPSH-4 used both semantic relations to jointly train hash codes and { p i } = {100, 200, 500}, but removed the interaction term during hash functions learning stage.Finally, JSPSH-5 replaces the high-dimension sparse hash codes in JSPSH with dense hash codes, keeping other settings unchanged.The specific differences between all variants are summarized in Table IV.The results are reported in Table III.
By comparing JSPSH-1, JSPSH-2, JSPSH-3, and JSPSH, we can verify the role of the cluster-wise semantic relationship.The results lead to the following conclusions: • The introduction of cluster-wise semantic information, irrespective of its level, proves beneficial to the final retrieval performance.In most cases, JSPSH-2, JSPSH-3, and JSPSH perform better than JSPSH-1, which only uses sample-wise semantic information to learn hash codes.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.• The cluster-wise semantic relationship required by different data sets varies.Specifically, for the MIRFlickr data set, the results obtained by JSPSH-2 ( p = 100) and JSPSH-3 ( p = 500) are comparable.However, for the IAPR TC-12 data and above, the results of JSPSH-3 are significantly better than those of JSPSH-2.Theoretically, the larger the value of p, the more accurate the clusterwise semantic information is, which is more conducive to the learning of hash codes.However, the validity of this information also depends on the distribution of the label itself, which requires further investigation.
• The cluster-wise semantic relationship that is adapted to hash codes with different representation capabilities varies.When the representation ability of the hash code is limited, that is when r is small, too complex semantic information may not be beneficial to the learning of the hash code.For instance, when r = 4, the I2T results of JSPSH-2 on the MIRFlickr dataset are significantly higher than those of other variants.Conversely, when the hash code representation ability is adequate, that is when r is larger, more appropriate semantic information can stimulate its representation potential.For instance, when r = 16, the results of JSPSH-3 outperform all other variants on the TC-12 dataset.Through the above analysis, it can be concluded that finding suitable cluster-wise semantic relations as supervisory information for different datasets is a challenging task.To address issue, we adopt the strategy of weighted average, which helps to mitigate the different requirements to a certain extent.The results demonstrate that JSPSH performs better than JSPSH-2 and JSPSH-3 in most cases.
Furthermore, the effectiveness of the interaction term in the hash function learning phase can be demonstrated by comparing JSPSH-4 and JSPSH.It can be seen that the retrieval performance of JSPSH has always been better than that of JSPSH-4.This proves that the interaction term we proposed can effectively strengthen the interaction between modalities, further narrow the heterogeneous gap, and achieve better retrieval results.Besides, the performance of JSPSH significantly outperforms that of JSPSH-5, indicating that high-dimensional sparse hash codes possess a stronger representation capability compared to traditional dense hash codes, given the same number of hash bits.β, γ , and µ.Parameter α adjusts the proportion of sample-wise and cluster-wise semantic relationships, while parameters β, γ , and µ are weights of three different auxiliary terms, namely quantization error term, error correction term, and interaction term.Figure 1 shows the corresponding mAP performance.Our observations are as follows: • Parameter α: When a is small (α < 10), its impact on the retrieval performance is relatively slight.However, when a is large (α > 10), the retrieval performance drops significantly.This is because the cluster-wise semantic relationship should be an auxiliary to the sample-wise semantic relationship in JSPSH.When α is excessively large, the cluster-wise semantic relationship dominates, subverting the primary and secondary relationship, and leading to a decline in retrieval performance.
• Parameters β, γ , and µ: These parameters correspond to auxiliary terms and the performance of JSPSH is not so sensitive to them.Only when their values are too large, such as µ = 1000, does the retrieval performance drop significantly.2) Convergence Analysis: In Section III-E, we provide a theoretical analysis of the convergence of JSPSH.To gain a deeper understanding, we conduct additional experiments on MIRFlickr and IAPR TC-12 datasets to further analyze the convergence empirically.Fig. 8 presents the convergence results, where we normalize the objective function value for ease observation.It is noting that after a single iteration, we observe a sharp drop in the objective value and the model consistently converges after five iterations.These findings provide additional evidence of the efficient and effective convergence of our proposed model.
3) Comparison With Deep Hashing Methods: To further validate the efficacy of JSPSH, we conducted a comparison study with some state-of-the-art deep cross-modal hashing methods, including DCMH [43], SSAH [44], EDGH [45], Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
To ensure a fair comparison, same as [12], [29], and [30], we replaced the shallow features used in the prior experiment with 4096-dimensional CNN features that were extracted using the pre-trained CNN-F network [74] on ImageNet [75].Table V represents the mAP results, and for all baselines, we directly report the results from the original papers.
As demonstrated, JSPSH consistently outperforms all the baselines.A plausible reason may be that deep hashing methods tend to relax the discrete constraints of hash codes and optimize the objective function in batches.In contrast, JSPSH can effectively guarantee the quality of the hash codes by designing a discrete update algorithm and updating it in a global manner.Besides, when the dense hash code length is reduced, there is a notable decline in the performance of these deep hashing methods.Conversely, JSPSH still achieves stable performance under the same circumstances.Furthermore, even with a hash code length of 4 bits, JSPSH surpasses the majority of deep methods, highlighting the expressive capability of sparse hash codes.

V. CONCLUSION AND FUTURE WORK
In this paper, we have proposed a novel approach Joint Semantic Preserving Sparse Hashing (JSPSH) for cross-modal retrieval.It overcomes the limitations of existing methods that only consider sample-wise semantic relationships.We have proposed a new concept of cluster-wise semantic relationships that takes into account the distribution of labels to identify which samples should be closer to each other.By preserving both sample-wise and cluster-wise semantic relationships, JSPSH is able to learn more efficient hash codes.Additionally, to capture more precise semantic information, we have utilized high-dimensional sparse hash codes that are more expressive for multi-modal data representation than traditional dense hash codes.To further bridge the gap between heterogeneous modalities, we have proposed an interaction term during hash functions learning to align the hash codes of different modalities.The experimental results demonstrate that the proposed JSPSH outperforms existing state-of-the-art methods.
Although the effectiveness of the proposed cluster-wise semantic relationship has been demonstrated in improving retrieval performance, the k-means clustering algorithm used in this paper still has some limitations in capturing this relationship.Specifically, since the number of clusters for the labels is unknown, we adopt a compromise strategy that involves selecting different numbers of clusters and performing a weighted average on the results.However, as shown in Section IV-D, this strategy is not always the optimal solution.In future work, we plan to explore new methods to obtain more effective cluster-wise semantic information, thereby further improving retrieval performance.

Manuscript received 19
March 2023; revised 24 July 2023; accepted 20 August 2023.Date of publication 22 August 2023; date of current version 5 April 2024.This work was supported in part by the NSFC/Research Grants Council (RGC) Joint Research Scheme under Grant N_HKBU214/21, in part by the General Research Fund of RGC under Grant 12201321 and Grant 12202622, in part by the National Natural Science Foundation of China under Grant 61991401, Grant U20A20189, and Grant 62161160338, in part by NSFC under Grant 62202204, and in part by the Fundamental Research Funds for the Central Universities under Grant JUSRP123032.This article was recommended by Associate Editor H. Zhang.(Corresponding author: Yiu-Ming Cheung.)

Fig. 3 .
Fig.3.When the sample-wise semantic relationship between C and B and that between C and A are the same, i.e., d 1 = d 2 , k-means algorithm will cluster C with A because there are more samples corresponding to label A. By preserving this cluster-wise semantic relationship, it can be guaranteed that more semantically similar samples are clustered around C in the retrieval set.

Fig. 4 .
Fig.4.When hash codes of different modalities are not aligned, two different situations can arise: (a) both P I X I and P T X T have small distances to B but in different directions, and (b) both P I X I and P T X T have small distances to B and in the same direction.

E. Further Analyses 1 )
Parameter Sensitive: We conducted experiments on the MIRFlickr dataset to analyze the sensitivity of parameters α,
Joint Semantic Preserving Sparse Hashing for Cross-Modal Retrieval Zhikai Hu , Yiu-Ming Cheung , Fellow, IEEE, Mengke Li, Weichao Lan , Graduate Student Member, IEEE, Donglin Zhang , and Qiang Liu , Senior Member, IEEE Algorithm 1 JSPSHInput: Cluster number { p i } m i=1 , Image data X I , text data X T , and corresponding labels L; Output: Unified hash codes B, image hash function P I , and text hash function P T ;

TABLE I THE
MAP RESULTS (MAP@50) OF THE PROPOSED JSPSH AND OTHER COMPARED BASELINES ON THREE DATASETS.THE BEST RESULTS ARE IN BOLDFACE

TABLE II THE
TRAIN TIME COMPLEXITY, TRAINING TIME (SECONDS), AND RETRIEVAL TIME (SECONDS) OF THE PROPOSED JSPSH AND OTHER COMPARED BASELINES ON MIRFLICKR DATASET TABLE III THE MAP RESULTS (MAP@50) OF JSPSH AND ITS FOUR VARIANTS ON MIRFLICKR AND IAPR TC-12 DATASETS.THE BEST RESULTS ARE IN BOLDFACE time, and retrieval time of all methods on the MIRFlickr dataset.The results are presented in Table II.Regarding the training time, while the time complexity of most methods is O(n), there are variations in the actual time required due to different coefficients such as c 2 k and k 3 in time complexity.Since they are significantly smaller than n, they are disregarded when calculating the time complexity.Generally, the training time of JSPSH is comparable to other methods.

TABLE IV THE
DIFFERENCES BETWEEN VARIANTS OF JSPSH IN ABLATION STUDY