Deep Hashing Similarity Learning for Cross-Modal Retrieval

In the realm of cross-modal retrieval research, hash methods have garnered significant attention from scholars due to their high retrieval efficiency and low storage costs. However, these methods often sacrifice a considerable amount of semantic features when mapping multi-modal characteristics to a low-dimensional space. Moreover, the focus of hash learning has primarily been on inter-modal similarity learning, neglecting the importance of intra-modal similarity learning. To address these issues, this paper proposes a novel cross-modal hash method called Deep Hashing Similarity Learning for Cross-modal Retrieval (DHSL). DHSL incorporates relation networks into the hash method, enabling pairwise matching between images and texts. This approach effectively bridges the heterogeneity gap between images and texts while simultaneously emphasizing the intra-modal similarity information within both modalities. The result is a hash similarity matrix that captures both inter-modal similarity and intra-modal discriminability. Considering that the process of transforming high-dimensional features into hash codes often leads to a loss of important semantic information, we introduce a feature selector to enhance the features. This selector filters out distinctive features from the original feature set and combines them with low-dimensional features to complement the semantic information. Moreover, we introduce weighted cosine triplet loss and quantization loss to constrain the hash representation in the Hamming space, thereby learning high-quality hash codes. Comprehensive experimental results on two benchmark datasets, NUS-WIDE and MIRFlickr25K, demonstrate that DHSL outperforms the state-of-the-art cross-modal hash methods.


I. INTRODUCTION
With the development of big data, the internet generates a vast amount of multimedia data every day, including texts, images, and videos.Due to heterogeneous differences in data distribution, different modalities manifest disparities, creating an urgent demand for accurate retrieval from multimedia data.As a result, cross-modal retrieval has emerged as an attractive and challenging research field.Among various cross-modal retrieval applications, image-totext (text-to-image) retrieval is the most widely used, where a text (image) query sample is provided, with the expectation of retrieving images (texts) that contain semantically relevant information from a database.
The associate editor coordinating the review of this manuscript and approving it for publication was Massimo Cafaro .
One commonly used approach for cross-modal retrieval in large-scale data is based on hashing, in which highdimensional features are stored in binary codes through dimensionality reduction.This allows for similar binary codes for related samples across different modalities, resulting in more accurate cross-modal retrieval.Early cross-modal hashing methods typically relied on manually designed feature extraction methods [1], [2], such as SIFT and HOG, which have limited generalization capability and lower retrieval performance.In recent years, with the widespread application of deep neural networks in cross-modal retrieval, experimental results have shown that neural networks have superior performance in feature extraction compared to shallow methods.Although deep neural networks have achieved significant progress in cross-modal hashing methods [3], [4], [5], [6], [7], [8], [9], there are still several issues that need improvement.Firstly, there are heterogeneity differences among different modalities, and maintaining similarity across modalities is one of the key challenges.Secondly, many cross-modal hashing methods use deep neural networks to convert features into hash codes, potentially resulting in the loss of some semantic information.Thirdly, while most methods focus on constructing cross-modal similarity matrices to preserve similarity between modalities, they often overlook intra-modal similarity instances.
To tackle the mentioned issues, we propose a novel approach for cross-modal retrieval called Deep Hashing Similarity Learning (DHSL).Inspired by relation networks [25], [26], [27], we incorporate relation networks into the cross-modal hashing method.The main contributions of our work can be summarized as follows: 1) We design a feature selector that enhances features using a class residual connection method [28], [29], [30].Specifically, we employ norm-based feature selection to identify relevant features, and then integrate the selected features with low-dimensional features through a class residual connection.This integration ensures that the hash representation contains more semantic information.2) In our hash learning approach, we introduce a relational network.Firstly, we directly perform pairwise similarity learning on the enhanced features of images and texts, reducing the loss of semantic information in the hash codes.Secondly, we conduct intra-modality similarity learning on the fused features of low-dimensional and high-dimensional spaces, learning discriminative hash codes.3) Furthermore, we employ weighted cosine triplet loss and quantization loss to constrain the hash codes.By combining the above loss functions, we optimize the model and enable DHSL to learn high-quality hash codes.

II. RELATED WORK
The essential principle of cross-modal hashing lies in mapping features of different modalities into a shared Hamming space, where the similarity between different modalities can be measured, enabling cross-modal retrieval.The crucial aspects of this process involve the selection of features and the measurement of their similarity, aiming for increased retrieval accuracy.
Cross-modal hashing could be divided into unsupervised hashing and supervised hashing.Unsupervised hashing [10], [11], [12] refers to the process of converting different modal data into hash codes to measure similarity without labeled information.For example, Song et al. [13] sought a common Hamming space that could map multimedia data information, enabling different media data to learn unified hash codes in this space.Ding et al. [2] used collective matrix factorization to learn different view information and mapped them into unified hash codes.Considering the discreteness of hash codes, Wang et al. [14] first learned semantic features (multiple semantic topics or concepts) of multimedia data, mapped these features to a common subspace, and then directly generated hash codes.Supervised hashing [17], compared to unsupervised hashing, utilizes supervised information such as labels or semantic information to further improve cross-modal retrieval performance.Lin et al. [18] added semantic-related supervision into the training data and minimized the Hamming distance of hash codes in the Hamming space by transforming it into a probability distribution, thus obtaining hash codes that preserved semantic structure.Zhang et al. [19] seamlessly integrated semantic labels into the process of hash learning to maximize semantic relevance of modal data.Cao et al. [20] used data category labels as supervised information and learned superior hash codes by maximizing inter-class distance while minimizing intra-class variance.Ma et al. [34] used semantic similarity relationships to learn binary codes and learned hash codes bit by bit through alternating updates, making hash codes more similar.
Compared to superficial approaches, deep learning has the ability to extract features with non-linear structures, demonstrating superior performance compared to manually extracting features.Jiang and Li [3] proposed Deep Cross-Modal Hashing (DCMH), which integrated feature extraction and hash codes learning into a framework using deep convolutional neural networks, enabling end-to-end learning.Yang et al. [4] further enhanced DCMH by considering the similarity between instances within each modality and applying pairwise constraints to different types, enhancing the discriminative power of the hash codes.Additionally, Li et al. [6] introduced labels as input information to the neural network, converting labels into binary codes to constrain the hash codes.Yao et al. [35] utilized collaborative filtering to mine the relationship between labels and hash codes, which to some extent reduced memory consumption and enhanced cross-modal alignment through image attributes, thereby improving the quality of hash codes.Wang et al. [21] introduced the concept of adversarial learning to cross-modal retrieval and proposed Adversarial Cross-Modal Retrieval (ACMR).ACMR has prompted many researchers to combine adversarial learning with hashing methods [22], [23], [24] and achieved significant results.AGAH [8] utilized adversarial learning to guide the multi-label attention module in learning feature representations and employed a binary code mapping for multi-label semantic information, leading to improved retrieval performance.DADH [9] employed adversarial learning in both feature learning and hash codes learning, ensuring consistency in cross-modal feature representation through dual adversarial learning.In contrast to adversarial learning, our proposed approach introduces the relational network mechanism in the DRSL [27] method and optimizes it, resulting in improved retrieval performance.Given the training sets X , Y , L, the goal of DHSL is to acquire a hashing function and hash codes for images and texts, as well as a similarity hash matrix R. The DHSL framework is illustrated in Figure 1.

A. FEATURE LEARNING MODULE 1) FEATURE LEARNING
The feature learning part consists of two neural networks, one for learning the image features and another for learning the text features.For the image modality, we utilize the CNN-F network pre-trained on ImageNet to extract 4,096-dimensional deep features from images, with fixed parameters.Since image features contain a lot of redundant information, we set up a 3-layer fully connected network to extract high-level semantic features from images.The last layer of the network serves as the hash layer, mapping image features to a low-dimensional feature space.
In regards to the text characteristics, we begin by utilizing the Bag-of-Words (BoW) model to transform each text into a feature vector, with the specific dimensions determined based on the dataset.Subsequently, we establish a three-layer fully connected neural network to extract high-level semantic information from the text, and employ a hash layer to map the text features to a lower-dimensional feature space.
We shall employ the aforementioned feature learning network as our primary network, with F X = f (X ; θ X ) and F Y = f (Y ; θ Y ) denoting the feature projection functions for the images and texts, respectively.Here, F X and F Y represent the output image and text features within the primary network.θ X and θ Y signify the parameters of the image and the text feature projectors.

B. FEATURE ENHANCEMENT
The process of projecting the original features onto a low-dimensional space through the primary network results in the loss of some semantic information.To address this issue, we propose the utilization of a feature selector, composed of three linear layers, which differentiates from fully connected layers in feature learning.We employ the L 21 -norm regularization on the feature selector parameters, resulting in a sparse weight matrix.At this stage, the feature selector is able to identify features that contain discriminative edge semantics.Subsequently, these distinctive features are connected in a residual manner to the low-dimensional features, complementing the missing features and achieving feature enhancement.Therefore, the loss function for feature enhancement can be expressed as: ( In this context, the matrices representing the sparsity parameters are denoted as µ 1 and µ 2 , and the matrix of sparse weights is denoted as U * .We can represent the feature-enhanced feature as H * = F * +σ M * , s.t.* ∈ (X , Y ).M represents the output of the feature selector, F represents the output of the main network, and σ denotes the weight parameters.

C. HASHING LEARNING MODULE
In the hashing learning phase, a non-linear transformation on image and text features is performed by the hashing layer using the tanh function.It maps the image and text features to a hash code representation ranging from −1 to 1.In the testing phase, the model converts the image and text hash code representations into binary codes using the sign function.The sign function is defined as: ( The image and text hash codes can be represented as B X  = sign H X , B Y = sign H Y , respectively.In order to ensure consistency between the hash codes and cross-modal features, we apply quantization loss to enforce balanced constraints on the hash codes.
In order to acquire the capability of maintaining the consistency between modalities and the discriminability within each modality, our approach to hash learning consists of two parts.On one hand, we leverage relation networks to conduct similarity learning between image and text hash codes, aiming to learn a similarity hash matrix that captures inter-modality and intra-modality relationships.On the other hand, by employing a weighted cosine triplet loss in the Hamming space, we learn high-quality hash codes that possess similar semantics across different modalities.

1) RELATION NETWORK
In DRSL [27], the author employed a relational network to directly calculate the similarity between images and texts, representing the similarity of each image and text using scalar values, which heavily compromised the modal semantic features.In DHSL, the relational network outputs a similarity matrix that represents the similarity between images and texts, and uses Euclidean distance to measure the distance between the similarity matrix and a priori similarity matrix, thereby encoding more semantic features in the similarity matrix.For the image and text modalities, we utilize a fusion mechanism to directly match and fuse the enhanced features of the image and text, and then calculate the similarity of paired samples using the relational network.The relational network function is represented as R * * = r (G * * , θ r ), where G * * is the output result of cross-modal feature fusion, with a fusion approach of concatenation.θ r represents the parameters of the relational network, and R * * denotes the hash similarity matrix calculated by the relational network.For inter-modality fusion, the fused binary code matrix can be represented as G XY  = G XY pq | p = 1, . . ., n i ; q = 1, . . ., n t , where G XY pq is the fused binary code of the p-th image and q-th text.On the other hand, the semantic information conveyed by the label matrix reveals that it defines a similarity of 1 for samples of the same class and a similarity of 0 for samples of different classes.S pq represents the similarity between the p-th image and q-th text, defined as follows: In the module of relational network, we optimize the similarity distance model by minimizing the hash-based similarity matrix R XY and the prior similarity matrix S XY .The corresponding loss function can be represented as: In the realm of modality, we integrate the low-dimensional spatial features of images and texts together with the enhanced features of the feature enhancement component into binary code representations, G XX and G YY .By utilizing the relation network, we obtain the similarity matrices R XX and R YY for images and texts respectively.Consequently, the modality-intrinsic similarity loss within the relation network can be defined as follows: In the domain of relational network modules, although the hash similarity matrices may have different forms, they convey similar similarity information.Thus, the similarity hash matrix can be expressed as R = R XX + R YY + R XY , and the overall loss of the relational network is: 2) WEIGHTED COSINE TRIPLET LOSS In the context of multi-label data, where a sample can belong to multiple categories, which results in weak discriminative power between different modalities.Hence, we propose the utilization of a weighted cosine triplet loss [9] to measure the similarity between instances of hash codes, aiming to increase the distance between samples with different semantic meanings while reducing the distance among those with the same semantic meaning.
In the context of the image modality, we construct triplets in the form of x i , y + j , y − k , where x i represents an image sample, and y + j and y − k represent positive and negative text samples respectively.The positive text samples have similar semantics to the image sample, while the negative text samples have opposite semantics.Typically, cosine distance is used to measure the similarity between instances in the triplet samples.To further accurately describe the association information between data points and multi-class labels, semantic ordering of hash codes in the Hamming space is performed.A weight factor is computed for the ranked hash codes based on the NDGG evaluation criterion, defined as follows: where i represents the similarity level of the i-th data point in the sorted list, and Z is the normalization constant, the weighted cosine triplet loss for image modality can be expressed as: Similarly, the textual modal weighted cosine triplet loss is defined as follows: where ω represents the weight factor, m denotes the marginal parameter, and cos represents the cosine distance.The higher the weight factor, the stronger the semantic relevance between the query sample and the data point from another modality.The overall weighted cosine triplet loss function can be expressed as:

D. OPTIMIZATION
The overall loss function of the DHSL approach consists of the losses for feature enhancement, relational network, weighted cosine triplet, and quantization.The comprehensive loss function is represented as follows: where θ X , θ Y are the parameters of the main network, θ r is the parameter of the relationship network, and θ * represents the parameters of the sparse weight matrix.We optimize the overall objective function using stochastic gradient descent, and a detailed summary of the optimization process is presented in Algorithm 1.

IV. EXPERIMENTS A. DATABASES AND EVALUATION CRITERIA
In this piece of writing, the experiment with two commonly used datasets in cross-modal retrieval will be illustrated.MIRFlickr25K consists of 24 categories with a total of 25,000 image-text pairs.Each pair has at least one label.For our experimental database, we select 20,015 instance pairs, with each pair having a minimum of 20 text tags.Each instance in the text modality is represented as a 1,386-dimensional bag-of-words vector.For the experimental dataset, we randomly choose 2,000 instances as the query set, while the remaining instances serve as the retrieval database.Additionally, we extract 10,000 image-text pairs from the database for training.
NUS-WIDE contains 81 categories and a total of 269,648 images, each accompanied by relevant text descriptions.We select 195,834 image-text pairs from this dataset as our experimental database, covering the most common 21 categories.Each instance in the text modality is represented as a 1,000-dimensional bag-of-words vector.For the experimental dataset, we randomly choose 2,100 pairs as the query dataset and extract 10,000 image-text pairs from the database for training.
As evaluation metrics, we adopt three commonly used indicators in cross-modal retrieval: mAP, Precision-recall curves, and top-N precision.We compute these three metrics for two different tasks: image retrieval text (I→T) and text retrieval image (T→I).

B. EXPERIMENTAL SETUP
After conducting experimental analysis, we tailor different hyperparameters for various networks and datasets.For the image and text main networks, each is a 3-layer fully connected network.The dropout rate is set to 0.2 for the first two layers, and the activation function used is ReLU.Batch normalization is applied to each layer, and the number of neurons in the three-layer network is set to 8,192-2,048-k.Taking into account the disparities between the relational network and the main networks in learning features, we have assigned distinct initial learning rates: 0.0001 for the main network and 0.0005 for the relational network.The batch size remains uniform at 128.The relational network consists Notably, on both datasets, adjusting the hyperparameters to µ 1 = 4 and µ 2 = 5 improves mAP results when the hash codes length is set to 16 bits.
The results, as shown in Table 1 and Table 2, reveal that DHSL achieved the best performance on both datasets.Overall, on both datasets, the DHSL method shows improved performance on both tasks as the number of hash bits increases.This improvement is attributed to the feature enhancement part, which allows longer hash codes to contain more discriminative semantic features, thus aiding in the learning of sample similarities across modalities.Additionally, compared to the DADH baseline, DHSL achieves a higher performance improvement on the NUS-WIDE dataset.The reason for this is that each data point in NUS-WIDE has more labels, allowing a better expression of inter-modality relationships in the reduced-dimensional hash representation.Moreover, the relational network can directly calculate the similarity between cross-modal features, enhancing the alignment of cross-modal data and subsequently improving accuracy.In comparison to the DCMHT baseline, DHSL shows a slight performance improvement.This is mainly due to the use of the CLIP pre-trained model [31], [32] in the main network of DCMHT, which extracts cross-modal high-level semantic representations and enhances its potential for crossmodal alignment.However, this also significantly increases the complexity of the model.Therefore, DHSL improves retrieval performance while reducing computational space requirements.
This text illustrates the Precision-recall curves and top-N precision curves of the DHSL method and the top six stateof-the-art baseline methods on two public datasets, with hash codes of 16 bits, 32 bits, and 64 bits.The curves are shown in Figures 2,3  Table 3 presents the results of four ablation experiments.For DHSL1, low-dimensional features are directly fused with original features without using L 21 norm for constraint, leading to the inclusion of many irrelevant features.This results in feature repetition and redundancy, causing a decrease in retrieval accuracy.The results of DHSL2 demonstrate that the feature enhancement component can improve the model performance.The fusion of filtered high-dimensional features in the model partially compensates for the shortcomings of feature dimension reduction, adding semantic features from multiple modalities in cross-modal hashing.DHSL3 demonstrates the effectiveness of incorporating intra-modality similarity hashing learning in the relational network.By combining inter-modality and intra-modality similarity hashing learning in the relational network, not only the similarity between modalities can be learned, but also the discriminative power within each modality can be increased, thereby improving retrieval accuracy.DHSL4 demonstrates the positive effect of the relational network module on cross-modal hashing learning.The relation network directly performs similarity learning on the hash representation, reducing matching errors for cross-modal data points.It simultaneously conducts similarity learning on inter-modality and intra-modality features, allowing the similarity hash matrix to contain more semantic features and generating more similar and discriminative hash codes.

E. PARAMETER ANALYSIS
Experiments are conducted on the MIRFlickr25K dataset to analyze the impact of parameter variations η, σ , µ 1 and µ 2 on the mAP results.While testing each parameter, the other parameters are kept constant using the optimal values from  section IV-B.The parameters µ 1 and µ 2 control the selection of image and text features in feature enhancement, while η and σ influence the overall objective loss with different losses.We optimize parameters µ 1 and µ 2 using grid search, and parameters η and σ using traditional manual search.We evaluate the model's performance with a hash code length 8616 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

F. COMPLEXITY ANALYSIS
This section compares the parameter size and training time of the proposed model with the baseline DADH on MIRFlickr25K.From table 4, it can be observed that the proposed model achieves a reduction of 89.10e6 in parameter size compared to DADH, with a decrease in training time of approximately 10 seconds per epoch.The introduction of adversarial networks in DADH contributes to a significant increase in parameter complexity, with the discriminator alone having 67.15e6 parameters, while DHSL's relation network only has 1e5 paramter, which can be considered negligible.Overall, DHSL substantially reduces the model's complexity when compared to DADH.

V. CONCLUSION
The paper introduces a novel cross-modal hashing technique called Deep Hashing Similarity Learning for Cross modal Retrieval (DHSL).DHSL aims to address the heterogeneity between modalities and achieve efficient hashing.In the feature learning process, DHSL incorporates class residual connections to enhance features and integrates them with low-dimensional features to supplement missing information.Regarding the hashing learning process, DHSL incorporates relation networks to learn the similarity between modalities and within modalities.Our goal is to learn a hash similarity matrix that approximates the prior similarity matrix containing label information.This process effectively bridges the heterogeneity between images and texts, preserving the semantic relevance between cross-modal features and the discriminative semantics within each modality.The hash codes in Hamming space are constrained by introducing the weighted cosine triplet loss and quantization loss, preserving the cross-modal semantic data structure and ultimately learning efficient hash codes.Through multiple experiments on two widely used datasets, the results demonstrate that DHSL outperforms several state-of-the-art methods.However, there are still some limitations in this study.In future work, we will further improve the utilization of label information in relation networks and explore superior feature extraction methods.
YING MA received the Bachelor of Science degree from Yantai University, in 2019.She is currently pursuing the master's degree with the School of Science, Guangxi University of Science and Technology.Her research interests include machine learning and multimodal learning.
MENG WANG received the master's degree from Huazhong Normal University in 2005.He is a Professor at the School of TUS-Digit, Guangxi University of Science and Technology.He has published more than 20 papers in domestic and international academic journals as well as conferences.His research focuses on natural language understanding, cross-modal retrieval, and graph neural networks.
GUANGYUN LU was born in Yulin, Guangxi, in 1983.He is currently a Senior Engineer/Information System Project Manager.His research interests include natural language processing, image recognition and processing, and multimodal retrieval.
YAJUN SUN received the Bachelor of Engineering degree in data science and big data technology from the Hubei University of Economics, in 2021.He is currently pursuing the master's degree in applied statistics with the Guangxi University of Science and Technology.His main research interests include computer vision and multimodal learning.
III. METHODWithout loss of generality, a collection of n instances consisting of images and texts pairs is considered.The image instance is denoted by X = {x i } n i=1 and the text instance by Y = {y i } n i=1 , where x and y represent the image and the text, respectively.L = {l i } n i=1 represents the multi-label information, where l i = [l i1 , l i2 , . . ., l ic ], and c corresponds to the number of categories.If the i-th instance belongs to the j-th class, then l ij = 1; otherwise l ij = 0.

FIGURE 1 .
FIGURE 1.The DHSL framework, as depicted in Figure1, consists of two main components: the feature learning module and the hashing learning module.The feature learning module is further divided into feature learning for the image modality and feature learning for the text modality.The hashing learning module includes hash code learning for relation networks and hash code learning for Hamming space.

FIGURE 2 .
FIGURE 2. The Precision-recall curves on the MIRFLICKR25K dataset.

D
. ANALYSIS ON ABLATION EXPERIMENTS In order to verify the effectiveness of the DHSL modules, four ablation experiments are designed for validation.(1) DHSL1 abandons the selection of original features and directly combines them with the hashed output features at the fusion layer.(2) DHSL2 removes the class residual connections and deletes the feature enhancement part.(3) DHSL3 removes the intra-modal similarity hash metric from the relational network module.(4) DHSL4 removes the relational network module altogether.

FIGURE 4 .
FIGURE 4. The Precision-recall curves on the MIRFLICKR25K dataset.

FIGURE 6 .
FIGURE 6. Parameter analysis on MIRFLICKR25K. of 16 bits, and the results are shown in Figure 6.The model achieves the best performance when µ 1 = 4 and µ 2 = 5, as well as when η = 0.1 and σ = 0.1.

TABLE 1 .
mAP results for cross-modal retrieval task on MIRFlickr25K dataset.

TABLE 2 .
mAP results for cross-modal retrieval task on NUS-WIDE dataset.

TABLE 4 .
Computational overhead of different models.