Deep Label Feature Fusion Hashing for Cross-Modal Retrieval

The rapid growth of multi-modal data in recent years has driven the strong demand for retrieving semantic-related data within different modalities. Therefore, cross-modal hashing has attracted extensive interest and studies due to its fast retrieval speed and good accuracy. Most of the existing cross-modal hashing models simply apply neural networks to extract the features of the original data, ignoring the unique semantic information attached to each data by the labels. In order to better capture the semantic correlation between different modal data, a novel cross-modal hashing model called deep label feature fusion hashing (DLFFH) is proposed in this article. We can effectively embed semantic label information into data features by building label networks in different modal networks for feature fusion. The fused features can more accurately capture the semantic correlation between data and bridge the semantic gap, thus improving the performance of cross-modal retrieval. In addition, we construct feature label branches and the corresponding feature label loss to ensure that the generated hash codes are discriminative. Extensive experiments have been conducted on three general datasets and the results demonstrate the superiority of the proposed DLFFH which performs better than most cross-modal hashing models.

which can be helpful for the users to better understand the 27 target events or topics. However, due to the ''heterogeneous 28 gap'' of various modal data, how to effectively implement 29 cross-modal retrieval is still a challenging task. 30 The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca .
Representational learning [3] is a usual way to eliminate 31 the heterogeneity gap in cross-modal retrieval. It transforms 32 different modal data into value representations in the same 33 semantic space, and semantically similar data have more sim-34 ilar values. Many algorithms of such type are listed in litera-35 ture [4], such as Canonical Correlation Analysis (CCA) learns 36 the common space by maximizing the pairwise correlation 37 between two sets of heterogeneous data. As the dimension 38 of multi-modal data increases, the storage of large-scale data 39 and the speed of data retrieval are very important. There-40 fore, the cross-modal hashing method has attracted extensive 41 attention of a large number of researchers due to its lower 42 memory cost and high retrieval speed. We can obtain the 43 Hamming distance by performing a simple bit-wise XOR 44 operations [5] on the two hash codes, and then judge whether 45 the two original data are similar. A small Hamming distance 46 means that the two data are similar and vice versa. The 47 heterogeneity of different modalities makes it difficult to 48 compare the similarity directly. Therefore, the main research 49 work at present is how to generate efficient hash codes to 50 make cross-modal retrieval more accurate and extensible. 51 Traditional hashing models [6], [7], [8], [9], [10], [11] 52 generate hash codes based on hand-crafted features, which 53 lack sufficient discrimination ability and cannot represent 54 original data effectively. In addition, a major disadvantage 55 of these models is that the feature learning process and the 56 hash code generation process are separated from each other. 57 For the past few years, with the excellent performance of 58 deep learning in feature extraction and representation, deep 59 cross-modal hashing models [12], [13], [14], [ show that the features based on deep network learning are 62 more representative than the traditional hashing models. 63 However, most of the deep cross-modal hashing models use  The rest of this article is organized as follows.

98
Section 2 reviews the related work. Section 3 introduces the 99 innovative DLFFH. The experimental results and correspond-100 ing analysis are presented in Section 4. Finally, the conclusion 101 is given in Section 5.

103
Depending on whether data labels are applied to model 104 training, cross-modal hashing can be generalized into two 105 categories [19]. One is unsupervised models, the other is 106 supervised models. Unsupervised models refer to the absence 107 of data labels in the process of training retrieval mod-108 els. To be specific, Collective Matrix Factorization Hashing 109 (CMFH) [20] obtains one different modal common semantic 110 space through collective matrix factorization, and then learns 111 the hash mapping of each modality in this space. Latent 112 Semantic Sparse Hashing (LSSH) [21] applies sparse encod-113 ing to process image data and matrix decomposition to pro-114 cess text data, followed by mapping into a common semantic 115 space to learn hashing. Semantic Topic Multimodal Hashing 116 (STMH) [22] applies clustering and matrix factorization to 117 get semantic themes in image and text data respectively, 118 and then learns the relationship between the two modalities 119 data in common subspace through semantic topics. Finally, 120 the mapping of original data to the common subspace is 121 established to obtain the hash code representation.

122
In contrast, supervised cross-modal hashing can guide the 123 generation of more representative hash codes by applying 124 data labels in the training process.

185
In this section, we will introduce the DLFFH and discuss it 186 in the two most frequently used modalities: image and text. 187 Figure 1 shows the network structure of our DLFFH, which is 188 divided into two segments: image network and text network. 189 We demonstrate the details of the model in the following 190 section.
stands for the text data, where D y is the dimension of y j . 204 L = {l 1 ,. . . ,l N } ∈ {0, 1} N ×C refers to the label matrix, 205 we apply C to stands for the total number of label categories. 206 S stands for the semantic similarity matrix, S ij = 1 means that 207 x i and y j have at least one same label. Conversely, they are 208 dissimilar and S ij = 0. The Hamming distance between two 209 hash codes reflects their similarity of image data and text data, 210 a small Hamming distance means that two data are similar and 211 vice versa. For different hash codes, we can use the following 212 formula to calculate their Hamming distance: denotes the image feature label output by the feature label 265 branch in image network, where θ x represents the total param-266 eters of image feature learning network and label network, 267 and θ x_label represents the parameters of the feature label 268 branch. Furthermore, V * j = g y j , l j ; θ y , θ y_hash denotes the 269 text feature output by hash code branch in text network, where 270 θ y represents the total parameters of text feature learning net-271 work and label network, and θ y_hash represents the parameters 272 of the hash code branch. L y * j = g y j , l j ; θ y , θ y_label refers 273 to the text feature label output by the feature label branch 274 in text network, where θ y represents the total parameters of 275 text feature learning network and label network, and θ y_label 276 represents the parameters of the feature label branch.

277
Semantic gaps between different modal data make it 278 impossible to compare directly, so we measure the data sim-279 ilarity by mapping them into a common semantic space. The 280 likelihood function of image data feature and text data feature 281 is shown below: − ij , when S ij = 1 284 denotes the inner product (similarity) between U * i and V * j 285 is larger and vice versa. To facilitate the training of model, 286 we apply the negative log likelihood function (semantic sim-287 ilarity loss J s ) of the above equation to get the similarity 288 between image data and text data: Minimizing the negative log like-291 lihood (equivalent to maximizing the likelihood function) 292 above can reduce the Hamming distance between similar 293 image data and text data, thereby improving the accuracy of 294 the model. 295 Furthermore, we improve the discrimination of hash codes 296 by constraining the distance between the feature label matrix 297 generated by the feature label branch and label matrix. Effec-298 tively making hash codes with the same label more similar 299 and vice versa. The feature label loss is defined as follows: where L x refers to image feature label matrix, L y refers to 302 text feature label matrix.

303
Next, there is a certain quantization error when the con- fore, we propose the hash code discrete loss:  For each text modal data y j , the derivative of the objective 340 function can be obtained: 343 Algorithm 1 Optimization Procedure of DLFFH Input: Image set X, text set Y, label set L and semantic similarity matrix S. Output: Parameters θ x , θ x_hash , θ x_label , θ y , θ y_hash , θ y_label of two networks, and hash code matrix H. Initialization Initialize parameters θ x , θ x_hash , θ x_label , θ y , θ y_hash , θ y_label , γ , η, mini-batch size N x , N y , maximum iteration number T max , image network iteration number T x = n/N x and text network iteration number T y = n/N y . repeat for iter = 1, 2, · · ·,T x do Randomly select N x samples from X.
Calculate U * i = f x i , l i ; θ x , θ x_hash and L x * i = f x i , l i ; θ x , θ x_label by forward propagation. Compute the corresponding derivatives using (7), (8).
Update θ x , θ x_hash and θ x_label by BP algorithm. end for for iter = 1, 2, · · ·,T y do Randomly select N y samples from Y. Calculate V * j = g y j , l j ; θ y , θ y_hash and L y * j = g y j , l j ; θ y , θ y_label by forward propagation. Compute the corresponding derivatives using (9), (10). Update θ y , θ y_hash and θ y_label by BP algorithm. end for Learn H using (12). where P = η(U + V). Therefore, the hash code matrix can be 352 optimized by the following formula: The trained DLFFH can generate hash codes for the data 356 outside the training set. We can take an instance of anyone 357 modality as the input of the network and generate the cor-358 responding hash code through forward propagation. Specif-359 ically, when given an instance x q of image modality and its 360 corresponding label l q , its hash code can be obtained by the 361 following formula: Similarly for text modality y q , we have: 364 h y q = sign(g(y q , l q ; θ y , θ y_hash )) (14) 365

367
The MIRFLICKR-25K dataset [41] contains 25000 image-  In this article, we build the DLFFH based on the TensorFlow 418 framework. Except that the image feature learning network 419 adopts the trained parameters, other network parameters are 420 randomized. The hyper-parameters of the objective function 421 is set to: γ = 1, η = 0.1, the detailed hyper-parameter 422 analysis will be explained in the following sections. The 423 mini-batch size is 128 and the number of model training 424 iterations is 300. The learning rate decreases from 10 −2 to 425 10 −6 with the increase of iterations. For all models, we run 426 five times in turn to get the average.

427
For activation functions applied in DLFFH, we apply iden-428 tity function in hash code branches. Sigmoid function is 429 adopted in feature label branches, and the remaining neural 430 networks all apply the Rectified Linear Unit (ReLU) [45].  Although the deep network models can achieve good per-455 formance, they lack the unique label features of each data in 456 the feature learning process. On the contrary, we embed the 457 semantic label information into the hash code through feature 458 fusion and set feature label branches to further increase the 459 discrimination of hash codes. Therefore, the performance of 460 DLFFH can be effectively improved. In addition, the perfor-461 mance of most models is positively correlated with the length 462 VOLUME 10, 2022

485
To verify the effect of feature fusion and feature label 486 branches in DLFFH, we design two variants for comparison: 487 (a) DLFFH-1 removes the label network and feature fusion 488 in each modality; (b) DLFFH-2 is constructed by deleting the 489 feature label branch and only retaining the hash code branch. 490 VOLUME 10, 2022   through label network and feature fusion, which can make the 528 data features generated by the network more representative. 529 This can more effectively capture the semantic correlation 530 and make up the semantic gap between multi-modal data. 531 In addition, the feature label branch makes the generated hash 532 codes more discriminative. Numerous experimental results 533 on three general datasets prove that the deep label feature 534 fusion hashing achieves satisfactory performance.