Ranking-Based Deep Hashing Network for Image Retrieval

In large-scale image retrieval, the deep learning-based hashing methods have significantly progressed. However, most of the existing deep hashing methods still have the problems of low feature learning efficiency and weak ranking relationship discrimination. To remedy these problems, a novel Ranking-based Deep Hashing Network (RDHN) is proposed for image retrieval in this paper, which integrates the feature learning module and hash learning module into a unified deep hashing network framework to jointly learn a powerful hash function so that the raw images can be mapped to discrete hash codes with significant discrimination. Specifically, a novel difference convolution is designed based on edge detection operators, and then it is uniquely applied to the first convolutional layer of the convolutional neural network (CNN), which can take advantage of the sensitive characteristics of edge detection operators for edge information to extract richer image edge information. Meanwhile, in hash learning, the ranking metric Mean Average Precision (MAP) is optimized using the idea of scaling, and then a ranking loss function based on MAP is carefully designed to enhance the neighborhood ranking capabilities of the hash codes. Furthermore, to reduce the quantification error, a quantization loss function is also designed. Finally, the ranking loss function is combined with the quantization loss function to form the objective function. The proposed method can generate high-quality discrete hash codes while learning to preserve ranking information, effectively improving retrieval performance. Extensive experimental results on three widely used benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art hashing approaches.


I. INTRODUCTION
According to statistics, with the explosive growth of information, more than three billion pictures are released on social networks every day. For such a massive amount of information, how to effectively retrieve the objects of interest is crucial. In fact, the image retrieval technique is not only one of the research hotspots in the field of computer vision and artificial intelligence [1], but also has a significant application value in daily life [2], [3], remote sensing [4], medicine [5] and other fields.
For large-scale databases, image retrieval systems usually have problems such as high storage cost and slow The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy . search, which are effectively alleviated by the proposal of the Approximate Nearest Neighbor (ANN) search method. Among ANN-based image retrieval methods, hashing method is a representative one. It tries to map data features from high-dimensional space to Hamming space, which has the advantages of fast retrieval speed and low storage cost. Generally, hashing methods can be divided into traditional hashing and deep hashing methods. The most representative of traditional hashing methods is Locality Sensitive Hashing (LSH) [6], which belongs to dataindependent hashing methods. In contrast, some datadependent methods are proposed, such as Spectral Hashing (SH) [7], Iterative Quantification (ITQ) [8], Minimal Loss Hashing (MLH) [9], Kernel-based Supervised Hashing (KSH) [10], Binary Reconstructive Embedding (BRE) [11], and Supervised Hamming Hashing (SHH) [12]. However, due to the limitations of hand-crafted features, the traditional hashing methods tend to perform poorly at extracting more complex semantic information. With the development of deep learning, various deep learning-based hashing methods can not only learn more complex feature information, including semantic information, but also show excellent performance on large-scale image retrieval. Currently, the representative deep hashing methods include Convolutional Neural Network Hashing (CNNH) [13], Deep Neural Networks Hashing (DNNH) [14], Deep Regularized Similarity Comparison Hashing (DRSCH) [15], Deep Semantic preserving and Ranking-based Hashing (DSRH) [16], Deep Supervised Hashing (DSH) [17], Discriminative Deep Hashing (DDH) [18], Multi-Level Supervised Hashing (MLSH) [19], Deep Balanced Discrete Hashing (DBDH) [20], Central Similarity Quantization (CSQ) [21], Quadratic Spherical Mutual Information Hashing (QSMIH) [22] and Deep Hash Distillation (DHD) [23].
However, there still exist some problems in deep hashing methods, such as inefficient feature learning and weak discrimination of ranking relationship. In feature learning, it has been found that the feature extraction ability of Convolutional Neural Networks (CNN) generally shows an increasing trend from the first convolutional layer to the last one [19], [24]. In other words, the feature extraction ability for the first convolutional layer in most CNNs is weak. On the other hand, the visualization research for CNN [25] shows that shallow layers of CNNs mainly extract features such as edges and colors of input images, while deep layers usually extract more abstract features of input images. At the same time, it is common knowledge that some edge detection operators, such as Roberts, Sobel, and Scharr, are specially used to extract image edge information. Therefore, the feature information extracted by the shallow layer of CNNs is highly coincident with that extracted by the edge detection operator. Imagine that if the edge detection operator is combined with the convolutional kernel to achieve a novel convolution operation applied to the shallow convolutional layer in the CNNs, a new feature learning network can be formed, which will help enhance the feature extraction capabilities of the network.
In ranking learning, there are few research works on ranking-based deep hashing currently. Furthermore, most existing deep hashing methods [14], [15], [16], [17], [18], [19], [20], [21], [22] widely use methods such as pairs [26] or triplets [27] of metric learning. These methods usually demand similar samples to be close to each other and keep a certain distance from dissimilar samples without emphasizing the neighborhood ranking ability of hashing. Therefore, their ranking results do not necessarily fully fit the evaluation metrics that measure ranking results in information retrieval, such as Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG), where MAP is the mean of Average Precision (AP). For this issue, it is assumed that if the loss function in the form of MAP can be obtained by directly optimizing AP, not only the ranking information can be directly optimized, but also the retrieval performance such as MAP can be effectively improved. Nevertheless, AP is discrete and non-differentiable. To address this issue, researchers optimize AP by using structured SVM [28], gradient descent [29], upper bound of AP [30], histogram block approximation [31], and sigmoid function approximation [32]. However, these methods are either not deep learning-based methods or not combined with hashing learning, and most of them use single-batch retrieval rather than cross-batch retrieval [33].
To tackle the above problems, this paper proposes a novel Ranking-based Deep Hashing Network (RDHN) for image retrieval. Accordingly, the proposed RDHN framework is shown in Fig. 1, mainly including feature learning and hash learning modules. In feature learning, we combine edge detection operator with convolution kernel to achieve a novel kind of convolution, called difference convolution. The difference convolution is then uniquely applied to the first convolutional layer of the deep network to form a new feature learning network to enhance the feature extraction ability of the network. At the same time, in hash learning, the idea of scaling is used to optimize the discrete AP, and then a ranking loss function in the form of MAP is designed to enhance the neighborhood ranking ability of the hash codes. Moreover, in order to reduce the quantization error in the process of hash codes discretization and improve the quality of hash codes, a quantization loss function is carefully designed. Finally, the objective function is formed by combining the ranking loss function and the quantization loss function. Under the constraint and optimization of the objective function, the rich feature information extracted by the feature learning network is mapped into discrete hash codes with discrimination. The main contributions of this paper are summarized as follows.
(1) A novel ranking-based deep hashing network is proposed, which combines the deep feature learning module and hash learning module into a unified framework to jointly learn a powerful hash function. (2) In feature learning, a new kind of difference convolution is designed specifically for the first convolutional layer of the deep feature learning network to enhance the feature learning ability of the network. In hash learning, an objective function that can optimize discrete MAP is designed, which can make the network learn the discriminative ranking information to enhance the neighborhood ranking ability of hash codes and reduce the quantization error. (3) Extensive experimental results on three widely used datasets demonstrate that the proposed method can obtain good ranking performance and outperforms other state-of-the-art hashing approaches. The rest of this paper is organized as follows. Section II briefly reviews hashing-based and ranking-based image retrieval methods, as well as the related work of feature extraction. The proposed method is detailed in Section III. The experimental results are analyzed and discussed in Section IV. Finally, the conclusion is given in Section V. . The network framework combines feature learning and hash learning modules, which can preserve good ranking information and learn discriminative discrete hash codes. The deep network consists of three convolutional layers, three pooling layers, and two fully connected layers.

II. RELATED WORK
Aiming at the field of image retrieval, the related work of traditional hashing methods, deep hashing methods, rankingbased methods, and feature extraction methods is briefly analyzed and summarized in this section.

A. HASHING METHODS FOR IMAGE RETRIEVAL
Early hashing methods are independent of data distribution, such as the LSH method [6], which uses random projections to generate hash codes. However, this method needs long hash codes and more hash tables to obtain satisfactory retrieval precision, so it is difficult to apply to large-scale datasets. To alleviate this problem, hashing methods that rely on data distribution are proposed in succession. Generally, they are divided into supervised and unsupervised hashing methods, depending on whether the data labels are used. SH [7] is a typical unsupervised hashing method that generates balanced hash codes by minimizing the correlation between different hash functions. Another unsupervised hashing method is ITQ [8], which learns the similarity preserving of hash codes through rotating zero-centered PCA projected data to minimize quantization loss. Comparatively, supervised hashing methods achieve better retrieval performance than unsupervised hashing methods. The MLH [9] is a method that learns hash codes representation based on structured prediction with latent variables and a hinge-like loss function. The BRE method [11] learns a hash function by explicitly minimizing the reconstruction error between the corresponding binary embedded original Euclidean and the Hamming distances. However, the hash functions designed by these two methods are too complex and result in expensive training costs. To solve these problems, the KSH method [10] designs a simple kernel-based hash function, which reduces the training cost and facilitates large-scale dataset training. However, the retrieval performance of these hashing methods in Hamming space is not satisfactory. To avoid this drawback, the SHH method [12] combines label regression with asymmetric similarity preserving. It uses a fusion strategy and discrete algorithm to optimize hash codes, which improves the retrieval performance in Hamming space. Although these traditional supervised hashing methods are superior to unsupervised hashing methods in retrieval performance, they have some limitations in more complex semantic information representation.
Recently, with the development of deep learning, deep learning-based hashing methods show more powerful retrieval ability in image retrieval. The CNNH proposed by Xia et al. [13] is the first deep hashing method that combines CNN and hashing for image retrieval. However, because this method separates feature learning from hash learning, the updated hash codes information cannot be fed back to the feature learning process. To further improve hash performance, as an end-to-end learning hashing method, DNNH, is proposed by Lai et al. [14], which combines feature learning and hash coding based on triplet loss and block encoding learning. Like DNNH, the DDH proposed by Lin et al. [18] also uses a block encoding module to enhance the discrimination of hash code. Unlike DDH, DRSCH, proposed by Zhang et al. [15], use a weighting method to generate scalable-length hash codes to preserve the discrimination of hash codes. To learn similarity-preserving hash codes in large-scale image datasets, Liu et al. [17] propose a pairwisebased DSH method. Unlike most methods that only learn hash codes to maintain similarity and ignore the preservation of semantic information, Yao et al. [16] present a triplets-based DSRH method, which preserves relative similarities among images while maintaining more image semantic structure to facilitate hash learning. Lai et al. [34] propose a deep architecture that learns instance-aware image representations for multi-label image data. Based on this representation, binary hash codes can be generated for both semantic hashing and category-aware hashing. In addition, the MLSH method [19] applies the multiple-hash-table mechanism to integrate the multi-layer features of CNN, retaining more structural and semantic information. To maintain the balance of hash codes, Zheng et al. [20] present a DBDH method to maximize the entropy of the discrete distribution of hash codes. Yuan et al. [21] proposed the CSQ method, which encourages similar hash codes to approach the corresponding common center for improving the hash learning efficiency and retrieval accuracy. On the other hand, to fully use several information-theoretic aspects in the information retrieval process, the QSMIH method [22] adopts Quadratic Mutual Information to optimize the learned hash codes, which can better meet the needs of image hashing and information retrieval. In order to reduce the impact of data enhancement on hash codes, Jang et al. [23] proposed a DHD scheme to minimize the diversity while exploring the potential of authenticated data. The correlation filtering hashing [35], asymmetric discrete hashing [36] and discrete class specific prototype hashing [37] proposed by Ma et al. have achieved promising performance in narrowing semantic gaps, learning discriminative hash codes and making full use of semantic information. Unlike the traditional hashing and deep hashing methods, the proposed method focuses on the impact of the ranking relationship on retrieval performance. It can simultaneously generate discriminative discrete hash codes and preserve good ranking information.

B. RANKING-BASED METHODS FOR IMAGE RETRIEVAL
As described by Liu et al. [39], many information retrieval problems are essentially ranking problems, such as text retrieval [40] and image retrieval [41]. Therefore, the performance of most information retrieval techniques can be enhanced by optimizing the ranking information. As mentioned above, the DRSH method effectively preserves the ranking performance of the hash code by adopting the form of triplet loss. Song et al. [41] propose a method to optimize the precision of the top position of the Hamming distance ranking list, which significantly improves the retrieval effect. Similar to [41], Qin et al. [42] also present a deep top similarity hashing method to preserve semantic similarity between the top image in the ranking list and the query image. What is more, Jin et al. [43] propose a novel deep ordinal hashing (DOH) method, which targets to learn ranking-based hash functions by encoding the local spatial and global semantic information from deep networks. In the methods based on ranking metric optimization, Zhao et al. [44] and Wang et al. [45] propose to learn ranking preserving hashing codes by optimizing ranking metric NDCG, which achieves better retrieval performance. However, these methods are mainly used for the retrieval of multi-label datasets. Currently, MAP-based optimization is usually used in single-label and multi-label datasets. In order to optimize the AP with the discrete form, various methods are proposed. For example, Chapelle et al. [29] use a gradient-descent minimization method to smoothly approximate the AP. Mohapatra et al. [30] devise a method to optimize the upper bound of AP. Revaud et al. [31] propose a histogram block approximation method to optimize AP. Brown et al. [32] propose using the Sigmoid function to relax the AP to approximate the AP metric and achieve better retrieval performance in largescale image retrieval. Yang et al. [33] use the lower bound of AP to optimize MAP and proposes a cross-batch retrieval method for image classification and retrieval. However, none of these methods for optimizing AP are hashing methods.
For this reason, He et al. [46] devise a tie-aware version of AP for hash learning. Based on [33], Lu et al. [38] propose method that combines the ranking optimization with the deep hashing and optimizes the hash ranking by making full use of category information to achieve better retrieval performance. However, similarly to [33], this method weakens the ranking characteristics between hash codes because all hash codes are replaced with the mean of the relevant hash codes. Moreover, this method does not employ a quantization loss function; at this point, [41], [46] are no exception. Compared with these above ranking-based deep hashing approaches, the proposed method pays more attention to the discrimination of ranking relationship and considers the impact of quantization error on retrieval performance to obtain discrete hash codes with better discrimination for image retrieval.

C. FEATURE EXTRACTION METHODS
In deep hashing methods, it is usually required to learn a hash function to achieve hash codes representation. This requires not only an effective hash loss function, but also the effective feature information for joint learning, which puts forward higher requirements for feature learning of deep networks. Currently, most researchers adopt more complex deep network structures to improve the ability of the network features learning. The typical deep network frameworks include LeNet [47], AlexNet [48], VGGNet [49], ResNet [50], Graph Convolutional Network (GCN) [51], and the recently emerging feature learning network Vision Transformer (VIT) [52]. What is more, some researchers utilize attention mechanisms to enhance network feature extraction [43], [53], [54]. There are researchers who combine global and local features to improve feature learning [43], [55]. Chen et al. [56] propose combining CNN and Recurrent neural network (RNN) for feature learning. Zhang et al [57] adopt the capsule network to learn the effective information of images. However, there are few methods to enhance the feature extraction efficiency of deep networks by improving the network structure of convolutional layer, thereby improving retrieval performance. In recent years, difference convolution has been widely used in computer vision because it can effectively promote the feature information extraction and has strong applicability. For instance, the local binary convolution neural network [59] applies local binary patterns (LBP) [58] to CNNs; central difference convolution is adopted for face liveness detection [60]; spatio-temporal difference convolution is exploited for gesture recognition [61]; pixel difference network is used for edge detection [62]. In this paper, a new difference convolution is introduced into the feature learning network to improve the feature extraction ability of the proposed network.

III. PROPOSED METHOD
In this paper, a Ranking-based Deep Hashing Network (RDHN) is presented for image retrieval. The specific network framework is shown in Fig. 1. This section describes in detail the two main parts of the proposed network framework, VOLUME 10, 2022 namely, the feature learning part, and the hash learning part. The following first describes the design and composition of the feature learning network. Then, the ranking loss function and the quantization loss function of the hash learning part are introduced.

A. FEATURE LEARNING
Based on CNN, this paper proposes to apply difference convolution in the first convolutional layer of the deep network to improve the feature learning ability and enrich the information quantity of feature extraction.
According to Section I, the feature information extracted from the shallow layer in the CNN has many similarities with the feature information extracted by the edge detection operator, and the extraction of feature information for the latter is more rapid and target-oriented. Furthermore, considering that the edge detection operator usually only extracts the edge information from the raw input image and the input objects of the second and subsequent convolutional layers in the CNN are not the raw image. Therefore, we will apply difference convolution only to the first convolutional layer of the CNN. Fig. 2(a)-(c) shows three edge detection operators (filters) used to implement difference convolution. Fig. 2(a) depicts the Sobel operator, which is a discrete difference operator that can effectively extract image edge information, but the extraction effect on the weak edge information in the image is poor. The Scharr operator is shown in Fig. 2(b). It can enhance the difference of the Sobel operator and increase the difference between pixel values by enlarging the weighting coefficients in the operator, so the Sobel operator can effectively extract weak edge information. Fig. 2(c) illustrates the binary difference operator that expands the Roberts operator with 2×2 size to the operator with 3×3 size and detects edge information through local differences, just like the operating principle of the Roberts operator.
Based on the three operators mentioned above, a uniform equation can be used to achieve difference convolution, which is implemented as follows. First, we use the description of Yu et al. [60], that is, the convolution commonly used in CNNs is called vanilla convolution, as a distinction from difference convolution. Let x be the input feature map as well as let w c and w d be the filter weights of the vanilla and difference convolutions, respectively. Then the feature map y c of the vanilla convolution and feature map y d of difference convolution can be defined as, respectively, where ⊗ denotes convolutional operation. In order to make the convolutional layer have better edge detection characteristics, we use a uniform equation to combine vanilla convolution and difference convolution, that is, Eqs. (1) and (2) are linearly combined, and the final output feature map y is defined as where λ is the weight parameter, it is set as a learnable parameter to encourage better feature fusion of vanilla convolution and difference convolution. In this paper, a feature map x with 3 × 3 size is adopted, as shown in Fig. 2 (d), where E i , i = 0, 1, · · · ,8, represents the position of ith element in the feature map. Then the Eq. (1) can also be expressed as where x(E i ) is the value of the feature map at E i position; w c (E i ) is the value of the vanilla convolution filter weight at position E i . To verify the feasibility of using the other forms of difference convolution in the first convolutional layer of CNNs, we also introduce central difference convolution [60] into our method. The central difference refers to the difference between the neighborhood pixel value and central pixel value, and its convolution result is equivalent to the weighted sum of the eight binary differences. The central difference convolution uses the filter weight w c of the vanilla convolution, so the feature map output obtained by the central difference convolution can be represented as In order to combine the characteristics of vanilla convolution and difference convolution, like Eq. (3), central difference convolution fuses the intensity-level semantic information extracted by vanilla convolution with the gradient-level detailed message extracted by difference convolution. Unlike [60], we combine Eqs. (4) and (5) with the learnable parameter λ, the final output feature map y of the central difference convolution can be formulated as The implementations of the central difference convolution and the convolution based on Sobel, Scharr and binary difference operators are described above. However, in Fig. 2(a)-(c), the three types of edge detection operators are fixed and nonlearnable. If only they were used directly as filter templates for the convolutional layer, the extracted feature map would be relatively simple and single. To enrich the diversity of feature information extracted from the first convolutional layer in the deep network, first, we first counterclockwise rotate these three edge detection operators eight times with 45 • each respectively, and form the three filter templates, which are respectively shown in Fig. 2(e)-(g). Obviously, this processing is equivalent to data enhancement of the image. Finally, we respectively copy the three filter templates four more times to obtain 32 filters for the first convolutional layer in the deep network shown as Fig. 1.
For convenience of comparison and illustration, we respectively apply the above-mentioned four difference convolutions to the first convolutional layer of the deep network shown in Fig. 1, and accordingly the implemented deep hashing network model is referred to as RDHN1, RDHN2, RDHN3, and RDHN4, respectively. Simultaneously, the deep hashing model with vanilla convolution is referred to as RDHN0. Precisely, the deep network shown in Fig. 1 consists of three convolutional layers, three pooling layers, and two fully connected layers. When using vanilla convolution, the three convolutional layers leverage 32, 32, and 64 filters with size 5 × 5, respectively. Batch Normalization (BN) is added after each convolutional layer. When using difference convolution, we apply it only to the first convolutional layer and exploit 32 filters with size 3 × 3. The pooling layers consist of a max pooling layer and two average pooling layers. The pooling operator is set to 3 × 3 with stride 2. The first fully connected layer has 500 units. The hash layer has C units, where C is the length of the hash code. All the convolutional layers and the first fully connected layer are equipped with the rectified linear unit (ReLU) activation function.

B. HASH LEARNING
In order to efficiently map the rich feature information extracted by the feature learning module into discrete hash codes with discrimination, a new objective function is designed. It consists of the ranking loss function and the quantization loss function. For ease of understanding, the symbols used in this paper and their corresponding meanings are listed in Table 1.

1) RANKING LOSS FUNCTION
Let x q as the query sample and X D as the database sample set. Then, X S represents the set of the relevant sample x s of the query sample x q , that is, the set of all samples that belong to the same as the query sample x q in the database X D . The AP for the query sample x q can be defined as where K is the total number of samples in the relevant sample set X S ; R X S (x s ) and R X D (x s ) are the rank of relevant sample x s in relevant sample set X S and database sample set X D , respectively. In this paper, we adopt the way of cross-batch retrieval [33]. Specifically, for the training sample set containing M training samples. Then, the training sample set X i of the current batch is used as the mini-batch of network input, and it is used as the query sample set (because the query sample set is the same as the training sample set of the current batch, we use the same symbol X i for convenience). The remaining batch training samples in X as the database q=1 be the hash code set of the query sample set X i while let B X D and B X S be the hash code sets of X D and X S , respectively. Then the MAP of the query sample set X i can be defined as the averaging AP of the hash codes VOLUME 10, 2022 for M query samples x q , as follows: where b x q and b x d are the hash codes of the relevant sample x s and the database sample is the rank of b x s in the ranking list of relevant hash code set B X S , which can be computed by comparing the similarity between b x q and b x s with the similarity between b x q and each hash code b x si in B X S , as shown in Eq. (9); similarly, which can be obtained by comparing the similarity between b x q and b x s with the similarity between b x q and each b x d in B X D , as shown in Eq. (10).
x d respectively (we use the form of the inner product to represent the similarity between hash codes). τ (·) is the Heaviside step function. If the input value is less than 0, then τ (·)=0, otherwise τ (·)=1. According to Eqs. (9) and (10), Eq. (8) can be rewritten as However, by observing Eq. (11), two issues need to be solved to effectively optimize MAP. First, the Heaviside step function is non-differentiable. To solve this issue, we can convert the numerator term of Eq. (11) into constant. Specifically, since R B X S (b x q , b x s ) is the rank of the relevant hash code b x s in the ranking list of relevant hash code set B X S , its ranking result r x s belongs to a positive integer from 1 to K , namely, Thus, Eq. (11) can be further rewritten as In Eq. (12), since the numerator term r x s of MAP is a constant, there is no need to optimize it anymore. For the optimization of denominator term in Eq. (12), we use the scaling idea to solve the it by optimizing the lower bound of MAP(lbMAP). Specifically, given any real number z, for exponential function e z and step functions τ (z), there are , z < 0, and τ (z) = 1 Therefore, for any real number z, there is e z greater than or equal to τ (z), namely, According to Eq. (14), if the step function of the denominator in Eq. (12) is replaced by an exponential function, then the lbMAP can be defined as Simultaneously, it can be proved by Eq. (14) that lbMAP is the lower bound of MAP. Specifically, Hence, lbMAP(X i ) ≤ MAP(X i ).
To maximize the MAP, we can maximize the lbMAP, that is max lbMAP. Nevertheless, another issue of MAP optimization is that the hash codes are discrete, which prevents the proposed network model from effectively back-propagation during training. To address this issue, we replace the hash code b with the real value output h for the input image x in the hash layer of the proposed deep network. Thus, Eq. (15) can be rewritten as In this way, lbMAP is fully differentiable. On the other hand, our goal in the training process is usually to minimize the loss function, which is equivalent to minimizing the negative value of lbMAP, that is, min -lbMAP. Therefore, the ranking loss function L R can be defined as In addition, the influence of the computational complexity in the training process also needs to be considered. If the database samples are updated once with each query sample or each batch of samples, the training will be difficult because it requires a large amount of computational cost. This is similar to the problem encountered in [31] using cross-batch retrieval, each network parameter weight update means that the database hash codes are updated, which increases training time. Therefore, to reduce training time, we update the database hash codes once per epoch. However, the disadvantage of this trick is that mismatching the new weight parameters with the database hash codes may affect the retrieval performance of the model. Moreover, to further reduce the computational cost, we use the relevant hash codes mean of the query hash code instead of each relevant hash code. Namely, we replace each relevant hash code b x s with the mean of the relevant sample hash code set B X S of the query hash code x q . Accordingly, the real value h x s corresponding to the relevant hash code b x s can be replaced with the fixed value h x s = 1 K K s=1 h x s . For example, for the difference between n relevant hash codes with 16 bits and m database hash codes with 16 bits, its computational complexity is O (n). If the mean of the relevant hash codes is used, its computational complexity is O (1). Therefore, the proposed method significantly reduces the computational complexity. We will discuss the impact of this trick on the performance of the model in detail in Section IV-C-3). Concurrently, to maintain the ranking information of hash codes as much as possible and enhance the discrimination of ranking relationship, the hash codes of the database samples are not replaced by the mean of their relevant hash codes. Thus, Eq. (18) where W = K r=1 r x s . Since W is a constant, it can be considered as a weight parameter of the ranking loss function. For the single-label dataset, K is still the total number of samples for a category. However, for the multi-label dataset, unlike single-label dataset, because the number of samples is unequal for different categories, the number of relevant samples for query samples with different categories varies greatly, which is easily leads to data imbalance. To solve this problem, according to the characteristics of the ranking loss function designed, this paper uses a uniform fixed value as the K value. The impact of data imbalance is reduced by penalties with different weights on the number of relevant samples for query samples in different categories. That is, if the number of relevant samples of the query sample is greater than the K value, the punishment will be increased; otherwise, the punishment will be reduced. The effect of different K values on the performance of the proposed method will be evaluated experimentally in Section IV-D-4).

2) QUANTIZATION LOSS FUNCTION
In Eq. (17), since the real value output h of the hash layer in the deep network replaces the discrete hash codes b, the real value h needs to be quantized into the discrete hash codes b by the sign function. Namely, However, this mandatory quantitative relationship inevitably leads to quantization errors. To address this issue, the continuous relaxation method is used in most of the current research work. For example, the hash layer uses sigmoid or tanh activation functions for continuous encoding. In order to generate higher quality hash codes, Cao et al. [65] adopt the smooth activation function tanh(βx), which adds a hyperparameter β to approximate the sign function. However, the relaxation method usually makes the gradient update slowly, which leads to too much time-consuming training process and even the disappearance of the gradient. To reduce training time and the quantization error, another relaxation method, mainly represented by DSH, adds a penalty term || |h| − 1|| 1 in the loss function. On the other hand, the discretization methods are different from the relaxation methods, such as the straight-through estimator used in the RODH [38] and DBDH [20] methods, that is, the hard tanh (Htanh) function max( − 1, min(1, h)) is used in back-propagation. Although this method can obtain discrete hash codes during training without using quantization loss, the gradient is zero when the Htanh function value is greater than 1 and less than −1. Therefore, like the tanh activation function, its training process will be slow. In contrast, the quantization loss term ||h − sign(h)|| 1 used by Li et al. [66] in the loss function can not only obtain better discrete hash codes while having a fasttraining speed, but also achieve good retrieval performance. In this paper, in order to achieve faster training speed, reduce the loss of retrieval performance caused by the ranking optimization process and improve the discriminative ability of hash function encoding, and considering the characteristics of the ranking loss function, a novel quantization loss function is defined as where β is a hyperparameter that controls saturation. Its role is different from that used in [65]. Firstly, instead of using tanh(βh) in the hash layer, we use it in the loss function to avoid slow training. Secondly, the main purpose of increasing β value is to facilitate the optimization of the ranking loss function. Because the exponential function used in the VOLUME 10, 2022 ranking loss function has a disadvantage that its value may be out of the storage range of a 64-bit computer when length of the hash code is large. For example, for a 48-bit query hash code, if one of its relevant hash code is the same as it, then the inner product (similarity) between them is 48, thus e 48 > 2 64 . In this case, by adding a hyperparameter β to the quantization loss function term, the value of the exponential function can be prevented from being large, and the optimization space of the loss function is expanded. In other words, adding β makes the real value output in the hash layer close to h β and does not change its positive and negative characteristics, which has no effect on Eq. (20).
It should be noted that in previous studies it was generally assumed that the result of this processing deviated from the hash layer output the approximately discrete real value (+1/-1). However, the results of the evaluation experiments on the quantization loss term in Sections IV-D-1)-IV-D-3) will demonstrate that the proposed quantization loss function will help the proposed network model obtain discrete hash codes with better ranking information, less quantization error and higher retrieval performance. Finally, the ranking loss function and quantization loss function are combined through a weight parameter α to form the objective function as follows: To optimize the network model parameters, we use the Adam optimizer to minimize the objective function as follows: Algorithm1 details the proposed RDHN algorithm.

IV. EXPERIMENTS
This section first introduces three benchmark datasets, the details of experimental settings and evaluation methods. Then, the abundant evaluation experiments are conducted in these three datasets and the experimental results are compared with other state-of-the-art hashing methods.

A. DATASETS AND EXPERIMENTAL SETTINGS
The three widely used benchmark datasets are CIFAR-10, CIFAR-20, and NUS-WIDE. The CIFAR-10 dataset [63] is a single-label dataset containing 60,000 32 × 32 color images. It contains 10 categories, each consisting of 6000 images. Following [20], [38], 10,000 images are selected as test query set, and other 50,000 images are chosen as training set and database.
The CIFAR-20 [63] dataset is also a single-label dataset containing 60,000 32 × 32 color images. It has 20 superclasses, each consisting of 3000 images. In addition, each super-class also contains 5 classes, thus the semantic gap between different images within each super-class is large, which shows that this dataset is more challenging than the CIFAR-10 dataset. Following [20], [38], 10,000 images are leveraged as the test query set, and the remaining 50,000 images are exploited as the training set and database.
The NUS-WIDE [64] dataset is a large-scale multi-label dataset containing 195,834 color images. It has 21 categories, each including at least 5000 images. If all the images are trained, it will take too much training time. Therefore, following [19], 2100 images (100 images per class) are randomly chosen as test query set, and the remaining as databases. Meanwhile, 10,500 images (500 images per class) are randomly sampled from the database as the training set. The details of the three datasets used in the experiment are shown in Table 2.

Algorithm 1 RDHN algorithm Input:
The training sample set X ={x i } N i=1 divided into current batch training (query) sample sets X i ={x q } M q=1 and database

Output:
Discrete hash code b for each image.

Initialization:
Weight parameters of the deep network.

Repeat:
1) The training sample set X is trained once before performing per epoch, the corresponding real value h (including h x q and h x d ) is obtained from the hash layer; 2) According to the query sample x q , query the sample  In the experiments based on the deep hashing methods as well as the proposed method, for the CIFAR-10 and CIFAR-20 datasets, the raw pixel-based image is used as the input; and the images from the NUS-WIDE dataset are resized to 64 × 64 as the input. In the experiments based on the traditional hashing methods, we follow [13], [20] to represent each image in CIFAR-10 and CIFAR-20 by a 512dimensional GIST vector, and leverage a 500-dimensional bag-of-words vector to represent each image in NUS-WIDE. For fair comparisons, the three traditional hashing methods MLH-CNN [9], KSH-CNN [10], and BRE-CNN [11], use 4096-dimensional deep features extracted from AlexNet [48] in the experiments.
All experiments are implemented on a computer configured with GeForce GTX 1060 6G GPU, Intel Core i7-8700 CPU, and 32G RAM. In the training stage, we first pre-train a hashing network model with 12-bit hash code, and then fine-tune it to obtain the model with 16-bit, 24-bit, 32-bit, and 48-bit hash codes, respectively. During pre-training, the model is trained with 300 epochs, and the learning rate is set to 10 −3 ; During fine-tuning, the model is trained for 100 epochs or terminated when the gradient is no longer updated, and the learning rate is set to 10 −4 . For the two training processes, the mini-batch is set as 200, and the weight decay rate is set as 0.004. By cross-validation, for CIFAR-10, CIFAR-20, NUS-WIDE datasets, the α in Eq. (22) takes the values of 10, 10, and 0.1, respectively, the β in Eq. (21) takes the values of 6, 3, and 3, respectively. In addition, for NUS-WIDE datasets, the K in Eq. (19) takes the value 800.
We use the Mean Average Precision (MAP) with different number of bits, Precision@500 curves with different number of bits, Precision@N curves with 16 bits, and the Precision-Recall (PR) curves with 16 bits to evaluate the performance of the proposed method. Specially, for the NUS-WIDE dataset, the MAP values are calculated by the top 5000 returned images.

B. PERFORMANCE EVALUATION AND ANALYSIS OF THE PROPOSED NETWORK
The four models RDHN1, RDHN2, RDHN3, and RDHN4 with difference convolutions are compared with the RDHN0 model with vanilla convolution. The experiments are conducted on CIFAR-10, CIFAR-20, and NUS-WIDE datasets. The image retrieval results (MAP) with different number of bits are recorded in Fig. 3. From Fig. 3, it can be seen intuitively that, except for the RDHN3 model that achieves MAP with 48 bits on the CIFAR-10 dataset, RDHN1, RDHN2, RDHN3, and RDHN4 models achieve better retrieval results with different number of bits than RDHN0 on three datasets. This demonstrates that the difference convolutions achieved by the edge detection operators can extract more effective edge feature information than the vanilla convolution, enrich the feature information extracted by the feature learning network, and help to improve the retrieval performance of the model. Moreover, the retrieval performance of RDHN1, RDHN2, and RDHN3 models are very similar, which may be related to the characteristics of the three edge detection operators. It should also be noted that the RDHN4 model achieves much better retrieval performance on the CIFAR-20 and NUS-WIDE datasets than other models, and achieves better retrieval results on the CIFAR-10 dataset, which indicates that it is also effective and feasible to apply other forms of difference convolution in the first convolutional layer of proposed model. Overall, the four models of RDHN1, RDHN2, RDHN3, and RDHN4 with difference convolutions are better than the RDHN0 model with vanilla convolution in the image retrieval performances. Therefore, applying the designed difference convolution in the first convolutional layer of the proposed model is effective and advantageous.
To visually analyze and understand why the difference convolution used for the first convolutional layer in proposed model can improve the retrieval performance, we have done further comparative experiments. A ''car'' image from the CIFAR-10 dataset is randomly selected as the test image, as shown in the first column on the left of Fig. 4. For the first convolutional layer of the five models RDHN0, RDHN1, RDHN2, RDHN3, and RDHN4, the feature maps of the test image extracted from each channel are described in the form of the heatmaps, respectively. The visualization results are shown from columns 2 to 17 in Fig. 4. As shown in Fig. 4(a), the feature maps extracted from the first vanilla convolutional layer of the RDHN0 model mainly display the contour of the ''car'', but these contours are relatively blurry. In comparison, VOLUME 10, 2022 the feature maps of the test image extracted by the difference convolutional layers of the four models (namely, RDHN1, RDHN2, RDHN3 and RDHN4) have more apparent ''car'' contours. Moreover, for these feature maps from column 7 of row 1 in Fig. 4(b), column 13 of row 2 in Fig. 4(c), the last column of row 2 in Fig. 4(d), and column 15 in row 2 of Fig. 4(e), the wheels of the ''car'' are also relatively clear. The above phenomena reveal that the designed difference convolution has better edge detection characteristics, which accords with the original design intention in Section III-A. In addition, the reason why the model based on the proposed difference convolution can learn more effective feature information than the vanilla convolution model is also explained intuitively in Fig. 4. To sum up, with the help of the difference convolution, the proposed network model can effectively improve the ability to learn the feature information of input image, thus lay a solid foundation for the hash learning.

C. COMPARISONS OF HASHING METHODS ON BENCHMARK DATASETS
To evaluate the effectiveness of the proposed method RDHN, for image retrieval performance, it is compared with traditional hashing methods, traditional hashing methods with deep features and deep hashing methods on three datasets, namely, CIFAR10, CIFAR20, and NUS-WIDE. The traditional hashing methods include LSH [6], SH [7], ITQ [8], MLH [9], KSH [10], and BRE [11]. The traditional hashing methods with deep features include MLH-CNN [9], KSH-CNN [10], and BRE-CNN [11]. The deep hashing methods include CNNH [13], DNNH [14], DRSCH [15], DSRH [16], DSH [17], DDH [18], RODH [38], MLSH [19], DBDH [20], CSQ [21], QSMIH [22] and DHD [23]. It should be noted that RODH uses a six-layer backbone network structure, and other hashing methods use the same backbone network as the RDHN0 in the proposed model. We use the source codes of the CSQ, QSMIH and DHD methods, respectively and re-evaluate them after changing their backbone network to the same as the proposed method. The results of the DSH method in the CIFAR-20 dataset are not available in the original paper, so we have re-evaluated them using its published source code. The results of other hashing methods comprehensively refer to the original paper and the results cited in [19] and [20]. Moreover, in the next experiment, we will use bold fonts to represent the largest retrieved results (MAP) in the all method.

1) RETRIEVAL RESULTS ON CIFAR-10 DATASET
On the CIFAR-10 dataset, Table 3 presents the retrieval results (MAP) with different number of bits for different hashing methods, where ''-'' indicates that there is no retrieval result. Fig. 5(a)-(c) report the Precision@500 results with different number of bits, the PR results with 16 bits and the Precision@N results with 16 bits for some typical methods, respectively. From Table 3 and Fig. 5, we can observe the following results intuitively. First, the deep hashing methods are generally better than the traditional hashing methods. Moreover, the three hashing methods with deep features, namely, BRE-CNN, MLH-CNN, and KSH-CNN, have obtained better retrieval results than the hashing methods with hand-crafted features, such as MLH, KSH, and BRE. For example, these three methods increase the MAP with 48 bits by 13.85%, 7.51%, and 4.45%, respectively. These results suggest that deep hashing methods generally have better feature extraction capabilities and help to improve the performance of retrievals.
Second, the proposed method is better than other state-ofthe-art deep hashing methods. For example, when the length of the hash code is 16 bits, 24 bits, 32 bits, and 48 bits respectively, the better RDHN2 in the proposed method improves the MAP by 1.24%, 0.94%, 0.59%, and 0.2% compared with the better RODH method in the ranking-based deep hashing methods. Even though RDHN0 is relatively poor in the proposed method, apart from its MAP with 48 bits, the retrieval results with other number of bits are better than RODH. On the other hand, the proposed method RDHN0 can improve the MAP with 16 bits from 80.16% to 84.61% compared to the better CSQ in the deep hashing methods. At the same time, it can be also clearly seen from Fig. 5 that the proposed five models are better than the CSQ and other deep hashing methods. These results manifest that the proposed ranking-based deep hashing method can better maintain a good ranking information, and is more conducive to improving retrieval performance while more competitive in the deep hashing method.
Third, the deep hashing network based on difference convolution achieves better retrieval performance than that based on vanilla convolution. For example, RDHN4, RDHN3, RDHN2 and RDHN1 models are able to respectively increase the MAP with 16 bits by 0.42%, 0.33%, 0.86% and 0.63%, compared with RDHN0 model. These results more specifically illustrate the effectiveness of the deep hashing method based on difference convolution than that of Section IV-B.
To better illustrate the discrimination of the hash codes learned by the proposed models, we use t-SNE visualization to show them. We select all images from query set of CIFAR-10 as the test set for visualization, and the visualization results of the 16 bits hash codes generated by proposed models show in Fig. 6. We can find from Fig. 6 that the hash codes learned by the proposed four models show favorable intra-class compactness and inter-class separability.

2) RETRIEVAL RESULTS ON CIFAR-20 DATASET
On the CIFAR-20 dataset, Table 4 shows the retrieval results (MAP) with different number of bits for different hashing methods, and Fig. 7(a)-(c) demonstrate the Precision@500 results with different number of bits, the PR results with 16 bits and the Precision@N results with 16 bits for some typical methods, respectively. Through the comparisons and observations of the data in Table 4 and Fig. 7, we can observe that, on the more challenging CIFAR-20 dataset, the improvements of retrieval performance achieved by the proposed method are more apparent than that on the CIFAR-10 dataset. For example, in Table 4, the proposed model RDHN4 can respectively increase MAP with 48 bits by 2.84%, 7.58%, and 11.86%, compared with the RDHN0 model, the deep hashing method QSMIH, and the ranking-based deep hashing method RODH. At the same time, the Precision curves and PR curves of the five proposed models are all above the CSQ and other deep hashing methods in Fig. 7. These retrieval results not only validate again the effectiveness of the proposed model based on difference convolution to improve the learning ability of the features of the proposed model, but also demonstrate that the proposed method can better optimize the ranking information and enhance the discrimination of ranking relationship.
In addition, we found that on CIFAR-10 and CIFAR-20 datasets, even for the ranking-based deep hashing method RODH, which adopts a more complex backbone network with four convolutional layers and two fully connected layers, its retrieval performance generally is lower than that of the proposed method. For intuitive comparison, we take the network model with input image of size 32 × 32 and hash layer C = 12 units as an example. Table 5 shows the comparison of network configurations between the RODH and the proposed RDHN0 methods, where ''-'' indicates that the method does not use this layer. Based on Table 5, the total network parameters (excluding the BN layer) of the RODH and RDHN0 methods can be calculated as 17,826,956  and 373,840, respectively. The network parameters of the proposed methods are reduced by about 46.75 times as many as that of the RODH method. Therefore, the proposed model will greatly reduce the memory occupation. Although this is not the focus of our study, it is undeniable that the proposed method can achieve better retrieval performance while taking up relatively less computer resources. Furthermore, the highperformance and lightweight model will have more application value in practical engineering. However, similar to the problem faced by [33], the idea of cross-batch retrieval can  improve the inter-batches communication ability, but it also makes the training time longer, which is also a shortcoming of the proposed method.

3) RETRIEVAL RESULTS ON NUS-WIDE DATASET
On the NUS-WIDE dataset, Table 6 lists the retrieval results (MAP) with different number of bits for different hashing methods, and Fig. 8(a)-(c) demonstrate the Precision@500 results with different number of bits, the PR results with 16 bits and the Precision@N results with 16 bits for some typical methods, respectively. From Table 6 and Fig. 8, it can be distinctly seen that, like the good retrieval performance achieved on the single-label dataset, the proposed method also obtains better retrieval results on the more challenging large-scale multi-label dataset NUS-WIDE than other stateof-the-art hashing methods. For example, in Table 6, the proposed model RDHN4 can increase MAP with 16 bits by 1.91%, 5.02% and 8.34%, respectively, compared with the RDHN0 model, the deep hashing method CSQ and the ranking-based deep hashing method DSRH. At the same time, these retrieval results in Fig. 8 further illustrate the effectiveness and feasibility of the proposed method. However, it should be noted that the improvement margin of the retrieval result of the NUS-WIDE dataset is not obvious compared to that on the single-label datasets, which indicates that the proposed method is more applicable on the single-label datasets than on the multi-label datasets. One reason may be that the number of images on the multi-label dataset is too large. The other may be that the proposed method uses ''the mean of the relevant hash codes'' to reduce computational cost, which is far-fetched for the multi-label dataset. Unlike single-label datasets, the number of labels per image of the multi-label dataset is not necessarily the same and the gap within the class is large. Therefore, using ''the mean of the relevant hash codes'' to represent the relevant image does not quite fit its multi-label characteristics, which is also a limitation of the proposed method.

V. ABLATION STUDY A. EVALUATION OF THE QUANTIZATION LOSS FUNCTION
In order to evaluate the impact of quantization loss on the retrieval performance of the proposed network model, we take the RDHN0 model as an example. On the three datasets CIFAR10, CIFAR20, and NUS-WIDE, four models adopting different quantization loss functions are used for the VOLUME 10, 2022 comparative experiments, which are the RDHN0 model with the proposed quantization loss function, without the quantization loss function, with relaxed quantization loss function as proposed in DSH [17] and with discrete quantization loss function as used in [66], respectively. For simplicity, the last three models are named as RDHN00, RDHN01, and RDHN02. The retrieval results (MAP) with 48 bits are shown in Table 7.
From Table 7, firstly, it can be observed that better retrieval results are achieved by using the quantization loss function. For example, on the CIFAR-10 dataset, the retrieval results of the RDHN0, RDHN01, and RDHN02 models are 2.17%, 1.31%, and 1.22% higher than that of the RDHN00 model. Secondly, the retrieval performances of the RDHN01 and RDHN02 models are similar retrieval on the three datasets, which indicates that the relaxation strategy and the discretization strategy have similar effects on the performance of the proposed model. However, the retrieval results of the RDHN0 model have greatly improved compared to the RDHN01 and RDHN02 models. For example, on the three datasets CIFAR-10, CIFAR-20, and NUS-WIDE, the MAP of the RDHN0 is 0.86%, 1.18%, and 0.66% higher than that of the relaxed model RDHN01, respectively; and 0.95%, 0.96%, and 0.54% higher than that of the discrete model RDHN02. The primary reason is that the proposed quantization strategy is designed according to the characteristics of the ranking loss function. Compared with other discrete strategies, it is more helpful to obtain discrete hash codes with good ranking information, reduce quantization errors and improve retrieval performance. What is more, to intuitively illustrate the effectiveness of the proposed quantization loss function, on the CIFAR-10 dataset, the top 10 retrieval images with 16 bits by the proposed model RDHN0 and the discrete model RDHN02 shown in Fig. 9. It can be evidently seen from Fig. 9 that the number of correct images retrieved by the RDHN0 model is significantly more than that retrieved by the RDHN02 model. To sum up, the proposed quantization loss function is conducive to the network model to learn a good discrete hash function and makes the model have superior retrieval performance.

B. EVALUATION OF THE HYPERPARAMETER α
In order to evaluate the influence of the hyperparameter α in the objective function on the retrieval performance of the proposed network model, taking the RDHN0 model as an example, the retrieval results (MAP) with different number of bits and with different hyperparameters α on CIFAR-10, CIFAR-20, and NUS-WIDE datasets are shown in Fig. 10. First, it can be obviously observed from Fig. 10 that on three datasets, the retrieval results with different number of bits corresponding to the α value greater than 0 are always better than that the α value is equal to 0, which shows that considering the quantization loss is beneficial to improve the performance of image retrieval. Second, for different types of datasets, the α values for the optimal model are not necessarily the same. For example, on the NUS-WIDE dataset, the model is optimal when the α value is 0.1, while on both the CIFAR-10 and CIFAR-20 datasets, the α value of the optimal model is 10. This reflects that the impact of wight α of the quantization loss function on the retrieval performance may be related to the types of single-label datasets and multi-label datasets. Different types of datasets have different degrees of sensitivity to the α value, which means that if the model is applied to new types of datasets, the α value may need to be readjusted.

C. EVALUATION OF THE HYPERPARAMETER β
In order to evaluate the influence of the hyperparameter β in the objective function on the retrieval performance of the proposed network model, RDHN0 is taken as an example. Fig. 11 shows the retrieval results (MAP) with various number of bits and with different hyperparameters β on CIFAR-10, CIFAR-20, and NUS-WIDE datasets. Firstly, it is clearly seen from Fig. 11 that, on three datasets, the retrieval results with different number of bits when β is set as greater than 1 are always better than that when β is set as 1.  Secondly, when β is set as greater than 1, different β values have different effects on the performance of the proposed model. Basically, as the β value increases, the retrieval results with different number of bits increase first and then decrease. Finally, for different datasets, the β values are not necessarily identical when the retrieval results are best. For example, on the three datasets CIFAR-10, CIFAR-20, and NUS-WIDE, the RDHN0 model can achieve the best retrieval result when the β is set as 6, 3, and 3, respectively. To sum up, these abovementioned results demonstrate that the reasonable setting of the hyperparameter β is beneficial to improving the retrieval performance of the proposed model, which also conforms to the original design intention in Section III-B-2).
To analyze the relationship between hyperparameters α and β, we set α to 0.1, 1, 10, 100, β to 1, 2, 3, 4, 5, 6, 7, 8, respectively. The retrieval results (MAP) of RDHN0 model with 16 bits on CIFAR-10 datasets are observed in Fig. 12. When β is fixed, the value of MAP increases first and then decreases with the increase of α value. When α is set to 10, the MAP gets the maximum value, that is, the RDHN0 achieves optimum performance. These results are consistent with those in Section IV-C. Similarly, when α is fixed, the values of MAP increase first and then decrease with the increase of β values. When β is set to 6, the RDHN0 achieves optimum performance. Therefore, we speculate that when β increases, the output of the hash code decreases, and a properly small hash code is helpful to the hash learning. However, when the hash code is too small, the sensitivity of the hash code will increase [23]. For example, the change of the real value of the hash code from 0.1 to −0.1 will be easy, and the  value of the hash code will be changed after quantization (from 1 to −1), which is not conducive to the hash learning.

D. EVALUATION OF THE HYPERPARAMETER K
In order to evaluate the influence of the hyperparameter K in the objective function on the retrieval performance of the proposed network model, the RDHN0 is again taken as an example. Fig. 13 shows the retrieval results (MAP) with different number of bits and various hyperparameters K on NUS-WIDE dataset, where ''K= class_num'' indicates that K is set as the number of relevant hash codes for the query sample, that is, the number of samples in the database that belong to the same category as the query sample, which means that the data imbalance problem for the NUS-WIDE dataset is not handled under this case. When K is set as 600, 800, 1000, and 1200 respectively, different weight penalties are imposed according to the number of relevant samples (categories) of the query samples to deal with the data imbalance. It can be evidently seen from Fig. 13 that, when K is set as 800, 1000, and 1200, the three retrieval results overall are not much different, but these results are better than when K is set as 600 and ''class_num''. However, it can be found through careful observation that, when K is set as 800, the model with a relatively small number of bits has more advantageous. The above phenomena demonstrate that the reasonable setting of the hyperparameters K can alleviate the problem of data imbalance. Meanwhile, when K is set as a relatively large value (such as 800), it is more helpful to improve the retrieval performance of the model.

VI. CONCLUSION
A novel Ranking-based Deep Hashing Network (RDHN) is proposed for image retrieval in this paper, which consists of a feature learning module and a hash learning module. The proposed method can generate discriminative discrete hash codes while learning to preserve ranking information. We uniquely apply a novel difference convolution in the first convolutional layer of the CNN to improve the learning ability of the deep network. Meanwhile, a new ranking-based objective function is carefully designed, which can fully learn hash ranking information to enhance the discrimination of ranking relationship and reduce the quantization error. A large number of experiments are conducted on three widely used benchmark datasets, and the results demonstrate that the proposed method achieves better retrieval performance than the stateof-the-art hashing approaches. It should be pointed out that the method proposed in this paper still has some deficiencies in training efficiency and generalization performance, which needs further improvement. In the future, on the one hand, we will explore the possibility of combining the first convolutional layer of CNN with the feature extraction algorithm based on color or texture. On the other hand, we will deeply investigate the use of ranking-based deep hashing for image retrieval on multi-label datasets.