Dual Asymmetric Deep Hashing Learning

Due to the impressive learning power, deep learning has achieved a remarkable performance in supervised hash function learning. In this paper, we propose a novel asymmetric supervised deep hashing method to preserve the semantic structure among different categories and generate the binary codes simultaneously. Specifically, two asymmetric deep networks are constructed to reveal the similarity between each pair of images according to their semantic labels. The deep hash functions are then learned through two networks by minimizing the gap between the learned features and discrete codes. Furthermore, since the binary codes in the Hamming space also should keep the semantic affinity existing in the original space, another asymmetric pairwise loss is introduced to capture the similarity between the binary codes and real-value features. This asymmetric loss not only improves the retrieval performance, but also contributes to a quick convergence at the training phase. By taking advantage of the two-stream deep structures and two types of asymmetric pairwise functions, an alternating algorithm is designed to optimize the deep features and high-quality binary codes efficiently. Experimental results on three real-world datasets substantiate the effectiveness and superiority of our approach as compared with state-of-the-art.


I. INTRODUCTION
W ITH the rapid growth of multimedia data in search engines and social networks, how to store these data and make a fast search when an novel one such as the image is given, plays a fundamental role in machine learning.Due to the low storage cost and fast retrieval speed, hashing techniques have attracted much attention and are widely applied in nearest neighbor search [1] for information retrieval on large scale datasets.Hashing learning aims to project the data from the original space into a Hamming space by generating compact codes.These codes can not only dramatically reduce the storage overhead and achieve a constant or sub-linear time complexity in information search, but also preserve the semantic affinity existing in the original space.
Many hashing methods have been studied [2] [3] [4] [5] [6] [7] [8] [9] [10].Generally, these approaches can be roughly classified into two categories: data-independent and datadependent hashing methods.Locality Sensitive Hashing (LSH) [2] and its extension Kernelized LSH (KLSH) [3], as the most typical data-independent hashing methods, were proposed to obtain the hashing function by using random projections.Although the designation of these data-independent methods is quite simple, they often meet a performance degradation when the length of the binary codes is relatively low.By contrary, instead of randomly generating the hashing function like LSH does, data-dependent methods aims to learn a data-specific hashing function by using the training data, being capable of generating shorter binary codes but achieving more remarkable performance.Therefore, various data-dependent hashing approaches containing both unsupervised and supervised have been proposed.Unsupervised hashing, e.g.Spectral Hashing [6], Anchor Graph Hashing (AGH) [7], and Discrete Graph Hashing (DGH) [8] etc., only try to utilize the data structure to learn compact binary codes to improve the performance.By taking the label information into account, supervised hashing methods attempt to map the original data into a compact Hamming space to preserve the similarity between each pair samples.Many representative works including Fast Supervised Hashing (FastH) [5], Kernel Supervised Hashing (KSH) [9], and Supervised Discrete Hashing (SDH) [10] etc., demonstrate that supervised hashing methods often obtain an outstanding performance compared with unsupervised hashing methods.Thus, we focus on studying the supervised hashing method in this paper.
Although some traditional supervised hashing methods achieve a good performance in some applications, most of them only linearly map the original data into a Hamming space by using the hand-crafted features, limited their application for large-scale datasets which have complex distributions.Fortunately, due to the powerful capability of data representation, deep learning [11] [12] provides a promising way to jointly represent the data and learn hash codes.Some existing deep learning based hashing methods have been studied, such as Deep Supervised Hashing (DSH) [13] and Deep Pairwise Supervised Hashing (DPSH) [14], etc.These approaches demonstrate the effectiveness of the end-to-end deep learning architecture for hashing learning.
Despite the wide applications of deep neural network on hashing learning, most of them are symmetric structures in which the similarity between each pair points are estimated by the Hamming distance between the outputs of the same hash function [15].As described in [15], a crucial problem is that this symmetric scheme would result in the difficulty of optimizing the discrete constraint.Thus in this paper, we propose a novel asymmetric hashing method to address aforementioned problem.Note that a similar work was described by Shen et al. [15], named deep asymmetric pairwise hashing (DAPH).However, our study is quite distinctive from DAPH.Shen et al. tried to approximate the similarity affinity by exploiting two different hashing functions, which can preserve more similarity information among the real-value features.However, DAPH only exploits a simple Euclidean distance, but ignores the semantic structure between the learned real-value features and binary codes [16] [17].One major deficiency is that it is difficult to efficiently preserve the similarity in the learned hash functions and discrete codes.Furthermore, in DAPH, two different types of discrete hash codes corresponding to two hash functions are estimated in the training time.However this strategy would enlarge the gap between two schemes, resulting in a performance degradation.By contrast, we not only propose a novel asymmetric structure to learn two different hash functions and one consistent binary code for each sample at the training phase, but also asymmetrically exploit real-value and multiple integer values, which permits the better preservation of similarity between the learned features and hash codes.Experiments show that this novel asymmetric structure can get a better performance in image retrieval and quicker convergence at the training stage.
The main contributions of the proposed method are shown as follows: (1) A novel asymmetric deep structure are proposed.Two streams of deep neural networks are trained to asymmetrically learn two different hash functions.The similarity between each pair images are utilized through a pairwise loss according to their semantic/label information.(2) The similarity between the learned features and binary codes are also revealed through an additional asymmetric loss.Real-value features and binary codes are bridged through an inner product, which alleviates the binary limitation, better preserves the similarity, and speeds up convergence at the training phase.
(3) By taking advantage of these two asymmetric properties, an alternative algorithm is designed to efficiently optimize the real values and discrete values.(4) Experimental results on three large-scale datasets substantiate the effectiveness and superiority of our approach as compared with some existing state-of-the-art hashing methods in image retrieval.
The rest of this paper is organized as follows.In Section 2, the related works including data-independent and datadependent hashing methods are briefly reviewed.In Section 3, the proposed Dual Asymmetric Deep Hashing Learning (DADH) is then analyzed, followed by its optimization.In Section 4, experiments are conducted on three real-world datasets, and some comparisons, parameter sensitivity analysis and convergence analysis are discussed.This paper is finally concluded in Section 5.

II. RELATED WORKS
As mentioned before, the hashing method can be roughly separated into data-independent and data-dependent hashing.
Locality Sensitive Hashing (LSH) [2] aims to use several hash functions to randomly project the data into a Hamming space, so as to ensure the probability of collision is much higher for data points which are close to each other than for those which are far apart.Consider the non-linearity existing in many real-world datasets, LSH was generated to accommodate arbitrary kernel functions (KLSH) in [3].Some other priors, such as p-stable distributions [18] and shift-invariant kernels [19], are also embedded to extend LSH for performance improvement.
Different from data-independent methods, the datadependent methods try to learn more compact codes from a given dataset to achieve a satisfactory search accuracy.According to whether the label information is available, datadependent hashing can also be classified into unsupervised and supervised.Typical learning criteria for unsupervised hashing methods contains graph learning [6] [7] [8] [20] and error minimization [4] [21] [22].Graph Learning: Yair et al. [6] proved that finding a best code is associated with the problem of graph partitioning.Thus, a spectral hashing (SH) was proposed to learn the hash function.Another graph hashing named Anchor Graph Hashing (AGH) was presented by Liu et al. [7], which is capable of capturing the neighborhood structure inherent automatically.In order to avoid the high complexity of existing graph hashing methods, Jiang et al. [20] proposed a scalable graph hashing (SGH) which can be effectively applied to the large-scale dataset search.Although SH, AGH and SGH achieve a satisfactory performance in some datasets, both of them relax the optimization by discarding the discrete constraints, which results in an accumulated quantization error.To address this problem, a discrete graph hashing (DGH) [8] was proposed, which can find the neighborhood structure inherent in a discrete code space.Error Minimization: A typical method is the iterative quantization (ITQ) [4] which aims to project the data to the vertices of a binary hypercube and minimize the quantization error.Additionally, the method in [21] makes a quantization by decomposing the input space into a Cartesian product of low-dimensional subspaces, dramatically reducing the quantization noise.Different from most single-bit quantization, a double-bit quantization hashing [23] was also studied by quantizing each dimension into double bits.
In contrast to unsupervised hashing methods, supervised hashing learning utilizes the label information to encourage the binary codes in the Hamming space to preserve the semantic relationship existing in the raw data.For instance, Mohammad et al. [24] introduced a hinge-like loss function to exploit the semantic information.Besides, Li et al. [25] projected the raw data into a latent subspace, and the label information is embedded on this subspace to preserve the semantic structure.The Jensen Shannon Divergence is also utilized in [26] to learn the binary codes within a probabilistic framework, in which an upper bound is derived for various hash functions.Being similar to KLSH, Liu et al. [9] proposed a supervised hashing with kernels, in which the similar pairs are minimized while the dissimilar pairs are maximized.Consider the discrete constraint, the supervised discrete hashing (SDH) [10] was proposed to not only preserve the semantic structure, but also discretely learn the hash codes without any relaxation.However, this discrete optimization is time-consuming and unscalable.To tackle this problem, a novel method named column sample based discrete supervised hashing (COSDISH) was presented to directly obtain the binary codes from semantic information.
Although various works mentioned above have been stud-ied, they only project the data into the Hamming space by using the hand-crafted features.The main limitation is that they would meet a performance degradation if the distribution of a real-world dataset is complex.Fortunately, deep learning provides a reasonable and promising solution.Liong et al.
[27] used the deep structure to hierarchically and non-linearly learn the hash codes.Convolutional neural network (CNN) was first applied by Xia and Yan et al. to the hashing learning (CNNH) [28], which simultaneously represents the image and learns a hash function.A novel deep structure [29] was then proposed by modifying the fully-connected layer in CNNH to a divide-and-encode module, in which the hash codes can be obtained bit by bit.Also, Can et al. [30] combined the quantization model with deep structure to gain a satisfactory performance in image retrieval.Different from the triple loss used in some deep hashing methods, Li et al. [14] studied a pairwise loss (DPSH) which can effectively preserve the semantic information between each pair outputs.Due to the power of asymmetric structure, the asymmetric deep hashing was also studied in recent years.For instance, Shen et al. [15] (DAPH) tried to learn hash functions in an asymmetric network.However, DAPH only exploits two streams to preserve the pairwise label information between the deep neural network outputs, but ignores the similarity between the real-value features and binary codes.Thus, in this paper, we propose a novel deep hashing method to not only exploit the label information between each two outputs through an asymmetric deep structure, but also semantically associated the learned real-value features with the binary codes.

III. THE PROPOSED METHOD
In this section, we first give some notations used in this paper, as well as the problem definition.The proposed Dual Asymmetric Deep Hashing Learning (DADH) is then described, followed by its optimization.

A. Notation and Problem Definition
In this paper, since there are two streams in the proposed method, we use the uppercase letters to denote the input images in the first and second deep neural networks, respectively, where N is the number of training samples, d 1 and d 2 are the length and width for each image.Note that, although X and Y are represented with different symbols, both of them denote the same training data.In our experiments, we only alternatively use training samples X and Y in the first and second networks.Since our method is supervised learning, the label information can be used.Let the uppercase letter S ∈ {−1, +1} denote the similarity between X and Y and S i,j is the element in the i-th row and j-th column in S. Let S i,j = 1 if x i and y j share the same semantic information or label, otherwise S i,j = −1.

Denote the binary codes as
The purpose of our model is to learn two mapping functions F and G to project X and Y into the ) between b i and b j should be as small as possible if s ij = 1 and vice versa.Due to the power of deep neural network in data representation, we apply the convolution neural work to learn the hash functions.Specifically, the CNN-F structure [31] is adopted to perform feature learning.In CNN-F model, there are eight layers including five convolutional layers as well as three fully-connected layers.The network structure is listed in Table I, where "f." means the filter, "st." means the convolution stride, "LRN" means the Local Response Normalization [11].In order to get the final binary code, we replace the last layer in CNN-F with a k-D vector and the k-bit binary codes are obtained through a sign operation on the output of the last layer.In this paper, CNN-F model is applied to both streams in our proposed asymmetric structure.

B. Dual Asymmetric Deep Hashing Learning
The main framework of the proposed method is shown in Fig. 1.As we can see, there are two end-to-end neural networks to discriminatively represent the inputs.For a pair of outputs F and G in these two streams, their semantic information is exploit through a pairwise loss according to their predefined similarity matrix.Since the purpose is to obtain hash functions through the deep networks, the binary code B is also generated by minimizing its distance between F and G. Furthermore, in order to preserve the similarity between the learned binary codes and real-value features, and alleviate the binary limitation, another asymmetric pairwise loss is introduced by using the inner product of the hash codes B and learned features F (G).
Denote f (x i , W f ) ∈ R k×1 as the output of the i-th sample in the last layer of the first stream, where W f is the parameter of the network.To simplify the notation, we use f i to replace f (x i , W f ).Similarly, we can obtain the output g j corresponding to the j-th sample under the parameter W g in the second network.Thus, the features corresponding to the first and second networks are then gained.
To learn an accurate binary code, we set sign(f i ) and sign(g i ) to be close to their corresponding hash code b i .A general way is to minimize the L 2 loss between them.
Data and Binary Codes Similarity Preservation

Real-valued Data Similarity Preservation
Feature Extraction with different Weights Fig. 1.The framework of the proposed method.Two streams with five convolution layers and two full-connected layers are used for feature extraction.For the real-valued outputs from these two neural networks, their similarity is preserved by using a pairwise loss.Based on the outputs, a consistent hash code is generated.Furthermore, an asymmetric loss is introduced to exploit the semantic information between the binary code and real-valued data.
However, it is difficult to make a back-propagation for the gradient with respect to f i or g i in Eq.( 1) since their gradients are zero anywhere.In this paper, we apply tanh(•) to softly approximate the sign(•) function.Thus, Eq.( 1) is transformed into Furthermore, to exploit the label information and keep a consistent similarity between two outputs F and G, the negative log likelihood of the dual-stream similarities with the likelihood function is exploited. where Therefore, the pairwise loss for these two different outputs is shown as follows.
Although Eq.( 2) achieves to approximate discrete codes and Eq.( 4) exploits the intra-and inter-class information, the similarity between the binary codes and real-value features is ignored.To tackle this problem, another asymmetric pairwise loss is introduced.
In Eq.( 5), the similarity between the real-valued data and binary codes is measured by their inner product.It is easy to observe that Eq.( 5) not only encourages the tanh(f i ) (tanh(g i )) and b i to be consistent, but also preserve the similarity between them.Additionally, our experiments in Section 4 also prove that this kind of asymmetric inner product can quickly make the network converge to a stable value for the real-valued features and hash codes.
Jointly taking Eq.( 2), Eq.( 4) and Eq.( 5) into account, the objective function can be obtained as follows: where τ , γ and η are the non-negative parameters to make a trade-off among various terms.Note that the purpose of the forth term tanh(F) T 1 2 F in the objective function Eq.( 6) is to maximize the information provided by each bit [32].In detail, this term makes a balance for each bit, which encourages the number of -1 and +1 to be approximately similar among all training samples.

C. Optimization
From the objective function Eq.( 6), we can see that the real-valued features as well as the weights in two neural networks (F, W f ) / (G, W g ), and discrete codes B need to be optimized.Note that this NP-hard problem is highly non-convex, and it is very difficult to directly get the optimal solutions.In this paper, we design an efficient algorithm to optimize them alternatively.Specially, we update one variable by fixing other variables.
1) Update (F, W f ) with (G, W g ) and B fixed: By fixing (G, W g ) and B, the objective function Eq.( 6) can be transformed to Then the back-propagation is exploited to update (F, W f ).
Here denote U = tanh(F) and V = tanh(G).The gradient of the objective function with respect to f i is where denotes the dot product.After getting the gradient ∂L ∂fi , the chain rule is used to obtain ∂L ∂W f , and W f is updated by using back-propagation.
2) Update (G, W g ) with (F, W f ) and B fixed: Similarly, by fixing (F, W f ) and B, the back-propagation is exploited to update (G, W g ).The gradient of the objective function with respect to g i is After getting the gradient ∂L ∂gi , the chain rule is used to obtain ∂L ∂Wg , and W g is updated by using back-propagation.
3) Update B with (F, W f ) and (G, W g ) fixed: By fixing (F, W f ) and (G, W g ), we can get the following formulation.
Then Eq.( 10) can be rewrote as: where 'const' means a constant value without any association with B. For the sake of simplicity, let Q = −2k(S T U + S T V) − 2γ(U + V).Eq.( 11) can be simplified to According to Eq.( 12) and [17], B can be updated bit by bit.In other words, we update one column in B with remaining columns fixed.Let B * c be the c-th column and Bc be the remaining columns in B. So do U * c , Ûc , V * c , Vc , Q * c , and Qc .Eq.( 12) can then be rewrote as: Update (F, W f ): Fix (G, W g ) and B and update (F, W f ) using backpropagation according to Eq.( 8). 3: Update (G, W g ): Fix (F, W f ) and B and update (G, W g ) using backpropagation according to Eq.( 9).

4:
Update B: Fix (F, W f ) and (G, W g ) and update B according to Eq.( 14).5: end while Obviously, the optimal solution for B * c is After computing B * c , we update B by replace the c-th column with B * c .Then we repeat Eq.( 14) until all columns are updated.
Overall, the optimization of the proposed method is listed in Algorithm 1.

D. Query
When W f and W g are learned, the hash functions corresponding to the two neural networks are subsequently obtained.For a given testing image x * , two kinds of binary codes can be computed, which are b * f = sign(f (x * , W f )) and b * g = sign(f (x * , W g )), respectively.Note that since tanh will not influence the sign of each element at the testing phase, we do not apply tanh for the output.From the experiments we find that the performances computed through the first and second networks are quite similar.To obtain a more robust result, we use the average of two outputs as the final result in our experiment.
IV. EXPERIMENTS In this section, experiments are conducted on three largescale datasets to demonstrate the effectiveness of the proposed method compared with some state-of-the-art approaches.We first describe the datasets used in our experiments, followed by the description of baselines, evaluation protocol, and implementation.We then make a comparison with other methods.The parameter sensitivity as well as the convergence are subsequently discussed.
IAPR TC-12 [33] dataset consists of 20000 images associated with 255 categories.Since some samples have multiple labels, we set S ij = 1 only if there is at least one same label for the i-th and j-th sample.In this dataset, 2000 images are used for testing and 5000 samples selected from the remaining 18000 (retrieval set) points are used for training to greatly reduce the training time.
MIRFLICKR-25K dataset [34] is composed of 25000 images collected from the Flickr website.According to [32], 20015 images associated with 24 categories are selected.Being similar to IAPR TC-12 dataset, some images has multiple labels, we also define two images be a ground-truth neighbor if they share at least one same label.Additionally, 2000 images are randomly selected as the testing data, and the rest is defined as the retrieval data.Meanwhile, 5000 samples selected from the retrieval data are used for training.
CIFAR-10 dataset [35] contains 60000 32×32 color images with ten categories.Each image belongs to one of these ten classes.Two images will be regarded as semantic neighbor if they have the same label.Being similar to the setting in [17], we randomly select 1000 samples to be the testing data.In order to reduce the training time, we also randomly select 5000 images from the remaining 59000 images as the training data.The rest is then regarded as the retrieval set.

B. Baseline and Evaluation Protocol
To demonstrate the superiority of DADH, some existing hashing methods are used for comparison, including one data-independent methods (LSH [2]), four traditional datadependent methods (ITQ [4], DPLM [36], SDH [10], SGH [20]) and three deep learning based hashing methods (DPSH [14], ADSH [17], DAPH [15]).Since LSH, ITQ, DPLM, SDH, and SGH are not deep learning methods, features should be extracted previously.For these three datasets, we have extracted the 4096-D CNN feature 512-D GIST feature, respectively.We have found that these five approaches often achieve a better performance on the GIST feature.Thus we use the GIST feature as the input for LSH, ITQ, DPLM, SDH, and SGH.For DPSH, ADSH, and DAPH, the raw image is used as the input and all images are resized into 224×224×3.For all deep learning methods, the CNN-F is used as the network for feature extraction and the parameters in DPSH and ADSH are set according to their descriptions in their publications.Note that, since the code for DAPH is not released, we implement it with the deep learning toolbox MatConvNet [37] very carefully.Additionally, the original network structure in DAPH is not CNN-F which means the parameters in [15] may be not optimal.Thus, we try our best to tune the parameters in DAPH.
To quantatively measure the proposed method and other comparison methods, two widely used metrics containing mean average precision (MAP) and precision-recall (PR) are adopted.The definitions of MAP criteria is demonstrated as follows: Given a query, the average precision (AP) is first computed by searching a set of R retrieved results.
where T is the total number of document set in retrieved set, P (r) is the precision of top r retrieved cases, and δ(r) denotes whether the retrieved sample is relevant (if the instance is a true neighbor of the query, δ(r) = 1, otherwise δ(r) = 0).Additionally, being similar to some existing methods [15] [32], Top-500 MAP and Top-500 Precision are also exploited to evaluate the superiority of the proposed method.

C. Implementation
We implement DADH with the deep learning toolbox Mat-ConvNet [37] on Titan X GPU.The pre-trained ImageNet model is used to initialize the first seven layers in each stream and the weights in the last layer are initialized randomly.During the training time, we set the mini-batch size to be 64 and divide the learning rate among [10 −6 , 10 −4 ] into 150 iterations.In other words, the learning rate gradually reduces from 10 −4 to 10 −6 and the stochastic gradient descent is used to update the weights.Based on the cross-validation (a small set for validation is randomly selected from the training data), we set γ = 100, η = 10, and τ = 10 in the three datasets.We will further demonstrate the insensitivity of these parameters in the following subsection.

1) IAPR TC-12:
The MAP scores obtained by different methods on the IAPR TC-12 dataset are shown in Tab.II.It is easy to observe that DADH achieves a remarkable improvement in MAP scores compared with other approaches.In contrast to the data-independent method LSH, DADH achieves more than 10%-20% percents higher in MAP scores.Compared with ITQ, DPLM, SDH and SGH, there is also an obvious enhancement.Specifically, our proposed method obtains at least 46.54%MAP score and reaches as high as 55.39% when the bit length is 48, while the best result obtained by ITQ, DPLM, SDH and SGH is only 38.83%, being far below than our's.Referring to DPSH, ADSH and DAPH, DADH also has more or less improvement in MAP scores.Particularly, compared with DAPH, the presented approach gains about or more than 5% enhancement when the bit length ranges from 12 to 48.In comparison to DPSH and ADSH, the performance obtained by DADH also has about 3%-5% improvement when the length of hashing bit is 24, 36, and 48, respectively.
The Top-500 MAP and Top-500 Precision scores on the IAPR TC-12 dataset are listed in Tab.III.From this table we can see that the experimental results obtained by the deep learning based methods including DPSH, ADSH, DAPH and DADH are remarkably better than that computed by other traditional approaches.Specifically, there is about 10%-20% improvement in MAP@Top500 and Precision@Top500 scores in most cases.In contrast to DPSH, ADSH and DAPH, the performance achieved by the proposed method DADH  still reaches the best point.Except the case when the bit length is 8, DADH always gain 4% or more enhancement in MAP@Top500 and Precision@Top500 scores, indicating the effectiveness of our method.The Precision-Recall curves computed by different methods on the IAPR TC-12 dataset are displayed in Fig. 2, when the bit length changes from 8 to 48.We can easily observe that covered areas gained by DADH are much larger than that obtained by other comparison methods.We can find that the proposed method can dramatically outperform the traditional data-independent and data-dependent strategies.Referring to DPSH, ADSH and DAPH, there is also a better achievement in all cases with different values of the code length.
2) MIRFLICKR-25K: The MAP results of the experiment conducted on the MIRFLICKR-25K dataset are tabulated in   Fig. 3 show the Precision-Recall curves computed by differ-ent methods on the MIRFLICKR-25K dataset, when the bit length changes from 8 to 48.Note that we do not depict the Precision-Recall curve obtained by LSH in Fig. 3(f), since its precision scores is far below than that of others.From Fig. 3 we can observe that DADH remarkably outperforms LSH, ITQ, DPLM, SDH, SGH, ADSH, and DAPH.Referring the comparison between DPSH and DADH, the proposed method is obviously superior to DPSH when the code length is 8 and 12, respectively.Although DPSH covers more areas when the recall value is smaller than 0.4 in Fig. 3(c)-(f), it is inferior to DADH with the increase of the recall value.Overall, our method still outperforms DPSH when the code length is 16, 24, 36 and 48.
3) CIFAR-10: Tab.VI lists the MAP scores obtained by the proposed method and various comparison approaches on the CIFAR-10 dataset.With the change of the code length from 8 to 48, the MAP scores computed by DADH rise from 71.86% to 83.90%, being much higher than that obtained by traditional hashing approaches, including LSH, ITQ, DPLM, SDH and SGH.In contrast to DPSH and DAPH, it is easy to observe that the presented method can achieve a better performance under the different code length.Except the cases when code length is 8 and 12, the MAP scores gained by DADH are always higher than 80%, while the best performance computed by DPSH and DAPH is only 75.02%.Also, ADSH is inferior to the proposed method, especially when the code length is small.This relatively indicates the effectiveness of our method no matter the code length is small or large.

E. Parameter Sensitivity Analysis
The MAP scores under the changes of different values of τ , η and γ are shown in Fig. 5. Note, we tune a parameter with others fixed.For instance, we tune τ in the range of [0.001, 1, 5, 10, 50, 100] by fixing η = 10 and γ = 100, respectively.Similarly, we set τ = 10, γ = 100 in η tuning and τ = 10, η = 10 in γ tuning.As we can see, our model is insensitive to parameters.Specifically, τ , η and γ have a wide range [1,50], [1,50] and [1,300], respectively.Our method always achieves a satisfactory performance when τ , η and γ are in these ranges.This relatively demonstrates the robustness and effectiveness of the proposed method.

F. Convergence Analysis
To be honest, our proposed model can get a convergence with a few of iterations.The change of the objective function values and MAP scores on three datasets are displayed in Fig. 6 when the code length is 48-bit.It is easy to observe that DADH converges to a stable value after less than 30 iterations.In fact, we further find that the asymmetric terms tanh(F)B T − kS F greatly contribute to the quick convergence.We try to remove these two terms from our objective function to study the influence.Note that, if the asymmetric terms are removed, the binary code B are updated through sign(γ[tanh(G) + tanh(F)]).As shown in Fig. 7, if we remove tanh(F)B T − kS F and tanh(G)B T − kS 2 F from the objective function, not only the MAP scores meet a degradation, but also our model converges much slower compared with the original DADH, indicating the necessity and significance of the asymmetric terms.

V. CONCLUSION
In this paper, we propose a novel deep hashing method named dual asymmetric deep hashing learning (DADH) for image retrieval.Specifically, two asymmetric networks are designed to integrate the feature representation and hash function learning into the end-to-end framework.A pairwise loss is introduced to exploit the semantic structure between each pair outputs.Furthermore, another pairwise loss is proposed to not only capture the similarity between the discrete binary codes and learned real-value features, but also contribute to a quick convergence at the training phase.Experiments are conducted on three large-scale datasets and the outstanding results substantiate the superiority of the proposed method.
Tab.IV.We can see that DADH achieves the best performance in all cases with different values of the code length.Being similar to the results on IAPR TC-12 dataset, DPLM, SDH, DPSH, ADSH, DAPH and DADH can obtain higher values in MAP compared with LSH, ITQ and SGH.Referring to the comparison between the traditional methods and deep learning methods, DPSH, ADSH, DAPH and DADH dramatically outperform LSH, ITQ, DPLM, SDH and SGH.Make a comparison between the proposed method with other deep hashing approaches, DADH still has a more or less improvement.For DADH, there is about 1.5%-3% enhancement on MAP scores compared with these three deep hashing approaches.The Top-500 MAP and Top-500 Precision scores on the MIRFLICKR-25K dataset are displayed in Tab.V.It is easy to observe that DADH obtains the best performance in both

Fig. 5 .
Fig. 5.The MAP scores with the change of parameter τ , η and γ on three datasets, when the code length is 48.

Fig. 6 .
Fig. 6.The change of objective function values and MAP scores with the increase of iterations.

Fig. 7 . 2 F 2 F
Fig. 7.The change of objective function values and MAP scores with the increase of iterations.Note that, the asymmetric terms tanh(F)B T − kS 2 F and tanh(G)B T − kS 2 F are removed from the objective function.

2 F
and tanh(G)B T − kS 2

TABLE I THE
NETWORK STRUCTURE OF CNN-F.
Hamming space B. b i = sign(F(x i )) and b j = sign(G(y j )), where sign(•) is an element-wise sign function, and sign ) Training data X/Y; similarity matrix S; hash code length k; predefined parameters τ , γ and η.Output: Hashing functions F and G. Initialization: Initialize weights of the first seven layers by using the pretrained ImageNet model; the last layer is initialized randomly; B is set to be a matrix whose elements are zero.1: while not converged or not reach the maximum iteration do

TABLE II THE
MAP SCORES OBTAINED BY DIFFERENT METHODS ON THE IAPR TC-12 DATASET.

TABLE III THE
TOP-500 MAP AND TOP-500 PRECISION SCORES OBTAINED BY DIFFERENT METHODS ON THE IAPR TC-12 DATASET.

TABLE V THE
TOP-500 MAP AND TOP-500 PRECISION SCORES OBTAINED BY DIFFERENT METHODS ON THE MIRFLICKR-25K DATASET.

TABLE VI THE
MAP SCORES OBTAINED BY DIFFERENT METHODS ON THE CIFAR-10 DATASET.
than 20% enhancement in most cases.Furthermore, results computed by DADH are much higher than that calculated by DPSH, ADSH and DAPH.Concretely, scores of Top-500 MAP and Top-500 Precision gained by DADH are almost always higher than 85%, while these scores calculated by other deep hashing methods are below than 85% in most cases.