Contrastive Self-Supervised Hashing With Dual Pseudo Agreement

Recently, unsupervised deep hashing has attracted increasing attention, mainly because of its potential ability to learn binary codes without identity annotations. However, because the labels are predicted by their pretext tasks, unsupervised deep hashing becomes unstable when learning with noisy labels. To mitigate this issue, we propose a simple but effective approach to self-supervised hash learning based on dual pseudo agreement. By adding a consistency constraint, our method can prevent corrupted labels and encourage generalization for effective knowledge distillation. Specifically, we use the refined pseudo labels as a stabilization constraint to train hash codes, which can implicitly encode semantic structures of the data into the learned Hamming space. Based on the stable pseudo labels, we propose a self-supervised hashing method with mutual information and noise contrastive loss. Throughout the process of hash learning, the stable pseudo labels and data distributions collaboratively work together as teachers to guide the binary codes learning process. Extensive experiments on three publicly available datasets demonstrate that the proposed method can consistently outperform state-of-the-art methods by large margins.


I. INTRODUCTION
Hash learning [1]- [5] aims to accelerate the retrieval system's speed when querying an image from a large-scale images database. This is a challenging task because only limited binary features are available to preserve the semantic relationship of the original high dimensional data [6], [7]. Recently, significant advances [8]- [19] have been made to tackle this problem using the ideas of deep hash learning.
Generally, deep hash learning methods can be roughly divided into two categories: supervised deep hashing [8]- [13] and unsupervised deep hashing [14]- [19]. Supervised deep hashing methods propose to learn deep networks as well as binary codes simultaneously with labeled data. However, all the datasets require time-consuming annotations, which are impractical or expensive in practice. Meanwhile, it is reported that deep supervised hashing models trained on one source domain have a significant performance drop on a new target dataset. This is because the data-bias existing between the source and target datasets.
The associate editor coordinating the review of this manuscript and approving it for publication was Ilaria Boscolo Galazzo .
Since it is unfeasible to annotate all images in the target dataset, unsupervised deep hashing methods [14]- [19] are therefore proposed to learn hash codes from pixels themselves without relying on semantic annotations. But how to transform high-dimensional images data into low-dimensional binary codes without labels and effectively preserve the structural information between images has become a new issue in this field [20]. State-of-the-art unsupervised deep hashing methods [14]- [19] for image retrieval always use a pretext task to learn representations on unlabeled data. Therefore, this kind of unsupervised methods can also be called self-supervised hash learning, which is based on the same core idea of creating labels by predicting them on their own.
Although the pretext tasks [21] such as generative adversarial networks (GAN) [22]- [24] and clustering can conduct alternatively to refine the binary codes to some extent, the training of the neural network is still substantially hindered by the inevitable label noise. Recently, some works try to select clean instances out of the noisy ones [25]. For example, DistillHash [18] treated the initial similarity relationship as noisy similarity signals and learned a distilled FIGURE 1. Illustration of contrastive self-supervised hashing with the dual pseudo agreement. Each unlabeled instance is assigned to two pseudo labels by different pre-trained networks. Refined pseudo labels are used to encodes the semantic structure of data. dataset to perform confidence unsupervised hash codes learning. PSCH [25] used soft multi-part corresponding similarity and refined pseudo labels to further enhance the accuracy of hash codes. However, the performances of these unsupervised approaches are still far behind their fully-supervised counterparts [10], [12]. The main reason is that the noise labels are varied, which can be derived from the imperfect results of the clustering algorithm, the unknown number of target-domain identities, and the limited transferability of source-domain features.
To effectively address the problem of noisy labels in unsupervised deep hashing, in this article, we propose an simple but effective approach to self-supervised hash learning based on dual pseudo agreement (illustrated in Fig. 1). Specifically, we use two deep networks with identical architecture to produce refined pseudo labels. By adding a consistency constraint, our method can prevent corrupted labels which encourages generalization for effective knowledge distillation. Furthermore, we use the refined pseudo labels as a stabilization constraint to train hash codes, thereby ensuring the error flow from the noisy label would not be accumulated in the process of hash learning.
The main contributions of our method can be summarized as follows: • We propose a simple but effective dual pseudo agreement labelling paradigm, which allows us to learn deep binary codes robustly even with extremely noisy labels. Essentially, this conclusion is almost trivial from a statistical perspective, but in practice, it significantly eases learning hash codes by lifting requirements on the availability of clean data.
• Based on the stable pseudo labels, we propose a self-supervised hashing method with mutual information and noise contrastive loss. Throughout the process of hash learning, the stable pseudo labels and data distributions collaboratively work together as teachers to guide the binary codes learning process.
• Extensive experiments on three benchmark datasets show the robustness of the proposed method is superior to many state-of-the-art approaches. Furthermore, the ablation studies clearly demonstrate the reliability of dual pseudo agreement and contrastive loss.
In this section, we focus on the recent progress on unsupervised deep hashing methods. Since unsupervised deep learning is a challenging task, it also brings new challenges for unsupervised deep hashing.

A. UNSUPERVISED DEEP HASHING METHODS
Unsupervised deep hashing methods can learn binary codes without semantic annotations, which are widely adopted to accelerate the approximate nearest neighbor (ANN) search for large-scale database [27]. We briefly group the unsupervised deep hashing methods into two categories: generative model based deep hashing and pseudo label learning based deep hashing. Generative model based deep hashing mainly explore image synthesis task in an unconditioned manner for hash learning [22]. For example, stochastic generative hashing (SGH) [16] proposed a novel generative approach based on the minimum description length principle to learn binary hash codes. HashGAN [28] proposed a new deep unsupervised hashing network, which consists of three networks: a generator, a discriminator, and an encoder. Binary generative adversarial network (BGAN) [14] proposed to simultaneously learn a binary representation and generate an image plausibly similar to the original one. However, unsupervised hashing based on generative models are limited for their complexity. Furthermore, these methods can not capture semantic relationships between different data points by the reconstruction loss. VOLUME 8, 2020 Recently, pseudo label [29], [30] learning based deep hashing is widely used and leads to better performance. In general, these methods use pseudo labels to convert the unsupervised problem into a supervised problem. For example, Hu et al. [31] proposed the first two-step pseudo label based unsupervised deep discriminative hashing algorithm. In the first stage, they cluster images via k-means and the cluster labels are treated as pseudo labels. In the second stage, the pseudo labels are used as soft supervisions to train a deep hashing network and minimize the classification loss and quantization loss. Unsupervised triplet hashing (UTH) [32] proposed to construct the image triplet by an anchor image, a rotated image, and a random image. Similarity-adaptive deep hashing (SADH) [13] proposed to train hashing model alternatively over three modules, which substantially improves the robustness of binary codes optimization. Semantic structure based unsupervised deep hashing (SSDH) [17] analyzed cosine distance distribution of image features and constructed the similarity matrix through the pairwise distances. DistillHash [18] improved the performance of SSDH [17] by using the initial similarity relationship as noisy similarity signals and learned a distilled dataset to perform confidence unsupervised hash codes learning.
However, most of these works ignore the label noise, which inevitably yields negative pairs that share similar semantic meaning in the embedding space. Such problem can also lead to class collision [33] and hurt hash codes learning.

B. DEEP LEARNING WITH NOISY LABELS
Previous works have demonstrated that noisy labels can severely degrade the performance of supervised deep learning algorithms [34]. Several methods have been proposed to address this problem by outlier detection [35], thresholding [36] or reweighting [37].
For example, MentorNet [38] proposed to learn a data-driven curriculum dynamically with StudentNet, which can significantly improve the generalization performance of deep networks trained on corrupted training data. Co-teaching [39] proposed to train two deep neural networks simultaneously, and identify the examples with small loss as clean examples. Noise-tolerant [40] proposed a novel training paradigm, which uses angular margin based loss to reflect the probability of noisy labels. PurifyNet [41] proposed to refine the annotated labels and optimizes the neural networks by progressively adjusting the predicted logits. Li et al. [42] proposed a distillation framework, which uses ''side'' information to ''hedge the risk'' of learning from noisy labels. Shu et al. [43] proposed a meta-learning method, which can adaptively learn hyper-parameter in robust loss functions. JoCoR [44] proposed a joint training method to select small-loss examples and reduce the diversity of two networks during training. DivideMix [45] proposed to model the per-sample loss distribution with a mixture model, which can dynamically divide the training data into a labeled set with clean samples and an unlabeled set with noisy samples in a semi-supervised manner.
In unsupervised deep hashing domain, DistillHash [18] treated the initial similarity relationship as noisy similarity signals and learned a distilled dataset to perform confidence unsupervised hash codes learning. PSCH [25] used soft multi-part corresponding similarity and refined pseudo labels to further enhance the accuracy of hash codes. Despite these important findings, how to further improve the performances of these unsupervised approaches stills an open question. In this article, we provide further insights into the unsupervised hash learning by investigating refined pseudo labels and robust loss functions.

III. PROPOSED METHOD
In this section, we first introduce the dual pseudo agreement labelling paradigm and then provide a self-supervised hashing method based on the contrastive loss.
The goal of deep image hashing is to learn a nonlinear hash Base on the pairwise similarities, the key idea of deep hashing lies in constructing a similarity matrix S ∈ {0, 1} N ×N , so that when S ij = 1 denotes that x i and x j are similar, S ij = −1 denotes that x i and x j are dissimilar. According to whether extra labels information Y = y i N i=1 are used, existing deep hashing methods can be categorized into either supervised or unsupervised learning. In the supervised setting, instance level annotations (e.g., classification tags) are used to measure whether two images are similar. In the unsupervised setting, no additional labels are available and thus the raw features are the only source of judgment.
Obviously, deep supervised hashing methods can provide more accurate estimations on S than unsupervised methods. However, annotating all the images in the dataset is timeconsuming, which is impractical or expensive in practice. Hence, we focus on improving the performance of unsupervised deep hashing methods that are easier to be deployed to real-world scenarios.
According to the above analysis, we can see that the difference between supervised and unsupervised deep hashing methods is not so big. Actually, the supervised hashing methods do not require real manually labeled information, but only need to know the similarity between different images. Under the unsupervised manner, the similarity matrix S can also be obtained form calculating the distance between their feature vectors. However, due to the similarity of feature vectors are not as accurate as label information, the estimated S always contain some noisy information. In other words, if we can estimate the similarity matrix S more accurately, we can further improve the accuracy of unsupervised deep hashing. In this article, we use two different deep neural networks denoted as f (x, 1 ) and f (x, 2 ) to estimated the refined similarity matrix S. As illustrated in Fig.1, we first used two different pre-trained deep neural networks to extract two deep features set . Then, we can roughly label the training data based on their features centroid matrix C 1 and C 2 . It means that we assume the corruption of labels is a random variable distributed according to the clean target. This kind of assumption is an abstract approximation to the real-world corruption of labels [35].
Inspired by the deep clustering theory [46], iterating the k-means centers of C 1 and C 2 , the deep image features will be divided into k different groups by optimizing the following problems: where d is dimension of the feature vectors, y i is the label associated with x i . After solving Eq.(1) and Eq.(2), we can use these assignments as pseudo labels However, due to the domain gaps, the pre-trained networks f (x, 1 ) and f (x, 2 ) will produce lots of noisy labels, which always located at the edge away from the center of the sample features. In consequence, we use two distance thresholds to reject the noisy samples: where µ 1 d and µ 2 d are the mean of the cosine distances between the feature and the centroid feature, σ 1 d and σ 2 d are the standard deviation of the cosine distance distribution.
Based on the two distance thresholds in Eq.(3) and Eq.(4), we can get two refined pseudo labels set. From a statistical viewpoint, we calculate the common parts of the two sets as maximum likelihood (ML) estimation, and formulate the similarity matrices: where ||f i − f j || 2 is the Euclidean distance between two feature vectors which lies in the range of [0, 2] for two normalized vectors. Furthermore, we consider two different modal information in Eq. (5). And due to the refined step, our final labeled dataset has m samples, which is smaller than the original dataset N .

B. CONTRASTIVE SELF-SUPERVISED HASHING
Training accurate hash codes in the presence of noisy labels is very important for unsupervised hash learning. Our contrastive self-supervised hashing method is based on the dual pseudo agreement labels and distribution as introduced in Section III-A, where the final pseudo labels are refined by two different pre-trained deep networks. Our hashing framework used the final pseudo labels and data-to-data relations to train a deep siamese network [47] with the same structures and parameters. Siamese network is a common architecture to reduce the model size, which is widely used in deep learning field [48], [49]. For the backbone of deep siamese network, we use the PSCH architecture [25] for hash codes learning. The overall of our method is illustrated in Fig. 2. In our framework, the main intuition behind our method is to learn the binary representations that encode the underlying shared information between different images. Therefore, our loss function has three complementary parts: contrastive loss, hashing loss, and balance loss. VOLUME 8, 2020

1) CONTRASTIVE LOSS
In order to alleviate label noise, we propose a contrastive loss that is not dependent on the labels. By introducing the smoothness of decision boundaries along the data manifold, this contrastive loss can improve the robustness of hash codes. Formally, for any data point x, contrastive loss aims to learn an encoder such that: where x is an ''anchor'' data point, x + is a positive sample, and x − is a negative sample. The score function is a metric that measures the similarity between two features. From Eq. (6), we can see that the contrastive loss aims to bring closer samples from the same instance and separate samples from different instances.
To optimize this property, we introduce a mutual information loss, called InfoNCE [50] for hashing learning. Different from [50], our method is from the perspective of resistance to label noise. Furthermore, our loss can encourage the score function to assign large values to positive examples and small values to negative examples: As we can see, the contrastive objective in Eq. (7) is a kind of mutual information (MI) loss, which explicitly considers data-to-data relationships. Specifically, minimizing the InfoNCE loss can maximize a lower bound on the mutual information between (x) and (x) + [51]. Furthermore, the MI loss is better than the traditional classification loss due to the separation in entropy and conditional entropy.

2) HASHING LOSS
In order to preserve the learned semantic structure, we can define the hashing loss function based on the hash layer output b i and b j : where H ij is the similarity structure in the hash code space, and S ij is defined by the final pseudo labels in Eq. (5). According to Eq.(9), we can see that hashing loss can minimize the loss between semantic structure and hash similarity structure. During hash learning, we relax the binary codes b i = tanh ( (x i )) to be the output of the neural network. Therefore, we can integrate the above loss function into our deep architectures. However, this relaxed strategy may be harmful to the performance due to the relaxation error [52]. To reduce relaxation error, it is a natural choice to introduce the balance loss when solving the relaxed problem.

3) BALANCE LOSS
Balance loss originally proposed for balancing hash codes, it is also an effective way to make the binary value close to the output value of the network. Balance loss has also been proved to be robust to label noise, and is defined as: where b i − sgn (h i ) 2 measures the quantization error between the binary code b i and the deep activation h i .
will constraint the binary codes fulfill the binary-valued requirement more appropriately.

4) FINAL LOSS FUNCTION
Inspired by the benefit of complementary learning [53], we propose to combine the strength of the three loss functions in Eq. (7), Eq. (9), and Eq. (10). The final loss function can leverage their complementary aspects for both robust and sufficient learning. Formally, the final loss function of our method is defined as: where α > 0 and β > 0 are two hyper-parameters to balance the hash loss and balance loss. With proper balancing between the three terms, our final loss function can leverage both the convergence and the robustness advantages. Note that, based on Eq.(11), we can solve the vanishing gradient problem by using a smooth activation function b i = tanh(h i ) to approximate the non-smooth activation function b i = sgn(h i ) by continuation. Therefore, the final loss can be optimized by the standard stochastic gradient descent (SGD) method. During training, we randomly select 48 images from the training set to update the parameters. During the testing process, we can predict image binary codes just by forward propagation.

IV. EXPERIMENTS
Following common practice in hash learning, we provide comprehensive experiments to demonstrate the effectiveness of our method. We first introduce three public datasets and implementation details. Following these settings, we compare our method with some representative prior works. Finally, parameter sensitivity, training complexity, and retrieval results are further reported.

A. BENCHMARK DATASETS
We evaluate the performance of our method on three popular benchmark datasets: CIFAR-10, NUSWIDE, and FLICKR25K. All images except for the query images are used as the retrieval set. The description of these datasets are as follows.
• CIFAR-10. 1 This dataset is a popular subset of 80M tiny images, which composed of 60,000 color 32×32 images  from 10 classes (each class contain 6,000 images). Following experimental settings in [14], [17], [18], we randomly select 500 images in each category as the training data, and 1,000 images in each category as the query dataset.
• NUSWIDE. 2 This dataset contains 269,648 color images referring to 81 semantic concepts. Following experimental settings in [18], we also use the subset of the 10 most popular concepts, where each concept contains of at least 5,000 images. We randomly select 100 query images and 1,000 training images in each class respectively.
• FLICKR25K. 3 This dataset is a popular dataset that contains 25,000 images collected from Flickr website. Each image is annotated with one of 38 unique labels. We randomly selected 4,000 images for training and 1,000 query images for testing.

B. IMPLEMENTATION DETAILS
The implementation of our method is based on the Tensorflow [54] framework. For non-deep learning-based hashing methods [55]- [60], we utilize the 4096-D deep features extracted from the last fully-connected layer of the VGG16 network to learn hash codes. For deep hashing methods, we use raw images as inputs and adopt the VGG16 as the backbone architectures to extract the deep representations.
During training, we also use mini-batch SGD algorithm with 0.9 momentum to optimize the network. The learning rate is 0.001 and the decline rate of learning is 0.1 after every 100 iterations. We set the batch size to 48, α = 0.6 and β = 0.01. Detail analysis of these hyper-parameters is shown in the following sections. Our experiments are implemented on Windows 10 (64-Bit) platform with a Geforce RTX 2080 Ti GPU.
In order to evaluate the performance of our method, we use three widely employed evaluation criteria [61]: Mean Average Precision (MAP), TopN-precision curves, and Precision-recall (PR). Detailed description of the above evaluation criteria can be referred to [62].
The general MAP results of our method and other methods with different hash codes lengths are shown in Table 1. The first and second best results are highlighted by bold and underline. As can be seen in Table 1, traditional unsupervised hashing methods (ITQ [57] and DSH [59]) sometimes can surpass unsupervised deep hashing methods (DeepBit [15] and SGH [16]). This indicates that some deep hashing methods may achieve unsatisfactory performance by label noisy and over-fitting. However, by leveraging a reduction of risk minimization under noisy labels, deep hashing methods (DistillHash [18] and our method) achieve more promising results. Compared with other unsupervised deep hashing methods, our method consistently achieves the best performance across different lengths of hash codes for all three datasets. Compared with DistillHash [18], the MAP improvement of our method is over 13% in all cases on the FLICKR25K dataset. Furthermore, our method can outperform BGAN [14] over 6 % on CIFAR-10 dataset and 2% on NUSWIDE dataset.
Additionally, we also provide the TopN-precision and precision-recall curves for a more comprehensive comparison on NUSWIDE and FLICKR25K datasets. As shown in Fig. 3 and Fig. 4, the curves of our method can consistently above all other curves, which demonstrates that the performance of our method can achieve higher precision than all other methods.

D. EFFECTIVENESS OF DUAL PSEUDO AGREEMENT
To validate the effectiveness of dual pseudo agreement mechanism in our self-labelling procedure, we compare our method with the single pseudo method: single pseudo label + hash function learning. The retrieval results about MAP are shown in Table 2 and Table 3. It can be observed that dual pseudo agreement effectively improves the retrieval performance. This happens because more corrupted labels are prevented with the consistency constraint. Moreover, dual pseudo agreement mechanism connects the learning of hash function and binary codes, which makes them more compatible and easier to preserve the updated semantic similarity.

E. PARAMETER SENSITIVITY
To investigate the sensitivity of hyper-parameters α and β, we conducted experiments with different parameter settings. Fig. 5 and Fig. 6 show the precision score of our method, when α and β are within a range at 32 bits and 64 bits on CIFAR-10 dataset. As we can see, for both 32 bits and 64 bits, there is a significant performance drop in terms of MAP when α and β become very small or very large. This is because α and β are designed to balance the hashing loss and balance loss. Setting α and β to small or to large will lead to unbalance between these two terms. The best result is obtained when α = 0.6 and β = 0.01, so we fix α = 0.6 and β = 0.01 in our other experiments.

F. EFFICIENCY OF THE PROPOSED METHOD
In this subsection, we evaluate the efficiency of our method. Specifically, we conduct experiment on CIFAR-10 dataset and compare the encoding time of different unsupervised methods, including: Deepbit [15], BGAN [14], SH [56], DistillHash [18], and our method.
As illustrated in Fig. 7, our method outperforms Deepbit [15], BGAN [14], SH [56], and DistillHash [18]. In particular, our method can run 1.37x faster than the widely used BGAN [14] in testing stage, which suggests our method is indeed fast in real applications.

G. QUALITATIVE RESULTS
To further demonstrate the retrieval performance of our method, we provide some qualitative examples (queries and top-10 matches) on CIFAR-10 dataset in Fig. 8. From Fig. 8 we can see that most of the retrieval examples are true positives results, which demonstrate that our hash features are robust to changes in cropping, scale, and viewpoint. However, there are still some failure cases when the hash feature is distracted by irrelevant objects.

V. CONCLUSION
In this article, we present a contrastive self-supervised method that addresses the fundamental limitations of corrupted labels for unsupervised hash learning. Based on dual pseudo agreement labels, we propose a self-supervised hashing method with mutual information and noise contrastive loss. Throughout the process of hash learning, the stable pseudo labels and data distributions collaboratively work together as teachers to guide the binary codes learning process. Extensive experiments on three benchmark datasets show the robustness of the proposed method is superior to many state-of-the-art approaches. Furthermore, the ablation studies clearly demonstrate the reliability of dual pseudo agreement and contrastive loss.
In the near future, we would like to explore new loss functions that not only are effective for sufficient learning but also have theoretically guaranteed robustness further. Furthermore, we would like to test different architectures to further decrease the model size and improve the training speed.
YANG LI received the M.S. degree from the PLA University of Science and Technology, Nanjing, China, in 2010, and the Ph.D. degree from the Army Engineering University of PLA, Nanjing, in 2018. He is currently an Associate Professor with the Army Engineering University of PLA. His current research interests include computer vision, deep learning, and image processing.
YAPENG WANG received the B.S. degree from the PLA University of Science and Technology, Nanjing, China, in 2015. He is currently pursuing the M.S. degree with the Army Engineering University of PLA, Nanjing. His current research interests include digital image processing and deep learning.
ZHUANG MIAO received the Ph.D. degree from the PLA University of Science and Technology, Nanjing, China, in 2007. He is currently an Associate Professor with the Army Engineering University of PLA, Nanjing. His current research interests include artificial intelligence, pattern recognition, and computer vision.
JIABAO WANG received the Ph.D. degree in computational intelligence from the PLA University of Science and Technology, Nanjing, China, in 2013. He is currently an Assistant Professor with the Army Engineering University of PLA, Nanjing. His current research interests include computer vision and machine learning.
RUI ZHANG received the Ph.D. degree from the PLA University of Science and Technology, Nanjing, China, in 2004. He is currently a Professor with the Army Engineering University of PLA, Nanjing. His current research interests include data engineering and information fusion. VOLUME 8, 2020