Fully Unsupervised Person Re-Identification via Multiple Pseudo Labels Joint Training

Person re-identification (re-ID) is the task of finding the matched person in a non-overlapping and multi-camera system. Because annotating images across multiple cameras is difficult and time-consuming, this paper focuses on fully unsupervised learning person re-ID that can learn person re-ID on unlabeled data. The unsupervised re-ID needs to self-generate pseudo labels to make training possible. Unlike human-annotated true labels, the pseudo labels contain noise labels which substantially hinder the network’s capability on feature learning. In order to refine the predicted pseudo labels, we introduce a novel unsupervised re-ID method named Multiple pseudo Labels Joint Training (MLJT) in this paper. Different from the existing works, the MLJT predicts multiple pseudo labels for each image by mining potential similarities in multiple ways. Based on invariance constraints among multiple pseudo labels, the MLJT is jointly optimized under the supervision of multiple pseudo labels to ease the impact of noises in the single pseudo label. The proposed MLJT predicts three types of pseudo labels for one input image. The first one is the clustering-based pseudo label. The second one is adaptive similarity measurement-based pseudo label. The third one is pseudo sub-labels which are predicted by mining channel-based self-similarities. The proposed MLJT has been extensively evaluated on two mainstream and public person re-ID datasets and outdoor real-world videos. Experiments demonstrate the effectiveness of the proposed multiple pseudo labels joint training strategy and the practicality of the proposed MLJT in real-world unsupervised person re-ID applications. The testing demo can be found at https://drive.google.com/drive/folders/1RvNaEiy6tF18_RcgTNcjE7jJ6eGy8sZL?usp=sharing.


I. INTRODUCTION
The person re-identification (re-ID) system aims to retrieve matched people from different cameras or different occasions. It has become an active research area in recent years due to the growing practicability of computer vision in the industry [1]- [4]. The person re-ID system has wideranging applications in security and surveillance systems, including human abnormality detection, the specific person searching, lost person tracking [4], [5]. The idea of person re-ID is to matching features of images, therefore extracting discriminative features is critical and challenging for the person re-ID.
The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang .
Based on training strategy, current person re-ID methods can be summarized in three categories: supervised person re-ID [6]- [8], Unsupervised Domain Adaptive (UDA) based person re-ID [9]- [13], and the Fully Unsupervised Learning (FUL) based person re-ID [14]- [23]. The illustrations of these methods are shown in Fig. 1.
In past decades, person re-ID works mostly focused on supervised person re-ID. The supervised person re-ID methods [6]- [8] train the network on a labeled dataset. Annotating person identities for every image is required in supervised methods, as shown in Fig. 1(a). It is expensive to manually annotate identity across multiple cameras.
Therefore, some recent works focus on unsupervised person re-ID, which do not require labeled information in a target dataset. Because of the lack of labeled information, the unsupervised methods can not achieve satisfying performance as supervised methods. Several unsupervised methods [9]- [13] utilized the Unsupervised Domain Adaption (UDA) to improve the model performance. The idea of UDA-based methods is to transfer the knowledge from a labeled source dataset to the unlabeled target dataset to improve the model performance on the target dataset. A common operation is to train the network on the labeled source and the unlabeled target dataset simultaneously [10], [13], as shown in Fig.1 (b). However, the model performance will significantly decline if the domain gap between the source and target datasets is large.
The above two factors make supervised person re-ID and UDA-based unsupervised person re-ID are difficult to meet the requirement of practical industry application. Conversely, the FUL-based person re-ID methods [14]- [23] enjoy two merits, which make them more suitable for real-world application. (1) Training a FUL-based re-ID system does not require any labeled information, as shown in Fig. 1(c). It is relatively easier to acquire a large number of unlabeled images and videos by public surveillance systems in the real world. (2) FUL-based re-ID methods do not need to consider the domain gap between the source and target datasets, because they can be trained directly in any unlabeled target dataset. Because of the above two merits, people tend to pay more attention to FUL-based person re-ID.
Recent research MLCReID [18] formulated unsupervised person Re-ID as a Multi-Label Classification task and achieve state-of-the-art performance. The framework of MLCReID is shown in Fig. 2(a). The MLCReID [18] can be mainly divided into three stages: feature extraction by backbone network, pseudo labels prediction, and fine-tuning backbone network with computed loss. Unlike human-annotated true labels, these self-generated pseudo labels contain noises that hinder the model's capability on extracting discriminative features. Therefore, based on MLCReID, this paper proposes a Multiple pseudo Labels Joint Training (MLJT) method, which aims to mitigate the effects of noisy labels. The proposed person re-ID architecture is shown in Fig. 2(b). The MLJT can be mainly divided into four stages: feature extraction by backbone network, channel-wise matching for exploring channel-based self-similarity, multiple pseudo labels prediction, fine-tuning backbone network with computed joint losses.
Unlike predicting a single pseudo label for an input image in MLCReID, the proposed MLJT predicts multiple pseudo labels for the input image by mining potential similarities in different ways. The combination of multiple pseudo labels is more robust than a single pseudo label. According to invariance constraints among these predicted multiple pseudo labels, the MLJT optimizes the network via multiple pseudo labels in a joint manner. In general, there are three types of pseudo labels in MLJT, as shown in Fig. 2(b). The first one is the clustering-based pseudo label y cl i . The second one is adaptive similarity measurement-based pseudo label y sm i . The third one is pseudo sub-labels y {g|g=0,...,G} i . We will introduce these three pseudo labels one by one in detail in Section III.
As a summary, the contributions of this work is four-fold: 1) Unlike predicting a single pseudo label in the baseline method, this work proposes to predict multiple pseudo labels for an input image by mining multiple potential similarities among samples. 2) Based on the invariance constraints among multiple pseudo labels, this work proposes to train the backbone network jointly to refine pseudo labels effectively. 3) In order to avoid neglecting positive labels of difficult samples, this paper proposes an Adaptive Similarity Measurement-based pseudo label Prediction (ASMP) method to adaptively select positive labels based on the degree of difficulty of samples. 4) This work improves the performance of baseline method [18] by considerable margins on two mainstream and public datasets, Market-1501 [26] (Market) and DukeMTMC-reID [27], [28] (Duke). Testing experiments are also performed in outdoor real-world videos to show the practicality of this work in realworld applications.
The remainder of this paper is organized as follows. Section II summarizes the related work. Section III describes the proposed method. In section IV, the datasets, evaluation metrics, and implementation details are introduced. In section V, extensive ablation studies are reported. The model performance in two public datasets and real-world applications are reported in Section VI. Finally, Section VII concludes this paper.

II. RELATED WORK A. FUL-BASED RE-ID METHODS
Several FUL-based methods [14]- [18] were proposed, which trained the model without any manually annotated labels. Yu et al. [14] proposed to learn an unsupervised asymmetric metric based on asymmetric clustering on cross-view person images to deal with the view-specific interference. Followed the similar idea of [14], Yu et al. [15] embedded the asymmetric metric into a deep neural network to further improve the performance. Ding et al. [16] proposed an effec-  [18]. (b) The proposed MLJT. Given an unlabeled image, d -dimensional feature f i is extracted by the backbone network. A feature memory stores feature for all images in a dataset. Before every training iteration, multiple pseudo labels can be predicted. The MLJT fine-tunes the backbone network in each training iteration step by step via the joint loss between multiple classification scores and multiple pseudo labels. tive clustering approach for person re-ID by exploring the dispersion in statistics. Lin et al. [17] proposed a Bottom-Up Clustering (BUC) approach, that merged a fixed number of clusters to fine-tune the model step by step until convergence. In order to ease the impact of noisy clusters, Ji et al. [20] proposed a Two-stage Clustering Method to trained the network only using reliable samples. Without using an additional clustering algorithm, Wang et al. [18] formulated re-ID as a multi-label classification task by generating pseudo labels via similarity measurement. Based on the MLCReID, Tang et al. [21] leveraged the eligible neighbors as additional reference information to further boost the model performance in ranking accuracy. Yang et al. [19] designed a noise-tolerant loss function, which enables the model robust against noisy pseudo labels. Generally, existing FUL-based re-ID researches mainly aim to design an accurate pseudo label prediction method or a noise-robust loss function. This paper also focuses on obtaining high-quality pseudo labels to optimize the model.

B. PSEUDO LABEL PREDICTION FOR UNSUPERVISED RE-ID
Based on the pseudo label prediction methods, unsupervised person re-ID can be generally divided into two categories: clustering-based label prediction and similarity measurement-based label prediction.
The core idea of clustering-based label prediction [12], [13], [19] is that they perform a clustering algorithm on Convolutional Neural Network (CNN) features to generate pseudo labels for training. Clustering-based methods pull samples close to their class centroids as illustrated in Fig. 3(a). Fan et al. [12] proposed a progressive unsupervised learning (PUL) method, which can be seen as an original clustering-based re-ID work. They iterated clustering and fine-tune CNN step by step until convergence. Because the clustering results may noisy, subsequent researches [13], [19], [20] mainly focus on refining noisy labels.
The core idea of similarity measurement-based label prediction [11], [18], [23] is that they estimate similarities among samples to select positive samples for training. Similarity measurement-based methods pull samples close to their near neighbors as illustrated in Fig. 3 (b). The existing methods selected positive samples based on some fixed rules. ECN [11] is the original similarity measurementbased re-ID method. ECN [11] and SSL [23] selected k-Nearest Neighbors (k-NN) of each image as its positive samples. Subsequently, MLCReID [18] proposed to select positive samples using a pre-defined and fixed similarity threshold t.
The existing researches only focus on one of the pseudo label prediction methods. As mentioned above, clusteringbased and similarity measurement-based methods lead the model to learn in different directions, and thus they can be complementary to each other. This paper predicts clusteringbased and similarity measurement-based pseudo labels for each image simultaneously to train the model jointly to achieve better performance, as shown in Fig. 3(c).

C. SIMILARITY EXPLORATION FOR UNSUPERVISED RE-ID
Exploring similarity correctly is critical to the re-ID performance. Supervised person re-ID methods can use additional human pose labels [6] or human body part segmentation regions [7] to help model learn human part similarity. However, these pre-annotated auxiliary labels are unknowable in the unsupervised learning task.
In order to boost model learning discriminative features without human part labels, several methods, including supervised method in [8], UDA-based unsupervised method Self-Similarity Grouping (SSG) [22], and FULbased unsupervised method Softened Similarity Learning (SSL) [23] explored the self-defined part-based selfsimilarities, as shown in Fig. 4(a). These methods [8], [22], [23] split w f × h f × d global feature maps into P horizontal stripes to compute P numbers of part-based selfsimilarities. In SSG [22], P = 2 are used for exploring the potential similarities of upper human body and lower human body of unlabeled samples. They [8], [22], [23] ignore a factor that the human region does not always locate in the center of the image, as shown in Fig. 4(c). The human location variance makes the inconsistency in each partition. Moreover, data augmentation strategies, such as random crop, random erasing, and random rotation, further increases human location variance. In order to avoid the impact of human location variance and mine potential similarities among images, we propose to explore channel-based self-similarity in this paper, as shown in Fig. 4(b).

III. PROPOSED METHOD A. FRAMEWORK OVERVIEW
The framework of the proposed method MLJT is shown in Fig. 2(b), which can be mainly divided into five stages: feature extraction by backbone network, feature splitting using channel-wise matching, multiple pseudo labels prediction, classification, network optimization using joint losses.
Given an unlabeled person images x {i|i=1,2,...,N } ∈ X , d-dimensional feature f i is extracted by the backbone network. i is the index of the image in an unlabeled person re-ID dataset X , N indicates the number of images in X . A feature memory M is used to stores up-to-date features for all images in X to make compute similarities among all images possible. The size of M is N × d. M is updated by f i after every training iteration as, The superscript t in Eq. (1) indicates t-th training iteration. α is the updating rate, α ∈ [0.0, 1.0]. Updating with the moving averaged weights makes M more stable to predict pseudo labels [11], [18]. After updating, the feature of can be predicted. y cl i and y sm i are the clustering-based pseudo label and the adaptive similarity measurement-based pseudo label predicted using M, respectively. y g i are sub-labels predicted by mining channel-based self-similarities using sub-feature memories M g , which are obtained by channelwise matching, as shown in Fig. 2 The backbone network is optimized progressively by the loss between multiple pseudo labels y i = {y cl i , y sm i , y computed using the whole feature f i and the sub-features f g i , respectively. In the following subsection, we will introduce these three pseudo labels one by one in detail.

B. MULTIPLE PSEUDO LABELS PREDICTION 1) CLUSTERING-BASED PSEUDO LABEL y cl i
Following the previous works [19], [25], the unsupervised clustering algorithm DBSCAN [24] is used in this paper. The clustering results is illustrated in Fig. 3 (a). Before each training epoch, DBSCAN assigns pseudo-classes c i to all samples in X by computing distances among their features f M {i|i=1,...,N } ∈ M. DBSCAN is a density-based clustering algorithm, it treats high-confident samples as clustered inliers (c i ≥ 0), and treats low-confident samples as unclustered outliers (c i = −1).
There is one challenge in the DBSCAN-based clustering method: the numbers of clustered inliers keep changing during the whole training process. Uncertain numbers of clusters make loss function is difficult to design. To address this issue, we encode 1-dimensional pseudo-class c i to Ndimensional clustering-based pseudo label y cl i as follows, After encoding, the multi-label classification loss function [18] can be easily adopted to compute the loss between v 0 i and y cl i against the changing of clusters.

2) ADAPTIVE SIMILARITY MEASUREMENT-BASED PSEUDO LABEL
The cosine similarity of x i and all images x {j|j=1,...,N } ∈ X is computed using their features in M as follows, According to the s 0 i , the adaptive similarity measurementbased pseudo label y sm i is predicted by our proposed Adaptive Similarity Measurement-based pseudo label Prediction (ASMP).
As mentioned in Sec II.B, the previous pseudo label prediction methods selected positive labels [11], [18], [23] using fixed rules, i.e., a fixed k number of positive labels in ECN [11] and SSL [23], or fixed label selection threshold t in MLCReID [18]. Using fixed rules [11], [18], [23] to select positive labels is unsuitable, because the similarity distribution of every image is different and keeps changing during the whole training process. More specifically, difficult samples always share relatively low similarity with its k-Nearest Neighbors (k-NN) because of challenging situations, such as complex or uncommon human poses, occlusion, and complex backgrounds, etc. The examples of similarity distributions of image A (simple case) and image B (difficult case) are illustrated in Fig. 5. The similarities among image B and its k-NN are much lower than the similarities among image A and its k-NN, as shown in the right graph of Fig. 5. If positive samples are selected using the fixed threshold t = 0.6 for all images as [18], the potential positive labels of the difficult sample with similarity less than 0.6 might be neglected continuously.
To avoid neglect the positive label of the difficult sample, we proposed the ASMP, which first distinguish the difficult sample based on the degree of difficulty θ i of the sample, then assigns a lower and adaptive threshold to the difficult sample for selecting more positive labels. Our idea is that, when the model can roughly capture the data distribution and predict similarity, we can therefore utilize the obtained similarity distribution to estimate the degree of difficulty of the image.
ASMP first computes the similarities among x i and its k-NN as follows, where R i is the rank list by sorting the similarity s 0 i by descending order. K i is the collection of k-NN of x i . k is a hyper-parameter, controls the numbers of samples in K i . The analysis of k are reported in the Sec. V.
The degree of difficulty θ i is estimated according to statistical characteristics of K i , i.e., the mean m i and standard deviation σ i , as follows, If θ i is low, x i is considered a difficult sample. The computed θ i will be directly used to select more positive labels for x i as follows, In this case, more difficult samples can be assigned a lower θ i to mitigate the neglect to potential positive labels. Finally, adaptive similarity measurement-based label y sm i are generated. It is noteworthy that θ i for every sample is different according to statistical characteristics of the sample, therefore ASMP is better than fixed rules in [11], [18], [23].

3) PSEUDO SUB-LABELS BY CHANNEL-BASED SELF-SIMILARITY EXPLORATION (CSS)
The third pseudo labels in this paper is pseudo sub-labels y g i , which is predicted by mining channel-based self-similarities.
The baseline method MLCReID [18] only compared the similarity between two images by global features. To mine more similarity information existing in the unlabeled dataset, several methods [22] split images into horizontal parts to represent human partial regions, as illustrated in Fig. 4(a). Then, unlabeled images are additionally compared using these partial features. The precondition of part-based selfsimilarity computation is that the human region should always locate in the center of the image. Uncertain human location may cause inconsistency in similarity mining and matching, as we mentioned in Sec II.C.
The critical idea of self-similarity computation is to compute a more robust similarity score by additionally comparing partial features. In order to avoid the impact of human location variance and mine the potential similarity as well, we propose to explore channel-based self-similarity in this paper, as shown in Fig. 4(b).
To formulate the proposed channel-based self-similarity computation, a channel-wise matching module is proposed in this paper. The channel-wise matching module is attached after feature f i ∈ R d and feature memory M ∈ R N ×d to split them into G groups of sub-features f g and sub-feature memories M g , respectively. As illustrated in Fig. 2(b), each sub-feature f The G groups of self-similarities of x i are computed using their corresponding sub-feature memories M g as follows, Based on s g i , the corresponding pseudo sub-labels y g i can be predicted using the proposed ASMP as Eq. (4)-(7) as follows,

C. THE JOINT LOSS FUNCTION
The Memory-based Multi-label Classification Loss (MMCL) [18] L * is used to regress classification scores v i to predicted pseudo label y i as follows:

IV. EXPERIMENT SETTINGS A. DATASETS AND EVALUATION METRICS
We evaluate the proposed method on the two large-scale and mainstream person re-ID datasets. Every person re-ID datasets consist of three subsets, including training set, query set, and gallery set. The number of identities (IDs) and images in the two datasets are reported in Table 1. The training set is used for training, and both the query and gallery sets are used for testing. The identities of the training set are different from the query and gallery sets. Market-1501 [26] (Market) has six cameras and 32,668 person images of 1,501 identities in total. 751 identities with 12,936 images are used for training, and 750 identities with 19,732 images are used for testing.
DukeMTMC-reID [27], [28] (Duke) has eight cameras and 36,411 person images of 1,404 identities in total. 750 identities with 16,522 images are used for training, and 702 identities with 19,889 images are used for testing.
Two evaluation metrics are used to measure model performance. The first one is Mean Average Precision (mAP) (%), which is the average value of Average Precision (AP) over all query images. AP is the area under the precisionrecall curve. Another evaluation metric is the Cumulative Matching Characteristic (CMC) curve. The CMCs (%) of Rank-1 (R-1), Rank-5 (R-1), and Rank-10 (R-1) are reported, which represents the probability of top-1, top-5, and top-10 ranked gallery samples containing the query identity, respectively. VOLUME 9, 2021

B. IMPLEMENTATION DETAILS
Following the previous researches [11], [18]- [21], [23], we use an ImageNet [30] pre-trained ResNet-50 [29], provided by Pytorch official, as the backbone network to conduct fair comparisons. A 1 × 1 CNN layer and a batch normalization layer are added after the last global pooling layer of ResNet-50 to generate 2048-dimensional L2-normalized features. The input images are resized to 256 × 128 × 3. The training batch size is 64. The network is trained by the Stochastic Gradient Descent (SGD) with a learning rate of 0.03, 40 epochs in total. The CamStyle [9] is used as a data augmentation strategy. To achieve the best performance, hyper-parameter k = 80. The experiments are performed using an Intel Core i5-6600 3.30-GHz CPU and one NVIDIA GeForce Titan 1080Ti GPU with 11 GB of memory. The total training time is around 4 hours on Market-1501 and DukeMTMC-reID. The test architecture is the same as the baseline method [18], therefore our method does not increase any computation during testing.

V. ABLATION STUDY OF THE PROPOSED COMPONENTS
To demonstrate the effectiveness of the proposed multiple pseudo labels and joint training strategy, extensive ablation studies are reported in this section. Ablation studies are presented in four aspects: 1) The effectiveness of the clustering-based pseudo label y cl i , 2) The analysis of ASMP, 3) the analysis of CSS, and 4) the analysis of multiple pseudo labels joint training strategy.
We summarize the performance of each proposed component in Table 2, and visualize the t-SNE [31] features Fig. 6. Features in Fig. 6 (a) only contain weak clustering information before training on a specific unlabeled person re-ID dataset. This is because the ImageNet [30] pre-trained model was trained to distinguish humans and other objects. Therefore, the model still can not distinguish human identities with their appearances without training on specific person re-ID dataset. As shown in Fig. 6 (b)-(f), features of the same identity are progressively gathered after training with predicted pseudo labels. It indicates the predicted pseudo labels can train the model on the unlabeled dataset to distinguish human identities. More accurate pseudo labels help the model learn to extract more discriminative features to re-identify persons.

A. PERFORMANCE OF THE BASELINE METHOD
The No. 1 in Table 2 and Fig. 6 (b) show the results of the baseline method, which does not use clustering-based pseudo labels y cl i , adaptive threshold θ i = 0.6 in Eq. (7), and sub-labels y g i by channel-based self-similarity exploration. In table 2, the baseline method No. 1 produces unsatisfactory performance on two datasets. For instance, it achieves 43.0% in mAP and 78.9% in R-1 on Market-1501, and 37.4% in mAP and 63.5% in R-1 on DukeMTMC-reID. It shows that using a single pseudo label is not enough to provide robust information. As shown in Fig. 6 (b), the model still can not produce discriminative features to distinguish difficult samples.
To improve the performance of the baseline method, this paper proposes three types of pseudo labels in Sec III. We further perform experiments to investigate the effectiveness of them in detail by adding them into the baseline model. The results are reported in Table 2 and Fig. 6.

B. THE EFFECTIVENESS OF THE CLUSTERING-BASED PSEUDO LABEL
The effectiveness of the clustering-based pseudo label y cl i is presented in here. From the comparison of No. 1 and No. 2 in Table 2, it is clear that the model performance is improved with the clustering-based pseudo label y cl i . Specifically, the performance improves in mAP from 43.0% to 50.9% and from 37.4% to 40.3% on Market-1501 and DukeMTMC-reID, respectively. The significant improvement is because samples are additionally enforced learning towards their corresponding class centroids under the supervision of clustering labels y cl i , as illustrated in Fig. 3(c). Similar observation also can be found from the comparison of Fig. 6(b) and Fig. 6(c). Features of the same identity are more compact and independent from other clusters by using y cl i . The improvement demonstrates the effectiveness of combining similarity measurement-based and clusteringbased label prediction in this paper.

C. THE ANALYSIS OF ADAPTIVE SIMILARITY MEASUREMENT-BASED PSEUDO LABEL PREDICTION (ASMP)
The ASMP is analyzed in three aspects. First, the effectiveness of ASMP is evaluated in Table 2 and Fig. 6. Second, the robustness of hyper-parameter k in Eq.(4) is analyzed in Table 3. Third, we compare our proposed ASMP with other label prediction methods in Table 4.

1) THE EFFECTIVENESS OF ASMP
ASMP is proposed to adaptively select positive labels for a sample according to the similarity distribution between the sample and its neighbors. If a sample shares low similarities with neighbors, ASMP will consider the sample as the difficult sample and compute a low threshold to choose more positive labels for it. Comparison results between the model trained with or without ASMP are illustrated in Table 2 and Fig. 6. In Table 2, No. 3 surpasses the No. 1 by 6.1% in mAP and 2.7% in R-1 accuracy on Market-1501, and by 3.4% in mAP and 2.3% in R-1 accuracy on DukeMTMC-reID. Similarly, when we compare Fig. 6(b) and Fig. (d), we observe that the model with ASMP can gather features of the same identity and enlarges the distances among different identities. It is because that using a fixed and high threshold in the baseline method makes dispersed samples (e.g. blue and orange samples in Fig. 6(b)) lack positive supervisory signals for training, thereby leading them to be neglected. The results demonstrate the necessity of setting a lower threshold to choose more positive labels for difficult samples in unsupervised person re-ID tasks and the effectiveness of our proposed ASMP.

2) COMPARISON OF DIFFERENT k IN ASMP
The hyper-parameter k controls the numbers of near neighbors are chosen to build K i to estimate the degree of difficulty of the sample in Eq.(4). We validate the influence of different k in Table 3 by vary k from 20 to 160. Compared with baseline results in Table 2, it is clear that using any k enhances model performance consistently, which demonstrates the necessity of adaptive thresholds in ASMP. When k = 80, the optimal performance can be obtained. Too small k decreases the performance because the limited number of near neighbors is not enough to represent the statistical characteristics of similarity distribution of the sample for estimating its degree of difficulty θ i . Also, too large k causes similar statistical characteristics of simple and difficult images, thereby distinguishing them difficultly.

3) COMPARISON WITH DIFFERENT LABEL PREDICTION METHODS
We further compare the proposed ASMP with other label prediction methods, i.e., a fixed k number of positive labels in ECN [11] and SSL [23], and a fixed label selection threshold t in MLCReID [18] in Table 4. Table 4 shows that using the fixed threshold t (t = 0.6, which achieves the best performance) outperforms using the fixed numbers k (k = 10, which achieves the best performance) by large margins. Our proposed adaptive threshold θ i in ASMP achieves the best performance. The superior performance demonstrates that, compared with fix rules in existing works, our proposed adaptive threshold is a more reasonable and effective method for pseudo label prediction.

D. THE ANALYSIS OF CHANNEL-BASED SELF-SIMILARITY (CSS)
The CSS is proposed to predict pseudo sub-labels y g i by exploring a more robust and precise similarity relation among images. The CSS is analyzed in three aspects here. First, the effectiveness of CSS is evaluated in Table 2 and Fig. 6. Second, different splitting methods and different numbers of group G in channel-wise matching are analyzed in Table 5. Third, we compare our proposed channel-based self-similarity with part-based self-similarity in existing works [22], [23] in Table 6.

1) THE EFFECTIVENESS OF CSS
As reported in Table 2, No. 4 surpasses the baseline No. 1 by 3.4% in mAP and 1.5% in R-1 accuracy on VOLUME 9, 2021  Market-1501, and by 2.8% in mAP and 1.9% in R-1 accuracy on DukeMTMC-reID. As shown in Fig. 6(e), the model can distinguish different identities better than the baseline by mining similarities from global feature to channel-wise partial feature. The improvements and visualization results demonstrate that the proposed CSS is a simple and effective method for mining self-similarity in an unsupervised manner.

2) COMPARISON OF DIFFERENT SPLITTING METHODS AND DIFFERENT G
In channel-wise matching, the feature f i and the feature memory M can be split into G groups sub-features f g i and sub-feature memories M g in order or shuffle, respectively. Table 5 reports the performance comparisons of these two splitting methods with different splitting groups G. The results show that no significant difference in performance between splitting in order and shuffle. It is clear that a large G consistently performs better than a small G. When we set G = 2, we observe that small G slows down the model convergence speed. Thus, the model is difficult to converge to achieve satisfactory performance even if we set the training epoch very high. Finally, we split the feature memory in a shuffling manner and use G = 8.

3) COMPARISON BETWEEN PART-BASED AND CHANNEL-BASED SELF-SIMILARITY
We compare the proposed channel-based self-similarity with part-based self-similarity in [22], [23] in Table 6. For a straightforward comparison, we further visualize the hot maps of feature maps before the last GAP layer of the backbone network in Fig. 7. Fig. 7(a) illustrates six different input images with three different identities. The original height and width of the feature maps are w f × h f × d : 8 × 16 × 2048, as mentioned in Fig. 3. We resize them to the same size as input images W × H : 128 × 256 to more straightforwardly comparisons.  Brighter color means the model extracts more features from the regions. Table 6 reports that using part-based self-similarity drops the model performance from the baseline 43.0% to 41.9% in mAP and 78.9% to 74.6% in R-1 accuracy on Market-1501. Similar, the performance declines are observed on DukeMTMC-reID. It shows that part-based self-similarity method is not robust. The same situations are observed by visualization examples in Fig. 7(c). Mining part-based self-similarity makes the model can not accurately capture features from the human region, especially if the identity does not locate in the center of the image. The proposed channel-based self-similarity can help the model learn to extract more discriminative features, as compared in Fig. 7. The superior performance of the proposed CSS are mainly reflected in four aspects. Firstly, Table 6 reports that using CSS enhance the model performance consistently on Market-1501 and DukeMTMC-reID. Secondly, compared with Fig. 7(b), features of the foreground (human region) are more accurate and brighter in Fig. 7(d). Thirdly, compared with Fig. 7(c), the background area is darker (lower importance) in Fig. 7(d). Fourthly, feature extraction capability of the model is not affected by human location. The visualization results demonstrate that the proposed channel-based self-similarity helps the model extract discriminative features and avoid the impact of human location variance as well.

E. THE ANALYSIS OF JOINT TRAINING STRATEGY
The No. [5][6][7][8] in Table 2 shows that combinations of each individual proposed module bring greater improvements. Finally, the full version of the proposed MLJT achieves the best performance 55.3% in mAP and 81.6% in R-1 accuracy on Market-1501, and 42.9% in mAP and 66.3% in R-1 accuracy on DukeMTMC-reID. The t-SNE features generated by the full version of the proposed MLJT are shown in Fig. 6(f). It shows that joint training strategy can overcome demerits and utilize merits of each individual module. Specifically, compared with Fig. 3(c), the clustering errors are eased in Fig. 3(f). Compared with Fig. 3(d) and Fig. 3(e), Fig. 3(f) shows that the model with joint training strategy produces more compact feature clusters. These verify that our proposed multiple pseudo labels joint training strategy is able to fully utilize each individual module for learning better and discriminative features.

A. PERFORMANCE COMPARISON IN PUBLIC DATASETS
We compare our proposed MLJT against state-of-the-art unsupervised person re-ID models in Market-1501 and DukeMTMC-reID in Table 7. We first compare our method MLJT with the hand-crafted feature-based methods, including BoW [32], UDML [34], and LOMO [33], which do not require CNN and any labeled dataset to extract features. The performances of the hand-crafted feature-based methods are not satisfactory enough, because it is difficult to manually design discriminative features with good generalization and robustness, especially on large-scale datasets.
We also compare the MLJT with CNN-based fully unsupervised learning-based methods, including CAMEL [14], DECAMEL [15], SSL [23], BUC [17], ADTC [20], DBC [16], the baseline method MLCReID [18], the best similarity measurement-based method NNCT [21], and the best clustering-based method DSCE [21]. We summarize three observations. 1) The clustering-based methods BUC, DBC, ADTC and DSCE can achieve better clustering accuracy (in mAP) because they assigned the same labels to the samples in the same cluster. During training, intra-class samples are enforced learning towards their class centroids to force these samples to get more compacter. 2) The similarity measurement-based methods SSL, MLCReID, and NNCT can achieve higher ranking accuracy (in R-k) because they assign labels by mining reliable positive neighbors around the sample. The similarity measurement-based methods enforce learning towards reliable neighbors. 3) our method MLJT outperforms other FUL-based person re-ID methods in mAP and R-k accuracy. The superior performance demonstrate the effectiveness of our proposed multiple pseudo labels joint training strategy for unsupervised person re-ID.
Moreover, we compare the results of the baseline method [18] and the proposed method MLJT by showing their top-10 retrieved images of three query images in Fig. 8. The green and red boundaries denote correct and false re-ID results, respectively. As illustrated, for the same query images, the MLJT can retrieve correct images more accurately than the baseline. For example, the baseline is easily confused by the cloth with white and purple stripes. Overall, the comprehensive comparison results indicate the effectiveness and superiority of the MLJT.

B. PERFORMANCE IN REAL-WORLD APPLICATION
Finally, we test the proposed re-ID method on two outdoor real-world videos to evaluate the model performance in a real-world application. The full testing demo can be found at https://drive.google.com/drive/folders/1RvNaEiy6tF18_Rcg TNcjE7jJ6eGy8sZL?usp=sharing.
In a real-world application, YOLOv5 [35] is adopted to detect human regions because person regions are a prerequisite for person re-ID. Then, the re-ID network retrieves the same person by matching a pre-defined query image with every detected person region in the frame.
The visualization of re-ID results of the baseline method [18] and our proposed MLJT on two outdoor realworld videos are compared in Fig. 9. The query image (re-ID target) is shown at the left bottom of each frame. From the comparison, we notice that MLJT retrieves correct persons more accurately than the baseline, and MLJT outputs more confident (higher) classification scores than the baseline in two videos.
The runtime of the person detection and re-identification system is shown at the left top of each frame, notated as Frame Per Second (FPS). The system can work in real-time with a processing speed of about 128 FPS with YOLOv5 [35] detector.

VII. CONCLUSION
In this work, we proposed an end-to-end fully unsupervised person re-ID method, which can be trained without using any labeled information. The proposed method achieves superior performance benefit from three aspects. 1) Selecting positive labels adaptively according to similarity distribution of samples. 2) Estimating similarity precisely by the channelbased self-similarities exploration strategies. 3) Optimizing network jointly using multiple pseudo labels to mitigate the impact of noises in a single pseudo label. Extensive experiments and comprehensive analysis demonstrate the effectiveness of the proposed method MLJT. In the future, we will integrate the proposed algorithm in high-level video surveillance tasks like the specific person searching and abnormal human action recognition.