Person Re-Identification Using Additive Distance Constraint With Similar Labels Loss

Despite the promising progress made in recent years, person re-identification (Re-ID) remains a challenging task due to the intra-class variations. Most of the current studies used the traditional Softmax loss for solutions, but its discriminative capability encounters a bottleneck. Therefore, how to improve person Re-ID performance is still a challenging task. To address this problem, we proposed a novel loss function, namely additive distance constraint with similar labels loss (ADCSLL). Specifically, we reformulated the Softmax loss by adding a distance constraint to the ground truth label, based on which similar labels were introduced to enhance the learned features to be much more stable and centralized. Experimental evaluations were conducted on two popular datasets (Market-1501 and DukeMTMC-reID) to examine the effectiveness of our proposed method. The results showed that our proposed ADCSLL was more discriminative than most of the other compared state-of-the-art methods. The rank-1 accuracy and the mAP on Market-1501 were 95.0% and 87.0%, respectively. The numbers were 88.6% and 77.2% on DukeMTMC-reID, respectively.


I. INTRODUCTION
Person re-identification (Re-ID) aims to retrieve a given person from non-overlapping cameras, which is widely applied in the field of intelligent security, intelligent retail and intelligent video surveillance [1]. However, the intra-class variations, caused by large variations in poses, viewpoints, illuminations, background environments and occlusion, are challenging this task [2]- [4]. Moreover, when the number of retrieving persons increases, the inter-class similarities also make the task more complicated [5].
In the last decade, deep-learning techniques have significantly boosted the state-of-the-art performance [6]- [9]. Recent deep learning solutions for person Re-ID have demonstrated good retrieval performance [10]- [12]. The current The associate editor coordinating the review of this manuscript and approving it for publication was Yan-Jun Liu. mainstream strategy is to combine the local features or to utilize the semantic information from human pose or body part [13]- [15]. As for the network architecture, a number of deep-learning methods chose Resnet50 [16] pre-trained on the image identification dataset ImageNet [17] as the baseline network, and further got better performance by considering both the local and global information [18]- [22]. Among these studies, existing part-based approaches used human structural information such as body part or human pose to further improve Re-ID performance. In the method of [10] and [15], input images or feature maps were all split horizontally to take advantage of local spatial cues. The methods in [11], [16] and [23] made use of human pose to learn local features. In recent years, attention mechanisms were also adopted to capture attentive regions for Re-ID enhancement [14], [24], [25]. However, these part-based or pose-based approaches need additional prior knowledge such as the pre-trained human VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Differences between identification loss and metric loss.
pose estimation model to increase the method complexity and computational cost, or have high requirement on image normalization (e.g., the alignment of human body parts) which is easily to cause low-precision problems. The approaches based on attention mechanisms will also make the model more complex to highly increase the computational cost. Meanwhile, these deep learning studies usually design neural network structures or concatenate multi-branch features to capture and to focus on the salient features, generally bringing too much extra calculation, whereas the industry communities prefer simple but effective methods [26]- [28]. Loss function design plays a significant role in feature learning for person Re-ID. The current loss functions can be categorized into two types: metric loss [25]- [30] and identification loss [23], [31]. See Fig. 1 for the differences between these two types. With the same purpose to better complete the Re-ID task, the training targets and the realizing ways among different loss functions are quite different. Intuitively, simultaneously enlarging the intra-class convergence and inter-class discrimination can be more conducive to learn representative features. Inspired by this idea, the metric loss was proposed to enforce input images of a unique person to be mapped to nearby points in the feature space and of different person to be mapped far apart. However, when the number of training samples increases, the number of training pairs will exponentially increase and therefore cost much more training time. Cross-entropy loss together with Softmax is an alternative choice of identification loss function, which considers the identification task as a multi-class classification problem. This method treats each identity as a separate category, and then minimizes the Softmax loss. Despite its popularity, current Softmax loss only concentrates on the logits of ground truth label and neglects the effectiveness of other labels.
To alleviate these shortcomings, The loss function we proposed in this study is inspired by the typical Softmax-based loss functions in face recognition [32]- [35] and label smooth (a regularization mechanism) [36]. Person Re-ID is similar to face recognition where significant progresses based on deep learning have been made for improvement. To enhance the discriminative capability of the original Softmax loss, [32] proposed a large margin Softmax (L-Softmax) by adding angular constraints to each identity to improve feature discrimination. Angular Softmax (A-Softmax) [33] further improved L-Softmax by normalizing the weights. The proposed ArcFace in [35] directly maximized the decision boundary in angular (arc) space based on the L2 regularization normalized weights and features. CosFace [34] proposed a novel algorithm large margin cosine loss, which takes the normalized features as inputs to learn highly discriminative features by maximizing the inter-class cosine margin. All of these methods generalized the Softmax loss to a more general large margin Softmax loss in terms of angular similarity. Disappointingly, these methods have not been well examined in person Re-ID tasks.
Inspired by such ideas, we employed a distance constraint in the original Softmax loss to enhance the feature discriminative capability in this study. However, Softmax loss is more suitable for classification tasks on known classes, while Re-ID is an unknown-label recognition retrieval task. To solve this problem, we designed a new loss function by considering the following aspects: (1) Softmax loss only gives consideration to the ground truth label and do not consider the loss information of other labels. Therefore, we further extracted other labels with higher logits (namely similar labels) and introduced them to Softmax loss design. Similar labels are expected to help improve the network to enforce more stable and more centralized features of other labels.
(2) As the addition of similar labels distracts the attention of the network, the distribution of the ground truth label decreases. Therefore, we imposed a distance constraint to the ground truth label so as to improve the difficulty of the training target, which can ultimately increase the logit of the ground truth label. The ground truth label refers to the label that has the same person identity with the current training sample, while the rest of the labels are defined as other labels. However, some labels belonging to other labels are considered to have similarities with the training samples, so that they tend to get superior logits. We define such labels as similar labels.
Label smooth strategy [36] was used to prevent models from overfitting and to improve models' generalization ability by considering the weights of all identification labels (ground truth label and other labels) when calculating losses. No matter how the other labels perform during training, they all get attention from the model, leading to insufficient bias of similar labels and needless bias of the others. Thus, we improved the loss function design by only calculating the losses of similar labels and removed the unnecessary part.
To leverage the complementary advantages of different loss functions, many person Re-ID approaches optimized their networks with multi losses [1], [18], [22], [31], [37], [38]. The work presented in [39] combined both identification loss and contrastive loss together to design the loss function. Identification loss and metric loss were also combined to better speed up the convergence in [14], [16], [37]. In this paper, we proposed a novel comprehensive loss function considering both identification loss and triplet loss together to get a better person Re-ID performance. The main contributions of this study can be summarized as follows, (1) Our proposed method designs a novel identification loss based on a single model with unshared global features and achieves outstanding performance on the most popular public datasets for person Re-ID.
(2) Similar labels are merged to enforce the learned features much more stable and more centralized, thus the generalization ability is improved.
(3) A distance constraint is added to the ground truth label to effectively limit the decision boundary for Re-ID performance improvement.

II. PROPOSED APPROACH
To alleviate the above-mentioned problems, we proposed a generalized additive distance constraint with similar labels loss (ADCSLL) function to acquire superior retrieval performance by intrinsically learning highly discriminative features without extra calculation. The overview of our proposed architecture is presented in Fig. 2. Generally, the proposed ADCSLL consists of two parts, the additive distance constraint loss and the similar labels loss. ResNet50 [16] with pretrained parameters on ImageNet was considered as the backbone network of our developed method. We changed the last stride of ResNet50 from 2 to 1 and then connected a global average pooling layer, a batch normalization layer, and a classifier. The details of the proposed approach are described as follows.

A. THE SOFTMAX LOSS
Given a training sample x, the Softmax loss L s is calculated as, where y is the ground truth label; p i is the probability of each label i ∈ {1, · · · , N }; f i is the un-normalized log-probability (logit); N is the number of training labels. Considering q i as the distribution of the i-th label together with the ground truth label y so that q i (y) = 1 and q i (i) = 0 for all i = y, the Softmax loss L s can be re-written as, with However, (2) ignores the distribution of other labels so that the generalization performance needs to be improved. Consequently, we introduced similar labels to the original Softmax loss and employed a constraint to the ground truth label to design a novel loss function.

B. SIMILAR LABELS LOSS
Softmax loss only maximizes the ground truth label's distribution q i (y) = 1 but ignores the distribution of other labels, leading to training convergence monotony and limitation. Label smooth shares distribution indiscriminately over other labels as q i (i) = 1/N for all i = y. Compared with Softmax loss, the mechanism of label smooth is to replace the distribution of with where ε = 0.1 is a preset constant balancing the distribution of the ground truth label and other labels. Therefore, label smooth L LS can be formulated as follows, with Furthermore, it is assumed that those labels with higher logits are more meaningful than all the other labels for model training. Therefore, we introduced similar labels to the original Softmax loss and the corresponding similar labels loss L SL is reformulated as, with which is a mixture of the ground truth label's distribution q i (y) = 1 and the similar labels' distribution q i (k) = 1/N sl , with weights 1 − ε and ε, respectively. N sl is the number of similar labels. The similar labels set S N sl is composed by the top N sl labels according to the descending sorted logits. N sl varies according to the number of labels (identities) in the dataset. In this paper, we set ε and N sc as 0.3 and 100 for both the examined datasets (Market-1501 and DukeMTMC-reID). The distribution evolution from Softmax loss to the similar labels loss is shown in Fig. 3. This can be considered as the distribution q i of label i evolution as follows: In Fig. 3(a), the distribution of ground truth label is q i (y) = 1; Then, in Fig. 3(b), label smooth replaces q i (y) = 1 with weight 1 − ε, and shares the distribution q i (i) = 1/N with weight ε to the other labels; In Fig. 3(c), we only share the distribution q i (i) = 1/N sl with weight ε to the similar labels and set the distribution to 0 for the rest of other labels.

C. ADDITIVE DISTANCE CONSTRAINT LOSS
Face recognition is enlightening to Re-ID because of their similarities. Recent approaches on face recognition designed new loss functions for deep face recognition. Specifically, [32], [33], [34] and [40] maximized the decision boundary in the angular space to learn highly discriminative features. Despite the various geometrical forms, all of them intrinsically optimize the models by enforcing more stringent decision boundaries. As a classical face recognition method, the large margin cosine loss (LMCL) L lmc [34] is defined as, where θ y is the angle between w T y and x. m is a fixed parameter introduced to control the magnitude of the cosine margin. s is the scaling parameter. In [34], they set m as 0.35 and s as 64.
Similar to (2), L lmc can be re-written as, with , i = y We had experimented with L lmc to handle the Re-ID task but achieved an unsatisfactory result, which indicates that normalizing weights and features as well as enlarging angular margin are not useful enough to enhance the discriminative power of the learned features. It is probably because human faces have the same topological structure but human poses significantly differ from each other between the data samples. The significant variation would make the loss function for face recognition not applicable in person Re-ID tasks.
To solve this problem, we further introduced a distance constraint to the ground truth label without features and weights normalization, reinforcing the discrimination of learned features by pursing a more stringent decision boundary. The following simple binary-class example is shown to explain our improvement.
Given a learned feature x from class 1, the Softmax loss is to force w T 1 x > w T 2 x to classify x correctly. The decision boundary between class 1 and class 2 is w T 1 x = w T 2 x. To develop a more discriminative classifier, a more stringent decision boundary than w T 1 x = w T 2 x is needed. We employed a distance constraint c to further require w T Similarly, the above analysis can be well generalized into multi-class identification tasks. The new Re-ID criteria is more stringent than the original decision boundary w T 1 x = w T 2 x. By directly reformulating the original Softmax, we defined the additive distance constraint loss L ADC as, with where c is the preset distance constraint to determine the strength of getting closer to the ground truth label. The parameter c should be set to a reasonable large number to effectively boost the learning abilities of highly discriminative features. Generally, c = 0 leads to an insufficient convergence. Decreasing the logit of ground truth label by increasing the distance constraint c is able to increase the training difficulty and to improve the Re-ID performance, but decreasing too much may cause training divergence and convergence failure because the distance constraint would become stricter and more difficult to be satisfied. Besides, the distance constraint with an overlarge c forces the training process to be more sensitive to noise. Comprehensively considering these factors, we set c = 0.4.

D. ADDITIVE DISTANCE CONSTRAINT WITH SIMILAR LABELS LOSS (ADCSLL)
By combining the additive distance constraint loss and the similar labels loss, we proposed a new loss function ADCSLL, together with the network structure presented in Fig. 1, to solve the person Re-ID problem. Simultaneously considering (6) and (10), our proposed ADCSLL is defined as, where L ADCSL is our proposed ADCSLL function, which can be re-written as,

III. EXPERIMENTS
To validate the performance of our proposed method, we tested it on two popular person Re-ID datasets, Market-1501 and DukeMTMC-reID. It is worth noting that all our results were conducted in a single-query setting without using re-ranking algorithm to improve the mean average precision (mAP) in all the experiments.

A. DATASETS 1) Market-1501
The Market-1501 dataset [41] was constructed and published in 2015. It consists of image samples from 6 cameras. In this dataset, a total of 32668 images from 1501 identities with annotated bounding boxes detected for both training and testing were collected. The dataset was divided into a training set with 12936 images of 751 identities and a test set of 750 identities with 3368 query images and 19732 gallery images.

2) DukeMTMC-reID
DukeMTMC-reID is an image-based subset of the DukeMTMC [42] which is used for person Re-ID. It consists of 36411 images from 1812 persons in 8 cameras.  It randomly selected 16522 training images from 702 identities, 2228 query images from the other 702 identities and 17661 gallery images.

B. IMPLEMENTATION DETAILS
Our model was implemented on Pytorch framework. All images were resized into a resolution of 256 × 128, and image augmentation techniques including random crop, random erasing and horizontal flip (with a probability of 0.5) [43] were used to expand the datasets. We selected a batch size of 64 for each iteration and trained the proposed model for 120 epochs. Adam [44] was used as an optimizer in our optimization, where the weight decay factor was set as 0.0005. The initial learning rate was linearly increasing from 2.4×10 −5 to 2.4×10 −4 in the first 10 epochs. Then, the learning rate was decreased to 2.4×10 −5 and 2.4×10 −6 at the 40 th epoch and the 70 th epoch, respectively. In the training phase, ADCSLL enforced the model to learn more salient features with larger distance constraint and similar labels. In the testing phase, the highly discriminative features were calculated to examine the effectiveness of our proposed method.

A. PERSON RE-ID PERFORMANCE
The performance of our proposed person Re-ID method is compared with the state-of-the-art methods on both Market-1501 and DukeMTMC-reID datasets. The employed comparison methods include AlignedReID [45], IDE (ID-discriminative embedding) [39], SVDNet (singular vector decomposition net) [46], TriNet (triplet net) [30], Pyramid [47], AWTL (adaptive weighted triplet loss) [48], ABD-Net (Attentive but Diverse Net) [49], DSA-reID (Densely Semantically Aligned reID) [50], the baseline method in [51], and the baseline method together with triplet loss [51]. The comparison results are summarized in Table 1. It should be noted that the baseline Pyramid method used one feature for training for fair comparison. The results show that our proposed ADCSLL achieves competitive results compared to most of the existing methods except ABD-Net and DSA-reID. The results of our proposed method is better than DSA-reID on DukeMTMC-reID, but not superior to ABD-Net. Different from our model which only uses global features, both ABD-Net and DSA-reID utilize global features together with local features to build a network model with two branches, which indicates that these two methods are more complex than our proposed method and this may be the main reason leading to the better performance of these two methods. However, the performance gap between ABD-Net and our proposed method is small. In practical industry applications that require simple but effective solutions, our proposed method could be one of the promising counterbalanced solutions to real-world person ReID tasks.
When comparing the performance of our proposed ADCSLL and ADCSLL + triplet loss, the results show that ADCSLL together with triplet loss achieves better performance than ADCSLL alone. The rank-1 accuracies when using ADCSLL + triplet loss on Market-1501 and DukeMTMC-reID are 95.0% and 88.6%, respectively, 0.2% and 1.1% higher than the numbers when using ADCSLL alone. Similar results are found when comparing the mAP values. The mAP values on Market-1501 and DukeMTMC-reID are 85.8% and 76.0%, respectively. 0.6% and 0.4% higher than the numbers when using ADCSLL alone.

B. PERFORMANCE OF DIFFERENT LOSS FUNCTIONS
To examine the effectiveness of different loss functions in our proposed ADCSLL, we compared the rank-1 accuracies and mAPs on Market-1501 and DukeMTMC-reID in Table 2 when using different loss functions. For fair comparison, all the experiments were conducted under the same network structure. Compared with the Softmax loss, the rank-1 accuracy and mAP on Market-1501 when using label smooth increase by 0.6% and 1.2%, respectively. The numbers are 1.2% and 1.5% on DukeMTMC-reID. The rank-1 accuracies and mAPs of the contrastive loss and triple loss are not as good as softmax loss on both Market-1501 and DukeMTMC-reID, which indicates that identification loss is more suitable for Re-ID than metric loss. When using label smooth together with triplet loss, the rank-1 accuracy and mAP are better than either Softmax loss or label smooth on any of the examined datasets. The rank-1 accuracies and mAPs when using the similar labels loss only are higher than the accuracies and mAPs when using the additive distance constraint loss only on both of the datasets. The best performance is achieved when using our proposed method.
The parameter ε balances the distribution of ground truth label and similar labels, and the parameter c gives a distance constraint to further produce a stricter learning objective. To investigate how the strigent constraint and partial similar labels affect the loss performance on person Re-ID tasks, we further examined the effect of c and ε when using different c and ε values in our proposed ADCSLL with N sl = 100. By varying c from 0 to 0.5 and ε from 0 to 0.4, the person Re-ID performance on Market-1501 is illustrated in Fig. 4. The best rank-1 (94.8%) and mAP (86.4%) on Market-1501 are simultaneously achieved when c = 0.4 and ε = 0.3. The results of a c and ε combination (c = 0.35, ε = 0.35) are shown in Table 2 for detailed numerical comparison.
We further examined the standard deviation (Std) of other labels probabilities at each epoch when using the original Softmax loss, label smooth, and our developed ADCSLL. The results are illustrated in Fig. 5. Std is a measure to evaluate the variation of a set of values. A lower Std means that the values tend to be close to the expected value of the labels' probabilities. As shown in Fig. 5, ADCSLL obtains the lowest Std when comparing with other loss functions, indicating that ADCSLL can make the features of other labels more stable and more concentrated, and therefore enlarge the intra-class convergence.

C. INDIVIDUAL PERFORMANCE WHEN USING ADDITIVE DISTANCE CONSTRAINT LOSS OR SIMILAR LABELS LOSS
Our proposed ADCSLL includes two types of loss including the additive distance constraint loss and the similar labels loss. When using the additive distance constraint loss individually for person Re-ID, the rank-1 and mAP results with different c values on Market-1501 are shown in Table 3. The rank-1 and mAP results show that the performance is almost unaffected by different c values between 0 and 0.25. The numbers decrease when c increases after 0.25.
When using the similar labels loss individually for person Re-ID on Market-1501, the rank-1 and mAP results corresponding to the best combination of N sl and ε when N sl = 0, 100, 300, 500, 750 are shown in Table 4. The results show that the best performance (rank-1: 94.6%, mAP: 85.7%) are obtained when N sl = 500 and ε = 0.7, better than the best results in Table 3. This indicates that similar labels loss might contribute more to the better discriminative capability of our proposed ADCSLL.
As shown in Table 2, the best Re-ID performance of our proposed ADCSLL is achieved when c = 0.4 and ε = 0.3, and the corresponding rank-1 accuracy and mAP are 94.8% and 86.4, respectively. However, as shown in Table 3, c = 0.4 corresponds to the worst performance when using the additive distance constraint loss only. The results presented in Table 4 show that neither does ε = 0.3 correspond to the best performance for the examined N sl when using the similar labels loss only. Moreover, none of the presented results in Table 3 and Table 4 is better than the best results in Table 2. These results suggest that additive distance constraint loss and similar labels loss work together to contribute to the better person Re-ID performance.

V. CONCLUSION
This paper proposed a novel and effective approach ADCSLL (additive distance constraint with similar labels loss) to learn highly discriminative features for person Re-ID (re-identification) performance improvement. Both additive distance constraint loss and similar labels loss were innovatively developed to be included in our proposed ADCSLL. Two datasets (Market-1501 and DukeMTMC-reID) were used to examine the effectiveness of our proposed method. The results show that ADCSLL achieves the best performance on either Market-1501 or DukeMTMC-reID, outperforming the previous methods by at least 1% in rank-1 accuracy and 1.2% in mAP. Comparing the performance when using different loss functions, our proposed ADCSLL shows better discriminative capability. When using additive distance constraint loss or similar labels loss individually, the results indicate that: (1) similar labels loss contributes more to the better performance of ADCSLL, and (2) additive distance constraint loss and similar labels loss work together to contribute to the better person Re-ID performance. Our superior results provide a novel perspective for further improvement in intelligent systems like security and video surveillance applications. In our future work, both global features and local features will be used to further improve the performance of our proposed method. Besides, more datasets including CUHK03 [52] and MSMT17 [53] need to be employed for a more complete comparison with the state-of-the-art methods.