Local Heterogeneous Features for Person Re-Identification in Harsh Environments

Local features could learn semantic information for pedestrian images and they are very important for person re-identification (Re-ID) in harsh environments. However, most approaches only optimize one kind of local feature, which results in incomplete local features. In this paper, we propose Local Heterogeneous Features (LHF) to extract discriminative local features from three aspects. To this end, we utilize three kinds of losses to learn three kinds of local features, i.e., local discriminative features, local relative features, local compact features. As for local discriminative features, we split the attention maps into three horizontal sub-regions and perform the classification operation. Then, we divide the attention maps into two horizontal sub-regions, and we synchronously apply the triplet loss and center loss to learn local relative features and local compact features. Finally, we utilize local discriminative features to represent pedestrian. We evaluate LHF on public person Re-ID datasets and prove LHF is meaningful for local feature learning.


I. INTRODUCTION
Person re-identification (Re-ID) in harsh environments has an important practical application value, and it aims to distinguish the same identity under different camera sensors [1]- [5]. Because of the complexity of real scene, the pedestrian images possess large variances in illuminations, clothes, occlusions, backgrounds, and so on. Hence, person Re-ID still maintains many challenges. In recent years, deep learning approaches [6]- [8] become the mainstream approaches that bring some effective strategies to learn power feature representations. Some approaches [9]- [11] based on deep learning obtain a large breakthrough, which pushes person Re-ID to a high level.
The intuitive approach is to learn robust features from the global pedestrian images. The global features could search for the most salient region to represent pedestrian images [12]- [15]. However, the complexity of pedestrian images restricts the development of person Re-ID, because many non-salient details may be ignored. The local features The associate editor coordinating the review of this manuscript and approving it for publication was Qilian Liang . contain subtle information which is important for person Re-ID. As shown in Fig. 1, the images in the first row have different identities, but very similar global appearances. It is hard to distinguish these pedestrian images using global features. The pedestrian images in the second row have the same identity, but large variations. The local features can extract effective information from head, legs or feet so as to recognize pedestrian images accurately.
In order to excavate discriminative features, some researchers utilize deep learning to learn local features [16]- [19], which further promotes person Re-ID performance. However, these methods split original pedestrian images or feature maps, and then utilize the cross-entropy function to optimize local features. Thus, the existing methods can not fully learn local information of pedestrian images.
To overcome the shortcoming, in this paper we propose Local Heterogeneous Features (LHF) to learn discriminative local features from different views for person Re-ID in harsh environments. The proposed LHF contains three kinds of local features, i.e., local discriminative features, local relative features, local compact features. Specifically, we simultaneously employ three kinds of loss functions to learn local features. To learn local discriminative features, we split attention maps into three horizontal regions and perform the pooling operation to extract features. Afterwards, Batch Normalization (BN) is utilize to implement a normalization operation. Then, the cross-entropy loss is utilized to perform a classification task. At the same time, we divide the attention maps into two horizontal regions. Then, we utilize the triplet loss and the center loss to learn local relative features and local compact features, respectively. By optimizing local features from three different views, we can fully learn local information from pedestrian images.
Different from previous part-based methods, the paper has the following three contributions: • The local features are all learned from attention maps which contain abundant semantic information.
• We learn three kinds of local features from different views to discover local cues from pedestrian images.

II. RELATED WORK
Person Re-ID shows great application value and potential. Traditional methods [23]- [27] usually extract low-level features, i.e., colour features, edge features, texture features, etc, to represent pedestrian images. Recently, it undergoes a large reformation from hand-crafted features to automatic deep features. The Convolutional Neural Networks (CNNs) [28]- [31] reveal a huge advantage on feature representations, and therefore they dominate the person Re-ID field.
Many researchers propose effective approaches using CNNs to learn global features so as to search for the most salient character. For examples, Zheng et al. [32] combine the classification model and the verification model to jointly optimize global features. Huang et al. [33] utilize adversarial occluded samples to enlarge training images and then retrain the deep model to obtain discriminative global features. Si et al. [34] design the Discriminative-Aware Integration (DAI) that employ different deep models to optimize global features, and then integrates these global features to represent images. Luo et al. [14] apply multiple tricks on the deep model to extract global features that promote the performance of person Re-ID.
The global features often lose some subtle information so that restrict the development of person Re-ID. So, some methods propose meaningful strategies to learn local features, which further improves person Re-ID. For instances, Zhang and Si [35] split original pedestrian images into three sub-regions and then employ the verification model to recognise the positive or negative sub-region pairs. Sun et al. [17] split the attention maps into six horizontal regions and apply six cross-entropy functions to optimize local features independently. Huang et al. [36] leverage the pose estimation algorithm to predict seventeen key points and then partition the attention maps into eight sub-regions so as to learn local features. Fu et al. [37] propose Horizontal Pyramid Matching (HPM) to learn different scale local features and use these local features to obtain pedestrian features.
These methods mentioned above learn local features using the cross-entropy loss. Different from them, we exploit three kinds of loss functions from multiple views to optimize local features. Then, we utilize local discriminative features to represent pedestrian images, which achieves a new level.

III. APPROACH
In the section, we introduce the structure of LHF. Then, we describe three kinds of local features, i.e., local discriminative features, local relative features and local compact features. Finally, we introduce how to optimize the overall deep model.

A. THE STRUCTURE OF LHF
The proposed LHF employs ResNet-50 [6] as backbone to extract image features due to the succinct structure. Different from the traditional setting, we modify the last stride from 2 to 1 and abandon the rear structure. This is because changing the stride from 2 to 1 can remove the down-sampling and increase the size of feature maps from 24 × 8 to 12 × 4. The feature maps maintain more semantic information. The overall backbone is shown in Table 1. We can find that the backbone contains a convolution layer, a pooling layer and four blocks. Each of block includes different number of convolution layers. As shown in Block 1 in Table 1, 1 × 1 denotes the size of filter and 64 indicates the number of filters. 1 is the stride of filter movement in the Stride column. (1,1) represents padding one column and one row with pixel 0. ×3 denotes the corresponding Block has three such structures. We crop each pedestrian image into 384 × 128, and then these pedestrian images are fed into the backbone. We obtain feature maps with the size of 24 × 8 × 2048 after the backbone.
In order to learn local heterogeneous features, we propose an efficient model structure as shown in Fig. 2. We design three branches to learn local discriminative features, local relative features and local compact features, respectively. Local discriminative features could make full use of the  identity label and obtains the discriminative information. The local relative features could decrease the intra-class variation and increase the inter-class variation, and the local compact features could further decrease the intra-class variation.

B. LOCAL DISCRIMINATIVE FEATURES
As shown in Fig. 2, we apply the first branch to optimize local discriminative features. Specifically, we first partition the feature maps extracted from CNN into three horizontal sub-regions. Then, we perform the average pooling operation to extract local discriminative features f i d (i= 1, 2, 3). Afterwards, local discriminative features are fed into the classifier. Here, each classifier contains a fully connected (FC) layer and a softmax, where the neuron number of the FC layer is equal to the identity number. We obtain the prediction probability that the i-th sub-region of each pedestrian image belongs to the k-th identity. It is formulated as: where K denotes the total identities in the training set and a i k represents the k-th neural value in the classifier. Each local feature has an identity label which is the same as the label of corresponding original pedestrian image and each local feature corresponding to a cross-entropy loss function. Then, we employ the cross-entropy loss to compute the loss value of each pedestrian image: where q i (k) is the true identity distribution and is written as: where t is the true identity. So, the total loss of each batch for local discriminative feature learning is written as: where L b d is the loss of the b-th pedestrian image in one batch and B is the batch size.

C. LOCAL RELATIVE FEATURES
We utilize the second branch to learn local relative features. From Fig. 2, we split the feature maps into two horizontal sub-regions and then perform Batch Normalization (BN) operation. Afterwards, we implement the average pooling process to extract local relative features f i r (i = 1, 2). We apply the triplet loss to optimize f i r : where a indicates the threshold, (f i r ) b denotes the i-th sub-region feature of b-th pedestrian image in one batch, f i n represents the i-th sub-region feature of the n-th pedestrian image with the same identity of (f i r ) b , f i m is the i-th sub-region feature of the m-th pedestrian image with the different identity from (f i r ) b . Here, D[·, ·] is to compute the Euclidean distance between two different feature vectors, maxD outputs the maximum value, minD outputs the minimum value and [x] + = max(x, 0).

D. LOCAL COMPACT FEATURES
As shown in the third branch in Fig. 2, we apply center loss to optimize the local compact features f i c (i = 1, 2). Similarly, we divide the feature maps into two horizontal sub-regions. Since the three losses have different functions, we also add a BN layer for better convergence. and perform BN operation. Then, we implement the average pooling procedure to obtain f i c . The loss is formulated as: where (f i c ) b denotes the i-th sub-region feature of b-th pedestrian image in one batch and t i b indicates the i-th sub-region feature center of the b-th pedestrian image.

E. MODEL OPTIMIZATION
We combine three loss functions to optimize local features and the final loss value is formulated as: where η and ρ are the weight coefficients to control the importance of local relative features and local compact features. The cross-entropy loss can make full use of the identity label and obtains the discriminative information. The triplet loss could decrease the intra-class variation and increase the inter-class variation, and the center loss could further decrease the intra-class variation. We add the three loss values as the final loss value. Then, we utilize the final loss value to compute the gradient and perform the backpropagating process. Afterwards, we utilize Stochastic Gradient Descent (SGD) to update parameters of the deep model. Local discriminative features are extracted from feature maps, which is optimized by three kinds of loss functions. So, the feature maps contain rich semantic information. We only utilize local discriminative features to represent pedestrian images, which has low dimension and complexity but better performance. So, we only utilize local discriminative features as final pedestrian features. In the test process, we concatenate three local discriminative features to represent pedestrian images. Then, the feature f d is utilized to evaluate the proposed LHF.

IV. EXPERIMENTS
In this section, we introduce person Re-ID datasets, and then present the experimental settings. Afterwards, we evaluate three local features, i.e., local discriminative features, local relative features and local compact features. Finally, we compare LHF with other competitive approaches.

A. DATABASES
Market1501 [20]. The dataset possesses 1,505 identities. It totally contains 32,668 images and has two parts. One part includes 12,936 pedestrian images, which is called the training samples. The other part has 19,732 pedestrian images, which is the test set. These images are detected by DPM [38].
DukeMTMC-reID [22]. The dataset contains 1,812 identities and is captured by eight different cameras. These identities has 36,411 pedestrian images. Specifically, the training data includes 16,522 images, the test data has 17,661 images, and the query data possesses 2,228 images.
CUHK03 [21]. The dataset possesses 14,097 pedestrian images of 1,467 identities. We employ the same image partitioning strategy as [33], [39]. The training samples have 7,365 images and the gallery samples have 5,332 images. The other 1,400 pedestrian images are treated as the query samples. We evaluate the proposed LHF using both two types of image samples.  The three datasets are captured at different environments and have different settings. Some pedestrian samples are list in Fig 3. In order to comprehensively evaluate LHF, we employ the mAP and Rank 1 as the criterion.

B. EXPERIMENT SETTINGS
In the experiments, the batch size is set to 64 which includes 16 identities and each identity possesses four different images. We randomly select 16 identities, and randomly choose four pedestrian images for each identity. To optimize the deep model, each image is cropped into 384 × 128, and then perform a normalization operation. a in Eq. 5 is 0.3. The parameters η and ρ in Eq. 7 are 1 and 0.0005, respectively. In addition, the epoch number denotes the total number of training iterations. Based on the observation, the deep model achieves converge when epoch is 120, and therefore we set the epoch to 120. As for learning rate, we utilize different values at multiple steps and it is formulated as: , e ≤ 10 3.5 × 10 −4 , 10 < e ≤ 75 3.5 × 10 −5 , 30 < e ≤ 70 3.5 × 10 −6 , 70 < e ≤ 100 3.5 × 10 −7 , 100 < e ≤ 120 (8) where e represents the current epoch.

C. PERFORMANCE OF LOCAL FEATURES
We evaluate three kinds of local features and the performances are shown in Table 2. f d denotes local discriminative features and it achieves a good performance. This proves that f d is efficient. Then, VOLUME 8, 2020 we add local relative features f r to f d , which called f d + f r . This method greatly raises the accuracy of person Re-ID. This verifies f r can further distinguish different identities. Finally, we add center loss to learn local compact features f c . As a result, f d + f r + f c obtains the best result, which demonstrates that f c is meaningful for person Re-ID. The three kinds of local features are complementary and could learn the local information from different aspects.

D. COMPARISON WITH OTHER APPROACHES
In this subsection, we compare LHF with other competitive approaches on person Re-ID datasets.
On Market1501, we compare the proposed LHF with other approaches and the results are listed in Table 3. LHF obtains rank1 accuracy 94.9% and mAP 86.1%, which surpasses other competitive approaches. This denotes LHF could effectively learn local features. The three kinds of local feature jointly promote person Re-ID. In addition, ASB method employs different kinds of loss functions to optimize global features. Our method is better than ABS, which verifies local features are meaningful.
As listed in Table 4, we evaluate LHF on DukeMTMC-reID, and the results exceed other approaches. LHF obtains 87.1% and 77.6% at rank 1 accuracy and mAP, respectively. LHF achieves the best performance, which verifies the three kinds of local features are effective. Similarly, ASB utilizes different loss functions to optimize the global features. LHF utilizes the same functions to learn local features that is higher than ABS, which proves local information is meaningful.   We evaluate LHF on CUHK03, and the comparison results are listed in Table 5. LHF achieves rank1 66.4% and mAP 64.7% on the detected set. In addition, LHF achieves rank 1 70.1% and mAP 67.3% on the labeled set. The performance excels other approaches. This demonstrates LHF is meaningful for person Re-ID once again.

E. PARAMETERS ANALYSIS
We perform experiments to verify the performance of different α on Market1501 and the results are shown in Table 6. The results indicate that the method obtains the best performance when α = 0.3.
ρ and η are the weight coefficients of the center loss and the triplet loss. We conduct the experiments to evaluate the two parameters and the results are shown in Table 7. When ρ = 0.0005 and η = 1, LHF achieves the best result.

V. CONCLUSION
We have proposed LHF for person Re-ID in harsh environments. The proposed LHF could learn three kinds of local features from multiple angles. Specifically, we employ the cross-entropy loss to optimize local discriminative features. At the same time, we apply the triplet loss to optimize local relative features and utilize the center loss to learn local compact features. Finally, we employ local discriminative features to represent pedestrian. Experiments on three datasets show that LHF effectively promotes the performance of person Re-ID.
The proposed LHF could achieve better results on three large-scale person Re-ID datasets. However, occlusion is an important challenge in person Re-ID. The method does not consider how to learn discriminative pedestrian features under occlusion. In the future, we will utilize the local features to overcome the occlusion problem.
HAIJIA ZHANG is currently pursuing the master's degree with Tianjin Normal University, Tianjin, China. His research interests include person reidentification, deep learning, and sensor networks.