Joint Attention Mechanism for Person Re-Identification

Although person re-identification (ReID) has drawn increasing research attention due to its potential to address the problem of analysis and processing of massive monitoring data, it is very challenging to learn discriminative information when the people in the images are occluded, in large pose variations or from different perspectives. To address this problem, we propose a novel joint attention person ReID (JA-ReID) architecture. The idea is to learn two complementary feature representations by combining a soft pixel-level attention mechanism and a hard region-level attention mechanism. The soft pixel-level attention mechanism learns a discriminative embedding for the fine-grained information by exploring the salient parts in the feature maps. The hard region-level attention mechanism conducts uniform partitions on the convolutional feature maps for learning local features. We have achieved competitive results in three popular benchmarks, including Market1501, DukeMTMC-reID, and CUHK03. The experimental results verify the adaptability of the joint attention mechanism to non-rigid deformation of the human body, which can effectively improve the accuracy of ReID.


I. INTRODUCTION
Person re-identification (ReID) aims to tell whether a person can be found in other non-overlapping surveillance camera views by matching person images [1]. The key to ReID problem is how to discover the identity discriminative information from difficult samples. Difficult samples can be divided into two categories: one is due to environmental changes, (i.e. view variations of cameras, illumination, occlusion); the other is related to pedestrian attributes, (i.e. pose changes, intra-person appearance variation, inter-person appearance similarity), as shown in Fig.1. Non-rigid deformation is a common problem in both types of difficult samples. Nonrigid deformation means that the relative distances between parts of the human body are changed. In real-world applications, non-rigid deformation severely affects the performance of person ReID.
To address the problem, the recent ReID methods focus on how to extract more discriminative feature representations. Some works cut the feature maps into strips [49] or grids [5] to extract regional features. However, these deep methods implicitly assume the accuracy of bounding boxes by simply adopting existing deep architectures with high complexity in The associate editor coordinating the review of this manuscript and approving it for publication was Shuhan Shen. model design and lack of interpretability. Some other works focus on prior knowledge to localize the discriminative components, such as poses or body landmarks [36], [51], [52]. However, the performance of these methods is highly dependent on the robustness of pose or landmark estimation models. Erroneous estimation of pose or landmark may greatly influence the identification result. Therefore, we want to design an end to end architecture, which not only can effectively solve the misalignment because of inaccuracy bounding boxes, but also does not need additional auxiliary labels.
We propose a joint attention person ReID architecture (JA-ReID), which combines the soft pixel-level attention and hard region-level attention. The soft pixel-level attention which is based on saliency can dynamically select the useful features. The soft pixel-level attention mechanism can automatically localize the most activated part in the feature maps by aggregating all the pixels cross-channels into one feature map and getting the largest connected component. The advantage of this method is that it can remove background noise and less distinctive parts of the image without learning parameters, which benefits the ReID problem. The hard region-level attention which divides the convolutional feature rigidly into several parts can select features according to fixed scales. We introduce the hard region-level attention to get the coarse-grained feature representations. This complementary attention mechanism can make the extracted feature representations more discriminative.
Moreover, we adopt two effective strategies to improve the accuracy of ReID. First, we reduce the dimension of each part separately after dividing the feature into parts. The experimental results show that the method performs better than reducing the dimension of the whole feature directly. Second, we adopt an improved pooling strategy. Average pooling perceives the information of the whole image including the background. Max pooling is concerned with the activated part. We concatenate both of them in the width direction. It performs better than using any pooling strategy alone in the experiments.
The main contributions of this paper are as follows: (1) We propose a soft pixel-level attention mechanism, which can get fine-grained information of the image.
(2) A novel Joint Attention person re-identification architecture (JA-ReID) is proposed by combining the soft pixel-level attention and hard region-level attention, which can maximize the correlated complementary information. (3) We have achieved competitive results on three large datasets including Market1501, DukeMTMC-reID, CUHK03. The experimental results verify the adaptability of the joint attention mechanism to non-rigid deformation of the human body, which can effectively improve the accuracy of ReID.
The rest of this paper is organized as follows. Sec. II introduces the related work on person re-identification (ReID) task. The details of the proposed JA-ReID method is presented in Sec. III. In Sec. IV, we have done sufficient experiments on three datasets and further discussed the effectiveness of the proposed method. Sec. V gives a brief summary and discussion of our work.

II. RELATED WORK
Early person ReID methods are based on hand-crafted features, which have two components, image description and distance metrics. Hand-crafted features always use color [2], texture features [3] and SIFT [4]. With the great breakthrough of deep learning in the field of computer vision, person ReID problem has gradually entered the era of deep learning. In this section, we review several deep learning person ReID methods from three perspectives.

A. DEEP LEARNING GLOBAL FEATURES
Most of the early work based on deep learning use global features. These methods regard person ReID as a matching or classification problem. The matching approaches use a siamese network with image pairs or triplets [25], [46], [47], [49], [50], [53] as input. The disadvantage of these methods is that the query has to pair or triplet with each gallery image before being sent into the network, which is a time-consuming process in large datasets. The classification approaches [48] can make full use of ReID labels, unlike the siamese network. In fact, the siamese network only needs to consider pairwise (or triplet) labels. On larger datasets, such as PRW and MARS [54], [55], the classification approaches achieve good performance without careful training sample selection. However, the classification approaches assume that person images are well aligned, which rely heavily on the accuracy of bounding boxes. In fact, in many real-world scenarios, bounding boxes are not perfect.

B. DEEP LEARNING LOCAL FEATURES
To overcome the limitation of global features, many approaches focus on the local features which have more regional information. This type of approach can be divided into three categories. The first type is that the partition is cropped into pre-defined strips. Sun et al. [6] proposes a Part based Convolutional Baseline (PCB) to learn discriminative partition features. Yi et al. [49] divides an image into three strips, each of which makes an SCNN, and concatenates part features. The second type is dividing the image into grid patches. Ahmed et al. [50]not only matches the grids in the same position but also in the neighborhood grid of another image. The third type is that the methods based on some prior knowledge such as human pose estimation or landmarks [51], [52]. Su et al. [36] embeds pose information into the architecture and generates a modified image. However, the performance of these methods is highly dependent on the robustness of pose or landmark estimation models. Whether the image is cropped based on patches or human posture key points, because each part is rigid segmentation, they may suffer some outliers, which make the inconsistency in each partition.

C. DEEP MULTI-VIEW LEARNING
Some researchers consider person ReID as a cross-view classification problem, that recognizes persons from different cameras. Learning a common feature space from multi-view spaces becomes an effective approach to solve the problem. Cao et al. [58] proposes a unified solution for subspace learning methods using the Rayleigh quotient, which is extensible for multi-views. You et al. [59] presents a MvCCDA algorithm for cross-view classification to handle view discrepancy, discriminability and nonlinearity in a joint manner. Zhu et al. [60] proposes a multi-view PHDL approach, which aims to learn a type-specific discriminative representation space from the visual appearance features of images and this type of video features.

III. METHODOLOGY A. JA-ReID OVERVIEW
Our target is to train a deep feature representation architecture, which can reduce the interference caused by non-rigid deformation. The proposed JA-ReID architecture can learn the feature representation from three scales, the global scale, the regional scale (hard region-level attention) and fine-grained scale (soft pixel-level attention) as shown in Fig. 2. The hard region-level attention may be disturbed by some useless information (i.e. background noise) in each partition, which makes each regional features not completely indicate the feature representation of the human body. The soft pixel-level attention can locate the most discriminative part of the feature maps, which can reduce the impact of hard region-level attention errors. The complementary attention mechanism is designed to identify the best visual patterns for ReID problem, which simulates the dorsal and ventral attention mechanism of human brain [9]. In addition, we adopt two effective strategies to improve the accuracy of Rank1 rate and mAP. First, after dividing the feature into p parts, we reduce the dimensions of each part separately.
Second, we concatenate the average pooling and max pooling features in the width direction for training.
Apart from the pre-trained classification model on ImageNet [15], our attention mechanism does not require additional pre-trained or labels. It can take various models as the backbone, such as VGG [41], Resnet [8] and Google Inception [42] . Because of its good performance in classification, we choose ResNet50 [8] as the CNN backbone. The structure after the original global average pooling (GAP) layer is removed (included GAP layer).
Global feature branch consists of a global average pooling (GAP) layer and a global max pooling (GMP) layer. We concatenate the two feature maps in the width direction. Then the concatenated feature maps are put into the fullyconnected (FC) layer and classified by Softmax loss. Some methods think that without considering global features, only focusing on local features can achieve good results such as PCB [6]. But we think that global features contain some useful information. Models can learn the relationship between the background and the person under different camera views, for example, human contour information. Local feature methods segment the background and lose the complete information of the image.
Hard region-level attention branch segments the feature maps horizontally into p parts. Each part feature maps are put into a part average pooling layer and a part max pooling layer respectively. After pooling, the two part feature maps are concatenated in the width direction, put into the fullyconnected (FC) layer and classified by Softmax loss.
Soft pixel-level attention branch is more concerned with salient information, focusing on the most discriminative part. In some cases where the bounding boxes detected are inaccurate, the regional features acquired from the hard region-level attention branch do not match the corresponding part. The soft pixel-level attention mechanism can effectively lessen the impact of misalignment. We get a saliency weight map from soft pixel-level attention branch. With a tensor multiplication, it can select the most representative features.
Remarks PCB [6] is the closest competitor, which also leverages partial-based learning for person Re-ID. However, it has three major drawbacks. 1)The performance of PCB depends on the precise bounding boxes otherwise the pre-fined partition cannot be aligned very well. In practical application scenarios, the current detection models cannot be sufficient to do that. 2) It benefits from a post-processing approach called RPP, which makes the optimized model cannot be trained in an end-to-end manner. 3) Global information is an important clue to identification and recognition, which is completely ignored in PCB. Global features are often robust to subtle view changes and internal changes. Our method JA-ReID has improved on all three shortcomings.

B. SOFT PIXEL-LEVEL ATTENTION LEARNING
After resnet_5c in ResNet50, the input image I can be represented by an 3-dimensional tensor T , which is a sparse and distributed representation [7], [10]. The task of soft pixellevel attention is to produce a saliency weight map W , which is the same size as T . With a tensor multiplication from W and T , it can remove unimportant information such as backgrounds and select the most representative features.
We first sum the pixels of T through the channel direction to get a 2-dimensional feature representation called the aggregation map S. The h × w × c tensor T becomes an h × w tensor S, formulated as where h, w and c denote the number of pixels in height, width and channel dimensions respectively. f n denotes the c − th feature map in T . We have two considerations for this summation. First, from the perspective of cross-channels feature fusion, the activations of different channels are sparse for the same component in an image, and the useful information should be activated on most channels. Therefore we sum all the pixels in the same spatial position across channels, displaying the most activated positions which contain the most important recognition information. Second, in terms of spatial distribution of features, a single channel contains at most some weak semantic information, most of which is noise, and cross-channel feature fusion can highlight the part that actually contains semantic information.
The higher the activation response of a specific location (i, j), the greater the possibility that the corresponding region becomes a main part of the person. In order to distinguish the more activated part from S, we calculate the mean value s of all the positions in S as the threshold, where the activation response is higher than s indicates the more discriminative component of the image, formulated as search for the next unlabelled pixel and label it L; 11: end while 12: select the largest L num and label its components 1; 13: label other components 0; 14: return a saliency weight map W We can compute efficiently the attention map A from feature maps T and a saliency weight map W with a tensor multiplication, formulated as which means the most distinguish salient information is reserved, all the other parts are 0.

C. HARD REGION-LEVEL ATTENTION LEARNING
The soft pixel-level attention focuses on fine-grained features and the hard region-level attention pays more attention to the regional features of the human body. According to the distribution of the human body, we divide the tensor T into p parts horizontally. For the selection of hyper-parameter p, we draw on the method of dividing pedestrians of PCB [6] p = 6, which has been demonstrated the best. Then we downsample each piece of tensor by part average pooling and part max pooling. We concatenate the two features in the width direction and use p 1×1 convolution layer to reduce the dimension in order to reduce the computation. After that, each part feature maps are put into a classifier, which consists of a fully-connected (FC) layer and the following Softmax loss. Remarks: For how to segment and reduce the dimension of T , we consider two methods. The first method has fewer parameters. The dimension of T is reduced through a 1 × 1 convolution layer, and then it is divided into p parts, as shown in Fig. 3.(a). The other method is to divide T into p parts firstly and then reduce the dimension of each component by p independent parameter 1 × 1 convolution layer, as shown in Fig. 3.(b). Our architecture uses the second set, which does not share the parameter of 1 × 1 convolution layer. We will provide extreme ablation experiments in the following section (Sec.IV.D.2) to verify the effectiveness of our settings.

D. POOLING STRATEGY
The average pooling is widely used in many classification problems because it forces a corresponding relationship between the feature representations and the category. This average operation gives all parts of the image the same weight and equal treatment. But when the background is similar to the pedestrian, average pooling may cause a low response to the component of the pedestrian and ignore it. On the other hand, the max pooling extracts the most discriminative information but lacks global relevance to the whole image.
These two pooling strategies have their own advantages. In order to combine these two complementary features to get a more representative feature, we concatenate the two features obtained by average pooling and max pooling in the width direction for training. This concatenated feature is put into the FC layer and computed Softmax loss.

A. DATASETS AND PROTOCOL
We evaluate the proposed JA-ReID architecture on three large benchmarks Market1501 [18], DukeMTMC-reID [11], CUHK03 [12]. Table 1 shows the structure of the three datasets. Market1501 contains 2793 distractors. DukeMTMC-reID has 8 cameras, whose backgrounds are more complicated than Market1501. There are also distractors in DukeMTMC-reID, so it is the most challenging dataset at present. CUHK03 offers two types of annotations: human labeled and detected by DPM [43]. Our experiments are based on the detected label images.
We use the cumulative matching characteristic (CMC) and mean Average Precision (mAP) metrics to measure the performance. We set all the experiments with the single query evaluation. Moreover, in order to verify the effectiveness of the approach, we do not use data augmentation, reranking [13] or random erasing [16], which considerably improve mAP.

B. IMPLEMENTATION DETAILS
We implement our JA-ReID in the Pytorch [14] framework with a NVIDIA 1080ti GPU. The backbone ResNet50 [8] is pre-trained on ImageNet [15]. The training images are resized to 384 × 192. We set batch size to 32 and train the model for 50 epochs with base learning rate initialized at 0.0003. We use AMSGrad [17] with two moment terms β 1 = 0.9, β 2 = 0.999. All datasets share the same experiment setting as above.
In the testing stage, we concatenate 6 regional features after part average pooling as the final feature maps.

C. COMPARISON WITH THE STATE OF THE ART METHODS
We compare JA-ReID with the state of the art on three datasets. Note, we do not use any data argumentation methods, such as scaling, rotation, color distortion, neither model pre-trained. Most deep learning ReID methods benefit greatly from these operations, which require not only heavier computation but also time-consuming. As a closest competitor, which also leverages partial-based learning for person Re-ID, we reproduce the PCB [6] method in order to better evaluate the performance of the two approaches under the same experimental environment. Maybe there are some differences between our implementation and the PCB paper's, which are not detailed. The reproduced results are different from those on the paper, especially on CUHK03 dataset. In fact, if our reproduced results of PCB are better, our method can also make a further performance improvement. In this paper, we compare the performance with the reproduced results VOLUME 7, 2019 of PCB. All the other results of methods are obtained from the published papers.
Comparisons on Market1501 are detailed in Table 2. The comparison methods are divided into three categories, namely, hand-crafted methods, deep learning methods with global feature and deep learning methods with local features. The method we proposed, JA-ReID, has a greater improvement on Rank1 and mAP than the previous methods including PDC [36] which needs auxiliary component labels to deliberately align components. Compared with HA-CNN [45], which does not use any augmentation, our method has a slight gap in Rank1 (−0.8%), but it has an improvement (+0.4%) in mAP. Compared with the camera style adaptation [56] and MLFN [57] of 2018, JA-ReID achieves higher Rank1 accuracy and mAP. Compared with PCB, our method has an improvement in Rank1 (+0.5%) and mAP (+0.3%).
Person images from DukeMTMC-reID have more variations in illumination and backgrounds because of wider camera views and more complex scene layout compared with Market1501. Our JA-ReID approach is again superior to the most recent methods, as shown in Table 3. Although our method performs a little worse than MLFN [57] in Rank1 (-0.1%), it has a great improvement in mAP (+2.9%). JA-ReID achieves higher Rank1 accuracy and mAP, compared with the camera style adaptation [56]. Compared with PCB, our method performs better (+1.4%) in Rank1 and (+0.4%)mAP.
We visualize some results of Market1501 in Fig. 4. Whether the people in the images are occluded, in large pose variations or from different perspectives, our method can extract discriminative feature representations. The experimental results verify the adaptability of the proposed JA-ReID architecture to non-rigid deformation of the human body, which can effectively improve the accuracy of ReID.

D. ABLATION STUDY 1) EFFECT OF SOFT PIXEL-LEVEL ATTENTION
We set experiments about the effect of soft pixel-level attention on Market1501, DukeMTMC-reID and CUHK03 with the same training set. First, we only adopt the global feature branch for training and get the global features for testing. Moreover, we train the model combined global feature branch with soft pixel-level attention branch, and get the global features for testing. In addition, we train the model combined global feature branch with hard region-level attention branch, and get the regional feature maps concatenated for testing. It can be seen from Fig. 6 that the with the soft pixel-level attention branch, the global feature branch achieves an increase in Rank1 (+3%) and mAP (+5.5%) on Market1501, in Rank1 (+1.3%) and mAP (+4.2%) on DukeMTMC-reID, in Rank1 (+3.8%) and mAP (+5.5%) on CUHK03. With the soft pixel-level attention branch, JA-ReID achieves an improvement in Rank1 (+0.4%) and mAP (+2.4%) on Market1501, in Rank1 (+1.1%) and mAP (+0.9%) on DukeMTMC-reID, in Rank1 (+0.9%) and mAP (+2.6%) on CUHK03 than global feature branch just combined with hard region-level attention branch. In order to better illustrate the role of soft pixel-level attention branch, we visualize the feature maps got from soft pixel-level attention branch as shown in Fig. 5. For example, the third person is occluded in the third image. The feature maps focus on the unoccluded upper body of the pedestrian. When the people in the images are in large pose variations or from different perspectives (such as the first, second, fourth person), our soft pixel-level attention can locate the most discriminative part.

2) COMPARISON OF TWO DIMENSION REDUCTION METHODS
In hard region-level attention branch, we use 1×1 convolution layer to reduce the dimension, which aims to reduce computational complexity. We first use one 1 × 1 convolution layer for the whole feature maps, which has the advantage of fewer parameters. In addition, we try the independent parameter 1 × 1 convolution layer for each part. We set experiments on Market1501, DukeMTMC-reID and CUHK03. As shown in Table 5, setting an independent parameter 1 × 1 convolution layer for each part can make an improvement in Rank1 (+1.3%) and mAP (+2.5%) on Market1501, in Rank1 (+1.1%) and mAP (+0.8%) on DukeMTMC-reID, in Rank1 (+7.9%) and mAP (+8.1%) on CUHK03. This shows that it is better to reduce the dimension of each component separately with an independent parameter 1× 1 convolution layer.

3) EFFECT OF DIFFERENT POOLING STRATEGIES
Before putting the features into classifiers, we consider three pooling strategies, using GAP (global average pooling), GMP (global max pooling) and concatenated features for training. As can be seen from Table 6, the Rank1 and mAP of max pooling strategy are a bit higher than the ones of average pooling strategy. That is because max pooling strategy focuses VOLUME 7, 2019 FIGURE 6. Results of different combined branches on Market1501,DukeMTMC-reID and CUHK03. We show that soft pixel-level attention branch can improve the performance over the global branch. In addition, soft pixel-level attention branch yields consistent improvement over the combination of global branch and hard region-level attention branch. more on discriminative parts of the feature maps. Average pooling strategy gives the same weight to all parts of the feature maps, which is easily influenced by the background. Each of these two strategies has unique advantages. We concatenate the two features obtained by average pooling and max pooling in the width direction for training. Table 6 shows that this strategy achieves the best results.

V. CONCLUSION
In this work, we propose a novel Joint-Attention ReID (JA-ReID) approach, which can effectively solve the current challenge of non-rigid deformation of the human body in ReID task. Unlike most methods just using one constraint, JA-ReID combines two types of attention: soft pixel-level attention and hard region-level attention. The advantage of the joint attention mechanism is that it can focus on feature representations of fine-grained and coarse-grained scale, which can improve the robustness of the model and make the features extracted more discriminative. The proposed soft pixel-level attention mechanism can locate the most discriminative part of the feature maps without additional auxiliary labels and training. Moreover, we adopt two effective strategies to improve the accuracy of ReID. First, we reduce the dimension of each part separately after dividing the feature into parts. Second, we adopt an improved pooling strategy. The experimental results show that our approach JA-ReID has achieved very competitive results on three datasets. Ablation study well demonstrates the effectiveness of our JA-ReID approach. The experimental results verify the adaptability of the joint attention mechanism to non-rigid deformation of the human body, which can effectively improve the accuracy of ReID. SHANSHAN JIAO received the B.S. and M.S. degrees in safety engineering from the University of Science and Technology Beijing, Beijing, in 2011 and 2013, respectively. She is currently pursuing the Ph.D. degree in computer science and technology with PLA Army Engineering University, Nanjing, China. Her research interests include computer vision and machine learning.
JIABAO WANG received the Ph.D. degree in computational intelligence from the PLA University of Science and Technology, Nanjing, China, in 2013. He is currently an Assistant Professor with PLA Army Engineering University, Nanjing. His current research interests include computer vision and machine learning. VOLUME 7, 2019 GUYU HU received the B.S. degree in radio communication from Zhejiang University, Hangzhou, China, in 1983, and the M.S. degree in computer application technology and the Ph.D. degree in communications and information systems from the Nanjing Institute of Communications, Nanjing, China, in 1989 and 1992, respectively. Since 1990, he has been involved in the research on network management. Since 1997, he has been a Full Professor with the PLA Army Engineering University, China. Since 1998, his research interests include intelligent of network management, mainly on failure-finding from data with pattern recognition, machine learning, and neural networks.
ZHISONG PAN received the Ph.D. degree in computational intelligence from the PLA University of Science and Technology, Nanjing, China, in 2013. He is currently a Professor with PLA Army Engineering University, Nanjing. His current research interests include computer vision and machine learning.
LIN DU received the B.S. degree in computer science and technology from Northeastern University, Shenyang, China, in 2014, and the M.S. degree in communication and information system from the PLA University of Science and Technology, Nanjing, China, in 2017. He is currently pursuing the Ph.D. degree with PLA Army Engineering University, Nanjing. His research interests include computer vision and machine learning.
JIN ZHANG received the B.S. and M.S. degrees in nuclear power engineering from the PLA Naval University of Engineering, Wuhan, in 2007 and 2009, respectively. He is currently pursuing the Ph.D. degree in computer science and technology with PLA Army Engineering University, Nanjing, China. His research interests include computer vision and machine learning.