A Double Stream Person Re-Identification Method Based on Attention Mechanism and Multi-Scale Feature Fusion

Person re-identification mainly plays a role in multiple non-overlapping camera monitoring environments to determine whether the target person of interest that has appeared under a camera appears again under others. However, the image data, in the real scene, taken by the surveillance camera may be occluded or blurred, which increased the difficulty of identifying the pedestrian posturn and then lead to dramatically decrease the accuracy of recognition. To solve the above problems, we propose a dual branch multi-scale feature fusion network, which improves the expression ability of pedestrian features under partial occlusion by learning discriminative pedestrian features. By embedding the lightweight attention module into Residual neural network-50 (Resnet-50), the image sequence features of the channel dimension will be extracted, and the interference caused by the cluttered back ground information will be suppressed. In the training phase, the average pooling layer and the maximum pooling layer of different kernels and strides are utilized for different residual stages, and the mixed pooling strategy and mixed loss function of different kernels are designed. By comparison with the existing representative methods on Market1501, DukeMTMC reID and MTMS17 datasets, the experimental results show that the features extracted by the proposed method are more discriminative and with high recognition accuracy.


I. INTRODUCTION
Person Re-Identification(Re-ID) is a kind of computer vision technology, which has been widely used in intelligent video surveillance system. In real life situations, face features often cannot be clearly captured due to the equipment, weather or other problems, then the external features of people, such as clothing, skin color and posture, can be used to identify the pedestrian identity. Obtain the image or video data of a specific pedestrian from a camera device as a detection set, and the image or video data obtained from other camera devices as a candidate set. By resorting to the pedestrian association technology across the camera device, the pedestrian with the same identity as the detection set is searched from the candidate set [1]. As shown in the figure 1, Person re-identification The associate editor coordinating the review of this manuscript and approving it for publication was Luca Cassano.
is a challenging task, as images captured by different cameras often contain numerous intra-class transformations, and the same pedestrian may show different features in different cameras. Therefore, designing or learning as many representations as possible that are resistant to intra-class variations has been one of the main goals of individual re-identification.
Since the deep learning models have a powerful ability to fit visual representations, existing approaches tend to focus more on the direct visual representation with robust results, while ignoring some representative and critical information [2]. With the rapid development of attention mechanism, the attention-based model has gain more and more attention in the field of person re-identification, which obtains a certain solution to the problem of insufficient local expressiveness. The attention mechanism, which extracts discriminative pedestrian features, focuses selectively on the pedestrian part of the image and ignores other uninteresting regions, (1) is the pedestrian detection module, which determines the candidate set, and (2) is the re-identification module, which selects pedestrians with the same identity as the detected pedestrians from Gallery.
enhancing the saliency of pedestrian features. Although there are many excellent works that combine attention mechanism with person re-identification methods [3], [4], [5], [6], [7], it suffers from the problem of not being able to learn local regions of high response images correctly. Therefore, it is particularly important to construct a learning network that can extract local features without destroying the original image representations.
Loss function is another important factor that affects the effectiveness of the training of the re-identification model. For those early models, which are usually treated as a classification network [8], the loss function adopted is also the corresponding Softmax loss. Triplet loss [9] is another commonly used loss function, which makes the input samples with the same label as close as possible and the samples with different labels as far away as possible in Embedding space. Thanks to the Softmax loss can achieve robust performance in different classification tasks, and the triplet loss has advantages in finegrained recognition, and the strategy which combines the Softmax loss and Triplet loss is employed in this paper.
This paper proposes a two-branch multi-scale feature fusion person re-identification model based on CNN networks, which exhibits a better performance by learning features at different scales. Our approach has two main contributions: 1) A two-branch feature extraction network is designed, and at the same time, channel attention is incorporated into different branches which makes the targeted network extract more critical information; 2) A hybrid loss function is constructed in the designed model;Specifically, in each branch, we design a particular loss function and take advantage of the average pooling and maximum pooling mechanisms to extract features and evaluate it by using Triplet loss, Softmax loss or their combination. Moreover, the model is tested using three mainstream datasets, and the extensive experimental results show that the model achieves good results under both the Rank-1 criterion and the mAP criterion.

II. RELAED WORKS
Thanks to the emergence of large-scale and highly benchmarked pedestrian datasets, the deep learning based methods are widely exploited, which promote the development of person re-identification algorithms. Among them, researches based on local representation and for occluded scenes are gradually receiving attention. Meanwhile, attention-based learning approaches are also a new promising research direction for designing the current person re-identification models, and a large number of attention-based learning approaches have emerged in various top conferences.

A. PERSON RE-IDENTIFICATION BASED ON DEEP LEARNING
Deep learning based person re-identification methods can be divided into representation learning and metric learning. Representation learning intuitively uses the ability of deep learning models to extract features by treating person re-identification as a classification or verification problem, and classifying categories by the extracted pedestrian features. Zheng et al. [10] proposed the Identity Embedding(IDE) network which became the representation learning benchmark, and subsequent approaches improve the IDE network. Since the classification label of the supervised signal is the pedestrian ID, the cross-entropy loss in IDE networks is often referred to as ID loss. Some of the successful improvements in IDE networks are Label Smoothing [11] and Sphere Re-ID [12], which can further improve the performance of IDE networks and therefore are becoming the mainstream methods.
Metric learning gives the same pedestrians a better distribution in the feature space than different pedestrians by learning the similarity between samples. Hermans et al. [13] selected the positive and negative samples, which are difficult to distinguish each others, in a batch to train the ternary loss function and enhance its ability to mine difficult sample pairs. In order to solve the problem of large intra-class distance of ternary loss function, some researchers proposed a quadratic loss function based on the triplet loss function. Chen et al. [14] proposed a person re-identification network architecture by utilizing the quadratic loss function and adding a negative sample for different IDs.

B. PERSON RE-IDENTIFICATION BASED ON LOCAL FEATURES
According to the types of image features, Re-identification algorithms can be classified into two major groups, global feature based and local feature based methods. Global feature is to enable the network extracting one feature for the whole image, without considering local information. However, the method of only one feature extraction for one image would not be able to achieve effective results as usual, so it needs to extract more features and fully consider the local information of the image. Commonly used methods for local feature extraction are fixed chunking, attention-based mechanisms.
Most local feature-based methods segment the image according to a predetermined division and cannot handle the feature information at the segmentation point, which makes the model inflexible. Sun et al. [15] presents a Part-based convolutional baseline(PCB) method in the process of extractingfeatures, and divides the feature map into multiple horizontal feature blocks. Then, the pooling operation is performed, and the multiple feature vectors are generated by feature downscaling technique with small size convolutional kernels. The PCB methodology has a large impact on the subsequent works, and much of the subsequent works taken it as a baseline [16], [17], [18].

C. PERSON RE-IDENTIFICATION BASED ON ATTENTIONAL MECHANISMS
With intensive research and wide application of attention mechanism in the image field, the person re-identification algorithms integrated with attention have became the dominant research direction [19], [20], [21]. The attention mechanism makes the network concerning more about the effective features in the image by weighting the local information. Inspired by the attention mechanism, Zhao et al. [22] proposed a new method which solves the unaligned key points problem by dividing the network into K branches and adopting different weights when extracting features from different regions. Wang et al. [23] presented a person re-identification method that effectively solves the occlusion problem by using local features as nodes of a graph and aggregating the information between nodes through a kind of adaptive directional graph convolution(ADGC).In recent years, these are also other representative works on image attention [24], [25], [26], [27], [28].
Although deep learning-based person re-identification methods have achieved great results to a large extent, they still have shortcomings in some aspects. The majority of local feature-based algorithms segment images according to predetermined divisions, ignoring the correlation between fine-grained features and failing to process feature information at segmentation points, which makes the model ineffective and inflexible. At the same time, the existing model has obvious reliance on pedestrian color representation, so it can be problematic to deal with changes in complex lighting conditions. The attention mechanism has the ability to partially resolve this, but it stills difficult to learning the correct hight response loal area. To summarize the existing methods and address the problem of poor correlation of local features.

III. PROPOSED METHOD
We proposed a person re-identification method based on channel attention and multi-scale feature fusion, which provides a solution for the practical application of person re-identification. In this chapter, we begin with an overview of the architecture of the network, and then introduce each module in the network in detail.

A. NETWORK ARCHITECTURE
As shown in Figure.2, the network structure is divided into three modules: the feature extraction backbone network based on Resnet-50 [29], the local branching network based on batch feature erasing, and multi-scale feature extraction networks which are used to learn different levels of image features. The models are built with the goal of learning discriminative global features and local fine-grained features to improve the representation of pedestrian features in images under local occlusion conditions. A lightweight attention module is also embedded in the Resnet-50 residual module in order to fully extract key information about the pedestrian and to obtain discriminative global features. The backbone network based on Resnet-50 extracts the feature map of the input images. After the first residual layer of the shallow layer, it is divided into global branches and local branches. The global branch is composed of four residual layers after Resnet-50, and the features of different layers are connected by adaptive average pooling and adaptive max pooling layers in the third and fourth layers respectively. The local branch extracts the fine-grained features of the image and is located after the first residual layer of the backbone network. Training stage is performed by ID loss and triplet loss [9], while the testing stage is ranked by calculating the Euclidean distance between features.

B. BACKBONE NETWORKS
We used Resnet-50 as the backbone network and adapted its structure to embedded the Efficient Channel Attention Module(ECA module) [30] in the residual module. Different from the SE-Net [31], the ECA module can implement the local cross-channel interaction strategy without dimensionality reduction, which avoid the effect of dimensionality reduction on the learning effect of channel attention and improve the discriminative ability of the network effectively. As shown in Figure 3, the global average pooling of channels with input features F is performed to obtain a vector of (1 × 1 × C) size. After that, the output is also a vector of size (1 × 1 × C) after channel feature learning by (1 × 1) convolution, which avoids the dimensionality reduction operation. Finally, the input features(H ×W ×C) and the channel features (1×1×C) are channel intermultiplied in order to output the feature map with channel attention. With different sizes of input feature maps, the ECA module uses dynamic convolution kernels to acquire different scales of features. It can use different size of convolution kernels for (1 × 1) convolution operation depending on the size of the channel layers. The ECA module could decide to use convolution kernels of different sizes for (1 × 1) convolution operations, which can achieve more cross-channel dynamic interactions.
where, Conv1 is the 1-dimensional convolution; k is the size of the one-dimensional convolution kernel; y is the channel feature; C is the number of channels; γ and b are the hyperparameters; ω denotes the weights of the channels.

C. LOCAL BRANCH
The local branch extracts local mutle-scale features of the image by the batch feature erasing method, Local branches are located after the first residual layer of the backbone network, and the image was extracted using a batch erasing method for local multiscale features. The network perception field of Resnet-50 bottom features was comparatively small and contained a lot of fine-grained features, but would be disturbed by the cluttered information background. Therefore, we places the local branches after the first layer of the residual network and calculates the Loss separately for thelocal features, which appears in the final loss in the form of a weighted fusion. The features fed into the local branch pass through a BN layer to generate depth features, and Random Erasing can introduce a region mask for the region to be erased, and the image of the mask region is set to 0. After that, the features pass through a global average pooling layer and a global maximum pooling layer to obtain the final local features F P , which are normalized to obtain the feature vector Y P , as shown in Figure 4. The effect of erasure is achieved by setting the pixels in a region to 0 during a batch iteration, making the network focus more on the remaining fine-grained features. The location and size of the region to be erased are defined as follows: X and Y are the length and width of the region to be erased, r is the erasure ratio, and the randomly generated x t and y t are the coordinates of the upper-left initial point of the region to be erased. The pixel value in the erased region M is set to 0 if the sum of the horizontal coordinate x t and the length X of the initial point is less than the image length H and the sum between the vertical coordinate y t and the width W is less than the width W . After that, the erased image is fed into the feature extraction network to obtain the feature F P . If the length sum is greater than H or W , the initialization of negative values will be done again. The final features F P obtained by local fine-grained branching are fused with local features through a splicing fusion strategy to obtain the final pedestrian feature vector Y t otal. VOLUME 11, 2023

D. LOSS FUNCTION
The loss function used our model is obtained by combining the Triplet loss [32] and the ID loss. Id loss is the loss function commonly used for characterization learning, as shown in Eq5.
where n is the number of samples trained in each batch, p(y i |x i ) is the input image x i and its class label y i , and after the softmax function for classification, and the predicted probability of x i being identified as y i class. Triplet loss is a commonly used loss function in metric learning, which can expand the distance between different classes and reduce the one between the same classes. In addition, the original position of features in space is not changed by the Triplet loss function, so the deeper network layers are not affected by this optimization function.The pooling layer after the third residual layer and the fourth residual layer using Triplet loss to calculate the loss can effectively optimize the network, and the formula of Triplet loss is shown in Eq6: For increasing the distance between positive and negative samples and decreasing the distance between positive samples, We use a modified Triplet loss function with a dap on top of the original,as shown in Eq.7 The network not only ensures that the positive and negative samples are pushed apart in the feature space, but also ensures that the distance between positive sample pairs is close enough. where L final denotes the final loss, L Tri−res4 and L Tri−res5 denotes the Triplet Loss at the third and fourth stages. L ID denotes the ID loss.

A. DATASETS
We used Market1501 [33], DukeMTMC-reID [34], MSMT17 [35] as the evaluation datasets in our article. Figure 5 displays several images from the three datasets. The Market1501 dataset is from Tsinghua University and consists of 1501 pedestrians captured by 6 cameras with a total of 12936 images, including 751 people in the training set and 750 people in the test set. The DukeMTMC-reID dataset is a large-scale labeled publicly available by Duke University of a multi-target multi-camera pedestrian tracking dataset, recorded by eight synchronized cameras with over 2700 individual characters. MSMT17 (Multi-Scene Multi-Time) is a large dataset closer to real scenes presented by Wei [36] in CVPR2018, as shown in Table 1.

B. EXPERIMENT SETTINGS
We initialize the Resnet-50 network using the pre-trained parameters in ImageNet. Each training selects N pedestrian IDs from the training set, each pedestrian ID selects K images, the size of the input image is set to 256 × 128, and the final batch size is N × K . The optimization of the model is performed using the Adam method. The initial learning rate was set to 0.00035 and reduced by 0.1 in the 80th and 140th rounds, respectively, for a total of 240 training epoch. Random flipping and random cropping are used simultaneously in the training phase for data enhancement. The experiments were performed on an Ubuntu18.04 system environment with an 8-core Intel(R)Xeon(R)CPUE5−2620v4@(2.10GHz) and 16G of CPU memory. A TITAN Xp graphics card with 12GB of video memory was used.The software environment is Python3.7 and Pytorch1.7.

C. EVALUATION METRICS
The evaluation metrics commonly used in person re-identification are evaluated in this paper, mainly cumulative match characteristic(CMC) and mean average precision(mAP).The percentage of matches through the first N images in the CMC curve is denoted as Rank − n. The performance of the model is measured using Rank − 1 and mAP in our case.
• A denotes the probability of the existence of the first k images belonging to the same ID as the query image after sorting by similarity.In the experiment, Rank − 1 is selected as the evaluation index, which indicates the probability that the first image belongs to the same ID as the query image.
• mAP:Suppose the number of correctly predicted samples is TP and the number of incorrectly predicted samples is FP. Then the precision rate Precision is defined as: AP refers to the average precision, which is the sum of all precision rates for images of that category divided by the number of images containing targets of that category, and the expression for AP is: Precision c n c (10) where, n is the number of images containing targets of the category, M is the total number of images returned, and Percision c is the precision rate of the ith image belonging to the category. The mAP is the average accuracy of the model prediction over all categories, and since there is more than one category in the target identification, the average AP value needs to be calculated for all categories. As shown in Eq.11 C is the total number of categories and AP k is the average precision of the kth category target.

D. RESULTS
The comparison of our method on the Market1501, DukeMTMC-reID, and MTMS17 datasets with other stateof-the-art methods is shown in Table 2 and Table 3, in which we only use Resnet-50 as the backbone of the model for testing the benchmark.

1) RESULTS ON THE MARKET1501 AND THE DukeMTMC-reID
The   DukeMTMC-reID. Our method achieves 94.8% renk-1 accuracy and 84.8% mAP on the Market1501 dataset, and our model improves 0.4% on Rank-1 and 2.8% on mAP compared to CASN [37] based on global and local features. The CASN needs to focus on local regions by manually segmenting the feature maps, which are not as effective as our method in extracting features. Then, compared with DNN+CRF [38] and other feature fusion networks, our model improves rank-1 accuracy by 1.3% and rank-1 accuracy by 3%. Compared with the Market1501 dataset, the DukeMTMC-reID dataset is more relevant to reality. Our method achieved 88.3% rank-1 accuracy and 75.1% mAP on the DukeMTMC-reID dataset, which achieved the same improvement of 0.6% rank-1 accuracy and 1.4% mAP over CASN. Our model improves by 3.4% and 5.6% in two metrics compared with the DNN+CRF [38] method trained using a combination of identity and attribute labels.

2) RESULTS ON THE MTMS17
We compared our model with other methods on the MTMS17 dataset in order to validate more significantly the capability in a real environments. As shown in Table 3, we obtained 82.1% rank-1 accuracy and 63.6% mAP. Although SCSN [43] improved 2.7% over our method in rank-1 accuracy, we achieved the best result over the mAP index, which improved 1.5% over SCSN. The mAP is more representative of how many matching targets a query sample has for the MTMS17 dataset, which more closely resembles a realistic scenario, which gives our method a better result.

E. ABLATION EXPERIMENT
To verify the influences of the attention module on person re-identification performance, the ECA module added to ResNet-50 was replaced with a different attention module and rest of the settings were unchanged. Experiments were conducted on the Market1501 and DukeMTMC-reID datasets, and the results are shown in Table 4.
As shown in table 4, the addition of the attention module can effectively improve the accuracy of the model. For the Market1501 dataset, with the addition of the SE module and CBAM module, the rank-1 accuracy improves by 0.1% and the mAP improves by 0.2% and 0.6%. The model with the ECA module has improved 0.8% and 0.9% on Rank-1 and mAP. On the DukeMTMC-reID dataset, the networks with the ECA module added improved the rank-1 and mAP by 2.3% and 2.6%. The experimental results show that the ECA module has the best performance and can effectively helps the model to extract more significant global features and strengthen the model retrieval ability.

V. CONCLUSION
In this article, we proposed a two-branch person re-identification method based on channel attention and multi-scale feature fusion. The method extracts accurate and recognizable features through the batch feature erasing module and the features fusion module, which solves the occlusion and similarity problems. We adopt a mixed loss function combining the ID loss and the triplet Loss as the final loss function to improve the model generalization ability. Many comparative experiments on three public datasets have shown that our method is better than other state-of-the-art methods in most cases. In the future, we will explore the effectiveness of the present method on the unsupervised person re-identification problem by extracting invariant features for pedestrian data from different domains, which aims towards further adapting the method in this paper to realistic scenario applications.
XIAO MA received the bachelor's degree in software engineering from Yangzhou University, Yangzhou, Jiangsu, China. Since September 2020, he has been a Postgraduate Researcher with the Chongqing University of Science and Technology, working on topics such as metric learning, generative adversarial networks, semi-supervised learning, and people re-identification. He has participated in mathematical modeling contest and won awards during his graduate period.
WENQI LV is currently pursuing the Ph.D. degree with the Chongqing University of Science and Technology, under the supervision of Xiang Yi's Mentor Team. She mainly studies visual SLAM, machine vision, and image processing. She is also participating in the research of visual odometry based on image enhancement technology in lowlight scenes. Until now, she had published one EI retrieval paper and one softauthored paper. She also had two invention patents, which had already been accepted. During the postgraduate period, she won the First-Class Scholarship. She has participated in many competitions and won the