Multiscale Reference-Aided Attentive Feature Aggregation for Person Re-identification

In person re-identification (Re-ID), increasing the diversity of pedestrian features can improve recognition accuracy. In standard convolutional neural networks (CNNs), the receptive fields of neurons in each layer are designed to have the same size. Therefore, in complex pedestrian re-identification tasks, the standard CNNs extract local features but are unable to obtain satisfactory results for global features extracted from the images. Local feature learning methods are helpful for obtaining more abundant features, which focus on the most significant local features and ignore the correlations between features of various parts of the human body. To solve the above problems, a new multiscale reference-aided attentive feature aggregation (MS-RAFA) mechanism is proposed, consisting of three main modules. First, to extract the most significant local features and strengthen the correlations between the features of various parts of the human body, an autoselect module (ASM) is designed, an attentional mechanism that can stack the structural information and spatial relations to form new features. Then, to realize multiscale feature fusion of the multiple output branches of the backbone network and increase feature diversity, we propose a multilayer feature fusion module (MFFM), which enables the model to mine the features hidden by salient features and to learn features better. Finally, to supervise the MFFM and make the network obtain better recognition features, we propose a multiple supervision mechanism. Finally, experimental results demonstrate that our proposed method outperforms the state-of-the-art methods on three large-scale datasets.


I. INTRODUCTION
Person re-identification (Re-ID), which forms the core of video surveillance technology, implements image processing, computer vision, pattern recognition, machine learning and other related technologies to solve cross-camera and crossscene pedestrian retrieval problems. Re-ID utilizes the spatiotemporal continuity of images to continuously track pedestrians across cameras. Visual feature-based recognition methods are more reliable than those based on biological information, such as carrying items or clothing, and can be used more reliably in Re-ID [1,2,3,4]. With the popularity of video capture systems, video-based Re-ID also achieves more robust performance. Many scholars have developed improved pedestrian re-recognition methods and achieved very good results. However, in cases involving different visual points, low image resolution, illumination changes, unconstrained attitude change and occlusion, the recognition effect is not ideal [5,6,7,8,9].
In recent years, deep learning, represented by convolutional neural networks (CNNs), has been successfully applied to the field of pedestrian re-recognition. CNNs are constructed by researchers with certain prior knowledge, and a large number of stacked convolutional cores are used to extract regional features. At present, there are many efficient feature extraction CNNs, such as GoogLeNet [10,11,12,13], ResNet [14], and VGGNet [15]. Feature extraction methods that implement feature learning algorithms can obtain better pedestrian representations. In the field of Re-ID, most algorithms use the feature of the last layer of the network to realize pedestrian recognition, which achieves good results but also has some defects [28]. Each CNN layer focuses on a different piece of information, low-level features such as texture and shape features focus on shallow information of the object of interest, while deep layer features focus more on the semantic information. Therefore, only using the last layer features weakens the recognition effect. To better verify this observation, Figure 1 depicts a visualization of the feature map of different layers of ResNet50. Here, layer 1, layer 2, layer 3 and layer 4 represent the final output feature of the corresponding layers of ResNet50. Each layer focuses on different significant features, but the detailed information of some local features, such as clothing color and shoes, is not sufficiently extracted. These local details can improve recognition accuracy, but deep neural networks have difficulty in selectively focusing on these details.
Based on the above considerations, some works have obtained valuable detailed information by focusing the network on local areas to extract global features. These methods can be summarized as follows. (1) With the attention mechanism method, partial alignment is achieved by enhancing the distinguishing areas and suppressing the background to reduce background interference. Many works learn attention using convolutional operations with small receptive fields on feature maps [23,24,25,26]. However, to intuitively determine whether a feature node is important, one should know the features of global scope, which facilitates the comparisons needed for decision-making. In addition, if the various features are linked arbitrarily, some significant discriminant features that do not show obvious strength will be masked by other significant features. (2) In fringe segmentation-based methods, the human image is segmented into fixed horizontal stripes, and the finer-grained local salient features of each stripe are studied [16,17,18,19]. Although the above methods are effective, they have high requirements for image alignment. Moreover, the positioning ability of the model is poor, and the redundancy between features obtained from different regions is relatively high. (3) Methods based on automatic positioning attempt to locate body parts by learning a grid [20,21,22]. The above methods are detection networks that need extra training, and when there is considerable background noise in the location, the complexity of the whole model is increased.
To address the above deficiencies, in this paper we present a new multiscale reference-aided attentive feature aggregation (MS-RAFA) mechanism that enables the network to adaptively extract all potential salient pedestrian features. More specifically, we propose an autoselect module (ASM) to mine local and global information in different stages of the backbone network, which solves the problem of insufficient salient feature extraction. Then, we present a multilayer feature fusion module (MFFM), which is used to better aggregate the low-and high-level features of the backbone. The MFFM uses an adaptive selection mechanism to select effective features in different stages from multiscale features, which solves the problem of feature redundancy, and models the object from a global scope. Additionally, multiple monitoring mechanisms are used to supervise and teach the MFFM so that the network can obtain better identifying features. It is worth noting that the end-to-end training method is used, and no additional training network is required, reducing the complexity of the model. The specific content will be introduced in Section 3.
To summarize, our proposed work makes the following contributions: • We introduce a novel multiscale reference-aided attentive feature aggregation mechanism (MS-RAFA) that can mine all potential salient features stage-by-stage and integrate these discriminative salience features with the global feature, forming the final diverse pedestrian feature representation.
• We devise an autoselect module (ASM), an attention mechanism placed on a backbone network that can optimize the backbone network features. This module extracts global and local appearance information compactly, stacks structural information and spatial relationships to form new features, and uses that information as input to the next stage.
• We incorporate a multilayer feature fusion module (MFFM) to extract and fuse low-and high-level features. The MFFM is a nonlinear dynamic selection mechanism that allows each neuron to adjust the size of its receptive field adaptively according to multiple input information scales.
• We propose a multiple supervision mechanism to verify the necessity of joining the MFFM and the final effect value of the network.

II. RELATED WORK
Pedestrian re-recognition technology can be divided into five steps [27]: data collection, bounding box generation, training data annotation, model training and pedestrian retrieval. Due to the continuing improvements in computing power, many deep learning-based methods have been developed in recent years to solve pedestrian re-recognition tasks. This section will introduce the most representative works related to ours.

A. LOCAL FEATURE LEARNING
Varior R et al. [29] used a twin network to divide a pair of input images horizontally into several blocks. Then, several segmented image blocks were sent to a long short-term memory (LSTM) network in sequence [30], and the local features of all image blocks were fused to obtain the final feature representation. An image structure analysis method [31,32] was adopted to obtain the corresponding parts of features, such as head, chest, legs and shoes, and the color features of each part were extracted for matching. Zhang X et al. [33] designed a dynamic alignment network to automatically align image blocks from top to bottom. In terms of the spatial alignment of the human body, Zheng Z et al. [34] used spatial transformer networks (STNs) [35] to directly segment and align the original image. The STN was then also used to transform the shallow features extracted by CNNs to spatially align human body features. Reference [36] proposed a method of horizontal pyramid matching (HPM), which divides a pedestrian picture into 1, 2, 4 and 8 subparts horizontally. Reference [84] proposed a Multi-Granularity Network based on Local Context aware Correlation Feature (MGN_CACF) based on the ResNet50-IBN-a backbone, which is split into four branches.
Because local feature learning is only used to obtain and combine information of different parts of the human body, the trained network has insufficient generalizability. However, the information of each part of the human body has strong semantic correlations, which are helpful for teaching the network to learn better representation. If only part of the feature graph is learned, the correlation information between each part of the feature will be lost.

B. FINE-GRAINED INFORMATION LEARNING
One challenge in pedestrian recognition is distinguishing those with similar appearances. Reference [37] proposed a densely semantically aligned (DSA) model to map human body features to three-dimensional space. However, this method often requires both front and back pictures of the same person, which limits its applicability. Lin et al. [41] proposed a bilinear CNN model using two networks, VGG-D and VGG-M, as the joint benchmark network and achieved a good effect without using bounding box marking information. Reference [38] proposed an activation mapping method that judged the activation area by the loss function of the overlapping activation penalty to continuously expand the spatial perception range of the CNN. Reference [39] proposed an interaction-and-aggregation block (IA-Block), which can not only obtain pixel-level fine-grained information but also introduce channel information to obtain a more comprehensive feature representation. Reference [40] integrated attributes into features and proposed an attributedriven feature separation and time-aggregated pedestrian rerecognition method. Referenced [85] proposed leveraging the stability of person attributes to guide the learning of discriminative domain-invariant features (DIFs) and align attributes with corresponding local visual features.
Because the classification networks of these methods have strong feature representation ability, they can achieve better results in conventional image classification. However, in the study of Re-ID, the difference in some pedestrians' appearances is actually very subtle, so the effect is not ideal. The common solution is to use the network weights pretrained on ImageNet as the initial weights and then finetune them on a fine-grained classification dataset to obtain the final classification network.

C. ATTENTION MECHANISM LEARNING
The essence of the attention mechanism [42] is to imitate the human visual signal processing mechanism to selectively observe part of the area while ignoring other visible information. Li et al. [43] proposed a spatiotemporal attention model that uses multiple spatial attention models to ensure that each learns different parts of the body. Reference [44] used an attention diagram to determine whether an unnoticed area contains features that could provide a judgment basis to obtain complete human body features. Reference [45] proposed a pose-guided feature alignment (PGFA) method to obtain the features of the area where the body parts are connected. However, this method only focuses on unshaded parts and fails to identify shaded parts adequately. Reference [46] proposed a spatiotemporal completion network (STCNet) to solve the problem of pedestrian lower body occlusion. The spatial generator generates the frames that need to be completed, and then the temporal attention generator finds the adjacent key frames for completion. Reference [86] proposed a weighted aggregation strategy to impart a strong multiview reason ability to the imaginative reasoning module (IRM) and to classify and aggregate the single-view features of the same pedestrian.
Although the abovementioned attention mechanisms have achieved certain effects in pedestrian re-recognition, their main purpose is to extract the most significant features and suppress less obvious features. In this way, feature diversity is reduced, so the extracted features may be insufficient.

III. PROPOSED METHOD
We propose a new multiscale reference-aided attentive feature aggregation mechanism (MS-RAFA), which includes three main modules: the autoselect module (ASM), multilayer feature fusion module (MFFM) and the multiple supervision module. The framework is shown in Figure 2. ASM is an attentional mechanism and sits on a backbone network, which is used to solve the problem of insufficient feature extraction of the backbone network. MFFM operates the output features of each layer of the backbone network and selects useful information for fusion to obtain more features. The multiple supervision module supervises the optimization direction of the MFFM module and guides the MFFM to select effective pedestrian features for fusion.

A. AUTOSELECT MODULE
The goal of Re-ID is to recognize the same pedestrians from multiple cameras, but often the differences among these people are not sufficiently large. Most pedestrians are distinguished by clothing color, height, belongings and other information. The convolutional unit in a CNNs only focuses on the region of the convolutional kernel in the neighborhood each time. Although the receptive field becomes increasingly larger in later periods, it still learns from information in the local region, thus neglecting the contribution of other global regions to the current region. If the use of only local information cannot obtain the differences between objects well, the correlation between the features of various parts of the human body will be ignored, limiting the ability to improve the performance of the model. However, global learning extracts features from the global information of each pedestrian picture. These features have no spatial information and can easily lose details, which is not conducive to pedestrian recognition.
Considering the above, this paper proposes an ASM that uses the channel attention mechanism to obtain local information. Then, global average pooling is used to obtain global information. The local and global information are superimposed to obtain more accurate pedestrian information. Through these operations, effective object features can be better selected. The specific ASM structure is shown in the red dotted box in Figure 2. the ASM is placed into the backbone branch to compensate for its insufficient feature extraction ability.
Given an intermediate feature tensor of width W, height H, and channel number C in CNN layer X∈R C×H×W , after the ASM has conducted a series of operations, a new feature graph F of size C×H×W is obtained. ψa, ψb, ψc, ψd and ψe are all theoretically 1×1 convolutions that can flexibly change the dimensions of the data. After the size of ψa is changed to C×W×H, the size of ψb and ψc are changed to C×H×W, ψd changes size to 1×C×1, and ψe changes size to 3C×1×1. The 1×1 convolution has different effects in different locations. Its main purpose is to multiply or add the features of the same input, increase the similarity and extract features. At the end of the module output, the sigmoid activation function is added to increase the nonlinear expression ability of the module and make the model more consistent with the data.
In Re-ID, it is possible to compute the global attention by using local convolutional operations. The main operation of the ASM is to take the C-dimensional eigenvector of each spatial position as a feature node, each of which uses the similarity relation function f = R(x, y) to obtain the similarity between features of other location nodes and form raster data.
Through raster scanning of spatial positions, N feature nodes are expressed as feature set V = { Xi∈R C , i = 1,•••, N }. The similarity relation Ri,j between any two nodes i and j in set V can be defined as the dot product similarity between the nodes: (1) where α and β are two embedded functions shared between feature nodes, and ω is an operation function for changing the size of the image. We first transform the input tensor X C×H×W into X1∈R C×W×H , then implement 1×1 convolution and BN layer, and finally activate the ReLU function to transform X1. We obtain and use the affinity matrix Ri,j∈ R C×C×1 to represent the pairable relations of all nodes and then transform these pairable relations to obtain the global features. The local information of the C-dimensional feature vector of each spatial position can be obtained by the ASM module: where α is the 1×1 convolution function and φi is an adaptive average pooling function. After the operation, the input tensor X C×H×W is changed to X2 C×1×1 .
Global scopes contain rich structural and semantic information, while local features contain the most significant information. Using the ASM, they can be stacked, and valuable knowledge can be mined to infer spatial attention. Then, the spatial concern value of the i-th feature node Gi, obtained through the modeling function, is defined as: where B is the BN function and R represents the ReLU operation. The sigmoid function causes the output to take on a value between 0 and 1. As a result, the weight of each feature in the spatial structure is different, and the probability of the feature that needs to be given more attention is larger, which is conducive to feature extraction and model learning.
To learn the attention of the i-th and j-th feature nodes, in addition to the pair relation term Ri,j, the feature Xi is added.
Using the global scope structure information and local original information related to this feature, the final output F can be expressed as: Experimental results show that the ASM proposed in this paper can effectively extract the global and local hidden features. Moreover, it can be used in different neural network frameworks. Most importantly, it can improve Re-ID accuracy.

B. MULTILAYER FEATURE FUSION MODULE
Fusion and analysis of features at different levels [47] can help in semantic segmentation, classification and detection. Common fusion operations are performed at the pixel level, such as addition or concatenation, but the performance gains are limited and lack semantic information. To aggregate features at different scales from different branches and retain the features in the final representation, inspired by long-term dependence on mechanisms to fuse multilayer features [48], we add an MFFM. As shown in Figure 2, the structural design of the MFFM is derived from SKNet [49], and the specific structure diagram is shown in Figure 3.
The input of the MFFM is the output feature of the ASM in each stage of the backbone network. In a CNNs, the bottom layer contains shallow information such as position and shape information, while the top layer contains deep semantic information. To obtain effective pedestrian information at different stages, we proposed the MFFM method to fuse these different depths of information. The MFFM is a nonlinear module that can dynamically select features. It allows the input of multiple neurons of different sizes, each of which adjusts the size of its acceptance domain according to the information scale, and then outputs features of uniform size.
As shown in Figure 2, the backbone network is divided into five layers, each with different sizes and information. Except for the first layer, the other four layers are followed by an ASM. Features that pass through the ASM are the source of the multilayer features that will enter the MFFM. In Figure 3, P2-P5 represents the multilayer feature input, in which the lower-level feature map is larger and contains more feature information, while the higher-level feature map is smaller in size but with a larger object. The study found that as the object grows larger, most neurons gather more information from the larger kernel pathway. This suggests that the nonlinear dynamic selection mechanism in the MFFM has superior performance in object recognition, which enables neurons to adaptively adjust their receptive field size. P2-P5 are convolved to change the number of channels and the size of the feature map so that the output feature, denoted as Ci (i=3, 4,5), is uniform in size. For the input, a given intermediate feature tensor P2∈R C×H×W with width W, height H, and number of channels C is: where W1 and W2 are implemented by convolution followed by BN. The final size of Ci is the same as that of P4, therefore, the feature map tensor Ci∈R C×H×W is obtained by taking the size of P4 as the standard. After unifying transformation to obtain the same size, to highlight the significant features, we add Ci elementwise to form a new aggregation feature L, which combines low-and high-level features. Then, FC1 uses global average pooling to operate on L, and the generated channel statistics are embedded in the global information. Here the input L ∈ R C×H×W , and the output FC1 ∈ R C×1×1 . Specifically, the elements in FC1 are calculated by spatial dimension contraction of L: Furthermore, a compact feature FC2 ∈ R d×1 is created, which is attenuated by FC1 to achieve precise and adaptive selection guidance. This is achieved through a simple convolutional layer, which improves efficiency by reducing dimensions: max(C / , ) d r m  (8) where m represents the minimum value of d (m=256 is fixed in our experiment), r is set to 4, and then the feature vector is expanded to the size of the feature before attenuation.
The FC1 and FC2 operators combine and aggregate the information from multiple paths to obtain a global and comprehensive representation for the selection weights. Next, the weight FCi of each channel is obtained through a softmax operation (in which the importance is expressed by the probability of each channel). Then, FCi is multiplied with the original feature Ci to obtain the feature weight of each channel. This is the equivalent of a select operation, it aggregates the feature maps of differently sized kernels according to the selection weights. Finally, the weight feature map obtained is added element by element, and the final weight feature V is obtained through the multilayer feature fusion module as: ( ), i V Cat Ci fc i = 2,3,4,5   (9) The size of the feature map of V is the same as that of the output feature graph of the last layer of the backbone network. Through this soft attention method, adaptive kernel selection is performed to obtain the potentially significant features among the global features, increasing feature diversity and improve object recognition efficiency and effectiveness.

C. MULTIPLE SUPERVISION MECHANISM
As shown in Figure 2, to increase the information diversity and improve information interactivity, the features output from the MFFM and the last ASM are added elementwise to form a new feature, Xcat∈ R 2C×H×W , with twice the number of channels. Then, the feature map tensor V∈R C×H×W output from the MFFM module and the feature tensor x∈R C×H×W output from the backbone network are combined by the add operation. The subsequent 1×1 convolution changes the number of channels from 2C to C to reduce the number of channels in the feature map. Previously, the fully connected layer FC had been used for information classification, but its computational complexity is high. Therefore, we replace it with global average pooling, which calculates a weighted sum of the front layer features, takes the internal average of each feature map, and turns each feature map into a value.
The MFFM aggregates multiscale pedestrian features to obtain more robust information. The feature information involved in this module is quite different, aggregating these features will not achieve good recognition performance. Therefore, a multiple monitoring mechanism that uses real pedestrian information to monitor the MFFM and adaptively select effective information is designed. As shown in Table 8, an ablation experiment demonstrates that supervision is necessary. After obtaining the features supervised by the MFFM, the multiple supervision mechanism is merged with the features of the backbone network for the last layer, and then another supervision is implemented.
We add two loss functions to the final effect value of the monitoring network. The identification loss function obtains the predicted logit value of the image, similar to the classification loss, and is defined as: where y and Pi represent the real ID tag and predicted logit value of the classification, respectively. N represents the number of classes, qi is the proposed smoothing label and ε=0.1 [83]. Considering the goal of Re-ID, which is to find the most similar sequences of people from a gallery of images, the idea of metric learning is introduced to enable networks to find useful features for similarity measurement. Therefore, triplet loss is adopted to improve the final ranking performance, which is defined as: where dpos is the feature distance of the same identity and dneg is the distance of different identities. N is the batch size of the triplet samples, and [•]+ represents Max (•, 0). The purpose of triplet loss is to ensure that the distance between the positive sample pairs is less than the distance between the negative sample pairs. here, it ensures that the distance between similar features and the positive sample is close. Note that  Training-IDs  767  751  702  Querry-IDs  700  750  702  Gallery-IDs  700  751  1110  Camera  2  6  8  Images  28192  32668  36411 the distance is measured by the Euclidean distance in the design of this paper. In this paper, the model undergoes supervised learning twice, first for the features obtained by the MFFM and second for the fusion features of the last layer of the backbone network. The final total loss of the model is the sum of the two losses and can be written as: (12) where i indicates the i-th supervised learning. Finally, the monitoring mechanism is used to supervise the results and verify the effectiveness of the proposed method.

A. EXPERIMENTAL DETAILS
We use ResNet50 as our backbone network. The total batch size is set to 64, and two GPU graphics cards are shared. The batch_size on each graphics card is set to the value that automatically divides the total batch_size.
We use common data augmentation strategies: random clipping [50], horizontal flipping and random erasure [51]. The input of all datasets is changed to images with a uniform size of 256×128, and the backbone network is pretrained on ImageNet [52]. Using the Adam optimizer, all models are trained with a total of 600 epochs, the recording of parameters starts from epoch=320, and a new file is recorded every 40 epochs. The learning rate is 8×10 −4 , and the weight decay is 5×10 −4 .
The experiment is performed on three public reidentification datasets: CUHK03 [53], Market1501 [54] and DukeMTMC-ReID (a subset of the DukeMTMC [55] dataset). The details of the datasets are shown in Table 1. To compare the performance of this method with that of existing Re-ID methods, we use the rank index in cumulative matching characteristics (CMC) and mean average precision (mAP) as the evaluation index of each queried image.

B. EXPERIMENTAL RESULTS
In this section, the final experimental results for the proposed method on different datasets are highlighted and compared with the results for other methods. Table 2 shows the results of different methods on the Market1501 dataset. Note that all methods listed in the table are based on the ResNet50 backbone network. i) The methods are from different journals and published in different years. The oldest papers were published in 2018, and the latest were published in 2021. ii) Our experimental results are superior to those of other methods, in terms of both mAP and Rank-1 accuracy. Our mAP value is 89.1%, and the Rank-1 accuracy is 95.8%. The mAP value was 5.4% higher than the baseline, while the Rank-1 value was 1.6% higher. CUHK03 is a more challenging dataset than Market1501 and DukeMTMC-ReID. This is because i) CUHK03 has fewer samples and contains serious viewpoint variations and occlusion problems; and ii) the annotation of bounding boxes marked by the object detection algorithm has location offsets. All methods listed in Table 3 are also based on the ResNet50 backbone network. The experimental results of our method, both mAP and Rank-1, are significantly better than those of the other methods. The mAP and Rank-1 values for the labelled category are 79.6% and 83.9%, which are 10.6% and 10.1% higher than the baseline, respectively. The mAP and Rank-1 values of the detected category are 77.2% and 82.2%, respectively, which are 11.7% higher than both the mAP and Rank-1 of the baseline.  Table 4 shows the results of the different methods on the DukeMTMC-ReID dataset. To further verify the effectiveness of our method, we select methods with different backbone networks, for example, ResNet101, DenseNet121 and SEResNet101. Similar to the other two datasets, the proposed MS-RAFA also achieves the best results in terms of Rank-1 accuracy and mAP. Our results exceed those of MGN_CACF [86](21) by 0.4%.
Through these comparisons, it can be seen that the method proposed in this paper is the most effective. The experimental results clearly demonstrate that mining potential features and integrating these complementary features leads to great performance advantages.

C. ABLATION EXPERIMENT
To demonstrate the experimental results of our MS-RAFA, we conduct an incremental evaluation of its modules on the Market1501 dataset. The dataset consists of complex scenes and contains much information, so the experimental results when testing our model are more convincing. We again use ResNet50 as the backbone network. In Figure 2, the MFFM module fuses the output of the four layers after the output of that contain the ASM module. After that, the feature V outputted from the MFFM module is fused with the feature outputted through the backbone network (which we simply denote as x). Finally, it is supervised by ID and triplet loss twice. It is worth noting that the final total loss is the sum of the two supervised losses.
To better verify the experimental effect of each module on the overall network, we conduct experiments on each module, and the results are shown in Table 5. For the three modules, the result of removing any one module is lower than the result of combining them all together, which shows that the three modules are complementary. After implementing the ASM, the mAP value improved by 1.4%, which shows that the module leads to the extraction of more effective target features through the identification of local and global features. After implementing the MFFM, the mAP value improved by 0.9%. This is because the MFFM can make reasonable use of both high-and low-level features. The two proposed modules complement each other and extract more hidden features. After using the multiple supervision module, the mAP value improved by 0.3%. This indicates that multiple supervision helps detect and teach the MFFM to obtain better recognition features. Therefore, this reasonably demonstrated the validity of our three proposed modules.
The feature map outputted by the first layer of the ResNet50 network model is very large (7×7 convolution), contains very complex information, and has a large number of parameters and a large number of computations. Therefore, we do not add the ASM after the first layer, as shown in the red dotted box in Figure 2. Instead, ASM modules are added after the last four layers (i.e., layer2, layer3, layer4 and layer5).  Table 6 shows the results on ablation experiments on the ASM structure to verify its effect on model performance. For learning attention, without taking the proposed global relation (Rel.) as part of the input, the scheme is inferior to our scheme that includes the ASM by 1.6% in mAP. In addition, without taking the feature itself (Ori.) as part of the input, the scheme is inferior to our ASM scheme by 1.0% in mAP. The combination of the two structures achieves better performance. This suggests that the main improvement in the experimental results comes from the new design for learning attentional relations, in which modeling of global relations provides better performance.
To further verify the fusion effectiveness of the MFFM module, we implement the following experiments with internal detail changes, and its results are shown in Table 7. The MFFM module fuses the output of the four layers that contain the ASM module. Here, we remove some of the fusion branches to verify the fusion effectiveness of the MFFM on the low-and high-level features. The terms w/o layer5, w/o layer54, w/o layer543, and w/o layer5432 indicate removal of the corresponding fused layers in turn. For example, w/o layer54 indicates that the output of the branches of the fifth and fourth layers do not enter the MFFM module for fusion. w/o layer5432 is equivalent to using only the ASM in the model. In Table 7, as the number of fusion layers decreases, the performance of the model gradually decreases. The decreases by 2.1%, and the Rank-1 value decreases by 0.6%, which indicates that the information inside the MFFM module is complementary. Specifically, the MFFM module acquires features of different levels and integrates them into a new feature. It is effective because the formation of new features has a certain positive effect on the extraction of global features. It also forms information complementarity between different layers to obtain more implicit features.
To verify the fusion of the V and x features, as well as the effectiveness of the supervision of ID and triplet loss twice, we set up a comparison experiment, as shown in Table 8, where w/o VLoss and w/o xloss represent removal of the corresponding loss, xvCat represents the fusion operation on the output features of the MFFM and the features of the last layer of the backbone network, and both refers to the complete multiple supervision mechanism. After using VLoss, the mAP value improves by 1.8%, and after using xLoss, the mAP value improves by 1.4%. Therefore, adoption of both branches is most effective for loss supervision and information fusion. The mAP and Rank-1 values for this condition are the best in the comparative experiment.

D. VISUALIZATION OF ATTENTION
Similar to RGA [56], we apply the Grad-CAM (gradientweighted class activation mapping) [82] tool to the baseline model and to our model for qualitative analysis. The Grad-CAM tool identifies areas that the network deems important, as shown in a comparison between the models in Figure 4 The Grad-CAM masks of our proposed model cover the human area better than the baseline model, allowing the network to focus on larger regions of different parts of the body. Compared with RGA [56], which uses spatial and channel attention mechanisms, our method more clearly reflects the extraction of more salient and implicit features. This is the result of the aggregation and mining of multiscale attention features from global-scale structural information.

V. CONCLUSIONS
For the re-identification of pedestrians, a new multiscale reference-aided attentive feature aggregation (MS-RAFA) mechanism was proposed to learn more distinct features. First, to extract the most significant local and global features and strengthen the correlation between the features of various parts of the human body, an attention mechanism called the autoselection module (ASM) was designed. Then, to extract and fuse low-and high-level features, an multilayer feature fusion module (MFFM) was proposed. The MFFM is an independent branch that fuses each layer of output from the combination of backbone networks, which enables the model to extract hidden features concealed by salient features and to learn them better. Finally, the effectiveness of the model was verified by multiple supervision mechanisms to obtain better model performance. Extensive ablation studies demonstrated the high efficiency and state-of-the-art performance of our design.