Introduction
Person re-identification aims to match a specific person captured by non-overlapping cameras, or across time using the same camera. In many surveillance applications, such as cross-camera tracking [1] and multi-person association [2], person re-identification serves as a fundamental technique and it is generally considered as an image retrieval problem. Despite great progress in recent years, person re-identification still remains an open research challenge. Due to large appearance variation arising from viewpoint changes, varying illumination conditions, occlusion, and complex background, it is rather difficult to match cross-view image pairs.
Extracting discriminative features that fully characterize the query person, and distinguish from others at the same time, is of vital importance for any person re-identification systems. Owing to remarkable ability of learning discriminative features, solutions based on Convolutional Neural Networks (CNNs) have become the mainstream for person re-identification [3], [4]. In practice, because global features are prone to ignore the information of small regions, it has been a trend to fuse global features with part-based local features [5], [6]. These local features are generally learned from multi-branch architectures with supervision, they can help re-identification models focus on fine-grained details in each individual local part. Thus higher performance can be achieved when comparing to merely use global features [7], [8].
To further enhance the discrimination of feature representations, the visual attention mechanism has also been introduced into person re-identification [9], [10], [11], [12]. By endowing more distinguishable patterns with higher weights, attention mechanism equips networks with the ability of laying emphasis on more informative regions. In the meantime, the irrelevant background interference would be suppressed. Therefore, representations strengthened by attention mechanism can better represent pedestrian images and provide more distinguishable information.
Despite observed effectiveness of adopting local features and visual attention mechanism, there are two shortcomings of most existing person re-identification approaches. First, the branches on the main level rarely communicate with others in existing branching networks, the ability of finding potential clues remains improvement. Second, the widely used branching architecture usually brings high computational cost at the time of boosting performance. Especially in some works like [6] and [7] that several convolutional blocks are duplicated, or in [10] and [11] that heavy matrix multiplications are executed for attentions, the model complexity may increase greatly.
In this paper, we propose to address above problems by hierarchically aggregating features based on the Omni-Scale Network (OSNet) [13]. Technically, we first introduce a hierarchical feature aggregation strategy to progressively combine multi-scale features. The pre-stage feature map is not only fed into the next stage in current branch, but also aggregated into another parallel branch. In this way, the semantic and detail information at different stages and different branches are aggregated. During aggregation, a Feature Suppression Operation (FSO) is applied to partially erase feature maps with the aim of mining more diversified features. Intuitively, the erased regions generally correspond to the areas where network has strong activations, so other potential clues would stand out in the next branch. As a result, the branches are forced to work together, and all salient features can be extracted in a branch-by-branch manner. Besides, we also design a novel lightweight attention module to guide feature learning. Comparing to other typical attentions, the number of parameters and computation complexity are significantly reduced. To better leverage the multi-branch structure, the final feature maps in each branch are processed via different pooling strategy to obtain global, multi-granularity part-based, and channel-based features.
We name our model Hierarchical Attentive Feature Aggregation (Hi-AFA). Taking the advantage of lightweight OSNet [13] architecture, the number of parameters is kept in a low magnitude by using it as backbone. We note that our Hi-AFA is not restricted to the usage of OSNet, other lightweight architectures can also be employed as the backbone.
The main contributions of our work can be summarized as follows:
We design a novel hierarchical feature aggregation framework (Hi-AFA), which aims to generate more discriminative features by combining the features of different levels and branches. By partially erasing feature maps via Feature Suppression Operation (FSO), the branches can cooperate to mine richer and more diversified features.
We design a Lightweight Dual Attention Module (LDAM), which contains two complementary parts: Spatial Attention Module (SAM) and Channel Attention Module (CAM). Due to the adoption of group convolution, it has much less parameters than existing attentions, and the computational cost is quite low.
We integrate Hi-AFA and LDAM into the OSNet, forming a resource-economical and effective multi-branch network. From the branches, diverse features are computed for person re-identification. We conduct extensive experiments on four public person re-identification datasets. The proposed method achieves better performance or comparable results to a broad range of existing models, while keeping much lower model complexity.
The rest of this work is organized as follows. Section II briefly reviews related works. In Section III, the structure of Hi-AFA and LDAM will be elaborated. Section IV presents the experimental evaluations and some discussions. Finally, the whole work is concluded in Section V.
Related Work
As one of the most active research areas in computer vision, a large number of solutions have been reported for person re-identification [14], [15]. In this section, we will briefly review some closely related works, including local feature learning, attention mechanism, and feature aggregation.
A. Local Feature Learning for Person Re-Identification
The prevailing success of deep learning has made person reidentification no-exception. The earlier approaches based on deep learning, such as [3], [16], [17], and [18], naively applied CNN backbones to extract global features. Due to the limitation of being prone to ignore local information from small regions [19], more and more works focus on learning local features.
To obtain local features, the works in [20] and [21] firstly partitioned pedestrian images according to some predefined rules, and then computed local features from each sub-image separately. This approach is easy to implement, but the predefined partitions are not often ideally aligned with human body parts. Instead of using rough partition strategy, some methods extracted body part features via external clues like pose estimation and human part parsing. In [22], Zhang et al. constructed densely semantically aligned part images to assist feature learning. Rao et al. [23] learned multi-scale skeleton representations. However, these methods need to detect key points or perform semantic parsing with additional models, extra computation cost is inevitable [24].
Recently, splitting feature maps into a bunch of spatial parts has become the mainstream [4], [6], [7], [25], [26]. Generally speaking, the feature maps are obtained by multi-branch deep architectures first, and multi-granularity features are then acquired by pooling with different sizes. Part-based Convolutional Baseline (PCB) [4] is a typical representative of this type, which splits the last feature map into horizontal stripes of the same size. Multiple Granularity Networks (MGN) [7] improved PCB by adding a global branch to utilize the global features. Pyramid [6] learned multi-granularity features by dividing the final feature map into a pyramidal partition set. Although impressive performance is achieved, the branches mainly work separately in these works, the capability of mining diverse features is limited. While in Hi-AFA this is addressed by the aggregation structure assisted with feature suppression operation.
B. Attention Mechanism in Person Re-Identification
The attention mechanism has also been introduced to person re-identification after success in other computer vision tasks like visual question answering [27] and scene segmentation [28]. As attention can guide model to focus on informative features while suppress irrelevant ones, it well matches the goal of handling challenges in person re-identification.
Directly incorporating a separate stream of spatial attention in deep networks is a common strategy for feature enhancement [29]. Li et al. [9] proposed a multi-granularity attention selection mechanism to better select region of interest. Si et al. [29] captured spatial dependencies among different pedestrian images by incorporating a correlation attention module. Chen et al. [30] learned the attention with counterfactual causality which can measure the attention quality and provide supervisory signal to guide learning process. Xun et al. [31] designed a local attention guided network to extract approximate semantic local features of human body parts. To better model long range dependencies, second order non-local attentions are computed in [8] and [11]. However, one potential limitation is that the computation cost is a bit high.
Channel-wise attention [32] has also been introduced to explore the correlations among different channels, the combination of spatial attention and channel attention can enhance feature representation further [10], [33]. To this end, Zhang et al. [34] captured the global structural information for better attention learning via mining pairwise correlations among feature positions and channels. Chen et al. [10] applied orthogonal regularizations to enforce diversity on attention maps. In [35], an attention-guided mask module was proposed to address occlusion problem. In [36], holistic and partial attentions are jointly learned to increase the feature robustness against pose variations.
C. Feature Aggregation for Person Re-Identification
Feature aggregation is a common strategy to make full use of features. In deep architectures like ResNet [37] and DenseNet [38], feature aggregation plays a vital role in relieving the vanishing gradient problem for feasible optimization. In person re-identification, a number of solutions with feature aggregation have been reported [12], [39], [40], [41].
Chen et al. [12] employed a salience suppression strategy to mine diverse visual clues at different stages. Xu et al. [42] aggregated the predictions of multiple networks to mimic the decision process of multi-experts. Fu et al. [43] designed an iterative impression aggregation module to update features for similarity computation. Hou et al. [44] proposed to enhance feature representations by selectively aggregating correlated spatial and channel features. The typical two-stream network is employed to fuse the features extracted from different spaces in [45] and [46]. Based on the Vision Transformer (ViT) with impressive capability of exploiting structural patterns, Zhang et al. [47] proposed a hierarchical and iterative structure to refine and aggregate multi-level features. Wang et al. [48] proposed a neighbor transformer network to model interactions across all input images. However, one shortcoming of ViT based methods is that they are thirsty for training samples [49].
The proposed Hi-AFA learns local features via a multi-branch architecture and it splits feature maps into horizontal parts. To guide feature learning, both spatial and channel-wise attentions are included to build a lightweight dual attention module. Due to the branching architecture, Hi-AFA might look like PyConv [50], FractalNet [51], CliqueNet [52], and BranchyNet [53] at first glance. However, the branches of FractalNet are trained alternately, which implies the sub-paths still work separately in essence. The parameters in CliqueNet [52] are recurrently updated many times, the computational cost is too high. For BranchyNet and PyConv, there are no aggregations to utilize features of different stages. The Hi-AFA is also related to [41] and [47] that sharing the same idea of aggregating intermediate features. But there are notable differences with Hi-AFA: (1) A Feature Suppression Operation (FSO) is applied to partially erase feature maps, thereby allowing the network to discover diverse visual clues. (2) The attentive features at intermediate stages are aggregated along both depth and parallel branches. (3) Multi-granularity part-based and channel-based features are extracted from the branches for better utilization.
Methodology
Let
The architecture of Hierarchical Attentive Feature Aggregation (Hi-AFA) model. The OSNet is used as the backbone, and its transition stages are omitted for simplicity. There are four parallel branches in Hi-AFA, and their numbers of convolution blocks gradually decrease to 1 from branch-1 to branch-4. The feature maps are not only fed into the next convolution block in current branch, but also aggregated into next branch after suppression. Multi-granularity part-based local features and global features are computed from the first three branches. For branch-4, global and channel-based features are extracted, and DropBlock is applied to obtain another feature tensor. All pooled feature volumes are further forwarded to BNNeck to produce final embeddings.
Due to outstanding ability of feature extraction, the off-the-shelf OSNet [13] is utilized as the backbone of Hi-AFA. Similar to PyConv [50], multiple filters are utilized to learn diverse features in each convolutional block of OSNet. There are five convolutional blocks in OSNet, which will be referred to as Conv1 to Conv5 hereafter, and the key component of them is the bottleneck illustrated in Figure 2. The Conv1 block contains a standard
The bottleneck of OSNet [13]. The Lite
Our Hi-AFA can be roughly divided into three parts: the common OSNet-Conv1&2 blocks, hierarchical attentive feature aggregation, and final feature processing. Images are first passed through OSNet backbone, up until its Conv3 block. After forwarding images through the initial layers, the network forms an upper triangle structure of multiple branches, which comprise the remaining layers of OSNet up to Conv5 block. By this design, the layers up to Conv3 are shared by all the branches. This concept has been employed in a few person re-identification solutions like [7], [25], and [26], which can decrease model size effectively. Finally, the feature volumes in each branch are pooled with different size, such that we can obtain multi-granularity features. The part-based local features are computed via average pooling, and max pooling is utilized to get global features. The key components of Hi-AFA are detailed in the following.
A. Hierarchical Feature Aggregation
It has been demonstrated multi-scale feature aggregation can help to improve person re-identification performance [42], [44], [47]. However, traditional aggregation operations generally only consider aggregating high- and low-level features. Few efforts have been devoted to the cooperation of branches for potential clues mining in multi-branch architecture. In this work, the proposed hierarchical feature aggregation aims to combine features from different branches, such that richer and more diverse features can be explored.
As shown in Figure 1, from branch-1 to branch-4 the numbers of convolutional blocks gradually decrease to 1 due to aggregation structure. And extra links are added between adjacent branches in Hi-AFA, which makes it differ from previous multi-branch network with independent branches. By this design, the feature stream also flows along parallel branches for aggregation. As a consequence, the branches are forced to cooperate with each other.
Let \begin{align*} \boldsymbol {X}_{b,l} = & \mathcal {A}\left ({\mathcal {F}_{b,l-1}\left ({\boldsymbol {I}; \mathcal {W}_{b,l-1} }\right) }\right) + \mathcal {S}\left ({\mathcal {A}\left ({\mathcal {F}_{b-1,l}\left ({\boldsymbol {I}; \mathcal {W}_{b-1,l} }\right) }\right) }\right) \\ &\qquad \qquad \qquad 2\leq b \leq 4, b+1\leq l\leq 5, \tag{1}\end{align*}
Although the branches are enforced to cooperate in the aggregation structure, they may fall into the trivial salient features if no extra guidance is provided. To address this problem, FSO is particularly applied to attentive features before aggregation, which functions a little like the dropout. But unlike dropout that randomly chooses units to deactivate, FSO only filters out high responses, so as to suppress the salient features discovered in previous branch. Despite some information loss due to the thresholding process, the branches are endowed with the ability to mine more potential visual clues for visual matching, and this is critical to the re-identification task.
As illustrated in Figure 3, we first apply channel-wise average pooling to get averaged 2-D feature map \begin{align*} \boldsymbol {M}_{b,l}\left ({x,y }\right) = \begin{cases} \displaystyle 0,& if \bar { \boldsymbol {Y}}_{b,l}\left ({x,y }\right) >\tau \\ \displaystyle 1,& otherwise\end{cases} \tag{2}\end{align*}
where
Based on the hierarchical aggregation structure, diversified features can be obtained for person re-identification. First, the multi-level attentive features in different branches are recurrently aggregated, thus diversified information can be utilized. Second, potential important features may stand out in the next branch after the previous salient feature being suppressed. The network is thereby enabled to extract all potential useful features branch-by-branch.
B. Lightweight Dual Attention Module
The proposed Lightweight Dual Attention Module (LDAM) can be viewed as a variant of the classical Convolutional Block Attention Module (CBAM) [55], which consists of Channel Attention Module (CAM) and Spatial Attention Module (SAM). The two types of attention modules work in a complementary manner to enhance feature representations. CAM explores the correlation between channel features, and SAM aims to capture and aggregate semantically related spatial features. But LDAM differs from CBAM in attention computation process, especially the group convolution employed in CAM and SAM, which leads to much less parameters than CBAM. As a result, the computational cost is quite low. Besides, the softmax activation is used in LDAM, other than sigmoid in CBAM. The detail of LDAM is as follows.
1) Channel Attention Module
It is well known that each channel map of high-level convolutional feature can be viewed as a class-specific response, and the responses are generally semantic-related. In person re-identification task, it will contribute to better fine-grained recognition if some channels sharing similar sematic contexts (e.g., foreground and background) are more correlated. Thus, we group and aggregate those semantically correlated channels by explicitly exploiting the interdependencies between channel maps.
The structure of CAM is illustrated in Figure 4. Given a local feature tensor \begin{equation*} {h}=\text {softmax} \left ({\text {gconv}_{2}\left ({\text {gconv}_{1}\left ({\widetilde { \boldsymbol {x}} }\right) }\right) }\right), \tag{3}\end{equation*}
where \begin{equation*} \boldsymbol {A}_{ch}=\gamma \boldsymbol {X}\otimes \boldsymbol {h}+ \boldsymbol {X}, \tag{4}\end{equation*}
2) Spatial Attention Module
An illustration of SAM is shown in Figure 5. In contrast to CAM, SAM captures and aggregates related features in the spatial domain. Given a local feature map \begin{equation*} \boldsymbol {H}=\text {vec}^{-1}\left ({\text {softmax} \left ({\text {gconv}_{2}\left ({\text {gconv}_{1}\left ({\text {vec}\left ({\boldsymbol {M} }\right) }\right) }\right) }\right) }\right), \tag{5}\end{equation*}
where \begin{equation*} \boldsymbol {A}_{sp}=\gamma \sum _{c=1}^{C} \boldsymbol {X}^{c}\otimes \boldsymbol {H}+ \boldsymbol {X}. \tag{6}\end{equation*}
It can be found that there are only simple operations of pooling,
It has been proved that sequential combination of SAM and CAM can lead to better performance [34], [55], we follow the same scheme to place SAM in front of CAM for attention learning (see Figure 6 for illustration). Due to the lightweight design of LDAM, it is quite flexible and can be easily plugged into networks multiple times if necessary.
C. Feature Processing in Each Branch
In order to learn multi-granularity features and make a better usage of them, we employ a simple partition strategy to obtain global, part-based, and channel-based features. The final feature maps in each branch are equally partitioned with different size to get multiple granularity local features. Both global and local features are extracted from each branch. In addition, we also extract channel-based features via channel partition.
To extract part-based local features, we simply divide the final feature map into
For branch-4, we first aggregate the information by global max pooling on the tensor, resulting a vector
During training, the global features in
D. Loss Functions
The combination of identification loss, ranking loss, and center loss [57] is adopted for the optimization of network parameters.
The cross-entropy with label smoothing [58] is used as identification loss, which treats each identity as a distinct class. In each minibatch, the label smoothed cross-entropy is defined as \begin{equation*} {\mathcal {L}}_{xe} = -\frac {1}{N} \sum _{i=1}^{N}{\sum _{k=1}^{K} {\left ({\left ({1-\epsilon }\right) y_{i}^{k} + \frac {\epsilon }{K} }\right)}} \log \left ({p_{i}^{k} }\right), \tag{7}\end{equation*}
For computation of ranking loss, the multi-similarity [59] is utilized. As a pair-based list-wise loss function, multi-similarity loss integrates pair mining and soft weighting scheme into a single-framework. The multi-similarity loss is computed as \begin{align*} \mathcal{L}_{m s}=&-\frac{1}{N} \sum_{i=1}^N\left\{\frac{1}{\alpha} \log \left[1+\sum_{k \in {\mathcal{P}}_i} \exp \left(-\alpha\left(S_{i k}-\lambda\right)\right)\right]\right. \\& \qquad \left.+\frac{1}{\beta} \log \left[1+\sum_{k \in {\mathcal{N}}_i} \exp \left(\beta\left(S_{i k}-\lambda\right)\right)\right\}\right], \tag{8}\end{align*}
To enhance the compactness of each identity cluster, the center loss [57] is also included, which is defined as \begin{equation*} {\mathcal {L}}_{ce}=\frac {1}{2}\sum _{i=1}^{N}{\lVert \boldsymbol {\psi }_{i}- \boldsymbol {c}_{y_{i}} \rVert _{2}^{2}}, \tag{9}\end{equation*}
During training, the final loss function is \begin{equation*} \mathcal {L} = \lambda _{xe}\sum _{ \boldsymbol {\psi }\in \mathcal {I}}{{\mathcal {L}}_{xe}} + \lambda _{ms}\sum _{ \boldsymbol {\psi }\in \mathcal {R}}{{\mathcal {L}}_{ms}} + \lambda _{ce} \sum _{ \boldsymbol {\psi }\in \mathcal {I}\cup \mathcal {R}}{{\mathcal {L}}_{ce}}, \tag{10}\end{equation*}
Experiments
In this section, we report the experimental results of the proposed Hi-AFA on four mainstream person re-identification datasets, including Market-1501 [60], DukeMTMC-reID [61], MSMT17 [62], and CUHK03 [63]. Figure 7 shows some randomly selected images. We compare Hi-AFA with a line of state-of-the-art solutions, and conduct extensive ablation studies to investigate the effectiveness of each component.
Example images randomly chosen from three benchmark datasets. Images in each row are of the same person in each dataset.
A. Datasets
We conduct experiments on the following four widely used person re-identification datasets.
Market-1501 [60] is currently the most popular person re-identification dataset, which is captured by six cameras. This dataset contains 1,501 identities with 32,668 bounding boxes obtained by the Deform Part Model (DPM) detector. The training set contains 751 identities with 12,936 images, and in the testing set there are 750 identities with 3,368 query images and 19,732 gallery images.
DukeMTMC-reID [61] contains 36,441 images of 1,404 pedestrians captured by eight cameras. A total of 16,552 images belonging to 702 identities make up the training set, and the remaining 702 identities along with 408 distractors make up the testing set. In the testing set, there are 2,268 query images and 17,661 gallery images respectively.
MSMT17 [62] is collected by twelve outdoor and three indoor cameras. There are 4,101 identities with a total of 126,441 images. It is divided into a training set of 32,621 images and a testing set of 93,820 images. Due to its massive scale, more complex and dynamic sciences, it is much more challenging to perform person re-identification on MSMT17.
CUHK03 [63] consists of 14,097 pedestrian images of 1,467 identities captured from two disjoint camera views. There are two types of bounding boxes in CUHK03, one is obtained by human annotation, and the other is detected by DPM. We adopt the splitting protocol of 767/700 identities for training and testing on this dataset.
B. Experimental Settings
1) Implementation Details
The OSNet [13] initialized with the weights pretrained on ImageNet is used as our backbone. All images are resized to
2) Evaluation Metrics
The Cumulative Matching Characteristic (CMC) at top ranks and mean Average Precision (mAP) are reported as evaluation metrics. The value at different ranks of CMC shows the re-identification accuracy by counting the query identities among the top n results. The mAP reflects the overall re-identification accuracy by calculating the area under the precision-recall curve. We note that all experiments are conducted under the single-shot scenario.
C. Comparison With State-of-the-Art Methods
Table 1 shows the performance of our proposed Hi-AFA and other state-of-the-arts on Market-1501, DukeMTMC-reID, MSMT17, and CUHK03. The compared methods can be generally grouped into three categories: discriminative feature learning based (top of the table), attention based (middle of the table), and transformer based (bottom). We report the mAP and CMC values at Rank-1/5 for comparison. We observe that Hi-AFA achieves superior performance on multiple benchmarks or competitive results compared to previous methods.
1) Results on Market-1501
Our Hi-AFA achieves 91.8% mAP and 97.0%/99.0% Rank-1/5 accuracies on this dataset. Comparing to the previous best Rank-1 96.3% reported by LightMBN [25], the improvement is 0.7%. Although the mAP of Hi-AFA is lower than previous best ABD+NFormer [48], it still ranks the second. Note that the stunning mAP of ABD+NFormer mainly comes from NFormer, which improves the mAP of ABD-Net [10] from 88.3% to 93.0%. As NFormer can be viewed as a post-processing module, some higher mAP is natural. We also conduct experiments with Hi-AFA+NFormer. For each image, the features extracted via Hi-AFA are concatenated to a representation vector. NFormer is then applied to all vectors in a mini-batch to obtain their final representations. Following [48], the number of neighbors is also set to 20 in Hi-AFA+NFormer. The obtained mAP and Rank-1 accuracy are as high as 95.4%/97.2% on Market-1501, exceeding other methods significantly.
Compared to the two representative feature learning based methods of Pyramid [6] and MGN [7], the improvements of mAP and Rank-1 accuracy are 3.6%/4.9% and 1.3%/1.3%. Because Hi-AFA shares similar branching structure with them, we believe the improvements should be attributed to the aggregation structure and attention modules. Among the methods based on attention or transformer, IANet [44], SCSN [12], and HAT [47] all embrace the aggregation strategy to make better use of multi-scale features. Whereas our Hi-AFA outperforms all of them, which demonstrates the encouraging ability of learning discriminative features in Hi-AFA.
2) Results on DukeMTMC-Reid
Hi-AFA achieves competitive results on this dataset. The mAP of Hi-AFA is 82.9%, which ranks the second among all methods. The highest score is 85.7%, reported by ABD+NFormer [48] again. On the most important Rank-1, Hi-AFA achieves the same score with AdaSP [67] and BPB(Res50-IBN) [70], all report 91.7% matching accuracy. When Hi-AFA is welded with NFormer, the mAP and Rank-1 are improved to 91.1% and 94.0%, outperforming all others significantly. Compared with SCSN [12] and HAT [47] that aggregate information via cascaded attentions or transformers, the superiority of Hi-AFA is obvious. The mAP and Rank-1 are improved by 3.9%/1.5% and 0.7%/1.3%. Both of them have to undertake heavy computation burden to mine diverse features, while in Hi-AFA this is achieved by simple but effective hierarchical feature aggregation and FSO.
3) Results on MSMT17
Our Hi-AFA achieves the best mAP (71.9%) and Rank-1 (87.6%) over all previous competitors. The previous best is TransReID [74], which reports 69.4% mAP and 86.2% Rank-1 accuracy. Although TransReID benefits from the transformer-based learning structure, Hi-AFA outperforms it with 2.5%/1.4%. On top of that, much higher performance of 76.7% mAP and 90.2% Rank-1 accuracy can be obtained by Hi-AFA+NFormer. From Table 1, we can also observe that Hi-AFA has obvious superiority over other multi-branch feature learning based and attention-based models. Take the feature learning based AdaSP [67] for example, its mAP and Rank-1 are 67.1% and 85.5%, while our Hi-AFA exceeds it by 4.8% and 2.1%. When compared with attention based DCA [69], the improvements are even higher. The results on MSMT17 demonstrates the scalability of Hi-AFA on such a huge person re-identification benchmark.
4) Results on CUHK03
As shown in Table 1, Hi-AFA achieves the best in terms of both mAP and Rank-1 accuracy, which gives 85.4%/83.6% mAP and 87.9%/85.5% Rank-1 matching accuracy on labeled and detected settings respectively. The previous best was reported by APNet [72], which gives 85.3%/81.5% mAP and 87.4%/83.0% Rank-1 accuracy. The improvements are 0.1%/2.1% for mAP, and 0.5%/2.5% for Rank-1 accuracy. With the support of NFormer, the results can be boosted to 88.7%/86.4% and 89.5%/88.6%. Compared to the backbone OSNet [13], Hi-AFA improves the mAP and Rank-1 accuracy by as large as 15.8% and 13.2% under the detected setting, which justifies the superiority of aggregating attentive features.
D. Ablation Study
In the following, we systematically investigate the effectiveness of each key component of Hi-AFA, namely hierarchical feature aggregation, FSO, LDAM, along with the final feature processing. Experiments are conducted on all four considered datasets. On CUHK03, only the labeled version (CUHK03-L) is considered, since the two types of bounding boxes are from same source. The results are obtained with only one setting changed and the rest remain the same.
1) Effect of Hierarchical Feature Aggregation
The hierarchical feature aggregation structure plays an important role in the proposed Hi-AFA model. To investigate its effectiveness, different sub-models of Hi-AFA are evaluated. We use the branch-1 in Hi-AFA as basic model, and then gradually add other branches to it. The Hi-AFA with independent branches (denoted as Hi-AFA-BrIndep) and backbone OSNet [13] are also evaluated for comparison.1 Note that in Hi-AFA-BrIndep, only the first links between branches are kept, all later ones are discarded. Thus the branches work independently.
Table 2 demonstrates the results of each sub-model. We observe that with merely branch-1 quite encouraging results can be achieved. For example, it gives 87.6% mAP and 95.3% Rank-1 accuracy on Market-1501, which are 2.7% and 0.5% higher that the results of backbone OSNet [13]. By gradually adding other branches, the re-identification performance increases accordingly on all datasets. This proves that the feature aggregation structure in Hi-AFA can lead to significant performance improvements. From Table 2 we can also find that the mAP and Rank-1 of Hi-AFA-BrIndep are obviously lower than full-state Hi-AFA with all links (i.e., branch-{1, 2, 3, 4}). This indicates that the lateral links between adjacent branches are vital to the final re-identification performance, because they could enforce the branches cooperate with each other to explore more potential clues. While in Hi-AFA-BrIndep the branches work independently with no correspondence, the performance drops in consequence.
In the bottom of Table 2, the results of Hi-AFA with two other widely used backbones of ResNet-50 [37] and DenseNet-169 [38], are also reported. We first evaluate them as backlines, and then apply our Hi-AFA in these backbones. We observe that consistent improvements can be achieved on both of them, which indicates Hi-AFA is effective for different backbones. In general, the DenseNet-169 performs slightly better than ResNet-50, but they are all inferior to OSNet. Therefore, OSNet is our first choice of backbone.
2) Effect of FSO
To demonstrate the effect of feature suppression, we evaluate Hi-AFA with different FSO embedding strategies, including without FSO (w/o), the main architecture equipped with FSO after Conv2 to Conv4 in backbone network (C2, C3, and C4), and different combinations of them at consecutive stages (C2&C3, C3&C4, and C2-C4).
From the evaluation results shown in Figure 8, we can draw the following observations. (1) FSO can boost the re-identification performance effectively. With FSO embedded, both mAP and Rank-1 accuracy can be obviously improved. For instance, even the weakest embedding strategy of C2 can bring 0.2% mAP gain on Market-1501 dataset. (2) The later stage FSO is embedded, the higher performance gain will be acquired. This is a natural result. It is well known that the higher-stage convolutional features are more category-related than shallow layers. By embedding FSO into latter stages of CNN backbone, more diverse and discriminant features can be obtained, thus resulting better matching results. (3) The combination of FSOs can further boost the re-identification performance. Similar to the usage of single-stage FSO, the combination of C3&C4 also performs better than C2&C3, demonstrating the superiority of later feature suppression again. By plugging FSO into all stages, C2-C4 gives the highest results on all datasets. Comparing to the model without FSO, the improvements are 1.6%/1.2%, 2.5%/2.2%, 4.2%/2.1%, and 2.8%/2.5% respectively. This comparison justifies the effectiveness of mining diverse features by FSO.
Performance comparison of Hi-AFA under different FSO embedding settings. w/o means without FSO,
3) Feature Suppression Threshold Analysis
The parameter threshold
Variation of mAP (a), and Rank-1 accuracy (b) with respect to parameter
4) Effect of LDAM
In the proposed Hi-AFA, LDAM plays an important role of guiding feature learning. To investigate its effectiveness, we conduct comparative experiments of Hi-AFA with and without LDAM. Under the setting of Hi-AFA without LDAM, all attention modules are removed for a clean comparison. The result is shown in Figure 10 (a). It can be found that, Hi-AFA consistently outperforms the model without LDAM by a large margin. With the guidance of LDAM, the mAP is improved by 2.0%, 1.4%, 3.7%, 3.1%, and Rank-1 accuracy is also promoted by 1.3%, 1.1%, 1.8%, 2.8% on each dataset. This demonstrates that LDAM can effectively guide Hi-AFA to learn discriminative and robust features for cross-view matching.
Performance comparison of (a) Hi-AFA with (w/) and without (w/o) LDAM, (b)
In addition to experiments of utilizing LDAM or not, three other attentions including CBAM [55], RGA [34], and Nonlocal [77] are also compared with LDAM. We use the same Hi-AFA architecture and replace LDAM with these attentions to conduct experiments. The performance comparison is shown in Table 3. We can observe that RGA [34] performs consistently better than others due to its consideration of structural relationship between human body parts. It outperforms the second best by 0.5%/0.2%, 0.5%/0.5%, 0.7%/0.3%, and 0.5%/0.6% on each dataset. Although the performance of LDAM is a bit lower than RGA [34], it performs better than CBAM [55] and Nonlocal [77]. Since LDAM and CBAM have similar architectures, we think the performance improvement should be mainly attributed to the group convolution which endows attention with more flexibility.
Given a tensor of shape
5) Effect of Final Feature Processing
In Hi-AFA, two feature sets are obtained finally, namely
To validate the effectiveness of DropBlock and channel-wise features, we first use all features except
E. Visualization of Attention Maps
To investigate the attended image regions of each attention module, we use Grad-CAM [78] to visualize the attention maps for qualitative analysis. In all branches, the attention maps after each attention module are generated. As shown in Figure 11, we can observe that the attentions at convolution block 2 are relatively coarse, multiple parts are of high importance in every attention map. When going deeper, they become more concentrated, forming few blobs on salient parts. For attention maps at the same stage, the attended areas are generally consistent but differs from each other in detail. Take B1C4, B2C4, and B3C4 in last row for example, besides the commonly highlighted legs, they focus on left shoulder, head, and right elbow, respectively. This proves the capability of mining diverse salient features of different branches. Therefore, they can greatly help to distinguish visual similar pedestrians in person re-identification task.
Visualization of attention maps in Hi-AFA.
F. Model Complexity
The idea of learning diverse features via multi-branch architecture is quite popular in person re-identification. It enables networks to focus on different person features in individual branches. However, such branching strategy brings higher computational cost at the time of boosting re-identification performance. Although our Hi-AFA also embraces the branching strategy, the reduction of computational complexity is considered in the first place. In either the backbone or attention module, much less parameters are required. In Table 5, the space complexity and model size of Hi-AFA, some other branching models, as well as the backbone OSNet [13] are listed, in terms of FLOPs, Params, and Memory size. We can find that there are only 12.76M parameters in the proposed Hi-AFA, the consumption of memory is 55.83MB, and the FLOPs are about 2.24G. Although it is about 6 times larger than the backbone OSNet [13], Hi-AFA is still quite slim when comparing to other branching models.
Conclusion
In this paper, we present a novel Hierarchical Attentive Feature Aggregation (Hi-AFA) network to address the challenging person re-identification task. In Hi-AFA, the features are aggregated not only along the depth, but also the parallel branches. In such way, the branches can work together to mine more diverse and richer features for fine-grained recognition. To guide the feature learning, we design a lightweight dual attention module, in which much less parameters are required. With the aim of capturing essential person features, we extract global, channel-based and multi-granularity part-based features from the distinct branches. Due to the usage of lightweight backbone and attention module, the overall model complexity of Hi-AFA is kept on a lower level than state-of-the-art models, but superior or comparable performance is obtained on four mainstream person re-identification datasets. Ablation analysis is also performed to investigate the insight of the proposed model. The backbone of Hi-AFA is not restricted to OSNet, other lightweight deep convolutional models can also be utilized. In future work, we will continue the research on more effective and lighter person re-identification.