Processing math: 100%
Hierarchical Attentive Feature Aggregation for Person Re-Identification | IEEE Journals & Magazine | IEEE Xplore

Hierarchical Attentive Feature Aggregation for Person Re-Identification


There are four parallel branches in Hi-AFA, and their numbers of convolution blocks gradually decrease to 1. The feature maps are not only fed into the following stage in...

Abstract:

Recent efforts on person re-identification have shown promising results by learning discriminative features via the multi-branch network. To further boost feature discrim...Show More

Abstract:

Recent efforts on person re-identification have shown promising results by learning discriminative features via the multi-branch network. To further boost feature discrimination, attention mechanism has also been extensively employed. However, the branches on the main level rarely communicate with others in existing branching models, which may compromise the ability of mining diverse features. To mitigate this issue, a novel framework called Hierarchical Attentive Feature Aggregation (Hi-AFA) is proposed. In Hi-AFA, a hierarchical aggregation mechanism is applied to learn attentive features. The current feature map is not only fed into the next stage, but also aggregated into another branch, leading to hierarchical feature flows along depth and parallel branches. We also present a simple Feature Suppression Operation (FSO) and a Lightweight Dual Attention Module (LDAM) to guide feature learning. The FSO can partially erase the salient features already discovered, such that more potential clues can be mined by other branches with the help of LDAM. By this manner, the branches could cooperate to mine richer and more diverse feature representations. The hierarchical aggregation and multi-granularity feature learning are integrated into a unified architecture that builds upon OSNet, resulting a resource-economical and effective person re-identification model. Extensive experiments on four mainstream datasets, including Market-1501, DukeMTMC-reID, MSMT17, and CUHK03, are conducted to validate the effectiveness of the proposed method, and results show that state-of-the-art performance is achieved.
There are four parallel branches in Hi-AFA, and their numbers of convolution blocks gradually decrease to 1. The feature maps are not only fed into the following stage in...
Published in: IEEE Access ( Volume: 12)
Page(s): 55711 - 55725
Date of Publication: 16 April 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Person re-identification aims to match a specific person captured by non-overlapping cameras, or across time using the same camera. In many surveillance applications, such as cross-camera tracking [1] and multi-person association [2], person re-identification serves as a fundamental technique and it is generally considered as an image retrieval problem. Despite great progress in recent years, person re-identification still remains an open research challenge. Due to large appearance variation arising from viewpoint changes, varying illumination conditions, occlusion, and complex background, it is rather difficult to match cross-view image pairs.

Extracting discriminative features that fully characterize the query person, and distinguish from others at the same time, is of vital importance for any person re-identification systems. Owing to remarkable ability of learning discriminative features, solutions based on Convolutional Neural Networks (CNNs) have become the mainstream for person re-identification [3], [4]. In practice, because global features are prone to ignore the information of small regions, it has been a trend to fuse global features with part-based local features [5], [6]. These local features are generally learned from multi-branch architectures with supervision, they can help re-identification models focus on fine-grained details in each individual local part. Thus higher performance can be achieved when comparing to merely use global features [7], [8].

To further enhance the discrimination of feature representations, the visual attention mechanism has also been introduced into person re-identification [9], [10], [11], [12]. By endowing more distinguishable patterns with higher weights, attention mechanism equips networks with the ability of laying emphasis on more informative regions. In the meantime, the irrelevant background interference would be suppressed. Therefore, representations strengthened by attention mechanism can better represent pedestrian images and provide more distinguishable information.

Despite observed effectiveness of adopting local features and visual attention mechanism, there are two shortcomings of most existing person re-identification approaches. First, the branches on the main level rarely communicate with others in existing branching networks, the ability of finding potential clues remains improvement. Second, the widely used branching architecture usually brings high computational cost at the time of boosting performance. Especially in some works like [6] and [7] that several convolutional blocks are duplicated, or in [10] and [11] that heavy matrix multiplications are executed for attentions, the model complexity may increase greatly.

In this paper, we propose to address above problems by hierarchically aggregating features based on the Omni-Scale Network (OSNet) [13]. Technically, we first introduce a hierarchical feature aggregation strategy to progressively combine multi-scale features. The pre-stage feature map is not only fed into the next stage in current branch, but also aggregated into another parallel branch. In this way, the semantic and detail information at different stages and different branches are aggregated. During aggregation, a Feature Suppression Operation (FSO) is applied to partially erase feature maps with the aim of mining more diversified features. Intuitively, the erased regions generally correspond to the areas where network has strong activations, so other potential clues would stand out in the next branch. As a result, the branches are forced to work together, and all salient features can be extracted in a branch-by-branch manner. Besides, we also design a novel lightweight attention module to guide feature learning. Comparing to other typical attentions, the number of parameters and computation complexity are significantly reduced. To better leverage the multi-branch structure, the final feature maps in each branch are processed via different pooling strategy to obtain global, multi-granularity part-based, and channel-based features.

We name our model Hierarchical Attentive Feature Aggregation (Hi-AFA). Taking the advantage of lightweight OSNet [13] architecture, the number of parameters is kept in a low magnitude by using it as backbone. We note that our Hi-AFA is not restricted to the usage of OSNet, other lightweight architectures can also be employed as the backbone.

The main contributions of our work can be summarized as follows:

  1. We design a novel hierarchical feature aggregation framework (Hi-AFA), which aims to generate more discriminative features by combining the features of different levels and branches. By partially erasing feature maps via Feature Suppression Operation (FSO), the branches can cooperate to mine richer and more diversified features.

  2. We design a Lightweight Dual Attention Module (LDAM), which contains two complementary parts: Spatial Attention Module (SAM) and Channel Attention Module (CAM). Due to the adoption of group convolution, it has much less parameters than existing attentions, and the computational cost is quite low.

We integrate Hi-AFA and LDAM into the OSNet, forming a resource-economical and effective multi-branch network. From the branches, diverse features are computed for person re-identification. We conduct extensive experiments on four public person re-identification datasets. The proposed method achieves better performance or comparable results to a broad range of existing models, while keeping much lower model complexity.

The rest of this work is organized as follows. Section II briefly reviews related works. In Section III, the structure of Hi-AFA and LDAM will be elaborated. Section IV presents the experimental evaluations and some discussions. Finally, the whole work is concluded in Section V.

SECTION II.

Related Work

As one of the most active research areas in computer vision, a large number of solutions have been reported for person re-identification [14], [15]. In this section, we will briefly review some closely related works, including local feature learning, attention mechanism, and feature aggregation.

A. Local Feature Learning for Person Re-Identification

The prevailing success of deep learning has made person reidentification no-exception. The earlier approaches based on deep learning, such as [3], [16], [17], and [18], naively applied CNN backbones to extract global features. Due to the limitation of being prone to ignore local information from small regions [19], more and more works focus on learning local features.

To obtain local features, the works in [20] and [21] firstly partitioned pedestrian images according to some predefined rules, and then computed local features from each sub-image separately. This approach is easy to implement, but the predefined partitions are not often ideally aligned with human body parts. Instead of using rough partition strategy, some methods extracted body part features via external clues like pose estimation and human part parsing. In [22], Zhang et al. constructed densely semantically aligned part images to assist feature learning. Rao et al. [23] learned multi-scale skeleton representations. However, these methods need to detect key points or perform semantic parsing with additional models, extra computation cost is inevitable [24].

Recently, splitting feature maps into a bunch of spatial parts has become the mainstream [4], [6], [7], [25], [26]. Generally speaking, the feature maps are obtained by multi-branch deep architectures first, and multi-granularity features are then acquired by pooling with different sizes. Part-based Convolutional Baseline (PCB) [4] is a typical representative of this type, which splits the last feature map into horizontal stripes of the same size. Multiple Granularity Networks (MGN) [7] improved PCB by adding a global branch to utilize the global features. Pyramid [6] learned multi-granularity features by dividing the final feature map into a pyramidal partition set. Although impressive performance is achieved, the branches mainly work separately in these works, the capability of mining diverse features is limited. While in Hi-AFA this is addressed by the aggregation structure assisted with feature suppression operation.

B. Attention Mechanism in Person Re-Identification

The attention mechanism has also been introduced to person re-identification after success in other computer vision tasks like visual question answering [27] and scene segmentation [28]. As attention can guide model to focus on informative features while suppress irrelevant ones, it well matches the goal of handling challenges in person re-identification.

Directly incorporating a separate stream of spatial attention in deep networks is a common strategy for feature enhancement [29]. Li et al. [9] proposed a multi-granularity attention selection mechanism to better select region of interest. Si et al. [29] captured spatial dependencies among different pedestrian images by incorporating a correlation attention module. Chen et al. [30] learned the attention with counterfactual causality which can measure the attention quality and provide supervisory signal to guide learning process. Xun et al. [31] designed a local attention guided network to extract approximate semantic local features of human body parts. To better model long range dependencies, second order non-local attentions are computed in [8] and [11]. However, one potential limitation is that the computation cost is a bit high.

Channel-wise attention [32] has also been introduced to explore the correlations among different channels, the combination of spatial attention and channel attention can enhance feature representation further [10], [33]. To this end, Zhang et al. [34] captured the global structural information for better attention learning via mining pairwise correlations among feature positions and channels. Chen et al. [10] applied orthogonal regularizations to enforce diversity on attention maps. In [35], an attention-guided mask module was proposed to address occlusion problem. In [36], holistic and partial attentions are jointly learned to increase the feature robustness against pose variations.

C. Feature Aggregation for Person Re-Identification

Feature aggregation is a common strategy to make full use of features. In deep architectures like ResNet [37] and DenseNet [38], feature aggregation plays a vital role in relieving the vanishing gradient problem for feasible optimization. In person re-identification, a number of solutions with feature aggregation have been reported [12], [39], [40], [41].

Chen et al. [12] employed a salience suppression strategy to mine diverse visual clues at different stages. Xu et al. [42] aggregated the predictions of multiple networks to mimic the decision process of multi-experts. Fu et al. [43] designed an iterative impression aggregation module to update features for similarity computation. Hou et al. [44] proposed to enhance feature representations by selectively aggregating correlated spatial and channel features. The typical two-stream network is employed to fuse the features extracted from different spaces in [45] and [46]. Based on the Vision Transformer (ViT) with impressive capability of exploiting structural patterns, Zhang et al. [47] proposed a hierarchical and iterative structure to refine and aggregate multi-level features. Wang et al. [48] proposed a neighbor transformer network to model interactions across all input images. However, one shortcoming of ViT based methods is that they are thirsty for training samples [49].

The proposed Hi-AFA learns local features via a multi-branch architecture and it splits feature maps into horizontal parts. To guide feature learning, both spatial and channel-wise attentions are included to build a lightweight dual attention module. Due to the branching architecture, Hi-AFA might look like PyConv [50], FractalNet [51], CliqueNet [52], and BranchyNet [53] at first glance. However, the branches of FractalNet are trained alternately, which implies the sub-paths still work separately in essence. The parameters in CliqueNet [52] are recurrently updated many times, the computational cost is too high. For BranchyNet and PyConv, there are no aggregations to utilize features of different stages. The Hi-AFA is also related to [41] and [47] that sharing the same idea of aggregating intermediate features. But there are notable differences with Hi-AFA: (1) A Feature Suppression Operation (FSO) is applied to partially erase feature maps, thereby allowing the network to discover diverse visual clues. (2) The attentive features at intermediate stages are aggregated along both depth and parallel branches. (3) Multi-granularity part-based and channel-based features are extracted from the branches for better utilization.

SECTION III.

Methodology

Let \mathcal {T}=\left \{{ \boldsymbol {I}_{i},y_{i} }\right \} _{i=1}^{n} be a set of training images, where \boldsymbol {I}_{i}\in \mathbb {R}^{H\times W\times 3} is the ith pedestrian image with corresponding label y_{i}\in \left \{{ 1,2,\cdots,c }\right \} and c is the number of identities. For each image, our goal is to compute its rich and diverse feature representations via a multi-branch architecture. To achieve this goal, the proposed Hi-AFA relies on a given CNN backbone and enriches it with hierarchical aggregation branches. By this manner, more potential clues can be mined for fine-grained cross-view matching. The overall architecture of Hi-AFA is illustrated in Figure 1.

FIGURE 1. - The architecture of Hierarchical Attentive Feature Aggregation (Hi-AFA) model. The OSNet is used as the backbone, and its transition stages are omitted for simplicity. There are four parallel branches in Hi-AFA, and their numbers of convolution blocks gradually decrease to 1 from branch-1 to branch-4. The feature maps are not only fed into the next convolution block in current branch, but also aggregated into next branch after suppression. Multi-granularity part-based local features and global features are computed from the first three branches. For branch-4, global and channel-based features are extracted, and DropBlock is applied to obtain another feature tensor. All pooled feature volumes are further forwarded to BNNeck to produce final embeddings.
FIGURE 1.

The architecture of Hierarchical Attentive Feature Aggregation (Hi-AFA) model. The OSNet is used as the backbone, and its transition stages are omitted for simplicity. There are four parallel branches in Hi-AFA, and their numbers of convolution blocks gradually decrease to 1 from branch-1 to branch-4. The feature maps are not only fed into the next convolution block in current branch, but also aggregated into next branch after suppression. Multi-granularity part-based local features and global features are computed from the first three branches. For branch-4, global and channel-based features are extracted, and DropBlock is applied to obtain another feature tensor. All pooled feature volumes are further forwarded to BNNeck to produce final embeddings.

Due to outstanding ability of feature extraction, the off-the-shelf OSNet [13] is utilized as the backbone of Hi-AFA. Similar to PyConv [50], multiple filters are utilized to learn diverse features in each convolutional block of OSNet. There are five convolutional blocks in OSNet, which will be referred to as Conv1 to Conv5 hereafter, and the key component of them is the bottleneck illustrated in Figure 2. The Conv1 block contains a standard 7\times 7 convolution layer and a 3\times 3 max pooling layer, both are conducted with stride 2. From Conv2 to Conv4, each contains two bottlenecks. A transition block, which serves as down sampler, is followed after Conv2 and Conv3. The Conv5 block contains a 1\times 1 convolution only. Benefiting from the design of multiple convolutional feature streams in bottleneck, OSNet [13] outperforms ResNet50 [37] and its variants (e.g., PyramidNet [54]) with much lower model complexity on the re-identification task.

FIGURE 2. - The bottleneck of OSNet [13]. The Lite 
$3\times 3$
 convolution consists of a 
$1\times 1$
 convolution, a depth-wise 
$3\times 3$
 convolution, Batch normalization, and ReLU activation. AG means aggregation gate, which is a learnable neural network. 
$\times 2$
, 
$\times 3$
, and 
$\times 4$
 represent the Lite 
$3\times 3$
 convolution is repeated 2, 3, and 4 times.
FIGURE 2.

The bottleneck of OSNet [13]. The Lite 3\times 3 convolution consists of a 1\times 1 convolution, a depth-wise 3\times 3 convolution, Batch normalization, and ReLU activation. AG means aggregation gate, which is a learnable neural network. \times 2 , \times 3 , and \times 4 represent the Lite 3\times 3 convolution is repeated 2, 3, and 4 times.

Our Hi-AFA can be roughly divided into three parts: the common OSNet-Conv1&2 blocks, hierarchical attentive feature aggregation, and final feature processing. Images are first passed through OSNet backbone, up until its Conv3 block. After forwarding images through the initial layers, the network forms an upper triangle structure of multiple branches, which comprise the remaining layers of OSNet up to Conv5 block. By this design, the layers up to Conv3 are shared by all the branches. This concept has been employed in a few person re-identification solutions like [7], [25], and [26], which can decrease model size effectively. Finally, the feature volumes in each branch are pooled with different size, such that we can obtain multi-granularity features. The part-based local features are computed via average pooling, and max pooling is utilized to get global features. The key components of Hi-AFA are detailed in the following.

A. Hierarchical Feature Aggregation

It has been demonstrated multi-scale feature aggregation can help to improve person re-identification performance [42], [44], [47]. However, traditional aggregation operations generally only consider aggregating high- and low-level features. Few efforts have been devoted to the cooperation of branches for potential clues mining in multi-branch architecture. In this work, the proposed hierarchical feature aggregation aims to combine features from different branches, such that richer and more diverse features can be explored.

As shown in Figure 1, from branch-1 to branch-4 the numbers of convolutional blocks gradually decrease to 1 due to aggregation structure. And extra links are added between adjacent branches in Hi-AFA, which makes it differ from previous multi-branch network with independent branches. By this design, the feature stream also flows along parallel branches for aggregation. As a consequence, the branches are forced to cooperate with each other.

Let {\mathcal {F}}_{l}:\mathbb {R}^{H\times W\times C}\mapsto \mathbb {R}^{d} be the feature extraction function parameterized by a set of trainable parameters {\mathcal {W}}_{l} , where l\in \left \{{ 1,2,\cdots,5 }\right \} is the stage index of OSNet [13] backbone, H, W, C are the height, width, and channels of a tensor. The feature representation of an image \boldsymbol {I} at lth stage (l \geq 2 ) can be denoted as \boldsymbol {X}_{b,l}=\mathcal {F}_{b,l}\left ({\boldsymbol {I}; \mathcal {W}_{b,l} }\right) , where b\in \left \{{ 1,\cdots,4 }\right \} is the branch index in Hi-AFA. If we denote \mathcal {F}_{b,l}\left ({\boldsymbol {I}; \mathcal {W}_{b,l} }\right) = \boldsymbol {0} \left ({2\leq b=l\leq 4 }\right) , then the feature aggregation can be formulated as \begin{align*} \boldsymbol {X}_{b,l} = & \mathcal {A}\left ({\mathcal {F}_{b,l-1}\left ({\boldsymbol {I}; \mathcal {W}_{b,l-1} }\right) }\right) + \mathcal {S}\left ({\mathcal {A}\left ({\mathcal {F}_{b-1,l}\left ({\boldsymbol {I}; \mathcal {W}_{b-1,l} }\right) }\right) }\right) \\ &\qquad \qquad \qquad 2\leq b \leq 4, b+1\leq l\leq 5, \tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features. where \mathcal {A}\left ({\cdot }\right) and \mathcal {S}\left ({\cdot }\right) represent the computation of attention and FSO respectively. Here we introduce FSO first, the attention will be detailed in the next section.

Although the branches are enforced to cooperate in the aggregation structure, they may fall into the trivial salient features if no extra guidance is provided. To address this problem, FSO is particularly applied to attentive features before aggregation, which functions a little like the dropout. But unlike dropout that randomly chooses units to deactivate, FSO only filters out high responses, so as to suppress the salient features discovered in previous branch. Despite some information loss due to the thresholding process, the branches are endowed with the ability to mine more potential visual clues for visual matching, and this is critical to the re-identification task.

As illustrated in Figure 3, we first apply channel-wise average pooling to get averaged 2-D feature map \boldsymbol {Y}_{b,l} given \boldsymbol {X}_{b,l} , and obtain its normalized version \bar { \boldsymbol {Y}}_{b,l} by min-max normalization. Then, we compute a thresholding mask \boldsymbol {M}_{b,l} as follows:\begin{align*} \boldsymbol {M}_{b,l}\left ({x,y }\right) = \begin{cases} \displaystyle 0,& if \bar { \boldsymbol {Y}}_{b,l}\left ({x,y }\right) >\tau \\ \displaystyle 1,& otherwise\end{cases} \tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features.

FIGURE 3. - A schematic of the proposed feature suppression operation.
FIGURE 3.

A schematic of the proposed feature suppression operation.

where \tau \in \left ({0,1 }\right] is a thresholding parameter assigned manually, and \bar { \boldsymbol {Y}}_{b,l}\left ({x,y }\right) stands for the intensity value at position \left ({x,y }\right) . With obtained \boldsymbol {M}_{b,l} , the suppressed features \widetilde { \boldsymbol {Y}}_{b,l} can be computed as \widetilde { \boldsymbol {Y}}_{b,l}^{c} = \boldsymbol {Y}_{b,l}^{c} \otimes \boldsymbol {M}_{b,l} (c is the channel index of \boldsymbol {Y}_{b,l} , and \otimes represents the elementwise multiplication). Because there are only simple average pooling, normalization, and thresholding operation in FSO, it can be performed efficiently.

Based on the hierarchical aggregation structure, diversified features can be obtained for person re-identification. First, the multi-level attentive features in different branches are recurrently aggregated, thus diversified information can be utilized. Second, potential important features may stand out in the next branch after the previous salient feature being suppressed. The network is thereby enabled to extract all potential useful features branch-by-branch.

B. Lightweight Dual Attention Module

The proposed Lightweight Dual Attention Module (LDAM) can be viewed as a variant of the classical Convolutional Block Attention Module (CBAM) [55], which consists of Channel Attention Module (CAM) and Spatial Attention Module (SAM). The two types of attention modules work in a complementary manner to enhance feature representations. CAM explores the correlation between channel features, and SAM aims to capture and aggregate semantically related spatial features. But LDAM differs from CBAM in attention computation process, especially the group convolution employed in CAM and SAM, which leads to much less parameters than CBAM. As a result, the computational cost is quite low. Besides, the softmax activation is used in LDAM, other than sigmoid in CBAM. The detail of LDAM is as follows.

1) Channel Attention Module

It is well known that each channel map of high-level convolutional feature can be viewed as a class-specific response, and the responses are generally semantic-related. In person re-identification task, it will contribute to better fine-grained recognition if some channels sharing similar sematic contexts (e.g., foreground and background) are more correlated. Thus, we group and aggregate those semantically correlated channels by explicitly exploiting the interdependencies between channel maps.

The structure of CAM is illustrated in Figure 4. Given a local feature tensor \boldsymbol {X}\in \mathbb {R}^{H \times W \times C} , we first squeeze the spatial dimension with average pooling and max pooling. It is known that average pooling can well retain structural information, but it is easily distracted by background interference. Max pooling overcomes this problem by focusing on the most salient part, while the cost is some structural information loss. In CAM, we jointly use them to obtain two context descriptors of \boldsymbol {x}_{avg}\in \mathbb {R}^{1 \times 1 \times C} and \boldsymbol {x}_{max}\in \mathbb {R}^{1 \times 1 \times C} , and aggregate them via summation to obtain \widetilde { \boldsymbol {x}}= \boldsymbol {x}_{avg}+ \boldsymbol {x}_{max} . Then, we use group convolution to squeeze the channel size of \widetilde { \boldsymbol {x}} to C/r , where r is a shrinkage parameter. After dividing \widetilde { \boldsymbol {x}} to g independent fractions, we apply 1\times 1\times C/g filters on each of them and concatenate the resulting intermediate descriptors. By such group convolution, we can achieve the typical convolution with much less parameters. Similarly, a second group convolution layer is applied to restore the channel size to C. At last, a softmax activation is applied. The whole procedure of CAM can be formulated as \begin{equation*} {h}=\text {softmax} \left ({\text {gconv}_{2}\left ({\text {gconv}_{1}\left ({\widetilde { \boldsymbol {x}} }\right) }\right) }\right), \tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 4. - Structure of channel attention module.
FIGURE 4.

Structure of channel attention module.

where \text {gconv}_{1}\left ({\cdot }\right) and \text {gconv}_{2}\left ({\cdot }\right) represent the two group convolutions. Finally, we can obtain the output of CAM by \begin{equation*} \boldsymbol {A}_{ch}=\gamma \boldsymbol {X}\otimes \boldsymbol {h}+ \boldsymbol {X}, \tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \gamma is a hyperparameter to adjust the impact of CAM. In equation (4), each position of \boldsymbol {X} is multiplied with \boldsymbol {h} along the channel dimension.

2) Spatial Attention Module

An illustration of SAM is shown in Figure 5. In contrast to CAM, SAM captures and aggregates related features in the spatial domain. Given a local feature map \boldsymbol {X} with size H\times W\times C , SAM first obtains a 2-D matrix \boldsymbol {M}\in \mathbb {R}^{H\times W} by summation over the channels for each spatial position, i.e., \boldsymbol {M}\left ({x,y }\right) = \sum _{c=0}^{C}{ \boldsymbol {X}^{c}}\left ({x,y }\right) . Here, \boldsymbol {X}^{c} represents the submap of \boldsymbol {X} at cth channel. Then \boldsymbol {M} is reshaped to 1 \times 1 \times HW for convenience of applying two sets of 1\times 1 convolutions. Similar to the two group convolution layers in CAM, a context descriptor with the shape of 1 \times 1 \times HW/r is obtained after the first convolution, and the second one restores its shape back to 1\times 1\times HW . After that, a softmax function is applied, and 2-D attention map \boldsymbol {H}\in \mathbb {R}^{H \times W} is obtained by restoring the shape back. The value at each position of \boldsymbol {H} indicates the degree of importance for that location. Formally, the spatial attention map \boldsymbol {H} is computed as \begin{equation*} \boldsymbol {H}=\text {vec}^{-1}\left ({\text {softmax} \left ({\text {gconv}_{2}\left ({\text {gconv}_{1}\left ({\text {vec}\left ({\boldsymbol {M} }\right) }\right) }\right) }\right) }\right), \tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 5. - Structure of spatial attention module.
FIGURE 5.

Structure of spatial attention module.

where \text {vec}\left ({\cdot }\right) and \text {vec}^{-1}\left ({\cdot }\right) represent the vectorization of a 2-D matrix and its inverse operation, respectively. With \boldsymbol {H} , the output of SAM can be computed as \begin{equation*} \boldsymbol {A}_{sp}=\gamma \sum _{c=1}^{C} \boldsymbol {X}^{c}\otimes \boldsymbol {H}+ \boldsymbol {X}. \tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features.

It can be found that there are only simple operations of pooling, 1\times 1 convolution, and softmax in LDAM, the computational cost is rather low. For CAM and SAM both, two group convolutional layers are applied, which make LDAM differ from CBAM [55] in structure. The usage of group convolutions follows the squeeze and excitation process in SENet [32], which can enable network increase its sensitivity to informative features while greatly reduce the parameters. Consequently, the channel and spatial dependencies will be better modeled. Besides, softmax rather than sigmoid activation function is used in two types of attention modules. This is because softmax can encourage filters to learn diverse features, hence making the model more robust.

It has been proved that sequential combination of SAM and CAM can lead to better performance [34], [55], we follow the same scheme to place SAM in front of CAM for attention learning (see Figure 6 for illustration). Due to the lightweight design of LDAM, it is quite flexible and can be easily plugged into networks multiple times if necessary.

FIGURE 6. - Sequential combination of SAM and CAM in LDAM.
FIGURE 6.

Sequential combination of SAM and CAM in LDAM.

C. Feature Processing in Each Branch

In order to learn multi-granularity features and make a better usage of them, we employ a simple partition strategy to obtain global, part-based, and channel-based features. The final feature maps in each branch are equally partitioned with different size to get multiple granularity local features. Both global and local features are extracted from each branch. In addition, we also extract channel-based features via channel partition.

To extract part-based local features, we simply divide the final feature map into n_{b} submaps according to the number of convolutional blocks in each branch. That is, n_{b} equals 5, 3, and 2, from branch-1 to branch-3. The local features \left \{{ \boldsymbol {p}_{b}^{i}}\right \}_{i=1}^{n_{b}} (\mathcal {B}=\{1,2,3\}, b\in \mathcal {B} ) are all acquired by spatial average pooling, and their shapes are of 24\times 8\times 512 . Additionally, we use max pooling on the initial feature maps, obtaining global representations \left \{{\boldsymbol {g}_{b}}\right \} (b\in \mathcal {B} ) of 512-dimension. The hybrid usage of average and max pooling here can help to retain structural information and obtain robust global feature simultaneously.

For branch-4, we first aggregate the information by global max pooling on the tensor, resulting a vector \boldsymbol {g}_{4}\in \mathbb {R}^{512} . We also apply the mask computed via DropBlock [56] to the feature map, the resulting tensor is further applied with global max pooling. This leads to another vector \boldsymbol {g}_{drop}\in \mathbb {R}^{512} . In addition to \boldsymbol {g}_{4} and \boldsymbol {g}_{drop} , two channel-based feature vectors are also extracted. After reducing the original feature map using average pooling, we split the resulting 512-dimension vector into two sub vectors and each has a length of 256. Then, 1\times 1 convolution is used to rescale them to 512-dimension, by which two channel-based vectors \boldsymbol {c}^{1}\in \mathbb {R}^{512} and \boldsymbol {c}^{2}\in \mathbb {R}^{512} are obtained.

During training, the global features in \mathcal {R}=\{ \boldsymbol {g}_{drop}, \boldsymbol {g}_{b'}\} (\mathcal {B}'=\{1,2,3,4\}, b'\in \mathcal {B}' ) will be fed into a ranking loss to learn distance metrics. We also use BNNeck [57] to obtain \mathcal {I}=\{\widetilde { \boldsymbol {g}}_{drop},\widetilde { \boldsymbol {g}}_{b'},\widetilde { \boldsymbol {p}}_{b}^{i},\widetilde { \boldsymbol {c}}^{k}\} (b'\in \mathcal {B}' , b\in \mathcal {B} , 1\leq i\leq n_{b} , k\in \{1,2\} ), by which identity classifiers will be learned. The BNNeck is comprised of a batch normalization and a fully connected layer with number-of-classes units. During inference, the network without BNNeck and classifiers will be used as a feature extractor, which is utilized to extract features for all query and gallery images. Then, Euclidian distance is calculated to perform a standard information retrieval.

D. Loss Functions

The combination of identification loss, ranking loss, and center loss [57] is adopted for the optimization of network parameters.

The cross-entropy with label smoothing [58] is used as identification loss, which treats each identity as a distinct class. In each minibatch, the label smoothed cross-entropy is defined as \begin{equation*} {\mathcal {L}}_{xe} = -\frac {1}{N} \sum _{i=1}^{N}{\sum _{k=1}^{K} {\left ({\left ({1-\epsilon }\right) y_{i}^{k} + \frac {\epsilon }{K} }\right)}} \log \left ({p_{i}^{k} }\right), \tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \epsilon \in \left ({0,1 }\right) is a smoothing parameter, N is the mini-batch size, K is the number of identities, y_{i}^{k} and p_{i}^{k} are the ground-truth and predicted probability respectively.

For computation of ranking loss, the multi-similarity [59] is utilized. As a pair-based list-wise loss function, multi-similarity loss integrates pair mining and soft weighting scheme into a single-framework. The multi-similarity loss is computed as \begin{align*} \mathcal{L}_{m s}=&-\frac{1}{N} \sum_{i=1}^N\left\{\frac{1}{\alpha} \log \left[1+\sum_{k \in {\mathcal{P}}_i} \exp \left(-\alpha\left(S_{i k}-\lambda\right)\right)\right]\right. \\& \qquad \left.+\frac{1}{\beta} \log \left[1+\sum_{k \in {\mathcal{N}}_i} \exp \left(\beta\left(S_{i k}-\lambda\right)\right)\right\}\right], \tag{8}\end{align*} View SourceRight-click on figure for MathML and additional features. where S_{ik}=\langle \boldsymbol {\psi }_{ \boldsymbol {i}}, \boldsymbol {\psi }_{ \boldsymbol {k}} \rangle is the dot product of feature vectors \boldsymbol {\psi }_{i} and \boldsymbol {\psi }_{k} , \alpha , \beta , and \lambda are manually set hyper-parameters, {\mathcal {P}}_{i} and {\mathcal {N}}_{i} are the selected positive and negative pairs for an anchor \boldsymbol {\psi }_{i} .

To enhance the compactness of each identity cluster, the center loss [57] is also included, which is defined as \begin{equation*} {\mathcal {L}}_{ce}=\frac {1}{2}\sum _{i=1}^{N}{\lVert \boldsymbol {\psi }_{i}- \boldsymbol {c}_{y_{i}} \rVert _{2}^{2}}, \tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \boldsymbol {c}_{y_{i}} denotes the center of class y_{i} .

During training, the final loss function is \begin{equation*} \mathcal {L} = \lambda _{xe}\sum _{ \boldsymbol {\psi }\in \mathcal {I}}{{\mathcal {L}}_{xe}} + \lambda _{ms}\sum _{ \boldsymbol {\psi }\in \mathcal {R}}{{\mathcal {L}}_{ms}} + \lambda _{ce} \sum _{ \boldsymbol {\psi }\in \mathcal {I}\cup \mathcal {R}}{{\mathcal {L}}_{ce}}, \tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \lambda _{xe} , \lambda _{ms} , and \lambda _{ce} are suitable weights that can be obtained by grid search. The identification loss {L}_{xe} , ranking loss {\mathcal {L}}_{ms} , and center loss {\mathcal {L}}_{ce} are computed over \mathcal {I} , \mathcal {R} , and \mathcal {I}\cup \mathcal {R} , separately.

SECTION IV.

Experiments

In this section, we report the experimental results of the proposed Hi-AFA on four mainstream person re-identification datasets, including Market-1501 [60], DukeMTMC-reID [61], MSMT17 [62], and CUHK03 [63]. Figure 7 shows some randomly selected images. We compare Hi-AFA with a line of state-of-the-art solutions, and conduct extensive ablation studies to investigate the effectiveness of each component.

FIGURE 7. - Example images randomly chosen from three benchmark datasets. Images in each row are of the same person in each dataset.
FIGURE 7.

Example images randomly chosen from three benchmark datasets. Images in each row are of the same person in each dataset.

A. Datasets

We conduct experiments on the following four widely used person re-identification datasets.

Market-1501 [60] is currently the most popular person re-identification dataset, which is captured by six cameras. This dataset contains 1,501 identities with 32,668 bounding boxes obtained by the Deform Part Model (DPM) detector. The training set contains 751 identities with 12,936 images, and in the testing set there are 750 identities with 3,368 query images and 19,732 gallery images.

DukeMTMC-reID [61] contains 36,441 images of 1,404 pedestrians captured by eight cameras. A total of 16,552 images belonging to 702 identities make up the training set, and the remaining 702 identities along with 408 distractors make up the testing set. In the testing set, there are 2,268 query images and 17,661 gallery images respectively.

MSMT17 [62] is collected by twelve outdoor and three indoor cameras. There are 4,101 identities with a total of 126,441 images. It is divided into a training set of 32,621 images and a testing set of 93,820 images. Due to its massive scale, more complex and dynamic sciences, it is much more challenging to perform person re-identification on MSMT17.

CUHK03 [63] consists of 14,097 pedestrian images of 1,467 identities captured from two disjoint camera views. There are two types of bounding boxes in CUHK03, one is obtained by human annotation, and the other is detected by DPM. We adopt the splitting protocol of 767/700 identities for training and testing on this dataset.

B. Experimental Settings

1) Implementation Details

The OSNet [13] initialized with the weights pretrained on ImageNet is used as our backbone. All images are resized to 384\times 128 pixels such that more detailed information can be captured. For both training and testing, the input images are normalized to channel-wise zero mean and a standard variation of 1. During training, we adopt a data augmentation strategy of random cropping, horizontal flip, as well as random erasing. The model is trained 200 epochs with a batch size of 64. Each mini-batch consists of 8 identities, with 8 instances per identity. The Adam optimizer with \epsilon =1 \times 10^{-8} , \beta _{1}=0.9 , and \beta _{2}=0.999 is used for training. The learning rate is set to 8 \times 10^{-4} with a weight decay of 5 \times 10^{-4} . In LDAM, the shrinkage parameter r is set to 8, and a group size of g=8 is used for group convolution. The hyper-parameters \alpha , \beta , and \lambda in equation (10) are set to 2, 40, and 0.5. The balance parameters \lambda _{xe} , \lambda _{ms} , and \lambda _{ce} in equation (10) are empirically set to 0.5, 0.5, and 5\times 10^{-4} . We use the same settings for all considered datasets.

2) Evaluation Metrics

The Cumulative Matching Characteristic (CMC) at top ranks and mean Average Precision (mAP) are reported as evaluation metrics. The value at different ranks of CMC shows the re-identification accuracy by counting the query identities among the top n results. The mAP reflects the overall re-identification accuracy by calculating the area under the precision-recall curve. We note that all experiments are conducted under the single-shot scenario.

C. Comparison With State-of-the-Art Methods

Table 1 shows the performance of our proposed Hi-AFA and other state-of-the-arts on Market-1501, DukeMTMC-reID, MSMT17, and CUHK03. The compared methods can be generally grouped into three categories: discriminative feature learning based (top of the table), attention based (middle of the table), and transformer based (bottom). We report the mAP and CMC values at Rank-1/5 for comparison. We observe that Hi-AFA achieves superior performance on multiple benchmarks or competitive results compared to previous methods.

TABLE 1 Performance Comparison of Hi-AFA With the State-of-the-Art Methods on Market-1501, DukeMTMC-reID, MSMT, and CUHK03 Datasets. R1/5 Indicates Rank-1/5 Accuracy. In Each Column, the Highest Score is Marked in Bold, and the Second-Best is Underlined
Table 1- Performance Comparison of Hi-AFA With the State-of-the-Art Methods on Market-1501, DukeMTMC-reID, MSMT, and CUHK03 Datasets. R1/5 Indicates Rank-1/5 Accuracy. In Each Column, the Highest Score is Marked in Bold, and the Second-Best is Underlined

1) Results on Market-1501

Our Hi-AFA achieves 91.8% mAP and 97.0%/99.0% Rank-1/5 accuracies on this dataset. Comparing to the previous best Rank-1 96.3% reported by LightMBN [25], the improvement is 0.7%. Although the mAP of Hi-AFA is lower than previous best ABD+NFormer [48], it still ranks the second. Note that the stunning mAP of ABD+NFormer mainly comes from NFormer, which improves the mAP of ABD-Net [10] from 88.3% to 93.0%. As NFormer can be viewed as a post-processing module, some higher mAP is natural. We also conduct experiments with Hi-AFA+NFormer. For each image, the features extracted via Hi-AFA are concatenated to a representation vector. NFormer is then applied to all vectors in a mini-batch to obtain their final representations. Following [48], the number of neighbors is also set to 20 in Hi-AFA+NFormer. The obtained mAP and Rank-1 accuracy are as high as 95.4%/97.2% on Market-1501, exceeding other methods significantly.

Compared to the two representative feature learning based methods of Pyramid [6] and MGN [7], the improvements of mAP and Rank-1 accuracy are 3.6%/4.9% and 1.3%/1.3%. Because Hi-AFA shares similar branching structure with them, we believe the improvements should be attributed to the aggregation structure and attention modules. Among the methods based on attention or transformer, IANet [44], SCSN [12], and HAT [47] all embrace the aggregation strategy to make better use of multi-scale features. Whereas our Hi-AFA outperforms all of them, which demonstrates the encouraging ability of learning discriminative features in Hi-AFA.

2) Results on DukeMTMC-Reid

Hi-AFA achieves competitive results on this dataset. The mAP of Hi-AFA is 82.9%, which ranks the second among all methods. The highest score is 85.7%, reported by ABD+NFormer [48] again. On the most important Rank-1, Hi-AFA achieves the same score with AdaSP [67] and BPB(Res50-IBN) [70], all report 91.7% matching accuracy. When Hi-AFA is welded with NFormer, the mAP and Rank-1 are improved to 91.1% and 94.0%, outperforming all others significantly. Compared with SCSN [12] and HAT [47] that aggregate information via cascaded attentions or transformers, the superiority of Hi-AFA is obvious. The mAP and Rank-1 are improved by 3.9%/1.5% and 0.7%/1.3%. Both of them have to undertake heavy computation burden to mine diverse features, while in Hi-AFA this is achieved by simple but effective hierarchical feature aggregation and FSO.

3) Results on MSMT17

Our Hi-AFA achieves the best mAP (71.9%) and Rank-1 (87.6%) over all previous competitors. The previous best is TransReID [74], which reports 69.4% mAP and 86.2% Rank-1 accuracy. Although TransReID benefits from the transformer-based learning structure, Hi-AFA outperforms it with 2.5%/1.4%. On top of that, much higher performance of 76.7% mAP and 90.2% Rank-1 accuracy can be obtained by Hi-AFA+NFormer. From Table 1, we can also observe that Hi-AFA has obvious superiority over other multi-branch feature learning based and attention-based models. Take the feature learning based AdaSP [67] for example, its mAP and Rank-1 are 67.1% and 85.5%, while our Hi-AFA exceeds it by 4.8% and 2.1%. When compared with attention based DCA [69], the improvements are even higher. The results on MSMT17 demonstrates the scalability of Hi-AFA on such a huge person re-identification benchmark.

4) Results on CUHK03

As shown in Table 1, Hi-AFA achieves the best in terms of both mAP and Rank-1 accuracy, which gives 85.4%/83.6% mAP and 87.9%/85.5% Rank-1 matching accuracy on labeled and detected settings respectively. The previous best was reported by APNet [72], which gives 85.3%/81.5% mAP and 87.4%/83.0% Rank-1 accuracy. The improvements are 0.1%/2.1% for mAP, and 0.5%/2.5% for Rank-1 accuracy. With the support of NFormer, the results can be boosted to 88.7%/86.4% and 89.5%/88.6%. Compared to the backbone OSNet [13], Hi-AFA improves the mAP and Rank-1 accuracy by as large as 15.8% and 13.2% under the detected setting, which justifies the superiority of aggregating attentive features.

D. Ablation Study

In the following, we systematically investigate the effectiveness of each key component of Hi-AFA, namely hierarchical feature aggregation, FSO, LDAM, along with the final feature processing. Experiments are conducted on all four considered datasets. On CUHK03, only the labeled version (CUHK03-L) is considered, since the two types of bounding boxes are from same source. The results are obtained with only one setting changed and the rest remain the same.

1) Effect of Hierarchical Feature Aggregation

The hierarchical feature aggregation structure plays an important role in the proposed Hi-AFA model. To investigate its effectiveness, different sub-models of Hi-AFA are evaluated. We use the branch-1 in Hi-AFA as basic model, and then gradually add other branches to it. The Hi-AFA with independent branches (denoted as Hi-AFA-BrIndep) and backbone OSNet [13] are also evaluated for comparison.1 Note that in Hi-AFA-BrIndep, only the first links between branches are kept, all later ones are discarded. Thus the branches work independently.

Table 2 demonstrates the results of each sub-model. We observe that with merely branch-1 quite encouraging results can be achieved. For example, it gives 87.6% mAP and 95.3% Rank-1 accuracy on Market-1501, which are 2.7% and 0.5% higher that the results of backbone OSNet [13]. By gradually adding other branches, the re-identification performance increases accordingly on all datasets. This proves that the feature aggregation structure in Hi-AFA can lead to significant performance improvements. From Table 2 we can also find that the mAP and Rank-1 of Hi-AFA-BrIndep are obviously lower than full-state Hi-AFA with all links (i.e., branch-{1, 2, 3, 4}). This indicates that the lateral links between adjacent branches are vital to the final re-identification performance, because they could enforce the branches cooperate with each other to explore more potential clues. While in Hi-AFA-BrIndep the branches work independently with no correspondence, the performance drops in consequence.

TABLE 2 Results of Different Sub-Models and Backbones on Four Considered Datasets (%), the Highest Score in Each Column is Marked in Bold
Table 2- Results of Different Sub-Models and Backbones on Four Considered Datasets (%), the Highest Score in Each Column is Marked in Bold

In the bottom of Table 2, the results of Hi-AFA with two other widely used backbones of ResNet-50 [37] and DenseNet-169 [38], are also reported. We first evaluate them as backlines, and then apply our Hi-AFA in these backbones. We observe that consistent improvements can be achieved on both of them, which indicates Hi-AFA is effective for different backbones. In general, the DenseNet-169 performs slightly better than ResNet-50, but they are all inferior to OSNet. Therefore, OSNet is our first choice of backbone.

2) Effect of FSO

To demonstrate the effect of feature suppression, we evaluate Hi-AFA with different FSO embedding strategies, including without FSO (w/o), the main architecture equipped with FSO after Conv2 to Conv4 in backbone network (C2, C3, and C4), and different combinations of them at consecutive stages (C2&C3, C3&C4, and C2-C4).

From the evaluation results shown in Figure 8, we can draw the following observations. (1) FSO can boost the re-identification performance effectively. With FSO embedded, both mAP and Rank-1 accuracy can be obviously improved. For instance, even the weakest embedding strategy of C2 can bring 0.2% mAP gain on Market-1501 dataset. (2) The later stage FSO is embedded, the higher performance gain will be acquired. This is a natural result. It is well known that the higher-stage convolutional features are more category-related than shallow layers. By embedding FSO into latter stages of CNN backbone, more diverse and discriminant features can be obtained, thus resulting better matching results. (3) The combination of FSOs can further boost the re-identification performance. Similar to the usage of single-stage FSO, the combination of C3&C4 also performs better than C2&C3, demonstrating the superiority of later feature suppression again. By plugging FSO into all stages, C2-C4 gives the highest results on all datasets. Comparing to the model without FSO, the improvements are 1.6%/1.2%, 2.5%/2.2%, 4.2%/2.1%, and 2.8%/2.5% respectively. This comparison justifies the effectiveness of mining diverse features by FSO.

FIGURE 8. - Performance comparison of Hi-AFA under different FSO embedding settings. w/o means without FSO, 
$\text{C}i$
 means FSO is embedded after the ith convolution block, and C2-C4 means from convolution block 2 to 4.
FIGURE 8.

Performance comparison of Hi-AFA under different FSO embedding settings. w/o means without FSO, \text{C}i means FSO is embedded after the ith convolution block, and C2-C4 means from convolution block 2 to 4.

3) Feature Suppression Threshold Analysis

The parameter threshold \tau in FSO controls the degrees of feature suppression operation in Hi-AFA, so it is of vital importance to choose a proper threshold. With a low threshold, too much features will be erased, which is harmful to feature learning. On the contrary, a high threshold may limit the removal of enough features, the branches cannot cooperate well to mine new significant ones. To carefully choose the optimal value of threshold \tau , we conduct experiments by varying its value from 0.1 to 1 and plot the corresponding mAP and Rank-1 in Figure 9. It can be observed that results on four datasets generally present a similar trend. Both mAP and Rank-1 accuracy increase when threshold \tau grows larger at the first stage, and highest scores are obtained roughly at \tau =0.7 . But when \tau keeps increasing, the performance begins to degrade. Therefore, we set \tau to 0.7 for performance consideration.

FIGURE 9. - Variation of mAP (a), and Rank-1 accuracy (b) with respect to parameter 
$\tau $
 on each dataset.
FIGURE 9.

Variation of mAP (a), and Rank-1 accuracy (b) with respect to parameter \tau on each dataset.

4) Effect of LDAM

In the proposed Hi-AFA, LDAM plays an important role of guiding feature learning. To investigate its effectiveness, we conduct comparative experiments of Hi-AFA with and without LDAM. Under the setting of Hi-AFA without LDAM, all attention modules are removed for a clean comparison. The result is shown in Figure 10 (a). It can be found that, Hi-AFA consistently outperforms the model without LDAM by a large margin. With the guidance of LDAM, the mAP is improved by 2.0%, 1.4%, 3.7%, 3.1%, and Rank-1 accuracy is also promoted by 1.3%, 1.1%, 1.8%, 2.8% on each dataset. This demonstrates that LDAM can effectively guide Hi-AFA to learn discriminative and robust features for cross-view matching.

FIGURE 10. - Performance comparison of (a) Hi-AFA with (w/) and without (w/o) LDAM, (b) 
$\mathcal {I}$
 and 
$\mathcal {R}$
 feature sets.
FIGURE 10.

Performance comparison of (a) Hi-AFA with (w/) and without (w/o) LDAM, (b) \mathcal {I} and \mathcal {R} feature sets.

In addition to experiments of utilizing LDAM or not, three other attentions including CBAM [55], RGA [34], and Nonlocal [77] are also compared with LDAM. We use the same Hi-AFA architecture and replace LDAM with these attentions to conduct experiments. The performance comparison is shown in Table 3. We can observe that RGA [34] performs consistently better than others due to its consideration of structural relationship between human body parts. It outperforms the second best by 0.5%/0.2%, 0.5%/0.5%, 0.7%/0.3%, and 0.5%/0.6% on each dataset. Although the performance of LDAM is a bit lower than RGA [34], it performs better than CBAM [55] and Nonlocal [77]. Since LDAM and CBAM have similar architectures, we think the performance improvement should be mainly attributed to the group convolution which endows attention with more flexibility.

TABLE 3 Comparison of Different Attentions (%)
Table 3- Comparison of Different Attentions (%)

Given a tensor of shape H\times W \times C , the computational complexity of LDAM is \mathcal {O}((H^{2}W^{2}+C^{2})/(gr)) , and it is \mathcal {O}(HWC+C^{3}/r) for CBAM. RGA and Nonlocal are at the same level of \mathcal {O}(H^{2}W^{2}C+HWC^{2}) , which is much higher than the former two. Owing to the group convolution in LDAM, its complexity is the lowest. In Table 3, we also present the Floating-Point Operations (FLOPs) and Parameters (Params) of each attention. The results are obtained by feeding each attention with an input tensor of shape 32\times 24 \times 8\times 2048 . It can be found that there are only 0.26M parameters in LDAM, and the FLOPs are merely 0.006G, which is quite lightweight. On the contrary, there are heavy matrix multiplications in Nonlocal [77] and RGA [34], the FLOPs of them amount to as high as 32.23G and 79.89G, respectively. From the view of performance, RGA [34] should be the best choice for guiding feature learning. However, when our perspective shifts to the model size and computational cost, lightweight attentions will be more welcome, and the proposed LDAM is a good compromise.

5) Effect of Final Feature Processing

In Hi-AFA, two feature sets are obtained finally, namely \mathcal {I} and \mathcal {R} . \mathcal {R} consists of all global features which are obtained by max pooling and DropBlock. Features in \mathcal {I} contain two groups, one is obtained by applying BNNeck to Features in \mathcal {R} , and the other group contains spatial- and channel-wise partitioned local features. To investigate the effect of such combination of global features, spatial- and channel-wise local features, we conduct experiments with \mathcal {I} , \mathcal {R} , and \mathcal {I}\cup \mathcal {R} , respectively. The results are presented in Figure 10 (b). It can be seen that much higher performances are obtained with \mathcal {I} than with \mathcal {R} , which means that the diverse local features are more discriminant than global ones. Besides, we can find that \mathcal {I}\cup \mathcal {R} significantly outperforms \mathcal {I} or \mathcal {R} alone, demonstrating the importance of joint usage of global and local features. Note that we apply identification loss to features in \mathcal {I} and ranking loss to \mathcal {R} , and both of them are supervised by center loss. In such way, the features can be fully utilized and the advantages of different losses are fully exploited.

To validate the effectiveness of DropBlock and channel-wise features, we first use all features except \{\boldsymbol {g}_{drop}, \boldsymbol {c}^{1},\boldsymbol {c}^{2}\} as baseline, and then add \boldsymbol {g}_{drop} , \{\boldsymbol {c}^{1},\boldsymbol {c}^{2}\} , and both of them for evaluation. The results are shown in Table 4, it can be found that each of DropBlock and channel features can bring certain performance promotion. When both \boldsymbol {g}_{drop} and channel-wise features are added, the mAP and rank-1 accuracy are improved by 0.3%/0.2%, 0.5%/0.2%, 0.8%/0.6%, and 1.1%/1.4% on each dataset. This indicates that better generalization can be obtained with them.

TABLE 4 Comparison of Performance With Different Feature Settings (%)
Table 4- Comparison of Performance With Different Feature Settings (%)

E. Visualization of Attention Maps

To investigate the attended image regions of each attention module, we use Grad-CAM [78] to visualize the attention maps for qualitative analysis. In all branches, the attention maps after each attention module are generated. As shown in Figure 11, we can observe that the attentions at convolution block 2 are relatively coarse, multiple parts are of high importance in every attention map. When going deeper, they become more concentrated, forming few blobs on salient parts. For attention maps at the same stage, the attended areas are generally consistent but differs from each other in detail. Take B1C4, B2C4, and B3C4 in last row for example, besides the commonly highlighted legs, they focus on left shoulder, head, and right elbow, respectively. This proves the capability of mining diverse salient features of different branches. Therefore, they can greatly help to distinguish visual similar pedestrians in person re-identification task.

FIGURE 11. - Visualization of attention maps in Hi-AFA. 
$\text{B}i\text{C}j$
 indicates the attention map of jth convolution block in branch-i.
FIGURE 11.

Visualization of attention maps in Hi-AFA. \text{B}i\text{C}j indicates the attention map of jth convolution block in branch-i.

F. Model Complexity

The idea of learning diverse features via multi-branch architecture is quite popular in person re-identification. It enables networks to focus on different person features in individual branches. However, such branching strategy brings higher computational cost at the time of boosting re-identification performance. Although our Hi-AFA also embraces the branching strategy, the reduction of computational complexity is considered in the first place. In either the backbone or attention module, much less parameters are required. In Table 5, the space complexity and model size of Hi-AFA, some other branching models, as well as the backbone OSNet [13] are listed, in terms of FLOPs, Params, and Memory size. We can find that there are only 12.76M parameters in the proposed Hi-AFA, the consumption of memory is 55.83MB, and the FLOPs are about 2.24G. Although it is about 6 times larger than the backbone OSNet [13], Hi-AFA is still quite slim when comparing to other branching models.

TABLE 5 Comparison of Model Size and Complexity
Table 5- Comparison of Model Size and Complexity

SECTION V.

Conclusion

In this paper, we present a novel Hierarchical Attentive Feature Aggregation (Hi-AFA) network to address the challenging person re-identification task. In Hi-AFA, the features are aggregated not only along the depth, but also the parallel branches. In such way, the branches can work together to mine more diverse and richer features for fine-grained recognition. To guide the feature learning, we design a lightweight dual attention module, in which much less parameters are required. With the aim of capturing essential person features, we extract global, channel-based and multi-granularity part-based features from the distinct branches. Due to the usage of lightweight backbone and attention module, the overall model complexity of Hi-AFA is kept on a lower level than state-of-the-art models, but superior or comparable performance is obtained on four mainstream person re-identification datasets. Ablation analysis is also performed to investigate the insight of the proposed model. The backbone of Hi-AFA is not restricted to OSNet, other lightweight deep convolutional models can also be utilized. In future work, we will continue the research on more effective and lighter person re-identification.

References

References is not available for this document.