HAGN: Hierarchical Attention Guided Network for Crowd Counting

In recent years, deep learning based crowd counting networks have achieved significant progress. However, most of them generate rough crowd density maps due to low-resolution features used for estimating crowd distribution, which affects the performance of crowd counting. To solve this problem, in this paper, we propose a Hierarchical Attention Guided Network (HAGN) for crowd counting. We apply the first 13 layers of VGG-16 to extract base features. Then, the extracted features are processed by the Hierarchical Attention Mechanism (HAM), which guided the extracted features to enlarge step by step via our proposed attention guided branch. Finally, the outputs of HAM are fed to $1\times 1$ convolutional layer for final crowd density estimation. Experiments are performed on ShanghaiTech and UCF-QNRF datasets, and our HAGN achieves promising performance compared with the other state-of-the-art methods on crowd counting and crowd localization, respectively.


I. INTRODUCTION
Crowd counting is the basis of crowd analysis and scene understanding [1], [2]. However, getting accurate crowd numbers in realistic application scenarios is a challenging task due to scale variations of the crowd head.
Many efforts for improving the scale variation problem of the crowd have been studied. One approach adopts different network structures to model scale variation, such as Multi-column Convolutional Neural Network (MCNN) [25], Switching Convolutional Neural Network (Switching CNN) [26] and Contextual Pyramid CNN (CP-CNN) [57] using multi-column convolutional neural networks with different structures to model scale variation of the crowd. However, multi-column convolutional networks generate redundant features and consume a lot of computation [29]. Another approach adopts multi-scale extraction modules to model scale information, such as Context-Aware Network (CAN) [20] using spatial pyramid pooling to obtain contextual information for scale variation of the crowd. Although the above works achieve great performance, however, most of them directly upsample the output from the network to regress the final crowd density The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu . estimation, which generates a rough density map and affects the final counting performance. Therefore, in this paper, we propose a Hierarchical Attention Guided Network (HAGN) to generate high-resolution crowd density map for crowd counting. We firstly apply the 13 layers of VGG-16 to extract features. Then, the extracted features are fed in two separate ways. One way is that the extracted features are sent to attention module for obtaining global context information. Then, the global context information guides the extracted feature to enlarge step by step via our proposed Hierarchical attention mechanism (HAM), which consists of three attention guided branches placed in cascade manner. Finally, the original size features with global context information are fed to 1 × 1 convolutional layer for final crowd counting. We perform experiments on ShanghaiTech and UCF-QNRF datasets, and the results of the experiment demonstrate the effectiveness of our method. We also evaluate the crowd localization of our method on UCF-QNRF dataset, and we achieve superior results against the other methods.
In a word, the contributions of our method are as follows: 1) We propose the HAGN, which adopts the base feature of large receptive fields to guide the density map regression process step by step and to predict high-resolution crowd density maps for crowd counting.
2) We propose HAM module, which contains three attention guided branches to supplement rich contextual information for the generation of crowd density maps.
3) The proposed HAGN achieves good performance on ShanghaiTech Part A/B and UCF-QNRF datasets compared with other state-of-the-art methods. The remainder of this paper is organized as follows: SectionII introduces previous progress of crowd counting methods and attention module. SectionIII introduces HAGN and HAM. The experiment and ablation studies are performed in SectionIV. We draw the conclusion of this paper in SectionV.

II. RELATED WORK A. CROWD COUNTING
Recently, the deep learning methods represented by CNNs have developed rapidly and have achieved great success in various fields of computer vision.
Many excellent crowd counting methods arise in recent years. The improvements of the proposed methods are mainly in network structure [1]- [3], [5]- [11], [13]- [23], [30], [31], [41], [44], [49], [50], [52], [54], loss function design [4], [12], [27], more challenging crowd counting datasets [24], [51], efficient crowd counting framework [32] and so on. Zhou et al. [1] proposed a new kind of network structure with deformable convolution for crowd counting tasks. Sang et al. [2] proposed a new model by improving the Scale-adaptive CNN (SaCNN) architecture with a backbone of fixed small receptive fields [43]. Cheng et al. [3] proposed a new kind of learning strategy named Multi-column Convolutional Neural Network (McML) for multi-column networks, which could effectively solve the multi-scale learning problem of the network, and has the advantages of less parameter and be less prone to overfitting. Sindagi and Patel [5] proposed advanced counting methods which consists of multi-level and multi-directional information fusion from multi-layer networks. Zou et al. [6] proposed a new structure named ''temporal channel-aware'' (TCA) block, which could use time interdependence for crowd counting of video sequences. Hossain et al. [7] proposed a new issue termed as a single scene-specific crowd count and a corresponding new learning algorithm. Shi et al. [8] proposed a novel attention module that could guide the counting network to the area of interest with supervised segmentation. Xu et al. [9] proposed Learning to Scale Module (L2SM) to solve the density variation issue in crowd counting. Xiong et al. [10] proposed Spatial Divide-and Conquer Network (S-DCNet), which has an excellent performance on counting data sets for multiple cross-categories. Liu et al. [11] proposed Deep Structured Scale Integration Network (DSSINet), which could respond to crowd density changes through structured and multi-level feature learning and corresponding loss function optimization. Yan et al. [13] proposed a new network structure named Perspective-Guided Convolution Networks (PGCNet), which could solve the perspective effect problem of crowd density changes in large scenes. Liu et al. [14] proposed a novel network architecture named Recurrent Attentive Zooming Network, which could solve the counting problem of crowded scenes by amplifying the blurred area to a high resolution by a high-resolution method and a specific loss function. Lian et al. [15] proposed a new network architecture named regression guided detection network (RDNet), which could estimate the quantity and location of heads in the crowd scene using RGB-D data. Wan et al. [16] proposed a residual regression frame-work which use relevant information between samples. Zhang and Chang [17] proposed a novel network, which could combine the information of multiple cameras to estimate the crowd density of the scene. Zhao et al. [18] proposed a new crowd counting method that could use heterogeneous attributes to solve the problem of various scales, complex backgrounds, and occlusion. Jiang et al. [19] proposed Trellis Encoder-Decoder Networks (TEDnet), which could estimate a high-quality crowd density map for counting. Liu et al. [20] proposed a novel network, which could encode context information of different scale. Shi et al. [21] proposed perspective-aware convolutional neural network (PACNN), which could add perspective information to the estimated number to effectively solve multiple-scale issues. Liu et al. [22] proposed a novel detection-based network which could estimate crowd density by means of detection. Liu et al. [23] proposed an attention-injective deformable convolutional network called ADCrowdNet, which could solve the low precision issue of noisy scenes for the high-density crowd. Sindagi and Zisserman [52] proposed a novel network named Hierarchical Attention-based Crowd Counting Network (HA-CCN), which employs attention mechanisms in multi-level style. Chen et al. [54] proposed an accurate and compact model based on CNNs, which could be deployed in embedded devices. Cheng et al. [4] proposed a new network architecture named SPatial Awareness Network (SPANet) and a new kind of loss function called Maximum Excess over Pixels Loss, which could effectively combine spatial context information for crowd counting task. Ma et al. [12] proposed a new and useful loss function named Bayesian loss, which could measure the density contribution by establishing a probability model by labeling information. Gao et al. [32] designed an open source crowd counting framework based on PyTorch named C ∧ 3 Framework. For the crowd localization task, Zheng et al. [64] firstly obtained the density map using the sliding window on the image and then used integer programming to locate objects on the density maps. Liu et al. [55] proposed a reliable end-to-end network called DecideNet that can estimate the localization of crowd heads and the crowd density by detecting and regression, respectively. Liu et al. [14] proposed a Recurrent Attentive Zooming Network that can simultaneously. Lian et al. [15] proposed density map regression guided detection network (RDNet), which can perform crowd counting and localization simultaneously. Liu et al. [22] proposed a new deep detection network based on point supervision, which can simultaneously detect VOLUME 8, 2020 The extracted base feature is delivered to one 3 × 3 convolution layer and fused with the attention refined feature map. There is one residual connection after feature fusion. There is one upsampling layer after every attention guided branch. Finally, there is one 1 × 1 convolution layer and one upsampling layer for getting the crowd density map.
the localization of crowd heads and count in a crowded scene.
Different from above crowd counting methods, we propose HAGN to reuse the base feature of the first 13 layers of VGG-16 and recover the density map through the hierarchical attention mechanism.

B. ATTENTION MODULE
Attention modules are often used in visual tasks [60]. Mnih et al. [33] adds attention information to the RNN-based network firstly. Xu et al. [34] proposed a new pooling method that could use attention weights to select regions of interest or spatial features. Vaswani et al. [35] proposed a novel transformer structure that could effectively obtain global dependencies. Chen et al. [36] proposed a novel model which could obtain multi-level attention information. Wang et al. [37] proposed a new network architecture that could calculate the response of each point to incorporate all point weight information into the feature map. Hou et al. [39] proposed a novel attention distillation approach to enhance the representation learning of CNN-based lane detection models.
CBAM [38] and SEblock [40] are most extensive attention modules in recent years. Woo et al. [38] proposed the Convolutional Block Attention Module (CBAM), which could achieve attentional calculation of spatial and channel dimensions in the forward process of convolutional neural networks. Hu et al. [40] proposed an attention module named Squeeze-and-Excitation block (SEblock), which could be adaptive adjustment the channel feature response by modeling the dependencies between channels.

III. METHOD
In this section, we firstly introduce the overview of the proposed HAGN. Then, we demonstrate the details of the hierarchical attention mechanism used in the HAGN and the associated attention modules.

A. OVERVIEW
The overview of HAGN is shown in Fig.1. The HAGN mainly consists of two parts: the encoder part and the decoder part. The details of HAGN are shown in Table 1  Encoder part: The first 13 convolutional layers of VGG-16 are used as local feature extractor.
Decoder part: We design the attention guided branch to recover the loss of contextual information as shown in Fig.1  FIGURE 2. Illustration of the attention guided branch. The extracted base feature F is fed to the attention module and transformed into F A . On the other hand, the extracted base feature F is fed to one 3 × 3 convolutional layer and transformed into F . The F A is fed to 3 × 3 convolutional layer and one upsampling layer, then transformed into F RA , which has the same resolution with F . F represents the attention refined feature map, which is the point multiplication result of F and F RA . We get the refined feature map F R by means of element-wise summation of F and F in the end.
and Fig.2. Each attention guided branch will concatenate feature maps from the output of the attention module and the upper layer. Besides, we add a residual connection after the fusion of the feature maps at each branch. The proposed HAGN adopts step-by-step upsampling mode, and the ablation studies on different upsampling modes are shown in Section IV-E. Finally, the high-resolution crowd density maps are regressed by one 1 × 1 convolutional layer.

B. HIERARCHICAL ATTENTION MECHANISM
Hierarchical attention mechanism used in the proposed HAGN, as shown in Fig.2.
Hierarchical attention mechanism mainly consists of three attention guide branch, as shown in Fig.2. The workflow of each attention guided branch is as follows: where F represnts the input feature map. F represents the feature map predicted by 3 × 3 convolution layer. F A represents refined feature map predicted by attention module. F RA represents refined feature map predicted by one 1 × 1 convolutional layer and one upsampling layer. represents element-wise multiplication. F represents the attention refined feature map, which is the element-wise multiplication of F and F RA .
represents element-wise summation. F R represents the final refined feature map, which is the element-wise summation of F and F .
For the attention module, We apply CBAM and SEblocks to get contextual information, respectively. CBAM module can encode spatial-wise attention information and channel-wise attention information at the same time. The operation process of CBAM is shown as follows: where the refined feature map F A obtains channel-wise attention information and spatial-wise attention information in a specific order. F represents the input feature map. M C (F) represents the predicted channel attention feature map by the channel attention module of CBAM. U represents the refined feature map, which is the element-wise multiplication of F and M C (F). M S (U ) represents the predicted spatial attention feature map by the spatial attention module of CBAM. F A represents the channel and spatial refined feature map, which is the element-wise multiplication of U and M S (U ). SEblock module encodes channel attention information as shown in Fig.3 (b).
where S represents the input feature map. C(S) represents the predicted channel attention feature map by the channel attention module of SEblock. S A represents the refined feature map, which is the element-wise multiplication of S and C(S).

C. LOSS FUNCTION
L2 loss is chosen as loss function.
where L represents the loss of the HAGN. N represents the number of images. i represents the i th serial number of test images. Y i represents the i th ground truth density map.
x i represents the i th image of test images. f (x i ) represents the i th predicted density map by the HAGN.

IV. EXPERIMENT
In this section, we first introduce datasets, evaluation metrics, and implementation details. Then, we show the experimental VOLUME 8, 2020 • UCF-QNRF dataset [51] consists of 1535 highresolution crowd images with altogether 1.25 million labels, which includes 1201 training images and 224 test images. The number of crowd heads in images varies from 49 to 12865. The mean absolute error (MAE) and mean squared error are chosen as the evaluation metrics to evaluate the performance of our method. The definations of MAE and MSE are as follows: where N denotes the number of all images in the dataset, y i andŷ i represent respectively the ground-truth and predicted number of crowd in the i th image.
Following [14], [22], [51], we adopt the precision, recall and F-measure to evaluate the localization performance of our method.

B. IMPLEMENTATION DETAILS
All images are uniformly resized as 576 × 768. We choose Adam algorithm as the optimizer. The initial learning rate is set as 1 × 10 −5 . The learning rate decay is set as 0.995. All experiments are performed on one NVIDIA GTX TITAN Xp. The batch size is set as 2, 6, 1 on ShanghaiTach Part A/B and UCF-QNRF datasets, respectively.
The qualitative results are displayed in Fig.4. From left to right, they are input images, ground truth, the results of HAGN, and pretrained VGG-16.

D. CROWD LOCALIZATION
We perform crowd localization task on UCF-QNRF dataset. Firstly, We adopt nonmaximal suppression on the crowd density map to extract the local maximums, which are the output of the head locations. When local maximums are found, we use 1-1 matching between predicted heads locations and ground truth locations to calculate precisions and recall. We use different pixel thresholds to evaluate the matching results. If the distance between the predicted head location and the ground truth location is less than the pixel threshold, it is marked as True Positive. If the distance between the predicted head location and the ground truth location is greater than the pixel threshold, it is marked as False Positive. The ground truth location without matching is marked as False Negative. For the evaluation on the UCF-QNRF dataset, we follow [51] and show the average precision, average recall at different distance thresholds t d : 1, 2, 3, . . . , 100 pixels.
The localization results are shown in Table 3. Our proposed method outperforms MCNN [25], ResNet74 [61], DenseNet63 [62], Encoder-Decoder [63] and CL [51] in  terms of average precision and F-measure and gets the second place in terms of average recall, which demonstrates the effectiveness of our method.
The qualitative results are shown in Fig.5, where red points represent ground truth locations, and green dots represent head locations estimated by our method.

E. ABLATION STUDIES
We perform two ablation experiments on UCF-QNRF dataset. The details are shown in Table 4 and Table 5.

1) ABLATION EXPERIMENTS ON NETWORK STRUCTURE
We explore the selection of attention modules and up-sampling modes of HAGN for crowd counting, the results  are displayed in Table 4. SU represents that upsampling mode is the step to step upsampling after every attention guided branch, as shown in Fig.1. FU represents that upsampling mode is the final upsampling at the end of the network, as shown in Fig.6 (b).
VGG-16 baseline model consists of VGG-16 first 13 convolutional layers as the feature extraction module, as shown in Fig.6 (a). VGG-16+SE+SU is the model which adopts SEblock attention module in each attention guided branch and step by step upsampling mode. VGG-16+SE+FU is the model which adopts SEblock attention module and the final upsampling mode. VGG-16+CBAM+FU is the model which adopts CBAM attention module and the final upsampling mode.
From Table 4, we see that HAGN gets better results than VGG-16+SE+FU: 107.7/183.1 v.s. 112.5/191.3 in terms of MAE/MSE. It indicates that the CBAM module is more effective than the SEblock module on contextual information encoding because the CBAM module adopts both spatial and channel attention mechanisms for modeling contextual information effectively, which is helpful to improve the crowd counting accuracy. The HAGN and VGG-16+SE+SU model get better results than the VGG-16+CBAM+FU model and VGG-16+SE+FU model. It indicates that the step by step upsampling mode is more effective than the final upsampling mode because the step by step upsampling mode integrates the contextual information in the process of generating the crowd density maps, which increases the ability of the network to recognize the crowd.

2) ABLATION EXPERIMENTS ON FEATURE EXTRACTION SELECTION
Ablation experiments on feature extraction selection is shown in Table 5. HAGN w/ the first 10 layers of VGG-16 is a model that has the decoder network structure with HAGN but adopts the first 10 layers of VGG-16 to extract base feature. The HAGN gets a better result than the model using the first 10 convolution layers of VGG-16. It indicates that the base feature using the first 13 convolution layers of VGG-16 is more effective because the perceptive field of the first 13 convolution layers is larger than that of the first 10 convolution layers, which obtains more contextual information and guides network generating crowd density maps more effectively.

V. CONCLUSION
In this paper, we proposed an effective crowd counting network named HAGN to solve the scale variation problem in crowd counting. The HAGN consists of an encoder network, which is the first 13 layers of VGG-16 and a decoder network named HAM. The HAGN has achieved excellent performance on challenge crowd counting datasets against the other state-of-the-art methods. Finally, we designed the ablation experiments in attention module selection, upsampling strategy selection, and base feature extraction selection.
ZUODONG DUAN received the B.S. and master's degrees from the College of Information Science and Engineering, Shandong Agricultural University, Taian, China, in 2014 and 2016, respectively. He is currently pursuing the Ph.D. degree with the Beijing Institute of Technology. His research is about deep learning applications in computer vision and robotics, with specific interests in crowd analysis, vehicle fine-gained classification, and SLAM.
YUJUN XIE received the bachelor's degree from the College of Information Science and Technology, Beijing University of Chemical Technology University, Beijing, China, in 2018. She is currently pursuing the master's degree with the Beijing Institute of Technology. Her research is about deep learning applications in computer vision and pattern recognition, with specific interests in target tracking and crowd analysis.
JIAHAO DENG was born in Laizhou, Shandong, China, in 1958. He received the B.S. and M.S. degrees in fuze technology and the Ph.D. degree in mechatronic engineering from the Beijing Institute of Technology, Beijing, in 1982Beijing, in , 1985, and 1998, respectively. Since 1985, he has been a Teacher with the School of Mechatronical Engineering, Beijing Institute of Technology. VOLUME 8, 2020