Learning Multi-Level Features to Improve Crowd Counting

Crowd counting is a task that aims to estimate the number of people in an image. Recent crowd counting methods make significant progress by employing convolutional neural networks to regress crowd density maps. One of the most challenging problems in this task is the drastic scale variation of the region of interest in images. In this paper, a Feature Fusion Attention Network (FFANet) is proposed for crowd counting. Firstly, the VGG16 network is adapted as the backbone of the FFANet to extract the features of crowd images. Then, the extracted features are fused by the subsequent two stages. Specifically, the information enhancement operations on the multi-levels features are conducted by Feature Fusion Attention Module (FFAM), which are further refined by the Residual Block (RB). Finally, the features are processed by the Compression Module (CM) to generate a density map. To demonstrate the effectiveness, the proposed algorithm is verified on three benchmark datasets. Evaluation of the algorithm performances in comparison with other state-of-the-art methods indicates the proposed FFANet outperforms the existing methods.


I. INTRODUCTION
Crowd counting is one of the promising applications in computer vision. It is a task that aims to estimate people's numbers in an image. The predicted results can be used in a wide range of fields, such as intelligent transportation [1], public security [2], agriculture monitoring [3], video surveillance [4] and so on. However, crowd counting is also a highly challenging task because of occlusion, low image resolution, perspective distortion, scale variation of objects, etc [5]. To obtain accurate results, researchers have paid lots of attention to study the above issues.
Early methods for crowd counting are based on manual features extraction of the human body and various regression functions [6]. These methods usually don't perform well in dense crowd scenes where pedestrians are severely occluded or overlapped. With the development of deep learning technology in computer vision, the algorithms based on CNN have made great progress by conquering the scales variation. These methods typically design different sized filters architectures to extract multi-scale features [7]. However, human The associate editor coordinating the review of this manuscript and approving it for publication was Ali Shariq Imran . scales change continuously in the entire image and current models can only concern some discrete scales. It brings a major problem that these methods ignore a larger number of crowds in an image.
Some researchers [8]- [10] discover that CNN's shallower layers focus on low-level texture and spatial information which can help the model determine the location of the target. The deeper layers focus on high-level contextual and semantic information which can help the model to identify the type of targets. Inspired by the above research, we reasoned that the fusion of these multi-level features can effectively solve the crowd scales variation. There are two ways to realize feature fusion, one is to merge the features on its channel axis and the other is to conduct element-wise summation. The defect of these two methods is that they can't utilize the information contained in the feature effectively, which results in the waste of calculation. In this paper, we introduce the FFAM to realize the information enhancement of the multilevel features. First, the FFAM utilizes the contextual and semantic information contained in the high-level features to enhance channel-wise information in the low-level features. Second, the spatial information of the processed low-level features would be extracted by the FFAM to enhance the high-level features. Third, the features processed in the first two steps are concatenated in the channel axis. The experiment shows that FFAM significantly improves the accuracy of crowd counting tasks. Fig. 1 shows a crowd image and the estimated density maps by the proposed FFANet and the SANet [11]. Compared with SANet, our result deviate less from the ground truth. By observing the spatial distribution of the density maps, SANet could not solve the drastic scale variation of the region of interest in crowd images (as shown in the red box). On the contrary, the proposed FFANet solves the scale variation problem well and the spatial distribution of the estimated density map is very similar to the ground truth. In conclusion, compared with SANet, the FFANet proposed in this paper has a significant improvement in solving the problem of the drastic scale variation in crowd image and improving the accuracy of crowd counting.
In summary, the contributions of this paper are as follows: 1) A new end-to-end multi-level feature fusion network is proposed to enhance the network robustness to scale variations of crowd images. 2) The proposed FFAM is utilized to enhance the spatial and semantic information between multi-level features in the FFANet. The remainder of the paper is organized as follows. After the related work discussion in Section II, we cover the details of our proposed method in Section III. Section IV introduces experimental designs and discusses the results. We conclude with a short discussion in Section V.

A. CROWD COUNTING
In recent years, crowd counting methods have made great progress by employing convolutional neural networks to regress crowd density maps. Researchers have designed a variety of efficient convolution neural networks to solve scale variation [5]. The remainder part of this section describes the multi-column models and the single-column models according to the network structure.

1) MULTI-COLUMN MODELS
Zhang et al. [7] proposed a Multi-column Convolutional Neural Network (MCNN) to overcome the scale variation in images. Each column is composed of different filters to get features with various scales. Inspired by MCNN, Onoro-Rubio et al. [12] designed a scale-aware counting model that can predict crowd distribution and the number of the crowd without perspective information, by extracting features from the image with different resolutions to overcome the perspective distortion. Switch-CNN which trains a classifier to choose the optimal branch from the multi-column network for crowd image patches is proposed by Sam et al. [13]. SANet designed by Cao et al. [11] uses scale aggregation modules to solve scale variation and extracts these features to generate high-resolution density maps. Guo et al. [14] explored a scale-aware attention fusion with various dilation rates to capture different visual granularities of crowd regions of interest and utilizes deformable convolutions to generate a high-quality density map. Recently, Gao et al. [15] proposed a Perspective Crowd Counting via Spatial Convolutional Network (PCC Net) to solve high appearance similarity, perspective changes and severe congestion.

2) SINGLE-COLUMN MODELS
CSRNet [16] used dilated convolution layers to expand the receptive field and replace pooling operations. By taking advantage of these designs, CSRNet can easily generate highquality density maps. Shi et al. [17] proposed a Perspective Aware Convolutional Neural Network (PACNN), which can add perspective information based on crowd estimation and effectively solve the scale variation. ADCrowdNet [18] consists of two CNN networks. An attention aware network firstly detects the crowd regions in the image and calculates the congestion degree of these areas. Based on the detected crowd area, a multi-scale deformable network is used to generate high-quality density maps. More recently, Jiang et al. [19] proposed an effective Multi-Level Convolutional Neural Network (MLCNN) architecture that first adaptively learns multi-level density maps and then fuses them to predict the final output.

B. ATTENTION MODELS
Attention models are first applied in the field of machine translation and then developed into many deep learning fields such as object detection [20], image classification [21], image segmentation [22] and face recognition [23]. Hu et al. [24] proposed a lightweight attention mechanism named SENet which could automatically obtain the importance of each feature channel-wise information to enhance the useful features. Jon. et al. combined the channel attention mechanism and the spatial attention mechanism to propose CBAM [25] and BAM [26]. Li et al. [27] proposed a dynamic selection mechanism that allows each neuron to select a different receptive field for the size of the target. Cao et al. [28] combined the advantages of Non-local [29] and SENet and proposed a new global context module, achieving significant results on computer vision tasks.
Multi-column models typically design different sized filters architectures to extract multi-scale features. However, human scales change continuously in the entire image and current models can only concern some discrete scales. It brings a major problem that these methods ignore a larger number of crowds in an image. Single-column models need to rely on other information or auxiliary network to solve the scale variation, for example, PACNN needs to generate perspective maps and train the corresponding network branches. This has led to additional work and increased difficulty in model training. Therefore, combined with the above works, we design a simple and effective network based on multi-level feature fusion to solve the problem of scale variation in crowd counting.

III. METHOD
In this section, we firstly introduce the structure of the proposed FFANet. Then, we introduce the associated modules and the loss function. Fig. 2 shows the details of the network structure. Table 1   The FFANet consists of a backbone network, feature fusion attention modules, residual blocks and a compression module.
Backbone the backbone network is a pre-trained VGG16 [31] network with the fully connected layers removed. A BN layer is added at the back of all convolutional layers in the VGG16 network. The backbone network contains 13 convolutional layers which are divided into 5 convolutional blocks. We take the output of convolution blocks from the third to the fifth as the objects of feature fusion.
FFAM the proposed FFAM fuses multi-level features of the backbone and utilizes this information diversity to enhance channel-wise information in the low-level feature and spatial-wise information in the high-level feature.
RB & CM the RB is a residual block composed of a 1 × 1 kernel size convolutional layer and two 3 × 3 convolutional layers to refine the features of FFAM output. The compression module (CM) compresses the feature map into a singlechannel crowd density map.  The FFAM upsamples the high-level feature F i+1 and concatenates it with the low-level feature F i in the channel axis. The channel attention module uses the concatenated features to output vectors w c to enhance channel-wise information in F i . Fig. 4 (a) describes the structure of the channel attention module. It is formulated as

B. FEATURE FUSION ATTENTION MODULE
where U (·) denotes the upsampling layer, [·] denotes the concatenation layer, φ(·) denotes the channel attention module. Channel-wise enhanced F is obtained by element-wise multiplying vector w c and F i . It is formulated as The spatial attention module calculates F to obtain the spatial weight to enhance the high-level feature. Fig. 4 (b) represents the structure of the spatial attention module. We define this operation as where ϕ(·) denotes the spatial attention module, w s denotes the spatial weight. Spatial-wise enhanced F is obtained by element-wise multiplying w s and U (F i+1 ). It is formulated as Finally, the two enhanced features would be concatenated in the channel axis. It is formulated aŝ

C. LOSS FUNCTION
We define a joint loss function which consists of Mean Squared Error (MSE) loss and Structural Similarity Index (SSIM) loss. MSE loss is used to minimize the Euclidean distance between the ground truth and the estimated density map. Ref. [11] reveals the fact that MSE loss employed by many previous methods is dependent on the pixel independence hypothesis and doesn't consider the local correlation of the density map. Therefore, we utilize the SSIM loss as part of the loss function to improve the result. The joint loss function is defined as follows where N is the number of training batch size, D G i is the ground truth of density map, D P i is the predicted density map, M is the number of the pixels in the density map, λ is the parameters which are used to balance L MSE and L SSIM . Means (µ P i , µ G i ) and standard deviation (σ P i , σ G i , σ P i G i ) in SSIM loss are calculated with a Gaussian kernel that configures a standard deviation of 1.5 within an 11 × 11 region at each position j.

IV. EXPERIMENT
In this section, we validate the effectiveness of our method on three public crowd counting datasets: ShanghaiTech [7], UCF_CC_50 [32], UCSD [33]. Then we conduct the ablation studies about the hyperparameter λ, the density map generation and the structure of the network. Considering the small size of the dataset, we use a cross-validation protocol for training and testing our methods following the approach from ref. [7].

A. DATASETS AND TRAINING DETAILS
• UCSD. The dataset consists of 2000 frames from a surveillance video camera on the UCSD campus. The resolution of each frame is 158 × 238. The average number of people in each frame is 25. The dataset provides the region of interest to ignore the background. Following ref. [11], frames #601 to #1400 are used for training and the rest for testing. To satisfy the constraints of the backbone on the shape of the input tensor, we resize the resolution of the image to 400 × 608. This operation can not only meet the input restrictions but also ensure that the image content is not distorted. If the image in the dataset has the various resolution, the original image will be cropped into 400 × 400 patches. We convert the label to a density map by a fixed-size Gaussian kernel G σ (σ = 4). We use the delta function δ(x − x i ) to represent the head position. The ground truth of density map Y is generated by convolving the Gaussian kernels with each delta function.
The parameters of the network are randomly initialized by the Gaussian distribution with mean zero and standard deviation of 0.01 except for the backbone network. We use Adam with a learning rate of 1e-4 and a batch size of 16 to train the network.

B. EVALUATION METRICS
The performance of the model is evaluated by two metrics, the Mean Absolute Error (MAE) and Mean Square Error (MSE). MAE reflects the accuracy of the results predicted by the model and MSE indicates the robustness of the model. They are defined as follows: where N is the number of the test dataset, i is the index of the image, C i is the estimated number of the image i and C GT i is the ground truth.

C. EXPERIMENTAL RESULTS
In Table 2∼4, we compare the performance of our method with several advanced works on ShanghaiTech [7], UCF_CC_50 [32] and UCSD [33]. These tables demonstrate that our method achieves the best MAE and MSE on these three benchmark datasets. Compared with the most advanced VOLUME 8, 2020    6, which reveals that the accuracy and robustness of our method are competitive. On the challenging UCF_CC_50, our FFANet outperforms PACNN by 6.2% in MAE, which states that our method is equally effective on a dataset with a small sample size where the number of people changes dramatically. Table 4 shows that our FFANet also achieved the best performance in sparse scenarios. The above results prove that our FFANet has the advantages of accuracy and robustness in both dense and sparse scenes.
To comprehensively evaluate the performance of FFANet and other models, we further verify the results of the proposed FFANet on ShanghaiTech Part A dataset in terms of model parameters, runtime and whether to load the pre-training model. The results on other measures are shown in Table 5. Compared to PACNN with the second-best MAE as shown in Table 5, FFANet achieves higher accuracy with fewer parameters. However, there is still a certain gap between FFANet and the best model [15] in terms of parameters and runtime. In future work, our research direction is to use lightweight technology to reduce the amount of FFANet network parameters and maintain the inference accuracy. Fig. 5 describes the comparison of some high-quality results of FFANet on ShanghaiTech Part A with CSRNet. Compared with CSRNet, FFANet can capture the spatial distribution of the crowd and generate a clearer density map. Fig. 6 shows some high-quality results of FFANet on three datasets.

D. ABLATION STUDIES
We present the ablation studies about the hyperparameter λ, density map generation, different combinations of CBAM and network and the structure of the network.

1) ABLATION EXPERIMENTS ON LAMBDA
λ is a hyperparameter that is used to balance the MSE loss and SSIM loss. To analyze the impact of the SSIM loss function on the results, we set different λ values to observe the performance of the FFANet on ShanghaiTech Part A. Fig. 7 shows that when λ = 100 the FFANet achieves the best results.

2) ABLATION EXPERIMENTS ON DENSITY MAP GENERATION
In this experiment, we compare the effects of three values of Gaussian kernel which are commonly used to generate density maps in crowd counting tasks on counting results. Table 6 shows that when σ = 4, the performance of the FFANet is the best.

3) ABLATION EXPERIMENTS ON CBAM ATTENTION MODULE
This part of the study is to discuss the impact of CBAM [25] on the networks. Table 7     and computational cost, this paper chooses FFANet as the optimal network.

4) ABLATION EXPERIMENTS ON NETWORK STRUCTURE
We study the effects of fusing different levels of features on the accuracy of crowd counting.   which indicates that feature fusion can effectively improve the counting accuracy. Moreover, the network parameters and runtime of method II are increased by 8.7% and 5.6% compared with Method I. Compared with Method II, the MAE and MSE of Method III reduced from 69.1 to 65.9 and MSE from 113.8 to 113.2. The results show that the performance gain is due to the increase of parameters caused by the stacking of the feature fusion stage, which strengthens the expression ability of the network. However, simply increasing the network parameters will also increase the training difficulty and weaken the robustness of the network. Compared with Method II, the network parameters and runtime of method III are greatly increased, but the MSE of Method III is only decreased by 0.6. The proposed FFAM can be a nice tradeoff between crowd estimation performance and computational cost.
To evaluate the performance of the FFAM, we added FFAM to Method II and III in the feature fusion stage, resulting in Method IV and V. Compared with Method II, the MAE and MSE of Method IV decreased from 69.1 to 66.5 and MSE from 113.8 to 105.6. Moreover, in terms of network parameters and runtime, Method IV only increases 0.6M and 1ms compared to Method I. In comparison with Method III, the MAE and MSE of Method V decreased from 65.9 to 62.4 and MSE from 113.2 to 102.6. The decrease of MAE indicates that FFAM can solve the scale variation in an image. Meanwhile, the significant reduction in RMSE indicates that FFAM can well solve the scale changes in the dataset. The above experimental results show that FFAM is effective in dealing with scale changes in crowd counting. For the computational costs, Method V brings an increase of parameters with 0.6M and 5ms for runtime. In conclusion, FFAM can greatly improve the counting accuracy and enhance the robustness of the network under the premise of adding limited parameters.

V. CONCLUSION
In this paper, we proposed a Feature Fusion Attention Network (FFANet) to accurately estimate the number of people in the images. On one hand, the Feature Fusion Attention Module (FFAM) is proposed to realize the information enhancement of the multi-level features which are extracted by the VGG16 network. On the other hand, the enhanced features are processed by the Compression Module (CM) to generate a density map. Evaluation of the algorithm performances in comparison with other state-of-the-art methods indicates that the proposed FFANet is effective for crowd counting.
In near future, we plan to verify the adaptability of our method on other feature extractors. In addition, the performance of FFANet on UCF_CC_50 is still not perfect. This will be another research content we improve. Finally, we also plan to use model lightweight technology to reduce the time complexity of FFANet.