CSANet: Channel and Spatial Mixed Attention CNN for Pedestrian Detection

Current mainstream pedestrian detectors tend to profit directly from convolutional neural networks (CNNs) that are designed for image classification. While requiring a large downsampling factor to produce high-level semantic features, CNNs cannot adaptively focus on the useful channels and regions of the feature maps, which limits the accuracy of pedestrian detection. To obtain a higher accuracy, we propose a single-stage pedestrian detector with channel and spatial attentions (CSANet), which can locate useful channels and regions automatically while extracting features. The backbone of CSANet is different from that of mainstream pedestrian detectors, which can effectively highlight the pedestrian-likely regions and suppress the background. Specifically, we model contextual dependencies from channel and spatial dimensions of the feature maps, respectively. The channel attention module can selectively promote CNNs to focus on key channels by integrating associated features. Meantime, the spatial attention module can illuminate semantic pixels by aggregating similar features of all channels. Eventually, the two modules are connected in series to further enhance the representation of feature maps. Experiment results show that CSANet achieves the state-of-the-art performance with $MR^{-2}$ of 3.55% on Caltech dataset and obtains competitive performance on CityPersons dataset while maintaining a high computational efficiency.


I. INTRODUCTION
Pedestrian detection plays a critical role in computer vision tasks such as autonomous driving, robotics, and surveillance. In recent years, pedestrian detectors have made considerable progress with the revival of deep learning [1], [2]. However, current state-of-the-art pedestrian detectors still fall far short from the cognitive levels as fast and accurate as human [3].
Pedestrian detection can be traced back to traditional methods with low-level features, e.g. HOG [4]. The emergence of R-CNN [5] made the two-stage architectures of ''Region Proposal+CNN'' into an established method in object detection [6], [7]. Then Zhang et al. [8] discussed the effectiveness of Faster R-CNN framework in pedestrian detection The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . tasks. The performance of small-scale pedestrian detection was optimized by improving the resolution of feature maps. Unfortunately, the lack of diverse pedestrian datasets limits the ability of Faster R-CNN to generalize pedestrian detection in various scenes. Zhang et al. [9] proposed a pedestrian benchmark called CityPersons, which contains rich and diverse images. They reported that AdaptFasterR-CNN [9], a model trained using the CityPersons, has stronger generalization capabilities. However, in the two-stage frameworks mentioned above, the cumbersome anchor boxes design and the generation of region proposals make it difficult to perform real-time detection.
In contrast to the two-stage detectors, single-stage detectors, e.g. YOLO [10], [11], are well known for their fast detection and considerable accuracy. Therefore, Noh et al. [12] proposed a single-stage detector to optimize real-time VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ indicators of pedestrian detection. Currently, researchers are keen on the design of simpler anchor-free detectors.
In particular, H. Law et al. proposed an anchor-free object detector that detected the bounding boxes as a pair of keypoints [13], which led to a boom of anchor-free detectors. Subsequently, considerable anchor-free object detectors were designed [14]- [17]. Furthermore, researchers began to design anchor-free pedestrian detectors, such as [18], [19].
Notably, the pedestrians in traffic scenes have characteristics that are different from general objects, such as dynamic background and variable scales [8], [20]. Generally, researchers employ deep models to abstract higher-level semantics of object instances, which is helpful for identifying pedestrians in traffic scenarios. Unfortunately, this approach filters out a lot of small-scale pedestrians as well as the position information of large-scale pedestrians. Due to the inherent characteristics of CNNs, the key channels cannot be highlighted and the key spatial position cannot be illuminated. Convolution is a local operation, which obtains local information of an image by applying a convolution kernel to the local image. The local operation of CNNs results in its inability to capture images from a global view. Therefore, it is still a challenging task to design an efficient backbone for pedestrian detection.
In this paper, we designed a dual attention network to extenuate the limitations mentioned above for pedestrian detection. In detail, the dual attention network was constructed from two dimensions (channel and spatial) of feature maps. First, the global average pooling was used to aggregate the global information of feature channels. The original feature map was squeezed along the spatial axis to generate a 1D channel attention map. The channel attention map can model the correlations between channels. Then, the global pooling was used to obtain a 2D spatial attention map by compressing the original feature maps along the channel axis direction. Furthermore, the two modules constructed above were combined in proper order to refine the feature maps. In consideration of the superiority of the anchor-free detector, we designed an anchor-free pedestrian detector called CSANet based on the dual attention modules. In addition, the performance of CSANet on multi-scale pedestrians was enhanced by fusing multi-scale feature maps. In line with our expectations, CSANet achieved a significant performance on two pedestrian benchmarks, namely Caltech [21] and CityPersons [9].
Overall, our contributions were as follows: 1. We proposed a lightweight dual attention network, which not only models the relationship of each feature channel, but also improves the expression ability of feature maps at the pixel level.
2. We constructed a single-stage pedestrian detector based on the dual attention network and further analyzed the impact of several key components in CSANet on its performance.
3. CSANet achieved state-of-the-art performance on Caltech benchmark and competitive performance on CityPersons benchmark while maintaining computation efficiency.

II. RELATED WORKS
In this section, after reviewing cutting-edge technologies related to pedestrian detection, we finally found an approach to construct a lightweight, concise and effective pedestrian detector.
A. ANCHOR-BASED AND ANCHOR-FREE DETECTORS Pedestrian detectors fall into two categories, the anchor-based and the anchor-free detectors. The former is dedicated to the improvement of accuracy, such as Faster R-CNN [7] and Mask R-CNN [22]. The latter focuses on the improvement of speed, such as YOLO [10], [11] and SSD [23]. Inspired by the methods above, the pedestrian detection tasks have made great progress [8], [24], [25]. Recently, CornerNet [13] has proposed an anchor-free method for detector based on keypoints detection. In some studies [18], [19], the idea of being anchor-free was applied to pedestrian detection, which fully explained the broad validity of the idea and opened up the perspective of pedestrian detection.
Our works fall into the category of anchor-free pedestrian detection based on keypoints detection, which has advantages in terms of speed and accuracy. In fact, keypoints belong to high-level semantic features. Therefore, anchor-free detectors rely strongly on the representation of feature maps. As we all know, some baseline methods often use deeper layer and larger downsampling factor to extract feature maps with greater abstraction. Different from the above methods, we used attention mechanism to enhance the expression ability of feature maps, which can increase the size of receptive fields. In addition, we set a small downsampling factor to maintain a high resolution of the feature maps.

B. ATTENTION MECHANISM IN IMAGE PROCESSING
Human visual attention mechanism inspires the development of attention mechanism in computer vision. Nowadays, the idea of attention was introduced into many computer vision tasks by researchers, such as image classification [27], [28], medical image segmentation [29], [30], image captioning [31], scene segmentation [32], remote sensing imagery analysis [33], etc.
Besides these outstanding works, there are more works on visual attention. Wang et al. [34] proposed a non-local neural network for capturing long-range dependencies. SENet [28] modeled the correlations among channels. Inspired by SENet and Iception [35], SKNet [36] made an improvement, which combined the channel attention module of SENet and the multi-branch convolutional layer of Iception. In addition, the spatial attention model is famous for STN [37] proposed by Google DeepMind, which can make up for the limitations of local convolution operation by aggregating the context information of the feature maps.
Chen et al. [38] integrated spatial, channel-wise and multi-layer visual attention in CNN for image captioning, and proposed two attention modules, Channel-Spatial and Spatial-Channel. Inspired by SCA-CNN [38], the network CBAM [39] integrated spatial and channel attention modules and achieved better results in image classification and object detection. In 2018, Guo et al. [40] started person reidentification task based on spatial and channel dual attention networks, which made great progress in person re-realization.
Inspired by [40], we carried out a pedestrian detection task in traffic scenes, which was similar to person re-realization. Our works fall into the design of a dual attention network for pedestrian detection, but they are different from all above attention networks. Firstly, we modeled the attention mechanism from channel and spatial dimensions of feature maps, which considered unequal roles of channel and spatial dimensions. In addition, inspired by GCNet [41], we used the addition operation to broadcast attention maps. In a word, our dual attention network can effectively improve the semantic abstraction of CNNs.

III. PROPOSED METHOD
In this section, we will firstly show the overall framework of CSANet and then introduce the mathematical modeling of channel attention module (CAM) and spatial attention module (SAM) separately. Finally, we will introduce the arrangement of the two attention modules.

A. OVERALL ARCHITECTURE
The overall framework of CSANet is shown in Fig. 1. The backbone network is ResNet-50 [42] with dual attention network embedded. Similarly to [18], the detection head module mainly includes three 1 × 1 convolutional layers, which predict the center position, scale, and offset, respectively. The ResNet-50 is divided into 5 stages. We define the output feature maps of 2 to 5 stages as ϕ 2 , ϕ 3 , ϕ 4 and ϕ 5 , respectively. Following the practice [43], the input image is downsampled by 4, 8, 16, and 16, respectively. Among them, the low-level feature map can provide more accurate information on position, and the deeper feature map contains more semantic information. We merge the multi-scale feature maps of each stage in a simple way to get the final feature map φ conc . Similarly to [18], [19] and [20], the resolution of the output feature maps at each stage is unified using the deconvolution operation before concatenation. The number of filters in three deconvolution layers is the same as that of the last convolution layer in stage 2. The number of filters in the deconvolution layer can be flexibly set. Usually, shallow features are known as universal, while the semantic information expressed by each channel of deep features is more category-specific. We compare the effectiveness of different types of dual attention network in ablation studies.
Taking the third residual block of stage 5 as an example, the dual attention network can be constructed as follows. Given the output of the residual block, we define it as the original feature map F ∈ R H ×W ×C . We feed it into the dual attention mechanism. CAM and SAM in turn derives a 1D channel attention map M C ∈ R 1×1×C and 2D spatial attention map M S ∈ R H ×W ×1 . The original feature map F is sequentially refined by two attention maps. The calculation of two refined feature maps can be summarized as follows: where ⊕ indicates element-wise addition, M C is the channel attention map, M S is the spatial attention map, F C is the feature map refined by M C , and F S is the feature map refined by M S .
In what follows, we describe the modeling process of two modules in detail. VOLUME 8, 2020

B. CHANNEL ATTENTION MODULE
In fact, each channel of high-level features can be regarded as a class-specific response and these channels have different responsiveness [28]. However, CNNs treat all channels equally. Therefore, we build a channel attention module to explicitly model interdependencies between channels. The attention model can emphasize interdependent feature maps and improve the feature representation of specific semantics.
In [30], [36], [38], [39], the channel attention module gives different weights to each channel to stress useful channels for image classification. However, such a channel relationship modeling method is not suitable for binary classification of pedestrian detection. The weighted operation mode excessively suppresses unimportant channel information and reduces the diversity of feature maps. In addition, while enhancing the information of a certain channel, the interference of noise in a complex background is also enhanced. We use addition operation to broadcast the channel map to the original feature map.
As shown in Fig. 2 (a), in order to calculate an effective channel attention map, the 2D feature map was compressed into a real number along the spatial axis of input. First, for the original feature map F ∈ R H ×W ×C with channel number C, the global average pooling is used to aggregate the global information of each channel to obtain the channel attention map U ∈ R 1×1×C . The calculation process is: where GAP represents global average pooling operation, f c represents c-th channel of size H × W in the original feature map F, and u c is c-th real number of the channel attention map U ∈ R 1×1×C . Then, we feed the channel feature map U into two fully connected layers to obtain a non-linear descriptor, and then the descriptor passes through the function of sigmoid to obtain the final channel attention map M C ∈ R 1×1×C . The process is expressed as: where the two fully connected layers are used to better fit the complex correlations between channels, δ represents the ReLU activation function, σ denotes the sigmoid function, W represents the scaling parameters in the fully connected layer, including W 1 ∈ R C/r×C and W 2 ∈ R C×C/r , r is the reduction ratio, and r is set as 16.
Finally, we re-characterize the input feature map F with M C . In the first step,M C is broadcast into the same dimensions as F ∈ R H ×W ×C and we donate it as M C ∈ R H ×W ×C .
Here we use pixel-by-pixel addition operation to broadcast the channel attention map. The entire calculation process is expressed as: where

C. SPATIAL ATTENTION MODULE
In pedestrian detection, the detailed position of the pedestrian plays a critical role, which is beneficial for locating the position of pedestrian. The discriminant feature representations are critical for pedestrian detection, which could be obtained by capturing long-range contextual dependencies between each pixel. In order to make up for the shortcomings of channel attention module, we further design the spatial attention module to refine the feature map from the pixel level to improve the feature representation of the feature map. Similar to the channel attention map, the spatial attention map is also broadcast to the original feature map by pixel-wise addition operation. In order to calculate an effective spatial attention map, the 3D feature map is compressed into a 2D feature channel along the channel axis of the feature map F C .
First, given the feature map F C ∈ R H ×W ×C after re-calibration, the average pooling operation is performed on all feature channels along the channel axis to obtain a feature map V ∈ R H ×W ×1 . The calculation process is: where AP represents the average pooling operation, f ij represents the pixel value of ij point in f c , C is the number of feature channels, and v ij represents the pixel value of ij point in the feature map V . Then, a convolution layer with step size of 1 and convolution kernel size of 7 × 7 is used to perform a convolution operation on the feature map V . The activation function sigmoid is used in the process to obtain the attention map M S ∈ R H ×W ×1 . The process is as follows: where σ denotes the sigmoid function and f 7×7 represents a convolution operation with the filter size of 7 × 7. Finally, we re-characterize the input feature map F C with the spatial attention map M S . M S is broadcast into the same dimensions as F C ∈ R H ×W ×C and we denote it as M S ∈ R H ×W ×C . As shown in formula (5), we use pixel-by-pixel addition operation. The whole calculation process is: where F add represents spatial-wise addition, m s is the s-th channel of the broadcast attention map M S , f s is the s-th channel of the refined feature map F C by M C , and f s is the s-th channel of the refined feature map F S ∈ R H ×W ×C by M S . Therefore, spatial attention map has a global contextual view and selectively aggregates contexts. In addition, the attention map focuses on the global information of each channel, increases the size of receptive fields, and enables CNNs to capture the image from a global perspective.

D. ARRANGEMENT OF TWO ATTENTION MODULES
In the works of [30], [38], [40], a multi-attention module organization method is proposed. Inspired by this, we connect CAM and CSM in series. The two attention modules can be embedded in ResNet-50 in a parallel or sequential manner. The channel attention module focuses on important channels, while the spatial attention module focuses on important regions of feature maps. Proper combination of two attention modules can maximize the effectiveness of the attention mechanism. Therefore, we discuss experimental results in ablation studies.

IV. EXPERIMENTS
In this section, we firstly introduce two pedestrian detection benchmarks, evaluation metric and implementation details separately. Then, we report the results of ablation studies on Caltech dataset. In addition, to fully prove the effectiveness of CSANet, we implement a visualization experiment. Finally, we show the results of comparisons among state-of-the-art pedestrian detectors on Caltech and CityPersons datasets.

A. EXPERIMENTS DETAILS 1) DATASETS
We evaluate the effectiveness of CSANet on two challenging benchmarks, Caltech [21] and CityPersons [9]. For the Caltech dataset, we follow the approach in [18], where the training data are augmented by extracting one of every 3 frames. There are 42,782 images with resolution 640×480 in the training set. The testing set has 4,024 official images. The evaluations are conducted based on the new annotations provided by [1].
The CityPersons dataset is derived from Cityscapes [44], and has pedestrian annotations with multiple occlusion levels. Experiments use a training set that contains 2,975 images and an official validation set that contains 500 images.
The evaluation metric follows the Caltech evaluation standard [21], which is log-average Miss Rate over False Positive Per Image (FPPI) ranging in [10 −2 , 10 0 ] (expressed as MR −2 ). Smaller value of MR −2 indicates better performance.

2) IMPLEMENTATION DETAILS
Our method is implemented in the Keras framework. The training and testing are performed on single NVIDIA GTX 1080Ti GPU. The backbone network is ResNet-50 pre-trained on ImageNet [45]. For the Caltech dataset, one mini-batch contains 16 images, and the learning rate is 10 −4 , and the training is stopped after 120 epochs. For CityPersons dataset, we set a mini-batch that contains 2 images, and the learning rate is set to 2×10 −4 , and the training is stopped after 150 epochs. Following [9], [18], for the Caltech dataset experiment, it also includes further optimization experiment using model initialized from CityPersons, which is trained with the learning rate 2 × 10 −5 . The CityPersons dataset contains a large number of images under a variety of conditions.

B. ABLATION STUDY
In this subsection, we conduct an ablative analysis on the Caltech dataset to show the effectiveness of four main components of the proposed method. The ablation studies are mainly divided into four parts: 1) Which feature extraction method is more effective? 2) How important is the feature fusion? 3) Which is more efficient, addition or multiplication? 4) How to connect CAM with SAM?

1) WHICH FEATURE EXTRACTION METHOD IS MORE EFFECTIVE?
Our dual attention network can be easily embedded in each residual block of ResNet-50. Multi-layer visual attention in CNNs has proven to be more effective for image captioning [38]. In this part, we embed the dual attention network in multi-layer of the ResNet-50 to obtain the feature maps with different expressive abilities. We compare five methods of the extraction of feature maps in detail. The experiments are implemented under IoU =0.5 and IoU =0.75. The IoU reflects the overlap area between the prediction boxes and the ground truth boxes.
As shown in Table 1, the embedding method of stage 3-5 achieves the best performance with MR −2 of 3.88% under IoU = 0.5. Under the stricter IoU = 0.75, the embedding method of stage 2-5 improves the performance by about 36% compared with the embedding method of stage 5. Notably, under the IoU = 0.5, the models stage 2-4 and stage 5 have comparable performance with MR −2 of 4.28% and 4.27%, respectively. However, there is a large performance gap with MR −2 of 4.77% between them under the threshold of IoU = 0.75. This comparison shows that dual attention network is more conducive to the detection of high-quality bounding boxes.

2) HOW IMPORTANT IS THE FEATURE FUSION?
In this part, we compare different combinations of multiscale feature maps based on dual attention network under IoU =0.5 and IoU =0.75. Multi-scale feature fusion is essential for pedestrian detection [18], [20]. The shallower feature maps (e.g.ϕ 2 ) maintain a higher resolution, which is conducive to the detection of small-scale pedestrians. The deeper feature maps (e.g.ϕ 5 ) have a lower resolution and a larger receptive field, which is conducive to the detection of large-scale pedestrians. The proper feature fusion is beneficial to pedestrian detection of various scales.
It can be seen from Table 2 that the model combining lowlevel features such as ϕ 2 and ϕ 3 has a poor accuracy, but has smaller number of parameters and faster detection speed. As the number of fused feature maps increases, the detection accuracy of the model also increases. Under IoU = 0.5, the model with the fusion of ϕ 3 , ϕ 4 and ϕ 5 has a significant improvement with MR −2 of 47% with the worst accuracy. Under IoU = 0.75, the fusion of ϕ 3 , ϕ 4 and ϕ 5 has the best result. In general, we find that although deeper features are helpful for feature detection, they consume more running memory.

3) WHICH IS MORE EFFICIENT, ADDITION OR MULTIPLICATION?
Considering the real-time requirement of pedestrian detection, we use a different way of broadcasting attention maps from [38], [39]. The study of [41] modeled long-distance dependencies and used the addition operation to broadcast attention maps. In our proposed method, the pixel-by-pixel addition operation is used to broadcast attention maps, and the original feature maps are re-calibrated in turn by dual attention modules. In this set of experiments, we compare the effects of addition and multiplication under IoU =0.5.
It can be observed from Table 3 that the models with addition broadcasting is better than the models with multiplication broadcasting. The miss rate gap between ''p3p4p5+add'' and ''p3p4p5+mul'' is about MR −2 of 37%. In the second and third sets of experiments, the two gaps are about MR −2 of 17%. In addition, we find that the broadcast method of attention map hardly affects the test time. In fact, the runtime is mainly affected by the parameters of the model. The computational complexity of multiplication is much higher than that of addition [46]. While multiplication operation enhances the useful information expression of the feature maps, it also excessively enlarges the impact of noise. In addition, multiplication weighting operation excessively suppresses some contextual details, which is disadvantageous to locating pedestrians. In addition to accuracy, we also need consider the real-time indicators of the model. It should be noted that the multiplication operation increases the runtime of networks.

4) HOW TO CONNECT CAM WITH SAM?
In this set of experiments, we compare three combinations of dual attention modules under IoU =0.5. There are three manners to connect the two attention modules, the CAM-first, the SAM-first and the parallel. These manners can effectively improve the accuracy of pedestrian detection, but there is a certain gap in effect.
As shown in Table 4, CAM+SAM represents that the channel attention module and the spatial attention module are connected in a sequential manner. We find that the sequential arrangement achieves a better result than a parallel arrangement. The manner of CAM-first has the best result with MR −2 of 3.88%. CAM//SAM represents that the two modules are arranged in parallel, which underperforms the CAM+SAM with MR −2 of 0.27%. In the third manner, SAM+CAM represents SAM-first in dual attention network, which has the worst performance with MR −2 of 4.57%. The results of Table 4 show that CAM-first has a better performance than others.

C. NETWORK VISUALIZATION WITH GRAD-CAM
In this subsection, we apply Grad-CAM [47] to explain our model qualitatively. The interpretability of CNNs has improved to a certain level. The algorithm can derive class activation mapping, which can be used to locate the region of the class in the image. Grad-CAM mainly uses the gradient of the last convolution layer of networks to generate a heat map, and it can highlight important pixels in the input image.
We attempt to find how CSANet makes good use of features and enhances the expression ability of feature maps at the pixel level. In Fig. 3, we can clearly see that the masks of the model with dual attention network better cover the pedestrian areas than the model without dual attention network. In other words, the dual attention modules can better focus on the pixel information of the target areas. The visualization results qualitatively show that in the refined feature map, the pixel expression ability of the target areas is enhanced to a certain extent.

D. COMPARISONS WITH STATE-OF-THE-ART METHODS ON TWO BENCHMARKS
In this subsection, we compare the model CSANet with state-of-the-art models on two benchmarks, Caltech and CityPersons. In this experiment, CSANet indicates that the initialization weights are derived from the ImageNet dataset model initialization weights. CSANet+City indicates that the initialization weight comes from CityPersons.
As shown in Fig. 4, the model initialized from the CityPersons dataset has the best performance. Compared with the current advanced method CSP+City, CSANet achieves the best performance with MR −2 of 3.55%. The CSANet model initialized with the ImageNet dataset exceeds MR −2 of 3.88%, which indicates a significant improvement over the baseline models. As shown in Fig. 5, our models also achieve    a smaller miss rate under stricter threshold settings, which means that the dual attention network also helps improve the quality of the bounding boxes. Table 5 reports the detailed experimental results on Caletch, suggesting that CSANet significantly outperforms the competitors in accuracy while maintaining a high computational efficiency. The speed of the proposed method is about 18 FPS with the original 640 × 480 pixels, and the accuracy denotes MR −2 with 3.55% under IoU =0.5 and 18.86% under IoU =0.75. As can be seen from Table 5, the two-stage detectors have the tardy speed and our method achieves a better speed-accuracy trade-off.

2) RESULTS ON CITYPERSONS BENCHMARK
In this set of experiments, we produce the results on the CityPersons dataset under IoU =0.5, and we only use single NVIDIA GTX 1080Ti GPU and mini-batch =2. Table 6 shows that the CSANet detector achieves the state-of-the-art performance denoted MR −2 with of 7.25% on the Bare subset of CityPersons. On the Reasonable subset, the performance of CSANet is second only to the CSP [18] model trained with mini-batch =8. Our detector is comparable to ALFNet [51] and produces about 2.6% improvement compared to the competitor RepLoss [52]. On two subsets of Heavy and Partial with different occlusion levels, the performance of our method still falls short of that of advanced detectors. In fact, within a reasonable range, a larger batch size makes the gradient descent direction more accurate.

V. CONCLUSION
In this paper, we propose a dual attention network to obtain larger receptive fields and contextual information, which can help CNNs better capture the image from a global perspective. We construct an anchor-free pedestrian detector based on dual attention network. As a result, the proposed detector CSANet achieves the state-of-the-art performance on Caltech benchmark and obtains mainstream performance on CityPersons benchmark.
It is worthy studying the introduction of attention mechanism into complex computer vision tasks. Our work also suggests that the attention mechanism can improve the performance of pedestrian detection to a certain extent. However, current detectors including CSANet only achieve about 20 FPS in detection speed, which is far from the standard of real-time detection. In future works, we will work on the design of simple yet effective attention networks to further contribute to the realization of real-time detection.