Pedestrian as Points: An Improved Anchor-Free Method for Center-Based Pedestrian Detection

Although excessive proposals using traditional sliding-window methods or prevailing anchor-based techniques have been proposed to deal with deep learning-based pedestrian detection, it is still a promising yet challenging problem. In this paper, we propose a precise, flexible and thoroughly anchor-free, as well as proposal-free framework named Pedestrian-as-Points Network (PP-Net) for pedestrian detection. Specifically, we model a pedestrian as a single point, i.e., the center point of the instance, and predict the pedestrian scale at each detected center point. In order to achieve higher accuracy, we build a pyramid-like structure based on the backbone as a feature extractor to aggregate multi-level information. In addition, we construct a deep guidance module (DGM) at the top of the backbone, so that the higher-level information can be captured in the process of building a feature pyramid network (FPN) to avoid the dilution of high-level information on the top-down pathway. We further design a feature fusion unit (FFU) to fuse the fine-level features well with the coarse-level semantic information from the top-down pathway. With the only post-processing non-maximum suppression (NMS), we achieve better performance than many state-of-the-arts methods on the challenging pedestrian detection datasets.


I. INTRODUCTION
Deep neural networks (DNNs) based on the fully convolutional neural network have showed great improvements over systems relying on hand-crafted features [1]- [3] on benchmark tasks. With the rapid progress in DNNs research in recent years, it has dramatically facilitated the development of computer vision, such as object detection [4]- [6], image retrieval [7]- [9], scene recognition [10], [11], semantic segmentation [12]- [14], image classification and inpainting [15], [16], and so on. In particular, the state-of-the-art works in object detection continues to grow, including face recognition [17]- [19], pedestrian detection [20]- [22], vehicle detection [23], [24], etc. Pedestrians are one of main participants in the public transportation system, so pedestrian detection helps to realize an efficient and safe system. In the past few years, the widely-used anchor-based methods [25]- [29] have been dominant and have achieved tremendous progress.
The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao . Unfortunately, there are several drawbacks of current anchor-based approaches. First, anchor-based methods introduce additional hyper-parameters of design choices. Usually an extensive number of anchors, i.e., bounding boxes of the potential object are required to ensure a sufficiently high recall rate and a high Intersection over Union (IoU) rate with the ground-truth objects. Moreover, detectors encounter difficulties to manually design object candidates with large variations of size and aspect ratio of each anchor box. Second, the preset anchor boxes hinder the generality of detectors, that is, the designed choice based on a specific dataset is not always applicable to other datasets. Last but not least, another point that cannot be ignored is most of these anchor boxes are labelled as negative samples during training, leading to the imbalance between positive and negative samples.
To this end, anchor-free methods have been gradually increasing. Keypoint-based object detection [30]- [32] is a sort of methods that generate pedestrian bounding boxes by detecting and grouping their keypoints. CSP [33] detector, the state-of-the-art among them, uses the vanilla ResNet-50 network [34] to extract multi-level feature maps and then concatenate them for predicting the center heatmaps and corresponding scale maps, i.e., detects the central points and size of the bounding boxes. CSP detector has achieved brilliant accuracy on challenging CityPersons [35] with a simple design that eliminates the need of anchor boxes.
Unfortunately, when look closely at the operations of CSP detector, we find that the detection performance can be further improved. Specifically, after conduct feature extraction, the author of CSP simply fuses the multi-scale feature maps, which are from different stages of the backbone network, into a single one. The principle lays here is that shallower feature maps contain more accurate localization information, while the deeper ones are able to provide more semantic information as the receptive field has enlarged. However, large semantic gaps between feature maps from different depths are introduced by in-network feature hierarchy. On account of this inherent property, concatenating multi-depth feature maps directly harms their representational capacity for subsequent detection.
A large number of structures [36]- [38], which are helpful for alleviating the above problem, have been proposed. In that U-shape based structures [39], [40] can construct enriched feature maps via building top-down pathways upon basic network, they get a lot of attention. Thus, in this paper, we intuitively leverage the pyramidal shape of a ConvNet's feature hierarchy by creating a feature pyramid network (FPN) [40], where each level has strong semantic and localization information with regardless of scale. More specifically, we depend on the architecture that aggregates the features with low-resolution but strong semantic information and the ones with high-resolution yet weak semantic information through a top-down pathway and lateral connection. Taking a step further, different from the standard FPN, we investigate how to preferably solve the problem of multi-scale feature fusion when building each pyramidal level. In general, we incorporate a feature fusion unit (FFU) into our model to fuse features with different resolutions.
There is still a large room for refining existing feature pyramid network. First of all, as pointed out in [41], the semantic information captured by deep layers will be gradually diluted on the top-down pathway of the FPN architecture. Second, as mentioned in [42], the size of receptive field of a convolutional neural network (CNN) fails to be proportional to its layer depth. There are several kinds of approaches aim at addressing aforementioned problems, such as recurrently refining feature maps [43], [44], drawing attention mechanisms [45], [46] into FPN architectures, etc.
Inspired by PoolNet [41], we improve the vanilla feature pyramid network. We propose to adopt a deep guidance module (DGM) upon the bottom-up pathway, i.e., adding a residual unit on the top of the bottom-up pathway. Profiting from this operation, a higher-level feature map I with abundant semantic information can be obtained. Then, the captured information is transferred to feature maps at all pyramid levels by fusing I with them, respectively. Specifically, the deeper feature map contains extensive semantic information, thus alleviating the sparsity in top-down pathway of FPN.
In summary, the main contributions of this work can be highlighted as follows: (1) We construct a structure in the shape of existing FPN based on ResNet-50 [34] network to obtain multi-scale features, which means that we can detect pedestrians in various scales. Then our newly-proposed feature fusion unit (FFU) together with the built feature pyramid network (FPN) can solve the problem of ignoring the large semantic gap between multi-layer features when directly fusing them. (2) Based on the U-shaped architecture, we further build a novel deep guidance module (DGM) upon the bottom-up pathway, which aims to provide the location information of potential objects for layers at different feature levels. Therefore, we tackle the dilemma of information sparsity by expanding the role of deep features in U-shape based architectures. (3) We develop a novel and unique framework called Pedestrian-as-Points Network (PP-Net) for real-time pedestrian detection, which can effectively utilize the semantic information of images at low resolution along with details at high-resolution. (4) The anchor-free method achieves higher performance compared with state-of-the-art methods on CityPersons [35] and Caltech [47] datasets.

II. RELATED WORKS
Object detection has been extensively studied over the past few decades, and great progress has been made with the emergence of deep convolutional neural networks. Object detection algorithms can be classified into anchor-based and anchor-free detectors.

A. ANCHOR-BASED DETECTORS
Anchor-based detectors inherit and further expand the ideas from traditional sliding-window strategy [22] and proposal based detectors such as Fast R-CNN [47]. Pedestrian detection has been significantly improved due to the use of dense predefined anchors with preset scales and aspect ratios. Modern CNN-based detectors are categorized into two-stage and one-stage detectors. Within the two-stage framework, classical Faster R-CNN [26] utilizes an anchor mechanism in the branch dedicated to generating proposals, i.e., Region Proposal Network (RPN). Afterwards, dozens of methods [27], [48]- [50] have been developed. Following Faster R-CNN, Mask R-CNN [48] adds a mask branch parallel to the branch of classification and regression for performing mask predicting. For the sake of preventing Faster R-CNN from heavy region-wise CNN computational cost, R-FCN [49] proposed efficient region-wise fully convolutions without accuracy loss. Cascade R-CNN [27] extends the architecture of Faster R-CNN to multiple stages. Illumination-aware faster R-CNN [50] addresses the problem of fusing color and thermal modalities for detecting VOLUME 8, 2020 multispectral images. In the one-stage stream, a considerable number of approaches [51]- [55] which use anchor mechanism are proposed after SSD [25]. They aim at improving performance, including multi-stage refinement [51], [52], adaptive anchors [53] and loss function improvement [54], [55].

B. ANCHOR-FREE DETECTORS
Most recently, a lot of papers about anchor-free [56]- [60] have published, which has a great momentum of transforming the period of anchor-based detector.
CornerNet [56] predicts two groups of corners of bounding box, i.e., top-left and bottom-right points and then divides the corners belonging to the same object into a group based on the distance between the corner embedding by Grouping Corners, which is inspired by the Associative Embedding method [57]. Corner pooling is used for better localizing the corners. CornerNet-Lite [58] is a combination of two efficient variants of the CornerNet and thus improves efficiency without sacrificing accuracy. Compared to CornerNet, the ExtremeNet [59] detects four extreme points and central points of bounding box instead of corners. More specifically, top, left, bottom and right points are predicted and then grouped to form the final detected bounding box. FCOS [60] predicts the bounding boxes by making full use of the advantages of all points in a ground truth bounding box. And the low-quality detected bounding boxes are suppressed by the proposed ''center-ness'' branch. The detector considers location of object as training sample rather than anchor box, which is same as semantic segmentation.
Following anchor-free pipeline, our work aims to predict the precise center points and the corresponding pedestrian scales. We try to explore whether the results of such a simple method of localizing pedestrians by simply detecting the center points can be more competitive than other complex methods.

C. FEATURE PYRAMID NETWORKS
Feature pyramid constructing module are applied many computer vision applications required multiscale processing as the basis of solutions. Furthermore, the feature pyramid representation module can be easily modified and insert into most deep neural networks based detectors. SPPNet [61] eliminates ConvNet's requirements for fixed input by introducing spatial pyramid pooling layer. Recently, PFPNet [62] extends the idea to build multiple parallel SPPNets for generating feature pools with different sizes, then the elements in the feature pool are rescaled to a uniform size and their context information aggregated to generate each level of the final feature pyramid. M2det [63] employs multiple U-shaped modules after a backbone model and thus build stronger feature pyramid representations. NAS-FPN [64] introduces the Neural Architecture Search (NAS) mechanism and discovers the fresh feature pyramid architecture in a novel scalable search space covering all cross-scale connections.

D. PEDESTRIAN DETECTION
As a critical part of general object detection, pedestrian detection receives considerable interests. Nowadays, the field of pedestrian detection is almost dominated by deep learning [28], [65]- [67].
A jointly learning framework is proposed by [65]. In addition, [63] adds extra features to improve performance. A cascaded prediction is performed by [28] to stimulate the potential of one-stage detectors. [66], [67] focus on studying and overcoming the impact of occlusion. [66] is the first one operates full and visually body of a pedestrian regression simultaneously. Reference [67] employs attention mechanism into framework for enhancing the features of pedestrian.
In our work, we aim at putting up with multi-scale problems of pedestrian detection by refining feature pyramid network [68]- [71]. Reference [68] enhances the semantic information of low-level features by applying multiple convolution operations and increases resolution of high-level features by getting rid of pooling layer. Reference [69] applies feature pyramid network, equipped with refined attention modules to strengthen the representation ability of features. Reference [70] enhances feature pyramid network by introducing a cross-scale feature aggregation module. In [71], the structure of convolution neural network is summarized.

III. PROPOSED METHOD
As is known to all, high-level features with rich semantic information help to discover specific locations of objects. Meanwhile, low-and mid-level features with plenty of location information are also essential for refining the coarse features extracted from deep layers. Based on the above knowledge, we propose in this section a pyramid-like network named Pedestrian-as-Points Network (PP-Net) as illustrated in Fig. 1, which has two complementary modules that can detect the exact positions of pedestrians and simultaneously predict their sizes.

A. OVERALL PIPELINE
We build our architecture in a one-stage manner. It takes advantages of the widely-adopted feature pyramid network (FPN) [40], which is a kind of classic U-shaped architectures designed in a bottom-up and top-down manner for finely fusing multi-level features as shown in Fig. 4a.
For the bottom-up pathway, we adopt ResNet-50 [34] as the forward baseline network unless otherwise stated, which consists of five stages made up with Conv layers (convolutional layer followed by batch normalization and ReLU). It is worth noting that feature tensors with the same scale belong to a network stage. It is natural for us to choose the last feature map of each stage as our reference set of feature maps, which we will enrich to generate our pyramid-like structure, because the feature map with the strongest representation ought to exist in the deepest layer of each stage. The output feature maps of different stages in the forward streamline are down-sampled by 2, 4, 8, 16, 32 w.r.t. the input image. In practice, the output of stage 5 is kept as 1/16 of the input image size by utilizing the dilated convolutions. We denote the last feature map of each stage as C i , where i corresponds to the stage within the backbone hierarchy. Concretely, the last feature maps of stage 2, 3, 4 and 5 are denoted as C 2 , C 3 , C 4 and C 5 , in which the shallower feature maps contain more accurate localization information, while the deeper ones can provide more semantic information with larger receptive fields. We do not include stage 1 into the building of pyramid due to its large memory footprint.
As shown in the Fig. 1, we add a deep guidance module (DGM) to address the feature dilution on the top of the bottom-up pathway. More specifically, we explicitly transform the guidance information from DGM to the layers at different feature levels by merging the high-level information extracted by DGM with feature maps at each feature level. After then, we actually go one step further and introduce a feature fusion unit (FFU) to ensure that feature maps at different resolutions can be concatenated seamlessly.
The features with higher resolution are generated in top-down pathway via up-sampling spatially coarser yet semantically stronger feature maps from higher pyramid levels.
For the efficient design of FPN, we aim to make the pyramid pathways lightweight by reducing their channel capacity. To be specific, the channel capacity which is significantly lower than the number of channels of the final stage in the backbone pathway is used, yielding the computationallyeffective multiple pathways because the computation cost of a weight layer scales quadratically with its channel dimensions.
In detail, we first attach a 1 × 1 convolutional layer on C 5 to produce the coarsest resolution map C 5 . Here, the 1 × 1 convolutional layer is used for reduce channel dimensions to fixed number, denoted as d (d = 256 in the paper). Then, the feature maps C 5 and C 6 (output of GMM) are fed into FFU, creating a feature map P 5 . Then we reduce the number of channels of C 4 to d via a 1 × 1 convolutional layer and feed the output along with P 5 and C 6 into FFU for generating P 4 . This process is iterated until the finest resolution map P 2 is obtained. It is noteworthy that if the resolution among the inputs of FFU is different, we are going to rescale the coarser ones by apply up-sampling rate 2 on them through bilinear interpolation. Last but not least, the number of channels is reduced to d by applying 1 × 1 convolution operation before sent into the feature fusion module. VOLUME 8, 2020 Finally, we append a detection head, which is crucial in the whole detection system, to the generated feature map P 2 to parse it into the final detection results. The structure of the head is shown in the Fig.1. First, the number of channels is reduced to 256 by a 3 × 3 convolutional layer. Then, the center heatmap and scale map are produced separately via two parallel 1 × 1 convolutional layers. The predicted heatmaps are with the same size as the concatenated feature maps. Note that more complicated detection head like [52], [55] can be explored to further improve the detection performance, but it beyond the scope of this work.
The following two reasons can explain why anchor-free detection is superior to anchor-based one, i.e., why detecting centers is more effective than bounding box proposals. First, from CornerNet [56] we can know that directly predicting the center points is a more efficient way for densely discretizing the space of boxes, because O w 2 h 2 possible anchor boxes can be represented by only O(wh) centers. Second, the anchor-free way has a smoother prediction, which can empirically improve the generalization performance of the network. Third, the anchor-free method avoids a large amount of IOU calculation between GT boxes and anchor boxes, so that the training process takes up less memory.
Subsequently, we will describe the architectures of the two modules mentioned above, namely Deep Guidance Module (DGM) (Sec.3.2) and Feature Fusion Unit (FFU) (Sec.3.3), and describe their functions in detail.

B. DEEP GUIDANCE MODULE
There are two main noticeable issues caused by constructing top-down pathway of U-shaped structure based on the bottom-up backbone. One of them is the dilution of the deep semantic information in the top-down transportation way. The other is the misalignment between receptive field in practice and theory. In particular, it is not sufficient for the small virtual receptive field of the CNNs to cover the entire input images. To this end, we propose a deep guidance module (DGM) for providing deeper and richer information, which is in a plug-and-play manner.
As shown in Fig.2 (c), the structure of the DGM is adapted from the residual stage of original ResNet. Inspired by DetNet [72], our proposed deep guidance module consists of a dilated bottleneck with 1 × 1 convolution projection and two subsequent dilated bottleneck identical connection. To be more specific, as shown in the Fig.2a and Fig.2b, we apply bottleneck with dilation as a basic unit of DGM for efficiently exploring deep semantic information while enlarging the receptive filed without changing fixed spatial size of feature map after stage 5.

C. FEATURE FUSION UNIT
In order to fuse feature maps with different resolutions for constructing feature pyramid structure, we propose a simple while effective feature fusion module. As shown in Fig.3, the inputs of feature fusion module are three feature maps with different scales. More precisely, they represent the  feature maps fused to build a pyramidal level in the top-down pathway of our new structure, i.e., feature maps F 1 ,F 2 and F 3 with sizes C 1 ×H 1 × W 1 , C 2 ×H 2 × W 2 and C 3 ×H 3 × W 3 . Note that F 3 is with doubled spatial size of F 1 and F 2 . In other words, the resolution of F 1 is equal to the one of feature map F 2 .
For F 1 (F 2 ), we first double the resolution of F 1 (F 2 ) via a deconvolution layer, leading to the same size as feature map F 3 . Then a L2-normalization layer is used to rescale the norm of the resized feature map for following fusion operation.
As for F 3 , since there is no necessity of changing spatial size, we merely carry out the L2-normalization for adjusting the norm to the same as the one of processed F 1 .

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we first describe the details of training the framework. Then, we introduce the implementation details, the used datasets and the evaluation metrics of the experiment. Next, we exhibit the experiment results and the comparison among previous state-of-the-art methods. Finally, we demonstrate the effectiveness of each module we proposed through a series of ablation studies.

A. TRAINING
We can generate ground truth map of center and scale with the bounding box annotations. For the center ground truth, if a location is the center point of a pedestrian, it is defined as positive, and vice versa.
As for scale, it can be defined as the height and width of pedestrians. Following the CSP detector [33], we only predict the height of each pedestrian, and then the bounding box can be obtained by the preset aspect ratio because we define that the high-quality ground-truth bounding boxes are automatically generated by a uniform aspect ratio of 0.41. Additionally, the values of log(h k ) corresponding to the k-th object are allocated to the k-th positive locations and the negatives within a radius 2 of the positives (for alleviating ambiguity), while all other locations are assigned as zero. Specifically, the framework directly predicts a 1D vector, i.e., the height information of the object plus a class category at each positive location on a level of feature maps. As shown in Fig. 4, the four sides of a bounding box (shown as orange box in the figure) can be obtained through 1D vector (shown as vertical green solid line).
We adopt the classification loss in [31] which can be formulated as: where In the equation, p ij = 1 if the center point of object pedestrian is located in the coordinate (i, j) while otherwise 0. And y ij ∈ {0, 1} denote the ground truth label. In addition, M is the mask map and is calculated by using a 2D Gaussian mask G(·), which is proposed for relieving the ambiguity of negative samples surrounding the positive ones, it is formulated as in CSP detector [33].
On the whole, the final loss function is: Here, L reg and L offset are both adapted from the smooth L1 loss function.

B. EXPERIMENTAL SETUP 1) IMPLEMENTATION DETAILS
Our proposed framework is implemented in Keras [73]. Totally, our network is trained for 150 epochs in total and the optimizer is Adam [74] with an initial learning rate of 2e-4. By default, the backbone is pre-trained ResNet-50 and the rest modules are randomly initialized. During the test phase, we extract the results from the models trained with 50 to 150 epochs respectively unless otherwise stated.

2) DATASETS
For verifying the availability of PP-Net, we use two challenging datasets CityPersons [35] and Caltech [47], which can provide central point annotations and aspect ratios of bounding boxes. CityPersons contains 2975 images for training and 500 images for testing, to demonstrate the performance of proposed framework. The images of CityPersons are in extremely large sizes and the types of occlusions are many and varied. Caltech consists of 42782 training images and 4024 testing images, which are the frames extracted from a 2.5-hour auto-driving video.
Compared to other datasets, the annotations of these two selected datasets highly fit for our method as they contain normalized aspect ratio and central body line annotation.
Before training, some methods of data augment are used, such as random brightness, random crop and color jittering.

3) METRICS
Follow the CSP detector [33], we choose the log-average miss rate against false positives per image (MR-FPPI) (ranging in [10 −2 , 10 0 ]), which we denote as MR, for evaluating the detection results. The calculation of the miss rate can be seen in [22]. Also, we use average precision (AP) for supplement. Note that the higher the value of average precision (AP), the higher the accuracy of the pedestrian detected by the detector. While the value of miss rate (MR) is about low, which means that the number of pedestrians missed by the detectors is less.

1) CityPersons
In this section, we compare our proposed framework with several previous state-of-the-art methods in CityPersons dataset [35], including FRCNN [35], FRCNN+Seg [35],  OR-CNN [75], RepLoss [76], TLL+MRF [77] and CSP detector [33]. For fair comparisons, the final detection results of aforementioned methods are directly provided by authors except our closest competitor CSP detector, i.e., CSP detector is re-implemented by the original code released by the authors with Keras [73]. In the table I, it is can be found that our proposed method (denoted as PP-Net in the table) outperforms most methods above, especially main comparison object CSP detector. In other words, we reach a competitive performance of pedestrian detection in the challenging dataset in spite of the various occlusions and scales.
Moreover, as illustrated in Fig.5, form the horizontal axis, PP-Net performs barely satisfactory. Fortunately, it is close to the number one DCS+NMS [6]. That is to say, the AP of PP-Net is just passable. From the vertical axis, PP-Net performs well and is superior to most methods.
In brief, our proposed PP-Net combines accuracy with strong object capture capability.
Several qualitative results are shown in Fig. 6. It indicates that our proposed PP-Net can detect great majority of pedestrians even some of them are crowded, highly overlapped, small and large.    Table 2 and corresponding Fig.7 show the comparisons with state of the arts on Reasonable setting across multiple NMS thresholds. We also re-implement the CSP detector [33] for the sake of fairness. Our PP-Net achieves passable result, which is comparable with the main competitor CSP detector. Because there are not sufficient training samples for the model to be fully trained, the improvement is slightly inferior to that on the CityPersons dataset. In PP-Net, each prediction point is not associated with a particular reference shape, and it directly predicts the bounding boxes with the predicted height information. Since PP-Net allows specific aspect ratios, it can capture the entire body of a pedestrian in a similar shape.

2) CALTECH
From Fig.7, we can draw a conclusion that our PP-Net is less sensitive to the NMS thresholds because its curve is smoother than that of baseline.

D. ABLATION STUDY
In this subsection, we demonstrate the effectiveness of different components which we introduce in our proposed framework with different settings. To reach the goal, we construct several variants and evaluate them on convincing CityPersons [35] and Caltech [47] datasets.

1) U-SHAPED STRUCTURE
We put to use U-shaped structure upon the basic ResNet-50 for narrowing the semantic gaps between different-depth features. Meanwhile, we also design another alternative structure for the same purpose. As is shown in the Fig.8b, we gradually fuse the feature maps from the nearest two stages instead of directly fusing all feature maps, termed as Nearest-fused architecture. The results on different datasets are compared as displayed in Table 3 (a) and (b) respectively. Besides, we directly construct these two feature fusion structures on the backbone network of initial CSP detector, so as to eliminate the influence of the deep guidance module. And the results comparisons with various datasets are shown in Table 3 (c) and (b).
From the results above, it is suggested that the proposed alternative architecture is inferior to U-shaped FPN structure. We can see from Table 3 (a) and III (c) that on CityPersons dataset, the U-shaped structure improves baseline method by reducing the miss rates (MR) by 0.69% and 1.15% with and without deep guidance module respectively while Nearest-fused architecture only correspondingly reduces by 0.64% and 0.89%, which demonstrates the effectiveness of FPN.
In addition, from the Table 3 (b) and III (d), it can be observed that on Caltech dataset, the U-shaped structure improves baseline method by reducing miss rates (MR) by 0.66% and 0.16% with and without deep guidance module respectively while Nearest-fused architecture hurts the performance.

2) DEEP GUIDANCE MODEL (DGM)
For proving the performance improvement brought by the proposed deep guidance module (DGM), we add DGM upon the backbone of the feature extraction part of CSP detector [33]. We then concatenate the feature maps from stage 3, 4, 5 and DGM for following detection. For verification, we take DGM away from our proposed framework and test the performance (Note that we only detect the feature map from the bottom level of FPN-like network for simplicity). The result on two datasets are shown in the Table 4 (a) and (b). We can observe that DGM plays an important role in our detector.
For further exploration, we employ atrous spatial pyramid pooling (ASPP) from DeepLab V3 [13] to substitute the original deep guidance module (DGM). The ASPP consists of several parallel branches of atrous convolution with different dilated rates to capture multi-scale context. Following the configurations in [13], ASPP consists of one 1×1 convolution and three 3 × 3 convolutions with rates = (6, 12, 18) when output stride = 16 (all with 256 filters and batch normalization), and the image-level feature obtained by operating a VOLUME 8, 2020 The result comparisons on two datasets are shown in Table 5 (a) and (b). On CityPersons, it can be found from the table that our DGM brings about reduction of 1.35% in MR while ASPP promotes by 0.22%, which means our DGM is able to provide more semantic information beneficial for the final prediction while operating multi-branch dilated convolutions on final feature maps may generate redundant feature information which greatly disturbs the detection results. On Caltech dataset, our DGM reduces MR by 0.67%, while ASPP reduces by 0.58%, showing that both ASPP and our DGM help to improve results, but our DGM brings more growth.
To further test and verify our point and remove interference brought by feature fusion architecture (i.e., U-shaped FPN structure), we conduct experiment on vanilla CSP detector without follow-up FPN. As in Table 5 (c) and (d), it is suggested that on CityPersons, our DGM improves MR of the baseline method by 1.21% while ASPP degrades the performance instead, which implies that not all modules can provide semantic information that is helpful for detection performance.
In addition, on Caltech, we can also draw similar conclusion that our DGM is helpful to improve the vanilla CSP detector [33] with 0.17% reduction of MR, while the ASPP has a negative effect.

3) FEATURE FUSION UNIT (FFU)
We consider that our feature fusion module is superior to previous operation which fuses multi-scale feature maps directly. To this end, we conduct the removal of FFU module.
The numerical results in Table 6 (a) and (b) indicate that the absence of FFU module is harmful for the performance of our approach by increasing MR by 0.45% on CityPersons and 0.09% on Caltech because the various norms of multi-scale feature maps play a negative role in the process of feature fusion. Compared with the existing feature fusion module, our FFU is simple and pragmatic.

4) AGGREGATE ALL LEVELS OR NOT
While building the U-shaped structure, we intuitively face with two related choices. More concretely, which level of the structure is the feature with finest resolution, i.e., the bottom level should be P 2 or P 3 ?
The other one is whether we should detect the feature maps via fusing all levels of FPN-like network or the one from the bottom level? In order to make the most beneficial decision to improve performance, we conduct the comparison experiments and the results on two datasets can be seen in the Table 7 (a) and (b) separately.
It is demonstrated that we should build P 2 as the bottom level in the top-down pathway, and concatenate all levels of U-shaped network for subsequent detection.

V. CONCLUSION
In this paper, we have proposed an anchor-free pedestrian detector which finds a better trade-off between accuracy and efficiency. We have established a U-shaped architecture to eliminate the semantic gaps between multi-level feature maps. In addition, we propose a deep guidance module to extract deep semantic information for addressing the information dilution in the top-down pathway. We further propose a feature fusion unit (FFU) for multi-feature concatenation. By plugging these modules into the FPN-like network, we can achieve significant performance. The detection results on the challenging CityPersons and Caltech datasets demonstrate that our framework is competitive with the state-of-the-art methods. In our future work, we will pursue better performance by exploring superior detection heads. He is also involved in the research and development of high-performance devices/circuits, as well as intelligent electronic systems. He is a member of the Institute of Electronics, Information, and Communication Engineers of Japan.
QIU CHEN (Member, IEEE) received the Ph.D. degree in electronic engineering from Tohoku University, Japan, in 2004. Since then, he has been an Assistant Professor and an Associate Professor with Tohoku University. He is currently a Professor with Kogakuin University. His research interests include pattern recognition, computer vision, information retrieval, and their applications. He serves on the editorial boards of several journals, as well as committees for a number of international conferences. VOLUME 8, 2020