Scale-Aware Hierarchical Detection Network for Pedestrian Detection

,


I. INTRODUCTION
Pedestrian detection stands out from the traditional object detection tasks, in view of its broad application prospects in computer vision, such as video surveillance, autonomous driving, robotics. Despite significant improvements have been made on pedestrian detection [4], [8], [24], [27], [41] over the years, the most existing efforts generally work very well for large scale pedestrian instances [17]- [19], [23], [34], [36], [51]. Compared with pedestrian detection under large scale, much less attention has been put toward medium and small scale ones, as similar observations in the literature [14], [54]. For autonomous driving system, detecting medium and small size pedestrians are an important topic because there may leave sufficient time to alert the driver. Assuming the vehicle traveling at an urban speed of 15m/s and a pedestrian The associate editor coordinating the review of this manuscript and approving it for publication was Abdel-Hamid Soliman . of 1.8m tall, the person with 80 pixels in height is just 1.5 s away, while a person with 30 pixels is 4 s away. Take one latest effort AR-Ped [2] for example, it has been reported that empirically their detector is capable of achieving 6.45% log-average miss rate for pedestrians taller than 50 pixels on Caltech Pedestrian Benchmark [20], however the same error rate increases to 49.31% MR for pedestrians of 30-80 pixels in height. Fig. 1(a) shows several failed examples of pedestrian detection using state-of-the-art method AR-Ped [2] under large scale appearance variations on Caltech benchmark. As Fig. 1(b) illustrates the scale distribution of pedestrians in height on Caltech dataset, we group pedestrians by their image size (height in pixels) into three scales following [54]: near (80 or more pixels), medium (between 30-80 pixels), and far (between 20-30 pixels). Note that about 81.67% MR of the pedestrians lie in the medium scale on Caltech dataset.
The degraded performance for pedestrian detection under large scale variations may be attributed to the following inherent challenges. First, small-size pedestrian instances often VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ convey smaller amount of information while having a greater proportion of noise with obscure appearance and blurred boundaries. It is in general difficult to distinguish them from the background clutters. Second, visual semantic concepts of an object can emerge in different spatial scales depending on the size of the target objects. For a pedestrian instance of interest, visual features are effective only at a proper scale where optimal response is obtained. This difference is more pronounced in complex scenes containing pedestrian instances of diverse scales.
To address the issue of pedestrian detection under large scale appearance variations, the Faster-RCNN [9] exploits a multiscale region proposal network (RPN), which achieves excellent object detection performance. However, multi-scale detection is generated by sliding a fixed set of filters over a fixed set of convolutional feature maps. This results in an inconsistency between the sizes of objects and filter receptive fields, as the scales of objects are variable, yet the sizes of filter receptive fields are fixed. Instead of using a fixed set of receptive fields, in most related works [1], [7], [15], [44], [49], [56], [57] that aim to detect multi-scale pedestrians, pedestrian detection is performed by redeploying the perceptive fields of convolution based on object sizes at multiple output layers. However, in our view, these methods either simply selecting multiple output layers based on the sizes of receptive fields [11] [4] [5], or using feature fusion to expand the receptive field on a single output layer [44] [55] to obtain multi-scale receptive fields, are lack of enhancing entire feature hierarchy for multiscale pedestrian detection. This motivates us to construct an aggregation feature representation to enhance semantic information and localization signals for scale-aware pedestrian detection.
Motivated by above insight and analysis for the representation of feature hierarchy pyramid, we propose a scale-aware hierarchical detection network for pedestrian detection under large scale variations. First, we accomplish feature aggregation based on FPN [43] to enhance semantic information and localization signals for feature representation, by merging the lateral connection, the top-down path and the bottom-up path. Furthermore, in view of the feature differences for pedestrians under different scales, scale-aware hierarchical detection network is designed to learn adaptively perceive pedestrian instances within certain scale ranges, by probing the feature differences with different scales from augmented pyramid features.
To sum up, our work possesses the following contributions: 1) We introduce a cross-scale features aggregation module to enhance feature pyramid representation by fusing robust semantic and accurate localization for pedestrians with different scales, which accomplishes feature augmentation from the lateral connection, the top-down path and bottom-up path.
2) A novel scale perception strategy by normalized Gaussian fate function is designed to integrate multiple detection heads into a unified framework through adaptively perceiving the cross-scale features aggregation module for scale-aware hierarchical detection network.
3) Experimentally, compared with the state-of-the-art method FasterRCNN+ATT [44], the log-average miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between 30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on Caltech benchmark.

II. RELATED WORK
There has been lasting research activities on pedestrian detection with vast literature. Before the emergence of CNN, hand-crafted features have been widely used to obtain good performance for pedestrian detection, including HOG [10], Edgelets [38], ICF [21] and its variants ACF [28], LDCF [17], [22], SCF [23]. The most popular pedestrian detector is deformable part model (DPM) [50], which combines rigid root filter and deformable part filters based on HOG feature pyramid and latent SVM classifiers for detection.
Deep ConvNet due to its stronger feature representation ability exhibits obvious performance gains on pedestrian detection [28,13,29,18,2]. CCF [27] absorbs merits from filtered channel features and Convolutional Neural Networks (CNN), and transfers low-level features from pre-trained CNN models to feed the boosting forest model for pedestrian detection. ConvNet [41] uses an unsupervised method based on convolutional sparse coding to pre-train CNN for pedestrian detection. Deep Parts [13] consists of extensive part detectors, and each part detector is a strong detector that can detect pedestrian by observing only a part of a proposal. SDP [11] investigates scale-dependent pooling and layer-wise cascaded rejection classifiers from CNN to detect objects. CompACT-Deep [16] leverages both hand-crafted and CNN features to form complexity-aware cascaded detectors for an optimal trade-off between accuracy and speed. Especially, Faster-RCNN [9] has addressed a multiscale region proposal network that shares full-image convolutional features with the detection network, leading to an excellent performance for pedestrian detection.
However, spatial scale variation is one of main challenge for pedestrian detection due to the large variance of instance scales in a cross-scenario. To address the issue, an upsampling or dilated operations [5], [11] are employed to alleviate the decline which just adopts a fixed set of filter respective fields existed in Faster-RCNN [9]. MS-CNN [5] combines multiple output layers by feature upsampling of deconvolution to produce a strong multi-scale object detector. SA-FastRCNN [4] exploits multiple built-in subnetworks by a divide-and-conquer strategy to adaptively detect pedestrians across scales. RPN+BF [7] reuses the high-resolution convolutional features of RPN by cascaded boosted forests for multiscale pedestrian detection. ADM [49] executes sequences of coordinate transformation on multi-layer feature maps to deliver accurate pedestrian locations. TridentNet [53] constructs a parallel multi-branch architecture to expend receptive fields on the detection of different scale objects through dilated convolution. However, these methods have not effectively to fuse the robust semantic information of targets existed in high-level convolutional layers and the precise localization signals of the lower convolutional layers for multiscale pedestrian detection.
To exploit strong semantic for prediction, FPN [43] augments a top-down pathway and lateral connections to propagate high-level semantic information for reasonable classification capability. DSSD [47] adopts deconvolution layers to aggregate context and the high-level semantics for enhancing shallow features. M2Det [3] presents multi-level feature pyramid network to fuse multiscale features for detecting objects of different scales. On the other hand, many fine details and higher resolution existed in low-level feature maps are benefits for localization accuracy. PANet [32] builds a strong indicator to accurately localize instance segmentation by a pathway with clean lateral connections from the low level to top ones. DLA [48] augments standard architectures with deeper aggregation across layers to obtain stronger layer-wise multi-scale representation capability. STDN [29] is equipped with embedded super-resolution scale-transfer layers to explore the inter-scale consistency nature across multiple detection scales. Recently, NAS-FPN [52] consists of a series of merging cells to fuse features across scales by a combination of top-down and bottom-up connections. Res2Net [46] constructs hierarchical residual-like connections within one single residual block to capture multi-scale features at a granular level.
Inspired by these observations and analysis of feature fusion to multiscale detection, in this paper, we explore a scale-aware hierarchical detection network for multi-scale pedestrian detection, by aggregating the strong semantic information from high-level features and the accurate localization signals from low-level layer to enhance pyramidal feature representations.

III. APPROACH OVERVIEW
A high-level overview of our approach architecture is shown in Fig. 2. Our proposed approach consists of two main components: cross-scale features aggregation module and scale-aware hierarchical detection network. The cross-scale features aggregation module is built on Feature Pyramid Network (FPN) [43] to enhance representation ability of pyramid features. FPN shows significant improvement as a generic feature extractor for object recognition, significantly which propagates semantically strong features to enhance pyramid features with reasonable classification capability by the top-down path. Similarly, with many fine details and strong responses of local patterns are existed in low-level convolutional layers, which are benefits for high localization accuracy. For this reason, we design a cross-scale features aggregation module to adaptively aggregate features of pyramid hierarchy to enhance the localization capability.
Further, the scale-aware hierarchical detection network based on the Fast R-CNN framework [6] combines the complementary detection branches on hierarchical pyramid feature maps from the cross-scale features aggregation module. And the detection heads in hierarchical detection network pretrained on ImageNet based on ResNet [33] are all sharing parameters for each proposal to learn scale-aware hierarchical weights by minimizing the error rate for VOLUME 8, 2020 FIGURE 2. The architecture of our proposed Scale-aware Hierarchical Detection Network. Our approach uses the cross-scale features aggregation module to enhance semantic robust and localization accuracy, and the scale-aware hierarchical detection network to adaptively detect pedestrians from augmented feature levels for special-scale pedestrians presented in the image.
pedestrians with different scales, regardless of their feature levels.

A. CROSS-SCALE FEATURES AGGREGATION MODULE
Feature Pyramid Network (FPN) [43] shows significant improvement as a generic feature extractor for object recognition, which propagates semantically strong features to enhance pyramid features with reasonable classification capability. Followed by previous evidence on the benefits of the strategy of feature approximation [28], we denote the output of last residual blocks as {C 1 , C 2 , C 3 , C 4 , C 5 } for conv1, conv2, conv3, conv4, and conv5 in ResNet. And given a list of multi-scale pyramid features {P 1 , P 2 , P 3 , P 4 , P 5 } from FPN [43], where P i represents the feature at pyramid level i. However, the feature fusion from FPN only directly builds on the lateral connection and the top-down pathway, ignoring the impact of bottom-up path augmentation to enhance feature representation for accurate localization signals existing in low-level convolutional layers.
Our goal is to find a transformation function f that can effectively aggregate multi-scale features and output a list of new features: X out = f (X in ), X in may be C i , P i , or their union. Different from the feature augmentation generated by FPN, we propose a cross-scale features aggregation module (CFAM) to merge a bottom-up pathway to FPN. Specifically, we use {H 1 , H 2 , H 3 , H 4 , H 5 } to denote augmented feature pyramid and in which the spatial resolution of feature maps is gradually upsampled with factor 2 from H i to H i−1 . As shown in Fig. 3(b), each feature aggregation module takes a convolutional feature map C i−1 with higher resolution, an identify mapping feature maps C i and a coarser feature map H i+1 with stronger semantic to generate the augmented feature map H i . Note that we adopt an average pooling to downsample the spatially finer feature maps, which implies to directly propagate strong responses of local patterns from low-level pyramid levels for accurately localization by the bottom-up augmented pathway.
The key idea of CFAM is to adaptively aggregate multi-scale context information from feature maps of the convolutional layers at adjacent scales to generate more discriminative features. As shown in Fig. 3(b), each aggregating module merges a top-down path, lateral connections and a bottom-up augmented path by addition. The feature aggregating module takes a convolutional feature map C i−1 with higher resolution, an identify mapping feature map C i and a coarser feature map H i+1 with stronger semantic to generate the fused feature map H i . This is an iterated process to build augmented feature pyramid until to the finest resolution map H 3 . At the beginning of iteration, we adopt a 1 × 1 convolutional layer on C 5 to produce the coarsest but semantically strongest resolution map H 5 . Then the lower-level feature map C i−1 goes through a 2 × 2 average pooling layer with stride 2 to reduce the spatial size to generate the down-sampled feature map in the bottom-up augmented pathway. Each element of feature map H i+1 , the down-sampled feature map and the identify mapping feature map C i are added to generate fused feature map. Finally, we append a 1 × 1 convolution on each merged map to generate the final augmented feature map H i for following sub-networks, which is used to reduce the aliasing effect of upsampling and downsampling. In the feature aggregation module, these augmented feature maps are respectively  corresponding to {C 3 , C 4 , C 5 } with the same spatial sizes, and we set 1024-channel outputs for each augmented feature pyramid {H 3 , H 4 , H 5 } to scale-aware hierarchical detection network.

B. SCALE-AWARE HIERARCHICAL DETECTION NETWORK
The coverage of multiple scales is a critical problem to different scale ranges for pedestrian detection. Different from the multi-scale mechanism of the RPN [9], we divide the region proposals into three scales (near, medium, and far) from higher convolutional layers C 4 , each scale is transported an augmented feature pyramid level H i to detect pedestrian instances within certain scale ranges as shown in Fig. 4. We hypothesize that pedestrian instances with different scales can be better modeled by hierarchical detection network with the valid range of filter receptive fields. Specifically, each pedestrian anchor scale needs to effectively match the receptive field size of the ROI pooling though different spatial pooling structure.
Let L m (X i , Y i |W ) represents multi-task loss function for each pedestrian proposal under specific feature level H m , and is given by: wherep i is 1 if the anchor is labeled positive, otherwise is 0. p i is the predicted probability of the anchor being a pro- represents the ground-truth box associated with a positive anchor, represents the parameterized coordinates of the predicted bounding box. The classification loss L m cls is the softmax loss of two classes (pedestrian vs. not) from specific feature level H m . For the regression loss, we use L m loc = R(b i −b i ) where R is the robust loss function (smooth-L1) defined in [6]. The termp i L m loc means the regression loss which is activated only for positive anchorsp i = 1 and is disabled otherwisep i = 0.
To adaptively match valid feature level and anchor scale for multiscale pedestrian detection, SDP [11] adopts a hard isolation strategy by the pixels in height of an object proposal to detect multiscale objects. SA-FastRCNN [4] exploits a soft isolation strategy by Sigmoid gate function defined over the object proposal sizes to generate scale-aware weighting for multi-scale detection subnetworks. In this paper, we design a novel scale perception strategy by normalized Gaussian gate function for scale-aware hierarchical detection network (SHDN) as shown in Fig. 4, and the model loss function is defined as: where M is the number of hierarchical feature pyramid as mentioned in Section III A, contains the training examples of multi-scale for pedestrian instances, and ω m is the normalized scale-aware weight to corresponded hierarchical loss L m (X i , Y i |W ), and is initialized by denotes the height scale of the pedestrians which has already been normalized to a narrow range prior to detection,s m and γ m is the average height scale and the scaling coefficient for specific feature level H m , respectively. Given a sliding window, the Gaussian function with lower γ m tends to enlarge the gap between the weights for pedestrian instances from different scale ranges. Based on the ResNet structure, the output size of RoI pooling is 7 × 7, with a stride chosen from the set of {8, 16, 32} to construct deep network {C 3 , C 4 , C 5 }, then the valid receptive fields for hierarchical feature pyramid For efficient training the scale-aware hierarchical detection network, sampling is used to compensate for the imbalance from the distribution of positive samples U m + and negative samples U m − . In this paper, we adopt random sampling and bootstrapped sampling to collect a final set of negative samples, such that U m − = ζ U m + . We utilize random sampling to randomly select easy negative samples according to a uniform distribution. Because hard negatives mining has large influence on the detection accuracy, bootstrapping sampling is exploited to improve detection performance by ranking the negative samples according to their objectness scores. On the other hand, to avoid the heavily asymmetric of positive samples U m + and negative samples U m − resulting in for each specific detection layer, the cross-entropy terms of positives and negatives are weighted in formula (3), which guarantee that each detection layer have enough positive samples to cover a certain range of scales.

IV. EXPERIMENTS A. EXPERIMENTS DETAILS
Following ResNet [33] pretrained on ImageNet, we finetunes the convolutional neural network to extract visual features from observed video frames on Caltech training dataset. The convolutional layers and max pooling layers of the ResNet network are used as the shared convolutional layers before the Region-of-Interest (RoI) pooling layer to produce feature maps from the entire input image. The last convolutional block in ResNet is 2048-d, and we employ a randomly initialized 1024-d 1 × 1 convolutional layer for reducing dimension. And we use single-scale training in which the scale of the input image is resized as 600 pixels on the shortest side. The scale-aware feature aggregation network is trained with Stochastic Gradient Descent (SGD) with momentum of 0.9, and weight decay of 0.0005. As [9], [30] demonstrate that mining from a larger set of candidates (e.g., 2000) has no benefit, we use 300 RoIs for both training and testing of this paper. We fine-tune scale-aware hierarchical detection network using a learning rate of 0.001 for 20k mini-batches. Each mini-batch consists of 128 randomly sampled object proposals in one randomly selected image, where in 32 positive object proposals and the rest 96 negative object proposals. A positive label of pedestrian is assigned when IoU ≥ 0.5 between the object proposal and the ground truth box, and the negative label to RoIs if their IoU ≤ 0.3 for all ground-truth boxes. The whole scale-aware hierarchical detection network is trained on a single NVIDIA GeForce GTX TITAN X GPU with 12GB memory.

B. ABLATION EXPERIMENTS 1) EVALUATING THE CROSS-SCALE FEATURES AGGREGATION MODULE
As mentioned in [7], the Region Proposal Network (RPN) in Faster R-CNN indeed performs well as a stand-alone detector, but the downstream classifier degrades the pedestrian detection performance. In this subsection, we investigate cross-scale features aggregation module in terms of detection quality, evaluated by the log-average miss rate of pedestrian detection under IOU = 0.5 on Caltech dataset.
First of all, we evaluate high-level convolutional layer (from ResNet-50-C3 to ResNet-50-C5) of ResNet [33] to extract ROI features to detect pedestrian by using a set of anchor scales from RPN. As shown in Table 1(a)(b)(c), for illustrating the effects of high-level convolutional features in ResNet-50 to detect pedestrian instances, the higher convolutional layers (e.g., C 4 , C 5 ) obviously perform better than lower-level convolutional layers (e.g., C 3 ) for pedestrian instances with near scale. This can be attributed to higher-level convolutional features with more robust semantic information than lower-levels.
Further, compared to adopt the simple high-level convolutional layer (e.g., C 3 , C 4 , or C 5 ) to detect pedestrian, FPN (e.g., P 3 , P 4 , or P 5 ) fused the semantically strong features from higher convolutional layer to enhance pyramid features  Table 1(e), which may be due to lack of accurate localization signals existing in lower convolutional layers. Therefore, we propose a cross-scale features aggregation module (CFAM) to fuse semantic information and localization signals by adding a bottom-up augmented pathway to FPN. As shown in Table 1 (g), H 3 has achieved the best pedestrian detection performance under far-scale and mediumscale, up to 72.83% MR and 36.50% MR respectively. Note that H 4 achieves 43.69% MR for pedestrians with all scales.

2) THE ROLE OF SCALE-AWARE HIERARCHICAL DETECTION NETWORK
In this subsection, the contribution of proposed scale-aware hierarchical detection network is evaluated by log-average miss rate under IOU = 0.5 on Caltech testing dataset for pedestrian detection. We conduct comparison experiments to verify the effectiveness of the proposed method within a single output layer and multiple output layers for detection heads. As shown in Table 2 (a)(b)(c), we compare the single output layer H 3 , H 4 and H 5 as detection head from proposed cross-scale features aggregation module for pedestrian detection under different scales. We found that the H 3 performs better than other single output layer on log-average miss rate for pedestrian detection under far and medium scales. For near scale, H 4 has achieved the best detection performance in a single output layer, up to 2.12% MR with a relative improvement of 14.23% over the competitor H 3 .
However, detecting pedestrian only from a single output layer cannot effectively cover the multiscale pedestrians appeared large scale variations, due to lacking of the scale complementary from multiple feature layers with different sizes of filter receptive fields. To effectively combine multiple output layers from feature pyramid for pedestrian detection, we adopt the scale-aware parameters (s m , γ m ) to initialize learning hierarchical weights ω m for optimizing multi-task loss function in formula 2. Specifically, we assign scale-aware parameters (s m , γ m ) as {(5.8, 1.25), (6.8, 2), (7.8, 1.25)} for hierarchical feature pyramid {H 3 , H 4 , H 5 }, respectively. In Table 2 Table 2(e). The reason may be attributed to our proposed hierarchical scale-aware detection network that each detection branch is used to learn a proper pyramid feature layer to focus on pedestrian instances within certain scale ranges. Moreover, the log-average miss rate is reduced to 40.39% for all scales pedestrian detection, 28.77% for medium scale and 1.08% for near scale, by combining layers {H 3 , H 4 , H 5 } as shown in Table 2(f). Note that combining {H 3 , H 4 , H 5 } gets the best performance compared to {C 3 , C 4 , C 5 },{P 3 , P 4 , P 5 }, and {P 2 , P 3 , P 4 , P 5 } shown in Table 2(g∼i). The experiments demonstrate that the proposed hierarchical scale-aware detection network is more flexible and is able to take advantage of different sizes of filter receptive fields from multiple level pyramid features for large variance in pedestrian instance scales.

C. COMPARISON WITH STATE-OF-THE-ARTS
In this section, the performance of proposed algorithm is fully evaluated to the state-of-the-art methods on Caltech [20] and ETH [18] datasets. As [54] proposed the evaluation criteria, the log-average miss rate is used to summarize the detector performance. The performance is computed by averaging miss rate at FPPI rates evenly spaced in log-space within the range of 10 −3 to 10 0 . The experiments demonstrate that jointing cross-scale features aggregation module and scale-aware hierarchical detection network outperforms the state-of-theart pedestrian detection algorithms, especially on pedestrian instances with small sizes. VOLUME 8, 2020

1) COMPARISON WITH STATE-OF-THE-ART METHODS ON CALTECH DATASET
The Caltech pedestrian dataset consists of approximately 10 hours of 640*480 30Hz video taken from a vehicle driving through regular traffic in an urban environment, which includes about 250,000 frames with a total of 2300 unique pedestrians. Similar to other relevant publications previously [13], [16], [17], [24], we adopt the different spatial scale pedestrians to evaluate our method on the Caltech testing dataset, and choose the Caltech training dataset and the INRIA training dataset [10] as our training set. The experimental evaluations of our proposed method with the stateof-the-art methods are constructed on the Caltech testing dataset, including LDCF [22], ACF+SDt [34], RPN+BF [7], MS-CNN [5], CompACT-Deep [16], TA-CNN [24], SA-FastRCNN [4], FasterRCNN+ATT [44], and AR-Ped [7].
To evaluate the effectiveness of our proposed scale-aware hierarchical detection network, quantitative results of comparison are presented for different scale ranges of pedestrian instances on Caltech dataset. Fig. 5 shows the comparison results of the log-average miss rate for pedestrians under different scale ranges. It can be observed that our proposed method significantly outperforms other methods and achieves the lowest log-average miss rate 28.77% on Caltech dataset of the medium scale shown in Fig. 5(a), which is lower than the state-of-the-art approach FasterRCNN+ATT [44] by 11.98%. As the similar trend shown in Fig. 5(b), our approach achieves 7.41% log-average miss rate for pedestrian instances taller than 50 pixels in height, second only to the state-of-theart approach AR-Ped [2].
For pedestrian instances in far scale ranges, most methods exhibit dramatic performance drops as shown in Fig. 5(c). While our proposed method outperforms better than the available state-of-the-art competitors, it is difficulty to identify pedestrians reliably for small-size pedestrian instances under 30 pixels in height. In Fig. 5(c), the log-average miss rate is reduced to 70.69%, improved 20.25% compared to FasterRCNN+ATT [44]. This is similar to human performance that is also quite good in the large scales but degrades noticeably at the medium and far scales. Significantly, for pedestrian instances in whole scale span ranges, our approach achieves the log-average miss rate 40.39% for all pedestrian instances taller than 20 pixels in height, better than the current FasterRCNN+ATT [44] by 14.12% as shown in Fig. 5(d).
The comparison results with different scale ranges of pedestrian instances demonstrate that our proposed approach substantially improve the performance for pedestrian detection. Fig. 6 shows the detection results of our proposed scale-aware hierarchical detection network on Caltech dataset. The green dotted bounding boxes represent true positive windows when the intersection over union (IoU) between the detected window and the ground truth (green solid bounding box) exceeds 50%. Otherwise, the bounding boxes denote false positive windows by the red dotted bounding box. As shown in Fig. 6, the most of pedestrian instances with different scale ranges can be detected by our proposed approach. Moreover, because of adaptively perceiving the augmented feature level with different resolutions for special-scale pedestrians, the medium-size and small-size pedestrian instances also can be detected in proposed scale-aware hierarchical detection network. The red dotted bounding box represents the positive pedestrians which are not marked by the ground truth as shown in Fig. 6. This experiment shows that the usage of jointing the cross-scale features aggregation module and scale-aware hierarchical detection network for pedestrian detection outperforms the state-of-the-art algorithms, especially for pedestrian instances from medium and small scale ranges.

2) COMPARISON WITH STATE-OF-THE-ART METHODS ON ETH DATASET
The ETH benchmark dataset consists of 3 testing video sequences with a resolution of 640*480, and a frame rate of 13FPS. Studies in [7], [37] report that state-ofthe-art algorithms have the remarkable detection performance evaluated on ETH dataset, including ChnFtrs [21], MultiFtr+Motion [35], JointDeep [37], pAUCBoost [40], ConvNet [41], DBN-Mut [12], SpatialPooling [39], TA-CNN [24], and RPN+BF [7]. As most approaches are trained on the INRIA training dataset [10], our proposed method is also trained on the INIRA training dataset. As Fig. 7(a) the log-average miss rate of our proposed approach achieves 44.75% next to the state-of-the-art SpatialPooling [39] 43.36% for pedestrians under medium scale. Similar trend to what we have observed for pedestrian with near scale, VOLUME 8, 2020 our approach achieves 20.49% log-average miss rate, second only to the best available competitor's RPN+BF [7] as shown in Fig. 7(b). Significantly, for pedestrian instances above pixels taller than 80 pixels in height, our approach gets 16.84% log-average miss rate, improving 0.78% compared to the state-of-the-art RPN+BF [7] shown in Fig. 7(c). Moreover, for a more challenging with large variation of scale (above 20 pixels in height), the log-average miss rate of our approach reduces 3.98% over RPN+BF [7] on ETH dataset as shown in Fig. 7(d). The results demonstrate that our proposed method has a substantially better detection performance for the multiscale pedestrian instances appeared large scale variations in natural scenes.
The pedestrian detection results of our proposed method are shown in Fig. 8 on ETH dataset. As shown in Fig. 8, the green dotted boxes demonstrate the detection results of our approach. Our proposed approach adaptively perceives the augmented feature level to generate the final detection results for special-scale pedestrian detection by scale-aware hierarchical detection network. And the small-size pedestrian instances also can be detected, where the red dotted bounding box represents the positive pedestrians which are not marked by the ground truth as shown in Fig. 8. One can observe that our method can successfully detect most of the pedestrian instances, especially for pedestrians with large scale variations.

V. CONCLUSION
This study describes an effective approach to detect pedestrian instances with different scale ranges. The proposed cross-scale features aggregation module adaptively fuses hierarchical features to enhance feature pyramid representation by merging the lateral connection, the top-down path and bottom-up path. Moreover, probing the differences of local features with different sizes of receptive fields, the proposed scale-aware hierarchical detection network effectively integrates multiscale pedestrian detection into a unified framework through adaptively perceiving the augmented feature level for special-scale pedestrian detection. Experimentally, compared with the state-of-the-art FasterRCNN+ATT [44], the log-average miss rate of pedestrian detection is reduced by 11.98% for medium scale pedestrians (between 30-80 pixels in height), and 14.12% for whole scale pedestrians (above 20 pixels in height) on the Caltech benchmark. CHENGLIZHAO CHEN received the Ph.D. degree in computer science from Beihang University, in 2017. He is currently an Assistant Professor with Qingdao University. His research interests include computer vision, machine learning, and pattern recognition. VOLUME 8, 2020