Robust Head Detection in Complex Videos Using Two-Stage Deep Convolution Framework

Pedestrian head detection plays an important role in identifying and localizing individuals in real world visual data. Head detection is a nontrivial problem due to considerable variance in camera view-points, scales, human poses, and appearances in the scene. Thanks to the translation invariance property of convolutional neural networks (CNNs) which enables large capacity CNNs to handle the problem of appearance and pose variations in the scene. However, the problem of scale invariance is still an open issue. To address this problem, this paper presents a two-stage head detection framework that utilizes fully convolutional network (FCN) to generate scale-aware proposals followed by CNN that classifies each proposal into two classes, i.e. head and background. Experiments results show that using scale-aware proposals obtained by FCN, the object recall rate and mean average precision (mAP) are improved. Additionaly, we demonstrate that our framework achieved state-of-the-art results on four challenging benchmark datasets, i.e. HollywoodHeads, Casablanca, SHOCK, and WIDERFACE.


I. INTRODUCTION
For many vision based applications, pedestrian and human face detection is a pre-processing step. These applications include person identification [53], [56], action recognition [14], [37], tracking [40], autonomous driving, behaviors understanding [15], [16]. While these algorithms have gained maturity in recent years [28], [46], the problem of detecting pedestrians in natural images and videos is still challenging. Face detector can not extract facial feature for a person whose face is not visible. On the other hand, person detection is challenging job. This is due to reason that large portion of human body is not visible due to occlusion and clutter in the scene. This is due to reason that face and pedestrian detection methods are not applicable in natural scenes. Therefore, to find people in unconstrained images and videos, head is an indispensable choice.
The goal of head detector is to precisely detect and localize human heads in naturalistic conditions. Precise The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh . head detection is an important element and used as a pre-processing step in many video surveillance applications, for example, tracking [3], [12], person authentication [25] and density estimation [36]. During the recent years, few strides have been made towards head detection in crowds [7], [21], [39] in complex scenes, however, head detection is still a challenging task. Significant variations in poses, scales, and appearances of human heads, make the head detection problem even more challenging.
A reliable head detection system should be invariant to scales, appearances and poses. Figure 1, highlights these problems, where three human heads are marked in red, green and yellow colors. From the Figure, it is obvious, that heads have different scales (sizes), poses and appearances. Convolutional neural networks (CNN) are inherently transnational invariant. Due to this property, large capacity CNN can handle variation in pose and appearance. However, CNNs are not inherently scale invariant and still have room for improvement.
Generally, most of the existing methods deal the head detection as a special case of generic detection problem. The detection pipeline of these detectors consist of two stages, (1) object proposal generation, (2) classification of object proposals. Therefore, for two-stage object detectors, object proposal generation is an important pre-processing step. Generally, in contrast to exhaustive search for object in image, object proposals guide the search for objects. Generating object proposals is the preferred choice for object detection over sliding windows approaches due to the following reasons: (1) save computation time by passing small number of proposal to the detector, and (2) improves the precision and recall rate. Acknowledging the importance of object proposals in object detection tasks, several methods for generating object proposls have been reported in literature during the recent years.
Recent object proposal generators exploit saliency, gradient and edge information [5], [59] to hypothesize the location of objects in images. Later on, DeepBox [20] move a step forward and refined the proposal generated by EdgeBox [59]. DeepProposal [9] utilize initial and final layers of the network in inverse cascade fashion to generate object proposals. Multi-Box [23] employs regression to extract object regions.
Usually, detector face challenges to detect head in natural scenes, since human heads have significant variations in object scales, appearances, and poses as mentioned before. Therefore, current existing two-stage methods usually achieve low precision and recall rates when tested in natural scenes. To address the problem of scale, we propose a novel strategy to detect human heads in complex scenes.
Precisely, we propose head detector to detect heads with multiple scales in various complex scenes and follows the following sequential pipeline: 1) The first part is multi-scale object proposal generation network, that captures the distribution of scales in the input image by generating scale-specific object proposals. Concisely, a binary classifier is trained by employing [24] using patches belonging to human heads. The input to network is arbitrary size image and output is a dense heat map. Dense heat map represents the confidence whether a specific region contains a human head or background for every pixel. In order to generate multi-scale object proposals, we re-size the input image into multi-scales (image pyramid), pass image pyramid to the network and obtain the multiple heat maps corresponding to levels of pyramid. Non-maximal suppression technique is then employed to reduce the redundant and obtain refine proposals.
The contribution of this paper lies in the first part of the proposed framework. Compared with the existing methods, our framework has the following contributions: 1) We propose a novel framework to handle the scale problem by generating scale-aware proposals using Fully Covolutional Network that generate pixel-wise head scores and square shape bounding boxes of the head instances through various scales and location of the input image 2) With the adoption of anchor-free scale-specific region proposal network, our framework has significantly reduced the time cost as compared to feed-forwarding single object proposal through the CNN. 3) Compared to dense networks, e.g. GoogleNet, we train a shallow network for head/background classification. This model can be adapted to dense prediction of human heads in images with arbitrary sizes. 4) The proposed framework shows superior performance using challenging datasets.
Comparison and Difference: The proposed proposal generation network is superficially similar to typical Region proposal network (RPN) adopted by FRCNN. However, it differs in many aspects, for example, FRCNN uses a large receptive field to detect generic objects in images. Usually, these objects are large and occupy large portion of the image. These objects can easily be detected by FRCNN, however, FRCNN faces difficulties in detecting small objects, where the size of objects is less than 16 pixels. This is due to the reason, that ROI pooling layers of FRCNN use feature maps from highest convolutional layer. These feature maps have reduced resolutions and lost most of the important information related to small objects. Therefore, FRCNN can not precisely classify and predict the location of small objects. Another flip side of FRCNN is that it uses anchor boxes with predefined sizes and scales. To achieve high precision and recall, anchor boxes should be of different sizes and scales to cover size and shape variations of generic objects in image. As in crowded scenes, the size and shape of heads change significantly as compared to generic large objects, it requires much more complex design of anchor boxes to capture wide range of scales. Therefore, anchor boxes based methods are inefficient in such cases. Our proposed framework is different from FRCNN in following ways. (1) The most important difference is that the proposed RPN is anchor-free and class specific proposal generator in contrast to anchor based generic proposal generator.(2) We trained a head descriptor that can detect head in extreme scales by incorporating features from multi-scales using image pyramid.
The rest of the paper organized as follows. In Section II, we discuss related works. Section III discusses proposed methodology. Experiments results on different data sets are reported in Section IV. Conclusion is presented in Section V.

II. RELATED WORKS
Since our framework has two sequential parts, i.e,. object proposal generation and head detection, therefore we discuss related work in separate subsections.

A. GENERATING OBJECT PROPOSALS
We categorize object proposal methods into two categories: 1. Segment based methods and 2. Window scoring methods. In addition to these methods, we also discuss CNN based approaches.
The goal of segment based methods is to generate multiple segments from the image that may contain objects. These methods typically start with initial over-segmentation followed by different merging strategies to cluster similar segments based on color, texture, and shape into object proposals. For example, Selective Search (SS) [41] generates object proposals by greedily merging super-pixels without learning. Randomized Prim [26] utilizes connectivity graph to learn randomized merging strategy. Graph cut is used in [29] to merge super-pixels to generate proposals. Multiscale Combinatorial Grouping (MCC) [2] expolits mutli-scale hierarchical segmentation to obtain object proposals. Reference [18] measures geodesic distance transform between multiple segments, where distance transform represents object proposals. The above mentioned methods achieve high recall rates, however, these methods are computationally expensive since proposals are obtained by multiple segmentation in multiple scales and color spaces.
On the other hand window scoring methods show the likelihood of a window to contain an object of interest and therefore are computationally efficient as compared to segmentation based methods. Generally, these methods first generate candidate object proposals (bounding boxes) in multiple locations and scales. Then high confidence boxes are selected as object proposals. Objectness [1] selects the high rank proposal on the basis of low-level cues, such as, edge, size, location and color. BING [5] trained a linear Support Vector Machine and applied it in a sliding window fashion on gradient map. Similarly, Edge Boxes [59] also follows sliding window fashion and associates score to the windows base on edge map. In contrast to segment based methods, window scoring approaches are fast. However, these methods, due to sampling of proposal at discrete levels, results in poor detection accuracy.
Due to the popularity of deep learning models, CNNs are also explored for object proposal generation task. Overfeat [34] trained a deep model that operates in sliding windows fashion and simultaneously predict bounding box and score for each object. MultiBox [8] also trained a CNN that generates fixed number of proposals without adopting sliding window strategy. DeepBox [20] on the other hand, does not output the proposals by itself but re-ranks the proposals generated by other methods.
Our proposed object proposal approach falls into the category of Window scoring method. However, our methods adopts different scheme by exploiting fully convolutional network that outputs heat maps. Our method is similar to Region Proposal Network (RPN) [32], which is employed in [32] for object proposal generation. RPN generates fixed number of proposals based on pre-defined anchor boxes. Compare to RPN, our method neither generates fixed number of proposals nor based on pre-defined anchor boxes. Unlike other methods Overfeat [34], MultiBox [8], and RPN [32], our method does not regress the bounding boxes. Instead we adopt a mapping scheme, where each pixel in heatmap corresponds to the a window in the input image. The integration of localization information with scale-specific strategy achieves better performance and achieves high recall rates than bounding box regression methods.

B. HEAD DETECTION
Most of related works deals head detection problem as special case of object detection. Traditional head detection methods learn hand-craft features by a non-linear classifier. For example, the classical method proposed by Viola and Jones [43] extracts Haar-like features from the image and employed cascade booting classier for classification. In [33], authors move a step forward and refine the results of Viola and Jones by exploiting spatial and temporal information using Conditional Random Field (CRF). Deformable part model (DPM) [47] utilized Histogram of oriented gradients (HOG) features and was widely adopted model in object detection tasks. However, these traditional methods receive performance setback and cause high computational cost in real world scenes.
Convolutional Neural Networks achieve enjoyed tremendous success in classification, and segmentation task. Following the success of CNN, deep neural networks becomes the first choice for object detection task. The most efficient step in this direction is taken by Region based convolutional neural network (RCNN) [10]. RCNN is a two-stage framework, where the first step involves generation of object proposal (around 2000) by employing Selective Search (SS) method. The proposals generated by SS method are then feed to feed-forward network. The network then extract hierarchical features from the (5 th layer of AlexNet). A linear VOLUME 8, 2020 SVM classifier is then learned using the hierarchical features extracted from the last convolutional layer. Although R-CNN achieved state-of-the-art results, however, it also suffers from computational complexity. A more refined version, Faster R-CNN is proposed that replaced traditional Selective Search strategy by RPN. You only look once (YOLO) [30] generates bounding boxes using regression and classify each bounding box by assigning class scores to the bounding boxes. YOLO beast Faster-RCNN in terms of inference speed on most of existing object detection datasets, however, at the cost of accuracy. Single shot detector (SSD) [23] generate fixed number of bounding boxes by utilizing fully convolutional network.
Although the above existing models achieve considerable performance in classifying multiple objects in image, however, they face challenges in detecting small objects. It is due to the fact that most models utilize features from the last convolutional layer for object detection. However, last convolutional layers contain inadequate information regarding small objects. Since, in head detection problem, where the size of target (head) is usually small (upto 10-20 pixels), therefore, current existing methods in the current form are not applicable for detecting small objects.

III. PROPOSED METHODOLOGY A. NETWORK DESIGN FOR OBJECT PROPOSALS
In this section, we discuss the proposed architecture for generating scale-aware proposals. Fully Convolutional Networks (FCNs) become dominant in image segmentation tasks that take an arbitrary size input and predict dense output of the same size. The output of FCN may also be used in dense prediction tasks (e.g., image restoration, depth estimation and semantic segmentation). Our multi-scale object proposals generation framework is based on FCN which takes whole image as input and produces a high level semantic heat map. All pixels in the output heat map represent to what extent different regions in the input image contain human heads. In short, we train a binary classifier (head/background) using patch wise training strategy with annotated heads. The framework slides over the image with a network stride and feed-forward each sampled window to a binary classifier. The output is heat map, where each pixel represents the confidence value of one of the window (corresponds to patch) in the input image as shown in Figure 3. Generally, fully convolutional networks are more efficient compare to existing sliding window methods as they share the computation among overlapping windows. Moreover, FCNs are translation invariant and take arbitrary size image as input.
For input image, the size of image patch (window size) corresponds to pixel in the heat map is called the Scale (receptive field size) of the network. Several parameters affect the scale of the network, for example, depth of the network, sizes of convolution and pooling layers and stride settings. Lets assume R i represents the receptive field of network layer i, where i = {1, 2 . . . , n} and n represents the total number of layers in the network. Then the scale of the network is R 1 and we can compute the receptive field of any layer i of the network using the recursive formulation 2.
where r i represents the convolution or pooling stride and k i shows the size of kernel of the i th convolution/pooling layer. R i and R i−1 represents the receptive field of i − 1 th and i th layer respectively. To precisely map any pixel in the heat map to the corresponding window region in an image, we need to compute receptive field (scale) and stride (or network stride), N s . One inherent issue with FCN is that N s is computed by the network itself and is equal to the product of strides of all network layers.
With the known receptive field size R 1 and network stride N s , we can compute window region in the input image which corresponds to the pixel in the heat map. Let (x o , y o ) represents a pixel in the heat map. We can compute its corresponding window W = {x min , x max , y min , y max } in the input image as follows, We train the network from scratch and the details of the proposed network architecture is shown in the Table 1. Our architecture follows the geometry of the AlexNet [19] for first five convolutional layers. In our design, we convert the 6 th of AlexNet to full convolution layer with the kernel size of 6 × 6. The last 1 × 1 convolutional layers follows Network in Network (NN) [22]. Each convolutional layer of the network is followed by a ReLU layer. We use softmax layer on the top of the network that predicts the confidence score within the rage of 0 and 1 by optimizing cross entropy loss 2.
where t g represents k th ground truth value and p k denotes k th prediction value. For training, in contrast to feeding whole image, we adopt patch wise training strategy. For generating the training data, we crop positive patches with annotated heads and re-size them to the input size of the network (224 in our case). We also crop several patches around the human heads with Intersection over Union (IoU) ≥ 0.5 and are treated as positive samples. This step is performed to increase the amount of positive patches and to balance the data (number of positive and negative patches). For the negative samples, we sparsely sampled patches from the background with IoU < 0.5. IoU is computed as the intersection of a candidate box and ground truth box divided by area of their union.
For all layers of the network, we use zero-mean Gaussian distribution to initialize the weights. We keep standard deviation to 0.01 and the biases with 0. We adopt stochastic gradient descent (SGD) during the training process and learning rate to 0.01. We reduced the learning rate 10 after every 40 epochs. We set the batch size to 256.

B. MULTISCALE OBJECT PROPOSALS
In this section, we discuss our strategy of generating multiscale proposals. For fully convolution network discussed above, pixels in the heatmap cover windows of fixed size R 1 in an image. Therefore, FCN can only detect heads with size R 1 in the original image. However, the size of heads varies significantly due to perspective distortions. Therefore, to generate object proposals that captures different sizes of the human heads, we re-size the original input into multiple sizes and generate an image pyramid.
After generating image pyramid, we then feed each re-sized image of the pyramid to the network and predict the corresponding heatmap. The heatmaps generated by different layers of the pyramid will have different receptive fields. Figure 4 shows the input original image which is re-sized to different sizes, i.e. 28 × 28, 56 × 56 and 112 × 112 and then feed to the network one by one. We predict the corresponding heatmaps as shown in the Figure 4. From the Figure, we infer that heatmap corresponding to the smaller scale (28 × 28), the network gives higher response on smaller heads while low response on bigger heads. In the same way, the network characterizes bigger heads in large scale, i.e. 112 × 112. With this motivation, we propose multiscale strategy to generate scale-aware proposals that captures different sizes of heads in the image. The proposed pipeline for generating multiscale proposals is shown in Figure 2.
Acknowledging the effectiveness of multi-scale strategy, we now find the set of scales required to precisely detect all VOLUME 8, 2020 human heads in the given image. Generally, large set of scales results in large number of proposals concentrated around the regions containing head. However, this setting produces large number of false bounding boxes (not likely to contain head) which may lower the recall. On the other hand, small set of scales usually missed the objects in the image and results in lower precision. This issue rises a trade off in selecting the parameter for multiscale settings.
We use the values of the scale, ranges from a minimum bounding box size of 28 × 28 (784 pixels area) to the full resolution of an image. For the head detection, we keep the aspect ratios as ∈ [ 2 3 , 3 2 ] for all bounding boxes. The exact values of the scale S can computed as follows, where r is the index and takes value from range 0 to [log( I √ 784 )/ log( 1 α )], where I is the image size and we define α as the step size of the scale and representing IoU for neighboring boxes [59]. In all our experiments, we fix the value of α to 0.65 as it is ideal for most of cases [59]. After obtaining multiscale proposals using different heatmaps, we then remove bounding boxes with score lower than 0.3. This step will significantly minimize the number of proposals. In the next step, we sort all the remaining proposal in descending order and apply non-maximal suppression (NMS). In all our experiments, we fixed the threshold value to 0.8.

C. HEAD DETECTION
After obtaining multiscale proposals by using the above mentioned multiscale strategy, we then classify each proposal into two classes, i.e. head and background. Our head detection framework follows classical R-CNN [10] approach and instead of selective search [42], we use proposals generated by our multiscale strategy. Before feeding the proposal to a network, for each proposal, we process each proposal in following way. 1) Extend the bounding box by a small scalar value. 2) Crop patch corresponding to each proposal from image. 3) Re-size image patch to make it fit to the input layer of the CNN. For the classification, we use different architectures, AlexNet [19], VGGS [4], VGG-verydeep-16 [38], and ZF [54].

IV. EXPERIMENT RESULTS
In this section, we evaluate the performance of proposed framework using four publicly available datasets, i.e. SHOCK [6], WIDERFACE [51], HollywoodHeads [45] and Casablanca [33]. SHOCK dataset is proposed by Conigliaro et al. [6]. The dataset captures 100,000 spectators from all over the world to watch an ice hockey match held in Trento, Italy. The datasets contains 75 video sequences captured from five different cameras and covers four ice hockey matches on different days. Two different types of cameras were used to record the video sequences. To capture panoramic and ice rink view, full HD camera with resolution of 1920 × 1080, focal length 4 mm and with frame rate of 30 fps is used. To cover different locations of spectators crowd, three cameras with resolution of 1280 × 1024, focal length 12 mm, with frame rate of 30 fps were mounted at different locations of the stadium. The video sequences are annotated in different ways to evaluate different crowd analysis methods, for example, face detection, pose estimation, action recognition, and posture detection. WIDERFACE dataset is proposed by Yang et al. [51]. This dataset is used to evaluate face detection methods. The dataset is composed of 32,203 images and 300,000 face annotations (bounding boxes). The dataset is 10 times larger than existing face detection datasets. The images collected from different sources with varying view points, resolutions, scales, poses and densities. This data set has unique properties. The faces are divided into groups based on scales, occlusion, pose and events. The dataset has unique property of arranging the faces into three groups, i.e., small, medium and large based on face size. The small group covers faces of size 10-50 pixels, medium (50-300 pixels), and large contains human faces of size greater than 300 pixels. In the same way, to evaluate detector performance on handling occlusion, faces are divided into three categories, high occlusion, no occlusion, and medium level occlusion. We use three groups of scales, small, medium and large to evaluate and compare the performance of proposed framework and other reference methods.
HollywoodHeads dataset is first proposed by Vu et al. [45]. The dataset is collected from 21 Hollywood movies scenes and contain 224,740 images. This dataset contains 369,846 annotations. The human heads were annotated in different key frames and remaining frames are annotated by using linear interpolated. These annotation are then verified by multiple coders. The dataset is divided into 216,719 training frames from 15 movies, 6,719 frames for validation sampled from other 3 movies and 1,302 frames are sample from remaining 3 movies. We followed the same convention in our evaluation of proposed framework.
Casablanca dataset is first proposed by Ren [33]. The dataset is collected from old movie named ''Casablanca''. The dataset contains 147600 frame of resolution 464 × 640. Casablanca dataset contains the annotations that mostly cover the frontal heads which have different scales and aspect ratios.

1) COMPLEXITY OF DATASETS
In this section, we discuss and compare the complexity of datasets. As discussed above, scale problem is caused by perspective distortions in the image that is induced by camera view point. Due to perspective distortions, the size of human heads near to camera appear large, while the size of human heads become smaller as with distance from the camera increases. Objects appears at various scales in natural images that may compromise the detector's performance. Therefore, scale problem lies in the heart of every object detector [57] and good object detector should overcome the scale problem. To demonstrate the complexity of dataset in terms of scale variations, we plot the distribution of the entire scale space of heads/face for all datasets as shown in Figure 5  We count the number of scales belonging to four groups and generate histogram for all dataset as shown in Figure 5. From the Figure, it is obvious that casablanca and Hollywood-Heads datasets contain human heads belonging to medium and large groups. SHOCK dataset contains heads belonging to the medium group while WIDERFACE dataset is diverse and contains heads from all four groups. We further illustrate the complexities of the datasets by plotting standard deviation of scales in Figure 5 (b). From the Figure, it is clear that SHOCK dataset produces low standard deviation compare to other datasets. The small standard deviation shows small scale variance that can be easily capture by a single scale detector. On the other hand, standard deviation of WIDERFACE is high that shows that the existence of large scale variance and VOLUME 8, 2020 requires multi-scale detector. We also report the summary of datasets in Table 2.
We compare the performance of proposed framework with other state-of-the-art methods using following performance metrics: object recall, and detection mean average precision (mAP).
For two stage detectors, it is important that object proposal generator should cover all object of interest. Objects missed during the object proposal stage will never be classified during the classification stage. This will reduce the object recall rate for the classifier. Generally, the performance of detector depends on the performance of object proposal generator. Therefore, it is important to evaluate the performance of object proposal stage. Generally, object recall rate is used as an evaluation metric to evaluate the performance of object proposal stage. We compare our proposed object proposal generation framework with the other reference methods, including Bing [5], Region proposal network (RPN) [32], MultiBox [23], EdgeBox [59] and SelectiveSearch [10].
For computing the object recall, we find the matching by computing intersection over union (IoU) between the object proposal and the ground truth. Figure 6 (a) shows the object recall of different methods at fixed IoU threshold (0.6) with the increasing number of proposals. From the Figure 6 (a), it is obvious that our approach out performs other state-ofthe-art methods for both small and large number of proposal at fixed threshold (0.6). It can also be noticed from the Figure 6(a) that even for small number of proposal (1000), our approach performs comparatively better.
We next evaluate the performance of different methods by computing object recall for fixed number of proposals (2000) and change IoU values within the range of [0.5, 1] as shown in Figure 6(b). It can be seen that our approach beats other state-of-the-art methods by a considerable margin with IoU changes from 0.5 to 1. The superior performance of our approach attributes to the fact that we utilize multi-scale prediction strategy. This strategy has the ability to capture scale variations and results in high object recall rates.

A. COMPARISON WITH GENERIC DETECTORS
We now evaluate and compare the performance of our proposed framework with other generic object detectors. Generally, we categorize generic object detectors into two groups: (1) two stage frameworks and (2) single stage frameworks. Two stage detection frameworks incorporate generation of region proposals as a pre-processing step while one stage detection frameworks are free from region proposals.
We utilize different region proposal methods with different backbone CNN architectures. Faster-RCNN [32] uses a fully convolution network named as Region Proposal Network (RPN) for generating region proposals. The features map generated from the last convolution layer is used to generate regions proposals of different sizes and aspect ratios. We combine R-CNN with MultiBox and Selective Search which utilize low-level image features for generating object proposals. We also compare our results with Cascade Rejection classifier (SDP+CRC) [49] which utilizes Edge-Boxes for object proposals.
It is important to note that for generating object proposals, we fine-tuned the pre-trained models of Selective Search, MultiBox and EdgeBoxes on HollywoodHeads dataset according to the original splits. We also use single stage detection frameworks and directly employ the publicly available pre-trained models of You Only Look Once (Yolo) [31] and Single Shot Detector (SSD) [23] during testing phase. We use average precision (AP) with a threshold of 0.5 IoU as a performance measure based on precision-recall curves. The results are summarized in Table 3. From the Table, it is obvious that the proposed framework outperforms all state-of-the-art detectors. While Faster-RCNN produce comparable results. The performance of Yolo and SSD are relatively lower than rest of detectors. We attribute their inferior performance to the following two reasons: (1) show poor generalization capability when applied to new datasets. (2) Both Yolo and SSD suffer from the problem of detecting small objects compared to region based object detection methods. It may be the reason that both these methods are using feature maps of low resolution due to which small objects features become too small to be detected.
For fair comparisons, we first train each detector on Ima-geNet dataset and then finetune on each of the analyzed dataset. We observed from the experiments that the performance of detectors improve after fine tuning.
We evaluate the performance of all detectors using precision-recall curve with varying threshold values. We report precision-recall curves of all the methods on all datasets in Figure 7. We also report precision, recall and F-score of all detectors in Table 5, 6, 7 and 8 for SHOCK, WIDERFACE, HollywoodHeads and Casablanca datasets, repectively.
From experiment results, we observe that two variants of Viola-Jones, i.e, VJ-HOG and VJ-LBP showed lower performance on all data sets as compared to to other specific detectors. This is due to the reason that Viola-Jones is affected by orientation of heads and faces. Furthermore, it is sensitive to illumination and accumulates many bounding boxes on face location due to sliding window approach that lowers precision-recall rate. We further observed that DPM-head also achieved lower performance on all datasets. The lower performance attributes to the small size of the human head. Due to small size of head, DPM detector could not detect the heads with the size less than 23 × 23 pixels. DISAM [17], on the other hand achieved comparable results by tackling scale problem to some extent, however the method suffers from the following limitations: (1) The models follows the traditional pipeline of R-CNN which uses scale-aware strategy for object proposal generation. The strategy typically requires human efforts to generate a scale map. (2) The inference speed of the model is very slow since the model extracts samples proposals in a sliding window fashion and feed forward each proposal in a single pass.
However, proposed framework efficiently address all above problems. The superior performance of proposed method attribute to the adoption of scale-aware strategy that covers large range of scales of heads. To demonstrate the effectiveness of proposed approach, we present qualitative results in Figure 8 for samples from all data sets.
We further summarize the experiment results in two points: 1) The performance of specific detectors (head/face) is comparatively higher on SHOCK dataset than other datasets. In SHOCK dataset, people are sitting in front of the camera, where most of body part of human are visible. Furthermore, human head/faces lie in limited range of scale and the variance in scale is not significant as also obvious from Figure 5 (b). Due to these properties, specific detectors perform well on this dataset compare to other datasets. 2) The performance of head detectors is higher than face detectors. Face detectors rely on facial features for detection, however, in crowded scenes, facial features are not visible due to occlusions, lighting conditions, and camera view point. For example, face detector can not detect face of a person who turns his back to the camera. Due to these limitations, face detectors perform comparatively low than head detectors.

C. EVALUATION ON EXTREME SCALES
To evaluate the performance of different methods on detecting different sizes of heads, we divide human heads into three categories, i.e., small, medium and large based on sizes of heads. The size of head corresponds to the height of bounding box overlaid on head. Bounding boxes with sizes of 8-60 pixels belong to small category, medium category contains bounding boxes of sizes 60-160 pixels, and bounding boxes greater than 160 pixels fall into large category.
Since WIDERFACE dataset contains heads in wide range of scales and sizes, therefore we use WIDERFACE dataset for evaluation purpose. We evaluate the performance of all detectors in terms of mean Average Precision (mAP) and results are summarized in Table 4. From the Table, it is obvious that all detectors perform well on both medium and large groups. However, the performance of these detectors degrades on small group. It is obvious that these detectors face difficulty in detecting small objects. This is due to the reason that network uses single deep convolutional neural network with a fixed receptive field size. For example, Faster-RCNN and SSD showed inferior performance compared to other detectors. Faster-RCNN uses deep layers for ROI-pooling which   power. Our proposed framework addresses the above issue and achieves significant improvement by using multi-scale feature that proves helpful in finding wide range of scales. VOLUME 8, 2020

D. TIME COMPLEXITY
We also compare time complexity of our method with other state-of-the-art object proposal methods. The detailed time complexity of our proposed object proposal method as well as other state-of-the-art methods is reported in Table 9.
For SelectiveSearch method, we use its fast version while for other methods, we directly employed their codes. For testing, we use images from Casablanca dataset. From the Table, it is obvious that our method is not the fastest method but still running comparatively faster than most of the stateof-the-art methods.

V. CONCLUSION
In this paper, we exploit fully convolutional network (FCN) to handle the problem of scale variance in images by generating scale-aware proposals. The heatmap produced by FCN helps to identify whether a patch contains head or not. We observed from experiments, that proposals produced are more stable towards image perturbation compared to other object proposal methods. From the experiments, we also showed that our object proposal generation strategy results in high object recall and mean average precision. We believe that our proposed framework can also be extended to dynamic video sequences. Therefore, in future, we will extend the current framework to incorporate motion information to gain stronger power in identifying and localizing human behaviors and emotion recognition. Al-Qura University, where he is also an Assistance Professor. He is also a Senior Consultant with Umm Al-Qura Consultancy Oasis, Institute of Consulting Research and Studies (ICRS), Umm Al-Qura University, where he is also the Owner of Vision office of consultancy. He has authored many technical articles in journals and international conferences. His research interests includes smart grid communications, cooperative communications, wireless networks, the Internet of Things, crowd management applications, and smart city solutions.