Moving Pedestrian Localization and Detection With Guided Filtering

Detecting a moving pedestrian is still a challenging task in a smart surveillance system due to dynamic scenes. Locating and detecting the moving pedestrian simultaneously influences the development of an integrated but low-resource smart surveillance system. This paper proposes a novel approach to locating and detecting moving pedestrians in a video. Our proposed method first locates the region of interest (ROI) using a background subtraction algorithm based on guided filtering. This novel background subtraction algorithm allows our method to also filter unexpected noises at the same time, which could benefit the performance of our proposed method. Subsequently, the pedestrians are detected using YOLOv2, YOLOv3, and YOLOv4 within the provided ROI. Our proposed method resulted in more processing frames per second compared with previous approaches. Our experiments showed that the proposed method has a competitive performance in the CDNET2014 dataset with a fast-processing time. It costs around ~50 fps in CPU to classify moving pedestrians and maintain a highly accurate rate. Due to its fast processing, the proposed approach is suitable for IoT or smart surveillance device which has limited resource.


I. INTRODUCTION
Intelligent video surveillance systems currently play a crucial role in monitoring and evaluating human activity in public areas. This is especially true for pedestrian detection systems, which have been one of the most common subjects in various areas over the last decade. However, developing a robust pedestrian detection system is challenging due to many aspects such as illumination, cluttered background, and variations in pedestrian sizes.
The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu .
Specifically, to address the background challenge, one of the promising approaches is integrating a robust background (BG) model into the pedestrian detection system. In developing the BG model, it is important to notice Stauffer and Grimson's work [1], which largely influenced the field. The Stauffer-Grimson BG model is based on a Gaussian Mixture Model (GMM) that was fitted to the pixel values distribution over time. After the fitting process, the model is able to decide if the incoming pixels belong to the BG, based on the probability of the MoG identified at each pixel location. The BG model in their work has inspired the development of many variants of BG models, such as the texture-based VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ approach [2], improved GMM [3], [4], and other novel approaches [5], [6], [7], [8]. Despite its popularity, the GMMbased model like the Stauffer-Grimson model still has a problem in dealing with a noisy background. Fig. 1 depicts the problem that is caused by shadow as the noise in the background. The changes in illumination can also introduce noises that lead to undesired results [9]. Ultimately, the ROI provided by the BG subtraction algorithm is required for building a complete pedestrian detection system. The recent trend suggests the use of a deeplearning-based detection algorithm for pedestrian detection [11], [12], [13], [14], although other types of computer vision techniques can also be employed [15], [16], [17]. Because of the sequential integration between the BG model and the detection system, the problem caused by the BG model can affect the overall performance of the pedestrian detection system. Thus, the aforementioned problem of the BG model needs to be addressed. At the same time, the whole pedestrian detection system needs to be fast enough to meet the requirement of real-time applications.
To answer the challenges that have been mentioned, this paper proposes an integration of a BG subtraction algorithm based on guided filtering [18] and a pedestrian detection algorithm using YOLO [19]. The key idea is to develop a fast pedestrian detection system that is robust to noisy frames. The guided filtering part allows the proposed system to generate an ROI while filtering the noises in the incoming frames. Subsequently, the detection process can be executed at a fast speed with YOLOv3. Unlike the previously prevalent approaches that used variants of R-CNN [20], [21], which are two-stage deep-learning-based object detectors, YOLOv3 is a one-stage object detector. This allows YOLOv3 to have a faster inference time, with the speed at about 45 FPS (frame per second). Interestingly, the accuracy of YOLOv3 is not significantly compromised. It is also worth noting that YOLOv3 excels at detecting large objects in the PASCAL dataset [22], which contains a large number of objects that are classified as pedestrians. Thus, YOLOv3 is a natural choice for a real-time pedestrian detection system.
In summary, the main contributions of this study are: • To develop a novel and robust BG subtraction algorithm using guided filtering and texture-based modeling. This algorithm is expected to be robust against noise with the use of guided filtering.
• To apply YOLOv3 to detect pedestrians based on the ROI that is provided by the BG subtraction algorithm. This ultimately leads to a fast and accurate pedestrian detection system for real-time applications.
The remainder of this paper is organized as follows. The previous works related to this study are presented in Section 2. The details of the proposed BG subtraction model are provided in Section 3. The experiment result is presented and discussed in Section 4. Finally, the study is concluded in Section 5.

II. RELATED WORKS A. BG SUBTRACTION FOR FINDING ROI
BG subtraction is a standard way of detecting ROI from BG to find objects in successive frames. The BG subtraction is used in moving pedestrian detection fields to find the probable pedestrian areas (ROI) prior to detecting the real pedestrian object in a surveillance camera [23], [24]. The common method is based on color and texture features [1], [3], [25], which can utilize either pixel-based or block-based processing. To successfully detect the ROI from BG, one of the most prevalent approaches is to use an edge-aware filtering technique. This technique has a unique trait that can also filter the noise in the frames while detecting ROI. Recently, edge-aware filtering has been applied in many applications, for example, in the study by Wang et al. [26] and Munadi et al. [27]. Currently, the bilateral filter [28] and anisotropic diffusion [29] are the most popular variant of edge-aware filters. Despite their popularity, these two filters require a relatively high computational cost. To alleviate this issue, the guided filter [18] was developed, which is increasingly being applied in many fields, such as image fusion, image matting, up-sampling, etc. [30], [31]. Due to its non-approximate implementation, the guided filter is more preferred than other filters. Compared to other filters, the guided filter is able to generate a filtered image with improved quality at a faster runtime due to its invariant filter size [30].

B. MOVING PEDESTRIAN DETECTION VIA HANDCRAFTED-FEATURE-BASED TECHNIQUES
Before the advent of deep learning, the solution for object detection, including pedestrian detection, relies on the use of handcrafted features that are subsequently fed into a detector algorithm. Specifically for pedestrian detection, the most frequently employed features are the Histogram of Oriented Gradients (HOG) [32], [33], Haar-like features [34], [35], Viola-Jones features [36], texture features [37], and Local Binary Patterns (LBP) [38]. Since pedestrians are typically moving, Spatio-temporal features are also commonly used for pedestrian detection [39], [40]. It is also beneficial to use the handcrafted features for an intermediary BG subtraction process within a detection pipeline, as demonstrated by Kanagamalliga and Vasuki [41]. Recently, Kim et al. [60] focused on integrating the teacher-student concept into the standard random forest (RF) to create a novel fast pedestrian detection algorithm for surveillance cameras that can be run on a low-level computer device.

C. MOVING PEDESTRIAN DETECTION VIA DEEP LEARNING TECHNIQUES
Like in most computer vision tasks, deep learning has also emerged as a preference for pedestrian detection. The deep learning algorithm in a pedestrian detection system is usually treated as a feature extractor, whose features are processed by another algorithm for the detection task. For instance, in the study proposed by Chahyati et al. [42] and Zhang et al. [43], the features were extracted from a deep learning algorithm named Faster R-CNN. Similarly, Li et al. [44] used the features from a fully convolutional network (FCN). Not only improve pedestrian detection system performance, but the utilization of deep learning can also alleviate challenging problems such as detecting pedestrians from an occluded image [13].

III. METHODS
Unlike previous approaches, our proposed method integrates a guided-filtering-based BG subtraction algorithm prior to a deep learning algorithm for pedestrian detection. The motivation for incorporating guided filtering is to relieve unwanted noises in the images that can degrade the accuracy of pedestrian detection. Fig. 2 depicts the outline of the proposed system pipeline. Firstly, the image is inputted into the BG subtraction algorithm to generate a bitmap that discriminates foreground and background. The foreground can be thought of as the promising part of the image containing pedestrians that eventually are detected by the subsequent pedestrian detector. Therefore, to eliminate unnecessary computation, the input image is cropped to the smallest part of the image that contains all foreground. Afterward, this cropped image is fed into a deep learning algorithm for the final pedestrian detection. Because of the elimination of unnecessary computation, our proposed approach is guaranteed to run faster than the typical deep-learning-based pedestrian detection system. In the next two subsections, the detail of this pipeline is elaborated. Specifically, the first subsection covers the details of the BG subtraction algorithm and the second subsection covers the detail of the deep learning algorithm for the final pedestrian detection.

A. ROI LOCALIZATION VIA BG SUBTRACTION
In brief, our proposed BG subtraction algorithm combines guided filtering and the improved version of the multi-level texture BG subtraction algorithm proposed by Yeh et al. [25]. This proposal was motivated by our observation of the existing BG subtraction methods' behavior, including the multilevel texture method that naturally identifies occurring noises as foreground. The observation is presented in Fig. 3. In the figure, the correct foreground region is marked in green, and the incorrectly identified foreground is in red. Because BG subtraction is an integral part of our proposed pedestrian detection method, this flaw may introduce performance degradation to the overall detection system. For this reason, this paper proposes to infuse guided filtering into a BG subtraction algorithm in a pedestrian detection system.
The first process of our proposed BG subtraction is to apply a guided filter to an input frame I . This process generates a grayscale image I guided , whose noises have been filtered. Afterward, I guided is fed into the multi-level texture BG model [25] to produce a bitmap that separates foreground and background. In this study, the BG model is applied to I guided for each 4 × 4-pixel block, as suggested by Yeh et al. [25]. Subsequently, the final binary bitmap BM guided is obtained by calculating the average of each block and comparing each pixel value with its corresponding mean block. A pixel in BM guided is set to 1 if the I guided pixel at the same location is greater than the mean value, and vice versa. In the first frame, the BM guided 's blocks are stacked to obtain the initial BG model BM mod . The initial model is subsequently updated for each incoming frame to get a more accurate representation of the true BG. The same post-processing steps VOLUME 10, 2022 (connected component and morphology), update rule and learning rate that was suggested by Yeh et al. [25] are used. Finally, to determine whether the blocks in a new incoming frame are foreground or background, they are compared to the BM mod . The details of this comparison process are elaborated in subsection 3.1.2.

B. GUIDED FILTER AS PRE-PROCESSING
Since the guided filter [18] directly inspired our algorithm, this subsection provides a brief review of the edge-preserving property and its formula. The guided filter algorithm assumes that an edge-preserving smoothing filter can be learned via a linear model of the filtered image I guided from a guidance image G within a window w n that surrounds a pixel n. This can be formally expressed as follows: where a n and b n are constants that have a unique value for each window in the image. The value can be obtained analytically by framing the case as an optimization problem to minimize the squared error between I guided and I as well as an L2 regularization on a n , which is formally expressed as follows: where ε is a parameter to adjust the effect of the regularization term. To estimate a n and b n , a linear model as in equations (3) and (4) are utilized.
whereḠ n andĪ n are respectively the local means in a window w centered at pixel n of the G and I value, |w| is the window's size, and σ 2 n is the variance of G in w n . The guided filtering process was implemented using OpenCV. Fig. 4 shows the comparison between bilateral and guided filters. As clearly shown in the figure that the guided filter is able to preserve the interesting area while smoothing the remaining regions.

C. TEXTURE-BASED BG MODELING WITH GUIDED FILTERING
The texture-based BG modeling in this study was applied to each non-overlapping 4 × 4 block in the incoming frame. Firstly, each block is filtered by a guided filter. Afterward, a texture bitmap is generated by thresholding each pixel value with respect to the local mean of the block. If the pixel value is less than the mean, then the corresponding pixel in the texture bitmap is set to 0. Else, the bitmap pixel is set to 1. This process is illustrated in Fig. 5.
To justify the choice of 4 × 4 block size, an example of the generated bitmap with different block sizes using the proposed texture descriptor is visualized in Fig. 6. The block size of 2 × 2 failed to identify most of the important textures ( Fig. 6(b)). Meanwhile, the 6×6 block size captures excessive   texture ( Fig. 6(d)). The 4 × 4 block size generates the bitmap with the most perfectly balanced texture among the tested block size (Fig. 6(c)).
Motivated by PBAS [50], the texture information is used in this study to model the observed blocks by adding the neighborhood information. The improved BG subtraction method is divided into two parts, namely the initial improvement, and final improvement.

1) INITIAL IMPROVEMENT
The first modification was made to Yeh et.al.'s updating mechanism. In contrast to this method, which only updates the BG model of the observed block, the proposed idea incorporates the adjacent blocks by checking for similarity before  updating the adjacent BG models. More specifically, different from [50] which selects and updates randomly neighboring pixels, the proposed step is first check the similarity of the observed BM model with BM adjacent_model via hamming distance. If the similarity of models exceeds the TH adjacent , then all BM adjacent_model is replaced by its current binary bitmap. VOLUME 10, 2022

2) FINAL IMPROVEMENT
The bit transition is estimated by Yeh, et al. [25] to determine the mode of a block. When a block is complex, the upper level is used (2 or 3-bits mode, instead of 1-bit mode). The result of the first phase in the proposed approach of this study allows for further development by using bit transition to check the complexity of the observed block and adjacent blocks. Fig. 8 shows an example of accumulating bit transition of a block. Note that, the higher the total transition, the more complex block, and texture information will be.
If the observed block is regarded as BG, the complexity of current adjacent blocks is considered to be identified if it is an FG block. If both conditions are met (the adjacent blocks are all complex and FG blocks), the label of the observed block   Once the binary mask is obtained, the guided feathering is finally performed to further improve the results. To be specific, a binary mask b output is filtered under the guidance of G (see section III.b). The parameters are r = 5, and = 0.2 2 for the guided filter. Where the r and are radius and epsilon, respectively. As shown in Fig. 9-10, the fragment issues highlighted in red (in the inner region of a detected object) can be alleviated.
Furthermore, more detailed comparative evaluations are shown in the subsequent results. It aims to overlay the VOLUME 10, 2022 obtained final binary mask over an RGB color image. This verifies that the proposed method is able to improve the previous work, especially for optimizing the block texturedbased approach. It is noteworthy that the block textured-based is selected to guarantee the initial moving object detection is executed rapidly prior to inference through a deep learning pipeline.

D. FINAL PEDESTRIAN DETECTION VIA DEEP LEARNING
To detect pedestrians from the previously generated ROI by our BG model, a deep-learning-based pedestrian detector is utilized, especially with the model based on Convolutional Neural Networks (CNN). It is currently the most popular model to solve many computers vision tasks, including image classification and object detection. In its simplest form, CNN consists of convolutional, pooling, and fully-connected (FC) layers. The core of CNN is the convolutional layer, which applies a fixed-size kernel to the input matrix via a convolution process and sends the output matrix to the next layer. To allow non-linear mapping, a non-linear activation function is applied after each layer. CNN has been observed to generate more optimized features than hand-crafted features, which leads to better performance.
In particular, the model by YOLOv2 [46], YOLOv3 [19], and YOLOv4 [58] are used in this work, which demonstrated robust performance for object detection. YOLO is a type of CNN specially engineered for fast object detection with competitive accuracy. It is an improved model from the previous versions of YOLOv1 [45]. At its core, YOLO is a single regression model that fully connected and explains its fast inference compared to other object detection methods that are typically multi regression models. The single regression model is achieved by framing detection as regression of bounding boxes for each S × S grid in the input image. If a target bounding box is centered at a certain grid, the representation of that grid is utilized for detecting the bounding box. Each grid can detect a fixed number of bounding boxes B, along with the corresponding confidence score, which is calculated as P r (object) * IOU truth pred . The confidence score can  be interpreted as the probability that the detected bounding box is correct, which is measured by the Intersection over Union (IOU) of the predicted box compared to intersecting target boxes. If no target boxes intersect the detected box, its confidence score is set to 0.
To get a better detection in the second version, the size of the anchor boxes was set with the size obtained from the training dataset via k-means clustering. Because YOLOv2, YOLOv3, and YOLOv4 include ''person'' as one of the object categories, it can be employed as a pedestrian detector.
Given its fast inference time, these methods are suitable to be used as the detector of our proposed method that is designed for real-time application.

IV. RESULTS AND DISCUSSION
This section aims to present and discuss our simulation result as well as its evaluation. The simulation hardware is a PC with an Intel i7-7700HQ processor, 16 GB of memory, and NVIDIA GeForce GTX 1050 Ti 4 GB. It is noteworthy that the comparison between the previous works against the VOLUME 10, 2022   proposed work is evaluated in terms of speed since the main objective of our proposed approach is to be deployed in real-time or edge devices. In our simulation, the proposed approach can achieve around 55 frames per second, the fastest among the previous approaches that have been considered in this study. The comparison of our proposed approach to the previous approaches is presented in Table 1. All approaches were performed without GPU acceleration, following the typical protocol of BG modeling studies.

A. QUALITATIVE COMPARISON FOR PEDESTRIAN DETECTION
The comparison discussed in this section was evaluated on the CDNET2014 dataset [10], which introduces several challenging pedestrian scenes. In this paper, five representative scenarios, which are backdoor, busStation, cubicle, pedestrians, and sofa, respectively are selected. Notably, these videos contain difficult challenges such as shadow, and occlusion. In Fig. 12-16, the comparison of the BG subtraction result    from our proposed approach and the top-10 SOTA methods from CDNET 2014 are visually presented. The proposed approach removes the incorrectly detected regions caused by the previously mentioned challenges. It also eliminates noises better than the multi-level texture approach. As the result, our proposed method provides a tighter ROI, which leads to faster detection of the subsequent deep learning process. In Fig. 17-21, the detection result from YOLO given the ROI from the BG model are visualized. As specifically highlight that YOLO can accurately detect persons in the given frames, VOLUME 10, 2022    so that demonstrates the suitability of YOLO as the pedestrian detector in our proposed pipeline.

B. QUANTITATIVE COMPARISON FOR PEDESTRIAN DETECTION
The performance of our proposed approach compared to the previous approaches is qualitatively evaluated. The quantitative evaluation in this study is based on pixel-wise binary measurements with the following metrics: Specificity (Sp), False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Wrong Classifications (PWC), Precision (Pr), and F1 score [59]. In the case of a BG model assessment, specificity measures the number of background pixel which was correctly classified; FPR measures the ratio of background pixels misclassified as foregrounds; FNR measures the ratio of foreground pixels misclassified as backgrounds; PWC measures the overall misclassification rate; precision measures the number of foreground pixel which was correctly classified; F1 measures the harmonic mean of precision and recall (1 -FNR).  The results of the quantitative evaluation are presented in Tables 2 to Table 7. The best performance in these tables is highlighted in red, while the second best is highlighted in blue. As visualized in Fig. 22 through 24, in general, our proposed method can be very competitive and applicable for extracting moving regions. It is noteworthy that the proposed approach aims to localize ROI prior to pedestrian classification through YOLO. Therefore, as shown in Fig. 25, the whole pipeline can execute the incoming frames faster than full-frame processing. It allows the proposed pipeline to be applied in a real-time environment.
From the tables, it can be seen that the specificity value of the proposed approach yields a slightly similar value to the best specifity value, namely BSUV-NET, in most cases.
The FPR and FNR values obtained in the proposed approach are not the best for almost all representative scenarios, but in ''Backdoor'' and ''Pedestrian'' scenarios as the best FNR among other approaches. The percentage of wrong classification (PWC) shows that the approach generates the average value that marginally not different to the other approaches. For the obtained precision, our proposed approach yields the second highest score for the ''cubicle'' scenario, while in the other scenarios almost achieve the lowest score of precision value. Moreover, the F1 values of proposed approach are marginally similar to the highest score. Table 7 represented the average value of quantitative measurements of all pedestrian scenes. From the table can it be seen that the best performance is provided by the VOLUME 10, 2022 BSUV-NET approach. However, our proposed pedestrian detection pipeline obtained slightly identical values to the BSUV-NET in terms of specificity and FNR.

V. CONCLUSION
The advantages of our proposed pedestrian detection pipeline based on a robust ROI localization, have been comprehensively evaluated. The results of this study suggest that the robust ROI localization with a guided-filtering-based BG model contributes to the rapid and accurate pedestrian detection in our pipeline. Moreover, the guided filter allows our proposed method to be robust against various complex challenges caused by noises, which are not adequately handled by the previous approaches. The future works is to evaluate more pre-processing steps on the robust BG subtraction methods. In addition, comprehensive experiments through edge environments will enable this work to be applied in the productready and real-time environment. TJENG WAWAN CENGGORO (Member, IEEE) received the bachelor's degree in information technology from STMIK Widya Cipta Dharma, and the master's degree in information technology from Bina Nusantara University. He is currently an AI Researcher whose focus is in the development of deep learning algorithms for application in computer vision, natural language processing, and bioinformatics. He is also a Certified Instructor with the NVIDIA Deep Learning Institute. He led several research projects that utilize deep learning for computer vision, which is applied to indoor video analytics and plant phenotyping. He has published over 20 peer-reviewed publications and reviewed for prestigious journals, such as Scientific Reports and IEEE ACCESS. He also holds two copyrights for AI-based video analytics software. ADHIGUNA MAHENDRA received the M.S. degree in computer vision and robotics from the University of Heriot-Watt, U.K., in 2008, and the Ph.D. degree in machine learning and computer vision from the Universite de Dijon, France, in 2012. He is currently a Lecturer of data science and enterprise architecture in the Master of Information Technology with Swiss German University. He is also a Lecturer of business analytics in the MBA Program at Central Queensland University. He is also active in the industry with over 20 years of experience building intelligent systems based on AI and machine learning for global and national companies in Europe, Singapore, and Indonesia in verticals, such as oil and gas, industrial automation, aviation, logistics, smart-city, and B2B eCommerce. He is also the Chief of AI and product with Nodeflux, a leading AI Vision company in Indonesia, leading the implementation of AI products, such as the video analytics platform, biometrics eKYC platform, and retail visual analytics platform from the product design, algorithm development, operationalization (MLOps), and commercialization. He has publications in SPIE and IEEE and served as a Reviewer for the International Conference on Engineering and Information Technology for Sustainable Industry (ICONETSI), in 2020. He received the Best Lecturer Award from Swiss German University in 2018.

BENS PARDAMEAN
MUHAMMAD RIZKY MUNGGARAN received the M.T. degree in business intelligence from the School of Electronic Engineering and Informatics, Institute of Technology Bandung, in August 2016. His current research interests include information retrieval, artificial intelligence, computer vision, and business intelligence. In career and experience, he has been active in several machine learning projects related for over seven years. He built machine learning systems on healthy systems using signal processing, temporal data analysis, text mining, and vision on hand writing recognition, face recognition using biometrics. He also have intellectual property rights (HAKI) for an application to reads and extracts the values of archive graphs image (Graph Digitizer) was joined development with the Indonesian Agency for Meteorological, Climatological and Geophysics (Badan Meteorologi, Klimatologi, and Geofisika or simply BMKG) in 2020. He focus on computer vision and artificial intelligence and has been joining with Nodeflux since six years ago, a start-up company who provides video analytics platform, biometrics eKYC platform, and retail visual analytics platform from the product design, algorithm development, operationalization (MLOps), and leading AI Vision company in Indonesia.