Novel On-Road Vehicle Detection System Using Multi-Stage Convolutional Neural Network

General object-detection methods based on deep learning have received considerable attention in the field of computer vision. However, when they are applied to vehicle detection (VD) in a straightforward manner to realize an intelligent vehicle (IV), a graphics processing unit (GPU) is required for their real-time implementation. The use of GPUs is unacceptable in commercial VD systems. A novel on-road VD method comprising the use of a multi-stage convolutional neural network (MSCNN) is proposed to solve this problem. In the MSCNN, the properties of the vehicles are exploited, and an efficient region proposal specialized for vehicles is developed. The proposed MSCNN comprises four stages: vehicle lower-boundary detection, vehicle upper-boundary detection, region proposal network (RPN), and vehicle classification. Effective anchor boxes are generated in the lower- and upper-boundary-detection stages with appropriate sizes for vehicles of all scales. The bounding box of the vehicle within the anchor box is determined in the RPN stage. In the last stage, the predicted bounding boxes are classified as vehicles or non-vehicles. Finally, the proposed method is applied to the KITTI, CrowdAI, and AUTTI datasets, and its advantages are demonstrated by comparing its performance with those presented in previous studies. The proposed MSCNN realizes an average precision (72.1%) on the KITTI dataset while running on a central processing unit (CPU).


I. INTRODUCTION
Autonomous driving is considered one of the most attractive emerging technologies in the industry and academia. Various technologies are being developed for the reliable implementation of autonomous driving. Object detection [1]- [3] around an autonomous vehicle is an important technology that is required for realizing autonomous driving, and popular examples of object detection (OD) include pedestrian detection, vehicle detection (VD), and cyclist detection. Several sensors and combinations of sensors, as well as monocular cameras owing to their low cost, have been widely used in the implementation of OD. This study focuses on VD alone with the use of a monocular camera [4]- [7].
The majority of studies on VD using a monocular camera can be classified into two categories: 1) the use of The associate editor coordinating the review of this manuscript and approving it for publication was Huanqiang Zeng . hand-crafted features and 2) the use of deep learning methods. In the former, hand-crafted features are used. The most popular hand-crafted features used for VD are the histogram of oriented gradient (HOG) [8], Haar-like features [9], Gabor features [10], aggregate channel features (ACFs) [11], and position and intensity-included HOG (πHOG) [12]. The Haar-like features require less computation time than the HOG, and Gabor features; however, the detection performance obtained using Haar-like features is inferior to that obtained using HOG or Gabor features. The ACF is a popular hand-crafted feature that includes HOG, normalized gradient magnitudes, and LUV [11]. πHOG compensates for the information loss that occurs when computing a histogram with a combination of position and intensity information [12]. Fig. 1(a)-(d) illustrate the basic structures obtained from classical studies wherein hand-crafted features are used. Fig. 1(a)-(c) depict the exhaustive searches. Fig. 1(d) depicts an image to which the region proposal method was VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ applied [13]- [15]. Fig. 1(a) illustrates a simple sliding window wherein a search is performed for vehicles at all possible locations and scales [16], [17]. In Fig. 1(b), the cost of generating multi-scale feature maps is reduced by approximating the feature map [11]. Fig. 1(c) depicts the structure wherein different classifiers are assigned to each scale, thus realizing a considerable speed-up [18]. In Fig. 1(d), the region proposal method is employed to avoid the implementation of an exhaustive search and reduce the number of applications of the classifiers. Deep learning methods based on convolutional neural networks (CNNs) have received significant attention as a general tool for OD in the field of computer vision. The learned features obtained from the deep learning method have a considerably higher discriminatory power for general OD than classical hand-crafted features [19]. Fig. 1(e)-(g) depict some of the structures of the general OD based on the CNN. The structure illustrated in Fig. 1(e) is similar to that in Fig. 1(d), where regions with a CNN belong to this type [20]. In Fig. 1(f), the feature map is computed for the entire image only once to extract the features of all the region proposals [21], [22], thus reducing the computation time.
In Fig. 1(g), a region proposal network (RPN) is used to share the feature map with the classifier [23].
Although the above method [23] shares the feature map between the RPN and classifier, an alternative training is still required. This learning method has become a bottleneck in real-time applications. To address this problem, one-step frameworks based on regression and classification methods have been proposed. You only look once (YOLO) [24] and single shot multibox detector [25] are representative one-step frameworks. The YOLO divides the input image into a grid, and each grid is used to predict the bounding box and object class. This network can process images at 45 fps with a better detection performance than other real-time systems. The second improved version, YOLOv2 [26], was later proposed for improving the detection performance and speed by using anchor boxes, dimension clusters, and multi-scale training. YOLOv2 can process images at 67 fps with a better detection performance than YOLO. A third improved version, YOLOv3 [27], was also presented. YOLOv3 adopted predictions across scales to detect small objects. In terms of accuracy and speed, YOLOv3 demonstrates significant benefits over YOLOv2. However, YOLOv3 incurs a high computational cost and requires a large amount of time without a GPU. For the convenience of deployment, a simplified version called YOLOv3-tiny [27] was also presented. The structure of YOLOv3-tiny is similar to that of YOLOv3; however, the number of layers and filters is less than that of YOLOv3.
The above methods [20]- [27] based on deep learning demonstrate excellent detection performance in comparison with those in other studies [8]- [12]. However, the straightforward application of general OD to VD presents some problems. First, because the entire image is used as an input to deep learning and millions of convolution operations are used to perform OD, a GPU is required to meet the acceptable computation time. However, the use of GPUs is unacceptable in commercial VD systems. Second, the VD is likely to fail in detecting small vehicles at a large distance from the camera. As the layer becomes deeper, the output size of the layer becomes smaller due to the pooling operation, and thus, the output from the last layer does not fit small vehicles. To solve the above problems, a new VD method called multi-stage CNN (MSCNN) based on deep learning is proposed in this study. The proposed MSCNN comprises four stages: vehicle lower-boundary detection, vehicle upper-boundary detection, RPN, and vehicle classification. The first three stages correspond to the new region proposal method, which is specialized for VD. In the last stage, a shallow CNN is trained to classify the given proposal into vehicles and non-vehicles. The contributions of this study can be summarized as follows: 1) In contrast to the general OD methods comprising the use of the deep learning method, the properties of the vehicles are exploited, and a new region proposal method specialized for VD is proposed. 2) A cascade of shallow CNNs is used to detect vehicles.  3) A new method for using a new image strip in the second stage is developed. The new method is based on the position-wise vehicle-height predictor (PVHP), as described in Section IV. Thus, the proposed MSCNN has the following advantages over the general OD applied to the VD: 1) The MSCNN outperforms general OD methods, such as YOLOv3 and YOLOv3-tiny in terms of the log-average miss rate, average precision, and computation time. 2) As the MSCNN consists of a few shallow networks and does not use deep networks, it can be implemented on a central processing unit (CPU). 3) The MSCNN demonstrates a higher detection performance for small vehicles than a general OD, such as YOLOv3 and YOLOv3-tiny. This is because, as the MSCNN is implemented using shallow networks, a considerably smaller number of pooling layers is used in the MSCNN than that in the general OD, thus preventing the loss of information. The MSCNN may appear to be slightly similar to the CNN cascade methods, such as those presented in [33]; however, the philosophy of the MSCNN is different from that in [33]. Fig. 2 presents a comparison of the previous CNN cascade method and the proposed MSCNN. In Fig. 2(a), the CNN cascade method considers the detections by the CNN in the earlier stage as region proposals. As the stage increases, deeper CNNs are used to detect vehicles, and the number of false positives is reduced. Therefore, the same region can be tested several times, which is time-consuming. However, in Fig. 2(b), each stage in the MSCNN plays its semantic role, such as lower-boundary or upper-boundary detection, and generates proposals efficiently and effectively.
The remainder of this paper is organized as follows. In Section II, the proposed VD system based on the MSCNN is briefly outlined. In Section III, the proposed VD method comprising the use of the MSCNN and the network structure for each stage are explained in detail. A new method of using a new image strip based on PVHP to improve the detection performance is developed in Section IV. In Section V, the experimental results of the proposed methods are presented and compared with those of previous methods, and the merits of the proposed methods are discussed. In Section VI, the conclusions of this study are presented.

II. SYSTEM OVERVIEW
The proposed MSCNN comprises four stages, as shown in Fig. 3. The four stages are 1) vehicle lower-boundary detection, 2) vehicle upper-boundary detection, 3) RPN, and 4) vehicle classification. Fig. 3 depicts an outline of the MSCNN.
In the first stage, the lower bounds of the vehicles are detected using a CNN. This CNN is based on a previous study [28], and is similar to that used in detecting the road boundary because all vehicles in an urban environment are on the road. In Fig. 3, the red bars represent the lower bounds of the vehicles. In the second stage, the upper bounds of the vehicles are detected using another CNN, and the input of the CNN is an image cut below the lower boundary generated using the first CNN. The architecture of the CNN in the second stage is the same as that in the first stage. In Fig. 3, the blue bars represent the upper bounds of the vehicles. In the third stage, pre-defined boxes called anchor boxes [23] are generated using the lower and upper bounds of the vehicles. The anchor boxes are indicated in green in Fig. 3. To consider various views, such as front, rear, VOLUME 9, 2021 or side views of the vehicles, anchor boxes of two different aspect ratios (1:1 and 2:1) are generated. These anchor boxes are refined using RPN [23]. The refined anchor boxes are termed object proposals. In Fig. 3, the green and cyan boxes represent the initial anchor boxes and corresponding refined object proposals, respectively. In the fourth stage, the object proposals are classified using the last CNN. The classified vehicles are represented by blue boxes. Non-maximum suppression (NMS) [29] is then used as a post-processing method to eliminate overlapping detection results. In the last image in Fig. 3, the detected vehicles are indicated by red boxes. The proposed MSCNN uses a shallow CNN with several layers to reduce the test or evaluation time. The compactness of the shallow CNN facilitates the implementation of the proposed MSCNN in an embedded system without a GPU.

III. PROPOSED METHOD A. LOWER BOUNDARY DETECTION OF VEHICLES
In the first stage, the CNN returns the lower boundaries of the vehicles in an urban environment. To obtain the lower boundaries, an RGB image I (x, y, c) ∈ R W ×H ×3 is divided into N multiple strips, where i = 1, 2, · · · , N , W and H are the width and height of the image, respectively, and w s = W N is the width of the image strip. Fig. 4 presents the main objectives of the first stage. When the i-th image strip S i l (x, y, c) is given, the lower boundary detection problem for the strip becomes a regression problem as shown in Fig. 4(b), where the input is the image strip and the output y l (i) comprises the lower boundaries of the vehicles in the strip. To simplify this regression problem, we convert the lower boundary detection of the vehicles into a multiple class classification problem for each image strip. In particular, we divide (or quantize) the height of an image into M intervals. The regression problem is approximated using a multiple-class classification problem: as shown in Fig. 4(c), where L = H M is the length of the interval and I [·] is an indicator function that returns a value of 1 if the argument is true and 0 otherwise. The classification problem is solved using the CNN. The classification result y l (i) ∈ {0, 1, 2, · · · , M } denotes the label corresponding to the vertical position of the vehicle, andȳ l (i) = 0 corresponds to the case in which there is no vehicle in the image strip. The non-vehicle class aids in eliminating unnecessary region proposals and reducing the subsequent computation time. Using the image strips and lower boundaries of the vehicles, the first CNN is trained to estimate the quantized vertical position of the lower boundary of the vehicleȳ l . The last layer of the first CNN comprises M + 1 nodes with a softmax function, and each node returns the probability of each interval being the lower boundary of the vehicle. Using these definitions, we minimize the cross-entropy loss as follows: where n l denotes the M +1 outputs of the last layer of the first CNN. Among the M + 1 nodes, the node with the maximum probability denotes the prediction p l for the first CNN as follows If no vehicle is present, the first node has the maximum probability (p l = 0). Based on the p l estimated by the first CNN, the lower boundary of the vehicleŷ l (i) = p l × H M is predicted as shown in Fig. 4(d). Fig. 5 presents a summary of the workflow. Fig. 5(a) presents two deep-learning modules: C_M and F_M. The C_M module comprises four deep-learning layers: a convolution layer, batch normalization layer, max pooling layer, and rectified linear activation function. The F_M module comprises three deep-learning layers: a fully connected convolution layer, batch normalization layer, and rectified linear activation function. These modules use shallow layers to reduce computation time. Fig. 5(b) presents a summary of the workflow. In this study, the input to the first CNN is set to a size of 10 × 240 (w s = 10, and H = 240), and all the image strips of different sizes are resized to a size of 10 × 240. The height H of the strips is quantized into M = 48 intervals of the same length H M = 5 pixels. On combining the outputs from the image strips, the lower boundaries of the vehicles in the image are predicted, as indicated by the red bars in Fig. 5.

B. UPPER BOUNDARY DETECTION OF VEHICLES
In the second stage, the CNN returns the upper boundaries of the vehicles in an urban environment. Fig. 6 presents the main objective of the second stage. The input to the second CNN is an image strip cut below the lower boundary,   image strip S i u_r (x, y, c). Using these definitions, we minimize the cross-entropy loss as follows where n u denotes the M outputs of the last layer of the second CNN. As in the case of the first CNN, the prediction p u_r for the second CNN is as follows: Using p u_r (i), the upper boundary of the vehicle of the i-th image strip is computed as followŝ as shown in Figs. 6(c) and (d), whereŷ u_r (i) = p u_r (i) 240 M . Fig. 7 depicts a summary of the workflow. The workflow is similar to that of the first stage, and the architecture of the second CNN is the same as that of the first CNN; however, different weights are used.ŷ u is indicated by the blue bars in Fig. 7.

C. REGION PROPOSAL NETWORK
In the third stage, two types of anchor boxes of different aspect ratios are generated for each vehicle candidate region, and an RPN is used to refine the anchor boxes to provide region proposals for the vehicles. Fig. 8 depicts the workflow of the RPN. As vehicles have varied width-to-height ratios, we generated two types of anchor boxes of aspect ratios 1:1 and 2:1 to represent all possible width-to-height ratios. For each vehicle candidate region characterized by a pair of upper and lower boundaries, the center position of the anchor box is initialized as the mean of the upper and lower boundaries. For the j-th anchor box in the i-th image strip, the anchor box is parameterized as follows.
where a x i,j , a y i,j denotes the center position of the j-th anchor box, a w i,j denotes the width of the j-th anchor box, a h i,j denotes the height of the j-th anchor box, R j is the aspect ratio of the j-th anchor box, andŷ i u andŷ i l are the upper and lower boundaries of the i-th image strip, respectively. In Fig. 8, two anchor boxes of different aspect ratios are indicated by green boxes for an image strip. After generating all the anchor boxes, the RPN is used to estimate the bounding box via the parameterizations of the four coordinates as follows [23]: where denotes the normalized value for p i,j . The parameter g i,j for the j-th anchor box is considered to be a nearby ground truth of the vehicle from the j-th anchor box. Using these definitions, we minimize the regression loss as follows.
whereḡ i,j denotes the normalized value for g i,j calculated using the function f N g i,j , a i,j in (11). In Fig. 8, the region proposal results are indicated by blue bounding boxes. To generate a region proposal, the anchor box should have been previously defined. The anchor box is used as the input to the third CNN (RPN). The anchor box represents the reference image for the region proposal. As the intersection over the union between the anchor box and the ground truth of the vehicle increases, precise region proposals can be more easily generated. Therefore, a precise anchor box enhances the detection performance. To further enhance the detection performance, two RPNs are implemented for anchor boxes of aspect ratios of 1:1 and 2:1. One RPN related to the aspect ratio of 1:1 is focused on the region proposal for the front or rear views of the vehicles, whereas the other RPN related to the aspect ratio of 2:1 is focused on the region proposal for the side view of the vehicles. In this study, the height of the anchor box is defined by the difference between the estimated upper and lower boundaries of the vehicle.

D. VEHICLE CLASSIFIER
In the last stage, the vehicle classifier network is applied to the region proposals, and the vehicles are then detected. The system illustrated in Fig. 9 is in the fourth stage of vehicle classification. The architecture of the fourth CNN is similar to that of the third one; however, different weights are used. The input to the fourth CNN is the region proposals returned from the third CNN. As in the third stage, two vehicle classifier networks are implemented separately depending on which anchor box generated region proposals in the last stage. One vehicle classifier network related to the anchor box (1:1) is suitable for detecting the front or rear views of the vehicles, and the other vehicle classifier network related to the anchor box (2:1) is suitable for detecting the side view of the vehicles. The input to the fourth CNN is set to an image size of 64 × 64, and each region proposal from the third CNN is resized to 64 × 64. The fourth CNN is trained as a binary classification, and the classification loss is in the form of a cross-entropy loss over the two classes (vehicle versus non-vehicle).
The proposed RPN makes some overlapping region proposals for each vehicle, as shown in Fig. 9. The detected bounding boxes are accordingly overlapped, thereby deteriorating the detection performance. To prevent performance degradation, NMS [29] is used for the post-processing. NMS is used to eliminate the detection results having low confidence scores, thus retaining only the detections having a maximum confidence score among the overlapping detections. In Fig. 9, the VD results are indicated by blue bounding boxes, and the NMS results are indicated by red bounding boxes.

E. TRAINING PROCESS
In this study, we adopt a four-step training algorithm to learn the CNN in each stage. Fig. 10 presents the training process for the proposed MSCNN. In the first step, we train the first CNN for the lower-boundary detection of vehicles. In the second step, we collect the training dataset using image strips cut below the lower boundary generated by the first CNN. We then train the second CNN for upper-boundary detection of the vehicles using this dataset. In the third step, we collect two training datasets using anchor boxes generated by the first and second CNNs. The two datasets consist of anchor boxes of aspect ratios of 1:1 and 2:1. Then, we train two RPNs using each dataset. In the fourth step, we separately train two vehicle classifiers using the region proposals generated by each RPN.

IV. POSITION-WISE VEHICLE HEIGHT PREDICTOR
In the second stage for the upper-boundary detection of vehicles, the image strip from the lower boundary to the top is provided as the input to the second CNN. Because the image strip cut below the lower boundary is used, it is possible to prevent false upper-boundary detection of vehicles below the lower boundary. However, this approach still involves the scanning of the top of the image, including inefficient areas such as the sky or buildings. In this paper, a new approach using a PVHP is presented to generate a new image strip with the inefficient areas removed and enhance the upper-boundary detection performance. The main concept used herein is based on our previous work [12]. In our previous work, it was reported that the bottom position of the vehicle is highly correlated with the vehicle size. More specifically, large vehicles tend to be detected in the lower region of the image, whereas small vehicles tend to be detected in the upper region of the image. The vehicle-bottom position along the y-axis is plotted against the vehicle height in Fig. 11, wherein each dot corresponds to a vehicle in the AUTTI dataset [32]. In this manner, we defined the PVHP, which is given by the following linear statistical model.
where β = β 0 β 1 T , w y = 1 w y T , w y is the bottom position of the vehicle, w h is the vehicle height, N () is a Gaussian distribution, and φ denotes the precision (inverse of the variance). As shown in (13), β and φ are the model parameters of the PVHP. We then restrict the vehicle height to k standard deviations from the mean w T y β using In this study, we use k = 3 because three standard deviations from the mean accounts for 99.7% of the samples. Fig. 12 presents the results of the PVHP in the AUTTI dataset. In Fig. 11, the red line denotes the mean w T y β for the vehicle height and the red dotted lines denote the minimum and maximum values w T y β − k/ √ φ and w T y β + k/ √ φ, respectively.
To explain the update of the PVHP parameters in detail, we denote the n-th detected vehicle as w n = w n y w n h T VOLUME 9, 2021 and the accumulated vehicles as W n = w 1 , w 2 , · · · , w n . We denote the parameters of the PVHP after detecting n vehicles as (β, φ) |W n . If we assume that (15) and that the n-th vehicle w n = w n y w n h T , which respects the PVHP, is detected, the parameters of the PVHP can then be updated using the Bayesian rule as (β, φ) |W n ∼ N G ·|µ n , S n , α n , λ n (17) and α n = α n−1 + 1 2 , S n −1 = S n−1 −1 + w n y w n y T µ n = S n S n−1 −1 µ n−1 + w n y w n h λ n = λ n−1 + where N (·) and N G (·) denote the normal and normal-gamma distributions, respectively. The derivation of (18) is presented in our previous work [12]. Using the PVHP parameterized by (β, φ) |W n , we replace β and φ in (14) with their means E (β|W n ) and E (φ|W n ), respectively, and then obtain where E β|W n = µ n , φ|W n = The proposed MSCNN with the PVHP (MSCNN+PVHP) in Fig. 13 is outlined as follows: 1) Let us assume that n − 1 vehicles are already detected. The first stage of the proposed method (MSCNN+PVHP) is the same as that of the MSCNN. When the lower boundary predicted using the first CNN is given, the PVHP is parameterized by (β, φ) |W n−1 , which estimates the maximum value of the vehicle height W h max = w T y β + k/ √ φ.
2) Using the maximum value of the vehicle height, we refine the image strip which is the input of the second CNN. The refined image strip includes a more efficient area than the previous one. 3) In the third and last stages, the workflow of the proposed method (MSCNN+PVHP) is the same as that of the MSCNN. If the n-th vehicle is detected, the parameters µ n , S n , α n , λ n are updated using (18). 4) Steps 1 to 3 are repeated. Whenever the PVHP is updated by the current detected vehicle, the performance of the PVHP is more stable than that of the previous one.

V. EXPERIMENTAL RESULTS
In this section, the proposed MSCNN is applied to three datasets: KITTI VD [30], AUTTI [31], and CrowdAI [31]. Each dataset is built using a fixed monocular camera installed at the front of the host vehicle. The performance of the proposed method is compared with that of previous methods. All the algorithms are trained on a GTX TITAN X GPU using the Pytorch package [32] in Python, and the trained network is deployed on an Intel CPU i5-4670 for testing. The following four algorithms are compared in this section: 1) The vehicle detection based on the YOLOv3 (single class). The first and second algorithms are state-of-art detection methods, whereas the third and fourth algorithms are the ones proposed in this study. In the fourth algorithm, the PVHP is employed to accelerate the CNN of the upper boundary detector. YOLOv3 and YOLOv3-tiny generally consist of multiple outputs in each YOLO layer to achieve multiple classifications. For a fair experimental comparison, YOLOv3 and YOLOv3-tiny are implemented for a single class (vehicle) and not a multi-class. Thus, for a single class, YOLOv3 and YOLOv3-tiny require fewer convolution filters than those for multiple classes. Table 1 lists the experimental environments for the training and testing of each algorithm. To meet the real-time performance requirements in a CPU-based test environment, YOLOv3 is trained using a small image (160 × 160). To obtain computation times similar to that of YOLOv3, the other algorithms are trained using the corresponding appropriate image sizes.

A. KITTI DATASET
The KITTI set comprises the detection and tracking DB, and the size of the images in the KITTI datasets is 1250×375. The KITTI detection DB comprises 7481 training images and 7518 testing images. The KITTI tracking DB comprises 21 training sequences (8008 images) and 29 test sequences (11095 images). The proposed and previous methods are trained using the training images in the KITTI detection DB and evaluated using the training images in the KITTI tracking DB because the testing set in the KITTI detection DB is not annotated. Figure 14 depicts the VD results for the KITTI dataset.
Figs. 14(a) and (b) are obtained using the YOLOv3 and YOLOv3-tiny models. For real-time performance using only the CPU, the two models are trained by reducing the image size. In this test environment based on the CPU, the YOLOv3 method has more false negatives than YOLOv3-tiny because YOLOv3 is trained by an image that is too small to satisfy the real-time performance. Figs. 14(c) and (d) are obtained using the proposed methods (MSCNN and MSCNN_PVHP). As shown in the figure, the proposed methods have fewer false positives and false negatives than the previous methods. In particular, the PVHP contributes to improving the VD rate of the MSCNN.

B. CrowdAI DATASET
The CrowdAI set includes driving in Mountain View, California, and neighboring cities during daylight. The size of the images in the CrowdAI dataset is 1920 × 1200. In the CrowdAI DB, there are 9400 images, and we randomly divide them into 4800 training images and 4600 testing images. Fig. 15 depicts some of the VD results for the CrowdAI dataset.
Figs. 15(a) and (b) are obtained using the YOLOv3 and YOLOv3-tiny models. As in the KITTI test environment,  the two models are trained by reducing the image size. In this CPU-based test environment, the YOLOv3 method has more false negatives than YOLOv3-tiny because YOLOv3 is trained using an image that is too small to satisfy real-time performance requirements. Figs. 15(c) and (d) are obtained using the proposed methods (MSCNN and MSCNN_PVHP). As shown in the figure, the proposed methods have fewer false positives and false negatives than previous methods. In particular, the PVHP improves the VD performance in the cases of small far-away vehicles and occlusions.

C. AUTTI DATASET
The AUTTI set is similar to the CrowdAI set. The size of the images in the AUTTI dataset is 1920 × 1200. The AUTTI DB comprises 10416 images, and we randomly divided them into 6400 training images and 4000 testing images. Fig. 16 depicts the VD results for the AUTTI dataset.
As shown in Fig. 16(a), a number of false positives and false negatives are obtained using YOLOv3.
In particular, small vehicles far from the camera or vehicles that are seen from the side tend to remain undetected. In Fig. 16(b), YOLOv3-tiny exhibits more robustness than YOLOv3 and detects the vehicles viewed from various angles in a CPU-based test environment; however, small far-away vehicles are sometimes missed in this case. In Fig. 16(c), the MSCNN demonstrates an improved detection performance for small far-away vehicles. This implies that the MSCNN generates better region proposals than YOLOv3-tiny. Figure 16(d) depicts the results of the MSCNN with PVHP. It is demonstrated that the MSCNN with PVHP realizes a better detection performance than that without the PVHP. Fig. 17 depicts the VD results for special cases. When two vehicles appear in the same image strip, the first CNN predicts only one vehicle's lower boundary with the maximum probability, as shown in Fig. 17(a). Although the first CNN misses the vehicle lower boundaries on some image strips (left image in Fig. 17(a)), the vehicle can be detected (left image in Fig. 17(c)) because one vehicle typically includes   Fig. 17(a)), the vehicle cannot be detected (right image in Fig. 17(c)). When the vehicle is hidden by some obstacles, the first CNN predicts a false lower boundary of the vehicle near obstacles, as shown in Fig. 17(b). As in the previous case, other lower-boundary detections of the vehicle are used to detect vehicles correctly, as shown in Fig. 17(d). Tables 2, 3, and 4 present the comparisons of the log-average miss rates for each KITTI tracking DB sequence for all the methods under consideration. In the KITTI tracking DB, the ground truth of the vehicles has three occlusion levels: fully visible (level 0), partly occluded (level 1), and difficult to see (level 2). Fig. 18 illustrates examples of vehicles for each occlusion level. Tables 2, 3, and 4 present the detection performances for the three occlusion levels mentioned above. In Tables 2, 3, and 4, the best performance is indicated in red boldface for each KITTI tracking DB sequence. The proposed methods MSCNN and MSCNN_PVHP have advantages in considering all the occlusion levels. Fig. 19 depicts the precision-recall curves for the KITTI tracking, CrowdAI, and AUTTI datasets. The precision-recall curve plots the precision versus recall. The value of the algorithm in the legend denotes the average precision. The   goal of VD is to increase the precision, recall, and average precision simultaneously. Figs. 19(a)-(c) correspond to the KITTI tracking, CrowdAI, and AUTTI datasets, respectively. In addition, YOLOv5 [34] is also compared in this figure.

D. ANALYSIS OF THE EXPERIMENTAL RESULTS
It is proved that YOLOv5 exhibits the best performance among YOLO series. For all three datasets, it can be observed that the MSCNNs outperform YOLOv3, YOLOv3-tiny, and YOLOv5. In particular, the MSCNN_PVHP exhibits the best performance among the five compared methods. To demonstrate the ability of the MSCNN in detecting small vehicles, the precision-recall curve is plotted in Fig. 20 when only small vehicles are considered. In the figure, small vehicles are defined as those having widths lesser than or equal to 50 (KITTI tracking dataset) or 200 (CrowdAI and AUTTI dataset) pixels. As shown in Fig. 20, the proposed MSCNNs also present a better detection performance than those of the other methods, and the performance of MSCNN_PVHP is the best. In particular, YOLOv3 and YOLOv3-tiny demonstrate significant performance degradation when only small vehicles are considered. All the experiments conducted thus far have been focused on the detection performance (log-average miss rate, precision, recall, and average precision) without considering the computation time analysis. Considering only the previous evaluation criterion, it is difficult to determine which algorithm is better (e.g., one algorithm exhibits a high detection VOLUME 9, 2021 rate but the low value of frames per second (FPS), while the other algorithm exhibits a low detection rate but high FPS). To consider both detection performance and computation where t is the FPS, and p (t) is the average precision (AP) value for the input in the AP-FPS curve. Fig. 21 depicts the AP-FPS curves in the KITTI tracking, CrowdAI, and AUTTI datasets using only the CPU. In the region below 5 fps, YOLOv3 exhibits a high detection performance, but for a high FPS, the detection performance decreases significantly. Therefore, YOLOv3 is unsuitable for use with only the CPU. As shown in Fig. 21, the proposed MSCNN_PVHP outperforms all compared algorithms above 5 fps. Table 5 lists the AP-FPS values defined by (21) for the three datasets. Considering the detection performance and computation time, the proposed MSCNN_PVHP exhibits the highest AP-FPS value for all the datasets. In particular, the MSCNN_PVHP incurs no additional computation cost (below 1 ms) compared to that of the MSCNN, although the detection performance of the MSCNN_PVHP is superior. In conclusion, our MSCNN demonstrates a higher AP-FPS value than the algorithms being compared. Thus, the proposed VD system is effective for real-time systems using only a CPU (without using a GPU).

E. CONCLUSION
In this study, the MSCNN and PVHP was proposed to improve VD performance and reduce the corresponding computation time. The proposed methods demonstrated high performance in detecting vehicles while considering the KITTI, CrowdAI, and AUTTI datasets. The experimental results revealed that the MSCNN outperformed the state-ofthe-art methods YOLOv3 and YOLOv3-tiny. Furthermore, the MSCNN with PVHP improved the VD performance without incurring an additional computation cost. Without the GPU, the proposed method is useful for obtaining the best performance. The proposed approach is expected to facilitate the implementation of the system in embedded environments.