An Improved Speed Estimation Using Deep Homography Transformation Regression Network on Monocular Videos

Vehicle speed estimation is one of the most critical issues in intelligent transportation system (ITS) research, while defining distance and identifying direction have become an inseparable part of vehicle speed estimation. Despite the success of traditional and deep learning approaches in estimating vehicle speed, the high cost of deploying hardware devices to get all related sensor data, such as infrared/ultrasonic devices, Global Positioning Systems (GPS), Light Detection and Ranging (LiDAR systems), and magnetic devices, has become the key barrier to improvement in previous studies. In this paper, our proposed model consists of two main components: 1) a vehicle detection and tracking component – this module is designed for creating reliable detection and tracking every specific object without doing calibration; 2) homography transformation regression network – this module has a function to solve occlusion issues and estimate vehicle speed accurately and efficiently. Experimental results on two datasets show that the proposed method outperforms the state-of-the-art methods by reducing the mean square error (MSE) metric from 14.02 to 6.56 based on deep learning approaches. We have announced our test code and model on GitHub with https://github.com/ervinyo/Speed-Estimation-Using-Homography-Transformation-and-Regression-Network.


I. INTRODUCTION
Estimating vehicle speed precisely on the road, especially using monocular videos, has become a necessary and important task for various applications such as intelligent transportation systems (ITS), traffic analysis, anomaly event detection [1], [2], and vehicle re-identification [3], [4], [5]. Speed estimation can additionally be used to reduce traffic accidents on the highway under conditions such as daylight, low visibility, rain, hail, and snow [6]. It can also be combined with the distance and direction of each vehicle, for instance, The associate editor coordinating the review of this manuscript and approving it for publication was Razi Iqbal .
to guide the visually impaired when they are walking on the street [7]. Distance and direction use the distance between person and object. Many researchers have been using distance and direction together to get better results in speed estimation, and many benefits of speed estimation can be implemented in the real world.
There are two approaches for vehicle speed estimation: traditional and untraditional methods. Traditional approaches typically utilize dedicated hardware sensors, including infrared/ultrasonic, Global Positioning System (GPS), Light Detection and Ranging (LiDAR), and magnetic sensors. In [8], the infrared/ultrasonic sensor combined with machine learning approaches (e.g. Bayesian) reached 99% accuracy VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in vehicle detection and a mean error of 0.7m in vehicle speed estimation. However, the required cost of deployment in a city was around US$30,000. GPS and LiDAR sensors could solve calibration issues in vehicle speed estimation in 6 different locations and 3 different camera angles, including center, left, and right [9]. Based on experimental results, the combination of GPS and LiDAR can achieve higher accuracy than the state-of-the-art method can. Many researchers have used magnetic sensors not only get vehicle speed, but also to classify vehicles. In [10], magnetic sensors were able to classify vehicles into four classes (e.g. sedan, SUV/van, bus, and truck) with an accuracy above 90%, whereas the mean error of the speed estimation was below 7.5 km/h. Almost all of the traditional approaches that directly use a camera achieve better results on speed estimation; however, deploying these devices on a large scale is expensive and time-consuming to implement in real time. Furthermore, the sensor only applies in specific conditions. Untraditional approaches usually employ software without any hardware sensors, thereby offering simplicity without the need for expensive steps/cost. Such approaches typically use images and methods (e.g. machine learning/deep learning methods) for real-time applications. In [11], the Deep Convolutional Neural Network (CNN) successfully solved for the lack of robustness and poor model interpretability to estimate traffic speed. It was able to reach a Root Mean Square Error (RMSE) ∼8.76 km/h and greatly reduce the network parameters, enabling the system to perform computations more quickly. Vehicle speed estimation with the 3-dimensional (3D) CNN and non-local block reached a Mean Absolute Error (MAE) and Mean Square Error (MSE) of 2.71 km/h and 14.62, respectively [12]. However, these methods require extensive labeled data and involve splitting the dataset into multiple short videos, so some information could be missing from the ground-truth datasets. On the other hand, object detection techniques such as You Only Look Once (YOLO), and tracking methods like deep Simple Online Realtime Tracking (DeepSORT) can estimate the speed of each vehicle using its appearance on the highway [13]. DeepSORT has a very high detection rate, and the MSE of speed estimation is 16.78 km/h compared to the other methods. However, these results were obtained only on one dataset [14] and suffer from occlusion issues.
In the two approaches to vehicle speed estimation, each approach has its advantages and disadvantages. In this paper, we propose a system that can estimate the speed of each vehicle using object detection and tracking methods in one complete video without splitting it. In addition, the proposed method is more efficient and has a lower cost. Moreover, we use two datasets, BrnoCompSpeed and AIC18, to evaluate the effectiveness of our proposed method. A summary of the contributions of this research follows. 1) We identify three important issues with the vehicle, including identifying the direction [7], defining distance, and generating speed accurately and efficiently.
2) We solve the calibration issues using a combination of detection and tracking methods based on CSPDarkNet53 as a backbone of the network and DeepSORT. This helps us to obtain reliable object detection and tracking, especially on a specific object (e.g. cars).
3) We solve occlusion issues using a deep homography transformation regression network, not only to change monocular videos into a bird's-eye view, but also to improve the vehicle speed results.
In Section II, we discuss related works. The proposed method, which includes general architecture, vehicle detection and tracking, homography transformation network, distance measurement, and regression speed estimation, will be presented in Section III. The experiments are shown in Section IV. Finally, conclusions and future works are presented in Section V.

II. RELATED WORKS
In this section, we discuss the object detection and tracking method, bird's-eye view methods, and speed estimation related to our proposed method.

A. OBJECT DETECTION AND TRACKING METHOD
There are two kinds of object detectors: one is a one-stage detector, like YOLO or Single Shot Detector (SSD), which have high inference speeds. The other is a two-stage detector, like Faster R-CNN; these have more robust accuracy, but lower inference speeds [15]. A one-stage detector is more suitable for real-time systems according to the KITTI benchmark [16]. Currently, the one-stage detector has been found to have better performance than two-stage detectors in object detection. For instance, Peng et. al. [17] and Hsu et. al. [18] demonstrated that a one-stage director achieved higher detection accuracy than a two-stage detector. They used YOLO to detect a vehicle and pedestrian, reaching a mean average precision (mAP) of 92.1% and 88.5%, respectively.
The combination of object detection and tracking methods also show powerful improvement and performance in autonomous vehicles in adverse weather conditions [19]. It can detect every vehicle, even at a small size, in any weather condition, including heavy snow, fog, rain, or dust. Perera et al. [20] implemented vehicle tracking and detection to reduce errors in tracking associations. Kapania et al. [21] proposed a multi-object detection and tracking method for Unmanned Aerial Vehicle (UAV) datasets and achieved a Multi-Object Tracking Accuracy (MOTA) of 45.8%. In our work, we propose using multiple object detection and tracking methods to solve the calibration issues in monocular videos.

B. BIRD'S -EYE VIEW METHODS
Bird's-Eye View (BEV) is the most important tool for understanding a surrounding scene for specific tasks such as collision avoidance and object tracking. BEV has become well-known in recent years, since it offers accurate information and is easy to implement in the real world [22]. Many works continue to use sensors (e.g. LiDAR) with deep learning methods such as CNN [23], [24], Generative Adversarial Network (GAN) [25], and Deep Layer Aggregation (DLA) [26] to develop BEV from the frontal view, especially for vehicles and pedestrians. They take advantage of the LiDAR point cloud and the frontal view so as to achieve high-quality and significant performance compared to the baseline detector. However, the cost of LiDAR is considerable, and the fusion of LiDAR and images poses another issue.
Deep learning methods have rapidly developed in recent years as well. To build the BEV, Palazzi et al. [27] proposed a deep learning framework called Semantic-aware Dense Projection Network (SDPN), which is based on a basic CNN architecture. The result showed improvements in vehicle detection and bounding box positioning of a vehicle in real videos. The advantages of BEV can solve the occlusion issues and reveal the scale of an object [28]. Moreover, the BEV system enhances the scene understanding in complex situations that contain many objects, such as cars, people, trees, bicycles, and buses [29]. Given these advantages, we have implemented BEV based on deep learning (e.g. Inceptionv4) to enhance the vehicle and line detection to calculate an accurate speed. In addition, the proposed method can transform any monocular view into BEV.

C. VEHICLE SPEED ESTIMATION
Markevicius et al., [30] proposed the sensor hardware Anisotropic Magneto-Resistive (AMR) to estimate the speed of hatchbacks, sedans, station wagons, and SUV's in realtime in Lithuania. The AMR sensor has 3-axis coordinates used to detect and classify vehicles. In addition, it has a wider detection area than other sensors. The result of Mean Square Error (MSE) to predict the size of the vehicle is around 22%. Famouri et al. [31] reported that the motion plane of the images can estimate the vehicle speed and detect the license plate by using a camera in any condition, such as light and dark. They used projection removal, which transforms the monocular view into a top-view, and used vehicle tracking and 3D position to estimate speed. Hua et al. [32] proposed speed estimation to deal with collisions on the road by combining detection and tracking methods. They used the NVIDIA AI City Challenge [14] dataset [33], [34], [35], [36], [37], [38] for testing. Kumar et al. [39] proposed a semi-automatic 2D way rectification with vanishing points associated with the scaling factor to estimate the speed. Hua et al. [32] used optical flow like Shi-Tomasi and Lucas Kanade to enhance the detection and tracking results and then estimate the vehicle speed. Tran et al. [40] also used optical flow to enhance detection and tracking results, and then created virtual scanlines from the landmarks on the road to estimate the vehicle speed. Lastly, Dubská et al. [41] applied Faster-RCNN and Kalman filter as a detection and tracking method, and then used regular 2 × 2 orthogonal grids as a calibration function to estimate the speed of every vehicle.

III. PROPOSED METHOD
In the general architecture as shown in Figure 1, we use three deep learning methods: YOLOv4, DeepSORT, and Inception-v4. YOLOv4 is used for vehicle and line detection concurrently, while DeepSORT is used for vehicle tracking. In addition, we use Inception-v4 to generate a homography matrix so as to change the monocular view into a bird's-eye view. To identify the direction of each vehicle, the geo-location libraryfrom Python is used. The combination of the three deep learning models is used to estimate the vehicle speed (e.g. kilometer per hours) in videos. In Figure 2, our feature extraction includes homography transformation, line detection, and vehicle detection & tracking, respectively. Lastly, all the extraction features are combined for regression to generate the speed estimation. The final result will show the speed prediction and the direction of the vehicle.

A. HOMOGRAPHY TRANSFORMATION NETWORK
The aim of homography transformation is to change the view from monocular to BEV. It can depict a scene better than the monocular view can, especially for both object vehicles and street lines. In Figure 3, our homography transformation network includes Inception-v4 as the backbone, and generates a 3 × 3 homography matrix. Inception network has a simple architecture and lower memory requirement than other CNN architectures such as VGG-16, ResNet, and AlexNet. Moreover, the number of parameters in the Inception network is around 6.4M, so the training time is faster than that of others. In the homography transformation, we use the CarlaVP2 dataset (https://carla.org/), which contains the ground truth of vanishing points (q), center (xs, ys), till-roll (t, r), and focal of view (fw, fh), to train our Inception network. After training, the predicted vanishing points (VP), center, till-roll, and focal of view (FOV) for an input video will be used to obtain the homography matrix (H ) as shown in Equation 1.
After performing homography transformation, the transformed frame finds line segments by using probabilistic Hough transform. Next, we look for gradient (m) between two coordinates (x 1 , y 1 , x 2 , y 2 ) of a line as shown Eq. (2) and the angle (θ x,y ) computation as shown in Eq. (3).

B. ARCHITECTURE OF VEHICLE DETECTION AND TRACKING NETWORK
In the detection and tracking network, we have combined YOLOv4 and DeepSORT to detect and track each vehicle in the images. Figure 4 shows the architecture of our detection and tracking network, where CSPDarkNet53, Spatial Pyramid Pooling (SPP), and Path Aggregation Network (PAN) come from YOLOv4. We choose YOLOv4 because it has improved average precision (AP) and frames per second (FPS) by 10% and 12% versus Darknet53. DeepSORT is used for object tracking and integrates Kalman prediction, Mahalanobis distance, a deep appearance descriptor, and Hungarian assignment. It was selected because it only requires minimal resource and offers faster processing with promising results compared to other tracking methods.
We use the pretrained model from the Microsoft Common Object in Context (MS COCO) dataset and choose one specific object (e.g. vehicle). Our input for detection and tracking network is the BEV result, which has been generated from the homography transformation network. It has the advantage of solving occlusion issues in detection and tracking. In Figure 5, we can see the difference with and without BEV as our input. With BEV, the object detection results are more effective than without it, especially for vehicle and street line detection.

C. LINE DETECTION AND DISTANCE MEASUREMENT
The distance measurement is defined by two detected street lines. The line detection is also based on YOLOv4, and we have collected around 200 images for training lines. The result of the detection can achieve around 80% in Average Precision (mAP). Moreover, we use real line segments 1 to get the total distance as shown in Figure 6. In Figure 6, the size of each line segment is 4 meters, and the distance between two consecutive line segments is 6 meters, so the total distance of the line segment and its space is 10 meters.   The distance used in the real image is shown in Figure 7. We select two consecutive white line segments, and the center point of the white line segment is marked by a yellow circle with coordinates (x i , y i ). The vehicle speed is calculated when the vehicle enters (x 1 , y 1 ) and leaves (x 2 , y 2 ). The system realizes this concept by adding blue lines, which are perpendicular to the green line. The distance measurement is performed using BEV rather than the monocular view. This has the benefit of getting and computing the distance and speed simply and accurately.
The distance of the green line (as shown in Figure 7) was computed by using Euclidian distance. The Euclidian distance street line between real and image (D r,i ) can be computed based on the total distance of the line segment and its space (d r ) Where d r is the real distance of 10 meters and the coordinates of the first line (x 1 , y 1 ) and the second line      the frames per second (FPS) of a video is set to 50 (C fps = 50), so Tv is calculated by Eq. (5). Then, the vehicle speed (V c ) can be determined by Eq. (6) to get the speed on kilometers per hour (km/h), where 3.6 comes from Eq. (7).
The estimated speed of (V c ) can be further improved by a regression network to smooth the result. The network has four variables for input, including the enter frame (when the vehicle passes F 0 ), exit frame (when the vehicle passes F n ), distance (D r,i ), and predicted speed (V c ), as shown in Figure 9. We also use two hidden layers with 8 and 4 neurons, respectively. An example of the final vehicle speed is shown in Figure 10.

IV. EXPERIMENTAL RESULTS
In this section, the performance of our proposed method is evaluated. We use BrnoCompSpeed and AIC18 datasets to test our proposed method and provide a comparison to the baseline. In addition, we conduct an ablation study to evaluate the effectiveness of each component in our proposed model. All of our program have been implemented by using Python and running on Intel Core i7-8700.

A. DATASETS
We use two datasets: the BrnoCompSpeed [9] and the 2018 AI city challenge [14]. Both datasets contain full HD videos (1080 p) and have various durations. The BrnoComp-Speed dataset contains 21 full-HD videos, each around 1 hour long, captured at six different locations. Vehicles in the videos are annotated with the precise speed measurements from optical gates using LiDAR and verified with several reference GPS tracks. The BrnoCompSpeed dataset also has seven sessions; however, one of them does not have the ground truth, so we only have six sessions with which to evaluate our proposed method. The video in our proposed method has been resized to 640 × 480 with 50 frames/second. Some images from the BrnoCompSpeed dataset are shown in Figure 11. The AIC18 dataset is no longer available to the public and has no ground truth, so we needed to upload our results to get the accuracy. Several hours of videos were recorded at multiple intersections and along highways in Silicon Valley and in Iowa with recordings from multiple sensors capturing the flow of traffic along major arterial and multiple traffic intersections. AIC18 only has two camera angle views, center and right views, but is more complicated than the BrnoCompSpeed dataset. There are 4 different locations for these videos. Locations 1 and 2 have eight videos with oneminute durations, Location 3 has six videos with one-minute durations, and Location 4 only has five videos with oneminute durations. The original AIC18 video has a resolution of 1080 (1920×1080px), and we resize it to 640×480px with 50 frames/second. Some images from the AIC18 dataset are shown in Figure 12.

B. EVALUATION
Mean Square Error (MSE) is used as the measurement to evaluate our approach on both the BrnoCompSpeed and AIC18 datasets. The lower the MSE, the better the prediction  is. A vector of predictions (Y i ) is generated from a video i including N vehicles and corresponding ground-truth (Ŷ i ). The MSE is shown in Equation 8.

C. RESULTS OF HOMOGRAPHY TRANSFORMATION ON MONOCULAR VIEW
The homography transformation can adapt various views and improves the detection and tracking results, especially for occlusion issues. We can change the height through homography to create the effect of zooming in or out. The different results with height adjustments are shown in Figure 13. In our experiment, we used 3 as our height, since it could achieve better object detection and tracking results on both BrnoCompSpeed and AIC18 datasets. In addition, we tested this value by downloading 200 images from Google Street View and achieved a mean average precision (mAP) of 84.76%. Some results of the homography transformation on the BrnoCompSpeed and AIC18 datasets are shown in Figures 14.

D. SPEED ESTIMATION RESULTS ON BRNOCOMPSPEED AND AIC18
Speed estimation used the MSE metric to measure errors in our proposed method compared to the ground truth speed in BrnoCompSpeed. Unlike the AIC18, we uploaded our results and obtained the MSE directly from the system evaluation. We have two proposed methods to evaluate our results. In Proposed Method I, we used homography transformation without regression to estimate the vehicle speed. In contrast, Proposed Method II used homography transformation with regression to estimate the vehicle speed. The MSE results on BrnoCompSpeed and AIC18 are shown in Table 1 and 2, respectively.   regression network can actually improve speed estimation. Similar results are shown in Table 2 based on the AIC18 dataset.

E. STATE-OF-THE-ART SPEED ESTIMATION RESULTS ON BRNOCOMPSPEED AND AIC18
We also compared our approach to state-of-the-art methods, including FULLACC [50], OPtSCale [4], OPtScaleVP2 [4], OPtCalib [4], OPtCalibVP2 [4], and 3DCNN [3]. Since AIC18 is a dataset for a competition, we have compared our results to those of other participants using these datasets in the competition, such as the Univ. Washington [12] Table 3 and 4, respectively.
As seen in Tables 3 and 4, our proposed methods outperform other methods in terms of MSEs. The two tables also show that AIC18 is more complicated than BrnoCompSpeed for estimating the speed of the vehicle because the MSEs on AIC18 are significantly larger than those on BrnoComp-Speed. Table 4 shows results from both Proposed Method I and II on the AIC18. Similar to what was observed on the BrnoCompSpeed dataset, the utilization of homogra-   phy transformation also has benefits over the state-of-theart speed estimation method on the AIC18 dataset. However, it still falls short of the UW team [12] with an MSE score of 16.78. After we add regression to our network, the results improve and achieve a lower MSE than the UW team with an MSE score of 16.67; a difference of 0.11. The difference between the BrnoCompSpeed and the AIC18 dataset, especially in the MSE scores, proves that AIC18 is more complicated than BrnoCompSpeed for estimating vehicle speed.

F. ABLATION STUDY
The ablation study on the proposed method is shown in Tables 5, 6 and 7 in terms of homography transformation and regression network. As shown in Table 5, we have four configurations to identify the task of each module in our network. Both BrnoCompSpeed and AIC18 datasets are tested using MSE scores. A comparison of the results of these configurations is shown in Tables 6 and 7, which confirm that 5962 VOLUME 11, 2023  homography transformation and regression can significantly improve the performance of the network.

V. CONCLUSION
This paper solves an important issue in vehicle speed estimation. The original videos are in monocular view, which is difficult to get an accurate vehicle speed from. We integrate pretrained DeepSORT and YOLOv4 models to detect and track vehicles, and apply homography transformation to transform monocular videos into bird's-eye view videos, which can solve the calibration and occlusion issues in monocular views. Finally, a regression network is used to further improve the speed estimation. The experimental results show that the proposed method reaches higher accuracy than the state-ofthe-art methods on the BrnoCompSpeed and AIC18 datasets. Many methods can estimate the speed beside street line and it will become our future work to try another idea to get more higher accuracy and reliable vehicle speed result.  AVIRMED ENKHBAT received the B.S. degree in computer science and the M.S. degree in applied sciences and engineering from the National University of Mongolia, Mongolia, in 2011 and 2016, respectively. He is currently pursuing the master's degree in computer science and information engineering with the National Central University (NCU), Taoyuan City, Taiwan. His research interests include computer vision, human-computer interaction, and gesture recognition.
FITRI UTAMININGRUM was born in Surabaya, East Java, Indonesia. She received the bachelor's degree in electrical engineering from the National Institute of Technology, the master's degree in electrical engineering from Brawijaya University, Malang, Indonesia, and the Doctor of Engineering degree in computer science and electrical engineering from Kumamoto University, Japan. She is currently an Associate Professor with the Faculty of Computer Science, Brawijaya University. She is also a Coordinator of the Computer Vision Research Group and a full-time Lecturer at Brawijaya University. Her research interests include computer vision, machine learning, and image processing. VOLUME 11, 2023