Camera and Radar Sensor Fusion for Robust Vehicle Localization via Vehicle Part Localization

Many production vehicles are now equipped with both cameras and radar in order to provide various driver-assistance systems (DAS) with position information of surrounding objects. These sensors, however, cannot provide position information accurate enough to realize highly automated driving functions and other advanced driver-assistance systems (ADAS). Sensor fusion methods were proposed to overcome these limitations, but they tend to show limited detection performance gain in terms of accuracy and robustness. In this study, we propose a camera-radar sensor fusion framework for robust vehicle localization based on vehicle part (rear corner) detection and localization. The main idea of the proposed method is to reinforce the azimuth angle accuracy of the radar information by detecting and localizing the rear corner part of the target vehicle from an image. This part-based fusion approach enables accurate vehicle localization as well as robust performance with respect to occlusions. For efficient part detection, several candidate points are generated around the initial radar point. Then, a widely adopted deep learning approach is used to detect and localize the left and right corners of target vehicles. The corner detection network outputs their reliability score based on the localization uncertainty of the center point in corner parts. Using these position reliability scores along with a particle filter, the most probable rear corner positions are estimated. Estimated positions (pixel coordinate) are translated into angular data, and the surrounding vehicle is localized with respect to the ego-vehicle by combining the angular data of the rear corner and the radar’s range data in the lateral and longitudinal direction. The experimental test results show that the proposed method provides significantly better localization performance in the lateral direction, with greatly reduced maximum errors (radar: 3.02m, proposed method: 0.66m) and root mean squared errors (radar: 0.57m, proposed method: 0.18m).


I. INTRODUCTION
A key enabler of recent DAS technologies is the drastically improved perception technologies owing to vision and radar sensors. These sensors have their own strengths and weaknesses. Table 1 summarizes the strengths and weaknesses of three popular sensors in terms of performance, cost, and robustness to environmental effects. For mass-production vehicles, cost is usually the dominant factor behind the sensor choice, and thus Lidar is rarely used for production cars despite its superior performance. Instead, radar and camera sensors are the two most widely used sensor choices. These The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Ayoub Khan . two sensors work adequately when they are applied in a well-conditioned environment for low-level automated driving systems. However, in order to develop highly-advanced DAS such as a highly automated driving system, accurate position information is essential to predict the future motions of surrounding vehicles and to take appropriate actions while avoiding faulty ones [1], [2]. Unfortunately, radar and camera sensors do not provide sufficiently accurate position information when used individually.
The research on cameras in DAS has focused mainly on vehicle detection with a lighter calculation load and greater robustness to external environmental changes [3]- [12]. A few recent studies investigated the distance estimation problem [13]- [20]. These studies use a method that estimates VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the distance between a vehicle with ADAS installed, called an ''ego-vehicle'', and surrounding vehicles through geometric relationships. When using this method, a distance estimation error often occurs because the vehicle detection bounding box cannot always be fitted suitably. A detected vehicle is represented as a bounding box in an image, and the contact point is defined as the point that overlaps both the bounding box and the road plane in the image. The distance is then estimated using the geometric relationship between this contact point and a camera installed on the vehicle. However, since a camera has low resolution in the longitudinal direction, a slight position error of the bounding box causes large longitudinal distance variation. Additionally, due to the variation of the camera optical axis as the vehicle is driven, estimation errors based on geometric relationships are inevitable. Consequently, existing distance estimation methods using only a camera cannot represent the position of the vehicle precisely. Radar also cannot always accurately provide precise position information. As summarized in Table 1, radar shows good range accuracy in the radial direction but poor performance in the lateral direction. Some studies on radar related to improving accuracy in lateral direction have been conducted [21]- [29]. However, limitations due to inherent characteristics still exist. Basically, a radar uses measured phase difference from more than two Rx antennas to estimate the angle of arrival of the object. The relationship between the phase and angle is nonlinear, and the sensitivity of the phase to angle degrades as the angle increases. Accordingly, when the angle is estimated using only radar, an error tends to occur as the angle increases. And since the position of peak reflectivity of the object is continuously changing in the tracking process, the resulting angle error also exists. Also, as shown in Fig. 1, radar is still limited when classifying the vehicle's corner part, which is needed to accurately localize the vehicle onto bird's-eye-view coordinates.
In order to complement the limitations of radar and cameras, the radar and camera sensor fusion algorithm has been studied. However, since most sensor fusion studies have focused on improving detection performance through cross-validation or reducing computation load, the weaknesses of each sensor are still not completely solved [30]- [37]. To achieve better localization performance, a few sensor fusion techniques have been proposed [34], [38]- [40]. In a prior study, a simple radar and camera coordinate transformation calibration method was proposed as a type of sensor fusion and has served as preliminary work for radar and camera sensor fusion studies [34]. In another study, to represent the position of the vehicle, the position information from each sensor is displayed on a grid cell in a bird's-eyeview and the vehicle position is derived by superimposing this information [38]. Because the vehicle position is displayed over several adjacent grids, uncertainty arises. In order to reduce uncertainty in vehicle localization and maintain continuity of the localization results, a Kalman filter is used to fuse both the radar and camera data [39]. This results in better performance, effectively reducing the noisy distance estimation data of each sensor in the radial direction. Moreover, other work sought to complement the high lateral position variance due to radar using the symmetry of the rear parts of the vehicle [40]. This approach can provide a relatively accurate lateral position in the case of a vehicle moving forward. However, it is limited when the vehicle is partially visible, as it relies on the symmetry of the rear contour of the vehicle. In a close-up case, unlike when a vehicle is moving forward, it is also difficult to apply symmetry detection due to the varying viewpoints. Overall, existing sensor fusion methods are effectively used only as the cross validation to increase detection accuracy, but there is a limitation to robustly localizing a vehicle in diverse driving environments including occlusion.
Therefore, in this study, we present a sensor fusion method which reinforces the azimuth angle accuracy of the radar data by localizing a vehicle's rear corner part using a camera. For vehicle part localization, the center position of the vehicle's rear corner is estimated and tracked based on a particle filter framework. Thus, through this part-based fusion approach, we can provide accurate vehicle localization as well as robust performance with respect to occlusions. In this way, the shortcomings of each sensor are compensated for and surrounding vehicles are localized accurately onto a bird's-eye view in actual driving environments.
The main contributions in this study can be summarized as follows: First, our method shows accurate localization performance by estimating a vehicle's rear corner part robustly. The vehicle's rear corner part is classified robustly via a proposed vehicle part classification model and using these classified corner parts, the most probable position of the vehicle's rear corner is estimated accurately through a particle filter.
Second, our method shows reliable results when localizing the relative positions of surrounding vehicles under diverse driving conditions. Through tracking each vehicle's rear corner part separately based on particle filter framework, our method copes well with sudden partial observation driving situations such as cut-in, cut-out driving.
The overall workflow of this study is described in Fig. 2. and consists of the following three steps.
Step 1: Preliminarily candidate point generation The radar data is translated to the image plane and preliminarily candidate points for classifying a vehicle's rear corner part are generated around the initially translated radar data.
Step 2: Vehicle rear corner part localization Each corner part of the vehicle is classified from the candidate points, after which the most probable corner part position is estimated using the classified candidate points. Once the corner part position is estimated, the position is continuously tracked around the prior estimated position through an iterative process.
Step 3: Bird's-eye-view localization Both estimated rear corner part positions in an image plane are combined with the range and angle data from the radar system, which is then translated to the coordinates of the vehicle's rear corner part in a bird's-eye-view.
The remaining sections covered in this paper are as follows. The vehicle's rear corner part localization on the image plane is described in Section II. In this section, an overview of the corner part localization method, the vehicle rear corner classification model, candidate point generation and the particle-filter-based framework for the extraction and tracking of the vehicle rear corner part position are explained in detail. In Section III, adding vehicle localization into a bird's-eye view is explained using both radar data and camera data. Section IV describes the test environment and presents experimental results on the accuracy of the surrounding vehicle localization process. Finally, the conclusion and future works are presented in Section V.

II. VEHICLE REAR CORNER PART LOCALIZATION A. OVERVIEW
During the second step shown in Fig. 2, both rear corner part positions are estimated and tracked on an image plane in order to localize the vehicle in the lateral direction with respect to the relative coordinates of the ego-vehicle. Even when either the left or right corner part is not visible due to occlusion, one corner part is continuously tracked to realize a close approximation of the position of the vehicle. For part localization, transformation of the radar data to the image plane should take precedence. Subsequently, the candidate points are generated around the initial radar point with a uniform distribution to classify both rear side parts of the vehicle, efficiently reducing the processing time. From among the candidate points, the vehicle's rear corner parts on the left and right are identified through a classification model pre-trained on both corner parts, with the classified points having their weights assigned from their corresponding classification scores. Using the position and weight of the classified points, the most probable left and right side positions are estimated. Once the probable position is estimated, the candidate points are regenerated around the prior estimated position to track both rear corner parts of a vehicle, and each corner part is tracked through an iterative process. The surrounding vehicles then localized onto a bird's eye view based on the rear corner parts of the tracked vehicle. The workflow to track the most probable left and right side positions of the surrounding vehicles in the image sequences is described in Fig. 3.

B. DEEP LEARNING-BASED VEHICLE PART CLASSIFICATION MODEL 1) TRAINING DATASET
In order to classify a vehicle's rear corner part, we must establish a dataset which is composed of various classes. Thus, we define five classes including other parts of the vehicle in the vicinity of the corner. Each of these classes is defined in Fig 4. In total, 29,185 images of vehicles are included in the dataset: some of them from Stanford which contains 16,185 images of vehicles [41]. Using those images, we define the class of the vehicle part and train it to classify the rear corner.

2) DEEP LEARNING MODEL FOR REAR CORNER CLASSIFICATION
A pre-trained model is used to train a vehicle rear corner part classification model. In this study, we use the VGG16 network [42] as a pre-trained model, which is simple but shows good performance in many applications of image classification. Since our defined dataset has five classes, we have 75226 VOLUME 8, 2020  to replace the output layer of VGG16 to customize it to the dataset. Therefore, for customizing we replace the classifier of the VGG16 network to the newly added classifier as shown in Fig 5. Finally, to confirm the performance of the trained detection model, classification is carried out throughout the test dataset. The total simulation result shows 98 percent classification accuracy for the test dataset. The accuracy of each class classification as a confusion matrix in the test dataset is shown in Table 2. In order to use the radar data in an image, appropriate coordinate transformation must be done initially. Thus, the direct linear transform (DLT) method is employed to translate the radar point to the corresponding point on the image plane. The radar and image coordinate systems are defined in Fig. 6, where the transformation matrix between two coordinate systems is defined as (1).
This can be rearranged as (3) for the pixel coordinates u and v.
For each point i, (3) can be rewritten as a polynomial with respect to h, as shown in (4).
In conclusion, to obtain each element h of the homography matrix, the equation can be defined as a least square problem, as shown in (5).
In order to solve the least square problem for the homography matrix defined as (5), the distance from the radar sensor to the object (corner reflector) was measured and the data was matched manually with the pixel coordinates of the object in an image. A test sample image for the calibration dataset is shown in Fig. 7. Like Fig. 7, data was collected from 19 separate points which are arranged in Table 3.

2) REAR CORNER CANDIDATE POINT GENERATION
The initial candidate points for classifying the rear corner parts of the surrounding vehicles are generated around the initial radar points. Each initial radar point is converted into pixel coordinates in an image through the homography matrix, calibrated using the process described above. The candidate points are then generated around the pixel coordinates of the radar points with a uniform distribution.
In each case, the mean of the uniform distribution is the pixel coordinate of the radar point and the size of the interval in the uniform distribution is dependent on pixel coordinate v in the image plane, as defined in (6). In accordance with this process, generated candidate points around the initial radar point are shown in Fig. 8.
where, S(i) is a candidate point; m is the mean of the uniform distribution; σ is the interval size of the uniform distribution which is set differently according to the v pixel coordinate; α, β and k are hyper parameters which are adjusted to the image frame size; The small n represents sample numbers that is defined as 30 in this study.

D. REAR CORNER PART CLASSIFICATION FROM CANDIDATE POINTS
Once the candidate points are generated according to the initial radar point, the trained vehicle rear corner part classification model is allocated to a window box with a rectangular shape centered on the candidate points. Because the closer vehicle is located in the lower region of an image, the size of  the window box for classification depends on the position of the candidate points. The classification score for the window box is then calculated through the allocated classification model to find the rear left and right corner parts of the vehicle. Despite the fact that these parts may be classified as a corner part initially, they are regarded as false positives if the classification score is lower than a predefined threshold value. Through the procedures described above, vehicle rear corner parts are classified for the right and left part in each respective case among the candidate points. Figure 9 shows the classified rear left and right corner parts of the vehicle. Among the generated candidate points, the green 'star' represents the points classified as left corner parts and the red 'circle' represents the points classified as right corner parts.

E. TRACKING THE REAR CORNER POINT USING A PARTICLE FILTER
In order to extract the best possible position of the rear corner among the classified rear corner parts, a particle filter framework is employed. A point regarded as the left or right corner among all candidate points has its weight assigned from the classification score based on the classification model. The weights are normalized and an arbitrary probability distribution is generated based on these values. Thus, the most ∼ N (P t , σ 2 ) 9: w t+1 [k] ∼ classification score of k th sample 10: If w t+1 [k] to X t+1 suitable left and right corner part positions are estimated based on the generated arbitrary probability distribution. In order to keep track of more suitable positions in consecutive frames, the search points are regenerated around the estimated position of the prior frame and the best possible position is estimated through an iterative process among the customized regions. The position estimation and tracking process based on the particle filter framework for vehicle rear corner parts is described in Algorithm 1.
Through Algorithm 1, the positions [(U L , V L ), (U R , V R )] of both vehicle rear corner parts can be extracted and tracked, as shown in Fig. 10. By tracking the position of each corner part of a vehicle in this way, it becomes possible to estimate the position of the vehicle even when only a part of the vehicle is visible due to the camera's field of view or occlusion. Moreover, by rapidly reducing the search regions to the estimated position, the computing load can be greatly reduced. If there is no region detected as the left and right-side parts in the initial candidate group, it is initialized as the first time and is searched again in the candidate regions generated based on the next radar data input. Figure 11 demonstrates the limitation of the commercial radar output data. Commercial radar transmits the output data in the form of points, including a radial range and angle information; however, it does not include information about where the point is on the vehicle, as shown in Fig. 11. Therefore, in this study we assume that the point lies between the left corner and the right corner of the vehicle rear part. In this case, even if the radar angle shows a large deviation, the vehicle position in the longitudinal direction is kept constant (R * cos (θ)). Where, R is the relative distance between an ego-vehicle and surrounding vehicles with respect to the radial direction; θ is the relative angle between an ego-vehicle and surrounding vehicles with reference to the center front of the egovehicle.

B. LOCALIZATION USING RADAR AND CAMERA DATA
The vehicle position with respect to the ego-vehicle is calculated based on both the radar output data and the spatial resolution of the camera. The vehicle position in the longitudinal direction is calculated from the radar output data directly, and is combined with the angular data of the rear corner part to calculate the vehicle's lateral position.
The pixel coordinates from the estimation of the vehicle rear corner part position on the image plane are translated into angular data with respect to the optical axis, as shown in Fig. 11(a). Because the radar and camera are aligned in the same direction, the longitudinal position of the vehicle with respect to the camera can be calculated simply by adding the difference in the length (T) between the radar and the camera, as shown in Fig. 11. Accordingly, the vehicle position in the lateral direction is calculated as shown in Fig. 11(b).

IV. PERFORMANCE EVALUATION A. TEST SCENARIO
In this study, the performance of the proposed method is evaluated in terms of the accuracy of the relative position between the ego-vehicle and the target vehicle. The radar system used as the main part of the sensor system in a typical ADAS offers mainly poor performance, especially regarding the 2D reconstruction of all edges of the surrounding vehicles. In order to solve this problem, we developed a relative position estimation method for the surrounding vehicles using sensor fusion based on mono-vision and radar. To evaluate the proposed method, datasets were generated through test driving. In the test datasets, surrounding vehicles were positioned under diverse conditions, such as in the left lane, middle lane, and right lane, to reflect the relative locations of the surrounding vehicles sufficiently in an actual driving environment. An occlusion case is also included which consists of observations of partial occlusions due to the camera's field of view and other vehicles.

B. TEST ENVIRONMENT CONFIGURATION
In order to obtain the reference data (ground truth) for a performance evaluation of the proposed method, three vehicles (one ego-vehicle and two target vehicles) with systems installed (RT3002 and RT-Range (Oxford Technology System)) were set up, as shown in Fig. 12. The data from RT was calibrated based on the center of the front bumper of the egovehicle, which is the origin coordinate of the radar system as   described in Fig. 13. The data accuracy of the RT equipment regarding the relative position of the target vehicle is 0.03m in the longitudinal, lateral range. Through the RT3002 and RT-Range systems, the relative position data between the egovehicle and the target vehicles can be measured as shown in Fig. 13. In order to evaluate the proposed method compared to reference data, the ego-vehicle was equipped with a radar system (Delphi ESR), a camera and the RT-equipment as  shown in Fig. 14. The data interface of each sensor and sensor data processing unit are described in Fig. 15.
For data synchronization between heterogeneous devices with different data update and output times, the default data logging time is set to 20Hz (50ms), which is the minimum update interval value of radar. Because the RT signal is received via CAN communication, it is matched to the radar signal through the receiving period. For the vision data, the time reference is transmitted to the image processing board through the CAN signal every 50ms, which is the logging time of the radar data, and data synchronization is performed.  depending on their position relative to the ego-vehicle. In each test case, bird's-eye view localization outcomes using both the proposed method and the radar tracking data are compared with the ground truth. Moreover, each lateral position error with respect to the ground truth is shown through bird's-eye view localization. When localizing surrounding vehicles onto the bird's-eye view, it is assumed that the radar tracking data is centered on the rear part of the vehicle and that the width of the vehicle is known.

1) MIDDLE LANE, RIGHT LANE (NORMAL DRIVING SITUATION)
In a normal driving situation without vehicle occlusion, it is confirmed that the proposed method tracks both corner parts of the vehicle, as shown in Fig. 14(a). Bird'seye view vehicle localization based on this rear corner part tracking significantly reduces lateral position errors compared to the tracking data of the radar system. In Fig. 15, the radar tracking data (track 2) shows a maximum error of 1.5m, which could lead to a serious malfunction in an ADAS.

2) RIGHT LANE (OCCLUSION DUE TO THE CAMERA FIELD OF VIEW)
For an adjacent vehicle in the right lane or left lane, it must be localized more precisely for an ADAS to make the correct decision. However, as shown in Fig. 17, the radar data for an adjacent vehicle (track 2) shows a large variation about −1.5m to 1m in the lateral direction. This large variation makes the ego-vehicle unable to distinguish which lane the adjacent vehicle is in. On the other hand, the proposed method shows a small variation regarding an adjacent vehicle (track 2) of about -0.2m to 0.1m in the lateral direction. It works well for either left or right corner part tracking even when the vehicle is partially visible due to the camera's field of view, as shown in Fig. 16(a). In this case, the left corner part is tracked, as represented by the green 'star' in Fig. 16(a). Using this corner part localization, bird's-eye view vehicle localization is carried out. Because the left and right corner parts are classified during the tracking process, surrounding vehicles can be localized precisely onto the bird's-eye view via one corner part localization.

3) LEFT LANE (OCCLUSION DUE TO OTHER VEHICLE)
Similarly, for the surrounding vehicles in the left lane, as described in the previous case, the right corner part is classified and tracked for the occluded vehicle (track 1), as represented by the red 'circle' in Fig. 18(a), even when the vehicle is partially visible due to other vehicles. The occluded vehicle (track 1) is localized precisely onto the bird's-eye view via the localization of this one corner part. In Fig. 19, the proposed method shows an estimation distance error for track 1 of approximately −0.15m to 0.1m in the lateral direction. However, the radar tracking data shows a large variation for track 1 of about −1.25m to 1.4m in the lateral direction.

4) RESULTS OF AN ANALYSIS ACCORDING TO THE RELATIVE POSITIONS OF SURROUNDING VEHICLES
In order to analyze the position estimation results according to the relative position of the surrounding vehicles, the relative positions of the surrounding vehicles in the test cases are summarized by lane, with the lateral position accuracy results depending on the lane, as shown in Fig. 20 and Table 4. As indicated by the test results, the radar data with a maximum lateral position error of 2.3∼3.02m may cause a crucial fault because it cannot detect the lane of the surrounding vehicle. Also, as shown by the total Root Mean Square Error (RMSE, 0.5773), application is difficult to the lateral control system in the ADAS due to the consecutive position  fluctuations of the surrounding vehicles. The maximum error and total RMSE of the proposed method are 0.66m and 0.1831, respectively. It is confirmed that the lanes of the surrounding vehicles can be accurately discriminated regardless of the relative positions of the surrounding vehicles. In addition, because the rear left and right corners of the vehicle are classified and separately extracted through the proposed method, it is possible to estimate the positions regardless of partial occlusion when driving.

V. CONCLUSION AND FUTURE WORKS
The relative position estimation of surrounding vehicles around an ego-vehicle are important for the safety of an ADAS and automated vehicles. In order to develop a highly automated driving system, not a semi-automated driving system, accurate localization of surrounding vehicles is essential. Accurate relative position estimations can allow motion predictions of the surrounding vehicles to help an automated vehicle prevent accidents due to unexpected situations. However, a sensor system using radar and camera has limitations when accurately localizing surrounding vehicles onto a bird's-eye view in the actual driving environment. In that sense, the proposed method shows accurate and reliable localization performance with diverse driving test condition considering actual driving environment, not the processed virtual dataset from a conventional dataset. The outcomes here can also be sufficiently extended to solve problems concerning motion predictions of surrounding vehicles and the coordinated signal control system using it [44]. However, additional research on relative position estimations of surrounding vehicles should be carried out for robust estimations. Issues related sensor fusion should also be assessed regarding the occurrence of radar range data errors. If there is an error of radar range data, the localization performance corresponding to both longitudinal and lateral direction will be degraded. Therefore, additional research is needed to cope with this error.