A VINS Combined With Dynamic Object Detection for Autonomous Driving Vehicles

This paper proposes a visual-inertial navigation system (VINS) combined with a dynamic object detection (DOD) algorithm, to improve the localization and state estimate accuracy of autonomous driving vehicles (ADV) in dynamic environments. Firstly, based on the YOLOv5 network, we train the proposed DOD model to detect dynamic objects in the road environment. Secondly, by removing the feature points in the dynamic object regions, we track the remaining feature points to eliminate the influence of the dynamic object. Furthermore, we model the global positioning system (GPS) measurement as a general factor and introduce its residual factor into the cost function to eliminate the cumulative error. Finally, we validate the performance of the proposed method on public datasets and real-world experiments. The results show that the proposed method can effectively eliminate the influence of the dynamic object and eliminate the cumulative error. It provides theoretical guidance for ADV navigation in dynamic or large-scale outdoor environments.


I. INTRODUCTION
In the past decade, ADV has been widely applied in the military and civilian fields [1]. Continuous and accurate positioning of vehicles is a primary requirement of the ADV system. As a basic auxiliary unit for ADV positioning, VINS has attracted the attention of many research institutions and scholars because of its simple hardware layout, small size, and low cost [2], [3], [4]. During VINS operation, visual observation and inertial measurement unit (IMU) are tightly coupled to estimate the state accurately. Therefore, VINS can achieve high-precision positioning results in an ideal environment with stable illumination conditions and sufficient texture information [5], [6], [7].
The performance of VINS is heavily dependent on the assumption that visual features are static [8], [9]. This The associate editor coordinating the review of this manuscript and approving it for publication was Michail Makridis . assumption is difficult to satisfy in the actual dynamic environment. Dynamic objects such as pedestrians and vehicles in a typical urban environment are shown in Fig. 1. Since VINS hardly distinguish between static objects and dynamic ones, breaking the static-world assumption may lead to significant estimation errors. Therefore, robust methods are utilized to detect and discard outlier [10], [11]. However, therobust methods are still challenging to work stably in a dynamic environment. In summar, a dynamic environment undoubtedly limits the application of VINS on ADV. Furthermore, the positioning errors accumulated largely over time limit the application of VINS too. The typical solution is loop detection and re-localization to mitigate the accumulative error. But such methods both increase computational complexity and memory requirements. Providing accurate, consistent localization by selectively fusing global pose information and local pose estimation is another solution [11]. GPS can be integrated into VINS to improve the localization accuracy and solve the drifting problem of ADV. In this paper, we fuse the position information of GPS in the global reference frame and the pose of VINS in the local reference.
Motivated by the above discussions, this paper mainly focuses on the two unsolved problems of dealing with dynamic objects in the environment and eliminating the cumulative error in a large-scale outdoor environment.To improve the localization and state estimate accuracy of ADV in a large-scale outdoor dynamic environmen, this paper proposes an algorithm that combines VINS with DOD. The primary contributions of the paper are: -According to the YOLOv5 network, we traied the proposed DOD model to detect dynamic objects. -By combining VINS with DOD, we removed the feature points in the dynamic object regions and tracked the remaining feature points to eliminate the influence of the dynamic object. -To eliminate the cumulative erro in a large-scale outdoor environmen, we model the GPS measurement as a general factor and introduce its residual factor ito the cost function.
The rest of thispaper is organized as follows. Section 2 briefly revies variousVINS achievements ina dynamic environment. Section3 elucidates the architecture of our algorithm. We introduce the methodology o VINS combining GPS sensor and YOLOv5 network in section4. Section 5shows the qualitative and quantitative results ofthe proposed method. In the end, section6 concludes the paper.

II. RELATED WORK A. VINS IN DYNAMIC ENVIRONMENT
The main research directions to eliminate the influence of dynamic objectscan be divided into three parts: DOD based on motion tracking [12], [13], [14], semantic segmentatio and removal based on deep learning [15], [16], 17], and eliminating the influenc of dynamic objects using robust methods [8]. MostVINS algorithms treat dynamic feature points as outliers. These pointsscarcely participate in localization and mapping.
To reduce the influenceof dynamic objects onstate estimatio,one research direction is to manipulate features [12], [13], [14]. Reference [12] propose a sparse motion removal model that detects the dynamic and static regions based on a Bayesian framework.Later in [13, the authors design a new dynamic region detection method. The detection process is completed in a Bayesian framework and then preprocesses the input frame. Further, 14 proposed an optical flow-base motion detection metho to remove irrelevant information afterwards.
Another research direction is using semantic segmentation to remove dynamic feature points [15], [16], [17]. The semantic-based approache use semantic information to segment objects and remov outliers from tracking, such as RS-SLAM [15] and Dynamic- SLAM 16]. DynaSLAM [17] pixel-wise segments the priori dynamic objects in frames instead of extracing features on dynamic objects.

B. DYNAMIC OBJECT DETECTION
There are various different object detection algorithms based on deep learnin, such as faster region-based convolutional neural network (R-CN) [18] and Mask R-CNN [19].The R-CNN detector is a two-stage detecto. It utilizes aCNN to find potential regions that contain object. Redmon et al. [20] proposed another series of excellent detectors, YOLO. YOLOv5 can even reach 140 frames per secon becaus it utilizes a single convolutional network for the entire image at once.

C. VINS COMBINE WITH GPS
Inertial measuremen and visual are suitablefor obtaiing accurate local state,but large drifts will accumulate during longterm navigation. As a result, global positional information can be fused with visual and inertial measurement to achieve accurate localization.
In [11], GPS measurements are fused with VIO (visual inertial odometer) estimates in a pose graph optimization model. Besides, Mohamed et al. [21] presented an integrated navigation system that depends on integrated sensors to enhance navigation in bad weather conditions. These systems are loosely couple, which means that the relative pose update is estimated by the VIO algorithm independently of the GPS measurements. Then the pose graph is optimized to align with the global frame.
Unlike previous loosel couple works, tightl coupled method allows exploiting the correlation amongst all the measurements. In [22], an adaptive fusion GNSS and VIO system were proposed to achieve accurate global positioning. Similar to [22] and [23] propose a methodology to tightly coupled fuse GNSS information with visual-inertial measurements in a nonlinear optimization estimator.

III. SYSTEM OVERVIEW
The structure of VINS can be divided into four parts: measurement preprocessing, features tracking, system initialization, and global optimization. The block diagram is shown in Fig. 2. Pictures in block diagrams are colored for better presentation to the reader.

A. FEATURES TRACKING
The Kanade-Lucas-Tomasi (KLT) [24] algorithm tracks features between two frames for the monocular camera. Fig. 3 shows the process of feature tracking.
After the DOD model obtains the bounding box of the dynamic object, the front-end removes the feature points in the object regions. Then we track the remaining feature points to eliminate the influence of the dynamic objects. It is worth noting that when the proportion of dynamic objects in the image is large, that is, when too many feature points are deleted, the algorithm will not generate drift due to the existence of IMU and GPS sensors. The schematic diagramof processing features points is shown in Fig.4.

B. SYSTEM INITIALIZATION AND GLOBAL OPTIMIZATION
VINS utilize the IMU information to form motion constraints between consecutive image frames and provide the absolute scale of the motion. Furthermore, VINS requires exact initialization to perform accurate state estimation [25]. To lead the subsequent nonlinear optimization, the initialization procedure provides the value of pose, gravity vector, velocity, bias, feature location, etc.
GPS sensoris employed to provide drift-free measurements in a global frame, ultimately eliminating the cumulative error of VINS. We model the GPS measurement as a general factor. GPS factor and its related states constitutepart of the global pose graph. The global pose graph optimization module eliminates the cumulative error and drift.

IV. METHODOLOGY A. VINS COMBINED WITH GPS 1) PROBLEM DEFINITION
A tightly coupled, nonliner optimization-based method is used to obtain highly accurate state estimate by fusing GPS measurements, preintegrated IMU measurements and feature observations. The state vector X is defined as follows: where p and R are the position and orientation of the body, respectively. n is the total number of keyframes consider for optimization.
x cam includes the depthλof each feature. l is the total number of features in the sliding window.
x imu is consists of gyroscope bias b g , acceleration bias b a , and velocity v.
x gps = x gps 0 , y gps 0 , z gps 0 , x gps 1 , y gps 1 , z gps 1 , . . . , x gps m , y gps m , z gps m (4) VOLUME 10, 2022 x gps is the GPS-related variable, x, y, and z are mean longitude, latitude and altitude, respectively. m represents the set of valid GPS measurements.
The maximum a posteriori method is used to solve the optimization problem of the pose graph. The maximum a posteriori is defined by: where, x i = p w i , q w i , q w i andp w i are orientation quaternion and position vector, respectively. S is a set of measurements that come from sensors. The uncertainty of measurements is The negative log-likelihood of equation (5) is as follows: where k t represent the information matrix for Gaussian Distribution p z k t | X and h k t (X ). r 2 = r T −1 r is the Mahalanobis norm. h(.) denote the sensor model.
Then the bundle adjustment is performed. That is, the state estimation is transformed into a nonlinear least-squares problem. As a result, the entire navigation problem can be expressed as a nonlinear optimization problem.

2) SENSOR FACTORS
The camera factor is constructed from the feature in each frame, and it reprojects the first observation into the following frames. The camera residual of the observation in the image t is described as: π −1 c andπ c are the back-projection and projection functions, respectively. T denotes the 4 × 4 homogeneous transformation. k t represents the covariance matrix of reprojection error. This camera factor is common to both stereo camera and monocular camera. IMU residuals come from the IMU measurement model of VINS-Fusion. VINS-Fusion uses the IMU pre-integration [26], [27] algorithm to construct the IMU factor. The IMU residual is written as: Within two time instants, the preintegration produces relativerotation γ t−1 t , velocity β t−1 t , and position α t−1 t . denotes the minus operation on manifold. g representsthe gravity vector and its standard value is approximately 9.81. b a is acceleration bias of IMU and b g is gyroscope bias of IMU.
The GPS measurement does not contain the cumulative error. Therefore, GPS constraint added as a global constraint could relieve the cumulative error. The GPS residual could be expressed as: T represents the GPS measurement. p w t is calculated by the state and position transformation of the previous moment.

3) GPS/VIO OPTIMIZATION
During the entire state estimation period, the VIO provides the main contribution because GPS pseudo-range measurements were significantly biased in some parts of the path with bad signal conditions. On certain road segments with good signal conditions, the constraint of GPS measurements can rectify the accumulated drift in VIO and improve the overall accuracy. The fused result of GPS/VIO has a smaller drift.
The GPS/VIO optimization module first needs to processthe VIO data. The pose of VIO is converted to the coordinate system of GPS.The conversion expression is written as T GPS−>VIO is an external parameter between the GPS coordinate system and the VIO coordinate system. The matrix of the external parameters is optimized by the ceres solver [28]. The calculation formula is described as: The GPS preprocessing module first processe GPS pseudorange measurements. This modul converts the longitude, latitud, and height tothe world coordinate system and pus it intothe global variable. Latitude and longitude are the eart's coordinates, but the earth is spherical and needs to transform into a plane coordinate system. We need to convertLongitude-Latitude-Altitude (LLA) coordinates into East-North-Up (EN) coordinates. The module first converts the LLA coordinate system to the Earth-Centered-Earth-Fixed (ECEF) coordinate system, afterwards converts to the ENU coordinate system. For a certain point in space, the transformation relationship from LLA coordinate system to ECEF coordinate system is following equation: In formula (12), x is longitude, y is latitude, and z is altitude. N represents the radius of curvature, N = a/ 1 − e 2 sin 2 y . e 2 = a 2 − b 2 /a 2 . a, b and e are the long half axis, short half axis, and reference ellipse's first eccentricity in the LLA coordinate system, respectively.
The transformation from ENU coordinate system to ECEF coordinate system is as follows Finally, we get A schematic representation of the transition matrix between the coordinate systems is shown in Fig. 5 [22], [23].
The GPS/VIO optimization schematic diagram is shown in Fig.6.

B. VINS COMBINE WITH DOD MODEL 1) TRAINING MODEL
A custom-trained neural network model can be more accurate and quickly detect specific dynamic objects. We fed the train images into a neural network to create a custom-trained detector. This paper prepares a BDD100K [29] training set to train the proposed DOD model. This model is trained based on the YOLOv5 network. YOLOv5 is an one-stage object detection network. In case of high speed detection, it can also maintain relatively high accuracy. Like other neural networks, the YOLO network needs to label a large of images and send them into the model, aiming to train the parameters of the neural network. The dynamic objects detected by the DOD model include persons, cars, motorcycles, buses, and trucks. The DOD model cannot identify whether a dynamic object is stationary or moving, so even a stationary car will be rejected.
The accuracy and running speed of the DOD model have a decisive impact on the overall results. We have done a lot of work to improve the accuracy of the DOD model. In practice, adding the DOD model to VINS increases the computation of the system, but it hardly affects the real-time of the system.

2) GET BOUNDING BOXES
DOD model provides the classes and 2D bounding boxes of dynamic objects. The DOD model requires three steps to obtain the bounding box of the dynamic object. First, the entire image gets segregated into equally sized grid cells, and the network runs simultaneously on all the cells combined. Secondly, the network predicts bounding boxes at each cell in the output feature map. Finally, to obtain the actual coordinate VOLUME 10, 2022 values of the predicted bounding box relative to the original image, the algorithm divides the (b x , b y , b w , b h ) by the feature map size and then multiplies it by the size of the original picture.  The coordinates of the corresponding bounding box for a given object are shown in Fig.7. Here, each grid cell is predicting the coordinates of the center (b x , b y ) (b x , b y ). The bounding box's width and height is (b w , b h ). (c x , c y ) is the upper-left coordinate of the grid cell in the feature map. Besides, (p w , p h ) is the width and height of the predicted bounding box. t x and t y is scale factor. σ is sigma function.

V. EXPERIMENTAL RESULTS
The proposed method is based on VINS-Mono [10], VINS-Fusion [11], [22], [23] and YOLOv5. We compare theproposed method withstate-of-art VINS-Fusio to verify the effectiveness and improvement of the method.This paper hs evaluatedtheproposed methodon the public KITT datase 31] and real-world. Each configuration was run five times and the RMSE median value is reported. The running environment and the sensor's parameter are shown in Table 1.
Horn's method 33 is utilized to alig the estimated trajectory with the ground truth. This paper evaluated accuracy byabsolute trajectory error (ATE) and absolute pose error APE,whereATEand APE are calculated byEVO tools 33]. We employed the absolute trajectoryroo mea squar error (RMSE) for quantitative evaluations. ATE compres the absolute distance between the translation components of the estimated and ground truth trajectoy. The computation of the    ATE at a time step i is shown in equation (16): where T is the transformation of aligning two trajectories, G is the ground truth, and E represents the estimated trajectory. For a sequence of N poses, the RMSE of ATE is given by equation (17): A. DOD MODEL The model training results are shown in Fig. 8. It can be seen from Fig. 9 that when iterating 300 times, the average accuracy of the DOD model is 94%, and the loss function is 0.5. The curve shows that the model has high accuracy speed and fast detection. Table 2 shows the evaluation indexes of DOD, which shows that the model trained in this paper yields good results for all indexes. Timing statistics are show in Table 3.

B. EXPERIMENT WITH KITTI DATASET
This paperuses the KITTI datase to evaluatethe proposed method. This dataset provided ground trut for the sequences 00-10. Fi.9is a reconstructed map of sequence 00-10.
The KITTI dataset is collected from various scene data in the real-world, such as rural, highways, and urban. We mainly consider sequence 00 and sequence 05-0, becausethere are moving vehiclesand pedestrians in these urban sequence. The aligned estimated trajectory, APE, and Roll-Pitch-Yaw (RPY) of sequences 00 and 05-07 are shown in Fig. 10. To better demonstrate the effectiveness of the proposed method, we only show four pure visual trajectory error curves. It can be seen from Fi.10 that after adding the DOD model (yolo in the figure), the location error, rotation error, and translation error are significantly reduced. The results show that adding theDOD model to VINS improves the accuracy of the algorithm.
The AT's RMSEof sequences00-10 are shown in Tabl 4. We can observe that the state estimation accuracy has significantly improved using our proposed method. Compared with VINS-Fusion, the proposed method uses the mono+GPS or stereo+GPS combination has reduced the estimation error by about 40% on average in the dynamic urban sequences (sequence 00, 05, 06, 07). In addition, the proposed method using the stereo+loop or stereo+IMU+GPS combination has reduced the estimation error by about 5% on average. Clearly, our algorithm outperforms VINS-Fusion in all urban sequences. There is often no loop in the actual ADV applications, only stereo+gps adds loop detection, and the other three combinations do not have loop detection to reduce the computational complexity.
It is worth noting that the error of stereo+imu+gps in Table 4 is larger than that of stereo+gps. It stands to reason that adding IMU sensors can improve the accuracy and robustness of the system. However, during the experiment, we found that the IMU measurement in the KITTI data source is not aligned with other measurements and some data are missing, resulting in a decrease in the accuracy of state estimation. In addition, the proposed method reduces the accuracy in some high-speed (Sequence 01) and static environments (Sequence 10). In a high-speed environment (Sequence 01), the feature points are seriously lost because the vehicle moves too fast. The feature tracking module requires a sufficient number of stable feature points to ensure high-precision pose estimation. In a static environment (Sequence 10), since the DOD model removes the feature points within the dynamic objects that remain stationary, deleting the available feature points will also cause a drop in accuracy.

C. EXPERIMENT IN REAL-WORLD
In the real-world experiment, the dataset was collected from an onboard sensor suite, and this suite is presented in Fig. 11. We drive the car around the outdoor dynamic environment. At the same time, we collected images and synchronized IMU measurements with an MYNTEYE-S1030 camera. GPS measurements are collected using the GPS-U-Blox. To provide an error evaluation of this experiment, we set 16 control points through a real-time kinematics GPS sensor, control points as the ground truth points. We align estimated trajectories with the control points to get the absolute error. The experiment environment is presented in Fig. 12. The estimatedtrajectories of the real-world are shown in Fig. 13(a). Fig. 13(b) shows the absolute error between the estimated trajectorie and the control points. Table 5 shows the comparison of control points and statistical errors. It can be seen from Table 5 that the proposed method reduces the localization error by about 30% compared with VINS-Fusion.  In addition, we evaluate the running time of the proposed method. Compared to an average of 61ms/frame for VINS-Fusion, the increase of our system is just 15ms, which shows that our implementation efficiently introduces DOD model in the sliding window front-end. Our method increases the running time by about 25% on average compared with state-ofart VINS-Fusion. However, we believe that it is worthwhile to sacrifice 25% of the time cost to increase the accuracy by more than 25%. In conclusion, the proposed method can operate smoothly and reliably in a large-scale or dynamic outdoor environment.

VI. CONCLUSION
Achieving accurate localization via VINS in large-scale or dynamic outdoor environments is challenging. This paper proposes an algorithm that combines VINS and DOD for ADV to improve the accuracy and robustness of navigation. The proposed method removes the influence of dynamic objects by combining the DOD model and eliminates the cumulative error by fusing the GPS sensor.
Experiments on the KITTI dataset and real-world show that our method improves the localization accuracy by about 40% compared to state-of-art VINS-Fusion in the largescale dynamic urban environment. The contrast experimental results showed that our algorithm enhanced accuracy and robustness in dynamic environments.
In the future, we will draw attention to designing better navigation algorithms and exploring more robust SLAM systems.