Robot Localization and Navigation using Visible Light Positioning and SLAM Fusion

— Visible light positioning (VLP) is a promising technology since it can provide high accuracy indoor localization based on the existing lighting infrastructure. However, existing approaches often require dense LED distributions and persistent line-of-sight (LOS) between transmitter and receiver. What ’s more, sensors are imperfect, and their measurements are prone to errors. Through multi sensors fusion, we can compensate the deficiencies of stand-alone sensors and provide more reliable pose estimations. In this work, we propose a loosely-coupled multi-sensor fusion method based on VLP and Simultaneous Localization and Mapping (SLAM), using light detection and ranging (LiDAR), odometry, and rolling shutter camera. Our multi-sensor localizer can provide accurate and robust robot localization and navigation in LED shortage/outage situations. The experimental results show that our proposed scheme can provide an average accuracy of 2.5 cm with around 42 ms average positioning latency.


I. INTRODUCTION
Precise localization is a prerequisite for many autonomous systems, such as robotics, unmanned aerial vehicles, etc. Also, indoor positioning is an especially challenging problem, where localization cannot be achieved by GPS due to the satellite signal being greatly attenuated, while the traditional radio-based indoor positioning technologies, such as Bluetooth, Wi-Fi, Radiofrequency Identification (RFID) and ultra-wideband (UWB), still have some disadvantages in terms of low accuracy, high latency, electromagnetic interference or high hardware cost [1].
In recent years, the localization technology based on visible light communication (VLC), which is referred as visible light positioning (VLP), has attracted intensive attention as it can provide high accuracy positioning based on the existing lighting infrastructure. Look up in almost any building and you will be able to see multiple light fittings, demonstrating that at most indoor locations [2]. The modulated LED broadcasts its unique identity (ID) via switching at a high frequency imperceptible to the human eye, but which can be recognized by photodiodes (PD) [3]- [5] and rolling shutter effect (RSE) cameras [6]- [8]. The LED-ID can be mapped/modulated once for all since they are normally fixed and not easily vulnerable to environmental changes. Hence, the "Last Mile Problem" of localization is solved via VLP and the pre-built LED landmark map. Ref. [9]-1 https://en.wikipedia.org/wiki/Robot_navigation [11] design the PD-based indoor positioning system using machine learning to provide around 10 cm positioning accuracy. Besides, some state-of-the-art (STOA) camera-based VLP systems can offer centimeter-level accuracy on commodity smartphones [12], [13] or mobile robots [6], [14]. Despite the promising performance of existing VLP systems, there remain some practical challenges.

A. Motivation
One of the most urgent issues arises from the fact that VLP normally requires multiple LED observations at a time for successful positioning through trilateration or triangulation. This is because the normal LED lights offer less usable point features due to the lack of distinguishable appearance, e.g., one feature for a circular LED [15]. However, the number of decodable LEDs in a camera view is limited by a couple of practical factors, such as deployment density and geometry layout of LEDs, obstruction of the LOS views, limited field-of-view (FOV) of the camera, etc. As such, the shortage or outage of LEDs can severely deteriorate the performance of the camera-only method in reality. To address this problem, different VLP-aided inertial localization methods have been proposed [6], [15]- [17] to provide pose estimation by combining VLP and the inertial measurement unit (IMU). In [17], An Extended Kalman Filter (EKF)-based loosely-coupled VLP-inertial fusion method is proposed. The presented experiments demonstrate the functionality that the IMU helps to overcome the moments with lack of VLP coverage. However, IMU suffers drift over time (cumulative error) or measurement noises (biases). Therefore, these VLP-aided inertial localization methods suffer from errors after long-term running. To address this problem, we are motivated to adopt LiDAR sensor to compensate the accumulated error of the inertial sensors, so that the VLP-aided inertial & LiDAR method can achieve long-term accurate and robust localization.
In addition, sensors are imperfect, and their measurements are prone to errors. All of VLP, odometry, and LiDAR also have outliers. Through multi sensors fusion, we can compensate the deficiencies of stand-alone sensors and provide more reliable and robust estimations. Furthermore, one of the most basic yet important functions of intelligent mobile robots are the ability to navigate autonomously [18]. Navigation 1 denotes the robot's ability to determine its own position in its frame of reference and then to plan a path towards some goal location. However, to the best of our knowledge, most of the works in the field of VLP only can provide discrete position (localization) and lack the ability to navigate and tracking for moving targets, which is caused by the limited VLP coverage range, positioning latency, and the lack of the obstacle detection/avoidance function. The priori pre-built LED landmark map doesn't include any obstacle information around the environment. Based on this point, we are motivated to adopt LiDAR-SLAM for occupancy grid map building and obstacle detection, adding the tracking and navigation capability into the VLP system. Through multi-sensor fusion, we can estimate quantities that are unobservable using single sensors (e.g., VLP cannot detect the obstacle information; LiDAR cannot observe the global position; odometry has cumulative error) so that we can improve robustness, and handle the sensor failures.

B. Contribution
In this work, we propose a loosely-coupled multi-sensor fusion method for VLP and LiDAR-SLAM. Integrating multisensor, including RSE-camera, LiDAR, and odometry, enables a robot to maximize its perceptual awareness of the environment and obtain sufficient measurements to make it promising for robust indoor localization and navigation. More specifically, the LiDAR is employed to detect the surroundings for obstacle avoidance during the navigation, and compensate the cumulative error from the odometry. While the VLP is adopted to provide high accurate pose initialization and correction for LiDAR-SLAM and odometry. Our multi-sensor localizer includes mapfusion and localization-fusion, which firstly builds a map of the previously unknown environment, and then, provide accurate and reliable pose estimation through multi-sensor fusion. In order to realize map-fusion, we develop a VLP-constrained mapping scheme based on LiDAR-SLAM (Gmapping 2 ) to construct the occupancy grid map based on the pre-built LED landmark map. While for the localization-fusion, the looselycoupled EKF using VLP, odometry and LiDAR can provide high accuracy indoor localization based on the previously built VLPconstrained occupancy grid map, which is the premise of reliable navigation.
We highlight the contributions as follows: • Map-fusion: VLP-constrained Gmapping algorithm is proposed to align the occupancy grid map (SLAM-map) with the pre-built LED landmark map, which is the premise of the fusion between VLP and LiDAR-SLAM can be integrated. Meanwhile, the localization result can be simultaneously visualized and aligned in both SLAM-map and floorplan. • Localization-fusion: Loosely-coupled multi-sensor fusion based on the EKF is proposed to achieve high accuracy localization using RSE-camera, LiDAR and odometry. Through the state prediction from odometry motion and the measurement update from LiDAR scanner and VLP, we can relax the requirement on the minimum number of concurrently observable LEDs for positioning to zero. The LiDAR can compensate the cumulative error from the odometry, the odometry contributes to smooth pose estimation, while the VLP measurement can provide high accuracy pose initialization or pose calibration when it is available.
• We embedded the navigation function into our multi-sensor fusion for VLP-SLAM. Owing to the high accuracy performance of our multi-sensor fusion scheme, the robot can provide autonomous navigation. The efficacy of the scheme is verified in real-world experiments.

C. Organization
The remainder of this paper is organized as follows. Section II introduces the related works. Section III explains the methodology of our multi-sensor localizer based on VLP-SLAM, including the VLP calculation based on single LED and odometry, VLP-constrained Gmapping, and the loosely-coupled multi-sensor fusion based on EKF. Section IV presents the experimental evaluation and Section V is the conclusion.

A. RSE Camera-based VLP
LEDs can transmit data over the air by directly modulating the light intensity at a high frequency that is invisible to human eyes but perceivable by camera or PD. Generally, PD is used as the receiver to capture the signal for the VLC system. There are many works use the PD and solar cells [19]- [21] as the receiver of VLP. PD is not an ideal VLP device, since it is sensitive to the light intensity variation and the diffuse reflection of the light signal, which degrades the positioning accuracy [22]. In contrast, camera-based VLP is favored in both commerce and industry due to the high positioning accuracy achievable by imaging geometry, and the good compatibility with user devices, such as mobile robots and smartphones. The pixels on RSE-camera are exposed and read out line by line instead of perceiving light for the whole image at a single moment. Then, the temporallyvarying intensity signals from the LED transmitter are mapped to spatially-varying bright or dark strip patterns on the captured image. For comprehensive understanding of optical camera communication (OCC) and LED-ID decoding for RSE-camerabased VLP, we refer readers to [23]- [28].
Camera-based VLP systems employ modulated LED luminaries mounted at known locations (e.g., on the ceiling) as artificial landmarks, associate each LED measurement with the ID coordinate through RSE-based OCC, and provide information about the camera's relative position and rotation with respect to each detected LED by the LOS and weak perspective projection property. Decimeter/centimeter-level accuracy has been reported in some RSE-camera based VLP systems [7], [8], [13], [29]- [32]. Yet, all these methods treat LED lamps as point sources without geometric information. Thus, multiple known LED anchors are required at a time for trilateration or triangulation calculation. Such approaches rely on dense deployment of LED beacons. To reduce the number of required LEDs in VLP, Ref. [33] no longer treats the captured LED images as a point, but as an image whose geometric features are exploited to determine the receiver's orientation and location. However, this method requires an additional marker to be placed on the LED. Another straightforward method is to employ angular sensors to measure the receiver's orientation information, thus "compensating" the missing information due to the reduced number of LED lamps. Ref. [34] proposed single LED VLP using image and attitude information from mobile phone to achieve an average accuracy of 5.5 cm. Ref. [22] proposed a hybrid positioning algorithm consisting of VLP and pedestrian dead reckoning (PDR) to achieve decimeter-level accuracy. However, the IMU is a local frame with arbitrary yaw at the start of each run, it is not identical to global information (absolute measurements) which can be directly processed with the VLP, it needs pose initialization. Ref. [15], [16] efficiently relax the assumption on the minimum number of simultaneously observable LEDs from three to one through a tightly-coupled VLP and IMU localization system. However, they still built up the experimental platform in the area of 5m×4m×2.3m with 23 LEDs, since their system required more than 4 LEDs for initialization. In our previous work [17], we proposed a looselycoupled VLP-inertial fusion scheme, which can relax the requirement of the observable LEDs to zero. However, the robot has to head along the x-axis of the global frame before running so that it can obtain the initial orientation. Additionally, all of the proprioceptive sensors (IMU, odometry) are subject to cumulative error. Therefore, another exteroceptive sensors acquiring information from the robot's environment, are needed to handle the LED shortage/outage problem (when VLP is unavailable) and deal with the cumulative errors from angle sensors.

B. Simultaneous Localization and Mapping
SLAM is useful for building and updating maps within unknown environments, while the robot maintains the information about its location. It consists of the concurrent construction of a model of the environment (the map) and the estimation of the pose of the robot moving within it [35]. SLAM can be divided into two categories according to the sensors: vision-based and LiDAR-based. Compared to other distance measurement devices, the LiDAR sensor makes measurements around itself with a wide scanning angle and high angular resolution. Besides, it is invariant to illumination. The high reliability and precision make LiDAR sensor a popular option for pose estimation [36]. Cartographer SLAM [37], tinySLAM [38], FastSLAM [39], and Hector SLAM [40] have been widely applied for LiDAR range-finder-based mapping and localization. Gmapping SLAM [41] is the most widely used SLAM method in robotics, it uses the Rao-Blackwellized (RB) particle filter to process the environment perceived by the robot, and works with the wheel odometry to calculate the robot's position and construct the grid map. It uses a relatively small number of particles to represent the SLAM posterior, and reduces the computational effort required to perform resampling to successfully build very accurate maps. This approach has achieved remarkable success and indeed is able to reduce the memory consumption while producing highly accurate grid maps. Therefore, Gmapping is more suitable for low-cost mobile platforms where computational resource is limited. For comprehensive understanding of Gmapping and the RB particle filter, readers are strongly advised to the works [41], [42].
Despite its advantage, the LiDAR sensor is not perfect and has several shortcomings. First of all, the initial pose for LiDAR-SLAM is an unsolved problem, since SLAM commonly suffers 3 https://en.wikipedia.org/wiki/Kidnapped_robot_problem from data sparsity or limited vertical FOV in real-world applications. How to bootstrap the localization without a relatively accurate manually initial guess has not been well considered [43]. Secondly, the features obtained from the LiDAR are limited, which is caused by the low vertical resolution and the sparse point cloud of it. In addition, although most LiDAR offer a horizontal FOV of 360 degrees, it receives only few usable points from the side walls in a narrow corridor environment. In other words, it is difficult to estimate the pose in environments such as large spaces or long corridors without too much variety of observations, since the depth information obtained from the LiDAR rangefinder does not change with time and is considered featureless. Therefore, the robot will easily get lost when the geometry of the environment is quite simple, such as a public space surrounded by a few long walls or circular walls [44]. Finally, the rangefinder of the LiDAR is limited, it only receives the local information around the robot, and still cannot perceive the global information of the whole indoor or map environment.
To tackle these problems mentioned above, we are motivated to adopt VLP to provide the pose initialization or calibration for LiDAR-SLAM. The VLP can be used to make the robot capable of sensing a wide range of environments through the global information provided by LED beacons. While the observation of external landmarks (LED) is also useful to provide high accuracy pose initialization for LiDAR-SLAM. We regard the localization problem as a registration problem between local perceptual information from the LiDAR and global coordinate information from the VLP. It can handle the kidnapped problems 3 via the VLP observation to successfully deal with the uncertainty of the initial guess for LiDAR-SLAM. Thus, combining with VLP and SLAM, we can achieve more refined, accurate, robust, and global pose estimation, which meets the intelligent robot application demand.

III. METHODOLOGY
The proposed multi-sensor fusion for VLP-SLAM aims at enabling sufficient environment perception (mapping), robust and accurate localization, and autonomous navigation. In this section, we firstly introduce the single LED VLP with odometry (SLO-VLP). Then, the loosely-coupled multi-sensor fusion for VLP-SLAM is introduced to show how to integrate two different exteroceptive sensors (camera and LiDAR) information.

A. Visible Light Positioning Calculation
The time-varying VLP signals from LEDs are perceived by the RSE-camera as spatially-varying strip patterns. Through the VLP observation, the pose of the robot can be estimated in time.
To do so, we need to find its centroid imaging location (also called as region-of-interest, ROI), recognize its identity (LED-ID), and retrieve LED's 3D position from the registered lights databases (pre-built LED landmark map). After that we can obtain the global 3D position and azimuth of the robot through SLO-VLP calculation.

a) LED-ROI Detection
The RSE-camera can capture barcode-like strip patterns from the high-frequency flashing LED during underexposure. Natural features are not observable, while bright objects (e.g., LED beacons, background light) can be easily distinguished, as shown in Fig.1 (b). We are interested in the image regions that carrying VLC information, termed as LED-ROI [45]. We firstly binarize the grayscale image by thresholding, and then dilate the binary image to fill the strip gaps. After that, the match template method is utilized to locate the LED-ROI in the captured image for subsequent LED-ID decoding. Since only the contour feature information is needed for ROI detection, we do not take every pixel into the computation but downsize the captured images with a constant scale, e.g., 6. To avoid the errors caused by down sampling, after the boundary of the LED-ROI is found, all the pixels nearby the obtained LED-ROI are checked again to detect the precise boundary. Since the number of pixels in the downscaled image and the double-check pixels are much less than that in the original image, the proposed way to obtain the LED-ROI is efficient. The effectiveness of the LED-ROI detection scheme has been well evaluated in our previous work [46].

) VLC-ID Coding and Decoding
For LED-ID coding, we develop a VLC modulator for retrofitting LED lamp to transmit the ID information to the RSEbased camera receiver, as can been seen in Fig.1 (a). The VLC modulator can be embedded into commercial LEDs. Each LED lamp is assigned with a unique ID, which is associated with its location and stored in an ID-location database (pre-built LED landmark map). The designed ID data packet begins with a 6-bit preamble (header, 011110), proceeded by 8-bit and 8 kHZfrequency OOK-coded data payload data. The payload carries one byte of ID, labeling up to 256 LEDs. The channel capacity is adequate for our experiment implementation and can be extended by a larger packet length. The VLC information is encoded by strips of varying widths. As can be seen in Fig. 1(c), the grayscale pixels in the centering column of the LED-ROI are selected for OOK demodulation. The thresholding method based on local extremum and sliding window [14] is adopted to counter the nonuniform illumination of the LEDs. The decoding rate of our LED-ID decoding method can maintain more than 95% at a height of 2.7 m, which had been evaluated in our previous work [17].

c) Single LED VLP with Odometry (SLO-VLP)
The pinhole camera model is shown in Fig.2. For VLP calculation, the position of the LED in the 3-D world coordinate = [ , , ] ( = 1 , 2 , … ; is associated with the VLP-ID) can be obtained through LED-ID decoding. While the LEDs mapped onto an image point = [ , ] can be measured through the LED-ROI detection. For any LEDs in the world coordinate, we can model the perspective projection model between the world coordinate and the image plane coordinate as: where can be represented as the depth between the LED and the camera. (focal length), / (physical size of the pixel), and 0 / 0 (the center pixel of the image) are formed together as Intrinsic Matrix (denoted as ). represents the translation vector, which is equivalent to the receiver's position in the 3-D world coordinate system. is the rotation matrix from the world coordinate system to the camera coordinate system: (2) where , and is the angle along the X, Y, and Z axis, and refer to (), refer to (). The three rotation angles can be estimated from the proprioceptive sensor (such as IMU or odometry) attached to the robot. Since the robot is moving horizontally on the floor, only the azimuth angle needs to be considered, while the other two rotation angles can be ignored as constant matrixes. where is equal as the vertical distance from center of the camera to the ceiling. The LED image is no longer treated as a point, but as an image whose geometric features are exploited to determine . Then vertical distance between the LED and the lens plane can be expressed as: where is the physical diameter of the LED, is the pixel distance of the LED-ROI, and is the conversion of the pixel distance and physical distance. Substituting (4) and (5) where, , , and , , are the rotation and translation coefficient from the center of the camera to the base_link 5 of the robot. However, the odometry suffers drift over time, in which case the assumption of perfectly-known yaw information will gradually be problematic. While observing a circular LED without any asymmetry appearance cannot give orientation information. Therefore, this may have problems when long-term drifts occur. In this paper, we utilize a LiDAR scan matcher component that corrects the robot's orientation estimated with a raw odometry by matching LiDAR scan data with previous scans, then the cumulative error of the odometry can be compensated: is the raw orientation angle from the odometry, while → is the included angle between the output TF of the LiDAR-SLAM and the map_frame 6 .

B. Multi-Sensor Fusion for VLP and SLAM
The proposed multi-sensor fusion for VLP-SLAM includes map-fusion and localization-fusion. For the map-fusion, the VLP-constrained Gmapping is proposed to construct the occupancy grid map 7 based on the pre-built LED landmark map. While for the localization-fusion, an EKF-based pose integration scheme is designed to stably obtain the global poses of the robot through odometry, LiDAR-SLAM and SLO-VLP localization.

a) VLP-constrained Gmapping for map-fusion
Mapping is the fundamental prerequisites for localization and navigation. Both LiDAR-SLAM and VLP are map-based localization method, the localization outputs of them are corresponding to the origin of their respective map_frame (VLP-4 https://wiki.ros.org/tf 5 http://wiki.ros.org/navigation/Tutorials/RobotSetup/TF 6 https://www.ros.org/reps/rep-0105.html#map map and SLAM-map). As for the VLP-map, it is a pre-build LED landmark feature map, which manually indicate the localization of the LED and store in the registered lights databases through LED-ID coding. While the SLAM-map is an occupancy grid map (binary map) of navigable regions and boundary walls, which are represented through white and black (boundaries and obstacles), respectively, for visualization purposes. The occupancy grid map is created by obtaining the information from odometry and LiDAR scanning.
To some extended, the SLAM-map belongs to the local map, which is based on the starting point of the SLAM Gmapping process (as the origin). Since the position of the obstacles in the grid map, which is obtained from the LiDAR scanner, are related to the odometry during the Gmapping process, while the start position and direction of the robot is viewed as the origin and xaxis in the odom_frame 8 . On the other hand, the VLP-map is global, which is consist with the floorplan/blueprint of the building. The origin of the VLP-map is artificially defined, which is related to the installation coordinate of the LED luminaries. If the origin of the VLP-map and SLAM-map are not aligned, the integration between VLP and SLAM would be mismatched.
To overcome this problem, a VLP-constrained Gmapping method is proposed to ensure the alignment between the SLAMmap and VLP-map. Since the obstacle or occupancy information detected in the SLAM-map is collected from LiDAR based on the current pose of the mobile robot. The SLO-VLP is adopted into the Gmapping SLAM as the initialization point. Then, the accuracy of mileage information from the odometry and LiDAR can be improved and constrained by fusing the SLO-VLP measurement data. The output SLAM-map would be associated with the LED beacons in the environment, so that the localization of the LiDAR-SLAM can be integrated with the VLP measurement. robot starts at Point B, then the origin of the SLAM-map made from Gmapping is Point B, which has a drift from Point A. As can be seen in Fig.3, without the VLP-constrained, the SLAMmap made from Gmapping is termed as the start point (Point B) as the origin of the SLAM-map. While with the VLP-constrained Gmapping, the origin of the SLAM-map can be drifted to the same point as the origin (Point A) of the pre-built LED landmark map. Thus, the proposed method can align the LiDAR-SLAM results with the VLP global map, so that the localization output of the LiDAR-SLAM can be integrated with the SLO-VLP. After the map-fusion, the information about a prior-unknown environment is stored as a certainty grid map, which is a twodimensional array of identical units that keep their probability of being occupied by obstacles. This makes great contributions to the navigation function for VLP system.

b) Loosely-couple EKF for localization-fusion
After map building, we adopt Adaptive Monte Carlo Localization (AMCL 9 ) for LiDAR-SLAM to track the pose of the robot against the map from VLP-constrained Gmapping. AMCL is a probabilistic localization system for a robot moving in 2D. It uses a particle filter to track the pose of a robot against a known map with the LiDAR scanner match and odometry. Since AMCL is not the contribution in this paper, we would not describe the details of it, we refer readers to the book Probabilistic Robotics 10 for the concept of AMCL. The EKF is adopted for state estimation of dynamic systems, which works as formulas (9) ~ (13). First, the current estimated state ̂ of the robot is predicted based on the previous state −1 with the state transition matrices , and the control command . For our application, the prediction step is based on the motion model of the odometry measurements and the twist 11 . During the state prediction, the estimated error covariance, ̂, is projected via , and then perturbed by the process noise covariance . ̂= ( −1 , ) = −1 + (9) ̂= −1 + (10) The second step is the measurement update, the EKF incorporates the measurements from the sensors, including the camera measurements from our SLO-VLP and the LiDAR scanning from AMCL, as input to correct its state estimation. The EKF gain can be calculated by using the observation matrix and the estimated error covariance ̂: is measurement noise covariance. Finally, the gain is also used to update the state vector and covariance matrix： =̂+ ( −̂) (12) = ( − )̂ (13) While the cumulative error or drift from the odometry can be estimated and compensated by the AMCL, since it can publish a map_frame to odom_frame transform, which essentially fixes the pose of the robot in the map frame.

IV. EVALUATION
In this section, we evaluate our system through real-world experiments, shown in Fig.5 (a). We firstly evaluate the performance of the proposed VLP-constrained Gmapping, which shows that our VLP-constrained Gmapping can drift or align the origin of the SLAM building map with the floorplan map. After that, we verify the property of the proposed multi-sensor fusion for VLP-SLAM, and show the strong performance and the robustness of our method compared with the state-of-the-art (SOTA) methods. Also, we embed the navigation function into our system to present the navigation and obstacle avoidance function, as an indication of the overall positioning and navigation performance.

A. Experimental settings
We setup a room-sized (around 8.4×4.9 m 2 ) test field with 4 LEDs mounted on the yellow pole with 2.2 m high. The experiments are performed on a Raspberry Pi 3B mobile robot (Turtlebot 3 Burger 12 with Quad ARM Cortex-A53 Core 1.2 GHz Broadcom BCM2837 64 bits CPU and 1 GB RAM), which runs an Ubuntu Mate 16.04 OS equipped with a robot operating system (ROS 13 ). The LEDs' radiation surface has a circular shape of size 18.0 cm in diameter. We control the RSE camera sensor (MindVision UB-300) settings by minimizing the exposure time, so as to see clear strip patterns from the modulated LEDs. The image stream is captured at around 6 Hz with a 2048 ×1536 resolution and recorded as ROS bags. The LiDAR sensor (HLS-LFCD-LDS 14 ) we used is a 2D LiDAR scanner capable of sensing 360 degrees with one-degree angular resolution and 12cm ~3.5m detection distance. The LiDAR is connected and powered by the Raspberry Pi to collect a set of data around the robot to use for SLAM . The odometry and LiDAR is sampled at around 24 HZ and 5 HZ, respectively. Yet for these sensors, hardware synchronization is not available. Therefore, a multithreading loose synchronization framework is used in our algorithm. We run our algorithm on a laptop computer (Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz 2.11 GHz, Ubuntu 18.04) using the recorded bags from the robot. The major computation is done by the computer, and the physical components like the LiDAR sensors, RSE-camera, odometry, etc. are interfaced through the Raspberry Pi robot, which communicates serially with the computer through publishing or subscribing the message of the related ROS packages. The indoor map of our lab (R&C lab, Innovation and Entrepreneurship Incubation Base, SCUT) is shown in Fig5 (d), and is changed to the occupancy grid map so that it can be visualized in the RVIZ 15 .

B. Map-fusion performance
In this section, we firstly study the performance of our mapfusion. The pre-built LED landmark map is fixed and predefined along with the floorplan map of the building (Fig5 (d)). With the proposed VLP-constrained Gmapping process, the VLP-map and the generated SLAM map can be aligned and integrate together, so that the fusion between the VLP and LiDAR-SLAM is meaningful. It is worth to mention that, in this paper, we don't evaluate the map building accuracy, since the Gmapping is a related accurate map building method [47], [48]. The comparison of grid maps made through the traditional Gmapping (beginning of Point A and Point B in Fig5. d), and the proposed VLP-constrained Gmapping (beginning of Point B) are visualized in Fig.6, where we can see that the resulting maps have similar structure. Although the mapping performances among these three maps are similar and good (fit the floorplan of our lab), the origin of the map is different. As can be seen in Fig.7 (visualize the robot in Fig.6), we visualize the robot (stays at the same Point B with the same SLO-VLP calculation result) in RVIZ. However, in Fig.  7 (b), the position of the robot on the map drifts out of range, which is caused by the drift or mismatch of the origin between 12 https://emanual.robotis.com/docs/en/platform/turtlebot3/overview/ 13 https://www.ros.org/ the VLP-map and the SLAM-map. Since the SLAM and VLP are map-based localization method, when the origin of these two maps is not aligned, the localization fusion between them would be mismatched. Therefore, our VLP-constrained Gmapping can align the LiDAR scan relative to the pre-built LED landmark map. Based on Fig.6 and Fig.7, we can know that our proposed VLP-constrained Gmapping has the same performance with the 14 http://wiki.ros.org/hls_lfcd_lds_driver 15 http://wiki.ros.org/rviz traditional Gmapping, at the same time, it can drift the origin of the occupancy grid map to the origin of the LED feature map, so that it can realize the alignment between the localization output of LiDAR-SLAM and VLP.

C. Localization-fusion performance a) Positioning Accuracy
To evaluate the positioning accuracy of our multi-sensor localizer, two series of experiments are carried out. The first series are used to test the performance of motionless objects. 900 locations were randomly chosen in the experimental field to evaluate the positioning accuracy of the proposed multi-sensor fusion for VLP-SLAM. Without loss of generality, we also simultaneously calculate the positioning result from the SLO-VLP (900 samples) and the AMCL (50 samples) in the experiment. The results can be seen in Fig.9. The average positioning accuracy of the proposed multi-sensor fusion for VLP-SLAM is 2.48 cm with a maximum error of 5.73 cm, which is close to that of SLO-VLP, which is 2.23 cm with maximum error of 5.74 cm (maintains as similar level to our previous work [17] [49]). While the average accuracy of the AMCL (LiDAR-SLAM) is 6.97 cm with a maximum error of 14.18 cm.

b) Robust and Real-time Pose Estimation
To further assess the real-time localization performance, we collect the data in trials to show the positioning performance among the SLO-VLP, AMCL, and our proposed VLP-SLAM during the movement of the robot. Fig.9 shows the estimated trajectories, which travel around 17.6 m. Due to the limitation of our hardware, we cannot provide the motion capture system to obtain the ground-truth of a moving robot. Therefore, we do not discuss the accuracy for the dynamic localization. For the proposed VLP-SLAM method, on the one hand, the SLO-VLP can provide high accuracy positioning, which can not only used as the pose measurement for the robot but also can be used as the pose initialization and calibration for the AMCL. On the other hand, with the AMCL and the odometry, it still can provide localization when the LED is outage (no LED coverage region). Therefore, the derived pose estimates from our VLP-SLAM can get smoothed temporally and can be sustained over a short time when LEDs are not available. It is worth to mention that the AMCL is the calculation of robot pose through the LiDAR scanner. When the robot is under no-LED region, the LiDAR localization calculation (AMCL) can be used to compensate the information loss of the VLP, so that the overall robot localization system can work well without LED density deployment. 16 http://wiki.ros.org/navigation/Tutorials/RobotSetup

c) Real-time Performance
Positioning speed is another key factor for localization systems. Especially, when the positioning terminal is moving, it can continuously receive information from the positioning system and calculate the current position in time. In this section, the computational time for position calculation of our VLP-SLAM is continuously measured 200 times to calculate the average positioning time-consumption, as shown in Fig.10. The average computational time of our VLP-SLAM is around 42ms. There are some special points of positioning time, which fluctuate from about 100 ms to 120 ms. These points are the EKF output when SLO-VLP or LiDAR-SLAM calculation is available, otherwise, the main weight of the EKF output is calculated through the odometry sensor which is high frequency sampled. Please note that the odometry provided data at higher rates than the camera or LiDAR scanner, even when the SLO-VLP is in-calculating (unavailable caused by calculation delay), the odometry prediction step of the EKF is still used. Thanks to the high-rate pose estimation from odometry, our VLP-SLAM can maintain smooth pose estimation and compensate the delay of the VLP or LiDAR-SLAM calculation.

D. Navigation performance
Once mapping and localization are successfully done, the navigation can be easily achieved since our VLP-SLAM can provide real-time and accuracy localization. As can be seen in in Fig.5 (c), we embed our VLP-SLAM method into the Navigation Stack 16 in ROS to navigate the robot. The robot navigation framework is based on our multi-sensor fusion for VLP and SLAM, which uses the proposed multi-sensor localizer to estimate the pose of the robot, and the map data from the proposed VLP-constrained Gmapping for path planning. ROS package move_base 17 is adopted to accomplish autonomous navigation. This package uses the localization information from our multisensor localizer and provides move-commands to the mobile base of the robot to move safely in the environment without colliding with obstacles. The framework maintains two costmap 18 each for the local and global planner. The global planner is based on A Star search algorithm, while the local planner uses the dynamic window approach [50]. Once a navigation goal is received, the global path planning algorithm takes charge of generating an appropriate path from the start pose to the target pose, while the local path planning is responsible for generating velocity commands. Thanks to the reliable and accurate pose estimation from our VLP-SLAM fusion, the robot can move steadily according to the planned path to reach the designated location.
During the automatic navigation, the VLP provides high accuracy pose estimation, initialization, and correction for SLAM, which contributes to pose initialization for global path planning. Odometry makes great contributions to maintain real-time capability and compensate the delay of the VLP or SLAM calculation. While the LiDAR-SALM can compensate the cumulate error from the odometry, and the dynamic obstacles can be detected through the LiDAR scanner via the update of the local cost-map to realize the dynamic obstacle avoidance. All of these contribute to the reliable navigation for robotics. Thus, the crucial problem of autonomous initialization and localization in robotics can be solved. For the details of the navigation test, we strongly recommend readers to our video demo: • https://www.bilibili.com/video/BV1Qw411Z7HX

E. Discussion
In this section, we compare the performance of our VLP-SLAM multi-sensor localizer with the SOTA works in the field of VLP in Table I. The average positioning accuracy, computing time, and the density of the LEDs in the related experimental platform are also reported objectively. Compared to other approaches, our method performs favorably, with less latency (42 ms), high accuracy (2.5 cm) and no LED density deployment required. It is worth to mention that, without losing generality, the proposed multi-sensor fusion for VLP-SLAM method has also been evaluated in the Hong Kong University of Science and Technology to show the reproducibility. Extensive experiments evaluation can be seen in our technology report [49]. Note that the experiment setup is different between this work and our technology report [49] with same algorithm, however, based on the probabilistic theory, the localization result might have slight fluctuation in different platform, but maintains a similar level (2.0 cm in [49]). While the real-time performance is different since the processor is different (52 ms in [49]). But these can show the stability of the accuracy of our method is also evaluated. As for the LED density coverage, the VLP coverage area depends on the FOV of the camera and the height of the LED-installation, our multi-sensor fusion for VLP-SLAM method can perform well under LED outage region through LiDAR and odometry.
Compared to the VLP-aided IMU method [6], [15]- [17], which suffers from bias or drift during long-term running. Our work can cope with the cumulative error through the LiDAR scan matching measurement. Through the multi-sensor fusion, especially the LiDAR scanning compensation, the orientation angle from the odometry can perform well with considerable drift after long-term experiments. The single LED VLP, which is based on the orientation angle from the odometry, can obtain lowdrift and accuracy pose estimation. Therefore, the proposed multi-sensor fusion for VLP-SLAM can not only achieve well balance among accuracy, real-time performance, and robustness, but also compensate the deficiencies of stand-alone sensors and provide more reliable, more robust positioning and navigation.
For the real-time performance, please note that we run the whole image process and data collection on a low-cost embedded platform Raspberry Pi 3B with a large-size captured image of 2048×1536 without any code optimization for ARM processors. After that, the data (including the VLP observation, LiDAR scanning, and odometry data) are transmitted to the laptop which takes charge of multi-sensor fusion pose estimation and visualizes the result in RVIZ. The time cost here is the whole process covering both the image process and pose estimation (runtime on both laptop and Raspberry), and also the time-cost of data transmission from robot to the laptop. Compared to the similar process platform with image size of 1640×1232 in Ref. [15], and 640×480 in Ref. [6], our proposed multi-sensor fusion for VLP-SLAM is more efficient, and hence lightweight to be used on resource-constrained computational platforms.
For the pose initialization, since the odometry only provides relative motion measurements, the initial pose is needed with the direction/orientation with respect to the map. For LiDAR-SLAM localization, the AMCL also needs to initialize its filters, which is generally done with an initial pose message given by the user. In our VLP-SLAM method, the VLP can provide position initialization for the AMCL, while the AMCL can provide reliable orientation estimation for the SLO-VLP. Based on this, the robot no longer needs to head along a fixed direction (e.g. xaxis) before running. When the manually initial pose (position and orientation) of the robot is not accurate, the LiDAR-SLAM can adjust the orientation of the robot, while the SLO-VLP can provide higher accuracy position estimation with the more correct orientation from the LiDAR. Meanwhile, the high position estimation from SLO-VLP can update the overall pose estimation for the EKF, which can also act on the AMCL to improve the measurement from the AMCL. The interaction between VLP and LiDAR-SLAM provides reliable and accurate pose estimation. The details of the random pose initialization can be visualized through the video demo 19 .
Furthermore, for LiDAR-SLAM, it is difficult to estimate the pose in the large space or long corridors without a sufficient variety of observations. Since similar geometric regions in these environment may mislead the convergence of AMCL. When the robot is not under the VLP coverage area, the wrong initialization for LiDAR-SLAM would cause the robot getting lost since the geometric regions of these two parts are quite similar. However, when VLP is available, the position of the robot can be calibrated to the correct position. The evaluation of the VLP calibration for LiDAR-SLAM can be seen in the video demo 20 . 2) "*" means that the LED lamp with beacon marker "#" means that the position estimation was performed in post-processing, not real-time.

V. CONCLUSION
In this paper, we pursued reliable, real-time, and accurate pose estimation for mobile robot through the multi-sensor fusion method based on EKF for VLP-SLAM. We relaxed the assumption on the minimum number of concurrently observable LEDs from three to zero for the RSE-camera based VLP system through integrating with odometry and LiDAR sensor. While the cumulative error of the odometry can be compensated and corrected through the LiDAR scanning. The experiment result shows that the our method has strong robustness with low latency (42 ms) and high accuracy (2.5 cm), which can achieve well balance among accuracy, real-time ability and coverage. In future work, we will deeply study the tightly coupled VLP and SLAM for incorporating the LED markers into the maps, which could construct the global map and optimize poses with sufficient features, together with a method to model and reduce data uncertainty.