Non-learning Stereo-aided Depth Completion under Mis-projection via Selective Stereo Matching

We propose a non-learning depth completion method for a sparse depth map captured using a light detection and ranging (LiDAR) sensor guided by a pair of stereo images. Generally, conventional stereo-aided depth completion methods have two limiations. (i) They assume the given sparse depth map is accurately aligned to the input image, whereas the alignment is difficult to achieve in practice. (ii) They have limited accuracy in the long range because the depth is estimated by pixel disparity. To solve the abovementioned limitations, we propose selective stereo matching (SSM) that searches the most appropriate depth value for each image pixel from its neighborly projected LiDAR points based on an energy minimization framework. This depth selection approach can handle any type of mis-projection. Moreover, SSM has an advantage in terms of long-range depth accuracy because it directly uses the LiDAR measurement rather than the depth acquired from the stereo. SSM is a discrete process; thus, we apply variational smoothing with binary anisotropic diffusion tensor (B-ADT) to generate a continuous depth map while preserving depth discontinuity across object boundaries. Experimentally, compared with the previous state-of-the-art stereo-aided depth completion, the proposed method reduced the mean absolute error (MAE) of the depth estimation to 0.65 times and demonstrated approximately twice more accurate estimation in the long range. Moreover, under various LiDAR-camera calibration errors, the proposed method reduced the depth estimation MAE to 0.34-0.93 times from previous depth completion methods.


I. INTRODUCTION
D EPTH measurement is conducted in several ways such as time-of-flight (ToF), stereo cameras, and structured light projection [1]. Stereo cameras and structured light projection estimate depth by pixel disparity. Hence their precision dramatically reduces as the distance increases since a small disparity change indicates a large depth change in the long range. In comparison, ToF sensors have a higher precision in the long range. Among ToF sensors, light detection and ranging (LiDAR) is used in various systems that require adaptability to dynamic environments, e.g., automated driving and robots, because of its active sensing capability and robustness Yasuhiro Yao was with the Institute of Industrial Science, the University of Tokyo, Tokyo 153-0041, Japan, and also with the Human Informatics Laboratories, Nippon Telegraph and Telephone Corporation, Tokyo 100-0004, Japan (e-mail: yao@cvl.iis.u-tokyo.ac.jp).
Shingo Ando, Kana Kurata, Naoki Ito, and Jun Shimamura were with the Human Informatics Laboratories, Nippon Telegraph and Telephone Corporation, Tokyo 100-0004, Japan (e-mail: shingo.andou.fv@hco.ntt.co.jp; jun.shimamura.ec@hco.ntt.co.jp). SSM selects the most appropriate depth for each pixel from its nearby LiDAR projections within a search radius based on stereo correspondence and smoothness. The diagram ignores the smoothness for simplicity. Here, two depths (LiDAR 1 and 2) exist within the search radius from the target pixel in image 1. SSM warps the target pixel to image 2 using the depths of LiDAR 1 and 2. Then, SSM evaluates the correspondence of image 1 at the target pixel and image 2 at the warps (Warp 1 and 2). Here, the correspondence is higher for Warp 1, in other words, when LiDAR 1 is selected.
to environmental changes. However, in terms of measurement density, LiDAR has a limitation because of the number of lasers in its array and the narrow beam measurement.
An important issue in the depth completion is mis-projection where LiDAR points are projected onto different objects in the image. Mis-projection often occurs because of spatial and temporal displacement of sensors, dynamic objects, decalibra-tion, and calibration errors. Because camera and LiDAR are typically placed at different position, occlusion is unavoidable for near or dynamic objects. The temporal displacement of each LiDAR beam also causes errors without precise synchronization. Decalibration appears at run time because of oscillations of the vehicle or other mechanical reasons. Furthermore, the extrinsic calibration error between sensors results in mis-projection over the entire image.
Generally, extrinsic LiDAR and camera calibration is difficult because of their different modalities. To build the KITTI dataset [23], Geiger et al. calibrated LiDAR and cameras using multiple calibration boards in a controlled garage environment [24] followed by the manual selection of corresponding points. Although marker-less calibration methods have been examined [25], [26], [27], it remains difficult to stably realize accurate extrinsic calibration in uncontrolled environments.
In recent studies, stereo images have been used for depth completion to solve the mis-projection issue rather than a single image. This is because stereo camera systems are widely available and such systems can perceive 3D information with the help of stereo matching algorithms. For example, a selfsupervised method is applicable under mis-projection caused by displacements of sensors [28]. However, this method still requires a dataset measured with a well-calibrated LiDARstereo system to train the neural network (NN). Furthermore, this method estimates depth by pixel disparity and suffers from low depth precision in the long range.
Therefore, in this study, we propose a non-learning depth completion method for a stereo-LiDAR system that is effective for the long range and robust to mis-projection. The proposed method comprises two techniques, i.e., selective stereo matching (SSM) and binary anisotropic diffusion tensor (B-ADT) [8]-aided smoothing. An important proposal is SSM, which searches for an optimal depth value for each pixel from its neighborly projected LiDAR points using an energy minimization framework (Fig. 1). This energy minimization approach can handle any type of mis-projection. Furthermore, SSM directly uses LiDAR depths and is advantageous in long-range accuracy. SSM is discrete optimization; thus, we apply B-ADT-aided smoothing for continuous depth estimation while preserving discontinuity between different objects.
Our contributions are summarized as follows.
• We propose SSM, which performs stereo matching in a selective manner to upsample LiDAR depths while maintaining the depth precision of LiDAR in the long range and considering mis-projection of LiDAR points.
• We propose a non-learning depth completion framework that combines SSM and B-ADT-aided smoothing. The framework achieves boundary-aware continuous depth estimation in addition to the SSM effects (long-range depth accuracy and robustness to mis-projection).
The rest of the paper is organized as follows. Section II reviews the related works and limitations of existing depth completion methods. Section III explains the proposed method. Section IV shows the evaluations. Section V gives the conclusion and summarizes limitations and future works.
In terms of accuracy, the supervised methods perform the best among them. The supervised methods also have the potential to perform in the long range because the precision of the ground truth disparity, which is usually made by other sensors such as LiDAR, is high and sub-pixel-level accurate. However, it is challenging to prepare a large dataset with ground truth disparity to train NNs.
Non-learning and self-supervised methods have difficulties achieving high precision in the long range because their depth estimation is limited by pixel-level stereo matching. As mentioned, a small disparity change indicates a large depth change in the long range. Therefore, there is uncertainty in the depth estimation even if the matching is accurate at the pixel level. Moreover, stereo matching methods still suffer from challenging scenarios such as repetitive pattern, low texture, discontinuity to cause occlusion, and specular reflection conditions.

B. Single-image-aided depth completion
Depth completion methods generate high-resolution and dense depth maps from sparse or low-resolution depth maps captured using LiDAR or depth cameras. The most common approach uses a single image as guidance. Kopf et al. [2] proposed a method to interpolate low-resolution depth values based on the joint distance of color and space in a highresolution image. Diebel and Thrun performed upsampling using a Markov random field (MRF) formulation [4]. In this method, the smoothness term is weighted as per texture derivatives; however, the results suffer from surface overflattening. To address this issue, Ferstl et al. formalized depth completion into ADT-aided and TGV-regularized energy minimization [3]; and their method has been successfully used to smooth and optimize depth maps in more recent methods [37], [38]. Recently, Yao et al. proposed B-ADT to achieve depth completion to preserve discontinuity between different objects [8].
In addition, NNs have been applied to depth completion tasks. The most common approach is to train networks with ground truth dense depth maps [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22]. Recently, selfsupervised and semi-supervised methods have been examined because it is difficult to acquire the dense ground truth. Ma et al. [6] and Wong et al. [7] proposed methods with NNs that can be self-supervised using monocular camera frames and sparse depth maps from LiDAR with motion. Yang et al. proposed a method that can train a NN by the likelihood of the observed sparse point cloud under a hypothesized depth map [39].
A major limitation of the single-image-aided depth completion is that mis-projection of LiDAR points is not considered. Map to initial location (Y SSM ) Fig. 2: Proposed Framework. SSM is composed of candidate assignment and candidate selection via optimization. The candidate assignment takes stereo images (I 1 and I 2 ) and a sparse inverse depth map (D S ) to derive a candidate set (S x ). The candidate selection via optimization derives a dense discrete inverse depth map (D SSM ) and a map to the initial inverse depth location (Y SSM ). B-ADT-aided smoothing first performs the ground detection to derive the ground mask (Γ SSM ), then derives B-ADT(G x ), and finally performs optimization to derive a dense inverse depth map (D).

C. Stereo-aided depth completion
Stereo images have been used as guides to complete the sparse measurements of LiDAR. These methods have been developed based on stereo matching, and they perform dense stereo matching using the accurate sparse depth value by LiDAR as a clue.
Badino et al. used LiDAR measurements to reduce the search space for stereo matching and provided predefined paths for dynamic programming [40]. Maddern et al. proposed a probabilistic model to fuse LiDAR and disparities by combining prior from each sensor [41], and Park et al. used NNs to learn such a model, which takes two disparities as input, i.e., one from the interpolated LiDAR and the other from semi-global matching [42]. Choe et al. recently proposed a geometry-aware stereo-LiDAR fusion network for long-range depth estimation [43]. As the same as single-image-aided methods, these methods do not consider mis-projection in given sparse depth maps.
Several recent methods have attempted to infer dense disparity maps from inaccurately projected LiDAR points with the help of stereo images. For example, Cheng et al. proposed a self-supervised method to train a NN to remove occluded background projection of LiDAR points to infer dense disparity maps [28]; however, this method does not handle incorrect projection caused by extrinsic calibration errors between the LiDAR and camera. Park et al. proposed a supervised method to train a NN to infer dense disparity maps from LiDAR inputs with extrinsic calibration errors between LiDAR and the camera [44]; however, this method requires accurately calibrated LiDAR and cameras to acquire effective training data.
Furthermore, the previous non-learning and self-supervised methods [40], [41], [28] estimate the depth by pixel disparity; thus, their depth precision is limited in the long range.
In summary, there are two major limitations in existing stereo-aided depth completion methods. The circles indicate the areas to create the candidate sets for the centered pixels. Although there is mis-projection, the appropriate LiDAR values are located in the areas, e.g., purple for the pole at the left, and yellow for the person at the right.
• These methods require accurate LiDAR-camera extrinsic calibration at some point in their process, which is often difficult to realize.
• The precision in the depth estimation is dramatically reduced as the distance increases because of the nature of disparity estimation using images. The cost of the proposed method is similar to the previous studies [29], [28], whereas the approach to minimize the cost is different. The proposed method searches the minimizer by the selection from projected LiDAR depth values. The approach can handle any type of mis-projection without requiring accurate LiDAR-camera extrinsic calibration in any part of the process. Moreover, this selective approach has an advantage in the long-range precision because it directly uses LiDAR depth values.

III. PROPOSED METHOD
As shown in Fig. 2, the proposed method applies SSM followed by B-ADT-aided smoothing [8]. SSM is a discrete optimization, and its output is discrete; thus, smoothing improves the quality of the result. B-ADT allows us to incor-porate boundary-direction-aware discontinuity in a variational approach. Both SSM and B-ADT-aided smoothing; thus, the proposed method as a whole, preserve depth discontinuity between different objects.
We give the problem statement in Section III-A, introduce SSM in Section III-B, explain B-ADT-aided smoothing in Section III-C, and describe a practical parameter tuning approach for SSM in Section III-D.

A. Problem statement
Our problem settings assume a stereo camera and LiDAR are used to capture the scene; however, the camera and LiDAR calibration contains errors. Such conditions are possibly occur because of the difficulty associated with calibration, particularly when a calibration target is not available. Our aim is to estimate a dense depth map that is aligned with the image. Using mathematical notations, the target problem is defined as follows.
We are given a pair of stereo images (I 1 : Ω 1 → R, I 2 : Ω 2 → R) and a sparse inverse depth map captured by LiDAR (D S : Ω 1 → R {φ}) with Ω 1 ⊂ R 2 and Ω 2 ⊂ R 2 being the image domains and φ indicating an empty depth. Here, D S is determined by projecting the LiDAR points to the image I 1 ; however, D S is not accurately aligned with I 1 because we expect LiDAR-camera miscalibration and occlusions. Our aim is to derive a dense inverse depth map D : Ω 1 → R aligned with the input image I 1 . Throughout this paper, we normalize I 1 and I 2 to the range [0, 1], and the unit of the depth is meter.
Note that we derive an inverse depth map D rather than directly deriving the depth map. This is performed to balance the contribution of both near and distant depths [45], [8]. Deriving a dense inverse depth map D is equivalent to deriving a dense depth map D −1 or dense disparity map D . The conversions at x ∈ Ω 1 are given as Eq. (1), (2) using camera focal length f and stereo baseline b.
The proposed method is applicable to motion stereo images as input as in the evaluation using the Komaba dataset (Section IV-B).
In this paper, we use | · | to denote the vector norm. In particular, given a vector p ∈ R K with K being the arbitrary number of dimensions, the norm is given as follows:

B. Selective Stereo Matching
SSM searches the most appropriate inverse depth value for every x ∈ Ω 1 from its neighborly projected LiDAR points. SSM comprises the candidate assignment and candidate selection steps. In the candidate assignment step, each pixel in the image is assigned a set of LiDAR inverse depth values. In the candidate selection step, SSM selects the most appropriate value from the candidate set using an energy minimization framework. Here, the energy is defined as the sum of the stereo matching cost and the smoothness regularization term.
Implementation-wise, both the candidate assignment and the candidate selection are composed of pixel-wise calculations which are parallelized on GPU.
1) Candidate assignment: a) Initial map of candidate sets: First, SSM constructs a candidate set S x (x ∈ Ω 1 ), which is a set of inverse depth values in the surrounding pixels within a pre-defined radius r from x (Eq. (4)), as shown in Fig. 3.
Note that we introduce an empirical approach to set r in Section III-D. If the cardinality of S x is less than predefined threshold m, the set is assumed to be empty (S x = ∅) to avoid selecting a value from a small number of candidates (we used m = 4 in our evaluations).
b) Candidate set interpolation: To fill the pixels with non-empty candidate sets, we interpolate the candidate sets using image-guided nearest neighbor search (IGNNS) [8]. IGNNS searches the nearest neighbor by the cumulative distance of image gradients. Here, let Ψ be the set of pixels where the candidate set is empty (Ψ = {x | S x = ∅}), and letΨ be the complement of Ψ (Ψ = Ω 1 \Ψ). We search an imageguided nearest neighbor of every x ∈ Ψ fromΨ (denoted z) and update the candidate set by S x = S z .
We derive z as Eq. (5) by denoting π(x, x) being the set of pixels on the grid path from a surrounding non-empty pixel where c is the constant cost of the unit path length, and we set c = 0.04 in our evaluations following the parameter study in the literature [8]. c) Correspondence search: Following candidate set interpolation, we identify the correspondence If I 1 and I 2 are a pair of rectified binocular stereo images, using the floor function denoted as · , x (d) is derived with camera focal length f and baseline b as follows: For motion stereo images, x (d) is calculated using the camera intrinsic parameter (K ∈ R 3×3 ), rotation (R C ∈ SO(3)), and the translation (t C ∈ R 3 ) of the camera motion as follows: If there are two or more d ∈ S x to derive the same x (d), we only maintain the nearest from x among those in S x . Note that this pruning process is performed to realize computational efficiency of the following optimization process.
2) Candidate selection via optimization: a) Stereo matching cost: The stereo cost L x (d) evaluates the consistency of the inverse depth d and the pair of input images at location x ∈ Ω 1 .
Similar to the literature [28], we compose the stereo cost L x (d) using the sum of the photometric loss L P x (d), the census loss L C x (d), and the image gradient loss L G x (d) with weights α C and α G as follows: Here, are calculated using the warped coordinates x (d) in Eq. (6) or (7) with the predefined window W as follows: where C 1 and C 2 respectively represent the census transformation of I 1 and I 2 with window W , · H denotes the Hamming distance, and l P , l C , and l G are the maximum cost values. We set the window W to be an 11×11 square centered at x. In our evaluations, we set α C = α G = 1 and l P = l C = l G = 0.5. b) Energy definition: SSM searches the optimal depth value from S x for every x ∈ Ω 1 , which is performed using an energy minimization. Here, the energy follows the conventional stereo disparity estimation [29]. We construct an MRF whose nodes are x ∈ Ω 1 , and the edges E comprise all the pairs of adjacent pixels. The energy E SSM is defined by the addition of the stereo matching cost L x (d) defined in Eq. (8) and a smoothness regularization term for the inverse depth as follows: Here, ∆ e represents taking the difference across the edge e, and λ SSM is the regularization weight. We empirically set λ SSM = 10 2 . c) Optimization: SSM derives a discrete inverse dense depth map (D SSM ) by minimizing the energy (E SSM ) in Eq. (12).
The minimization of E SSM is an optimization of MRF, which we solve by Loopy Belief Propagation (LBP) [46]. In particular, by setting X ⊂ Ω 1 as the set of four adjacent pixels of x ∈ Ω 1 , we iteratively update the message from x to one of its adjacent pixels y ∈ X by the min-sum algorithm as shown in Eq. (13) and (14). Here, we denote the iteration index as n, the normalized message from x to y as msg n x→y (d), and the message prior to normalization as msg n x→y (d).
(14) Denoting the message after convergence as msg ∞ , the optimal inverse depth value d x SSM at x is expressed as follows: The output inverse depth map D SSM is assigned based on the optimal values as Eq. (16).
D SSM is visually shown in Fig. 2.
In addition, for the ground mask creation in later process (Section III-C1), we construct a map Y SSM : Ω 1 → Ω 1 to indicate the original location of the inverse depth map. Because D SSM is created by the selection, we know where each value of D SSM initially located in D S . In particular, if the value of D SSM at x ∈ Ω 1 is originally at y ∈ Ω 1 in D S , we set Y SSM = y. By using equations, this assignment is expressed as Eq. (17).
C. B-ADT aided smoothing D SSM is discrete because it is generated by the selection from a finite number of candidates. Here, we apply B-ADT weighted TGV smoothing from [8] to derive smooth depth with discontinuity preservation at boundaries. Below, we explain the ground detection to create the filter for B-ADT derivation, the B-ADT derivation, and the optimization. Implementation-wise, the ground detection is the RANSAC plane segmentation, B-ADT derivation is a single-step calculation, and the optimization is iterative pixel-wise and parallelized on the GPU. 1) Ground detection: We create a ground mask to filter out occlusion boundaries that are faultily detected on the ground.
First, we detect the ground points for the input LiDAR depth map. We convert the LiDAR depth map to the point cloud and apply the RANSAC plane segmentation [48]. The RANSAC plane segmentation iteratively searches the coefficients of a plane having the maximum number of inlier points within the given threshold d rand , by randomly sampling three points from the point cloud to derive the plane coefficients in every iteration. For RANSAC parameters, we set d rand = 0.2 [m] and the number of iterations as 100.
Then, we project the inlier points of the derived plane to the image domain Ω 1 and acquire the ground mask Γ S : Finally, we create a dense ground mask Γ SSM : Ω 1 → {0, 1} which is aligned with D SSM . Γ SSM is derived by interpolating Γ S based on the selection performed in SSM. In particular, using Y SSM in Eq. (17), Γ SSM is assigned as follows:   The B-ADT for each pixel is assigned based on the following two conditions: A, i.e., the pixel is on a vertical occlusion boundary, and B, i.e., the pixel is on a horizontal occlusion boundary. Here, a vertical occlusion boundary is a vertical line segment across which the depth is horizontally discontinuous, and a horizontal occlusion boundary is a horizontal line segment across which the depth is vertically discontinuous.
In particular, with predefined threshold t, we determine a pixel To make occlusion boundaries where the adjacent depths change more than 2 [m], we used t = 2 in our evaluations. Because images are defined on 2D grids, every pixel belongs to one of four sets, i.e., neither A nor B (Ā∩B), A but not B (A ∩B), not A but B (Ā ∩ B), and A and B (A ∩ B).
Boundary detection by a single threshold can be faulty, particularly in the ground region because the ground is often a large plane parallel to the view direction with a wide depth range. Thus, we filter out occlusion boundaries that are detected on the ground by Γ SSM in Eq. (18).
Finally, by denoting B-ADT at pixel x ∈ Ω 1 as G x , we set G x based on the boundary conditions and the ground mask as follows: .
3) Optimization: We minimize the energy with B-ADT weighted TGV regularization to acquire the output of the proposed framework. By denoting the inverse depth map during optimization as u : Ω 1 → R and the relaxation variable as v : Ω 1 → R 2 , we define the energy E TGV as the sum of the data term C [u] and the smoothness term R [u, v] as follows: where w is the pixel-wise weight for the data term, and λ A and λ B are weights for the energy terms. We set w = D −2.5 SSM , λ A = 1.0, and λ B = 8.0 based on the literature [8].
Note that E TGV is convex, and we can derive the optimums of u and v using the primal dual algorithm [49]. In the end, the output of the proposed framework (a dense inverse depth map D) is the optimum of u as Eq. (23).

D. Parameter setting for SSM
SSM introduces a parameter r as the radius for candidate set search. Here, we present a practical approach to select the r value. r should be as small as possible to cover the nearest appropriate depth because the number of candidates increases as r increases, which generally leads to inappropriate selections. Furthermore, we consider two primary causes for mis-projection, i.e., LiDAR-camera calibration error and occlusion.
The mis-projection caused by calibration errors is primarily attributed to rotational errors. At the center of the image, projection error σ calib caused by rotation error θ calib can be calculated as follows: where f is the camera focal length. Although the exact value of θ calib cannot be known, it can be practically given in several ways, e.g., the error range presented in the reference of the original calibration method or by visually observing the LiDAR points projected onto the image.
To handle mis-projection by occlusion, all pixels typically should have several candidates in the range. Empirically, we found that this can be achieved when the radius is set to cover two scanlines. The pixel distance between two scanlines σ scan is estimated using Eq. (24) with angle θ scan between the scanlines as follows: gives discrete point clouds, and our whole framework gives continuous point clouds. The surface shapes of the poles and walls in the background are preserved by the proposed method, while they are missing in the results obtained using Cheng's method [28]. In addition, the error maps demonstrate smaller errors of our method in the long range compared to Cheng's method [28].
We set the optimal radius r * to be maximum of σ calib and σ scan to cover mis-projection caused by calibration errors and occlusion as follows:

IV. EVALUATION
We performed an evaluation that used the accurate LiDARcamera extrinsic calibration (Section IV-A), another that used erroneous LiDAR-camera extrinsic calibration (Section IV-B), and the other for the parameter study (Section IV-C). In the first evaluation, we compared the accuracy of the proposed method to that of existing state-of-the-art methods under common experimental conditions [41], [42], [28]. Moreover, we analyzed the accuracy distribution over the depth range to demonstrate the advantage of the proposed method in the long range. In the second evaluation, we examined the robustness of the proposed method against LiDAR-camera extrinsic calibration errors. In this experiment, we used the KITTI [24] and Komaba datasets [38] with added calibration errors.
In all evaluations, we implemented SSM and B-ADT aided smoothing on GPU by CUDA, and used RANSAC plane segmentation from PCL library [50] for the ground detection.

A. Evaluation with accurate calibration
We evaluated the proposed method on a subset of the KITTI dataset, which is commonly used to evaluate stereo-LiDAR fusion [28], [41], [42]. These data comprise 141 sets of left and right images, sparse LiDAR depth maps, dense disparity maps, and dense depth maps. The figure of an example frame of the KITTI dataset is in the supplementary material. Here, we used the ground truths of the dense disparity map [47] and dense depth map [51] for the evaluation. Note that the input sparse depth maps still have mis-projection caused by occlusions, although the extrinsic calibration is accurate, as shown in Fig. 7 (a). In this evaluation, we set the radius for the candidate search to r = 5 [pixel] for our method. The stereo-only method [31] has larger disparity error than LiDAR-aided methods. Moreover, our method has less error in the repetitive pattern and discontinuity conditions than the state-of-the-art stereo-aided depth completion [28].  We compared the proposed method to non-learning stereo methods [30], [31], non-learning single-image-aided depth completion methods [3], [8], non-learning stereo-aided depth completion methods [41], and supervised stereo-aided depth completion methods [42], [28]. Note that we assume Cheng's method [28] is a supervised because it uses an accurately calibrated dataset during training.
Implementation conditions are as follows. We used our own CUDA implementations for several methods [2], [3], [8]. We used the authors' implementation for the non-learning stereo methods [30], [31]. We used the authors' implementation and their trained model for Cheng's method [28]. We referred to the results in the original papers with the same experimental conditions for methods [41], [42].
1) Overall accuracy: Table I compares the accuracy of each method, and Fig. 4 shows the visualized results. This evaluation was based on the error rate measured with the dense disparity maps and the Mean Absolute Error (MAE) measured with the dense depth maps. Here, the error rate is defined as per the literature [47] and is the percentage of stereo disparity outliers that have errors greater than or equal to three pixels. The proposed method outperformed the compared methods in terms of MAE. Moreover, although the proposed method is a non-learning method, it demonstrated a competitive error rate compared to the supervised stereo-aided depth completion [42], [28].
In addition, Table I demonstrates the general advantage of LiDAR-aided methods, including our method, in relation to stereo-only methods [30], [31] in terms of depth accuracy. Furthermore, Fig. 5 shows the advantage of LiDAR-aided methods in challenging conditions as a repetitive pattern, low texture, discontinuity, and specular reflection.
2) Long-range accuracy: Figure 6 shows a breakdown of MAE against the depth range compared to Cheng's method [28]. As shown, the difference in MAE between the proposed method and Cheng's method [28] increased as the distance increased. Cheng's method estimate depth by pixel disparity, and its depth precision decreases as the distance increases. In contrast, the proposed method is based on the selection of LiDAR depths and does not lose the depth precision in the long range. The accuracy in the long range is also visible by the point clouds in Fig. 4. The method of Cheng et al. [28] lost the shapes of background objects, e.g., the poles, walls, and cars, whereas the shapes of these objects were retained in the results of the proposed method.
3) Processing time: Table I also shows the processing time in our environment, which is a laptop computer running Intel Core i9 and GeForce RTX 2080. Although the processing time of our method is not in real time as methods [30], [3], [8], our method performed faster than the state-of-the-art non-learning stereo matching [31] and stereo-aided depth completion [28].

B. Evaluations with calibration errors
We evaluated the proposed method with LiDAR camera extrinsic calibration errors. Here, we applied random errors to the KITTI and Komaba datasets. This comparison was performed against unsupervised methods [2], [3], [8]. Supervised stereo-LiDAR fusion methods were not applied because    accurately calibrated scans for training are not available. In this evaluation, the parameter settings were the same as those discussed in Section IV-A, except for the candidate search radius, which was set to r = 15 [pixel]. 1) KITTI dataset: We applied the following three error types to the KITTI dataset used in Section IV-A.
• blueprint represents the extrinsic parameters before calibration, derived by the sensor setup blueprint of the KITTI dataset.
• error-1 represents parameters calibrated by a singleframe-marker-less method from the initial The single-frame-marker-less calibration method to derive error-1 and error-2 is explained in the supplementary material. The intention of error-1 and error-2 is to emulate the worstcase calibration error expected in practical cases. Table II shows the details of the errors, and Fig 7 shows the visual of the calibration error in the input data. Note that KITTI with calibration errors also has mis-projection caused by temporal and spatial occlusions in the original KITTI dataset. Table III shows the results obtained on the KITTI dataset. The proposed method outperformed the baselines under all experimental conditions. Moreover, by comparing with those in Table I (the same dataset with accurate calibration), the proposed method outperformed Park's method [42] in terms of error rate and Cheng's method [28] in terms of MAE, although the proposed method was applied to data with calibration errors. The results indicate that the proposed method is robust to LiDAR-camera extrinsic calibration errors. Figure 8 shows the results and error maps. Figure 8 indicates that the proposed method successfully densified the depth of thin objects, e.g., poles, although the LiDAR points were not projected onto thin objects in the image.
2) Komaba dataset: The Komaba dataset was introduced in the literature [38] and has been used in a previous study [8]. The figure of an example frame of the Komaba dataset is in the supplementary material. This dataset includes five frames of data comprising motion stereo image pairs and dense depth maps captured by FARO FocusS 150. The motion between two scans is estimated by aligning the LiDAR point clouds. There is no spatial and temporal displacement between the camera and LiDAR, and occlusions are not expected in the Komaba dataset.
To create input sparse depth maps, we sampled the original dense depth maps and applied the randomly generated calibration errors (Table II). Here, three sampling patterns were applied to simulate different LiDAR resolutions.
• lines-16 sampled 16 scanlines. • lines-32 sampled 32 scanlines. • lines-64 sampled 64 scanlines. This condition is similar to the KITTI dataset, which has 64 scanlines. The density and mis-projection in input data are visualized in Fig. 9. Table IV shows the results obtained on the Komaba dataset. Here, rather than the error rate, we evaluated the inverse MAE (iMAE) because the Komaba dataset does not provide the ground truth of the disparity maps. The iMAE evaluates the accuracy of the inverse of the depth, which is proportional to the disparity. The proposed method outperformed the baselines under all experimental conditions. However, we observed a performance degradation with the proposed method as the number of scanlines decreased. This reduction in performance occurred because there was less possibility to find an appropriate value near the target pixel if scanlines are sparse. The relationship between the number of scanlines and performance is visually confirmed in Fig. 10.

C. Parameter study
We evaluated the effect on MAE of the value of r using the KITTI dataset (Section IV-A), the KITTI dataset with the blueprint condition, and the Komaba dataset with the lines-64 condition (Section IV-B). The results are shown in Table V.
In Table V, the value of r * derived from Eq. (26) reside close to the r value to give the minimum MAE for every data. Here, we derived r * for each data using Eq. (26) as follows. Note that, in this case, we ignored σ scan because occlusion was not expected. Hence, the results supports our approach to set r in Section III-D.

V. CONCLUSION
We proposed a non-learning stereo-aided depth completion method that is robust to mis-projection and preserves LiDAR precision in the long range. Unlike previous methods, our method does not require accurate LiDAR-stereo extrinsic calibration parameters in any part of its process. Therefore, it is applicable in the conditions that the calibration is difficult to conduct. In the evaluations, our method demonstrated smaller MAEs than previous state-of-the-art stereo-aided depth completion methods.
Our proposal is composed of SSM and the framework combining SSM and B-ADT aided smoothing. SSM searches for an optimal depth value for each pixel from its neighborly projected LiDAR points by an energy minimization approach, which can handle any type of mis-projection. In addition, we apply B-ADT-aided smoothing [8] to generate boundarypreseriving continuous depth maps since SSM is discrete optimization.
The current limitations of the proposed method include the accuracy dependency on the LiDAR scan density, as demonstrated by the evaluation discussed in Section IV-B.
We aim to extend our approach to run in real time for applying it to actual robotic systems. Since most of our processing time comes from LBP of SSM candidate selection (0.943 out of 0.999 [s]), we consider improving the selection process to be able to adapt faster optimizers.

A-2. LIDAR STEREO EXTRINSIC CALIBRATION
Here, we explain the single-frame-marker-less calibration method to create error-1 and error-2 in our evaluation with calibration errors. We intended to emulate the maximum error expected in practical situations. Hence, we extrinsically calibrated LiDAR and camera by using the same inputs as our depth completion, which are a single pair of images and a single scan of LiDAR. Specifically, we search the rotation matrix R * and translation vector t * to minimize the cost L from the candidate set of rotation matrices R = {R i |i = 0, 1 . . . N R − 1, R i ∈ SO(3)} and translation vectors T = t i |i = 0, 1 . . . N t − 1, t i ∈ R 3 in a brute force manner with N R and N t being the number of candidates, where R and T are generated around the initial values via grid sampling. For each combination of R ∈ R and t ∈ T , we first project each LiDAR point p ∈ P to the left image by applying R and t to derive set of the image locations and the depth of the point (P Rt ). Here, W and H are the width and height of I 1 , respectively.
To construct the cost, we introduce the L x B term to avoid LiDAR points to be projected on the areas where no object exists. With using the notation of the stereo matching cost L x (d) in Eq. (8) of the main paper, L x B is defined as follows: Note that L x B increases if a LiDAR point is projected onto an image location that appears to have infinite depth (or zero disparity).
Then, the cost L calib for calibration is defined as the sum of L x (d) and L x B over points in P Rt .  A-1 shows the extrinsic parameter errors after calibration with or without the L x B term. The results indicate that the L x B term has little effect when the initial calibration error is small as in error-1. However, the L x B term stabilizes calibration when the initial error is large as in error-2.