Learning Depth for Scene Reconstruction Using an Encoder-Decoder Model

Depth estimation has received considerable attention and is often applied to visual simultaneous localization and mapping (SLAM) for scene reconstruction. At least to our knowledge, sufﬁciently reliable depth always fails to be provided for monocular depth estimation-based SLAM because new image features are rarely re-exploited effectively, local features are easily lost, and relative depth relationships among depth pixels are readily ignored in previous depth estimation methods. Based on inaccurate monocular depth estimation, SLAM still faces scale ambiguity problems. To accurately achieve scene reconstruction based on monocular depth estimation, this paper makes three contributions. (1) We design a depth estimation model (DEM), consisting of a precise encoder to re-exploit new features and a decoder to learn local features effectively. (2) We propose a loss function using the depth relationship of pixels to guide the training of DEM. (3) We design a modular SLAM system containing DEM, feature detection, descriptor computation, feature matching, pose prediction, keyframe extraction, loop closure detection, and pose-graph optimization for pixel-level scene reconstruction. Extensive experiments demonstrate that the DEM and DEM-based SLAM are effective. (1) Our DEM predicts more reliable depth than the state of the arts when inputs are RGB images, sparse depth, or the fusion of both on public datasets. (2) The DEM-based SLAM system achieves comparable accuracy as compared with well-known modular SLAM systems.


I. INTRODUCTION
Considered as an important computer vision topic, depth estimation focuses on predicting depth from RGB images (monocular images), sparse depth, or the fusion of both (RGBd) [1]. The predicted depth is often used for various tasks, such as visual simultaneous localization and mapping (SLAM), robot localization, obstacle avoidance, and semantic segmentation [2]. When applying depth prediction to SLAM, monocular camera-based SLAM will be low-cost and attractive as compared with RGBD camera-based SLAM. Essentially, depth estimation creates a virtual RGBD sensor for monocular SLAM, helping SLAM contribute to industries such as robots and self-driving cars.
The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. by loss functions combining relative depth relationship with ground-truth metric depth. The relative depth relationship can help loss functions minimize prediction error and penalize larger outliers in depth estimation. For reliable depth results using the relative depth relationship, we need to redesign a loss function. Monocular depth estimation often provides depth for SLAM to reconstruct scenes such as studies [10], [11], but inaccurate monocular depth estimation brings more error to SLAM. In addition to this problem, other challenges still exist in depth estimation-based SLAM. Specifically, Tateno et al. [10] only refined depth predictions from high-gradient pixels, yet depth predictions of low-gradient pixels also need to be improved to reconstruct reliable scenes. Luo et al. [11] needed images with a fixed baseline and just serve horizontal motion of cameras. Tang et al. [12] faced multi-frame matching setup problems and had difficulty in practical application. Therefore, to better perform monocular reconstruction without issues such as multi-frame matching, a plug-and-play SLAM should be developed, alleviating scale ambiguity.
This paper proposes a dependable encoder, decoder, loss function, and SLAM system. The encoder is designed with dual path networks (DPNs) [13] to reuse and re-exploit features. Specifically, DPNs inherit advantages of ResNets [9] and DenseNets [14] by respectively adopting their residually and densely connected paths. Through the altered DPN structure, our encoder can re-explore new features flexibly. The decoder is proposed with transposed convolution and convolution layers to recover details lost by linear interpolation in existing decoders. Our decoder can learn dense depth maps accurately.
To further boost depth estimation accuracy, we develop a loss function by using relative relationships among depth points in RGB images, guiding the training of the encoder-decoder model (DEM). The loss function encourages our predicted depth to agree with ground-truth depth. Relying on DEM, monocular SLAM is designed, consisting of eight independent modules: DEM, feature detection, descriptor computation, feature matching, pose prediction, keyframe extraction, loop closure detection, and pose-graph optimization. These modules are easy-to-use and plug-and-play.
Experiments show that the DEM and DEM-based SLAM are more accurate than the state of the arts under the same condition. For example, the RMSE of DEM is 17.4% better than that of the representative work [15] with monocular RGB inputs from the NYU-Depth-v2 dataset. 1 The RMSE of DEM is decreased by 29.3% than that of the classic method [8] with RGB inputs from the KITTI dataset. 2 Our RMSE is 18.1% lower than that of the state of the art [16] with RGBd-500 inputs (each RGB input image contains 500 valid depth samples) on KITTI. As shown in Fig. 1(c), 1 https://cs.nyu.edu/ silberman/datasets/nyu_depth_v2.html 2 http://www.cvlibs.net/datasets/kitti pixel-level depth maps are estimated from RGB images by DEM. The prediction of DEM is identified more easily than that of the study [8] in Fig. 1(b). Based on depth estimated by DEM, our SLAM achieves greater improvement in accuracy on the TUM dataset than others such as SLAM [10]- [12], [17], [18]. Additionally, our SLAM is low-cost as compared with RGBD camera-based SLAM [19], [20]. The results demonstrate that the DEM and DEM-based SLAM are effective.
To summarize, our main contributions are as follows.
• First, we propose the DEM architecture containing a reliable encoder and decoder. Specifically, the encoder effectively reuses and re-explores features relying on existing CNNs. The decoder learns local features which are easily lost by linear interpolation in other decoders. Our decoder is capable of combining different encoders and consistently improves depth estimation performance.
• Second, a loss function is designed to guide the training of DEM. Our results verify that the loss function achieves comparable accuracy, whether inputs are RGB images or sparse depth. Additionally, we deploy and evaluate DEM on embedded devices. Here, DEM satisfies real-time (9 frames per second) and low-power (6.9 W) constraints with RGB image inputs of size 480× 640 on embedded devices.
• Third, relying on DEM, we develop a low-cost SLAM system including plug-and-play modules for effective scene reconstruction. The DEM-based SLAM estimates real object scales and reconstructs scenes reliably, although using monocular images. Our SLAM achieves more promising results than others on the TUM dataset. We confirm the DEM-based SLAM is useful to researchers who reconstruct scenes with monocular cameras because DEM is precise, monocular cameras employed by our SLAM are generally cheaper than RGBD cameras, and our SLAM modules are easy-touse. To our knowledge, the gap between depth prediction and its application in SLAM is reduced by this work. VOLUME 8, 2020

II. RELATED WORK
In this section, we overview recent studies that are most relevant to our work on two subjects: depth prediction and depth estimation-based SLAM systems.
Previous research on depth estimation can be divided into three categories: RGB-based, multimodal data-based, and sparse depth-based depth estimation. The RGB-based depth estimation often relies on traditional machine learning methods or CNNs. For example, Saxena et al. [21], [22] used traditional markov random fields to predict depth. Karsch et al. [23] leveraged a non-parametric approach to estimate depth of dynamic foreground objects in video and static backgrounds in a single image. These pioneer studies laid the foundation for RGB-based depth estimation. Eigen et al. [4] first used CNNs to infer depth from RGB images. Then, Eigen and Fergus [24] extended two-scale networks to three-scale networks for depth prediction. Roy and Todorovic [25] presented random forests and CNNs to infer depth. Wang et al. [26] employed perceptual losses to estimate depth.
Some work [16], [27] focused on fast depth estimation, but these methods still have a scale ambiguity problem because they improve inference speed at the cost of less accuracy. Other research estimated depth with ResNets such as [5], [6], [8], [15], [28], [29], [30]- [33], [34]. In these studies, ResNets effectively reused features through residually connected paths. Different from ResNets, DenseNets [14] efficiently re-explored new features by densely connected paths. Then, inheriting the backbone architecture of ResNets and DenseNets, DPNs [13] simultaneously reused and re-explored features with residually and densely connected paths. In general, effective feature exploration and utilization in DPNs can bring more accuracy to depth estimation, but fewer methods leverage DPNs to predict depth.
Multimodal data-based depth estimation commonly uses inputs containing two or three modalities of data [7], [8], [35]- [37]. The method [7] converted depth estimation into distance prediction between reference and true depth maps, performing more effectively than the depth prediction [35]. Wang et al. [36] inferred depth by iteratively changing intermediate representation in pre-trained depth estimation models. Li et al. [37] employed depth samples and RGB images to estimate depth. Sparse depth-based depth prediction also used deep learning models [3], [8], [38] to predict depth. Specifically, Chodosh et al. [3] employed alternating direction neural networks and compressed sensing techniques to extract features. The research [8], [38] selected CNNs to recover dense depth maps with linear interpolation which easily lost depth features. These problems still need to be addressed for effective depth estimation and depth estimation-based SLAM.
Existing visual SLAM often relies on data from stereo cameras, monocular cameras [10], [11], [17], [18], [39], or RGBD cameras [12], [19], [20]. In research [10]- [12], [20], [39], the SLAM systems all employ deep learning techniques to extract features. Specifically, in approaches [12], [20], keypoints and descriptors are learned by CNN and RNN (recurrent neural network) for SLAM. In methods [10], [11], [39], only CNN was integrated into SLAM to improve scene reconstruction. By using CNN-based depth estimation, the studies [10], [11], [39] focused on issues in conventional triangulation and consecutive view matching. For example, Yang et al. [39] leveraged a monocular camera to reconstruct dense maps with generative adversarial networks. Tateno et al. [10] used CNN to estimate depth maps, which were adopted as the initial guess of keyframes. Then, they revised depth through triangulation and high-gradient pixel matching, but depth estimation of low-gradient pixels is unrefined in CNN-SLAM [10]. Luo et al. [11] fused online-adapted CNN with direct monocular SLAM. The work [11] alleviated scale ambiguity and low map completeness, but it is only applicable to horizontal motion of cameras. To solve these problems and further boost the performance of monocular camera-based SLAM, we propose the depth estimation-based SLAM system for precise scene reconstruction.

III. METHOD
To predict depth maps accurately, we design the depth estimation model (DEM) in Section III-A. In Section III-B, a loss function is proposed to guide the training of DEM. In Section III-C, we present the SLAM system based on DEM.

A. DEM
To improve depth estimation, we develop the DEM architecture containing an encoder and a decoder. Here, the capabilities of DEM are verified by three modalities of data as inputs, namely, RGB images, sparse depth, or RGBd data. The type of inputs is the same for model training and inference. The DEM architecture with RGB inputs is displayed as an example in Fig. 2.
Encoder: The encoder is implemented to extract image features. We investigate classic models, including different layers of CNNs, such as ResNet-18, ResNet-34, ResNet-50, DenseNet-121, DPN-68, DPN-92, and DPN-131. In these CNNs, ResNet-50 is employed as an encoder by famous studies [6], [8], [32], [33]. Although few methods use DPNs to predict depth, we modify DPN-92 as the encoder of DEM after considering the tradeoff between multiply-andaccumulate operations and accuracy of DPNs. The initial DPNs [13] are only proposed based on RGB inputs. To process different modalities of inputs, we alter the structure of DPN-92. Additionally, the original DPNs are usually used for three tasks: semantic segmentation, image classification, and object detection. For depth estimation tasks in this paper, we modify DPN-92 to achieve encoding effectively.
DPN-92 is changed as an encoder in four steps. 1) The first layer in DPN-92 is modified with configuration parameter options. Thus, DPN-92 can learn features from RGB images, sparse depth, or RGBd data. The parameter options change 89302 VOLUME 8, 2020 FIGURE 2. The DEM architecture based on RGB image inputs of size 480 × 640. Our DEM consists of an encoder (highlighted in blue) and a decoder (highlighted in red). The training images are augmented to generate RGB images of size 228 × 304. In this figure, a cube represents a feature map. The dimension of each feature map is denoted as #features@height×width. In the encoder, a dotted box means a block in DPN-92, such as block1, block2, block3, and block4. The dotted line in a dotted box represents the shortcut connection between blocks. Concat among dotted boxes means the concat of feature maps. The decoder includes four upsampling layers, two 7 × 7 convolution layers, a 7 × 1 convolution layer, and a 4 × 4 transposed convolution layer. The output of DEM is a pixel-level depth map of size 1@228 × 304.
with the type of inputs. 2) To connect decoders and predict pixel-level depth maps, we remove the last classifier layer in DPN-92. 3) Followed by DPN-92, we use convolution with a kernel size of 1 × 1 and batch normalization (BN) layers. 4) The pre-trained model of DPN-92 on the ImageNet dataset is selected to initialize the encoder of DEM. The initialization improves prediction accuracy. We will verify the benefit of the initialization in Table 4.
Decoder: To output high-resolution and precise depth maps, we propose a decoder, as shown in the red part of Fig. 2. The decoder is designed in the following steps. 1) To upsample feature maps from encoders, we use four upsampling layers consisting of transposed convolutions. 2) To further extract features, three convolution layers are employed. These convolution layers also adjust the size of depth images to produce feature maps of size 114 × 152. 3) To generate pixel-wise depth images of size 228 × 304, a transposed convolution layer is leveraged. In dense depth prediction, our decoder is superior because the pixel-level depth maps are learned by transposed convolution and convolution layers. By contrast, the state of the arts [5], [8], [32] use decoders including linear interpolation.
In the decoder where linear interpolation is used, local image features are easily lost and image features are rarely learned and extracted. These two disadvantages are caused by curve fitting using linear polynomials in the linear interpolation. For example, a simple linear interpolation method is as follows: obtaining the depth value x on the straight line that is determined by the coordinates of two known depth points (x 0 , y 0 ) and (x 1 , y 1 ), where the value x is in the interval [x 0 , x 1 ]. Similarly, bilinear interpolation adopted in existing decoders is essentially linear interpolation in two directions. In linear interpolation methods, we can see that the depth value x is not obtained by learning or leveraging local features. In contrast, our decoder addresses the above issue of losing local features easily and extracting no features.
The following experiments in Table 2 will demonstrate the effectiveness of our decoder.

B. LOSS FUNCTION
To predict precise depth, we propose the loss function L guiding the training of DEM. In general, it is challenging to estimate depth with a monocular image due to scale ambiguities. Although depth of objects in an image is ambiguous, the depth between different points has a relative relationship. The relationship of depth points is easy to access. Therefore, we use the relationship among sparse depth samples to design the loss function L as follows: where L 1 is and N denotes the number of pixels in a depth map; P s and T s (s = 1, 2, 3, . . . , N ) represent depth values of pixels in predicted and ground-truth maps, respectively. The second part R is based on relationships among depth samples as follows: Specifically, we randomly combine two depth points in the set D consisting of sparse depth samples in a depth image. Then, we obtain a combination of depth points from the set D. The second part R has three advantages. 1) R enhances the effect of L 1 . Generally, R minimizes prediction error.
2) R pushes predicted depth further closer to true depth. 3) R assigns a large penalty for large outliers and still penalizes small outliers.
By combining R and L 1 , we obtain the loss function L. Experimental results will demonstrate that L outperforms other loss functions. In essence, L performs well by taking advantage of L 1 and R. Specifically, the loss function L has two benefits as follows.
First, our loss function L encourages estimated depth to agree with the ground truth. The first part of L is effective. The second part of L fully uses relative relationships among different sparse depth points. The acquired relationships are about uniform because we obtain sparse depth points randomly and automatically. Through the use of relative relationships, L is robust and reduces the impact of outliers.
Second, our loss function L exceeds other loss functions, such as L 2 and berHu. As a common loss function for regression problems, L 2 may adjust models based on outliers which cause poor depth estimation in previous research. Oppositely, our loss function L is reliable. To demonstrate the effectiveness of our loss function L, we present an empirical study. The results are shown in Table 3.

C. SLAM SYSTEM BASED ON DEM
Our feature-based SLAM alleviates inherent ambiguity problems in monocular camera-based SLAM. As shown in Fig. 3, each box denotes an independent module in SLAM. The details are as follows. Single RGB images are captured by a monocular camera. The RGB images are used by DEM (pretrained with RGB images) to predict depth maps. By DEM, the real scales of objects are obtained. The keypoints in the RGB images are detected by the ORB algorithm in OpenCV. 3 Based on the detected keypoints, the BRIEF descriptors [40] are computed. Through the modules of feature detection and descriptor computation, we obtain feature points. In the feature matching module, the BRIEF descriptors are matched between two frames of RGB images. Then, we find the minimum distance of the BRIEF descriptors. If the distance of the BRIEF descriptors is less than six times the minimum distance, the matches are considered as good matches. Relying on the good matches of feature points, we obtain the feature points' 3D (three-dimensional) positions by using depth maps estimated by DEM. The 3D positions and pixel coordinates of feature points are as inputs of a pose prediction module. The pose prediction module is performed as shown in Fig. 3. First, poses of monocular images are estimated by the Random sample consensus Perspective-n-Point (RANSAC PnP) algorithm in OpenCV. Second, we constitute a nonlinear least-squares problem. To optimize the nonlinear 3 https://opencv.org

Input:
The pixel and 3D coordinates of feature points in good matches.

Output:
The frame f2's optimized pose T defined by G2o and the number of inliers h. 1: Relying on the 3D and pixel coordinates of feature points, obtain poses using the RANSAC PnP algorithm in OpenCV, and then acquire the 3D rotation vector r, the 3D translation vector t, and the number of inliers h; 2: Create a class ''EdgeProjectXYZ2UVPose'' relying on Equation (6)  least-squares problem, we use G2o [41] to minimize two-dimensional (2D) reprojection errors. The minimization of 2D reprojection errors can generate poses of monocular images. To help accurate pose estimation of monocular images, the results of the RANSAC PnP algorithm are employed as the initial values of the pose optimization in G2o, as shown in Algorithm 1. The initial values in G2o boost the optimization quality of poses. We will describe how to constitute least-squares problems using 2D reprojection errors and optimize the nonlinear least-squares problem by minimizing 2D reprojection errors.
Following the method [19], we constitute the nonlinear least-squares problem. Suppose that 3D positions of n feature points in good matches are denoted as F i = [X i , Y i , Z i ] T and i = 1, 2, 3, . . . , n. The feature points' pixel coordinates are set to g i = [s i , t i ] T and i = 1, 2, 3, . . . , n. The Lie algebra of an image pose is defined as ξ . The scale factor and camera intrinsics are l i and K. Then, the relationship between the feature point's 3D position F i and the feature point's pixel coordinate g i is given as The homogeneous coordinate of the feature point's 3D position F i is represented as (4), we obtain The 2D reprojection error occurs on the left and right sides of Equation (5) because poses of monocular images are unknown, and noise exists among different image observation points. We represent the process of summing the 2D reprojection errors, constituting a least-squares problem, and finding the relatively optimal pose which minimizes the 2D reprojection errors as follows: Through the minimization of 2D reprojection errors, the least-squares problem is optimized, as presented in Algorithm 1. Here, G2o [41] minimizes the 2D reprojection error by setting vertexes and edges. The Levenberg-Marquardt algorithm [42] is selected as the gradient descent method in G2o. We perform G2o optimization for 10 iterations. When completing G2o optimization, we acquire reliable poses between frames.
The modules of feature detection, descriptor computation, feature matching, and pose prediction make up the front-end of our SLAM, namely, visual odometry (VO). As part of our SLAM, the VO estimates the poses of monocular images through G2o optimization, but the VO only focuses on local consistency of camera trajectories. To achieve a globally consistent scene reconstruction, a complete SLAM system is needed as shown in Fig. 3.
Our SLAM system is developed containing VO, keyframe extraction, loop closure detection, and pose-graph optimization. 1) The keyframe extraction selects keyframes to rebuild point cloud maps of scenes. It is unnecessary that a map is built relying on every frame because the relative motion distance between each frame is small. 2) The loop closure detection identifies drift in SLAM by realizing that a previous area in a map is re-visited or not.
3) The pose-graph optimization minimizes drift and optimizes keyframe poses for

Input:
A frame sequence A with n frames.

Output:
The keyframe sequence B with o keyframes. 1: o ← 0; 2: for ri ← 1 to n do 3: Obtain the pose T and the number of inliers h between the current frame r i and its previous frame in the keyframe sequence A using Algorithm 1 based on Equation (6); 4: Calculate the 3D rotation vector r and translation vector t relying on the pose T ; 5: Calculate the relative motion distance dis between the current frame r i and its previous frame using r and t by Equation (7); 6: if (In < 8 or dis < 0.1 or dis > 0.21) then 7: continue; //Discard the current frame r i ; 8: end if 9: o ← o + 1; // Add the current frame r i to the keyframe sequence B; 10: end for 11: return The keyframe sequence B with o keyframes. consistent maps. By using the above three modules, global consistency is guaranteed in our SLAM.
To this end, the main difference between our SLAM and VO is threefold. 1) The keyframe extraction, loop closure detection, and pose-graph optimization are included in our SLAM, while VO does not include these. 2) The SLAM acquires a globally consistent estimate of a scene map, while VO focuses on local consistency. 3) The SLAM optimizes poses, reduces drift, and achieves global consistency without any prior information when reconstructing point cloud maps of scenes. By contrast, the VO predicts poses incrementally and does not solve a drift problem. We will describe the modules of keyframe extraction, loop closure detection, and pose-graph optimization in detail.
As shown in Algorithm 2, in the keyframe extraction module, we extract keyframes and obtain a keyframe sequence B.
Here, the frame sequence consists of n frames. The frame used in our SLAM contains a frame index r i , and i = 1, 2, 3, . . . , n, an RGB map, a depth map estimated by DEM, keypoints extracted by the feature detection module, and descriptors of keypoints corresponding to the frame. Similarly, the keyframes used by our SLAM have the same structure as the frames. The first frame in the frame sequence A is put into the keyframe sequence B.
In Algorithm 2, when the number of inliers and relative motion distance between the current frame and its previous frame in A are suited, the keyframe will be added to the keyframe sequence B. The number of inliers can be obtained from Algorithm 1. The threshold on the number of inliers is set to 8 based on our experience. The relative motion distance VOLUME 8, 2020 is computed by the equation where E r represents the L2-norm of the rotation vector r, and E t denotes the L2-norm of the translation vector t. Here, a and b denote weighting factors. We define a = 1/3 and b = 2/3 in the keyframe extraction module. The number of inliers and Equation (7) help us decide if a frame is a keyframe. Specifically, according to Equation (7), we judge whether the relative motion distance between two frames is within a certain range. According to the number of inliers, we determine whether there is enough matching accuracy between adjacent frames.
The loop closure detection is performed to detect accumulated scale drift, as shown in Algorithm 3. Specifically, the keyframe extraction results are used as inputs of Algorithm 3. If a keyframe in the keyframe sequence B closely matches six random keyframes in C and seven last keyframes in C, the keyframe in B will be put into the keyframe sequence C. The matches are selected or not according to the number of inliers and Equation (7). In Equation (7), we define a = 1/3 and b = 2/3 for the loop closure detection module. The threshold on the number of inliers is set to seven based on our experience. When the process in Algorithm 3 is completed, we obtain the more accurate keyframe sequence C than B.
By using the keyframe sequence C as inputs, we perform pose-graph optimization with G2o. The detailed G2o optimization is presented in Algorithm 4. Based on Algorithm 4, the pose-graph optimization module optimizes a pose graph with loop closure constraints, as shown in Algorithm 5. Specifically, if a keyframe in the keyframe sequence C accurately matches six random keyframes in D and seven last keyframes in D, the keyframe in C will be put into the keyframe sequence D. The matches are detected based on the number of inliers and Equation (7). In Equation (7), we fix a = 2/3 and b = 1/3 for the pose-graph optimization module relying on our experience. The threshold on the number of inliers is set to seven. The method to detect matched keyframes is the same as that in Algorithm 3. The matched keyframe is optimized by G2o for 80 iterations. Through the G2o optimization, loops are re-detected and corrected. After the pose-graph optimization module, we obtain the keyframe sequence D including optimized poses.
To reconstruct dense scenes, we backproject depth maps predicted by DEM from the optimized poses in the keyframe sequence D. Relying on optimized poses and accurate depth maps outputted by DEM, the point cloud map of an unknown environment is built with monocular images. The reconstruction results can be acquired without indoor or outdoor constraints through the DEM-based SLAM.
The DEM-based SLAM has three advantages of low manufacturing cost, high accuracy, and easy use. 1) The SLAM is low-cost as compared with RGBD camera-based SLAM such as [19], [20]. This is because depth is acquired

Input:
The keyframe sequence B with o keyframes.

Output:
The new keyframe sequence C with m keyframes. 1: Create the new keyframe sequence C with m keyframes and initialize m to 0; 2: for ri ← 0 to o do 3: if (m < 7) then 4: for i ← 0 to m do 5: Obtain poses T and the number of inliers h between the frame r i in keyframe sequences B and frame i in keyframe sequences C using Algorithm 1, and calculate the 3D rotation vector r and translation vector t based on the pose T , and then use r and t to calculate dis based on Equation (7) from DEM with monocular cameras in our SLAM. In contrast, RGBD camera-based SLAM uses RGBD cameras to obtain depth. The RGBD cameras are generally more expensive than monocular cameras. 2) Although using a cheap monocular camera in our SLAM, we alleviate absolute scale ambiguities. The initialization trouble of monocular cameras is also avoided by our SLAM. As compared with other monocular camera-based SLAM, the reconstruction accuracy is greatly improved by the DEM-based SLAM. 3) Our monocular camera-based SLAM is easy-to-use, requiring no additional multi-frame matching setup and indoor/outdoor limits. As described earlier, each module in our SLAM is Algorithm 4 G2o Optimization Input: Two frames f1 and f2. 1: Create new object pointers of a linear solver and a block matrix solver and initialize them; 2: Select the Levenberg-Marquardt algorithm [42] in G2o as the gradient descent method; 3: Create a sparse optimizer ''optimizer''; 4: Create a new object pointer of a vertex belonging to the class ''VertexSE3''; 5: Initialize the vertex and define the vertex number as the index of frame; 6: Set the initial value of the optimization to the unit matrix; plug-and-play, runs independently, and feeds the results to the next module directly.

IV. EXPERIMENTS
In Section IV-A, two publicly available datasets are presented to train and test DEM. In Section IV-B, online data augmentation is shown, training DEM for further accurate depth prediction. In Section IV-C, we preprocess the datasets to obtain different modalities of inputs for algorithms. In Section IV-D, error metrics are discussed.

A. DATASETS
In this section, we describe the indoor NYU-Depth-v2 and outdoor KITTI datasets for experiments.

1) NYU-Depth-v2 DATASET
The NYU-Depth-v2 dataset is used to train and assess models. The dataset includes 120,000 pairs of RGB and depth images captured by Microsoft Kinect v1 sensors. The largest ranging distance of Microsoft Kinect v1 is 10 meters.

Algorithm 5 Pose-Graph Optimization Using Algorithm 4 Input:
The keyframe sequence C with m keyframes.

Output:
Optimized poses in the keyframe sequence D. 1: Create the keyframe sequence D with s (s=0) keyframes and perform Step 1 to Step 7 of Algorithm 4; 2: for ri ← 0 to m do 3: Obtain optimized poses T and the number of inliers h between the current frame r i and its previous frame in the keyframe sequence C by using Algorithm 1, and set vertexes and edges in G2o by performing Step 8 to Step 18 of Algorithm 4; 4: if (s < 7) then 5: for i ← 0 to s do 6: Perform The dataset contains 464 scenes. In general, 249 scenes are adopted for model training and 215 scenes are employed for model testing. Following the classic work [8], 47,584 pairs of RGB-depth maps on the training dataset are leveraged for model training, and the testing dataset containing 654 pairs of RGB-depth images is used to test models. For a fair comparison, the same testing dataset is adopted to evaluate all models in this paper.

2) KITTI ODOMETRY DATASET
The KITTI dataset is adopted to further evaluate DEM. The KITTI contains depth and RGB images of urban, rural, VOLUME 8, 2020 and highway scenes. The depth images are captured by LiDARs. The maximum ranging distance of depth images is 100 meters. The odometry benchmark on KITTI includes 22 stereo sequences: 11 sequences are used for model training, and the others are for model testing. The resolution of training and testing images is 376 × 1241 and 370 × 1226, respectively. Following the research [8], we train DEM with 46k RGB-depth pairs from the training sequences and assess DEM with random 3200 RGB maps in the testing sequence. The testing dataset on KITTI is also leveraged by other models for evaluation in this paper. For a fair comparison, we process the training and testing sequences on KITTI following the method [8]. First, LiDAR measurements are projected on corresponding RGB images from training and testing datasets, respectively. Second, we remove sky parts in raw images on training and testing datasets because measurements of the sky parts are meaningless. Then, the training and testing RGB-depth images of size 240 × 1200 are obtained. Third, online data augmentation is performed for the training dataset. The online data augmentation is described as follows.

B. DATA AUGMENTATION
To further achieve accurate depth estimation, we adopt online data augmentation for model training. The online data augmentation does not increase the number of images on training datasets. Following the work [8], the online data augmentation consists of five steps.
• Flip: we flip depth and RGB images horizontally with a 50% probability.
• Scale: we scale up and down depth and RGB images through a random scaling factor r ∈ [1.0, 1.5].
• Translation: we translate depth and RGB images in a vertical direction.
• Rotation: we rotate depth and RGB images at the random angle a ∈ [−4.5, 4.5].
• Color Normalization: we normalize RGB images by mean subtraction and division. After color normalization, RGB and depth images are cropped from the center to obtain training images of the same size as that of the method [8]. Specifically, on the NYU-Depth-v2 dataset, training images of size 228 × 304 are obtained through data augmentation. To ensure the input size of model inference is consistent with that of model training, we crop from the center of RGB maps on the NYU-Depth-v2 dataset to acquire testing images of size 228 × 304. On the KITTI dataset, training images of size 228×912 are obtained by data augmentation. To ensure the input size of DEM is consistent for training and inference, we crop from the center of RGB images on the KITTI dataset to obtain testing images of size 228 × 912 for inference.

C. THREE MODALITIES OF INPUTS
After cropping images in Section IV-B, the obtained training and testing images on the same dataset are processed to generate three modalities of model inputs, i.e., RGBd data containing sparse depth and RGB images, sparse depth, and RGB images. The modalities of inputs are the same in model training and inference processes.
• Inputs of RGBd data: RGBd inputs are processed for model training and inference in two steps. 1) A few valid depth points are uniformly and randomly sampled from depth images on training and testing datasets, respectively. For a fair comparison, the number of valid depth samples is set to the same as that of research [8], [37].
2) RGB images and corresponding depth samples on training and testing datasets are fused as RGBd data to train and test models respectively. The RGBd data on the same testing datasets are employed to evaluate models.
• Inputs of sparse depth: the same steps as those of RGBd data are performed to generate sparse depth inputs on training and testing datasets. The probability of depth samples at each position is about identical in a depth map because we sample depth points uniformly and randomly. The number of valid depth samples is set to the same as that of work [8], [37], [38] for a fair comparison. In each depth image, the largest number of valid depth samples is 200, which accounts for 0.04% and 0.06% of the total on the KITTI and NYU-Depth-v2 images, respectively. The above sampling proportion suggests that our depth sample-based inputs are sparse.
• Inputs of RGB images: RGB images captured by cameras or given by public datasets are used as inputs of model training and testing. For a fair comparison, the RGB images are the same to test DEM and other methods. Compared with depth estimation using RGBd data or sparse depth, predicting depth from a single RGB image is challenging because scale ambiguity exists in monocular depth estimation. Our DEM alleviates the ambiguity and estimates pixel-level depth maps reliably in outdoor or indoor scenarios.

D. ERROR METRICS
To evaluate DEM quantitatively, we use the error metrics which are also adopted by studies [1], [4]- [6], [7], [8], [15], [16], [24]- [26], [27], [28], [35], [37]- [39], [43]. The error metrics consider global statistics between a ground-truth depth image Y containing N depth pixels and a corresponding predicted depth map P consisting of N depth pixels. These error metrics are absolute relative difference (REL), root mean square error (RMSE), and δ m , which are where j and i are the ordinate and abscissa of a pixel in a depth map; p i,j and y i,j denote a predicted and ground-truth depth value of a pixel respectively, and , where δ m represents a percentage of estimated pixels in which relative errors are in a threshold [8], and m = 1, 2, or 3. Here, i and j are the abscissa and ordinate of a pixel in a depth map. The depth value of a pixel in ground-truth and predicted depth maps is denoted as y i,j and p i,j , respectively. The card is the cardinality of a set.

V. EVALUATING DEM
The performance of DEM is compared with state-of-the-art studies. The encoder, decoder, and loss function of DEM are evaluated on the NYU-Depth-v2 dataset in Section V-A. In Section V-B, to illustrate the effectiveness and certain generalization of DEM, we evaluate it on the NYU-Depth-v2 and KITTI datasets. In Section V-C, the inference latency of DEM is evaluated on different platforms. The computation resources of DEM are assessed on embedded devices in Section V-D.
Development Environment: In this section, models are trained in the cloud platform equipped with NVIDIA Tesla P100 graphics card and Intel i9 CPU. Following the work [8], the batch size and epochs are set to 16 and 20 when DEM is trained. The model inference is performed in the cloud platform and embedded platform (NVIDIA Jetson TX2). The type of inputs is the same for model training and inference.

A. EVALUATING ENCODERS, DECODERS, AND LOSS FUNCTIONS
In this section, we demonstrate that our encoder, decoder, and loss function are effective.

1) EVALUATING ENCODERS
As presented in Table 1, encoders are assessed. The same RGB input images, decoders, and loss functions are adopted in the same group. The loss function L 1 is chosen for model training. The decoder in different groups is fixed as Deconv 3 for a fair comparison. Deconv 3 denotes transposed convolution layers with a 3 × 3 kernel. Here, ResNet-18 represents the encoder containing a convolution (Conv) layer, batch normalization (BN) layer, and ResNet-18. Similarly, the following CNN networks are expressed in the same way. The REL and RMSE are the-lower-the-better, and δ m (m = 1, 2, or 3) is the-higher-the-better. Then, RMSE is analyzed as a representative. As listed in Table 1, our encoder is more accurate than state-of-the-art encoders on the NYU-Depth-v2 dataset. We highlight better results in bold. The results [1], [8], [16] are obtained by implementing authors' respective codes.

2) EVALUATING DECODERS
As shown in Table 2, our decoder provides more reliable depth than state-of-the-art decoders. All models use RGB inputs on the NYU-Depth-v2 dataset for model training and inference. The results of models [5], [24] are taken from their respective papers. The results of studies [8], [16] are obtained by running their corresponding models. We implement the models including ResNet-18 and acquire their results. In Row 2-4, our decoder is 0.5% more accurate than Deconv 3 and UpProj in RMSE. In Row 5-10, UpProj is slightly better than UpConv. The results are consistent with those in the work [8]. In the table, our RMSE is always lower than that of research [5], [8], [16], [24] when the proposed decoder is employed. As expected, our decoder outperforms other decoders in accuracy. In summary, our decoder has three advantages. First, the decoder outputs high-resolution depth maps. Second, the decoder recovers local features, extracts features, and learns features through learnable transposed convolution and convolution networks. In contrast, the local features are easily lost by linear interpolation in decoders such as [5], [8]. Third, properly matched different encoders, the decoder is convenient to improve depth prediction accuracy. The proposed encoder and decoder compose the depth estimation model (DEM). Without additional pre-processing or post-processing steps, we train the end-to-end DEM with the proposed loss function. Table 3, our loss function provides comparable accuracy. The results of studies [5], [8], [16], [37] VOLUME 8,2020 are obtained by implementing their corresponding models. We perform models including ResNet-18 and ResNet-34, and acquire their results. The NYU-Depth-v2 dataset is leveraged for model training and inference. The modalities of inputs are the same for model training and inference.

As shown in
In all groups of Table 3, our loss function outperforms state-of-the-art loss functions. For example, the RMSE of our loss function is at least 0.8% lower than that of L 1 , L 2 , and berHu, in the RGB group where inputs are RGB images for model training and testing. The results in the sd group also confirm that our loss function L is effective in depth estimation. Here, in the sd group, inputs of model training and testing are sd-200 (200 valid depth points generated by random sampling in each image) on the NYU-Depth-v2 dataset. Consequently, when the proposed loss function L is adopted, the accuracy of depth estimation is higher than that of the state of the arts. Our accurate results in Table 3 benefit from the use of relative relationships by the proposed loss function L. The relationships among sparse depth samples encourage prediction to agree with ground truth. The ablation experiments of our encoder, decoder, and loss functions verify that they all enable finer depth estimation. We will compare the encoder-decoder model DEM with other studies in the next section.

B. COMPARISON WITH STATE OF THE ARTS
In this section, we verify the performance of DEM and discuss prediction results on two benchmark datasets.
In Table 5, we use sd-20, sd-50, and sd-200 as model inputs. The sd-20 denotes inputs contain only 20 valid depth samples for model training and inference. The sd-50 and sd-200 are expressed in the same way. The depth estimation quality improves as the proportion of depth samples increases. All accuracy of DEM is comparable as compared with the state of the arts. For example, our RMSE is respectively 3.1%, 3.9%, 4.1%, 67.6%, 34.9%, and 1.3% lower than that of methods [1], [8], [16], [33], [37], [38] when model inputs are sd-20. The accuracy of DEM is also higher than that of other studies when model inputs are sd-50. Our RMSE is respectively 13.9%, 22.8%, 82.4%, and 42.8% better than that of work [8], [16], [33], [37] when model inputs are sd-200. In Fig. 4, the visual results also show that DEM boosts the accuracy of depth prediction. Here, depth maps are predicted with RGB images or sd-200 inputs on the NYU-Depth-v2 dataset. The modalities of inputs are the same in training and inference processes. The results of Ma and Karaman [8] are taken from their paper. The difference between methods is marked by boxes or circles in Fig. 4. In each group, our depth maps are clearer than others. Specifically, the objects predicted by DEM have distinct edges and corners. The estimated details of a sofa, chair, and table are easy to distinguish in Fig. 4(b). The sofa, table, and mural in Fig. 4(d) also have complete contours. In contrast, the sofa and table predicted by the work [8] are unclear, as listed in the first and second columns of Fig. 4(c). The objects [8] in Fig. 4(e) are still ambiguous. These visual results further confirm that DEM considerably improves depth estimation.

2) COMPARISON ON THE KITTI DATASET
On the outdoor KITTI testing dataset, we evaluate DEM and compare it with the state of the arts, as shown in Table 6. The KITTI dataset is challenging because the largest ranging distance of the KITTI dataset is ten times that of NYU-Depth-v2. The input modalities in model training are the same as those in model inference. The results of models [4], [7], [8], [26], [27], [32], [35], [36] are taken from their respective papers. The result of the research [22] is taken from the paper [4]. The results of models [16], [33] are acquired by implementing their respective methods.
In Table 6, DEM performs better than other methods. For example, our RMSE is respectively 38.1%, 29.3%, 49.2%, 29.6%, 14.6%, 41.0%, 14.5%, and 18.5% lower than that of models [4], [7], [8], [16], [26], [32], [33], [35] when model inputs are RGB images. The RMSE of DEM is 65.2% better than that of Cadena et al. [35], and our depth samples are less than those of the method [35] when model inputs are RGB images and 500 depth samples. Additionally, our output VOLUME 8, 2020 size of DEM is 48.1 times that of the method [35]. Hence, more accurate and high-resolution depth maps are predicted with DEM than other models. The RMSE of DEM is reduced by 23.7% compared with the research [7]. In addition, DEM has 55.6% fewer depth samples than the work [7]. Relative to the state of the arts [8], [16], [33], [36], DEM also has higher accuracy. These results demonstrate the encoder, decoder, and loss function of DEM are more precise than others.

3) COMPARISON OF GENERALIZATION ERROR ON THE Make3D TESTING DATASET.
To verify the generalization of DEM, we train DEM on one dataset such as KITTI and test it on another dataset like Make3D. For comparison, the baseline is selected from the paper [22]. The results of Saxena et al. are taken from their paper [21]. Similar to DEM, the state of the arts [8], [16] are trained on KITTI and tested on Make3D. When model inputs are RGB images, the REL of DEM is 24.2% lower than that of the baseline [22]. The RMSE of DEM is respectively 40.1% and 2.7% better than that of methods [16], [21]. When model inputs are RGB images and 500 sparse samples, our RMSE is respectively 12.7% and 35.9% lower than that of the studies [8], [16]. Different from DEM, the classic work [21] and the baseline [22] use Make3D to train and test models, but DEM performs better than them. These results show that DEM has a certain generalization ability.

C. EVALUATING RUNTIME ON DIFFERENT PLATFORMS
The inference time of DEM is evaluated in the cloud platform and embedded platform. The inference time is measured three times, and then the average of inference time of each image is computed. The inputs on NVIDIA Jetson TX2 are the same as those in the cloud platform for training and inference. In the sd group of Table 8, sd-50 denotes model inputs containing only 50 valid depth samples in an image for training and inference. In the RGBd group, RGBd-50 means the inputs  include RGB images and 50 valid depth samples for training and evaluation. All values are in milliseconds (ms) in Table 8. The accuracy metrics are not shown here because they have been listed in Table 4 and 5. The accuracy has no difference in the cloud platform and embedded platform when inputs and models are the same. Table 8 shows that DEM satisfies real-time (9 frames per second) constraints when inputs are RGB images of size 480 × 640 on TX2. Specifically, in the RGBd group, the runtime of the RGBd group is the longest in all groups on TX2 or in the cloud platform because DEM predicts depth with sparse depth points and RGB images. In the sd group, the short runtime is spent whether in the cloud or on TX2, owing to only valid depth points used as inputs of DEM. In the RGB group, the runtime is better than that of the RGBd group. Here, the runtime of DEM on the KITTI dataset is a little longer than that on the NYU-Depth-v2 dataset because the size of inputs on the KITTI dataset is larger than that on the NYU-Depth-v2 dataset. The runtime of RGB input images of size 480 × 640 meets real-time requirements on TX2. This result is reasonable since the RGB group predicts depth maps with RGB inputs alone. Additionally, RGB images are easily obtained from low-cost monocular cameras. Therefore, RGB images are chosen as inputs of DEM to test computation resources on embedded devices.

D. EVALUATING COMPUTATION RESOURCES ON EMBEDDED DEVICES
To evaluate computation resources of DEM on the representative embedded device NVIDIA Jetson TX2, we measure metrics such as the number of parameters, memory utilization, CPU/GPU consumption with their frequency, power consumption, and energy consumption, as listed in Table 9. These metrics are widely used to evaluate computation resources of algorithms on embedded systems [44], [45]. The result of ResNet-FC-160 × 128 [5] is acquired by implementing the method ResNet-FC-160×128 on TX2. The result of Ma and Karaman [8] is obtained by running their code on TX2.
For comparison, the computation resources of methods [5], [8] are evaluated under the same condition as that of DEM. When stably predicting depth maps of 654 RGB images (the NYU-Depth-v2 testing dataset) by using MAXP_CORE_ARM mode on TX2, we record the metrics of memory utilization, CPU/GPU consumption with their frequency, and power consumption, and we then compute average values of them. The power consumption is computed by multiplying voltage and current. The voltage and current are measured by a power supply. When completing depth estimation of 654 RGB images, we record the runtime and compute the energy consumption. The energy consumption is computed by multiplying the power consumption and runtime of 654 RGB images.
On all metrics in Table 9, parameters are the-less-thebetter. The memory utilization, CPU/GPU consumption, CPU/GPU frequency, power consumption, and energy consumption are the-lower-the-better. The CPU/GPU frequency changes with CPU/GPU workloads dynamically. The percentage of CPU/GPU consumption is relative to the CPU/GPU frequency. Additionally, in the RGB group of Table 9, inputs of all methods are RGB images for model training and inference. In the RGBd group of Table 9, inputs of all methods are RGB images and sparse depth samples for model training and evaluation.
As presented in Table 9, DEM is suited to run on embedded devices. In contrast to the state of the art, DEM boosts the performance of depth prediction with low consumption of GPU, power, and energy. Specifically, in RGB and RGBd-200 groups, we perform better than those of the work [5], [8], [34] on metrics such as runtime and consumption of GPU, power, and energy. The memory utilization and CPU consumption in the research [8] slightly outperform ours. Also, DEM is less complex and has 28.6% fewer parameters than the algorithm [8]. These results demonstrate that DEM is good at reducing the runtime and consumption of GPU, power, and energy. Therefore, DEM is real-time with low consumption of GPU, power, and energy for fast depth estimation on embedded devices.

VI. APPLICATION OF DEM AND EVALUATION OF SLAM
In Section VI-A, we apply DEM to LiDAR superresolution. The application result is qualitatively compared with the state of the art. In Section VI-B, we quantitatively assess the application of DEM in LiDAR super-resolution. In Section VI-C, we qualitatively evaluate the DEM-based SLAM. In Section VI-D, the DEM-based SLAM is quantitatively compared with well-known SLAM systems, and results are analysed in detail.

A. APPLYING DEM TO LiDAR SUPER-RESOLUTION
As shown in Fig. 5, we apply DEM to LiDAR superresolution following the representative study [8]. On the KITTI dataset, RGBd-20000 data (20000 depth samples in each image) are used as inputs of models for training and inference. The results of Ma and Karaman [8] are acquired by running their code. The good application in LiDAR super-resolution represents easily recognizable objects in Fig. 5. In Fig. 5(c), faraway cars and trees are recognized by our estimation. In Fig. 5(b), the car and tree at a long distance cannot be identified. The object shape in our prediction is clear. For instance, nearby cars in Fig. 5(c) are distinguished easily. The window of the car is identified more clearly in our results than that in the research [8]. These significant differences between Fig. 5(b) and Fig. 5(c) imply that DEM is more reliable than the method [8]. Additionally, the estimation of DEM has fine contours of objects, compared with the ground-truth depth projected from LiDAR measurements in Fig. 5(d). Our experiments justify that DEM can be applied to LiDAR super-resolution well.

B. EVALUATING APPLICATIONS QUANTITATIVELY
The mean absolute error (MAE) metric is computed to quantitatively evaluate applications of DEM in LiDAR super-resolution again. The MAE represents the average of absolute deviations between ground-truth and predicted depth maps. The MAE of DEM is 0.745 when the inputs for model training and testing are RGBd-500 data (500 depth samples in each image) on the KITTI dataset. With RGBd-500 inputs on KITTI for model training and testing, we run the source code [8]. Then, the MAE [8] is obtained (1.209). In contrast, our MAE is 38.4% lower than that of the study [8]. In Table 6, we also quantitatively demonstrated the application of DEM in super-resolution performs better than that of the work [8]. Specifically, DEM outperforms the state-of-thearts on metrics of RMSE, REL, δ 1 , δ 2 , and δ 3 in Table 6. These assessments quantitatively show that applying DEM to LiDAR super-resolution is effective.

C. EVALUATING DEM-BASED SLAM QUALITATIVELY
As shown in Fig. 6, the DEM-based SLAM is evaluated qualitatively. The reconstruction results of SLAM are presented in Fig. 6(c) and Fig. 6(d). To demonstrate the local consistency in SLAM systems, the front-end of SLAM is shown in Fig. 6(e) and Fig. 6(f). The results in Fig. 6(c) and Fig. 6(e) are obtained by implementing the representative method [8]. Here, we use inputs of monocular RGB images on the NYU-Depth-v2 dataset for model training and testing. The outputs of DEM from RGB images are employed by our SLAM. In contrast, the work [8] employs inputs of RGB and sparse depth on the NYU-Depth-v2 dataset for model training and testing. The outputs of model testing relying on RGB and sparse depth are leveraged as inputs of the SLAM system [8].
In Fig. 6, the reconstruction results of the DEM-based SLAM are close to the ground truth, although using monocular RGB images as inputs. Compared with the state of the art [8] from RGB and sparse depth inputs, the DEM-based SLAM reconstructs the detail of walls from RGB images, as displayed in Fig. 6(d). By contrast, the work [8] has little reconstruction of walls in Fig. 6(c). The reconstruction of textureless regions like walls shows that the DEM-based SLAM is dependable. Additionally, more complete objects are shown in the front-end of our SLAM in Fig. 6(f) than those in Fig. 6(e). The favorable results in Fig. 6(d) and Fig. 6(f) benefit from our reliable DEM and SLAM.
To further justify the effectiveness of the DEM-based SLAM, we provide visual results of optimized poses obtained in the pose prediction module of our SLAM. The monocular camera is moved a distance and then returned to the vicinity of an initial position. The poses of the end and starting point are presented in Fig. 7(a) and Fig. 7(b). The pose deviation is large between the origin and end point without pose optimization in Fig. 7(b). This deviation is reduced by optimizing poses with RANSAC PNP initial values, as shown in Fig. 7(a). These results confirm our poses are accurate by using G2o optimization with RANSAC PNP initial values in the pose prediction module. The optimized poses boost the accuracy of the DEM-based SLAM.

D. EVALUATING DEM-BASED SLAM QUANTITATIVELY
The TUM dataset 4 and Absolute Trajectory Error (ATE) RMSE (m) are adopted to test SLAM quantitatively. The ATE RMSE metric is computed as the RMSE between the ground truth and the estimated translation for each evaluated sequence. The ATE RMSE is low when SLAM performs well. As common evaluation metrics of SLAM, ATE RMSE is widely leveraged in previous research [10]- [12], [17], [18], [19], [20].
On the public TUM dataset, the sequences of fr1_desk, fr3_long_office, and fr3_str_tex_far are chosen for SLAM evaluation. On these sequences, monocular RGB images are employed as inputs of the DEM-based SLAM. Here, DEM is trained on the NYU-Depth-v2 dataset and directly for inference on the TUM dataset. The outputs of DEM are used by the DEM-based SLAM for reconstruction.
In Table 10, the DEM-based SLAM performs well with near state-of-the-art accuracy. The results of RGBD ORB-SLAM2 are used as an upper bound of accuracy because RGBD ORB-SLAM2 yields small error relying on inputs of RGB images and ground-truth depth maps. In contrast to RGBD ORB-SLAM2, the DEM-based SLAM reconstructs scenes through only monocular RGB images, but the DEM-based SLAM outperforms deep learning-based SLAM methods [10]- [12], [20]. The extensive experiments demonstrate that the DEM-based SLAM is reliable.

VII. CONCLUSION
In this paper, we propose the encoder-decoder model (DEM) to estimate depth, the loss function to guide the training of DEM, and the DEM-based SLAM system to reconstruct scenes with monocular cameras. The encoder reuses and re-exploits features effectively; the decoder recovers details that are generally lost by linear interpolation in other decoders; the loss function encourages estimated depth to agree with the ground truth. Our methods achieve comparable results on public datasets. Particularly, DEM outperforms state-of-the-art algorithms in accuracy, whether inputs are monocular images, sparse depth, or the fusion of both. Our DEM runs in nearly real-time (9 frames per second) and is efficient in the consumption of GPU, power, and energy on embedded devices. Additionally, the application of DEM in LiDAR super-resolution performs better than previous algorithms. Based on DEM, the proposed SLAM system reconstructs more accurate scenes than existing depth estimation-based SLAM. Future research will use lightweight depth estimation models to decrease the consumption of memory and CPU.