FastMDE: A Fast CNN Architecture for Monocular Depth Estimation at High Resolution

A depth map helps robots and autonomous vehicles (AVs) visualize the three-dimensional world to navigate and localize neighboring obstacles. However, it is difficult to develop a deep learning model that can estimate the depth map from a single image in real-time. This study proposes a fast monocular depth estimation model named FastMDE by optimizing the deep convolutional neural network according to the encoder-decoder architecture. The decoder needs to obtain partial and semantic feature maps from the encoding phase to improve the depth estimation accuracy. Therefore, we designed FastMDE with two effective strategies. The first one involved redesigning the skip connection with the features of the squeeze-excitation module to obtain partial and semantic feature maps of the encoding phase. The second strategy involved redesigning the decoder by using the fusion dense block to permit the usage of high-resolution features that were learned earlier in the network before upsampling. The proposed FastMDE model utilizes only 4.1 M parameters, which is much lesser than the parameters utilized by state-of-art models. Thus, FastDME has a higher accuracy and lower latency than previous models. This study also demonstrates that MDE can leverage deep neural networks in real-time (i.e., 30 fps) with the Linux embedded board Nvidia Jetson Xavier NX. The model can facilitate the development and applications with superior performances and easy deployment on an embedded platform.


I. INTRODUCTION
Depth map prediction from a single image is a fundamental aspect of several applications that involve three-dimensional (3D) visualizations of the real world. It can be deployed in multiple applications, such as robotics, autonomous vehicles, and drones [1], [2]. It assists robots in building good simultaneous localization and mapping (SLAM) for autonomous obstacle avoidance [3], [4]. However, existing depth sensors, such as Light Detection and Ranging (LiDAR), structuredlight sensors, etc., are typically bulky, heavy, and consume a lot of power. This makes them unsuitable for small robotic The associate editor coordinating the review of this manuscript and approving it for publication was Xinyu Du .
platforms. Thus, the development of a depth estimation technique by using a monocular camera is being explored due to its compact size, low cost, and low power consumption.
Researchers have recently been tried to address this problem by performing monocular depth estimation (MDE) through deep learning. Studies [5]- [7] used the encoderdecoder as the backbone of the architecture. The encoder is commonly used in complex networks that were designed for object detection and recognition problems, such as VGG-16 [8], ResNet [9], and DenseNet [10], due to their high expressive power and accuracy. For instance, Alhashim and Wonka [11] used transfer learning and DenseNet-169 for high quality depth estimation. Lee et al. [12] used DenseNet-161 with local planar guidance to extract dense encoding features. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. MDE on KITTI datasets. The results of the current method were compared with those of Monodepth2 [5]. The current method can produce images with higher quality and sharpness than those of Monodepth2 [5], despite using 3.5 times fewer parameters than the latter. The image resolution in the current study is 1024 × 320.
Godard et al. [5] adopted ResNet-18 for the encoder network architecture. However, such complex and deep convolutional neural network (DCNN) architectures, which require over 14 M parameters, lead to high complexity and high latency. Therefore, it is very difficult to deploy these models in real-time industrial applications because they have limited hardware resources on embedded platforms. It is essential to develop an efficient convolutional neural network (CNN) model that can run in real-time on embedded devices. Previous studies on real-time MDE have adopted network pruning techniques to reduce the size of the model. Wofk et al. [13] used MobileNet [14] for the encoding phase and network pruning processes in their model. They obtained fast depth estimation by using the Nvidia Jetson TX2 at 178 fps on a GPU and 27 fps on a CPU with an input image resolution of 224 × 224. However, the model proposed in [13] has low accuracy and supports low-resolution input images. Thus, researchers have adopted reinforcement learning to design high-efficiency model architecture. Shaw et al. [15] proposed a fast-semantic segmentation model named SqueezeNAS. This study utilized reinforcement learning to develop the most optimum network architecture for segmentation tasks. They only utilized 1.8 M parameters and achieved high accuracy. Several studies [16]- [19] reported high performances by optimizing the hyper-parameters of the neural network architecture. These studies revealed that the computer could adopt the neural architecture search technique to design neural network architectures automatically. However, this approach requires a lot of GPU computational resources and hundreds to thousands of computational days to obtain the optimal neural network. Thus, this study demonstrates a high-efficiency neural network architecture by accounting for properties that improve the depth prediction results. Since the depth estimation problem requires pixel-based information, the model needs the semantic features and spatial information of the object to predict its boundaries at a high resolution. Semantic feature maps are appropriate for semantic segmentation tasks and can be used to produce boundaries between different objects. Spatial information can help visualize the boundaries of objects, thereby enhancing the depth map estimation.
An analysis to infer high-quality depth estimation is conducted to develop a lightweight model for MDE with high accuracy.
• The deep layers in the encoder of the DCNN contain additional channels to extract high-level features from the input image, thereby ensuring that the semantic features are relatively detailed. Therefore, a network that can capture the high-level features in the deep layers from the encoder and merge them into the features within the decoder is needed. This helps the decoder layers construct a highly detailed semantic features output.
• A bilinear interpolation or interpolates the nearest neighbor at a scale factor of 2 is usually used in upsampling of the decoder. These two methods generate smooth edges or pixelated, interpolated images with stair-step artifact [20]. Therefore, upsampling an image with blurred or noisy edges makes it difficult for the model to estimate the edge information. Thus, the model must predict a sharp edge at a high resolution before it undergoes upsampling.'' Thus, the DCNN architecture is redesigned in this study to achieve high accuracy and low latency for MDE. Our main contributions are summarized below.
• A FastDME architecture that can perform better than state-of-the-art methods while using 3.5 times fewer weights than Monodepth2 [5] and HR-depth [7] is proposed in this study. A comparison between the depth map estimation of the current method and Monodepth2 is shown in Fig. 1. The results demonstrate that the FastDME architecture can predict the depth map at a higher quality with sharper edges than the state-of-art methods.
• The skip connection is redesigned, and the squeezeexcitation (SE) block is used to extract important features of the encoding phase and is referred to as eSE. Thus, the decoder contains more details on the spatial and semantic feature maps.
• The dense connection in the decoder part is redesigned, namely fDense, to easily learn the high-resolution features that are obtained from the skip connection and encoder features to produce highly-detailed edge information before the upsampling process. This allows the model to predict sharper edges at higher accuracy.'' • The TensorRT engine model, which is easily deployed in the Nvidia Linux embedded board (Tested with Nvidia Xavier NX), is then proposed.
The remaining sections of this manuscript are organized as follows. Section II discusses studies related to the proposed model. Section III introduces the proposed FastDME, which includes the novel skip connection with an SE block and a unique fusion dense module. The training specifications are also described in this Section. Section IV compares the results of the FastMDE technique with those of a few stateof-art methods. Section V discusses the ablation studies to analyze the performance improvement realized by the Fast-DME model. An ablative analysis is performed on different architectural components introduced with the baseline model Monodepth2 [5]. Section VI concludes this study.

II. RELATED STUDIES
This section describes studies related to the deep learning methods used for MDE, such as supervised learning and unsupervised learning. This is followed by a discussion on the existing lightweight MDE networks.

A. THE MONOCULAR DEPTH ESTIMATION APPROACH 1) SUPERVISED LEARNING
MDE is an auto-encoder problem that feeds an input image and permits it to be transformed according to multiple reasonable depths. Earlier studies were based on supervised learning, wherein the model was trained by applying a ground truth depth for the calculation loss. The first supervised learning method [21] was trained on the RGB-D dataset. The network designed in [21] can predict the depth map in two states. The first state has a globally coarse depth, whereas the second state has a locally fine depth to generate pixelby-pixel depth values. This local prediction assists the model in generating the depth map in great detail. Researchers later developed new network architecture to improve the model's accuracy such that the model was comparable to a depth sensor. Zhang et al. [22] proposed a hard-mining network that adopted a similar approach to that of [21]. The study used intra-scale and inter-scale refinement sub-networks to accurately localize and refine the hard-mining regions. This assisted the model in improving the performance of MDE in hard-regions where it is difficult to predict the depth value. Lee et al. [12] used atrous spatial pyramid pooling to extract contextual information and local planar guidance for every scale of the decoding phase. Bhat et al. [23] is the current state-of-art method for depth estimated based on supervised learning. This study introduces an adaptive bin-width estimator block that divides the depth into bins whose center value is estimated adaptively per image. It then calculates the depth by using a linear combination of the bin center values. However, all supervised learning methods required the ground truth depth value, as discussed above. It is difficult and expensive to generate this value. Further, the readings are sparse and flawed, despite using an expensive depth sensor (e.g., the inability of the sensor to capture the depth information of a moving object). The lack of a dataset of ground truth values results in a poor generalization performance, which leads to biased predictions. As a result, researchers began to explore self-supervised learning (i.e., unsupervised learning) for MDE.

2) UNSUPERVISED LEARNING
The unsupervised learning models for MDE can be trained by multiple monocular images. The input can be a sequence of monocular images or stereo image pairs. Unsupervised learning uses the photometric reprojection error between the corresponding pixels of multiple input images to generate an output depth image. Stereo image pairs were initially used for training unsupervised learning models [6], [24]- [26]. These models used the left and right images of the same scene to compute the disparity map by calculating the displacement between the corresponding pixels of these two images. Several studies [5], [27]- [32] proposed unsupervised learning models that used sequences of images. The depth map prediction was based on the output of two DCNNs, namely the pose estimation network (PoseNet) and the depth estimation network (DepthNet). PoseNet can regress the transformation between adjacent frames, which is used for the reconstruction of the target image. DepthNet predicts the depth map according to the output of PoseNet (additional details are included in Section III). Their studies allow the model to be trained entirely with monocular image sequences. Although the pose network and depth network are used simultaneously during the training process, they can operate separately during the testing process. Monodepth2 [5] introduced a loss function to calculate the minimum photometric error instead of calculating the average error, which was a technique used in a previous study [33]. This improves the sharpness of the occlusion boundary, which significantly increases the accuracy of the model. As a result, Monodepth2 became a widely used baseline. Guizilini et al. [32] aimed to further improve the MDE performance by developing a pre-trained network with semantic image segmentation tasks to guide the network learning process. [34] proposed a 3D packing and unpacking network to preserve the spatial information in images and low-level features. This study demonstrates that the standard max pooling and bilinear upsampling techniques are not good enough to preserve semantic and spatial information for depth estimation in detail. In addition, the model uses pack and unpack blocks with 3D convolution. As a result, the number of parameters is greatly increased, thereby making it impossible for it to be deployed in embedded devices to operate in real-time. These two studies also demonstrate that abundant semantic and spatial information is important to obtain sharp images and improve the accuracy of depth estimation. Jiang et al. [35] predicted the ego-motion through optical flow, which permitted large, open sources, such as YouTube videos, to be leveraged without labels. This study thus demonstrates the application of unsupervised learning to predict depth from raw videos. The pre-trained model with semantic segmentation tasks is also applied for depth estimation.
The aforementioned studies use very expensive DCNN architecture to generate spatial and semantic feature maps for estimating the depth with high accuracy. Our design obtains these features with a lightweight DCNN architecture and successfully estimates the depth map with sharp edges.

B. LIGHTWEIGHT MONOCULAR DEPTH ESTIMATION NETWORKS
The proposed model must be optimized to carry out a real-time estimation of the depth in embedded systems and make MDE viable for industrial applications with limited computational resources. Wofk et al. [13] proposed a lightweight architecture that is based on the Jetson TX2 board in real-time. The total number of parameters used by the network is equal to 1.34 M after pruning. Elkerdawy et al. [36] reported that a lightweight monocular depth model could be developed from a complex pre-trained model by using pruning methods. Their baseline model was trained according to Monodepth2 [5]. Although such network architectures have few parameters, their performances are also relatively poor. Lyu et al. [7] developed a lightweight MDE by teaching the lightweight network with a high-performance network. However, this lightweight network also reported a lower accuracy than that of the original network. Despite using lightweight model architectures, the depth estimation networks proposed in [7], [13], [36] have low accuracy. This study addresses these problems by developing a lightweight DCNN with high accuracy for depth map prediction.

III. METHODOLOGY
This section describes the proposed CNN architecture for fast MDE with high resolution, followed by the training method and implementation of the model.

A. MODEL ARCHITECTURE
Zhang et al. [37] reported that high-level feature maps of the encoder contain more semantic features than low-level feature maps. However, the upsampling procedure followed by the bilinear interpolation in the decoder causes the model to generate a low-resolution dense output with large gradient regions. Therefore, the following strategies have been adopted to extract a high-resolution performance. The skip connection is developed, and the important features of the encoder are identified and merged with those of the decoder. Second, the decoding phase is redesigned to preserve features of the encoder in maximum detail before upsampling.

1) ENCODING PHASE
Several CNN architectures, such as SqueezeNet [38], MobileNet [14], MobiletNetV2 [39], and MobileNetV3 [40], were analyzed in this study. These networks were designed for object classification and did not require complex computations. Therefore, they can be easily deployed in embedded systems and edge devices in real-time. The depth estimation problem depends on the pixel information, so the encoder phase needs to extract semantic features and spatial information of the object as much as possible. Moreover, MobileNetV2 is a more effective feature extractor for pixel-based tasks than other networks. Thus, the lightweight MobileNetV2 [39] is selected for the design of the encoder. We also used the transfer learning technique with MobileNetV2. Since it is difficult to get a model trained from scratch to converge, our model was trained with initial weights that were pre-trained with large datasets obtained from ImageNet [41]. Therefore, the channels of the encoder are similar to those of the original MobileNetV2 model. Thus, we had 16, 24, 32, 64, and 160 channels at 1/2, 1/4, 1/8, 1/16, at 1/32 scales, respectively.

2) DECODING PHASE
The semantic features from the encoder are leveraged during the decoding phase to infer a high-quality depth map. We focus on the design of the skip connection to utilize additional semantic information from the encoding phase. The decoder is redesigned according to the following modules.

a: eSE MODULE
The representation of semantic layers in deep neural networks increases with the increasing depth of the layers. Let X e i+1 denote high-level encoder feature maps at scales of 1 2 (n+1) from the original resolution, X e i represents the encoder feature maps at scales of 1 2 n from the original resolution, where n is a scale factor such that n ∈ [1,4]. The skip connection is uniquely designed by combining high-level encoder features X e i+1 and encoder features X e i with the SE block, which is called the eSE module, as shown in Fig. 2, to improve the accuracy of the predicted depth map. Based on the methodologies adopted by previous studies [7], [42], [43], we used the SE module for channel attention, which is essentially a detector response map of the corresponding filter. The proposed eSE module squeezes the high-level encoder features X e i+1 and encoder features X e i according to global average pooling to generate channel information. The fully connected neural network is then used to determine and activate the important channels with relatively high weights. We also used the 1 × 1 convolution to fuse the feature maps. The SE block was used to estimate the important features of the encoding phase to concatenate them with the features of the skip connection and merge them with the features within the decoder.
Our proposed eSE has two advantages. The first one involved reducing the semantic and resolution gap of encoder and decoder features. The second advantage involved fusing the channels to obtain high-quality textural information of the image. Notably, our eSE module takes two continuous layers from the encoder, and focuses on the fusion the channel of these layers. Therefore, our eSE module can help FastMDE reduce the number of parameters from 4.2M to 4.1M, when compared with that using the SE module, and the accuracy of the model also be improved. The proposed model architecture is shown in Fig. 3. The output of the encoder node X e i is equal to x e i , the output of the decoder node X d i is equal to x d i , and x i denotes the output of the central node X i . Consider a single image I input that is supplied as an input to the network. The stack of the feature maps is calculated as shown below. (1) E(·) represents a feature extraction block, which is similar to the MobileNetV2 [39] block at every half scale of input. eSE(·) is the squeeze-excitation block for both features from the encoder x i and x i+1 . U (·) represents an upsampling block that uses the nearest neighbor interpolation operation with double scaling of the input features.

b: FUSION DENSE BLOCK (FDENSE)
The deep convolutional layers are concatenated with the previous subsequent layers channel-wise in the dense block. Thus, the layer l (i) contains the feature maps of the previous layers ( i.e., l 0 , l 1 , . . . , l (i−1) ). Therefore, each layer in the dense block gains additional feature maps due to the ''collective knowledge'' gained by the previous layers. The dense block operation is not used to summate the output feature maps and incoming feature maps; instead, they are concatenated. The equation is then rewritten as follows.
Herein, l = 4 layers with a growth rate of k = 32 for each layer. Dense block can alleviate the vanishing-gradient problem, strengthen feature propagation, and encourage feature reuse [10]. Since the feature maps are concatenated, the output channel dimension of the dense block increases k times for every layer. Therefore, the output channel of a dense block with an input channel of k 0 can be calculated by k l = k 0 + k × (l − 1). The number of parameters is further reduced by redesigning the dense block through the addition of the convolutional with the 1 × 1 kernel to fuse the channels with the output channel of the dense block remain as input channel k 0 . The fusion channels are used to obtain high-quality features and also to reduce the number of parameters of the network. We used the fDense block on the basis of the techniques adopted by [44] and [45]. This allows the higher resolution features learned previously in the network to be used before upsampling, as shown in Fig. 3.

c: dSE MODULE
The SE block is also used in the decoding phase to improve the accuracy and is referred to as dSE to simplify the process. The input channel of the disparity convolution block needs to be re-weighted before it predicts the depth map. Thus, the dSE is also applied before the disparity convolution block. [42] demonstrated that the SE blocks provide significant performance improvements in existing models, such as ResNet [9]. Since the SE module uses an inexpensive channel-wise scaling operation, the model requires relatively few additional computational resources. However, the accuracy of the SE-ResNet-50 model is similar to that of the deep ResNet-101 network. Thus, we utilize the SE block to enhance the accuracy of the monocular depth prediction.

B. PROBLEM FORMULATION
The approach used in this study was based on a self-supervised learning technique with a sequence of input images. The main idea involved estimation of the appearance of a target image from another image's viewpoint. Therefore, a sequence of images was needed as input for the training network. Further, an additional network to estimate the camera pose from a sequence of images, along with a depth prediction network, is required, as described in [6], [28]. Thus, the model architecture includes two separate networks, which are PoseNet and DepthNet. The proposed approach calculates the loss by trying to minimize the photometric reprojection error of the image sequences and predict the depth map for the target image D t . Let consider the sequences of images I t and I t−1 . The PoseNet and DepthNet are trained to predict T t→t−1 and D t , respectively. These are used to establish the I t projection relationship between the sequences of the images I t and I t−1 . The loss is equal to the difference between the real image I t and the reconstructed image I t . The per-pixel minimum photometric loss procedure is followed to handle occlusion, as shown below. L pe = min t pe (I t , I t proj(D t , T t→t−1 , K ) ) .
pe represents the photometric reconstruction error, and K is the camera intrinsic matrix. According to [5], [6], [46] pe can VOLUME 10, 2022 TABLE 1. Summary of our FastMDE network architecture for self-supervised monocular depth estimation. U (·) refers to upsampling, and c refers to the concatenated feature maps. We use the disparity convolution block at each node dSE, the input channel of this convolution block is the output channel of the dSE, which is one channel. The disparity convolution block includes a convolution of 3 × 3 with the sigmoid activation function.

FIGURE 3. The FastDME model used for monocular depth estimation.
At every half scale of input has six main components, namely eSE block, dSE block, fusion dense block (fDense), upsample block, convolution 3 × 3 block, and disparity convolutional block. The architecture can be expressed as equations (1), (2), (3), where x e i is the output of encoder node X e i , x d i is the output of encoder node X d i , and x i is the output of the central node X i . be calculated by using the structural similarity index measure (SSIM ) [47], with over a 3 × 3 pixel window and addition L1 regulation: The disparities between a texture-less edge and the low gradient regions of the image are difficult to detect. This was overcome by using the term edge-aware smoothness [6].
where I t denotes the target image and D t represents the depth map estimation from the depth network. The final training loss function is a combination of the photometric reprojection error and smooth loss, as shown below.
where s = 4 represents the number of scales of the decoding phase, and λ = 10 −3 is the weight of the edge-aware smoothness error term.

C. IMPLEMENTATION DETAILS
This section contains additional details on network architectures, training specifications, and the datasets that were used to train the FastMDE model.

1) PoseNet
The architecture of the PoseNet model is described in [5]. It is built on ResNet-18. Since the network only has two input frames, the channel at the first-level convolution changes from 3 to 6. The PoseNet network can estimate the six degrees of freedom (DoF) of the camera pose relative to a scene. The first three dimensions represent the translation vectors, and the last three dimensions represent the Euler angles.

2) DepthNet
The DepthNet model uses the standard MobileNetV2 [39] as its encoder. The details of the architecture proposed in this study are listed in Table 1. The disparity map output of each scale is used to calculate the loss during training. The output scale is similar to that of the input image for comparison purposes.

3) DATASET
The KITTI dataset [48], which is most widely used for depth evaluation, is used in this study. The data split was based on [49], and the static frame removal procedure was performed according to [28]. The model was trained, validated, and evaluated with 39810, 4424, and 697 images, respectively. All images have the same intrinsic properties. The principal point of the camera is fixed at the image center. The focal length is equal to the average of the focal lengths in the KITTI dataset. The transformation between the two stereo frames is treated as a purely horizontal translation of fixed length for stereo training.

4) IMPLEMENTATION
An open-source PyTorch library is used to train the models in this study. The detail hyper parameter is set at 20 epochs with a learning rate of 10 −4 for the first 15 epochs, followed by a reduction to 10 −5 for the remaining 5 epochs. The Adam optimizer [50] is used for training with exponential decay rates of β 1 = 0.9 and β 2 = 0.999. The batch size is set to 8. Input images with a resolution of 640 × 192 are trained on a single GTX 2080 GPU. The training process required 24 hours for completion. However, due to the limited memory of the GPU, images with a resolution of 1024×320 are trained on three GPUs for 10 hours. Data parallelism techniques are used to decompose the datasets into subsets. These subsets are used in batches on different GPUs through the same model.

5) TensorRT
This is built on the open-source TensorRT engine, which was developed by Nvidia to accelerate the deep learning performance of NVIDIA GPUs. It can be easily deployed on Linux embedded systems that support TensorRT. The open neural network Exchange (ONNX) [51] is used to convert the PyTorch model into the TensorRT engine. The inference time of the TensorRT engine is approximately 30 milliseconds. This study also makes it possible for users to utilize TensorRT to easily develop applications.

A. EVALUATION KITTI DATASET
An analysis of the MDE performance based on the KITTI dataset was performed by using the evaluation metrics described in [21]. These error metrics are defined as: where y p is a pixel of ground trust depth image y,ŷ p is a pixel of predicted depth image, n is the total number of pixels of depth image.
The results of different variants of the model were compared. The variants were trained according to different self-supervision techniques, namely monocular video only (M) and monocular and stereo (MS). The results obtained from our models are compared with those of state-of-art models and other unsupervised learning methods. The evaluation results are shown in Table 2. The results demonstrate that the most lightweight model is the PyD-Net [52] model with 1.9 M parameters. Similarly, the second-most lightweight model is the Lite-HR-Depth model [7]. However,

TABLE 2. Comparison between the results of state-of-art techniques and the proposed technique, based on self-supervised learning methods and the KITTI dataset.
Relatively low values of metric evaluation indices, such as absolute relative difference (Abs Rel), squared relative difference (Sq Rel), linear root-mean-square error (RMSE), log root-mean-square error(RMSE log), are desirable. However, accuracy evaluating indices, such as threshold δ < 1.25, δ < 1.25 2 , δ < 1.25 3 , must be as high as possible.
these networks have poor accuracies. The accuracy of the proposed network architecture is similar to that of the stateof-art HR-depth model and outperforms a recent model [34] with 128.29 M parameters. In addition, our network uses 3.5 times fewer parameters than the HR-depth model (Number of parameters used by the HR-depth model and the current model are 14.62 M and 4.1 M, respectively). Fig. 4 shows a qualitative comparison between the MDE performances of the proposed FastMDE model and the other state-of-art methods. The current model can predict edges with higher quality and more sharpness than those predicted by the Monodepth2 model [5]. The current model's performance is comparable to that of the recently developed stateof-art HR-depth [7] model architecture while utilizing fewer parameters than the latter. Although semantic and spatial feature maps cannot be captured easily through small DCNN architectures, these two feature maps are well-captured well with the proposed model. Table 3 provides the reliable estimation results of our proposed model on the Make3D dataset [53]. We evaluate our model on the 134 images (collected using 3D scanner) of Make3D with a center crop of 2 × 1 ratio. Therefore, we crop the original images of the Make3D dataset to 1704 × 852 and then resize them to 640 × 192, and finally pass them through the network. From the table, our proposed FastMDE model outperforms all the compared methods that use monocular supervision. such as Monodepth and Monodepth2. Moreover, our estimated depth results are shown in Fig. 5. It is shown that our method reliably shows depth maps with clear image in complicated boundaries of various objects.

C. EVALUATION ON REAL IMAGES
To evaluate that our model architecture, FastMDE, can achieve good stability and generalization, we test our model with images captured by a hand-held phone camera. The size of an original image captured by a phone camera is 2048 × 1536. We crop the captured images to 2048 × 640, then resize it to 1024×320, and make no image enhancement. Our estimated depth results are shown in Fig. 6. The results show that our model achieves a strong generalization on the real scenes captured by mobile phones.

V. ABLATION STUDIES A. ADVANCE OF SKIP CONNECTION WITH eSE BLOCK AND dSE BLOCK
The eSE module and skip connection are redesigned to capture the maximum number of important features from the encoding. The selected features belong to two output layers X e i , X e i+1 from the encoder part. The eSE re-weighs the important channels and then alleviates the channels by using the standard convolution 1 × 1 kernel, thus the eSE module helps not only reduce the number of parameters but also improve the performance of the network, which is even better than that of the SE block as shown in the Table 4. Moreover, it is difficult to fuse the features with large semantic gap between the encoder and decode. Therefore, the eSE skip connection is used to generate more intermediate semantic features from encoder to effectively reduce the semantic gap, as shown in the Fig. V-A. With more semantic information, we can significantly improve the depth map estimation. The   dSE block is applied in the decoder section to re-weigh the important features before generating the depth prediction map. The influence of the eSE + dSE blocks with the skip connection is shown in Table 6. The model accuracy improved from 0.114 to 0.109 times the absolute relative difference evaluation metrics by using the eSE + dSE blocks.

B. ADVANCE OF fDense BLOCK CONNECTION
The redesigned dense block connection allows the model to capture additional semantic information. The fDense block connection reuses the feature maps of the previous layers. The features learned in the previous layers are passed forward and removed. The need to learn redundant features encourages the learning of a different set of features. This permits the  application of the high-resolution features that were learned earlier in the network in the upsampling process. Therefore, it improves the accuracy of the model and helps predict sharp edges. The model can produce high-resolution feature maps, as shown in Fig. 7. The model with a fDense connection, as shown in the right panel, can easily capture an image's textural information, despite it having a low resolution. To evaluate our fDense block, we apply the original dense block to eSE + dSE block and compare it with our fDense block. As shown in the Table 5, our fDense module can reduce the number of parameters from 4.8 M to 4.1 M. Interestingly, the accuracies (i.e., abs rel and sq rel) achieved by fDense are higher than those of the original dense block. The results of the ablation studies are shown in Table 6. Although the application of the fDense connection in the decoder increases the number of parameters from 3.3 M to 4.1 M, the fDense connection significantly improves the accuracy from 0.114 to 0.109 times the absolute relative difference evaluation metric.  Despite the application of a single module in a system based on two effective modules, the proposed model outperforms the baseline Monodepth2 while using fewer parameters than the latter, as shown in Table 6.

C. TensorRT
TensorRT is a software development kit (SDK) developed by Nvidia that is used to formulate a high-performance deep learning model. We used the TensorRT SDK to build the model engine in this study. It can be easily deployed in Nvidia Linux development kits (tested with Nvidia Xavier NX). The model is converted from the Pytorch model to ONNX and then built with the TensorRT engine. Since the TensorRT engine does not support reflection padding from the ONNX model, we changed it to zero padding. The weights from the reflection pad model are used, followed by fine tuning at a   learning rate of 10 −5 for five epochs to obtain the accurate weights of the zeros padding model. The result is reduced due to the influence of zero padding, as shown in Table 7.
To show our model architecture, FastMDE, can achieve higher speed than state-of-arts models, we build the TensorRT model engine for various depth estimation algorithms. Then, we compare the speed of various model architectures by using CUDA event function in Pytorch package, as shown in Table 8. The results show that the speed of our model outperform all the other methods.

VI. CONCLUSION
A lightweight convolutional network architecture named FastMDE was developed in this study by applying a novel skip connection with features of the eSE module, dSE module and the fDense block. The network utilized self-supervised learning for fast MDE at high-resolution. The eSE block was redesigned according to an analysis of the properties of the encoder and decoder sections to enhance the quality of depth map prediction, assist the model in identifying the important features of the encoder and decoder, and utilize the fDense connection to capture those features in great detail. Our eSE and fDense modules can totally reduce the parameters of FastMDE from 5.4M to 4.1M, when compared to the original SE and dense modules. The number of trainable parameters required for the proposed network architecture is 3.5 times lesser than that of the HR-depth model, despite maintaining a similar accuracy. We have also provided the TensorRT engine to easily deploy our model in Nvidia Linux embedded boards that support the TensorRT engine. Notably, our model can run in real-time ∼ 33 fps with an input image resolution of 640 × 192. Deploying the proposed model into the autonomous drone is our goal for future work.