Self-Supervised Monocular Depth Estimation Using Hybrid Transformer Encoder

Depth estimation using monocular camera sensors is an important technique in computer vision. Supervised monocular depth estimation requires a lot of data acquired from depth sensors. However, acquiring depth data is an expensive task. We sometimes cannot acquire data due to the limitations of the sensor. View synthesis-based depth estimation research is a self-supervised learning method that does not require depth data supervision. Previous studies mainly use the convolutional neural network (CNN)-based networks in encoders. The CNN is suitable for extracting local features through convolution operation. Recent vision transformers (ViTs) are suitable for global feature extraction based on multiself-attention modules. In this article, we propose a hybrid network combining the CNN and ViT networks in self-supervised learning-based monocular depth estimation. We design an encoder–decoder structure that uses CNNs in the earlier stage of extracting local features and a ViT in the later stages of extracting global features. We evaluate the proposed network through various experiments based on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) and Cityscapes datasets. The results showed higher performance than previous studies and reduced parameters and computations. Codes and trained models are available at https://github.com/fogfog2/manydepthformer.

The monocular camera-based depth estimation study is 35 divided into a supervised learning-based method and an unsu-36 pervised learning-based method. Supervised learning-based 37 networks directly learn depth data obtained from sensors. The 38 unsupervised learning-based method estimates depth based on 39 a viewpoint synthesis technique that reconstructs the current 40 image from temporally adjacent consecutive frames [1]. The 41 supervised learning-based method outputs higher performance 42 than unsupervised learning because it learns directly from 43 depth data. However, acquiring 3-D data is an expensive oper-44 ation. When 3-D data cannot be acquired, or a depth sensor 45 is not available, unsupervised learning is the only solution. 46 Recently, research on unsupervised learning-based models is 47 being actively conducted in various fields such as automobile 48 autonomous driving [2], the smartphone-based AR system [3], 49 the drone avoidance system [4], and the medical system [5]. 50 However, there is a problem in that the unsupervised learn-51 ing method based on the monocular camera cannot estimate 52 the absolute distance of the depth. Therefore, the output of 53 unsupervised learning depth estimation is limitedly applicable 54 to fields that are available with relative depth. Some studies 55 additionally train velocity to estimate absolute depth or use 56 inertial measurement unit (IMU) sensors [6]. 57 In most cases, the convolutional neural network (CNN)-58 based backbone was the basic model of the existing deep-59 learning-based monocular camera depth estimation. This was 60 true not only in the field of depth estimation [7], but also 61 in the field of image classification [8], segmentation [9], and 62 detection [10]. Two CNN-based residual network (ResNet) 63 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In the decoder, channel-space attention is additionally applied 122 to the commonly used multiscale confusion module to improve 123 the depth estimation performance. 124 The structure of this article is as follows. Section II inves-125 tigates related studies, backbone networks, and unsupervised 126 depth estimation. Section III describes the structure of the pro-127 posed overall network, and Section IV shows the experimental 128 results of applying the proposed method to the Karlsruhe 129 Institute of Technology and Toyota Technological Institute 130 (KITTI) and Cityscapes datasets. The conclusion of the pro-131 posed method is mentioned in Section V. Since the proposal of AlexNet [18], the CNN has been 135 used as the main backbone network in the field of computer 136 vision. Various models have been studied, such as visual 137 geometry group (VGG) [19], ResNet [11], MobileNet [20], 138 and EfficientNet [12]. VGG analyzed the effect of the depth 139 of the network, and ResNet proposed a residual network that 140 merges the input with the output. MobileNet improves network 141 efficiency with depthwise convolution and an inverted resid-142 ual block. EfficientNet improves performance by compound 143 scaling that determines the depth, width, and size of the input 144 image of the network. Recent studies use the above general 145 CNN-based backbone networks with some modifications [7], 146 [8]. An ensemble model is also used for optimization [21].

147
In particular, ResNet improves the learning speed and 148 training effect of the network without significantly increas-149 ing parameters and computations with a shortcut structure 150 [11], [22]. Also, even when the number of model layers 151 increases, gradient vanishing is prevented by the effect of 152 skip connection. In the field of depth estimation, the ResNet 153 backbone is mainly used following the main contribution paper 154 Monodepth2 [23]. Following the previous works, we construct 155 an efficient hybrid network based on ResNet.
156 Transformer is a model that showed good performance 157 in the field of natural language processing. ViT is an early 158 ViT model that showed the best performance by applying 159 a transformer in the image classification field. Data-efficient 160 image transformers (DeiT) [24] proposed a method for knowl-161 edge distillation of the CNN-based network results on a 162 transformer and applied various data augmentation techniques. 163 Swin transformer proposes a hierarchical feature map that 164 reduces image resolution by patch mixing for each stage. With 165 the proposal of the hierarchical feature map, it became possible 166 to apply the transformer not only to the classification field, 167 but also to the image detection and segmentation field [16]. 168 In addition, the Swin transformer proposes a cyclic shifting 169 window to improve local feature expression performance. 170 After that, convolutions to vision transformers (CvT) [25] 171 removes the limitation of nonoverlapping patch unit embed-172 dings by using convolution for token embedding. A recent 173 study, CNNs meet Vision transformers (CMT) [26], proposed 174 lit multihead self-attention that reduces the spatial resolution of 175 keys and values. In addition, an inverted residual feed-forward 176 network (IRFFN) based on depthwise separable convolution 177 is used instead of a feed-forward network to improve local 178 representation.

B. Self-Supervised Monocular Depth Estimation 180
Although the supervised learning method shows relatively 181 good performance in the field of monocular camera depth 182 estimation, in a recent study, the unsupervised learning method 183 also shows comparable performance [27]. The unsupervised learning model is a depth estimation method that can be easily 185 applied to images for which it is not easy to acquire depth data. 186 Garg et al. [28] proposed a viewpoint synthesis technique    There is also a recent study using transformers for self-238 supervised depth estimation. Varma et al. [50] configured the 239 network to learn camera parameters and compared the network 240 according to the use of the CNN or transformer, respectively. 241 Guizilini et al. [51] proposed a method to generate the cost 242 volume from a cross-attention-based transformer network. 243 However, it is necessary to use an additional network instead 244 of the simple difference operation in previous studies.

245
Recent studies use a transformer for depth estimation and 246 show good performance, but use a high amount of computation 247 and parameters. In this study, we propose a hybrid network 248 that is more efficient than existing CNN-based networks by 249 hierarchically mixing CNNs and transformers. In this chapter, we describe the proposed hybrid 252 transformer-based self-supervised learning depth estimation 253 method. First, the viewpoint synthesis method of the self-254 supervised learning model and cost-volume-based depth esti-255 mation methodology are reviewed. This review describes the 256 equations and geometric models used in the proposed model. 257 Then, the hybrid encoder-decoder network proposed in this 258 article will be described. The overall block diagram of the 259 proposed structure is shown in Fig. 1. In this article, depth and pose networks are simultane-263 ously trained for unsupervised learning-based depth estima-264 tion according to a recent study [2], [6], [23]. The network 265 is trained through a view synthesis process that minimizes 266 the photometric error between the target image I t and the 267 reconstructed target imageÎ s→t from the source image I s 268 to the target image viewpoint. The reconstructed image is 269 sampled from the source image using the 2-D homogeneous 270 coordinates obtained by projection using the predicted target 271 depth and the predicted pose. At this time, the predicted depth 272 D t and pose P t →s are estimated by each network, and the 273 camera parameter K is input. The viewpoint synthesis process 274 for generating the reconstructed image is as follows: where proj is the camera projection operation and is the 277 binary sampling operation using STN [30]. 278 The photometric error pe is combined with the L1 distance 279 and SSIM [31], which is the degree of similarity. The image 280 reconstruction loss is as follows: where a is the balancing weight and SSIM is a method of 284 evaluating the quality of the images. 285 The source image consists of temporally adjacent frames 286 of the target image. The reconstructed target image from the 287 source image depends on the number of adjacent frames. 288 the L1 distance of the feature map F t for each unit and the 324 reconstruction target feature map F d s→t is input to each channel 325 of the depth cost volume.

326
A cost-volume-based depth estimation that receives multiple 327 frames generally works better than single frame estimation. 328 However, the existence of an object moving in the same 329 direction as the camera or a textureless region is a major 330 cause of failure in cost-volume depth estimation. Because 331 the cost-volume-based depth estimation uses the difference 332 in feature points as a learning factor, the depth estimation 333 fails when the depth difference cannot be known as above. 334 To solve this problem, a recent study uses a single input 335 image-based depth constraint network as a teacher model. 336 The network used as the teacher model is the baseline of 337 existing studies [6], [23]. The L1 distance between the depth 338 D t predicted by the cost volume and the depthD t predicted 339 by the depth constraint network is added to the loss function, 340 preventing from excessively dependent on the disparity. The 341 depth constraint loss is written as Additionally, an edge-aware term such as (6) that constrains 344 the gradient of depth according to the gradient of the image 345 is added as in previous studies [4], [5], [6] 346 L smooth = |δ x D t |e −|δ x I t | + |δ y D t |e −|δ y I t | .
(6) 347 The final loss consists of reconstruction loss, depth con-348 straint loss, and depth smoothness loss and is as follows: where α, β, and γ are loss function scale correction weights. 351

C. Hybrid Encoder and Self-Attention Decoder 352
In this article, we propose a novel encoder-decoder net-353 work that transforms cost volume into depth. The proposed 354 network is efficient in terms of parameters and the compu-355 tational amount and has high accuracy and low error metric. 356    In detail, the LFB is composed of two residual blocks of 382 existing ResNet, and the first residual block sets stride to 2 to 383 reduce the resolution. When an input X ∈ R H ×W ×C is given, 384 the LFB process is as follows: where Residual is the existing ResNet block and s is the stride.

387
The GFB uses a CMT block that mixes convolution and an input feature X is given, the LPU is as follows:

394
Next, in the LA module, the 2-D input feature X ∈ 395 R H ×W ×C is flattened to X ∈ R N×C for patch operation.

396
In this, N is H × W . To reduce the computational complexity 397 of the attention operation, the key and the value reduce the 398 spatial resolution with a k × k depth-wise convolution set by 399 stride k, respectively. According to the existing self-attention 400 operation, the query and the key are linearly transformed into 401 the d k dimension and the value into the d v dimension. So, 402 each is linearly transformed into a query Q ∈ R N×d k , key 403 K ∈ R (N/k 2 )×d k , and value V ∈ R (N/k 2 )×d v . As in the recent 404 study [16], a relative position bias B is added, and LA is as 405 follows: IRFFNs are similar to the structure of MobilNetV2 [52], 408 but the connection location of residuals has been changed. 409 The modified IRFFNs are as follows: Finally, the GFB is composed of each block and residual as 413 follows: where X i and X i are the outputs of LPU and LA, respectively, 419 and LN is layer normalization. Multiple GFBs are stacked in 420 each stage.

421
Through the feature extractor and hybrid encoder, the spatial 422 resolution of the entire feature map is reduced to F n ∈ 423 R (H/2 n )×(W/2 n )×C n , 1 ≤ n ≤ 5. Here, n is each layer number, 424 composed of two layers in the feature extractor and three 425 layers in the hybrid encoder.  Table I. 436 The self-attention block consists of a channel attention 437 operation and a spatial attention operation [53]. The self-438 attention operations consist of the 1-D channel self-attention 439 operation M c ∈ R 1×1×C and the 2-D spatial self-attention 440 operation M s ∈ R H ×W ×1 for the input feature X ∈ R H ×W ×C . 441 The total self-attention process is as follows: where ⊗ is element-wise multiplication, X is the result of the 445 channel self-attention operation, and X is the final result of 446 the attention operation. After the self-attention operation, each 447    We compare the proposed method with the existing state-of-498 the-art methods. Table II shows the quantitative performance 499 evaluation on the KITTI dataset of the existing depth esti-500 mation models and the proposed hybrid-based model. Test 501 frames indicate the number of frames used in the test time, and 502 the numbers in parentheses −1, 0, and 1 mean the previous 503 frame, the current frame, and the next frame, respectively. The 504 semantics column shows whether segmentation networks or 505 semantic supervision are used. The T50 of the proposed model 506 means that the backbone of a single-depth constraint network 507 is changed from ResNet18 to ResNet50. 508 The proposed hybrid model shows higher performance in 509 error metrics than the existing models. The proposed method 510 shows higher performance than the single-frame-based esti-511 mation methods as well as the multiframe-based estimation 512 methods. In the accuracy evaluation of δ < 1.25 3 , lower 513 performance was recorded than the model [41] using semantic 514 information and a heavier PackNet backbone network [6]. 515 However, the difference is very small, and the proposed 516 model showed high performance in most evaluation indicators 517 without using semantic information. Furthermore, we found 518 better performance when using ResNet50 as the backbone of 519 a single-depth constrained network. 520 Table III shows comparative experiments on cityscapes 521 datasets. Again, we show higher performance than previous 522 studies in all evaluation methods.   Table IV shows the results of the ablation study to evaluate 533 the performance of each module of the proposed method. 534 The basic model used the Manydepth model. Performance 535 evaluation is performed according to the presence of the hybrid 536 encoder and self-attention decoder proposed in this ablation 537 study. The proposed hybrid encoder improves both error and 538 accuracy evaluation indicators. In addition, the use of a hybrid 539 encoder reduces parameters and computational amount. This 540 means that compared to the ResNet layer, the lightweight 541 transformer model maintains global feature expression perfor-542 mance with a low amount of computation. The self-attention 543 decoder improves the SqRel and RMSE error metrics, but 544 lowers the accuracy metrics. The use of all proposed mod-545 ules improves both error and accuracy evaluation indicators. 546 However, the performance of the accuracy evaluation metric 547 is lower than when using only the hybrid encoder.