Attention-Based Dense Decoding Network for Monocular Depth Estimation

,


I. INTRODUCTION
Depth information is vital for real-world 3D scenes, and depth estimation is an essential part of understanding the geometric relationship in the scenes. Generally, it can enhance many recognition tasks and have full applications in multimodal emotion recognition, 3D reconstruction, simultaneous localization and mapping (SLAM), 3D object detection and autonomous driving. Monocular depth estimation refers to estimating the corresponding depth based on one rgb image. As early as the 1960s, scholars have studied the topic of estimating depth from the scene. Previous methods were generally based on optical geometric constraints or environmental assumptions [1], [2]. In recent years, deep convolutional neural networks have made great progress in image classification [3], semantic segmentation [4] and object detection [5]. At the same time, more and more scholars apply a deep convolutional neural network to monocular image depth estimation [6]- [9].
The associate editor coordinating the review of this manuscript and approving it for publication was Liangxiu Han . Despite the success of convolutional networks, there are severe challenges in depth estimation. Modern convolutional neural networks have multi-layer or even super multi-layer convolution operations. As the network goes deeper, the receptive field of the posterior layer neurons will increase, and different levels of features in the image region can be obtained. With a stack of several convolutions and pooling, the deep features of each layer change from generalized features to high-level semantic representations. Furthermore, operations such as pooling can integrate and abstract these features. Although consecutive convolutions and pooling reduce the spatial dimension and prevent overfitting to some extent, they make the local spatial information lost. In order to overcome this problem, some works [6], [10] used skip connections to add relatively shallow generalized features to relatively deep semantic features.
Another challenge comes from the image distortion in the distance. As shown in FIGURE 1, the picture frame far away from the camera is blurred. Some methods [12], [13] exploited atrous spatial pyramid pooling(ASPP) to capture objects of different sizes. Although ASPP effectively obtains different receptivity sizes without changing the image resolution, there are three problems with the dilated convolution: first, the kernel of the dilated convolution is discontinuous, and not all pixels are involved in the operation; second, how to design a suitable dilated convolution kernel is the key to deal with objects of different sizes; third, this approach does not consider the relationship between different pixel feature.
The main goal of this work is to effectively alleviate the problems of blurred local and distant details as mentioned above. We propose an attention-dense decoding network, which pays attention to the distance in an Encoder-Decoder framework. The first part of the decoder is the Channel-Spatial Attention Module (CSAM). Since the convolution extracts feature by mixing cross channel and spatial information, there is a weak dependence in the feature map. However, the informational context is especially important for depth estimation. We apply CSAM as the intermediate transition layer to enhance the representation of information feature in channel and spatial dimensions. Here, channel attention is adopted to capture the dependence of objects in the depth map no matter what the distance of theses objects are. The spatial attention selectively aggregates feature of each position through a weighted summation of feature at all locations. Following the attention module, we propose a dense decoding module (DDM), which cascades multiple dilated convolutional layers and adds all previous features as next inputs in the way of dense up-sampling. In this way, multiple dilated convolutions generate wider and denser features. Finally, we propose a distance-aware loss which enforces accurately depth prediction for far away objects. To summarize, our main contributions are three-fold: • We cascade channel-spatial attention module and dense decoding module to explore the dependencies between arbitrary channel and spatial feature in depth estimation, and enhance the discrimination ability of attention feature representation in a more intensive way.
• We propose a novel distance-aware loss which predicts more meticulous edges and local details in the distance.
• Experimental results show that the proposed framework achieves state-of-the-art performance on KITTI [14] and NYU Depth V2 [15] datasets.

II. RELATED WORK
In this section, we mainly discuss the related works based on deep learning from two perspectives, which are unsupervised and supervised depth estimation.

A. SUPERVISED DEPTH ESTIMATION
In recent years, the deep convolutional network have been applied to depth estimation due to its robust feature representation capability [10], [13], [16]- [21]. Eigen et al. [16] first proposed to use a multi-scale convolutional neural network for depth estimation. They performed a coarse estimation of the scene structure in the rough estimation stage and then performed a more accurate depth prediction in the fine estimation stage. Based on this work, [22] proposed a three-level framework by introducing depth estimation, surface normal prediction, and semantic annotation. Liu et al. [18] used a conditional random field (CRF) and deep convolutional neural network to explicitly simulate the complicated relationship between adjacent parts of the depth map. The model extracted the relevant features in the image through CNN and then used CRF to enhance the smoothness and edge preservation of adjacent superpixel blocks. In order to improve the output resolution, Laina et al. [10] proposed a new practical method for learning feature upsampling. For optimization, they introduced the reverse Huber loss function [23]. Lee et al. [24] used the depth network to predict the depth gradient and took it as local cues, and then further estimated the global and coarse depth. Finally they integrated the complementary prediction into the unified depth CNN framework to estimate the final depth image. These methods simulate depth estimation as a regression problem and train the regression network by minimizing the mean square error, which is affected by slow convergence and unsatisfactory local solutions. In order to eliminate or at least significantly reduce these problems, Fu et al. [13] used a multi-scale dilated convolution and spacing-increasing discretization (SID) strategy to avoid unnecessary spatial pooling and capture multi-scale information. Chen et al. [25] proposed an attention-based context aggregation network to solve grid artifacts in [13]. However, they only consider information unilaterally from channel attention and fail to combine pixel spatial level with channel level attention to obtain more detailed and comprehensive contextual information.

B. UNSUPERVISED DEPTH ESTIMATION
Without requiring the ground truth labels, unsupervised methods use photometric constraints from multiple views to predict depth. It was proposed in [17], [20] that utilized epipolar geometry to train the depth estimation network (Depth-Net). SfMLearner [26] was the first framework to learn both depth and ego-motion using monocular videos. Following this framework, [27]- [30] have been proposed to solve the challenge of moving objects that violated the rigid scene assumption. Recent work [4], [31], [32] jointly learnt multiple tasks based on [26]. All unsupervised methods include a DepthNet. Therefore a reasonable DepthNet is crucial for the performance improvement of unsupervised methods.

III. METHOD
In this section, we first introduce the overall framework, then describe the CSAM and dense decoding module that captures the long-range context information from the channel and spatial level. Finally we introduce the distance-aware loss function in this paper.

A. OVERVIEW
Inspired by the success of the encoder-decoder framework, our work is based on the structure of encoder-decoder, as shown in FIGURE 2. Four intermediate features of the encoder are separately upsampled, convoluted and concatenated with the output features of decoder, and finally fine-tuned to obtain a depth map with fine edges and local detail information. Specifically, (1) the encoder module uses ResNet or other ResNet-based encoding networks, such as SENet. For the SENet, we select four layers for up-sampling and convolution. The channels of each feature are compressed to 16, and then concatenated on the channel dimension. (2) The first part of the decoder is the element-wise summation of the channel attention module and spatial attention module. Followed by the dense decoding module, which gradually recovers the coarse depth map from the high-level semantics. Specifically, after two parallel convolutions, the output features of encoder reduce the channel to 1/4, respectively. The upper convolution layer connects the spatial attention module, while the lower layer connects the channel attention module. (3) The results of (1) and (2) are concatenated on the channel, finally fine-tuned by three convolutions to obtain the final depth map.

B. SPATIAL ATTENTION MODULE
There is a relationship between adjacent pixels in the same feature map because there may be a shared area at the time of the upper convolution. When the distance between two-pixel points in the feature map gradually increases, the relationship reduce or even become irrelevant. This dependence is necessary for distinguishing objects of different depths. Based on the self-attention mechanism [33], [34], we introduce the spatial attention module to more accurately characterize the contours of objects of different depths by modeling the relationship between any point and global feature. Given a local feature f ∈ R C×H ×W , we first feed it into three independent 3 × 3 convolution layers to generate three new features q p , k p and v p (q p ∈ R WH ×C , k p ∈ R C×WH , v p ∈ R WH ×C ), respectively. After that, multiplying q p by k p and calculating the softmax on the horizontal axis, we can obtain the spatial attention 'map' of a pixel to other pixel points. In order to get a more stable gradient, the 'map' is divided by √ d k , where d k is the number of channels. Then we perform a matrix multiplication between 'map' and v p and restore to p ∈ R C×H ×W . Finally, we multiply it by a scale parameter α and perform an element-wise sum operation with the feature f to obtain P ∈ R C×H ×W as follows: We build a channel attention module by taking advantage of the relationship between the different channels of the feature map. Since each channel of the feature map is regarded as a feature detector [35], the feature map channels focus on what it is. In order to summarize spatial information, [36] adopted average pooling to calculate spatial statistics in their attention module. Besides, it was proposed in [37] that used global pooling and max-pooling to collect two important cues that highlight the unique object characteristics. [38] combined attention mechanism and dilated convolution to learn optical flow. Different from the above work, we describe how to represent complex channel context relationship in depth estimation tasks. Similar to spatial attention module, we first feed f into convolutions to generate three new features q c , k c and v c (q c ∈ R C×HW , k c ∈ R HW ×C , v c ∈ R C×HW ), and get channel attention 'map' through matrix multiplication, softmax and √ d k , where d k is the number of pixel points in one channel. Then we perform matrix multiplications between channel 'map' and v c and restore to c ∈ R C×H ×W . Finally, we multiply it by a scale parameter β and perform an element-wise sum operation with the feature f to obtain C ∈ R C×H ×W as follows: When depth estimation is considered as a regression problem, the low-resolution features are restored to the same resolution as the color map by multiple up-sampling. In order to recover the local detail information, the relatively shallow features are spliced to the relatively deep features by skip connection. Inspired by DenseNet [39] and Aspp [12], the dilated convolution rates of 3,6,12,18 are combined into a dense cascade. As shown in FIGURE 3, the dense decoding module contains of multiple up-projection [10] and AsppBlock. Given a input feature f i , the output of up-projection i is u i , and the output of AsppBlock i is a i . Then a i element-wise sum with u i to get f i+1 , which is regarded as the input of AsppBlock i+1 and a part of the up-projection i+1 . The formula is as follows: where U i and A i denote up-projrction i and AsppBlock i , respectively. Stacking all the dilated convolutions in a more dense manner can produce multiscale features and cover a larger scale and a more dense range. All up-projection i share parameters. The purpose is to get the same resolutions and channels of skip features when the feature passing through the AsppBlock i . For all AsppBlock i , the bilinear interpolation is firstly performed, and the interpolation size is consistent with e 5−i of the encoding layer. Then we choose to implement two convolutions. The first convolution kernel size is 1 × 1, the input and output channels are the same, and the second size is 3 × 3. The second kernel uses different dilation rates, and the number of output channels is half of the input.

E. LOSS FUNCTION
Many previous studies took depth estimation as a regression problem. l 1 and l 2 were used to minimize the difference between predicted depth values and ground truth. For l 1 loss, the formula is: where e i = y i g − y i p 1 . y g is the ground truth and y p is the predicted map. We find that most methods are realistic for near-field. However, the prediction of distant scenes are distorted or fuzzy. We believe that we should pay more attention to the distant scene while calculating the near scene. If l 1 loss function is used, the depth difference between the predicted map and the ground truth in the distance and near scene has the same contributions to the l 1 loss, which has the same penalty for near and far scenes. The key idea of distance-aware loss is that we give greater weight to the farther depth difference. Even if it is only 1cm, the total loss value is still large due to its weight. Through back propagation, network parameters are optimized to make the distant depth difference as small as possible. When the depth difference is small enough in the distance, the total loss function is still small even if multiplied by a relatively large weight. Based on the loss function in [11], we propose the distance-aware loss that pays more attention to the longdistance: where where α and λ are hyper-parameters. In pratical, α = 0.5 and λ is calculated as: Similarly, it is necessary to punish more errors at the edge of the distant object, which makes the edge structure clearer.
where ∂ x (e * i ) and ∂ y (e * i ), respectively, are the derivative of e i calculated relative to x and y at the i-th pixel.
Finally, the angle between the surface normals of the depth map n p i and the ground truth n g i is calculated to complement VOLUME 8, 2020 the other two kinds of loss and further improve the fine details of the depth map prediction.
Therefore, our loss function is: where γ , η ∈R. In these experiments, the weights γ of l g and η of l n are set to be γ = 1 and η = 1. For l d , taking derivate of ω, the gradient takes the following form: For the same depth difference, according to formula , it has a relatively large gradient in the distance, while the near gradient is relatively small, so the distant error causes a larger weight adjustment. In this way, it achieves the goal of focusing on long distance.

B. IMPLEMENTATION DETAILS
We implement our proposed depth estimation network on the public deep learning platform pytorch [40] and train it on the 16GB Tesla V100, with a batch size of 8. The encoder of the entire network are initialized by ILSVRC [41], and the rest of the network are randomly initialized by Gaussian distribution. We adopt Adam optimizer with an initial learning rate of 1e −4 . β 1 , β 2 and weight decay are set to be 0.9, 0.999 and 0.0001 respectively. Scale parameters α and β are initialized as 0 and gradually updated as the model learns.
We train the proposed model for ten epochs. When the model is trained to the fifth epochs, the performance tends to be stable and no longer increases.

C. EVALUATION METRICS
Following previous work [16], we use the following metrics to evaluate our method:

D. COMPARISONS WITH THE STATE-OF-THE-ART
In this subsection, we evaluate our methods from both qualitative and quantitative points of views on the NYU V2 and the KITTI datasets. Results show that our proposed framework achieves state-of-the-art performance.
In TABLE 1, we compare the proposed method with the different previous state-of-the-art using the mentioned evaluation metrics on NYU v2 depth dataset. It is observed that our model obtains the same score as [11] on rel, δ 2 and δ 3 metrics, and provides a second performance 0.519 on rms, only 0.01 less than the result in [13]. However, this difference is caused by different iterations and training samples. We use 50 K training data for 10 epochs. Other metrics results show that our method achieves significant gains over all existing state-of-the-art approaches.
Qualitative results are illustrated in FIGURE 4. We show some samples comparison between our method and previous state-of-the-art on NYU V2 dataset. It is observed that [10], [22] only predict the approximate contour of an object. For example, the computer on the table in (b) and the left four chairs in (a) cannot be accurately predicted. Although [11], [13] can produce relatively clear details and edge at close range, they focus on the boundary of a closer object such as the floor in (d), and weaken the local details, such as the table in (e). Our method not only considers the dense intermediate features but also focuses on long distance features from the image. Therefore, the predicted depth maps provide excellent edge and local details both in the near (e) and far (c).
In Table 2, we also evaluate our method's performance on KITTI dataset. Following the evaluation protocol in [26], we only evaluate in central images where depth is less than 80 meters. Our result outperforms most existing state-ofthe-art methods except DORN. We suspect that one reason is sparse 3D point clouds in KITTI raw data. The distance-aware loss does not converge well in sparse depth map. Although qualitative results might not be the best, the quality of the proposed method is much better than state-of-the-art as shown in FIGURE 5.

E. ABLATION STUDIES
In order to further analyze the proposed loss function, dense decoding module, and CSAM, we conduct three sets of Evaluation results of depth estimation on the KITTI test set. The methods trained on KITTI raw dataset are denoted by K, virtual KITTI dataset are denoted by vK. M, S and D * denote monocular video, stereo supervision and auxiliary depth supervision, respectively. D means depth supervision. The best results in each category are in bold, and the second best are underlined.

1) LOSS FUNCTION
In this ablation experiment, the encoder is SENet, and the decoder are CSAM and DDM. We use three loss functions L 0 , L 1 , L 2 , their e * i are e 0 i , e 1 i , e 2 i : L 0 pays attention to the nearby points and less attention to distant points, while L 1 and L 2 are opposite to L 0 . From TABLE 3, L 1 and L 2 are better than L 0 . The difference between L 1 and L 2 is that y is used in L 1 , and y 2 is used in L 2 . The fuzziness of the predicted map does not increase linearly from nearby points to distant points. Compared with L 1 , L 2 uses nonlinear weights from near to far to punish this nonlinear fuzziness. Looking at the white boxes in FIGURE 6, the prediction of the L 2 in the distance is closer to groundtruth, while the prediction of the L 0 is inaccurate or even unpredictable.

2) DENSE DECODING MODULE
We conduct the ablation experiments to reveal the effectiveness of incorporating the DDM into the proposed modules. We control the same conditions and compare the effects of up-projection and dense decoding module. The experimental results are shown in TABLE 3. The results demonstrate that the DDM provides benefits in different perspectives. Analyze the decoder, instead of putting all input features of the AsppBlock j (j<i) through multiple interpolations and then adding them up as input to AsppBlock i , we upsample the already upsampled features. Compared with the former, this method uses upsamplings four times during decoding, while the former may use ten times (4+3+2+1). Since upsampling consists of up-projection, this design significantly reduces the number of parameters. In addition, DDM has four more AsppBlocks than the baseline, which uses only four upprojections. The number of parameters in AsppBlock i is C × 1 × 1 × C + C × 3 × 3 × C 2 , the parameters of the 4 AsppBlocks total about 1.9M.
In addition, we find that after the first epoch of training, the δ 1 is as high as 82.4%. Compared with the ordinary decoding layer [11], dense decoding layers make full use of different intermediate features in the process of decoding. These features contain different scale ranges and achieve greater receptive fields while acquiring multi-scale information. At each iteration, the multi-scale information of different receptive fields is reused as much as possible to accelerate the convergence speed of the model. Therefore, the network can obtain relatively good results at the beginning of training.

3) CHANNEL-SPATIAL ATTENTION MODULE
We compare two sets of experiments with and without the CSAM. As shown in TABLE 3, the CSAM significantly improves performance. Looking at the red and green boxes of the first row in FIGURE 7, the CSAM can predict richer local details, such as table corners. The predicted map without the attention module may be blurred. Besides, observing the red box in the second row (b), the chair and the floor in the red box have almost the same color, that is, they have the same distance. While the distance is different in the color map (a). It may be that the ground truth has lost information during acquisition or data preprocessing. The CSAM can capture the relationship dependence between any pixel points from the color map. There are strong dependencies on chair internal pixels. Another dependencies between the chair and the ground make the chair appear darker in the predicted depth map, and the chair is different from the floor.

4) FEATURE VISUALIZATION
In order to further verify the benefits of the two modules and the loss function, we visualize the intermediate feature maps generated by the network as shown in FIGURE 8. It is observed that the baseline represents a low level of informational context and distant scenes, while our loss function  gets larger areas of interest in the distance. The network that does not use the CSAM as the transition layer has a weak ability to discriminate the depth features, although the dense decoding module is used to enhance the reuse of features. As the dependence between these features is inferior, it has similar features in the far and near, which results in dispersed features ranges in (c). (d) and (e) use the attention module to characterize objects of different depths and locations by modeling the relationship between any point and global features. Compared with (d), (e) reuses the attention features, so the depth layers are stronger.

5) TIMING ANALYSIS
In FIGURE 9, we compare the computational cost of the proposed modules during test phase. The CSAM and DDM do not significantly increase computing resources. Our approach outperforms Xu et al. [9], Hu et al. [11] both in terms of accuracy and running time. Furthermore, in comparison to Eigen et al. [16], Laina et al. [10] and Hao et al. [21], the proposed method provides a trade-off between accuracy and speed.

V. CONCLUSION
To solve the local depth details loss caused by convolution stacking, a novel encoder-decoder attention-dense decoding network is proposed. Our main idea is to take advantage of the channel-spatial attention module, which captures the dependence between different channels and spatial locations by self-attentions. Furthermore, a dense decoding module is introduced to capture more massive and denser attention features. Moreover, in order to solve the fuzzy phenomenon in the distant scene, we further propose a distance-aware loss function that pays more attention to the long-distance objects. Extensive experimental results demonstrate that our method outperforms state-of-the-art performance on the KITTI/NYU V2 dataset. For the future work, it is worthwhile to explore a more accurate unsupervised depth estimation based on these modules.

VI. ACKNOWLEDGMENT
(Jianrong Wang  He has won many national and regional awards. GE ZHANG received the B.S. degree in computer science from Tianjin Polytechnic University, in 2018, where he is currently pursuing the master's degree in computer technology from Tianjin University. His research interests include 3D pose estimation, (un)supervised depth estimation, scene understanding, and deep learning.
TIANYI XU received the B.S. degree in automation and the M.S. degree in computer science and technology from Tianjin University, in 2012 and in 2015, respectively, where he is currently pursuing the Ph.D. degree. His research interests include data mining, the IoT, and blockchain.
MEI YU is currently an Associate Professor with the College of Intelligence and Computing, Tianjin University, China. Besides, she also serves as the Instructor of the IT innovation and entrepreneurship training base of Tianjin University, where she is responsible for the construction of the base. Her research interests include natural language processing and knowledge mapping.
TAO LUO received the M.S. degree from the School of Precision Instrument and Opto-Electronics, in 2006, and the Ph.D. degree from the School of Electronic Information Engineering, Tianjin University, in 2009. He is currently an Associate Professor with the College of Intelligence and Computing, Tianjin University. His research interests include intelligent vision processing, image sensors, and integrated circuits.