Leveraging Contextual Information for Monocular Depth Estimation

Humans strongly rely on visual cues to understand scenes such as segmenting, detecting objects, or measuring the distance from nearby objects. Recent studies suggest that deep neural networks can take advantage of contextual representation for the estimation of a depth map for a given image. Therefore, focusing on the scene context can be beneficial for successful depth estimation. In this study, a novel network architecture is proposed to improve the performance by leveraging the contextual information for monocular depth estimation. We introduce a depth prediction network with the proposed attentive skip connection and a global context module, to obtain meaningful semantic features and enhance the performance of the model. Furthermore, our model is validated through several experiments on the KITTI and NYU Depth V2 datasets. The experimental results demonstrate the effectiveness of the proposed network, which achieves a state-of-the-art monocular depth estimation performance while maintaining a high running speed.


I. INTRODUCTION
Depth estimation is a key problem in computer vision that can be applied to a variety of fields such as autonomous driving, 3D modeling, or robotics. In particular, monocular depth estimation aims to generate a corresponding depth map for a given image, which is an ill-posed task. This is because a number of distinct 3D scenes can be mapped to a single 2D image. However, humans can estimate the distance to objects even with one eye because they can exploit semantic features [1] and monocular cues. Recent papers support that convolutional neural networks (CNNs) also take advantage of a similar property. Hu et al. [2] trained an auxiliary mask network that can predict the minimum set of relevant pixels in the image that can contribute to the inference of the depth map. Through visualization of the predicted mask, they have found that CNNs can use visual cues, such as edges or boundaries in input images, and inside the region of individual objects. This study indicates that semantic features can play a crucial role in depth estimation for humans and deep neural networks. Hence, focusing on the contextual information in input images can be beneficial for effective monocular depth estimation.
The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Napoletano .
Since the emergence of the deep neural networks, there has been a rapid rise in the state-of-the-art performance in monocular depth estimation. By adopting a good backbone network trained on a substantially large-scale dataset, it became easier to extract more powerful features. Thus, many researchers have studied methods for applying the knowledge acquired from this powerful encoder for depth estimation [3]- [6]. Moreover, several papers have attempted to leverage contextual features in this area. Reference [7], [8] employed an encoder-decoder structure with a skip connection; however, their methods focus more on refining the coarse local features than contextual information itself. Some studies have used additional knowledge such as pretrained weights or a segmentation dataset [9] to achieve semantic supervision; however, these methods limit the datasets that can be applied and it complicates the training methodology. Therefore, it is worth formulating a training strategy that allows the network to concentrate on significant regions and uses semantic representations without employing any external information.
This paper proposes a new network architecture to leverage the contextual information for effective monocular depth estimation. The first contribution of this paper is the proposed attentive skip connection which enables the use of encoded features in the decoding phase. As previously discussed, objects with different positions and sizes can play a crucial role in depth estimation. Therefore, a multi-scale skip connection with self-attentive modules is added to highlight the feature maps from the diverse objects in a different scale. The second contribution of this paper is a novel global context module, which leverages global features to understand the scene context comprehensively in a global scale. The global context module receives the bottleneck feature as an input and captures rich contextual information. These additional units are adoptive for all networks and it consumes a small amount of computation, which yields a high inference speed. By focusing on significant regions and representation with effective light weight augmented modules, the model shows a high performance with a reasonable inference time. To summarize, the main contributions of this study are as follows: • Contextual information plays an important role in many scene understanding tasks, including monocular depth estimation. To generate an accurate depth map for a given image, we introduce a novel network architecture that leverages contextual information using an encoder-decoder structure.
• The novel attentive skip connection delivers the features that are obtained from the encoder to the decoder; hence, the model can take advantage of the encoded features in the decoding phase. In contrast with previous studies involving skip connections [10], an attentive skip connection infers an attention map to learn the regions on which the network should focus.
• We propose a global context module to enhance the obtained bottleneck feature and to exploit the global context for a comprehensive scene understanding.
• The experimental results demonstrate that the proposed model accomplishes a state-of-the-art performance on the KITTI and NYU depth V2 datasets. Owing to the easy integration of the lightweight modules, the network shows a high running speed while improving the performance in comparison to previous methods.

II. RELATED WORK A. MONOCULAR DEPTH ESTIMATION
There has been significant development in monocular depth estimation. Wang et al. [11] solved semantic segmentation and depth estimation tasks jointly by developing a unified framework. They employed joint global and regional CNNs to predict potential and inferred final results through the hierarchical conditional random field. Laina et al. [12] proposed fully convolutional networks with the fast up-projection method using residual learning to model the mapping between RGB images and depth maps. Furthermore, they introduced the reverse Huber loss, which tackles the heavy-tailed distribution of the depth dataset. Godard et al. [13] suggested unsupervised training objective to replace the use of labeled depth maps. The network generates the left and right disparity maps and calculates the reconstruction, smoothness, and left-right consistency terms. Kuznietsov et al. [14] introduced a semi-supervised approach to overcome the deficiency and limitation of sparse ground truth lidar maps. They trained the network with sparse depth maps in a supervised manner and provided image alignment loss to generate photoconsistent dense maps based on stereo images. Li et al. [15] showed the two-streamed network that produces depth and depth gradients with the given RGB image and combines each result to obtain a final dense depth map. Fu et al. [3] modeled the monocular depth estimation as a classification task and tackled this problem with a spacing-increasing discretization strategy. Gan et al. [5] employed an affinity layer to integrate relative and absolute features within a network. In addition, they used vertical max pooling to focus on vertical characteristics of depth maps and improved accuracy.  Guo et al. [6] incorporated a synthetic depth dataset to acquire a considerable amount of ground truth images. Subsequently, they trained a network with synthetic data and fine-tuned with a real dataset. Finally, they mitigated the domain gap between the ground truth and synthetic dataset by distilling stereo networks. Qi et al. [16] utilized the relation between the depth and surface normal by employing two networks: depth-to-normal and normal-to-depth networks.
Hu et al. [17] proposed a network that extracts a multi-scale feature to preserve spatial resolution. In addition, they defined a new loss that considers the depth, gradients, and surface normal of depth maps. Yin et al. [4] emphasized the importance of geometric constraints in the 3D space to improve the performance of monocular depth estimation. They generated a 3D point cloud from the estimated and ground truth depth maps, and followed by computing the virtual normal loss by randomly sampling points of pair maps. Zhang et al. [18] suggested a new framework that predicts depth, surface normal, and semantic segmentation jointly. This framework utilizes cross-task patterns by calculating the affinity matrix while performing each task.

B. CONTEXTUAL INFORMATION
Contextual information is an essential cue in many computer vision tasks, especially in scene understanding tasks such as 3D object detection, semantic segmentation, or depth estimation. Reference [10] constructed an encoder-decoder architecture with skip connections to combine contracted high-resolution features with an expanded output for segmentation. To achieve depth estimation, Eigen et al. [7] and Garg et al. [8] employed the encoder-decoder structure with skip connections that use encoded features in the decoding phase. Liu et al. [19] suggested a network that performs semantic segmentation first and uses the predicted labels for depth estimation. Jiao et al. [20] proposed a synergy network to incorporate semantics in depth prediction by using an information propagation strategy as well as knowledge sharing. Amirkolaee and Arefi [21] constructed a depth prediction network with the encoder-decoder and skip connection structure to integrate the global and local contexts. Unsupervised methods use additional information to overcome the absence of labeled data; such methods include those that leverage semantic information. Ochs et al. [9] performed semantic segmentation and depth estimation using two independent CNNs, one for each task. Through this approach, the network learns more stable features and can leverage semantic labels. Chen et al. [22] combined the depth and segmentation modalities by minimizing self-supervised objective losses, the left-right semantic consistency, and the semantics-guided disparity smoothness.
In addition, there have been several papers that have employed an attention architecture to focus on the contexts that are significant for depth estimation. Xu et al. [23] proposed an attention module which parameterized by binary variables to control the flow between the encoder and the decoder. Then, the proposed attention module was integrated with a conditional random field. Chen et al. [24] proposed an attention-based context aggregation network to solve the depth estimation problem. They placed a pixel-level self-attention module at the bottleneck of network and trained it with attention loss. Takagi et al. [25] proposed a two-branch depth estimation network with mutual learning and employed channel attention with squeeze-and-excitement [26] attention module.

III. METHOD
This section first introduces the entire architecture of the network and an attentive skip connection with the global context module in the sequence.

A. NETWORK ARCHITECTURE
As depicted in Fig. 2, the proposed model adopts the encoder-decoder architecture with the suggested attentive skip connections and the global context module. The encoder is initialized with the weights of a pretrained ImageNet [27] classification model to extract the dense features. We develop a remarkably simple decoder network to restore the obtained features to the image scale and to generate a depth map. The decoder is designed to have the same number of blocks as the encoder. Each block consists of a 3 × 3 deconvolution, batch normalization, and ReLU layer. To strengthen the power of the decoder, residual blocks are placed between the 2 nd and 3 rd block. Similar to the decoder blocks, the residual blocks have two 3 × 3 convolution layers with batch normalization and a ReLU layer.

B. ATTENTIVE SKIP CONNECTION
We consider a depth estimation network as a mapping function for the image to depth map translation, which shares an underlying structure. The objects and structure in a given RGB image are roughly aligned with those in the output depth map. As previously described, the location of important edges plays a major role in depth inference. Therefore, it would be desirable to flow acquired information through the network. In this study, to shuttle the low-level features, we append the skip connections between the encoder and the decoder. Unlike previous studies [4], [7], [8], this study does not simply sum the feature values or apply a concise convolution. An attention mechanism is applied to the skip connection. As discussed earlier, there are some studies that have used the attention mechanism for monocular depth estimation [23]- [25]. However, our approach differs from previous works in two aspects. First, we design a task-specific attention module with two branches and attach it to the skip connection to deliver refined multi-scale features to each of the blocks of the decoder. Second, our module is light-weight and requires only a small amount of additional computation. The proposed attention module is detailed in Fig. 3.
An attentive skip connection is provided for each encoder block to propagate the meaningful features to the decoder block. For every convolutional block in the encoder, the output feature maps F M pass through two branches. Similar to the implementation in [28], the attentive skip connections generate attention maps along the spatial and channel dimensions. In the first branch, a spatial attention map is obtained through its branch in parallel with the channel attention branch. We consider that the computational graph for the spatial attention map should be task-specific. Previously proposed attention modules are usually employed for the classification task [26], [28], [29]. It is needed to derive the highest possibility from the whole image in the classification task; however, the regression network infers continuous values for all of the pixels in the image. Therefore, in this study, an attentive skip connection is designed to specifically understand the scene in multiple scales. As it is necessary for our network to focus on multiple locations, important edges, and objects during depth estimation. We adopt atrous spatial pyramid pooling (ASPP) to broaden the fields-of-view and to capture the objects at multiple scales. The intermediate feature map F M for each block of the encoder is forwarded to the ASPP module with the dilation rate d = {d 1 , d 2 , · · · , d n }. The value of d is obtained via experiments. We choose {3, 6, 9} for this investigation and this process will be discussed in the Experiment section. Furthermore, the feature that passed the ASPP module goes through a 1 × 1 convolution for effective integration. The integrated feature are passed through the ASPP module and the 1 × 1 convolution once more to enhance the effectiveness of the module. Ultimately, a batch normalization layer is employed at the end of the spatial block to ensure stable training. Thus, the spatial attention map F M is acquired.
In the second branch, average-pooling is applied for the intermediate feature map F M in the channel dimension to encode the contextual information in each channel. Then, the pooled feature is forwarded to a multi-layer perceptron (MLP) with one hidden layer. To make the model compact and effective, a hidden layer is constructed to have a reduced number of units compared to that of the input and output layers of the MLP. The value of 16 is used as the reduction ratio for the dimensions of the hidden and input layers. Thus, a refined spatial attention map, F M , is obtained. After the spatial attention map F M and the channel attention map F M are acquired, each map is multiplied with the original feature map F M element-wisely and they are merged by summation. Finally, the calculated refined feature is added with the original feature map F M .

C. GLOBAL CONTEXT MODULE
To further exploit a global context representation, we do not directly deliver the bottleneck feature of the encoder to VOLUME 8, 2020 the decoder. The global context module is placed at the end of the encoder to obtain the global context information and pass meaningful features to the decoder. The structure of the global context module is illustrated in Fig. 4. The bottleneck feature F B ∈ R C×H ×W is fed into two paths. The goal of the global context module is to capture important features in a global scale with a simple additional computation. Hence, the pooling method is applied to reduce the dimensions of the feature and obtain significant representations with small parameter overhead. In the first branch, average pooling is applied in the channel dimension to utilize the inter-dependencies between the channel-wise feature maps and to help the model concentrate on the useful regions since average-pooling has been commonly used for capturing spatial information [26]. Then, the feature is convolved with the kernel having H × W × d h weights where d h denotes the dimension of the intermediate refined feature map F B . The appropriate dimension for the best performance is determined to be 512 via ablative experiments. Regarding the second branch, a max-pooling operation for F B is used to capture the most informative spatial information. Similar to the case of the attentive skip connection, the max-pooled feature is forwarded to the multi-layer perceptron comprising one hidden layer with a reduced dimension, and the reduction ratio 16. Then, the refined feature is reshaped into d h × 1 × 1 to aggregate it with F B .
After the refined feature maps F B and F B are obtained from both branches, the output vectors are combined using element-wise summation. Additionally, a 1 × 1 convolution layer is employed to fuse the added features, and it is upsampled by bilinear interpolation such that it has the same size as that of the original feature map F B . Finally, this obtained feature is multiplied and added with the original feature map F B and used as an input to the decoder.

D. TRAINING
In the training phase, a scale-invariant log loss function [7] is used as the objective function. For a generated depth map y and the ground truth y * , there are n pixels indexed by i. The final loss function is as follows: where d i = log y i − log y i * .

IV. EXPERIMENT
The effectiveness of the proposed model is demonstrated by performing various experiments on the KITTI [36] and NYU Depth V2 [37] datasets. For the evaluation, this study uses the following metrics from previous works [3], [7]: where T is the available pixels in the ground truth, y is the predicted value, and y * is the ground truth. Following the illustration of our results on the dataset, an ablation study has been provided.

A. IMPLEMENTATION DETAILS
This model is implemented on the open deep learning framework PyTorch [38]. The encoder is initialized with the weights of the pretrained networks ResNet-50, ResNet-101 [39], ResNeXt-101 [40]. We use randomly cropped images with a size of 352 × 704 from the KITTI dataset and images with a size of 448 × 576 from the NYU Depth V2 dataset. The learning strategy employs the ADAM optimizer, and the learning rate is started from 0.0001 with a weight decay of 0.9. The network is trained for 40 epochs and the batch size is set to four. The images are augmented by applying random brightness, contrast, color adjustment, and rotation; the range for each of the aforementioned modifications is (0.5, 1.5), (0.8, 1.2), (0.8, 1.2), and (-5, 5) degrees, respectively. In addition, random horizontal flipping is applied.
The KITTI dataset [36] consists of 61 scenes of outdoor images captured by driving a car with cameras and velodyne sensors. The proposed model is trained based on the split proposed by Eigen et al. [7]. They used 56 scenes from the ''city'', ''residential'', and ''road'' categories. The images are split into training and testing sets, which contain 23,488 and 697 images, respectively.

2) NYU DEPTH V2
The NYU Depth V2 dataset [37] contains 464 indoor scenes, which includes 249 scenes for training and 215 for testing. The proposed model is trained on 24,231 images and tested on 654 images.

C. PERFORMANCE
The results obtained for the KITTI and NYU Depth V2 datasets are listed in Table 1 and Table 2, where the proposed model is compared with other previous works. As described in the results, our approach outperforms the other state-of-the-art methods for the outdoor and indoor datasets. It proves that the proposed model is suitable for various situations. As presented in Table 1, the results of our method exceed those of the previous works by 2% ∼ 22% in terms of all of the metrics on the KITTI dataset. From Table 2, our model achieves state-of-the-art results for all of the metrics, except for AbsRel on the NYU Depth V2 dataset. In Fig. 5, the results are compared with those of prior works, for the KITTI dataset. As previously highlighted, the importance of contextual information in depth estimation has been demonstrated. The proposed method shows sharp boundaries on objects such as a person, a road sign, or bicycles. In addition, our approach successfully locates the VOLUME 8, 2020 FIGURE 5. Qualitative comparison with the previous methods. The depth maps are generated from the test set of the KITTI dataset. From top to bottom, the images are the input and the depth map of our method and those of the methods propsed by Yin [4], Fu [3], and Gan [5]. objects in the image, in contrast with the previous methods, even when there are multiple objects. The road sign in the image in the first column is not presented in the result depth map of [4], [5]; in contrast, our model provides an appropriate inference for the depth of a given object.
To further emphasize the strength of the proposed method, the mean RMSE versus the running time for the proposed model on the NYU Depth V2 dataset is illustrated along with that for some of the prior works. As shown in Fig. 6, the inference speed of our method is higher than that of the other compared methods; in addition, our model achieves a higher accuracy. The results are represented from different base networks including ResNet-50, ResNet-101, and ResNeXt-101. Even though there is a trade-off between the performance and the inference time, the suggested model consistently provides reasonable results.

D. ABLATION STUDY
To demonstrate the effectiveness of the proposed model, we conduct several ablation experiments with different settings on the NYU Depth V2 dataset. First, experiments are performed on the baseline method; then, the network is amended with an attentive skip connection and a global context module to verify the performance of the proposed method. The quantitative and qualitative results are shown in Table 3 and Fig. 7.
As listed in Table 3, the attentive skip connection and the global context module significantly improve the performance of the network. To demonstrate that this strategy can be generallized to a different base network, the model is trained with ResNet-101 and ResNeXt-101. The results shows that the proposed approach consistently exhibits good performance even when applied to a different network. Furthermore, the number of parameters increased only by 2.8M, as listed in the table. A significant improvement in the performance and fast inference are achieved with a small number of parameters for the suggested modules. The qualitative results obtained for the proposed modules on the NYU Depth V2 dataset are illustrated in Fig. 7. It can be observed that the boundaries of the objects became more accurate owing to the addition of the   proposed modules. Moreover, our model is able to accurately detect the objects on the table (3rd row) that were not detected accurately by the baseline method. Table 4 shows the results of the comparison of the proposed attentive skip connection with previous attention modules. Squeeze-and-excitement (SE) [26], bottleneck attention module (BAM) [28], and convolutional bottleneck attention module (CBAM) [29] are selected and tested on the KITTI dataset based on ResNet-101 architecture. The proposed attentive skip connection yields the best performance for all metrics, as indicated by Table 4. This demonstrates that the proposed attentive skip connection is more suitable for depth estimation tasks and that it increases the performance of the network further, in comparison with other attention modules.
In addition, experiments are conducted by using different dilation rates for the attentive skip connection and using various hidden dimensions for the global context module. These experiments are performed to maximize the performance. The results are presented in Table 5. The dilation value of {3, 6, 9} provides the best results among those obtained for the various settings. This result supports the notion that applying a well-designed ASPP module for a skip connection can improve the performance of the model by deriving useful features with enlarged receptive fields. With regard to d h , the value of 512 in the hidden layer provides the best performance among those obtained for the various settings. If the size of the hidden dimension increases, the network usually shows a better performance owing to the increase in depth. However, using an excessively high value for this VOLUME 8, 2020 parameter can cause overfitting, and the inference rate will also be adversely affected. Therefore, it is important to find an appropriate value for this task. In summary, based on the ablation study, a d h value of 512 and dilation rate of {3, 6, 9} are used in this study.

V. CONCLUSION
This paper presents a novel network architecture that leverages the contextual information for monocular depth estimation. Using the proposed modules, the multi-scale attentive skip connections and the global context module, our network captures meaningful contextual representation in the multi-scale and global scale. Extensive experiments and an ablation study demonstrate that the proposed model effectively provides a more accurate predictions, compared to other state-of-the-art methods. Furthermore, our network achieves a significant performance improvement on the KITTI and NYU Depth V2 datasets. Moreover, we plan to investigate the structure of faster and lighter networks to achieve real-time performance.