Transfer2Depth: Dual Attention Network With Transfer Learning for Monocular Depth Estimation

Monocular depth estimation poses a fundamental problem in many tasks. Although recent convolutional neural network-based methods can achieve high accuracy with very deep networks and complex architectures to exploit different cues and features, doing so not only increases the vulnerability of the model, but also increases the difﬁculty of convergence. Moreover, recent depth estimation methods for indoor environments are impractical for outdoor environments. In this work, we aim to develop a simple deep network structure to improve model effectiveness for depth estimation. We apply a dual attention module that can be inserted into any type of network to improve the power of representation, and additionally propose a training strategy which combines transfer learning and ordinal regression to improve training convergence. Even with a simple end-to-end encoder-decoder type of network architecture, we are able to achieve state-of-the-art performance on two of the biggest datasets for indoor and outdoor depth estimation: NYU Depth v2 and KITTI.


I. INTRODUCTION
Estimating depth from a 2D image is a long-standing challenge in computer vision and scene understanding. A depth map can not only provide considerable help for 3D-related applications such as 3D object detection [1], scene segmentation [2], and simultaneous localization and mapping systems [3], but also help other research areas, such as image dehazing [4], image refocusing [5] and augmented reality [6].
Compared to depth estimation that uses multiple images [7]- [9], monocular depth estimation is more complicated. Depth estimation from a pure 2D image suffers from scale ambiguity problem because a 2D image can be generated from infinite combinations of object sizes and camera movement speeds. Several methods [10]- [12] have attempted The associate editor coordinating the review of this manuscript and approving it for publication was Shiqi Wang. to resolve this problem by extracting helpful features from a scene, such as textures, object sizes and occlusions. With the rise of convolutional neural networks in recent years, many works have introduced [13]- [20] CNNs into the task of depth estimation and have significantly improved performance. These works usually treat the monocular depth estimation problem as a regression problem and train the network with mean squared error loss or other point wise losses. Recently, in order to further improve the estimation accuracy, very deep and complex network architectures have been designed to exploit different types of features in a scene. This not only increases the vulnerability of the model, but also increases the difficulty of model training.
In this paper, we propose three strategies to improve depth estimation: 1) dual attention module, 2) transfer learning, and 3) incorporate ordinal regression for outdoor depth estimation. Snapshots from our results are presented in Fig. 1. Motivated by human perceptions, people tend to selectively focus on a specific part of a scene, that is the concept of attention. We simulate this mechanism by enhancing the meaningful features in the feature patch with a specially designed spatial-channel dual attention module. The module in use is designed to have the same input and output shape. Thus, it can be inserted not only in our proposed network, but in any type of model as well to boost the quality of extracted features. This idea has been proved to be very effective. We are able to deliver a consistent improvement of around 5% in accuracy simply by applying our proposed module.
To further improve the estimation performance without increasing convolutions, which would result in greater memory costs and slower inference speed, we delve into research on methods to improve model convergence. Though most researchers train their models from scratch, Zamir et al. [21] indicated that many tasks are highly related and share similar features in their investigation. Motivated by [21], we introduce transfer learning into our training. We pretrain a high-performance model with the ImageNet [22] dataset, and then utilize the pre-trained weights to initialize our depth estimation training. By initializing our training with meaningful weights, we found our model converges significantly faster and requires less training data compared to previous state-ofthe-art models.
Since the depth range of outdoor data is much wider than that of indoors, previous depth estimation networks cannot be directly applied to outdoor environments. In this paper, we incorporate the concept of ordinal regression from [17] into our training strategy. The continuous depth values are discretized into 80 intervals, and the regression training in depth estimation is cast as a multi-class classification, where an ordinal loss is introduced to train the network. The proposed network achieves state-of-the-art performance on two of the biggest benchmarks on indoor and outdoor depth estimation, i.e., NYU Depth v2 [23] and KITTI [24], reaching an improvement in performance of up to 10%.
The remainder of this paper is organized as follows. In Section 2, a brief review of related previous works is provided. We describe our proposed network, spatial-channel dual attention module and training strategy in Section 3. Besides quantitative and qualitative comparisons, several experiments analyzing the performance of different parts of our proposed network are provided in Section 4. Last but not least, we conclude the paper in Section 5

II. RELATED WORK
Depth estimation is a crucial problem within 3D computer vision, and how to resolve 3D structural information from 2D RGB images has been a very popular and important research topic. Traditional methods [25]- [27] utilize feature point movements to resolve geometric information from multiple images. Corresponding feature points between images are extracted and triangulation is utilized to estimate depth. Groundbreaking work from Saxena et al. [10] introduced machine learning to estimate the depth for 2D images with monocular cues. Since then, several approaches [11], [12], [28]- [31] following this concept with different representations have been introduced.
In recent years, due to the success convolutional neural networks have had in image understanding tasks, many works [1], [8], [9], [13]- [20] have proposed using CNN for depth estimation. Powerful deep architectures such as VGG, ResNet, and DenseNet have brought the accuracy of depth estimation to a new level. Recent works [32], [33] utilize multilayer deconvolutions to recover fine information. Skip connection design has been introduced in some encoder-decoder research [18], [19], [34] to preserve details from network inputs. Unsupervised or semi-supervised learning were recently introduced to the depth estimation problem [32], [34]- [37]. These methods usually utilize the estimated depth map to reconstruct a reference image from another image with a different view angle and build up disparity losses between the reference image and the reconstructed image to train the network.
Transfer learning has been proven to be very effective in different cases. Recently, Zamir et al. [21] investigated the relationship and modeled the transfer learning dependency of 26 tasks, 16 of which are 3D or geometric related topics. When Alhashim et al. [19] introduced this concept to the problem of depth estimation, transferring the model for object classification to depth estimation was highly effective.
Plug-in modules for convolutional neural networks are a newly emerging research topic. Recent research [17], [38], [39], [41] has typically designed special modules for specific tasks to improve a model. Some works aim to develop modules that can be inserted into any network without the need for any hyper parameter modifications. Wang et al. [38] developed a non-local module to resolve global feature relationships. Their module has since been ported into many different tasks for performance improvement. Attention type modules [39]- [42] focus on improving the quality of extracted features. Wang et al. [40] proposed an encoder-decoder type of attention module, which achieved good performance but is computationally expensive. Hu et al. [41] proposed using global average pooling to reduce the computational cost and exploit inter-channel relationships. Other research [42], [43] suggests that spatial attention is as important as channel attention for feature enhancement.
Previous state-of-the-art techniques for monocular depth estimation are described as follows. Eigen et al. [13] proposed a multi-scale network consisting of a coarse scale and a fine scale network. Even though the design goal of multi-scale network is to retain details while resolving global information, the pooling and striding at the beginning of the fine network causes it to lose a lot of information early on. In addition, unlike recently developed densely connected structures, which can preserve information while passing features, more details are lost after repeated convolution layers in the fine network. Fu et al. [17] regarded depth estimation as a multi-class classification problem. They used a dense feature extractor followed by a scene understanding module to extend the field of view to capture global information. However, the lack of connections between layers leads to a lack of detail in the output. Alhashim et al. [19] applied a simple encoder-decoder architecture and trained it using transfer learning technique. However, their architecture lacks attention to large scale features. Zhang et al. [20] designed a pattern affinitive network which concurrently produces depth, surface normal and semantic segmentation maps. The main hurdle in their approach is that the complex and massive architecture needs to be carefully engineered, which makes it fragile and increases the difficulty of convergence.

III. METHOD
In this section, we introduce the architecture and details of our proposed network.

A. NETWORK ARCHITECTURE
The overall network architecture is shown in Fig. 2. Previous depth estimation networks [13], [15] usually apply multiple layers of convolutional operations on different spatial sizes. However, as the architecture becomes deeper and deeper, the representation power of convolutional neural networks does not increase proportionally. Therefore, we incorporate a different aspect of network architecture called ''attention.'' Previous works have shown that attention not only tells the network where to focus, but also improves the representation of interest.
Encoder-decoder architecture [18], [19], [32]- [34] has been shown to be powerful in addressing the depth estimation problem. To preserve details, most works also introduce skip connections [18], [19], [33], [34] between the encoder and decoder. Our proposed network follows this trend as well. We incorporate high performance DenseNet [44] architecture and our proposed spatial-channel attention module as the encoder. The proposed spatial-channel attention module is inserted after each dense block and transition layer. For the decoder, we utilize transpose convolution for feature up sampling. Cross connections between the encoder and decoder are applied to preserve high level features for better output detail and quality.

B. SPATIAL-CHANNEL DUAL ATTENTION MODULE
The spatial-channel dual attention module consists of two main components, a channel attention module and a spatial attention module. The overall architecture of the spatial-channel dual attention module is shown in Fig. 3.
The channel attention module is a combination of average pooling, max pooling and multi-layer perceptron, which is the same as in [43].
Because each channel of a feature map is seen as a distinct feature detector, channel attention focuses on finding out what is meaningful in the input feature pack. To preserve computational efficiency, average pooling and max pooling are applied to squeeze the spatial dimensions of the input feature map. Average pooling has been commonly adopted in previous works [41], [45]. However, in [43], with average pooling and max pooling both applied, the attention module is able to gather more important clues about distinctive object features. Thus, we simultaneously apply average pooling and max pooling.   [43], the channel attention module is a combination of average pooling, max pooling and multi-layer perceptron. As shown in Fig. 4, both the squeezed feature sequences from average pooling and max pooling are forwarded through a set of shared multi-layer perceptron. After the shared multilayer perceptron is applied to each descriptor, the output features are merged using element-wise summation. A sigmoid activation is then applied to generate the channel attention sequence. The generated attention sequence multiplies with the input feature maps to emphasize the meaningful features.
In contrast to channel attention, which focuses on where the meaningful channels are, spatial attention focuses on the meaningful area in each given feature map. Unlike the spatial attention design in [43], our proposed spatial attention module consists of an atrous spatial pooling pyramid (ASPP) [46] followed by 1 × 1 convolutions. The ASPP extracts features from multiple receptive fields with dilated convolution operations. This avoids the loss of detail resulting from the spatial size reduction of features. After that, 1 × 1 convolutions learn the cross-channel interactions between the extracted features. We merge features from different receptive fields via concatenation, and a convolution with sigmoid activation is applied to further integrate and transform the features into a spatial attention map.

C. TRAINING STRATEGY 1) TRANSFER LEARNING
Alhashim et al. [19] show that with simple yet effective transfer learning technique, it is possible to significantly boost performance on the depth estimation problem. To further improve the performance of our proposed network, we therefore incorporate transfer learning to give our training a meaningful initialization. We pretrain a DenseNet169 dataset with ImageNet dataset. The pretrained weights are then transferred into our proposed network to initialize weight-setting, and the spatial-channel dual attention module and decoder are also inserted for depth estimation training.

2) ORDINAL REGRESSION
In most research, the depth estimation problem tends to be seen as a regression problem. However, the depth range of outdoor scenes is much wider than that of indoor scenes, so it is much more complex. Casting the depth estimation problem as a multiclass classification problem [17] can substantially simplify the problem, which leads to better estimation performance. Therefore, in this research we propose a split training strategy, where we switch our estimation method between regression and multiclass classification depending on the input data.
In order to perform ordinal regression for the outdoor environment, a ground truth depth map is first discretized into labels. We follow Hu et al. [17] in using a spacing-increasing discretization strategy, which avoids an over-strengthened loss for large depth values, letting our proposed network focus more on closer regions where 3D structural information is much richer than in the farther regions. Spacing-increasing strategy uniformly discretizes depth maps in log space, resulting in bigger intervals with larger depth values. The depth values are discretized into 80 classes, which are represented by 80 label maps. For each pixel, the labeling formula can be represented as: where t i ∈ {t 1 , t 2 , . . . , t K } are discretization thresholds. K is the number of intervals, which is set to 80 in this research. α and β are the shifted minimum and maximum depth values of the whole dataset, where we apply a shift ξ to both minimum and maximum depth values so that α = minimum + ξ = 1.0. Once the label of each pixel is decided, the label map is then filled in correspondingly. In label map representation, each label has its own map, which means there are 80 maps L i ∈ {L 1 , L 2 , . . . , L 80 } in this work. If a pixel P (w,h) is determined to be class 10, the corresponding pixel in label maps 1 through 10, L 1(w,h) , L 2(w,h) , . . . , L 10(w,h) is set to one while other areas and other maps remain zeros.

D. TRAINING AND INFERENCE
For indoor regression training, we use the loss function design from Alhashim et al.'s [19] work. The proposed loss function consists of three parts: point wise L1 loss, gradient loss and SSIM loss. The proposed loss function is outlined below: The point wise L1 loss is the average of the absolute error of each pixel to represent the overall disparity between the 86084 VOLUME 8, 2020 estimated depth map and ground truth: The gradient loss is a L1 loss defined over the image gradient of the x and y axis: where g x represents the gradient of the depth map on the x axis and g y represents the gradient of the depth map on the y axis. Since SSIM has an upper bound of 1 and a lower bound of 0, we define the SSIM loss as: The weighting in the overall loss function is defined by σ = 0.1 and λ = 0.5, which is the same setting as in [19].
As for the outdoor dataset, we use Hu et al.'s [17] loss function design, which is an ordinal loss that takes the ordinal correlation between discrete labels into account. The ordinal loss function is formulated as the average of the pixelwise ordinal loss (w, h) over the entire image: where k is the index of class labels, P k (w,h) is the confidence of the predicted label of a pixel at position (w, h) and n is the total number of pixels in an image.

IV. EXPERIMENTAL RESULTS
In this section, we demonstrate the efficacy of our proposed method on two challenging datasets: NYU Depth v2 [23] and KITTI [24]. After introducing the implementation details, we compare our performance with state-of-the-art methods [17], [19], [20]. We follow previous work [13] on evaluation metrics and additionally perform ablation studies to further analyze the impact of different parts of our proposed method.

NYU Depth v2
The NYU Depth v2 is an indoor dataset with 464 indoor scenes of 640 × 480 resolution captured by a Microsoft Kinect depth camera. The dataset contains around 120K training samples and 654 testing samples pre-defined by previous work [13]. Just as in [19], a 50K image subset was selected as the training set. Since depth maps captured by Kinect usually contain a lot of invalid values, those invalid values are inpainted using the method in [47]. Depth maps in the NYU Depth v2 dataset have an upper bound of 10 meters.
KITTI The KITTI dataset is an outdoor dataset with about 1241 × 375 resolution captured by cameras and lidar sensors mounted on a moving vehicle. We train our network using 22.6K images as training images and 697 as testing images, following the settings in [13]. Ground truth resolution is reduced via max pooling for output measurements. We train our model with 640 × 480 resolution as the input, and 320 × 240 resolution as the output. Where we crop the input image to 640 × 375 and fill in zeros on the top to match the set input resolution.

B. IMPLEMENTATION DETAILS
We implement our proposed network with TensorFlow, and train on a Nvidia TITAN Xp GPU with 12 GB of memory. We pretrain a DenseNet 169 with ImageNet dataset for encoder weight initialization while the decoder weights are randomly initialized. We chose ADAM optimizer for our network with 20 epochs. The learning rate is set to 0.0001 with parameter β 1 = 0.9 and β 2 = 0.999, and the batch size is set to 2.

C. EVALUATION METRICS
We evaluate the proposed method's performance on six metrics with indoor data and seven metrics with outdoor data in line with previous work [13]. The error metrics for indoor data are defined as: where d p is a pixel in the ground truth depth map d andd pdp is the corresponding depth value in the estimated depth mapd. n is the total number of pixels in each depth map. The seven metrics adopted for outdoor evaluation include rel, RMSE, and three threshold accuracies, as those adopted for indoor evaluation. The other two metrics are RMSElog and Squared Rel as follows: D. PERFORMANCE Table 1 shows the quantitative evaluation results on the NYU Depth v2 dataset. Our proposed method is able to VOLUME 8, 2020 out-perform state-of-the-art methods and achieve up to a 6.2% improvement in performance while requiring only 50k training images and a 20 epoch training duration. The visualized qualitative comparison is shown in Fig. 6. As for outdoor data KITTI, although our quantitative comparison is slightly behind the previous best score [17] on squared relative error and RMSE, our proposed method achieves a 10% improvement on logRMSE as shown in Table 2. This indicates that our proposed method obtains better estimation at the closer range, which is much more important than the long range accuracy in most applications. This phenomenon may be derived from the pretraining process of our proposed network on ImageNet, which leads to better feature extraction on close range objects. On the other hand, our proposed method delivers significantly better qualitative results. As can be seen in Fig 7, our proposed method generates much sharper edges with smoother surfaces. These differences can be clearly observed on columnar objects such as road trees and traffic

FIGURE 7.
Depth prediction on KITTI. Input RGB image, ground truth depth map, our estimated depth map and depth map estimated by previous state-of-the-art [17]. Our method shows significant sharper edges and smoother surfaces. sign poles. The results suggest that our proposed method provides state-of-the-art accuracy on both indoor and outdoor data.

E. ABLATION STUDIES
We performed several experiments to analyze the performance of different parts of our proposed network. All ablation study experiments were conducted on the NYU Depth v2 dataset.

1) SPATIAL ATTENTION MODULE DESIGN
Since the attention module plays a critical role in performance improvement, we run several experiments to prove its effectiveness and optimize the parameters. Table 3 shows the comparison of how different types of spatial attention module designs performed. With the application of our proposed attention module, we are able to gain around 5% improvement in accuracy compared to the same network architecture without the attention module. We also made the comparison among the performance of our proposed atrous convolution-only spatial design, the pooling only design of [43], and our spatial design when it is equipped with both atrous convolution and pooling. The evaluation results indicate that our proposed atrous convolution-based spatial attention module is able to extract features with better representation, which leads to better accuracy in depth estimation. The pooling operation in spatial attention is not practical because the max and average pooling lose significant amounts of spatial information. VOLUME 8, 2020

2) TRANSFERRED WEIGHTS AND NETWORK SIZE
In this experiment we test the influence of the transfer learning technique; also, we substitute the DenseNet-169 architecture for DenseNet-121 to test the performance of different encoder depth. Table 4 shows the comparison of performance with different weight initialization methods and encoder depths. The best performance at each depth is bolded and the second best is underlined. It can be seen that the impact of transfer learning technique is significant. As shown on the last row of Table 4, training without transfer learning leads to undesired performance due to the lack of training data and training epochs. By initializing our network with meaningful weights, we are able to gain a significant amount of improvement. In addition, even with a much smaller encoder, we are able to outperform the previous state-of-the-art techniques in most metrics. Though DenseNet offers a denser architecture with 201 layers, the previous work [19] argued that the performance improvement does not justify the trade-off with the much slower convergence and higher memory usage. Therefore, we conclude that utilizing DenseNet-169's architecture for our encoder achieves the best balance between performance and speed.

V. CONCLUSION
This paper proposes a convolutional neural network for monocular depth estimation from a single image. We leverage the effectiveness of high performing pre-trained models and a specially designed attention module. Unlike that most researches focus mainly on the network architecture design, our research aims to point out the importance of other aspects in model learning: training strategy and model effectiveness improvement. We propose a spatial-channel dual attention module which improves the representation power of the encoder, and a training strategy which combines transfer learning and ordinal regression to improve model convergence. Our proposed method can achieve a rate of 18 frames per second. Moreover, our results prove that our simple encoder-decoder module with attention function and ordinal regression is quite suitable for depth estimation in both indoor and outdoor environments using NYU Depth v2 and KITTI, two of the biggest datasets for indoor and outdoor images, respectively.