Monocular Depth Prediction With Residual DenseASPP Network

Monocular depth estimation is an ill-posed problem because infinite 3D scenes can be projected to the same 2D scenes. Most recent methods focus on image-level information from deep convolutional neural networks, while training them may suffer from slow convergence and accuracy degeneration, especially for deeper network and more feature channels. Based on an encoder-decoder framework, we propose a novel Residual DenseASPP Network. In our Residual DenseASPP network, we define features as low/mid/high vision features and use two-kinds of skip connection to learn useful features with certain layers, where feature concentration in the dense block is used to generate more features in the same layer, and feature summation in the residual block is used to increase backward gradient. The experimental results show that high vision features require more channels by feature concentration, while low/mid vision features need better convergence by feature summation. Experiments show that our proposed approach achieves state-of-the-art performance on both NYUv2 and Make3D datasets.


I. INTRODUCTION
Monocular depth estimation aims to estimate depth information of a scene from an RGB image, which is a fundamental problem of computer vision with many potential applications, such as semantic segmentation [1], [2], object detection [3]- [5], human pose estimation [6], [7], 3D reconstruction [8], simultaneous localization and mapping [9]. Inspired by the remarkable success of image classification, most recent methods study this work based on deep networks. They learn visual representations for depth estimation in an end-to-end multi-layer fashion [10], in which features of various receptive fields are generated by convolution and pooling operations. The resolution of the features reduces along with pooling operations, which leads that the final features do not contain sufficient detail information. These features are not beneficial for model convergence. Atrous Spatial Pyramid Pooling [11] (ASPP) has been studied to alleviate this problem, which can generate features with abundant receptive fields by concatenating outputs of multiple atrous not more effective than simple connections in ResNet. Skip connections among certain features can be invalid especially when the combinations of these features do not exist in the real world. Existing methods can not fully interpret what connections are contributing. (2) It is hard to learn mass parameters for skip connections in DenseNet. This case aggravates in a DenseNet with multi-scale ASPP, which can cause the model overfitting. This suggests that searching better architecture to reduce model complexity is not sufficient in depth estimation.
To solve the above problems, we propose a novel Residual DenseASPP Network based on an encoder-decoder framework. We consider pattern combinations vary in different receptive fields, and design three decomposed blocks for low/mid/high vision features. We fully exploit two-kinds of skip connection in a feed-forward fashion and suggest an effective architecture for depth estimation by fusing two-kinds of skip connection.
To verify the superiority of the proposed model, extensive experiments are performed on two large-scale databases. Experiments show that our proposed approach achieves state-of-the-art performance on both NYUv2 and Make3D datasets.
The main contributions of our work lie in three folds: (1) We propose a novel Residual DenseASPP Network, in which we fully exploit network architecture for low/mid/high vision features by fusion two-types of skip connection. In our network, certain types of skip connections are designed for certain layers, which can improve deep feature learning for depth estimation.
(2) We search for flexibility architecture with different model complexity and design various ablation models of our Residual DenseASPP Network. We design low/mid vision features module with less connection and fewer feature channels, which can give improvement for depth estimation.
(3) Experiments show that our proposed approach achieves state-of-the-art performance on both NYUv2 and Make3D datasets. The visibility results prove that our method can give good predictions in many challenges, including small objects, complex boundaries, illuminations, objects with a big change in depth.

II. RELATED WORK
In monocular scene depth estimation, feature extraction from a single image plays a crucial role. We categorize the related depth estimation approaches into four types:  [2]. CNN can be extended with harmonizing overcomplete local predictions [21], multiple candidates in the frequency domain [22], and the whole strip masking module [23]. Nevertheless, these works do not consider the relation between two regional features. Conditional Random Fields (CRFs) can model this pair-wise relation, and has been successfully introduced into CNNs [24]- [27] to improve the prediction of depth maps. Yan et al. [28] applied the CRF model with additional constraints of the normal vector of the object surface, which can estimate the depth of multi-level scenes at the super-pixel level and pixel level. The key to these methods is what prior information can be introduced into CNNs to solve some challenges, such as fuzzy structure, unclear layers, and fusion of object depth and background. Depth gradient [29] and surface normal [30] are important edge cues for monocular depth estimation. And segmentation can provide regional cues to optimize depth estimation [1], [31]- [33]. Zhang et al. [34] provide a joint task-recursive learning framework for semantic segmentation and depth estimation. Some works [35], [36] also introduce pose predictor into depth prediction. Godard et al. [37] use per-pixel minimum reprojection to address the problem of occluded pixels. As a special application, monocular video can supply optical flow cues to estimate image depth [38]. Recently, Yang Wang et al. used UnOS (Unified Unsupervised Optical-Flow and Stereo-Depth Estimation) [39] to estimate image depth. Zhao et al. [40] designed a geometry consistency loss for binocular depth prediction. A. A. Abarghouei and Breckon [41] used synthetic images to predict depth.

B. FULL CONVOLUTION WITH PARALLEL MULTI-SCALE DEPTH MODEL
ASPP applies the parallel dilated convolution with different dilation rates to extract features at different scales. On this basis, the CRF model further optimizes the local inconsistencies between two regional features [42]- [44]. H. Fu et al. [45] used ASPP and cross-channel to learn multiscale features. They design an ordinal regression loss function from pixel to pixel. However, there are many parameters in that full convolution with a parallel multi-scale depth model, and they still suffer from a serious gradient degradation problem.

C. DEPTH CONVOLUTION MODEL WITH DENSE BLOCK
DenseNet alleviates the vanishing gradient problem because it strengthens feature propagation and encourages feature reuse by feature concentration. Based on ASPP, Fu et al. [45] furtherly used multi-scale cascade dense CNN to train the RCNN model for depth estimation. Huang et al. [17] introduced the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Jégou et al. [46] applied a dense cascade network of DenseNet with down-sampling and up-sampling for segmentation. Yang et al. [12] combined the ASPP with the dense cascade of dense density to construct the DenseASPP module. Gur et al. [47] provide an additional seminal work with CNN enhanced by point spread function and focus cues. They conducted a multi-scale sampling strategy to obtain more features. We introduce a Residual DenseASPP module to DeepLab V3+ to find out better network architecture. Our novel Residual DenseASPP module contains three composed blocks to describe low/mid/high vision features separately. To explain our architecture in certain blocks, our low/mid/high vision feature is relative in all three features. More Variants of our DenseASPP module will be discussed in Figure 2.

D. DEPTH CONVOLUTION MODEL WITH RESIDUAL BLOCK
Unlike feature concentration in DenseNet, a residual module reuse them by feature summation [48]. Cao et al. [49] added a residual module on basic network FCN as an FC-residual module. Kuznietsov et al. [50] used ResNet as the backbone network and proposed special up-sampling for depth estimation.
However, we still cannot get the conclusion that what architecture is more reasonable. We also focus on what kind of low/mid/high vision features is important in depth estimation. Therefore, we propose a Residual DenseASPP model that fully exploits two-kinds of skip connection in a feed-forward fashion and suggest an effective architecture for depth estimation by fusing two-kinds of skip connection

III. METHODOLOGY
In this section, we describe our Residual DenseASPP model for monocular depth prediction in detail.

A. NETWORK ARCHITECTURE
Our model used an encoder-decoder structure base on DeepLab V3+ [33], as shown in Fig 1. The encoder module encodes multi-scale contextual information by applying atrous convolution at multiple scales, and the decoder module refines the depth results from these features. In the encoder, we using ResNet-101 as the backbone network of the Deep Convolutional Neural Network (DCNN) [33]. Unlike the basic encoder, we change the ASPP module to DenseASPP and integrate the residual module into DenseASPP to solve the vanishing gradients problem more effectively.

B. FROM ATROUS CONVOLUTION TO RESIDUAL DenseASPP 1) ATROUS CONVOLUTION
Atrous convolution can increase the receptive field while do not change the feature map resolution. ASPP generates multi-scale features by the parallel atrous convolution. In one dimensional case, consider two-dimensional signals, for each location i on the output feature map y and a convolution filter w, atrous convolution is applied over input feature map x as follow: where the atrous rate a determines the sampling stride, w [k] denotes the k-th parameter of filter, and K is the filter size. The receptive field can adaptively modify by changing the rate value. Note that standard convolution is a special case in which rate a = 1. For an atrous convolutional layer, the receptive field size R is calculated as: Furtherly, DenseASPP connects more atrous layers with dense connections to obtain larger receptive fields. In our network, we use five atrous convolutional layers in a cascade fashion in DenseASPP, where the dilation rate of each layer increases layer by layer (3,6,12,18,24). We feed DenseASPP with a 64-dimensional feature map y 0 , which is transformed from a 2048-dimensional feature map y Resnet−101 of ResNet-101 by an FC layer. In DenseASPP, VOLUME 8, 2020 each layer accepts a concatenated feature map, which contains the output of each lower atrous layer and the input feature map. The output of each atrous layer is an equal-sized 64-dimensional feature. We can write the output in terms of where H is the atrous convolution, K is the filter size, a l represents the dilation rate of layer l , and [·] denotes the concatenation operation, [y c l−1 , y c l−2 , · · ·, y 0 ] means the feature map formed by concatenating the outputs of all previous layers, c indicates the output is extracted by dense connections way. The final output of DenseASPP is a feature map generated by multi-rate, multi-scale atrous convolutions.

C. RESIDUAL DenseASPP
Although DenseASPP increases the reusability of features of each layer, it still cannot tell us features of which layer are important in depth estimation. Therefore, we construct a Residual DenseASPP module, which contains three composed blocks to describe low/mid/high vision features separately. Basic DenseASPP only consists of low and mid vision features, in which the low vision features contain five atrous convolutional output, and the mid vision features indicate the final output. We additionally increase a block to combine the final output and the input from ResNet-101 to extract high vision features.
We design six variants, named RRR, RRD, RDD, DRR, DDR, DDD, to discuss what kind of low/mid/high features is important in Residual DenseASPP. All variants have the same input features and the same dilation rates. Fig 2. shows the six variants, and Fig 3. shows their low/mid/high feature combination ways. Admittedly, DDD provides the most sufficient low/mid/high vision features in all variants, but DDD does not perform best on NYU Depth v2 dataset, which may because the dense block (D) cannot provide good gradients in feature learning, while the residual block (R) is an alternative way.
RRR. We generate another ablation model RRR, as shown in Fig 2. R means corresponding composed block by a residual way. In RRR, each layer accepts a summation feature map, which sums the output of each lower atrous layer and the input feature map. The output of each atrous layer is an equal-sized 64-dimensional feature. We can write the output of low vision features as: where {·} denotes the summation operation. {y + l−1 , y + l−2 , · · ·, y 0 } means the feature map formed by summing together the outputs of all previous layers. + indicates the output is extracted by residual connections way. We furtherly adopt residual connections to extract mid/high vision feature, and then the final high vision features of RRR can be formulated as: RRD. We replace the high vision feature connection in RRR with a dense block to get RRD. As shown in Fig 3, RRD can provide more high vision features than RRR. The final high vision features of RRD can be formulated as: RDD. RDD can provide more mid/high vision features than RRR, and its final high vision features can be formulated as: DRR. DRR can provide more low vision features than RRR, and its final high vision features can be formulated as: DDR. DRR can provide more low/mid vision features than RRR, and its final high vision features can be formulated as:

D. LOSS
In our method, we use L2 as our loss function. L2 is sensitive to outliers in the training data since it penalizes more heavily on larger errors. It minimizing the squared euclidean norm between predictions g * and ground truth g:

IV. EXPERIMENTS
To verify the superiority of our method, we provide extensive experimental results on challenging NYU Depth v2 and Make3D to compare with state-of-the-art works. The evaluation metrics are following previous works [24].
A. DATASETS 1) NYU DEPTH V2 DATASET [51] We evaluate our method on one of the largest RGB-D data sets for indoor scene reconstruction, NYU Depth v2. The raw dataset consists of 464 scenes, captured with a Microsoft Kinect, with the official split consisting of 249 training and 215 test scenes. In our experiments, we consider a subset of 1449 RGB-D pairs, of which 795 are used for training and the rest for testing. In particular, we conducted data augmentation (scale scaling, rotation, color transformation, flip) to enlarge the training set. As a result, we obtained 87,000 training pairs in total. Then we down-sample the original frames of size 640 × 480 pixels to 1/2 resolution and center-cropped to 304 × 228 pixels, as input to the network.   2) Make3D DATASET [52], [53] We also test our algorithm on the outdoor dataset Make3D. We use official split with 400 pairs for training and 134 for testing. We train on an augmented data set of around 15k samples. Since Make3D expresses depths up to 80m only, the depths of far objects are often inaccurate. Errors in training are only calculated where depth is less than 70 meters in a central image crop.

B. IMPLEMENTATION DETAILS
We implement the proposed network using the open deep learning framework Pytorch framework with two NVIDIA 1080 GPUs with 16GB memory. As the encoder for feature extraction, we use ResNet-101 with pre-trained weights trained for image classification using the ILSVRC dataset [54]. The stochastic gradient descent (SGD) algorithm was started with a learning rate of 0.001 and decreased by 10 times around every 6 epochs. 24 epochs are performed for training. The momentum and weight decay were set to 0.9 and 0.0005, respectively. The number of batch size sets to 8. It took around 14 hours to train the Residual DenseASPP Network. To avoid over-fitting, we augment images before input to the network using random horizontal flipping, random contrast, brightness and color adjustment in ranges of [0.6, 1.4], with a 50% chance. We also use a random rotation of the input images in a range of [−5, 5] degrees. We train our network on a random crop of size 304 × 118.

D. ERROR METRICS
For quantitative evaluation, we report errors obtained with the following error metrics, which have been extensively used. Denote g as the ground truth depth, g e as the estimated depth, and T denotes the set of all points in the images. | log 10 (g) − log 10 (g e )| VOLUME 8, 2020 • δ i threshold: where card is the cardinality of a set. A higher δ i indicates better prediction. Table 1 shows the results of the comparison with state of the art methods on NYU Depth V2. We can notice that our RRD network achieves 84.10% in the accuracy (with threshold <1.25) on the NYUv2 dataset, which is superior to state-ofthe-art methods.

E. COMPARISON WITH THE STATE-OF-THE-ART
(1) Our RRD model outperforms most CNNs without parallel multi-scale depth models. But CNN with surface normal cue [30] outperforms RRD in rel accuracy probably because prior knowledge can provide more detail features for depth estimation.
(2) Our RRD model outperform CNN wish ASPP model [42], and this suggests that multi-scale reception fields in ASPP are not enough for complex depth pattern.
(3) We notice that our RRD achieves better performances in terms of threshold accuracy and rmse as compared to CNN with dense block feature [45], and gives worse performances in terms of rel and log10. That is to say, dense block feature is efficient, and network architecture and its optimizer can furtherly impact threshold accuracy.
(4) A residual learning framework can ease the training of networks [48], [49], but they do not fully express the different characteristics of low/mid/high vision patterns, therefore, they give worse performance than our RRD method.
(5) Our RRD proves that low/mid/high vision patterns have different characteristics. The number of low/mid vision patterns is lower than the number of high vision patterns. And low/mid vision patterns need to be learned more precisely than high vision patterns, which may because low/mid vision patterns are the components of high vision patterns, and small change in low/mid vision patterns can be amplified in high vision patterns. Meanwhile, RRD has smaller feature channels in low/mid vision features than DDD and gives better performance than DDD.
(6) Comparison with encoder-decoder with ASPP. Our work is similar to the DenseASPP with an added attention block [47] and the Deep Ordinal Regression Network [45] because they both use encoder with ASPP and decoder with upsampling and convnet. First, Our RRD outperforms DenseASPP with an added attention block [47], probably because it adds a convolutional path to DenseAspp rather than a direct skip connection, and appends a self-attention block to this new module. But these operations increase model parameters and cause a little overfitting. Second, Deep Ordinal Regression Network outperforms our RRD in rel [45], probably because it uses an ordinal regression module which takes into account the ordering of discrete depth values. This ordinal regression module release spatial constraints to allow a relatively larger difference in training. But our RRD still outperforms it in δ 1 , δ 2 and rmse, because it only uses parallel multi-scale ASPP, while we enlarge more feature channels by DenseASPP and learn them better with the residual module. Table 2 shows the results of the comparison with state of the art methods on Make3D. Errors are computed for depths less than 70m in a central image crop [40]. Our methods outperform many methods. D. Xu et al. [42] outperform DDD in rel and rmse, probably because CRF smooths some noisy features. Fu et al. [45] give the best performance in rel, which shows that the denseblock is important for feature extraction. Our RRD gives the best performance in rmse due to optimized network architecture.

F. ABLATION STUDIES
In this part, we will analyze the results of the six ablation models mentioned above. Each ablation method takes a certain combination way for low/mid/high vision feature. The ablation results are shown in Table 3. We find that: (1) RRR performs worst in all variants, which may due to features are excessively reduced by summation operation.
(2) DDD outperforms 0.104 in accuracy (with threshold <1.25) than RRR, which proves that a dense block can provide more sufficient features than a residual block.
(3) Admittedly DDD performs lower than other 4 variants, including RRD, RDD, DRR, DDR, which proves that sufficient features are hard to learn and certain residual blocks can facilitate feature training.
(4) Compared with DDD and RDD, better low vision features can increase 0.07 in the accuracy (with threshold <1.25). Furtherly, in RRD, better low/mid vision features can increase 0.19 in the accuracy (with threshold <1.25), and RRD performs best in all six variants. That is to say, low/mid vision features are less than high vision features, and more low/mid vision features can be redundant. Moreover, we need a residual block to learn more precise low/mid vision features.
(5) Compared with DDD and DDR, better high vision features can increase 0.03 in the accuracy (with threshold <1.25). And in DRR, better mid/high vision features can increase 0.05 in the accuracy (with threshold <1.25). Precise mid/high vision features with a residual block are less important than low vision features.
(6) Compared with RRR and RRD, high vision features need more channels, which may because high vision features with a larger receptive field size have more complex patterns. Figure 4 shows the depth prediction results of ablation models on NYU Depth v2. These instances have many challenges, including small objects, complex boundaries, illuminations, objects with a big change in depth.
(1) A small object is hard to segment in a scene, and its depth is easy to be interfered with by background information. Small bins in row (c), small screens in row (i), small bottles in row (j) can be extracted by RRD. DDD also performs well on small objects, but RRR will lose them, which indicate that more high vision feature is good for a small object.   (2) Complex boundaries like long narrow edges, twisty edges, need complex patterns in a certain reception field. The chair in row (d), the sofa in row (e) has been well estimated by RRD, DDD. Similarly, high vision features are good for complex boundaries.
(3) Illuminations in row (f) and row (h) will greatly interfere with local depth prediction, DDR and DRR perform worse than RRD, which indicates that low/mid vision features need be reduced and high vision features can capture illumination in-variance in large reception field.
(4) Objects with a big change in depth also need complex patterns of high vision features, such as the wall in row (g), RDD performs worse then RRD, which indicates that precise mid vision features may be helpful in this case.
(5) It is interesting in row (a), the sculpture can be extracted well in DRR, because more low vision features are good for this complex boundary, while DRR does not well describe the background depth. On the whole, RRD and DDD may give more accurate perdition than DRR.
(6) The problem in row (b) is that wall depth prediction is interfered by window and door in all variants, and RRD can alleviate this interference.

V. CONCLUSIONS
In this work, we propose a novel Residual DenseASPP Network, in which we define features as low/mid/high vision features, and use two-kinds of skip connection, including feature concentration in a dense block and feature summation in a residual block. We discuss what kind of low/mid/high features is important in Residual DenseASPP by six variants, and one ablation method RRD outperforms all other variants and gives better performance in terms of threshold accuracy than state-of-the-art methods. We notice that low/mid/high vision patterns have different characteristics, which residual block can facilitate training for low/mid vision features learning, and dense block can provide sufficient features for high vision patterns. We show our depth prediction results on NYU Depth v2 and Make3D, and prove that our method can give good prediction in many challenges, including small objects, complex boundaries, illuminations, objects with a big change in depth.