Deep Convolutional Grid Warping Network for Joint Depth Map Upsampling

Depth maps play an important role in the representation of 3D information. They are often simultaneously acquired with color images; however, their resolution is significantly lower than that of color images owing to hardware limitations. In this paper, we propose a novel approach to upsample depth maps by using geometric deformation instead of pixel value refinement, which is employed in a majority of existing methods. This approach, known as grid warping, displaces the position of blurred pixels around the edge towards the center of the edge. The displacement vector for warping is obtained from an analysis of the corresponding high-resolution color image. Furthermore, we propose an edge signal and displacement vector modeling for a more effective analysis. The experimental results show that the proposed method significantly improves the quantitative and visual performance, as compared to state-of-the-art methods. The source codes of the proposed method will be available at https://github.com/yym064/DeepGridWarp.


I. INTRODUCTION
Owing to the developments in 3D technologies, considerable attempts have been made to apply 3D technologies to various types of applications, including robotics and advanced driver assistance systems [8], [35]. Depth information plays a critical role in these applications for internal as well as external processing.
Passive and active methods are popularly used to acquire depth maps [16], [33], [37], [39], [41]. In the passive method, depth information is obtained indirectly; a typical example of this method is stereo matching, wherein depth information is estimated from two scenes with a binocular parallax. On the contrary, in the active method, the depth map is acquired directly. In this method, depth information is captured via special devices such as laser range scanners or time-of-flight cameras. Microsoft Kinect and SoftKinect are examples of devices used to directly capture depth information [39], [41]. However, the resolution and quality of the acquired depth map is generally low; as compared to RGB color images, owing to the limitations in the hardware technology.
The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang .
Even though insufficient resolution can be partially solved by many of the existing upsampling schemes, its quality is significantly lower than the quality of color images, especially when attempting to considerably increase the resolution. A popular approach to address this problem is joint filtering, i.e., the upsampling filter is derived using the depth map as well as its corresponding color image. In the concept of joint filtering, it is assumed that the edge structure of the depth map is highly correlated to its corresponding color image. Therefore, a filter is designed to transfer meaningful structural information in the color edge to the depth map.
These approaches commonly result in two problems during information transfer. First, they transfer unwanted information such as texture patterns. This is because the information transfer is realized via the kernel method; however, the kernel computation is sensitive to small pixel changes such as texture pattern. Second, these methods frequently cause underor over-shooting artifacts around the edge boundary due to inaccurate kernel estimations.
To further investigate the proposed approach, we first explore the existing works on image upsampling technology. We classify the existing upsampling methods into the following three categories: model-based kernel filtering, optimization problem, and deep learning approaches.

A. MODEL-BASED KERNEL FILTERING
A majority of the methods falling under this category employ the variational form of the adaptive nonlinear filter, such as the bilateral filter (BF) [32] or guided filter [10]. In [19], a joint bilateral upsampling (JBU) filter is proposed to incorporate the corresponding high-resolution (HR) color image information. Kim et al. proposed a trilateral filter to reduce the blurring artifact caused by the misalignment of the depth edge and the color edge [18]. Jung proposed an adaptive joint trilateral filter, wherein the color and depth maps are simultaneously restored according to the classification of the depth edge [15]. In [24], an extension of the joint bilateral filter is proposed to avoid the textural copying artifact which occurs when depth structure information is transferred to the depth map. To prevent artifacts, it integrates local gradient information during filtering. Chan et al. proposed a noise-aware filter (NAF) for depth upsampling [3]. It acts as a multilateral filter by adjusting the influence of color similarity to prevent texture copying artifacts. In [27], Min et al. proposed a weighted mode filter (WMF) that generates filter coefficients based on local statistical information induced by a histogram. Hua et al. proposed an extended guided filter (EGF) by inserting an additional term by considering the local 2nd order gradients of the depth map [12]. This filter employs an onion peel-like filtering, which significantly improves the performance of the filter. As an extension of EGF, Yang et al. proposed a confidence-based joint guided filter (CJGF) by controlling the filtering order using the confidence map derived from the shape of the unreliable region, depth map, and color pixel values [40].

B. OPTIMIZATION PROBLEM
The methods falling under this category build the objective function by considering various factors and attempt to minimize it. In [4], a depth upsampling problem is formulated with the Markov random field (MRF), wherein the data term is determined using a given depth map, and the smoothness term is determined using estimated HR depth samples derived from the HR color images. Based on MRF framework, Park et al. [30] proposed to use an additional term known as non-local mean regularization, which is implemented using the anisotropic structure-aware filter. Similar to the non-local mean filter, this term enables the contribution from faraway pixels during processing. Another MRF formulation was suggested by Lu et al. [25], wherein the truncated absolute difference between the estimated and the input depth value is employed for depth map upsampling. Liu and Gong proposed to use anisotropic heat diffusion filtering (ADF) [23], where the known pixels of depth maps are set as heat sources, and depth enhancement is performed by diffusing depth value from sources to unknown pixels based on color similarity.

C. DEEP LEARNING
The recent popularity of deep learning has motivated active research in deep learning-based approaches. In [5], a single image super resolution network based on a convolutional neural network (CNN) was proposed. Lim et al. proposed a deeper and more complicated network structure that consists of several residual blocks to extract meaningful features from the input image [22]. Harris et al. proposed a different approach using deep back-propagation network (DBPN) [9], where feature maps are first extracted using convolution layer, and then upsampled and downsampled repeatedly to feedback error in each stage. When multiple or multimodal input data such as depth map upsampling with corresponding color images are available, different deep network architectures can be considered. Hui et al. proposed a network which gradually upsamples the depth map by using color images as a reference [13]. Li et al. proposed the deep joint filtering (DJF) by using a two-stream network, wherein one stream extracts feature maps from the color image and the other stream extracts features from the depth map [21]. Then, the extracted feature maps are combined using a shallow network called the fusion network. In [36], Su et al. proposed a pixel adaptive convolution (PAC), which mimics the bilateral filter. Adopting a different approach, Kim et al. [17] focused on the receptive field for depth map upsampling; the receptive fields were enlarged using deformable kernel convolution (DKN).
In this paper, we propose a novel and distinct approach for a joint depth map upsampling algorithm with a deep network. Instead of directly inferring the HR depth map, or estimating the local-adaptive kernel, the proposed method reconstructs the low-resolution depth map by warping the pixel position without changing the depth intensity. The major contributions of the study are as follows: • To the best of our knowledge, this is the first deep learning-based approach to upsample depth map via the image warping technique.
• We extract the displacement vector for grid warping from the corresponding color image and design the network for an efficient reconstruction of HR depth maps, using the estimated deformation information.
• We validated the proposed approach via mathematical edge modeling, which verifies the robustness of the proposed displacement vector estimation.
The remainder of this paper is organized as follows. In Section II, the warping method described in detail. The proposed system and its theoretical analysis are introduced in Section III. The implementation details and experimental results are provided in Section IV. Finally, the conclusions of the study are presented in Section V.

II. IMAGE RESTORATION BY GRID WARPING
For the purpose of image restoration, image warping methods were proposed in [1], [20], [28], especially aimed toward image deblurring. The basic assumption in these techniques is that the blurring process distorts the edge by shifting the pixels away from the true edge. Therefore, the remedy for deblurring should be the inverse process of pixel shift, i.e., shifting the pixels back toward the edge, as shown in Fig. 1(a).  However, kernel based methods reconstruct the distorted pixel by shifting the pixel value, as shown in Fig. 1(b). For the convenience of explanation, we use the 1D edge profile of 2D images. Mathematically, in the 1D domain, restoring the blurred signal I b (x) to the reconstructed signalÎ (x) can be performed by determining the displacement function d(x) aŝ Therefore, the core approach of a warping based restoration scheme is to determine the grid displacement function. Nasonova et al. [28] considered that the ideal step-edge, i.e., 1D edge, has the following profiles: If the blurring effect is modeled via Gaussian filtering, then the blurred edge can be modeled as where σ is the blurring parameter, * indicates the convolution operation, and δ(t) is the delta function. This model shows that the edge profile modified due to blurring has the form of a cumulative Gaussian distribution function, as shown in Fig. 1.
In [28], the displacement function for (3) is obtained by the spring model as where κ controls the sharpness of deblurring. The displacement vector has a positive value when x < 0, a negative value when x > 0, and a maximum value when x = ±σ . As shown in (3) and (4), the overall performance of this scheme is significantly dependent on the determination of the true edge position (i.e., x = 0) and the optimal determination of σ .

III. PROPOSED METHOD A. OVERALL SYSTEM STRUCTURE
The fundamental idea of the proposed scheme is to use the grid warping technique for depth map upsampling. As the conventional grid warping technique is designed to reconstruct a blurred image, the input image is resized to achieve the target resolution, and then the resized image is assumed as the blurred version of the ground-truth. However, unlike the conventional warping scheme stated in Section I, we consider the case that a pair of HR reference color image and low-resolution depth map is given, which enables us to infer displacement vector flows in a different manner. As mentioned in the detailed literature review in Section II, the core step in deblurring via the warping scheme is the determination and localization of the displacement function d(x). In the proposed scheme, we directly obtain the displacement vector from the given HR reference color image instead of directly estimating them. The overall process is depicted in Fig. 2. First, we downsample the HR color image I (x) to have the same resolution of LR depth map D(x), and upsample it to generate I u (x) via simple, pre-determined downsampling and upsampling methods, i.e., where f ↑ and f ↓ indicate the upsampling and downsampling functions, respectively. I u (x) is assumed to be the blurred version of I (x). Subsequently, the displacement vector can be extracted by analyzing the relationship between I (x) and I u (x) asd Onced(x) are obtained, the target upsampled depth mapD(x) is computed in the same manner: B. ANALYSIS In this Section, we present the theoretical investigation and analysis of the proposed approach via signal modeling.
As previously introduced in [28], the modeling for 1D edge profile is more focused on the analysis because edges mainly affect the overall upsampling performance, especially in depth maps. First, it is preferable to use the ideal step edge model in (1), but we generalize it by convolving the Gaussian filter as Thus, the image edge varies smoothly rather than changing abruptly, and the varying speed is controlled by a small value σ c (the step edge is the case of σ c = 0). When the given image edge is blurred by upsampling, it can be modeled by convolution with another Gaussian filter as Similarly, the 1D depth map profile can be obtained as (10) Based on the 1D modeling, it is concluded that the proposed approach can appropriately upsample the depth map by the following two properties.
Property I: The displacement vector is independent of the edge signal scale.
Proof I: In (3), when the signal is scaled by h, i.e., the edge signal is formed by a scaled step edge function given as its blurred signal will be by the linear property of convolution operation. Therefore, arg min Property II: The error caused by replacing the displacement vector in the depth map with that of the color image is approximately proportional to σ c /σ . Thus, the error reduces when the blurring artifacts are dominant (σ c σ ). Proof II: As in (9), the edge parameter σ c of the original color signal is changed to σ 2 c + σ 2 by blurring. Let the edge parameter for depth map σ d = sσ c . Then, the edge parameter in the blurred depth signal will have s 2 σ 2 c + σ 2 . To determine the extent of the influence of the change of edge parameter on the disparity vector, we consider the xdirectional variation by σ variation. We need to find dx/dσ = dI −1 /dσ in (8); however, its close-form solution cannot be derived. Instead, we consider the sigmoid function for (8) as in [42] I (x, σ ) ≈ 1 1 + e −x/pσ (12) where p −1 = 0.9 √ π is a constant. From this equation, we can obtain its inverse function as Therefore, From (14), the x-directional variation according to σ is only a function of y. Therefore, we can infer that the optimal value for the displacement vector is proportional to the difference of two edge parameters, i.e., where k is a proportional parameter. Intuitively, s = 1 (equivalently σ c = σ d ) will provide an error-free result. When s = 1, the relative error, E r , can be computed as where C = σ/σ c 1 (i.e., the blurring parameter for upsampling is significantly larger than the edge parameter) is used for the approximation. Furthermore, 0 ≤ s < 1 (i.e., the depth map has a more rapidly varying edge than the color image), and C s in most cases. As a result, we can conclude that |E r | 1. Additionally, this derivation provides a numerical model of the estimated relative error.

C. NETWORK ARCHITECTURE
The function blocks of the proposed system are presented in Fig. 2 and implemented in the deep networks. The left network, called displacement network, seeks the displacement vectors from the HR color image. The right network, called fusion network, attempts to reconstruct the depth map using the transferred displacement vector. Both networks are concatenated and trained end-to-end. Additional details on the network design are presented in below.

1) DISPLACEMENT NETWORK
The displacement network is designed to estimate the displacement vector at each pixel position. As stated in Section I, signal blurring is assumed to be the outward shift of pixels from the true edge. Obtaining displacement vectors between two views is similar to obtaining an optical flow. Therefore, we adopt a state-of-the-art optical flow FlowNetS structure, as shown in Fig. 3(a) [6].
As stated in Section II, the previous approaches in [20], [28] attempted to localize the true edge before applying grid warping. This is because it determines the image warping direction around the true edge and significantly affects overall performance. However, the proposed approach does not involve this constraint because the corresponding color image is used as a reference, thereby sufficiently guiding deformation direction. Here, we assume the color and depth maps to be perfectly aligned. However, slight misalignments can also be managed due to its multiresolution architecture.

2) FUSION NETWORK
The fusion network consists of three steps: feature extraction, feature warping, and reconstruction, as presented in Fig. 3(b).
In the feature extraction part, we adopt an architecture similar to the encoder of the autoencoder. It consists of five convolutional layers. The first two convolutional layers use the common convolution, and the remaining three layers employ the stride convolution to extend a receptive field with small network parameters. The input of this network is the upsampled depth map D u , and 64 feature maps are extracted in each resolution. During the feature warping step, the extracted feature maps are shifted by the displacement vector obtained from the displacement network. In this process, the spatial transformer network is adopted for the realization of the shifting operation, because it is eligible to represent various operations including scaling, cropping, rotation, and non-rigid deformations [14]. It can be also trained with a standard backpropagation method, which makes the proposed system end-to-end trainable. Using the spatial transformer network, feature maps are shifted to align the center edge position in each resolution level.
During the reconstruction step, the network fuses the warped feature maps to reconstruct the HR depth map image. This network has an architecture similar to the decoder of the autoencoder and contains five convolutional layers as feature extraction. From the lowest to the highest resolutions, the warped feature maps are upsampled via bilinear interpolation and concatenated with the feature map having the next resolution; they are then sequentially convolved in each convolutional layer. In the conventional autoencoder network, skip-connection and feature maps concatenation techniques are commonly used for fast training and to avoid gradient vanishing, especially when the network structure is considerably deep. However, note that the proposed method does not employ these techniques in the middle of the reconstruction network, because the extracted feature maps in the encoder are deformed by warping; therefore; skip-connection would degrade the performance.

3) LOSS FUNCTION
For the network training, the L1 norm is used. For a given network output D p and the ground truth depth map D gt , the loss function can be formulated as (17) where N and j are the number of training samples and the pixel position, respectively.

IV. EXPERIMENT
In this Section, the implementation details and test datasets are introduced. Subsequently, the proposed method is compared with various state-of-the-art depth map upsampling methods, quantitatively and visually. Furthermore, we conducted extensive experiments to further analyze the proposed method in various situations.

A. IMPLEMENTATION DETAILS
Similar to [17], we trained the network to upsample depth maps for scale factors of 2, 4, 8, and 16, with random initializations, respectively. We used the Adam optimizer with β 1 = 0.9 and β 2 = 0.999. The learning rate starts at 1e −4 and is divided by 5 at every 5 epochs. The batch size is 1. Our experiments were performed on an Ubuntu operating system. We trained a network by using a GTX 1080Ti GPU card. The network is trained in an end-to-end manner using PyTorch [31].

B. DATASETS
For a fair comparison, we collected four popular datasets. The first two datasets were divided into training and evaluation sets. The other two were only used for evaluation. The details on the datasets are given below: 1) Sintel dataset [2]. This is a computer graphic video with fine textures. It provides color and depth map pair video sequences. Each sequence consists of either 50 or 40 frames. The resolution of the sequences is 1024 × 438. The color image and the depth map are well aligned. A total of 1000 color-depth image pairs are used as the training dataset, and a total of 300 color-depth image pairs as the testing dataset. 2) NYU v2 dataset [34]. This dataset consists of RGB/D image pairs captured with the Microsoft Kinect. The resolution of the image pairs is 640 × 480. We split the image pairs into 800 training dataset and 600 testing datasets. 3) Lu dataset [26]. Six RGB/D image pairs were provided.
The resolution of the LU dataset is 640×480. They were acquired using the ASUS Xtion Pro camera. This dataset was only used for evaluation.

C. QUANTITATIVE EVALUATION
For the evaluation of the proposed method, a few superior upsampling schemes are selectively compared, including model-based filtering, optimization, and deep learning-based methods. As mentioned in the previous Section, we evaluate our model using four different evaluation datasets, which have different resolutions and different color-depth alignment quality.
Tables 1 and 2 exhibit the average root mean square error (RMSE) and mean absolute difference error (MAE) value for each scheme, respectively. The lowest RMSE and MAE values are presented in bold red, and the second lowest are presented in blue. It is observed that the proposed method achieves the best performance for almost all test cases in terms of the both RMSE and MAE. The only exemption shows the second-best performance with negligible difference. For the upsampling factor of four, DKN is comparable with the proposed method, whereas the proposed method outperforms for all other upsampling factors.
For the computational complexity, we measure the runtime for the deep learning-based schemes of DJF, PAC, DKN, and the proposed method on the same machine. The proposed scheme took an average of 15 ms for the Middlebury datasets, which was slower than DJF and PAC, but significantly faster than DKN.
Finally, we investigated how much the proposed gridwarping-based scheme could improve the upsampling performance. To evaluate the potential of the proposed approach, we replaced the input of the displacement network I and I u by D u and D gt , and the network was trained with the same VOLUME 8, 2020  [32], EGF [12], CJGF [40], DJF [21], PAC [36], DBPN [9], DKN [17], Ours, and ground-truth (from left to right). approach. This experiment assumed that ground-truth displacement vectors could be derived by D u and D gt . As shown in Table 4, it is reported that the warping-based approach was able to recover almost error-free upsampled depth maps. It shows there is still room for improvement, especially for large-scale factors.

D. VISUAL COMPARISON
We examined the effectiveness of the proposed warping based scheme in handled challenging cases in joint depth upsampling. As shown in Fig. 4, we chose a few complicated regions from various images and compared the performance in the zoomed images. In most cases, our scheme was capable of maintaining a sharper shape of the depth map edge as compared to the other schemes. For example, as shown in the first row in Fig.4, the proposed method achieved the sharpest depth edge while well preserving the overall shape of the object. In the case of multiple overlapped objects in the fifth row, each object is clearly separated in our proposed scheme. We also test our scheme in more challenging situation such as low-resolution images. In general, upsampling a low-resolution image is more challenging than upsampling a high-resolution image, because low-resolution image has more complicated patterns than high-resolution image in the same size of region. Although, we did not train our model using low-resolution datasets, the proposed scheme achieved relatively sharper depth edge compared to any other schemes as shown in second and third row of Fig 5. However, when small objects disappeared during downsampling, as shown in the last row of 4, similar to the other schemes, the proposed method was unable to sufficiently restore the depth map. Another weakness of our scheme is color image dependency. When the proposed method restores edge of depth map, it tends to mimic the shape of color image, not the shape of depth map. For example, Although the visual results in Fig.4 are predicted by one model, sharpness of each visual result is different. This phenomenon is more clearly when comparing the real image with the computer graphic image. as shown in Fig.4, the second row result is much blurred than the third row result. It is because the sharpness of third row color image is much sharp than the second row.  [32], EGF [12], CJGF [40], DJF [21], PAC [36], DBPN [9], DKN [17], Ours, and ground-truth (from left to right).

E. DISPLACEMENT VECTOR ANALYSIS
In this Section, we visualized the displacement vector field to verify the assumption of grid warping. The displacement vector field was displayed in the same manner as the conventional optical flow visualization in Fig. 6. Overall, displacement vectors were formed around the edge at orthogonal and opposite directions with respect to the edge. These two observations are consistent with our hypothesis. However, some unexpected chessboard patterns were observed. We analyzed these types of patterns that would be generated by the transpose convolution operation that was used to increase the resolution of feature maps. It was reported that the transposed convolution often caused chessboard patterns in the output image owing to uneven overlaps [29]. To further confirm the effect of the transpose convolution, we replaced the transpose convolution layer with the bilinear interpolation. As shown in Fig. 6, the modified model with bilinear interpolation yielded the pattern-free displacement vector fields. However, we want to maintain use of the transpose convolution owing to its higher performance. VOLUME 8, 2020 FIGURE 6. Displacement vector field visualization (scale 8×): ground-truth depth map, displacement vector field with transpose convolution, and its pattern-free vector field without transpose convolution (from left to right).

F. COMPLEXITY COMPARISON
In this Section, more intensive complexity analysis and comparisons are conducted. We measure the run-time, the number of parameters, and the floating-point operations per second (FLOPs), where FLOPs are measured with upsampled 256 × 256 input depth map. As exhibited in Table 5, the proposed scheme has largest number of parameters compared to other schemes, especially approximately 30 times more than DKN. On the other hand, the proposed scheme has about 60% FLOPs compared to DKN. It is because the proposed scheme deals with feature maps in lower resolution, and it results in fewer operations. For the further analysis of the proposed scheme, the complexity of each network is measured. As reported in Table 6, the displacement network accounts for 97% of the total number of parameters, whereas the fusion network performs more FLOPs. It is analysed that the larger resolution feature maps are more processed in the fusion network.

G. NOISY ENVIRONMENT EVALUATION
In this Section, we consider more challenging test scenarios for practical applications. First, we tested the case when the edge of color image is more blurred. In the proposed system, the computation of displacement vector highly relies on the quality of HR color image and could degrade the performance when its quality is low. We consider that the given image is distorted by blurring, and it was simply modeled via Gaussian blurring with σ b . As shown in Table 7, the blurred color image slightly decreases the performance. For the weakly blurred case (e.g., σ b ≤ 2), the performance degradation of the proposed system was small compared to other deep learning-based joint upsampling methods. However, the performance worsens when σ b is large. It is well matched with the proposed mathematical modeling in Section III.B, i.e., the overall performance degradation is still small for σ σ b , whereas performance worsens for larger σ b . It is also noteworthy that the proposed method is slightly more robust toward the blurry reference image than other joint upsampling methods are more sensitive towards the blurry reference image.
Second, we test the case of a color-depth misalignment situation. Most of the joint depth map schemes assume that color and depth map are well aligned. However, this is not true for some practical situations. To generate the misalignment, we shifted the color images in the test dataset to the n pixel in the horizontal and vertical directions. As shown in Table 8, the proposed model does not degrade much for small pixel shift, but the performance drop becomes larger for large misalignment as all other joint upsampling methods do. It is analyzed that slight misalignment can be managed by multiresolution structure, while misuse of the displacement vector by large misalignment is unavoidable, which results in large performance degradation.

H. JOINT SALIENCY MAP UPSAMPLING
In this Section, we explore how effectively the proposed scheme can be applied to saliency map upsampling [38]. To measure the accuracy of upsampled saliency map, we use RMSE and structure-measure(S-measure) metric [7]. Here, S-measure evaluates region-aware and object-aware structural similarity between ground truth and predicted FIGURE 7. Visual comparisons of 16× upsampled saliency map results: Color image, DJF [21], PAC [36], DKN [17], Ours, and ground-truth (from left to right).  upsampled saliency map. DUT-OMRON dataset is used for the evaluation. As reported in Table 9, the proposed scheme shows the best performance among the joint deep learning-based approaches in terms of RMSE (lower is better) and S-measure (higher is better). Fig. 7 shows the visual comparisons. We observe that fine-details and its structures are well preserved compared to any other schemes.

V. CONCLUSION
In this paper, we proposed a novel depth map upsampling technique by the image warping approach. The displacement vector for the image deformation was computed by the corresponding HR color information, which is the major contribution of the study. Furthermore, we also provided the theoretical edge signal modeling to verify the robustness of the proposed approach. As a result, the proposed scheme outperformed model-based approaches and exhibited the best performance, as compared to other state-of-the-art deep learning-based schemes, in terms of the RMSE and MAE. The visual results also validate the superiority of the proposed scheme. Furthermore, more intensive experiments are provided to analyze the proposed method with various situations. However, the performance of the proposed method relies on the similarities of the color images and depth maps. This limitation will be addressed in a future work.