Multi-Frame Depth Super-Resolution for ToF Sensor With Total Variation Regularized L1 Function

In this paper, we propose a multi-frame depth super-resolution (SR) method based on <inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula> data fidelity with the total variation regularization (TV-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>) model. The majority of time-of-flight (ToF) sensors exhibit limited spatial resolution compared to RGB sensors and the improvement of the depth image resolution is an inherently ill-posed problem. To overcome this under-determined problem, the solution space is limited by the regularization term through prior knowledge and the data fidelity term using statistical information of the noise. Firstly, the statistical characteristics of ToF depth images are analyzed to specify the appropriate observation model. Thereafter, the objective function for multi-frame depth SR based on the TV-<inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula> model is designed by considering the prior knowledge of the depth images. This approach enables the sharpness of the edges to be preserved and the noise amplification to be suppressed simultaneously. Furthermore, an efficient solver based on half-quadratic splitting is proposed. The algorithm minimizes the objective function for the multi-frame SR problem consisting of the TV regularization term and <inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula> data fidelity term. The proposed method is verified on a synthetic dataset and real-world data acquired from a ToF sensor. The experimental results demonstrate that the proposed method can substantially reconstruct high-resolution depth images in terms of preserving sharp depth discontinuities, without any obvious artifacts, and can increase robustness to noise.


I. INTRODUCTION
Accurate depth (distance) measurement is an essential requirement in image processing and computer vision. In recent years, the demand for accurate depth information has increased in numerous applications, such as image enhancement [1], virtual reality [2], robotics [3], and autonomous driving [4]. In particular, face recognition technology that utilizes depth information [5] has been adopted in traditional business models to provide access control in various fields (e.g., finance systems or identity verification). Although the significance of depth images is rapidly increasing, several artifacts are still frequently observed. For example, integrated detectors or sensors are usually disturbed by visible light from the background and certain The associate editor coordinating the review of this manuscript and approving it for publication was Michele Nappi . devices suffer from interferences between waves during depth measurements. Moreover, depth images exhibit the additional problem of low quality because the depth measurement requires high costs in terms of complexity or hardware implementation.
Various methods for depth sensing have been developed over the past several decades. The structured light scanner can measure the distances to objects using projected light patterns [6]. Light scanners can measure accurate distances in certain environments, such as those with static objects. Nonetheless, these scanners generally capture distorted depth images because the projection onto the surfaces of moving objects is typically biased. A method often referred to as stereo photography uses a stereo camera, which is a type of camera with two or more lenses with a separate image sensor, as well as independent optical systems to enable the camera to imitate the human binocular vision system. However, stereo cameras require exhaustive computational costs to estimate the point of inconsistency between two independent sensors. Furthermore, although the performance of the stereo matching algorithm has been improved, stereo cameras may exhibit limitations in environments under low-light conditions or with severe noise.
The time-of-flight (ToF) is the measurement of the time taken by an electromagnetic wave to travel a specific distance to an object. ToF cameras employ the ToF technique to resolve the distance between the camera and objects for each point of the depth image. These cameras measure the time required by near-infrared (NIR) to hit the object and then return to the camera [7]. The transmitter transmits NIR with a specific frequency and the receiver detects a phase change with respect to the time. That is, the phase difference between the input and the reflected signals can easily be measured to calculate the flight time. Although the time can be calculated based on simple phase detection in this manner, the majority of ToF cameras have a limited spatial resolution because ToF sensors experience difficulties in scaling the sensor architecture to large arrays owing to the size of the on-chip time-to-digital converters required [8].
Numerous attempts have been made to overcome the physical limitations of ToF sensors, such as their limited spatial resolution. Algorithms for improving the depth image resolution of ToF sensors can be categorized into three groups, which are described as follows: The first approach is single-frame depth upsampling, which aims to recover a high-resolution (HR) image from a single low-resolution (LR) image [9], [10]. Such methods generally use the statistical characteristics inside the depth image or those of the training images. In recent years, numerous test images have been employed to improve single-frame depth upsampling. For example, deep learning methods, which are used extensively to model the mapping from LR patches to HR patches using a convolutional neural network (CNN), have achieved remarkable success in natural image processing [11]. However, single-frame depth upsampling methods exhibit limited performance owing to issues such as a loss of details and jagging artifacts compared to the methods in the other two categories, because they do not employ additional hardware, extra sensors, or alignment.
The second approach employs an intensity image (e.g., an RGB image), the field of view of which is the same as the ToF camera, as additional information. The additional input of these methods stems from the correlation between the depth and intensity discontinuities. These techniques assume that the edges of the intensity image coincide with the edges (or discontinuities) of the depth image [12]- [17]. As the additional intensity image is typically captured at a higher resolution, its discontinuities are used to locate the depth discontinuities at a higher resolution than the LR depth image. However, in practice, this approach tends to introduce serious artifacts such as inaccurate depth, depth bleeding, and texture copying artifacts, because the intensity and depth discontinuities do not always coincide.
To avoid artifacts from the intensity image, methods for increasing the spatial resolution only from the depth data have been proposed. The third approach restores the high-frequency information from multiple LR depth images. This approach originates from the classical super-resolution (SR) theories in image processing, and it is assumed that aliasing is always present in the LR images [18]. Hardie et al. have formulated a joint approach for reconstruction and registration using Tikhonov regularization [19]. Farsiu et al. have introduced a fast and robust scheme based on bilateral total variation (BTV) regularization with L 1 data fidelity [20]. An algorithm to solve the nonlinear total variation (TV) based SR problem for digital video has been proposed in [21]. Babacan et al. have utilized the quadratic approximation of the TV prior based on a variational Bayesian formulation [22]. A combination of the sparse and non-sparse priors has been introduced in [23].
In terms of increasing the resolution of the ToF sensor, a technique using multi-frame depth images has been developed by extending the classical SR method for existing cameras [24]. Recently, a method based on the TV prior model has been proposed to preserve the overall sharpness of depth images [25]. However, the multi-frame depth SR techniques for ToF sensors presented in [24] and [25] do not reflect the depth image characteristics, and do not consider problems such as oversmoothing and noise amplification. Therefore, we propose a method based on the L 1 data fidelity with the TV regularization (TV-L 1 ) model to overcome the limitations of other methods in terms of preserving the sharp depth discontinuities and suppressing the noise amplification.
In this paper, a multi-frame depth SR approach is proposed to improve the spatial resolution of the depth image. Our approach can be classified into the third category: multiple LR depth images are used to reconstruct a single HR depth image. The main contributions of this study are as follows: • We analyze the statistical characteristics of ToF depth images.
• We design an objective function for multi-frame depth SR based on the TV-L 1 model which is robust to noise and produces sharp depth discontinuities. The experimental results verify that the TV-L 1 model is more robust than the L 2 data fidelity with the Tikhonov regularization (Tikhonov-L 2 ) and the L 2 data fidelity with the TV regularization (TV-L 2 ) models.
• We propose an efficient solver based on half-quadratic splitting. The experimental results show that our method is more reliable than previous methods using a similar objective function as well as other non-similar ones. The remainder of this paper is organized as follows. Section II presents related works on multi-frame depth SR and TV-L 1 approach for deconvolution and denoising. In Section III, the analysis of the characteristics of ToF images is discussed and the proposed method based on the TV-L 1 model is described in detail. Section IV presents the experimental results. Finally, conclusions are outlined in Section V. Example illustrations of multi-frame depth SR: The HR depth image is reconstructed using multiple LR depth images with different sub-pixel registrations. We present a comparison with the conventional method based on the Tikhonov-L 2 model (top). Note that the conventional method oversmooths the sharp edges and exhibits noise amplification, whereas the proposed method based on the TV-L 1 model (bottom) preserves sharp depth discontinuities without any obvious artifacts and increases the robustness to noise.

II. RELATED WORK
In this section, we present an observation model for LR depth images. Firstly, we briefly review multi-frame depth SR methods, which are related to our method. Thereafter, we introduce an image restoration algorithm based on the TV-L 1 approach. Observed LR depth images are employed to restore the HR depth image.

A. IMAGE OBSERVATION MODEL
SR image reconstruction is the process of estimating an original image from multiple degraded images, with the aim of overcoming the physical limitations of optical and sensor systems. During the process of obtaining depth images, a natural loss of spatial resolution occurs as a result of aliasing owing to the limited spatial resolution of the sensor, optical blur owing to the physical limitations of the optical system, and sensor noise. The observation model can be described as follows: where y k is the kth observed LR depth image, D is a downsampling matrix, B is a blurring matrix, M is a warping matrix, x is the HR depth image, and n k is the corresponding additive noise [18]. These series of degradations are summarized as the single matrix W k in this study. In the observation model of Eq. (1), the task of restoring x from y k is an inverse problem that is solved by an optimization process, as follows: where D is the data fidelity term, R is the regularization term for incorporating prior information, and λ k is the regularization parameter controlling the trade-off between the two terms. In this formulation, K denotes the number of LR images in the reconstruction. The process of Eq. (2), which is also known as multi-frame SR and is illustrated in Fig. 1, simultaneously performs inverse filtering to obtain the HR image.

B. MULTI-FRAME DEPTH SR
The data fidelity term is often derived from the negative loglikelihood; that is, D(y k − W k x) = − log P(y k |W k x), and it is usually associated with the noise model. In the general case, where n k is assumed to be additive white Gaussian noise, the data fidelity term utilizes the L 2 -norm, and the corresponding inverse problem is expressed as follows: The data fidelity term with the L 2 -norm is frequently used as a standard approach in regression analysis, which is known as the method of least squares. The regularization term is adopted to stabilize the inverse problem by confining the set of feasible solutions. In early years, Tikhonov regularization was representative of smoothing constraints, which assume that the distribution of the image gradients is smooth [18]. This regularization is constructed by employing the L 2 -norm and is represented as follows: where denotes the Tikhonov matrix, which is usually constructed by a high-pass filter (e.g., the Laplacian) in image restoration. This regularization strategy, also known as L 2 -regularization [26], has been applied extensively because fast transforms can be developed with low computational costs [27].
The objective function, in which both the data fidelity and regularization terms employ L 2 -regularization, can be formulated as: where C denotes the discrete Laplacian operator. The Tikhonov-L 2 model has become popular in various applications. For example, Hardie et al. formulated the multi-frame SR reconstruction method by minimizing the Tikhonov-L 2 objective function [19]. In terms of increasing the depth resolution, Schuon et al. proposed multi-frame based depth SR with Eq. (5) as an optimization framework [24]. One of the most successful regularization techniques for image restoration is TV, in which the gradient magnitude of the image is assumed to be sparse. Osher et al. introduced a TV model for regularization, and proposed a denoising and deblurring algorithm based on the regularization strategy [28], [29]. Wang et al. proposed a minimization algorithm using the TV strategy [30]. TV regularization has become a common regularization strategy in image restoration because of its special characteristics. One major problem in image restoration is the ringing artifacts that appear near sharp transitions. Tikhonov regularization suffers from ringing artifacts owing to its assumption of smooth gradients. Unlike in Tikhonov regularization, edge information can be preserved more effectively in TV regularization, and the regularization term based on TV is expressed as follows: where ∇ denotes the first derivative linear operator. The norm || · || denotes the L 2 -norm if the TV regularization is isotropic and it denotes the L 1 -norm if the TV regularization is anisotropic. However, we only deal with the anisotropic case, || · || = || · || 1 , because the treatment of the isotropic case is very similar. The TV regularization strategy is generally combined with the data fidelity term using the L 2 -norm, which is usually denoted as the TV-L 2 model. The objective function, in which the regularization terms employ TV-regularization, can be formulated as follows: Ng et al. formulated the multi-frame SR reconstruction method by minimizing the TV-L 2 objective function [21], which was recently applied to multi-frame depth SR [25]. This method is effective in suppressing ringing artifacts and preventing sharp edges from oversmoothing.

C. TV-L 1 APPROACH FOR DECONVOLUTION AND DENOISING
The theoretical approach for image restoration begins with acquiring an HR image from a single input LR image. This approach mainly focuses on recovering high-frequency information by inverse filtering of a point spread function (PSF) of the optical system and is known as deconvolution. In deconvolution problems, the blurring matrix B is defined by a certain PSF and the noise term n k is modeled as Gaussian noise.
The majority of deconvolution methods use a certain type of regularization to overcome the issue of noise amplification in the final solution, and consider the problem of various noise types by applying various norms to the data fidelity term. The L 1 -norm can also be adopted in data fidelity to increase the noise immunity of the optimization process. We refer to the objective function with the L 1 -norm based data fidelity term and TV regularization as the TV-L 1 model. TV-L 1 model has been employed to several applications. Yang et al. proposed the TV-L 1 model for deblurring images corrupted by random-valued impulse noise [31]. Micchelli et al. presented proximity algorithms for TV-L 1 image denoising [32]. This approach can be modeled as follows: The L 2 data fidelity term assumes Gaussian noise, which causes side effects in other noise types. In real imaging systems, LR observations are often degraded by various sensor noise types. Therefore, it is not appropriate to assume that the noise follows a Gaussian distribution. However, in the approach of Eq. (8), the data fidelity term with the L 1 -norm is more effective and robust than that of the L 2 -norm when the images contain non-Gaussian noise. Furthermore, the optimization process based on Eq. (8) conserves strong edges and generates few ringing artifacts. To preserve sharp depth discontinuities and improve robustness to noise, we design an objective function for multi-frame depth SR based on the TV-L 1 model and propose an efficient solver based on half-quadratic splitting.

III. PROPOSED METHOD
In this section, we first analyze the statistical characteristics of ToF depth images, and subsequently design the objective function to restrict the solution space by means of prior knowledge through the regularization term and statistical information of the noise through the data fidelity term. The problem of depth image SR being corrupted by various sources is considered to improve the robustness to noise. Thereafter, we propose an efficient solver based on half-quadratic splitting and describe the proposed method in detail.

A. TV-L 1 FUNCTION FOR MULTI-FRAME DEPTH SR
Depth SR is an inherently ill-posed problem; that is, many solutions exist for any given LR depth images. To overcome this under-determined problem, the development of effective image regularization and data fidelity has been the topic of substantial research over the years. Under the Bayesian framework, the maximum a posteriori (MAP) estimation of the HR image x with the observed LR images y 1 , y 2 , . . . , y K is given byx = arg max x P(x|y 1 , y 2 , . . . , y K ).
Using Bayes' rule, Eq. (9) can alternatively be expressed aŝ where P(x) describes the prior probability of the HR image, P(y 1 , y 2 , . . . , y K ) represents the prior probability of the LR images, and P(y 1 , y 2 , . . . , y K |x) represents the conditional probability of the LR images given the HR image.
As P(y 1 , y 2 , . . . , y K ) has no influence on the maximization probability function, the above MAP estimation can be rewritten aŝ It is convenient and equivalent to minimize the minus log of Eq. (11), which yieldŝ In general, the observation of each LR image is independent and identically distributed (i.i.d.). Therefore, Eq. (12) can be simplified aŝ where P(y k |x) represents the conditional probability of the LR image given the HR image. Therefore, the prior probability and conditional probability concerning prior knowledge based on the statistical information of the HR image and noise must be specified. Firstly, let us consider the prior probability involved in Eq. (13). In general, an image is assumed to be globally smooth, which is incorporated into the reconstruction process through a Gaussian prior [18]. The prior probability is expressed as follows: where Z g is a simplified normalizing constant. This approach is referred to as Tikhonov regularization, which is the representative smoothing constraint. The approach in Eq. (14) smooths the restored image by penalizing the high-frequency component and has been applied extensively owing to its low computational costs. However, strategies that use this regularization inevitably oversmooth the sharp edges and detailed information. Recent research in natural image statistics has demonstrated that real-world scenes exhibit significantly heavy-tailed or hyper-Laplacian distributions in their gradients [33]. Image processing based on such priors has yielded state-of-the-art results in image restoration [34]. As explained previously, an edge-preserving HR depth image can be obtained if the prior probability of the desired depth images is modeled. Fig. 2a presents the logarithmic first-order derivative using 20 depth images from a ToF sensor in the real-world. The empirical distribution of the gradient magnitude exhibits a higher correlation with the Laplace distribution than the Gaussian distribution. As a result, the prior of the depth images takes the form of the logarithm of a Laplace prior, and is represented as follows: where Z l is a simplified normalizing constant and G denotes the derivative matrix corresponding to the first derivative linear operator. This model is associated with TV regularization, which assumes that the gradient magnitude of the depth image is sparse. Based on the MAP theory, the problem of multiple depth SR with the TV-L 2 model can be transformed into a minimization problem, as follows: where ||Gx|| 1 denotes the TV regularization. Secondly, the distribution of P(y k |x) is derived from the observation model. In general, the noise in the observation model is assumed to follow a Gaussian distribution. The conditional probability is represented as follows: where σ g denotes the standard deviation of the Gaussian distribution. However, the result obtained by the strategy based on the data fidelity term with the L 2 -norm is only optimal when the depth images contain Gaussian noise. In practical applications, the noises that exist in the LR depth images are usually mixed. Because an image can be degraded by noise from various sources, these are typically non-Gaussian distributed. In ToF sensors, a noise source is correlated with the light source (shot noise) and an additional additive noise source (dark current noise) [35]. Furthermore, ToF sensors contain defective pixels that cause bright or dark spots in the depth images. Fig. 2b presents the logarithmic probability density of various noise types. The mixed noise is similar to the Laplace distribution, rather than the Gaussian distribution, in that it exhibits a heavy-tailed distribution. Therefore, the conditional probability takes the form of the Laplace distribution, and it is represented as follows: where σ l denotes the standard deviation of the Laplace distribution. This approach is associated with the L 1 -norm data fidelity term, which is robust to noise regardless of the noise type. As the median geometry is defined as a point in Euclidean space that minimizes the sum of the distances, it is related to the L 1 -norm [36]. Thus, the L 1 -norm data fidelity term exhibits the characteristics of a median operator. Consequently, the data fidelity term with the L 1 -norm is more effective and robust than the L 2 -norm when the images contain non-Gaussian noise such as random-valued Laplace noise and Salt-and-Pepper noise.
According to the MAP theory, the problem of multiple depth SR with the TV-L 1 model can be transformed into a minimization problem, as follows: In Eq. (19), the L 1 -norm is used for both the data fidelity and regularization terms. The objective function based on the TV-L 1 model is known as the TV-L 1 function. Fig. 3 compares the SR results obtained by the Tikhonov-L 2 , TV-L 2 , and TV-L 1 models. The observed LR depth image is degraded by downsampling with a factor of 4 and adding mixed noises containing Gaussian noise with σ g = 5, random-valued Laplace noise with σ l = 5, and Salt-and-Pepper noise with probability p = 0.05. In the SR result restored by the Tikhonov-L 2 model, severe artifacts such as smoothed edges and overshooting on the boundaries are unavoidable owing to the smoothing constraints. Moreover, noise amplification appears during the restoration process. In the SR result generated by the TV-L 2 model, the noise amplification problem is suppressed to a certain extent by adjusting the regularization parameters. However, the sharp depth discontinuities are not preserved. However, the TV-L 1 model outperforms the others in terms of sharpness, without any obvious artifacts, and suppresses the noise amplification. Fig. 3b presents the one-dimensional (1D) sectional view of the dashed line in Fig. 3a. We further investigate the performance of the TV-L 1 model in suppressing the noise in Fig. 3c. The histogram of the green square box in Fig. 3a for the TV-L 1 model exhibits the most significant spike compared to the other models. More specifically, the TV-L 1 model produces a lower standard deviation, which indicates the level of noise in a flat area. It is important to measure the accurate distance through the depth image. As the proposed method exhibits the characteristics of a median operator, the mean value is preserved in a depth image that is corrupted by mixed noise, as indicated in Fig. 3c. Therefore, the proposed method based on the TV-L 1 model overcomes the limitations of other models in terms of preserving the sharp depth discontinuities and suppressing the noise amplification.   As expected, the objective function based on the TV-L 1 model achieves the highest peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) values for various noise levels (Gaussian noise with σ g = 0 to 10, random-valued Laplace noise with σ l = 0 to 10, and Salt-and-Pepper noise with probability p = 0 to 0.1). Moreover, as the noise level increases, the performance difference between the models increases significantly.

B. OPTIMIZATION
In this section, we explain how to solve the optimization problem of Eq. (19), which is computationally difficult owing to the non-differentiability and non-linearity of the TV-L 1 function. To recover the HR depth image x efficiently, we propose an efficient technique based on half-quadratic splitting for TV-L 1 minimization [31]. In this strategy, the auxiliary variables v k and u are introduced to transfer y k − W k x and Gx out of the non-differentiable term. The objective function can be rewritten as follows: where v k and u are the auxiliary variables, and β k and θ are the regularization parameters. It is well known that the solution of Eq. (20) converges to that of Eq. (19) as β k → 0 and θ → 0. The objective function of Eq. (20) is strictly convex and differentiable. Furthermore, if two of the three variables v k , u, and x are fixed, the minimization of the objective function with respect to the others has a closed-form solution.
As v k and u are decoupled for a given x, the objective function is numerically easy to minimize iteratively. For a fixed x, the objective function of Eq. (20) is separable with respect to v k and u. The objective function can be efficiently minimized though alternatively minimizing the following two sub-problems. According to the alternative minimization method [30], the problem can be transformed into two separate objective functions, as follows: The objective functions in Eq. (21) and Eq. (22) are each a single-variable optimization problem. Firstly, the solution of Eq. (21) is expressed by the shrinkage formula, as follows: where the convention 0 · (0/0) = 0 is followed. Each component of u becomeŝ where sgn(·) is the signum function and all operations are performed component-wise. Secondly, the minimization of the Algorithm 1 Multiple Depth SR With TV-L 1 Function Input: The degraded multiple LR depth images y k , the degradation matrix W k , and the regularization parameters λ k Output: The reconstructed HR depth imagex Initialization: x 0 = 0 β k ← λ k 1: repeat 2: Solve v k according to Eq. (22). β k ← β k /2 10: until β k > β min 11: returnx FIGURE 5. Six test images from Middlebury dataset [37]: Art, Book, Dolls, Laundry, Moebius, and Reindeer; three test images from ToFMark dataset [13]: Books, Devil, Shark; and six test images acquired from real-world ToF sensor for evaluation. objective function in Eq. (22) is given by the 1D shrinkage, as follows: where all operations involve component-wise processing. For fixed v k and u variables, the minimization with respect to x in each iteration is a least-squares problem, as follows: and the objective function is convex and differentiable with the use of quadratic data fidelity and regularization terms. Because the gradient of the objective function with respect to x is equal to 0, Eq. (26) for the optimal solution becomes and the unique solution can be obtained by deterministic techniques.
The proposed multi-frame depth SR process with the TV-L 1 function is summarized in Algorithm 1. A continuation scheme on the regularization parameters β k and θ is employed to accelerate the overall convergence. Small values of β k and θ are initially set and gradually incremented in iterations. The solution of Eq. (20) converges to that of Eq. (19) as the values of β k and θ move towards 0. In terms of convergence, the cost function in Eq. (19) is convex with respect to all variables.

IV. EXPERIMENTAL RESULTS
We verified the performance of the proposed method both quantitatively and qualitatively through various experiments. The experiments were conducted using three degradation types: downsampling degradation (noise-free observation), ToF-like degradation (sub-sampling with mixed noises), and real-world ToF sensor data. In this section, we first describe the datasets used in the experiments. Thereafter, the conventional methods that are compared with the proposed method are briefly outlined. Subsequently, the comparative results on the datasets in terms of the performance metrics and visual quality are presented. Finally, the influence of the parameters VOLUME 8, 2020 FIGURE 8. Visual comparison of ×2 and ×4 upsampled depth images, enlarged parts, difference maps between the ground truth and the corresponding result from ToF-like degradation (sub-sampling with mixed noises): (a) ground truth, (b) bicubic interpolation, (c) SRCNN [11], (d) FGI [15], (e) RCG [16], (f) RSR [20], (g) MTSR [25], (h) proposed method, and (i) to (p) are the same methods as (a) to (h). The first and second rows present the results of Art and Moebius in the Middlebury dataset, respectively. and convergence analysis are discussed. In the implementation of our proposed method, the following were used in each LR image: λ k = 0.005 for the downsampling degradation, λ k = 0.1 for the ToF-like degradation, and λ k = 0.01 for the real-world ToF data. The same regularization parameters λ k were applied to each LR image. The criterion ||x n+1 − x n || 2 /||x n || 2 < 10 −6 was used to terminate the iteration. The PSNR, SSIM [38], and percentage of bad matched pixels (BMP) [39] were used as objective performance metrics. The PSNR was employed to evaluate the gray value similarity, whereas the SSIM was mainly used to evaluate the structural similarity. To evaluate the depth accuracy, the percentage of BMP between the upsampled depth imagex(i, j) and ground truth depth image x(i, j) was measured, which is defined as where δ d is the depth error tolerance and N is the total number of pixels. We used δ d = 1.0 for the downsampling degradation and δ d = 255 × 0.05 for the ToF-like degradation. Moreover, the cross-correlation method [40] was applied to estimate the sub-pixel registration between the reference LR frame and other LR frames. The sub-pixel registration is represented by the warping matrix M k in Eq. (1).

A. DATASETS
We used a synthetic dataset and a real-world ToF dataset to demonstrate the performance of the proposed method, as illustrated in Fig. 5. The Middlebury dataset [37] and ToFMark dataset [13], which include HR color images and ground truth HR depth images, were used for evaluation. The former and latter comprise six test images, namely Art, Book, Dolls, Laundry, Moebius, and Reindeer, as well as three test images, namely Books, Devil, and Shark, respectively. As these datasets include ground truth depth images, quantitative analysis was possible. To synthesize the LR depth images y k for the downsampling degradation, we translated the HR image x and sub-sampled it by a certain upsampling factor in each dimension using the same bicubic kernel. We generated 4 or 16 LR depth images from one HR depth image at ×2 or ×4 upsampling factors, respectively. The sub-pixel registration of every LR image was simulated as a global motion randomly changing from −1 to 1 in the horizontal and vertical directions. We assumed that the blurring matrix was an identity matrix (B k = I) to focus on the downsampling and noise in the SR problem in all experiments. For the ToF-like degradation, a combination of Gaussian noise, random-valued Laplace noise, and Salt-and-Pepper noise was added to the LR depth images simultaneously as the mixed noises. The noisy observation for each pixel (i, j) of mixed noises could be described as where n i,j is a sample of i.i.d. zero-mean Gaussian distribution with standard deviation σ g and v i,j is a sample of i.i.d. zero-mean Laplace distribution with standard deviation σ l . Within the given dynamic range [d min , d max ], d min and d max denote the minimum and maximum values of an image pixel, respectively. An LR image was corrupted by Salt-and-Pepper noise when y i,j was either d min or d max with an equal probability of p/2. For the mixed noises, a pixel was corrupted by Gaussian and Laplace noise with a probability of 1 − p. Fig. 6 presents a noisy LR depth image corrupted with four levels of mixed noises. We considered four values of the Gaussian noise parameter, namely σ g = 2. Moreover, we acquired multiple depth images from the real-world ToF sensor data, which included 172 × 224 dense depth measurements, as indicated in the six images on the right side of Fig. 5. As multi-frame SR algorithms require LR depth images with sub-pixel registration, global translations are introduced into real-world ToF data by arbitrarily manipulating the ToF imaging system during acquisition. Moreover, it is possible to obtain LR depth images using multiple ToF camera arrays. Regardless of which method is used to obtain the LR depth images, the new information contained in each LR image can be exploited to obtain the HR depth image if the sub-pixel registrations of the LR depth images differ from one another and if aliasing is present.

B. COMPARED METHODS
We compared the proposed method with the following three method categories: 1) the single-frame depth image upsampling approach including general image upsampling methods: bicubic interpolation, self-guided residual interpolation (SG) [9], edge-guided upsampling (EG) [10], and VOLUME 8, 2020 FIGURE 10. Visual comparison of ×4 upsampled depth images, enlarged parts, gradient maps, and 1D sectional view of purple dashed ellipse from ToF sensor in real-world: (a) bicubic interpolation, (b) SG [9], (c) SRCNN [11], (d) RSR [20], (e) CSR [23], (f) MTSR [25], (g) proposed method, and (h) to (n) are the same methods as (a) to (g). SRCNN [11]; 2) the intensity-guided depth upsampling approach: guided image filtering (GF) [12], fast global image smoothing (FGI) [15], robust color guided restoration (RCG) [16], and joint local structure and nonlocal low-rank regularization (LN) [17]; and 3) the multi-frame depth image SR approach: robust SR (RSR) [20] based on the BTV model, combination SR (CSR) using sparse and non-sparse priors [23], and multi-frame ToF SR (MTSR) [25] based on the TV model. The implementations from the publicly available codes provided by the authors and appropriate parameters were used in our experiments. However, the source code of LN [17] was executed only for the downsampling degradation on the Middlebury dataset. Experiments were conducted on the Tikhonov-L 2 and TV-L 2 models implemented with our scheme (denoted by ''ours''), as proposed in Lidarboost [24] and MTSR [25]. We selected the regularization parameters that generated the highest PSNR values for all of the testing images.

C. EVALUATION ON DOWNSAMPLING DEGRADATION
For the evaluation on the downsampling degradation, the LR images were generated by sub-sampling the synthetic datasets at upsampling rates of ×2 and ×4. In this case, the degradation matrix W k represented the operation that first translated the HR depth image and then downsampled it to the desired resolution. The noise n k was 0. The quantitative evaluation of the upsampled depth images in terms of the PSNR, SSIM, and BMP values is summarized in Table 1 and  Table 3. It can be observed that the proposed method achieved the highest PSNR and SSIM values as well as the lowest BMP value among all method types, including the single-frame and intensity-guided approaches. Fig. 7 and the first row of Fig. 9 presents the results of three approach types: single-frame depth upsampling (bicubic interpolation and SRCNN [11]), intensity-guided depth upsampling (FGI [15] and RCG [16]), and multi-frame depth SR (RSR [20], MTSR [25], and the proposed method). The bicubic interpolation produced jagging artifacts and oversmoothing. The SRCNN [11] method generated sharper edges than the bicubic interpolation. However, the single-frame depth upsampling approach exhibited low performance compared to those of the intensity-guided upsampling and multi-frame SR approaches. The intensity-guided upsampling methods yielded sharper depth edges than the single-frame upsampling methods, as the intensity guidance could aid in maintaining the depth edge. However, these methods suffered from depth bleeding or texture copy artifacts owing to the inconsistency between the color edges and depth discontinuities. In particular, FGI [15] resulted in more severe artifacts than RCG [16]. In contrast, no texture transfer artifacts existed in the results of the multi-frame SR methods. Note that RSR [20], MTSR [25], and the proposed method used multiple LR depth images. Although RSR [20] is based on the BTV-L 1 model which uses gradient descent, a problem similar to the ringing artifact occurred during the restoration process. MTSR [25] based on the TV-L 2 model exhibited relatively good results in a noise-free environment. However, as can be observed, the proposed method based on the TV-L 1 model produced much sharper edges than the other methods, without any obvious artifacts. In particular, the proposed method successfully suppressed the ringing artifacts and outperformed the other methods in terms of the sharpness around the depth discontinuity region. The visual comparison demonstrates that the proposed method intuitively achieved low errors through the difference images.

D. EVALUATION ON ToF-LIKE DEGRADATION
In this case, the degradation matrix W k was the same as that of the downsampling degradation. However, the noise n k represented mixed noises. To verify the robustness of the proposed method to noise, mixed noises including Gaussian noise with σ g = 5, random-valued Laplace noise with σ l = 5, and Salt-and-Pepper noise with p = 0.05 were added to the LR depth images. The comparison results in Table 2 and Table 3 indicate that the proposed method achieved the highest PSNR performance in all cases. Moreover, the proposed method exhibited the highest SSIM performance when the SSIM was used as the performance metric. The reconstructed depth images had high structural similarity to the ground truth depth image. Furthermore, because the proposed method yielded the lowest BMP value among several comparison methods, the highest depth accuracy was achieved. A visual comparison of the upsampling evaluation is presented in Fig. 8 and the second row of Fig. 9. The majority of single-frame upsampling methods exhibited limitations in noisy depth images. SRCNN [11] generated artifacts that appeared on flat areas because the noise was not considered during the training procedure. The intensity-guided upsampling methods reduced the noise relatively effectively. Although FGI [15] and RCG [16] could reduce the noise, several severe outliers such as bleeding artifacts were propagated. RSR [20] removed most of the noise but yielded an unnatural overall result. In MTSR [25], the boundaries of the edge were broken, and the level of the overall depth varied. In contrast, the proposed method based on TV-L 1 model successfully suppressed the noise amplification while preserving the sharpness of the edges.

E. EVALUATION ON REAL-WORLD ToF SENSOR DATA
We also performed comparisons on real-world ToF sensor data. The depth images captured with the real-world ToF camera were degraded by various noise types, such as Gaussian, Poisson, and Speckle noise. A total of 10 LR depth images with different sub-pixel registrations were used in this experiment. We compared the single-frame upsampling methods (bicubic interpolation, SG [9], and SRCNN [11]) and multi-frame SR methods (RSR [20], CSR [23], MTSR [25], and the proposed method). As no ground truth HR depth images existed, only a subjective comparison of the upsampled depth images could be performed. The gradient maps, as opposed to difference maps, are presented in Fig. 10, which express the absolute magnitude of the gradient at each pixel. The noise in the flat areas could also produce local maxima in the gradient map, and the sharpness of the edges Several examples of this comparison with an upsampling factor of ×4 are presented in Fig. 10. The results of the bicubic interpolation contained considerable outliers such as blurred edges and the SG [9] suffered from similar artifacts. However, the noise reduction effect could be expected because this method was based on the guided filter [12]. SRCNN [11] generated much sharper edges, but artifacts appeared on the flat areas because the noise was not considered during the training procedure. In general, the methods based on a single-frame depth image lacked sharpness in the high-frequency region. The multi-frame SR methods produced sharper edges than the single-frame upsampling methods. RSR [20] produced periodic artifacts and yielded unnatural results overall. CSR [23] blurred the edges owing to the use of a combination of sparse and non-sparse priors. Although MTSR [25] suppressed the noise amplification, zipper artifacts could be observed at the edges. In contrast, the proposed method based on the TV-L 1 model exhibited superior performance without any obvious artifacts VOLUME 8, 2020 by reconstructing detailed information and generating sharp depth discontinuities. Therefore, the proposed method provided more reliable results than previous methods using both a similar objective function and other non-similar ones.

F. INFLUENCE OF PARAMETERS
The proposed method requires regularization parameters λ k , which control the trade-off between the data fidelity and regularization terms. Small values of λ k tend to yield sharper results, but the noise will be amplified. Large values of λ k tend to yield less noisy results, but the restored image may be smoothed. Fig. 11 presents an example of the influence of the regularization parameters in the case of mixed noises. Because the choice of λ k is not known prior to solving the minimization, λ k should be selected empirically. An appropriate selection of λ k according to the noise level can significantly improve the objective performance. Fig. 12 presents a comparison of the objective performance with an upsampling factor of ×4 according to λ k with various noise levels (σ g = 0 to 10, σ l = 0 to 10, and p = 0 to 0.1). Through this  experiment, appropriate values of λ k were established according to the noise level. Moreover, the available information contained in each LR image was dependent on the image quality, such as the noise level and high-frequency information. Therefore, if different regularization parameters λ k are considered in each LR image, improved results can be expected compared to when using the same regularization parameter.

G. CONVERGENCE ANALYSIS
Finally, to demonstrate the convergence behavior of the multi-frame depth SR with TV-L 1 model, the cost function in Eq. (19) versus the iteration number is plotted in Fig. 13 for various conditions. In this test, the initial estimate x 0 = 0, upsampling rate of ×4, and test image Art without noise were set as the default values and only one variable was changed to conduct the experiment. For the initial estimates x 0 , we experimented with a blank image (x 0 = 0) or an interpolated version of the first LR image (x 0 = D T y 1 ). Furthermore, experiments were conducted with various noise levels (σ g = 0 to 10, σ l = 0 to 10, and p = 0 to 0.1). As illustrated in Fig. 13, the objective function of the proposed model converged for various variables and the graphs indicate that the cost function rapidly decreased in several iterations.

V. CONCLUSION
The multi-frame depth SR method based on the TV regularized L 1 function has been presented. The statistical characteristics of ToF depth images were analyzed and an objective function based on the TV-L 1 model for multi-frame depth SR was designed. The multi-frame depth SR based on the TV-L 1 model is robust to noise and produces sharp depth discontinuities without obvious artifacts compared to the Tikhonov-L 2 and TV-L 1 models. Furthermore, an efficient solver based on half-quadratic splitting has been presented. The cost function decreases within only a few iterations for various variables. The proposed method was compared with existing state-ofthe-art methods using synthetic and real-world ToF data. In both the quantitative and qualitative evaluations, the proposed method outperformed existing state-of-the-art methods on the synthetic and real-world ToF data. His current research interests include superresolution image processing, image restoration, and color interpolation. VOLUME 8, 2020 JAEDUK HAN received the B.S. degree in electrical and electronic engineering from Yonsei University, South Korea, in 2014, where he is currently pursuing the joint M.S. and Ph.D. degrees with the Department of Electrical and Electronic Engineering.
His current research interests include image restoration and inverse problems in image and video processing.
MOON GI KANG received the B.S. and M.S. degrees in electronics engineering from Seoul National University, Seoul, South Korea, in 1986 and 1988, respectively, and the Ph.D. degree in electrical engineering from Northwestern University, Evanston, IL, USA, in 1994.
He was an Assistant Professor with the University of Minnesota, Duluth, MN, USA, from 1994 to 1997. Since 1997, he has been with the Department of Electronic Engineering, Yonsei University, Seoul, where he is currently a Professor. His current research interests include image and video filtering, restoration, enhancement, and superresolution reconstruction. He has authored more than 100 technical articles in his areas of expertise. He