Video Super-Resolution Using Plug-and-Play Priors

Video super-resolution is a fundamental task in computer vision, aiming to enhance the resolution and visual quality of low-resolution videos. Plug-and-Play Priors is one of the most widely used frameworks for solving computational imaging problems by integrating physical and learned models. Traditional approaches often rely on handcrafted priors, which are computationally expensive and may not generalize well to diverse video content. In this paper, we propose a novel approach for video super-resolution using Plug-and-Play Priors with motion estimation. By leveraging the power of deep learning and the flexibility of the Plug-and-Play framework, our method achieves promising results while maintaining computational efficiency. Experimental results on benchmark datasets demonstrate the superiority of our approach in terms of both quantitative metrics and visual quality.

frames containing distinct information, the HR image can be reconstructed through digital image or video processing [2].
In the realm of video super-resolution, the accurate estimation of motion assumes a pivotal role.This process, essential for enhancing the resolution of low-resolution videos, involves aligning and consolidating information from multiple frames to generate a high-resolution output [3].
The deterioration of images in video super-resolution commonly involves the representation of a linear blur, motion, subsampling, and Gaussian noise.This is typically conceptualized through an observation model, assuming the acquisition of multiple low-resolution (LR) images through a specific process [4].According to this model, the LR input images are obtained from the high-resolution (HR) original scene through operations such as warping, blurring, and downsampling.It is assumed that the HR image remains constant during the acquisition of several LR images [5].
Numerous algorithms and techniques have been proposed over the years to address the enhancement of resolution in both images and videos.The initial attempt was made by Tsai and Huang [6], utilizing the shifting property of the Fourier Transform and the aliasing relationship between the continuous Fourier transform (CFT) and the discrete Fourier transform (DFT).Tekalp et al. [7] extended this method, incorporating a least squares approach to solve a system of equations and introducing a linear shift-invariant (LSI) blur point spread function (PSF).Kim et al. [8] further improved this technique by introducing a weighted least squares algorithm to handle noisy data.However, these methods are limited to scenarios where global motion is known in advance.Other spatial domain methods include the projection onto convex sets (POCS) approach introduced by Stark and Oskoui [9].This method intersects convex constraint sets representing desirable image characteristics, such as positivity, bounded energy, fidelity to the data, and smoothness, with the HR image space.POCS has been extended to handle time-varying motion blur [10] and [11], using block matching or phase correlation for registration parameter estimation [10].
Stochastic methods form another category of resolution enhancement algorithms, with maximum likelihood (ML) and maximum a posteriori (MAP) approaches falling under this group [12].The MAP estimation, employing an edge-preserving Huber-Markov random field image prior, is examined in [13], [14], and [15].Resolution enhancement with simultaneous registration parameter estimation is proposed in [16], [17], [18], and [19].This method uses a Gibbs-Markov random fields (GMRF) image prior with a local clique.The regularization parameter is crucial to the HR image reconstruction, and the L-curve method is employed for its estimation in [20], selecting the desired ''L-corner'' or point with maximum curvature on the L-curve.
A thorough comprehension of the point spread function (PSF) and precise registration of subpixel motion are crucial elements for reconstructing high-resolution (HR) images.However, in practical applications, ensuring accurate knowledge of these parameters is often challenging.Lee and Kang [21] presented a regularized adaptive HR reconstruction method that accommodates inaccurate subpixel registration.Assuming Gaussian noise for the registration error, with a standard deviation (STD) proportional to the registration error's magnitude, two approaches were developed to estimate the regularization parameter for each low-resolution (LR) frame (channel).Experimental results demonstrated the convergence of these methods to a unique global solution, although the synergy of these approaches was not extensively demonstrated.In [22], a hierarchical Bayesian framework was employed to address image restoration in the presence of partially known blurs, using stationary zero-mean white noise to model the unknown component of the PSF.Evidence analysis (EA) was utilized to propose two iterative algorithms resembling the regularized constrained total least squares filter and the linear minimum mean square-error filter [23], [24], [25].
Robust super-resolution techniques have been introduced in [23], [24], and [25], specifically designed to handle anomalies (data that deviate from the model).In [23], the iterative HR image acquisition method incorporates a median filter, showcasing robustness when errors from outliers are symmetrically distributed.However, determining whether bias arises from aliasing or outlier information requires a threshold, and the method's mathematical justification is not thoroughly examined.In [24] and [25], a robust superresolution approach was proposed, incorporating the norm in both the regularization term and the measurement term of the penalty function.A robust regularization based on bilateral priors was introduced to accommodate various data and noise models, providing mathematical support for a ''shift and add'' approach related to norm minimization when relative motion is purely translational, and the PSF and decimation factor are common to all LR images.
Subsequently, the methodology introduced in [16], [17], [18], and [19] was extended to handle scenarios where lowresolution (LR) frames suffer from additive white Gaussian noise (AWGN) with varying variances in each frame [18].The fundamental idea involves adjusting the residual term of the cost function by the inverse of the variance for each frame (channel) when AWGN with distinct variances is the sole additional noise source in the LR images.Moreover, to mitigate errors introduced by other types of noise during the resolution enhancement reconstruction phase, weighting should be applied to each channel.Additionally, He and Kondi proposed an image super-resolution algorithm in [4] that takes into account imprecise estimates of registration parameters and the point spread function.These inaccurate estimates, coupled with additive Gaussian noise in the LR image sequence, result in varying noise levels for each frame.In the proposed algorithm, LR frames are adaptively weighted based on their reliability, and the regularization parameter is simultaneously estimated, assuming a translational motion model.
Image super-resolution using deep learning has gained significant attention due to its ability to generate high-resolution images from low-resolution inputs.Various deep learning architectures and methods have been proposed for image super-resolution.Among them there is SRCNN [26], FSRCNN [27], ESPCN [28], VDSR [29], SRGAN [30], EDSR [31], RCAN [32], IDPT [33] and DBTC [34].The emergence of deep learning has showcased the substantial potential of convolutional neural networks (CNNs) in video super-resolution.Tao et al. [35] introduced a CNN-based framework for video super-resolution that effectively harnessed both spatial and temporal information.Their network learned spatio-temporal dependencies in videos, leading to improved resolution and visual quality.
To further reinforce the performance of CNN-based video super-resolution, researchers explored the incorporation of recurrent neural networks (RNNs) to model long-term temporal dependencies.Caballero et al. [36] proposed a recurrent video super-resolution network (RVSR) that integrated a recurrent structure to capture temporal information across frames.The recurrent connections facilitated a better understanding of temporal dynamics, resulting in superior super-resolution outcomes.
In addition to approaches based on deep learning, there have been endeavors to exploit alternative priors and constraints in video super-resolution.For example, Huang et al. [37] proposed a method that incorporates non-local self-similarity to harness redundancy within video frames.By enforcing self-similarity constraints, their approach achieved enhanced reconstruction quality and reduced artifacts.
Another direction in video super-resolution involves the utilization of generative adversarial networks (GANs).Ledig et al. [30] introduced an SRGAN-based framework for single-image super-resolution, later extended to address video super-resolution.With a generator-discriminator architecture, SRGAN effectively captured high-frequency details, resulting in visually pleasing super-resolved videos.
Moreover, researchers have explored the fusion of multiple frames to enhance the resolution of video sequences.Huang et al. [38] proposed a multi-frame video superresolution method that combines a temporal fusion module with a spatial attention mechanism.By selectively fusing information from multiple frames, their approach achieved improved super-resolution results.
It is crucial to emphasize that the assessment of video super-resolution methods relies significantly on evaluation metrics and datasets.The use of benchmark datasets, such as Vimeo-90K [39] and REDS [40], has facilitated fair comparisons and benchmarking of various algorithms.
The Plug-and-Play Priors (PPP) framework is recognized as one of the extensively used methodologies for addressing computational imaging challenges through the integration of physical and learned models.PPP takes advantage of high-fidelity physical sensor models and robust machine learning techniques for data pre-modeling, incorporating cutting-edge reconstruction algorithms.PPP algorithms follow a cycle of minimizing data fidelity terms to uphold data consistency and enforcing learned regularization through image denoising [41].Recent achievements of PPP algorithms span applications in biomicroscopy, computed tomography, magnetic resonance imaging, and joint ptychotomography [42].
This article proposes a video super-resolution method, based on the Plug-and-Play (PnP) framework.To our knowledge this is the first attempt to use PnP framework in video super-resolution, using motion estimation.

II. PLUG-AND-PLAY PRIORS
Plug-and-Play Priors (PPP) stands out as a widely adopted framework that integrates physical and learned models to address computational imaging challenges.It is a robust framework that merges conventional optimization techniques with modern denoising methods and priors to efficiently tackle inverse problems [43].Initially introduced by Venkatakrishnan et al. [42], PPP has garnered significant attention across various domains of computer vision and image processing.This literature review delves into key contributions that have shaped the development and application of PPP.
The original PPP framework proposed by Venkatakrishnan et al. [42] showcased its efficacy in solving inverse problems, such as image denoising and deblurring.Their work demonstrated that by alternately applying denoising and data fidelity steps, PPP achieves state-of-the-art results.The denoising step employs robust algorithms like Non-Local Means (NLM) or Block-matching and 3D filtering (BM3D) [44] to eliminate noise and enhance image quality.The data fidelity step ensures consistency between the denoised image and the observed measurements.Despite the original formulation relying on ADMM [45], PPP proves equally effective when combined with other proximal algorithms like primal-dual splitting (PDS) [46] and fast iterative shrinkage/thresholding algorithm (FISTA) [47].
To further enhance denoising capabilities within PPP, Zhang et al. [48] introduced a deep denoising network named DnCNN.Integrating DnCNN into the PPP framework demonstrated its effectiveness in tasks such as image super-resolution and inpainting.The utilization of deep neural networks within PPP provides a more flexible and potent denoising tool, surpassing traditional handcrafted denoisers in performance.
Ghassab and Bouguila [49] explored the utilization of a Student-t mixture model as a promising tool for the reconstruction of video super-resolution.The Student-t mixture model, renowned for its heavy tail, was deemed robust and well-suited for the prior of video frame patches, offering a mixture model with a rich log-likelihood for information retrieval.Edge-preserving filtering was implemented to address potential data uncertainties and preserve areas with abrupt lighting changes in video frames.The Plug-and-Play Priors (PPP) structure was subsequently employed to integrate the Student-t mixture prior model and edge-preserving filtering into the super-resolution algorithm.Empirical evaluations conducted on various video frame sets, demonstrated the effectiveness of the proposed algorithm.Comparisons with eight other state-of-the-art super-resolution methods affirmed that the proposed framework generally outperforms others across different super-resolution scales, even in the absence of leveraging motion estimation to exploit frame correlations.
PnP-ADMM is widely recognized for its efficiency and fast empirical convergence within the realm of frequently employed operators in computational imaging.However, it demands the computation of the proximal map, in contrast to PnP-FISTA, which solely requires the computation of the gradient ∇g.While the gradient is theoretically less complex than the proximal map, numerous applications enable the efficient computation or approximation of the proximal map.General techniques such as conjugate gradient or specialized methods, particularly when the forward model incorporates a spatial blurring operator computed through fast Fourier transform (FFT), can be employed for this purpose [50].
The incorporation of an extra state variable, employed as an initiation for the proximal minimization problem, streamlines this procedure.An iterative solver, commencing from this initialization, performs a series of steps to estimate the minimization effectively.This state variable also converges with the outer loop, resulting in decreased computational requirements through partial updates while maintaining the accuracy of the final solution [51].
In the research work reported in [52], scientists introduce a straightforward and robust super-resolution framework applicable to individual images and easily adaptable to videos.The foundation of the framework is rooted in the observation that the denoising of both images and videos can be effectively accomplished through various methods.By leveraging the Plug-and-Play-Prior framework and adopting the Regularization-by-Denoising (RED) approach, the researchers illustrate how denoisers can be harnessed to tackle both Single-Image Super-Resolution (SISR) and Video Super-Resolution (VSR) challenges using a unified formulation.Instead of incorporating motion estimation between frames, the VBM3D video denoiser was employed in this approach.
Our paper attempts to introduce a PnP method for video super-resolution, using motion estimation, which has not been done yet.

III. OUR METHOD
The acquisition model we are assuming is: where: • y is the full set of LR frames, described as y = [ y 1 T , y 2 T , . . ., y p T ] T , where y k , k = 1, 2, . . ., p are the p LR images.Each observed LR image is of size N 1 × N 2 .Let the kth LR image be denoted in lexicographic notation as y k = [ y k,1 , y k,2 , . . ., y k,M ] T , for k = 1, 2, . . ., p and M = N 1 N 2 .
• x is the desired HR image, of size written in lexicographical notation as the vector x = [ x 1 , x 2 , . . ., x N ] T , where N = L 1 N 1 L 2 N 2 and L 1 and L 2 represent the up-sampling factors in the horizontal and vertical directions, respectively.x is the ideal un-degraded image that is sampled at or above the Nyquist rate from a continuous scene which is assumed to be band-limited.
, where ε k is the noise vector for frame k and contains independent zero-mean Gaussian random variables.
• A = [ A 1 , A 2 , . . ., A p ] T is the degradation matrix which performs the operations of blur, motion and subsampling.Assuming that each LR image is corrupted by additive noise, we can then represent the observation model as [5]: where In our case B k = I , since we assumed no added blur on video frames.The goal is to find the estimate x of the HR image x from the p LR images y k by minimizing the cost function where is the ''fidelity to the data'' term, and h(x) is the regularization term, which offers some prior knowledge about x.In this work, we utilize the Plug-and-Play Prior methodology, where h(x) is not explicitly defined.Instead, the ADMM algorithm is modified so that the proximal operator that depends on h(x) is replaced by a denoising neural network [53].
We next outline the steps of the proposed algorithm.
1) The first step of our algorithm is to evaluate the term M k from Eq. ( 3), by using optical flow motion estimation.The motion estimation method used is a popular optical flow method, called the Farneback algorithm, named after its creator, Gunnar Farneback.
The Farneback algorithm generates an image pyramid, where each level has a lower resolution compared to the previous level.The Farneback method employs a dense approach, meaning it estimates the motion vector for every pixel in the image.The algorithm consists of the following steps [54]: a) Preprocessing: The input frames are preprocessed to enhance their quality.Preprocessing steps include noise reduction, image denoising, and color space conversion.b) Image pyramids: The Farneback algorithm constructs a Gaussian pyramid for each frame.This involves creating a series of downsampled versions of the original image, forming a hierarchy of images with decreasing resolution.The pyramids enable capturing motion at multiple scales, improving the accuracy of the optical flow estimation.c) Optical flow estimation: For each level of the pyramid, the Farneback algorithm computes the optical flow using a combination of polynomial expansion and spatial filtering.It estimates the local flow vectors by calculating the phase difference between the polynomials corresponding to neighboring image patches.d) Upsampling and refinement: Once the optical flow is computed at the coarsest level of the pyramid, it is successively refined by upsampling the flow field and incorporating the local information from higher-resolution levels.This refinement process improves the accuracy of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
flow estimation, particularly for small and fastmoving objects.The result of the Farneback method is a dense optical flow field, where each pixel has an associated motion vector.These vectors represent the direction and magnitude of the motion of objects in the scene between consecutive frames.We assume that one of the LR images, y mid (typically the middle one), is produced from the HR image x, by applying only downsampling, without motion shift.Thus, M mid = I .Optical flow is calculated between y mid and the rest of the LR images.Following that, we get M k for the remaining p − 1 images.
2) The second step of our algorithm is based on the PnP-ADMM method.Specifically, we run PnP-ADMM, following the steps described in Algorithm 1 until convergence, where x 0 is the initial value of the HR image, obtained from y mid multiplied by the pseudo-inverse of A mid , followed by denoising using DnCNN, while D is the image denoising operator (neural network) and g is defined as Algorithm 1 : PnP-ADMM [42] 1: u 0 = 0, x 0 , and γ > 0 2: for k = 1, 2, . . ., t do 3: 4: One important property of ADMM is that it does not explicitly require knowledge of g(x) or their gradients, relying instead on the proximal operator, which is defined as:

IV. PROOF OF CONVERGENCE
The crucial conceptual observation lies in the fact that PnP algorithms incorporating black-box denoisers often struggle to address optimization problems.In other words, while the original ADMM algorithm effectively solves the optimization problem, the introduction of a black-box denoiser, denoted as D, disrupts this process by eliminating a corresponding function h for minimization.Specifically, the numerical assessment of widely employed denoisers, such as BM3D and DnCNN, demonstrates that their Jacobians lack symmetry, suggesting that these denoisers do not function as either gradient descent steps or proximal maps [55].
Nevertheless, it remains feasible to establish a criterion for the converged solution in PnP by employing a consensus equilibrium formulation, as proposed by [56].
x = G(x − u) and x = D(x + u), (6) where G := prox g and x, u are the converged values of PnP-ADMM.Notably, within the consensus equilibrium expression in (6), x represents the final reconstruction and u can be construed as noise, eliminated by the denoiser in x = D(x+u) on one hand and counterbalanced by the fidelity to the data effect in x = G(x − ux) on the other.To derive (6), it is important to recognize that the fixed points z, x, and u of the PnP-ADMM iteration satisfy From the last equation we conclude that x = z, which leads directly to (6).Also, the first-order optimality condition for the minimization problem The application of monotone operator theory, as outlined in [57], allows for the illustration of the convergence of PnP algorithms.In this approach, the initial phase involves identifying a fixed point for a high-dimensional operator that can be iteratively used to discover a solution, provided the appropriate assumptions are met.In the proof of PnP-ADMM convergence presented in [56] and [58], the initial step is to establish a one-to-one correspondence between the fixed points of PnP-ADMM and those of the operator: After a linear coordinate transformation, Algorithm 1 is essentially identical to the Mann iterations of T , expressed as [56].This results in linear convergence towards a unique fixed point when T functions as a contraction, a condition satisfied when g is strongly convex and R := I − D serves as a suitably strong contraction [58].Weaker conditions lead to sublinear convergence, reaching a potentially non-unique fixed point [59].Additional notable theoretical findings on PnP-ADMM encompass its convergence for implicit proximal operators [43], applicability with bounded denoisers [60], and suitability for linearized Gaussian mixture model (GMM) denoisers [50].Even CNN-based denoisers can be trained to meet these contractive, non-expansive, or Lipschitz conditions through the implementation of spectral normalization techniques [58], [61].Conversely, when g exhibits only mild convexity and the denoiser D is strongly non-expansive, the iteration converges sublinearly towards its fixed point [62].

V. RESULTS
We implemented our PnP method in SCICO [63], which is an open source library for computational imaging that includes implementations of PnP algorithms.
We conducted extensive experiments on benchmark subsets ''calendar'' and ''city'', from Vid4 dataset to evaluate the performance of our proposed method.Specifically, we used p = 3 frames, with the second in order being the zeromotion image, and we added Gaussian noise with a standard deviation of 0.02.The up-sampling factors in the horizontal and vertical directions were L 1 = L 2 = 4.For the denoising operator D, the DnCNN neural network [48] was used, as it was pre-trained by SCICO.Finally, we compared our results against other successful video super-resolution techniques in terms of both quantitative metrics, such as PSNR (Peak signal-to-noise ratio), and visual quality.
Table 1 show PSNR results for the two datasets for all the methods tested.It can be seen that average PSNR for our method is 22.86 dB for ''Calendar'' dataset and 25.74 dB for ''City'' dataset, while all the other methods have lower values.The highest PSNR values for Frame 17 of ''Calendar'' (Fig. 1) and Frame 14 of ''City'' (Fig. 2).Apart from the numerical results, the visual proofs are also in favor of our method, since the super-resolved pictures are clearer than the pictures produced with the other methods.Examples of the results can be seen in Fig. 3 and Fig. 4, which are the results for the original Fig. 1 and Fig. 2. It should be noted that there is no image result for EDVR, since results were taken from [67].Fig. 5 and Fig. 6 show the results in terms of PSNR for the images of ''Calendar'' and ''City'' datasets accordingly, for all the methods tested.
Frames 9, 10 and 11 from ''Calendar'' dataset show a much lower PSNR for APGM, BM3D, and TV, because these images have greater difference from the others and these methods are more motion-sensitive than ours.
The results demonstrate the superior performance of our approach in terms of reconstruction accuracy and preservation of fine details and textures.It is worth mentioning that our method needs no training, since DnCNN is pre-trained.Finally, the runtime of our method per frame is 12 seconds, ran in Google Colab with T4 GPU.

VI. CONCLUSION
PnP techniques have established themselves as a standard tool for computational imaging since their introduction in 2013.They have been utilized in a remarkable variety of applications that provide cutting-edge performance.They were arguably the first practical approach to integrating learned models with imaging physics to solve inverse imaging issues when they were first introduced.The ease with which they can be implemented was a major factor in their rapid popularity.Since then, alternative strategies have emerged that, in some cases, result in improved reconstruction performance; however, this is achieved at the expense of a potentially time-consuming and data-dependent applicationspecific training procedure.In this paper, we proposed a PnP method for video super-resolution (resolution enhancement) with motion estimation.The convergence property of the proposed algorithm is analyzed in detail.More importantly, experimental results show the validity of our algorithm and its superiority compared to other state-of-the-art methods.

FIGURE 4 .TABLE 1 .
FIGURE 4. Result of ''City'' Image.TABLE 1.Average PSNR values for the two datasets for all the methods.

FIGURE 5 .
FIGURE 5. PSNR values of the 29 images of ''Calendar'' dataset for all the methods tested.

FIGURE 6 .
FIGURE 6. PSNR values of the 29 images of ''City'' dataset for all the methods tested.