Robust Single-Image Super-Resolution via CNNs and TV-TV Minimization

Single-image super-resolution is the process of increasing the resolution of an image, obtaining a high-resolution (HR) image from a low-resolution (LR) one. By leveraging large training datasets, convolutional neural networks (CNNs) currently achieve the state-of-the-art performance in this task. Yet, during testing/deployment, they fail to enforce consistency between the HR and LR images: if we downsample the output HR image, it never matches its LR input. Based on this observation, we propose to post-process the CNN outputs with an optimization problem that we call TV-TV minimization, which enforces consistency. As our extensive experiments show, such post-processing not only improves the quality of the images, in terms of PSNR and SSIM, but also makes the super-resolution task robust to operator mismatch, i.e., when the true downsampling operator is different from the one used to create the training dataset.


I. INTRODUCTION
I N science and engineering, images acquired by sensing devices often have resolution well below the desired one.Common reasons include physical constraints, as in astronomy or biological microscopy, and cost, as in consumer electronics or medical imaging.Creating high-resolution (HR) images from low-resolution (LR) ones, a task known as superresolution (SR), can therefore be extremely useful in these areas; it enables, for example, the identification of structures or objects that are barely visible in LR images.Doing so, however, requires inferring values for the unobserved pixels, which cannot be done without making assumptions about the class of images to super-resolve and their acquisition process.
Classical interpolation algorithms assume that the missing pixels can be inferred by linearly combining neighboring pixels via, e.g., the application of filters [3], [4], or by preserving the statistics of the image gradient from the LR to the HR image [5].Reconstruction-based methods, on the other hand, assume that images have sparse representations in some domain, e.g., sparse gradients [6]- [13].More recently, data-driven methods have become very popular; their main assumption is that image features can be learned from training data, via dictionaries [14], [15] or via convolutional neural networks (CNNs) [16]- [18].
CNNs were first applied to image SR in the seminal work [16] and have ever since remained the state-of-the-art, both in terms of reconstruction performance and computational complexity (during deployment/testing).By relying on vast databases of images for training, such as ImageNet [19] or T91 [14], they can effectively learn to map LR images/patches to HR images/patches.Although training a CNN can take several days, applying it to an image (what is typically called the testing phase) takes a few seconds or even sub-seconds.Despite these advantages, the knowledge that CNNs extract from data is never made explicit, making them hard to adapt to new scenarios: for example, simply changing the scaling factor or the sampling model, e.g., from bicubic to point sampling, almost always requires retraining the entire network.More conspicuously, however, is that during testing SR CNNs fail to guarantee the consistency between the reconstructed HR image and the input LR image, effectively ignoring precious "measurement" information, as we will illustrate shortly.Ignoring such information makes CNNs prone to generalization errors and, as a consequence, also less robust.
Curiously, adaptability and measurement consistency are the main features of classical reconstruction-based methods, which consist of algorithms designed to solve an optimization problem.To formulate such an optimization problem, one has to explicitly encode the measurement model and the assumptions about the class of images to be super-resolved.Although this explicit encoding confers reconstruction-based methods great adaptability and flexibility, it naturally limits the complexity of the assumptions, which is one of the reasons why reconstruction-based methods are outperformed by datadriven methods (CNNs).This motivates our problem: Problem statement.Can we design SR algorithms that learn from large quantities of data and, at the same time, are easily adaptable to new scenarios and guarantee measurement consistency during the testing phase?In other words, can we design algorithms that have the advantages of both data-driven and model-based methods?
Lack of consistency by CNNs.Before summarizing our method, we describe how CNNs fail to enforce consistency between the reconstructed HR image and the input LR image.Although we illustrate this phenomenon here for the specific SRCNN network [16], more systematic experiments can be found in Section IV.The top-left corner of Fig. 1 shows a ground truth (GT) image X ∈ R M ×N , which the algorithms have access to during training, but not during testing. 1 We will represent X by its column-major vectorization x = vec(X ) ∈ R n , where n = M • N .The GT image x is downsampled via a linear operator represented by A ∈ R m×n 3) We propose an algorithm based on the alternating direction method of multipliers (ADMM) [20] to solve the TV-TV minimization problem.In contrast with most SR methods, which process image patches independently, our algorithm processes full images at once.It also easily adapts to different degradations and scaling factors.4) We conduct extensive experiments that illustrate not only the robustness of our algorithm under different degradation operators, but also how it systematically improves (in terms of PSNR and SSIM) the outputs of state-of-the-art SR networks, such as ESRGAN [17], VDSR [18] and LapSRN [21].We highlight the following differences with respect to our previous work in [1], [2].We now explore and illustrate with experiments the underlying reason why our framework improves the output of state-of-the-art SR CNNs.We also describe how the proposed optimization problem can be solved efficiently using ADMM; in fact, the algorithms used in [1], [2] were different and less efficient than the algorithm we present here.Our experiments are also much more extensive: they consider different sampling operators to illustrate robustness to operator mismatch, and include many more algorithms, e.g., FSRCNN [22], VDSR [18], LapSRN [21], SRMD [23], IRCNN [24], and ESRGAN [17].
Organization.Section II summarizes prior work on SR algorithms, and Section III describes the proposed framework and optimization scheme.Section IV then reports our experimental results, and Section V concludes the paper.

II. RELATED WORK
Super-resolution (SR) schemes are often labeled as interpolation, reconstruction, or data-driven.Interpolation methods infer the missing pixels by locally applying an interpolation function such as the bicubic or bilinear filter [3], [4].As they have been surpassed by both reconstruction and learning SR algorithms, we will limit our review to the latter.

A. Reconstruction-Based SR
Reconstruction-based schemes view SR as an image reconstruction problem and address it by formulating an optimization problem.In general, the optimization problem has two terms: a data consistency term that encodes assumptions about the acquisition process, usually that Ax b (where x is the optimization variable), and a regularization term on x that encodes assumptions about the class of images.Different methods differ mostly on the image assumptions.
Image assumptions.Reconstruction SR methods encode assumptions about the images by penalizing in the optimization problem measures of complexity.These reflect the empirical observation that natural images have parsimonious representations in several domains.Examples include sparsity in the wavelet domain [25], sparsity of image patches in the DCT domain [26] and, as we will explore shortly in more detail, sparsity of image gradients [6]- [13].Since sparsity is well captured by the 1 -norm, the resulting optimization problem is typically convex and can be solved efficiently.A more challenging assumption is multi-scale recurrence [27], [28], which captures the notion that patches of natural images occur repeatedly across the image.For example, [27] proposed an SR algorithm that explores the recurrence of image patches both in the same and in different scales.
Total variation.In natural images, the number of pixels that correspond to an edge, i.e., a transition between different objects, is a small percentage of the total number of pixels.This can be measured by the total variation (TV) of the image [6].Although TV was initially defined in the context of partial differential equations, there has been work that discretizes the differential equations [9], [10] or that directly defines TV in the discrete setting [11], [12], [29].Although there are several definitions of discrete TV, the most popular are the isotropic TV, which consists of the sum (over all pixels) of the 2norms of the vectors containing the horizontal and vertical differences at each pixel, and the anisotropic TV, which is similar to isotropic TV but with the 2 -norms replaced by the 1 -norm.Both definitions yield convex, yet nondifferentiable, functions.Many algorithms have been proposed to solve problems involving discrete TV, including primal-dual methods [8], [30], and proximal and gradient-based schemes [11]- [13].
The concept of TV has been used in many imaging tasks, from denoising [6], [11], [29] to super-resolution [9]- [11].For example, [9] discretizes a differential equation relating variations in the values of pixels to the curvature of level sets, while enforcing fidelity to the LR image.The work in [10] proposed a similar method, but with more complex models for both TV and the image acquisition process.

B. Learning-based Algorithms
Learning-based algorithms typically consist of two stages: training, in which a map from LR to HR patches is learned from a database of training images, and testing, in which the learned map is applied to super-resolve an unseen image.
Manifold learning.Manifold learning relies on the observation that most data (e.g., patches of images) lie on a lowdimensional manifold.An example is locally linear embedding [31] which, like PCA, learns a compact representation of data, but considers nonlinear geometry instead.This idea was applied to SR in [32] by assuming that LR and HR patches lie on manifolds with similar geometry.Thus, given a set of training HR and LR patches and a target LR image, the method in [32] first finds the LR patches in the training set that are neighbors of the target LR patch.Then, it uses the HR version of those neighbors to infer the HR target patches.The resulting algorithm can over-or under-fit the data depending on how many nodes (i.e., patches) define a neighborhood.
Dictionary learning.In dictionary learning, also known as sparse coding, patches of HR images are assumed to have a sparse representation on an over-complete dictionary, which is learned from training images.For example, [14] uses training images to learn dictionaries for LR and HR patches while constraining corresponding patches to have the same coefficients.Other schemes use similar concepts, but require no training data at all.For example, [15] uses self-similarity to learn the LR-HR map without any external database of images.
CNN-based methods.The advent of deep learning and the availability of large image datasets inspired the application of CNNs to SR.Currently, they surpass any reconstructionor interpolation-based method both in reconstruction performance and in execution time (during testing).The first CNN for SR was proposed by [16]; although its design was inspired by dictionary learning methods, the proposed architecture set a new standard for SR performance.
SR networks can be classified as direct or progressive.In direct networks, the LR image is first upscaled, typically via bicubic interpolation, to the required spatial resolution, and then is fed to a CNN, as in [16], [18], [22], [33].In this case, the CNN thus learns how to deblur the upscaled image.As previously mentioned, CNN architectures need to be retrained every time we change the scaling factor.To overcome this, [34] repeatedly applied a recursive convolutional layer to obtain the super-resolved image.However, since the LR input is blurry, the CNN outputs a HR image lacking fine details.Inspired by this observation, [35] proposed the SRGAN, which produces photo-realistic HR images, even though they do not yield the best PSNR.As direct networks operate on high-dimensional images, their training is computationally expensive [36].
Progressive networks, in contrast, have reduced training complexity, as they directly process LR images.Specifically, the upsampling step, performed using sub-pixel or transposed convolution [36], is applied only at the end of the network.For instance, LapSRN [21] used the concept of Laplacian pyramids, in which each network level is trained to predict residuals between the upscaled images at different levels in the pyramid.More recently, [37] proposed a fully progressive network that super-resolves images by an upsampling factor of two at each level until the desired factor is reached.
In spite of achieving state-of-the-art performance, CNNs for SR suffer from two major shortcomings: as already illustrated, they fail to guarantee the consistency between the LR and HR image during testing, and the trained network applies only to a unique scaling factor and degradation function.Most of the CNNs, e.g.[16], [22], [23], are trained by solving minimize where x (t) represents (the vectorization of) the tth image in the training set, A the bicubic sampling operator, and f θ (•) a CNN parameterized by θ (i.e., weights and biases of the neural connections).Most CNNs are trained with images that have been downsampled with a bicubic filter.Thus, their performance quickly degrades when the true downsampling operator is different.Indeed, during testing, the true degradation is unknown.The work in [23] addresses this problem by designing a network that deals with different degradations by accepting as input both the blur kernel and the noise level.

C. Plug-and-Play Methods
A different line or work blends learning-and modelbased methods.The main observation is that, when solving linear inverse problems, proximal-based algorithms separate the operations of measurement consistency and problem regularization (using prior knowledge).The latter usually consists of a simple operation, like soft-thresholding, which encodes the assumptions about the target image and which can be viewed as a denoising step.Given its independence from the measurements, such operation can be replaced by a more complex function, such as a CNN.The resulting algorithms are versatile, as the measurement operator can be easily modified.Most work in this area, however, has focussed on compressed sensing, in which the measurement operator is typically a dense random matrix; see [38]- [40].
The One-Net [41], for example, replaces the proximal operator associated to image regularization in an ADMM algorithm with a CNN trained on a large database.The experiments in [41] considered SR, but the resulting network does not perform as well as current leading CNN-based methods.
The pioneering work in [42] takes this idea further and proposes a scheme that requires no training at all.There, an untrained CNN is used as a prior.Specifically, a linear inverse problem is reparameterized as a function of the weights of a CNN whose input is a noisy/corrupted image and whose output is the denoised/reconstructed image.Such reparameterization provides a type of regularization.The work in [43] combined this idea with regularization by denoising and used ADMM to solve the resulting linear inverse problem.See [24], [44] for related work.These algorithms require no training at all and can be easily adapted to different measurement operators.However, they can be particularly slow, as each iteration requires some backpropagation iterations on the CNN.And, when applied to SR, they are still outperformed by trainingbased CNN architectures.

A. Main Model and Assumptions
We aim to reconstruct the vectorized version of an HR image x ∈ R n from a LR image b ∈ R m , with m < n.We assume that these quantities are linearly related: where A ∈ R m×n represents the downsampling operator.The model in ( 1) is often used in reconstruction-based and dictionary learning algorithms [10], [14], [25], even though many methods also consider additive noise: b = Ax + , where is a Gaussian random vector [13], [41], [45]- [47].More interesting, however, is that CNN-based methods implicitly assume the model in (1), although that is rarely acknowledged.In particular, all the SR networks we know of (e.g., [16], [17], [21], [22]) are trained with HR images that are downsampled according to (1), where A implements bicubic downsampling.We next discuss other possible choices for A.
Common choices for A. Different instances of A ∈ R m×n in (1) have been assumed in the SR literature: • Simple subsampling: A contains equispaced rows of the identity matrix I n ∈ R n×n , i.e., each row of A is a canonical vector (0, ..., 0, 1, 0, ..., 0).This operator is simple to implement, but often introduces aliasing.• Box-averaging: if the scaling factor is s, then each row of A contains s 2 nonzero elements, equal to 1/s 2 , in positions corresponding to a neighborhood of a pixel.In other words, box-averaging replaces each block of s × s pixels by their average.Although simpler than the bicubic operator, it does not introduce the aliasing that simple subsampling does; see, e.g., [41].In our experiments, we will mostly instantiate A as a bicubic operator.The reason is that most SR CNNs assume this operator during training.Simple subsampling and box-averaging will be used to illustrate how our post-processing scheme adds robustness to operator mismatch, i.e., when A is different during training and testing.
Assumptions.We estimate x ∈ R n from b ∈ R m by taking into account two possibly conflicting assumptions: 1) x has a small number of edges, as captured by a small total variation (TV); 2) x is also close to the side information w, the image returned by a learning-based method (CNN), where the notion of distance is also measured by TV.For a given vectorization x ∈ R n of an image X ∈ R M ×N , the anisotropic 2D TV (semi-)norm is defined as [7], [11] x In ( 2), v ij ∈ R n and h ij ∈ R n extract the vertical and horizontal differences at pixel (i, j) of X.By concatenating v ij (resp.h ij ) as rows of V ∈ R n×n (resp.H ∈ R n×n ), we obtain the representation in (3), where • 1 is the 1norm (sum of absolute values).And the matrix D ∈ R 2n×n in (4) is the vertical concatenation of V and H.We assume periodic boundaries, so that both V and H are circulant.
As circulant matrices are diagonalizable by the DFT, matrixvector products by both V and H can be computed via the FFT in O(n log n) time.

B. Our Framework
The framework we propose is shown schematically in Fig. 2. It starts by super-resolving b into w ∈ R n with a base method, which we assume is implemented by a CNN due to their current outstanding performance.As explained in Section I, CNNs fail to enforce measurement consistency during testing, i.e., Aw = b for any matrix A that is assumed to implement the downsampling operation.
We propose to use an additional block that takes in both the HR output w of the CNN and the LR image b, and creates another HR image x.The block implements what we call TV-TV minimization, which enforces measurement consistency while guaranteeing that assumptions 1) and 2) are met.This is explained next.
TV-TV minimization.Given the LR image b and a HR image w, TV-TV minimization consists of minimize where x ∈ R n is the optimization variable, and β ≥ 0 tradeoffs between assumptions 1) and 2).Indeed, the first term in the objective of ( 5) encodes assumption 1), the second term assumption 2), and the constraints enforce measurement consistency.Of course, our assumptions 1)-2) can be easily modified to better capture the class of images to be superresolved.We found that using TV semi-norms in the objective yielded better results.In addition, as these functions are convex, problem ( 5) is convex as well.
Although a problem like (5) has appeared before in [48] in the context of dynamic computed tomography (CT), the side information w there was an image reconstructed by solving the same problem in the previous instant; see also [49]- [51].Our approach is conceptually different in that we use (5) to improve the reconstruction of a CNN-based method.
Next, we show how TV-TV minimization relates to 1 -1 minimization, and how the theory for the latter in [49] suggests that selecting β = 1 in (5) may lead to better performance.
Relation to 1 -1 minimization.Introducing an auxiliary variable u ∈ R 2n and defining w := Dw, we rewrite (5) as minimize Thus, x := (u, x) ∈ R 3n is the full optimization variable.Define A := 0 m×2n A; −I 2n D , and b := b 0 2n , where 0 a×b (resp.0 a ) represents the zero matrix (resp.vector) of dimensions a×b (resp.a×1), and I 2n is the identity matrix in R 2n .This enables us to rewrite (6) as minimize where G 2n ∈ R 2n×3n contains the first 2n rows of the identity matrix I 3n .In other words, for a vector v ∈ R 3n , G 2n v represents the first 2n components of v.The work in [49] analyzes (7) when G 2n is the full identity matrix and the entries of A are drawn from a Gaussian distribution.Specifically, it provides the number of measurements required for perfect reconstruction under these assumptions.It is shown both theoretically and experimentally that the best reconstruction performance is obtained when β = 1.Although the theory in [49] cannot be easily extended2 to (7), our experiments indicate that β = 1 still leads to the best results in our setting.

C. Algorithm for TV-TV minimization
We now explain how to efficiently solve TV-TV minimization (5) with the alternating direction method of multipliers (ADMM) [20].In contrast with the majority of SR algorithms, which operate on individual patches, our algorithm operates on full images.We do that by capitalizing on the fact that matrix-vector multiplications can be performed fast whenever the matrix is D [cf.(2)] or any of the instantiations of A mentioned in Section III-A.
ADMM.The problem that ADMM solves is minimize y,z f (y) + g(z) where f and g are closed, proper, and convex functions, and F and G are given matrices.Associating a dual variable λ to the constraints of ( 8), ADMM iterates on k where ρ > 0 is the augmented Lagrangian parameter.Applying ADMM.Although there are many possible reformulations of (5) to which ADMM is applicable, they can yield different performances.Our reformulation simply adds another variable v ∈ R n to (6) [which is equivalent to (5)]: We establish the following correspondence between ( 8) and ( 10): we set y = (u, x), z = v, and assign where i Ax=b (x) is the indicator function of Ax = b, i.e., it evaluates to 0 if Ax = b, and to +∞ otherwise.This means that we dualize only the last two constraints of (10), and thus λ has two components: λ = (η, µ) ∈ R 2n × R n .The above correspondence yields closed-form solutions for the problems in (9a) and (9b) (see below).Furthermore, even though the objective of ( 10) is not strictly convex, the fact that F and G have full column-rank implies that the sequence (y k , z k ) generated by ADMM (9) has a unique limit point, which solves (10) [52].We now elaborate on how to solve (9a)-(9b).
Solving (9a).Using the above correspondence, problem (9a) decouples into two independent problems that can be solved in parallel: where we defined s k := Dv k + η k and p k := v k − µ k .Problem (11) decomposes further componentwise, and the solution for each component can be obtained by evaluating the respective optimality condition (via subgradient calculus).Namely, for i = 1, . . ., 2n, if w i ≥ 0, then component u k+1 i is given by 12) is the projection of p k onto the solutions of Ax = b.Assuming that A has full row-rank, i.e., AA is invertible, ( 12) also has a closed-form solution: whose computation has a complexity that depends on the properties of the downsampling operator A. When A is simple subsampling or the box-averaging operator, AA is the identity matrix I m or a multiple of it.In that case, computing (15) requires only two matrix-vector operations which, due to the structure of A, can be implemented by indexing.In other words, there is no need to construct A explicitly.On the other hand, when A is the bicubic operator, the inverse of AA can no longer be computed easily, and we solve the linear system in (15) with the conjugate gradient method.In this case, matrix-vector products can be computed in O(n log n) time using the FFT.
Solving (9b).With our choice of g, F , and G, problem (9b) becomes Given the definition of D in ( 3)-( 4), we have where the last step uses the fact that V and H are circulant matrices and, therefore, are generated by some vectors v and h, respectively.Also, C n denotes the DFT matrix in R n , and Diag(x) is a diagonal matrix with the entries of x in its diagonal.This representation of I n +D D not only enables us to compute its inverse in closed-form (just take the inverse of the matrix in parenthesis), but also to do it without constructing any matrix explicitly.Dual updates.Finally, since λ decomposes as (η, µ), the dual variable update in (9c) becomes Applying ADMM (9) to the equivalent reformulation (10) of TV-TV minimization ( 5) therefore yields steps ( 13)-( 14) for each component of u, (15) for x, (16) for v, and ( 17) for the dual variables.These steps are repeated iteratively until a stopping criterion is met; we use the one suggested in [20].

IV. EXPERIMENTS
We now describe our experiments.After explaining the experimental setup, we expand on the phenomenon described in Fig. 1.Then, we consider the case of operator mismatches (i.e., A is different during training and testing), and show how our framework adds significant robustness in this scenario.Finally, we report experiments on standard SR datasets.Code to replicate our experiments is available online. 3

A. Experimental Setup
Algorithm parameters.Most experiments were run using the same algorithm settings, unless indicated otherwise.The hyperparameter β in (5) was always set to 1 and, for most experiments, A was the bicubic operator via MATLAB's IMRESIZE.For ADMM, we adopted the stopping criterion in [20, §3.3.1] with pri = dual = 0.001, or stopped after 1500 iterations.Also, we initialized ρ = 0.5 and adjusted it automatically using the heuristic in [20, §3.4.1].
Computational platform.All experiments were run on Matlab (R2019a) using a workstation with 12 core 2.10GHz Intel(R) Xeon(R) Silver 4110 CPU and two Nvidia GeForce RTX GPUs.
Methods evaluated.We compared our framework against the state-of-the-art methods in Table I and also considered simple TV minimization, i.e., (5) with β = 0, using the TVAL3 solver [8].The table shows the acronyms and references of the methods, their main technique, the scaling factors (S.F.) considered in the original papers, and the datasets used for During training, the HR images are converted to LR images by applying MATLAB's IMRESIZE, as originally done in [16].Other methods, such as [64], work on UINT8 images directly or use Python's IMRESIZE, which is different from MATLAB's.To keep our experiments consistent, we did not consider such methods.
The output images for Kim [57], SRCNN [16] and Self-ExSR [56] were retrieved from an online repository. 4For the remaining methods, we generated the outputs from the available pretrained models.
Performance metrics.We compared different algorithms by evaluating the PSNR (dB) and SSIM [65] on the luminance channel of the output images.As these metrics do not often capture perceptual quality, we also provide sample images for qualitative evaluation.

B. Measurement Inconsistency of CNNs
We show that the phenomenon illustrated in Fig. 1 for SRCNN [16] occurs not only for this network, but is pervasive.That is, CNNs for SR fail to enforce measurement consistency (1) during testing.We chose three images for this purpose: Baboon from Set14 [54], 38092 from BSD100, and img 005 from Urban100 [56].Every image is downsampled with MATLAB's IMRESIZE, which is the procedure executed for training each CNN, and the resulting LR image is fed into the network.We chose a scaling factor of 4.
Results.Table II shows the results for a subset of methods in Table I.In the 3rd column, it displays the 2 -norm of the difference between the downsampled HR outputs, i.e., Aw, and the input LR image b; in the 4th column, it shows the same quantity after feeding the corresponding w (and b, cf.Fig. 2) to our method.It can be seen our post-processing improves consistency by 6 orders of magnitude.Note that even though SRMD models various degradations without retraining, it still fails to ensure consistency.IRCNN is a plug-and-play method and, as a result, can also handle different degradation models.Although it achieves better consistency than pure CNN-based methods, it is still 5 orders of magnitude below our scheme.The last row of Table II shows the consistency of DeepRED [43], which processes the three RGB channels simultaneously.For this reason, it was difficult to provide a fair comparison with our method.

C. Robustness to Operator Mismatch
As previously stated, most SR CNNs are trained by downsampling a HR into a LR image using the bicubic operator.If, during testing, A is different from the bicubic operator then, as we will see, there can be a serious drop in performance.This may indeed limit the applicability of CNNs in real-life scenarios.Our approach, however, mitigates this effect and adds robustness to the SR task.We considered the same images and methods as in Table II, with DeepRed replaced by TVAL3, and considered the operators for A described in Section III-A: bicubic, box averaging, and simple subsampling.Results.Each shaded box in Table III shows, for each subsampling operator, the PSNR values obtained by a given method, and by subsequently processing its output with our scheme.While all methods perform the best under bicubic subsampling, there is a performance drop for box filtering, and an even larger drop for simple subsampling.Note that our method systematically improves the output of all the networks, even for bicubic subsampling.And while the improvement is of less than 1dB for bicubic subsampling, it averages around 2dBs for simple subsampling.Indeed, the performance of the CNNs for this case drops so much that there is a large margin for improvement.Interestingly, TVAL3, which solves (5) with β = 0, is the worst method for bicubic subsampling, but approaches the performance of CNNs for box averaging and, besides ours, becomes the best for simple subsampling.Hence, this illustrates that reconstruction-based methods can be more robust and adaptable than CNN architectures.

D. Standard Datasets With Bicubic Downsampling
We also conducted more systematic experiments using the standard datasets Set5, Set14, BSD100, and Urban100, under different scaling factors and using bicubic downsampling only.
Quantitative results.Table IV displays the average PSNR and SSIM, as well as the average execution time of our method (in seconds), for 2×, 4×, and 8× scaling factors.Each shaded area shows the performance of a given (learning-based) reference method, the performance of our scheme applied to the output of that reference method, and the average execution time (of our method).For easy comparison, the values for TVAL3 occur repeatedly in different vertical sub-blocks of the table.Note that the 4th vertical sub-block of the table is the only one with values for 8× upsampling, since LapSRN is the only method that can handle such a scaling factor.Also, as ESRGAN was designed specifically for 4× upsampling, we do not present its values for other scaling factors.All results for our method were generated with β = 1 in (5), except when we considered LapSRN and ESRGAN for the Urban100 dataset, in which case we set β = 2.
An obvious pattern in the table is that our method consistently improves the outputs of all the methods in terms of PSNR and SSIM, except in a small subset of cases.The improvements range between 0.0023 and 0.3796 dB.One of the exceptions occurs for 8× upsampling with LapSRN.In that case, LapSRN has always better SSIM than our method, even though the opposite happens for the PSNR.As expected, TVAL3 had the worst performance overall and was surpassed by all learning-based methods.
A drawback of our method, however, is its possibly long execution time.We recall that of all downsampling operators mentioned in Section III-A, the most computationally complex is bicubic downsampling, as considered in these experiments.The timing values in Table IV are average values: they report the total execution time of our algorithm over all the images of the corresponding dataset divided by the number of images.While in some cases our algorithm took an average of 16 sec (SRCNN, BSD100, 4×), in others it took more than 250 sec (LapSRN, Urban100, 2×).In fact, for the Urban100 dataset, because of the large size of its images (1024 × 644), we had to reduce the number of simultaneous threads to prevent the GPUs from overflowing.For reference, for (SRCNN, BSD100, 4×), our method takes an average of 9 sec when we use simple subsampling.This is roughly half the execution time it takes for bicubic downsampling.
Qualitative results.Figures 3 and 4 depict the output images of all the algorithms (except IRCNN) for the test images baboon from Set14, and img067 from Urban100.All super-resolved images exhibit blur and loss of information compared with the groundtruth images in Figs.??-??.And as our scheme builds upon the outputs of other methods, it also inherits some of their artifacts.It is difficult to visually assess differences between the outputs of the algorithms and of our method, in part because the improvements, as measured by the PSNR, are relatively small.Yet, as our experiments show, our scheme not only systematically improves the outputs of CNN-based methods, but also adds significant robustness to operator mismatch.

V. CONCLUSIONS
We proposed a framework for single-image super-resolution that blends model-and learning-based (e.g., CNN) techniques.As a result, our framework enables solving the consistency problem that CNNs suffer from, namely that downsampled output (HR) images fail to match the input (LR) images.Our experiments show that enforcing such consistency not only systematically improves the quality of the output images of CNNs, but also adds robustness to the super-resolution task.At the core of our framework is a problem that we call TV-TV minimization and which we solve with an ADMM-based algorithm.Although our implementation is efficient, there is still margin to improve its execution time, for example, by unrolling the proposed algorithm with a neural network.We leave this for future work.

Fig. 1 .
Fig. 1.Illustration of the lack of measurement consistency by CNNs during testing: when the output image w is downsampled using A, it typically differs significantly from b.Our method takes in both w and b, and fixes this problem.

Fig. 2 .
Fig. 2. Our framework: the low-resolution image b and the image w superresolved by a CNN are fed into the TV-TV minimization problem which, in turn, obtains a high-resolution image x with better quality.

Fig. 3 .
Fig. 3. Results on Baboon (Set14) for 4×.Each shaded area (except the top-left) shows the output of a learning-based algorithm and of our method.

Fig. 4 .
Fig. 4. Results on img076 image from Urban100 for 4×.Each shaded area (except the top-left) shows the output of a learning-based algorithm and of our method.

TABLE I METHODS
USED IN OUR EXPERIMENTS.FOR EACH, WE SHOW THE MAIN TECHNIQUE, THE SCALING FACTORS IT CAN HANDLE AND, IF ANY, THE TRAINING DATASET.Both during training and testing, all the CNN-based methods in Table I extract the luminance channel of the YCbCr color space, and then convert the image from UINT8 to DOUBLE.

TABLE II CONSISTENCY
ACHIEVED BY CNN-TYPE METHODS ( Aw − b 2 ) AND BY OUR ALGORITHM ( Ax − b 2 ).

TABLE III OPERATOR
MISMATCH EXPERIMENTS.PSNR VALUES UNDER DIF-FERENT SAMPLING OPERATORS FOR A: BICUBIC, BOX FILTER-ING, AND SIMPLE SUBSAMPLING.WITHIN EACH BOX, THE BEST (HIGHER) VALUES ARE HIGHLIGHTED IN BOLD.

TABLE IV AVERAGE
PSNR (SSIM) RESULTS IN dB AND EXECUTION TIME IN SECONDS OF OUR METHOD USING THE REFERENCE METHODS.