Learning Multiscale Convolutional Dictionaries for Image Reconstruction

Convolutional neural networks (CNNs) have been tremendously successful in solving imaging inverse problems. To understand their success, an effective strategy is to construct simpler and mathematically more tractable convolutional sparse coding (CSC) models that share essential ingredients with CNNs. Existing CSC methods, however, underperform leading CNNs in challenging inverse problems. We hypothesize that the performance gap may be attributed in part to how they process images at different spatial scales: While many CNNs use multiscale feature representations, existing CSC models mostly rely on single-scale dictionaries. To close the performance gap, we thus propose a multiscale convolutional dictionary structure. The proposed dictionary structure is derived from the U-Net, arguably the most versatile and widely used CNN for image-to-image learning problems. We show that incorporating the proposed multiscale dictionary in an otherwise standard CSC framework yields performance competitive with state-of-the-art CNNs across a range of challenging inverse problems including CT and MRI reconstruction. Our work thus demonstrates the effectiveness and scalability of the multiscale CSC approach in solving challenging inverse problems.


I. INTRODUCTION
Convolutional neural networks (CNNs) obtain state-of-theart performance in many image processing tasks. To understand their success, an active line of recent research reduces CNNs into conceptually simpler and mathematically betterunderstood building blocks. Examples of these simplified convolutional models include convolutional kernels [1]- [3], convolutional scattering transforms [4]- [7], and convolutional sparse coding [8]- [10]. In addition to being mathematically tractable, these models have achieved remarkable empirical success, sometimes matching state-of-the-art CNNs.
This work studies convolutional representations arising from the convolutional sparse coding (CSC) paradigm, which provides a natural connection between sparse representation models and CNNs. Indeed, many CNN instances can be interpreted as optimizing a CSC objective through cascaded layers [8]. Moreover, CSC models compete favorably with state-of-the-art CNNs in several image processing tasks including denoising, single image super-resolution, and inpainting [10]- [18].
While these emerging results are promising, the successful applications of CSC in imaging inverse problems are still confined to problems with relatively simple forward operators, including Gaussian noise addition, blurring, and uniformly This work was supported by the European Research Council Starting Grant 852821-SWING. The code to reproduce our experiments is available at https: //github.com/liutianlin0121/MUSC random pixel removal. Common to these forward operators is their spatial locality -they introduce artifacts that are spatially correlated only, if at all, within small pixel neighbourhoods. By contrast, a broad range of imaging inverse problems involve forward models that mix distant parts of the image and are highly spatially heterogeneous; examples include the Radon transform for computed tomography, which computes line integrals along radiating paths, and the Fourier transform for magnetic resonance imaging, which computes inner products with globally-supported sinusoids. Working with these forward models presents different challenges since they introduce structured noise, such as streak artifacts, with long-range spatial correlations. We thus ask a natural question: Can CSC models also yield strong performance on such inverse problems with non-local operators?
To deal with spatially heterogeneous imagery data, one natural strategy is to employ multiscale dictionaries. Indeed, seminal works have shown that multiscale dictionaries, either analytical or learned, are advantageous in representing and processing images [19]- [24]. Separating scales is useful because it gives efficient descriptions of structural correlations at different distances. Yet, these existing CSC models [10], [25]- [28] mostly employ single-scale dictionaries, whose dictionary atoms all have the same size. While there exist proposals for multiscale CSC architectures, they are tailored for specific tasks [29], [30]. In addition, CSC models do away with flexible skip connections between non-consecutive layers, which are nonetheless essential for many successful CNNs such as the U-Net and its variants [31]- [33] to fuse features across scales. This challenge of harnessing multiscale features in the CSC paradigm motivates our work.
To address the challenge, we introduce a multiscale convolutional dictionary inspired by the highly successful U-Net [31]. We then apply the multiscale convolutional dictionary to challenging, large-scale inverse problems in imaging. The main contribution of this paper is twofold: • We propose a new convolutional dictionary, whose representation incorporates atoms of different spatial scales. The proposed multiscale dictionary augments standard, single-scale convolutional dictionaries to exploit the spatially-heterogeneous properties of images. • We study the effectiveness of the multiscale convolutional dictionary through experiments on large-scale datasets. We find that the performance of the multiscale CSC approach is competitive with leading CNNs on datasets including two major CT and MRI benchmarks. We addi-tionally show that our model matches (and slightly improves) the state-of-the-art performance on the deraining task achieved by a deep neural network [34]. Notably, the single-scale CSC model performs significantly worse on this task [27].
Overall, our work makes a step forward in closing the performance gap between end-to-end CNNs and sparsitydriven dictionary models. At a meta level, it (re)validates the fundamental role of sparsity in representations of images and imaging operators [20], [35], [36].
The rest of this article is organized as follows. In Section II, we first briefly review the sparse representation model and its relationship to CNNs. Section III explains how we incorporate multiscale atoms in a dictionary model; we also explain how to learn the multiscale dictionary from data under the task-driven dictionary learning framework. Section IV reports experimental results on tasks including CT reconstruction and MRI reconstruction.

II. BACKGROUND AND RELATED WORK
In this section, we briefly review the related work; a summary of notation is given in Table I.

A. Sparse representation models
Sparse representation has been extensively studied and widely used in imaging inverse problems [37]- [39]. It is motivated by the idea that many signals, images being a prime example, can be approximated by a linear combination of a few elements from a suitable overcomplete basis. The sparse representation framework posits that we can decompose a signal of interest 1 z ∈ R d as z = Dα, where D ∈ R d×N is an overcomplete dictionary of N atoms (N > d) and α ∈ R N is a sparse vector with few non-zero entries. Learning a sparse representation model thus comprises two sub-problems: (i) given a dictionary D, encode the signal z into a sparse vector α (the sparse coding problem), and (ii) given a set of signals, learn an appropriate dictionary D that sparsifies them (the dictionary learning problem). We briefly review these two problems and show how they are related to neural network models such as CNNs.

B. The sparse coding problem
The sparse coding problem is often formulated as basis pursuit denoising [40] or Lasso regression [41]. Most relevant to our work is its formulation with non-negative constraints on the sparse code α: Here, the first term 1 2 z − Dα 2 2 ensures that the code α yields a faithful representation of z, the second term λ α 1 a noisy image to be processed D an overcomplete dictionary x † a ground-truth image x a predicted image α a sparse code λ the thresholding parameters of ISTA controls the sparsity of the code, and the two terms are balanced by a parameter λ > 0. An effective solver for the minimization problem (1) is the iterative shrinkage-thresholding algorithm (ISTA) [42], which executes the following iteration where the superscript [k] denotes the iteration number, η is a step-size parameter, λ is a vector whose entries are all λ, and σ(x) := max(x, 0) is a component-wise rectifier function. For simplicity, we use S(α, z; D, λ) to denote one execution of ISTA with measurement z, sparse code α, dictionary D, and threshold λ. The ISTA algorithm is a composition of such executions; we write ISTA K for the K-fold composition of S with itself: ISTA K (z; D, λ) := S(·, z; D, λ) • · · · • S(·, z; D, λ) where α [0] is the initial sparse code; throughout this work, this initial code α [0] is assumed to contain zero in all entries. We emphasize that ISTA is a nonlinear transform of its input z.

C. The task-driven dictionary learning problem
We now briefly recall the task-driven dictionary learning framework [43]. Consider a supervised learning setting, in which we aim to identify a parametric function that associates each input z (e.g., a corrupted image) with its target x † (e.g., a clean image) for all (z, x † ) ∈ R d × R d drawn from some joint distribution. In the task-driven framework, we proceed by first representing the signal z by a sparse code α z with respect to a dictionary D. One way to achieve this is to let which can be approximated by K iterations of ISTA as in (3). Next, we approximate the desired target x † using the sparse code α z through a regression model f (·, w) with learnable parameter w. For instance, f (·, w) could be a linear regression model with weights and biases w. The model output f (α z , w) thus depends on the regression model parameters w as well as the sparse code α z , which in turn depends on the dictionary D through the ISTA iterations. In this way, the regression parameters w and dictionary D can be jointly optimized, for instance, with respect to the quadratic loss objective evaluated on a dataset of M input-target pairs Importantly, the task-driven objective in (5) implies that the dictionary D is optimized to solve the supervised learning task and not just to sparsely represent data.

D. Convolutional sparse coding
Our work is inspired by the convolutional sparse coding (CSC) model [8], [44]- [47], which bridges deep CNNs and sparse representation models. Concretely, Papyan et al. [8] noticed that if the dictionary D has a convolutional structure and if the sparse code α is assumed to be non-negative, a single iteration of ISTA with α [0] initialized as a zero vector and step-size η = 1 is equivalent to the forward pass of a single-layer convolutional network where b is a vector whose components are −λ (cf. Equation (2)). This single-layer formulation can be extended to characterize a deep CNN of multiple layers. Specifically, the forward-pass of a deep CNN of L-layers can be interpreted to approximate the sparse codes α 1 , · · · , α L sequentially with respect to different dictionaries D 1 , · · · , D L ; the backpropagation pass is interpreted as an update to these dictionaries {D i } L i=1 in a task-driven way.

E. CNNs for solving inverse problems
Deep CNNs achieve state-of-the-art performance in many image processing tasks [48]- [51]. In particular, the U-Net [31] and its variants [32], [33], [52] are among the most extensively used CNN architectures in solving image-to-image learning tasks. U-Nets represent images via multiscale features computed from measurements using an encoding (or downsampling) branch and a synthesized into an estimated image in a decoding (or upsampling) branch ( Figure 1a). In the downsampling branch, the spatial resolutions of feature maps are reduced while the number of feature maps is increased; in the decoding branch, these features are recombined with previous high-resolution features via channel concatenation ("skip connections") and convolution. Heuristically, low-resolution feature maps of a U-Net capture large-scale image properties, whereas the high-resolution feature maps capture more finegrained image properties [52]. In a related line of work, Ye et al. [53]- [55] proposed to use the framelets formalism [56] to study aspects of U-Net-like encoder-decoder CNNs. A key observation they make is that a U-Net model is closely related to convolutional framelets whose frame basis selection depends non-linearly on input data.

III. CSC WITH MULTISCALE DICTIONARIES
The structure of a convolutional dictionary is crucial to a CSC model since the dictionary atoms characterize the signals that can be represented sparsely. In the existing formulation of CSC, atoms of a convolutional dictionary have a single scale, in the sense that they all share the same spatial shape. However, many image classes and imaging artifacts exhibit structured correlations over multiple scales. To exploit these correlations in imaging inverse problems, we construct multiscale convolutional dictionaries.
Our construction is based on the U-Net reviewed in Section II. Indeed, the tremendous success of U-Nets has in part been attributed to their ability to represent images at multiple scales [33], [55], which is achieved by using up-and downsampling operations together with skip connections as in Figure 1a. Another property of the U-Net is its shared parameters across scales: Low-resolution features (the grey boxes at the bottom of Figure 1a) and high-resolution features (the grey boxes at the top of Figure 1a) undergo an overlapping synthesizing path parameterized by shared weights. This weight-sharing strategy has not been employed by existing proposals for multiscale CSC dictionaries [29], [30]. In what follows, we describe the construction process of a linear dictionary inspired by and closely following the standard U-Net.

A. Encoder-decoder dictionaries
We denote the encoding branch of the U-Net by f enc (·, θ enc ) : R d → R N with parameters θ enc ; the encoding branch maps the input z ∈ R d to convolutional feature maps α z = f enc (z, θ enc ) ∈ R N , illustrated as the dark grey boxes in Figure 1a. Note that, for a U-Net, the intermediate feature map dimension N (number of scalar coefficients in α) is typically much greater than the image dimension d. These feature maps are then fed into the decoding branch of the U-Net either through skip connections or through the bottleneck layer. To describe this process, we write the decoding branch of the U-Net as a function f dec (·, θ dec ) : R N → R d with parameters θ dec . That is, the function f dec (·, θ dec ) takes the convolutional feature maps produced by the encoding branch and transforms them to produce the model output. We can thus write the output produced by a U-Net as We now focus on the image synthesis process of the U-Net, described by the decoding function f dec (·, θ dec ). This function synthesizes convolutional feature maps at different spatial scales through skip connections and upsampling. As such, the decoding branch of the U-Net approximates an image x † ∈ R d using multiscale feature maps α z ∈ R N of a much higher dimension, so that x † ≈ f dec (α z , θ dec ). Conceptually, this representation is similar to the sparse and overcomplete representation in a dictionary, except that the U-Net decoder is non-linear.
To construct a multiscale dictionary, we thus consider a stripped-down version of the image synthesis process of U-Net by removing all non-linearities, batch normalization, and additive biases from the function f dec (·, θ dec ), as shown in The dictionary considered in this work is a simplification of the decoder branch of the U-Net: We retain convolution and multiscale representation from the decoder branch of the U-Net but remove all non-linearities, batch-normalization, and additive biases; additionally, we remove a convolution at each spatial resolution level and halve the number of convolutional channels for all convolutions. Grey boxes indicate the multiscale sparse code α = (α 0 , . . . , α 4 ) that the dictionary takes as input. Dashed boxes indicate the position that each α i feed into the dictionary. (c): The proposed as a computational graph that uses multiscale dictionaries D enc , D enc , and D dec ; although each dictionary is linear, the computational graph is nonlinear due to the thresholding operator. Figure 1b; to further simplify the architecture, at each spatial scale, we additionally remove a convolution and halve the number of convolutional channels for all convolutions. The resulting function is then simply a linear transformation where α 0 , . . . , α 4 are sparse code having different resolutions (visualized as the grey boxes in Figure 1b). This dictionary shares the essential ingredients of convolution, multiscale representation, and skip connections with the U-Net decoding branch and therefore we refer to it as the decoder dictionary. A precise definition of the decoder dictionary D dec through convolution and upsampling is provided in Appendix A.

B. The dictionary-based sparsity prior
With a given decoder dictionary D dec to describe the image synthesis process, we next consider how to infer an associated sparse code α, so that D dec α is a good approximation of the image we wish to model. In a supervised learning setting where the input image z is given, it is natural to interpret α as an encoded representation of z. Since the encoding must produce a coefficient vector whose structure is compatible with α, we endow an encoder dictionary D enc ∈ R d×N with the same structure of D dec albeit with a different set of atoms. This setup is analogous to U-Net's encoding and decoding branches: the encoder and decoder dictionaries D enc and D dec are employed to process input signals and produce output signals, respectively. The sparse code α z induced by an input z and the encoder dictionary D enc then facilitate the subsequent task for approximating the ground-truth image x: In what follows, we derive a supervised learning method that turns each z into a prediction x using encoder and decoder dictionaries.

C. The task-driven dictionary learning objective
Under the task-driven framework introduced in Section II, we formulate a supervised learning problem via sparse coding and dictionary learning. We consider the following minimization problem over a dataset of M input-target pairs where α zi := ISTA K (z i ; D enc , λ).
The objective in (9) penalizes the discrepancy between the ground-truth signal x † and the model prediction D dec α z , where the latter is a signal synthesized from a sparse code α z via the decoder dictionary D dec ; the code α z is a sparse representation of the input image z with respect to the encoder dictionary D enc by unrolling a fixed number K of ISTA iterations. The sparsity-controlling parameter λ is multidimensional, weighting codes component-wise. The intuition behind this choice is that the different convolutional features, especially those at different scales, should be thresholded differently. The sparse code α, illustrated as the grey boxes in Figure 1b, is a collection of multi-dimensional tensors, each corresponds a spatial scale. The task driven objective (9) defines a computational graph that transforms an input image z into a prediction D dec α z . We term this computational graph MUSC, since it involves multiscale U-Net-like sparse coding. We note the MUSC is an instance of optimization-driven networks [26] derived by unrolling an optimization algorithm. It incorporates two modules with meaningful objectives, one implementing sparse coding and the other dictionary-based synthesis. This composition is arguably conceptually more interpretable than end-to-end layerwise composition of deep networks.
While a traditional compressed sensing approach uses a single dictionary for reconstruction, our approach uses two dictionaries D enc and D dec in the task-driven learning objective (9). This discrepancy is due to different assumptions in measurement-to-image reconstruction (the compressed sensing approach) and image-to-image reconstruction (our approach). Consider an inverse problem with a forward operator A, a unknown ground-truth signal x † , and measurements y := Ax † ; in CT reconstruction, A is the Radon transform and y is the measured sinogram. The compressed sensing approach estimates x † as Dα * for some dictionary D, where is the inferred sparse code based on the dictionary D. Note that (10) and the synthesis Dα * require only a single dictionary D. However, this approach assumes that we know the measurements y and the forward operator A.
If we were to apply a single dictionary D := D enc = D dec in our image-to-image learning approach in (9), we would find a sparse code α such that Dα ≈ x † and Dα ≈ A + Ax † . This is difficult when A + A significantly differs from the identity operator as in the case of highly ill-posed problems. On the other hand, using two dictionaries D enc and D dec in (9) requires finding a sparse code α such that D dec α ≈ x † and D enc α ≈ A + Ax † , a formulation that is more flexible when A + A substantially differs from the identity. Experiments in Section IV-E confirm that allowing D enc = D dec yields better performance. We note that our approach is morally related to setting D enc = AD in (10), but since we do not know A we have to learn D enc from samples together with D dec . Such a learned encoder dictionary captures information about A, entangled with information about the data distribution.

D. Relaxation on dictionaries
We now describe computational techniques that stabilize the gradient-descent-based dictionary learning of MUSC. Following earlier work [6], [10], [25], [26], [57], we untie the encoder dictionary from its adjoint during dictionary update. That is, we replace the execution in (2) by where the dictionary D enc is initialized to be identical to D enc but is allowed to evolve independently during training. Even though the theoretical effects of this relaxation remain unclear, the dictionary D enc can be interpreted as a learned preconditioner that accelerates training [25], [26]; see also the investigation in [6], [58], [59]. The learned ISTA (LISTA) algorithm [57] corresponding to (11) is written as where λ 1 , . . . , λ K are the soft-thresholding parameters for each ISTA execution. Note that, in (12), the soft-thresholding parameters {λ i } K i=1 depend on the execution step. As shown in [6], incorporating step-dependent soft-thresholding parameters can be beneficial. While [6] uses a homotopy continuation strategy to adjust these parameters we treat them as learnable parameters for simplicity. Taking these considerations into account, we define a new regression loss: Unless mentioned otherwise, we use the loss (13) to train MUSC throughout our paper. In Section IV-E, we compare the performance of trained model using (13) and (9).

E. Training the MUSC
Training the MUSC entails the following three steps: 1) Dictionary initialization: We randomly initialize the dictionary D enc and initialize D dec , and D enc as identical copies of D enc . 2) Model forward pass: For each input image z i , we evaluate the model prediction D dec α zi as in Equation (13). For ISTA executions, we initialize all sparse code α z as a collection of all-zero tensors; the ISTA stepsize parameter η is initialized as the inverse of the dominant eigenvalue of the matrix D enc D enc , which can be approximated using by power iteration (Appendix C).
3) Task-driven dictionary learning: For a mini-batch of input-target pairs, solve the optimization problem in (13) with gradient descent.

IV. EXPERIMENTS
We report the performance of MUSC on deraining, CT reconstruction, and MRI reconstruction tasks. The motivations for choosing these tasks are as follows. First, we note that single-scale CSC models have recently been applied to the deraining task, achieving performance slightly worse than  state-of-the-art deep networks [27]; we thus aim to test the capability of our multiscale approach on the same task. We additionally choose CT and MRI reconstruction tasks as there exist challenging, large-scale, and up-to-date benchmark datasets for these tasks. Two such datasets that we use are the LoDoPaB-CT [64] and the fastMRI [65]. An additional strength of these two datasets is that the model evaluation process is carefully controlled: The evaluation on the challenge fold (for LoDoPaB-CT) or the test fold (for fastMRI) is restricted through an online submission portal with the ground truth hidden from the public. As a result, overfitting to these evaluation folds is difficult and quantitative comparisons are transparent.
Throughout this section, we use the MUSC architecture whose encoder and decoder dictionaries are displayed in Figure 1b and mathematically defined in Appendix A. Hyperparameter choices for the experiments are provided in Appendix D. For each task, we use well-known CNN models as baselines. We note that, for the CT and MRI reconstruction tasks, there are two major approaches to employ CNNs. In the first, model-based approach, one applies neural networks on raw measurement data (sinogram data in CT and k-space data in MRI) by embedding a task-dependent forward operator (the Radon transform for CT and the Fourier transform for MRI) into multiple layers or iterations of the network. Learning methods of this approach can be highly performant at the cost of being computationally expensive, especially during training, since one needs to apply the forward operator (and the adjoint of its derivative) repeatedly [49]. In the second, model-free approach, the (pseudoinverse of the) forward operator is used at most once during data preprocessing and is never used during subsequent supervised training. These preprocessed images contain imaging artifacts. During supervised learning, one applies a CNN directly on these preprocessed images. The proposed MUSC is in this sense a model-free approach and we compare it to model-free baselines. We note that in this case one does not need to know the forward operator at all. The leading model-free baseline CNN methods in this approach are typically U-Net variants tuned to the task at hand. For a more thorough comparison, we also implemented the original U-Net architecture proposed in [31] (schematically illustrated in Figure 1a) in these tasks as additional baselines.
While model-free approaches perform somewhat worse than model-based ones, our purpose here is to show that a generalpurpose multiscale convolutional model can perform as well as convolutional neural networks ceteris paribus, rather than to propose state-of-the-art reconstruction algorithms for specific problems. This general-purpose approach further allows us to tackle structured denoising problems such as deraining where the forward operator is simply the identity.

A. Deraining
Image deraining aims to remove rain streaks from an image. Formally, a rainy image z is expressed as z = x † + s, where x † is a clean image and s is the rain streaks component. The goal is to reconstruct the clean image x † based on the rainy image z. Recently, single-scale CSC models have been applied to the draining task [27]. Despite theoretical progress, these single-scale CSC models still fall short competing with leading deep learning models, as remarked in [27]. In this section, we demonstrate that our multiscale CSC model closes this performance gap.
Throughout this subsection, we follow the experiment setup of [27]. We use 200 clean and rainy image pairs as the training dataset. A rainy image is created by adding synthesized rain streaks to its clean version. We use two test sets, Rain12 [60] and Rain100L [63], to benchmark our results. Similar to [27], we train our model to restore rain streaks based on rainy images; a derained image is then the difference between a rainy image and the restored rain streaks. To be consistent with [27], [34], [63], the evaluation result is calculated after transforming the image into the luma component in the YCbCr domain using the software provided by [34]. Additional details of the experiment are provided in Appendix D.
We report the reconstruction performance in Table II and visualize the reconstruction results in Figure 2. Our multiscale convolutional dictionary approach matches or outperforms baseline methods. Notably, it improves upon the LGM method (the single-scale CSC approach of [27]) by a non-trivial margin.

B. CT reconstruction
Computed tomography (CT) aims to recover images from their sparse-view sinograms. We use the LoDoPaB-CT dataset [64] to benchmark our results. This dataset contains more than 40000 pairs of human chest CT images and their simulated low photon count measurements. The ground truth images of this dataset are human chest CT scans corresponding to the LIDC/IDRI dataset [66], cropped to 362 × 362 pixels. The low-dose projections are simulated using the default setting of [64].
To train our MUSC, we use the default dataset split as recommended in [64]: The dataset is divided into 35820 training samples, 3522 validation samples, 3553 test samples, and 3678 challenge samples. Here, the ground-truth samples from the challenge dataset are hidden from the public; the evaluation on this fold is performed through the online submission system of the LoDoPaB-CT challenge 2 .
We compare the reconstruction performance of MUSCs with five modern CNN baselines, namely CINN [67], U-Net++ [68], MS-D-CNN [69], U-Net [31], and LoDoPaB U-Net [64]; the LoDoPaB U-Net refers to a modified U-Net architecture tailored to the LoDoPaB-CT task. Figure 3 2 https://lodopab.grand-challenge.org/challenge/ shows the reconstruction results of a test sample. In Table III, we quantitatively compare MUSC with two classic methods (FBP and TV) together with five CNN baseline methods mentioned above. As shown in Table III, MUSC outperforms all baselines. The metrics PSNR and PSNR-FR are taken from [49]: For a ground-truth signal x † and its approximation x, we define PSNR x, x † := 10 log 10 max( PSNR-FR x, x † := 10 log 10 1 MSE ( x, x † ) .

C. MRI reconstruction
We further considered the task of accelerated magnetic resonance imaging (MRI) reconstruction using the fastMRI dataset [65] procured by Facebook and NYU. Specifically, we used the single-coil knee dataset with a 4-fold acceleration factor. This dataset contains 973 volumes or 34742 slices in the training set, 199 volumes or 7135 slices in the validation set, and 108 volumes or 3903 slices in the test set. The groundtruth images in the test set are not provided to the public and the evaluation must be made through the fastMRI online submission system 3 .
Following the training protocol of [65], we first transformed the undersampled k-space measurements into the image space using zero-filled Inverse Fast Fourier Transform (IFFT); we use the transformed images as input to MUSC and other CNN baselines. Consistent with previous work [65], we found that U-Net variants deliver exceptional performance on validation samples (Table IV). Remarkably, MUSC performs on-par with U-Net variants, yielding visually indistinguishable results ( Figure 4). We next evaluate the U-Net and the MUSC on test samples through the fastMRI submission system. On the test data, the proposed MUSC produces results comparable to the best-performing U-Net result (fastMRI U-Net-256) provided by the fastMRI challenge organizer while having an order of magnitude fewer parameters (Table V).

D. Single-image super-resolution
We have additionally tested the MUSC on a standard superresolution task, whose results are deferred to Appendix G. The goal of this task is to recover a high-resolution image from its degraded, low-resolution version. Unlike tasks such as CT and MRI reconstruction, in which the image degradation processes introduce long-range spatially correlated noise like streak artifacts, the blurring process in the super-resolution task is spatially local.
In this case, we do not observe a performance gain of using a multiscale model -either U-Net or MUSC -over state-of-the-art single-scale CSC models. Interestingly, MUSC outperforms the U-Net, but is up to 0.5 dB worse than singlescale CSC.
In subsection IV-F, we study this phenomenon by analyzing the sparse code yielded by MUSC. In the super-resolution    tasks, the nonzeros in sparse codes are confined to highresolution channels, or, equivalently, small filter supports which only leverage local information. This is well aligned with the intuition that the blurring forward operator mixes information only locally. It suggests that the right strategy is to use a large number of small-support filters just like CSC does, instead of "wasting" trainable parameters on unused large scales. We similarly find that a single-scale CSC model works better than MUSC on a denoising (Gaussian noise removal) task. Together, these findings suggest that multiscale features are no panacea for imaging inverse problems; the configuration of scales needs to resonate with the task-dependent forward operator that we aim to invert.

E. Ablations on the choices of model components
In Figure 5,     fastest learning speed and highest end-point accuracy. Consistent with findings in [6], [25], [26], we find it advantageous to use untied adjoints as described in (11): Untied dictionaries (Cases 1, 2, and 4 in Figure 5) in general perform much better than tied dictionaries with D enc = D enc = D dec (Cases 5 and Case 6). What is more, we find that learnable threshold λ gives better results than fixed threshold. The non-negative constraint of sparse code α ≥ 0 does not greatly influence the end-point performance of models, although with the constraint the model learns slightly faster (Case 2) than without (Case 1).

F. Probing multiscale dictionary-based representations
Thus far, we have shown that our proposed multiscale CSC approach, dubbed MUSC, performs comparably to state-ofthe-art CNNs in a range of imaging inverse problems. This is noteworthy, as the strong performance is achieved simply by employing a multiscale dictionary -as opposed to a singlescale one -in an otherwise standard CSC paradigm. The strong performance suggests the usefulness of the multiscale representation. We now analyze our learned dictionaries and their induced sparse representations. a) Visualizing dictionary atoms: We visualize dictionary atoms of the MUSC. To extract a dictionary atom from a dictionary D, we first prepare an indicator code δ, which is a collection of multichannel tensors that takes a value 1 at a certain entry and 0 elsewhere; a dictionary atom corresponds to that entry is computed as Dδ. Note that, different positions of the nonzero entry may give rise to atoms of different support sizes. This can be seen in Figure 1b  nonzero entry resides in, the sparse code activates different receptive fields under composite convolutions and transposed convolutions. If the nonzero entry resides in the top-most box, then the support of the atom is 3 as it undergoes only a single 3 × 3 convolution; if the nonzero entry is in one of the lower boxes, the support of the atom is larger as the code undergoes multiple convolutions and one or more transposed convolutions.
In Figure 6, we show samples of multiscale atoms in D dec of varying sizes -we crop these atoms to only show their nonzero support regions. As can be seen in Figure 6b-d, the learned dictionaries contain Gabor-like or curvelet-like atoms with different spatial widths, resolutions, and orientations. Thus the learned dictionaries indeed exploit multiscale features. For comparison, we also show a randomly initialized dictionary (Figure 6a). Unlike a learned dictionary, a random dictionary does not exhibit structures in atoms. We also visualize atoms of encoder dictionaries D enc and D enc in Appendix E. Using a similar technique, we also probe the multiscale representations learned by U-Nets in Appendix F. b) Sparsity levels of representations: We anticipate that the trained dictionaries induce different sparsity levels at different resolution levels in a task-dependent manner: More non-zeros associated with large-support atoms are useful when imaging artifacts have long-range correlations (e.g., streak artifacts in CT) than when the artifacts are localized (e.g., deraining or super-resolution). Figure 7 shows the sparsity levels across tasks, both before and after dictionary learning. We observe that, prior to any learning, the sparsity levels induced by randomly  Fig. 7: Sparsity of dictionary-induced convolutional features maps. Each bar corresponds to the sparsity level of a feature map tensor from the "deepest" activations corresponding to large-support atoms ("Middle") to the "shallowest' activations corresponding to small-support atoms ("Up-4").
initialized dictionaries (grey bars) are approximately uniform across scales. After learning, the sparsity levels of feature maps differentiate in a task-dependent way (orange bars in all panels). This task-dependent differentiation suggests the usefulness of multiscale representations -the learned sparsity levels are neither collapsed to a single scale nor remain uniform across spatial scales; instead, they are weighted and combined across scales in a problem-dependent way. A curious effect of multiscale learning arises in super-resolution (panel d): the activations are nonzero only in high-resolution features ("Up-2", "Up-3", and "Up-4"), corroborating the intuition that low-resolution features are not important for this task. Additionally, comparing the "Middle" bars across panels, we see that CT and MRI reconstruction tasks indeed use more nonzero coefficients on large-support atoms than tasks such as deraining and super-resolution.
V. DISCUSSION The CSC paradigm provides a natural connection between sparse modeling and CNNs. Despite being mathematical principled, existing CSC models still fall short competing with CNNs in terms of empirical performance on challenging inverse problems. In this work, we report one simple and effective way to close the performance gap between CSC and CNN models: incorporating a multiscale structure in the CSC dictionaries. Crucial to our approach is the structure of our constructed multiscale dictionary: It takes inspiration from and closely follows the highly successful U-Net model. We show that the constructed multiscale dictionary performs on par with leading CNNs in major imaging inverse problems. These results suggest a strong link between dictionary learning and CNNs -in both cases, multiscale structures are essential ingredients.
Beyond empirical performance, we believe that the interpretability of the proposed MUSC is showing the way towards an interpretable deep learning model. An interpretable model consists of components whose objectives and functionality have nominal values. The MUSC fulfills this desideratum by incorporating two modules with well-understood objectives, one implementing sparse coding and the other dictionarybased synthesis.
Overall, our work demonstrates the effectiveness and scalability of CSC models on imaging inverse problems. While deep neural networks are profoundly influencing image reconstruction, our work shows promise in a different direction: the principles of sparsity and multiscale representation developed decades ago are still useful in designing performant, parameter-efficient (compared to mainstream CNNs), and interpretable architectures that push the current limits of machine learning for imaging inverse problems.

APPENDIX A THE DEFINITION OF THE MUSC DICTIONARY
In the main text, we illustrate the architecture of a U-Net (Figure 1a) and the corresponding MUSC decoder dictionary (Figure 1b). Loosely speaking, the decoder dictionary is the decoding branch of a standard U-Net with all ReLU activations, batch normalization, and some convolution operations removed. Here we provide a formal definition of the decoder dictionary. While in the main text we assume for simplicty that signals are 1D vectors, in this section we represent RGB images as multichannel tensors. To that end, we first consider single-input single-output (SISO) operations whose input and output are 2D signals having a single channel. We follow by considering multiple-input multiple-output (MIMO) operations whose input and output are 3D tensors having multiple channels. a) SISO convolution and transposed convolution: Let ξ ∈ R H×W be a 2D signal with height H and width W . We regard ξ as a function defined on the discrete domain {1, . . . , H} × {1, . . . , W }. With standard zero padding, we extend the domain of ξ to Z×Z. The notation ξ[i, j] represents the value of the function ξ at the coordinate (i, j). b) SISO convolution: Given a 2D signal ξ ∈ R H×W and parameters (filter weights, filter impulse response) w ∈ R 3×3 , the convolution of ξ and w is defined as c) SISO transposed convolution: A transposed convolution in the U-Net consists of a bed-of-nails upsampling by a factor of two followed by a 2-by-2 convolution with an interpolating filter. The bed-of-nails upsampling interleaves zeros between samples, which can be written as ξ ⊗ [ 0 0 0 1 ] with ⊗ denoting the Kronecker product. The 2-by-2 convolution between the filter v ∈ R 2×2 and signal ξ ∈ R H×W is written as Putting these together, we have the following definition for a transposed convolution.
Proof. Observe that the result of LHS and RHS are both 2nby-2n matrices, each composed of n × n number of 2-by-2 submatrices. It is enough to show that each 2-by-2 sub-matrix of the LHS equals that of the RHS if they have the same position indices. We first expand the RHS. By definition of Kronecker product, we have where the (i, j)-th 2-by-2 sub-matrix of ξ ⊗v has the form for each (i, j) ∈ {1, · · · , n} × {1, · · · , n}.
for each (i, j) ∈ {1, · · · , n} × {1, · · · , n}. Comparing Equation (17) and Equation (18) g) Up-block: An up-block receives multichannel input from (i) low-resolution features in the shape of (2C, H, W ) and (ii) skip-connection features in the shape of (C, 2H, 2W ). With these input features, an up-block applies a transposed convolution to low-resolution features that halves the number of channels, concatenates the resulting features with skipconnection features, and then submits the concatenated results for another convolution. This can be written as where W ∈ R 2C×C×3×3 and V ∈ R 2C×C×2×2 are parameters and [·; ·] denotes channel-wise concatenation.
h) The multiscale dictionary: We now define the multiscale dictionary that is used as encoders and decoders in our work. Here, C out = 1 or 3 for grayscale and RGB images. We construct the dictionary D as a linear transform R that is independent of C out followed by a one-by-one convolution that produce C out channels: The linear map R is defined by cascading 4 Up-blocks with parameters W i and V i , i = 1, . . . , 4: where That is, the function R transforms multiscale sparse code (α 0 , . . . , α 4 ) to a tensor ξ 4 of shape R C/2 4 ×2 4 ·H×2 4 ·W . The 1 × 1 convolution operator C next synthesizes these features into a tensor with channel number 1 or 3:

APPENDIX B CAN A LEARNED SPARSE CODER YIELD DENSE OUTPUTS?
In the main text, we introduced the K-fold ISTA algoritm for sparse coding. The sparse coding result reads ISTA K (z; D enc , λ), where z is a given input, D enc is an encoding dictionary, and λ is a threshold tensor. Since we also learn a λ from data during training, one concern is whether a learned λ be zero. Indeed, if this were to occur, the sparse coding results would be dense and it would defeat the purpose of sparse coding.
In this section, we argue that λ = 0 is (i) not optimal for many inverse problems and (ii) unlikely to be learned from data. To see the non-optimality of λ = 0, we observe that with the iteration number K is large enough, with λ = 0, and with zero initialization of sparse code, ISTA yields the smallest 2 -norm solution which can be written via the Moore-Penrose pseudoinverse, lim K→∞ ISTA K (z; D enc , 0) = D + enc z.
Our task-driven dictionary learning problem thus has the form minimize Denc,Ddec This implies that λ = 0 results in a linear reconstruction method which comes with all the known drawbacks of linear methods. In particular, it cannot do better than the linear minimum mean square error (LMMSE) estimator (a generalized Wiener filter). Since regularization in ill-posed inverse problems entails the use of data models, and most useful data models are nonlinear (e.g., natural and medical images are known to be sparse or compressible in wavelet frames, but they do not belong to any linear subspace), these problems demand λ > 0.
One may still wonder whether our learning procedure will overfit finite datasets with λ = 0. To see that this will not happen, note that D dec and D enc are constrained to have a specific structure: They are multiscale variants of block-Toeplitz matrices. Additionally, filters at high resolutions are convolutions of filters at lower resolutions which induces a rather complicated algebraic structure. As a result, the set of valid dictionaries in our model has a much smaller dimension than the set of all possible dictionaries of correct size, and solving (9) cannot be reduced to finding two generic overcomplete dictionaries that "overfit" to training data to achieve zero loss. In fact, forcing λ = 0 typically incurs a large loss in (9), so λ = 0 is unlikely to be learned from data. This is the magic of multiscale convolutional sparsity.

APPENDIX C POWER ITERATION
We describe how to approximate the dominant eigenvalue of the matrix D enc D enc using by power iteration. We achieve this by first estimating the eigenvector associated with the dominant eigenvalue using the power iteration method, by recursively calculating up to some step K with b 0 being an all-zero vector. The estimated dominant eigenvalue can then be derived from

APPENDIX D DETAILS OF THE EXPERIMENTAL SETUP
In Table VI, we summarize parameters used in each scale of our dictionaries. In Table VII and Table VIII  in-channels out-channels stride in-channels out-channels stride in-channels out-channels stride Scale 1  512  512  1  512  256  2  ---Scale 2  512  256  1  256  128  2  ---Scale 3  256  128  1  128  64  2  ---Scale 4  128  64  1  64  32  2  ---Scale 5  64  32  1  ---32  1 or 3  1   Table VI: Parameters used at each scale of the encoder and decoder dictionaries. Scale 1 corresponds to the low-resolution scale (the bottom-most gray box in Figure 1b) and Scale 5 correspond to the high-resolution scale (the top-most gray box in Figure 1b).  Note that, for CT and MRI tasks, U-Nets and MUSCs are trained on full images; in deraining and super-resolution tasks, U-Nets and MUSCs are trained on cropped images, where sizes of the cropped images are in Table VIII. Since in the deraining and LoDoPaB-CT tasks the range of the target images is non-negative, we clip the negative values of the synthesized image. During model training, we use weight normalization [70], a reparametrization trick that decouples the magnitude of a convolutional filter from its direction. To enforce the positivity of the ISTA parameter λ, we reparametrize λ = ReLU(λ)+1e-5 and perform gradient-based learning oñ λ instead.

APPENDIX E VISUALIZING THE ATOMS IN ENCODER DICTIONARIES
Consistent to how we visualize decoder dictionary atoms in Section IV-F, we visualize atoms from/ encoder dictionaries D enc and D enc in Figure 8 and Figure 9.

APPENDIX F VISUALIZING THE REPRESENTATION OF U-NETS
Similar to how we visualize dictionary atoms in Section IV-F, we visualize prototypical images that U-Nets synthesize through its decoder branch f dec (·, γ). Concretely, we first prepare a set of indicator codes corresponding to different spatial resolutions as described in Section IV-F. We then feed each indicator code δ into the decoder branch of a U-Net to yield f dec (δ, γ). Due to additive biases and batchnorm modules of the U-Net, the synthesized output f dec (δ, γ) has the same support as the full image. To focus on the region influenced by the indicator code, we thus display the support of f dec (δ, γ) − f dec (0, γ), where 0 is an all-zero tensor; the purpose of this subtract is to offset those image values solely influenced by batchnorm and additive biases but not by the indicator code. These synthesized results are visualized in Figure 10.
As it can be seen, compared to the randomly initialized U-Net (Figure 10a), the representations of learned U-Nets ( Figure  10b and c) are organized in a more structured way at each scale. Compared to the (linear) representations learned by the MUSC, the (nonlinear) U-Net atoms much less resemble the classical oriented multiresolution systems such as curvelets.

APPENDIX G SUPER-RESOLUTION
CSC models achieve competitive performance in image super-resolution [71], [72]. We train an out-of-the-box MUSC for this task to study the sparsity patterns of its learned representation in Section IV-F. We follow the protocol in earlier work [73]- [75] to train MUSC on the DIV2K dataset [76]. Low-resolution images are prepared by downscaling high-resolution images by a factor of four. We used bicubic interpolated images up-scaled from low-resolution images as model inputs and high-resolution images as model targets. The trained models were evaluated on standard datasets including Set-14 [77], Set-5 [78], B-100 [79], and Urban-100 [80]. Table IX shows the performance of the trained models. We observe that single-scale CSC has an edge of all other models, indicating the limited usefulness of large filters and the benefits of trading them off for a large number of small-support convolutional channels, as also discussed in the main text.    Learned D enc on the fastMRI dataset