Learning of Continuous and Piecewise-Linear Functions With Hessian Total-Variation Regularization

We develop a novel 2D functional learning framework that employs a sparsity-promoting regularization based on second-order derivatives. Motivated by the nature of the regularizer, we restrict the search space to the span of piecewise-linear box splines shifted on a 2D lattice. Our formulation of the inﬁnite-dimensional problem on this search space allows us to recast it exactly as a ﬁnite-dimensional one that can be solved using standard methods in convex optimization. Since our search space is composed of continuous and piecewise-linear functions, our work presents itself as an alternative to training networks that deploy rectiﬁed linear units, which also construct models in this family. The advantages of our method are fourfold: the ability to enforce sparsity, favoring models with fewer piecewise-linear regions; the use of a rotation, scale and translation-invariant regularization; a single hyperparameter that controls the complexity of the model; and a clear model interpretability that provides a straightforward relation between the parameters and the overall learned function. We validate our framework in various experimental setups and compare it with neural networks.


I. INTRODUCTION
The primary task in supervised learning is to estimate a target function f : R d → R from finitely many noisy samples {x m , y m } M m=1 , where y m ≈ f (x m ), m = 1, . . . , M [1]. Since there are arbitrarily many continuous models that can fit the training data well enough, this problem is ill-posed in general. To address this issue, the learning scheme generally includes regularization and favors certain models based on prior information on the target function [2], [3].
One way to make this problem computationally tractable is to restrict the admissible solutions to a given family of parametric functions f , where denotes the vector of the underlying parameters. A celebrated example of this approach is deep learning, whose underlying principle is the construction of an overall map f : R d → R built as a neural network via the composition of parameterized affine mappings and pointwise nonlinearities known as activation functions. The attribute "deep" refers to the high number of such module compositions (layers), which is instrumental to improve the approximation power of the network [4]- [6] and its generalization ability [7].
Rectified-linear-unit (ReLU) networks, in particular, have been the most prominently used in machine learning [8], [9]. These networks have spline activations of the form x → max(x, 0) which results in a continuous and piecewiselinear (CPWL) input-output relationship [4], [10]. Consequently, they can be interpreted as hierarchical splines [11]- [13]. Furthermore, the opposite also holds: any CPWL function can be represented by some ReLU network [14]. This leads to the conclusion that ReLU networks provide a nonlinear parameterization for the family of CPWL functions.
A more generic strategy to address supervised learning problems is to adopt a functional approach where the model is optimized over a well-suited function space [15], [16]. In this paradigm, the learning task is formalized as the minimization where X is the search space, E : R × R → R + is an arbitrary convex loss function (the data-fidelity metric), R : X → R is the regularization functional, and λ ∈ R + its corresponding (adjustable) weight. A one-dimensional (1D) example of (1) is learning with second-order total-variation regularization [17], [18], which promotes sparse piecewise-linear models and leads to an alternative procedure to learn 1D CPWL functions. This problem is formulated as min f ∈BV (2) (2) where the search space BV (2) (R) contains functions with bounded second-order total variation such that f ∈ BV (2) (R) ⇔ TV (2) ( f ) < +∞. In the spirit of reproducingkernel Hilbert spaces [19], [20], there exists a representer theorem for (2) that states that the extreme points of the solution set are linear splines with the generic form where a = (a k )∈ R K and b = (b 0 , b 1 )∈ R 2 are expansion parameters, K < M, and {x k } K k=1 are adaptive (learnable) locations that are known as knots. The regularization term for these functions has the simple expression given by TV (2) As a result, the original infinite-dimensional problem can be recast as a finite-dimensional one, parameterized by K, a, b, and {x k } K k=1 [21]. The TV (2) regularization favors simple models with few knots due to (4) and the sparsity-promoting effect of the 1 -norm [22], [23].
Interestingly, one can also use TV (2) regularization to learn the activation functions of deep neural networks. In this case, it has been shown that neural networks with linear spline activation functions of the form (3) are optimal [24], [25]. The link between functional approaches to neural networks and splines has also been observed in various works [26]- [30].

A. CONTRIBUTIONS
Our work presents a method to learn sparse CPWL functions in 2D. It is based on three fundamental ingredients.
1) A regularization based on second-order derivatives, the Hessian-nuclear total-variation (HTV). 2) A CPWL search space spanned by linear-box-spline basis functions, which allows us to have a closed-form parametric expression for the regularization. 3) An exact discretization of the infinite-dimensional learning problem. The resulting parameterized problem has a structure that is reminiscent of the generalized LASSO [31], and therefore can be efficiently solved using known optimization algorithms. This discretization encapsulates the sparsity-promoting effect of the HTV, while it reveals the link with 1 regularization. Our framework presents itself as an alternative to training ReLU networks for the learning of CPWL functions, with the following advantages: 1) the enforcement of sparsity, in the sense that we follow Occam's razor principle by promoting solutions with the fewest CPWL regions; 2) the use of a rotation, scale and translation-invariant regularization; 3) the reliance on a single hyperparameter-the regularization weight λ. This is in contrast with the numerous hyperparameters found in neural networks such as the choice of architecture and its components, learning rate schemes, and batch size, among others; 4) an improved model interpretability since we provide a linear parametrization for the learned CPWL mapping.

B. RELATED WORKS
One of the most widely used regularizers in image reconstruction is the Rudin-Osher-Fatemi total-variation (TV) semi-norm, given by - [34]. The success of TV is partly attributable to its capacity to preserve edge information. However, TV has the tendency to promote piecewise-constant solutions (vanishing first-order derivatives), which creates an undesired staircase effect. Such issue can be dealt with by the adoption of regularizers based on higher-order differentials [35], [36]. The Hessian is the second-order counterpart to the gradient operator. Accordingly, the works in [37], [38] introduce the regularizer where · * is the nuclear norm-the 1 -norm of the singular values of the Hessian. This regularizer preserves the desirable affine-invariance of total-variation while it alleviates the staircase effect by promoting piecewise-linear solutions.
Here, in contrast with [37], we address supervised learning rather than inverse problems. Moreover, the approach taken in [37] raises two relevant concerns. First, the regularization term (5) is inoperative for CPWL functions because the Hessian of a CPWL function is zero almost everywhere. Second, [37] resorts to a discretization of the Hessian with secondorder finite differences in order to establish a discrete formulation of the problem, which leads to discretization errors. In our work, we address these two concerns by using a novel HTV seminorm as the regularization term. The proposed functional is a generalization of (5). It has been introduced in [39] to measure the complexity of neural networks. The foundation for our work is that the HTV seminorm is properly defined for CPWL functions and has an explicit closed-form formula for these mappings. This allows us to discretize our problem exactly.

C. ROADMAP
In Section II, we introduce the two key mathematical elements that underlie our framework: the Hessian-nuclear totalvariation (HTV) seminorm, along with box splines. We then define an adequate CPWL search space in Section III, where we define our HTV-regularized learning problem. In Section IV, we discretize our problem exactly and provide the algorithmic components to solve it. Finally, in Section V, we validate our framework on numerical examples.

A. NUCLEAR NORM
The nuclear norm of a matrix A ∈ R m×n is defined as where σ k are the singular values of A [40]. Its dual norm is the spectral norm, defined as A S ∞ max k |σ k |. The nuclear norm plays a prominent role in the field of low-rank matrix recovery, due to its sparsity-promoting effect [41]- [43].

B. GENERALIZED HESSIAN OPERATOR
The Hessian of a twice-differentiable function f : R 2 → R is defined as Using this matrix, one can readily compute second-order directional derivatives. Let us recall that the second-order directional derivative of a twice-differentiable function f at x along a direction u with u 2 = 1 is defined as This quantity can be expressed as If the Hessian of f is symmetric, for which it suffices that the second partial derivatives be continuous (Schwarz' theorem), then its eigendecomposition has an important geometric interpretation: the two eigenvectors, v 1 and v 2 , point in the directions in which the magnitude of the second-order directional derivative is maximal and minimal, respectively. Furthermore, the magnitudes of the directional derivatives along these directions are given by the corresponding eigenvalues, respectively. The Hessian operator can also be defined for functions that are not twice-differentiable using the notion of weak derivatives, as detailed in Appendix A.

C. HESSIAN-NUCLEAR TOTAL VARIATION
In this section, we briefly recall the definition and relevant properties of the HTV seminorm (see [39] for more details). It is based on the nuclear-TV norm, which is defined for any W ∈ S (R 2 ; R 2×2 ) as We refer to Appendix A1 for more details on the matrixvalued spaces S(R 2 ; R 2×2 ) and S (R 2 ; R 2×2 ). What is relevant for this paper is that, for any matrix of absolutely integrable functions W ∈ L 1 (R 2 ; R 2×2 ), we have that Furthermore, the nuclear-TV norm is defined for the Diraclike distributions that appear in the Hessian of CPWL functions. This motivates the choice of our regularization functional given in Definition 1.

D. CONTINUOUS AND PIECEWISE-LINEAR FUNCTIONS
2) its domain R 2 = N n=1 P n can be partitioned into a finite set of non-overlapping convex polytopes P n over which it is affine, with f | P n (x) = a T n x + b k . The gradient of a CPWL function can easily be seen to be piecewise-constant with discontinuities at the junctions of the polytopes. For a CPWL function f : for almost every x ∈ R 2 . We can show that the HTV of a CPWL function f : R 2 → R is finite if and only if it is compactly supported up to an affine term, which means that is supported over a convex and compact set D ⊆ R 2 . We call such CPWL functions "admissible". We present now a closed-form formula for the HTV of any admissible CPWL function.

Setup:
We consider the partitioning of D into N polytopes {P n } N n=1 whose gradients, ∇ f | P n , are denoted by a n . Moreover, we gather the indices of the neighbors of a polytope P into the set adj(P). Each pair of neighboring polytopes P n and P k share a common line segment (junction) denoted by L n,k = P n P k , whose length is len(L n,k ). Finally, we use the notation u n,k for the (unit) normal vector that is perpendicular to L n,k ( Fig. 1(a)). The proof of Theorem 1 can be found in [39].
Theorem 1 (HTV of admissible CPWL functions): Let f : R 2 → R be an admissible CPWL function. Then, we have that a k − a n 2 len(L n,k ).
Given the central character of Theorem 1 in our work, we now discuss it briefly. We first observe that α n,k = u T n,k (a k − a n ) is the difference of the directional derivatives inside two polytopes along the junction normal u n,k . Therefore, it measures the degree of coplanarity of the adjacent piecewiselinear regions. We illustrate how the shape of each neighboring two-polytope region relates to α n,k in Fig. 1: When the two regions are coplanar, α n,k = 0; otherwise, |α n,k | gauges how much the slope changes in the direction u n,k (i.e., how far the function is from being affine in this region), and sgn(α n,k ) determines the sign of that change.
It is also interesting to compare the discrete HTV expression (15) with (4), which relates to the TV (2) regularization in 1D. We observe that, for a CPWL function f : R → R, where each a k is the difference of the derivatives in two consecutive linear regions which connect at the knot position x k . In 2D, this result is extended by considering directional derivatives and junctions instead of derivatives and knots, and by taking into account the new factor of the length of a junction. This also highlights the sparsitypromoting effect of HTV regularization: since the HTV seminorm imposes a (weighted) 1 penalty on the change of slopes of the neighbouring regions, it favors CPWL functions with few linear regions.
The necessity to restrict our framework to admissible CPWL functions is made evident by (16): the regularization only has a finite value if a function has a finite number of non-coplanar junctions, and each of these has a finite length.
Finally, we remark that, although the HTV of admissible CPWL functions has a simple closed-form expression, it requires the complete knowledge of the domain partition and the gradients in each polytope. To circumvent this, we construct a CPWL search space that is based on a uniform domain partition. This allows us to obtain a tractable formula for computing the HTV of any model in the space.

E. BOX SPLINES
Box splines are a multivariate extension of B-splines [44]. In constrast to tensor products of 1D B-splines, they are nonseparable, which makes them suitable for interpolation algorithms taylored to non-Cartesian (and often optimal) sampling lattices [45]- [48]. They also find applications in areas such as finite-element methods [49], computer-aided design [50], and edge detection [51].
In 2D, we denote by k : R 2 → R, the box spline associated to the matrix = ξ 1 · · · ξ n ∈ R 2×n . For n = 2 and an invertible matrix ∈ R 2×2 , k is an indicator function of the form where S 2 = { α : α ∈ [0, 1) 2 }. For n ≥ 3, the box spline is defined recursively as Box splines are nonnegative functions and have a unit integral over the entire space, with R 2 k (x) dx = 1. Moreover, they are supported over the set { α : α ∈ [0, 1) n } and are symmetric with respect to the center of their support [52].

III. SEARCH SPACE
Motivated by the results of Section II, we construct a search space that only contains admissible CPWL functions. More precisely, we let this space be spanned by shifts of a CPWL basis function ϕ of the form  where Fig. 3). We note that ϕ = √ 3 2 k is, in fact, a scaled hexagonal box spline [48], with the matrix given by (see Fig. 2 Consequently, its Fourier transform is given by We refer to Appendix B for the proof. Additionally, we construct a hexagonal lattice on which we shift these basis functions. This lattice is determined by the primitive vectors r 1 = ξ 1 and r 2 = (−ξ 2 ). For ease of representation, we concatenate the primitive vectors into the lattice matrix R = r 1 r 2 [53]. Then, the search space with grid size h ∈ R + is defined as We observe that any model f ∈ X R h (R 2 ) can be expressed as for some set of box-spline coefficients {c[k]} k∈Z 2 .
Analogous to the space of cardinal linear splines [54], [55], our search space satisfies some desirable properties that are listed in Theorem 2. The proof can be found in Appendix C.
Theorem 2: The search space X R h (R 2 ) satisfies the following properties.
1) It reproduces any affine mapping, in the sense that any function of the form f (x) = a T x + b can be expressed as (23). 2) The collection {ϕ(·/h − Rk)} k∈Z 2 forms a Riesz basis for X R h (R 2 ). This ensures a unique and stable link between each model function and its coefficients [56].
Consequently, we have that f (hRk) = c[k] for any f ∈ X R h (R 2 ) and any k ∈ Z 2 . 5) The basis element ϕ is refinable, in the sense that ϕ(·/2h) can be exactly represented in X R h (R 2 ) with finitely many coefficients. Additionally, this search space has three desirable characteristics. First, the hexagonal lattice provides an optimal packing density [57], [58], which leads to an improved approximation power for a given number of basis functions. Second, the domain of f ∈ X R h (R 2 ) is partitioned into equilateral triangles (see Fig. 3(b)), which results in simplified computations (see Section IV). Third, Property 5 allows for efficient multiresolution algorithms.
Finally, we formalize the functional-learning problem in our search space as the minimization where E : R × R → R + is a strictly convex loss function (e.g., E (y, z) = (y − z) 2 for the quadratic loss). Let us mention that our framework is compatible with any open connected bounded domain (such as Euclidean disks) [39]. Nonetheless, working with R 2 has the advantage of learning an affine extrapolation.

IV. DISCRETIZATION OF THE PROBLEM
In this section, we detail the algorithm we have developed to solve problem (24). For this purpose, we first show how to efficiently evaluate (23) for a given x ∈ R 2 , and then exactly express the HTV of any f ∈ X R h (R 2 ) in terms of its parameters.

A. DATA FIDELITY
Due to the finite support of the atoms (Fig. 3(b)), we infer that, at each location x ∈ R 2 , there are at most three basis functions that are nonzero. These active where h m,n = ϕ(x m /h − Rk m,n ), n = 1, 2, 3.

B. REGULARIZATION
The central result of this section is presented in Theorem 3. Theorem 3: For any f ∈ X R h (R 2 ) of the form (23), we have that where A 1,1 = vec(A) 1 is the sum of the absolute values of the entries of A, and Due to the specific form of our search space, we can rewrite (28) as a summation over the lattice vertices rather than the triangles and associate three junctions to each vertex. This leads to where a n is the gradient of the triangle P n associated with the vertex n (Fig. 4). Similarly, the vector a n k is the gradient of the neighboring triangle that shares a border with P n in the direction of r k , where r 1 and r 2 are illustrated in Fig. 3(b) and r 3 = (− 1 2 , √ 3 2 ). By changing the order of summation, we obtain that HTV( f ) = h n∈Z 2 a n 1 − a n 2 + h n∈Z 2 a n 2 − a n 2 + h n∈Z 2 a n 3 − a n 2 .
Each of the three terms of (30) can be computed via a filtering operation. Here, we just prove this for the last term in the summation and we deduce the other two using similar computations.
Using the notations in Fig. 4, we write that where R h = h r 1 r 2 is the lattice matrix. Combining these equations, we obtain that R T h (a n 3 − a n ) =  (1, 1). The application of (R −1 h ) T to both sides of (32) leads to (a n 3 − a n ) = (1, −1, −1, where z = (c[n], c[k 1 ], c[k 2 ], c[k 3 ]). Using the homogeneity of the 2 -norm, we verify that a n 3 − a n 2 = (1, −1, −1, By plugging in k 1 = n + (1, 0), k 2 = n + (0, 1) and k 3 = n + (1, 1), we express the last term in (30) as h n∈Z 2 a n 3 − a n 2 = 2 √ 3 3 Theorem 3 provides a simple algorithm to evaluate HTV( f ) in terms of three convolutions, whose filters are depicted in Fig. 5. We also remark that, for any admissible CPWL model f , the outputs of the digital filters d n , n = 1, 2, 3, are zero outside of a compact domain. This in effect allows us to consider an equivalent finite lattice to represent f in practice.

C. GENERALIZED LASSO
We now merge the results of Section III, IV-A, and IV-B to derive an exact finite-dimensional discretization of Problem (24). We consider a finite lattice of square size (N × N) in such a way that all training data are contained in it. The lattice coefficients are grouped into the vector c ∈ R N 2 which is a (row-wise) vectorization of the 2D array c[k], k ∈ , where ⊂ Z 2 is the set of lattice indices. Consequently, we define the regularization matrix L ∈ R 3N 2 ×N 2 as where L n is a Toeplitz-like matrix associated to the 2D digital filter d n such that, for n = 1, 2, 3, L n c is the vectorized version of (d n * c)[k], k ∈ . Further, we define the forward matrix H ∈ R M×N 2 such that its mth row corresponds to the datapoint Ultimately, the finite-dimensional problem (37) has the composite structure of the generalized LASSO [31] which can be solved efficiently using known convex optimization solvers (e.g., ADMM or its variants [59]- [61]). We denote the corresponding solution by c 0 .
The discrete formulation (37) also highlights the sparsitypromoting effect of the HTV regularization, due to the presence of the 1 penalty in (37). Consequently, we expect to learn models with few linear regions. In order to find a sparser solution, we use the simplex algorithm [62], [63] to solve the minimization arg min This post-processing step is known to provide an extreme point of the solution set of (37) [21] which, in our case, often leads to a sparser CPWL mapping.

D. DISCUSSION
In an attempt to generalize our framework, we could also consider learning higher-order splines. Indeed, the HTV is fully compatible with these smooth functions, and we can even define different flavors of it by choosing other Schatten-norms [39]. The main challenge would be to obtain a parametric expression (akin to (26)) for the HTV of a given element in the restricted search space. This is crucial for obtaining an exact discretization of the continuous-domain learning problem.

V. EXPERIMENTS
In this section, we demonstrate the advantages of our pipeline by comparing it to other existing learning methods. The Python code to reproduce all the experimental results is available on Github. 1

A. MININUM-NORM INTERPOLATION
We demonstrate the sparsity-promoting effect of the HTV regularizer in a controlled environment in which we sample M = 12 points from a pyramid function f pyr whose vertices are positioned on the lattice. This ensures that the target function can be represented exactly in our search space. To isolate the effect of the regularization, we use simplex solver to find arg min by recasting it as a discrete minimization problem of the form (38). In Fig. 6, we show the results of successive experiments where we use a lattice of size (20 × 20) (a total number of 421 parameters) with zero boundary conditions. We chose a colormap based on the triangle normals so that co-planarities can be identified. Due to randomness in the implementation of the algorithm, we obtained different solutions of (38). They all resulted in the same minimal HTV( f ). We observe that the algorithm leads to sparse solutions in all cases, with few faces. Indeed, from a search space which can model functions with hundreds of faces ( Fig. 6(a)), we reached solutions with just 12 (6(b)), 7 (6(c)), and 6 (6(b)) faces, respectively.

B. DATA FITTING
In this experiment, we tackle a data-fitting problem and compare three approaches. 1) Ours, using HTV regularization and a CPWL search space (22). 2) ReLU neural networks, which also construct CPWL models. 3) Radial basis functions with Gaussian kernels-a classical approach in supervised learning [64]- [66]. The dataset consists of samples from a CPWL function with noisy labels. More precisely, the labels are of the form where ∼ N(0, σ 2 ) and h is the CPWL function shown in Figs. 7(a) and 7(f). We use 200 datapoints and set σ = 1 20 f L ∞ . Note that the model cannot be represented exactly in our search space since the data points do not fall on the lattice; however, the error can be mitigated by a sufficient reduction of the stepsize of the grid. The setup is as follows: for the data-fidelity term in (37), we use the quadratic loss E (y, z) = (y − z) 2 . For the ReLU network, we use a fully connected architecture with 4 hidden layers, each with 256 hidden neurons. The total number of parameters of the neural network is 198401. We train the neural network for 500 epochs using an Adam optimizer [67] with a batch size of 10 and weight decay. The initial learning rate is set to 10 −3 and is decreased by 10 at epochs 375 and 425. For the HTV, we use a lattice size of size (64 × 64), giving a total of 4225 parameters. In all methods, we tune the corresponding hyperparameter on a validation set (regularization weight λ for the HTV and radial-basis function [RBF], kernel size γ for the RBF, and weight-decay parameter μ for the neural network) to have a fair comparison. To assess sparsity, we sample the learned neural network and RBF models in the position of the lattice vertices and vectorize these values (we denote the resulting vector by c), as done for our method. Finally, for all methods, we compute the percentage of non-negligible "changes of slope" as Lc> 0 3N 2 · 100, where = 10 −4 and 3N 2 is the number of rows of L.
The results are shown in Table 1 and Fig. 7, along with the ground-thruth (GT). We observe that the HTV model performs significantly better than the radial-basis functions and on par with the neural network. Furthermore, as seen in the last column of the table and from the Figures, the HTV leads to a much sparser result. Moreover, we observe that the level of sparsity for the HTV can be controlled with the regularization weight (higher leads to sparser results). In the

TABLE 1 Test MSE and Sparsity of Each Method in the Data-Fitting Example
extreme case λ → +∞, the model should converge to the least-squares linear approximation of the training data. The very high regularization weight λ = 10 allows us to verify this in practice. Indeed, the resulting model is linear and the data-fitting error is precisely the same as the one obtained with a least-squares fit.

C. REAL DATASET
In this section, we benchmark the three methods of the experiment of section V-B on a (non-CPWL) facial dataset. This dataset is a 2D height map f : R 2 → R that we construct by cutting a 3D face model 2 ( Fig. 8(a)). We then sample 8000 data points for training ( Fig. 8(b)).
Relative to Section V-B, the setup has the following differences: for the HTV, we use a lattice of size (194 × 194) (38025 parameters) and skip the simplex post-processing step; for the neural network, we incorporate one additional hidden layer (264193 parameters), increase the number of epochs to 2000 and the batch size to 100, and, lastly, decrease the initial learning rate at epochs 1750 and 1900.
The results are shown in Table 2 and Fig. 8. The HTV achieved the lowest test mean-square error (MSE) on par with the RBF which is expected to perform well due to the high density of datapoints and the absence of noise. Regarding the effect of the regularization, we again observe that, for the HTV, increasing it results in a model with fewer faces. In the case of the RBF, the solutions present ringing artifacts,

TABLE 2 Test MSE and Sparsity of Each Method in the Face Dataset
especially in a low-regularization regime. Finally, we remark that the neural network constructs a coarse approximation of the data.

VI. CONCLUSION
We have introduced a method to solve two-dimensional learning problems regularized with Hessian-nuclear total-variation (HTV) seminorm. The starting point of our work has been the observation that the HTV of (admissible) continuous and piecewise-linear (CPWL) functions has a closed-form expression. Its computation, however, requires knowledge of the gradient and boundaries of each partition. To circumvent this drawback, we have formulated the problem in a search space consisting of shifts of CPWL box-splines in a lattice. By doing so, we are able to evaluate any model in the search space, as well as compute its HTV, from the values at the lattice points (model parameters). In particular, we showed that the latter can be computed with a three-filter convolutional structure; this allows us to discretize the problem exactly and to recast it in the form of the generalized least-absolute shrinkage-andselection operator. Finally, we have demonstrated the sparsitypromoting effect of our framework via numerical examples where we have compared its performance with ReLU neural networks and radial-basis functions.

A GENERALIZED HESSIAN OPERATOR
Let S(R 2 ) be the Schwartz space of smooth and rapidly decaying functions over R 2 . Its topological dual, denoted by S (R 2 ), is the space of tempered distributions. An element w ∈ S (R 2 ) is a linear functional whose action for any ϕ ∈ S(R 2 ) is given by the duality product ϕ → w, ϕ .
For any w ∈ S (R 2 ), the weak partial derivative ∂ ∂x k : S (R 2 ) → S (R 2 ) is a continuous linear map whose action is given by where the partial derivative on the right-hand side is in the usual sense (with the limit definition) ∂ ∂x k : S(R 2 ) → S(R 2 ). We refer to [68,Section 3.3.2] for more details on extensions by duality.
To define the generalized Hessian operator, we first need to extend the two spaces S(R 2 ) and S (R 2 ) for matrix-valued functions. Concretely, any F ∈ S(R 2 ; R 2×2 ) is of the form Similarly, any W ∈ S (R 2 ; R 2×2 ) is of the form Consequently, the duality product associated with these two spaces is the sum of the entrywise duality products Finally, the generalized Hessian operator is denoted by H : S (R 2 ) → S (R 2 ; R 2×2 ) and is defined for any f ∈ S (R 2 ) as

C. PROOF OF THEOREM 2
We prove the properties for a unit grid size (h = 1), without any loss of generality. To do so, we rely on the Fourier-domain characterization of a generic box spline [69, Proposition (17)].
Item 5 (Refinable search space): We want to show that there exists a refinability filter h ∈ 2 (R 2 ) such that whereR = ξ 1 ξ 2 . In the Fourier domain, this condition is equivalent to Computing 2 where we have used the identity sin(x) = 2 cos(x/2) sin(x/2).