Deep Unrolling for Light Field Compressed Acquisition using Coded Masks

,


I. INTRODUCTION
L IGHT field imaging is becoming an increasingly popu- lar subject of interest and research.Indeed, light fields naturally extend the traditional notion of images as they capture more information about a scene, by recording not only the radiance on pixels on a two-dimensional plane, but also discriminating along the direction of the light rays.Light field acquisition thus makes possible a great deal of applications, like post-capture image refocusing and depth-of-field tuning [1], [2], synthetic aperture imaging [3], microscopy [4], virtual reality, depth estimation [5], [6], etc.However, light field acquisition remains the main problem, as the devices that record light fields are generally very bulky as well as expensive, thus hindering light field capture in real-world applications.Plenoptic cameras based on micro-lenses placed between the main lens and the sensor have been designed [7].Since the sensor resolution is limited, with plenoptic cameras a dense angular sampling is obtained at the expense of a reduced spatial sampling of the different views.This trade-off between spatial and angular resolutions needs to be addressed by using light field super-resolution methods as proposed in [6].Among the other possible solutions to tackle the problem of practical light field acquisition, compressed light field acquisition using color-coded masks (CCM) and color filter array (CFA) is especially interesting.Indeed, compressed sensing is a mathematical framework providing strong guarantees on signal recovery from incomplete measurements [8]- [10].Its application to the compressed acquisition of light fields can be made effective in several ways.Compressed sensing is originally meant to reconstruct a signal that is assumed to be sparse in a given dictionary.While traditional methods using sparse priors have been to some extent successfully applied to light field reconstruction [11]- [14], the iterative nature of the reconstruction algorithm greatly precludes its actual use, notably in real-time applications.Alternatively, deep learning methods have been very successful at reconstructing light fields in a way that is both fast and accurate [15], [16].These methods are orders of magnitude faster than traditional iterative methods like the orthogonal matching pursuit [17], and can outperform sparsity priors for a great number of VOLUME 4, 2016 image reconstruction tasks [18].However, these networks in general have a very large number of parameters, hence require a very large amount of training data.
Unrolled optimization algorithms have emerged as efficient solutions to combine the flexibility of deep learning methods with the mathematical principles underpinning the more traditional optimization methods.Algorithm unrolling can be seen as a paradigm to design deep neural architectures from optimization algorithms while incorporating useful priors into the models in a principled way.Several optimization algorithms have been unrolled in the literature, such as ISTA and the coordinate descent algorithm [19], the gradient descent [20], the proximal gradient [21], the ADMM [22], [23] and the half quadratic splitting (HQS) [24] algorithms.A learned network is used at each iteration of the optimization algorithm, as a regularizer (or as its gradient [20]), or as a proximal operator [23] which can be seen as a denoiser.While usual optimization methods iterate until convergence using an ending condition on the reconstruction error, unrolling an iterative algorithm considers a small number of iterations.This allows training a learned component end-toend within the optimization algorithm, in a way that takes into account the data term, i.e. , the degradation operator.By end-to-end learning we refer to the learning of the weights of the regularizers by back-propagation of the gradients from the last to the first iteration of the unrolled optimization algorithm.
In this paper, we present an unrolled optimization solution with a neural network-based prior for light field reconstruction from a set of 2D measurements.The proposed solution is based on unrolling the half-quadratic splitting (HQS) method, where the proximal operator used in the regularization step is defined by a neural network-based denoiser learned at each iteration of the unrolled optimization.While most compressed light field acquisition methods assume RGB measurements, our acquisition scheme does not make such assumption and relies instead on measurements obtained using a monochromatic sensor equipped with a color filter array.We show that our approach can be applied to any number of measurements and give experimental results for a number of measurements going from 1 to 3. Whereas the methods in [15], [16] are based on convolutional neural networks (CNN) that act as regression networks, we derive instead our architecture from the unrolling of the minimization of a regularized mean-squared problem.The proposed solution is actually based on an unrolled Half Quadratic Splitting (HQS) optimization method with a learned proximal operator, leading to a completely different architecture.We additionally show that, with our acquisition scheme, the extraction of information from the set of measurements is made possible in closed-form.This comes from the specific structure of our sensing matrix which is such that this matrix can be inverted in a very efficient manner using the Sherman-Morrison-Woodbury identity.The unrolled optimization method, together with the closed form introduced to solve the data term minimization, makes the architecture more mathematically grounded, compared with a learned CNN acting as a regression network between the input measurements and the reconstructed data.To the best of our knowledge, a closed-form minimization of the data term has not been proposed before for light field compressed acquisition, as other unrolled methods usually rely on an approximation instead (often a single gradient descent step).Note that, while unrolled optimization has been considered in [25], this was in a context of coded aperture acquisition of each colour component.Here, we consider an acquisition scheme with coded masks that is suitable for monochromatic sensors equipped with CFA.
We have assessed the proposed scheme, considering different acquisition scenarios, i.e. using a coded mask placed on the sensor in comparison with coded aperture designs.For the sake of comparison with the methods in [14], [15] and [16] which also consider masks placed on the sensor, we first consider a mask drawn from a uniform distribution, with a model assuming the presence of pinholes on the aperture plane.This corresponds to the case where the real continuous light field is first discretized and then modulated with the coded mask and the color filter array.We then derive a more realistic model without pinholes and show that this new model does not lead to any drop in quality.We also investigate the benefit of learning the distribution of a realization of the mask.
In summary, our contributions are as follows: • We present a novel architecture for compressed light field acquisition using a color-coded mask and a CFA, that can take different numbers of shots at its input.This architecture is based on an unrolled HQS optimization method with a learned proximal operator.• In the 1-shot case, we show that our architecture yields a 2.5 dB improvement over the deep learning based method in [16], which is already in average 1.97 dB better than the method in [15], using the same acquisition scheme.We also show that our approach yields a significant improvement in PSNR compared to traditional iterative methods using the same acquisition scheme.• We present a closed-form data term minimization solution which not only makes the architecture more mathematically grounded, but also consistently gives a PSNR gain of about 0.6 dB, regardless of the number of shots.• We derive a new realistic sensing model with coded masks placed on the sensor, which, unlike prior work, does not assume the presence of pinholes on the aperture plane to discretize the real continuous light field in the angular dimension before being modulated with the coded mask and the color filter array.

II. RELATED WORK
Many camera designs have been proposed for light field acquisition.The goal of this section is not to give a complete overview of the various designs, which can be found in [26], but rather to recall the designs related to the proposed approach, i.e. .based on coded masks.While programmable aperture approaches with non refractive masks placed at the aperture have been proposed in [27] to sequentially capture subsets of light rays [27], we focus here on solutions based on coded masks placed in front of the sensor.

A. COMPRESSIVE LIGHT FIELD ACQUISITION WITH CODED MASKS
The problem of light field reconstruction from the recorded set of measurements can be placed in a compressed sensing framework.Thanks to the use of a coded mask, the photosensor records a set of linear measurements from which a higher resolution light field can be reconstructed.This problem, being an ill-posed inverse problem, is solved using a least squares minimization with some regularization constraint based on hand-crafted signal priors.Marwah et al. [12] propose a camera architecture that records optically coded projections on a single image sensor using a monochrome mask, while Miandji et al. in [28] and [13] use respectively a random stationary or a moving color-coded mask to extract incoherent measurements.Nguyen et al. [14] introduce an Equivalent Multi-Mask Camera (EMMC) model which allows for a flexible configuration of a variety of sensing schemes.
Whereas the camera designs in [28], [13], [14] place the mask close to the sensor, the coded aperture cameras have the mask placed directly on the aperture plane.This is the approach followed in [29], [30], [25] and [31], where incoherent light field measurements are captured by using a randomly coded mask placed on the aperture plane.In our proposed design, the mask is placed close to the sensor, which actually allows the rays coming from different angles to be multiplexed in a way that is dependent on the spatial position of the incident pixel, thereby increasing the possibility for the various measurements to be mutually incoherent, which is known to be a crucial property of degradation matrices as explained by [32].

B. DEEP COMPRESSIVE LIGHT FIELD ACQUISITION
In the above cases, the light field is reconstructed using a compressive sensing framework and classical sparse recovery methods, assuming the data to be sparse in a domain defined by an overcomplete dictionary [12], [13].However, this problem can also be efficiently solved using deep learning techniques [15], [30], [33], [34].The authors in [15], [33], [34] assume a pre-defined mask pattern and propose convolutional neural network architectures to reconstruct the light field from the set of measurements.Inagaki et al. [30] formulate the coded aperture acquisition and light field reconstruction as an auto-encoder, whereas we consider instead an unrolled optimization technique, and optimize the mask pattern together with the parameters of the reconstruction algorithm in an end-to-end auto-encoder learning.A learned convolutional network architecture is used in [25] to compute the coded sub-aperture images, from which the light field is reconstructed using an iterative optimization approach with a deep spatio-angular regularization prior.

C. UNROLLED OPTIMIZATION METHODS
Recent methods have been introduced with the goal of combining the advantages of well understood iterative optimization techniques with those of learned regularizers allowing us to model more complex image priors.These regularizers can take the form of operators of projection on a learned image subspace or manifold [18], [35], of a denoiser [36], [35], [37], or of the proximal operator for a regularizer [38].
Unrolling a fixed number of iterations of optimization algorithms is an efficient way of coupling optimization and deep learning techniques.Whereas usual iterative methods iterate until convergence using an ending condition on the reconstruction error, unrolling an iterative algorithm considers a small number of iterations.This allows using a learned regularization step trained end-to-end within the optimization algorithm, hence in a way that takes into account the data term, i.e. , the degradation operator.This principle has been applied to several optimization algorithms in the literature, e.g. , to ISTA and the coordinate descent algorithm [19], the gradient descent [20], the proximal gradient [21], the ADMM [22], [23] and HQS [24] algorithms.The learned regularization network can take the form of the gradient of a regularizer [20], or of a proximal operator based on a denoiser [23].The number of iterations of unrolled optimization methods is typically quite small due to difficulties in training networks corresponding to a large number of iterations.The authors in [39] however propose a solution for training arbitrarily deep unrolled optimization networks based on deep equilibrium models [40].

III. MATHEMATICAL FORMULATION OF THE LIGHT FIELD ACQUISITION PROBLEM A. LIGHT FIELDS
A light field is in general defined as the radiance at a given point in space and time, along a given direction and for a given wavelength.As such, it may be described as a real-valued function over a 7D space L(x, y, z, θ, φ, λ, t), where (x, y, z) are the spatial coordinates, and (θ, φ) are the angular coordinates, λ is the wavelength and t is the time.In the context of static light field acquisition by a given camera, assuming free space around the camera, a light field is more adequately described by a well-known twoplane parameterization L(x, y, u, v, λ) where (x, y) are the spatial coordinates, i.e. the coordinates on the sensor plane, and (u, v) are the angular coordinates, i.e. the coordinates on the aperture plane.For the sake of notation simplicity, we write x = (x, y) and u = (u, v).We define X as the range of spatial coordinates x, U as the range of angular coordinates u and Λ as the range of the spectral coordinate λ.

B. MONOCHROME IMAGE FORMATION MODEL
We consider a general camera model in which a light field L is first linearly modulated by one or several optical devices, before being recorded by an ideal 2D photosensor.
In practice, the light field is usually first transformed by a convergent lens, but this process can be ignored, without loss of generality, by considering L to be the in-camera conjugate light field with respect to the lens.For the sake of simplicity we also ignore the vignetting effect in our model, since it can be included in L as well.Assuming X is partitioned into P sub-regions (X p ) 1≤p≤P , each corresponding to an actual pixel, the intensity recorded by the pth pixel is modeled as where ψ is a modulation field (or shield field [41]) conditioned by the optical components of the camera.The whole measurement consists in the combined intensities recorded by all pixels.We may further extend the model by allowing multiple-shot captures.In this case, we consider N modulation fields (ψ n ) 1≤n≤N , each corresponding to a shot.The number of measurements then becomes N × P , and the intensity recorded by the pth pixel at the nth shot is: The whole acquisition process can be formulated as a linear operator: Note that the domain of the operator, i.e. the space of light fields, is infinite-dimensional, whereas the co-domain -the space of measures -is finite.

C. LIGHT FIELD COMPRESSED ACQUISITION USING CODED MASKS AND COLOR FILTER ARRAYS
The goal of multi-shot light field acquisition is, given a number of shots N , to recover the light field L. Of course, since the physical description of light fields is infinite-dimensional, we first need to discretize the space of parameters, i.e.X, U and Λ in order to obtain a computable representation.In the remainder of the article, we use the same notations X, U and Λ for the discretized sets, and assume a uniformly spaced discrete parametrization for X and U , while Λ is reduced to the common {R, G, B} set.By a slight abuse of notation, we will also use X and U to denote the size of those sets, i.e. the number of discrete spatial and angular coordinates respectively.The discretized version of Equation 2 is then: and the domain of the acquisition operator becomes finite-dimensional.Furthermore, we assume that the pixels (X p ) 1≤p≤P are squares and correspond to the discretized version of X.That is, we assume that |X| = P , and that the spatial span of each pixel corresponds to a single discretized element of x: X p = {x p } for each p, where (x p ) 1≤p≤P is an enumeration of X.Note that one could choose a finer discretization for X, in which case the problem would become a joint compressed sensing and super-resolution task.We can thus drop the now redundant notation p and use x instead; Equation 4 becomes: The goal of light field compressed multi-shot acquisition is, given a number of shots N U × Λ, to recover the whole discretized light field L from the set of measurements.The theory of compressed sensing tells us that the ability to reconstruct the whole signal from a small number of measurements greatly depends on the properties of the sensing operator.As a general principle, the sensing matrix should be as random as possible, so that the measurements be as mutually incoherent as possible [32].In our case, the acquisition operator is entirely determined by the modulation fields ψ n .
Our acquisition scheme consists of a color-coded mask (CCM) placed at a small distance in front of the photosensor plane, and a color filter array (CFA) placed directly on the photosensor.These two elements act as linear filters on the input light field before its projection on the photosensor plane.Figure 1 represents the optical device used for acquisition, with the color-coded mask and the CFA.The CCM is a random mask whereas the CFA is usually a periodic array, traditionally a 2x2 periodic Bayer pattern [42].In our acquisition framework, multi-shot acquisition can be made possible by allowing the CCM to move on its plane and be able to be translated by a small amount.Assuming the CCM is characterized by m : Ξ × Λ → R, where Ξ is the CCM's plane, the corresponding modulation field is derived from elementary geometric considerations: where γ = d(Ξ,X) d(U ,X) , d(., .)denoting distance between planes.Equation 6shows that the sub-aperture slices of the modulation (i.e. the restriction of the modulation to a fixed u) are all translated versions of one another.
Whereas we could use fixed values for the pixels of the color-coded mask, for the sake of comparison, we instead follow the approach of [15] and [16], in which a different mask is generated each time, by independently drawing the transmittance value t ∈ R Λ of each of its pixels from a distribution D. Besides, to be able to compare with [15] and [16], we depart from physical realizability in our experiments (unless stated otherwise) by directly drawing a value t ∼ D independently for each element of the modulation field (i.e. for each spatial-angular coordinate (x, u)), instead of computing ψ using Equation ( 6).However, in order to simulate in a more accurate manner what a real-world set-up would do, we also assess the acquisition scheme using, at test time, a modulation field given by Equation ( 6) corresponding to a coded mask for which the transmittance of the pixels are drawn from D. We will see in the experimental section (Table 3) that the reconstruction does not suffer from any significant drop in quality when using a modulation field derived from such a physically realizable mask.
Furthermore, it is worth noting that Equation (5) corresponds to an acquisition device that includes pinholes placed on the aperture plane, corresponding to the fact that the real continuous light field is first discretized, and then modulated by way of coded mask and color filter array.Table 3 shows experimental results obtained when simulating the acquisition without pinholes.In that case, a linear interpolation of the available discrete input views is performed to approximate the missing views.The analytical expression of the modulation field given a mask m corresponding to this model is further detailed in Section III-D below.

D. REALISTIC ACQUISITION MODEL WITHOUT PINHOLES
In this section, we derive a more realistic modulation field ψ from the description of a mask m, assuming a continuous light field that we linearly approximate using the available views.Let us first consider the case of a monochromatic light field for which both the spatial domain and the angular domain are one-dimensional.We assume that the angular domain is discretized into regularly spaced points, so that the angular coordinates where data is available are u j = (j − j 0 )∆u for each j ∈ [0, . . ., N u − 1], with j 0 such that those coordinates are centered around zero and ∆u > 0 corresponding to the angular resolution.Similarly, the spa-tial domain is discretized into points located at coordinates x i = (i − i 0 )∆x where i 0 is such that the points are centered around zero and ∆x > 0 is the dimension of a sensor pixel.Let L(x, u) denote the continuous light field.We only have access to the discrete data L i,j := L(x i , u j ).Let m be composed of N m regularly spaced pixels, such that where m k is the transmittance value on mask pixel M k := [ξ k , ξ k+1 ] with ξ k = (k − k 0 )∆ξ and ∆ξ > 0 denoting the dimension of a mask pixel, and 1 M denoting the indicator function of a any subset M ⊂ Ξ.
Since the light fields in our dataset are densely sampled, we assume that L varies smoothly along the coordinates (x, u).Based on this assumption, we approximate the continuous light field at coordinates that are not on the grid by linear interpolation in the angular domain and nearest neighbour interpolation in the spatial domain.That is, L(., .)L(., .)where: whenever ] is then given by: where ξ(x, u) = (1 − γ)x + γu.Partitioning the angular domain into pixels U j = [u j , u j+1 ], this equation is rewritten: Injecting Equations ( 7) and ( 8) into (10) and rewriting the resulting equation yields where and Further rewriting gives us: where β i,j,k = α i,j,k + α r i,j−1,k , corresponding to a light throughput.Note that the values α i,j,k can be rewritten as where denotes the pre-image of M k by ξ), a convex polygon in the spatio-angular domain corresponding to the set of rays intersecting simultaneously the spatial pixel X i , the mask pixel M k and the angular pixel U j .Applying the same rewriting to α r , one can analytically compute the values for β as a sum of integrals of affine functions over convex polygons.We now consider light fields that are non-monochromatic and two dimensional in each of the spatial and angular domains and assume that the real continuous light field can be approximated by a bilinear interpolation of the available views.We also resume our notations x, u, ξ to denote the discrete coordinates corresponding to i, j and k respectively.Equation ( 14) can be readily extended to obtain the intensity at position x on the sensor for the wavelength λ, before the different wavelengths are modulated by the CFA and integrated over by the sensor, as: From Equation ( 16) it is clear that we can define: as the modulation field corresponding to our color-coded mask in this discretized view-interpolated framework.Equation ( 17) is conveniently rewritten as a matrix product: where B is a R (X•U )×Ξ matrix and M is a R Ξ×Λ matrix.Since the mask, the sensor and the aperture are all discretized along a separable grid of pixels, i.e. all three are the Cartesian product of their respective single-dimensional counterpart with itself, the 2D light throughput β is simply computed from its one-dimensional version by: where ξ = (ξ 1 , ξ 2 ).Note that β(x, u, ξ) equals zero for most combinations of x, u and ξ, and consequently the corresponding matrix B is sparse.The modulation field ψ is therefore efficiently computed as a sparse-dense matrix product.Experimental results using this framework are provided in section V-C.Please note that Marwah et al [12] successfully built a prototype using a monochromatic coded mask.Manufacturing a color-coded mask may be more challenging than a monochromatic one, nevertheless feasible thanks to recent advances in photolithograpy [43], [44].In addition, in practice, the modulation ψ cannot be known exactly by analytic means due to imperfections in the manufacturing process.However this problem could be tackled by using a calibration protocol akin to the ones used by [45] and [12].
Representation of the unrolled half-quadratic splitting algorithm as a two-layer block.The first layer corresponds to a (non-trainable) data-term minimization step: it solves a quadratic problem and takes as input an intermediate reconstruction V k , as well as the measures I, the degradation matrix Ψ and the weight µ k .The second layer (trainable) projects an intermediate reconstruction onto a sub-manifold of more likely -or natural -light fields and is parameterized by θ k .The block consisting of these two steps is iterated K times.

IV. ACQUISITION/RECONSTRUCTION ARCHITECTURE
Let Ψ be the sensing matrix corresponding to the acquisition operator defined in the previous section.We have Assuming L is the input discretized light field reshaped as a XU Λ-dimensional vector, the available information is I = C(ΨL) ∈ R XN , where C is a pixel-wise stochastic corruption process accounting for imperfections in the sensor.A detailed model of C is given in section VI-A.As a simplification, we can assume C to be the addition of centered Gaussian noise.In that case, we have: where n is a XN -dimensional centered Gaussian random variable with variance σ 2 .

A. OPTIMIZATION UNROLLING
The problem of reconstructing the light field L from the measurements I may be formulated in a Bayesian framework, by considering the reconstruction task as a maximum a posteriori problem.In its equivalent negative logarithmic formulation, the problem can be expressed as finding L * such that: Since the sensing noise is assumed to be centered Gaussian noise, we have: The reconstruction problem can then be formulated as a regularized least squares problem: where J = −σ 2 log P.
A popular way to solve this problem is to use the ADMM optimization algorithm [46].A simpler but effective alternative, especially in the context of algorithm unrolling, is to use the half-quadratic splitting (HQS) method [25], [47], [48].This approach is very similar to ADMM.Just like ADMM, HQS aims at solving unconstrained problems of the form: The problem is first reformulated as a constrained problem, introducing another variable w : The values of z and w are then iteratively and alternately optimized in an unconstrained framework by relaxing the equality constraint into a half-quadratic penalty term: where the (µ k ) k≥0 weight the penalty terms.Casting these equations into our regularized least squares problem, and introducing the variables V k we obtain: The update rule for L clearly corresponds to the resolution of a quadratic problem, whereas the update rule for V can be interpreted as applying a proximal operator Prox.We further rewrite in a more synthetic form: Now, while it is clear that the update rule for L can be easily applied, at least theoretically, given I, Ψ and V , the proximal operator actually depends on the underlying probability distribution P on light fields.Traditional methods have been somewhat successful at recovering signals from compressed measurements using hand-crafted choices of P, yet recent achievements in low-level image processing suggest that it is immensely beneficial to learn these priors from data.Algorithm unrolling in deep learning consists in using Equation 28to design a deep learning architecture.Specifically, one usually fixes a number of iterations K and then, instead of considering the Prox k as fixed functions, one replaces them by K trainable parameterized functions Prox(_, θ k ), typically neural networks.The general architecture for the unrolled half-quadratic splitting algorithm is depicted in Figure 2.
To our knowledge, the minimization of the quadratic terms is not solved in an exact manner in methods of the literature using unrolled HQS.Instead, some authors, for instance [25], replace the exact resolution by a single step of gradient descent on the quadratic function.The authors in [47] argue that the problem cannot be resolved in closed form, due to the complexity and size of the sensing matrix Ψ.While this may be true for some inverse problems, we show that in our compressed sensing framework the quadratic problem can actually be solved in closed form efficiently.

B. CLOSED-FORM SOLUTION OF THE DATA-FIDELITY TERM MINIMIZATION
A single step gradient descent is often used in the literature, as in [25], to minimize the data term of unrolled optimization methods, since the inversion of the corresponding degradation matrices is in general computationally untractable.However, this only gives an approximation of the optimal solution to the minimization of the data term.In our proposed scheme, we show that the structure of our sensing matrix is such that this matrix can be inverted in a very efficient manner using the Sherman-Morrison-Woodbury identity.
Taking the gradient of the quadratic form in the L-update of Equation 27 (with respect to L), we get: The single-step gradient descent update rule for L would be: where δ k > 0 is the gradient descent rate.The closed-form solution of the equation ∇ L Q = 0 is given by: Therefore we see that, a priori, in order to compute the closed-form solution L * , one has to invert the XU Λ×XU Λ regularized covariance matrix Ψ Ψ + µ k I.However, recall from section III-C that our sensing matrix Ψ is defined by Equation 5.This equation shows that the multiplexing of information acquired by the sensor is actually performed "spatial pixel-wise".This means that our compressed acquisition sensing matrix is actually block-diagonal.We can write: . . .
where each Ψ x is a N × U Λ matrix characterizing the multiplexing happening on pixel x.Thus the inversion of the matrix can be performed efficiently block-wise -i.e.pixelwise -by computing, possibly in parallel, the matrices . As a notational aside, we drop the subscript x in the remainder, and regard the block-diagonal matrix Ψ as a X ×N ×U Λ tensor; all operations on matrices are regarded as being performed pixel-wise.
Nonetheless, these matrices are still of size U Λ, so a direct computation, though tractable, remains somewhat costly (as U Λ is usually in the order of 10 2 ).This cost can be further alleviated by using the Sherman-Morrison-Woodbury identity [49]: where G is the Gram matrix ΨΨ .Note how the matrices to invert and the product on the right-hand side is efficiently computed by matrix-vector multiplications, once the G + µ k I have been inverted.All these multiplications can be carried out pixel-wise in parallel efficiently.Figure 3 gives a network representation of the L-update step.It is worth noticing that this architecture makes it straightforward and inexpensive to learn not only the parameter µ k , but also the sensing matrix Ψ.This point is further studied in section VI-B.

C. LEARNED PROXIMAL OPERATOR
While the structure of the function corresponding to the proximal operator is the same for each unrolled iteration, the weights between them are not shared.This is customary practice and allows for a greater expressive power of the overall architecture.We use a residual structure for our proximal operator, with a single skip-connection (see [50]), i.e. we have where Res is a function chosen to be a simple 2-dimensional convolutional neural network (CNN) with ELU non-linear activation [51].The input light field L of shape (X, U , Λ) is first reshaped into a (X, U Λ) = (X, Y, U Λ) tensor, which is processed as a 2D image with U Λ channels by the CNN.The output of the CNN is then reshaped back to its original shape.Each convolution except the last one in the residual branch consists in a 3×3 kernel with 128 filters.The last convolution uses a 3×3 kernel with U Λ filters to match the dimensions of the light field.The last convolution block does not have any subsequent non-linear activation, in order to maintain a full range of values for the residual.In our experiments we use a stack of 4 convolutions.The architecture of the proximal network is depicted in Figure 4.

D. OVERALL ARCHITECTURE
The overall reconstruction architecture is composed of the elements described in IV-A, IV-B and IV-C.The initial reconstruction V 0 is set to 0. We also add a final clipping layer, so that the output of the reconstruction architecture is L := max(0, min(1, V K )), to ensure that the intensities of the reconstructed light field lie between 0 and 1.In our experiments we use a number of unrolled iterations K = 12.For each iteration, the trainable proximal operator has 4.7 • 10 5 parameters, in addition to the learned µ k parameter.Because the weights are not shared between the different layers, the reconstruction network has a total of 5.6 • 10 6 parameters.
In addition to the reconstruction system which is purely computational, some physical parameters, namely the pattern of the CFA and the CCM, are learnable in our framework, since it is possible to physically produce an optical device with an arbitrary CFA and CCM.While [16] reported the benefits of learning the CFA and the distribution of the CCM, we found no improvement over using a fixed CFA and fixed CCM distribution in the noiseless case, with the proposed architecture.We instead used randomly generated CCMs in which the transmittance values of the pixels are drawn from a distribution D = U(0, 1), and a fixed 2×2 Bayer CFA (greenblue-red-green).The end-to-end acquisition-reconstruction process is summarized in Figure 5.We hypothesise that the inductive bias towards inverse problem solving in the architecture, enforced by the closed-form data-term minimization layer, makes the network less sensitive to the specifics of the degradation matrix, at least in the absence of noise.Nonetheless, learning the CFA and distribution of the CCM proved to be greatly beneficial in the noisy case, as detailed in section VI-C.

V. EXPERIMENTS A. TRAINING DETAILS
We defined and trained our model using TensorFlow 2 1 .We used the Stanford Lytro light field archive [52] as our training dataset, and used the Linköping University Lytro dataset2 as a validation set (14 light fields).We performed our experiments using light fields with an angular resolution of 5 × 5 views.For practical purpose, and since the overall architecture is convolutional, we can train the model using light field patches of a smaller resolution, and use full-size light fields at test time.We chose a patch resolution of 64 × 64 pixels; patches are extracted from the full-size light fields using a stride of 32 pixels.No data augmentation is applied.We used the 1 norm for our regression as usual for this kind of task.The whole network was trained using the ADAM [53] optimizer, using hyper-parameters (β 1 , β 2 , ) := (0.9, 0.98, 10 −5 ) and gradient clipping to maintain the values within the range [−1, 1].We empirically found that these values prevented explosion of the loss that could happen otherwise during training, especially as the number of unrolled iterations in the model increases.The learning rate was set to 10 −3 for the first 5 • 10 5 steps, and then decreased to 10 −4 until the end of training.Each model was trained for 8•10 5 steps using a batch size of 16.This corresponds to 180 epochs, each epoch containing about 7 • 10 4 sample patches extracted from 476 light fields.For stability purposes, we additionally clip the values of the µ k s at each optimization step so that they remain greater than 10 −2 .

B. RESULTS
We first trained and evaluated our architecture with a number of shots N ∈ {1, 2, 3} in a noiseless framework (that is, where the corruption operation C is the identity).In this first experiment, we use a modulation field in which the transmittance of each of its elements is independently drawn according to a uniform distribution.Each model for a given number of shots was trained separately, and all follow the same training schedule as detailed in section V-A.
Table 1 provides a quantitative comparison of different models, in terms of peak signal-to-noise ratio (PSNR), for a set of light fields used in all the methods referenced in the table.In the sequel, we refer this set of light fields as the "base test set".Table 1 also compares the proposed algorithm with the closed-form solution of the data term minimization against the single-step unrolled HQS architecture, which uses the same architecture for the learned proximal operator, but performs a single gradient descent step of data-term minimization as prescribed by Equation 30 instead of solving it in closed form.These models have an additional trainable parameter δ k for each layer, which we all initialize to 0.05, a hyper-parameter that was tuned to perform reasonably well for all three numbers of shots.We see that using the closed-form solution improves the reconstruction quality by about 0.6 dB, regardless of the number of shots used, and systematically outperform the single-step approach except in one case (see Table 2).We also include in our comparison models from [15], [16] and [14], for which the acquisition scheme is identical to ours.
We additionally compare with [25], which uses a different compressed acquisition scheme based on coded apertures.Since their method reconstructs a light field from a set of RGB measurements (instead of monochrome ones in our method), for a fair comparison we compare their method using a single RGB measurement to our method using 3 monochrome measurements.Note that we used the multiplechannel version of [25] where all three color channels are reconstructed jointly from the RGB measurements, thus allowing the exploitation of cross-channel correlations.The model was retrained using the code provided by the authors.A more extended comparative analysis with the model of [25] is given in the next sections which studies different aspects of our camera design.Tables 3 gives a comparison of the various methods both in terms of average PSNR and average EPI-SSIM (an angular consistency metric that averages the SSIM of all epipolar images) for several datasets.
Figures 6 and 7 visually compare the reconstruction of the central view with different methods, and also show the reconstruction error.Figure 8 shows epipolar plane images (EPI) of the reconstructed views.We can see that our method reconstructs the parallax correctly, an essential aspect of light fields.More material assessing the visual quality of the different methods can be found at clim.inria.fr/research/DeepUnrolling-CSLF.

C. COMPARISON OF DIFFERENT ACQUISITION SCHEMES
We compare our main acquisition scheme that uses a modulation field in which each element is independently drawn from a uniform distribution (thus not physically corresponding to a real coded mask) to a scheme that uses a modulation field corresponding to a real coded mask in the presence of pinholes on the aperture, as given by Equation (6).For this comparison, we did not retrain the whole model,   1: PSNR comparison of the different methods (in dB), with a modulation field in which the transmittance of each element is independently drawn according to a uniform distribution.CF-HQS stands for our "closed-form HQS" architecture in which the data-term minimization is performed using Equation (34), SS-HQS stands for the "single-step HQS" architecture which uses Equation (30) to perform the data-term minimization step.[14] is an iterative dictionary-based approach, while [15], [16] and [25] are based on deep neural networks.Note that [15] and [16] work only with a single-shot acquisition, while on the contrary [25] uses RGB measurements, thus being comparable only to our monochromatic 3-shot acquisition scheme.
but merely changed the structure of the modulation field at test time.The modulation field for each sub-aperture view was obtained by translating the 2D mask by 8 pixels per view, which corresponds to a parameter γ 0.1 if we make the reasonable assumptions that the sensor is about 400 pixels wide and that the 5 × 5 pinholes span an aperture which has approximately the same size as the sensor.Table 3 shows no perceptible drop in performance when using such physically accurate modulation field instead of the one used for training.We hypothesize that this is because the modulation fields used during training are general enough to allow the network to perform well in the specific case in which the modulation field derives from a coded mask.In addition, even though the sub-aperture views of the CCM are translated versions of each other, the norm of the translation is in practice larger than the receptive field of the reconstruction algorithm which may have the effect that the sub-aperture views of the CCM locally look like they have been drawn independently from one another.
In addition to these pinholes-based schemes, we give experimental results when using the scheme without pinholes with bilinear interpolation of the views described in section III-D.For this scheme, we provide the results for one experiment using a new mask randomly generating by drawing the transmittance value of each pixel of the mask from U(0, 1).We performed our experiments with a relative distance parameter γ = 0.1, so as to compare with the pinhole-based model.We show in Table 3 that this model performs well, even outperforms the pinhole-based model on half of the test datasets.
Finally, we compare our coded mask acquisition scheme to a monochromatic coded aperture scheme, which is the one     For a better visual readability, we amplified the error by a factor 10. Lighter means lower error, darker means higher error.
used in [25], [31] and [30].In these designs, the mask is placed at the aperture plane, whereas in our design it is placed close to the sensor.Furthermore, note that all the experiments presented in Table 3 use a Bayer color filter array in addition to the mask, thereby recording only monochromatic measurements but allowing multiplexing in the spectral domain, while [25] and [30] use true RGB sensors, thereby having three times more channels for each shot.
Following the scheme used in our other experiments, each pixel of the coded aperture is randomly drawn from a uniform distribution for each light field sample.Table 3 shows that using a monochromatic coded aperture yields worse results comparing to using a color-coded mask.We hypothesise that its placing close to the sensor actually allows the rays coming from different angles to be multiplexed in a way that is dependent on the spatial position of the incident pixel, thereby increasing the randomness, hence the possibility for the various measurements to be mutually incoherent, which is known to be a crucial property of degradation matrices as explained by [32].
It is also worth noting that in all of our experiments, we do not learn a fixed degradation operator, but instead sample a new degradation operator each time a light field sample is processed.As a consequence, for a given acquisition scheme, our network is not tied to a particular realisation of the degradation operator, which makes our approach more robust to the specifics of the degradation operator (for the purpose of physically realizing our acquisition device, it suffices to fix the degradation to one particular sample from the distribution).This sets us apart from the framework of [25], [31] and [30].By learning the degradation operator (instead of sampling a new one each time a light field is processed), [25] and [31] were able to relax the constraint that the weights of the degradation operator should be shared between the various layers, so as to increase the expressive power of their network, thus learning different virtual degradation matrices at each step of the unrolled algorithm.Note that in the case of [25], [31] and [30], the degradation is obtained by angular    multiplexing of the light field and is thus spatially invariant, allowing the authors of [25] and [31] to implement their relaxed degradation operators at each unrolled iteration as angular convolutions and deconvolutions.In contrast, the degradation operator resulting from a coded mask acquisition scheme is not invariant to translation in the spatial domain and therefore cannot be implemented using angular convolution and deconvolution layers.Besides, the stochasticity of the degradation prevents us from directly learning additional weights corresponding to the relaxed degradation operators.However, relaxing the sharing condition comes at the cost of interpretability.Indeed, this departs from the original unrolled optimization algorithm and prevents us from considering the output of these layers as a projection on the space of measurement-consistent signals.

D. IMPACT OF THE STRUCTURE OF THE REGULARIZER
Table 2 gives averaged PSNR and EPI-SSIM results on several datasets obtained with a regularizer using 2D convolutions in comparison with an architecture using a regularizer consisting of a stack of spatio-angular separable 4D convolutions (as used in [25] and [31]), and also compares with [25] and [31].Using 4D convolution-based regularizers instead of the 2D ones generally yield slightly better results, however at the cost of a greatly increased computation time.More precisely, 4D separable convolutions have sensibly fewer parameters than traditional 2D convolutions performing operations on flattened light fields.The time needed (during training) to process one sample is one order of magnitude longer than that of 2D convolutions, but the learning converges faster, which leaves the overall time required for training approximately unchanged.However, the extra computational time for each sample remains an issue at test time: when tested on an Nvidia Quadro RTX 8000, the reconstruction of a single light field takes about 0.3 seconds with the 2D convolutional regularizer, whereas it takes approximately 4 seconds with the 4D separable approach.

E. IMPACT OF THE NUMBER OF UNROLLED ITERATIONS
We studied the impact of increasing the number of unrolled iterations K.While increasing the number of iterations could in theory improve the final reconstruction quality, we observe a quick saturation.Figure 9 shows the effect of the number of unrolled iterations on the quality of reconstruction.Increasing K from 8 to 12 improves the PSNR by 0.78 dB, while further increasing K from 12 to 16 yields an average improvement of only 0.02 dB.

VI. EXPERIMENTS IN A NOISY SETTING A. CORRUPTION MODEL
While the results presented in V-B were all conducted in a noiseless framework, we additionally evaluated the robustness of our approach in a noisy framework.To that end, we applied a corruption process to the monochromatic images Averaged PSNR and EPI-SSIM measures obtained with different schemes: our unrolled HQS method using single step gradient descent (SS-HQS) or the closed-form expression (CF-HQS), with a regularizer using 2D or 4D convolutions.We compare the results with 1 and 3 shots against the methods in [25] and [31].Note that 1 shot with the acquisition in RGB (as in [25] and [31]) is equivalent to 3 shots with our acquisition scheme, in terms of number of input measurements.The base test set corresponds to the light fields used in Table 1.captured by the photosensor.Following [14] and [16], we include in our noise model three sources of corruption: shot noise, caused by the quantized nature of the light reaching the sensor; readout noise, which originates from the spontaneous emission of electrons by the photoelectric device and quantization noise arising during the analog-digital conversion of electrons into digital units.The corrupted pixel-wise intensity recorded on the photo-sensor is given by:

Acquisition matrix
where int(.)denotes rounding, clip [0,c] (.) = max(min(.,c), 0), B = 2 b − 1, with b the number of bits used to code the digital-converted measure, c is the full-well capacity of the pixel (in number of electrons), g is an intensity gain factor proportional to the ISO gain of the pixel and the exposure time; p(α) is a random variable following a Poisson distribution P(α) and n rd ∼ N (0, σ 2 rd ).However, the discrete nature of some of the sources of randomness makes it difficult to integrate them into a differentiable gradient-based learning approach.For this reason, we approximate the corruption VOLUME 4, 2016 process using continuous distributions.We thus substitute the Poisson distribution P(α) modeling shot noise with a Gaussian distribution N (α, α), and the rounding int(.) with an additive uniform noise u ∼ U(− 1 2 , 1 2 ).Since the shot and readout noises are independent from each other, their sum follows the Gaussian distribution N (gcI, gcI + σ 2 rd ).Our actual model for the corrupted measurements therefore becomes: where n follows the standard normal distribution N (0, 1).

B. LEARNING THE CCM AND CFA
While using a Bayer CFA and uniform CCM may be an efficient way to multiplex spectral information in a noiseless framework, their low transmittance ( 1 3 and 1 2 respectively, yielding an even lower transmittance of 1  6 for the global device) make them ill-suited in a noisy framework, as the signal-to-noise ratio is greatly damaged.This makes it desirable to actually learn the CFA and CCM, along with the weights of the reconstruction network.For the CFA we simply learn a periodic 4 × 4 pattern, which yields 48 additional parameters to the overall model.These weights are initialized following a uniform distribution between 0 and 1.Following the considerations in section III-C and the approach of [16], we learn a distribution of the CCM by learning a distribution D of the transmittance on the pixels of the CCM.This 3-dimensional distribution is learned using a simple multi-layer perceptron with output dimension 3 that we feed with random noise.More precisely, we sample a XU × d-dimensional standard Gaussian random variable s, and set ψ(x, u, .):= MLP Θ (s) -i.e.D is the direct image of N (0, I) under MLP Θ .The parameters Θ of the MLP are then learned using standard back-propagation.In our experiments, we use a stack of three dense layers (i.e.affine transformations) with 32 hidden layers, interleaved with ReLU non-linearities, and with a final logistic non-linearity to ensure that the output transmittance lie in the range (0, 1).This yields 1443 additional parameters, a negligible amount compared to the number of parameters of the whole network.
In addition, we use an entropy-based regularization scheme on the pixels transmittance values distribution, as it was shown in [16] to be an effective way to prevent the learned distribution from falling into a poor local minimum.
Please note also that, even though it might be difficult to create a pixel on the mask with any arbitrary transmittance, the constraints imposed on the pixel's distribution D can be addressed using the framework developed in [57] in which the authors expose a method to incorporate physical realizability constraints at training time into additional regularizers.

C. RESULTS IN THE NOISY FRAMEWORK
We set the characteristics of our noisy photosensor to (b, c, σ rd ) := (14, 2 • 10 4 , 40) which are typical values for   4 gives a quantitative comparison of the performance in the noisy case depending on the level of noise and the number of shots used for reconstruction.Table 4 also includes results in the single-shot case, using the same architecture, but without learning the CFA and CCM.It is worth noticing that learning the CFA and CCM along with the weights of the reconstruction network yields significant improvement, especially when the noise level is high.Table 5 shows examples of learned CCM and CFA for different levels of noise and different numbers of shots.We can see that the transmittance of the pixels of both the CFA and CCM tends to increase as the gain factor g decreases.Figure 10 shows the reconstruction error on the central view for the different levels of noise.

Noise
Low-level High-level

VII. LIMITATIONS
Whereas our approach performs well when the amount of noise is moderate, a significant reduction of the signal-tonoise ratio can greatly degrade the quality of the reconstruction.This makes our setup less efficient when the amount of light is very low.In addition, due to the convolutional nature of the network, light fields with large disparities (i.e.exceeding the size of the receptive field of the CNN) are usually not well recovered.

VIII. CONCLUSION
In this paper, we have presented a new deep architecture, based on unrolled optimization with learned priors, for the reconstruction of compressively acquired light fields via color-coded masks in the presence of color filter arrays.This architecture leverages the power of the algorithm unrolling paradigm, and works with an arbitrary number of shots.
We have shown that this method improves the state-of-theart for this acquisition framework by several dBs, comparing favorably to both traditional and deep methods [14]- [16].We have presented an efficient closed-form data-term minimization layer that is shown to substantially improve the reconstruction quality, while allowing the joint learning of the coded mask and color filter array, along with the weights of the network.In addition, we have presented a new realistic acquisition model together with a method to compute its modulation field.Finally, we have shown that our approach was robust to realistic levels of noise, an important consideration in regard to practical applications.

FIGURE 1 :
FIGURE 1: Model of compressed light field acquisition.A light beam, parameterized by (x, u), is first filtered by the color-coded mask at coordinates ξ on the CCM plane.It is subsequently filtered by the color filter array on the sensor plane X.

FIGURE 4 :
FIGURE 4: Convolutional network corresponding to one learned proximal operator.The residual branch consists in 4 stacked convolutions and element-wise non-linear activation functions.Each convolution in the residual branch has 128 filters, except the last one which has U Λ to recover the full dimension of the signal.

FIGURE 5 :
FIGURE 5: End-to-end architecture.The left-hand part models the physical operations, comprising the modulation by CCM and CFA and photo-sensing (multiplication by the sensing matrix) and the stochastic corruption process C -this part is only present in the noisy framework.The right-hand part represents the deep unrolled algorithm: the reconstruction is first initialized with value 0, and then the following are applied: data-term minimization step, or data-fidelity layer (DF) -whose architecture is detailed in Figure3, intertwined with the application of proximal operators (Prox) -whose architecture is detailed in Figure4.A final clipping operation is applied to ensure that the values of the reconstructed intensities lie in a valid range.

FIGURE 6 :
FIGURE 6: Reconstruction error for the central view of the light field Rock, with a single-shot acquisition.For a better visual readability, we amplified the error by a factor 10. Lighter means lower error, darker means higher error.Figure (a) is cropped in the bottom-right corner (reconstruction provided by the authors).

FIGURE 7 :
FIGURE 7: Reconstruction and reconstruction error for the central view of the light field Rock, with a 3-shot acquisition.For a better visual readability, we amplified the error by a factor 10. Lighter means lower error, darker means higher error.

FIGURE 8 :
FIGURE 8: Reconstructed EPI (right-hand side) and reconstruction error of EPI (left-hand side) of the light field Rock.Figures (a) to (c) are comparable as they correspond to single-shot acquisition methods, while figures (d) to (f) correspond to multi-shot acquisition schemes.For a better visual readability, we amplified the error by a factor 10. Lighter means lower error, darker means higher error.
FIGURE 8: Reconstructed EPI (right-hand side) and reconstruction error of EPI (left-hand side) of the light field Rock.Figures (a) to (c) are comparable as they correspond to single-shot acquisition methods, while figures (d) to (f) correspond to multi-shot acquisition schemes.For a better visual readability, we amplified the error by a factor 10. Lighter means lower error, darker means higher error.

FIGURE 9 :
FIGURE 9: Impact of the number of unrolled iterations on the reconstruction quality.Average reconstruction quality in PSNR (dB) on the test set, for a number of shots N = 1.
(a) Noiseless (b) Low-level noise (c) High-level noise (d) Ground truth

FIGURE 10 :
FIGURE 10: Reconstruction error of the central view of the light field Seahorse from the Kalantari test dataset [58], for different levels of noise.The number of shots is N = 3.
of the data-projection layer by efficient closed-form resolution of the quadratic problem.Multiplication nodes represent either matrixmatrix multiplication or matrix-vector multiplications.Thick lines indicate matrices, normal lines indicate vectors and dotted lines indicate scalars.Note that µ k is applied elementwise in the division node, but only on the diagonal in the matrix-scalar addition node.The nodes neg, inv and indicate multiplication by −1, matrix inversion and matrix transposition respectively.

TABLE 3 :
(18)aged PSNR and EPI-SSIM measures obtained with different acquisition models, with one shot, and different datasets.CCM-independent is our base model, for which the modulation field is generated by randomly drawing the transmittance value of each of its elements from a uniform distribution.CCM-pinhole corresponds to the case where the modulation field is related to a uniformly generated coded mask by Equation(6).CCM-views-interpolation corresponds to the case where the modulation field is related to a uniformly generated coded mask by Equation(18).Finally, MCA stands for monochromatic coded aperture.

TABLE 5 :
Learned color filter arrays and color-coded masks.'Low-level' corresponds to g = 1.0 while 'Highlevel' corresponds to g = 0.33.The CCM displayed are random samples.