A Model-Driven Deep Dehazing Approach by Learning Deep Priors

Photos taken in hazy weather are usually covered with white masks and lose important details. Haze removal is a fundamental task and a prerequisite to many other vision tasks. Single image dehazing is an ill-posed inverse problem that has attracted much attention in recent years. Generally, current single dehazing methods can be categorized into the traditional prior-based methods and the data-driven deep learning methods that respectively investigate haze-related image priors and deep architectures. In this paper, we propose a novel model-driven deep learning approach that combines the advantages of both kinds of methods. First, we build an energy model for single image dehazing with physical constraints in both color image space and haze-related feature space (implemented as dark channel space in this work), regularized by haze-related image priors. Then, we design an iterative optimization algorithm for solving the proposed dehazing energy model based on the half-quadratic splitting algorithm, and the priors are transformed to their corresponding proximal operators. Finally, inspired by the optimization algorithm, we design a deep dehazing neural network, dubbed as proximal dehaze-net, by learning the proximal operators for haze-related image priors using CNNs. Our network incorporates physical model constraints of hazes and haze-related prior learning into a novel deep architecture. Extensive experiments show that our method achieves promising performance for single image dehazing.


I. INTRODUCTION
Haze is an atmospheric phenomenon where dust, smoke, or dry particles obscure the clarity of a scene. Hazes usually degrade the quality of photos by reducing the total contrast in color and occluding objects in photos. It is necessary to reduce the haze effect on these photos to make them visually pleasing and appealing. Moreover, many vision tasks in practical life depend on clean images, such as face detection and automatic license plate recognition from monitoring images, scene analysis from satellite images, etc. However, real captured images even on sunny days often suffer from low visual quality with haze effect. It is therefore essential for a vision system to firstly remove hazes from the captured images and then conduct detection or recognition.
In hazy images, only a portion of the reflected light reaches the observer because of the absorption in the atmosphere.
The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang . Based on this observation, the captured image I of a hazy scene can be modeled as a linear combination of the direct attenuation and the airlight [1]- [4]: where x is the spatial coordinate, I is the image degraded by hazes, J is the scene radiance or haze-free image, A is the global atmospheric light, and T (x) = exp(−ηd(x)) is the media transmission along the cone of vision which depends on scattering coefficient η and scene depth d (x). Depending on the number of given hazy images, the image dehazing task can be divided into single image dehazing and multi-image dehazing. In this paper, we focus on single image dehazing that requires to recover the unknown haze-free image J , atmospheric light A, and transmission T from a single input hazy image I . Single image dehazing is more challenging compared with multi-image dehazing, and it is essential to investigate effective haze-related priors to regularize this inverse problem. Previous works can be roughly categorized into the traditional methods that investigate various haze-related priors and the learning-based methods that build learning systems for dehazing. The traditional image dehazing methods [1]- [12] have investigated various haze-related image priors. Tan [2] assumes that the contrast of hazy images is lower than haze-free images and propose to maximize the contrast of hazy images under the MRF framework. Fattal [3] uses independent component analysis for estimating the transmission in hazy scenes assuming that the transmission and surface shading are locally uncorrelated. He et al. [4] propose the dark channel prior to estimate the transmission map based on the observation that the local minimums of color channels of haze-free images are close to zero. Liu et al. [12] propose the rank-one prior based on the observation that in most of the regions except the light source area, the imaging scene is covered by spatially homogenous light. Polarization-based methods [13], [14] are also effective for haze removal, which are based on the fact that the airlight scattered by atmospheric particles is partially polarized. These methods are effective in image dehazing due to the full investigation of image prior knowledge and the understanding of physical mechanism of hazes. However, these priors are mainly based on human observations that would not always hold for diverse real-world images. For example, dark channel prior [4] is effective for most outdoor images but usually fails for those containing large areas of white scenery such as white walls or clouds in the sky, as shown in Figure 1 (b). Despite the effectiveness, polarization-based dehazing requires professional optical equipment and multiple image fusion, which is not suitable for single image dehazing.
In recent years, deep learning methods [15]- [29] have overwhelmed single image dehazing area. The early learning-based methods usually learn transmission maps or other haze-related variables. For instance, Ren et al. [16] and Cai et al. [17] design CNNs to learn transmission maps and then recover clean images, and Li et al. [18] propose to learn a K -module for the estimation of transmission T and atmospheric light A. More recent works [19], [23], [25], [26], [29]- [31] propose to learn a direct mapping between hazy images and clean images by developing new CNN architectures or investigating novel effective loss functions. Besides, to achieve more realistic dehazing results, generative adversarial networks [20], [21], [25], [28], [32] are also widely used in image dehazing task. The learning-based methods have shown promising results for single image dehazing. However, these methods usually take CNNs to learn a mapping from input hazy images to the transmissions or haze-free images, without considering haze-related image priors to constrain the mapping space compared with the traditional methods.
Our main motivation is to combine haze imaging mechanism and deep learning in a novel model-driven deep learning framework [33], which takes the advantages of both the prior-based methods and the deep learning-based methods. Compared with the traditional prior-based methods, our approach is free of parameter tuning after training and runs at a higher speed. Most importantly, combined with the data-driven method, the dehazing performance is significantly improved. Compared with the deep learning-based methods, our approach integrates physical mechanism constraint into the deep learning framework and can produce stable dehazing results (see Figure 10). Moreover, our method learns to explicitly predict atmospheric light and transmission maps. They may be helpful for model explanation and downstream tasks such as semantic segmentation on hazy images. We build our model-driven learning approach by the following steps. First, based on the haze imaging model, we formulate the inverse problem of single image dehazing as an energy model with physical constraints in image space and haze-related feature space (dark channel space in this paper), regularized by haze-related image priors. Second, we design an iterative optimization algorithm for minimizing the dehazing energy function using the half-quadratic VOLUME 9, 2021 splitting algorithm, with proximal operators for modeling the regularization terms. Third, we propose a deep neural network based on the iterative algorithm, dubbed as proximal dehaze-net, to implicitly learn these image priors by learning their corresponding proximal operators using convolutional neural networks.
In summary, our work makes three main contributions. First, we propose a novel energy model for single image dehazing, which investigates the haze imaging model constraints in both image space and haze-related feature space. Second, based on the iterative algorithm for minimizing the energy model, we design a multi-stage deep neural network, by discriminatively learning haze-related image priors, including dark channel prior, transmission prior and clean image prior, saving the effort of manually designing them. We also learn to predict the atmospheric light instead of estimating it with traditional methods. Third, extensive experiments show the effectiveness of learning haze-related image priors, and the proposed proximal dehaze-net achieves promising results on both synthetic and real-world hazy images.
This paper is an extension of our previous work [22]. In this paper, we extend the contents of our work in the following three aspects. First, we reformulated our dehazing model by introducing an additional clean image prior learning module. We also learned to predict the atmospheric light instead of estimating it with the traditional methods. Second, we improved the performance of our model by replacing lightweight sub-networks with more powerful CNN backbones. Moreover, we treated hyper-parameters that need manual adjusting as learnable variables. Third, for fair comparisons, we evaluated our method on multiple public datasets, including SOTS and NTIRE 2018 datasets, and our method achieves promising results on both synthetic images and real-world images. As a comparison between the original PDN model [22] (denoted as PDN-ECCV) and the current extended PDN model, we show in Figure 2 two examples of synthetic and real hazy image dehazing. We can see that the extended PDN model improves the dehazing ability both quantitatively and qualitatively.

II. RELATED WORK A. HAZE-RELATED IMAGE PRIORS
Most traditional dehazing methods assume an image prior on haze-free images or latent transmission maps based on human experiences. Researchers have proposed various effective priors for single image dehazing. The most related work to ours is the dark channel prior (DCP) [4]. The dark channel of a color image is defined as the minimum of local image patches: where I c is a color channel of I , and (x) is a local patch centered at x. Dark channel prior assumes that, in most non-sky patches, at least one color channel of a haze-free outdoor image has very low intensities at some pixels. According to dark channel prior, the transmission can be estimated by: where ω is a constant for keeping aerial perspective.
Since image priors usually rely on human observations and experiences, the traditional dehazing methods are not always applicable to diverse scenes. For instance, DCP is effective for dehazing but may fail when the scene color is close to the atmospheric light, e.g., sky regions in the wild environment and light color walls in cityscapes. Instead of constraining dark channel to be close to zero as in DCP, we learn dark channel prior by learning its corresponding proximal mapping from training data using a convolutional neural network, potentially being able to well approximate the dark channels of haze-free images as shown in Figure 1.

B. DEHAZING BY ENERGY MINIMIZATION
It is a common practice to build energy functions for various image restoration and reconstruction problems. For the problem of single image dehazing, there are also several methods proposed based on image priors and energy minimization [2], [7], [10], [34]- [38]. The main idea of these methods is to firstly find effective image priors for describing transmission maps or haze-free images, and then build energy functions with these image priors as regularization terms. The energy function is then minimized for optimal transmission maps or haze-free images in an iterative manner.
However, it highly relies on expert experiences to find or design effective dehazing priors. It is often time-consuming for most energy-minimization methods to dehaze a single image due to a quantity of optimization steps. There also exist hyper-parameters in the energy models that usually have to be carefully adjusted in order to get the best visual effect for each individual image. As a comparison, our proposed model can adaptively learn haze-related image priors. Through discriminative learning, we can reduce the number of iterations of the optimization algorithm to only 2 or 3, which greatly reduces the time cost. In the meanwhile, hyper-parameters in the energy model are also learned during training, saving the effort of manually tuning.

C. DEEP UNFOLDING NETWORKS
Recently, there have been several works to solve image inverse problems under the iterative deep learning framework [22], [24], [33], [39]- [45]. Zhang et al. [40] train a set of effective denoisers and plug them in the scheme of the half-quadratic splitting algorithm as modules. Meinhardt et al. [41] solve the inverse problem in image processing using the primal-dual hybrid gradient method, and replace the proximal operator with a denoising neural network. In [39], [42], [43], the linear inverse problems are solved by learning proximal operators in the scheme of iterative optimization algorithms. These methods can well solve linear inverse problems such as denoising, super-resolution, non-blind deconvolution, compressive sensing MRI, etc.
Compared with these works, we focus on single image dehazing, which is a challenging inverse problem with more unknown variables in the imaging model. Instead of using common linear inverse models in these works, we specify single image dehazing as a non-linear inverse problem with regularization terms for haze-related features. We propose to discriminatively learn effective image priors by learning proximal mappings for the regularization terms using CNNs. The most related works to ours are [24], [45]. In [24], the authors learned deep priors for single image dehazing, but did not investigate image priors on haze-related features. In [45], the authors learned image priors for deraining problem based on the model-driven approach. To the best of our knowledge, our work [22] is the first to learn haze-related priors for image dehazing task.

III. DEHAZING AS AN INVERSE PROBLEM
In this section, we first build an energy function with physical model constraints in both image space and feature space, and then design an iterative algorithm for energy minimization based on the half-quadratic splitting (HQS) algorithm.

A. DEHAZING ENERGY MODEL
Considering the haze imaging model in Eqn. (1), given a hazy image I ∈ R M ×N ×3 , we assume a known global atmospheric light A ∈ R 3 , and subtract A from both sides of Eqn. (1) in each color channel: where c is the color channel. For simplicity, let P c = I c − A c and Q c = J c − A c . Then P and Q represent the normalized hazy image and clean image respectively. Thus, Eqn. (4) can be rewritten in a concise form as: where • is the Hadamard product for matrices. This is a physical constraint in image color space. Now we consider physical constraint in haze-related feature space. Let h be a transformation from image space to any haze-related feature space, and we simply apply h on both sides of Eqn. (5): in which we let T element-wisely multiply Q channel by channel. There are various choices for transformation h , such as dark channel, local max contrast, hue disparity and local max saturation, as mentioned by Tang et al. [15]. However, dark channel has been proved to be the most effective [15], we therefore take h as the dark channel of an image. Thus, we physically constrain our model in the dark channel feature space. According to He [4], we can further assume that T is locally constant, then we have where P d , Q d are the dark channels of P, Q. By enforcing Eqns. (5) and (7) as data fidelity terms, we design a dehazing energy function: where α and β are coefficients for data terms, · F is the Frobenius norm, and f (Q d ), g(T ) and h(Q) are regularization terms modeling the priors on dark channel Q d , transmission map T and clean image Q. The optimal haze-free image Q * and transmission map T * can be obtained by solving the following optimization problem: Regularization Terms: We have three regularization terms f , g and h that respectively model dark channel prior, transmission prior and clean image prior. Multiple image priors can be taken for them. For example, f for the dark channel can be taken as 0 or 1 regularizer, enforcing the dark channel to be sparse and close to zero. The transmission map is closely related to the latent scene depth, which is piecewise-smooth and edge-aligned with the depth, thus its regularizer g can be modeled by MRF [46], [47], or TGV [10], [48]. For clean image prior h, we can use common image priors like TV [49].

B. MODEL OPTIMIZATION
It is non-trivial to directly solve optimization problem Eqn. (9), so we turn to the half-quadratic splitting (HQS) algorithm to break it into easier sub-problems. The HQS algorithm has been widely used to solve image inverse problems [50]- [54]. By introducing an auxiliary variable U to substitute Q d , i.e., the dark channel of the latent haze-free image, we derive the augmented energy function: in which γ is a penalty weight, and when γ → ∞, the solution of minimizing Eqn. (10) converges to that of minimizing Eqn. (8). We minimize Eqn. (10) by alternately updating U , T and Q while fixing the other two variables. We initialize Q 0 = P and all elements of T 0 are ones, then for the n-th iteration of the HQS algorithm, we successively solve the following sub-problems. Update U : Given the estimated haze-free image Q n−1 and transmission map T n−1 at iteration n−1, the auxiliary variable U n for the n-th iteration is updated as: from which we can derive whereÛ n is an intermediate variable defined as: and u n = βT n−1 • T n−1 + γ . The proximal operator prox [55] we used is defined as: assuming that f (X ) is separable for different elements in a matrix X such that f (X ) = i f (x i ). This assumption is reasonable for many common regularizations such as 1 or 2 . In practice, we relax this constraint and extend it to general regularizations. In addition, we also extend λ to be a matrix with the same size as X .
Update T : We next update the transmission map T n . Given Q n−1 and U n , T n is computed as: . (15) Then we can derive whereT n is an intermediate variable defined as: and t n = α c Q c n−1 • Q c n−1 + βU n • U n . Update Q: Given T n and U n , the haze-free image Q n is updated as: Since computing the dark channel of an image is to extract the smallest value from the local color patch around each pixel, the second term of Eqn (18) only constrains on limited pixels in the original image Q, which may cause unstable results. Therefore we ignore the second data term, and the clean image Q n is updated as: whereQ n = P max(T n , ) , q n = T n , and is a constant to prevent extreme low transmission values.
After N iterations, the final haze-free image J can be derived by adding Q N with A in each color channel: In summary, the iterative procedure for optimizing the proposed energy model with the HQS algorithm is shown as Algorithm 1.

Algorithm 1 Energy Minimization for Single Image Dehazing With Half-Quadratic Splitting
Update U n based on Eqn. (12). 6: Update T n based on Eqn. (16). 7: Update Q n based on Eqn. (19). 8: Update γ := δγ , δ > 1. 9: end for 10: For an instance of our energy model, we concretize these regularization terms in Eqn (10). Specifically, for dark channel U , we use 1 regularization for sparse and low pixel values. For transmission T , we use ATGV [10], [48] regularization for smooth and edge-aligned maps. For clean image Q, we do not add any constraints, i.e., ATGV is the anisotropic total generalized variation: whereT is the initial transmission map, D 1 2 is the anisotropic diffusion tensor decided by guide image P. The dehazing result is shown as Ours-EM (energy minimization) in Figure 3. We also demonstrate several other energy-based methods as comparisons, and we can see that our energy model is effective in removing hazes.  The architecture of proximal dehaze-net. We first estimate the atmospheric light A from input hazy image I by A-Net, then subtract A from I in each color channel to get the scaled hazy image P (P c = I c − A c ), which will be sent into a multi-stage learning framework to predict the haze-free image. For each stage of the learning framework, we first calculate the auxiliary variablesÛ n ,T n andQ n following Algorithm 1 and then learn corresponding proximal mappings with CNNs. F n , G n , and H n in the gray dashed box are sub-modules to be learned at the n-th stage. After N stages, we get the final predictions U N , T N and Q N . We add A back to Q N to reconstruct the clean image J. The whole framework can be trained end-to-end.

IV. PROXIMAL DEHAZE-NET
Although our energy-based method can remove hazes effectively, we can further improve it in visual performance and processing speed by learning haze-related image priors. Instead of designing image priors by hand according to human experiences, we model haze-related priors with convolutional neural networks via learning proximal mappings appeared in Section III-B. Note that the above introduced optimization process requires the atmospheric light A to be known in advance. In the conference version of our work [22], we refer to [4] to pre-estimate the atmospheric light A. In this paper, to estimated A more accurately, we introduce an extra convolutional neural network to predict the atmospheric light. Combining the learning of A and the proposed iterative optimization algorithm, we build a deep learning framework for single image dehazing, denoted by proximal dehaze-net, as illustrated in Figure 4.
The core part of this framework is an architecture with N stages implementing N iterations of Algorithm 1 for solving Eqn. (10). As shown in Figure 4, the atmospheric light A is first estimated by A-Net, then we subtract A from I in each color channel to obtain P, which is an normalized hazy image within the range of [−1, 1]. P is sent into an N -stage learning framework. For the n-th stage, it takes the outputs of the previous stage U n−1 , T n−1 , Q n−1 as inputs, and learns to predict U n , T n and Q n for the current stage.
As mentioned above, instead of designing image priors by hand, we model them by using CNNs to learn their corresponding proximal operators prox 1 un f , prox 1 tn g and prox 1 qn h for updating U n , T n and Q n in each stage, i.e., where F n , G n and H n are sub-modules to be learned for representing the corresponding proximal operators at the nth stage. In this way, we design an end-to-end training architecture, dubbed as proximal dehaze-net (PDN). As shown in Figure 4, each stage of the proximal dehaze-net implements one iteration of model optimization discussed in Section III-B, and the proximal operators are substituted by convolutional neural networks as in Eqn. (20). Note that in Eqn. (14), we assume that the regularization functions are separable, while in this section, we relax this constraint to make full use of the representation ability of convolutional neural networks. We now introduce the structures of these sub-modules for each stage.   At the n-th stage,Û n is first computed by Eqn. (13), then sent into a convolutional neural network, U-Net n , to learn sub-module F n . The updated dark channel is: in which we concatenateÛ n with hazy image P as the input of U-Net n to prevent information loss. Similarly,T n is first computed by Eqn. (17), concatenated with P and then sent into a convolutional neural network, T-Net n , to learn sub-module G n . Note that T-Net n is followed by a GIF block (performing guided image filtering) to ensure edge alignment between the transmission map and the original image. The updated transmission map is computed as:T whereT n is the output of T-Net n , and GIF(T n , P) is to perform guided image filtering [56] on imageT n with P as the guidance image. Thus, the sub-module G n can be represented as the composition of T-Net n and the GIF block: Finally, for Q n , the intermediate variableQ n is first computed using Eqn. (19), and then sent into a convolutional neural network Q-Net n concatenated with P to perform proximal mapping prox 1 qn h . The updated clean image is: where Q n is the estimation of the haze-free image with A subtracted at the n-th stage. U n , T n and Q n then serve as the inputs of the (n + 1)-th stage of our proximal dehaze-net.
After N stages, we obtain the final outputs U N , T N and Q N , and we use Q N and the predicted atmospheric light A to reconstruct the haze-free image as shown in Figure 4.
Sub-Networks: Our proximal dehaze-net includes four kinds of sub-networks, i.e., A-Net, U-Net, T-Net and Q-Net, and they share similar structures. As shown in Figure 5, we adopt the commonly used residual encoder-decoder (RED) as the base architecture for these sub-networks. Accordingly, these sub-networks consist of several stacked down-sampling convolution blocks for the encoder part and up-sampling convolution blocks for the decoder part. The bottleneck is made up of stacked residual blocks [57], [58]. Skip connections between the encoder and the decoder are used to prevent from losing spatial information. For the last layer of an RED, we use Sigmoid for A-Net and T-Net, since the outputs are within the range of [0, 1], and we use Tanh for U-Net and Q-Net since their outputs are within the range of [−1, 1].
GIF Block: GIF block stands for guided filtering [56] computation block within our proximal dehaze-net. GIF block enforces the transmission map learned by T-Net to be well aligned with the image in edges. It takes the hazy image P as guidance and performs guided image filtering on the output of T-Net. As stated in [56], the process of GIF consists of a series of average filtering operations and simple element-wise operations. The computation graph of GIF block is shown in Figure 6, and more details of the GIF algorithm can be found in [56]. Many previous works [16], [17], [56] use GIF as a post-processing of the estimated transmission map. On the contrary, we include the process of guided image filtering within our end-to-end trainable system. For GIF block implementation, we adopt the work of Wu et al. [59] and we will discuss the necessity of GIF block later.
Loss Functions: To train our proximal dehaze-net, we introduce commonly used loss functions, including 1 reconstruction loss, total variation (TV) loss and structure similarity (SSIM) loss on network outputs and the corresponding ground truths. Specifically, the atmospheric light loss is: where A * is the atmospheric light map estimated by A-Net and L s (·, ·) = 1 − SSIM(·, ·) defines the structure similarity loss. As for U , T and Q, the loss function is defined as: where z ∈ {U , T , Q}, Q gt n = J gt − A gt , N is the number of PDN stages, ε 1 and ε 2 are coefficients for TV loss and SSIM loss. Thus the total loss function is: where λ A , λ U , λ T , λ Q are weights for these loss terms. Empirically, for all experiments, the loss coefficients ε 1 for total variation loss is set to 0.1 for L A , 0.01 for L U and L T , and 0.001 for L Q . Coefficients ε 2 for SSIM loss are set to 0.01 for all loss terms. Finally we set the coefficients λ A = λ U = λ T = λ Q = 1.

V. EXPERIMENTS
To verify the effectiveness of the proposed proximal dehazenet, we evaluate our method on different datasets and compare it with other single image dehazing methods.

A. DATASETS
We evaluate our proximal dehaze-net on multiple benchmark datasets for single image dehazing, including RESIDE dataset [60] and NTIRE 2018 single image dehazing challenge [61]. Both RESIDE and NTIRE 2018 datasets consist of indoor and outdoor subsets.

1) RESIDE DATASET
RESIDE dataset consists of indoor and outdoor datasets. The indoor dataset contains 13990 generated hazy/clean training images and 500 test images. The outdoor dataset contains about 8400 haze-free images with depth maps that can be used to synthesize training pairs and 500 hazy images as the test set. Note that for the outdoor dataset, we first remove redundant images from the training set that are overlapped with the test set. For each dataset, we randomly crop 64000 patches of 256 × 256 as the training set and apply horizontal/vertical flipping and random rotation as data augmentation. Considering the natural gap between indoor and outdoor images, we individually train two models for indoor and outdoor situations.

3) DATA PRE-PROCESSING
To effectively train our PDN model, we need ground truths For RESIDE dataset, A gt , T gt and J gt are directly provided, Q gt = J gt − A gt , and we compute the ground truth U gt as the dark channel of J gt − A * . However, for I-Hazy and O-Hazy, only J gt is available. To handle this problem, we first treat A and T as unknowns and minimize the following energy function to obtain A and T : where A ∈ R 3 and T ∈ R p×p , which means that we compute A and T within a p × p local patch (p = 512 in our experiments). To show the effectiveness of the proposed pre-processing method, in Figure 7, we illustrate two examples of the intermediate results of the recovered transmission maps. Although the accuracy of recovered transmission maps is restricted since real-world hazy images do not abide strictly by Eqn. (1), we can claim that the results are reasonable to a certain extent and sufficient to train our dehazing model. We use simple gradient descent algorithm to solve the above problem, and after we have A and T , we prepare training data with the same setting as RESIDE.

B. IMPLEMENTATION DETAILS
We implement and train our PDN model with PyTorch [62] framework. Note that hyper-parameters [α, β, γ ] are also trained simultaneously and they are initialized with 1. Since all hyper-parameters should be positive, we learn the exponent of them instead of themselves. We choose the Adam  optimizer with default parameters, and the initial learning rate is set to 10 −4 , which will be decreased by multiplying a factor of 0.75 every 10 epochs. We use a batch size of 16 and it takes about 3 days to train a single-stage network for 100 epochs on a Titan X GPU. For multi-stage networks, we first initialize the weights with the pre-trained one-stage network and then continue the training.

C. RESULTS ON SYNTHETIC DATASETS
We first evaluate our proximal dehaze-net on synthetic datasets and compare it with other single image dehazing methods. We select some representative single image dehazing methods, including DCP [4], CAP [8], NLD [9], GRM [10], IDE [11], MSCNN [16], DehazeNet [17], AODNet [18], DcGAN [21], GDN [23], DADN [28] and FDGAN [27]. Among these methods, DCP, CAP, NLD, GRM and IDE are traditional image processing methods based on image priors. MSCNN, DehazeNet, AODNet, DcGAN, GDN, DADN and FDGAN are deep learning methods that predict either transmission maps or clean images. For fair comparisons, we retrained these learning-based methods on the corresponding datasets if the training codes are provided by authors. We evaluate these methods on SOTS and NTIRE 2018 datasets. Both datasets consist of indoor and outdoor subsets. We report the peak-signal-to-noise ratio (PSNR) and structural similarity (SSIM) as performance metrics. As shown in Table 1, the learning-based methods are usually quantitatively superior to the traditional prior-based methods. Our proposed PDN model surpasses most methods and achieves competitive PSNR and SSIM values on multiple benchmark datasets. Specifically, on SOTS indoor and SOTS outdoor datasets, we achieve the highest PSNRs and  comparable SSIMs with state-of-the-art method GDN [23]. However, on realistic dehazing datasets, NTIRE 2018 (I-Hazy and O-Hazy in Table 1), we achieve the best PSNRs and SSIMs among compared methods on both indoor and outdoor cases. The dehazing results on NTIRE datasets also verify the effectiveness and rationality of the proposed data pre-processing algorithm to generate atmospheric lights and transmission maps from given hazy/clean image pairs.
As a visualization, we show some dehazing examples from these benchmark datasets in Figure 8 Figure 8, we can see that learning-based methods usually behave better than the traditional prior-based methods in visual effect. The traditional methods (DCP, CAP) sometimes over dehaze the images and cause darkened images and color distortion. On the other hand, the learning-based methods can produce dehazed images that are closer to ground truths. However, the learning-based methods sometimes fail to effectively remove all hazes on an image, such as DehazeNet and AODNet. Recent methods, such as GDN, are powerful in removing hazes from synthetic images but behave not that well on realistic images, e.g., the last two examples in Figure 8. As a comparison, our method can remove hazes effectively while keeping dehazed images visually pleasing.

D. RESULTS ON REAL DATASETS
In Figures 9 and 10, we respectively show the dehazing results of our method on real-world hazy images by comparing with the traditional methods and the learning-based methods. On one hand, as we can see from Figure 9, the traditional methods are usually effective in removing hazes due to the investigation of useful haze-related image priors. However, they tend to overly enhance the hazy images, which causes over-saturation or color distortion, such as BCCR [7], CAP [8], NLD [9] and IDE [11]. DCP [4] are likely to produce undesirable artifacts, especially in the sky regions. GRM [10] can well suppress artifacts but will lose detailed textures due to its smooth regularization term. On the other hand, the learning-based methods are usually able to produce more visually pleasing results, as shown in Figure 10. However, MSCNN [16] and DehazeNet [17] are not always as effective as the traditional methods in haze removal. AODNet [18] often produces darken images. DCPDN [20] predicts inexact transmission maps in areas with high intensity. GFN [19] and DcGAN [21] directly predict haze-free images and sometimes cause unexpected dark artifacts due to the lack of haze imaging model constraint. FDGAN [27] and DADN [28] can produce more visual-pleasing results, but they both rely on photo-realistic synthetic training data. By learning haze-related image priors, our method combines the advantages of the traditional methods and the learning-based meth- ods, being able to effectively remove hazes and keep the dehazed images natural in the meantime.

VI. DISCUSSION AND ANALYSIS
We now discuss the parameter selection, effectiveness of learning haze-related image priors as well as the necessity of GIF block. We then analyze the effect of stage numbers on the performance of our model. We also compare the running speed with the recent dehazing methods.

A. EFFECTS OF PARAMETERS FOR LOSS TERMS
As we mentioned in Section IV, we utilize a combined loss function for training our network, introducing multiple hyper-parameters for different loss terms, i.e., ε 1 for TV loss, ε 2 for SSIM loss and λ z , z ∈ {A, U , T , Q}. First, we observe in experiments that TV loss and SSIM loss affect slightly the performance of our model. By removing TV and SSIM losses, the PSNR on SOTS indoor dataset decreases from 32.53 to 32.52 for 1-stage PDN, and from 33.55 to 33.41 for 2-stage PDN. Thus, we set the values of ε 1 and ε 2 empirically. Second, to evaluate the effect of λ z , we adjust one of them from 0 to 1.8 with the step of 0.2 while fixing the other three as 1, and train for 20 epochs to investigate the performance on validation dataset (no overlapping with test dataset). As we can see from Figure 11, when λ ≥ 1, our PDN model reaches a relative steady state. Thus, for simplicity, we set λ Q = λ T = λ U = λ Q = 1 in our experiments.

B. LEARNING IMAGE PRIORS
Our network learns multiple haze-related image priors. In this section, we discuss the effect of learning each image prior. To do so, we respectively remove dark channel prior f , transmission prior g and clean image prior h from the dehazing energy function in Eqn. (8). We then deduce new proximal dehaze-nets without learning these priors, dubbed as Net-ND (without learning dark channel prior), Net-NT (without learning transmission prior) and PDN-NC (without learning clean image prior). We also discuss the effect of    Figure 12. We can see that the PDN model is promoted by learning these image priors with CNNs. Notably, learning clean image prior and atmospheric light contribute most to the improvement by about 7 dB and 5 dB in PSNR. Learning dark channel prior and transmission prior also brings considerable performance improvements to our PDN model by about 1 dB and 1.5 dB respectively. This proves the effectiveness of learning haze-related image priors for the single image dehazing task.
As a visualization, we show an example of the dehazing results of these variants of our model in Figure 13. We can see that our full model achieves the highest PSNR, and learning all these image priors will help to remove hazes more effectively. Without learning dark channel prior or transmission prior, our model can also effectively dehaze images, but some slight hazes are still observable. Without learning clean image prior or the atmospheric light, the remaining hazes in the image are still quite obvious.
To illustrate what are learned for the proximal mappings F, G and H, in Figure 14, we show an example of the learned proximal mappings for dark channel, transmission FIGURE 14. The learned hazed related image priors by our proximal dehaze-net.Û ,T ,Q are dark channel, transmission and haze-free image before prior learning modules. U , T , Q are corresponding results after prior learning. We can see that the learned dark channel is darker, the learned transmission map is more accurate, and the learned haze-free image is visually better. Note that P and Q are obtained by subtracting A from I and J, and we add A back for better visualization. map and clean image prior of our proximal dehaze-net. In Figure 14, the two figures in each of the three red boxes denotes the input and output of these learned proximal mappings. We can observe that the learned proximal mapping F produces reasonable dark channel U with lower values than input dark channelÛ . The learned proximal mapping G produces a piecewise-smooth transmission map T that is consistent with the underlying scenery depth, which is more accurate compared with the input transmission estimationT . Final proximal mapping H will further remove haze residuals in the estimated imageQ and produce a more clear image Q.

C. NECESSITY OF GIF BLOCK
GIF block is a part of the learned proximal mapping G. As stated in Section IV, GIF block serves as a guided image filtering. Its effect is to force edge alignment between the estimated transmission map and the original image. To verify the effectiveness of the embedded GIF block, we train a PDN without GIF block (denoted as PDN-NG) and evaluate PDN-NG on SOTS indoor dataset. The performance is shown in Figure 12. We can see that PDN-NG without GIF block achieves almost the same quantitative performance compared with the full PDN model. We also found that, for most images, the PDN model without GIF block behaves as well as our full model. The reason for this may be the powerful learning ability of Q-Net. However, as shown in Figure 15, for some examples, the dehazed image has visible halo effect around image edges, which degrades the image visual quality. As a comparison, the result of our full model with GIF block is more clear and natural.

D. EFFECT OF MULTI-STAGE NETWORK
Our proximal dehaze-net model is a multi-stage learning architecture based on dehazing optimization algorithm, and more network stages are supposed to achieve higher performance on the benchmarks. However, since we use residual encoder-decoder (RED) backbone as sub-networks, which is powerful in universal image restoration tasks, we are able to achieve satisfactory results with only one stage both on quantitative performances and visual effects. As shown in Table 1 and Figure 12, we use a two-stage PDN to achieve the best PSNR on SOTS dataset. Compared with the one-stage PDN, the two-stage PDN improves the PSNR values on indoor and outdoor datasets by 1.02 dB and 0.42 dB respectively. The improvements of SSIM values are less than 0.01. On NTIRE I-Hazy and O-Hazy datasets, we achieve the best performances with a one-stage PDN. Adding more stages will not continue to improve qualitative results significantly. Furthermore, the visual effects are similar among PDN models with different stages. For simplicity, we use a two-stage PDN (PDN-S2) for SOTS and a one-stage PDN (PDN-S1) for NTIRE respectively in our paper.

E. RUNNING TIME
The complex part of our model in implementation is the dark channel computation and its backward process. In the conference version of our work [22], we realize this with low-level CUDA language for high-speed computation. In [32], dark channel and its backward are realized with the look-up table technique. In [63], dark channel is calculated by extracting local patches and finding the minimum. In this paper, we find that dark channel can be simply computed as the negative one-stride max-pooling operation, which densely extracts the local maximum of a negative image, i.e., I d = −MP(− min(I ), w, 1), where min(I ) is the minimum of each channel and MP is the max-pooling operator with kernel size w and stride 1. Therefore our implementation is purely PyTorch-based and our PDN model can process images at high speed on GPU devices. In Table 2, we show the average running time (in seconds) of single image dehazing methods on 500 images in 460 × 620 resolution, i.e., the average time of 500 evaluations. For a fair comparison, we compute the running times on both CPU and GPU devices if possible. For GPU time, we test on a Titan X GPU device with 12 GB memories. For CPU time, we test on an Intel Xeon E5-2650 CPU @ 2.20 GHz without parallel acceleration.

TABLE 2.
Computation time for different methods to dehaze a color image of 460 × 620. PDN-S1 and PDN-S2 denote our PDN model with 1 and 2 stages. We report the running times on CPU and GPU devices for fair comparison.

A. EXTENSIONS TO MORE APPLICATIONS
Though our network is trained for image dehazing, we can apply it to other similar tasks without the need to retrain on their corresponding datasets. Figure 16 (a)-(c) show an example of anti-halation enhancement using our method. Although halation has a different imaging model, it brings haze-like effects to image [17], [18]. Our proximal dehaze-net can be directly applied to anti-halation image enhancement. In Figure 16 (d)-(f), we show an example of underwater image enhancement. Ignoring the forward scattering component, the simplified underwater optical model has a similar formulation with haze imaging model [64], [65]. Compared with methods that are specifically designed for this task such as [64], our network can also effectively remove haze-like effects in this underwater image.

B. FAILED CASES
While our method behaves well on most natural images, it has limitations in certain situations where the photo is taken in the night or under heavy hazy weather. For night-time hazy images, there are usually multiple light and color sources, and the image quality is degraded by low light conditions. The commonly used haze imaging model is not sufficient to describe the night-time haze phenomenon, and our PDN model has no access to such kind of training data. As shown in Figure 17 (a)-(c), our method fails to remove all the hazes in the image. On the other hand, the method [66] designed for night-time dehazing is capable of removing hazes more  adequately, but the visual quality is also lowered. As for images with very thick fog, too many details are overwhelmed by hazes, and it becomes quite difficult for current methods to achieve satisfactory results. As shown in Figure 17 (d)-(f), there still remains visible hazes in the dehazed image by our method. As a comparison, iPal-DH [67], trained on Dense Haze dataset [68], is able to dehaze images with dense haze more effectively, but it also fails to recover lost details covered by hazes.

VIII. CONCLUSION
In this paper, we propose a model-driven deep learning approach, proximal dehaze-net, for single image dehazing. We first build an energy function based on haze imaging model constraints in both image space and haze-related feature space, and then design an iterative algorithm for solving the energy model. We unfold the iterative algorithm into a multi-stage network by learning proximal operators using CNNs. The proposed proximal dehaze-net achieves promising results for single image dehazing.
Although the proposed PDN model is effective in most cases, our method shares some inherent drawbacks with other learning-based methods. First, the learning scheme is built upon the haze imaging model, which may limit the applications to more complex real-world situations, such as non-uniform haze or night-time haze. This problem can be solved by considering imaging models that can better describe real situations and better techniques to simulate realistic datasets. Second, our method has difficulty in handling cross-domain image dehazing problems. As shown in Table 1, we have to train separate models to achieve the best performance on different benchmarks, and a model trained on one dataset may behave poorly on another dataset. This problem can be solved by the simple early stopping strategy to prevent over-fitting to a specific dataset but will decrease the performance on this dataset. There exists a trade-off between the performance on one dataset and the generalization ability of the trained model. In the future, we will consider domain adaptation to better handle this problem.