A Progressive Single-Image Dehazing Network with Feedback Mechanism

In the past decade, deep learning methods, especially convolutional neural networks, have received much attention in applications of single-image dehazing. However, the haze in hazy images cannot be distinctly separated because it is complicatedly mixed with the background components. If we roughly remove the haze, the background tone of global atmospheric light may also be destroyed. To resolve the above problem and reconstruct clearer and higher-quality dehazing images, we introduced our progressive feedback progressive network (PFBN) in recurrent structure ties with a feedback mechanism. The feedback mechanism is implemented by stacking feedback blocks that contain feedback connections among iterations. At the input layer of each feedback block, its hidden state in the last iteration is delivered by a feedback connection to the present block as part of the input. The last hidden state, also referred to as high-level information, is fused with low-level information output by the previous block to generate effective feature representation. Moreover, we proposed an enhancement self-ensemble strategy to decrease the random error of the network to reconstruct clearer dehazing images. Finally, we designed a series of extensive experiments to verify the outstanding performance of our method.


I. INTRODUCTION
O NE of the most common types of weather is haze, which generates complicated noise. Therefore, it brings about great challenges if hazy images need to be used in upper applications. Removing haze is necessary for many outdoor applications, such as objective recognition and video surveillance [1]. However, because the amount of useful information in a single hazy image is insufficient, dehazing algorithms are always considered ill-posed tasks [2]. Today, this topic is one of the most attention-grabbing academic explorations in image reconstruction and artificial intelligence [3].
Many algorithms focusing on single-image dehazing have been proposed in recent years. These algorithms can be roughly classified into two categories: model-based approaches and data-driven approaches, that is, deep learningbased methods. According to previous studies [4] [5] [6], the classical model for generating a hazy image, I, is as follows: where J refers to haze-free scene brilliance; I refers to the observed hazy image; A denotes the global light of the atmosphere, indicating the strength of the environmental light; T refers to the transmission map; and x is the pixel location.
Most methods use this model and design two networks to obtain A and T ; however, this model is computationally intensive, and the proposed networks are always difficult to train and have strict high PC requirements.
According to [2], a simpler but more effective model is employed in this paper: In the equation above, K(x) is a new unified variable contains 1 T (x) and A. Variable b refers a constant bias and default value is set as 1 as [2] did.
Therefore, we only need to estimate one unified parameter matrix K(x) through the network rather than two in Eqn.1. It is more effective and computationally efficient.
Currently, there have been many upper applications in the field of computer vision or image enhancement, and all these applications, such as object detection and autopilot, require a clean and high-quality dataset for their implementation. Therefore, the performance of upper applications depends to a large extent on whether the previous data processing process was successful . Therefore, it is vital to enhance the captured visual data, such as through the use of dehazing and deraining, to pursue cleaner data, which can then be utilized in upper applications.

II. RELATED WORK A. DATA-DRIVEN SINGLE-IMAGE DEHAZING METHODS
To date, in the field of computer vision and image enhancement, most of the designed networks have been end-to-end networks: introducing a specific network to reconstruct clean images immediately from the original hazy images. There exists a wide consensus that the key point of reconstructing clean images is to obtain a precise medium transmission map.
The recurrent structure is also employed in image dehazing. These methods are aimed at removing haze accumulated in images both iteratively and progressively. Jiang et al. [9] proposed a lightly recurrent network for video dehazing, which makes it possible to train such a large dataset in a limited amount of time. Tan et al. [10] introduced a novel method in which a random Markov field was employed to obtain transmission map T .

B. FEEDBACK MECHANISM
The feedback mechanism is a simple but effective approach to enhancing the performance of networks in recurrent structures. The core of this mechanism is delivering high-level information to low-level information through a feedback connection to integrate a power feature representation. The output of the network and reconstruction of clean images are then iteratively clearer.
Many previous studies have successfully applied feedback mechanisms in various missions. The author of [11] first introduced a feedback mechanism and successfully adopted it for image classification and designed some helpful strategies that obtained outstanding performance. Later, [12] utilized a feedback strategy for superresolution and replaced the LSTM [13] block with their own feedback block to achieve a better result. Then, Shama et al. [14] found a helpful improvement by combining a feedback approach with generative adversarial networks, the feedback loop is employed to transmit the discriminatory spatial information to the generator. Recently, a feedback strategy has been employed in action detection with graph neural networks to enhance the representation ability of networks [15].
In this paper, we introduce a unique feedback block to make it apposite with a feedback mechanism. After several feedback blocks are stacked, the designed deep neural network can extract more precise information and features from the original hazy images; then, strong representation can be helpful for reconstructing the dehazing images. Due to the ability to correct the previous state, haze can be eliminated gradually, while the background iteratively becomes clearer.

III. PROPOSED METHODS
Two components are required in a feedback mechanism. 1) iterativeness. 2)rerouting the output back into the system to amend the input in each loop. To meet this demand, the recurrent structure is employed and feedback loops are adopted in each feedback block. As Figure.1, the whole network is folded by T sub-networks, where T refers to the settled parameter that represents the number of iterations. Besides, the parameters of each sub-networks are globally shared across iterations. What's more, each sub-networks will be forced to estimate K(x) and reconstruct a dehazing image for sub-loss calculating.

1) Input Layer:
The input layer f t in in the t th iteration receives the original hazy image I and the dehazing image output by the previous iteration J t−1 .
where [·|·] refers the concatenation operation and F t in is the feature map produced by the input layer. The input layer essentially a fusion layer that merge the highlevel information of its previous iteration and the lowlevel original information. The input layer consists of a convolutional operation Conv(6, m), where m is the number of channels of F t in , we set m = 64 in this paper. 2) Feedback Backbone: The feedback backbone f t backbone is a combination of several stacked feedback blocks {F B 1 , F B 2 , ..., F B B } (The detailed description of feedback block is in III-B and the exact number of feedback blocks will be explored in IV-A). The feedback backbone receives the F t in output by the input layer.
where F t backbone refers the hidden states produced by feedback backbone in the t th iteration. It is noticeable that each feedback block in the internal of feedback backbone will receive extra information, more description can be found in III-B. 3) Output Layer: In the output layer f t out , feature map F t backbone is used to calculate the unified parameter matrix K t (x).
The output layer contains a convolutioanal operation Conv(m, 3) to transform the m−channels F t backbone into 3-channels K t (x). 4) Recurrent Loss: In each iteration, after obtaining the estimated K t (x), sub-network will be enforced to generate a hazy-free scene and calculate the sub-loss. The red lines denotes the feedback connection designed for information transmission. The black lines from dehazy images to input layer indicate that the input layer will receive the reconstruction dehazing image of last iteration. It is notable that the parameters at each iteration are shared globally. The network is forced to restore the background at each iteration and T sub-losses are calculated to obtain the final loss.
where loss is the final loss of the whole network, loss i is the sub-loss of the i th sub-network, and w i is the weight corresponding to each sub-loss. In this paper, we set w i = 1, i = 1, 2, ..., T as [11] did, and T is the number of iteration of proposed network (The exact value of T will be discussed in IV-A). The feedback mechanism requires rerouting the output back into the system in each iteration [11]. In that case, the main difference between feedback network and recurrent feedforward network is recurrent loss. If simply adopting the loss of the last iteration as final loss will lead to a feedforward network rather than feedback network. Therefore, every sub-loss of each iteration must be considered in our proposed network.

B. FEEDBACK BLOCK
A feedback block is composed of a fusion layer and a K−estimation module proposed by [2]. The fusion layer of the i th feedback block in the t th iteration will receive the output of the i − 1 th feedback block in the t th iteration and respectively. The fusion layer is essential a average operator, it will deliver the average feature map of F B t i−1 and F B t−1 i to K−estimation module. As shown in Fig.2. The K−estimation contains 5 convolutional operations and 3 concatenation operations. The first concatenation layer will concatenat the hidden states produced by Conv1x1 and Conv3x3. Similarly, the second VOLUME 4, 2016 concatenation layer is responsible for processing those of Conv3x3 and Conv5x5. The final concatenation layer will receive all hidden states produced by previous 4 convolutional operations and deliver the concatenated feature map to the final convolutional layer.  The self-ensemble method is an enhancement method to decrease the variance in the model [16]. By horizontally, vertically, or diagonally flipping the outputs of augmented images, we can obtain eight ensemble hazing images [I B1 , I B2 , ..., I B8 ] as shown in Fig.3. With the inspiration of great success achieved by this strategy in field of super resolution, we hope to adopt this strategy in singe image dehazing.

C. SELF-ENSEMBLE
After feeding the network these eight hazing images, we can obtain eight dehazing images [O B1 , O B2 , ..., O B8 ], flip them back and average them to obtain the final output of the network.

IV. EXPERIMENTS RESULTS
In this section, we conduct a series of ablation experiments to determine the key components of our proposed network (PFBN). Then, we quantitatively and qualitatively evaluate our network on several common benchmark datasets with state-of-the-art methods.
In the details of our proposed method, the number of channels m is set to 64, and the kernel size of convolutional operations is 3. All experiments are implemented with the PyTorch framework [17]. The training processes are conducted on a PC with a Linux system equipped with NVIDIA 2080Ti GPUs. In the experiments, ADAM optimization [18] is adopted to train the models with an initial learning rate of 1e-5.

A. ABLATION STUDIES
All ablation studies are carried out on a benchmark test dataset called SOTS [12], which contains 1,000 pairs of haze images and cleaning images, the evaluation metrices of which are PSNR and SSIM [19].

1) Study of Iterations
In this subsection, we discuss the influence of the number of iterations (denoted as T ) while fixing the number of blocks as 3. Fig. 4 indicates the PSNR values of the proposed FBN, and T = 1, 2, 3, ..., 7, 8. It is notable that when T > 1 (with feedback connection), the metric of reconstruction images is significantly higher than T = 1. This means that our proposed PFBN indeed benefits from the feedback strategy. Moreover, when T increases, the performance continues to rise. In addition, we note that when T > 7, the performance almost converges to a constant. Although there is still a slight improvement in performance when T > 7, we comprehensively consider the GPU memory and running time, and we set T = 7 as final number of iteration in proposed network structure.  Similar with the study of iterations, we explore the number of feedback blocks (denoted as B) by settle B = 1, 2, 3, 4 while fixing the T = 7. Fig.5 shows that B = 3 is the most suitable value. So, the final structure of our proposed PFBN is designed as B = 3, T = 7. There are several widely adopted loss functions, such as MSE loss, L1 loss, negative SSIM loss and recurrent MSE

3) Study of loss function
where X, Y are the reconstruction image and ground truth, respectively; m, n denote the height and width, respectively; µ refers to the mean value of all pixel values of an image; σ refers to the deviation; and σ XY denotes the covariance of images X and Y . More details about SSIM can be found in [19]. In recurrent MSE loss, each iteration is forced to reconstruct the dehazing image and calculate the MSE loss, and then, it must sum all iteration losses to obtain the recurrent loss.
We compare these formulas by setting B = 3, T = 7 and training different networks with respect to each loss. The results are shown in Table 1. It is clear that recurrent MSE loss is most suitable, so we choose it as our method's loss function.

4) Study of activation function
Similarly, now, we investigate the influence of the activation function; the alternative activation functions that are widely used in image reconstruction include RELU [21], leaky ReLU [22], and Gaussian error linear units (GELU) [23]. The formula for the above activation functions are as follows: where a in leaky ReLU is a constant that satisfies a ∈ [1, +∞], and tanh(x) = e x −e −x e x +e −x . A comparison of various activation functions is shown in Table 2. It is clearly shown that the network with RELU achieves the highest performance. Therefore, in our network, RELU is employed.

B. COMPARISON WITH THE STATE-OF-THE-ART METHODS
In this subsection, we evaluate our PFBN and other dehazing methods on three benchmark datasets, i.e., SOTS [12], I-HAZE [24], and O-HAZE [25]. It is notaable that the networks are trained on 27,256 selected training pairs from NYU2 datasets [26]. The SOTS consists of 500 pairs of test images, I-HAZE consists of 35 image pairs that are hazy and haze-free images, and O-HAZE contains 45 different outdoor scenes.
Our PFBN is compared with several state-of-the-art dehazing methods, including NLD [27], AODNet [2], NLDN [28] and a Geometric-Pixel Guided CNN proposed by [20]. Besides, the proposed self-ensemble methods are implemented to enhance the reconstruction ability. The only difference between PFBN and PFBN+ is that PFBN+ is evaluated with the self-ensemble strategy. Table 3 illustrates the comparison results with state-of-the-art methods for three benchmark datasets. The results significantly indicate that our PFBN has a stronger capability of reconstructing a clean image than other methods, and PFBN+ indeed benefits from such a simple strategy. According to [29], self-ensemble strategy actually decreases the variance in each pixel of hazy images and keeps its bias unchanged. The visual results are shown in Fig. 6. It can be clearly observed from Fig.6(a) that haze exists over the image. The results of the comparison methods can remove this haze significantly but are not effective. Such visual results illustrate that the proposed PFBN can reconstruct high-quality clean images.
The most critical advantage of our model is its performance. The number of optimized parameters of PFBN and other state-of-the-art methods is shown in Table 4. Although the number of parameters, GPU memory and running time is a little bigger than other states-of-the-arts, PFBN have more powerful ability to reconstruct a haze-free scene.   Notes that self-ensemble strategy is only used in the test processing, it don't require any extra parameters, but it will increase the GPU memory in test processing and the running time is about 8 times that of PFBN.

V. CONCLUSIONS
In this paper, a progressive feedback network for singleimage dehazing was proposed, which is easy to train and can achieve high performance for dehazing. The employed feedback connections can fuse high-level and low-level information through iterations, which enhances the representation ability of the network. In the training process, the sublosses are tied to train the network according to the rules of the feedback mechanism. Additionally, a self-ensemble method tends to enhance the performance of the network. The extensive experimental results show that the proposed methods can reconstruct clearer dehazing images in some common benchmark datasets compared with state-of-the-art methods. .