Feature Attention Parallel Aggregation Network for Single Image Haze Removal

Images captured in hazy weather often suffer from color distortion and texture blur due to turbid media suspended in the atmosphere. In this paper, we propose a Feature Attention Parallel Aggregation Network (FAPANet) to restore a clear image directly from the corresponding hazy input. It adopts the encoder-decoder structure while incorporating residual learning and attention mechanism. FAPANet consists of two key modules: a novel feature attention aggregation module (FAAM) and an adaptive feature fusion module (AFFM). FAAM recalibrates features by integrating channel attention and pixel attention in parallel to stimulate useful information and suppress redundant features. The shallow and deep layers of neural networks tend to characterize the low-level and high-level semantic features of images, respectively, so we introduce AFFM to fuse these two features adaptively. Meanwhile, a joint loss function, composed of L1 loss, perceptual loss, and structural similarity (SSIM) loss, is employed in the training stage for better results with more vivid colors and richer details. Comprehensive experiments on both synthetic and real-world images demonstrate the impressive performance of the proposed approach.


I. INTRODUCTION
P HOTOS taken in hazy weather are usually degraded in contrast and color fidelity, as shown in Fig. 1. There are two main reasons for this image quality degradation phenomenon. For one thing, the scene radiance is attenuated due to the absorption and scattering of small floating particles in the air. Another is that the ultimate irradiance captured by the camera is blended with the ambient light, which is reflected by atmospheric particles. The presence of the haze may impair the performance of subsequent computer vision algorithms and industrial applications such as object detection and tracking in traffic monitoring systems. Therefore, it is important to recover the clear scene from hazy images.
According to the atmospheric scattering model [1,2], the generation of a hazy image can be mathematically formulated as follows: where I(x) and J(x) are the hazy images and the clear counterparts to be recovered, respectively. The variable A represents the global atmospheric light, which describes the intensity of the ambient light, x represents the pixel position, and t(x) is the medium transmission, which means the rate of the scene radiance that is not scattered or absorbed. Furthermore, the transmission matrix t(x) is a distancedependent parameter that can be calculated using the formula t(x) = e −βd(x) , where β is the atmospheric scattering coefficient, and d(x) is the scene depth. Single image haze removal aims to recover the clear image J(x) from a given hazy image I(x). Obviously, it is an ill-posed problem due to the unknown atmospheric light A and transmission t(x).
Some researchers tried to supplement auxiliary information to reconstruct the original clean image in early attempts at hazy image restoration. For example, photos of the same scene taken with different degrees of polarization [3,4] or captured under diverse weather conditions [5] were used to remove the haze. In addition, depth information was also utilized for dehazing [6,7]. These multiple-image-based methods can effectively solve the degradation problem of hazy images. However, extra information is not always available in practical applications. Therefore, haze removal using a single image deserves more attention. Image enhancement methods [8,9,10,11,12,13] usually do not consider the atmospheric scattering model but rather improve the visual quality of degraded images by modifying the image's contrast, color, and brightness. Therefore, these enhancement methods cannot adequately restore clear pictures because they do not consider the fundamental causes of image degradation. Neither directly supplementing the additional information nor ignoring the physical model, prior-based image restoration methods [1,14,15,16,17] add constraints to the physical model through priors derived from rational hypothesis, observation, and statistics, thereby restoring clear and haze-free images. Unfortunately, these methods may not work in some particular scenarios where the priors do not hold.
With the great success of deep learning in various visual tasks, many dehazing methods based on convolutional neural networks (CNNs) [18,19,20,21,22,23] have emerged in recent years. These methods exploit massive amounts of training data and the powerful learning capability of CNN to capture the intricate patterns of hazy images. Many previous studies have evaluated the unknown transmission maps and the global atmospheric light using CNN, ultimately obtaining the original haze-free images by referring to the existing physical model. However, the performance of this kind of two-stage approach [18,19] rapidly decreases when the intermediate variables cannot be assessed accurately. For this reason, many end-to-end CNN-based methods [20,21,22,23] have been proposed to remove haze, which directly learn the potential sophisticated mapping between hazy images and corresponding clear counterparts.
In this paper, we propose an effective neural network for single image haze removal that is inspired by the following three main ideas: Firstly, some recent dehazing methods [21,23,24] based on encoder-decoder structure have achieved good results, and UNet [25], as a classic encoderdecoder structure network, has an excellent performance in some other low-level image processing tasks such as image segmentation. However, in traditional UNet-like neural networks, multiple downsampling and upsampling operations will lose some low-level information of the input images and cause feature misalignment, which results in the performance degradation of the whole network. Therefore, the network proposed in this paper adopts a simplified UNet structure. Only two strided convolution layers are used in the encoder for downsampling twice. In the decoder, two strided deconvolution layers transform the feature resolution to be consistent with the input. Secondly, the attention mechanism is widely used in various visual tasks as a powerful feature recalibration tool that can effectively capture non-local information in images. This paper proposes a novel and practical feature attention aggregation module that connects channel attention and spatial attention in parallel. A skip connection between the module's input and output is used to aid gradient flow. Finally, with the number of intermediate layers in the deep neural network increasing, the receptive field of the feature map gradually expands. But the hierarchical information between different spatial scales, which has been proven helpful for image reconstruction in previous research [21,24,26,27], is not fully utilized. Therefore, we introduce an adaptive feature fusion module, which learns the weights of the feature maps in shallow and deep layers and adds them together by weight, finally achieving the delicate aggregation of features with different receptive fields.
The contributions of this research can be summarized as follows: • We propose an end-to-end convolutional neural network, named FAPANet, to restore the clean images from the hazy inputs. And we introduce a joint loss function that consists of L1 loss, perceptual loss, and SSIM loss to train the proposed FAPANet, which dramatically enhances the details of the restored images. • To enhance the features of the network, the feature attention aggregation module (FAAM), an effective hybrid attention scheme, is proposed in this paper, which recalibrates the feature maps by parallel channel attention and spatial attention. In this module, we use the maximum, minimum, and mean values of each channel of the features simultaneously to calculate the feature channel attention. Also, the spatial attention part of the module uses a simple, parameter-free strategy. • A simple and effective feature fusion strategy is presented to assemble the features at different spatial scales. It makes full use of the shallow and deep features, which substantially works in image reconstruction. • Extensive experiments on synthetic and natural images have been conducted. The quantitative and qualitative results compared with various state-of-the-art methods demonstrate the effectiveness of the proposed method. In addition, the adequate ablation study reveals the importance of each sub-module in the network.

II. RELATED WORK
For many years, the problem of image dehazing has attracted much attention, and many excellent methods have been proposed. In this section, we provide a brief review of existing techniques. As mentioned previously, we classify haze removal methods into four categories: multi-imagebased methods, image enhancement methods, prior-based methods, and deep-learning-based methods, the latter three of these dehaze using only a single degraded image. We individually introduce the representative studies of each of these four methods.

A. MULTI-IMAGE-BASED METHODS
Early methods added constraints to the ill-posed equation via supplementary information for the existing atmospheric scattering model, thereby restoring haze-free images. Kopf et al. [6] and Narasimhan et al. [7] removed haze with the assistance of the 3D model of the scene or depth information provided interactively by users. Nayar and Narasimhan [5] restored the visibility of hazy images using multiple images of the same scene captured under diverse weather conditions. In [28], Schaul et al. presented a near-infrared-based method for image dehazing. In addition, some polarization-based techniques have been proposed in [3,4]. However, these methods are often difficult to deploy in real-world applications since the geometrical information and multiple images are not always available.

B. IMAGE ENHANCEMENT METHODS
Faced with the difficulty of reliably evaluating the transmission map and ambient light through a single hazy input, some researchers have attempted to bypass the physical model and use image enhancement methods to improve the visual quality of images. With some minor adjustments to the original histogram equalization algorithm, Stark et al. [8] and Joung-Youn Kim et al. [9] proposed adaptive histogram equalization and partially overlapped sub-block histogram-equalization to improve image contrast. Furthermore, some fusion-based methods [10,11] and retinex-based methods [12,13] have also been presented to enhance the visibility of hazy images.

C. PRIOR-BASED METHODS
Prior-based methods estimate the albedo of the scene and the atmospheric light using various hypotheses, statistical rules, and observations. Tan et al. [14] found that photos taken in hazy weather always have low color contrast, therefore, they maximized the contrast of image patches to enhance image visibility. He et al. [1] estimated the transmission map using a dark channel prior, which was derived from the intriguing observation that the patches of a haze-free image often have one color channel whose intensity is close to 0. Fattal [15] observed that the local pixels of image patches typically exhibit a one-dimensional distribution and introduced a color-line method based on this discovery. Berman et al. [16] proposed an algorithm based on a non-local prior, which presupposes that the colors of a haze-free image are well approximated by a few hundred distinct hues. Zhu et al. [17] offered a dehazing method based on color attenuation prior, which assumes that the difference between brightness and saturation is positively correlated with haze concentration. In general, these priorbased methods rely heavily on the validity of handcrafted features. For this reason, they would fail in some real-world scenes where the assumption is invalid.

D. DEEP-LEARNING-BASED METHODS
Over the past several years, many learning-based approaches have been introduced to cope with the degradation problem of hazy images. Cai et al. [18] constructed DehazeNet to predict transmission maps via a novel nonlinear activation function known as the Bilateral Rectified Linear Unit (BReLU). Similarly, Ren et al. [19] employed a multiscale convolutional neural network (MSCNN) to estimate and refine transmissions. Li et al. [20] proposed the AOD-Net by reformulating the physical model and incorporating transmissions and atmospheric light into a new K-module. The Densely Connected Pyramid Dehazing Network (DCPDN) [21] has been proposed to jointly learn transmission maps and atmospheric light. All of these CNN-based methods have achieved promising improvements for single image dehazing, but compared with traditional methods, they scarcely consider haze-related priors to constrain the learning space. To that end, Yang et al. [22] presented a CNN-based dehazing approach known as Proximal Dehaze-Net that learns both dark channel and transmission priors via their introduced energy functions. Different from these methods, Ren et al. [23] first combined fusion strategy with deep neural networks to dehaze, exploiting an encoder-decoder network to nonlinearly fuse three inputs derived from the same hazy image by applying white balance (WB), contrast enhancing (CE), and gamma correction (GC). In addition, given the great success of generative adversarial networks (GANs) in various computer vision tasks, many GAN-based dehazing methods [29,30] have also been proposed. Recently, Kuanar et al. [31] proposed a deep learning-based model, DeGlow-DeHaze, to solve the problem of dehazing for images taken at night. The method proposed in this study uses a single image to remove haze without auxiliary geometrical information. Furthermore, it is an end-to-end haze removal network that can potentially learn the complex mapping between haze and haze-free images without strictly referring to the ideal atmospheric scattering model, thereby avoiding the error magnification caused by separately estimating transmission maps and atmospheric ambient light.

III. PROPOSED METHOD
In this section, we introduce FAPANet in detail. First, we describe the architecture of our network and the motivation behind it. Second, we present two novel sub-modules, i.e., the Feature Attention Aggregation Module (FFAM), which integrates the features enhanced by channel and spatial attention in parallel for feature recalibration progressively, and the Adaptive Feature Fusion Module (AFFM) for the fusion of shallow and deep features by learnable weight. Finally, the joint loss function is introduced.

A. NETWORK ARCHITECTURE
As shown in Fig. 2, FAPANet is a classic encoder-decoder structure network, which can be roughly divided into 3 stages. Firstly, four consecutive conventional and strided convolutional layers are used as an encoder to encode the inputs, in which the strided convolutional layers are used to significantly reduce the feature resolution, thereby saving a tremendous amount of training memory, and the convolution layers gradually expand the receptive field of features to boost the representation of high-dimensional semantic features. After that, the features acquired by the encoder are progressively enhanced by 10 cascaded FAAMs. Then, two AFFMs reasonably fuse the multi-scale spatial features. A detailed description of these two essential modules will be presented in Section III-B and Section III-C. Finally, the decoder is a symmetrical structure to the encoder, which restores the resolution of the feature map to that of the original input via two strided deconvolution operations, and then decodes the feature maps to haze-free results.
In the process of feature encoding and decoding, the original UNet employs multiple upsampling and downsampling operations, which inevitably leads to the loss of some image information, especially low-level features that are crucial for image recovery. While feature mapping using the original resolution of the input image, instead of multiple samplings, requires too much memory in the training stage and causes excessive information redundancy. Therefore, FAPANet is designed as a simplified UNet structure, which performs two downsampling and upsampling operations in the encoder and decoder, respectively. But at the same time, it has feature attention aggregation modules and feature fusion modules to compensate for the information loss caused by upsampling and downsampling as much as possible, to obtain representative hazy image features.
The general convolution layer with a fixed grid kernel could lead to image texture degradation and artifacts [32]. In contrast, the deformable convolution operation has a dynamic and flexible kernel, which allows it to capture more information [24,33]. Therefore, we introduce two deformable convolution blocks in FAPANet to strengthen the feature representative capability and get more realistic dehazing results. Following the previous work [24], the deformable convolution is cascaded after the FAAMs to achieve better performance.

B. FEATURE ATTENTION AGGREGATION MODULE
Although only two downsampling operations are used in the encoder of the proposed FAPANet to compress the resolution of the feature map to a quarter of the input resolution, it also causes low-level information loss. To compensate for this loss, we introduce the Feature Attention Aggregation Module (FAAM) and cascade 10 FAAMs to enhance the features progressively. The attention mechanism is a common feature enhancement method in computer vision, which stimulates the valuable features and suppresses the useless ones from the perspective of the channel or spatial pixel, called channel attention and spatial attention, respectively. The FAAM integrates the channel and spatial attention in parallel, which simultaneously achieves channel-wise and pixel-wise feature recalibration. It is known that the probability of gradient vanishes and gradient explosion in the training phase of a neural network increases as the network deepens. Fortunately, by adding a shortcut connection between the input and output of the network layer, the ResNet [35] explicitly specifies the network layers to learn the residual of the input feature maps, which makes the deep network easier to train. Similarly, we also deploy a shortcut connection in the FAAM.  As shown in Fig. 3(b), FAAM consists of three parts. Firstly, the input feature flows through a residual block composed of two convolutional layers. After that, the feature is recalibrated by two parallel attention blocks: the 3-D spatial attention block and the enhanced channel attention block, respectively. Finally, the recalibrated features by parallel attention blocks and the input feature are added together as the final output of the FAAM.
It should be noted that the FAAM adopts a hybrid attention mechanism, which has two main innovations compared to existing hybrid attention methods [34,36]. First, the existing channel attentions [37,38] tend to use global average pooling when modeling the global importance of channels, while we believe that only using the mean can not model channel information well. Therefore, we combine the mean, maximum, and minimum of the channels to model the global channel information. Second, the existing spatial attentions [34,36] extract the pixel-wise weights of features for recalibration by convolution operations, ignoring the channel information, while we adopt a novel and simple three-dimension spatial attention proposed in the latest research [39], which estimates the importance of each element of the feature without additional parameters. As shown in Fig. 3(a), CBAM, a typical example of the hybrid attention mechanism proposed in [34], employs a cascaded channel attention block and spatial attention block, which is different from our proposed FAAM in terms of both the overall structure and the technology used in the internal submodules. In the ablation study in Section IV-E1, we compare the effects of these two structures, and the results prove that our proposed scheme performs better in the haze removal task.

C. ADAPTIVE FEATURE FUSION MODULE
Previous studies [21,24,26,27] have shown that the fusion of features at different spatial scales is conducive to improving network performance. In this paper, we propose the Adaptive Feature Fusion Module (AFFM) to achieve a fine-grained fusion of deep and shallow features. As shown in Fig. 4, the AFFM consists of three parallel processing pipelines. First of all, two feature maps with the same resolution from the  shallow and deep layers in FAPANet are concatenated along the channel dimension. And then, the merged feature map is fed into a convolutional layer to reduce its channels and learn the potential interaction between the deep and shallow features. Finally, the outputs of the sigmoid activation layer are used as coefficients to linearly combine the deep and shallow features, and this combination is pixel-by-pixel with learnable coefficients. The module can be formulated as: Where x deep and x shallow represent the input deep and shallow features, F conv represents an abstract function, which is used to map the implicit relationship between x deep and x shallow . The concatenation operation and sigmoid activation function are denoted by concat and Sigmoid, respectively. Therefore, y indicates the result of a linear combination of x deep and x shallow with the coefficient w.
Unlike existing feature fusion methods [21,26,27] that directly merge feature maps of different scales by addition or concatenation operations, our approach achieves fine-grained feature fusion by explicitly modeling the weights of each position of the deep and shallow features, thereby obtaining effective feature information. VOLUME 4, 2016

D. LOSS FUNCTION
To restore realistic haze-free images, we propose a joint loss function that linearly combines L1 loss, SSIM loss, and perceptual loss by weight. In the following, we present a detailed introduction to these three kinds of loss functions. For uniform expression, we specify that I and J represent the hazy images and the corresponding haze-free images, respectively.
L1 Loss. L1 loss pixel-wisely measures the mean absolute error between the recovered image and its corresponding ground truth, a technique that is widely used in image reconstruction. Following the previous studies [22,24,27,40], we also adopt L1 loss to optimize our model. L 1 is used to denote L1 loss, which can be formulated as follows: where N , H, and W represent the batch size, height, and width of the image, respectively. The variable i indicates the index of the sample in a batch training data, and F F AP AN et expresses the latent mapping function between the hazy images and the clear images which is learned by the proposed FAPANet. Accordingly, F F AP AN et (I i ) denotes the restored image corresponding to the hazy input I i . SSIM Loss. Since the L1 loss function has equal gradient values in most cases, even small loss values have large gradients, which is not conducive to the convergence of the model. Also, the model trained with L1 loss alone tends to produce texture collapse and halo artifacts. To avoid these problems, we combine L1 loss with SSIM loss function following the strategy in literature [40]. By assessing images from the perspective of perceived differences, the SSIM model helps restore visually clear and realistic images. The SSIM value of the given pixel p between the hazy image and the clear image can be calculated as follows: where x and y represent the image patches in the hazy image I and the clear image J, respectively, while p is the central pixel of the image patch. The variables µ x , σ 2 x , and σ xy represent the mean of x, the variance of x, and the covariance of x and y, respectively. Similarly, µ y and σ 2 y are the mean and variance of image patch y, respectively. According to the above equation, the SSIM loss between the recovered image and the corresponding ground truth can be defined as where N is the batch size and M is the total number of pixels in a single training sample. Perceptual Loss. It is well known that shallow layers of CNN often capture low-level image features, such as edges and contours, which are critical for image dehazing. Therefore, we introduce the perceptual loss function [21,29], which can be expressed as Here, we inherit the notations of F F AP AN et , N , i, I and J in L 1 , and let V j be the feature map calculated by the jth layer of the pre-trained VGG network [41]. The "relu1-1" and "relu2-1" layers of VGG-16 were used in this study. Essentially, L V GG estimates the mean squared error between the restored image and the degraded image in feature space, and therefore can be used as an effective complement to the original MSE loss for image reconstruction tasks. Total Loss Function. The proposed FAPANet is trained by minimizing a total loss function which combines the described above L1 loss, SSIM loss and perceptual loss by weight as follows: where α, β, and γ are the weights of the corresponding losses. This combined loss function helps to restore clear images with more natural color and richer details. and "ConvT2d" denote the convolution layer and the deconvolution layer. "CA" means Channel Attention, and "SA" means Spatial Attention. In the last column, "ic", "oc", "ks", "st", and "p" indicate the input channel, output channel, kernel size, stride, and the padding value, respectively.

A. IMPLEMENTATION DETAILS
We implement the proposed FAPANet using the Pytorch framework. All convolutional layers in FAPANet are followed by the ReLU activation function, except for the last layer, which adopts the Tanh activation function. Similarly, the filter size is set as 3×3 in all convolutional layers, except for the first and last convolutional layers, which use 7×7 convolutional kernels. The detailed network structure is described in Table 1. The FAPANet was trained on the RESIDE dataset [45], and all training samples were cropped to 256x256 randomly. During the training stage, we utilized the Adam algorithm as the optimizer, and set the initial learning rate to 0.001, and set the batch size to 16. The network was trained 120 epochs with a learning rate decay of 0.1 for every 30 epochs. Following the practice of previous studies [21,29] and adjusting in our experiments, we set α=5, β=1, and γ=1 in the total loss function. The training and testing was performed on a same personal computer with the acceleration by a RTX 3090 GPU.

B. HAZE REMOVAL ON SYNTHETIC IMAGES
To fairly evaluate the dehazing performance of the proposed FAPANet, we compared the dehazing performance of the FAPANet with nine state-of-the-art methods on synthetic datasets of both indoor and outdoor scenes. We adopt the Synthetic Objective Testing Set (SOTS) subset of the RE-SIDE dataset as the test set, which contains 500 synthetic indoor hazy images and 500 synthetic outdoor hazy images, called SOTS-indoor and SOTS-outdoor, respectively. All the test images do not overlap with samples from either the indoor or outdoor training set of the RESIDE dataset. Due to inaccuracies in atmospheric light and transmission map evaluations, some color distortions appear in the results of the prior-based DCP and CEP algorithms, as shown in Fig.  5(a). The Hazeltine and IDE methods, on the other hand, can only remove local haze, leaving some fog behind. DehazeNet and AOD are unable to thoroughly remove the haze, resulting in unclean results. GDN, MSBDN, and PRENet produce better results in terms of overall quality, but we notice that GDN and MSBDN cause minor color distortion in the first image's floor area and the third image's ceiling portion, particularly around the mirror, while GCANet has some haze remaining in the leftmost part of the third image. In contrast, our method achieves a better dehazing performance for the indoor scene. Fig. 5(b) shows the dehazing results of various dehazing methods for outdoor images. DCP and CEP cause some obvious color distortion in the third test image, and slight halo artifacts appear in the sky of the first and second test images. In addition, significant local contrast reduction is observed in all the results of the CEP and DCP algorithms. The IDE method makes the results brighter by exposure processing, but it does not remove the haze well, as shown in the first test image. Similar to the results in Fig. 5(a), HazeLine, DehazeNet, and AOD can not completely eliminate haze either. In general, GCANet, GDN, MSBDN, and FAPANet have better dehazing effects in outdoor scene images.
Furthermore, we quantitatively compare our proposed FA-PANet with the state-of-the-art methods using two authoritative full-reference metrics, the peak signal to noise ratio (PSNR) and the structural similarity index (SSIM). The PSNR is the most commonly used metric to evaluate the difference between an image and its reference image in terms of pixel values. Generally, a higher PSNR indicates that the reconstructed image is of higher quality. The SSIM index is widely used to predict the perceived quality of a given image by measuring the similarity between an image and its distortion-free reference image in terms of structure and texture. The higher the SSIM value, the more similar the restored image is to the reference clear image.
As shown in Table 2, our method achieves the best values of PSNR and SSIM on the SOTS-Indoor benchmark. Its PSNR is 35.907 dB, exceeding the result of the second-best model (MSBDN) by 10.45%, while its SSIM reaches 0.993, meaning that our FAPANet can restore high-quality hazefree images. For the sake of fairness, all of the data in this table is obtained by repeated experiments of all compared methods. The results of DehazeNet, AOD, and HazeLine are almost the same as those released in [45], and our GDN and MSBDN results are also consistent with those reported in the corresponding papers [26,27]. It should be noted that our PSNR and SSIM data are calculated by the scikit-image library 1 [46] with default parameters. For the SOTS-Outdoor dataset, the PSNR and SSIM results of FAPANet are only slightly lower than those of MSBDN and GDN, but it also produces satisfying dehazed images.

C. HAZE REMOVAL ON NTIRE DATASETS
In recent years, the New Trends in Image Restoration and Enhancement workshop and challenges on image and video processing (NTIRE) has become the most popular academic activity in the field of low-level image and video processing, and haze removal is one of its most important topics.    [48], DenseHaze [49], and NH-HAZE [50], respectively. I-HAZE contains 35 pairs of hazy and corresponding haze-free indoor images, O-HAZE and NH-HAZE are both composed of outdoor paired hazy samples, and DenseHaze contains 50 pairs of indoor and outdoor images that have extremely thick haze. Different from the RESIDE dataset, which consists of synthetic hazy images, the NTIRE datasets were constructed from images with real haze generated by a professional haze machine. Therefore, the performance of the dehazing algorithm on the NTIRE datasets is possibly more indicative of its dehazing ability in the real world, so in addition to RESIDE, we also conducted quantitative experiments on four NTIRE datasets. We still use PSNR and SSIM to evaluate the performance of dehazing methods, and the results are shown in Table 3. The proposed FAPANet obtains the best PSNR results on three datasets: O-HAZE, DenseHaze, and NH-HAZE, and the second-best PSNR results on the I-HAZE dataset after the MSBDN algorithm. In terms of SSIM metrics, the FAPANet algorithm performs best on the NH-HAZE dataset, thirdbest on the I-HAZE dataset after MSDBN and GCANet, and second-best on both the O-HAZE and DenseHaze datasets after the GCANet algorithm. It can be seen that GCANet has good results in SSIM metrics, while our FAPANet has better PSNR values along with competitive SSIM scores. The quantitative experiment results show the powerful dehazing ability of FAPANet for real-world hazy images, and we believe that this strong dehazing ability benefits from the FAAM, AFFM, and joint loss function proposed in this paper.

D. HAZE REMOVAL ON REAL-WORLD IMAGES
On four NTIRE datasets, we conducted comparison experiments with existing state-of-the-art algorithms. However, the haze in the samples of NTIRE datasets was generated by a haze machine, so verification in a natural hazy scene is required. FAPANet was used to dehaze nine challenging realworld images collected from the internet 2 , and the results were compared to those of the nine state-of-the-art methods. Although the prior-based DCP, CEP, and HazeLine methods dehaze well, the dehazed results have low color saturation and significant haloes, as shown in the first, third, and eighth tests in Fig. 6. The neural-network-based dehazing algorithm performs well on thin haze images, but not so well on dense haze scenes, as demonstrated in the second, third, and eighth tests. The IDE produces visually pleasing results due to its combination of exposure processing and dehazing, but its inability to handle thick haze images is revealed in the third and eighth tests. Our method, on the other hand, can remove haze and restore clear, natural images with vivid colors and high visual quality, as well as produce competitive results even when the input images have dense haze.
For the real-world degraded images, it is difficult to obtain the corresponding haze-free ground truth, so we cannot use the full-reference assessments to measure the performance of the dehazing algorithm. Therefore, we additionally use four commonly used no-reference evaluation metrics to measure the restoration quality of the results of each dehazing algorithm, which are e, r, and σ, proposed in [51], and the Fog Aware Density Evaluator (FADE) [52], respectively. The e metric indicates the capability of the dehazing algorithm to restore new visual edges in the dehazed image; the r metric estimates the visibility enhancement level of the restored image; the σ calculates the ratio of saturated pixels. FADE predicts the fog density of a hazy image based on statistical features. The σ and FADE indicate better image visual quality with small values, while e and r are opposite. For fairness, we list the averages of the no-reference evaluation scores in Table 4 for all comparison algorithms dehazing on all nine test samples shown in Fig. 6. The proposed FAPANet   Fig. 6. All the data listed in the table are the average scores of the evaluation results of the corresponding dehazing algorithms on all of these 9 test images. Texts in red, blue, and green indicate the best, the second-best, and the third-best performance, respectively. Symbol ↑ indicates that the higher the value, the better the result, while symbol ↓ indicates the opposite.
achieves good scores in terms of e, r, and FADE metrics while maintaining a respectable σ result, which shows that our algorithm reconstructs clear images with nice visuality.

E. ABLATION STUDY
We conducted sufficient ablation experiments concerning three factors: the feature enhancement strategy, the feature fusion technique, and the loss function used in the training stage, denoted in Table 5 by "FF," "FE," and "Loss," respectively. According to previous research [24,25], we used the simplified UNet with two downsampling operations as a baseline, then combined it with various feature enhancement and feature fusion methods to create a series of variant models, which we then trained with different loss functions. Finally, all models were quantitatively evaluated on the SOTS-Indoor dataset using by PSNR and SSIM. Table  5 displays all of the comparison results.   the decoder without any feature fusion structure. "FFA*6" denotes 6 FFA modules in series (used in [24,53]), which are used as feature enhancement in model (b). We build model (c) by replacing FFA with RB, the widely used Residual Block proposed in [35]. In addition, we create the model (d) using 6 successive CBAM [34] as the feature enhancement module. Our FAAM is a parallel structure of spatial attention (SA) and channel attention (CA), and we also try the cascaded mode of attention to form the "SACA" and "CASA" modules, which represent the "SA first CA second" structure and the "CA first SA second" structure, respectively. In this experiment, the addition of features with the same resolution in the decoder and encoder is used as the feature fusion strategy, and all models are trained by L1 loss function. Obviously, our FAAM outperforms the other feature enhancement methods, and its PSNR value is 45.73% higher than that of the baseline. Also, we find that the parallel mode of spatial attention and channel attention is superior to the cascaded mode, regardless of the order.

2) THE QUANTITY OF THE FAAM
In addition to the effectiveness of the FAAM, we also explore the impact of FAAM's quantity on performance. In models (g), (h), (i), and (j), we employed 6, 8, 10, and 12 FAAMs, respectively, and kept the other components of all models the same. We find that the performance of the model gradually improves with the increase in the number of FAAMs and drops when the number exceeds 10. We believe that this is because the network is too deep, which makes the gradient propagation difficult and thus impairs the learning process of the model. Therefore, we finally adopt 10 FAAMs to construct the FAPANet.

3) EFFECTIVENESS OF THE AFFM
Similarly, we also verified the effectiveness of the AFFM module proposed in Section III-C. We constructed models (l) and (m) based on model (i) by replacing its addition part with the concatenation operation and AFFM, respectively. And we removed the addition operation of model (i) to build model (k), which had no feature fusion module. The quantitative comparison results show that AFFM helps the network recover better quality images.

4) EFFECTIVENESS OF THE JOINT LOSS
In this experiment, we compared the performance differences between the FAPANets trained with L1 loss (model (m)), L2 loss (model (n)), SSIM Loss (model (o)), and the joint loss (model (p)) proposed in Section III-D, respectively. As shown in Table 5, the joint loss, abbreviated as "JLoss" in the table, contributes to a 5.2% improvement in the PSNR evaluation of the model compared to L2 loss, while keeping a competitive SSIM value.

V. ANALYSIS AND DISCUSSIONS
In this section, we first show the effect of the feature attention aggregation module proposed in this paper by the visualization technology, followed by the discussion of the running time, and some failure cases are presented at the end.

A. VISUALIZATION OF THE FEATURE ATTENTION
The design of the FAAM is one of our main contributions, and the ablation study in Section IV-E2 quantitatively verifies the usefulness of the module. To further demonstrate its effect in feature enhancement, we use visualization techniques to explicitly show the visual comparison of feature maps before and after FAAM enhancement. We use models (b), (c), (d), (e), (f), and (g), described in Table 5, to remove the haze in two hazy images, which are samples with heterogeneous and thick haze from the NH-HAZE [50] dataset. We then show the original features calculated by the encoder of each model and the features recalibrated by the first feature enhancement block, which follows the encoder. As shown in Fig. 7, no obvious feature excitation or suppression happens in the results of models (b) and (c), while model (d), which employs CBAM [34] as the feature enhancement module, over-excites features in the global range and blurs the texture, thus leaving a large amount of haze or halo artifacts in dehazed images, as shown in the red region of the first sample and the green region of the second sample, respectively. In the last row of Fig. 7, model (g), which uses the proposed FAAM, exhibits suitable feature excitation in the dense haze region and thus achieves good dehazed results.

B. RUNNING TIME
To compare the computational time of the proposed method with some state-of-the-art algorithms [ Table 5, respectively. Columns (a) and (b) indicate the features before and after enhancement, and column (c) indicates the dehazed images of the models mentioned above. Best viewed on a high-resolution display with zoom-in.

Methods
Platform Time (Seconds) DCP [1] Matlab(CPU) 1.395 CEP [42] Python(CPU) 0.061 HazeLine [16] Matlab(CPU) 3.906 IDE [43] Matlab(CPU) 0.720 AOD [20] Pytorch(GPU) 0.017 GCANet [44] Pytorch(GPU) 0.071 GDN [27] Pytorch(GPU) 0.076 MSBDN [26] Pytorch(GPU) 0.064 FAPANet(ours) Pytorch(GPU) 0.062  43,44], we calculated the average time of each method to dehaze 500 images from the SOTS-indoor dataset with a resolution of 460×620, and all the tests were performed on the same computer (Windows 10, Intel (R) Core (TM) i7-10700 CPU @ 2.90GHz, 32.0 GB RAM, and NVIDIA GeForce RTX 3090 GPU). The average time cost and the implementation platform of the compared algorithms are shown in Table 6. DCP [1], CEP [42], HazeLine [16], and IDE [43] run on the CPU without GPU acceleration because they are traditional algorithms based on the physical model. Although FAPANet adopts 10 cascaded FAAMs for feature enhancement, the network down-samples twice to squeeze the input size of FAAM to a quarter resolution, and the spatial attention in FAAM is parameterless and plain, which dramatically reduces the computational complexity of the overall model. The result shows that our algorithm is competitive compared with other state-of-the-art methods in time overhead, but at the same time, the dehazing capability is significantly improved.

C. FAILURE CASES
(a) Input (b) Ours While our algorithm works well on most daytime images, it fails if the input image is taken at night or in dense foggy weather. As shown in the first row of Fig. 8, our method cannot remove the haze well and even produces halos in the area marked by the green rectangle. This is because the nighttime images are significantly different in light distribution from the daytime images, and the physical degradation models of haze images are different for day and night. Our algorithm is trained on a large number of daytime images and therefore does not generalize well to nighttime scenes. For scenes with very thick haze, the loss of image detail is so severe that our model is unable to recover enough image content from subtle cues. As shown in the second example in Fig. 8, our algorithm removes the haze in the near part of the scene, as marked by the red rectangle, but does not help with the dense fog in the distance. We will try to address these two issues in future work.

VI. CONCLUSION
In this study, we have presented an effective single image dehazing method. This new approach removes haze in an end-to-end way, thereby potentially reducing the risk of error amplification caused by the independent evaluation of global atmospheric light and the transmission map. To extract more effective haze-relevant features, we utilize the specifically designed feature attention aggregation module to progressively enhance the feature map of the hazy image. The feature fusion mechanism is then employed to effectively merge the features calculated from the shallow and deep layers of the network. In order to recover realistic clear images, we combine L1 loss, perceptual loss, and SSIM loss to train our network. Quantitative and qualitative experiment results demonstrate that the proposed algorithm outperforms existing advanced algorithms in the dehazing of both synthetic and real-world hazy images.