Dual Multi-Scale Dehazing Network

Single-image haze removal is a challenging ill-posed problem. Recently, methods based on training on synthetic data have achieved good dehazing results. However, we note that these methods can be further improved. A novel deep learning-based method is proposed to obtain a better-dehazed result for single-image dehazing in this paper. Specially, we propose a dual multi-scale network to learn the dehazing knowledge from synthetical data. The coarse multi-scale network is designed to capture a large variety of objects, and then fine multi-scale blocks are designed to capture a small variety of objects at each scale. To show the effectiveness of the proposed method, we perform experiments on a synthetic dataset and real hazy images. Extensive experimental results show that the proposed method outperforms the state-of-the-art methods.


I. INTRODUCTION
The turbid medium in the atmosphere often degrades the image quality. Outdoor images taken in bad weather tend to show a hazy and blurry appearance. Atmospheric absorption and scattering cause haze, which reduces the contrast and fades the color of outdoor images. The light reaches by the camera from the scene objects is attenuated along the line of sight and blended with the atmospheric light. The absorption and scattering processes are commonly modeled by a linear combination of the direct attenuation and the air-light [1]: where the I is the input hazy image, and the J it the corresponding clean image, t represents how much the light reflected from objects is received by camera, A is the air-light. Single-image dehazing, which aims at removing haze from single input image as much as possible, has a wide variety of applications, such as auto driving, semantic segmentation, image recognition, etc. Due to its wide applications, dehazing has attracted much attention. There are two key steps in the dehazing process: 1) estimation of transmission map and The associate editor coordinating the review of this manuscript and approving it for publication was Yi Zhang . atmospheric light, and 2) compute the final dehazed result. Prior-based methods [2], [3] have been proposed to remove haze based on two steps of dehazing. Due to the fact prior is based on simple statistic law, which cannot be satisfied by real cases. For example, dark channel prior (DCP) [2] cannot deal the white objects well.
Inspired by the success of data-driven methods, many researchers proposed end-to-end CNN models [4], [5], [6], [7], [8] for single-image dehazing. Although these methods have shown effectiveness on a synthetic dataset. However, these methods have limitations due to large-scale arbitrariness caused by haze. Furthermore, The distribution of haze is depend on depth, which needs different receptive field sizes to estimate the depth for each pixel.
To overcome these two issues jointly, we propose a dual multi-scale dehazing network. The formation of haze can be affected by various factors, such as temperature, altitude, and humidity, making the distribution of haze at individual spatial locations space-variant and non-homogeneous. To capture the distribution of haze, we propose a dual multi-scale dehazing network, which has different perceptive fields and captures objects with different sizes. We compare our method with traditional and learning-based methods [2], [9] in Fig 1. The main contributions of this work are listed as follows: The dehazed result of DCP tends to show a dark appearance and the tree area cannot be recovered well, compared with (d). The dehazed result of PhysicsGan looks better than DCP, there is still room for improvement. Compared with DCP and PhysicsGan, our method often generates a visual favorable result.
• We propose a dual multi-scale dehazing network, which can capture the large and small variety of objects and understand the distribution of haze. The distribution of haze is very large, we employ a coarse multi-scale network to capture the global haze distribution. We then capture the small variety of distribution of haze via fine multi-scale blocks. The proposed model can capture the global and local distribution of haze well and effectively improves the dehazing performance.
• We propose a fine multi-scale block, which can capture small varieties of objects. The distribution of haze depends on the depth, which is different for different objects. However, the distribution of haze within one object tends to show homogeneous. It is critical to design a network that can capture small varieties of object sizes in each scale, which motivates us to design a fine multiscale block.
• We conduct extensive experiments to quantitatively and qualitatively compare the proposed method with the state-of-the-art single-image dehazing methods and demonstrate the effectiveness of the proposed model.

II. RELATED WORK
Single image dehazing methods can be mainly grouped into two approaches: physical model based recovering methods and color information based enhancement approaches.

A. SINGLE IMAGE DEHAZING METHODS
Physical models based methods [10], [11] assume that hazy images can be modeled by Eq. (1), which models hazy images as the linear sum of clean image and atmospheric light [12].
Clean image means the scene information that are not affected by medium particles. Based on this model, most existing algorithms focus on recovering the scene that not reaches the camera sensors, i.e., estimating the transmission map t(x) for each hazy image. For example, a improved image formation model is proposed by [13]. This model is designed for the estimating of transmission map and surface shading. The hazy image can be treat as regions of constant albedo, and we can infer the the scene transmission from hazy image. Dark channel prior (DCP) is inferred from the features of non-sky haze-free images. The DCP assumes the at least one pixel contains a channel whose intensity is close to zero.
[10] extends the DCP and proposes a more general boundary constraint. Four haze-relevant priors are studied [14] and a multi-scale dehazing method are designed to improve the dehazing performance. Reference [15] finds that the relation between brightness and saturation in a clear image patch, and proposes a color attenuation prior to compute transmission maps. Reference [3] finds that a clean image can be presented by hundreds of color clusters. However, a hazy image cannot be presented by hundreds of color clusters. Based on this observation, [3] design a non-local method to compute the transmission map. However, these hand-crafted priors are statistical properties over a large number of images and thus cannot hold always in practical scenarios. For example, when the scene objects are close to the airlight, the dark channel has bright values near such objects, which means that the dark channel prior is not hold, and as a result the haze layer will be overestimated [2]. To avoid designing statistical features, several algorithms employ deep convolutional neural networks (CNN) to improve image dehazing. Both DehazeNet [11] and MSCNN [16] use a deep neural network for transmission estimation and then follow the conventional method to estimate atmospheric light and haze-free image. Instead of computing the transmission map and the atmospheric light separately, AOD-Net [17] incorporates the transmission and the airlight into a new variable and design an light dehazing method. However, this method tends to retain haze in dehazed result. DCPDN [18] and DDN [19] are two methods, which incorporate the scattering model into deep network. These methods need two networks to compute transmission maps and atmospheric lights first, then restore final dehazed images by inversing the model (1). An end-to-end fusion-based dehazing network [20] is proposed to predict weight maps to combine three derived inputs into a single one by choosing the most important features of them. However, GFN also computes three inputs using traditional methods and intermediate confidence maps were needed to be computed. Qin et al. design a novel pixel and channel attention [5] to improve the dehazing performance. Pan et al. design a physics-based generative networks [9] for image restoration problem, which can incorporate the physics model to boost the dehazing performance. Dong et al. employ the boosting strategy to design a multi-scale dehazing network [21]. Zheng et al. study the ultra-high-definition image dehazing [22] based on the physical model. Although promising results have been obtained, the assumption that hazy images is the sum of clean image and airlight does not hold in real complex scenes, especially when the haze is heavy and contains noise. To improve the dehazing performance on natural hazy images, Shao et al. propose a domain adaptation dehazing method [23]. Different from these methods, our method takes multi-scale ability into the proposed network for dehazing and achieves the fast dehazing performance.
Prior based dehazing methods can restore dehazed sharp results at the expense of low quantitative results for synthetic images. Data-driven dehazing methods obtain high quantitative results for synthetic images but cannot remove haze from real hazy images completely. To address the disadvantages of prior based dehazing methods and data-driven dehazing methods, neural augmentation based dehazing methods [24], [25], [26] are proposed. Neural augmentation based dehazing methods estimate the atmospheric light and transmission map firstly, and then data-driven methods are used to refine the the atmospheric light and transmission map. The dehazed results are obtained by physical model with the estimated the atmospheric light and transmission map.

III. PROPOSED METHOD
The proposed model is a dual multi-scale dehazing network, the overall framework is shown in Fig. 2. The dual multi-scale ability is from coarse multi-scale network and fine multi-scale blocks. We firstly introduce the motivation, and then the dual multi-scale dehazing network, which learns dehazing ability from synthetic images.

A. MOTIVATION
Objects often have different sizes, which are hard for dehazing. As shown in Fig. 3, persons in red rectangles have different sizes due to the different depths, which result in different densities of haze for these areas. We note that the trees in black rectangles also have different sizes. We can see that the objects in near areas have large size, while the objects in far areas have small sizes. To capture such a dramatic variation of object sizes, we propose a coarse multi-scale network that increases the receptive field via down-sampling. As shown in Fig. 3(c) and (d), we note that the sheep and cranes have similar object sizes and show small variations of object sizes, it is important to capture such a variation for image dehazing. To capture such small variations of object sizes, we propose a fine multi-scale block, which employs different dilation rates to understand the variations in local areas. In order to capture the large and small variations of objects, we combine the coarse multi-scale network with fine multi-scale blocks.

B. DUAL MULTI-SCALE DEHAZING NETWORK
Based on the analysis in Section III-A, we propose a dual multi-scale dehazing network (DMSDN), the network detail can be found in Fig. 2. The dual multi-scale dehazing network consists of a coarse multi-scale network and fine multi-scale blocks. The coarse multi-scale network contains three scales. The first scale (coarse scale) contains six fine multi-scale blocks, the second (median scale) contains six fine multiscale blocks, and the third scale contains six fine multi-scale blocks. To capture the global and local features, the model employs three scales of information to explore useful features for dehazing.  As the learned feature map exists redundant information, which is a reason for deep learning models cannot learn effective features for dehazing. In order to boost the learning efficient, we propose a fine multi-scale block. The fine multi-scale block (FMB) contains multi-scale information extracting and an attention module, which is shown in Fig 4. Based on the observation, the feature map contains redundant information, applying a convolution on it cannot learning information as much as possible. We split the feature into four sub-features, which contain sub-information of the original feature. We apply a convolution on one sub-feature, and obtain a new feature (O 1 ). We concat the O 1 with another sub-feature and obtain a concated feature (C 1 ), then we apply a convolution on C 1 and obtain a new feature (O 2 ). We repeat this process, and obtain O 3 and O 4 . We concat O 1 , O 2 , O 3 and O 4 , We then apply a channel attention on the concated feature and obtain the active feature for dehazing. The proposed module reduce the computation time and model complexity.
To further improve the information flow, we propose an adaptive fusion module (AFM), which fuses the features from each scale of a coarse multi-scale network adaptively. As shown in Fig. 5, we first concat the high-level, middle-level, and low-level features, and then a convolution operation with 1×1 kernel is applied, which obtains the fused feature.

C. TRAINING LOSS
Let F denote the mapping function which is learned by the network, and represents the parameters of the network. Let {I i , i = 1, 2, · · · , N } and {J i , i = 1, 2, · · · , N } denote the hazy input images and the corresponding clean ones, respectively. It has been widely acknowledged that L 2 loss tends to produce blurry dehazed results [18]. To solve this issue efficiently, we introduce a novel edge-preserving loss, which is composed of two different parts: L 1 loss and perceptual loss. L 1 is defined as follow: where N is the number of training pair data.
To eliminate the visual artifacts of dehazed images, we employ perceptual loss to train the model. The perceptual loss consists of Feature Reconstruction Loss and Style Reconstruction Loss. Instead of encouraging the pixels of the dehazing imageJ to exactly match the pixels of the ground truth image, feature reconstruction loss encourages them to have similar feature representations. The perceptual loss can be defined as follow: where φ presents the VGG-19 network, which is trained on ImageNet, N demotes the number of training samples, and j denotes the layer number. We select the layers 'conv1-2', 'conv2-2', 'conv3-2', 'conv4-2', and 'conv5-2' in the VGG-19 network to compute the feature reconstruction loss. Our overall loss function is: where λ 2 controls the contribution of perceptual loss.

D. IMPLEMENTATION DETAILS AND DATASET
In the proposed model, we set 3 × 3 as the kernel size for all convolution layers except the ones in AFM. In our experiment, we set the scale number to 3. For each scale-aware attention module, we set the dilation rate to 1, 2, 4 and 8. All dilated layers are initialized using an identity initializer [27]. We set λ 1 =0.01 and λ 2 =0.01 in all the experiments. We use a leaky rectified linear unit (LReLU) as our activation function. We use Adam optimizer with β 1 = 0.9 and β 2 = 0.9999 to train the network. The batch size and the learning rate are 1 and 0.0005, respectively. During training, we decrease the learning rate decreases half for every 30 epochs. The network was trained for totally 100 epochs by Pytorch with an Nvidia GTX 2018Ti GPU. We train the proposed network on the SOTS dataset from RESIDE [28] as the state-of-the-art dehazing methods [5], [29].     Fig. 9. The best result is marked with red color, while the second best is marked with blue color.

A. QUANTITATIVE COMPARISON 1) RESIDE DATASET
REalistic Single-Image DEhazing (RESIDE) [28] is the first large-scale simulated haze dataset, which provided indoor and outdoor hazy images. The hazy images in this dataset have the ground truth, we can evaluate the dehazing performance using PSNR and SSIM. The indoor part of  RESIDE dataset simulates hazy images using NYU indoor dataset [33]. The indoor part of RESIDE dataset contains 500 hazy images for test. We then evaluate the performance of our proposed network on the SOTS dataset from 84704 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  RESIDE [28]. The comparison results on SOTS are shown in Table 1. From the experimental comparisons, it has been demonstrated that the proposed method outperforms the current state-ofthe-art methods [21], [29], and achieves superior performance with great improvements. It should be pointed out that the FFA-Net achieves the best scores for RESIDE dataset. However, its performance on real hazy images is poor. We term the GridDehazeNet [29] as GDN.

B. QUALITATIVE COMPARISON
To further evaluate the proposed method, we use real images to compare with different state-of-the-art methods. Fig. 6 shows the qualitative comparison of results with the seven state-of-the-art dehazing algorithms [2], [9], [11], [17], [18], [20], [21], [31], [32] on challenging real-world images. As shown in Fig. 6(b), most of the haze is removed by DCP, and the details of the scenes and objects are well restored. However, the results significantly suffer from overenhancement (for instance, the building regions of the first and second images are much darker than it should be. The results of DehazeNet, AOD-Net, FFA-Net, MSBDN, Dehamer, GridDehazeNet, and DCPDN do not have the overestimation problem and maintain the original colors of the objects as shown in Fig. 6. But these methods have some remaining haze in the dehazed results. The method of AirNet and GFN tend to non-uniformly estimate haze concentration and results in inhomogeneous dehazed images in Fig. 6(k). The PhysicsGan, and EPDN generates relatively clear results, but the images show some color distortions. In contrast, the dehazed results by our method are clear and the details of the scenes are enhanced moderately as shown in Fig. 6(n).
We note some works employ hazy images to train dehazing network [6], [31]. PSD [6] employ hazy images to finetune dehazing network. AirNet [6] designs a encoder-decoder network, which can handle unknown corruption images. As shown in Fig. 7, we can see that the dehazed of AirNet and PSD show a haze appearance. We also note some color distortion in the deazed result of PSD in the second row in Fig. 7. The dehazed of AirNet looks darker than the dehazed results of proposed method and PSD. In contrast, the proposed method can restore the images details and recover a reasonable global appearance. The AirNet assumes that the degradations in the same image should be similar, which is not true for image dehazing. We show an example in Fig. 8, which shows that the degradations in the same image is not similar. The PSD employs several well-grounded physical priors to fine-tune the dehazing model. However, the physical priors arenot true for all hazy images. The proposed method employ haze-aware model to fuse the dehazed result, which helps the model restore high quality dehazed result.
We further compare the proposed method with some recently End-to-End dehazing methods [5], [6], [21], [30], [31], [34], [35], [36]. We show an example in Fig. 9. The dehazed results of FAMED, FFA-Net, MSBDN, AECR, and AirNet tend to retain haze. EPDN can remove haze. However, the dehazed result of EPDN tend to lose image details. SGID-PFF can remove haze. However, some areas of dehazed are completely dark. PSD can enhance the hazy image. However, the dehazed result of PSD tends to retain haze and show color distortion. In contrast, the proposed method can restore the images details and recover a reasonable global appearance and colorful dehazed result.
To show the effectiveness of the proposed method, we compare it with other dehazing methods. First, we show the dehazing performance of dehazing methods on real hazy images. Second, we show the densities of dehazed results obtained by different dehazing methods. As shown in Table 2, we can see that the proposed method achieve the second best  dehazing performance with metric DHQI [37]. The proposed method is a data-driven dehazing method, which may do not perform well for real-world images. However, the proposed method is better than other data-driven dehazing methods. To Further show the performance of the proposed method, we show the density of the results obtained by dehazing methods. As shown in Table 3, we can see that the proposed method can remove haze better than other dehazing methods.

C. ABLATION STUDY
To better show the effectiveness of the proposed modules, we design an ablation study that includes a coarse multiscale network, fine multi-scale blocks, and adaptive fusion modules. We construct the series variants with different proposed modules: 1) To show the effectiveness of the coarse multi-scale network, we design a single scale model termed as Base; 2) We add coarse multi-scale ability by removing AFM and replacing FMBs with traditional dense blocks, which is termed as BaseNet. We show the architecture of the BaseFMB in Fig. 10; 3) We add AFM to the BaseNet, and we term it as BaseAFM; 4) We replace traditional dense blocks with FMBs, and we term it as BaseFMB; 5) The architecture proposed in Section III, which is termed as Full. All models are trained in the same way and tested on the indoor part from RESIDE. As shown in Table 4, each proposed module shows its contribution to image dehazing.
To show the influence of loss function, we add an experiment, which only use L 1 norm to train the proposed model. As shown in Table 4, we can see that model trained with L 1 norm obtains lower quality dehazing results.
To show the efficiency of the proposed model, we show the run time from the variants of the proposed model. As shown  in Table 5, we can see that the proposed model run faster than other models, such as Base, BaseNet, and BaseAFM. The models are tested on a computer equipped with a Nvidia Geforce 1060.

D. RUN TIME
We note that the dehazing performance has been greatly improved. However, the dehazing speed is slow. In this subsection, we compare the propose model with some dehazing methods, which achieve high dehazing performance. We test the dehazing speed on a server platform, which is equipped with eight TITAN V GPUs. The CPU of the platform is Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz and the memory is 512 GB. We resize the hazy images to a fix size 512 × 512. We show the dehazing speed of state-ofthe-art methods in Table 6. As we can see that the proposed dehazing is almost two times faster than MSDBN.

E. LIMITATION
Although the proposed model is effective for most hazy images. However, the proposed method maybe failed for some dense hazy images. We address the this problem by using DCP loss. We show an example in Fig. 11, which is from the prior work [38]. As shown, we can see that the dcp loss may result in artifacts around depth jump areas. In order to further improve the dehazing quality, we design a novel method to improve the accuracy of transmission map. We show the different between the DCP and the proposed method in Fig. 11. Although the proposed model is trained with the transmission maps estimated by DCP, The proposed model also trained with synthetic dataset, which improve the accuracy of the estimated transmission maps. As shown in Fig. 11, we can see that the the transmission maps estimated by DCP contains more details. In contrast, the visual result of the proposed method is much smoother. We can use the new transmission map predicting network and real hazy image to boost the dehazing performance on real hazy images.
As proved by [39], [40], DNN-based methods often learn low-frequency functions, while ignore the high-frequency information. The neural augmentation framework [24] is proposed to address such a problem. In the feature, we also adopt the neural augmentation framework to improve the dehazing quality of the proposed method.

V. CONCLUSION
In this paper, we design a dual multi-scale dehazing network for single-image dehazing. The model contains a coarse multi-scale network and fine multi-scale blocks. The coarse multi-scale network which capture is designed to capture large variations of object sizes, while fine multi-scale blocks are designed to capture small variations of object sizes. The coarse multi-scale network contains three scales, which extract pyramidal features from the input image. To further explore the multi-scale information, we develop a fine multiscale block, which extracts multi-scale information using dilation convolution with different dilation rates and channelwise attention. The adaptive fusion module is designed to boost information flowing. Extensive experiments are conducted on public synthetic indoor images and natural hazy images to show the effectiveness of the proposed method.