Enhanced Multiscale Attention Network for Single Image Dehazing

Under severe weather conditions, the quality of the images taken outside is directly affected by floating atmospheric particles. To keep the quality of the images, haze removal methods play a critical role. The most difficult part of haze removal is removing the haze that spreads over the entire image. Many CNN-based methods have been proposed to remove the haze, and can be divided into two types. One is to use a multi-scale structure and the other is to stack layers. The former causes image degradation due to the loss of some of the original information in an image and the latter increases computational complexity due to not reducing the resolution. In addition, a large number of parameters is required to secure the expressive power of the model, which leads to a huge amount of memory. To tackle these problems, we tried to 1) downsample the image while saving parameters and maintaining the quality of the generated image, and 2) consider the information in the entire image to remove the haze. For the first problem, we tried to solve this by using a feature extractor that has been used in other tasks, learning to optimize the output image in low-resolution, and preparing kernels with various dilation rates to expand the receptive fields. For the second problem, we use the attention structure to determine which part of the image features should be focused on from the entire feature map. By incorporating such modules, our method achieves better results on both synthetic and real-world images when compared with state-of-the-art methods.


I. INTRODUCTION
In recent years, the demand for high quality images has been color of the image. The physical haze model [5], [6], [7] is 28 expressed by the following equation. 29 I (x) = t(x)J (x) + (1 − t(x))A (1) 30 where, I , J , t, and A are the input haze image, the clean 31 image, the transmission coefficient, and the ambient light, 32 The associate editor coordinating the review of this manuscript and approving it for publication was Amin Zehtabian . respectively. Recently, due to the development of hardware, 33 deep learning methods are widely used for single image 34 dehazing and there are mainly two types of methods. One is 35 to directly output a clean image in an end-to-end manner [8], 36 [9], [10], [ is to output a haze free image by estimating the transmission 38 coefficient and ambient light through a network and substi-39 tuting these values into Eq. (1). Unlike other degradation 40 factors, haze is greatly affected by distance and it is very 41 difficult to train the network without considering the distance 42 information. However, the depth information is required to 43 obtain the correct label of the transmission map, which is 44 overload and processing speed reduction. 95 To address these issues, we propose Enhanced Multi-Scale 96 Attention Network (EMSAN) while reducing the parameters. 97 Furthermore, various modules, such as attention, are added 98 to achieve more global and advanced feature extraction and 99 processing.

100
To expand the receptive field, it is necessary to reduce the 101 size of the image and process it, but information included 102 in the image is lost due to downsampling during encoding. 103 To achieve this trade-off, we propose a Mixed Encoder(ME). 104 ME is an encoder consisting of a pre-trained VGG16 and 105 Dense Net. By using these networks as the encoder, the 106 ME not only avoids missing information during encoding, 107 but also performs high-level feature extraction and prevents 108 color distortion. Additionally, we propose the Multi Output 109 Branch(MOB) structure to improve the accuracy of low-scale 110 branches in a multi-scale network, which can reduce the dete-111 rioration. MOB generates low resolution output images from 112 the low-scale features through a Refine Block [8]. By taking 113 the loss between those output images and the downsam-114 pled ground truth, the feature extraction at lower scales is 115 optimized.

116
In addition, we have incorporated a new block structure, 117 the Enhanced Feature Attention module inspired from the FA 118 module proposed in FFA-Net [10]. This module uses dilated 119 convolution, pixel attention, and channel attention to consider 120 the entire feature map. The previous vertical pixel and chan-121 nel attention structures are likely to cause bottlenecks in one 122 of them, so we address this problem by parallelizing them.

123
Processing at the original resolution has a significant 124 impact on the final output. Therefore, instead of using a 125 structure based on the convolutional layers, which consider 126 local regions, we develop a structure based on attention so 127 that the information of the entire image can be considered. 128 The attention part is structured around the Multi-head Self 129 Channel Attention(MH-SCA) block. The conversion of the 130 Fully Connected layer of Channel attention to Multi-head 131 Self Attention allows the processing according to the seman-132 tic features of the image. 133 We summarize the contributions of our work as follows.   226 where d denotes the number of dimensions. In Eq. (2), 228 Q T K represents the inner product of Q and K , and the value 229 is calculated based on the similarity between the query and 230 key.

231
Self Attention [27] is a method to get Q, K , V from the 232 same vector in Scaled Dot-Product Attention. This makes it 233 possible to understand the relationship between words in a 234 sentence or between pixels in an image.
When Q, K , V are calculated from the same vector as in Self 244 Attention, the similarity to oneself inevitably increases, mak-245 ing it difficult to see the relationship between each element. 246 Therefore, this method compresses the head to change the 247 viewing position.

249
The overall structure of the proposed method is shown in 250 Fig. 2. The feature map generated using Mixed Encoder(ME) 251 is processed and upsampled in a branch composed of 252 Enhanced Feature Attention(EFA) blocks. The feature map 253 is added to the extracted features one scale higher by 254 the trade-off between the number of parameters and the per-285 formance improvement.

286
The structure of the Enhanced Feature Attention (EFA) 287 module is an enhancement of the Feature Attention (FA) 288 module proposed in FFA and is shown in Fig. 3. EFA is 289 divided into two parts: the dilation part and the attention part. 290 The dilation part is structured using dilated convolutions to 291 expand the receptive field of the model. To enable feature 292 extraction over a wide range of feature sizes, dilated con-293 volutions with dilation rates of 1, 2, 4, 8, 16 are connected 294 in parallel. In the attention part, channel attention and pixel 295 attention are processed in parallel, and they are concatenated 296 and propagated to the next block by convolution. This is 297 reasonable because both channel attention and pixel attention 298 can see the same feature map, and it prevents one of them 299 from becoming a bottleneck.

330
Self Channel Attention(SCA) introduces the concept of Self 331 Attention to conventional Channel Attention. In conventional 332 channel attention, a feature map is processed by Global Aver-333 age Pooling(GAP), and the resulting vector is passed through 334 fully connected layers to generate a new vector as weights for 335 each channel. If all channels are considered to determine the 336 attention weights, then it can negatively affect the unrelated 337 channels because some channels have correlations with each 338 other while others do not. By using Self Attention on the 339 vectorized feature map by Global Average Pooling as shown 340 in Fig. 6, it is possible to consider the relationship between 341 channels. The feature map is updated according to the fol-342 lowing formula.    The overall loss function is expressed as where L p , L s , and L ms represent the perceptual loss, smooth  tual loss is expressed as follows.
where φ j (Ĵ ) represents the VGG16 feature maps from the  Smooth L 1 loss L s uses the L 1 norm, which is less sensitive 384 to outliers than the MSE loss. In addition, the gradient is made 385 smoother, making it less prone to gradient explosion, and is 386 expressed as follows [29].  where α, β j , γ j are the default parameters computed by sub-405 ject experiments.

407
In this section, we conduct an ablation study to demonstrate 408 the effectiveness of each of the proposed components by 409 constructing models with and without these modules. We also 410 compared our method quantitatively and qualitatively with 411 conventional dehazing methods.

412
A. In order to investigate the effectiveness of the various archi-443 tectures proposed in this study, we carried out an ablation 444 study using the NH-HAZE dataset. In addition to the base 445 model (Fig. 7a) and the proposed network (Fig. 2), three other 446 VOLUME 10, 2022  models were developed and tested, as shown in Fig. 7, to bet-447 ter demonstrate the effects of each component.

448
The base model is shown in Fig. 7a. The encoder is a pre- to MH-SCA block.

457
The results of experiments using these models are shown 458 in We also show visual comparisons with SOTA methods 472 in Fig. 8, Fig. 9 and Fig. 10. Although there is not much 473 difference in appearance in the synthetic image dataset SOTS 474 as indicated by Fig. 8, the proposed method is able to recover 475 objects in distant areas with dense haze and similar in color 476 to the haze. 477 Fig. 9 shows the results of the real image dataset Dense-478 HAZE. As can be seen from Fig. 9, most of the methods are 479 not able to produces high quality images due to the dense 480 haze, but EMSAN is the closest to the ground truth in terms of 481 the object structure. This shows that ME successfully extracts 482 the background features.   Multi-Head Self Channel Attention Block. By devising this 514 new network structure, the proposed EMSAN is able to 515 achieve higher quantitative and qualitative evaluations than 516 previous methods. Future work is to explore more lightweight 517 dehazing networks while improving performance.