CA-GAN: Class-Condition Attention GAN for Underwater Image Enhancement

Underwater images suffer from serious color distortion and detail loss because of the wavelength-dependent light absorption and scattering, which seriously influences the subsequent underwater object detection and recognition. The latest methods for underwater image enhancement are based on deep models, which focus on finding a mapping function from the underwater image subspace to a ground-truth image subspace. They neglect the diversity of underwater conditions which leads to different background colors of underwater images. In this paper, we propose a Class-condition Attention Generative Adversarial Network (CA-GAN) to enhance an underwater image. We build an underwater image dataset which contains ten categories generated by the simulator with different water attenuation coefficient and depth. Relying on the underwater image classes, CA-GAN creates a many-to-one mapping function for an underwater image. Moreover, in order to generate the realistic image, attention mechanism is utilized. In the channel attention block, the feature maps in the front-end layers and the back-end layers are fused along channels, and in the spatial attention block, feature maps are pixel-wise fused. Extensive experiments are conducted on synthetic and real underwater images. The experimental results demonstrate that CA-GAN can effectively recover color and detail of various scenes of underwater images and is superior to the state-of-the-art methods.

Inspired by the success of deep learning in object classification and speech recognition, deep models are increasingly implemented on image restoration, such as dehazing [9], derain, deblur, and image inpainting. As we know, a good deep model requires large labeled training dataset. However, there is only a few underwater image datasets until now. There are big space to improve the performance of underwater image enhancement. Recently, Generative Adversarial Network (GAN) [3] was rised up, and it is recognized that GAN favor the visual effect over other deep learning based models. Gradually, GAN has been employed for underwater image enhancement. Most existing GAN based methods for underwater image enhancement treat the image enhancement as an image-to-image translation [10], [18] and have achieved promising results. However, the existing GAN models for underwater image enhancement are limited in three folds: 1) the underwater image dataset is deficient in diversity. 2) the existing methods usually learn only a mapping function for any underwater images, neglect that the underwater imaging varies with the increase of water depth [11].
3) The existing GAN models have space to be improved with the development of GAN.
In this paper, we build an underwater image dataset in which underwater images are synthesized by a simulator based on physical optical model with randomly generated depth parameters. Our dataset contains 70000 pairs images which are classified into 10 classes according to the water depth. Moreover, we propose a Class-condition Attention GAN (CA-GAN) for underwater image enhancement in which an underwater image is classified first and then the class label guides the generating of enhanced images. Actually, we design a water class embedding block (WCEB) which normalizes different classes of underwater images to an in-air image. Furthermore, we implement attention mechanism on the generator in order to amply the relevant feature and to suppress the irrelevant feature for underwater image enhancement. A concurrent channel and spatial attention feature fusion block (CS-AFFB) is employed to recalibrate the feature maps in shallow layers and deeper layers.
The contributions are summarized as follows: • We propose a class-conditioned attention GAN for underwater image enhancement. The class embedding block normalizes an underwater image in color. With the help of class prediction, our GAN model focuses on mapping different classes of underwater images to an in-air image.
• We build a new dataset that contains 70,000 pairs images with eight classes classified according to water depth.

II. RELATED WORKS
Underwater image enahncement is a challenging while important task, as underwater images suffer from dramatically degrades of visibility and color distortions. There mainly exist two classes of methods for underwater image enhancement: hand-crafted priors based methods and deep models for underwater image enhancement.

A. HAND-CRAFTED PRIORS BASED METHODS
The traditional methods mainly depend on hand-crafted prior together with the physical scattering model. Yang et al [21] use dark channel prior to estimate the depth map of an underwater image and adopt a color contrast method to enhance contrast. Carlevaris et al. [22] compare the maximum intensity among different color channels to estimate the scene depth, and use MAP estimator to recover the scene. Different to [21] and [22], Peng et al. [23] utilize image blurriness and light absorption to estimate the scene depth which can produce better enhancement results in different light conditions and color tones. Li et al. [24] restore an underwater image by minimizing the information loss, then base on the histogram distribution to enhance the contrast and brightness. Yang et al. [25] firstly implement the histogram equalization on an underwater image and then decompose the results into the structure and texture layers based on total variation and L1 norm minimization, followed by enhancing the contrast of the structure layer and denoising the texture layer. Ancuti et al. [25], [30] try to unifiy the intensity correction with the edge sharpening in a multi-scale fusion way. The hand-crafted prior based methods are deficiency in generalization. When the assumption is not satisfied, they cannot achieve good performance. Thus, in wild world, they often perform poorly, e.g. undesired artifacts, color distortion, and so on.

B. DEEP MODELS FOR UNDERWATER IMAGE ENHANCEMENT
Recently, many GAN based methods have been employed for underwater image enhancement. In [11], a UIE-Net was proposed to enhance an underwater image, which contains a Color Correction subnetwork(CC-Net) that estimates attenuation coefficient and a Haze Removal subnetwork (HR-Net) that estimates the transmission map, after which we obtain the enhanced image based on the formation of underwater image. Different to [11], many works enhance underwater images based on CycleGAN which enable unpaired imageto-image transformations. Li et al. [19] proposed a water-GAN method to generate underwater image sets relying on a physical model of underwater image, and train a twostage image restoration network for color correction using synthetic images. Although waterGAN can get the state-ofthe-art results, it needs to train various models for different water scence and they cannot cope with the diverse water environment. Fabbri et al [18] directly adopt CycleGAN to generate paired images by using the subsets of ImageNet [26] to train the Underwater GAN (UGAN) restoration network with L1 loss and Gradient Difference Loss as the objective function. Li et al. [10] designed a GAN model to map real underwater images to in-air images by minimizing the adversarial loss, cycle consistency loss and SSIM loss to make the content and structure of recovered image look like without water. Different from these works, we use in-air images and their corresponging depth map to synthesis underwater images to construct a dataset which contains eight types of images in various water depths. We also propose a GAN model that embedded with water class information to map various classes of underwater images to one in-air scene with just one trained model.

III. DATASET CONSTRUCTION
We employ the underwater imaging model similar to [1] to synthesize underwater images which is formulated as: where c ∈ {R, G, B}, I c (p) is the captured underwater image in the color channel c, J c (p) is the scene radiance, A c is the homogeneous global veiling-light component, and p is a pixel location in underwater scene. t c (p) is the transmission of the color channel c, which depends on the distance d (p) of the scene and the water attenuation coefficient β c : Different from image dehaze model [9], β c is wavelengthdependent and affected by the variations of season, geography and climate, causing objects to appear blue, green or yellow.
To synthesis the underwater images, we use the attenuation coefficients which are measured by Jerlov and Colleagues [6] during the Swedish Deep Sea Expedition of 1947-48. It includes five coastal waters (1,3,5,7,9) and five ocean waters (I, IA, IB, II, III). The water 1 is the clearest and water 9 is the most turbid coastal. Likewise, the water I is the clearest and water III is the most turbid.
We build the Realistic Single Underwater Image Enhancement (RSUIE) dataset based on RESIDE dataset [7] which is a large-scale benchmark dataset for image dehazing algorithms, it contains 8970 nature scene images and their corresponding depth map generated by [5]. In order to adapt to the problem of underwater image enhancement, we screen the original RESIDE dataset and delete the image that looks too dark or too foggy and finally we obtain a dataset which containing 70,000 images.
We implement Eq.(1), Eq.(2) on RESIDE to synthetic the 10 types of coastal water and ocean water images. As the coastal water classes face huge attenuation in deep water, such as the water type 9 above 5m and water 3 above 10m, which makes the objects almost invisible. However, some classes of water have small attenuation in shallow water, such as water type I, IA, IB about 1m to 5m, which have little effect on objects. Therefore we set different depth-ranges for different water classes: set the depth-range of water 9, 7 and 5 to [1,5], set the depth-range of water 3 and 1 to [1,15] and set the depth range of water I, IA, IB, II, III to [5,20]. At the same time, we select a random global veilinglight A c ∈ [0, 1]. Finally, we use the first 6,600 clean color images and depth images to synthesize training set and the other 400 images to synthesize validation (test) set. Therefore the RSUIE dataset contains a training set of 66,000 samples and a validation (test) set of 4,000 samples. Figure 4 shows 10 different samples for one RGB image and its corresponding depth map.

IV. PROPOSED METHOD
The goal in this paper is to get the clear image J when given an underwater image I and its corresponding water type c as inputs. The enhance process is done by a Classcondition Attention Generative Adversarial Network (CA-GAN), the details of proposed network architecture are shown in Fig.1, and for each image I the generator estimates its corresponding clear image J.The generator of CA-GAN contains two strided convolution blocks with stride 2, nine residual blocks and two upsample blocks with 3*3 convolution layer. Every convolution layers except the last one in CA-GAN are followed by spectral normalization [13], water class embedding block (WCEB) and ReLU activion layer. The spectral normalization layer used here can not only avoid unusual gradients, but also reduce the update times of discriminator [13]. And we supply the generator with class-conditional gians and bias in WCEB to map different types of underwater images to one clear natural scene image. Particularly, to recalibrate front-end feature map produced in the encoder layers and back-end feature map produced in the decoder layer, we introduce a concurrent channel and spatial attention feature fusion block (CS-AFFB). In addition, during the training phase, we apply SNGAN discriminator [13] and train both networks in an adversarial manner.
In the following subsections, we will introduce the details of WCEB and CS-AFFB in Section 4.1 and Section 4.2, respectively. Then, we describe our loss functions that allow simultaneously recover general content of underwater images and enhance their details.

A. WATER CLASS EMBEDDING BLOCK
The class information can be fed into the network in various method. In [14] Odena concats a one-hot class vector with the input vector, and maximize the the log-likelihood of the real image and the log-likelihood of the correct class. In [15] the conditional instance normalization was proposed to generate images in completely different styles by fedding the generator with class-conditional gians and bias in instance normalization layer.
In this paper, we propose a water class embedding block (WCEB) to constrain the feature space of every class of underwater images. As shown in Figure 2, we first encode the water class into a one-hot vector. Next we use a fully connected layer to map the encoded vector to a 48-dimensional vector. Then use two fully connected layers to map the 48-dimensional vector to gians γ and bias β, respectively. Finally, we get class-conditional instance normalization results by WCEB (x, c) x is the feature map, c is the class of input image, µ (x) and σ (x) are mean and standard deviation of the feature map that computed across spatial dimensions independently for each channel and each sample.

B. CONCURRENT CHANNEL AND SPATIAL ATTENTION FEATURE FUSION
To combine the front-end feature map with back-end feature map produced by the CA-GAN, we introduce the concurrent Channel and Spatial Attention Feature Fusion block (CS-AFFB). CS-AFFB consists of a Channel Attention Feature Fuse Branch (C-AFFB) and a Spatial Attention Feature Fuse Branch (S-AFFB), it allows us to recalibrate the front-end feature map and back-end feature map along channel and space. Fig. 3 illustrates the inner structure of our CS-AFFB.

1) CHANNEL ATTENTION FEATURE FUSE BRANCH
We introduce channel attention feature fuse branch (C-AFFB) to emphasize the important channels and ignore the less important ones. We consider the given front-end input feature map (represented by dotted line) F l = [f l1 , · · · f li , · · · , f lC ] and back-end input feature map (represented by solid line) F h = [f h1 , · · · , f hi , · · · , f hC ] as a combination of channels f li ∈ R H ×W and f hi ∈ R H ×W , respectively, where C is the number of channels, W is the width of feature map, H is the height of feature map. As shown in Fig. 3, the channel feature fuse branch reweight different channels of the frontend feature map and fuse the back-end feature map. First, the two feature maps F l and F h are concatenated. Next, transform the concatenated feature map to a 1 × 1 × C tensor by a global average pooling layer followed by two fullyconnected layers and a ReLU layer. Then, the weight range W C = [w 1 , w 2 , · · · , w C ] of F l is constrained to [0, 1] by a Sigmoid layer. Finally, the weighted F l is added to F h and get the channel fuse feature map

2) SPATIAL ATTENTION FEATURE FUSE BRANCH
We introduce the spatial attention feature fuse branch (S-AFFB) to squeeze the feature map for fine-grained underwater image enhancement. Here we consider the front- ∈ R 1×1×C and f i,j hs ∈ R 1×1×C . The spatial feature weight W s = w 1,1 s , · · · , w i,j s , · · · , w H ,W s is achieved through a 1*1 convolution layer followed by a Sigmoid layer. And the spatial fusion feature map F sf is formulated as:

3) CONCURRENT CHANNEL AND SPATIAL ATTENTION FEATURE FUSION
Finally, we obtain concurrent channel and spatial attention fuse feature map F csf by element-wise addition of F cf and F sf , F csf = F cf + F sf . This fuse method encourages the network to pay more attention to meaningful features, which have higher activation both on channel-wise and saptial-wise.

C. LOSS FUNCTION
We formulate the loss function as a combination of Mean Squared Error(MSE), feature-based loss L F and adversarial loss L G . The MSE loss and feature-based loss focus on enhancing general contents while adversial loss focus on enhancing texture details.

1) MSE LOSS
MSE loss of generated image I en and target image I t is:

2) FEATURE-BASED LOSS
Feature-based loss aims to measure the difference of the generated and target image CNN feature maps, L F is defined as where F represents a non-linear CNN transformation and the size of feature map is W ×H with C channels. In this paper the feature map is obtained by the 3-th convolution layer before the 3-th pooling layer within the VGG19 model.

3) ADVERSARIAL LOSS
In CA-GAN, the discriminator and generator are trained in an alternating fashion by minimizing the hinge version of the adversarial loss: Combine (4), (5) and (6) we get our final objective function: where λ F and λ D are weighting parameter to control the trade-off among the adversarial loss, the MSE loss and the feature-based loss.

V. EXPERIMENTAL RESULTS
In this section we present the implementation details and evaluation results on synthetic and real underwater images.  The proposed CA-GAN algorithm is compared with serval state-of-the-art underwater image enhancement methods on both synthetic and real-world underwater images. These methods can be seperated into prior information based methods and deep CNN based methods. The prior information based methods include Histogram Distributions Prior (HDP) [24], Retinex based enhancing apporach (Retinex) [29], Fusion enhance (Fusion) [30], White Balance Correction and Image Decomposition (WBCID) [31]. The deep CNN based methods are DPATN [27], UGAN-P [18], Water-Net [35] and Un-GAN [33]. We run all the comparsion methods using their recommended parameter settings, specially VOLUME 8, 2020 we test UGAN-P and Underwater-GAN using their provided models which trained on their own datasets.

A. IMPLEMENTATION DETAILS
In our experiment, the CA-GAN was implemented using the Pytorch deep learning framework and executed on a computer with Intel(R) Core(TM) i7-7700 CPU @ 4.2GHz CPU, Nvidia GTX 1080Ti GPU. During the CA-GAN training, we use random rotation and horizontal flipping for data augmentation and images in the training set of RSUIE datasets are resized to 256 × 256. We train the proposed CA-GAN in an alternating manner that performs one gradient descent step on discriminator first, and one step on generator using Adam optimizer with initial learning rate lr = 0.0002, β 1 = 0.5, β 2 = 0.999. The batch size is set to 1, the weighting parameter λ F is set to 0.5 and λ G is set to 0.1. For testing, it takes about 0.14s for resolution 512 × 512 on the above-mentioned machine, which is about 7 FPS. Particularly, the underwater images belong to 10 different classes corresponding to the synthesized data, and we train a VGG16 classifier using the RSUIE training set to predict the water class of input image while testing. The test accuracy of the VGG16 classifier is 95.68% on RSUIE test set.

B. ABLATION STUDY
In this subsection, we would like to demonstrate the effectiveness of the proposed WCEB and CS-AFFB in our CA-GAN by conducting ablation study. Both quantitative and qualitative evaluations on RSUIE test set are reported by removing WCEB (-woWCEB) and CS-AFFB (-woCS-AFFB) components, respectively. Due to the limitation of space, we only tabulate the average PSNR, SSIM and MSE results on Type-9, Type-3 and Type-II synthetic images in RSUIE test set in Tabel 1 as they apparently show different colors. From Tabel 1, we can see that fed the water type information into the net by WCEB can significantly improve the performance of underwater image enhancement, especially for the turbid Type-9 subset. We also observe that CS-AFFB could achieve better performance compared with other feature fusion strategies. A visual comparison is presented in Fig. 5. As we can see, the model without WCEB can not enhance all types of images, and the model with CS-AFFB can get much more realistic results and have less artifacts.

C. RESULTS ON SYNTHETIC DATASET
The performance of CA-GAN on RSUIE synthetic dataset is evaluated in terms of PSNR, SSIM and MSE, we report the average scores of 10 different classes of underwater images in Tabel 2 and bold the best results. As we can see, Un-GAN and Fusion are effective only on ocean water classes, i.e., Type-1 and Type-3. While HDP, HLP and WBCID can hardly achieve good performance on any classes of underwater images. On the contrary, the proposed CA-GAN achieves the best PSNR, SSIM and MSE values on various classes of underwater images, which indicates that our CA-GAN can better restore the content, structure and texture information of an underwater image.
Furthermore, Fig. 6 presents a visiual comparisions for 10 different classes of underwater images in the test set of RSUIE synthetic dataset. It is visible that most methods cannot recover appropriate color of the scenes and produce color deviations. Among them, DPATN introduces halo artifacts and color distortion in most images, and Fusion method can obviously improve the contrast but over-enhance the red channel on some images. In comparison, the results of proposed CA-GAN have more realistic colors that closer to the ground truth images, and it's effective for deep underwater images that severely degraded.

D. RESULTS ON REAL-WORLD UNDERWATER IMAGES
We employ the underwater image quality set (UIQS) [28] and underwater color cast set (UCCS) [28] which are recently FIGURE 6. Visual comparison of underwater image enhancement results for 10 types of water from RSUIE test set. From top to down are: Type-9, Type-7, Type-5, Type-3, Type-1, Type-III, Type-II, Type-IB, Type-IA and Type-I. benchmark dataset to evaluate the performance of CA-GAN on real-world underwater images. The images in RUIE are taken in the water near Zhangzi island in the Yellow Sea, and the water depth various from 5 to 9 meters, the scene depths range from 0.5 to 8 meaters. Specially UIQS contains 3630 images in various underwater conditions which is used to test the algorithms' improvement of image visibility; and UCCS contains 300 images with bluish, greenish and bluegreen tones that is used to test the algorithms' ability of correcting color cast.
The qualitative comparison results are presented in Fig. 7, we can observe that our proposed CA-GAN presents the best visual experience on all images, that it can not only correct the color cast, but also can enhance the visibility of details. But WBCID have almost no effect on various underwater images. HDP, HLP, Fusion and DPATN can partially enhance the underwater images, but hard to recover the scene color. And the results of Un-GAN tends yellowish color.
As ground truth images of real underwater images are unavialable, we employ the PI [12] to evaluate the visual effect with computable index which has been verified effective in single image super-resolution task. The smaller the PIs the better the results. From Fig. 7, we can also observe that CA-GAN achieves the best PI among the state-of-the-art methods.
We also give a visual results on serval widely used realworld test images in Fig. 8. For fair comparision, we directly download the results of Fusion17 [25] and HLP18 [34] from their project page. Fig. 8 also demonstrates the effective of CA-GAN on underwater images of various color tones. As we  can see, different from Fusion17, which over-enhances the image detail and noise, the proposed CA-GAN can properly restore the serious underwater image details loss. At the same time, the results of proposed CA-GAN can display more realistic scene colors and fewer artifacts comparing to HLP18.

VI. CONCLUSION
In this paper, we have introduced a class-condition attention generative adversarial network (CA-GAN) for underwater image enhancement. It employs a water class embedding block to map different types of underwater images to one clear natural scene image and introduce a concurrent channel and spatial attention feature fusion block to recalibrate frontend feature map produced in the encoder layers and backend feature map produced in the decoder layer. Experimental results on both synthetic and real underwater images demonstrate that CA-GAN can effectively recover the color and detail of various scenes of underwater images and is superior to the state-of-the-art methods.