Denoising of Maritime Unmanned Aerial Vehicles Photography Based on Guidance Correlation Pixel Sampling and Aggregation

In this paper, we proposed a guidance correlation pixel Sampling and aggregation image denoising for maritime Unmanned Aerial Vehicles(UAVs) photography, providing a reliable data base for maritime reconnaissance work. The overall step is mainly composed of two parts, pixel sampling and pixel aggregation. To improve the content correlation of sampled pixels, we propose a guided sampling scheme based on the basic estimated map and extend this algorithm to the restoration of maritime UAVs image. Finally, a UAVs image denoising system is shown. Our experimental results show that the proposed algorithm can effectively remove noise and achieves 32.98dB and 31.83dB in Set12 and BSD68 datasets with less and image distortion. In the actual scene, the PSNR of our denoising algorithm has reached 35.33dB, meets the basic needs of practical vision and follow-up research.


I. INTRODUCTION
Driven by economic globalization, trade, and communication between countries have become more frequent, the requirements for the transportation of goods becoming higher and higher [1], [2]. Since the large size, low cost, and inconvenient transportation, ship waterway transportation has irreplaceable importance in the field of transportation. However, maritime incursions occur from time to time. To solve this problem, it is possible to identify ships using aerial photography. With advances in energy, materials, and digital technologies, camera-equipped Unmanned Aerial Vehicles (UAVs) are widely used for aerial photography and maritime reconnaissance, due to their stability and safety. UAVs can replace humans to the task in some tough environments.
At present, UAVs maritime patrol is rising gradually, which makes it possible to conduct maritime investigation by UAVs. The identification of UAVs to maritime intrusion depends on computer vision and information science technology [3]. In the object detection task, the Region Proposal Network(RPN) firstly searches for the Region of Interest (ROI) that contains objects in the image, then the final recognition results are obtained through the ROI feature extraction, bounding box regression and classification process [4]. However, the performance of object detection network depends on the image quality seriously. Degraded images can have devastating effects on detection results, which leads to security issues and economic losses. Affected by the natural environment, the aerial images taken by UAVs may be degraded by a variety of factors, such as fog, rain, noise [5], low-light [6] etc. These degrading factors will directly affect the UAV's identification performance. Affected by noise, the recognition algorithm cannot accurately capture the information in the image, resulting in false positives or missed alarms. Therefore, it is indispensable to research and design an image enhancement and repair algorithm for UAVs imaging system. This paper focuses on the research of denoising of maritime UAVs photography.
As the important component of computer vision and pattern recognition technology, image denoising is a fundamental and active low-level problem in image processing(e.g., [7]- [10]). The image quality affects the high level tasks of computer vision, such as image semantic segmen-tation and target detection. Therefore, image denoising has been widely studied. Image denoising has developed by frequency and spatial domain filtering, sparse coding and deep learning techniques. At present, model-based denoising technology has become the mainstream, plenty of approaches have been proposed, such as DnCNN [11], FFDNet [12] and CBDNet [13],BM3D [14], NLNet [15]. However, the denoising research for UAVs aerial photography has not been widely studied and commercialized.
As one of the computer science technologies, deep learning is widely used in image restoration. The deep learning models estimate potential clean images from a noisy image. Through supervised learning training in large amounts of labeled data, the deep network could learn potential mapping from the noisy image to the clean image. Although these methods can achieve end-to-end training and achieve satisfactory performance, the spatially-invariant convolution kernel makes the network less flexible to different inputs, and the overall performance of the network is too dependent on the training data. In addition, in the end-to-end training, the space-invariant convolution kernel will automatically adapt to the training noisy input images under the labeled data, which may lead to the lack of texture in the denoising results.
Recently, a denoising scheme based on correlation pixel aggregation has been proposed [16]. This technique is similar to making image denoising dataset that averaging the same scene after repeated observation for N times. By predicting the coordinate offsets, the network samples each pixel in its neighborhood to complete repeated observations, and weighted averaging the observed values to obtain the basic noise-free estimation of the original pixel. However, the performance of the network depends on the accuracy of the coordinate offset estimation and weight allocation considerably. As the noise increase, the offset predicted network tends to search pixels from a wider area. This increases the likelihood that unrelated pixels will be selected. As a result, the denoising result suffer from distortion.
Unlike [16] on raw domains, our model works in grayscale and color domains. To deal with more challenging inputs, we propose to correct offset through secondary sampling of basic noise-free estimation in this paper. Concretely, we introduce basic-denoising map as a basic noise-free estimation, and predict the offset of central pixel on basic estimation rather than on the noisy image. Since the basic estimation has less distortion than the noisy image, it will avoid the problem of widening the sampling area due to the interference of high noise and obtain more relevant sampling pixel values at the same time. Experimental results show that the proposed scheme can obtain better sampling location for searching content relevant pixels. Our proposed method has better noise suppression and less image distortion than previous image convolution filtering denoising algorithms, thanks to the pixel aggregation operation instead of stacking a large number of convolutional layers directly.
Since the human eye is most sensitive to the brightness of the image, our offset estimation process is carried out on the gray-scale image. Based on this priori, in the process of color aerial photography denoising, we first perform color space conversion from RGB to YCbCr, and estimate the offset on Y component. Finally, we retrain the aggregation weight for color image.
To sum up, our contributions are summarized as follows: • A novel denoising algorithm for UAVs photography is presented by considering both offset estimation and pixel aggregation.
• Different from block matching, our similar pixel matching is to sample the entire input image at pixel level, which greatly improves the similar image matching efficiency.
• Since the network denoising performance depends on the estimated correlation pixels, to improve the denoising results, we propose a coordinate offset estimation scheme based on basic-denoising image to improve the sampling effect.
• We adopt a multi-step training strategy, experiment results show that our approach achieves good performance in terms of both quantitative metrics and visual quality.
This paper is organized as follows. Section II introduce the problem this paper plan to solve and review relative works. Section III presents our proposed method. In Section IV, extensive conducted experiments are reported to validate the effectiveness of our method. Finally, conclusion remarks are in Section VI.

II. RELATED WORK A. IMAGE DENOISING
For traditional non-learning image denoising, the methods based on similar image blocks aggregation [7], BM3D [14] has been proposed, which is well developed for Gaussian noise removal. The non-local means(NLM) [7] searches for similar blocks in the search window around the target block and allocates weights by block similarities. The final noisefree estimation is obtained by weighted estimation of similar blocks. BM3D [14] process is divided into basic estimation and final estimation. In the first stage, basic estimation is obtained by hard threshold method. In the second stage, basic estimation are used to guide secondary search of similar blocks, and the aggregation of collaborative filtering results are used as final denoising estimation.
In recent years, the end-to-end learning-based methods have been developed. DnCNN [11] first proposed to train a deep convolutional network for image denoising and achieved good results in Gaussian denoising. This baseline method predicts the noise residual layer by stacking multiple convolutional layers and Batchnorm layers, which improves the denoising performance compared with traditional denoising methods. FFDNet [12] considers the generalization of Gaussian noise to more complex real noise, and takes the noise level map as part of the network input, further balancing the control noise reduction and detail protection. CBDNet [13] achieves the blind denoising of Gaussian and real noise by predicting noise level map. BRDNet [17] includes two networks for increasing the width of the network and obtaining more precise features to im-prove the denoising performance finally. Meanwhile, batch renormalization(BRN) [18] and dilated convolution improve the convergence efficiency of non-independent identically distributed mini-batch training and receptive field. The CNN-Based 3D Filtering BMCNN [19] was proposed. BMCNN processes the 3D block matching noisy-group by 1 × 1 or 3 × 3 convolutional network according to the known noise level. [20] propose a generative adversarial network(GAN) based framework to complete noise modeling and address the problem of image blind denoising. In addition, the learningbased methods have also made significant progress in medical image denoising [21], [22]. For image noise modeling and estimation, [23] proposes a data-driven normalizing flow model base on Glow architecture [24] that can estimate the density of a real noise distribution. It synthesizes training data for a and denoising CNN resulting in significant improvements in PSNR.

B. PIXEL AGGREGATION DENOISING
Different from block matching, Pixels Aggregation Network(PAN) [16] aims to find a series of relevant pixels around each pixel of the image, and takes the weighted average of relevant pixels as the final noise-free estimation. This method uses fixed convolution weights for different input images. As this method produces unique sampling coordinates for each pixel of different images, PAN is more flexible. Similarly, [25] introduces deformable convolution [26] to adapt to spatial textures and edges for image denoising task.
The offset of the deformable convolution needs to be quantified as an integer, while the offset of the pixel sampling can be a floating point value and sampling value can be obtained by bilinear interpolation [27]. Therefore, the latter can have higher coordinate flexibility.

III. PROPOSED METHOD
This section presents our proposed method. On the whole, our UAVs image denoising model building and training are divided into three steps. In the first step, we build and pretrain a Pixel Aggregation Denoising Network(PAN) for grayscale images. In the second step, to get a better offset and weight estimation, we adopt the output of the PADN as the feature map to guide the second sampling and weighted averaging for the input noisy image. In the third step, we extend the denoising model from gray-scale domain to sRGB domain and realize the UAVs image denoising and proposed the Maritime image denoising based on Guidance PAN (MG-PAN).

A. PIXEL SAMPLING AND AGGREGATION
The center pixel and surrounding pixels of the image are content-dependent and numerically-dependent. Based on this prior knowledge, we design a correlated pixel sampling model for the center pixel of the denoised image. To facilitate the description of the position of pixels in the image, we use grid coordinates to represent the pixel position of each input noisy image, as shown in Figure 1. The position of each pixel is represented by x-coordinate x and y-coordinate. Therefore, to find the value of the relevant pixel point of a target pixel, we only need to computed and estimate the offset of the x and y coordinates between the relevant pixel and the target pixel. Then, the pixel sampling operation is implemented in Figure 2. The sampling operation. Given coordinate offset, taking sampling of 3 × 3 size matrix as an example, the pixel sampling process of image is shown. The red and green squares in coordinate(x) and coordinate(y) represent the coordinates of the pixels related to the pixels in that position, and the red squares square and green dots in the image matrix represent the sampled pixels. Note that for the fractional coordinates, we perform bilinear interpolation to get the sub-pixel value as the sampled relevant pixel value.
A rigid grid with the same image size is defined in advance. We combined the normalized offset and the rigid grid to form the sampling coordinate map. The sampling coordinates of each pixel position consist of x-coordinate and y-coordinate. According to each sampling coordinate, bilinear interpolation method is used to obtain the final sampling pixel value in fractional coordinates.
After n relevant pixels of a target pixel are sampled, we get VOLUME 4, 2016 sampling group with n relevant images. Then the processed image y is obtained by weighted averaging of the n sampling images. The weighted averaging operation is defined as: where x is noisy image and W is aggregation weight, p and q represent the coordinates, p i and q i represent the offsets of p and q respectively.

B. FIRST STEP: PIXEL AGGREGATION NETWORK FOR GRAY IMAGE
Inspired by the denoising work of raw images, we design a pixel aggregation network for gray image as our first step. In first stage, we pre-train a pixel aggregation network in grayscale domain. The framework is shown in Figure 3, which consist of a Offset Estimated Network(OEN) and Weight Estimated Network(WEN).
Specifically, the Offset Estimation Network predicts the offset of each noisy input x for a rigid sampling grid. By deforming the integration of offset V 1 and the rigid sampling grid, we obtain the normalized sampling coordinates. Then, as shown in Figure 2, we sample pixels from the noisy input according to the normalized coordinates. To assign the weight to the pixels of the sampling image, we concatenate the extracted feature f 2 , the sampled images x sp1 and the input noisy image x and feed them into the weight estimated network to estimate the weight W 1 . Finally, the basic noisefree estimation is generated by averaging the sampled pixels with the learned weights W 1 , x sp1 (p + p i , q + q i , n)·W 1 (p, q, n), (2) where x is noisy image and y basic represents denoising result. Denoted ground truth y gt and batch size N , the training loss L step1 in first stage is defined as follows:

C. SECOND STEP: GUIDED SAMPLING AND AGGREGATION
The performance of the pixel aggregation network mainly depends on the prediction of the relative pixel coordinate offset and the weight of each pixel. As the noise increase, the PAN tends to search pixels from a wider area. This increases the likelihood that unrelated pixels will be selected. Subsequently, the network has a deviation on the weight of the sampling pixel prediction. As a result, the framework denoising performance will degrade.
To reduce the offset deviation and weight deviation, we design a guided sampling aggregation strategy. The strategy framework is shown in the Figure 4. In the guided sampling strategy, as the noise intensity of y basic is smaller than original noisy image x, the image texture is less affected by noise, the offset prediction network can relatively easily capture the information of the image for inferring a more reasonable offset. The process of sampling in the original noisy image by estimating pixel offsets from the reference map, called "guidance". Therefore, the offset prediction can be guided by a relatively clean image y basic . The new noisy sampled pixels obtained according to the new offset are conducive to image denoising. Due to the offset is estimated in basic-denoising results y basic , we need to replace noisy information with basic-denoising information when estimating the averaging weights. The averaging weights is decided jointly by new sampled noisy pixel, offset feature and basic-denoising result y basic .
Concretely, We calculated the coordinate offset V 2 of the relevant points on the basic-denoising results y basic in second sampling operation by a offset estimated network with the same structure as the first stage. We send the offset V 2 and noisy image x to the sampler and obtain a new noisy sampled pixels group x sp2 with the same operation mentioned in Figure 2. Since the offset V 2 in guided sampling is determined by y basic and the y pre needs to participate in weight estimation inference. By concatenating and sending the x sp2 , the basicdenoising y pre and the offset features f 2 into the weight estimated network, the averaging weights W 2 is obtained.
Finally, we generate the denoised output y out by aggregating the sampled pixels with the learned weights W 2 . The training loss is same to the first stage, defined as: Note that while the second operation network structure is same to the PAN, the parameters are not shared.

D. THIRD STEP: EXTEND MODEL TO UAVS RGB IMAGE DOMAIN
For UAVs color image denoising, due to the channel number mismatch between color image and gray image, the color image cannot be directly input into the network. The steps before retraining are repetitive work. To achieve effective denoising of color images by the model, we fix the weight of the trained sampling network before and retrain the weight estimation module, and realized model transfer learning eventually.
Since the pixel offset estimated network is trained on gray image and estimates the offset based on pixel gray value, the network is sensitive to luminance. Therefore, we first convert the RGB image to the YCbCr and extract the Y -component as the input of network. After a guided and estimated offset V F , we sample the color noisy image with offset V F and get a sampled color noisy image group x sp , as shown in Figure  5.
Since the weights of the offset estimation network are trained, it can be used directly to accurately estimate the  FIGURE 4: Second stage, the framework of our proposed guided sampling aggregation Strategy. We first use a network estimated offset V 2 on basic-denoising y basic . We then sample pixels from the noisy input x according to the predicted offsets V 2 with the method mentioned in Figure 2. We concatenate the new noisy group x sp2 , feature f 2 and y basic into the weight estimation network and jointly determine a new pixel weight W 2 for x sp2 . After that, the denoising result y out is obtained by weighted averaging. offset from Y . Note that the parameters of trained offset estimation network is fixed in current step. Then, We retrained the weight estimation network with input features f 2 , sampled color noisy image x c , x sp and Y -component. Finally, we implement the pixel aggregation on sampled color noisy image groupx sp with weight W F and get the denoising result y F .
Denoted that y cgt is ground truth of color noisy UAVs maritime image x c , the objective function in this step is defined as: Note that the L step3 is only used to constrain the weight estimation network.

IV. EXPERIMENT
In this section, We first describe the datasets and implementation details, then evaluate the proposed algorithm for maritime image denoising quantitatively.

A. EXPERIMENTAL DATASET
For maritime image denoising, we collect 677 maritime photographs from the UAVs database, where each has a resolution of 1080 × 1920 pixels. We divided all maritime photographs into training set, verification set and test set, which contained 470, 34 and 173 images respectively. In addition, we adopt super resolution dataset DIV2K and Standard test dataset BSD68 and Set12 public dataset to train and compare algorithm performance. : Third stage, the framework for color image denoising. The color image is first converted into YCbCr. Since the offset estimation network is most sensitive to luminance, the y-component Y is input into the network to estimate offset. The offset estimated procedure is the same as second step. Then, we sample the relevant noisy images group from color noisy image with offset V . The weight W F is estimated by sampled noisy image group x sp , color noisy image, feature and Y . Finally, the denoising result y F is obtained by weighted averaging.

B. EXPERIMENTAL HYPER-PARAMETER SETTING
The number of network layers and weights setting is same to [16]. The learning rate is 0.0001. We decrease the learning rate, by a factor of 0.999 per epoch, until it reaches 2 × 10 −5 . The batch size is set to be 32. We randomly crop 128 × 128 patches from the original noisy image for training. The number of sampled pixels per pixel is set to 25.
The synthesized gaussian noise standard deviation σ from 0 to 60 in training process. In the test experiment, we tested the denoising performance of the proposed model for σ = 15, 25 and 50 respectively.

C. PROGRESSIVE TRAINING SCHEME
As described in Section III, our model training process is divided into three step. In the first step, we train the pixel aggregation network PAN on Gaussian synthetic noisy images from DIV2K, through minimizing L step1 . Noted that the color images in DIV2K is convert into gray-scale. For the guided offset estimation in the second step, to speed up the training, the guided offset estimated network is initialized by trained PAN of first step. We train the overall network by minimizing L step2 . In the final step, we extend the model from gray-scale domain to color domain. Since the guided offset estimated network can effectively estimate the coordinates of relevant pixels, we directly estimate the offset from the Y -component of color image by trained guided offset V F estimated network. Different from sampling the gray image before, we use offset V F to sample the original noisy color image, and only retrain the weight estimation network to achieve pixel aggregation denoising for color image. The third step is implemented on maritime image.

D. EVALUATION ON THE PUBLIC DATASET
We compare proposed model on the noisy gray-scale images that are corrupted by Additive White Gaussian Noise(AWGN). The comparisons between the proposed method and other methods are presented in Table 1. Compared with other algorithms, our denoising performance of Set12 and BSD68 is better. Note that the original PAN [16] was trained in the raw domain, to a fair comparison, we retrain the PAN in the same way.
As shown in Table 1, the denoising performance of MG-PAN is better than other denoising algorithms, especially when the noise level σ is large than 25. MGPAN significantly outperforms DDFN and outperforms PAN at various noise levels on BSD68 and Set12. It is due to the more accurate offset estimation guided by the basic-denoising. MGPAN is slightly worse than BRDNet when the noise level is low, the phenomenon may be due to the trade-off between the training iterations and the modeling capability. Related and similarly, BM3D uses similar blocks matching aggregation and 3D filtering for image denoising, and weights are based on wiener filtering coefficient. Our experiments show that the relevant pixels sampling is more flexible than block matching. For weight estimation, the weight allocation based on learning can get better performance.
The Figure 6 shows the denoising results of the same noise image by various algorithms. From Figure 6 (a), it is observed that when the noise level is low (σ = 15), the MGPAN proposed in this paper achieves better results in both denoising and texture preservation. The surrounding processing area used by convolution filtering is fixed, that is fixed sampling pixels for both smooth area and texture area. However, in the sampling of relevant pixels, only the pixels with the most relevant spatial structure are selected for processing. The latter is more relevant and flexible, therefore, the denoising effect will be further improved, the texture protection in low noise level image denoising is attributed to relevant pixel aggregation.

E. EVALUATION ON THE UAVS MARITIME DATASET
In this section, we use real UAVs maritime images to test the proposed model and conduct compared experiments. We used PSNR to evaluate the results of denoising, and the test results are shown in Figure 7. Here we present six sets of results compared, covering a variety of scenarios.
From Figure 7, we can observe that the proposed MG-PAN algorithm has achieved the best results in all the comparison methods for noise images with different signal to noise ratio. The BM3D is a traditional denoising algorithm and DnCNN is a convolution filtering algorithm based on residual network. Our MGPAN algorithm, which integrates convolutional neural network and correlation sampling graph aggregation, has superior performance in experimental data. Its performance exceeds that of single traditional algorithm BM3D and single convolution algorithm DnCNN.
In addition, to facilitate visual subjective analysis, the image patch shown in Figure 8 is the red-bounding box part extracted from Figure 7. The image patch from Figure 8  In Figure 8 (b)(c)(e), compared denoised results with the reference ground truth, we observed that BM3D eliminated some original textures of images while realizing noise reduc-  tion. The processing results of DnCNN are not smooth and containartifacts, which are not expected to be seen in the image denoising task. Excessive smoothing, loss of texture information, and inclusion of synthetic artifacts all will lead to degraded image quality andreduced PSNR values. Compared with the comparison algorithm, our processing results are closer to the ground truth. These performance improvements benefit from the guided offset estimation pixel aggregation. Different from the filtering of convolution kernel, the input of pixel aggregation is sampled pixel from original image pixel, and the result of aggregation are derived from the weighted sum of pixels of the original image. This operation is equivalent to n separate observations of the original image. The aggregation results of multiple images are closer to the desired ground truth. Therefore, we can conclude that our method is more suitable for maritime scene, which retains more image details while ensuring denoising and achieves better image quality.

V. DISCUSSION AND FUTURE WORK
Maritime video surveillance has become an important part of the ship traffic service system, aiming to ensure the safety of ship traffic and marine applications. To make maritime surveillance more feasible and practical and promote smart vision-enabled technologies development, we proposed a novel Maritimes UAVs image denoising network based on Guided Pixel Aggregation Network. It provides a reliable data repair method for vision technology, and provides an important foundation for object detection technology for marine vessels. We believe there are several applications of this technique that should be explored further. In future work, we will further distill the algorithm proposed in this paper to reduce the amount of model parameters without losing accuracy as much as possible. Using the repaired images, we will further study the object detection technology related to maritime surveillance, and plan to embed this research into the object detection system, and finally form a complete