Marine Organism Detection Based on Double Domains Augmentation and an Improved YOLOv7

The existing object detection methods are facing significant challenges when applied to marine environments, such as underwater image degradation caused by absorption and scattering of light, and domain transfer in water bodies with different water qualities. In this letter, we propose a marine organism detection framework to improve the detection performance and the domain generalization performance. First, a double domains data augmentation is proposed. This method combines underwater image enhancement and water quality transfer to improve the domain diversity of the original dataset. Second, we utilize the self-attention operations and the convolution to improve the detection performance of the YOLOv7, fully utilizing the advantages of self-attention and convolutional computation. Meanwhile, this model uses SIoU loss to accelerate convergence speed and improve the regression. Experiments on the URPC2019 and URPC2020 datasets show that the proposed object detection method achieves a mean average precision of 82.3% and 83.6%, respectively, which is superior to all other methods used for comparison.


I. INTRODUCTION
The exploitation and utilization of marine resources are of vital importance for humankind, and the detection and identification of marine organisms are key links in this process. Conventional underwater detection relies on manual sampling, which is not only harmful to divers' health, but also low in efficiency. In contrast, visual object detection methods based on underwater optical images are becoming increasingly popular as an effective, portable, non-invasive and non-destructive detection method, and have achieved excellent results. However, it is still a challenging task due to complicated underwater environments and lighting conditions.
Although existing object detection methods based on computer vision technology have produced certain remarkable The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu . achievements in the common object detection. Nevertheless, underwater object detection (UOD) in marine is difficult to obtain satisfactory results. There are three reasons for this trouble. Firstly, acquiring underwater images acquired in complicated underwater environments often suffers from serious degradation, which dramatically affects the detection accuracy of underwater object detection. It has been known that in the underwater scenes, wavelength-dependent absorption and scattering significantly degrade the quality of underwater images. This leads to many problems such as low contrast, image blurring and color cast, thus bringing significant challenges to the detection task. Many underwater image processing technologies have emerged to solve the above problems. In these studies, underwater image enhancement (UIE) can effectively improve the quality of underwater images and provide powerful help for the realization of underwater object detection in marine. In the existing studies, underwater image enhancement technologies work as a pre-processing operation to improve the accuracy of downstream detection tasks by improving the visual quality of underwater images [1]. However, UIE may remove features indispensable to detection, thus hurting performance [2].
Secondly, the degradation is affected for a large number of complex factors including water component, quality and temperature [3]. The underwater scene can also rapidly change due to the wave movement of aquatic plants and variable lighting [4]. In other words, different underwater source domain may affect the accuracy of underwater object detection. Underwater object detector should work in invariant water quality, whether in shallow, deep water or in the ocean. This can be seen as a kind of domain generalization problem that a model is trained on one domain but evaluated on other domains [5].
Finally, there are some complicated conditions: irregular movements, occlusion, overlapping, and blurring, which also reduce the performance of existing object detectors. Traditional target detection frameworks based on handcrafted feature engineering include scale-invariant feature transform (SIFT) [6], histogram of oriented gradient (HOG) [7] has received much attention. However, there are some limitations in traditional target detection due to the handcrafted features. Great successes in computer vision tasks with deep learning techniques have already demonstrated that the methods based on deep learning significantly outperform methods based on handcrafted feature. However, because of the complicated underwater scenes, the existing underwater object detection still suffers from various challenge, such as small targets, occlusion, overlapping, and blurring of underwater targets, which hinder the application of existing object detection methods in the ocean [8].
To address these problems, in this paper, a method based on the improved YOLOv7 and domain generalization was proposed and applied to marine organism. Our contributions are detailed as follows: • We propose an underwater image enhancement method.
We compensate the raw underwater degraded images with red and blue channels, then use contrast-limited adaptive histogram equalization to increase the contrast of the image. Finally, we use Restormer to denoise. The proposed method improves the visual effect of the image without excessive losses on the features required by the object detection.
• We propose a domain generalization method called double domains data augmentation (DDA). Inspired by literature [2] and literature [5], we initially transfer original underwater images to a specific water quality type, and then merge them with the original dataset. We call this the underwater domain data augmentation (UDA). And then, we merge the enhanced images and the original images, which is called the enhanced domain data augmentation (EDA). Finally, we combine UDA and EDA to obtain DDA dataset. The DDA dataset is used to train the object detector.
• We also improve the YOLOv7 network to increase the performance of underwater object detection. First, we integrate the ACmix module into the original detection network in order to improve ability to recognize complex backgrounds. Secondly, we use the SIoU regression loss function to improve the convergence speed and detection efficiency of the network.
The remainder of this study is organized as follows. Section II surveys the related research in the field of Underwater Image Enhancement, Data Augmentation, Object Detection and Self-Attention, respectively. Section III introduces our proposed method from Double Domains Data Augmentation and Improved YOLOv7. Section IV demonstrates the advantages of our method through testing on different datasets and control experiments, as well as ablation studies. Finally, Section IV-H concludes this survey.

II. RELATED WORKS A. UNDERWATER IMAGES ENHANCEMENT
The complex underwater imaging environment seriously affects the quality of underwater images. Underwater turbulence, the absorption and scattering of light by water bodies, various types of noise, contrast degradation, uneven lighting, color distortion, and complex underwater backgrounds will all result in the serious degradation of underwater images. It must be noted that low-quality underwater images greatly reduce the performance of downstream object detection methods. UIE technology can effectively improve the quality of underwater image and provide powerful help for object detection [9]. Many methods have been used to improve the quality of underwater images to facilitate downstream detection tasks, which could be roughly divided into three subclasses: physical model-based methods, nonphysical model-based methods, and data-driven methods. Physical model-based methods have also been called underwater image restoration techniques, which usually recover the image by estimating an underwater image degradation model. The dark channel prior (DCP) method [10] restored clear images from blurred images by estimating atmospheric light and transmission images. Due to its simplicity and effectiveness, DCP was widely used in underwater image enhancement based on this type of restoration [11], [12]. Inspired by hyper-laplacian reflectance priors, Zhuang et al. [13] proposed retinex variational model to enhance both salient structures and fine-scale details and recover the naturalness of authentic colors. Liang et al. [14] proposed an image restoration method based on a generalized image formation model (GIFM), which described the image as the light attenuation process that included the light source-scene path and scene-sensor path. However, these methods require prior knowledge or extra assumptions. In addition, the restored results are always unsatisfactory, especially with respect to color correction. Non-physical model-based methods have also been called underwater image enhancement methods. These methods operate by transforming pixel intensities in the spatial/frequency domain, specifically aimed at correcting color, enhancing contrast, and denoising to improve the visual effect of the image. Dixit et al. [15] combined DCP and adaptively clipped contrast limited histogram equalization (ACCLAHE) to estimate the blurred areas of the image and thus remove them. Moreover, they used homomorphic transformation technology to maintain the edges of the image to then use intermediate transformation technology to eliminate noise. Zheng et al. [16] fused the contrast-limited adaptive histogram equalization (CLAHE) [17] converted image and the unsharp masking (USM) converted image using a weighted mixture method. Hu et al. [18] combined CLAHE and DWT to improve image contrast and sharpness. Li et al. [19] utilized the color recognition strategy to determine the hue of the image to guide the adaptive color compensation, and then to dehaze the image, according to the fog line theory, which improved the quality of underwater images to a certain extent. Zhang et al. [20] calculated attenuation matrices among color channels, and then the degraded color channels are compensated. Then, dual-histogram-based iterative threshold method and a limited histogram method were used to improve the global and local contrast of images, and used the multiscale fusion strategy to fusion. Finally, they used multiscale unsharp masking strategy to further sharpen the local details and edge textures. Furthermore, Zhang et al. [21] proposed a novel underwater image enhancement method based on the Retinex-inspired color correction and detail preserved fusion technology. This method obtained better enhancement performance by fusing local contrast enhanced version, detail enhanced version and global contrast enhanced version. Literature [22] proposed to cope with the aforementioned issues via piecewise color correction and dual prior optimized contrast enhancement (PCDE). They firstly used the maximum mean and two gain factors to correct the color cast of each color channels. And then, spatial priori and texture priori are used to optimize the contrast enhancement strategy. This method has great generalization ability for fog image and low light image. MLLE [23] proposed an efficient and robust underwater image enhancement method by locally adjusting the color and details of the input image according to the principle of minimum color loss, implementing a fusion strategy guided by the maximum attenuation map, and then using the integral and square integral map to quickly calculate the mean and variance of the local image blocks-thereby adaptively adjusting the contrast of the input image. Kang et al. [24] proposed a structural patch decomposition and fusion method (SPDF). SPDF is based on the fusion of two complementary pre-processed inputs in the perception-aware and conceptually independent image spaces and has obtained an enhancement effect that is superior to several of the most advanced UIE methods.
With the development of deep learning technology, datadriven methods are becoming more and more widely used in the context of underwater image enhancement. With respect to the aforementioned, Xu et al. [25] combined the convolu-tional neural network (CNN) with Retinex theory to solve the underwater image enhancement problem. Such an approach first utilizes CNN to decompose the original image into an illumination map and a reflection map. Then, a dynamic threshold white balance was used to correct the color cast of the reflection map. Next, gamma correction to improve the brightness and contrast of the illumination map was utilized. Finally, CLAHE was implemented to deblur the fused image to obtain an augmented image. Zhang et al. [26] used three submodules including denoise, color correction and deblur, to complete image enhancement and to jointly optimize this method with subsequent detection modules. Yeh et al. [27] designed a lightweight color conversion network to correct the color cast of underwater images, and they jointly trained the color conversion network and object detection network. In general, supervised learning models require a large number of clear/degraded image pairs to train the network model. However, in the harsh and complex underwater environment, it is difficult to obtain such image pairs. Zhu et al. [28] introduced two adversarial losses and cycle consistency losses to propose CycleGAN, which realized the style transfer from picture to picture without paired training sets. Jiang et al. [29] proposed an adaptive framework based on transfer learning. First, they used CycleGAN to migrate the image from the underwater domain to the air domain, and then used the dehaze network to augment the migrated image, thus ridding the image pair of its dependence. Chen et al. [30] combined DCP and GAN to improve the underwater image quality and significantly improve the detection precision of underwater object detection.

B. DATA AUGMENTATION
In the above works, underwater image enhancement technologies work as a pre-processing operation to boost the detection accuracy of underwater object detection (UOD) by improving the visual quality of underwater images. However, most of the existing strategies consider UIE and UOD tasks as two separate pipelines. Separate optimization of two tasks results in inconsistency in the rendering of image quality and detection accuracy [1]. Peng et al. [2] believed that using underwater image enhancement through data augmentation can improve the detection precision more than using it as a pre-processing step for detection. The reason might be that image enhancement will delete certain essential features that are required for the purposes of detection, thus damaging the detection performance. By using the proposed data augmentation method, the mAP can be improved by 1.9%.
In application, we hope a detector can be applied in any underwater circumstances. However, the absorptivity of water bodies with respect to the light of different wavelengths is different, and the distribution of suspended particles in water bodies under different water environments is also different, which causes underwater images to show different water quality characteristics [28]. However, the generalization between different waters is not considered in the above studies. As such, the existing underwater image enhancement results cannot be well promoted to different water environments. Liu et al. [29] proposed a data augmentation method for the purposes of water quality transfer (WQT). WQT used WCT2 [29] to transfer underwater images to seven different water quality types in order to expand datasets and increase domain diversity. The mean average precision of WQT's Full_WQT combined with the conventional data augmentation method was found to be 7.38% higher than that of the training network on the original dataset.
The above studies have brought about great achievements in their respective fields. Underwater image enhancement technology improves image quality in order to improve the results of downstream detection tasks. The use of underwater image enhancement for the purposes of data augmentation further expands the dataset. Moreover, WQT considers the generalization problem between water areas with different water quality. However, they only focus on their own areas, thereby leaving room for further improvement.

C. OBJECT DETECTION
Currently, object detection methods based on deep learning provide a powerful framework for the detection and recognition of marine organisms. As for object detection algorithms based on deep learning, they can be classified into two categories: two-stage object detection algorithm and one-stage object detection algorithm. Faster R-CNN [32] was a typical representative of a two-stage object detection model. Indeed, due to its excellent performance, Faster R-CNN was widely used in underwater object detection and recognition research [33], [34]. Although the two-stage method possesses high accuracy, its inferring speed is slow. YOLO [35] was a typical one-stage object detection method. It took the whole picture as the input of the network, and directly regressed the BBox position and category at the output layer, thus obtaining a much faster inferring speed than the Faster R-CNN. In the subsequent research, researchers committed to a more robust performance with higher accuracy and a faster reasoning speed, and they formed a series of YOLO family detection methods [35], [36], [37], [38], [39]. Pedersen et al. [40] used YOLOv2 and YOLOv3 for the purposes of marine organism identification in temperate saline water. Kandimalla et al. [41] developed an automatic real-time depth learning framework, thereby combining YOLOv3 and the Kalman filter in order to accurately detect and classify fish. Hu et al. [18] proposed a fish behaviour detection network based on an underwater imaging system and deep learning framework. The improved YOLOv3Lite network was used to detect fish behaviour, thereby providing a good compromise between accuracy and speed. Lei et al. [8] applied the YOLOv5 network to underwater object detection and obtained a mean average precision of 87.2%. Sun et al. [42] proposed the bidirectional feature fusion and angular classification (BiFA-YOLO). In view of the multi-scale, arbitrary direction and dense arrangement of ships in high-resolution SAR images, a bidirectional feature fusion module was added to the YOLOV5 network to detect multi-scale ships, an Angle classification structure was added to obtain the Angle information of ships, and the influence of Angle imbalance was transplanted by random rotation Mosaic data enhancement. Furthermore, Li et al. [43] used the improved YOLOv5 for underwater scallop recognition. Liu et al. [5] proposed a DG-YOLO algorithm, thereby combining domain adversarial training and invariant risk minimization in order to further mine the semantic information of underwater images and to improve the domain generalization ability of the network. In view of the limitations of hardware conditions in edge computing applications, Ma et al. [44] combined the pruning, the knowledge distillation and the quantization to improve the YOLOv4. They proposed Light-YOLOv4 to improve the detection speed by 4 times, with slight decrease in the detection precision.
When compared with the above algorithm, the YOLOv7 [39] model possessed faster speed and higher precision. When compared with the YOLO's basic model, YOLOv7 reduced the parameters by 40% and the amount of calculation by 50%. It achieved a precision of 56.8% on the MS-COCO dataset, and it achieved the same level of performance as SOTA in all consistent object detection algorithms. The literature [45] combined the HorBlock module, CoordAtt module and SIoU to improve YOLOv7, which is used for detecting minor defects in high-voltage transmission line insulators. Indeed, Wu et al. [46] combined various data augmentation methods to establish the DA-YOLOv7 model in order to detect Camellia oleifera fruit complex scenes. At present, however, there is little research on the application of the YOLOv7 model in the context of marine organism detection, leaving space for improving the mean average precision of marine organism detection.

D. SELF-ATTENTION
A self-attention mechanism is a variation of the attention mechanism, which reduces the dependence on external information and captures the internal correlation of data or features. Swin Transformer [47] utilized window attention in order to maintain the same receptive field of token in the same local window, so as to save computing costs and achieve rapid reasoning. Moreover, BoTNet [48] introduced multi-head self-attention (MHSA) into ResNet's bottleneck structure. Most of the above studies focused on exploring the changes in the self-attention operator in order to further improve the performance of the model. Indeed, Pan et al. [49] proved that most calculations of the two paradigms, i.e., convolution and self-attention, are the same operations and thus possess a strong internal relationship. Moreover, they split convolution and self-attention into two stages: convolution and shift sum, thus combining two different paradigms together in order to propose an ACmix operation, thereby enjoying the benefits of convolution and self-attention at the same time. Furthermore, when compared with pure convolution or self-attention, it possesses the minimum computational overhead, with continuous improvement results in regard to image recognition and downstream tasks.

III. PROPOSED METHOD
In this section, we propose in detail a new approach with simple and efficient marine organism detection pipe that combines data augmentation and the improved object detection framework. The overall framework of the proposed method is shown in Fig. 1. Our method combines DDA with the improved YOLOv7 to increase domain generalization performance and improve detection accuracy for small target in complicated scene.
First, we perform DDA on the original dataset to increase domain generalization capacity. DDA includes EDA and UDA. In EDA pipeline, we compensate the raw underwater image with red and blue channels, and follow CLAHE process to increase the contrast of the image, and then use Restormer to denoise. Finally, the enhanced images are merged into the original image. The EDA can boost underwater object detection accuracy. In UDA pipeline, we transfer original images to a specific type of water quality, and then merge it with original images. The UDA can significantly improve domain generalization capacity of the detector.
Secondly, we have optimized the network structure of YOLOv7. We add ACmix module to the backbone network for extracting useful information from images more effectively and weaken irrelevant features. Furthermore, SIoU-NMS is used to improve the processing of non-maximum suppression, and improve the convergence speed and the accuracy of reasoning during training.

A. DOUBLE DOMIANS DATA AUGMENTATION
Compared with the land environment, underwater environments are much more complicated due to color cast, blurring, and low contrast of underwater images and the scatting and absorption of water medium. Moreover, the water bodies have different absorption effects on different wavelength of light, thus underwater images usually show a bluishgreen tone. Low-quality underwater images seriously weaken the performance of object detector. Consequently, an effective underwater images enhancement method can effectively improve the quality of underwater images, at the same time provide the necessary safeguards for object detection in complex environments.

1) UIE
To solve the above problems, a novel underwater image enhancement method is proposed in this paper. The pipeline of the propose method is shown in Fig. 1. Our method includes three steps:(1) color correction; (2) contrast enhancement; and (3) image denoising. Overall, firstly, the channel with the largest average value is selected as a reference channel to compensate the other two heavily attenuated color channels using a gain-based compensation strategy, and then the CLAHE is used to improve contrast. Subsequently, we denoise the image using the Restormer method to obtain the final enhanced image.
(1) Color correction. Different wavelengths of light have different attenuation tares when propagating in an underwater medium, which leads to blurred details and color distortion of underwater image [50]. Generally, the absorption of light under the underwater environment results in the average value of the red and blue channels of the underwater image being far less than the average value of the green channel. In addition, the traditional color correction methods will lead to over-compensation or under-compensation on the red and blue channel, respectively [21]. We utilize adaptive color compensation [21] in order to compensate for the red and blue channels of underwater degraded images. The image in shallow water demonstrates a green tone. Moreover, the average value of the green channel is found to be greater than the average value of the blue channel. As such, the red and blue channels need to be further compensated for.
where, I r (i, j), I g (i, j), and I b (i, j) represent the red, green, and blue channels of the image, correspondingly. (i, j) represents the pixel coordinates. [ERR : macc : iOp =0× 0304] g and [ERR : macc : iOp =0× 0304] b stand for the average values of I g (i, j) and I b (i, j), respectively. α and β are the gain factors, which are used to adjust the compensation degree of the red channel and blue channel in order to prevent excessive compensation.
In a deeper underwater scene, the absorption of red and green light is far greater than that of blue light. The picture shows a blue tone; in addition, the average value of the blue channel is greater than that of the green channel. As such, the red and green channels need to be compensated for: where I r (i, j), I g (i, j), and I b (i, j) represent the red, green, and blue channels of the image. (i, j) represents the pixel coordinates.
[ERR : macc : iOp =0× 0304] g and [ERR : macc : iOp =0× 0304] b represent the average values of I g (i, j) and I b (i, j). α and β are the gain factors, which are used to adjust the compensation degree of the red channel and blue channel in order to prevent excessive compensation.
(2) Contrast enhancement. Generally, the underwater imaging environment is turbid. Furthermore, the scattering and refraction of the background light via suspended particles in the water will blur the underwater image and reduce the contrast. For this issue, we utilize CLAHE for the purpose of contrast enhancement. Histogram equalization (HE) and adaption histogram equalization (AHE) may over-enhance the local contrast of the image and over-amplify the noise for similar areas in the image. Indeed, the application of CLAHE limits the enhancement amplitude of the local contrast by limiting the height of the local histogram, thus limiting the method of noise and over-enhancing the local contrast. The research [51] shows that the red, green, and blue pixels in the image enhanced by CLAHE are evenly distributed, without excessive enhancement, and possess a better definition than the AHE method.
(3) Denoise. CLAHE can effectively improve the image contrast by enhancing the local details of the image, but it may still produce noise. So, we use Restormer [52] in order to denoise the image. Restormer is a new transformer model of low-level vision. Furthermore, Restormer improves the spatial attention of the transformer into a channel of self-attention with a depth separable convolution, and it replaces the convolutional forward network with a forward network that possesses gating and depth that are separable convolutions. Indeed, Restormer achieves the most advanced results in certain tasks, such as deraining, dehazing, and denoising. When used for denoise, Gaussian denoising and real denoising are superior to other existing algorithms. It can visually reconstruct clearer images with finer-grained textures. Fig. 2 shows the processing results when using the underwater image enhancement method proposed. The color cast of the original degraded image (Fig. 2a) is significantly improved after adaptive color correction (Fig. 2b). Furthermore, the contrast and sharpness of the color corrected image are improved after CLAHE is performed (Fig. 2c) and after Restormer denoising occurs (Fig. 2d). As can be seen in Fig. 2, the detail texture is clearer, which is conducive to subsequent detection tasks.

2) DDA
Existing works point out that improving the quality of underwater images can improve the performance on computer vision tasks such as underwater object detection. In these tasks, underwater image enhancement works as the image pre-processing operation of underwater object detection tasks. Researchers believe it reasonable to use UIE as a pre-processing step before feeding data to object detector. However, most of the existing works treat UIE and UOD tasks as two separate steps and optimize them separately, whereas their evaluation indicators and optimization goals are different. UIE may add noise and artifacts, and remove features indispensable to detection, thus hurting performance of detection. Consequently, applying an enhancement method to the training data and testing the trained model on image with or without enhancement may not help in terms of accuracy.
We believe that the UIE can correct the color casting of the original underwater image and improve the contrast and clarity of the image, which is essentially beneficial for advanced task. Therefore, EDA we performed on the training set, which consists of the enhanced images and original images, achieved good visual quality of the enhanced image with good visual quality and retained the essential feature in the original image to improve accuracy. The experiments VOLUME 11, 2023 have shown that EDA can significantly improve the detection accuracy.
In application, we hope a trained detector can be applied in any underwater environments. However, traditional object detection methods suffer from domain shift severely. A model trained on source domain performs unsatisfactorily when evaluated on an unseen domain. Underwater optical images are often affected by environmental factors such as the wavelength-dependent light absorption and scattering. On one hand, the selective absorption of light by water cause color distortion. On the other hand, scattering is the energy loss caused by the change of light direction in the transmission process caused by suspended particles or ambient light. Scattering will reduce the image contrast, causes fuzzy features such as textures, edges, colors, etc. In different water qualities, the type, number and distribution of suspended particles are different, leading to different scattering characteristics of light. Consequently, the degradation of image in different water qualities is different, and even trained object detector have the problem of domain generalization.
In this paper, we use UDA to improve the domain generalization capacity of model. Firstly, the images in the original dataset are divided into 8 types different in water quality. As shown in the Fig. 3, 8 images with different types of water quality are selected, and the entire original dataset is transferred into different water quality types using WCT2. They are denoted as type1 to type8. The reference style image is from 8 images with different water quality types, and the content image is from original training set and validation set. WCT2 can effectively preserves the important structure characteristic of the source images, and produce photorealistic images. Finally, the transferred images are added to the original dataset. It should be noticed that type8 is designed to test the domain generalization capacity of model, therefore it will never be used for the UDA.
In practice, we add type4 dataset to the original dataset instead of all 7 types of transferred datasets. We think, firstly, there are a lot of redundant features in various type datasets, which may lead to overfitting. Secondly, even SOTA imageto-image translation method can still bring artifacts and noise to the image, and a large number of transferred images will increase the weight of the defects during training, thus affecting the performance. Thirdly, Data augmentation with all water quality types will greatly reduce the speed of training. The subsequent experiments confirm our hypothesis that the model trained on the ori+ type 4 dataset achieve a 1.2% higher mAP than the model trained on the ori+ typeall, and the training time is only one third of the latter. we finally add the enhanced images and type 4 images to original dataset, which we call double domains data augmentation.

B. IMPROVED YOLOv7
Most object detectors based on computer vision are designed for land environments, and face a huge challenge in identifying marine organism in complex backgrounds. They do not consider the complex underwater environments, such as underwater scene overlap, occlusion, and blur, so that the detection accuracy of small targets is low. Therefore, it is urgent to propose a new detection method for marine in complex backgrounds to improve detection efficiency.
In this study, we improve the original YOLOv7 object detection network. First, we add the ACmix module to the YOLOv7 network in order to improve the ability of the network to extract global features. Secondly, we use the SIoU-NMS loss in order to accelerate the convergence of the network during training and improve the detection precision. Thereby, the improved network is called the YOLOv7-Acmix, and the network structure is shown in Fig. 4.
First, the images enhanced by DDA are inputted to the network input port. Then, the conventional data augmentation operations, such as random cropping, scaling, rotation, etc. are performed. The backbone network mainly utilizes the ELAN and MP structures with Silu. ELAN aids the deeper network to learn and converge effectively by controlling the shortest and longest gradient paths. The MP structure utilizes both max pooling and CBS to fully absorb the advantages of the two down-sampling methods. Moreover, CBS is composed of the convolution of stride=2, batch normalization, and activation function. However, the neck and head network still adopt a mechanism-based anchor. After down-sampling 32 times, the feature map size changes from 640 × 640 × 3 to 20 × 20 × 1024, whereby a small feature fusion module, i.e., an SPPCSPC with an SPP structure, transmits the features extracted from the backbone network to the neck network. Indeed, YOLOv7's neck network module fuses context features through top-down and bottom-up methods. Unlike YOLOv5, the ELAN-W module replaces the CSP module. Furthermore, the network outputs feature maps with dimensions of 80 × 80, 40 × 40, and 20 × 20 through the three feature layers. Finally, three headers composed of Rep and the convolution module output improve to the network prediction results.

1) ELAN-ACMIX
The Improved YOLOv7-Acmix network adds the Acmix structure to the last ELAN-W layer of the head network from the bottom to the top. We utilize the Acmix structure to replace the last CBS module of the ELAN-W module. We call this the ELAN-Acmix module (shown in Fig. 5). The improved module possesses the advantages of self-attention  and convolution, which results in the model directing more attention to valuable information and thus improving the detection precision.
Convolution is one of the most essential parts of modern ConvNets, and is widely adopted on computer vision tasks such object detection, semantic segmentation, and achieve state-of-the-art performances. Attention mechanism has also been widely adopted in computer vision tasks, focusing more on important regions within a large size context. How-ever, in general research, self-attention and convolution are regarded as different operations, and the internal relationship between them is not fully utilized. Acmix hybrid module elegantly integrates self-attention and convolution with minimal overhead (as shown in Fig. 6), and effectively enjoys both advantages. In this study, the Acmix hybrid module is add to the ELAN-W module of YOLOv7.
The ELAN-W module of YOLOv7 has two branches. The first branch undergoes channel transformation through a convolution of 1 × 1. The second branch first performs channel transformation through 1 × 1 convolution, and then extracts features through four 3 × 3 convolution modules. The outputs of all five convolutions and the branches of the first branch are added by concat. We replace the last 3 × 3 convolution module of the second branch with the Acmix module, which is the so-called ELAN-Acmix module. The module structure is shown in Fig. 5. The improved ELAN-Acmix module has the advantages of both convolution and self-attention, which makes the model more focused on valuable information and improves the detection precision.

2) SIOU
The total loss functions of YOLOv7 are composed of three parts: box loss L box , confidence loss L conf , and classification loss L cls (Equation (5)). Among them, the binary cross entropy loss is used for the confidence loss and the classification loss, whereas the CIoU loss function is used for the box loss.
Loss =w 1 L box + w 2 L cls + w 3 L conf (5) Among the above, w 1 , w 2 , w 3 is the weight of the three losses. In this paper, we use SIoU loss function to replace CIoU loss function for optimization.
The original YOLOv7 [39] uses CIoU [53] to calculate the regression location loss. CIoU solves the problem of indistinguishable scores of the same area with different aspect ratios in IoU loss by increasing the width to height ratio factor, as shown in Equation (6).
However, the CIoU does not consider the direction of the mismatch between the real box and the prediction box.
As the prediction box may wander around in the training process and eventually produce a worse model, this deficiency may lead to slow convergence and low efficiency. By considering the vector angle between the required regressions, Gevorgyan et al. [54] redefined the penalty and proposed the SIoU loss function (as shown in Fig. 7). Indeed, the SIoU loss function consists of the Angele cost, the distance cost, the shape cost, and the IoU cost.
Angle cost. The model will try to bring the prediction to X or Y axes first and then continue the approach along the relevant axes. To achieve this process of convergence will first try to minimize α if α ≤ π 4 otherwise minimize β = π 2 −α. Thus the definition of Angle cost is as following: where, The distance cost is redefined, taking into account the angle cost defined above: where, ω w = |w−w gt | max(w,w gt ) , ω h = |h−h gt | max(h,h gt ) IoU cost is defined as following: where, B and A respect the bounding box and the ground truth of prediction.
Finally, the SIoU regression loss function is defined as: SIoU effectively reduces the total degree of freedom lost and improves the convergence process and the effect of training. Our improved YOLOv7-ACmix uses SIoU to calculate regression loss. Compared with traditional CIoU, SIoU has faster convergence speed and higher positioning precision.

A. DATASETS
We used URPC2019 and URPC2020 datasets to evaluate the effectiveness of our proposed methods. URPC datasets are provided by the ''Underwater Robot Picking Competition'' organized by the Natural Science Foundation of China, which are also widely used in the research of underwater object detection [2], [5], [26], [55], [56], [57], [58], [59]. The URPC2019 dataset includes 3765 training samples and 942 valid samples, covering five water object categories: echinus, starfish, holothurian, scallop and waterweeds. The URPC2020 dataset includes 4200 training samples, 800 valid samples and 1200 test samples, including four different types of marine organisms: echinus, starfish, holothurian and scallop. We zoomed all the pictures to416 × 416, consistent with the literature [5].

B. IMPLEMENTATION DETAILS
In our experiment, we utilized the most basic model in the YOLOv7 series, which contains 37.2 M parameters. We used an 8-batch size, 0.01 initial learning rate, 0.937 momentum and 0.0005 weight decay to train 100 epochs. In addition, we also use the network default parameters, without any other fancy optimization methods. The used hardware was a desktop computer which was equipped with a Geforce GTX1080Ti GPU, 32G memory, and i7-6800KCPU. The software environment was composed of ubuntu18.04, cuda10.2, python 3.6, and torch 1.10.

C. EVALUATION METRICS
In order to comprehensively and objectively evaluate performance of the proposed method, we used the mean of average precision (mAP), recall, precision, F1-score and PR curve for comprehensive evaluation.
The definition of precision and recall are as follows: Recall = TP TP + FN (13) where TP represents the correct detection result, which means the predicted result of the model is positive when the true value is also positive. FP referring to a false detection result, means the model predicts a positive result but the true result is negative. FN stands for False Negative, that is, the predicted result of the model is negative but the true result is positive.
As its name suggests, mAP scores are the mean of the average precision (AP) scores for each class. AP is defined as following: where, p is the precision and r is the recall rate. In this paper, mAP scores are reported using Intersection-over-union (IoU) thresholds at 0.5. Generally, the higher the score of mAP, the better the detection result and the higher the network performance we can get. F1-score is more realistic measure of a test's performance by providing a balance between precision and recall. F1-score is defined as following: F1-score, which ranges from 0 to 1, is often used as the final measurement method in some multi-classification machine learning problems.

D. URPC2019 EXPERIMENTAL RESULTS
We compared the mAP of various UDA methods, from ori+ type1 to ori+ type7. Here, ori+ type1 represents adding type1 images to the original dataset, and so on. The typeall represents all 7 types of water quality images. We used the URPC2019 dataset as the original dataset. The training set includes 3765 images, and the validation set includes 942 images. Unless otherwise specified, YOLOv7 basic type was used as the detection network, the default super parame- ters were used with no bells and whistles, epoch=100, batch size=8, and the classes was 5. The experimental results are shown in Table 1 and Table 2. Fig. 8 shows some results from the URPC2019 dataset. It can be seen from the Fig. 8 that the method proposed in this paper can effectively detect some small and fuzzy objects. Table 1 shows the comparison of the UDA results between the method proposed in this paper and the method proposed in the literature [5]. First of all, the first row is the test result of the literature [5], which is far lower than the test result of the method proposed in this paper (i.e., the last row). The second row is the test results of the YOLOv7 basic model after training on original URPC2019 dataset, which is used as the baseline. Rows 3-9 of the table are the test results of the various the UDA methods. The experimental data shows that the data augmentation method of ori+ type4 has achieved the highest mAP. Row 11 of the table shows the detection results of DDA proposed in this paper, and the YOLOv7 had not yet been modified. The last row is the results of the DDA and YOLOv7-ACmix proposed in this paper. It is obvious that the mAP has been improved by 2-3.5% using only the UDA, which is lower than the DDA method proposed in this paper (i.e., increased by 4.2%). The mAP of DDA+ YOLOv7-ACmix proposed in this paper is found to be 82.3%, which is 4.8% higher than that of baseline. At the same time, we also use the typeall and UIE images for DDA. With respect to this, the test results are displayed in row 10 of the table. The mAP of this method, however, is only 80.3%. Obviously, the detection precision of DDA with ori+ typeall does not appear to possess obvious advantages, which also confirms our hypothesis in previous text. Therefore, we considered the DDA mode of ori+ type4 as our final choice. It can be seen from columns 8-9 of the table that the recall and F1-score of our proposed method are 8.1% and 4.3% higher than the baseline. In fig. 9, the PR curve also shows that the proposed method has the best performance.  Next, we compared the detection results of the EDA by using the method proposed in this paper (see Table 2). The underwater image enhancement methods used for comparison include the classic CycleGAN [28] and the two latest advanced underwater image enhancement methods MLLE [23] and SPDF [24]. The experimental results are shown in Table 2. The first row of the table represents the baseline. Rows 2-7 of the table show the data pre-processing steps used by CycleGAN, MLLE and SPDF algorithms for the purposes of the detection tasks and the detection results used for data augmentation. It can be seen from the table that UIE can help more with the subsequent detection tasks when them are used as data augmentation than when used in pre-processing. However, the improvement of detection results by a single EDA (up to 3.2%) is still lower than that obtained by the DDA method (up to 4%) and far lower than that obtained by the DDA+ YOLOv7-ACmix proposed in this paper (up to 5.6%). At the same time, we also find that when CycleGAN is used in the pre-processing, compared with the detection result of the baseline, the mAP decreases by 2.2%. This may be due to the fact that the generation network will generate additional noise, artifacts, etc., which will damage the subsequent detection tasks. Columns 8-9 of the table show that our method achieved the highest recall and F1-score, compared with the comparison methods. In fig. 10, the PR curve also shows that the proposed method has the best performance.
In addition, according to the method in the literature [5], we also generate underwater domain images of type8 to compare the domain generalization performance of our method. Fig. 11 shows some of the domain generalization performance. As can be seen from the Fig. 12, underwater object detection methods suffer from domain generalization problems, the original YOLOv7 has missed many targets, and some images even could not be detected (the first and second rows). Specific test data are shown in Table 3, which manifests the apparent advantages of our method. The test results show that the detection precision of the YOLOv7 network, when applied without any data augmentation, for type8 images is only 18.3%, which is far lower than the detection precision of the YOLOv3 network used in the literature [5]. This shows that the improvement of the network structure cannot significantly improve its domain generalization performance. Furthermore, the generalization performance of the DDA method proposed in this paper (i.e., Row 5 of the table) is lower than that of the ori+ MLLE and ori+ SPDF data augmentation methods Rows3 and 4 of the table). As such, we believe that MLLE and SPDF can aid with the detection network focus on the public information of each underwater image rather than the specific information of a single underwater image. To this end, we randomly select images (we call them typex) from other seven underwater domain image sets to replace type4 images for DDA. The experimental results show that our proposed method achieved the highest mAP (56.2%), recall (50.5%) and F1score (61.1%). Besides the generalization performance is also greatly improved.
It is worth noting that the precision and recall are contradictory performance measure. From table 1, 2 and 3, we find that the precision of our proposed method is not the highest, and 5% lower than the compared method. Therefore, we conduct a comparative analysis of the detection results. As is shown in Fig. 13, from left to right are ground truth, ori+MLLE, ori+SPDF and our method. From the first row, we can see that our proposed method detects the scallop that is not labeled in the ground truth. However, ori+MLLE and ori+SPDF do not detect any targets. In the second row, compared with the control methods, our proposed method has better detection result since the starfish is in the middle of the figure and the scallop in the lower right of the figure. We conclude that due to the imperfect labeling of targets in the dataset [55], TP detected by our method is regarded as FP, resulting in low precision. Although the comparison methods have higher precision, there are a lot of missing detection points.
In order to compare with other studies that also utilize the URPC2019 dataset, we tested the detection precision of four types of targets without the presence of waterweeds (see Table 4 for the test results). VOLUME 11, 2023    In order to facilitate comparison, the above table lists the total number of images in the datasets used by the various methods, as well as the number of images in the segmented training set, validation set, testing set, and the final test results (columns 2-6). The methods in rows 1-5 use the URPC2019 refine dataset, which contains 4757 pictures. The methods in rows 6-8 use the URPC2019 dataset, which contains 4707 pictures. From the mAP of the last column of the table, it can be seen that the proposed method is superior to all the other control methods.

E. URPC2020 EXPERIMENTAL RESULTS
We verified our method on the URPC2020 dataset. We randomly selected 4200 images from the URPC2020 dataset as the training set, 800 images as the valid set, and 1200 images in the test set. Fig. 14 shows some results from the URPC2020 dataset. The test results are shown in Table 5.
As is shown in Table 5, although the number of training sets and test sets segmented by the comparison literature are different, the proposed method achieves a higher mAP on a larger test set with fewer training sets (rows 2-4 of the table), and is far higher than t mAP of literature [58] (row 1) 5.4% higher than YOLOv7 as the baseline. Without losing generality, it can be considered that the precision of our method is superior to all control methods.

F. NMS
In this study, we also compared the effects of the different NMS methods. It must be noted that YOLOv7, the baseline, uses CIoU [53] as the regression loss function by default. We compared the performance of several common methods, including DIoU [53], CIoU, EIoU [60], and SIoU. The results are shown in Fig. 15. Fig. 15 illustrates that the SIoU is 1.8% higher than the CIoU used by the native YOLOv7 than other control methods. 68848 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   Comparative performance for marine organism detection on URPC2020. From left to right are original image and ground truth, and the results of YOLOv7 and the proposed method. As noticed, the proposed method detects more occlusion targets and background-similar target (for example, the holothurian, starfish, echinus in the first row).

G. ABLATION EXPERIMENT
Apart from the experiments mentioned above, we also conducted ablation experiments in order to confirm the contribution of our improved scheme to the final detection task. The ablation experiments in this paper were all conducted on the URPC2019 dataset, with five target categories, each training 100 epochs and batchsize= 8, whereby the default parameter and test set were all set as the original valid  datasets. With respect to the aforementioned, please see Table 6 for the experimental results. The first row of the table shows the test results of the original YOLOv7 after training on the training set without VOLUME 11, 2023 DDA; this method and specific setup are then used as the control group. The second row of the table represents the detection result of converting the training set to the type4 style, and the third row is the detection result of taking underwater image enhancements as the pre-processing process of the detection task. In rows 4-5, the ACmix and SIoU are added to the YOLOv7 network, respectively, leading to 3.7% and 1.5% improvement accordingly. Rows 6-7 show the results of the UDA and EDA separately. The experimental results show that the data augmentation in one domain alone can contribute an influence of 2.6-3.5% to the precision improvement. When the DDA is used at the same time, the mAP reaches 81.5% (row 8), which is 4.0% higher than the baseline and 0.5% and 0.8% higher than that of a single domain. The data show that our proposed DDA is effective for the purposes of improving the mAP of downstream detection tasks.
Rows 9-10 of the table show the detection results of adding ACmix and SIoU to the YOLOv7 network while augmenting DDA. The data shows that the mAP after adding ACmix has reached 82.0%, which is 4.5% higher than the baseline and 0.5% higher than the DDA only. The mAP in the last row is 82.3%, which is the detection result of our final scheme. We added the ACmix and SIoU to the original YOLOv7 network and enhanced the DDA data. In return, we obtained an improvement of 4.8% when compared with the baseline. In addition, the improved YOLOv7-ACmix has a detection speed of 56fps, which is higher than the detection speed of YOLOv7 of 52fps. This is attributed to the less computational overhead of the ACmix module compared to convolution or self-attention.

H. DISCUSSION
Although our proposed method has good results, there are also certain limitations. First, the experimental results indicate that the performance of detection in an unseen domain can still not reach a similar level as that in the seen domains. This indicates that the degradation differences of underwater images in different water qualities have a significant impact on detection performance, thereby affecting the generalization performance of the model. Second, we still consider image enhancement and object detection as two independent models, without joint optimization between image enhancement and object detection to further improve detection performance. Therefore, there is still a lot to explore in this field.
Considering these limitations, we will do more work to improve the performance of the model. A better UIE method can enhance the expression of useful features and further improve domain generalization performance. In addition, considering the interaction between image enhancement and object detection, guide the enhancement model to generate images that are more conducive to detection. Based on the above two aspects, we will explore more methods to improve the generalization ability and accuracy of our model.

V. CONCLUSION
In this paper, a new underwater object detection method is proposed. First, we combine underwater domain and enhancement domain in order to augment the underwater dataset. The experiments demonstrate that data augmentation in both domains can maximize the useful information in the data and improve the performance of downstream detection tasks. The application of the self-attention mechanism in the CV field enables the model to focus on important areas in a larger space. Moreover, the ACmix integrates the convolution and self-attention processes together, combining the advantages of both and reducing the computational cost. Indeed, adding the ACmix to the native YOLOv7 network can improve the ability to express useful information in certain features and optimize the feature expression of underwater organism objects. Finally, the SIoU guided the prediction to the x or y axes by introducing the angle between the real box and the prediction box. Then, it continued to regress along the relevant axis in order to accelerate the convergence process and the effect of training. Through the above strategies, our method has improved the detection precision of the original network by 4.8% with respect to the URPC2019 dataset. At the same time, our strategy is only combined with one type of underwater domain images and an extended domain images for the purposes of data augmentation, which reduces the difficulty of network training.
The proposed method still requires to be improved. Excellent underwater image enhancement methods can better express features that are useful for downstream detection tasks and can be used in DDA, which may further improve the detection effect. As such, we will further study better underwater image enhancement methods in order to improve the detection effect.