A Deep Learning Method for Offshore Raft Aquaculture Extraction Based on Medium-Resolution Remote Sensing Images

Aquaculture has experienced significant growth, contributing to resolving the global food crisis and delivering substantial economic benefits. Nevertheless, the uncontrolled expansion of aquaculture activities has led to an ecological crisis in offshore waters. This highlights the critical need for precise delineation and monitoring of aquaculture areas in these regions to ensure scientific management and sustainable development of coastal areas. In this article, we introduced an SRUNet model based on the Swin Transformer for accurately extracting offshore raft aquaculture areas using medium-resolution remote sensing images. Our SRUNet model combined the UNet model with the Swin Transformer block and the residual block to account for multiscale features, resulting in excellent extraction performance in diverse and complex sea areas. To evaluate the model, we selected four typical raft aquaculture areas and compared the SRUNet model with other comparative network models. Results revealed that the SRUNet model outperformed all other models, and the F1 Score and MIoU of the classification results were 86.52% and 87.22%, respectively. The model reduced the loss of feature information and misclassification of aquaculture areas, generating extraction effects that aligned closely with real aquaculture area shapes. Additionally, we tested the performance of each component of the SRUNet model. The results indicate that the SRUNet model exhibits strong robustness and effectively filters out irrelevant information. These results demonstrate the model's potential for large-scale extraction of offshore aquaculture areas.


I. INTRODUCTION
A CCORDING to the Food and Agriculture Organization of the United Nations (FAO), China, which is the largest aquaculture producer in the world, has accounted for over 20% of global production in the last decade [1], [2]. While aquaculture has generated substantial economic gains for China, the industry has also caused significant ecological problems in coastal regions [3]. The unchecked expansion of aquaculture areas into offshore waters and increasing farming densities have resulted in the production of large amounts of excrement and bait that surpass the capacity of the natural environment, leading to severe pollution of offshore aquaculture waters [4], [5], [6], [7]. Therefore, it is essential for fishery authorities to monitor and regulate offshore aquaculture practices in a responsible manner to ensure sustainable development [8]. Traditional monitoring methods, such as field visits, are inefficient and cannot be used to monitor large areas. With the development of earth observation technology, remote sensing technology has become a crucial research direction in marine fisheries and marine environment studies [9], as it can overcome the shortcomings of traditional monitoring methods. Offshore aquaculture activities are primarily concentrated within 15 km of the coast in open sea areas and at water depths of less than 20 m [10], [11]. The farming methods primarily used are net or floating rafts. The former is typically constructed using wood and plastic materials, and its reflective properties differ significantly from those of the surrounding water body, making it easier to extract in optical remote sensing images. The latter form of aquaculture mainly focuses on cultivating seafood such as kelp and seaweed and is composed of floating rafts and underwater ropes. The reflectivity of the raft-type aquaculture areas is lower than that of seawater, resulting in a dark rectangular strip with a more uniform tone on optical remote sensing images [11]. In recent years, the primary image data used for research in this field are optical remote sensing images, such as Landsat-TM/OLI, Sentinel2 MSI, GF-1/2, and Worldview-2, and synthetic aperture radar (SAR) images, such as Sentinel1 and Radarsat-2.
Research on aquaculture information extraction using optical remote sensing image data can be broadly classified into three categories: 1) extraction based on spectral features; 2) object-oriented extraction methods; and 3) deep-learningbased information extraction [12], [13]. Some researchers [11], [14], [15], [16] found that in optical remote sensing images, the feature difference between the raft aquaculture area and the surrounding seawater background was more obvious in the near-infrared band, red band, and green band. They used this as a basis to establish relevant spectral feature indices, which were useful for expanding the differences between aquaculture areas and seawater background, and thus enhancing the feature This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ information of aquaculture areas. These methods are better suitable for areas with significant spectral background differences between aquaculture areas and the surrounding seawater. Otherwise, it is easy to encounter classification problems, such as "salt and pepper" noise and "foreign object with same spectra." Therefore, based on these research works, the object-oriented method based on image segmentation can effectively avoid "salt and pepper" noise. Due to the regular shape of the aquaculture area, it shows stripes in the optical images and has obvious differences in shape and texture features from the seawater background. Therefore, some researchers [17], [18], [19] focused more on the shape and texture features of the culture area and used high-resolution optical remote sensing images and combined them with object-based methods to extract the aquaculture area. Compared with the extraction method using only spectral features, a higher level of accuracy can be obtained. In recent years, deep learning has developed rapidly in remote sensing image segmentation research, especially convolutional networks have performed well in various segmentation tasks [20]. Therefore, various deep learning network models have been applied to various marine research fields [21], [22], [23]. Among some network models specifically for aquaculture area extraction have emerged, such as the improved UNet model with pyramid upsampling and squeezed excitation structure [24], RaftNet that dual-channel and residual hybrid dilated convolution blocks [25], the hierarchical cascade convolutional neural network HCNet [26] and the hierarchical cascade homogenous neural network HCHNet [10]. Compared with other methods, the deep-learning-based method can reduce the intensive parameter tuning work.
Although optical remote sensing images are effective in aquaculture information extraction, the complexity and variability of the natural environment in coastal areas where aquaculture areas are located can pose challenges. The high levels of water vapor evaporation can result in cloud and fog formations, limiting the imaging capability of optical remote sensing as a passive remote sensing technique and severely affecting the availability of large-scale, high-quality optical remote sensing images [27]. To overcome the shortages of optical remote sensing images, researchers have turned their attention to SAR image data. However, the key advantage of SAR as an active remote sensing technique is that its electromagnetic waves can penetrate through clouds, fog, and certain water depths, thereby enabling the monitoring of coasts and oceans and reducing the imaging limitations imposed by weather conditions [28], [29]. Furthermore, SAR can compensate for aquaculture areas that are not detectable in optical images due to weak reflection signals. Studies [30], [31], [32], [33] had shown that VV polarization in SAR images could penetrate more deeply and highlight the characteristics of raft aquaculture areas. Thus, SAR images with VV polarization and deep learning methods can be used to extract information about these areas. These studies demonstrated that SAR images had great potential in aquaculture information extraction. However, SAR is a grayscale image that lacks spectral information and contains significant scatter noise.
Taking into account the advantages and limitations of both optical and SAR image data, researchers investigated the potential of combining these data types for aquaculture information extraction. In previous studies [34], [35], [36], machine learning methods were applied to extract aquaculture areas using a combination of Sentinel1 SAR and Sentinel2 MSI images. The results demonstrated that the addition of SAR data significantly improved classification accuracy and reduced the omission of aquaculture areas compared to using only a single optical image. Building on these findings, this study aimed to leverage deep learning methods to extract offshore raft aquaculture areas by combining Sentinel1 SAR and Sentinel2 MSI data.
Based on previous research works, most deep learning models had focused on improving classical semantic segmentation networks. These models learned different features through convolutional layers and pooling layers, which converged lowdimensional shapes to a high-dimensional feature space and obtained aquaculture area depth features by alternating convolution and pooling in multiple layers [37], [38]. However, this process inevitably led to the loss of target details and difficulty in exchanging information across scales, resulting in incomplete extraction of large targets and omission of small targets. Recently, the Transformer model has emerged as a new feature extraction method in computer vision. This model utilizes the self-attention mechanism, allowing all pixels to acquire global information. Some researchers have applied transformers to image segmentation and detection tasks and achieved significant improvements in accuracy. In 2021, Liu et al. [39] proposed the Swin Transformer model. Due to the advantage of the Swin Transformer in obtaining global information, an increasing number of researchers have applied this model in remote sensing image recognition [40], [41]. By incorporating the Swin Transformer into the semantic segmentation network UNet, researchers [40], [42], [43], [44] achieved satisfactory results in various publicly available remote sensing datasets. Therefore, the possibility of combining the Swin Transformer with convolutional networks has been validated.
To sum up, the primary contributions of this article can be summarized as follows.
1) We proposed the SRUNet model for information extraction of raft aquaculture in complex sea environments. The model uses the Swin Transformer block as a feature extraction network, which helps to extract multiscale feature information. The experimental results showed that the shapes of the model extraction result were more closely matched to the real shapes. And more small targets can be identified, thus overcoming the lack of detail extraction in the existing research work. 2) We optimized the loss function of the SRUNet model and used two different loss functions, focal Loss and dice Loss, to alleviate the class imbalance between the raft and nonraft areas and make the model accuracy improved. 3) We fully considered the advantages and disadvantages of optical and radar images. In order to make up for the lack of a single image source, Sentinel1 SAR, Sentinel2 MSI, and spectral feature index (NDWI, NDVI, Index) were used to enrich the feature information of the raft aquaculture areas.

A. Study Area
China is a vast maritime nation with a lengthy coastline and abundant coastal resources. Taking advantage of its unique geography, China is home to numerous aquaculture areas, scattered from south to north along its coasts. In this study, we selected four offshore raft aquaculture areas with distinctive characteristics. They are located near Changhai County in Liaoning Province, Rongcheng Bay in Shandong Province, Haizhou Bay in Jiangsu Province, and Sansha Bay in Fujian Province, respectively. The geographical distribution of these four typical offshore raft aquaculture areas is illustrated in Fig. 1.
Changhai County, one of the eight major islands in China, is situated in the northern sea of the Yellow Sea on the east side of the Liaodong Peninsula, between 38°55´N-39°18´N, 122°13´E-123°17´E. As illustrated in Fig. 1(a), the aquaculture area is mainly dispersed around the Changhai Islands and parts of the sea east of the Dalian Peninsula, exhibiting a fragmented distribution. Due to the deeper water and lower water temperature, the optical images display a darker water color, making it difficult to distinguish aquaculture areas using only ordinary optical images.
Rongcheng Bay, situated at the easternmost point of the Shandong Peninsula, is surrounded by the Yellow Sea on the north, east, and south. The bay boasts rich biological resources, including rockweed and kelp. With ample water, exchange conditions inside and outside the bay, moderate water depth, and a large area. As illustrated in Fig. 1(b), the aquaculture area here is distinguished by large scale, high density, and regular shape while the rafts utilized in this area are relatively bigger and lengthier than those used in other areas.
Haizhou Bay, spanning between 34°30´N-35°10´N and 119°10´E-119°40´E, sits at the intersection of Jiangsu Province and Shandong Province. This open sea area faces the Yellow Sea. The favorable natural conditions foster an ideal growth environment for seaweed rafts in Haizhou Bay. As depicted in Fig. 1(c), the aquaculture area here is concentrated primarily in the bay, with small and regularly shaped rafts occupying the area, approximately resembling a square.
Sansha Bay, situated in the northeastern part of Fujian Province, China, lies between 26°27´N-26°49´N and 119°33´E-120°03´E. It is a typical semienclosed bay that is only connected to the East China Sea by that body of water. The surrounding mountains protect it from typhoons and waves. The bay is enriched with abundant nutrients injected by several rivers, which provides an optimal growth environment for aquaculture. The aquaculture area in Sansha Bay, as depicted in Fig. 1(d), is mainly dispersed near the islands within the bay and presents a high density. Furthermore, the area is a combination of various types of culture areas such as raft aquaculture and net aquaculture areas, with different shapes and sizes.

B. Dataset and Processing
The primary dataset used in this study consisted of Sentinel1 SAR and Sentinel2 MSI images. The Sentinel satellite system is a critical component of the European Copernicus program, developed jointly by the European Union and the European Space Agency. The Sentinel1 SAR is composed of two satellites that provide repeat observations of C-band SAR data with a revisit period of 6 days. SAR images are active remote sensing images that can overcome the limitations of weather conditions, such as clouds, rain, and fog. However, SAR images have a single waveband and cannot provide rich spectral information. Sentinel2 MSI is composed of two multispectral satellites A and B. And it revisits the same location every 5 days. Optical remote sensing images provide rich spectral information, but they are susceptible to weather conditions that affect imaging. To leverage the advantages of both Sentinel1 SAR and Sentinel2 MSI, this study employed a combination of both images to enhance the accuracy of offshore raft aquaculture area extraction, enabling effective long-term monitoring of aquaculture at a broader scale. These images are available on Google Earth Engine (GEE, https://earthengine.google.com/), facilitating rapid and multiscale analysis.
Based on relevant data search and local inspection, it is discovered that local fishermen typically begin deploying floating rafts and cultivating seaweed and other seafood around September of each year, and end harvesting around May of the following year, gradually recycling the floating rafts. Therefore, the images selected for this study were taken from September 2021 to May 2022, the planting season of 2021. Among them, Sentinel1 SAR images were selected in the VV polarization band. And Sentinel2 MSI images were selected in the red, green, and blue bands, with cloudiness controlled to be below 5%. We resampled the resolutions of both different images to 10 m. In addition, we further filter the selected SAR images to eliminate images that affect the imaging effect due to complex sea conditions. Then, we got the final collection of images. To solve the problems such as different imaging times of different images, we calculated the mean values of the images. And we used the mean values to synthesize the images, respectively. The size of the four images obtained were 4446 × 10447 pixels in Changhai County, Liaoning Province, 6207 × 3325 pixels in Rongcheng Bay, Shandong Province, 6128 × 6128 pixels in Haizhou Bay, Jiangsu Province, and 3944 × 5348 pixels in Sansha Bay, Fujian Province.
In particular, the approach taken in this study acknowledges that the Sentinel2 MSI images exhibit varying shades of color in the raft aquaculture areas, due to different factors such as sediment content and seawater color in different sea areas. To address this issue and enhance the feature information of the aquaculture areas, the three original bands of red, green, and blue from Sentinel2 MSI images were used, along with additional feature information to assist the deep learning model for the extraction task. Spectral feature indices are introduced to expand the difference of spectral features between the raft aquaculture area and the background information, thus improving the accuracy of the extraction results. After analyzing the spectral characteristics of the aquaculture area in each band of Sentinel2 MSI remote sensing images, it was observed that the offshore aquaculture area exhibited more pronounced differences in reflectance from the surrounding seawater in the near-infrared, red, and green bands. To emphasize the target aquaculture areas in the enhanced images, this article selected spectral feature indices constructed based on these three bands, namely the normalized difference water index (NDWI), normalized difference vegetation index (NDVI), and ratio index (Index) [14]. The calculation formula for each index is as follows: where R represents the red band in the Sentinel2 MSI image, G represents the green band in the image, and NIR represents the near-infrared band in the image. These spectral indices can be calculated directly in the GEE platform and can be added to the original image bands. Furthermore, the selected Sentinel1 SAR satellites offer four polarization capabilities: HH, VV, HH+HV, and VV+VH. Among them, the isotropic polarization band has better penetration capability than the cross-polarization band. Based on the observation of the performance of raft aquaculture areas in the study area under different polarization types, the isotropic polarization images showed more prominent characteristic information of the target areas. In addition, HH polarization is more often used in polar and sea ice monitoring. Therefore, the VV polarization band was chosen as one of the bands for the experimental image data in this study. The RGB bands and spectral feature indices of Sentinel2 MSI images, and Sentinel1 SAR images, were then fused into experimental image data with seven channels. The performances of the raft aquaculture area in terms of Sentinel2 MSI, Sentinel1 SAR and spectral indices are shown in Fig. 2.
After acquiring the experimental image data, the target extract raft aquaculture area was visually interpreted using ArcGIS software. The raft aquaculture areas were labeled as "1" and the other features as "0" to generate raster maps of the target feature labels. Each raster image and the corresponding labeled raster image were uniformly cropped to the same size of 128 × 128 pixels to form the experimental dataset, with the image edge information preserved at a repetition ratio of 0.075 during cropping. And the dataset was then divided into training, validation, and test sets in a certain ratio of 6:2:2. To prevent overfitting during the training process, this study also utilized data augmentation methods such as horizontal flipping, vertical flipping, and diagonal mirroring, to expand the training set. The specific dataset processing details are illustrated in Fig. 3.

III. METHOD
In this section, we provided a detailed description of the SRUNet network model constructed in this study. First, we presented the general framework of the network, followed by a comprehensive discussion of the encoder, decoder, and other essential components. The SRUNet network proposed in this article is designed to enhance the accuracy of extracting offshore raft aquaculture areas in complex marine environments. The simplified figure of the SRUNet network structure is depicted in Fig. 4. The overall structure of the SRUNet network is mainly based on the simple yet elegant symmetric "U" structure of the classical semantic segmentation network, UNet. The encoder of the SRUNet network employs the Swin Transformer block as the backbone network, which is responsible for extracting multiscale features from the images. And the residual block is used as the backbone network for the decoder. The residual block consists of multiple convolutional layers. Convolution is a local operation that establishes the relationship between pixels in the domain. And Swin Transformer is a global operation that establishes the relationship between all pixels. Therefore, the Swin Transformer block is introduced to compensate for the inability of the convolution operation to interact with remote information, making it better suited for the extraction task of raft aquaculture areas.

A. Swin Transformer Block
The Swin Transformer model was introduced by Liu et al. in 2021 [39] and was a variant of the transformer model that employs a sliding window [45]. The model utilized the concept of localization by limiting the attentional computation to a fixedsize window and also designed transformer hierarchical features [46], [47]. These hierarchical features enable the Swin Transformer to leverage various multiscale feature processing techniques for dense prediction, similar to the popular semantic segmentation network UNet model. In comparison to other variants of the transformer, the Swin Transformer model is notably less computationally intensive and achieves faster operational efficiency. Furthermore, the introduction of the Swin Transformer can effectively compensate for the limitations of convolutional networks in capturing global information. Therefore, the Swin Transformer model is used as the encoder backbone network in the SRUNet. It is composed of four blocks, including Patch Partition, Linear Embedding, Swin Transformer block, and Patch Merging.
The input image dataset is initially partitioned into nonoverlapping chunks of equal size using a 4 × 4 sliding window in the Patch Partition layer. Each chunk consists of 16 pixels and is expanded in its channel direction. The resulting image size changes from H × W × 7 at the input to H/4 × W/4 × 112, expanding the number of channels from 7 to 112. The image resolution decreases from H × W to H/4 × W/4. Next, the embedded chunks are passed through the Linear Embedding layer, which maps all chunks from 112 to C in the feature dimension. The Swin Transformer block and the Patch Merging layer are then sequentially used to generate feature information at different scales. The Swin Transformer block is responsible for learning the features, and the Patch Merging layer performs a downsampling operation that reduces the resolution of the input feature maps by half.
Compared to other variants of the Transformer model, the Swin Transformer block introduces the hierarchical construction commonly used in CNNs to build hierarchical Transformers. The Patch Merging layer in the Swin Transformer block serves a similar purpose to the pooling operation in CNNs. In the SRUNet model, the Patch Merging layer reduces the resolution of the feature map input to the layer by half while simultaneously expanding the dimensionality of the feature map to twice the input. Fig. 4(b) illustrates the detailed architecture of the Swin Transformer block, which consists of a LayerNorm (LN) layer, a windows multihead self-attention (W-MSA) block, a shifted windows multihead self-attention (SW-MSA) block, and a multilayer perceptron (MLP). The LN layer is similar to the batch normalization operation in CNNs, and the MLP adds nonlinearity to the network. W-MSA and SW-MSA are the core components of the Swin Transformer and are responsible for learning the features.

B. Residual Block
The residual neural network (ResNet) was proposed by He et al. [48] at Microsoft Research. The structure of ResNet accelerates the training of the neural network. The accuracy of the model is significantly improved. Compared to other classic semantic segmentation networks, the key part of the network is the addition of a directly connected channel in each residual block, allowing the original input information from the retained part to be passed directly to the later layers. In the SRUNet model, the residual block is used as the backbone of the network decoder. The use of residual block not only preserves the integrity of feature information during propagation but also effectively improves the utilization of location information. And it effectively solves the problems of gradient disappearance and  The output of the Swin Transformer block is fed into the decoder. The decoder consists of three residual blocks, the structure of which is shown in Fig. 4(c). In each residual block, the feature information is thus passed through a 1 × 1 convolution filter, a 3 × 3 convolution filter, and a 1 × 1 convolution filter. And after each convolutional layer, Batch-Normalization and ReLU are applied. Their effects are mainly to speed up the training of the model, mitigate overfitting, and thus improve accuracy. At the "shortcut connection," a 1 × 1 convolution layer is used to adjust the number of channels. After passing a residual block, the size of the output feature map is restored to twice its original input size, and the number of channels is halved. After three decoder operations, the size of the feature map gradually increases, and the number of channels gradually decreases. Finally, the feature mapping is restored to the original mapping resolution. Then, a 1 × 1 convolutional layer is used to predict the raft aquaculture area, with the output prediction being the same size as the original input image.
In the original UNet network structure, a skip connection is utilized to fuse multiscale features from the encoder with the upsampled features. This architecture is retained in SRUNet because downsampling during feature compression can cause a loss of useful spatial information. The skip connection structure connects shallow and deep features, reducing the loss of spatial information due to downsampling [49].

C. Loss Function
The loss function is used mainly in the training phase of the model. When the training dataset is fed into the model and the predicted values are output through forward propagation, the loss function calculates the value of the difference between the predicted and true values, also known as the loss value. Once the loss values are obtained, the model updates each parameter by backpropagation to reduce the difference between the true value and the predicted value, so that the predicted value generated by the model is closer to the true value, thus achieving the purpose of accuracy improvement. In this article, focal loss [50] and dice loss [51] are used jointly. It is calculated by the following formula: where L Dice is the dice loss function, and L Focal is the focal loss function. When building the training dataset, the concentration of raft aquaculture areas within a certain local sea area made the number of samples from raft aquaculture areas significantly smaller than the number of samples from nonraft aquaculture areas. As a result, there is a serious class imbalance between positive and negative samples in the sample set. To alleviate this problem, the focal loss function is selected as part of the model loss function in the SRUNet model to calculate the training error. The focal loss improves the standard cross-entropy function by increasing the weight of positive samples in the loss function, allowing the model to focus more on positive samples, thereby alleviating the class imbalance problem. It is calculated as follows: where α is a balancing factor to balance the number of positive and negative samples. γ is a modulation factor to reduce the loss of nonentity samples (simple samples) and make the model focus more on entity labels (difficult samples). α and γ are both hyperparameters. y is the probability that the model will predict the category. The dice loss function is mainly applied to binary classification problems. In this study, the extraction of raft aquaculture is also a binary classification problem. Dice loss is derived from the dice coefficient, which is a measure of similarity between two samples, ranging from 0 to 1, with higher values indicating greater similarity. The dice coefficient is defined as follows: where|X ∩ Y | is the intersection between positive and negative samples. |X| and |Y | are the number of positive and negative samples, respectively. Therefore, dice loss can effectively deal with the scene of positive and negative sample imbalance in the semantic segmentation task. It can be expressed as follows: The standard cross-entropy loss function is improved by jointly using focal loss and dice loss to effectively alleviate the sample imbalance between the raft and nonraft aquaculture areas, thereby improving the extraction accuracy of the model.

D. Evaluation Metrics
The prediction results of the various models can be evaluated from two aspects: 1) qualitative analysis; and 2) quantitative analysis. The qualitative analysis includes whether the target features extracted by feature segmentation are complete, whether they are misclassified, and whether the edge information of the extracted features is clear and consistent, among other things. The same test dataset is used as the baseline for quantitative evaluation in this study, which estimates the extraction accuracy achieved by different network models after training and evaluates and analyzes the extraction results of different models.
The main evaluation metrics used include precision, recall, F1 score, mean intersection over union (MIoU), and Kappa coefficient. Precision is the percentage of target features that are predicted to be positive samples whose predictions are correct. The recall is the proportion of correctly extracted target features to all target features. Considering that precision and recall can constrain each other, the F1 score is introduced. The F1 score is the average of precision and recall, and the performance of the model can be better evaluated by considering both precision and recall together. The MIoU is calculated as the ratio of the intersection of two sets of true values to the set of combined predicted values and is a global evaluation of the image classification results. The Kappa coefficient is used to measure classification accuracy and can also be used to test consistency. The formulas for calculating the above evaluation indicators are as follows: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. p 0 and p e are calculated based on the confusion matrix. k is the number of categories.

A. Experiment Details
The experiments were trained on a server equipped with NVIDIA GeForce RTX 2080Ti 11GB GPU. The experimental models were implemented using the PyTorch framework. The experiments used the Adam optimizer; an initial learning rate of 0.01; the batch size was set to 16; the iteration period was 50 times; the Softmax function as the classifier.

B. Experiment Results
To better evaluate the performance of the SRUNet model, we used the same dataset and experimental details to train and test our model and other neural network models. Table I shows the accuracy levels of each neural network model for the extraction of the test set. The main metrics selected for evaluation are precision, recall, F1 score, MIoU, and Kappa coefficient. The results show that the SRUNet model based on Swin Transformer block and residual block outperforms all other comparison network models in the four metrics of recall, F1 score, MIoU, and Kappa coefficient.
In Table I, the experimental results show that the SRUNet model achieved the highest levels in four metrics of recall, F1 score, MIoU, and Kappa coefficient. Moreover, compared with other models, all of them have a more obvious improvement. The UNet model achieved unsatisfactory figures in all indicators. In the face of multiple complex sea environments, the UNet model showed poor adaptability and was unable to perform the extraction task well. The UNet++ model preserved feature information as much as possible through dense connection, which made the evaluation indicators significantly improved. In the comparison between the ResNet series models and the DeepLab V3+ series models, although the ResNet series models achieve a high level in the precision metric, they perform far worse than DeepLab V3+ series models in the recall metric. It can be shown that although ResNet series models can extract raft areas more accurately, they can miss more. The original Swin  [52] constructed by the Swin Transformer block were also slightly lower than the SRUNet model in various indicators. In addition, we also compared the computational complexity and model parameters of different models. In Table I, compared with the UNet++ model with similar accuracy, although the SRUNet model has a large number of parameters, its computational complexity is much lower than that of the UNet++ model, which is at the middle level of all comparison models. Based on the above evaluation results, the effectiveness of the SRUNet model can be verified. Fig. 5 shows the prediction results of the SRUNet model and other comparison models for raft aquaculture areas. Some special scenes have been selected to compare the extraction results of the different models. The first two rows of Fig. 5 correspond to the ground truth environment and the corresponding labels, respectively. The next nine rows show the extraction results for the different models. We have used yellow boxes to identify missing aquaculture area scenes and green boxes to identify the incorrect identification of nonraft area features as raft areas. It is undeniable that each model has some degree of misclassification and omission. Specifically, the UNet model shows more misclassifications and extracts aquaculture areas with varying degrees of incompleteness, which is not ideal. The ResNet series models and DeepLab V3+ series models show similar results, with significantly better results than the UNet model in complex environments. However, the Deep Lab V3+ series models show more severe "adhere." Swin Transformer and SwinUnet models have serious misclassification. The UNet++ model also achieved good extraction results. Our SRUNet model shows the best extraction results. From scene (e), it can be found that SRUNet can still be extracted more completely when facing some small targets. In terms of extraction results, it overcomes the problems of convolutional neural network extraction to a certain extent. It is able to perform well in a variety of complex environments, although there is still a small amount of underestimation.
The qualitative analysis and quantitative analysis of each network model show that the SRUNet model is significantly superior to the other models. The SRUNet model is also the  most effective in terms of extracting the real labels, and the edge information is clearer. Therefore, it can be proved that the SRUNet model can perform the task of extracting offshore raft aquaculture areas very well.

C. Extended Experiments
According to the work in [37], In Table II, we still used precision, recall, F1 score, MIoU, and Kappa coefficient as evaluation metrics to evaluate the performance of the model. The experiments show that Swin-L achieves the highest level in the four metrics of recall, F1 score, MIoU, and Kappa coefficient. Swin-B is the next best. In addition, from the perspectives of the computational complexity and parameter quantity of the network, both indicators showed multiple rises. And Swin-L has the greatest number of parameters and the most computational complexity. Overall, the four versions of the SRUNet model do not differ much in each metric and generally show that the accuracy increases as the complexity of the network increases.

V. DISCUSSION
In this section, we discuss four main aspects as follows: 1) evaluate the impact of each component of the SRUNet model; 2) evaluate the impact of the loss function used in the SRUNet model; 3) evaluate the impact of the input dataset; 4) application of the SRUNet model. It is important to note that the SRUNet models used in this section are all in the Swin-B version, given the computational complexity and parameter quantity of the network.

A. Ablation Study of Backbone
The SRUNet model is based on the UNet model and uses the Swin Transformer block and the residual block. Therefore,    [52] is an improved Unet model based on pure a Swin Transformer block. The SCUNet model uses Swin Transformer block for the encoder and a traditional convolutional block for the decoder.
In Table III, the experimental results show that the application of the Swin Transformer significantly improves the extraction accuracy of the model. It can be shown that relying only on the local feature information obtained by traditional convolution is difficult to satisfy the extraction task within a complex sea environment. On the contrary, the strength of the Swin Transformer in global feature information can effectively compensate for the shortcomings of convolution. On the hand, the use of convolutional networks in the decoder can fuse the multiscale information and further improve the accuracy of the model.

B. Ablation Study of Loss Function
The SRUNet model was optimized using focal loss and dice loss for the standard cross-entropy loss function (CE loss). To verify the effect of the optimized loss function on the accuracy of the model, four sets of experiments were set up. These experiments include CE loss in the first group, dice loss in the second group, focal loss in the third group, and a combination of dice and focal loss in the fourth group.
In Table IV, the experimental results show that the accuracy of the SRUNet model is significantly improved after using two different loss functions alone compared to using the CE loss function. And in terms of the importance of the improvement, the two loss functions give to the improvement of the SRUNet model to a similar extent, with little difference. After using the two loss functions jointly, the accuracy of the model is further improved. Therefore, it can be shown that the two loss functions have significantly contributed to the improvement of model accuracy.
In the face of a serious imbalance between positive and negative samples in raft aquaculture areas, the positive samples can be better focused, thereby improving the accuracy of the model.

C. Ablation Study of Dataset
The input dataset for the SRUNet model is a fusion of Sentinel2 MSI and Sentinel1 SAR images, as well as spectral feature indices. Five sets of experiments were set up to verify the effect of the input dataset on the improvement of the experimental accuracy. These experiments include the RGB three bands of Sentinel2 MSI images in Group1, the addition of NDWI to Group1 in Group2, the addition of NDVI in Group3, the addition of Index in Group4, and the addition of SAR in Group5.
In Table V, the experimental results show that the experimental accuracy is significantly improved after the addition of NDWI. It indicates that the addition of NDWI greatly enriches the feature information of the raft aquaculture area in the images. And with regard to the NDVI, Index, and SAR added afterward, the extraction accuracy of the model is also improved to some extent. Therefore, the combined use of Sentinel2 MSI and Sentinel1 SAR, as well as the associated feature indices, can effectively enhance the feature information of the raft aquaculture area. And the information can help to improve the extraction accuracy of the aquaculture area. In summary, the reliability of the combined use of Sentinel2 MSI and Sentinel1 SAR images for offshore raft aquaculture area extraction can be verified.

D. Model Applications
To further validate the reliability of the SRUNet model, we selected Sentinel1 SAR and Sentinel2 MSI images from four study area seas in 2020. The details related to the acquisition and processing of the images were kept the same as described above. Subsequently, we employed the SRUNet model established in this article to predict the offshore raft aquaculture areas in the study areas in 2020. Then, the distribution of culture areas in the study area was obtained, as illustrated in Fig. 6. Compared with the original images of 2020, the result maps extracted by the SRUNet model were consistent with the distribution range of coastal marine aquaculture areas in the real images. The aquaculture areas within the sea area were extracted with high accuracy, and other features such as net aquaculture areas, coastal land, and seawater were not identified as raft aquaculture areas. Furthermore, we performed a quantitative evaluation of the extracted aquaculture area distribution map in 2020. To be specific, we first randomly selected 200 sampling points in each sea area and then visually interpreted the real ground conditions of the selected sample points using Sentinel2 MSI images supplemented with high-resolution images from Google Earth as the judgment basis. We chose overall accuracy (OA) and Kappa coefficient as evaluation metrics to quantitatively assess each of the four sea areas. OA represents the percentage of correctly classified pixels. And the Kappa coefficient is utilized to assess the agreement between the predicted and actual results. Higher values of OA and Kappa indicate better extraction results. The results demonstrated that the OA of the four seas, Changhai County Sea area in Liaoning, Rongcheng Bay in Shandong, Haizhou Bay in Jiangsu, and Sansha Bay in Fujian, were 96.70%, 94.93%, 95.97%, and 96.21%, respectively. And the Kappa coefficient of the four seas were 0.76, 0.82, 0.88, and 0.81, respectively. In summary, the improved SRUNet can better identify offshore raft aquaculture areas in medium-resolution images.
In Fig. 6, the extraction results for selected scenes are presented in the third row. It can be shown that the SRUNet model achieved outstanding extraction results in various complex scenes, including densely populated aquaculture areas, turbid seawater, and sediment areas, mixed aquaculture areas with different modes, and aquaculture areas with indistinct optical features. The model can successfully extract the aquaculture areas while avoiding misidentifying other nonraft aquaculture areas and land as raft-based aquaculture areas. Overall, the SRUNet model demonstrated excellent performance and accuracy in extracting offshore aquaculture areas and had the potential for large-scale extraction of seawater aquaculture areas. However, it should be noted that some raft-based aquaculture areas in certain waters were not identified in the 2020 distribution map, indicating a certain degree of omission. Additionally, the "adhere" phenomenon is observed in densely distributed waters. Thus, these issues require further investigation in future studies.
We also selected a previous related study to compare with the extracted results in this study, which was based on the HCHNet model extracted by Fu et al. for offshore aquaculture information in China in 2019. In addition, different from other publicly available datasets that are continuous aquaculture patches, the targets extracted in these two studies are individual aquaculture areas. Therefore, we compared the two results. We have combined the extraction results of the two deep learning models and selected several different scenes for comparison, as shown in Fig. 7. The comparison results show that the extraction effects of the SRUNet model are more consistent with the real feature shapes, and the phenomena of omission and misclassification are significantly reduced. At the same time, some "adhere" phenomena appear in the dense aquaculture areas, but it is reduced compared with the HCHNet model.

VI. CONCLUSION
In this article, we proposed an SRUNet for offshore raft aquaculture areas extraction. Different from the existing raft aquaculture information extraction network models, SRUNet introduces Swin Transformer as a feature extraction network to obtain multiscale feature information of raft aquaculture areas. Also, the residual block is used to compensate for the lack of local information. It effectively solved the drawback that the single Swin Transformer block only focused on the global feature information, and can extract the raft aquaculture areas more completely. To evaluate the extraction performance of SRUNet, four typical raft aquaculture areas were selected for experiments and compared with several neural networks. Through quantitative and qualitative comparisons, the effectiveness of SRUNet was demonstrated. We also evaluated the impact of different Backbone, different loss functions, and different datasets in SRUNet. In addition, we used the SRUNet model for practical application to extract the aquaculture information of the study areas in 2020. However, in the practical application, we found that SRUNet still had some problems such as omissions and "adhere." Therefore, in the future, we will collect more samples and further optimize the network model to achieve large-scale extraction of aquaculture information.