SA$^{2}$Net: Ship Augmented Attention Network for Ship Recognition in SAR Images

Maritime surveillance is extensively concerned by worldwide authorities, in which ship recognition in synthetic aperture radar (SAR) images is a significant and fundamental component. Though some development has been achieved in the SAR ship recognition task, two areas remain inadequately explored, which are the comprehensive utilization of multiscale features and the deployment of the prior knowledge of the ship shape. In this article, a novel ship augmented attention network (SA$^{2}$Net) for ship recognition is proposed, which comprehensively utilizes the multiscale features and integrates the ship shape prior to the end-to-end network. On one hand, due to the unequal effects of different scales, a scale attention module is proposed to adaptively select and assign weights to desired feature scales while disregarding irrelevant scales. Moreover, a feature weaving module (FWM) is constructed to merge semantic and detailed features produced by the high-to-low backbone, enriching representations across all scales of ship targets. On the other hand, in order to incorporate the priory knowledge of the ship shape into the network, we develop a feature augmentation module (FAM) to further boost the ship recognition accuracy. This module can provide rectangular receptive fields that align with the shape of ships, wherein a limitation encountered with traditional square convolutions. Comprehensive experiments on representative three- and six-category OpenSARShip tasks and seven-category FUSAR-Ship tasks show that our SA$^{2}$Net demonstrates superior performance when compared to the current state-of-the-art methods.

Abstract-Maritime surveillance is extensively concerned by worldwide authorities, in which ship recognition in synthetic aperture radar (SAR) images is a significant and fundamental component.Though some development has been achieved in the SAR ship recognition task, two areas remain inadequately explored, which are the comprehensive utilization of multiscale features and the deployment of the prior knowledge of the ship shape.In this article, a novel ship augmented attention network (SA 2 Net) for ship recognition is proposed, which comprehensively utilizes the multiscale features and integrates the ship shape prior to the endto-end network.On one hand, due to the unequal effects of different scales, a scale attention module is proposed to adaptively select and assign weights to desired feature scales while disregarding irrelevant scales.Moreover, a feature weaving module (FWM) is constructed to merge semantic and detailed features produced by the high-to-low backbone, enriching representations across all scales of ship targets.On the other hand, in order to incorporate the priory knowledge of the ship shape into the network, we develop a feature augmentation module (FAM) to further boost the ship recognition accuracy.This module can provide rectangular receptive fields that align with the shape of ships, wherein a limitation encountered with traditional square convolutions.Comprehensive experiments on representative three-and six-category OpenSAR-Ship tasks and seven-category FUSAR-Ship tasks show that our SA 2 Net demonstrates superior performance when compared to the current state-of-the-art methods.
Ji Yang is with the Unit 31308 of the PLA, China (e-mail: yangji@163.com).Jianqi Wu is with the East China Research Institute of Electronic Engineering, Hefei 230031, China, and also with the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 610056, China (e-mail: wujianqi38@163.com).
Digital Object Identifier 10.1109/JSTARS.2023.3317489maritime surveillance tasks, ship monitoring plays a key role in both military and civil activities, such as trade management, marine traffic, transportation monitoring, and national maritime safeguarding [1], etc. Automatic identification system (AIS) and vessel traffic service are conventional techniques for ship monitoring.However, neither of these techniques is enough to achieve general purpose vessel monitoring with the demanded independence, temporal coverage, and spatial coverage [2].Synthetic aperture radar (SAR), with all-day monitoring capability, can monitor large areas independently of meteorological conditions [3], which stands out as an effective substitution and has been extensively studied for ship recognition in recent decades.
In the past few decades, many hand-crafted feature methods have been introduced for SAR ship recognition, such as scattering statistics features, texture features, geometric features, moment features, scale-invariant features [4], and HOG features [5].To increase the recognition accuracy, some machine learning methods are jointly used, including K-nearest neighbor (KNN) [6], support vector machine (SVM) [7], and random forest (RF) [8].
Although these traditional recognition algorithms have produced good results, they are always based on hand-crafted features.These features may be suitable for specific data but lack adaptability.In contrast, deep learning diverges from conventional approaches as it leverages neural networks to autonomously extract features through end-to-end learning.This data-driven paradigm has achieved remarkable success across various domains, particularly in the realm of image recognition [9].In recent years, the convolutional neural network (CNN)-based methods tend to be the mainstream for SAR ship recognition [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25].To improve recognition accuracy, many studies have been put forward and achieved good SAR ship performance in various aspects.To solve the issue of class imbalance, Li et al. [12] proposed a dense residual network (DRNet) combining upsampling data augmentation and ratio batching.Shao et al. [13] proposed a balanced batch-based sampling method to avoid learning imbalance during training.Zhang et al. [14] presented a method for training CNN that integrates deep metric learning (DML) with progressively balanced sampling.Raj et al. [15] proposed a one-shot learningbased deep learning model.To address the problem of small training dataset due to few available data, Lu et al. [16] established a CNN with data augmentation.Yuanyuan et al. [17] conducted some small sample SAR ship recognition research based on transfer learning.To tackle with the challenge of large intraclass variation and small interclass separation of ships, Xu and Lang [18] and He et al. [19] used a DML scheme to expand the distance between different classes.To resolve the weak robustness of individual models in high risk scenarios, Zheng et al. [10] introduced an automated approach for ensemble modeling of heterogeneous deep convolutional neural networks (DCNNs), employing a two-stage filtration process.This self-configuring algorithm dynamically determines the optimal combination of base classifiers by automatically identifying the suitable types and quantities.
Although these approaches have achieved notable success, the majority of the aforementioned works tend to focus on iterative modifications of network structures, training trick optimizations, loss function adjustments, and so on, rather than design a task-specific network from the characteristics of SAR image and empirical knowledge of the ship.In recent years, an expanding body of scholars has taken notice of this phenomenon.Huang introduced a new Deep SAR-Net [20] that considers complex-valued SAR images to learn the spatial texture information and backscattering patterns of ships.Zhang et al. [21] fused HOG features into CNNs and proposed four mechanisms to ensure superior recognition accuracy.Zeng et al. [22] designed a hybrid channel feature loss that jointly utilizes the information contained in the polarized channels (VV and VH).He et al. [23] established a group bilinear pooling and a MPFL loss to fully exploit the dual-polarized SAR images for promising finegrained ship recognition.Xiong et al. [24] developed a miniature hourglass region extraction network dedicated to dual-channel feature fusion.Zhang and Zhang [25] designed a SE-LPN-DPFF to perform dual-polarization feature fusion and balance each polarization feature's contribution.Although the above methods achieve good results, leaving room for further improvement in the performance of the network.First of all, the comprehensive utilization of multiscale features holds paramount importance in enhancing SAR ship recognition, an area that remains inadequately investigated.Ships usually appear with diverse sizes, which is challenging to achieve state-of-the-art (SOTA) recognition result by using a single scale features of CNN [26].This is why maximizing the use of multiscale features is crucial.Xu [27] and Zhang [21] have made some preliminary explorations to deal with this issue.However, they simply flattened the multiscale features without fusing them, which resulted in a failure to provide sufficient features at all scales.In addition, these previous works aggregated multiscale features of CNN to recognize ships using unified weights, e.g., a simple summation, which ignores the different importance of different scales.Second, the ship class in SAR imagery has special shape prior characteristics.To the best of our knowledge, no work has yet integrated the ship shape prior into an end-to-end network to perform SAR ship recognition.
Based on the analysis above, we propose a task-specific ship augmented attention network (SA 2 Net) for comprehensively utilizing the multiscale features and integrating the ship shape prior into an end-to-end network.Among SA 2 Net, the feature weaving module (FWM) is designed to generate rich and reliable representations at all scales.The scale attention module (SAM) has been constructed to select and assign weights to relevant feature scales while disregarding irrelevant scales.The feature augmentation module (FAM) has been designed to enhance ship features, which incorporate the priory knowledge of the ship shape.Comprehensive experiments demonstrate the superiority of our SA 2 Net compared with several SOTA methods.In contrast to previous works, the novelties and contributions can be summarized as follows: 1) We proposed a SA 2 Net that jointly applies SAM and FWM to fully exploit the multiscale features of ship targets.The shallow scale features contain more detailed information while deep scale features contained more semantic information, which is unequally effective for recognition.Instead of simply combining different-scale features, the proposed SAM is developed to control information flow of different scales using a leaned weight vector, and then adaptively selects and assigns weights to desired feature scales while disregarding irrelevant scales.The proposed FWM aggregates semantic and detailed features of different scales by integrating high-level semantic information and low-level detailed information through a similar weaving process, resulting in rich representations at all scales.2) FAM is first proposed in this article to leverage empirical knowledge regarding ships, which commonly exhibit elongated and narrow characteristics.In contrast to prior approaches that employ square convolution kernels for ship feature extraction, the FAM introduces rectangular convolutions.This design can provide rectangular receptive fields that align with the shape of ships, a limitation encountered with traditional square convolutions.3) We conduct extensive experiments on benchmark Open-SARShip [28] and FUSAR-Ship [29].The results show that SA 2 Net exceeds existing methods, including traditional feature-based methods, classic object recognition CNNs, and novel task-specific SAR ship recognition CNNs.The experimental results demonstrate the effectiveness of our method.The rest of this article is organized as follows.In Section II, we present the details of our proposed SA 2 Net.In Section III, implementation details are reported, and extensive experimental results are provided.Section IV presents the conclusion.

A. Network Structure
The overall framework of our ship augmented attention network (SA 2 Net) for SAR ship recognition is illustrated in Fig. 1.To achieve ship recognition with diverse sizes, we propose SAM and FWM, which comprehensively utilize multi-scale features.Besides, FAM is designed to enhance ship features by incorporating the priory knowledge of the ship shape.In SA 2 Net, the pretrained ResNet-50 [30] has been leveraged as the backbone for its enormous performance in feature extraction.Given a SAR ship image, the FWM integrates high-level semantic information and low-level detailed information through repeatedly fusing the representations produced by the high-to-low backbone to obtain better representations at all scales.Besides, in view of Fig. 1.Overall architecture of ship augmented attention network (SA 2 Net).The architecture consists of a backbone for feature extraction and three modules to refine the extracted features for final precise recognition.ResNet-50 is adopted as the backbone due to its impressive performance.The three modules are FWM, FAM, and SAM, respectively.the distinctive prior characteristics pertaining to the shape of ship class, FAM has been devised to augment ship features by integrating the prior knowledge of ship shape.At last, to select the effective feature scales for the final recognition, SAM utilizes the relevance scores to select and assign weights to relevant feature scales while disregarding irrelevant scales.This network guarantees more accurate recognition for SAR ship to squeeze Algorithm 1: SA 2 Net for SAR Ship Recognition.Gain final enhanced features f with f i and w; 11: Predict the SAR ship recognition scores out the benefits of the multiscale features and integrated the ship shape prior.Details are provided in the Algorithm flow below.

B. Feature Weaving Module
Accomplishing robust SAR ship recognition across diverse sizes proves challenging when using a single-scale feature representation from CNN.To address this issue effectively, leveraging the multiscale features obtained from intermediate layers of the CNN presents a viable solution.In CNN, the receptive field of layers become larger as the layer becomes deeper.The feature maps obtained from the lower layer focus on detailed information while the feature maps obtained from the deeper layer focus on semantic information.Inspired by HRNet [31], with a focus on sufficiently making use of multiscale features to obtain rich representations at all scales, FWM is proposed.
As shown in Fig. 2, FWM fully mines and combines the feature maps of different scales through a feature fusion mechanism called feature weaving.It generates reliable rich feature representations through repeatedly fusing the representations produced by the high-to-low backbone convolutional layers.The details of FWM are presented as follows.
ResNet-50 is utilized as the backbone.The output of the last layer of different residual blocks in Conv3, Conv4, and Conv5 levels of ResNet-50 is indicated as C i (i = 3, 4, 5).The specific pattern of generating each M i layer corresponding to C i layer is shown in Fig. 2(a), with i ∈ {3, 4, 5}.For higher level, same level and lower level features, the features are processed by bilinear interpolation upsampling, 1 × 1 convolution, and convolution layer downsampling, respectively.In this step, the channel dimension is uniformly adjusted to 256.Finally, different layers are consolidated with elementwise summation.The operations of FWM are computed as follows: where Conv(•) denotes 1 × 1 convolution to align the channel dimensions, ConvU (•) is the bilinear interpolation upsampling, and ConvD(•) indicates 3 × 3 convolution with stride 2 downsampling.

C. Feature Augmentation Module
As shown in Fig. 4, the ship class in SAR images exhibits a prominent geometric characteristic, a large aspect ratio.In addition, in contrast to natural images captured from a horizontal view, SAR images are acquired from a top-down perspective.This leads to objects appearing at arbitrary orientations.Traditional convolution operations commonly employ square kernels such as 3 × 3, 5 × 5, as they are well-suited for capturing block-like structures like vehicles and buildings.However, the unique shape characteristics of the ship class, which exhibits a strip-like structure and arbitrary orientations of ships pose challenges for effective extraction using traditional convolution kernels.Therefore, rectangular convolutions with different directions are introduced, which can provide rectangular receptive fields to match the shape and arbitrary orientations of ships.We develop the FAM to replace the traditional square convolution with a combination of a square convolution and four rectangular convolutions implemented through separate branches.The original features are preserved by the square convolution branch, while horizontal convolution, vertical convolution, left diagonal convolution, and right diagonal convolution refine the details by providing rectangular receptive fields.Fig. 3 demonstrates the structure of FAM.The parallel organization of five branches with different kernel sizes are constructed.Assuming the convolutional layers take a C-channel feature map as input.As illustrated in Fig. 3, FAM incorporates rectangular convolutions in four distinct orientations: horizontal, vertical, left diagonal, and right diagonal.Concretely, for the replacement of a 3 × 3 square kernel S ∈ R 3×3×C , FAM comprises five parallel branches including four rectangular convolution kernels and a square convolution kernel.The horizontal kernel ×C , and right diagonal kernel S 4 ∈ R [right diag ]×C are rectangular kernels that align with the shape of ships.Let M l ∈ R H×W ×C be the input of FAM, with l as {3, 4, 5}.X is fed into five juxtaposed paths.Then, five output feature maps M lh , M lv , M ll , M lr , M ls ∈ R H×W ×C are obtained.Then, the concatenate operator of five feature maps is performed to obtain M l ∈ R H×W ×5C .This progress can be described as where * indicates the convolution operation, and cat is the concatenate operator.
We define comp(•) as a composite function to get the final output of FAM.comp(•) is consist of three consecutive operations: batch normalization (BN), a rectified linear unit (ReLU), and a 3 × 3 convolution (conv).As for M l , the corresponding output of FAM can be denoted as The reason why we employ cat(•) operator and comp(•) function, rather than simply summation of the output of five branches is motivated by DenseNet [32].First of all, when M lh , M lv , M ll , M lr , M ls are combined by summation, which may impede the information flow in the network [32], leading to ship feature extraction insufficiency.Second, the statistical characteristics of the five juxtaposed branches differ from each other, e.g., there may be large differences in the mean and variance of the pixels in each branch.So it is important to perform the batch normalization (BN) [33] layer after concatenation, rather than before.This is the ingenuity of comp(•) function.Applying the BN layer before the concatenation operator may result in an internal covariate shift in new feature maps, reducing the generalization capability of the network.Finally, the outputs A 3 , A 4 , A 5 are fed into SAM for the next step.

D. Scale Attention Module
Most previous works aggregate multiscale features of CNN to recognize ships using unified weights, e.g., a simple summation, which ignore the unequal effectiveness of different scales.To address this problem, we propose SAM, as shown in Fig. 5.This module weights desired feature scales according to the relevance scores between each scale and final recognition probabilities, selecting the effective feature scales while excluding irrelevant scales.
where f ∈ R 3d is the feature vector after concating.
To make the module automatically select the desired feature scales to obtain preferable recognition scores, the designed SAM can generate a learned scale relevance scores.The weight p = [w 1 , w 2 , w 3 ] ∈ R 3 of selecting feature scales for each specific recognition score can be described as where w a ∈ R 3d×3 is the attention weight, which combines features of distinct scales into a weight vector with three dimension.Based on the above weight predictor values w i , the feature scales f i can be weighted and summed to gain the final enhanced features f ∈ R d for preferable SAR ship recognition results: Subsequently, a fully connected (FC) layer and a softmax function are needed for achieving the final recognition.

III. EXPERIMENTS AND RESULTS
In this section, we will perform extensive experiments to verify the effectiveness of the proposed method on benchmark dataset OpenSARShip and FUSAR-Ship.First, we describe the dataset and give the dataset settings.Then, we present the implementation details, including image preprocessing, parameter settings, evaluation metrics, loss function, and backbone.Next, the experimental results are demonstrated for OpenSARShip Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.under a three-category recognition task, a six-category recognition task, and a FUSAR-Ship under seven-category task.In addition, we perform a comprehensive comparison between the proposed method and some SOTA methods, encompassing traditional classifiers, classic CNN methods, and modern CNN methods designed for SAR ship recognition.Ablation studies and discussion are conducted at last.

A. Dataset 1) OpenSARShip:
In our study, we utilize OpenSARShip, a benchmark dataset that comes from the Sentinel-1 satellite.OpenSARShip possesses five critical properties, namely specificity, large-scale coverage, diversity, reliability, and public availability, which collectively contribute to its significant value in practical applications.The ship labels in the Open-SARShip dataset are expertly assigned through a semiautomated process, with support from AIS, ensuring their accuracy.The dataset utilized in our experiment is ground range detected (GRD) products captured by the Sentinel-1 IW mode.It possesses a resolution of 20 m × 22 m and a pixel size of 10.0 m × 10.0 m in both the azimuth and range directions [28].Based on OpenSARShip, two recognition datasets are conducted.Fig. 6 shows some SAR ship samples in OpenSARShip.
a) Three-Category: Container ships, tankers, and bulk carriers are chosen to establish the representative dataset.These three classes of ships are the most common and representative ships occupying 80% of the international shipping market [28].The number of each class of ship is uneven in OpenSARShip.To avoid the effect of class imbalance, the number of training samples in each class is equal.Table I shows the training-testing sets of the three-category dataset.

TABLE II TRAINING-TESTING DIVISION OF THE SIX-CATEGORY DATASET IN OPENSARSHIP TABLE III TRAINING-TEST DIVISION OF THE FUSAR-SHIP
b) Six-Category: On the basis of the three category, another three classes, cargo ship, fishing, and general cargo are selected to organize one more challenging six-category recognition experiment.Based on the detailed ship classes provided by the Maritime Traffic AIS information, six ship classes are specifically selected for analysis as their sample numbers exceed 200.Furthermore, categories with insufficient samples in the raw OpenSARShip dataset are excluded to ensure a more reasonable experimental setup.Table II shows the training-testing sets of the six-category dataset.
2) FUSAR-Ship: Another benchmark dataset FUSAR-Ship is introduced to further confirm the effectiveness of SA 2 Net.The high-resolution dataset FUSAR-Ship originates from China's Gaofen-3 (GF-3) satellite, the country's maiden civil C-band fully polarimetric spaceborne SAR.The GF-3 SAR images possess an azimuth resolution of 1.124 m and a slant range resolution ranging from 1.700 to 1.754 m.The FUSAR-Ship dataset is assembled through an automatic SAR-AIS matchup procedure encompassing over 100 GF-3 scenes, containing over 5000 ship image chips integrated with AIS information.In this article, FUSAR-Ship consists of seven main categories, namely bulk carriers, container ships, fishings, tankers, general cargo ships, other cargo ships, and others.Table III shows the ship sample numbers of each category in FUSAR-Ship.Fig. 7 presents several SAR ship samples from the FUSAR-Ship dataset.

B. Implementation Details
All the experiments are implemented on a personal computer (PC) with NVIDIA GeForce RTX 2060 VENTUS (12G) GPU and 24G RAM.The software development process is carried out within the Python programming language environment, utilizing the open-source PyTorch machine learning library.For training and inference acceleration, CUDA10.1 is employed.
1) Image Preprocessing: The backbone we utilized is pretrained ResNet-50.The pretrained weights of ResNet-50 are based on natural images, which are three-channel images.However, the SAR images we exploit in our manuscript are single-channel.In order to utilize the pretrained ResNet-50, we replicate the grayscale value across all three channels.In other words, the values in all three channels are the same at each pixel position since our grayscale image has only one channel.Such conversion can also be found in other classic work [34].
2) Parameter Setting: These experiments are trained under the same parameters.The size of the input images in OpenSAR-Ship are unified to 224 × 224.Using stochastic gradient descent (SGD) optimizer with the weight decay parameter 0.001 and the momentum parameter 0.9, the proposed network is trained by 10 000 iterations.The batch size is set to 16 due to the limited GPU memory.To alleviate the adverse impact of vanishing training gradients, we assigned a relatively low learning rate of 0.0001, which is appropriate for our method.
3) Loss Function: The cross entropy(CE) loss is served as the loss function where the mth sample recognition result is denoted as y m , the mth sample ground truth is denoted as y m , and the total number of training samples is denoted as N .4) Evaluation Metrics: Similar to most scholars, accuracy (%) is used as the core evaluation criteria to measure recognition performance and confirm effectiveness of the proposed modules.For comprehensive evaluations of SAR ship recognition results, four additional performance metrics are employed in the experiments, including: 1) F1; 2) precision; 3) recall; and 4) confusion matrix.The definition of these metrics are as follows.
Accuracy is defined by Accuracy = TP + TN TP + TN + FP + FN (8) where TP denotes true positives, TN denotes true negatives, FP denotes false positives, and FN denotes false negatives.In other words, the numerator denotes the number of correctly recognized ship samples, the denominator denotes the number of all test ship samples.
Recall [21] is defined by Precision [21] is defined by F1 [21] is defined by Furthermore, in order to evaluate the ship recognition performance in a more specific manner, a confusion matrix is adopted as a classwise measure to evaluate the recognition ability of each category.This evaluation method has been commonly utilized in previous studies on SAR ship recognition as well.
5) Backbone: Generally speaking, the backbone will directly influence the recognition performance.In the context of SAR ship detection task, ResNet serves as the most favored backbone in some popular and substantial works [35], [36], [37].Thus, we also apply it to SAR ship recognition task empirically.To choose the most suitable backbone for our task, we conduct the experiments of ResNet-18, ResNet-34, ResNet-50, and ResNet-101 as the backbone networks of SA 2 Net.The experimental results are presented in Table IV.From the experimental results, we found that the recognition performance of ResNet-50 shows the optimal accuracy not only on OpenSARShip, but FUSAR-Ship as well.The primary reason for this observation is that the features learned by ResNet-18 and ResNet-34 are insufficient, and ResNet-101 is prone overfitting due to its deep network.Therefore, we choose the pretrained ResNet-50 as the backbone in the subsequent experiments.accuracy on three-category OpenSARShip is 82.91%, on sixcategory OpenSARShip is 61.10%, and that on the sevencategory FUSAR-Ship task is 88.28%.As for OpenSARShip, the latter performance is significantly lower compared to the former, primarily due to the inherently higher complexity of the six-category recognition task as compared to the three-category task.In addition, the number of training samples available for the six-category task is smaller than that of the three-category task, further amplifying the recognition challenge associated with the six-category task.Due to FUSAR-Ship has a better resolution with more ship detailed representations can be learned, the recognition accuracy can reach 88.28%.

2) Confusion Matrix:
The recognition performance under three-category and six-category tasks for each ship class in confusion matrix forms are offered by Tables VI, VII, and VIII, respectively.Most diagonal values are higher than others in the same line from both tables, which indicate that most ships can be recognized correctly.A notable observation from the three tables is that most diagonal values predominantly surpass the corresponding values in the same row, implying a high rate of correct recognition for most ships, but there are still some classes which are easily confused.From Table VI, the bulk carrier is recognized as a container ship mistakenly.This phenomenon may arise due to the outline of the ship is too vague, which acts as a strong scattering point, thereby limiting its capacity to facilitate effective recognition.From Table VII, the general cargo is recognized as a cargo mistakenly.In fact, their class differences are rather small, and the general cargo can be regarded as a special cargo.From Table VIII, the primary source of class prediction confusion lies among the categories of fishing, other, and other cargo.This phenomenon can be attributed to the analogous geometries shared by these three ship classes.

D. Comparison With Traditional Methods and Modern CNN-Based Methods
To thoroughly evaluate the efficiency of the proposed method, we compare the experimental results with the state-of-art methods, including the traditional feature-based methods [6], [7], [8], classic object recognition CNNs [9], [38], [39], [30], [40], and novel task-specific SAR ship recognition CNNs [21], [22], [25], [27], [28], [29].The comparison methods are our reappearance and our experiments are as consistent as possible with their original reports.It should be noted that the inputs of Zeng et al. [22] and SE-LPN-DPFF [25] are paired VV-VH SAR amplitude images.More specific, the input of training sample number is 150 VV-VH SAR amplitude images for three-category task and 100 VV-VH SAR amplitude images for six-category task.For other approaches, the inputs consist of unpaired VV and VH SAR amplitude images, wherein single-channel VV and single-channel VH SAR images are sequentially fed directly into the networks.Please note that the FUSAR-Ship dataset solely offers single-channel SAR images, thereby preventing the reappearance of Zeng [22] and SE-LPN-DPFF [25].Table IX shows the quantitative SAR ship recognition performance with traditional methods and modern CNN-Based methods.From Table IX, the following conclusions can be drawn: 1) Among all traditional methods, on the three-category OpenSARShip dataset, the optimal recognition accuracy is 61.72% from KNN, but is still greatly lower than our SA 2 Net (61.72%<<82.91%).On the six-category OpenSARShip dataset, among all traditional methods, the optimal recognition accuracy is 43.54% achieved by SVM.However, this accuracy remains significantly lower compared to our proposed SA 2 Net (43.54% 60.10%).On the FUSAR-Ship dataset, among all traditional methods, the optimal recognition accuracy is 77.19% achieved by RF.However, this accuracy remains significantly lower compared to our proposed SA 2 Net (77.19% 88.28%).
Modern CNN-based models typically exhibit superior recognition accuracies compared to traditional method, aligning with expectations.This observation suggests that the features extracted by modern CNNs may possess enhanced characterization capabilities.2) On the three-category OpenSARShip dataset, SA 2 Net offer the highest recognition than other modern CNN-based methods.Among all of them, the suboptimal recognition methods is 80.82% from SE-LPN-DPFF [25].However, it is still lower than our SA 2 Net by 2.09%, which shows the SOTA SAR ship recognition performance of our proposed SA 2 Net. 3) On the six-category OpenSARShip dataset, SA 2 Net also offer the highest recognition accuracy than others.Among all of them, the suboptimal recognition method is 59.73% from SE-LPN-DPFF [25].Nevertheless, our SA 2 Net achieves a 1.37% higher accuracy, showcasing its superior performance as the state-of-the-art SAR ship recognition model.4) On the seven-category FUSAR-Ship dataset, SA 2 Net also offer the highest recognition accuracy than others.Among all of them, the suboptimal recognition methods is 86.69% from HOG-ShipCLSNet [21].Nevertheless, our SA 2 Net achieves a 1.59% higher accuracy, indicating its superior performance.5) Although SE-LPN-DPFF use the dual-polarization coherence features to characterize ship feature relationships in different polarization channels to improve recognition accuracy, the method neither comprehensively utilize the multiscale features nor leveraged empirical knowledge regarding ships.Thus, it recognition performances are inferior to SA 2 Net's.In addition, although HOG-ShipCLSNet [21] utilized the multiscale features, it simply flattened them and use each feature scale equally, which reduce the ability of the network to extract and choose effective features for precise recognition.

E. Ablation Study
In this part, a series of ablation studies on OpenSARShip and FUSAR-Ship are performed to verify the effectiveness of FWM, FAM, and SAM.For a fair comparison, all subsequent studies are performed with the same settings.The overall comparisons are displayed in Table X.Most specifically, adding any of FWM, FAM, i.e., the first three rows of Table X, could boost the recognition accuracy of our model, resulting from the powerful feature supplementation and refinement donated by our task-specific modules.Besides, as can be seen from the fourth and fifth columns of Table X, the accuracy gains further improvements when enabling two modules.Eventually, as can be seen from the sixth column of Table X, compared with the baseline, when applying FWM, FAM, and SAM together, the accuracy of our method achieved the highest on both OpenSARShip and FUSAR-Ship datasets.Next, we will analyze the effectiveness of FWM, FAM, and SAM in detail.
1) Effect of FWM: Most existing methods simply extract multiscale features of the network, which limits the performance of SAR ship recognition.To get rich representations at all scales, we leverage semantic and detailed features of different scales extracted by the backbone to construct FWM.Through feature weaving, FWM combines high-level and low-level features to obtain enriched representations.This approach enhances feature discrimination in comparison to the direct utilization of multiscale features extracted solely by the backbone.
Table X provides the results of the FWM in the ablation experiments for both datasets.It should be noted that "×" means that only the last layer features of the backbone network are utilized, ignoring the multiscale features of middle layers from CNN.From Table X, compared with the baseline, FWM gains 3.21% accuracy boost on three-category OpenSARShip task, 2.73% accuracy boost on six-category OpenSARShip task, and 4.79% accuracy boost on FUSAR-Ship task, which is an impressive improvement.To get a comprehensive understanding of FWM, another experiment is conducted to validate the effectiveness of feature weaving, which is named as "ablation study intra FWM." Table XI provides the results.The "×" means that our SA 2 Net does not perform feature weaving.In other words, the multiscale features are not fused in SA 2 Net.The results in Table XI show that feature weaving achieves 0.52% and 0.89% improvements in accuracy under three-category Open-SARShip and seven-category FUSAR-Ship tasks.Although the improvements in feature weaving are not impressive as other modules, it still demonstrates that integrating high-level and The improvements of two groups of ablation studies indicate the necessity and effectiveness of combining different scales of CNN to achieve SAR ship recognition of various sizes.
2) Effect of FAM: The proposed FAM is introduced to incorporate the priory knowledge of the ship shape into the network by providing rectangular receptive fields that align with the shape of ships.In addition, the directional rectangular kernels can deal with the challenges of arbitrary orientations of ships pose.Table X provides the results of the FAM in the ablation experiments for both datasets.The "×" means that only the square kernel is employed for feature extraction.The results in Table X show that FAM module achieves reasonable 2.69%, 1.53%, and 3.36% improvements in accuracy under the three-category OpenSARShip, six-category OpenSARShip, and FUSAR-Ship tasks compared with the baseline.The improvements show that introducing the rectangular kernels breaks through the limitation of traditional fixed kernel, making the feature extraction more powerful.So FAM is rational for the recognition task of SAR ship target with large aspect ratio and arbitrary orientations.
3) Effect of SAM: Although FWM can provide rich representation at all scales, different scale features are not equally effective for recognition.Compared to deep features, shallow features are often not discriminative enough.To adaptively select and assign weights to desired feature scales while disregarding irrelevant scales, we propose SAM to control the information flow of different scales.Table X shows the recognition results with and without SAM.The "×" means that the multiscale features are fused by a simple summation.From Table X, the results show that SAM module achieves reasonable 1.04%, 0.81%, and 1.22% improvements in accuracy under the three-category OpenSARShip, six-category OpenSARShip, and FUSAR-Ship tasks compared with SA 2 Net without SAM.This is in line with our knowledge of CNNs.The shallow features contains more detailed information, which is less in discriminative.So each scale feature is not equally effective for recognition.However, HOG-ShipCLSNet gave the opposite conclusion.They found that the average weighting type achieves a slightly better accuracy than the adaptive type.We analyze their network carefully to find the underlying reasons.One possible reason is that HOG-ShipCLSNet applied too much FC layer in their network.Many of them have more than 2000 neurons, a few even as high as 32 768.When the adaptive type is used, the network might fail to search the suitable weight parameter due to the heavy computational burden, which also lead to ship feature extraction insufficiency.
From the ablation study, FWM, FAM, and SAM have different effects on the recognition of SAR ship with scale variance, large aspect ratio, and arbitrary orientations.Each component of SA 2 Net helps each other to achieve the optimal recognition performance and tackle the problems of SAR ship recognition.

4) t-SNE:
To provide a comprehensive understanding of the impact of FWM, FAM, and SAM, we visually present the qualitative results using t-distributed stochastic neighbor embedding (t-SNE) [41] of three-category task in Fig. 8.In the t-SNE visualization, the greater the distance between different categories, the higher the recognition accuracy achieved by the model.Fig. 8(a)-(d) illustrate the visualization based on SA 2 Net without FWM, SA 2 Net without FAM, SA 2 Net without SAM, and SA 2 Net, respectively.It can be found that after supplementing the three modules, the recognition error is alleviated and the feature embeddings from the same class are more aggregated, which is shown in Fig. 8(d).The combination of the three modules separates the features between different categories and the intraclass features of the same category are closer together.These results indicate that the three modules help each other to achieve the optimal recognition performance and tackle with the challenges of SAR ship recognition.

F. Discussion
In this section, we will further discuss and explain FWM.A discussion about detection and recognition integrated network is also included.
1) FWM: Why FWM shows impressive improvement is benefit from two aspects.One is leveraging the multiscale features obtained from intermediate layers.The other is fully mining and combining the feature maps of different scales through feature weaving.We first conduct a comprehensive ablation study to analyze how much the different scale features are related to the final recognition probability on three-category OpenSARShip Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and FUSAR-Ship.Then, the feature visualization results of FWM are given.
Table XII shows the results of how much the different scale features are related to the final recognition probability on three-category OpenSARShip and seven-category-FUSAR-Ship. From Table XII, when using a single scale, the recognition results only achieve 79.70% on three category OpenSARShip and 84.67% on seven category FUSAR-Ship.When two scales are employed, SA2Net improves results by 1.87% on threecategory OpenSARShip and by 2.41% on FUSAR-Ship.The optimal recognition results are obtained when three layers are all utilized, showing the necessity of leveraging multiscale features to recognize ships of various sizes.
To validate the effectiveness of feature weaving, we present some qualitative visualization results of feature maps C3 and M3 in Fig.   network with feature weaving has more information and pays more attention to the distinguishable regions, so these important parts have higher activation scores.It proves the feature weaving effectively enriches the representations of SAR ship targets.
2) Detection and Recognition Integrated Network: Nowadays, an increasing number of scholars have paid more attention to establishing a unified detection and recognition SAR ship network [42], [43], [44].However, the detection and recognition parts are independent and irrelevant in classical algorithms.On the contrary, in practical applications, it is often necessary to perform detection and recognition tasks in the SAR images simultaneously.To achieve satisfied unified detection and recognition performance, one necessary way is to inject more discriminative features extracted by SAR ship recognition methods to SAR ship detection methods.As, shown in Fig. 10, the most recent detection algorithms [36], [45] utilize oriented bounding box (OBox) to tackle with the challenge of arbitrarily oriented ships.In SA 2 Net, the proposed FAM utilizes directional rectangular convolution kernels to solve the same problem.In the future study, I believe the joint utilization of FAM and OBox may boost the performance of detection and recognition integrated network.

IV. CONCLUSION
In this article, we propose a SA 2 Net to further improve the performance of ship recognition in SAR image.ResNet-50 is adopted as the backbone to extract SAR ship features.Taking into account the special shape prior characteristics of the ship class, the FAM in SA 2 Net is designed to enhance the semantic features of ships, which incorporate the priority knowledge of the ship shape.The proposed FAM breaks through the limitation of traditional square kernels.In addition, to achieve ship recognition with diverse sizes, the comprehensive utilization of multiscale features holds paramount importance.Different from aggregate multiscale features with unified weights, SAM in SA 2 Net adaptively weights the desired feature scales and disregards the irrelevant scales.The proposed FWM in SA 2 Net generates rich and reliable representations through repeatedly fusing the representations produced by the backbone to obtain better representations at all scales.The experimental results, comparisons, and ablation studies on representative three-and six-category OpenSARShip tasks show that SA 2 Net greatly improves the recognition performance.His research interests include radar signal processing, target detection, machine learning, and automatic target recognition.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Manuscript received 30
June 2023; revised 17 August 2023; accepted 6 September 2023.Date of publication 20 September 2023; date of current version 14 November 2023.This work was supported by the National Natural Science Foundation of China under Grant 61901091 and Grant 61901090.(Corresponding author: Wei Pu.)

Fig. 2 .
Fig. 2. (a) Overall depiction of FWM is presented.Subsequently, (b)-(d) elaborate on the specific details of generating M3, M4, and M5, respectively.Moreover, it is worth noting that the pathway represented by the blue square signifies upsampling utilizing bilinear interpolation, the pathway indicated by the green square signifies downsampling using one or two 3 × 3 convolutions, and the pathway denoted by the red circle signifies aligning channel dimensions through 1 × 1 convolutions.

Fig. 3 .
Fig. 3. Illustration of the FAM.In this figure, the FAM contains five parallel layers with kernel sizes.

Fig. 8 .
Fig. 8. Three-category task t-SNE feature visualization of the embedding vector distribution.(a) Our network without FWM.(b) Our network without FAM.(c) Our network without SAM.(d) Our network SA 2 Net.

Fig. 9 .
Fig. 9. Visualization of the features of different ships.(a) Original images.Visualization of feature map C3 (b) and M3 (c) with FWM.

9
. The sizes of C3 and M3 are 28 × 28 pixels.The activation heatmap of the extracted feature is the summation of the values in each row along the channel dimension.Fig. 9(a) is the original SAR ship images.As illustrated in Fig. 9(b), although features extracted by C3 focus on ship targets, the features are not sufficient enough.To deal with the problem and capture more information, feature weaving can integrate high-level and low-level information through a weaving process, resulting in rich representations.As shown in Fig. 9(c), the

Yuanzhe
Shang (Graduate Student Member, IEEE) received the B.S. degree in electronic engineering from the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China, in 2020, where he is currently working toward the Ph.D. degree in electronic engineering.

3 , C 4 , C 5 with
hyperparamaters setting; 2: Extract hierarchical feature maps C backbone ResNet-50, build enhanced feature maps M 3 , M 4 , M 5 with C 3 , C 4 , C 5 .3: Stage1 : 4: Input M l with l as {3, 4, 5}, obtain M lh , M lv , M ll , M lr , M ls by five parallel branches with distinct convolution kernels; 5: Concatenate M lh , M lv , M ll , M lr , M ls to get M l ; 6: Perform composite function comp(•) on M l to get output A l .7: Stage2 : 8: Generate feature vector of each scale f i with A l by GAP; 9: Concatenate f i and obtain the learned scale relevance scores w with FC and Softmax function; 10:

TABLE I TRAINING
-TESTING DIVISION OF THE THREE-CATEGORY DATASET IN OPENSARSHIP

TABLE IV RECOGNITION
PERFORMANCE OF DIFFERENT BACKBONES ON THREE-CATEGORY OPENSARSHIP AND FUSAR-SHIP Table V shows the evaluation metrics of SA 2 Net on the three-category OpenSARShip task, six-category in classical algorithms task, and seven-category FUSAR-Ship task.From Table V, the SAR ship recognition Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V EVALUATION
METRICS OF SA 2 NET ON OPENSARSHIP AND FUSAR-SHIP TASKS

TABLE VI CONFUSION
MATRIX OF SA 2 NET RECOGNITION RESULTS ON THREE-CATEGORY OPENSARSHIPTABLE VII CONFUSION MATRIX OF SA 2 NET RECOGNITION RESULTS ON SIX-CATEGORY OPENSARSHIP TABLE VIII CONFUSION MATRIX OF SA 2 NET RECOGNITION RESULTS ON FUSAR-SHIP TABLE IX COMPARISON OF SA 2 NET ON THE THREE-CATEGORY AND SIX-CATEGORY UNDER OPENSARSHIP Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE X ABLATION
STUDIES OF EACH MODULE IN SA 2 NET

TABLE XII COMPARISON
OF QUANTITATIVE EVALUATION INDICES WITH DIFFERENT NUMBER SCALES IN FWM