Fine-Grained Ship Detection in High-Resolution Satellite Images With Shape-Aware Feature Learning

Fine-grained ship detection is an important task in high-resolution satellite remote sensing applications. However, large aspect ratios and severe category imbalance make fine-grained ship detection a challenging problem. Current methods usually extract square-like features that do not work well to detect ships with large aspect ratios, and the misalignments in feature representation will severely degrade the performance of ship localization and classification. To tackle this, we propose a shape-aware feature learning method to mitigate the misalignments during feature extraction. Furthermore, for the issue of category imbalance, we design a shape-aware instance switching to balance the quantity distribution of ships in different categories, which can greatly improve the network's learning ability for rare instances. To verify the effectiveness of the proposed method, we contribute a multicategory ship detection dataset (MCSD) that contains 4000 images carefully labeled with oriented bounding boxes, including 16 types of ship objects and nearly 18 000 instances. We conduct experiments on our MCSD and ShipRSImageNet, and extensive experimental results demonstrate the superiority of the proposed method over several state-of-the-art methods.

images is crucial for marine monitoring and management, maritime rescue and coastal military defense early warning, etc. [1], [2], [3], [4]. In recent years, the resolution of satellite images is increasing with the development of Earth observation technology, which lays a strong foundation for fine-grained ship detection and recognition.
Benefiting from the powerful feature representation of deep convolutional neural network (DCNN), object detection algorithms based on DCNN have achieved superior detection results Dataset and code will be available at https://guobo98.github.io/shape-awareshipdet.
Digital Object Identifier 10.1109/JSTARS. 2023.3241969 in natural scene images, such as faster R-CNN [5], YOLO [6], Efficientdet [7], and Dynamic R-CNN [8]. However, satellite images differ from natural scene images regarding imaging angle and resolution. The massive variations in the scale and orientation of objects caused by the bird's-eye view bring board challenges to object detection in satellite images [9]. State-ofthe-art satellite image detectors, such as region of interest (RoI) transformer [10] and oriented R-CNN [11] adopt a two-stage method. They first use a backbone, such as ResNet [12] and feature pyramid network (FPN) [13] to extract features from the satellite images and then generate RoI proposals with the region proposal network (RPN) for subsequent classification and regression. These methods mainly focus on general object detection and have achieved good results. However, the ship has a large aspect ratio, resulting in the inaccurate location regression of the above detectors. In addition, the above detectors only conduct general classification, and the performance is not prominent in the fine-grained ship classification task. The challenges of fine-grained ship detection are as follows.
Large Aspect Ratio: Ships in satellite images usually have large aspect ratios, which will bring about the mismatch problem between the slender ground truth region and the receptive field of square-like RoI Align. Specifically, the long and short sides of the ship need different receptive fields size and feature representation parameters. However, current RoI extraction methods obtain square-like features [14] on a single-level feature map [13]. A single-level feature map does not contain features extracted from multiscale receptive fields, and the square-like RoI align treats the long and short sides equally, wasting some parameters on the representation of the short side.
Severe Category Imbalance: The quantities of instances in different ship categories vary greatly and present a long-tail distribution, leading to severe quantity imbalance in different categories. The numbers of ships in specific categories are usually very small, and the network does not pay much attention to these categories during training, which will degrade the performance of the detectors.
To overcome the above problems, we design a novel ship detection network with shape-aware feature learning and instance switching (IS) method. We follow the state-of-the-art two-stage satellite image detectors, finding that their RoI extraction process is as follows. After the RPN generates the RoIs, feature extraction of each RoI usually takes the following two steps. First, mapping the RoI back to a proper level feature map of the FPN based on the size of the RoI, and then pooling the feature of This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ RoI to a particular dimension for classification and regression to obtain the rotated bounding box. However, the above RoI extraction method leads to two misalignments between RoI and the feature map. The first one is that, for a rotated object that is represented by the parameters of (x, y, w, h, θ) [10], there is a big gap between the initial rotated RoI and the final detection bounding box. Only using the size of RoI to select the target level of FPN will lead to a misalignment between RoI and the feature map. Besides, traditional RoI pooling or RoI align considers that the width and height have equal importance and therefore output a square-like (1:1) feature map (e.g., 7 × 7). Ships in satellite images usually have large aspect ratios, and the structures with discriminative features on the ship are also designed based on the ship's shape. Therefore, the features on the short side are relatively sparse compared to the long side. For example, if the ratio of a ship instance is 7:1, but the ratio of RoI align feature map is 1:1, this will cause a severe scale mismatch between the feature map and the instance. Our proposed feature learning method performs RoI extraction at each level of FPN and fuses these features based on the spatial attention mechanism. It also contains a shape-aware rotated RoI align to balance the fine-grained information of the short and long sides and fix the misalignment problem above. For the category imbalance problem, we design a shape-aware instance switching (SIS) method to handle the situation of category imbalance. It can selectively replace common ship instances with rare ones so that the number of ships of different categories will be distributed similarly. To ensure the consistency of the background, we use a 2-D-Gaussian distribution to model the ship's shape, which enhances the ship's texture information and suppresses the ocean background.
To verify the effectiveness of our method, we need to conduct experiments on a large-scale fine-grained dataset. However, the existing multiclass ship datasets [15], [16] are small, and their detection performance has shown a saturation tendency. Therefore, we construct a high-resolution optical multicategory ship detection dataset (MCSD) to explore the effectiveness of our method. The dataset contains 38 principal harbors worldwide and nearly 4000 images carefully labeled with oriented bounding boxes (OBBs), including 16 types of ship objects and about 18 000 instances. The dataset also contains 8600 images with the label of single category ship and nearly 130 000 ship instances.
Our contributions can be summarized as follows. 1) We design a shape-aware feature learning method that contains shape-aware RoI extractor (SRE) and shapeaware rotated RoI align (SRRA). It performs RoI feature extraction at each level of FPN and fuses these features based on the attention mechanism. It also introduces the shape-aware rotated RoI align to adapt the ship's shape, which mitigates the problem of feature misalignment during feature extraction. 2) We design an SIS method to cope with the problem of quantity imbalance in different categories. It ensures the consistencies of the ocean background and ship directions when replacing common ship instances with rare ones.
3) We contribute a high-resolution optical multicategory ship detection dataset with OBB annotation, expanding the size of the existing multiclass ship dataset. The rest of this article is organized as follows. The related works are reviewed in Section II. In Section III, our multicategory ship detection dataset is introduced in detail. The proposed method is described in Section IV. The experiments and their analysis are shown in Section V. Finally, Section VI concludes this article.

II. RELATED WORK
This section first reviews some advances in object detection using satellite remote sensing images and then introduces related algorithms designed for ship detection with large aspect ratios and category imbalances.

A. Object Detection
With the powerful feature extraction ability of the artificial neural network, object detection based on deep learning has made a significant breakthrough [17]. Detectors based on the convolutional neural network are mainly divided into two-stage and single-stage detectors.
Faster R-CNN [5] uses an anchor-based region proposal and operates RoI pooling to extract fixed-length RoI features for object classification and further regression. SPP-Net [18] proposes a spatial pyramid pooling method to extract features from feature maps of multiscales, improving inference speed. To enhance the regression accuracy of boundary boxes, Cascade R-CNN [19] successively cascades three R-CNN networks with the same structure and sets increasing IoU thresholds to make the distribution of predicted RoI gradually close to the actual object. DetectoRS [20] proposes a recursive feature pyramid and switchable atrous convolution to improve the detection performance further. Two-stage algorithms, which can obtain features from the RoI have high accuracy, but the efficiency is low due to many redundant computations. One-stage detectors simultaneously regress bounding boxes and predict object classes, improving the efficiency of object detection. YOLO [6] predicts the bounding box and category of the object to which each block belongs simultaneously based on image segmentation. RetinaNet [21] proposeds a Focal loss, which sets higher weights for hard-to-learn positive samples, balancing the problem of unbalanced positive and negative samples during network training. Xu et al. [22] proposed a new evaluation metric dubbed normalized Wasserstein distance and a new RanKing-based assigning strategy for tiny object detection. The above methods are based on the anchor boxes. Recently, some anchor-free detection methods, such as CenterNet [23], CornerNet [24], and transformer-based DETR [25] have further advanced the research on object detection.
Objects in remote sensing images have the characteristics of arbitrary orientation, large aspect ratio, and cluttered distribution, which brings new challenges to detection tasks. The representation based on the horizontal bounding box (HBB) is not very appropriate for tilted objects in remote sensing images, which will introduce too much background noise. Methods based on the OBB can solve this problem well. Using the rotated boxes to represent objects in arbitrary directions can obtain more accurate location information.
Inspired by faster R-CNN, Ma et al. [26] proposed a rotation anchor frame to fit the rotation object and offer a rotation RoI pooling to extract the features. However, this method will generate many anchor boxes and cause huge parameter overhead. Ding et al. [10] proposed an RoI-transformer, which learns the rotation angle from the horizontal bounding boxes, improving inference speed and accuracy of detection. Xie et al. [11] proposed oriented R-CNN to generate high-quality oriented proposals at a minimal cost, further improving detection efficiency. Wang et al. [27] adopted an instance segmentation-based method to generate OBBs from instance masks, called mask OBB. Yang et al. [28] proposed SCRDet using feature fusion and multidimensional attention networks to refine small and cluttered object feature representations. Yang et al. [29] proposed KLD to model the object with the Gaussian kernel and use the Gaussian Wasserstein distance to replace the metric of the IoU, which solves the problem of discontinuous boundary during the regression of the rotation bounding box. These methods have achieved good results in general object detection, but do not have specific designs for ship detection. The following section will review and analyze the current ship detectors in remote sensing images.

B. Ship Detection in Satellite Images
In addition to general detectors for arbitrary direction objects in satellite remote sensing images, detection methods designed for ships are also proposed. Zhu et al. [1] presented a rotated RetinaNet network and introduced a refined network and IoU constant factor to solve the boundary discontinuity problem. To generate a rotated bounding box, Chen et al. [30] proposed an anchor-free detection framework that detects three key points and angles of the ship's bow, stern, and center. Guo et al. [31] proposed a rotated Libra network that balances three levels of neural networks for ship position prediction with rotation angle information to achieve consistent accuracy. Liu et al. [32] proposed GRS-Det, which contains a feature extraction network with a selection cascade module and a rotated Gaussian mask model to classify and regress pixels of ships. Considering that current general feature extraction methods are not very suitable for ships with large aspect ratios, we propose a shapeaware feature learning method to fix the problem of feature misalignments during the RoI assignment and extraction. Our SRE optimizes the feature extraction of the ship instances. It performs RoI feature extraction at each level of FPN and fuses these features based on the attention mechanism. In addition, our SRRA optimizes the aspect ratio matching of the instances and corresponding feature maps. It can balance the fine-grained features of the short and long sides to achieve a more precise RoI align with fewer parameters.
The number of instances in different categories in the real world follows a long-tailed distribution [33]. In ship datasets, the number of particular types of ships is small, which brings problems to network training. Kang et al. [34] decoupled the learning procedure into representation learning and classification and explore the effect of the resampling strategy for network learning. Cui et al. [35] designed a reweighting scheme that uses each class' effective number of samples to rebalance the loss and guide the model. Menon et al. [36] revisited the classic idea of logit adjustment based on the label frequencies and modify the standard softmax cross-entropy training to cope with the imbalance problem. Tan et al. [37] proposed an equalization loss to tackle the problem of long-tailed categories by simply ignoring those gradients for rare categories. These methods are complicated in design and difficult to transfer between different tasks. Copy and Paste [38] and IS [39] can perform data argumentation at the instance level to improve the diversity of data. Guo et al. [40] proposed a background-consistent IS method to enrich the bridge dataset. However, this method does not work well in ship detection due to the complex textures of the ocean background and ship instances. In this work, we design a SIS to relieve the problem of quantity imbalance in different categories and increase instance diversity. It can be easily transferred between tasks to enhance the network's learning ability in rare instances.

A. Dataset Backgroud
Recently, the convolutional neural network has been widely used in many remote sensing tasks and has surpassed the performance of traditional methods by a large margin. However, it is data-driven, and its good performance relies heavily on the support of a large amount of labeled data.
In the early stage, the development of ship object detection in remote sensing images is relatively slow due to the limitation of remote sensing datasets. As we can see from Table I, the first public dataset for ship detection NWPU [16] was released in 2014, and its size is very small. Until 2016, the publication of the HRSC dataset [15] provided basic support for the research of ship detection using deep neural network models. However, previous works mainly investigate single-category ship detection on this dataset and the detection performance has shown a saturation tendency due to its small capacity. In recent years, some new datasets [41], [42], [43] of remote sensing images have also been proposed. However, those datasets mainly focus on general remote sensing object detection. Therefore, there are only one category of ships in those datasets and the scenes contained are also not comprehensive. In 2021, Zhang et al. [44]  published the ShipRSImageNet dataset for ship detection, which contains 50 types of ship objects, providing data support for multicategory ship object detection. To further promote the development of multicategory ship object detection, we construct a high-resolution optical multicategory ship detection dataset (MCSD). The dataset contains 38 major harbors worldwide and nearly 4000 images carefully labeled, including 16 types of ship objects and about 18 000 instances. MCSD also has 8600 images with the label of single category ship and nearly 130 000 ship instances.

B. Dataset Description 1) Data Collection:
In this article, we construct a highresolution optical multicategory ship detection dataset named MCSD. The images in the dataset are mainly obtained from Gao-Fen satellites and Google Earth, and the scenes cover 38 important harbors in the world. The location distribution of harbors is shown in Fig. 1. The dataset contains two parts: 1) multicategory (16 categories) ships and 2) single-category ships, size of the images in both is 1024 × 1024. The former contains nearly 4000 images and 18 000 ship instances, and the latter contains 8600 images and 130 000 ship instances. The spatial resolution of the images in the dataset range from 0.25 to 2 m, which makes the ship texture clear enough to support the multicategory classification task. The multicategory ship instances in MCSD are shown in Fig. 2 .
2) Data Annotation: Ship instances in MSCD are labeled in OBB format that is defined by a 5 -D vector (x, y, w, h, θ), where (x, y) is the center of OBB, and (w, h) is the width and height. We assume that the initial bounding box is an HBB, where the side paralleled to the X-axis is defined as w, and the other is defined as h. The target OBB can be obtained by rotating the initial HBB θ clockwise, where 0 ≤ θ < π. The details are illustrated in Fig. 3.
3  Fig. 4 shows the long-tail distribution of the number of different categories in the dataset.

IV. METHODOLOGY
Our proposed method consists of two modules, shape-aware RoI feature extractor, and SIS. The former is designed for accurate ship feature extraction and the latter is to handle the problem of category imbalance. We now describe these two parts in detail.

A. Shape-Aware Rotate RoI Extractor
The ship detection task is to get the bounding box and category of the ship in the remote sensing images, and the bounding box regression and classification are both based on the RoI that is generated by RPN, so it is critical to extract the feature of RoI accurately. The current detectors commonly perform feature extraction on a single-level feature map. First, RoI of different scales are assigned to different level k of FPN by (1), and then the features are extracted by RoI pooling [5] or RoI Align [14] where k 0 is the highest-level feature map used in FPN, w and h represent the width and height of the RoI, and 224 is the image size of the ImageNet. This assignment makes sense because it allows larger RoI to be assigned to higher FPN feature maps with larger receptive fields. However, for the RoIs of ships with larger aspect ratios, the initial RoIs are greatly different from the final regression boxes due to the rotation angle, which leads to the fact that w × h cannot accurately describe the shape of each RoI. As shown in Fig. 5, the size (w × h) of the green RoI is equal to that of blue. However, since the green RoI has a larger aspect ratio, it needs a larger receptive field to harvest the features of the long side and a smaller receptive field for the short side. It is inappropriate to assign two different RoIs to the feature maps of the same level according to (1). To overcome this, our SRE is designed to mitigate the problem of feature misalignment during RoI extraction.
Shape-Aware RoI Extractor: For the input satellite image, we use ResNet as the backbone to get the 4-level feature maps {C 2 , C 3 , C 4 , C 5 }. We use these four-level feature maps to construct feature pyramids {P 2 , P 3 , P 4 , P 5 } through FPN, and each layer of pyramid corresponds to {4, 8, 16, 32} times of downsampling of the original image. The higher the level of feature map is in FPN, the larger its receptive field is. Inspired by [49], our proposed SRE is shown in Fig. 6, for each RoI generated by the RPN, instead of just using the single-level FPN feature, shape-aware rotate RoI align is performed on the feature map of each level of the FPN. Finally, the extracted multilevel features are accumulated together, followed by a spatial attention module to weigh the fused features to guide the learning of the detector.
Shape-Aware Rotated RoI Align: To adapt to the ship's shape, we propose a shape-aware rotate RoI align, as shown in Fig. 7. Different from the traditional square-like RoI align [14], the shape-aware rotate RoI align has two characteristics: one is that it is a rotated feature extraction paradigm, which makes feature localization more accurate and does not make one box contain multiple instances; the other is that it takes the shape characteristics into account, paying more attention to the long  side to balance the fine-grained features of the short and long sides. The SRRA can attain a more precise RoI align with fewer parameters.
Generalized Spatial Attention: Simply aggregating the features of different levels is not enough to properly characterize the RoI features. For the classification and bounding box regression, we need to perform some transformations on the coarse features to increase the capacity of the network. We introduce the spatial attention mechanism [50] after feature aggregation to find the most important discriminant features and then guide the subsequent learning of the network.
The essence of the attention mechanism is to get a weight map from the relationship between the query, key, and value [51]. Given a query x q , we need to consider the content and position embedding between the query and key when determining the attention weight of the corresponding key. In this article, the contents of the query and key are the feature pixels on the feature map after shape-aware RoI align. We design a spatial attention model by the contents and the relative position of the query and key are illustrated as follows: where N = 2 denotes the number of pairs of the attention mechanism, M = 6 is the number of attention heads. A nm (x q , x k ) is the attention weight in the mth attention head of nth attention module, and W nm are the learnable weights. Note that when calculating relative positions, since the output of our proposed shape-aware RoI align is not square-like, we need to calculate the embedding for w and h, respectively. Then, the corresponding embedding will be passed through a linear layer to obtain the embedding feature for the following calculation of the attention weight map.

B. Shape-Aware Instance Switching
As shown in Fig. 4, the number of ship instances of different categories varies greatly, presenting a long-tail distribution, and leading to the problem of quantity imbalance in different categories. We design a SIS method to solve the problem of quantity imbalance in different categories. SIS also ensures the consistencies of the background and ship directions during replacing common ship instances with rare ones, as shown in Fig. 8. Specifically, our method can be divided into the following three key steps.
1) Instances Gallery Building: First, we cut out all the OBBs that contain ship instances and collect these OBBs into the instance gallery. Then, the size, aspect ratio, background color, and other characteristics of each instance in the library will be counted. We build a ball tree of the KNN according to the above characteristics, which is used for instance matching to select suitable instances from the gallery. Finally, we count the numbers of each category in the dataset, and the category with more instances than the average is defined as a common class, and the category with fewer instances than the average is defined as a rare class. Instances in the common classes will be replaced by those in the rare classes to ease the instance imbalance problem. 2) 2-D Gaussian Modeling: Simple copy and paste cannot guarantee the direction consistency of the ship, which may cause the label noise. IS improves this disadvantage but cannot guarantee the consistency of the background. The side of the switched instance is discontinuous, which may make the network mistakenly believe that there is a ship object where the side is discontinuous and leads to overfitting. We assume that the instance to be replaced has a size of w × h, Gaussian(0, σ 2 ) is performed on the w and h, respectively where w η and h η are the fused weights of w and h, the bigger the w η and h η , the higher the weight of the centerness part. The final determined h η and w η are 3.5 and 4, respectively. We can construct the mask M of the ship's shape through the Gaussian kernel as follows: The selected instances for switching and original instances are fused with the weights of M and 1 − M . Note that the mask weight threshold is 0.8, it sets the weight greater than 0.8 in the mask to 1 to make the texture information of the ship instance more prominent. 3) Ship Instance Switching: Our ship IS works before the backbone. The input is the original image and the output is the switched one. For each instance in the input original image, we judge whether its category is rare or not, calculate the size, aspect ratio, and background color of the instance belonging to the rare class, and concatenate these characteristics into KNN [52] for the following instance matching. In the KNN matching stage, different characteristics have different impact factors. The three impact factors are ranked from high to low as aspect ratio, size, and color. In the final stage of SIS, we also perform brightness compensation on the matched instance to fit the original background. To prevent the algorithm from overfitting, we randomly select an instance from the nearest 10% matched instances for switching.

A. Dataset and Implementation Details
Datasets: Experiments are performed on MCSD and ShipRSImageNet [44] to verify the effectiveness of our proposed method. As described in Section III, the MCSD contains nearly 4000 images and 18 000 instances of 16 categories of ships, the size of images in MCSD is 1024 × 1024 and the spatial resolution of the images in the dataset ranges from 0.25 to 2 m. We divide the train-val set and the test set in a ratio of 4:1 in MCSD. The ShipRSImageNet contains over 3435 images with 17 573 ship instances in 50 categories. Note that the dock does not belong to any ship category but is labeled in ShipRSImageNet. The size of images in ShipRSImageNet range from 930 × 930 to 1024 × 1024, and the spatial resolution of the images in ShipRSImageNet ranges from 0.12 to 6 m. For convenience, we fill the images in ShipRSImageNet to 1024 × 1024 in the experiments. In addition, our experiments focus on the rotated ship detection task, and the instances in both two datasets are labeled in the form of the OBBs, as illustrated in Fig. 3. Fig. 6. Workflow of the shape-aware feature extraction. For each RoI generated by the RPN, instead of just using the single-level FPN feature, shape-aware rotate RoI align is performed on the feature map of each level of the FPN. It takes the shape characteristics into account, paying more attention to the long side to balance the fine-grained features of the short and long sides. Finally, the extracted multilevel features are accumulated together, followed by a spatial attention module to weigh the fused features to guide the learning of the detector. Implementation Details : Our method and the baseline are implemented using PyTorch and trained on the MMDetection [53] framework. All the experiments are performed on four NVIDIA TITAN X GPUs with a batch size of 2. We choose stochastic gradient descent as the optimizer for the network, the momentum is set to 0.9 and the weight decay of the optimizer is 0.0001. All the models are pretrained on the ImageNet, to be consistent with the ShipRSImageNet benchmark, all the networks are trained by 100 epochs. The initial learning rate is set as 0.005. As the training goes on, the learning rate linearly decays at the factor of 0.1 in the 66th and 90th epochs. Other hyperparameters are the same as the MMDetection default settings.
Evaluation Metrics: In this article, we use the average precision (AP) scores obtained by the standard MS COCO [54] evaluation metric to evaluate the performance of different methods in ship detection. Since the COCO API does not provide an interface for rotated intersection over union (IoU) calculation, we converted all OBB detection results into the form of the COCO mask and the mask IoU provided by the COCO API is used to measure the effectiveness of each method.

B. Ship Detection Benchmark and Results
To compare with existing methods, we build a benchmark for rotated ship detection based on ShipRSImageNet and MCSD.
In the early stage, the results of semantic segmentation of ship objects can be converted into the form of OBB, so we include the classic mask R-CNN [14] and cascade mask-RNN [19] in the benchmark. Our benchmark also contains the classic two-stage detector including faster R-CNN (OBB) [5] and gliding vertex [55], one-stage detectors including RetinaNet (OBB) [21] and S2A-Net [56], and anchor-free detector FCOS (OBB) [57]. In recent years, some new rotated object detection paradigms have been proposed. They no longer generate rotated anchors but detect rotated objects by learning the transformation from the HBB to OBB. RoI transformer [10] and oriented R-CNN [11] are two representatives and have achieved good performance in the ship detection task. Our benchmark includes the algorithms mentioned above and we choose [10] and [11] which have better performance as the baselines to verify the effectiveness of our proposed method. The experimental results show that our method can effectively improve the performance of these two networks on AP 0.50 , AP 0.75 , and AP 0.50:0.95 .
As shown in Table II, for the ShipRSImageNet dataset, our method improves by 3 points on AP 0.50 and 4.4 points on AP 0.75 compared to RoI transformer. Similarly, compared with oriented R-CNN, our method achieves the gain of 2.8 points on AP 0.75 and 4 points on AP 0.5:0.95 . The benchmark's overall performance is better in MCSD, but our method still contributes significantly to the performance improvement. Compared to the RoI transformer, we achieve a gain of 4.9 points on the AP 0.75 metric, and there is also a 3.8-point improvement for oriented R-CNN on AP 0.5:0.95 . Note that our method has a more significant improvement in AP 0.75 and AP 0.5:0.95 , indicating that our method has a higher accuracy of rotated ship localization. This is because we adopt the shape-aware feature extractor and RoI align to avoid misalignment in the ship RoI assignment and extraction. The visualization of detection results is shown in Fig. 9.

C. Ablation Study
In this section, we investigate the impacts of the proposed modules on the final performance, respectively. Our methods are mainly composed of SIS, shape-aware rotate RoI align (SRRA), and shape-aware RoI extractor (SRE). First, we will discuss the contribution of these three components to the performance and  Components of the Proposed Method: We add SIS, SRRA, and SRE to the baseline to observe the function of different modules. According to Table III, after adding SIS to the baseline, AP 0.50 and AP 0.75 are improved by 2 points. In the training process, SIS changes the distribution of the number of instances of different categories in the dataset. It replaces common ships with rare category ships so that the number of instances in the dataset tends to be balanced, as shown in Fig. 10. And SIS considers the consistency of ship direction and background in the replacement process, making no additional noise.
With the addition of SRRA, the AP 0.50 gains 1.7 points, and AP 0.50:0.95 gains 2 points. SRRA assigns different parameter quantities to the long and short sides of the ship and describes the ship more accurately without increasing the total parameter quantity, which improves the performance of bounding box regression and classification. SRE improves the performance on AP 0.50 by 2.4 points, and AP 0.75 by 2.6 points. The AP 0.50 gains 2.2 points, and AP 0.75 gains 3.6 points with the combination of SRE and SRRA. Due to its large aspect ratio, slightly inaccurate positioning will sharply decline the IoU of ship instances. The performance has been improved even more under the more stringent AP 0.75 metric, indicating that SRE avoids misalignment during RoI assignment and makes ship localization more accurate. When applying the proposed three modules to the baseline simultaneously, the performance is further improved, which shows that they can complement and promote each other.
Level Selection in FPN: When performing feature extraction, we assign RoIs to all levels of FPN. Due to the top-down pathway and lateral connections in FPN, the features between adjacent levels are related. To avoid unnecessary computation, we design ablation experiments to explore the importance of  features at different levels. Table IV shows that when the P 2 and P 3 level features are combined, it can achieve a similar effect to the baseline. After integrating the P 4 level, the AP 0.50:0.95 reaches 0.661 and is close to saturation with the additional P 5 . But the AP 0.50 and AP 0.75 still have a particular improvement contributed by P 5 . In practical applications, we can choose the combination of {P 2 , P 3 , P 4 } to balance the performance and  Fig. 11.
Spatial Attention Mechanism: The proposed shape-aware feature learning method can accurately extract ship features and aggregate them together. After aggregation, we introduce spatial attention to guide the further learning of the network. As described in Section IV, feature and position attention are applied to generate the attention weight map. We also design ablation experiments to explore the contribution of these two kinds of attention to the final results. Note that the baseline in this experiment is the SRE and SRRA without any spatial attention. It can be seen from Table V that the performance of position attention is better than that of feature attention. This is because the SRE has already aggregated the features of each level in the form of addition. What we need is to find  the locations with discriminative features. The points on the feature map are encoded by position attention, and a linear transformation can obtain their position features. The weight map will be generated from the point features and location features to guide the network to focus on the essential parts of the feature map. The performance is further improved after combining the two attentions, indicating that the two attentions are complementary to a certain extent.
Size of Shape-Aware RoI Align: Ships have the characteristics of a large aspect ratio, and traditional square-like RoI aligns do not match their shapes. It does not pay enough attention to the long side of the ship and wastes some parameters on the short side. We propose a shape-aware RoI align to adapt to the ship's shape for better feature extraction and design ablation experiments to select an appropriate shape for the shape-aware RoI align. In order not to add extra parameter overhead, we fix the W × H of scalable RoI align to 48, which is slightly smaller than the square-like RoI align with a size of 7 × 7 = 49. According to Table VI, the performance is optimal when we set the shape to 4 × 12. As we can see from Fig. 12, compared with the   baseline, our method has more accurate bounding box regression for large aspect ratio objects. In the subsequent design, we can also adaptively select the shape of the feature map after RoI align according to the shape of the RoI. Note that we need to ensure that the size of the feature maps output by RoI align is uniform for the following classification and regression.
Effectiveness of 2-D Gaussian Modeling: From Fig. 8, we can find that 2-D Gaussian modeling can guarantee the consistency of background in the SSIS process. We also design an ablation experiment to evaluate the effectiveness of 2-D Gaussian modeling. According to Table VII, with the addition of SIS, the AP 0.50 and AP 0.75 both gain 2.0 points, and AP 0.50:0.95 gains 1.7 points. If removing the 2-D Gaussian modeling from the SIS module, all three indicators AP 0.50 , AP 0.75 , and AP 0.50:0.95 dropped significantly, which illustrates the effectiveness of the 2-D Gaussian modeling.

D. Discussion
Extensive experiments demonstrate the effectiveness of our proposed method. However, when the size of the ship is small, the information provided by the ship itself is limited, and the performance of our proposed method is not much different from that of existing feature learning methods. Shape-aware ship instance switching can change the distribution of the number of samples in various categories and alleviate the problem of class imbalance. It is simple but effective. In the future, we can also design a classwise loss to balance the network's learning ability for rare instances and further relieve this problem. In addition, it is essential to distinguish the bow and stern of the ship, which is related to the following feature representation and situation awareness. The task is very dependent on the strong supervision information with directional OBB. The cost of refined directional OBB annotation is expensive, and it is necessary to introduce some detection methods based on semisupervised or weak-supervised learning.

VI. CONCLUSION
In this article, we have presented a shape-aware feature learning method for ship detection with a large aspect ratio, fixing the problem of feature misalignment during RoI assignment and extraction. To solve the quantity imbalance in different categories, we propose a shape-aware ship instance switching method. It can guarantee the consistency of the ocean background and ship direction when replacing common ship instances with rare ship instances, which greatly improves the network's learning ability of rare instances. We also contribute an MCSD to verify the effectiveness of the proposed method. Experiments on both MCSD and ShipRSImageNet demonstrate the effectiveness of our proposed method.