A Deformable Spatial Attention Mechanism-Based Method and a Benchmark for Dock Detection

Dock is a significant site in the shipbuilding industry. The detection of docks contributes to many important fields. With the abundant methods and datasets, the deep learning-based object detection in remote sensing images has received wide attention. However, there is no dataset that includes the dock class. This article first proposes a dock dataset to build a benchmark and advance dock detection research. Further, object detection of docks using existing methods cannot yield convincing results due to the characteristics of docks. To meet the challenges in dock detection, a novel deformable spatial attention module (DSAM) is proposed to enhance the feature representation and localization of docks. Based on the DSAM, a novel network architecture is proposed to perform accurate and efficient dock detection. The ablation and comparison experiments reveal that the proposed methods are accurate and effective, which are superior to the existing methods.

Digital Object Identifier 10.1109/JSTARS.2023.3265700 located near the seashore or water shore, utilized to construct new ships and repair old ones, exerting a crucial effect in the shipbuilding industry [5], [6]. The most complex process in the shipbuilding is carried out on the docks [7]. Therefore, the dock is the most critical component of a shipyard. From the perspective of industrial production, different scales of docks can be utilized to produce ships with various throughputs. So, it can directly and effectively reflect the production capacity of each shipyard by detecting the docks. From the perspective of land resource utilization, timely detection of long-term vacant docks and timely adjustment of the land use type are conducive to improving the land resource utilization in coastal areas. From the perspective of ecological environment, it will release a lot of waste water, exhaust gas, and dust in welding, painting, and outfitting during ship construction, which directly affect the ecological environment of the coastal area around the docks. The detection of docks contributes to the protection of ecological environment in the coastal areas. At present, the manual statistics are employed for detection of docks in the traditional methods, which are subject to lack of macroscopic, inefficiency, and mass labor cost. Satellite remote sensing (RS) can observe objects from a macroscopic perspective, which can overcome the disadvantages of traditional methods. However, to the best of our knowledge, there are few researches on the detection of docks based on the RS images (RSIs) worldwide. For instance, Firat et al. [8] proposed an approach for end-to-end object detection by leveraging large amount of unlabeled data and a single-layer convolutional sparse autoencoders, which were evaluated on the dry docks. Combining multisource optical RSIs and deep learning methods, this article aims to achieve macroscopic and efficient detection of docks with low labor cost.
Object detection has been widely applied in RSIs and received extensive attention in recent years [9]. So far, there have been several public geospatial object detection datasets for RSIs available. For instance, the TAS dataset for automotive detection of aerial images [10], the SZTAKI-INRIA dataset for building detection [11], the RSOD with four classes [12], [13], the UCAS-AOD dataset for vehicles and planes [70], the NWPU VHR-10 dataset containing 10 categories [14], [39], the HRSC2016 dataset for ship detection [15], the DOTA dataset [16] including 15 categories and the DIOR benchmark in optical RSIs [18] containing 20 categories. Especially, the DOTA dataset has been expanded from DOTAv1.0 to DOTAv1.5 and DOTAv2.0 [17]  learning-based detection of docks. To build a benchmark for dock detection, a novel dataset for deep learning-based dock detection is proposed in this article first.
For the traditional object detection of images in natural scenes, the current deep learning-based methods can be divided into two main streams. The first is the two-stage method, which detects objects with the two-stage convolutional neural network (CNN) methods. The best-known two-stage methods include R-CNN [19], Fast R-CNN [20], Faster R-CNN [21], Mask R-CNN [22], R-FCN [23], and Cascade R-CNN [24]. The second refers to detecting the objects with the single-stage methods which do not need the proposals generated by an extra network like the two-stage ones. Therefore, it presents a faster and simpler architecture [25]. The best-known single-stage methods include YOLO series [26], [27], [28], [29], [30], SSD [31], RetinaNet [32], LADet [33], and so on.
In comparison with the object detection in natural scenes, that in RSIs faces more challenges [34] such as rotation, efficiency, and accuracy, which have been intensively studied. In the past few years, there are many researches focusing on the two-stage methods such as adopting the R-CNN architecture to detect various geospatial objects in RSIs [35], [36], [37], [38], [39]. Cheng et al. [39] proposed a rotation-invariant CNN model for multiclass geospatial object detection, which merges a novel rotation-invariant layer to the R-CNN. However, the methods based on R-CNN are time consuming. To further improve the accuracy and efficiency of object detection in RSIs, the faster R-CNN-based object detection methods for RSIs are developed and applied [40], [41], [42], [43], [44], [45]. For example, Li et al. [43] proposed a rotation-insensitive region proposal network (RPN) network by introducing a multiangle anchor frame into the RPN in the Faster R-CNN framework, which can effectively address the rotational variations of geospatial objects. Tang et al. [44] developed a hyperregion proposal network for vehicle detection and used hard negative sample mining to further improve the accuracy. The most of the above methods are based on the horizontal bounding boxes (HBBs). With the DOTA and DIOR benchmarks released, the object detection methods in RSIs have gradually developed toward the oriented bounding boxes (OBBs) based ones [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], the baselines of which are not restricted to the two-stage methods. Notwithstanding the OBBs can match the real forms of geospatial objects, it only can be implemented smoothly with more extra parameters, such as the orientation of objects, requiring more samples for training. Due to the limited number of dock samples in the proposed dock dataset, this article focuses on the detection of docks based on HBBs. Therefore, the relatively traditional Faster R-CNN is adapted as the baseline in the article because of its universality, stability, and reliability for object detection in RSIs with HBBs annotation.
With the rapid development of deep learning-based methods, attention mechanism has arguably become one of the most important concepts [56]. Attention not only indicates where to focus, but also improves the representation of interests [57]. The attention in object detection can be categorized into two types: spatial attention and channel attention. The spatial attention allows networks to learn the positions that should be focused on from the spatial-wise [56]. For instance, Mnih et al. [58] presented a spatial attention model as a single RNN, which takes a glimpse window as its input and selects the next location to focus on using the internal state of the network, and generates control signals in a dynamic environment. The channel attention allows networks to learn the focuses from the channel-wise [56]. For example, Hu et al. [59] proposed a squeeze-and-excitation (SE) network that adaptively recalibrates the channel-wise feature responses by explicitly modeling the interdependencies among different channels. In the SE network, the channel-wise attention can be obtained based on the global average-pooled features. To comprehensively consider the spatial-wise and channel-wise, Chen et al. [60] proposed a novel spatial and channel-wise attention CNN which merges spatial and channel attentions in a CNN. Woo et al. [57] developed a convolutional block attention module (CBAM), comprising the concatenated channel and spatial attention modules (SAMs), which improves the performance while keeping the overhead small. Unlike CBAM, the BAM [61] arranges the channel and SAMs in parallel, which can learn what and where to focus or suppress efficiently through two separate pathways and can refine the intermediate features effectively.
Located in shipyards, the docks are utilized for ship construction and repair. Due to the long period of ship construction, the materials stacked on the same dock often change, the manifestations of docks appear differently over time. Moreover, the docks exhibit flexible orientations and various spatial distributions. Accordingly, the detection of docks in RSIs is more challenging. Nowadays, the mainstream object detection baseline is the backbone combined with feature pyramid network (FPN) [62], which can enable the multilevel features to the subsequent architecture. However, the faster R-CNN with FPN fails to consider the orientations and the features of real dock forms within the HBBs, as they are usually mixed with the features of background in HBBs. Meanwhile, the HBBs tend to overlap each other in the case of dense distribution, so that the network recognizes multiple docks as one. In addition, the docks are easily confused with other geospatial objects in the shipyard, even the background.
To make up the above deficiencies, a novel deformable spatial attention module (DSAM) is proposed based on the CBAM [57] and the deformable convolution [63], [64]. In the proposed DSAM, the extra deformable convolution layers are merged with the traditional SAM to learn the irregular location offsets which can match the real dock forms. Based on the location offsets, the network can focus on the informative features of the real dock forms. Therefore, with the DSAM, the dock features can be better represented and the overlap with omission caused by the HBBs annotation in the case of dense distribution can be relieved. Further, a novel network architecture is proposed based on the DSAM and the faster R-CNN with FPN to improve the accuracy in the detection of docks, which enhances the localization and feature representation of docks. Finally, the effectiveness of proposed methods is proved experimentally, and the accuracy is confirmed to be better than that of the existing methods.
The rest of this article is organized as follows. In Section II, the dock dataset is introduced and analyzed. In Section III,   [73], and ZY-3 [74]. The sources of the RSIs include 315 ZY-3 images, 575 Google Earth images, and 590 MW images and the corresponding resolutions are 2 m, 0.5 m, and 0.5 m, respectively.
In the dock dataset, all the RSIs that include three RGB visible bands cover the complete area of each shipyard, and the docks are labeled by HBBs in the format of VOC annotation [65]. Some typical RSIs labeled by HBBs in the dock dataset are given in Table I, in which the docks are labeled by the green HBBs. Further, the image size, the instance size (HBB size), and the dock density are illustrated in Fig. 1 while the ranges of data are given in Table II. Due to the long period of ship construction, the same dock observed by different RSIs represents differently, such as being stacked by materials or just being vacant. Moreover, different scales of docks feature with variable characteristics. For instance, there are normally large guide rails and gantry cranes in the large-scale docks, which manifest their distinctly interpretable characteristics. Nevertheless, the characteristics of diminutive docks are indistinctive because of the limited scale. Different locations of shipyards make the docks located inland or adjacent to the water, so the orientations of docks are unfixed. Therefore, it can be concluded that the docks are abundant in intraclass diversity. In addition, docks are densely distributed in some cases as shown in shipyard 5. The above discussion confirms that detection of docks is more challenging than that of other objects.
Due to the unique representation forms of docks in RSIs, the correlation coefficients between each band of a dock instance are calculated to further reflect the characteristics of the docks (see Fig. 2). It can be calculated with the flattened pixel vectors of different bands obtained by flattening the dock instance pixels within HBBs in images. The distributions of correlation coefficients which are larger than 0.9 and 0.8 are given in Table III.  Table III reveals the correlations between each band are high. A novel SAM and a novel network architecture for dock detection are proposed in the next section, which can overcome the above challenges to improve the detection accuracy.

III. METHODOLOGY
With the capability to address the multiscale feature representation, the FPN [62] has been merged into the mainstream network structure in recent years. Not only the FPN, but also the attention mechanisms enable the network to focus more on the significant features in images, which also enhance the feature representation capability of the network. The CBAM [57] is one of the most popular attention mechanisms, which consists of the channel attention module (CAM) and the SAM to comprehensively focus on the significant features from spatial-wise and channel-wise.
Although the extra network structures enhance the expression of features, the existing methods are not well adapted for the dock detection because of the dock characteristics. Specifically, the CBAM is usually merged with the backbone to enhance the feature representation for regions of interest. And, the SAM  in CBAM enables the network to focus on the informative parts from the spatial-wise. Nevertheless, the receptive field of conventional convolution in SAM is commonly a square area, which leads the network to focus on the part mismatched the real dock forms. Moreover, the orientations of docks vary widely within HBBs. Consequently, the SAM cannot enable the network to focus on the features of real dock forms well within limited square areas, which unduly merges the features of dock and background together. On the other hand, the CAM in CBAM enables the network to focus on the informative channels among the multi channels of the features from the channel-wise. And yet, the correlations between each visible band of the docks are strong based on the results in Table III. Excessive channel attention brings more parameters to the model, which may result in overfitting and reducing the final accuracy in test set. Thus, this article only focuses on the spatial attention which can improve the detection accuracy of the docks. Further, the network structure combined with attention module is critical. Owing to the multiblock structure of the backbone, the sizes of the feature maps are different. Therefore, where to merge the attention module in the network will directly affect the ability of the network to localize and detect docks.
To address the above issues and improve the feature expression and the location accuracy of docks, a novel DSAM and a novel network structure are proposed in this article. Inspired by the deformable convolution [63], [64], the DSAM (see Fig. 3) is proposed, in which the traditional convolution of SAM is replaced with the deformable convolution. The deformable convolution is able to learn the irregular offsets with additional convolutional layers, which enables the network to focus on the features of irregular dock forms. The principle of DSAM can be expressed as In (1), F represents the input feature of DSAM, F Avg and F Max are the output of the sibling average pooling and max pooling, respectively. Then, the two outputs are aggregated and input to the deformable convolution layer, which is calculated in (2). The σ represents the sigmoid function while N represents the sampling kernel size. F DSA is the deformable spatial attention map, which represents the spatial-wise weighting feature map. The deformable convolution operation can be calculated by (3), where Δp n and Δm n are the learnable offset and modulation scalar for the nth location, respectively. The value range of Δm n is [0, 1], while Δp n is a real number with unconstrained range. The w n and p n denote the weight and prespecified offset for the nth location, respectively. The sampling is on the irregular and offset locations x(p + p n + Δp n ) which values are usually fractional and can be calculated by the bilinear interpolation. Finally, F DSA and F are element-wise multiplied to produce the spatially weighted feature (the output feature in Fig. 3) which is input to the subsequent structures.
In the DSAM, multichannel information can be first compressed by the sibling channel-wise operations. Then, the aggregated feature [F Max , F Avg ] is input to the deformable convolution layer to obtain the irregular directional offsets. Finally, the deformable spatial attention map is obtained to weight the input feature from the spatial-wise. With the DSAM, the network can learn to focus on the features in the irregular forms that approximately resemble the forms of docks. Therefore, the characteristics of docks can be better represented based on the DSAM, which are not restricted by the traditional convolutional reception field of squares.
Based on the proposed DSAM and the ResNet50 with FPN, a novel feature extraction network architecture is proposed, which is illustrated in Fig. 4. Because the ResNet50 is a bottom-up structure and the FPN is a top-down structure, the DSAM should be merged into the two structures, respectively. Concerned with the precise localization and feature representation, the DSAM is added into the ResNet50 where before the Resblock1 and after the Resblock4. Because the feature map which is the output of the Conv layer1 contains low-level semantic information and extensive location information, we add DSAM between the Conv layer1 and Resblock1 to mainly enhance the localization of docks. Further, the capability of features representation and classification of network should also be taken into account. The C5 feature contains extensive high-level semantic information. Hence, the DSAM is added after the Resblock4 to mainly enhance the feature representation of docks. With the addition of DSAMs, the localization and features representation for docks of ResNet50 can be improved without excessive expansion. On the other hand, the FPN is the extra structure with ResNet50, which input features are the outputs of ResNet50. And, we set C2, C3, C4 and the output of the top DSAM in ResNet50 as the input to FPN. In the structure of FPN, the inner block (conv2d 1 * 1 in Fig. 4) unifies the dimensions of the input features to 256 dimensions. Moreover, there is an upsample layer between each FPN layer, which expands the spatial size of feature map. During upsampling, the spatial attention is significant for preserving high-level semantic information while accurately locating docks. Thus, the DSAM is added after each inner block in FPN to further reinforce the localization and feature representation of docks. With the addition of DSAM in the ResNet50 with FPN, the backbone can focus on the informative parts matched real dock forms and reinforce the dock futures representation during the feature extraction stage. Finally, the multiscale features P2-P6 are input to the RPN and the subsequent processes are the same with the faster R-CNN.

IV. EXPERIMENTS AND RESULTS
The proposed dock dataset and methods are used to conduct experiments. The experiments are performed on one RTX3070 with 8G memory by PyTorch in Ubuntu. We use 70% of the dataset as training set and 30% as test set. To thoroughly evaluate the effectiveness of the proposed methods, we first perform extensive ablation experiments. Meanwhile, we verify the proposed methods outperform the existing methods. Further, we analyze the effect of kernel size hyperparameter on the final detection results. Finally, we compare the predicted results with the ground truth boxes in the test set. In particular, the mAP in the following results is calculated according to the evaluation indicator in the MS coco dataset [67], which is under the intersection over union of 0.5. And the pretrained weights [68] trained by MillionAID [69] are used in this article.

A. Comparison and Ablation Experiments
To verify the effectiveness of the proposed network architecture, experiments with different network architectures are conducted firstly. As analyzed in Section III, the localization and feature representation of docks should be concerned comprehensively. Thus, the top and bottom DSAMs in all ResNet50 and FPN are definitely required. We compare the cases with or without DSAM in the two intermediate layer of the FPN, and the results of other existing methods are given in Table IV.
With the results in Table IV, the optimal detection result can be obtained by the proposed network architecture, which is superior to the existing methods. Further, it is proved that the DSAM can further reinforce the feature representation and localization of docks during upsampling in FPN. The spatial attention is significant for preserving high-level semantic information with accurately locating docks, which will directly affect the final detection results. Therefore, the proposed network architecture is reasonable, which focuses on the informative dock area during the whole bottom-up and then top-down feature extraction process based on the DSAM.
To further verify the effectiveness and superiority of the proposed methods, more ablation experiments are conducted based on the proposed network architecture. Different network architectures are compared in Table V, where also demonstrates the frames per second (FPS) and the parameters number of whole network (Param.).
As given in Table V, the mAP of the proposed architecture (only with DSAM) is optimal, which is greater 1.9% than the baseline with CBAM (CAM+SAM) and improves 4.9% than the baseline. However, the combination of DSAM and CAM (MCBAM) performs the worst in all improved architectures. It demonstrates excessive channel-wise attention yields negative results instead. As analyzed in Fig. 2, there are high correlations between the visible bands of the docks. In the MCBAM, the DSAM will be influenced by the excessive channel-wise attention. As a result, the DSAM cannot perform its original function and the detection accuracy is worse than the DSAM alone. The proposed DSAM focuses on the features within the approximate forms of the real docks, which comprehensively enhance the features representation and localization for docks from the spatial-wise. Hence, the proposed network architecture achieves the best accuracy.

B. Hyperparameter Experiments
The effectiveness of the proposed methods has been proved by the above experiments. Further, the comparison experiments of detailed hyperparameter are performed. Because of the different kernel size of the deformable convolution layer in DSAM, the receptive fields of the attention parts are diverse, which will affect the detection accuracy. The detection result is better when the network matches the real form of the docks more precisely. Thus, the kernel size of DSAM is a significant hyperparameter. In ResNet50, the bottom feature spatial dimension is large with a narrow channel dimension and contains abundant location information, while the top feature spatial dimension is small with a wide channel dimension and contains abundant high-level semantic information. We empirically set the kernel size of the top and the bottom DSAM in ResNet50 as 3 and 7. Further, considering the top-down structure of FPN and all the channel sizes of the input features are the same, the kernel sizes in FPN  TABLE VII  GROUND TRUTH BOUNDING BOXES AND THE PREDICTION RESULTS are set the same empirically. We compare the effect of different kernel size of DSAM in FPN to the results which are given in Table VI. As given in Table VI, the optimal result can be obtained when the kernel size is set to 7. Due to the sizes of docks, small kernel size limits the feature representation and the localization of docks, which cannot match the real forms of the docks well.
Therefore, the kernel size of 7 * 7 is the optimal setting for the dock detection.

C. Detection Results
Based on the optimal parameters of the trained network, the test set is employed to compare the prediction results with the In Table VII, the detection results can be classified into various cases. Specifically, the detection results in shipyards 1 and 2 are correct, which means that the number and the area of docks are all basically correct. The detection results in shipyard 3 are the missed detection because two vacant docks are not detected, which is possibly due to the insufficient distinctive features in the vacant docks. Further, there are few vacant docks in the dock dataset, which are not "learned" by the network. In shipyards 4-7, the detection results are acceptable in the case of dense distribution, which indicates that the proposed DSAM and network can detect the docks in dense distribution by focusing on the characteristics of each dock. The shipyard 8 is subject to false detection. The material area is incorrectly identified as a dock because of the similar characteristics of different objects,  (THE GREEN BOXES INDICATE THE CORRECT DETECTIONS. THE BLUE ONES INDICATE THE  FALSE DETECTIONS. THE RED ONES INDICATE THE SYNCRETIC DETECTIONS. THE YELLOW ONES INDICATE THE MISSED DETECTIONS) that is, the two types of objects both contain the materials for ship construction and gantry cranes. In shipyards 9 and 10, two adjacent docks are identified as one, which may be caused by the unobvious boundary information between docks. Moreover, some typical prediction results in the case of dense distribution are given in Table VIII. The proposed methods can identify the docks more accurately than the baseline in the case of dense distribution because the latter tends to identify the multiple neighboring docks as one. In summary, the proposed methods exhibit better detection results than the existing methods, with the mAP of 71.02%, although there are still some special cases of missed detections and false detections.

V. CONCLUSION
To perform accurate and efficient dock detection, a novel dock dataset is proposed first in this article, which includes 1480 images and 4704 docks labeled by HBBs. In order to overcome the challenges in dock detection, a novel DSAM is proposed to drive the network to learn features that match the actual form of the docks. Based on the DSAM, a novel feature extraction network is proposed. The proposed network comprehensively focuses on the feature representation and localization of docks, which merges the DSAM with the backbone together. Further, the sufficient experiments prove that the proposed methods are accurate and efficient, which are superior to the existing methods. The mAP of the proposed methods reaches 71.02%. Finally, the predicted results are compared with the GT boxes, which demonstrates that the predicted results are compelling.