MashFormer: A Novel Multiscale Aware Hybrid Detector for Remote Sensing Object Detection

Object detection is a critical and demanding topic in the subject of processing satellite and airborne images. The targets acquired in remote sensing imagery are at various sizes, and the backgrounds are complicated, which makes object detection extremely challenging. We address these aforementioned issues in this article by introducing the MashFormer, an innovative multiscale aware convolutional neural network (CNN) and transformer integrated hybrid detector. Specifically, MashFormer employs the transformer block to complement the CNN-based feature extraction backbone, which could obtain the relationships between long-range features and enhance the representative ability in complex background scenarios. With the intention of improving the detection performance for objects with multiscale characteristic, since in remote sensing scenarios, the size of object varies greatly. A multilevel feature aggregation component, incorporate with a cross-level feature alignment module is designed to alleviate the semantic discrepancy between features from shallow and deep layers. To verify the effectiveness of the suggested MashFormer, comparative experiments are carried out with other cutting-edge methodologies using the publicly available high resolution remote sensing detection and Northwestern Polytechnical University VHR-10 datasets. The experimental findings confirm the effectiveness and superiority of our suggested model by indicating that our approach has greater mean average precision than the other methodologies.

of providing rich details and spatial structure information, which makes it possible to detect objects of interest within images [1]. Object detection in remote sensing data has grown in popularity over the past several decades since it is essential for a variety of real-world applications, including environmental monitoring, natural hazard warning, precision agriculture, and military operations [2].
Object detection, a vital yet complicated issue in the study of satellite and airborne images, tries to locate items of interest in a given remote sensing image and determine their categories. Numerous approaches for object detection in optical pictures have been developed over the past few decades, and all these approaches could all be broadly categorized into two divisions: 1) traditional methods and 2) deep learning-based methods.
The majority of traditional methods rely on manually created features, by way of illustration, threshold segmentation, texture feature extraction, and geometric feature extraction. The extracted features are used to identify the target of interest by using template matching or machine learning. The machine learning-based method is the most representative among these proposed methods, which considers the detection of objects as a categorization issue. Feature extraction, optional feature fusion, and dimensionality reduction are often the first two processes, followed by classifier training. The histogram of oriented gradients [3] has been widely used for feature extraction. Zhong et al. [4] proposed an integrated model of multiple conditional random fields to enhance the feature extraction performance. The classifier has been trained using a wide range of techniques, such as support vector machine (SVM), AdaBoost, k-nearest neighbor, and artificial neural network methods [5]. Because the effectiveness of feature extraction using traditional approach is constrained by the different properties of the remote sensing target, they are mostly used in certain situations. Besides, these methods require a large number of experiments to design suitable feature extraction methods, which is time-consuming.
Due to its amazing feature representation capabilities, deep learning has already been extensively used in applications of computer vision, greatly enhancing the effectiveness of optical object detection techniques. There are two primary types of deep learning-based detection techniques: one-stage and two-stage. For the first time, Girshick et al. [6] introduced a two-stage region-based convolutional neural network (R-CNN), which uses deep learning to detect objects. With the aim of speeding up the processing pipeline, the fast R-CNN [7], faster R-CNN [8], and mask R-CNN [9] were developed. Meanwhile, other This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ one-stage approachs were proposed, including the you only look once (YOLO) series [10], [11], [12], single shot detector (SSD) [13], RetinaNet [14], and fully convolutional one-stage object detection (FCOS) [15]. These methods conduct target localization and classification in a single network without a region proposal stage, significantly reducing model complexity and enabling real-time processing. Keypoint-based methods, which are one-stage detectors, regress the center location and bounding box information. Thus, they are considered anchor-free methods. Examples include CornerNet [16], CenterNet [17], and CenterNet-Triplets [18].
Although deep learning has made breakthroughs in optical image target detection, the results of remote sensing object detection are not satisfactory because commonly used detection methods have not considered the unique characteristics of objects in remote sensing imagery.
First, objects captured from the top view are characterized by multiple scales, even targets in the same category may have different sizes. Therefore, a robust detector suitable for multiscale object is required. Second, many items of interest are small and closely spaced, as a result, the extracted features' properties are diminished or lost after going through the pipeline, which is consist of convolutional operation and pooling layers. Third, since the sensor has a large imaging range, images captured by remote sensing have backgrounds that are more complicated and varied than those of other images. In most cases, the background accounts for a larger areal proportion than the objects of interest. Accordingly, the detector needs to be able to focus on the pixels of the target. Therefore, it is critical to enhance the characteristics of objects from remote sensing imagery for a multiscale aware detector.
We suggest a specialized detector called MashFormer based on the properties of remote sensing imagery discussed previously. It takes the CenterNet anchor-free detection network as baseline. In the feature extraction stage, we use the transformer [19] block to augment the original residual module-based backbone. Due to its capacity to model long-range correlation dependencies, this technique has been widely employed in natural language processing (NLP). In contrast, the convolutional layer is more suitable for extracting effective local information from neighboring pixels. The proposed backbone is a hybrid design, which combines convolutional layer and transformer block to extract information from the entire image. This approach is more suited for scenarios with a complex background and high object density. Besides, we adjust the lateral connection module according to the abovementioned backbone design and suggest a feature alignment module to improve the effectiveness of feature fusion. Since details information are easily lost in the cascaded convolutional layers, a common practice is to establish a feature pyramid network (FPN) [20], [21], [22] to collect features with thorough representations of the semantics and precise information. The hourglass-104 network constructs feature map by combining lower level spatial features and higher level semantical features in a top-down manner. However, this strategy results in semantic gaps during each step of the forward propagation of information. Our proposed feature alignment module utilizes an attention mechanism to automatically learn the offset between features from different levels. It then uses dynamic convolution to align the features from the higher level, and the sampling position of the dynamic convolution kernel is adjusted by the offset. The aligned features are fused with high efficiency, therefore, our model can detect multiscale objects, especially small targets.
Experiments on the remote sensing HRRSD dataset indicate that the mean average precision (mAP) of MashFormer is demonstrably greater than that achieved by other algorithms. The method is also highly applicable to small-scale targets. A comparative experiment is carried out on the NWPU VHR-10 dataset to validate that the suggested method is compatible with various datasets. The following is a summary of our contributions.
1) We suggest an innovative multiscale aware hybrid detector called MashFormer for the purpose of detecting objects in remote sensing. It performs better than other cutting-edge approaches in situations with complex backdrops, various scales, and small targets. 2) A hybrid designed backbone that consists of convolutional layer and the transformer block, it can models the correlation between neighboring pixels and long-range features, improving the detection performance in complex background scenarios. 3) A feature alignment module is developed to align the higher level feature with the corresponding lower level features to improve the propagation of semantic information in feature fusion and prevent the loss of detailed information. This approach enhances the performance of multiscale object detection, particularly for small targets. The rest of this article is organized as follows. Current remote sensing detection related techniques are covered in Section II. Our suggested approach is introduced in Section III. In Section IV, the experimental findings for the suggested method and alternative approaches are presented. Finally, Section V concludes this article.

A. Object Detection in Optical Remote Sensing Imagery
Deep learning-based object detection techniques have been widely developed and have achieved considerable success. Modern object detection techniques based on deep learning fall into two broad categories: two-stage detectors and one-stage detectors. The two-stage detector separates detection pipeline into two stages: stage one generates high-quality candidate regions based on the input image, and stage two performs object classification using fine-grained object detection and bounding box regression. R-CNN series are typical representatives of twostage detectors. The original R-CNN [6] leverages the selection search algorithm to generate region proposals, which are then fed into the convolutional neural network (CNN) as the input for feature extraction. An SVM classifier is employed to classify the region proposals. Fast R-CNN [7] uses the OverFeat and spatial pyramid pooling net, incorporating a region of interest pooling layer. Through sharing the feature extraction calculation of the region suggestions, this model enhances the R-CNN. Faster R-CNN [8] further enhances the method of extracting region proposals, i.e., a region proposal network, which achieves the end-to-end training of object detection. The one-stage detector replaces the region proposals with anchor boxes. The location and category information of an object can be obtained from the extracted features via a single CNN backbone. Representative methods include SSD [13], the YOLO series [10], [11], [12], RetinaNet [14], and FCOS [15]. SSD employs feature maps from various resolution convolutional layers to recognize objects in various sizes. YOLOv1 scales the image to be detected to the same size and divides the image into grids to detect objects at different positions. The algorithm attempts to identify objects whose centers fall within this grid. Unlike the YOLOv1 [10], YOLOv3 [12] reuses the classifier or locator for detection and applies the model to multiple locations and image scales. Areas with higher scores are regarded as valid detection results. Recently, novel anchor-free detection methods have attracted the attention of many scholars, such as CornerNet [16] and CenterNet [17]. These detectors directly extract network features to classify the category and regress the position of a target. The CenterNet network structure is very simple: it only predicts the center point of an object. Three branches are obtained through the feature extraction network: the HeatMap branch, Width Height (WH) branch, and Offset branch. Then, the target's size and center are then determined. CenterNet is a one-stage end-to-end target detection method.
Object detection techniques based on deep learning have been developed for optical remote sensing imagery. For instance, Cheng et al. [23] added a rotation-invariant layer and a Fisher discriminative layer to an existing CNN model to deal with object rotation. Zhou et al. [24] introduced rotation-sensitive feature maps for regression and rotation-invariant feature maps for classification to ensure that the detector was rotation invariant. Guo et al. [25] suggested a unified multiscale CNN for multiscale objects, especially small objects in high-resolution satellite images. Cheng et al. [26] combined cross-scale feature fusion framework with a squeeze and excitation block inserted into the top layer of an FPN to obtain discriminative multilevel feature representations. Qian et al. [27] proposed a multilevel feature fusion module and incorporated it into an existing hierarchical deep network to exploit multilevel features for each region proposal. Lei et al. [28] added a saliency restriction and multilayer fusion approach into a CNN model to deal with complex scenes in remote sensing imagery. Hu et al. [29] proposed a sample update-based CNN framework for object detection in images with complicated backgrounds and various groundcover types. Ye et al. [30] proposed an adaptive attention fusion mechanism (AAFM), which employs parallel spatial and channel attention to obtain the optimal feature representation. Then the AAFM can be easily integrated into the basic convolution block of the backbone.

B. Transformer Block
The transformer [19] was originally proposed for NLP and has become the most cutting-edge technology for these tasks.
The advantage of the transformer is the self-attention block where, the the query/key dimension is named d, the query, key, and value metrics are named Q, K, and V, respectively. The relative position is obtained from (1). The success of the transformer in NLP has led many researchers to investigate the possibility of using it in computer vision applications. Some researchers incorporated transformers into standard CNNs. Carion [31] combined the CNN and transformer and proposed an end-to-end object detection. The transformers replaced the custom-designed components after the CNN backbone to output the final predictions. Zhu [32] combined the spatial sampling ability of deformable convolution layers and the relationship modeling capability of the transformer. Chi [33] presented a transformer-based decoder module to transform other representations into a typical single format. Another strategy is to build a backbone network based on the transformer block. The vision transformer (ViT) [34] is a ground-breaking method for classifying images using a pure transformer block. It divides the image into patches and has shown an impressive accuracy-speed tradeoff compared with other CNN-based models. The Swin transformer [35] constructs a hierarchical representation and computes self-attention locally within nonoverlapping windows, achieving linear computational complexity.

C. FPN and Feature Alignment Module
In the pyramidal structure of CNNs, the high-level semantic information is represented by low-resolution feature maps. To improve spatial detail, a common strategy is to build an encoderdecoder structure, i.e., top-down and bottom-up connections. The FPN [20] is a representative top-down method that obtains high-level semantic information using a top-down architecture. This method first backpropagates semantic information through upsampling operations to increase the detail of semantic information through lateral connections. Then, feature maps with multiscale feature representations are generated. The path aggregation network [21] shortens the information flow through the use of a bottom-up approach based on the FPN and improves the feature pyramid with precise localization feature at low levels. Learnable weights are used in a weighted bidirectional FPN [22] to evaluate the relative relevance of various input features. REFB [36] using deformable convolution to expand the receptive field on top of the backbone, then the expanded feature is directly added to different level of FPN through lateral connections.
Nevertheless, the feature fusion in the FPN indiscriminately adds pixels, therefore, the misalignment during feature aggregation causes significant performance degradation. Recently, many researchers focused on feature fusion. SegNet [37] utilizes a decoder with pooling indices that were obtained in the max-pooling step to translate the low-resolution encoder features to the full input resolution features. IndexNet [38] uses an index-guided encoder-decoder framework. Adaptive learning of the indices guides the upsampling operators. By introducing a learnable transformation, the guided upsampling network [39] uses a guided upsampling module to enhance the upsampling operators. AlignSeg [40] uses learnable interpolation to determine the transformation offsets between multiresolution features. The semantic flow network [41] learns the semantic flow between feature maps of neighboring levels and disseminates high-level features to high-resolution features.

III. PROPOSED METHOD
The MashFormer's framework that we have suggested is shown in Fig. 1. MashFormer is based on the classical one-stage detection network CenterNet.
First, we employ the transformer block and the CNN layer to design a hybrid feature extraction backbone. To acquire the correlations between long-range features while maintaining linear computing complexity, the Swin transformer was selected to complement the origianl residual-based feature extraction module.
Second, we design efficient lateral connections for lower level features in the backbone so that they can be forward propagated. Before fusing the features from adjacent levels, we suggest an attention-based feature alignment component to minimize semantic gaps and improve the network's performance. The designed components can all be easily integrated into the current detection networks. More information is covered in Sections III-A through III-C.

A. Overall Architecture
The suggested network's structure is depicted in Fig. 1. The optical remote sensing image displayed on the left is used as the input to the pipeline, and the suggested feature extraction network is represented by the blue trapezoid box in the center. The original CenterNet uses three different feature extraction networks (backbone), i.e., Resnet-18, DLA-34, and Hourglass-104. Since the target in the optical remote sensing image has variable features, a shallow feature extraction network cannot adequately distinguish the target from the background. The Hourglass-104 [42] network maps small-scale features to the original scale through upsampling and fuses the input features of the previous layer for feature extraction. Therefore, we select the CenterNet network with Hourglass-104 backbone as the baseline network and make several improvements. First, the input image is cropped to a size of 512 × 512, then, the image is down-sampled four times, and a feature map with a size of 128 × 128 × 256 is obtained by the feature extraction network. The following three neck networks are constructed to generate the predicted value, which contains the heatmap branch, the WH branch, and the offset branch of the center point.

B. Hybrid Feature Extraction Network
The intricacy of the background makes it difficult to distinguish an object from it in optical remote sensing imagery. Most current detection methods use different CNN structures to deal with complex backgrounds, but their detection results are not satisfactory. A CNN uses convolution operations to extract information from a small neighborhood around an input point. As a result, each convolution layer has a modest receptive field to improve the modeling capabilities of the network. A common method is to cascade the conv-pool-norm structure and establish a hierarchical feature extraction backbone. Consequently, the background features will be the main focus of a CNN feature extraction network. In contrast, we utilize the transformer block to complement our backbone to improve the feature extraction globally and enhance the deep characterization capability.
In order to enhance the object detection performance in optical remote sensing imagery with a complicated background, this research suggests a technique for merging CNN and the transformer block to gather complementary information. As depicted in Fig. 2, the hourglass-shaped extracting features network is built by cascading two distinct varieties of transformer-based modules. The Swin transformer [35] consists of the Swin-T, Swin-S, Swin-B, and Swin-L, they have different model sizes and computational complexities ranging from small to large. Each of the first four stages is made up of patch partitioning component and Swin-T module. It creates 2 × 2 nonoverlapping patches from the supplied RGB image, just like the ViT. The feature of each patch, which is interpreted as a "token," is the unprocessed pixel values. A linear embedding function is employed to project its channel dimension to the hidden dimension specified in Swin-T. Finally, the tokens are transmitted to the transformer encoder component, while the shape of the input feature is not changed by the encoder. Given an input feature with the size of H × W , the first four stages produces , and H 16 × W 16 tokens, respectively, resulting in a hierarchical representation. Each of the last four stages is composed of an upsampling module and a Swin-T module. The upsampling layer uses a 2× upsampling module to increase the number of tokens and a 1 × 1 convolution kernel to project the channel dimension to the input size in reverse order of the first four stages.

C. Feature Alignment Module
As stated in Section B, the hourglass-like backbone is designed using a recursive step-by-step downsampling or upsampling operator. The original hourglass-like backbone uses direct pixelwise addition to aggregate the upsampled feature with the same-resolution feature from a lower level. This approach causes significant misalignment and decreases the detection precision. To solve this problem, we suggest an attention-based feature alignment module.
As depicted in Fig. 3, features from the upsampled layer P i and the lower level layer C i are aggregated. In our backbone design, these features have the same number of channels. P i is upsampled from P i−1 by standard regular grid sampling-based bilinear interpolation. Before feature fusion, P i must be aligned according to its feature. In our module, the upsampled P i and C i are concatenated and passed to the attention module to automatically predict the offset Δ i , which is more suitable for complex scenes than the offset calculated by simple mathematical operations. The attention mechanism is composed of two submodules  (see Fig. 4): An efficient channel attention module [44] and a spatial attention module [43]. The channel attention model filters all feature channels to optimize the features and reduce the weight of invalid channels by increasing the weight coefficient of effective channels. The spatial correlations between features serve as the foundation for the generation process of the spatial attention feature map. The spatial attention mechanism is a supplement to the channel attention mechanism. Mathematically, the aforementioned steps can be represented as (2) After calculating the offset, we use it to guide the kernel sampling position of the deformable convolution network (DCN). The output feature at any position p 0 can be obtained by where, w(p n ) and p n are the predefined weight and offset for the p 0 sampling location, respectively, and Δp n is the additional offset computed by incorporating features from high-level and low-level layers. When we apply the deformable convolution before the original feature fusion, the kernel adjusts its sampling location according to the offsets obtained from (2). Then, it aligns P i according to the spatial difference between P i and C i . The additional operator does not affect the performance of the detection network.

IV. EXPERIMENTS
This section describes the implementation method, including the datasets, evaluation indicators, and parameter settings. Discussion follows the outcomes of the network and comparison experiments.

A. Datasets and Evaluation Metrics
Two publicly available datasets are used to evaluate the suggested approach.

1) High Resolution Remote Sensing Identification (HRRSD):
The Optical Image Analysis and Learning Center at the Xi'an Institute of Optics and Fine Mechanics published the HRRSD dataset with the goal of studying, analyzing, and performing target detection on high-resolution remote sensing imagery. The dataset includes 13 different types of optical remote sensing targets and 21 761 color remote sensing images with a resolution from 0.15 to 0.2 m. Table I contains a list of the dataset's specifics.
2) Northwestern Polytechnical University (NWPU) VHR-10: The NWPU VHR-10 dataset was designed to evaluate remote sensing target detection. The collection includes 800 color optical remote sensing photos, of which 150 are background images and 650 contain targets. Among these, 750 have a spatial resolution from 0.5 to 2 m and were retrieved from Google Maps. The dataset includes ten different kinds of targets. The target categories are listed in Table II.
3) Evaluation Metric: We use the intersection over union (IoU) overlap rate between the target detection result and the ground truth as the criterion for target detection. An IoU value greater than 0.5 represents a successful detection in the optical object detection approach. The value of the overlap rate indicates the accuracy of the target localization. The IoU is defined as The IoU, which is the ratio of the intersection to the union of the target detection result bounding box and the ground truth bounding box, is an indicator used to quantify the accuracy of a detection. The detection results are examined using two measures: 1) the average precision (AP) of each category and 2) the mAP of all categories. For a specific type of target, the AP is used to measure the performance of target detection. The recall rate is used on the abscissa, and the precision rate is used on the ordinate to draw the precision-recall rate curve. The AP value represents the region encircled by the curve. As described by the AP AP = 1 0 P (r)dr.
The recall rate (r), which measures how many targets in a category were really correctly spotted, versus how many targets were actually present in the category. This score which is defined in (6), indicates how well a model can find all pertinent instances. The precision rate (p) is the proportion of accurately detected targets to all targets detected. This value reflects a model's capacity to extract only the pertinent objects. It is defined in It is more thorough and convincing to use the AP to assess target detection performance rather than using either the accuracy rate or the recall rate separately.
For multicategory targets, the mAP is applied to evaluate the general detection capability. The mAP is calculated by averaging the APs of all categories. It is defined in the following: in which N represents the total number of categories. The mAP is defined in the same way that it was in the PASCAL VOC 2012 object detection challenge.

B. Experimental Setup
We trained and tested the MashFormer network based on the PyTorch deep learning framework. The random Gaussian distribution was applied to initialize the network's parameters. The network was trained using two NVIDIA GTX2080Ti GPUs for 140 epochs, which required two days for the HRRSD dataset. The training for the NWPU VHR-10 dataset took 3 h. The training batch size was 8, the learning rate was set to 0.000125 initially, and the maximum training epoch was 140. The model saved the network model parameters every five training times. We adopted the CenterNet [17] detection network as the baseline network, then introduced the multiarchitecture hybrid backbone and feature alignment module. We used the same loss function as the standard CenterNet. The loss function contained three parts: L k was the center point loss of the heatmaps, L off was the offset loss of the target center point, and L size was the loss function of the target length and width

C. Experimental Results
Fig . 5 demonstrates the outcomes of the multicategory object detection on the HRRSD test dateset utilizing the proposed MashFormer. The detection result shows the location of the target, the category name, and the confidence score. The visualization results demonstrates that the suggested method can be used in detection scenarios with a single target and when  III  COMPARISONS BETWEEN THE ORIGINAL CENTERNET AND THE METHOD IN  THIS ARTICLE ON THE HRRSD DATASET it is challenging to tell the target from the backdrop. Most of the objects in Fig. 5 were successfully detected, resulting in high detection confidence. The effectiveness of the suggested methodology is demonstrated by its capability to identify small items in a complicated background, such as basketball courts and vehicles. The method has strong adaptability, high target detection accuracy, and overcomes the key difficulties of target detection in optical remote sensing imageries. We compare the performances of the MashFormer and the original CenterNet method on the HRRSD dataset to validate the usefulness of the proposed hybrid backbone and feature alignment module. Table III is a rundown of detection accuracy of these two approaches for each object category on the HRRSD dataset. The presented method has the greatest accuracy in almost all categories on the test dataset, with an average accuracy of 88.61%. The mAP is 1.90% points greater than that achieved by the original CenterNet, demonstrating the effectiveness of the suggested approach using the multiarchitecture hybrid backbone and feature alignment module. For large targets, such as a baseball diamond, bridge, ground track field, harbor, and ship, the proposed MashFormer's detection performance noticeably surpasses (6% points) that of the CenterNet method. It also distinguishes the targets from the background, especially the baseball diamond. The detection accuracy of vehicles and basketball courts has also been significantly improved, demonstrating the efficiency of the method and its ability to detect small-scale targets with detailed features, such as lines.
To give a more clearly evaluation on the performance of the proposed method on the targets of different sizes, we follow the definition in COCO [45] to calculate the detection accuracy of three types of targets, which include large target (area > 96 2 ), medium target (32 2 < area < 96 2 ), and small target (area < 32 2 ), and the result is shown in Table IV. It is noteworthy that our proposed method achieves a notable 5.9% improvement of mAP in small target category, which further demonstrates the effectiveness of our proposed module in terms of small-scale object detection.
Besides, we also conducted a comparative experiment on the inference speed between the proposed MashFormer and the baseline network. The average inference speed of the baseline  [8], mask R-CNN [9], FCOS [15], MSHEMN [46], SGFTHR [47], and GLFPN [48]. Table V demonstrates that even if MashFormer does less well at detection than some of the other approaches in some categories, such as basketball court, ground track field, and tennis court, the detection results for airplane, bridge, baseball diamond, crossroad, ship, T junction, and others are excellent. In addition, MashFormer achieves the greatest final mAP, demonstrating the capability of the suggested approach.
On the NWPU VHR-10 dataset, MashFormer is contrasted with six other methodologies, including faster R-CNN, mask R-CNN, FCOS, MSHEMN, SGFTHR, and EGAT-LSTM [49]. Table VI is a list of the test results. The suggested approach has the highest average accuracy of 93.80% and performs well in nearly all dataset categories. These outcomes demonstrate our method's superiority for target detection in optical remote sensing imagery.
In conclusion, considering the transformer block has the ability to model the relationship between long-range features, we employ it to complement the original residual module based backbone, our brand-new hybrid architecture backbone provides superior results for detecting objects in remote sensing scenarios where the background are difficult to distinguish. The feature alignment module alleviates the semantic gap before feature fusion and enables the backbone to extract features more effectively without losing small target features. Therefore, it has advantages for remote sensing scenarios where targets has multiscale characteristics and the target.

D. Ablation Study
To assess the effectiveness of the suggested components, an ablation experiment using the same parameter values is conducted on the HRRSD dataset. The ablation experiment proceeds from the baseline and sequentially incorporates each component, including the hybrid backbone, attention module, and feature alignment module. The numerical outcomes of the ablation experiment upon the HRRSD dataset are listed in Tables  VII and VIII.  TABLE V  COMPARISONS WITH THE EXISTING METHODS AND THE METHOD IN THIS ARTICLE ON THE HRRSD DATASET   TABLE VI  COMPARISONS WITH THE EXISTING METHODS AND THE METHOD IN THIS ARTICLE ON THE NWPU VHR-10 DATASET   TABLE VII  COMPARISONS WITH THE PROPOSED HYBRID BACKBONE AND FEATURE ALIGNMENT MODULE ON THE HRRSD DATASET Using the multiarchitecture hybrid backbone results in an average improvement of 0.97%. Targets in scenarios with complex backgrounds, such as baseball diamond, basketball court, tennis court, and vehicle, usually are close to vegetation and buildings, which interfere with feature extraction by neural networks. Therefore, the improved hybrid backbone shows that our model can detect targets in complex backgrounds.
The incorporation of the feature alignment module in the backbone improves the mAP by 0.79%. The improvements in the mAP for multiscale targets, such as baseball diamond, ship, and bridge, are 3.16%, 1.37%, and 1.75%, respectively. Smallscale targets also achieve improvements (2.77% for basketball court). Therefore, aligning features before fusion improves the detector's multiscale feature representation ability.
The attention module integrated in feature alignment module is trained in a supervised manner, which could learn the desired offset from the concatenated feature in a more general and robust way. As a result, the introduced module improves the mAP by 0.38% compared with the feature alignment module without the attention module, which simply uses the concatenation The incorporation of all modules improves the performance by 2% points. The results of the studies show that the suggested strategy has great detection accuracy and excellent effectiveness.

V. CONCLUSION
In order to detect interested targets in optical remote sensing pictures, this research presented a brand-new detector called MashFormer. We selected the anchor-free detection network CenterNet as the baseline network and made several improvements. The transformer block was used in conjunction with CNN layer as a hybrid feature extraction backbone. This strategy could learn the correlation between long-range features and extract critical information from images with a complex background. A powerful feature alignment module was embedded in the backbone before the feature fusion module to alleviate the semantic discrepancy between features from shallow and deep layers. To assess the effectiveness of the suggested methodology on the HRRSD and NWPU VHR-10 datasets, we conducted comparative experimentation using several detectors (such as Mask R-CNN and FCOS) and an extended ablation research. The outcomes proved that the suggested approach for target detection was superior.