Building Detection From Panchromatic and Multispectral Images With Dual-Stream Asymmetric Fusion Networks

Building detection from panchromatic (PAN) and multispectral (MS) images is an essential task for many practical applications. In this article, a dual-stream asymmetric fusion network is proposed, named DAFNet. DAFNet can achieve effective information fusion at the feature level. It obtains better building detection performance from the following three perspectives: a two-stream network structure is designed to guarantee the ability to extract information from PAN and MS images; an asymmetric feature fusion module is proposed to fuse features efficiently and concisely; and two consistency regularization losses, i.e., PAN information preservation loss and cross-modal semantic consistency loss are applied to further explore the consistency between features for better fusion. The experiments are conducted on a challenging building detection dataset collected from GaoFen-2 satellite images. Comprehensive evaluations on 12 popular detection methods demonstrate the superiority of our DAFNet compared with the existing state-of-the-art fusion methods. We reveal that feature-level fusion is more suitable for building detection from PAN-MS images.


I. INTRODUCTION
B UILDING detection from remote sensing (RS) images is an important research topic since it provides basic information for a wide range of applications such as urban planning [1], earthquake disaster reduction [2], and mapping [3]. Thanks to powerful deep neural networks, research in this field has been advancing rapidly in recent years. Many of these methods draw inspiration from generic object detection approaches that aim to detect everyday objects in natural scenes. Manuscript  However, real-world RS image processing and analysis systems usually accept inputs in two modalities: panchromatic (PAN) and multispectral (MS), because many Earth observation satellites cannot provide images with both high spatial and spectral resolutions. Alternatively, these satellites acquire images in two modalities: PAN images that are rich in spatial information with fine details and textures and MS images with rich spectrum information complementary to the PAN images. To leverage the advantages of both modalities and facilitate subsequent processing steps, researchers have developed pansharpening techniques that fuse PAN and MS images to generate high-resolution MS images. The pan-sharpened images have been proven to be able to achieve a fairly good recognition performance [4] because they can preserve the spatial detail and spectral information of the image content [5]. Pan-sharpeningthen-understanding has become a standard pipeline for many RS image interpretation systems.
However, our experiments show that constructing RS image interpretation models (e.g., building detection model) based on pan-sharpened images might be suboptimal. Pan-sharpening methods have a significant impact on building detection performance. We test various image fusion methods, including traditional [6], [7], [8], [9] and deep learning ones [10], [11], [12], [13]. Some of them degrade detection methods significantly compared with those using only PAN images. There are two reasons for this, which are as follows: 1) these pan-sharpening methods are not optimized for the downstream tasks, such as building detection, even though they can produce good pan-sharpened images; 2) PAN and MS are considered to have equal contributions for fusion; however, our experiments show that PAN is more important than MS for object detection. Compared with pan-sharpening methods, feature-level fusion [14], [15], [16], [17], [18], [19] combines the fusion process with the downstream tasks to alleviate the suboptimal problem. Nonetheless, there are still several concerns that must be taken into account while applying feature fusion on building detection. 1) As an effective multiscale feature fusion method, feature pyramid network (FPN) [20] is widely used in object detection. The strategy of combining multimodal fusion with FPN will affect the detection performance. The effective combination strategy remains to be studied. 2) Most current fusion modules fuse features using a symmetrical structure. However, PAN and MS are not equally helpful for building detection. There still needs to be more researches on the asymmetric fusion structure to highlight the information of PAN. 3) Attention mechanisms [14], [15], [16], [17], [18], [19] are widely employed in existing fusion methods, which pushes the model focus on essential parts and obtain the complementary information from multimodal data [14]. However, they ignore the heterogeneous gap [21] existing in multimodal features, i.e., features from different modalities located in unequal subspaces, and will weaken multimodal fusion's benefits. We propose the dual-stream asymmetric fusion (DAF) network to deal with these issues. The commonly used two-stream architecture [22], [23] in multimodal fusion is adopted to extract MS and PAN features. To better adapt to the FPN, we propose dual fusion FPN, which first performs scale fusion, and then, modality fusion. An asymmetric feature fusion (AFF) module and a PAN information preservation (PiP) loss are designed to avoid losing PAN information. Motivated by DCCA [24], a cross-modal semantic consistency (CSC) loss is introduced to alleviate the heterogeneous gap so that the fused feature does not contain noise and is more robust.
In summary, our contributions are as follows. 1) We reveal that models aiming to detect buildings from RS images should be well-designed. Performing detection from fused images may not be a good solution. Detection from joint inputs of PAN and MS images has great potential to be investigated. 2) A dual-stream asymmetric fusion network, termed DAF, is proposed for building detection. DAF takes advantage of the original information of PAN and MS images and fuses them using an AFF module. PAN information preservation (PiP) loss and cross-modal semantic consistency (CSC) loss are proposed to augment building detection further. 3) Experiments demonstrate that the proposed losses and AFF module have strong adaptability that are applicable to various detectors and can boost detectors' performance on building detection without bells and whistles.

A. Generic Object Detection
Object detection is a fundamental task in computer vision. It has achieved great success thanks to powerful deep neural networks. Most of the existing detectors can be grouped into two families, namely two-stage detectors [25], [26], [27], [28], and single-stage detectors [29], [30], [31], [32], [33]. Two-stage detectors resolve the detection with a two-stage pipeline, in which the first stage generates a set of candidate proposals, and the second stage performs category classification and bounding box regression simultaneously. The second stage can be considered a refinement process. Thus, two-stage detectors generally show high detection performance; however, they always suffer from low inference speed. Single-stage methods discard the proposal generation stage and directly conduct detection from features. These methods are more computationally efficient than two-stage detectors but have lower accuracy.
In order to achieve a better performance, single-stage methods usually place a large number of preset dense anchors over images, and then, predict the final detection boxes by scoring the anchors and estimating relative offsets to them. Anchors play a similar role to proposals, thus enabling detection performance promotion. These methods [25], [29] are widely known as anchor-based detectors. However, to guarantee better performance, there might be more than 10K anchors required, which significantly decreases the training and inference speed, and more importantly, results in extremely unbalanced positive and negative samples during training. Anchor-free models (e.g., FSAF [31], FCOS [30]) are proposed to solve these problems. They use the center points or center areas as positive sample areas and directly predict detection boxes and categories in these areas.

B. Building Detection
Benefiting from generic object detectors, detecting ground objects in RS images has been advancing rapidly in recent years. Many methods have emerged and achieved astonishing performance. Among all concerning objects in aerial images, buildings are the most important and challenging ones. Great efforts have been devoted to solving the building detection problem.
Vakalopoulou et al. [34] propose an automatic building detection framework based on deep features and SVM classifiers. Zhang et al. [35] design a coarse-to-fine detection framework, which uses saliency maps to locate built-up regions, followed by an R-CNN [36] like pipeline to detect buildings. Li et al. [37] develop a cascaded network, where they incorporate the Hough transform to highlight the boundaries of buildings. Li et al. [38] design a multibranch network to capture contextual and structural features for better identification of buildings.
Many methods solve building extraction with instance segmentation approaches. Alshehhi et al. [39] address road and building extraction with a single-branch CNN. Hamaguchi et al. [40] address the multiscale problem with a multitask framework. The framework consists of multiple U-Net models. Each model is devoted to a specific size of building. Yang et al. [41] design a dense-attention network for building extraction. The attention mechanism can strengthen features, thereby enabling better performance. Griffiths et al. [42] argue that label quality is critical for model training. They propose to improve building footprint masks using morphological geodesic active contours. Han et al. [43] combine the advantages of traditional image processing methods and deep models. They use traditional methods to enhance the dataset, and then, employ a Mask R-CNN for building detection. Sirko et al. [44] study continental-scale building detection. They also utilize U-Net to segment buildings.
In addition to the perspective of segmentation, some works seek better representations for buildings, e.g., polygon and vector fields. Castrejon et al. [45] cast instance-level building segmentation as a contour polygon prediction task, inspiring more subsequent building detection works. Li et al. [46] circumvent the conventional pixel-wise segmentation of aerial images and directly predict buildings and roads in a vector representation. They developed a method named PolyMapper to achieve this goal. Wei et al. [47] propose a two-step method. They first introduce an improved fully convolutional network to obtain masks of building footprints, and then, use a polygon regularization algorithm to transfer the masks into polygons. Li et al. [48] propose a hybrid model for building polygon extraction, in which they employ several networks to obtain bounding boxes, segmentation masks, and corners of buildings, and then, use Delaunay triangulation to construct building polygons. Zhu et al. [49] present an adaptive polygon generation algorithm (APGA), which first generates sequences of building vertexes and then arranges them to form polygons.
In real applications, images may come from different platforms, and thus, are with different resolutions. To bridge the resolution gap, Guo et al. [50] adopt a superresolution method to zoom them into the same resolution and perform building segmentation. Chen et al. [51] extract buildings from PAN and MS imagery to fully explore the spatial-spectral information. They propose to use a multiscale spatial-spectral contextual information mining CNN for this goal.

C. Multimodal Fusion
Different modalities captured from the same platform usually carry distinct yet complementary information. Combining them together, such as RGB-Depth [17], RGB-Thermal [17], [52], Audio-Visual [53], RGB-LiDAR [54], and RGB-Radar [53] is believed to enable considerable and consistent perception improvement compared with a single modality. Bin et al. [55], [56] use an adaptive multimodal mechanism in dealing with realworld inverse synthetic aperture radar (ISAR) object recognition problem on the level of feature and decision. Bin et al. [57] proposed deep geometric learning to strengthen the capability of the CNN in multimodal scenarios.
Multimodality fusion can be divided into three categories: early fusion, mid fusion, and late fusion. These fusion strategies happen on pixel level, feature level, and decision level. Early fusion is widely used in RS field, such as pan sharpening [10], [11]. However, pan sharpening is independent of downstream tasks; therefore, it may not be beneficial for interpretation models. Late fusion makes decisions based on the predictions obtained from each modality. However, terrible predictions from one modality are likely to damage the final performance. The mid-fusion strategy has been widely studied. The key to achieving effective multimodal fusion is to filter useless information in each modality and combine the rest. This idea coincides with attention, making attention mechanisms are widely used in multimodal fusion [17], [22], [53], [58], [59], [60].
Satellites usually carry two kinds of sensors providing two modalities: PAN and MS. In addition to combing the strengths of the two modalities using pan-sharpening techniques, researchers also explore interpreting RS images using midfusion strategies. Li et al. [22] design an attention-based heterogeneous gated fusion network to fuse the optical and SAR features for land cover classification. Kang et al. [15] propose a fully convolutional network using a cross-gate module to fuse features from optical and SAR images.

A. Overview
The overall pipeline of our method is shown in Fig. 1. Two CNN networks are used to extract features from the input PAN and MS images. Since the modalities carry distinct information, these two networks do not share weights. Then, FPNs [20] are equipped to obtain multiscale features for better detecting buildings on various scales. Finally, AFF module is proposed to fuse features of PAN and MS images. Two loss functions are introduced to enforce fusion: PAN information preservation (PiP) loss and cross-modal semantic consistency (CSC) loss. Our fusion strategy occurs in the feature extraction stage and is independent of detection heads, making it applicable to a variety of detection methods.

B. Feature Extraction
Given a pair of PAN and MS images {I p , I m }, where superscript p denotes PAN and m denotes MS, ResNet50 [61] with unshared parameters are used as the backbones to extract their visual features. During the construction of the dataset, MS images are upsampled by bilinear interpolation to the same size as PAN images, which gives the same resolution to the features obtained by the two branches. ResNet50 is composed of one input block B and four stages: ResNet50 accepts three-channel RGB images as inputs, which is inconsistent with PAN and MS images. Modifying the input channel of the input block will destroy the pretrained parameters, whereas selecting only three channels as input will damage the multispectral information [62], [63]. Motivated by the 3-D CNN [64] used in hyperspectral image classification, we develop a sliding strategy to fill this gap. The PAN image is replicated three times and stacked together to form a three-channel input. Then, the new inputs are fed into the PAN branch where B p denotes the input block of the PAN branch, and C p 1 is the obtained feature. For one MS image that contains four channels, we slide the input block of ResNet50 along its channel dimension and obtain the features of the input block in the MS branch where I m i:i+3 represents the ith to the (i + 3)th channels of the MS image. Through the sliding strategy, the network can process MS images while preserving the pretrained weights.
After the feed-forward propagation, features in a pyramid style each is with channels of {256, 512, 1024, 2048}, respectively. R i denotes the ith layer of backbone. Then, features pass through two independent FPNs to obtain multiscale features. The FPN conducts multiscale feature fusion through a top-down pathway with lateral connections [20], which produces the features {X i } 5 i=2 , all with channels of 256. Finally, the two pyramidal-style features are fused through AFF modules for detection: where AFF i denotes the ith AFF module. Considering that PAN image features account for the dominant role in the detection, a skip connection is added to ease gradients update of the PAN branch, as shown in the AFF module in Fig. 1. It is formulated as where Conv is a convolutional layer with kernel size 3 × 3.

C. Consistency Regularization of the AFF Module
For object detection, each level of the FPN is supervised by the regression loss and classification loss [20] to learn features with semantic and spatial information. The semantic information helps the detector to distinguish objects in each region. The spatial information, such as contours and edges, helps to identify object boundaries [65]. By using global average pooling (GAP) [66], the global-level representations of the whole image could be obtained. The maximum value along channels describes the spatial information to some extent [67].
There are two consistencies essential for fusion and detection.
1) The features of the PAN and MS images should hold a semantic consistency since they are captured over the same site. 2) Both semantic and spatial information of PAN images should be preserved after fusion since PAN images play a decisive role in detection.
Two regularization losses are proposed to achieve these consistencies: semantic consistency loss and spatial preservation loss.
Semantic consistency loss ensures that the semantic information of two inputs is as close as possible. To this end, the global-level representations G i are obtained by applying GAP [66], followed by a 1 × 1 convolutional layer without bias. To avoid a trivial solution, that is, features collapse to 0, an orthogonal regularization [68] is applied to constrain the parameter of convolutional layer. The parameter is marked as W c ∈ R D×D , where D is the dimension of the features. Finally, the L2 distance of features in the latent space is calculated to obtain semantic consistency loss L c .
where Y i denotes the features to be constrained, and I denotes the identity matrix with ones on the diagonal and zeros elsewhere. The CSC loss is the sum of semantic consistency losses between PAN and MS features at each stage The spatial preservation loss measures spatial information agreement between two inputs. Considering that spatial information lies in the activations of feature maps, a max-pooling operation along the channel axis is performed to obtain the spatial feature map. Finally, the loss is calculated using L2 distance as where S i denotes the spatial feature map, L s is the spatial preservation loss, and Y i denotes the features to be constrained. The parameters of the spatial preservation loss will not collapse to zero since its' optimization difficulty is much lower than that of the semantic consistency loss. The overall preservation loss between PAN and fused features, i.e., the PiP loss, could be formulated as

D. Detection
To coordinate optimization with the detection task, the CSC loss and PiP loss are optimized during training. Let L det be the detection loss and the total loss of our model is The detection loss L det depends only on the detector, irrelative to our method. Our experiments are performed on 12 popular object detectors, including Faster R-CNN (FR-CNN) [25], FoveaNet (FvNet) [69], FSAF [31], GA Faster R-CNN (GFRCNN) [28], Grid R-CNN (GRCNN) [26], RetinaNet (RtnNet) [29], ATSS [32], Cascade R-CNN (CRCNN) [70], Dynamic R-CNN (DRCNN) [71], Reppoints [72], Sparse R-CNN (SRCNN) [73], and the newest RTMDet [33]. The first 11 models use FPN for multiscale feature extraction. These models are trained with 12 epochs, and the learning rate decays by a factor of 10 at epoch 8 and 11. Models except for Sparse R-CNN [73] are optimized with SGD optimizer with an initial learning rate of 0.01. For Sparse R-CNN [73], the SGD optimizer is replaced with AdamW optimizer and reduces the initial learning rate to 0.000025. RTMDet [33] is an efficient real-time detector equipped with an FPN. The AdamW with a 0.05 weight decay and cosine annealing [74] with a minimum learning rate of 0.0002 are adopted for optimizing RTMDet. The medium size one is chosen for our experiments among the five available model sizes in RTMDet. Warm-up strategy is adopted for the first 500 iterations with ratios of 0.33 to stabilize the training process. The gradient clipping with maximum normalized value of 35 is also utilized to avoid gradient explosion. The experiments run on a single NVIDIA 2080TI GPU with a batch size of 4. For testing, non-maximum suppression (NMS) with intersection over union (IoU) threshold of 0.3 is leveraged to remove duplicated bounding boxes. In addition, boxes with scores less than 0.05 are removed to further reduce false detections.

IV. EXPERIMENTS
In this section, we first introduce the building dataset for evaluation. Then, the impacts of different fusion levels on detection is validated, revealing the disadvantages of image-level fusion and result-level fusion methods in building detection. The alternative multiscale architectures for multimodal fusion are discussed afterward. What is following is the ablation study of our proposed CSC loss and PiP loss. Finally, the proposed DAFNet is compared with other feature fusion strategies.

A. Dataset
Experiments are conducted on 5M-building dataset [75], which is comprised of images captured by GaoFen-2 satellite over Shandong province of China. This dataset contains 109 PAN images and their corresponding MS images. The spatial resolution is about 3. Statistics of training set w.r.t building size, aspect ratio, and instances in each sample are shown in Fig. 3. The dataset contains many small objects; 37 299 buildings are smaller than 32 × 32 pixels. It also can be seen that buildings vary significantly in aspect ratio; 11 341 buildings have an aspect ratio greater than 4.

B. Experiment Details
Our model is implemented with MMDetection [76]. All models use ResNet50 as the backbone network. The backbone is initialized with ImageNet [77] pretrained weights. The first layer of the backbone is frozen to match the default configuration in MMDetection [76]. During training, the MS images and PAN images are resized into 800 × 800 through bilinear interpolation, and then, random horizontal flips with a probability of 0.5 for data augmentation. The images are normalized with the mean and variance obtained from the ImageNet images, since the pretrained parameters are derived from the ImageNet classification task. Image preprocessing in the test phase is consistent with training, except that no data augmentation is used. The performance is measured by COCO [78] metrics, including mean average pooling (mAP) and AP50.

C. Impacts of Fusion Levels
The impacts of different fusion levels on detection are validated in this section, including image-level fusion, feature-level fusion, and decision-level fusion.
Eight pan-sharpening methods are selected for image-level fusion, including Brovey [6], fast intensity-hue-saturation (FIHS)based [7], principal component analysis (PCA)-based [8], A Tróus wavelet transform (ATWT)-based [9], PGMAN [11], PNN [79], PanNet [13], and PSGAN [12]. These methods are either widely used in practical applications or new deep fusion approaches. In our experiments, the traditional methods are directly applied to obtain fused images without training. The CNN-based methods are trained with the open accessed GaoFen-2 images, and then, used for image fusion. Two nonreference metrics D λ [80] and D S [80] are used to evaluate the performance of the pan-sharpening methods. Faster R-CNN [25]  and RetinaNet [29] are used for evaluation. The results are shown in Table I. It can be seen that PAN images are far better than MS images for building detection, indicating that spatial information is essential for the detection task. Simply concatenating PAN and MS images together is not a good solution. The results degenerate compared to that of using PAN images. A possible reason may be that MS channels dominate the input, which makes the network hard to learn textural and structural features that are mostly within PAN. The decision-level fusion based on the detections of PAN and MS images is also investigated. We use PAN and MS as training data, and train two independent detectors based on Faster R-CNN. Then, the results of the two detectors are merged by using nonmaximum suppression (NMS) algorithm. The results are shown in the NMS row of Table I, which are worse than PAN but slightly better than [PAN, MS]. Since the performance gap between PAN and MS images is huge, MS drags the performance of NMS. Also, pan-sharpening has a significant impact on building detection. As shown in Table I, although CNN-based methods achieve much better fusion results in terms of D λ and D S , the performance on detection is completely opposite. ATWT based is the worst among the eight methods. Brovey-based produces the best results, although PAN is slightly better for the RetinaNet detector. Detections from the rest pan-sharpening methods are not as good as PAN images, indicating substantial information loss during pan sharpening.

D. Multiscale Multimodal Fusion
An FPN [20] is widely used in object detection to address the multiscale problem. In this work, the strategies of combining multiscale features for multimodal fusion are discussed.
A straightforward approach is simultaneously performing multiscale multimodal feature fusion, termed SiMM, as shown in Fig. 4(b). At each scale, except for the lowest one, the fusion module accepts features from the multimodal features in the same scale and the fused futures of the lower scale. The fusion process can be described as follows: where C p i and C m i denote features from PAN and MS after the ith stage, respectively; L i denotes the ith lateral connection; X f i denotes the fused feature; and Fusion i denotes the fusion module. During fusion, the last obtained feature is upsampled by a factor of two.
The second architecture is dual FPN fusion (DuFF), which is also implemented in our DAF network. DuFF first builds feature pyramids for different modalities, and then, performs fusion in each scale, as shown in Fig. 4(a).
The experiments evaluate two fusion strategies, i.e., elementwise addition (ADD) and concatenation (CAT). The ADD applies element-wise addition to combine features and uses a 3 × 3 convolutional layer to obtain fused features. The CAT concatenates features and then compresses the dimension through a 3 × 3 convolutional layer. Both of these two operations are tested in SiMM and DuFF. The results are shown in Table II, with the "Source" column indicates the data source utilized for training. DuFF delivers better performance than SiMM, and ADD operation is better than CAT. The reason is that DuFF is a progressive fusion method that first performs multiscale feature fusion, and then, completes the multimodal feature fusion, while SiMM accomplishes multiscale and multimodal fusion simultaneously, the network would be confused about what is important and what should be preserved when fusion. On the other hand, the large semantic gaps between different modalities and different scales   Table II. It is found that the performance of the dualstream network using PAN+PAN as the data source is lower than that of the single network using only PAN. This phenomenon indicates that the performance improvement brought by DuFF is due to the spectral information from MS images rather than the extra computation of the multibranch structure. Thus, DuFF with ADD is chosen as our multiscale architecture for fusion.

E. Consistency Loss
In addition to our CSC loss, there are two options for imposing consistencies between two features. The first is simply minimizing L2 distance between features [81], and the second is maximizing cosine distance between the two modalities [82]. All losses are computed in a latent space where features are projected with a linear mapping. The results are shown in Table III. The baseline simply adds the two features element-wise without imposing any constraint. It can be seen that L2 loss decreases the detection performance. In particular, FSAF [31] and RetinaNet [29] do not converge. Cosine loss and ours improve the performance, while ours obtains the best results. The semantic agreement and diversities of features before and after the linear mapping are computed to further investigate why this happens. Also, the cosine metric is used to measure semantic agreement and L2 distance to measure the diversity of features. If two features have strong semantic consistency, their cosine similarity should be close. Features should also be diverse so the model can learn good decision boundaries. The results are shown in Fig. 5.
As can be observed, all losses improve semantic consistency. Cosine loss obtains the best results since it imposes the cosine similarity directly. However, it boosts the diversity of features,  [14]. (b) Cross gates (GRSs) [15]. (c) Cross reference module (CRM) [16]. (d) Gated information fusion (GIF) [17]. (e) Adaptive feature fusion modules (AFFM) [18] C : Concatenate operation. R : ReLU. ⊗: Tensor product. ⊕: Element-wise addition. σ : Sigmoid activation. "GAP": Global average pooling. which may increase model instability, as shown in the bar graph. L2 loss significantly decreases the diversity, which hampers the detection. The proposed loss improves the semantic consistency while still maintaining an appropriate diversity of the features.
In addition, the parameters W c and W T c W c in (7) are visualized. The first 25 rows and 25 columns of the parameters used in the first three layers are selected for visualization, as shown in Fig. 6. It can be found that the orthogonal loss of our matrix can well force the matrix to meet the orthogonality, so the mapping is only used to transform the feature into a new space to complete the constraints, and there will be no feature collapse. We also study the effectiveness of each term of our loss. The results are shown in Table IV. The ADD strategy is considered as the baseline. It can be seen that the CSC loss increases AP50 by an average of 0.55, the PiP loss increases by an average of 0.62, and the combination achieves the best, increasing by an average of 0.73.

F. Overall Results
To further demonstrate the effectiveness of our fusion method, we compare it with other fusion strategies that are widely used in RGB-Depth and RGB-Thermal perception tasks, including: channel-wise weighted feature fusion (CWF) [14], cross gates (CRGs) [15], cross reference module (CRM) [16], gated information fusion (GIF) [17], and a fusion method for PAN and MS data fusion, i.e., the adaptive feature fusion module  [18]. For CWF, CRGs, and GIF, we reimplement them in strict accordance with the article; for SCA and AFMM, we use the codes the authors provided.
CWF [14] first concatenates features of PAN and MS, and then, fuses them using a convolutional layer. Afterward, a weight vector is generated from the fused features using GAP, which will be used to reweight the PAN and MS features, as shown in Fig. 7(a). Finally, the fused features are obtained by elementwise addition of the weighted PAN and MS features.
CRGs [15] generates channel weights for PAN and MS modalities, respectively, and then, applies them crosswise, as shown in Fig. 7(b).
CRM [16] first obtains channel attention vectors for each modality, and then, mines the most discriminative features among them through element-wise addition. Finally, the channel features are fused according to the weights of each mode and the common important region, as shown in Fig. 7(c).
GIF [17] uses a spatial gate fusion mechanism. It generates spatial weight maps for each modality based on their concatenated features. Fusion is achieved through weighted concatenation, as shown in Fig. 7(d).
AFFM [18] generates weights from the concatenation of features after two convolutional layers. A softmax operation will then normalizes the weights along the channel. After that, AFFM computes element-wise weighted sum to fuse spatial and spectral features, as shown in Fig. 7(e).
Detection results using PAN images are taken as the baseline and compared with the ADD fusion strategy described in Section IV-D and detections based on Brovey pan-sharpened images. The performance of these methods is illustrated in Table V. In general, the performance of all detectors on the 5M-Building dataset does not exceed 70% AP50. This is mainly because 5M-Building dataset covers complex scenes and has diverse building styles, and large scale variations than other building datasets, as shown in Figs. 2 and 8. These diversities make 5M-Building dataset more challenging, so the detection performance is relatively lower. It can be seen that, according to the average improvements, ADD, Brovey pan-sharpening, and GIF slightly improve the detection performance. All the other three fusion approaches decrease the detection. In sum, these four fusion approaches do not contribute much to the detection. Our method achieves an average improvement of 1.27% AP50. Furthermore, we promote Grid R-CNN to achieve 70% AP50, which performs the best in 5M-Building dataset, and improve Sparse R-CNN by 3.7% AP50.
Table VI shows all methods' running time and complexity based on Faster R-CNN [25]. Our DAFNet has fewer parameters than other feature fusion methods and achieves better performance. In particular, DAFNet reaches 68.2% AP50 with 74.2 M parameters and 12.5 FPS during inference, indicating that it is a effective way to realize feature fusion compared with other methods. In addition, the CSC loss and PiP loss introduced by Some results are visualized in Fig. 8. Detection results from PAN and Brovey pan-sharpened images using Faster R-CNN are shown in the second and third rows. As can be observed, our model has fewer miss detections and false positives. The proposed fusion method effectively combines the strengths of PAN and MS images, enabling augmented features of buildings, thus leading to more accurate localization and classification.

V. CONCLUSION
In this article, we have conducted in-depth studies of building detection from remote sensing images. We reveal that pan sharpening may degenerate the building detection performance. The building detection problem is resolved from a multimodality feature fusion view and a dual-stream asymmetric fusion network is proposed to effectively fuse and augment PAN and MS features for building detection. The fusion is realized with an AFF module and two consistency regularization losses, i.e., CSC loss and PiP loss. Extensive experiments on 5M-building demonstrate the effectiveness and superiority of the proposed approach.
Although the proposed DAFNet was motivated by the PAN and MS fusion problem in remote sensing, the method is a general framework that can be applied to other data sources, such as optical images and photogrammetric point clouds [84]. Additionally, we noticed that the independent dual-branch structure would bring too many parameters. A future direction is to use siamese networks combined with joint learning [85] to achieve a tradeoff between speed and accuracy.