Axial Cross Attention Meets CNN: Bibranch Fusion Network for Change Detection

In the previous years, vision transformer has demonstrated a global information extraction capability in the field of computer vision that convolutional neural network (CNN) lacks. Due to the lack of inductive bias in vision transformer, it requires a large amount of data to support its training. In the field of remote sensing, it costs a lot to obtain a significant number of high-resolution remote sensing images. Most existing change detection networks based on deep learning rely heavily on the CNN, which cannot effectively utilize the long-distance dependence between pixels for difference discrimination. Therefore, this work aims to use a high-performance vision transformer to conduct change detection research with limited data. A bibranch fusion network based on axial cross attention (ACABFNet) is proposed. The network extracts local and global information of images through the CNN branch and transformer branch, respectively, and then, fuses local and global features by the bidirectional fusion approach. In the upsampling stage, similar feature information and difference feature information of the two branches are explicitly generated by feature addition and feature subtraction. Considering that the self-attention mechanism is not efficient enough for global attention over small datasets, we propose the axial cross attention. First, global attention along the height and width dimensions of images is performed respectively, and then cross attention is used to fuse the global feature information along two dimensions. Compared with the original self-attention, the structure is more graphics processing unit friendly and efficient. Experimental results on three datasets reveal that the ACABFNet outperforms existing change detection algorithms.


I. INTRODUCTION
C HANGE detection of remote sensing images refers to the process of feature recognition in a collection of Manuscript  multitemporal remote sensing images captured by a satellite or unmanned aerial vehicle (UAV) in the same area at various times, aiming to identify changed and unchanged areas or to identify different types of changed areas [1]. In recent years, change detection has become one of the research hotspots in the field of Earth observation, playing a very important role in urban development planning [2], disaster loss assessment [3], water body change [4], vegetation cover change monitoring [5], and other practical applications.
With the rapid advancement of remote sensing image observation techniques, the Earth observation system of space information presents six characteristics [6], [7]: high spatial resolution, high spectral resolution, high temporal resolution, multiplatform, multisensor, and multiangle, providing reliable data sources for obtaining rich spatial information of the surface and promoting the further development of change detection research. Traditional remote sensing image change detection methods occupied the mainstream prior to the rise of deep learning. The most popular method is the direct comparison method, which includes the difference method [8], the ratio method [9], and the change vector analysis method [10]. The method is straightforward to use. After image preprocessing, it just has to execute pixel-level calculations on the remote sensing images, and then, choose a suitable threshold value to separate the changed area from the unchanged area. However, the performance of the method relies heavily on circumstances, which means that change detection methods should be appropriately chosen for different scenarios. With the rapid growth of change detection technologies, object-level change detection methods [11] began to arise. The method uses geographical objects in remote sensing images as the primary classification unit, classifying them comprehensively using texture, shape, spectrum, and other factors to reduce intraclass variation and eliminate the salt and pepper effect caused by misclassification. Compared with pixel-level change detection methods, objectlevel change detection methods can obtain richer feature representation, better modeling image context information. However, when processing remote sensing images in different imaging environments, these algorithms cannot effectively extract the rich feature information in images, which makes it difficult to achieve high-precision detection results.
In recent years, deep learning has significantly promoted the development of semantic segmentation. Considering that this article only considers the changed category and the unchanged category, remote sensing image change detection can be regarded as binary semantic segmentation. In comparison with This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ traditional change detection methods, the methods based on deep learning can process remote sensing images with a vast quantity of data [12], [13]. Their capacity to characterize features is far superior to the former, and the step of manually designing feature extraction method is avoided [14]. In 2015, Long et al. [15] proposed fully convolutional networks, removing the final fully connected layer of a standard convolutional neural network (CNN) and allowing the network to output a prediction result the same size as the input. In 2017, Chen et al. [16] introduced the atrous convolution into fully convolutional networks and utilized the conditional random fields for postprocessing, which effectively enlarged the receptive field of the network without increasing parameters and optimized the segmentation boundary. In the same year, Zhao et al. [17] proposed the PSPNet, which took advantage of the pyramid pooling module to aggregate the contextual information from different areas, improving the global expression ability of the network. In 2019, Fu et al. [18] proposed the DANet, a dual-attention scene parsing network, modeling the global semantic interdependencies by self-attention mechanism instead of previous multiscale feature fusion.
In addition, Daudt et al. [19] introduced fully convolutional neural networks into the field of change detection in 2018 and proposed three change detection algorithms (FC-EF, FC-Siam-conc, and FC-Siam-diff). They used the Siamese network structure for change detection for the first time. FC-EF is an early fusion-based model, which connects bitemporal images in the channel dimension and feeds them into the fully convolutional network. FC-Siam-conc and FC-Siam-diff are both Siamesebased structures. Siamese network is used to extract the features of bitemporal images, respectively, and then, concatenation or difference operation is used to obtain the differences between them. This article mainly conducts a series of change detection research based on the early fusion method.
In 2021, Dosovitskiy et al. [20] migrated the transformer [21] from natural language processing to computer vision and proposed vision transformer, introducing a new idea in image processing. Subsequently, networks based on the vision transformer have been constantly emerging in the field of computer vision. Thanks to the local correlation and the translational invariance, the CNN can perform well on small-and mediumsized datasets [22]. The vision transformer lacks aforementioned inductive bias, so it needs to be backed up by a massive amount of data to outperform the CNN. Since the CNN can effectively model the local detailed information, while the vision transformer excels at modeling global image information, combining the CNN and vision transformer becomes a viable option. In 2021, Peng et al. [23] proposed the conformer, which used the parallel structure of the CNN and transformer, and the feature coupling unit to fuse local information and global information of images. In the same year, Guo et al. [24] proposed a hybrid series structure of the CNN and transformer, replacing the multilayer perceptron in the transformer with convolution, achieving a balance between speed and accuracy. In addition, Srinivas et al. [25] proposed the bottleneck Transformer, which replaced the 3 × 3 convolution in the bottleneck layer with multihead selfattention, significantly improving the baseline of downstream tasks. However, considering that the aforementioned networks are basically carried out under the support of large datasets such as ImageNet [26] and COCO [27], the problem is whether the networks can maintain the same excellent performance under a small amount of data. Besides, the aforementioned methods are basically aimed at improving the convolution structure or the connection mode between the CNN and transformer, but ignore the important impact of the huge calculation amount of self-attention on the results.
In the field of remote sensing, because of the high cost of acquiring plenty of high-resolution remote sensing images, how to achieve the best performance under the premise of a small amount of data based on the vision transformer is the starting point of this article. In view of this, a bibranch fusion network based on axial cross attention (ACABFNet) is proposed in this article, the structure of which is depicted in Fig. 1. The overall structure of the network is a parallel dual-branch structure of the CNN and Transformer. The CNN is used to extract fine-grained features of images, while the transformer is used to extract global features of images. We fuse the two different features through a bidirectional interactive structure. In the upsampling stage, the overall local features and overall global features are integrated on the two branches, respectively, and the similarity and difference of the two feature information are explicitly modeled by addition and subtraction. It is worth mentioning that, considering the low efficiency of global attention of the original self-attention, we propose the axial cross attention. The axial attention is utilized to pay global attention to the images along the height and width dimensions, respectively, and then, the cross attention is used to fuse global feature information on the two dimensions. The structure is more efficient at extracting global features. In conclusion, our contributions are as follows.
1) A ACABFNet) is proposed. Different from existing classification-based algorithms, the network is designed for semantic segmentation and change detection tasks, fully exploiting the fine-grained and global representation characteristics of images during the downsampling and upsampling stages. 2) Axial cross attention is proposed. Axial attention is used to model global representation along the height and width dimensions, respectively, and cross attention is used to fuse the global feature information in the two directions. Compared with the original self-attention, the structure has higher efficiency and accuracy in capturing global feature representation and is more graphics processing unit (GPU) friendly. 3) Experimental results on three remote sensing image change detection datasets reveal that the ACABFNet is superior to existing change detection algorithms based on semantic segmentation.

II. BIBRANCH FUSION NETWORK BASED ON AXIAL CROSS ATTENTION (ACABFNET)
At present, the CNN and vision transformer are two mainstream directions in the field of computer vision. Owing to the inherent inductive bias, the CNN can extract local  neighborhood features of images layer by layer using convolution [28]. A transformer can pay global attention to the image patches through the self-attention mechanism. Local representation and global representation are complementary. Based on the idea, we propose the ACABFNet, the details of which are shown in Table I. The ACABFNet is made up of the CNN branch and transformer branch running in parallel. The CNN is used for local refinement, while the transformer is used for global generalization. Feature fusion is carried out by 1 × 1 convolution with bidirectional intersection. In the upsampling stage, the local features and global features are fused by a 3 × 3 convolution and a multilayer perceptron (MLP), respectively, on two branches. The two features are then added and subtracted, allowing the network to explicitly model the similarity and difference between local and global features, as well as filter out redundant features.

A. CNN Branch
A lot of existing work has demonstrated that ResNet [29] is a deep model with excellent performance thanks to the residual connection structure, so we adopt ResNet as a CNN branch. Since the transformer utilizes nonoverlapping image patches for global attention and MLP for global information fusion, it will inevitably lead to the loss of local details. Thanks to the local correlation and translational invariance, the CNN can effectively use the priori information to model local fine-grained information [30], which makes up for the deficiency of the transformer on the aforementioned information. To be specific, the CNN branch consists of five stages. First, The CNN stem layer is used to rapidly downsample the input image to obtain a feature map with 1/4 the size of the input. Then, four successive residual layers are followed, with 3, 4, 6, and 3 residual blocks, respectively. As shown in Table I, the feature map is downsampled layer by layer from D2 to D5 stage, finally yielding a feature map with 1/32 the size of the input. Through the CNN Branch, the network can obtain the local feature representation of the input image.

B. Transformer Branch
Given the excellent performance of the feature pyramid, we adopt the same structure when designing the transformer branch. In addition, the existing work shows that the performance can be effectively improved by using the CNN for downsampling in the initial stage of the transformer network. Therefore, the input first passes through the CNN-based stem layer, which conducts downsampling on the input image using convolution and maxpooling to obtain a feature representation with 1/2 the size of the input, capturing the shallow information of the image quickly and effectively. The structure of stem in the transformer branch is shown in Fig. 2. Then, four consecutive axial cross attention layers are used to capture the long-term dependencies of images. Each layer contains 3, 4, 6, and 3 axial cross attention blocks with 4, 8, 16, and 32 heads, respectively. The feature map is downsampled layer by layer to obtain the feature hierarchical representation structure similar to the CNN branch. The design is based on the following two considerations.
1) The hierarchical representation structure is more flexible in model design. 2) It is convenient for bidirectional fusion with the CNN Branch. Self-attention is strong at modeling the long-term dependencies of images [31], but it has to pay attention to all patches of images, which causes feature extraction to be inefficient [32]. There are many background interference factors in remote sensing images. If the global attention is paid to all patches directly and roughly, the phenomenon of feature redundancy will occur. The feature will not only affect the classification accuracy of the model, but also significantly occupy GPU during training, which is extremely unfriendly to hardware devices. Therefore, we propose the axial cross attention, which consists of axial attention, cross attention, and feed forward network. The structure of the axial cross attention is shown in Fig. 3.
Patch embedding (PE) is first applied to transform the input f in ∈ R C×H×W into a sequence f p ∈ R (H×W )×C . f p is then reshaped into f H/W ∈ R H/W ×C after Layer Norm (LN) in preparation for the following global attention to the image along the height and width, respectively. Since the transformer captures the global semantic information while ignoring the positional information, positional embedding is introduced before axial attention and cross attention, respectively. P H represents the positional embedding along the height dimension, while P W represents the positional embedding along the width dimension. The aforementioned process is mathematically expressed as follows: Axial attention only carries out global semantic modeling along the height or width dimensions of the image. Compared with the original self-attention, the structure is more efficient in global attention and more GPU friendly. As shown in Fig. 4, the input feature map f H/W ∈ R H/W ×C first passes through a linear transformation φ to obtain the Query (Q), Key (K), and Value (V ). The feature relationship matrix W H/W ∈ R H×H/W ×W of the height or width dimension is obtained by the matrix multiplication of Q and K and activated by the Softmax function. Finally, the matrix multiplication is applied to W H/W and V to obtain the output f H/W ∈ R H/W ×C . Because it is modeled only along a single dimension, axial attention is a computationally efficient structure. The mathematical expression of the aforementioned process is as follows: Given that a single axial attention loses information from the other dimension, we introduce cross attention to fuse the global semantic information from two different dimensions, the structure of which is shown in Fig. 5  The mathematical expression of the aforementioned process is as follows: Like the original transformer, the feed forward network (FFN) is used to fuse the global features of axial cross attention. As shown in Fig. 3, the input feature f FFN ∈ R (H×W )×C first goes through the LN, then two linear transformations φ 1 and φ 2 . Between φ 1 and φ 2 , there exists a GELU function, which is used to activate the features. Finally, the output f out ∈ R (H×W )×C of axial cross attention is obtained by adding it to the shortcut connection. Through the FFN, the global feature information of the height and width of the images can be effectively fused and enhanced. The mathematical expression of the aforementioned process is as follows: (LN(f FFN )))) + f FFN .

C. Feature Recovery
In order to efficiently integrate global features and local features in the upsampling stage, we first fuse the features of the last three stages of the transformer branch and CNN branch, respectively, then add and subtract the fusion features of the two branches, the purpose of which is to enable the network to distinguish the similarity and difference between global features and local features explicitly, and to extract discriminant features more effectively. Specifically, for the transformer branch, linear transformation φ i is first performed on each layer's fea- Then, the LN and GELU are applied to f i T C . For the CNN branch, 3 × 3 convolution ψ i is used to operate on Then f i NC goes through batch norm (BN) and ReLU. Through the aforementioned operations, the channels of each layer's features of the two branches are transformed to the same size. Next, in order to concatenate all the features, each layer's features f i T C are reshaped into f i T I ∈ R C×H i ×W i and upsampled so that all the features have the same dimension size C × H × W . Both H and W are 1/8 the length of the input image's side length. It is worth noting that both bilinear interpolation and transpose convolution are employed for upsampling. The former does not require training, whereas the latter follows the network's training. They can complement each other well. Each layer's features f i NC in the CNN branch are also transformed into f i NI ∈ R C×H×W by a similar upsampling operation. Finally, concatenation operation is performed on each layer's features of the two branches to obtain the global fusion feature and local fusion feature of the whole network. The mathematical expression of the aforementioned process is as follows: Existing semantic segmentation-based change detection algorithms do not take into account the similarity and difference between the global and local feature information when combining the transformer and CNN for feature recognition, resulting in significant feature redundancy. Therefore, at the end of the network, we explicitly model the similarity and difference of global features and local features by addition and subtraction operations, and optimize the features by 3 × 3 convolution block. Fig. 6 shows the heat maps of two discriminant features. For the change detection task, similarity features mainly focus on the unchanged area of bitemporal remote sensing images, while difference features focus on the changed area, as shown in the red section in the figure.

III. DATASETS
To comprehensively verify the effectiveness of the ACABFNet proposed in this article, we conduct training and testing of the model on three different remote sensing image change detection datasets, namely BTCDD [1], CDD [33], and LEVIR-CD [34].

A. BTCDD
The BTCDD dataset is a remote sensing image change detection dataset that we suggested in our first work. It contains 5281 pairs of high-resolution bitemporal remote sensing images with 256×256 pixels each, among which 4224 pairs of images are used as training set and 1057 pairs are used as test set. All of the images were taken in different regions of China from 2010 to 2020. The types of changed areas include factories, farmland, roads, buildings, and mining areas.

B. CDD
The CDD dataset consists of seven pairs of bitemporal remote sensing images with 4725×2700 pixels each and four pairs with 1900×1000 pixels each. Eleven pairs of images are synchronously cropped into 16 000 pairs of image patches with 256×256 pixels each, of which 10 000 pairs constitute training set, 3000 pairs constitute validation set, and the rest 3000 pairs constitute test set. Seasonal variations are taken into account to make the trained networks more convincing.

C. LEVIR-CD
The LEVIR-CD dataset is composed of 637 pairs of highresolution Google Earth images with 1024×1024 pixels each. All the images were taken in 20 different areas of Texas between 2002 and 2018. The dataset focuses on significant changes in buildings, including villas, apartments, garages, warehouses, and so on. In addition, seasonal variations and illumination variations are taken into consideration, which help to develop high-performance models.

A. Evaluation Indicators
For assessing the performance of the ACABFNet in the change detection task, we adopt four evaluation indicators, namely Precision, Recall, MIOU, and F1 score. The mathematical expression of the evaluation indicators is as follows: where TP, FP, and FN denote True Positives, False Positives, and False Negatives, respectively.

B. Experimental Details
In this article, all the experiments are implemented on a GeForce RTX 2080Ti GPU based on Pytorch. BCEWithLogit-sLoss is utilized as the loss function for training models. Adam is used as the optimizer. The batch size is set to 4. The initial learning rate (lr) is set to 0.0001. On BTCDD, the maximum number of epoches (max_epoch) is set to 250. On CDD and LEVIR-CD, max_epoch is set to 200. In view of the effectiveness of dynamic adjustment of learning rate, we adopt the Poly learning rate reduction strategy, and the learning rate of each epoch is lr × (1 − epoch max_epoch ).

C. Ablation Experiments
For verifying the effectiveness of the presented ACABFNet, we conduct ablation experiments on the BTCDD dataset. The experimental results are shown in Table II. 1) Transformer branch: We remove the transformer branch from the overall network structure, so the network degenerates into ResNet34 [29]. It is worth noting that we retain the original prediction head to complete change detection. It can be concluded from Table II that the transformer branch has brought significant improvements to our model, with an increase of 3.26% and 3.62%, respectively, on two evaluation indicators. 2) Axial cross attention: We conduct ablation experiments on axial attention and cross attention, respectively. From Table II, we can find that after removing the two attention modules, respectively, the model has performance degradation on two important indicators to varying degrees. The combination of the two enables the model to fuse the global feature information on the two spatial dimensions of the images effectively, improving the feature expression performance. Axial attention improves MIOU and F1 score by 0.49% and 0.68%, respectively. Cross attention improves MIOU and F1 score by 0.43% and 0.67%, respectively.

3) +/-operations:
The addition and subtraction operations explicitly distinguish the similarity and difference between the global features of the transformer branch and the local features of the CNN branch, effectively filtering out the redundant feature information. As can be seen from Table II, the operations make the model improve MIOU and F1 score by 0.28% and 0.34%, respectively.

D. Comparative Experiments
The performance of the ACABFNet is evaluated on three datasets, namely BTCDD, CDD, and LEVIR-CD. A quantity of comparative experiments completely demonstrate the outstanding performance of the ACABFNet. Considering the fairness of the experiment, the super parameter settings of all experiments on a dataset are the same. In addition, for the single-input semantic segmentation-based model (SETR/HRNet/PSPNet, etc.), we directly concatenate the bitemporal images along the channel dimension, and then, send it to the network. For the dual-input change detection model (FC-EF/FC-Siam-conc/FC-Siam-diff), the bitemporal images are sent to the network at the same time.

1) Comparative Experiments on BTCDD:
We first conduct comparative experiments on the BTCDD dataset. To make the experimental results more convincing, we compare the change detection algorithms based on the CNN and vision transformer at the same time. The experimental results are shown in Table III, where * represents the transformer-based method. From the table, it can be seen that some CNN-based algorithms are significantly better than transformer-based algorithms, which may be due to the small amount of data in BTCDD. In comparison to existing change detection methods based on deep learning, our suggested ACABFNet achieves the best accuracy on most indicators. Its MIOU and F1 score are 0.76% and 1.07% higher than PSANet's, respectively. In addition, compared with BiSeNet, HRNet, PVT, and SegFormer, our algorithm achieves significant improvement on four indicators on the premise of adding a small amount of calculation. It should be emphasized here that, suppose N (N = H × W ) represents the length of the sequence, and d represents the dimension. The computational complexity of self-attention is O(N 2 · d). Our axial cross attention calculates H and W , respectively, and its computational complexity is O(L 2 · d) when L = H = W . Compared with the original self-attention, O(L 2 ) times the calculation is saved. Therefore, axial cross attention can effectively obtain the global representations of the images with a relatively low amount of calculation, which is one of the advantages of our algorithm. However, the number of parameters of our model is a little large, and we will solve this problem in future work. Fig. 7 shows the prediction results of different algorithms. We display two groups of images from 1057 groups of prediction results. As shown in the figure, the performance of the transformerbased models represented by SETR, PVT, and SegFormer is unsatisfactory. Due to the lack of inductive bias, the transformer usually needs training on large datasets to achieve a better performance. Moreover, existing change detection algorithms based on the CNN are difficult to effectively model the global feature information, so the phenomenon of false detection and  missing detection is very serious when predicting the changed areas. As shown in the first group of comparison diagrams in Fig. 7, some CNN-based models incorrectly predict the land around the blue factories as changed areas (indicated in red), resulting in serious false alarms. In addition, look at the second group of images. The CNN-based detection models have an obvious phenomenon of missing detection (indicated in green). The ACABFNet proposed by us combines the advantages of the CNN in local feature extraction and the transformer in global feature extraction to effectively improve the accuracy of change detection. Compared with existing change detection algorithms, the ACABFNet outperforms them in detecting the boundaries of changed areas. Fig. 8 shows the heat maps of the intermediate layers of the ACABFNet. Red indicates higher attention, and blue indicates lower attention. From the figure, we can see that the CNN branch pays more attention to fine-grained boundary information, and the transformer branch focuses more on the overall representation of the images. The two are complementary to each other, which enables the model to capture more abundant information about the images.   Fig. 9 shows the accuracy curves of our ACABFNet and several models on BTCDD. It can be seen from the figure that our ACABFNet is superior to other algorithms on the test set. This means that the ACABFNet has a better generalization performance. In addition, under the same number of iterations during training, our model can achieve the optimal detection accuracy on the test set, which may be attributed to the effective complementarity between the CNN branch and the transformer branch. The simultaneous development of local features and global features makes it easier for the model to detect more complex changed areas.
2) Comparative Experiments on CDD: A single dataset is insufficient to comprehensively examine the performance of the model, so we also experiment on CDD dataset. All of the models are retrained and retested on CDD dataset. The experimental results are shown in Table IV. Similarly, the transformer-based algorithms are denoted by *. Considering that the training set of CDD dataset contains only 10 000 pairs of images, which is a fraction of the size of large datasets such as ImageNet and COCO, it is understandable that transformer-based models perform poorly. Thanks to the high efficiency of axial cross attention, the ACABFNet can quickly converge to the optimal accuracy. As shown in table, the ACABFNet improves MIOU   TABLE V  COMPARATIVE EXPERIMENTS ON LEVIR-CD and F1 score by 0.72% and 0.67%, respectively, compared with the suboptimal model DANet. Fig. 10 shows the prediction results of various algorithms on CDD dataset. We select two groups of prediction results from 3000 groups for display. As shown in the figure, most of the existing deep learning-based models can only predict a portion of the track when detecting narrow roads, leading to serious omission alarms (indicated in green). The coherence of the prediction results is poor. Our ACABFNet can detect the entire changed track coherently, and the detection edges are smoother. This is due to our axial cross attention focusing on the image from a global perspective, and the explicit discrimination of similarity and difference between global features and local features.
3) Comparative Experiments on LEVIR-CD: We carry out the third group of comparative experiments on LEVIR-CD dataset. Considering the limitation of GPU, the original image pairs with 1024×1024 pixels each are synchronously cropped into image patches with 256×256 pixels each. The training set includes 7120 pairs of images and the test set includes 2048 pairs of images. Similarly, all of the models are retrained and retested on LEVIR-CD dataset. The experimental results are shown in Table V. * indicates the vision transformer. Our proposed ACABFNet is superior to the existing deep learning-based models on four indicators. Especially on MIOU and F1 score, the ACABFNet is 1.31% and 1.5% higher than the transformerbased model SegFormer, and 0.98% and 1.12% higher than the suboptimal model DANet, respectively. This adequately proves the effectiveness of our algorithm. Fig. 11 shows the prediction results of different algorithms on LEVIR-CD dataset. The two groups of prediction images are from 2048 groups of predicted images from the test set. As can be seen from the figure, when detecting the changes of several adjacent buildings, most of the prediction boundaries of existing deep-learning-based models are serrated, and even the phenomenon of adhesion occurs, resulting in false detection and missing detection (indicated in red and green, respectively).  Our ACABFNet can distinguish each changed building clearly, and the prediction boundaries are smoother, which effectively reduces the occurrence of false alarms and omission alarms.

V. DISCUSSION
The aforementioned extensive experiments effectively prove the advantages of our algorithm from multiple perspectives. Conventional change detection methods cannot cope with the task of change detection in different imaging environments due to their simple feature extraction ways. As shown in the two visualization maps of PCA-Means in Fig. 7, there are dense areas of false detection and missing detection. Most of the existing learning-based change detection methods rely heavily on the CNN framework, which is limited by the size of the convolution kernel, and cannot effectively distinguish the differences between two images and model the relationships among changed areas from a global semantic perspective. As shown in Figs. 10 and 11, there are strong semantic associations among the narrow roads and the dense buildings. Ignoring this leads to the serious omission alarms. Some existing transformer-based methods, such as SETR, PVT, and SegFormer, model the long-distance dependence based on self-attention to solve the agorementioned problem. However, it should be noted that only using selfattention to process the images will often lead to the loss of local details, especially the small objects. Different from the existing change detection methods, our proposed ACABFNet utilizes both the local attention capability of the CNN and the global semantic modeling capability of axial cross attention, achieving reliable improvement with regard to detection accuracy with relatively low FLOPs. The curves in Fig. 9 indicate that our model converges faster under the same iteration number and outperforms other algorithms on the test set.

VI. CONCLUSION
In this article, we present a ACABFNet. The network is composed of CNN branch and transformer branch in parallel, and two-way interaction is carried out through 1×1 convolution to fuse local and global feature information. Considering that the efficiency of global attention of the self-attention mechanism is low, we propose an axial cross attention, which first pays global attention to the images along the height and width dimensions, respectively, and then, fuses the global feature information from the two dimensions through cross attention. Compared with the original self-attention, axial cross attention has higher global feature extraction efficiency and is more GPU friendly. Furthermore, at the end of the network, we perform explicit discrimination of similarity and difference between global features and local features by addition and subtraction operations, effectively filtering out the redundant features. Experimental results on BTCDD, CDD, and LEVIR-CD datasets show that the ACABFNet outperforms existing conventional change detection algorithms and semantic segmentation-based algorithms.
Code availability section Name of the library: ACABFNet Hardware requirements: GeForce RTX 2080Ti (12 G) Software required: python3.8 Packages: Pytorch The source codes are available for downloading at the link: https://github.com/SONGLEI-arch/ACABFNet