Geospatial Contextual Attention Mechanism for Automatic and Fast Airport Detection in SAR Imagery

The automatic extraction of airport runway areas from high-resolution Synthetic Aperture Radar (SAR) images is of great research significance in the military and civilian fields. However, it is still challenging to distinguish the airport from surrounding objects in SAR images. In this article, a new framework is proposed to extract airport runway areas (runways, taxiways, packing lots, and aircrafts) in a fast and automatic manner. The framework is based on the Geospatial Contextual Attention Mechanism (GCAM) for geospatial feature learning and classification, which is employed together with the down-sampling and coordinate mapping modules. To evaluate the performance of the proposed framework, three large-scale Gaofen-3 SAR images with 1m resolution are utilized in the experiment. According to the results, Mean Pixels Accuracy (MPA) and Mean Intersection Over Union (MIOU) of the GCAM are 0.9850 and 0.9536, respectively, which outperform RefineNet, DeepLabV3+, and MDDA. The training time of GCAM for the dataset is 2.25h, and the average testing time for the five SAR images is only 18.15s. Therefore, GCAM can offer rapid and automatic airport detection from high-resolution SAR images with high accuracy, which can further be employed to mark the airport to greatly improve the detection accuracy of the aircrafts.


I. INTRODUCTION
As important transportation hubs and military facilities, the detection of airports from Synthetic Aperture Radar (SAR) images attracts considerable interests for a long time [1]. SAR [2] provides all-weather and all-day observation and possesses the ability of cloud and fog penetration. However, SAR images are more difficult to interpret than optical images, since the analysis of SAR images is usually more complicated. Previously, most of airports detection is based on optical remote sensing images, which is time-consuming and labor-intensive [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal.
With the rapid development of SAR imaging techniques, the research in extracting airports from SAR images has gradually increased in recent years, and related analytical approaches have also bloomed [3]. Therefore, the exploration of automatic and fast detection of airport runway areas from high-resolution SAR images has become feasible and attracted considerable research interests. Moreover, once runway areas are extracted, the accuracy of aircrafts detection from high-resolution SAR images can be improved greatly.
The SAR imaging process is noise-prone, and its sidelooking imaging approach is easy to incur shadows of various objects. Therefore, in SAR images, runway areas of the airport are similar to shadows (e.g. shadow caused by the vegetation), and their representative features are difficult to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ distinguish using traditional detection methods. In recent years, deep learning has attracted more and more attention [3]. Through sufficient training datasets and elaborate network design, deep learning can mine implicit features of targets and is robust to various noise. We integrate domain knowledge in SAR image analytics with deep learning techniques to tackle challenges in airport detection, and the main contribution of this article is listed as: (a) Multiscale analysis has become routine within deep learning techniques for SAR research. To improve the efficiency and effectiveness of extracting runway areas for the high-resolution SAR images, down-sampling and coordinate mapping are developed using domain knowledge in SAR image interpretation to handle scale challenges. (b) Geospatial Contextual Attention Mechanism (GCAM) is proposed in the paper, which effectively integrates the geospatial domain knowledge of SAR image analytics, contextual information and attention mechanism. The satisfying results proves its success. It can also be transferred to other networks to perform various targets detection. The rest parts of this article are listed as follows. Section II is the state of the art, which describes the development of airport detection in the field of deep learning in details; Section III introduces the framework proposed in this article, elaborating the sub-modules and working mechanism contained in the network; In section IV, the experiments are implemented by high-resolution SAR images, and the extracted results of runway areas are generated and evaluated. Section V gives a simple summary of the proposed method and a discussion about the network, and the future work is also given. Finally, the conclusion is given in Section VI.

II. STATE OF THE ART
The Airport detection has a wide range of applications in navigation, accident detection, rescue, and aircraft positioning [4]. Existing research of airport detection is mostly based on optical remote sensing images [5], [6]. The traditional method of extracting the airport edge line segment to perform the airport detection is the most commonly used [5]- [8], but this method assumes that linear features could be easily extracted from all airports. This is very challenging for those airports with numerous terminals and irregular buildings. Zhang [9] used Sparse Reconstruction Saliency (SRS) and Target Aware Active Contour Model (TAACM) to implement airport detection, to address the extraction of airport details; Zhang et al. [10] combined the visual saliency analysis model, the two-way complementary saliency analysis module, and the saliency active contour model to extract airport contours. The performance of this method was largely constrained by the quality of optical remote sensing images, especially weather conditions. However, SAR systems have strong penetration capabilities, which can work even under the condition of clouds and fog. This advantage makes SAR images gradually more popular for airport detection. Liu et al. [11] integrated traditional line segment grouping method and saliency analysis model to perform airport detection on small-scale SAR images, but this method was not robust for detecting airports on large-scale SAR images; Zhang et al. [12] proposed a Polarimetric Synthetic Aperture Radar (Pol-SAR) airport runway detection algorithm that combined optimized polarization features and random forest, but focused only on extracting parallel runway features in airports.
In recent years, deep learning has achieved excellent results in the semantic segmentation [13], [14]. Airport detection needs to extract all the pixels of the runway areas, which is consistent with the idea of semantic segmentation [15]. Yu et al. [16] proposed an airport detection method combining YOLO model and saliency analysis model. Xiao et al. [17] integrated Google-LF network and Support Vector Machine (SVM) to detect airports. Zeng et al. [18] fused Faster-CNN network and spatial analysis method to identify airports. Li et al. [19] constructed an end-to-end deep transferable convolutional neural network to recognize airports. But these methods are all examples of applying deep learning to extract airports from optical remote sensing images, and deep learning often tends to overfit due to the training with specific airport datasets. Aiming at detecting airport from high-resolution SAR images, Chen et al. [1] proposed a deep learning network Multi-level and Densely Dual Attention (MDDA) to extract runway areas. It could achieve high-precision airport extraction, but the network required a great quantity of high-quality labeled datasets and long training. Therefore, it is very practical and necessary to find a deep learning network, which is suitable for small sample datasets, efficient to extract airports and works under different weather conditions.
Deep learning networks are developing rapidly, and the DeepLab series in deep learning have noticeable performances in the field of semantic segmentation [20], [21]. In 2014, DeepLabv1 [22] was proposed, which employed the atrous convolution for the first time to solve the problems of signal down-sampling and spatial deformation in the traditional CNN algorithms [23]. It also improved the ability of the model to capture fine details by using the Conditional Random Field (CRF). Finally, DeepLabv1 won the second place in the PASCAL semantic segmentation challenge. In 2016, DeepLabv2 [24] came with the Atrous Spatial Pyramid Pooling (ASPP) on the basis of DeepLabv1 to capture contextual semantic information from multi-scale directions, and replaced the backbone network VGG-16 [25] with ResNet [26] to overcome the problem of feature resolution degradation caused by pooling in the traditional CNN. In 2017, DeepLabv3 [27] improved ASPP based on DeepLabv2, to achieve better overall performance in object detection. In 2018, DeepLabv3+ [1] introduced the encoder and decoder modules, designed an effective decoder module, and incurred the depth-wise separable convolution as well. It enabled the model to effectively reduce the amount of calculations and parameters while maintaining satisfactory performance.
In recent years, deep learning has developed rapidly, and some preliminary results have been achieved in the field of SAR image airport extraction [1]. But at present, how to use deep learning network to extract more effective airport features needs further research. To solve this problem, the Geospatial Contextual Attention Mechanism (GCAM) is presented in this article, which integrates geospatial information, attention mechanism and contextual information. It can better extract effective features of the runway areas, and suppress the similar characteristics.

III. METHODOLOGY
In this article, a new framework is proposed for rapid and automatic detection of runway areas of airports, which includes three parts: down-sampling, the Geospatial Contextual Attention Mechanism (GCAM) for the extraction of geospatial and contextual features, and up-sampling using coordinate mapping. The proposed overall framework is shown in Fig. 1. First, high-resolution SAR images are down-sampled to generate medium-resolution images, so that the samples can contain adequate feature about the airport [1]. Second, these samples are input into the GCAM, which includes the encoder and the decoder. The encoder consists of the improved backbone ResNet_101 [29], Multi-scale Squeeze Pyramid (MSP), and Edge Detection Module (EDM) [13]. At first, the improved ResNet_101 extracts the features from samples. Then, these features are input to MSP to capture geospatial information and contextual semantic information at multiple scales. At the same time, these features are also input to EDM to enhance the ability of edge extraction. The encoder fuses MSP and EDM to extract the deep features, which are then input into the decoder to implement edge refinement and finally the results of runway areas are generated. Eventually, the results are processed using coordinate mapping to achieve airport runway detection from high-resolution SAR images detection.

A. DOWN-SAMPLING
In this article, we choose down-sampling by 5 times to illustrate the framework, which could be amended for other tasks. The down-sampled datasets include two types of SAR images: one is the large-scale SAR images used to generate the datasets, and the other is the three high-resolution SAR images for testing the proposed framework. After downsampling, the medium-resolution SAR images are produced. Then, the ground truth is attached to these datasets for training.

B. THE GEOSPATIAL CONTEXTUAL ATTENTION MECHANISM (GCAM)
To extract the airport, the GCAM (as shown in Fig. 1) is proposed in this article, which includes two parts: the encoder and the decoder. In the encoder, the improved ResNet_101 residual network is used to perform preliminary feature extraction from the input dataset; MSP and EDM are utilized to carry out feature extraction and fusion of these preliminary features, respectively. MSP obtains global contextual information and geospatial information from features at different resolutions, and EDM strengthens the edge extraction capability of the network and further fuses multi-level features. The decoder has two inputs: one is the multi-level enhanced features output by the encoder; the other is preliminary features from improved ResNet_101. Then, GCAM conducts the segmentation of the runway areas.

1) THE ENCODER
The internal structure of the encoder is shown in Fig. 1, which consists of three parts: the improved backbone network ResNet_101, MSP, and EDM.

a: IMPROVED RESNET_101
In this article, ResNet_101 with atrous convolution [29] is adopted as the backbone network. Improved ResNet_101 has the characteristics of jumping connection and residual optimization. This structure can accelerate training and improve the accuracy of the model, which is very attractive for building semantic segmentation network. Atrous convolution can solve the problem that the pooling operation has the risk of missing detailed features. Moreover, it would not incur extra parameters, but keeps feature maps in the back convolution layer, which is plausible to the detection of targets and improves the overall performance of the model. Considering the atrous convolution, for any pixel position j in the image, a filter ω is applied on the input feature x, in which the output becomes: where the rate r introduces r − 1 zero values to the sampling points, so the receptive field is effectively extended from k × k to k + (k − 1)(r − 1) without increasing the number of parameters and the amount of calculations. The structure of improved ResNet_101 is shown in Fig. 2. The specific initial structure of ResNet_101 is shown in [15]. In the default ResNet_101, the ResNet_block simply works in parallel and uses continuous convolutions with different stride sizes, which compress features into smaller feature maps in end layers. Detailed deep semantic information might get lost in the compression process. To address this issue, we copy the last block of the default ResNet_101 four times and build them in parallel, therefore, we replace the two-dimensional convolution with atrous rate of 2, 4, 8, and 16, to improve the final output stride. The addition of atrous convolution changes the resolutions of some feature maps, which makes the final output features of improved ResNet_101 not only have low-resolution feature maps with high-dimension, but also contain some high-resolution features of low-dimension. By this means, it achieves the effective extraction of multi-level features.

b: MULTI-SCALE SQUEEZE PYRAMID
Multi-scale Squeeze Pyramid (MSP) is a new proposed network module in the paper. It mainly includes two parts: the multi-receptive field parallel pooling layer and the effective Squeeze-and Excitation (eSE) [30]. The feature maps generated by the improved ResNet_101 contain 256 channels and rich semantic information. At first, They are input into the multi-receptive field parallel pooling layer, which consists of several convolution and pooling operations in parallel structure: one 1 × 1 convolution with an atrous rate of 1, three 3 × 3 convolutions with atrous rates of 6, 12, and 18, one Global Average Pooling (GAP) [31] and one Stripe Pooling (SP) [32]. Four convolutions with different atrous rates can effectively capture multi-scale features from different receptive fields. The addition of GAP performs downsampling of features to prevent overfitting. SP can capture local information of features. The multi-receptive field parallel pooling layer achieves multi-scale feature fusion. After receiving multi-scale features, and eSE screens features using channel information.
Attention module plays an important role on targets classification and detection [33]. Wang et al. [34] and Cui et al. [35] both combined channel attention and spatial attention to detect ships. However, dual attention brings additional parameters, which greatly increases the training time of the neural network. Lee and Park [30] introduces compression excitation on the basis of Squeeze-and-Excitation (SE) module [36] to eSE, which only focuses channel attention. Compressed excitation is a representative channel attention method in neural networks, which can directly model the channel relationship among feature maps, so as to enhance the feature learning of the network. eSE first learns specific features through GAP, then readjusts the input feature maps and finally extracts useful channel features through a Fully Connected (FC) layer and a Sigmoid function. If the size of the input feature map is X i = R C×W ×H , the effective channel attention map A eSE (X i ) ∈ R C×1×1 can be calculated as follows: where F gap (X i ) is the GAP of channel information. W and H denote the width and hight of the input feature. W C is the weight of the FC layer and σ is the sigmoid function; X refine is the final output after the weighting of A eSE (X i ).
The input X i is a multiscale feature map from the output of MSP. Applying A eSE (X i ) as a channel feature attention to the multi-scale feature map X i will make it more informative. Finally, the output feature maps are input to X refine element-by-element, and then the feature rescale is performed by X refine . The specific structure of eSE is shown in Fig. 3.
SP [32] overcomes the shortcoming of general pooling which is prone to generate false detection. When the input x ∈ R H ×W is a two-dimensional feature tensor, SP uses pooling windows of H ×1 and 1×W to perform pooling operations in the horizontal and vertical directions, respectively. It averages the element values in the pooling kernel, and utilizes this value as pooled output value. The output y h = R H of the SP in the horizontal direction is: Similarly, the output y v = R W of the SP in the vertical direction is: Following the H ×1 pooling and 1×W pooling processing, two one-dimensional convolutions are utilized to expand the output in the horizontal direction and the vertical direction. After the expansion, the two feature maps possess the same size, and then they are fused. Finally, the original data and the sigmoid processed data are multiplied to produce the result. The specific structure of the stripe pooling is shown in Fig. 4.
In the horizontal and vertical directions of the SP, the discretely distributed pixel regions and the band-shaped pixel regions are dependent on each other. Since the convolution kernel is long and narrow, while the shape of the convolution kernel is narrow along the orthogonal direction, it is easy to capture the local information of features. These characteristics make the stripe pooling more plausible than the average pooling based on square cores.

c: EDGE DETECTION MODULE
The Edge Detection Module (EDM) [13] contains Global Convolutional Block (GCB) and Boundary Refinement (BR). EDM and MSP work in parallel and receive the output feature map from the improved ResNet_101 at the same time.  GCB enhances the correlation between feature maps and the pixel classification layer, and improves the ability to process feature maps of different resolutions to generate global information. BR further improves the edge extraction capability of the encoder from the global information. Fig. 5 shows the detailed structure of the EDM.
The GCB and BR in the EDM can effectively solve the problem of pixel classification and positioning in the semantic segmentation. GCB extends the size of the convolution kernel to the spatial size of the feature map, which makes the feature map keep in touch with pixel classification layer, so as to enhance the ability to deal with different features and obtain global information. The BR unit is then introduced to further improve the capability of edge extraction in the network. Fig. 5 shows the detailed structure of GCB and BR.
GCB adopts the mode of the convolutional construction to make full use of the multi-channel information of features. In terms of the pixel classification, GCB employs a large convolution kernel, so that the semantic information of each VOLUME 8, 2020 pixel will not be changed due to the image transformation (i.e., translation, flipping), and the relationship between pixels is closer. For pixel positioning, GCB uses full convolution, which introduces the matrix decomposition principle. The convolutions of 1 × k + k × 1 and k × 1 + 1 × k are used to replace the large-core convolution of k × k, which reduces the amount of parameters and the amount of calculation, and makes each pixel match with the corresponding correct type to achieve accurate pixel segmentation. Since GCB does not have the Batch Normalization (BN) layer and the activation function, a BR unit with a small convolution kernel is introduced to prevent the phenomenon of pixel misclassification at the object boundary, so better classification accuracy and positioning accuracy are achieved.

2) THE DECODER
The internal structure of the decoder is shown in Fig. 1. It is an edge refinement decoder, and its input comes from two parts: one is the input feature maps generated by the encoder, and the other is the features output by the improved ResNet_101. The features from the encoder are first reduced by 1 × 1 convolution, and edge information is decoded by EDM, and then bilinear up-sampling by 4 times is performed. These operations aim to fully decode the edge information while reducing the number of feature channels. Then these features are concatenated to the corresponding features with the same spatial resolutions output by the improved ResNet_101. Because the features from the improved ResNet_101 contain some low-level features, which usually include a large number of channels, so one 1 × 1 convolution is adopted to reduce the number of channels and the unnecessary channel calculations. After the concatenation, one 3 × 3 convolution is applied to refine the features, and a 4-fold up-sampling bilinear interpolation is performed to achieve the final segmentation result finally.

C. COORDINATE MAPPING
After GCAM extracts runway areas of the medium-resolution SAR image, the coordinate mapping is executed to produce the final result of the original high-resolution SAR image. Then the segmentation result and the original SAR image are fused together to generate the visualization results.

A. DATASET USED IN THE EXPERIMENT
To validate the proposed framework GCAM in this article, many large-scale SAR images with 1-m resolution including airports from Gaofen-3 system are utilized in the experiment. First, SAR images are down-sampled by five times to generate medium resolution images. Then LabelImage software is used for pixel labeling, which contains runway area and background. We cut the down-sampled medium resolution SAR images into 480 × 480 images to make a small dataset, including a total of 466 images. The ratio of training set to validation set is 4:1. Fig. 6 (a) -(b) show samples of the dataset, SAR image, and the ground truth. In the ground truth, the black areas denote the background and the red stands for the runway area. Fig. 6 (c) also gives the corresponding optical remote sensing images. In addition, the runway area contains the runways, the taxiways, the parking aprons and the aircrafts, Fig. 6 (d) shows the specific target example of a local runway area. The red area is the parking aprons, the green area is the runways. The yellow area is the taxiways. The light green box shows the aircraft, which are involved in the parking aprons. Fig. 6 (e) indicates the details of (d). However, in the experiments, we classify these types of targets into one category, namely runway areas.

B. PARAMETERS SETTING
Heuristically, the learning rate is set to 0.00001, and the weight attenuation coefficient is 0.995. The batch size of the input picture is 1 while the network is training. The iteration number of the network training is 100 times, and an epoch is saved every 5 times. During the training process, the input images are randomly cropped, and the window size of the random crop is 480 × 480.

C. EVALUATION MEASUREMENTS
In the paper, Pixel Accuracy (PA) and Intersection Over Union (IOU) are utilized to evaluate the accuracy of runway extraction [1]. PA denotes the ratio of correctly segmented pixels to the total pixels of the same type in the ground truth, and IOU is the ratio of the intersection and union of extracted results to the ground truth. Mean Pixel Accuracy (MPA) represents the ratio of pixels in each category that are correctly classified. Mean Intersection Over Union (MIOU) indicates the average of IOUs in each category. The specific calculation formulas are as follows:  (10) where k + 1 is the total number of categories (because the background is also one category). P ij denotes the number of pixels that belong to category i but are predicted to be category j, which are false positive samples. P ji indicates the number of pixels that belong to category j but are predicted to be category i, which are false negative samples. P ii means the number of pixels correctly classified in category i.

D. EXPERIMENT ANALYSIS AND EVALUATION
To test the proposed framework of GCAM, five large-scale Gaofen-3 SAR images covering airport unused in the training dataset are empoyed. The sizes of the three large scene highresolution SAR images are 12000 × 15000, 22850 × 20985, and 15000 × 17500. The dataset used in the experiment is 466 small samples which are labelled manually and confirmed by SAR experts. Furthermore, two popular deep neural networks for semantic segmentation (RefineNet [14] and DeepLabV3+ [27]) and one neural network we designed before (MDDA) [citation] are also used as reference studies.
We analyze the training time of the network, the test time of the images, and the extraction accuracy of the runway areas before and after sampling.

1) THE EXTRACTION RESULT OF AIRPORT I
The size of the Airport I image is 12000 × 15000. Fig. 7 shows the detection results of runway area for Airport I. Fig. 7 (a) is the SAR image from the Gaofen-3 system with 1-m resolution and Fig. 7 (b) is the corresponding down-sampled SAR image. So the middle resolution image of Airport I is importing into GCAM, which size is 2400×3000. The texture of the targets in (a) is much clearer than that in (b). Fig. 7(c) is the ground truth of the airport corresponding to (b). Fig. 7(d)-(g) are the extraction results of the airport generated by RefineNet, DeepLabV3+, MDDA and the proposed framework GCAM, respectively. Fig. 7 (h)-(k) denote the visualization results of fusing (d), (e), (f), and (g) with (b), respectively. Fig. 7 (l)-(o) represent the fusion maps after coordinate mapping of Fig. 7 (h)-(k). And the yellow boxes represent the missed detection areas and the green boxes represent the false detection areas.
As shown in Fig. 7 (a), Airport I is composed of a largescale long runway area and an aircraft landing area. There are a large number of aircrafts in the airport, and there are obvious highlighted targets. In the non-airport area, there are clustered housing areas and intricate transportation lines. We test the medium resolution SAR image (Fig. 7 (b)), the size of which is 2400 × 3000. The segmented results are shown in Fig. 7 (d)-(g), from which we can see that GCAM is the closest to the ground truth of Fig. 7 (c). RefineNet has the worst extraction effect, which has the most false detections and considerable number of missed edge detection;    DeepLabV3+ has a small number of missed detections in the runway area; in MDDA, the results of partial edge extraction in runway areas are incomplete. However, there is no obvious missed detection and false detection in GCAM. Comparing our network result (g) with DeepLabV3+ extract result (e), we can see EDM enhances the learning of edge features.

2) THE EXTRACTION RESULT OF AIRPORT II
The runway area of Airport II is relatively regular, which is mainly composed of long straight runways. There are no large residential areas except for small building groups near the airport edge area, but there are many water areas around it. The water area in SAR images shows similar dark values as the runway areas, which somehow undermines the runway area detection. The size of Airport II is 22850×20985, we test the medium-resolution SAR image of Airport II with a size of 4570 × 4197. Fig. 8 shows the detection results of runway areas for Airport II . Fig 8 (a)-(o) are the same type of images as Fig. 7. When large scene SAR images with very small airport targets are employed for testing, all the four networks can detect the runway area, but their performance in false alarms are heterogeneous. As shown in Fig. 8 (g), the GCAM result is the closest to the ground truth (Fig. 8 (c)). As Fig. 8 (h)-(k) are concerned, compare missed detection, GCAM, RefineNet, and MDDA are almost no green boxes, DeepLabV3+ generated one; Compare false detection, RefineNet almost dectected all water areas to runway areas, DeepLabV3+ and MDDA also have many false detection, which shows that the three networks lacking the ability of removing false detection. GCAM has almost no false detection. MSP can fully extract the geographic information features of the runway area and effectively distinguish the features of the non-runway area, this is what the other networks lack, which also proves the advantages of GCAM.

3) THE EXTRACTION RESULT OF AIRPORT III
The structure of the runway area and the surrounding features in Airport III are the most complicated of the three airports. There are more airstrips, taxiways, rest stops and parking aprons. Airport III is a civil airport, whose runways are mostly short. The edge features of Airport III are more complicated, which requires the network to have better learning ability of global semantic information and effectively decode the edge information. The size of Airport III for the high-resolution SAR image is 15000 × 17500, and the medium-resolution SAR image is 3000 × 3500. Fig. 9 (d)-(l) delineate the segmentation results of runway areas by four different networks. According to the results, RefineNet has a large number of false detections and the extraction effect is the worst. DeepLabV3+ has a great many missed detections, indicating that its ability to learn edge information is not strong. MDDA detection has two obvious boxes of false detection. While the extraction effect of our network is also the best, only few small areas are missed. This also brings out the effectiveness of edge decoding.

4) ANALYSIS AND EVALUATION
For better evaluating the performance of different networks for airport runway areas detection, Fig. 10 shows the detection details of four networks in Airport III. Tab. 1 shows the accuracy of these networks in three airports of medium resolution SAR images. Tab. 2 shows the training time of these networks for small datasets used in the paper and the test time of four networks using these four medium-resolution SAR images.
Comparing the detection result of Airport I and Airport II by four networks (Fig. 7 and Fig. 9), for big target airport large scene images, the extraction of the edge details of the runway area can better highlight the performance of the network. RefineNet, DeepLabV3+, and MDDA all cannot extract all the edge details of the runway area completely, GCAM can almost extract all runway areas edge details effectively. Which fully shows that EDM strengthens the network's ability to learn edge information. As shown in the detection result of Airport II (Fig. 8). Geographical information of the nonrunway larger area is likely to interfere with the detection result. As shown in the Fig. 8, for small target airport large scene image, four network all almost can extract the runway areas completely, however, it also increase a lot of false detections. The IOU values of runway areas by RefineNet,  DeepLabV3+, and MDDA are all lower than 0.6, and GCAM is higher than 0.85, which shows the outstanding performance of GCAM. GCAM has almost no false detections, which fully shows that MSP is a very powerful module, which can effectively extract target geographic contextual information.
As show in Fig. 10, the details of the detection results generated by the four networks in Airport III denote the advantages of GCAM clearly. Comparing Fig. 10 (c) -(f), GCAM is the closest to the label for runway areas extraction, which indicates it can better extract the details and edge information well. RefineNet is a typical semantic segmentation network, but the decoding network only uses simply transferring features one by one, so the extraction effect is poor, because we can see many missed detections. DeepLabV3+ adds atrous convolution to expand the receptive field, but the lack of attention mechanism makes the feature learning redundant. Therefore, it cannot extract the detailed information well. MDDA fits large dataset training and unable to effectively learning features from small sample dataset, so it is easy to produce missed detection. The proposed GCAM improves these problems, and we can see from the details that GCAM can extract the runway areas much better. In addition, we can see more detailed edge information in Fig. 10 (f) than Fig. 10 (c) -(e) due to their high resolutions.
As shown in Tab. 1, the MPA and MIOU of GCAM for four airports are 0.9850 and 0.956, which are higher than MDDA, DeepLabV3+, and RefineNet. According to the calculation formula (9), the more false detections, the greater the difference between the values of PA and IOU. The difference between PA and IOU of GCAM in the same airport runway area is very small, which indicates that GCAM can extract runway areas with limited false detections. Among the four networks, RefineNet has the lowest values in PA and IOU, showing that its feature extraction ability of the runway areas is relatively weak. Its difference between PA and IOU is the largest, which also means it is easy to produce false detections. DeepLabV3+ can achieve better MPA and MIOU than RefineNet, so it can acquire better extraction results of runway areas than RefineNet, and it has less false detections. Although the overall extraction effect of MDDA is not bad, it is still not as good as GCAM, and it is obviously insufficient to learn the details of small sample datasets. The effect of MDDA training on large dataset is given in [1].
The EDM brings a small number of parameters to the network, that's why the training time and test time of GCAM are slightly longer than DeepLabV3+. Therefore, for processing the small sample datasets of SAR images, GCAM can achieve fast extraction of runway area with high precision, which is very useful in practice.
Based on the above analysis, GCAM can achieve fast and automatic extraction of the airport runway areas from high-resolution SAR images. The GCAM is designed as a lightweight network, which can greatly reduce the iteration time of the network layer, the training time and the image testing time. MSP enables the network to learn global features at multiple scales and encode effective features. EDM can completely decode and extract the edge information, and the design of parallel working mode of EDM and MQF strengthens the learning ability of context semantic information. At the same time, GCAM is more suitable for training small sample datasets. At present, there is no public large data set for airport segmentation in SAR images, so it can only be manually annotated. Therefore, the application of small datasets is more plausible to save time and costs. Based on the detection accuracy, the training time of the datasets and the testing time of images, GCAM is superior to RefineNet and DeepLabV3+. In overall performance, GCAM also outperforms the MDDA network we proposed before.

V. CONCLUSION
To achieve automatic and fast detection of airports with high accuracy from high-resolution SAR imagery, a new framework based on GCAM is proposed. It includes three parts: down-sampling, GCAM, and coordinate mapping. The down-sampling introduces multiscale training samples of airports, which is favorable for limited data availability. GCAM proposed in this article could learn features at multiple scales, to encode more contextual information and edge semantics. The coordinate mapping can generate the detection results of runway areas in the original high-resolution SAR images. According to the experimental results, GCAM outperforms several existing popular networks (DeepLabV3+, RefineNet and MDDA) with MPA 0.9850 and MIOU 0.9536. In GCAM, there is nearly no false alarm and very few missed detections, which proves the effectiveness of the proposed network in airport detection. In addition, GCAM takes only 2.25h in training, and the average testing time of the three airports is only 18.15s, indicating its high efficiency. The main contributions of this article and the following research are summarized as follows: (a) Multiscale analysis using down-sampling and coordinate mapping can largely improve the detection efficiency of the runway areas from the high-resolution SAR images; (b) The proposed mechanism GCAM can effectively handle similar image features within airports, which are major problems in DeepLabV3+. It verifies the importance of effectively integrating geospatial knowledge, attention mechanism and contextual information.
(c) The proposed GCAM is only tested in Gaofen-3 SAR images with 1m resolution. Next, we will study the generalization of the network in SAR images with different bands and different resolutions, in which the transfer learning [37] technology will be incurred. In conclusion, GCAM is proposed in this article to implement runway areas detection of the airport, which can fully extract geospatial features and edge features to achieve fast and automatic detection of airport with high precision. It can greatly improve the detection efficiency in practical projects. Furthermore, once runway areas are extracted, they can be utilized to enhance the detection results of aircrafts by eliminating false detections. In addition, this mechanism may also be utilized to perform the detection of other types of targets.