Edge Enhanced Channel Attention-Based Graph Convolution Network for Scene Classification of Complex Landscapes

Monitoring the land covers in complex landscapes is of great significance for the sustainable development of mine geo-environments. As most existing remote sensing scene datasets are composed of RGB images, there is a lack of multimodal datasets for complex landscapes with mining land covers (MLCs) at a fine-scale. In this study, a new dataset was created by the China University of Geosciences (CUG), Wuhan (named CUG-MLCs) using ZiYuan-3 imagery-based multispectral and topographic data. Moreover, the characteristics of multisize objects, irregular or blurred edges, and spectral-spatial-topographic heterogeneity and variability limited the classification accuracy. Therefore, an edge enhanced channel attention-based graph convolution network (ECA-GCN) was proposed and tested. The proposed ECA-GCN includes three key modules. 1) Multiscale and shallow feature fusion, used to fuse the multiscale convolutional features and shallow features, which helps present the MLC features with various scales; 2) edge enhanced channel attention, used to further select effective channels after a spatial edge feature enhancement, which helps identify irregular or blurred MLCs; and 3) edge detection-based GCN, used for edge feature-based adjacency matrix and feature maps from (2) to construct GCN, which can obtain edge node relation and global contextual information. This framework improved the representation of complex landscape characteristics. The proposed ECA-GCN achieved an overall accuracy of 66.60% ± 1.39%, averaged accuracy of 36.25% ± 1.50%, and Kappa of 55.91% ± 2.05%, thus, outperforming other models. In general, the proposed dataset and model were positive for the fine classification of complex landscapes.


I. INTRODUCTION
S INCE the Industrial Revolution, greenhouse gas emissions were associated with numerous issues [1], [2], [3]. Achieving carbon peaks [4] and carbon neutralization [5], [6] can effectively prevent the increase in greenhouse effects and mitigate the risk of climate change [7]. Technologies related to carbon neutralization have become a research hotspot in China and other major economic countries.
The mining areas are special, artificial, or semi-artificial terrestrial ecosystems, whose core is the mining operation area [8].
Mining areas and surrounding farmland, woodland, and other mining land cover constitute a complex geological environment. In these regions, plenty of geological and environmental issues may occur due to mining activities [9], [10], [11], [12], [13]. The consequences of the inflicted soil and vegetation damage are usually irreversible; therefore, it is important to restore and protect the environment in mining areas.
Land cover classification is a hot topic in the sphere of remote sensing due to various issues and challenges [14], [15], [16], [17], [18], [19]. Furthermore, monitoring land cover types in mining areas is important for ecological, environmental, and social development [20].
Pixel-oriented [21] and object-oriented [22], [23] methods have achieved impressive performances in MLC classification tasks; however, they have little semantic meaning. To obtain a semantic-level understanding of the meaning and the content of remote sensing images [24], we performed scene classification in complex landscapes. Although there are multiple available datasets such as the UC Merced Land-Use dataset [25], WHU-RS19 dataset [26], and SIRI-WHU dataset [27], the existing remote sensing scene datasets are basically composed of red, green, and blue bands (i.e., RGB images). Only a few datasets, such as SAT-4 airborne [31] and SAT-6 airborne datasets [31] are composed of RGB and near infrared (NIR) images. Table I shows a comparison of remote sensing scene classification datasets.
With respect to the classification methods of MLCs at the fine-scale, there are some advances. Complex landscape features in mining areas, especially the remarkable stereo topographic features and spectral-spatial variability, severely restrict the improvement of MLC classification [33], [34], [35], [36].
Previous  complex landscape areas. For MLC tasks, if we can ensure high spatial resolution as well as a rich number of images while introducing NIR and digital elevation model (DEM) data bands, the classification precision for complex landscapes will be further improved. Therefore, a multimodal scene dataset for the classification of MLCs at the fine-scale is necessary.
Nevertheless, the combination of artificial feature calculation methods and machine learning algorithms (MLAs) [37], [38], [39] is insufficient to achieve satisfactory classification accuracy. Compared with traditional MLAs, deep learning methods are much more accurate and efficient, and have been extensively used in the remote sensing domain.
For example, Li et al. [33] developed deep belief networkbased models (DBN) for fine classification of MLCs using ZiYuan-3 (ZY-3) imagery based multispectral and topographic data. They further developed a multilevel output-based DBN model [39]. Compared with DBN, a convolution neural network (CNN) can extract spatial features more effectively, and it performs well in remote sensing classification tasks. For instance, Zhao et al. [40] proposed a dense connection and dilated convolution-based model to capture more comprehensive spatial information.
Alternatively, some studies have investigated multiscale feature fusion strategies. Liu et al. [41] proposed a context-aware spectral-spatial feature extraction module to capture the multiscale features of scale invariance. Xia et al. [42] proposed a multiscale feature fusion network with a series of redesigned skip pathways. Zhang et al. [43] constructed a three-branch feature fusion network that uses Dual-Anchor triplet loss and nonlocal operation. Mei et al. [44] proposed a multilevel features fusion framework based on sparse representation. Gu et al. [45] proposed a generative adversarial networks (GANs) structure with a pyramidal multiscale structure. They achieve good classification results by multiscale feature fusion, multibranch fusion, or loss function design. There is a consensus that with the increase of network depth, features become more and more abstract. However, shallow features are also important for complex landscapes. Therefore, the integration of multiscale features and shallow features is worth exploring.
Some researchers attach importance to the attention mechanism. Tong et al. [46] proposed a densely connected convolutional network (DenseNet) with channel attention and labelsmoothing-based cross-entropy loss function. Ouyang et al. [47] propose multichannel-feature-fusion landform recognition networks based on channel attention. Chen et al. [48] used global context spatial attention and DenseNet to obtain multiscale global scene features. Liu et al. [49] proposed a multidimensional CNN model with improved channel-spatial attention. Guo et al. [50] proposed a self-attention GAN with similarity loss.
MLCs are characterized by large edge differences. For example, a mining catchment has highly irregular edges and a concentrator has blurred edges. Therefore, the effective extraction of edge information will effectively improve the performance of MLC classification. Ma et al. [51] proposed a foreground activation framework with a dual-branch decoder and collaborative probability loss. It distinguishes foreground objects from the background, achieving similar effects to using edge information. Yang et al. [52] proposed an architecture with a block shuffle structure, super-pixel branch and self-boosting method to obtain precise edge contour. Liang et al. [53] proposed a dual-stream system structure that combines global visual features with positional functions to improve feature representation by using edge node relationships. Zhang et al. [54] proposed an architecture that uses an edge guidance module to learn edge attention representation and aggregate it with feature information. Wang et al. [55] proposed an architecture to extract multigranularity edge features, and jointly learn the segmentation object mask and edge detection. Zhang et al. [56] proposed an architecture that uses soft boundary detection to transform raw data features and obtain global context information. As mentioned above, combining edge information to improve attention is a direction that can be explored.
In recent years, increasingly more attention has been focused on graph convolution networks (GCN) which utilize the correlation between land cover categories by encoding remote sensing images to form maps. Compared with CNN, GCN improves the inapplicability of translation invariance on non-matrix structured data, and its essence is to extract the spatial features of topological graphs. Liu et al. [57] proposed a heterogeneous deep network combining CNN with GCN for pixel and super-pixel feature fusion. Zhou et al. [58] proposed a depth-wise separable GCN model in which the feature graph adjacency matrix was constructed using a Sobel operator.
In addition to the above models, a deep learning model called transformer has gradually been applied in the field of remote sensing. The model divides the image into blocks. The context is captured using the relationship between image blocks. Bazi et al. [59] applied an attention mechanism to focus on different areas of the image and integrate global information. Tang et al. [60] proposed a transformer that used multilevel features to mine the potential context information of remote sensing scenes. However, we believe that in complex landscapes, the transformer model has limitations for feature modeling and high computational complexity.
In this study, a new multimodal dataset was constructed by the China University of Geosciences (CUG), Wuhan (named CUG-MLCs), using ZY-3 imagery. Moreover, in order to extract complex landscape features of multisize objects, irregular or blurred edges, and spectral-spatial-topographic heterogeneity and variability, we proposed and tested an edge enhanced channel attention-based graph convolution network (ECA-GCN). ECA-GCN has the following three key structures: 1) multiscale and shallow feature fusion, which fuses the multiscale convolutional features and the shallow features, thus, aiding the presentation of MLC features with various scales; 2) an edge enhanced channel attention that further selects effective channels after a spatial edge feature enhancement, which helps identify irregular or blurred MLCs; 3) an edge detection-based GCN that uses the edge featurebased adjacency matrix and feature maps from edge enhanced channel attention to constructing GCN, which can obtain edge node relation and global contextual information. This framework allowed the improvement of the representation of complex landscape characteristics. Other parts of the study are as follows: Section II introduces the CUG-MLCs dataset; Section III introduces the details of ECA-GCN; Section IV illustrates the experimental settings and classification results; Section V includes the discussion regarding the algorithm and dataset; and Section VI includes the conclusion of the article and the outlook for future work.

II. CUG-MLCS DATASET
This section introduces the content, production process, and classification system of the CUG-MLCs dataset. Furthermore, the characteristics of the CUG-MLCs dataset are described.

A. Description of the Proposed CUG-MLCs Dataset
The CUG-MLCs dataset developed for this study contains 6125 images. It is a multimodal dataset meant for MLCs at finescale. It contains five channels, namely DEM, RGB, and NIR bands. All images have a size of 64 pixels × 64 pixels. The dataset consists of 20 land covers. The number of samples in each class of the CUG-MLCs dataset varies from 19 to 2333. Table II shows the basic information on the CUG-MLCs dataset. Fig. 1 illustrates example images from the dataset.

B. Remote Sensing Data Acquisition and Preprocessing
The study area is located in Wuhan City, China. The range of longitude and latitude is: 114°16' E-114°20' E and 30°16' N-30°18' N. The study area is 109.4 km 2 and presents typical surface mining and agricultural landscape characteristics [61]. Fig. 2 shows the RGB imagery of the study area and division of scene data.
ZY-3 satellite imagery was selected as the data source and the images used are from scenes obtained with different cameras on June 20, 2012. Multispectral images were obtained using some preprocessing methods.
A DEM with a resolution of 10 m was generated using the stereo image pair data. Subsequently, the DEM was resampled to 2.1 m to match the following multispectral image.
Combining the generated DEM data and the rational polynomial coefficients, orthorectification was conducted on the panchromatic and multispectral data. Next, based on corrected panchromatic images, geometric registration of multispectral data was performed using a quadratic polynomial function. The resampling error was controlled within 0.5 pixels using the cubic convolution method. Finally, the Gram-Schmidt method was used to fuse the registered panchromatic and multispectral data. A fused multispectral image with a resolution of 2.1 m was obtained.

C. Visual Interpretation, Cropping, and Classification System
Manual visual interpretation and image clipping were performed on the preprocessed data to form the final CUG-MLCs dataset. In manual visual interpretation, the representative and significant land cover types investigated in this study were selected. In the clipping process, the image was sequentially clipped from the upper left corner of the image. The image size was 64 pixels × 64 pixels. Finally, a total of 6125 images were divided. Table III shows the description of the MLC classification system and the number of images.

D. CUG-MLCs Dataset Size
Take stope and asphaltroad as examples (as shown in Fig. 3, the pictures on the right are the ground truth of the corresponding area after manual visual interpretation). In a complex landscape, images with different sizes in the same area contain multiple categories. For example, pictures labeled as a stope. Although images with the size of 48 × 48, the number of categories is small, but the proportion of stope is small. Larger size images such as 128 × 128, 224 × 224, 256 × 256, although stope account for a large proportion, there are too many categories. This may make feature extraction difficult. While pictures are labeled as asphaltroad, the category and proportion of 48 × 48 and 64 × 64 size pictures are similar. Larger size images face the same problems as stope pictures. Therefore, based on qualitative and quantitative considerations, we choose 64 × 64 as the image size. The effective use of edge node information will help to improve the classification accuracy at the fine scale.

III. METHODS
The network structure of the ECA-GCN is shown in Fig. 4. It contains the following four parts.
1) Multiscale and shallow feature fusion: The size of land cover types in different mining areas varies greatly, and there is much background information; therefore, we extracted multiscale information using a multiscale convolution kernel to focus on multiscale feature representation. With the increase in network layers, the deep features will inevitably lead to the loss of some features. This may affect the accuracy of small-sized objects. Therefore, we added the branch of shallow features to fuse the deep features with the shallow features. 2) Edge enhanced channel attention: The importance of multiscale features differs from that of shallow features; thus, after obtaining the characteristics of (1), we added the channel attention module. In addition, to further highlight the importance of local edge information, we enhanced edge information in the attention module. Edge-dependent design highlighted the importance of local image details.

3) Edge detection-based graph convolution network:
We used the Canny operator to detect edges. After processing, it was used as the adjacency matrix of the GCN. After propagating through each layer of the GCN, all features were fused. GCN received the features from the above modules, and used the edge node relationship to further capture the global context information. 4) Classification: After GCN, we stacked three convolutional pooling structures to reduce channel dimensions and feature map size. Layer-by-layer dimension reduction ensures the preservation of representative features. Ultimately, the classification result is output through 1 × 1 convolution.

A. Multiscale and Shallow Feature Fusion
The deep learning method exhibits improved feature extraction as well as classification performance when compared to traditional machine learning methods, and is characterized by increased robustness and easy migration. CNN is an important part of a deep learning method, and its strong representational learning ability has attracted wide attention.
The land cover classification network of the mining area accepts RGB + NIR + DEM images as input. With increasing convolution layers, convolution layers can retrieve features from fine to rough. Nevertheless, classical CNN usually extracts features by stacking the same convolution layers. Furthermore, convolution kernels of different sizes acquire multiscale features from different scales, effectively expanding the information flow, which is helpful for the recognition of small-sized objects. To avoid losing significant image detail information with the increase in depth, we added a shallow feature extraction branch to obtain multiscale information on deep and shallow feature fusion.
The multiscale feature extraction part of the multiscale and shallow feature fusion module comprised four branches. The first branch used a 1 × 1 convolution kernel. The second branch used a 3 × 3 convolution kernel, padding was set to 1 to ensure that the size of the feature map was unchanged. The third branch used a 5 × 5 convolution kernel, with padding set to 2. The fourth branch used two 3 × 3 convolution kernels. After the convolution layer, a BN layer and the ReLU activation function were added. Finally, a 2 × 2 MaxPool is used to reduce dimension. The shallow feature extraction part of the multiscale and shallow feature fusion module consists of a 1 × 1 convolution and MaxPool. The 1 × 1 convolution is used to normalize the number of feature channels, and the MaxPool was used for feature dimensionality reduction. Ultimately, we fuse the deep and shallow features through the channel-stacking operation to obtain the module output. The structure of the multiscale and shallow feature fusion module is shown in Fig. 5.
Here, we stack three multiscale and shallow feature fusion modules as our feature extraction structure. This is mainly because of the size of the input image of 64 pixels × 64 pixels. After pooling-induced dimension reduction, the size of the feature map decreases. To maintain the size of the feature map moderate and consider the network depth, we choose to stack three modules.

B. Edge Enhanced Channel Attention
SENet [62] is a brand-new image recognition structure published in 2017. Through learning the correlation between characteristic channels, SENet strengthens the channels with strong presentation ability and weakens the secondary channels. It alleviates the loss caused by different channel representation capabilities.  A convolutional block attention module (CBAM) [63] is proposed based on SE attention. The spatial attention module is added after the channel attention module. The attention weight is derived from the channel and spatial dimensions. In this way, it can learn the importance of features and spatial positions, respectively.
The fusion module combines multiscale and shallow features. We screen channels with strong presentation ability by adjusting the weights of channels. To adjust in time, we add an edge enhancement channel attention mechanism after each multiscale and shallow feature fusion module. The improved edge enhanced channel attention is shown in Fig. 6. SE attention includes two stages which are squeezing and excitation. In the squeeze phase, the input feature map size is C × H × W, where C represents the number of channels, and H and W represent the height and width of the feature map, respectively. We used the AdaptiveAvgPool to compress each feature map into a single value; thus, the feature map becomes a C × 1 × 1 vector, where the output of the squeeze phase can be calculated as follows: In the excitation stage, to effectively use the local descriptors obtained following the squeeze, we ensure that the channel-wise dependencies are fully captured. Two nonlinear fully connected layers are used. The ReLU activation function is added after the first fully connected layer. After the second fully connected layer, Sigmoid is added. The output of the exception stage can be calculated as follows: Next, we assigned corresponding weights to each channel and obtained the final output through channel wise multiplexing The edge enhancement mechanism used the Canny operator to extract the edge information from each channel feature map. The edge enhancement matrix is formed through superposition and normalization. The edge information of the original feature map is enhanced through matrix multiplication. Then, the squeeze and excitation phases commenced. Refer to Section III-C for the calculation process of the Canny operator.

C. Edge Detection-Based GCN
GCN utilizes the graph structure and aggregates node information from the neighborhoods [64]. Therefore, GCN performs well in modeling long-range spatial relations [65].
The formula of graph convolution used in this article is defined as follows: where σ(·) denotes an activation function andÃ represents the adjacency matrix,D ii = jÃ ij . W is a layer-specific trainable weight matrix and H (l) ∈ R N ×D is the matrix of activations in the lth layer. The edge detection-based GCN uses the Canny operator to extract the edge information and construct the graphic structure, allowing it to achieve a better feature representation.
The key of GCNs is to generate the adjacency matrix to input tensor H ∈ R N ×D . In the complex landscape, the road presents a linear structure in the image; although the proportion is small, the edge is relatively regular. The miningcatchment occupies a small area in the image and has no stable shape. We believe that for the type of multi size MLCs with irregular edges, edge nodes contain more efficient information. The edge nodes can highlight the characteristics of multiscale feature types from the background information, thus, indicating the efficiency of building edges between these nodes. The edge detection-based GCN is shown in Fig. 7.
1) Graph Construction: The Canny operator is a commonly used edge detection filter. The calculation process can be divided into four stages: a) Image filtering: First, the Canny operator uses a blur filter to eliminate noise from the input image. Here we use a Gaussian filter, and the Gaussian kernel size is set to 3 × 3 G (x, y) = 1 2πσ 2 e − x 2 +y 2 2σ 2 (5) where σ represents the parameter of the Gaussian filter, which controls the blurring degree of the image (set to 1). b) Image gradient calculation: The amplitude and direction of the image gradient should be calculated. We use the Sobel filter with a kernel size of 3 × 3 decomposed into two filters. The first kernel is used to extract the horizontal gradient and the second kernel is used to extract the vertical gradient. The gradient size G can be calculated as follows: The template of the Sobel operator gradient calculation operator is as follows: The gradient direction calculation formula is as follows: c) Nonmaximum suppression (NMS): Edge refinement is performed using the NMS method, ensuring that each edge is a single pixel in width. This step needs to lead to the detection of 8 neighborhoods. If a pixel has the maximum intensity compared to its neighbor, it is the local maximum, and the pixel is retained. First, we create the kernel NMS 0 • in 0°direction with a size of 3 × 3 and constructed the direction matrix R(θ) d) Checking and connecting edges: Using the double threshold method to select the edge points after NMS. Pixel whose gradient amplitude is lower than the low threshold is selected as nonedge points. Pixels whose gradient amplitude is higher than the high threshold are selected as edge points. Pixel whose gradient amplitude is lower than the high threshold and higher than the low threshold are selected as candidate edge points. When the candidate edge point is directly connected with the edge points, it is considered a part of the edge; otherwise this point will be discarded. Therefore, the adjacency matrix of graph convolution can be calculated by the following: where F represents the Canny operator, and i represents the original image.
2) Graph Convolution: After using the Canny operator to generate the adjacency matrix, the fully connected graph was generated and then the extracted multiscale features after edge enhanced channel attention were inputted into the GCN. Finally, we inputted the designed classifier and obtained the classification results. The GCN should not be too deep; generally, two to three layers were used. The GCN proposed in this article uses a three-layer graph convolution. Compared with the standard convolution, GCN can expand the receptive field, obtain long-distance dependence relationships, and efficiently exchange information in a larger range. Therefore, GCN has better feature expression ability.

D. Classification
To effectively utilize the representative features, we carefully designed the classification structure. The classification module accepts feature maps from GCN. We use a 3 × 3 convolution to reduce the number of channels while maintaining the feature map constant. Further, we reduce the size of the feature map by using maximum pooling, with each convolution layer being followed by the BN layer and ReLU activation function. To reduce the size of the signature map to 1, we stacked three convolution pooling structures. Ultimately, a 1 × 1 convolution layer was used to classify of land cover in mining areas.

A. Experimental Settings 1) Machine Configuration:
The experiment was under the Centos7 system. The scene classification algorithms were implemented by using the PyTorch framework. The hardware configuration had 128 GB, the GPU was RTX2080ti with 11 GB of memory, and the CPU was Inter Xeon(R) E5-2620 v4.
2) Parameter Optimization: At the beginning of the experiment, the images were normalized. Using the accuracy of the verification set to adjust parameters. We set the learning rate to 0.0001. The number of training samples in each batch was 64 or 32 for different algorithms. Each experimental training iteration was 200 times. We used Cross Entropy as the loss function. Each experiment was performed 5 times to obtain the average value.
In addition, the model with the best validation accuracy was the optimal model and was saved. Fig. A1 shows the optimization of the parameters of the selected algorithm.
3) Model Comparison: We conducted a comparative experiment between the proposed ECA-GCN and other scene classification models. Four classical CNN models VGGNet-16, ResNet-18, ResNet-101 and DenseNet-121 were used to conduct experiments on the CUG-MLCs dataset to provide benchmark results for subsequent studies. Based on classical networks, two attention mechanisms SE and CBAM, were used to add the above four convolutional neural network models to form an attention network for experiments. For the VGGNet-16 network, SE and CBAM modules were respectively added before the first full connection layer. For ResNet-18 and ResNet-101 networks, the CBAM module was added to the last residual block, and the SE module was added to the residual block located after the second BN layer. For the DenseNet-121 network, SE and CBAM modules will be added before the last full connection layer.
Data fusion is the current research hotspot in the field of remote sensing. Therefore, based on the classical CNN, dualstream and three-stream fusion network experiments were carried out. For the dual-stream network, the first branch inputted a multispectral image, while the second branch inputted DEM. For the three-stream network, the first branch inputted RGB, the second branch inputted NIRRG, and the third branch inputted DEM. Among them, VGGNet-16, ResNet-18, ResNet-101, and DenseNet-121 were respectively used as backbone networks for each branch, and the fusion mode of the feature cascade was used for model building. Among them, the VGGNet-16 network performs data fusion before the last full connection layer, ResNet-18 and ResNet-101 networks perform data fusion after the first convolution layer, and the DenseNet-121 network performs data fusion before the last full connection layer. Furthermore, we selected some recent models such as ShuffleNet V2 [66], EfficientNet [67], CAD [46], MF 2 CNet [68], PDCNet [40], GCSANet [48], and EMTCAL [57] to compare the effectiveness of our proposed model.
We calculated the overall accuracy (OA), average accuracy (AA), and Kappa values of the test set. Table IV shows the evaluation indicators of the CUG-MLCs dataset. Among all models, the proposed model achieved an OA of 66.60%, AA of 36.25%, and Kappa of 55.91%, the best performance among the analyzed datasets on all evaluation indicators. The results show the effectiveness of our proposed multiscale and shallow feature fusion module, edge enhancement channel attention module, and GCN. 1) RGB: Among the four classical convolutional neural networks, VGGNet-16 showed the best classification effect. It achieves an OA of 57.65%, AA of 21.15%, and Kappa of 43.46%. VGGNet performed better on the CUG-MLCs dataset than ResNet and DenseNet. This might be due to the obvious imbalances between the categories of this dataset. In the mining area scenario, the shallow features have a greater impact on the classification of land cover categories in the mining area than deep features. Therefore, when designing the network, considering the deep and shallow features will effectively improve the classification accuracy. Fig. A2 shows the confusion matrix.

B. Results of Accuracy Assessment
2) NIRRG: Among the four classical CNNs, VGGNet-16 showed the best classification effect. It achieved an OA of 60.74%, AA of 26.52%, and Kappa of 48.53%. Compared with the RGB band, the OA and Kappa increased by 3.09% and 5.07%, respectively. We hypothesize this is because the NIR band has advantages in reflecting the difference of infrared characteristics reflected or radiated by vegetation and other ground objects.
3) Five Channels: The classical CNN with the same network structure only changes the input data from 3-band to 5-band. According to our observations, the OA of VGGNet-16, ResNet-18, ResNet-101, and DenseNet-121 increased by 4.00%, 10.59%, 6.58%, and 6.83%, respectively. It can be concluded that the network with the same structure can effectively improve the model accuracy by adding data sources, that is, adding NIR and DEM bands to their RGB basis.
The dynamic selection of feature information was achieved through the attention mechanism. The SE module focuses on the relationship between different channels and aids the model in automatically learning the importance of different channel characteristics. After the channel attention module, the CBAM module was connected to the spatial attention module to achieve the dual mechanism of channel attention and spatial attention. At this point, the CBAM module was no longer using a single maximum pooling or average pooling, but rather the addition or stacking of maximum pooling and average pooling. Among them, the channel attention modules were added and the spatial attention modules were stacked. Two attention mechanisms were added to the four networks and compared with the source network. Compared with VGGNet-16, VGGNet-16-SE, and VGGNet-16-CBAM led to no significant accuracy improvement. Alternatively, the OA of ResNet-18-SE and ResNet-18-CBAM increased by 1.69% and 2.61%, while the Kappa increased by 1.59% and 3.08%, respectively, when compared to ResNet-18. Compared with ResNet-101, the OA of ResNet-101-SE and ResNet-101-CBAM increased by 1.2% and 1.32%, while Kappa increased by 1.25% and 1.09%, respectively. Compared with DenseNet-121, the OA of DenseNet-121-SE and DenseNet-121-CBAM increased by 0.99% and 0.7%, respectively, while their Kappa increased by 1.18% and 0.43%, respectively. This indicates that except for the VGGNet-16, the classification accuracy of other networks was improved after adding the SE and CBAM modules, with the most significant improvements in accuracy being associated with the ResNet-18 network. Alternatively, the accuracy of VGGNet-16 has not been improved. We hypothesize that the model was already in the over fitting state before adding attention; therefore, adding these parameters exacerbated the over fitting problem, resulting in no improvement in performance.
The dual-stream network uses different branches to extract the characteristics of different bands and obtain difference information. Feature fusion was used to realize feature complementation, with each branch of the four dual-stream networks using a network with the same structure for feature extraction. Compared with the network in which only RGB band images were inputted, the accuracy of the dual-stream VGGNet-16 has not been significantly improved. Alternatively, the OA of dualstream ResNet-18 increased by 10.39% and Kappa increased by 14.51%. The OA of dual-stream ResNet-101 increased by  6.71%, and Kappa increased by 11.63%. The OA of dual-stream DenseNet-121 increased by 6.64% and Kappa increased by 9.63%. Compared with the single branch network of input five bands, the OA of VGGNet-16 was 4.96% higher than that of the dual-stream VGGNet-16. The OA of ResNet-18 was 0.2% higher than that of dual-stream ResNet-18. The OA of dualstream ResNet-101 was 0.13% higher than that of ResNet-101. The OA of DenseNet-121 was 0.19% higher than that of the dualstream DenseNet-121. The fusion strategy of the dual-stream network we used was simple channel superposition. It can be seen that the effect of directly inputting 5-band images is better than that of double branches. This may be because different data bands have different importance for land cover classification in the mining area. The fusion method of channel superposition is equivalent to giving the same weight to the features extracted by the two branches, which makes the classification effect not ideal.
Identical to the dual-stream network, the three-stream network uses different branches to extract differential information. Compared with the network in which only RGB band images were inputted, the accuracy of the three-stream VGGNet-16 was not significantly improved. The OA of three-stream ResNet-18 increased by 7.18% and Kappa increased by 7.77%. The OA of three-stream ResNet-101 increased by 4.89%, and Kappa increased by 8.35%. The OA of three-stream DenseNet-121 increased by 3.33% and Kappa increased by 4.64%. Compared with the dual-stream network, the accuracy of the three-stream network has not been improved. We hypothesize that this is because the three-stream network has a large number of lowefficiency redundancy characteristics. Thus, to extract more representative features, inefficient features can be removed, or efficient features can be enhanced.
Compared to the analyzed networks, the ECA-GCN proposed in this article performed best. Compared with ShuffleNet V2, the OA of ECA-GCN increased by 15.11% and its Kappa increased by 23.13%. Compared with EfficientNet, the OA of ECA-GCN increased by 14.53% and its Kappa increased by 22.48%. Compared with CAD, the OA of ECA-GCN increased by 7.55%, and its Kappa increased by 10.6%. Compared with MF 2 CNet, the OA of ECA-GCN increased by 9.31%, and its Kappa increased by 13.14%. Compared with PDCNet, the OA of ECA-GCN increased by 0.87% and its Kappa increased by 1.29%. Compared with GCSANet, the OA of ECA-GCN increased by 7.29% and its Kappa increased by 10.82%. Compared with EMTCAL, the OA of ECA-GCN increased by 13.74% and its Kappa increased by 19.07%. We verified the effectiveness of the GCN network through ablation experiments and concluded that: Compared with ECA-GCN(without GCN), the OA of ECA-GCN increased by 7.62%, and its Kappa increased by 10.21%. This indicates the effectiveness of the ECA-GCN proposed in this article. We used multiscale and shallow feature fusion to obtain multiscale information on deep as well as shallow features and filtered important features using the edge enhanced channel attention. The GCN was constructed on the relationship of image edge nodes. Finally, the designed classifier was inputted for classification. ECA-GCN can effectively improve the classification accuracy of difficult land cover samples of the mining area, such as concentrator, dump, paddy field, mining catchment, asphalt road, and dirt road. As shown in Fig. A3, the accuracy of ECA-GCN in dryland was 91%, the accuracy for the pond was 87%, and the accuracy of nursery and orchard was 63%. Competitive classification accuracy was maintained in the land cover categories of a large number of mining areas. The accuracy of ECA-GCN in concentrator was 54%, while the accuracies for dump, paddy field, and rural settlement were 10%, 38%, and 55%, respectively. Our structure effectively extracts the edge information and improves the classification effect of a few categories. However, there are some categories, for which the performance of ECA-GCN remains unsatisfactory. For example, the greenhouse was wrongly divided into dry land and rural settlement. Cement road was wrongly classified as dry land. Urban land was wrongly divided into dry land and rural settlement. This is due to the small number of samples and the extremely unbalanced categories. Conversely, these parts of the mining area represent small land cover areas with vast background information. Other construction land was wrongly classified as rural settlements. This is due to the high similarity of the characteristics of these two types. Although the color characteristics are relatively different, other characteristics are relatively consistent, thus, allowing for misclassifications.
In order to further analyze the classification of MLCs under complex landscapes, we select some test set samples (such as asphaltroad, miningcatchment, and dump) and output the classification probability of each category. Refer to Table V for details, where the red box in the sample image is the general range of MLCs. In addition to the asphaltroad, the first image also contains dryland, havewoodland, cementroad, and other categories; however, the area of the asphaltroad is relatively small. However, from the classification probability, we can see that the ECA-GCN can effectively enhance the features of the asphaltroad and improve classification accuracy. In addition to the miningcatchment, the second image also contains stope, dump and other categories. From the perspective of classification probability and classification results, it is not ideal. This may be because the miningcatchment is not only small in area, but also has similar characteristics with other categories (such as concentrator, fallowland, etc.). We believe that we can try to add geoscientific prior knowledge to the network in the future to improve the classification accuracy of difficult-to-distinguish samples. The third image includes stressvegetation, cementroad, and other categories in addition to dump. The dump is very similar to the background, and the edge is very irregular. It can also be proved from the classification probability that feature extraction is difficult. However, the classification of the ECA-GCN is correct, which proves the effectiveness and pertinence of the algorithm.

V. DISCUSSIONS
This section discusses the ablation experiments of different modules of the proposed ECA-GCN to verify the effectiveness of the proposed modules. The limitations of the CUG-MLCs dataset established in this article are also discussed.

A. Effectiveness of the Multiscale and Shallow Feature Fusion
The earliest multiscale module should be the inception module [69]. There are two main methods to improve network performance, increasing the width and/or depth of the network. However, the deeper and wider the network, the larger the number of parameters. When the dataset size is small, the network is easy to over fit. Alternatively, when the network is deep, it is easy to cause gradient disappearance and gradient explosion. This restricts the development of CNNs. The inception module solves these problems well as there are two main contributors to the inception: One is to use 1 × 1 convolution to increase and reduce dimensions; the other is to simultaneously perform convolution re-polymerization on multiple dimensions. Drawing on this idea, we refer to and design the effectiveness of the multiscale and shallow feature fusion module described in this article.
To verify the effectiveness of the multiscale and shallow feature fusion module, we used single branch, double branch, and four branch structures to carry out experiments. Wherein the single branch was composed of two convolution layers with a size of 3 × 3 and one maximum pooling layer, the first branch of the double branch consisted of two convolution layers with a size of 3 × 3 and one maximum pooling layer, while the second branch consisted of a convolution layer with a size of 1 × 1 convolution layer and one maximum pooling layer. The BN layer and ReLU activation function were added after each convolution layer.  Table VI the four-branch structure of the multiscale and shallow feature fusion proposed in this article exhibited the best classification effect, which was characterized by 0.74% higher OA than that of the single branch and 0.86% higher OA than that of the double branch. This strongly proves that convolution kernels of different sizes acquire multiscale features from different scales, which is helpful for multiscale object recognition.

B. Effectiveness of the Edge Enhanced Channel Attention
To verify the effectiveness of the edge enhanced channel attention, we compared the designed module with the original SE attention and channel shuffling methods.
Channel shuffle [70] is used to solve the problem of accuracy loss caused by full constraints between channels induced by point-by-point convolution in small networks. Channel shuffle effectively strengthens the information flow between channel groups and the information representation ability. This in turn enhances the channel representation capability from another perspective. Therefore, we used it as a comparison strategy for experiments.
As shown in Table VII the edge enhanced channel attention proposed in this article has the best classification effect, showing a 1.1%, 0.63%, and 1.1% increase in OA when compared to no enhancement strategy, the use of SE attention, and the use of channel shuffle. This shows that the feature map is enhanced by the edge enhancement strategy. Fig. 8 shows the attention maps of edge enhancement channel attention. We choose three scenarios: asphaltroad, ruralsettlement, and dump. It can be seen that the edge enhancement channel attention effectively captures MLCs while enhancing edge features. Thus, we showed that the method of edge enhanced channel attention is effective in strengthening the characteristics of land cover categories in mining areas with multi size and irregular edges.

C. Limitations of CUG-MLCs Dataset
Our CUG-MLCs dataset has some limitations. As exposed by the aforementioned experiments, although the proposed ECA-GCN performs best in OA, AA, and Kappa, the accuracy of all models is poor. From a dataset perspective, we identified three reasons that limit the accuracy of the model. 1) Visual Interpretation Deviation: Before the dataset was established in this article, labels should be given according to a representative and significant land cover types. Due to the use of large images, we have divided several areas, set up a team to interpret by block, and provided manual interpretations, which are associated with potential errors. Therefore, the accuracy of land cover types in some mining areas is affected.
2) Dataset Size: Because dataset production is laborious and time-consuming, we only used one remote sensing image and the corresponding DEM for dataset construction. This inevitably leads to a small dataset. However, due to the characteristics of the mining scene itself, the number of categories is extremely unbalanced, which also limits the capacity to improve the model accuracy.
3) Strong Homogeneity: In the mining area scenario, some categories are highly homogeneous, such as dump and fallowland. In the case of consistent visual features, targeted modules must be designed to effectively improve classification accuracy, which is part of our future work.
To address the above limitations, we will make the following improvements: a) Dataset improvement: We will use ZY-3 images in different time phases and GF-7 images in the study area to expand the dataset. In addition, there are polymetallic mining areas in the southeast of Hubei Province and phosphate mining areas in the west of Hubei Province, which will be integrated into the scope of our research. b) Model improvement: We will use a multibranch fusion strategy for experiments, thus, exploring the possibility of using multisource data to improve the accuracy of MLC classification. An important strategy to improve class imbalance is data augmentation, in addition to translation, rotation, and other strategies. As GAN is a hot topic at the moment, we will consider using GAN to expand the small number of categories and improve classification accuracy. In order to avoid excessive computation, the GCN is behind the feature extraction structure. We will study the influence of GCN location on model accuracy.

VI. CONCLUSION
Due to the urgent demand for semantic-level understanding of MLCs at the fine-scale, the study constructed a multimodal dataset named CUG-MLCs. In view of the multisize and irregular edge of the MLCs, we proposed an ECA-GCN with the following key points.
1) Multiscale and shallow feature fusion module, which was used to extract multiscale information and fuse the multiscale convolutional features with the shallow features. 2) Furthermore, the edge enhanced channel attention was used to select effective channels after a spatial edge feature enhancement. 3) Lastly, the edge detection-based GCN was used to construct an adjacency matrix that uses edge node relationships and learns the global contextual information. Our results indicate that the ECA-GCN constructed in this study achieved an OA of 66.60%, AA of 36.25%, and Kappa of 55.91% on the CUG-MLCs dataset, and outperformed the classical CNN and recent networks. Thus, the proposed model is adequate for the fine classification of complex landscapes. In the future, on one hand, we will aim to improve the dataset. On the other hand, we will focus on multimodal and muti-branch feature learning and fusion, class imbalance learning, and multilabel scene classification.