A Stepwise Framework for Fine-Scale Mining Area Types Recognition in Large-Scale Scenes by GF-5 and GF-2 Images

Quickly obtaining fine-scale mining area types information in large-scale scenes is significant for dynamically detecting mineral resources. Currently, mining area types recognition methods encounter challenges such as low recognition accuracy and difficulty detecting small mining areas. To address these issues, this article proposes a stepwise top-down mining area types recognition framework. The framework consists of two steps. First, a GF-5 spectral index named the Normalized Difference Mining Area Index (NDMAI) is constructed to obtain the rough position of the mining area quickly. Then, the identification network of Mine Types with Transformer (Mitformer) is proposed for accurate type recognition of the candidate mining area regions. Mitformer combines a multiscale feature enhancement module and a decoder based on multilevel skip connections, which achieves a sufficient fusion of features at each layer of deep feature maps and adds the skip connections between low-level and high-level feature maps, thus, can improve the accuracy of types identification and the detection rate of small-scale mining areas. Moreover, this framework can effectively avoid misclassification caused by different objects with similar spectra to the maximum extent possible. This article selects two independent study areas with a large spatial extent, respectively, in Hebei Province and Anhui Province. The imagery utilized for these regions is obtained from Chinese GF-2 and GF-5 satellites. Multiple experiments are conducted to verify the superiority of NDMAI and Mitformer and the effectiveness of this framework. The experimental results illustrate that this framework can provide adequate technical support for the dynamic detection of mineral resources.

Currently, there are widespread illegal and unauthorized mining activities globally [2], which not only disrupts the order of mineral resource development but also causes irreversible damage to the ecological environment [3], [4]. Rapid identification and monitoring of mining areas and illegal mining have become important tasks for natural resource management departments at all levels of government. This article topic is of practical significance as it aligns with the policy demands of various levels of government. In early work, the location of mining areas and information on the extent of mining were obtained through field surveys [5]. This method was time-consuming, labor-intensive, and unsuitable for large-scale, routine inspections. High spatial resolution remote sensing (HRS) images can quickly obtain clear information about the land surface over a large area. At the same time, the continuous improvement of spectral resolution has also provided the possibility for obtaining fine-scale mining area types information for large-scale scenes [6].
Since the 1970s, remote sensing technology has been widely used in land cover detection [7], [8] and land use mapping [9] tasks. According to the different research units, technology can be divided into pixel-and object-based methods [10], [11]. [12]. The former takes a single pixel in an HRS image as the research object, constructs shallow features of ground features, and then uses a simple threshold method to classify the targets. In the early research on remote sensing detection of mining areas, spectral index methods [13], [14], edge detection methods [15], and other methods were the most typical representative methods of that period. Although these methods were simple and easy to operate, the detection accuracy was relatively low. Object-based methods study objects composed of multiple related pixels. This method considers the spatial structural characteristics between ground features and can achieve higher detection accuracy than pixel-based methods. However, how to adaptively select a reasonable segmentation scale is still a bottleneck and difficulty of this method [16]. With the development of machine learning, scholars have gradually used various machine learning algorithms to replace thresholding methods for mining features of mining areas, such as decision trees [17], support vector machines [18], and deep belief networks [19]. Although machine learning methods have achieved good results, more powerful algorithms are needed to explore deeper semantic features due to the complexity of land cover and the irregularity of spatial distribution in mining areas. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In recent years, the rise of deep learning has brought new opportunities for the innovation of remote sensing algorithms [20]. Deep learning is good at automatically mining features from massive data and has good generalization ability, attracting more, and more researchers' attention [21]. Especially deep learning models, represented by convolutional neural networks (CNNs) [22], [23], [24], have been widely applied in the research of mining areas remote sensing surveys due to their superior performance in image processing. Models such as Unet [25], Unet+ [26], Deeplab [27], and SegNet [28] have demonstrated far superior performance to traditional machine learning methods in open-pit mining identification tasks. Similarly, methods that combine object-based image segmentation with CNN [29], attention-based convolutional neural networks [30], and complex frameworks that combine semantic segmentation networks with object detection networks [31] have also shown satisfactory results. However, in the application research of large-scale scene mining area recognition, the existing methods still have serious problems, such as misclassification and difficulty identifying small mining areas. At the same time, as far as we know, few application studies introduce deep learning models for fine-scale identification of mining area types, which also puts higher demands on the models.
Recent studies have shown that CNNs can efficiently learn spatial features for the shared convolution kernel and shifting invariant characteristics [32]. However, the receptive field of CNNs is limited [33], which means that even in large-scale HRS images, only extremely small window areas can capture the features of objects. Therefore, the primary task for current research on large-scale remote sensing target recognition is to break through this limitation and efficiently extract local detail features and global contextual features [34]. The emergence of Transformer provides a new solution to this problem. Recently, networks based on Transformer have gradually become the new champions in computer vision [35], with models like VIT [36], Segformer [37], and Swin [38] achieving impressive results in multiple competitions. Because combining patch embeddings with the multihead self-attention module gives the Transformer a robust local-global feature learning capability [39]. Whether a more robust deep learning model can be developed based on the Transformer for mine-type recognition is an urgent area of research.
For mining area detection tasks in large-scale scenes, using a single deep learning model approach is extremely timeconsuming. Reducing the search range before inputting the image to the network is an efficient and feasible solution. Based on the above problems, this article proposes a stepwise framework for fine-scale recognition of mining area types in large-scale scenes. The framework consists of two main steps. The first step is to quickly locate the position of the mining area from the large-scale scene using the mining area spectral index. The second step is to use a novel Transformer-based semantic segmentation network to accurately identify the mining area type and coverage range from the candidate locations of the mining area. This framework can effectively avoid misclassification caused by spectral similarity between mining areas and urban areas, or bare soil areas. Through this framework, intelligent extraction of fine-scale mining area types can be achieved from top to bottom and coarse to fine, providing basic data for ecological environment governance and dynamic detection of mineral resources.

II. METHODOLOGY
The mining area extraction framework proposed in this article consists of two main steps: 1) the determination of mining area candidate regions and 2) the precise recognition of mining area types. Fig. 1 illustrates this framework. In the first step, this article constructs a Normalized Difference Mining Area Index (NDMAI) using GF-5 data based on the differences in reflection spectral curves of different ground features, and then uses this index to delineate the candidate areas of mining areas on a large scale globally. In the second step, the GF-2 image at the same location is cropped based on the candidate area. For accurate recognition, the cropped image patches are input into the finescale identification network of Mine Types with Transformer (Mitformer). Finally, a position coordinate is assigned to each prediction result of the network to obtain the mining area type recognition results. The following two sections will describe the key technologies in the proposed framework in detail.

A. Normalized Differential Mining Area Index
The spectral characteristics of different land-cover types vary greatly, so studying the spectral reflectance of multiple landcover types is a prerequisite for identifying mining areas. For large-scale HRS images, spectral indices are a simple and effective method for quickly obtaining the target's approximate location. Considering the diversity of land-cover types in the study area, this article mainly focuses on researching vegetation, water, built-up areas, and mining areas.
Based on the spectral response function of GF-5 images and the spectral reflectance of typical land-cover types, four spectral curves of vegetation, water, built-up areas, and mining areas were calculated, as shown in Fig. 2. Due to the red edge effect of vegetation, the spectral reflectance of vegetation has obvious fluctuations in the near-infrared band. This is also one of the main reasons vegetation is often easily distinguished from other land-cover types [40]. In addition, the spectral reflectance of water is mainly in the blue and green bands, and there is strong absorption in other bands, especially in the near-infrared band [41]. Therefore, water appears as almost straight lines parallel to the horizontal axis in the spectral response curve. Therefore, it is easy to distinguish mining areas from water and vegetation using spectral characteristics. However, as shown in the figure, the spectral curves of built-up areas and mining areas have similar fluctuations. The two curves are almost parallel, which poses a significant challenge to distinguishing between the two landcover types.
The normalized difference index (NDI) is commonly used to highlight the difference between the strongest and weakest spectral responses of target land-cover types [42]. Analyzing the spectral response curves of built-up areas and mining areas, we found that the reflectance of mining areas is higher than that of built-up areas in the visible band. In comparison, the reflectance  of mining areas is lower than that of built-up areas in the nearinfrared band. Therefore, it is easy to observe that the normalized value of mining areas is lower than that of built-up areas. In this article, five bands with the local maximum or minimum values in the visible and near-infrared regions were selected to represent the visible and near-infrared bands. Three bands were selected in the near-infrared spectrum, while two bands were selected in the visible spectrum. This selection scheme aims to prevent the normalized values from becoming negative, thereby reducing the computational complexity. The selected bands are B1 (390.32 nm), B48 (591.46 nm), B100 (813.82 nm), B128 (933.60 nm), and B155 (1038.48 nm) of GF-5 imagery. Formula where R 390.32 , R 591.46 , R 813.82 , R 933.60 , and R 1038. 48 represent the reflectance of the 1st, 48th, 100th, 128th, and 155th bands in GF-5 image, respectively.

B. Fine-Scale Identification Network of Mine Types With Transformer
While simple and convenient, the method of spectral indices cannot achieve satisfactory results in fine-scale land cover classification tasks with a single spectral index [43]. The emergence of deep learning has brought a new turning point in solving this problem. Based on this, this article aims to develop a deep learning semantic segmentation network for mine classification called the fine-scale identification network of Mitformer. This network fully integrates features of different levels and scales. It has obvious advantages in solving the problems of misclassification in mining areas and difficulty in detecting small-scale mining areas.
The Mitformer consists of an encoder, a feature enhancement layer, and a decoder, as shown in Fig. 3. The Transformer-based model named Segformer is used as the encoder to extract deep features of different types of mining areas. Segformer has been widely applied in various remote sensing information extraction tasks and has achieved good results. In the encoder, there are four Segformer blocks, each containing an efficient self-attention mechanism, a mix-feedforward network, and an Overlapped Patch Merging module. The Segformer block performs well in extracting both large-scale, coarse-grained features and smallscale, fine-grained features. Although Segformer has strong capabilities in mining global contextual information, its learning of rich detailed information in HRS images still needs to be improved [44]. To address this issue, this article introduces a multiscale feature enhancement module and a decoder based on multilevel skip connections in the Mitformer. The details of these two modules will be introduced in the following two sections.
1) Multiscale Feature Enhancement Module: To achieve the sufficient fusion of feature information at each layer of deep features, this article proposes a multiscale feature enhancement module (MSFE), whose structure is shown in Fig. 3. After the image goes through the decoder, a feature map of size (256 × H/32 × W/32) is obtained. Then, this feature map is input into the MSFE module. In the MSFE module, multiple convolution kernels are parallel-connected, including three dilated convolutions with different expansion factors. These are used to obtain multiscale spatial relational features with different receptive fields. The convolution sizes of these three dilated convolutions are all 3 × 3, and their dilated rates are 6, 12, and 18, respectively. In addition, there is also a 1 × 1 convolution kernel parallel-connected, which is used to aggregate feature information across different layers. This parallel structure produces feature maps of size (256 × H/32 × W/32) for each branch output. Then, multiple feature maps are stacked with the input feature map to create a feature map of size (1280 × H/32 × W/32). This design aims to prevent the loss of original information and avoid excessive compression of features at different levels, thereby improving the robustness and stability of the model. Subsequently, this feature map is passed through a 1 × 1 convolutional layer to scale the feature map channels, resulting in a final feature map size of (256 × H/32 × W/32). The following formula represents the main implementation process of this module: where x in represents the feature output of the fourth stage of the encoder, Conv i 3×3 () refers to a 3 × 3 dilated convolutional operation with a dilation rate of i, x fuse represents the fused feature obtained by stacking multiple feature maps together, and x out is the output feature of this module.
2) Decoder With Multilevel Skip Connections: To fully utilize feature information from different levels, this article proposes a decoder with multilevel skip connections, as shown in Fig. 3. First, the output feature maps of each stage are upsampled layer by layer, and each upsampling operation obtains a new feature map with the same size as the feature map at the input stage. For example, if the output feature map size of the third stage encoder is (160 × H/16 × W/16), it needs to be upsampled twice to scale it to the feature map with a unified size. The first upsampling operation enlarges the input feature map by a factor of 2 and reduces the number of channels, resulting in an output feature map size of (64 × H/8 × W/8). The second upsampling operation is the same as the first one, resulting in an output feature map size of (32 × H/4 × W/4). It is worth noting that this article did not solely use transpose convolution in each upsampling process. Instead, it first increases the resolution of the feature map to the size of the output feature map using interpolate, then performs calculations using a 3 × 3 transpose convolution with a stride of 1. This design is reasonable to avoid the occurrence of the checkerboard effect. The checkerboard effect will produce many fragmented, regularly distributed patches in the network's prediction images, seriously affecting semantic segmentation results' accuracy.
In this decoder, a multilevel upsampling structure is designed to fuse the decoded results of different stages, achieving the effect of feature reuse by indirectly fusing high-level and low-level features. This fusion method is implemented through multilevel skip connections, where the output high-level features are directly added to the low-level features of the same size during the upsampling process. For example, the feature map output by the third-stage encoder goes through two upsampling operations. During the first upsampling, the size of the feature map is the same as that of the low-level feature output by the second-stage encoder, so the two features are added through a skip connection. Similarly, the feature map obtained by the second upsampling is added to the low-level feature output by the first-stage encoder, achieving the effect of multilevel skip connections. As is well known, feature maps at different depths have different impacts on semantic segmentation results. Lowlevel features mainly contain edge information about objects, while high-level features contain more spatial position information. Therefore, this design of multilevel feature fusion will undoubtedly significantly improve the accuracy of small object recognition. The following formula shows the implementation process of the multilevel skip connection decoder: where m represents the number of upsampling operations performed, with m ∈ [03], and i represents the stage number of the encoder, with i ∈ [1,4], x m i is a feature map obtained after m upsampling operations, with the same size as the output feature map of the ith stage encoder, and x 0 i represents the feature map output by the ith stage encoder without upsampling.

A. Study Area and Data
To validate the effectiveness of the proposed framework, experiments were conducted in two large-scale mining areas located in different geographic locations. Study Area 1 is located in Chengde, Hebei Province, where the geological structure is complex, the resource reserves are abundant, and there are many types of mines, including metal, gravel, and coal. The distribution of mineral deposits is relatively concentrated. Small deposits are far more numerous than large deposits, greatly challenging the model's ability to recognize small mining areas. Therefore, for the data required by the framework, a GF-5 image taken on July 21, 2019, and a GF-2 image taken on June 1, 2021, are selected as the experimental data. Study Area 2 is situated in Wuhu, Anhui Province, a region rich in mineral resources. Early volcanic activity in the area provided favorable geological conditions for mineralization, gradually forming medium and small-sized deposits and mineralization points of metal and nonmetal minerals. Similarly, a GF-5 image taken on May 22, 2019, and two GF-2 images taken on December 5, 2021, are being selected as experimental data. Fig. 4 shows the 1-m resolution GF-2 fused images of the two study areas.
To highlight the good extraction performance of the proposed framework in large-scale scenes, the selected two study areas have a sufficiently large area. Study area 1 is a rectangular area with a length of 22 771 m and a width of 14 619 m. Similarly, study area 2 is a rectangular area with a length of 17 163 m and a width of 12 113 m. The study areas have many mining areas, mainly metal mines and gravel mines. Gravel mines are all openpit mining, while metal mines are divided into underground mining and open-pit mining. Therefore, this article mainly focuses on identifying open-pit metal mines and gravel mines. Of course, the ground tailings pond is also a major component of the metal mine area and will be classified as a metal mine. The information regarding the types and boundaries of mining areas used in the experiments was obtained through queries on Google Maps. Additionally, these pieces of information were validated using well-known online mapping software such as Baidu Maps and Tianditu. Therefore, they are considered to be authentic and reliable.

B. Experimental Setts and Evaluation Metrics
All the experiments for training networks were conducted using PyTorch 1.7.1 version, supported by NVIDIA GeForce RTX 2080TI GPU. We trained the models with Adamw optimizer for 160 000 iterations and set the learning rate to 6 × 10^(−5).
The results of all experiments were evaluated using class pixel accuracy (CPA), intersection over union (IoU), mean pixel accuracy (MPA), and mean intersection over union (mIoU). CPA represents the accuracy of correctly predicting pixels that belong to a specific class. MPA calculates the proportion of correctly classified pixels for each class and then takes the average. IoU measures the ratio of the intersection to the union of predicted and ground truth values for a specific class. Additionally, mIoU represents the average IoU across all classes. The following formulas (5)-(8) describe the calculation methods for these metrics: where TP (true positive) refers to the number of correctly predicted pixels, FP (false positive) refers to the number of pixels incorrectly predicted as belonging to a specific class, FN (false negative) refers to the number of pixels belonging to a specific class but incorrectly predicted as not belonging to that class, i represents the class index, and N represents the total number of classes.
We manually annotated labels of other GF-2 fused images for the experiments around the two study areas. The labels consisted of three land-cover types: metal mines, gravel mines, and nonmining areas. They were randomly cropped to a size of 1000 × 1000 and augmented using data augmentation algorithms to produce 10 000 sets of images with their corresponding binary labels, which were used for training the Mitformer model.

C. Results
The results of the mining area identification in the two selected study areas are shown in Fig. 5. It can be seen that there are many mining areas in both study areas. The mining areas in study area 1 are more concentrated, mostly located in the southern part of the study area, while the mining areas in study area 2 are relatively evenly distributed. In addition, the mining areas in study area 2 occupy a larger area. In contrast, in study area 1, there are many small mining areas, and different types of mining areas are very close to each other, which greatly increases the difficulty of the mine type identification task. As can be seen from Fig. 5(c) and (f), the framework proposed in this article can not only accurately extract large mining areas but also has good extraction performance for small mining areas. Table I shows the accuracy metrics of mine type identification using the framework proposed in this article. It can be observed from the table that the MPA of study area 1 is 89.35% and the mIoU is 85.16%, while the MPA of study area 2 is 91.05% and the mIoU is 89.02%. All the metrics achieved above 85%, especially study area 2, which reached 90%. Regarding specific mine type identification accuracy, the CPA and IoU of nonmining areas are above 99%, and the various indicators of metal mines are around 85%. At the same time, the various indicators of gravel mines in study area 2 are also around 85%. However, the CPA and IoU of gravel mines in the study area are less than 80%. This is because gravel mines in study area 1 are generally small and densely distributed, making them difficult to detect. Therefore, the accuracy of gravel mine extraction in study area 1 is slightly lower

A. Effectiveness of NDMAI
In the framework proposed in this article, NDMAI is designed to extract candidate areas of mining areas, i.e., to use a simple method of the spectral index to quickly screen out possible  locations of mining areas within the study area, providing a guarantee for the subsequent more efficient, fast, and accurate identification of mining area types. The research on utilizing a particular spectral index to identify mining areas is limited in this field. However, whether it is an open-pit mining site or a tailings pond used for waste disposal, from the perspective of the appearance of land features, they both appear as bare land. Therefore, three commonly used bare soil indices, BI [45], NDSI [46], and NDBSI [47], were selected for experimental comparison with the NDMAI proposed in this article to demonstrate the effectiveness of this spectral index in identifying candidate areas for mining areas. The specific calculation formulas for these three indices are as follows: With where Blue, Red, Green, NIR, and SWIR represent the corresponding blue, red, green, near-infrared, and short-wave infrared bands reflectance, respectively. In the article, the Blue, Red, Green, NIR, and SWIR bands are represented by B22 (480.18 nm), B63 (655.55 nm), B41 (561.46 nm), B112 (865.16 nm), and B222 (1603.38 nm), respectively, using GF-5 data. First, the effectiveness of NDMAI in identifying candidate mining areas was analyzed from a qualitative perspective. As shown in Fig. 6, the ground truth information of the mining area boundaries in the figures comes from Google Maps and is manually delineated. BI and NDSI cannot completely separate the mining areas from nonmining areas, especially BI in small mining areas [see Fig. 6(c)] and NDSI in urban areas [see Fig. 6(i)]. In contrast, NDBSI and NDMAI perform well in this regard. Details in the figure reveal that mining areas in the NDBSI image appear in red and yellow [yellow in part of the metal tailings pond in Fig. 6(e)], while NDMAI mainly appears in red. This also indicates that, compared with the NDBSI image, the numerical range of mining areas is relatively concentrated in the NDMAI image. Thus, the index has a stronger separation ability. Therefore, the NDMAI proposed in this article shows higher accuracy in identifying potential mining areas.
Table II provides a quantitative analysis of the effectiveness of NDMAI in identifying mining areas. The results of each index  were categorized and evaluated for accuracy through thresholding, with the threshold values for each index listed in Table II. The highest score in the evaluation index values for these spectral indices is highlighted in bold, and the second-highest score is underlined. In study area 1, NDMAI achieved an MPA of 68.14% and a mIoU of 52.41%, while in study area 2, NDMAI achieved an MPA of 72.80% and a mIoU of 47.46%. These indicators are not very high, which explains why single spectral index methods are difficult to achieve highly accurate mining area identification results. The NDMAI proposed in this article demonstrated high accuracy in both MPA and mIoU. From the perspective of the stability of accuracy metrics, it can be found that the accuracy of NDMAI in extracting candidate mining areas is more stable and reliable.

B. Effectiveness of Mitformer
In this section, the effectiveness of the Mitformer for mining area types recognition is validated through comparative experiments. Several state-of-the-art semantic segmentation networks are selected as the comparison methods, including CNNbased FCN [48], Deeplabv3 [49], PSPNet [50], HRNet [51], and Transformer-based Swin [38] and Segformer [37]. The training samples and parameter settings used by these networks are the same as those used by the Mitformer. Fig. 7 shows that FCN, Deeplabv3, and PSPNet have poor extraction results, with serious misclassification and salt-andpepper effects. Comparing the structures of these three networks with other networks, it can be explained that feature maps at different levels contain different types of information, and the fusion of high-level and low-level features is necessary. This is also why HRNet can achieve good extraction results. Comparing Swin, Segformer, and Mitformer, it can be found that the former two networks perform poorly in extracting detailed information at the edges of targets, resulting in adhesion when encountering multiple targets close to each other. At the same time, it is difficult to ensure the completeness of the extraction results for small mining areas. In addition, comparing HRNet and Mitformer, it  can be seen that Mitformer has more accurate details of target information. Therefore, Mitformer is more suitable for mining area types recognition.
Next, we will analyze the advantages of Mitformer from a quantitative perspective. From Table III, it can be seen that Mitformer performs stably and has relatively high accuracy, whether in the overall accuracy indicators MPA and mIoU or the accuracy indicators CPA and IoU for a single category. Similarly, in Study Area 2 (see Table IV), although Mitformer does not have the highest scores in some indicators, the difference between Mitformer and the highest one is very small. At the same time, from an overall perspective, our network has the most stable accuracy. Therefore, this also demonstrates the clear advantages of Mitformer in mining area types recognition tasks.

C. Effectiveness of Framework for Fine-Scale Recognition of Mining Area Types in Large-Scale Scenes
The mining area types identification framework proposed in this article is divided into two steps. The first step is to use spectral indices to obtain candidate locations in the mining area, and the second step is to perform overlapping cropping within the candidate areas. The cropped results are then input into Mitformer to identify mining area types accurately. This section analyzes the effectiveness of the framework in terms of the accuracy and speed of mining area identification by comparing it with the fixed step nonoverlapping sliding window prediction method.
The first aspect to be analyzed is the accuracy of the framework. From the perspective of algorithm logic, the framework can reduce the occurrence of misclassification. Because the spectral index can take advantage of multiband spectroscopy to significantly reduce the adverse effects of "foreign objects with the same spectrum" on targets with mining areas in the R, G, B, and NIR bands. Therefore, the framework has a lower misclassification probability than the sliding window method, which is also evident from the indicator score values in Table V. Additionally, from Fig. 8, it can be observed that the sliding window method is prone to produce results with a cut-and-paste effect, significantly reducing the integrity of the extracted results. In contrast, the proposed framework does not exhibit such issues due to the significant role played by overlapping cropping. Of course, the sliding window method can also use overlapping to improve detection accuracy, but a higher overlap degree undoubtedly reduces the detection speed. This is a fatal weakness for mining area recognition tasks of large-scale scenes.
Then, the effectiveness of the framework is analyzed from the perspective of recognition speed. The advantage of this framework in terms of speed is obvious. For example, there is currently a very large study area, but there is only one mining area in this study area. Imagine if the sliding window method is used for calculation; hundreds or thousands of image patches would need input into the network. However, if this framework is used, the spectral index can directly locate the area where the mining area is located. Then, only a few image patches need to be input into the network through overlapping cropping. This order of magnitude comparison can prove the advantage of the framework in terms of speed. This section does not provide the corresponding network prediction time for different frameworks. Because, to demonstrate the universality of the method, the study areas used in this article have a relatively large number of mining areas, and the high overlap degree (50%) of the overlapping cropping algorithm causes the framework to require predicting many more image patches than the sliding window method.

D. Limitations of the Proposed Method
In this framework, overlapping cropping is a last resort. Due to computational performance limitations, the network's input cannot be arbitrary, making remote sensing large-scale scenes' target recognition tasks challenging. Therefore, it is natural for people to think of using overlapping cropping to achieve largescale scene detection. However, overlapping cropping will cause a serious problem; when the target is large enough, cropping will make the target incomplete in the image patch. At the same time, cropping will also destroy the structural information of the scene composed of the target and the surrounding environment. Therefore, how to overcome the limitations of network input and design a deep learning network that is truly suitable for the characteristics of remote sensing data is currently the most important area that should be paid attention to in research.

E. Advantages and Practicality of China's Domestically Produced High-Resolution Satellite Data in Land and Resources Monitoring
With the rapid development of China's aerospace industry, a large number of domestically produced high-resolution satellites have been launched into space, playing an irreplaceable role in urban development planning, environmental monitoring, and disaster assessment in China. Especially in the field of land and resources monitoring, from the past annual comprehensive monitoring to the current quarterly or even monthly monitoring, it provides timely and effective data support for real-time supervision of natural resources. High-frequency monitoring undoubtedly puts forward higher requirements for the speed and accuracy of remote sensing data processing. Therefore, it is essential to conduct relevant technical research on natural resource monitoring based on China's domestically produced high-resolution satellite data.
Among the high-resolution satellites launched in China, the successful launch of GF-2 marks the beginning of the highresolution era with a 1-m resolution. GF-5 is a rare comprehensive observation satellite for the atmosphere and land in the full spectrum range, and it is also an important scientific research satellite in China's high-resolution project. The high-spectral camera carried by GF-5 is the world's first high-spectral camera that simultaneously considers wide coverage and wide spectral range. It has 330 spectral channels in the spectral range from visible light to short-wave infrared (400-2500 nm). It has an extremely high spectral resolution, which enables precise detection of object composition. The GF-5 image provides rich spectral information, which is suitable for spectral-based mineral exploration and can assist in quickly identifying the approximate location of the mining area. The high spatial resolution of the GF-2 image can compensate for the deficiency of the GF-5 image and accurately identify the scope and type of the mining area. Therefore, this article aims to use the advantages of China's domestic GF-2 and GF-5 data to solve the challenges in fine-scale recognition research of mining area types and provide feasible technical support for land and resource monitoring.

V. CONCLUSION
This article proposes a stepwise framework and applies it to the recognition of mining area types using HRS imagery from Chinese satellites. The framework combines a novel mining area spectral index called NDMAI and a robust deep learning network called Mitformer. The main contributions and innovations of this article are as follows.
1) This article constructs a top-to-down, coarse-to-fine framework for mining area types recognition in large-scale scenes. The framework initially utilizes GF-5 imagery to quickly determine the location of mining areas, followed by fine-scale types recognition using GF-2 imagery. By fully leveraging the advantages of different types of Chinese high-resolution satellite data, the framework has been shown to effectively improve the accuracy of mining area types recognition and avoid misclassification caused by different objects with similar spectra as much as possible through experiments. In classification tasks with sufficiently large scenes, it can significantly increase the speed of target recognition. 2) According to the spectral reflection characteristics of different land-cover types, this article uses GF-5 images to construct a simple and easy-to-calculate spectral index named NDMAI for the preliminary locating candidate mining areas. Qualitative and quantitative analysis shows that the index exhibits more stable and reliable performance in the study areas, which are mainly composed of gravel mines and metal mines. Particularly, NDMAI shows a stronger ability to separate mining areas from built-up areas. 3) To efficiently obtain the category information of mining areas, this article proposes a deep learning network named Mitformer, consisting of an encoder, a feature enhancement layer, and a decoder. The Mitformer introduces a novel multiscale feature enhancement module and a decoder based on multilevel skip connections, which achieves a full fusion of features at different layers of deep feature maps and skip addition between low-level and high-level feature maps. The experimental comparison shows that Mitformer can achieve higher mining area recognition accuracy than other state-of-the-art networks. It somewhat solves the problem of misclassifying mining area types and difficulty detecting small mining areas. This article provides a new approach for mining area types recognition research. However, this article still has many limitations. For example, due to the difficulty of data acquisition, the selected study areas in this article mainly focus on sand and metal mines, with a relatively small number of types. In the future, it is possible to expand to more types of mining areas, such as coal mines. Alternatively, a more refined classification of metal mines, such as copper mines and gold mines, could also be explored. Additionally, incorporating multisource data (SAR, GIS data) will likely enhance the expression capability of features of different types of mining areas, enabling high-precision extraction of various types of mining areas.