Land-Cover Classification With High-Resolution Remote Sensing Images Using Interactive Segmentation

Deep convolutional neural network (CNN) has been increasingly applied in interpretation of remote sensing image such as automatically mapping land cover. Although the automatic CNN method achieves relatively high accuracy, there are still many misclassified areas. Considering that it is still far from practical application, this paper proposes a semi-automatic auxiliary scheme for land cover classification whose core idea is to use an interactive segmentation network. To infer the rough positions and categories of objects, a CNN is relied on to classify images in a patch-wise manner. Then an interactive segmentation method is proposed by accepting user-clicks on the inside and outside of object to guide the model for the segmentation task in the patches. This model also introduces different interactive modules to better integrate features of different scales. In addition, we create a large-scale sample library containing five common land cover categories which covers Jiangsu Province, China, and includes both aerial and satellite imagery. On our sample, we gave a thorough evaluation of most recent deep learning-based methods. The experimental results shown by our interactive segmentation also far outperform the recent semantic segmentation method, which provides a reference for semi-automatic land cover mapping.


I. INTRODUCTION
Land cover classification is of great significance to urban planning, environmental protection and ecological monitoring. Traditional classification methods mainly rely on some machine learning algorithms to extract spectral and spectral-spatial features [1], [2], [3], [4]. Since the features extracted by these methods are handcrafted, it limits their wide generalization and application on remote sensing images. As the spatial resolution of the captured imagery becomes higher and higher, the detailed information and structural features of the ground object categories become clearer. Meanwhile, the powerful hierarchical feature extraction capability of convolutional neural networks (CNN) can automatically obtain the contextual semantic information of The associate editor coordinating the review of this manuscript and approving it for publication was Chong Leong Gan . objects in images, which brings the possibility of pixellevel classification in high-resolution remote sensing images (HRRSI). Many end-to-end CNN-based networks have been designed for remote sensing image interpretation.
Maggiori et al. [5] design a two-stage training method to cope with the lack of sufficiently accurate labeled training data, in which a multi-scale module was embedded to extract the fine classification map. Kussul et al. [6] use a multi-level CNN architecture combining supervised and unsupervised neural networks to classify land cover and crop types in multi-source satellite images. The final experimental results exceed the traditional fully connected multilayer perceptron (MLP). Liu et al. [7] utilize skip connections with residual units and an inception module in a CNN encoder-decoder architecture. The entire optimized architecture obtain excellent segmentation results on Vaihingen and Potsdam datasets datasets. Li et al. [8] propose a network named DASSN_RSI, VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ which consists of unique encoder-decoder architecture, and a weight-adaptive loss function based on focal loss and achieve decent results on the GID dataset [9]. In order to improve the generalization ability of the model to multi-source satellite images, Tong et al. [9] designed a transfer learning scheme to implement land-cover classification. Certainly, the premise of such networks for exerting powerful feature extraction performance requires learning and training with large sample data. The construction of the remote sensing sample platform also promotes the iterative updates of land-cover semantic segmentation model. However, there are still two main issues regarding the land cover classification application of CNN with HRRSI.

A. INSUFFICIENT PRACTICALITY OF MODELS
Many deep models based on CNN have achieved optimal classification for automatic segmentation on specific data sets, but the further improvement of the generalization of the model requires continuous training data for fine-tuning, and the imperfect segmentation of the model also does not meet the requirements of the accurate segmentation boundary of the actual land cover map, which greatly reduces the versatility and practicability of the model in land cover classification.

B. LACK OF CHALLENGING LAND-COVER DATASET
The improvement of model segmentation performance is inseparable from the more difficult and accurate data support. At present, the datasets lack more complex scenarios, and the performance of the model is close to saturation. Highresolution datasets are also relatively scarce, and have relatively single image data sources, including only satellite images or only aerial images. Many land cover datasets are annotated with multiple categories on the same image, which can easily lead to imbalance between categories.
The main contributions of this paper are as follows. An auxiliary semantic segmentation framework for land cover mapping was proposed, in which the interactive segmentation model outperforms the fully automatic semantic segmentation model. Compared with manual annotation, the whole scheme reduces human participation and can greatly improve efficiency.
A large scale sample library was introduced which includes five datasets, namely building dataset, greenhouse dataset, agricultural dataset, forest dataset, and road dataset for semantic segmentation. It mainly consists of 0.3-m resolution aerial images and 1-m resolution satellite images, covering most of Jiangsu Province, China. At the same time, a baseline for semantic segmentation using the deep learning approach is provided for subsequent research.
The remainder of the paper is organized as follows: In Section II, the related works are introduced. In section III, the introduction of our auxiliary semantic segmentation framework and the experiment details are presented. In Section 4, the details and properties of JSmaple are described. Section 5 gives the result and discussion. Section 6 finishes with the conclusion.

II. RELATED WORK
A. LAND-COVER SEMANTIC SEGMENTATION DATASETS WHU [10], AIRS [11] and SpaceNet MVOI [12] are all datasets used to study extraction of building outlines. WHU provides two types of aerial and satellite images with a spatial resolution of 0.075m, and the collected images from all over the world are accurately annotated. The area covered by AIRS data is also very broad, reaching 457km2. It provides roof outlines as ground truths that are used for roof segmentation; SpaceNet MVIO annotates building footprints in the collected MultiView Overhead images with 27 unique looks, which greatly increases the challenge due to different perspectives. CHN6-CUG [13] is constructed for road extraction which includes railways, highways, urban roads, and rural roads. SpaceNet dataset [14] is created for building footprint and road network extraction from satellite imagery. Agriculture-Vision [15] annotated field anomaly patterns in 0.1 m resolution aerial images consists of RGB and Near-infrared (NIR) channels. The above datasets are all focused on specific categories of mapping and analysis. The ISPRS Vaihingen, ISPRS Potsdam, Zurich Summer [16] and Zeebruges [17] all annotated multiple categories of land cover in high spatial resolution images, but mainly studied semantic parsing in urban scenes; while DeepGlobe [18] and Land-Cover.ai [19] annotated images covering more rural areas. LandCoverNet [20] provides a global land cover dataset with medium-resolution. Due to the limitation of image spatial resolution, the annotated categories cannot be extended in a finer-grained manner. Both iSAID [21] and SkyScapes [22] have highly-accurate, fine-grained annotations for pixel-level semantic labeling, and the labeled objects reach the level of instantiation. Expensive labeling costs cannot support it to cover a large area. The MiniFrance suite [23], GID and LoveDA are all datasets made for a specific task, of which The MiniFrance suite contains both labeled and unlabeled data, facilitating the study of semi-supervised learning, GID [9] contains 150 Gaofen-2 satellite images in which 5 major categories and 15 sub-categories were annotated for transfer learning of land-cover classification. LoveDA [24] dataset collected images with 166768 annotated objects in urban and rural area for domain adaptive semantic segmentation.

B. INTERACTIVE SEGMENTATION
Interactive segmentation, which allows human participation to segment objects of interest by providing some user inputs such as bounding boxes or click. It has been developed for many years, especially in the development of CNN, and many interactive deep segmentation models have emerged. Xu et al. [25] generate the foreground and background maps from user-provided positive and negative clicks respectively and concatenate them with an input image to train an end-to-end CNN. Xu et al. [26] also transform a userprovided rectangle around an object into a euclidean distance map by using bounding-box as a soft constraint and concatenate them with an input image to feed into CNN.
To make the user interaction information cooperate with the model to achieve better segmentation results, Liew et al. [27] propose RIS-Net by expanding view of the given inputs to augment each local region for improving feature representation. Mahadevan et al. [28] develop an iterative training of the interactive segmentation network by iteratively adding clicks based on the error area of the predicted mask during the training process to improve click refinement process. Jang et al. [29] design the backpropagating refinement scheme (BRS), which constrains user-specified locations to correct the mislabeled pixels. Hu et al. [30] propose a two-stream late fusion network (TSLFN) to improve the impact of user interactions on the prediction results. Contrast to methods which allow clicks at non-fixed positions of object, DEXTR [31] uses extreme points (left-most, rightmost, top, bottom pixels) on the objects which are encoded as Gaussians and are concatenated to the image as CNN input to obtain precise object segmentation. IOG [32] leverage an inside point and two outside points to further simplify the interactive annotation process. Unlike pixel-level segmentation, Polygon-RNN [33], Polygon-RNN++ [34], Curve-GCN [35] treat it as a polygon prediction task to perform interactive segmentation. Most interactive tools only separate the foreground and background for binary interactive segmentation, DISIR [36] focuses on interactive multi-class segmentation.

III. METHODOLOGY
To efficiently classify high-resolution RS images in pixellevel, we propose a scheme to train two CNNs, which are pre-trained on labeled land cover datasets and can be applied to unlabeled high-resolution RS images. Suppose we define two kinds of data, one is an annotated source dataset (SD) and the other is target data with no label information (TD). Our aim is to use information learned by the CNNs model from source data (SD) to classify the target data (TD). Given a target image belonging to TD. It is divided into patches with non-overlapping grid partition by sliding the window.
As shown in Fig 1, the proposed scheme is divided into two stages: classification and positioning and refinement segmentation, which are presented in Section 4.1 and Section 4.2, respectively. Rough localization and category prediction of the target is performed using CNN from to search useful segmentation regions, and then the interactive CNN is used to segment specific outline of the target on selected grid.

A. CLASSIFICATION AND POSITIONING
To efficiently use CNNs to classify large-scale highresolution RS images, we use a semantic segmentation network called U-Net [37] trained with labeled data to extract the rough location and the category of the target in the largescale image, which can filter out the background. The U-Net network architecture is mainly composed of an encoder and a decoder. The encoder is mainly composed of repeated convolutional layers, RELU activation layers and max pooling layers, and the different-level features of the target are extracted by continuous downsampling. The feature map upsampled by the decoder is cascaded with the features of the corresponding level of the encoder to recover the semantic space information of the target. The entire architecture is an end-to-end fully convolutional network, The final feature map is pixel-level classification map for images.
The target image is partitioned into non-overlapping patches by grid with the size of pixels which is equivalent to the size of the input and output of the U-Net model. Each patch m i is input to U-Net that has been pre-trained on SD. After forward propagation, the classification probability map at the pixel position x ∈ (S 1 ,S 1 ) is obtained from sigmoid layer since the SDs we mentioned above are all datasets with single-category labels.
Although U-Net can classify pixel by pixel, it is still difficult to obtain a sufficiently accurate segmentation effect in the part of the target, but the overall rough positioning of the target is enough for us to meet the requirements of this stage. We compare the probability value of in the patch m i with the threshold σ . If the value is less than σ , it is classified as background, and if it is greater than σ , it is considered as the classified target area. After removing plain background patches from M , the remaining set of patchesM is formed for accurate segmentation in the next stage.

B. REFINEMENT SEGMENTATION
Inspired by the interactive segmentation method, we propose an auxiliary segmentation method to finely segment the target. The idea of the interactive segmentation model (ISM) is to accurately guide the model to extract the features of the target by adding additional annotation points. Our scheme VOLUME 11, 2023 obtains more accurate segmentation results by combining the prior location information of the target clicked by the user and the advantages of CNNs feature extraction.
Given the patch set of a target image X T , the user labels three points of a selected object in and obtain a mask prediction for it. Looping through each object, we end up with desirable segmentation maps in patches. As shown in Figure 2, in an interactive interface, three guiding points are provided by clicking on the target of the patch as IOG [32], which are either the Top-Left and Bottom-Right or the Top-Right and Bottom-Left click and Inner-Center Click, the first two click which indicates the background regions form a roughly tight bounding box that can surround the target. The third click (indicating the foreground regions) is sampled around the object center which is farthest from the object boundary. The foreground and background clicks are encoded as two heatmaps respectively by centering a 2D Gaussian around each click. In order to obtain more contextual information, the bounding box by background click is relaxed to a certain width to crop interest of the object in image. The RGB of cropped region is concatenated with generated two heatmaps to form 5-channels input for segmentation network. Our segmentation network adopts DeepLabV3+ [38] as baseline as shown in figure 3. The different multi-level feature maps from a pre-trained backbone ResNet101 [39] are obtained. The Self-Interaction Modules (SIM) and Aggregate Interaction Module (AIM) same as MINet [40] are connected in a cascaded order. They are ultilized to integrate features at different levels and scales extracted in the backbone network to better deal with scale variation of objects in RS images. Features of adjacent layer from DeepLabV3+ is efficiently integrated by AIMs. This ensures that the output features corresponding to the use of the AIM module not only contain the information at the current resolution, but also efficiently supplement and fuse the relevant information of the higher and lower resolutions. SIMs module further learn multi-scale information generated by AIM. AIM and SIM are integrated into a cascaded architecture network. The whole model is supervised via binary cross entropy loss.
To evaluate the performance of our dataset on semantic segmentation models, Intersection over Union (IoU) and F1score are used as our evaluation metrics. The IOU and F1score are formulated as follows: where TP, and represent number of pixels correctly classified as objects (true positives), number of pixels misclassified as objects (false positives) and number of pixels misclassified as background (false negatives), respectively. Precision means the proportion of in the number of all predicted positive pixels, while recall represents the proportion of on all positive pixels.
To verify the results of our interactive segmentation network, we additionally added the metric contour accuracy [41], [42] to focus on the effect of the model on the boundary quality of segmentation, defined as below: where Precision (i pixel), are the contour-based precision and recall changing pixel tolerance by dilation. 0, 1 and 2 pixel dilation are set on our dataset to measure how close the prediction is to the ground truth. F(0-2 pixel) means the average of F(0 pixel), F(1 pixel) and F(2 pixel).

2) TRAINING AND TESTING DETAILS
In order to verify each subset of each dataset, each subset is divided into training set, validation set, and test set according to the ratio of 6:2:2 for experiments. Some common data augmentation methods such as random cropping, horizontal flipping and gaussian noise were applied to train models.
In the data set evaluation stage, all data is not scaled and fed into the model at the original size. When using the interactive segmentation model for training, we resized the cropped instance target to 512 * 512 and then input it into the model. All networks were implemented using Pytorch1.7 framework in the Windows10 environment. All experiments were performed on a single NVIDIA GeForce RTX 3090 with 24G memory. The backbones used in all the networks were ResNet [39] pre-trained on ImageNet. For interactive segmentation model, the Adam optimizer with a momentum of 0.9, the learning rate of 10-8, batchsize of 10 and a weight decay of 5 × 10 −4 were used. For semantic segmentation models, the learning rate, momentum, batchsize and weight decay are set to 10 −4 , 0.9, 8 and 10 −4 , respectively.

IV. DATA DESCRIPTION FOR JSAMPLE A. GEOGRAPHIC DISTRIBUTION OF CATEGORY
We selected five common categories for sample production, which are agriculture, forest, building, road, and greenhouse. Firstly, we calculated the spatial distribution of these ground objects in Jiangsu Province based on the geographic and national conditions data products of Jiangsu Province. Secondly, Area of interest was selected according to the distribution density of each feature in Jiangsu Province. We will expand the sampling ratio in dense areas, and will also conduct a certain amount of sampling in sparse areas. Data sources of different resolutions and different sensors were obtained to construct five datasets, respectively ( Figure 4). We annotated these five feature types using the ArcGIS software to produce a high-quality map. Unprocessed images have extremely large image sizes. This poses significant challenges to deep network training in terms of computation time and memory consumption. We have performed some cropping operations on the labeled data for ease of use. Manual editing, expert inspection, refinement are done in the aerial and a satellite imagery to build up JSample. Field verification ensures the accuracy of the JSample.

B. STATISTICS FOR JSAMPLE 1) BUILDING DATASET
The data source of this dataset comes from 0.3-m and 0.5-m resolution aerial imagery, 1-m and 2-m resolution satellite imagery. We provide six sub-data sets from different data sources ( Figure 5). Among them, the 0.3-m resolution aerial image data has a higher spatial resolution and contains more specific feature information, and due to different perspectives, especially for high-rise buildings showing more side features other than roof features, which puts forward higher requirements for the robustness of the model. For 0.5-m resolution aerial images, we have strictly divided the rural and urban areas. The rural areas are often dominated by low-rise buildings, individual buildings that are scattered. Most of them are often irregularly arranged, and the surrounding areas are mostly natural landscape. On the contrary, the buildings in urban areas are neatly arranged and surrounded by man-made landforms. There is a huge gap in the contextual semantic information of buildings presented in different areas, which is helpful for the research on model domain adaptation. On the 1-m resolution satellite imagery, we also labeled visible part of the building. For the 2-m resolution satellite imagery, building groups of different sizes are formed because many buildings are adjacent to each other. Due to the decline of image resolution, the interval between buildings may be less than one pixel, and it is difficult to distinguish the clear boundaries of individual buildings. Therefore, for these building groups, pixellevel instance annotation is difficult to complete. We finely annotated these individual buildings and roughly annotated building groups in the selected area. They are divided into two sub-datasets.
The dataset contains 240,323 building individuals, covering both urban and rural areas in various regions of Jiangsu Province. In total our dataset covers an area of 1141 km 2 . The labels which are transformed into binary raster and the corresponding original images are batch cropped to the same size data, wherein the aerial imagery is seamlessly cropped to 1024 * 1024, while the satellite imagery is cropped to 256 * 256. The building total pixels contain 23% of the annotations, and the remaining 77% is the background, which is more than three times the size of target object (Table 1).

2) AGRICULTURE DATASET
Agriculture data were collected through 0.3-m resolution aerial imagery, 0.8-m and 1-m resolution satellite imagery over numerous fields in Jiangsu Province, which primarily consist of paddy land, dry land and irrigated land in rural VOLUME 11, 2023 areas. All images in the Agriculture Dataset were collected from the growing seasons between 2017 and 2019. In order to reduce the redundancy of sample distribution in our farmland dataset, we selected farmland samples in different landscapes.
As shown in Figure 5, parcels have different shapes and sizes under the natural scene distribution. They are regularly arranged, and are often separated by roads, rivers, etc. Compared with satellite images, the texture features of parcels are particularly clear due to the high resolution of aerial imagery. For adjacent parcels that cannot be determined with a clear dividing line, they are merged into the same target for labeling.
It covers a total of 691 km 2 . we simply cropped the annotated region non-overlapping 256 * 256 on satellite imagery, 1024 * 1024 on aerial imagery. Number of farmland pixels accounts for 48.9% of the entire sample data (Table 1), which fully retains enough contextual semantic information. It contains agricultural data of different texture structures, which provides the basis for the research of land extraction with visible light data.

3) FOREST DATASET
Our forest-grass dataset consists of three types: green belt, woodland and garden land, covering both rural and urban areas. It consists of 0.3-m resolution aerial imagery and 1-m resolution satellite imagery only including RGB (Figure 2).Woodland often have curved and irregular borders, while green belts often appear as small patches on both sides of the road, which makes labeling very difficult and takes a lot of time to check, and finally provides a pure data set.
The forest dataset contains two sub-datasets covering 212 km 2 , namely aerial data and satellite data, which are cropped into 1024 * 1024 and 256 * 256 sizes. The average proportion of forest pixels is 46.9% (Table 1). Considering enough semantic information, there are enough pixels in the forest, which is more suitable for studying the extraction of forests.

4) ROAD DATASET
We collected interest of region with different types of road surfaces, rural and urban area to construct our road  dataset in the Jiangsu province. The data sources come from 0.8-m resolution satellite images and 0.3-m resolution aerial images, which consist of 3 channels (Red, Green and Blue). Due to the particularity of road surface of the railway, the difference from other roads is particularly obvious in the high-resolution images. We divided the 0.3-m resolution aerial image data collected into two subsets: railway dataset and non-rail road dataset, which is helpful to promote the algorithm model for the extraction of fine-grained targets in remote sensing images. Our final dataset contains three subsets, shown in Figure 5, with roads under different scenarios.
The spatial distribution of roads in the image is uneven, and it is usually coherent, but the proportion of pixels is not high, so Large-scale images and corresponding annotated mask are batch cropped to produce our road sample. The size of aerial data and satellite data are 1024 * 1024, 256 * 256, respectively. The final road dataset consists of average of 5.6% positive pixels and 94.4% negative pixels (Table 1). It consists of a total of 4360 images and spans a total land area of 347 km 2 .

5) GREENHOUSE DATASET
Greenhouses are often located in rural area. Their materials are composed of white translucent plastic film, which are VOLUME 11, 2023 quite different from farmland in remote sensing images. Accurate detection of greenhouses is conducive to ensuring my country's agricultural modernization production and sustainable development. There are also many such ground objects in Jiangsu Province, and we regard it as a category to establish a dataset. As shown in Figure 5, the interior of the greenhouse is neatly arranged, in sharp contrast with the surrounding environment. The texture characteristics of the greenhouse itself are easy to distinguish from the surrounding environment. The greenhouse dataset was collected from 0.3-m resolution aerial images and 1-m resolution satellite images.
It covers 211 km 2 . A total of 20228 instances are included. The final dataset contains two sub-datasets according to different data sources. All samples are cropped to a size of 256 * 256. The pixels of the greenhouse account for an average of 60.8% of the total samples (Table 1). Objects with a high pixel ratio can more reflect the feature information of the target, and also take into account the semantic information.

1) BUILDING DATASET
It is obvious that the IOU score of U-Net++ in building dataset is better than other models, especially those of the buildings with 0.5m and 1m resolution are the highest, the possible reason is that its nested cross-scale connections between different scales are more suitable for extracting and retaining multi-scale features of buildings. Buildings in rural areas may be easier to detect than those in urban areas, due to the relatively denser buildings and more complex environments in cities from the result of the comparison. The main difference between the Buildinggroup and the Singlebuilding is the fineness of the labeling. High-density building instances under 2m resolution images are combined into a Buildinggroup for labeling, while Singlebuilding is annotated with independent individuals. We found that Buildinggroup can perform very stable in all baseline methods, while Singlebuilding performs relatively poorly. We analyze that the relative target of singlebuilding in the image is small, which increases the difficulty of model learning. The target of Buildinggroup is obvious, which is more conducive to positioning and segmentation. Overall, as the resolution of our building dataset decreases, the segmentation results will get worse and worse, which may be caused by the following reasons. 1. The amount of data trained by aerial photo data is more than that of satellite images, and the model can be fully learned from the aerial data set; 2. The reduction of resolution will directly affect the clarity of the internal structure of the building, which cannot provide sufficient texture information for the model to learn. 3. The buildings on the aerial image contain more side information and are easier to distinguish, while more the roof information on the satellite image.

2) AGRICULTURAL DATASET
On the agricultural dataset, similar segmentation scores in the corresponding sub-data sets are obtained. The IOU scores on different baseline segmentation models could reach about 88%, while the satellite image data showed the opposite. Especially in satellite data, our two subsets perform very differently on the IOU metric, with a difference of nearly 30%. The reasons for our analysis may be due to the following aspects. 1. The farmland on the satellite image has certain chromatic aberration disturbance in different regions. 2. The farmland with a resolution of 0.8m is more regularly divided and the internal texture is clear, while the farmland with a resolution of 1m is mainly in the hilly area, and the plots are much more cluttered, which poses a challenge to the segmentation ability of the model. As a whole, the results of the agricultural dataset between the various baseline models show that it is still far from practical application, and the generalization ability of the model needs to be further improved by researchers.

3) ROAD DATASET
The average road/non-road ratio of all our data is big, which indicates that there is obvious bias between positive and negative training samples. The results of our road dataset do not perform very well compared to other categories, which is the most challenging. On the data source at 0.3m resolution, our Non-rail Road subset outperforms Railway on the semantic segmentation model by almost 25%. This further verifies that there are obvious feature differences between railway and non-rail roads, and railways are more difficult to identify. The segmentation results of our satellite dataset also need to be improved, and the IOU value of each model is only about 40%.

4) FOREST DATASET
Our forest dataset can achieve an average IOU index of 65% in each baseline semantic segmentation model, which fully demonstrates the feasibility of using RGB visible light band to extract forest in high-resolution images. The extraction results on satellite images are not as good as aerial images, with an average IOU of only about 55%, which provides more room for model improvement and optimization. Through careful comparison, it is found that there are often holes in the segmentation result of large forest areas, and the reasons for the poor segmentation results may be caused by two aspects. 1. The heterogeneity within the forest class confuses the model recognition. 2. The large forest area occupies the entire image and lacks contextual information in rural area.

5) GREENHOUSE DATASET
Compared with the other four categories, our greenhouse data set can achieve about 92% in the IOU value of both aerial images and satellite images in each semantic model, which is the best. The particularity of the shape and structure of the greenhouse itself makes it easier to distinguish it from the background than other classes. The semantic segmentation model can fully learn features and can show good generalization ability on each semantic segmentation model.

B. PERFORMANCE OF INTERACTIVE SEGMENTATION NETWORK UNDER THE ISM FRAMEWORK
Our interactive segmentation model is evaluated on each of our five datasets. Its segmentation results on each dataset are compared with the best performing non-interactive semantic segmentation models (baseline) to verify whether the click interaction based on bounding box points and interior points (BBI) [32] can significantly enhance the segmentation ability of the semantic segmentation model. In addition, in order to further improve the segmentation ability of the model for multi-scale targets, in the structural part of the interactive segmentation model, we use the Feature Pyramid Network (FPN) [50] and the Multiscale Interactive Network (MIN) [40] to integrate the multi-scale feature layers of different levels extracted by the backbone network to verify effectiveness of the MIN used in this paper. Likewise, quantitative and qualitative results are provided in Figures 8, 9.

1) EFFECTIVENESS OF THE BBI
As shown in Figure 8, the BBI module has a large improvement over the baseline model on each of the metrics on the five datasets. The IOU indicators on greenhouse dataset, forest dataset, agricultural dataset, road dataset, and building dataset increased by an average of 3.9%, 19 26.82 in the F(0-2pixel) indicator. Buildings account for an important proportion of the feature elements in the entire cartography, and BBI has the most significant improvement in the building data set; while the greenhouse data has the least improvement, but the performance of the banseline model itself is already prominent, which leads to not much room for improvement.

2) EFFECTIVENESS OF THE MIN
We compared the effects of MIN and FPN plugins under our ISM framework, and we can find that BBI+MIN has a further improvement in the three indicators compared to BBI+FPN (Figure 8, 9). This shows that MIN can better combine high-level semantic features and low-level precise positioning ability to achieve the best segmentation effect than FPN. In particular, the F(0-2 pixel) indicator is improved by 3.42%, 2.57%, 2.36%, 1.87%, and 1.65% in the five datasets in our sample library, respectively,which shows that the MIN module can further improve the segmentationability of the target boundary.

3) ARTIFICIAL REFINEMENT
The last step of our auxiliary segmentation tool requires human operation. The interactive coarse segmentation map we obtained through BBI+MIN cannot actually be used for mapping or as a high-precision label. We first transform the coarse segmentation map into vector maps. As shown in Figure 10, the segmentation result of the main body of the target does not need to be modified, and the main error is on the boundary. Then, its contour points are fine-tuned based on the boundary of the target on the actual image, and finally our high-precision vector product is generated. The whole process greatly reduces the time of manual mapping.

VI. CONCLUSION
A human-machine-assisted land cover mapping framework is proposed. Firstly, a semantic segmentation network is used to roughly locate the objects and classify the category of the target in the remote sensing image, and then an interactive segmentation model performed to finely segment the objects, and finally refine the objects manually, our experiments show that the interactive segmentation model used can greatly improve the segmentation accuracy and provide the possibility for semi-automatic mapping.
We introduce JSample, a large sample library of remote sensing scenes that can be used for semantic segmentation. JSample consists of common categories in land cover. Each category has constructed a high-quality single-element segmentation dataset, which is derived from multi-source data of aerial images and satellite images. Complex scenes, highprecision annotation, multi-source is characteristic of Jsample which helps researchers to evaluate and validate new algorithms. In addition, we evaluated the current mainstream semantic segmentation, and analyzed the specific performance of Jsample in detail. The results show that the fully automated semantic segmentation model is still far from practical application.

AUTHOR CONTRIBUTIONS
Leilei Xu: Validation, formal analysis, investigation, writing-original draft preparation, and writing-review and editing; Yujun Liu and Shanqiu Shi: Conceptualization, methodology, supervision, and project administration; Hao Zhang and Dan Wang: Software, resources, data curation, and visualization. All authors have read and agreed to the published version of the article.
LEILEI XU received the M.S. degree in surveying and mapping engineering from Hohai University, Nanjing, China. His research interests include object detection and semantic segmentation.
YUJUN LIU is currently pursuing the Ph.D. degree with Institute of Geographic Sciences and Natural Resources Research, CAS with a focus on deep learning applied to remote sensing imagery processing. His research interests include computer vision and machine learning.
SHANQIU SHI is currently a Senior Engineer with the Jiangsu Provincial Geomatics Centre. His research interests include geographical national condition monitoring, spatial data processing, and deep learning.
HAO ZHANG is currently a Senior Engineer with the Jiangsu Provincial Geomatics Centre. His research interests include geographical national condition monitoring, spatial data processing, and deep learning.
DAN WANG was born in 1987. She received the M.S. degree in cartography and geographic information system from Nanjing Normal University, in 2012. She is currently a Senior Engineer with the Jiangsu Provincial Geomatics Centre. Her research interests include technical research and engineering practice in smart cities, spatial data processing, and deep learning. VOLUME 11, 2023