Edge Detection With Direction Guided Postprocessing for Farmland Parcel Extraction

Farmland is a significant resource for human survival and development. Rapid acquisition of farmland information is the basis for dynamic crop detection and sustainable land development. The continuous development of high-resolution remote sensing imagery makes it possible to make a wide range of refined earth observation. With better image interpretation ability, image segmentation method based on deep learning can bring specific results from high-resolution imagery and is widely used in remote sensing. However, existing image segmentation methods based on semantic segmentation have difficulties to extracting refined farmland parcels. Deep neural network is used to detect farmland edge. We use high-resolution network to achieve feature extraction that retains high-resolution features, strengthens the feature representation of network context information based on object-contextual representations module, and carries out more complete interpretation of farmland and its boundary. Finally, we design a farmland edge postprocessing method to connect the disconnected boundary based on the direction information generated by the connectivity attention module, and finally obtained the farmland boundary which is complete enough to be closed for generating farmland parcels. To verify our method, we used Google Earth image to label farmland boundaries and conduct experimental verification. The results show that our proposed model has a higher precision for farmland edge detection, and the postprocessing method of boundary connection can effectively close the boundary lines and achieve more detailed and complete farmland parcels.

requirement for precision agriculture and sustainable development [2]. Precision agriculture requires high-efficiency and low-cost benefits to be obtained in an efficient and economical manner under the support of advanced technology [3]. Farmland is the first unit of agricultural operation and production, so how to efficiently obtain refined farmland parcels is the basis of precision agriculture research.
The use of artificial ground survey methods is costly and inefficient. The development of remote sensing technology provides a data basis for large-scale and efficient farmland parcel research, so the production of farmland parcels through artificial visual interpretation has become the mainstream method for practical application and research [4]. However, this method still fails to solve the problem of inefficiency. Classic image segmentation and edge detection methods, such as watershed algorithms [5] and canny [6], are first tried in farmland extraction [7], [8], [9]. Objected-based image analysis are widely used in automated farmland parcel extraction. Based on the brightness gradient information and texture information of the image, this type of method uses super-pixel methods to segment the remote sensing image into multiple independent and homogeneous objects, and classifies the segmented objects through SVM, random forest, or other machine-learning methods to obtain farmland parcel results [10], [11], [12], [13], [14]. However, with the improvement of remote sensing image resolution, the performance of traditional image segmentation methods is limited in richer ground object texture features.
Deep-learning models have achieved good results in different types of image processing problems, and scholars have also tried to apply them in remote sensing [28], [29], [30], [31], [32], [33], [34], [35], [36], [37]. Researches have shown that semantic segmentation methods have a better accuracy and robust prediction in high-resolution satellite or aerial imagery [38], [39], [40], [41], [42]. Semantic segmentation models are applied and modified in farmland extraction [43], [44], [45], [46], [47], based on context representations, multiscale or pyramid features, and attention modules. These networks have the benefits of enhancing [6] semantic representations by larger receptive fields and deeper networks. Thus, complex contours between objects may be neglected or ambiguous, and this issue becomes more fatal in refined farmland parcels extraction. Details inside farmland are easily lost and finally models can only bring coarse results. Hence, edge detection methods are attempted in farmland parcels extractions, which can focus on the junction area between farmland parcels and bring more details [48], [49]. However, breaking points occurs in the edge results from edge detection model, which makes it difficult to produce vectorial farmland parcels.
In this article, we propose a process based on edge detection to extract the boundary between farmland parcels through the semantic edge detection model based on deep learning, and finally close the boundary into vector parcel results. The proposed semantic edge detection model is based on the high-resolution feature extraction network, and the context information feature representation of the network is enhanced based on the objectcontextual representations (OCR) module. In order to improve the extracted boundary that can be closed into a polygon as much as possible, we propose a postprocessing method based on breakpoint connection, which introduces the connectivity attention module (ConAttn) into the edge detection model to generate direction information through supervised training, and connects the breakpoints based on the direction information and edge confidence, which can finally make the extracted boundary results as complete as possible and obtain more accurate farmland parcels.
The rest of this article is organized as follows. Section II demonstrates the proposed network architecture. Section III presents experiments and the corresponding analysis with edge detection models and the proposed postprocessing methods. Section IV discusses superiorities of edge training labels. Finally, Section V concludes this article with some remarks and hints at plausible future research lines.

II. METHODS
In this section, we design an edge detection network based on semantic segmentation structure. The feature extraction part of semantic segmentation structure contains multiple parallel multiscale resolution feature branches, which enable network to learn stronger semantic information and precise position information at the same time, and each branch is composed of multiple residual blocks. The upsample part of semantic segmentation structure adds the feature representation enhanced by the model to the context information, and generates multiscale features that represent both object information and background information through convolution. The information strengthens the extraction of farmland features. Some approaches are designed to enhance the farmland edge. A ConAttn generates direction information, and an automatic boundary line connection method is introduced to close the edge. Finally, closed edge lines are converted into polygon results as farmland parcels.

A. Semantic Edge Detection Architecture
For refined farmland block extraction tasks, the traditional semantic segmentation model is used to train with polygon labels, which can accurately identify a large range of farmland areas. However, the model trained with polygon labels is not sensitive to the edge between farmland parcels. The adjacent parcels cannot be successfully identified separately, and the refined parcels cannot be obtained. Therefore, we need to use farmland edge as labels, so that we can choose to use semantic edge detection model. Through experiments, it is found that the traditional edge detection model is difficult to converge in the farmland edge detection task. Compared with the edge of all objects in the natural image, the edge detection network focuses on the lower image features, while the farmland edge has stronger semantic information. So the network needs to fully extract the semantic features of the farmland. Therefore, we use the semantic-segmentation-based network as our final farmland edge detection network structure.
Semantic segmentation network structure contains the feature extraction part and the upsample part. The feature extraction part is used to compute image representations by multiple convolutional layers, and obtains deep semantic features when the image resolution decreases with the network structure going deeper. Based on the spatial position information corresponding to the extracted deep semantic feature, the upsample layer restores the resolution of the image representations to be consistent with the input image through the upsampling structure, and finally outputs the classification information corresponding to each pixel through the classifier layer to obtain the pixel-level image classification result. In the edge detection task in this article, the semantic-segmentation-network-based structure extracts farmland semantic information by entering high-resolution remote sensing satellite imagery, and outputs the pixel-level farmland edge results.

B. High-Resolution Network
High-resolution network (HRNet), originally proposed by [50], is a feature extraction network that can maintain highresolution feature information. The commonly used semantic segmentation network is based on the feature extraction layer in the image classification task as the backbone, so the structure of these backbones usually shrinks the feature image resolution as the network deepens, so that the network can learn deep semantic features. However, in the semantic segmentation task, the network ultimately needs to output pixel-level prediction results, so it also needs to upsample low-resolution feature maps. With the process of downsampling and upsampling to feature maps loses a lot of spatial details, thereby limiting the accuracy of pixel-level classification results. To solve this problem, common semantic segmentation networks, such as SegNet [21] and U-net [22], are introduced into the feature extraction layer through the upsampling process, but the simple skip connection combined with shallow features lacks feature consistency. HRNet retains feature information at different scales through parallel multiresolution branches, and retains strong semantic information and spatial location information infeatures through information interaction between different resolution branches, which are appropriate for high-resolution satellite imagery with complicated details.
The overall structure of our proposed network is shown in Fig. 1. The HRNet contains four stages. As the network deepens,  a lower resolution branch will be added between the stages, and the features between different resolution scales will be interacted with through the transition layer. The resolution of feature N1 is r 1 , and the resolution of the q layer is r q = 1 2 q −1 . In order to realize the information interaction between multiscale branches, in the feature transition layer, feature is transferred by dense connections between different branches. Taking a feature transition layer with three branches as an example, the features corresponding to three different resolutions in the three branches can be represented as After passing through the transition layer, the branches of input features with the resolution of r q can be represented as N i+1 where f x−y (·) denotes a transform module inside the feature transition layer, as shown in Fig. 2.
; when x < y, feature maps need to be downsampled to a lower resolution; when x > y, feature maps need to be upsampled to a higher resolution, so f x−y (·) corresponds to bilinear upsampling, which upsamples feature maps to the resolution corresponding to the target branch, and align the number of channels through 1 × 1 convolution.

C. Object-Contextual Representations
The proposed method needs more semantic information to detect farmland boundaries, and the adjacent farmland parcels are contextually connected. Thus, the upsample module uses the OCR module, shown in Fig. 3, first presented by [51], which strengthens context aggregation in semantic segmentation. The network structure enhances the representation of context information based on the target area of the object, thereby perfecting the pixel-level classification results with the context information. First, the spatial features extracted by the backbone network output pixel representation through the convolutional layer.
To better integrate object-based semantic information and dense spatial context information, the model refers to the crossattention structure in the transformer [52]. Soft object regions, which contains object category information, generates weights through Softmax function, and multiplies them by pixel representation. Object region representation is obtained by the sum of pixel-level feature representations and object information. The calculation formula can be expressed as where p denotes pixel representation, o denotes soft object regions, and φ(·) denotes transformation function including 1 × 1 convolution, batch normalization, and ReLU. Object region representation obtains object-based context information by inner product, and implements cross-attention with pixel representation, mapping the object information containing context to a dense feature space to obtain OCR, which can be  expressed as

D. Postprocessing
The edge obtained in the edge detection method based on deep learning cannot guarantee the connectivity of boundary lines because it is generated by probability maps. However, the method of generating parcels based on boundary lines needs to ensure that the boundaries are connected and closed into synthetic parcels, so more postprocessing methods need to be introduced. In this article, an edge breakpoint connection method based on direction information is proposed. First, the breakpoint detection of the skeletonized boundary is performed, and all breakpoints are placed on the stack. In each iteration, first, a breakpoint is taken out of the stack. The direction information of the point and the edge probability map in the neighborhood are combined to determine the connection direction. The direction information and boundary probability maps above are obtained by the deep-learning model, and the module that outputs the direction information will be introduced in the next section. When getting the target connection direction, the pixel in the target direction will be turned into new edge area. A pseudocode of the edge breakpoint connection is shown in Algorithm 1. Finally, if the new boundary pixel is a breakpoint, put the pixel on the stack. Through the above iterative process, the edge detection results can be completely closed, thereby improving the integrity of the farmland parcel results.

E. Connectivity Attention
In the previous section, the introduction of direction information is critical in connecting breakpoints. If only the probability plot is used as a guide, the method connects the low-confidence areas adjacent to the boundary, and the results are similar to the double threshold postprocessing method in Canny [6]. The problems still remains that some breakpoints cannot be connected with others during the connection process. In this article, a connectivity map representing direction information is used, and the ConAttn, shown is Fig. 4, is introduced into the edge detection model to train and predict the connectivity map corresponding to the boundary of the image. Finally, the connectivity map provides direction information to guide the breakpoint connection in the postprocessing process. The connectivity map quantifies and disassembles the connection direction of the edge to strengthen the learning ability of the model for the connective direction features.
For parcel boundary labels, connectivity maps O ∈ R H×W ×C are generated, where C represents the number of samples of a given pixel and its neighbors. C is denoted as 8 for a 2-D edge image, and O i,j,c in the connectivity map represents the connectivity between the pixel and the specific pixel, where i, j represent the spatial position of the pixel. c represents the location of its adjacent pixels, and if two pixels are connected, it is recorded as a positive label, indicating that the two are connected farmland boundaries. Finally, the connectivity information in the eight neighborhoods is calculated and concatenated to obtain the final connectivity map.
The module used to generate the connectivity map uses the design of the squeeze-excitation [19] structure. The features generated by the backbone network are input into the 3 × 3 convolution with expansion rates of 1 and 3 in turn, and the use of convolution with different expansion rates can expand the receptive field and fully mine a larger range of local information. Then, the features are pooled globally. The feature is compressed at the channel level, and the feature attention is learned by using two fully connected layers to obtain a vector with a range of (0, 1). The vector can be multiplied by the input feature to obtain the connectivity attention, and finally supervised learning with the connectivity map ground truth generated by the label. The loss function can be expressed as where C O denotes the number of channels in the connectivity map; N = H × W means the sum of pixels; y C i denotes the

A. Datasets
We yield a farmland edge dataset based on Google Earth high-resolution satellite imagery. Images are uniformly resampled into 0.5-m ground resolution using bilinear interpolation method. First, satellite imagery covering the countryside area in Guangdong province are selected, and we screen out images in mountainous and plain areas. After manual annotation with artificial visual interpretation, we finally got farmland edge dataset containing 2873 tiles and 512 × 512 images. Edge labels of farmland are converted into binary maps with a width of three pixels as the training labels. Fig. 5 shows experimental images of the datasets. In the subsequent experiments, the datasets are divided into training set, validation set, and test set at a ratio of 6:2:2. The training set and validation set are used for training, and the test set is used for assessment and analysis.

B. Evaluation Metrics
In evaluation part, we use four metrics for pixel-based evaluation, including precision, recall, F1-score, and intersection over union (IoU). These metrics are widely used for accuracy evaluation in semantic segmentation tasks, and in this article, these metrics are used as the criteria in edge detection. F1 is the weighted average of precision and recall. IoU calculates the ratio of the intersection and union between the predicted category and the real category. Indicators above are explained as follows: where TP, FP, and FN denote the true positive, false positive, and false negative, respectively.

C. Implementation Details
Data augmentation strategies are implemented to enhance generalization ability of model. Spatial augmentation methods include random clip and resize, random rotation at different angles (90°, 180°, 270°, and 360°), and random mirror flip vertically or horizontally. Spectral augmentation methods include brightness and contrast enhancement, blurs, and gauss noise.
Network training was implemented by Pytorch using two GeForce RTX 2080Ti. Backbone parameters of all edge detection networks are pretrained on ImageNet [53], and the rest parameters of networks use Kaiming initialization methods [54]. AdamW [55] with the initial learning as 0.0002 is adopted as training optimizer to make the training process converge quickly. Batch size is set to 16. During the training process, the learning rate is adjusted adaptively based on the loss. When the loss stops decreasing for three consecutive epochs, the learning rate will decay with a rate of 0.5, and the training stops when the learning rate is less than 1e-7. We use binary cross-entropy loss as the loss function in experiments.

D. Compared Models
To evaluate our proposed methods in farmland edge detection, some classic semantic segmentation networks, such as DeepLab V3+ and LinkNet, are introduced to make comparison with our proposed model. Meanwhile, D-LinkNet is considered in the comparison for its good performance in the road extraction task from high-resolution satellite imagery, and the task is similar to farmland edge detection in this article. A brief introduction of these models are as follows: 1) DeepLab V3+ [26]: Optimized from a series of Deeplab networks, the network uses Xception [53] as its backbone. Atrous spatial pyramid pooling module is proposed in the network to encode multi-scale spatial contextual information with convolution kernels at multiple rates. spatial. To this extent, Deeplab V3+ achieved state-of-the-art performance on the PASCAL VOC 2012 semantic image segmentation benchmark [56]. 2) LinkNet [57]: LinkNet brings the residual connection into the most typical encoder-decoder symmetrical architecture model, U-net [22]. When feature maps get upsampled, a cascading contraction path brings corresponding shallow features to the decoder. Network architecture in LinkNet is modified, which uses less parameters but gives stateof-the-art performance on CamVid [58] and comparable results on Cityscapes dataset [59]. 3) D-LinkNet [60]: Built based on LinkNet for its efficiency in computation and memory. Dilation convolution is introduced to enlarge the receptive field of feature points without reducing the resolution of feature maps, so that it can perform better in satellite imagery. The network

E. Experiments and Analysis
In this part, we will first compare the proposed farmland edge detection model with other typical semantic segmentation and road extraction models, and then present the results after postprocessing. The effectiveness in the network module is then demonstrated by ablation experiments. Table I shows the accuracy evaluation part, including IoU, recall, precision, and F1 score. The proposed model is optimal in all indicators, IoU and F1 score obtaining 0.7398 and 0.8505, respectively. Fig. 6 shows the prediction visualization results of comparative models. The proposed model has better recognition ability of farmland edge in different scenarios. As a traditional semantic segmentation network, DeepLab V3+ extracts more incoherent parts of the edge extracted in the face of boundary extraction tasks, which also exists in our proposed method, which is also based on a traditional semantic segmentation network. However, our model can effectively identify more areas of edge, which is consistent with our relatively high price of IoU indicators in accuracy evaluation. LinkNet and D-LinkNet use the skip connection design in the U-Net structure and use transpose convolution in the decoder to increase the feature resolution gradually, so the overall coherence of the results from LinkNet and D-LinkNet is stronger. However, the overall edge detection accuracy is relatively lower, and there are more missed extractions for the farmland boundary in more details. Since the proposed model enhances the extraction ability of contextual information, the interpretation of farmland areas in the image is stronger, so that more accurate boundary extraction can be obtained. For large farmland parcels, where other models cannot fully identify the boundaries of a single parcel, the proposed model is able to extract the complete edge of large parcels. For densely distributed small parcels, other models tend to lose more detailed edge of complex small parcels, while our model has richer and more complete details. However, it is worth noting that although the accuracy of the boundaries extracted by our model is higher, the number of breakpoints is significantly higher than that of LinkNet and D-LinkNet, so the postprocessing process needs to be introduced.

1) Edge Detection Model Comparison:
2) Postprocessing Results: After obtaining the farmland edge extracted by the model, we evaluate the postprocessing effect after extracting the edge centerline using the skeletonizing method. Comparisons include results from no postprocessing at all, double threshold methods proposed in canny [6], and breakpoint connection postprocessing methods without direction information guide. Fig. 7 displays that there are a large number of disconnected edge lines having been connected after postprocessing. Most of these connection areas are with weak edge features. Deep-learning models often give lower probability values because their features are not obvious enough compared to strong edge features, and the edge obtained after final binarization have a large number of incoherences. The proposed connecting postprocessing strategy focuses on supplementing the areas with low confidence to close more boundaries. It can be seen from the final generated parcel vector results that since more boundaries can be closed into polygons, more parcels can be generated after vectorization. When there is no postprocessing method to connect the breakpoint, even if the boundary of the parcel can be mostly identified and the above accuracy evaluation performance is normal, parcels from the process of generating vector will still cause a large number of missed extraction farmland parcel. When the postprocessing method can connect more breakpoints, the resulting parcel vector result is more refined.
To numerically evaluate the postprocessing effect, we rethicken the postprocessing boundary, corresponding to the label with the boundary line width of three pixels used in model training, and evaluate the accuracy of the label based on the bold result. Table II shows that the postprocessing of breakpoint connection can effectively improve the accuracy of boundary extraction, and the accuracy can be further improved after introducing direction information. The postprocessing process can also complete the breakpoint connection as much as possible for areas with low confidence in the output boundary from the network. It can also be seen from Fig. 7 that after introducing direction information, the problem that the boundary cannot be connected after the breakpoint is extended can be effectively avoided, and it can be effectively connected to another boundary.

3) Ablation Experiments:
To demonstrate our effectiveness of the ConAttn added to the edge detection model, we performed ablation experiments on this module in our farmland edge dataset. The results are shown in Table III. The baseline is object-contextual representations network (OCRNet) in our proposed methods. ConAttn without fusion denotes that the ConAttn is added and the training is supervised, but the results output by the ConAttn are not fused with the edge prediction probability maps. After the model introduces direction information from supervised training of the ConAttn, the model can enhance the sensitivity of the edge and the surrounding connected boundary. There is a certain improvement in various indicators. When we add all the direction information of the connectivity attention result and fuse it with the edge probability map to enhance the edge detection results, the accuracy evaluation indicators have been further improved, indicating that the ConAttn can provide effective direction and connection information. Thus, through the simple probability map fusion, the advantage of perfect boundaries can be exerted.

IV. DISCUSSION
In the discussion part, we compare the proposed methods trained on different types of farmland parcel labels, including edge labels and polygon labels, to discuss possible ways for improvement.
In the method introduction, we mentioned that semantic segmentation methods trained by traditional polygon labels are not suitable for more refined farmland parcel extraction, so we further verify the proposed farmland parcel extraction process based on edge detection, and we compare the extracted results with the output of the semantic segmentation model based on polygon label training. In this part, a same network based on OCRNet is used, and the network is trained by edge labels and polygon labels from the same dataset, respectively. After vectorization, the model trained based on polygon labels has higher recognition accuracy for farmland areas. The extracted farmland areas are more accurate, and the model trained based on polygon labels can effectively distinguish from farmland for the thick boundary between farmlands, such as roads. However, it is clear that there are a large number of missing boundary details in the internal parcel, and a large number of parcels with obvious boundaries are aggregated together, so these methods are hard to generate refined farmland parcels. Since the polygon label contains the spatial texture information of the farmland area, and although the boundary area is easily diluted by other features as the supervision signal brought by the negative sample label. The network cannot pay enough attention to the edge area of the farmland, and the farmland texture information is fused together with the decrease of feature resolution during the downsampling process of the feature maps. In the farmland edge detection process we use, since the supervision signal is the edge, the network can focus on the boundary features between the parcels and output the edge containing more detailed information. The postprocessing process we use greatly compensates for the missed farmland caused by the edge failure to be closed into a polygon, so that the edge-detection-based results will decrease missing parcels. However, through this experiment, it can also be seen that the semantic segmentation method based on polygon labels still has advantages for farmland range extraction. For the thick boundary between farmlands, the edge-detection-based method cannot be well divided, and the thick boundary areas are easy to be misclassified into the farmland range. How to combine our processes with segmentation methods trained by polygon labels to form more accurate farmland parcel results is a direction that needs further consideration and optimization.

V. CONCLUSION
In this article, we try to extract farmland edge and generate more refined farmland parcels based on the farmland edge detection results. Based on this task, we propose an edge detection model based on connectivity attention, which can retain multiscale image features through high-resolution structure network, and integrate the object-based semantic information and the dense spatial context information through OCR module. The ConAttn improves the model's perception of boundary directions, while outputting direction information for postprocessing. The results show that although the proposed model can extract the edge area of farmland more accurately, there are still more breakpoints, so we propose a postprocessing method based on breakpoint connection, which connects the incoherent locations of the boundaries extracted by the model based on confidence and direction information, and finally allows more edge to be closed into blocks, thereby completing more refined farmland parcels extraction.
In the experiments in this article, most of the farmland comes from plain areas, so in future article, we will continue to improve our methods, and explore the farmland extraction ability of the model in different regions to improve the generalization and transferability of the model. At the same time, we will optimize the efficiency of postprocessing so that it can be used for a wide range of farmland mapping.