Road Extraction From High Spatial Resolution Remote Sensing Image Based on Multi-Task Key Point Constraints

To solve some problems of high spatial resolution remote sensing images caused by land coverage, building coverage and shading of trees, such as difficult road extraction and low precision, a road extraction method based on multi-task key point constraints is put forward in this article based on Linknet. At the preprocessing stage, an auxiliary constraint task is designed to solve the connectivity problem caused by shading during road extraction from remote sensing images. At the encoding & decoding stage, first, a position attention (PA) mechanism module and channel attention (CA) mechanism module are applied to realize the effective fusion of semantic information in the context during road extraction. Second, a multi-branch cascade dilated spatial pyramid (CDSP) is established with dilated convolution, by which the problem of loss of partial information during information extraction from remote sensing road image is solved and the detection accuracy is further improved. The method put forward in this article is verified through the experiment with public datasets and private datasets, revealing that the proposed method provides better performance than several state of the art techniques in terms of detection accuracy, recall, precision, and F1-score.


I. INTRODUCTION
In modern society, roads are an important identification object in maps and geographic information systems. As a kind of important infrastructure, roads play an important role in the construction of cities and towns, traffic and transportation and other fields [1]- [4]. So, how to extract road information from remote sensing images quickly and accurately has aroused the attention of many scholars [5]- [8].
During road extraction from remote sensing images, the preprocessing of data is usually carried out first to extract road features, and roads are classified based on pixel points to get the final result. According to different research methods of road extraction, road extraction methods can be classified into the traditional method and the deep learning method.
The associate editor coordinating the review of this manuscript and approving it for publication was Stefania Bonafoni .
With the traditional method, roads are usually recognized and extracted by establishing a feature model based on the basic feature of roads and human experience. Common features include the background feature [9], path morphology [10], feature template [11] and dynamic outline feature [12]. Chaudhuri et al. [13] extracted roads by enhancing morphological features of directions and deduced road sections accurately based on the feature of road sections. Miao et al. [14] put forward a semi-automatic road detection method based on mean shift where the initial road section of the preset road seed point was extracted with the geodesic method, and then roads and non-roads were segmented with the threshold value to extract the smooth and correct central line of roads. Unsalan and Sirmacek [15] extracted the initial edge feature of roads and then extracted roads with the binary Ballon algorithm and the graph theory, reducing time spent on road extraction. Song and Civco [9] put forward the method of detecting road regions with regional growing technology based on support vector machine [10] and similarity criterion respectively. With this method, road features can be extracted through simple threshold segmentation of the shape index and the density feature of roads, achieving good effects in road extraction from images composed of the urban and rural features.
The traditional method has played an active role in certain application scenarios, but many threshold parameters involved in feature design and connection algorithm can achieve good effects in some types of images only by artificial adjustment, which restricts its application to large-scale data.
Compared with the traditional method, the road extraction method based on deep learning has huge strengths. Convolutional networks in deep learning can explore high-level features of roads to improve the effectiveness of computer vision tasks. And it also has a powerful self-learning ability and an efficient feature fitting ability, showing huge strengths in road extraction precision and automation. For example, Sun et al. [16] proposed a road extraction method based on stacked U-nets and a multi-output network model, solving the imbalance problem of samples in multi-goal classification. Chen et al. carried out super-resolution accurate extraction of narrow roads with one-class support vector machine (OCSVM) classifier, by which the road extraction could only extract roads whose width was only several pixels. Mnih and Hinton [17] proposed a method where road regions in high-resolution aerial imageries were detected with constrained Boltzmann machine, and the number of dimensions of input data was reduced by preprocessing to make segmented lines smoother. Saito et al. [18] designed a model for road recognition based on extraction of feature of sliding windows, refining the extraction process of road extraction.
However, the above methods based on deep learning are only mechanical application models which do not consider the scenario of road extraction, the resolution of feature maps will decline gradually during down sampling, and then information of roads whose width contains fewer pixels will be lost severely. Some lost road information cannot be recovered during up sampling effectively, which will significantly reduce the performance of road extraction in networks.
As deep learning develops rapidly [19]- [21], full convolutional networks(FCN), such as Alexnet [22], VGGNet [23] and ResNet [24], have shown outstanding feature information extraction ability which promotes further development of the road extraction technology. Zhou et al. [25] fused ResNet with DLinknet structure, put forward by LinkNet [26], expanded the convolutional layer in the central part, and widening the receptive field [27] of feature points. Li et al. [28] improved the network model in [25] and established a D-Linknetplus model based on DBlockPlus which reduced model parameters and improved the calculation efficiency. The above two methods only pay attention to local road extraction and do not take into consideration the distinctiveness of road scenarios, so roads in road images extracted will be not continuous. Kestur et al. [29] designed a model of road extraction from remote sensing images based on the U-shaped full convolutional network (U-FCN) model. This model is established based on the classical FCN [30] network and is composed of a group of stacked convolution and corresponding stacked mirroring deconvolutions [31]. This method merely uses a general segmentation network structure to finish road extraction tasks, but wastes a large quantity of semantic information on remote sensing road images. Hu et al. [32] designed a graph generation model based on deep learning and recurrent attention network, which could learn by itself the road generation mode according to the probability distribution of road edges in road images to get the final generated road map. However, both of them do not consider the role of road intersection during road extraction, which can be fused to improve the connectivity of road extraction.
Although the above method has made great progress in terms of road extraction from remote sensing images, there are still many problems during road extraction from remote sensing images, such as building coverage, shading of trees and diversity of roads, bringing many challenges to extraction and segmentation of road details. In essence, road extraction is binarization segmentation of roads in remote sensing images, which is one of the semantic segmentation tasks. Therefore, there are many new semantic segmentation models in the field of semantic segmentation. Yu et al. proposed the BisNet [33] model and the DFN [34] model to achieve the image semantic segmentation respectively. The BisNet completed the extraction of image features by constructing a feature fusion module and a refinement module. And the discriminative feature network (DFN) constructed two sub-networks to solve the problem of intra-class inconsistency and inter-class indistinction. Zhang et al. [35] built an EncNet model, which introduced a context encoding module to capture the context information of the image. Zhang et al. [36] and Di et al. [37], respectively constructed an EncNet model and segmentation framework MSCI to improve image segmentation performance. However, the above segmentation models are usually extracted local target objects, which cannot effectively address the difficulties of road extraction. The difficult point of road segmentation is that the road structure is complicated. First, roads are usually similar to the surrounding environments (for example, grains of building roofs are usually similar to those of roads in city images; Traces left by wind are similar to those of roads in desert image). Secondly, the roads may be blocked by barriers, such as shadows or visual shading, which will be recognized roughly (for example, shadows of buildings will block roads in city images), and recognition of roads et.al, which cannot be seen is particularly difficult, for example, the roads with good afforestation will be covered by the trees, causing the road to becoming completely invisible.
Linknet is an effective semantic segmentation neural network, which uses the advantages of skip connections, residual blocks, and codec structure. Linknet also has fewer VOLUME 9, 2021 calculation parameters and runs fast in the encoding and decoding network, high extraction accuracy and so on. In this paper, we put forward a new strategy for improvement based on Linknet. This strategy includes three stages, including preprocessing, encoding & decoding, and road pixel binarization output. At the preprocessing stage, an auxiliary constraint task is a design, and the connectivity problem during target extraction from remote sensing road images is solved by calculating the semantic constraint angle. At the encoding & decoding stage, PA mechanism and CA mechanism [38] are applied to intermediate features generated by the encoder for decoding. With these two modules, global semantic information of positions and channels can be considered during feature extraction, which solves the problem of the semantic split caused by the that only local information is emphasized in the existing road extraction model. These two modules can capture intersections at different positions better and integrate intersections, the prior information, with our road extraction model more efficiently. In addition, a multi-branch CDSP module is established with dilated convolution [39] and spatial pyramid [40] after the attention mechanism. This module contains four different branches. As regards each branch, feature extraction in different scales is carried out for the feature map, feature extraction results of each of the four branches are fused to take into consideration diversified level features in the model and recover original feature of images at the decoding stage, especially for the feature of roads with a small width in images.
The main contributions of this article are as follows: I. A multi-task intersection point constraint (IPC) module is designed. The road extraction ability is improved in complicated scenarios by adding an auxiliary constraint task into the segmentation task and calculating the semantic constraint angle between road points.
II. The position and channel attention (PCA) modules are considered to improve the road extraction precision in high-resolution images, which fused semantic information in the context during semantic segmentation and filtered other redundant information of remote sensing road images. III. A multi-branch CDSP module which contains four different branches is established. As regards each branch, feature extraction in different scales is carried out for the feature map, feature extraction results of each of the four branches are fused to enhance the ability to recover original information of images during up-sampling.
The structure of the article is as follows. Part II briefly introduces the basic framework of this article; Part III introduces the general framework of methods mentioned herein and the road extraction procedure; Part IV displays the experiment result of methods and the result of the comparison experiment; Part V is the summary of the whole article and outlook on future work.

II. BRIEF REVIEW
Linknet CNN is used as the basic network of road extraction in this article. The structure is as shown in Figure 1. Where the Input is the high-resolution remote sensing road image which is processed in the Encoder module first to extract road feature and get the feature map and then is processed in the Decoder module to recover the size of the feature map of roads and get the segmentation graph of roads at last. Compared with other segmentation networks, Linknet is different in terms of information interaction between Encoder and Decoder, that is, the feature map result of each Encoder Block in down-sampling of the encoder adds the feature map result of Decoder Block in each up-sampling of the decoder to solve the problem of spatial information loss during feature extraction. The output feature map which passes one Encoder Block is taken as E i , and the output feature map which passes Decoder is taken as D j , so the input of each Decoder Block is E i + D j , i.e. the sum of the output feature map of the former Decoder Block and the output feature map of the corresponding Encoder Block via bypass link. Linknet not only has a very good segmentation and extraction effect, but also can accelerate the training process of the model through the embedded GPU.

III. METHODOLOGY
The structure of the proposed model is shown in Figure 2.
The road extraction procedure is as follows: First, data enhancement means are applied to input high-resolution remote sensing road images, such as HSV contrast ratio transformation, geometric transformation, image cropping, etc. Second, the feature of input images is extracted with Encoder, and the feature map obtained will pass the PCA module where a large quantity of semantic information of roads in different positions and different channels can be fused effectively. Then, the feature map will pass the CDSP module where the problem of severe loss of road information during down-sampling of CNN. At last, the feature map passes two up-sampling feature recovery decoders with a similar structure, i.e. IPC Decoder and Seg Decoder. As the main task, Seg Decoder can output the pattern segmentation result, while IPC Decoder, as the auxiliary task, can output the matrix of semantic constraint angle results of images.
The input of the model has two parts: the label of the original image and each pixel. Under such a framework, the road image extraction can be converted into loss calculation of labels of images extracted by CNN based on original images. Such labels are corresponding to spatial dimensions w × h of original images one by one. w and h represent the width and height of the image. In the label images, 0 and 1 represent the background and the road to be extracted respectively.
In terms of feature extraction, CNN performs well in terms of the segmentation task, such as Unet [41], Densenet [42], and Depplabv3+ [39]. So, the encoder feature extraction module and CNN mentioned are selected for extracting the road image feature in this article.
Next, we will elaborate on the composition of the network module of each part in the framework. First, the establishment mode of the multi-task constraint module is described. Second, the specific establishment process of PCA is discussed. At last, the composition principle of multi-branch CDSP is described in detail.
A. MULTI-TASK IPC Figure 2 reveals that IPC output is a feature matrix of semantic constraint angle. The IPC model is an additional task related to prior information in order to solve the problem of road disconnection in road extraction. As shown in Figure 3, the general network model is likely to cause road extraction disconnection because the road is covered by a tree. From Figure 3(c), it can be seen that the role of the auxiliary task constraint module is to provide the prior features of the road by calculating the semantic constraint angle. This constraint task can also allow the network to learn this angle autonomously by calculating the angle between each pixel in the road marking map and its neighboring road. And then some deep-level feature information is generated, and this feature information cannot be perceived by our human intuition. The auxiliary task of calculating the semantic constraint angle solves the problem of road disconnection of remote sensing road image extraction. This problem of road disconnection is solved effectively.
Key pixel points of roads (i.e. road intersection points) in the prediction matrix figure and the ground truth semantic constraint angle matrix figure are constrained via the loss function. First, the skeleton of label images of roads is extracted. The skeleton map of roads is extracted with Zhang and Suen [43] burning algorithm. Each iteration step of the algorithm is corrosion of target pixels which conform to specific conditions to detail goals, i.e. road pixel labels in this article. Iterations are carried out continuously until goals (i.e. pixel points) corroded last time are not corroded in this round of operation. And then the ground truth semantic constraint angle is calculated. The feature map of pixel points is as shown in Figure 4. This algorithm should meet the four conditions of Formula (1), Formula (2), Formula (3) and  Formula (4): Each pixel point is represented by 0 or 1. 0 represents backgrounds, while 1 represents roads. Formula (1) requires that the sum of the number of target pixels (1 in binary value) around the central pixel P1 is between 2 and 6. Equation (2) requires that the number of times of 0->1 of two adjacent pixels among pixels around P8 following the clockwise direction should be 1. Equations (3) and (4) require that the product of corresponding pixel points nearby the edges of P8 should be not 0. Pixel points meeting these four conditions will be marked with ''deleted.'' The algorithm is carried out for all the pixel points on the whole image until no pixel point on this image can meet the above conditions. The results of the detailing algorithm are shown in Figure 5. On the left side of Figure 5 is the value of the original labels. The right side of Figure 5 is the skeleton map obtained by executing the detailing algorithm.
After getting the skeleton map, the calculation of the skeleton map needs to execute to get the intersections and edges for further calculation of semantic constraint angles. First, the node mapping of binary label images is executed. The background pixels are marked as 0, the edge and intersection pixels are marked as 1 and 2 respectively. Second, intersection domains are extracted. Adjacent intersections compose an intersection domain that starts from 10 and increases gradually. To avoid confusion with the mapping ensemble 0, 1, and 2, 10 is marked as the first intersection. At last, recursion is carried out for each intersection, edges are searched in all of its neighbor domains. If points of domains found are marked as the background pixel 0 via mapping, this time of recursion is stopped, and the last intersection pixel or edge pixel is backtracked. If intersections are found in neighbor domains, the traversed edge serves as the connection edge of these two intersections.
After that, the semantic constraint angle of each road point is calculated: first, the road intersection extracted on the skeleton map of each connected road is marked as [z 1 , z 2 , . . . . . . z k ], pixel points between two neighbor road intersections extracted are marked as [p 1 , p 2 , · · · , p n ], the shortest path distance and azimuth angle of two intersections nearby each pixel point p i are calculated respectively, and the azimuth angle corresponding to the intersection with the smallest shortest path distance is taken as the semantic constraint angle of this point.
Here, the A * [44] algorithm is used to calculate the shortest path distance. A * algorithm will find the shortest path from the starting point X to the ending point Y. The path calculation formula is as follows: where G(n) represents the shortest distance from the starting point to the current point (straight line or diagonal line available), H(n) represents the Manhattan distance from the current point to the endpoint [45]. Only the horizontal distance and the longitudinal distance are calculated, other oblique distances are not calculated, and possible barriers are not taken into consideration. The calculation process is as follows: (1) NewList (for storing coordinates of pixel points which may be involved), OldList (for storing coordinates of pixel points which can be not taken into consideration during calculation) and ShortList (for storing the shortest node distance between two points) are established. Set the current node as Z .
(2) Let current node Z = X , and judge in sequence whether eight neighbor points Z i (1 <= i <= 8) of Z can be accessed (that is, whether they are road pixel points). If Z i cannot be accessed, Z i is stored in the OldList (if Z i has been in the OldList, Z i is ignored). If Z i can be accessed and is in the NewList, the distance between Z i and Z is calculated and marked as If the new F value is greater than the old F, the new F is adopted to be the new New_F(Z i ), and Z is set as the father node of Z i . If Z i can be accessed, but is not in the NewList, Z is set as the father node and Z i is added into the NewList.
(3) Repeat steps (1)-(2), until the calculation of all the eight neighbor points is finished. Add Z into the OldList, traverse all the points in the NewList and select the minimum F value, mark this F value as Min_F and take it as the new Z node and then repeat the above steps.
(4) When Z = Y node, the search is finished, and recursion starts to be conducted to search the father node of Y node.
(5) Set the current node set as T . Initialize T to be Y , i.e. T = Y . Add every node represented by T node into the ShortList and then make T be equal to the father node of the current node, i.e. T = T .father. (6) Recursion is finished when T = X . All the nodes in the ShortList are nodes passed by the shortest path from X to Y . Assuming that there are n nodes in the ShortList, the function ϕ(t) is used to sum F value of each father node recorded: Compare ϕ(t) of p i and its two neighbor intersections via the above calculation method and select the smaller coordinate of ϕ(t) to calculate their azimuth angle as the semantic constraint angle. Assuming that the coordinate of the intersection obtained with the shortest path distance calculation method and the point on the road is z i : [x i , y i ] and p j : x j , y j respectively, if dx = x 1 − x 2 dy = y 1 − y 2 , the definition of azimuth angle a ij is: Compared with ordinary images, the image information contained in high spatial resolution remote sensing images is several thousands and even ten thousand times that of ordinary images, and semantic information contained in high spatial resolution remote sensing images is several orders of magnitudes of that of ordinary images. So, how to apply semantic information in remote sensing images is the penetration point of improving the precision of road extraction.
On this basis, PCA is introduced into the model in this article, i.e., the PCA module in Figure 2. This module includes the CA mechanism and the PA mechanism. The attention mechanism module can make the network model pay attention to features and functions of such semantic information of roads on the level of positions and channels more effectively during road feature extraction, and the effect of PCA will be more significant, especially when there is more semantic information in remote sensing images.
The position attention module is as shown in Figure 5. Features of any position are updated via weighted aggregation of road features of all the positions. The weight needed in the update is determined by the similarity of the feature of roads at two positions. As they are more similar to each other, the weight will be greater; as they are less similar to each other, the weight will be smaller. The weight is irrelevant to the distance of two-pixel points. Figure 6 shows that when one road feature map X is input, X ∈ R C * H * W . Wherein, C represents the number of channels, H represents the height of the image, W represents the width of the image. Three new feature maps A, B and D are obtained through three convolutions [47] with BN layer [46] and ReLU layer respectively. A and B are reshaped respectively to get R C×N A * and B * . Wherein N = H × W , the definition of PA mechanism is as follows: where S ij represents the impacts of the i position on the j position. As road features at these two positions are more similar to each other, this value will be impacted more. Road feature map D is reshaped to get R C×N and D * . The calculation formula of Y is as follows: where α is initialized to be 0 and is obtained through learning. Formula (10) shows that the final feature Y is updated via weighted aggregation of road features at all the positions in the image, so it can aggregate global semantic information.
The structure of the CA mechanism is shown in Figure 7. The CA mechanism is similar to the PA mechanism. They just emphasize different perspectives. For any two channels, as long as their road feature is similar, a higher weight can be obtained. Each channel can be seen as a semantic response to different road features. The weight of channels can represent the dependence degree of semantic responses of specific roads. As the weight is greater, they are more dependent on each other; as the weight is smaller, they are less dependent on each other. Via the CA mechanism, the network can learn by itself the ability to express the semantic feature of a specific road.
The definition of the CA mechanism is as follows: where H ji represents the impacts of the i channel on the j channel. The final result Y j is: where β is initialized to be 0 and is updated via training. It can be seen that the final feature is updated via weighted aggregation of road features on all the channels, so the final feature can aggregate global semantic information on the channel feature map. At last, the feature map of PA mechanism and that of CA mechanism are fused. The above two feature maps are stacked with the dimensional stacking method to ensure that the original feature will not be damaged and retained completely and that networks can learn by itself how to integrate such complete road feature information during interaction.

C. CDSP
The CDSP module is after the PCA module, as shown in Figure 2. To solve the problem of loss of local spatial information of roads caused by pooling operation and the problem that detailed features of roads included in images cannot be restored completely during up-sampling, a multi-branch CDSP is designed in this paper. The CDSP module uses hollow convolution to solve the problem of the traditional convolutional network in the feature extraction, which leads to the continuous decline of image resolution and the loss of spatial abstract information. The structure of CDSP is as shown in Figure 8. This module includes four branches: dimensionality reduction branch, self-adaptive pooling branch, even number of cascade branch of dilated convolution, and an odd number of cascade branch of dilated convolution.
(1) Dimensionality reduction branch: The first branch is the dimensionality reduction branch of convolution. This branch is constructed with standard 1 * 1 convolution. It is mainly used to reduce the dimensionality of feature maps. This branch is equal to feature information of the original road feature map. After finishing convolution, a BN layer and a ReLU layer are included.
(2) Self-adaptive pooling branch: A self-adaptive global pooling branch with a convolution kernelas larger as its size is constructed. The definition of convolution calculation is as follows: where n is the size of the input feature map, k is the size of the convolution kernel, p is the filling range, and s is the step length of convolution. Since there is one global pooling branch, k is the size of the input feature map, p is defaulted as 0, and s is defaulted as 1. The convolution kernel is as large as its own size, so one time of up-sampling should be conducted before channel stacking with other branches.
(3) Even number of cascade branches of dilated convolution. Three dilated convolution modules with different dilated ratios are established respectively to improve the size of the receptive field of the output feature map. The dilated ratio is 2, 4 and 8 respectively. Three convolution modules with different dilated ratios are cascaded. The output definition of the dilated convolution is as follows: where d represents the dilated ratio. Definitions of other parameters are the same as those in Formula (14). To maintain that the size of the output feature map is the same, when the convolution kernel is 3 * 3, p = d, and s = 1, the size of the output feature map remains unchanged. When using the dilated convolution, the calculation definition of receptive field with dilation is as follows: where F(i, j) represents the local receptive field of the j layer under the impact of the i layer, k represents the size of the convolution kernel which is set as 3 in this article, s represents the step length of the convolution, and d represents the dilated ratio. It can be seen from the above formula that as d becomes greater, the receptive field will be larger.
(4) The odd number of cascade branches of dilated convolution is similar to the third branch. The difference lies in those three dilated ratios are 3, 5, and 9 respectively. Other operations are similar to the third branch.
As regards different pyramid branches, different subdomains are segmented for feature mapping, enabling the output of different sizes of feature mapping on different levels and at different positions in the pyramid pooling module. To maintain the weight of the global feature, if the pyramid has N levels, 1 × 1 convolution is used after each level, the number of channels of the corresponding level is reduced to 1/N , and then up-sampling is conducted for low-dimensional feature maps via bilinear interpolation to get the feature map with the same dimensions as the original feature mapping. At last, different levels of feature dimensions are stacked as the final global output of the pyramid pooling module.

D. LOSS FUNCTION
Dice Coefficient Loss+BCELoss is used as the loss function of the main tasks of the model in this article. Its definition is as follows: BCELoss P i , GT i (17) where P i represents the i predicted image, G i represents the i-label image, and N represents the Batch Size. The numerator represents the total number of correct predictions of the positive sample, and the denominator represents the total number of the positive sample and the negative sample. Wherein, the coefficient of the numerator is 2, because repeated calculation and common elements are involved in the denominator. When selecting the loss function of the auxiliary task, CrossEntropy [48] Loss function is used as the auxiliary loss function. Its definition is as follows: (18) where i represents the i sample, N represents the Batch Size, y (i) represents the semantic constraint angle of the predicted value of the sample, andŷ (i) represents the semantic constraint angle of the label value of the sample. The positive is 1, and the negative is 0.

IV. EXPERIMENTAL VERIFICATION A. BRIEF DESCRIPTION OF DATASETS
To verify the effectiveness of the method proposed herein, two different datasets are designed to train and verify our model. One dataset is an open Massachusetts dataset, and the other is the private dataset marked and extracted from Google Map and is named RSR (Remote Sensing Rod) dataset. Besides, the deep learning framework used is [49]. Two datasets are briefly described as follows: (1) Massachusetts road dataset. This dataset is composed of the aerial imagery of Massachusetts, including all areas in Massachusetts, such as cities, rural areas and suburbs. It covers an area of more than 2,600 km 2 . The size of each original high spatial resolution remote sensing image is 1500 × 1500. There are 1,171 images. The resolution is 1m. The dataset is segmented into 1,108 training images, 49 test images and 14 verification images. The label image of this dataset is customized into a binary image. Wherein, roads are marked as the foreground, and other objects are marked as where TP is True Positive: the prediction is positive, and the ground truth result is positive too; FP is False Positive: the prediction is positive, and the ground truth result is negative; FN is False Negative: the prediction is negative, and the ground truth result is positive; TN is True Negative: the prediction is negative, and the ground truth result is negative too. As the result of F1 is greater, it means that the prediction image is more similar to the ground truth provided. IOU (Intersection over Union) is used to evaluate the precision of pixel marked in image segmentation and has been used as a standard measure in semantic segmentation. The calculation is as follows: where i, j = 0 represents the background and i, j = 1 represents the road, p ij represents the number of pixels which should belong to i type but are predicted as j type, p ji represents the number of pixels which should belong to j type but are predicted as i, and p ii represents the number of correct predictions.

C. ANALYSIS OF THE RESULTS OF THE COMPARISON EXPERIMENT
In order to improve the generalization ability of the network, we use horizontal flipping, image random rotation, image mirroring flipping, HSV space transformation for data enhancement. HSV transformation is a means of data set amplification, where H represents hue, S represents saturation, and V represents brightness. HSV generally can eliminate the impact of lighting, brightness, and color differences on the picture. And the purpose of HSV transformation is also to pass the existing data to obtain more training data, and reduce the under-fitting or over-fitting problems caused by the data quality or the amount of data is too small. A training strategy with 300 epochs and a batchsize of 8 is used. a stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight attenuation of 0.0005, and an initial learning rate of 0.003 are set in the experiment. And after every 100 epochs, the learning rate will automatically drop 10 times.
To compare the superiority of our method, UNet, LinkNet, DLinknet, BDNet and GLNet are used to compare with our model. And some methods with complex and deep models, like Exfuse, BiseNet, Context Encoding Network (EncNet), learning a discriminative feature network (DFN) and/or multi-scale context intertwining (MSCI), are also used to compare with our proposed method. The experimental results on the private dataset and the public dataset are as shown in Table 1 and Table 2.
When using the loss function mentioned in this paper, at the end of training, the CE and DL values of the loss function of each model are as follows: UNet: 0.069, 0.204; LinkNet:0.066, 0.201; DLinknet:0.059, 0.176; BDNet: 0.056, 0.174; GlNet: 0.061,0.177; Ours: 0.051, 0.158. The road extraction accuracy of the proposed model through the multi-task key point constraint module, the dual attention mechanism module, and the multi-branch cascading hollow space pyramid module, so that the CE and DL indicators in this paper are the lowest of all the comparison models.
Results in Table 1 and Table 2 reveal that our method is superior to compared models and realizes the optimal performance. In private dataset and public dataset, results of key indexes are as follows: IoU: 69.32% and 60.32%; F1: 81.43% and 75.17%; precision: 77.14% and 74.30%; recall: 86.23% and 76.10%.
For the comparison model shown in Table 1 and Table 2, UNet and LinkNet were early segmentation models that used encoding & decoding networks to achieve target segmentation. But the model structures of UNet and LinkNet are simple and their results are worse than the results of other compared models both on RSR and Massachusetts Road Datasets. DLinkNet improves the accuracy of road extraction by adding a dilated convolution structure to LinkNet to increase the receptive field of the feature map, and the IoU, F1 score, Precision, and Recall are 67.63%, 80.34%, 75.28% and 86.13% on RSR Dataset. BDNet also adopts this dilated convolution structure, So the results of BDNet and DLinknet are very close both on RSR and Massachusetts Road Datasets. GLNet is a dense network model based on Densenet and improves the road extraction effect by increasing the complexity of the network and achieves good effects, the IoU, F1 score, Precision and Recall are 67.15%, 79.81%, 75.76% and 84.32% on RSR Dataset, but the GLNet also involves a large quantity of calculation and the Fps is 9.8ms.
We also do some comparative experiments with the latest complex models, such as Exfuse, BiseNet, EncNet, DFN and MSCI. The experimental results are shown in Table 1 and  Table 2. It can be seen that our model has better performance in terms of Precision, Recall, F1 Score and IOU. The complex deep learning models are proposed for conventional segmentation tasks, such as PASCAL VOC 2012 dataset, COCO dataset, PASCAL-Context and NYUDv2, etc. The characteristics of the segmentation object and the background are different. For the road extraction in remote sensing image, the difficult point of road segmentation is that the road structure is complicated.
Based on Linknet, the proposed model not only adopt dilated convolution, but also combine branches of dilated convolution and ordinary convolution, integrate information of high-dimensional feature and low-dimensional feature and adds road constraint angle and PCA to further improve the effect. Therefore, our method has great strengths in terms of Iou, F1, Pr, Re, etc.
In addition, we take Fps as an indicator of computational cost. Because Linknet is the basic network for road segmentation, the Fps of Linknet can be Up to 22.5ms on the RSR data set. Compared with Linknet, Unet uses dimensional superposition in feature fusion, which increases the testing time. Based on Linknet, DLinknet uses hole convolution to improve the road segmentation accuracy, but the increase in calculation parameters will also add calculation time. Compared with DLinknet, although BDNet reduces the calculation parameters by constructing D-Blocks, it also increases the network depth and the fps of BDNet will also increase. Glnet uses a dense connection method, and the network depth is the deepest among these models, so the computational cost is relatively high. However, the Fps of the complex deep learning models is equivalent to our proposed model. The fps of Exfuse, BiseNet, EncNet, DFN, and MSCI is 3.6 ms, 6.8 ms, 9.1 ms, 5.2 ms, 7.7 ms, respectively. And the fps of our model is 4.4 ms. Figure 11 and figure 12 displays the segmentation result of different roads. Wherein, the second row and the fourth row are the local amplification results of the first row and the third row. It can be seen that the method proposed in this article can solve the problem of false and missing detection of roads with a small pixel width proportion, improve performance of segmentation during road extraction and improve comprehensive evaluation indexes.
From the second row of Figure 11(c)-(f), it can be seen that BDNet, DLinknet, GLNet and LinkNet are more likely to wrongly recognize something non-road like shadows or junctions of different objects as roads. And in figure 11(g), the segmentation results of UNet even have omissions, resulting in road disconnection. Only the road extracted by our model is closest to the label image. For the results in the fourth row in Figure 11, the advantages of our method are much obvious. It can be seen that other models have generated some non-existent roads.
In Figure 12, it can be seen that the segmentation results of the models Exfuse, DFN, EncNet, and MSCI produce road recognition errors. Although BiseNet has no wrong roads, the segmented roads are blurry at the connection and even broken. Compared with other methods, it can be seen that only the road extracted by our model is clear and continuous.
In conclusion, the good performance of the method proposed in this paper on these two datasets reveals that our model is of good robustness in terms of road extraction from high spatial resolution remote sensing images and proves the effectiveness and reliability of the model in road extraction.
To further verify the effectiveness of the method proposed in this article, in verification with remote sensing road images VOLUME 9, 2021  in different scenarios, test images of the RSR dataset are classified into the non-shading scenario (N), general shading scenario (S) and multi-shading scenario (M) according to the complexity of scenarios in images. In the N scenario, roads are highlighted. In the general shading scenario, some positions of roads are shaded by buildings, trees and shadows, which will affect the road extraction effect. In multi-shading scenarios, a very long part of roads is covered by other objects, such as trees or shadows, so roads are very hard to be extracted.
Next, we only compare the results of the UNet, LinkNet, DLinknet, BDNet and GLNet. Partial experimental comparison results of three scenarios are as shown in Table 3 and Figure 14. Wherein, N_IoU represents IoU index in N scenario, F1_N represents F1 index in N scenario and so on. Mean_IoU and Mean_F1 represent the mean value of F1 of   three scenarios. Results show that the IoU, F1, precision and recall of our model in three scenarios are superior to those of other compared models, and the mean IoU and mean F1 score is 69.32% and 81.43% respectively. Even compared with the compared model which has the best performance, the mean IoU and mean F1 of our model in three scenarios increase by 1.4% and 0.9% respectively.
The road extraction results are as shown in Figure 13. The first row is road extraction in N scenario, the second row is road extraction in S scenario, and the third row is the experimental result in M scenario. From the results of the first and second rows, it can be seen that for some unobvious roads, although some other models can also extract roads more or less, the extraction effect is not as good as the model we proposed. In terms of clarity, road continuity and ductility are also the best results of our model. Especially, from the results of the third row, it can be seen that none of the other models can detect the section of road covered by trees, resulting in obvious disconnection in the extraction results. our proposed model can detect this section of the road, which proves the superiority of our model.

D. ANALYSIS OF RESULTS OF ABLATION EXPERIMENT
To verify the superiority of our multi-task IPC, PCA and multi-branch CDSP in terms of road extraction and segmentation, an ablation experiment is conducted for three modules in this article.

1) MULTI-TASK IPC
The superiority of the IPC module in road extraction was verified on the private dataset and the public dataset respectively. Experimental results are shown in Table 4 and  Table 5. In the private dataset, IoU and F1 of LinkNet with IPC module increase by 3.0% and 2.2% respectively than that without IPC module. In the public dataset, IoU and F1 of LinkNet with IPC module increase by 1.8% and VOLUME 9, 2021   1.8% respectively than that without IPC module. Results show that multi-task IPC can improve the performance of road extraction models significantly, because IPC module can make use of road intersection, the prior knowledge, effectively.

2) PCA
To verify the impacts of PCA on road extraction of models, an ablation experiment was carried out for the PA mechanism and CA mechanism respectively. Results on the private dataset and the public dataset are as shown in Table 6 and  Table 7. Results show that the performance in both the private dataset and the public dataset is improved by adding PA or CA, but the experimental effect is the best when both the PA and the CA are added. Compared with ordinary Linknet, when both two modules are added, IoU increases by 1.8%, and F1 increases by 1.3% on the private dataset. IoU increases by 1.6% and F1 increases by 1.4% on the public dataset. It verifies the effectiveness of PCA in road extraction and further indicates that the fusion of semantic information of images can improve the precision of road extraction.
The visualization results for PCA on the RSR Dataset and Massachusetts Road Dataset are shown in Figure 14 and Figure 15.
From the extracted road results, it can be seen that in the segmentation results of the LinkNet model, the roads generated in the area marked by the red circle are very blurred and generate a lot of noise, as shown in Figure 14(c). After adding the position attention mechanism and the channel attention mechanism respectively, the segmented roads with Linknet+PA and Linknet+CA are smoother, as shown in Figure 14 (d) and (e), but there will still be wrongly segmented roads or disconnected roads. Only    when the position attention mechanism and the channel attention mechanism are added at the same time, the result of Linknet+PCA does not generate the wrong road and the extracted road is closest to the label image. The superiority of the PCA module is proved by the road extraction results on the private data set. Similar results can also be verified on the public data set, as shown in Figure 15.

3) MULTI-BRANCH CDSP
To verify the role of multi-branch CDSP in performance improvement of models, on the basis of multi-task IPC and PCA, results before and after stacking of multi-branch CDSP are compared. The comparison results are shown in Table 8 and Table 9.
Results show that compared with models without adding multi-branch CDSP, indexes on both the public dataset and the private dataset are improved. The improvement effect of this module is limited, but multi-branch CDSP can enhance the ability to restore information via feature fusion.
Comprehensive results indicate that compared with the basic model LinkNet(See the results in Table 6-7), IoU and F1 on the private dataset increase by 3.6% and 2.5%, respectively. IoU and F1 on the public dataset increase by 2.5% and 2.3% respectively, further proving the effectiveness of the method proposed in this article.

V. CONCLUSION
To improve LinkNet, a network model for extraction of target roads from multi-scenario remote sensing images is proposed in this article. To solve the problem of instable connection between road intersections in high spatial resolution remote sensing images, failure of effective fusion of semantic information and loss of information during downsampling, multi-task IPC, PCA and multi-branch CDSP are designed based on LinkNet respectively. Model analysis and experimental results indicate that all the models designed in this article can improve the final performance of road extraction, especially in many scenarios, such as coverage, shading, etc.
The method improves the performance of road extraction from remote sensing images to some extent, but the following aspects are to be improved: (1) due to the auxiliary task of multi-task key point constraints, the network needs to calculate the skeleton information of the road label map additionally; (2) the subsequent burning algorithm and pathfinding algorithm will increase the computational cost; (3) the multiple scales of void convolution in the multi-branch cascaded void space pyramid module will also inevitably increase the calculation parameters of the network and increase in time cost. (4) It has been proven that our models improve the precision of road extraction, since the angle and other parameters are added during road extraction, multibranch and multi-scale road feature are fused, road features of channels and positions are extracted, the corresponding road extraction time of networks will be prolonged. We need to explore some new methods which can minimize the road extraction time without affecting the accuracy of models.