AD-RoadNet: An Auxiliary-Decoding Road Extraction Network Improving Connectivity While Preserving Multiscale Road Details

Obtaining Road information from high-resolution remote sensing images is gaining attention in intelligent transportation systems. Existing road extraction methods tend to improve road connectivity with graph convolution or global attention, however, ignore the damage of introduced excessive effective receptive field (ERF) to multiscale road details. In this study, we propose an auxiliary-decoding road extraction network named AD-RoadNet, which decouples multiscale road representation and connectivity improvement based on two modules; the hybrid receptive field module (HRFM) and the topological feature representation module (TFRM). The HRFM is introduced in the encoder to emphasize target road features through adaptively matching the receptive field (RF) size for various scale roads, thus, beneficial for multiscale road representation. The TFRM is introduced in an auxiliary decoder to represent topological features with the position information encoded in the shared encoder and then helps the main decoder reason occluded roads, thus improving connectivity. Between the encoder and main decoder. The proposed model has a similar parameter scale as HRNetV2 and outperforms the state-of-the-art ResUnet, D-LinkNet, and HRNetV2 by 3.34%, 2.03%, and 1.53% in the mean intersection of union on DeepGlobe road dataset. Ablation analysis, inference size matter, and the robustness for unseen occlusion scenarios, low-quality labels, and various quality inference images are further presented to evaluate the proposed AD-RoadNet.


I. INTRODUCTION
O BTAINING road information is vital in many applications, such as autonomous navigation [1], autonomous driving [2], and intelligent transportation system [3]. Highresolution remote sensing images (HRSI) are widely used in Manuscript  producing rich road information and provide probability [4]. However, the following difficulties are still an immense challenge to extract roads from HRSI: First, there are diverse road scales and types, including urban trunk roads, urban overpasses, and rural roads. Not just the wide main roads but also trails and narrow rural roads are equally essential for road networks. Second, different lanes or adjacent roads are commonly confused due to the lack of clarity in lane marking and diverse median strips in HRSI. Third, the road connectivity in HRSI is easily affected by building shadows, green belts, trees, vehicles, etc. [5] With the continuous development of deep learning technology, end-to-end semantic segmentation models have been the mainstream to extract roads from HRSI and have made remarkable achievements [6], [7]. The first attempt to use an encoder-decoder structure for semantic segmentation was with the fully convolutional network [8]. Then, a series of improvements for common semantic segmentation were presented from the classical models (Unet [9] and SegNet [10]) to the modern models (DeepLab families [11], [12], [13], [14] and HRNetV2 [15]), then to the current transformer models (SegFormer [16], UNetFormer [17]) designed for spatial information recovery, high-resolution semantic segmentation, and high-shape bias semantic segmentation, respectively. In addition, some domain adaptation methods are used to improve the robustness of road extraction models, such as [18].
To improve road connectivity based on these methods, road extraction models frequently design elaborate modules with graph convolution or global attention to encode more global contextual features, such as spatial and interaction space graph reasoning [19], global context-aware (GCA) block [20], spatial intensifier (DULR module) [21], and separable graph convolutional network (SGCN) [4]. However, these methods excessively enlarge the effective receptive field (ERF) and damage the multiroad details while improving connectivity. Taking a two-lane road as an example, a large ERF will blur the lane boundary and result in the confusion of lanes, as shown in Fig. 1. Although there are some few quality works taking their eyes to multiscale road details, such as DDU-Net [22] and Richer U-net [23], giving overall consideration on road connectivity and multiscale details remain a big challenge.
To overcome this problem, an auxiliary-decoding road extraction model termed AD-RoadNet is developed in this article to improve road connectivity while preserving the multiscale details. The essential ideas behind AD-RoadNet are as follows: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ 1) Adaptively matching the RF size for various scale roads according to their surrounding features to optimize the extraction for multiscale and multilane roads. 2) Modeling the spatial location relation among pixels with the position information encoded through extensive zeropadding. 3) Decoupling connectivity improvement and multiscale road representation by adding an auxiliary decoder to help the main decoder reason the occluded multiscale roads. Compared to other research studies on road extraction that focus on elaborate modules designed to learn long-range contextual information, our research contributions could be summarized as follows.
1) A novel auxiliary-decoding road extraction network named AD-RoadNet is proposed to give overall consideration to road connectivity and multiscale details improving the overall performance of road extractions, especially for intersections and multilane roads. In the proposed AD-RoadNet, an auxiliary decoder is embedded before the main decoder to help connect fragmented segments. 2) HRFM is introduced in the encoder to adjust the RF size for multiscale roads, which is beneficial for the detection of lanes and rural roads. In the HRFM, feature maps of branches with various RF sizes are obtained by stacking basic convolutions, and the weights for these maps are computed with respect to their surrounding features to match the suitable branch. 3) TFRM is introduced in the auxiliary decoder to represent topological information to help reason occluded roads. This module utilizes the position information encoded in the shared encoder to model spatial location correlation among pixels and does not disturb the multiscale semantic features fed from the encoder to the main decoder.

A. Road Network Extraction
Numerous techniques have been developed in other works of the literature to extract road networks from remote sensing data. Traditional methods improve road connectivity with probabilistic models by incorporating contextual priors, such as spectral features [22], [23], road geometry [26], and marked point processes [27]. Song and Civco [24] used statistical spectral and geometric information as classification criteria to segment road pixels. Mena and Malpica [28] proposed a technique called texture progressive analysis to extract road networks in rural and semiurban areas. These methods utilized handcrafted features and required complex optimization techniques [29].
In recent deep learning-based techniques, road extraction is formulated as a segmentation problem [19], [30], [31], [32] using convolutional encoder-decoder structured models. Among them, Mnih and Hinton [33] made the first attempt to apply a convolutional neural network (CNN) in classifying roads, operating on the patches. Máttyus et al. [30] proposed an encoder-decoder structure model and used shortest path algorithms to improve the connectivity in the postprocessing step. Unet [9] and LinkNet [34] are two well-known encoder-decoder structures. There are many improvements based on these models. Zhang et al. [35] proposed the ResUnet that combines the strengths of residual learning and U-Net. Chen et al. [36] proposed a reconstruction bias U-Net, which increased the decoding branches to obtain multilevel semantic information in the up-sampling. Wang et al. [34] optimized the D-LinkNet with nonlocal blocks and gained better performance, with less computational cost as well as faster convergence. Zao and Shi [23] enhanced the U-Net with an enhanced detail recovery structure and edge-focused loss function to obtain complete and accurate results. The abovementioned methods perform well in segmentation, however, fail to detect roads obscured by trees or other objects and produce a lot of fragmented segments.
To improve road connectivity, one main idea is to enlarge the receptive field (RF) using dilated convolution [11] and detect occluded roads with extra contextual information. Zhou et al. [37] improved LinkNet with dilated convolution in DeepGlobe road extraction subchallenge [38]. Tao et al. [39] designed a spatial information inference structure to collect contextual information without introducing invalid context. Another great idea is to learn road orientation additionally, which was first proposed by Batra et al. [40]. Yi et al. [5] proposed the Efficient UNet multitask joint learning model, incorporating an orientation learning decoding branch to solve the discontinuity problem in road extraction. This idea shows commendable effectivity but requires extra effort for road orientation ground-truth. Currently, due to the great ability for extracting dependencies over distant regions of graph structure and attention mechanism, a lot of works [4], [20], [21], [32] utilize them in the encoder to improve road connectivity. Bandara et al. [19] introduced graph convolution modeling dependencies between different spatial regions and other contextual information to represent road connectivity. Zhu et al. [20] designed the GCA block to the encoder-decoder structure to effectively integrate global context features. Zhou et al. [4] proposed a split depth-wise (DW) SGCN to capture global contextual road information in channel and spatial features and extracted covered roads. These methods show their effectiveness in improving road connectivity. However, the blindly introduced excessive ERF destroys the multiscale road features and is not beneficial for multiscale and multilane roads which require accurate ERF.

B. Receptive Field in ConvNets
RF is a term originally coined by [41] to describe an area of the body surface where a stimulus can produce a reflex. For existing ConvNets, an RF can be described as a region of input affecting the value of an output unit [42]. The RF size affects the scope of extracted information as well as the expression of semantic information. Generally, it can be calculated layer by layer as where R n and R n−1 are the RF size of the nth and (n−1)th layers, respectively; k n is the kernel size; S i is the stride size of the ith layer.
Two common operations to increase RF size are stacking more convolutional layers and subsampling. For large-scale image classification, a larger RF size means more effectiveness, which has been proven by a huge improvement from AlexNet [43], VGG [44] to ResNet [45]. For semantic segmentation (also called classification on pixel level), one extra requirement is preserving spatial information while enlarging the RF size. Atrous convolution [11], multilevel feature structured models, such as Unet [9], high-resolution models, such as HRNetV2 [15], and transformer models, such as SegFormer [16], were developed to deal with this problem. Another problem is the RF imbalance [46], [47], which means multiscale and multilevel architectures widely used [21], [23] provided unsuitable RF for some objects, negatively impacting the segmentation of objects of varying sizes. For this problem, Liu et al. [48] designed the scale-layer attention module and scale-feature attention module to weigh useful information after Atrous spatial pyramid pooling (ASPP) and skip connection, respectively. Li et al. [49] proposed an adaptive multiscale deep fusion residual network using the adaptive feature fusion module to emphasize useful information and suppress useless information during the multilevel feature fusion (MLFF). Wang et al. [50] designed the adaptive multiscale feature extraction module setting the RF according to feature map size to avoid introducing invalid information. These methods provide solutions from various perspectives but do not consider the impact of surrounding features on the required RF of target roads.

III. METHODOLOGY
To improve road connectivity while preserving great multiscale details, we propose an AD-RoadNet, which will be thoroughly introduced in this section. Specifically, we first illustrate the overview of AD-RoadNet for road extraction from HRSI. Then, the two designed modules, HRFM and TFRM, are introduced sequentially.

A. Pipelines of Proposed Model
The proposed AD-RoadNet comprises of the following four parts; an encoder, an auxiliary decoder, and the main decoder, our pipelines are shown in Fig. 2.
The encoder was designed for extracting multiscale semantic features with hybrid RFs. Specifically, each intermediate feature map is followed with a basic residual block [as shown in Fig. 3(a)] to provide a basic RF size. Where output_stride Fig. 2. Pipeline of the proposed networks. The HRFM is introduced in the encoder to adjust the RF size for target roads according to their surrounding features, while the TFRM is used in the auxiliary decoder to represent topological feature and the high-resolution feature is preserved. [10]> = 4, an HRFM is extra arranged in front of the residual block to guarantee the flexibility of RF size. This combination goes deeper with 1:2:4 to extract the superior road information. The ratio is determined experimentally, and the output_stride of the final feature map is 16, similar to the common semantic segmentation networks [50], [51]. In addition, we experimentally observe that two subsampling operations, max-pooling, and convolution with a stride of 2, have their own advantages in filtering invalid textures and refining road edges. We shall be combining the two operations using a parallel method to subsample the feature map, as shown in Fig. 3 The auxiliary decoder utilizes TFRM to represent road topological information, using the deep and various levels feature maps as input. This part allows extracting multilevel topological features without disturbing the multiscale semantic features fed to the main decoder.
We repeated residual blocks four times to replace the skip connection between high-resolution feature maps. With this connection, we optimize the extracted road details without the common MLFF to reduce the introduced invalid context information [48].
change the filter size and output channel to corresponding number.
The main decoder receives the following three features; the multiscale semantic feature, the multilevel topological feature, and the high-resolution feature, which are received from the encoder, auxiliary decoder, and high-resolution feature to corporately extract satisfactory results.
In terms of loss functions, we use a combination of binary cross entropy loss and dice loss, as shown in (2). Binary cross entropy is a common loss function for semantic segmentation, while the dice coefficient is widely used to highlight the foreground class (here is the road) [52] where N is the number of pixels; y (i) is the probability of predicting the pixel as road;ŷ i is the ground truth; X denotes a set of pixels predicted as roads; Y denotes a set of roads pixels; smooth is used to smooth the loss curve, here we set it as 1.

B. Hybrid RF Module
The HRFM matches the target roads' suitable RF size according to their surrounding features for extracting multiscale details. Compared to other multiscale techniques, such as spatial pyramid pooling (SPP) [53] and ASPP [13], HRFM produces no feature maps at an unsuitable scale, and each pixel in a feature map owns its customized RF. Fig. 4(a) illustrates its diagram. Given an intermediate feature map xࢠR C×H×W as input, the HRFM feeds it to three branches simultaneously, which provide different RF sizes by varying the number of convolution stacks. Based on the surrounding features of each pixel in three output feature maps F i ࢠR C×H×W (i = 1,2,3), a receptive field weighting learner (RFWL) is utilized to generate a weight vector WࢠR 3×1×H×W , which represents the weights of three RF sizes for each pixel. Then, we conduct dot product with the two vectors FࢠR 3×C×H×W and WࢠR 3×1×H×W to mix the extracted features from three branches while using residual connection to avoid gradient vanishing [45]. Moreover, since each channel of feature maps could be considered as a feature detector [54], to avoid the mixed feature detector missing its focus, we then use a basic residual block with channel attention [55] to capture meaningful information. In short, the overall process of HRFM can be summarized as where x is the input feature map; F consists of the output feature maps F i ࢠR C×H×W (i = 1,2,3) from three branches; W is the output from RFWL using F as input; · denotes dot product; f c denotes the basic residual block with channel attention. The following describes the details of RFWL and channel attention. 1) RFWL: Based on the distribution of each pixel in different branches of the surrounding features, we generate the weight maps of three RF sizes. Specifically, for each input F i ࢠR C×H×W (i = 1,2,3), since applying pooling operations along the channel axis is proved to be effective in highlighting information [44], [56], we first aggregate feature information by using averagepooling and max-pooling operations along the channel axis and then utilize a 7 × 7 convolution layer to represent the surrounding feature of each pixel, generating an efficient feature descriptor D i ࢠR 1×H×W (i = 1,2,3). To learn the most suitable RF size for each pixel, three feature descriptors are then concatenated and subsequently feed to a bottleneck block containing two basic convolutions, producing the weight map vector WࢠR 3 * 1 * H * W . The computation process could be summarized as follows, and Fig. 4(b) shows the whole abovementioned process where σ denotes the softmax activation function; ex is an expansion ratio, here we set it as 16; f 1 7×7 represents a convolution operation with the filter size of 7 × 7 and the output channel of 1; Similarly, f 3 3×3 represents a convolution operation with the filter size of 3 × 3 and the output channel of 3; f 3×ex 3×3 represents a convolution operation with the filter size of 3 × 3 and the output channel of 3 × ex.
2) Channel Attention: The channel attention is proposed originally by [55] to capture meaningful information in the channel dimension. The processing process of channel attention is illustrated in Fig. 4(c). Specifically, after two basic convolutions, we aggregate spatial information of a feature map by using both average-pooling and max-pooling operations, generating two spatial information descriptors. Both descriptors are then fed to a shared multilayer perceptron with one hidden layer to produce the attention map [55]. The computation process could be summarized as follows: where F in is the input feature map; f C 3×3 represents a convolution operation with the filter size of 3 × 3 and the output channel of C; σ denotes the ReLU activation function; σ' denotes the sigmoid function; ⊗denotes element-wise multiplication; F out is the output feature map.

C. Topological Feature Representation Module
Based on the shared encoder, the TFRM models the spatial location correlation among pixels to represent road topological information and help reason the occluded roads.
This module could be explained as follows: suppose F 1 ࢠR C×H×W is the final output from the encoder, and F e ࢠR C'×H' ×W' is the final feature map in a certain level. Given the extensive use of zero-padding, F 1 and F e are proved to have encoded the position information of pixels [57]. x i ࢠR 1×C' is a 1-D vector representing pixel i from F e , which could be regarded as a feature descriptor [54]. We use (10) to model the global spatial location correlation (GSLC) on pixel i where θ and γ denote projection functions embedding x i and F 1 to topological space; f is a function for relationship calculation; x i new is the generated feature descriptor, which represents the GSLC on pixel i.
To convert the abovementioned equation into a computable neural network module, the functions θ, γ, and f need to be instantiated. Naturally, we set θ and γ as simple point-wise convolutions since linear transformations are enough. As for function f, the dot product is a common operation to represent the relation of vectors. Then, (10) can be written as where Conv 1×1 C (x i ) represents a convolution operation with the filter size of 1 × 1 and the output channel of C'; • denotes dot product; x i new is the generated feature descriptor, which represents the GSLC on pixel i.
For all pixels in F e , we repeat the process to model their GSLCs. In practice, we compute all pixels simultaneously by matrix multiplication and finally generate a new feature map F S ࢠR (H×W) ×(H' ×W') .
After modeling GSLC on all pixels, we utilize a bottleneck block to represent the topological relationship among pixels. Specifically, our bottleneck block first uses a 1 × 1 convolution to reduce the channels of new feature maps to C' as same as F e , then two basic convolutions are used to extract topological features on each pixel. Moreover, to avoid gradient vanishing, here residual connection [45] is used. Fig. 5 shows the whole abovementioned process in TFRM.

IV. EXPERIMENTS
The proposed model is evaluated on two datasets: the Massachusetts road dataset [58] and the DeepGlobe road dataset [38]. In this part, the two datasets are first introduced, and then implementation details and evaluation metrics are given. Finally, a performance comparison between the proposed network with some state-of-the-art networks (SegNet [10], Unet [9], HRNetV2 [15], D-LinkNet [37], Residual Unet [35], DDU-Net [22], GAMSNet [59]) is made. Note that although the transformer-based models like UNetFormer [17] show excited performance, the high requirement for the memory of the graphic processing unit (GPU) limits their applications in many situations (In fact, there are some comparisons of calculated costs between the CNN-based model and the transformer based model, such as in [60], [61], and [62]), hence, the transformer families are not our competitive objects. We also make a brief comparison of model efficiency in Section V-A.

1) The Massachusetts Road Dataset:
The dataset consists of training, validation, and test sets with 1108, 14, and 49 images, respectively. There are a wide variety of road features in rural, suburban, and urban features in the dataset. The spatial resolution of these RGB images is 1 m. The annotations are road centerlines obtained from the OpenStreetMap, and all centerlines are converted to raster with a line thickness of 7 pixels [51]. The original image size is 1500 × 1500 pixels. Before feeding them to the segmentation model, images are cropped to a 512 × 512 pixel size with an overlap of 18. Moreover, we filter the images with heavily abnormal occlusion, since it could seriously disturb the performance of the network. After the abovementioned operations, the Massachusetts road dataset now contains 6856, 126, and 441 images of size 512 × 512 pixels, corresponding to the training, validation, and test set respectively.
2) The DeepGlobe Road Dataset: The DeepGlobe dataset consists of 6226 satellite images with a paired mask for road labels. These images have a size of 1024 × 1024 pixels and a spatial resolution of 50 cm/pixel [39]. Like in the Massachusetts road dataset, we crop these images to 512 × 512 pixels without overlaps before feeding them to the test network. Since the background (nonroad) pixels are much more than the road pixels in the satellite image, as the same as Tao et al. [39], we remove some images with an extremely small foreground ratio to alleviate the problem of class imbalance during optimization. After these preprocessing operations, a total of 11 350 images are obtained and divided into training, validation, and test sets with 9500, 350, and 1500 images, respectively.

B. Training Details 1) Avoid Overfitting:
Given the relatively small size of the processed training data, we utilize several techniques to avoid overfitting, including online data augmentation, batch normalization [63], L2 regularization, and early stopping (to evaluate the model with validating dataset at every 200 iterations). Concretely, random flip, random rotate (by 90°), and random crop and resize are used as data augmentation.
2) Configuration of Hyperparameters: All models are trained with the same parameter settings and in the same environment. Specifically, we conduct all experiments with the Pytorch [64] tool and train the models using the AdamW [65] optimizer with one RTX3060 (memory 12 GB) that allows a batch size of 4 images. The weight decaying is set to 5e−4. We use a cosine annealing learning rate scheduler [66] with an initial learning rate of 0.001, while the warm-up and restart strategies are applied to avoid premature convergence. Concretely, 126 epochs are trained, in which the first epoch is used for warming up, and after 50 epochs with an initial learning rate of 0.001, the final learning rate is then reset to 0.0005 for another 75 epochs.

C. Evaluation Metrics
To quantitatively evaluate the performance of the proposed network architecture, five common and widely accepted metrics are utilized here, including Precision, Recall, F1 score [67], [68], intersection of union (IoU) [20], and mean IoU (mIoU) [39], [69]. Before introducing the definitions of these metrics, it is necessary to define the following four initials; TP, FP, TN, and FN, where TP represents the number of correctly classified foreground pixels, FP, TN, and FN represent the number of false positives, true negatives, and false negatives, respectively [48].
With these initials, precision and recall can be determined as shown in (12) and (13), respectively F1 score is the harmonic mean of precision (P) and recall (R), and it can be calculated by the following equation: The IoU evaluates the ratio of the intersection value (TP) and the union value [the sum of FP, FN, and TP, as shown in (15)]. The mIoU is the average ratio of the correctly classified pixels in a class to the union of predicted pixels of this class and ground truth, and it can be calculated by (

D. Results
In this section, performance comparisons between the proposed network AD-RoadNet and some state-of-the-art (SegNet [10], Unet [9], HRNetV2 [15], D-LinkNet [37], Residual Unet [35], DDU-Net [22], and GAMSNet [59]) networks are conducted quantitatively and qualitatively on the abovementioned datasets. Table I lists the accuracy assessment results on Massachusetts road dataset. All six methods present a good performance in road extraction. The SegNet [10] presents the lowest accuracy among all the six networks, while the proposed AD-RoadNet performs best with an IoU 1.02%-3.13% higher than the other methods. In addition, the rare improvement (from 77-78 to 79.44) on recall indicates the effectiveness of our HRFM and TFRM. Fig. 6 displays the extracted road features on the Massachusetts road dataset. Overall, all methods can detect road features with relatively high accuracy, but the results differ when HRSI contains complex road structures and intersections. AD-RoadNet presents superiority over the other methods used in the extraction of multiscale roads and obscured roads, which require the ability to obtain road feature details and connection information. To further illustrate the advantages of AD-RoadNet, five typical scenes, which include trails close to main roads, urban overpasses, adjacent roads, a two-lane road covered by trees, and roads covered by shadows and vehicles, are selected as shown in Fig. 6.
To further verify the advantages of the proposed network AD-RoadNet, another comparison with the same five networks, including SegNet, Unet, Residual Unet, D-LinkNet, and HR-NetV2 is conducted on the DeepGlobe dataset. The quantitative results are shown in Table II. The ground truth on the DeepGlobe dataset has true width, and scenes in the dataset are more comprehensive and complex. Results show that the proposed AD-RoadNet significantly outperforms others. Specifically, AD-RoadNet achieves the best results with an IoU 2.46%-6.03% higher than the other methods, and also gains a 1.61%-4.25% improvement on the F1 score. Fig. 7 displays road extraction details from the DeepGlobe road dataset with the abovementioned networks. It could be seen that the proposed AD-RoadNet performs fairly well in dealing with multiscale road extraction of very complex satellite images.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    To illustrate the effectiveness of AD-RoadNet on the DeepGlobe road dataset, four typical scenes, which include expressways with vehicles, median strips as well as lane markings, multiscale rural roads, roads obscured by trees, and intersections connecting roads of various types, are selected, as shown in Fig. 7.

V. ANALYSIS
The abovementioned comparison experiments have demonstrated the effectiveness of the proposed AD-RoadNet for multiscale road extraction in HRSI. The results of the experiment can be interpreted based on the structure of AD-RoadNet. The encoder with hybrid RFs makes feature extraction achievable while still preserving road details, such as trails and lanes. While the main decoder decodes roads from extracted multiscale semantic features, TFRM in the auxiliary decoder is utilized to represent topological information, and help reason obscured roads. Moreover, the high-resolution feature prevents the disappearance of very narrow roads in the subsampling. The combination of these features guarantees rich multiscale road details as well as connectivity improvement.
This section further evaluates the proposed method in the following four parts. Section V-A analyzes the complexity of our model, Section V-B performs comprehensive ablation to validate proposed modules, Section V-C discusses the matter of inference patch size, and Section V-D evaluates the robustness of the proposed AD-RoadNet.

A. Parameter Scales and Model Complexity
The parameter scales, floating point operations (FLOPs), and inference time are widely used in evaluation of model complexity. The params and FLOPs are provided in articles [10], [15],  [35] or calculated from Pytorch [64]. The FPS (inference time) are tested and calculated on our hardware platform. Fig. 8 shows the parameter scales and performances of methods. From Fig. 8, we can see the parameters of our model are similar to HRNetV2, while the performance achieves a remarkable improvement.
We also compare the efficiency of several models under our hardware (RTX3060 and i9-10900) and explain why we exclude the Transformer-based model from our analysis. Table III shows the approximate number of parameters, FLOPs, and FPS (inference times) for each model with an input size of 512 × 512.
We can see that the transformer-based model, Segformer (B4), which is a segmentation model based on transformers, has the longest inference time (lowest FPS) among all the models. Due to its larger epoch requirement and longer inference time, the Transformer-based model takes longer to train; this explains why we do not include it in our comparison.

B. Model Analysis
We perform comprehensive ablations to discuss the effectivity of the proposed HRFM and TFRM. Concretely, first train a  baseline whose pipeline is the same as AD-RoadNet except without HRFM and TFRM, next gradually add HRFM and TFRM, then conduct experiments on the two datasets and finally report the performance via F1 score and mIoU. Table IV lists all experimental results, where H is the abbreviation for HRFM and T is for TFRM. Compared to the baseline, HRFM helps the model increase the F1 and mIoU by 0.63% and 0.45% on the Massachusetts dataset, respectively, while the results achieved on the DeepGlobe dataset are 0.9% and 0.63%, respectively. Besides, TFRM helps improve the baseline's performance with 0.9% F1 and 0.59% mIoU on the Massachusetts dataset, while achieving a 1.01% F1 and 0.7% mIoU result on the DeepGlobe dataset. This indicates that when HRFM and TFRM are interpolated to the baseline independently, there is a limited improvement on performance. This is because if only HRFM is used, while various visible roads may be extracted, the overaccurate RF size for the target road weakens the network's robustness to obscured roads with large RF, which is commonly shown to be effective [19], [37], as shown in Fig. 9(a)-(c). On the other hand, if TRFM is used independently, the encoder with a unified RF size will not be good at extracting multiscale roads in an HRSI. Many feature details are missed, such as narrow roads and lane markings, resulting in incomplete and inaccurate extraction results, as shown in Fig. 9(d) and (e). Overall, HRFM and TRFM complement each other in the road extraction process, and the best accuracy achieved is with a complete AD-RoadNet (baseline+H+T), which further proves our hypothesis to some degree. Table V lists experimental results of our model with different effective receptive filed in RFWL module. This experiment is to illustrate the effectivity of RFWL module. We force the three branch channels of the RFWL module to use the same size pooling operations; and in this way, we could control the size of the ERF of perception.

C. Inference Size Matters
In practical applications, roads need to be extracted in a large extensive region to indicate the overall distribution. However, limited to the memory of GPU, it is not feasible to directly infer such a big remote sensing image. Now we commonly crop the image into patches and merge them after inference. Various patch sizes for inference impact the performance [13], [48]. Thus, we discuss the optimal patch size by cropping patches with different sizes. Fig. 10 displays the performance trend as inference patch size varies. We can see that precision shows an upward trend in the range of 128-768 and a backward trend in the range of 768-1532 while Recall shows a contrasting trend. The best performance (the highest value of IoU) occurs when the input size is 768, reaching 66.22%. This phenomenon may imply that when the input size is small, the road feature in the image is insufficient. Thus, distinguishing roads from objects with similar textures, such as parking lots, cement ground, and so on is difficult, Fig. 9. Visualization of results for different combinations. From left to right are raw images, ground truth, the model only HRFM is interpolated, the model only TFRM is interpolated and complete AD-RoadNet. We can see that the HRFM is better at detecting narrow roads, as (d), (e), while the TFRM has advantages in the obscured roads, as (a), (b), (c), but they are all limited. The complete AD-RoadNet makes full use of their advantages. resulting in a low precision. On the other hand, a small input size also decreases the variance of the widths of roads. Detecting roads with this input becomes easier, resulting in a high recall. When the input size is larger than 768, the road features the model could extract achieve saturation, while the more severe scale problem begins to degrade the detecting precision of our model. However, the long-range contextual information introduced proceeds a higher Recall. When the input size is 768, the road feature, scale problem, and long-range contextual information reach the best tradeoff.

D. Robustness of AD-RoadNet
We discuss the robustness of the proposed AD-RoadNet in the following three aspects: unseen occlusion scenarios, low-quality labels, and various quality of inference images.

1) Robustness Analysis in Unseen Occlusion Scenarios:
To assess the robustness of AD-RoadNet and avoid just memorizing similar occlusion scenes from the trained samples, we manually attach some occlusions and test the robustness by strengthening them slowly. Specifically, three occlusion levels;  the raw road image, road image with weak occlusion, and road image with strong occlusion are set in this part, and the performance changes of all methods are shown in Fig. 11. The visualization road extraction results are shown in Fig. 12, in which rows correspond to occlusion levels and the columns correspond to applied models. All five methods infer relatively high accurate results in the first row, while SegNet, Unet, and D-LinkNet are slightly disturbed with weak occlusion. Under the third occlusion level, the extraction results are all impacted greatly, but the one from AD-RoadNet still reserves basic road elements and their connectivity, which indicates its robustness.
2) Robustness Analysis for Low Quality Labels: Using correct ground truth to train and optimize road extraction networks could definitely release all their potential. However, it is not feasible since labeling images are error-prone and frequently has different standards for ambiguous situations. Moreover, recent research using open spatial vector data to create training datasets further introduced labelling errors [70]. Therefore, it is necessary to evaluate the robustness of the proposed model for low quality labels. In the road extraction dataset, most of incorrect labels mislabel the road as background, while there is almost no case where the background has been mislabeled as road. We randomly select some images with obvious labelling mistakes from our training dataset, and then predict the images again to test if the mistakes can be relabeled correctly. Fig. 13 shows these images and our corresponding prediction results. We can see that for the patterns, which have been labeled correctly in many samples, such as main roads, our AD-RoadNet could sufficiently correct the raw error mask as shown in the red box. However, for patterns that are often labeled incorrectly, such as the short roads in front of houses, the prediction results prefer to keep the incorrect label, as shown in the magenta box. The result implies that although the proposed model allows mislabeling, we could better ensure certain amount of correct labeling for each possible pattern, so as to get a great extraction result.
3) Robustness Analysis for Quality of Inference Images:: There are many data sources for HRSI, such as unmanned aerial vehicles, Google Earth, WorldView-4, and so on. Various acquisition conditions or equipment performance enlarge the variance of HRSI quality. To test the inference image quality robustness of our AD-RoadNet, we manually adjust an inference image quality by adding noise or perturbations in multidegrees. Specifically, we test 5 severities for 3 kinds of noise, Gaussian Blur, Pepper Noise and Contrast, and compare the performance of AD-RoadNet with D-LinkNet and HRNetV2. The performance changes are shown in Fig. 14. Overall, the compared three methods are at the same level in the robustness for the quality of inference images. But for the images with Gaussian Blur, the proposed AD-RoadNet suffers a relative degradation, getting a much lower performance than HRNetV2. This may be explained by the structure of HRFM. As the weights of various RF sizes for the target road are designed to consider its surrounding features, the Gaussian Blur smoothens the target and surrounding objects and passes a wrong message for RF size matching, thus resulting in weak feature extraction. The hypothesis could be proved to some degree by the similar performance changes between AD-RoadNet and baseline+H and the relatively better robustness of baseline+T, as shown in Fig. 14(d). From the perspective of training, it can be regarded as an overfit for the almost invariable ground sample distance (GSD). With this consideration, we retrain the AD-RoadNet with the same configures but stronger data augmentation adding a random blur. After that, the performance changes in 5 severities of Gaussian Blur as shown in Fig. 14(e). The AD-RoadNet with stronger data augmentation (AD-RoadNet-Aug) gets a significant improvement in its robustness for Gaussian Blur  while retaining its initial performance. The abovementioned analysis may imply that since the HRFM dynamically adjusts target road RF size according to its surrounding features, there is a risk behind more targeted feature extraction, that is, the model only remembers the distribution of surrounding objects but does not recognize the corresponding pattern, thus resulting in an overfit for a certain GSD. Data augmentation techniques, which could change the GSD, such as Random Resized Crop and Random Blur, are essential to improve the robustness of proposed AD-RoadNet. Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

VI. CONCLUSION
In this article, to decouple and give an overall consideration to the representation and connectivity improvement of multiscale road details, the proposed AD-RoadNet performs well by introducing HRFM in the encoder to adaptively provide each unit with suitable RF size, high-resolution feature information is preserved by residual blocks to detect very narrow roads, and compared to some previous research works in multiscale road feature extraction (see [17], [21], [35], [69]), this study introduces a topological feature module (TFRM) that encodes the connectivity and directionality of roads as additional features, and experiment results demonstrate that the TFRM could improve the performance of the network by reducing false positives and enhancing the continuity and smoothness of extracted roads.
The proposed network has achieved state-of-the-art performance on two benchmark datasets, outperforming existing methods in terms of recall and IOU. The effectiveness of the proposed network is proven with solid experiments and achieves the SOTA performance. For future research, we suggest exploring more topological features that could enhance the road extraction performance, such as curvature, width, or intersection angles. We also recommend testing our network on different types of HRSI with varying spectral, spatial, and temporal resolutions to evaluate its adaptability and robustness. Furthermore, we would propose developing a more interpretable representation of topological features that could provide insights into how they affect the network's decision-making process.

ACKNOWLEDGMENT
The authors would like to thank the editors and anonymous reviewers for their instructive comments.
Declaration of Competing Interest: The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.