An Ensemble Wasserstein Generative Adversarial Network Method for Road Extraction From High Resolution Remote Sensing Images in Rural Areas

Road extraction from high resolution remote sensing (HR-RS) images is an important yet challenging computer vision task. In this study, we propose an ensemble Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) method called E-WGAN-GP for road extraction from HR-RS images in rural areas. The WGAN-GP model modifies the standard GANs with Wasserstein distance and gradient penalty. We add a spatial penalty term in the loss function of the WGAN-GP model to solve the class imbalance problem typically in road extraction. Parameter experiments are undertaken to determine the best spatial penalty and the weight term in the new loss function based on GaoFen-2 dataset. In addition, we execute an ensemble strategy in which we first train two WGAN-GP models using the U-Net and BiSeNet as generator respectively, and then intersect their inferred outputs to yield better road vectors. We train our new model with GaoFen-2 HR-RS images in rural areas from China and also the DeepGlobe Road Extraction dataset. Compared with the U-Net, BiSeNet, D-LinkNet and WGAN-GP methods without ensemble, our new method makes a good trade-off between precision and recall with F1-score = 0.85 and IoU = 0.73.


I. INTRODUCTION
Up-to-date road networks are crucial for vehicle navigation, urban-rural planning, disaster relief and traffic management [1], [2]. However, manual road extraction is costly and slow, especially for rural areas which are weakly reachable and have limited access to GPS data unlike urban areas and highways [3], [4]. Instead, automatic road extraction from high resolution remote sensing (HR-RS) images have attracted much attention, since HR-RS images such as GanFen-2 (GF-2) have the advantages of wide imaging coverage, frequent revisit and sufficient spectral and spatial information of land covers [5], [6]. Most researchers regard road extraction from HR-RS images as a pixel-wise segmentation task which classifies each pixel into road or non-road according to geometric, photometric and texture The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . features of roads (e.g. [2], [7], [8]). Thus, the performance of road extraction highly depends on the chosen semantic segmentation methods.
In the last decade, the deep convolutional neural network (CNN) has been introduced to the road extraction due to its impressive performances in computer vision [9]- [11]. CNN is a powerful feature extractor for road extraction which can automatically learn low to high level semantic features with consecutive convolutional operations. Based on CNN, full convolutional network (FCN) and its variants use encoder-decoder structures with skip connections to fuse multi-scale and multi-level features and have achieved stateof-the-art accuracy in road extraction tasks [8], [12]. Nevertheless, although numerous methods have been proposed over the decades, road extraction from HR-RS images is still far from large-scale practical application. The spectral heterogeneity in HR-RS images [6], [13] and the occlusions by trees, shadows and other non-road objects on roads are the key factors that make the segmented road networks messy and unclean [14], [15]. A more general and robust road segmentation method is needed to improve the performance of road extraction from HR-RS images.
Recently, the generative adversarial networks (GAN) has taken over the deep learning community by storm due to their unparalleled flexibility and accuracy in image segmentation tasks [16]. It consists of a generator G and a discriminator D where G generates images that are indistinguishable from the ground truth and D distinguishes between the ground truth and the generated images. Few works have used standard GANs in road extraction from HR-RS images and achieved good performances [4], [7], [15], [17], [18]. However, these standard GANs may suffer from vanishing gradients, mode collapse, and unstable training [19], [20]. To solve the problems, [4] recently introduces the Wasserstein GAN with gradient penalty (WGAN-GP) model into road extraction. The WGAN-GP modifies the standard GANs using Wasserstein distance and gradient penalty which is more stable and easier to train [21]. However, his work neglects the class imbalance problem since the roads are spatially sparse in HR-RS images. In addition, his work only uses the U-Net [12] as G which can be improved by other state-of-the-art segmentation models such as BiSeNet [22]. And to further improve the overall performance in CNN methods, creating ensembles of multimodels may be an effective way [40]- [42].
In summary, the main contributions of this study are as follows: 1. A new WGAN-GP-based method named E-WGAN-GP is introduced into road extraction from HR-RS images in rural areas. The E-WGAN-GP method trains two WGAN-GP models with U-Net and BiSeNet as G, respectively. Then use an ensemble strategy of intersection calculation to improve the overall performance.
2. A spatial penalty term is added in the loss function of the WGAN-GP model to solve the class imbalance issue typically in road extraction.
3. Parameter experiments are undertaken to determine the best spatial penalty and the weight term of the E-WGAN-GP method in GF-2 dataset. 4. The E-WGAN-GP method outperforms the state-ofthe-art method: U-Net, D-LinkNet, BiSeNet and WGAN-GP without ensemble in terms of F 1 -score and IoU in both GF-2 and DeepGlobe datasets.

II. RELATED WORKS
There are intensive studies of road extraction from HR-RS images. Most of them use classification-based methods which can be further divided into supervised and unsupervised [23]. For supervised classification, [2] propose an artificial neural network (ANN) method with a Restricted Boltzmann Machines (RBM) initialization procedure to segment road. [24] use an object-oriented method based on support vector machine (SVM) to extract road centerline. [25] compare Markov random fields (MRF) with a hybrid model of the SVM and Fuzzy C-Mean (FCM) in road extraction. For unsupervised classification, [26] apply K-means clustering to extract road areas followed by mathematical morphology to remove non-road areas. [27] present a semi-automatic approach to delineate road networks incorporating the geodesic and mean shift methods. [28] introduce probabilistic and graph theory to infer road networks. Other classic methods represented by the snake and conditional random field (CRF) models are also adopted in road extraction. For example, [29] evolve a snake to capture road networks by minimizing an energy function which specifies geometric and appearance constraints of roads. [30] employ a high-order CRF to obtain the structures of road networks.
Inspired by the outstanding performances of deep learning in computer vision since 2012, CNN methods have been introduced into road extraction from HR-RS images. Compared with traditional road extraction methods which mostly utilize low level handcrafted features, CNN can automatically extract low to high level semantic features with consecutive convolutional operations [31], [32]. On the basis of CNN, FCN and its variants use encoder-decoder structures with skip connections to fuse multi-scale and multi-level features. In addition, they can output pixel-wise segmentation images with the same size as input images, which is suitable for end-to-end road extraction tasks [8], [12]. For example, [33] refine the U-Net with residual units to improve its performance. [34] propose the D-LinkNet which combines the LinkNet architecture with dilated convolutional layers. Other CNN methods are also adopted in road extraction. For example, [35] build the cascaded CNN (CasNet) to address road segmentation and centerline extraction tasks. [14] create the RoadTracer to get road network graphs following an iterative search procedure guided by a decision function based on CNN. [36] introduce DeepLab to detect roads with the MobileNet-v2 as the backbone. However, despite numerous studies, road extraction from HR-RS images is still far from large-scale practical application. The spectral heterogeneity in HR-RS images [6], [13] (e.g. large intra-class or small inter-class spectral variability) and the presence of occlusions (e.g. shadows and trees) often makes segmented road networks unclean and noisy [13]- [15]. A more general and robust road segmentation method is still needed to improve the performance of road extraction from HR-RS images.
The revolutionary technique in computer vision GAN is promising to fix these problems [16]. In a GAN we train two different neural networks called generator G and discriminator D, each one adversarial to the other. GAN-based methods regard road extraction as a task of image-to-image translation and work the same way as a conditional GAN (cGAN) [17], [37]. Few works have adopted the cGAN in road extraction from HR-RS images and made some progress. For example, [15] present a two-stage framework to extract roads, in which two GANs are used to detect roads and intersections followed by a smoothing-based optimization algorithm. [7] propose an improved GAN using the U-Net as G and suggest a simple loss function with an L2 loss and a cGAN loss. [18] create a multi-supervised GAN with two D to infer road networks with improved topology. However, these methods are too complex either in architecture (e.g. two GANs in [15]) or loss function (e.g. four loss functions in [18]). Moreover, standard GANs struggle in vanishing gradients, mode collapse, and unstable training which are not easy to train [19], [20]. Recently, [4] introduces a WGAN-GP method to translate HR-RS images to a cleaner road network. The WGAN-GP model is proposed by [21] in 2017. It replaces the KL divergence in standard GANs with Wasserstein distance and employ a gradient penalty to enforce a Lipschitz constraint on the loss function. The WGAN-GP model has more stable training and outperforms other methods in road extraction [4]. Our study will further refine the WGAN-GP model to improve its performance in road extraction from HR-RS images in rural areas, which will be presented in the following sections.

A. MODEL ARCHITECTURE
The architecture of WGAN-GP model follows the pix2pix [37], in which two adversarial models (a generator G and a discriminator D) are trained simultaneously [37]. G is given the image x in addition to a random noise z and aims to produce the realistic image G(x) similar to the label y, while D has to decide whether the image belong to y or obtained from G(x). The detailed architecture of WGAN-GP model is shown in Figure 1. Since G is flexible with any segmentation models, we choose the widely used U-Net and the state-of-theart BiSeNet as G respectively in this study. The U-Net is characterized by a symmetric encoder-decoder pathway with 4 down-sampling modules and 4 up-sampling modules (Figure 2a) [12]. The down-sampling module consists of repeated operations of two 3 × 3 convolutions, each followed by a batch normalization (BN), a rectified linear unit (ReLU) and a 2 × 2 max pooling operation. The up-sampling module contains a 2×2 up-convolution, a concatenation with the corresponding feature map from the decoder pathway at the same level, and two 3 × 3 convolutions, each followed by a BN and a ReLU. The BiSeNet is composed of a Spatial Path with a small stride to preserve the spatial information, and a Context Path with a fast down-sampling strategy (e.g. ResNet 101) to obtain sufficient receptive field (Figure 2b) [22]. To refine the output feature of each stage in the Context Path, the BiSeNet proposes an Attention Refinement Module (ARM) which employs global average pooling to capture global context and computes an attention vector to guide the feature learning. To fuse the output features of the two paths, the BiSeNet proposes a Feature Fusion Module (FFM) which employs concatenate operation and a weight vector to re-weight the features. In order to restrict our attention to the structure in local image patches, we use PatchGAN [37] discriminator following pix2pix with 5 down-sampling modules as D in the WGAN-GP model which classifies whether each N× N patch in an image is real or fake and average the classification outputs of each patch to obtain the final results (Figure 2c).

B. LOSS FUNCTION
The total loss function L total (G,D) of our WGAN-GP model mixes the cGAN loss with the pixel-wise loss [37], which can be expressed as follows: where L cGAN (G,D) is the standard cGAN loss, L L1 (G) is the pixel-wise loss, β is the weight to balance the two losses. The L L1 (G) is used to evaluate the pixel-wise quality of the image generated by G, which encourages less blurring than the L2 loss [36]. The L L1 (G) can be written as: where x denotes the distribution of image x, z denotes the distribution of random noise, y denotes the distribution of label image y, and E x,y,z is the expected value of the pixelwise loss. Since roads are spatially sparse and only account for a small proportion of an image as shown in Figure 4, the class imbalance problem cannot be ignored [39], [40]. Here we add a spatial penalty term ω to conserve the impact of the class imbalance. The ω will be set to 1 if the pixel is the background and to α (α >1) if the pixel is the road. Accordingly, there will be a larger punishment if the road is VOLUME 8, 2020 misclassified as the background: The L cGAN (G,D) is the standard cGAN loss and can be expressed as: (D(x, G(x, z)))], (3) where G tries to minimize this loss while D tries to maximize it, E x,y denotes the expected value over all real data instances, and E x,z denotes the expected value over all generated fake instances G(x,z). However, [19] argued that the JS divergence used in standard GANs fails to provide a usable gradient when two distributions are disjoint, leading to the difficulty of training. They suggest replacing the JS divergence with the Wasserstein distance, which is informally defined as the minimum cost of transporting mass from one distribution to another [20]. In contrast to the JS divergence, the Wasserstein distance is continuous and differentiable almost everywhere. We modify the L cGAN (G,D) to the WGAN loss with the Wasserstein distance as follows: In addition, the realization of the WGAN loss enforces a Lipschitz constraint on D and the gradient penalty is considered better than clipping weights [21]. We add a gradient penalty term to our final L cGAN (G,D) as follows: wherex = εy + (1−ε)G(x,z) which is sampled uniformly along straight lines between pairs of points from the distribution of label image y and the distribution of generated image G(x,z), λ is the penalty coefficient, ∇G(x) is the gradient of the distributionx , and Ex is the expected value of the gradient penalty term.

C. OPTIMIZATION PROCEDURE
The training dataset including image x and label y is fed to the model. The overall training objective aims at optimizing the total loss function L total (G,D) to yield the final model G * , which can be expressed as follows: Specifically, an alternate optimization procedure is adopted. We first optimize D by fixing G (Equation 7) and then optimize G by fixing D (Equation 8), which is presented as follows: To accelerate the optimization speed, the mini-batch Adam is used. Repeat the training process till the training loss converges. A pseudocode for the optimization procedure is presented in Table 1.

D. ENSEMBLE STRATEGY
Due to the spectral heterogeneity in HR-RS images and the occlusions on roads, mistakes of road extraction are inevitable even for the state-of-the-art models. Creating ensembles of multi-models is proved to be an effective way of improving accuracies in CNN methods, because models trained with different architectures or hyper-parameters may have complementary information [40]- [42].
In this study, the ensemble strategy is carried out as follows: in the training stage, two WGAN-GP models in which G are tuned as U-Net and BiSeNet are trained respectively with the same training dataset and the same hyperparameters. Here we do not train more models by tuning the hyper-parameters since too many models will affect the inference time. In the inference stage, we use the trained models to infer testing images and yield different sets of segmentation outputs. Since the intersection calculation of vector objects is more time-saving than fusing pixel values in segmentation images, we then apply a vectorization to the segmentation outputs and calculate the intersected vectors. Last we revert the intersected vectors to segmentation images and apply the skeleton thinning method followed by another vectorization to yield the final road vectors. Our ensemble strategy can take advantage of the best of different segmentation models and get road vectors with high precision as shown in Figure 5. The overall ensemble strategy is shown in Figure 3.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASETS
We use GF-2 images in this study. Launched in 2014, GF-2 satellite is the first civil optical remote sensing satellite in China with a spatial resolution of 0.8 m. GF-2 is an ideal HR-RS data source for road extraction because of its high resolution, wide imaging coverage, frequent revisit and high image quality [6], [38].
We select over 30 scenes of GF-2 images in Fujian and Hainan Provinces from China covering various types of rural areas. The scenes are all in the visible spectrum with RGB bands. The pre-processing method contains radiometric calibration, atmospheric correction, and orthorectification. The original scenes are cropped to 512 × 512 pixels and ended up with 40000 images. Data augmentation methods including rotation and flipping are also applied to expand the dataset. We randomly split our dataset into a training dataset with 36000 images and a validation dataset with 4000 images. For testing, we choose extra three typical scenes of rural areas in Fujian and Hainan Provinces covering mountainous, plain and sub-urban areas, each with a size of 60000×60000 pixels. Each image x has an accompanying binary label y, indicating whether a pixel in the aerial image belongs to road (denoted as 1) or non-road (denoted as 0) as shown in Figure 4. To generate the label image, we follow the centerline-based approach to label road vectors manually in ArcGIS software and then rasterize them with a line width of 5 pixels [7], [39].
To verify the effectiveness of our method, we also conduct experiments on the famous DeepGlobe Road Extraction Challenge 2018 dataset [44]. We only use the training dataset of DeepGlobe dataset since the dataset has not published label images for the validation and testing dataset. The training dataset consists of a total of 6,226 HR-RS images with a size of 1024 × 1024 pixels. We randomly split our dataset into a training dataset with 5500 images, a validation dataset with 500 images and a testing dataset with 300 images. The spatial resolution of the dataset is 0.5 m which is higher than the GF-2 images.

B. EXPERIMENTAL SETTINGS
We implement our E-WGAN-GP method with the PyTorch framework and train them using NVIDIA Titan RTX GPUs with HR-RS images and paired label images from the training dataset. The spatial penalty term α in L L1 (G) and the weight term β in L total (G,D) are set to 100 and 10 according to our parameter-sensitivity experiment. The penalty coefficient λ is set to 10 following [21]. For optimization, we use mini-batch Adam to train our model with β 1 = 0.3, β 2 = 0.99, weight decay = 5e −5 . The initial learning rate is set to 0.0005 and halves every 50 epochs. The max epoch is 300 and the batch size is 20.

C. COMPARISON METHODS AND EVALUATION METRICS
Our new method is compared with the other four methods: the U-Net [33], the BiSeNet, the D-LinkNet [34] and the WGAN-GP method using BiSeNet as G.
Traditional segmentation-based metrics such as precision, recall and IoU fail to be evaluation metrics in this study, since a slight error in the road width will heavily penalize the metrics while a small gap in an inferred road will not [43]. In this study, since our final outputs are road vectors, we suggest using vector-based precision, recall, F 1 -score and IoU to evaluate the performance of our method. Precision and recall show the correctness and completeness of the inferred road vector, respectively. F 1 -score considers the balance between the precision and recall and reaches its best value at 1. IoU represents the intersection of the prediction and ground truth over their region. Precision (P), recall (R), F 1 -score (F 1 ) and IoU can be expressed as follows: where TP, FN and FP denote the true positive, false negative, and false positive respectively.

D. EVALUATION ON GF-2
We use hit/miss images to show our experimental results which superpose the extracted and labelled road vectors upon the original images to detect the hit and miss areas ( Figure 5). We choose three typical scenes from rural areas to demonstrate the experimental results. Scene 1 represents a mountainous area where roads are distributed along valleys. Scene 2 displays a plain area with a river crossing by. Scene 3 is a sub-urban area with a huge body of water and many residences.
In the hit/miss images, yellow lines denote the roads which are extracted correctly (TP), red lines denote the roads which are missed (FP), and blue lines denote the roads which do not exist but are extracted incorrectly (FN). The E-WGAN-GP,  WGAN-GP (BiSeNet) and U-Net have much more yellow lines and fewer red lines than other methods suggesting they are superior in terms of R. The E-WGAN-GP and the D-LinkNet show much fewer blue lines than other methods indicating they outperforms others in terms of P. Accordingly, to consider both the R and P, the E-WGAN-GP method yields the best results in the hit/miss images.  Table 2 shows a quantitative comparison of five methods, including P, R, F 1 and IoU. The E-WGAN-GP and U-Net give the highest R (0.85-0.87), and the E-WGAN-GP and D-LinkNet yield the highest P (0.86-0.90). As a result, our E-WGAN-GP makes the best trade-off between P and R with the highest F 1 (0.85) and IoU (0.73) and is more reliable in large-scale practical applications.
To get a deep understanding of the errors our E-WGAN-GP makes in GF-2 dataset, Figure 6 shows one result from Scene 2 with paired original image and hit/miss image. Typical FNs are in areas where roads are thin and narrow or hard to be distinguished from the background due to shadows and trees (purple box). Possible ways to tackle these problems include using different road widths in dataset generation [7] and improve the post-processing method to the segmentation images. Typical FPs are caused by either narrow roads (blue box) or interior roads in residences (orange box). One possible way to tackle this problem is adding more corresponding negative samples to the training dataset.

E. PARAMETER ANALYSIS
Since the performance of the E-WGAN-GP method is highly parameter-sensitive in its loss function, we undertake experiments to determine the best parameters in GF-2 dataset. The parameters involved in our experiments are the spatial penalty term α in L L1 (G) and the weight term β in L total (G,D) which are the most critical parameters during training. From Figure 7 and Table 3, we can observe that adding the spatial penalty term α in L L1 (G) will significantly improve the performance. In addition, when we increase α and β (α from 5 to 100 and β from 1 to 10), IoU increases from 0.57 to 0.74 as well. These results suggest that the spatial penalty term α may help us eliminate the class imbalance issue typically in road extraction task [39], [40].
However, when α and β become too large (α from 100 to 200 and β from 10 to 100), IoU decreases from 0.74 to 0.69 instead. This may be caused by that many non-road objects are ''punished'' to be roads by the large α and β. We suggest the proper range of the α and β to be 50-150 and 5-50, respectively. In this study, we set the α and β as 100 and 10 in our study.

F. EVALUATION ON DEEPGLOBE
To verify the effectiveness of our method, we also compare the performance of our method on the public DeepGlobe Road Extraction Challenge 2018 dataset. The comparison method is the D-LinkNet which is the champion method in 2018. The hit/miss images of three typical scenes are shown in Figure 8. Scene 1 and Scene 2 are rural areas and Scene 3 is an urban area. Our method yields more yellow lines and fewer red lines compared with the D-LinkNet suggesting it is superior in terms of R. It is noteworthy that our method may be more suitable for the rural areas, since our method gives much more blue lines in the interior roads of urban areas compared with the D-LinkNet. The quantitative comparison including P, R, F 1 and IoU is displayed in Table 4. Our method has higher averaged R, F 1 and IoU compared with the D-LinkNet, suggesting the effectiveness of our method in road extraction from HR-RS data.

V. CONCLUSION
In this study, we present an E-WGAN-GP method for road extraction from HR-RS images in rural areas. The WGAN-GP model used in our method is a modification of standard GANs with Wasserstein distance and gradient penalty. To solve the class imbalance problem specially in road extraction, we add a spatial penalty term in the loss function. Parameter experiments are undertaken to determine the best parameters of the new loss function based on GaoFen-2 dataset. We also follow the ensemble strategy first to train different WGAN-GP models with different G (U-Net and BiSeNet) and then to ensemble their results to yield better road vectors. The E-WGAN-GP method outperforms the state-of-the-art method: U-Net, D-LinkNet, BiSeNet and WGAN-GP without ensemble in terms of F 1 -score and IoU in both GF-2 and DeepGlobe datasets.