A Novel Boundary Loss Function in Deep Convolutional Networks to Improve the Buildings Extraction From High-Resolution Remote Sensing Images

—Inrecentyears,therehasbeenasigniﬁcantincreasein the production of high-resolution aerial and satellite imagery. Analyzing and extracting urban features from these images, especially building components, is a signiﬁcant challenge in photogrammetry and remote sensing. Meanwhile, deep convolution neural networks have been used as a powerful model in the semantic segmentation of building features. However, due to the structure of these types of deep convolution networks, accurate retrieval of building boundary information during training will be difﬁcult, and this will lead to ambiguous boundary areas. On the other hand, the use of distributed-based loss functions, such as the binary cross-entropy loss alone cannot improve the segmentation accuracy in theboundaryareasofbuildingfeatures.Accordingly,inthisarticle,aderivativeboundarylossfunctionisintroducedtooptimizeand enhancetheextractionofbuildingsinborderareas.Tocalculatetheproposedboundarylossfunction,distancetransformimage(DTI) isobtainedfromground-truthimagesandpredictedimages,thatderivedfromthesegmentationnetwork.Inthisarticle,aderivative methodforcalculatingDTIispresented.Anotheradvantageofthe proposedlossfunctionisitsabilitytobeusedinawiderangeofdeepconvolutionsegmentationnetworks.Theproposedmethodwas investigatedandtestedusingtheISPRSPotsdamandVaihingendatasets.Theresultsofthisimplementationshowthesuperiorityin thecriteriaofevaluationandimprovementofbuildingsextractioniftheproposedboundarylossfunctionisusedalongwiththe distributed-basedlossfunction.

progress and advancement is the increased spatial resolution of images up to a centimeters level, accessible significantly in huge volumes and on a daily basis [1]. High spatial resolution images contain exact spatial and spectral information of the earth so that they are widely applied to a wide range of applications such as urban planning [2], accurate agriculture [3], management of environmental resources [4], monitoring of events and natural disasters [5], and three-dimensional (3-D) modeling reconstruction of buildings [6]. Among these diverse applications, the automatic extraction of buildings from high-resolution images as the most critical urban objects has always attracted the attention of researchers in this field. Regarding the various features of a building (e.g., size and shape) and the intervention of complex areas in high-resolution satellite and aerial images (like roads and parking lots), achieving distinctive inter-class features associated with building features and simultaneously discriminating intraclass features between different features is very challenging in automatic extraction of building objects [7]. As a result, in photogrammetry and remote sensing, automatic building extraction from high spatial resolution images is a difficult problem. Thanks to their capabilities in learning features, deep convolutional neural networks (DCNNs) improve the accuracy of semantic segmentation [9]. Convolutional and pooling operations are iterated in the DCNN to increase inputs fields of the original images and obtain deep semantic features. Nonetheless, the downsampling trend reduces the dimensional of the original image. This process leads to the loss of critical spatial details of the images, resulting in inappropriate segmentation with incorrect boundaries. Computer vision scholars have developed different encoder-decoder architectures to effectively merge high-level semantic information with low-level spatial information aiming to reduce spatial details loss and generate more reliable segmentation results [10]. The idea of merging characteristics maps produced from the encoder part is generally adopted in DCNN-based building extraction approaches and helps maintain spatial details [9], [11].
In some research, multilevels features extracted from encoder-decoder architectures have significantly contributed to improving of segmentation problems. But, low-levels features, including boundary information extracted straight from This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ independent segmentation operations, are erroneous and may adversely impact dense pixel-based segmentation results [12]. According to this point, researchers have recently addressed and studied the ability of joint learning of semantic segmentation and auxiliary applications associated with the shape of interesting objects [13], [14]. Some researchers enhance the segmentation accuracy through a multitask network, including object segmentation with boundary estimation [12], [15]. Yet, using binary values of the extracted boundaries does not consider the connection between neighboring pixels, thus reducing the strongness of the previous methods. Meanwhile, a distancetransform image (DTI) measures the distance between each of the pixels from the closest edge of the objects [16]. The values of pixels change in a fuzzy mode at boundary areas, and its advantage is that it establishes a complementary relationship among the neighboring pixels.
In this article, the idea of applying DTI is used to determine the boundary loss function in DCNNs so that the building extraction accuracy in the vicinity of its boundary areas to the background is improved. The adoption of DTI can be effective as an index feature in detecting boundary areas of the classifier network. in other words, DTI offers an alternative to classical ground truth. A binary mask can be converted to a gray level image, in which the intensities of pixels in the foreground are adjusted based on their distance from the closest boundary [14]. Based on this idea, one of the goals of this study is to present a loss function based on DTI, named the boundary loss function. The purpose of giving this loss function is the more accurate optimization of objects boundaries in the building extraction process. Moreover, considering previous methods are not derivative in calculating the values of DTI and cannot be directly used for network training. The present study introduces a novel derivative approach to calculate the values of DTI in an iterative process based on the morphological dilation operator. By doing this, the network will be able to determine the DTI, and it can be merged with the DCNN to calculate the boundary loss function.
The rest of this article is set out as follows. Section II discusses previous relevant research on building extraction methods based on deep learning. Section III contains a detailed description of the proposed network and boundary loss functions. Then, in Sections IV and V, the details of the implementation experiments and discussion are addressed. Finally, Section VI contains the article's conclusion.

A. Semantic Segmentation Based on DCNN
The process of assigning a class label to each pixel in an image is known as semantic segmentation. In recent years, DCNNs have been widely used in semantic segmentation. One such example is the fully convolutional networks (FCNs) [17], [18]. In the DCNN-based semantic segmentation, downsampling processes are used to obtain semantic features in deeper layers. However, essential spatial features are lost as a result of this operation. Deep and initial features are connected via a summation or concatenation operator to combine information from different spatial scales. Popular UNet networks [10], Hourglass-shape networks [19], [20], and Refine-Net networks [21] enjoy such a structure in their architectures. On the other hand, to extract objects with different dimensions, various features with diverse scales are merged through dilated convolutional methods and multiscale pooling layers. In this regard, we can mention popular PSPNet [22] and DeepLabv3+ [23] networks. Generally, feature merging in multilevels enhances the performance of segmentation networks significantly.

B. Building Extraction Based on DCNN
In the past decades, researchers in the fields of photogrammetry and remote sensing have intensively studied the extraction of buildings from aerial and satellite images with a high spatial resolution [7]. Color, spectrum, edges, shapes, and shadows are among the features derived from the images in these studies. Then, in an unsupervised process, one of, or a mixture of these features, is utilized to classify the buildings [24]- [28] or in supervised classification [29]- [32]. Due to the complexity of building architecture in satellite and aerial imagery with high spatial resolution and its similarity to other urban terrains such as roads, the manually created features in traditional building extraction methods need to be more accurate. This sometimes leads to a weakness in the generalization capability of such approaches [33].
Many novel semantic segmentation approaches in remote sensing have recently been developed, and they are based on DCNNs, which can learn highly distinguishable characteristics [34]. Patch-based and FCN-based DCNN methods are the two types of DCNN-based approaches. In the patch-based process, the labeling of a pixel is done by classifying around that pixel [35]- [37]. For classifying all of the image pixels, the patch-based segmentation process must be repeated, resulting in a high computational cost. To address this issue, approaches based on the FCN method have been designed for controlling the dimensions of the input image to the network [12], [38]- [40]. A set of networks based on Hourglass-shaped like UNet networks are developed to fuse multiscale features and generate accurate building extraction outcomes to reduce spatial details loss and generate accurate building extraction outcomes [41]- [46]. In recent years, the multimodal fusion network has been introduced to increase features representation to classify semantic-rich dependencies in the channel and spatial dimensions. Thus, some scholars adopted this idea in segmentation networks to boost the features learned for building representation [47]- [50].
Recent research works have employed tools in the segmentation process, such as using the shape structure of objects [51], [52]. They have used morphological data to enhance the capability of the segmentation networks, such as boundary learning and DTI. To this end, the use of DCNN to detect boundary areas of objects has been proposed as a solution. The trained boundaries can be used to improve the segmentation process. In [46], for instance, an additional network is used to estimate the boundaries. Then, the estimated boundaries with features generated from the image are used as inputs in a segmentation network. In [53], to boost the performance of building extraction, a DTI and the estimated image resulting from classification in different encoder-decoders are mixed in parallel. In [54], a signed DTI is presented for building extraction. The idea of this research is to provide a boundary among the buildings at high dimensions, which improves the detection of different building areas with complicated spatial relations.
In some research, boundary information is provided in the network optimization process based on the boundary loss function. For instance, Bokhovkin and Burnaev [55] provide a boundary loss function to improve the segmentation of a high-resolution image using boundary information. On the other hand, the DTI can be used to complete morphological information and improve semantic segmentation. In the field of computer vision, the researchers [56], [57] adopt this idea and introduce loss functions based on the shape of the foreground objects. Karimi and Salcudean [56] calculate the loss function using Hausdorff distance minimization. The value of the loss function is derived in [57] by calculating the joint surface between both the predicted image and the ground-truth. The methods mentioned above are based on developing various network structures to obtain robust features and only solve the problem with asymmetric class distribution. As a result of this issue, the building extraction is more biased toward the background, resulting in cases with missing building boundaries in the results. Therefore, a novel boundary loss function is provided in this study based on the power of the DTI to represent the information about the object's boundaries. The segmentation network is optimized using the boundary loss function in conjunction with a distributed-based loss function. During training, the combined learning in the optimization task is explicitly enforced, which successfully helped to provide correct building extraction results. Fig. 1 illustrates the architecture of the proposed method for extracting the buildings object from high spatial resolution images. In the building extraction phase, a convolutional network similar to the UNet network is incorporated. This architecture is known as one of the successful deep learning architectures in medical images classification. The structure of UNet architecture [10] is an inspiration by FCNNs, and its original idea is merging initial layers with deep layers through concatenation operations. This process provides a more suitable distribution of the gradient value in the network training process and improves the classification accuracy. Moreover, in contrast to the SegNet architecture [20], upsampling operations structure is used instead of a pooling structure. The related details will be discussed in Section III-A.

III. PROPOSED METHOD
Nonetheless, regarding the downsampling operations and losing the details of building edge boundaries, the accuracy of the segmentation network is low in these areas. To solve this issue, in addition to the binary cross-entropy loss, a loss function based on boundary areas and using the DTIs is introduced in the suggested architecture in this article. As is observed in Fig. 1, this boundary loss function is calculated independently and does not change the main structure of the segmentation network. Additionally, this loss function can be used in the design of other segmentation neural networks. Section III-B provides the details of the boundary loss function and shows how it is calculated.

A. Segmentation Network
As previously mentioned, the structure of the proposed segmentation network in this article is similar to that of the UNet model. However, some differences have also been applied to the encoder and decoder parts of the original structure of the UNet model. Deep features were extracted in the encoder part of the network using the ResNet-50 network's architecture [58]. The fully connected layers of this architecture are removed, and the decoder part of the network is added for dense pixel estimation. The use of deep residual networks helps overcome the gradient reduction problem in the network training part. In the training part, the trained parameters of ResNet-50 that applied to ImageNet dataset [59] are used to initialize learning parameters and increase the convergence speed of the model. As is seen in Fig. 1, ResNet-50 network consists of five blocks, named Conv (t, s) in this article, where t and s denote the block number and the number of residual convolutional layers in it, respectively. The first block (Conv (10)) in ResNet-50 lacks residual convolutional layers and uses only one 7×7 convolutional layer. Assuming that the input images to the network have a dimension of W×H, the output dimensions of the feature space of each convolutional block E i (i ࢠ {1, 2, 3, 4, 5}) will be equal to values (w/2×h/2), (w/4×h/4), (w/8×h/4), (w/16×h/16), and (w/32×h/32). The feature space E 6 is associated with the end pooling layer and has a dimension of (w/64×h/64). Similar to the structure of UNet networks, each part of the side features of encoder flow blocks are merged via concatenation operations before upsampling operations with features corresponding to the decoder flow. The flow of a decoder part corresponding to each flow of the encoder block is represented by D m , and its value will be calculated as follows: (1) where conv 1x1 represents the 1x1 convolution operation with the size of the kernel equals to one, E represents the encoder features, © denotes the concatenation operation, and RUM+ shows the residual upsampling module. In this article, in contrast to conventional upsampling methods [see Fig. 2(a)], the idea is to use residual connection layers in the upsampling convolution block structure [60]. Fig. 2(b) shows the simple residual unit module (RUM) architecture [28]. Based on the design of this module, the RUM+ structure has been developed in this research. The input features to this module enter two different flows. In one flow, once the dimensions of the feature are doubled through bilinear upsampling, a residual block is used. In other flow, opposite to the previous case, a residual block is first deployed, and finally, the dimensions of feature space are doubled by the bilinear upsampling layer. Eventually, the results of these two flows are averaged. In the RUM, convolutional operations with kernel dimensions of 3×3 are performed in the residual block after batch normalization (BN) [61] and the ReLU activation function [62]. In the last layer of the proposed model,  The RUM consists of four parts: BN, ReLU, Conv, and ⊕. "BN" represents the batch normalization layer, "ReLU" represents the activation function, "Conv" represents the 3x3 convolution layer, and ⊕ represent the summation operation. In addition, "UP" represents the upward sampling layer with the ability to double the image dimensions, and "A" means the averaging operation. the predicted building maps is calculated as follows: In the above equation, P rgb is equivalent to the predicted map from building objects, and conv 2 1×1 (.) reduces the dimensions of features channels. Also, the size of kernel is one, and the number of classes is 2, showing building classes and background terrain. The sigmoid function is adopted to produce the map generator function for building estimation and is represented by sig(.). Once the estimation map is known, the network is optimally trained in a supervisory mode via ground-truth images using binary cross-entropy (BCE) loss function (L BCE ) and the proposed boundary loss function (L boundary ) In the above equation, α and β are the balancing coefficients of loss functions L BCE and L Boundary . "iter" denotes the number of training iterations. The value of β in this article is assumed zero until the number of iterations (i.e., N) reaches half. The rest of the iterations of the network training step continues with a balancing coefficient between α and β. Details of implementation and quantification of these parameters are mentioned with detail in Section IV-D.

B. Boundary Loss Function
The proposed boundary loss function in this study is calculated based on ground-truth images and predicted maps from the segmentation network [16]. After selecting the predicted maps by the sigmoid function, the binary image of the estimated image is calculated using a simulated binary function. Next, boundary areas of the binary images are extracted for both ground-truth and predicted maps. Then, DTIs for both images of boundary areas are obtained. In this research, inspired by the method presented in [16], DTIs are calculated using the morphological dilation method. Finally, the calculations required to determine the boundary loss function (L boundary ) are discussed.
1) Calculation of the Boundary Map: Assume that P rgb is the predicted map by the segmentation network. This section introduces a function that is responsible for the binarization of P rgb during the network training interval. One of the essential features of this function is that it is differentiable. Otherwise, backpropagation error operations will not be possible in the network optimization step. The plot of this function in the range 0 to 1 is shown in Fig. 3(a). This function is calculated as follows: where γ and T are constant values of the binary function, which are considered 15 and 0.5 in the present study. The input value of this function is equal to the predicted maps from the segmentation network, P rgb , and its output is the value of the predicted binary image, B Pred [see Fig. 3 The Sobel operator [63] is the inspiration to extract boundary areas of the ground-truth and predicted binary image. To this end, convolutional kernels with constant parameters are used for the length and width of the image If the ground-truth images and images estimated by the classifier network are assumed as B GT and B pred , boundary maps for them will be found as follows: In (6) and (7), * is the standard convolution operations, and |.| shows the absolute value of the functions.
2) Calculation of Distance Transform Image: Having the binary image B, its DTI (i.e., B DTI ) is given by the following: In the above equation, x − y 2 is the Euclidian distance between pixels x and y. In the binary image, G in and ∂G represent the foreground space (building areas) with pixel values of 1 and boundary areas with pixel values of 0. Based on this definition, pixel values B DTI (i, j) in building areas are always positive and are zero in background areas. By moving away from boundary areas of the building, B DTI (i, j) values increase. To calculate the proposed boundary loss function in this research, first, the DTI in boundary areas extracted from the ground-truth and the predicted map need to be generated. Despite this, in DTIs calculated for binary images, the value of pixels reduces by moving away from the boundary areas, thus leading to challenges to the optimization process of the deep learning network. To overcome this problem, those boundary areas with pixel values of 1 will be considered the background and zero value for the calculation of DTI. Conversely, background areas with nonzero values are considered. Based on DTI, the boundary loss function (L boundary ) with differentiability capability is introduced during network training. In this method, DTI is calculated in k steps and based on the morphological dilation operator. For brevity, the proposed distance transform imagery is named k-DTI. Fig. 4 shows a flowchart of the k-DTI method. In the following, the details of the calculation of the k-DTI are provided. Parameter k will be analyzed as metadata during network training. Considering the boundary images (I Mask 0 ), calculated in previous section, as the input images, by k≥1, the transformed image is obtained as follows: In the above equations, Dilate(.) shows the morphological dilatation operations, ࣷ represents the addition operator in tensor mode, and I DT k is the distance transformation image obtained in step k. Except for Dilate(.), all functions are differentiable. To calculate the dilation with differentiability during the network training process, a constant average kernel with dimensions 3×3 is taken into account as follows: After applying this kernel to the input binary image through the convolution operator, (4) is used to binarize the image. Equations (9) and (10) are repeated k times considering that parameter k has been selected already. Finally, by applying (11), the DTI is calculated.

3) Calculation of Boundary Loss Function:
The boundary loss function is obtained by measuring the distance between boundary areas of the predicted map and the ground-truth image. Assume that Ψ pred and Ψ gt represent boundary areas, respectively. Consider point M on boundary areas Ψ pred . The distance between point M and Ψ gt is usually equal to the shortest distance between M and the points of boundary areas in the ground-truth image Ψ gt , which is calculated as follows: Assume that Ω gt and Ω pred are the DTIs of Ψ gt and Ψ pred , respectively. According to the calculation of DTIs, the distance between point M and Ω gt in (13) is equal to the mapping distance of point M: Therefore, the distance between boundary areas of the estimated map and ground-truth image is found based on DTIs Values |Ω pred | and |Ω gt | in the above equation denote the number of points in boundary areas of Ψ pred and Ψ gt . Equation (15) shows the discrete version in calculating the distance between the two boundary areas. This method is not possible to use in the optimization process because the function is not differentiable. Based on the options available in deep learning software packages, the continuous version of (15) has been developed as follows: The above equation, ࣹ represents the vector multiplication of two tensors, GAP(.) is the average global pooling, and ϵ is the normalizing value to avoid division by zero. The boundary loss function is computed as follows, assuming that the total number of training samples is Q in the training process:

A. Dataset
Two aerial image datasets, namely the ISPRS Potsdam and Vaihingen datasets, 1 were used to assess the proposed segmentation network and loss function in this study. Both datasets were specially designed to develop algorithms related to segmentation and fusion methods. The first data is related to the part of the Potsdam city, and the second data is corresponding to the Vaihingen city in Germany.
The Potsdam dataset includes 38 orthophoto images with a spatial resolution of 5 cm and dimensions of 6000 × 6000 pixels, containing buildings of different sizes and appearances. Each image can be obtained in one of two modes: red (R), green (G), and blue (B) or near-infrared (NIR), red (R), and green (G). Ground-truth images include six classes of impenetrable land cover, buildings, low vegetation, trees, machines, and other features. According to the information provided by the dataset provider, the training datasets will consist of 24 images, while the evaluation data will include the remaining images. The Vaihingen dataset is cropped from a large orthophoto image. The dimensions of the images in this case, unlike the images in the Potsdam dataset, are not the same, and on average, they have dimensions of 2500 × 2000 pixels. In this dataset, the spatial resolution of the images is 9 cm. Each image is presented in only one mode with three bands: red (R), green (G), and near-infrared (NIR). The number of the classes in this dataset is similar to the Potsdam data class. Sixteen images were used as training data in the proposed network training procedure. The rest of the images were used for network testing based on the information provided by the dataset provider. In both datasets, the class containing the building area is assigned to the value 1, and the other classes are labeled 0. The examples of images associated with these datasets are shown in Fig. 5.

B. Evaluation Criteria
The effectiveness of the proposed method is investigated in this study by analyzing the results of the evaluation criteria on the test dataset. Similar to previous studies, three typical criteria are used to analyze the accuracy of the extracted building: overall accuracy (OA), F-score, and intersection over union (IoU). The OA is calculated as the ratio of all successfully predicted pixels to total pixels in the image OA = TP + TN TP + TN + FP + FN (18) where TP stands for true positive predictions, TN for true negative predictions, FP for false positive predictions, and FN for false negative predictions. In this article, Precision is a measure of the accuracy of minority class (building class) based on the number of accurate predictions of TP in all positive probabilities. At the same time, Recall gives a proposal for the missing buildings which were correctly predicted. The harmonic means of the two Precision and Recall criteria, which can be represented as the following equations, is identical to the F-Score criterion: The IoU is a criterion for determining how closely the prediction map and ground truth overlap. It is defined as follows: The group of predicted pixels is P pred , the group of groundtruth pixels is P gt , and the number of members in the collection is "|.|". The union and intersection of the two groups are represented by ∩ And ∪, respectively.

C. Implementation of the Proposed Method
To start the network training process, the value of the initial weight parameters of the ResNet-50 network, which is trained on the ImageNet image set, is used in the encoder part of the segmentation network. The weight parameters of the other network parts were initialized using the method presented in [58]. To implement the proposed strategy, the Pytorch programming package [64] is used in all experiments on a single-core operating system (Tesla K80) based on the graphics processing unit (GPU). The data augmentation method was utilized to expand the input data and overcome the problem of overfitting during the training process [65]. This method involves randomly rotating images at 90°increments from 0°to 270°and randomly inverting images in horizontal or vertical directions. Due to the limitations of the GPU, the training set's input images were cropped into 640 x 640 subimages. In each training dataset, 80% of the images were used to train the network, and the rest were used to evaluate the network. The AdaMax function, developed from the well-known Adam optimizer algorithm [66], was used to optimize network weights, with weight decay values of 0.0009 and an initial learning rate of 0.002. During the learning process, the value of the learning rate decreases by multiplying the learning rate value by the value (1 − iter max−iter ) power , as a function with "Poly" behavior [67]. "Power" equals 0.35, and the total number of epochs is multiplied by the number of training data in each dataset to get "max-iter." In this research, the number of training epochs was considered 60. Up to 30 iterations, the coefficient value of the proposed boundary loss function is set to zero, and this function is not evaluated in the training phase of the segmentation network. From 30 to 60 iterations, the boundary loss function was taken into account in network training calculations. Next section describes how to weigh this loss function. At each repetition stage, the network weight parameters associated with the best evaluation values are stored. In the test process, similar to the training phase, the input images are cropped first. There is no need to augment datasets at this stage, and the final results are obtained after fusing and averaging with each other.

D. Evaluation of the Proposed Boundary Loss Function
In the first loops of model training, the values obtained from the end layer of the segmentation network (i.e., Softmax layer) become close to zero and lead to foreground pixels with zero values. So, it is not possible to train the network by relying merely on the boundary loss function in the training phase of the segmentation network. In this case, the error values do not converge. In other words, the role of the boundary loss function as an additional loss function is to help improve segmentation results in boundary areas via giving larger weights to them. According to (3), only distribution-based loss function like binary cross-entropy loss (L BCE ) was adopted up to the 30th iteration to avoid nonconvergence of network training during 60 iterations. From the 30th to the 60th iteration, network training continues by considering the proposed boundary loss function merged with the binary cross entropy. To weigh the parameters α and β of (3), three strategies are considered. For this purpose, first, the value of the parameter k in the calculation of DTI was considered equal to one. In the first strategy, the value of the α is deemed to be similar to one, and β will be regarded as constant values during network training. In the second strategy, the α value is assumed to be the same as in the first strategy, and the β increases at a constant rate during each network training repetition. In the first step, the β value is set to 0.05, and the rate of increase is adjusted so that its value in the last loop is equal to the constant weight of the α value, which is equivalent to one. In the third strategy, the α value is considered equivalent to (1-β). Similar to the second case, the β value in the first step is considered equal to 0.05 and increases during network training. In the first iterations, more weight is assigned to the L BCE function, and in later iterations, this weight is reduced. The increase in the β is done so that the β value does not exceed the α value during training. Table I shows the results obtained from network training based on using the proposed boundary loss function in different strategies. The results show an improvement in evaluation criteria compared to training only based on the L BCE function. In Strategy 1, the constant β value is increased, and in the case of using β = 1, the best result is obtained. By comparison, every one of the two proposed scheduling strategies (strategies 2 and 3) produces better outcomes than any constant β (strategy 1) without extensive testing. However, the best result is obtained when strategy three is utilized, which is why this strategy has been used in subsequent tests.
As mentioned in Section III-B2, the value of the boundary loss function depends on parameter k. Hence, when analyzing the effect of this loss function on the building extraction results, parameter k is considered with a changing domain between 1 and 6. Fig. 6 illustrates the values of IoU and OA during the network training phase. These values were calculated between 30th to 60th iterations. As is shown, the best performance on the network evaluation data during training is obtained for k = 3. Table II shows the quantitative results of the test images for choosing the best parameter of k. The results of merging the two L BCE and L boundary functions are shown using different values for parameter k in network training. By comparing these results, one can conclude that using the L boundary function improves building extraction results. Moreover, the effect of different values of k can be analyzed. In this test, selecting k = 3 results in obtaining the best accuracy in the evaluation metrics results of the two datasets. In this case, the IoU evaluation metric for building class has increased by 0.57% and 0.39%, respectively, for ISPRS Postdam and Vaihingen datasets. Also,    with increasing the value of k (k>3), the effectiveness of the L boundary function value decreases. The experimental results in Fig. 7 demonstrate that our proposed boundary loss function could extract a more accurate building boundary than the method based on using only the BCE loss function. One of the main reasons for this is that the number of pixels in the background of high spatial resolution remote sensing images is typically much larger than the number of pixels in the building's objects, causing the training to fall into the local optimum problem when using the distribution loss function in the building extraction process. As a result of this issue, the extraction outputs are more biased toward the background, resulting in missing building components. Therefore, we attempted to add learnable boundary information by introducing DTI to joint-optimize the building outline. Our proposed method more accurately differentiates the borders and shapes between building and background.

A. Comparison With the State-of-the-Art Loss Functions
The proposed boundary loss function is compared with its counterparts provided in the literature. Accordingly, boundary loss functions based on different approaches, such as Housdroff loss [56], Surface loss [57], and RS-Boundary loss [55], have been used. In Housdroff loss and Surface loss, the DTI calculation of ground truth is incorporated. To weigh the softmax prediction outputs, Hausdorff loss used not only the ground truth DTI but also the predicted segmentation DTI. Surface loss, on the other hand, based on the ground truth DTI, assigned weights to the softmax probability outputs. RS-Boundary loss is computed based on a differentiable metric of boundary detection. The implementation of the loss functions mentioned in this section has been done according to the authors' suggestion, and the same training and test data have been used in all experiments for fairness. Table III   outcomes. The potential reason may be that building extraction from high spatial resolution aerial or satellite images is much more challenging than segmentation problems in the computer vision community. The building has various locations, shapes, and sizes, while these characteristics are relatively fixed for most computer vision semantic segmentation.
To investigate the effect of boundary loss functions, the prediction heat map of some sample test images from both the ISPRS Vaihingen and Potsdam dataset is shown in Fig. 8.
The employment of previously developed boundary loss functions in conjunction with the BCE loss function improves qualitative results. In addition, we see a reduction in noise in the extracted buildings' boundary areas. The areas indicated by yellow square boxes further illustrate the effects of using boundary loss functions. The comparison of the output prediction maps of the extracted building suggests that our proposed loss function contains more semantic information and less noise in the boundary area of the buildings.

B. Evaluation of the Segmentation Network
As previously mentioned in Section III-A, there is no limitation on using the L boundary function in other segmentation networks. The proposed boundary loss function can be adopted in various segmentation networks for extracting the building area. In this section, the proposed segmentation network is compared to previously presented popular deep segmentation models. Based on this, SegNet [20], PsPNet [68], Res-U-Net [45], and DeepLabV3+ [69] networks are used. The criterion to select these specific networks is their popularity and availability of their implementation code. To make a fair comparison, all metadata are considered the same in all tests. Moreover, according to Table II, k = 3 is chosen to determine the boundary loss function. Table IV lists the results of the comparison between different segmentation networks.
As shown in Table IV, the proposed segmentation model achieves the best evaluation metrics for the Potsdam and Vaihingen datasets. Columns 1 to 3 of Table IV evaluate the computational speed. The multiply-and-accumulates (MACs) volume criterion in neural networks, as well as training and testing duration in image per second, were employed in this section. The difference in computing efficiency between these methods is primarily due to the training and testing process's time cost, which is generally related to the complexity of the segmentation models. The proposed model has reached an 83.42 value for MACs and a process of 1.73 and 0.163 images per second for network training and testing phases, which shows a suitable speed for calculations compared to other models except PsPNet model. It is worth noting that the PsPNet model enjoys only one convolution block in the decoder block, resulting in reduced calculations volume and reducing the accuracy of evaluation results. So, the weakest segmentation results were obtained by this model. The IoU of PsPNet model equaled 84.6% and 89.65% for Vaihingen and Potsdam datasets, respectively. The evaluation criteria in the popular deeplabV3+ and the model proposed in this research are slightly different. However, the deeplabV3+ has high MACs (342.73) than the other models due to its complex encoder and decoder structure. In its encoder architecture, the Res-U-Net model, like our segmentation model, uses the residual block. Our model, on the other hand, is faster and more efficient than the Res-U-Net model. This is owing to the proposed RUM+ block that uses two residual units in its structure for accurate upsampling in this research. The role of decoder structure in increasing the accuracy of building extraction results was investigated. The literature has paid little attention to the decoder parts of the networks, and superficial convolution layers are generally used to restore the lost feature. The design of the proposed RMU+ in this article somehow enhances the restoration of the deep feature in the encoder part of the network. According to Fig. 2, the segmentation model is tested with three decoders to evaluate the ability of the proposed RMU+. The evaluation criteria for the ISPRS Vaihingen and Potsdam datasets are compared in Fig. 9. Experiments demonstrate that the RUM+ (Model-RUM+) model outperforms models that merely employ conventional convolution (Model-CD) or simple RUM (Model-RUM) in the decoder portions.
The building extraction results of some test images from the ISPRS Potsdam and Vaihingen datasets are shown in Fig. 10. Our proposed segmentation network outperforms existing advanced methods and can successfully handle the problems of inadequate building extraction for buildings with complicated shapes and sizes. According to qualitative results, the building extraction results at the boundary areas showed reduced noise in all models. This means that network training based on coded boundary information (i.e., DTI), generated using the last layer of semantic features, could improve the robustness of building extraction in edge parts. On the other hand, the buildings extraction results in Fig. 10 still have misclassification problems in some buildings. The causes of these errors can be divided into two categories. The occurrence of errors in training datasets is the first category, including factors like incorrect building labeling, noise and distortion in satellite or aerial images, and a mismatch between ground-truth and color images. In this case, the possibility of manipulating and solving the mentioned problems is not recommended because it reduces the uncertainty in solving the building extraction problem. The second factor is related to the structure of hourglass deep learning models. In these models, which are mainly composed of two parts, encoder and decoder, the dimensions of the input images are reduced to extract the feature space to overcome the limitations of the computational volume, which can cause the loss of important information of buildings parts. On the other hand, traditional methods of semantic feature recovery in the decoder part do not have the ability to optimally retrieve the buildings, which leads to the extraction of incomplete buildings. In this research, our segmentation method, compared to other methods, produced a small number of FN (pink) and FP (cyan) in the building extraction results. It can be concluded that the performance improvement of the proposed segmentation model is considerably associated with the proposed RUM+.

VI. CONCLUSION
In this article, a new boundary loss function was introduced to enhance the result of the extracted building area in highresolution aerial and remote sensing images. The boundary loss function is calculated using binary images from the predicted image of the segmentation network and ground-truth image, as well as DTI conversion images with derivability throughout the network training phase. The DTI is calculated based on the proposed derivative morphological dilation in the k step. Since the challenge of extracting buildings from VHR images is characterized by an unbalanced class distribution, integrating the proposed boundary loss with a distributed-based loss function can improve the final result. The proposed approach was tested using the ISPRS Potsdam and Vaihingen datasets. The implementation results show that the proposed approach is beneficial in enhancing building extraction results, particularly in the boundary area. When the parameter value of the k was set to 3 to calculate the boundary loss function, the best result was obtained. In this case, the IoU evaluation metric in both the Potsdam and Vaihingen datasets improves by 0.57% and 0.39%, respectively. In addition, another advantage of the proposed loss function is that it does not depend on the segmentation network and can be used in a wide range of DCNN networks.
As suggested in future research, the proposed approach might be enhanced to handle multiclass semantic segmentation of urban areas. In this study, only VHR orthophoto images were used to classify buildings. However, height data may be useful in future study to ensure accurate building extraction. In other words, obtaining high-resolution 3-D data such as aerial 3-D point clouds or digital surface model data can improve the connection between the classes identified in this study.