BSSNet: Building Subclass Segmentation From Satellite Images Using Boundary Guidance and Contrastive Learning

Building subclass segmentation, aimed at predicting classes of buildings (high-rise zone, low-rise zone, single high-rise, and single low-rise) from satellite images, is beneficial in numerous applications, including human geography, urban planning, and humanitarian aid. However, problems, such as complex scenes and similar characteristics of different building categories make it difficult for general models to balance the accuracy of localization and classification in building subclass segmentation. Therefore, this article proposes a novel network for building subclass segmentation called building subclass segmentation network (BSSNet), which uses two subnetworks to divide and conquer the problem. The first network guides the building locations through binary building segmentation, called localization network. The spatial gradient fusion module in the localization network improves the binary segmentation result by supervising the spatial gradient map of prediction. The second network is a classification network, which predicts building subclasses. Intermediate features of the second network are optimized by contrastive learning loss to improve feature consistency. Finally, predictions of the two networks are combined to obtain the final result. The experimental results demonstrate that our BSSNet can perform significant improvements on the Hainan dataset we produced and the xBD dataset. In particular, the BSSNet achieves the best performance compared to current methods on the Hainan dataset.


I. INTRODUCTION
B UILDING segmentation is widely studied in the field of remote sensing. Usually, most studies [1], [2], [3], [4] Manuscript received 15  focus on binary building segmentation (whether the pixel is a building). Still, users need to know building subclass information (what type of building the pixel belongs to) in many applications. However, as a meaningful extension of building segmentation, automatic segmentation of building subclasses has rarely been studied. As shown in Fig. 1, building subclasses data provide information, such as building location and category, which can be of great help to many fields, including human geography [5], urban planning [6], and humanitarian aid [7]. But most of the building subclass data used in these fields comes from manual labeling, which is slow, costly, and laborious. Therefore, accurate and efficient automatic segmentation of building subclasses will be convenient for these fields. However, problems, such as within-class feature variation and between-class feature similarity make it difficult for general semantic segmentation networks to maintain localization and classification accuracy simultaneously. Nowadays, studies on building subclass segmentation either combine images from different angles [8] or incorporate shadow detection with high-resolution images [9]. Damage assessment [10], [11], [12] is also a branch of building subclass segmentation, which classifies the damage level of buildings by using pre and postdisaster images. However, few studies focus This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ on using a single image to segment buildings into subclasses. Motivated by the abovementioned circumstances, we study a method specifically for building subclass segmentation, which classifies pixels in five classes, including four classes shown in Fig. 1 and background. The presentation of our method can fill the gap of building subclass segmentation and provide a large amount of accurate data quickly for the fields that need building subclass information for analysis.
Generally, experts identify the building class by the arrangement, the density, the shape, and the texture of buildings. Nevertheless, in automatic segmentation of building subclasses, the complexity of the task makes it difficult for a single network to ensure accurate classification and localization simultaneously. When the problem is decomposed, one network will focus on only one problem and learn more effectively. Therefore, we adopt a divide-and-conquer approach by using two networks to perform different tasks separately and combining them in terms of features and predicted results. Specifically, we divide the task into the following two parts: 1) binary building extraction; 2) building subclass extraction. The purpose of the division is to disassemble the task so that only the accuracy of the building localization is concerned in the first task, not the multiclass extraction accuracy, and vice versa. We add the feature fusion module (FFM) to reinforce the link between these two models. The boundary information can make the localization more accurate, so we propose the spatial gradient fusion (SGF) module to improve the boundary by refining the spatial gradient map. The subtle difference between building classes is also one of the reasons why segmenting building subclasses is difficult. Therefore, we introduce the contrastive learning loss to improve the representation of features in the building subclass segmentation network (BSSNet). Lastly, predictions of the two tasks are merged by intersection.
Our main contributions can be summarized as follows. 1) We propose a novel BSSNet that has two subnetworks for building subclass segmentation by combining binary building segmentation and multiclass building segmentation. 2) In the BSSNet, an SGF module is proposed to refine boundaries of binary building segmentation, while the pixel contrastive learning loss is introduced to enhance the representation of features in multiclass building segmentation.
3) The Hainan building subclass dataset, we proposed enriches the datasets for building subclass segmentation, which can help the research in subclass object extraction or fine-grain land cover classification from remote sensing images. The dataset can be accessed at. 1

A. Building Segmentation
Building segmentation is one of the most popular research focuses in remote sensing information extraction. Novel machine learning and remote sensing technologies have allowed automatic building segmentation, reducing manual work in recent years. Nevertheless, building segmentation remains a long-term challenge in remote sensing because of buildings' complex appearance in complicated environments.
Traditional building segmentation of aerial and remote sensing imagery always uses manual design features, such as color [13], texture [14], edge [15], [16], and spectrum [14], [17]. However, these features may vary significantly due to the indeterminacy of light, shooting angle, and sensors. With the development of CNNs, deep-learning-based methods have been broadly utilized to segment buildings on remote sensing images [1], [2], [3], [4]. With multilayer convolution, CNN can obtain multiscale and more robust features than artificially designed features. Yuan et al. [2] integrated features from multiple scales and combined the building boundaries to improve the performance of building segmentation. Maggiori et al. [3] designed a new architecture and a two-step training approach to solve the inaccurate training data problem. To acquire precise building boundaries, Bischke et al. [4] proposed a multitask network predicting segmentation and distance masks simultaneously.

B. Building Subclass Segmentation
Although the automatic recognition of building subclass is meaningful to urban planning [6], humanitarian aid [7], and other fields [5], few of studies focus on building subclass segmentation. Peng et al. [8] try to detect built-up areas by using stereo imagery incorporates height information. Taoufiq et al. [18] and Huang et al. [19] focus on building subclass classification. Sirmacek et al. [9] incorporate shadow detection with high-resolution images. However, no current methods segment the building subclasses with a single optical remote sensing image.
xBD [10] presents a task to assess building damage level, which can be considered an extension of building subclass segmentation. Various methods have been proposed to evaluate building damages [11], [12], which uses a two-stream CNN architecture for pre and postdisaster images. Nonetheless, building damage assessment uses sequential images to assess the building damage level by comparing images before and after a disaster.
We extend building segmentation from binary to subclass, as shown in Fig. 1. We propose a two-stream end-to-end network. One segment performs multiclass segmentation, and the other predicts binary building location to provide localization guidance.
CASENet [27] claimed a new task called semantic boundary detection, desiring at finding category-aware boundaries. Cheng et al. [21] employed boundary detection as a multitask network to improve the result of object segmentation. Zhen et al. [20] combined semantic boundary detection and semantic segmentation using the spatial gradient to improve the boundary pixel accuracy.
In the field of remote sensing, edge detection is also widely used to improve the effect of building detection. Jung et al. [22] adopted HED [26] and combined the boundary and segmentation mask to obtain an enhanced segmentation result. To improve building extraction, He et al. [24] embedded the boundary detection task into their framework by using spatial variation fusion to couple these two tasks.
Our methods follow the idea of combining boundary and segmentation in a multitask way to enhance the accuracy of building location. Instead of concatenating features or using postprocess, we concatenate the spatial gradient of segmentation and boundary to improve the mask boundary in an efficient way.

D. Contrastive Learning
Contrastive learning is one category of self-supervised learning [28], whose core goal is to discover discriminative representations. Another category of self-supervised learning is generative learning [29], [30], [31], [32], whose primary purpose is to generate feature vectors that can retain essential parts of the original data and reconstruct the original data. Contrastive learning considers representation learning from a different aspect: learn to compare [28]. In this way, contrastive learning avoids pixellevel learning and is more stable. Through noise contrastive estimation [33], contrastive methods learn meaningful representations by attracting positive pairs and repulsing negative pairs. Recently, many methods focus on constructing positive and negative sets [34], [35], [36], [37]. Hadsell et al. [35] first regarded contrastive learning as a dictionary lookup. He et al. [34] developed this method by building a dynamic dictionary with a queue and a moving-averaged encoder. Khosla et al. [38] extended the self-supervised batch contrastive approach to a fully supervised learning task, allowing the effective leverage of label information. Normalized embeddings from the same class are drawn tighter than those from other classes. Latest works address contrastive learning in dense image prediction [39], [40], [41]. Wang et al. al. [39] implemented supervised contrastive learning at the pixel level for semantic segmentation.
There are also some methods that use contrastive learning in remote sensing area tasks [42], [43], [44]. For instance, contrastive learning has been used, for example, in the hyperspectral image (HSI) classification to solve the small-sample problem of HSIs. Meanwhile, it has been adopted in synthetic aperture radar image classification to overcome insufficient labeled data [45], [46]. However, few methods apply contrastive learning to remote sensing image segmentation.
Given the complexity of remote sensing image scenes, even objects in the same class may differ vastly in their embeddings, making the application of semantic segmentation in remote sensing images difficult. Hence, we employ contrastive learning to gather clusters of pixel embeddings belonging to the same category while pushing apart different categories' embeddings.

III. PROPOSED METHOD
A. Network Architecture 1) Architecture Overview: Fig. 2 gives an overview of the procedure of BSSNet, which consists of te following two parts: the classification network (i.e., the upper one) and the localization network (i.e., the lower one). We exploit HRNet [47] as the backbone for the localization and classification network, respectively. HRNet can be divided into four stages according to the number of branches and resolutions. The stage n includes n branches corresponding to n resolutions. For ease of presentation, we simplify each stage to its number without showing the details of each stage in Fig. 2.
The localization network predicts the binary mask of building objects from images, in which building objects can be predicted intactly and shapely. The localization network first concatenates feature maps from the HRNet backbone. It feeds them into the predictor, which uses the 3×3 convolution (3×3 Conv), followed by a batch normalization (BN) layer and ReLU to reduce the feature dimension to 256, and then a 1×1 convolution is used to acquire mask predictions. Although the predicted masks can provide relatively accurate location of buildings, the predictions can still be rough and fuzzy due to ignoring of boundary information.
The before-mentioned issue could be primely alleviated by providing improved localization and guidance while employing building boundaries. Therefore, to utilize boundary information, we propose a boundary-predicting head. It realizes boundary prediction using the same predictor as the binary building mask predictor and boundary of binary ground truth as ground truth. However, simply adding a boundary-predicting head cannot potently pass boundary information to mask predictions. Thus, the SGF module is proposed, which combine the spatial gradient of mask predictions and predictions of the boundary-predicting head to obtain the final boundary predictions. It will be explained in detail later.
Likewise, the classification network employ the same predictor with different class numbers to generate the building subclass segmentation prediction. After the backbone network, we add up the projection head. The projection head outputs 256-dimensional features, and these features will be used in the contrastive learning loss. The role of contrastive learning loss is clustering features from the same class and scattering features from different classes.
In addition, we fuse two features of each HRNet in the same resolution with a simple FFM. The FFM consists of concatenation and two convolution blocks (3×3 Conv+BN+ReLU). It is a simple yet effective module to exchange information between the localization network and the classification network.
Finally, the binary building prediction and the building subclass prediction are combined to get the final prediction. We simply combine the intersection of nonbackground parts of the two predictions, and take the predicted values of building subclasses corresponding to these nonbackground pixels as the final prediction result.
2) SGF Module: An accurate boundary is essential to ensure the building binary segmentation result, which can make neighboring buildings effectively separated. Recently, most methods using boundary information to improve the segmentation effect have added boundary-predicting branches. However, these methods do not use boundary ground truth to supervise building binary segmentation, which makes the use of the boundary information ineffective. Therefore, we propose to combine the results of the boundary prediction branch with the spatial gradient of binary segmentation results to obtain the final boundary prediction results to learn the boundary information simply and directly.
From the boundary-predicting head and the mask-predicting head of the localization network, we can generate the boundary probability map B ∈ R H×W ×1 , and the mask probability map M ∈ R H×W ×1 , respectively. Then, we can obtain the mask boundary easily by spatial gradient deriving. Here, we use adaptive pooling to derive spatial gradient ∇M , which is where i and j are the coordinates of the mask probability prediction and | · | denotes the norm function. pool k is an adaptive average pooling operation with kernel size k. k can control the width of generated boundary ground truth. The default setting of k is 3.
To supervise the mask boundary directly and efficiently, we concatenate the boundary probability map B and derived boundary map ∇M . The concatenated map is assigned into a convolution layer to get the final boundary map, which will calculate loss with ground truth in the boundary loss function. This process can be formulated as where ⊕ is the concatenation operation, and conv is a simple convolution layer. The final boundary prediction map is b. In this way, we can simultaneously supervise the boundary accuracy of mask prediction and boundary prediction. Moreover, few impurities exist in the building outline, which can also be continuous. Fig. 3, FFM uses concatenation and two convolution blocks (3×3 Conv+BN+ReLU). The output features can be formulated as

3) FFM: As shown in
whereX s loc andX s cls are the localization feature and the classification feature after FFM in stage s. X s loc and X s cls remark the localization feature and the classification feature before FFM in stage s. f s loc and f s cls denote two convolution blocks of FFM in stage s. ⊕ means the concatenation operation.
FFM links two subnetworks to build bilateral information exchanges. This module makes the classification network focus more on areas predicted as buildings and lets the localization network be more robust to various constructions.

1) Pixel Cross-Entropy Loss:
We can obtain the logit prediction y ∈ R HW ×C , i.e., the unnormalized prediction vector from the last layer of the network. In the classical semantic segmentation cross-entropy loss function, y is normalized using softmax. Then, it is multiplied with the one-hot vector of ground truthŷ ∈ R HW ×C L CE (y,ŷ) = −ŷ T log(softmax(y)).
However, it computes loss pixel by pixel, so it does not consider the relationship between pixels. This may result in different classes of pixels with very similar characteristics that are difficult to distinguish. Thus, we propose pixel contrastive loss to cluster pixels in the same class and push away pixels in different classes, as shown in Fig. 4.
2) Pixel Contrastive Loss: First, we introduce InfoNCE Loss in unsupervised representation learning. Unsupervised representation learning aims to train an encoder, which generates effective image embedding (feature vectors) v I of image I. Contrastive learning is the current mainstream way to achieve this goal.
In contrastive learning, v I should be similar to the positive embedding v + I (feature vectors of the same augmented image I) and dissimilar to embedding v − in negative embedding set N I (feature vectors of other images). Driven by this motivation, InfoNCE is the commonly used contrastive learning loss function where v − is the negative embedding in N I , · is the inner product operation, and τ is the temperature hyperparameter. Now, we extend the InfoNCE loss to the pixel level. Thus, positive embeddings imply pixel embeddings in the same class, while negative embeddings are pixel embeddings of different classes. Our goal is to attract these positive embeddings and repulse negative embeddings.
We assume the embedding of pixel i as v i ∈ R D , where D means the dimension of the embedding. P i and N i are pixel embedding sets of the positive and negative samples for pixel i, respectively; i.e., P i is the embedding set that the class of samples is the same as pixel i, and vice versa. Accordingly, our pixel contrastive loss is defined as (7) Note that the positive and negative samples come from an identical batch of pixel i. Furthermore, all embeddings are normalized before sending into the pixel contrastive loss.
Finally, we generate the overall loss function of the classification network by adding L CE and L NCE L cls = L CE + λ NCE L NCE (8) where λ NCE is the weight to control the importance of L NCE .

3) Boundary Loss:
We view boundary prediction as a binary semantic segmentation problem, similar to the practice of joint boundary detection. We concatenate the prediction of the boundary-predicting head and the spatial gradient of mask prediction to acquire the final boundary prediction. We can obtain complete results if we simply supervise it by using binary cross-entropy loss. However, because the proportion of boundary pixels in each image is changing, even if the positive sample weight is carefully set, the response degree of boundary prediction is still not high. Dice loss [48] avoids the difficulty of setting positive sample weight by directly optimizing the F 1 score, but due to its instability, its trained boundary is often incomplete. Therefore, we combine these two types of loss and use their complementary to improve the effect of boundary prediction.
To generate soft boundaries from the ground truth of the binary mask, we utilize the Laplacian operator. The Laplacian operator is a second-order gradient operator for generating boundaries. The generated soft boundary maps are converted to binary maps by a threshold value of 0. We utilize binary cross-entropy loss and dice loss to improve the learning of boundaries. Dice loss calculates the ratio of overlaps between prediction and ground truth, independent of the number of foreground/background pixels. We define boundary loss L B as follows: where λ Dice is the weight to control the importance of L Dice . b,b ∈ R H×W are the prediction and ground truth of boundary, respectively. Dice loss λ Dice is given as follows.
where b i andb i are the ith pixel in boundary prediction and ground truth, respectively. · is the inner product operation. is added in numerator and denominator to ensure no zero division (default = 1). This formula is similar to the F 1 score at the pixel level.

4) Multitask Learning Loss:
Our network accomplishes three tasks, namely binary building segmentation, building subclass segmentation, and building boundary detection. To train the network efficiently, we choose to train in a multitask learning Note that L loc is the binary cross-entropy loss used in the binary building segmentation task.

A. Dataset and Evaluation Metric 1) Hainan Building Subclass Dataset:
As far as we know, few public datasets specifically for remote sensing building subclass segmentation are available. Thus, to facilitate the training of our proposed method, we construct a subclass dataset for buildings in Hainan Province, China. The Hainan dataset we presented compensates for the absence of building subclass segmentation datasets. We will continue to expand this dataset as the research progresses. Four building subclasses exist in this dataset: high-rise zone (HZ), low-rise zone (LZ), single high-rise (SH), single low-rise (SL), which are identified by experts from the Shanxi provincial mapping agency.
The dataset contains 42 images with resolutions ranging from 0.8 to 2 m per pixel, and sizes ranging from 2000×2000 to 5000×6000. We crop the images to size 512×512 patches. This gives a total of 1348 image patches, divided into 70% for training and 30% for test. That is, we got a training set with 944 cropped images and a test set with 404 cropped images. The proportion of each category (ignoring background) is shown in Table I. The data are imbalanced, and the proportion of SL in the dataset is deficient. The reason is that the geographic distribution of images is concentrated in urban areas, where most low-rise buildings are clustered. To solve this imbalance problem, we set the class weight of CELoss [see (5) 2) xBD Dataset: Since it is challenging to obtain datasets for building subclass segmentation, a building damage assessment dataset xBD [10] is employed to evaluate our proposed method. This dataset is a publicly available, large-scale satellite image dataset for building damage level assessment, which is similar to the task we are working on. While the difference is that the xBD dataset contains images before and after disasters, so the changes brought by disasters also should be concerned in the network.
Although change information should be concerned in the xBD dataset, building damage level is primarily evaluated using images after disasters, whereas images before disasters are inclined to locate buildings. In addition, building damage level can be viewed as a variation of building subclass. These characters are consistent with our proposed work, so we select the xBD dataset to evaluate the effectiveness of our work. This dataset selects 19 diverse disasters in different locations (such as forest fires, earthquakes, floods, and hurricanes). The dataset contains pre and postdisaster image pairs with 1024×1024. Each image is in the visible spectral band (red, green, and blue) with a spatial resolution of 0.8 m. Four building damage levels exist: no damage, minor damage, major damage, and destroyed. Table II shows the number and distribution in the dataset.
3) Evaluation Metric: To evaluate the performance of our method, we perform qualitative and quantitative analyses in our experiments. We use the F 1 score (F 1 b ) to evaluate the experiment results of binary building segmentation. And the harmonic mean of the F 1 score (F 1 c ) of each building class is employed to evaluate the effectiveness of building subclass segmentation. The metrics are defined as follows: where TP, FP, and FN are the numbers of true-positive, falsepositive, and false-negative pixels in segmentation results, respectively. n is the number of classes, and F 1 c i is the F 1 score of class i.

4) Experimental Settings:
All the experiments are run on four GeForce GTX 2080Ti GPUs with PyTorch implementation. In training, we crop images to 512×512 patches. We use HRNet-32 as the backbone for our networks with pretrained weights downloaded from the PyTorch library. For pixel contrastive loss, we randomly select 1024 pixels in the same batch as positive and negative embedding sets in all experiments, and loss weight λ NCE = 0.1. For boundary loss, we set the kernel size of the Laplacian operator to 3, and loss weight λ Dice = 1.0. The model is trained using Adam optimizer with an initial learning rate of 0.0001. The batch size is 4 for 60 000 iterations on the Hainan dataset. The batch size is 8 for 100 000 iterations in xBD dataset. We reduce the learning rate by using the "poly" learning rate policy, in which the initial learning rate is multiplied by (1 − iter max iter ) power and power = 0.9. Random crops and horizontal flip are also applied.

1) Hainan Dataset:
To demonstrate the effectiveness of our proposed BSSNet, we first compared our method with several SOTA segmentation methods on the Hainan building subclass dataset.
2) OCR [51] uses HRNet48 as the backbone network. Features in different levels are concatenated, and the feature before using the OCR module is also used to generate auxiliary prediction.
3) MANet [52] and MCFINet [53] use ResNet101 as the backbone network. The last layer of features is directly used to predict segmentation results.
In addition to our full framework, the two subnetworks clean baseline without FFM, SGF module, and pixel contrastive loss, called the vanilla network, is also compared. A single network with two heads is also compared. The functions of the two heads are similar to those of the two subnetworks, which locate and classify, respectively, and SGF and pixel contrastive loss are also added. According to Table III, the proposed framework using the vanilla network alone can be competitive to existing methods in terms of all metrics. Bold entities emphasize that the current method achieves the best results on the corresponding metrics.
We also evaluate our method with popular and SOTA methods in the natural image and remote sensing image segmentation area. As shown in Table III, our proposed method outperforms these methods by an impressive F 1 score. On the overall F 1 metric, BSSNet produces a 4.0% improvement over the previous best results. Our vanilla network is only 0.2% below FPN on the overall F 1 metric because our two-subnetwork framework divides and conquers the task and gives a better localization result. We also observe a 1.4% increase in overall F 1 score when splitting the single network into two subnetworks framework, which again proves that our two-subnetwork framework is effective. Furthermore, with the help of other modules, including FFM, SGF, and contrastive loss, the performance of our method is significantly improved over the vanilla network. The SH F 1 score of our BSSNet is 2.8% higher than that of FPN, which is due to the contrastive loss making features more robust, and the difference between features of the LZ and that of SL is more distinct. Fig. 5 shows a visual comparison of the building subclass segmentation results of different networks. The predicted masks of our proposed method are more precise and highly coincident with their boundaries. Our method also better predicts small, isolated objects, such as SL buildings in rural areas. Additionally, in comparison with FPN and MCFINet, our method can better separate buildings close to each other.
2) xBD Dataset: Moreover, we present quantitative and qualitative comparisons of building disaster damage assessment on the xBD dataset. We compare our method with the xBD baseline [10], BDANet [56], RescueNet [54] and the method of Weber et al. [55], which are popular and typical methods in building disaster damage assessment. All results indicate that our method can be competitive in building disaster damage assessment.
In Fig. 6, we give qualitative results on a small but diverse sample of the dataset. From these results, our method appears to be remarkably better than the baseline model. The baseline model produces quite a few false positive and false negative errors eliminated by our model because the contrastive loss in our method makes features more consistent. In addition, the SGF module makes our prediction results more apparent and accurate at the boundary.
According to Table IV, our method produces 0.1% and 3.8% improvement in major and destroyed damage levels, respectively, but the overall F 1 metric is 0.3% lower than the maximum value. Although the overall F 1 is slightly lower than that of the method RescueNet, our method shows a significant improvement in binary F 1(2.3%, 84.0%→86.3%), which is also reflected in Fig. 6. The reason is that we use a localization network to focus on binary segmentation, which is a relatively simple task, and an SGF module to improve the boundary of binary segmentation results. The overall F 1 score is slightly lower than that of RescueNet because the network is not designed to take advantage of the differences between pre and postdisaster images.   [54], (e) Weber et al. [55], and (f) our proposed network.

C. Ablation Study
To understand how our proposed method works, we perform complete experiments to study its components. Table V shows the results of the vanilla network combined with each component individually. The FFM and SGF modules achieve 0.9% and 1.4% improvement in F 1 c , respectively. The contrastive loss promotes overall performance by 2.8%. The addition of contrastive loss enhances the network's ability to distinguish categories with similar characteristics, thus increasing the F 1 scores of HZ, LZ, and SL by 1.8%, 1.1%, and 3.7%, respectively. The influence of adding multiple modules is also explored. The FFM module along with the contrastive loss can bring an improvement in overall F 1 score by 2.9% to the vanilla network. Each component will be analyzed more detailedly in the subsequent sections.

1) FFM in Different Stages:
To validate the effect of FFM, we add FFM stage by stage, as shown in Table VI. With the addition of FFMs, the F 1 score of a single category or the overall F 1 score maintains an upward trend. The improvement of the overall F 1 score is 0.4% when FFM is added to the first stage and 1.1% when FFM is added to all three stages. Through fusing classification and localization features, our network gives considerable attention to the location recognized as buildings. The F 1 score of LZ fluctuates after FFM is added in different stages because manual labeling in the Hainan dataset tends to label LZ into a whole piece. Hence, accurate localization information may be of little help in this category.
2) Pixel Contrastive Loss: In (7), we use positive embedding set P i and negative embedding set N i to compute contrastive loss L NCE . The way of obtaining these two sets will greatly impact the network's performance. Simply using all pixels in the same batch may be computationally expensive. Accordingly, first, we randomly sample a specified number of pixel embeddings. Then, to make the embedding number of each class even, we set a hyperparameter named view number limiting the max embedding number of each class.  Table VII shows the result of using various sample numbers and view numbers. The network's computational cost and performance grow as the sample and view numbers increase. When sample number = 4096 and view number = 800, the network can achieve the best overall F 1 score of 60.3%, but the efficiency is low at this time. Therefore, considering effect and efficiency, we choose sample number=1024 and view number=200 as our default setting.
Moreover, the number of dimensions of the projection head in our network is crucial. Thus, we study the effects of the number of embedding dimensions, as shown in Table VIII. The larger the number of dimensions is, the richer the embedded information is; otherwise, the less efficient the computation is. At dimension=256, the overall F 1 score reaches the highest value of 59.6%. However, as the dimensions continue to increase, the network's performance decrease. At dimension=1024, the overall F 1 score is lower at 58.4%. The reason is the excessive dimension of the embeddings, which are mixed with redundant information and thus affect the network's performance.
To study the effect of the contrastive loss, we perform experiments using different contrastive loss weights λ NCE . As the result shown in Fig. 10(b), we can get the best overall F 1 score when λ NCE = 0.1. Interestingly, with the growth of weight, the overall F 1 score remains constant at around 60.0%. This shows that, the contrastive loss can only bring limited help to learning after it has learned to a certain extent. This is because the features of  In addition, to understand the improvement from pixel contrastive loss, we use t-SNE [57] to visualize the embedding space before and after contrastive loss is added. Fig. 7 exhibits that, after the addition of contrastive loss, the boundary between features of different categories is more apparent, or the clustering of embeddings of the same category is more compact. As a result, the network can better distinguish different categories. From the predicted masks in Fig. 8, with pixel contrastive loss, our method is capable of producing a more accurate segment.
3) Boundary Loss: Although additional constraints on the boundary can improve the segmentation effect, learning the boundary with different loss functions will have a great impact on the result. Table IX reports the influence of boundary constraints with different loss functions on the F 1 score of binary segmentation. BCE, weighted BCE and Dice loss yield about 0.3%, 0.4%, and 0.2% raise in binary F 1 score, respectively. The biggest gains of 0.6% were made by utilizing BCE and Dice loss together in the network.
To find out how the combined Dice loss and BCE loss lead to such competitive advantages, we analyze the visualization  results using different boundary loss functions in detail. As shown in Fig. 9, weighted BCE tries to solve the data imbalance problem by applying balancing weights. Still, this hard balancing carries few promotions because the proportion of boundary pixels of different images fluctuates over a wide range, and fixed weights cannot cope with this problem. Dice loss makes boundaries clearer, but buildings are incomplete because it regards loss function as the F 1 score, thus ignoring the influence of a single pixel. Consequently, combining Dice loss and BCE loss can generate precise, clear, and complete building boundaries by combining the advantages of both to complement each other. It is worth noting that Dice loss is much brighter than the boundary lines in the other comparisons. According to (10), we can get the gradient of boundary prediction:b(b 2 −b 2 )/(b 2 +b 2 ) 2 , which is much stricter to wrong predictions than that of BCE loss. Therefore, compared with BCE loss, the correct prediction of Dice loss tends to be 1, while the wrong prediction tends to be 0, resulting in the brighter prediction.
We conduct experiments in Fig. 10(a) to study the impact of different Dice loss weights on the boundary loss function.
This weight parameter λ Dice has a certain impact on the performance. We find that λ Dice = 1.0 achieves the best binary building segmentation performance. Large λ Dice will cause the network to focus too much on the boundary, leaving the incomplete prediction.

V. CONCLUSION
This article proposes a CNN-based learning framework with two subnetworks named BSSNet for building subclass segmentation from satellite images. The first network is used for binary building segmentation and guides the building locations in building subclass segmentation. An SGF module is added to the first network, and it improves the binary segmentation result by supervising the spatial gradient map of prediction. In the second network, building subclasses (HZ, LZ, SH, and SL) are predicted. Intermediate features of the second network are supervised using contrastive learning loss to improve feature consistency. Finally, predictions of the two networks are combined to generate the final result. Experimental results demonstrate that significant improvements can be obtained using our proposed framework. Adequate experiments are performed on the Hainan and xBD datasets to prove our method's effectiveness.
For future works, it would be interesting to divide the building into more fine-grained subclasses. And another possible direction of subclass segmentation is extending other classes, such as vegetation and road. These classes are more challenging than the building. For vegetation, the concept of object ceases to exist, and there is less difference in features between subclasses. The scale of roads is more flexible, and its subclasses may need to be determined by features that are far apart on the same road.