Supervised Contrastive Embedding for Medical Image Segmentation

Deep segmentation networks generally consist of an encoder to extract features from an input image and a decoder to restore them to the original input size to produce segmentation results. In an ideal setting, the trained encoder should possess the semantic embedding capability, which maps a pair of features close to each other when they belong to the same class, and maps them distantly if they correspond to different classes. Recent deep segmentation networks do not directly deal with the embedding behavior of the encoder. Accordingly, we cannot expect that the features embedded by the encoder will have the semantic embedding property. If the model can be trained to have the embedding ability, it will further enhance the performance as restoring from those features is much easier for the decoder. To this end, we propose supervised contrastive embedding, which employs feature-wise contrastive loss for the feature map to enhance the segmentation performance on medical images. We also introduce a boundary-aware sampling strategy, which focuses on the features corresponding to image patches located at the boundary area of the ground-truth annotations. Through extensive experiments on lung segmentation in chest radiographs, liver segmentation in computed tomography, and brain tumor and spinal cord gray matter segmentation in magnetic resonance images, it is demonstrated that the proposed method helps to improve the segmentation performance of popular U-Net, U-Net++, and DeepLabV3+ architectures. Furthermore, it is confirmed that the robustness on domain shifts can be enhanced for segmentation models by the proposed contrastive embedding.


I. INTRODUCTION
R ECENTLY, learning methods based on deep neural networks are rapidly changing the field of medical imaging analysis and are being applied for clinical purposes, e.g., early detection or classification of lesions to improve current practices. Deep neural networks learned from a large amount of data support precise diagnoses and accelerate time-consuming processes requiring medical expertise [1]- [3]. Semantic segmentation in medical imaging analysis is a particularly important area and is essential for diagnosis, monitoring, and treatment [4]. Despite such significant progress, better and reliable performance is required for segmentation models to be widely used in clinical settings. To achieve more precise and reliable segmentation performance, recent studies have proposed several strategies: the design of advanced model architectures [5]- [8] including devising a new information delivery method [9]- [11], or designing a new loss function that is specific to a segmentation task [12]- [18].
Typically, segmentation networks consist of an encoder to extract features (i.e., representations) from an input image and a decoder to generate prediction maps from those extracted features. If a model is ideally trained to meet the original purpose of the encoder and decoder, the feature vectors extracted from the encoder should have the semantic embedding property, i.e., they should be easily distinguishable according to their classes in the corresponding receptive fields. That is, the feature vectors representing the same class of regions should be located close to each other and the feature vectors representing different classes should be distantly separated in the embedding space as illustrated in Fig. 1. However, we cannot expect that the encoder of extant segmentation models would have this property since previous studies have focused on improving segmentation performance without considering the aforementioned aspect of encoders.
In this study, we propose a novel framework termed supervised contrastive embedding (SCE), which utilizes contrastive loss [19] to encourage the encoder to learn the semantic embedding property during training. Contrastive loss has been actively used for self-supervised learning which aims to learn generic representations from large-scale unlabeled data [20]. To apply contrastive loss to segmentation tasks, we extend the existing contrastive loss by contrasting the feature vectors extracted from the encoder. Each feature vector corresponds to a specific patch in the input image and can be labeled by its position on the ground-truth segmentation mask. Thus, the proposed contrastive loss encourages the encoder to embed closely the feature vectors associated with positive regions and embed them far from the feature vectors associated with negative regions in the embedding space. By contrasting the embedded feature vectors to be easily distinguished in the feature space, the decoder can generate more accurate segmentation results.
Meanwhile, deep neural networks are known to degrade performance when they encounter data from a domain even slightly different from that of the training data [21]. Especially in the field of medical imaging, such situations, termed domain shift, are commonly caused by diverse image acquisition equipment and the personal characteristics of patients. Models are unreliable that fail to generalize to target domains that share semantic information with source domains but have slightly different visual characteristics. Therefore, domain robustness is considered an important research direction along with improvements to model performance [22], [23]. Our SCE contrasts positive and negative regions within different input images to learn the desirable embedding property. Therefore, by utilizing multi-source domains during training, it can be expected to learn domain invariant features.
We comprehensively validated the effectiveness of our proposed method on four different segmentation tasks from diverse imaging modalities, including liver (CT), brain tumor (MR), lung (X-rays), and spinal cord gray matter (MR) segmentation. For the experiments, we adopted model architectures that are widely used in medical image segmentation tasks: U-Net, U-Net++, and DeepLabV3+. Our contributions can be summarized as follows 1 : • We provide qualitative and quantitative results, demonstrating that the encoder of the existing segmentation models does not have the desirable semantic embedding property. In specific, we show that segmentation models learned by the conventional training strategy do not consider the semantic relationship between feature vectors in the embedding space. • We propose a novel framework called supervised contrastive embedding that utilizes contrastive loss for the encoder to learn the ideal embedding property. Also, we propose an effective sampling strategy called boundaryaware sampling, which further enhances segmentation performance by sampling features from the boundary region of foreground and background. Extensive experiments to demonstrate the advantages of the proposed method were conducted on various medical image datasets and state-of-the-art architectures. • We also demonstrate that our proposed method is effective in improving the domain robustness of segmentation models when trained on multi-source domains. We validated the effectiveness of our proposed method on domain robustness under a standard domain generalization evaluation protocol. The remainder of this paper is organized as follows. In Section II, we introduce the related works on semantic segmentation and contrastive learning. In Section III, we provide details on the proposed supervised contrastive embedding with boundary-aware sampling. In Section IV, we present our experimental settings and results for semantic segmentation and domain robustness. Finally, conclusions and discussions are given in Section V.

A. SEMANTIC SEGMENTATION
Several methods have been proposed to improve segmentation performance from the perspective of deep neural network architecture. One of the pioneering works is on fully convolutional networks (FCNs) [24]. It successfully outperformed previous segmentation models, and significantly influenced subsequent studies on neural networks-based semantic segmentation. To further enhance FCNs, encoderdecoder architectures that have symmetrical structures were proposed to restore feature maps to the original size of its inputs by a trainable decoder [25], [26]. DeepLab family [5]- [7], [27] is also one of the notable architectural advances utilizing dilated convolutions as a way to increase the size of the receptive field while maintaining the number of trainable parameters. DeepLabV3+ [7], the latest DeepLab series, combines depthwise separable convolutions with dilated convolutions and, thereby, conspicuously reduces the amount of computation while maintaining performance. U-Net [28] This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

FIGURE 2.
Framework overview of the proposed SCE. The feature map extracted from the encoder is resized through bilinear interpolation and then the predefined number of feature vectors is randomly sampled. Here, we adopt a sampling strategy that pays more attention to the boundaries of segmentation targets. and its variants have been proposed for medical image segmentation and gained popularity. Because of its superior performance, U-Net has been widely used for biomedical image segmentation and many subsequent works have adopted the U-shaped architecture as a basic design [29], e.g., U-Net++ [9], MultiResUNet [10], TransUNet [11], etc. Besides advanced architectures, employing a better loss function also effectively improves segmentation performance in deep neural networks. Focal loss [30] assigns weights to the cross entropy loss according to the difficulty of classifying pixels for the model to be more affected by pixels that are hard to classify. Loss functions that aim to optimize the target overlap metrics of segmentation directly, e.g., Dice loss [31], have also been introduced. Salehi et al. [32] proposed a loss function based on the Tversky index and Abraham and Khan [14] introduced Focal Tversky loss, a generalized focal loss with the Tversky index. ClDice [17], [18] is a similarity measure that considers the connectedness of segmented regions, and its differentiable form is used as a loss function. Such pixel-level and region-level losses can be also naturally combined [13], [15]. However, none of the previous approaches explicitly encourages the encoder in segmentation networks to learn the semantic embedding property.

B. CONTRASTIVE LEARNING
The operating principle of contrastive learning is to learn representations by contrasting positive pairs against negative pairs. Based on this concept, contrastive loss has been extended in various forms and has achieved remarkable progress in recent years [20], [33]- [35]. For example, Sim-CLR [20] used contrastive loss to learn generic representations with a large-scale unlabeled dataset in a self-supervised manner, which treats augmented versions of an image as a positive pair and other images as negative samples. While most studies related to contrastive learning deal with unsupervised learning, Khosla et al. [35] extended contrastive learning to a fully supervised setting to leverage label information by considering all samples from the same class as positive.
There are several works where the contrastive learning strategy is applied to the semantic segmentation task. Chaitanya et al. [36] proposed contrasting strategies that leverage structural similarities across volumetric medical images under self-supervised learning. This approach treats representations of slices from similar volumes as positive pairs and representations of slices from dissimilar volumes as negative pairs. Zhao et al. [37] proposed a fine-tuning approach after pretraining the encoder using contrastive loss on semantic segmentation tasks. However, these works were motivated to achieve better performance with a small amount of training data rather than train a segmentation network encoder to learn the semantic embedding property.
Concurrently with our work, Wang et al. [38] proposed using contrastive loss to the embedded features to learn semantic relations between features across different images. Their core idea is similar to ours, but they contrast the transformed features by projection layers and employ different sampling strategies. In contrast, we apply contrastive loss to the encoder's features directly and verify the additional benefit, robustness to domain shifts.

A. FRAMEWORK OVERVIEW
The overall framework of our proposed method is illustrated in Fig. 2. In our work, the objective of semantic segmentation is to obtain a binary segmentation mapŷ ∈ R H×W that consists of class predictions for each pixel of the input image x ∈ R H×W . To enable a model to learn the semantic embedding property, we introduce the supervised contrastive loss for feature vectors extracted from the encoder. Our proposed contrastive loss aligns positive feature vectors closely with each other and separately from the negative feature vectors in the embedding space. To this end, we use the resized ground-truth masks for supervision as depicted in Fig. 2. Each feature vector in the feature map represents a specific patch in the input image. Therefore, each pixel of the resized label, which has the same size as the feature map, indicates VOLUME 4, 2016 whether the patch represented by the corresponding feature vector belongs to a positive or negative class.
To prevent the loss of information caused by an overly down-sampled label, we first expand the size of the feature map using bilinear interpolation and then resize the groundtruth label to the same size as the interpolated feature map. It is worth noting that our proposed contrastive loss is computed on a mini-batch and not an individual image basis to encourage a model to learn the representations that are invariant across images. However, computing the contrastive loss might be computationally too expensive to consider every possible pair of feature vectors in a given mini-batch. Hence, we present the boundary-aware sampling approach that effectively samples valuable subsets of feature vectors in terms of semantic embedding.

B. CONTRASTIVE LOSS
Since semantic segmentation is inherently a pixel-level classification task, the commonly used loss function for binary segmentation is the binary cross entropy (BCE) loss. For a predicted segmentation mapŷ with a total of N pixels and the ground-truth label y, BCE loss is computed as: where y i ∈ {0, 1} andŷ i ∈ [0, 1] are the ground-truth class and the predicted probability for the i-th pixel, respectively. Also, one can consider the Dice loss as a segmentation loss, which can be computed as: In this work, we used a combination of BCE loss and Dice loss for segmentation loss [39]. We argue that these commonly used loss functions do not guarantee that the encoder of a segmentation network will learn the semantic embedding property as validated in Section IV-D. To obtain an encoder that has this desirable property, we apply the contrastive loss to the embedded feature vectors. For a feature map with an up-sampling factor c. Accordingly, the resized label will be y ∈ R H ×W , and its i-th pixel is assigned as a label to the i-th vector z i ∈ R C f in the interpolated feature map Z . For a single positive feature vector z i in a mini-batch B, the proposed contrastive loss is defined as follows: where {z + } and {z − } respectively are a set of positive and negative feature vectors in B. Here, sim(·, ·) denotes a cosine similarity function, which measures the similarity between two feature vectors and τ represents a temperature parameter.
In this work, we set the temperature τ to 1 without tuning the hyperparameter.
It is important to note that L con is computed over a minibatch rather individual images, i.e., |{z + }| + |{z − }| = H × W × |B|. When L con is considered for each image, the model will only learn the relationship between positive and negative features in each image independently. To learn this relationship across all training samples, our proposed contrastive loss L con is computed over all images in a mini-batch. Hence, L con is minimized when a given positive feature is aligned with other positive features (i.e., the cosine similarities are nearly ones) in the representation space and orthogonal to all negative features (i.e., the cosine similarities are nearly zeros). It can be inferred from (3) that L con does not guarantee that the negative features will be located closely in the representation space. This is reasonable since the backgrounds in medical images (i.e., negative patches) usually contain heterogeneous characteristics while foregrounds such as lesions or organs (i.e., positive patches) are relatively homogeneous.
Consequently, we define the total loss function L total for a given mini-batch B as follows: whereŷ k and y k are the prediction result and ground-truth label of k-th sample, and λ is a weight for the contrastive loss and is set to 1 in all experiments. Training a segmentation model with L total encourages the encoder to have the semantic embedding property and the decoder to predict segmentation results based on the well-embedded features.

C. BOUNDARY-AWARE SAMPLING
Computing the proposed contrastive loss in (4) for all feature pairs in a mini-batch is computationally expensive since the number of pairs is too large even with a moderately sized mini-batch. For example, given the encoded feature map with a spatial resolution of r × r and the mini-batch size of b, the number of pairs to be examined is 1 2 r 2 b(r 2 b − 1). Therefore, in practice, the feature vectors must be sampled to compute the contrastive loss.
One can simply adopt a random sampling strategy. However, a more effective sampling strategy can be designed based on the boundary regions generally being more challenging to segment correctly. In terms of learning the semantic embedding property of the encoder, it is helpful for it to focus on hard-to-classify regions rather than those easy-toclassify. Hence, we propose an effective sampling method, boundary-aware sampling, which gives more attention to the feature vectors corresponding to the boundary regions.
Suppose that the number of positive s p and negative s n feature samples are given. Note that it is known that s n should be much larger than s p for contrastive learning [20] and this is also shown in our experimental results. The proposed boundary-aware sampling can be applied in three ways according to how s b p , the number of positive feature samples associated with the boundary area, is determined: fixed, random, and linear sampling.
• Fixed : Using a fixed proportion to decide the number of boundary features, i.e., s b p = αs p where α ∈ [0, 1]. In our experiments, we manually set α to 0.2. • Random : Using a random proportion to decide the number of boundary features, i.e., s b p = αs p where α ∼ Unif(0, 1). Here, α is randomly sampled at each training iteration. • Linear : Using a variable proportion that linearly increases according to the epoch, i.e., s b p = αs p = t T s p where t and T represent the current and total number of epochs, respectively. We can expect the curriculum learning effect, which progressively focuses on the hard examples.
Note that the parameter α for random or linear sampling is determined automatically at each iteration whereas that for fixed sampling should be defined a priori. For all three different sampling strategies, the number of negative feature samples representing the boundary area s b n is set to s b p . Consequently, the feature vectors corresponding to non-boundary areas are sampled as the amount of s p − s b p and s n − s b n for positive and negative samples, respectively. To demonstrate the efficacy of the proposed sampling methods, we examine performance gains according to the sampling methods in Section IV.

IV. EXPERIMENTS AND RESULTS
We performed two sets of experiments to demonstrate the efficacy of our proposed method regarding segmentation (referred to as source segmentation) and domain robustness performance. We show that applying contrastive loss with boundary-aware sampling yields better segmentation performance on liver and brain tumor segmentation tasks across the state-of-the-art architectures. Also, we provide quantitative and qualitative analyses to show that the encoder of the standard segmentation model does not learn the semantic embedding property. Additionally, for domain robustness, we present experimental results that our method also improves the segmentation model's generalization performance on the unseen domain for lung and spinal cord gray matter segmentation tasks.

1) Liver Segmentation
Liver Tumor Segmentation Challenge(LiTS) dataset [40] consists of 201 3D abdominal CT scans with annotations on the liver and tumor regions. We used 131 CT scans that have publicly available labels and we focused on the liver segmentation task by only considering the liver as a positive class and the remainder as a negative class. We truncated the pixel intensity values of all scans to the range of [-200, 200] Hounsfield units (HU) to remove irrelevant details and enhance the contrast between liver and other tissues. Then, we normalized them to the value of [0, 255]. For training the segmentation models in our experiments, we used 2D slice images from the 3D CT scans. In total, we used 19,163 slices with 512×512 resolution except for the images that do not show liver. We randomly selected 103 CT scans as training data and the remaining 28 CT scans as test data.

2) Brain Tumor Segmentation
The BraTS 2018 training dataset [41] provides multimodal 3D brain MRIs annotated with three ground-truth segmentation labels: necrotic and non-enhanced tumor, edema, and enhanced tumor. The dataset contains 210 high-grade (HG) and 75 low-grade (LG) cases, and each case has four MRI modalities: FLAIR, T1, T1c, and T2. We randomly selected 20 cases for test data and the remaining cases were used as training data. For simplicity, we converted it into a binary problem by considering all three labels as a positive class and the remaining background as a negative class, following [9]. In the training process, we used 2D slice images from original 3D scans as inputs. In the evaluation process, we measured the segmentation performance on 3D scans by stacking 2D slice-level predictions.

3) Lung Segmentation
To evaluate the generalization performance on the unseen domain, we used three chest X-ray datasets acquired from different sources: Japanese Society of Radiological Technology (JSRT) [42], Montgomery Country (MC) [43], and Shenzhen (SZ) [43]. The JSRT dataset consists of 247 chest X-ray images, of which 154 cases have lung nodules and 93 cases are without lung nodules. MC dataset contains 138 X-rays from Montgomery County's tuberculosis screening program including 80 normal cases and 58 cases with manifestations of tuberculosis. The JSRT and MC datasets have manually segmented lung masks for evaluating automatic lung segmentation methods. However, the SZ dataset is not officially provided with ground-truth segmentation labels. Therefore, we referred to labels released by [44] and used 521 images by excluding the cases without annotations or that are mislabeled. Since the image acquisition equipment and lung area's visual characteristics are different, we consider each dataset as an individual domain. Two datasets were used as source domains for training and the remaining dataset was used as the target domain (i.e., unseen domain). Specifically, we randomly selected 30% of each source domain to evaluate segmentation performance on the source domains, and the remaining 70% was used for training.

4) Spinal Cord Gray Matter Segmentation
We used the spinal cord gray matter segmentation challenge dataset [45], which contains MRI axial slices of healthy spinal cord. Since the dataset was acquired at four different medical centers with different MRI systems, we considered the data collected from each site as an individual domain. The four domains, named site1, site2, site3, and site4, have 30, 133, 177, and 134 MRI axial slices, respectively, after the slices having no annotations were excluded. Similar to the lung segmentation, we adopted one domain as the target unseen domain to evaluate the domain generalization performance. We sampled 30% of each source domain as test data and the remaining data as training data. Since the positive regions were too small compared to the input size, we applied center cropping to the images before training.

B. IMPLEMENTATION DETAILS
To demonstrate the efficacy of SCE, we set the model trained with the segmentation loss as a baseline (i.e., λ = 0 in Eq. (4)). For the segmentation architectures, we employed U-Net [28] and U-Net++ [9], which are popular in medical image analysis, and DeepLabV3+ [7], the latest version of the DeepLab series. For the liver segmentation task, ResNet34 [46] was used as the segmentation model's encoder and ResNet50 was used for the remaining segmentation tasks 2 . All models were trained using the SGD optimizer with the momentum of 0.9 and weight decay of 1e-4 for 120 epochs. The initial learning rate was set to 0.01, decayed at the 80-th and 100-th epochs by a factor of 0.1. For data augmentation, we applied color jittering that randomly adjusts the brightness and contrast of an input. The factors for brightness and contrast adjustments are both randomly sampled from Unif(0.6, 1.4) at each iteration. The batch size is one of the important factors that significantly affect the performance of deep neural networks. When using a large batch size for a small-scale dataset (e.g., the lung segmentation and spinal cord segmentation datasets in our experiments), the performance generally degrades since the advantages of stochastic gradient descent cannot be fully utilized. Thus, we determined the batch size for each dataset based on the validation performance: we set the batch size to 64 for both liver and brain tumor segmentation, 32 for lung segmentation, and 16 for spinal cord gray matter segmentation.
For all tasks, the number of positive features to compute the contrastive loss was set to ×2 of the batch size, and the number of negative samples was set to ×6 of the number of positive samples based on the empirical findings from the ablation study (see Fig. 5). For the proposed method, we simply set the value of λ in (4) to 1.0 for all experiments and manually set the parameter α to 0.2 in the case of fixed boundary-aware sampling. It should be noted that we did not tune the hyperparameters, the number of positive and negative features, and the weight λ, for each experiment to demonstrate that the proposed method is insensitive to those hyperparameters.

C. EVALUATION METRICS
We used six different metrics to evaluate the performance of the segmentation models from various perspectives: precision, recall, Dice coefficient, average contour distance (ACD), average surface distance (ASD), and negative log likelihood (NLL). The ACD and ASD are distance metrics that measure how far the predicted positive region is from the actual positive region. Let s i be the i-th pixel on the boundaries of predicted segmentation mask S and g j be the j-th pixel on the boundaries of ground-truth mask G. Considering n S and n G as the total number of pixels in S and G, respectively, ACD and ASD are defined as: where d(s i , G) is the Euclidean distance between s i and the closest pixel on G. Furthermore, we use the NLL metric, which can be interpreted as a cross entropy loss, to identify the impact of applying contrastive loss on the quality of the estimated probability.

D. SOURCE SEGMENTATION
The comparative results on the LiTS and BraTS datasets are reported in Table 1. Our proposed method is referred to as SCE, SCE+fixed, SCE+random, or SCE+linear according to the boundary-aware sampling methods. Note that we did not search for the best-performing sampling strategy on each dataset since we aimed to demonstrate that our proposed contrastive loss with boundary-aware sampling strategies are robust and widely applicable to various imaging modalities. We report the means and standard deviations over five runs, and we conducted the one-tailed t-test to verify rigorously the performance improvement's statistical significance.  The results on LiTS dataset show that the proposed semantic contrastive embedding considerably improves the baseline regardless of the network architecture. For example, SCE improves the Dice coefficient from 85.43%, 86.11%, and 86.41% to 87.57%, 88.34%, and 87.58% respectively for U-Net, U-Net++ and DeepLabV3+. In addition, we observe that applying contrastive loss with boundary-aware sampling further enhanced the segmentation performance. For U-Net++ and DeepLabV3+, in the majority of metrics, random and linear boundary-aware sampling achieved better performance than simple SCE. In the case of U-Net, linear boundary-aware sampling shows superior performance than other sampling methods. Note that these performance improvements are statistically significant. For the BraTS dataset, we also observe similar results: the proposed semantic contrastive embedding improves the segmentation models and the boundary-aware sampling further enhances their performance. For example, all performance metrics on U-Net benefit from the proposed contrastive loss, especially when combined with the linear boundary-aware sampling strategy. For U-Net++ and DeepLabV3+, the baseline trained with only the segmentation loss can be boosted by using the semantic contrastive embedding while their recall, precision,    and NLL values are further improved via our boundary-aware sampling method.
The experimental results confirm that, for superior segmentation models, learning the semantic embedding property is required and all state-of-the-art architectures can benefit from learning that property. The linear boundary-aware sam-pling method was selected for the subsequent experiments since it showed consistent performance improvement across all datasets and architectures.
For qualitative comparison, we provide some of the segmentation results of U-Net on the LiTS and BraTS datasets in Fig. 3. From the results on the LiTS dataset, we observe that applying contrastive loss with boundary-aware sampling decreases false positive predictions and improves the prediction on a very small lung region compared to the baseline. For example, in the second row of Fig. 3(a), U-Net trained with our proposed method correctly classifies all areas of the right side of the image that are misclassified as positives by the baseline. The similar results can be observed in the results on the BraTS dataset. For instance, the third row of Fig. 3(b) shows the case where the positive region is very small. In this case, SCE localized the positive region and performed even better with boundary-aware sampling, whereas the baseline could not find any positive area at all.
We conducted further analysis to identify whether the performance improvements are indeed due to the encoder's embedding property. Fig. 4(a) shows the U-Net Dice score curves and contrastive loss on the LiTS training dataset. The baseline's contrastive loss does not decrease as training progresses, whereas the Dice score increases. This implies that the general training scheme does not guarantee that encoders will acquire the semantic embedding property. Without any constraints on feature representation, the segmentation models learn how to produce precise segmentation results without attention to learning semantically meaningful representations.
To provide a more intuitive understanding of the embedding property, we visualize randomly sampled feature vectors from the encoder of baseline and SCE in Fig. 4(b) by using t-SNE. The baseline visualization shows that the positive and negative vectors are confused and the feature vectors representing the boundary area are barely distinguishable. However, the positive and negative feature vectors are significantly more separable when applying contrastive loss. When applying contrastive loss with boundary-aware sampling, the feature vectors representing boundary areas are well grouped according to their corresponding class. These results confirm that, as we intended, the proposed contrastive loss helps the encoder to learn the semantic embedding property, thereby inducing the reported improvements.
Two hyperparameters, the number of positive and negative feature samples, must be determined for SCE. We conducted an ablation study to investigate the impact of these hyperparameters on the segmentation performance. Figure 5 shows the precision, recall, and Dice scores of the proposed method with respect to the number of positive and negative samples from each mini-batch. The number of positive samples does not significantly affect the segmentation performance as shown in Fig. 5(a). Conversely, in Fig. 5(b), we see that the number of negative samples is important for the segmentation performance of SCE; when more negative samples are used, the performance improves. Note that this finding is consistent with the previous studies [20], [34] showed that a large number of negative samples is required for self-supervised contrastive learning.

E. DOMAIN ROBUSTNESS
We conducted additional experiments to verify that the proposed method can also be effective for improving domain robustness. Since SCE contrasts the feature vectors from different classes across inputs, it can be expected to learn domain invariant representations when trained with multisource domains. Intuitively, this representation should improve the domain robustness of a model and may also improve performance on the unseen domain.
To examine the effect of the proposed method on domain robustness, we use lung segmentation and spinal cord segmentation datasets that consist of three and four independent domains, respectively. Specifically, one domain was adopted as the target domain and the remaining were considered as source domains. All models were trained on the source domains and evaluated on both the source and target domains. Note that the target domain is not accessible during training, which is then interpreted as an entirely unseen domain. We evaluated the baseline, SCE, and SCE with linear boundaryaware sampling by considering three representative measures including the overlap measure Dice score, distance measure ACD, and probabilistic measure NLL.

1) Lung Segmentation
The comparison results on the lung segmentation task are summarized in Table 2. From the results, we observe that applying contrastive loss improves all performance measures on the target unseen domain. For example, in the setting of "JSRT, MC (source domains) → SZ (target domain)", the proposed method improves the Dice scores of U-Net, U-Net++, DeepLabV3+, from 94.37%, 94.51%, and 94.26% to 94.97%, 94.83%, and 94.58%, respectively. Also, the performances of U-Net++ and DeepLabV3+ were further enhanced by using linear boundary-aware sampling. In the setting of "MC, SZ → JSRT", all architectures benefit from SCE especially DeepLabV3+. The results from the "SZ, JSRT → MC" setting lead to a similar conclusion, except that the effect of boundary-aware sampling is not clearly observed.
For source domains, the proposed method enhances the segmentation performance only in a few cases. We believe that it is because lung segmentation on CXRs is a relatively easy task and even the baseline produces a fairly high segmentation performance. Nevertheless, the proposed SCE strongly effects the robustness of models regarding domain generalization because the model can learn domain invariant representations.
The qualitative results of lung segmentation on the target domains are presented in Fig. 6(a). From the segmentation results on the target domains, compared to the baseline, we observe that applying contrastive loss reduces false positive regions that can be reduced even more with boundary-aware sampling. These results confirm that the proposed method can also improve segmentation performance on the unseen domain by contrasting positive and negative regions across multi-source domains.

2) Spinal Cord Gray Matter Segmentation
We next consider the spinal cord gray matter segmentation task to demonstrate the effectiveness of our proposed method on domain generalization. Table 3 summarizes the comparison results. From the results, we can see that the proposed method enhances segmentation performance in the majority VOLUME 4, 2016 of cases on both source and target domains. For example, in the "2, 3, 4 → 1" setting with U-Net++, the SCE improves the Dice score from 87.83% and 84.21% to 88.37% and 85.39% on the source and target domains, respectively. However, the advantages of the boundary-aware sampling were not clearly observed in this experiment. Nevertheless, it is sufficient to confirm that ensuring the semantic embedding property improves the segmentation performance on the source domains and helps to generalize on the unseen domain. The qualitative comparison results of spinal cord gray matter segmentation on each target domain are shown in Fig. 6(b). When the target domain was site 1 or site 3, the small false positive regions generated by the baseline decreased with SCE, and were entirely removed when boundary-aware sampling was also applied. Conversely, we observed that the baseline misclassified some positive regions as negatives when the target domain was site 2 or site 4. These false negative regions were perfectly corrected by semantic contrastive embedding with boundary-aware sam-  pling. These results, which are consistent with other qualitative results, confirm the advantages of our proposed method.

V. CONCLUSION
In this work, we argue that typical training schemes for deep segmentation networks do not ensure the desirable semantic embedding property of the encoder-decoder networks. To enable the encoder to learn this property during training, we introduce a novel framework that utilizes supervised contrastive learning for the segmentation task. We extend the existing contrastive loss to contrast feature representations from the encoder by assigning the feature-level labels derived from the ground-truth masks. In addition, we propose a boundary-aware sampling strategy that samples feature vectors in a given mini-batch that corresponds to the boundary area, which can be treated as a particularly informative area for improving segmentation performance. The extensive experiments on liver and brain tumor segmentation with popular architectures show that across several metrics, the proposed SCE method effectively improves segmentation performance. Through further analyses of training curves and feature visualization, the semantic embedding property is confirmed to have produced the performance improvement. We conducted further experiments on lung and spinal cord gray matter segmentation to demonstrate that the proposed method boosts the domain robustness of segmentation networks by learning domain invariant representations. From the results, we confirm that segmentation models trained with the proposed method are more robust to domain shift.
Although the proposed method clearly shows the advantages in terms of segmentation and domain robustness, there exist some limitations. The first is that the computational cost is slightly increased. We tried to mitigate this issue through boundary-aware sampling, but it still needs additional computations than standard training schemes. Another limitation is that boundary-aware sampling may not be effective for some datasets with small regions of interest when a segmentation network has a large receptive field size. These limitations are valuable to be addressed in future research.