Unsupervised Domain Adaptation Network With Category-Centric Prototype Aligner for Biomedical Image Segmentation

With the widespread success of deep learning in biomedical image segmentation, domain shift becomes a critical and challenging problem, as the gap between two domains can severely affect model performance when deployed to unseen data with heterogeneous features. To alleviate this problem, we present a novel unsupervised domain adaptation network, for generalizing models learned from the labeled source domain to the unlabeled target domain for cross-modality biomedical image segmentation. Specifically, our approach consists of two key modules, a conditional domain discriminator (CDD) and a category-centric prototype aligner (CCPA). The CDD, extended from conditional domain adversarial networks in classifier tasks, is effective and robust in handling complex cross-modality biomedical images. The CCPA, improved from the graph-induced prototype alignment mechanism in cross-domain object detection, can exploit precise instance-level features through an elaborate prototype representation. In addition, it can address the negative effect of class imbalance via entropy-based loss. Extensive experiments on a public benchmark for the cardiac substructure segmentation task demonstrate that our method significantly improves performance on the target domain.


I. INTRODUCTION
Deep neural networks have achieved remarkable success in recent years; a variety of challenging medical imaging problems, such as biomedical image segmentation [1], [2], have witnessed breakthroughs, when a large quantity of labeled data are used, and the training and testing data are sampled from the sample distribution [3].However, deep neural networks deployed in real-life applications usually suffer from the domain shift problem [4].In biomedical imaging scenarios, this problem is even more obvious, as biomedical images have very different characteristics when they are acquired with different acquisition parameters or modalities [5], [6], such as magnetic resonance imaging (MRI) and computed tomography (CT).In addition, manual annotation in the field of biomedical images is a timeconsuming and expensive task.Compared to natural images, domain adaptation is more challenging on cross-modality biomedical images.
To address these problems, unsupervised domain adaptation (UDA) has been intensively studied to generalize the The * indicates equal contribution P. Gong, W. Yu, Q. Sun, and J. Hu are with the School of Medical Imaging, Xuzhou Medical University, Xuzhou, 221000, China.R. Zhao is with the Department of Computing, The Hong Kong Polytechnic University, Kowloon 999077, Hong Kong.
Email at: W. Yu yuwenwen62@gmail.comwell-trained models on unlabeled target data, which seeks to transfer knowledge from labeled training data (source domain) to test data (target domain).To obtain domain-invariant feature representations across both domains, common methods for domain adaptation can be roughly divided into two types: 1) measures for minimizing an explicitly defined distance [7], [8], and 2) applying adversarial training to align the latent feature space [9], [10].
For instance, Gretton et.al. [11] minimized the maximum mean discrepancy (MMD) distance between the source and target domains.Long et al. [8] proposed the multikernel MMD distance.Other works based on adversarial training employed a domain classifier to facilitate domain invariance on the input level [12], feature level [9] (Figure 1(a)), output level and their combinations [6] (Figure 1(b)).Nevertheless, how to balance the ratio between multilevel discriminators requires empirical design and time for UDA and can severely affect the domain adaptation performance.Furthermore, multilevel discriminators may come at a contradiction in the adversarial adaptation procedure given the complex combinations of crossmodality between domains.In addition, domain classifiers tend to capture global-level discrepancy features across both domains, ignoring distinct modal information of instances in the cross-modality scenario, especially when neighboring structures contain unclear boundaries or relatively homogeneous tissues [6], and combining the feature-level and instance-level alignment of the source and target domains makes it difficult to perfectly address domain shift [13]- [16].
Motivated by these problems, we propose the unsupervised domain adaptation network with category-centric prototype aligner for biomedical image segmentation illustrated in Figure 2. Specifically, we introduce two key components, a conditional domain discriminator (CDD) and a categorycentric prototype aligner (CCPA).Our method incorporates the CDD module inspired by [17] into the existing architecture to model the multimodal information and joint distributions of multilevel features via a randomized router, which makes it easier to fully capture multiplicative interactions between feature representation and segmentation prediction for domain adaptation.In CCPA, we extend category-level prototype alignment to cross-domain segmentation tasks, inspired by [13], [14], to guarantee the discriminability by exploiting more precise instance-level features.In addition, it can mitigate the negative effect of class imbalance on domain adaptation via an entropy-based loss to control the process of adaptation during the training phase.The main contributions of this paper can be summarized as follows: • In this paper, we present a novel framework for UDA, which is more effective and robust in handling complex cross-modality biomedical image segmentation.It efficiently balances multi-level features discriminator without contradiction that is crucial for UDA.• We introduce CCPA module into the image segmentation framework, which exploits more precise instance-level features.In addition, it also can mitigate the negative effect of class-imbalance via entropy-based loss.• We conduct experiments on one public benchmark for cardiac substructure segmentation tasks and show that our method significantly improves performance on the target domain.

II. RELATED WORK
The purpose of UDA is to generalize the model learned from the labeled source domain to the unlabeled target domain.A large number of adaptive methods have been proposed from different perspectives, including input-level, feature-level, output-level adaptation, and their combinations.In academia, UDA can be divided into two categories: 1) measures for minimizing a specific domain discrepancy metric, and 2) applying adversarial training to align the latent features space.In this section, we will give a brief review of related works in both areas.A more detailed review of UDA for image segmentation can be found in [18]- [22].
In the field of UDA, early research mainly focused on aligning the distributions of feature space by minimizing distance measurements.For example, Tzeng et al. [23] used the MMD distance as a minimized target between the source and target domains.Shen et al. [24] learned domain-invariant feature representations via Wasserstein distance guided representation learning.Long et al. [8] presented the multikernel MMD distance based on MMD to reduce the domain discrepancy.Yan et al. [7] further extended the work and employed weighted MMD with a task-specific loss for domain adaptation.Ding et al. [25] proposed an adaptive exploration method to maximize the distances of all target images, minimizes distances of similar target images, and address the domain-shift problem for person re-identification (re-ID) in an unsupervised manner.Fan et al. [26] proposed a progressive unsupervised learning method to transfer pretrained deep representations to unseen domains based on clustering to improve the re-ID accuracy and produced CNN models with high discriminative ability.Nevertheless, both [25], [26] are aimed at the topic of re-ID.In addition, the difference between [25], [26] and our method is that [25] minimizes distances between one image and its neighbors and maximizes distances between one image and other images at the feature level, and [26] performs clustering at the feature level.However, our method is not only performs conditional domain discriminator at the feature level and instance level, but also aligns category-centric prototype at the instance-level.
More recently, with the advent of generative adversarial networks (GAN) [27], another line of research is based on adversarial training.For instance, Ganin et al. [9] aligned the distributions of features across the two domains accomplished through standard backpropagation training via a simple new gradient reversal layer.Tzeng et al. [28] introduced a more flexible adversarial learning framework with united weight sharing to address the problem of domain shift.With the widespread success of CycleGAN [29] in unpaired image-toimage transformations, many previous image adaptation efforts were based on modified CycleGAN with applications in both natural datasets [30], [31] and medical image segmentation [32]- [34].Overall, our framework belongs to the latter category based on adversarial learning.
For biomedical image segmentation applications, as the common cross-modality, interscanner, and different imaging protocols vary, domain shift has become a critical and challenging problem in biomedical image segmentation.Ghafoorian et al. [35] reduced the required number of labels in the target domain via transfer learning for brain MRI lesion segmentation tasks.Opbroek et al. [36]  presented a novel patch-based output space adversarial learning framework to jointly and robustly segment the optic disc and optic cup from different fundus image datasets and achieved effective feature alignment.However, these works did not aim at the topic of unsupervised domain adaptation for cross-modality biomedical images.In the field of cross-modality biomedical segmentation, Dou et al. [5], [6] employed adversarial learning to adapt the early-layer feature distributions while the higherlayer features are fixed.However, their method needs comprehensive empirical studies and design to determine the optimal adaptation depth.Chen et al. [33], [34] used deeply synergistic image and feature alignment for unsupervised bidirectional cross-modality adaptation segmentation and achieved better domain performance, but it has a complex training process due to the essence of GAN is used.
The most related works to our method are [13], [17] based on the graph-inducted prototype alignment mechanism and conditional domain adversarial networks, but still differ from our method in several aspects.These works are not used for the field of unsupervised domain adaptation for crossmodal biomedical image segmentation.More specifically, the difference between [17] and our proposed CDD module is that conditional adversarial domain adaptation is a common operation in the classifier tasks but rarely in segmentation tasks, especially in the field of unsupervised domain adaptation on cross-modal medical image segmentation.The difference between [13] and the proposed CCPA is that [13] is only suitable for object detection tasks because it performs graph-induced prototype alignment on the bounding-box level.However, the proposed CCPA module extends [13] to the pixel-level for segmentation tasks.We use a learnable graph attention layer to obtain the adjacency matrix of the graph at the pixel-level instead of calculating weights between region proposals at the bounding-box level.

III. METHOD
In this section, we provide a detailed description of our proposed method for UDA in biomedical image segmentation.Figure 2 gives the overall architecture of our method, which contains 3 modules: To ease understanding, our full model is described in parts.First, we begin by introducing the notation used in this paper in Section III-A.Our segmenter representation is described in Section III-B; then, the proposed CDD module mechanism is described in Section III-C.Finally, Section III-D shows how the CCPA module works.

A. Notation
In UDA, given N s labeled examples of the source domain and N t unlabeled examples of the target domain, they are denoted by D s = {(x s 1 , y s 1 ), . . ., (x s N s , y s N s )} and D t = {x t 1 , . . ., x t N t }, respectively, where x s i is the image for the i-th sample of the source domain and follows distributions P s , and y s i represents its corresponding segmentation one-hot label.Similarly, x t i follows distributions P t , and the i.i.d.assumption is violated as P s = P t .
For ease of notation, we directly use x s and y s to represent the sample and label from the source domain, where we omit the subscript index i in the following subsections.

B. Segmenter
As shown in Figure 2, the top position in the diagram is the segmenter module, which contains two steps for the source domain.Specifically, image sample x s is first forwarded into the feature extractor G, yielding image features F s ∈ R H×W ×d in latent feature space, where H, W, d denotes the height and width of the output map, and the dimension of feature embedding, respectively.Next, to perform segmentation on image features, the mask predictor P learns to create a mapping from image features to the label space Y s by conducting supervised learning.Formally, prediction map M s ∈ R H×W ×Nc , where N c denotes the number of classes, is defined as follows: where Θ p and Θ g represent the learned parameters of the mask predictor and the feature extractor, respectively.Similarly, as shown in the bottom position of Figure 2, the target domain sample x t also has the same procedure to generate F t ∈ R H×W ×d and M t ∈ R H×W ×Nc during the domain adaptation phase, which is also fed into CDD and CCPA.In addition, Θ p and Θ g are shared parameters for both the source and target domains.In practice, we apply dilated residual blocks [41] to the feature extractor G to obtain a large receptive field and preserve the spatial acuity of feature maps.For dense predictions in the segmentation task, the upsampling operation following by a softmax layer is used in the mask predictor P for probability predictions of the pixels.
To transfer the label information from the source domain to the target domain, supervised learning with the source domain D s is an essential part of domain adaptation.The L seg loss is defined as follows where y s ∈ R H×W ×Nc denotes one-hot labels, and η ∈ [0, 1] is a trade-off parameter.The first term is the cross-entropy loss for pixelwise classification.The second term is the Dice loss for multiple cardiac structures, which is commonly employed in biomedical image segmentation tasks.We combine the two complementary hybrid loss functions to address the challenging heart segmentation task.

C. Conditional Domain Discriminator
Existing UDA works [5], [6] use dual-domain discriminators whose inputs are the feature-level features and the predicted segmentation masks of the source and target domains respectively.These works aim to align the latent feature space of the target domain to that of the source domain, with a more explicit constraint on the shape of segmentation masks.Although the starting point for these works [5], [6] is good and performs well on datasets, how to balance the ratio between dual discriminators can severely affect the domain adaptation performance, as it needs empirical design and time for UDA.Additionally, due to the lack of the supervisory signal of the target domain, the predicted map M t probably have error noise or inaccurate boundaries, which makes it harder to align the source domain and target domain on the instance level when data distributions embody complex multimodal structures.
In this regard, we incorporate CDD inspired by [17] into the existing architecture illustrated in the left-middle position of Figure 2 to effectively align different domains of multimodal distributions native in segmentation problems.This is different from aligning the features and segmentation maps separately [42], [43].Notably, this module can capture the cross-covariance between feature representations and segmentation predictions to improve the transferability.The key to CDD models is a novel conditional domain discriminator conditioned on the cross-covariance of domain-specific feature representations and segmentation predictions.Figure 1(c) shows a detailed description of the CDD.Given two inputs F and M of the source domain and target domain respectively, the CDD models the multimodal information and joint distributions of F and M via a randomized router, which makes it easier to fully capture multiplicative interactions between feature representation and segmentation prediction.Then, conditional domain adversarial loss measures the domain discrepancy by training a domain discriminator in a conditional manner.Given conditional domain discriminator D cdd , the definition of conditional domain adversarial loss is where J is the explicit randomized multilinear map that converts F and M to a single tensor that is the joint variable of domain-specific feature representation and segmentation prediction for adversarial adaptation.We define where represents an element-wise product, d o is the dimension of the output tensor, and R F ∈ R d×do and R M ∈ R Nc×do are random matrices, in which each element R ij follows a symmetric distribution, where In practice, R F and R M are sampled only once from a uniform distribution or Gaussian distribution and fixed in the training phase.We restrict d o d × N c of random matrices to avoid dimension explosion [17], [44], [45].
One common optimization approach for adversarial learning networks in an unsupervised manner is to follow the training rules of GAN [46], which is implemented in the form of a minimax two-player game containing a generator and a discriminator role.An alternative optimization method utilizes a gradient reversal layer (GRL) [9], which is inserted between the feature generator and CDD.In practice, we use the latter implementation method to align the domain statistics with CDD that predicts the domain.

D. Category-Centric Prototype Aligner
The gap between the two domains can severely decrease the model's performance.Furthermore, as different instances commonly embody distinct modal information in the crossmodality scenario, especially when neighboring structures remain unclear boundaries or relatively homogeneous tissues [6], combining the feature-level and instance-level alignment of the source and target domains makes it difficult to perfectly address domain shift [13]- [16].In this regard, in addition to the CDD module, we also proposed CCPA shown in the rightmiddle position of Figure 2, inspired by [13], which includes five steps to align the source and target domains with categorycentric prototype representations.The detailed flowchart is illustrated in Figure 3.In addition, to relieve the negative effect of class imbalance [47] on domain adaptation, we design a category-reweighted contrastive loss to balance the training process of domain adaptation.Graph Attention Layer As described in Section III-B, feature representation F and segmentation prediction M are generated via a segmenter on both the source and target domains, which is the first step of the flowchart shown in Figure 3(a).Then, we use the graph attention layer [48], [49] to produce adjacency matrix A ∈ R HW ×HW , which is used to model the relationship between pixels.Intuitively, two spatially closer pixels are more likely to depict the same class and should be assigned higher connection weights.Following this idea, a method for obtaining an adjacency matrix is defined as where w i ∈ R d is the learnable weight vector, To solve the problem of gradients vanishing in the training phase, we use LeakyReLU instead of the Relu activation function.In practice, to avoid high computing complexity and memory, we perform graph attention on downsample feature representation F and then upsample matrix A to the original resolution size.Graph Convolution Layer Because of the boundary deviation in segmentation prediction, often distributed around the ground truth mask, initial segmentation prediction conveys incomplete instance information, which leads to inaccurately of representing an instance.A natural approach here is that the segmentation prediction feature belonging to a certain instance should be aggregated, to achieve exact instance-level feature representations.Specifically, more exact instance-level feature representations are calculated by using an adjacency matrix A containing the spatial relevance, image features F and segmentation confidence M. Formulation is defined as where W F ∈ R d×d and W M ∈ R Nc×Nc are learnable weight matrices.σ(•) = max(0, •) is a nonlinear activation function.In Eq. 6, 7, after graph convolution, F and M more precise instance-level information is aggregated through information propagation among adjacent pixels.
Category-centric Prototype After obtaining more accurate feature representations aggregated on the instance level, we employ confidence-guided metering to integrate the multimodal information reflected by different instances into prototype representations.Category-centric prototypes C ∈ R Nc×d are calculated by the weighted mean feature representation F: where α ∈ R HW ×Nc represents normalized segmentation confidence to each class.The derived prototypes C serve as the proxy of each class during subsequent domain alignment.Prototype Aligner Following the heuristic rules of the prototype-based method [13], [15] for unsupervised domain alignment, we minimize intraclass loss to narrow the distance between the same categories' prototypes of two domains, named L intra .In addition, we also minimize interclass loss to bound the distance between different categories' prototypes, named L inter .Furthermore, in segmentation tasks, as a classimbalance problem usually exists [47], we address this problem through reweight loss L intra and L inter using an entropybased map [50].The underlying idea is that hard samples or sample-scarce categories produce high-entropy predictions on both source and target domains because the categories with abundant samples are trained more sufficiently and better aligned so that they have higher confidence compared with sample-scarce categories, and vice versa.
In this regard, we assign higher weights to the sample-scarce categories during the training process of domain adaptation.
The weight corresponding to the k-th category is calculated by: Note that the intraclass loss requires prototypes of the same category to be as close as possible, and the interclass loss constrains the distance between prototypes of different classes to be larger than a margin.The category-centric prototype domain adaptation loss L ccpa consists of an intraclass loss and three interclass losses in which all pairwise relations between two domains' prototypes are considered.So we define the following loss: Linter(Ds, Ds) + Linter(Ds, Dt) + Linter(Dt, Dt) , ( 10) Linter(D, D ) = where Φ(c, c ) = ||c − c || 2 is the Euclidean distance between two prototypes and {C s i } Nc i=1 , {C t i } Nc i=1 represent the prototypes of source and target domains.D and D denote two domains from which prototype pairs belonging to different categories are drawn.m is the margin term which is fixed as 1.0 in all experiments.
Our model parameters of whole networks are jointly trained by minimizing the following total loss: where L seg , L cdd , and L ccpa are defined in Eqs. 2, 3, and 10, respectively.λ 1 and λ 2 ∈ [0, 1] are trade-off parameters.

A. Experimental Details
Datasets We use a medical cross-modality domain adaptation benchmark proposed in [6] to validate the performance of our proposed unsupervised cross-modality domain adaptation method for biomedical image segmentation.This dataset contains training (16 subjects) and testing (4 subjects) sets for each modality, and it collects from the public dataset of MICCAI 2017 Multi-Modality Whole-Heart Segmentation [51], which consists of 20 unpaired CT and 20 MRI images from 40 patients.The CT and MRI images were obtained in different clinical centers.The cardiac structures of the images were manually annotated by radiologists for both MRI and CT images.Our segmenter aimed to automatically segment four cardiac structures, including the ascending aorta (AA), the left atrium blood cavity (LA-blood), the left ventricle blood cavity (LV-blood), and the myocardium of the left ventricle (LV-myo).All the volumetric MRI and CT images and corresponding labels were preprocessed in [5], [6].Evaluation metrics We employed the Dice coefficient ([%]) to evaluate the agreement between the ground truth and predicted segmentation for cardiac structures.In addition, we calculated the average surface distance (ASD[voxel]) to measure the segmentation performance from the perspective of the boundary.A higher Dice and lower ASD reflect better segmentation performance.Both metrics are presented in the format of mean±std, which shows the average performance as well as the cross-subject variations of the results.

Network architectures
In the segmenter, we use DRN101 [41] as the base semantic segmentation architecture, which is composed of stacked dilated residual blocks, to capture a large receptive field and the spatial context of feature maps.The DRN hyper-parameters used in our paper are the same as [6].
The multiple stages of layers [1,3,5,7] are concatenated as image features F. The image feature dimension is d = 512.We modify the stride and dilation rate of the last layers followed by an upsampling layer to produce denser feature maps with a larger field of view for segmentation.The CDD consists of several stacked residual blocks followed by a sigmoid layer to predict the domain.The output dimension of the randomized multilinear map Training details The proposed model is implemented in PyTorch and trained on 4 NVIDIA RTX 2080 Ti GPUs with 44 GB memory.We train using the Adam optimizer [23] in three stages, and the batch size is 16, split equally for source and target samples.First, our model is trained from scratch on the source domain using a learning rate of 1e − 3 to minimize the segmentation loss L seg .The trade-off parameter η in the L seg (Eq.2) item is 1.Second, the network is trained with only the source supervision and unsupervised domain classification losses L seg + λ 1 L cdd at a learning rate of 3e − 4.
Then, the overall loss function (Eq.13) is optimized, applying the conditional domain adversarial loss L cdd and categorycentric prototype alignment loss L ccpa jointly, and the learning rate is set to 1e − 4. The trade-off ratio parameters of λ 1 and λ 2 are both set to the default standard setting value of 1 in the loss L ccpa .We also perform a series of experiments to analyze the impact of different hyper-parameters including λ 1 , λ 2 and η, in the ablation study section.We use dropout with a ratio of 0.1 and batch normalization in all the convolutional layers.At the inference phase, the model directly predicts every pixel that belongs to the most likely class via the segmenter on the source or target domain.
Experimental settings To verify the performance of our proposed method, we conducted extensive experiments to demonstrate the performance of the domain adaptation method.
In our experiments, biomedical cardiac MRI images were the source domain, and CT images were the target domain.In practice, we carry out the following experimental settings: 1) Training and testing the segmentation network only on the source domain (referred to Seg-MRI); 2) Training on the source domain and directly testing on target data, with no domain adaptation, as a lower bound (referred to Seg-CT-noDA); 3) Training and testing the segmentation network on annotated target domain images, as an upper bound (referred to Seg-CT); 4) Our method for unsupervised domain adaptation (referred to Seg-CT-DA).

B. Experimental Results
We report segmentation quantitative results in this section, which demonstrate the superiority of the proposed method in the UDA scenario for cardiac structure segmentation, as illustrated in Table I.In addition, Figure 4 presents the qualitative results of the segmentation for CT images.First, we validated the performance of the segmenter for Seg-MRI, which serves as the basis for subsequent domain adaptation procedures.Our segmenter achieved competitive performance on most of the four cardiac structures compared with the standard U-Net [52] and cascaded-FCN [53] methods.With the segmenter network architecture, we performed the following experiments to validate the effectiveness of our unsupervised domain adaptation framework.
Then we confirmed the upper-bounds performance of the segmenter on the target domain and report it as Seg-CT.Generally, these results are comparable to the standard U-Net [52] and cascaded-FCN [53] methods.Furthermore, comparing Seg-MRI with Seg-CT, we found a significant performance gap, which demonstrates the severe domain shift between the source and target domains.Furthermore, domain-shift problem inherent in cross-modality biomedical images is also illustrated by the degradation of Seg-CT-noDA's performance.This indicates that although the cardiac MRI and CT images share similar highlevel representations and identical label space, the geometric pattern or boundary of each category or instance remains different, which makes domain adaptation extremely difficult.
What is striking about the figures in this table is that Seg-CT-DA outperforms Seg-CT-noDA in all metrics.Further analysis showed that the most striking aspect of the data was the largest increase in Dice of LV-blood.These results show the benefits of using a category-centric prototype to distinguish modal information in the cross-modality scenario, especially when neighboring structures contain unclear boundaries or relatively homogeneous tissues.Furthermore, as seen in the last part in Table I, our method shows significant improvement over other domain adaptation methods in half of the metric.These results demonstrate the superiority of the proposed method in the UDA scenario for cardiac structure segmentation.In summary, these results show the importance of using CDD and CCPA across domains.

C. Ablation Studies 1) Influence of Key Component:
To evaluate the contributions of each component of our model, we perform ablation studies in this section.As described in Table II, when we remove either the CDD module (λ 1 = 0) or CCPA module (λ 2 = 0) from our method, the most striking observation to emerge from the data comparison is the decrease in performance on all metrics.This indicates that both two modules play an important role in addressing the issue of domain shift in UDA.Furthermore, when we remove these two modules (λ 1 = 0, λ 2 = 0), significant performance degradation occurs.This shows that enabling domain adaptation helps to improve the model generalization capability on cross-modality segmentations.
2) Influence of Key Hyper-parameters: We also perform a series of ablation studies to analyze the impact of different hyper-parameters on the domain adaptation performance.All models are trained from scratch with the default standard settings.Here, we study three key hyper-parameters, the tradeoff ratio λ 1 and λ 2 of L cdd and L ccpa in Eq. 13, respectively, and the trade-off ratio η of the Dice loss in Eq. 2. The results are shown in Table II, consisting of three groups of experimental comparisons at the bottom of Table II.
Fixing λ 2 = 1, η = 1, we vary λ 1 ranging in [0.3, 0.6, 0.9, 1].We first observe that as the value of λ 1 increases, the domain adaptation performance increases gradually.Compared to λ 1 = 0.3, λ 1 = 1 obtains performance improvement on all metrics, especially on the Dice of LV-blood, and the ASD of LVmyo.These tissues are difficult and contain unclear boundaries.We believe this phenomenon is because the introduced CDD  In practice, we should set a task-specific value of λ 1 ensuring optimal results.Similarly, we evaluate different settings λ 2 = [0.3,0.6, 0.9, 1] of the trade-off ratio of L ccpa in the Eq. 13.As an interesting observation, the change in the metrics is monotonic as we adjust the trade-off ratio.Additionally, λ 2 = 1 likewise achieves the best performance on all metrics, in particular, a significant improvement in the Dice of LV-blood and the ASD of LV-myo.This indicates the importance of the CCPA module in cross-modality domain adaptation segmentation.
Finally, we perform a group of experiments on the hyperparameters η to investigate how it affects the model performance.We report the results at the bottom of Table II by adjusting η ranging from [0, 0.2, 0.4, 0.6, 0.8, 1].We observe in Table II that our method achieves the best performance under the setting η = 1 compared with other settings.Note that the cardiac structure datasets exhibit class imbalance.In this regard, it is beneficial for challenging heart segmentation tasks to combine Dice loss with cross-entropy loss.In practice, we are supposed to adjust the optimal ratio η according to the degree of sample class imbalance in the dataset.
3) Reverting Domain Adaptation Direction: To investigate whether the reverse adaptation direction from CT to MRI can also be achieved and what the impact of D s in domain adaptation is, we applied the same model setup, replacing the source domain as CT and the target domain as MRI for the experiments.The quantitative performance of reverting the domain adaptation direction is shown in Table III.Unsurprisingly, using the CT segmenter directly on MRI data also fails.Our proposed method can recover the average segmentation performances via unsupervised domain adaptation, indicating that our model has good robustness and generalization over different types of datasets and that cross-modality domain adaptation can be achieved in both directions.Interestingly, the best recovered structure in this reverse setup is the LV-blood, which is the same structure as the MRI-to-CT direction.It is also necessary to mention that the Dice of LV-blood increases from a complete failure (5.8) to a considerably higher value (79.2), and the ASD of LV-blood also shows the same trend.Compared to the adaptation from MRI to CT, the reverse direction generally yields lower performance generally, show that the difficulty of D s can affect the performance of domain adaptation, and the difficulty is not symmetric.Segmentation of cardiac MRI itself is more difficult than segmentation of cardiac CT.This is also evident from Table I, where CT segmentation Dice is higher than the result of MRI in all four structures.In these respects, transferring CT segmenter to MRI seems to be more challenging.

V. CONCLUSIONS
In this paper, we studied the problem of unsupervised domain adaptation on cross-modality biomedical images.We introduced the CDD and CCPA modules into the model to align the latent feature space and category-centric prototype of the target domain to that of the source domain.Extensive experiments show the superior performance of our approach.

Figure 1 :
Figure 1: Frameworks of different domain discriminator.(a) feature-level domain discriminator based method.(b) dualdomain discriminator (feature-level and instance-level)-based method.(c) our proposed models named the conditional domain discriminator.

Figure 3 :
Figure 3: Category-Centric Prototype Aligner flowchart.(a) Feature representation F and segmentation prediction M are generated via the segmenter.(b) Pixel-level relational weights A are obtained by the graph attention layer.(c) More accurate instance-level feature representations F and M are acquired through the graph convolution layer.(d) Category-centric prototype C is derived via confidence-guided merging.(e) Performing category-level domain alignment by minimizing loss L ccpa calculated by prototype aligner.

Figure 4 :
Figure 4: Results of different methods for CT image segmentations.Each row presents one typical example, from left to right: (a) raw CT image, (b) directly applying the MRI segmenter on CT data, (c) our unsupervised cross-modality domain adaptation result, (d) the segmenter trained from scratch with CT labels, and (e) ground truth labels.The structures of AA, LA-blood, LV-blood and LV-myo are indicated by brown, white, red and light red colors, respectively.

Conditional Domain Discriminator Category-Centric Prototype Aligner Segmenter Segmenter
Figure 2: Overview of the framework.The segmenter is shared for both the source and target domains.The conditional domain discriminator and category-centric prototype aligner are trained from both source and unlabeled target data.During inference, the target domain is predicted by the segmenter.

great potential of unsupervised domain adaptation for biomedical image segmentation. For example, Kamnitsas et al. [37] made the earliest attempts to align feature distributions with an adversarial loss for unsupervised domain adaptation cross-protocol MRI segmentation and achieved
[40]ising adaptation performance.Degel et al.[38]and Zhang et al.[39]combined regularization with adversarial training and obtain better adaptation results on ultrasound datasets and cardiac MRI segmentation, respectively.Wang et al.[40] This module consists of the feature extractor G, and the mask predictor P .The feature extractor G encodes source/target domain (MRI/CT) images using CNN to obtain image features F s /F t .F s and F t represent semantic information of the source domain and target domain in the latent feature space, respectively.Then the mask predictor P uses image features F s /F t as input to predict the segmentation map M s /M t of the source domain and target domain individually.•ConditionalDomain Discriminator: This module can capture the latent relation between image feature F and prediction map M through a randomized router, in which image features interact with the prediction map.In addition, it can capture the cross-covariance between image feature F and prediction map M to improve the transferability.
• Segmenter: • Category-Centric Prototype Aligner: This module performs category-level prototype alignment to guarantee discriminability by exploiting more precise instance-level features.In addition, it can mitigate the negative effect of class imbalance on domain adaptation via an entropybased loss to control the process of adaptation during the training phase.Thus, CCPA can obtain the capacity of transfer from the source domain to the target domain through UDA.

Table I :
Quantitative performance comparison between different methods on cardiac datasets (MRI → CT).(Note: Bold represent the best performance and the -denotes that the results were not reported by that method.)

Table II :
Performance comparisons of the different components of our model on cardiac datasets (MRI → CT).The results indicate the importance of the proposed module for unsupervised domain adaptation.(Note: Bold represent the best performance.)

Table III :
Quantitative performance of reverting the domain adaptation direction comparison between different methods on cardiac datasets (CT → MRI).(Note: Bold represent the best performance.)module can deal well with cross-modality domain adaptation.