Learning to Adapt to Label-Scarce Image Domain via Angular Distance-Based Feature Alignment

Most recent domain adaptation (DA) methods deal with unsupervised setup, which requires numerous target images for training. However, constructing a large-scale image set of the target domain is occasionally much harder than preparing a smaller number of image and label pairs. To cope with the problem, a great attention is recently paid to supervised domain adaptation (SDA), which takes an extremely small amount of labeled target images for training (e.g., at most three examples per category). In the SDA setup, adapting deep networks towards target domain is very challenging due to the lack of target data, and we tackle this problem as follows. Given labeled images from source and target domains, we first extract deep features and project them to hyper-spherical space via l2-normalization. Afterwards, an additive angular margin loss is embedded so that deep features of both domains are compactly grouped on the basis of shared class prototypes. To further relieve domain discrepancy, a pairwise spherical feature alignment loss is incorporated. All of our loss functions are defined in the hyper-spherical space, and the advantage of each ingredient is analyzed in the literature. Comparative evaluation results demonstrate that the proposed approach is superior to existing SDA methods, achieving 60.7% (1-shot) and 64.4% (3-shot) average accuracies for the DomainNet benchmark dataset using the ResNet-34 backbone. In addition, by applying a semi-supervised learning scheme to a network initialized by our SDA method, we achieve the state-of-the-art performance on semi-supervised domain adaptation (SSDA) as well.


I. INTRODUCTION
Deep convolutional neural networks (DCNNs) have shown promising results on various visual recognition tasks. However, due to the strong dependency on training data, the performance of DCNNs is usually degraded when tested to a shifted domain [1]. Preparing large-scale training resources for every target domain could be a simple solution, but it is time-consuming and costly. To resolve this problem, domain adaptation (DA) has been actively studied in recent years [1], [2].
The goal of DA is to overcome the deficiency of training resources in a target domain by leveraging knowledge from The associate editor coordinating the review of this manuscript and approving it for publication was Juan Wang . resource-sufficient a source domain. Depending on the type of target data available for training, DA schemes can be roughly categorized into unsupervised, semi-supervised, and supervised approaches, as indicated in Table 1. Most of previous studies on DA are devoted to unsupervised domain adaptation (UDA) [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], which assumes that a large number of unlabeled target images are available for training. The assumption behind UDA is that the expense of assigning labels is much larger than that of collecting images from a target domain. Similarly, semi-supervised domain adaptation (SSDA) [14], [15], [16], [17] assumes that a few amount of labeled target images are given for training, while leaving the rest of target images as unlabeled. However, mining a large number of images from a target domain is occasionally very hard, allowing only a VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ few amount of image and label pairs for training (e.g., one or three examples per category). To cope with those problems, supervised domain adaptation (SDA) is addressed by several previous works [18], [19], [20]. Previous SDA methods are focused on matching feature representations by using semantic priors. In [18] and [20], pairwise distance and similarity metrics are introduced to facilitate feature alignment based on the class equivalency of feature pairs. This pairwise alignment scheme is rebuilt as an adversarial learning model in [19]. Although these methods show promising results on SDA, the principle for feature alignment is limited to the pairwise comparison scheme. Based on this observation, we aim to develop a new approach for SDA beyond the existing methodology.
In this paper, we propose a new SDA framework for image classification. Given labeled images from both domains, we aim to learn highly discriminative representations by encouraging between-class variations and penalizing cross-domain variations of features in a spherical space. Firstly, we extract deep features and normalize them to unit vectors via l2-normalization. Geometrically, this procedure is same as projecting a feature vector from the Euclidean space to a hyper-spherical (or spherical for simplicity) space. In this setup, a feature is solely represented by its direction, and a class prediction score is computed based on the angular distance from normalized class prototypes. An illustration that describes the difference between Euclidean and spherical feature spaces is given in Fig. 1(a) and (b). Compared to the Euclidean space, features in the spherical space have relatively lower degree of freedom. This characteristic of spherical feature space relieves the risk of overfitting [21], and implicitly enforces matching feature distributions of both domains, as illustrated in Fig. 1 Secondly, inspired by [22], an additive angular margin loss is embedded as a classification loss function. Compared to the vanilla softmax loss, the additive angular margin loss encourages the smaller disparity between a feature and its corresponding class prototype by inserting a non-negative margin penalty to the prediction. In our method, the additive scheme is extended to SDA by applying an angular margin for each domain. Thirdly, to further alleviate the domain discrepancy, we propose a pairwise spherical feature alignment loss, which operates in the hyper-spherical feature space. Compared to training with a vanilla softmax loss function in the spherical space ( Fig. 1(b)), our proposed loss function can generate more compact feature distributions as described in Fig. 1(c). Our method can be generically applied for various deep networks from CNNs to recent vision Transformer (ViT) [23] without requiring any domain classifier. Extensive experimental results show that the proposed SDA method is superior to the existing methods on various datasets. Furthermore, our SDA method can be extended to semi-supervised domain adaptation (SSDA) by applying a semi-supervised learning (SSL) scheme to a network initialized by our SDA method. Experimental results demonstrate that this extended version of our method surpasses recent state-of-the-art methods which are specifically designed for SSDA.
The contributions of this paper can be summarized as follows: • We propose a novel SDA scheme which conducts feature alignment in a spherical space. In contrast to the Euclidean space in Fig. 1(a), the spherical space can reduce the risk of overfitting by enforcing feature alignment as illustrated in Fig. 1(b).
• For SDA, we propose a domain-wise additive margin loss based on the spherical feature embedding model. In addition, we propose a pairwise spherical feature alignment loss, which aligns feature distributions based on the class equivalency of feature pairs. By incorporating the proposed losses, the spherical features can be compactly grouped on the basis of class prototypes, as depicted in Fig. 1(c).
• For SSDA, we show that a deep network pre-trained by our SDA scheme can be further extended to SSDA by applying a semi-supervised learning scheme to target labeled and unlabeled data.
• By means of extensive experiments, we demonstrate that state-of-the-art results are achieved both on SDA and SSDA.

II. RELATED WORK A. DOMAIN ADAPTATION
Based on the configuration of target data given for training, DA methods for image classification can be categorized into unsupervised, semi-supervised, and supervised schemes. A comparison on the three schemes is shown in Table 1.
To cope with cases that only a few target images are available with labels, several SDA methods [18], [19], [20] are proposed. In [18], a pairwise loss function, named the classification and contrastive semantic alignment (CCSA) loss, is proposed to match features across domains based on the class equivalency of image pairs. This pairwise alignment scheme is reconstituted by means of the adversarial learning approach (FADA [19]) and the stochastic neighborhood embedding method via a modified Hausdorffian distance (d-SNE [20]). Apart from the pairwise loss functions, the previous SDA methods commonly adopt the vanilla softmax loss, which is an ordinary classification loss, equally for each domain. Unlike these methods, we introduce a new loss function, which is motivated by a margin-based metric learning scheme [22], and apply the loss in a domain-adaptive manner.

B. FEW-SHOT LEARNING
The goal of few-shot learning [33], [34], [35], [36] is to learn discriminative feature representations for novel classes with a few training samples. Our work is partially related to few-shot learning since a few training samples are given for the target task. However, the problem statements of the two topics are different each other as follows: few-shot learning aims to adapt to novel classes with a fixed domain, whereas the goal of SDA is to adapt to a novel domain with a fixed set of classes. Chen et al. [21] introduce a baseline feature embedding scheme, and empirically demonstrate that the l2-normalization-based embedding scheme can be generically applied to various few-shot setups. This approach is adopted for feature alignment in several DA methods [10], [14], [16], yet the l2-normalization is applied to the feature space only. In our work, this embedding scheme is further extended to fully hyper-spherical space by applying l2-normalization to class prototypes (i.e., classifiers) as well.
In this setup, we propose three loss functions for few-shot DA, as addressed in Section IV.

C. MARGIN-BASED DEEP METRIC LEARNING
A metric is a function that defines a distance between each pair of elements in a set, and the goal of metric learning is to learn a task-specific metric in a data-driven manner. Recently, by letting the deep feature space as a metric subspace, deep metric learning has been explored for various tasks such as image retrieval [37], [38], person re-identification [39], [40], and face verification [22], [41], [42]. Learning a discriminative metric subspace is very challenging if image pairs are similarly structured as in face verification. To address this issue, margin-based methods are proposed, which insert a non-negative margin that penalizes the distance between predictions and their class centres. Inserting a margin makes deep features of a same class be compactly grouped by pushing them farther from decision boundaries. Several examples on margin-based methods for face verification are SphereFace [41], CosFace [42], and ArcFace [22]. In our work, we extend ArcFace to domain adaptation to reduce the intra-class and cross-domain variations and to enlarge the between-class variations.

III. BACKGROUND: SPHERICAL FEATURE SPACE
An image classification network consists of a feature extractor F(·; ψ F ) and a classifier C(·; ψ C ), where ψ F and ψ C represent their corresponding weights. For an i-th input image x i , its feature vector is embedded as where d indicates the feature dimension. Following [21], the classifier C is modeled as a task-specific fully connected layer W ∈ R d×K . Here, K indicates the number of image categories, and the bias parameters of the classifier are fixed to zero for simplicity [41]. In this setup, the matrix W can be understood as a set of class prototypes W = [W 1 , . . . , W K ], where W k ∈ R d represents a weight vector for the k-th class prototype. In the Euclidean space, a classification logit vector z i is obtained as To establish a spherical feature space, we apply the l2-normalization to both features and weight vectors as follows: Given l2-normalized feature f i and weight vector W k , a classification logitz i,k is obtained as follows: where θ x i ,k = arccos(z i,k ) indicates the angle between the normalized feature of the i-th image and the k-th normalized VOLUME 10, 2022 Overall framework of the proposed SDA scheme. For a training iteration, mini-batch images are sampled from each domain, and they are embedded to spherical features (f s i ,f t j ). Afterwards, for each domain, an additive angular margin loss is embedded (L s , L t ) as a classification loss. Here, m s and m t are additive margins for source and target domains, respectively. We set m s < m t to impose stronger regularization on the target domain. In addition, for each pair of source (blue) and target (red) features, a pairwise spherical feature alignment loss is embedded based on the class equivalency of the pairs. For feature pairs of same class (blue and red circles, y s i = y t j ), a pairwise similarity loss (L ps ) is embedded. For feature pairs of different class (blue triangle and red circle, y s i = y t j ), a pairwise dissimilarity loss (L pd ) is embedded. Best viewed in color.
class prototype. Thus, class prediction in the spherical feature space solely depends on the angle between f i and W k .

IV. PROPOSED METHOD
A. PROPOSED SDA SCHEME 1) PROBLEM FORMULATION In SDA, we are given a set of labeled source images for training. The amount of the target data N t is very small (e.g., at most three examples per category), and is usually much smaller than that of source data (i.e., N t N s ).

2) ADDITIVE ANGULAR MARGIN LOSS
In the hyper-spherical feature space, a vanilla softmax function on the source data (D s ) is represented as follows: where cos θ x s i ,k indicates a classification logit which represents a similarity between the feature of the i-th source sample (x s i ) and the k-th class prototype. In Eq. (4), r denotes a scaling parameter that controls the magnitude of radius. This widely-used classification loss, however, does not explicitly regularize features to have higher intra-class similarity and between-class variations. Inspired by ArcFace [22], we insert an additive margin to enforce features to be compactly grouped on the basis of their corresponding class prototypes. In our work, source and target data are separately embedded with domain-dependent margin parameters as follows: where m s and m t are margin parameters for source and target domains, respectively. The margin parameters are non-negative constants that induce features get closer to their corresponding class prototypes. In our work, we set m s < m t to impose stronger regularization on the target domain.
Discussions on the magnitude of the margin parameters are given in Section V-C1. By adopting Eq. (5) and (6), crossentropy losses for each domain are given as follows: where L s and L t are additive angular margin losses on source and target domains, respectively.

3) PAIRWISE SPHERICAL FEATURE ALIGNMENT LOSS
To further match feature distributions of the two domains, we propose a pairwise spherical feature alignment scheme. Suppose a pair of normalized features (f s i ,f t j ) is given on the spherical space, which are driven from the i-th source and the j-th target images, respectively. The similarity between the two normalized featuresf s i andf t j is obtained by a dot product as follows: where φ x s i ,x t j denotes the angle between the normalized features of x s i and x t j . Our goal is to enhance the similarity of a pair sharing a same class. To this end, we define a metric M ps (·, ·) as follows: On the contrary, we aim to penalize pairwise similarity of features of different classes, and a metric M pd (·, ·) for this objective is defined as follows: where is a constant that relieves the objective. Setting = 0 enforces every feature pair to have an opposite direction, which is infeasible in multiclass classification. In our work, we set to 0.5 for all experiments. By taking Eq. (10) and (11), pairwise loss functions are given as follows: where L ps and L pd are pairwise similarity and dissimilarity losses, respectively.

4) OVERALL OBJECTIVE FUNCTION
The overall objective function is given as follows: where α is a balancing constant and set to 0.1 in our implementation. The overall pipeline of the proposed SDA scheme is illustrated in Fig. 2. For every iteration, n b images are randomly sampled from each domain (thus, 2n b images for one iteration). The training procedure is continued until the overall loss is converged.

5) FINE-TUNING ON TARGET DATA
In SDA, the amount of source data is generally much larger than that of target data. As a result, the expectation of empirical risks on source data remains larger than that of target data still after convergence. This makes a deep network more biased to the source domain, which is sub-optimal for our task. To further align to the target domain, we conduct finetuning with target data by means of L t . Although the amount of the target data is small, this fine-tuning stage is resistant to overfitting since the spherical feature embedding scheme is inherently robust to few-shot setups. This simple method can enhance the final accuracies, as analyzed in Section V-C1.

B. EXTENSION TO SSDA
In SSDA, apart from D s and D t , a set of unlabeled target images D u = {x u i } N u i=1 is additionally given for training. Thus, in target domain, a labeled image set D t and an unlabeled image set D u are available, and this setup corresponds to that of semi-supervised learning. Our key motivation is that a classification model which is pre-trained by our SDA scheme can be further extended to SSDA by applying a semi-supervised learning scheme. To this end, we employ a modified version of MixMatch [43], which is one of the representative semi-supervised learning schemes.
The MixMatch algorithm is a holistic approach, which integrates dominant paradigms for semi-supervised learning, such as consistency regularization [44], [45], pseudolabeling [46], mixup model [47], and entropy minimization [48]. Since learning with unlabeled data is heavily dependent on their output predictions, a well-established initial feature embedding space is significant for semisupervised learning. In [43], an initialization is conducted with labeled data and a linear ramp-up policy is applied. In contrast, we directly employ a network weight which is pre-trained via the proposed SDA scheme. By initializing with the pre-trained network, an initial feature embedding space is provided for stable semi-supervised learning. In this stage, we employ classification weights without l2-normalization to relax the feature space. To verify the effectiveness of the SDA pre-training stage, comparative evaluation results on SSDA according to various initial network weights are reported in Section V-C3.

A. EXPERIMENTAL SETUPS 1) DATASETS
For comparative evaluations, we used the three benchmark datasets as follows. DomainNet [49] is a large-scale benchmark dataset for domain adaptation, which involves six domains with 345 classes. Office [50] involves three domains (Amazon, Webcam, and DSLR) with 31 classes. Office-Home [51] contains four domains (Real, Clipart, Art, and Product) with 65 classes. We used DomainNet as the primary benchmark dataset for comparative evaluation on SDA and SSDA. For further validation, we used the Office dataset for evaluation on SDA, and the Office-Home dataset on SSDA.
The setup for comparative evaluation on each dataset is as follows. For DomainNet, we addressed seven adaptation scenarios, which are derived from four domains (Real, Clipart, Painting, and Sketch) with 126 classes, following [14]. For Office, six adaptation scenarios are addressed [18], [19], [20]. VOLUME 10, 2022   For Office-Home, five adaptation scenarios are selected [17]. All adaptation scenarios in our experiments are given one or three examples per category in a target domain (1-shot and 3-shot, respectively). For fair comparison, we adopted the labeled target image lists which are released in [14] for the DomainNet and the Office-Home datasets. For the Office dataset, we randomly selected source and target data, following [18], [19], and [20]. For both datasets, all available data in source domain are given for training. Meanwhile, in the Office dataset, the number of source data is 20 in Amazon domain, and 8 in Webcam and DSLR domains, following [18], [19], [20]. We employed ResNet-34 [52] for the DomainNet and the Office-Home datasets, and VGG-16 [53] for the Office dataset. For all experiments in SDA, we adopted ImageNet pre-trained models for weight initialization.

2) BASELINE METHODS FOR COMPARISON
We conducted comparative evaluations with three SDA methods and four SSDA methods. Baseline methods on SDA are CCSA [18], FADA [19], and d-SNE [20]. Baseline methods on SSDA are MME [14], SagNet [15], Meta-MME [17], and APE [16]. It is worth noting that these methods are specifically designed solely for SSDA, whereas our SSDA method is a simple extension of the proposed SDA scheme. All numerical measurements of other methods in this paper were directly quoted from the original papers, except the SDA results on the DomainNet dataset, which are our reproduced measurements.

3) IMPLEMENTATION DETAILS
For every iteration of the proposed SDA scheme, we randomly sampled 24 labeled images from each domain. The angular margin parameters are given as m s = 0.1 and m t = 0.25 for source and target domains, respectively. The scaling parameter r is set to 2 in the Office dataset and 20 in the other datasets. The maximum number of iteration is 50k, and a training phase is early stopped based on validation accuracies. For every iteration of the SSDA extension, we randomly sampled 12 labeled images and 144 unlabeled images from target domain. The maximum iteration number of SSDA is 10k. Unlike [43], we employed a cross-entropy loss for mixup sets of labeled and unlabeled images. The other setups for the SSDA extension are identical to those in [43]. Both for SDA and SSDA, we used the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01, a momentum of 0.9, and a weight decay of 5 × 10 −4 . All experiments in this paper were implemented in PyTorch [54] by using a single NVIDIA RTX A5000 GPU.

1) RESULTS ON SDA
In Table 2, comparative evaluation results on the DomainNet dataset are reported. On average, the proposed SDA method outperforms the previous state-of-the-art by 1.8% and 2.2% on 1-shot and 3-shot setups, respectively. Among the seven adaptation scenarios, our SDA scheme leads to the largest performance gain over previous methods on the R →S scenario, surpassing by 3.8% and 3.9% on 1-shot and 3-shot setups, respectively. It is worth noting that the R →S scenario involves a large domain shift, and the accuracy without adaptation (Source only) on this scenario is the lowest among the seven adaptation scenarios. This indicates that the advantage of our SDA scheme becomes prominent for challenging scenarios involving a large domain disparity. The results on the Office dataset are reported in Table 3. Following [18], [19], and [20], we report an average accuracy with deviations of three runs for each adaptation scenario. On average, our SDA scheme surpasses the previous state-of-the-art method by 1.38%. The accuracy gains of our SDA scheme are relatively  larger on adaptation scenarios involving the Amazon domain (e.g., A →D, W →A) than scenarios between the Webcam and the DSLR domains which involve small domain shifts (e.g., W →D, D →W).

2) RESULTS ON SSDA
We report the SSDA results on the DomainNet dataset in Table 4. Our SSDA extension achieves state-of-the-art accuracies except one scenario (R →P with 1-shot). On average, our method surpasses the other SSDA methods by 3.0% and 1.9% on 1-shot and 3-shot setups, respectively. The proposed SSDA scheme is particularly robust to 1-shot setups, and the average accuracy of our method with the 1-shot setup is superior or competitive to the 3-shot accuracies of the previous methods. The results on the Office-Home dataset is reported in Table 5. On average, our SSDA extension surpasses previous methods by 1.4%. It is worth noting that the previous methods [14], [15], [16], [17] are specifically designed for SSDA, whereas our SSDA method is a semi-supervised extension of the proposed SDA scheme.

C. ANALYSIS AND DISCUSSIONS 1) ABLATION STUDY ON SDA
To investigate the impact of each proposed module, we conducted ablation studies on the seven adaptation scenarios in DomainNet, and the results are reported in Table 6. When a vanilla softmax is employed, the spherical feature space leads to better results than the Euclidean feature space, demonstrating its robustness to few-shot setups. Based on the spherical feature space, the proposed additive angular margin losses (L s and L t ) considerably enhance the performance. In addition, the pairwise spherical feature alignment losses (L ps and L pd ) and the following fine-tuning stage further improve the performance.
In our proposed additive angular margin loss for SDA, the margin parameters m s and m t are incorporated with the default values of 0.1 and 0.25, respectively. To investigate the impact of each margin parameter, we report ablation study results in Table 8. As we can see in the table, the margin parameters of both domains contribute to the performance of SDA. On the other hand, the effectiveness of the angular margin loss is significantly degraded when the magnitude of m s becomes larger than that of m t , as indicated in the last row in Table 8. This implies that the magnitude of a margin parameter is closely related to the priority (or importance) of data in a certain domain. Based on this observation, we always set m s < m t to make a deep network more focused on learning with target domain data.

2) VALIDATION ON VISION TRANSFORMER (ViT)
To further investigate the robustness of our SDA method across various backbone architectures, we report comparative evaluation results on vision Transformer (ViT) [23]. Unlike CNN-based models (e.g., ResNet and VGGNet), ViT is free from convolution operators, and is composed of multiple selfattention blocks [55]. For experiment, we adopt the ViT base model with 16 self-attention blocks which is pre-trained on ImageNet-21k [56]. Table 7 shows the results. In SDA, our proposed method outperforms other previous works on the DomainNet dataset. This result demonstrates that our SDA scheme can be broadly applied across various backbones from convolutional (CNNs) to self-attention models (Transformers).

3) IMPACTS OF SDA PRE-TRAINING ON SSDA EXTENSION
As explained in Section IV-B, the SSDA extension scheme assumes that a network is initialized by our SDA scheme. To investigate the impact of the initialization method in the VOLUME 10, 2022 FIGURE 3. t-SNE visualization results of the S →P scenario in the DomainNet dataset. Each two-dimensional point in the figure represents a feature vector which is obtained by the t-SNE visualization scheme [57]. ResNet-34 is adopted as the backbone architecture. Best viewed in color. SSDA extension stage, we conducted an ablation study on SSDA by varying SDA pre-training methods. The results are reported in Table 9. Without SDA pre-training (i.e., initializing with an ImageNet pre-trained model), the SSDA results are far lower than those with SDA pre-training. This indicates that the pre-training stage is highly demanded for semi-supervised learning with a small number of labeled data. By comparing the two SSDA results, which are driven by SDA pre-trained weights, it can be confirmed that our SDA scheme leads to higher SSDA accuracies than applying a vanilla softmax loss. This implies that the SDA pre-training stage has a significant impact on the performance of SSDA.

4) QUALITATIVE ANALYSIS VIA t-SNE VISUALIZATION
To qualitatively analyze the impact of our proposed methods on feature alignment, we present the t-SNE [57] visualization results in Fig. 3. t-SNE (t-distributed Stochastic Neighborhood Embedding) is a statistical method for visualizing high-dimensioanl data by projecting them into the two or three dimensional space [57]. In our work, we applied the t-SNE scheme to visualize the high-dimensional feature vectors (e.g., 512-dimension for ResNet-34) in the twodimensional space. On the upper row in Fig. 3, the visualization results showing source and target features are presented. Comparing the three learning schemes (i.e., Source only, SDA, and SSDA), we can see that the features from the two distinct domains are aligned by our proposed DA methods. On the other hand, the target features without applying DA scheme (Source only) are not well complied with the source features. The benifits of our DA schemes become more apparent when we focus on target features, as shown on the second row in Fig. 3. For the 15 categories that randomly selected from the DomainNet dataset, each target feature are colorized based on its category. By comparing target features of the three methods, we can figure out that our DA schemes enhance the discriminative ability in target domain. Ablation study results on the additive margin parameters (m s and m t ). Average accuracies on the seven adaptation scenarios in DomainNet (%). Note that the measurements below are obtained by the additive angular margin loss only (L s and L t ).

5) COMPARISON WITH UDA
In this subsection, we introduce a comparative analysis on SDA and UDA. As indicated in Table 1, SDA assumes that a few amount of image and label pairs are given, whereas UDA assumes that relatively larger amount of unlabeled images are accessible in target domain. Thus, SDA would be more practical than UDA if the cost of collecting target images is larger than that of assigning labels to target images. To provide a quantitative comparison on SDA and UDA, we report average accuracies of the 12 adaptation scenarios in Office-Home in Table 10. Given three examples per class (3-shot), our proposed SDA scheme is competitive to most of recent UDA methods. It is worth noting that 65 (1-shot) and 195 (3-shot) labeled images are provided in SDA, whereas the number of given images in UDA ranges from 2,427 to 4,439 (3,897 on average) in Office-Home. We expect this comparison on the two topics would be a practical and intuitive benchmark when constructing training sets for domain adaptive representation learning.

VI. CONCLUSION
In this paper, a novel feature alignment scheme is proposed to address label-scarce DA scenarios, which are few-shot supervised DA (SDA) and semi-supervised DA (SSDA). For SDA, we propose to align source and target features based on angular distances in the spherical space. To this end, the additive angular margin loss and the pairwise spherical feature alignment loss are introduced. By means of the loss functions, features are encouraged to have lower intra-class variations and higher between-class variations in a domain adaptive manner. The proposed SDA scheme outperforms the previous state-of-the-art methods, achieving 60.7% (1-shot) and 64.4% (3-shot) average accuracies for the DomainNet benchmark dataset using the ResNet-34 backbone. In addition, our SDA scheme is further extended to state-of-the-art SSDA by applying a semi-supervised learning scheme. We expect that the proposed learning scheme would be practically applied to reduce labeling costs, and also could be extended towards other various tasks such as semantic segmentation and object detection.