Adversarial Networks With Circular Attention Mechanism for Fine-Grained Domain Adaptation

Fine-grained Image Analysis (FGIA) as a branch of the image analysis tasks has received more and more attention in recent years. Compared with ordinary image analysis tasks, FGIA requires more detailed human data annotation, which not only requires the annotator to have professional knowledge, but also requires greater labor costs. An effective solution is to apply the domain adaptation (DA) method to transfer knowledge from existing fine-grained image datasets to massive unlabeled data. This paper presents the circular attention mechanism to cyclically extract deep-level image features to match the label hierarchy from coarse to fine. What is more, the networks effectively improve the distinguishability and transferability of fine-grained features based on the adversarial learning framework. Experimental results show that our proposed method achieves excellent transfer performance on three fine-grained recognition benchmarks.


I. INTRODUCTION
Fine-grained Image Analysis (FGIA) is called sub-category image analysis which aims to categorize an object among a large number of subordinate categories within the same metacategory. In previous FGIA tasks, the dataset required manual annotation by professionals, which required a great time cost and manpower. Therefore, people try to use machine learning models as a substitute for fine-grained image recognition and annotation. However, different from general image analysis tasks, different sub-category images in FGIA tasks may have the similar shape, size and even textures. The huge intra-class differences and subtle inter-class differences in FGIA tasks bring challenges to mainstream machine learning models.
To address this issue, people have made many efforts and achieved great advance in fine-grained recognition tasks in recent years. On one hand, many researches [1], [2] are dedicated to extracting local discriminative features to improve the ability of deep networks for identifying subtle differences between similar fine-grained image samples. On the other hand, the number of fine-grained image datasets has increased significantly in recent years, which includes different sample types such as birds [3] [4], flowers [5], [6], cars [7]- [10], dogs [11], [12], etc.
The associate editor coordinating the review of this manuscript and approving it for publication was Shagufta Henna .
Still, it is unrealistic to allow the labels of fine-grained images to cover all the datasets on demand. Therefore, scholars try to use computers to replace human experts for fine-grained annotation of images in large-scale data sets. One promising way is to apply the domain adaptation approaches [13] to fine-grained recognition tasks. For example, we may transfer the knowledge from existing labeled birds datasets to massive unlabeled birds images in the wild to save the tedious fine-grained annotation work.
However, fine-grained domain adaptation algorithms face great challenges in many aspects. In fine-grained domain adaptation tasks, we not only have to face the common problem of inter-domain distribution differences in domain adaptation algorithm, but also have to solve the problems of huge intra-class differences and subtle inter-class differences that are unique in fine-grained domain adaptation tasks. The traditional image domain adaptation algorithm [14]- [16] usually establishes the connection between the two domains by finding the correlation between the source domain and the target domain in the feature space, and thus achieves the purpose of reducing the inter-domain distribution differences. But when it comes to the fine-grained domain adaptation, the situation becomes more complicated in that we have to confront tough issues brought by the fine-grained categorization. As shown in Figure 1, birds under different fine labels may have similar characteristics, such as similar feather colors and bird beak shapes. This makes it difficult for feature-based domain adaptation algorithms to achieve satisfactory results.
This paper aims to address these challenges by designing adversarial networks with circular attention mechanism for fine-grained domain adaptation. We use the attention mechanism to locate the most discriminative regions in images. Furthermore, the circular attention mechanism is designed to locate multi discriminative regions for fine-grained image analysis tasks by recursively dropping the previous discriminative region and adopting the attention mechanism again. The general idea of our domain adaptation method is to extract the fine features in the fine-grained images from the multi discriminative regions learned in the circular attention mechanism, and use the adversarial learning network to enable domain adaptation progressively from coarse-grained categories to fine-grained categories.
We evaluate our method on three benchmarks. Two of them are based on the domain adaptation of bird images, including the CUB-200-2011 [17], CUB-200-Painting [18], NABirds [4] and iNaturalist2017 [19] datasets, and the other is based on the domain adaptation of vehicle images, including the Stanford [8] dataset. The extensive experimental results show that the proposed adversarial networks with circular attention mechanism achieve excellent performance in fine-grained domain adaptation tasks.
The rest of this paper is organized as follow. Section II gives a brief description on related work. In Section III, the adversarial networks with circular attention mechanism are introduced in details. Section IV provides the comparison experiments and ablation experiments on three different benchmarks. Finally, we conclude the paper in Section V.

II. RELATED WORK A. FINE-GRAINED IMAGE CLASSIFICATION
In recent years, fine-grained image classification as the basis of fine-grained image tasks, has received more and more attention in the field of computer vision. Since the differences between the fine-grained categories are subtle, traditional CNN networks are difficult to obtain features that are sufficient to support fine-grained image classification.
To address this issue, researchers have proposed three solutions. The first solution is to enhance the fine-grained classification ability by introducing additional labels such as partial annotations and visual attributes to the images [20]- [23]. Another solution is to improve the feature representation ability of the network. For example, Lin et al. [24] proposed a bilinear model to fuse features in different dimensions of images to obtain features that are more suitable for finegrained image recognition tasks. On the basis of this article, Gao et al. [25] proposed a compact bilinear pooling method, which reduces the computational complexity. The third and which is the most mainstream method is to locate the position of the object to be classified in the image, so that the CNN networks can provide more refined features. Hu et al. [26] first proposed attention mechanisms to locate the object. Similarly, Yang et al. [27] proposed the Region Proposal Network (RPN) which concatenates original features and partial features together to do the object location. The above methods have achieved fairly good performance in fine-grained image classification tasks.

B. DOMAIN ADAPTATION
Domain adaptation problem is a representation method in transfer learning, which aims to use the labeled data (source domain) to learn the classifier and use it to predict the label of the unlabeled data (target domain). The most commonly used method for domain adaptation is to transform the data features of the source domain and the target domain into a unified feature space through feature transformation, so as to reduce the discrepancy between two domains [15], [28] [29]. Pan et al. [30] proposed Transfer Component Analysis (TCA) method which uses the Maximum Mean Discrepancy (MMD) [31] as a metric to minimize the distribution discrepancy between the source and the target domain. In recent years, feature-based domain adaptation methods are usually combined with neural networks. Long et al. [32] integrated the idea of adversarial learning into the domain migration algorithm and proposed Conditional Adversarial Domain Adaptation (CADN) method. The above methods have made great contributions to domain adaptation algorithms, but unfortunately, none of them are aimed at fine-grained images adaptation tasks. Due to the ignorance of the hierarchical labeling of fine-grained images, these methods are difficult to achieve satisfactory results in the task of fine-grained domain adaptation.

III. PROPOSED METHOD
In this section, our proposed adversarial networks with circular attention mechanism is introduced in details. Some mathematical notation is set to interpret our method. In the fine-grained domain adaptation task, a source domain is given with both fine label y f and coarse label y k c K k=1 in a K-layer class hierarchy. On contrast, a target domain T is consisted of n t unlabeled examples. The joint VOLUME 9, 2021 distributions on the source and target domains are denoted as P (x, y) and Q (x, y) respectively.

A. CIRCULAR ATTENTION MECHANISM
The overall framework of our networks is shown in Figure 2. Networks with circular attention mechanism are designed to extract the fine features in the fine-grained images from the multi discriminative regions learned in the circular attention mechanism.
We first introduce the circular attention mechanism which is shown in Figure 3. In our work, we adopt attention mechanism on bilinear pooling to train the attention maps. The bilinear pooling was first adopted by Lin et al. [24] to improve the performance on fine-grained image classification tasks. The algorithm flow of circular attention mechanism is summarized in Algorithm 1.

Algorithm 1: Circular Attention Mechanism
Input: Input image I = R, i = 1, K (K-layer class hierarchy), threshold δ Output: Attention areas {L 1 , L 2 , . . . , L n } while i < K do 1. Generate attention maps A with spatail attentional bilinear pooling; 2. Generate mA by averaging the attention maps on channels; 3. Binarize mA according the threshold δ: Locate the discriminative region and sample local image L i from the raw image R; 5. Generate the drop image D by dropping the discriminative region in the input image I ; 6. i ← i + 1, I ← D By iteratively dropping the previous discriminative region from the raw image, the circular attention mechanism could propose a set of local images {L 1 , L 2 , . . . , L n } from high to low information. It is natural to associate these local images with the class hierarchy of fine-grained labels. This is also in line with human's cognitive habits. In order to distinguish fine-grained differences in images, people may focus more attention on the details of the object. The circular attention mechanism filters out the background of the images and selects local images that have received more attention in the raw image to assist the fine-grained classification.

B. PROGRESSIVE GRANULARITY LEARNING
After extracting the local images from the circular attention mechanism, progressive granularity learning method [18] is used to complete the training from coarse-grained to fine-grained for the recognition tasks. As shown in Figure 2, the coarse labels are divided into K levels. CNN is introduced with a coarse feature extractor G and K label predictors C k , k = 1, 2, · · · , K . The image data x with coarse labels y k c , k = 1, 2, · · · , K is fed into the coarse-grained CNN and trained on the source domain by minimizing the cross-entropy loss as follows: is the k-th coarse predicted distribution and L y is the cross-entropy (CE) loss.
On the other hand, the fine labels of the images are explored by the fine feature extractor F and fine label predictor Y , which is trained by minimizing a cross-fine hybird loss: is the fine predicted distribution and y f is the corresponding truth label. During training, ε follows the formula to change from 0 to 1 [33]: where ρ is the ratio of the training iteration progress. As the training progresses, ε gradually approaches to 1. Thus the influence of coarse labels disappears and the cross-fine hybird loss converges to the fine-grained loss: which plays the same role as the CE loss.

C. ADVERSARIAL LEARNING
After progressive granularity learning, the domain adversarial networks are used for domain adaptation. We first establish the relationship between the predicted distributionŷ and the fine feature f = F (x). In this paper, we employ a bilinear transformation with residual connection [34] to combine thê y and the feature f . Embedding the feature with the predicted class information can enhance discriminative. What is more, the residual connection retains the subtle differences between the features in the fine-grained images. The bilinear transformation is expressed as follows: where A and b are the weight and bias of the bilinear transformation, ⊕ represents the residual connection. Parameters A and b are both learned from the following adversarial learning. Common domain adversarial networks are consist of three network modules: the feature extractor F, the domain discriminator D and the label predictor Y . F and Y is trained to extract transferable features. At the same time, F and D conduct adversarial training. D aims to distinguish the source domain from the target domain, while F is trained to keep the D away from making correct judgments. In our method, the coarse predictors C k | K k=1 , the fine label predictor Y and the domain discriminator D are trained for adversarial learning. The overall loss of the network is as follows.
where λ is a hyperparameter, d is the domain label of x and n = n s + n t is the total sample size of the source and target domains. The overall loss of the network can be divided into three parts as shown in the formula. First is the L y , which is the cross-entropy loss for coarse recognition and is minimized by G and C k | K k=1 . Second is the L h , which is the coarse-fine hybird loss for fine recognition and is minimized by Y and F. Both two losses are introduced in the previous sections. The last part is L d , which is the cross-entropy loss for domain discrimination and is minimized by F. The adversarial training of the network is to reduce these three losses at the same time to obtain better fine-grained recognition accuracy. Compared with previous domain adversarial networks, our networks can gradually align the feature distribution between domains from coarse-grained to fine-grained.

IV. EXPERIMENTS
We evaluate the proposed Adversarial Networks with Circular Attention Mechanism to state-of-the-art domain adaptation models on three benchmarks. Table 1 records the dataset sources of the three benchmarks and the specific number of images. Figure 4, 5 and 6 show example images for fine-grained domain adaptation task in the three benchmarks.

3) BENCHMARK III: BIRDS-31
Benchmark III (Birds-31) can also be split into two domains: NABirds (N) [4] and iNaturalist2017 (I) [19]. Not all of the images from the two datasets are incorporated into the Benchmark. 31 categories with a balanced sample size are selected and the labels are in four levels. Specifically, there are 31 Species, 25 Genera, 16 Families, and 4 Orders.

B. IMPLEMENTATION AND RESULTS
All comparison experiments are carried out on Pytorch. We finetune the ResNet-50 [34] model pretrained on ImageNet. For the fairness of the experiments, the parameters in all domain adaptation tasks are kept consistent and unchanged. Mini-batch SGD with momentum of 0.9 is adopted as the optimization function and batch size is fixed to 36. The learning rate strategy is the same as [33].
As shown is Table 2, our method performs best across both transfer tasks on CompCars. It outperforms the second best method CDAN+BSP by more than 1.5% on average accuracy. We raise average accuracy from the baseline DANN of 73.03% to 80.66%, an increase of 7 percent. Similarly, the experimental results on CUB-Birds and Birds-31 are recorded in Table 3 and Table 4. On CUB-Birds, our method achieves the best performance among all domain adaptation methods. The accuracy is improved by more than 8% compared to the baseline DANN. Our algorithm also achieved the best performance on Birds-31 with an accuracy rate of 6.3% higher than the baseline DANN and 2.2% higher than the second best method.   From the experimental results on the three benchmarks, we notice that the improvement of our methon on CUB-Birds is larger than that on CompCars and Birds-31. There are two reasons. First, the basic recognition accuracy is relatively low on CUB-Birds, which leaves to the domain adaptation algorithm a larger room for improvement. Second, it can be seen from Figure 5 that the inter-domain variations of CUB-Birds are much larger than CompCars and Birds-31. Unlike CompCars and Birds-31, the images in two domains are all real photos. The images in CUB-200-Painting (P) dataset include watercolor, oil painting, cartoon, etc. This shows that the circular attention mechanism in our algorithm locates the details of the object itself so as to reduce the influence of image style and background on domain adaptation task.

C. ABLATION STUDY
We design ablation experiments by removing the circular attention mechanism. The results of the ablation experiments on the three benchmarks are recorded in Table 5, 6 and 7.   It can be seen from the Table 5, 6 and 7 that circular attention mechanism improves the accuracy by about 5% on the three benchmarks. This demonstrates that the circular attention mechanism works well to the positioning of the finegrained features. With the gradual learning of labels from coarse to fine, the attention mechanism effectively reduces the inter-domain variations in the datasets, thereby achieving better domain adaptation accuracy.

V. CONCLUSION
In this paper, we propose the adversarial networks with circular attention mechanism to solve the fine-grained domain adaptation problem. The key idea of our model is to locate multiple discriminative areas in the image through the circular attention mechanism and gradually align them with multiple levels in the fine-grained image label. On this basis, we design an adversarial training network to complete the domain adaptation task of fine-grained images. We compare our method with other state-of-the-art methods on three benchmarks for fine-grained domain adaptation. The experimental results show that the proposed method is effective and achieves the best performance in all three benchmarks.
NINGYU HE received the B.S. degree from the School of Electronic Engineering, Xidian University, China, in 2017. He is currently pursuing the Ph.D. degree with Shanghai Jiao Tong University, China. He also works with the Department of Electronic Engineering, Shanghai Jiao Tong University. His research interests include audio signal processing, image processing, and deep learning.
JIE ZHU received the Ph.D. degree in communications and information systems from Shanghai Jiao Tong University. He went to Bell Labs, Murray Hill, NJ, USA, in 1997, for cooperative scientific research. He was a Senior Visiting Scholar with the Dresden University of Technology, Germany, in 2000, for visiting research. He is currently a Professor with the Department of Electronic Engineering and a Ph.D. Supervisor in electronic science and technology. He went to the USA, Europe, Japan, South Korea, and other countries to participate in international conferences and academic exchanges for many times.