Multi-Source Transfer Network for Cross Domain Person Re-Identification

Unsupervised person re-identification has been improved significantly by the development of cross domain person re-identification models, which apply useful knowledge in source data to completely unlabeled target data. However, existing cross domain re-identification models still remain a major limitation that they are all based on single-source and single-target setting. The only one source domain may remain a tremendous gap between target, generating negative effect for the model training in target domain. To overcome this drawback, this paper proposes a Multi-Source Transfer Network to learn a shared target-biased feature space between multi-source and target domains, which achieves transfer learning in feature-level, pixel-level, and task-level by the proposed target-biased multi-source transfer learning module, relativistic adversarial learning module, and task-gap bridging module, respectively. Through leveraging the domain gaps in feature-level, pixel-level, and task-level, this network can synthetically learn a discriminative model from multiple source domains to effectively conduct re-identification in target domain. Furthermore, this paper conducts extensive experiments on three widely-recognized person re-identification datasets, and the proposed network achieves rank-1 accuracies of 80.9% and 74.6% on DukeMTMC-reID and Market-1501 datasets, respectively. The results demonstrate the contribution of the proposed method, compared with state-of-the-art methods, including hand-crated feature, clustering and transfer learning based methods.


I. INTRODUCTION
Person Re-identification (re-id) is a complex image retrieve task, aiming to find the target person in multiple camera views without any overlapping areas. This topic has attracted large amounts of research attention in recent years on account of its important applications in automatically video analysis for public security [16], [17], [25], [27]. Though they achieved considerable progress in the development of person re-id, it still meets unresolved challenging situations due to the large variations in pose state, illumination intensity, random occlusion, even real-time switching background and camera views. With the development of deep learning technology, plenty of person re-id models paid attention to supervised learning [2], [28], [32], which has proved to be effective. Though they achieved several successful applications in The associate editor coordinating the review of this manuscript and approving it for publication was Lefei Zhang . supervised framework, they require large amounts of annotated pedestrian data, which may be invalid in realistic scenarios, to train a serviceable model.
In order to bridge this obstacle, many researchers employ unsupervised models [15], [26], [29], [30] to exploit discriminative information from completely unlabeled pedestrian data because it is easier to obtain in video surveillance. Nevertheless, they often developed distance-based clustering methods to learn the latent feature space, without any guidance of annotations. That results in too terrible performance when these methods are introduced to unseen domains, according to the illustration in [26]. In another direction, domain adaptation is absorbed in establishing a knowledge-transferable discriminative model from labeled source datasets to adjust the application of unlabeled target domain, which is already employed in person re-identification [13], [19], [21]. In these cross domain person re-identification approaches, the most challenging task is to alleviate the domain gap between source VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and target domains, when transfer the source-knowledge into target domain. A series of domain adaptation approaches [18], [19], [21] have achieved significant progress to tackle the challenge of domain gap in cross domain person re-id problem. There exists a major limitation, which usually learns discriminative knowledge from only one source dataset and transfer the knowledge into one target domain (single-source to single-target transfer learning). Only adopting one source domain often have confined positive effect on target due to the existing of negative samples. Though existing transfer learning person re-id models can leverage the domain gap between single-source and single-target domains in a certain extent, the most challenging limitation of them is that the single-source domain may has dubious transfer effect for the target unlabeled data. To settle out this limitation, this paper focuses on multi-source domain adaptation technology for unsupervised person re-id task to relax this constraint, so as to learn extensive knowledge from multi-source domains and choose the veritable target-positive information to train a discriminative person re-id model on the unlabeled target domain. Specifically, this paper assumes that different sources have unparalleled positive knowledge to target domain, which enlightens us to transfer discriminative information from multiple source domains, and transfer it into the target dataset.
In this paper, the ideology of Multi-Source Transfer Learning (MSTL) task is introduced into person re-identification. There exists two major problems that need to be settled in MSTL problem. To be the first and foremost, how to leverage the various domain gaps with different scales between multiple sources and the target domain is confronted with unseen solutions in person re-identification. Then, the discriminative tasks in each domains also contain diverse task-gaps when person re-identification in multi-source transfer learning process is conducted. These two major difficulties are still unresolved in existing domain adaptation approaches, which this paper focuses to resolve in this paper, as shown in Figure 1.
Aiming at achieving MSTL in person re-id problem, this paper designs a Multi-Source Transfer Network (MSTNet), to leverage domain gaps and task-gaps between multiple sources and the target domain when exploits discriminative information for person re-identification. Specifically, the proposed MSTNet considers leveraging domain gaps both in pixel-level and feature-level by relativistic adversarial learning and target-biased multi-source transfer learning modules, individually, and incorporating a task-level transfer module to bridge the task-gap between different domains. After bridging gaps in different level, MSTNet can learn a shared target-biased feature space to conduct re-identification, which is learned both from labeled multiple sources and the unlabeled target data.
Finally, the major motivations and contributions of the proposed multi-source transfer network are sketched out in following content.

A. MOTIVATIONS
• Existing cross domain person re-identification methods almost conduct single-source to single-target transfer learning to exploit transferable knowledge in the only one source domain, and they have biased domain gap because of the dubious effect of source samples, which the multi-source transfer learning can handle this limitation.
• The single-source transfer learning approaches can not alleviate the different domain gaps with various sales in multi-source transfer learning task in cross-domain person re-identification, not only in feature-level, but also in pixel-level.
• Another major domain gap in task-level is often ignored which the pedestrian similarities of matching task in multiple domains are in inconsistent distance metrics. The task-level domain gap has not been settled in exiting methods, which is more severe in multi-source transfer learning framework.

B. CONTRIBUTIONS
• In order to relax the limitation of single-source to single-target transfer learning in cross-domain person re-identification models, this paper proposes a Multi-Source Transfer Network (MSTNet), which can exploit synthetic discriminative information from multiple labeled source domains in different levels to learn a robust target-biased feature space for unlabeled target data.
• To leverage the multiple domain gaps, MSTNet designs a target-biased multi-source transfer learning module to alleviate the distribution-gap in feature-level and proposes a relativistic adversarial learning module to transform images from whichever source into the target-style aiming to bridge the pixel-level gaps between multiple sources and the target domain.
• Aiming to bridge the task-level gap in different domains, MSTNet involves a task-level transfer learning module to make insistent distance metrics from multiple sources following the matching task in target domain, which transfer different metric learning tasks into the target-biased feature space.

II. RELATED WORK
In this section, recent person re-identification researches are introduced in a quick review, and their progresses are discussed. The person re-id models are divided as supervised, unsupervised and cross domain approaches, according to their utilized annotations in pedestrian data.

A. SUPERVISED PERSON RE-IDENTIFICATION
Most existing person re-identification models [1], [28], [33] research on supervised frameworks by the supervision from a completely labeled dataset. Bai et al. [1] designed an end-toend long short-term memory method to model the sequence of body parts from head to foot, which strengthens the discriminative ability of local feature learning by integrating the contextual information. Zhang et al. [33] distinguished the person identity as a one-vs-all linear classification problem, which construct all classifiers into a task-specific projection matrix, and utilized the matrices to form a tensor structure, and jointly train all the tasks in the uniform tensor space. Tay et al. [28] proposed an attribute attention network that integrates person attributes and attribute attention maps into a classification framework to solve the person re-id problem, leveraging on a baseline model that uses body parts and integrates the key attribute information in an unified learning framework. These methods are almost employ person identity annotations, even attribute labels, which is time-consuming and labor-expensive in realistic applications.

B. UNSUPERVISED PERSON RE-IDENTIFICATION
In the contrast to supervised frameworks, unsupervised person re-id draws close attention from experts, which does not require any annotations of pedestrians [6], [15], [30], [31]. Lin  Yang et al. [31] focused on multi-level feature representations from pedestrian data to conduct weight linear coding with comprehensive effect in robustness and distinctiveness of rank-1 accuracy 51.4% on VIPeR dataset. Generally, they achieved poor performance compared to supervised frameworks, because they lack guidance of the training direction without supervisions of annotations. In this category, this paper adopts BUC and PAUL as the representative compared methods, along with a Bow [34] approach, which contributed a high quality dataset with an supervised Bag-of-Words method to conduct re-id of results 17.1% and 8.3% of rank-1 accuracy and mAP on Market-1501 dataset, respectively.

C. CROSS DOMAIN PERSON RE-IDENTIFICATION
To take full advantages of existing labeled pedestrian data, a few researches [5], [9], [13], [21], [23] introduce domain adaptation to solve the problem of lacking annotations in target domain (without labeling new data). These methods learn a transferable discriminative model from labeled source data and then transfer it to the unlabeled target domain, which focus on bridging the domain gap when transfer model between different domains. For example, Deng et al. [5] translated the labeled images from source to target domain and train the re-id models in an unsupervised manner (PTGAN), which achieved rank-1 accuracy of 58.1% and mAP of 26.9% on DukeMTMC-reID dataset.
Huang et al. [11] addressed the cross domain re-id and contributed both for model generalization and adaptation by a part aligned pooling (EANet) that brings significant improvement with rank-1 accuracy of 67.7% on Market-1501 dataset. Fu et al. [8] proposed a self-similarity grouping approach, exploiting the potential similarity of unlabeled samples to establish multiple clusters from different views (SSG), and it achieved the rank-1 accuracy of 80.0% in unsupervised manner and obtained 86.2 under semi-supervised setting on Market-1501 dataset. Khan and Brémond [13] employed a fine-tuning strategy to transfer knowledge from source to target domain by residual learning, and discussed the effectiveness of hybrid network into which embed the statistical similarity learning under a small dataset of multi-view person re-id. Peng et al. [23] proposed a cross-dataset transfer learning approach to learn a discriminative representation by a multi-task dictionary learning method which is able to learn a dataset-shared but target-data-biased representation. Genç and Ekenel [9] utilized and analyzed the state-of-the-art Convolutional Neural Network (CNN) architecture on cross domain person re-id task, because they shows significant performance in widely used image classification tasks. In this category, this paper employs PTGAN [5], EANet [11], and SSG [8] as compared methods, along with a hand-crafted feature based transfer learning method (UMDL) [23], which utilizes a dictionary transfer learning model to solve cross domain re-id with rank-1 accuracy of 34.5% on Market-1501. From their domain settings, they only adopt single-source transfer learning framework, which performance is often constraint by the chosen source domain, rather than the architecture complexity. The most dubious point is that the only one chosen source may generate negative transfer effect for target domain. Therefore, this paper aims at solving this drawback, and proposes multi-source transfer network for cross domain person re-identification task, which detail is as following.

III. THE PROPOSED APPROACH
This section presents the detailed architecture of the proposed Multi-Source Transfer Network (MSTNet), including feature extractor, feature-level target-biased transfer learning, pixel-level relativistic adversarial learning and task-level task-gap bridging stages. It firstly introduces the overview of MSTNet, and then describes the feature extractor module, target-biased multi-source transfer learning module, relativistic adversarial learning module, and present the task-gap bridging module independently.

A. APPROACH OVERVIEW
The framework of the proposed multi-source transfer network is illustrated in Figure 2, considering the setting of two source domains. This network is firstly formulated by three feature extractors for i-th, j-th source and target domain individually, which is pre-trained on ImageNet [4]. Those feature extractors are in charge of learning robust representations from each domain, which are constrained by target-biased transfer loss and task-gap bridging loss to conquer the domain gap in feature-level. Then, these feature representations are fed into an image generator to reconstruct images following the target image style by the target-biased reconstruction loss and a relativistic discriminator. This strategy guarantees the pedestrian images from multiple sources can be transfered into target domain in a target-biased pixel-level. Through overcoming the feature-level and pixel-level domain gaps, MSTNet can successfully learn a target-biased feature space without multiple domain gaps in different levels, with the assistance of the proposed task-gap bridging module.

B. FEATURE EXTRACTION FOR MULTIPLE DOMAINS
Assume that MSTNet learns knowledge-transfered model from K source domains S 1 , S 2 , · · · , S K and then adopt it into a target domain T . Note that, the pedestrian image data from K source domains is completely labeled and the target domain's is fully unannotated. It is defined that denotes the identity (ID) label of x i j , and N j presents the image number in j-th source domain. For the target domain T , the pedestrian images are represented by without any annotations, where N T presents the amount of human data in target domain. First and foremost, the initial stage of MSTNet is to learn robust feature representation for images from each domain, including multiple source and target domains. Considering a single shared feature extractor will damage the peculiar identity information on multiple domains, MSTNet adopts K feature extractors {f j } K j=1 for the feature learning of each source domain, and employ a single feature extractor f T for target domain. Though this network configuration, MSTNet can learn sufficient pedestrian information from each domain by their own feature extractor, and then achieve the feature representation for each image x i j , and x i T in multiple sources and the target domain, where h i j denotes the feature of i-th image in j-th source domain, and h i T is the feature of i-th image in target domain.

C. FEATURE-LEVEL TRANSFER LEARNING
After learning the representations of pedestrian images, MSTNet pays attention to moderating the feature-level gap in this learned feature space. To constraint the shared feature space between multiple sources and the target domain, MSTNet chooses the target as the baseline to normalization of the feature space, which could ensure the image representations from source domains are following similar distribution of target domain, as much as possible.
To achieve this expectation, this paper introduces a Target-biased Transfer (TT) loss on the learned pedestrian features from each domain, and the TT loss L tt is defined by, where D MMD is maximum mean discrepancy [22] constraint between each source and target domains, and D KL is the distribution discrepancy function on each source and target domains, by embedding target-biased Kullback-Leibler (KL)-divergence [24] factor. In detail, D MMD (X j , X T ) means attaching the maximum mean discrepancy on the learned features of j-th source and target domains, The purpose of MMD term in the proposed domain adaptation method is to confine the feature-distributions in different source domains following target domain's, which is usually utilized in generic domain adaptation methods. However, they often ignore the distribution of each sample in different domains. Thus, this paper introduces the a target-biased KL-divergence factor D KL (X j , X T ), which aims to force that the score distribution of each image x i j in source domains is more close to the target data's, where P i j denotes the score distribution of each image x i j in source domain, and P i T represents the score distribution of each target sample x i T . This term measures the sample-level domain discrepancy and employs KL-divergence to eliminate the distribution distance between each source and target domains.
Through this target-biased transfer loss, MSTNet can enforce the learned feature space following the shared distribution across each source domain to target, which is fixed by the maximum mean discrepancy loss on the average feature values, and KL-divergence constraint on the individual sample-level.

D. PIXEL-LEVEL TRANSFER LEARNING
The domain gap between multiple source and target domains not only appears in the extracted features, but also exists in another major modality of image pixel, where MSTNet designs a novel adversarial learning strategy to reduce the pixel-level gap between multi-source and target domains. To achieve the pixel-level transfer learning, MSTNet attaches a generator G on the learned features to produce target-style images for every source images. There is a severe challenge in generating the target-style image for different source images, which require the generator can handle the multitudinous style gaps existing in each source-target domain pairs. To achieve this goal, MSTNet deploys an enhanced discriminator D e , inspired by the relativistic GAN [12].
The innovation of the proposed discriminator D e is that D e attempts to estimate a score, which the real target image x T is relatively more target-stylized than the generated images from the source images X j , rather than identifying the input image is from target or source (generic discriminator in conventional GANs). The comparison between conventional discriminator and the proposed enhanced discriminator is discussed below.
(a) The generative learning in convention GANs focuses on transforming the source image into target domain, and generating a fake image, which should be identified as a real target image by the discriminator, and the adversarial learning trains the discriminator D c to identify the generated image as the fake image from the source domain. That is, where C denotes the feature extractor in discriminator. These two equations intend to identify whether the generated image is following target or source style by Eq. 5, 6, respectively. This discriminator only treats the single source and target image style transfer, which will achieve poor performance in multi-source to target image style transformation. (b) The enhanced discriminator D e in the proposed relativistic adversarial learning method aims to judge whether the generated images from multi-source domains are more realistic than the target image style. The detail of D e is as follows, where realizes that the real target images are more realistic than generated images by Eq.7, and the generated target images are more realistic than real target images by Eq.8.

VOLUME 8, 2020
Through this mechanism, MSTNet can transform each source image into the target-style, whereas it comes from whichever source domain.
After the analysis of the enhanced discriminator, the relativistic adversarial learning architecture in this paper can be acknowledged. The overall discriminator loss and generative loss are as follows, where both source and target images are contained in the generative and adversarial losses, and h i j denotes the learned feature of the source image x i j . Therefore, the MSTNet can transform each image from any source domain into the target-style to alleviate the pixel-level gap in multi-source transfer learning task.

E. TASK-LEVEL TRANSFER LEARNING
Combining feature-level and pixel-level transfer learning in multi-source transfer task, MSTNet can obtain a shared target-biased feature space to conduct pedestrian matching. To exploit discriminative information in the learned target-biased feature space, MSTNet introduces the triplet loss on images in multi-source domains, where a, p and n present the anchor, positive and negative pedestrian image, d {·} is the distance measure between two samples. This constraint loss ensures, where {S k } K k=1 represents the all multi-source samples. Note that, the selection of n is from all samples in multi-source domains which has the negative relation with a.
From the matching task in Eq.12 and 13, there also exists the task-level gap between source and target samples. Specifically, the distance metrics in each domain are not consistent with target domain. Therefore, MSTNet attaches the task-gap bridging L tgb loss on the metrics, where (h i S , h j S ) and (h i T , h j T ) are randomly selected sample pairs from multi-source and target domains, individually.
Through the task-gap bridging loss function, MSTNet could neutralize the distance metric inconsistencies between each source and target domains.
From the analysis above, the overall loss function can be concluded by Eq.15, L = L t + λ 1 (L D e + L G ) + λ 2 L tt + λ 3 L tgb (15) where λ 1 , λ 2 , and λ 3 are balance parameters to measure the importance between each sub loss functions. The algorithm of the MSTNet is summarized in Algorithm 1.

IV. EXPERIMENTS
In this section, the implementation of Multi-Source Transfer Network (MSTNet) is achieved for cross domain person re-identification task. Three effectively-acknowledged person re-id datasets are employed to validate the performance of MSTNet. The detailed introduction of the experiments will be described in following subsections.

A. DATASETS
The proposed MSTNet is implemented on three widelyacknowledged person re-id datasets, including Market-1501 [34], and DukeMTMC-reID [35], to prove the effectiveness of the proposed method.
Market-1501 [34] collects 32,668 annotated human boxes by the prediction of DPM [7]. These images are captured from 1,501 persons who have walked through six non-overlapping cameras. In experiments, MSTNet follows the standard training and test split and evaluated the single-query test evaluation settings [18]. Specifically, it employs 12,736 images from 751 identities to conduct training stage and the left 19,732 images of 750 pedestrians are fed into the testing phase.
DukeMTMC-reID [35] acquires 36,411 annotated person images from 1,404 identities, who pass through eight non-overlapping cameras with high resolution. Following the work [18], the experiment randomly selects 702 identities with 16,522 images to train the MSTNet, and the remaining 1,110 persons are utilized into testing.
Note that, MSTNet conducts experiments by taking Market-1501 and DukeMTMC-reID datasets as the target domain in turns. When one dataset is regarded as target domain, the other, combining a third dataset are treated as the multi-source domains in MSTNet experiments. It adopts CUHK03 [14] as another source datasets. It captures 14,096 pedestrian images by two cameras, which are annotated into 1,467 identities. It adopts the combination of Deformable Part Model (DPM) [7] and human manual annotating to detect human body in the obtained pedestrian images. Because CUHK03 is seldom employed in cross domain person re-id models which is hard to compare, this paper only conduct re-id experiments on Market-150 and DukeMTMC-reID datasets, and CUHK03 is only utilized as a source dataset.

B. IMPLEMENTATION 1) MODEL AND PREPROCESSING
This paper employs Pytorch framework to achieve the MST-Net on Ubuntu system with eight NVIDIA Titan XP GPUs. MSTNet utilizes the ResNet50 [10] architecture as the feature extractor, which is initialized by the parameters pre-trained on ImageNet [4]. In addition, the architectures of target-biased generator is constructed by the decoder in Cycle GAN [36] and the enhanced discriminator is following relativistic GAN [12]. In addition, the numbers of neurons for each layer are following the referred components in ResNet50 [10], Cycle GAN [36] and relativistic GAN [12]. Before feeding into the network, all images are resized as 384 × 128 × 3 with padding 10, which MSTNet only adopts random flipping as the only tool to conduct data augmentation.

2) TRAINING CONFIGURATION
In the network optimization, the experiment utilizes Stochastic Gradient Descent (SGD) optimizer with initial learning rate 5 × 10 −4 . The learning rate stays unchanged in the first 10 epochs, and decayed to 0 in the last twenty epochs, linearly. The parameter m is set as 0.35, λ is set as 0.8, and balance parameters follow λ 1 = 0.8, λ 2 = 0.5, and λ 3 = 0.6. In all experiments, each batch in iterations consists 8 pedestrian images, and the maximum epoch number is set to 30. The feature extractors are firstly trained on source domains and optimized by the triplet loss L t to increase the representative ability. In addition, the multi-source domain transfer learning framework in this paper is time-consuming because of the large amounts of data in multiple domains, and the training time of the whole MSTNet is around 7.5 hours.

3) EVALUATION METRICS
To measure the performance of MSTNet, three widely recognized evaluation metrics is adopted, including rank-n matching accuracy [34], Cumulative Matching Characteristic (CMC) curve [3], and the mean average precision (mAP) [20]. Rank-n accuracy aims to measure the true matching accuracy when predicts n gallery samples with top-n similarities for a query image. This metric provides significant convenience in realistic person re-id applications when predict n samples for users to select the target identity accurately.
CMC curve demonstrates the visualization of detection or recognition performance in each rank. mAP is the mean average precision for evaluating classifiers, which is a relative better compared measurement in person re-id evaluation.
Because MSTNet is built by convolutional neural network, this paper choses two hand-crafted feature based unsupervised person re-id models to demonstrate its better representative ability than hand-crafted features. Thus, Bow [34] and UMDL [23] are introduced as the two baselines. Specifically, Bow [34] integrates Bag-of-Words to lean local features and conducts rapid global feature matching. UMDL [23] aims to learn view-invariant and identity-discriminative information from unlabeled target data by an asymmetric multi-task dictionary transfer learning model. These two models both focus on hand-crafted pedestrian features by unsupervised frameworks, which are the most competitive methods in their category.
To show the superiorities of the proposed MSTNet to clustering based methods, three clustering based newly proposed unsupervised person re-id approaches (PUL [6], BUC [15], and PAUL [30]) are employed. PUL [6] is an effective baseline for unsupervised re-id feature learning which iterates between pedestrian clustering and fine-tuning of the convolutional neural network to improve the initialized discriminative model. BUC [15] designs a bottom-up clustering model to jointly optimize the convolutional neural network and the relationship among the individual samples. PAUL [30] introduces a patch-based unsupervised clustering framework aiming at learning discriminative feature from pedestrian patches, instead of the global image (optimized by an unsupervised patch-based discriminative feature learning loss). Note that, these representative clustering based methods are all under deep learning framework, and cover the scope of directly clustering combined fine-tuning [6], novel clustering strategy on global images [15], and patch-based clustering methods [30].
For the compared cross domain person re-id models, MSTNet also introduces three recently proposed methods as baselines, including PTGAN [5], EANet [11], and SSG [8]. Particularly, PTGAN [5] preserves the self-similarity of an image before and after translation, and the domain-dissimilarity of a translated source image and a target image which are transfered by Siamese network and CycleGAN. EANet [11] designs a part aligned pooling and part segmentation constraints to enhance the domain adaptation and improve the model generation. SSG [8] exploits the potential similarity from the global body to local parts of unlabeled samples to build multiple clusters from different view automatically, which is named by the self-similarity grouping. These domain adaptation models are also based one deep learning framework and cover the major transfer categories in unsupervised person re-id task, including image-translation [5] and part feature adaptation [8], [11].

D. RESULTS
This part reports the results of rank-n accuracy on Market-1501 and DukeMTMC-reID datasets in Table 1 and 2, and draws their CMC curves in Figure 3 and 4, compared to baselines. According to the performances of rank-n accuracy and CMC curves, the quantitative evaluation of MSTNet is analyzed, and the comparison between hand-crafted feature based methods, clustering based deep learning models, and domain adaptation based deep learning methods.

1) QUANTITATIVE EVALUATION
This paper introduces the Euclidean distance to measure the similarities between query and gallery images, and obtains  the rank-n matching accuracies. From Table 1 and 2, it can be observed that MSTNet separately achieves 80.9% and 74.6% rank-1 accuracy on Market-1501 and DukeMTMC-reID datasets, and obtains mean average precisions of 55.2% and 53.6%, individually on the evaluated person re-id datasets. As the accuracies of rank-5 to rank-10, the proposed MSTNet also reflects the effectiveness on pedestrian images, through the multi-source transfer learning between cuhk03, market-1501 and DukeMTMC-reID datasets. That demonstrates the multi-source setting in transfer learning can exploit valuable transferred-knowledge between multiple domains in unsupervised person re-identification task.

2) COMPARISON TO HAND-CRAFTED FEATURE BASED METHODS
This paper introduces Bow [34] and UMDL [23] as the baselines, and their results are shown in Table 1 and 2. For the quantitative comparison, It can be observed that MST-Net reaches a superior distance at least 45.1(80.9-35.8)% to them. Their CMC curves reveal that the overall performance in different ranks of MSTNet is more effective than the hand-crafted feature based methods. This comparison demonstrates MSTNet not only learns better representative features than Bow, but also transfer more useful knowledge than UMDL.

3) COMPARISON TO CLUSTERING BASED MODELS
To demonstrate the preponderance of MSTNet over clustering models, this paper employs three recently proposed clustering methods (PUL [6], BUC [15], and PAUL [30]) as baselines to compare. The best performance in them is 68.5% and 72.0% rank-1 accuracies of PAUL on Market-1501 and DukeMTMC-reID datasets separately, which have inferior manifestations (leaving 12.4% and 2.6% distances to the proposed MSTNet). The CMC curves in Figure 3 and 4 also reveal the better performance in different rank-n accuracies. From this comparison, the transfer learning of MSTNet in exploiting useful knowledge from source domains is more effective than clustering based models, which directly learn similarity metrics in unlabeled target samples.

4) COMPARISON TO DOMAIN ADAPTATION METHODS
Finally, this paper discusses the difference between MSTNet and state-of-the-art domain adaptation based person re-identification models, including PTGAN [5], EANet [11], and SSG [8]. According to Table 1 and 2, the proposed MSTNet has absolute advantage in rank-n and mAP performance compared to PTGAN and EANet. As for SSG, the proposed MSTNet achieves superiority of 0.9(80.9-80.0)% rank-1 accuracy on Market-1501, but haves little distance in rank-5,10 and mAP. However, MSTNet achieves better performance in rank-n accuracy and mAP than SSG on DukeMTMC-reID dataset, as well as CMC curves in Figure 3 and 4. That demonstrates the multi-source transfer network can transfer more useful knowledge from sources and it has proved the effectiveness of the proposed MSTNet.

E. ABLATION STUDY
In this subsection, the influences of main components in multi-source transfer network are discussed, containing the multi-source setting, task-level and pixel-level transfer learning. The analysis of these modules are based on Market-1501 dataset, which is implemented by modifying the modules with absolutely same experimental settings (such as training epochs, training and testing data division, learning rate, etc). In addition, this part conducts parameter analysis for the balance parameters to show the optimal decision of them.

1) INFLUENCE OF MULTI-SOURCE DOMAINS
The multi-source setting in MSTNet allows learning various discriminative transferable-knowledge from different source domains, rather than single source. To validate the influence of multi-source setting, the single-source of MSTNet (Single Transfer Network, STNet) is implemented. Specifically, STNet utilizes one of CUHK03 and DukeMTMC-reID datasets as the single source domain, and evaluate the effectiveness on Market-1501, which results are summarized in Table 3. It can be observed that the multi-source setting improves the performance at least 6.3(80.9-74.6)% in ran-1 accuracy and promotes the mAP at least 5.5(55.2-49.7)%. This result proves that multi-source setting can transfer more valuable knowledge than single-source setting.

2) INFLUENCE OF PIXEL-LEVEL TRANSFER LEARNING
Another major novelty of the proposed MSTNet is the pixel-level transfer learning module, which are achieved by target-biased generator with a relativistic discriminator. To evaluate the effectiveness of this module, this generator and its corresponding discriminator in the MSTNet is removed, which is named as MSTNet-pixel. From the reported results in Table 3, this modified method reaches a rank-1 accuracy of 75.8 % and mAP of 50.4%, which is weaker to MSTNet with a distance with 5.1% in rank-1 accuracy and 4.8% in mAP. This comparison shows the importance of pixel-level transfer learning module in the proposed MSTNet, which ensures the image from multi-source domains can transformed into target domain to boost the feature learning of feature extractor.

3) INFLUENCE OF TASK-LEVEL TRANSFER LEARNING
To alleviate the task-level gap between different source and target domains, MSTNet introduces the task-gap bridging loss to conduct task-level transfer learning. Therefore, the task-gap bridging loss is removed to validate the influence of the task-level transfer learning module, which is named by MSTNet-task. Through the results reported in Table 3, MSTNet-task obtains rank-1 accuracy of 76.2 and mAP of 51.1, which has a considerable gap to MSTNet. According to the analysis, it can be observed that task-level transfer learning plays an important role in the multi-source transfer network, which makes the various task-level gaps into a consistent distribution.

4) PARAMETER ANALYSIS
To show the decision of balance parameters, this part chooses a range the values [0.4:0.1:1.0] for the balance parameters λ 1 , λ 2 , and λ 3 , and implements the MSTNet with them. When the selected parameter is evaluated, the others are fixed and keep the same experimental settings with original MSTNet implementation. Taking the experimental result on Market-1501 as an example, this paper reports the rank-1 accuracies with different parameter values on Market-1501 dataset, as shown in Table 4. It can be observed that MSTNet achieves the optimal rank-1 accuracy with λ 1 = 0.8, λ 2 = 0.5, and λ 3 = 0.6, which demonstrates the different contributions of each term (L D e + L G , L tt , L tgb ) to the final person re-id performance. Moreover, the parameters λ and m are set into 0.8 and 0.35 by the similar evaluation, respectively.

V. CONCLUSION
In this paper, a Multi-Source Transfer Network (MSTNet) is proposed to solve cross domain person re-identification task, which can leverage the domain gaps in feature-level, pixel-level and task-level between multiple sources and the target domain, by the novel target-biased multi-source transfer learning module, relativistic adversarial learning module, and task-gap bridging module. Therefore, the proposed MSTNet can learn the target-biased feature space to relax the limitation of single-source to single-target transfer learning in previous domain adaptation person re-id models. Through the extensive experiments compared with the state-of-the-art methods, the proposed MSTNet achieves competitive performance, which sufficiently demonstrates the effectiveness of MSTNet in cross domain person re-identification task.