Deep Multi-Task Transfer Network for Cross Domain Person Re-Identification

As a prominent application of surveillance video analysis, person re-identification attracts much more research attention recently. Existing person re-identification models often focus on supervision by the pedestrian identity annotation, while it has limited scalability in realistic. Though several unsupervised person re-identification researches pay attention to solve this problem, they are either clustering based or cross domain based approaches, where a conventional assumption of them is the identity number of the target dataset is acknowledged. To relax this hypothesis, we propose a Deep Multi-task Transfer Network (DMTNet) for cross domain person re-identification, which conduct classification, attribute attention and identification task between source and target domains. There are three main novelties in DMTNet, including clustering number estimating algorithm to learn prior knowledge from source data to estimate the identity number, attribute attention importance learning rather than directly utilizing attribute information, and a multi-task transfer learning mechanism to transfer specific tasks cross domains. To prove the superiority of our DMTNet, we implement several compared experiments on DukeMTMC-reID and Market-1501 datasets, which results show the advancement of our network. Moreover, the discussions for different modules also point out the significance of the specific tasks.


I. INTRODUCTION
Person re-identification (re-id) is an extremely prominent technology in surveillance system due to its significance in pedestrian retrieval, such as criminal tracking, pedestrian locating. The goal of person re-id is to identity the specific identity in gallery images, given a single probe image. This task is confronted with formidable challenges on account of severe variations in resolution, view-point, pose, occlusion and illumination across different cameras, on which most person re-id models focus to solve. Figure 1 illustrates the challenges in different datasets. Though existing person re-id approaches achieve an adequate performance, they need large amounts of annotations to train the models, which are under The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao . supervised framework. That seriously confines the scalability of person re-id in realistic scene.
To expand the promptness of the person re-id models, a number of researches pay attention to unsupervised person re-identification models, which are divided as clustering based methods [4], [13], [27] and domain adaptation models [5], [12], [16], [17], [26]. Clustering based person re-id models only analyze the unlabeled datasets and generally yield poor pedestrian matching performance due to the lack of strong supervised tuning and optimization. Nevertheless, Cross Domain Person Re-id (CDPR) models is a preferable solution to overcome the shortcomings in clustering methods. It learns transferable knowledge from a completely labeled dataset (denoted as source domain), and embed it into an unlabeled dataset (denoted as target domain), which is disparate within the objective variations and contains none overlapping VOLUME 8, 2020 This pedestrian identity with source domain. The CDPR model can resolve a series of urgent person re-identification tasks lacking of sufficient time to annotate vast identity labels. In addition to relax the constraint of unlabeled data, CDPR model is up against more complex situations than conventional person re-id problem. The most challenging task in CDPR problem is to bridge the distribution gap between source and target domains (shown in Figure 1), which attracts much more research attention in existing CDPR approaches [16], [26]. Xu et al. [26] proposed an attribute feature learning based deep neural network to transfer augmented attribute feature across domains to resolve cross dataset person re-id task, so as to distinguish pedestrian with similar attributes by the additional learned extra image information. Lv et al. [16] employ a spatial-temporal information transfer mechanism to conduct unsupervised learning in target domain, which trained a robust classification network in source domain as the guidance to boost the discriminative ability of the person's feature extracted from target domain. After obtaining the final feature, it adopted learning-to-rank based boosting process to enhancively train the classification based model of unlabeled target domain.
Compared with clustering based person re-identification models, which often annotate soft labels on pedestrian images to provide identity information, CDPR models only focus on alleviating the domain gap between source and target domains through the trained matching model in source domain. They ignore the pedestrian identity information and performs weak effectiveness in realistic scene. Furthermore, directly introducing the soft-label method into target domain is not an effective way, because clustering based person re-id models have a conventional assumption that the number of pedestrian identities is acknowledged. This is an impractical assumption when we focus on target domain, and they can not estimate the cluster number to provide sufficient soft identity information.
To make up these drawbacks analyzed above, we propose a novel multi-task transfer learning framework for cross domain person re-identification, which is not only learning discriminative feature representation from source domain, but also transferring it into target domain with cluster estimating algorithm to support a soft multi-task learning procedure as well as in source domain, namely Deep Multi-task Transfer Network (DMTNet) method. It can achieve the unknown class number clustering soft-label process in unsupervised person re-identification. Our DMTNet is composed by an identity information classification module, attributes-attention module, and identification module in source domain, as well as identity information estimation module in target domain and a multi-task transfer module both in source and target domains, which guarantees the feature representative ability and identity estimation in the multi-task transfer learning course.

II. RELATED WORK
In this section, we review the research works on unsupervised person re-identification, which is divided as clustering based person re-id and cross domain person re-id methods according to surveys [9], [24]. In addition, this paper introduce a multi-task learning architecture into CDPR model, so we also discuss some multi-task learning approaches in supervised person re-identification.

A. CLUSTERING BASED PERSON RE-IDENTIFICATION
Existing clustering based person re-identification models often minimize the distance between similar pedestrian images and maximize the distance between dissimilar images by the soft labels provided by clustering algorithms [3], [13], [27]. Specifically, Lin et al. [13] focused on diversity across different identities and similarity within the same identity, and utilized a diversity regularization term in the bottom-up clustering procedure to balance data volume of each cluster, which achieved an effective trade-off between the diversity and similarity. Yang et al. [27] proposed a patch-based unsupervised learning framework in order to learn discriminative feature from patches instead of the whole images. The patchbased learning leveraged similarity between patches to learn discriminative model by an unsupervised patch-based discriminative feature learning loss and an image-level feature learning loss as the guidance. Ding et al. [3] introduced a statistic conception of 'dispersion' to constrain the clustering algorithm following the dispersion state and proposed a novel clustering based unsupervised person re-id models which can exploit the underlying feature space for unlabeled pedestrian image data.
Though these clustering based unsupervised person re-id models have achieved progressive improvements, they still keep a way from realistic application due to the lacking of any 5340 VOLUME 8, 2020 priori knowledge of the unlabeled data. The chief drawback of them is that they make the assumption that unlabeled person re-id dataset has a known identity number. That is less rigorous compared with application in real scene.

B. CROSS DOMAIN PERSON RE-IDENTIFICATION
Another frequently-used strategy is the cross domain person re-id solution. They are either leverage the feature gap or transform the image style to bridge the distribution gap [2], [10], [23]. Li et al. [10] proposed a pose disentanglement and adaptation network aiming at learning deep image representation with pose and domain information properly disentangled, and it can perform pose disentanglement across domains without supervision in identities. Chen et al. [2] proposed an instance-guided context rendering scheme for cross-domain person re-identification, which transfer the source person identities into diverse target domain contexts to enable supervised re-id model learning in the unlabeled target domain. Wei et al. [23] proposed a person transfer generative adversarial network on person transfer to bridge the domain gap among datasets, which consider extra constraints on the person foregrounds to ensure the stability of their identities during transfer.
These cross domain person re-id models can alleviate the domain gap by image style transfer or learning a shared feature space to perform competitive results, compared with clustering based method. However, they are highly relied on source domain when train a discriminative feature representation without consider identity information of target data.

C. MULTI TASK PERSON RE-IDENTIFICATION
Both clustering based and cross domain models have their weakness, thus we are prone to design a novel multi-task learning framework can not only estimate the cluster number in target domain to utilize identity information but also bridge the domain gap between source and target domains. It can sufficiently take advantage of the soft identity information during the cluster estimation to overcome the weakness in existing cross domain person re-identification models. In this subsection, we describe the existing multi-task learning based person re-identification models.
Existing multi-task person re-id models are almost under supervised framework [1], [14], [21]. Chen et al. [1] are the first to integrate a binary classification task and ranking task into a unified framework, named MTDnet. Inspired by MTDnet, Ling et al. [14] proposed a multi-task learning network with four different losses for person re-identification, including person re-identification, pedestrian identity task and pedestrian attribute task, who provide complementary information from different perspective. Wang et al. [21] proposed a multi-task attentional network with curriculum sampling method, which contains a fully attentional block and a curriculum sampling method for training ranking losses.
These multi-task learning approaches can integrate several task-specific goals into a unified network to boost the feature representation learning. Different with them, our DMTNet composes clustering, attribute learning and domain adaptation into a multi-task cross domain person reidentification approach, which can conduct unsupervised person re-identification without a prior-acknowledge pedestrian number in target domain. Detail description of our DMTNet is in Section III.

III. OUR APPROACH
In this section, we describe our proposed Deep Multi-task Transfer Network (DMTNet) in detail, and illustrate the optimization of the whole algorithm.

A. APPROACH OVERVIEW
Aiming at solving cross-domain person re-identification problem, we propose a Deep Multi-task Transfer Network (DMTNet), which is constructed by two main manifolds, including a identity information classification task, an attributes-attention task, identification task in source domain, and identity information estimation module, a multitask transfer module in target domain. In novelty, we design an attributes-attention mechanism due to the positive affects produced by introducing attributes in existing person re-id methods. This module can learn the attribute importance for the learned attribute feature, and then synthesize all the attribute features combined with their importance to produce a final attributed feature. Then, we employ part of the source domain as the guidance of estimating pedestrian clustering numbers of target domain, which is in charge of a soft classification task for target data. Finally, the estimated target estimated soft labels and the attributed network are combined to conduct multi-task network training in the target domain with a transfer function. Therefore, our multi-task transfer module can bridge the domain gap between source and target domains, which is achieved by the attributes-attention and self-information estimation modules. Detail architecture can be seen in Figure 2.

B. MULTI-TASK LEARNING IN SOURCE DOMAIN
In cross domain person re-identification, here is the assump- where contains the target matching data, x t i is the i-th image in target domain and N t is the number of pedestrian images in target domain. Then, we introduce the attributes label in source domain to achieve the muti-task network, and define the attributes annotations a i = (a 1 i , · · · , a j i , · · · , a m i ) for ith image x s i in source domain, where a j i represents the j-th attribute for x s i . Note that, there is none attribute labels in target domain D t , and we are aiming at learning attribute attention network in source domain and transferring it into target domain. VOLUME 8, 2020 FIGURE 2. Architecture of our proposed deep multi-task transfer network. Both of source and target data is fed into a shared CNN module and it obtains their basic pedestrian feature F s(i ) 0 , which is added a MMD constraint. Then, the identity information classification task, attributes-attention task and identification task is fixed on source domain, while identity information task and a multi-task transfer module are deployed in target domain. After all, this deep multi-task transfer network can output a fused feature for pedestrian representation by a concatenation layer. Besides, this architecture employs a novel identity estimating cluster algorithm to annotate soft labels, which is used to conduct identification training for target data.
Firstly, we design the multi-task network for source domain, to achieve identity-classification, attribute-attention, and identification tasks. These main processes are implemented by a backbone network to extract basic pedestrian features and a series of attention operations, both of which are constrained by related objective functions. In detail, a backbone feature extracting network γ (x s i , θ 0 ) is employed for source pedestrian image x s i , where γ (·, θ 0 ) is the backbone convolutional neural network with parameters θ 0 . Therefore, the basic pedestrian feature F s(i) 0 for x s i can be obtained by the backbone network, After extracting basic feature, we add several parallel fully-connected layers to achieve feature transformation for pedestrian classification, attribute attention and identification features. They are expressed by f (·, θ cl ), f (·, θ at ) and f (·, θ id ) separately, where θ at = {θ at (1) , · · · , θ at(j) , · · · , θ at(m) } is the collection of attribute layer parameters and θ at(j) is the parameter of fully-connected layer for j-th attribute. Through these transformers, we can acquire the task-specific features for x s i in different tasks, including classification-specific feature F From the procedure of these three task-specific features, it is well acknowledge that the basic feature F s(i) 0 contains much more information containing class, attributes, and identity, which are important for person re-identification. To strengthen the representative ability of the basic feature, we devise three objective loss functions for the task-specific features.
The first one is the classification loss for F cl , on where we employ the softmax layer to generate the probabilities belonging to each identity in source domain, This item can classify each image into pedestrian classes, and make the basic feature preserve pedestrian identity characteristic information. Moreover, our DMTNet method introduces an attribute attention loss function on attribute-specific features to retain the attribute information in pedestrian basic features. For the attributed-specific feature F s(i) at , we assume an attention matrix U at , θ u ), which estimate the importance of each attribute. This part is achieved by the constraint between the attribute labels and the attention matrix, where U s(i) a ∈ R m×m is a composition of the pedestrian attribute labels, which is expanded into a column in 1 × m. Each element in the column is in [0, 1] and denotes whether the pedestrian contain the attribute. From this constraint, the attribute attention coefficient of each pedestrian image can be obtained to strengthen the representative ability for the attribute-specific feature, where J denotes the column sum calculation.
Finally, the person re-identification is a matching problem in essence. We have acquire the identification-specific feature for each pedestrian image. However, the matching feature not only require the identification-specific feature, but also contain the pedestrian identity information and its attribute information. Therefore, we conclude these three task-specific features into a fused feature representation, where [·] is the concatenation. After the final feature representation for each pedestrian image, we employ the triplet loss function in source domain. For a positive pedestrian image pair, we generate a triplet samples, including a positive pair (x s i , x s j ) and a negative sample x s k ). In this triplet function, we select the negative sample by its similarity to positive sample pair in first rank. Their final feature representations should following the expected state as FaceNet [20], Therefore, we introduce the triplet loss function to ensure the matching performance, For the devised feature extracting mechanism and multitask learning loss functions, our DMTNet can conduct a matching procedure in source domain. After that, the main left problem is how to transfer this network into target domain, and how to implement the multi-task learning in the completely unlabeled target domain.

C. MULTI-TASK LEARNING IN TARGET DOMAIN
In the last subsection, we principally describe the DMTNet model in source domain, and this part is prone to build the multi-task transfer model to clinically solve the multi-task learning in target domain.
In target domain, three specific tasks are taken into account to implement. DMTNet focuses on transferring the multi-task learning from source domain, including classification, attribute learning and identification tasks. The most intractable transfer task is the classification, because the source and target domains do not share any common pedestrian. The attribute learning and identification tasks can be alleviate the domain gap by Maximum Mean Discrepancy (MMD) constraint [7].
Based on the pre-trained model from source domain, DMTNet can output a basic feature F t(i) 0 , classificationspecific feature F t(i) cl , attribute-specific feature F t(i) at , and identification-specific feature F t(i) id when input a pedestrian image x t i in target domain. For these task-specific features, we can bound the MMD constraint on the basic feature, attribute-specific feature and identification-specific feature to make distributions of source and target are consistency, Note that the MMD loss function do not constrain classification-specific feature thanks to it may lose some important identity information. It is better to train the classification task independently, but conservative unsupervised person re-identification clustering algorithm is based on exclusive pedestrian number, which can not be acknowledged previously in realistic scene. Thus, DMTNet proposes a Novel Identity Estimating Cluster (NIEC) algorithm to discover pedestrian identity clustering number and annotates soft labels on target data, which are used to train the classification task.
In Novel Identity Estimating Cluster algorithm, we are aiming at discovering new pedestrian identity by the experience gained from source domain. NIEC algorithm is inspired by Deep Embedding Clustering (DEC) [25]. However, the goal of NIEC method is not only determine the cluster points, but also to discover the number of clusters in target domain.
Following DEC approach, let P s (c s |i) be the probability of source pedestrian image x s i belonging to identity cluster c s ∈ {1, · · · , C s }, where C s denotes the identity number of source pedestrian images. DEC employ a Student's t distribution as the initial parameterization, where µ c s ∈ {µ c s , c s = 1, · · · , C s } is the c-th identity cluster. Assuming that pedestrian data indices are sampled uniformly (i.e. P s (i) = 1/C s ), the joint distribution can be written as P s (i, c s ) = p(c s |i)/C s . In target domain, we suppose the pedestrian data also following the t distribution, denoted as P t (c t |i), where the c t ∈ {1, · · · , C t }, where C t denotes the identity number of target pedestrian images.
Because the classification task outputs the probability of each pedestrian image, the optimal solution to neutralize the domain distribution gap of the predicted probability is minimizing the KL divergence between joint distributions P s (i, c s ) = P s (c s |i)/C s and P t (i, c t ) = P t (c t |i)/C t . We employ the symmetrized version of KL-divergence,

VOLUME 8, 2020
This item can keep the distribution gap between source and target domains in consistency, but it needs to know the explicit identity number of target domain. In realistic cross domain person re-identification, it is very hard to know the pedestrian number for target data. Therefore, DMTNet devises an identity number estimation mechanism to seek for person account in target domain, which can transfer the classification model learned from source domain.
Through the MMD and KL loss functions, the pedestrian images both in source and target domains are transformed into a shared feature space. That makes the identity number estimating model trained by source data is appropriate for target data. According to this theoretical basis, we utilize the source data to train our NIEC algorithm. In detail, we split the C s known classes in D s into a training subset D s t with C s t classes, and a validating subset . Moreover, the target data in D t is regarded as testing subset.
Then we run a semi-supervised k-means clustering method on D s t ∪ D t to estimate the number of identity in D t . Namely, during k-means, we force images in the training subset D s t to map to clusters following their ground-truth labels, while images in the validation subset D s v are considered as additional ''unlabeled'' data. We launch this constrained k-means multiple times by sweeping the number of total categories C in D s t ∪ D t , and measure the constrained clustering quality on D s t ∪ D t by the estimation in D s v . To this end, we employ two evaluating criteria [6] on the estimating result of D s v to evaluate the clustering effectiveness under classes C.
The first criterion is Overall Clustering Accuracy (ACC), which is applicable to the C s v labeled classes in the validation subset D s v , and it is given by, (11) where N s v is the number of validating images, and g(y s i ) denotes the ground-truth label, while theȳ s i is the estimated clustering assignment for each image in D s v . This term ensures to estimate correct labels as much as possible for validating subset D s v . The another criterion is Cluster Metric Measurement (CMM) by capturing notions of intra-identity cohesion and inter-identity separation based on estimated clusters, which is applicable to the unlabeled data D t . This constraint is according to,

Algorithm 1 Deep Multi-Task Tranfer Network (DMTNet)
Initialization: The backbone network γ (x s i , θ 0 ), and the task-specific feature extractors f (cot, θ at ), f (cot, θ cla ) and f (cot, θ id ) by the data in labeled source domain D s = {(x s i , y s i )|, i = 1, · · · , i, · · · , N s }. Given an evaluated identity number C 0 ≤ C * ≤ C max for D s t ∪ D t , and parameter m = 0.35 in Eq. 7. Training in source domain: for t ∈ 1, · · · , N s do Train θ 0 , θ at , θ cla , θ id on D s by Eq.2, 3 and 7. end for for C 0 ≤ C * ≤ C max do K-means cluster and annotate soft labels for the target data. Training in target domain: for t ∈ 1, · · · , N t do Train θ 0 , θ at , θ cla , θ id on D s by Eq.2, 3, 7, 8, and 10 for the target data.
Train the whole network by Eq.11, and 12.
If the error is convergence, Return θ 0 , θ at , θ cla , θ id . end for Select the optimal value of C * . end for Training in target domain again by the C * . Return the parameters of DMTNet.
When these two criteria are in convergence under a fixed value, the cluster number C * in D s t ∪ D t can be obtained, and the identity number is C * −C s t . Through the obtained clusters in target domain, we can annotate soft identity label for each pedestrian image in target domain D s . Thus, the whole multitask learning approach can be transferred into target data to implement and achieve the fused feature F t(i) mat . The testing procedure between probe and gallery in target domain can be conducted by, where c t i is the predicting label given a pedestrian image in target domain. Our DMTNet algorithm is also concluded in Algorithm 1.

IV. EXPERIMENTS A. DATASETS AND EXPERIMENTAL SETTINGS
To validate the efficacy of the DMTNet approach, we implement several evaluating experiments on two large scale datasets, DukeMTMC-reID [19], [29] and Market-1501 [28], which have attribute labels and are widely used in cross domain person re-identification models.
DukeMTMC-reID dataset [19], [29] is a sufficient large scale person re-identification dataset, which is suitable for deep neural network training task. The pedestrian images are captured by 8 surveillance cameras, and they are composed by 36,411 annotated images with 1,404 persons. They are formed by 702 person with 16,522 training images, and 2,228 probe pedestrian images belonging to other 702 persons, which has 17,661 gallery images.
Market-1501 dataset [28] contains 32,668 annotated pedestrian images belonging to 1,501 persons, which are captured by 6 surveillance cameras. This large scale dataset is divided as two partitions by their purpose, where the one is utilized for training with 12,936 images from 751 individuals and the left 19,732 images from 750 individuals are adopted for testing. There are 3368 probe images from the 750 testing persons are employed to matching target identities in the gallery.
Evaluation protocols are kept in consistency with conventional person re-id models. We utilize the Cumulative Matching Curve (CMC) to produce the ranking accuracy, and adopt the mean Average Precision (mAP) to evaluate the performance of our approach, which reflects the overall precision and recall rates. In this section, we describe the rank-1,rank-5, and rank-10 and mAP to show the performance on these two experimental datasets (Table 3), and draw the CMC curves as the performance of rank-n (1 ≤ n ≤ 20) (Figure 3, 4).

B. IMPLEMENTATION DETAIL
To train our deep multi-task transfer network, we introduce the ResNet50 with pre-trained parameters on ImageNet as the basic feature extractor following Luo et al. [15] and use Pytorch to achieve the network. The objective function of DMTNet is optimized by Adam solver [8] with a minibatch of 32 on Ubuntu 16.04 system with NVIDIA GeForece GTX 2080Ti GPU. The learning rate of the whole network is initialized by 2e-4 when is training a half process, and then will be decayed to 0 at the end of training. The parameter m is set to be 0.35, and the dimentsion of output matching feature is 256.
Specifically, LOMO [11] is an effective hand-craft feature representation of local maximal occurrence, which analyze the horizontal occurrence of local features and maximizes the occurrence to make a stable representation against viewpoint changes. This feature is often utilized into conventional machine learning models to solve person re-identification. BoW [28] is a Bag-of-Words model, which accommodates local features and enables fast global feature matching. UMDL [18]) is a asymmetric multi-task dictionary learning model to learn view-invariant and identity-discriminative information from unlabeled target data. For deep clustering based methods, BUC [13] is a bottom-up clustering approach to jointly optimize a convolutional neural network and the relationship among the individual samples. DBC [3] is a novel clustering based unsupervised person re-id models which can exploit the underlying feature space for unlabeled pedestrian image data by the statistic concept of 'dispersion'. PAUL [27] is a patch-based unsupervised learning framework in order to learn discriminative feature from patches instead of the whole images, combined with an unsupervised patch-based discriminative feature learning loss. Furthermore, the two state-of-the art cross domain approaches (TJ-AIDL [22], and CR-GAN [2]) also conducted experiments on Market-1501 and DukeMTMC-reID. TJ-AIDL [22] transfers the labeled information of an existing dataset to a new seen unlabeled target domain for person re-id without any supervision in the target domain, which simultaneously learns an attribute-semantic and identity discriminative feature representation space transferrable to the target domain. CR-GAN [2] formulates a dual conditional generative adversarial network that augments each source person image with rich contextual variations, and leverages abundant unlabeled target instances as contextual guidance for image generation.

2) PERFORMANCE ON DukeMTMC-reID
For DukeMTMC-reID dataset, we utilize the Market-1501 as the source, and DukeMTMC-reID is the target domain, which VOLUME 8, 2020 is set as the compared cross domain methods' setting. Table 1 reports the comparison between our DMTNet approach and these compared methods. From this table, it can be seen that our DMTNet model obtains rank-1 accuracy of 72.9%, rank-5 accuracy of 83.0%, rank-10 accuracy of 87.5% and mAP rate of 53.8%, and its CMC curve in drawn in Figure 3. These results are superior to most of recent approaches, and have a important significance for cross domain person re-identification.
Compared with hand-craft feature based machine learning methods, our DMTNet can extract more discriminative information according to different tasks, and leaves them a large margin in rank-n accuracies and mAP rate. Contrast to clustering based unsupervised person re-identification models, our approach can estimate the cluster points for conducting multi-task learning in target domain (improve at least 0.9% (72.9-72.0) of rank-1 accuracy). Compared to cross domain models, our method preserves the identity information in multi-task learning procedure, and it improves 16.9% (72.9-56.0) in rank-1 accuracy and 20.5% (53. 8-33.3) in mAP. Moreover, their CMC curves of the compared models are drawn in Figure 3.

3) PERFORMANCE ON MARKET-1501.
For this dataset, we choose the DukeMTMC-reID as the source domain, and the cross domain person re-id models keep in consistency with this setting. We report the results in Table 2, and it shows the superiority of our proposed, compared with hand-craft feature based machine learning, clustering and domain adaptation based person re-id models. DMTNet conduct experiments and obtain rank-1 accuracy of 71.5 and mAP rate of 42.3, which surpass the baselines and state of the art methods, their CMC curves also shows this result (Figure 4). From the performance on two datasets, our proposed cross domain person re-identification model using multi-task transfer learning framework is proved to outperform existing unsupervised person re-identification methods.

D. ABLATION AND DISCUSSION
This paper proposes a novel multi-task transfer network, integrating classification, attribute learning, and identification  tasks into a unified framework, and design a cluster estimating algorithm for target domain. In this part, we will evaluate the influence of them, and make several discussions of our DMTNet.

1) EVALUATION OF ATTRIBUTE ATTENTION TASK
We employ attribute learning as one of the multi tasks due to the effectiveness of existing attribute based person re-id models, and develop it by our attribute attention subnet to supplement the attribute-specific information in final matching features. As a comparison, we remove this attention task and combine the attribute feature directly with other tasks both in source and target domains, named as Multi Task Without Attribute Attention (MTwAANet). Table 3 shows the results of MTwAANet, and it achieves rank-1 accuracy of 65.5% on DukeMTMC-reID and 60.6% on Market-1501, which is lower than the performance than our DMTNet as well as mAP performance. This comparison demonstrates the positive effect of the attribute attention task not only on rank-1 performance but also on mAP criterion.

2) EVALUATION OF CLASSIFICATION TASK
In existing cross domain person re-identification methods, the classification task is often employed in source domain rather than target because target domain is lack of annotations which can not support the classification. We attempt to estimate cluster number and annotate a soft class labels to each image in target domain. Thus, we make it possible to integrate classification task into target domain, and preserve soft identity information on the matching feature for each image in target domain. To evaluate this task, we eliminate this term in the whole network, named as Multi Task without Classification (MTwCNet).
We can find that the MTwCNet obtains a rank-1 accuracy of 68.6%(66.2%) on DukeMTMC-reID(Market-1501) dataset, and mAP of 48.1%(38.3%). The superiority of our DMTNet is at least 4.3% of rank-1 accuracy, and the importance of the classification shows the identity information is a significant component for pedestrian feature matching.

3) EVALUATION OF IDENTIFICATION TASK
The identification task is in charge of the basic matching ability, which assimilate the identity and attribute information to constitute the matching feature. We remove this basic feature, and direct fuse the classification-specific feature and attribute-specific feature to conduct final matching process, named Multi Task without Identification (MTwINet), to evaluate this basic feature extraction.
The identification task reveals its decisive effect on the pedestrian feature matching procedure through the performance comparison between MTwINet and our DMTNet. The difference between them is at least 15.3%, which is the largest discrepancy in every task.

4) DISCUSSION OF CLUSTER NUMBER ESTIMATION
Our DMTNet approach proposes a cluster number estimating algorithm, which use KL divergence to leverage the distribution gap between source and target domains for classification task, and learn cluster knowledge from source domain to estimate target cluster number. Because it is always unknown of the identity number of unsupervised person re-identification in realistic, this strategy can resolve this problem. To validate the cluster number estimating algorithm, we set a constant identity number from the real account of the target domain as a comparison, instead of estimated number, which is named as Multi Task with identity Number (MTwNNet).
With the guidance of real identity number in target domain, MTwNNet achieves the rank-1 accuracy of 73.3%(72.5%) on DukeMTMC-reID(Market-1501), which is higher than our original DMTNet. The distance between them is only [0.4%,1%], and it shows that our DMTNet can resolve the problem of lacking target identity number in realistic, while retains little distance between MTwNNet.

5) EVALUATION OF METRICS
To evaluate the evaluating metric of the pedestrian feature matching, we adopt re-ranking [30] technology to improve the performance of our DMTNet. Following the parameter setting of re-ranking, our method improves a margin of 1.2%(4.5%) on DukeMTMC-reID(Market-1501), which is the DMTNet+ReRanking in Table 3. That illustrates our DMTNet has a prospect of improvement when combine different metrics or complex feature extractors.

V. CONCLUSION
In this paper, we present a novel multi-task transfer network for cross domain person re-identification. This approach aims to solve the target identity information preserving and target cluster number estimating problem, by a soft classification task and identity cluster estimating algorithm. It can not only learn discriminative feature representation from source domain, but also transfer them into target domain with cluster estimation to support a soft multi-task learning procedure as well as source domain. Furthermore, extensive experiments demonstrate the effectiveness of our proposed DMTNet method.
HUAN WANG received the master's degree and the Ph.D. degree in system analysis and integration from Yunnan University, in 2008 and 2013, respectively. She is currently an Associated Professor with the Baoji University of Arts and Sciences. Her research interests include deep learning, complex networks, complex systems, and pattern recognition research.
JINGBO HU received the master's degree in computer architecture from Yunnan University, in 2010. He is currently an Associated Professor with the Baoji University of Arts and Sciences. His research interests include target detection, object recognition, and complex system research.