Learning Domain-Specific Features From General Features for Person Re-Identification

Person re-identification (re-id) plays a vital role in surveillance and forensics application. Since the labeled images for person re-id task is limited, the generalization ability of existed person re-id models is poor. On the other hand, images of different classes (pedestrian and non-pedestrian images) share some general features. To this end, this paper aims to improve the performance of person re-id by designing a relearning network which can learn domain-specific features and general features simultaneously. The proposed relearning network consists of a pretrained backbone network which provides the general features, and several attention-based subnetworks that learn domain-specific features from general features of different levels. Besides, we propose a coarse-fine loss to improve the generalization of person re-id model by making full use of the massive labeled non-pedestrian images. Experimental results on the publicly available Market-1501, DukeMTMC-reID and CUHK03 pedestrian re-id datasets demonstrate the effectiveness of the proposed relearning network and coarse-fine loss.


I. INTRODUCTION
Person re-identification (re-id) is a crucial technique in automated surveillance and forensics application. Given an image of target pedestrian (probe), the task of person re-id aims at retrieving images of the same pedestrian in a large set of candidate images (gallery) captured by different non-overlapping cameras. The person re-id problem has attracted lots of research attention during the past decades [5], [14], [20], [23], [25], [41], [45], [50], [52]. However, considering the large variation in illumination [21], [30], [33], viewpoint [16], resolution [22] and limited labeled data [2] in practice, it is still a challenging problem.
Existing works solve the person re-id problem from two perspectives. The first one focuses on developing discriminative feature representation for each image of pedestrian. The second one attempts to learn powerful distance metric so that the distance between similar images is smaller than that of dissimilar images. Deep learning, which can learn discriminative feature representation and powerful distance metric simultaneously, has achieved great success in many fields of computer vision, such as, image classification/retrieval [12], The associate editor coordinating the review of this manuscript and approving it for publication was Mingbo Zhao . [24], [32], object detection [18] and segmentation [49] during the past years. As a sub-field of image retrieval, the person re-id task has been developed rapidly in recent years because of the application of deep learning [1], [10], [28], [31], [48]. For example, a multi-channel part-based convolutional neural network with an improved triplet loss function is proposed by Cheng et al. [10] to jointly learn both the global full-body and local body-parts features of the person in images. Wei et al. [46] design a PTGAN to perform style transfer between different datasets while keeping the pedestrian's identity information, such that the domain gap between different person re-id datasets can be bridged. Similarly, ATNet [29] introduces CycleGAN [53] to perform crossdomain transfer. To address the significant and unknown cross-view feature distortion caused by the difference of camera views, Chen et al. [9] design a CRAFT framework which performs cross-view adaptation by automatically measuring camera correlation from cross-view visual data distribution.
Though exciting results have been achieved by these deep learning person re-id methods, there are some limitations in these methods. First, as we all know, an inherent defect of deep learning models is that they need a large amount of labeled data to train the models which have a large number of parameters. However, it is expensive, labor-consuming VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and time-consuming to obtain a large amount of labeled data in person re-id task, which severely limits the generalization ability of the model. This limitation inspires us to design a new deep learning method which only needs a few marked pedestrian images for model training. Second, there are lots of labeled large-scale non-pedestrian image datasets, such as ImageNet [11] and CIFAR-100 [24]. However, only a few person re-id methods use these datasets. Even with these methods, their use of large-scale non-pedestrian labeled datasets is only to pretrain a model on these data, and then finetune the pretrained model with limited labeled pedestrian images. It should be noted that the number of parameters contained in most of these models always large, thus the limited labeled pedestrian images cannot finetune all these parameters well. Some of these methods try to fix most of the parameters contained in the models and finetune the parameters of the last few layers. This may lead to another problem, that is, whether the layer with fixed parameters at the front of the network can extract features that meet the task of person re-id? Third, for human beings, general knowledge (feature) can enable us to learn discriminative representation description with only a few labeled samples when solving new similar problems due to that different objects share limited general features. Since the components and combinations of general features of different objects may be different, the descriptions based on general features of different objects may vary greatly. For example, human and dog are totally different in shape and appearance but both of them can be described by the combination of several simple general features. The limitation and sharing of general features make it suitable for describing different objects and classifying them. On the other hand, the general features are insensitive to small differences between similar things. For example, generally speaking, all people have similar shape and appearance, e.g., two legs, two hands and one head. This means that we cannot handle the task of distinguishing different individuals in the same object (e.g., person re-id) well by only using general features. To this end, in this paper, we focus on designing a network to learn domain-specific features for person re-id task from the general features exploited from large-scale labeled non-pedestrian datasets. We expect to enhance the generalization ability of the model in this way so as to improve the accuracy of person re-id. Overall, the contributions of this work are summarized as follows: (1) We design a relearning network to learn domainspecific features for person re-id task from the general features exploited from large-scale labeled non-pedestrian dataset. Specially, our relearning network adopt a backbone network which is trained with a large amount of labeled non-pedestrian images to provide general feature of strong generalization ability, and apply several attention-based subnetworks attached to different nodes of the backbone network to extract domain-specific features for person re-id task.
(2) We propose a coarse-fine metric loss, which enables the relearning network to learn both coarse-grained and fine-grained metric spaces simultaneously. Through the coarse-fine loss, the generalization ability of the features extracted by the relearning network can be guaranteed, and the domain features specific to person re-id task can be learned at the same time.
The rest of this paper is organized as follows. In Section II, we briefly review the most recent and related developments with this work and discuss how they differ from this approach. Details of the proposed relearning network and coarse-fine loss are described in Section III. Experimental results are reported in Section IV to show the superiority of the proposed relearning network and coarse-fine loss over existing methods. Finally, some concluding remarks and future works are present in Section V.

II. RELATED WORKS
In this section, we briefly review two types of works that are related to our approaches: (1) Fine-tuned models for person re-id; (2) Metric learning.

A. FINE-TUNED MODELS FOR PERSON RE-ID
As we mentioned in Section I, there are lots of labeled largescale non-pedestrian image datasets but the number of labeled pedestrian images is limited. Considering that some features are shared by most images (pedestrian images and nonpedestrian images), some works try to design their person reid models based on the model which is pretrained on these labeled large-scale non-pedestrian images. They first use these labeled large-scale non-pedestrian images to pretrain a model on the original task. Then modify the last several layers of the pretrained model and fine-tuned all or part of the modified network parameters on the limited labeled pedestrian images. For example, In [47], the bottom 4 stages of the pretrained ResNet-50 provide mid-level semantic representations, such that they can train the task of attribute recognition and re-identification sequentially by fine-tuning one based on another. The AlexNet model trained on Imagenet dataset is adopted in MTDnet [6] to initialize the kernel weights of the first two convolutional layers of the proposed network. PCB [43] takes ResNet [19] as backbone and replaces the original average pooling layer with a p-category part classifier. Then authors use the labeled pedestrian images to train the part classifier but leave the parameters of pre-trained layer fixed. Original models are pretrained on Imagenet dataset for unsupervised person re-id model in [13].
As we can see, all these methods use the models pretrained on non-pedestrian image dataset as backbone network or original model, and re-train some or all of the parameters of the network on the labeled pedestrian dataset. Though these methods achieve exciting results by making use of the largescale labeled non-pedestrian image, there are some limitations in the way these methods used the pretrained model. On one hand, some methods (e.g., PCB) fix the parameters of pretrained model and train a few layers attached to the end of the pretrained model based on the high-level features provided by the pretrained model to adapt to person re-id task.
To this end, they cannot learn domain specific features from low-level features for person re-id task. On the other hand, some methods fine-tune all parameters of the pretrained model with the limited labeled pedestrian images which means that they cannot train these large amount of parameters well enough. Besides, change values of all parameters will result in the loss of the ability of the network to extract general features. Different from them, our method can fix the parameters of the pretrained model, that is, we do not reduce the ability of the network to extract general features, and at the same time, we can learn domain feature from several stage of the pretrained network specific to person re-id task.

B. METRIC LEARNING
Metric learning is a type of method which tries to learn a metric space that the distance between similar objects is smaller than the distance between different objects. So far an enormous amount of research effort goes into metric learning theory and application. Contrastive loss [17] is designed to train the Siamese network. The contrastive loss function can be defined as follows: where W is the parameter of the network, Y is the pair label which will be set to 0 when both input X 1 and X 2 belong to the same category or identity, otherwise, it will be set to 1. D W represents the Euclidean distance between X 1 and X 2 in the latent variable space mapped by the network. m is the margin parameter. If Y = 0, the contrastive loss aims to train the network to minimize D W ; if Y = 1 and D W ≥ m, the contrastive loss tries to train the network to increase D W to m. Though contrastive loss is simple and powerful, it predefines an absolute margin m, which is based on a strong assumption (hypothesis) that the distributions of all kinds of samples are the same. To relax this strong hypothesis, triplet loss [37] is proposed: where x a i , x p i and x n i are the anchor, positive and negative sample of the i th triple. The triplet loss tries to make sure that the Euclidean distance between the feature vectors of the same label (x a i and x p i ) is always α less than the Euclidean distance between the feature vectors of different labels (x a i and x n i ). Subsequently, some works [3], [10] improve the performance of triplet loss by designing a better way to construct triples, others expand triple to quadruples [7] or add additional constraints to the relationship between positive samples.
Though these works have done in-depth research on metric learning and achieved excellent results, they either only learn coarse-grained metric space or learn fine-grained metric space. For example, some works apply metric learning algorithms to learn coarse-grained metric space for general object recognition [39], while others use the metric learning algorithm to learn fine-grained metric space for specific tasks such as face recognition [37] and person re-id [7]. Different from them, we learn the fine-grained metric space for the person re-id task from the coarse-grained metric space which is learned for general object recognition, such that we can make full use of the massive labeled non-pedestrian images to improve the performance of person re-id.

III. THE PROPOSED RELEARNING NETWORK AND COARSE-FINE LOSS
As we have mentioned in Section I, our relearning network consists of a backbone network which is well-trained on the general image classification dataset, and several levels of attention-based subnetworks. The backbone network provides general features while the attention-based subnetworks exploit domain-specific features for the person re-id task. Then we fuse the general features and domain-specific features to produce features of both generalization ability and domain adaptability. A diagram of the proposed relearning network is shown in Fig. 1. Finally, we propose a coarse-fine loss to train the relearning network. In the following sections, we will give the details of relearning network and coarse-fine loss.

A. RELEARNING NETWORK
In this paper, we adopt the pretrained Resnet50 [19] network as the backbone network for its excellent recognition performance. We replace the layers after ''layer4'' in the backbone network with a convolutional layer (noted as ''Conv_G'' in Fig. 1), so that the backbone network can extract general features more suitable for person re-id task. To make full use of general features, we fix the parameters of the backbone network, thus we can obtain general features with strong generalization ability. However, since the backbone network is trained on Imagenet [36] for image classification, the highlevel features it provided are not fully applicable to person reid task. For example, to distinguish between human and dog, we only need outline information due to the fact that the shape of a person and a dog is quite different. However, in order to distinguish two people with different identities, we may need some domain-specific information (feature) about the color and style of their clothes, body shape, etc. It should be noted that we cannot extract domain-specific features from the final high-level features produced by the backbone network since they're customized for generic classification tasks. An intuitive idea is to design another network to exploit domainspecific features from the original images and combine them with the general features provided by the backbone network. However, considering that labeled pedestrian images are limited, it is difficult to provide enough training data to train a model from scratch. Fortunately, intermediate features at different stages of the backbone network can provide different levels of features. To this end, we attach three attention-based subnetworks to three different nodes of the backbone network to relearn domain-specific features from these intermediate general features for person re-id task. Now we will introduce the details of the attention module that makes up attentionbased subnetworks. To reduce the complexity of the model, we apply a convolutional layer with a kernel size of 1 × 1 at the head of each sub-network to reduce the channel number of feature maps, and use another convolutional layer with a kernel size of 1 × 1 at the end of each sub-network to increase the channel number of feature maps.
Attention Module-An attention module consists of two branches, i.e., trunk branch and mask branch. The trunk branch is used for feature extraction, and the mask branch is designed for attention mask generating. Given a set of feature maps I κ (x), where κ denotes the node index, we will input it to both trunk branch and mask branch. For the trunk branch, we adopt two serial residual units [19] to form it. Inspired by the idea of residual unit [19], we build a shortcut connection between the input feature maps I κ (x) and the output feature maps of the second residual unit to produce the final output feature maps T λ κ (x). For the mask branch, we first apply a convolutional layer which is followed by a max-pooling layer to the input feature maps I κ (x). Then an interpolation layer followed by a convolutional layer is adopted to restore feature maps. Similarly, we also build a shortcut connection between the I κ (x) and the output of the second convolutional layer. We use a Relu layer and Softmax layer to produce the attention mask M λ κ (x) from the restored feature maps, where λ refers to λ th attention module adopted in the κ attentionbased sub-network and M λ κ (x) ∈ [0, 1]. Finally, the output of the attention module can be formulated as: Note that the scales of λ κ (x) and I κ (x) are the same.
Since the scales of different I κ (x) are different, we attach a max pooling layer to some λ κ (x) to adjust the scales of them to make sure that the scale of all κ (x) are the same size, where κ (x) is the final output feature maps corresponding to I κ (x). Then we fuse all κ (x) using a convolutional layer (noted as ''Conv_S'' in Fig. 1) to produce the final domainspecific features. Once we get the domain-specific features and general features, we apply a convolutional layer (noted as ''Conv_f'' in Fig. 1) to fuse them.

B. COARSE-FINE LOSS 1) COARSE LOSS
Let f G (x i ) denotes the features produced by the ''Conv_G'' layer corresponding to the input image x i . To make sure that the backbone network can provide general features suitable for person re-id task, we apply the standard CrossEntropy loss I G with a fully connected layer to f G (x i ) train the network to classify x i . Note that the I G here only classify which category the input sample belongs to and does not distinguish the identity of different pedestrians. For example, given a sample x i , I G can tell us whether x i is a dog, a flower or a pedestrian, and so on.
Given a set of triplet samples , where x P i and x P j are the pedestrain images corresponding to the i th and j th identity in training set, respectively, x NP k is the k th nonpedestrian image in the training set. To make the features learned by the network more robust to environmental noise, we expect that: where f G ( * ) represents the feature vector corresponding to the sample * produced by ''Conv_G'', D (f 1 , f 2 ) represents the distance between feature vector f 1 and f 2 . β is the margin parameter. Furthermore, for each pedestrian sample x P i in the batch, we select the most dissimilar pedestrian sample x P j (P-P) and the most similar non-pedestrian sample x NP k (NP-P) in the batch, to form the triplet for computing the loss, which we call coarse triplet loss: where M and N are the number of pedestrian images and non-pedestrian images in the batch, respectively. Through Equation 6, pedestrian and non-pedestrian are separated in the metric space.

2) FINE LOSS
To further distinguish pedestrians of different identities, we adopt the Batch Hard triplet loss [3] introduced in Section II-B as fine triplet loss to learn metric space from the pedestrian samples for person re-id task: where x a i , x p i and x n i are the anchor, positive and negative samples, respectively. As we have mentioned before, the distance between different pedestrians should be smaller than the distance between pedestrian and non-pedestrian, thus we set α < β.
Inspired by [43], we impose a part-based identity loss I PCB with a part number of 6 for the features produced by the ''Conv_F'' layer to learn fine-grained features. Finally, we can write the overall coarse-fine loss function as follows:

IV. EXPERIMENTS A. IMPLEMENTATION DETAILS
The configures of the proposed relearning network are listed in Table 1. Please note that we concatenate feature maps produced by both backbone network and attention-based subnetworks before we input them into ''Conv_S'', ''Conv_G'' and ''Conv_F'' layer. The batch size for training the relearning network is 256. We use the SGD algorithm to train the proposed relearning network 1000 epochs. The learning rate in our experiments is 1e − 2. We randomly mirror all the  [19] to view the parameters of the backbone network.
images in each training iteration to increase the diversity of the training data. When it comes to the testing phase, we don't apply any data augmentation method. The proposed relearning network is implemented using the PyTorch [34] framework on 2 NVIDIA GTX 1080 TI GPUs.

B. DATASETS AND EXPERIMENTAL SETTINGS
We conduct experiments on three public person re-id datasets including CUHKO3 [26], DukeMTMC-reID [35] and Market-1501 [51] to verify the effectiveness of the proposed relearning network and coarse-fine loss. CUHK03 -There are 13, 164 images of 1, 360 pedestrians captured by 5 disjoint camera pairs in CUHK03. Each identity was captured by two disjoint cameras. CUHK03 provides both automatically detected bounding boxes using the DPM detector [15] and manually labeled bounding boxes. In our experiments, we only use the automatically detected bounding boxes. The challenges for CUHK03 are detection errors, viewpoint variations, and occlusions existed in the dataset.
Market-1501 -The Market-1501 dataset contains 1,501 identities shot by 6 cameras in front of a supermarket. The bounding boxes are first detected by DPM and further validated by manual annotations. Among them, 12, 936 images of 751 identities are used as training set and the testing set consists of the other 19, 732 images of 750 identities. When it comes to testing, we use only one image from each camera for each identity as a query to construct the 3, 368 images of query set.
To increase the diversity of the available datasets, we randomly mirror all the images in a batch in the training phase. In the testing phase, we simply use relearning network to extract feature vector for each input image without any data augmentation.

C. ABLATION STUDY
In this section, we will investigate the effect of domainspecific feature and coarse-fine loss by experiments conducted on the three datasets.

1) THE EFFECT OF RELEARNING NETWORK
Since the key difference between our relearning network and the backbone network is the three subnetworks listed in Table 1, we study the effect of the relearning network by attaching these three subnetworks one by one to the backbone network. In this section, we change the number of input channels for ''Conv_S'' adaptively to ensure that the network can work normally, and take the network without none of the three subnetworks as baseline. All experiments are based on the models trained with full coarse-fine loss, i.e., Equation 6. Results in Table 2 show that all three subnetworks are helpful for the improvement of person re-id accuracy and verify that these subnetworks can focus on exploiting the domain-specific features from general features. These results are interesting as they show that the ''subnetwork-1'' improves the performance most, while the ''subnetwork-3'' improves the performance least. and the best performance achieved when all ''subnetwork-1'', ''subnetwork-2'' and ''subnetwork-3'' are adopted. We examine that this may be due to the fact that 1) the higher the abstraction level of features in the backbone network, the less pedestrian re-id domain-specific features it contains; 2) domain-specific feature maps extracted from different levels of general features produced by the backbone network provide different information and can complement each other.

2) THE EFFECT OF COARSE-FINE LOSS
To verify the effect of the coarse-fine loss, we train the relearning network with several different loss function combinations. Specially, we take the model trained only with part-based identity loss I PCB as baseline in this part, then we add the CrossEntropy loss I G , coarse triplet loss C and fine triplet loss F one by one. For simplicity, in Table 2, we use I PCB , I G , C , F and their combinations to represent the model trained with them. Note that we set α = 2 and β = 5 in this part. As we can see, compared with I PCB , the re-id accuracy is improved from 86.1% to 88.3% when it comes to ''I PCB +I G '', which demonstrates that preserving the general information (general features) is beneficial to person re-id. Comparing ''I PCB '' and ''I PCB + C '', we can conclude that the coarse triplet can improve the re-id performance by making the model more robust to the non-pedestrian noise. We also observe that when F is included (''I PCB + C + F ''), the re-id accuracy is further improved. The improvement validates that the difference between different pedestrians is smaller than the difference between pedestrian and nonpedestrian. Last but not least, the best results achieved when we use ''I PCB + I G + C + F '', which demonstrates the effectiveness of the coarse-fine loss.
In addition, we fix α = 2 and vary the value of β from 0.5 to 8 to further verify the effect of C . Here we use the model trained without C as baseline. Results in Figure 2 show that when β ≤ α, we can observe a significant decrease in the rank-1 accuracy. This proves once again that the difference between different pedestrians should be smaller than the difference between pedestrian and nonpedestrian. As we can see, the performance of the model increases with the increase of β within a certain range and then a slight decrease can be observed. We examine that this may be due to the fact that the pedestrian images and non-pedestrian images share some features, trying to make the difference between the feature of pedestrian and nonpedestrian become too large increases the difficulty of model training.

1) EVALUATION ON CUHK03
In Table 3, we reported the comparison of our approach with the existing eight state-of-the-art person re-id methods on CUHK03 dataset. As seen from the comparison, our model outperforms all compared methods. More concretely, our model achieves an improvement of 3.2% rank-1 accuracy over the best compared method Deep-Person and 1.1% mAP over the best compared method KPM. The improvements demonstrate the effectiveness of our approach. Fig. 3 shows four examples of retrieving. Specifically, some query images in the third and fourth rows are heavily occluded, and the ninth query image in the third row contains multiple pedestrians. Besides, the pedestrians contained in these query and probe images are captured from different views. Our method succeeds in finding correct matched candidates at the top 1 rank in the first two examples, and can find the correctly matched candidates at top 3 rank in all four examples, which demonstrate the effectiveness and robustness of the proposed method.   Table 4 compares our approach to the works of seven state-of-the-art person re-id methods on DukeMTMC-reID dataset. The results show clearly that with the proposed approach, both the rank accuracy and mAP are improved significantly. For instance, our approach improve the rank-1 matching rate by 2.8% compared to the second best method AANet. This validates the the effectiveness of the proposed approach.

3) EVALUATION ON MARKET-1501
The comparison with state-of-the-art methods on Market-1501 dataset is depicted in Table 5. Our results show clearly that the proposed approach obtains better matching rates and mAP than all compared methods. More specifically, our method is 1.5% and 2.7% higher than the best compared method AANet at rank-1 and mAP, respectively. Retrieval results of the proposed approach in the testing set of CUHK03. Images in the first column represent the query. The second column contains candidates in the gallery. Candidates with green boxes indicate that they belong to the same pedestrian as the query, while the red boxes denote the wrong matched image sequences.

V. CONCLUSION
In this paper, we first develop a relearning network to learn domain-specific features from general features for person re-id task. Through fixing the parameters of the pre-trained backbone network, we can obtain general features with powerful generalization ability. By attaching three attentionbased subnetworks to different positions of the backbone network, we can learn discriminative domain-specific features of different levels. Then we fuse the general features and domain-specific features to produce the final discriminative representation for person re-id task. Besides, we propose a coarse-fine loss to use the non-pedestrian images to ensure the generalization ability of the model and improve the robustness of the model to environmental noise. Experiment results on three public person re-id datasets demonstrate the effectiveness of the proposed HCRN, scale-fusion framework, and species invasion training strategy.
Future work may focus on developing more efficient methods to use non-pedestrian images to improve person re-id accuracy.
JUNYING FENG received the M.S. degree in business administration from Central South University. She is currently a Senior Economist with the School of Economics and Management, Hunan Institute of Technology. Her current research interests include human resource management and data analysis.
RUILING ZHOU received the master's degree in computer application from Hunan University, in 2012. She is currently a Lecturer with the School of Computer and Information Science, Hunan Institute of Technology. Her current research interests include big data, recommendation systems, pattern recognition, and image processing.