Unifying Person and Vehicle Re-Identification

Person and vehicle re-identification (re-ID) are important challenges for the analysis of the burgeoning collection of urban surveillance videos. To efficiently evaluate such videos, which are populated with both vehicles and pedestrians, it would be preferable to have one unified framework with effective performance across both domains. Unfortunately, due to the contrasting composition of humans and vehicles, no architecture has yet been established that can adequately perform both tasks. We release a Person and Vehicle Unified Data Set (PVUD) comprising of both pedestrians and vehicles from popular existing re-ID data sets, in order to better model the data that we would expect to find in the real world. We exploit the generalisation ability of metric learning to propose a re-ID framework that can learn to re-identify humans and vehicles simultaneously. We design our network, MidTriNet, to harness the power of mid-level features to develop better representations for the re-ID tasks. We help the system to handle mixed data by appending unification terms with additional hard negative and hard positive mining to MidTriNet. We attain comparable accuracy training on PVUD to training on the comprising data sets separately, supporting the system’s generalisation power. To further demonstrate the effectiveness of our framework, we also obtain results better than, or competitive with, the state-of-the-art on each of the Market-1501, CUHK03, VehicleID and VeRi data sets.


I. INTRODUCTION
Re-identification (re-ID) is a core challenge for the computer vision community whereby a detection is required to be matched with another detection of the same object, typically from a different viewpoint. With the increasing volume of large-scale urban surveillance data, re-ID has started to attract a large amount of attention. In the past few years, deep learning techniques have received increased popularity due to significantly improving the performance of both pedestrian [1]- [3] and vehicle [4]- [6] re-ID. In the real-world, person and vehicle re-ID often need to be used together, e.g. when a person of interest boards a vehicle and gets off somewhere else. We would prefer re-ID systems to be able to handle this occurrence for continuous tracking. For this reason, Wei et al. [7] attempt to develop an integrated application by using existing person and vehicle re-ID architectures. However, this does not truly unify the tasks, as the system accuracy depends on sub-systems, which is not optimal. It requires an additional component to classify between pedestrians and vehicles, which could introduce inaccuracies. We instead train person and vehicle re-ID in a unified manner. This approach allows us to discover underlying principles of re-ID, whereas handling the two systems separately does not allow us to explore this direction.
The challenges of re-identifying vehicles and persons have significant differences. For wide area video surveillance on humans, the same identity viewed from a different pose angle usually looks fairly alike. The shape of the detection remains upright and the colour information, predominantly extracted from articles of clothing, is of a similar pattern. The same condition cannot be satisfied for vehicles. Colour information can become far more distorted in different lighting due to the reflectiveness of the body of a car. The shape information of a car viewed from the front is significantly different than that viewed from a 45°or 90°angle. On the contrary, many high-end vehicle re-ID algorithms use license plate We propose a unified framework for pedestrians and vehicles re-identification using a new unified data set, PVUD, which challenges re-ID systems to be capable of handling both tasks simultaneously. Our framework includes MidTriNet to harness the power of mid-level features for re-ID, and a Unification Loss Function to better handle the mixed data stream.
information [4], [8], [9], which is not applicable in the human domain. Moreover, pedestrians are more likely to undergo significant changes over time or viewpoint, e.g. a person's appearance is greatly altered after they put on a coat. In general, changes to vehicles between viewpoints are high variance but predictable whereas the change in a person's colour representation is usually lower variance but prone to much more extreme outliers. We propose that there are underlying principles of re-ID that hold regardless of the composition of the object worked upon. Unifying person and vehicle re-ID allows us to explore and discover these underlying principles, precisely because they are so different. Traditional works split the two tasks and design a network that can specifically target the individual task's respective challenges. However, we argue that this is inadequate. In the real world, urban surveillance videos provide a mixed stream of data, consisting of both vehicles and pedestrians, on which analysis is required. Mixing this data allows us to discover techniques and good practices, which are likely to extend to other re-ID tasks. For example, one may improve person re-ID performance by generating a better feature representation of pedestrians, e.g. by introducing squeeze and excitation modules [10]. This tells us nothing about the framework's actual ability to re-identify an object, despite the accuracy increasing. Our data set helps to solve this issue.
In this paper, we present an approach to unify the two tasks, summarised in Figure 1. We construct the Person and Vehicle Unified Data Set (PVUD) from other popular data sets, which is more representative of raw video surveillance data extracted from the real world. The data set is designed to be challenging and well-balanced, in order to prioritise re-ID systems that excel on both tasks. To the best of our knowledge, this is the first proposed re-identification data set containing both domains. We propose a triplet loss function that can be trained on either person or vehicle data and achieve state-of-the-art performance on each task. As the proposed framework is a form of metric learning, it does not require specific, domain-based design in order to re- identify objects. It is inherent in the framework to separate object classes in the same way that it separates different identities from one another, which makes it ideal to handle the challenges within the database that we design. We exploit information from mid-level layers which are more appropriate for the task of re-ID than the more abstract, final layer representations. In addition, we introduce hard negative and hard positive mining to our framework, with associated unification terms in the loss function, to improve its ability to handle multiple data streams. We extensively test our proposed framework, attaining an 88.52% top-1 matching rate on PVUD, and competitive results with state-of-the-art methods on each of its components. The strong performance we obtain on the unified data shows that, contrary to discussion in [5], this is a realistic task on which to focus attention, particularly due to the presence of both pedestrians and vehicles within the vast majority of realworld, surveillance data. This paper presents the following contributions. To facilitate research in this area, we open our data set and source code at https://github.com/PVUD: 1) The Person and Vehicle Unified Data Set (PVUD) -Motivated by the composition of large-scale, surveillance data, we compose a challenging data set containing pedestrian and vehicle information to encourage the re-identification community to pursue the development of frameworks which are applicable to real-world data. 2) Harnessing information from earlier layers -We propose MidTriNet, a triplet framework which exploits information from mid-level layers. This information is more valuable than features from the deepest layers across both person and vehicle re-identification tasks. 3) A unified framework -We append unification terms to the triplet loss to derive a unification triplet loss function. We also introduce term-specific mining algorithms to discover the most important data for the unification terms to focus on. The rest of the paper is organised as follows: Section 2 contains an overview of related work. Section 3 introduces PVUD and provides details on how it is constructed and how data imbalance is avoided. Section 4 describes MidTriNet, the framework designed to improve re-ID performance and details the construction of the proposed unification terms to handle the unification task. Section 5 shows our experimental results and ablation studies. Section 6 concludes the paper and discusses future directions.

II. RELATED WORK A. PERSON AND VEHICLE RE-IDENTIFICATION
Historically, popular methods for person re-ID were typically comprised of two components: designing hand-crafted features and learning distance metrics [11], [12]. Most works focused on developing features invariant to variations in light, pose and viewpoint while using conventional distance metrics like the Mahalanobis distance [13], Bhattacharyya distance, and the l 1 -and l 2 -norms. Research has also been performed on a post-processing technique called re-ranking [14], [15]. We do not include re-ranking in any of our experiments as it does not evaluate the core performance of the framework.
Although similar to person re-ID, vehicle re-ID has received comparatively little attention. This is inconsistent with other computer vision tasks in the vehicle domain, like detection and classification, which have received increased attention in recent years. This lack of popularity can be attributed to the inferiority of large-scale vehicle re-ID data sets compared with their human re-ID counterparts. This is beginning to change as two large-scale data sets, VeRi and VehicleID have more recently been released and have started to attract more research attention.
Research focus in re-identification has shifted towards deep learning methods, which are routinely used to obtain state-of-the-art results over a wide variety of challenges in computer vision and machine learning. Typically, two types of CNN model have been employed to solve the person re-ID task: the classification model that is used across a broad spectrum of computer vision problems [16] and, more commonly for re-ID, the Siamese model which takes multiple images as input, such as pairs [17], [18], triplets [10], [19], and quadruplets [2].
As there is typically more variance between viewpoints within vehicle re-ID compared to person re-ID ( Figure 2), more creative methods have been proposed to obtain satisfactory results. Liu et al. [20] developed a two branch CNN to learn deep features and the distance metric simultaneously. Liu et al. [4] combined hand-crafted features and high-level attributes learned by a CNN with license-plate recognition and spatio-temporal information. Zhou et al. [6] trained a model on a toy car data set to in order to infer a multi-view vehicle representation from any input view. Due to the proficiency of deep learning at handling large-scale databases like the one we construct, we elect to utilise it in our experiments.
A recent trend in person re-ID has been to design attention modules that can extract colour information from clothing [21]- [25]. While these do attain strong performance, they are too heavily tailored towards being effective at re-identifying people. The most popular attention mechanisms for person re-ID are part-based systems [26], [27], which split the image into several parts, so the head, torso, legs, and feet are separated from each other. It is clear to see that, although these would perform well on the human proportion of our unified data set, these modules are not able to effectively re-identify vehicles. More generic attention modules also struggle. Colour and shape information remains consistent across different viewpoints in the person domain, but is incredibly inconsistent in the vehicle domain. Therefore, attention modules easily become confused when attempting to tackle these challenges simultaneously. For these reasons, we did not include attention within our unified system.
Recently, there has also been focus on unsupervised re-ID by domain adaptation [28], because traditional supervised re-ID cannot generalise to additional data sets. Fan et al. [29] develop a progressive unsupervised learning method that iterates between person clustering and CNN fine-tuning during training. Zhong et al. [30] explore three types of invariance that hinder the ability of the re-ID model to generalise to new domains: example invariance, camera invariance and neighbourhood invariance. Deng et al. [31] translate images to the target domain using CycleGAN [32] then enforce domaindissimilarity between the translated image and other images in the data set. Ding et al. [33] use adaptive exploration to learn discriminative features in the target domain. Whereas unsupervised learning requires generalisation to unlabelled data, PVUD requires generalisation between data types.

B. TRIPLETS
Triplets have been used extensively in the field of person re-ID. Triplets are generated by pairing query images with one image of the same identity and one with a different identity. Wang et al. [34] proposed to use the triplet loss function to learn image similarity. Cheng et al. [35] introduced an improved triplet loss function that decreases the distance of similar IDs and increases the distance of dissimilar IDs. Hermans et al. [1] proposed Batch Hard mining in order to find harder triplets to improve the efficacy of training, however the inter-class variance remains too close. Chen et al. [2] train with an additional negative pair and form a quadruplet loss to enlarge inter-class variations while Yuan et al. [36] attempt to get the same improvement by adding a similar term without needing to mine an additional image. Wu et al. [37] combine triplet loss with identification loss and centre loss. Tian et al. [38] mine more informative triplets via their re-weighting strategy.
Triplet-wise training has also been effectively applied for vehicle re-ID. Zhang et al. [39] combined the triplet loss with a classification loss and also ensured negative samples in one triplet act as positive samples in another triplet. Bai et al. [40] fed groups of images into their triplet network to mitigate inter-class variance and propose a mean-valued triplet loss to enhance learning. Due to the success of the triplet loss across both tasks, it is chosen as a strong backbone to our framework. We additionally mine batch-hard positive and negative examples and introduce respective positive-sample and negative-sample loss functions to further supplement the network in separating identity classes.   Train IDs Test IDs Images  Market-1501  750  751  32669  CUHK03  1372  95  14297  VeRI  576  200  40395  VehicleID  13134 13133 221763  Train IDs Train Images Test IDs Test Images  Market-1501  751  12936  200  3486  CUHK03  1372  13176  95  921  VeRI  676  14632  100  2116  VehicleID  1500  12964  500  2085  Person Total  2123  26112  295  4407  Vehicle Total  2176 27596 600 4085

C. MID-LEVEL FEATURES
Yu et al. [41] concatenate features from earlier ResNet layers with the final layer representation for cross-domain image matching. However, although their approach works well when it uses the triplet loss for sketch-based image retrieval, their approach does not work well with the triplet loss for re-ID so they switch to a classification loss. Zhu et al. [42] also fuse mid-level features with final level ones as part of a two stream posed-based and part-based architecture. Zeng et al. [43] perform an extensive analysis on the performance of each layer to develop a hierarchical deep learning feature, which fuses features from several earlier layers. Although their method works well with their newly defined metric, their model is heavily engineered for person re-ID, thus would struggle to adapt for vehicle data. Lin et al. [44] align mid-level features to boost the performance of unsupervised re-ID. This provides further evidence that mid-level features are an important tool for re-ID and not just specifically useful for supervised, person re-ID. A number of works have been proposed to select features automatically [45]- [47] for machine learning classification. However, re-ID differs significantly from standard classification tasks in several ways: i) our model exploits metric learning rather than using a classification loss, ii) testing identities in re-ID do not appear in training, whereas all classes appear in training for regular classification tasks, iii) the evaluation metric is based on information retrieval rather than the classification accuracy. Therefore, we do not use any automatic feature selection method.

III. PVUD: PERSON AND VEHICLE UNIFIED DATA SET
There is no publicly available data set for re-identification that contains objects from both person and vehicle classes. As re-ID frameworks are mostly applicable to surveillance data, which generally consists of pedestrians and vehicles, it is imperative for re-ID to be able to handle both streams simultaneously if it is to be applicable to real-world data. Moreover, testing on multiple domains concurrently allows us to be more confident that any adjustments made to the network are beneficial for the re-identification task in general, rather than just for a specific domain. To facilitate the research in this direction, we release a unified data set based on existing ones in the field.
We select the two most popular data sets in each domain -Market-1501 [48] and CUHK03 [17] for pedestrians, and VeRi [49], [50] and VehicleID [20] for vehicles. An overview of the raw data sets, containing the number of identities for training and testing, along with the total number of images can be found in Table 1. The final composition of PVUD can be found in Table 2.

A. SOURCE DATA SETS
CUHK03: The CUHK03 data set contains 14297 bounding boxes of 1467 persons. It contains two settings: one with manually annotated bounding boxes and one with automatically detected bounding boxes. We only consider the automatically detected setting as it contains some misplaced bounding boxes making it more challenging and more similar to what we would expect when applying re-identification to real-world tasks.
Market-1501: The Market-1501 data set contains 32668 automatically detected bounding boxes of 1501 individuals.
VeRi: The VeRi data set has 37,781 images of 576 vehicles for training and 11,579 images of 200 vehicles for testing. In order to obey the 'Balance' design principle, we move 100 vehicles from the test set to the train set. We also use a maximum of 20 images per vehicle. Rather than having standalone images from different viewpoints, VeRi contains 'tracks' of vehicles which are extracted as several consecutive frames from a video source. This means that images from all angles are available. Thus, VeRi usually requires imageto-track calculation rather than the standard image-to-image metric that other data sets use. To maintain consistency across data sets, we use the image-to-image testing on PVUD.
VehicleID: The VehicleID data set has 221763 images of 26267 identities. VehicleID contains 'Small', 'Medium', and 'Large' settings for testing. As Market-1501, CUHK03 and VeRi are much smaller, for easier integration, we only take data from the 'Small' set. Contrary to the VeRi data set, VehicleID only contains images from the front and back of the vehicle.

B. DESIGN PRINCIPLES
As discussed in [51], imbalanced data sets are inherently complex. When constructing this data set, it is important to ensure that person and vehicle data are equally balanced to accurately assess how strong a method is at re-identifying humans and vehicles simultaneously. As can be seen in Table  1, if we blindly conjoin the four data sets, there will be much more vehicle data than person data. This will result in the data FIGURE 3. An overview of the architecture with unification terms. Each batch of images is processed with MidTriNet. We take the final layer of the network as the embedding space. We design unification terms specifically to make the network more robust against the mixed data that is present in PVUD and append them to the triplet loss function. Finally, we mine hard triplets, positive pairs and negative pairs to feed into our unification loss function.
set being biased towards vehicle re-identification methods, rather than methods which are effectively able to generalise across both tasks. We lay out the following design principles to ensure the data set is as fair and balanced as possible without sacrificing difficulty. We provide full details of our constructed data set in Table 2.
Balance: A critical property of the data set is balance between different domains. In this regard, we have two options. We may either equate the number of vehicle IDs with person IDs in the data set, or the number of vehicle images with person images. We find that equating IDs leads to too many vehicle test images, which may result in weak person re-ID frameworks attaining an artificially high result. Balancing the number of images will facilitate a more challenging data set that better represents the real-world. Our studies shows that balancing IDs gives an mAP of 84.83%, whereas balancing images reduces the mAP to 77.51%.
Size: A data set with both pedestrians and vehicles is already challenging. We wish to take this challenge further.
A larger testing set means more negative images to compare against, i.e. more likelihood to find a negative with a high similarity score, which makes testing more challenging. We also want to ensure that the data set is large enough for deep learning models, which demonstrate much greater efficacy at handling large-scale, real-world surveillance data. As we release the data set to motivate the re-ID community to design frameworks which can handle real-world data, it is imperative that it is suitable for deep learning frameworks. For these reasons, we select the design principle of maximising the size of the data set.
Random Sampling: In the interests of fairness, we randomly sample from the four comprising data sets rather than hand selecting examples.

1) Design Choices for VeRi:
VeRi requires image-to-track re-identification rather than image-to-image. Each track is composed of several consecutive frames from a video. We choose to include the entire track in order to more accurately model real-world surveil-VOLUME 4, 2016 lance videos. Many of the images used for training from VeRi are therefore similar to one another, so there is less effective information. This has two main consequences: (1) as Vehi-cleID contains more effective training data, any framework must be capable of transferring knowledge between the data sets for accurate vehicle re-identification results, (2) models have to be robust against overfitting as VeRi training data can be very similar.

IV. A UNIFIED FRAMEWORK FOR PERSON AND VEHICLE RE-IDENTIFICATION
In this section, we will detail the implementation of MidTriNet and provide motivation for the design of our unification terms.
We present our architecture in Figure 3. The input batch is generated by taking four images of P identities, which are processed by MidTriNet and mapped into the embedding space. The distance matrix between all feature vectors is then calculated via a Euclidean distance and the hardest samples are mined. These samples are then fed into our novel loss function.

A. TRADITIONAL TRIPLET LOSS FUNCTION
The triplet loss function has seen extensive use for both person and vehicle re-ID due to its proven ability to attain state-of-the-art results and its efficacy in being able to handle difficult examples through specifically mining such examples during training. Traditional triplet models take three images as input: one query image, one image with the same identity as the query (positive), and one image with a different identity to the query (negative). The margin α is enforced to ensure distance between positive and negative pairs.
We denote a triplet, t = (x, x + , x − ), where x is the query image, x + is a positive image, and x − is a negative image. The triplet loss function is formulated as follows: where f (x) is the feature vector of image x and T is the set of mined triplets.

B. MIDTRINET
Contrary to most deep learning classification tasks, mid-level layers have been shown to have similar importance as higherlevel layers for constructing effective feature embeddings for re-ID [41], [43]. Re-identification relies on matching humanunderstandable information such as colour of clothes, and features are required to be viewpoint invariant. Mid-level information such as colours and textures, which are robust to viewpoint changes, are extremely useful information to discern whether an individual in the gallery set is the same as that in the query image. The very abstract features in the final layers are therefore not necessarily optimal for comparison, particularly within a triplet loss framework which attempts to differentiate between identities by directly comparing the feature representations of each image. To exploit the important information generated by the mid-level layers, we develop MidTriNet, which contains two major design choices throughout our experiments. These choices are supported by the ablation studies on the stride length (Table 10) and ResNet blocks (Table 11) provided in Section V-D.
Layer removal: We remove the final two conv5 blocks to strike a balance between the powerful representation ability that is characteristic of conv5 blocks and the re-identification task-specific efficacy of mid-level layers. Not only does this improve the feature embedding for re-ID, but also helps to protect the model against overfitting, which is extremely important for this data set as described in Section III-B1. Through our extensive experimentation, we find that removing the final two ResNet blocks works best.
Stride: To best exploit mid-level features for re-ID, we reduce the stride length in the conv4 block from 2 to 1. This ensures that we have more informative feature maps at the important mid-level layers, which enriches the output of those layers to improve the final feature representation. The conv5 stride length is typically 1 for this reason. Reducing the stride length of conv4 to 1 allows us to focus on those features in the same way. This allows us to better compare the similarity between two images which benefits the network at all stages of training and testing.

C. UNIFICATION TERMS
The triplet loss aims to simultaneously pull images of the same identity closer together whilst pushing away an image of a different identity. This can be difficult when dealing with unified data. Different data sets have different characteristics (camera intrinsics, lighting conditions, etc.), so the feature representations are likely to be further apart from one another on average. This means that it is more difficult to find hard negatives (and thus hard triplets), so the model risks being unable to handle difficult situations when it comes to testing.
To counteract this, we mine the hardest negatives and positives across the batch. We design unification terms to separate hard negatives and compress hard positives, and append them to the loss function.

1) Loss Function
Let T be the set of triplets, where t = (x 0 , x + 0 , x − 0 ) ∈ T is a triplet comprising of a query image x 0 , a positive image x + 0 from the same identity as x 0 and a negative image x − 0 from a different identity. Let f (x) be the feature vector of an image x. H + is the set of the hardest positive pairs h + = (x 1 , x + 1 ) with lowest similarity and H − is the set of negative pairs h − = (x 2 , x − 2 ) with highest similarity (likewise, the hardest negative pairs). We set H + = H − = T = 4P where P is the number of identities in each batch. Throughout this section, we refer to D as the distance between two feature representations.
The first term we use is a modified triplet loss function presented in [1]. Their analysis shows that replacing the (a) softplus(D) 4. Visualisations of the softplus, Ψ and Φ functions used to calculate the overall loss function found in Equation (5) traditional hard margin α from (1) with the softplus function softplus(D) = log(1 + e D ) is beneficial. The softplus function is shown in Figure 4(a). Overall, we have The second term focuses on pulling together the positive pairs. We design the function Ψ(D) = ψ D − 1, where ψ > 1 is a constant, in order to heavily punish large distances between positive pairs as seen in Figure 4(b). This forces the network to pull images from the same class together during training in order to keep the loss minimal. In our experiments we use ψ = 1.1. The positive unification term is written as: Note that this is especially important in the vehicle domain. As discussed in Section I, vehicle shape can change drastically in different viewpoints. One of the reasons why our network is so robust to this shape deformation is that we force the model to learn from additional hard positives, of which a large proportion will typically be vehicles in a very different pose. The third term works similarly but aims to push negative images away from each other. We adopt Φ(D) = φ φ 10 +D to punish negative pairs with small distances and reward pairs with large distances as seen in Figure 4 (c). Throughout our experiments, we set φ = 30. The negative unification term is written as: From Equations (2), (3) and (4), we obtain our unification loss function: where α t , α p and α n , are weights for their relative losses. Empirically, we found that setting α t = 0.05, α p = 0.5 and α n = 0.5 performs best. One of the most important elements of building a framework which utilises a triplet loss function is effective mining. We require it to effectively match vehicles with significant distortions in shape, thus it is imperative that the model is trained on the most difficult samples available. Likewise, we wish for it to be able to handle outlier scenarios, e.g. where a person is wearing a bag, thus having a highly different appearance in different viewpoints. To challenge the framework to be able to handle these tough cases, sufficiently difficult triplets need to be mined. However, if the model is only trained on the most difficult triplets, it will not be representative of the entire data set and could struggle on easier examples. Let p be the identity of the image x p,i in the batch, B, and let f (x p,i ) be its feature vector, where p = 1, . . . , P and i = 1, . . . , 4. Each query image x p,i is paired with its hardest positive image x + and hardest negative image x − , where: Together, we obtain the triplet t p,i = (x p,i , x + , x − ) and these form the set of triplets, T , where T = 4P .
In a similar manner, we scan across the entire distance matrix to find the set of hardest positive pairs, H + , and the set of hardest negative pairs, H − , with H + = H − = 4P .

V. EXPERIMENTAL RESULTS
In this section, the proposed architectures are exhaustively evaluated on the most popular modern data sets for person re-ID and vehicle re-ID. Our data set and source code can be found at https://github.com/PVUD We give results for MidTriNet and MidTriNet+UT (Unification Terms). MidTriNet is the baseline TriNet model, with the addition of the design choices described in Section IV-B to harness mid-level features: reduction of the stride length in the conv4 block, and removal of the final two conv5 blocks. MidTriNet+UT includes the additional terms from Section IV-C which specifically help the system to handle the mixed data in our unified data set.
In addition to PVUD, we test our framework on individual data sets. For person re-ID, the Market-1501 [48] and CUHK03 [17] data sets are selected. Meanwhile for vehicle re-ID, we use the widely used VeRi [49], [50] and VehicleID [20] data sets.

A. EVALUATION PROTOCOL
For PVUD and person re-ID, we use the standard 'mean average precision' (mAP) and 'rank-1' metrics to evaluate our framework against the state-of-the-art methods. As many vehicle re-ID methods do not report mAP, we additionally report our 'rank-5' score for better comparison. For details on the individual data sets, see Section III-A.
The rank-x matching rate is defined as the percentage of query images with a correct match within the highest x ranks. The precision, P x , at rank x is calculated via P x = true positives true positives + false positives . The average precision for a given query, q, is calculated by taking the average of the precision scores at each true positive in the ranking list: where N is the number of true positives in the gallery and P + i denotes the precision at the i-th true positive in the ranking list. The mAP is then calculated via PVUD: For PVUD, we take subsets of the standard train/query/gallery splits of each of the four individual data sets. These are procured by following standard procedures as described below.
Note: as described in Section III-B, the training set of PVUD contains some instances from the VeRi test set. For fairness, we exclude these when we train on PVUD and test on VeRi to analyse the robustness of our system.
Market-1501: For the Market-1501 data set, either the single query or multiple query setting can be used. We evaluate on the single query setting as it is more challenging and applicable to real world scenarios.
CUHK03: Recently, a new train/test split has gained popularity for the CUHK03 data set. However, as discussed in Section III-B, to keep our unified data set balanced, we were required to use a similar train/test split as the initial CUHK03 split. For this reason, we conduct our tests on the original split. VeRi: The VeRi data set differs from other re-ID data sets as it maps temporally close images in the gallery onto tracks. The re-identification is computed from the query to the entire track (image-to-track) rather than just to gallery images (image-to-image). We follow the standard procedure for computing the similarity between a query image and a track, by calculating the similarity between the query image and all images on the track and then to take the maximum.
VehicleID: For the VehicleID data set, we follow the standard procedure as described in [20]. Given an identity i with N i images in the test set, max(6, N i − 1) images of identity i are placed into the gallery set, and the remaining images are put into the query set.

B. COMPARISON WITH BASELINES
PVUD: Our results on PVUD can be found in Table 3. It can be observed that the unification terms introduced in Section IV-C boost the mAP by 0.92% and increase the top-1 matching rate by 1.13%. Moreover, we attain significant improvement over the standard TriNet on both networks. We also compare with ResNet where the triplet loss is replaced with cross-entropy loss, and two other state-of-the-art re-ID methods: Parts-based Convolutional Baseline (PCB) and Harmonious Attention Network (HA-CNN) [21] [27]. PCB is designed to specifically handle person data, and was trained with a ResNet-50 backbone for fair comparison. HA-CNN uses an attention module that learns from the data that is provided, so it could be applied to vehicles as easily as it is to persons.
Both methods provide a significant reduction of performance compared to standard ResNet, implying that attention diminishes performance in both cases. For PCB, this is because the part models cannot adequately handle vehicle data as discussed in Section II-A. PCB encodes better feature representations of pedestrians but cannot generalise to vehicle data that it was not designed to handle. Therefore, PCB obtains moderately better performance at re-identifying pedestrians within PVUD but drastically worse performance at re-identifying vehicles. This results in a net performance decrease of 4.4% when applied to the overall data set. In contrast, HA-CNN has demonstrated strong performance when trained on each domain individually. However, the attention mechanism of HA-CNN becomes confused when trying to handle two drastically different data types simultaneously. This results in a sharp decrease in performance. We discuss the impact of the individual unification terms in more detail in Section V-D.
We compare against the baseline TriNet with two separate settings in Table 4. First, we train and test on individual data sets in the standard way. Secondly, we train all models on PVUD and test on the individual data sets to analyse how robust models are at handling data from different sources. This is very challenging. The model is required to be able to use information from training on one data set to test on another. Despite this, we see very little performance loss on any of the data sets. Both of our models considerably outperform the baseline on both settings for all data sets.
Further, we can see that the unification terms benefit performance across all data sets when the models are trained on PVUD. This demonstrates that the additional sample mining for the unification terms helps to create a much stronger model for handling mixed data.

C. COMPARISON WITH STATE-OF-THE-ARTS
CUHK03: Our results on the CUHK03 data set are presented in Table 5. In particular, MidTriNet outperforms the mAP of the popular LSRO [59] by 1.1% on the detected data and exceeds the second best rank-1 score by 2.6%.
Market-1501: Our results on the Market-1501 data set can be found in Table 6. Our results are competitive with the state-of-the-arts, attaining an mAP of just 1.7% less than HA-CNN and achieving a 4.9% improvement on the original TriNet.
VeRi: We present our results on the VeRi data set in Table  7. Our method clearly outperforms the state-of-the-art at the vehicle re-ID task. We obtain a rank-1 score of almost 6% higher than the next best result.
VehicleID: Our results on the VehicleID data set can be found in Table 8. Our MidTriNet model consistently attains the highest mAP and rank-1 results across all three settings. Our methods considerably outperform state-of-thearts, achieving 4% rank-1 improvement over the best method not to use a triplet loss.  On all five data sets presented in Tables 3 -8, MidTriNet significantly outperforms the baseline TriNet model. This consistent performance enhancement across domains provides conclusive experimental evidence that harnessing midlevel information is an underlying principle of re-ID.

D. ABLATION STUDIES
In this section we present our ablation studies to demonstrate the benefits of a) mid-level information for re-ID, b) unification terms. All experiments in this section are performed on PVUD. We include confidence intervals at a 95% confidence level to demonstrate the significance of our design choices. We calculate these confidence intervals using the guidance for information retrieval tasks in [66]. Table 9 shows our ablation studies on the batch size. In particular, we find that larger batch sizes attain greater reidentification performance. This is because we can mine harder triplets, negatives and positives for our loss function so the framework learns more efficiently.
Our results with different stride sizes are presented in Table  10. We see that reducing the stride from 2 to 1 in the third ResNet block boosts both the mAP and rank-1 performance by over 1.7%. This shows that the more informative midlevel feature maps are very important in boosting re-ID performance. Table 11 shows that removing the final two conv5 blocks significantly boosts performance. This supports the notion that mid-level features are more suitable than final level features for a triplet loss re-ID framework.
We perform ablation studies on our unification terms in Table 12. We see that both the positive and the negative term contribute to the overall score. When the negative term is excluded, the positive term provides a performance improvement of 0.43% on the mAP metric. Likewise, when the positive term is excluded, the mAP is 0.34% higher than the standard MidTriNet. We arrived a the highest performance with the unification terms weighted equally and very large compared to the standard triplet loss term.

VI. CONCLUSION
In this paper, we have unified person and vehicle reidentification. Firstly, by constructing a balanced, challenging data set by combining the two most popular data sets in each domain; secondly, by designing a triplet loss framework that beats or is competitive with state-of-the-art methods on both tasks and also attains high performance on our newly designed data set. We propose MidTriNet, to demonstrate that utilising mid-level features is an underlying principle of re-ID. Our design to exploit them boosts performance across all data sets. Finally, we show the value of our data set by appending terms to the loss function, specifically to improve the accuracy when the data is merged. The unification terms presented in this paper have been demonstrated to benefit the network, specifically when targeting mixed data streams. As a future work, we wish to explore this potential by deriving more complex mechanisms which target multi-domain data. Many popular re-ID approaches currently make use of attention modules and partbased representations to learn better feature representations by giving less weight to background pixels. One planned future work is to incorporate these ideas into our framework to further improve the network's robustness when dealing with data from different domains.
Transfer learning, whereby a model is trained on one data set and tested on another, is a largely under-researched area in re-ID. Our data set forces models to be able to use data from one data set to help the training of another set in the same domain. It also attains strong performance when the model is trained on the unified data set and tested on individual data sets, suffering little performance loss. We wish to explore transfer learning in the future by training state-of-the-art methods on this data set and testing on a data set that is not a component of the unified one, such a DukeMTMC-reID. We hypothesise that the model is forced to be more robust and is less likely to overfit on our unified data, thus should perform We also plan to apply this for real world problems as future work, extending the architecture so that vehicle re-ID can be used to identify the area that a person of interest has travelled to via vehicle, which would significantly narrow down the number of cameras to evaluate, providing us with a much greater likelihood to re-identify the individual. HUBERT P. H. SHUM is an Associate Professor (Reader) at Northumbria University, U.K., as well as the Director of Research and Innovation of the Computer and Information Sciences Department. Before this, he worked as a Senior Lecturer at Northumbria University, U.K., a Lecturer in the University of Worcester, U.K., a post-doctoral researcher in RIKEN, Japan, as well as a research assistant in the City University of Hong Kong. He received his Ph.D. degree from the School of Informatics at the University of Edinburgh, U.K. His research interests include computer graphics, computer vision, motion analysis and machine learning. VOLUME 4, 2016