An Efficient Person Re-Identification Model Based on New Regularization Technique

The aim of person re-identification (ReID) is to recognize the same persons across different scenes. Due to the many demanding applications that utilize large-scale data, more and more attention has been devoted to matching efficiency and accuracy. Many methods that are based on binary coding have been presented to reach efficient ReID. Those methods learn projections to map the high-dimensional features into deep neural networks or compact binary codes through simple insertion of an extra fully connected layer with tanh-like activation. Nevertheless, the former approach needs hand-crafted feature extraction that also wastes a lot of time and complex (discrete) optimizations. In contrast, the latter approach lacks the essential discriminative information to a large extent because of the straightforward activation functions. A ReID framework is proposed in the current work, and it is inspired by the adversarial framework depending on the new regularization approach (ABC-NReg). We embedded the discriminative network into adversarial binary coding (ABC) with our new regularization, which improved the discriminative power combined with the triplet network. ABC-NReg and triplet networks were optimized, and three large-scale benchmark datasets, namely CUHK03, Market-1501, and DukeMTMC-reID datasets, were utilized to test the performance of our proposed model. We further compared the simulation results with the present hashing and non-hashing algorithms. Our model provided better results than other present models using the Market-1501 and DukeMTMC-reID datasets when considering Rank-1. For CUHK03 dataset, the proposed model exceeded the performance of other works when considering Rank 5 and Rank 20.


I. INTRODUCTION
With one or more pictures of a pedestrian, person re-identification (ReID) recognizes the person with the same identity across multiple images taken from different angles and viewpoints and in various scenes. ReID assists different potential applications, like long-term cross-view scenario tracking and retrieval of criminals. However, the task is still challenging because of the remarkable differences in poses, viewpoints, and illuminations through various cameras. A lot of ReID methods have been presented [1]- [6]. Most of them acquire high-dimensional (usually thousands or more) features for comprehensively representing persons through different clues (e.g., colors, textures, and spatial-temporal The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . clues). Thus, a higher level of computational complexity is brought to the subsequent similarity measurement (e.g., metric learning). Also, recent large-scale ReID benchmarks include many identities and cameras to create a simulation of real-world scenarios, making the state-of-the-art ReID approaches that currently exist computationally unaffordable [7]. Thus, despite the considerable enhancement in matching accuracy, more challenges have faced the computational and memory requirements. Binary coding (i.e., hashing), utilized by [8], [9], map high-dimensional features into compact binary codes and calculate similarities efficiently in the low-dimensional Hamming space. It is considered a potential solution for efficient ReID. The hashing based ReID methods can be classified into two main categories: The first category learns various projection matrices to map the original features concurrently to a low-dimensional and discriminative Hamming space. Its objective is mainly a non-convex joint function of multiple subtasks (e.g., similarity-preserving mapping and binary transformation), which needs the explicit design of sophisticated functions and time-consuming nonconvex or discrete optimizations. Both memory storage and computational efficiency are major and serious matters, particularly when working with large-scale data. The second category demonstrates a deep neural network-based method, which can process large-scale data in a much more efficient way than the traditional methods by utilizing mini-batch learning algorithms and advanced GPUs. Here, the binary codes are obtained through the insertion of hashing layers at the networks' final part. The hashing layer is a fully-connected layer, after which a tanh-like activation occurs to transform the outputs into binary form. This simple, uncomplicated scheme barely restricts the outputs under the essential principles of hashing (e.g., balancedness and independence [10]) to generate high-quality binary codes. Additionally, to preserve discriminability, the hashing layers' outputs tend to lie in the approximately linear part of the tanh-like functions. Thus, direct binarization of the outputs through signification will lead to the loss of the discriminative information. This work presents a new regularization-based adversarial framework combined with a triplet network to address the aforementioned issues. The following part summarizes our main contributions in this paper: • In this work, we introduce a deep adversarial based binary transformation strategy. The proposed model depends on a generative model based on CNN, which extracts features and a discriminative model to differentiate between the binary and real-valued features. The CNN model is trained to produce binary features in order to confuse the discriminator.
• The discriminative model is integrated with the new regularizer in order to regularize the features into binary codes based on the standards deviation of weight values.
• Adversarial binary coding model is incorporated in an end-to-end new regularization based deep neural network for efficient ReID. The binary transformation and similarity measurement are optimized, and as a result, during feature binarization, the discriminative information is maintained to the maximum extent.

II. RELATED WORK
Traditional approaches normally present certain feature learning algorithms concerning ReID, such as low-level color features [1], [16], [17], local gradients [4], [18], [19], and high-level features [2], [3], [20]. Because of deep neural networks' outstanding performance, deep learning-based ReID methods [13], [21], [26] have been presented progressively. For example, siamese CNNs [27], [29] and triplet CNNs [6], [30], [31] are extensively utilized for measuring similarity. Lately, multiple binary coding-based approaches [8], [9], [32], [33] have appeared to deal with the high computation and storage costs related to the ReID problems. Moreover, generative adversarial nets (GANs) [11], [12] present a methodology for random mapping variables from a simple distribution to a certain complex one. GANs have been extensively utilized in image generation [12], [34]- [36], style transferring [37], [38] and latent feature learning [39]- [41]. For stabilizing and quantifying the training of GANs, a breakthrough known as Wasserstein GAN (WGAN) was presented in [42] and enhanced in [43]. Lately, GANs are also used for problems concerning image retrieval. In [44], GANs were employed to differentiate between synthetic and real images, to enhance the discriminability of binary codes. GANs were also utilized to improve the intermediate representation of the generator in [45]. Yet, these studies simply utilized tanh-like activations for binarization. Apart from the mentioned works, there are different deep Re-ID models [46]- [48] that present different network architectures. These differences are mainly determined by training losses/objectives. The authors in [49] presented a framework depending on adversarial binary coding (ABC) for efficient person re-identification, which efficiently produces binary and discriminative features from pedestrian images. After that, a trained discriminator network was employed for differentiating between the real-valued features and the binary ones, so that it can lead the feature extractor network to produce features in binary form under the Wasserstein loss.

III. APPROACH
In the current work, a new GAN model is proposed based on the new regularization. The proposed framework employs the existing GANs model combined with the proposed regularizer to compact the high dimensional real-valued data to binary codes. In the first section, we introduce the new regularization model based on the standard deviation of weight values. Next, we shed light on the principles of GANs. Finally, we describe the integration of the new regularizer with the adversarial binary coding (ABC-NReg). Generalization Error and Regularization: During the training of a predictive model, the learning model sometimes begins to memorize the data, which results in increasing the generalization error. This causes the model to have a good performance on training data, but the performance drastically deteriorates on unseen or test data. To prevent the model from memorization, the term regularization is introduced. Regularization aims at penalizing the learning model to commence generalization through performing well on unseen data. Different types of regularizers set penalty terms on the predictive model. The most prominently utilized regularization methods are L1 and L2 regularizers.
A. L1 OR LASSO REGULARIZER L1 regularizer penalizes the absolute values of the weight matrix from attaining higher values. Reducing the weight value to zero for less important features is the major benefit of L1 regularizer. Therefore, these features are not utilized in defining the classifier boundaries, and L1 regularizer is employed for feature selection or reduction. The mathematical notation for this regularizer is as follows.
where n is the number of features in data, and w is the corresponding weight values of each feature.

B. L2 OR RIDGE REGULARIZER
Sometimes, there is no need for the weight value to be zero for less important features. Thus, the features of corresponding coefficient values are decreased but kept greater than zero. That's why the square magnitude values are considered from the weight matrix, and this is identified as L2 regularizer.
The mathematical notation of this regularizer is given in Equation 2. The λ parameter is used to put an extra penalty on the values of corresponding weight values. Since the λ parameter controls the magnitude of the coefficient values, selecting a good λ value is a critical task.
where n is used for each feature, and w is the coefficient value of each feature.
where k denotes the number of rows in the weight matrix and σ denotes standard deviation of weight values as given below θ is squared as it is done in standard deviation, θ i is the feature vector i − th while θ d,i is the d-th feature in the i-the feature vector. Here the parameter λ is used to control the values of weight matrix, and n is the weight vector size. So, n relies on the number of features in the dataset. Particularly, the number of columns in a specific weight matrix.
In figure 1, we have shown the contours for the penalty functions of the new regularizer. Here, the penalty is equal to 1 for all regularizers (L1, L2, and the new). Further, to maintain the dimensions, we have excluded the sum factor in the new regularizer. So, the spread of the regularizer depends upon the penalty term λ, the spread increases upon decreasing the λ and the reverse for increasing the λ. We observed that the behavior of the L2 norm is circular and incorporates L1 regularization. While the new regularization acts like a parabola along with other the regularizations. This allows the new regularizer to take values beyond the L2 limits. Therefore, it increases the limit of values (space) to adopt and further squeezes and expands this space based on the penalty term λ.

1) A BRIEF REVIEW OF GANs
In the GANs model, the following two models are trained simultaneously: the generative model G and the discriminative model D. The generative model is trained through mapping a prior distribution p z (z) to a generative space G(z; θ g ), while the discriminative model D(x; θ d ) is trained on input x, which yields the probability showing whether the input is coming from the real distribution rather than the G. Maximizing the probability of D, making a mistake is the task of training model G. The loss function for the GANs model is shown as: WGAN is a variant of GANs presented by Arjovsky et al. [42], [50] to overcome the difficulty in the training process of the GANs model. Similarly, for evaluating the distributions' similarity, the WGANs model optimizes Wasserstein loss in place of the Jensen-Shannon loss. The Wasserstein loss is given as follows: In the present study, for adversarial learning, we followed the training process of WGANs, whose gradients are computed based on Wasserstein-1 distance, hence providing stronger stability.

2) ADVERSARIAL BINARY CODING COMBINED WITH THE PROPOSED REGULARIZER
Our binary coding method depends on the method presented by Liu et al. [49], who proposed a method called adversarial binary coding inspired by GANs. Our method is different from their method since we utilized our proposed regularization in the discriminative model. A deep neural network is used in learning the mapping of data from the original distribution. So, the input images to a mapping of binary vectors in a GAN model. Here, we explained the binary mapping of the end-to-end efficient extended-ReID framework.
As given in figure 2, we utilized ResNet-50, a CNN based feature extractor, in our proposed architecture of adversarial binary coding. The ResNet-50 is a trained model that is accessible in the PyTorch library [51]. The output of the feature extractor is to transform the input images to feature vectors. For every bit of the binary vector, we simultaneously executed random sampling through a binary code sampler. Regarding sampling, the Bernoulli distribution was utilized because it satisfies the principle of effective binary coding [10] and the different bits depend on each other. Since the binary vector is considered as the positive sample and the real-valued feature vector as the negative sample, the discriminator is trained based on the new regularization (Eq.4) in order to execute the classification task. Likewise, the task of the extractor is to extract feature vectors based on the Wasserstein loss as given in 6. In GANs, the feature extractor can be considered as a function f (K ) which perform the task of a generator G under a encoding distribution q(Z |K ). Where K is the bath of n images that can be shown as K = K 1 , K 2 , . . . , K n and O = o 1 , o 2 , . . . , o n is the feature vector extracted by the generator. The aim of q(O|K ) is to map data from the original distribution p k to another distribution q: The Wasserstein distance is utilized to regularize the extractor by matching the posterior q to a prior binomial distribution. A binomial distribution is similar to multiple Bernoulli sampling with the same probability. For the feature extractor, the ResNet-50 model is utilized with Rectified Linear Unit (ReLU) as the activation function. ReLU activation function yields values between 0 and 1, inclusive. Thus, using ReLU helps in representing binary codes by 0,1. Furthermore, the weight of the network is initialized with Gaussian distribution that has low values. Our new regularization prevents weights from gaining higher values; thus, it avoids gradient vanishing, and consequently, the weight obtains values near zero. Moreover, unstable optimization may occur because of the absence of normalization. Therefore, we followed the method of [49] to normalize the output feature vectors and sampled binary codes.

D. ReID FRAMEWORK BASED ON TRIPLET LOSS
In order to measure similarity among the binary codes, to transform features to binary form, and to ensure the discriminant behavior of the learned binary code, we integrated  the new ABC-NReg into a triplet network. The mathematical formulation of triplet loss [52] is given below: (8) where f i , f j and g k are the input features, and the imposed distance margin is denoted with α between negative and positive pairs. A similarity between different features is denoted with the term d(.). Features from the same class are represented as f i and f j , while g k shows features from another class. The triplet loss is used because it forces the distances between positive and negative pairs. In a sense, the positive pairs should be smaller than those in negative pairs.

A. DATASET AND SETTINGS
In the present study, three large-scale ReID benchmark datasets are utilized as follows. CUHK03 dataset comprises 14,096 images with 1,467 identities captured through six surveillance cameras. It includes both manually and automatically labeled bounding boxes with different sizes. All the images are resized to be 160 60 as conducted by [49], and 20% train/test splits criteria are followed as done by [13]. The number of iterations was set to 6,000, and α values were set to 0.2 and increased to 0.5 with an increment of 0.1 after every 1,000 iterations.
Market-1501 dataset comprises 32,668 images with 128× 64 bounding boxes of 1,501 pedestrians. The number of iterations was set to 8,000, and α values were set to 0.2 and increased by 0.1 after 2000 iterations.
DukeMTMC-reID dataset comprises 36,411 manually labeled bounding boxes of 1,401 identities captured through 8 cameras having fixed training and testing split. In our simulations, all images are resized to 128 × 64. The number of iterations was set to 8,000, and α values were set to 0.2 and updated to 0.3 after iteration 2,000 and to 0.4 after iteration 5,000.
Simulations were performed on Amazon EC2 g3.4xlarge instance. The g3.4xlarge instance contains one NVIDIA Tesla M60 GPU with 16 vCPUs and 122GB of RAM. To augment the training images, we have flipped each of the images horizontally. The batch size is initialized to 128, and the learning rate is set to 10 −3 and decreased to 10 −4 along the training process. Similarly, for discriminator, the learning rate is set to 10 −2 .

B. EVALUATION C. COMPARISON WITH BINARY CODING BASED METHODS
In this section, our proposed method is compared with the state-of-the-art hashing methods based on ReID, as shown below:

1) Deep Hashing consisting ofn Deep Semantic Ranking
Based Hashing (DSRH) [32] and Deep Regularized Similarity Comparison Hashing (DRSCH) [33]. 2) Non-Deep Hashing comprising Cross-Camera Semantic Binary Transformation (CSBT) [9]. As per the CSBT results, it outperformed the previous method. Therefore, this is the main method to which we compare our results. 3) 512-bit ABC+triplet [49] Results concerning CUHK03 and Market-1501 datasets are displayed in Tables 1 and 2, respectively. Figures 3 to 7 present the triplet and discriminator loss on the three datasets. The progressions of both losses are similar to those presented in [49]. The slight variations are caused by the new regularization in the discrimination model. In this study, our work is similar to the work carried out by [49]. Thus, we mainly compared our performance with their results. For the CUHK03 dataset, when considering Rank 1, our method's performance is slightly lower compared to that of [49]. Notably, our results are better when considering Rank 5 and Rank 20. Likewise, mAP value is also slightly high.
For Market-1501, as displayed in Table 2, our method provides notably better results compared to [49] because the discrimination power of the discriminator is increased as a result of the new regularization method. In other words, the discriminator is well trained to identify the difference   between or match both inputs: the binary inputs (from sampler) and those produced by a feature extractor.

D. COMPARISON WITH THE STATE-OF-THE-ART METHODS
Aside from the comparison with the hashing methods, our proposed method was also compared based on the new regularization with the current non-hashing ReID methods. These methods consist of the following algorithms: 1) Deep Learning based methods These methods mainly include ABC triplet [49], DeepReID [13], SIR+CIR [6], Improved Deep [21], EDM [28]], Gated CNN [29], Deeply-learned Part-Aligned Representation [53], Multi-scale Deep Learning [54], Pose-Driven Deep Convolutional Model [55], Spindle Net [56] and SVDNet [57]. 2) Metric learning based methods include KISSME [16], NSL [58] and XQDA [2]. 3) Local patch matching based methods consist of SDC [59] and BoW [14] All the comparison results for CUHK03 (detected), Market-1501, and DukeMTMC-reID datasets are displayed   in Tables 3, 4 and 5, respectively. Among all the state-of-theart methods, our methods, adopting high-dimensional realvalued features, exhibited good performance, and produced good matching accuracy. Our framework exceeds the present non-hashing algorithms in performance as well as in terms of matching accuracy. It is worth noting now that the query time is the same as that of the method presented by [49], which is quite faster than the non-hashing methods on the existing datasets. Likewise, the memory requirements were also decreased by representing images as 2048-bit binary codes rather than representing 26960-dim real-valued fea-  tures as adopted by Local Maximal Occurrence Representation(LOMO).

V. CONCLUSION
In the present work, a framework is proposed for efficient person re-identification depending on the recent advances in adversarial learning, particularly those inspired by the study of Liu et al. [49]. We integrate the discriminator in the ABC framework with the new regularization depending on the weight matrix's standard deviation. Furthermore, for improving the discrimination power, the deep triplet network VOLUME 8, 2020 is embedded in ABC framework. CUHK03, Market-1501, and DukeMTMC-reID datasets were utilized in testing the performance of our proposed model. We further compared the simulation results with the present hashing and non-hashing algorithms. Our model provided better results than other present models using the Market-1501 and DukeMTMC-reID datasets when considering Rank-1. For the CUHK03 dataset, the proposed model exceeded other works' performance when considering Rank 5 and Rank 20.

ACKNOWLEDGMENT
The authors, therefore, gratefully acknowledge DSR technical and financial support.
ABDULAZIZ ALI ALBAHR is currently an Assistant Professor with the College of Applied Medical Sciences, King Saud Bin Abdulaziz University for Health Sciences. His main research interests include natural language processing and data mining. His research interests include extracting keyphrases, summarizing text, and identifying prerequisite relations from educational data.
MUHAMMAD HATIM BINSAWAD received the master's degree (applied information technology) and successfully completed his post baccalaureate certificate in information systems management from Towson University, USA, and the Ph.D. degree in information systems from the University of Technology Sydney (UTS), in 2019. He is currently an Assistant Professor with the Department of Computer Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University. VOLUME 8, 2020