Pedestrian Re-Identification Monitoring System Based on Deep Convolutional Neural Network

,


I. INTRODUCTION
The establishment of large-scale distributed camera networks has put much attention on massive video surveillance systems. Surveillance cameras produce large numbers of videos with multiple data types at a low value density and a strong real-time performance. There are many challenges to security inspection [1], [28], [29], [37], [49]. However, the shooting range of current monitoring systems cannot cover all key monitoring areas, while in general, the areas monitored by multiple cameras do not overlap. This makes it extremely difficult for relevant departments to track suspect trajectories through monitoring systems. In current practical applications, The associate editor coordinating the review of this manuscript and approving it for publication was Mu-Yen Chen . the target pedestrians are matched via manual real-time monitoring, consuming a large amount of human resources and yielding a low information utilization rate. Furthermore, current methods are associated with unstable accuracies and low efficiency levels [2], [30], [35]. The bottlenecks of manual camera network monitoring methods are even more pronounced in big data environments [31], [32], [37]. Thus, in order to keep up with the rapid development of computer vision technology, a video analysis system that can simultaneously process multiple surveillance videos and accurately identify effective target pedestrians is urgently required [4].
Pedestrian re-identification refers to the similarity matching of pedestrian targets under the surveillance of multiple cameras without the overlapping of viewing angles. Its application in monitoring systems is a hot topic in the field of computer vision [5], [36]. As shown in Figure 1 [6], the target (the pedestrian, point 1), can be identified and locked in multiple cameras in different locations or in the same video (point 1, dotted circles). As the camera resolution in such systems is generally low, it is difficult to obtain identifiable features such as faces. Furthermore, the process is affected by additional factors (e.g. illumination and viewing angles), resulting in discrepancies within the same pedestrians in different cameras. It is often difficult to capture the invariant factors of the sample due to the extreme variability and high-dimensionality of the visual characteristics [7]. The simultaneous identification of multiple pedestrians proves to be a difficult task for traditional pedestrian re-identification methods. Such methods are also associated with long operation times and low recognition efficiency. Therefore, it is necessary to further improve the pedestrian re-identification method.
Deep learning has become a popular topic in the field of machine learning over the recent years. A convolutional neural network (CNN) is a representative and widely adaptable model in the field of deep learning. In the current paper, we propose a deep learning model suitable for pedestrian re-identification. In addition, a pedestrian re-identification system based on a deep convolutional neural network was designed, with outputs having significant social benefits for accident prevention and emergency treatment. Moreover, our proposed model can potentially generate economic benefits by saving human resources and improving the information usage, identification accuracy and work efficiency of camera monitoring systems.
More specifically, we propose a deep convolutional neural network structure that employs deep learning to solve the pedestrian re-identification problem in monitoring systems [38], [39]. The convolutional neural network extracts the feature vector of each pedestrian image, while the Euclidean distance between the feature vectors is then calculated to measure the similarity between pedestrian images. Following this, the loss function is determined, with the focal loss used to solve the sample imbalance, thus reducing the model image feature storage space, accelerating the image feature matching speed and improving the recognition accuracy [40]- [46]. Moreover, a surveillance system for pedestrian re-identification is implemented by a host computer and several high-speed network dome cameras to acquire images of the target. Images of the same person are then selected from similar image groups. The feature extraction and automatic learning of targets in complex environments are the core modules of the system, and can be applied to video surveillance and intelligence analysis in the live broadcast, education, film and television industries. In addition, our proposed model has key applications in human body recognition for complex environments such as machine vision human-computer interactions and automated robot tracking.
Recent studies have applied Siamese networks for re-identification [47], [48], demonstrating that novel deep CNN and integrated focal loss are the two key directions in the field of person re-identification. In the current paper, we explore these two fields to solve the problem in a unified framework. Our study has the following contributions: (1) An end-to-end pedestrian feature extraction architecture was proposed that learns hard examples using a specific fully connected layer; (2) the algorithm architecture was developed and implemented in our designed monitoring prototype system using digital matrix hardware and high-speed domes, which can be further extended to additional tasks for future applications; and (3) experiments on a public benchmark dataset were conducted and their competitive performance was observed to be similar to existing methods.
The rest of this paper is organized as follows. Related work is described in Section 2, and the framework of pedestrian feature extraction is explained in Section 3. The system design and implementation is introduced in detail in Section 4. The experimental results and analysis are presented in Section 5, while the conclusions and suggestions for future work are proposed in Section 6.

II. RELATED WORKS
Complications in pedestrian re-identification originate from multi-camera tracking problems. Early research on pedestrian re-identification was closely related to multi-camera tracking problems. Since pedestrian re-identification was originally employed for tracking pedestrians in videos, most studies focused on the matching of pedestrian images [8], [12]. For example, Huang and Rueesl et al. (1997) proposed the use of the Bayesian method to estimate the posterior probability of a target appearing under another camera by observing the appearance of the target under the field of view of a known camera [9]. Wojciech and Zajdel et al. (2005) first proposed the term ''pedestrian re-identification'' [10], [2], [33], [34]. Gheissari and Niloofar et al. (2006) followed this by investigating pedestrian re-identification as an independent visual problem and proposed the use of the space-time segmentation algorithm to detect the foreground area of pedestrians [8].
The complexity of pedestrian re-identification subsequently increased. Farenzena et al. (2010) extended pedestrian re-identification research from images to videos by proposing a feature extraction scheme. This was an important step for the application of pedestrian re-identification technology [11]. Following the success of convolutional neural networks in pattern recognition and classification, Ouyang Wanli et al. (2013) proposed joint deep learning for pedestrian re-identification, achieving promising results [13]. Anelia et al. (2015) combined the soft-cascade with CNN networks of different complexities to obtain a highly accurate and real-time pedestrian detection system [14]. Weihua Chen et al. (2016) proposed a multi-task deep network (MTDnet) and an across-domain architecture, with the employment of the auxiliary set to assist in the training of small target sets [26]. Song Bai et al. (2017) proposed an unconventional manifold-preserving algorithm that was able to take full advantage of the supervision of training data as well as demonstrating applications as a post-processing program for most existing algorithms, thus improving recognition accuracy [4].
After much research, pedestrian re-identification is currently in the application stage. With further studies focusing on deep learning, the performance stability of pedestrian reidentification has gradually improved for small datasets. It is an irresistible trend to apply deep learning to solve the problem of pedestrian re-identification and monitoring. Recently, a number of methods have applied Siamese networks for reidentification [47], [48]. In particular, Lin Wu et al. (2016) proposed PersonNet, integrating an adaptive root-meansquare gradient decent algorithm to seek robust features within images [47], [48]. In addition, Dangwei  used dilated conv to extract multi-scale features and reduce the information loss of traditional CNN feature extraction [25]. Furthermore, Yeong-Jun Cho et al. (2017) did pedestrian reidentification by estimating human posture and establishing multi-posture matching model [52]. As for feature selection, Luo Hao et al. (2019) proposed BagTricks in which global feature learning directly learns the representation of the whole image without local reduction [51].

III. PEDESTRIAN FEATURE EXTRACTION
The techniques used for searching in pedestrian re-identification generally include feature extraction and feature similarity measurements. Traditional feature extraction methods include color histograms, LBP, Gabor and Local Patch. Feature similarity measurements are often performed using M-distance, LFDA, MFA, etc. [6]. In the current paper, we propose a deep convolutional neural network structure that employs deep learning to solve the pedestrian re-identification problems within the monitoring system. The monitoring system acquires a target image and a set of similar images, with the subsequent task of matching different images of the same person from a similar group of images.

A. NEURAL NETWORK ARCHITECTURE
The neural network architecture of the proposed model is presented in Figure 2. It is a six-component scalable network with the generalization ability to extract image features. Twoview 60 * 160 * 3 RGB images are input into the network where context information of the human body is extracted. First, the first two-layer-tied convolution structure with 20 5 * 5 * 3 filters is deployed to calculate the high-order features of the two input images. Note that the network weights are shared by the parameter sharing mechanism to ensure that two-view features are calculated with the same filters. The length and width of the feature maps are then halved using the max pooling layer. Next, the second tied convolution structure is employed to determine the feature maps with 25 5 * 5 * 20 filters, followed by the deployment of the second max pooling to obtain 25 feature maps with a size of 12 * 37.
In the second component, a cross-input neighborhood difference structure is implemented to compare the feature maps in the adjacent neighborhood positions. These 25 feature maps are recorded as a i and b i (0 < i ≤ 25). The final output feature maps are the differences of the features in the five neighborhoods around the feature values of the corresponding two feature maps in A i and B i . The neighborhood difference map C i is generated from A i and B i with a grid size of 12 * 37 5 * 5, as follows: where α (5, 5) ∈ R 5×5 is a 5 × 5 matrix and ω [B i (x, y)] ∈ R 5×5 is a neighborhood matrix centered on (x, y). C i is determined by exchanging A i , B i in Eq. (1). A total of 50 neighborhood difference maps are calculated, with the ReLu activation function used to obtain the final outputs. The third component is a patch summary feature that uses C and 25 5 * 5 * 25 filters with a step size of 5 to convolve and sum each 5 * 5 matrix in C i to obtain the overall difference. The output is recorded as Z . C i denotes the same calculation excluding the weight sharing, with the output denoted as Z'.
The fourth component involves the learning of the neighborhood difference spatial relationships. In particular, Z is convolved with 25 3 * 3 * 25 filters and a step size of 1. The pool then halves the feature maps to 25 12 * 37. Following this, Y and Y are calculated using 25 5 * 18 feature maps by max pooling.
The fifth component of the network obtains high-order connections through the fully connected layer, combining the information of the matrix with a long distance. The information from Y and Y is combined to generate a 500-dimensional vector. The ReLu function is first used to activate the features, which are then classified with the focal loss layer of two nodes.

B. NETWORK TRAINING BASED ON FOCAL LOSS
In the vast majority of existing pedestrian re-identification datasets, the number of potentially generated negative samples is much larger than the number of positive samples [15]. In this case, if all the samples are indiscriminately concentrated and used in the training of the network model (after scrambling the order) to minimize the risk of prediction errors, the trained network model will predict that ''the two images do not belong to the same person'', regardless of the specific content of the input image. If the number of negative samples in the training sample is limited, only a limited amount of negative samples are retained from those generated. In addition, the information from the large set of negative samples that is helpful for the training of the network model will be seriously wasted [7]. This imbalance between positive and negative samples increases the difficulty in training network models due to the large number of differences across training samples.
In order to reduce the experimental error caused by this imbalance, our proposed model attempts to use focal loss to replace the traditional SoftMax loss calculation method. Focal loss is actually a normal cross-entropy loss plus a (1 − p t ) γ factor that is used to reduce the loss function of high-quality categorical samples [15]. Intuitively, dynamic scaling can automatically reduce the contribution parameters of simple samples, such that more attention is focused on difficult samples and the impact of data imbalance can be solved.
In our model, we define focal loss as a binary classification test as follows. Let CE denote cross entropy, while y ∈ {−1, +1} represents the real label of the sample, with −1 indicating a negative sample and +1 a positive. p ∈ (0, 1) represents the sample prediction value produced by the classifier. For y = 1, the value of p is large, and the confidence of the classifier predicting the sample as a positive example is higher. This indicates that the classifier performs well with a low value loss function. Similarly, when y = −1, if p is small (1−p is large), the classification performance is strong. In particular, the cross-entropy loss is defined as follows: where p t represents the degree of matching between the predicted value produced by the classifier and the true value of the sample, namely, p t is used to calculate the focal loss as follows: For large p t , the value of (1 − p t ) γ is small (γ > 0), and the degree of matching between the predicted value and the true value is high. As an example, let us assume that the predicted value of a positive sample is 0.99, then the sample belongs to the ''simple sample'' in the training of the learner. That is, the learner is able to judge the true category of the sample well. Focal loss attenuates the loss value of this sample by (0.001) g times. Assuming that g = 2, the result will be attenuated 10,000 times, namely, the loss values of 10000 p t = 0.99 and p t = 0.99 will be equal before and after attenuation. This property reduces the influence of a large number of simple samples throughout the training process.
In the actual experiment, the model will also add weights to the positive and negative samples, further reducing the impact of the sample size on the experimental results. For example, for high negative sample frequencies, the weight of the negative samples is reduced, while if the number of positive samples is small, the weight of the positive samples increases. The formula is as follows: Assuming a balanced sample size, for the focal loss-based network training, the loss function value of the ''simple sample'' model is small, and the influence of the reduction in the loss function value during the training process is also little. This method can effectively train the pedestrian re-identification model by increasing the weight of positive samples and reducing the weight of negative samples, greatly reducing the experimental error. Thus, the problem of sample imbalance in pedestrian re-identification is minimized.
Following the completion of the network model training, image traversal retrieval is executed, with the results presented in Figure 3. This process begins with the determination of a target and a detection group. The traversal method is then used to compare the images in the target and detection groups using the trained network model. Lastly, the results are divided into two categories, namely ''same'' and ''different''. Hence, a similar image of the same pedestrian from different cameras is obtained, and the entire pedestrian re-identification is thus completed.

IV. SYSTEM DESIGN AND IMPLEMENTATION
The proposed pedestrian re-identification and monitoring system consists of five components: digital matrix, network switch, storage server, master computer and camera matrix. Figure 4 depicts the monitoring system framework. The high-speed network dome camera and the network switch form the camera network. The master computer and storage server are connected by the network switch, for system control and dataset storage, respectively. The digital matrix is connected to the network switch in order to make the monitoring process visible in real time. The PC component is equipped the ''Person re-identification-collection'' and ''Person re-identification-handle'' security system software, whereby searching commences when the target person and dataset are provided. Figure 5 presents the monitoring system identification process. The user is able to perform the video monitoring and data collection operations through the ''Person reidentification-collection'' step.
Note that in practice, the input of the system should be a complete surveillance video. An in-depth algorithm is involved here to transform the video collected by the ''Person re-identification-collection'' into a database that can be processed by the ''Person re-identification-handle''. This means that the system will automatically extract a single image of a human from the video and set a unique label for each image. By doing so, it can determine which position of the recognized human body is from the original video. Due to the space, this will be investigated further in future research.   For our experiments, we accomplished the data segmentation by default, obtaining the dataset of single body images.
The left of Figure 6 is the pedestrian image captured by the left camera, while the right is the same pedestrian captured by the right camera at different angles in the same position. Note that the image patches in our pedestrian re-identification experiment and monitoring system were obtained from public datasets.
During the ''Person re-identification-handle'' step, the path of the target person is selected for retrieval using the ''Open the file'' and ''Open a folder'' buttons above. The imported dataset is then retrieved to start the pedestrian reidentification search. Once the searching ends, the information is stored and the search result can be visually displayed. Figure 7 presents an example search result, with the left showing the target character, and the right the result images with high similarity to the target.

V. EXPERIMENTAL VERIFICATION A. TRAINING NETWORK
In the experimental phase, network model training was first performed on the host computer. The main control computer hardware had the following configurations: A NVIDIA GTX 1080ti (1) GPU; a graphics card single-precision floatingpoint performance of approximately 10.7 TFlops; a 11 G memory; 3584 stream processor cores; a i7 920 CPU with 2.66 GHz and quad-core eight threads; and a RAM of 18 G. The network parameter settings are reported in Table 1: The initial learning rate during training was set to 0.01, a network model was generated and saved every 10,000 iterations, and the training process of the entire network was iterated 200,000 times. Gamma in Table 1 denotes the focal loss hyperparameter used to control the ''simple sample'' loss rate. Moreover, Alpha is the weight applied to balance the positive and negative samples in order to reduce the influence of sample size on the model.

B. DATASET AND EVALUATION FRAMEWORK
We employed the CUHK03-Labeled public dataset to train our proposed network. As shown in Figure 8, the CUHK03-Labeled dataset had a total of 1316 for 1360 pedestrians. The images of the dataset came from 6 (3 pairs) different cameras on the campus of the Chinese University of Hong Kong, and each pedestrian appears in just one of the cameras. Since the images were taken from real scenes, images of the same pedestrian under different cameras exhibited large variations in posture and mutual occlusion, amongst other issues.
As pedestrian re-identification is a similarity ordering problem, we adopted the Cumulative Matching Characteristic (CMC) curve to evaluate the experimental results [50]. The images corresponding to 1260 and 1360 pedestrian IDs were randomly selected as the training set. The images corresponding to the remaining 100 pedestrian IDs were used as the test sets, and the first camera image under each pair of cameras was used as the target image. In the subtest process, an image of the second camera was randomly selected for each pedestrian ID in the test set, thereby forming an image library in which the ID of each image is known. The cumulative matching characteristic of each retrieved image in the image library was calculated and averaged to form a cumulative matching characteristic curve for the sub-process. The random sampling was repeated 100 times. The average of the 100 cumulative matching characteristic curves was then taken, resulting in the cumulative matching characteristic curve of the entire testing process.

C. DATASET AND EVALUATION FRAMEWORK
The proposed network model was tested using the singleshot setting method with the CUHK03-Labeled dataset. The Rank-1, Rank-10 and Rank-20 accuracies were recorded and compared with current methods. The results are reported in Table 2:  Table 2 gives a comparison of the CUHK03 data sets. Our network is up to 76 percent Rank-1; In contrast to PersonNet, which is also improved on IDAL, it tries to replace the shallow convolution layer with multiple 3 × 3 convolutional layers, which deeptens the number of network layers and makes the effect significantly improved. However, it is difficult to train the model due to data imbalance, and the deeper network layers will aggravate this phenomenon. In Rank-1, the score of our model is about 11% higher than that of PersonNet. Therefore, compared with deepening the number of network layers, how to reduce the experimental error caused by negative samples is more effective to improve the accuracy of the model. Furthermore, SSM surpassed all of the other pedestrian re-identification methods in terms of judging the similarity of two pedestrian images by comparing their VOLUME 8, 2020 feature maps. The SSM method uses the CNN feature as the input of the algorithm, while the supervised embedded manifold is employed to reduce the dimensionality of the pedestrian feature vector. At the same time, it guaranteed the constraint on intra-class distance and inter-class distance in dimension reduction, which belonged to subspace learning [27]. Different from SSM, the network model proposed in this paper uses a deep convolutional neural network to learn features and corresponding similarity metrics simultaneously, introducing a cross-input neighborhood difference layer. The local correlation was calculated according to the convolutional feature map of the image pair. The neighborhood of the output feature map was added using the additive feature, and the correlation of the distant pixel points was then calculated. The traditional SoftMax Loss function calculation was replaced with the focal loss, allowing for the data imbalance of pedestrian recognition to be efficiently solved. Figure 9 is the comparison between our model and IDLA's scores on CMC. It is not difficult to find that we use the Focal Loss, to reduce the Loss function quality classification samples, the contribution of narrowing the simple sample parameters automatically, so as to faster the attention of the difficult samples, solve the unbalanced data, can effectively improve the scoring model at Rank-1, which means that the model can be faster from the pedestrian database retrieval to the pedestrian targets accurately, but also illustrate the Focal Loss contrast than traditional SoftMax Loss is more suitable for the pedestrian recognition model.
Due to practical constraints such as funding, privacy, etc., our experiments were conducted on the open source CUHK03 dataset. The lack of further verification using real data may lead to a sharp decline in the identification accuracy of the model. In a real environment, pedestrians may be affected by factors such as illumination, angle of view and attitude. This leads to larger visual performance differences of the same pedestrian from different cameras. To solve this problem, existing datasets can be generated from different perspectives via human posture estimation and antagonistic network combinations. This will standardize the posture of pedestrians in real data, thus helping the model to better acquire pedestrian characteristics and improve its recognition accuracy for real data.

VI. CONCLUSION
Despite the gradual improvement in performance stability for small pedestrian re-identification datasets, massive video datasets and human analysis prove to be severe bottlenecks. The application of deep learning to solve pedestrian recognition has become a current trend. In this paper, a convolutional neural network was employed to extract the feature vector of each pedestrian image. The Euclidean distance between the feature vectors was then calculated to measure the similarity between pedestrian images, followed by the determination of the loss function. The focal loss was used to solve the sample imbalance phenomenon, reduce the model image feature storage space, speed up the image feature matching and improve the recognition accuracy. In addition, we implemented an end-to-end monitoring system for pedestrian re-identification based on the deep convolution neural network and focal loss. The real-time, simple and rapid characteristics of the proposed method are in line with practical application requirements, which will be expanded to other tasks in the future.
WENZHENG QU was born in Shandong, China, in 1997. He received the bachelor's degree from the Xinhua College of Sun Yat-sen University, in 2019. He was with Tencent Youtu Lab, Shenzhen, as an AI Engineer. He has published five articles. His research interests include computer vision and deep learning.
ZHIMING XU was born in Guangdong, China, in 1996. He received the bachelor's degree from the Xinhua College of Sun Yat-sen University, in 2019. He was with the School of Information Science, Xinhua College of Sun Yat-sen University, as a Lab Manager. He holds over three patents and two inventions. His research interests include embedded systems and wireless sensor networks. VOLUME 8, 2020 BEI LUO was born in Guangdong, China, in 1996. He received the bachelor's degree from the Xinhua College of Sun Yat-sen University, in 2019. His research interest includes embedded systems.
HAIHUA FENG was born in Guangdong, China, in 1996. He received the bachelor's degree from the Xinhua College of Sun Yat-sen University, in 2019.His research interest includes automatic systems.
ZHIPING WAN was born in Hubei, China, in 1980. He received the bachelor's degree from the Huanggang Normal College, in 2003, and the master's degree from the Guangdong University of Technology, in 2008. He was a Visiting Scholar with Sun Yat-sen University. He was a Teacher with the Huali College Guangdong University of Technology, from 2004 to 2008. Since 2010, he has been with the School of Information Science, Xinhua College, Sun Yat-sen University, and an Associate Professor, in 2018. He has published 60 articles. He holds over two patents and one invention. His research interests include wireless sensor networks, cognitive radio networks, and network security.