Occluded Person Re-Identification Method Based on Multiscale Features and Human Feature Reconstruction

Occluded Re-ID task is proposed mainly because people are often occluded by various obstacles in the real world, which greatly affects the accuracy of model matching.In view of the challenge of occluded Re-ID, the main work of this paper is as follows:(i) Aiming at the incompleteness of human body under occlusion, an occluded Re-ID method based on multi-scale features is proposed. A partial human body locator is constructed by using the target detection algorithm to automatically recognize and cut partial human body in this method. Then this method designs a horizontal pyramid pooling strategy to extract multi-scale features and enhance the robustness of the model under the occlusion problem. Experiments show that, this method has better matching accuracy in the occluded Re-ID task. (ii) Aiming at the problem that it is difficult to align the local features between different people images under occlusion, an occluded Re-ID method based on human feature reconstruction is proposed. This method is an unaligned method, which uses sparse representation to reconstruct human body features. Difficult sample triplet loss function was improved by using human feature reconstruction distance and the proportion of similar parts to matching correlation was increased. Experiments show that this method can effectively improve the occlusion resistance of the model.

model based on complete peoples is not suitable for such 23 scenes. 24 The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo .
The deficiencies of existing methods in solving the prob-25 lem of occluded person Re-ID mainly come from the follow- 26 ing aspects [9], [10]:

27
(1) The incompleteness of the human body under occlu-28 sion. The early representation-based learning and metric 29 learning methods mainly focused on human body matching, 30 but could not extract effective person features well under 31 occlusion. Some person Re-ID methods rely on manual clip- 32 ping to obtain partial human body images that are not cov-33 ered. Manual clipping is time-consuming and laborious, and 34 also introduces human bias to the clipping results. 35 (2) It is difficult to align local features between different 36 person images under occlusion. Although the person Re-ID 37 Harmonious Attention CNN (HACNN) model for joint learn-93 ing of soft pixel attention and hard regional attention along 94 with simultaneous optimisation of feature representations, 95 dedicated to optimise person Re-ID in uncontrolled (mis-96 aligned) images. 97 Local features-based methods [10], [11], [12], [13], 98 [14], [15], [16] usually generate local features of various body 99 parts by using human body posture estimation to generate cor-100 responding key points or roughly horizontal division, so as to 101 make them more robust to person Re-ID problems containing 102 a lot of noise information in the real world. Miao et al. [10] 103 propose a pose guided feature alignment (PGFA) method to 104 match the local patches of probe and gallery images based 105 on the human semantic key-points. Sun et al. [11] designed 106 a PCB (Part-based Convolutional Baseline) method, which 107 uniformly divides the human body into multiple horizontal 108 parts to learn partial human body features [12]. Similarly, 109 Zhao et al. [13] decomposed human body into discriminant 110 regions for human body matching, calculated the repre-111 sentation of these regions accordingly, and aggregated the 112 similarity calculated between a pair of query images and cor-113 responding regions of gallery images into the overall match-114 ing score. On the other hand, Su et al. [14] propose a PDC 115 (Poor-driven Deep Convolutional Model), which improves 116 the learned attitude information and integrates the global 117 and local features of the human body for model matching. 118 Zhao et al. [16] proposed a new type of convolutional neural 119 network based on multi-stage feature decomposition and tree 120 structure feature fusion guided by human body regions, called 121 Spindle Net. Although local feature matching is considered 122 in these models, they ignore that the difficulty of occluded 123 person Re-ID in the real environment lies in incomplete 124 human body information and spatial imbalance. Therefore, 125 such methods often rely heavily on the local alignment of 126 images or the strict alignment of human key points. When the 127 images in the query set are occluded, the results of methods 128 based on local features are poor.

129
Partial person Re-ID is a method to match the local 130 image with the overall image of the library. The reason why 131 this problem is raised is that in the complex real world 132 environment, the camera often cannot capture the complete 133 human body. Therefore, some scholars manually crop the 134 occluded human body image and retain partial human body 135 for model matching training. Zheng

179
In the process of using partial human body image and com-180 plete human body image matching, there will be some prob-   fully connected layer to train its loss function. The specific 192 partitioning process is shown in Fig. 3: 193 In the process of horizontal pyramid pooling as shown 194 in Fig. 3, due to the particularity of feature map blocks in 195 person Re-ID task, the size of feature map after pooling may 196 be inconsistent with the expected feature map by using the 197 pooling kernel and step calculation method in spatial pyramid 198 pooling [33] under general conditions. In this regard, the 199 influence of filling layer needs to be considered. If the feature 200 block is n h × n w (1 × 1, 2 × 1, 3 × 1 in this paper), the size 201 of the nucleus, step size and filling layer of spatial pyramid 202 pooling can be calculated as follows [33]: where, K h , S h , p h represent the height of the pooled core, 206 the step length and filling amount of the height direction 207 respectively; K w , S w , p w represent the width of the pooled 208 core, the step length and filling amount of the width direction 209 respectively. K h and S h , K w and S w are calculated using the 210 same formula, h in and w in are the size of the feature map. 211 According to the number of different horizontal pyramid 212 pools, the required pool core, step size and fill amount can 213 be calculated. As shown in Fig. 4, the image that input to model for matching 217 training is actually a partial human body image and a com-218 plete human body image. It can be seen that in the actual 219 model training process, partial human body image is only 220 locally similar to the complete image in essence, but due 221 to the multi-scale problem, it is difficult to carry out the 222 where, X and Y are the vectorized tensors w x × h x × d and 251 w y × h y × d. Respectively, w k * and h k * are the width and height 252 of the feature graph, and d is the number of channels.

253
For sparse representation of images, in order to better 254 describe its local features and reduce the amount of subse-255 quent computation, dimension reduction of feature graphs X 256 and Y obtained by convolution is required first to facilitate 257 subsequent linear representation. In this regard, the feature 258 graph x and y are divided into N and M blocks respectively, 259 which are expressed as follows: x n and y n represent 263 feature blocks of size 1 × 1 × d. 264 Since partial human body image I and complete human 265 body image J are similar in some areas, the feature matrix 266 X of the former can theoretically be approximately linearly 267 represented by the feature matrix Y of the latter, that is, there 268 exists a linear representation coefficient matrix W of Y with 269 respect to X , so that: According to the above deduction, the sparse representation 272 model of X to Y can be obtained as follows: where, formula (4) is equivalent to formula (5), β is its 275 regularization parameter, · 1 represents the 1-norm of its 276 matrix, and · 2 represents the 2-norm of its matrix.

277
The above equivalent formula (4) and formula (5) are 278 solved by the least square method, the linear representation 279 coefficient matrix W of Y with respect to X is as follows: Therefore, the feature matrixX of partial human body image I 282 reconstructed from the feature matrix Y of the complete 283 human body image J can be expressed as follows: According to the feature matrixX after reconstruction of 286 the feature matrix X of partial human body image I and the 287 feature matrix Y of the complete human body image J , the 288 human feature reconstruction distance between partial human 289 body image I and the complete human body image J can be 290 calculated by the following formula:

2) IMPROVED DIFFICULT SAMPLE TRIPLET LOSS FUNCTION 293
Difficult sample triplet loss function is to select one of the 294 most difficult positive and one of the most difficult negative 295 samples for each picture in the batch and form a triplet with 296 them. Triplet loss is to calculate the similarity of two images 297 in the embedded feature space, that is, the distance in the 298 feature space. 299 VOLUME 10, 2022 In this paper, sparse representation method is used to partial human body. In the second step, because the backbone 352 network commonly used by most algorithms in person Re-ID 353 is ResNet-50, for better comparison, we also use ResNet-50 354 as the backbone network. ResNet-50 is used as the backbone 355 network to extract human body image features, and the lower 356 sampling step in conv5_1 was set to 1 to retain more human 357 body local features and details. The third step is the horizon-358 tal pyramid pooling of human body feature images, which 359 divides human body features into different horizontal feature 360 blocks and integrates multi-scale human body features. In the 361 fourth step, the improved difficult sample triplet loss function 362 and cross entropy loss function are used to train the extracted 363 human body features respectively. In this step the similarity 364 measure distance used for difficult sample triplet loss func-365 tion is the human feature reconstruction distance.

367
In order to verify the effectiveness of the algorithm in this 368 paper on the occluded person Re-ID, the datasets adopted 369 in this paper are Occluded-DukeMTMC, Partial-REID [17] 370 and Partial-iLIDS [34]. The batch size adopted in this experiment was set as 32, 389 epochs as 120, optimizer as ADAM, weight attenuation as 390 5e-4, momentum as 0.9, loss function as cross entropy loss 391 and difficult sample triplet loss, margin as 1.2, and learning 392    Table 2 and Table 3. 411 The training sets and test sets of the comparison methods in 412 As can be seen from the results in Table 2, the last column 416 of the table shows the training time. The accuracy of the pro-417 posed method based on multi-scale features and the method 418 based on human feature reconstruction is significantly better 419 than that of most methods. At the same time, compared 420 with the recent algorithms, the training time is obviously 421 lower than them when the accuracy is close. The algorithm 422 model has stronger robustness and generalization ability in 423 the shielding person Re-ID task. Compared with PCB of 424 baseline algorithm in person Re-ID task, the performance of 425 the proposed method in CMC and mAP is optimal. Can be 426 seen through the contrast experiment, the proposed human 427 feature reconstruction distance is very effective in improving 428 the difficult sample triplet loss function which based on the 429 multi-scale features. The new model further improves the 430 matching accuracy of the model in the person Re-ID task, and 431 makes the model have higher anti-occlusion performance. 432 TABLE 3. Performance comparison of different methods on Partial-REID and Partial-iLIDS datasets.
As can be seen from the results in Table 3, the method 433 proposed in this paper also performs well in some datasets 434 of person Re-ID. In comparison with other partial person 435 Re-ID methods, the accuracy index of the proposed method 436 on the Partial-REID and Partial-iLIDS datasets exceeds that 437 VOLUME 10, 2022 exceeds that of the original method on the above two datasets, 455 and the model accuracy is improved again. It is verified that 456 the reconstructed distance of human features is very effective 457 to improve the occlusion resistance of the model. 458 Improve the quality of the detection frame when using 459 the target detection algorithm to automatically locate part of 460 the human body. Although the performance of the proposed 461 method in occlusion is better than most existing methods, its 462 matching accuracy still has a lot of room for improvement. 463 The bottleneck of its accuracy improvement is that this paper 464 improves the partial person Re-ID algorithm based on manual 465 clipping, and constructs a partial human body locator to make 466 the model automatically focus on the visible human body. But 467 at the same time, this method also has a certain impact on the 468 cutting results. In the case of serious occlusion, it is prone 469 to false detection and missed detection, which hinders the 470 further improvement of the accuracy of the model. Therefore, 471 how to design an algorithm to make the model pay more 472 attention to the visible parts of the human body and eliminate 473 the interference of the occluded area is a major problem faced 474 at present.

475
Finally, this paper presents the change of the loss function 476 of this model in the training process, as shown in Fig. 8. 477 It can be seen that the model proposed in this paper has good 478 convergence in the training process.

480
This paper studies the occluded person Re-ID tasks facing 481 some problems. Including the incompleteness of human body 482 under occlusion leads to the problem that the effective human 483 body features cannot be well extracted and the problem 484 that it is difficult to align the local features between dif-485 ferent human body images under occlusion. The effect of 486 the model depends heavily on the degree of alignment. This 487 paper proposes a occluded person Re-ID based on multi-488 scale features, and the difficult sample triplet loss function 489 was improved by using human feature reconstruction distance 490 and the proportion of similar parts to matching correlation 491 was increased. Experiments show that the proposed method 492 based on multi-scale features is significantly more accurate 493 than other methods. The algorithm model has stronger robust-494 ness and generalization ability in the occluded person Re-ID 495 task. The multi-scale features based method with human 496 feature reconstruction is added to improve the accuracy of 497 the model again, which further verifies that the human fea-498 ture reconstruction distance is very effective to improve the 499 anti-occlusion of the model. The authors would like to express their sincere thanks to 502 the editors and the anonymous reviewers for their valuable 503 comments and contributions.