Anonymization of Human Gait in Video Based on Silhouette Deformation and Texture Transfer

These days, a lot of videos are uploaded onto web-based video sharing services such as YouTube. These videos can be freely accessed from all over the world. On the other hand, they often contain the appearance of walking private people, which could be identified by silhouette-based gait recognition techniques rapidly developed in recent years. This causes a serious privacy issue. To avoid it, this paper proposes a method for anonymizing the appearance of walking people, namely human gait, in video. In the proposed method, we first crop human regions from all frames in an input video and binarize them to get their silhouettes. Next, we slightly deform the silhouettes from the aspects of static body shape and dynamic walking rhythm so that the person in the input video cannot be correctly identified by gait recognition techniques. After that, the textures of the original human regions are transferred onto the deformed silhouettes. We achieve this by a displacement field-based approach, which is training-free and thus robust to a variety of clothes. Finally, the anonymized human regions with the transferred textures are filled back into the input video. In the results of our experiments, we successfully degraded the accuracy of CNN-based gait recognition systems from 100% to 1.57% in the lowest case without yielding serious distortion in the appearance of the human regions, which demonstrated the effectiveness of the proposed method.

Kazuaki Nakamura is with the Faculty of Engineering, Tokyo University of Science, Katsushika, Tokyo 125-8585, Japan.
Naoko Nitta is with the Department of Informatics and Mediology, School of Human Environmental Sciences, Mukogawa Women's University, Nishinomiya, Hyogo 663-8558, Japan.
Noboru Babaguchi is with the Institute for Datability Science, Osaka University, Suita 565-0871, Japan, and also with the Department of Management and Information Sciences, Fukui University of Technology, Fukui 910-8505, Japan.
Digital Object Identifier 10.1109/TIFS.2022.3206422 publisher as well as freely access other people's contents as a 33 viewer from all over the world. As a result, a massive number 34 of web videos are stored and shared on the Internet, whose 35 amount is still growing. Importantly, these web videos often 36 contain the appearance of private people.

37
At the same time, techniques of identifying humans by their 38 walking motion, which are so-called gait recognition, have 39 been actively studied and rapidly developed recently [1], [2], 40 [3], [4]. The current mainstream of gait recognition techniques 41 is silhouette-based ones [2], [3], [4], where a human region 42 cropped from an input video is first binarized and then fed into 43 a recognition system to identify who s/he is. The state-of-the-44 art methods can achieve a good performance even when the 45 resolution of an input human region is quite low. 46 If gait recognition techniques are maliciously applied to the 47 web videos containing private people, it can be disclosed who 48 they are. This is a serious privacy issue. Moreover, people's 49 location and behaviors can also be revealed by gait recognition 50 since web videos often include the information of the shooting 51 time, shooting location, and so on. This means that a human's 52 gait, i.e., her/his walking motion, has become privacy-sensitive 53 information similar with other biological features such as face, 54 voice, fingerprint, and so on. To solve the privacy issue, it is 55 desirable that human regions in the web videos uploaded in 56 VSS should be anonymized by its provider before being pub-57 lished. From the above background, in this paper, we propose 58 a method for anonymizing human gait in a given video. 59 There are two possible approaches for human gait 60 anonymization: visual abstraction and replacement. In the 61 former, we anonymize the gait of the private people contained 62 in a web video by pixelizing or blurring the corresponding 63 human regions [5]. However, this approach makes the human 64 regions visually unnatural and degrades the quality of the 65 web video. This would not be preferred by the provider of 66 VSS because users visiting to the VSS as a viewer generally 67 favor high quality videos more than low quality ones that 68 may visually frustrates them. Hence, we employ the latter, 69 namely replacement-based approach for gait anonymization, 70 whose specific procedure is as below (see also Fig. 1).

71
(1) Detect and crop the human regions from each frame in 72 an input video.

73
(2) Binarize the cropped regions to obtain their silhouettes. 74 (3) Slightly deform the silhouettes so that gait recognition 75 systems cannot successfully identify who they are.

76
(4) Transfer the texture of the original human regions onto 77 the deformed silhouettes, by which an anonymized ver-78 sion of the human regions is obtained. 79 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of visual content abstraction and gait recognition. Next, 124 in Section III, we describe the details of the method for gait 125 silhouette deformation, whose key component is the feature 126 perturbation. After that, in Section IV, we describe the method 127 of human region texture transfer in detail, which is based 128 on displacement field estimation as mentioned above. The 129 performance of the proposed method is then experimentally 130 evaluated in Section V. Finally, we conclude this paper in 131 Section VI.

132
Note that this paper is an extended version of our previous 133 conference paper [8] where we did not focus on human region 134 texture transfer. The differences between the conference paper 135 and this paper are summarized as follows. First, we propose 136 a method of human region texture transfer as well as that of 137 gait silhouette deformation in this paper. The final output of 138 our current method is a colored video of anonymized human 139 regions, whereas the output of the previous method is just 140 a video of anonymized silhouettes. Details of the method 141 of human region texture transfer is described in Section IV. 142 Second, we further improve the method for perturbing shape 143 and posture features, which is described in Sections III-D 144 and III-E. Third, we experimentally evaluate the performance 145 of the proposed method from a broader point of view, whose 146 results are reported in Section V.

148
Since human gait anonymization is an emerging topic even 149 in the field of multimedia, there are only a few studies focusing 150 on this novel task. Rather, methods for face anonymization 151 have been actively studied in the past decade, which is deeply 152 related to gait anonymization. Hence, we first review the 153 previous work on face anonymization in Subsection II-A and 154 then move our focus to gait anonymization in Subsection II-155 B. After that, in Subsection II-C, we review some existing 156 techniques of gait recognition, which is another topic deeply 157 related to this study.

159
Visual contents such as images and video often have 160 privacy-sensitive information, whose typical example is human 161 face. Therefore, methods for anonymizing face regions in a 162 visual content have been widely studied.

163
In classic methods, simple image processing techniques 164 such as blocking-out, pixelization, blurring, and so on are 165 applied to the face regions to anonymize them [9], [10]. This is 166 called visual abstraction, which works well in terms of human 167 perception, whereas it is not always effective for preventing 168 automatic face recognition systems [11], [12], [13]. Once 169 the category and the strength of the visual abstraction filter 170 used for anonymization is revealed, its impact can be easily 171 canceled [12]. Moreover, visual abstraction techniques often 172 yield unnatural appearance.

173
To cope with the above drawbacks, another kind of face 174 anonymization techniques has been proposed, whose key 175 concept is replacement. Specifically, face regions in a visual 176 content are anonymized by being replaced with other face 177 images. For each face that should be anonymized, Bitouk et al. 178 proposed to find its nearest neighbors from a pre-constructed 179 face library and use them for the replacement process [14].
Here, we point out that gait anonymization is somewhat 233 related to adversarial attacks [28], which aim to fool a 234 machine learning-based pattern recognition system without 235 affecting human perception by adding a small random noise 236 to an input pattern. In face anonymization, anonymized face 237 images should be incorrectly recognized by both humans 238 and automated systems. Unlike this, in gait anonymization, 239 it is enough for anonymized gait videos to be incorrectly 240 recognized by only systems since most humans originally 241 do not have the ability of gait recognition. Therefore, gait 242 anonymization might be achieved with adversarial examples. 243 There have been several existing studies of adversarial 244 attacks against gait recognition [29], [30]. These are trying 245 to add a small noise into a sequence of gait silhouette images; 246 in other words, these adversarial attack methods aim to fool 247 the gait recognition systems at the silhouette level. Since 248 silhouette images are composed only of two kinds of colors, 249 black and white, the noise generated by adversarial attacks 250 becomes salt-and-pepper-like. This kind of noised silhouette 251 is difficult to naturally colorize. This means that adversarial 252 attacks are actually not suitable for gait anonymization. In this 253 paper, we aim to fool gait recognition systems at the colored 254 video level, which is a large difference between our work and 255 adversarial attacks. For this purpose, directly adding a small 256 noise into the original colored video is not a good solution. 257 This is because such a small noise is mostly canceled by a 258 silhouette extraction process performed as a preprocess of gait 259 recognition.

261
Gait recognition is a relatively novel technique of biometric 262 identification. Although the accuracy of gait recognition is 263 currently not comparable to that of well-studied face recog-264 nition, it has a unique advantage; that is, gait recognition 265 can be successfully performed even on low resolution videos 266 captured from far away in which human faces cannot clearly 267 observed. Because of this advantage, gait recognition has been 268 already utilized for the purpose of crime investigation in some 269 countries [3], [4].

270
There are two categories of gait recognition meth-271 ods: model-based and silhouette-based. The former uses 272 3-dimensional human body models such as a surface model 273 and a skeleton model to estimate the posture of the humans in 274 a given video, and identify them based on a sequence of the 275 estimated postures [1]. The performance of the model-based 276 approach heavily depends on video resolution because it is 277 difficult to accurately estimate the human postures from low 278 resolution video. This loses the above advantage. On the other 279 hand, the latter, i.e., the silhouette-based approach, is suitable 280 to the low resolution video.  so that gait recognition techniques cannot correctly identify 345 the person in T . To this end, our method achieves the defor-346 mation frame-by-frame; that is, we first deform each frame 347 S i to T i separately and then concatenate the deformed frames 348 into a single video. Therefore, we focus on the frame-wise 349 deformation process in the remainder of this section.

350
As mentioned in Section I, a gait silhouette is determined by 351 two factors: shape (the shape of body and that of clothes) and 352 posture. The former represents the static aspect of the input 353 gait while the latter represents the dynamic aspect. Basically, 354 the same person's silhouettes always have the same shape in a 355 single video regardless of posture, i.e., frame ID. On the other 356 hand, posture is different in individual frames, which can be 357 expressed by "phase" θ ∈ [0, 2π) because of the periodicity 358 of walking motion. Based on the above discussion, we can 359 re-write the frame S i as S i = Sil a (θ i ), whose meaning is the 360 gait silhouette of a person a with phase θ i . We can anonymize 361 the Sil a (θ i ) from both the static and the dynamic aspects by 362 changing it to Sil a (θ i ), where a is a fictional person different 363 from a and θ i is a different value from θ i .

364
The concrete procedure of the proposed deformation method 365 is as follows.
vector representing the shape of Sil a (θ i ), is extracted by 368 a certain encoder E. At the same time, the value of the 369 phase θ i is estimated in some way from the Sil a (θ i ).

370
(ii) Perturbations z and θ i are respectively added to the 371 shape code and the phase value to get z a = z a + z 372 and is gener-374 ated from the perturbed shape code z a and phase value 375 θ i by a certain decoder D. 376 The silhouette image obtained in the third step is finally used 377 as the anonymized version of the frame S i , that is, Sil a (θ i ) = 378 T i . Note that the perturbation for the shape code is always 379 same for all i ∈ {1, . . . , M} to avoid unnatural shape change 380 in T .

381
Hereafter, we describe how to estimate the phase value in 382 Subsection III-B and how to train the encoder E and the 383 decoder D in Subsection III-C. Then, the strategies for deter-384 mining the perturbations z and θ i are described in detail in 385 Subsections III-D and III-E, respectively. For convenience of 386 explanation, we assume that all gait silhouette sequences used 387 in the subsequent subsections only include just one cycle of 388 walking motion, which can be easily extracted by using an 389 autocorrelation analysis as a pre-process. In practice, there 390 are cases where the length of an input sequence is longer 391 than one cycle, of course. In these cases, we extract the 392 first cycle from the input sequence and apply the proposed 393 method to it. Then, for every remaining frame with a phase 394 value θ i ≥ 2π, we anonymize it (i.e., the i -th frame) in the 395 same way as the anonymization process of the j -th frame 396 which is the nearest neighbor of the i -th frame in terms of 397 phase; that is, the perturbations z and θ j that are used to 398 anonymize the j -th frame are also used for the i -th frame, 399 where 0 ≤ θ j ≈ θ i − 2πn < 2π with a certain integer n. For 400 a gait sequence whose length is shorter than one cycle, we do 401 not have to anonymize it. This is because the gait recognition 402 we first make a periodic shifted version of S as 430 and then we apply DP matching to R and S l . Let C(R, S l ) 431 be the matching cost. We perform the above process for all 432 l ∈ {0, 1, · · · , M − 1} and find the bestl that minimizes the For any phase value θ , we convert it to a two-dimensional 445 vector p θ = (cos θ sin θ) before feeding it into the decoder 446 D. This allows us to equally treat the phase values 0 and 2π 447 despite their gap on the real number line.

448
In the training process for E and D, we do not add any 449 perturbation to the shape code and the phase value. In this 450 case, E and D behave like an autoencoder and therefore we 451 can employ the mean squared error (MSE) between an input 452 and an output silhouette images as a loss function. However, E 453 should output the same shape code z a from Sil a (θ i ) for all θ i 454 unlike autoencoders. To ensure this property under the strategy 455 of using the MSE loss, we have to define the ground-truth of z a 456 in some way. To this end, we employ the following approach. 457 First, using a lot of gait silhouette sequences as a training 458 dataset, we train a variational autoencoder (VAE). Let E vae 459 be the encoder part of the trained VAE. Next, each silhouette 460 image Sil a (θ i ) in the training dataset is compressed to a feature 461 vector ξ a (θ i ) = E vae [Sil a (θ i )], whose average with respect to 462 θ i is then calculated and used as the ground truth of z a , i.e., 463 (3) 464 M a is the length of the gait silhouette sequence of the person 465 a in the training dataset.

466
Using the above z gt a , we simultaneously train E and D by 467 minimizing where λ is a weighting constant to control the balance between 471 the first and the second terms. Note that the dataset for training 472 E and D is totally same with the one used for training the 473 VAE. See also Appendix B for more details on the training 474 process of the E and D.

476
Now we proceed to how to determine the shape perturbation 477 z. The size, or norm of z is an important factor. If ||z|| 478 is too large, the shape of an output silhouette image would 479 be greatly different from that of an input silhouette, which 480 might not look like human. On the other hand, if ||z|| is 481 extremely close to 0, the anonymization capability is signifi- To make ||z|| as large as possible to get enough 505 anonymization capability, we naively try to find 507 and calculate z a as z a = Zĉ. This allows us to obtain z a 508 that is definitely not close to z a . However, this solution has 509 a problem as mentioned above; namely, seriously distorted 510 appearance would be caused in the output gait silhouette 511 because of too large ||z||. To prevent this, we give the 512 following two constraints on the c. One is 514 where I = (1 · · · 1) is the K -dimensional vector all of 515 whose elements are 1, and the other is

517
These constraints are expected to keep z a close to at least one 518 of {z a,k } 1≤k≤K , achieving natural appearance in the output gait 519 silhouette. For convenience of computation, we do not directly 520 consider the second constraint (8). Instead, we introduce it as 521 a regularization term into the objective function (6); that is, 522 we actually find under the constraint (7), where the second term corresponds 526 to the constraint (8) and η is a weighting constant. Since the above optimization problem (9) is quadratic with 528 respect to c, its solution is easily obtained as identity matrix. Note that the quadratic term of the objective 533 function (9) is property, η should be smaller than 1 τ , where τ is the largest 538 eigenvalue of Z Z . Hence, we set η = ω τ in our experiments 539 by introducing a parameter ω that satisfies 0 ≤ ω < 1. The 540 setting of ω = 1 makes G singular, in which G −1 cannot be 541 computed.

543
The dynamic aspect of human gait is represented by not a 544 single phase value but its sequence. Hence, the phase perturba-545 tion θ i should not be separately determined frame-by-frame. 546 Instead, our proposed method simultaneously determines the 547 θ i for all i (1 ≤ i ≤ M). 548 In general, a person's walking direction is consistent and 549 her/his posture changes continuously in a single video. Hence, 550 a sequence of phase values (θ 1 , · · · , θ M ), which represents 551 one cycle of walking motion, satisfies the following equation: 552 Note that we introduce θ 0 = θ M for sake of convenience. 554 Since θ i ≥ 0 is also satisfied for all i , we can consider that 555 = (φ 1 , · · · , φ M ) is virtually a probability distribution 556 (see Fig. 4). Similarly, for the phase value sequence after 557 perturbation (θ 1 , · · · , θ M ), we can consider that = 558 (φ 1 , · · · , φ M ) is a probability distribution, where To anonymize the dynamic feature of the input sequence, 561 we should make as dissimilar with as possible.

562
To measure the dissimilarity between two probability distri-563 butions and , we employ Jensen-Shannon (JS) divergence, 564 which is calculated as 566 The optimal that maximizes JS(|| ) can be obtained as

576
Using the optimal found by the above process, we com-577 pute the perturbed phase values as

Algorithm 1 Texture Transfer Based on Displacement Field
Input: I t , S t , T i Output: J i 1: Calculate a DF between S t and T i , whose horizontal and vertical components are denoted by f and g, respectively. 2: for y = 0 to H do 3: for x = 0 to W do 4: u = x + f (x, y) 5: v = y + g(x, y) 6: Copy the color of the pixel I t (u, v) to the pixel J i (x, y), i.e., J (x, y) ← I (u, v). 7: end for 8: end for for all (x, y), where a pair of f and g is a DF. A typical 608 example of DFs is an optical flow field densely computed 609 between two consecutive frames in video.

610
With the method proposed in Section III, we have already 611 obtained a sequence of deformed gait silhouettes T from an 612 input sequence S. Let T i (u, v) be the i -th frame in T and let 613 S t (u, v) be the t-th frame in S. In addition, let I t (u, v) be the 614 original colored version of S t (u, v). In this situation, once an 615 appropriate DF between T i and S t is obtained, we can easily 616 transfer the texture of I t onto T i and get its colored version 617 J i by Algorithm 1, where f and g are the DF satisfying 618 T i (x, y) ≈ S t (x + f (x, y), y + g(x, y)), as shown in Fig. 5. 619 H and W are the image height and width, respectively. Thus, 620 how we estimate the appropriate f and g from a pair of T i and 621 S t plays a key role here, which is more difficult if the phase 622 value of S t is more different from that of T i . Hence, it is also 623 important to select in-phase pairs of silhouette images as T i 624 and S t . We describe how to achieve these two sub-tasks in the 625 subsequent subsections.

627
For each T i ∈ T = {T 1 , · · · , T M }, we have to select 628 the best counterpart for it from S = {S 1 , · · · , S M }. This 629 is achieved by calculating the silhouette similarity between 630 T i and S t for all t ∈ {1, · · · , M} and then selecting 631 the most similar one. Separately performing this process for 632 each i ∈ {1, · · · , M} is the simplest way. However, this 633 strategy often causes a perceptible flicker in the output image 634 sequence {J 1 , · · · , J M }. This is because the counterpart of 635 T i and that of T i+1 are not always temporally neighbored 680 Fig. 6. Whole image region A and its boundary region B.
Boundary constraint is introduced for making the magni-681 tude of the displacement vectors become 0 on the boundary 682 of image T i . This is achieved by minimizing where B is the boundary region in the image T i , as shown in 685 Fig. 6. With the combination of this constraint and the smooth-686 ness constraint, we can avoid getting the large-magnitude DF. 687 Using the above constraints, we define the ultimate cost 688 function for the DF estimation as where β 2 , β 3 , and β 4 are positive constants for adjusting the 691 balance between the four terms. The minimization of the cost 692 functionQ can be realized by a gradient descent algorithm, 693 where initializing f and g as f (x, y) = g(x, y) = 0 for all 694 (x, y) ∈ A is enough because we use in-phase silhouette pairs 695 selected in the previous subsection as T i and S t .

697
We evaluated the effectiveness of the proposed method 698 with several experiments, whose results are reported in this 699 section. We first summarize the setup of the experiments in 700 Subsection V-A and describe how to tune the hyperparameters 701 in Subsection V-B. Then, we show the experimental results and 702 give some discussions on them in Subsection V-C.  For the former, we employed the treadmill dataset A and B 708 from the OU-ISIR Gait Database [42]. The treadmill dataset 709 A includes 612 gait silhouette videos of 34 people, whose 710 walking speed is ranged from 2 [km/h] to 10 [km/h]. However, 711 since some of them look unordinary, i.e., too slow or too fast, 712 we used only 204 videos in which a person is walking at 713 the speed of 4, 5, or 6 [km/h]. We refer to a set of these 714 204 videos as DS a in the remainder. Meanwhile, the treadmill 715 dataset B includes 2176 gait silhouette videos of 68 people 716 whose walking speed is ordinary enough. Thus, we used all 717 of them, which we call DS b in the remainder.

718
For the latter, namely colored gait videos, we constructed 719 our own dataset by shooting 14 people with a web camera 720 13 or 14 times for each person. The number of the collected 721 videos is 190, where the human region in each frame was 722 extracted with a chroma-key system and the remaining back-723 ground region was filled with gray color. This is a pre-process 724 for easily binarizing the colored videos to obtain their silhou-725 ettes. We refer to this dataset as DS c in the remainder. The were used for another purpose as described later.

740
To separately evaluate the effect of the shape perturbation 741 and that of the phase perturbation, we actually tested the fol-742 lowing three methods and compared them: shape-only, phase-743 only, and both. In the shape-only method, we did not add any 744 phase perturbation into the input gait videos. Similarly, in the 745 phase-only method, we did not add any shape perturbation.

746
In the both method, we added shape perturbation as well as 747 phase perturbation.  After training the three GEINet-based networks, 781 we anonymized the remaining 64 videos in DS c as mentioned 782 above. Then we evaluated the anonymization results in the 783 following two ways. One is to directly input the results of 784 gait silhouette deformation proposed in Section III to the 785 three networks before performing the human region texture 786 transfer (HRTT) proposed in Section IV. The other is to 787 input the re-binarized version of the final anonymization 788 results after performing HRTT. We refer to these two ways as 789 before-HRTT and after-HRTT in the remainder of this section. 790 For visual naturalness, it is not easy to objectively evaluate 791 the gait videos from such an aspect. Tieu et al. employed a 792 human action recognition system for a similar purpose [26]. 793 This is based on the consideration that gait videos having 794 visually natural motion tend to be correctly recognized as 795 "walking" by action recognition systems whereas videos with 796 unnatural motion tend to be wrongly recognized. Inspired 797 by them, we employed a pre-trained CNN model of action 798 recognition called 3D-ResNet [43]. In addition, we also 799 employed an object detection model called YOLO [44] under 800 the consideration that human regions in the gait videos having 801 a visually natural appearance tend to be correctly detected 802 by YOLO. We input the gait videos anonymized by the 803 proposed method into not only 3D-ResNet video-by-video 804 but also YOLO frame-by-frame and measured their recogni-805 tion/detection accuracy, where higher accuracy indicates better 806 visual naturalness.

808
As introduced in Subsections III-D and III-E, we have 809 several hyperparameters which could significantly affect the 810 anonymization performance of the proposed method as well 811 as the visual naturalness of its resultant gait videos. They 812 are, namely, ω and K for the shape perturbation and α 813 for the phase perturbation. We first examined the impact 814 of these hyperparameters to tune them, whose results are 815 reported and discussed in this subsection. As described later 816 in Subsection V-C, we finally employed ω = 0.99, K = 20, 817 and α = −0.5 through the examination. However, we only 818 reported the cases when two of them were fixed and the 819 remaining one was variously changed, due to page limitations. 820 The hyperparameters for human region texture transfer, 821 namely β 2 , β 3 , and β 4 in Formula (25) were empirically set 822 as β 2 = 0.3, β 3 = 10, and β 4 = 5.

1) Hyperparameters for the Shape Perturbation:
In the 824 proposed method, shape perturbation is achieved by replacing 825 the original shape code with a linear combination of its 826 K -nearest neighbors, whose coefficients are determined by 827 solving the optimization problem (9). ω is a weighting constant 828 for balancing the first and the second terms in Formula (9). 829 Their impacts were first examined, whose results are shown 830 in Fig. 7 and Fig. 8. The horizontal axis in these figures indi-831 cates GEI-based gait recognition accuracy when before-HRTT 832 silhouettes obtained by the shape-only strategy were used, 833 and the vertical axis indicates how accurately the anonymized 834 gait videos are recognized as the "walking" action by 3D-835 ResNet. This means the left area in the figure is desirable 836 Fig. 7. Impact of K on anonymization performance and visual naturalness of resultant gait video.  α → −∞, whereas it would also degrade the visual natu-857 ralness of the anonymized gait videos. We experimentally 858 examined this, whose result is shown in Fig. 9. The meanings 859 of the horizontal and vertical axes are almost the same 860 as those in Fig. 7 and Fig. 8, except for the use of the 861 phase-only strategy instead of the shape-only. As we theo-862 retically considered, α with a larger absolute value leads to 863 higher anonymization performance as well as lower visual 864 naturalness. Especially, in the cases when α < −0.5, the 865 visual naturalness is rapidly degraded. This is because such α 866 makes it difficult to successfully achieve human region texture 867 transfer. For these reasons, we employed α = −0.5 to strike 868 a balance between the anonymization performance and the 869 visually natural appearance.

871
Using the hyperparameters tuned as above, we anonymized 872 the test videos in DS c and evaluated them qualitatively and 873 quantitatively. 874 1) Qualitative Evaluation: Fig. 10 shows two examples 875 of original gait videos (non-anonymized ones) and their 876 anonymized versions obtained by the proposed method. 877 We can see from this figure that the anonymized videos can 878 keep as natural appearance as their original. There are no 879 serious differences between the original and the anonymized 880 videos, even in the case of the both strategy. The same property 881 can also be observed in most of the other test data, by which 882 it is demonstrated that our proposed method does not degrade 883 the visual naturalness of input gait videos.

884
Of course, the visual naturalness of the anonymization result 885 is affected by the performance of HRTT, which depends on 886 the three constraints introduced in Subsection IV-C. We qual-887 itatively verified their effect. Fig. 11 shows an example 888 of the displacement field (DF) estimation results and the 889 final anonymization results with and without each constraint. 890 We can see from Fig. 11 Fig. 12. Convergence trend of DF estimation process. Since we initialize DF as zero vector field, Q 2 , Q 3 , and Q 4 are initially zero, which slightly increase along with decrease of Q 1 .
consistency constraint does not cause any serious distortion. 896 This is because the consistency constraint can be implicitly sat-897 isfied by the use of smoothness constraint if the magnitude of 898 the estimated DF is low. Since we use in-phase silhouette pairs 899 for the DF estimation, a low-magnitude DF is obtained in most 900  ness constraint introduced in Section IV-C. The smoothness 944 constraint weakens the high-frequency components in the 945 estimated DF, by which the local shape information of the 946 human regions in the original video might be reflected in 947 the anonymized one, as shown in Fig. 14. Even though the 948 effect of the smoothness constraint is not so serious, it is not 949 desirable, because attackers who want to reveal some privacy 950 information from web videos can input a re-binarized version 951 of anonymized human gait video into their gait recognition 952 systems. In our future work, we will reconsider how to 953 appropriately impose the smoothness constraint on the DF 954 estimation process.

955
The visual abstraction-based approaches, namely pixeliza-956 tion and blurring, cannot reduce the gait recognition accuracy 957 so much compared to our method. Especially, an accuracy of 958 around 80% is still kept in the case of pixelization. This is 959 because CNN-based gait recognition systems such as GEINet 960 have an internal process similar to pixelization (e.g. average 961 pooling) and therefore are robust to pixelized images.

962
There is a concern that data augmentation by slight sil-963 houette deformation might enhance the capability of the gait 964 recognition system so that it can defeat our anonymization 965 method. To address the concern, we also tested the effect of 966 data augmentation on gait recognition accuracy. Specifically, 967 we applied dilation, erosion, and scaling operators to the 968 human silhouette in each training sample in DS a and DS c , and 969 then re-trained the GEINet-based gait recognition systems. For 970 the dilation and erosion operators, we used the 3 × 3 rectangle 971 kernel. For the scaling operator, we used the following four 972 ways to set the horizontal scaling factor τ x and the vertical 973 one τ y : (τ x = 1.0, τ y = 0.95), (τ x = 1.0, τ y = 1.05), 974 (τ x = 0.95, τ y = 1.0), and (τ x = 1.05, τ y = 1.0). Table II 975 shows the result. The anonymization performance of shape-976 only is slightly decreased, however, that of phase-only and both 977 is not affected by the data augmentation. This demonstrates 978 that the phase perturbation, which is the main advantage of 979 our proposed method over existing ones, is robust to the data 980 augmentation-based gait recognition.

3) Quantitative Evaluation in Terms of Visual Naturalness: 982
As mentioned in Subsection V-A, the visual naturalness of the 983 anonymized gait videos was evaluated by two criteria: frame-984 by-frame person detection accuracy by YOLO and video-by-985 video "walking" action recognition accuracy by 3D-ResNet. 986 TABLE III shows the evaluation results. In addition, Fig. 15   Original gait frame has a dent in its foot region. This local shape is once removed by silhouette deformation but restored by human region texture transfer (HRTT) in anonymized gait frame, because of too smooth displacement field (DF).  abstraction-based approaches cannot keep the visual natural-1008 ness at all, as mentioned in Section II.

1010
In this paper, we propose a method for anonymizing gait 1011 information in web videos to reduce the risk of privacy 1012 leakage caused by gait recognition systems. The proposed 1013 method consists of two modules: silhouette deformation and 1014 human region texture transfer. The former slightly deforms 1015 the silhouette of each frame in an input gait video from not 1016 only the static aspect (i.e., shape) but also the dynamic aspect 1017 (i.e., phase) so that the person in the input video cannot be 1018 correctly recognized. Anonymization from the dynamic aspect 1019 is an important contribution of this paper. The latter, namely 1020 HRTT, is achieved by the use of a displacement field in order 1021 to handle various types, colors, and designs of clothes without 1022 any training process, which is another contribution.

1023
In our experimental results, the proposed method succeeded 1024 in reducing the gait recognition accuracy from 100% to 1025 at most 12.1% (4.73% in the lowest case) by only static 1026 anonymization. More importantly, the accuracy was further 1027 reduced to at most 8.42% (1.57% in the lowest case) by the 1028 combination of static and dynamic anonymization. This trend 1029 can be seen in whatever kind of gait features (i.e., GEI, FDF, 1030 and SFDEI) was used for gait recognition. This demonstrates 1031 the high anonymization capability of the proposed method. 1032 In addition, the proposed method also succeeded in preserving 1033 the visual naturalness of the walking people's appearance 1034 in terms of the action recognition accuracy by 3D-ResNet, 1035 which was 75.6% and 73.0% before and after anonymization, 1036 respectively. This result indicates that the proposed method 1037 does not significantly affect the visual naturalness of the input 1038 video. We also verified this fact by the qualitative evaluation. 1039 Importantly, the proposed dynamic anonymization tech-1040 nique (i.e., phase perturbation) can be combined with not 1041 only our own static anonymization technique (i.e., shape 1042 perturbation) but also any other existing shape-based gait 1043 anonymization method, as below. First, the phase perturbation 1044 technique is applied to an input gait video (this process 1045 is equivalent to the phase-only strategy), whose result is 1046 next input into an existing shape-based method. Then, the 1047 anonymized gait silhouette is colorized by the proposed texture 1048 transfer technique. With this procedure, the phase perturbation 1049  as mentioned in Subsection III-B (see also the right side of 1086 Fig. 16). LetP l be the best path between R and S l , that is, 1087 P l = argmin PC (R, S l ; P). The global best path is obtained 1088 asPl , where 1089 l = argmin lC (R, S l ;P l ) = argmin l C(R, S l ).
(30) 1090 Finally, for each frame S j , we construct an index set N j = 1091 {i |(i, j ) ∈Pl } and estimate the phase value of S j as As mentioned in Subsection III-C, we first train a VAE to 1096 obtain z gt a , whose network structure is shown in Fig. 17. In the 1097 figure, "FC" means a fully-connected layer, where n is the 1098 number of units in it. "Conv" and "Deconv" means a convo-1099 lution layer and a transposed convolution layer, respectively, 1100 where "KS" and "Ch" are their kernel size and the number 1101 of channels. Hereafter, let E vae and D vae be the encoder and 1102 the decoder parts of the VAE, respectively. We use sigmoid 1103 activation after the last layer of D vae while do not use any 1104 activation after the last layer of E vae . For the other layers, 1105 we use ReLU activation after them.

1106
Using the above E vae , we extract a feature vector ξ a (θ i ) 1107 from Sil a (θ i ) for all i , and obtain z gt a based on Formula (3). 1108 After that, using the z gt a , we train the E and D. The network 1109 structure of E is identical to that of E vae except that it does 1110 not have a fully-connected layer to calculate the standard 1111 deviation (std. dev.). The E and E vae share the same structure 1112 but their parameters are independently trained from each other. 1113 The network structure of D is shown in Fig. 18. Importantly, 1114 we design the decoder D by cascading a subnetwork F and 1115 the above D vae , whose parameters are fixed. This means that 1116 we only tune the parameters of the subnetwork F during the 1117 training process of D. We previously show the loss function 1118 Fig. 18. Network structure of gait silhouette decoder D, which is designed by cascading F and D vae . We use ReLU activation after all FC layers. as Formula (4), but it is a simplified version. We actually use the following loss function, namely,  Fig. 19. To enable the F to precisely reconstruct ξ a (θ i ) from