GaitCopy: disentangling appearance for gait recognition by signature copy

This paper addresses the problem of gait-based people identification by copying optical flow-based signatures. The proposed model, coined as GaitCopy, receives as input a stack of gray images and returns the gait signature of the represented subject. The novel property of this network is that it is not trained to only generate discriminative signatures, but to copy signatures generated by a Master network trained on optical flow inputs. Then, GaitCopy is enforced to extract signatures based on motion and not based on appearance, despite having been trained with pixel inputs. We implement two different versions of GaitCopy, one mainly composed of 3D convolutional layers to capture local temporal information; and a second one based on GaitSet which uses 2D convolutional layers under a temporal setup. We evaluate our approach on two public gait datasets: CASIA-B and TUM-GAID. We observe that compact networks, up to ×4.2 smaller for TUM-GAID, can be obtained by using our approach, while keeping a competitive recognition accuracy with respect to the state of the art, and without the need of explicit optical flow computation. Even with such network compression, the results obtained in TUM-GAID are comparable to those of the state of the art, with an average accuracy of 97% on the test set. (Code will be publicly available upon acceptance.).


I. INTRODUCTION
T HE term 'gait' refers to the manner in which a person walks or runs. Since each person walks in a very particular way (i.e. her 'gait signature'), gait has become a popular biometric cue that allows people identification, even at a distance [1], [2]. In the last few years there has been significant progress in gait recognition [3], however, the task is far from solved.
Traditionally, gait recognition has been tackled using appearance-related features such as RGB images [4] or shape-based features such as silhouettes [5]. However, these types of features are sensitive to changes either in the scene (e.g. lighting or clutter background) or in the subject itself (change in clothing or other factors produced by the passage of time). These problems can be mitigated using motionbased features of the subject rather than appearance ones as gait, i.e. walking, is closely related to motion.
In computer vision, the lowest level feature used to encode motion in a sequence of images is optical flow (i.e. the direction and magnitude of each pixel displacement). There exist some works in the field of gait recognition that use optical flow [6], [7]. However, the computation of optical flow might be considered computationally expensive for some particular applications.
In this context, this work proposes GaitCopy, a deep neural network for gait recognition that receives as input a sequence of gray-scale images, containing a person walking, and generates a gait signature equivalent to the one that might be obtained by a model that received as input a sequence of optical flow maps -alleviating this way the computational cost of optical flow computation -as illustrated in Fig. 1. Note that, in that figure, for the sake of simplicity, we just represent the magnitude of the optical flow vectors with different colours, where the brighter the pixel the higher the value of the magnitude.
The main contribution of this paper is a novel learning framework for gait signature extraction able to disentangle appearance from identity thanks to motion cues. In particular, GaitCopy generates gait signature that copies the properties of the optical flow features without the need of explicit optical flow computation at inference time. The experimental results on CASIA-B and TUM-GAID datasets indicates that the recognition accuracy achieved by using the learned signatures is in-pair or improves the one obtained by the optical flow-based ones. In addition, the amount of parameters used in the student networks on TUM-GAID is up to ×4.2 smaller compared to the master ones, what is an important feature to implement the networks in low-resource systems.
The rest of the paper is organized as follows. After presenting the related work in Sec. II, our model is described in Sec. III. Then, in Sec. IV, we define and discuss the experimental results. Finally, Sec. V concludes the paper.

II. RELATED WORK
Nowadays, most gait recognition approaches use silhouettes as input data, since they encode the body shape information ignoring details such as textures or clothing. The simplest approaches use Gait Energy Images (GEI) or features derived from it [8], [9]. In these methods, the whole video sequence is summarized into a single image containing spatio-temporal information. Recent approaches [5], [10] use stacks of random silhouettes to represent the gait information, where each frame is handled independently to extract features that are combined with other frame features to build the final gait signature. However, all these approaches are sensitive to changes in the body shape produced by clothing or camera view-points. Recently, optical flow maps have been used to represent gait, since they are easy to compute and focus on the subject's motion, independently of body shape. For example, Castro et al. [6] use different CNN models to identify subjects from optical flow maps.
A direct extension of these previous gait approaches, which focus on a single kind of input data, is the use of multimodal models where more than one kind of data (e.g. optical flow, gray images, depth maps) are used as input. Kumar et al. [11] use data obtained from multiple inertial sensors to obtain a 3D-skeleton representation together with video images. In [7], it is proposed a CNN model that uses optical flow, depth and gray images at the same time to improve the global gait accuracy of the model. Zhang et al. [12] use RGB inputs to obtain an intermediate representation based on skeletons to disentangle appearance details. Then, global temporal descriptors are obtained with a LSTM from the intermediate skeletons.
In a non-gait context, Nagrani et al. [13] combine faces and voice to identify people, trying to find a common embedding. In [14], the authors propose a multimodal CNN for action recognition where the model learns a global multimodal embedding from multiple modalities and uses this embedding to recalibrate the input features, obtaining better global performance. Garcia et al. [15] assign teacher-student roles to each modality according to its loss value and updates the teacher branch to maximize the accuracy and the student branches to replicate the behavior of the teacher branch. A different approach is presented in [16], [17], where a student model learns from RGB information to produce features or signatures similar to optical flow ones obtained from a teacher model. Then, both are combined to boost the final results, without the need of optical flow computation at test time. In our case, to the best of our knowledge, we apply for the first time a similar idea to the task of gait recognition. Where, in contrast to [16], no feature fusion is performed, as we want to get rid of optical flow computation at inference time, directly using the data stream (i.e. pixels) provided by the camera.
Teacher-student approaches also use signatures to compress deep learning models though knowledge distillation between predictions of a big model (i.e. the teacher) and a smaller one (i.e. the student) [18], [19]. Other recent works have tried to improve the resulting compressed models. Thus, in [20] an ensemble of students learns collaboratively and teach each other during training. Distillation performance can degrade when differences in size between teacher and student models are big. To alleviate this problem a multi-step knowledge distillation can be applied [21]. In [22] a teacherfree knowledge distillation is proposed in which the student is able to learn from itself.

III. PROPOSED APPROACH
As illustrated in Fig. 1, we start by training a ConvNet for gait recognition, using as input optical flow, that will act as the master network (Sec. III-B). Then, by using the training loss function described in Sec. III-D, we train the student network (Sec. III-C), i.e. the one that has to copy the gait signatures generated by the master network, but using as input a different modality (i.e. gray). Finally, at inference time, we only use the student network to generate gait signatures.

A. INPUT DATA 1) Optical Flow
We define Optical Flow (OF) [23] as the motion pattern in a scene caused by the relative motion between an observer and the scene between two instants of time. OF has shown excellent results in the characterization of gait [7] since it is a motion-centric representation that obviates appearance and focuses on describing a subject by a set of local and subtly varying motions.
The OF is divided into two components or channels, one representing the y-axis and the other representing the x-axis, where most of the gait motion flow is concentrated.
Let F t be an OF map at time t. Then, F t (x, y, c) is the value of the OF component c located at coordinates (x, y), containing the horizontal and vertical components of the vector.

2) Gray
In contrast to OF, gray pixels are intended to capture the subject's appearance, keeping in mind that color is not informative for this task and that gray allows CNNs to focus more on gait information.

B. MASTER NETWORK ARCHITECTURE
We consider two possible master network architectures: a classic one [6], and a more modern one [5]. Both are described below and will be evaluated in the experimental section (Sec. IV).

1) Master-Conv2D: Classic network
This network architecture follows the one described in [6]. It is composed of the following layers: 4 Conv2D of [96,192,512,4096] filters with kernel sizes [7 × 7, 5 × 5, 3 × 3, 2 × 2], respectively. Then two fully-connected layers are added with 4096 and 2048 units. The final layer is a softmax with as many units as subject identities. The input is a tensor of dimensions (50, 60, 60) where in the first channel alternates the X and Y components of an optical flow sequence of 25 frames of 60 × 60 pixels, forcing the person to be centered in the frame #13, as in [6]. This network is trained using a loss function that combines the Triplet loss [24] and the Crossentropy loss with loss weights of 1.0 and 0.1, respectively.

2) Master-GaitSet: GaitSet-style network
This second type of network follows the architecture described in [5], i.e. GaitSet. This state-of-the-art network has been chosen because it is immune to permutations of the frames. So, it can use frames from different videos which have been filmed under different scenarios, viewing angles or clothing conditions. These characteristics have enabled it to be the state of the art on CASIA-B. Also, having its implementation released has allowed us to replicate its results accurately.
We have performed a small modification of it to decrease its inference time. In particular, we have changed the output of each fully-connected layer from 256 features to 128, in order to compress the model and speed up training and inference. This way, the final gait signature is reduced from 15872 to 7936 dimensions. The scheme of the architecture is shown in Figure 3.

C. STUDENT NETWORK ARCHITECTURE
Depending on the master network, we propose the use of the student architectures described below: Master-Conv2D with Student-Conv3D and Master-GaitSet with Student-GaitSet. For completeness, we summarize in Tab. 1, column ''# of parameters", the number of parameters (in millions) of each network used in this work.

1) Student-Conv3D: 3D convolutions
We consider the use of 3D convolutional filters (Conv3D) to define the student network in order to capture local temporal information, as the optical flow does. The sequence of the Conv3D blocks is represented in Fig. 2, which includes two versions (i.e. Student-Conv3D-A and Student-Conv3D-B) depending on the number of filters used in each layer, i.e. illustrated with different font colors inside each block. The first component of the stride (in brackets) correspond to the temporal channel. All blocks include ReLU activations, but the last one. A dropout layer (rate 0.2) is inserted before the last block, whose kernel is L2-regularized (1e-3).

2) Student-GaitSet: GaitSet-style
The corresponding student network of the Master-GaitSet is based on GaitSet as well. In this student network, due to the way gait signatures are built, the general architecture remains unchanged with respect to its master one. Only the first convolutional block is modified, adapting its convolutional kernels to deal with gray images and their single input channel. However, this change is minimal, and the architecture still matches the one shown in Figure 3. From what can be seen in Table 1, the number of parameters of the master and student networks based on GaitSet is similar.

D. LOSS FUNCTION FOR TRAINING THE STUDENT NETWORK
A student network (e.g. a gray-based one) has to learn to generate gait signatures as similar as possible to the ones generated by the master network (i.e. optical flow-based one). For teaching the student network to copy flow signatures, a regression function L R seems to be a good choice. We use as L R the Huber loss (i.e. smooth-L1) for training our VOLUME 23, 2021 Each blue block represents a Conv2D with max pooling and leaky ReLU activation, each green block is a Conv2D, and the red block is a horizontal pyramid pooling (HPP). Vertical green arrows represent a set pooling operation (SP) applied to feature maps coming from different samples. Furthermore, K is the kernel size, and P is the filter padding. (Best viewed in digital format) student models, as it offered better results in our preliminary experiments than the mean squared error: where y t and y p are the true and the predicted values, respectively, and δ is a hyperparameter to be cross-validated. In addition, as the copied signatures will probably not be exact to the ones presented by the master network, we encourage the student network to generate discriminative signatures. To achieve this second goal, we propose the use of the Triplet loss [24]: where g(·) is a deep neural network, A is an anchor sample, P is a positive sample w.r.t. A, N is a negative sample w.r.t. A, and α is a margin value to be cross-validated.
Then, our combined loss function L M S is defined as: where β is a hyperparameter, to be cross-validated, that controls the contribution of L T .

IV. EXPERIMENTS AND RESULTS
We start by briefly describing the datasets used in our experiments and the metrics. Then, we provide the implementation details of our models. We continue presenting the experimental setup. And the section concludes with the discussion of the results.

B. IMPLEMENTATION DETAILS
We use the Keras version of TensorFlow-2 for implementing and training our models. Regarding the master network, we have trained an OF-based master network from scratch for each dataset. Adam is the optimizer used for training all the student networks. Each minibatch contains 150 samples with the same proportion of samples from the different scenarios (e.g. 50 N, 50 B, 50 S for TUM-GAID). We use the Batch All (BA+) triplet loss [27] to train the networks, with a batch size of p × k samples, where p denotes the number of different subjects and k denotes the number of training samples per subject in the batch. The learning rate starts at 0.0001 and decreases by 0.2 when the validation Mean Absolute Error (MAE) stops decreasing, with a maximum of 500 epochs. Selected by cross-validation, the parameter δ (Eq. 1) is set to 0.5 and α (Eq. 2) to 0.25. Input data. Since each original video sequence has a different temporal length, we extract L-frame subsequences from the full-length sequences, with a size of 25 frames and with an overlapping of 80% between subsequences. In this way we unify the size of the input samples making sure that the temporal information of the video is not lost. In our approach, optical flow is obtained using SpyNet [28], pre-trained on MPI Sintel (more details in Sec. IV-D3). Data augmentation. We randomly apply a horizontal flip (see left image in Figure 4) to the training clips to increase the amount of samples. In addition, we randomly swap the target signature of a given input (see right image in Figure 4) by choosing one of the same subject that may come from a different sequence (i.e. including different scenarios). We name this latter type of data augmentation as 'crosssignature', which is experimentally studied in Sec. IV-D2.  Video sequence classification. Note that our models receive as input a clip of 25 frames, returning the corresponding gait signature for that clip. Then, we L2-normalize the signatures and classify each individual clip by using a simple 3 Nearest Neighbor (3NN) classifier. Finally, in order to obtain a global label (i.e. subject identity) for the whole video sequence, we use a classical majority voting scheme on the individual clip labels.

C. EXPERIMENTAL SETUP
In order to validate our proposed approach, we train the diverse student network architectures described in Sec. III-C, and compare their performance against both their Master network and networks trained from scratch (no copy) directly on the gray modality. In all our experiments, we use the standard data partitions (training and test) of the respective datasets, allowing a direct comparison with previous (Sec. IV-D4) and future works. Note that we follow an open-world setup for evaluating our models. That means that we always test on a set of subjects that has never been seen by the networks during training. This is more challenging that its counterpart (i.e. close-world) where a portion of the videos of the test set can be used to fine-tune the networks (e.g. to train a softmax classifier).
Also, in our experiments, we use Master-Conv2D, Student-Conv3D-A and Student-Conv3D-B for TUM-GAID, and Master-GaitSet and Student-GaitSet for CASIA-B, as master and students networks.

1) Performance of the student networks
We observe in Fig. 5 that, in general terms, the student networks achieve a Rank-1 accuracy comparable to that VOLUME 23, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

CASIA-B
Master-GaitSet Student-GaitSet attained by their masters (green bars). Note that, in the CASIA-B dataset, there is a remarkable improvement in the 'bg' scenario and, as a consequence, in the average on the scenarios. In our opinion, as the copies are not exact, the student networks have been able to discover identity features more robust to the 'carrying-bag' covariate than the master one. In fact, the same behavior with the 'B' scenario is observed on TUM-GAID. Focusing on the different versions of the student networks, we notice that the smallest case 'Student-Conv3D-B' is even able to achieve slightly better results that their counterparts in TUM-GAID. We hypothesize that, when there is a bigger change in appearance (e.g. scenario 'B'), a lower amount of parameters helps to reduce overfitting, i.e. to improve generalization.
In CASIA-B, we verified that our method is easy to extrapolate to other state-of-the-art models, such as GaitSet. As it can be seen in Fig. 5-right, the model learns to mimic OFbased signatures by using pixel data. Visual verification of the student learning. We use the UMAP projections [29] of the signatures, to visually verify that the student networks are learning to copy the master signatures. This correct behavior can be observed in Fig. 6, that has been generated on samples from TUM-GAID. In Fig. 7 we show both the signatures obtained by a model trained from scratch on the gray modality (no copy) and the signatures obtained by a student. We notice that both types of signatures are clearly separated (i.e. 'triangles' do not overlap 'circles' of the same color). This fact reinforces our hypothesis that the student is correctly disentangling appearance from identity via motion. Finally, in Fig. 8 we represent the signatures obtained for the 16 subjects of TUM-GAID belonging to the 'temporal' set (where clothing was changed drastically). We observe that the student is able to generate signatures that noticeably separates the different subjects into clusters. Computational performance. Tab. 1 shows the different networks together with their number of parameters (in millions), the achieved clips per second (CP S) and the number of floating point operations (in billions) per second (GF LOP S) at inference time -with a batch size of only one sample. CP S and GF LOP S have been obtained on a NVIDIA Titan Xp 12GB using the NVIDIA profiler. It can be observed that the Conv3D models are around a 58% faster than the Master-Conv2D. Also, Gaitset-based models are faster than the others models presented because they are Conv2D-based models with a lower number of parameters, which increases their CPS by approximately 300%. Regarding the GF LOP S metric, it depends on the number parameters and on the kind of layers of the model, thus it is difficult to directly compare different architectures. When comparing similar architectures that only differ on the output feature size, it is clear that, the more parameters, the more GFLOPS.

2) Ablation study
We have already studied, in the experiment summarized in Fig. 5, how the performance of the method changes as a function of the number of parameters of the student network. Then, we present in Tab For completeness, we have used a set of values for β and δ 1 . For each dataset and row in Tab. 2, we have selected by cross-validation the best combination of β and δ values. Then, we report the average on the corresponding scenarios of the datasets. Note that, in contrast to the previously reported results in this paper, we report here the Rank-1 accuracy directly on the isolated input samples, without any majority voting applied at video level. This helps to disentangle the actual performance of the network (which only sees a 25-frame sample at a time) from the majority voting contribution. In fact, these results indicate that by using only 25 frames, a subject can be correctly identified ∼ 92% of the times in TUM-GAID and up to ∼ 68% in CASIA-B.
Depending on the complexity of the dataset, especially due to the changes in clothing, we observe: (i) the lowest MAE is obtained in both datasets when both 'Tri' and 'Xs' are disabled; (ii) in TUM-GAID, the only use of L R is enough for achieving top results; (iii) the use of 'Xs' generally increases the MAE, as there is not a 1:1 relation anymore between master and student signatures, but a set of possible signatures; in contrast, (iv) CASIA-B, which includes drastic changes in clothing in scenario 'cl', benefits from its use, according to the increase in the Rank-1.

3) Optical flow study
Nowadays, there exist many optical flow approaches relaying on different ways of computing this kind of low level feature. On the one hand, there are relatively old hand-crafted approaches such as [34], [35], and, on the other hand, there are new approaches based on deep learning such as [28], [36]. In this experiment, we compare two widely used optical flow methods: a hand-crafted approach (Farneback) [34] and a deep learning approach (SpyNet) [28], in order to select the best optical flow approach for gait recognition. Fig. 9 shows the output of both approaches obtained on the same frames. Images a) and b) represent the vectors map and the magnitude of the optical flow, respectively, obtained with Farneback. On the other hand, c) and d) represent the vectors map and the magnitude of the optical flow, respectively, obtained with SpyNet. Comparing both approaches, it is clear that SpyNet produces better results in terms of sharpness and vector maps. Thus, since gait recognition must focus on subtle details of the movement, SpyNet brings an extra quality that can be decisive for gait recognition. Finally, Tab. 4 compares the execution time of both methods computed on CPU and GPU, over 25 frames of 640x480 pixels. Note that, since SpyNet is a deep learning approach, we only report execution times computed on GPU due to the excessive time required on CPU (around 15 times slower). Farneback-CPU is the slowest method, requiring 2.16s to process an input sample. On the other hand, when Farneback is computed on the GPU, it is the fastest one, requiring 0.36s per sample. Focusing on SPyNet, it is slower than Farneback-GPU, but faster than Farneback-CPU. However, although Farneback-GPU requires less computational time, SPyNet provides a more robust and accurate optical flow as shown visually in Fig. 9 and empirically in the original paper [28]. Thus, we use it to compute our optical flow maps.

4) Comparison with the state of the art
We compare here our results with previous works. On TUM-GAID (Tab. 3) we achieve state-of-the-art results with the network 'Student-Conv3D-B' (97% vs 96.4% [33]). Regarding CASIA-B-90, we have only found a comparable setup in [7], where a non cross-view test setup is used. We used that setup because the goal of our approach is to produce

Method
Time per input sample (s) Farneback-CPU [34] 2.16 Farneback-GPU [34] 0.36 SPyNet [28] 1.94 a student model that using gray information is able to emulate the output of a master model trained with optical flow. Thus, cross-view recognition is not the goal of this work. In this experiment, the best Rank-1 (case 'ResGray-SMP') obtained in [7] is 96, 91, 46 and 77.7 (nm, bg, cl, Avg) while our best model achieves 98, 75, 65 and 79 ( Fig. 5: Student-GaitSet). Thus, on average, our approach achieves better results and only for the bg case the results are lower than [7] but, on the other hand, our results for cl scenario clearly outperforms [7], obtaining a better average accuracy.

V. CONCLUSIONS AND FUTURE WORK
This paper has introduced a new approach named GaitCopy that is able to mimic the behavior of optical-flow-based networks for gait recognition but using gray-level pixel inputs. The student network learns to copy the gait signatures generated by the master network, but using a different input modality. We experimentally show on CASIA-B and TUM-GAID that, compared to the master networks trained on optical flow data, a similar accuracy can be obtained by the student networks trained on gray inputs, without the need of explicit optical flow computation and even using less network parameters (up to ×4.2 smaller).
As future work, we plan to investigate the performance of the model on other combinations of master-student modalities, as depth maps or silhouettes, and in a multi-view setup.