Flexible Gait Recognition Based on Flow Regulation of Local Features Between Key Frames

The information contained in gait frames is different, and the contribution of different frames to recognition tasks is also different. However, each frame has the same degree of attention in the input layer, this prevents the network from focusing on keyframes. Therefore, we propose a keyframe extraction module via information weighting, make network can pay more attention to the high contribution frame at the input layer, and the extraction of the distinctive features is improved. Moreover, the range of motion in different parts of the human body is different, the temporal and spatial correlation of local feature between silhouettes is different. Based on the discovery, we propose a Local Features Flow Regulation module to calculate the correlation coefficient of the local features of each silhouette, and the regulation coefficient is generated by the correlation coefficient. The regulation coefficient is applied to regulate the flow of local features, this enables the network to capture areas with more spatial and temporal features. Through the extraction of frame-level features and the interaction of local features between frames, the network can extract the most discriminative features from global to local flexibly. During training, each horizontal part is trained separately. The training can adjust the regulation coefficients, and the network is more flexible and expressive. Our model has a good performance on cross-viewing and complex environments of CASIA-B dataset. In the case of normal environment and complex environment (pedestrian with backpacks and in coats), the rank-1 of the proposed model is 95.1%, 87.9% and 74.0% respectively, higher than state-of-the-art.


I. INTRODUCTION
Gait is a non-cooperative biometric feature, and the gait can be used to identify the identity of a pedestrian within a few tens of meters. This biometric is difficult to disguise and does not require subject cooperation when tested. However, due to changes in perspectives, changes in pedestrian clothing and environmental covariance, gait recognition still faces many problems.
The associate editor coordinating the review of this manuscript and approving it for publication was An-An Liu . In the early stages, traditional gait recognition methods are built on the template [1]- [6]. This kind of methods generate gait templates based on mathematical geometry or other principles, and a template is utilized to extract features from the gait sequence to produce an image for gait recognition. The template-based method extracts the features of the gait directly and eliminates some covariance to improve recognition accuracy. But it is difficult for the gait template in fitting, and the template is not flexible enough. Therefore, performance is significantly reduced in the case of cross-view recognition. In contrast, the silhouette-based gait recognition methods do not assume certain model of the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ human subject, but analyze the spatio-temporal shape and motion features of the silhouette images which are extracted from the sequences [7]. In addition, Bayesian probability and time-frequency signals of gait are applied by researchers to verify gait [8], [9]. These are two types of very effective representations of gait. However, neither the time-frequency representation nor the probability can be easily integrated into the neural network. In contrast, the Euclidean distance or cosine similarity is easily calculated by matrix calculation in the neural network, which is convenient for training in the network. Gait recognition method based on view transform is proposed to convert gait sequences from different views to the same view [10], or project sequences of gaits at different views onto a plane independent of the views [11], or use the correlation between gait sequences from different views [12]. Finally, the identification match is exploited in the case of the same view. The problem of cross-view recognition is solved to some extent by the view transform model. However, the view transform model does not pay much attention to the time and fine-grained spatial information of the gait sequences. Therefore, it causes the model not to adapt to complex environments.
In recent years, deep learning has been extensively used in various fields of computer vision. Convolutional neural network is particularly excellent in terms of image feature extraction. Non-linear mapping of multiple convolutional layers can adaptively extract features independent of views [13]- [15], the extracted features have strong expressiveness and generalization. The deep-learning method achieve a 95% average cross-view average rank-1 accuracy under normal walking conditions. The Gait Energy Image (GEIs) [16] and the gait silhouette sequences are used as input by many deep learning models to extract features directly from a backbone network. This input makes each frame equally important to the network, without considering the importance of each frame, and does not pay attention to the spatiotemporal correlation locally on the gait silhouette sequences. Long and Short-Term Memory networks (LSTM) and 3 Dimensional Convolutional Neural Networks (3D-CNN) are utilized to extract temporal features of gait sequences [14], [17]. However, it is low-efficient in computation and difficult to train for these networks. Moreover, current gait recognition methods based on deep learning mainly extract global features from each image in the sequence. These network structures have been unable to capture the correlation and spatiotemporal information of local features between frames. The information in the gait sequence is not fully utilized.
Through the observation of human walking, we notice that the change range of each horizontal part of human body is different. This shows that the gait sequence of each horizontal part of the human body contains different spatiotemporal information. Some parts provide a lot of spatiotemporal information, while some parts provide little spatiotemporal information. For example, when walking, the feet swing the most, while the head is usually still. Therefore, the spatiotemporal information on feet will be more helpful for gait recognition.
The current gait recognition methods do not consider the screening of input data. And, there is also a lack of methods for analyzing the correlation between local features of the gait contour frame and the extraction of local features of the gait contour. In this paper, based on the above motivation and current research status of gait recognition, we propose a flexible and simple network structure. with three contributions: 1) In order to increase the contribution of key frames to feature extraction, the gait silhouette sequence in the network is readjusted by using the attention mechanism based on the information it contains. 2) For the gait silhouette of each frame, the gait feature is made with a uniform horizontal division. After obtaining the correlation between the local features of the frames and the features of the corresponding parts of other frames, the correlation coefficients are used to regulate the flow of fine-grained local features in each frame. The features of reorganization will be more flexible and have better expressiveness. The model will be more stable after a lot of sample training.
3) The proposed method has achieved state of the art on CASIA-B dataset. The rank-1 of the proposed model is 95.1%, 87.9% and 74.0% on NM, BG and CL respectively.
The proposed network can select key gait frames through each frame of information, and can further capture the correlation and spatiotemporal information of local features between frames. Finally extract more discriminative features The remaining parts of this paper are organized as follows: a review of related theories applied in the proposed method is presented in Section 2. in Section 3, we elaborated on the principles of each module in our network. In Section 4, we do a lot of exploratory experiments, comparative experiments and ablation experiments to verify the effectiveness, stability and generalization ability of the network. In Section 5, we summarized the work of the paper.

II. RELATED WORK
The traditional methods of gait recognition usually meet with the problem of underfitting, which cannot fully lay out the key information in the gait sequences. The traditional method cannot be applied to the complex scene. In this paper, we mainly refer to the method based on deep learning.

A. EXTRACTION OF KEY INFORMATION IN SEQUENCE DATA
Video person re-identification and gait recognition are similar, it is also used to identify people from image sequences. Recently, the mainstream method of video person re-identification is to extract a feature vector from each frame of a video by convolutional neural network [18]. Then these features are sent to Recurrent Neural Network (RNN) in order to extract temporal features. Finally, network make a max pooling or mean pooling feature vectors for all frames to represent the feature vector of the video sequence. In 2018, Song et al. proposed the region-based quality estimation network [20] to assign high weights to images without pedestrians being obscured. This work selected the images containing more useful information and improves the accuracy of identification. It is proved that the selection of key information is helpful for identification.
The attention mechanism is a very helpful mechanism for deep learning. In 2017, Hu et al. proposed Squeezeand-Excitation Networks (SENet), which improved the accuracy by modeling the correlation between feature channels and strengthening key feature maps [19]. Yang et al. proposed hierarchical attention networks for the document class [21]. This work demonstrated the effectiveness of attention mechanism in long text classification. Inspired by their work, we propose attention-based key frame extraction based on information weighting in gait sequences, that simply and efficiently extract the key frames with this attention mechanism.

B. SPATIOTEMPORAL INFORMATION EXTRACTION BETWEEN SEQUENCE FRAMES
In 2006, Han et al. proposed the use of Gait Energy Images (GEIs) [16] to recognize human identities. The representation of the average silhouette of gait sequences is used as image input to various network models. The gait energy image retains some of the temporal motion features of gait sequences, it can save a lot of computing resources that input GEIs as images into the network. However, it should loss a lot of spatio-temporal information between frames. GEIs are used as input to many gait recognition networks based on deep learning.
Since 2016, convolutional neural networks had been widely used in gait recognition. Zhang et al. proposed a representation of gait characteristics based on convolutional networks named Deep Gait [22]. Shiraga et al. proposed GEINet which used GEIs as input to convolutional networks to extract features [13]. Wu et al. extensively studied the gait recognition method across views [14]. Deep convolutional network-based method took advantage of the highly nonlinear mapping of convolutional layers and plenty of trainable parameters. It was possible to learn the stability features that overcome the view changes and environmental impact. The gait recognition network based on deep learning showed good performance in the case of multiple views and environments, which had gone beyond the traditional gait recognition algorithm.
Gait sequences and GEIs [16] are used as inputs to many deep learning-based gait recognition networks. There is fewer spatial-temporal information contained in GEIs. In the method of using the gait sequence as the input of the network. The gait silhouette sequences need to be input into the LSTM [14] or 3D-CNN [17] in order. It took a lot of computation time and it is difficult to train the network. Moreover, these methods only extracted the spatio-temporal information of the global features of the images between frames, but do not captured the spatio-temporal information of the local features.
In 2019, Chao et al. proposed GaitSet [15]. The authors assume that each frame contains its ordinal information in the sequential gait sequence. GaitSet do not take a gait sequence as input. GaitSet is invoked as a network input in the form of a Set, which simplify the network structure, speed up network computing, and facilitate the embedding of new modules. Images are randomly extract from the gait sequence to form a Set and input into the network. Backbone of GaitSet only consist of convolutional layers. The Set is specially processed when it is input into the convolutional layer and when it is output from the convolutional layer, the specific operation is reshaping of the tensor of the Set. Inspired by GaitSet, our model also uses the Set as input. GaitSet also proposed a method of aggregating all gait frames into a single image. This method is named as Set Pooling.

C. METHODS OF HUMAN BODY FEATURE DIVISION
Region of Interest (RoI) is generally used in object detection, and was often used in Re-id to divide global features to obtain local features [23], [24]. However, these methods generate a large number of unnecessary proposals, which affect the speed of the calculations.
In 2018, Sun et al. proposed a method of evenly dividing the person global features in the horizontal direction and then separately training every part [25]. This method is simple, just horizontally divide the pedestrian's global feature map to extract local features. The method has achieved good results. In our method, this kind of feature division method is adopted, and the division method is also used in the training part of our network. This causes the feature blocks at the end of the network to be in contact with the feature blocks in the middle of the network.

III. METHOD
In this Section, we describe the main modules we have proposed in the global network. The Set of gait silhouettes is a four-dimensional tensor. The number of Set frames in the second dimension of the tensor. This tensor is input into the network for forward propagation. In the process of forward propagation, the feature map Set is input to the key frame extraction module to retrieve key frames. The feature map Set is subsequently sent to the proposed Local Feature Flow Regulation module for flow regulation. Then, the feature map Set is subjected to aggregation operation after multiple down samplings in the convolution. The final feature is used to calculate multiple loss functions with joint training. The architecture of the global network is shown as FIGURE 1.

A. KEY FRAME EXTRACTION BASED ON INFORMATION WEIGHTING
If the gait silhouettes are randomly composed of a Set of framenum images as the input data to the network. The extraction of data is not targeted, resulting in some unidentifiable and highly similar silhouettes being extracted into the Set.   This is detrimental to the extraction of temporal and spatial features, and it will affect the training effect of the network and the recognition effect of the final model. To solve these problems, we have proposed a key frame extraction module based on information weighting. The feature map is extracted for each frame of the image in the network to ensure that the network is end-to-end. Using the frame-level attention mechanism, the key score for each frame is predicted and weighted onto the original feature map.
The feature map Set F is first reshaped and then input into key frame extraction module. F is performed to reduce the dimensionality of the convolutional layer and output F * after the max pooling layer. After being expanded, the feature map for each frame is input into the same fully connected layer to calculate the critical score S.
where W k_c , W k_fc1 and W k_fc2 are the parameters of the convolutional layer, the first fully connected layer, and the second fully connected layer, respectively. b k_fc1 and b k_fc2 are the bias parameters of the two fully connected layers. s i is the key score for each frame, i ∈ {1, 2, 3, ..., framenum}, s i is executed a probabilistic operation, getting S = [s 1 , s 2 , ..., s framenum ]. To make each key score in S be multiplied by each frame feature map in F, need to expand the tensor and change the shape of S, The purpose of the key frame extraction module is to extract features at the frame level. The purpose of the Local Features Flow Regulation module is to regulate the local features between frames. Further, based on the frame level, more subtle, the module is a local level of feature regulation. We note that the correlation between the corresponding local features of the frame can be utilized to calculate the correlation coefficient to regulate the local features of each frame feature. The correlation between frames is measured by the projection between corresponding local features. Pr a a is a self-projection of a local feature vector a on this vector a. Pr b a is a projection of a local feature vector a on corresponding local feature vector b of other frames. If the Euclidean distance between the two projections is large, the correlation between b and a is small. On the contrary, it shows that the correlation between these two vectors is high. The information contained in the randomly extracted N images is limited, and it is necessary to extract the most discriminative features from the local parts of each frame. We propose the Local Features Flow Regulation, which uses the correlation between the local features of the frames to regulate the flow of local features in each frame, and then extracts the most discriminative features in each frame.

C. CORRELATION MATRIX CORRESPONDING TO LOCAL FEATURES BETWEEN FRAMES
The feature maps of all frames are horizontally divided into stripenum stripes, each of which has a correlation matrix. In the process of calculating the correlation matrix, the tensor shape of the features input into the module has 4 dimensions,  (batchsize, framenum, stripenum, featurenum). The second dimension and the third dimension are transposed, and the shape of this tensor becomes (batchsize, stripenum, framenum, featurenum). In this paper, in order to clearly describe the calculation principle of the correlation matrix, we will ignore the previous two dimensions and only describe the calculation steps of one stripe. That is, the shape of this input tensor becomes (framenum, featurenum). We first focus on a two-dimensional matrix, In the following explanation.
First, calculate the self-projection of the same stripe in each frame. Specific calculation steps are presented as Equation 5, Equation 6 and Equation 7. Where i is the i-th stripe, i ∈ {1, 2, 3, ..., stripenum}. In order to make subsequent calculations more convenient, i f is unitized before it is used for the following calculations.
Equation 5 is used to multiplies the corresponding elements in the i-th stripe to prepare for self-projection calcu- A T = a 1 , a 2 , . . . , a framenum (9) Second, we calculate the mutual projection matrix of the same stripe between frames, and use the following equations to calculate. The multiplication of 2 in equation (11) is explained by the addition of two matrices in equation (10).
In essence, it is to get cosine similarity of two different feature vectors. For instance, For ease of description, we assume that there are 3 frames, − → α , − → β and − → χ are used to indicate the feature vectors of the i-th stripe of these 3 frames, respectively. Assume that the length of the feature vectors is 3. The specific calculation process is described in the example of equation (11). AB represents the mutual-projection matrix of the i-th stripe between frames. AB is generated by multiplying the Finally, subtract the two matrices to get the correlation matrix of the i-th stripe. The self-projection matrix AA is subtracted from the mutual-projection matrix AB to generate a correlation matrix the i-th stripe in other frames, which is the difference between the self-projection and the mutual projection. self-projection and mutual projection are measured by cosine similarity.
The value of the self-projection is greater than or equal to the value of the mutual projection. Therefore, the value in i M is greater than or equal to 0. In the actual calculation, correlation matrices of all the stripes are calculated simultaneously and are denoted as M .

D. FLOW REGULATION COEFFICIENT DETERMINATION
The flow regulation coefficients of the local features are calculated using the correlation matrix M . The purpose of flow regulation is to extract features of the stripes with the most frequent changes between frames and enhance the features of these stripes, thereby improving the extraction efficiency of spatial and temporal information. The goal is to extract the most discriminative feature of gait set.
For the convenience of the description, we still explain the calculation process of the regulation coefficient on the i-th stripe. The main principle of calculating the flow regulation coefficient is to weaken the features on the stripe with similarity correlation and enhance the features on the stripe with weak similarity in all frames. The relevant factors are measured using three different types of projections, they are the maximum projection difference max_D i , the minimum projection difference min_D i and the mean projection difference mean_D i . The calculation of the regulation coefficient is not simply determined by the three projection differences. w i is set to the adaptive parameter of the i-th stripe so that the calculation of the regulation coefficient has adaptability. The calculation process of the regulation coefficient is as follows.
where w i is a numerator and used to adjust the generation of the regulation coefficient c i . When the denominator is too large or too small, w i will not make c i too small or too large. w i increase network adaptability and avoid gradient explosion and gradient disappearance. In the above Equation 13, c i increases as max_D i , min_D i and mean_D i increases. When max_D i , min_D i and mean_D i are large, it means that the features on the i-th stripe between each frame have great differences. Pedestrians have more motion information on the i-th stripe. In this way, the feature of the i-th stripe will be enhanced by c i . After the above calculation, the flow regulation coefficients of all stripe are obtained, c = [c 1 , c 2 , . . . , c stripenum ]. h f is the number of rows of the feature map of each frame in f . The feature map of each frame in f is divided into stripenum stripes, each stripe corresponds to a control coefficient, each stripe contains h f /stripenum rows. In order to make the control coefficient one-to-one with each stripe in f rows of each frame. c i is performed to perform the copy operation while ensuring that the order of each regulation coefficient is unchanged. c i is copied h f /stripenum times. where c is executed some necessary shape changes. Then multiply c and f to control the feature flow of each stripe in f .

E. ASYMMETRIC CORRESPONDENCE OF LOCAL FEATURES
After the Set Pooling module, the feature Set g is generated and will be horizontally divided into partnum regions. This way of dividing features is the same as the way of dividing features in the Local Features Flow Regulation module, it is only different in the number of divisions. The horizontal division method makes a local correspondence between the two position features, and this correspondence can be adjusted by changing stripenum and partnum. Due to the flexibility of this correspondence, it is named as an asymmetric correspondence. It has been shown in FIGURE 4. When partnum is less than stripenum, a part in g corresponds to multiple stripes in f flow . When partnum is greater than stripenum, multiple parts in g correspond to the same stripe in f flow . When partnum is equal to stripenum, a part in g corresponds to a stripe in f flow . This relationship is not an integer number of integers in most cases.
In the training part of the network, a joint training method of classification and metric is adopted. Train each part separately. There are partnum parts in g, and a classifier will be trained separately for each part. The training of metric is the same as classification, each part has independent parameters. The asymmetric correspondence between f flow and g makes each part to communicate with each other during training. This fuzzy correspondence improves the flexibility and stability of the network, as well as the ability to express discriminative features.

F. JOINT TRAINING OF MULTIPLE LOSS FUNCTIONS
The network uses joint training of classification loss and metric loss in the training process to make full use of the only category labels in the dataset. Joint training allows the network model to correctly classify the gaits while expanding the distance between classes and reducing the distance within the class, improving the stability and robustness of the model.
Refer to the setting of the loss function in Beyond Part Models [25] and GaitSet [15], we calculate separate metric loss [31], [32] for each part. Then, add the loss of each part to form totalloss.

IV. EXPERIMENT A. DATASETS AND TRAINING DETAILS
We conduct the experiments on the CASIA-B dataset [26]. The CASIA-B dataset contain 124 subjects. Each subject contains 6 kinds of conditions of normal walking (NM), 2 kinds of conditions of walking with a bag (BG), 2 kinds of conditions of walking with coat (CL). Persons walking with a bag will have some interference with gait recognition. Persons wearing coats can play a good camouflage effect and can be used to cover large gait postures. There are 11 different views (0 • , 18 • ,..., 180 • ) for each walking condition, the spacing between two adjacent views is 18 • . There are 110 gait sequences in each subject. CASIA-B contains various views and includes three different walking conditions, which are very challenging and not available in many other gait datasets.
The input to the network is a set of aligned silhouettes. The alignment here is built on the method proposed by Takemura et al. [27]. A patch of size 64 × 44 is randomly sampled from the training set, P represents P different subjects and N represents that each subject contains N gait sequences. In the experiment, P is set to 6, K is set to 12, and there are 72 gait sequences in a batch. 30 silhouettes are randomly extracted from each sequence to constitute each set. A batch contains 72 sets. FIGURE 5 shows the gait data of three different conditions. Because this dataset is not officially divided into training sets and test sets, we use the popular divisions in the current papers. Mainly referring to the division method in GaitSet [15], we conducted small sample training (ST), medium sample training (MT) and large sample training (LT). In ST, the first 24 subjects (marked in 001-024) are used for training and the remaining 100 subjects were left for testing. In MT, the first 62 subjects were used for training and the remaining 62 subjects were left for testing. In LT, the first 74 subjects are used for training and the remaining 50 subjects are left for testing. In the test sets of all three settings, the first 4 sequences of the NM condition (NM #1-4) are kept in the gallery, and the rest 6 sequences are divided into 3 probe subsets, i.e. NM subsets containing NM #5-6, BG subsets containing BG #1-2 and CL subsets containing CL #1-2.
In testing, a sample q in each probe set Q is input into the network to produce a final representation F q . All samples in Gallery G are also subjected to the same operation to get any F g . Finally, using Euclidean distance to compare the distance between F q and each F g . Get rank-n as experimental results. The learning rate lr is set to 0.0001, and for small sample training (ST), we train 60,000 iterations. For medium sample training (MT), we trained 70,000 iterations. For large sample training (LT), we trained 80,000 iterations. All experimental results were averaged over 11 views and excluded from the same view. All experiments are performed on a GTX 1080Ti graphics card.

B. REGULATION OF BLOCK CORRESPONDENCE
We execute a large sample (LT) experiment on the CASIA-B dataset to explore the regulation of block correspondence. At the middle and end of the network, f flow and g are respectively subjected to horizontal division operations. At the middle and end of the network, f flow and g are respectively subjected to horizontal division operations. This operation makes them locally correspond to each other, and this correspondence can be controlled by changing stripenum and partnum. The correspondence between partitions can be divided into three cases. The first is the case where stripenum is smaller than partnum, the second is the case where stripenum is equal to partnum, and the third is the case where stripenum is larger than partnum. In FIGURE 6, the three different correspondences are shown.
In the experiment, we fixed partnum to 6 and only changed stripenum to achieve the purpose of adjusting the correspondence. VOLUME 8, 2020  In order to comprehensively cover the above three cases in the experiment, stripenum ∈ {2, 4, 6, 8, 16, 32}. For each stripenum, we conducted comparative experiments under three different conditions (NM, BG and CL). Through the experiments, we found that different stripenum will get different recognition accuracy rank-1.
In FIGURE 7, the dotted line diagram clearly describes the change of rank-1 with stripenum in three different conditions. The accuracy curves of NM, BR and CL are at three different levels from high to low, but the trend of their changes is synchronous. Under the two conditions of NM and BG, the variation is small, while in the CL condition, the variation is larger. The rank-1 shows a trend of increasing first and then decreasing with the increase of stripenum, and peaking at the position set to 8. TABLE 1 shows the specific rank-1 of different stripenum under three conditions.
In the case where stripenum is less than partnum. One stripe contains multiple parts. In the process of back propagation of the single training of each p at the end of the network, the coefficients and related parameters in the Local features flow regulation module is adjusted. But one stripe only has one flow regulation coefficient, and multiple parts training is used to control the regulation coefficients in one stripe. This makes it hard for adjacent stripes to be well connected, making the trained model not effective. In the case where stripenum is equal to partnum. The oneto-one correspondence between stripe and part makes the network inflexible.
The experiment proves that rank-1 is the best when stripenum is larger than partnum. Especially when stripenum  is set to 8, one part in g will contain two stripes in f flow . In this correspondence, by single training of each part, one part can affect two adjacent stripes to make the two stripes better connect to adjust the weight of the feature and the flow regulation coefficient. Setting stripenum to 8 does not make one part contain too many stripes, thus avoiding overcontainment. In the case where stripe is set to 16 or 32, it will cause the same part to affect too many stripes, and result in a decrease in the recognition accuracy of the model. These three graphs in FIGURE 8 clearly describe the change of rank-1 with stripenum under three conditions. The recognition accuracy when stripenum is larger than partnum is generally higher than the recognition accuracy when stripenum is less than partnum.
The experiments in this session fully verify that the local feature asymmetric correspondence between the middle and the end of the network plays a positive role in the adjustment of the network parameters. Setting the appropriate stripenum and partnum can significantly improve the recognition accuracy, and especially improve the recognition accuracy in the most challenging CL condition. The experiment also verified the validity of the local features flow regulation module. The local features flow regulation module and the network finally adjusted the individual training of P to form a perfect combination. The local features flow regulation module modules and the single training of each part are adjusted to each other to form a perfect combination.  table below.  In TABLE 2, although the small sample training only uses 24 subjects, our model has achieved very good results. The effect of the model in NM and BG walking conditions is very good, rank-1of our model is slightly higher than the highest rank-1 of the current gait recognition model. The CL condition is a walking condition in which a person wears a coat, and wearing a coat can have a good camouflage effect and can be used to cover a large gait posture. The rank-1 of our model is significantly higher than that of the current gait recognition model under the complex condition of CL. Our rank-1 is higher than the highest rank-1 of the current gait recognition model in all views.
Contrast experimental results of medium sample training (MT) and large sample training (LT) are listed in TABLE 3 and TABLE 4, respectively. As the training samples increase, the rank-1 of our model is getting higher and higher. Compared with the current gait model, there is a steady improvement in rank-1. Especially under CL complex walking conditions, our rank-1 is higher than the highest  rank-1 of the current gait recognition model under all views of large sample training (LT), the average rank-1 is 3.6% higher than the current best mean rank-1.
At 36 • view and 144 • view, the neural network can extract more enough gait characteristics, so that the rank-1 of these two views is higher. At 90 • view, the walking postures of subjects are shielded a lot, and the neural network cannot extract enough features, which lead to a lower rank-1. In 0 • view and 180 • view, there are many gait features are obscured by the subject itself, resulting in rank-1 being the lowest of the 11 views.
Through the small sample training (ST) experiment, the proposed model can achieve the rank-1 higher than the best rank-1 of current gait recognition model by using only 24 subjects that are trained, which proves the validity and good generalization ability of our model. The rank-1 of the three walking conditions under the medium sample training (MT) and the large sample training (LT) is also higher than the best rank-1 of the current gait recognition model, which proves the stability of our network.

D. ABLATION EXPERIMENT
In order to verify the validity of each module and the module combination of the network. We conduct large sample training (LB) on networks with each module and different module combinations. The specific results are as TABLE 5.
Both KFS and LFFR are better than Single Backbone, and LFFR is better than KFS in CL walking condition. The experiment verifies the validity of KFS and LFFR, respectively.
Combining KFS and LFFR can get better results, especially in CL walking conditions, the recognition accuracy is greatly improved. Experiments demonstrate that the combination of KFS and LFFR modules can further improve the recognition accuracy of the network.

V. CONCLUSION
In this paper, we propose a key frame extraction based on information weighting module and a Local Feature Flow Regulation module. The frame weighting module enables the network to increase the attention to key frames at the input end and improve the extraction efficiency of high key features. The Local Feature Flow Regulation module can control the local channel size in each frame and flexibly adjust the traffic size of the local features, so that the network can extract the temporal and spatial features of the gait silhouette locally and increase the network expression capability. The Local Feature Flow Regulation and the individual training of each part adjust each other. Such adjustment increases the flexibility of the network. Network flexibility improves the expressive ability of the network and the accuracy of gait recognition. Compared to other networks, our network pays more attention to the analysis of local features and relationships between them in space and time, thus mining more valuable information to recognize identity. Through a lot of experiments, we have proved the effectiveness of the Key frame extraction module and the Local Features Flow Regulation module. With proper stripenum and partnum selected, our algorithm can make a big breakthrough on CASIA-B dataset. Under the three walking conditions on the CASIA-B dataset, the recognition accuracy of our network is improved compared with the current state-of-the-art. Especially in the most complex walking conditions CL, rank-1of our network can achieve 74%.
The proposed algorithm module can also be embedded into all networks that process video streams. It is used to extract the correlation and spatiotemporal information of local features between frames. In this paper, we use a fixed division method for global features. This kind of division method is simple. But because it is a hard division, this will result in less precise alignment. The method has its limitations. In the future we will research an adaptive global feature division to improve our algorithm module to improve the flexibility of the network.