Joint Metric Learning and Hierarchical Network for Gait Recognition

Gait recognition aims to identify person by walking pattern of individuals, which has a valuable application prospect in the fields of identity identification, public security and medical diagnosis. At present, most of the gait recognition methods are designed based on the human gait information. However, this paper focuses on the feature space of gait recognition feature learning, and optimizes the feature space by introducing some general criteria to improve the discrimination of depth features. In recent literature, optimizing the feature space has been proven to be beneficial for personal identification. Taken above insights together, we propose a novel gait recognition method based on joint metric learning and hierarchical network in an end-to-end manner, for improving the ability of feature identification. The contribution of this paper is two-fold. First, a joint metric learning and hierarchical network is used to realize the learning of fine-grained information and accelerated the discriminative feature learning. Second, to learn discriminative feature, we use a fuse loss function, the triplet loss and classifier loss. It guides the training process, pulls the samples in the same set close to each other, pushes the samples in different sets away from each other, and takes the silhouettes in each set as a whole. We evaluate the proposed method on two widely-used gait datasets, ie. CASIA-B and OU-LP-Bag. Experimental results demonstrate the effectiveness of the proposed method, significantly improve gait recognition performances under cross-view and bag-carrying walking conditions, respectively. Particularly, we achieves Rank-1 accuracy of 93.2% on OU-LP-Bag dataset, better than existing state-of-the art.


I. INTRODUCTION
Gait [1] is a unique biological feature that can identify at a distance, non-invasive and without subjectś cooperation. Gait recognition aims to identify person by walking pattern of individuals. Therefore, it is a valuable application prospect in the fields of identity identification, public security and medical diagnosis. Nonetheless, in real-world scenarios, the accuracy of gait recognition is easily affected by external factors, such as camera viewpoints, walking speed, clothing, and carrying condition.
To alleviate these issues, latest literatures usually compresses the whole gait silhouettes into one image or preprocesses them with a gait template on gait recognition extract The associate editor coordinating the review of this manuscript and approving it for publication was Haris Pervaiz .
feature [2]- [5]. These methods can be performed easily, while losing temporal and spatial information. And extracts the gait features [6]- [8] directly from the original gait silhouette sequences, such as using 3D-CNN [9], [10], which is a better way to learn the features. Previous methods [6], [7] have been used loss function to improve discriminative feature, while these methods use only one loss function, the classification effect is not obvious, to illustrate this more explicitly, compared to using triplet or cross-entropy loss alone, the combination of two loss is more efficient, as shown in Fig. 1. It is more difficult to obtain the well-trained model. These methods are mainly based on data processing and network design, and are complex and difficult to execute. In addition, these methods without effectively solve the difficulties in gait recognition, such as cross view and carrying condition.  To this end, there are few studies that focus spatial optimization of features. Therefore, a joint metric learning and hierarchical network approach is proposed for gait recognition, this method improve the ability to identify features by improving the loss function of deep metric learning. Our framework consists of two modules which are responsible for visual representation and high-level semantic metric, respectively. The visual representation module is obtained higher-level feature maps, it can retain temporal and spatial information better than base on gait template. The high-level semantic metric module is designed to help deep networks extract both local and global features. After the global pooling of visual features, the feature embedding space is learned by using the double layer full connection (FC) layer. We further propose an optimal consist of triplet loss and cross entropy loss to accelerate discriminative feature learning. It guides the training process, pulls the samples in the same set close to each other, pushes the samples in different sets away from each other, and takes the silhouettes in each set as a whole. Therefore, this architecture design optimizes the gait features to reduce the intra-class distance and increase the inter-class distance, as shown in Fig. 2. Compared to using triplet or cross-entropy loss alone, the combination of two loss is more efficient. The comparison will be shown in Sec.IV-D. The advantages of the proposed method are summarized as follows: • To focus on the feature space of gait recognition feature learning, and improve the feature recognition ability by improving the loss function of deep metric learning. This paper propose a joint metric learning and hierarchical network for gait recognition, which can learn discriminative feature.
• To learn discriminative feature, we use a fuse loss function, the triplet loss and classifier loss. It guides the training process, pulls the samples in the same set close to each other, pushes the samples in different sets away from each other, and takes the silhouettes in each set as a whole.
• We evaluate the proposed method on CASIA-B and OU-LP-Bag which are two widely-used datasets, the proposed method can significantly improve recognition VOLUME 8, 2020 performances under cross-view and bag-carrying walking conditions, respectively. In particular, we achieves Rank-1 accuracy of 93.2% on OU-LP-Bag, better than existing state-of-the art.

II. RELATED WORK A. MAIN METHODS OF GAIT RECOGNITION
Most of the current methods for gait recognition enhance the discrimination of visual features. Zhang et al. [7] presented a method based on automatic encoding by separates appearance and posture features from RGB frames, and then combines them together to form a gait feature that is a video segment using LSTM. To extract discriminative gait features, Yu et al. [11] presented a GaitGAN method based on generation of antagonistic networks. A multitask generation antagonistic network (MGAN) [3], [12] for learning perspective representation is presented. The (template) generation aims to compress the original gait silhouetters into a single image, or a single template, such as a gait energy image (GEI) [5] and a timed gait image (CGI) [13]. In the template matching method, the Perspective Transform Model (VTM) [14] learned the projection between different views. A viewinvariant discriminant projection (ViDP) [3] was proposed to mining the low-dimensional geometric structure of data and learning the discriminant projection matrix of view-invariant. Das et al. [15] based on deep learning is presented to accurately identify the affected frames and predict the missing frames to rebuild the complete gait periodic. Zhang et al. [8] proposed a JUCNet method, which combines the advantages of the combination of cross-gait representation and unique gait representation of single gait image. In addition, it proposed a quintuplet loss function which can minimize the class difference and maximize the class difference at the same time. Although these methods have achieved promising performances, only extract motion features and neglect some discriminative regions. The performance of appearance features is restricted. It ignores differences in gait under certain conditions, such as carrying a bag or wearing a coat. Compared with those works, our model can not only enhance the discriminative ability, but also achieve better performance.

B. DEEP METRIC LEARNING
In recent years, deep metric learning has become one of the most attractive research fields of machine learning. How to effectively metric the similarity between objects becomes the key to the problem. In face recognition, fingerprint recognition and other open classification tasks, the number of classes is often many and the number of samples intra-class is relatively small. In this case, deep learning-based classification methods often show some limitations, such as lack of intra-class constraints, difficulties in classifier optimization, and so on. These limitations can be solved by deep metric learning.
Contrast loss [16] is the beginning of deep metric learning, which first introduces deep neural networks into metric learning. Contrast loss only constrains the features of pairs within a class as close as possible, while pairs between classes have features as far as possible. Triplet loss [17], [18] considers the relative relationship between intra-class pairs and inter-class on the basis of contrast loss. Since the triplet loss function sampled three samples at a time, while taking into account both the intra-class distance and inter-class distance as well as their relative relationship, it did not make full use of all samples of each batch during training, so the paper [19] proposed to establish a dense pair-wise connection within one batch. Sohn [20] attributed the slow convergence of contrast loss and triplet to mining only one negative sample at a time during training and the lack of interaction with other negative samples. Therefore, they proposed multiple-class N-pair loss. Liu et al. [21] proposed a coupled cluster loss by first the center of positive samples is computed, and then a constructing triplet with sample center, positive samples and hardest negative samples is constructed. Kumar et al. [22] proposed a global loss and Wen et al. [23] proposed a center loss to learn a clustering center for each category of feature. Ustinova and Lempitsky [24] proposed a histogram loss, i.e., the probability of similarity obtained by random sampling in positive samples is lower than that obtained by random sampling in negative samples.
Using the second-inverse feature of the deep neural network, combined with the classifier trained by softmax and cross entropy loss, at the same time, it can be applied to many application tasks based on deep metric learning [25]. The former is designed to find the best classification surface, while the latter is designed to learn feature embedding so that sample embedding in the same category is compact and sample embedding in different categories is far away. This prompts us to explore the relationship between metrical learning and classifier training. We find that the classifier obtained by combining the triplet loss with the cross entropy loss training is also suitable for many tasks based on deep measure learning, which further reduces the distance between classes, enlarges the class spacing, and improves the discriminant performance of the algorithm.

III. PROPOSED METHOD A. OVERVIEW
In this work, we propose a gait recognition method based on joint metric learning and hierarchical network. The proposed method processing the input data is extract the gait features directly from the original gait silhouette sequences. Because gait is periodic motion, it can be represented by a periodic, that is, a silhouette sequence can be represented by a gait periodic, where the silhouette of each position is observed to have a unique appearance. Therefore, it is assumed that the appearance of the silhouettes contains location information. With this assumption, Chao et al. [6] has proved the validity of frame disorder in sequence, and our method continues to develop on this basis.
The deep convolution network can obtain better visual representation with a lot of labeled data and appropriate metric functions. The metric function is particularly important when given the training data. However latest literatures ignore reduce the intra-class distance, influence the learning of discriminative features, at the same time, there is without feature of robustness. Based on this, we design a new metric function, which combines cross entropy loss and triplet loss. The cross-entropy loss can guide the network to learn the features of each ID. However, gait is a process of motion, and gait may vary slightly among different people, which makes learning discriminatory features difficult. To end this, we use the existing work for reference and introduce triplet loss, pull the samples of the same ID closer in the feature space and push the samples of different IDs further in the feature space, thus improving the discriminatory ability of gait features.
So, the input of the framework is a set of gait silhouettes. First, the first level visual representation module is extract frame-level features of each silhouette independently and aggregate them as global features. Secondly, the highlevel semantic metric module is to learn the extracted visual features and preserve the spatio-temporal information.

B. VISUAL REPRESENTATION
The visual representation module aims to obtain the gait feature representation of the gait sequence. The module can extract the features of each silhouette in the sequence and aggregate them as the global features of the sequence. Inspired by [26]- [30], In CNN, the receptive field determines the size of the input layer corresponding to an element in the output result of a certain layer, and in CNN hierarchical network, different layers have different receptive fields. The more layers, the larger the receptive field, the better the performance. In this situation, shallow features pay more attention to fine-grained, while deep features contain more global coarse-grained information. For this reason, we propose a multi-level CNN visual representation extraction module, details as shown in Table 1. The first level CNNs extract frame-level features of each silhouette independently, the second and third level CNNs aggregate frame-level features into independent video sequence features to obtain visual features.
Given any gait dataset, suppose that there are N identities in the dataset, each person is represented by N ), one or more sequences corresponding to it can be marked as X i = {X j i |j = 1, 2, . . . , n}, n denotes the number of frames in a sequence. In order to facilitate understanding, the task of obtaining visual features can be expressed as: H (C(X i )), C is the convolution network to extract frame-level features of each gait silhouette independently. The H aims to aggregate a group of frame features into global features through the hierarchical convolution network, thus forming higher-level visual features. The formula is V = H (Z ), where V is the visual feature, Z = {z j |j = 1, 2, . . . , n} is the frame level feature. It's worth noting that when a sequence is taken as an input, there is no spatial position relationship between gait features of a set, it is expressed as: where π define as any permutation. The number of gait silhouettes of a person is arbitrary since in the real life scene, function H should be able to input any cardinal number sequence. In order to improve the performance of visual representation module, we use the global information to learn visual attention map of each frame feature map and refine it. Therefore, we use element-wise maximum (frame-max pooling) operation across the views.

C. HIGH-LEVEL SEMANTIC METRIC
The high-level semantic metric module aims to learn the extracted visual features and save the spatial-temporal information. In order to extract local and global features, the visual feature map is divided horizontally, and then each part is divided into maximum pooling and average pooling, and the results of the two parts are added correspondingly. In order to make different segmentation strips describe different receptive fields in different scales, and different segmentation strips describe different spatial positions in each scale, so the visual features are divided horizontally. At the same time, in order to obtain the discrimination of the appearance features, we design two fully connection (FC) layers with different constraint loss. We propose a new learning strategy that joint cross-entropy loss and triplet loss. The first FC layer is to map features to discrimination subspace. The second FC layer is for ID classification, so as to improve the expression ability of discrimination subspace. Specifically, inspired by [31], the visual features are divided into V levels, where we added two layers of FC instead of just using 1 × 1 convolution. Visual features have S scales. On the scale s ∈ (1, 2, . . . S), the visual feature map extracted by hierarchical network is divided into 2 s−1 strips and equally according to different strips on height dimension, that is, a total of S s=1 2 s−1 strips. Specifically, assume each segmented strips as V s,t , t ∈ (1, 2, . . . , 2 s−1 ) here represents the index of the strip in each scale S. Here, the visual features after being pooled are represented by f , and the formula of global pooling is where MP and AP respectively represent max-pooling and average-pooling. MP is good for extracting highly discrim-VOLUME 8, 2020 inative information, while AP is good for extracting global information. The integration of the two can not only extract the global information, but also extract the local information. Secondly, the first layer FC maps the feature f to the discrimination subspace, the constraint on this layer is triplet loss. The second layer FC is the discrimination space of the first layer FC output for ID classification. Then, we need to consider the design and definition of constraint functions. Most of exist methods use ID as classifier label, in this case, the formation formula of cross entropy loss function is as follows: where M is the batch size, S is the number of blocks finally divided vertically, y i is the true value label, y i is the predicted value of the image output, CE is cross entropy loss. The classifier is implemented as a fully connected layer, and the output length is the total number of training IDs. ID classifier can be expressed as the cross entropy error between the prediction labels and the truth labels. However, the use of the cross entropy loss function alone does not ensure that the learning features are discriminative. Especially for the gait silhouette information, its variation is greatly affected by walking conditions, that is, the same ID varies greatly under different walking conditions. Therefor, we introduce the triplet loss function, which is used to learn the end-to-end deep metric, which is used to pull the samples of the same ID closer in the feature space and push the samples of different IDs further in the feature space.
Given an image triplet, i.e., anchor a, positive p and negative n, the formation formula of triplet loss function is as follows: among them, f (x i a ), f (x i p ) and f (x i n ) are features from anchor sample, positive sample and negative sample respectively. D(·, ·) is the distance between features measured by Euclidean distance, m is the edge amplitude of the distance between positive and negative samples of the triplet function.
According to the above analysis, under the same gait recognition framework, the triplet loss and cross entropy loss are fused. The final objective function is formulated as follows: where L c means the cross-entropy loss and L t means the triplet loss. λ ∈ [0, 1] weights those two loss function. When λ is set to 0, only the L t is considered. When λ is set to 1, L c and L t are considered. The effect of λ is discussed in Sec IV-D.

IV. EXPERIMENT
Our experiments consist of three parts. The first part is the experimental settings related to this method. The second part compares the proposed method with other advanced algorithms on CASIA-B and OU-LP-Bag datasets. The third part is ablation experiments on CASIA-B dataset.

A. SETTINGS
We use two datasets, CASIA-B [32] and OU-LP-Bag [33], to evaluate our method. CASIA-B and OU-LP-Bag are two public gait datasets. The details of these datasets are as follows. CASIA-B [32] is a large-scale, multi-view gait dataset, samples as shown in Fig. 3. There are 124 subjects, and 11views(0 • , 18 • , 36 • , . . . , 180 • ) per subject, it contains normal walking conditions(NM), wearing with coats(CL), walking with bags(BG). Namely, each subject has three walking conditions, including 6 video sequences per person under NM conditions, 2 video sequences per person under CL conditions and 2 video sequences per person under BG conditions. There are 11 views per walking condition, so each subject has 110 video sequences. the dataset is divided into three settings according to the data size of the training set, in first set, the training sets are the first 24 subjects, and the test sets are the rest 100 subjects, which are called small sample training (ST). In second set, the training sets are the first 62 subjects, and the test sets are the rest 62 subjects, which are called medium sample training (MT). In third set, the training sets are the first 74 subjects, and the test sets are the rest 50 subjects, which are called large sample training (LT). In the test set, the three settings are the same, the first 4 sequences under NM conditions are as gallery, the rest 6 sequences remain as probe.
OU-LP-Bag [33] is the world's largest gait recognition database at present that focuses on carrying real-life carried objects. It consists of 62,528 subjects with seven distinct labels for the carrying status, samples as shown in Fig. 3. Each subject contain three sequences (A1, A2, and A3), including 60,450 subjects with the A1 sequence, 58,859 subjects with the A2 sequences and 58,709 subjects with the A3 sequences. The A1 sequence contains with or without carried objects, and the A2 and A3 sequences are both without carried objects. According to the literature settings [33], 58,199 subjects who had samples in the A1 sequence and in the A2 or A3 sequences for each subject were selected from 62,528 subjects in the dataset. Then, the 58,199 subjects was randomly divided into two sets by subject id: the training set consisting of 29,097 subjects and test set consisting of 29,102 subjects. The test sets was divided into gallery and probe set. In order to better reflect the real scene, there are cooperative settings and uncooperative settings. The cooperative setting is to leave the A2 or A3 sequences in the gallery and the A1 sequence in the probe. The uncooperative setting is to leave the A1 and either the A2 or A3 sequences in the gallery and the rest sequences(A2 or A3) kept in the probe.

B. IMPLEMENTATION DETAILS
The proposed method is implemented with PyTorch, and trained on a TITAN Xp GPU. During training, gait silhouettes are gray-scale image, and the input image size is 44 × 64.  The silhouettes are aligned based on methods in [2]. Adam is chosen as an optimizer [40]. After two down sampling operations, it becomes 16 × 11. A batch of P×K size is selected from the training set, where p represents the number of people and K represents the number of training samples each person has in the batch. Here P is set as 8 and K is set as 16. The m for triplet loss is set to 0.2. The learning rate is set to 1e − 4. In CASIA-B, due to the different size of training set data, for ST, MT and LT, we conducted 50,000, 60,000 and 80,000 iterations training on our model respectively. In OU-LP-bag, we conducted 50,000 iterations training. During testing, given a query and gallery, both extract frame level features by visual representation module, and aggregate them as the global features of the sequence.
Then it is sent to the high-level semantic metric module to get the final feature representation. Finally, each query is compared with all samples of gallery use Euclidean distance to calculate Rank-1 recognition accuracy. Table 2 reports the performance of our approach and existing state-of-the-art methods on CASIA-B, respectively. The results are averaged over the 11 views and excluded the same perspective. For example, the accuracy of the 18 • probe is the average of 10 gallery views except 18 • gallery.

1) RESULT ON CASIA-B
Gait information includes not only the gait information parallel to the walking direction, but also the gait information VOLUME 8, 2020 perpendicular to the walking direction. Although it is easy to lose part of the gait information in the parallel view angle (90 • ) and the vertical view angle (0 • and 180 • ). From Table 2, the proposed method still has good performance. Compared with the current optimal Gaitset [6] algorithm, The viewing angles of 90 • , 0 • and 180 • increased by an average of 1.8%, 2.4% and 2.7%, respectively, At the same time, it improved by 0.8% across 11 views.
The proposed method has been improved on the small sample training (ST) with only 24 objects under three walking conditions (NM, BG, CL). Outperforming most of the recent work, e.g., AE [36], CMCC [34], CNN-LB [35] and GaitSet [6]. It has been improved by 2.8%, 4.4% and 0.7% compared with the current optimal of GaitSet [6] method. The reason is that because our model takes the input as a set, our model is obtain the number of silhouettes more than base on gait templates (AE [36], CMCC [34] and CNN-LB [35]) in a same batch. Therefore, our model get more temporal and spatial information. Then, compared with the GaitSet [6] more discriminative gait features are acquired through metric learning, and a good training model is obtained. In addition, the overall average Rank-1 accuracy of 11 viewing angles increased by 2.6% under three walking conditions. In addition, in order to obtain more fair and convincing experimental results, We prepared 6 groups of experiments and averaged them, the results are shown in Table 3.
Appearance change is a big challenge in gait recognition. Our method has good performance in the BG sequence on CASIA-B. Table 4 shows that our method has achieved very good results in BG subsequence. It is better than the current optimal algorithm in ST, MT and LT, and it is the best in 11 views and average Rank-1 accuracy is 4.4%, 1.3% and  1.5% higher than other models, respectively. This also shows that gait features have strong invariance, which is for all major gait variations.
Some literatures [7], [35], [37], [38] also pay attention to the cross-view conditions. However, when the size of training set increases and the data interference increases, our model is also the best in accuracy. Table 5 shows that when the larger sample training (LT) with 74 objects, the walking conditions are NM, BG and CL, and cross angle of view (0 • , 54 • , 90 • , 126 • ), the average accuracy of our model achieves the gain of 7%.
At the same time, some literatures [7], [8], [35], [41], [42] also pay attention to the appearance changes. According to the relevant literature settings, the training set is 54 • to 144 • in the case of BG, and different test sets contain different view, combinations of probes and appearance conditions (BG). Table 6 shows that our performance on the BG subset is out performance the current state-of-the-art JUCNet [8] method. It is worth noting that owing to the challenge of appearance change, that is a significant performance by a big margin between BG and NM in all methods except our method, which also demonstrate that our gait features have strong invariance to all major gait changes.

2) RESULT ON OU-LP-BAG
Following [33] and report the Rank-1 accuracy on OU-LP-Bag dataset, Which is the world's largest gait recognition database at present that focuses on carrying real-life carried objects. We evaluate our method on Rank-1 and Rank-5 are reported in Table 7. It shows that, under the cooperative and un cooperative setting, our method achieves the best Rank-1 accuracy of 93.2% and 92.6%, outperforming most of the recent work by large margins in Rank-1 accuracy, e.g., under the cooperative setting, PCA_LDA [47], SIAME [16], JITN [48] and ICDNet [49] by 52.4%, 43.4%, 18.8% and 6.2% in Rank-1 accuracy, respectively. Note that, JITN introduces a complex joint intensity transformer network and uses contrastive loss and triplet loss, but in fact, contrast loss only constrains the features of pairs within a class as close as possible, while pairs between classes have features as far as possible. Compared with JITN and ICDNet method, our method achieves excellent performance over recently works.
In all the evaluation algorithms, our models all achieved excellent performance. This shows that the model has strong robustness in learning gait features.

D. ABLATION EXPERIMENTS 1) IMPACT OF METRIC LEARNING
To evaluate the effectiveness of the proposed method on CASIA-B, ablation experiments are conducted on three conditions with 11 views: ST. We consider four setting with different function: L c , L t , L c +L t and λL c +L t . All experimental results are shown in Table 8. The reason the experiment was set up is that whether the proposed method is suitable for feature learning of small samples, whether it performs better in all three walking conditions like NM, BG, CL, and whether the fusion of loss functions is effective. Also, when only triplet loss is added and only cross entropy loss function is added, triplet loss is more effective than cross entropy loss. From the results, we can obtain some key observations. When combined with training, the ratio in order to optimize the joint training, the parameter λ is added to adjust the weight of cross entropy loss. By adjusting λ, the overall average Rank-1 accuracy is improved by 1.7% compared with that the parameter λ as 0. In terms of NM, BG and CL settings, the average Rank-1 accuracy is increased by 2.1%, 2.5% and 0.5%, respectively. Specially, in BG sequence, the Rank-1 accuracy was improved by 2.5%. This indicates that the proposed method suitableness for feature learning of small samples, performs better in three walking conditions, and fuses loss functions is effective.

2) IMPACT OF λ
We evaluate the influence of λ on Rank-1 accuracy under the ST, MT, and LT setting of CASIA-B dataset of BG condition. The impact of the parameter λ is shown in Fig.4. Notice that, when λ is set to 0, we only consider the triplet loss; when λ is equal to 1, triplet and cross-entroy loss are considered. It can be observed that When λ is set to 0.01, the overall average Rank-1 accuracy is improved by 1.4% compared with only consider the triplet loss, as the same time, in terms of ST, MT and LT settings, the average Rank-1 accuracy is increased by 1.4%, 1.3% and 1.5% respectively. It s clear that λ is set to 0.01 achieves the best performance. Therefore, we set λ as 0.01 in the all experiments.

3) IMPACT OF POOLING
In this paper, in order to maintain the global relation with the identification and preserve the discriminative part, we use both average pooling (AP) and max pooling (MP). In addition, the proposed method uses AP and MP together, not separately. Table 9 shows the results with different pooling strategies. It can be observed that MP performs better than AP    in the most cases. The reason is that AP takes into account all the locations of a particular part, and all locations contribute equally to the final part representation. Therefore, unrelated background patterns can easily affect the recognition ability of the representations generated by AP. Conversely, the global MP only retains the maximum response value for the local view. We think these two pooling strategies are complementary in generating feature representations from global and local competition. Therefore, we integrate them into a unified model to take full advantage of these two strategies. Experimental results in Table 9 demonstrate that mixing the two pooling strategies achieves better results compared with using either of them.

4) IMPACT OF GLOBAL FEATURE LAYERS
To evaluate the impact of global feature layers, G1 and G2, we conducted experiment on CASIA-B dataset. The results are shown in Table 10. It can be observed that the combination of G1 and G2 can achieve better performance. This is because shallow features pay more attention to fine-grained, while deep features contain more global coarse-grained information. Thus, we propose a multi-level CNN visual representation extraction module, It has a similar structure to the CNN in the main pipeline and incorporates the layer features extracted in different layers into G1-G2.

5) IMPACT OF FULLY CONNECTION LAYERS
In order to help the deep network focus on features with different sizes to gather both local and global information. We improve the high-level semantic metric module to make it adapt better for gait recognition task. We design two fully connection (FC) layers, the first FC layer for each pooled feature to map it into the discriminative space, the second FC layer is for ID classification. according to [31], 1 × 1 convolutional layers are used to reduce the dimensions of the column feature maps. However, in this paper, it need a classifier after the pooling. Obviously, 1 × 1 convolutional layer cannot do this job, so the full connection layer is used here. In order to better understand the effect of the two FCs, the following ablation experiments were conducted, as shown in Table 11. It can be observed that two FCs performs better than one FC in a large margin.

V. CONCLUSION
This paper presents a gait recognition method based on joint metric learning and hierarchical network. Compared with the existing methods, our method focuse on feature space optimization which can extract spatial and temporal information more effectively, and has a great improvement in a discriminative gait feature. The current perspective and the change of carrying objects are the two challenges in gait recognition, as the main factor affecting the rate of gait recognition. In this paper they have been effectively improved here. The experiments of two benchmark gait datasets (open standard datasets) show that compared with other most advanced algorithms, it achieves the highest recognition accuracy and shows well flexibility in complex environments. At present, gait recognition still has problems such as gait occlusion and dress change that need to be solved. Therefore, in the future, we will focus on these problems that need to be solved and extend the applications of the gait recognition, such as person Re-Identification and behavior recognition. HUANHUAN