Batch Hard Contrastive Loss and Its Application to Cross-View Gait Recognition

Biometric person authentication comprises two tasks: the identification task (i.e., one-to-many matching) and the verification task (i.e., one-to-one matching). In this paper, we propose a loss function called batch hard contrastive loss (BHCn) for the deep learning-based verification task. For this purpose, we consider batch mining techniques developed in the identification task and translate them to the verification task. More specifically, inspired by batch mining triplet losses to learn a relative distance for the identification task, we propose BHCn to learn an absolute distance that better represents verification in general. Our method preserves the identity-agnostic nature of the contrastive loss by selecting the hardest pair of samples for each pair of identities in a batch instead of selecting the hardest pair for each sample. We validate the effectiveness of the proposed method in cross-view gait recognition using three networks: a lightweight input, structure, and output network we call GEI + CNN (Gait Energy Image Convolutional Neural Network) as well as the widely used GaitSet and GaitGL, which have sophisticated inputs, structures, and outputs. We trained these networks with the publicly available silhouette-based datasets, the OU-ISIR Gait Database Multi-View Large Population (OU-MVLP) dataset and the Institute of Automation Chinese Academy of Sciences Gait Database Multiview (CASIA-B) dataset. Experimental results show that the proposed BHCn outperforms other loss functions, such as a triplet loss with batch mining as well as the conventional contrastive loss.


I. INTRODUCTION
Biometrics are significant assets for many applications such as surveillance, access control, and forensics. Biometrics have a substantial impact on society, which is why they draw the research community's attention. Many biometrics are available, such as the face, finger veins, fingerprints, voice, iris, and gait. Each biometric has its advantages and disadvantages that should be considered for an application. Using gait as a biometric is highly suitable for applications such as surveillance because gait biometrics are observable even at a distance, unobtrusively, and without requiring the subject's cooperation. At the same time, gait is harder to hide The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar . because of its behavioral nature rather than physical. Anyone can hide their face, for example, by simply using a mask. Almost all recent research on this topic uses deep neural networks (DNNs). Researchers have improved many aspects of DNN, such as network structure, pre/post-processing, sampling, and loss functions. Among them, one of the most essential components of the DNN is the loss function, and suitable loss functions depend on the target scenario.
We note there are two main scenarios that use biometrics: identification and verification. In an identification scenario, given an instance of biometrics, the goal is to determine the best match in a gallery of biometric instances, i.e., a oneto-many comparison. In the case of gait, a police officer might want to re-identify a suspect from one (closed-circuit television) CCTV video in another CCTV video that contains VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ many other non-suspect gaits. For identification, the triplet loss [1] is widely adopted [2], [3], [4], [5], [6], [7], [8], [37]. The triplet loss operates on three embeddings (i.e., sample points in the discriminative feature space), which are the outputs of the DNN, and where exactly two out of the three input samples have the same identity (i.e., an anchor sample, a positive sample, and a negative sample). This loss penalizes the relative distance between the distance of a positive pair (the anchor and positive embeddings) and that of a negative pair (the anchor and negative embeddings) to make the positive pair relatively closer than the negative pair. Even if the distance between a positive pair is big in absolute terms (i.e., bigger than some other positive pair), the loss can be small as long as the distance between the negative pair is bigger, which corresponds well with the identification scenario but not well with the following scenario.
In verification scenarios, given two instances of biometrics, the goal is to determine whether these two instances are of the same identity or not, i.e., a one-to-one comparison. In the case of a verification scenario using gait, a police officer might want to compare the gait of the perpetrator to the gait of a suspect. For verification, the contrastive loss (Cn) [9] is more suitable than the triplet loss [2], [10], [11] and its variations have been proposed in [7], [12]. The contrastive loss is a pair loss, and therefore it handles the positive/negative pairs differently. To reduce this loss, the embeddings of the positive pair have to be as close as possible to each other (i.e., the distance should be ideally 0), whereas the embeddings of the negative pair have to be farther apart than a margin distance (i.e., greater than margin m). It is sufficient for a positive pair to be relatively closer than a negative pair in a triplet. In contrastive loss, the positive pair has to be absolutely close (i.e., closer than margin m at the least), and the negative pair has to be absolutely far (i.e., farther than margin m). We call this kind of distance learned with such loss an absolute distance, which corresponds well with the one-to-one matching.
In addition to the loss function itself, effectively sampling the data for training using a triplet or a pair loss dictates the performance of any deep learning model. We identified three main sampling methods, random sampling, sample mining, and batch sampling, as described below.

A. RANDOM SAMPLING
This sampling method is the simplest way to create a group of samples (e.g., pairs or triplets) by randomly taking samples from the training dataset [9], [13], [14], [15], [16]. However, after a few epochs of training the network performance increases. As a result, the embeddings of these random groups become more likely to be well separated in the discriminative feature space. Therefore the loss of such groups is zero, which means they do not participate in updating the weights (i.e., the loss is inactive because the gradients of zero loss are also zeros). The number of these inactive groups gradually increases, and in the worst case, all groups could be inactive; therefore, training would require additional iterations to determine active groups. Consequently, the training process may become slow because of these inactive groups with zero losses.

B. SAMPLE MINING
In this method, an algorithm searches the dataset to determine active groups, i.e., mine the dataset, to obtain a more efficient training process. When each training iteration has an adequate number of active groups, the network learns a more discriminative space efficiently. Studies such as [17], [18], [19], [20] start with random initial samples and mine the dataset for the corresponding effective groups, which are where the positive pair embeddings are far apart and the negative pair embeddings are close to each other. The mining process requires inferences about the samples to be made to obtain the embeddings and then searching for the most effective groups while discarding the other groups, which is computationally expensive. This creates a computational resources assignment problem between mining and training.

C. BATCH SAMPLING
This sampling method does not base the mining process on random initial samples. It instead starts with a minibatch and considers all available groups created in the batch. A mini-batch has K random samples from each of P random identities. After embedding selects samples based on the pair-wise distance matrix of all P × K embeddings of the mini-batch (see Fig. 1 for an illustration of this method). Current methods reduce the waste of mining and achieve better performance [3], [4], [5], [6], [7], [8], [11], [21], [22], [23], [24]. Most of these methods can be adapted to work with the triplet loss, where there is an anchor sample, a positive sample, and a negative sample. As shown in Fig. 1a, they follow batch-all sampling for triplets, i.e., each sample in the mini-batch is an anchor.Samples of the same identity as the anchor are candidates for the positive sample, and samples of a different identity to the anchor are candidates for the negative sample. The simplest approach is to determine the loss for all possible triplets. Alternatively, some methods [5], [8], [11] only use hard triplets, that is, the (+) and (−) in Fig. 1b, which illustrates such hard examples in a mini-batch.
The studies mentioned above developed their sampling strategy according to the triplet loss, which is suitable for the identification scenario. By contrast, a batch-sampling method for verification is still missing from the literature, to the best of our knowledge.
We therefore propose a batch-sampling method for verification scenarios. Our batching method is heavily inspired by the progress on the triplet loss in the identification literature. However, there is a mismatch between the relative distance learned by the triplet loss and verification, which requires absolute distances. Therefore we construct our method on top of the contrastive loss. Moreover, instead of using the batch all sampling method (Fig. 1a), Our method uses hard sampling. However, the sample-based batch sampling method introduces imbalance in the sampled identities, which is reflected in the asymmetric samples in the pair-wise distance matrix (Fig. 1a). This imbalance does not correspond well with the one-to-one comparison nature of verification. Our method samples the hardest pair for each pair of identities (as illustrated in Fig. 1c). We name this batching method, when coupled with the contrastive loss, batch hard contrastive loss (BHCn).
In addition to BHCn, we define and investigate batch all contrastive loss (BACn), which is a natural extension of the state-of-the-art batch all triple loss (BATr), by replacing the triplet loss with contrastive loss. Furthermore, an extensive evaluation is performed on two popular gait recognition datasets, the OU-ISIR Gait Database Multi-View Large Population (OU-MVLP) dataset [25] and the Institute of Automation Chinese Academy of Sciences Gait Database Multiview (CASIA-B) dataset [26]. Different network architectures are used, a light convolutional neural network (CNN), GaitSet [3], and GaitGL [4].
The outline of this paper is as follows: We summarize related work in Section 2, and describe the training network and loss for gait recognition in Section 3. In Section 4, we present the performance evaluation for gait verification, and discuss the proposed loss in Section 5. Finally, we present our conclusions and discuss future work in Section 6.

II. RELATED WORK
In this section, we introduce each family of sampling methods for mini-batch construction and the rationale behind our proposed loss.

A. BATCH ALL SAMPLING
This family makes the most of the information in each batch, not by increasing the batch size but by constructing all possible pairs from a given batch [5], [21], [22]. A batch is generally composed of multiple identities, and each identity contains multiple samples (the number of samples per identity is sometimes the same among the identities, as in [3], [4], [5], [37]). Each sample in the batch is regarded as an anchor and then all possible positive pairs (i.e., the anchor and a sample of the same identity) and negative pairs (i.e., the anchor and a sample of a different identity) are composed. All the possible triplets for the anchor are further constructed by combining the positive and negative pairs that have a common anchor.
Some techniques apply different grouping strategies, for example, [23] applies batch all sampling to make negative pairs only, whereas [8] applies it to make positive pairs only. Moreover, Sohn [24] generalized the triplet loss to accept a positive pair and multiple negative embeddings at one time. The mini-batch contains 2n samples that form n pairs of positive samples from n different identities, i.e., each positive pair has a unique identity in the mini-batch. They then apply the loss on each positive pair and the n − 1 negative samples from the other pairs. Chen et al. [6] proposed quadruplet loss as a constraint to the triplet loss so that the minimum inter-class distance is more prominent than the maximum intra-class distance. They consider all possible triplets and quadruplets in a batch.

B. BATCH WEIGHTED SAMPLING
Instead of giving all groups the same importance, which could cause the model to become stuck in a local minimum, batch weighted sampling gives different groups different weights to contribute to the weight training gradient. Wu et al. [7] focus on the distances between samples in a batch. This approach treats each sample as an anchor, and for each anchor, it selects all positive embeddings but uniformly samples negative embeddings according to their distance to the anchor.

C. BATCH HARD SAMPLING
This family takes the most difficult samples in the batch for efficient training. Hermans et al. [5] propose a standard framework on a batch of P identities and K samples for each identity, i.e., PK samples in total. Once a sample is set to an anchor, the other samples are categorized into (K − 1) positive samples (i.e., the same identity) and (P − 1)K negative samples (i.e., different identities). They then select the hardest positive sample (i.e., whose distance to the anchor is the largest among the (K − 1) positive samples) and the hardest negative sample too (i.e., whose distance to the anchor is the smallest among the (P−1)K negative samples, as shown in Fig. 1b. Gao et al. [23] propose a fusion method of batch all and batch hard sampling strategies, i.e., employing batch hard sampling for positive samples while employing the batch all sampling for negative pairs.
Yuan et al. [11] propose a cascade-structure model in which only a percentage of each hardest negative and hardest positive pairs are forwarded to the next sub-model in the cascade. Therefore, the last model trains on the hardest positive and the hardest negative pairs. Song et al. [8] start their sampling method by considering all positive pairings in a batch and then selecting the hardest negative sample for each sample in the positive pairs. Yuan et al. [27] train using a quadruplet loss with exactly three samples of the same identity. They take the hardest positive pair and the hardest negative pair for each identity.

D. OTHER METHODS
There are some other studies that improve discrimination capability by designing more suitable loss functions. Lezama et al. [28] use the matrix trace norm to push the same identity in a low-rank subspace and different identities so that they are linearly independent, i.e., orthogonal. The columns of intra-identity matrices are created from the batch's identity embeddings, whereas the inter-identity matrix is created from all embeddings in the batch. However, this loss is set up to support a softmax loss, not a standalone loss. Another approach [29], [30], [31], [32], [33], [34] reformulates the metric learning problem as a classification problem. They perform sampling in the same manner as classification, i.e., by feeding the data iteratively to the model for training. Wang et al. [29] proposed two reformulations of the contrastive loss and triplet loss to create a classification problem by grouping the samples with classes vectors instead of other samples. Other studies [30], [31], [32], [33], [34] proposed adding angular margin-based modification to crossentropy softmax classification loss to better embed high-level features.

1) RATIONALE BEHIND BHCn LOSS
Our BHCn loss, which is based on Euclidean distance, is suitable for the gait verification task for the following reasons. Many other biometrics (e.g., faces and fingerprints) often suffer from illumination variation (e.g., faces and contactless fingerprint) or measured intensity variation (e.g., low-contrast latent fingerprint images), and hence an intensity-normalized dissimilarity measure such as cosine distance is more appropriate. By contrast, gait silhouettes are binary and do not suffer from illumination variation. This binary nature of the gait silhouettes is why the Euclidean distance is effective for gait recognition. Moreover, target scenarios of biometrics fall into two main types of task, identification and verification, and suitable loss functions depend on the target scenario [2]. In identification scenarios, the relative distances of positive and negative pairs are important, and hence a triplet loss with an anchor, a positive sample, and a negative sample or its variant is often employed. However, in verification scenarios, the absolute distances of positive and negative pairs are important, hence a contrastive loss is a sensible choice. Our proposed BHCn aims to use contrastive loss for the most meaningful pairs while preserving the identity balance in the batch, and this is the rationale behind our method.

III. GAIT RECOGNITION WITH BHCn A. NETWORK STRUCTURE
We design a network that takes a gait image as an input and outputs an embedding (i.e., a discriminative feature vector).
There are many gait representations that can be used as input. The two main representations are 1) a temporally compressed image over a gait cycle (i.e., a kind of gait template image) and 2) a sequence or set of gait images. The most common representation for the first category is the gait energy image (GEI) [35], which is generated by averaging the aligned silhouettes over one gait cycle. This representation is compact and lightweight yet effective and hence has been employed for a long time in the gait recognition community. We design a simple CNN that takes a GEI and is named GEI + CNN. The details of its network architecture are shown in Fig. 2.
The temporally compressed image, however, loses the finegrained temporal information, and hence the gait representation from the second category has become more popular recently. We therefore adopt GaitSet [3] as the second category. GaitSet directly feeds an unordered set of silhouettes to the network so as not to lose each frame's information. Moreover, GaitSet employs horizontal pyramid pooling, in which a gait silhouette is vertically divided into parts (i.e., horizontal stripes) at multiple scales so as to capture both local and global information. As such, GaitSet outputs multiple vectors corresponding to each horizontal stripe, and the loss function is computed for each output separately, unlike the existing GEI-based network (e.g., GEINet [36]), which outputs a single vector for the whole body.
The unordered image set does not contain the characteristics of gait that change over time, and hence more recently, sequence representation has become more popular [4], [37]. We adopted GaitGL [4] as a second category to cover both set and sequence gait images. GaitGL represents time as the third dimension and employs 3D convolution to extract spatiotemporal features. Moreover, GaitGL applies these convolutions both on the whole body (i.e., globally) and on horizontal strips of the body (i.e., locally) to capture both local and global spatio-temporal information. The final features are per horizontal strip, and as such, like GaitSet, GaitGL outputs a multi-vector gait representation.

B. LOSS FUNCTIONS
We first introduce widely used relevant loss functions such as the triplet loss and its extension and the conventional contrastive loss as preliminaries to make this paper selfcontained. The proposed method, named BHCn, is then introduced as an extension to the contrastive loss [9].

1) TRIPLET LOSS AND ITS VARIANTS
The triplet loss is defined by a triplet of an anchor, a positive sample whose identity is the same as the anchor, and a negative sample whose identity is different from the anchor. More specifically, the triplet loss penalizes if the relative distance of the negative pair to the positive pair does not exceed a pre-defined margin m. Given embeddings of the anchor, the positive sample, and the negative samples as y a , y p , and y n , respectively, the triplet loss is defined as where D(·, ·) is a distance function (typically, Euclidean distance), and [·] + is a non-negative clipping function defined as [·] + = max(0, ·).
Hermans et al. [5] extends the triplet loss to the batch all triplet loss to leverage the available information in a batch. Assume that a batch is composed of P identities and each identity has K samples, i.e., PK samples in total. The embedding of the i-th identity of the j-th sample is denoted as y i j and a set of the embeddings as Y = {y i j }, and the batch all triplet loss is then defined as where |L + | is the number of nonzero triplet losses over all triplet combinations and 1/|L + | is the active triplets averaging weight, i.e., the active losses are averaged and the inactive losses are discarded.
They also proposed the batch hard triplet loss, which selects the positive and negative samples on which the anchor performs the worst in a batch as where 1/|L + | is the active anchors averaging weight.

2) CONTRASTIVE LOSS AND ITS VARIANTS
The contrastive loss is defined on a pair of embeddings. Unlike the triplet loss, which considers the relative distance, the contrastive loss considers the absolute distance of positive/negative pairs. This attribute makes the contrastive loss better suited for verification scenarios. Given a pair of embeddings y i and y j , the contrastive loss is defined as where m is a margin for the negative pair.
We then consider extending the contrastive loss so as to leverage all the available information in a batch. A batch hard variant of the contrastive loss is proposed, which is analogous to the batch hard triplet loss. A straightforward application of sample-based selection (1b) in the batch hard triplet loss may lead to a problem. The hardest negative for all the samples may be biased toward a specific identity (i.e., the so-called lamb in a Doddington biometric zoo). The training result may over-fit to the lamb as a consequence.
The proposed method is therefore to select the most challenging sample in an identity pair-wise way instead of the sample-wise way. Similar to the batch all/hard triplet loss case, assume that a batch contains P identities and each identity contains K samples, i.e., a total of PK samples.
More specifically, given a pair of identities, i, j ∈ {1, . . . , P},The method constructs K 2 pairs of samples within the identity pair and then selects the hardest sample pair from the K 2 pairs. In summary, the BHCn loss is defined as where 1/|L + | is the active identity pairs averaging weight.

a: GROUPED BHCn
When the network structure outputs multiple embedding vectors (e.g., GaitSet and horizontal pyramid mapping (HPM) [38]), the loss function is usually defined for each embedding (i.e., each group) separately and then summed later. Given the number of multiple embeddings as G, where the g-th embedding indicated by a subscript g, the grouped version of the BHCn is defined as where 1/|L + | is the active identity pairs averaging weight. To test the effectiveness of the proposed BHCn loss, we also define two versions of it. First, the BACn loss is defined in the same way as BATr by replacing the triplet loss in the BATr loss with contrastive loss, i.e., single-step averaging over the samples of all pairs of identities.
L Cn (y i g,a , y j g,b ) (7) where 1/|L + | is the active pairs averaging weight. Second, the BACn2 is defined loss by replacing the hard selection in BHCn with averaging over all pairs, i.e., two-step averaging, which consists of averaging over samples for each pair of identities and averaging over pairs of identities.
where for a pair of identities i and j, 1/|L + ijg | is the active sample pairs averaging weight in group g and 1/|L + | is the active identity pairs averaging weight.

A. DATASETS 1) OU-MVLP DATASET
The OU-MVLP [25] is as of now the largest publicly available silhouette-based gait dataset. It includes 10 . Each subject has two walking sequences labeled ''00'' and ''01,'' which are assigned to a gallery and probe, respectively. As a result, the total number of sequences per subject is 14 × 2 = 28.
For evaluation, we use the same training/test split defined in [2], where there are 5,153 training subjects and the rest are the 5,154 test subjects.
For evaluation, we use three training/test splits, which are well known in the literature. The settings are small-sample training (ST), where subjects labeled 1, . . . , 24 are for training and the rest are for testing, medium-sample training (MT), where subjects labeled 1, . . . , 64 are for training and the rest are for testing, and large-sample training (LT), where subjects labeled 1, . . . , 74 are for training and the rest are for testing.

B. PRE-PROCESSING
Both the OU-MVLP and CASIA-B datasets provide binary images of the extracted silhouettes at the original image resolution; hence, the apparent heights differ among subjects.
Because the networks take a set of fixed-height binary silhouettes as an input, preprocessing register and sizenormalize the extracted silhouette based on the method of [2], [3]. One exception is the use of bilinear interpolation instead of cubic interpolation, because cubic interpolation returns negative pixel values at the silhouette edges, which become decimated when the image is converted to 8-bit storage. Then GEIs are generated from the size-normalized silhouettes by averaging over one gait cycle. The image resolution of the size-normalized silhouettes (i.e., for the GaitSet and GaitGL networks) is 64 × 44 pixels. The GEI (i.e., for the GEI + CNN), it is 128 × 88 pixels to be compatible with the network architectures.

C. LOSSES
To evaluate our proposed method (BHCn), we compare it with different losses. The first loss is the state-of-the-art loss for gait identification, BATr. The second loss is the traditional contrastive loss (Cn). The last losses are the defined BACn losses, BACn and BACn2. With the exception of the experiments using GaitGL, we also compare the combined BATr and cross entropy (BATr_CE) losses, as detailed in [4].
Although we wanted to compare the proposed loss with the batch hard triplet loss, the network did not converge under this loss.

D. SETUP
The batch size is set depending on the dataset and the network architectures, as presented in Table 1. The margin m for the triplet-type loss was 0.2, whereas that for the contrastive-type loss was 256. The network structure parameters of GaitSet and GaitGL were set depending on the dataset in the same manner as in their respective original papers [3], [4]. Adam optimizer is used for training and selected learning rates between 10 −2 and 10 −6 . The number of training iterations is set differently depending on the network architectures and datasets by considering the number of network parameters and number of available training samples in the dataset, as described in Table 2.
As for the evaluation metrics, we consider the equal error rate (EER) for the verification scenario, i.e., the trade-off point between a false acceptance rate and false rejection rate [39]. We also report the standard deviation of the false rejection rate (FRR) at the EER threshold, based on [40].
To mitigate the relatively large variation in accuracies due to the training/test split from a limited number of training/test samples in CASIA, we repeated each experiment 10 times and report the average and FRR standard deviation at the EER over the 10 runs.

1) OU-MVLP
We report the EERs averaged over gallery views for the five benchmarks BHCn, BATr, BACn, BACn2, and the conventional Cn, as presented in Tables 3 to 5. The results reveal that our BHCn outperforms the other losses on all architectures and almost all views. GEI + CNN, GaitSet, and GaitGL trained with BHCn have average EERs of 0.66%, 0.50%, and 0.53%, respectively.

2) CASIA-B
We compare the accuracies for the same four benchmarks on OU-MVLP, as listed in Tables 6 to 8. The BHCn outperforms the other losses on all settings for GEI + CNN, while it achieves second place for GaitSet and GaitGL. We consider the results are derived from a trade-off among the number of network parameters (i.e., the model complexity), the number of sampled pairs in each loss, and the training set size. GEI + CNN is a simpler network than GaitSet and GaitGL (i.e., the number of parameters is smaller), and hence the proposed BHCn, which uses a smaller number of sampled pairs than BACn, still works well. By contrast, GaitSet and GaitGL have more parameters, and hence batch all variants, which uses the full combination of pairs in a batch, works better, in particular for the CASIA dataset, where the training set size is smaller than that of OU-MVLP.

F. ABLATION STUDIES OF THE SAMPLING METHODS
To verify that identity-based batch sampling is a better option for verification than other well-established batch-sampling methods, we conducted ablation studies on it. We show the EERs of BHCn with sample-based batch sampling and with identity-based batch sampling (the proposed method) in Table 9. We confirmed that identity-based batch sampling performs better than sample-based batch sampling.

G. SENSITIVITY ANALYSIS OF THE BATCH SIZE
The batch hard strategy used in our loss dramatically reduces the number of active samples (i.e., the samples that directly contribute to the loss function and update the network parameters). For P identities in a batch, our batch hard method selects P(P + 1)/2 active pairs, regardless of the number K of samples per identity. As shown in Table 10, for the same number of samples per identity K , EERs steadily increase as the number of identities is doubled (i.e., PK = 512) and quadrupled (i.e., PK = 1, 024). Moreover, we notice that K = 8 is a reasonable choice given the total samples PK on average because we achieve the best or the second-best EERs under the fixed total samples PK .

H. SENSITIVITY ANALYSIS OF THE LEARNING RATE
Because the learning rate impacts performance, we analyzed the sensitivity of the learning rate for GEI + CNN with the CASIA-B, ST setting. As shown in Fig. 3, most methods have     the best accuracy for learning rates between 10 −6 and 10 −5 . Although the BATr loss seems less sensitive to the learning rate than the proposed BHCn, the BHCn still performs better for learning rates between 10 −6 and 10 −5 .

I. DISTANCE DISTRIBUTIONS
To evaluate how well each loss trains the network to be robust against view angles, we compared the L2 distance  distribution for each view-angle pair on the training data of the OU-MVLP dataset using the GaitSet network, as shown in Fig. 4a. As a result, the distributions for the proposed BHCn are more consistent among view angle pairs (i.e., the distributions overlap well) than the other loss (i.e., the distributions are diverse). Moreover, to determine how well the embedding can generalize to unseen subjects, we show the L2 distance distributions on the testing data in Fig. 4b. As a result,   10. Sensitivity analysis of the batch size on EERs (mean ± SD) using GEI + CNN on the OU-MVLP dataset. P and K indicate the numbers of subjects and samples per subject in a batch, respectively. For each K , the number P of identities is doubled and quadrupled for PK = 512 and PK = 1, 024 as PK = 256. similar to the training data, the proposed BHCn exhibits better consistency among view angle pairs than the other losses.

V. DISCUSSION
This section discusses the advantage of the BHCn over its batch all counterpart.

A. CONTRASTIVE LEARNING: BATCH ALL
At the start of training, the embeddings lie randomly in the embedding space, and any loss can bring improvements overall. As the learning continues, most training data will be clustered appropriately, and thus their contrastive loss will be insignificant. The positive pairs will contribute the most to the training loss in the BACn -fewer negative pairs contribute to the loss because of the hinge. Thus, the batch all approach has an imbalance between positive and negative cases, which leads to rigid updates -the loss ensures the stability of the positive pairs and refuses to change for any negative ones.

B. CONTRASTIVE LEARNING: BATCH HARD
By contrast, batch hard sampling reduces rigidness by reducing constraints. The hard case has a more considerable loss for the negative pairs than the positive pairs; it is enough to push an embedding out of a stale configuration.
There is a trade-off between the rigid but stable batch all sampling and the flexible but less stable batch hard sampling. The batch all method requires a good starting point, making it less likely to become stuck in a local minimum. In contrast, the batch hard method requires much care in parameter selection, such as the number of identities per batch and number of samples per identity. Most important of all is the learning rate, as the learning rate controls the strength of the deformation

VI. CONCLUSION
In this paper, we proposed investing more research effort into gait-based verification scenarios because of their equal importance to identification scenarios in security applications. Owing to the massive interest in identification in the research community, high performance is attainable by modifying the components so that they are more suitable for verification. Because the loss function is tightly coupled with the target task, we proposed BHCn to train deep embedding networks on verification tasks. This loss can replace the batch all (or hard) triplet loss not just on gait recognition, but on any task. We applied our proposed loss to cross-view gait recognition using the OU-MVLP and CASIA datasets. The experimental results demonstrated our proposal's superior performance compared with that of traditional verification loss, and we achieved state-of-the-art performance using GaitSet.
In this paper, we focused on the cross-view challenge of gait recognition because it impacts other challenges. However, many other challenges exist, such as clothing, carrying status, and walking speed. In addition, state-of-the-art networks such as GaitSet are designed with the identification task in mind. Redesigning such a network for verification, coupled with our loss design, is a promising direction for future work.  VOLUME 11, 2023