Prototype Memory for Large-scale Face Representation Learning

Face representation learning using datasets with a massive number of identities requires appropriate training methods. Softmax-based approach, currently the state-of-the-art in face recognition, in its usual"full softmax"form is not suitable for datasets with millions of persons. Several methods, based on the"sampled softmax"approach, were proposed to remove this limitation. These methods, however, have a set of disadvantages. One of them is a problem of"prototype obsolescence": classifier weights (prototypes) of the rarely sampled classes receive too scarce gradients and become outdated and detached from the current encoder state, resulting in incorrect training signals. This problem is especially serious in ultra-large-scale datasets. In this paper, we propose a novel face representation learning model called Prototype Memory, which alleviates this problem and allows training on a dataset of any size. Prototype Memory consists of the limited-size memory module for storing recent class prototypes and employs a set of algorithms to update it in appropriate way. New class prototypes are generated on the fly using exemplar embeddings in the current mini-batch. These prototypes are enqueued to the memory and used in a role of classifier weights for softmax classification-based training. To prevent obsolescence and keep the memory in close connection with the encoder, prototypes are regularly refreshed, and oldest ones are dequeued and disposed of. Prototype Memory is computationally efficient and independent of dataset size. It can be used with various loss functions, hard example mining algorithms and encoder architectures. We prove the effectiveness of the proposed model by extensive experiments on popular face recognition benchmarks.


I. INTRODUCTION
F ACE recognition is one of the most established technologies [1], [2] of computer vision. With the help of modern deep neural architectures [3] and large-scale training datasets [4], [5], current face recognition models [6]- [9] perform at superhuman levels [10] on the million-scale face identification and verification tasks [11]. However, there are still a number of open problems in face recognition, such as large pose and age variations [12]- [15], racial bias [16], unconstrained face image conditions [17], usage of heavy makeup, disguise [18], and face masks [19].
The most straightforward way to combat these problems is a construction of large and (preferably) balanced training datasets with sufficient diversity of face image variations. Indeed, the best performing models in popular face recognition benchmarks are trained on a datasets, containing hundreds of thousands [5], [8], [20] or even millions [21], [22] of persons.
Training on the datasets of a million-person scale provides a set of challenges. One of them is the choice of training method. First possible way to train a model on a dataset with a large number of persons is to rely on a triplet-based training scheme [21]. It is scalable, but usually slow and less accurate than alternatives like softmax-based methods [6], [7], [23].
Softmax-based methods consider face recognition as a classification task, where persons are classes, and the model is trained to classify an image as belonging to one of the classes in the training set. These methods are proven to perform well, but they have a problem with scaling. Computation and GPU memory requirements of these methods are dependent on a number of classes in the dataset. Small and medium-sized [24]- [26] datasets with less than 100, 000 classes could be used for softmax-based methods efficiently, but training on the datasets with millions of classes becomes too computationally demanding or even infeasible. To overcome the difficulties of large dataset softmaxbased training, several methods were proposed [5], [27]- [29]. These methods are based on the "sampled softmax" approach, where only a small subset of the total set of classes is used in the current training iteration. The advantage of this approach is that there is much less computation and GPU memory involved compared to the "full softmax" training. Only selected class prototypes (classifier weights) are moved to the GPU memory to perform training iteration. All other prototypes stay in RAM or in other large-capacity memory. Whereas the "full softmax" approach has the advantage of more complete information available at the loss computation stage, comparable results could be achieved with "sampled softmax" using smart selection of sampled classes [27], [29], [30] and large enough training datasets [5].
"Sampled softmax"-based methods are good solutions for some problems in face recognition model training on largescale datasets, but they also have a set of disadvantages. First, they still need to store prototypes of all classes in memory. For multi-million-scale datasets and large prototype size it could become an unnecessary bottleneck. Also it means that the dataset must be fixed during the training process, and no new classes could be added. Second, operations of sampled prototype transfer from RAM to GPU memory and back add extra computational overhead. Third, class prototypes, which are not sampled in the current iteration, do not get gradients, and over time become detached from the encoder and begin to represent their class incorrectly in the embedding space.
It leads to inaccurate training signals and subsequent performance degradation. We refer to this situation as a problem of "prototype obsolescence". It is even more dangerous in large-scale datasets, where the frequency of sampling each individual class is very low.
In this paper, we propose Prototype Memory -a novel face representation learning model, suitable for training on a dataset, regardless of the number of identities in it. Prototype Memory is based on the idea of "online" class prototype generation. New class prototypes are approximated using groups of same-class exemplar embeddings in the current minibatch, and stored in a memory of limited size. Prototypes in memory are used in a role of classifier weights to perform softmax-based training. Prototypes are updated using gradients and also refreshed with recent exemplar embeddings, when they are available. The memory is continuously filled with prototypes until it reaches its full capacity. In this case, to free the memory for new prototypes, the oldest ones are removed from the memory and disposed of.
Prototype Memory shares the advantages of other "sampled softmax"-based methods, but is free from many of their disadvantages. First, there is no need to store all class prototypes, only those in current Prototype Memory. This behavior allows us to use datasets with an unlimited number of persons and train in class-incremental style. Second, the proposed approach has no overhead on RAM to GPU memory transfer operations: all Prototype Memory operations can be performed entirely on GPU. Third, class prototypes, used for training, are continuously updated and kept up-to-date by means of gradients and refreshing algorithm. Outdated prototypes are disposed of. This way we ensure that class prototypes are closely connected to the current encoder state and correctly represent their corresponding classes in the embedding space. It helps to solve the problem of "prototype obsolescence".
We summarize the contributions as follows: • We propose Prototype Memory model for face representation learning, which can be used to efficiently train face recognition models on very large datasets. • We prove the effectiveness of the proposed model with extensive experiments on different face recognition benchmarks. • We present algorithms of hard example mining and knowledge distillation, suitable for Prototype Memory.

II. RELATED WORK A. FACE RECOGNITION
Current state-of-the-art methods in the area of face recognition are based on the training of deep neural networks on large face image datasets [21], [39]. The usual pipeline of face recognition is to detect faces in image with a face detector [40], [41], align and crop them, and then pass through the encoder (deep neural network) to get face representations in a form of L 2 -normalized vectors (embeddings). Encoder is trained to represent different face images of the same person close to each other in the embedding space (according to the cosine similarity measure), and face images of two different persons -far from each other. There are two main ways to train an encoder to perform this task. First one is to group images in pairs [42] or triplets [21] and use their embeddings to compute gradients, pushing them to each other -for the same-class pairs, and apart from each other -for the pairs of different classes. Second way to train proper face encoders is to present a task of face recognition in the form of softmax-based image classification [10] and apply additional constraints to the resulting image representations [43], [44] to ensure they possess necessary properties. Classifier weights in this approach could be viewed as class representatives (prototypes) in the embedding space, and exemplar embeddings are trained to move closer to their corresponding prototypes and away from the prototypes of other classes. With proper normalization, prototypes bear conceptual similarity to the class centers [45] in the embedding space. This approach is currently considered state-of-the-art and represented with such methods as SphereFace [23], ArcFace [7], CosFace [6], CurricularFace [32] and many others [29], [33], [46]- [49].
There are also many complementary methods proposed to build better face recognition models by promoting desired properties of the produced face representations, such as robustness to noisy labels [8] and low image resolution [50], invariance to age [51] and pose [52], ability to mitigate racial bias [53] and domain imbalance [35], [54], to improve the fairness of representations [55]. There are also methods, proposed to overcome the problems with the situations of difficult face appearance variations, like deliberately disguised faces [31], [56] or faces in medical masks [57].
All these methods are used to improve face recognition models, but the main key of success is the usage of large training datasets. For example, datasets, used to train state-ofthe-art models in MegaFace challenge [11], are composed of hundreds of thousands [20] and even millions [22] of persons. To use this kind of datasets with the softmax-based approach there is a need of some kind of softmax acceleration methods.

B. SOFTMAX ACCELERATION
Utilization of softmax function is currently the main approach to perform classification tasks in the field of machine learning. In case of small number of classes, softmax is very fast, but when the number of classes reaches millions, softmax computation becomes a bottleneck. Recent works propose a number of ways to accelerate the computation of softmax-based classification. One line of research is looking for the opportunities to accelerate softmax calculation with more efficient algorithms [58]- [60] and parallelization [5], [7], the other one is trying to approximate softmax [61], [62], split it into several parts [63], use softmax hierarchically [64], [65] or sample classes stochastically [66]- [69].
In the area of face recognition, the latter approach is the most promising. For example, a variant of "sampled softmax" approach, Selective Softmax [27] reduces the computation cost and memory demand of softmax computation by selecting only a subset of "active classes" for each mini-batch based on a set of dynamic class hierarchies, constructed on-the-fly. D-Softmax-K [28] randomly samples a subset of negative classes and uses two different loss functions for intra-class and inter-class training. Random Prototype Softmax and Dominant Prototype Softmax, proposed in [70], store a matrix of prototypes in the non-GPU memory and use different ways to select and move prototypes to the GPU memory to perform training in the bisample face recognition scenario. Hard Prototype Mining [29] is used to adaptively select a small number of hard prototypes, most similar to the samples in the mini-batch in each iteration. Only these prototypes are used in softmax calculation, thus accelerating the training.
Positive Plus Randomly Negative (PPRN) [5] is another softmax acceleration method. Its core idea is to keep a full classifier weight matrix in RAM and sample positive classes and a subset of random negative classes at each iteration to perform training. Provided with a large-scale training dataset, this method of class sampling achieves the results, comparable to the "full softmax"-based training methods. VOLUME 10, 2022

C. WEIGHT IMPRINTING
In the classic softmax-based training, classifier weights for all training classes are initialized with random values in the beginning of the learning process. However, with this kind of initialization all classes are need to be determined before the training, and their set cannot be altered after that. To achieve the possibility of inserting new classes to the trained network, [71] proposed a method of weight imprinting. With this method, embedding vectors of the examples of the new class can be used to initialize weights for this class, thereby extending the classifier. To generate a weight for a new class, embeddings of its examples are averaged, and the resulting vector is L 2 -normalized and inserted into the classifier. Another closely related model is Prototypical Networks [72]. They were proposed for the task of few-shot learning and use a support set of several examples per class to approximate the prototype of this class. Prototypes are calculated in the embedding space using the mean embedding of the support set. Classification is then performed by finding the nearest class prototype. There are a number of ways to use and adapt prototype-based learning in different areas [73]- [75]. In case of face recognition, class prototypes could be calculated using averaged and L 2 -normalized face image embeddings, belonging to the same class.

D. EXEMPLAR MEMORY
Exemplar memory is a way to improve the neural network training by keeping examples or their embeddings in memory between iterations and use them to get better training information. Exemplar memory was used in [76] to store the features of target domain and enforce invariance constraints for the task of person re-identification. Another proposed variant of exemplar memory is called Cross-Batch Memory (XBM) [77]. It was used to improve embedding learning by keeping embeddings from past iterations and using them to find better hard examples and learn better embeddings. Memory-based Jitter [78] enhances intra-class diversity for the tail classes in the long-tail learning problem by keeping in a memory bank different versions of exemplar features from previous iterations. BroadFace [34] is another variant of exemplar memory, proposed for face recognition. BroadFace stores a large number of exemplar embeddings from previous iterations and uses them to compute better training signals for the classifier weight matrix updates. Since the exemplar embeddings are computed with different versions of encoder from different training steps, they are suffering from the problem of obsolescence. To prevent it, authors proposed compensation function: using the difference between prototypes at current iteration and at the iteration, when the embedding was enqueued to the memory, calculate the compensation value and apply it to the embedding. Compensation function is helpful for embedding obsolescence, but it cannot be applied to prevent prototype obsolescence, because in contrast with the BroadFace setting, in the "sampled softmax" situation there is no "actual" version of each class prototype at each iteration to get the compensation value from.

III. PROPOSED METHOD
In this section we propose Prototype Memory model for face representation learning. It combines the advantages of weight imprinting (online generation of class prototypes using groups of class exemplars), exemplar memory (keeping useful information between training iterations) and accelerated softmax methods (faster and more memory-efficient than "full softmax") and could be used to train face recognition architectures on datasets of any size without the problem of prototype obsolescence.

A. MOTIVATION
Current state-of-the-art face recognition models [7] use softmax-based training strategy. When the dataset is very large (millions of persons), this type of strategy becomes too memory and computationally expensive: for the training with "full softmax", all classifier weights should be placed in the GPU memory, and for each training example all classes in the classifier are needed to compute the loss function and gradients.
Current "sampled softmax"-based methods [5], [27]- [29], while solving some of the problems of "full softmax", still have the need to keep all classifier weights somewhere, and suffer from the problem of "prototype obsolescence"situations when the class prototype has not received gradients for so long time, that it does not represent its class correctly anymore. This problem is unavoidable in current methods if just a subset of classifier weights is updated at each iteration.
One more problem we want to solve is the inability of the current face recognition methods to use training datasets in class-incremental fashion, when the dataset could be updated with new classes at any time, without training problems.

B. PROTOTYPE MEMORY
The scheme of face representation learning with the proposed Prototype Memory model is presented in Fig. 1. Prototype Memory includes a memory module and algorithms for prototype generation, prototype refreshing and prototype disposal.
At each training iteration, several images of some class are passed through the encoder network, producing L 2normalized embeddings. These embeddings are used to generate a new prototype for the corresponding class. New class prototypes are enqueued to the Prototype Memory module. Prototypes in this module are later used as the classifier weights to perform usual softmax-based face recognition model training with some appropriate loss function.
Prototypes in the Prototype Memory are updated with the gradients. If the class, currently already placed in the Prototype Memory, gets new examples in the current minibatch, it also uses their embeddings to "refresh" its prototype and move it to the start of the queue. When the Prototype Memory is filled to its capacity, oldest prototypes, which haven't been refreshed with exemplar embeddings for many iterations and became outdated, are dequeued and disposed of to free the memory slots for the new ones.
Prototype Memory model has a set of hyperparameters: • Prototype Size (D) -it is the size of the prototype vectors. This size must be the same as the size of the embedding vectors, produced by the encoder. • Memory Size (M) -is the number of prototypes, which could be placed in the memory module. • Number of images per class (k), used for the prototype generation. • Refresh Ratio (r) -this hyperparameter is responsible for the strength of the prototype refreshing.
More detailed description of the Prototype Memory algorithms is presented below and on Fig. 3.

1) Prototype Generation
Prototype generation scheme is presented at Fig. 3a. Minibatch, containing the images of several classes, with the fixed number of images per class (set by a hyperparameter k), is passed through the encoder network. Resulting embeddings for each person (identity) are averaged, and the average embedding is L 2 -normalized: x j is the j-th exemplar embedding, produced by encoder, k is the number of images, used for prototype generation, and F norm is L 2 -normalization function. The resulting vector P new is the new class prototype. New prototypes are calculated for each class in the mini-batch, and then enqueued to the Prototype Memory (Fig. 3b).

2) Prototype Update and Refreshing
Prototype Memory module has fixed size, set by a hyperparameter M . New prototypes are enqueued to the Prototype Memory (and later dequeued from it) in order of appearance in training. When the prototypes are in Prototype Memory, they are used in a role of the classifier weights for usual softmax classifier-based training with a margin-based facerecognition-specific loss function like ArcFace [7]. These prototypes receive gradients and are updated according to them.
When the class has new examples in the current minibatch, and already has a prototype in the Prototype Memory, additionally to the usual gradient-based update, it is also "refreshed" using current mini-batch embeddings with: where P upd is updated prototype, P mem is a prototype from memory, P new is a prototype, calculated with the embeddings in the current mini-batch, F norm means L 2normalization, and r is a hyperparameter (refresh ratio). Prototype refreshing is performed to force the prototypes in Prototype Memory to be up-to-date to the current encoder state. When the prototype is refreshed, it is moved to the start of the queue as if it is a new prototype.

3) Prototype Disposal
Prototype disposal scheme is presented at Fig. 3b. When the Prototype Memory is full and has no free slots for new prototypes, there is a need to dispose of the oldest prototypes. These are the prototypes, which resided in the memory for the longest time, received no refreshing from their examples for a while and became outdated. Most likely, the "real" prototypes of these classes, if computed using the current encoder state and the images of this class, would be far from these prototypes in memory. To free the memory space for the new prototypes, these prototypes are dequeued and disposed of. Freed memory slots are then used for new prototypes.

C. SOLUTION TO PROTOTYPE OBSOLESCENCE
Prototype obsolescence problem of "sampled softmax"-based training methods is illustrated in Fig. 2a. Left part of the figure demonstrates a situation, when the class prototype is up-to-date, and the distance between prototype and class center (mean embedding for all class examples) is small.
Right part of the figure shows a situation after a number of iterations, during which this class has not been sampled, and the prototype was not updated. Encoder has evolved, and now it produces exemplar embeddings in different points of the hypersphere. This resulted in a new position of the class center. Class prototype, however, stayed in the same place, and now the distance between prototype and class center became large. This prototype became obsolete, it is not representing the class correctly in the embedding space anymore. Fig. 2b illustrates a similar situation, but for Prototype Memory. Instead of using the same old prototype from previous iterations, a new prototype is generated using exemplar embeddings in the current mini-batch. In this way it is kept up-to-date, even after a large number of iterations between the mini-batches, when the same class is sampled. Since new class prototypes are generated using up-to-date encoder, the distance between class centers and prototypes stays small, so the problem of prototype obsolescence is solved.

1) Group-based Mini-batch Sampling
For the prototype generation stage, to create prototypes closer to the corresponding class centers, we need to ensure that each class in the mini-batch has some minimum number of images k, sampled together. So, we need to sample images into the mini-batches accordingly. In this paper we utilize two different types of methods, used to sample mini-batches with groups of at least k images for each class: • Group-based iterate-and-shuffle: First, images in the training dataset are randomly combined into the singleclass groups of the same size k. Then these groups are shuffled and iteratively sampled to the mini-batch. When all groups are used, the process is repeated. With this kind of mini-batch sampling, we ensure that all images in the dataset will be sampled approximately equal number of times. • Group-based classes-then-images: First, a number of classes are sampled from the training dataset. Then for each class a group of k random images sampled to the current mini-batch. With this kind of mini-batch sampling, we ensure that all classes in the dataset will be sampled approximately equal number of times.
These methods could be combined with algorithms of hard class mining [30] and hard example mining [31], or used together as parts of a composite mini-batch [56].

2) Loss Functions
Prototype Memory is a model, independent of the choice of the loss function [79]. Most loss functions, suitable for usual softmax-based face representation learning, could be also used to perform training with Prototype Memory. In our experiments, we use Large Margin Cosine Loss [6]: where N is a size of the mini-batch, M is the size of the memory module, y i is the ground-truth class, θ j is an angle between i-th exemplar embedding and j-th class prototype, s is a scale parameter, and m is margin.
We also performed experiments with D-Softmax [28] loss: where = e ds , d is a hyperparameter (optimization termination point), s is a scale parameter, N is a size of the mini-batch, M is the size of the memory module, y i is the ground-truth class, and θ j is an angle between i-th exemplar embedding and j-th class prototype. Other similar loss functions [7], [32], [47] are applicable too.

E. HARD EXAMPLE MINING FOR PROTOTYPE MEMORY
Prototype Memory could be used together with hard example mining [31] and hard class mining methods like Doppelganger Mining [30], DP-Softmax [70] or HPM [29]. Careful selection of hard classes and examples is used to decrease the number of required training iterations and to achieve better performance in difficult face recognition scenarios. Most methods could be used for Prototype Memory without modifications, but some of them need to be adapted to the "sampled softmax" scenario.

1) Multi-Doppelganger Mining
Doppelganger Mining [30] is a hard class mining method, proposed for the "full softmax" classifier-based face representation learning scenario. Its main idea is to maintain a list with the most similar persons for each person in the training set. These most similar persons are called "doppelgangers", and each person in the dataset has exactly one doppelganger in the list. Doppelgangers are updated at each training iteration, using softmax classification scores of non-target classes: class with the largest score becomes a new doppelganger for a class, which exemplar is classified. At the mini-batch generation stage, a doppelganger list is used to sample pairs of similar classes together.
Doppelganger Mining has demonstrated the ability to improve face recognition models, but it is based on the assumption that all training classes are simultaneously presented in the softmax classifier at each training iteration. It is true for "full softmax"-based training, but false for models, based on "sampled softmax". In the latter case, only a subset of classes is used in the classifier at each iteration, so there is no guarantee that the most similar class ("global" doppelganger) is presented in the current classifier. As the model is forced to update the doppelganger list at each iteration, it may rewrite "global" doppelganger class in the list with a "local" one, by searching only in a subset of currently sampled classes. When the total number of classes in the dataset is large, and the number of sampled classes is small, there is a high chance, that doppelganger list will be filled with mostly "local" doppelgangers, missing the "global" ones, which are more preferable.
To prevent it, we propose Multi-Doppelganger Mining -a modification of Doppelganger Mining, adapted for the usage with "sampled softmax" models like Prototype Memory. With Multi-Doppelganger Mining, each class in the training dataset has a set of doppelgangers in the doppelganger list (instead of just one doppelganger per class as in the original version). Doppelgangers are added to this set using classifier scores in "sampled softmax": non-target class with the largest classification score is added as a new doppelganger to the set. If a class in the doppelganger set is presented in the current "sampled softmax" classifier, but does not get the largest classification score, -it is removed from the doppelganger set. At the stage of mini-batch generation, the doppelganger class is selected randomly from the doppelganger set.

2) Hardness-aware Example Mining
In the "group-based classes-then-images" mini-batch generation strategy, when the classes are selected (randomly or using hard class mining algorithm), there is a need to select examples of these classes to be used in the mini-batch. While random sampling could be performed successfully in most cases, sometimes the number of examples per class is very large, and only a small portion of them possess enough utility for the training process.
To find the most useful examples and put them in the minibatches more frequently, we propose a hardness-aware example mining method. Its main idea is to use cosine similarity scores, which are calculated between exemplar embeddings and class prototypes in the process of softmax-based classification, to measure the "hardness" of each example in the dataset, and then use these hardness values in the mini-batch generation stage to sample hard examples more frequently. Calculated hardness values are kept between training iterations and rewritten every time, when the example is classified again. Hardness values are calculated using: where y i is the ground-truth class for exemplar i, and θ yi is an angle between exemplar embedding and class prototype.

F. PROTOTYPE MEMORY KNOWLEDGE DISTILLATION
Knowledge distillation [80], [81] is a useful method of transferring knowledge from large teacher models to the smaller student models. There are several knowledge distillation methods, suitable for face recognition models [82]- [84], but they are not adapted to the Prototype Memory approach. For example, to initialize the training, [83] needs class centers for each class, calculated by teacher network. They could either be taken from the classifier weight matrix, or computed by averaging all teacher embeddings of each class in the dataset. This method requires memory space for prototypes of all dataset classes, thus limiting the ability of Prototype Memory to scale to the unlimited dataset sizes.
Here we propose Prototype Memory Knowledge Distillation (PMKD) -an approach of Prototype Memory-suitable knowledge distillation. Its essence is to calculate embeddings of the training dataset images with a teacher network (it could be done offline or online, in parallel with the training of the student, on a dataset of any size), and use these embeddings instead of the produced by the student encoder to generate prototypes for the Prototype Memory. VOLUME 10, 2022

IV. EXPERIMENTS
In this section we perform experiments with different Prototype Memory hyperparameters. We demonstrate the ability of Prototype Memory to solve the problem of prototype obsolescence. We also compare the performance of the proposed model with different "sampled softmax"-based alternatives and evaluate the effectiveness of the proposed hard example mining and knowledge distillation methods. Finally, we provide the comparison of the proposed Prototype Memory model with the state-of-the-art results on popular face recognition benchmarks.

3) Testing Settings
For testing we used L 2 -normalized average embedding of image and its horizontally flipped copy. For testing with the ResNet-100 model on the MegaFace dataset we used L 2 -normalized concatenated embeddings of image and its horizontally flipped copy. For the metrics we employ verification accuracy on LFW, CFP-FP, AgeDB-30, CALFW and CPLFW datasets. We also perform evaluations on Tril-lionPairs (identification TPR@FAR=1e-3 and verification TPR@FAR=1e-9) and on MegaFace, original and refined (identification rank-1 and verification TPR@FAR=1e-6).

B. HYPERPARAMETER EFFECTS
We have performed the experiments with different hyperparameters of Prototype Memory.

1) Number of images per class (k) for prototype generation
We have performed experiments with the number of images per class (k) in the mini-batch. We used a ResNet-34 model with memory size of M = 36, 000 and refresh ratio r = 0.2, trained on Glint360k with a mini-batch of size 512, and performed the testing on LFW, CFP-FP, AgeDB and Trillion-Pairs. The results are in Table 1 The best results were achieved with k = 4 images per class. Smaller value of k results in less accurately generated prototypes, larger k results in smaller number of different identities in each mini-batch. With larger mini-batch sizes and with datasets with many images per class it is reasonable to try larger values of k.  The best results were achieved with r = 0.2, demonstrating the effectiveness of prototype refreshing. With r = 0.0 prototypes are getting updates from their class exemplars only in a form of infrequent gradients, leading to a mild form of "prototype obsolescence". Large r values refresh prototypes too strongly, resulting in the loss of valuable information, accumulated during training by the prototypes in memory. r = 0.2 is a value, achieving the balance between these two situations.

3) Memory Size
We have performed experiments with different memory sizes M for Prototype Memory. We used a ResNet-34 model with memory size ranging from 36, 000 to 288, 000, with other hyperparameters set as r = 0.2, and k = 4 images per class in the mini-batch of size 512. We have trained the models on Glint360k-M, and performed the testing on MegaFace (R) identification benchmark. The results are presented in Fig. 4.
The best results were achieved with memory size of 216, 000. With the increase of the memory size from 36, 000 to 216, 000, performance of the model improved. Presence of the sufficient number of prototypes in memory is essential for the construction of informative training signals. Large memory with correctly calculated prototypes provides a comprehensive model of the class distribution in the embedding space, which is used for accurate calculation of gradients to train precise face recognition models. However, it is hard to maintain the correctness of all prototypes in large-scale memory, when the encoder, which produces face representations, is evolving, and prototypes in the memory are updated with delays and imperfect information.
The decrease of the face recognition accuracy for larger memory sizes in our experiments illustrates the harmful impact of the infrequency of the target class exemplar-based gradient updates. With the large memory sizes, most gradient updates, received by the prototypes in memory, come from the non-target class examples, pushing these prototypes away from them in the embedding space. Target class exemplarbased updates, on the contrary, are rare and thus provide insufficient training signals to the prototypes to keep them in close connection with exemplar embeddings. Prototype refreshing is helpful for this situation, but it is also dependent on the presence of target class examples in the mini-batch, and for large memory sizes its effect is limited. Therefore, to achieve good results, memory size should be large enough, but at the same time limited to a certain size.

C. EXPERIMENTS WITH PROTOTYPE OBSOLESCENCE
To measure the effect of prototype obsolescence, we calculated average distances between class prototypes and class centers for 100 longest unsampled classes for Positive Plus Random Negative (PPRN) model [5] and Prototype Memory. Class prototypes were taken from the weight matrix in RAM for the PPRN model, and generated using k = 4 random images of the considered class for Prototype Memory. Class centers were calculated as mean embeddings for the corresponding classes, using all class images, both for PPRN and Prototype Memory. We have performed experiments on the cleaned MS-Celeb-1M dataset with 90, 382 classes and on the Glint-360k-R dataset with 353, 658 classes. We used a softmax size of 36, 000 (≈ 0.1 sampling rate) for both PPRN and Prototype Memory, and trained for 50, 000 iterations. We On MS-Celeb-1M dataset (Fig. 5a), where the number of classes is moderate, prototype obsolescence is not present. After a few iterations, class prototypes for the longest unsampled classes move close to the class centers, both for PPRN and Prototype Memory. For the Glint-360k-R dataset (Fig. 5b), however, even after a large number of iterations the distance between class prototypes and class centers for PPRN remains considerably large. This discrepancy between prototypes and class centers is a variant of the problem of prototype obsolescence. It resulted in incorrect training signals, directing the updates in wrong directions. For the datasets of larger size this effect will be even more severe.
Unlike PPRN, the distance between class centers and prototypes for Prototype Memory remains insignificant both for MS-Celeb-1M and Glint-360k-R datasets. Thus we demonstrate that prototype obsolescence is not present in Prototype Memory. With the help of prototype refreshing algorithm, and due to the disposal and subsequent re-generation of the oldest prototypes, and by the means of the limited size of memory module, prototypes in memory remain up-to-date and close to class centers at each training iteration, resulting in more accurate training signals and ability to train face recognition models regardless of the dataset size. VOLUME 10, 2022

D. COMPARISON WITH OTHER "SAMPLED SOFTMAX"-BASED METHODS
We have performed the comparison with other "sampled softmax"-based methods -Positive Plus Random Negative (PPRN) [5] and D-Softmax-K [28]. We used a ResNet-34 model with memory size of 36, 000 (≈ 0.1 sampling rate), r = 0.2 and k = 4 images per class. For PPRN we used a sampling rate of 0.1 and CosFace with s = 64 and m = 0.4. For D-Softmax-K we used d = 0.9, s = 64 and sampling rate of 0.1. Training was performed on Glint360k-M, testing on MegaFace, original and cleaned variants (the latter is more reliable). Results of the experiments are presented in Table 3.  Prototype Memory outperformed other sampling-based methods in terms of face recognition accuracy. These results prove the effectiveness of the proposed model for learning face representations on large-scale training datasets.
Besides being accurate, Prototype Memory is also more memory-efficient than other softmax-based methods. In Table 4 we compare GPU and Non-GPU memory requirements of Prototype Memory, "full softmax"-based training, PPRN and D-Softmax-K. For the case of "full softmax", full weight matrix of size D × N should be placed in GPU memory, where D is embedding size and N is total number of classes in the training dataset. There is no need for additional non-GPU memory. For PPRN and D-Softmax-K, only a small part of the full weight matrix is placed in GPU memory, however there is a need to keep the full weight matrix in non-GPU memory. The size of required GPU memory is D ×M , where M < N is the number of classes in sampled softmax. The size of required non-GPU memory is D × N .
For the case of Prototype Memory, there is no need to keep a full weight matrix in non-GPU memory, only D × M of GPU memory is required. Prototype Memory provides a memory-efficient way of training face representations. Memory requirements of the proposed model are independent from the number of classes N , making it especially useful for the training on datasets with large numbers of persons. Besides being memory-efficient, Prototype Memory is also more computationally efficient than other softmax-based methods. In Table 5 we compare average computations times of main (and most costly) operations in these methods: cosine similarity calculation between embeddings in the mini-batch and prototypes (weights of classes) in the classifier, prototype generation (for Prototype Memory) and RAM-to-GPU transfer of class weights (for other "sampled softmax"-based methods). Computations are performed on 6 NVIDIA GTX 1080 Ti GPUs, with a mini-batch of size 128 and embedding (and also prototype) size of D = 256, using parallel acceleration strategy, described in [7]. Total number of classes is set to 1 million (1M ), numbers of sampled classes are set to 1M , 500k, 200k, 100k and 50k. Number of images per class for Prototype Memory is k = 4. Computation times are averaged over multiple iterations (for fair comparisons, for Prototype Memory we start the measurements after the memory is filled to its limits), and given in milliseconds.

Method
Average computation time (ms) Cosine Sim. Prototypes RAM-to-GPU Full Softmax (1M ) 139 As we can see from Table 5, "sampled softmax"-based methods are more efficient than full softmax, when the number of sampled classes is small. However, D-Softmax-K and PPRN methods include the costly RAM-to-GPU class weights transfer operation. When the number of sampled classes is larger, it introduces significant additional computational overhead. For some cases this operation could be optimized (for example, when some class weights are already on GPU since previous training iteration), but in general case it still remains as an unavoidable nuisance. On the contrary, Prototype Memory keeps all computations entirely on GPU and is released from this drawback. On the other hand, Prototype Memory introduces the operation of prototype generation, which also uses some extra computation, but is independent of the number of classes and much faster.

E. EXPERIMENTS WITH MULTI-DOPPELGANGER MINING AND HARDNESS-AWARE EXAMPLE MINING
To evaluate the effectiveness of the proposed Multi-Doppelganger Mining and Hardness-aware example mining algorithms, we have performed experiments with PM-100 model, pre-trained on Glint360k-R for 540, 000 iterations using Prototype Memory with M = 200, 000, r = 0.2, k = 4 and CosFace with s = 64 and m = 0.4 as loss function. This model was fine-tuned for 55, 000 more iterations, with and without hard example mining methods applied.
For the case of usual training, we used composite minibatch [56], containing 128 images, sampled with "groupbased iterate-and-shuffle" strategy, and 384 images, sampled with "group-based classes-then-images" strategy. We also used m = 0.5 as margin value in CosFace.
For the case of hard example mining, we used composite mini-batch, containing 128 images, sampled with "groupbased iterate-and-shuffle" strategy, and 384 images, sampled using a combination of Multi-Doppelganger Mining and Hardness-aware example mining, with h = 0.5, 12 classes sampled at random and 84 classes sampled using doppelgangers. Margin value m = 0.5 was used in CosFace. The results of the experiments are presented in Table 6

G. COMPARISON WITH STATE-OF-THE-ART
We have compared Prototype Memory to the state-of-the-art models from the literature. To perform fair evaluation [85], we used strict protocol with removed test / train identity overlaps. For the evaluation we used the PM-100 model, finetuned with Multi-Doppelganger Mining and Hardness-aware example mining, as mentioned above. We also used PM-100 to fine-tune another model, containing Prototype Memory M = 200, 000, k = 4, r = 0.2, and using D-Softmax [28] loss function with D = 0.9 and s = 64. Fine-tuning was performed on Glint-360k-R dataset for 55, 000 iterations with mini-batch of 512, containing 128 images, sampled with "group-based iterate-and-shuffle" strategy, and 384 images, sampled with "group-based classesthen-images" strategy. We evaluate both these models on a variety of face recognition benchmarks.

1) CFP-FP, AgeDB-30, CALFW and CPLFW
We have compared Prototype Memory models to the stateof-the-art on CFP-FP [12] and CPLFW [13] testing datasets, containing large pose variations. We also used AgeDB-30 [14] and CALFW [15] testing datasets, containing large age variations. Results are in Table 8. Models, which explicitly stated, that they have removed train / test identity overlaps (i.e. use strict evaluation protocol), are placed separately from other models, as the training with train / test overlaps could provide unfair and unreliable test results.
For the strict evaluation protocol, Prototype Memory models achieved state-of-the-art results on all four datasets. Prototype Memory model, trained using CosFace with hard example mining, performed better than Prototype Memory model with D-Softmax on all datasets except AgeDB-30, where the latter model achieved slightly superior results.

2) MegaFace
We have compared Prototype Memory models to the state-ofthe-art on large-scale MegaFace benchmark, in identification and verification scenarios, original and cleaned versions. Results are presented in Table 9. For the strict evaluation protocol, and among the models, trained using publicly available face datasets (with the exception of WebFace42M, which is a much larger training dataset), Prototype Memory models achieve state-of-the-art results. Prototype Memory model with D-Softmax outperforms Prototype Memory model with CosFace and hard example mining on verification tests and also on identification test for noisy version of MegaFace. On the identification test of cleaned MegaFace, CosFace-based model achieves better results. These results prove the ability of the models, trained with Prototype Memory, to perform accurate face recognition in large-scale scenarios.
Even better results could be achieved with larger training datasets [22], better encoder architectures [115], data augmentation [116] and other methods [117], providing models with the robustness to large age [51] and pose [118] variations and ability to overcome the problems of racial bias [119], domain imbalance [54] and bad image quality [120].

V. CONCLUSION
In this paper we proposed Prototype Memory -a novel face representation learning model, opening the possibility to train state-of-the-art face recognition architectures on the datasets of any size. Prototype Memory consists of the memory module for storing class prototypes, and algorithms to perform operations with it. Prototypes are generated online, using exemplar embeddings, presented in the current mini-batch. New prototypes are enqueued to the memory. Prototypes in memory are refreshed and kept up-to-date with the state of the encoder. When the memory is filled, the oldest prototypes are dequeued and disposed of. Prototypes in memory are used in the role of classifier weights for softmax-based face representation learning. Prototype memory is useful for preventing the problem of "prototype obsolescence". Like other "sampled softmax"-based models, it is computationally and memory-efficient. We have performed extensive experi-mental evaluations on popular face recognition benchmarks and proved the effectiveness of the proposed model. We have compared the performance of Prototype Memory to other "sampled softmax"-based models and demonstrated its superiority both in terms of face recognition accuracy and memory efficiency. We proposed Multi-Doppelganger Mining and Hardness-aware example mining methods to improve the training of models, based on Prototype Memory, with the help of the generation of more informative mini-batches. We also described a knowledge distillation method, suitable for using with Prototype Memory, and proved its effectiveness.