Towards Accurate and Lightweight Masked Face Recognition: an Experimental Evaluation

Given the current COVID-19 pandemic, most people wear a mask to effectively prevent the spread of the contagious disease. This sanitary measure has caused a significant drop in the effectiveness of current face recognition methods when handling masked faces on practical applications such as face access control, face attendance, and face authentication-based mobile payment. Under this situation, recent efforts have been focused on boosting the performance of the existing face recognition technology on masked faces. Some solutions trying to tackle this issue fine-tune the existing deep learning face recognition models on synthetic masked images, while others use the periocular region as a naive manner to eliminate the adverse effect of COVID-19 masks. Although the accuracy of masked face recognition remains an important issue, in the last few years, the development of efficient and lightweight face recognition methods has received an increased attention in the research community. In this paper, we study the effectiveness of three state-of-the-art lightweight face recognition models for addressing accurate and efficient masked face recognition, considering both fine-tuning on masked faces and periocular images. For the experimental evaluation, we create both real and simulated masked face databases as well as periocular datasets. Extensive experiments are conducted to determine the most effective solution and state further steps for the research community. The obtained results disclose that fine-tuning exiting state-of-the-art face models on masked images achieves better performance than using periocular-based models. Besides, we evaluate and analyze the effectiveness of the trained masked-based models on well-established unmasked benchmarks for face recognition and asses the efficiency of the used lightweight architectures in comparison with state-of-the-art face models.


I. INTRODUCTION
The present situation of the COVID-19 pandemic has changed the world in all dimensions. The trend of wearing face masks for all people in public places have imposed new challenges for the research community. Many applications based on face recognition techniques, such as face access control, face attendance, and face authentication based mobile payment, have nearly failed to effectively recognize the masked faces. At the moment, removing masks for passing authentication systems is not recommended since this can increase considerably the transmission of the COVID-19 virus. Furthermore, because the virus can be spread through contact, systems based on passwords or fingerprints are less safer than face recognition solutions which do not need to touch any device. Therefore, masked face recognition has become a crucial computer vision task to help the global society reduces virus infection.
Current advanced face recognition methods are based on deep learning models [10], [21], which have been able to achieve impressive performance on public benchmarks. However, deep face recognition performs poorly under new challenging conditions, such as the occlusion caused by masked faces. Face occlusion has been widely addressed by the research community in the scope of face recognition solutions [43]. Most existing works consider general occlusions that commonly appear in unconstrained capture conditions, such as sunglasses, scarves, or other random objects like books and cups [34]. The performance of these methods tends to degrade by a large margin in front of specific objects like the COVID-19 masks, that occlude a large part of the face, including important facial regions such as the mouth and the nose [29]. This is the reason why recognizing masked faces is currently an active research topic [7], [18], [37]. In the last year, recent works [4], [8], [29] have evaluated the effect of wearing a mask on automatic face recognition systems based on state-of-the-art deep Convolution Neural Networks (CNN). However, these studies focus mainly on the performance of common deep face recognition models with high computational cost. Moreover, since deep learning-based approaches depend on massive training data, databases with real face masks have been collected and tools for generating synthetic masked images have been developed [2]. Nevertheless, most of these collected datasets are not publicly available and the models are trained and evaluated under different experimental settings, which can be difficult to understand their behavior.
The use of some of the ocular traits that have been proposed for human recognition, can be regarded as a naive manner to eliminate the adverse effect of the mask. Unlike other ocular traits such as iris, retinal and conjunctival vasculature, the acquisition of the periocular biometrics does not require high user cooperation and close capture distance [31], which is in particular useful for COVID-19 pandemic. Recent methods for periocular recognition based on deep learning models have shown promising results even for in-the-wild images [13], [16], [36]. Nevertheless, since periocular biometrics encloses only the immediate vicinity of eyes, i.e., a sub-region of a face, it captures relatively less information compared with that of the face.
On the other hand, although the accuracy of face recognition systems is very important, in the last few years, the development of efficient and lightweight face recognition methods has received an increased attention in the literature. This interest has been motivated by the demand for the deployment of face recognition models in the embedded domains and other use-cases constrained by low computational power devices and high throughput requirements. Recently, the effectiveness of several lightweight face architectures was demonstrated for different face recognition scenarios, reaching high levels of accuracy and compactness with a very low computational cost [24].
In this paper, we aim at studying the effectiveness of three state-of-the-art lightweight face recognition models to enable the future development of solutions addressing accurate and efficient masked face recognition. We investigate two different approaches to enhance the performance of these models in front of masked faces. The first approach consists of including masked facial images in the learning process of the lightweight models. Due to the lack of publicly available masked face databases, we create face datasets with synthetic masks and propose a masked face dataset that simulates a realistically variant collaborative face scenario. The created real masked database is the first version of an on-going data collection process, that will be available upon request for future research and comparisons. In the experimental evaluation, we cover the matching of masked faces as well as faces with and without masks. Considering the periocular biometrics as a naive manner to address the adverse effect of the mask, the second approach applies the lightweight networks in the context of periocular recognition. In order to analyze the feasibility of using periocular information and compare its performance against the obtained by the models trained with masked faces, we create periocular datasets from the same datasets used in the masked face recognition scenario. In addition, we investigate the effect of employing the lightweight models trained with masked images on wellestablished benchmarks of unmasked images. Aiming at evaluating the deployment capacity of the used lightweight face networks, we assess their computational requirements and compare them with state-of-the-art models.
The main contributions of this work are summarized as follows: • The collection of a real masked face dataset including verification and identification protocols for unconstrained masked face recognition, that will be available upon request to encourage and support future solutions for this problem. • An extensive evaluation of the performance of lightweight deep models from two different approaches including (real and simulated) masked faces images and periocular images. This covered extended experiments with three state-of-the-art lightweight face architectures, which evidence that the masked face models should be used as the primary solution. • Performance assessment of face models when matching masked vs. masked face images and masked vs. unmasked face images. • Comparison of the proposed lightweight masked and periocular models with several state-of-the-art deeper models for the problem of wearing a mask on face recognition. • A study of the effect of using lightweight masked face models on unmasked face recognition benchmarks with different covariates such as pose, age, and large-scale (unmasked vs. unmasked face images). • An analysis and comparison of deployment capacity of the proposed lightweight solutions with several face recognition models by taking into account the storage space (model size), the compactness (number of parameters) and the Floating Point Operations Per Second (FLOPs).
The remainder of this paper is organized as follows. In Section 2, we provide an overview of related work covering both masked face recognition and periocular face recognition. In Section 3, the datasets employed for the study are described. The selected lightweight face models and the implementation details are presented in Section 4. The extensive experimental evaluation is provided in Section 5 and finally, conclusions are given in Section 6.

II. RELATED WORK
In this section, we review the face recognition methods that have been proposed to address the effect of wearing a mask in the era of the COVID-19 pandemic. We summarize existing solutions on two different approaches containing the methods that directly leading with the mask and those which use periocular region as biometric trait for recognition.

A. MASKED FACE RECOGNITION
Wearing a mask can be characterized as a kind of facial occlusion. Although there are a large number of approaches developed for face recognition in the presence of occlusions [43], most of them consider the occlusion of small regions of the face due to sunglasses, mustache, bangs or hats [34]. However, the face masks that are used to prevent COVID-19 disease, occlude around 70% of the face area [29], including the mouth, chin, and nose. Thus, specific studies and methods have been arisen during the last year for masked face recognition.
Due to lack of training and testing datasets with face images wearing masks, the first works related to masked face recognition during the COVID-19 pandemic are based on the creation of databases with real or simulated face masks. Wang et al. [38] were the first in proposing real and simulated masked face datasets, including a large Real-world Masked Face Recognition Dataset (RMFRD) and Simulated Masked Face Recognition Dataset (SMFRD). Although the authors claim to enhance the recognition accuracy from 50% to 95%, this dataset is a little hard to be used since it has not clearly defined an evaluation protocol. Moreover, details of their method and baseline are not clearly specified. In [2], a methodology and an open-source masking tool are presented, in order to effectively augment existing face datasets to train masked face recognition algorithms. Also, the authors create a real-world masked face database (MFR2) for testing that contains masked face images from celebrities and politicians. As result, they report a considerable increase in the accuracy of the existing FaceNet model for both masked and unmasked faces, being able to extend out to real life masked faces.
On the other hand, some works have focused on enhancing the recognition performance of masked faces. In [28] a Support Vector Machine classifier is trained using the feature vector embeddings provided by FaceNet model on a collected small database for this purpose. The authors claim 99% of accuracy but the database and evaluation protocols used, are not provided and detailed. An alternative solution based on recovering unmasked faces for feature extraction was proposed in [19]. For this, a de-occlusion distillation framework is introduced, where first, appearance information is recovered by using a generative inpainting network and then, rich structural knowledge is transferred from a highperformance pretrained general recognizer in a teacherstudent model. The method presented in [37], is based on the state-of-the-art ArcFace work [10] to extract deep fea-tures from the detected and normalized face images, which are then combined with LBP features extracted from eyes and eyebrows. Also, ArcFace network is used in [26], with several modifications for the backbone and the loss function. The network, based on ResNet-50, is modified to output the probability that a face is wearing a mask. In addition, the ArcFace loss is combined with a mask-usage classification loss to train mask robust facial feature embedding.
In the last year, different challenges and studies have been conducted in order to benchmark the performance of masked face recognition methods. The behavior of three face recognition algorithms in the presence of masked face probes is evaluated in [7]. The authors collect their own database for the evaluation and show how two of the top-performing face recognition deep models (ArcFace and SphereFace) and a COTS algorithm (from Neurotechnology) are affected in the presence of masks. The National Institute of Standards and Technology (NIST) reports the performance of a large number of face recognition algorithms on faces occluded by face masks being run under the Ongoing Face Recognition Vendor Test (FRVT) [29]. This study evidence that error rates for unmasked versus masked faces, have been decreasing across algorithms development after the pandemic. However, some of the algorithms that are quite competitive with unmasked faces still fail to authenticate between 10% to 40% of masked images. Although, the results evidence that a number of developers have adapted their algorithms to support masked face recognition, particular design details of the tested algorithms are not provided. Moreover, the dataset used in the NIST study is not publicly available and it contains synthetic masked images from controlled scenarios, thus real masked images in unconstrained scenarios were not considered.
Recently, the IJCB Masked Face Recognition Competition 2021 (IJCB-MFR-2021) [4] evaluates the solutions submitted by 10 teams. The database used in the competition represents a collaborative face verification scenario, but only 47 subjects wearing real face masks were considered. Moreover, most of the submitted solutions, especially the top-performing ones, are based on heavier ResNet architectures with a compactness between 23 and 108 millions of parameters. Another evaluation was conducted in Face Biometrics under COVID Workshop and Masked Face Recognition Challenge in ICCV 2021 [8]. A large number of recent solutions were evaluated on a dataset containing 6,964 masked facial images and 13,928 non-masked facial images. In this case, ResNet is also used as baseline model and the details of the submitted solutions are not provided. Unfortunately, the test data will not be released to the public but we are able to submit our proposal in order to be evaluated in this large dataset. In general, although some restrictions are imposed to the solutions submitted to these competitions, lightweight architectures are not considered. VOLUME 4, 2016

B. PERIOCULAR FACE RECOGNITION
Under the situation caused by the COVID-19 pandemic, periocular recognition has reached direct relevance. Periocular region refers to the facial area in the immediate vicinity of the eyes [16]. Although there are no specific guidelines for the size and bounds of the periocular region, some studies suggest that considering the eyelids, eyelashes, eyebrow, tear duct, eye shape, and the surrounding skin can result in higher recognition rates [30].
Early periocular recognition approaches used monocular information, separating the left region from the right region and performing the matching individually. Park et al. [30] were the first to study the feasibility of using the periocular region as a biometric trait and evaluate its performance using different matchers based on global and local handcrafted feature extractors. The authors also examine the effectiveness of the periocular region for non-ideal scenarios and suggest including eyebrows and using neutral facial expression, as well as combining the results of matching the left and the right sides of the periocular images for more accurate recognition. In [3], different methods using the left, the right and both eyes were evaluated for recognition. It was shown that in all cases an improvement between 3 and 5 percent was achieved when using both eyes instead of only one. Thus, most of the state-of-the art methods use the bi-ocular information, some of them analyzing left and right eyes separately and then combining the results [36], while others use both eyes within a single image [14].
With the emergence of deep learning approach, the focus of the researchers has been moved to learn robust representations by deep Convolutional Neural Networks (CNNs) for periocular recognition, achieving visible improvement in the performance of periocular biometric systems [17], [44], [45]. The semantics-assisted convolutional neural networks (SCNN) [44] was one of the first proposals that use deep learning-based representation for periocular images. By incorporating explicit semantic information (gender and side), it shows to offer better discriminating power with the usage of a relatively smaller number of training samples. In [45] the authors apply existing pre-trained architectures, proposed to classify generic objects, to the task of periocular recognition. The results obtained show that these networks are able to outperform reference periocular features. Similarly, seven different off-the-shelf deep learning based CNN using transfer learning approach were implemented in [17] to analyse the utility of periocular region in non-ideal scenarios. A new method for masked face recognition was proposed in [20] by integrating a cropping-based approach with the Convolutional Block Attention Module (CBAM) to focus on the regions around eyes.
On the other hand, some works propose feature fusion approach which combines handcrafted features (e.g. LBP and HOG) with features extracted using pretrained CNN models [18], [36]. Another hybrid model is introduced in [1] for ocular smartphone authentication (Selfie Biometrics). The proposal is a fusion of a stacked unsupervised convolution-based model with a stacked supervised convolution-based model, which is combined with Root SIFT. A recent selfie periocular verification method is presented in [35], which consists of a two-stage approach based on a CNN with pixelshuffle, and a new loss function based on a sharpness metric, aiming at enhancing the periocular images with a superresolution approach.
Although the significant and encouraging research progress gained by the aforementioned works to address the problem of recognizing faces wearing masks, the study of lightweight deep networks for this problem deserves further attention. Moreover, there is a lack in the evaluation and comparison of existing methods for the two different approaches reviewed in this section, under the same scenarios and conditions, which could be very helpful for establishing the most suitable way to deal with the problem of masked face recognition.

III. DATASETS
In this section, we present the datasets used for studying the masked face recognition problem. Due to the lack of publicly available large-scale datasets for training and testing, we generate face images with simulated masks from existing unmasked face databases. We test the trained models in some state-of-the-art masked datasets and we also collect a real masked dataset from subjects of our laboratory. To analyze the feasibility of periocular information, we construct periocular images from the same datasets used for the masked face recognition scenario.

A. SIMULATED MASKED FACE DATASETS
In order to create simulated masked face datasets, we use the open-source tool MaskTheFace [2] to convert existing face datasets into masked face datasets. It uses the face landmarks detector provided by Dlib library [15] to identify the face tilt and six key features of the face necessary for applying a mask. Based on the face tilt, the corresponding mask template selected from the library of masks, is then transformed based on the six key features to fit on the face. MaskTheFace provides five different mask types including cloth, surgical, N95, KN95 and gas, and supporting 24 existing patterns that can be applied to mask types above to create more variations.
For the purpose of training, we select CASIA-WebFace [42] which is a face dataset that contains 494,414 images of 10,575 identities, with an average of 15 images per identity varying in pose, age, ethnicity and illumination. From this dataset, we create a Simulated Masked (SM) CASIA-WebFace by augmenting it using the described MaskThe-Face tool. Specifically, for each unmasked face image from every subject, we generate four masked images using cloth, gas, KN95 and surgical-green synthetic mask types. Thus, both the original unmasked image and the created synthetic masked images, compose the SM CASIA-WebFace dataset in order to make sure that the trained networks perform well on both the masked and unmasked images. Figure 1 shows some examples of the synthetic masked face images obtained for different unmasked subjects from the CASIA-WebFace dataset. In the case of testing, we select different face benchmarks including LFW [12], AgeDB-30 [27] and CALFW [46] to generate simulated masked datasets. For this, we use the MaskTheFace tool with one randomly selected mask applied to each image.
Labeled Faces in the Wild (LFW) [12] is a standard face recognition benchmark that contains 13,233 web-collected images from 5,749 different identities, with large variations in pose, expression and illuminations. The AgeDB [27] is an in-the-wild dataset with large variations in pose, expression, illuminations, and age. It contains 16,488 images of 568 distinct subjects. The average age range for each subject is 50.3 years. There are four groups of test data with different year gaps (5, 10, 20 and 30 years, respectively) for ageinvariant face verification. In this paper, we only use the most challenging subset, AgeDB-30, to report the performance. Cross-Age LFW (CALFW) [46] is a recently introduced dataset that shows higher age variations, with the same identities from LFW database. These three databases define an evaluation protocol based on 6,000 face pairs matching, which are divided into ten subsets, each having 300 positive pairs and 300 negative pairs. To analyze the performance of trained networks, we compute the verification accuracy (Acc) and the Equal Error Rate (EER) metrics on the established 6,000 face pairs of each database.

B. REAL MASKED FACE DATASETS
Aiming at assessing the performance of the trained models on real masked face images, we create a small face database from persons of our laboratory wearing real masks. This database (RMFR-CEN) is an initial version and further data collection efforts are ongoing. The data tries to simulate a collaborative, yet varying, scenario where the mask, illumination, pose and background can change on each of the participants. In total, we collected 395 images from 100 identities. Each identity has on average of 4 images with both masked and unmasked faces. The dataset is processed in terms of face alignment and image dimensions. As result, each image has a dimension of (112 × 112 × 3). Figure  2 shows some examples of the images collected from our laboratory. For performance evaluation, we design both face verification and identification protocols. The verification protocol specifies 469 positive pairs and 10,000 negative pairs composed of one masked face and one unmasked face. For performance measurement, each pair is evaluated by computing a matching similarity score, and the paired True Acceptance Rate (TAR) at different False Acceptance Rates (FAR), the Equal Error Rate (EER) and the Area Under ROC (AUC) are used as evaluation metrics. In the case of face identification, we construct the evaluation setup for the closed-set scenario, where for each subject, we use the most frontal unmasked image as the gallery, while the masked ones are used as probes. To report the identification performance, we select the Cumulative Matching Characteristic (CMC) [32] and the mean Average Precision (mAP) measures.
In order to enlarge the evaluation and compare the trained lightweight masked models with some existing masked face recognition solutions, we use the Masked faces in realworld for face recognition (MFR2) [2], the test set of the InsightFace-Track in Masked Face Recognition Challenge of ICCV 2021 [8] and AR Face [23] datasets. VOLUME 4, 2016 MFR2 is a small dataset with 53 identities of celebrities and politicians with a total of 269 images collected from the internet. Each identity has on average 5 images, including both masked and unmasked faces. In Figure 3, we show some sample images from the MFR2 dataset. For the network performance evaluation, a total of 848 image pairs (424 positive pairs, and 424 negative pairs) are defined and Max Accuracy and TPR@FAR=0.2% metrics are used to report the verification performance. The AR Face Database [23] contains around 4,000 images from 126 subjects captured on two different sessions. Each person has up to 13 images per session with different expressions, illuminations and occlusions. The occlusions included in the dataset are the presence of sunglasses and scarves. Although this dataset is not a masked face dataset, it has been used for evaluating masked face solutions [19], [34] since the scarf occlusions cover more or less the same region than a face mask (See Figure 5). Thus, we have decided to use the Scarf subset of this database in the evaluation in order to be able to compare with state-of-the-art methods. Following previous protocols, we randomly select 100 subjects (50 males and 50 females) and conduct identification experiments by using one neutral image per subject (the first image in the first session) to conform the gallery.

C. PERIOCULAR DATASETS
In the case of periocular face datasets, we use the bi-ocular information including, in a single image, both eyes and considering the eyelids, eyelashes, eyebrow, tear duct, eye shape and the surrounding skin. For obtaining this periocular region, we crop face images based on the algorithm used in [18] for extracting the region of interest. This algorithm considers the canthus points as reference points, which were detected automatically through Dlib landmarks detector. Finally, the obtained periocular regions are geometrically normalized.
For a fair comparison between periocular-based lightweight models against mask-based lightweight models, we employed the same datasets and evaluation metrics. Thus, we use Periocular CASIA-WebFace for training, while Periocular LFW, AgeDB-30, CALFW y RMFR-CEN are used for testing. Figure 6 shows some examples of the training images from Periocular CASIA-WebFace. In addition, to enlarge our study, we compare the obtained periocular lightweight models with the state-of-the-art methods reported on the periocular images from the AR Face database.

IV. LIGHTWEIGHT FACE RECOGNITION MODELS
In the last years, developing very efficient and lightweight face recognition networks has become an active research topic in order to make deep CNNs feasible on real-time applications or resource-limited devices. Existing lightweight models have shown to be able to perform very similar to larger and heavier deep models in different face recognition scenarios. In this study, we select three state-of-the-art lightweight CNN face models that were the top-performing in [24]: VarGFaceNet [41], MobileFaceNet [5] and Shuffle-FaceNet [22] VarGFaceNet [41] consists of an efficient variable group convolutional network based on VarGNet for lightweight face recognition. Different from the blocks in VarGNet, it adds squeeze and excitation (SE) block and employes PReLU activation function instead of ReLU to increase the discriminative ability of their blocks. Moreover, VarGFaceNet removes the downsample process at the start of network to preserve more information and applies variable group convolution after last convolution to shrink the feature tensor to 1×1×512 before FC layer. Moreover, 3 × 3 Convolution with stride 1 is used at the start of network instead of 3 × 3 Convolution with stride 2 as in VarGNet, which reserves the discriminative ability in lightweight networks.
MobileFaceNet [5] and ShuffleFaceNet [22] have shown competitive performance with respect to high-accurate very deep face models on several benchmarks for unconstrained face recognition. The major contribution of these networks lie in the use of a Global Depth-wise Convolution (GDC) layer instead of a Global Average Pooling (GAP) layer in order to obtain a more discriminative face representation; and Parametric Rectified Linear Unit (PReLU) as non-linear activation function due to its accuracy improvement over the Rectified Linear Unit (ReLU) function.
Specifically, MobileFaceNet [5] uses the residual bottlenecks proposed in MobileNetV2 as their main building blocks, while ShuffleFaceNet [22] is based on the extremely efficient network ShuffeNetV2, where the building blocks in stages 2-4 consist of DenseNet blocks and the number of channels in each block is scaled to generate four networks of different complexities, denoted as 0.5×, 1×, 1.5× and 2×. Taking into account the results obtained in [22] where ShuffeFaceNet 1.5× presented the best trade-off between speed and accuracy, we will use this model and we will refer to it as ShuffeFaceNet in the remaining of our work. In addition, both lightweight networks adopt a fast downsampling strategy at the beginning of the networks, an early dimension-reduction strategy at the last several convolutional layers, and a linear 1 × 1 convolution layer following a linear global depthwise convolution layer as the feature output layer.

A. IMPLEMENTATION DETAILS
The lightweight face CNN networks, pretrained on the cleaned MS1M dataset [11], are independently fine-tuned on masked and periocular images created from the CASIA-WebFace dataset. For all the models, random horizontal flip is used as augmentation strategy. We adopt Stochastic Gradient Descent (SGD) optimizer with the batch size of 128/256/512 due to limited GPU memory, and the models fine-tuning is carried out on two Nvidia GeForce GTX 1080Ti (11GB) GPUs. The learning rate is initialized to 0.1 and decreased by a factor of 10 periodically at 100K, 140K, 160K iterations. The total iteration step is set as 200K. The momentum parameter is set to 0.9 and weight decay at 5e-4. The parameter initialization for convolution is Xavier with random sampling from a Gaussian normal distribution. For VarGFaceNet, MobileFaceNet and ShuffeFaceNet, we use ArcFace loss function with an angular margin m = 0.5, that turned out to be the best as it was specified in [24]. For all face models, we directly take the embedding feature after the last convolutional layer as face representation, and use the cosine similarity to obtain the matching scores.
For data preprocessing, RetinaFace detector [9] is applied to detect all faces and landmark points, which are used to align and crop each face into a template with the size of 112 × 112, where each pixel (ranged between [0; 255]) in RGB images is then normalized into [-1; 1] by subtracting the mean pixel value, i.e. 127.5, and divided by 128.

V. EXPERIMENTAL EVALUATION
In this section, we assess the effectiveness of trained Shuffle-FaceNet, MobileFaceNet and VarGFaceNet on both masked and periocular face recognition datasets and compare them with state-of-the-art solutions. Moreover, we analyze the effect of using the lightweight masked face models for face recognition in unmasked benchmarks. In addition, we analyze the computational efficiency of these lightweight models VOLUME 4, 2016 compared with some state-of-the-art deep face models used for the problem of masked face recognition.

A. MASKED FACE RECOGNITION
In this scenario, all lightweight deep models that were trained on the Simulated Masked CASIA-WebFace dataset are tested on several datasets for both, simulated masked and real face recognition. To baseline the performance, we compare these models with their original version without fine-tuning them on masked face images.

1) Results on simulated masked datasets
In order to asses the performance of recognizing faces with and without the masks on, we first conduct experiments on Simulated Masked datasets: LFW, AgeDB-30 and CALFW. We evaluate two different configurations in order to draw effective matching comparisons: a) we carried out the matching of face pairs with the simulated masked (masked vs. masked) and b) we test pairs composed by one masked face and one unmasked face (masked vs. unmasked). In all cases we evaluate the original models and their fine-tuned versions.
In Table 1 and Table 2, we present the face verification results obtained for the two matching configurations, in terms of Equal Error Rate (EER) and Accuracy (Acc).  It can be observed in the tables that for the three datasets, all the lightweight models fine-tuned with the masked images enhance the verification performance of the models that has not been trained with masked facial images. We can see that the greater improvements are obtained in front masked faces with age variations. Among the models, MobileFaceNet-Mask achieves the best results in the three databases. If we compare Table 1 against Table 2, it can be seen that better results are obtained when at least one of the images is unmasked. This is a desire property for real applications where usually the enrolled images are in normal condition (unmasked).

2) Results on RMFR-CEN dataset
In order to study the effectiveness of face models trained with simulated masks on real-world masked faces, we use the dataset collected in our laboratory, RMFR-CEN dataset. We follow the verification and identification protocols defined for this dataset on Section III-B, that consider comparisons of masked face images against unmasked images. The obtained results are presented in Table 3 and Table 4, respectively.  It can be seen that also in the real masked images, for both verification and identification protocols, the models fine-tuned with synthetic masks are able to enhance considerably the performance of originals face models. Specifically, all mask-based models are able to increase in at least 9% the TAR@FAR=1% and the Rank-1 results with respect to unmasked models. Among the masked models, VarGFaceNet-Mask obtains better verification results, while MobileFaceNet-Mask achieves the highest identification performance, especially at Rank-1. As we can appreciate, when we test the lightweight masked models in face images wearing real masks, the improvements over the original models are more remarkable. However, the overall performance, is not as good as when we tested these models in synthetic datasets, which still leaves a large margin of improvement.

3) Comparison with state-of-the-art
In order to compare the lightweight masked face models with existing solutions for the masked face recognition problem, we assess their performance on the MFR2, the InsightFace-Track in Masked Face Recognition Challenge of ICCV 2021 and the AR Face datasets. Table 5 presents comparative results achieved on the MFR2 dataset by the lightweight face models and FaceNet model with and without fine-tuning. The results are reported based on the evaluation criteria Max Accuracy and TPR@FAR=0.2%, described in [2]. As can be seen, all the lightweight-masked face models achieved higher verification performance than FaceNet-FT, being the MobileFaceNet-Mask the best one in terms of TPR@FAR=0.2%. In addition, we can observe that fine-tuning face recognition models with masked face images (real or simulated) improves the verification performance. In Table 6, we present the verification performance on the InsightFace-Track in Masked Face Recognition Challenge of ICCV 2021, where TPR is measured on mask-to-non-mask 1:1 protocol, with FAR less than 0.01%(1e-4). Also, further details are presented such as the training dataset, as well as the size and the inference time of the models. In all cases ArcFace loss function was used. We compare our lightweight masked face models with the provided baseline solutions based on the ResNet architecture. It is important to note that, we do not participate in the competition, we only asses the performance of our models on the test set. Thus, the comparison with the reported baseline results is not fair enough since most of them employ different and bigger training sets such as Glint360K and MS1MV3, which contributes to the differences in the performance. It can be seen that, all ResNet baseline models significantly increase their verification accuracy by using Glint360K dataset. For example, the R100 backbone trained on the Glint360K dataset, outperforms in more than 40% the results obtained by using the CASIA dataset, which is the one we used. Although our lightweight masked face models were not trained with the datasets provided by the competition (MS1M, Glint360K), they considerably improve the verification performance of the R100 trained on the CASIA. In particular, the MobileFaceNet-Mask, the best performing one, surpass the R100 trained on CASIA in more than 28%. Moreover, we can observe that MobileFaceNet-Mask is capable to obtain better results than both versions of R18 trained with more powerful training sets. In the future, we plan to employ some of these datasets for fine-tuning our masked face models in order to increase their accuracy. On the other hand, we can appreciate that all lightweight models present the smallest model sizes with very low inference times.
In Table 7, recognition accuracy at Rank-1 is reported on the Scarf subset of the AR Face database. The performance of lightweight face models fine-tuned with mask images is compared with those reported by state-of-the-art methods proposed for face recognition under occlusions. As we can see, the three masked face models are able to achieve perfect recognition rates under this kind of occlusion, which is somehow similar to the one caused by the presence of masks. These results outperform specific methods devoted to handling occlusions such as ArcFace-FT [34] and PDifferen-tialSiamese [34] that have been evaluated on this database.

B. PERIOCULAR FACE RECOGNITION
In this section, we test the lightweight face models that were trained with periocular CASIA-WebFace dataset on the periocular datasets obtained by cropping the images of the LFW, AgeDB-30, CALFW and RMFR-CEN databases. To analyze and evaluate the effectiveness of using periocular region, we compare their performance w.r.t. the results obtained by finetuning with masked face images. Moreover, we compare the periocular models with some state-of-the-art periocular algorithms.   Tables 1  and 2, it can be appreciated that by using the periocular information we are not able to improve the results achieved by the models fine-tuned on masked images, especially in front of age variations. In Table 9 and Table 10, we present the verification and identification performance obtained in Periocular RMFR-CEN dataset, respectively. Specifically, if we compare the performance of the periocular models w.r.t. the masked models (in Tables 3 and 4, respectively) we can observe that in the case of verification, for high FAR values (e.g. 30%), the degradation on the performance is bigger, while in the case of identification experiments, for a lower FAR=1% the results are more closer. Also in this case, MobileFaceNet achieves the best results and exhibits the greater differences between masked and periocular versions.

2) Comparison with state-of-the-art
In order to compare the periocular lightweight models with state-of-the-art periocular algorithms, we follow the protocol used in [36] for the AR Face database, where a large number of periocular methods have been evaluated. The obtained identification accuracy for Rank-1 and Rank-5 are listed in Table 11. As we can observe, similar to masked-based model, the three periocular lightweight models reach 100% of identification, outperforming all the reported state-of-theart methods.   Table 12 presents the verification results on LFW and AgeDB-30 datasets, while Table 13 shows those obtained on CALFW and CPLFW benchmarks. We can see that, for general unconstrained images in the LFW dataset, the lightweight masked models obtain very close results to their unmasked versions and even to the state-of-the art. However, for age and pose variations (Table 13) , the impact on the performance is greater. Among the masked models, the MobileFaceNet-Mask is the less affected by the different covariates.
In order to evaluate the masked models on large-scale datasets, we use the Janus Benchmark-B (IJB-B) dataset [40] which consists of 1,845 subjects with 21,798 still images and 55,026 frames from 7,011 videos. Specifically, we follow  Table 14 presents the verification and the identification performance of masked face models and state-of-the-art methods that are evaluated in [24]. For the verification task, IJB-B provides 12,115 templates with 10,270 genuine matches and 8M impostor matches. In the case of verification, we compare the TAR at FAR values of 0.01% (1e-4) and 0.1% (1e-3), while for 1:N identification, we compare the Rank-1 and Rank-5 accuracy for closed-set protocol and IET@FPIR=0.1% for open-set protocol [40]. As we can appreciate, for both verification and identification, maskedbased models also achieve worse results than their unmasked versions. The greater difference in performance are shown for VarGFaceNet-Mask and ShuffleFaceNet-Mask, being MobileFaceNet-Mask which presents the lowest drops. Derived from the results presented in Table 14, in unmasked environments it is still a better option to use models learned with images without masks. Unsurprisingly, current models are not able to fully generalize over masked and unmasked faces. Thus, for applications where both, masked and unmasked faces can be present, an alternative to consider is to introduce a previous stage where the face masks are detected.

D. NETWORK VISUALIZATION
In order to qualitatively analyze and interpret the experimental results, Gradient-weighted Class Activation Mapping (Grad-CAM) [33] technique is adopted to localize the discriminative areas. Grad-CAM uses the gradient information flowing into the last convolutional layer of the CNN to assign importance values to each neuron for a particular decision of interest, without any modification in the network architecture.
The generated Grad-CAM maps of the lightweight masked and periocular models are shown in Figure 7. In addition, we included the Grad-CAMs of R100-ArcFace model. In the figure, the first two columns correspond to the visualizations over masked face images from the Simulated Masked CASIA-WebFace, and the last two columns over periocular images from the Periocular CASIA-WebFace. As we can see, the Grad-CAM maps greatly vary for different models. However, we can appreciate that all the masked models assign a very low weight to the mask region. In general, the regions activated in the features maps for MobileFaceNet models are bigger than those of the other models. On the other hand, if we compare the maps for the masked models against those of the periocular ones, we found that in the case of the masked models, they are more focused over the the area around the eyes.

R100-ArcFace
ShuffleFaceNet VarGFaceNet MobileFaceNet FIGURE 7: Grad-CAM visualizations of the fine-tuned lightweight face networks and the state-of-the-art R100-ArcFace model for face images from the Simulated Masked and the Periocular CASIA-WebFace training datasets. First two columns correspond to visualizations over masked face images and the last two ones over periocular images. models improve remarkably the efficiency of the considered state-of-the-art models in all the requirements measured. The size of the biggest lightweight model (VarGFaceNet) is 10 times smaller than that of the well-known R100-ArcFace model, while the number of parameters is 13 times lower. The results indicate that the lightweight models have the best deployment capacity, which make potentially suitable and practical for using in embedding and low computational power devices.

VI. SUMMARY AND CONCLUSIONS
In this paper, we have presented a comprehensive study and evaluation of the performance of three state-of-the-art lightweight face models in order to address the effect of wearing masks on face recognition scenarios. To this end, two different approaches are investigated: on the one hand, the models are fine-tuned with several masked face images and on the other hand, the periocular information is considered.
Due to the lack of public datasets containing real masked face images, we created simulated masked datasets by placing synthetic masks over the face images from the CASIA-WebFace dataset for training, and from well-established face benchmarks for testing. Moreover, we test the trained models on some real masked dataset and also collect a database, named RMFR-CEN, of 100 subjects of our laboratory with real masked face images The proposed dataset is part of an ongoing effort to gather a larger scale database with realistic variations and will be available upon request. In order to compare the two considered approaches under the same conditions, periocular versions of these datasets is also constructed for training and evaluation.
From the experimental evaluation, our study pointed out the significant drop in the performance of the exiting face recognition solutions when considering masked face probes, especially in realistic scenarios. We found that by finetuning the models on masked faces, we are able to achieve better results than by using the periocular region. Moreover, we corroborate that models obtain a higher accuracy when matching masked vs. unmasked images is performed, which is an important aspect in the development of real applications. Compared with existing solutions for addressing the masked face recognition problem, which are based on more heavier deep networks, the considered lightweight models shown a very competitive performance. This indicates that utilizing a larger and deeper deep learning models does not necessarily and solely lead to higher recognition performance.
In addition, we observed that the masked-based models can recognize unmasked faces on general unconstrained scenarios. However, there is still a margin of improving the performance when there are more drastic appearance variations in the faces such as those caused by aging and larger variations in poses. The aforementioned conclusions open opportunities to propose new methods, algorithms, architectures, and/or loss functions that allow obtaining models able to generalize better in the presence of facial artifacts such masks, in order to provide existing face recognition systems with greater robustness to people wearing of such a necessary accessory in times of COVID-19. Also, it is a real necessity to detect masks as an additional functionality.
Regarding to the efficiency of the lightweight face architectures employed in this study, we assess to their computational requirements and compare them with some of the stateof-the-art methods that have been used in the literature for recognizing faces wearing masks. As results, we show that the lightweight models are potentially suitable for being employed in embedding and low computational power devices.
As future work, we plan to continuous collecting more real masked images from our laboratory in order to enrich the proposed RMFR-CEN database. Moreover, although masked-based models allow us to obtain higher recognition performance than periocular-based models, we think that combining both approaches could improve the performance of current masked face recognition solutions.