M2FRED: Mobile Masked Face REcognition Through Periocular Dynamics Analysis

Recent regulations to block the widespread transmission of COVID-19 disease among people impose the use of facial masks indoor and outdoor. Such restriction becomes critical in all those scenarios where access controls take benefit from biometric recognition systems. The occlusions due to the presence of a facial mask make a significant portion of human faces unavailable for feature extraction and analysis. This work explores the contribution of the solely periocular region of the face to achieve a robust recognition approach suitable for mobile devices. Rather than working on a static analysis of the facial features, like largely done by work on periocular recognition in the literature, the proposed study focuses the attention on the analysis of face dynamics so that the spatio-temporal features make the recogniser frame-independent and tolerant to user movements during the acquisition. To obtain a lightweight processing, which is compliant with limited computing power of mobile devices, the spatio-temporal representation of the periocular region has analysed and classified through Machine Learning approaches. The experimental discussion has been performed on a new dataset, Mobile Masked Face REcognition Database, specifically designed to analyse the periocular region dynamics in presence of facial masks. For a wider comparative analysis, a publicly available dataset called XM2VTS has been considered as well as Deep Learning solutions have been experimented to discuss the challenging aspects of the recognition problem. Moreover, a summary of the state-of-the-art on periocular recognition driven by COVID pandemic has been presented, showing how the research efforts in this field focused on recognition of still images. Experimental results show promising levels of performance as well as limitations of the proposed approach, creating the premises for future directions.

recognition system in presence of facial masks and based 82 on Machine Learning approaches. The literature presents a 83 wide collection of approaches and results for periocular anal-84 ysis to biometric purposes. However, the systems proposed 85 in the literature consider the periocular region like a static 86 biometric trait. Geometric features and landmarks, as well as 87 pixel appearance and their distribution are often considered 88 in a static way. Even though the experimental results show 89 high performance of several models on datasets built on 90 this premise, the operative condition of face recognition in 91 real scenarios suggests that those high level of performance 92 is hard to be achieved. In fact, the facial expressions and 93 the presence of several disturbing factors (e.g., eyeglasses, 94 makeup or, recently, facial masks) may introduce insightful 95 new dynamic patterns that can uniquely trace back to the 96 identity of a single individual. In general terms, facial expres-97 sions are undoubtedly affecting the approaches of periocular 98 recognition [12]. The impact can be potentially negative when 99 the analysis is performed on a static representation of the 100 facial features. The key point of our work is indeed exploring 101 the dynamics of facial features and taking into consideration 102 the involuntary micro/macro expressions that people perform 103 while speaking. In this direction, working with video acquisi-104 tions may provide more reliable information but, more impor-105 tantly, makes it possible to provide an expression-invariant 106 solution that can deal with those (in)voluntary expressions 107 affecting the recognition performance. To the best of our 108 knowledge, this work is the first attempt of exploiting the 109 dynamics of facial features rather than working on still 110 images or single video frames of a video sequence. We tried 111 to put together these challenges, by providing a new dataset of 112 video recording of people wearing facial masks and assessing 113 the level of performance that standard classifier can achieve 114 exploiting/analysing the dynamics of periocular features ver-115 sus the performance achieved by using still images of the 116 same subject. The main contributions of this work can be 117 summarised in the following: 118 • Realize a robust biometric recognition system, which is 119 tolerant to noise or expressions of a single acquisition, 120 so reducing the misclassification error.

121
• Combine the periocular region features over time to 122 achieve a compact and temporal representation that can 123 be used as input for Machine Learning classifiers.  The rest of the paper is organised as follows: Section II dis-131 cusses similar approaches in the literature that deal with peri-132 ocular analysis, Section III presents the proposed approach 133 while Section IV focuses on a wide experimental discussion 134 that involves two datasets; Section V draws the Conclusions 135 of this work. ysed over the past decade. According to [21], in [22] [36] 215 present a useful and comprehensive review of artificial intel-216 ligence models that have been used to detect face masks. 217 The body of research in the literature is on the production 218 of masked face image datasets and algorithms for detecting 219 whether a subject is wearing or not wearing a mask. In [37], 220 the authors propose an advanced deep learning model for 221 face mask detection in real-time video streams. In this regard, 222 Negi et al.
[38] employ two well-known deep neural network 223 architectures with transfer learning for face mask detec-224 tion using the Simulated Masked Face Recognition Dataset. 225 Recently, the focus has shifted to subject recognition in the 226 presence of masks, which is nowadays known as Masked 227 Face Recognition (MFR). As an example, Li et al. [39] pro-228 pose a new method for masked face recognition by integrating 229 a cropping-based approach with the Convolutional Block 230 Attention Module (CBAM). Optimal cropping is explored for 231 each case while the CBAM module is adopted to focus on 232 the region around the eyes. In this direction, Damer et.al. [40] 233 present a study on the impact of facial masks on the behaviour 234 of renewed face recognition systems. Their findings revealed 235 the expected observation that wearing a mask has a consid-236 erable negative impact on the recognition performance, thus 237 highlighting the need of alternative approaches to traditional 238 facial recognition algorithms to deal with the difficulties 239 introduced by the presence of masks.

241
The proposed approach is based on the hypothesis that facial 242 dynamics can be an added value to biometric face recognition. 243 In fact, it can be seen as a kind of signature that identifies 244 The product of the number of features and the number of 295 frames K × N represents the size of the DEF vector. Finally, 296 a normalisation procedure is applied to the DEF vector to 297 make the data structurally homogeneous and ready for the 298 classification.  applies a three dimensional convolutional filter to the dataset 342 and the filter moves in three directions (x, y, z) to calculate the 343 low level feature representations. Their output shape is a 3D 344 volume space such as cube or cuboid.    Each participant has pronounced the following sentences:  took place on a separate day of the week, interspersed with 436 indoor and outdoor acquisitions. Figure 6 shows some images 437 sample extracted from M2FRED videos.

438
In order to emphasise the contribution of M2FRED, a com-439 parison of the proposed dataset against existing masked face 440 datasets currently available in the literature is presented in 441

463
The following experiments use the geometric features The results reported in Table 2 resumes the level of perfor-483 mances in dynamic and static setting.  In addition, the number of frames was tuned, considering a 495 maximum number of frames (sample-61) corresponding to 496 a time window size equal to 61 consecutive video frames 497 and a minimum number of frames (sample-10) corresponding 498 to 10-frame time window size (such a double setting is not 499 presented in Table 2 for XM2VTS with 294 subjects due to 500 the insufficient discriminating information with sample-10 501 configuration).

502
Although with the sample-10 setting there is a loss of 503 information in terms of frames considered, Table 3 shows 504 how the performance on 50 subjects, for 2 out of 4 classi-505 fiers, is slightly better than the maximum number of frames 506 considered. Instead, static landmarks experiments produce 507 weaker effects than dynamic landmarks, with a general low-508 ering of performance (about 10% in all three classifiers). 509 This decrease in performance can also be seen in the ROC 510 curves at Figure 9, showing the smallest AUC score for all 511 the classifiers used.

512
It is useful to compare the result reported in Table 3 513 (50 subjects from XM2VTS) with the Table 4 (43 subjects 514 from M2FRED) to understand how the uncontrolled envi-515 ronment negatively affect the performance. In fact, a general 516 decrease in accuracy can be noted for all the models used: 517 i.e., the accuracy for the MLP in Table 3 reaches the 89%, 518 in Table 4 is equal to 60%. Finally, for the experiments at 519 Table 3, the best configuration is sample-61 setting while in 520 Table 4 the sample-10 setting shows better results.   Table 4).

526
The contribution of the dynamics is also visible in the  section III-C1. It is possible to notice a general improvement 562 of the classification performance in comparison to the geo-563 metric features using the same datasets and the same number 564 of subjects. Another improvement can be found among the 565 performance achieved on M2FRED compared to XM2VTS. 566 The first one shows better results even if the video acquisi-567 tions are made in-the-wild compared to the most controlled 568 acquisitions of XM2VTS. Moreover, the experiments on 569 M2FRED were conducted both on subjects with and without 570 masks. The increase in performance is also reflected in the 571 ROC curves which, in the best case, for the M2FRED dataset, 572 show an AUC equal to 0.99 (Figure 12).

574
The following experiments use 3D CNN: a Deep Learning 575 approach described in the section III-C2. As for the previous 576 methods, the comparative analysis is carried out considering 577 the XM2VTS and M2FRED datasets. Two experimental set-578 tings were taken into consideration, tuning the depth level of 579 the 3D CNN input space: depth-16 and depth-10. As shown 580 in Table 5, 3D CNN reaches the best results in compari-581 son with all other methods, in terms of accuracy (96% in 582 its best configuration). Unlike the geometric features and 583 LBP-TOP, the experiments that involve the XM2VTS dataset 584 show better results in comparison with the M2FRED dataset. 585 According to the ROC curves in figure 12 (b), the best results 586 are obtained with the XM2VTS#50, which show an AUC 587 score of 0.96. Even considering all optimizations possible 588 for the 3D CNN, such a temporal deep model cannot run 589 smoothly on mobiles. In this point of view, the MLP is a more 590 adequate choice for mobile devices even though the hardware 591 limitations for a real-time processing. All other standard 592 classifiers suffer when the number of different subjects to 593 recognise increases. By an accurate inspection of the ROC 594 curves, separately from the aggregated results in Table 5, 595 it can be observed that the performance achieved is not suit-596 able for a reliable biometric system in the majority of the 597 settings. On the other hand, the results achieved on M2FRED 598 show again that the challenging acquisition conditions and 599 the diverse contributions of the noise affecting the video 600 recordings do not limit the performance of the SVM over the 601 VOLUME 10, 2022   competitors. It even overpass, in accuracy, the performance 602 of the 3D CNN thus resulting, for that dataset, the choice to 603 be preferred. Moreover, the SVM classifiers can be feasibly 604 implemented on mobile devices, so meeting another goal of 605 our work. Comparing the experiments on a fair number of 606 subjects in XM2VTS and M2FRED, again it can be observed 607 that MLP is able to effectively deal with stabilised acquisi-608 tion conditions of XM2VTS, training the hyperparameters to 609 correctly classify the subjects. SVM and RF indeed collapse 610 in performances on XM2VTS#50 while on M2FRED#43 611 show promising classification accuracy. As it happens for 612 geometric approach, the SVM does not take advantage of the 613 laboratory acquisition conditions of XM2VTS dataset, rather 614 the noise affecting the uncontrolled acquisition of M2FRED 615 reveals useful to separates the classes. In appearance-based 616 setting, the SMV is moreover the most performing as authen-617 tication approach. The ROC curve in Figure 12 clearly shows 618 the superior trend of the curve compared to the others, even 619 if compared with the behaviour of the 3D CNN.

621
An analysis of the failure cases was carried out to identify any 622 significant drawbacks of the implemented approaches as well 623 as possible gender or age-related biases.By an aimed inspec-624 tion into the experimental results achieved on the two datasets 625 94396 VOLUME 10, 2022  considered in this study, we identify three classes of failures 626 for which the proposed approach misclassifies the identities.

627
They are the gender, the age, and the presence of the glasses, 628 this last one can be considered as noise in periocular-based 629 VOLUME 10, 2022 identification systems. We identify three classes for age in  be supposed that their presence is a sufficient condition for the 657 extraction algorithms to fail. But this is not true. As it can be 658 seen in Table 6 there is a good percentage of people wearing 659 glasses (34.58% for M2FRED and 39.53% for XM2VTS 660 respectively) and most of them are correctly acquired and 661 classified. In some cases, the eyeglass frames occlude the 662 periocular region and add features that hard to be separated 663 from the true facial features of the acquired subject. In case 664 of non-ideal conditions, we have just above discussed that the 665 reflections of the light can again be considered as a case of 666 occlusion. However, eyeglasses are a true source of issues 667 when the lenses alter the shape and the proportions of the 668 face (it can be better realised in samples shown in Figure 15). 669 This, in turn, causes the loss of facial geometries and, hence, 670 the crash of periocular feature extraction. Moreover, since 671 the proposed method in this study is based on a continuous 672 collection of frames, it is important to point out that the geo-673 metric feature extraction algorithm might fail for some frames 674 of the total in the time window. A failure in these cases is 675 explained by the fact that the number of faulty frames is high. 676 The biometric template becomes unavailable and the training 677 of the classifiers impossible to be performed properly. The 678 number of incorrect classifications was calculated during 679 the testing phase by using the static and dynamic features 680  percentage of the subjects belonging to each class. Dynamic 697 and static features are considered respectively. It can be 698 observed that the percentage of males is higher than the 699 females, in each configuration. The age-related attribute is 700 the most variable of the three, while the presence of glasses, 701 in general, does not appear to generate specific issues for 702 a classification algorithm or another. Based on the overall 703 percentage of total subjects (summarised in Table 6), the 704 gender bias in XM2VTS is significant, since the balanced 705 proportion of males and females in this dataset. The same 706 cannot be inferred from M2FRED, where the ratio of males is 707 considerably higher than the ratio females. Similar conditions 708 happen for age in both datasets while in M2FRED with 709 LBP-TOP no failures are reported, a result that is largely 710 justified by the limited number of subjects in the dataset but 711 that, on another side, confirms the feasibility of the proposed 712 method on practical authentication scenarios.

714
The periocular region of the face represents a valid biometric 715 trait to be used as authentication mechanism. Over the years, 716 the state of the art focused on techniques for the extraction and 717 the analysis of static periocular features, meaning the features 718 coming from a single acquisition of the face. Recently, the 719 analysis of periocular region gained a privileged position. 720 COVID-19 pandemic has led several public health institutes 721 worldwide to impose the use of facial masks to counter the 722 transmission of the virus among people. Such restrictions rep-723 resent a crucial factor for face recognition systems, which are 724 nowadays adopted in several public areas as well as used as 725 unlocking systems for personal mobile devices (e.g., smart-726 phones and tablet). In this work, we focus the attention on the 727 analysis of the periocular recognition aiming at discussing 728 the contribution of the dynamics of the facial periocular 729 improvement has been reported for the geometric modality 786 when the representation of the feature was dynamic. In all 787 cases of comparison with static features, the classifiers exhib-788 ited better performances in accuracy, precision and, recall 789 thus confirming the positive contribution that the dynamics 790 of the features introduced in the analysis. Apart from the 791 significance of the level of performance achieved, another 792 relevant contribution of using geometrical features is that they 793 are relatively easy to extract from the video stream. Once 794 the periocular area has been correctly detected in a video 795 frame, the extraction of the landmarks and their tracking over 796 time and space is computationally light. This is a suitable 797 condition to consider the geometrical approach as a viable 798 solution for mobile devices computing. On the other hand, 799 the expectation that appearance-based model could perform 800 better has been confirmed. In the second modality, the analy-801 sis is carried out at pixel-level. The experimental results show 802 that the overall level of performance achieved is compara-803 ble with the results of the geometric-based approach, but a 804 slightly higher level of performance has been observed. This, 805 however, at a computational demand of the feature extraction 806 process and the analysis that is notably higher in many cases. 807 Moreover, some extra considerations can be done, especially 808 by comparing with the Deep Learning solution included in the 809 experimentation. From one side, it can be observed unstable 810 results with classical classifiers, except for MLP. On the 811 other side, the 3D CNN better generalises the problem among 812 different datasets, thus representing an interesting result that 813 may reduce the significance of the results achieved by the 814 ML classifiers. However, the objectives of our work were 815 analysing the feasibility of solutions for mobile devices and 816 3D CNN are undoubtedly not a choice to prefer. Conversely, 817 the SVM exhibited a reasonable level of performance on 818 M2FRED dataset as an authentication system, confirming as 819 a feasible option for authentication approach to implement 820 on mobile devices. Biometric recognition systems can be 821 relatively accurate and precise, especially when conditions 822 are favourable. The challenging conditions on which this 823 paper has been conceived do not enable a level of perfor-824 mance comparable with well-assessed authentication sys-825 tems. On the other hand, it shows how traditional Machine 826 Learning solutions can compete, under some assumptions, 827 with well-established Deep Learning solutions. The exper-828 imental analysis is supported by a discussion of the failure 829 cases which is useful to focus on those conditions that make 830 the problem even more challenging.

831
The experimental results also emphasize the gap of perfor-832 mance to get a really reliable face biometric recognition suit-833 able for mobile devices in in-the-wild conditions. As a future 834 improvement of this work, an accurate analysis of the facial 835 features and their single contribution on total recognition 836 could allow us to understand which real facial features weigh 837 most heavily on recognition and which, if any, can even be 838 ignored if they have a negative impact. Evaluating the impor-839 tance of facial features would thus enable the achievement of 840 a model that is not only more robust but also computationally 841 lighter in case an effective strategy for mining facial features 842 can be identified.