A Deep Learning Based Light-Weight Face Mask Detector with Residual Context Attention and Gaussian Heatmap to Fight against COVID-19

Coronavirus disease 2019 has seriously affected the world. One major protective measure for individuals is to wear masks in public areas. Several regions applied a compulsory mask-wearing rule in public areas to prevent transmission of the virus. Few research studies have examined automatic face mask detection based on image analysis. In this paper, we propose a deep learning based single-shot light-weight face mask detector to meet the low computational requirements for embedded systems, as well as achieve high performance. To cope with the low feature extraction capability caused by the light-weight model, we propose two novel methods to enhance the model’s feature extraction process. First, to extract rich context information and focus on crucial face mask related regions, we propose a novel residual context attention module. Second, to learn more discriminating features for faces with and without masks, we introduce a novel auxiliary task using synthesized Gaussian heat map regression. Ablation studies show that these methods can considerably boost the feature extraction ability and thus increase the ﬁnal detection performance. Comparison with other models shows that the proposed model achieves state-of-the-art results on two public datasets, the AIZOO and Moxa3K face mask datasets. In particular, compared with another light-weight you only look once version 3 tiny model, the mean average precision of our model is 1 . 7% higher on the AIZOO dataset, and 10 . 47% higher on the Moxa3K dataset. Therefore, the proposed model has a high potential to contribute to public health care and ﬁght against the coronavirus disease


I. INTRODUCTION
The World Health Organization (WHO) has stated that coronavirus disease 2019 (COVID-19) had infected over 160 million people and caused over 3.4 million deaths worldwide as of May 2021 [1].Related large-scale respiratory diseases, severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), have occurred in the last two decades [2], [3].SARS coronavirus 2 (SARS-CoV-2), the viral agent of COVID-19, has a higher reproductive number than SARS [4].Increasing numbers of people are concerned about their health, and public health is a major priority of governments [5].Various machine learning based methods have been applied in health care to assist the detection of The associate editor coordinating the review of this manuscript and approving it for publication was Huiyu Zhou.
COVID-19 cases from medical images [6]- [8].One issue that limits machine learning methods for detecting COVID-19 cases is the lack of data.Fortunately, generative adversarial network based methods can be adopted to increase the size of datasets as in [9], [10].
For individuals, face masks could reduce the spread of coronaviruses by decreasing their emission in respiratory droplets [11].N95 masks, medical masks, and homemade masks can block approximately 100%, 97%, and 95% of virus particles [12].Currently, the WHO recommends that people should wear face masks if they have respiratory symptoms, or they are taking care of people with symptoms [13].A recent study pointed out that most environments and contacts are under conditions of virus-limited where wearing face masks can effectively prevent virus spread [14].Regions that had universal wearing of face masks have contributed more to the control of COVID-19 than those without this requirement [15].Many public service providers require customers to wear masks.However, some people still do not wear masks in public areas, which might lead to infection of themselves or others.Therefore, automatic detection of the wearing of face masks may help global society, but research related to this is limited.
The task of detecting face masks, or their being worn, refers to the localization of faces and judging whether masks are worn or not.Other recognition tasks relating to face masks include identifying their service stage [16] and efficiency [17], as these are useful to detect whether face masks can be re-used or their quality.These methods could play a complementary role with face mask detection algorithms to protect people from COVID-19.Face mask detection systems could be deployed in surveillance systems, internet of things systems, or smart cities to help public area managers ensure that all visitors are wearing masks, to reduce the risk of the spread of COVID-19.Face mask detection systems could take the place of workers who need to check the mask wearing status of visitors at supermarkets, universities, libraries, and similar locations.
Several studies have explored the detection of face masks.One approach is a two-step method which firstly detects faces using face detectors and then separately classifies whether a face mask is worn based on face mask classifiers [18], [19].Although two-step methods may be sufficient in some scenarios, the operation of passing the results from the first step to the second step can degrade the speed significantly.End-to-end convolution neural network (CNN) based face mask detectors, which jointly detect faces and recognize face masks, may be more suitable for real-time face mask detection.A you only look once (YOLO) model with a residual network (ResNet) based face mask detector [20] can achieve high detection accuracy, but the network is heavy and not fast enough for edge devices.RetinaFaceMask proposed a light-weight version with MobileNet as its backbone, but it did not solve the problem of the light-weight model substantially decreasing the detection performance [21].Other challenges in face mask detection come from the diversity of in-the-wild scenarios, which include, non-mask occlusion, various types of masks, different face orientations, and small or blurred faces (Fig. 1).
In this paper, we propose a novel single-shot light-weight face mask detector (SL-FMDet), which is able to detect face masks accurately and has a low hardware requirement.SL-FMDet uses a depthwise separable convolution based MobileNet as its backbone.It utilizes a feature pyramid network (FPN) to fuse high-level semantic information with low-level layers, and performs detection in multi-scale feature maps.However, FPN does not solve the problem that a light-weight model leads to worse feature extraction, so we propose two novel methods to achieve this.First, to extract rich context features and focus on crucial face mask related regions, we propose a novel residual context attention module (RCAM).Second, to learn more discriminating features for faces with and without masks, a novel auxiliary task is used to perform synthesized Gaussian heatmap regression (SGHR).
Evaluations of this study were performed on two publicly available face mask datasets, the AIZOO [22] and Moxa3K [23] face mask datasets.Experimental results showed that the proposed model achieved state-of-the-art results on both datasets.Compared with another light-weight model, YOLOv3-tiny, the mean average precision (mAP) of our model was 1.7% higher on the AIZOO dataset, and 10.47% higher on the Moxa3K dataset.The source code of our work is publicly available online. 1he rest of this paper is organized as follows.In Section II, we review related work on object detection, and face mask detection.The proposed methodology is presented in Section III.Section IV describes the datasets, implementation details, evaluation metrics, an ablation study, and quantitative and qualitative results.Section V concludes the paper and discusses future work.

II. RELATED WORK A. OBJECT DETECTION
The Viola-Jones detector [24] achieves real-time detection of objects by an algorithm that extracts features using a Haar feature descriptor with an integral image method and a cascaded detector.It is still computationally expensive, even though it utilizes integral images to facilitate the algorithm.An effective feature extractor to detect humans, called histogram of oriented gradients (HOG), computes the directions and magnitudes of oriented gradients over image cells [25].[26] detects object parts as a deformable part-based model and then connects them to judge classes that objects belong to.
Deep learning based detectors can perform well due to their robustness and high ability to extract features [27].There are two popular categories, one-and two-stage object detectors.One-stage detectors directly regress the bounding boxes in a single step.The approach in YOLOv1 [28] divided the image into several cells and tried to find objects in each cell, but this was not good for small objects.YOLOv1 does not perform well by only using the last feature output, as the last feature map has a fixed receptive field and can only observe certain areas on the original images.Therefore, multi-scale detection was introduced into a single shot detector (SSD) to conduct detection on several feature maps and detect faces of different sizes [29].To improve detection accuracy, Lin et al. [30] proposed RetinaNet by combining an SSD and an FPN architecture, which included a novel focal loss function to mitigate the class imbalance problem.In terms of the architecture, YOLOv2 has a similar improvement to SSDs using multi-scale features, and YOLOv3 is similar to RetinaNet by utilizing an FPN.Two-stage detectors generate region proposals in the first stage and then fine-tune these proposals in the second stage.The two-stage detector can provide high detection performance but at a low speed.Region-based CNN (R-CNN) [31] uses selective search to propose candidate regions that may contain objects.The proposals are fed into a CNN model to extract features, and a support vector machine (SVM) is used to recognize classes of objects.However, the second-stage of R-CNN is computationally expensive, since the network has to detect proposals in a oneby-one manner and uses a separate SVM for final classification.Fast R-CNN solved this problem by introducing a region of interest (ROI) pooling layer to input all proposed regions at once [32].A region proposal network (RPN) introduced by faster R-CNN took the place of selective search, the speed limiting step of two-stage detectors [33].Faster R-CNN integrated each detection component, region proposal, feature extractor, and detector into an end-to-end neural network architecture.

B. FACE MASK DETECTION
Face mask detection algorithms have become more topical recently, since masks can help control the spread of COVID-19 during the pandemic.The algorithmic task focuses only on detecting physical masks, as shown in [18], [20], [21], [23], [34], [35].Among these, YOLO based models are the most popular detectors.ResNet based YOLOv2 was used by [20] to improve feature extraction for face mask detection.To enhance the robustness of detection by YOLOv3, an image mix-up and multi-scale method was utilized in [34].A distance intersection over union non-maximum suppression (DIOU-NMS) algorithm was used to improve the post-processing stage of YOLOv3 [35].YOLOv3 achieved the highest mAP in a comparison of YOLOv3, YOLOv3-tiny, SSD, and Faster R-CNN on the newly-established Moxa3K face mask detection dataset [23].A person tracking system with a three-part face mask recognition system, a person detector, a tracker, and a face mask classifier, was developed to facilitate face mask detection applications in smart cities [36].Face mask classification or recognition, assuming faces were detected, has also been studied [19], [37], [38].

III. METHODOLOGY
The overall pipeline of the proposed SL-FMDet is shown in Fig. 2. We first introduce the general architecture of the SL-FMDet, followed by two novel modules, RCAM and SGHR.Finally, we discuss the loss function, and the inference procedure.

A. NETWORK ARCHITECTURE
To reduce the size of the neural network, we propose to use a depthwise separable convolution network based backbone -MobileNet [39] that uses a depthwise convolution and a pointwise convolution in series to reduce the computational load.Assume the output shape of a standard convolution is C × H × W , and there are C standard 2D convolution kernels of size Since the number of channels significantly influences the speed, we use the thinnest MobileNet, Mobilenet 0.25, with 0.25 times the number of channels of the ordinary MobileNet to make it smaller and have lower latency.Then, since each feature map corresponds to different receptive fields on the input images, we apply a multi-scale strategy to perform detection on three feature maps to find faces of different sizes.However, lower layers do not contain high-level semantic information, so we apply the FPN [40] to fuse high-level semantic information with lower layer feature maps.The size of the three feature maps used are We then generate two different size anchors on each feature map, and the details are given in section IV-B.
Although FPN can use high-level semantic information, it does not solve the problem caused by the separation of convolutions which reduces the capability of feature extraction.To cope with this problem, we propose two novel modules -RCAM, to focus on learning important information, in section III-B, and SGHR, to learn more discriminating features for faces with and without masks, in section III-C.RCAMs are directly applied to the fused feature maps from FPN.Then, we add a heatmap branch by performing a 1 × 1 convolution kernel on the output of RCAM to generate a one-channel map for SGHR.For the detection heads, we use a 1 × 1 convolutional kernel to form a 4 × 2 dimensioned bounding box of coordinates, and n c ×2 dimensioned classes, where the size 4 dimension is formed by the left corner x 1 , y 1 and right corner x 2 , y 2 coordinates, n c is the number of classes, and the size 2 dimension is formed by the two prior anchors of different sizes for each pixel.

B. RESIDUAL CONTEXT ATTENTION MODULE
Compared with face detection, the task of face mask detection is more difficult, because it has to locate the face as well as distinguish faces with and without masks.To focus on face areas where masks may appear, we propose a novel RCAM (Fig. 3 (a)).RCAM contains three major blocks -a context enhancement block (CEB), a channel attention block (CAB), and a spatial attention block (SAB).
For the CEB, we form three parallel branches with 3 × 3, 5×5 and 7×7 receptive fields to enhance context information, similar to the context module in single-stage headless [41].
To reduce the number of parameters while maintaining the same receptive field size, all branches are implemented by 3 × 3 convolution kernels.The branch with a 5 × 5 receptive field is implemented by two consecutive 3 × 3 convolution kernels, and that with a 7×7 receptive field is realized by three VOLUME 9, 2021 consecutive 3 × 3 convolution kernels.We then concatenate all feature maps from the branches to form an enhanced context feature map.
To focus on the important face mask related features, we cascade a convolutional block attention module (CBAM) [42] after the CEB, and add a skip connection.This attention module consists of a CAB (Fig. 3 (b)) and a SAB (Fig. 3 (c)).The CAB assigns the weights on each channel of the input features, while the SAB calculates a spatial attention map to focus on the specific part of the input feature.The computation of the CAB with input and that of SAB is where A c ∈ R D and A s ∈ R H ×W denote the channel and spatial attention; σ is the sigmoid function to normalize the output to (0, 1); MLP refers to the multi-layer perceptron, which is a 3-layer fully connected network with the number of neurons of the intermediate layer (D/8); GAP and GMP stand for global average pooling and global maximum pooling; CAP and CMP stand for channel average pooling and channel maximum pooling; Conv2D represents 2 dimensional convolution; Concat is the channel concatenation operation.Finally, we add a skip connection to avoid information loss and gradient vanishing.

C. SYNTHESIZED GAUSSIAN HEATMAP REGRESSION
Although the light-weight network is small and fast, it has a relatively weak feature extraction ability.To solve this problem, and enhance the feature learning of discriminating features for face areas with and without masks, we propose a novel auxiliary learning task as SGHR.
We consider an image containing n 1 bounding boxes of face masks and n 2 bounding boxes of faces.For the n 1 face mask bounding boxes, we first generate the face Gaussian heatmaps H m j1 , j ∈ {1, . . ., n 1 } as where (c jx , c jy ) is the central position, h j and w j are the height and width of the jth face bounding box; σ jx and σ jy control the radii of the corresponding heatmaps, and σ jx = h j /6, σ jy = w j /6.Then, we generate the Gaussian heatmaps for masks as, where ( c jx , c jy ) is the estimated central position of face mask j, which is calculated by c jx = c jx +h j /4, c jy = c jy .σ jx = h j /12, σ jy = w j /6.Then we sum H m j1 and H m j2 to obtain the Gaussian heatmap for face masks, For the n 2 bounding boxes for faces without masks, their heatmaps only contain single face Gaussian heatmaps H f i , i ∈ {1, . . ., n 2 }, which is the same as the calculation in (3).Finally, we sum the face mask and face heatmaps and suppress the maximum value to obtain the final synthesized Gaussian heatmaps (SGHs) as where clip (H , 1) is to avoid the maximum of H exceeding 1.
An example for computing an SGH is shown in Fig. 4. The objective of SGHR is to predict heatmaps as close as possible to ground truth SGHs.Thus, an l 2 loss performs regression between the predictive heatmap H and the ground truth heatmap H as

D. LOSS FUNCTION
The model gives three outputs for each input image, a localization offset prediction Y l ∈ R p×4 , a classification confidence prediction Y c ∈ R p×n c , and a predictive heatmap H , where p and n c denote the number of generated anchors and the number of classes.We also have the prior anchors P ∈ R p×4 , the ground truth boxes Y l ∈ R o×4 and the classification label Y c ∈ R o×1 , where o refers to the number of objects.Before calculating losses, we match and decode anchors P with the ground truth boxes Y l and the classification label Y c to obtain P ml ∈ R p×4 and P mc ∈ R p×1 , where each row in P ml or P mc denotes the offsets or top classification label for each anchor, respectively.The positive localization prediction and class are defined as Y + l ∈ R p + ×4 and Y + c ∈ R p + ×1 .The positive matched anchors' localization offsets and class are defined as P + ml ∈ R p + ×4 and P + mc ∈ R p + ×1 , where p + denotes the number of anchors whose top classification label is not zero.
To be robust to outliers, we use the smooth L1 loss [33] to regress the localization offsets as Hard negative mining [43] is performed to obtain sampled negative matched anchors and the corresponding predictions, , where p − is the number of sampled negative anchors.The classification loss is computed by positive and negative samples using cross-entropy (CE) as Together with the heatmap loss L h in ( 8), we derive the total loss as where N is the number of matched default anchors and α and β are hyperparameters to weight the losses.

E. INFERENCE
In the inference stage, the model produces the object localization L ∈ R p×4 and object confidence Y c ∈ R p×3 .The second column of Y c is the confidence of faces, Y cf ∈ R p×1 , and the third column of Y c is the confidence of face masks, Y cm ∈ R p×1 .Then, we remove objects with confidence lower than t c and perform non maximum suppression (NMS) with a threshold t nms to produce the final localization and confidence of faces , and those of face masks , where n f and n m denote the number of selected faces and masks.

IV. EXPERIMENT AND RESULT A. DATASET 1) AIZOO FACE MASK DETECTION DATASET
The AIZOO face mask detection dataset is a public open-source dataset created by AIZOOTech [22] that is integrated with approximately 8,000 images selected from the WIDER FACE [44] and MAsked FAces (MAFA) [45] datasets, and re-annotated to fit the face mask detection context.To cover more real-world conditions, most normal faces came from WIDER FACE (50%), while faces wearing masks were from MAFA (50%), giving the dataset a good balance among different scenarios.A subset of 1,839 images was pre-defined for testing.

2) Moxa3K FACE MASK DETECTION DATASET
The Moxa3K face mask detection dataset is a public dataset to facilitate face mask research [23].It contains 3,000 images with 2,800 for training and 200 for testing.The dataset was constructed by combining images from a Kaggle dataset and Internet images.The disadvantage of the dataset is that it contains only a few faces without masks.B ←MiniBatchSampler(D train , m) B ←DataAugmentation(B) 6: B ←Preprocess(B) Settings for the generation of prior anchors.

C. EVALUATION METRICS
For each class, average precision (AP) serves as a comprehensive indicator of the area under the precision and recall curve,  where the precision (P) and recall (R) are defined as [47], where TP, FP and FN denote the true positive, false positive and false negative counts, respectively.The calculation of precision and recall is based on predictions ranked in descending order by their predicted confidence scores, which start from 0.02.As in the PASCAL VOC [48] new evaluation metrics, all point interpolation is used to smooth the zigzag precision and recall curve to obtain AP as, We use AP F and AP M to denote APs for faces and face masks.mAP was used to evaluate the performance of the models [47] and can be calculated by taking the mean of AP against each class as, where n c is the number of classes, and AP j is the AP for jth class.We use the intersection over union (IOU) as 0.5 to judge the prediction, which is denoted as mAP@0.5 in the literature.

D. ABLATION STUDY
To demonstrate the effectiveness of the proposed components, we performed ablation studies on RCAM, SGHR, and the position of the SGHR branch.The experiments based on the AIZOO dataset are summarized in Table 2 with details below.

1) RCAM
We compared the detector without and with RCAM attached to the outputs of the FPN feature maps.By using RCAM, there was a 0.7% increase in the AP for faces, a 1.8% increase in the AP for face masks, and a 1.2% increase in mAP.This demonstrated that the proposed RCAM may be able to enlarge and focus on useful context information for face mask detection.

2) SGHR AND ITS POSITION
We added SGHR to the model to show the effectiveness of the SGHR auxiliary task and ran three experiments to find the best position for the SGHR branch.An auxiliary branch was placed on the output of RCAM at input feature f 1 from FPN or on the output of RCAM at input feature f 2 or on the output of RCAM at input feature f 3 .These positions were denoted as 1, 2 and 3 for brevity.The highest AP and mAP were achieved by placing the SGH auxiliary task branch at feature f 2 .This may be due to the f 2 feature maps having appropriate anchor scales for the majority of objects.Compared with the model without the SGHR branch, a maximum increase of 2.8% in mAP was observed, and the APs for each class also have an observable improvement.

E. VISUALIZATION OF ATTENTION MAP
In the above ablation studies, SGHR enhanced the final face mask detection performance.In this section, we visualized the spatial attention of RCAM to qualitatively demonstrate how SGHR helps learn more discriminating features to distinguish between the object and the background.In Fig. 5, the first row is generated from the model without SGHR, while the second row used SGHR.The spatial attention maps generated by the model with SGHR could differentiate between the object and the background.This shows that the proposed SGHR auxiliary task can boost the performance of RCAM, and thus the overall detection performance.

F. COMPARISON WITH OTHER MODELS ON AIZOO
The performance of our model on the AIZOO face mask dataset was compared with existing models used in face mask detection.The baseline model is a modified SSD with a light-weight backbone [22].Faster R-CNN is the best regarded two-stage detector using an RPN [33].
YOLOv3 uses Darknet-53 as its backbone and three detection heads to process three-scale features enhanced by FPN.YOLOv3-tiny is a lighter and faster version of YOLOv3 with a light backbone and only two detection heads.RetinaFace is a high performance face detector using FPN fuse high-level semantic information [50].RetinaFaceMask is a dedicated face mask detector, and its light-weight version powered by MobileNet is denoted as RetinaFaceMask-M [21].
The mAP and APs of faces and face masks are given in Table 3.The proposed SL-FMDet achieved the highest mAP and APs among all the models.Compared with the baseline SSD model, SL-FMDet increased mAP by 3.0% and the APs of faces and face masks were improved by 4.0% and 2.1%, respectively.YOLOv3 and RetinaFace had the closest performance to our model, but they used heavy backbones, Darknet-53 and ResNet-50, which are computationally expensive.YOLOv3-tiny is a lighter version of YOLOv3, but its mAP was less than the proposed model by 1.7%.RetinaFaceMask-M is also a light-weight model, but it performed poorly at finding face masks with a low AP M of 90.4%.We demonstrate some qualitative results in Fig. 6.The model can successfully distinguish some confusing occlusions, such as occlusion by hands, hair or other objects (Fig. 6(a) and all diverse mask types were detected (Fig. 6(b).Side views of faces with masks could be detected (Fig. 6(c) and results on small and blurred faces are shown in Fig. 6(d).

G. COMPARISON WITH OTHER MODELS ON Moxa3K
Experiments were also conducted on the Moxa3K face mask dataset, and the mAP and APs are summarized in Table 4.We compared our model with the best results reported by [23].SL-FMDet achieved the state-of-the-art performance on Moxa3K, outperforming the previous best, YOLOv3.The light-weight model with RCAM and SGHR achieved better performance than heavy models like YOLOv3.YOLOv3-tiny is a light-weight model, so it provides another insight into our model's performance on the Moxa3K dataset.SL-FMDet's performance exceeded YOLOv3-tiny by 10.47% in terms of mAP.However, as the Moxa3K dataset was created for closed circuit television applications, it contains more blurred or small faces, which are hard to detect and result in overall low performance.In addition to the results reported by [23], we conducted experiments on RetinaFace and RetinaFaceMask-M, and these models give 1-2% lower performance than SL-FMDet in terms of mAP.In Fig. 6(d), SF-FMDet can find most of these blurred or small faces in the wild.Although there are some failure cases, due to occlusions by people or objects, the result seems satisfactory.

H. COMPARISON WITH OTHER MODELS IN TERMS OF FLOPs AND THE NUMBER OF PARAMETERS
SL-FMDet requires the smallest number of floating point operations (FLOPs) and number of parameters (Params) of the methods examined (Table 5).SL-FMDet takes 1.01G FLOPs, and has 0.43M parameters, which is less than 10% of the requirement of YOLOv3-tiny.

V. CONCLUSION
In this paper, we proposed a novel SL-FMDet, which is efficient and has low hardware requirements.To overcome the lower feature extraction capability caused by its light-weight backbone, we proposed RCAM and SGHR.RCAM can extract rich context information and focus on crucial face mask related areas.By using SGHR as an auxiliary task, the model is able to learn more discriminating features for faces with and without masks.The model with SGHR yielded a better attention map, which qualitatively supports the effectiveness of this auxiliary task.The proposed model achieved state-of-the-art results on two public face mask datasets, AIZOO and Moxa3K.Compared with another light-weight 96972 VOLUME 9, 2021 model, YOLOv3-tiny, the mAP of our detector is 1.7% higher on AIZOO and 10.47% higher on Moxa3K.Experimentally, we have shown that light-weight models can achieve similar or even better performance than heavy models by using RCAM and SGHR.The qualitative results also show the model is capable of tackling the challenges present in face mask detection.Therefore, the proposed face mask detector has a high potential to contribute to public health care to control the spread of COVID-19.One drawback of the method is the extra computation required for generating heatmaps and, due to limitations of the datasets, the method cannot distinguish between correct incorrect mask wearing.
In future work, we would like to build face mask detection datasets with no, correct and incorrect mask wearing states, or use a zero shot learning approach to make the model able to detect incorrect mask wearing states.New deep learning detectors may be used to further improve the performance.Recently, advanced work on anchor-free deep learning detectors, such as CenterNet [51] or CornerNet [52] has appeared.We believe anchor-free detectors operate more like how human beings detect objects than anchor-based methods such as our method.CenterNet first detects the center of the objects, and then regresses the coordinates of corners relative to the centers.DEtection TRansformer (DETR) a newly-proposed transformer-based deep learning detector [53] borrows advantages from language transformers to use patch-based sequential information, and shows the method does not require post processing.In addition, we will develop a real-world face mask detection system on high performance edge devices, and integrate it with the internet of things systems.

FIGURE 1 .
FIGURE 1. Challenges in face mask detection.

FIGURE 2 .
FIGURE 2. The pipeline of the proposed SL-FMDet.The backbone uses depthwise separable convolutions; FPN is used to fuse the high-level semantic information; RCAM can extract rich context information and focus on crucial face mask related regions; SGHR learns more discriminating features for faces with and without masks.

FIGURE 3 .
FIGURE 3. Illustration of the RCAM.(a) Overall architecture of the RCAM.(b) The structure of the CAB.(c) The structure of the SAB.

BAlgorithm 1 3 :
. IMPLEMENTATION DETAIL In the experiments, we employed an adaptive moment (Adam) optimizer with an initial learning rate of α LR = 10 −3 .A reducing on plateau LearningRateScheduler was used to dynamically reduce the learning rate by a power of 10, if there was no change in the validation loss over 20 epochs.The hyperparameters of loss were: α = 2 and β = 10 −3 .The network was initialized by weights pre-trained on ImageNet.The models were trained on an NVIDIA GeForce RTX 2080 Ti and an Intel Xeon Silver 4108.The algorithm was developed with the PyTorch [46] deep learning framework.Each experiment operated for n ep = 250 epochs with batch size m = 32.The threshold of NMS was t nms = 0.3.The number of anchors, coordinates of the anchors' centers and anchor sizes are given in Table 1.The details of the training of our models are shown in Algorithm 1, where MiniBatch-Sampler refers to the operation of randomly selecting m pairs of samples from dataset D, denoted as B; DataAugmentation is the data augmentation operation including random image cropping, distorting and flipping; Preprocess resizes the image into 640 × 640 pixels and normalizes the pixel values by subtracting the mean red green blue (RGB) values.Details of the Training Procedure Require: Training set D train = {(x i , y i )} n i=1 ; Validation set D val = {(x i , y i )} n i=1 ; A parameterized model f θ ; Ima-geNet pretrained weights θ ; Number of epoch n ep ; Batch size m; Learning rate α LR ; Loss hyperparamters α, β; Minimal valitation loss L min = +∞ Ensure: A parameterized model after training f θ 1: Initialize the model parameters θ by θ = θ .2: for i = 1 to n ep do for j = 1 to n/m do 4:

FIGURE 5 .
FIGURE 5. Visualization of spatial attention yielded by RCAM without (upper) and with (lower) SGH.

FIGURE 6 .
FIGURE 6. Qualitative results on AIZOO and Moxa3K datasets demonstrating the capability of our model on face mask detection challenges.

TABLE 2 .
Ablation study of the proposed model (%).

TABLE 3 .
Comparison with other models on the AIZOO dataset (%).

TABLE 4 .
Comparison with other models on the Moxa3K dataset (%).

TABLE 5 .
FLOPs and the number of parameters of different models.