Facemask Wearing Alert System Based on Simple Architecture With Low-Computing Devices

The Covid-19 epidemic has been causing heavy losses to humanity in terms of population, economy, and political stability. To deal with outbreaks of the pandemic, countries have been racing to develop vaccines and issue many regulations for people in daily life. Wearing a facemask in public is mandatory and will be severely punished if violated. In addition to the above mandatory regulations, it is necessary to develop tools for early warning when the human does not wear the facemask in public places such as offices, schools, supermarkets, train stations, etc. This paper proposed a facemask wearing alert system based on a simple convolutional neural network (CNN) operating on low-computing devices. This system works in two stages: face detection and facemask classification. In the first stage, it uses a face detection network with the main benefit of convolution, separable depthwise convolution, and double detectors layer to extract face region of interest (RoI). Then, this image area will go through a facemask classification network that exploits the advantages of convolution, separable depthwise convolution, and skip connection layers to classify facemask wearing (Mask or NoMask). The proposed networks are trained and evaluated on benchmark datasets. Along with simple designs, optimizing network parameters without ignoring accuracy, the system works in real-time at 33.17 and 26.18 frames per second (FPS) on an Intel Core I7-4770 CPU @ 3.40 GHz (Personal Computer - PC) and a Nvidia Maxwell GPU (Jetson Nano device), respectively. The demo video can be found here https://bit.ly/3yUgb8f.


I. INTRODUCTION
Covid-19 is a dangerous pandemic that originated in Wuhan, China. According to statistics of the World Health Organization as of December 23, 2021, the world has about 276,436,619 infections and 5,374,744 deaths from Covid-19, and the number is increasing day by day [1]. The appearance of many strains of SARS-CoV-2 like the Delta and Omicron strains are serious and fast-spreading. Covid-19 is considered the biggest pandemic that has happened to humans, affecting the economy, politics, and social life of most countries in the world. Some studies have shown that facemasks can prevent infection from the corona virus [2]. The World Health Organization also recommends that people must wear a facemask when they have any symptoms of respiratory [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Jolanta Mizera-Pietraszko .
The governments of many countries have introduced many initiatives and mandatory regulations for people, especially the wearing of facemask in public places to limit the infection through the air. However, compliance with the law in some individuals is still limited, leading to the fact that not wearing a facemask can spread the disease. In addition, the management of wearing facemasks in public places with many people is difficult for the authorities because of the manpower, workers, and the risk of Covid-19 spreading among them. For poor countries, it is also much more difficult to deploy technical equipment to prevent the spread of the virus. Therefore, developing a toolkit for the automatic early alert of facemask wearing is essential. With the outbreak of CNN architectures in machine learning, many applications have been deployed to detect and localize facemasks in images, but the application in practice is still a challenge. Especially when deployed on low-computation devices such as CPU and Jetson Nano, they require optimization of many factors to help the system operate smoothly with high accuracy. From those analyses, this paper focuses on researching and designing simple convolutional neural networks with few network parameters to build a facemask wearing alert system in public places. This design can take advantage of existing devices such as PC/laptop/Jetson Nano devices, cameras, and speakers without any additional deployment cost. The core contributions of this research are listed as follows: 1) Proposed simple and lightweight convolutional neural network architectures to build the facemask wearing alert system. This is a two-stage system that consists of face detection and facemask classification. These network architectures are used in the system to exploit the benefits of the convolution, separable depthwise convolution, and residual layers while maintaining accuracy.
2) Fully developed facemask wearing alert system that is easily deployed and used with existing equipment to detect facemask wearing with high accuracy.
The rest of this paper is presented as follows: Section II introduces facemask detection methods and their pros and cons. Section III details the architecture of the proposed convolutional neural networks used to build the system. Section IV analyzes and interprets the experimental results. The last part includes the conclusion and future works.

II. RELATED WORK
When the Covid-19 pandemic emerged and broke out, there have been many research studies focusing on detecting and classifying the wearing of facemasks. These methods can be separated into two groups, traditional methodology and CNN-Based methodology.

A. TRADITIONAL METHODOLOGY
These studies mainly apply traditional methods to extract facial features and then perform classification by simple approaches. The authors in [4] use the Viola-Jones algorithm to detect the human face in the images and then apply the PCA (Principal Component Analysis) algorithm to extract the facemask feature and classify the image as a facemask or a non-facemask. The method proposed in [5] uses the OpenCV library for preliminary face detection and the dlib library for facial feature extraction. Finally, they take advantage of the hierarchical framework and similarity algorithms for sample classification. Based on available machine learning library packages, [6] applies TensorFlow, Keras and OpenCV to directly detect facemasks appearing in images. These approaches are easy to implement but are still computationally complex due to the application of traditional algorithms. Therefore, they limit the abilities of feature extraction, detection, and classification.

B. CNN-BASED METHODOLOGY
The facemask wearing detection based on convolutional neural networks has been specially developed from the Covid-19 epidemic outbreak. These methods use deep learning network architectures to directly detect facemasks or to perform face detection combined with the facemask wearing classification. The work in [7] refines the InceptionV3 architecture to directly classify the non-masked faces. In another approach, [8] uses image processing methods to segment and extract features, then applies VGG-16 architecture to classify facemask wearing. This proposed system is tested on a Raspberry-pi device with an image test set. Method introduced in [9] uses a ResNet-50 backbone network to extract the feature maps and a YOLOV2 network to detect medical facemask. In the object detection field, [10] proposed a method that uses the RCNN network family (R-CNN, Fast R-CNN, and Faster R-CNN) and YOLOV2 to detect facemask and social distancing. The authors in [11] proposed a two-stage neural network architecture named SSDMNV2 which uses SSD network for face region detection and a MobileNetV2 network for facemask classification. Still exploiting the computational efficiency of the MobiletV2 network, [12] develops a masked detector deployed on embedded systems. [13] applies a hybrid between the transfer learning methods and machine learning methods to build the facemask detection system. Several recent studies implemented a real-time facemask detection system using the NVIDIA DeepStream SDK platform [14] and improved YOLOV4 [15]. Most of the above approaches have exploited modern convolutional neural network architectures for feature extraction and facemask detection. However, the application to real-time systems is limited due to the heavily weighted parameters, performed only on an individual object, only evaluated on image datasets, or only implemented on a GPU.

III. PROPOSED METHOD
The entire facemask wearing alert system is detailed in Figure 1. The system is divided into two main stages, face detection and facemask classification. In the face detection stage, the paper proposes a simple convolutional neural network with double detectors for facial region extraction (RoI face) in multi-scale. This output is passed to the facemask classification stage which is a lightweight convolutional neural network to classify wearing a facemask or not. Additionally, this system is also integrated with a camera to capture images from public places and a speaker system to broadcast warning information.

A. FACE DETECTION NETWORK
The face detection network architecture is described in Figure 2. This network consists of a backbone network and double detectors.

1) BACKBONE NETWORK
This network is designed based on four convolution layers. First, a convolution layer uses kernel size 7 × 7 to quickly downsize the feature map. The first layer loses some information but still ensures the basic features of large and medium face sizes. The three sequential convolution layers all apply kernel size 3 × 3. The feature map will be reduced the size with a step of two over each layer. Following the last convolution layer is the depthwise separable convolution [16]. This convolution makes the detector work fast and save network parameters. With an input image of size 1024 × 1024 after going through the backbone network, a final feature map of size 8 × 8 is obtained. That means the backbone network reduces the input image dimension 128 times.

2) DOUBLE DETECTORS
To obtain the face RoI (region of interest) with multiscale, this network uses double detectors. Each detector uses two 3 × 3 sibling convolution layers for classification and bounding box regression. Two layers take the last feature maps with sizes 16 × 16, and 8 × 8. The detectors use square anchor boxes of different sizes to predict the location of the face in the input image. For this work, it uses four anchor boxes with sizes 32, 64, 96, and 128 for small faces, one anchor box of size 256 for medium faces, and one anchor box of size 512 for large faces. Finally, each detector generates a four-dimensional vector (x, y, w, h) where (x, y) is the center coordinate, w is the width, and h is the height of the bounding box as the offset of the face position and a two-dimensional vector (face or non-face) for label classification.

3) LOSS FUNCTION
The face detection network uses MultiBox loss [17] to calculate the loss during training. The total loss is described as follows: where L cls c i , c * i is the classification loss which using the softmax-loss shown as: with x i is an matching box indicator (i−th anchor and ground truth), and c 0 i is the confidence score of no object. L reg r i , r * i is the regression loss defined as: in which, H uses the smooth L1 loss to calculate: B is number of the matched boxes. c i is the predicted label and c * i is the ground truth label of the i − th anchor. r i is the center coordinate (x, y) and dimension (width, height) of the  predicted bounding box and r * i is coordinate and dimension of the ground truth bounding box.
The bounding box regression process in this paper applies the parameterizations of four coordinates following [18] for predicted bounding box, anchor bounding box, and ground-truth bounding box. The equations of bounding box regression are shown as: where (x, y) denotes the center coordinate, w denotes the width and h denotes the height of the bounding box.
The parameters x, x a , and x * are presented for the predicted bounding box, anchor bounding box, and ground-truth bounding box respectively (similar to the parameters in y, w, h).

B. FACEMASK CLASSIFICATION NETWORK
The details of the facemask classification network architecture are shown in Figure 3. This network includes three modules: stem, skip connection, and classification.

1) STEM MODULE
This module is built based on five convolution blocks comprised of two depthwise separable convolution blocks and three standard convolution blocks. Following each convolution block is an average pooling layer, except for the last one. The kernel size used in blocks decreases from 7 × 7 (first block) to 5 × 5 (second block) to 3 × 3 (third, fourth and fifth blocks). The convolution with a large kernel size at the beginning of the module is intended to increase the receptive field and capture the basic and useful information of the object that needs to be classified in the image. Otherwise, it is also the input of the first skip connection. Large kernels can increase the network parameters, but this problem has been overcome with interleaved depthwise separable convolution layers. The flexible calculation in depthwise separable convolution blocks helps to optimize network parameters

2) SKIP CONNECTION MODULE
Going deeper in the neural network, the feature maps will lose large information. Therefore, combining current and previous level feature maps is necessary to maintain and enrich the amount of extracted information. This is the working principle of ResNet [19], as shown in Figure 4. Inspired by this network, the skip connection module uses four skip connection layers at four network levels: 56 × 56, 28 × 28, 14 × 14, and 14 × 14. The high-level feature maps combine with the lower-level features by using the element-wise addition operation to create a new feature map with the adaptive number of channels at each level of 32, 64, 128, and 256, respectively. Conducting the sequence on different levels helps the entire network to ensure the information extraction from end to end. In this case, each skip connection uses a 1 × 1 convolution layer followed by a batch normalization layer which is then combined with a higher level feature map.

3) CLASSIFICATION MODULE
The classification module is the last part of the facemask classification network. It composes two main layers, the global average pooling and the softmax layer. While most of the current popular classification networks use the fully connected and softmax layers to calculate the probabilities of classes, this network has been replaced by global average pooling VOLUME 10, 2022 and softmax layers. This has greatly reduced the number of parameters and still ensures classification accuracy.

4) LOSS FUNCTION
During training, the facemask classification network uses the categorical cross-entropy loss function to compute the loss. The categorical cross-entropy loss function is described in detail as the following equation: where i is the number of classes (in this case, i set to 2), t is the target indicator (t = 0 or t = 1), p is the predicted probability, and log is the natural logarithm function.

C. REAL-TIME PROCESSING SYSTEM
After completing the training and evaluation of the separate networks on the respective datasets, two proposed networks are combined based on the description in Figure 1 for testing in the real-time system. The entire proposed facemask wearing alert system includes face detection network (using weight file was trained on the AFW dataset), facemask classification network (using weight file was trained on the CDD dataset), a camera (TGCAM-2000STAR), and a normal mini speaker. In the first stage, the face detection network will detect the face positions in the video obtained from the camera. The input video resolutions are the VGA (640 × 480 pixels) and HD (1280 × 720 pixels). The Non-Maximum Suppression (NMS) algorithm is used to reduce redundant bounding boxes and ensure to generate one bounding box which matches one face. After that, it crops the image areas containing only the face into separate regions and resizes them to 224 × 224 pixels. In the second stage, the facemask classification network will focus on previously cropped image areas to classify the face with a facemask or without a facemask. During the face classification process, if there is any face without a facemask, the system will generate an alert sound (in a beep) through the speaker.

IV. EXPERIMENTS A. DATASET PREPARATION 1) FACE DETECTION DATASET
The face detection network uses the WIDER FACE dataset [20] to train and evaluates on two datasets: Annotated Faces in the Wild (AFW) [21] and PASCAL FACE [22]. The WIDER FACE dataset is a benchmark dataset that is more challenging than other datasets. It comprises 32,203 images with 393,703 face labels of various contexts such as scale, poses, and occlusions. All images are mainly selected from the WIDER dataset. The AFW includes 203 images with 473 labeled faces collected from the Flickr website. This dataset contains abundant backgrounds and challenges

2) FACEMASK CLASSIFICATION DATASET
The facemask classification network was trained and evaluated on four datasets: Simulated Masked Face Dataset (SMFD) [23], Real-World Masked Face Dataset (RMFD) [24], Face Mask Lite Dataset (LMFD) [25], and Chandrika Deb's Dataset (CDD) [12]. The SMFD contains 1,376 images of which 690 are simulated facemask images and 686 are without facemask images. The RMFD consists of 5,000 images with facemask and 5,000 images without facemask. All images in the LMFD are produced using the Style GAN-2 network which is comprised of 10,000 HD images in each folder with facemasks and without facemasks.

1) FACE DETECTION NETWORK
The face detection network is trained with 300 epochs and several configurations are used with a batch size of 16, a weight decay of 5 × 10 −4 , a momentum of 0.9, a learning VOLUME 10, 2022

2) FACEMASK CLASSIFICATION NETWORK
The facemask classification network is conducted using the Keras framework. It is trained with several basic configurations for the popular classification networks. In particular, this network undergoes the training of 200 epochs with Adam optimization and a batch size of 16. The learning rate is started from 10 −4 and gradually decreases after 10 epochs with a coefficient of 0.75 if the accuracy does not increase.

1) FACE DETECTION NETWORK
The face detection network is trained on the WIDER FACE dataset and evaluated on two datasets, AFW and PASCAL FACE. As a result, this network achieves 97.46% and 94.44% of AP on the AFW and PASCAL FACE datasets, respectively. It outperforms traditional image processing and other CNN-based methods on both datasets. When compared to FaceBoxes [26] and FFCPU network [27] (the latest network architectures for face detection on CPU devices), its detection ability is weaker than the FaceBoxes and FFCPU architecture on the CEW and PASCAL FACE datasets. But it contains only 545,092 parameters, which is smaller than FaceBoxes network (844,610 parameters) by 1.55 times and FFCPU network (989,832 parameters) 1.82 times. Figure 5 shows the evaluation and qualitative results of the face detection network on the AFW and the PASCAL FACE datasets. From this result, it can be seen that the network has good ability to detect faces with different sizes and postures. In addition, it can also detect multiple faces appearing in a single image.

2) FACEMASK CLASSIFICATION NETWORK
The facemask classification network was trained and evaluated on the four datasets mentioned above and achieved accuracies of 95.52%, 99.83%, 98.94%, and 100% on SMFD, RMFD, CDD, and FMLD, respectively. Besides, this work selects the popular classification networks to finetune and retrain on the SMFD, RMFD, FMLD, and CDD datasets.
To finetune these networks, all fully connected layers in each original network are replaced with a global average pooling layer to significantly reduce network parameters. As a result, this network outperforms popular classification networks, Das et al. [6] on SMFD, Chandrika Deb et al. [12] on CDD, and Loey et al. [13] on SMFD and RMFD datasets with only 413,616 parameters. On the SMFD dataset, it is only behind SqueezeNet in classification ability but the difference is only 0.24% and it outperforms other networks. The proposed facemask classification network also achieved outstanding speed compared to all other networks with 39.88 FPS on Intel Core I7-4770 CPU @ 3.40 GHz (PC) and 224 × 224 input size. Table 1 shows the comparison result of the facemask classification network in the accuracy (%) and FPS with popular classification and other networks on four datasets. The qualitative result of the facemask classification network on four datasets is presented in Figure 6. The confusion matrix in Figure 7 also demonstrates the balance in facemask classification ability in all four datasets of the proposed network.

3) REAL-TIME ANALYSIS
The real-time system process is described in Section III. The whole system uses VGA (640 × 480 pixels) and HD (1280 × 720 pixels) live-stream videos obtained from a TGCAM-2000Star camera connected to a PC and a Jetson Nano device, and FHD (1920 × 1080 pixels) videos downloaded from YouTube. The experiments were carried out in a laboratory environment with the participation of male and female members. As the results show in Figure 8,   PC and is reduced to 18.12 FPS with FHD resolution on the Jetson Nano device. The detailed system speed is shown in Table 2 proves that the proposed system can perform in real-time with negligible latency with low computing devices such as CPUs and Jetson Nano devices. On the other hand, this study also compared the performance in real-time with several experiments in [14] on the Jetson Nano device and found that the proposed method outperforms the methods in DeepStream SDK platform. When compared to the experiments in [15], the proposed method uses only the CPU but reaches speeds equivalent to SSD and outperforms Faster R-CNN, YOLOV3, and YOLOV4 on a GPU. It is only slower than the method proposed by Jimin Ju et al. (GPU-based method). Figure 9 and Figure 10 show the speed comparison results (FPS) in real-time testing. During testing, this work found that the performance of the system may be affected by several environmental factors like illumination, camera quality, and especially the cameraobject distance. In addition, the number of participants in a test also greatly affects the performance of the entire system. If the number of participants is crowded or exceeds  the camera's frame will reduce the processing speed of the system. This paper tested and recommended that the number of people under three people ensures the operating speed and the system's stability. Table 3 shows the speed (FPS) of the system in real-time testing on a PC with VGA resolution and different participant numbers (one, two, and three participants).

4) ABLATION STUDY
The face detection is an important module in the facemask wearing alert system. To evaluate the face detection ability, this work tested three networks with three different configurations of the detector and anchor called single detection, double detection (proposed network), and triple detection. The single detector uses all square anchor boxes of size 32, 64, 96, 128, 256, and 512 at 8 × 8 feature map, the triplet detector uses four square anchor boxes of size (32, 64, 96, 128) at 32 × 32 feature map, one square anchor box of size 256 at 16 × 16 feature map, and one square anchor box of size 512 at 8 × 8 feature map. The results in Figure 11 show that the single detection is quite weak because it only focuses on detecting large faces (at close range). It achieved 95.37% and 90.81% of AP on the AFW and PASCAL FACE datasets, respectively. In contrast, when increasing the number of detectors to three (triple detection), the detection ability decreased compared to single detection by 1.46% and 1.56% of AP, respectively. Additionally, the network increased by 13,830 parameters. With double detection, the number of parameters is between the other two networks but it can achieve the highest detection accuracy with 97.46% and 94.44% of AP on the AFW and PASCAL FACE datasets, respectively. Therefore, the proposed system used the network with double detection in the first phase to accurately detect the faces.
The facemask classification network is used in the last stage of the facemask wearing alert system. It is responsible for classifying facemask wearing and broadcasting alerts to speakers. This work also evaluates the classification ability of the proposed network through several designs. The results are shown in Table 4 prove that using fully connected layers does not increase the accuracy much (only 0.33%), but also significantly increases the network parameters (184,450 parameters) when compared to GAP. In addition, combining the fully connected layer and skip connection both increases the network parameters (up to 598,066 parameters) and reduces the accuracy (down to 97.80%). From the experiments, this study uses GAP combined with skip connection to achieve the best classification result of 99.02%.

V. CONCLUSION
This paper proposed a facemask wearing alert system based on simple and lightweight convolutional neural network architectures including face detection and facemask classification networks. The face detection network uses basic convolutional layers and double detection to detect faces in the scene. The facemask classification network is designed with convolution, depthwise separable convolutional layers, and the advantages of skip connection to classify faces with facemask and faces without facemask. If a face is without a facemask, the system will change the face bounding box color to the red color and play a sound to warn about not wearing a facemask in public places. With network parameter optimization, this system achieved up to 33.17 FPS and 26.18 FPS on CPU and Jetson Nano devices, respectively. The system can be deployed in low cost, available, and low-computing devices based on the CPU and edge devices. In the future, the system will be further developed to be able to detect faces in the far distance, very small size, and more people. The system will also be integrated with a social distancing alert system for dual duty.