CloudNet: A LiDAR-Based Face Anti-Spoofing Model That Is Robust Against Light Variation

Face anti-spoofing (FAS) is a technology that protects face recognition systems from presentation attacks. The current challenge faced by FAS studies is the difficulty in creating a generalized light variation model. This is because face data are sensitive to light domain. FAS models using only red green blue (RGB) images suffer from poor performance when the training and test datasets have different light variations. To overcome this problem, this study focuses on light detection and ranging (LiDAR) sensors. LiDAR is a time-of-flight depth sensor that is included in the latest mobile devices. It is negligibly affected by light and provides 3D coordinate and depth information of the target. Thus, a model that is resistant to light variations and exhibiting excellent performance can be created. For the experiment, datasets collected with a LiDAR camera are built and CloudNet architectures for RGB, point clouds, and depth are designed. Three protocols are used to confirm the performance of the model according to variations in the light domain. Experimental results indicate that for protocols 2 and 3, CloudNet error rates increase by 0.1340 and 0.1528, whereas the error rates of the RGB model increase by 0.3951 and 0.4111, respectively, as compared with protocol 1. These results demonstrate that the LiDAR-based FAS model with CloudNet has a more generalized performance compared with the RGB model.


I. INTRODUCTION
Face recognition systems are widely used in various applications owing to their convenience and excellent performance. However, this technology is vulnerable to presentation attacks, such as print, replay, and 3D masks. In particular, 2D printers and mobile devices can easily generate print and replay attacks. Advances in scanners and 3D printers have enabled the production of high-quality 3D masks. Now, obtaining images of certain people's faces through the Internet is easy; consequently, sophisticated spoofs can be created for malicious purposes. Therefore, numerous studies have been conducted to improve the face anti-spoofing (FAS) model.
The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
Traditionally, FAS has adopted handcrafted methods, such as eye blinking [1] or gaze tracking [2]. Owing to the rapid developments in deep learning technology, end-to-end deep learning-based FAS models have also been studied extensively [3], [4], [5]. Several of these studies focused on commercial red green blue (RGB) cameras as it is an excellent solution that considers both the performance and cost [6], [7], [8]. However, some industries, such as mobile payments, require a secure model with lower errors, even if the costs are higher. Therefore, numerous studies have been recently conducted to further improve the performance of FAS models using advanced sensors [9], [10], [11], [12], [13]. Advanced sensors include near-infrared (NIR), shortwavelength infrared (SWIR), depth sensor, thermal, light field, and polarization cameras. In practice, these sensors perform excellently at detecting presentation attacks. The light detection and ranging (LiDAR) sensors also have this advantages. LiDAR-based multi-modal FAS model uses 3D spatial and depth information as well as color information and it leads excellent performance.
LiDAR sensors have another advantage in that they provide FAS models that are robust against light variations [14], [15]. One of the challenges that FAS studies face is creating a generalized model for environments, such as light and background [16], [17], [18], [19], [20]. In particular, face data are significantly affected by the intensity of light [17]. This implies that when a FAS model is delivered in an actual service, the face data obtained from different illuminations may be misidentified. This can have catastrophic consequences for financial services, such as mobile payments. This problem can be overcome by collecting training data from numerous environments. However, face data is difficult to collect because of the nature of biometric data, and collecting such data in various environments is even more challenging. To solve this problem, this study focused on LiDAR sensors. LiDAR sensors measure the distance by calculating the round-trip delay of the light signal emitted by the laser to the target. This provides 3D spatial coordinates of the points that make up the target and are called point clouds. Compared with RGB data, whose values fluctuate with light variation, LiDAR point clouds are negligibly affected by light. Therefore, using LiDAR sensors can reduce the impact of light on the model performance. Finally, LiDAR sensors have recently been integrated into mobile devices, making it convenient for creating multi-modal models without the need for additional hardware. This is beneficial as it allows for real-world mobile applications such as those that use both RGB cameras and LiDAR sensors.
In this study, the so-called CloudNet, a LiDAR-based FAS model, is proposed. As shown in Fig. 1, CloudNet determines the liveness of a face using RGB images, point clouds, and depth images obtained from a LiDAR-based camera. CloudNet consists of a RGB space and LiDAR space networks to learn the separate weights for RGB, point clouds, and depth data. The architecture of CloudNet is a binary classifier based on Resnet34. This is because recent multi-modal FAS methods have adopted Resnet [21], [22], VGG [23] and so on as backbone for image classification tasks. To verify the model performance, a dataset collected by the LiDAR sensor was required. Because no public face dataset has been built with LiDAR sensors, in this study, the LiDAR dataset for FAS (LDFAS) was built using an Apple iPad equipped with LiDAR sensors. Three protocols were used to confirm the superiority of the model according to light variation. In protocol 1, the training and test sets had the same light domains. Protocols 2 and 3 constructed these sets with different light domains. The RGB model and CloudNet had error rates of 0.0667 and 0 for protocol 1, 0.4618 and 0.1340 for protocol 2, and 0.4778 and 0.1528 for protocol 3, respectively. CloudNet increased the errors by 0.1340 and 0.1528, whereas the RGB model increased the errors by 0.3951 and 0.4111. This demonstrates that CloudNet with LiDAR sensors is a more generalized model than the RGB model. In addition, we investigated the trade-offs caused by using the LiDAR sensor. The details of the costs will be discussed in the experimental results and discussion sections in Section V. The contributions of this study can be summarized as follows.
• A method to create a generalized model for the light domain was devised using a LiDAR sensor.
• The LDFAS, which contains point clouds and depth using LiDAR sensors, was built.
• CloudNet was designed to efficiently train point clouds using a LiDAR sensor.
The remainder of this paper is organized as follows. Section II discusses related work. Section III describes the dataset built herein (LDFAS). Section IV explains the proposed method in detail. Section V covers the experimental setup, evaluation metrics, experimental results, and VOLUME 11, 2023 ablation studies; additionally, it discusses the results. Finally, Section VI concludes the study.

II. RELATED WORK A. MULTI-MODAL FACE ANTI-SPOOFING
Deep learning-based FAS study can be divided into two categories depending on the sensor used [24]. The first category utilizes only a commercial RGB camera [25], [26]. As previously mentioned, using RGB cameras is a excellent way for creating low-cost and high-performance FAS models. However, in certain high-security scenarios, such as face payment and vault entrance, a extremely low rate of false acceptance is required. As a result, a second category which uses a special sensor, along with or without a commercial RGB camera has been introduced. These specialized sensors, including NIR sensors [9], [27], [28], SWIR sensors [9], [28], depth sensors [9], [10], [27], [28], thermal sensors [9], [11], [28], light-field cameras [12], and four-way polarization cameras [13], increase the accuracy of FAS models. SWIR sensors are known to effectively protect against 3D mask attacks caused by moisture on real faces [9]. Reference [27] has shown through ablation study to reduce the error rate of FAS models through the addition of depth and IR sensors. Thermal sensors effectively block attacks based on the fact that the average temperature of the human face is 36-37 • C [11]. Additionally, a light-field camera and four-directional polarization sensor improve the FAS model performance [13]. This study belongs to the second case. Herein, a LiDAR sensor, which is a time-of-flight-based depth sensor, was used.

B. MULTI-MODAL FUSION
Multi-modal fusion is a method of combining data collected from different modalities to achieve more accurate results [29]. It is widely used in various fields from affective computing [30] to autonomous driving [31]. Recent studies have demonstrated that by using a combination of visual, vocal, and textual data, it is possible to more accurately identify psychological patterns from multiple perspectives [30], [32]. In the field of autonomous driving, multi-modal fusion has also been used [29], [33]. RGB images provide rich visual information, but are sensitive to light variation. Point clouds do not affect by light but have limitations in terms of resolution. Autonomous driving study fuses RGB images and point clouds together to use the data complementarily to overcome their own limitations [29]. Currently, multi-modal models often use one of three methods for combining data: early fusion, middle fusion, and late fusion. Early fusion combines data at the pre-processing stage, middle fusion is used during the feature extraction phase, and late fusion combines the output from multiple models to produce the final result [29].

C. LiDAR SENSOR AND POINT CLOUD
The LiDAR sensor measures the distance by calculating the round-trip delay time when the light signal emitted from the laser reaches the target [34]. It has been used as an observation technology for precise atmospheric analysis and global environmental observation via mounting on aircraft and satellites, and as a important technology for laser scanners and 3D imaging cameras in autonomous driving. Recently, mobile applications that use LiDAR for face recognition and clothes measurement had also been studied [35], [36]. The sensor generates point cloud data, which is a 3D representation of the target. Point cloud can be learned in deep learning models via three approaches [37]. The first is to project a point cloud onto a 2D plane and then learn the features using conventional 2D convolutional neural networks (CNNs). The second is a Voxel-based learning method that learns using a 3D spacebased 3D CNN called Voxel. Finally, the third is learning pixel-by-pixel. The first method was used in the present study. As is well-known, the feature distribution of LiDAR images changes drastically at different image locations despite the similarities between regular RGB and LiDAR images [38]. Recently, some methods have been devised for deep learning models to effectively learn from point cloud data. Typically, SqueezeSegV3 adapts the SAC block [38], whereas FPS-Net uses the MRF-RDB block [39]. To solve this problem, a separate network for each dataset was designed herein. More information is provided in Section IV.

III. LiDAR DATASET FOR FACE ANTI-SPOOFING (LDFAS)
The LDFAS was built to develop a LiDAR-based FAS model. The Dataset is composed of 8,640 face data collected from 36 Koreans. (2880 images in each of RGB, point cloud, and depth configurations). This section describes the LiDAR application, data collection procedure, comparision with multi-modal based public datasets, and evaluation protocols. Examples of these datasets are presented in Fig. 2.

A. LiDAR APPLICATION
An artkit-based mobile app that simultaneously generated RGB, point cloud, and depth data was used [35]. This camera application also provides information on how to map all points in the point cloud to a specific pixel of the RGB image. Depth images are derived from point clouds. A 3D point cloud had 45,192 points and depth had a resolution of 256 × 192, and RGB was generated at 1440 × 1080 pixels. The RGB image and the point cloud were each captured with the 12 MP Wide Camera and the TOF 3D LiDAR scanner, respectively.

B. DATA COLLECTION PROCEDURE
The participants were instructed to sit in front of the camera and look towards the sensor. LDFAS dataset are divided into three subsets: indoor, outdoor and indoor (dark). Table 2 shows the explanation of LDFAS's subsets. During the indoor subset, the participants were positioned 70-90cm away from the camera while bonafide and 3D mask were photographed. The lighting was maintained between 170-180 lux. The participants were also asked to slightly rotate their heads and 20 images were taken without video. The outdoor subset was collected during the day and data was collected at different  locations and at different times. The distance between the camera and the participants was also adjusted randomly. The indoor (dark) subset was collected in a dark indoor environment with varying degrees of darkness. The distance between the camera and the participants was also adjusted randomly and the participants were asked to slightly rotate their heads, 20 images were taken without video. Print attack and replay attack were made with bonafide. All print attacks and replay attacks in the three subsets were photographed under the same lighting conditions at the same location and at the same time, respectively. An interesting point is that, as shown in Fig. 2, replay attacks taken by the LiDAR sensor did not appear as a completely flat surface. This phenomenon occurred because the light shot from the LiDAR sensor was reflected on the surface of the device reproducing the replay attack. Therefore, it was important to collect all replay attacks in the LDFAS under the same conditions. Print attacks were made with a laser printer and replay attacks were made by playing on an Apple's iPad device. The 3D mask was made of thermoplastic polyurethane (TPU) and the Cubicon 3DP 320C Single Plus 3D Printer. The numbers of bonafides and attacks are listed in Table 3.

C. COMPARISION WITH MULTI-MODAL BASED PUBLIC DATASETS
Recently, several public datasets have been built for study on multi-modal based FAS in recent years. In comparison to other datasets, Table 1 shows the novel aspect of LDFAS. To the best of our knowledge, the most recent datasets are CeFA, HQ-WMCA, and PADISI-Face [9], [27], [28]. CeFA is a large dataset with 1,607 participants and includes presentation attacks in the form of impersonation using RGB camera, depth, and IR sensors [27]. HQ-WMCA also is a large dataset and includes presentation attacks in the form of impersonation and obfuscation, it was constructed using RGB, depth, NIR, SWIR, and thermal sensors [9]. PADISI-Face is also a large dataset with 360 participants, it is composed of various modalities similar to HQ-WMCA and includes presentation attacks in the form of impersonation and obfuscation [28]. The main difference in the LDFAS dataset that we have built is the use of a new modality called LiDAR. The LiDAR sensor generates point cloud data, which is why we constructed a dataset composed of RGB, point cloud, and depth map.

D. EVALUATION PROTOCOLS
The goal of this study was to develop a generalized FAS model considering light variations. Three protocols were designed for this purpose. Protocol 1 corresponds to when the learning and test datasets are in the same light conditions. By contrast, protocols 2 and 3 used different light conditions. The indoor, outdoor, and indoor (dark) sets were tested while training only the indoor sets. Details of each protocol are listed in Table 4.

IV. PROPOSED METHOD
In this section, the CloudNet architecture is explained. The structure is composed of a RGB space and LiDAR space networks. Each network extracts facial features from the RGB and LiDAR data (point cloud and depth). CloudNet performs both early fusion and late fusion to classify bonafide and spoofing images. Herein, binary cross-entropy was used as the loss function. The architecture of the model is shown in Fig. 3.

A. ARCHITECTURE
The input data for CloudNet are RGB, point clouds, and depth. CloudNet performs two fusion operations. The first one is an early fusion of point cloud and depth. The second one is a late fusion of the RGB space network and the LiDAR space network. The fusion operation is represented as follows.
where F represents the fusion operation. Accordingly, the entire CloudNet network can be described as follows.
where I rgb , I pc , and I d represent the RGB, point cloud, and depth, respectively; N rgb and N lidar represent the RGB and LiDAR networks, respectively; and σ denotes the sigmoid function, which is a non-linear activation function [40]. Herein, both the RGB and LiDAR networks were implemented using Resnet34, which is a CNN-based network exhibiting outstanding performance in image classification [41]. The difference between the networks used in previous studies and Resnet34, which was used herein, is that Resnet34 does not have a fully connected layer. After the first early fusion, the input images passes through the Resnet34based inner networks for the second late fusion. When the late fusion operation is completed, they go to the fully connected layer and finally pass to the activation function. The CloudNet consists of two networks owing to the characteristic of point cloud. The feature distribution of LiDAR images differs significantly from that of RGB images. As shown in Fig. 4, the feature distribution of RGB was confirmed to be different from that of the point cloud and depth images. A CNN applies the same weight matrix to all channels of the input image. Therefore, herein, a model that learns features from RGB and LiDAR data separately was designed.

B. LOSS FUNCTION
The FAS model is a binary classification method that classifies input images as bonafide or spoofing. Herein, a binary cross-entropy loss function was introduced to train the proposed network; this function was also used in [42], [43], and [44]. The loss function can be described as follows.
where y is the ground truth value and p is the predicted value.

V. NUMERICAL EXPERIMENTS
In this section, the experimental setup, evaluation metrics, experimental results, which are conducted followed by three protocols presented in Section III, ablation study, and discussion are presented.

A. EXPERIMENTAL SETUP
This study argues that a LiDAR sensor can provide an FAS model that is robust against light variation. Additionally, CloudNet is suggested suitable for training with the RGB, point cloud, and depth images. To support this argument, three models, namely Resnet34 with RGB, Resnet34 with three shots, and CloudNet with three shots (referred to as RGB, LiDAR, and CloudNet models, respectively), were employed herein. The models were trained and tested according to the three protocols mentioned in Section III. RGB, point cloud, and depth were all resized to 180 × 180.
In training stage, we used the Adam optimizer and set the learning rate to 1e-3. The batch size was 4 on single 2080Ti GPU. We trained models with maximum 1000 epochs. He Initialization was used as the weight initialization method. All codes were implemented with pytorch.

B. EVALUATION METRICS
To evaluate the performance of CloudNet, first, the bonafide presentation classification error rate (BPCER), attack presentation classification error rate (APCER), and average classification error rate (ACER) were used as the evaluation metrics. These metrics were proposed in ISO/IEC 30107-3:2017 for performance assessment of presentation attack detection mechanisms [45]. BPCER is the proportion of bonafides incorrectly rejected as an attack. APCER is the percentage of attacks incorrectly accepted as bonafides. ACER is the average of BPCER and APCER. Additionally, a receiver operating characteristic (ROC) curve was used.
To quantitatively compare the ROC curves, the area under curve (AUC) values of the graphs were determined. Table 5 reports the models' BPCER, APCER, and ACER values under protocol 1. The ACER values for the RGB, LiDAR, and CloudNet models were 0.0667, 0.025, and 0, respectively. CloudNet performed the best, followed by the LiDAR and RGB models. The experimental results indicate that when the test set is in the same light domain as the training set, using point cloud and depth, although subtle, improves the performance of the FAS model. Further, CloudNet allows learning point cloud and depth images more effectively.  Table 6 reports the models' BPCER, APCER, and ACER values under protocol 2. The ACER for the RGB model was 0.4618. The error rate of the model increased by 0.3951 compared with protocol 1. By contrast, the ACER values for the LiDAR and CloudNet models were 0.1958 and 0.1340, respectively. This corresponds to an increase of 0.1708 and 0.1340, respectively. CloudNet had the smallest increase in errors, followed by the LiDAR and RGB models. This increase in the error rate shows how generalized the model is with respect to light variation. Table 7 reports the models' BPCER, APCER, and ACER values under protocol 3. Compared with protocol 1, the ACER values of the models increased by 0.4111, 0.3340, and 0.1528, respectively. Similar to the experimental results obtained under protocol 2, CloudNet exhibited the smallest ACER growth, followed by the LiDAR and RGB models.  Table 8.  Furthermore, the performance of the models was compared based on AUC values. First, ROC curves, shown in Fig. 5, were plotted. The AUC values were measured to quantitatively compare the ROC curves. Table 8 reports the AUC values of the three models. For protocol 1, the RGB, LiDAR, and CloudNet models had AUCs of 0.9956, 0.9931, and 1.0, respectively. For protocol 2, the AUC values decreased by 0.4594, 0.0661, and 0.1198, respectively. For protocol 3, they decreased by 0.3705, 0.1552, and 0.1094, respectively. This reduction in AUC values also supports the argument that LiDAR data render FAS models robust against light variation. The AUC value under protocol 3 also demonstrates that CloudNet is a better model than the LiDAR model. However, under protocol 2, the LiDAR model had a higher AUC value than that of CloudNet. Finally, we investigated the trade-offs caused by using the LiDAR sensor. Table 9 shows the number of model parameters, latency and Multiply-Adds (MAdds). The delay time was calculated by running the program that tests 100 data 100 times and taking its average and standard deviation. The latency of the LiDAR model was 2% higher than the RGB model. The number of MAdds for the parameters increased by 0.01M, 0.1G. On the other hand, CloudNet's latency increased by 16% compared to the RGB model and the number of parameters and MAdds also increased by almost double.

D. ABLATION STUDY
In addition, ablation studies were performed. Additional experiments were performed using the point cloud and depth data. For the multi-modal models, the approach of early fusion, late fusion, and hybrid fusion was applied. Experiments in Section IV were conducted for the cases of: point cloud only, depth only, RGB and point cloud combined, and RGB and depth combined. The experimental results are listed in Table 10. According to the results of the point cloud and depth experiments conducted under protocol 1, these data are not suitable for performing FAS operations on their own compared to RGB model. Unlike the RGB model with an error rate of 0.0667, the point cloud and depth models exhibited high error rates of 0.3028 and 0.2750, respectively. Essentially, if the learning and test datasets are in the same domain, RGB provides stronger discrimination compared with point cloud or depth. However, the experimental results obtained under protocols 2 and 3 suggest that the point cloud and depth are negligibly affected by light variations. This is an obvious advantage that RGB does not have. Next, the model performance was investigated using RGB and point cloud, and RGB and depth. The experimental results indicated that models built using RGB and depth performed better than those constructed using RGB and point cloud. Furthermore, additional ablation studies confirmed that training RGB and LiDAR data separately was effective. All models, including those using RGB and point cloud, RGB and depth, RGB, point cloud, and depth, demonstrated better performance when using a late fusion approach instead of an early fusion approach. Additionally, when training RGB, point cloud, and  depth together, the use of both early and late fusion, such as in the CloudNet model, resulted in better performance than using only late fusion.
Lastly, the extracted features were visualized to determine how well the proposed model classifies bonafide and spoofing images. The T-distributed stochastic neighbor embedding (T-SNE) technique was used to transform high-dimensional features extracted by deep learning models into 2D features [46]. This technique was applied to the models we experimented with in section IV. Fig. 6 shows the feature distributions of the models expressed by the T-SNE.

E. DISCUSSION
Through the experiments, we have found that the performance of RGB model is severely poor when tested on datasets with domain shift in light. Compared to protocol1 where the light domains of the training and test sets were the same, the performance of the RGB model greatly decreased in protocols 2 and 3 where the light domains were different. On the other hand, LiDAR sensors have been found to improve the performance of FAS models and make them more robust to light changes, as confirmed by experimental results and an ablation study. In addition, CloudNet could further improve the performance of the LiDAR model. This suggests that optimizing the way LiDAR data is trained can improve the FAS model. Meanwhile, it is necessary to note that there is a trade-off involved. The LiDAR model's computational cost, measured in the number of model parameters and MAdds, is similar to that of the RGB model, with only a small difference of 0.01M and 0.1G. However, CloudNet's computational cost is twice that of the RGB model. This suggests that the increase in cost is primarily due to the CloudNet structure, rather than the use of LiDAR sensor. Therefore, it is considered a future study to reduce the cost of the CloudNet model while maintaining performance.

VI. CONCLUSION
In this study, an FAS model that uses a LiDAR sensor with an RGB camera was proposed. LiDAR provides 3D coordinate and depth information and has the advantage of robustness to light variation. Herein, the LDFAS was constructed to verify the superiority of the model. LDFAS consists of three subsets with different light variations. Based on this, with three different protocols were chosen for experimenting: 1) the same light domain, 2) brighter light domain, and 3) darker light domain, compared with the training set. Additionally, CloudNet was designed to learn separate weights for the RGB and LiDAR data (point cloud and depth). The experimental results revealed that using a LiDAR sensor provides robustness to light variation compared with the RGB model. In addition, CloudNet performed better than RGB and LiDAR models. However, the current CloudNet model also had the drawback of being heavier than a regular LiDAR model. This means that there is a possibility for LIDAR-based FAS models to improve. The task of studying a better LiDAR-based FAS model through model lightweighting will be left as a future study task.