“Blinks in the Dark”: Blink Estimation With Domain Adversarial Training (BEAT) Network

Blink detection plays an important role in many human-computer interaction applications for consumers. Unfortunately, deep neural network-based blink detection methods are not only susceptible to poor lighting conditions, but also the deep learning model is prone to bias due to the imbalance in the dataset distribution. To solve these problems, we propose Blink Estimation with Domain Adversarial Training (BEAT) network, which robustly detects blinks in unseen out-of-sample images captured even under poor lighting conditions by extracting domain-invariant features. BEAT network is inspired by the domain-adversarial neural network (DANN) but improved with several improvements including a lambda scheduler to stabilize adversarial training and a gradient decay layer to prevent the discriminative loss from overwhelming the classification loss. As a result, BEAT achieves faster and more accurate blink detection performances than other domain generalization methods for unseen target domains. In particular, BEAT’s feature extractor model achieves state-of-the-art performance in terms of AUPR on popular benchmark datasets. Also, we suggest a practical optimal threshold for blink detection based on our insights gained from our experiments for consumer applications.


I. INTRODUCTION
B LINK detection is an essential task in various human computer interaction (HCI) scenarios such as gaze estimation, deception detection [2], driver fatigue detection [3], face anti-spoofing [4], and dry eye syndrome recovery [5]. For these applications, researchers have been working to improve the performance of eye blink detection [6], [7], [8], [9]. Some of these methods, for example, [10], [11] have been reported to work well in real-world environments.
Nevertheless, some limitations due to racial differences, lighting, data imbalance, etc. in publicly available training Manuscript  datasets are not sufficiently considered in the previous studies and impede the practical application of blink detection. Those limitations become more severe when test or target datasets have different distributions -for example, different races and lighting conditions. In real-world applications, it is very likely that practitioners will have different racial distributions in the images between the training dataset and the test (target) dataset. This problem becomes more evident in target countries or regions with homogeneous ethnic groups. Also, lighting conditions can be stringent for some specific applications where lighting needs to be reduced in order not to disturb the user. Target images captured in these environments tend to be darker than images from publicly available training datasets.
Another issue that has not been fully addressed is data imbalance in the blink datasets. This means that most blink data sets have significantly more open-eye images than closedeye images. As a consequence, deep learning models can be skewed to one side because the samples corresponding to the two classes do not exist evenly.
To address the mentioned issues, we propose and discuss several strategies in this paper. For the racial bias and poor lighting conditions in datasets, we apply domain generalization based on domain adversarial training scheme. "Domain" in this context indicates a group of similar image distribution (i.e., races, bright and dark lighting conditions, various backgrounds). Thus, each set of image data with similar racial distribution or lighting condition forms a "domain". The main goal of domain generalization is to detect blinks correctly and not be confused with unnecessary domain information (racial bias or different lighting conditions). Also, we suggest a practical guideline to determine threshold from our experiments to mitigate the data imbalance issue.
To implement the mentioned strategies, we present a baseline network (i.e., a feature extractor) that outperforms the latest results on public eye blink datasets. Based on the baseline network, we design three versions of Blink Estimation with Domain Adversarial Training (BEAT) networks, which can generalize unseen domains using adversarial training and the KL divergence loss to implement the mentioned strategies. To stabilize adversarial training, we design and apply a gradient regularization method to BEAT. Figure 1 visualizes the generalization result of BEAT using the t-SNE method [1]. Without the domain generalization, the features from an unseen target domain (highlighted in blue) are distributed in a separative cluster from source domains on the left. Since the BEAT network extracts domain-agnostic features, the feature map on the right shows that the features of the target domain appear indistinguishable with those of source domains in an aggregated cluster.
Our contributions can be summarized as follows. 1) Performance: Our baseline network (i.e., the feature extractor) outperforms the latest results on the RT-BENE [10] dataset, which is the de facto standard for eye blink detection. For instance, our baseline network performs 1.57% higher in the AUPR (area under precision-recall curve) and is 2.86 times faster than the latest method [12] (Table V).
2) Optimal Thresholds on Data-Imbalaced Blink Datasets: As blink datasets are highly imbalanced, we discuss appropriate evaluation metrics. Based on the discussion, we propose optimal thresholds that maximize the F1-score for binary classification for imbalanced datasets. Furthermore, we also propose and discuss how to find the optimal sampling rate according to the optimal threshold for various cases.
3) Domain Generalization for Real Applications: We propose a domain adversarial training method for domain generalization using a gradient decay layer which enables stable adversarial training. The results show that our domain generalization method improves binary classification performances in the AUROC (area under receiver operating characteristic curve) and the AUPR by 2.99% and 50.24% in the Eyeblink8 target domain, 7.21% and 4.47% in the BID target domain, and 2.14% and 23.76% in the RT-BENE target domain, respectively. 1

II. BACKGROUND
Blink detection is usually implemented independently prior to gaze estimation as a natural design choice. This is important in the appearance-based gaze estimation methods [13], [14], [15] which are based on the deep neural networks (DNNs) because they predict gaze positions even when the user closes the eyes or they fail to recognize eyes correctly on the face images. Therefore, one of the important roles of the blink detection stage is to avoid unreliable outputs in the gaze estimation stage as a fail-safe. For example, Fig. 2 shows an electric wheelchair controlled by gaze estimation developed for people with disabilities who can only move their eyes and cannot operate control sticks with their arms and hands. If blink detection does not work correctly, users run a serious risk from unreliable gaze predictions when they close their eyes (e.g., sudden random changes in gaze position). In this wheelchair application, we have found that previous blink detection methods easily fail in backlit or very dim lighting conditions. Therefore, we need a robust way to predict blinks in backlit or very dark environments. 2 Another interesting consumer application for blink detection is indoor golf driving ranges (Fig. 3). Some novice golfers tend to involuntarily blink during the swing motion or at the moment of impact, resulting in unsatisfactory results. As a Fig. 3. Left: our lab environment simulating a typical indoor golf driving range where our dataset (BID) has been prepared (Section IV). Right: schematic diagram of the lab environment. Lighting and camera positioning lead to dark, backlit face images as shown in Fig. 4. natural consequence, indoor golf application developers want their systems to be able to check the condition of the golfer's eyes during the swing motion and provide useful feedback. The problem is that lights are usually installed on the ceiling to illuminate downward, and a camera is installed on the floor to capture the golfer's face upward as shown in Fig. 3, so the face images captured by the camera are very dark (Fig. 4). Additional upward lighting is generally prohibited so as not to obstruct the golfer's vision. As we will present later, previous methods do not effectively detect blinks in these backlit face images, so we need a practical way to address this issue.
Some consumer applications may rely on blink detection to prevent catastrophic accidents. For example, car or truck drivers who drive long distances are often exposed to fatal risks due to fatigue, and reliable blink detection is required while driving in low light conditions. In general, in prior studies, infrared (IR) light sources are often adopted to capture the driver's faces or eyes in dark lighting conditions, since IR illumination does not obstruct the driver's vision when driving at night [3], [16]. Typically, the IR light source is installed near the driver's face along with the camera. However, applying IR illumination to the human eye can introduce some complications. First, some studies [17], [18], [19], [20] report that IR illumination near the eye can harm the eyes. Second, we cannot directly estimate eye states from IR face images based on deep neural network models trained with regular RGB images. Since most public and private datasets are in RGB format, the ability to detect blinks in IR images can be severely limited. Finally, IR cameras and lights are less accessible and less common to the average consumers than RGB cameras, limiting their application.
Taken together, those discussed applications have common points. First, eye blink detection in backlit or very dark environments has various useful consumer applications, and failing to detect blinks could lead to catastrophic hazards for some cases. Second, additional directional lights directed toward the user's face should be avoided so as not to obstruct the user's view. Finally, consumer-level RGB cameras may be preferred over IR devices for some economic and medical reasons. On the other hand, it is not ideal if the system is only good at detecting blinks in dark conditions and not accurately in brighter environments. Therefore, a reliable system that detects blinks regardless of lighting conditions is needed.
As mentioned, the appearance-based gaze estimation technique adopts a deep neural network approach, and the eye blink estimation step is usually applied before the gaze estimation step. Considering that both steps have many possibilities to share useful information between neural networks, it would be natural to choose deep neural networks (DNNs) to implement effective blink estimation that meets all the requirements discussed so far.
The recent great success of artificial intelligence (AI) comes from the fact that rich public data sets are easily accessible on which DNNs can be trained. However, we need to overcome the following issues with datasets for training and testing for blink estimation.

Different Brightness Levels in Datasets:
Most publicly available datasets have human faces in normal lighting conditions. As a result, when generic DNNs are trained on those public datasets with normal brightness levels, they do not readily detect blinks in test images captured in dim lighting conditions.

Racial Bias in Datasets:
In real-world consumer applications, it is very likely that practitioners will have different racial distributions in the images between the training dataset and the test (target) dataset. This problem becomes more evident in target countries or regions with homogeneous ethnic groups.
Data Imbalance for Classification: Most blink datasets have significantly more open-eye images than closed-eye images. As a consequence, deep learning models can be skewed to one side because the samples corresponding to the two classes do not exist evenly.
To address the mentioned issues, we propose and discuss several strategies in this paper. For the racial bias and poor lighting conditions in datasets, we apply domain generalization based on the domain adversarial training scheme. Another possible approach would be to adjust the brightness level of dark images first and then to address racial bias by applying domain generalization separately. This means additional procedures for image processing are required (i.e., "two-step" approach). Also, simply increasing the brightness of a dark image tends to introduce image noise that degrades the classification performance. We avoid this two-step approach to solve both problems in the single-domain generalization scheme and make our method suitable for mobile applications. Also, we suggest a practical guideline to determine threshold from our experiments to mitigate the data imbalance issue.

A. Blink Detection
Eye blink images can be captured by either a near-infrared (IR) camera or a regular RGB camera for blink detection. When capturing IR images, the IR camera and IR light source are typically placed close to the human eye to capture highresolution images and provide better performance even in poor external lighting conditions. However, as discussed earlier, not only are IR cameras expensive, but IR sources are also claimed to cause eye damage due to the close distance between the light source and the eye in [17], [18], [19], [20]. Therefore, consumer-grade RGB cameras have a higher potential for safe blink detection.
Blink detection methods can also be divided into two categories, one exploits multiple image sequences from a video stream, and the other analyzes only a single image. In general, single-frame-based methods are reported to have faster processing times and lower computational costs in [21]. Some blink detection studies based on multiple-frame-based methods employ LSTM and RNN models to exploit features across time series [11], [22], [23]. On the other hand, some studies focus on classical image processing techniques without relying on deep learning approaches to blink detection. Some methods extract and classify eye regions using classical image processing techniques including SIFT and HOG to detect blinks [24], [25]. Others determine the degree of eye closure based on eye aspect ratio (EAR) using eye landmarks [26], [27]. Unfortunately, those approaches tend to be vulnerable to changes in face angle or skin color. Also, since the extracted eye landmarks are different for each person, different thresholds are required for each person. Recently, interest in CNN-based blink detection techniques that can utilize rich eye image datasets is increasing [10], [21], [28], [29], [30]. Researchers have proposed various CNNbased methods, such as a two-way approach that splits the input images into two streams to extract feature for eye detection [29], and a curriculum-learning-based approach [12].
In this paper, we design a CNN-based model that can detect eye blinks in a single image, inspired by [10] and other recent studies. Based on the single-frame approach, our model provides a fast interference rate that can be suitable for mobile devices and high blink detection performance in various environments.

B. Data Imbalance
Data imbalance occurs when there are large differences in the amount of data between classes. One of the common issues with blink detection is that there are often far less images with eyes closed than with eyes open. Such imbalanced datasets cause neural networks to be biased towards the prevalent (majority) class [31], [32]. Thabtah et al. [33] also have shown that evaluation measures such as precision and recall change with data imbalance. One approach to addressing data imbalance is preprocessing, usually with undersampling or oversampling applied [34]. The undersampling task removes some majority class samples to balance, but runs the risk of losing information about the majority class. On the contrary, the oversampling task increases the number of minority class samples to balance by synthesizing new samples or augmenting existing ones. Some literature including [35], [36] argues that oversampling provides a more accurate classification than undersampling based on experiments. However, we have found that oversampling does not always work in our cases, as we will discuss in a later section. Researchers also have worked on how to determine practical thresholds for binary classification problems with data imbalance. Provost [37] discuss the intricacies involved in classifying imbalanced datasets and suggests adjusting the output threshold. To analyze the intricacies associated with classifiers, data imbalances, and thresholds, various approaches have been proposed, including the ROC convex hull method [38] and cost curves [39].
In this paper, we present our approach to find an optimal threshold and test results in both undersampling and oversampling using multiple imbalance ratios for blink detection as a binary classification problem.

C. Domain Generalization
In machine learning, researchers and practitioners frequently encounter domain shifts, defined as the difference in distributions between training and test datasets. To address the domain shift problems, two methods are usually applied: domain adaptation and domain generalization. The domain adaptation method trains models to reduce domain shifts by learning the distribution difference between source and target domains. While the domain adaptation method uses both the source and target domains, the domain generalization method uses only the source domain to generalize the target domain, which may be out-of-distribution. Therefore, when obtaining information on the target domain is impossible or too expensive, the domain generalization approach is preferred. Common strategies for the domain generalization problem include data augmentation, domain alignment, metalearning, and ensemble learning method [40]. Among them, we apply data augmentation and domain alignment method to our blink detection for domain generalization. The data augmentation method is commonly used as a way to avoid overfitting and improve generalization performance. For image data, datasets are usually augmented using image transformation methods, including random flips, rotations, and brightness and contrast modifications. However, while image transformations help to enrich datasets with different brightness or skin tones, they cannot create reasonable variations for some meaningful features, including individual eye shapes and skin textures. Recent approaches for domain alignment aim to align domains by reducing means and variances of distributions of transformed features among domains [41]; considering KL divergence [42], [43]; or applying adversarial learning [44], [45], [46]. In particular, domain adversarial training is a min-max game that the discriminator is optimized to distinguish between domains while feature extractor model is trained to extract domain-agnostic features which interferes the discriminator from differentiating domains. Some researches expand domain adversarial training for multi-source domain [47], [48]. Other studies report that if domain labels and class Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  [49]. AFLAC network can learn domain invariance features without interfering classification task [48].
Inspired by the AFLAC network, we use an adversarial network to train our model, but add regularization terms to reduce the effect of discriminator loss on classification performance.

IV. EYE BLINK DATASETS
Deep neural networks usually need as many datasets as possible to guarantee performance. For a fair comparison, we select RT-BENE [10], UnityEyes [50], and Eyeblink8 [51] as experimental datasets. We also prepare and use a dataset ("Blinks in the Dark" or BID), extracted from video clips of golfers' eye blink moments in a golf driving range under poor lighting conditions (Figure 3 and 4). We chose the golf driving range as the data collection site because it allows us to simulate real-world situations more safely than other candidate situations (e.g., blinking while driving a car at night or maneuvering an electric wheelchair based on gaze estimation). Although BID has been collected in the context of an indoor golf driving range, we believe our dataset can be applied to other contexts and applications for blink detection.
A. Dataset Details 1) RT-BENE: Cortacero et al. [10] have created RT-BENE dataset by extracting and labelling the eye blink regions from RT-GENE dataset [52] which was originally prepared for gaze estimation task. RT-BENE collected 17 subjects without glasses, excluding subjects who wore glasses in RT-GENE. The collected images have been categorized into open, closed, and uncertain for the 17 subjects. We apply same data split criteria as [10] in our experiments. As in [10], we ignore the data from subject 6 and images tagged with uncertain for same comparison. We present the details of dataset splits (i.e., train set, validation set, and test set) in Table I.
2) UnityEyes: Wood et al. [50] have proposed a synthetic method to create training data for appearance-based gaze estimation using a game engine (a.k.a UnityEyes). Since this approach is designed for gaze estimation, it cannot directly generate images with closed eyes. To overcome this limitation, we reverse engineered UnityEyes execution file to generate images with closed eyes. In our configuration, we set the camera angles to have random values from 0 to 30 degrees. We apply random eyeball pitch angle from 5 to 20 degree to generate eye images with open and from 40 to 45 degree to generate eye images with closed, respectively. Also, we synthesize eye images with uncertain tag by applying random eyeball pitch angle from 30 to 40 degree. Since UnityEyes generates images that the eyeball is located in center, we crop 60 × 36 pixel size box area around center including eye and eyebrow.
3) Eyeblink8: Eyeblink8 [51] is a dataset with 70,992 frames and 640×480 resolution which capture people sitting and behaving naturally in front of the camera. Because the images have been captured under natural condition, number of the eyes closed images is very small compared to number of frames with the eyes open, as shown in Table I. Also, Fogelton and Benesova [53] have pointed out that Eyeblink8 dataset may include labeling mistakes. Therefore, We preprocess the eye images in Eyeblink8 dataset again, using same method as our BID dataset preprocessing. We selected videos of 1, 3, 8, 10 based on the folder name. Each data was taken with each different subject. And we prepare them into 60x36 pixel size eye images using MediaPipe and the face normalization method in [54]. A total of 44,319 cut out images are labeled and classified into 43,225 open images and 1,094 closed images.

4) Our Dataset [Blinks in the Dark (BID)]:
The gaze estimation performance for blurred or dark images tends to be poor [55]. From this it is reasonable to assume that blink detection performance also tends to deteriorate in blurry or dark images in general. As discussed, the ability to detect blinks in dark images is directly related to user safety for some applications (e.g., drowsy driving detection at night). Therefore, it is necessary to guarantee the performance of blink estimation even in a dark environment or backlight.
In order to measure the performance change in dark images, we have collected our dataset, Blinks in the Dark (BID), with a relatively long distance between the camera and the face in an indoor golf driving range ( Figure 3). Our target subjects consist of 11 males and 2 females in total, with minimum of 21 and maximum of 39 years old, and the average age of the subjects is 29. There are 4 people who wear glasses. All target subjects are Asian except one female Caucasian. This means that the BID dataset has a significant racial bias (12:1), which commonly occurs in the distribution of users in some East Asian countries. We have collected eye images according to various face angles and actions based on three scenarios in which the subject turns their head and blinks differently during a golf swing. First, the subjects turned their heads with their eyes open. Second, the same subjects turned their heads with their eyes closed. Finally, the same subjects blinked three times without turning their heads. Among the subjects 1 to 13, we excluded subject 7 from the training set and test set due to poor image quality (see Table VI). We recorded video with 1440×1080 resolution and 50 frames per second (FPS). The distance between the face and the camera is larger than 1 meter. After the recording, we extracted face landmarks and 6 landmarks corresponding to each eye using MediaPipe [56]. We used the normalization method proposed   [54], because the eye images have various features according to face yaw and pitch. This normalization method provides canonical eye images by removing various parameters, including head rotation and eye-to-camera distance. 3 We made a box containing 6 landmarks that represent the eyes, and crop the face image with vertically and horizontally ×1.5 scaled region. Then we resized cropped image into 60×36 pixel size. We used both eyes by flipping right eye image to left.

B. Statistics of Datasets
The four datasets have very diverse characteristics, including skin color, brightness, and sharpness ( Figure 6). In order to measure the brightness of the image, we convert RGB image into HSV image and get brightness value of image by averaging brightness value of all pixels in image. In order to get sharpness value, we convert RGB image into gray image and pass through the Laplacian filter which makes lines in the image to be emphasized. After passing through the filter, we get sharpness value by calculating variance of all pixel values. Table II summarizes the mean and standard deviation of brightness and sharpness for each dataset. Figure 5 shows the density functions for brightness and sharpness of each dataset. Mean brightness and mean sharpness are highest in the UnityEyes dataset and lowest in the BID dataset. This is because the BID dataset has been captured with backlighting in an indoor environment, while the UnityEyes dataset has been synthesized and rendered under ideal light conditions using a game engine. The RT-BENE and Eyeblink8 datasets have higher mean brightness than the BID dataset because they have been collected in natural real-world environments. However, mean sharpness of the RT-BENE dataset is lower Fig. 6. Images from the datasets. Top to bottom: RT-BENE, UnityEyes, Eyeblink8, BID, BID with improved brightness. The last row is shown here for reference only and is not used for training and testing. than that of Eyeblink8 and UnityEyes datasets. This is because the RT-BENE dataset has been captured at a longer camera and subject distance as described in [10]. Table I lists the number and proportion of open and closed eye images for each dataset. As shown in Table I, Eyeblink8 is the most imbalanced (39.51:1) dataset and UnityEyes is the most balanced (1:1) dataset.

V. METHOD
Our goal is to improve the overall performance of blink estimation in the target (test) domain, which has a different distribution than the source (training) domains. To perform a reliable classification operation on an unseen target domain, a feature extractor need to extract domain-invariant features that do not contribute to distinguishing individual domains. To this end, we propose Blink Estimation with domain Adversarial Training network (BEAT), a model that can improve performance on unseen target blink datasets. BEAT network is inspired from DANN [57] and AFLAC [48] network. Although Akuzawa et al. [48] argue that domain adversarial training can affect classifier performance where each domain is not independent, we find that domain adversarial training can help actually learn domain invariant features. From our observations, we combine the ideas from DANN and AFLAC in BEAT. As depicted in Figure 7, BEAT consists of a feature extractor, a blink classifier, and a domain discriminator which can be mathematically formulated as where I ∈ R 60×36×3 is an eye image from source domain S; F is the feature extractor with parameter θ f ; C is the blink classifier with θ c ; and D is the domain discriminator with θ d . The feature extractor model extracts a feature vector f ∈ R 36 that is applied as input to the classifier (C) and the discriminator (D  A. Optimization 1) Classification Loss: Since the blink datasets are highly imbalanced, we adopt the focal loss whereĉ is the classifier output and c is the ground truth classification label. The classifier and the feature extractor minimize this loss to distinguish open eye and closed eye images. We use hyperparmeters α = 0.5 and γ = 2 for all experiments.
2) Adversarial Loss: In order to extract domain invariant features, we apply the domain adversarial training method proposed by [57]. The key idea of domain adversarial training is that the discriminator and the feature extractor play a zero-sum game. The role of the discriminator is to separate features according to different domains. On the contrary, the feature extractor is optimized not to extract domain discriminating features. For the discriminator loss, we adopt the cross entropy loss as whered i represents the probability of each domain, d i can be 0 or 1, indicating whether it is an image from domain i. The discriminator is trained to distinguish domains well by minimizing Equation (5), and the feature extractor is trained to extract domain invariant features by maximizing the same equation. This allows the model to ignore information not relevant to the main classification task and improve generalization performance.
3) KL-Divergence Loss: Akuzawa et al. [48] have proposed a domain generalization method using the KL-divergence. Inspired by this, we adopt the KL-divergence loss as where p(d|c) denotes the conditional probability of domain label d at a given classification label c. Akuzawa et al. [48] have proved that when entropy H(d) and entropy H(p(d|c)) are equal, it is the worst case where the discriminator does not distinguish well between domains. Because the feature extractor should not discriminate domains, it is trained to minimize the KL-divergence loss, reducing the distribution difference betweend and p(d|c). AFLAC in [48] uses only the KL-divergence loss, but BEAT network linearly combines the adversarial loss and the KL-divergence loss.

4) Objective Functions: The objective functions arê
where λ adv is a hyperparameter for adversarial training and λ KL is for the KL-divergence loss. In Equation (8), the discriminator is optimized to minimize the discriminator loss, while the feature extractor is optimized to maximize the discriminator loss in Equation (9).

B. Feature Extractor Details
The dilated convolution operation [58] has the advantage of extending the receptive field without lowering the resolution of the input image [58], [59]. Chen and Shi [59] have constructed a network (DilatedNet) based on the dilated convolution to extract features robust to eye shape changes in gaze estimation tasks. Inspired by this, we set the DilatedNet as a baseline model for the feature extractor.
The vanilla DilatedNet consists of a convolution stage and a dilated convolution stage. The original convolution stage has four convolution layers. To reduce computation, we modify the first and last convolution layers to depth-wise convolution layers. In the dilated convolution stage, we use four dilated convolution layers identical to the configurations in [59]. Also, we change dilated rate to (2,3), (2,3), (2,4), (2,4) due to the size difference of the input image. See Table III for details.

C. Blink Classifier Details
The blink classifier predicts the blink probability from the extracted features. The architecture details of blink classifier is described in Table IV. Batch normalization, leakyReLU, and dropout layers are added between layers. The last layer is the sigmoid function that calculates the blink probability.

D. Domain Discriminator Details
The domain discriminator predicts which domain the input image is from. A domain's ground truth is labeled in a onehot vector. The last layer of the discriminator is the softmax function that predicts which domain the image will most likely belong to when there are multiple domains. We have found the optimal values of λ adv and λ KL as 0.01 and 1, respectively, through hyperparameter tuning experiments (see Table XI and XII).

1) Lambda Scheduler (Sc):
We use a scheduler for λ adv , based on Ganin et al.'s findings [57] that the scheduler makes the feature extractor less sensitive to noisy datasets during the early training epochs. λ adv is defined as where k denotes the number of epochs. The scheduler changes the λ adv value from 0 to λ o . As shown in Table XI, the best AUPR are achieved when λ adv is 0.01. We have applied σ = 0.5 and λ 0 = 0.01 for the experiments.
2) Gradient Decay (GD) Layer: As we have experimented with different values for λ adv listed in Table XI, we have found that classification performance degrades as λ adv increases. We discover that if the model is trained with excessive discriminative loss, the classification loss plays little role in training, which degrades classification performance. To prevent the discriminative loss from overwhelming the classification loss, we adopt a gradient decay layer that regularizes the gradient values transferred from the discriminator to the feature extractor as where t is a scale factor, δ f is the gradient of the last layer of the feature extractor, and δ d is the gradient of the first layer of the discriminator connected to the feature extractor. The gradient decay layer (Equation (11)) converges the gradient (δ f ) to the scale factor ±t at infinity (δ d → ±∞). Note that the gradient decay layer is similar to the tanh function and the gradient decay layer prevents the gradient values from divergence. Given that the adversarial training tends to be unstable due to unbounded gradients, we believe that our gradient regularization improves the stability of the adversarial training. We use the scale factor t = 4 for the experiments.

VI. EXPERIMENTS
We test the ideas discussed so far to choose the best configuration with the smallest classification loss. For our experiments, we use one RTX 2070Ti GPU, taking an average of 5 hours per experiment. During training, the batch size is set to 256 and the Adam optimizer (β 1 = 0.9, β 2 = 0.999) is used. We adopt a warm-up cosine annealing scheduler with a learning rate value between 10 −6 and 10 −4 . For loss functions, we apply the cross entropy loss for the domain discriminator and the focal loss for the blink classifier with a label smoothing value of 0.1.

A. Interpretation of Experiment Results
The tables in this section summarize the main experiment results measured based on the test dataset shown in Table VI. Since the test dataset is statistically independent from the training and validation datasets, the results presented in this section can be considered as the results of a statistically rigorous case study. Table X shows the performance comparison of our blink detection method. Each column of Table X lists the results with three training datasets (source domains) and one test dataset (target domain). For example, the title in the second column "D R , D B , D U → D E " means that we have trained on the source datasets, D R , D B , and D U , and tested our method on the target dataset, D E . As you can see in Table X, the three versions of the BEAT network are superior in most cases, except where the UnityEyes dataset D U is the target domain. However, since the UnityEyes dataset is a synthetic dataset created by a game engine for training purposes only, it is not practical to use this dataset as target or test data.   Table X shows the performance of our method when our network was trained by those images from good lighting conditions (D R , D U , D E ) and independently tested for classification of blinks in images from dark lighting condition (D B ). As shown in the third column, our method successfully detects blinks with the highest AUROC and AUPR under dark lighting conditions ( Fig. 4  and 8).

B. Baseline Models for Feature Extractor
In order to evaluate and determine the most suitable baseline model of the feature extractor in BEAT, it is necessary to compare different structures based on one common data set. For this purpose, we use the RT-BENE dataset for training and evaluation. Table V lists the comparison results for our DilatedNet-based model [58], ResNet50 [60], MobileNetv2 [61], DenseNet121 [62] and DenseNet121 with ensemble models [10]. For unbiased comparison with other reported methods, pretrained weights from the ImageNet dataset are applied to ResNet, MobileNetv2, and DenseNet121. Table VI shows the RT-BENE subjects that we have assigned for fair comparison with [10].
To measure the precision and recall in the experiments, we set the threshold to 0.5. However, the threshold of 0.5 may not be a proper choice because the datasets are imbalanced. Therefore, we choose the AUROC and the AUPR for  Table V, our design (DilatedNet) scores the highest in the AUPR and achieves the fastest inference speed, which is highly demanded in mobile applications. For more details, our DilatedNet-based design is approximately 28.4% better in precision, 6.38% in recall, and 18.6% in the AUPR and 12.8 times faster than [10]. The model proposed by [12] is also compared equally with ours using only the augmentation method without the curriculum learning for unbiased comparison. As a result, our DilatedNet-based design scores 1.57% higher in the AUPR than [12]. Even though they have used a better GPU (NVIDIA Titan V), our inference speed is 2.86 times faster. Based on the results, we claim that our feature extractor achieves state-of-the-art performance in terms of AUPR and higher inference speed on the RT-BENE dataset.
1) Undersampling Test: To deal with the class imbalance, we have tried the undersampling technique [63] on our feature extractor using a random subset of the majority class at the ratios of 1:1, 1:5, 1:10, and 1:15 for the RT-BENE dataset. Table VII summarizes the results according to the undersampling ratios. Note that the unsampled ratio of closed-eye to open-eye images in the original RT-BENE datasets is 1:23.
As shown in Table VII, recall increases as the sampling ratio approaches to 1:1, and precision increases as the sampling ratio (open to closed) increases. Note that the unsampled case scores highest on both the AUROC and the AUPR. We guess that this is because the model loses some important information that would contribute to classification performance due to the major class samples removed during undersampling.
2) Oversampling Test: We have tried two oversampling methods to increase the samples of the minor class. One is   Table VIII shows the results of the oversampling test using our feature extractor. All oversampled (i.e., augmented) datasets have lower AUPR values than the raw dataset.
3) Optimal Threshold: Through the undersampling and oversampling tests, we have learned that neither approach always helps to improve binary classification performance. One of the important assumptions we have to consider for the tests is that the precision and recall in Table VII and VIII are based on a threshold of 0.5, which may not be suitable for imbalanced classes. Therefore, we define an optimal thresh-oldT based on the F1-score for varying sampling ratios in practical blink estimation applications as follows: where T is a threshold. We find that the optimal threshold value for the raw RT-BENE dataset is 0.4598 from Equation (12) (see Table IX). The recalculated evaluation metrics based on the optimal threshold for precision, recall, and F1-score are 0.9307, 0.8409, and 0.8835, respectively. As shown in Table IX, the optimal threshold increases the F1-score and reduces the difference between precision and recall.

C. Domain Generalization Performance of BEAT
We evaluate the domain generalization performance of BEAT by selecting one domain dataset as a target domain and other datasets as source domains. We train the BEAT network with adversarial lambda scheduler (BEAT+Sc) and the BEAT network with scheduler and gradient decay layer (BEAT+Sc+GD). For comparison, the baseline network without adversarial training, DANN [57] and AFLAC [48] are also evaluated. Table X shows that the BEAT+Sc+GD network achieves better AUROC and AUPR values than the baseline network for all target and source domain combinations, except when the UnityEyes dataset is the target domain. To be more specific, the BEAT+Sc+GD network improves the AUROC and the AUPR by 2.99% and 50.24% on the Eyeblink8, 7.21% and 4.47% on the BID, and 2.14% and 23.76% on the RT-BENE, respectively.
It is interesting to find that the baseline network scores higher in the AUROC and AUPR than other networks including DANN and AFLAC when the UnityEyes dataset is set as the target domain. The results demonstrate that decision boundary created by RT-BENE, BID, and Eyeblink8 datasets can discriminate between open and closed eye images well, even though there is no domain adversarial training which makes domain invariant features.
From the results, we speculate that, since the UnityEyes images have little noise (and therefore less domain-specific information), the model trained on other (source) datasets can easily detect eye shape features from the (target) UnityEyes dataset without adversarial training. However, the last column (i.e., D R , D B , D E → D U ) describes a scenario that seldom happens because D U (UnityEyes) is for training, not testing, as discussed.
In summary, as shown in Table X, our BEAT+Sc+GD performs better than other methods, except in rare cases where synthetic datasets are tested as targets.
D. Hyperparameter Tuning 1) Adversarial Parameter: In order to find an optimal value for the adversarial parameter, λ adv in Equation (9), we have trained the BEAT network without applying the KLdivergence loss. In this configuration, the feature extractor considers only classification and adversarial losses and is equivalent to the DANN network. We have tried with λ adv = 0.001, 0.01, 0.1, 1, 10 under the same conditions depicted in     Table I. As a result, we have found that λ adv = 0.01 performs the best (see Table XI).
2) KL-Divergence Parameter: We also have tested to find an optimal λ KL in Equation (9). To evaluate the effect of λ KL more accurately, we have trained the BEAT network by optimizing the feature extractor without the adversarial loss, using the RT-BENE, BID, and UnityEyes datasets as source domains and the Eyeblink8 dataset as target domain. We have evaluated the performances by changing λ KL = 0.01, 0.1, 1, 10 and achieved the highest average AUPR with λ KL = 1 as shown in Table XII.

E. Ablation Study 1) Lambda Scheduler:
We have conducted an ablation study for the lambda scheduler (Sc). Note that the results in the sixth row (BEAT) of Table X are based on a constant λ adv = 0.01 without the lambda scheduler. We have evaluated the performances of the lambda scheduler by changing λ adv from 0 to 0.01, which are shown in the seventh row (BEAT+Sc). Although the intended purpose of the scheduler is to stabilize gradients, the results show that the lambda scheduler even improves the AUROC and the AUPR in some target domain datasets -by 0.07% and 6.22% on the Eyeblink8, and 9.18% and 9.72% on the BID, respectively. However, the scheduler degenerates the AUROC and the AUPR by 1.48% and 2.53% on the RT-BENE dataset, and 1.08% and 0.87% on the UnityEyes dataset, respectively.
2) Gradient Decay Layer: We also have conducted another ablation test to find the effect of the gradient decay (GD) layer described in Equation (11). As shown in the eighth row (BEAT+Sc+GD) of Table X, the BEAT+Sc+GD combination performs better in AUPR than the BEAT+Sc combination by 2.38% in the Eyeblink8, 0.47% in the BID, and 3.09% in the RT-BENE target dataset, except the UnityEyes dataset. The results prove that the gradient decay layer can also help to improve the generalization performance.

VII. CONCLUSION
Our network for Blink Estimation with domain Adversarial Training (BEAT) robustly detects eye blinks on unseen out-ofsample images captured even under poor lighting conditions in a variety of consumer applications. BEAT can generalize various domains by extracting domain-invariant features through adversarial training and the KL divergence loss. We also add a gradient decay layer which regularizes gradients for stable domain adversarial training. Based on the experiments, We conclude that our approach achieves better performances than DANN [57] and AFLAC [48] for unseen target domains.
The proposed feature extractor based on DilatedNet applied to BEAT achieves state-of-the-art performance in terms of AUPR and high inference speed on the RT-BENE dataset. We also experimentally determine the optimal threshold applicable to the RT-BENE [10] dataset.
Based on the improved classification performance and inference efficiency, we believe BEAT is suitable for a wide variety of consumer applications where robust blink detection is required to ensure critical safety even on mobile devices.