Bias Remediation in Driver Drowsiness Detection systems using Generative Adversarial Networks

Datasets are crucial when training a deep neural network. When datasets are unrepresentative, trained models are prone to bias because they are unable to generalise to real world settings. This is particularly problematic for models trained in specific cultural contexts, which may not represent a wide range of races, and thus fail to generalise. This is a particular challenge for Driver drowsiness detection, where many publicly available datasets are unrepresentative as they cover only certain ethnicity groups. Traditional augmentation methods are unable to improve a model's performance when tested on other groups with different facial attributes, and it is often challenging to build new, more representative datasets. In this paper, we introduce a novel framework that boosts the performance of detection of drowsiness for different ethnicity groups. Our framework improves Convolutional Neural Network (CNN) trained for prediction by using Generative Adversarial networks (GAN) for targeted data augmentation based on a population bias visualisation strategy that groups faces with similar facial attributes and highlights where the model is failing. A sampling method selects faces where the model is not performing well, which are used to fine-tune the CNN. Experiments show the efficacy of our approach in improving driver drowsiness detection for under represented ethnicity groups. Here, models trained on publicly available datasets are compared with a model trained using the proposed data augmentation strategy. Although developed in the context of driver drowsiness detection, the proposed framework is not limited to the driver drowsiness detection task, but can be applied to other applications.


I. INTRODUCTION
The ability of Artificial Intelligence systems (AI) to automate decision-making capabilities in human daily lives is increasing rapidly. As a result, these systems influence human interaction with the real world and are transforming the future. Their decision-making capabilities typically rely on large training datasets that learn and extract useful patterns in an automated way. Unfortunately, if these systems are trained on datasets that do not have a complete representation of realworld scenarios, they may be prone to bias and prejudice society.
One application where the use of deep learning techniques is increasingly gaining popularity and in which unrepresentative training datasets can lead to negative consequences is that of driver drowsiness detection. One of the primary objectives of the motor industry is passenger safety. Road related accidents are a primary cause of injuries and death among the human population [1]. Among the factors leading to accidents, driving while drowsy is of particular concern. This has prompted the automobile industry to make efforts to develop detection systems that improve driver safety. Monitoring driver behaviour through computer vision and machine learning techniques that detect drowsiness and warn the driver is an increasingly popular technique under investigation by the motor industry. Statistics reveal that a high rate of accidents is caused by drowsy drivers and 20% of serious accidents arise from a failure of driver's judgement and their inability to control the vehicle in the drowsy state [2]. In addition, the World Health Organisation (WHO) reveals that deaths arising from road traffic crashes have increased to 1,35 million in the year of 2018 [3]. The report by WHO further shows that nearly 3 700 people die on roads every day. This is a particular concern in Africa, which only has 2% of the world's cars, but has the highest accident rates of about 20% of road deaths [4]. These are alarming findings, which urgently need to be addressed. However, the development of robust driver drowsiness detection systems is still a challenge in both academia and industry.
In the automobile industry, several attempts have been made to monitor driver's state over time by considering various methods such as vehicle-based measures, physiological signals, and behavioural measures. Vehicle-based methods make use of car electronics together with appropriate sensors [5]. These sensors are usually placed on the pedals, steering wheels and often include cameras around the car [6]. Unfortunately, these methods mostly rely on the state of the car and the surrounding environment and focus less on the driver's state. On the other hand, physiological methods monitor the driver's state using physiological signals which can be recorded using devices such as Electroencephalogram (EEG), Electrooculogram (EOG), Electrocardiogram (ECG), or Electromyogram (EMG) [7,8,9]. These devices yield accurate results because the readings measure brain activity [10]. However, physiological methods are invasive as they typically require a device to be placed on the driver to record signals. These devices can make the driver uncomfortable and are considered impractical for real-time drowsiness detection systems.
In contrast, behavioural methods are non-invasive methods that make use of a mounted camera to track the face of the driver and measure the level of drowsiness based on facial features. There are numerous facial features that can be used to measure drowsiness from the camera feed including eye state, yawning, and head position [11,12,13]. Behavioural methods can be combined with machine learning techniques to produce more robust systems. A meta-review [14] examined Hidden Markov Models (HMM), CNNs, and Support Vector Machines (SVM) and concluded that CNNs performed better than other techniques, although SVMs were most commonly used. The success of CNNs has been shown in many computer vision tasks including object classification, segmentation, and object tracking [15,16,17]. CNN architectures require a large amount of training data to learn a suitable representation for a given task. Unfortunately, when it comes to driver drowsiness detection, there are a limited number of publicly available training datasets, and some datasets are not published because of security and privacy reasons preventing the publication of people's faces. In addition, publicly available datasets are often unrepresentative as these may not cover a wide variety of ethnicities. For African contexts, this poses a challenge since the population is diverse and individuals can have many different facial attributes. Models trained using publicly available datasets do not generalise well in an African context [18].
The limitations of datasets that fail to cover a wide range of ethnicities lead to bias in trained models when it comes to contexts with different nationalities. Prior work has shown that visualisation techniques can be used to identify bias in training datasets by identifying population groups where a classifier tends to fail [18]. This paper makes the following contributions in addressing population bias in driver drowsiness training datasets: • Introduces a novel framework that remedies generalisation failures in under represented population groups in the training dataset, which boosts the performance of drowsiness detection across all population groups. • Introduces a sampling algorithm that identifies individuals with facial features where the network is failing. • Shows how a GAN that generates realistic images can be used to produce training data for those races or individuals where the model is failing. The framework relies on two primary components, population bias visualisation and a Generative Adversarial Network (GAN). The GAN generates realistic images of individuals (drowsy and awake) in these population groups, which are used for retraining the ResNet model used for drowsiness detection with new parameters i.e learning rate and epoch sizes to reduce overfitting and improve the detection accuracy. The population bias visualisation is used to group races by similarity and identifies where the model is failing to generalise. This process is iteratively repeated until convergence. This paper is organised as follows. Section II provides an overview and background work, which is followed by a discussion on GANs, population bias and CNN visualisation. Details around our framework are discussed in section III, and with training information and a description of the datasets used in this paper. Experimental details and results are presented in section IV. Finally, Section V provides a brief conclusion of our work.

II. BACKGROUND AND RELATED WORK
In this paper, a novel framework that boosts the performance of CNNs for driver drowsiness detection is presented. This is accomplished by highlighting regions where the model is failing and passing similar GAN generated images to the model for retraining. This strategy is based on boosting, where a weak classifier is iteratively re-weighted to make it a strong classifier. In this section, we explain related works on data augmentation and visualisation strategies.

A. Generative Adversarial Networks
Since their introduction in 2014 by Goodfellow et al. [19], GANs have shown great success in many computer vision tasks, including pose guided person image generation [20], domain transfer [21], super-resolution [22], and text to image applications [23]. Many variants of GAN architectures have been developed, such as Wasserstein Generative Adversarial Networks (WGAN) [24], Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) [25], and Deep Convolutional Generative Adversarial Network (DCGAN) [26]. These architectures have been proposed to improve the original GAN architecture for various tasks.
The original GAN architecture is composed of two neural networks namely, a generator G and a discriminator D which are trained by playing a mini-max game against one another. In the case of data augmentation, we utilize the domain shift of one image to another domain. In the transfer of awake states to drowsy states, where we seek to learn a generator distribution p g over data x, the generator creates a mapping function, parameterized by θ g from a prior latent distribution p z (z) to data space G(z;θ g ). The discriminator D(x; θ d ), on the other hand, learns parameters θ d to distinguish whether images are from the training data or from the generator. The mini-max game function V (G, D) is expressed as follows: Unfortunately, the original architecture is limited in that there is no flexibility in generating desired outputs. To overcome this problem, conditional GANs were introduced as an extension that introduces additional information to both the networks [27]. This additional information allows the flexibility of producing controllable outputs from the training dataset. The additional information, y, is typically a label which is applied to the resulting image, for example in our case this is the awake or sleepy state. The mini-max objective function from equation (1) is then updated as follows: In this paper, a controllable GAN is used as a data augmentation technique to balance the training dataset for a driver drowsiness detection task. Data augmentation is a technique that increases the size of a training dataset to reduce the chances of overfitting by the network. In computer vision, the most common way to perform data augmentation is by applying parameterised transformations including random cropping, rotation, scaling, and jittering. In the case of driver drowsiness detection, many available datasets are unrepresentative as these are often captured in specific cultural contexts. In addition, there is a distinct lack of datasets captured in African contexts [28]. Applying common data augmentation strategies to the training dataset improves the results by a small amount. However, standard augmentation is severely limited and is unable to generalise to more complex domain shift problems.
There is much work using GAN architectures for data augmentation. Gupta [29] used conditional GANs for sentiment classification, obtaining a significant improvement against a baseline model that was only trained on real data. The conditional GAN was trained using different strategies including pre-training and noise injection on the training data. Mok and Chung [30] proposed an automatic data augmentation that enables machine learning methods to learn from the available annotated samples efficiently. Their architecture consists of a coarse-to-fine generator which captures the manifold of the training sets. Their proposed method was used on Magnetic Resonance Imaging (MRI) images and achieved improvements of about 3.5% over the traditional augmentation approaches that were compared against. In addition, Wu et al. [31] used a multi-scale class conditional GAN to perform contextual in-filling to synthesize lesions onto healthy screening mammograms. For experimentation, three classifications were compared and their method substantially outperformed a baseline model. Antoniou et al. [32] used a conditional GAN to augment data in another domain. They named their architecture Data Augmentation Generative Adversarial Networks (DAGAN), which can be trained for low-data tasks using standard stochastic gradient descent approaches. It is clear that GANs can be used as a substitute for traditional augmentation techniques and are particularly valuable where more sophisticated augmentation strategies are required.

B. Population Bias
Racial bias is a problem that has been raised in the computer vision community, with a specific focus on how to develop machine learning models that guarantee fairness in all ethnicity groups. This problem has been identified as a result of investigations of fairness in machine learning systems that involves humans. Racial bias has been reported in various areas including criminal justice, employment, education, and face recognition systems [33,34,35]. This bias comes from unbalanced training datasets that favor the demographics of the contexts that the application were developed for. As a result, when tested outside of these conditions these algorithms begin to fail. In addition, publicly available datasets often do not capture a wide range of races, while policies of publishing data which contains people's faces prevents some datasets to be published.
Buolamwini et al. [36] found that the performance of three commercial gender classification algorithms decreases dramatically for darker-skinned female faces. Moreover, in [37] it is revealed that Amazon's facial recognition tool misidentified photos of 28 US parliamentary members as criminals because of their skin complexion. De-Arteaga et al. presented a large scale study of gender bias in occupation classification [38].
Here, a machine learning algorithm learns to classify gender based on first names and pronouns from online biographies. Their study showed that there are gaps when using three different semantic representations such as bag-of-words, Deep Recurrent Neural Networks (DRNN), and word embedding [38]. Benthall and Haynes have investigated supervised learning algorithms and revealed that they are exposed to racial bias because of the differentiation that is embedded in systematic patterns [39].
Abiteboul [40] investigated issues in ethical data management, where he considered bias and violation of data privacy in data analysis. His work discusses factors to be considered when working with data such as fairness, transparency, neutrality, and diversity. To overcome the challenges this introduces, he has proposed a unsupervised learning technique that dynamically detects patterns of segregation in order to mitigate the root causes of social disparities and other factors that can lead to biased models.
In this paper, we seek to address the racial bias introduced by imbalanced training sets, by generating images of awake and drowsy people that look like those for whom the model is failing, so as to improve generalisation.

III. IMPLEMENTATION
This section discusses our framework. Figure 1 illustrates the general architecture of the framework. A pre-trained Resnet model is used for classifying the driver's state, after fine-tuning the final layers with fully connected layers and binary classification layer. The framework boosts the performance of the Resnet classification model on population groups where it is failing (in our case this is darker skinned individuals).
The framework is composed of four primary components. Firstly, a GAN architecture produces synthetic images of individuals with facial attributes that can be used when retraining the network. The second component is the CNN architecture that predicts the state of the driver, while the third component is a population bias visualiser that highlights regions where the model is performing well and where it is failing to generalise. Lastly, the sampler targets images where the model is not performing well and searches through the synthetic images to find more similar images to those, which are used for model retraining.
GAN architecture -We adopt the architecture for our generative network from Choi et al [41], who have shown impressive results for generating realistic synthetic images in different domains. Our architecture is a conditional GAN that is conditioned to translate facial attributes such as awake or drowsy across multiple ethnicity groups. This translation of attributes helps in improving the detection model by supplying images with appropriate features where the drowsiness detector fails to generalise. The generator was modified by replacing the standard convolutions by depthwise separable convolutions [42]. The benefit of this is to have fewer trainable parameters, while retaining the performance of the network. Furthermore, Fig. 1. The figure shows the proposed bias remediation framework. The first step uses a GAN architecture to generate synthetic data, which is used to train a CNN to detect driver state. A population bias visualisation is applied to the testing dataset and highlights where the model is failing to generalise. Images are then sampled where the model is not generalising, and are used to find similar images from the GAN generated images to continue training the model. the generator consists of a stride size of two for downsampling and 11 depthwise separable convolutions. We used instance normalisation [43] instead of batch normalisation for the generator. For the discriminator network, standard convolutions were retained because the discriminator acts as a classifier and requires greater capacity in order to distinguish between real and fake images. In addition, PatchGANs [44] were adapted for discriminator network because they make use of a fixed-size patch discriminator that is easily applied to 256x256 images.
ResNet architecture -We used a pre-trained ResNet model which comprises 50 layers and was originally trained on Canadian Institute For Advanced Research (CIFAR-10), ImageNet Large Scale Visual Recognition Competition (ILSVRC) and Common Objects in Context (COCO 2) [45] datasets. We finetuned the pre-trained model on the last layers, but modified the prediction function to a sigmoid for binary drowsiness classification.
Population bias visualisation -The population bias visualisation component relies on PCA [46] for dimension reduction. This is followed by sorting the images by similarity and overlaying the prediction error to visualise where the model is failing. Image features are extracted by projecting validation images onto a 2-dimensional grid using PCA. The validation dataset is transformed into an orthogonal subspace where axes (Principal Components) align with the directions of maximum variance in the data. Here, a matrix of images is formed by reshaping images x i into row vectors (where i = 1...N, and N is the number of images in the dataset) and stacking these vertically to form an N x P matrix. The number of pixels in each image is denoted by P. The PCA data transformation starts by mean centering the matrix of images, which is accomplished by subtracting the mean image from each 1 x P dimensional row vector, X i in the matrix of images, where µ = (µ i , .., µ p ) and µ i = 1 N N i X ij and X ij is the j th pixel of image. The mean centred matrix N x P dimensional matrix of imagesX is then decomposed using singular value decomposition (SVD),X Here, U is a P × N unitary matrix, V is a P × P unitary matrix and Σ is a diagonal matrix comprising the singular values ofX in decreasing order [47]. A reduced dimensional representation ofX can be obtained by discarding columns of U and V,X Here, j denotes the number of columns retained. As shown above, PCA can project data into a low dimensional coordinate system, with axes provided by the columns of U 0:j , and data coordinates given by V 0:j . In this work, we retain only two columns (j = 2), and project images into a two-dimensional coordinate system. Figure 2 shows the 2D projection (coordinates obtained from V 0:2 ) of facial images in our test dataset. We use this projection to construct a grid of images, grouped by similarity. We create a uniform coordinate grid and search for the closest image (in the reduced dimensional coordinate system) to each point in the grid. We assign each image a corresponding point, and ensure that no image is duplicated, by removing it from the list of available images once allocated a grid coordinate in order to produce a grid of images that groups individuals by facial similarity, as shown in Figure 3. It is clear that this process successfully groups faces of similar state and complexion together, with darker skinned individuals located towards the top of the image, and lighter skinned individuals towards the bottom. For each image selected, we calculate the error in prediction, to produce a saliency map indicating model quality for the constructed grid of images.
Image Re-sampling -This step is performed by randomly selecting images according to the error in prediction, termed failure probability here. The failure probability is calculated by determining the difference between the CNN sigmoid prediction output, y i and the true label y t (a binary label encoding drowsy or awake). This is normalised by the total probability of failure over all N images in the dataset The random selection is performed by categorically sampling images using the weights above. This ensures that images with greater probability of failure are more likely to be sampled. Similar images in the GAN generated dataset are then selected by finding close matching images in the PCA space. The selected images are then used to continue training the ResNet model to increase classification performance.

A. Training
In order to train the GAN architecture, the Adam optimiser [48] was used with β 1 = 0.3 and β 2 = 0.6, along with a batch size of 32. We applied a horizontal flipping data augmentation with a probability of 0.5. To train the model, we used a learning rate of 0.0001 for 100 epochs and then linearly decayed the learning rate over the next 100 epochs. This strategy compensates for the fact that our training data is limited. All experiments were carried out on a single Nvidia Tesla K20c GPU, where the training took approximately 12 hours.
Initially, the models were trained on three different datasets (NTHU-drowsy, DROZY, and CEW) and were compared with our framework as illustrated in the results section. The pretrained ResNet model initially used a learning rate of 0.0001, which was then modified for the rest of the experiments from 1e −3 to 1e −6 , which was performed for each iteration on our framework. We also used an early stopping strategy to prevent overfitting of the model.

B. DATASETS
This section describes the datasets that were used for training and testing ResNet model. For this work, the NTHUdrowsy, DROZY, and CEW datasets are used.
NTHU-drowsy was introduced at the 13th Asian Conference on Computer Vision (ACCV2016) [49]. The dataset is split into test and training sets. For training, there are 18 participants (10 men and 8 women) pretending to drive, with 5 scene scenarios for each participant including noglasses, glasses, glasses at night, no glasses at night, and sunglasses. For evaluation, there are images of 2 men and 2 women. Videos combining drowsy, normal and sleepy states are provided. DROZY consists of 14 participants (3 males and 11 females) [50]. Each video is approximately 10 minutes long and is accompanied by the results of psychomotor vigilance tests (PVTs) regarding the drowsiness state. For each participant, the dataset contains a time-synchronized Karolinska Sleepiness Scale (KSS) score [50]. CEW is a collection of online images of different races (for example Asians and non-Asians with light-skinned faces) and contains about 2423 participants [51]. Among the participants, 1192 have both eyes closed and 1231 have their eyes open. These images were selected from the labeled faces in the wild database. Validation Dataset Our validation set contains 1500 faces which some are obtained on the validation sets of the three datasets used. To have a balance and representative validation dataset, we added African faces which we collected for this purpose of drowsiness detection task. These images contain many ethnicity groups with facial drowsiness states.

IV. EXPERIMENTAL RESULTS
We first evaluated our proposed framework to results on the publicly available dataset using the pre-trained ResNet models. All the parameters were kept the same for all the first experiments, but thereafter the learning rate and training epochs were modified to prevent overfitting of the models. Figure 4 shows the eye state attribute transfer results on validation dataset. We observed that it is easier to transfer from eyes open to eyes closed. The first row in figure 4 shows that the network was not performing well in transferring from eyes closed to eyes opened, as shown by the blurriness of the image. Despite these limitations, these images still boosted the performance of detecting driver drowsiness task.

B. Population Bias Visualisation Results
Population bias visualisation highlights faces where the model fails to generalise. A re-sampling strategy randomly targets highlighted images where the model is failing and uses similar GAN images for model retraining. In these experiments, we compared our framework to models that are trained on CEW, NTHU-drowsy, and DROZY datasets. These models are then tested on the prepared test set. The GAN synthetic data was generated from the validation set which is prepared to represent a wide variety of races. These synthetic  generated images cover a wide variety of races and facial attributes. Fig 7 shows baseline experimental results based where we sampled randomly from the augmentation data. It was noted that when we used all the GAN generated images and did not use targeted sampling, model improvement was limited. Targeted sampling is clearly a more effective strategy. Fig 5 shows the population bias visualisation for the three datasets. It is clear that the models were failing to generalise on darker complexion ethnicity groups, but the proposed framework corrects this. Table I shows the accuracy results from the three experiments which show how accuracy is improved with additional iterations of targeted sampling. The  6. The figure shows the progressive improvement using the framework. As the models were fine-tuned and the learning rate was modified, there was improvement and the models reached more certain classification probabilities (0.70 to 0.99). At the seventh cycle we managed to reach optimal performance on the validation set.
table also contains results from previously unseen test images as the model is updated using the sampling technique, which show improvements in line with the validation set. Fig 6 highlights the improvement of the driver drowsiness detection performance using the proposed framework. Given enough data, the model was able to generalise well to all population groups in the test set.

C. Learning rate Results
Table II. shows the influence of different learning rates on the performance of the model. A learning rate of 1e −3 was too high and limited model learning. The best performance was observed from 1e −4 to 1e −6 . The early stopping strategy was also applied to prevent overfitting which can improve the generalisation on the model. At first, the models were trained with the same number of epochs. It was observed that using the same number of epochs on different learning rates showed traits of overfitting. Therefore, early stopping was used when there is no change in the accuracy.

V. CONCLUSION
In this paper, we introduced a novel framework that can be used to boost the performance of driver drowsiness detection models by reducing bias in the training dataset. Here, a GAN network produces realistic images that are used when retraining a ResNet model on synthetic data generated based on failure cases in validation data. Faces where the model fails are used to search for similar images in a synthetic dataset produced by the GAN network. These images are used to fine tune a CNN model, and this process is repeated until convergence. A strategy of choosing different learning rates for each iteration helped to prevent the chances of overfitting the model. Importantly, the proposed approach does not rely on any meta-data or assumptions about the race or ethnicity of individuals in the datasets, which is a commonly used approach to determine algorithmic fairness or bias. Requiring this knowledge is potentially problematic as it tends to rely on subjective and controversial racial classifications.
This work has shown that bias in datasets can be addressed to some extent using targeted sampling and generative adversarial networks. However, this process still requires that some training data be available for various population groups, and does not eliminate the need for good, representative datasets. Rather, the proposed approach is intended to remedy more subtle bias introduced by imbalanced datasets, where images of a particular group may be more numerous than that of another.