Fatigue Driving Detection Based on Deep Learning and Multi-Index Fusion

In order to reduce traffic accidents caused by fatigue driving, a fatigue driving detection algorithm is proposed based on deep learning and facial multi-index fusion from the driver $'\text{s}$ facial features. Because the scene in the actual driving process is very complex and changeable, this algorithm first improves the multi-task cascaded convolutional neural network (MTCNN) so that it can quickly and accurately locate the face and detect the facial key points. According to the facial key points, the driver’s eyes and mouth regions are determined. Second, these regions are input into the eyes and mouth state recognition network (E-MSR Net) for state recognition. The E-MSR Net is a depth separable convolution neural network that is improved and optimized based on MobilenetV2. Finally, the three facial features of eye closure rate (ECR), mouth opening rate (MOR), and head non-positive face rate (HNFR) are fused to judge the driver’s fatigue state. This algorithm can quickly and accurately make judgments in the face of complex and changeable scenes. At the same time, it can avoid the failure of the algorithm caused by the occlusion of the eyes or mouth due to wearing sunglasses or masks during driving. The accuracy of the proposed algorithm on the self-made data set achieved 97.5%, which proved the feasibility of the algorithm.


I. INTRODUCTION
Fatigue driving endangers road traffic safety and has grown up to be an important cause of traffic accidents. A 2011 World Health Organization survey found that 1.3 million people died each year from road traffic accidents and about 50 million were disabled. By 2020, traffic accidents were the fifth leading cause of death worldwide, killing about 2.4 million people each year. The American Automobile Association (AAA) Traffic Safety Foundation showed that nearly one-fifth of the deaths in traffic accidents in the United States each year, about 4.55 million people were killed, which was caused by fatigue driving.
How to effectively monitor and determine the fatigue driving state so as to achieve fatigue warning has become a hot topic in scientific research. At present, there are three main driving fatigue detection methods. The first is the driver fatigue detection technology based on driving data [1], [2]. This technology detects the driver's fatigue state by monitoring the real-time driving trajectory of the vehicle, The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu. the rotational speed of the steering wheel under a certain hand-held pressure [3], and the driving monitoring device to monitor the lane line deviation degree and other indirect driving data in real-time. Fatigue detection based on driving data is non-invasive and does not affect the normal driving behavior during monitoring. When the monitoring shows that the driving trajectory of the vehicle is deviated, or the pressure of the hand-held steering wheel is reduced. The system will determine that the driver is currently fatigued, and the driver will be awakened by vibration and voice. The indirect driving data of the vehicles will also be affected by the driver's personal driving habits, driving skills and other non-fatigue factors.
Wang et al. [4] combined with the steering wheel angle information at different time points, the longitudinal acceleration and lateral acceleration of the vehicle, and used the random forest algorithm to predict the fatigue state of the driver. After comparing and analyzing the above indicators, they found that the lateral acceleration of the vehicle is the best indicator to detect the fatigue state of the driver. Mao and Du [5] used a simulated driver to obtain a variety of physical characteristics of the vehicle during driving, analyzed the normal state and abnormal state of the vehicle, and constructed a classifier according to different characteristics.
The second is the driving fatigue state detection method based on driver physiological parameters [6]- [9]. The signals from electrode patches at the positions of forehead, heart and muscle are received through special medical equipment, and the signals are transmitted to the system for real-time analysis [10], [11]. The clinically defined physiological parameter signal threshold is used as the basis for fatigue identification. [12] Physiological parameters are mostly collected by special signal acquisition instruments. In order to obtain the precision of the collected data and the correctness of the analysis, the volume of medical equipment is relatively large, and it will occupy the driving space. The driver cannot fully carry out the driving of the vehicle, which invisibly aggravates the occurrence of accidents. In addition, because one end of the data acquisition equipment must be connected to the skin of the human body, or placed on the head, or placed on the trunk, it will more or less will restrict the behavior of the driver, increasing the uncertainty of the driving process. Most importantly, due to the individual differences among drivers, the parameter thresholds of each driver's fatigue identification are not the same, which brings resistance to individual differences to fatigue detection. The system design needs to be more targeted and cannot be universal.
Chui et al. [13] proposed an EEG signal processing method based on SVM algorithm. This method reduces the amount of calculation. At the same time, it speeds up the operation speed, which is beneficial to the improvement of fatigue driving detection speed. Piazzi et al. found that when the driver ' s heart beat frequency during fatigue is lower than 20% of the normal state, the driver appears fatigue phenomenon, which can be judged by the driver ' s heart beat frequency.
The third is the fatigue detection method based on computer vision [14], [16]. For fatigued drivers, some visual changes can be easily observed from their facial features, such as eye, head and face changes, including longer blinking time, slow eyelid movement, smaller degree of eye opening (or even close the eyes), frequent nodding, yawning, gaze (narrow line of sight), slow facial expression and drooping posture [17]. Computer vision is a natural and noninvasive method. The facial visual features that can characterize the driver's fatigue level are extracted from the images captured by the camera in front of the driver. Combined with machine vision, image processing, pattern recognition and other related technologies, the driver's eye state, mouth state, and head motion state are analyzed to judge the driver's fatigue state [18], [19]. Compared with many traditional methods, the fatigue detection method based on computer vision has the advantages of non-contact, non-interference and high detection accuracy.
Zhang and Su [20] designed a new yawing detection method for fatigue detection. They used a convolution neural network to extract spatial image features and long-term and short-term memory network to analyze time features. This method can solve the problem of eye occlusion caused by sunglasses. However, under the condition of insufficient sunlight at night, the algorithm will miss detection and check errors. Zhuang et al. [21] proposed an effective fatigue detection method based on eye status with pupil and iris segmentation. The method separates the pupil and iris of the human eye through a streamlined network, and then determines the eye state according to the characteristics of the pupil and iris. This method is less affected by the change of sunlight, but it is not very effective for wearing sunglasses. Zhang and Wang [22] proposed an algorithm using image processing technology to process the images, and then evaluated it by SVM model. Finally, the sequence floating forward selection algorithm is used to select the optimal parameters, and the fatigue detection model was established. Li et al. [34] used the improved Yolov3-micro to extract facial features, and then VOLUME 9, 2021  judged the driver ' s eyes and mouth. This method combines a variety of deep learning algorithms to improve the accuracy of fatigue detection. However, the deep learning network model used is large, which is not conducive to porting mobile terminals.
The existing detection methods have the problems of low comfort, vulnerability to external factors such as sunlight, and low accuracy caused by a single index. To solve these problems, this paper studies a driving fatigue detection algorithm based on deep learning and facial multi-index fusion, which mainly includes face location and facial key points detection module, state judgment module, fatigue judgment module.

II. FATIGUE DETECTION ALGORITHM
This algorithm first collects the driver's images through the camera in front of the driver, and then the images use the improved MTCNN (Multi-task Convolutional Neural Network) to locate the face and detect the facial key points. Based on the detected key points, the location areas of the eyes and mouth are determined, and they are input into the eye and mouth state recognition network (E-MSR Net) for state judgment. Finally, the fatigue state of the driver is judged by the facial multi-index fusion strategy. The overall framework of the algorithm in this paper is shown in Figure 1.

A. FACE LOCATION AND FACIAL KEY POINTS DETECTION BASED ON IMPROVED MTCNN
MTCNN contains three cascaded multi-task convolutional neural networks, namely Proposal Network(P-Net), Refine Network(R-Net), and Output Network (O-Net) [23]. Each multi-task convolutional neural network has three learning tasks, namely face classification, border regression, and key   point positioning. However, the focus of the three stages is difference.
The network model in the first stage is called the P-Net. The network structure is shown in Figure 2. The main function is to obtain the window of the face region and the boundary box regression. The obtained window of the face area will be corrected by the results of the boundary box regression, and then use non-maximum suppression (NMS) to merge the overlapping windows.
The network model in the second stage is called the R-Net, which filters out most of the non-face candidate windows through a more capable CNN network, then continues to correct the results of the Bounding Box regression, and uses NMS to merge overlapping windows. The network structure is shown in Figure 3.
The network model in the third stage is called the O-Net. Inputting the data of the second stage for further extraction, and find the five marker points on the face through a more capable network. The network structure is shown in Figure 4. In the process of fatigue detection, due to changes in human posture and scene light intensity, MTCNN often fails to detect. In response to this phenomenon, MTCNN is improved. Adding SPP layer to the last layer of O-Net network can improve the accuracy of the network in complex scenes. Adding BN algorithm to MTCNN can improve network performance.

1) ADDING THE SPATIAL PYRAMID POLLING(SPP) LAYER TO O-NET NETWORK STRUCTURE
Aiming at the problem of complex scene changes in the driving process, an SPP layer [24] is added after the last convolution layer in the O-Net network, which can output a fixed-length vector for feature maps of different sizes. As shown in Figure 5, the SPP layers of this algorithm adopt 1 × 1, 2 × 2, and 4 × 4 for the three-proportion pyramid pooling layers. After the SPP layer, the feature map is divided into 21 parts, and the maximum pooling operation is performed on each part respectively. Through the SPP layer, the feature map is transformed into 21 × 128 matrices, which is sent to the full connection layer to develop into a one-dimensional matrix. By adding the SPP layer, the feature maps of any size can be converted into feature vectors of fixed size, which improves the recognition accuracy of the network model for features of different scenes and sizes, and reduces the amount of calculation.

2) ADDING THE BATCH NORMALIZATION (BN) ALGORITHM BEFORE EACH LAYER OF MTCNN NETWORK ACTIVATION FUNCTION
The different distribution of input values will affect the training of MTCNN network. When the input feature values have relatively large differences, with the training of the network, these differences will affect the back-layer network, and at the same time will lead to gradient explosion in the back propagation process.
The function of the BN algorithm [25] is to standardize the input value and reduce the difference of input value to the same range. On the one hand, it improves the convergence degree of the gradient and accelerates the training speed of the model. On the other hand, each layer can face the input value of the same feature distribution as much as possible, which reduces the uncertainty caused by the changes, and also reduces the impact on the back layer network, making each layer network become relatively independent, alleviat-ing the problem of gradient disappearance in training, so as to improve the performance of the network.

B. EYE AND MOUTH STATE RECOGNITION NETWORK (E-MSR NET)
After these images pass through the face location and facial key points detection module, identifying the eye and mouth areas of the face through the detected feature points. Sending them to the depth separable convolution neural network E-MSR Net to judge the state of the driver's eyes and mouth. The E-MSR Net is a lightweight network, which adds SE network to the inverted residual structure of MobileNetV2 network [26] to improve the network accuracy, and replaces the sigmoid activation function in MobileNetV2 network with h-swish activation function to reduce the calculation amount.

1) ADDING THE SQUEEZE-AND-EXCITATION (SE) NETWORKS
The SE networks [27] mainly includes squeeze and excitation, as shown in Figure 6. Firstly, the global average pooling operation is performed on the input feature map. Next, two full connection layers are connected to increase the nonlinear processing of the feature map and fit the complex correlation between channels.
By adding the SE network into the bottleneck of the MobileNetV2 network, as shown in Figure 7. The E-MSR Net can automatically obtain the importance of each feature channel through learning, and then according to this result to enhance useful features and inhibit features that are not useful for the current tasks. The feature is calibrated to make the effective weight is significant and the ineffective weight is small, so as to improve the accuracy of the network.

2) CHANGING THE ACTIVATION FUNCTION
The sigmoid function [28] used in the MobileNetV2 network can be seen from the formula that the sigmoid activation function is very complex in the calculation and derivation process, which consumes a lot of computing resources. This algorithm uses the h-swish function to replace it. The formula is shown below.
(2) VOLUME 9, 2021 The h-swish function has many advantages while maintaining accuracy. Firstly, ReLU6 can be implemented in many hardware and software frameworks. Secondly, it avoids the loss of numerical accuracy when quantifying. And because of the decrease in the amount of calculation, the running speed is accelerated. The function image is shown in Figure 8.

III. FATIGUE STATE JUDGMENT ALGORITHM BASED ON MULTI-INDEX FUSION
In order to solve the problem that the algorithm is inaccurate due to a single evaluation index in the process of fatigue detection, this algorithm defines three evaluation indexes of eye closure rate (ECR), mouth opening rate (MOR) and head non-face rate (HNFR). The three evaluation indexes are fused to determine whether the driver is tired or not.

A. EYE STATE EVALUATION INDEX
PERCLOS [29] is an internationally recognized fatigue judgment criterion, which refers to the time proportion of eyes closed within a certain time. The calculation formula is as follows.

Number of closed eyes frames Total frame number of detection period
× 100% (4) The PERCLOS judgment criterion includes P70, P80, and EM, which respectively indicate that the area of eyelid covering the pupil exceeds 70 %, 80 % and 50 % is counted as eye closure. The time proportion of eye closure in a certain time is counted. Among them, P80 is considered to be the most responsive standard to fatigue. The principle is shown in Figure 9.
In this paper, the eye state evaluation index refers to the P80 criterion in the PERCLOS criterion. Definition f e the ECR to judge the driver's eye fatigue state. The calculation formula is as follows: Among them, t e represents the number of eye closing frames during the detection time, and T e represents the total number of frames during the detection time.

B. MOUTH STATE EVALUATION INDEX
The mouth state evaluation index is similar to the eye state evaluation index. Definition f m represents the MOR to judge the driver ' s mouth fatigue state. The calculation formula is as follows: Among them, t m represents the number of mouths opening frames during the detection time, and T m represents the total number of frames in the detection time.

C. HEAD POSTURE EVALUATION INDEX
The improved MTCNN detects the five 2D key points of the left eye, right eye, nose tip, and left and right mouth corners of the human face. The POSIT algorithm converts five 2D feature points into five 3D feature points in the world coordinate systems by rotation, translation and other methods. Next, the transformation parameters are estimated. Finally, the human head posture parameters in 2D plane are obtained, which are yaw angle (Yaw), pitch angle (Pitch) and roll angle (Roll). According to the actual head posture when fatigued during driving, Pitch and Roll are selected as the evaluation indexes.
The head posture evaluation index is similar to the eye and mouth state evaluation index. Definition f h represents the HNFR to judge the driver's head posture state. The calculation formula is as follows: Among them, t h represents the number of head posture fatigue frames during the detection time, and T h represents the total number of frames in the detection time.

D. FACIAL MULTI-INDEX FUSION
In the actual driving environment, fatigued driving is a complex state of psychology and physiology, and the detection results are easily interfered with by various environments. If only eye features or mouth features are extracted to determine the driver's fatigue, the accuracy of detection will be greatly reduced when the driver wears sunglasses or masks and other facial occlusions. Therefore, the fatigue detection  method based on a single feature needs to be carried out in an ideal situation. This algorithm uses the fusion of ECR, MOR, HNFR three evaluation indicators to determine the driver's fatigue, in order to improve the system performance and enhance the system robustness.
According to the single eye closure time of many tests under different eye states, the eye state parameters were calculated [30]. In the normal state, the time of single eye closure is usually 0.12 ∼ 0.15 seconds. In the fatigue state, the time of single eye closure is greater than or equal to 0.5 seconds, and the time is significantly greater than that in an awake state. Therefore, setting the parameter threshold of ECR as 0.5, that is, when the eye closure frequency within unit time is 0.5, it is judged that the eyes are in a state of fatigue. According to people usually yawn state duration is 3 ∼ 5 seconds [31]. In this algorithm, the unit detection time is selected as 20 seconds, and the yawning frequency within 20 seconds cannot exceed twice. Therefore, this paper sets the parameter threshold of MOR as 0.3, that is, when the yawning frequency within unit time is 0.3, it is judged that the mouth is in a state of fatigue.
According to the PERCLOS judgment criterion, when the Pitch and Roll change angles of the head posture are detected more than 20 %, it is judged that the head is in an incorrect state. According to multiple tests, the head posture changes when the driver is fatigued during actual driving, the parameter threshold of HNFR is set to 0.5. That is, when the proportion of the driver's improper head posture in the unit time period is 0.5, the head posture is judged to be fatigue.
The parameter threshold fusion of the above ECR, MOR, and HNFR three evaluation indexes is used to determine whether the driver is tired. The specific evaluation indexes are as follows: normal : f e < 0.5 and f h < 0.5 and f m < 0.3 danger : f e ≥ 0.5 or f h ≥ 0.5 or f m ≥ 0.3 (8)

IV. EXPERIMENT AND RESULT ANALYSIS A. FACE LOCATION AND KEY POINTS DETECTION MODEL TRAINING AND RESULT ANALYSIS
In this algorithm, the improved MTCNN uses the WIDER FACE database [32] as the training data set of face location. The WIDER FACE database is the mainstream face detection database, which has 32203 face images and 393703 labeled faces. Moreover, all kinds of scenes are very complex, which can evaluate the performance of the algorithm from all directions and angles. Using the MTFL database as the training data for facial feature points. The MTFL database contains 12995 face images and 5 key points. It also provides information on gender, smile, glasses, and head posture. The images in the data set section are shown below. The dataset is shown in Figure 10.    This experimental computer hardware configuration is shown in Table 1. The experiment in this paper is based on python3.7 and Tensorflow2.2 in the Windows10 environment. CPU uses Intel (R) Core (TM) i7 -10870H, and GPU uses NVIDIA GeForce RTX 2060 for model training. The experimental results are shown in Figure 11.  Figure 12. The experimental environment is the same as that of the previous experiment. The training process is iterative 50 times, the first 20 times learning rate is set to 0.01, and the last 20 times learning rate is set to 0.001. The training process curve is shown in Figure 13. The algorithm in this paper is compared with the MoblieNetV2 algorithm. It can be seen from the graph that the convergence speed of this algorithm is faster than MobileNetV2.
Using the test set to test the model trained by this algorithm and the model trained by MobileNetV2 respectively. The test results are shown in Table 2. The table shows that the accuracy of the E-MSR Net model trained is higher than that of the MobileNetV2 model.

C. FATIGUE DETECTION EXPERIMENT SIMULATION
The fatigue detection experiment uses self-made fatigue detection video as the test data. The test data includes fifty videos, each of which includes fatigue states such as eyes closed, yawning, and nodding. The detection effect is shown in Figure 14. The single evaluation index and the multifeature fusion evaluation index are used to judge the fatigue state respectively, and the results of the comparison are shown in Table 3.
It can be seen from the figure that the algorithm in this paper can accurately and timely judge the fatigue state of the driver. Through the accuracy to measure the evaluation index of the algorithm, the formula is as follows: where N t represents the number of fatigues driving states detected by the algorithm in this paper, and N represents the number of fatigues driving states that really exist in the test video. Compare this algorithm with other algorithms, and the results are shown in Table 4.

V. SUMMARY
The algorithm in this paper can deal with the fatigue detection of drivers in various complex environments. The improved MTCNN can accurately and quickly detect the face, eyes and mouth regions. E-MSR Net can accurately and quickly judge the state of mouth and eyes. The facial multi-feature fusion algorithm can accurately judge the fatigue state of drivers.
The experimental results show that this algorithm can well detect different fatigue states.