A Methodological Review on Prediction of Multi-Stage Hypovigilance Detection Systems Using Multimodal Features

Several hypovigilance detection systems (HDx) were developed to avoid road-side accidents due to driver fatigue. They have suffered from several limitations. Notably many of these are focused on center-head position to define an area of interest (often referred to as PERCLOS (percentage eye closure)) without considering the face occlusion problem, light illumination, and suffer poor response time. These HDx systems mostly depend on image processing, vision-based, and multisensor-based features. To address these problems, the author utilized vision, sensors, environmental, and vehicular-based features that integrated together by fusion to predict multistage of HDx. Lately, few studies have utilized the combination of multimodal features and deep learning (DL) architectures. Those multimodal-based features (M-HDx) were feasible to predict stages of driver fatigue (multi-stage). However, there is a need to critically measure the performance of these M-HDx by carrying out a comparative analysis to recognize multi-stage of fatigue in terms of hardware-based benchmarks. Moreover, it is important to evaluate the M-HDx systems using different features-set with respect to traditional and advanced machine learning techniques. Therefore, the primary aim of this work is in algorithm and feature modeling, then compare the advantages and differences with other work. In this paper, a different study is conducted compare to state-of-the-art survey articles by statistically measuring the performance. After experiments on M-HDx systems, this paper concludes that there is still a research gap to real-time development of multistage M-HDx systems. In the end, the paper summarizes the directions, challenges, and applications in the development of HDx systems to assist other researchers for further research.


I. INTRODUCTION
Fatigue and loss of vigilance (hypovigilance) among the drivers are very common problems. This problem assumes even acute significance for long-haul logistics drivers. These risks can potentially translate into hazardous accident situations. During the last few years, researchers have developed many hardware and software-based techniques to avoid such risks [1]. In particular, an effort has been to develop techniques for automated monitoring of the drivers' activities. These systems have the ability to provide intelligent feedback The associate editor coordinating the review of this manuscript and approving it for publication was Juan A. Lara . and generate alert messages to the drivers for recognition the uncontrollable situations. According to a report published by World Health Organization (WHO) [2], more than 1 million people lose their lives due to traffic accidents, and approximately 50 million more get injured causing severe disabilities. In particular, more than seven thousand death and thirty-eight thousand injuries are recorded due to road accidents every year in Kingdome of Saudi Arabia (KSA). The development of real-time driver fatigue detection and prediction system is a challenging task related to computer vision technologies. In such systems, detection of low vigilance and high fatigue level results in the generation of alerts and warnings to the driver about his/her poor state of driving through an alarm.
As a result, the development of this driver drowsiness system is critical for detecting the drivers' ability of perception, recognition, and vehicle control during roadside driving.
Most of the existing real-time hypovigilance detection systems (HDx) techniques have a poor response time as mentioned in the literature [3]. This calls for a need to develop such systems that respond effectively in less amount of time. Another characteristic of the desired technique should be its non-intrusive nature making it independent of any specific hardware on the driver's body. One such project to be cited is by MIT researchers named Smart Car [4]. This system however is intrusive as the driver needs to wear and use a wristband on the arm or wrist to calculate and measure the heart rate. Some other methods use the eyes and gaze movements of drivers that are installed on the helmet or by using special lenses. Though useful yet the intrusive nature of these types makes these practically unacceptable by many communities.
Hypovigilance detection systems (HDx) systems were developed in the past to detect driver drowsiness or inattention [5] under the state of drowsiness. To measure the drivers' state, in this case, several authors such as CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], and Riani-M-HDx [128] utilized different measures as shown in Figure 1. The past HDX developed systems were based on the integration of PERCLOS (percentage eye closure) and multisensor fusion approaches [6] to detect and predict driver drowsiness. Those systems were based on advanced image processing, signal processing, and machine learning techniques to define PERCLOS measures. Health measures and vitals such as EEG and/or heart rate monitoring systems can also be used to detect driver fatigue [7], [8]. Some other systems use non-visual features based on driver physiological measures and vehicle parameters. In the case of physiological parameter measurements, the researchers predict driver fatigue based on different parameters such as steering-wheel, acceleration pedal, and speed. In practice, those approaches were mostly dependent on the road-shape, way of driving, and performance of the vehicle. The author utilized electroencephalograph (EEG), electrocardiogram (ECG), electrooculography (EOG), and surface electromyogram (EMG) [9] sensors to predict driver fatigue. Yet, these techniques dependent on contactable sensors, which lessen user experience and increase hardware cost. On the other hand, it might increase the effectiveness of HDx systems.
Driver hypovigilance [10], [11] can be determined by applying a combination of the following measures. Vehicledependent measures are determined by monitoring the deviation of the lane direction, the rotation of the steering wheel, the acceleration pedal strength, etc. If the calculated metrics go beyond a small scale, so the hypo-surveillance risk of the driver is very high. The physiological measures are also used to determine fatigue state using medical vitals such as ECG, EMG, and EEG. Any visible distortion in these vitals can signal a change in the driver's state which should generate an alarm. Several studies have been published on the identification of driver drowsiness or the detection of driver inattention by a physiological signal. Driver activity such as yawning, eye closing, eye blinking, head location, etc. is tracked by a vision-based camera and the device warns the driver if hypovigilance is observed. A visual example of these different measures is displayed in Figure 1.
The development of the HDx system is very critical due to many factors such as specific machine-learning methods, inadequate parameter settings that result in a false negative and false positive. In such systems, the prediction of driver's drowsiness or fatigue level using different vision and multisensor-based features [14], [15] does not achieve absolute accuracy. Combining multiple modalities can be a possible solution to resolve some of these constraints [16]. Modality fusion can be applied to achieve that which is characterized as the incorporation and combination of various types of data, which are collected from a subject. Modality fusion [17] may be done at any level of data processing, such as sensor-level, feature-level, or decision-level fusion. In practice, many researchers have been utilizing featurelevel fusion techniques to implement HDx systems in a realtime environment. In feature-level fusion, in the first step, different features from different modalities are merged into a unique vector in an early and late binding fashion. It has been shown that integrating different modalities [18] at the feature-level is an efficient way to create a model for driver HDX detection. One of the main challenges facing effective modality fusion is the criteria for combination or convergence of modalities at the stage of fusion (sensor, feature, score, or decision). So far, feature-level fusion has been commonly used to test the stress levels and fatigue of drivers. Hence, it is desirable to develop an approach that integrates different modalities together to design a more effective and robust hypovigilance prediction system in the form of multistage output. In this paper, our focus has been exactly on the same concern as shown in Figure 2.
Various methodological techniques for the development of HDx systems have been described and discussed in this VOLUME 9, 2021 FIGURE 2. A visual example of state-of-the-art Hypovigilance prediction to measure any one of the following measures or by a combination of these measures using multimodal features.
paper. We have compared the efficiency and performance of these techniques against the technique proposed by authors in [127]. In that paper, we describe the implementation of an IoT-based architecture for detecting driver drowsiness through mobile, energy computing in 5G and cloud-based environments. The comparison carried out in this paper highlights the need to integrate modern innovations in the domain of image processing, machine learning, and intelligence algorithms for multimodal and deep-learning-based computing. Although, several survey articles [12] have been written to address the same issue. However, it is apparent that none of these studies carried out a methodological comparison in the domain of deep-learning and multimodal features-learning. In a recent survey article, the different papers were described to detect real-time driver drowsiness [13] without the comparison involving multimodal and deep learning architectures. Therefore, in this paper, we reviewed recognize methods for detection of driver fatigue based on vision-based and multisensor-based features.

A. MAJOR CONTRIBUTIONS
The paper's primary work is in algorithm and feature modeling, then compare the advantages and differences with other work. This section briefly described the main contributions of this article that are described as follows.
1) Different machine learning algorithms are evaluated on hardware-based benchmarks (CPU and GPU). 2) Methodological reviews have been carried out to highlight recent trends for the development of hypovigilance detection (HDx) systems based on vision-based (V-HDX), sensor-based (S-HDX), and multimodal-based (M-HDX) hypovigilance detection systems. 3) Comparison of different unimodal and multimodal features by using various machine learning algorithms and state-of-the-art HDx systems have been performed. 4) Online data sources along with parameters are also provided to assist other researchers in training the network. 5) Current and future directions for the development of the HDx system are also provided in this paper to continue further research in this domain. 6) Explain the importance of multimodal-features-based driver fatigue recognition system in the deep-learning context, which is a new review article in this domain. 7) State-of-the-art comparisons have been performed on recent four multimodal-based HDX systems to further discuss challenges in this domain.

B. PAPER ORGANIZATION
The remaining organization of this paper is as follows.
In Section II, the research protocol is described that consists of research questions, comparisons to other survey articles, and selection of papers for this review article.
A study of recent trends in the state-of-the-art in the field of machine and deep learnings for detection of driver's fatigue based on vision-based (V-HDX), sensor-based (S-HDX), and multimodal-based (M-HDX) driver fatigue is detailed described in section III. Section IV describes different deeplearning-based models used in the past. In section V, we have performed different experiments on selected datasets and 47532 VOLUME 9, 2021 HDx systems based on different statistical metrics. Those results highlight merits, demerits, and limitations about different parameter settings and environments to develop HDx systems in real-time. Also, this section highlights the recent datasets used in the past. In addition, the discussion about obtained results is detailed described in section VI. Besides, this section also presents the current limitation and future directions in this domain to assist other researchers. In addition, this section also describes the challenges and future directions to help other researchers so that they can develop HDx systems up-to-the-mark. Finally, this paper concludes in section VII.

II. RESEARCH PROTOCOL
The research protocol defined the detailed layout, the method that is a plan to use in the review. The protocol also defined the rules and instruction for conducting a survey on the existing work on recent state-of-the-art hypo-vigilance detection systems (HDx) which offers assistance to research beginners in this domain. And, also provide enough information regarding traditional machine learning and deep learning-based approaches in the development of HDx systems through multi-modal features as shown in Figure 3. The survey also assists the user in regards to performance comparison of traditional and deep learning-based HDx systems, the impact of deep learning and multimodal technologies, the kind of deep model are already implemented, and how the performance of the HDx systems can be improved? Moreover, the survey also provides help regarding the cost, data analysis and management, achieved result and also find the challenges related to both traditional and deep learning-based HDx systems.

A. RESEACH QUESTIONS
The fundamental core of the study of either the literature review or project is the research questions. This describes the methodological starting point of scientific study in all disciplines. Table 1 reveal the research queries linked with the HDx systems trait, the motivation and linked to a different section of the paper.

B. COMPARISONS WITH SURVEY ARTICLES
We have done extensive studies in terms of methodological comparisons of different features and machine-learning algorithms compare to state-of-the-art survey articles in this domain. Table 2 shows the difference between our study and existing surveyed articles.

C. PAPERS QUALITY EVALUATION
The visual, non-visual, and multimodal features with machine-learning and deep-learning architectures keywords are used to search papers. In general, the standard is considered after defining the research question and before the research process is performed out. Here, for instance, irrelevant papers and out of concerned papers were denied. If the large field of study relates to our subject, we would consider the research. Published research manuscripts from top Journals (SCI and ESCI) and Top conferences must always be acknowledged. However, old studies and research which do not fulfill our goal must be exempted.

III. RECENT TRENDS OF HYPOVIGILANCE DETECTION SYSTEMS
Detection of driver drowsiness and fatigue has been an area of research for scientists for a while now. In the early research, the steering motion of automobiles was regarded as an indication of hypovigilance [6], [8]. More recently, techniques have used face detection and recording for this purpose. Besides, some methods require working with health vitals such as brain waves and heart rate to equate them with states of sleepiness [9]- [11]. They combined vision-based methods with multisensor-based methods to define multimodal features for the detection of hypovigilance. Some authors used physiological measurements [12] that can be extracted by using different sensors to account for heart rate, skin conductance, and respiration rate to detect hypovigilance (HDx) of drivers. Orientation-based fatigue detection methods detect and verify the position of the eyes. The methods used different techniques such as template matching and appearance-based functionality to detect the state. Nevertheless, these distinctive methods faced problems because of the errors that are generating from the signal disambiguation. These also faced problems due to lack of facial detection [14] resulting from abrupt gestures, and the discomfort due to constant hardware application. Medical signals such as electroencephalography (EEG) and electrocardiogram (ECG) [15] have also been used and analyzed to differentiate between the sleepiness and alertness states of the drivers. As a result, some authors suggested integrating vision-based methods with multisensor-based [16] techniques to extract multimodal features. Wu et al. [17] studied methods adopted to detect the alertness of drivers and categorized them into methods related to the condition of drivers. One category involved metrics such as eyelid motions and percentage of eye closing, whereas another category comprised of methods related to the efficiency of drivers, such as vehicle distance and lane detection and multi-modal methods that incorporated these approaches. Using a single camera or multiple cameras and several sensors were used in the past to detect the hypovigilance state of the drivers.
The major limitation of this type of system is their inability for early and accurate detection [18]. It is needed to detect early if the warning occurs at the initial stages. Literature is replete with various parameters and measures that can be used to detect driver fatigue. We need to use a combination of these measures as a single measure or different parameters may not be able to detect fatigue [19] in terms of multi-stage. If the driver used glasses or hide their face then eyes or facial features cannot be detected easily. Therefore, more than one measure or parameter can be suitable for this type of system to detect driver fatigue.
A comprehensive literature review suggested that the authors are deploying a combination of techniques instead of using a single machine learning technique to optimize the performance. To achieve this goal, some recent studies [107] have employed hybrid solutions to enhance the accuracy of the fatigue detection system. The system employed various vehicle parameters such as speed, acceleration, vehicle lane position, steering angle, braking. These features were then combined with facial features to predict driver drowsiness. Research shows that these hybrid systems are providing more accurate results with the increase of computational time. Several studies also used deep-learning models to classify 47534 VOLUME 9, 2021 driver fatigue. The above-mentioned parameters are the main parameters of fatigue detection due to their ability to monitor variation. There can be environmental factors that may yield inaccurate data for these parameters. The lighting conditions may fluctuate because of diverse weather conditions. Similarly, sometimes it can be day or night. The background of the driver also affects these types of systems. Another factor that affects such type of system is the vibration of the car to collect visual data.
Researchers are constantly working on the development of HDx systems based on multimodal features instead of a single modality to overcome the above-mentioned limitations. By monitoring the heart rate of drivers [19] with a wearable tracker or extracting their facial expressions with an RGB camera, driving fatigue can be observed more effectively and robustly. However, such systems can present the problem of intrusiveness and result in inconvenience. The application of an RGB camera on the other hand can be affected again by light and other such factors. Also, the temporal details of fatigue characteristics and the relationship between the characteristics can't be neglected if we rely completely on such approaches. To approximate the detection of the driver's fatigue condition, these fusion methods need to be combined with transient fatigue features into the classifier's input vector. It is therefore required that we obtain features over a regular period-of-time to recognize multilevel (drowsy, alert, very alert, very drowsy or normal) driver drowsiness state. While the methods described above are encouraging, further challenges must be considered as mentioned in Table 3,  Table 4, and Table 5. According to these tables, it is needed to design a new multimodal fusion-based approach to develop an HDx system so that the maximum performance level may be achieved.
The precision of fatigue detection may be significantly improved by integrating physical characteristics and vehicular features to detect fatigue. A technique that incorporates both driver characteristics and vehicle characteristics were proposed by Cheng et al. [20]. Parameters selected for fatigue identification included PERCLOS, blink rate, maximum time eyes are near, non-steering percentage, percentage of on-center driving, standard steering wheel angle deviation, and standard lane direction deviation. In the trial, 20 participants took part. To fuse data from both the driver and the car, the authors developed a model based on a multisource data fusion approach. A self-adaptive dynamic recognition model was introduced in a more recent study [21] that used the most powerful features for feature fusion to detect fatigue. These included physical characteristics as well as vehicle characteristics. The features were based on visual measurements of the driver as well as the actions of the car. This technique achieves an accuracy of 90.8 percent by using only the vehicle action features. The accuracy of technique with only facial features was measured at 91.6 percent accuracy. The fusion of all fatigue features produced an accuracy of 92.1 percent while accuracy of 93.8 percent was achieved when using only the most powerful features.
Apart from the detection of drowsiness, another challenge is to forecast deficiency of the operating status of a car driver [22]. In that paper, the authors investigated if it is feasible or not to use the standard sources of knowledge to detect drowsiness for a certain degree of drowsiness. The authors developed the HDx based on the behavioral, vehicular, and psychological factors to define multimodal features. For defining the behavioral features, the author used head and eyelid movements (blink duration, frequency, and PER-CLOS) whereas to extract psychological features, they used variability of heart rate, respiration rate. Besides, the authors used speed, steering wheel angle, and position on the lane to measure the vehicular features. Two intelligent models were created to measure drowsiness and the time required to achieve a certain rate of drowsiness. With behavioral metrics and extra knowledge, the best result in both identification and prediction is obtained. Since they used multimodal features and obtained higher results of prediction compared to other approaches. Based on this point, we have methodological reviewed many papers in terms of vision-based, multisensor-based, and multimodal-based features. These technical details are presented in the upcoming sections. In this section, we will review state-of-the-art hypovigilance detection (HDx) systems. These HDx systems are broadly categorized into vision-based (V-HDx), multisensor-based (S-HDx), and multimodal-based (M-HDX) Systems.

A. VISION-BASED (V-HDx) SYSTEMS
Much research has gone into the development of visionbased hypovigilance detection (V-HDx) systems to diagnose drowsiness in drivers by analyzing driver behaviors. The main parameters for identifying drowsiness are the eye state, eye blinking level, mouth state, and yawning frequency of a driver [39]. The important parameter for the diagnosis of driver drowsiness is eye closure length. The V-HDx Systems that use this tool normally measure eye states and the location of the iris over a particular period to approximate the extent of eye blinking and the length of eye closure. PERCLOS (Percentage of eye closure) is a consistent and accurate measure for assessing the driver's alertness level [12], [49]- [51] for such devices. Eye condition research typically makes use of the PERCLOS measure as a drowsiness measure reaching 80% of the time for eyes closed [52], [53]. Generally, if the driver is sleepy, the time of eye closing will increase and the value of PERCLOS is greater than a driver's waking times.
In these techniques, the eye area including the pupil region needs to be separated to measure PERCLOS. Limitations of these techniques include improper lighting conditions, face occlusion, and sunglasses are main of them. To solve these issues, some V-HDx systems use IR (Infrared) cameras [24], [49], [54], [55]. The most useful V-HDx methods that focused on visual characteristics are explained in detail in the following paragraphs and compared in Table 3. In the past, the authors utilized different strategies for the detection of drowsiness based on V-HDx approaches. To classify these TABLE 3. Summary of state-of-the-art vision-based hypovigilance (V-HDx) systems using visual features and machine learning algorithms. VOLUME 9, 2021 states, the author used many machine learning approaches (traditional and latest technologies such as deep learning).

47536
In various research and engineering areas, artificial neural networks (ANNs) have been used [27]. The fuzzy inference method (FIS) [50] is another method that is used in the past for the development of V-HDx systems. The support vector machine (SVM) was used [12] which is based on the methodology of statistical learning and can be used for the classification of patterns and the inference of nonlinear interactions between variables. The SVM method's learning methodology makes it appropriate to test humans' cognitive states. SVMs can produce both linear and nonlinear models and the nonlinear models can be calculated as accurately as linear models. AdaBoost uses boosting algorithm [51] for pattern recognition. Its benefits include high performance in detection, quick process time for identification, and the potential expansion of recognition functionality. Also, Bayesian networks (BNs) classifiers are applied to the simulation of human behavior and have been used to diagnose inattention [52]. Despite these benefits, it takes considerable computing capacity and a vast volume of training data to construct a right and reliable BN model.
Another new development that has been to borrow from the field of speech processing and language technology techniques focused on hidden Markov models (HMMs) [53]. In [53], the authors built an HMM to predict route schemes using vehicle speed, steering-wheel angle, and braking force. To detect visual features from the driver's face, some authors suggested a single vision-based camera but some other authors used multiple cameras to effectively detect multiview features. Researchers in [23] propose a hybrid visual framework focused on driver's eye recognition for tracking driver's drowsiness. Using two cameras working in the visible and close infra-red spectra respectively, safe operation in-car conditions and processing in every day and night conditions were accomplished. A cascade of two classifiers conducts image recognition in both of these spectra. The exact description of the eye state in [24] is a criterion for the avoidance of car crashes due to driver sleepiness. These researches highlighted that previous classification approaches are susceptible to eye localization errors and visual obstructions.
In [25], another approach was presented combining image processing and machine learning techniques to detect driver drowsiness. The authors used Haar cascade classifier with SVM to recognize the current state of the eyes whether it is open or closed. In [26], a multilevel fusion system was developed through the measurement of eye blinking and 3D head pose estimation. They estimated the head rotation in the three directions by using only three interest points onto the driver's face. This system was evaluated by both DEAP and Mira-clHB databases. Whereas in paper [27], the authors used two descriptors namely Pyramid transform domain (PLBP) and Multi-Block Histogram LBP (BHLBP) with SVM to detect only human eyes in grayscale images. In [28], an Adaboost and Contour Circle (ACC) algorithm was developed for recognizing whether eyes are in an open state or closed state.
In [29], the driver monitoring system for hypovigilance (fatigue and distraction) was developed based on the symptoms associated with the facial and eye regions. A fuzzy expert system (FIS) was used to combine the symptoms to estimate the level of driver hypovigilance. In [30], and Intelligent Drowsy Eye Detection, using an Image Mining (IDEDIM) system was developed. The authors have used a cascaded regression in [31], different visual features were used jointly to estimate the eye orientation and likelihood of openness. This paper describes three of the most powerful contextual attributes, i.e., continuous driving time, sleep length time, and current time, by using a multi-class SVM classifier to enable the real-time (online) identification of the exhaustion condition. As in [33], only three facial keypoints were used by the authors on a compilation of geometric features, namely the middle of the eyes, the corner of the mouth, and the tip of the nose.
In [34], a new V-HDx method was developed to incorporate machine vision for intelligent transportation through deep learning (DL) architecture and action units (AUs). They achieved a significant (95%) improvement in terms of accuracy compared to others. In that study, the author demonstrated an innovative driver fatigue detection method based on fatigue-related facial action units' (AU) identification. Whereas in [35], given an RGB input video of a car, a DL architecture is referred to as a deep drowsiness detection (DDD) network for learning successful features and detecting drowsiness. Experimental findings reveal that on the NTHUdrowsy driver detection benchmark dataset, DDD achieves 73.06 percent detection precision. Eye detection is a very fascinating area of study that verifies eye detection in [36]. In [37], a modern method called DriveCare was proposed which uses video clips to assess the exhaustion condition of the drivers, such as the length of eye closing, blinking, and yawning. To enhance the tracking precision, Multiple Convolutional Neural Networks with kernelized correlation filters (MCNN-KCF) were used. Different experimental results found that about 95 percent accuracy was reached by Drive-Care. In [38], the authors initially use infrared imagery to capture the image of a vehicle at night and then created an algorithm to detect the face of the driver. Later, a modern eye-detection algorithm was applied to combine a Gabor filter with a similar prototype to find the location of the corners of the eye and introduce an eye-validation process thus improving the precision of the rate of detection. As the third step, to match the eyelid curve, they used a spline feature. This device has been checked with more than 200 faces on the IMM Face Database, as well as in a real-time simulation.
In [39], researchers proposed a vision-based fatigue warning system for the control of bus drivers. The system consists of head-shoulder detection units, face detection, eye detection, an estimate of eye openness, fusion, estimation of eyelid closure percentage (PERCLOS) sleepiness scale, and classification of fatigue level. To approximate the eye state based on adaptive convergence on the multimodal detections on both eyes, a fusion algorithm was integrated. The facial VOLUME 9, 2021 markings on the observed face are pointed later in [40] and the eye aspect ratio, mouth opening ratio, and nose length ratio are consequently measured and the drowsiness length ratio is calculated based on their values. Deep learning algorithms were introduced in an offline way as well. Via SVM-based classification, a sensitivity of 95.58 percent and accuracy of 100 percent have been achieved. In [41], a template matching method for feature extraction was also extended to the kinematics of gait cycles segmented by our stepwise searchbased segmentation algorithm with the SVM model for classification. The findings of fatigue identification through data from 20 recruited participants showed an accuracy of 90 percent.
The scientists implemented the first technique in [42] that senses wink completeness. These blinks vary in speed and length. However, the Recurrent Neural Network (RNN) is used as a classifier because of its suitability for sequence-based features. In [43], it was suggested that a new modular architecture method for early identification of the driver fatigue framework be proposed, taking into consideration the optimization of system output by the optimization of the particulate swarm (PSO). The findings obtained in terms of accuracy (90.4 percent), sensitivity (92.6 percent), and specificity are considered in line with the stateof-the-art (90.7%). In [44], by using the LBP technique, the authors implemented a fast and robust face detection algorithm to describe and normalize facial expression images and then used SVM to detect driver fatigue. A novel computer vision-based technique was developed in [45] to detect driver sleepiness from a video taken by a camera. The suggested approach was tested based on the YawDD public video dataset.
A V-HDx system was developed in [46] by detection of head-shoulder, face detection, eye detection, eye openness estimation, fusion, drowsiness measure percentage of eyelid closure (PERCLOS) estimation, and fatigue level classification. Using another approach in [47], the authors developed multi-task cascaded convolutional neural networks (MTCNNs) to detect driver fatigue by using multi-facial features. In [48], the authors developed a V-HDx system based on an architecture that detects the sleepiness of drivers. The authors used RGB videos of drivers as input and help in detecting drowsiness. They used mainly different transfer learning algorithms to classify these features into four classes. In [49], the authors used discrete wavelet transform (DWT) and entropy analysis to detect face features. Moreover in [50], the authors developed a V-HDx system based on the hierarchical temporal Deep Belief Network (HTDBN) method. In that research, the authors first extracted images of highlevel facial and head features and then used them to identify symptoms linked to drowsiness. These are used to model and capture the interactive interactions between the gestures of the eyes, mouth, and head. In search of broad variations of driver footage, they also collected a huge detailed data set comprising different ethnicities, races, lighting conditions, and driving scenarios.  [58] across each spectral band (delta, theta, alpha, beta, and gamma) for two mental states alert and Drowsy in all subjects with different stages.

B. MULTISENSOR-BASED (S-HDx) SYSTEMS
In comparison to V-HDx systems, the multisensor-based S-HDx approaches can help to detect driver drowsiness in multistage phases (drowsy, very drowsy, normal, and extreme drowsy). S-HDx systems use physiological or non-visual features to detect drowsiness. S-HDx is broadly classified into two main groups i.e., driver-based and vehicular-based hypovigilance detection systems. Driver-based characteristics typically apply to a driver's brain activity and heart rate, while vehicle-based characteristics include features such as brake pressure, vehicle speed variations, wheel angles, etc. In Table 4, most of the existing S-HDx systems are described in terms of different parameters.

1) DRIVE-BASED S-HDX SYSTEMS
The physiological analysis of an organism is specifically influenced by exhaustion and sleepiness leading towards dangerous situations. It was mentioned in [66] that the physiological indexes of the state of sleepiness could be differentiated from the normal state through different sensors. To assess physiological features, many authors used an individual's health measures (either individually or in combination) such as ECG, EOG, and EEG, etc. Among these methods, the most promising and feasible approach is based on calculating the EEG, as seen in Figure 4. The EEG describes the state of activity in the brain in terms of alert and drowse. It is stated that the activity of delta and theta waves is significantly increased and the activity of alpha waves is marginally increased while in a state of drowsiness [55]. The EEG is commonly recognized by researchers as a measure of the transition between various periods of sleep [67]. Many clinical [55]- [70]. These methods are intrusive because it becomes mandatory that drivers wear electrode helmets when driving to add EEG signals to driver HDx identification. However, some authors used many other sensors that are also participating in the detection of multistage of driver drowsiness.
Several studies have used EEG signals to classify driver drowsiness [55]- [70]. These methods are intrusive because it becomes mandatory that drivers wear electrode helmets when driving to add EEG signals to driver HDx identification. However, some authors used many other sensors that are also participating in the detection of multistage of driver 47538 VOLUME 9, 2021 TABLE 4. Summary of state-of-the-art driver's fatigue detection systems using non-visual features and machine learning algorithms. drowsiness. In practice, the EEG [12] based S-HDx systems have a temporal resolution of 0.001 s and a spatial resolution of 20 mm and are commonly used in the area of research into brain function. A driver's fatigue can be effectively detected using the frequency-domain features of EEG data (e.g., mean frequency, EEG continuum center of gravity, and energy contents of a, β, θ, and δ bands). Similarly, the timedomain characteristics of EEG data, such as the standard deviation, the average value, and the sum of the squares of the amplitudes, provide useful information on brain function. Due to its complicated structure, its intrusive nature, and effect on the efficiency of the driver, these techniques are still not very feasible for driving in a real environment. In studies of [67], [68], the authors used heart rate (HR) beats per minute to detect the sleepy stage. It was noted that there is a difference in HR at different times such as during long drives at night when a decrease in HR is reported. Moreover, the attention, mental behavior, and body energy of a driver also factors that affected the HR [69], [70]. Respiration rate (RR) is also a valuable measure which is the number of exhaled and inhaled breaths in one minute.
The authors of [77] attempted to create a relation between somnolence and RR, according to which RR begins to fall with the initialization of drowsiness and sets in, and continues to fall before sleep begins. The Figure 6 shows the visual example of this five-stage based drowsiness detection system. The electrooculography (EOG) is another sensor that offers a calculation of the eye [78] monitoring. The EOG signals are changed by eye movements, such as eye action and blinking [60], [79]. In collecting EOG data, the positioning of EOG electrodes takes on special significance [85, [85]. Additionally, hypovigilance identification-ion systems focused on blink activity are highly person-dependent. Such programs do not work well with people suffering from mental disorders, since they may do more blinks in wakeful environments or their eyes can stay open even in sleepy conditions. VOLUME 9, 2021 Electromyography (EMG) is another sensor for measuring and recording the electrical signal produced by the contraction of the muscle [86]- [88]. The HDx based on EMG data has been researched by several scholars [89]- [93]. A change in the middle-frequency portion towards the lower spectral band is observed during muscle contraction [94]- [96]. The Galvanic Skin Response (GSR) is another sensor used to measure skin conductance [92], [96]- [98]. Electro-Dermal Activity (EDA) offers a measure of skin conductance that varies due to sweat gland secretion. This technique, however, is particularly susceptible to moisture and temperature in the atmosphere. Also, some authors used Skin Temperature (ST) to measure the temperature of the surface of the skin. For example, five degrees of drowsiness are defined in [99] by measuring the temperature of the nasal skin, the temperature of the forehead, and the temperature of the muscles.
Compared to visual characteristics (V-HDx) systems, the physiological properties that are extracted by EEG, ECG, ST or EOG provide more reliability and precision [51]. A major challenge to these measures however is to design systems that overcome the intrusive nature of required hardware. Thus, the use of wireless technology such as Zigbee and Bluetooth [52] to calculate physiological signals in a non-intrusive fashion is a potential way to address this restriction but it can affect the reliability.

2) VEHICULAR-BASED S-HDX SYSTEMS
Research has been going on to use vehicular-based features to consider the state of drowsiness of the driver based on the study of car gestures, such as steering wheel rotation, lane holding, acceleration pedal movement, and braking, etc. The steering wheel rotation and the normal deviation of the lateral/lane direction are the two most widely used vehicle movement characteristics to identify the degree of driver drowsiness. The rotation of the steering wheel (SWM) [53] is measured by a steering angle sensor mounted on the steering column. Also, for environmental conditions such as minor road bumps and crosswinds, micro-corrections in steering are required. With rising drowsiness, drivers prefer to decrease the number of micro-corrections in the motions of the steering wheel. Due to its specialized technological needs, SWM-based systems have limited applicability. Now we examine the most recent studies that utilized these driver's behavior, physiological and vehicular-based features to develop HDx systems. The EEG-based sensor was used in [55] to detect driver drowsiness. A hybrid deep generic model (DGM)-based support vector machine technique was developed for this purpose. The experimental results revealed that with 91.10% of sensitivity and 55.48% of accuracy. In another work [56], the authors used EEG data to study the multiple entropy fusion processes and compare several channel areas. A 98.3% accuracy, a 98.3% sensitivity, and a 98.2% precision were achieved. On the other hand, the authors in [57] developed a technique to predict driver fatigue using ECG as the heart rate variability (HRV) measure and deep learning model. In another work [58], the authors used EEG signals to detect driver behavior using the Fisher score. In another recent work [59], the author developed EEG-based systems to detect hypovigilance state along with pre-processing filtering. They used AlexNet transfer learning architecture to classify signals as either normal or as a fatigue condition with an accuracy of 90%. In [60], the authors used four types of entropies to extract features from EEG signals and SVM was the classifier to detect the driver stage. The authors were reported 98.75% of detection accuracy.
The EEG-based framework was developed in [61] to detect driver drowsiness by using many different entropy metrics to evaluate EEG signals, including spectral entropy, approximate entropy, sample entropy, and fuzzy entropy. The average accuracy of the classification was found to be greater than 94% by gradient-boosted DT. Authors in [62] collected the ECG signals were and then applied multi-index fusion theory to correctly detect the level of fatigue. The ECG signals were collected in [63] and used with a convolutional neural network (CNN) model to detect fatigue-level. The highest 98.79% of detection accuracy was achieved. Likewise, in [64], the authors used energy (α, β, θ) parameters from EEG signals to study driver fatigue. Hierarchical deep learning algorithms with EEG signals were also used in [65] to detect driver fatigue on the publicly available dataset SEED-VIG. Similarly, in [66], the stack-based autoencoder (SAE) was used to detect four fatigue-level by using EEG signals. In [67], the authors used EEG sensors along with the features associated with electrocortical activities and eyeblink recognition analysis. The spectral analysis of the EEG samples immediately preceding the lane departure events showed changes in the spectral density.
In [68], the authors used optimization algorithms to detect the hypovigilance state of the drivers using EEG signals. It was a combinatorial technique where they used hierarchical extreme learning-based with particle swarm optimization known as PSO-H-ELM. To test the performance of PSO-H-ELM algorithms, they used different multisensor such as EEG, electrocardiogram (ECG), electrooculography (EOG), and surface electromyogram (sEMG). On average, the PSO-H-ELM algorithm achieved 83.12% of detection accuracy. The authors in [69] detected the drowsiness stage by using EEG signals with different entropies techniques. In contrast with these approaches, the authors in [70] suggested using multiple techniques for monitoring the drowsiness of the driver. The authors used a low-cost ECG sensor to extract data on heart rate variability (HRV) for fatigue detection. In another work [71], the authors used successive driving levels, heart rate variability (HRV) characteristics and determined fatigue-level by sample entropy (SampEn) and reported a high detect rate.

C. MULTIMODAL-BASED (M-HDx) SYSTEMS
In recent studies, the Multimodal-based (M-HDX) Systems have gained a lot of traction because of their ability to use deep-learning architectures to recognition driver's different activities and fatigue at different levels. The features  Grand-averaged scalp topographies across [77] each spectral band (delta, theta, alpha, beta, and gamma) for two mental states (a) and five drowsiness levels.
of M-HDx systems are described in Table 5 and visually presented in Figure 5. Many authors now use various forms of data [71], such as driver's physical conditions, audio and visual features, vehicle information, etc. To enhance the system's generalization ability for the development of HDx systems, the authors suggested integrating sensor data into the vision-based models. In recent years, early and late fusion techniques have been applied to combine multisensor and vision-based features into a single feature vector known as multimodal features vector. This integration of sensor data with vision-based driver detection significantly increases the overall performance of M-HDx systems. In this section, we describe some of the current state-of-the-art M-HDx systems.
A hybrid multimodal NN architecture was developed in [15] to detect driver drowsiness by integrating EEG data, Gyroscope data, and vision-based features. In another work [74], the authors suggested that a multimodal solution can have functionality that can be more effective in detecting the level of alertness of the drivers, based on developments in sensor technology. They developed a multimodal alertness dataset that comprising physiological, environmental, and vehicular features given by Ford Auto Company. Some other companies are trying to develop intelligent vehicles that used advanced driver assistance systems (ADASs) [75]. In the paper [19], authors suggested that tracking the heart rate of drivers with a wearable tracker or extracting their facial expressions with an RGB camera are mostly used in the past. To overcome the intrusive nature of existing hardware, they used a single RGB-D camera used to resolve these issues to derive three fatigue characteristics such as pulse rate, level of eye-opening, and level of mouth opening. More significantly, this paper presented a new multimodal recurrent neural fusion network (MFRNN) that combines the three features to increase the accuracy of the detection of driver fatigue. To gain temporal information, a recurrent neural network (RNN) layer is used in the MFRNN. In particular, the authors are used different fusion strategies to get effective fatigue features. The MFRNN model was used to improve the efficiency of the drowsiness detection system.
In another work, distracted driving [72] was determined through the vehicle's driver image that includes the face, arms, and hands taken from a digital camera. Besides, in this work, the authors suggested that it is necessary to integrate a multisensor along with a vision-based camera to define multimodal features. They mentioned that it helps to improve the performance of M-HDx. In the first step, vision-based convolutional neural network (CNN) models were generated using transfer learning and fine-tuning approaches. Afterward, the LSTM-RNN model was built in the second stage by using sensor and vision-based features together. Whereas in paper [73], the authors a mixture of camera video records and sensor data obtained on a cell phone to assess the behavior of the pilot. Image and sensor data were used together for the first three actions, and for the last three actions, only image data was used. Three separate deep learning methods were used for the classification process: CNN, CNN + RNN, and CNN + SVM. With 87%, the highest classification rate was reached by using CNN + RNN.
For the detection of driving fatigue, physiological signals such as EEG and EOG [76] have been successfully used in a single modality. A multimodal-based hypovigilance detection system was developed by integrating partial EEG and forehead EOG sensors to improve driving fatigue detection. The researchers found that the main region of the brain is an important place to detect effective EEG signals that can easily mix with forehead EOG signals. By experimental findings, they showed that when combined with forehead EOG to obtain mutual characteristics, the temporal EEG signals from six-channels provide the best output. Besides, to learn a better mutual representation, they suggested a novel multimodal fusion approach using the deep stack-based autoencoder (SAE) model. They measured useful components of the three characteristics when they already have saccade, blink, and fixation (the length of blink or saccade). Although some characteristics display an insignificant association with fatigue states, they did not use them in all. Afterward, they used EOG to reflect the forehead EOG rather than the conventional EOG. To monitor precise eye movement data, a new method was used by using eye-tracking of glasses and new PERCLOS measured is computed from Eq. (1) and Eq. (2) as follows: Inval(n) = blink + saccade + fixation + PES msr (2) This work introduces mEBAL [77] a multimodal database for the identification of eye blinks and the measurement of attention levels. The frequency of eye blink is related to cognitive function, and automated eye blink detectors have been suggested for several tasks, including estimating the level of attention, evaluating neurodegenerative disorders, identifying deceit, detecting drive exhaustion, or anti-spoofing of the face. The mEBAL dataset was created to help the other researchers effectively train the network for recognition of the hypovigilance stage based on multimodal features. The multimodal features were extracted using several experiments through vision-based cameras (both NIR and RGB) and EEG sensors. In total, the authors provided almost 6,000 samples from 38 different persons when they are engaged with e-learning tasks. This preliminary study was also included eye blink detection using CNN and persons' attention level was estimated based on their eye blink frequency. The overall system diagram of the mEBAL system is visually represented FIGURE 7. An example of mEBAL [77] dataset, which was collected in a constrained environment, but it is rich in pose, illumination changes, and other naturally-occurring factors.
in Figure 7. Surface electromyography (sEMG), electroencephalography (EEG), interface pressure of the seat, blood pressure, heart rate, and level of oxygen saturation data was collected as multimodal features in this paper [78].
Twenty male participants volunteered to conduct 60 minutes of driving on a static simulator in this study. In the back and shoulder muscle classes, findings from sEMG showed substantial physical exhaustion (r < 0.05). The EEG showed substantial (r < 0.05) increases in alpha and theta activity and significant decreases in monotonous driving beta activity. Another M-HDx system was developed in [79] by using physiological signals such as ECG, galvanic skin response, and respiration were recorded in real-world drive environments from 14 drives performed in a specified direction. Features from time, spectral and wavelet multi-fusion were widely extracted. Afterward, the prediction was performed based on sparse Bayesian learning (SBL) and principal component analysis (PCA) to look for the optimal feature sets. Average accuracy of 89% was achieved. In another study, the authors [80] used EEG, EOG, and ECG measurements for real-time mental fatigue detection. By using SVM, the authors achieved classification scores ranging from 80 ± 3 percent with a 4-s time window to 94 ± 2 percent with a 30-s time window. Similarly, in [81], the 68 electrodes of EEG/ECG/EOG and 8 channels of fNIRS data were simultaneously collected to develop an M-HDx system.
The EEG and EOG signals were used in the paper of [82] to detect the hypovigilance state of the drivers. The authors used the PERCLOS index, which is collected from eye monitoring glasses as a surveillance annotation. Then a novel electrode placement is identified for forehead EOG to enhance the feasibility and wearability. Similarly, in [83], a new multimodal architecture for the Electroencephalogram and Electrooculogram in-vehicle vigilance calculation was developed. A deep Long Short-Term Memory (LSTM) network was used to develop the HDx system. Also, in this research [84], they proposed a Multimodal deep learning-based method (LSTM) that recognizes both visual and physiological changes in drowsiness. The combination of EEG and EOG signals was used in [85] to detect the hypovigilance state of the drivers and applied in SEED-VIG publicly available dataset.
The authors used 58 driver's multimodal data to test the HDx system in the paper of [86]. In that research, they used blink rate and posture information to form multimodal features. They achieved very impressive results to detect all stages of drowsiness (F1-score 53.6%, root mean square error 0.620). Whereas in [87], the authors used multimodal-based physiological signals to assess the alertness level of drivers. In [88], the authors used CNN and DRL models to detect driver fatigue based on EEG signals. Although in [89], the authors used 8 participants for collecting data from the motion signals (accelerometer and gyroscope), electrocardiogram (ECG), galvanic skin response, and CAN-Bus signals. Those signals are combined into a single multimodal feature vector to enhance the accuracy of the m-HDx system. To reduce the feature space, the authors have also contributed towards an optimal selection of multimodal features.
For detecting the drowsy state in humans, a multimodal method was studied in paper [90]. Video information and multisensor signals, for analysis, are two modalities considered. Visual data conveys a great deal of human alertness. For analysis and detection, the exact indicators from the video information need to be identified and captured. The multisensor signal indicating alertness of the human brain is an EEG signal. For the detection of a human being's drowsiness state, physical and mental alertness were analyzed. For drowsiness detection of humans in real-time, a framework was proposed. In [91], a methodology was developed based on the eye patterns of people, which monitored by video streams for the detection of drowsiness. In that study, the authors utilized data from the vision-based camera and sensor-based EOG signals. Whereas in [125], the authors developed Face2Multi-modal dataset based on In-vehicle multi-modal data. Especially in that paper, the authors investigated the drivers' estimates of heart rate, skin conductance, and vehicle speed to determine multimodal features. Those multimodal features are presented in the Face2Multi-modal open-source link. They believe that Face2Multi-modal offers a dataset that is already recorded the physiological status and vehicle status of drivers. This initiative serves as the building block for many current or future customized designs of drivers. More details and updates about the project are available online at Face2Multi-modal, available at https://github.com/unnc-ucc/Face2Multimodal/. To estimate the multi-modal states [126] of the driver (skin conductance and heart rates) and driving status (speed) through their facial expressions, the authors presented an in-vehicle real-time system. They used DenseNet as the model architecture for each type of data flow. Training information includes heart rates (per minute), skin conductance (uS), and velocity (km\h). The procedure for training is operated by PyTorch. Further training is work-in-progress, and we are optimistic that the accuracy can be improved by comprehensive hyperparameter modification.

D. AVAILABLE DATA SOURCES
Different HDXs have utilized different data sources which are either freely available online or exist as private datasets. These datasets are used to extract both visual and non-visual features. Tables 4, 5, and 6 show some of the online data sets used by researchers to extract vision-based, multisensorbased, and multimodal-based features, respectively for training and testing of the machine learning classifier. The application of datasets used can be seen as Ultra-RLDD used in [112], NTHU-DDD in [50], MultiPIE in [113], 3MDAD in [114] as shown in Figure 8, MiraclHB in [26], BU-3DFE in [115] and AUC-DD in [116]. These datasets are based on computer vision technology to define visual features for driver fatigue. It can be noted from Figures 4 and 5, that we have a dataset of RGB images with 65-landmark points which can be used to train the network classifier for defining the features. To develop a robust HDXs system, it requires that those online and private data sources should be used to train the machine learning algorithms for the selection of effective visual-features. Even though, it is also required to train the classifier for recognition of driver fatigue in the smartphone or cloud computing-based platforms.
To develop hybrid HDX systems, several researchers are also using visual-features and various EEG-based biosensors to predict drowsiness. In practice, EEG signals are sometimes used to detect drowsiness, with three main building blocks. Several online data sources (see Table 8) are also available online to test and train the machine learning algorithms. A visual example of EEG spectrogram images visual with drowsiness and alert is displayed in Figure 5. In the past, most studies developed based on EEG biosensors that datasets are available online such as Min et al s' Fatigue-EEG [117], Cao et al s' Fatigue-Multi-channel [118], [119] EEG, and G. Cattan et al s' EEG-Alphawave [120].
One important dataset is mEBAL [77] contains a database of multiple types of blink detection and attention level measurement. The mEBAL dataset is used to train the network for defining effective visual and non-visual features of drivers. In this dataset, there are different parameters are measured and suggested for some functions including attention level, analysis of neuro-related diseases, detection of deception, fatigue, or coping with anti-fraud. mEBAL dataset develops preliminary data based on sensory input detection and calibration of camera sampling. In particular, three different sensors are monitored simultaneously such as Infrared Cameras (NIR) and RGB face-to-face capture and Electroencephalography (EEG) band for capturing user cognitive function and blinking events.
The mEBAL has a total of 3,000 blinking samples from both eyes received by 1 RGB camera and 2 NIR cameras. Each sample consists of 19 frames (around 600 ms.) with a total of 342,000 images (3,000 × 19 × 2 × 3). Factors such as user status and light changes were considered at the time of VOLUME 9, 2021   acquisition to mimic real-life e-learning situations. The 11 out of 38 students used glasses. A visual example of the mEBAL multimodal-based dataset is displayed in Figure 7. Several other important Multimodal-features based datasets that can be used as training models are described in Table 7. From this table, we have also visually displayed the most utilized multimodal datasets to develop M-HDX.

IV. TYPES OF DEEP LEARNING MODELS FOR HDx
In this section, we have described Deep-learning architectures (DLAs) with special emphasis on modern machine learning algorithms. For detection of driver drowsiness, several studies reported that the DLAs based algorithms achieved high accuracy [127], [128] with the reason to identify suitability for other authors. In this section, we described concepts, architectures, and techniques commonly utilized to detect driver drowsiness through visual features. Deep learning algorithms are recent techniques utilized for the detection of driver fatigue in the case of real-time [95], [97], [129]- [131]. Those DLAs methods were applied for pattern recognition and feature learning on mainly video frames. There are several variants of DLAs that have been used in state-of-the-art systems for the detection of driver fatigue. Those DLAs systems are described and compared in Table 4. Based on the aim of DLAs, we have divided them into sub-categories for helping potential readers. Deep learning models for classification include Convolutional Neural Networks (CNN) [124], [129]- [131], Recurrent Neural Networks (RNN) [97], [95], [132], Stack-based Autoencoders (SAEs) [133], [134], Restricted Boltzmann Machine (RBM) and Deep Belief networks (DBN). For feature engineering, SAEs, RNN, and CNN are applied in the earlier driver fatigue systems but failed to classify data. However, RBM and DBN are the best deep learning algorithms for data representation but these algorithms did not test in the past driver fatigue detection systems. For our comparative analysis, we have selected CNN, RNN-LSTM, stack-based autoencoders (SAEs), and pre-train transfer learning models. Compare to these DL models, the transfer learning (TL) [135]- [141] algorithms are used.

V. METHODOLOGICAL COMPARISIONS
As a contribution of this work, the authors have compared the current leading research in the domain of driver fatigue through visual-based features techniques. Prediction of these features was performed by model-based, rulebased, supervised, and non-supervised and deep-learning algorithms (DLAs) in the past systems. In this paper, we have focused on DLAs, which were applied in the past to predict driver drowsiness instead of conventional machine learning techniques. As discussed earlier, to detect real-time driver fatigue, there is a dire need to develop an HDx system. Several traditional (MLP, SVM, or Boosting) and advanced deep learning algorithms [82], [89]- [128] (CNN, RNN, or DBN) are often used to develop a hypovigilance monitoring system. Compare to traditional-machine learning approaches, there is the latest trend toward the utilization of multilayer deep-learning algorithms (DLAs). These DLAs algorithms have reported very encouraging performance in a wide range of applications in particular detection of driver drowsiness. Moreover, the parameter setting and selection of layers are the most important factors in terms of DLAs in the case of detection of real-time driver fatigue.
To implement a real-time HDx system, the researchers have been using a set of rules or models to classify driver states during driving in different conditions. In brief, those systems are presented in Tables 3, 4, and 5 in the case of visual features, sensor-based features, and multimodal-based features, respectively. While reviewing these features, it was noticed that the multimodal and deep-learning (DLAs) in combination achieved higher performance compared to other approaches. Compare to simple DLAs, the authors are also combining different DLAs with traditional machine learning algorithms to develop a hybrid classifier. There is no trend observed in the past about transfer learning algorithms to develop HDx systems that are also considered in this paper to test the performance. The researchers have also utilized different variants of deep learning algorithms such as CNN and RNN models. Those models are compared in Table 8 in terms of different architectures and accuracy. Though the parameter setting of CNN and RNN required a big training dataset and time complexity is the biggest thread for classification tasks. Table 8 represents various hybrid and latest DLAs algorithms along with their detection accuracy which is not upto-the-mark if we considered different parameters such as face occlusions, camera position, and type of sensors. In a real-time HDs system, there are certain limitations and problems of those models that are described in the upcoming sections. One of the main issues to set up parameters for DLAs is required very specialized knowledge about the model. For example, we have mentioned some parameters in Table 8 for the CNN model that is mostly utilized in the past to detect driver drowsiness as a feature selector and recognizer. It was observed that the main focus of researchers was on the design of assistant programs and aim to help drivers find their distraction time and create a warning alarm. Such hybrid or multimodal systems are providing a solid and reliable solution for predicting driver drowsiness. To increase the durability of the multimodal systems (M-HDx), it is possible to give weight to each sensor after training the mechanical separation. On these networks, the choice of machine learning algorithms is also an important task. The time and difficulty of the space always play a tradeoff for the deployment of these programs to monitor driver negligence.
Since several recent studies focused on the use of in-depth (DLAs) learning methods with multimodal features to develop an M-HDx system without understanding the practical analysis of this topic. In this paper, we have focused on overcoming this shortcoming. We have examined the impact of the construction of deep structures based on the discovery of driver fatigue. In particular, we are interested in studying the structural properties of driver fatigue in terms of visual and non-visual characteristics. Recently the detection of driver fatigue using in-depth learning algorithms has been greatly improved and in-depth by the number of layers and the number of processing units per layer to keep performance upto-the-mark through the latest advances in computer vision systems. However, it is important to notice that the deep architectures are required to be as powerful as their calculation capabilities for real-time driver fatigue recognition. In this article, the data sets used in the past and some HDx systems are compared in different scenarios. As a contribution of this work, the authors have compared the current leading research in the domain of driver fatigue through visual-based features techniques. Prediction of these features was performed by model-based, rule-based, supervised, and non-supervised and deep-learning algorithms (DLAs) in the past systems.
In this paper, we have focused on DLAs, which were applied in the past to predict driver drowsiness instead of conventional machine learning techniques. As discussed earlier, to detect real-time driver fatigue, there is a dire need to develop an HDx system. Several traditional (MLP, SVM, or Boosting) and advanced deep learning algorithms [82], [89]- [128] (CNN, RNN, or DBN) are often used to develop a hypovigilance monitoring system. Compare to traditional-machine learning approaches, there is the latest trend toward the utilization of multilayer deep-learning algorithms (DLAs). These DLAs algorithms have reported very encouraging performance in a wide range of applications in particular detection of driver drowsiness. Moreover, the parameter setting and selection of layers are the most important factors in terms of DLAs in the case of detection of real-time driver fatigue.
To implement a real-time HDx system, the researchers have been using a set of rules or models to classify driver states during driving in different conditions. In brief, those systems are presented in Table 3, Table 4, and Table 5 in case of visual features, sensor-based features, and multimodal-based features. While reviewing these features, it was noticed that the multimodal and deep-learning (DLAs) in combination achieved higher performance compared to other approaches. Compare to simple DLAs, the authors are also combining different DLAs with traditional machine learning algorithms to develop a hybrid classifier. There is no trend observed in the past about transfer learning algorithms to develop HDx systems that are also considered in this paper to test the performance. The researchers have also utilized different variants of deep learning algorithms such as CNN and RNN models. Those models are compared in Table 8 in terms of different architectures and accuracy. Though the parameter setting of CNN and RNN required a big training dataset and time complexity is the biggest thread for classification tasks.

A. SELECTIVE M-HDx SYSTEMS
To assess the performance of current state-of-the-art m-HDX systems, a qualitative comparative analysis is performed in this paper. Results of that comparison are presented here by using different datasets and environment scenarios. To perform statistical comparisons, we have fixed binary, ternary, and five stages of driver drowsiness in a real-time and simulator environment. In all these multistage, we considered the various parameters such as accuracy, sensitivity, specificity, reliability, F1-score, and area under the receiver operating curve (AUC). The testing environment contains face occlusion including sunglasses and night-time driving conditions. In general, visual, vehicular, and non-visual features are extracted in this comparison study to detect affective performance by using various machine learning and deep learning algorithms for predicting driver drowsiness. We have selected some recent papers to complete the state-of-the-art comparisons such as on CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], and Riani-M-HDx [128] systems on the real-time datasets. Those m-HDX systems were selected due to easy integration to a real-time system and focused on multimodal datasets. Those m-HDX systems have been briefly explained in the previous sections. The real-time processing on multisensor, vehicular, and environment parameters considered as compare to IoT-based architecture [126].
CogBeacon-ML [124] is a multimodal dataset designed to target the effects of cognitive fatigue in human performance. This dataset can be easily used to train the network for the prediction of real-time drive fatigue because the CogBeacon-ML system is developed based on multimodal features This dataset consists of 76 sessions obtained from 19 male and female participants conducting various variations of the Wisconsin Card Sorting Test (WCST), a common experimental and clinical psychological cognitive test designed to examine cognitive versatility, reasoning, and cognitive functioning aspects. During each session, the EEG functionality of the user, facial keypoints, real-time self-reports on cognitive exhaustion, and detailed success measurements obtained during the cognitive task are documented and thoroughly annotated (success rate, response time, number of errors, etc.). Also, there is an open-source machine learning analysis that can be used to predict cognitive fatigue by using multimodal features. Compare to facial points, the authors used mainly to detect cognitive fatigue by using EEG sensors at different sampling frequency rates. Afterward, they calculated the absolute band power for a given frequency range as the logarithm of the sum of the Power Spectral Density of the EEG data over that frequency range by using Eq. (3).
The FFT of the raw EEG signal G is where I and n are the minimum and maximum frequencies of the frequency band x, and ∅ is the five-frequency band. The relative frequency bands (R): γ , β, α, θ and δ at sampling frequency of 10 Hz are measured during each task. They also reported differences in the movement of the face to capture behavioral changes, capturing a set of 68 facial keypoints and four corners for bounding boxes, with a webcam mounted at a frame rate of 2 FPS on the top of the screen. The basic technique was applied by using Regression tree to classify facial keypoints.
The BROOK-DenseNet [126] is another platform that used M-HDx architecture by using BROOK multimodal dataset. The new version of BROOK consists of 34 passengers, driving respectively in automatic and manual modes for about 20 minutes. Their facial videos, multi-modal details, and driving status were protected by BROOK. In this BROOK dataset, the experiment included three driving conditions. Driving data was registered during the statistical analysis. The participants were asked to fill out a questionnaire on the cognitive facets of their experience during each driving session, including an appraisal of perceived confidence, relaxation, and situational awareness. For each driver, the analysis lasted roughly one hour. The BROOK dataset now consists of 11 data dimensions, including Facial Video, Vehicle Speed, Vehicle Acceleration, Vehicle Coordinate, Vehicle Ahead Size, Steering Wheel Coordinates, Throttle Status, Brake Status, Heart Rate, Eye Monitoring, and Skin Conductance. The visual example of BROOK data is displayed in Figure 9. In BROOK-DenseNet, the authors developed DenseNet-BC-100 [126] convolutional neural network architecture for detection of driver drowsiness stage. The Brook Feature defined driving Status such as Vehicle Speed, Vehicle Accelerations, Vehicle Coordinate, Distance Ahead, Brake status, Heart rate, skin conductance, eye tracking. Those features are recorded in different sessions. Heart rates (per minute), skin conductance (uS), rpm, and training data are included (km\h) in this dataset. The procedure for training is operated by PyTorch. They also used an optimization technique to adjust hyperparameter that would enhance the accuracy of further training work-in-progress. In all, the 320,000 frames/images are used in the ready-to-train BROOK, and they have divided the training, test, and validation data sets into a ratio of 8:1:1. The BROOK-DenseNet was trained on an NVIDIA GeForce GTX 1660 GPU and specifics of the DenseNet architecture are given as follows. The author used many configurations based on DenseNet's recommendations. The model involves Depth/Layers 100 of Growth Rate, 12 of dense Blocks, Batch Size of 128, and Initial Learning Rate of 0.1 along with Testing Epochs of 50. The initial learning rate for our case is set at 0.1, which is separated by 10 at 50 percent and 75 percent of the cumulative number of training cycles.
Ford company provided another multimodal dataset [74] (Ford-dataset) for the training set, validation set, and test of different modalities such as physiological, environmental, and vehicular. They have been used a simulator to collect data from driving sessions. They used 100 drivers of different ages and genders and sequential measurements were collected every 100 milliseconds during the two minutes trial. There were 8 physiological features, 11 environmental features, and 11 vehicular features that were the feature distribution across the three modalities. For the training set, the cumulative number of instances was 604,329 (previously separate training and validation) and 120,840 for the test set, from 469 training, 31 validation, and 100 field trials. Warning and drowsy instances may be placed within the same prosecution. Visualization of this multimodal dataset with three characteristics. We made two interesting observations before processing the results. First, of all the cases in the training and test sets, there was one physiological feature and two vehicular features which had a value of zero. Thus, these three characteristics were eliminated, namely P8, V7, and V9, resulting in a final set of 7 physiological, 11 environmental, and 9 vehicular characteristics. The Ford-dataset mainly contains the ''Phys + Env + Veh'' features to fused different modalities for selecting all 27 features in total. The author used Naive Bayes classifier to predict driver drowsiness based on these 27 features.
Riani-M-HDx [128] provided another dataset to use multimodal features by using action units (AUs), which were extracted from the facial deformation of drivers. They built an M-HDx system based on the OpenFace platform to identify 18 different AUs. In practice, the OpenFace framework is provided the best visual features to detect the driver's yawn. In the Riani-M-HDx dataset, the authors used four physiological sensors to extract statistical features. The final feature collection consisted of a total of 77 physiological characteristics, including 50 features of Blood Volume Pulse (BVP), 7 features of Skin Conductance (SC), 9 features of Respiration Rate (RR), 7 features of Skin Temperature (ST) and 4 features derived from the combined BVP and RR sensors, such as the mean and max-min heart rate differential, which is a measure of the variability of breath to heart rate. These all features were concatenated these new measurements into a new vector of 308 measurements after measuring the limit, mean, minimum, and standard deviations to produce four distinct vectors. They used a decision tree classifier (DT) to detect driver drowsiness.
A DROZY dataset as shown in Figure 10 is also provided to train and test the M-HDx system. However, we did not use it for comparisons because it does not come with source code to analyze the data. In contrast of this, we used both software and data sources of the same time flow are consolidated in this comparison study as displayed in Figure 11. To perform comparisons, we have used all these four data sources with comparisons to our IMSIU-DFD system [127] and results are reported in the upcoming sections. Major components and devices used to compare state-of-the-art hybrid systems by using IMAM University driver's simulator environment (IMSIU-DFD) [127] are represented in Table 9.

B. EXPERIMENTAL SETUP
To assess the performance of current state-of-the-art m-HDX systems, a qualitative comparative analysis is performed in this paper. Results of that comparison are presented here by using different datasets and environment scenarios. To perform statistical comparisons, we have fixed binary, ternary, and five stages of driver drowsiness in real-time and simulator environments. In these multistage, we considered the VOLUME 9, 2021 TABLE 9. Performance comparisons of state-of-the-art driver fatigue detection based on visual-features and deep learning algorithms. FIGURE 11. A visual example of four different datasets and software to perform comparisons, where CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], Riani-HDx [128] and IMISU-DFD [127].
various parameters such as accuracy, sensitivity, specificity, reliability, F1-score, and area under the receiver operating curve (AUC). The testing environment contains face occlusion including sunglasses and night-time driving conditions. In general, visual, vehicular, and non-visual features are extracted in this comparison study to detect affective performance by using various machine learning and deep learning algorithms for predicting driver drowsiness.
We have selected some recent papers with benchmarks to complete the state-of-the-art comparisons such as on CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], and Riani-M-HDx [128] systems on the real-time datasets. Those m-HDX systems (as shown in Figure 11) were selected due to easy integration to a real-time system and focused on multimodal datasets. Those m-HDX systems have been explained in the previous sections. The real-time processing on multisensor, vehicular, and environment parameters was considered as compare to IoT-based architecture [126].

C. STATISTICAL METRICS FOR ASSESSMENTS
To evaluate the performance of HDx systems, we have used several statistical metrics to check the effectiveness of the classifier with respect to the dataset and environmental settings. In these comparisons, we have also used AUC curve to show the impact of different classification algorithms. The statistical metrics for evaluating the framework performance is measured as follows: 1) Accuracy (ACC) is the first measure that is used to determine the type of HDx system and is calculated by the following equation.
2) Specificity (SP) is the second statistical measure that is used to precisely identify the authentic the results of m-HDx system and is measured by the following equation.

SP = TN TN + FP
3) Sensitivity (SE) is the third statistical measure that is used to take into account the capacity of a classification model to correctly recognize the fatigue stage and is measured by the following equation. The expected negative case, which is negative, is TN. The expected negative cases which are currently positive are FN. These instances are referred to as errors of form. The FP parameter is the positive cases expected, which are in the true negative. These instances are referred to as error category one. Where the true positive rate (TPR) indicates the right number of decisions to identify the class. The overall accuracy of detection (ACC) for detecting HDx systems is calculated based on average. The performance of different HDx systems is evaluated by the estimators of precision (PR), sensitivity (SE), and specificity (SP). To compute these estimators, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), and false negative rate (FNR) should be first measured. For the multi-class classification, we divided the comparison results into two major steps. First, we did comparisons based on the 2-class (binary), 3-class (ternary) and 5-classes based classification problems as describe in Table 8. The final estimators are calculated by taking the average results among all these experiments.

D. EXPERIMENTAL ANALYSIS
Extracted from physiological cues, the feature set can contain redundant information (features that are unable to be distinguished between alertness and drowsiness), which, when added directly to the classifier, may lead to output loss. Therefore, prior to the classification stage, all the feature sets were subjected to a feature selection process. We have used paired t-test to classify only the statistically relevant traits (p < 0.05) that can distinguish between all participants between the warning and drowsy states. Before carrying out the t-test, we verified that the data were naturally distributed by means of the Kolmogorov-Smirnov test [41], as this is a requirement before the applying t-test. Various EEG-based systems for detecting drowsiness have been developed by using multisensor. The EEG signals are also used for the detection of drowsiness in this paper, with three key building blocks forming the HDX system. The proposed building blocks use both raw EEG signals and their corresponding spectrographs. In the first building block, while kurtosis.
The energy distribution and zero-crossing distribution properties are measured from the raw EEG signals. The EEG spectrograph images are used that are extracted spectral entropy and instantaneous frequency characteristics. To extract non-visual features, the deep feature extraction is used directly on EEG spectrograph images in the second building block using pre-trained RNN-LSTM model. In the third building block, the discrete wavelet transform (DWT) technique is used to decompose EEG signals into similar sub-bands. Instantaneous frequencies of the sub-bands are the spectrogram representations of the sub-bands and the collected statistical characteristics, such as mean and standard deviation of the instantaneous frequencies of the subbands. For the purposes of classification, each function group from each building block is fed to a long-short term memory (LSTM) network. Afterwards, ECG data channel was used to get time series of human heart variability to measure the movements of human body that were considered as statistical samplings. Then the distribution of those values in these samplings was analyzed by calculation of mean, standard deviation, skewness, and kurtosis. These measures very obtained through IMSIU-DFD [127] M-HDx system, which is used as base system to test and compare. Mostly, we have implemented behavioral and physiological features to develop this IMSIU-DFD system but to perform other comparisons based on environmental and vehicular, we have mounted a camera and utilized USB (OBD scan) to capture real-time vehicular data from Toyota RAV4, 2013 model. By incorporating the different elements, multimodal classification was carried out. There were four different methods were used in two stages. Firstly, by concatenating the features obtained from the visual and physiological sources, we conducted an early modality fusion to create a single feature vector, which was then used for classification. The fused decision is computed by using Eq. (4) based on multistage classification problem. · min n (ln(Phych)) + Environment + Vehicular) (4) where the parameter n is the number of features that are extracted from each different modality, is the function to indicate visual-features from the driver, and shows physiological features of the drivers. Also, the majority voting scheme is used to predict the state of driver drowsiness. Also, the environmental and vehicular features are integrated to compare the performance with other state-of-the-art M-HDx systems. To evaluate the performance, we used four different M-HDx systems such as CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], and Riani-HDx [128]. By using 10 epochs, the training and validation accuracies are also computed. A visual representation of training and validation accuracy graphs are represented in Figure 12  To perform experiments, we used various state-of-the-art machine learning approaches to characterize the driver's behaviors especially fatigue. These comparisons help us to make a clear difference between drowsy and fatigue parameters in different conditions and levels. The five-stages were measured such as used V: Very alert (without distraction), F: Fairly alert (drowsiness with distraction), N: Neither alert nor drowsy, M: Moderate Drowsy, E: Extremely drowsy based on SE: sensitivity, specificity: SP and AUC: area under the receiver operating curve. In all subsequent tables and paragraphs, we used notation '' * / * / * '' corresponding to SE/SP/AUC values. Also, the first highest performance values are displayed in terms of the bold and second one in terms of underline. We run a 10-leave-one-out cross-validation scheme for our evaluation, and the average performance is presented.
Experiment I: Table 12 represented state-of-the-art comparison results based on five-stage based hypovigilance detection (HDx) systems using driver's behaviors features and 15 different subjects in 25 minutes of recorded time. In Ford-dataset [74], there are unavailable behavior features so we did not perform comparisons. On average, the CogBeacon-ML [124] system achieved somewhat lower results based on four-stage (V, F, M, E) compare to N-stage. However, in the case of BROOK-DenseNet [126], the performance is somewhat similar to CogBeacon-ML [124] system. It was noticed that in four-stage (V, F, M, E), both approaches achieved the same performance level when used driver's behavior features. Also, CogBeacon-ML achieved best performance in terms of N-stage than BROOK-DenseNet and Riani-M-HDx. Compare to these methods, the Riani-M-HDx [128] achieved better performance but this is also not up-to-the-mark to implement in a real-time environment. As shown in this table, the Riani-M-HDx system achieved the performance results as V-stage of 50.6/52.5/0.51, F-stage of 43.5/44.3/0.44, M-stage of 40.6/41.5/0.40, and E-stage of 44.6/47.5/0.45. This perform-ance was achieved due to the use of the best features and deep-learning (DL) architecture compare to the traditional logic regression technique used in the CogBeacon-ML system. Besides, the deep-learning architecture utilized in (BROOK-DenseNet and Riani-M-HDx) systems required many training samples, fine-tuning, and selection of layers. In contrast, if a vision-based camera is unavailable to get due to environmental or face occlusion factors then all these methods were completely failed.
Experiment II: Five-stage based hypovigilance detection (HDx) systems using physiological features and 15 different subjects in 25 minutes of recorded time are described in Table 13. On average, the Ford-dataset system achieved somewhat very lower results in terms of four-stage (V,F,N,E) compare to other M-stage. The M-stage obtained second highest result by Ford-dataset system. Where the case of the CogBeacon-ML system achieved somewhat similar performance as the Ford-dataset approach. In addition, the BROOK-DenseNet achieved similar performance as Ford-dataset and CogBeacon-ML systems and achieved results. Compare to these methods, the Riani-M-HDx achieved better performance. The higher results are obtained due to the best utilize of psychological features by the Riani-M-HDx system compare to other systems. Also in the case of environment and face occlusion, this system obtained good performance. Due to the use of psychological features compares to behavior features, the Riani-M-HDx system achieved the best performance. In terms of behavior features, the usage of DL methods achieved higher detection accuracy but this is also not up-to-the-mark to implement in a real-time environment.
Experiment III: Table 14 represents the comparison results obtained based on five-stage based hypovigilance detection (HDx) systems using driver's vehicular features on 15 different subjects and recorded time is 25 minutes. Among four different systems, there are only two HDx systems (BROOK-DenseNet [126], Ford-dataset [74]), which used vehicular features to detect driver fatigue. The obtained results are mentioned in this table. The results show that poor performance is obtained by using only vehicular features and even used DL architectures and normal environmental conditions. On average, the Ford-dataset and BROOK-DenseNet achieved similar performance when used a simulator as comparisons. The Ford-dataset HDx systems achieved better than BROOK-DenseNet.
Experiment IV: The comparative results obtained from experiment IV are represented in table 15. There is only one Ford-dataset HDx system that used environmental features  for the detection of driver fatigue-stage. We have carried out state-of-the-art comparative findings in this table based on five-stage hypovigilance detection (HDx) systems using driver's environmental characteristics on 15 different subjects and a reported period of 25 minutes. There is only one HDx system out of four different systems (Ford-dataset [74]) that use environmental features to detect driver fatigue.  Table 16 by using multimodal features (Behavioral + Psychological) and it has been mostly utilized in the past to develop the HDx systems. These results are based on five-stage based HDx systems using multimodal (Behavioral + Psychological) features, early fusion, and tested on 15 different subjects in 25 minutes of recorded time. On average, the Riani-M-HDx achieved better performance compare to other three systems. Similarly, the BROOK-DenseNet obtained second highest results in most of the stages to detect driver fatigue when used multimodal features.
The multimodal representations display a gradual increase in multiclass grouping relative to the use of the characteristics of individual modalities. When using early fusion, the precision exceeds 45% for the five-classes-based M-HDx systems. In comparison, by using early fusion, the per-class AUC in all multi-class methods reached the highest performance. The precision of the late fusion process, however, did not disclose comparable findings to early fusion. What is most interesting is that despite the fact that early modality fusion outperforms all other approaches. This may suggest a close association between visual characteristics and the states identified by the classes ''Alert'' and ''Drowsy'' but also their failure to perform as well when distractions are targeted as explicit states. Physiological data tend to have somewhat more consistent findings and are likely to have a positive impact on our multimodal studies as well. The small number of available data, which do not accurately reflect the targeted groups, particularly in the 5-stage and 3-stage problems, maybe a potential explanation for this finding. The problems related to face occlusion, head position, and environmental conditions are solved by using multimodal features in DL architecture. In addition, we expect to further explore these observations.
Next, there is a dire need to determine the effect of machine learning algorithms on multimodal-based features (M-HDx) systems for the detection of different levels of driver fatigue. We have selected traditional machine learning algorithms such as ANN, SVM, Logic Regression +ANN, and hybrid DL architectures (HDLA) such as CNN +SVM, CNN +Naïve Bayes, CNN+ RNN-LSTM, and DenseNet-BC100 based on early fusion. These machine learning algorithms were selected because they have been mostly utilized in the past for performing experiments no VI to predict driver's hypovig-ilance states. These experiments are conducted based on SE, SP, ACC, PR, MCC, AUC, and F1 statistical measures on two-, and three-stages-based DFD detection systems. The notation '' * / * /'' means that drowsy/alert/distracted values and high values provide better performance results.  Table 18 to Table 20. Based on these experiments, the BROOK-DenseNet and Riani-M-HDx achieved higher performance values compared to Ford-dataset and   CogBeacon-ML fatigue detection systems. However, different trends in hybrid deep-learning algorithms (HDLA) are observed. Still, the HDLA are provided the best performance compare to other machine-learning algorithms by using M-HDx features.
Experiment VII: Experiments VII results are presented in Table 21   M-HDx system is improved if used CNN and Naïve Bayes classifiers to detect driver fatigue compare to DenseNet-BC 100 model. These results can be affected in case of increased training data size. Also, the Riani-M-HDx system is used to test the performance of M-HDX features in different machine learning algorithms based on a three-stage prediction of fatigue. It was noticed that the DenseNet-BC100 and CNN + RNN-LSTM model achieved the best performance and comparable to other machine learning algorithms. From table 22, it noticed that the DenseNet-BC100 classifier outperformed compare to other learning algorithms.
Experiment VIII: It is also important to measure the computational time when implemented by using the same learning strategy and different benchmarks. This experiment is presented in Table 23 based on five-stage based detection of driver fatigue. For benchmarks, we used GPU of xTesla K80, having 2496 CUDA cores with dedicated 12GB GDDR5 VRAM and CPU of Intel(R) Xeon(R) CPU, RAM: 12.6 GB@2.3Ghz (1 core,2 threads). These results are reported in Table 23 based on different M-HDx systems. Also, this complexity is calculated based on 20 minutes of data recorded by 10 persons. On average, the Riani-M-HDx system achieved the highest accuracy of 91.5% with low training: 6.35s and testing: 0.35s times compare to other systems. It was calculated based on GPU, which is ultimately provided faster calculations due to dedicated memory as compare to shared-memory of CPU. Moreover, it is noticed that if increased the 7 cores of the processor then testing time performance is slightly improved but still, GPU is provided faster testing and training time. This point is not noticed by many M-HDx systems in the past to develop multistage's driver fatigue systems.
Experiment IX: The transfer learning (TL) algorithms are the latest trend in deep-learning models. Therefore, separate experiments are conducted to check the performance of the TL algorithm on Riani-M-HDx compared to other M-HDx systems because mostly, this system is outperformed. The TL algorithms used in this comparison are VGG-16, VGG-19, ResNet-50, DenseNet, InceptionV3, and Pretrain-CNN models. From these experiments, it was observed that the hybrid HDLA and transfer-learning (TL) algorithms are achieving high performance in terms of detection and prediction accuracy compared to standard machine learning and DL algorithms. Therefore, the TL algorithms are selected based on experiment analysis for predicting driver fatigue. To set up the TL algorithms, we describe the validation loss, VOLUME 9, 2021    epochs, training, and testing times in Table 24 when used the RIANI-M-HDX system. Compare to VGG-16, VGG-19, InceptionV3, pre-train CNN TL, the ResNet-50, and DenseNet are achieved the highest detection accuracies with the less computational burden. These results are described in Table 25. A graphically those results are represented in terms of AUC and recall-precision (as shown in Figure 13) from Table 25. These results are obtained even in the case of 5-classes based detection of driver drowsiness. Hence, the transfer learning algorithms are provided the best performance compared to traditional machine learning, hybrid, and standard deep learning algorithms. However, the results can be further improved to increase the training dataset and selection of deep layers. The results may be further enhanced by including different environmental conditions and position of the camera to define visual features.

VI. DISCUSSIONS
After presenting the results in detail in the previous section, we focus our attention on discussing the significance of 47556 VOLUME 9, 2021 these results in this section. Several hypovigilance detection (HDX) systems were developed based on the integration of vision-based features, sensor-based features, and multimodal-based approaches. By using vision-based features, the drowsiness of a person reveals itself in physical and physiological changes which can either be visually observed by eyes (PERCLOS measure) and/or mouth activity and head nodding. Those visual features are extracted and then used to classify driver hypovigilance or drowsiness by the traditional machine learning (TML), deep-learning (DL), and transfer learning (TL) architectures. Also, there is another way to detect drowsiness through digitally registered via EEG and/or heart rate monitoring systems. The non-visual features are extracted in the past systems based on the driver's physiological measure and vehicle parameters. In the case of physiological parameter measurements, the authors predict driver fatigue based on different parameters such as steering-wheel, acceleration pedal, and speed. In practice, those approaches were mostly dependent on the road-shape, way of driving, and performance of the vehicle. Also, some other studies utilized electroencephalograph (EEG), electrocardiogram (ECG), electrooculography (EOG), and surface electromyogram (sEMG) sensors to predict driver fatigue. Based on the sensors, the authors were detected drowsy and alert conditions of drivers. Later on, the authors used multimodal-based features to develop M-HDx systems based on an approach to integrate vision-based, vehicular-based or environmental-based and sensor-based features to detect driver fatigue into multistage (Very alert (without distraction), Fairly alert (drowsiness with distraction), Neither alert nor drowsy, Moderate Drowsy, Extremely drowsy). According to a literature review, it noticed that the multimodal-based features provided higher accuracy compare to unimodal-based techniques. Therefore in this review paper, multimodal-based features approaches are described and compared in detail with emphasis on machine-learning algorithms (TML, DL, and TL).
Numerous survey articles were written by authors to develop the need for driver fatigue detection (DFD) systems. However, many research questions need to be addressed as mentioned in Table 1. Those survey articles are unable to address all problems as described in Table 2. A recent survey shows that several kinds of research have been going on to develop an automatic solution [127] for detecting and predicting the driver's fatigue using multimodal and deep learning (DL) architectures to recognize multistage drowsiness. An automatic solution for drivers' fatigue [4] is considered significant to improve the visual attention of drivers. In the past studies, the researchers found that road attention is very sensitive to driver fatigue as discussed in [130]. Also, it is very much important to regularly monitor the traffic environment as well as vigilance [131]. For sustained driver attention, fatigue detection in real-time is beneficial to save accidents. An alert system is also designed in the past techniques to detect upcoming hazards in their path [132]. An automatic driver fatigue detection system is always required to advance the development system for road safety.
Visual features are also provided an efficient way to detect real-time driver fatigue without using non-visual features as suggested by many authors [12], [45]- [64]. In particular [63] study, the authors observed that the light-illumination has a considerable effect on extracting features based on PERCLOS measure. Also, in [67], the authors reported that face occlusion with big-black glasses and head position might affect the measurement's reliability and accuracy. In many studies to detect driver fatigue, the PERCLOS measure was employed using a single camera. However, in real-time driver detection, if the driver's head is not centered and it is out of focus. This poses a problem as it will be very hard to measure PERCLOS. As a result, domain-expert knowledge and image processing must be robust through at-least three cameras in a three-different position to capture the driver's face. Night-vision is another limitation for the PERCLOS measure to detect visual features. To minimize these lightillumination problems, many authors are suggesting using IR cameras.
Studies [122]- [127] suggested using a single-view camera to detect visual features and it is very much difficult to detect perfect facial features through one camera. As a result, it might require multiview Online, there are several datasets available to researchers to test their M-HDx systems but with limited real-time conditions. A video database of YawDD [142] is available to test development techniques. The videos were acquired in a day and nighttime driving conditions, with drivers belonging to different gender and race. Another challenging aspect of conventional fatigue detection systems is to make them compatible with changing times and trends.
During our studies, it was observed that numerous visual and non-visual features (as described in Table 3 and Table 4) are considered in the past to detect and predict driver drowsiness levels. However, the sensors that are utilized in [76]- [79] must not interfere with the real-time driving process, the life expectancy of sensors, and even the effect on the health of drivers. Instead of using the camera, many researchers have worked with health vitals such as EEG, ECG, and EMG sensors [80]- [94] to extract biological features from the driver. Some important visual-feature-based datasets are described in Table 6. These psychological features are used to detect multistage fatigue detection, which is not possible to use only behavior features.
Our study finds that it is not possible to use a single visual or non-visual feature to be used to detect real-time driver fatigue in all environmental conditions. Some research suggests using hybrid systems that combined both visual and non-visual features to get higher accuracy. Mostly, the authors utilized PERCLOS visual features and EEG, ECG sensors for non-visual features. Those hybrid or multimodal features are feasible to detect driver fatigue. In Table 8, we presented online data resources about different multimodal-based HDx (M-HDx) systems. Those datasets can be used to test and train VOLUME 9, 2021 M-HDx systems. The correctness of driver fatigue detection methods, in this case, depends on various factors such as real-time processing as fast and accurate results. A comparison of the fatigue detection techniques is presented in Table 5. It is concluded that the most suitable features for driver fatigue detection are visual features and machine learning algorithms especially deep-learning methods.
The latest trend is the utilization of deep learning algorithms (DLAs) [95], [97], [128]- [141] and multimodal-features to detect and predict driver drowsiness. Those DLAs-based DFD systems are presented in Table 9. In particular, a convolutional neural network (CNN) method is used in [35] to recognize driver fatigue based on minimal network structure and facial points. The authors reported that they achieved 73.06% classification accuracy. They used SoftMax and CNN algorithms together with three dropout hidden layers to do the final prediction. It was noticed that the authors did not train the model on the huge dataset and they used CNN for the prediction of driver fatigue. Since the CNN model is used to detect features from images and selection of features map are difficult steps for real-time detection of fatigue. As a result, the computational complexity is also high due to CNN methods and it is difficult to detect multistage drowsiness. Similarly, there are other deep-learning (DL) architectures that require many data augmentation and selection of multilayers to train and test the model. To evaluate the performance of traditional machine-learning, deeplearning, and transfer-learning algorithms on the detection of multistage of fatigue, we used four different M-HDx systems such as CogBeacon-ML [124], BROOK-DenseNet [126], Ford-dataset [74], and Riani-HDx [128]. By using 10 epochs, the training and validation accuracies are also computed.
Our comparative analysis based on different machinelearning algorithms and distinct features is described in Table 12 to Table 22. Based on these experiments, the application of deep learning (DL) models provided the best performance compared to traditional machine learning algorithms. In particular, the hybrid DL (HDL) algorithms are outperformed to detect driver fatigue in three-stage and five-stage level detection of drowsiness. It was noticed that many authors used unimodal or multimodal features with DL architectures to get state-of-the-art prediction results. This distribution can be seen in Figure 14 and Figure 15. This shows that the application of DL with transfer learning models is highly prevalent with both visual and non-visual feature-based fatigue detection systems. It was observed that the transfer learning (TL) algorithms were only utilized 5% in the past. However, our comparison analysis based on four M-HDx systems indicates that the TL algorithms are the best candidate for the detection of fatigue in multistage. By using TL algorithms, it is difficult to provide sufficient training samples based on visual and non-visual features. If we can provide enough training datasets then it can provide effective performance. To access computational performance, we have done another experiment and those results are mentioned in Table 23. In this table, we used the GPU of xTesla K80, having 2496 CUDA cores with dedicated 12GB GDDR5 VRAM and CPU of Intel(R) Xeon(R) CPU, RAM: 12.6 GB@2.3Ghz (1 core,2 threads). On average, the Riani-M-HDx system achieved the highest accuracy of 91.5% with low training: 6.35s and testing: 0.35s times compare to other systems. It was calculated based on GPU, which is ultimately provided faster calculations due to dedicated memory as compare to shared-memory of CPU.
In this review and comparative analysis study, we incorporated a preliminary comparison to find a use of multimodal driver alertness for transfer learning algorithms. Also, the identification of multimodal datasets and available M-HDx systems are described to further conduct a pilot study to differentiate among 2-stage, 3-stage, and 5-stages of driver drowsiness. It is our first attempt to present a detailed study that uses physiological, visual, and environment modalities to track the different driver states together. Besides, three key aims were covered by this study. First, as it is a widely underresearched process, we explored the benefits of the integration of multimodal features in tracking multistage detection of fatigue. Second, our comparisons work extended the standard binary classification problem as mostly addressed in the past into a three-and five-classes problem to detect different levels of driver's alertness. Finally, we studied which methods have a greater ability to discriminate to classify alertness in drivers. Our experimental findings demonstrated the benefits provided by multimodal feature learning, highlighting substantial drawbacks for both multiclass classification systems overall individual modalities. Compared to individual modalities for multiclass classification, early modality fusion leads to improved performance, with an overall accuracy of 79.73 percent for the three-class system and 55.12 percent for the five-class approach if you used all parameter settings and different facial occlusions such as a scarf, big sunglasses. Finally, our study would not only seek the discovery of universal trends of the multiple driver states, but also the detection of personalized features associated with five-stage of driver drowsiness. This study shows that the state-of-the-art M-HDx systems are based on unimodal-or multimodal-based features to detect multi-stage of fatigue. In the subsequent sub-sections, we show the limitation of the training dataset, deep-learning methods, and future directions to assist other researchers.

A. LIMITED DATASETS FOR TRAININGS
In addition to the above technological problems in deep learning (DL) algorithms, there is an increasing need for multiple enormous datasets to undertake a successful training process. In fact, in terms of classes and instances that impact the training accuracy of deep learning algorithms, the online datasets available are small with limited variability in environments. However, we have identified the most recent datasets including vision-based, sensor-based, and multimodal-based. Those datasets can be easily integrated into any real-time applications for training the network but we did not find a dataset that covered all aspects of driver drowsy in multiple stage recognition on day and night times of driving. In the literature, these mentioned datasets along contained trained networks that can be used as a pre-train model for the transfer learning domain. It can help to increase classification accuracy.

B. CHALLENGES OF DL ARCHITECTURES
The development of multilayer deep-learning (DL) architectures including hybrid learning (HL) and transfer learning (TL) is a recent trend in machine learning to develop hypovigilance (HDx) systems. In a large variety of applications, these DL algorithms [143]- [147] have obtained very promising results. To detect real-time multistage of driver fatigue, the DL architectures are still facing the challenges of computational cost, the complexity of network parameters, and system performance. Since the authors did not focus on computing cost and parameter complexity to develop many HDx systems when used DL algorithms. Therefore, a robust DL-based architecture is necessary to develop for feature extraction and classification tasks. Moreover, the computational complexity of the DL model is very high during the training phase because of the large number of parameters used in a network. As a result, the model reduction methods such as pruning, removal of redundant neurons, and layers are required to solve complex problems. In comparison relative to the classical DL models implemented in the CPU, the GPU-based implementation of the DL also offers high performance. Thus, this can remove the high computation cost.
The HDx architectures have recently been expanded and are becoming deeper in terms of the number of layers and the number of processing units per layer, following the new developments. Still, in terms of their computing power and training sophistication, the HDx architectures need to be more efficient. Henceforth, we are going to discuss certain recommendations to improve the computing capacity of the DL architectures in terms of the development of M-HDx systems for a real-time environment.
In this work, we have studied and compared various DL architectures based on multimodal-based HDx (M-HDx) systems to predict multistage driver fatigue. The latest techniques for M-HDx systems are DL models, where the emphasis is on extracting multimodal features fusion without defining handcrafted features. Multimodal feature fusion will mitigate the deficiencies of both visual features and non-visual features, thus enhancing HDx efficiency. While multimodal M-HDx has improved dramatically over the past few years, when integrating multiple modalities, current works fall short of optimum efficiency. Deciding at which stage of information fusion is a big difficulty such as early fusion, function level fusion, and decision/late fusion, the modalities should be fused [15]. Multimodal fusion aims to obtain complimentary knowledge from modalities to correctly perform the task of analysis. The major concern in multimodal fusion is to find the best example or stage to combine the modalities. The widely employed techniques are, based on this philosophy, data level or early fusion, function level or intermediate fusion, and judgment level or late fusion [16]. Some new fusion models have arisen from the advent of deep learning. Since feature extraction can be performed on any layer in deep learning models, particularly CNN, we can now have early and late feature level fusions.
The most commonly used technique for incorporating knowledge in deep learning models is feature-level fusion. The biggest advantage of early fusion is that at an early stage it exploits the similarity between the modalities. Besides, to execute a task, only one classifier is required, rendering the training process less repetitive. However, time synchronization is a noteworthy constraint of the function level fusion, as the data is collected at various rates and formats in different modalities [17]. Decision stage fusion is the other broadly employed fusion tactic. The major advantage of this strategy is that it helps one to analyze each modality directly, thus significantly minimizing the probability of superiority of one modality over the other. Most current works either feature level fusion or decision fusion, losing the chance to fuse rich representations of mid-level features available in a CNN-based architecture. Hence, new deep architecture with fusion frameworks for HDx needs to discuss to resolve the aforementioned shortcomings and to exploit various fusion strategies. The major drawback in current HDx deep learning-based fusion approaches using deep and inertial sensors is that the fusion is carried out at a single level or point without providing us semantic knowledge from the features to the classifier.

C. FUTURE DIRECTIONS AND IMPROVEMENTS
After doing extensive experiments on different features and classification algorithms, we conclude that there is still a big research gap to implement DL architecture for the recognition of the different states of the drivers in terms of drowsiness. The below points are described to improve the architecture of DL models with future improvements and given as follows.
1) Create deeper DL architectures to increase performance and override the problem of degradation by adding more layers. 2) Selection of multimodal features and fusion approach will be important concerns to develop accurate model for prediction of multistage fatigue. 3) Increasing training datasets with more than 50 epochs to build the transfer-learning (TL) model. 4) Parameters for DL methods should be decreased and there will be extensive optimization techniques to compare and test the performance. 5) Designing of an appropriate loss-function is an important factor for DL architecture to improve the discriminating power of the network. 6) Feature extraction and fine-tuning are important steps for the accurate development of the deep learning model. 7) Learning using hierarchical functions, i.e., learning variant layers with variant characteristics. 8) To make use of the complimentary benefits of these DL models, the development of hybrid models will be considered as best instead of using single DL model for features extraction or classification tasks. 9) A different kinds of DL architecture should be investigated such as few-short learning (FSL) and semantic-based learning (SBL). These kind of DL model are significantly improved identification accuracy with fewer training samples. 10) To compare the performance of vision transformers (VTs) with DL model is also required for the development of M-HDx systems. 11) Instead of using generalize DL architecture, it is also required to develop DL model specific of HDx-related problem. 12) To detect driver fatigue, many researchers [10]- [30] used visual features that are defined through vision-based technique. The authors used single view or multi-view camera to define PERCLOS measure but unfortunately, it is very much difficult to detect driver's features in case of face occlusion, light illumination and head is not center-aligned. 13) Few authors used combine approach by using vision-based and sensors-based devices together to detect driver fatigue to avoid face occlusion or light illumination problem. 14) Few current models collect both temporal function data and temporal interaction information between driver fatigue detection features. A greater degree of mouth opening, for instance, does not generally reflect fatigue, which may be reassessed over time by shifts in heart rate and level of eye opening. 15) In order to build an integrated framework that proactively identifies driver tension levels, it is important to capture, pass, pre-process, minimize, incorporate data and use it automatically to make the final decision. 16) Implement TL algorithms on GPU-based platform with cloud-computing platform.
A comprehensive literature review has taught that each approach has its drawback. Therefore, the use of many methods will be best be used. The considered features are derived from non-intrusive sensors that are related to the changes in driving behavior and visual facial expressions.
To get enhanced visual facial features, three cameras can be deployed at different angles. Afterward, it should be trained on big multimodal datasets with transfer learning algorithms that can better assist automatic drowsiness detection in a real-time environment. However, time is also an important factor in a real-time environment. Hence, if transfer learning will be implemented in GPU then 35% on average training and testing time will be enhanced compared to CPU based processing.

VII. CONCLUSION
This article shows the state-of-art current research and development efforts on recognition of driver drowsiness through vision-based, sensor-based, and multimodal-based features techniques. Literature has shown that the prediction of required features was performed by standard and latest deep learning (DL) algorithms in the past. In this paper, we have focused on deep-learning with transfer learning algorithms that have been utilized in the past to predict driver drowsiness instead of conventional machine learning techniques. A critical review is also presented to show the different driving influence factors, traditional and advance deep-learning algorithms to help the researchers identify the research gap. We conclude that many multimodal-based M-HDx systems were developed and tested on different hardware benchmarks such as (CPU and GPU). However, there are several review articles in this domain but according to our limited knowledge, none of them focused on comparisons about multimodal-based M-HDx systems in different benchmark settings. In the future, we will focus on more comparison studies related to hybrid systems for the implementation of the fatigue recognition system in a multistage. Moreover, we have assessed and described the latest methodologies used in this domain such as deep-learning algorithms. According to our knowledge, several review articles have been written to address driver fatigue detection problems but none of them focused on the domain of deep-learning multi-layer architectures. The future trends will implement the combination of supervised and unsupervised algorithms to enhance driver adaptability and cognitive performance without giving more details. With the development of hardware, the internet of things (IoT), 5G network, energy computing, and more transfer-learning based algorithms will be tested in the future trend.