An Efficient Deep Learning Framework for Distracted Driver Detection

The number of road accidents has constantly been increasing recently around the world. As per the national highway traffic safety administration’s investigation, 45% of vehicle crashes are done by a distracted driver right around each. We endeavor to build a precise and robust framework for distinguishing diverted drivers. The existing work of distracted driver detection is concerned with a limited set of distractions (mainly cell phone usage). This paper uses the first publicly accessible dataset that is the state farm distracted driver detection dataset, which contains eight classes: calling, texting, everyday driving, operating on radio, inactiveness, talking to a passenger, looking behind, and drinking performed by 26 subjects to prepare our proposed model. The transfer values of the pertained model EfficientNet are used, as it is the backbone of EfficientDet. In contrast, the EfficientDet model detects the objects involved in these distracting activities and the region of interest of the body parts from the images to make predictions strong and accomplish state-of-art results. Also, in the Efficientdet model, we implement five variants: Efficientdet (D0-D4) for detection purposes and compared the best Efficientdet version with Faster R-CNN and Yolo-V3. Experimental results show that our approach outperforms earlier methods in the literature and conclude that EfficientDet-D3 is the best model for detecting distracted drivers as it achieves Mean Average Precision (MAP) of 99.16% along with learning rate (le-3), epoch 50, batch size 4, and step size 250, demonstrating that it can potentially help drivers maintain safe driving habits.


I. INTRODUCTION
Numerous accidents on the roads happen because of the distraction of the driver. The AAA Foundation for traffic security found that 6 out of 10 road accidents are because of distracted drivers [1]. According to World Health Organization (WHO), 1.2 million deaths are due to road accidents, and 45% of road accidents are due to distracted drivers. The number of worldwide deaths in traffic accidents is nearly the same as the number of deaths due to Hepatitis and HIV (1.3 million [2], and 1.1 million [3], respectively). The centers for Disease Control and Prevention (CDC) provides a more precise definition of distracted driving, categorizing this into three types: 1) visual-looking around and not concentrating visually on the road, 2) cognitive-looking at the road but not concentrating mentally on the road and 3) and manual-taking the driver's hands off the steering wheel [4].
Road accidents cause loss of property and sometimes life.
An increase in the number of road accidents caused by distracted driving has been noticed. According to the National Highway Traffic Safety Administration (NHTSA), distracted driving refers to any action that can redirect consideration from driving, including (a) talking or texting on the phone, (b) eating and drinking, (c) talking to others in the vehicle, or (d) using radio, entertainment or navigation system. Three hundred ninety-one thousand people were wounded, 3,477 died in 2015, and distracted drivers caused car accidents. The use of cell phones was the significant reason for these accidents 1 . While in Pakistan, around 15000 individuals passed on in road accidents 2 . Distracted driving was responsible for the death of 3477 individuals, while 391,000 were seriously injured. The most common cause of reported car accidents was texting or talking on mobile phones while driving [5].
While in Pakistan, about 15000 individuals passed on in road accidents 3 . Distracted driving detection systems can be used to prompt early warnings to alarm drivers of hazardous driving conduct, including using a cell phone to call or text, using navigation applications, or selecting radio frequencies or music [6]. Distracted driving identification techniques are predominantly found in the driver's facial expression, head activity, line of sight, or body activity [7]. The driver's driving conduct and physiological state can be recognized through the visual following, target identification, movement acknowledgment, and different advancements. Detecting distracted driving has gained much attention from the research community, government agencies, and industry [8]- [10]. The driver's activity that diverts his attention during driving is a distraction like interaction with other passengers, making phone calls, and adjusting the multi-media and navigation tools. Most computer vision algorithms detect the driver's behavior using traditional computer vision and machine learning algorithms. We used deep learning algorithms to analyze the driver's abnormal behavior. Deep learning techniques have significantly improved the accuracy of vision-related tasks. Recent technological progress makes it possible for real-time algorithms to detect the distraction activity and assist and alert the distracted driver.
With improved deep learning classification and detection innovation, expanding analysts broke down driving conduct through Convolutional Neural Networks (CNNs). More scientists have additionally started to assemble their examination datasets. A combination of pre-prepared sparse filters and convolutional neural networks [11] was utilized to expand the arrangement precision of the SUE-DP dataset to 99.78%. Yan [12] improved the common areas with CNNs features (R-CNN) framework by replacing customary skinlike region extractor algorithms and scoring 97.76% on the SUE-DP dataset. Authors in [13] developed a dual-input deep three-dimensional convolutional network structure algorithm on a three-dimensional convolutional neural network (3DCNN), accomplishing 98.41% precision on the rail transit dataset. Authors in [14] prepared two independent convolutional neural networks by upgrading the size and number of convolution bits, which can adequately distinguish cell phones and hands, accomplishing 144 frames per second (fps) and 95.7% for cell phone precision utilization on the self-built dataset. In [15], a Multiscale Consideration CNN was proposed for recognizing driver actions.
Existing literature focuses on face location, hand/head recognition, eye movement investigating, and facial landmark detection. The definitions presented in guide 4 research in the field of distracted driving detection. It detects distractions that are manual, visual, or cognitive. This paper focuses 3 https://www.pbs.gov.pk/ 4 https://www.cdc.gov/motorvehiclesafety/distracted driving/ on "manual" distractions using eight State farm Distracted Driver Datasets (SDDD) classes, as manual distractions are primarily concerned with the driver's activities unrelated to safe driving. Efficientdet model is used to detect the objects involved in these distracting activities and the region of interest (ROI) of the body parts from the images. These results are better mAP compared to the previous detectors. In this paper, we use deep learning to detect and identify the distractions of a driver. A camera mounted above the dashboard captures RGB images. We use pre-trained networks on the ImageNet dataset in a "transfer learning" mode. We convert ten original SDDD into eight categories by combining calling (left/right) hands together, same as texting (left/right) hands. By cleaning the dataset, we put 1000 images in each and annotated every image from each class. EfficientDet model is used for detection purposes using the five variants, which accomplishes state-of-the-art accuracy while being up to 9x smaller utilizing 42x fewer FLOPs and less computation than earlier state-of-the-art detectors such as Yolo-V3, Faster R-CNN, RetinaNet, NAS-FPN [16]. EfficientNet surpasses state-of-the-art accuracy with up to 10x better efficiency, contrasts with previously commonly used backbones, (i.e., ResNet, AmoebaNet). The proposed approach can help drivers maintain safe driving habits and reduce accidents.
The main contributions of this study are: • Propose a model for distracted driver detection evaluated on the state farm distracted driver dataset. • Implement five variants of the Efficientdet model to determine the most suitable model for driver distraction detection. • Detect the objects and the region of interest of the body parts to detect distracted drivers. • Results conclude that EfficientDet-D3 is the best model for detecting distracted drivers as it achieves a mean average precision (mAP) of 99.16%. The rest of the paper is arranged as follows. Section II reviews related work, Section III presents the methodology, Section IV introduces experimental analysis and results. Finally, Section V concludes the paper.

II. RELATED WORK
Distracted conduct of the driver has been recognized utilizing two unique methodologies. One depends on the standard handcrafted features and then trains these disentangled features on the machine [17]. Such methods generally make use of the driver's eyes, head, and face to predict the unknown behavior of the driver. For example, authors in [18] distinguished the driver's distraction by estimating head rotations. First, they recognized the driver's head and then continuously tracked its head movement. The Haar-wavelet Ada-boots cascades were used to recognize the head, and the localized gradient orientation was used to assess the posture. The main disadvantage of this calculation was that this only considers the driver's head pose, even though most activities to detect inactivity in the driver rely on facial features.
Authors in [19] present fine-tuned the CNN-based inception ResNet model concerning oneself gathered pictures of driver interruption action to improve accuracy. Their selfcreated dataset contains six classes, and every classification includes 1000 pictures. First, they apply the diverse preprocessing methods on train and test pictures. These preprocessing strategies incorporate cropping, adjustment, and flipping. After using the various preprocessing procedures, the information was passed to the Inception ResNet model for training reasons. The model is pre-trained on the ILSVRC 2012 dataset. They have accomplished 83% accuracy on the test information. Authors in [20] present a deep learning-based architecture that recognizes the inactiveness of the driver. They have utilized the highlights of the pre-trained VGG-19 and fine-tuned them on the publicly arranged dataset. They had accomplished the best test precision of 95% and 80% exactness on the approval information of each class. They had likewise asserted that their model did not go towards the overfitting even on the oneself-produced dataset. They appeared in outcomes that their proposed model beats the XGBoost regarding accuracy by nearly 7%.
Authors in [21] present to apply SVM and fine-tune three CNN to the same combined features of those three networks to classify each edge into four classes: alert, nodding, drowsy with blinking, and yawning. That one was worth noting that the exactness reported in the analysis was 65.2%, which was lower than the precision of 73%. However, the technique was tested on simulated data, where the signs of drowsiness were generally quickly visible, classifying on the apparent signs of drowsiness. Authors in [22] proposed the CNN named as DarNet for inactive driver recognition. They created the dataset which identified the driver inactiveness. The dataset contains six distinct categories related to driver inactiveness. These categories are everyday driving, talking, texting, reading, Makeup, and eating. They fine-tuned the inception v3 module. The inception v3 module was first trained on the state from the Kaggle dataset for the inactive driver. They accomplished 87.02% precision on the proposed design. Authors in [23] proposed a CNN-based architecture by transforming the VGG-16 network, utilizing the Leaky ReLU activation function rather than ReLU, and applying different regularization strategies to adapt to the issue of overfitting. Thus, outcomes showed that the system accomplished 96.31% precision on the AUC Distracted Driver dataset.
Authors in [24] used the distinctive CNN models, including AlexNet, VGG-16, and ResNet152. For training purposes, they used the state farm distracted drivers' dataset. The dataset contains seven unique classes. Each class includes 3000 frames. The 2300 images from each class were used for training purposes. Moreover, 700 images were used for testing. They fine-tuned the above-depicted model on various variants of the ILSVRC dataset. They fine-tuned the AlexNet on the ILSVRC2012 dataset and accomplished 70% precision on the test data. Authors in [25] present a massive and publicly accessible certified drowsiness dataset. The method's principal was a Hierarchical Multiscale Long Short-Term Memory (HMLSTM) network, which prominent recognized blink features were dealt with in a grouping. An enormous and public genuine sleepiness dataset (RLDD), which contained 30 hours of video, detects signs of drowsiness. This benchmark strategy makes a precision of 80%, which was higher than human judgment. Authors in [26] present a deep learning-based model for detecting driver drowsiness in android applications. They created a model based on the disclosure of facial landmark points.
Authors in [27]proposed a driving-related recognition framework dependent on the deep CNN model. They utilized Kinect cameras to gather images of diverted drivers, and the raw images were prepared with a GMM-based division calculation. Then, at that point, CNN models were utilized as a double classifier which accomplishes 91% exactness. Authors in [28] proposed a deep learning-based technique for gaining diverted driving information furthermore, built an examination framework called DarNet, which accomplished a characterization exactness of 87.02% on their collected dataset. Authors in [29]- [31] proposed deep learning models for anomaly detection, intrusion detection, and botnet attack detection. Authors in [32] proposed Drive-net, a technique that utilizes a blend of a CNN and a random decision forest to classify driver images. Driver-net accomplished a recognition exactness of 95% on the Kaggle dataset. Authors in [33] proposed a driver interruption acknowledgment framework that utilized generative adversarial networks (GANs) and exhibited that generative models could produce images of drivers in various driving situations. Using these images to expand preparation improved the framework's image characterization execution by 11.45%.
Authors in [34] present that both SM and UC cannot join and exit deftly. Because of odd occasions, SM might be inactively disengaged from UC. Nevertheless, this UC does not have the foggiest idea about the disconnect of SM, which keeps on observing the channel. Another conceivable case is that SM and UC have seen the harmful activities of RA; nobody can initiatively quit. The two conditions will bring about pointless energy expenses and potential risks. [35] especially significant research point is the physiological signal encryption and secure transmission identified with the security assurance; some arising advancements give an essential reference. Liu et al. [36] proposed a strategy for following the driver's facial region and utilized the yaw course point to gauge the driver's facial activity to recognize the driver's facial direction. Authors in [37] effectively fostered a driver's facial activity recognition framework dependent on binocular stereo vision, utilizing a hidden Markov model to foresee the driver's facial activity. Later, with the improvement of machine learning technology and general society of driving conduct datasets, expanding examines were added to dissect the driver's phone calling, drinking, eating, and other perilous driving practices. Southeast University driving stance (SUE-DP) dataset [38] was proposed in 2011. The examination gathered four categories of occupied driving practices: "getting a handle on the controlling wheel," "op-VOLUME 4, 2016 erating the shift lever," "eating," and "chatting on the phone." Using the SUE-DP dataset: Authors in [38] implemented the multiwavelet transform method and the multilayer perceptron (MLP) classifier to perceive four predefined driving stances and acquired a precision of 90.61%. Authors in [39] utilized support vector machines (SVM) classification for acquiring 94.25% precision.
Existing literature focuses on face location, hand/head recognition, eye movement investigating, and facial landmark detection. It detects distractions that are manual, visual, or cognitive. Furthermore, the existing work of distracted driver detection is concerned with a limited set of distractions (mainly cell phone usage). At present, computer vision strategies are generally used to extract features, classify and detect images. Such arrangement tasks were completed by different deep neural network models running on high-performance computers to accomplish better acknowledgment precision.

III. PROPOSED FRAMEWORK
The proposed model contains two steps. In the first step, we perform preprocessing on the data. In the second step, we detect the objects involved in these distracting activities and the ROI of the body parts from the dataset's images.

A. PRE-PROCESSING OF DATASET
In April 2016, State Farm started the competition on the website named Kaggle 5 . The purpose of this competition is to generate images related to distracted driver behavior. We use the State Farm Distracted Driver Dataset (SDDD) to train our proposed algorithm. State Farm has collected the 2D images by placing the camera into the vehicle's dashboard. The purpose of these images is to generate the results on this data that will directly or indirectly help improve the stats of causalities due to distracted driver behavior. The original SDDD has two folders containing 22400 training images and 79727 testing images. The image has a resolution of 640 x 480 pixels. The training folder includes a total of 10 categories, which are as follows; calling (left hand), calling (right hand), texting (left hand), texting (right hand), everyday driving, operating on radio, inactiveness, talking to a passenger, looking behind, and drinking. Each category contains a different number of images, as shown in 5 https://www.kaggle.com/c/state-farm-distracted-driver-detection/data Tab. 1. Therefore, the data in the testing folder is unlabeled. We only use data in the training folder to evaluate our method. We convert ten original SDDD into eight categories by combining calling (left/right) hands together, same as texting (left/right) hands. By cleaning the dataset, we put 1000 images in each. We annotate every image from each class using the annotation tool labeling, and the annotated image is stored in a .xml file. Annotation is done to highlight the specific part of the images that include distracted objects and the region of interest of the body parts. There are 1000 annotated files in each class. So, this becomes 8000 RGB images and 8000 annotated files. The data distribution against each class has been shown in Tab. 2. We have split the dataset for each category into 80% train data and 20% validation data. The training set is split so that the validation set is not related to the training set. Some frames from the distracted driver dataset have been shown in Figure 1.

B. EFFICIENTDET
We use preprocessed and annotated data to train the Effi-cientDet model. After training the model, the video frame is input to predict whether the driver is distracted or not. We convert the image into the textual label related to our defined image classification and detection categories. For image classification, we have used the transfer values of EfficientNet. Efficientdet model is used to detect the objects and the region of interest of body parts involved in these distracting activities to make accurate predictions and achieve state-of-the-art results. These steps are described as; extracting the convolution features of the input image, classifying the image, detecting the objects, and the ROI of the body parts, predicting by combining the label of classification and detection. The flow diagram of our methodology is shown in Figure 2.
Our proposed deep learning framework gets frames from the video streaming where the camera is placed on the vehicle's dashboard. These input images are passed through a preprocessing phase where each class image is annotated after image cleaning. Then fine-tune a pretrained ImageNet model EfficientNet for image classification. After that, objects are detected along with the region of interest of the body parts involved in these distracting activities using Efficientdet and give us the final result to detect distracted activities of the driver.

C. EFFIFICIENTDET ARCHITECTURE
The Google mind group developed the Efficientdet model. Improving the different dimensions includes combination construction of Feature Pyramid Networks (FPN) and acquiring thoughts from scaling technique of Efficientnet model, it is a model that can adapt recognition calculation. Efficientdet comprises three sections. The initial segment is prepared as ImageNet uses Efficientnet as the backbone network. The subsequent section is a bidirectional feature pyramid network (BiFPN), which performs the hierarchical and bottomup, including various occasions for the yield normal for Levels 3 − 7 in EfficientNet. The third component is the characterization and location box prediction network used to group and identify the diverted driver. Parts two and three modules can be rehashed repeatedly depending on hardware situations. In this paper, EfficientNet is used as the backbone model; with feature extraction of image contributions to the network and few feature map boundaries, rich data can be separated, partly guaranteeing location speed and precision. Input P3-P7 to BiFPN for include combination at that point. To acquire semantic data of various sizes, BiFPN receives weighted element combinations. Because Effificientdet is the objective recognition of anchor-based, the underlying anchor estimation should be changed appropriately to achieve better results. The architecture of Efficientnet is shown in Figure  3. BiFPN performs the function of a featured network. This takes features from the backbone network's levels 3 − 7 and applies the BiFPN repeatedly. The combined features are fed into a class and box network to detect the object's class and bounding boxes.

1) BiFPN
Conventional FPN accumulates multiscale attributes from through and through, as demonstrated in Figure 4(a) After convolution of the information highlight guide of layer 7, the yield highlight guide of layer seven is obtained. Convolving the combination highlight map obtained by up-testing the yield include a guide of layer seven and adding the information have a guide of layer six yields the yield highlight guide of layer six., convolving the combination highlight map obtained by up-examining the yield include a guide of layer four and adding the information highlight guide of layer three yields the yield highlight guide of layer 3. PANet improves FPN's performance element combination strategy, as shown in Figure 4(b), receiving the hierarchical technique and the base up strategy. Figure 4(c) shows that an unpredictable trademark network geography, NAS-FPN, is discovered using the neural engineering search (NAS) technique. A significant amount of GPU processing time is required to achieve this result. The BiFPN showed in Figure  4(d) is improved in three ways: • If a feature map only contains one piece of information, its commitment to feature combination is little, and it could be erased. • At each level, associations are set up to allow users to join other features at a low cost. • This is recommended that each BiFPN joins as a module and that the output of the previous BiFPN is used as the contribution of the following BiFPN. The circumstances determine the number of such structures required.

2) Feature Fusion Methods
When the features of various scales are combined, the regular practice brings together the scales first and afterward adds the relating features. This expects that the heaviness of multiple features to the last combined component is similar. Indeed, various information features ought to contribute contrastingly to the last combination because of their diverse resolution. Unbounded fusion is calculated using Eq. 1. Unbounded fusion: Where learnable weight is wi that can be a per-channel, per-feature, or multi-dimensional tensor (per-pixel), the scalar weight is the most cost-effective way to calculate without sacrificing accuracy be used. The burden of such a weighting strategy was that there was no imperative weight to risk model preparation. SoftMax based fusion is calculated using Eq. 2: Softmax based fusion: This standardizes the heaviness of the past equation. However, this strategy does have the disadvantage of increasing the size of the computation. In addition, a quick combination approach is proposed to reduce the additional cost of inactivity. Fast normalized fusion is calculated using Eq. 3. Fast normalized fusion: w i >= 0 is guaranteed by implementing a Relu every after w i , but rather ϵ = 0001.0 has been a little worth to avoid VOLUME 4, 2016

3) Compound Scaling
Another compound scaling strategy for object detection was proposed due to the compound scaling used in Efficient-Nets. This method employs a coefficient ϕ to simultaneously scale up all backbone network components, BiFPN network, class/box network, and resolution. The scaling of each net-work segment is shown below: Backbone Network: Use the same coefficients as defined in efficientdet-B0 to efficientdet-B6 to reuse their ImageNet pre-prepared loads.
BiFPN: The width (channels) is dramatically developed, and the depth is straightly expanded (layers). The width and depth are officially scaled using the following equation below: Box/class Prediction Network: The width is fixed to be the same as in the BiFPN, but the depth is straight.
Input Image Resolution: Its resolution is expanded directly because it should be dividable by 2 7 = 128. This is accomplished through the following condition:

IV. EXPERIMENTAL ANALYSIS AND RESULTS
To train the Effificientdet model, we first annotate data. The annotation of the data means we highlight the specific object on the dataset's images. We divided the data into two parts for training purposes: 80% training data and 20% test data. The Effificientdet model is trained on Nvidia GPU using PyCharm frame 2020 of version 1.4.0 and Python 3.6. The experimental system is Linux Ubuntu 16.04, and the graphics card is GeForce GTX 1080Ti 11GB. The software used the Windows 10 operating system, Keras, and TensorFlow deep learning framework. Next, we fine-tune the pre-trained model Efficientdet at epoch 50 with steps size 250, batch size 4, threshold 0.3, and learning rate (1e-3).

A. LOSS AND MAP
The experimental results of different variants of the Efficientdet pretrained model are as follows. The losses and mAp of Efficientdet-D0, Efficientdet-D1, Efficientdet-D2, Efficientdet-D3, and Efficientdet-D4 are presented in Figure  5 Figure 6, respectively. Among the five variants of the pretrained model Efficientdet, Efficientdet-D3 achieves the highest mAP and the lowest loss. Efficientdet-D0 has the lowest mAP and also the most considerable loss. For the task of distracting behavior detection, Efficientdet-D3 achieves an mAP of 99.16%. Efficientdet-D4 has the second-largest mAP, which is 98.89%. Thus, the above-pretrained model can detect distracted driving behaviors in the State Farm dataset. However, the pretrained model Efficientdet also has a small value of losses; hence, Efficientdet can detect images of distracted drivers without this specific dataset. Each network contains eight models and the final layers. After this, each of them contains seven blocks. These blocks further have varying sub-blocks whose number is increased as we move from EfficientNetB0 to EfficientNetB7. The total number of layers in EfficientNet-B0 is 237, and in EfficientNet-B7, the total comes out to 813. EfficientNet-D3 has more number parameters, and several feature maps (channels) vary, increasing the number of parameters. This tuned version is the reason it is performing well. The model checkpoints are automatically saved in a temporary directory during training. Running the training script can first restart training from the most recent checkpoint in this temporary directory. Training will resume from the original pre-trained checkpoint that we saved in the efficientdet folder if none are found. As a result, if we intend to retrain from the original pre-trained model, make sure you delete this temporary folder. To avoid strange errors significantly, if we change some hyper-parameters and restart the fine-tuning process, we should delete this temporary directory. Table 3 shows the mean average precision and loss rate of EfficientDet-D0 to EfficientDet-D4. Whereas mAP compares the ground-truth bounding box to the detected box and returns a score. However, the model is more accurate in its detection if the score is higher. From Efficientdet-D0 to Efficientdet-D3 mAP progressively increases with epoch 50, EfficientDet-D3 has 99.16 mAP whereas, at Efficientdet-D4 mAP decrease to 98.89 from Efficientdet-D3 same as from Efficientdet-D0 to Efficientdet-D3 loss gradually decreases. Efficientdet-D3 has a loss rate of 0.1114 whereas, at VOLUME 4, 2016 Efficientdet-D4, loss increases to 0.1243 from Efficientdet-D3. Hence, it shows that Efficientdet-D3 has a high mAP and low loss rate, so it is the best detection method.  Table 4 shows the evaluation results of EfficientDet-D0 to EfficientDet-D4. Efficientdet-D2 shows the lowest classification loss, 0.0713, then others whereas, the classification loss of D3 is 0.0001 higher than D2. Efficientdet-D3 has the lowest regression loss, 0.04, then other variants.  Efficientdet-D4 achieves the highest average precision at drinking class 0.9964. Efficientdet-D0 achieves the lowest average precision at drinking class 0.9805. Efficientdet-D2 achieves the highest average precision at inactiveness class 0.9718. Efficientdet-D0 achieves the lowest average precision at inactiveness class 0.8279. Efficientdet-D4 achieves the highest average precision at looking-behind class 0.9999. Efficientdet-D1 achieves the lowest average precision at looking-behind class 0.9945. Efficientdet-D4 achieves the highest average precision at normal-driving class 0.9942. Efficientdet-D0 achieves the lowest average precision at normal-driving class 0.9513. Efficientdet-D3 achieves the highest average precision at operating-onradio class 0.9995. Efficientdet-D0 achieves the lowest average precision at operating-on-radio class 0.9912. Efficientdet achieves the highest average precision at talking-topassenger class 0.9996. Efficientdet-D2 achieves the lowest average precision at talking-to-passenger class 0.9055. Efficientdet-D2 achieves the highest average precision at texting class 0.9989. Efficientdet-D0 achieves the lowest average precision at texting class 0.9922.

C. COMPARISON OF DETECTION MODELS
We train our dataset on three detection models: Efficientdet, Faster R-CNN, and Yolo-V3. As we notice that Efficientdet has the highest mAP, so we compare Efficientdet-D3 with Faster-RCNN and Yolo-V3.

D. TEST RESULTS
The retrained model is saved as a checkpoint file containing only the model weights, not the model architecture. To perform inference with our fine-tuned model, we must first export the model into the saved model (.pb format). It is worth noting that the pre-processing and post-processing operations will be incorporated into the exported model. As a result, some parameters should be specified during the exporting process, such as the minimum score threshold to filter out low confidence-bound boxes. Finally, we apply the model to the distracted driver dataset we prepared and decode the prediction image shown in Figure 9.
Efficientdet-D0 to Efficientdet-D3 mAP progressively increases with epoch 50, EfficientDet-D3 has a high mAP 99.16, and less loss 0.1114 than other running speeds is 30% on faster GPU. The final EfficientDet-D4 has low  mAP and high loss than Efficientdet-D3, which shows that EfficientDet-D3 is the best model for detection purposes.

V. CONCLUSION
In this paper, the Efficientdet detection model is used to efficiently detect distracted drivers and the ROI of the body parts to reduce serious incidents. Our analysis suggests that EfficientNet compound scaling must simultaneously increase depth, width, and resolution when designing more extensive networks. As a result, it has a faster running speed and a higher success rate than previous object detection models. In the future other two variants, Efficientdet-D5 and Efficientdet-D6 of model Efficientdet, can be used for detection purposes. Furthermore, an extended version of the dataset can be presented to improve the distraction detection system by incorporating additional sensing modalities. For example, we can use a microphone to record the sound and voice in the car, which provides valuable clues for detecting various distracted driving behaviors. In addition, we can improve the models by using dynamic data. ABDUL REHMAN JAVED is a lecturer at the Department of Cyber Security, Air University, Islamabad, Pakistan. He has worked with National Cybercrimes and Forensics Laboratory, Air University, Islamabad, Pakistan. He has received his Master's degree in Computer Science from the National University of Computer and Emerging Sciences, Islamabad, Pakistan. He is a member of both IEEE and ACM. He is a cybersecurity researcher and practitioner with industry and academic experience. He has reviewed over 150 scientific research articles for various well-known journals. He is a TPC member of CID2021 (Fourth International Workshop on Cybercrime Investigation and Digital forensics-CID2021) and the 44th International Conference on Telecommunications and Signal Processing. He has served as moderator in the 1st IEEE International Conference on Cyber Warfare and Security (ICCWS). He has authored over 50 peer-reviewed research articles and is supervising/cosupervising several graduate (BS and MS) students on topics related to health informatics, cybersecurity, mobile computing, and digital forensics. His current research interests include but are not limited to mobile and ubiquitous computing, data analysis, knowledge discovery, data mining, natural language processing, smart homes, and their applications in human activity analysis, human motion analysis, and e-health. He aims to contribute to interdisciplinary research of computer science and human-related disciplines.
ASMA BASHARAT joined the Department of Computer Science, Kinnaird College for Women, Lahore, as an Assistant Professor. He has authored or co-authored several peer-reviewed articles in professional journals and the proceedings of conferences. Her research interests include the areas of machine learning algorithms, wireless sensor networks, mobile computing, self-organized networks, big data analytics, and the Internet of Things.
ADIL AFZAL'S research interests include but are not limited to computer forensics, machine learning, criminal profiling, software watermarking, intelligent systems, and data privacy protection.
In 2017, he joined the Department of Computer Science, Kinnaird College for Women, Lahore, as an Assistant Professor.
MUHAMMAD RIZWAN received the M.Sc. degree from PUCIT, Lahore, Pakistan, in 2006, the M.S. degree from CIIT, Lahore, Pakistan, in 2012, and the Ph.D. degree from HUST, Wuhan, China, in 2017. He has authored or co-authored several peer-reviewed articles in professional journals and the proceedings of conferences. His research interests include the areas of machine learning algorithms, wireless sensor networks, mobile computing, self-organized networks, big data analytics, and the Internet of Things. VOLUME 4, 2016