Optimally-Weighted Multi-Scale Local Feature Fusion Network for Driver Distraction Recognition

Distracted driving is one of the main contributors to traffic accidents. In this work, we propose a novel multi-scale local feature fusion network for image-based distracted driver detection. Since the driver is the most important part to infer the distracted driver actions in a single image, our proposed method first detects the driver’s body using person detection. Then capture abundant local body features after a repeated multi-scale feature fusion module. In addition to the features extracted from the whole image, our network also include the important feature of local body feature. The global feature and local feature are finally fused by an OAWS(optimally-weighted strategy). The experimental result shows that our methods achieve comparative performance on our own HY Large Vehicle Driver Dataset and the public AUC Driver Distracted Dataset.


I. INTRODUCTION
Distracted driving is the main cause of traffic accidents. According to data from the Department of Transportation, about 2 million traffic accidents occur every year, and over 80% of them are caused by distracted driving. In recent years, ADAS(Advanced Driver Assistance Systems)has been adopted by many vehicles, ADAS use a range of sensors to collect data inside and outside the vehicle, and also detect and recognize static and dynamic objects inside and outside the car. Driving behavior detection is the key technology in ADAS. This technique can effectively remind the driver to avoid traffic accidents. Therefore, driver distraction recognition has broad research prospects in computer vision and autonomous driving areas.
With the rapid development of deep learning and computer vision, many researchers have studied distracted driving The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . detection in various ways. In recent years, deep learning has been widely used in the field of driver distracted detection. Compared with traditional algorithms, deep learning methods have been greatly improved in terms of performance and accuracy. In the application field of driver distracted recognition, various model has been proposed, such as AlexNet, VGGNet, GoogleNet, and ResNet, these have achieved excellent performance in computer vision task such as driver distracted recognition. But most of these methods use the global information of the image, it is easily affected by the noise from the complex driving environments. To extract more abundant and discriminate driver features and take into account the clues generated from the scene. As shown in Fig.1, we propose an optimally-weighted multi-scale local feature fusion network for driver distraction recognition. In this task, we extract the driver's comprehensive multi-scale feature by person detection and utilize repeated feature fusion to get more low-level detail features and high-level abstract features. In addition, we proposed an optimally-weighted module to focus on learning those representative global features and local features. Fig1 is the overall framework of our proposed method.The major contribution of this article is summarized as follows: 1) A local multi-scale feature repeated fusion module is proposed to efficiently extract and aggregate the rich features from the driver's low-level and high-level feature.
2)An optimally-weighted module to weigh the local body features and global features, so the model can fully consider the context information and learn more representative features.
3)The proposed method has trained end to end and achieve comparative performance on both our own HY Driver dataset and the public AUC Driver Distracted Dataset.

II. RELATED WORK
Due to the frequent occurrence of traffic accidents caused by distracted driving, distracted driving detection has attracted a lot of attention from industry and academic research groups. The following works are devoted to the detection of various distracted behaviors [1], [2], [3].
Using mobile phone is the main reason to increase the probability of traffic accidents. Seshadri et al. [4] created a dataset to detect the behavior of using mobile phones, and proposed detection method based on HOG. The designed Adaboostclassifier to get 93.9% accuracy. Le et al. [5] proposed a faster RCNN for hand and face detection, and achieved 94.2% accuracy, which is higher than all previous methods. Zhao et al. [6] proposed a hidden random forest method to detect drivers using small mobile phones.
Guo and Lai [7] used color and shape information to detect driving behavior. Yan et al. [8] combined a motion history map and pyramid gradient direction histogram for driver behavior recognition. Tran et al. [9] proposes a realtime distraction detection system using dual Cameras. The result shows that using two sources of data from different view cameras can achieve a better performance than using a single camera and achieve 96.7% accuracy in own dataset. Li et al. [10] design a lightweight network that combines depthwise convolution and pointwise convolution for detecting driver distraction in the TX2 target platform and achieves an accuracy of 95.98% accuracy when the latency is just about 32. 8ms. Baheti et al. [11] presents a modified VGG-16 network that uses various regularization techniques such as L2 regularization and achieves an improved accuracy of 96.13%. Koesdwiady et al. [12] used a fine-tune VGG-19 to detect Driver Distraction. It demonstrates that CNN achieves better performance than traditional algorithms in distracted recognition. Hu et al. [13] uses a multi-scale attention convolution neural network to refine the feature in image-based distracted recognition. Wagner et al. [14] proposes a driver action recognition system to detect cell phone usage and food consumption based on two IR cameras. The combined result achieves a test accuracy of 92. 54% in its dataset. Baheti et al. [15]proposed a network structure called MobileVGG by replacing VGG's all connect layers with a 1 * 1 convolutional layer and reduced to 512 neurons. This network achieves 95.24% accuracy on the AUC dataset.
Although the above research work has achieved good improvement results. However, most of the above works focus on using global information from the whole image. The core part of driver distracted recognition is the driver's body. but the feature extraction of drivers is easily disturbed by noise in complex and changeable driving environments. To obtain more abundant features of the driver's body, and also to consider the cues of the global background. Therefore, this paper proposed an optimally-weighted multi-scale local feature fusion network to recognize driver distracting action. While comprehensively extracting driver body features, the effect of global background on prediction is also taken into account.

III. THE PROPOSED METHOD A. OVERVIEW
The method proposed in this paper is shown in Figure 2. Human body information plays an important role in this method. To locate the drive more accurately, this paper uses YOLOv5 to detect the driver's human body and normalize the human body coordinates. The three-scale feature map of the backbone network is used to extract the driver features and send in repeated multi-scale feature fusion modules, Then perform weight strategy optimization fusion with the global features of the backbone network to identify the driver's behavior. Our proposed method is mainly divided into two parallel branches, one branch utilizes the basic CNN to obtain the high-level feature expression map of the image. Another branch focuses on capturing rich multi-scale local driver's scale features from low-level and high-level via YOLOv5 and multi-scale architecture. Finally, we also propose a weight optimization strategy that combines global average pooling and global max pooling.

B. GLOBAL FEATURE
The global feature plays an important role in driver behavior recognition, describing the relationship between driver actions and the whole image. We adopt the widely used CNN model ResNet50 as the backbone network. The whole image VOLUME 10, 2022 is used as input to extract global features. The ResNet-50 used is pre-trained on ImageNet, the model is fine-tuned, and the last fully connected layer of the model is modified to adopt the driver behavior recognition task. After the entire image passes through the backbone network, the feature map F is obtained, Then followed by a residual block (Res) and get the global feature.
Since there are a large number of publicly available person image datasets and high-performance object detectors, it is easy to detect human bodies using existing object detectors. In this paper, YOLOv5 is used as the person body detector to predict the position of the human body in the image.
where the YOLOv5 network take the whole image as input, and outputs a feature map consisting of cells, offset by c x , c y which estimate k box coordinates t x , t y , t w , t h prior width p w , p h ,.The detected human bounding box is x h , b x and b y are the center points of the human bounding box, and b w and b h are the length and width of the human bounding box.

D. MULTI-SCALE LOCAL FEATURE FUSION MODULE
This branch focus on learning the local abundant driver body feature. Following the person detection methods, we use region of interest (RoI) pooling on the driver regions to extract features. Then followed by a residual block (Res) and global average pooling(GAP) to extract a single-scale driver body feature f h .
Multi-scale feature fusion can effectively aggregate the low-level detailed feature and high-level semantic information of images. The multi-scale feature fusion module needs three scale of input features from backbone network P in = P in l1 , P in l2 , P in l4 ,the output feature P out = h P in and h() is the multi fusion operation. P in ln represents a feature map with a resolution 1/2 n of the input image. For example, if the input image resolution is 512 * 512, the P in l1 represent the feature map which resolution is 256 * 256 after downsampling operation. P in l2 represent the feature map which resolution is 1/4 of the original input image. The multi-scale repeated fusion operation combine different scale feature map.
where the Conv represents convolution operation, DownSample represents downsampling operation, UpSample represents upsampling operation.

E. OPTIMALLY-WEIGHTED MODULE
Global features can provide some distinguishable cues for driver behavior recognition, while local human multi-scale fusion features provide rich human features. Normally, the two features are aggregated with GAP (global average pooling) or GMP (global max pooling) and used to replace the traditional fully connected layer. It can be regarded as a structured regularization that forces features to be mapped to the N classes. However, the use of GAP and GMP depends entirely on the specific task. GAP is an average aggregation of all values, which is more suitable for local human features, but there is also a problem that GAP always pays too much attention to those frequently occurring patches, whereas GMP is the opposite. GMP only pays attention to the largest value in the feature, but it also loses some important information for action recognition. The two are a complementary relationship. Therefore, in order to obtain more comprehensive and reasonable information, this paper introduces weights matrix ρ to the global feature features and allocates them reasonably.f where the weight matrix W ω and biasb ω are learnable parameters.f GMP is the feature map after global max pooling and thef GAP is the feature map after global average pooling.

IV. EXPERIMENTS
A. DATASET

1) HY LARGE VEHICLE DRIVER DATASET
The dataset comes from the in-vehicle surveillance video of trucks and buses provided by our cooperative companies in the industry. The cameras are installed in different positions in the vehicle, and the environment inside the vehicle VOLUME 10, 2022 is complex and changeable, which effectively improves the richness of the data. Figure 3 shows some sample images of the five actions in the dataset. Firstly, the long surveillance video is cut into short videos with distracting behaviors, and then the short videos are divided into video frames, The representative frames in video frames are selected as the dataset.The human body is annotated by combining YOLOV5 automatic detection and manual manual annotation.The dataset contains 43776 images in total, 38756 are used as training sets and 5020 are used as the test set. Table 1 shows the details of the dataset.  Fig. 4.

B. IMPLEMENTATION DETAILS
For global feature, we use Resnet-50 as a backbone network. The backbone CNNs are initialized with the pre-trained Ima-geNet weights. The input size of image is scaled to 224 * 224. For local features, YOLOv5 is used to detect the position of the human body frame, and the position of the human body is normalized to crop local features at different scales of the backbone network. The method of local feature extraction is ROI+ Pooing operation, and the backbone network extracts the features of the three scales are 28 * 28, 14 * 14, and 7 * 7, respectively. After repeated multi-scale feature fusion, the features of the three scales still maintain the original resolution.
The network training uses the cross-entropy loss function to update the weight of the network model. The initial learning rate is set to 1e-3, and the batch size of the algorithm training is 64. The Stochastic Gradient Descent optimizer with a momentum of 0. 99 is applied to the cross-entropy loss function. For the SGD, a multi-step learning rate with an initial value of 0. 001 and reduced by a weight decay of 0. 1 after 20 epochs. The model is trained using an NVIDIA Tesla V100 (16 GB) in Centos 8. 0. The implementation is based on Python3. 8 and Pytorch1. 8 toolboxes.

C. RESULTS AND ANALYSIS
This study is a classification problem, and the most common indicators are Accuracy, Recall, and Confusion Matrix.  for precision and recall are:

1) HY LARGE VEHICLE DRIVER DATASET
In our experiments, we evaluate our proposed method and compare the performance with the original ResNet50 using HY Large Vehicle Driver Dataset, and then compare our performance to the state-of-the-art approaches used on AUC Distracted Driver Dataset. The validation results in terms of accuracy are illustrated in table 2. It is clearly evident that the performance of our methods(94.34%) out-performed the baseline ResNet50(89. 25%). Among them, the recall rate of safe driving increased by 13%, and the recall rate of smoking increased by 11%.This is because the structure of the network has obtained more representative features and also consider the cues of background, so it can substantially improve the performance of global feature from single ResNet50.

2) AUC DISTRACTED DRIVER DATASET
We compare our performance to some recent methods [11], [16], [17], [18], [19], [20], used in public auc distracted driver dataset. Our methods has achieve 95.84% and 90.25% accuracy on AUC-V1 and AUCV2, respectively. Because of the train and test dataset of AUCV2 are divided on the unique driver which from the real driving environment. Thus, this distribution of data will cause the training set and test set less correlation. So the accuracy of AUCV2 will generally be lower than that of AUCV1. Reference [21] Although Incep-tionV3 and ResNet and cascaded RNN are used to extract and fuse images simultaneously, [18] the densely connected network with huge amount of parameters and computation is used, and the part vector field map and part heat map of human pose estimation are used for fusion.but using multiple networks at the same time will also increase the amount of parameters and calculations, which is not conducive to the practicality of the algorithm.While [17] and [19] use sliding window detectors to detect human body parts, this object detection algorithm is easily affected by noise in complex environments, and will seriously affect the recognition accuracy of next-stage behaviors. The result shows that our method can perform better than both DenseNet and AlexNet. Although our method obtained 95.84% which is close to 96.31% of the best method on    in each square of the matrix represent the proportion of the predicted category and the corresponding number of samples, respectively.
In the confusion matrix, the values on the diagonal are the correct predictions, and the other values are the wrong predictions. On the data set of our Large Vehicle Driver dataset, the calculation shows that the overall accuracy and recall rate of this method is both 94.30%, but the recall rate of looking around is only 86.7%, of which 9.98% of the 90 samples were incorrectly predicted as safe driving, this is because although this method obtains the feature of the driver through the prior method of person detection, and eliminates the interference problem caused by the noise of backgrounds, however, the changes in the posture of the driver looking around and safe driving is very small, and the main difference is the change in the driver's face, so the model is easily confused between the predictions of these two categories. The recall rate of talking on the phone is 94.97%, of which 4.74% of the 129 images in the sample are mistakenly classified as smoking, while the recall rate of smoking is 93.4.%, and 5.21% of the 83 images are confused as calling. This is because the postures of smoking and making a phone call is very similar, but both cigarette and mobile phone are relatively small objects, which are easily ignored in the convolution process of the neural network, so it is difficult for the model to distinguish the difference between the two categories.
On the AUCV1 public data set, the overall accuracy and recall rate of this method is both 95.61%. The recall rate of reaching behind and makeup is relatively low, which are 93.69% and 94.14%, respectively, and 15 and 11 images are wrongly predicted as safe driving. This is because some images of this category in the test set have relatively small pose changes, so the model will produce some misjudgments, but AUC overall, because the test set is not very large, the VOLUME 10, 2022 overall test performance of the model is better. Only a small number of pictures are mispredicted. Although high accuracy was achieved on both datasets, but how to make the neural network model pay more fine-grained attention to the small intra-class gaps in these parts under specific tasks is an issue to be focused on in the future.

D. ABLATION STUDY
We have also conducted several experiments to analyze the effectiveness of the components of our method on three datasets(HY Large Vehicle, AUC-V1, and AUC-V2). Where G repesents the global feature, LF represents the local fusion module, and OW represents the optimally-weighted module. We evaluate the impact of the components on our method. The result shown in Table 4, the performance of local fusion features is significantly improved compared to global features in HY Large Vehicle Driver Dataset. However, the improvement in accuracy on the AUC dataset is not noticeable. Since the viewpoint and environment in HY Large Vehicle dataset is more complex than the AUC dataset. So person detection reduces the noise from the complex driving environment and changeable viewpoint. Since the local feature fusion can get more abundant body features from low level and high level. When combination of three components, the model can achieve best accuracy. The results shows that the driver distracted recognition combined with human detection has better performance when facing complex driving environments and different viewpoint.
The results of the ablation study show the effectiveness of the proposed component, which can improve recognition accuracy. While combining the local fusion module and optimally-weighted module, the accuracy is 4.8% higher than the global feature in HY Large Vehicle Driver Dataset. More than 3% improvements have been achieved in the AUC dataset.

V. CONCLUSION
This paper proposes optimally-weighted multi-scale local feature fusion network. While combining person detection and local multi-scale repeated feature fusion structure to obtain rich human features, the weight optimization strategy with GMP and GAP is also used to focus on learning those representative global features and local human features. While fully considering the human-centered driving behavior recognition, this method also pays attention to the global context clues, and achieves comparative results on both the self-built HY Large Vehicle Driver Dataset and the public AUC Driver Distracted Dataset. However, how to use the driver's body keypoint position to obtain more detailed and discriminative visual and spatial features is our next research work.