Toward Extremely Lightweight Distracted Driver Recognition With Distillation-Based Neural Architecture Search and Knowledge Transfer

The number of traffic accidents has been continuously increasing in recent years worldwide. Many accidents are caused by distracted drivers, who take their attention away from driving. Motivated by the success of Convolutional Neural Networks (CNNs) in computer vision, many researchers developed CNN-based algorithms to recognize distracted driving from a dashcam and warn the driver against unsafe behaviors. However, current models have too many parameters, which is unfeasible for vehicle-mounted computing. This work proposes a novel knowledge-distillation-based framework to solve this problem. The proposed framework first constructs a high-performance teacher network by progressively strengthening the robustness to illumination changes from shallow to deep layers of a CNN. Then, the teacher network is used to guide the architecture searching process of a student network through knowledge distillation. After that, we use the teacher network again to transfer knowledge to the student network by knowledge distillation. Experimental results on the Statefarm Distracted Driver Detection Dataset and AUC Distracted Driver Dataset show that the proposed approach is highly effective for recognizing distracted driving behaviors from photos: (1) the teacher network's accuracy surpasses the previous best accuracy; (2) the student network achieves very high accuracy with only 0.42M parameters (around 55% of the previous most lightweight model). Furthermore, the student network architecture can be extended to a spatial-temporal 3D CNN for recognizing distracted driving from video clips. The 3D student network largely surpasses the previous best accuracy with only 2.03M parameters on the Drive&Act Dataset. The source code is available at https://github.com/Dichao-Liu/Lightweight_Distracted_Driver_Recognition_with_Distillation-Based_NAS_and_Knowledge_Transfer.


I. INTRODUCTION
As defined by the National Highway Traffic Safety Administration in the United States (NHTSA), distracted driving is "any activity that diverts attention from driving" [1], [2], such as drinking, talking to passengers, etc. Nowadays, distracted driving has become a huge threat to modern society. For example, as reported by the NHTSA, in the United States, traffic accidents caused by distracted driving led to 3,142 or 8.7 percent of all accidents in 2019 [3].
Recently, Advanced Driver Assistance Systems (ADAS) are being developed to provide technologies that alert the driver to potential problems for preventing accidents. As one of the basic and most important technologies of ADAS, distracted driver recognition (DDR) has attracted much interest from the academic society [4]- [7]. Many approaches have been developed to use the images taken by a dashcam to recognize whether the driver is driving safely or behaving some categories of distracted driving actions [8]- [13]. With the effort of the researchers, the recognition accuracy of the DDR task has been increasing, especially when convolutional neural networks (CNNs) are employed in this field [8], [10], [14], following the success of CNNs in many other fields. However, the accuracy improvement is generally brought by increased CNN parameter size. The huge parameter size becomes a big problem for real-world applications because of the limitation of vehicle-mounted computing equipment. The purpose of this paper is to design a lightweight and fast network for DDR with high DDR accuracy, which will be very useful for intelligent transportation system (ITS) applications. In the remainder of this section, we start with a review of the existing DDR methods and then briefly present a general overview of our approach.

A. Existing Distracted Driver Recognition Approaches
Recently, with the success of CNNs in the computer vision field, it has become common to use deep learning models to solve distracted driver recognition (DDR) tasks [8], [15], [16]. For example, Yan et al. [16] embedded local neighborhood operations and trainable feature selectors within a deep CNN, and by doing so, meaningful features could be selected automatically to recognize distracted drivers. Fig. 1. Examples of images taken by a camera monitoring the driver's behavior under different illumination conditions. The ground-truth label of the images is "Drink". The images are from the AUC Distracted Driver Dataset [24].
However, the introduction of CNNs causes the problem of huge parameter size. There are some recent lightweight networks designed for general-purpose computer vision, such as MobileNet [17], MobileNetV2 [18] and SqueezeNet [19]. However, these lightweight networks are not specifically designed for DDR, and therefore is still room for improvement regarding DDR accuracy and the number of parameters.
There are now also some lightweight networks designed specifically for DDR by hand. For example, Baheti et al. [20] propose the MobileVGG, which reduces the number of parameters by replacing the traditional convolution in the classical VGG structure with depth-wise convolution and point-wise convolution. D-HCNN [21] is another example, which uses an architecture containing four convolution blocks with the filters of rather large spatial sizes and achieves high performance with small number of filters. However, these networks were designed entirely by hand based on experience with networks used for general-purpose computer vision tasks, so the potential of the network structure could not be reached to the maximum extent possible. Moreover, D-HCNN requires histogram of oriented gradients (HOG) [22] in addition to RGB image as the input. HOG counts occurrences of gradient orientation in localized portions of an image and describes the appearance and shape of the local objects. The computation of HOG requires extra processing effort and is not favorable for real-world applications.
In this work, we search for an optimal architecture for the DDR task, which has less parameter size and higher accuracy than the above studies. Our approach is designed by NAS rather than totally by hand and only requires RGB images as inputs.

B. A Brief Overview of the Proposed Approach
To solve this problem, we propose a distillation-based neural architecture search and knowledge transfer framework. Overall, the proposed framework is based on knowledge distillation [23], which refers to the process of transferring knowledge from a large model (teacher network) to a smaller one (student network). The proposed framework includes three steps: (i) constructing a strong teacher network; (ii) searching and define the architecture of a student network under the supervision of the teacher network; (iii) transferring the knowledge from the teacher network to the student network.
Teacher Network. The teacher network is built based on progressive learning (PL). PL is a training strategy that starts the training from shallow layers and then progressively deepens the model by adding new layers to the model [8], [15], [16]. In some studies, PL is also regarded as partitioning a network into several segments and progressively training the segments from shallow to deep [25], [26]. Progressive learning (PL) was originally proposed for generative adversarial networks [27]. It started with low-resolution images, and then progressively increased the resolution by adding layers to the networks. For example, Wang et al. [28] proposed to progressively cascade residual blocks to increase the stability of processing extremely low-resolution images with very deep CNNs. Shaham et al. [29] proposed to reconstruct highresolution images by a progressive multi-scale approach that progressively up-sample the output from the previous level. Recently, PL has been also applied in fine-grained image classification tasks. For example, Du et al. [25] and Zhao et al. [26] used PL to fuse information from previous levels of granularity and aggregate the complementary information across different granularities.
In this work, we introduce PL into DDR to solve the problem caused by various illumination conditions, such as sunlight and shadow. As shown in Figure 1, in the real world, the dashcam commonly records the driver's behavior in different illuminations, while the color itself is susceptible to the influence of illumination. RGB information changes considerably under different illuminations, which causes strong intra-class variance in the DDR task. Such intra-class variance affects CNNs from shallow to deep layers. The shallow layers of CNN tend to learn basic patterns, such as different orientations, parallel lines, curves, circles, etc., while the deep layers tend to encode the patterns learned by shallow layers to capture more semantically meaningful information, such as hands, body, etc [30]. Following the experience learned with bright illumination on what basic patterns are discriminative, the shallow layers of a CNN might fail to find enough discriminative basic patterns in the shadows.
In this work, we progressively train the teacher network for several stages. During the stages, the training starts from shallow layers and progressively goes deeper with random brightness augmentation [31] to increase the robustness to the illumination of the layers from shallow to deep. Thereafter, we use the original image to train the aggregation of the models of all stages, considering that the random brightness augmentation might lose some visual information.
Student Network. The student network is a compact network that should be able to achieve high recognition performance. This leads to a research question: how to define the architecture of the student network to make it compact, lightweight, yet powerful for DDR, by utilizing the knowledge of the teacher network as supervision?
To answer this question, we turn our eyes to neural architecture search (NAS). NAS refers to the process of automating architecture engineering to learn a network topology that can achieve best performance on a certain task [32]- [34].
The major components of NAS includs searching space, searching algorithm, and evaluation strategy [32]. With the prior knowledge about typical properties of architectures, NAS approaches commonly define the searching space as a large set of operations (e.g., convolution, fully-connected, and pooling). Each possible architecture in the searching space is evaluated by a certain evaluation strategy [32], [33] and the searching process is controlled by certain searching algorithms, such as reinforcement learning [33], [35], [36], evolutionary search [37], differentiable search [38], or other learning algorithms [34], [39]- [41]. NAS commonly defines a searching space at first and then uses a certain policy to generate a sequence of actions in the searching space to specify the architecture.
In this work, we propose a new searching approach for DDR based on the characteristics of the images in the DDR task. We introduce how we define the searching space and the searching strategy as described below.
Searching Space. The images in the DDR task have less diversity and much stronger inter-class similarity than those in many other image recognition tasks. For example, in the fine-grained image recognition task of CUB Birds [42], the images contain the birds of different species, the background of different habitats, etc. However, in the DDR task, almost all the images can be roughly described as "a human is driving." Thousands of images showing different driving behaviors might be performed by the same person, and the backgrounds of all the images are actually the interior of the same car.
Due to the above reason, a large proportion of the visual information does not provide discriminative clues in the DDR task. For example, in CUB Birds, the color of wings, the shape of heads, etc. all provide useful information. Sometimes, even the background provides useful information as a bird image with the sea as the background highly likely shows a certain sea bird. In contrast, in the DDR task, the color of the driver's clothes, the shape of the driver's glasses or hat, almost all the background, etc. are useless information.
Consequently, the models for the DDR task do not need a huge number of object detectors. The key is to explore some discriminative objects, which are quite universal among different driving behaviors, such as hands, body pose, steering wheel, etc. In CNNs, depth influences the flexibility, and each channel of the filters acts as an object detector [30]. Thus, the architecture for DDR does not require a very deep structure and a huge number of channels. The above claim is backed up by some earlier observations that the architecture of a decreased number of layers and channels can achieve good results in DDR [20], [21].
On the other hand, the architecture for DDR must be able to effectively find and capture useful clues from the limited discriminative objects, which is very difficult because: (i) the inter-similarity is strong; (ii) the key objects vary largely in size (e.g., hands and body). In this work, we introduce pyramidal convolution (PyConv) [43] into the DDR task. In a standard convolution layer, all the filters have the same spatial size. In contrast, a PyConv layer uses convolution filters of different spatial sizes, and the filters are possible to divide into several groups. Thus, PyConv has very flexible receptive fields, which is beneficial to capture key objects of different sizes. Also, due to its flexibility, PyConv provides a large pool of potential network architectures. In this work, the main searching space is defined as the candidate combinations of filters' spatial sizes and the number of groups. Moreover, the pooling method applied in the model also influences the performance of capturing key objects [44]. We also search whether to use max pooling or average pooling in the layers.
Searching Strategy. Most of the NAS methods train the possible candidate networks one by one, and evaluate the performance of the trained candidate networks on a validation set [32], [45]. The evaluation results are used as metrics to update the architecture searching process. However, the process of candidate evaluation could be very expensive in terms of time, memory, computation, etc. In this work, since we have already constructed a powerful teacher network, we directly use the teacher network to guide the searching. Specifically, we first build a super student network that aggregates all the candidates with a weighted sum, whose weights are regarded as the possibility of choosing each candidate. Then the super student network is trained to learn from the teacher network by knowledge distillation. After the training, the candidates with the maximum weight are chosen to build the architecture of the student network.
After defining the architecture of the student network, the teacher network is utilized again to transfer knowledge to the student network.
Our contributions are summarized as follows: -We propose a novel framework for solving the DDR task with high accuracy and a small number of parameters. The research question is solved by the proposed searching strategy. -We mainly carried out the experiments of training the teacher network, defining the student network, and evaluating the performance of the teacher and student networks on two image-based DDR datasets, namely the AUC Distracted Driver Dataset (AUCD2) [24] and Statefarm Distracted Driver Detection Dataset (SFD3) [46]. The experimental results show that the teacher network achieves 96.35% on the AUCD2 and 99.86%-99.91% in different splitting settings on the SFD3 with 44.62M parameters, which outperforms the previous state-of-theart approaches on both datasets. Note that the previous best approach on AUCD2 requires 140M parameters. -The student network achieves 95.64% on the AUCD2 and 99.86%-99.91% in different splitting settings on the SFD3 with only 0.42M parameters. -The student network architecture can be extended into a spatial-temporal 3D convolutional neural network by replacing the 2D layers with spatial-temporal 3D layers [47]- [50]. We carried out comprehensive experiments in all the tasks of the Drive&Act Dataset (DAD) [51], which is a video-based DDR dataset. The 3D student network is 0.89%-29.00% higher than the previous best accuracy in the validation set and 2.05%-30.88% higher than the previous best accuracy in the test set. The 3D student network requires only 2.03M parameters.

A. Teacher Network Construction
In this subsection, we introduce the details of the teacher network. Let E be the backbone feature extractor, which Algorithm 1 Building the teacher network based on progressive learning (I is the total number of images in D), and N stages {s 1 , s 2 , ..., s n , ..., s N } of the backbone feature extractor E.
input n = Brightness augmentor(input) 5: x n = s n (input n ) BACKPROP(L n ) 10: end for 11: for n ∈ [1, N ] do 12: input n = input 13: end for 21: end for can be based on any state-of-the-art models, such as SKRes-NeXt50 [52], etc. The layers of E are divided into N segments {m 1 , m 2 , ..., m n , ..., m N }. Assume {s 1 , s 2 , ..., s n , ..., s N } be N consecutive stages from shallow to deep. At each stage of {s 1 , s 2 , ..., s n , ..., s N }, the training always starts from the first layer of E. From s 1 to s N , the training gradually goes deeper and covers more layers of E. That is, the segments under training at stage s n are m 1 + m 2 + ... + m n . Let {x 1 , x 2 , ..., x n , ..., x N } denote the the output feature maps at {s 1 , s 2 , ..., s n , ..., s N }. Let x n ∈ R Hn×Wn×Cn denote the output feature map at the stage s n , and H n , W n , and C n respectively denotes the height, width, and the number of channels of x n . We use a set of opera- x x where f max pool H×W (.) denotes a max-pooling operation whose window size is H ×W . f conv (.) illustrates the 2D convolution operation by kernel size. For example, f conv 1×1×C× L 2 (.) denotes a 2D convolution operation whose kernel size is 1×1×C × L 2 (1 × 1 is the spatial size, C is the number of input channels, and L 2 is the number of output channels). f bn (.) denotes the batch normalization operation [53], and f ReLU (.) denotes the ReLU operation.
Thereafter, we use a set of operations {ψ 1 (.), ψ 2 (.), ..., ψ n (.), ..., ψ N (.)} to respectively process {v 1 , v 2 , ..., v n , ..., v N } to predict the probability distribution {p 1 , p 2 , ..., p n , ..., p N } over the classes at each stage: where p n ∈ R K , and K denotes the number of the classes of driving behaviors. f fc L 2 ×K (.) denotes a fully connected layer whose input size is L 2 and the output size is K. f fc (.) denotes a fully connected layer whose input size is L and the output size is L 2 . After the last stage s n , we add an additional stage by concatenating v 1 , v 2 , ..., v N and generating the concatenated vector into the probability distribution over the classes as: where f concat (.) denotes the concatenation operation. Now, we The teacher network is trained by using a cross entropy loss L cls (.) to minimize the distance between ground truth label p truth and each prediction probability distribution of {p 1 , p 2 , ..., p n , ..., p N , p N +1 }: where p (k) n denotes the probability that the input belongs to the category k at the stage s n . p (k) truth equals to 1 if it is true that the input belongs to the category k, and equals to 0 on the contrary.
The overall algorithm of building the teacher network is given in Algorithm 1. For the stages s 1 ∼ s n , the input images are augmented with Imgaug [31].

B. Distillation-Based Neural Architecture Search for the Student Network
The computation overhead, including speed and parameter size, acts as an extremely crucial role for DDR. According to the experiences of previous studies [12], [21], it is much more favorable to use large convolution filters rather than deep layers because the former is able to compute in parallel to achieve a fast processing speed that satisfies the requirements of the real-world application. Thus, in this work, we design the student network to have four convolutional blocks, which are followed by a global average pooling layer (GAP) and a fully connected (FC) layer for predicting the probability distribution over the classes.
For each block, we use pyramidal convolution (Py-Conv) [43] rather than standard convolution. PyConv contains Fig. 2. Illustration of the searching process. "CAND" is the abbreviation for "candidate". In the super student network, there are several candidates for the convolutional architectures of each block. Besides, there are two candidates of the pooling method, namely average pooling and max pooling, in each block. The candidates of convolutional architecture and pooling methods for each block are aggregated by the weighted sum. α and β are the learnable weights. "GAP" means global average pooling and "FC" means the fully-connected operation. The super student network is trained to learn from the teacher network by knowledge distillation. After the training, only the candidates with the maximum weight are kept and forms the student network. a pyramid of kernels, where each level involves different types of filters with varying sizes. Using PyConv for DDR has two benefits. First, PyConv can capture different levels of details in the scene. A filter of a smaller kernel size has smaller receptive fields and thus can capture more local information and more detailed clues. A filter of a bigger kernel size has bigger receptive fields and thus can "see" more information at once and capture relatively more global information, such as the dependencies among some local patterns, some large objects, etc. Such multi-level details are very important for recognizing driver behaviors. Second, PyConv is flexible and extensible, giving a large space of potential architecture designs. That is, PyConv gives strong potential to search for a lightweight architecture.
At the end of each block, we use a pooling layer to downsample the feature maps. Two types of pooling layers are widely used for this objective: max pooling and average pooling. We define our search space as the candidates of different designs of PyConv and different pooling types in the four convolutional blocks.
As shown in Figure 2, the overall process of defining the architecture of the student network is given as: at first, we construct a super student network covering all the candidates of each block. In the super student network, the output feature maps of the candidates of each block are aggregated by a weighted sum to become the input of the next block. The sum weights are learnable and represent the probability of choosing the candidates. Then the super student network is trained to learn from the teacher network. Thereafter, the final architecture of the student network is derived by selecting the candidate with the maximum probability.
Specifically, let {b 1 , b 2 , b 3 , b 4 } denote the four blocks of the student network and super student network.
(.)} denotes the candidates of using average pooling or max pooling layer at the end of the block b. Given the feature map X in b outputted by the previous block, the output feature map X out b of the block b in the super student network is defined as: where } are the probabilities of choosing the corresponding candidates, and they are computed as: where, } are learnable parameters that are all initialized as 1 and optimized during the training.
All the blocks of {b 1 , b 2 , b 3 , b 4 } of the super student network are constructed by the process introduced above. The output feature map of b 4 is processed by the GAP and FC layers to predict the probability distribution over the classes (p super ). As mentioned above, after the training of the teacher network, we only use p N +1 of the teacher network for category  prediction. The super student network is trained with the search loss L search (.) defined as: where L mse (.) denotes the mean squared error loss, and λ is a manual hyperparameter. During the training of the super student network, the parameters of the teacher network are fixed. After the training, we only keep the candidate with the maximum probability and prune all the other candidates for each block to construct the student network.

C. Knowledge Transfer
In the former subsection, we use the teacher network to guide the search of the student network architecture, and in this subsection, we use it to transfer knowledge to the student network. Assume that p student is the probability distribution over the classes predicted by the student network. The student network is trained with the knowledge transfer loss L trans (.) defined as:

A. Dataset Description
The experiments are conducted using two types of datasets: image-based DDR dataset and video-based DDR dataset. The image-based DDR task requires recognizing the driver's behavior from each given image. The video-based DDR task requires recognizing the driver's behavior from each given video clip containing several frames. We mainly carried out the experiments of training the teacher network, defining the student network, and evaluating the performance of the teacher and student networks on the image-based DDR datasets. Then, we obtained an extremely lightweight yet powerful student network for the image-based DDR task. Thereafter, following Hara et al. [47], we extended the student network from 2D to 3D for the video-based DDR task.
For the image-based DDR task, we carried out experiments on two standard benchmark datasets for DDR: the Statefarm Distracted Driver Detection Dataset (SFD3) [46] and the AUC Distracted Driver Dataset (AUCD2) [24]. These two datasets are the most widely used datasets, and have been used for many studies on DDR. Both of the two datasets are composed of one safe driving action and nine distracted driving actions including (i) text right, (ii) talk right, (iii) text left, (iv) talk left, (v) adjust radio, (vi) drink, (vii) reach behind, (viii) hair and makeup, and (ix) talk to passenger. The images of both datasets are taken by dashboard cameras recording the driver's behavior. The sample images of the SFD3 and AUCD2 are shown in Figure 3 and Figure 4, respectively.
SFD3 is one of the most influential public datasets in the field of DDR. There are 22,424 images for training (around 2,000 images in each category) and 79,728 unlabeled images for testing. Since SFD3 does not provide the labels for the testing images, we follow the common practice of previous studies to perform experiments on the training dataset. We randomly split the training dataset of SFD3 as training image: testing image = 7:3 [13], [21], 7.5:2.5 [21], [54], [55], 8:2 [14], [21], [56]- [58], 9:1 [21], [59]. In this work, for each proportion of the train-test partition, we randomly split the images 10 times and report the average accuracy.
AUCD2 is another widely used public dataset for DDR. It has 17,308 RGB frames, of which 12,977 are for training, while the remaining 4,331 are for testing.
For the video-based DDR task, we utilized the Drive&Act Dataset (DAD) [51]. This is a large-scale video dataset consisting of various driver activities, with more than 9.6 million frames. As shown in Figure 5, the DAD provides multiple annotations for performing three types of recognition tasks on the video clips. The first task is the scenario recognition task, which requires recognizing the top-level activities (e.g., work and drink) from each given video clip. There are totally 12 different scenario categories. The second task is the finegrained activity recognition task, which requires recognizing the specific semantic actions (e.g., open laptop, close bottle, etc.) from each video clip. There are totally 34 different categories of fine-grained activities. The third task is the atomic action unit recognition task. The atomic action units portray the lowest degree of abstraction and are basic driver interactions with the environment. The annotations of the atomic action units involve triplets of atomic action, object, and location, which are detached from long-term semantic meaning and can be regarded as building blocks for complex activities. There are five categories of atomic actions (e.g., reach for), 17 categories of objects (e.g., automation button), 14 categories of locations (e.g., center console back), and 372 possible combinations.
Super Student Network. As mentioned above, we define our search space as the candidates of different designs for the four convolution blocks of the student network and construct a super student network to cover all the candidates. The specific candidates are shown in Table I. In Table I, the design of filters are illustrated by kernel size, number of channels, and number of groups. For example, 11 × 11, 16, 1 7 × 7, 16, 1 denotes a PyConv layer with two types of filters: one filter has 11 × 11 kernel size and the other has 7 × 7 kernel size. Both filters have 16 channels and 1 group. The pooling layers are illustrated by the type and window size. For example, "Avg. Pool 2 × 2 " denotes an average pooling layer with 2 × 2 window size. The stride of all the convolution layers is set as 1 and the padding size is set as θ−1 2 , where θ is the spatial size of the filter. Thus, the convolution layers do not change the spatial size of feature maps. The stride of the pooling layers is set as 2, and the height and width of feature maps decrease by half after the pooling layers.
Training Details. For the experiments on the image-based datasets, during the training, all the learning rate are set as 0.002 with cosine annealing [60]. Weight decay is set as 5 × 10 −4 . The input images are resized to 256 × 256 and applied with random crop of 224 × 224 region for training, center crop of 224 × 224 region for testing. We set batch size as  32 and train each network for 300 epochs. The manual hyper parameter λ in Equation 11 and Equation 12 is set as 0.7, which is a common setting for distillation. For the experiments on the video-based dataset, we follow the settings of Hara et al. [47]. Specifically, the learning rate is set as 0.001 with plateau scheduler [47]. Weight decay is set as 1 × 10 −5 . The batch size is set as 32, and 16 frames (16 × 3 × 112 × 112) are sampled for each video clip by uniform sampling.

A. Student Network Architecture Definition
As mentioned above, we first train the super student network to approximate the prediction distribution of the teacher network. We carry out this experiment on the AUCD2, as it is a more challenging dataset than SFD3. The probability of choosing each candidate is shown in Table II. The candidates of the highest probability are marked in gray background. For convolutional layers, the searching guided by the teacher network chooses the third candidate for b 1 and the first candidate for all the other blocks. For pooling layers, the second candidate (max pooling) is selected for all the blocks. The reason might be that max pooling selects the brighter pixels or the features corresponding to the sharp pixels, and therefore more robust to illumination changes.
Referring to Table I and Table II, we define the architecture  of the student network as Table III. This architecture only requires 0.42M parameters. In the following experiments, we use this architecture as the student network on both datasets.

B. Recognition Performance of the Teacher and Student Network
In this subsection, we compare the recognition performance of the teacher network with and without progressive learning (PL), the student network trained from scratch and finetuned after transferring the teacher network's knowledge to the student network. The results are shown in Table IV.
narrow possible improvement space, we suppose PL can be still regarded as effective on the SFD3. In the following experiments, we use the teacher network with PL to guide the search of the student network architecture and transfer knowledge to the student network.
On both datasets, the student network trained from scratch already achieves a very high accuracy, which shows the architecture obtained by the proposed searching approach is effective for the DDR task. Knowledge distillation respectively improve 0.52% and 0.03%-0.05% on the AUCD2 and SFD3, respectively.
Considering that the accuracy for the datasets is almost saturated, it is interesting to see there is still room for the improvement by our proposed method.
In addition, since the AUCD2 dataset is somewhat unbalanced, we also show the F1-score obtained with this dataset in Table V. PL improves the teacher network by 0.32%-1.54% in different categories. Knowledge distillation respectively improve 0.17%-0.84% for the student network in different categories.

C. Comparison with State-of-the-art Distracted Driver Recognition Approaches
In this subsection, we compare our performance with the state-of-the-art approaches on AUCD2 and SFD3. Table VII shows the results on the AUCD3. The accuracy of the teacher network (96.35%) surpasses the best previous accuracy (96.31%), which is achieved by Regularized VGG-16 [12]. Regularized VGG-16 has 140M parameters, whereas the teacher network in this work has 44.62M parameters (i.e., 31.87% of the Regularized VGG-16 parameters), which shows the effectiveness of the teacher network on this dataset. The student network achieves 95.64% with 0.42M parameters. For comparison, the original VGG-16 achieves 94.44% with 140M parameters (i.e., 333.33 times of the student network parameters), and the modified VGG-16 achieves 96.54% with 15M parameters (i.e., 35.71 times of the student network parameters) [12]. Table VIII shows the results on the SFD3. Both the teacher and student network achieve 99.86%-99.91%, which outperforms the best previous accuracy. The student network is recommended because it requires fewer parameters. D-HCNN [21] also achieves good accuracy on both datasets with small parameters. However, our student network is better because: (i) The student network has better accuracy than D-HCNN on both datasets; (ii) The student network's parameters are only about 55.26% of D-HCNN; (iii) D-HCNN requires HOG images in addition to RGB images as input. Therefore, it needs to compute the HOG feature [22] of every image when using D-HCNN, which is unfavorable for real-world applications.
Moreover, the student network has better real-time performance than other lightweight models. As shown in Table VI 10 GFLOPs and takes 7.40 ms. As D-HCNN requires HOG images as additional input, it takes additional 1.48ms per image to compute HOG for each image. Compared to previous lightweight networks, our network has no significant advantage in terms of GFLOPs but clearly has faster speed. It is because the parallelism of a convolutional network is mainly reflected in the calculation of each layer, and there is generally no parallelism across layers. So for convolutional neural networks used in high-speed DDR, large convolutional filter size is better than too deep layers. This fact was also pointed out by Qin et al. [21] and experimentally proved by them. Another advantage of our network is the aforementioned lower number of parameters, which allows our network to require less storage and memory space and be more easily deployed on in-vehicle devices. Since GPUs can process multiple images in parallel, we also compare the time consumption of our network with other lightweight networks that process multiple images in parallel. For processing one batch of images ( As the proposed teacher and student networks achieve very high accuracy on both image-based DDR datasets [24], [46], it is important to know what images cause the small number of recognition failures. Figure 6 shows the typical failure cases of the wrongly-predicted images together with their ground-truth labels and the prediction given by the proposed networks (the teacher network or student network). Those failure cases are even confusing for humans.

D. Extending the Student Network to 3D for the Video-based Distracted Driver Recognition
The above experiments have proposed a lightweight yet powerful network architecture (i.e., the student network) for image-based DDR. In this subsection, we extend the student network into a spatial-temporal 3D network to evaluate whether on the video-based DDR dataset [51], the 3D student network can retrace the success of the student network architecture proposed for the image-based DDR. This experiment is inspired by the experiments of Hara et al. [47], in which the researchers replaced the 2D layers (e.g., 2D convolutional layers, 2D batch normalization layers, etc.) of the ResNet architectures [64] with 3D layers (e.g., 3D convolutional layers, 3D batch normalization layers, etc.) and proved that using 3D ResNet architectures together with Kinetics [65] can retrace the successful history of 2D CNNs on ImageNet [66].
Following Hara et al. [47], we set the size of the third dimension of each 3D convolutional kernel to be the same as the size of the first and second dimensions. For example, a 2D convolutional kernel of a 3 × 3 kernel size is extended to a 3D convolutional kernel of a 3 × 3 × 3 kernel size.
We conducted comprehensive experiments to evaluate the performance of the 3D student network for all the tasks on the DAD. The specific accuracy of each split and the average accuracy over the three splits are shown in Table IX. The comparison results with the state-of-the-art approaches on the DAD are shown in Table X. It can be observed that the 3D student network outperforms the state-of-the-art approaches by a significantly large margin in both validation and testing sets. Our approach is 0.89%-29.00% higher than the previous best accuracy in the validation set and 2.05%-30.88% higher than the previous best accuracy in the test set. Besides, the 3D student network has only 2.03M parameters and is much more lightweight than the state-of-the-art approaches. The parameter size of the 3D student network is only 16.48% of the parameter size of C3D [48], 3.09% of the parameter size of P3D ResNet [49], 2.60% of the parameter size of I3D [50]. Moreover, the student network has better real-time performance than those 3D convolutional neural networks. As shown in Table VI  E. Discussion on the Implication of the Proposed Framework on the ITS applications The implication of our approach to applications is as follows: -We construct a powerful teacher network using progressive learning to increase robustness to illumination changes from shallow to deep layers of a backbone CNN. The classification accuracy of the teacher network exceeds that of all existing approaches and is well suited for the DDR applications that do not require a particularly small computational overhead but rather high accuracy. -Using NAS and knowledge distillation, we generate an effective student network with the guidance of the teacher network. The student network can achieve high DDR accuracy and has less parametric count and inference time than any existing lightweight DDR networks. The student network is suitable for applications with high parametric and inference time requirements. -We extend the student network into a spatial-temporal 3D network for performing DDR based on small video clips. The 3D student network has better DDR accuracy, smaller parameter size, and faster speed than the existing approaches. The 3D student network is suitable for applications developed based on video clips. -Our proposed framework combining knowledge distillation and NAS has the potential to become a general DDR network design framework for different applications.

V. ADDITIONAL EXPERIMENTS
We also evaluate our approach on three additional datasets, which are not for the DDR task but have the same characteristic: small diversity and strong inter-class similarity. The three additional datasets are Sign Language Digits Dataset (SLD2) [67], Gesture Dataset 2012 (Gesture2012) [68], and UIUC Sports Event Dataset (USED) [69]. SLD2 and Ges-ture2012 are image datasets for hand sign language recognition, which are also used by Qin et al. [21] as additional datasets to evaluate D-HCNN [21]. USED is an image dataset for sport event recognition.
On the three additional datasets, we compared the recognition performance of the teacher network with and without progressive learning (PL), the student network trained from scratch and finetuned after knowledge transferring. The results are shown in Table XI. We also compared our approach with the state-of-the-art approaches on the three additional datasets, and the results are shown in the Table XII. Both the teacher and student networks achieve 99.74% on the SLD and 100% on the Gesture2012, which reach state-of-the-art performance on the two datasets. The student network has much fewer parameters than other state-of-the-art approaches on these two datasets.
On the USED, the improvement brought by PL and knowledge transfer is obvious. PL improves the teacher network by 0.84% and knowledge transfer improves the student network by 2.91%. The accuracy of the teacher network is 98.75%, which surpasses the best previous accuracy by 3.55%. The student network achieves 92.08% with 0.42M parameters.

VI. CONCLUSION
In this paper, we proposed a novel framework for distracted driver recognition to achieve high accuracy with a small number of parameters. This framework first builds a powerful teacher network based on progressive learning and then uses the teacher network to guide the searching of an optimal architecture for a student network, which is lightweight but can achieve high accuracy. Thereafter, the teacher network is used again to transfer the knowledge to the student network. The teacher network outperforms the previous state-of-the-art approaches on the Statefarm Distracted Driver Detection Dataset and AUC Distracted Driver Dataset. The student network achieves high accuracy with extremely tiny parameters on both datasets. The student network architecture can be extended into a spatial-temporal 3D convolutional neural network for recognizing distracted driving behaviors from video clips. The 3D student network significantly outperforms the previous state-of-the-art approaches with only 2.03M parameters on the Drive&Act Dataset.