1. Introduction
Human-machine interfaces (HMI) for car information and entertainment systems are very important for safe driving and can offer a convenient interface to control navigation and other automotive functions. Speech interfaces are currently employed in car HMI's to reduce driving distraction. In practice, drivers need to handle complex situations inside and outside of the cars, such as difficult traffic conditions, unclear navigation instructions, and limited visibility. In such conditions, drivers may become confused because of a lack of information about how to proceed. Often, the needed information is available via the HMI, but the driver does not have enough time to retrieve that information using speech or manual interfaces. If the system can anticipate these situations then it can proactively provide more helpful information. We propose to detect driver confusion in order to provide a more proactive interface. There has been some prior work directed at detecting the driver's state, or likely actions, using sensor data available in the vehicle. Available data may include traffic conditions, navigation status, vehicle status, and information about the driver's behavior that can be extracted from sensors such as cameras and microphones. In prior work, corpora of such data have been recorded during driving and annotated according to driver status and driving conditions [1]–[3]. In these studies, data-driven approaches were used for prediction. For example, the driver's emotional state was detected using a Bayesian network obtained from multimodal data consisting of traffic condition, driving condition, and the drivers' facial expressions [4]. Gaussian mixture models, estimated from speech signals [5], have been used for detection of driver stress. In addition, destination prediction and driver action prediction were investigated using driving condition histories, obtained using the controller area network (CAN) bus, and the navigation system status [6]. All of these approaches employed classification without modeling the dynamics of the signals. However, it has been suggested, in the context of stress detection in speech, that temporal dynamics of sensor data and the dependency between multiple features are important [7]. Recently, neural network models such as feed-forward deep neural networks (DNNs), recurrent neural networks (RNNs) and related architectures such as and long short-term memory (LSTM) RNNs, and convolutional neural networks (CNNs) have been shown to dramatically improve the performance of speech and image recognition. In addition, speaker emotion detection has been investigated in speech signals using RNN and LSTM models [8], [9]. The sensor data involved in driver state prediction is challenging due to the large variability and dynamic range. Deep network models may be more capable of modeling the dynamics and interdependencies in sensor data than previous approaches. In this study, therefore, we propose as a proof of concept to apply deep network architectures to the problem of predicting driver confusion. Since the complexity of the problem relative to the amount of data in our corpus is unknown, we compare performance using a variety of models: logistic regression (LR), DNNs, RNNs, and LSTM-RNNs.