Accident Prediction Model Using Divergence Between Visual Attention and Focus of Expansion in Vehicle-Mounted Camera Images

Recently, accident prediction models, which predict the occurrence of traffic accidents through deep learning algorithms have been proposed. The application of these models demands both high precision and visualization of the decision basis applied. Current models use the motion features of objects in the surrounding environment, but they do not predict well when the motion feature of the risk factor is small. Meanwhile, drivers can avoid accidents because they utilize visual attention functions. This study focuses on the divergence between visual attention and focus of expansion (FOE), which are highly correlated in normal driving situations, as the basis for an accident prediction method. The proposed model can visualize decision basis with high accuracy, even when the motion feature of the risk factors is small, by combining it with Dynamic-Spatial-Attention, a deep-learning-based accident prediction method. In this experiment, we classified data from the Dashcam Accident Dataset, a widely used accident dataset, into categories of accidents. Using the Dashcam Accident Dataset, the proposed method achieves higher accident prediction performance in categories for which the motion feature of risk factors tends to be small while maintaining the same accident prediction performance as achieved by the baseline Dynamic-Spatial-Attention method in categories for which the motion feature of risk factors tends to be large. In addition, the proposed method visualizes the risk factors using visual attention and FOE to provide a visual explanation of the decision basis.

Meanwhile, the visual attention of the driver is focused on the risk factor when the motion feature is small, regardless of whether or not the class of object detection is known [31], [32], [33].This is because, unlike bottom-up current models based on low-dimensional image features, driver visual attention has top-down properties that are based on driving experiences.This visual function makes avoiding accidents possible, suggesting that driver visual attention is useful for accident prediction tasks.When using visual attention for accident prediction tasks, it is difficult to measure the driver visual attention in real-time with an eye tracker because of the adaptation to automated driving and the cost of installing an eye tracker.Therefore, we use a visual attention model that can estimate the driver visual attention from in-vehicle camera images by using actual driver gaze data for training.Estimated driver visual attention is strongly correlated with the focus of expansion (FOE) during normal driving [34], [35], [36].However, during abnormal events, visual attention is likely to diverge from FOE because visual attention is directed toward risk factors.In this study, we propose an accident prediction model that incorporates visual attention and FOE divergence as shown in Fig. 2. Furthermore, even in a highly accurate and opensource DRIVE [27] model, such as that based on the saliency map, it is impossible to explicitly show the risk factors as the predicted saliency map itself is used.However, the proposed model visualizes the risk factors by using visual attention and FOE, enabling the visual explanation of the decision basis of the model.Visualization of the decision basis is an important element for the practical application of accident prediction models because it reduces the black box behavior of deep learning models.The contributions of this study are as follows.
1) By a visual attention model learned with driver gaze data, top-down knowledge of the driver is utilized in an accident prediction model.2) An accident prediction model is proposed that incorporates the divergence between driver visual attention and FOE and enables the prediction of accidents difficult to estimate by only object and motion features.
3) The proposed accident prediction method achieves high accuracy in accident prediction using the Dashcam Accident Dataset, including accidents such as rearend collision, head-on collision, turn, and crossing collision.4) The visualization of risk factors from visual attention and FOE can provide a visual explanation of the decision basis for accident prediction.

1) MODELS BASED ON MOTION FEATURES
This model uses motion features of objects, capturing abnormal motion by extracting motion features from timeseries data, using optical flow or LSTM to predict accidents.Anopred [17] uses optical flow and U-Net [37] to predict future frames.Anomalies are detected by comparing the predicted future's frame object motion with the actual motion.Kataoka et al. [18] used semantic flow to separate the background for risk factor recognition.CST-S3D [19] extracts motion features based on 3D CNN [38] that can process spatio-temporal information on training data augmented by an image transformation model.SP [20] uses LSTM to account for longer-term temporal relationships.This mechanism learns the general movements of pedestrians and predicts their future trajectories.ConvLSTMAE [21] also uses LSTM to extract motion features and detect anomalies using reconstruction errors with the input image obtained by an autoencoder.Karim et al. [22] also used a recurrent neural network and proposed a model based on Gated Recurrent Unit (GRU) [39], which is more computationally efficient than LSTM.AdaLEA [23] employed a quasi-recurrent neural network [40], which can learn faster than LSTM by parallelizing the computational process to capture spatiotemporal relationships.The method introduces Adaptive Loss, which adaptively changes the penalty weights during each epoch of training.Although these methods can extract abnormal risk factors from motion features, they cannot estimate accident risk when the motion features of the risk factors are small.In addition, they do not focus on the major risk factors such as vehicles and pedestrians.Therefore, in recent years, object detection has been used to detect vehicles and pedestrians and to recognize risk factors based on the relationships among objects.
2) MODELS BASED ON MULTIMODAL DATA Models using multimodal data are those that use not only video images but also audio information and data obtained from various sensors.Yamamoto et al. [24] proposed a method that combines sensor data, such as acceleration and speed, with video images.Tanno et al. [25] also proposed a configuration with three streams that extract time-series multimodal data consisting of voice information and sensor data, such as moving images and speed.This study confirms the improvement in accuracy using voice information and shows the effectiveness of this method.Monjuru et al. [26] proposed a model that uses textual data to handle cases in which extracting features from moving images is difficult such as, at night.However, the above-mentioned models have not yet been combined with methods that show rational decision bases, such as the visualization of risk factors.

3) MODELS BASED ON SALIENCY MAP
A saliency-based model is a model that uses saliency maps showing the salient regions in an image.DRIVE [27] is an open source and highly accurate accident prediction model.It uses reinforcement learning to estimate the probability of an accident occurring at each time point, while simultaneously predicting the prominence of the next frame and providing feedback.Depending on the estimated value, the probability of the occurrence of an accident in the next frame is predicted, and the weights utilized in the predicted saliency are changed.However, many datasets that include accident scenes exclude annotated gaze data.
In such cases, the entire model cannot be trained properly because part of the reward in reinforcement learning is removed.Therefore, the estimated probability of accidents may be high even in normal operation scenarios in which no accidents occur, which may degrade the accuracy of the prediction.In addition, because a saliency map is used for the visualization of the decision basis, it is the output of the image not only in accident scenes but also in normal driving scenes.Therefore, the visualized maps cannot always indicate the risk factors, making interpretation of the decision basis difficult.Hence, the accident prediction model is required to visualize only the areas judged to be risk factors.

4) MODELS BASED ON MOTION AND OBJECT FEATURES
Models using both object and motion features extract object features, through object detection, and estimate the probability of accidents by acquiring motion features through optical flow and recurrent neural networks.DSA [1], CDAP [3], L-RAI [5], DSTA [6], Yamamoto et al. [7], and FA [16] have proposed models that process object detection results in conjunction with LSTM and other recurrent neural networks to extract location relationships among surrounding objects.Ustring [2] uses graph convolution (GNN) to clearly adapt the distance between objects to the edges in the graph convolution for object-position extraction.Ichiki et al. [12] used features such as dynamic obstacle presence and static road information by combining semantic segmentation and object detection.Object detection is also used in FRPN [15] and Zhou et al. [14], which use changes in the size of the detected bounding box and shifts in the center of gravity.SSC [13] proposed an unsupervised accident prediction method using the predictions of object movement and whole frames.FOL-Ensemble [8], AM-Net [9], MAMTCF [10], and THAT-Net [11] are models that utilize optical flow.AM-Net, MAMTCF, and THAT-Net generate object-level flow images by using the center coordinates of the object's bounding box.FOL-Ensemble uses optical flow in the same manner but predicts the position of the next frame from the optical flow.This position information is used to calculate the probability of an accident occurrence.Some models such as DSA [1], [5], [6], [7], [8], [9], [12], [13], [14] calculate the risk per bounding box to visualize risk factors by using the bounding box with the highest risk.However, when using object features, if the risk factor is an unexpected object such as a falling object from a vehicle in front of the user, it may be an example of out-of-class data for object detection, making estimation difficult.In addition, when using motion features, there are cases in traffic scenes in which it is difficult to obtain the motion features of risk factors.One example is the collision course phenomenon, in which the apparent movement of a risk factor is reduced by relative factors such as the angle and speed between vehicles.These models, which use motion features to determine hazards, cannot predict accidents well because similar phenomena can occur in crossing collisions, curves, and other situations.Therefore, there is a need for an accident prediction model that can visualize the basis of a decision with high accuracy, even for risk factors with small movement characteristics.

B. VISUAL ATTENTION MODEL WITH DEEP LEARNING
A model for estimating human visual attention from input images is called a visual attention model.A variety of methods have been proposed since the preliminary study on visual attention models by Itti et al. [41] in 1998.In recent years, deep learning based visual attention models have been proposed [31], [34], [42], [43], [44], [45], [46], [47], [48], [49], [50], [52], [53], [54], [55], [56], [57], [59].These models use human gaze data measured by an eye tracker for training and estimate visual attention based on factors such as depth, context, and flicker in the image.UNISAL [46] uses domain adaptation to predict visual attention with a unified model for different types of datasets of still and moving images.The model also uses a recurrent neural network, which is not capable of simultaneously encoding both spatial and temporal information.Therefore, methods [42], [45], ViNet [47], HD 2 S [48], and STSANet [49], which can simultaneously process spatio-temporal information based on 3D CNN, have been implemented.VSFT [50] incorporates Transformer [51] into the model structure to consider longerterm spatio-temporal dependencies compared with 3D CNN.Moreover, top-down characteristics, such as those based on task and experience, are important in estimating the visual attention of a driver.Therefore, these proposed methods [34], [52], [53], [54], [55], [56], [57], [59] for driving tasks are trained on the driver gaze data.BDD-A [31] proposed Human Weighted Sampling, a method in which the sampling rate is varied according to the degree of separation between the average map of the driver visual attention, which is the correct image during training, and the visual attention in each frame.This allows the model to learn effective visual attention to accident scenes by identifying important frames in the dataset.DR(eye)VE [34] applied semantic segmentation to explicitly extract the respective relationships among people, vehicles, and roads.Visual attention is estimated by adding the output of branches in RGB images, semantic segmentation images, and images showing optical flow.Meanwhile, SCAFNet [54] and Rui et al. [55] proposed a method of concatenating features obtained from segmentation images using Convolution LSTM to take advantage of features obtained from 3D CNN.Watanabe et al. [57] created a predictive model that reproduces human vision on the basis of PredNet [58], which incorporates predictive coding.Using this predictive model, Seki et al. [33] demonstrated the characteristics of gaze regarding potential dangers during driving.ARAGAN [59] proposed a method that combines Conditional Generative Adversarial Network [60] and Multi-Head Attention algorithms to generate a driver visual attention map from input RGB images.In this study, by using a visual attention model learned from driver gaze data, we apply driver top-down knowledge to the accident prediction model.

III. METHOD
The proposed method consists of a base model and visual attention module.Using DSA, a highly accurate open source accident prediction model as a base model, we calculate the divergence between visual attention and FOE in the visual attention module.The outputs are combined to calculate the probability of accident occurrence for the input image.The model structure of the proposed method is shown in Fig. 3.

A. BASE MODEL (DYNAMIC SPATIAL ATTENTION)
The proposed method uses DSA as a base model to predict accidents from object and motion features.We use Faster R-CNN [61], which was pre-trained on the KITTI dataset [62] to extract object features, such as vehicles and pedestrians.Faster R-CNN is a typical end-to-end object detection model.Vehicles and pedestrians in the input image are detected by determining the location and rectangular shape of the object using the region proposal network [61] ω, ω e , U e , b e are learning parameters.ρ indicates the tanh function, which improves the expressive power of the model by performing nonlinear transformation.h t−1 is calculated at the output gate of the RNN in the previous frame and has a size of 10 × 512.Then, at each time step, the weight α j t for each object is embedded into the Holistic Feat X t , which is a set of xj t at each time, to obtain φ(X t , α t ).The formula for FIGURE 3. Model structure of the proposed method.The proposed method consists of ''DSA'', ''Visual attention module'', and ''Accident occurrence probability calculation'' and outputs the accident occurrence probability P t .A graph is created using the probability of accident occurrence P t for each frame.GT in the graph shows the frames after the collision in red.
embedding is as follows.
φ(X t , α t ) uses RNN to calculate A t at time t based on the weights of detected objects such as vehicles and pedestrians.

B. VISUAL ATTENTION MODULE 1) VISUAL ATTENTION MODEL
The proposed method utilizes a visual attention model to predict the probability of occurrence of accidents for risk factors with small image variation.Highly accurate and open source visual attention models include ViNet [47], HD 2 S [48], and STSANet [49].However, STSANet uses images up to frame t + 16 to predict visual attention at frame t.Therefore, it is inappropriate to incorporate it into an accident prediction model.In addition, because HD 2 S has a smaller model size than ViNet, we useÂ a visual attention model [36] that composed of HD 2 S. HD 2 S estimates visual attention by combining the outputs of four streams encoded by 3D convolutional layers for each level of abstraction.For pre-training of HD 2 S, we use the BDD-A dataset [31], which includes dangerous traffic scenes among the driver's gaze datasets.Through this, we obtain top-down visual attention according to the traffic accident prediction task.The estimated visual attention is output as a 128 × 192 grayscale image.

2) FOE (FOCUS OF EXPANSION)
FOE is defined as the origin of the optical flow [63], [64].Fig. 4 shows FOE and optical flow in an in-vehicle camera image.As shown in Fig. 4, the norm of the optical flow increases with separation from FOE.Therefore, FOE obtained by the weighted average, which increases the weights of the optical flow norm, is extended to a two-dimensional image distribution by the Gaussian filter.The calculation formula for the x-coordinate of FOE is shown below.Also, calculate the y coordinate using the same formula.
140120 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
x 1 and x 2 represent the coordinates of the starting point of the 10% and 20% optical flow with small norms, and x 3 represents the coordinates other than the above.In this paper, we set ω 3 =1, ω 2 =2, and ω 1 =3 because the smaller the norm, the larger the weight.There are two methods for determining optical flow: dense optical flow [65], [66] and sparse optical flow [67].In the sparse optical flow calculation method, optical flow is calculated only for pixels where feature points in the image can be obtained.On the other hand, the method that calculates dense optical flow calculates it for all pixels.Therefore, it is calculated even in areas where the norm of optical flow is close to 0, such as the sky.In this paper, we calculate FOE using optical flow with a small norm.Hence, optical flow in areas such as the sky may become noise.Here, we use Lucas-Kanade method [67], which is the method for finding sparse optical flow.The estimated FOE is output as a 128 × 192 grayscale image.

3) CALCULATION OF DIVERGENCE
The divergence between the obtained visual attention and FOE is calculated as presented here.SIM, an index for evaluating the overlap of histograms, is used for the divergence.In SIM, the distribution is normalized so that the total value is 1, and the sum of the minimum values of each corresponding pixel i is calculated.The definition formula for SIM is as follows.
SIM indicates that the distributions are perfectly matched when it is 1, and 0 indicates that there is no overlap in the distributions.Therefore, the value obtained by subtracting SIM from 1 is used as the divergence.The equation for calculating divergence D t is shown below.Since this paper uses real images, FOE may fluctuate as the vehicle shakes, and the SIM may be affected.Therefore, to absorb the fluctuation, we set n = 5 and take the average of the previous 5 frames.

C. ACCIDENT OCCURRENCE PROBABILITY CALCULATION
Using the resulting A t from the base model, the final accident probability P t is calculated using A t and D t , as shown below.
When the output from the base model is low, the coefficient k increases the weight of the output from the visual attention module, allowing the treatment of accidents that are difficult to estimate from only object and motion features.For normal driving scenes, the divergence between visual attention and FOE is low, allowing the calculation of accident probability without over-detection.Let P t be the probability of accident occurrence, and use the learning process of DSA, which is the base model.The loss used for learning in the accident prediction model makes the penalty for failure in prediction in a frame close to the accident larger than that in the case of prediction in a frame far from the accident.In addition, crossentropy error is used in scenes where no accidents occur.Therefore, set each loss as follows.
L p = t −e −max(0,y−t) log(P t ) (9) Here, L p indicates the loss in the positive scene where the accident occurred.Let y be the time of occurrence of the accident.In addition, L n indicates the loss in a negative scene where no accident occurs.Finally, calculate the sum of each loss.
Here, let P be the set of positive scenes and N be the set of negative scenes.

D. VISUALIZATION OF RISK FACTORS
The proposed accident prediction model enables visual explanation through the visualization of only risk factors.Risk factors are visualized using the output from the basemodel DSA, visual attention, and FOE.Visual attention and FOE are output as two-dimensional image distributions to compute the difference of the image.In addition, the bounding box coordinates of the highly hazardous object obtained by DSA object detection are extended to the distribution by the Gaussian filter.By visualizing the risk In each graph, the vertical axis shows the estimated accident prediction probability, and the horizontal axis shows the frame number.In addition, FT is the estimation failure time.TTA is the difference between time t 1 when the predicted accident probability exceeds the threshold and time t 2 when the accident occurs.
factors from adding these two images, a heat map is created for the risk factor in the frame where the risk factors are present.There is no heat map created in the typical usual operation scene without the risk factor.Visualization of solely risk factors allows for the rational explanation of the basis for model decisions.

IV. EXPERIMENTS
To verify the effectiveness of the proposed model, categories were generated representing each type of accident, and accident prediction experiments were conducted.

A. DATA SET
For the experiments, a widely used dataset containing accident scenes named DAD [1] is used.The DAD consists of various accident scenes involving cars, pedestrians, and motorcycles, which are filmed with in-vehicle cameras such as drive recorders and published on the website.The resolution of the dataset is 720 × 1280, and the frame rate is fixed at 20 fps.In this study, accident scenes from datasets are classified into four categories.In the categories ''Rear end collision'' and ''Head on collision,'' it is assumed that risk factors exist in the center of images.When the given vehicle moves, the flow of the risk factor becomes relatively large in the direction of motion, because the optical flow of the peripheral background is small.Therefore, these categories are assumed to be accidental scenes in which the image variation of the risk factors tends to increase.In addition, in the cases of ''Turn'' and ''Crossing collision,'' considering that the motion features cancel each other out owing to the relative movement between the vehicles, these categories are assumed to be accident scenes in which the image variation of the risk factors tends to be small.In addition, the dataset includes some scenes in which vehicles are not moving, such as those captured by fixed-point cameras, and such scenes are excluded.

B. EXPERIMENTAL CONDITIONS
Experimental conditions are shown in Table 1.For the experiment, we used DAD [1], which is widely used as a dataset containing accident scenes categorized by accident category.The learning rate of the model is 0.0001, the number of epochs is 30, and the batch size is 10.Use Adam as the optimization function.These conditions are set so that the loss in learning can be sufficiently converged.DSA and DRIVE [27] are used as methods for comparison in the experiment.The CPU configuration is Intel Core i9-9900K CPU @ 3.60 GHz and the GPU configuration is NVIDIA GeForce RTX 2080Ti.Time to accident(TTA) is used as an evaluation index.TTA is the time range of risk perception before an accident occurs and is defined as the difference between the time t 1 when the predicted accident probability exceeds the threshold value and the time t 2 when the accident occurs.False Time(FT) is defined as the average of the time corresponding to false negatives in accident scenes and false positives in normal scenes, with a smaller value serving as an indicator of better results.

C. RESULTS
Estimated accident prediction curves from the Dashcam Accident Dataset are shown in Fig. 5.The horizontal axis is time, and the vertical axis is accident occurrence probability.
140122 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 2.
Quantitative Evaluation in DAD.In each category, ''Rear end collision'' and ''Head on collision'' are categories that tend to have large motion features, and ''Turn'' and ''Crossing collision'' are categories that tend to have small motion features.The period that exceeds a threshold of 0.5 (50%) in the accident prediction task is indicated in red.The proposed method can recognize the danger even in an accident scene in which the image variation of the risk factor tends to become small and confirm that the probability of an accident is low and predictable under normal conditions.A quantitative assessment is shown in Table 2.The proposed method exhibits the same performance as that of DSA in accident scenes where the image variation of the ''Rear end collision'' and ''Head in collision'' risk factors tend to be large, and improves by 12% in the F1 compared to DSA in accident scenes where the image variation of the ''Turn'' and ''Crossing collision'' risk factors tend to be small.In addition, the TTA is 0.5 [s] better than the TTA of the DSA, and the FT is better than those of both the DSA and DRIVE.Our proposed model performs 5% better than DSA and 49% better than DRIVE in terms of F1 for all test images.These results show that the proposed method can predict the probability of accidents with high speed and accuracy while maintaining predictability in categories for which the image variation of risk factors tends to be large and suppressing over-detection in categories for which the image variation of risk factors tends to be small.Fig. 6 shows the results of the visualization of the decision basis for the model.Risk factors in the input image are indicated by yellow boxes.In each accident scene, DRIVE can confirm risk factors such as cars and motorcycles, but it can also confirm vehicles traveling in front of the scene other than risk factors, such as the scenario in the third scene from the left.This can be considered as the gazing area during normal operation, and similar output can be seen in the three non-accident scenes shown on the right.This constant display of the human gazing area is not appropriate for visualizing the basis for decisions in the accident prediction task because it makes distinguishing risk factors impossible.Meanwhile, the proposed method provides output only for the risk factors that suddenly present themselves, such as the motorcycle in the first and third scenes from the left for accident scenes.Therefore, in the normal driving scene shown on the right, nothing is displayed because there are no risk factors.Although DSA also provides visualization for risk factors, it displays many bounding boxes containing features other than risk factors.These qualitative evaluations confirm that the proposed method adequately captures risk factors and provides reasonable visual explanations.We have discussed the issues with the proposed method.The proposed method has the same performance as the baseline in ''Rear end collision'' and ''Head on collision,'' and no major changes due to the visual attention module can be observed.This could be the case the vehicle, which is a risk factor, is likely to be located in the center of the screen.Therefore, because the visual attention and FOE regions overlap, no discrepancy occurs and each evaluation index has the same value.

V. CONCLUSION
This study proposes an accident prediction model based on DSA that predicts accidents based on object and motion features, combined with the divergence between visual attention and FOE.By applying the visual attention model learned from the driver gaze data, the driver top-down knowledge is incorporated into the accident prediction model.The proposed method is applied to the DAD dataset and compared with the DSA and DRIVE.The results show the effectiveness in the metrics F1, TTA, and FT by confirming that it is possible to predict accidents with high accuracy for all accident scenes containing risk factors with small motion features.Visualization of the risk factors using differential images of visual attention and FOE demonstrates that a visual explanation of the basis for decision making is possible.These qualitative evaluations confirm that the proposed method adequately captures risk factors and provides reasonable visual explanations.

FIGURE 1 .
FIGURE 1. Overview of accident prediction task.The accident prediction model takes in vehicle-mounted camera images as input and outputs the accident probability for each frame.The vertical axis of the graph shows the accident probability P t estimated by the model, and the horizontal axis shows the frame number.The horizontal dotted line shows a threshold value of 0.5.Frames with a probability of accident occurrence higher than the threshold are shown in red.

FIGURE 2 .
FIGURE 2. Overview of the proposed method.The proposed method incorporates a visual attention module consisting of visual attention and FOE into the base accident prediction model.The driver visual attention can be estimated by a visual attention model trained using eye tracking data measured by an eye tracker.

FIGURE 4 .
FIGURE 4. FOE and optical flow.This shows FOE and optical flow in the in-vehicle camera images.Optical flow is indicated by the direction and size of the red arrow.Also, FOE is shown as a green circle.

FIGURE 5 .
FIGURE 5. Accident prediction curves in DAD.Risk factors in the input image are shown in yellow.The two on the left show accident scenes, and the third from the left is a normal scene.In each graph, the vertical axis shows the estimated accident prediction probability, and the horizontal axis shows the frame number.In addition, FT is the estimation failure time.TTA is the difference between time t 1 when the predicted accident probability exceeds the threshold and time t 2 when the accident occurs.

FIGURE 6 .
FIGURE 6. Visual description of the model in DAD.Risk factors for the input image are shown in yellow.The three on the left are accident scenes, and the three on the right are normal scenes.In the proposed method, only the areas determined to be risk factors are visualized using a heat map.