CrowdFix: An Eyetracking Dataset of Real Life Crowd Videos

Understanding human visual attention and saliency is an integral part of vision research. In this context, there is an ever-present need for fresh and diverse benchmark datasets, particularly for insight into special use cases like crowded scenes. We contribute to this end by: (1) reviewing the dynamics behind saliency and crowds. (2) using eye tracking to create a dynamic human eye fixation dataset over a new set of crowd videos gathered from the Internet. The videos are annotated into three distinct density levels. (3) Finally, we evaluate state-of-the-art saliency models on our dataset to identify possible improvements for the design and creation of a more robust saliency model.


Introduction
Saliency studies form the intersection between natural and computer vision. A quantitative study of saliency provides a structured insight into the human mind on what it perceives to be important in a scene. Visual attention then guides gaze to focus on and further explore that region of interest. To achieve near human accuracy in predicting gaze locations, Saliency models need to be able to approximate gaze over a wide variety of stimuli (Borji, 2019). We approach this problem in two ways: first we discuss static and dynamic stimuli used for modelling saliency as well as the need for specialized datasets to boost saliency modelling. Traditionally most of the active research has been on images, but in the recent years, using dynamic content as the subject of saliency studies has picked up pace. The pace of this research is determined by publicly available, diverse datasets of videos covering multitudes of natural scenes.
Datasets such as DIEM (Mital, Smith, Hill, & Henderson, 2011), HOLLYWOOD-2 (Mathe & Sminchisescu, 2014), and UCFSports (Mathe & Sminchisescu, 2014), LEDOV (L. Jiang, Xu, Liu, Qiao, & Wang, 2018) and DHFK (Wang, Shen, Guo, Cheng, & Borji, 2018) are dynamic datasets that cover a range of natural scenes. However, there is an obvious gap for specialized datasets targeting a category of natural scenes. Our study focuses on the category of crowded scenes because it presents an interesting use case: The number of stimuli competing for attention in crowd scenes are larger in number and the crowd activity is far more random and attention grabbing than normal scenes containing one or two object of interest (Yoo et al., 2016). This insight proves useful for monitoring, managing and securing crowds (Gupta & Gupta, 2014). To date, there has been only one crowd saliency dataset, namely EyeCrowd, consisting of 500 natural images (M. Jiang, Xu, & Zhao, 2014).
Our research contributes by adding a first saliency dataset of crowd videos called 'CrowdFix' and its corresponding saliency information to the pool of publicly available saliency datasets. Crowdfix is a real-life, moving crowds high definition (720p) videos dataset collected in in RGB. Eyetracking results benefit from higher quality datasets (Vigier, Rousseau, Da Silva, & Le Callet, 2016).
For this reason we chose not to include videos from pre-existing crowd video datasets due to the lower quality of those videos, i.e. below 720p. The dataset has been further annotated into three different crowd density levels to facilitate understanding of attention modulation within different each level. This also helps in the producing better, more generalized saliency models, particularly deep models by providing a finer categorization of salient images and videos (He, Tavakoli, Borji, Mi, & Pugeault, 2019). We assess the attentional impact of different levels of these crowds on individuals and further evaluate three state-of-art deep learning based saliency models on our datasets to judge how well general saliency models perform for crowd saliency prediction. This analysis serves as a baseline for future design of a crowd saliency model.

Computational Models for Visual Attention
Older models integrated complicated characteristics of the Human Visual System (HVS) and and reconstruct the visual input through hierarchically combining low level features. The bottom-up mechanism is the maximum occurring feature found in these models (Le Meur, Le Callet, Barba, & Thoreau, 2006). The core indication which implicates bottom-up attention is the uncommonness and distinction of a feature in a given circumstance (Mancas, 2010). Bottom-up use a feed-forward method to process visual input. They apply sequential transformations to visual features collected over the entire visual field, to highlight regions which are the most attentiongrabbing, significant, eye-catching, or so-called salient information (Borji & Itti, 2012) However, the existing models of visual attention present a reductionist of visual attention. This is because fixations are not only influenced by bottom-up saliency as determined by the models, but also by various top-down influences. Consequently, comparing bottom-up saliency maps to eye fixations is demanding and requires that one attempts to minimize top-down impacts. (Volokitin, Gygli, & Boix, 2016) One way is to focus on early fixations when top-down influences have not yet come into affect, such as by use of jump cuts in videos, in our case, and MTV style video stimulus (Carmi & Itti, 2006).

Our Contribution: The CrowdFix Dataset
In the only crowd eye tracking experiments that have been done before, images were used. There is no HD (720p), FHD (1080p) or 4K crowd video dataset that exist instead all of the existing datasets have low resolutions. A higher quality dataset leads to better eye fixation information. This is because HD and FHD show a better level of detail in the video and allows for more possibilities of visual exploration. (Vigier et al., 2016). Most datasets cater exclusively to high-density crowds or abnormal crowds which established the need to have a diversity in the dataset according to crowd density levels. To the best of our knowledge, no such categorization has been performed on existing crowd video datasets.
We collected a crowd dataset consisting of videos that depict real life scenarios. The dataset is categorized into three distinct density levels of the crowds named as sparse, dense free-flowing and dense congested. This dataset is built for studying the influence and saliency in crowds. Therefore, this dataset consists of diverse real-life, moving crowds. It has a total of 89 videos cut into 434 clips for MTV style videos. Having high resolution of the videos as the starting key point, our dataset comes with the resolution of 1280×720 with 30 frames per second. None of the videos in this dataset are taken from any previously existing datasets. For maintaining the clarity and simplicity of the videos, none of them is in a fast forward motion nor any of them has a watermark on it while all of the videos being in RGB. For generating the dataset we picked the crowd videos under the Creative Commons depicting multiple real life crowded scenes. The stimulated crowd videos were not considered at all. We collected a wide variety of moving crowd scenes while assessing the varying densities of crowds. The videos were then later finalized. The categorization of the crowd density levels is concluded from the results of the participants. The major step in creating the stimulus was to maximize the bottom up attention. Since bottom up attention is the involuntary attention it follows that the stimulus should change frequently and abruptly. In terms of videos this can be achieved by using jump cuts by combining videos of a very short length back to back together. We call these very short videos as a 'clip'. Based on the research of (Carmi & Itti, 2006), each snippet duration varies from 1 second to up to 3 seconds. Any clip of length greater than this would invoke top down attention. To create the clips from the crowd videos we take all the videos from each density level and randomly shuffle them. This ensures there is no sequence based on the crowd density to make it predictable. Each video has a duration of 1 -3 seconds. These snippets are again randomly combined into two videos of approximately 10 minutes each. We then present the stimulus to the participants. Table 1 shows the attributes of the real life crowd dataset.

Dataset Annotation
The objective behind dataset annotation is to divide the dataset into distinct crowd density levels. All the previously available crowd videos datasets lacked the density feature. The major attributes of crowds include density, orientation, time and location of event, type of free-viewed the crowd videos. After viewing each video they were given some time to mark the video as one of the level explained in the beginning. Figure 1 shows the distribution of the categories chosen by the annotators that were further assigned to all the videos in the dataset. Since we saved the decision about the density of the video right after showing the video to the participants we can be fairly certain that the participant's judgment was not influenced by other videos. And the participant could pause as long as they wished before moving on to the next video. Figure 2 shows the sample images of different levels of crowd density. The rows represent sparse, dense free-flowing, and dense congested crowds respectively. fixations were computed from MTV1 and MTV2 both. The count of minimum and maximum number of fixations was also calculated on both the parts. The values clearly show that MTV1 has more number of fixations than MTV2 therefore having more average fixations in the first part as well. Table 3 shows the results of fixation data gathered by all the participants in MTV part 1 and 2.

Gaze Data Visualization
The final analysis is done with respect to crowd density levels against their number of videos. Figure 4 shows the distribution of videos over different levels of crowds. Each level has different number of videos as presented in the figure.  Table 4 shows the results of the evaluation on sparse, Dense free-flowing and dense congested crowd levels. From the levels that we have, dense congested has the highest number of fixations since there are more people to look at as compared to dense free-flowing and sparse. But even if sparse has less number of fixations, it has the highest duration of fixations on the screen.  Fixation location is also one of an important aspect to look at while interpreting the results. It reveals the areas where the participants fixate on the screen. The images below are the graphs representing fixation locations across all the participants of different crowd levels.
It can be seen that all the fixations lie closer to the center and form a big cluster with all the points tightly loaded. Figure 5 shows the fixation locations of the participants throughout the experiment on different crowd density levels. It represents the graphs for sparse, dense free-flowing and dense-congested categories from left to right respectively.

Figure 5: Fixation Locations of participants in Different Crowd Density Levels
The fixation coordinates of all the participants were extracted and used for calculating the distance from the center. Figure 6 shows the results of different crowd levels. A peak at the first frame can be seen due to the bottom-up saliency influences. The fixations from sparse density level seem to be more close to the center as there are lesser things to be looked at in those scenes that revolve around the center of the screen. Hence the lesser entities in the scene catch attention for longer periods of time being consistent towards the center of the screen. While the fixations from dense free-flowing and dense congested seem to be more distributed than sparse. Dense free-flowing seems to have more distance from the center as compared to both the categories. Reasons being having more number of videos along with having the attention on both the entities as well as the salient regions of the scene. Dense congested has lesser distribution of distance from dense free-flowing but more than sparse because the scene is so congested that the person is unable to focus on something but rather struggles to explore the screen during which the scene changes already. The spread of the recorded data samples was also evaluated to judge the closeness of agreement between the results. Mostly the standard deviation measure is used for estimating the variability. The fixation coordinates were again used to assess the dispersion of the data around the mean which was later averaged across the participants for all the density levels. Figure 7 shows the results of different crowd levels.

Performance Evaluation over Existing Saliency Models
Deep Learning models are trained by combining tasks such as feature extraction, integration and saliency value prediction in an end to end manner. Their performance is superior in contrast to classic saliency models. Keeping this in mind, we select two of the latest state-of-the-art dynamic deep learning models (Borji, 2019). The models were selected based on best performance on pre-existing dynamic saliency datasets. These models are ACL (resnet variant) (Wang et al., 2018)   We create a benchmark of these models over our dataset. The three models were tested over videos from each crowd category. We choose 4 of the most common saliency evaluation metrics AUC-J, NSS, KLdiv and CC to provide an easy comparison to other saliency benchmarks such as MIT@saliency (Bylinskii et al., 2019) and the DHF1K video saliency leaderboard. (Cheng, 2019) We also provide a baseline from the model's own performance results over their original datasets. We average our results as well to make a comparison with the baseline results and evaluate the performance difference. Figure 8 shows the original image and its ground truth saliency map for dense free-flowing crowd category. Table 5 shows the results of evaluation with DeepVS, ACL and SAM model over different crowd density levels. Based on the results, ACL performs the best out of three models over all three categories of videos individually and on average. However, the difference between these and ACL's original results is enough to prompt for improvements in model parameter design and architecture to bring saliency prediction in crowds up to par to general saliency prediction. Even in the other two models, the difference between average results and the baseline shows crowd videos need customized saliency prediction models to reach state-ofart-performance.

Discussion and Conclusion
Crowd Scenes provide a richer set of dynamics and stimuli. These can be used to formulate and test the accuracy of general saliency judgments and models if they hold true fro crowd scenes as well provide insights on how to bring about improvements.
In this work, we studied the crowd characteristics and categorised the crowds into different density levels. The fixation and dispersion analysis shows that attention does vary with the number of people in the crowd. As the crowd gets bigger, most of the time is spent viewing more objects in the scene rather than paying attention to any one particular object. With decrease in the number of entities, salient features are more spontaneously noticeable in individual objects. As s future avenue, to bridge the gap in human performance and predicted saliency, it would be prudent to include more cognitive information about crowded stimuli into the computational models (Feng, Borji, & Lu, 2016). The importance of different features particularly facial features in the context of crowd videos is still an unexplored area. General saliency datasets' evaluation metrics results and ours reflect a quite a big gap in performance. There is an obvious gap for improving deep saliency model to work equally well, if not better for crowds. We reiterate the need to investigate which features should be reinforced for crowd videos in the design of the model to predict better crowd scene saliency.