Audiovisual Saliency Prediction in Uncategorized Video Sequences based on Audio-Video Correlation

Substantial research has been done in saliency modeling to develop intelligent machines that can perceive and interpret their surroundings. But existing models treat videos as merely image sequences excluding any audio information, unable to cope with inherently varying content. Based on the hypothesis that an audiovisual saliency model will be an improvement over traditional saliency models for natural uncategorized videos, this work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features. The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset. The results show that the model outperformed two state-of-the-art visual saliency models.


INTRODUCTION
Though a lot of research has been done in the general field of unimodal saliency models for both images and videos, no substantial contributions exist for bimodal models. Of more consequence is the lack of a model for computation of audiovisual saliency in complex video sequences. Existing literature for audio-video saliency modeling is scarce and often targets a specific class of videos [10], [27], [28]. Therefore, an extended saliency model to predict salient regions in complex videos with different sound classes is required.
Many existing saliency algorithms are designed for images [6], [16], [24] using visual cues such as color, intensity, orientation etc., while other models [7], [14], [22] take social cues like faces into account resulting in more accurate eye movement predictions. Spatiotemporal saliency models [11], [15], [21] usually incorporate temporal cues like motion but ignore the effect of audio stimuli-an integral component of video content-on human gaze. Subsequently, such models are classified as unimodal models [4] where only visual stimuli are used.
Interestingly, the effect of audio stimuli is relevant to human eye movements. In [25] the authors find eye movements to be spatially biased towards the source of audio using an eye tracking experiment on images with spatially localized sound sources in three conditions: auditory Maryam Q. Butt and Anis Ur Rahman are with School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan. Corresponding Author: Anis U. Rahman, e-mail: anis.rahman@seecs.edu.pk (A), visual (V) and audio-visual (AV). Moreover, another study [29] analyzed the effects of different type of sounds on human gaze involving an experiment with thirteen sound classes under audio-visual and visual conditions. The sound classes are further clustered into on-screen with one sound source, on-screen with more than one sound source, and offscreen sound source. The results show that human speech, singer(s) and human noise (on-screen sound source clusters) highly affect gaze and, more importantly, linked audiovisual stimuli has a greater effect than unsynchronized audio-visual events.
The focus of this work is to propose a generic audiovisual saliency model for complex video sequences. The work differs from previous research [10], [27], [28] in that it does not restrict input videos from a certain category. To accomplish that an audio source localization method was used to relate an audio signal with an object in the video frames in a rank correlation space. The proposed model was evaluated against eye fixations ground truth from DIEM dataset.
The original contribution of this study is as follows: 1) Propose an audio-visual saliency model for complex scenes that, unlike existing literature, does not restrain videos to any specific category. 2) Present and analyze the results of experimental evaluation on a publicly available dataset to examine how our proposed saliency model compares to two state-of-theart audio-visual saliency models. The remainder of the paper is organized as follows: Section 1 narrates background knowledge of saliency modeling and identifies the novel contribution of this work. Section 2 provides a detailed review of state-of-the-art literature while Section 3 describes the proposed solution. Section 4 summarizes the implementation details as well as outlines the properties of video sequences used for experimentation. This section also explains the different saliency evaluation metrics. Section 5 presents our results followed by a discussion in Section 6. Section 7 summarizes our findings and concludes with future perspectives. color, intensity and orientation features [1], [3], [24]. Other biologically-inspired models [20], [21] exploit spatial contrast and motion, and simulate interactions between neurons using excitation and inhibition mechanisms. While others [18], [19] propagate spatial/ temporal saliency using multiscale color and motion histograms as features. In [19] pixel-level spatiotemporal saliency is computed from spatial and temporal saliencies via interaction and selection driven from superpixel-level saliency. In [18] temporal saliency is propagated forward and backward via inter-frame similarity matrices and graph-based motion saliency, whereas spatial saliency is propagated over a frame using temporal saliency and intra-frame similarity matrices. In most of these models conspicuity maps are constructed using a variety of approaches with different visual features that are later integrated together to get a final saliency map.
Based on the fact that eyes are the most important sensory organs that provide much of the information around humans, many state-of-the-art visual models [18], [19] aim at saliency computation for complex dynamic scenes. But such unimodal models tend to overlook other influential social cues like faces in social interaction scenes, and hence exhibit lower predictability [2], [30]. Moreover, social scenes involve a lot more sensory signals influencing eye movements spatially such as auditory information including voice tone, music, etc, and different kinds of sounds affect eye fixations differently [25], [29]. Thus, there is a need for a bimodal saliency model incorporating both visual and audio information channels.
Rapantzikos et al. [26] proposed an audio-visual saliency model for movie summarization. The visual saliency map is constructed using traditional features such as intensity, color and motion, and simulating feature competition as energy minimization via gradient descent. This map is thresholded and averaged per frame to compute a 1D visual saliency curve. While maximum average Teager energy, mean instant amplitude and mean instant frequency, are extracted as audio features by applying Teager Kaiser energy operator and energy separation algorithm on the audio signal. The resulting feature vector is normalized to a range [0, 1] followed by weighted fusion to get an audio saliency curve. The final audio-visual saliency curve is a weighted linear combination of audio and visual saliency curves. The local maxima feature of audio-visual saliency curve is used for key-frame selection. The experiments are conducted on movie database of A.U.T.H but no comparison and evaluation is given.
Coutrot and Guyader [9] proposed an audiovisual saliency model for natural conversation scenes; a linear combination of low-level saliency, face map, and center bias. Low-level saliency map is constructed via Marat's spatiotemporal saliency model [21]. While for face map construction a speaker diarization algorithm is proposed that uses motion activity of faces and 26 Mel-frequency cepstral coefficients (MFCCs) as visual and audio features respectively. Center bias is a time-independent 2D Gaussian function centered on the screen. The three maps are linearly combined into final audiovisual saliency map using expectation maximization to determine the weight for each. The resulting model performs better compared to the same model without speaking and mute face differentiation.
However, the target video dataset belongs to a limited category: conversation scenes only.
Sidaty et al. [28] proposed an audiovisual saliency model for teleconferencing and conversational videos. Three best performing models on target database i.e. Itti et al [13], Harel et al. [12] and Tavakoli et al. [31] are selected as spatial models. Acoustic energy is computed per frame and block matching algorithm is used to construct an audio map using the face stream of video. Then peak matching is used for audio-visual synchronization. Five fusion schemes are used to get a final map. Experiments performed on XLIMedia database created by the authors showed that the proposed model performed better compared to spatial models. Again the limitation of this work is that it only targets conferencing and conversational videos.
All in all, one of the major limitation of the aforementioned visual models is that they treat videos as a mute sequence of images and ignore any influence of audio stimuli. This results in inaccurate predictions where sound guides eye movement. Furthermore, another limitation of literature is the absence of an audiovisual model for complex dynamic scenes; that is, many of the state-of-the-art models restrict the dataset used to only one specific category, for instance, conversational videos. This limits the models' performance when dealing with videos containing different sound classes.

PROPOSED SOLUTION
This section explains the proposed solution for audio-visual saliency computation for videos. The framework consists of five major stages as illustrated in Figure 1. The first stage is the extraction of audio energy descriptors and object motion descriptors per frame using audio and visual stimuli as separate channels. The next stage computes an audio saliency map using these descriptors. In parallel, another stage computes visual saliency map and motion map. The former using low-level features while the latter from a colorcoded optical flow similar to one done for the audio maps. The last stage normalizes and combines all these maps into a unified audiovisual saliency map.

Feature Extraction
In this stage, we extracted visual and acoustic features from a given input video. The stage comprised two phases of feature extraction, one for audio features and the other for visual features.

Audio Feature Extraction
The step outputs an audio energy descriptor a(t) extracted from an audio signal featuring changing patterns of an audio signal strength. Note that the signal was obtained with the same temporal resolution as the video frames. Hence, the signal was first segmented into frames according to the frame rate of video so-that each audio frame corresponds to a video frame. Using short-term Fourier transform (STFT), this framed signal was transformed into a time-frequency domain to get a spectrogram of the signal at each frame. The descriptor a(t) was computed by the integration of the where the windowing function W (t) is defined so that neighboring windows overlap by 50%. The final descriptor was post-processed using a 1D Gaussian kernel.

Visual Feature Extraction
Based on the assumption that a moving object is a prime candidate to be an audio signal source, acceleration per frame of all moving objects in a given input video was computed as motion descriptor. First, the moving objects were segmented per frame using optical flow estimation and tracked along with all frames via color histograms of the regions in HSV color space. The process is described as follows: 1) Optical flow computation. The method proposed by [8] was used to compute dense optical flow and corresponding color-coded optical flow images per video frame. The method used apparent motion of each pixel to compute forward and backward optical flows where the former depicts the motion of pixels of frame t with reference to frame t + 1 and the latter was the motion of pixels of frame t with respect to frame t − 1. The resulting flows were averaged out to get a mean optical flow per frame, later used to compute an audio saliency map. 2) Frame segmentation. The color-coded mean optical flow per frame was used as input for the segmentation step. Mean shift, a nonparametric clustering algorithm was applied to segment input image in LUV color space. The oversegmented result of the step was followed by a simple region merging technique based on DeltaE, a color difference score, to merge the closely similar regions. Regions smaller than 200 pixels were filtered as noise and insignificant regions.
3) Region tracking. Once individual frames were segmented, a number of tracks were initialized in the first frame using the segmented regions' location and appearance features. All regions in following segmented input frames were either assigned to an existing track or initialized to a new track based on its location and appearance similarities. The location similarity d E was computed by Euclidean distance between the centroid of a new region C n and that of an existing track C e using, This resulted in a list of candidate tracks similar to the region under consideration for assignment within a specified search radius r. For appearance similarity, AS LUV histograms of existing candidate tracks H e were compared to the new region's histogram H n using cosine similarity cosθ as, The region C n was assigned to a track whose cosθ was maximum and greater than a specified threshold. The centroid of the track was updated to the centroid of the newly assigned region and its histogram was replaced with the mean of the existing histogram and new region's histogram. Otherwise, if cosθ was less than the specified threshold, the region was used to initialize a new track. 4) Calculate acceleration. In this step objects' acceleration was computed using the motion descriptors. Average of forward and backward optical flow resulted in acceleration at each pixel (x, y, t) per frame using, where x and y are spatial coordinates, t is frame number and F + and F − indicate forward and backward optical flow.
The acceleration of regions ST t i where i is region index per frame t was computed as the average acceleration of all pixels belonging to that region as: The resulting acceleration vector was filtered using a Gaussian kernel to remove noise. The result was a motion descriptor of objects in a given input video.

Audio Saliency Map Computation
For the audio saliency map computation, we used audiovideo correlation method proposed in [17]. The correlation between the aforementioned audio and motion descriptors was used to localize the source of the sound signal in input video frames to indicate audio saliency. Winner-Take-All (WTA) hash [33], a subfamily of hashing functions controlled by the number of permutations N and window size S, was used to transform both descriptors in rank correlation space. Once in the common rank correlation space, Hamming distance was used to relate the audio signal to an object.

Visual Saliency Map Computation
A classical visual saliency map was used as proposed in [12]. The model approaches the problem of saliency computation by defining Markov chains over feature maps, extracted for features of intensity, color, orientation, flicker, and motion, and treats equilibrium locations as saliency values. In detail, each value of the feature map(s) is considered a node and the connectivity between them is weighted by their dissimilarity. Once a Markov chain is defined on this graph, the equilibrium distribution of this chain computed by repeated multiplication of Markov matrix with an initially uniform vector accumulates mass at highly dissimilar nodes providing activation maps. A similar mass concentration process is applied to these activation maps and output is summed into a final saliency map.

Motion Map Computation
Motion map indicates the regions of high motion computed using mean optical flow per frame as described in Section 3.1.2. Adaptive thresholding proposed in [5] was applied on the flows to discard any inconsequential low motion as, where pixel I p is set to zero if its brightness is T percent lower than average brightness of its surrounding pixels.

Normalization and Combination
In this final stage, the three computed maps: a) visual saliency map, b) audio saliency map, and c) motion map were normalized before combining them together into a final audiovisual saliency map. Here the visual saliency map was a sum of normalized activation maps computed using mass concentration algorithm. The other two maps were normalized to a specified range [0 − 1] using simple linear transformations. The resulting normalized maps were linearly combined to get the final audiovisual saliency map.

IMPLEMENTATION AND EVALUATION
The proposed solution was implemented in MATLAB 2014b and Windows 10 on a 64-bit architecture machine with Intel i5 processor. The same setup was used for evaluation purposes. The parameters used for the proposed solution are given in Table 1.

Dataset
Dynamic images and eye movements (DIEM) dataset [23] was used for evaluation of the proposed approach. The dataset comprises 85 (eighty-five) videos with or without audio of varying genres. Eye fixation data is collected via binocular eye tracking experiment with 250 participants in total with ages ranging between 18 and 36 years with normal/corrected-to-normal vision. In this work, for evaluation 25 (twenty-five) videos with audio were randomly selected. The video sequences are listed in Table 2 along with its properties.

Evaluation Metrics
The proposed solution was evaluated using four criteria.

1) Area under the curve (AU C)
. is a location-based metric, where saliency pixels equal to the total recorded fixations are randomly extracted. The true positives (T P ) and false positives (F P ) are calculated for different thresholds treating saliency pixels as a classifier. The resulting values are used to plot an ROC curve and compute AU C-the ideal score being 1.0 and a value of 0.5 indicating random classification. 2) Kullback-Leibler divergence (D KL ). is a distributionbased dissimilarity measure given as, it estimates the loss of information when saliency map M s is used to approximate a fixation map M f -both considered as probability distributions. The ideal D KL score is zero, meaning the saliency and fixation maps are exactly same, otherwise poorer than the scale of the saliency model. 3) Normalized scanpath saliency (N SS). is computed using,

No Video Sequence
Scene Type Audio Source game trailer lego indiana jones Computer Game harry potter 6 trailer Movie home movie Charlie bit my finger againMovie news bee parasites News news sherry drinking mice News news us election debate News planet earth jungles monkeys sport football best goals Sports where saliency map M s is normalized to zero mean and unit standard deviation, then averaged for N fixations. Zero score means a chance prediction whereas a high score indicates high predictability of the saliency model. 4) Linear correlation coefficient (CC). is another distribution-based metric computed using, its output ranges between −1 and +1, the closer is the score to any of these, the better is predictability of the saliency model.

Comparison Methods
Based on our literature review, we found no other audiovisual saliency model for complex dynamic scenes. For the sake of comparison, we compared our proposed audiovisual saliency model against two state-of-the-art visual saliency models. The first model proposed in [19] derives pixel-level spatial/temporal saliency map from superpixel-level spatial/temporal saliency map constructed using motion and color histogram features. The other spatiotemporal saliency detection model proposed by Liu et al. [18] is based upon superpixel-level graph and temporal propagation.

RESULTS
For evaluation, we computed saliency maps for the selected videos from DIEM dataset using the two state-of-the-art  (Table 3) for the resulting saliency maps for the first 300 frames per video were compared to assess eye movement predictability.
We observe that the proposed model not only outperforms both comparison methods but also results in a satisfactorily higher average score in terms of AU C. Moreover, a lower D KL score indicates a better saliency model with less dissimilarity to the ground truth. For the remaining evaluation metrics, CC and N SS, the proposed method results in slightly lower scores; however, the results still suggest that the proposed model makes better eye movement predictions, and thus supports the idea of incorporating audio features when computing spatiotemporal saliency for unconstrained videos. Some of the video sequences performed better for instance stewart lee, news us election debate and one show, with on-screen sound source with no object occlusion, and interaction. Figure 2 illustrates the saliency maps obtained by different methods. The visual comparison demonstrates that our proposed model performs comparatively better. For instance, video sequence with an on-screen audio sourcetype in the third row, visual models failed to correspond to the ground truth (GT) as they considered both faces salient; by contrast, the proposed audiovisual model marked the talking face salient.

DISCUSSION
Spatiotemporal saliency detection is a challenging problem. It is worth mentioning that existing models ignore the audio signal in the input media. However, a number of experimental studies [25], [29], [32] discuss the influence of aural stimuli on early attention when viewing complex scenes; that is, audio stimuli can provide useful information in guiding eye movements. This influence can be incorporated into existing bottom-up models by the inclusion of low-level audio properties like energy, frequency, amplitude, etc. The resulting audiovisual saliency model makes more sense in application areas like video summarization/compression, event detection, gaze prediction, and robotic vision and interaction. There exist some models in the literature based upon multiple stimuli [9], [26], [28] but they lack a generic solution by limiting the models to specific categories of videos.
A major reason for this lack in literature is due to one of the foremost challenges of audiovisual saliency models: localization of audio source in a given frame. Some methods either use microphone arrays to triangulate a single source or only target stationary sources in a scene. The models fail to perform for dynamic videos, as they assume a single audio source. Furthermore, an approach overcoming these restrictions use correlation analysis between audio and video segments, the audio source is a set of relevant pixels rather than an object. The approach has been used in more recent works where object segmentation precedes audiovisual correlation, making audio source separation maintain the source object shape. Since both audio and video signals are from different domains, reliable correlation requires feature transformation into a suitable space. Moreover, it requires a method to relate an audio descriptor to an object descriptors in a video frame, that is, segmentation and tracking of diversified objects in a video frame is in itself a challenging task. To be precise, the literature lacks in techniques for multiple objects, the case in our dataset with no a priori information about objects like shape, color, size, etc.
In terms of eye movement predictability, the proposed audiovisual saliency model performed better for two evaluation metrics. However, they resulted in comparable scores for the other two metrics. This result can be attributed to the difficulty in segmenting and tracking of multiple interacting objects in varying conditions like motion blur, crowd, etc. Moreover, multiple and/or off-screen audio sources make it a more challenging task to locate an audio source, in consequence, affecting the model's performance.
The proposed saliency model exhibits higher time complexity (Table 4) attributed to dense optical flow computation, inherently compute-intensive being an optimization problem. The main advantage of using the method is that it estimates both forward and backward flows, and hence the optical flow of occluded regions is also computed correctly.
Other alternative motion estimation approaches are blockmatching and phase correlation that can be used instead to propose a more efficient solution. Likewise, segmentation of multiple objects is a complex task involving meanshift segmentation, a non-parametric clustering using kernel density estimation. The approach is not scalable due to its large feature space dimensions. Alternatively, a simpler histogram or superpixel-based segmentation methods can be used to reduce computational complexity, as well as increase model predictability.
A shortcoming of the current study is the use of a subset of the available dataset for evaluation. It may be interesting to perform evaluation using the entire video dataset and/or other available datasets to enforce our finding that aural stimuli alongside visual stimuli can provide useful information in guiding eye movements.
All in all, the proposed solution scored reasonably well, however it can be further improved. An improvement in segmentation and tracking techniques may contribute to a better audio saliency map, and in turn towards a better final saliency map. Furthermore, the use of a more sophisticated visual saliency model, as well as the use of more suitable combination techniques can improve the final result.

CONCLUSION
Existing bottom-up saliency models only use visual stimuli while available audio stimuli in the input media remain unused. In this paper, we proposed an audiovisual saliency model incorporating both low-level visual and audio information to produce three different saliency maps: an audio saliency map, a motion saliency map, and a visual saliency map. These maps were linearly combined to get a final saliency map. These maps were evaluated for DIEM dataset using four different criteria. The results show an overall improvement against two state-of-the-art visual saliency models and enforce the idea that of aural stimuli can provide useful information to guide eye movements.