A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip. Existing VSD-related works have mainly relied on the visual system but paid less attention to the audio aspect, while, actually, our audio system is the most vital complementary part to our visual system. Also, audio-visual saliency detection (AVSD), one of the most representative research topics for mimicking human perceptual mechanisms, is currently in its infancy, and none of the existing survey papers have touched on it, especially from the perspective of saliency detection. Thus, the ultimate goal of this paper is to provide an extensive review to bridge the gap between audio-visual fusion and saliency detection. In addition, as another highlight of this review, we have provided a deep insight into key factors which could directly determine the performances of AVSD deep models, and we claim that the audio-visual consistency degree (AVC) -- a long-overlooked issue, can directly influence the effectiveness of using audio to benefit its visual counterpart when performing saliency detection. Moreover, in order to make the AVC issue more practical and valuable for future followers, we have newly equipped almost all existing publicly available AVSD datasets with additional frame-wise AVC labels. Based on these upgraded datasets, we have conducted extensive quantitative evaluations to ground our claim on the importance of AVC in the AVSD task. In a word, both our ideas and new sets serve as a convenient platform with preliminaries and guidelines, all of which are very potential to facilitate future works in promoting state-of-the-art (SOTA) performance further.


I. Introduction
We humans tend to be attracted by specific things, and this mechanism has its basic principle in general.But, outwardly, it could vary from different people and scenes, and, directly or indirectly, such differences are usually caused by either personality and individual differences or the exact environment [1], [2], [3].For example, in an open wild, we may get attracted by a fantastic nature scene view and pay less attention to artificial subjects.However, things go differently in a downtown area, where magnificent artificial buildings could keep drawing our attentions.Also, our attention could get shifted to "rare" elements -patterns that are anomalies for their nearby surroundings, and we have an academic name for all these objects/things/patterns attracting our attentionsaliency.†Corresponding author: Li Guo (ally kwok@163.com) The first two authors contribute equally to this paper.
Fig. 1: The difference between video salient object detection (VSOD) and video fixation prediction (VFP).The GTs for VSOD (the 2nd column) are the human-annotated object-level binary masks, while the GTs used in VFP (the 3rd column) are the human-eye fixations (without clear object boundary) recorded automatically by using eye-trackers.
In general, the saliency-related research activities [4] should come with a specific venue, e.g., the visual saliency, which aims at segmenting the most eye-attracting objects or regions in a given scene.And the scenes are usually "expressed" in the form of images or videos.Since video data is the main course of this survey, we shall omit image-based saliency works.
The current visual saliency detection research field can be roughly divided into two groups, i.e., video salient object detection (VSOD) and video fixation prediction (VFP).The basic methodologies of VSOD and VFP are almost the same, where the existing hand-crafted methods [5], [6], [7], [8] mainly follow either top-down or bottom-up rationale.After entering the deep learning era, most of the existing works [9], [10], [11], [12] have adopted the end-to-end encoder-decoder network architecture, which, generally, belongs to the typical top-down category.Hence the difference between VSOD and VFP is the exact training ground truth data and training loss functions.For a better understanding, Fig. 1 has demonstrated such a difference.Though our visual system is one of the most important venues for us to perceive the environment that we're in, our auditory system also plays an important role.For example, our attention could fast shift to a sounding object, showing that our auditory system can really complement our visual system.Despite being complementary in general, these two venues have completely different perceptual mechanisms.
The visual venue is very informative yet with rather limited sensing scope (because of the limited field of vision, FOV).In contrast, the auditory venue is less informative, yet its sensing scope is clearly dead-angle-free.Besides, different from the Fig. 2: The structure of our review, which covers two significant topics: 1) Video Saliency Detection and 2) Audio-Visual Multi-Modality Fusion.W.r.t., the most representative applications, we have highlighted them with red rectangular boxes.Also, we have newly argued that the audio-visual semantical consistency perceptual (highlighted by the blue box) is the key factor in determining the AVSD performance.TABLE I: Illustration of the main differences between the existing reviews and ours.

Reviews
Year Publication Contents Katsaggelos et.al [13] 2015 P-IEEE Audio-visual Fusion Baltrusaitis et.al [14] 2018 T-PAMI Multi-modality Machine Learning Cong et.al [15] 2018 T-CSVT RGBD/Video/Co-saliency Detection Wang et.al [16] 2019 T-PAMI Video Saliency Detection Zhu et.al [17] 2021 IJAC Audio-visual Localization/Correspondence Chen et.al (Ours) 2022 T-CSVT Audio-visual Saliency Detection visual saliency research field, a quite mature topic, audiorelated saliency is in its infancy.Moreover, different from the visual saliency, a single modality task with abundant accessible training data, the available training data for the audio-visual saliency detection (AVSD) is in a critical shortage 1 , which has resulted in a clear performance bottleneck, especially in this deep learning era.Meanwhile, we have noticed that there exist massive researches [18], [19], [20] regarding the visual and auditory fusion, and their main interests are usually focusing on the multimedia applications, e.g., multi-modality information processing, filtering, and understanding, and these works are rarely intercrossed with the saliency detection research field.Though some of the existing fusion methods proposed in previous literatures [21], [22] can really inspire and help the network design toward the saliency detection task, none of them has covered both saliency detection and audio-visual fusion.Thus, as shown in Fig. 2, this review mainly focuses on two topics, and we choose three concrete research fields as the main courses, i.e., video saliency detection (VSD), audio-visual correspondence (AVC), and audio-visual saliency detection (AVSD).Also, the differences between several existing reviews on audio-visual representation learning and ours have been illustrated in Table I.
Despite providing an extensive review, we have noticed that the audio-visual consistency (AVC) between audio and visual, a representative task considered in the multimedia research field [23], [24], [25], is the key factor to determining the overall performance of AVSD, while its importance has long been overlooked by our AVSD research field.To verify our claim, we have newly labeled all publicly available AVSD clips frame-by-frame, and conducted massive quantitative experiments with them.This new finding is very potential to benefit our audio-visual saliency detection research field in the near future.
In a brief summary, significant highlights and contributions of this review include the following aspects:
As shown in Fig. 1, the GTs used for the VSOD task are binary masks, where all salient objects have been well annotated/segmented by humans.While the GTs used in the VFP task are human-eye fixations (i.e., individual pixel-wise coordinates) collected by eye-trackers directly, representing raw image regions that humans would pay attention to.In  [35], LSTM [36], ConvLSTM [37] and 3D Convolution [38]  Optical Flow [35] easy expensive light LSTM [36] hard expensive heavy ConvLSTM [37] hard expensive heavy 3D Convolution [38] easy cheap light a word, GTs for the VSOD task are object-aware, while GTs for the VFP task are scattering locations.Also, the widely-used loss function in the VSOD task is the cross-entropy loss, while the kullback-leibler (KL) divergence is the most frequently used one in the VFP task.In the following two subsections, we will review these two research branches respectively.

A. Video Salient Object Detection (VSOD)
As can be seen in Fig. 1, the major difference between the deep learning-based VSOD and VFP tasks is the different training ground truth data, i.e., the VFP task simply aims at simulating human eye-fixations, while the VSOD is more biased towards the object aspect, which can be treated as a combination of object segmentation and object localization [11], [34], [9], where the rationale of the localization process is similar to that of the VFP task, and all those nonsalient objects are filtered by this process.
Despite using different GTs, there also exist multiple other distinguishing differences: 1) The VSOD task should additionally consider detections' integrity, i.e., the detected salient regions should precisely comprise the entire salient object with all its subparts included.However, the VFP task aims at the simulation of the human eye's fixation, thus the detected results are not required to highlight the entire object.
2) The widely-used VSOD scenario could be full automatic video segmentation.In this application, the saliency ranks of different objects tend to stay unchanged for a long period of time.However, the human eye's fixations are usually scattered locations, which are rather weak in indicating those corresponding objects.In other words, fixations usually shift between objects.
Besides, due to the differences mentioned above, the loss functions adopted by VSOD and VFP are also different, where the VSOD task is mainly using the cross-entropy loss, while the VFP task usually prefers the kullback-leibler (KL) divergence, linear correlation coefficient (CC) loss, normalized scanpath saliency (NSS) loss, and similarity (SIM) loss, where all these losses are designed for measuring the consistency degree between the predicted scattering fixations and the real human eye fixations.
The bi-stream-based models usually consist of two subbranches, one for the motion saliency clues, whose input focuses on the temporal information (e.g., optical flow data); another branch is the conventional color branch, which could be any off-the-shelf image salient object detection deep model.Note that, the network architectures of these two branches could be the same, and the only difference is their training input, i.e., optical flow result vs. color image.
The single-stream-based methods have abandoned the individual temporal computation, e.g., the time-consuming optical flow [35].Instead, it takes multiple frames as input each time, and then uses either LSTM [36], ConvLSTM [37] or 3D convolution [38] to sense temporal information.Detailed comparison results of these methods are shown in Table II.Compared with the bi-stream-based methods, this type of work has a significant advantage, i.e., it could be 10 times faster in computation, because the individual temporal information computation is the major efficiency bottleneck for the bistream-based approaches.More details regarding this issue can be found in [54].

B. Video Fixation Prediction (VFP)
Different from the VSOD task, which uses well-annotated object-wise binary masks as training objectives, the training GTs for the VFP task are scattered human-eye fixation locations collected by the eye tracker (e.g., Tobbi, EyeLink, Smart Eye, and GazeTech).The earliest deep learning-based VFP approaches [55] followed the bi-stream structure, which clearly belongs to the multi-task rationale, where one stream handles the fixation predictions in the spatial domain, and another stream focuses on the fixation predictions over the temporal scale.Thus, the key problem of the bi-stream-based VFP models is how to achieve the fusion balance between its sub-streams.Clearly, this methodology is quite similar to that of the bi-stream-based VSOD approaches, while tiny differences would be the different loss functions and GTs.
Optical Flow-based Approaches.The primary way for the bi-stream VFP models to sense temporal information is to take the optical flow results as the models' input.As shown in Fig. 3, almost all existing bi-stream VFP models have adopted the optical flow (e.g., the most representative conventional one [35] and the deep learning-based ones, such as FlowNet [56], [57]) as the temporal sub-stream to sensing temporal information.Here we just name a few most representative ones.
In [33], Lai et al. have made two key innovations: 1) a novel way for performing early fusion between spatial and temporal feature backbones, and 2) the convolutional Gated Recurrent Unit (convGRU) has been firstly applied for learning the temporal attention transitions across time, which is able to make the predicted video fixation maps temporally smooth.The major highlight of the fusion scheme is that the deep features, obtained by the spatial and temporal feature backbones respectively, are connected densely via residual attention mechanism in a multi-scale way.Specifically, the exact fusion operation has clearly biased towards the spatial information, where the deep features from the temporal backbone are only Fig. 3: Method pipeline of the optical flow-based bi-stream approaches, which mainly contains temppral stream and spatial stream followed by a fusion module and a decoder.served as auxiliary stimuli.As a variant of the classic LSTM, the proposed convGRU has two advantages: 1) simpler in network design, and 2) slight performance improvement (less than 0.5%).
Following the bi-stream structure [32] also, Zhang et al. [58] have devised a novel fusion scheme.The key idea of the proposed fusion is to perform a selective combination of spatial and temporal information.The channel-wise attention has been used as the indicator to guide the selection process, and the rationale is that only those deep features with strong feature responses would be able to benefit the detection task.In addition, the authors have devised a novel strategy which takes the spatial position of the salient objects in previous consecutive frames as the additional input, aiming at facilitating the estimation of temporal saliency by shrinking the problem domain.Consequently, the network's output could stay consistent (smooth) over the temporal direction.
In summary, the major advantage of the optical flow-based approaches is the strong temporal sensing ability, while the disadvantage is also clear, i.e., the optical flow computation process takes some additional time.Also, the exact fusion strategy determines the overall performance directly, because the spatial and temporal streams are less interactive actually.
Long Short-Term Memory-based Approaches.Actually, most of the existing state-of-the-art (SOTA) VFP approaches have adopted the long short-term memory (LSTM) to sense temporal information.Compared with the optical flow-based ones, the LSTM-based approaches usually follow the singlestream methodology.As can be seen in Fig. 4, this type of approaches usually adopts the convolutional neural networks (CNN) to compute spatial deep features for each single frame.Then, in order to sense temporal information, all deep features computed individually via CNN are fed into the input gate of LSTM.Finally, a decoder is applied to produce the pixel-wise fixation prediction.
In [59], Wang et al. have completely followed the structure demonstrated in Fig. 4.However, some modifications have been made in the spatial stream, including: 1) several residual layers were used to compensate for the loss of receptive field caused by removing the last two pooling layers of the VGG16 feature backbone; 2) the spatial attention mechanism was applied to the spatial-stream for facilitating the network training, where the static fixation GTs could be used as the attentions helping the network's training (i.e., the dynamic saliency), which could be able to alleviate the demand of large Similar to [59], Linardos et al. in [60] have placed the LSTM in the middle stage of a typical encoder-decoder CNN.The LSTM collects the output of the encoder, and then its output, representing the spatiotemporal information, is fed to the decoder to formulate the fixation prediction.The major highlight of this work is the proposed recurrent mechanism, where the LSTM's output is used as an intraattention to enhance the input data flow.Consequently, the network's ability to sense temporal information gets improved significantly.
To further enhance the sensing ability of temporal information, Chen et al. in [61] have taken 3 frames as the network's input each time.Then, the deep features respectively computed from these frames are combined as the input of LSTM.Compared with the conventional LSTM-based approaches which take only 1 frame as input each time, this method has considered 3 frames, and thus its temporal sensing ability could, of course, get enhanced.Since the spatial displacement occurs along the time scale, deep features computed from consecutive video frames are usually misaligned, which could confuse the subsequent learning process, blurring the final prediction results.To solve this problem, the authors have resorted to deformable convolution -an off-the-shelf tool that could dynamically learn the spatial positions of the adopted convolutional kernels.By using the deformable convolution, the deep features before inputting into the LSTM are aligned.
It is worth mentioning that the LSTM can also be replaced by other networks which have the ability to sense temporal information.For example, Droste et al. in [62] have adopted the recurrent neural network (RNN), the early prototype of the LSTM, for sensing temporal information, where the RNN is placed between the encoder and decoder, sharing a similar overall network structure to that of the [61].
Apart from the single-stream LSTM-based approaches mentioned above, there also exist several works [63] following the bi-stream methodology, where the spatial and temporal information interact with each other as an early fusion.Jiang et al. in [63] followed the typical bi-stream structure, in which a pruned structure of YOLO is applied as the subnet for sensing the spatial information, and the temporal stream is a pruned FlowNet [56].The multi-scale deep features provided by the spatial stream are collected via the concatenation and batch Fig. 5: Method pipeline of the 3D convolution-based approaches, and the major highlight of these approaches is their capability of sensing both spatial and temporal information in a cubic way.normalization operations, formulating a coarse localization mask to compress those clearly non-salient backgrounds in the deep spatial features.Meanwhile, the deep features of the temporal stream are also assembled in a way identical to that used in the spatial stream.Finally, both the multi-scale deep features assembled individually from the spatial and temporal streams are concatenated to be fed to an LSTM.
Similar to the early fusion adopted in [63], Wu et al. in [64] have committed one modification to enhance the temporal sensing ability: the inter-frame correlations are explored by performing the simple dot-product operator along the channel dimension.Besides, the authors have adopted the spatial attention-based shuffle operation to enhance the spatial stream, where the multi-level deep features are combined and later shuffled.Both these strong spatial deep features and crossframe correlation features will be fed into a variant version of the LSTM, named the correlation-based ConvLSTM, where the input gate has been modified to an addition operationbased feature fusion, thus it could be able to simultaneously take two different sources as its input.
In a brief summary, the major advantage of the LSTMbased approaches is its faster computational speed than the optical flow-based methods.However, some of the most recent works [54] have argued that the nature of the LSTM might not be sensing the temporal information but constraining its individual inputs to stay partially consistent and thus could be capable of eliminating the intermittent external disturbances.In a word, the LSTM is clearly inferior to the optical flow in sensing temporal information.
3D Convolution-based Approaches.Compared with the widely-used 2D convolution that can only sense spatial information, 3D convolution is capable of sensing both spatial and temporal information in a cubic way.As been discussed in [54], 3D convolution is generally inferior to its competitors (e.g., LSTM [36] and optical flow [65]) in sensing temporal information, but it still has several unique advantages, i.e., fast computation and good compatibility.Also, to the best of our knowledge, 3D convolution-based VFP models [66] are generally leading the SOTA performance, and the overall method pipeline of this type of approach has been provided in Fig. 5.We shall review several most representatives here.
Min et al. in [67] have directly applied the 3D convolution to the conventional 2D encoder-decoder architecture, where the exact implementation is straightforward, i.e., all 2D convolutions are replaced by 3D versions.Though the newly applied 3D convolution is capable of providing some temporal information, there exists one critical problem in the decoder.That is, the widely-used unpooling operation cannot provide the exact spatial locations over the temporal scale, limiting its decoder's performance.To alleviate it, the authors have devised an auxiliary pooling scheme, whose key rationale is to record all spatial, temporal, and channel locations when performing pooling operations.Therefore, the unpooling operations in the decoder layers can re-use the reserved locations eventually.
Recently, Bellitto et al. in [68] have followed the 3D encoder-decoder network structure for the VFP task.The highlight of this approach is the newly proposed decoder, where two new concepts have been considered.To handle the domain shift problem, each side output of the encoder is assigned to an unsupervised binary classifier, whose primary objective is to follow the adversarial training that minimizes the gap between features respectively learned from the source and target domain.Besides, for each layer in the decoder, multiple domain-specific priors are dynamically learned, and incorporated to make the network domain-specific, and this strategy could significantly improve quantitative scores further.

III. Audio-Visual Multi-Modality Fusion
Different from the visual signal which usually determines human attention directly, the audio signal is usually the auxiliaries, influencing human attention in a subtle way.For example, our attention can be easily attracted by a sounding object, e.g., the sound of a dropping box hitting the floor.However, some audio signals are also completely helpless in drawing our attention, e.g., background music.Thus, since the human visual field has blind spots, the audio signal, whose perception scope is almost 360 o , should be appropriately used to complement visual in practice.With the development of deep learning techniques, more and more research attentions have been paid to how to combine/fuse audio and visual for vision-related tasks, e.g., sounding object localization [18], audio-visual synchronization [69], object tracking [70], and saliency detection [71].Though the primary focus of this review is on saliency detection, we shall still review several most representative audio-visual-related tasks [72], [73], [74] in advance, because these fusion-related arts can be directly referred and get a deep insight towards our audiovisual saliency detection.For a better reading, we propose to introduce three most representative tasks here, including audio-visual correspondence (AVC), audio-visual localization (AVL), and face and audio matching (FAM).

A. Preliminaries on Audio Feature Representation
Given any 1-dimensional raw audio data, we can directly input it to an off-the-shelf feature backbone (e.g., SoundNet [75] or VggSound [76]), where the raw audio is sequentially convoluted by a seises of 1D kernels.Also, the 1D audio signal can also be transformed to a 2D spectrogram, thus we can adopt the existing popular backbones (e.g., VggNet [77] or ResNet [78]) instead, where the audio signal's 2D spectrogram can be visualized in the middle of Fig. 7. To make the   [77] or ResNet [78]) can be used.Next, the Mel filter is utilized to make the 2D spectrogram more discriminative and sensitive to our human auditory system.audio's 2D spectrogram more discriminative and sensitive to our human auditory system, we can use the Mel filtera predefined linear transformation [79], to convert the 2D spectrogram to Mel spectrogram (see the right part of Fig. 7).

B. Audio-visual Correspondence (AVC)
As a young task proposed since 2017 [80], the AVC task takes both audio and visual information as input, aiming at making binary predictions on whether the given audio event is synchronized with the current visual event.For example, a barking dog might be out of the visual field, making the audio event unsynchronized with the visual event.In this case, the AVC task should make a negative prediction, and vice versa.For a better understanding, we have provided a pictorial demonstration of the AVC task's overview in the bottom-left of Fig. 2. Also, Fig. 6 (a) demonstrates the AVC task more clearly from the network perspective.Clearly, the nature of the AVC task is a typical binary classification, and the technical key is how to align and fuse audio and visual streams.
Arandjelovic et al. in [80] followed an identical network structure to that of Fig. 6 (a).Instead of focusing on the feature representation aspect, the primary interest of this work is to learn the relationship between single static frames and their audio counterparts.To fuse deep features respectively derived from audio and visual sources, the authors have resorted to a series of feature reshape layers (i.e., pooling layers).Hence, both features of audio and visual streams are reshaped to an identical size, which will be later fused via multiple fully- Following the bi-stream structure also, the same authors in [18] have made one significant modification regarding the audio-visual fusion part.In the early version [80], features respectively derived from audio and visual streams are simply fused via the widely-used feature concatenation operation.However, the concatenation-based fusion tends to misalign both audio and visual signals, resulting in the fused audiovisual features being inadequate for cross-modal retrieval.Thus, [18] has adopted the Euclidean distance-based fusion scheme to enforce the feature alignment process.
Also using the bi-stream framework, Cheng et al. in [81] have presented a fancy fusion scheme, where deep features respectively derived from either the audio steam or the visual stream are firstly combined by the newly designed "coattention" operation, which has been shown in Fig. 6 (c).The primary objective of this co-attention operation is twofold: 1) enhance audio-visual consistencies and 2) suppress those inconsistencies.As shown in Fig. 6 (c), the outputs of the co-attention operations can be regarded as the upgraded versions of the original input, i.e., A+ and V+, where all those clearly unsynchronized information can be effectively excluded.In addition, the exact implementation of co-attention could be either the widely-used spatial attention [82] or the fancy transformer [83].
To further promote SOTA performance, the existing learning strategies (e.g., contrastive learning [21]) can be used directly.Morgado et al. in [84] have applied contrastive learning to the AVC task, whose core idea can be briefly summarized as increasing the inter-class distance and decreasing the intra-class distance.In the implementation, training instances belonging to the intra-class are audio and visual pairs whose semantical feature distances are below the given hard threshold.And the rest of the audio-visual pairs are the inter-class cases.There also exist some other similar works (e.g., [85]) which have adopted the existing learning strategies targeting better audiovisual feature embedding.W.r.t., the embedding aspect, we will provide a brief review in the next subsection.

C. Audio-Visual Feature Embedding
Recently, there have been several works that have focused on the audio-visual feature embedding [21], [22], [86].The main objective of audio-visual feature embedding is to obtain a generic feature representation, and thus in the spanned feature space, the embedded features can be informative and discriminative enough for specific applications, e.g., the multimodality image retrieval [87].
Tian et al. [88] have applied channel-wise attention to help selectively fuse audio and visual features.The motivation is very straightforward, which is based on an assumption, i.e., that either visual or audio features might benefit the subsequent classification task, and thus the one with a higher feature response should be considered more during the fusion.Following this rationale, the channel-wise attention has been applied to both audio and visual streams at the same time, then the exact modality-wise selection is achieved by performing a softmax.Note that, this channel-wise attention-based multimodality selective fusion has also been used in some existing VSOD approaches, e.g., the classic MGA [89].Recently, Gao et al. in [90] have adopted a distillation network to compute audio-visual features.A teacher network was initially trained on the visual domain only, where the video tags were used as the classification supervision.Then, a student audio-visual network was trained by taking the predictions from the teacher network as its supervision, and thus the learned intermediate features can achieve automatic alignment between audio and visual and finally obtain a strong audio-visual feature embedding.
The AVC task can also be extended to tell if the current visual information is appropriate with the corresponding audio information.For example, it is inappropriate for a video frame to contain happy faces yet with a sad melody.To achieve this goal, Verma et al. in [91] have "weakly" divided the input audio-visual signals into three categories according to their intrinsic emotions, i.e., positive, neutral, and negative, Fig. 9: The demonstration of the SOL tasks.The objective of SOL is to locate the sounding object in the visual space, e.g., given a visual scene, the snake can be located with the snake hiss, while the person can be located with the human voice.
whose structure is almost identical to that of Fig. 6 (c).A similar solution can be found in [92], where the authors have adopted the video theme as an additional information source to boost the AVC performance.The "video theme" adopted in this paper is the manual video-level category tags.And the rationale of this work is to use the theme-based classification responses to eliminate instances whose audio-visual semantics are unsynchronized.
Some other representative applications include the face and audio matching (FAM), and this task's overview can be seen in Fig. 8.The methodology of the FAM task is quite similar to that of the person re-identification (ReID) [93], [94], [95], [96], while the major difference relies only on their feature modalities, where the ReID task only needs to consider the visual domain, while the FAM task needs to consider both audio and visual.Also, the key to succeeding in matching faces and voices heavily relies on the design of an appropriate audio-visual fusion.
Similar to the AVC task, Nagrani et al. in [97] directly treated the FAM task as a binary classification.In this work, the face and voice features are obtained by respectively feeding the given face and voice to the existing feature backbone.The audio-visual fusion is simply implemented by the widelyused concatenation operation, and the final classification is fulfilled by the conventional fully-connected layers.Like the AVC task, the existing learning strategies could also be directly applied to the FAM task and bring solid performance gain, e.g., triplet loss [98], [99] or contrastive loss [100].Meanwhile, the FAM research field [101], [102] has also focused on feature embedding.Rather than performing the binary classification towards the matching problem, some other weakly-supervised classifications (e.g., identity, gender, and nationality [103]) towards a single modality can also be used to implicitly obtain the aligned face-voice deep features.

D. Sounding Object Localization (SOL)
Correlation Analysis-based SOL Approaches.The objective of SOL is to locate the sounding object in visual space, and the task overview can be seen in Fig. 9.The research of SOL has a long history, where the earliest work originates in 1999 [104].In this work, Hershey et al. [104] have explored the correlation between audio and video signals.The idea itself is straightforward, whose rationale is that a spatial region containing a sounding object should have a large probability of exhibiting a strong correlation with the audio signal.After that, several works have adopted various correlation analysis methods for the SOL task, and we shall briefly review them respectively.
Izadinia et al. in [105] have applied the canonical correlation analysis (CCA) [106] for identifying the moving objects which are heavily correlated with the audio signal.And similar attempts can be found in [107], [108].Besides, several existing works [109], [110] have considered mutual information as the alternation, whose rationales are very similar to that of the CCA-based ones.In a word, the correlation analysis-based approaches are usually hand-crafted ones, which can only perform well when visual information is really having strong consistency with the audio counterpart.
Different from the correlation analysis-based approaches, which are mainly interested in the SOL task and designed mainly for videos with plain audio signals, there also exist several works [111], [112] which have investigated the stereo cases, i.e., videos with a stereo audio signal.The key idea of this branch of work is very simple -the sounding object's spatial location can be coarsely determined by analyzing the difference between the individual soundtracks.Let's take the dual soundtrack for example, the audio signal of a sounding object should first arrive at the left microphone if the sounding object is located on the left.Theoretically, this type of approaches can achieve the best SOL performance.However, the requirement of the stereo audio signal inevitably narrows the broad applications.
Deep Learning-based SOL Approaches.Recently, SOLrelated works are all based on deep learning [113], [114], [115], [116], whose key idea is to perform audio and visual feature embedding.And most of them can be roughly divided into two groups: 1) the class activation mapping-(CAM-) based ones and 2) the feature similarity-based ones.
The CAM-based approaches [108], [117], [19], [20] usually adopt the conventional classification network, e.g., the image scene classification.Outwardly, the primary objective of their network training is to achieve a good classification performance.Actually, the real purpose is to utilize the classification task to formulate the audio-visual feature embedding.Because the sounding objects' audio signal can significantly contribute to the classification task, we can infer that image regions with strong audio-visual feature responses tend to comprise the sounding object.Following this rationale, the CAM-based SOL methods can directly utilize the feature response map provided at the fusion module's last layer as the SOL result.Since the CAM's computation is fully automatic, the nature of the CAM-based SOL methods is definitely implicit.
The feature similarity-based SOL methods [80], [18], [118] are slightly different from the CAM-based ones.Instead of using the implicit manner, this branch of work has adopted the explicit way.That is, after the classifier training, two separate deep feature representations can be derived from the feature backbones' (e.g., Vgg and VggSound) bottom layers, i.e., a deep visual feature (a 3-dimensional tensor) and a deep audio feature (a 1-dimensional vector).Then, because those pixels belonging to the sounding object tend to have a strong audio-visual correlation, the pixel-wise audio-visual feature similarity (e.g., the widely-used Euclidean distance and Cosine similarity) can be applied to locate the sounding pixels.To facilitate a better understanding, we have provided a pictorial demonstration in Fig. 10.

A. Audio Saliency Detection (ASD)
Different from the 2-dimensional visual signal, the audio signals can only carry temporal information representing signals' variations over the time scale.Thus, it is very difficult to build a clear spatial alignment from audio to visual.By considering audio solely, saliency detection can still be performed, a.k.a., audio saliency detection or salient event detection, and there exist multiple works [119], [120], most of which are nondeep learning-based ones, and we shall briefly review them here.
Following the rationale proposed in the earliest Itti's classic work [121] -salient regions should exhibit high contrast to their surroundings, Kayser et al. in [122] have investigated the audio saliency detection task.In this work, the authors have adopted multiple filters to measure the audio signal's changing tendency, i.e., the first derivative of intensity and frequency over the time scale.Because, for a short time span, salient audio fragments usually come with a large difference from the rest, their temporal-scale changing tendency can be very effective in evaluating saliency.Following a similar idea, Schauerte et al. [123] have adopted the KL-divergence between two audio fragments' 2D spectral histograms.Compared with the previous work [122] which could be regarded as a "local" audio saliency approach, this new work is clearly a non-local one.Also, based on the 2D spectral histogram, Tsuchida et al. in [124] have proposed a novel non-local signal feature representation method.For each cell in 2D spectral histogram, the authors have used principal component analysis (PCA) to extract the non-local feature.Based on these features, audio saliency can be obtained by performing contrast computation over the newly devised feature subspace.
Also, the audio signal's amplitude and frequency are the widely-used computational unit for the salient event detection [125].Zlatintsi et al. in [126] have converted both audio amplitude and frequency to 3D feature via the Teager energy [127].This work has made a very strong assumption that people tend to be attracted by sudden loudness.Thus, the authors have directly considered the averaged audio's amplitude, frequency, and newly devised energy to measure the saliency degree.Beyond the amplitude and frequency-based representations, Merve et al. in [128] have further devised several novel feature representations (e.g., envelope feature, bandwidth feature, rate feature, pitch feature) for audio signal over the time scale.Finally, this work follows the conventional common thread, i.e., the contrast computation, for each of the newly devised features to obtain multiple bottom audio saliency, and the final saliency is achieved by combining all of them via simple linear fusion.
There also exists a research branch focusing on the relationship between text information and audio saliency [129], [130].Different from the above-mentioned audio saliency methods which consider audio signal only, this branch has additionally considered the text information.Zlatintsi et al. in [131] have fused both text information and audio signal before computing the audio saliency, where the key rationale of the adopted fusion is to compute the feature similarity (e.g., mutual information) between text and audio.Another most representative work could be the [132], which has adopted a non-negative matrix factorization model to measure the consistency between text and audio.
In a word, the advances toward audio saliency detection are relatively slow, where the widely-used methodologies are still limited to the conventional hand-crafted ones, and the deep learning-related researches are quite rare.Considering the importance of audio saliency, this field really deserves intense research attention in the near future.

B. Audio-visual Saliency Detection (AVSD)
As the main topic of this review, we shall give a more detailed introduction and discussion of the SOTA audio-visual saliency detection approaches.However, to the best of our knowledge, this topic is definitely in its infancy, and only several deep learning-based works exist.Thus, we decide to take the exact audio-visual fusion scheme as the starting point.Thus, some related works mentioned in the previous sections might be referenced here for a better understanding.
Hand-crafted Naive Fusions for AVSD.Most of the existing hand-crafted approaches [133], [134] follow the bi-stream structure, which is almost the same as the AVC task reviewed above.Given a video sequence, saliency detection over the either audio or visual channel is computed first, then the audiovisual saliency can be derived by designing an appropriate fusion scheme.Clearly, any off-the-shelf audio/visual saliency detection methods can be used directly, making the exact fusion scheme the key to the overall performance.
Many works [135], [136], [137], [138] have simply adopted the multiplicative-based fusion, because it can effectively enhance the consistency and compress the inconsistency between audio and visual saliency-related features, because those real salient regions tend to be salient in both audio domain and visual domain simultaneously.The limitation of the multiplicative-based fusion is also quite clear -it tends to get confused if there exist multiple visual and saliency features.In cases with multiple audio and visual features, Coutrot et al. in [139] have adopted the linear fusion, where the fusion weights are computed via the classic expectation-maximization (EM) algorithm, a statistical method using training samples to estimate the relative importance of each feature aiming to maximize the global likelihood of the mixture model.Further, Sidaty et al. in [140] have conducted an extensive evaluation regarding different fusion schemes, including maximum, addition, average, multiplication, and non-linear combination-based fusion schemes.As expected, all such simple fusions are inferior to the non-linear fusion, because, in most cases, the audio saliency and the visual saliency could have different contributions to the final audiovisual saliency, and the exact contribution degree is usually determined by the given video scene and content, yet these naive fusion schemes are not flexible enough, failing to achieve the optimal balance between audio and visual.
Correlation Analysis-based Fusions for AVSD.From the experiment perspective, Min et al. in [141] have conducted extensive verifications of the human eye fixations in conditions with and without audio signals.Their results indicate that audio signals can significantly influence human attention only if the salient object is visually non-salient yet salient in the audio channel; otherwise, the audio information is completely helpless.This work also inspires us that an audio-visual saliency detection method should bias more towards visual signals in most cases.Following the same rationale, Min et al. in [142] have adopted the classic canonical correlation analysis (CCA) to localize spatial regions which have demonstrated strong audio-visual consistency.Since the audio and visual saliency cues have been computed in advance, the fusion process mainly targets at highlighting the visual regions correlated well to the audio.More recently, Min et al. in [143] have further considered the deep learning-based saliency cues.And the CCA has been replaced by its upgraded variant -the kernel canonical correlation analysis (KCCA), to measure the audio-visual correlation.The main reason is that the CCA can only correlate linear relationships, while the KCCA is able to map features to higher-dimensional feature spaces and increase the nonlinearity, which could be more practical in the audiovisual saliency detection task.
A Brief Summary of Hand-crafted Fusions.To further explore the advantages and disadvantages of the existing fusion schemes, Tsiami et al. in [144] have compared three widely-used audio-visual fusion schemes, e.g., direct fusion (i.e., the multiplicative-based fusion), linear correlation coefficient [145], and mutual information [146].The authors have combined the existing visual saliency models with the off-the-shelf audio saliency models by using one of these fusion schemes alternatively, and the quantitative results have reached a clear conclusion, i.e., the exact optimal fusion scheme is determined by multiple factors, including both the quality of low-level saliency cues and the input video data.For "raw" hand-crafted saliency cues computed by models which are good at measuring saliency from the temporal scale, the correlation coefficient could be the best choice, since it mainly considers the temporal consistency between audio and visual.As for the case where the saliency cues have been incorporated with spatial information, mutual information could be the optimal choice.However, things could be changed for those "refined" saliency cues -saliency cues obtained via deep learning-based top-down models, where the direct fusion usually exhibits the best fusion performance, because the refined saliency cues are usually more trustworthy than those raw ones, and thus they could be directly used to complement their counterparts.
SOTA Deep Learning-based AVSD Methods.After en-tering the deep learning era, massive deep learning-based visual saliency models have been proposed.However, to the best of our knowledge, there only exist five deep learningbased audio-visual saliency detection models [147], [148], [149], [150], [151].Here we shall provide a detailed review regarding these works respectively.For a better understanding, we have provided multiple method pipelines to clarify the audio-visual fusion methodology regarding these SOTA deep learning-based audio-visual saliency detection works.As can be seen in Fig. 11, all three sub-figures respectively correlate to the SOTA models mentioned above: sub-figure (a) [147], [151], sub-figure (b) [150], and sub-figure (c) [149], [148].
We shall first introduce the [147], [151].As shown in Fig. 11 (a), the audio-visual fusion adopts the conventional plain concatenation operations, which takes both audio and visual feature tensors as input, and the saliency predictions are obtained via a typical decoder after concatenating both audio and visual tensors.Specifically, because the audio modality has a completely different formation from the visual modality, it is required to ensure that the audio's tensor feature has the same size as its visual counterpart.Clearly, the overall method rationale of this work is very straightforward, and the concatenation-based fusion could be replaced by other existing ones, e.g., direct fusions and correlation analysis tools, where similar works have been widely adopted by the AVC task which have been reviewed in Sec.III-B.
As illustrated in Fig. 11 (b), the spatial alignment-based audio-visual fusion can bias the fusion toward the visual part, where the deep audio feature, which usually is a 1-dimensional vector with the same size as the visual tensor's channel number, is either de-convolved or copied to correlate to each spatial location.This implementation has treated the audio as auxiliary information, where the embedded semantical consistency is the key factor in highlighting the corresponding spatial regions as the salient ones.The "copy" scheme has also been widely used by the AVC task.However, to the best of our knowledge, [150] is the first attempt to use de-convolution for the audio-visual alignment.Also, either the copy or the de-convolution-based alignment can be combined with the popular "residual" operation to focus the fusion process on the visual signal, because, in most cases, the visual signal is definitely stronger in determining human attention than the audio signal.
Lastly, as demonstrated in Fig. 11 (c), we introduce the bilinear audio-visual fusion, which has been adopted by [148], [149] and achieved the leading SOTA performance.Compared with either the plain or spatial alignment-based fusion, the bilinear fusion has one significant advantage: it doesn't require the individual audio and visual saliency cues to have an identical dimension size, where a dimension transformation matrix, i.e., see the M in the sub-figure C, is adopted to handle the dimension mismatched problem.Actually, the bi-linear fusion also has its own limitation, i.e., the semantical correspondence between audio and visual channels has been destroyed, making the modeling of complex audio and visual interactivity very difficult.In sharp contrast, the spatial alignment-based fusion could make full use of the semantical information provided by either off-the-shelf visual (e.g., ResNet50) or audio (e.g., VggSound) feature backbone, where the learned semantical information could shrink the problem domain effectively.As a result, the audio-visual complementary fusion status could be easily reached even in a complex audio-visual environment.

A. Preliminary
Existing audio-visual saliency detection (AVSD) works mainly adopt bi-stream network architecture, where audio saliency and visual saliency are computed individually and combined later as the final output.In fact, when the audio signal is not consistent with the visual signal, the audio saliency is completely helpless to complement the visual saliency, which takes up about 60% of all cases.For example, in an image, two persons are talking while the background music comes from the outside, and, in this case, the audio signal cannot benefit the visual in determining saliency.
Inspired by previous multimedia related works [23], [24], [25], we propose to introduce the "audio-visual consistency (AVC)" into our saliency detection research field.The major highlight of our approach is its generic usage, which is capable of upgrading any SOTA bi-stream-based AVSD model from Fig. 12: Detailed demonstration of our AVC annotation.In the selected video clip, when two men are talking, the sounds of 1st and 5th seconds are the background music, whose AVC labels are set to 0, meaning that the audio and the visual are semantically mismatched.While, from the 2nd to the 4th seconds, the audio signals are talking sound, fetching sound, and getting up sound respectively, and thus we label them as 1 because these audio-visual fragments are clearly matched.
"AVC-unaware" to "AVC-aware".In fact, the optimal audiovisual fusion is very difficult to achieve if the adopted AVSD model is AVC-unaware, because the model is completely blind and thus cannot completely omit the audio when the audio is not corresponding to the visual.Thus, in facing helpless audio signals, an AVSD model taking both audio and visual is clearly inferior to the model using the visual solely, yet this "binary switch" cannot be achieved if the model is AVC-unaware.
An intuitive way to convert an AVC-unaware AVSD model to AVC-aware is to resort to an additional module which can automatically predict whether the currently given audio is consistent with the visual.Therefore, to fully realize our idea -making any existing bi-stream AVSD model AVCaware, two things should be prepared in advance: 1) train the aforementioned classifier, and 2) integrate the classifier into the AVSD model.Next, we shall respectively detail each of them in the following subsections.

B. Audio-visual Consistency Labeling
The AVC classifier can of course be trained via the abovementioned weakly-supervised method, which has been shown in Fig. 6 (b).However, the overall performance of this method is usually too limited to benefit the saliency detection task.Thus, we propose to utilize the fully-supervised method to train the audio-visual consistency classifier.
To achieve this goal, we shall manually equip each video frame with AVC labels.That is, we manually provide all the existing benchmark AVSD datasets with binary AVC labels, and a representative pictorial demonstration has been shown in Fig. 12.Thus, each audio-visual fragment will be assigned to 1 or 0 label accordingly.Suppose all existing training instances (with N frames) can be represented as: {A i , V i , Ls i }, where i ≤ N, A and V respectively denote the audio and visual, and Ls is the corresponding fixation map.During the annotation  process, if the audio sound is made by the salient object 2 , we regard that the audio and visual are consistent, thus we assign the AVC label as 1.Otherwise, if the audio is unseen background music or off-screen sound, the AVC label of this audio-visual fragment is set as 0. For a better understanding, we have provided a pictorial demonstration regarding how to perform the proposed AVC labeling process, which can be found in Fig. 12.
After the annotation process 3 , each training instance can be converted to: where Lc denotes the newly annotated audio-visual consistency label, and N denotes the total frame number.We have newly annotated all publicly available AVSD benchmarks, totally 5 sets (or 6 if Coutrot set is divided into Coutrot1 set and Coutrot2 set) consisting of 241 video clips involving 300,000 frames.These newly annotated datasets are now publicly available 4 .

C. The Proposed AVC-aware AVSD Model
The conventional audio-visual saliency detection (AVSD) training and testing protocol has been shown in Fig. 13 (a), where the AVSD model is a typical bi-stream fusion net, which combines its AVSD and visual saliency detection (VSD) to formulate the final result.Clearly, the VSD stream is the mainstream, and the AVSD stream is the auxiliary stream, where audio and visual are fused early via fusion schemes mentioned in Fig. 11, to further promote the VSD stream.As we have mentioned before, this typical AVSD training and testing protocol is completely AVC-unaware, the later fusion (i.e., fuse VSD with AVSD) could even degenerate the overall performance when the given audio and visual are mismatched.
To handle the above-mentioned problem, we propose the AVC-aware training and testing protocol, which has been shown in Fig. 13 (b), whose major difference to (a) is the newly provided AVC classifier, and this classifier can be trained by using the newly equipped AVC labels.In our implementation, we simply use an identical classifier structure to AVID [84] to predict the AVC degree of the current input audio-visual fragment automatically, outputting 0 or 1.Notice that other classifier structures can also be used, and we have tested several others, where the quantitative result (Table V) suggests that the AVID is the best choice.
As shown in Fig. 13 (b), the newly proposed AVSD model can be trained in the typical end-to-end way, where the AVC classifier serves the existing SOTA bi-stream AVSD model, i.e., Fig. 13 (b), as "binary switchers" to control the INPUT of the adopted SOTA AVSD model.In other words, the output of the AVC classifier determines whether or not the single V flow or both V and AV flows are to be used in the subsequent SOTA AVSD model.That is, when the output of the AVC classifier is 0, which means the current audio is inconsistent with the current visual, suggesting removing the AV flow from OUTPUT ← Lc • Fuse(AV, V) where AVC cls represents the AVC classifier, and Lc is the binary prediction regarding AVC of the current input V and AV.
The training process of our AVC-aware AVSD model consists of two tasks, i.e., 1) the conventional audio-visual saliency detection task, which takes the saliency labels (Ls) as GT, and 2) the newly added AVC classifier training, which takes the AVC labels (Lc) as GT.Thus, the overall loss function L all can be detailed as: where L cls is a typical cross-entropy loss targeting at the training of AVC classifier, L cls is the Kullback-Leibler (KL) divergence loss, the most widely-used loss function in AVSD model training, and ρ is a balancing factor which we empirically assign it to 0.5.In the testing phase, the exact data flows are dynamically controlled by the AVC classifier in an identical way to the training phase.In brief, the major highlight of our approach is its generic design, which can serve any existing bi-stream SOTA AVSD models as the plug-in, and promote their performances persistently.Though a more fancy network design could bring additional performance gain, we shall leave it to future work to stay the main focus of our topic.

A. Datasets
There exist six publicly available datasets in our AVSD research field, including DIEM [152], AVAD [153], Coutrot1 [154], Coutrot2 [155], SumMe [156], and ETMD [157].Different from the conventional VSD sets, the eye-fixations in these six sets are collected in the audio-visual environment, while, in the VSD sets, the eye-fixations are simply collected without using any audio information.We briefly introduce these six sets here, and more details can be found via the links of Table III.Some qualitative demonstrations can be found in Fig. 14.
The DIEM set consists of 84 film clips, covering 26 films, including commercials, documentaries, game trailers, movie trailers, music videos, and news clips.The video scenes in this set are generally complex with strong background interference.
The AVAD set targets at exploring the effects of the highly correlated audio and motion on eye movements.The authors of this set tested the human eyes fixation on 45 video sequences, where these tested sequences are 5 to 10-second video clips containing various scenes, e.g., instrumental playing, dancing, and dialogue.
The Coutrot set includes Coutrot1 and Coutrot2 subsets.The dynamic nature scenes in the Coutrot1 set are divided into 4 visual categories: single moving object, multiple moving objects, natural landscapes, and human faces.The Coutort2 TABLE IV: Quantitative comparisons between our method with other fully-/weakly-/un-supervised methods on all 6 datasets.The best result is marked in bold font.* means that the target models (e.g., STANet*, STAViS*, and AVINet*) are trained by the whole pipeline in Fig. 13 with AVC classifier; # denotes that the target models (i.e.,, STANet#, STAViS#, and AVINet#) are trained by removing the AV classifier model, and their OUTPUTs are manually reformulated by using Lc (Eq. 3) as the indicator, standing for the ideal cases.

Means
Datasets AVAD [153] DIEM [152] SumMe [ set's scenes are all conversations, and it can be found that the fixations are most likely to be located on the speaker's face.
The SumMe set contains 25 unstructured videos collected from videos taken by users, whose lengths range from 1 minute to 6 minutes.Since all videos in this set are homemade ones, the corresponding background sounds tend to be very noisy, and most of them are irrelevant to the salient objects, making the audio-visual fusion process very challenging.
The ETMD set contains 12 videos, which are all collected from 6 existing Hollywood movies.Each video in this set ranges from 3 to 3.5 minutes, whose contents mainly consist of action scenes and dialogues.

B. Evaluation Metrics
There are totally five quantitative metrics that have been widely used in the saliency detection field.Since the objective of measuring the saliency detection performance in an audiovisual environment is almost the same as the conventional saliency detection field, all these five metrics can be directly used here, and we shall briefly introduce them respectively.These metrics include AUC-Judd (AUC-J), similarity metric (SIM), shuffled AUC (s-AUC), normalized scanpath saliency (NSS), and linear correlation coefficient (CC).
CC is a method to measure the linear correlation between the prediction saliency (S) and the ground truth (GT), which can be formulated as: where cov denotes the covariance, and σ is the standard deviation.SIM measures the similarity between two distributions.Given S and GT as input, SIM first normalizes them respectively, then measures the minimum values pixel-by-pixel (denoted by i).This process can be detailed as: where Z and min respectively denote the normalization operation and minimum operation.AUC measures the area under the receiver operating characteristic (ROC) curve, which has been widely fixations, the fixated points are regarded as the positive set, and others are regarded as the negative set.Then, the computed saliency map is binarized into salient regions and non-salient regions by using a hard threshold.The AUC-Judd (AUC-J) computes two items: 1) the true positives from all the saliency map values above a threshold at fixated pixels and 2) the false positive rate as the total saliency map values above a threshold at non-fixated pixels.The s-AUC samples the negatives from fixated locations of other images/frames.This sampling scheme can be greatly influenced by center bias and border cuts.
The NSS is designed to evaluate a saliency map over fixation locations.Given a saliency map S and a binary fixation map GT, NSS is defined as: where µ and σ are the mean and standard deviation of the predicted saliency map.This metric is calculated by taking the mean of scores assigned by the unit normalized saliency map (with zero mean and unit standard deviation) at human eye fixations.

C. Quantitative Evidences towards the Effectiveness of the proposed AVC Classifier
As we have mentioned before, our approach is generic, and is compatible with almost all existing bi-stream SOTA AVSD models.By using a few lines of code modification, the proposed AVC classifier can be intergraded into the target model.To verify this issue, we have tried to deploy our AVC classifier into 3-top tier SOTA AVSD models, including STANet [150], STAVIS [148], and AVINet [149].In fact, we shall incorporate our AVC classifier into more SOTA models, yet, in the AVSD research field, most of the existing papers haven't released their codes.Also, w.r.t. the model training, we follow the widely-used training/testing split [148] over all 6 datasets.To demonstrate the superiority of our approach, we have compared the upgraded versions of the three target models (denoted by *) with 12 other SOTA methods, including 4 unsupervised methods, 4 weakly-supervised methods, and 4 fully-supervised methods.For a fair comparison, we use either the code implementations with default parameter settings or saliency maps provided by the authors.Specifically, for others without codes, we simply refer to the numeric results reported in the papers.
As is shown in Table IV, all three upgraded target models (denoted by * highlighted by PINK color) can achieve persistent performance improvements.For example, our method can make an average of 1.9%, 1.5%, and 2.7% performance improvement generally of STANet, STAVIS, and AVINet respectively in terms of the AUC-J metric on six widelyused benchmark datasets.Also, the promoted model STANet* outperforms all weakly-supervised methods significantly and AVINet* performs the best among all fully-supervised methods.The main reason is that the AVSD benchmark datasets equipped with the newly proposed AVC classifier can filter out the unrelated audio-visual pairs, so that the side-effects from those mismatched audio-visual fragments can be avoided.
To further investigate the importance of our key idea, i.e., the audio-visual consistency really matters when performing AVSD, we have additionally removed the proposed AVC classifier from the upgraded target AVSD models.Instead, we directly use the original versions, yet their outputs are "manually reformulated" according to our newly provided AVC labels (i.e., Lc in Fig. 13).That is, the target model's output will be derived directly by using either AV or V, and this process can be formulated as: where all symbols are identical to Eq. 3, and the major difference is that the Lc has been replaced by Lc.Actually, OUTPUT from Eq. 7 is in ideal situation, which tends to persistently outperform that from the upgraded version powered by the AVC classifier (i.e., Eq. 3).The main reason is quite clear, and erroneous binary predictions can not be completely avoided by our AVC classifier.The quantitative results of these ideal versions have been marked by # with BLUE background color, and the detailed results can be found in Table IV.Further, as mentioned above, the classification accuracy of the AVC classifier will affect AVSD performance slightly.Thus, we have tested three AVC classifiers to verify this issue, i.e., L3Net [80], AVENet [166], and AVID [84].Among them, the AVID is our default setting, and the other two classifiers can simply be used to replace the AVID in our method as shown in Fig. 13 (b).That is, in each experiment, we only replace the target AVSD model's AVC classifier with either L3Net, AVENet, or AVID.The experimental results have been shown in Table V.
Actually, the influence of the classification result is based on the amount of corresponding audio-visual pairs, e.g., the more the corresponding audio-visual pairs are, the better the performance of the target models obtain; otherwise, the target models will degenerate into the original versions.According to the results, the AVID-based AVC classifier has achieved the best accuracy (i.e., 87.59%), and thus, as expected, the corresponding AVSD performance outperforms others.In short, the higher the performance of the AVC classifier is, the better the performance of the target models obtains.

VII. Conclusions and Future Work
In this paper, we present the first comprehensive review covering both topics ranging from saliency detection to audiovisual fusion.Based on this extensive review, we also provided a deep insight into the audio-visual saliency detection task, and reached our new claim about the importance of an AVSD model to be audio-visual consistency aware (AVC-aware).We have also devised a generic method to convert the existing AVC-unaware SOTA AVSD models to be AVC-aware, and the key is the newly proposed AVC classifier, which, as a plug-in, controls the data flow of the bi-steam target AVSD mode to avoid side-effects caused by mismatched audio-visual training fragments.Specifically, to train the proposed AVC classifier, we have newly labeled all existing publicly available AVSD datasets, equipping them with AVC labels.Lastly, we have conducted extensive experiments to verify the effectiveness of our claim.Hoping this review could draw more research attentions to the AVSD research field, and the newly claimed AVC-aware issue could inspire future works in performance improvement.
Specifically, although audio-visual-based saliency detection has made notable progress over the past several decades, there is still significant room for improvement, i.e., the AVSD model can only obtain limited performance.Thus, in the near future, we are particularly interested in further designing a more reasonable AVC classifier to improve the performance of audio-visual correspondence.

Fig. 4 :
Fig. 4: Method pipeline of the long short-term memory-(LSTM-) based approaches which usually follow the singlestream methodology.

Fig. 6 :
Fig. 6: (a) The widely-used network architecture for the audio-visual correspondence (AVC) task which is a typical binary classification, where the key is how to align and fuse the audio and visual streams.(b) The widely-used audio-visual correspondence (AVC) training data formulation.The synchronized video and audio pairs are set to positive, whereas the unsynchronized video and audio pairs are set to negative.(c) Co-attention-based audio and visual feature fusion, where the outputs of the co-attention operation can be regarded as the upgraded versions, i.e., A+ means upgraded audio features and V+ denotes upgraded visual features, where all those clearly unsynchronized information can be effectively excluded.

Fig. 7 :
Fig.7: Audio feature computation details.First, the raw 1D audio signal is transformed to a 2D spectrogram by fast Fourier transform (FFT), thus the existing popular backbones (e.g., VggNet[77] or ResNet[78]) can be used.Next, the Mel filter is utilized to make the 2D spectrogram more discriminative and sensitive to our human auditory system.
connected layers.The proposed training process requires no additional supervision data, where image and audio training instances pairs are automatically obtained by sampling two different videos, i.e., picking a random frame from video-1 and a random 1-second audio clip from video-2, and please see Fig. 6 (b) for more details.Note that, as the default training protocol, this strategy has been widely used in our AVC research community.

Fig. 8 :
Fig. 8: Matching between faces and voices.The matched faces and voices pairs are set as positive (), where the unmatched faces and voices pairs are set as negative ( ).

Fig. 10 :
Fig. 10: Demonstration of the CAM-based SOL and the feature similarity-based SOL.The former searches for strong audiovisual feature response to localize the sounding object, while the latter uses the pixel-wise audio-visual feature similarity instead.

Fig. 11 :
Fig. 11: The most representation fusion schemes for audio-visual saliency detection.Among them, (a) merely utilizes the conventional plain concatenation operations to integrate audio and visual features; (b) treats the audio part as auxiliary information, and the embedded semantical consistency is used to highlight the corresponding spatial regions; (c) adopts a dimension transformation matrix to handle the dimension mismatched problem, which doesn't require the identical dimension size of the individual audio and visual saliency cues.

Fig. 13 :
Fig. 13: Demonstrations of the differences between the conventional audio-visual saliency detection model training/testing pipeline (a) and the newly modified training/testing pipeline (b).The advocated AVC classifier can be trained by the newly annotated AVC labels, and then dynamically control the data flow of the adopted bi-stream SOTA AVSD model.The Lc denotes the binary output of the AVC classifier.Clearly, by equipping the existing SOTA AVSD model with the AVC classifier, we can make the original AVC-unaware AVSD model AVC-aware, achieving persistent performance improvement.

TABLE III :
Details of the existing AVSD sets.

TABLE V :
[80]]ion study regarding different AVC classifiers, e.g., L3Net, AVENet, and AVID.The target AVSD model used here is AVINet[149].AVID+ws denotes that the AVID-based AVC classifier is trained in a weakly-supervised manner the same as[80].The bests are highlighted in bold font.