Audio-Visual Cross-Attention Network for Robotic Speaker Tracking

Audio-visual signals can be used jointly for robotic perception as they complement each other. Such multi-modal sensory fusion has a clear advantage, especially under noisy acoustic conditions. Speaker localization, as an essential robotic function, was traditionally solved as a signal processing problem that now increasingly finds deep learning solutions. The question is how to fuse audio-visual signals in an effective way. Speaker tracking is not only more desirable, but also potentially more accurate than speaker localization because it explores the speaker's temporal motion dynamics for smoothed trajectory estimation. However, due to the lack of large annotated dataset, speaker tracking is not well studied as speaker localization. In this paper, we study robotic speaker Direction of Arrival (DoA) estimation with a focus on audio-visual fusion and tracking methodology. We propose a Cross-Modal Attentive Fusion (CMAF) mechanism, which explores self-attention to learn intra-modal temporal dependencies, and cross-attention mechanism for inter-modal alignment. We also collect a realistic dataset on a robotic platform to support the study. The experimental results demonstrate that our proposed network outperforms the state-of-the-art audio-visual localization and tracking methods under noisy conditions, with an improved accuracy of 5.82% and 3.62% at SNR = −20 dB, respectively.

music information processing [3], [4]. They can be estimated via the arrival time or energy level differences between signals from two spatially separated microphones [5], [6]. The Signal Processing (SP)-based Sound Source Localization (SSL) techniques are analytical solutions under certain assumptions about the signal, noise type, and environmental conditions, which may vary in practice. As an alternative, recently, researchers have proposed Deep Learning (DL)-based approaches that build machine learning models to bypass explicit sound propagation modeling and other required priors [7]. Those approaches model the mapping from acoustic features to speaker locations, and have demonstrated significant performance gain over the SP-based methods, unless the training and testing data are of different conditions [8]. Despite much progress, the techniques that solely rely on acoustic signals are always affected by adverse acoustic conditions [9].
Humans use multi-modal cues to explore, capture, and perceive the real world. In addition to audio, vision is another primary stream that conveys significant information [10]. Many studies confirmed the advantages of audio-visual fusion, such as visually indicated sound separation [11], video-infused audio in-painting [12], and embodied navigation [13]. If a visually tracked object emits sound, its location can also be inferred using SSL techniques. Audio and vision offer complementary characteristics [9]. For example, one may achieve improved tracking accuracy by using sound to estimate the speaker trajectories in unseen regions of a camera [14] or by using visual cues to predict a target location during silent periods [15].
Since audio and vision operate in different spaces, i.e., audio in a 3D space and video on a 2D image plane, in most audio-visual localization-based applications, sensor calibration information, which constructs the mapping between different coordinates, is required. Using calibrated sensors, one can align a DoA to specific 2D locations on an image plane [16], [17], or map a target image location to a 3D spatial space [18], [19]. However, such a calibration process is labor-intensive [20] and precise calibration information is hard to come by.
Not to be burdened with such sensor calibration, one solution is using DL techniques to transform the localization task into a data-driven optimization problem, where a model gradually learns the mapping from input signals to speaker location ground truth [28]. It is noted that the success of DL techniques is based on a large amount of training data. However, most existing datasets provide data of audio-only localization [28], [33], [34] or audio-visual localization, but with a short recording duration [35]. The scarcity of datasets hinders the DL-based speaker tracking studies [9].
In this paper, we tackle the speaker location estimation problem with signals captured by multi-modal sensors mounted on a real robot. We are particularly interested in exploiting three unique properties of audio-visual signals for speaker tracking under noisy acoustic conditions: (1) both audio and visual signals are sequential; (2) the speaker locus at either an audio or visual frame is temporally correlated to that at the neighboring frames; and (3) audio and visual signals from the same speaker are highly correlated as far as the speaker locus is concerned. To this end, we propose an audio-visual cross-attention network. We consider our work is of significant importance with the following contributions.
1) We make the first attempt at audio-visual speaker spatial DoA estimation using DL-based techniques and a tracking strategy. 2) We propose a Cross-Modal Attentive Fusion (CMAF) architecture that explores the self-attention mechanism to learn intra-modal temporal dependencies and the crossattention mechanism for inter-modal alignment. 3) We develop a DL-enabled audio-visual dataset with signals captured by a real robot. The monocular image sequences, multi-channel microphone array signals, and speaker 3D location annotations are provided. 4) We demonstrate that the proposed CMAF outperforms the state-of-the-art uni-modal and multi-modal approaches. 1 The rest of the paper is organized as follows. Section II gives a comprehensive review of the related works. Section III formulates the research problem. Section IV first characterizes the audio and video processing, and then elaborates the proposed audio-visual tracking network. Section V summarizes the existing DL-based datasets, followed by a detailed description of our self-collected audio-visual dataset. Experiments are conducted 1 We will release the dataset and the source code. and analyzed in Section VI. Limitations and future works are discussed Section VI-H. Finally, we conclude in Section VII.

II. RELATED WORK
Let us start with a review of the significant SSL approaches. Then, we discuss how speaker tracking is different from the localization task, and how the incorporation of vision can help improve performance. The most relevant DL-based SSL and tracking approaches are summarized in Table I.

A. Speaker Localization
Speaker localization using sound is a well-established area of research. Conventional SP-based speaker localization methods generally belong to four categories: (1) time delay estimation, e.g., Time Difference of Arrival (TDoA) (2) sub-space methods, (3) beamforming methods, and (4) histogram analysis methods. Among the four categories, time delay-based methods attract the most attention. In particular, Generalized Cross Correlation (GCC) estimates the sound location at the maximum correlation between the inter-microphone signals. As only phase information conveys TDoA, Generalized Cross Correlation with Phase Transform (GCC-PHAT) eliminates the amplitude and uses only phase of the cross spectrum for robustness against noise interference [5]. Studies show that speaker localization benefits from the use of a multi-channel microphone array [36]. By aggregating information from multiple microphone pairs, we overcome errors from an individual microphone pair. As an example, Steered Response Power PHAse Transform (SRP-PHAT) [6] promotes this concept by estimating the speaker DoA at the hypothesis with the maximized accumulated GCC-PHAT value. Despite much progress, SP-based methods remain to be improved under adverse acoustic conditions [28]. Nonetheless, the SP-based features are shown to be effective under controlled acoustic conditions.

B. Speaker Tracking
In speaker localization, we tackle frame-level localization, which does not consider sound source motion dynamics. Unlike speaker localization, speaker tracking not only locates the speaker, but also follows the speaker's motion over consecutive frames. Speaker tracking is required in many real-world robotic applications. As it benefits from sound source motion dynamics, it potentially provides more accurate position information than speaker localization.
Traditional parametric tracking approaches often rely on a recursive Bayesian estimation paradigm, such as Kalman Filter (KF) [38] and Particle Filter (PF) [16], [18], [39]. PF is a stochastic inference strategy that relies on sequential importance sampling to approximate the target posterior distribution using weighted samples. To complement audio cues, vision is shown effective [17], [18], [40]. For example, [18] uses PF for speaker tracking, where audio-visual fusion is implicitly handled during the recursive particle likelihood update stage. [41] adopts the Extended Kalman Filter (EKF) paradigm to achieve dynamic weighting of audio-visual streams according to the instantaneous sensor reliability measure. However, those parametric approaches require prior knowledge of the scene, such as spatial distribution and respective velocity of a sounding object, or manual tuning of specific hyperparameters according to the scene composition and dynamics.
Unlike parametric approaches, DL-based trackers do not require problem-specific parameter settings. For example, the Recurrent Neural Network (RNN) learns to capture long-term information. Others [27], [32] use consecutively stacked CNN and RNN layers, that is referred to as Convolutional Recurrent Neural Network (CRNN). In terms of audio-visual fusion, it is common to track a sound source on an image plane. For example, [42] generates a heat map to visualize the image location of a sounding object. [43] represents an audible and visible object as the trajectory of a potential sound source through space and time. When it comes to the spatial DoA domain, few studies have tackled the audio-visual tracking problem, let alone their fusion mechanism. One of the reasons that hinder the research is the lack of a large annotated dataset. As the first attempt, we tackle this problem with a simple but effective visual simulation method in our prior work [9] by taking advantage of sensor calibration information, where a straightforward audio-visual weighting mechanism is studied. Nonetheless, in [9], we did not investigate the temporal dynamics of speaker motion as the involved sound source is stationary.
It is tempting to combine parametric methods with DL techniques for speaker tracking. For example, [44] proposes Backprop Kalman Filter (BKF), which downsizes the observations to a lower-dimensional space through a nonlinear encoding network. Inspired by [44], a DL-based KF extension for audiovisual speaker tracking was introduced in [45] where dynamic multi-modal stream weights are jointly learned during model optimization. Despite comparable performance with standard KF-based methods, this approach still requires manual parameter tuning.

C. Summary
Most of the previous DL-based studies are conducted on synthetic recordings with a spatially static sounding object, either using audio signals [21], [22], [23], [24], [25], [26], [27], [29], [30], [31], [32], or audio-visual signals [9]. None of them explore the problem of DL-based audio-visual speaker spatial DoA tracking, which is the focus of this paper. Fig. 1 illustrates a general block diagram of our proposed architecture. In addition to multi-modal input data capture, it consists of three stages: (1) the front-end feature processing block, (2) the multi-modal fusion block, and (3) the speaker DoA classifier. We discuss this in further detail next.

III. PROBLEM FORMULATION
We start by defining a few notations and definitions. Unless otherwise specified, matrices are in bold uppercase, e.g., X, Y, vectors are in bold lowercase, e.g., x, y, variables are in lowercase, e.g., x, y, and functions are in calligraphic font, e.g., F. Let us denote the synchronized audio and image sequences captured by an M-channel microphone array and a camera mounted on a robotic platform, as s 1:M and I. We aim to estimate DoA of a speaker, denoted as We formulate the problem as a regression problem, which seeks to predict DoA as one of the 360 discrete classes, denoted as θ = {j|j is an integer, and 1 ≤ j ≤ 360}.
As DoAs are spatially continuous, instead of adopting the onehot output coding, we use a Gaussian-like vector [28], denoted as p t (θ), to represent the posterior probability likelihoods of a speaker presence in the direction of θ t , where p t (θ) is centered on the ground truth θ t with a standard deviation σ θ . To be noted, the prefix 1 √ 2πσ θ of the Gaussian distribution is dropped in Eq. (1) as well as in [28], since it is a constant which has no impact on the algorithm.
As will be discussed later, we will adopt a DL technique to learn the mapping from the multi-modal inputs to the speaker DoA posterior probability, where F(·) is the proposed network with Ω the learnable parameters. In this way, the locus of the speaker is approximated by the DoA value of the highest probability, We adopt the Mean Square Error (MSE) loss for the posterior probability-based coding in (1), For brevity, we drop the time index t hereafter.

IV. PROPOSED METHODS
Audio-visual signals provide rich spatial and temporal cues to track the speaker in the scene [10]. It was shown that directly concatenating the frame-level multi-modal features is vulnerable to temporal misalignment and uni-modal outliers [46]. We start with the basics of audio and visual characterization. We then formulate CMAF, the cross-modal attentive fusion architecture for speaker tracking.

A. Audio Processing
In acoustic speaker localization, time-delay based methods have achieved remarkable success with widespread applications thanks to their simple and effective computation. In particular, GCC-PHAT, which facilitates the TDoA estimation between any two arbitrary microphones, is more robust to noise and room reverberations [47]. Herein, we use it as the acoustic feature. Let S n 1 and S n 2 denote the STFT of short-time audio signals at a microphone pair {(n 1 , n 2 ), ∀n 1 < n 2 ≤ N } where N is the total number of microphones. We use m ≤ M to index the microphone pair with M the total pair number. Then, the GCC-PHAT feature with time delay τ is computed as, Ns τ (5) where i indicates the imaginary unit, * denotes the complex conjugate, k indicates the frequency bin, and N s is the STFT length. GCC-PHAT is a feature in the time domain that peaks at the actual time delay. It is noted that its performance is unstable for signals under a low Signal-to-Noise Ratio (SNR).
Since frequency-domain DoA estimation methods are more robust than time-domain methods, in particular in the presence of background noise and reverberation [48], we also incorporate log-mel spectrogram by passing STFT through a mel filter bank to provide a compact representation. The mel Spectrogram is supplementary to GCC-PHAT that operates in the frequency domain. We use M n to denote the one calculated in the nindexed microphone (n ≤ N ). Since DL features are automatically deduced and optimally tuned for the desired DoA outcome, incorporating features from both domains brings the greatest benefits.
In summary, we extract two location-related features from the audio processing block: the time-domain GCC-PHAT features, and the frequency-domain log-mel spectrogram features.

B. Video Processing
The advances in object detection have enabled many realworld applications, such as autonomous driving, robot vision, and surveillance [49]. Human face detection has served as the front-end of many audio-visual tasks, such as speaker tracking [18], diarization [50], and speaker extraction [51].
Object detection results are typically represented by bounding boxes [52]. In this paper, we adopt the tracking-by-detection methodology [53], and use face detection as the front-end for video encoding. Let us denote a detected face bounding box at t as where (u t , v t ) indicates the image position of the top-left corner, and (w t , h t ) the width and height of the bounding box. We propose to encode the visual locus of a speaker as the concatenation of two Gaussian-like vectors where ρ t (u) and ρ t (v) represent the likelihoods of a speaker visually present along the image's horizontal axis u and vertical axis v. We use the same formulation as [9], where μ u,t = u t + 1 2 w t is the horizontal center of b t , and σ u is the standard deviation. The vertical representation ρ t (v) has the same format as (7). Notice that when there is no face detection, the visual features are set to all-zero vectors. In

C. Cross-Modal Attentive Fusion (CMAF)
We extract the location-related features from audio and video, respectively. Then, we design a deep module to refine the location-related cues by taking into account the temporal variations among cross-modalities.
Self-attention in neural networks learns which element to focus on in a sequential signal, while cross-attention learns the interaction between audio and visual signals. Self-attention makes use of historical data to make a decision, that is particularly useful for speaker tracking because it considers a receptive field instead of a single frame, and benefits from the temporal correlation properties of audio and visual signals separately. Cross-attention exploits the synchronization information between audio and visual signals, which takes advantage of the audio-visual correlation properties. We believe that by employing both self-attention and cross-attention in the proposed CMAF network, we make full use of multi-modal cues.
We detail the architecture of CMAF in Fig. 3. For brevity, we exclude the input streams and DoA outputs from the figure. We first apply a fully-connected (FC) layer to project the multi-channel information from each of the three individual modalities (i.e., M -channel GCC-PHAT g m (τ ), N -channel logmel spectrogram M n , and two-channel visual features) into the latent representations, denoted as X ρ , X g , and X M . Then, we adopt six parallel CMAF blocks which enumerate all order combinations of the three latent representations. Finally, the resulting cross-attentive features are concatenated for speaker DoA estimation.
Specifically, in Fig. 4, we illustrate one CMAF block that deals with two arbitrary inputs, denoted X α and X β . Two attention modules, Multi-head Self-Attention (M-SAtt) and Multi-head Cross-Attention (M-CAtt), are detailed in the left-most and right-most panels. 1) Self-Attention (SAtt): SAtt allows the model to attend to all features at the scale of the input sequences. As described in [54], it maps a query to a set of key-value pairs. For instance, given a sequential input X α from modality α, the scaled dotproduct self-attention matrix, which represents the energy, is computed as, where W Q α , W K α and W V α are the trainable parameters of three individual FC layers which project X α into a common space. Q α = X α W Q α ∈ R T α ×d q is a set of queries, K α = X α W K α ∈ R T α ×d k is a set of keys, and V α = X α W V α ∈ R T α ×d v is the corresponding set of values (where T and d denote the sequence length and feature dimension, respectively). The softmax(·) is used to normalize the weights. To make full use of self-attention, a multi-head attention mechanism is applied [54], which enables a model to attend to various projections of an input. The resulting multi-head self-attention, denoted as M-SAtt, is then formulated as, Fig. 4. An individual CMAF block (Fig. 3) given the input streams X α and X β from modality α and β. The self-attention and the cross-attention block are displayed on the left-most and the right-most sides, respectively (⊕ indicates the pair-wise addition operation).
where cat denotes concatenation, H is the total number of heads, W α is the set of trainable parameters that have been associated with the concatenated SAtt(·) representation. Moreover, each self-attention module indexed by h ≤ H consists of independent trainable parameters W Q α ,h , W K α ,h and W V α ,h , respectively.

2) Cross-Attention (CAtt):
Technically, CAtt queries the feature of one modality to the other modality and vice versa. Since the learned audio and visual features share the same spatial correspondence, we use the CAtt module to achieve collaborative fusion while preserving the intra-modal characteristics. The right-most part of Fig. 4 illustrates the details. With the crossattention module, each modality keeps updating its features via external information from the other modality.
Given the latent representations X α and X β of modalities α and β, the scaled dot-product cross-attention (i.e., a variant of cross-correlation) achieves latent adaptation across modalities through a formulation similar to (8). This process is formulated as, where W Q α , W K β and W V β are the projection parameters of the query and the key-value pairs, respectively. Accordingly, the M-CAtt mechanism follows the same principle as the self-attention counterpart (10), (13) where we abbreviate the cross-attention representation (11) as CAtt 1,...,H with the subscript indexing the head number, and W α,β indicates the set of trainable parameters.
3) CMAF: We jointly model the temporal recurrence, cooccurrence, and synchrony of the multi-modal features in the CMAF block through the self-attention and cross-attention modules. The intermediate CMAF features are computed as, where the skip connections can help preserve the identity information from the input stream X α .
The final CMAF outcome is formulated as, where LN stands for layer normalization, and MLP includes two FC layers with a ReLU activation function. The overall CMAF block models the interactions between each cross-modal pair. With three input multi-modal streams, six parallel CMAFs are adopted (as illustrated in Fig. 3) where all the cross-attentive features are concatenated to the back-end DoA classifier at the last step.

V. DATASET
We now first review the existing datasets which are suitable for DL-based speaker DoA estimation, to motivate the design of a new dataset for Audio-Visual Robotic Interface (AVRI).

A. Existing Datasets
We summarize all datasets in Table II and characterize them in terms of modality, sensing platform, type of sounding source, recording duration, and data annotation.
1) The LOCalization And TrAcking (LOCATA) dataset [55] is recorded with different array configurations in a highreverberant indoor environment. Annotations include microphones and target 3D locations and Voice Activity Detector (VAD) labels. 2) The Realistic Speech Localization (RSL) dataset [34] is recorded in a low noise and nearly reverberation-free environment, using a 4-channel ReSpeaker microphone array. 2 A loudspeaker is placed at two different heights (1 m and 1.5 m height from the ground plane) with DoA of a sound source ranging from 1 to 360 degrees (5-degree resolution).
3) The Sound Source localization for Robots (SSLR) [28] is recorded using the humanoid Pepper robot 3 where four microphones and a stereo-vision are mounted on the robot head. It mostly uses a loudspeaker for recording, whereas human recordings only last for 4 minutes.

4) The MultiModal Mall Entertainment Robot (MuMMER)
dataset [35] uses the Pepper robot to record the audiovisual streams. However, it only provides 2D face location annotations, which are not suitable for speaker spatial location estimation. In summary, LOCATA and RSL only include audio signals. Among the audio-visual datasets, SSLR only has 4 minutes of human video, while MuMMER lacks the annotated 3D speaker location. To support our study, we propose a large-scale AVRI dataset, which is elaborated in the following.

B. AVRI Dataset
We collect the AVRI dataset, as summarized in Table III, that features several unique properties: (1) it involves both multichannel audio signals and images, (2) it is recorded from on-site human speakers, (3) it includes 3D location annotations, and (4) it not only enables DL-based applications, but also supports SP-based techniques. To promote research in DL-based audiovisual speaker localization and tracking, the AVRI dataset and 3 Pepper robot: https://www.softbankrobotics.com/emea/en/pepper the source code of this work are publicly available together with this paper.
For HRI applications in smart home scenarios, the recording environment makes the most difference. Thus, we design the recording in a real indoor reverberant room equipped with various furniture (e.g., table, sofa, and chairs) where the data are captured from a real robot and affected by ego and background noise. Reverberation time is approximated to RT 60 ≈ 0.35 s according to [56]. The whole recording procedure lasts for two months, while we do not constrain the robot location and room arrangements. This makes the data more realistic in a robotic scenario. Specifically, a Kinect sensor is used to capture RGB image sequences (with a resolution of 960 × 540 pixels), and audio signals are captured with a four-channel circular ReSpeaker microphone array (i.e., at the sampling frequency of 16 kHz). The setup of the multisensory equipment is illustrated in Fig. 5(b), which is mounted on a KINOVA robot 4 (Fig. 5(a)).
We invite 6 female and 5 male speakers, that is, 11 participants in total. For each speaker, we record 3∼4 clips with each length varying from 6 to 18 minutes. During the recording, the participants freely move around the robot while reading a given script. In particular, 22 recordings were read in Chinese, while 21 were read in English. Moreover, participants wore a face mask in 22 recordings.
To annotate the speech activities, we adopt a wireless presenter system 5 where a lavalier microphone is clipped on the collar of each participant to acquire the close-talk speech. Moreover, to facilitate speaker spatial localization, we incorporate an Op-tiTrack system to annotate the 3D locations of speakers and sensors. In Fig. 5(b), the small silvery dots stuck on the sensors are the reflection points (annotators) of the OptiTrack system.

VI. EXPERIMENTS
As studies show that DL-based localization methods are superior to SP-based methods [9], [21], [26], [32], we focus on comparing our proposed methods with the significant DL-based DoA estimation methods. The experiments are conducted on AVRI dataset to compare CMAF with the competitive baselines. We also provide qualitative analysis and visualization in a comparative study. Furthermore, we would like to show that appropriate audio-visual fusion greatly improves the robustness of speaker tracking in noisy acoustic conditions.
All methods are tested with the same parameter settings for fair comparison. We briefly summarize the methodology of the baseline methods as follows: 1) GCC-MLP [28]: It incorporates GCC-PHAT as the acoustic feature, where an MLP network with three fullyconnected hidden layers is used as the DoA classifier. 2) STFT-ResNet [29]: It uses STFT as input, which captures richer temporal and spatial information than GCC-PHAT. A CNN network with residual blocks [57] is used to avoid the problem of vanishing gradients and to learn the DoA feature representation. 3) A-CRNN [32]: It stacks GCC-PHAT and log-mel spectrogram as the inputs to a CRNN network to capture the temporal motion dynamics. Although this work was originally designed for sound event tracking, it is easily extendable to track a speaker. 4) AV-MLP [9]: It uses GCC-PHAT and Gaussian-encoded visual features ( (7)) as input. The two modalities are concatenated to form a global audio-visual representation for the back-end MLP-based DoA classifier. To be noted, since AV-MLP [9], the only comparable DLbased DoA estimation work, is our previous method replies on visual augmentations, we facilitate a new data collection and novel method proposals to encourage more explorations in this field. Fig. 6. The network architecture of the contrastive model i.e., AV-CRNN. The encoded multi-modal features are firstly concatenated, then feed into a stack of CNN blocks to learn a high-level representation. Then, a GRU module is adopted to incorporate the temporal information before the final DoA classifier (⊗ indicates the concatenation operation, which aggregates the multi-modal features as the entire input representation).

B. A Contrastive Model: AV-CRNN
Besides the four baselines in Section VI-A, we also implement a contrastive model to examine the contributions of the proposed self-attention and cross-attention mechanism in CMAF. We introduce a CRNN module, i.e., a stacked CNN and GRU architecture, in place of the parallel fusion module in CMAF, referred to as AV-CRNN in Fig. 6, where the dashed box denotes the CRNN architecture.
AV-CRNN has the same front-end feature extractor and the back-end classifier as CMAF, but employs a different multimodal fusion mechanism without cross attention, that allows us to clearly show the effect of the proposed CMAF. In the CRNN module, since the translation invariant characteristics of CNN are effective in processing multi-modal data [58], we concatenate these features into stacked CNN layers. Each 2D CNN block is followed by an average pooling, a batch normalization, and a ReLU activation, to extract high-level location-related information. To model speaker temporal dynamics that are not captured by CNN layers, we apply a Gated Recurrent Unit (GRU) module where each unit consists of multiple gates to identify the temporal information to store, ignore, and eventually trigger the output. In this way, the recurrent layers can accumulate the evolution of spatial parameters from neighboring time frames to facilitate speaker tracking. Finally, the GRU outcome is taken by the classifier to produce class-wise DoA posterior probabilities (p t (θ) in (2)).

C. Evaluation Metrics
We evaluate the methods in terms of mean absolute error (MAE) and accuracy (ACC). The symbols '↑' and '↓' indicate the desired direction of improvement. MAE ( • ) is calculated as the average difference between the ground truth and the estimated DoA,  IV  COMPARISON OF OUR PROPOSED METHODS WITH THE SIGNIFICANT STATE-OF-THE-ART DL-BASED DOA ESTIMATION METHODS ON THE AVRI DATASET ACC denotes the ratio of correctly estimated frames, where ρ is the accuracy tolerance which is set to 10 • .

D. Implementation Details
We split the AVRI dataset (Section V) into non-overlapping 70% and 30% between a training and test set. We apply a face detector to extract the face bounding boxes (i.e., b t in (6)), in each frame of the image. We chose [59] considering its high accuracy and robustness for masked faces. A face detection rate of 67.07% is achieved in the whole data set, where 59.43% is for the training set and 84.9% is for the testing set. The frames of no detections are mainly because the speakers move out of the camera's sight. It is noted that different face detection rates between the training and testing set result from subconscious actions of multiple different participants, since we do not constrain their motions. With such face detection rates, visual cues contribute over half of the time for speaker DoA tracking. Whenever no visual cue is available, the audio-visual algorithms rely only on the audio modality.
We set the standard deviation σ θ of the DoA-based posterior probability (Eq. 1) to 8, the same as [9], [28]. For the audio building block, the STFT is computed over an audio segment of 64 ms (1024 samples) with 50% overlapping. Considering the maximum available TDoA given the inter-distance of a microphone pair, we compute the 21-dimensional GCC-PHAT coefficients with the time delays ranging from −10 to 10 samples. The log-mel spectrogram is computed with 21 mel-scale filters at a frequency range from 20 to 8 kHz. For the visual building block, to be consistent with the audio features, each of the resulting horizontal and vertical visual image features, i.e., ρ t (u) and ρ t (v), is of dimension 21 as well. The standard deviation σ u is empirically set to 3 ((7)), and the same for σ v . Moreover, since audio and video are of different sampling rates, we resample all features with the same frame rate, i.e., 32 frames per second.
For CMAF, as shown in Fig. 3, we first use separate FC layers to derive the latent representations, X ρ , X g , and X M of 128 dimensions. Since we have three input modes, six parallel CMAFs are used. For each involved M-SAtt and M-CAtt blocks, we empirically set the multi-head number and employ H = 4 heads for satisfactory performance. For the back-end DoA classifier, we use the same architecture as [28] for all comparable methods, which consists of three FC layers. Except the output layer, each hidden layer is followed by a batch normalization, a ReLu activation function, and a dropout layer. We use Adam Optimizer [60] with a learning rate of 0.001 and a batch size of 32.

E. Comparative Study on AVRI Dataset
The performance of different methods is summarized in Table IV where the localization and tracking results are grouped separately. We observe that our proposed methods significantly outperform others. We also provide the number of trainable parameters (in million).
For audio-only methods, GCC-MLP simply feeds the GCC-PHAT features to the classifier and achieves the MAE of 19.03 • and ACC of 66.00%. STFT-ResNet [29] has slightly better performance than GCC-MLP with the resulting MAE of 17.21 • and the ACC of 67.63%. The tracking method A-CRNN [32] uses RNN to temporally filtering the input acoustic features. The reduction of MAE to 7.91 • and the increase of ACC to 79.28% emphasize the benefits of consolidating the motion dynamics of the speaker between consecutive frames.
For audio-visual methods, AV-MLP outperforms GCC-MLP with a lower MAE of 17.55 • and an improved ACC of 68.78%, which is attributed to visual cues ( (7)). When comparing A-CRNN and the group of systems with localization as the design task, it is obvious that temporal filtering (tracking) has a greater impact than visual cues. The results of our contrastive model AV-CRNN corroborate the tracking influence, while compared to A-CRNN, the improved MAE to 7.58 • and ACC to 79.72% shows the benefits from vision. The proposed CMAF model further boosts the performance with the parallel CMAF modules (described in Section IV-C), reducing MAE to 7.26 • and improving ACC to 80.86%. In addition to superior performance, CMAF only includes 3.825 million (M) trainable parameters, which are 50% fewer than the CRNN-based tracking methods (7.713 M parameters for A-CRNN, and 7.715 M parameters for AV-CRNN).
To give an intuitive illustration, Fig. 7 displays the speaker trajectory and the DoA estimates of different methods where the horizontal and vertical axes correspond to the time and DoA scale, respectively. We differentiate the localization and the tracking methods with the color index of blue and orange, respectively. Moreover, results of non-speech segments are marked with cyan crosses while frames with face detections are with grey background. From the figure, we can see that for the localization methods, i.e., STFT-ResNet ( Fig. 7(a)), GCC-MLP ( Fig. 7(b)) and AV-MLP (Fig. 7(c)), although the vast majority of DoA estimates follow the ground truth (black curve), there are still some spines, resulting from intermediate speech pauses or background noise, randomly distribute over various DoAs. Nevertheless, it is obvious that with the help of vision, Fig. 7(c) produces less spines than Fig. 7(b). Observing the estimated speaker trajectories of the tracking methods, i.e., A-CRNN (Fig. 7(d)), AV-CRNN (Fig. 7(e)) and CMAF (Fig. 7(f)), we find that the tracking mechanism helps remove the DoA outliers (especially in non-speech segments) and thus results in smoother DoA trajectories. Furthermore, when comparing tracking methods with and without visual incorporation (AV-CRNN and CMAF vs. A-CRNN), we can observe the contributions from face detections. In general, our proposed CMAF achieves the most smoothed trajectory over the other competitive methods.

F. Noise Robustness
In real-world applications, audio signals are always corrupted by noise. We would like to test the proposed DL-based model under noisy conditions compared to other models. The Additive White Gaussian Noise (AWGN) is added to the multi-channel audio signals, and the resulting SNR ranges from −20 dB to 10 dB. The results are summarized in Table V.
From Table V, we can see that the tracking methods using either audio (A-CRNN) or audio-visual (AV-CRNN and CMAF) exhibit superiority over the localization methods (STFT-ResNet, GCC-MLP, and AV-MLP). When SNR ≥ 0 dB, they maintain a great performance of MAE < 10 • and ACC > 70%. As the SNR degrades, the contribution of visual signals increases. For example, AV-MLP has a slightly higher localization accuracy (59.02%) than GCC-MLP (54.26%) at SNR = 10 dB, while it shows a greater improvement in localization accuracy (40.40%) than GCC-MLP (27.92%) at SNR = −20 dB.
The same view applies to the two tracking methods (AV-CRNN and A-CRNN). The tracking accuracy improves from 76.26% to 78.18% for SNR = 10 dB, and from 33.06% to 42.60% for SNR = −20 dB. To be mentioned, the audio-visual localization method AV-MLP (ACC = 40.40%) outperforms the audio tracking method A-CRNN (ACC = 33.06%) at SNR = −20 dB, which elaborates that video has a higher impact than the tracking mechanism at high noise interference. In summary, incorporating visual influence and temporal information helps improve the system's robustness. From Table V, the proposed CMAF always achieves the best results where the improved accuracy of 5.82% and 3.62% over AV-MLP and AV-CRNN at SNR = −20 dB are observed.

G. Feature Visualization
One of the reasons that DL-based methods are superior to SPbased methods is the encapsulation of feature extraction stages into a learning framework. Thus, we illustrate in Fig. 8 the t-SNE visualization [61] of feature representations which are extracted from the penultimate layer of the speaker DoA classifier. The gradient color varying from purple to yellow corresponds to the DoA range from 1 • to 360 • . It should be noted that since the azimuth is cyclic, the DoA estimates near 1 • and 360 • are spatially close. Moreover, since we treat speaker localization as a regression problem on discretized DoA labels, an ideal feature distribution should contain high inter-class variance and low intra-class variance.  [61] on feature representations extracted from the penultimate layer of the DoA classifier, given the same inputs to different methods in the test set. The gradient color varying from purple to yellow corresponds to the DoA range from 1 • to 360 • . The top row indicates the audio-only baselines [28], [29], [32] and the bottom row indicates the audio-visual baseline [9], and two of our proposed methods i.e., AV-CRNN and CMAF. Fig. 8(a) shows the feature representation of the GCC-MLP network [28]. It is observed that given the concatenated multichannel GCC-PHAT as inputs, in most cases, the network can successfully distinguish different DoA classes. However, the features of distinct classes are gathered around the origin area, which is not informative for the classifier. Fig. 8(b) visualizes the features from the STFT-ResNet [29] where we observe the same limitation as shown in Fig. 8(a), that the network fails on the distinct-class features located around the origin. Fig. 8(c) uses the CRNN module to incorporate a tracking mechanism. By considering the feature variations among neighboring frames, the model can successfully disambiguate the gathered inter-class features around the origin. Fig. 8(d) shows the use of audio-visual cues for DoA classification. We observe that the features are better clustered than Fig. 8(a), while there are still some distinct class features distributed around the origin. Fig. 8(e) corresponds to the contrastive AV-CRNN network, which considers the temporal dependency among audio-visual features. The feature distribution is similar to Fig. 8(c). For our proposed CMAF in Fig. 8(f), features are cohesive for the same class, while they are divergent for distinct classes. Moreover, it is observed that the DoA features of neighboring classes are spatially close, leading to smooth temporal transition between adjacent DoAs, which is beneficial to the tracking task.
In summary, the t-SNE visualizations in Fig. 8 corroborate the numerical results in Table IV: First, visual cues contribute to improving the audio-only SSL performance. Second, incorporating CRNN-based tracking helps to overcome the inter-class ambiguity. Finally, our proposed CMAF illustrates the best feature visualization with clustered intra-class features and distinct inter-class features. The distances between the different features are proportional to their spatial DoA difference.

H. Limitation and Future Work
The collected AVRI dataset is device-specific. Our proposed method, which is trained on AVRI, herein limits at the specific recording setup i.e., a Kinect with a 4-channel ReSpeaker microphone array mounted on top of it (Fig. 5). When with a different sensor setup, recordings need to be re-collected for model retraining. Nevertheless, theoretically, as long as the same sensor setup is used, our method can adapt to different room conditions and robotic platforms.
Future works include the DL-based multi-speaker localization and tracking, peculiarly, how to disentangle the identity-specific features from the multi-modal inputs. For observation-to-track assignments, one may be inspired by the permutation invariant training (PIT) mechanism [62] utilized in speech separation as a potential direction. Different room conditions and deployed robotic platforms should be investigated as well. Moreover, how to handle the situation with non-tracked point noise source needs to be explored. Last but not least, deploying an end-toend network is another promising way for real-world robotic applications.

VII. CONCLUSION
Multi-modal processing endows the robot with a higher capability of scene understanding. Despite the success of deep learning, most existing multi-modal localization works still rely on SP techniques, where the research development is penalized by the lack of dataset, complex coordinates transformation among heterogeneous sensors, and very few open-sourced algorithms. We consider our work to be of significant importance in the field of audio-visual speaker location estimation. Specifically, we contributed a newly annotated multi-modal dataset that enables DL technique exploration. What's more, we proposed a CMAF framework as the first attempt at DL-based audiovisual speaker DoA localization and tracking. The introduced CMAF module incorporates self-attention and cross-attention to jointly explore the intra-and inter-modality relations for more accurate and robust speaker tracking. The experimental results demonstrate the superiority of CMAF over the other methods. We will make the dataset and the source code publicly available.