The Neural-SRP Method for Universal Robust Multi-Source Tracking

Neural networks have achieved state-of-the-art performance on the task of acoustic Direction-of-Arrival (DOA) estimation using microphone arrays. Neural models can be classified as end-to-end or hybrid, each class showing advantages and disadvantages. This work introduces Neural-SRP, an end-to-end neural network architecture for DOA estimation inspired by the classical Steered Response Power (SRP) method, which overcomes limitations of current neural models. We evaluate the architecture on multiple scenarios, namely, multi-source DOA tracking and single-source DOA tracking under the presence of directional and diffuse noise. The experiments demonstrate that our proposed method compares favourably in terms of computational and localization performance with established neural methods on various recorded and simulated benchmark datasets.

Direction-of-Arrival (DOA) estimation uses the signals from a microphone array to estimate the angular position of one or more active sound sources relative to the array.Applications include event detection [1], [2], [3], camera steering [4] and sound source separation [5], [6], [7].Although many classical, signal processing based methods such as Multiple Signal Classification (MUSIC) [8], Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) [9] and SRP [10], [11] have been extensively explored over the last decades, state-of-the-art localization performance is usually currently obtained using deep learning methods [12], where a neural network model is trained to estimate the location of the desired sources using a feature representation of the multi-channel microphone signals.
Neural DOA estimators can be classified according to their input features as Time/Frequency (T/F) or hybrid.T/F networks (e.g.DoaNet [13]) typically process features such as the multichannel Short Time Fourier Transform (STFT), Generalized Cross-Correlation with Phase Transform (GCC) or the raw audio signal.A disadvantage of these networks is inflexibility to the microphone geometry, i.e., the number of microphones and respective positions of the array.This requires retraining for each array geometry, a cumbersome task which limits their off-the-shelf usage as a general tool.This also requires companies providing multiple array geometries within their line of products, such as voice assistants, to maintain multiple training pipelines.In contrast, current hybrid networks (e.g.Cross3D [14]) overcome this limitation by processing an input feature set that is independent of the number of microphone channels and their geometry, typically obtained using a classical signal processing DOA estimator such as the SRP method which will be described in Section II-A.A limitation of this approach is that it inherits the limitations of the underlying DOA estimator, such as an  assumption of anechoic propagation and the lack of robustness to directional noise sources.
The main contribution of this work is Neural-SRP, a T/F neural localization method which overcomes the limitations of previous models.Table 1 shows a qualitative comparison of Neural-SRP with respect to Cross3D [14] and DOANet [13], arguably the literature's most established single and multisource DOA estimation models.Unlike the DOANet, Neural-SRP is causal, therefore applicable to real-time applications, and universal, therefore applicable to arbitrary microphone geometries.In addition, unlike the Cross3D method, Neural-SRP is able to localize multiple sources simultaneously, as illustrated in Fig. 1.Finally, The proposed network is significantly smaller than the baselines.Code for the Neural-SRP architecture that can reproduce the experiments in this paper is available on Github. 1eometric independence is achieved by the introduction of two concepts, pairwise processing and metadata fusion.The former is inspired by the conventional SRP method, where a local feature is extracted between all microphone pairs, such local features then being summed to create a global feature.By providing the network with the microphone positions using a metadata fusion procedure, it is able to produce an encoded pairwise spatial likelihood map.After summation, the global feature is then decoded to estimate the sources' locations.
This paper continues as follows.Section I presents the signal model which will be used throughout this work.Section II presents a literature review of relevant neural methods for Sound Source Localization (SSL), followed by a description of the conventional SRP method, from which our model takes inspiration.Section III describes our proposed model, followed by our experimental validation in Section IV.The results are discussed in Section V.

I. PROBLEM DEFINITION AND SIGNAL MODEL
We define a 3-dimensional Cartesian system of coordinates centred at the position of a microphone array containing M microphones, whose known positions at discrete time index The goal of a DOA estimator is to provide an estimate of the set of positions U (t ) = {u 1 (t ) . . .u N (t )}, where u n is defined analogously to v m , of the N active sound sources at time t.Each microphone m receives a signal frame of length L where the convolution operator is represented by * , h nm (t ) ∈ R R is the Room Impulse Response (RIR) vector of length R between source n and microphone m at time t, and s n (t ) is the signal frame emitted by source n at time t.In the case of Gaussian sensor noise, m (t ) ∼ N (0, σ 2 m I ), where σ m controls the Signal-to-Noise Ratio (SNR).In the case of a directional noise source, such as a fan, the noise term is defined as the impulse response h m convolved with a random signal ε ∼ N (0, I ) scaled by a factor σ m .Note the noise impulse response is not time-dependent, as we assume directional noise sources to remain spatially stationary at unknown position u .
Although the sources can be located anywhere in the room, we are interested in their DOA, which we represent as a point on the unit sphere, i.e. u n (t ) = 1.DOAs are also often represented as two angles, namely, azimuth and elevation.The azimuth is the angle between the x axis and the projection of u n (t ) in the horizontal xy plane, whereas the elevation is the angle between u n (t ) and the xy plane itself.

II. PRIOR ART
Many approaches have been developed for the task of DOA estimation in the last decades.Arguably, the most established signal processing-based approach is the Steered Response Power (SRP) method, which was shown to be applicable to realistic scenarios containing noise and reverberation [15].On the other hand, neural approaches have achieved state of the art performance at the cost of higher computation and limited generizability to unseen scenarios, a limitation which is overcome by our proposed method.The following sections provide a review of the SRP method and neural network approaches for DOA estimation.

A. STEERED RESPONSE POWER
The main idea behind the SRP method [10], [11] is to map the temporal cross-correlation between a pair of microphone signal frames (x i (t ), x j (t )), as well as their associated microphone positions into a Spatial Likelihood Function (SLF) [16] which associates a value SRP i j (p) for each candidate location p = [p x p y p z ] T that is maximized at the true source locations.Note that the index t is omitted hereafter for conciseness.The pairwise SRP for a candidate location p is defined as [10], [11] SRP i j (p ; where the cross-correlation, represented by , between frames x i and x j is evaluated at the theoretical Time-Difference-of-Arrival (TDOA) the difference in samples between the microphones located at p i and p j and the source p.The speed of sound is c and f s is the system's sampling frequency.In practice, GCC [17] is commonly used instead of classical temporal cross-correlation.Finally, the global SRP is defined as the sum of all SRP pairs, SRP(p ; {x 1 , . . ., This represents the likelihood of a source being located at a candidate point p, and the source location is estimated as p = arg max p SRP(p).
Other functions more flexible than the peak-picking in (6) can be used for the case of multiple active sources, such as peak subtraction [18] or sparsity-based [19] techniques.Lower computational cost can be achieved through usage of volumetric SRP variations [20], [21].Also, the robustness of SRP can be improved in the case of moving sources/microphones by the inclusion of tracking algorithms [22], [23].However, due to its formulation, the SRP method may exhibit multiple peaks in reverberant environments or in the presence of directional sources, as can be seen in Fig. 2.

B. NEURAL NETWORKS FOR SSL
Neural networks have been widely applied for the task of DOA estimation using a centralized microphone array [12].Multiple architectures have been proposed, including Convolutional Neural Networks (CNNs) [24], Multi-layer Perceptrons (MLPs) [25] or Convolutional Recurrent Neural Networks (CRNNs) [13].They can also be classified by their output strategy, namely, regression or classification [26].Finally, networks can be classified according to the input feature used, such as the complex-valued multichannel STFT [27], its phase [24], or the GCC between all microphone pairs [13], [25].If the input feature consists of the output of a classical signal processing method, such as the SRP maps shown in Fig. 2, the network we classify it as hybrid.Othervise, we shall classify it as T/F.In [28] the concept of a dual-input neural network capable of jointly processing signals and metadata, such as the microphone positions, room dimensions and reverberation time for the task of positional SSL was introduced, allowing a T/F neural model to operate on distributed microphone arrays of unseen geometries, but with a fixed number of microphones.This constraint is removed in [29], where a spatial approach involving Graph Neural Networks (GNNs) is applied to the enhancement of SRP maps.In [30], an initial version of the Neural-SRP method is introduced for single-source positional SSL, where a network is trained to generate a likelihood map for each microphone pair.This work extends [29], [30] to the task of multi-DOA tracking.The remainder of this section focuses on the Cross3D [14] and DOANet [13] methods, which are respectively state-of-the-art hybrid and T/F models which serve as comparison baselines to our work.
The Cross3D method was proposed by Diaz-Guerra et al. [14] for the application of single-source DOA tracking.Their method can be interpreted as an image processing network, where its input is the 2D power map produced by the SRP method.The model's name is due to its architecture being a 3-dimensional causal CNN, where the three dimensions are azimuth, elevation and time.The authors show that the model can be trained on simulated data generated using the image source method [31] and tested on a realistic dataset of real recordings.Recent work by the authors modified the approach to use icosahedral networks [32], significantly reducing the computational cost of Cross3D.Multi-source capabilities were also recently introduced in [33].
The DOANet method was proposed by Adavanne et al. [13] for tracking up to two simultaneous sound events.The main model used is a bidirectional CRNN [34].The authors show that including tracking metrics defined in [35] significantly improved the model's performance.The output of the network consists of a vector of size 8, where the first 6 elements refer to the estimated source positions, and the last two represent the activity of each track, similar to a Voice Activity Detector (VAD).

III. NEURAL-SRP A. INPUT FEATURE SET
The input feature of Neural-SRP consists of the GCC of all pairs of microphone signal frames (x i , x j ), defined as the L-sized Inverse Discrete Fourier Transform (IDFT) of the element-wise product of the normalized frequency-domain frames x i and x j , where x k = DFT(x k ) and |x k | is the elementwise magnitude.The input feature consists of the GCC between all microphone pairs, thus generating an input of shape (M(M − 1)/2, T, G), where T is the number of timeframes and G is the number of central GCC delays used.This selection has the advantage of reducing the input size and removing delays which are bigger than the maximum theoretical Time-Difference-of-Arrival (TDOA) for the microphone array, computed as where 1 ≤ i < j ≤ M and G 0 ≥ 0 is a parameter to increase the feature size to values beyond the maximum theoretical TDOA, which increases performance in practice [13].This input feature is also used by the DOANet model.However, while the DOANet model jointly processed all input features using a single network, our proposed model processes each pairwise feature independently to create a summable encoded likelihood map, allowing the network to accept any number of microphone pairs as its input.

B. ARCHITECTURE
The Neural-SRP network is divided into two sub-networks, namely, a pairwise network P and a global decoder D. The architecture is shown in Fig. 3 and is summarized as The goal of P is to create an encoded and summable spatial likelihood feature for each signal pair, using GCC g i j along with its respective microphone coordinates (v i , v j ).These features are them summed together, creating a global feature which is then decoded by D to estimate a set of locations Û .The proposed method's name derives from the structural similarity between ( 9) and ( 5).
The pairwise network consists of a modified Convolutional Recurrent Neural Network (CRNN) architecture.The parameters of the pairwise network are shared across all pairs.Each pairwise GCC is first processed by a sequence of 2D convolutional blocks.To maintain causality, the kernel size in the time dimension is set to 1 and no pooling is applied in that dimension.Unit strides were used on convolutional layers.The resulting feature of shape (T, ) is transformed into shape (T, C c ) by flattening the last two dimensions of size, C 0 c , the number of output kernels, and C 1 c , the number of GCC bins after pooling.
To improve tracking performance, the resulting feature is then processed by a one-directional Recurrent Neural Network (RNN) of type Gated Recurrent Unit (GRU) [36].To produce a spatially-aware feature, the microphone coordinates of each microphone in the pair are concatenated to each channel, followed by transforming this feature into an encoded likelihood map of shape C p through the application of another MLP.An interpretation of this step is 'steering' the feature produced by the RNN according to the direction of the segment connecting the microphone pair's positions.We refer to [28] for a detailed discussion on methods for incorporation of metadata, namely microphone position information, for the improvement of SSL methods.
The decoder D consists of two independent MLPs as in the DOANet model.The first is an activity detector similar to a multichannel VAD, while the second outputs the N estimated locations.These outputs are implicitly related, in the sense that if the n th activity detector indicates no activity, the values of the n th estimated DOA should be ignored.

C. TRAINING
Both pairwise and global networks are jointly optimized using the network's output.In the following, we shall define the loss function for each temporal instant t and will therefore omit this index.We define U = [u 1 . . .u N ] and Û = [ û1 . . .û N ] as the target and output DOA matrices respectively, where each column is a unit vector representing a true or estimated DOA.We also define z and ẑ, N-dimensional binary vectors which refer to the target and output activities.In the case where only a single source exists, the loss function is defined as where the first term is the Euclidean localization error between the true and estimated DOA, weighted by the true activity, so as to ignore silent frames.The Euclidean error is employed in favour of the more interpretable angular error as previous works [14], [26] found it to yield better training results.The weighting factors α and β are hyperparameters.
To prevent the loss function from diverging to −∞, we clamp the maximum value of log(•) to a constant B. When two or more sources are active, the training must take the assignment problem into account, so as not to penalize equivalent target and true permutations [37].This problem can be defined as finding the association matrix A, a permutation of the rows of the identity matrix of size N.The optimal A minimizes the multi-source localization error, defined as where [D] i j = u i − û j is the distance matrix between all target and output combinations, is the element-wise product, | • | is the matrix norm and |z| is the number of active sources.Although A can be deterministically computed using the Hungarian algorithm [38], the latter is not differentiable, hindering its application for training the neural network using a backpropagation procedure.We solve this problem in the same manner as [13], where a neural network is used to approximate the Hungarian method, and then used for training.
The association matrix is also used for aligning the target and output activities z and ẑ, after which the binary cross-entropy function is applied for each entry.

IV. EXPERIMENTATION
Experiments were performed consisting of training/evaluating Neural-SRP and baselines on datasets of different complexities and characteristics, each serving the purpose of evaluating the method's performance in different conditions.Five datasets were used, three simulated and two recorded, which are described below.The Cross3D and DOANet baselines use the same architectural parameters and training procedures described in their respective original papers [13], [39].
The network parameters for Neural-SRP are summarized in Fig. 4, where the tensor output shapes are shown for each of the network's layers.Convolutional kernels of size (3,3) were used on all convolutional layers.Max pooling with a kernel size 2 was applied to the GCC dimension after all but the last convolutional layers.Parametric Rectified Linear Unit (PReLU) activation was used for all of the network's layers, apart from the RNN and DOA MLP output, which used a Hyperbolic Tangent (TANH) activation, and the activity output layer, which used sigmoidal activation.This architecture was chosen empirically.All the networks were implemented using the Pytorch library.The Adam optimizer was used for backpropagation.
A rectangular grid of size 64 × 32 was used for SRP, where the first dimension represents azimuth and the second elevation.The same configuration was used for generation of the input maps for the Cross3D baseline.The parameters used for the latter and the DOANet baseline were chosen similarly to those used in the respective original papers [13], [14].

A. EVALUATION METRICS
The main metric used for the single source experiment was the Root Mean Square Angular Error (RMSAE) [40], defined for a pair of positions (u, û) each with azimuth and elevations (θ, φ) and ( θ, φ) respectively, as E (p, p) = arccos 2 (cos θ cos θ + sin θ sin θ cos(φ − φ)), ( 12) where (12) was averaged for all frames in the dataset.For multiple sources, the localization error is defined for each correctly detected source using the ground truth association matrix A. For the multi-source experiment, the detection metrics of precision, recall and the F1 score were also used, as defined in [13], [35].These metrics are computed for each frame, As in the single source experiments, the final metrics are obtained for the proposed method and baselines through averaging of all frame metrics in the dataset.

B. DATASETS
The first three experiments were performed using simulated datasets, which we refer to as SimSW, SimDirect and Sim-RandMic.All datasets contain samples of a source moving in a 3-dimensional sinusoidal trajectory inside a cuboid-shaped reverberant room containing a compact stationary microphone array.The trajectories were generated by randomly selecting a start and end point inside the room, followed by randomly assigning a 3-dimensional vector referring to the frequency of oscillations within each direction.Finally, a second 3dimensional vector is randomly generated representing the amplitude of each direction's oscillation.As in [14], the simulated datasets follow an "infinite-style" paradigm, meaning acoustic scenarios are randomly generated using the image source method [31] during training, i.e., no data is stored.The duration of each sample is 20 s.The ranges of the parameters are shown in Table 2.The sampling rate used for the simulations was equal to 16 kHz.Both the first and second datasets, named SimSW and SimDirect use the pseudo-spherical array geometry of the NAO Robot as described in the LOCATA dataset [41].SimSW and SimDirect differ in the type of noise used, respectively, spatially white (SW) sensor noise and directional noise.The goal these datasets is to assess the robustness of the algorithms to different types of noise.For the third dataset, named SimRandMic, a random array geometry was generated for each dataset sample.The goal of this dataset was to assess the methods' generalizability to unseen microphone geometries.
The datasets were generated using the gpuRIR Python library [39], which can simulate audio recordings of cuboidshaped, reveberant rooms including an arbitrary number of moving sources and microphone arrays.Simulating moving sources/microphones is a computationally expensive task, as high quality scenes are typically rendered by generating one RIR using the Image Source method between each sourcemicrophone pair at every few milliseconds, and auralizing audio signals by convolving the source signals and RIRs using the Overlap-Add strategy [39].gpuRIR significantly reduces the computational time in comparison to other libraries such as Pyroomacoustics [42] by generating the RIRs in a Graphics Processing Unit (GPU).
In the SimSW and SimRandMic datasets, random Gaussian white noise is added to the auralized signals at the desired SNR, computed using (15).In the SimDirect dataset, a second source emitting random Gaussian noise is randomly placed at a static position inside the room.The auralized noise signal is added to the source signal by scaling it to the desired SNR, computed using the mean energy of both auralized signals across all frames.In the SimRandMic dataset, a spherical microphone array is generated for every sample first by uniformly sampling its radius and number of microphones from the values ranges shown in Table 2, followed by randomly placing the microphones on the sphere's boundary.An utterrance from the LibriSpeech dataset [43] is randomly chosen as source signal for each dataset sample.Each epoch consists of a network pass through all of the Librispeech dataset, although different scenes are generated for each epoch.
We define y m (t ) = x m (t ) − m (t ) as the idealized noiseless signal frame received at microphone m.We define the average array-wide power of all signal frames p y as where the ideal binary voice activity detector z is used to ignore silent frames.The array-wide power of each noise frame p is defined analogously.We compute the array-wide spatially white SNR sw as The LOCATA dataset [41] was released as part of the 2018 IEEE AASP Challenge on acoustic source LOCalization And TrAcking.It consists of 6 tasks of increasing complexity.In this work, we select tasks 1, 3 and 5, namely, static, moving source and moving microphone localization.The dataset provides recordings from multiple microphone arrays.In this work, we only use recordings provenant from the NAO robot, which contains a pseudo-spherical 12-channel microphone array.The goal of this dataset is to assess the performance of the algorithms in a real environment, as well as their ability to generalize to a real environment through training on simulated data.The sampling rate of the dataset is 48 kHz.
The TAU-NIGENS Spatial Sound Events dataset [44] was originally released for the Sound Event Localization and Detection task of the 2021 Detection and Classification of Acoustic Scenes and Events (DCASE) challenge.It was generated by filtering source signals from the NIGENS Sound Events database [45] using time-varying RIRs recorded on 13 different rooms of Tampere University, Finland.These RIRs were recorded using a 32-channel Eigenmike spherical microphone array and a Genelec G Three2 loudspeaker.Instead of providing the full 32-channel recordings, equivalent compressed 4-channel tetrahedral signals are provided.The dataset is subdivided into 400 training, 100 validation and 100 testing 1-minute recordings of up to two simultaneous moving sources.The samples may be corrupted by directional, moving interference emitting signals belonging to a noise class from the NIGENS database.The goal of this dataset is to assess the performance of the Neural-SRP method for tracking multiple sources.The sampling rate of the dataset is 24 kHz.All Values are Expressed in degrees.LOCATA (O) and (D) are the results following training using SimSW and SimDirect respectively.Both Entries in the aforementioned columns show the same value for SRP, as it is not trained.

C. EXPERIMENT 1: SPATIALLY WHITE NOISE
In this experiment, we evaluate the performance of Cross3D, Neural-SRP and conventional SRP in the presence of independent White Gaussian Noise (WGN) added to each sensor.The neural models are trained for a duration of 80 epochs using a learning rate of 10 −4 .As in [39], we use a frame size of 256 ms and a hop size of 192 ms.The noise signals are generated with unit variance, then scaled to the randomly selected SNR sw by inverting (15), and then summed to the noiseless received signals.Both networks are trained using the range of SNRs defined in Table 2, and tested using a simulated dataset of unseen source signals from the Librispeech test set, as well as the unseen LOCATA dataset.The results are shown in Table 3.We also report the dependence of the localization error to reverberation and noise on the test dataset, as shown in Fig. 5.

D. EXPERIMENT 2: DIRECTIONAL NOISE
In this experiment, the Neural-SRP method is applied to the task of localizing a single speech source on a directional noise scenario, which is arguably more realistic than the diffuse case.For example, a directional noise source could be a fan, or a washing machine.The main difference from the experiment described in Section IV-C is that, instead of adding independent noise to each microphone, the noise is itself modeled as a source in the room.In other words, for each training sample, the interferer is randomly placed within the room, with the restriction of being at least one meter away from the source and array.Then, a RIR between the microphones and interferer is computed, which is then convolved with a random unit variance Gaussian signal.Finally, the auralized result is scaled to the randomly assigned SNR in the same manner as (15).The results are shown in Table 3.

E. EXPERIMENT 3: TESTING ON AN UNSEEN GEOMETRY
To assess the proposed model's ability to generalize to unseen microphone geometries, we trained it using a dataset of multiple microphone array geometries, while testing it on the microphones mounted on the NAO robot head of the LOCATA dataset, a geometry which is unseen in the training dataset.Although the Cross3D method can be theoretically trained using variable microphone geometries, we were unable to train it using the SimRandMic dataset as the initialization of the SRP method was shown to be prohibitively costly, resulting in each epoch taking several hours on a GPU-enabled server.As a means of comparison, we use conventional SRP, as well as Neural-SRP trained using the SimSW dataset, i.e., a matched array geometry.The results can be seen in Table 4.

F. EXPERIMENT 4: MULTI-SOURCE TRACKING
In this experiment, the Neural-SRP method is applied to the task of multi-source tracking.We compared our method to the state-of-the-art DOANet model with parameters described in [13] on the TAU-NIGENS dataset.The network was trained three times for a duration of 80 epochs using a learning rate of 10 −4 .The average localization error was computed on the validation dataset at the end of each epoch, and the network weights that obtained the lowest validation localization error were used for evaluating the unseen test set.The results are shown in Table 5, where each value is the average metric obtained for each training round.The metrics used were the localization error in degrees, for true positive matches, as well as classical tracking metrics, namely, precision, recall, and the F1 score, defined as a geometric average of the two aforementioned scores.An example output of Neural-SRP successfully tracking two simultaneous sources is shown in Fig. 1.As in [13], a frame of size 20 ms with a hop size of 10 ms was used.

G. COMPLEXITY COMPARISON
In this section, the complexity of the proposed Neural-SRP model and baselines is presented in terms of number of parameters, computational time and number of Floating-Point Operations (FLOPS) for microphone array sizes {4, 8, 12}.The number of FLOPS is obtained through the use of the THOP Python library. 3 This library is not compatible with the SRP, so we compute the theoretical complexity of the latter theoretically, as in [46].The inference time is measured as the clock difference taken for the model to produce an output for an input stimulus of duration of one-second.These results were obtained using a 16 GB Macbook Pro with an M1 chip and are shown in Table 6.

V. DISCUSSION AND ANALYSIS
The single source experiments summarized in Table 3 show that Neural-SRP obtains favourable results in comparison to the Cross3D method both in the spatially white and directional noise scenarios, despite using a significantly smaller and more computationally efficient model.Like Cross3D, Neural-SRP can be trained using simulated data and tested using real recordings, as seen in the LOCATA results in Table 3.This is remarkable as, unlike Cross3D, Neural-SRP is required to learn its own spatial representation of sound.In other words, Neural-SRP is able to generalize despite having a less stringent inductive bias.Another relevant remark is that unlike SRP, Neural-SRP and Cross3D were able to eliminate the effect of a directional noise source, which is typically manifested as an additional peak in the GCC (and therefore SRP) features.
The method's dependence of localization error to reverberation and SNR is shown in Fig. 5.The error of SRP increases significantly with high reverberation and low SNR, whereas Neural-SRP's error increases less significantly in those conditions.Fig. 5 also shows consistent incremental gains of Neural-SRP in comparison to the Cross3D baseline throughout all reverberation times and SNRs.
As shown in Table 4, Neural-SRP was able to be trained on a set of microphone geometries and tested on an unseen microphone geometry with only a small reduction in localization performance.This reduction is however expected, as the prolate spheroid (American football) geometry of the NAO array is not contemplated in the training dataset.
Turning to the multi-source experiment shown in Table 5, Neural-SRP achieves an improved localization performance in comparison to the DOANet method, as well as comparable tracking metrics.An important remark is that this increased performance is achieved despite the fact that the DOANet is able to obtain non-causal frame information, as a bidirectional RNN is employed by the latter, which also incurs in a greater number of parameters.A possible explanation for this increased performance is that the Neural-SRP pairwise architecture is more parameter-efficient than the DOANet's global architecture, which has to employ neurons to replicate information for each pair.
Finally, as shown in Table 6, the Neural-SRP uses significantly fewer parameters than the other neural baselines, namely, over 6 times fewer parameters than Cross3D and a little over half as many as DOANet.In terms of computational complexity, Neural-SRP is positioned in-between Cross3D and DOANet, being at least 3 times faster than the former, and showing comparable performance with the latter in the case of a 4-microphone array.The proposed method's increase in computational cost is due to its pairwise formulation, which introduces a quadratic dependence with the number of microphone pairs (M(M − 1)/2).However, this pairwise formulation also introduces flexibility, as microphone selection procedures such as [47] can be applied to reduce the number of pairs.The pairwise formulation also allows for distributed computing and only requires pairs to be synchronized, which is of particular relevance when using a distributed microphone network [48].

VI. CONCLUSION
We have presented Neural-SRP, a state-of-the-art localization neural network which is able to overcome limitations of previous neural methods.Besides providing incremental gains in terms of localization performance, Neural-SRP is causal and shows a low computational complexity.Finally, Neural-SRP is the first method that has been shown to work on unseen array geometries.
Future research directions include exploring microphone pair selection methods which may further reduce cost without significantly affecting performance, and extending to locating three or more simultaneous sources.

FIGURE 1 .
FIGURE 1. Example of Neural-SRP's output when tracking two moving sources.The panel shows the target and predicted azimuth and elevations.

FIGURE 2 .
FIGURE 2. SRP maps generated for a simulated cuboid room containing one microphone array in its centre, as well as a source: (a) ideal, (b) reverberant, and (c) noisy scenarios.The arrows point to the true source and interferer locations.

FIGURE 4 .
FIGURE 4. Detailed view of Neural-SRP architecture, where the numbers show the output dimension of each layer.The dotted line separates the pairwise network P from the global decoder D, which receives the sum of pairwise features as its input.The input layer consists of T frames of 64 central GCC bins each.The mic. coords.layer is of shape (T, 6) where the three coordinates for each of the microphone in the pair are replicated for all frames.

FIGURE 5 .
FIGURE 5. Localization error comparison between Neural-SRP, Cross3Dand SRP for increasing levels of reverberation and SNR.The curves were smoothed using cubic interpolation.