Parametric Ambisonic Encoding of Arbitrary Microphone Arrays

—This article proposes a parametric signal-dependent method for the task of encoding microphone array signals into Am-bisonic signals. The proposed method is presented and evaluated in the context of encoding a simulated seven-sensor microphone array, which is mounted on an augmented reality headset device. Given the inherent ﬂexibility of the Ambisonics format, and its popularity within the context of such devices, this array conﬁgura-tion represents a potential future use case for Ambisonic recording. However, due to its irregular geometry and non-uniform sensor placement, conventional signal-independent Ambisonic encoding is particularly limited. The primary aims of the proposed method are to obtain Ambisonic signals over a wider frequency band-width, and at a higher spatial resolution, than would otherwise be possible through conventional signal-independent encoding. The proposed method is based on a multi-source sound-ﬁeld model and employs spatial ﬁltering to divide the captured sound-ﬁeld into its individual source and directional ambient components, which are subsequently encoded into the Ambisonics format at an arbitrary order. It is demonstrated through both objective and perceptual evaluations that the proposed parametric method outperforms conventional signal-independent encoding in the majority of cases.

playback setup, dictate the listener's perception of the spatial sound scene in the desired manner.Examples of this channelbased workflow include employing binaural microphones for headphone playback, and multi-microphone arrangements for stereo [1] and surround loudspeaker formats [2]- [4].However, such approaches may be considered inflexible, as there is often no clear solution for reproducing a recording intended for one specific playback setup over a different playback setup, or account for a different listener head orientation in the case of binaural microphone array recordings.
Scene-based alternatives, on the other hand, aim to circumvent these limitations by describing the captured sound scene using a format that is independent of the array and playback setups.Perhaps the most wide-spread scene-based framework is the one popularised under the name of Ambisonics [5].This refers to the two-step processing paradigm of: 1) employing a linear signal-independent mapping of the input microphone signals to intermediate spherical harmonic (SH) signals [6], often referred to as Ambisonic encoding; and 2) a linear mapping of these SH signals to the target binaural [7] or loudspeaker [8] setup, which is commonly referred to as Ambisonic decoding.Other linear signal-independent alternatives include beamforming designs that resemble head-related transfer functions (HRTFs) for headphone rendering [9], [10], or loudspeaker panning functions [11], [12].However, contrary to Ambisonics, the decoding filters then need to be designed specifically for the particular recording array or device; or, alternatively, the array specifications may also be transmitted to the reproduction side.Since the Ambisonics framework has the benefit of decoupling the recording and the playback setups, it can afford greater practical flexibility and portability.Furthermore, spatial transformations, such as sound-field rotations [13], which are important for head-tracked virtual or augmented reality applications, are well defined and easily realised compared to other spatial audio formats.
The maximum spatial resolution afforded by a linear signalindependent Ambisonic workflow is, however, inherently limited by the number of microphones that comprise the array, since this dictates the maximum SH encoding order [6].The Ambisonics format is also only truly portable in cases where the channel directivities (i.e. the SHs) are broad-band.However, when linearly encoding real microphone arrays, there are certain frequency-dependent limitations that affect this portability.These limitations are dictated by the array geometry and the placement of the microphones.For instance, there is a maximum frequency beyond which the SH directivities can no longer be This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/obtained.This limit is often referred to as the spatial aliasing frequency [14], which is, in turn, also dependent on the SH order and degree.Furthermore, due to microphone sensor noise, regularisation of the encoding gains is required in practice, especially at lower frequencies and higher SH orders, which further limits the usable band-width of operation.Non-uniform arrangements of sensors and/or irregular array geometries also lead to direction-dependent differences in spatial resolution.This latter issue is the main motivation for why spherical microphone arrays (SMAs) with near-uniform sensor arrangements are more widely employed in practice.However, while there do exist commercial SMA offerings capable of capturing up to fourth-order SHs, such arrays are uncommon and are often expensive and/or offer higher-order components for only narrow frequency bandwidths.Therefore, the majority of commercially available SMAs often comprise four sensors arranged in an open tetrahedral fashion, and are thus limited to first-order SH acquisition.Perceptual studies investigating the coupling of lower-order linear encoding with linear Ambisonic decoding have reported: the introduction of strong colourations, localisation inaccuracies, and a loss of perceived envelopment and spaciousness [15]- [19].
To overcome the perceptual limitations of a signalindependent low-order Ambisonics workflow, several signaldependent alternatives for the decoding stage have been proposed.These alternatives operate by employing an assumed sound-field model and applying time-frequency domain processing techniques.Their intention is to map the input SH signals to the target playback format in an adaptive, signal-dependent, and often perceptually informed manner, in order to improve the perceived spatial accuracy of the reproduction.Directional Audio Coding (DirAC) [20] was the first proposed parametric decoding method, which operated on first-order SH signals as input.Its sound-field model assumes that the input scene may comprise a single plane-wave and/or an isotropic diffuse component per time-frequency tile.In practice, the method employs intensity-based analysis [21] to determine the plane-wave direction-of-arrival (DoA) and a diffuseness measure.Components that are analysed to be diffuse are routed to all channels of the target setup and subjected to decorrelation operations, whereas non-diffuse components are spatialised directly over the target setup through application of vector-base amplitude panning [22].The DirAC model was then later extended to higher-orders in [23], [24], to resolve multiple simultaneous plane-waves by partitioning the sound-field into directionally constrained sectors [25], [26].
Other parametric Ambisonic decoding methods include the High Angular Resolution Planewave Expansion (HARPEX) [27] approach; which operates on first-order SH signals and assumes a sound-field model comprising two plane-waves for each narrow band frequency.By comparison, the Sparse-Recovery method [28] aims to resolve as few plane-waves as possible through an optimisation process, while ensuring that the sound scene is sufficiently described despite its sparse representation.The COding and Multi-Parameterisation of Ambisonic Sound Scenes (COMPASS) method [29] aims to resolve a time-variable number of plane-waves per frequency (based on source detection algorithms [30]).Along with extracting and spatialising the source components, the method also employs an additional directional ambient stream based on what remains after the source components are subtracted from the input sound-field.A similar model was also explored in [31], but with the addition of spatial post-filtering to improve the segregation of the source and directional ambient components.A linearly and quadratically constrained least-squares decoding solution was also proposed in [32], [33], which operated in a similar fashion to [24] but without the need for explicitly estimating a diffuseness parameter or requiring signal decorrelation.
It should be highlighted, however, that all of the parametric solutions mentioned thus far, are intended to enhance only the decoding stage of the Ambisonics pipeline.Signal-dependent Ambisonic encoding, on the other hand, has seen far fewer developments, with existing proposals primarily focusing on extending SH acquisition beyond the spatial aliasing frequency of SMAs; for example, using a tetrahedral array in [34], and higher-order SMAs in [35].A general solution was also proposed in [36], which employed a signal model and subsequent spatial filtering to divide the sound-field into its individual source and ambient components.The model is similar to the parametric decoding methods described in [29], [31], except, the intention was to instead enhance the SH signals directly on the capturing side, rather than later relying on a parametric decoding method to render linearly encoded SH signals to the playback setup.The method used the decomposed spatial components encoded into SH signals, in order to replace the linearly encoded SH signals for frequency ranges where the linear signal-independent encoding was sub-optimal; as dictated by the objective evaluation metrics described in [37].These existing signal-dependent encoding methods, however, all still impose the same maximum encoding order that would otherwise be dictated by the number of sensors associated with conventional linear encoding, and also considered only SMAs in their evaluations.
In general, Ambisonic encoding has primarily focused upon the use of SMAs, due to the practicality of mounting microphones on a sphere and its linear signal-independent encoding convenience [6].However, with the Ambisonics format continuing to gain popularity, owing to its portability and flexibility, there may soon arise a need for ambisonic recording to be integrated into devices where spatial sound capture is not their primary purpose; for example: in 360 • video cameras, mobile phones, head-mounted displays (HMDs) and other wearables related to augmented reality applications [38]- [43].While linear ambisonic encoding for arbitrary microphone placements and mounting bodies is possible [44], it may be sub-optimal and limited in terms of its maximum order and usable bandwidth of operation, which would subsequently compromise the reproduction performance on the decoding side.Therefore, in this article, a general parametric encoding method is proposed, which draws influence from the COMPASS method described in [29], and the work of [36].The primary novelty of the proposed method is in its general formulation, which allows it to cater to arbitrary array geometries and sensor placements; in order to obtain ambisonic signals of higher-order and over a wider frequency bandwidth than would otherwise be possible through a linear solution.The proposed method is also described and evaluated in the context of a case study, through the encoding of an array of seven microphones non-uniformly arranged over the irregular geometry of a HMD worn by a manikin.This particular sensor arrangement and array geometry represents a potential future scenario for ambisonics recording, which would otherwise be especially limited by conventional linear signal-independent encoding.
This article is arranged as follows: Section II describes how arbitrary microphone arrays may be linearly encoded into SH signals, and how such an encoding may be objectively evaluated.The microphone array employed for this study is then described in Section III.The parametric signal model employed is detailed in Section IV.The spatial analysis and synthesis stages of the proposed method are then described in Section V and Section VI, respectively.Objective metrics and perceptual evaluations are detailed in Section VII, with the results and discussions provided in Section VIII.The article is then concluded in Section IX.

II. CONVENTIONAL LINEAR AMBISONIC ENCODING
It is assumed that the input Q microphone array signals, x(t, f ) ∈ C Q×1 have been first transformed into the timefrequency domain, where t denotes the down-sampled time index and f denotes frequency.The conventional approach of encoding microphone array signals into N th order ambisonic signals a lin ∈ C (N +1) 2 ×1 may be described with the following linear signal-independent mapping where E ∈ C (N +1) 2 ×Q is a frequency-dependent matrix of encoding weights.For SMAs, analytical descriptions of the geometry and sensor directivities may be used to derive E, and more information can be found in e.g.[6], [45]- [47].However, for irregular geometries, such as the array employed for this present study, a general approach is required.Here, the directional characteristics of the array are described through a dense grid of V array steering vectors, A = [a(γ 1 ), . .., a(γ V )] ∈ C Q×V , which may be derived from numerical simulations or array measurements; where a(γ) ∈ C Q×1 is the steering vector of the array for direction γ.The encoding matrix may be computed through a least-squares closed-form solution as [37], [44] where D(f ) = (1/V )A(f )WA H (f ) ∈ C Q×Q is the diffuse coherence matrix (DCM) of the array, W ∈ R V ×V is an optional diagonal weighting matrix to account for a non-uniform measurement grid, β is a regularisation parameter, I Q ∈ R Q×Q denotes an identity matrix, and Y ∈ R (N +1) 2 ×V are the SH weights for all measurement directions.Since this encoding approach may lead to the attenuation of frequencies above the spatial aliasing limit f al , the aliased frequencies may be optionally diffuse-field equalised to retain a flat magnitude response on average, as described in [48] and also recommended in the original sound-field microphone report by Gerzon [49], as where Diag[•] denotes constructing a diagonal matrix based on the diagonal elements of the enclosed square matrix.The spatial aliasing frequency limit of the array may be specified based on analytical formulae in the case of SMAs, or, in the general case, through observation of the encoding performance metrics described in the following subsection.

A. Objective Evaluation of Conventional Ambisonic Encoders
In order to gain insight into the performance of a linear signalindependent Ambisonic encoder, two well established objective metrics may be employed, namely: the spatial correlation and diffuse level differences [37], [44].These metrics are computed through comparison between the microphone array encoded patterns and ideal SH patterns over a dense grid of directions.The spatial correlation is effectively a measure of spatial similarity, with the metric ranging between 0 and 1, and may be computed as where diag[•] denotes constructing a vector from the diagonal elements of the enclosed square matrix, denotes the Hadamard product, and c(f ) ∈ R (N +1) 2 ×1 are the resultant spatial correlation values for each SH component.Low spatial correlation values indicate that the encoded patterns have deviated from the ideal patterns, which is typically the case above the spatial aliasing frequency of the array.The upper usable frequency limit for each SH component may therefore be determined as the frequency where this metric begins to trend towards 0. Since higher-order components generally require significant gain amplification at low frequencies, regularisation is often employed in practice.This allows a compromise to be made between minimising sensor noise amplification and the provision of a sufficiently wide operating frequency range of usable SH components.The diffuse level difference metric is therefore useful in the determination of the lower usable frequency bound for each SH component, which may be determined as the frequency where the metric begins to deviate from 0 dB.The level difference metric may be computed as where δ(f ) ∈ R (N +1) 2 ×1 are the level differences for each SH component.

III. THE ARRAY IN QUESTION
While the parametric encoding method proposed in this article is general, and thus applicable to a wide-range of microphone arrays of arbitrary geometry, including SMAs, the focus of this work is primarily in regard to encoding arrays of irregular geometry and with non-uniformly distributed sensor placements.Therefore, an array of seven sensors arranged on the surface of an Fig. 1.Left: A picture of the microphone array in question, with the sensor positions depicted as red dots.Middle: Directivity of the scattered pressure from the surface of the array for two incident plane-wave directions on the horizontal plane aligned with the frame of the HMD.Right: A depiction of the objective metrics for the least-squares Ambisonic encoder, E, as given by ( 2), derived using the steering vectors for the array in question.Note that the results with the (eq) superscript are of the diffuse-field equalised encoder, E (eq) , as per (3), with the spatial aliasing frequency of 1 kHz.
HMD worn by a manikin, was first designed and 3D modelled; as depicted in Fig. 1 (left).Five sensors were arranged on the left, right, front, back and top orientations of the HMD, and two more sensors were placed in the forward facing directions in order to obtain a higher degree of frontal spatial resolution.The far-field pressure response of the array was then simulated1 for 841 directions, following a 28th order Fliege design [50], using the Boundary Element Method (BEM) module of COMSOL Multiphysics.The array was simulated for 128 frequencies (uniformly spaced between 93.75 Hz-12 kHz) in total, with a meshing resolution of 1  6 of the wavelength of each simulated frequency.The scattered pressure measured along the horizontal plane aligned with the HMD is presented in Fig. 1 (middle) as a directivity pattern for two different incident plane-wave directions, which indicates that the directivity of the scattered field of the array can change according to the DoA of the incident wave.This direction-dependent scattering, which is a product of the asymmetrical design employed, differs from the widely utilised rigid SMA configuration where the baffles produce similar scattered directivities for all incident directions.
Note that this particular array design was chosen as it represents a likely future use case in the context of augmented reality applications.It is also an array that is particularly problematic for the conventional linear Ambisonic encoding approach.The challenges associated with linear signal-independent encoding may be demonstrated by computing the performance metrics2 described in Section II-A; the results for which are provided in Fig. 1 (right).It can be observed that not all components of the same order are encoded in the same manner, which is something that is distinctly different from SMAs, and thus subsequently translates into a non-uniform spatial resolution for different directions.Furthermore, with SMAs, the components of a lower-order typically have a wider operational bandwidth than their higher-order components.However, this is not the case for this irregular array; as the z-axis dipole Y 1,0 component appears to exhibit adequate encoding performance up to higher frequencies than the omni-directional Y 0,0 component.Such properties are due to the irregular microphone placement and directionally diverse scattering arising due to the geometry of the HMD and the head of the manikin.The metrics also indicate that SH domain beamformers of first-order directivity cannot be reliably generated above approximately 1 kHz.This is also confirmed when the directivity patterns of beamformers derived from linearly encoded Ambisonics are plotted, as depicted in Fig. 2 (left).In contrast, when the microphone sensors are used directly, beamformers with higher directivity may be employed, which may also be generated beyond the spatial aliasing frequency of a linear encoding; as shown in Fig. 2 (right).This is therefore an early indication that a parametric encoding method based on space-domain beamforming, could potentially yield for five different frequencies and using the array in question.Note that the SH domain beamformers are hyper-cardioid (maximum directivity) beamformers with diffuse-field equalisation enabled above the spatial aliasing frequency (1 kHz), while the space-domain beamformers are as described in [51].
improved spatial resolution, and over a wider frequency bandwidth, when compared to conventional linear encoding.

IV. SIGNAL MODEL
The narrow-band spatial covariance matrices (SCMs) of the signal vectors are given by C which in practice are computed over a number of temporal frames.Note that the time-frequency indices are omitted henceforth for brevity of notation.
It is assumed that a number K < Q of active signals from sound sources s = [s 1 , . .., s K ] ∈ C K×1 at each time-frequency tile, are incident from directions Γ s = [γ 1 , . .., γ K ].The array signal vector is therefore described as where A s = [a(γ 1 ), . .., a(γ K )] ∈ C Q×K contains the array steering vectors for the source directions; d ∈ C Q×1 is the diffuse signal vector, which comprises reverberation and spatially diffuse sounds with no clear directionality; and n ∈ C Q×1 is the sensor noise signal vector, which is assumed to be uncorrelated between sensors.Assuming uncorrelated source signals, their second-order statistics are given by the diagonal SCM C s = E[ss H ] ∈ C K×K , which has a total source signal power P s = tr[C s ].The array SCM solely arising from these source components is given as The diffuse array signal vector is then modelled as where z ∈ C V ×1 are the diffuse signal components incident from all directions in the measurement grid.Assuming uncorrelated diffuse signal components, their SCM is given as , and the total diffuse signal power is therefore Note that in the case of an isotropic diffuse signal vector, the SCM becomes C z = (P d/ V )I V .The SCM for the diffuse signals, as captured by the array, is then given as The array noise SCM is then with equal noise power P n across all sensors.
The overall array signal SCM, based on this assumed model, is therefore V. PARAMETRIC SPATIAL ANALYSIS

A. Spatial Whitening of the Array SCM
The proposed parametric analysis is based on the subspace principles of array signal processing, from which the number of active sound sources and their direction-of-arrivals (DoAs) are estimated.It is noted, however, that the employed subspace techniques assume that the array SCM will exhibit an identitylike structure, with its eigenvalues all being P n , when the sound sources in the scene are inactive.These algorithms are therefore well-suited to the task of estimating the number of sources and their directions in the presence of sensor noise.However, in the present scenario, it is assumed that directional components are instead mixed with both sensor noise and diffuse sounds; with the latter not necessarily conforming to this identity-like structure, as demonstrated by (9).If one is to further assume that sensor noise may be negligible (i.e.P d >> P n ) for the intended applications of the proposed method, then it may be more beneficial to instead have the array SCMs exhibit an identity-like structure when the array is placed under isotropic diffuse-field conditions.Therefore, prior to estimating the required spatial parameters, a spatial whitening operation is applied.This operation is to ensure that the array SCMs, given an isotropic diffuse-field input, would instead conform to the following where T ∈ C Q×Q is the signal-independent ideal diffuse-field spatial whitening matrix, which is computed as given the eigenvalue decomposition D = RΛR H .The subspace decomposition is then applied to the array SCMs after the ideal diffuse-field whitening as where K refers to the number of sources, λ are the eigenvalues sorted in descending order, and v are the respective eigenvectors.
With the current assumptions, the largest K eigenvalues should be diag[C s ], while the smallest Q − K eigenvalues should all be equal to P d .Examples of eigenvalues for both the whitened and un-whitened array SCMs, for up to three white noise sources in a diffuse field, are presented in Fig. 3 using the array in question.
It is noted that for a diffuse-field input, the eigenvalues are not necessary all equal in practice.However, the whitened array SCM do more closely conform to the subspace assumptions for these diffuse-field conditions.This also extends to the source(s) mixed with diffuse sound cases, where the Q − K smallest eigenvalues (highlighted with a grey background) are notably flatter when the whitening operation is applied in the 1 kHz and 2 kHz examples.However, at higher frequencies, where D in any case begins to trend towards an identity matrix, the whitening operation may not provide any benefit; as is shown in the 4 kHz example.

B. Source Signal Detection
The estimation of the number of sound sources, often referred to as detection in sensor array processing literature, may be based on analysis of the SCM eigenvalues and thresholding [52], eigenvalue statistics [30], or operations performed directly on the eigenvectors [53].Alternative approaches are based upon information theoretic criteria [54].For this work, the SORTE algorithm is employed, as it has been demonstrated to be a robust detector in [30], and does not require any parameter tuning.The first step relies on determining the differences between the eigenvalues as The number of sources is then given by with

C. Source Direction Estimation
Once the number of sound sources has been determined, establishing their DoAs can be based on first generating activitymaps based on, for example, scanning the same dense grid of directions Γ = [γ 1 , . .., γ V ] as used to simulate (or measure) the array.Such activity-maps may be based on computing the energy of conventional beamformers, such as the filterand-sum [55], or minimum-variance distortion-less response (MVDR) [56] beamformers.However, since the subspace principles are employed for the source detection task, a spatial pseudospectrum [57]- [59] represents a practical alternative and often leads to sharper activity-maps than those generated by steered-response power approaches.In this work, the MUltiple-Signal Classification (MUSIC) approach [58] is employed as where V n refers to the noise subspace, defined as the eigenvectors corresponding to the smallest Q − K eigenvalues.Peakfinding may then be employed to numerically extract the K source DoA estimates from the pseudospectrum.

A. Source Rendering
Once the number of sources has been detected and their respective DoAs have been determined, spatial filters may be constructed to obtain estimates of the source signals.The extracted source signals may then be encoded into SH signals as incident plane-waves from the same respective DoAs.Various beamforming designs are possible with their own advantages and disadvantages.In the simplest case, beamformers may be steered towards the K DoAs using a matched filter (MF) approach, and thus the source beamforming matrix W s ∈ C K×Q is simply where the matrix of the source steering vectors A s ∈ C Q×K is constructed by taking a subset of the dense array response measurements corresponding to the estimated DoAs.The diagonal normalisation matrix ensures that unit gain in achieved in the focusing direction for each beamformer.However, while such a design is numerically robust, it does not offer the highest suppression of the ambient sound and of sources in the other estimated directions when K > 1.To improve this aspect, a linearly-constrained minimum power (LCMP) solution [56] may be employed with the constraint W s A s = I K , resulting in where β denotes a regularisation term to avoid any illconditioned inversions.Equivalently, and as more commonly formulated in the literature, the beamforming matrix may be expressed as W with the weight vectors required to extract the kth source signal obtained based on minimising the array output power w k = arg min[w H C x w] under the linear constraint A H s w = c, where the c vector has 1 at the kth entry and zeros elsewhere.It is further noted that it is possible for the LCMP solution to become unstable if two or more DoA estimates fall too close together.In such cases, heuristic approaches may be devised to cull or merge the DoA vectors to improve the robustness of the beamforming solution.Alternatively, if such instabilities are identified, then a single-column minimum power distortionless response (MPDR) solution may instead be employed for each source; although, this approach may then overestimate the energy of sources in the scene.Note that examples of extracted source signal energies for up to three simultaneous white noise sources in a free-field, when using the array in question and the LCMP beamformer design, are depicted in Fig. 4. It can be observed that at lower-frequencies, the beamformers are unable to fully separate the source signals; resulting in them containing also up to 3 dB of the signal energy from other source(s).However, given that practical scenes typically comprise source signals that are sparser across frequency and more intermittent over time, these examples may be considered to represent a worst-case scenario for free-field conditions.
Once the source signals have been extracted, they are then encoded into the Ambisonics format as where Y s = [y(γ 1 ), . .., y(γ K )] ∈ R (N +1) 2 ×K are the encoding SH weights for the respective source directions.Note that, unlike conventional linear signal-independent encoding, there is no maximum order dictated by the number of sensors in the array, and thus the encoding order may be arbitrarily selected by the user.

B. Ambient Rendering
To encode the residual sound scene component, which encapsulates ambient sound and weakly directional sources, a twostage strategy is followed.Firstly, the residual array signals are obtained after the source components have first been subtracted from the input sound-field.This source subtraction is conducted via a spatial filtering matrix W d ∈ C Q×Q , which is derived as with an estimate of the residual array signals then given by Secondly, a plane-wave decomposition of these residual signals is conducted over a uniformly distributed set of L ≥ (N + 1) 2 directions, which are subsequently re-encoded into ambisonic signals of the target order.The plane-wave decomposition may be performed using unity gain beamformers following (20), based on the respective steering response matrix A d ∈ C Q×L , which yields the signals It is noted, however, that the beamformer directivity patterns achieved through (20) are inherently frequency-dependent.Therefore, due to the fixed number of plane-wave decomposition directions, it is possible that some frequencies may be over-represented due to greater overlapping of the beamformer patterns.Conversely, at other frequencies, the beamformers patterns may instead become too narrow to capture the residual sound-field energy without losses.Additionally, if the employed microphone array features an irregular geometry and/or non-uniform sensor placement, then the directivity patterns and the energy captured by the beamformers will also be direction-dependent.Therefore, since it is assumed that the residual signals are mostly made up of diffuse ambient components, energy-preservation prior to re-encoding may be deemed to be more important than the unity response constraint imposed by (20).To ensure this energypreserving property of the beamforming matrix, the following singular value decomposition is first conducted This is followed by discarding the matrix containing the singular values Σ d and truncating the U d matrix, in order to force the array steering vector matrix to be unitary with where ∈ C L×Q is the truncated version of U d , whereby only the first Q columns are retained.Note that this energypreservation constraint is similar to the method proposed in [60], which instead employed broad-band SH vectors.An example of this energy-preserving plane-wave decomposition, when the array in question is under diffuse-field conditions, is depicted in Fig. 5.The figure demonstrates that the energy-preservation constraint leads to a more consistent capture of diffuse energy across both frequency and direction, when compared to using the unity response constraint.
The plane-wave signal vector is then encoded into ambisonic signals as where Y d ∈ R (N +1) 2 ×L is a matrix of SH weights for the respective plane-wave directions, and

C. Overall Rendering
The final parametrically encoded Ambisonic signals are then obtained as Naturally, this decoupling of the two streams also allows for the possibility of re-balancing them, for example, to apply more gain to the source stream, which would be akin to de-reverberation, or to emphasise the ambient stream to exaggerate the reverberance of the scene.Other parametric based spatial audio effects and/or sound-field modifications are also possible based upon the manipulation of the estimated spatial parameters prior to synthesis [61].The parametrically encoded signals may also be substituted by linearly signal-independent encoded signals for the frequency bandwidths at which conventional encoding is optimal; as explored in [36], based on the objective metrics depicted in Fig. 1.

VII. EVALUATION
The evaluation of the proposed encoding method was approached through: the calculation of objective metrics, and by conducting formal listening tests.Both evaluations utilised the microphone array described in Section III.

A. Objective Metrics Evaluation
To evaluate the objective performance of the proposed method, synthetic microphone array recordings of different scenarios were created.These were based on uncorrelated white noise source signals of varying number and directions, which were mixed with an isotropic diffuse field.The diffuse field was modelled based on uncorrelated white noise sources in all V = 841 measurement directions, accompanied by the appropriate integration weights for the employed spherical grid [50].The gains for the source signal(s) the diffuse-field signals were then adjusted to attain specific direct-to-diffuse (DDR) ratios, which were computed as .
(29) For this study, the following DDRs were targeted: [0, 6, 12, Inf] dB.Note that all objective metrics were based on computing C x over one second of input microphone array audio (sampling rate of 48 kHz), given a short-time Fourier transform (STFT) with a window size of 512 samples with no overlap; i.e. averaged over 48000/512 = 93 down-sampled time frames per frequency.The plane-wave decomposition of the ambient signals was based on selecting the L = 60 nearest measurements for the directions corresponding to a minimum t-design [62] of degree 10.The decorrelation of ẑ, prior to re-encoding them in (27), was conducted based on directly randomising their phase uniformly in the range [−π, π).In cases where two DoA estimates fell within the same π/(2 √ Q) angle, one of the DoA estimates was randomly omitted in order to improve the stability of the employed beamforming solution.The beamformers also used β = 0.01tr[C x ] as the regularisation term.Note that all V = 841 measurement directions were also used when computing D, and for the grid-scanning conducted by the DoA estimator described in Section V-C.The first objective metrics of interest relate to the parameter analysis performance, which refers to the method's ability to correctly detect the true number of sources and estimate their true DoAs.This was conducted based on computing the rootmean-square-error (RMSE) values as where N f refers to the employed number of frequency bins (up to the 12 kHz simulation limit), K is the true number of sources, u is the true source direction in Cartesian coordinates of unit length, and K and û are the estimated source number and source direction vector, respectively.Note that in cases where more than one DoA estimate was made, the error metric was computed for all combinations between the estimates and ground truths and the lowest min( K, K) error values were selected, followed by taking the mean to obtain a combined average.In total, 1000 iterations of randomised source directions were simulated, in order to obtain one averaged error value for each source number (up to K = 3) and DDR combination.Perceptually motivated objective metrics were also computed, in order to evaluate how accurately the proposed method synthesises the target SH signals; given a binaural rendering workflow.The metrics were based on first linearly decoding the SH signals to the binaural channels z bin ∈ C 2×1 as where D bin ∈ C 2×(N +1) 2 denotes a frequency-dependent binaural decoding matrix.Note that the magnitude least-squares design proposed in [7] was employed for this task.The binaural SCM is then given by from which the following binaural metrics can be computed: where BMS lr is the binaural mean spectrum (BMS), which corresponds to the timbral colouration of the encoding and decoding processing; ILD lr is the inter-aural level difference (ILD) between the left and right ears, which relates directly to the inter-channel level differences between the two binaural channels; and IC lr is the inter-aural coherence (IC), which relates directly to the inter-channel coherence.Note that an example of these binaural metrics for one scenario is depicted in Fig. 6.These binaural metrics were computed based on: the array signals parametrically encoded into fifth-order SH signals using the proposed method, the array signals linearly encoded to firstorder SH signals following (2) (with diffuse-field equalisation above the spatial aliasing limit as described by ( 3)), and a fifth-order SH reference based on directly encoding the source and diffuse signals used to simulate the array recording.Note that all V = 841 measurement directions were used to compute E (with β = 0.3).The error values for the three binaural metrics, RMSE BM S , RMSE ILD , RMSE IC , were then calculated in a similar manner to (30), using the metric values derived from the binaural decoding of the reference fifth-order SH encoding as the true values.The metrics were also computed and averaged over 1000 iterations of random source directions.However, contrary to the parameter analysis evaluation, the metrics were averaged over frequency using the perceptually-motivated equivalent rectangular bandwidths (ERB) scale.

B. Perceptual Evaluation
A multiple-stimulus binaural listening test was also conducted in order to evaluate the perceptual encoding performance of the Fig. 6.Binaural metrics for a scene comprising two sources, one directly in-front and one directly to the left of the array in question, with a DDR of 6 dB, when using: the proposed method targeting fifth-order (par_o5), linear first-order encoding (lin_o1), and reference fifth-order encoding (ref_o5).
proposed method, given a binaural rendering workflow.Note that, contrary to parts of the objective evaluations, these perceptual evaluations were conducted based solely on estimated spatial parameters.For the implementation of the proposed method3 used for the listening tests: the sampling rate, the L = 60 directions for the residual rendering, the employed culling scheme for the DoA estimates, and the beamformer regularisation term, were all configured to be the same as in Section VII-A.Whereas: the time-frequency transform, temporal averaging of C x , the updating of the spatial parameters and mixing matrices, and the decorrelation approach, were instead altered to better suit the dynamic sound scenes used for the listening test.The employed time-frequency transform was the 90% overlap alias-free STFT design 4 described in [63], which was configured to use a hop size of 128 samples, with the hybrid filtering feature enabled; thus providing 133 frequency bands in total.The temporal averaging of the array SCM was conducted in blocks, based on combining the current block of 2048 time-domain samples with the previous block of 2048 samples; thereby averaging C x over 4096/128 = 32 down-sampled time frames per frequency band.The proposed spatial analysis and synthesis were then updated and applied for each block of 2048 time-domain samples.Signal decorrelation was conducted based on assigning random delays per channel and per frequency band, with longer delays employed at lower frequencies and shorter delays at high frequencies; as used previously for similar studies conducted by the present authors [23], [29], [31].
To create the listening test scenes, three different contrasting sets of four source stimuli were first selected: 1) a four-piece funk band, 2) four simultaneous speakers, and 3) a mixed source scenario comprising a piano, speech, a water fountain, and clapping.Since the array in question was simulated up to 12 kHz, all stimuli were low-pass filtered at 12 kHz.These filtered stimuli were then directly convolved with the array measurements corresponding to fixed directions [0, 0; 90, 0; −90, 0; 45, 50; ] degrees (azimuth, elevation) and summed, in order to obtain a simulated array recording of the anechoic sound scene.The stimuli were also directly encoded into fifth-order SH signals in these same directions, in order to serve as the anechoic reference case.To also include a more realistic acoustical environment, a shoe-box room simulator 5 , based on the imagesource method, was employed.The wall absorption coefficients were configured in octave bands, to obtain reverberation times (RT60) of [0.5, 0.55, 0.5, 0.35, 0.2, 0.15] s (125 Hz to 4 kHz) for a [10 × 7 × 4] m (Width × Depth × Height) sized room.The receiver position was set to the centre of the room, with the four source positions set in the same directions as with the anechoic case, 1 m away from the receiver.The direct paths and modelled room reflections were then quantised to the employed V = 841 measurement grid and directly convolved with the respective array measurements, in order to obtain a simulated array recording of the reverberant scene.The direct path and reflections were also directly encoded into fifth-order SH signals, which served as the reverberant reference test case.
The simulated array recordings of the aforementioned sound scenes were subsequently encoded into fifth-order SH signals using the proposed parametric (IA_par_o5) method, and also into first-order SH signals using the conventional linear (IA_lin_o1) approach, as described by (2).As an additional control condition, a tetrahedral array of cardioid-pattern sensors with a radius of 2 cm, as commonly employed for ambisonic recording in practice, was also used to obtain simulated recordings and encoded into fifth-order SH using the proposed method (tetra_par_o5).Note that this tetrahedral array was simulated based on analytical descriptors [45], [46] for the same V = 841 directions, in order to have parity with the grid used to simulate the array in question.This condition was intended to reveal any improvements of the proposed method when using an array type that is commercially and widely available, and often employed for capturing firstorder linearly encoded recordings (tetra_lin_o1).Additionally, this SMA may demonstrate differences between the method applied to the irregular array under study, and a more regular array that exhibits a uniform spatial resolution.All encoded SH signals and the reference SH signals were then decoded to the binaural channels using the magnitude least-squares method proposed in [7].
In total, there were six test scenes, as summarised by Table I, and five test cases, as summarised in Table II.The listening test was then conducted in three parts: r Spatial: where the test cases were frequency-dependently equalised to the reference case.The listening subjects were then instructed to assess the test cases based on their spatial accuracy, and ignore any remaining timbral differences.
r Timbre: where the magnitude response of each test case was imposed onto the reference case, therefore ensuring that all the test cases presented were spatially equivalent.The listening subjects were then instructed to rate the cases based only on timbral differences.
r Overall: test cases were simply normalised to the refer- ence based on their average broad-band root-mean-square signals powers.The listening subjects were then asked to rate the cases based on personal preference.Fourteen subjects participated in the listening test, all of whom were naive as to the hypothesis of the study, reported having normal hearing, and had previous experience participating in perceptual studies.The scale of the listening test was set between 0 and 100, and had the verbal anchors: bad, poor, fair, good, and excellent between the respective 20 point intervals.The test subjects were instructed to rate each test case with respect to the reference, and relative to each other, while in consideration of the specific perceptual attribute under test (spatial, timbre, or overall).The average length for completing all three parts of the test was approximately 40 minutes.The tests were conducted in specially-built sound dampened listening booths (background noise level of L A,eq,30 s = 22.0 dB SPL(A)) located at Aalto University, using Sennheiser HD600 headphones.

VIII. RESULTS AND DISCUSSION
The results for the objective parameter analysis evaluation are presented in Fig. 7.It can be observed that, with the exception of the 3 sources and 0 dB DDR case, the RMSE DoA errors remain quite consistent; even as more sound sources are introduced into the simulation.The standard deviations are high, Fig. 7. RMSE and standard deviations results for the objective spatial analysis evaluation, which were averaged over frequency bins between [0,12] kHz and 1000 iterations of randomly selected source directions.
which is likely a product of the irregular array geometry and non-uniform sensor placements, but are otherwise consistent across the different numbers of sources and DDR values.The error and standard deviation for the 3 sources case at 0 dB DDR, however, are notably higher and wider; although, it is highlighted that this represented the most challenging case that was tested.The perceptual ramifications of these estimation errors, however, may be more suitably inferred from the results of listening tests described below.Regarding the evaluation of RMSE K , given positive DDR values the errors were found to be low and the standard deviations are narrow; suggesting that the source number estimator is suitable for detecting sources within moderate to low energy diffuse-fields.Whereas, in the 0 dB DDR case, the errors indicate that the employed source number estimator may over-estimate, or is otherwise unable to reliably detect the true number of sources.This 0 dB DDR issue may have also influenced the following objective binaural metrics results to some degree.
The results for the binaural metrics evaluations are shown in Fig. 8, using both the analysed parameters (left) and the known/Oracle spatial parameters (right).For both the analysed parameters and Oracle cases, it can be observed that the proposed parametric encoding yields lower RMSE values for all DDR values that are above 0 dB, and for all three binaural metrics, when compared to the linearly encoded baseline.However, for the 0 dB DDR cases, the error is higher, especially for the purely diffuse (K = 0) case, when using the estimated spatial parameters.The error for this particular case is significantly lower when using the Oracle parameters, thus suggesting that the aforementioned issues regarding the employed source number estimator may be to the detriment of the overall encoding method for such conditions.Therefore, the proposed method could benefit from the addition of a diffuse-conditions detector, which would allow the source number detector to be bypassed (i.e.force K = 0) in cases where the sound-field is analysed to be highly diffuse.A topic of future work could therefore involve investigating the use Fig. 8. RMSE and standard deviations results for the objective binaural metrics evaluation, computed based on the ideal fifth-order reference.Averaged in ERB frequency bands (up to 12 kHz), and 1000 iterations of random source directions.Left: using the parametric analysis, right: with known parameters (Oracle). of such detectors; for example, the estimator described in [64] may be suitable for this task, provided that spatial whitening of the SCM is conducted, as described by (12), and with the selection of an appropriate threshold value.
The results for the multiple stimulus listening test are presented in Fig. 9.The parametric rendering was rated notably higher than the linear signal-independent encoding in terms of both the spatial and timbral attributes, and also based on the overall preference of the listeners.The hidden references were consistently assigned scores near to 100, whereas the linearly encoded irregular array was likely interpreted as a low quality anchor and rated near to 0. The linearly encoded tetrahedral array fared better than the linearly encoded irregular array, which is likely a result of its uniform arrangement of sensors and smaller radius, which achieves a direction-independent spatial aliasing frequency of approximately 6 kHz; rather than the direction-dependent approximate 1 kHz spatial aliasing limit exhibited by the irregular array.For the spatial part of the listening test, the proposed parametrically encoded array signals for both arrays performed similarly, and were assigned scores within the good and excellent verbal anchors.The timbral part of the test indicated that the irregular array introduced noticeable timbral colourations for certain sound scenes, since they were rated lower than the parametrically encoded tetrahedral array; notably, both of the mixture scenes were rated lower.However, it should be highlighted that broad-band transient stimuli (such as clapping) typically require a responsive analysis for an adequate parameterisation and rendering, and such sounds tend to more readily reveal any artefacts arising due to signal decorrelation.Whereas the broad-band noise source (the waterfall) and musical source (piano) instead benefit from longer temporal averaging windows.Therefore, this particular sound scene may be considered especially challenging, since there are conflicting configuration requirements for the various contrasting source signals.However, the results for the overall part of the test suggest that the spatial attributes of the proposed encoding approach were more favoured by the test participants compared to the timbral attributes, since the overall scores were more inline with those of the spatial part of the listening test.

IX. CONCLUSION
This article proposes a parametric signal-dependent method for encoding the signals of an array of microphones into Ambisonic signals.The method is highly general by design, and is intended to yield improved performance over conventional linear signal-independent encoding, especially when employing irregular microphone array geometries and/or non-uniform microphone placements.The proposed method conducts a multidirectional parameterisation of the captured sound scene, and employs spatial filtering to divide the scene into its individual source and directional ambient components.The source components are then encoded into the Ambisonics domain at an arbitrary output order.The ambient components are first projected onto a uniform spherical arrangement of points, optionally decorrelated, and then encoded at the same target output order.The output ambisonic signals are then obtained by summing these two streams.
The proposed method was evaluated in the context of binaurally decoding ambisonic signals, which were obtained by encoding simulated recordings of a non-uniform arrangement of seven microphones affixed to a head-mounted display worn by a manikin.The evaluation was based on first analysing objective binaural metrics.Here, the objective binaural cues were computed based on first targeting fifth-order ambisonic output using the proposed parametric method and first-order using conventional linear signal-independent encoding, followed by decoding them to the binaural channels.The objective binaural cues were then compared against those derived from a fifth-order directly encoded reference case.It was found that the proposed encoding method outperformed conventional linear Ambisonic encoding for all of the scenarios tested, where the direct-todiffuse ratio was above 0 dB.For the 0 dB case, the improvement in performance of the proposed method, compared to the linear encoding, was less apparent.However, when substituting the processing with known spatial parameters, the computed error values of the proposed method were either similar to, or lower than, the linearly encoded baseline.This therefore suggests that there is room for further improvements in the proposed spatial analysis for such conditions.The proposed method was then evaluated based on formal listening tests.It was found that the test subjects rated the parametrically encoded fifth-order cases to be perceptually closer to ideal/reference fifth-order cases, when decoded to the binaural channels and compared against first-order linearly encoded and decoded baseline cases.These improved results hold for both the perceived spatial and timbral attributes for a number of sound scenes, comprising a diverse range of different source stimuli for both anechoic and reverberant environments.

Fig. 2 .
Fig. 2. Example directivity patterns of beamformers when using linearly encoded SH domain signals (left) or the microphone signals directly (right),for five different frequencies and using the array in question.Note that the SH domain beamformers are hyper-cardioid (maximum directivity) beamformers with diffuse-field equalisation enabled above the spatial aliasing frequency (1 kHz), while the space-domain beamformers are as described in[51].

Fig. 3 .
Fig. 3.An example of the effect of spatial whitening on the eigenvalues of the array SCM for three frequencies, given up to three (top-bottom) equal-power white noise source signals in a diffuse-field with tr[C s ] = tr[C z ], and using the array in question.The first, second, and third sources were incrementally introduced in the following directions: [0, 90, −90] degrees azimuth.

Fig. 4 .
Fig. 4. Examples of beamformer energy plotted over frequency for one (top), two (middle), and three (bottom) uncorrelated white noise source signals in a free-field, using the beamforming solution described by (21) (β = 0.01tr[C x ]) and the array in question.The first, second, and third sources were incrementally introduced in the following directions: [0, 90, −90] degrees azimuth.

Fig. 5 .
Fig. 5.A depiction of the energy of z d plotted over frequency for L = 60 directions when the array in question is placed in an isotropic diffuse-field.The top plot employed the energy-preserving steering vectors Âd , while the bottom plot used A H d .For visual reference, the total energy of the input diffuse-field tr[C z ], and the total energy of the diffuse-field as captured by the microphone array tr[C d ], are also plotted.

Fig. 9 .
Fig. 9. Means and 95% confidence intervals for the listening test results, based on fourteen participants.