Abstract:
The MPEG-H standard for spatial audio proposes the rendering of multiple auditory objects (up to 16) and ambisonics to create a spatial audio scene. Convolution of these ...Show MoreMetadata
Abstract:
The MPEG-H standard for spatial audio proposes the rendering of multiple auditory objects (up to 16) and ambisonics to create a spatial audio scene. Convolution of these (and their early environmental reflections) with user-specific Head Related Impulse Responses (HRIRs), and a treatment of the late tail of the room reverberation, is the gold standard for creating a spatial audio scene. However, this is expensive both in terms of computational time/battery power for finite impulse response (FIR) convolution, and device memory required to store the HRIRs. If quality could be maintained, an implementation with equivalent infinite impulse response (IIR) filters would mitigate these costs. We propose a novel differentiable optimization approach for determination of a IIR filter cascade from a given FIR filter. This is done via an application specific formulation that yields a convex and differentiable cost function for such conversion. We describe our results for spatial audio rendering of HRIR convolution. We compare our work against a recent neural network based HRIR estimation in terms of accuracy and speed. Finally, we implemented our approach in a real-time setting, suitable for implementation on DSP hardware, and conducted a small user study. Results from human participants were positive.
Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 06-11 April 2025
Date Added to IEEE Xplore: 07 March 2025
ISBN Information: