Introduction
Research on Head-Related Transfer Functions (HRTFs) [1] has been rapidly progressing over the past decade. The availability of relatively low-cost hardware technologies for immersive visualization has shown the need for higher fidelity HRTF-based spatial sound simulation. Although recent advances in fast HRTF acquisition make it possible to capture HRTFs via acoustic measurement in minutes [2], [3], individual HRTFs are still hard to obtain for the general public. As a consequence, most applications have been relying on non-individual, or generic, HRTFs. Generic HRTFs are known to cause systematic localization errors such as front/back reversals, wrong elevation perception, and inside-the-head localization [4], even if the use of real-time head tracking and artificial reverberation is able to significantly reduce these issues [5].
HRTF individualization—or the process of providing the user with an HRTF that matches the temporal and spectral content of their own ears’ responses—aims at solving the above issues. HRTF individualization approaches can be divided into three families: numerical simulation, indirect individualization based on anthropometry, and indirect individualization based on perceptual feedback [6]. The first class of methods, consisting in simulating the propagation of acoustic waves around a 3D scan of the subject's head and torso through computational techniques [7], has recently gained much attention from researchers. Although such methods are getting more and more accurate in predicting an individual HRTF from a 3D head mesh [8], a lack of large-scale perceptual evaluation studies prevents further discussion on their effectiveness.
On the other hand, one of the active but still unresolved research topics is to identify the physical mechanisms underlying the generation of the most important spectral HRTF cues. A thorough understanding of such generation mechanisms would allow the development of HRTF models that are easy to tune and computationally efficient [9]. Unfortunately, many available HRTF models focus on spatial rendering limited to the horizontal plane [10], [11], overlooking the relevant elevation cues.
While previous research shows some understanding of the relationship between pinna modes/reflections and HRTFs [12]–[14], more research is still needed to fully understand the main elevation cues and how they depend on the individual. This is partly due to generally low sample sizes in terms of number of measured individuals and the lack of commonly agreed protocols for measuring HRTFs and individual morphology, resulting in very different measurements for the same subject depending on the setup [15]. The lack of large and standardized HRTF datasets together with the relevant anthropometric data also complicates the use of machine learning and deep learning techniques, although some research in that direction is starting to appear [16]–[18].
In this paper we extend our previous research by using a new dataset of highly controlled HRTF measurements on a KEMAR mannequin with several different artificial pinnae, captured in an anechoic room with increased vertical resolution. Furthermore, we propose a simple computational model to predict the most important elevation cues, i.e., pinna notches, from a 3D pinna mesh. The computational model applied to this dataset provides more insight on the generation mechanisms for the first pinna notch, known as N1.
The paper is organized as follows. Section II outlines the related work. Section III reports the data collection procedures, Section IV describes our custom procedure to extract the relevant features from HRTFs, and Section V introduces the computational model. Finally, Section VI presents our results, and Sections VII and VIII report the discussion and conclusions, respectively.
Related Work
Humans are capable of vertical spatial sound localization by parsing specific spectral cues. This is achieved with poorer resolution than for sound sources located in the horizontal plane, where interaural cues play a major role [19]. A number of seminal experiments advanced the understanding of the spectral cues responsible for vertical localization. Notably, Hebrank and Wright [20] established that spectral cues for vertical localization exist between 4 and 16 kHz, and that a sound must occupy this frequency range in order to be localized along the median plane. The pinna is known to be responsible for generating these cues, which come in the form of spectral peaks and notches generated through processes of resonance, reflection, and diffraction [21]. Fig. 1 highlights these spectral features in an example HRTF set collected across the frontal median plane.
Frontal median plane HRTF amplitude spectra for an example subject with the main spectral peaks (P1
Shaw [22] identified six resonant modes of the pinna excited by sounds from different directions. These modes were calculated and averaged among 10 different pinnae and include: one mode appearing for all directions at 4.2 kHz (omnidirectional mode), two modes appearing for directions above the head at 7.1 and 9.6 kHz (vertical modes), and three modes appearing around the horizontal plane at 12.2, 14.4, and 16.7 kHz (horizontal modes). Pinna modes are thought to cause the most prominent peaks in the HRTF [12]—see P1 to P3 in Fig. 1 which correspond to the omnidirectional, vertical, and horizontal modes, respectively. As Hebrank and Wright observed [20], perception above the head is associated with a 7 to 9 kHz peak and frontal perception with increased energy above 13 kHz. Taken together, these results highlight the relevance of pinna modes as elevation cues. The center frequencies of peaks are relatively insensitive to changes in elevation of the sound source [23], and models to estimate them from individual anthropometric parameters have been recently proposed [14], [24].
By contrast, the exact origin of spectral notches is more difficult to trace, although it has been long thought to refer to reflections on the concha wall causing the pinna to behave like a delay-and-add system in the time domain [25]. More recently, it has been hypothesized that notches are due to pressure nodes in the concha induced by interactions between propagating waves and the pressure anti-node with opposite phase forming in the upper pinna cavities [12]. Center frequencies of pinna notches, especially N1, are generally seen to increase with the elevation angle [20], [26], therefore representing salient elevation cues [27]. Furthermore, notches exhibit little variation with changes in azimuth [21] or distance [28], [29] and are deeper for sound images below the horizontal plane [30].
While a number of characteristic peak and notch patterns have previously been identified, their contribution to vertical sound localization is still a topic of inquiry. Iida et al. [31] achieved localization performances similar to using the subjects’ own HRTF for the front and rear portions of the median plane by synthesizing a parametric HRTF composed of only the first peak (P1) and the first two notches (N1, N2). They therefore concluded that N1 and N2 are the most relevant elevation cues, while P1 could be used by the human hearing system as a reference for analyzing notches. In a more recent experiment [32], they further improved localization performances for a larger subset of elevations by including a second peak (P2). Models that mimic the characteristic peak/notch patterns seen in HRTFs have also been proposed throughout the past decades, including Watkins’ double-delay-and-add time-domain model [33], Shaw's physical flange-and-cavity model [34], and the diffraction-reflection model by Lopez-Poveda and Meddis [21]. The main drawback of these models is that it is unclear how they can be customized to fit a particular listener.
As a matter of fact, relating the individual variations of spectral features in HRTFs to pinna anthropometry is a key aspect of HRTF individualization, and has been previously investigated to some extent. In a recent work, Iida et al. [35] synthesized the HRTF of listeners from their individual anthropometric parameters using multiple regression, obtaining similar spectral features as in the respective measured HRTFs. Another promising approach to HRTF individualization involves embedding the HRTF into a compressed representation, which can then be estimated using anthropometric parameters. Common choices for such compressed representation include Principal Component Analysis (PCA) [36], [37], Surface Spherical Harmonics [38], and, more recently, deep autoencoders [11]. The anthropometric parameters can be related to the compressed representation using multiple regression analysis or artificial neural networks [39].
Furthermore, advances in the fields of computer learning and computer vision allow for automatic feature extraction from images of the pinnae [40], [41]. Pinna images are featured in recent work such as that of Lee and Kim [42], where they are used together with anthropometric measurements as input data for an artificial neural network trained to synthesize HRTF spectra. The authors of the present paper previously proposed a structural model of the pinna [13] whose parameters are given by a simple ray-tracing algorithm that converts 2D reflection paths on three distinct pinna edges into notch frequencies. Using the same algorithm together with a subset of standard anthropometric parameters, a linear regression model to estimate N1 frequencies from individual anthropometry was also introduced [43], as well as a marginally improved model based on PCA [44].
With the objective of gaining additional understanding of the mechanisms that generate spectral notches in experimental HRTFs, this paper offers the following key contributions:
a new dataset of highly controlled HRTF measurements of a KEMAR mannequin with different artificial pinnae and corresponding 3D pinna meshes, specifically designed to investigate the effect of a given pinna on the HRTF;
a simple computational model for predicting individual notches from a pinna mesh, which builds on and extends to the 3D case the ray-tracing algorithm previously described in [13];
a performance evaluation of the computational model on the new HRTF dataset and a discussion of its strengths and shortcomings.
Data Collection
In this work we use a new dataset of HRTFs measured on a KEMAR mannequin [45] equipped with custom silicone pinnae modeled on 20 different artificial heads. This set of acoustic measurements is an improved version of the previously released Viking HRTF dataset [46], [47], with a focus on extra median-plane measurements. Moreover, the new measurements were carried out in an anechoic environment. In order to minimize the impact of issues related to HRTF measurements on human subjects and of anatomical components other than the single pinna, we chose to create our own set of acoustic measurements rather than relying on available HRTF datasets.
A. Artificial Pinnae and 3D Meshes
In order to provide a diverse sample of pinna shapes for the KEMAR mannequin, we set up and applied a custom procedure to cast silicone replicas of pinnae from artificial heads. In short,1 the main steps of the procedure consisted in sequentially creating
a first negative silicone mold out of the artificial head's pinna;
a positive replica with a hard material (Jesmonite). This step was in order to shape the standard rectangular base of the KEMAR, and to allow drilling a hole to accommodate the KEMAR microphone inside the concha;
a second negative silicone mold with rectangular base included;
the final silicone (25 Shore-A hardness) replica.
We applied the above casting procedure to a series of left ears from 20 different artificial heads. These included the KEMAR with standard large anthropometric pinna (GRAS KB5001) and 19 dummy heads made out of plaster, borrowed from the Saga Museum in Reykjavík. The dummy heads, labeled A to S in alphabetical order, were manufactured between 2001 and 20032 by casting the heads of 19 Icelandic humans (7 female), aged between 7 and 77 at the time of manufacturing. The final result is shown in Fig. 2.
3D scans of all the artificial pinnae were then acquired with a Creaform Go!SCAN 20 white-light handheld scanner at 1 mm resolution. Every pinna was scanned on both the front and back (rectangular base) sides and the two scans were then merged using the VXelements software. Each single scan took approximately two minutes. The same software was also used to automatically close occasional holes in the mesh based on the neighboring vertices, and to manually align to a global coordinate system where the x- (width) and y- (height) axes are parallel to the shorter and longer sides of the rectangular base, respectively, and the z- (depth) axis is normal to the back side of the base. An example pinna mesh is shown in Section V. The origin of each mesh was manually selected as the point lying approximately in the center of the microphone hole at a depth that roughly corresponds to the location of the microphone diaphragm when the corresponding artificial pinna is inserted in the KEMAR's left pinna slot. This was made possible thanks to the availability of detailed pictures of the HRTF measurement sessions described in the following Subsection.
B. Acoustic HRTF Measurements
The HRTF measurement system, pictured in Fig. 3, consisted of an aluminum scaffolding hosting a spinning platform for a KEMAR mannequin (45BB-4 configuration), as well as a pivoting arm equipped with a Genelec 8020 C loudspeaker at its farthest extremity. The two degrees of freedom offered by the platform and arm allowed the speaker to travel around a 1-m radius spherical surface centered on the middle point of the mannequin's interaural axis. This distance guarantees the collection of far-field spectral cues with reasonable accuracy [29]. The two rotation axes were driven by independent high-torque stepper motors (JVL MST001 A, 1.2 Nm) with integrated gearboxes, controlled using an Arduino with serial connection. A dedicated RME Fireface 802 audio interface connected both the loudspeaker and the KEMAR half-inch pressure microphones (GRAS 40AO) to the host workstation. Prior to the measurement sessions, a rotary laser level was used to ensure proper alignment of the various components.
The HRTF measurement system: mechanical apparatus, loudspeaker, and KEMAR mannequin.
The HRTF measurements were carried out during the month of November 2019 inside the anechoic chamber recently installed at the University of Iceland. The chamber has a size of
The sweep method was adopted for recording all acoustical responses [48]. Specifically, the excitation signal
sets A to S: the 19 left pinna replicas of the corresponding artificial heads;
set T: the KEMAR anthropometric left pinna replica;
sets X and Y: the original KEMAR anthropometric left pinna in its soft (35 Shore-OO) and stiffer (55 Shore-OO) variants, respectively;
set Z: a flat 25 Shore-A silicone baffle filling the pinna slot flush with the head (so as to simulate a “pinna-less” condition).
KEMAR large anthropometric pinna replica (left), pinna replica of artificial head S (center), and silicone baffle (right) mounted on the KEMAR's left channel.
For each left pinna, sweep measurements were recorded in the frontal half of the median plane in the range of elevations from
On top of the above measurements, free-field reference measurements for the system were collected, allowing for removal of the influence of any element other than the mannequin. This was done by removing the KEMAR, replacing it with a thin wooden pole, and mounting the left KEMAR microphone onto it, so that its position would be roughly at the center of the interaural axis with the head absent, and its orientation would closely match that of the mannequin's left channel. Reference measurements were taken with the same protocol as for the HRTF measurements except that a
Amplitude spectra of the free-field reference measurements. Thin gray lines show the responses for all elevations; the thicker black line shows the log-average.
HRTF Feature Extraction
In order to recover Head-Related Impulse Responses (HRIRs) from sweep responses in accordance with the sweep method [48] and the free-field compensation method [49], a post-processing script was written in MATLAB. This means that each recorded signal
As an example, Fig. 6 displays the HRTFs collected in set A calculated with a DFT size of 2048 points. Well-known spectral effects can be recognized here, such as the shoulder reflection ridge between 1 and 2 kHz followed by striations at higher frequencies [50], the omnidirectional peak around
Amplitude spectra of HRTF set A. The influence of the pinna (as elevation-dependent peaks and notches) and torso (as striations) can be observed.
In order to extract the relevant spectral cues relative to the pinna, it is desirable to remove from the HRTF any other effect that is not caused by the interaction between the sound and the pinna itself. Isolating the response of the pinna alone (PRTF) can be done by using the pinna-less responses. Working under the assumption that the effects due to different anatomical components (head, pinna, and torso) on the HRTF are additive [9], the amplitude response of the PRTF for a given HRTF set
\begin{equation*}
\left|PRTF_i(f,\phi _k)\right| = \frac{\left|HRTF_i(f,\phi _k)\right|}{\left|HRTF_Z(f,\phi _k)\right|}. \tag{1}
\end{equation*}
Although the above assumption does not completely hold true, we found that good results are obtained when the spectral division is preceded by a further windowing of both the HRIR and pinna-less HRIR using a 1-ms falling half-Hann window [52]. This operation again does not fully suppress torso effects, especially at lower elevations where torso reflections occur within a few tenths of ms from the HRIR onset [53] and tend to overlap with pinna reflections. The result for set A is reported in Fig. 8. It can be noticed that nearly all torso reflections below 10 kHz have been removed and notch tracks appear considerably smoother than in the corresponding Fig. 6 HRTFs.
Amplitude spectra of PRTF set A, obtained by spectral division of the windowed set-A HRTFs by the windowed pinna-less HRTFs.
Center frequencies of pinna notches can now be directly extracted from each PRTF as the locations of the local minima, i.e., the samples that are smaller than their two neighboring samples in the frequency range between 4 and 16 kHz where the spectral cues generated by the directional filtering of the pinna lie [20]. Selection of the local minima was performed through an inverted basic peak picking algorithm. The depth value of the extracted notches was calculated as the difference between the values (in dB) of the local minima and of the PRTF spectral envelope at the same frequency. The spectral envelope was obtained by piecewise cubic interpolation of the spectral peaks [54], extracted through the same simple peak picking algorithm. Fig. 9 shows an example of notch extraction on a representative PRTF.
PRTF for set A (elevation
By plotting the extracted notch frequencies for each PRTF set in the elevation-frequency plane, one can easily visualize the evolution of the main notches along the elevation angle, and label them starting from lower elevations by increasing order of center frequency. Fig. 10 again shows the case of set A, where four main notches starting at the lowest elevation angle can be identified (and labeled N1 to N4). Notice that in this case N2 and N3 meet at about 10 kHz around
Scatterplot of pinna notch frequencies for set A. The size of each point is proportional to the depth of the corresponding notch. The first notch, N1, is highlighted.
The Computational Model
Spectral pinna notches have long been assumed to be caused by the destructive interference resulting from the sum of an incident wave and its reflected, time-delayed versions reaching the ear canal [25], [52], [55]. A single reflected component combines with the incident component after a time delay
\begin{equation*}
\tau = \frac{d}{c} \tag{2}
\end{equation*}
In previous work [13] we assumed that a single reflection point corresponding to each pinna notch relates to one of five main pinna surfaces approximated as 2D contours traced on the helix, concha, and crus helias. No previous assumption was made on the sign of the reflection coefficient. The results of that study, calculated on the CIPIC HRTF database [56], showed that N1 is likely related to a negative reflection from the helix, N2 to a negative reflection from the concha inner wall, and N3 to a negative reflection from the concha border. The study also highlighted how the results for N1 are in agreement with previous studies on its generation mechanisms and on the pinna structures related to it [57]–[59], as well as offering a few speculations on how a negative reflection coefficient could be produced [13], [60].
If we assume a negative reflection coefficient, then destructive interference occurs whenever the reflected wave is delayed by a multiple of a full wavelength, or
\begin{equation*}
d = n\it\lambda, {\kern14.22636pt} n = 1,2,{\ldots }, \tag{3}
\end{equation*}
\begin{equation*}
f_n = \frac{c}{\lambda } = \frac{cn}{d}, {\kern14.22636pt} n = 1,2,{\ldots } \tag{4}
\end{equation*}
Similarly to torso reflections, each pinna reflection should translate into a comb filter-like effect in the measured signal spectrum, i.e., a series of periodic notches. However, experimental PRTFs do not typically show periodically related notches—see, for instance, again Fig. 8. Therefore, as in previous works [13], [52], it is assumed that a single reflection path gives rise to a single notch at frequency
\begin{equation*}
f_1 = \frac{c}{d}. \tag{5}
\end{equation*}
Note that the analysis presented in [13] does not prove that negative reflections are actual physical phenomena that occur in the pinna. In order to reflect such a fact and to allow for some degree of approximation in the geometrical analysis that follows, we refer to the model we will now develop from Eq. (5) through basic ray tracing as a simple computational model, rather than a reflection model based on physical principles.
Given a 3D pinna mesh, we can find all its points which directly reflect a ray towards the ear canal entrance. Under the assumption that the sound propagates from the source as a planar wave, this is done through an algorithm that sequentially calculates, for every considered elevation angle
the vectors from a 1-m far source at elevation
to each vertex with positive z-coordinate (incoming rays);\phi the vectors from each vertex with positive z-coordinate to the mesh origin (reflected rays);
the vertex normals of the mesh, each defined as the normalized unweighted average of the surface normals of the faces that contain that vertex;
the angles between vertex normals and incoming rays;
the angles between vertex normals and reflected rays.
whose normals subtend an angle smaller than a threshold
with both the incoming and reflected rays; and\theta _{\max } whose reflected rays do not cross any mesh face before joining the mesh origin.
3D mesh of head A's left pinna and selected vertices for
Using Eq. (5) we can predict the frequencies at which destructive interference occurs. We set
\begin{equation*}
d = \Vert \overrightarrow{sv}\Vert + \Vert \overrightarrow{vo}\Vert - 1, \tag{6}
\end{equation*}
For each elevation
Distribution of predicted notch frequencies according to the computational model applied to head A's left pinna. The first peak, starting at about
We decided, therefore, to set
Results
A. Robustness and Accuracy of HRTF Measurements
Thanks to the availability of reference HRTF sets X and Y (see Section III-B) measured on the original KEMAR left pinnae in two different variants, and to the constant presence of the same right pinna in all measurements, it is possible to evaluate the robustness of our median-plane measurements as well as their fidelity to measurements taken from a previous dataset.
In particular, we calculated the mean spectral distortion between each pair of HRTF sets
\begin{equation*}
SD(a^{l|r}\!,b^{l|r})\!=\!\frac{1}{N_\phi }\!\sum _i\!\sqrt{\frac{1}{N_f}\!\sum _j\!{\left(\!20\!\log _{10}\!\frac{\left| H_a^{l|r}\!(\phi _i,f_j)\right|}{\left| H_b^{l|r}\!(\phi _i,f_j)\right|}\right)^2}}, \tag{7}
\end{equation*}
The results of this analysis show that while the right channel is affected by measurement noise only (mean
Mean spectral distortion [dB] between each pair of HRTF sets, left channels only. Due to symmetry, values above the diagonal line are not repeated. The low spectral distortion between HRTF sets T, X and Y is highlighted.
Finally, in Fig. 14 we compare the three HRTF sets measured with the different variants of the KEMAR left pinna (sets T, X, and Y) to previous measurements from the CIPIC HRTF database [56], where Subject 021 is the KEMAR with large standardized pinnae. We can qualitatively notice a close agreement among the four sets especially up to about 10 kHz, above which some common pinna cues can still be recognized (e.g., the 15-kHz notch). If we take into account that the external shape of the GRAS KB5001 pinna is identical to that of the standardized KEMAR pinna except for the concha and canal that have been recently modified to closely mimic the properties of a real human ear,5 overall this result highlights the accuracy of our reference KEMAR measurements.
Amplitude spectra [dB] of HRTF sets T, X, Y, and CIPIC subject 021, left channel. Because of the different resolution in elevation (
B. Performance of the Computational Model
The simple computational model presented in Section V outputs one sequence of histograms for each of the 20 considered left pinnae, allowing us to compare the predicted notch frequencies against those extracted from the corresponding HRTFs with the algorithm outlined in Section IV. Fig. 15 reports such a comparison for four representative pinnae/HRTF sets. In general, while the majority of our sets do not present any clear relationship between higher-order notches (N2, N3, ...) and modeled concha contributions, there appears to be a substantial overlap between N1 and the modeled helix contributions.
Distribution of predicted notch frequencies according to the simple computational model and extracted notch frequencies for four different sets. The extracted and predicted N1 are marked with red and blue points, respectively. The size of each point is proportional to notch depth and maximum bin count, respectively.
However, the latter observation does not hold true for all our pinnae. We can identify the following three cases:
Case 1: Sets G, H, M, O. In these HRTF sets, the first available notch falls above 7 kHz even in the lower elevation range, suggesting that the extracted notch might not be N1. As a matter of fact, the predicted N1 frequency is much lower and is generally associated to a faint cluster of points.
Case 2: Sets C, P, R. While the extracted N1 falls within a plausible frequency range (6 to 7 kHz at lower elevations), there is no corresponding helix cluster. This is because there is no straight path from the helix points to the mesh origin.
Case 3: Sets A, B, D, E, F, I, J, K, L, N, Q, S, T. For all these pinnae/HRTF sets, the extracted and predicted N1 frequencies overlap. In the remainder of this Section, we focus on this case.
For a given pinna/HRTF set
the mean absolute error between predicted (
) and extracted (\hat{f_x} ) N1 frequencies,f_x \begin{equation*} MAE(x)=\frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }\left|\hat{f_x}(\phi _i)-f_x(\phi _i)\right|; \tag{8} \end{equation*} View Source\begin{equation*} MAE(x)=\frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }\left|\hat{f_x}(\phi _i)-f_x(\phi _i)\right|; \tag{8} \end{equation*}
the mean signed error between predicted and extracted N1 frequencies,
\begin{equation*} MSE(x)=\frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }\hat{f_x}(\phi _i)-f_x(\phi _i); \tag{9} \end{equation*} View Source\begin{equation*} MSE(x)=\frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }\hat{f_x}(\phi _i)-f_x(\phi _i); \tag{9} \end{equation*}
the notch frequency mismatch [13] between predicted and extracted N1 frequencies, i.e., the average percentual ratio between the absolute error and the extracted frequency value,
\begin{equation*} m(x) = \frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }{\frac{\left|\hat{f_x}(\phi _i)-f_x(\phi _i)\right|}{f_x(\phi _i)}\cdot 100\%}; \tag{10} \end{equation*} View Source\begin{equation*} m(x) = \frac{1}{\left|\Phi \right|}\sum _{i\in \Phi }{\frac{\left|\hat{f_x}(\phi _i)-f_x(\phi _i)\right|}{f_x(\phi _i)}\cdot 100\%}; \tag{10} \end{equation*}
the sample Pearson correlation coefficient
between pairs of predicted and extracted N1 frequencies.r_f(x)
Table I reports complete error metrics for the 13 sets. Generally, there is close agreement between extracted and predicted N1 for the common elevations as shown by the MAE metric as well as the sample Pearson correlation coefficient. In particular, not only the absolute frequency values are similar, but also the elevation trend of extracted and predicted N1 frequencies, with an initial plateau at lower elevations followed by a frequency rise. Remarkably, in all cases except one (set E), the notch frequency mismatch is well below the above mentioned 9% threshold. In addition, we observe no particular estimation bias from the MSE metric, with a comparable number of cases either overestimating or underestimating notch frequency on average.
Discussion
The results reported in Section VI-A suggest that the collected acoustical measurements are replicable, robust, and faithful to reference KEMAR HRTFs, providing a level of accuracy for investigating the relation between HRTFs and pinna anthropometry previously not possible. In turn, this gives substance to the results of Section VI-B, which report a general agreement between the ground-truth N1 data and the predictions of our simple computational model. Conversely, the same model does not support any clear relationship between anthropometry and higher-order pinna notches.
It must be acknowledged that there exist at least two sources of error that could have slightly undermined the accuracy of our input data. The first one is related to the 3D pinna meshes. As reported in Section III-A, the origin of each pinna mesh has been selected manually prior to calculation of the reflected rays and, in particular, its placement along the z-axis has been solely based on 2D data from pictures taken during the HRTF measurement sessions. Furthermore, one variable not taken into account here is the possible deformation of the pinna replica once inserted into the KEMAR slot due to slightly off-center microphone holes. Even so, the origin placement procedure has been based on observations of the pinna replica alone without using the HRTF in any way. This represents a more unbiased estimate with respect to [13], where the assumed microphone location had been optimized by minimization of the error between extracted and predicted HRTF notches for every subject.
The second possible error source comes from the entirely automatic notch extraction procedure described in Section IV. This was preferred over the classic signal processing algorithm by Raykar et al. [52] as the latter outputs a number of notches that heavily depends on a user-defined threshold. When such a threshold is relaxed, additional notches appear, which were not visible in the original HRTF amplitude spectra. On the other hand, our algorithm extracts a certain number of PRTF notches that do not depend on any user-defined parameter, and this often results in shorter N1 tracks than observed in previous literature. However, we believe that a conservative yet objective estimate of notch frequencies represents the best possible scenario for a solid data analysis.
For 13 HRTF sets out of 20, we found a clear correspondence between our model predictions and ground-truth N1 frequencies. The notch frequency mismatch metric scores a median value of 3.3% across the 13 sets, which is less than half the median mismatch value previously found in [13], reported at 7.4% out of 17 sets. While the two studies refer to two different HRTF datasets with different representations of individual anthropometric data, this result endorses the extension of our model to the 3D case. The improved representation of pinna structures through 3D data and the robustness of our HRTF measurements are to be seen as additional factors for the said improvement. Furthermore, the average MAE in this study is considerably lower than that found in preliminary works that used linear regression models to predict N1 from anthropometric parameters [44] or depth maps of pinnae [62]. This suggests that the computational model is able to extract strong features from 3D data.
On the other hand, the computational model fails for 7 other HRTF sets. In 4 of them (case 1 in Section VI-B) there is strong evidence that N1 cannot be identified in the corresponding HRTFs because it is not generated at all. The issue of missing notches was also seen in a previous study on the CIPIC and ARI databases [63], where it was hypothesized as the cause of poor vertical localization performances. For 3 of these 4 HRTF sets the computational model actually predicts a notch track. This is, however, generally lower in frequency and overlaps with the omnidirectional 4 kHz resonance, which might explain the lack of prominent notches.
For the remaining 3 HRTF sets (case 2 in Section VI-B), N1 appears in the HRTF but not in the model prediction because there is no straight path from the helix to the mesh origin. For instance, in the specific case of set R, a considerably deep N1 track seen in the HRTF (see Fig. 15) cannot be associated to any ray-traced path in the corresponding pinna. It is therefore plausible to hypothesize that in these few cases the generation mechanism for N1 does not correspond to a contribution from the helix. As previously pointed out by Mokhtari et al. [14], there exists no established method for correctly labeling transfer function peaks (or notches, in our case) depending on frequency and spatial location, and since different physical mechanisms might appear in different pinnae (such as the presence or absence of a cavum-fossa vertical mode) these notches labeled as N1 might share the same generation mechanism of a higher-order notch in a different HRTF.
As for higher-order notches, it has been confirmed that a simple model such as ours is not able to replicate their trend along the elevation angle. This is in line with our previous results [43], [44] that pointed out the inaccuracy of linear regression models in predicting higher-order notches from both global and elevation-dependent anthropometric parameters. Conversely, this seems to contradict the findings in [13] where a median N2 frequency mismatch value as low as 5.3% was found. Nevertheless, that finding might have been biased by the optimization of the assumed microphone location, as previously mentioned. In general, our computational model calculates a discrete number of paths with very different path lengths associated to the concha wall for the same direction, while in practice, as pointed out by Lopez-Poveda and Meddis [21], significant reflections occur on an infinite number of points along the posterior concha wall for all source locations. In this regard, a more detailed physical model of the concha might be necessary to explain how these notches are produced and how they can be adequately approximated.
Conclusion
The simple computational model proposed in this paper helped us gain better understanding of the mechanisms that generate spectral notches in experimental HRTFs. The role of the helix in producing N1, previously hypothesized in [13], is hereby confirmed from notch frequency predictions that are extremely close to experimental data. Conversely, higher-order notches—most likely associated to interactions within the concha—would require the development of an alternative model of the concha including diffraction and reflection effects to be fully interpreted.
In its current form, the computational model is able to predict the evolution of N1 along the median plane from individual 3D pinna morphology. While it is not sufficient to infer the individual HRTF, the model can help selecting the closest non-individual HRTF in a dataset based on minimization of the error between non-individual and predicted N1. As a matter of fact, it has been previously found that N1-based HRTF selection improves localization performance with respect to generic HRTFs [63], [64]. Assessing whether this also applies to our computational model through individual localization tests is currently planned as future work.
We believe that the HRTF dataset presented in this paper can be useful for studying the effect of the pinna on spatial sound perception. Although the amount of time and resources available only allowed us to produce and test a limited number of artificial pinna samples, the casting procedure presented herein can be applied to any artificial head, and preliminary investigations suggest the possibility of extending it to human heads. It is also worth remarking that artificial pinnae can be easily replicated from the available negative molds, allowing modifications to pinna structures (e.g., removing the helix or filling cavities) for the sake of further understanding the physical mechanisms generating spectral cues. Ultimately, a broader sample size would grant the application of more advanced machine learning techniques for extracting the relevant anthropometric parameters from 3D pinna models and relating them to HRTF features.