Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, e.g., household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be applied to work on environments with novel sounds and maps. Experiment results on realistic 3D scenes (Replica and Matterport3D) demonstrate that our method achieves better generalization performance when zero-shot transferred to scenes with unseen maps and unheard sound categories.


I. INTRODUCTION
E MBODIED agents should be able to navigate to different locations to complete downstream tasks such as goalspecific tidying and delivering items.Most robot navigation is currently limited to pure visual input from scenes [1]- [5].From a bionic perspective [6]- [8], we humans can integrate audio information with visual observations to improve the ability to perceive objects and scenes, such as locating the position of an invisible object [9].Consequently, it is advisable for an intelligent agent to learn how to perceive and leverage multi-modal information, including vision and audio, to achieve better navigation performance.

Telephone Fan Human Water wave Alarm
No matter what category the sound is, I can locate the sound on my front-left direction and navigate to there.With the recent development of the Soundspaces [10] simulation environment, researchers have begun to study leveraging both audio and visual information for navigation [10]- [12].In visual-audio navigation (VAN) task, testing sets include heard and unheard sound categories: 1) heard sound categories mean the same sound categories group with the training set, 2) and unheard sound categories are never heard sound categories by the agent during the training procedure.For training sets, in some specific and typical scenes, we can provide almost any kind of sound that might be present in these scenes.For example, in a restaurant, a service robot may only need to learn to listen to service bells and customer greetings.However, in some atypical and complex scenes, we cannot provide all possible sound categories to learn because of the wide range of sounds that the agent will confront, such as a guard robot that should be able to react to odd sounds, activate the guard procedure and find where the odd sounds occur.Therefore, intelligent agents need to handle unheard sound categories.Even though the state-of-the-art (SOTA) methods attain ∼ 90% success rate [10], [11] in Replica environments [13] with heard sound categories, their success rates drop to ∼ 50% when navigating to unheard sound.Besides, existing methods use pure reinforcement learning loss (e.g.critic loss and actor loss) to train an agent in a simulator and thus need about 3M∼13M steps to converge due to low sample efficiency, which takes several days.It is important to develop an algorithm with high sample efficiency for this task.
Humans are sensitive to sounds, and even infants who know nothing about sound categories can perceive the general orientation of sound [14].Motivated by the previous observation, in this paper, we refer to the human auditory processing mechanism.A dual-pathway model of auditory processing exists in the human brain where sound semantic information (what path) and sound spatial information (where path) are segregated into different brain areas [15]- [18].Semantic information contains sound category and other category-related information, such as the percussive feeling of metal [19].Spatial information includes the distance and direction of sounds and other location-related information, such as the phase difference between two ears [16], [20].Semantic information changes with the sound category, leading to difficulties in learning generalizable semantic representations of unheard sound categories.In contrast, spatial information does not change [21]- [23], enabling the potential for generalizing to unheard sound categories.As a result, we opt to maintain different attention levels to different information in the features, i.e., to neglect semantic information and enhance spatial information.
Concretely, based on the human auditory mechanism, we propose a plug-and-play method encouraging agents to learn task-relevant representations from multi-modal inputs.To improve sample efficiency and generalization in the VAN task, we design two auxiliary tasks that provide additional training signals.These two tasks enable the agent to discover the intrinsic spatial correlations between visual and audio inputs.That can make it possible to apply the learned representation to environments with unseen sounds and maps.In one auxiliary task, we use a gradient reversal layer to create an adversarial relationship between an audio encoder and an audio classifier to ignore semantic information.In the other auxiliary task, we use temporal information from visual and auditory inputs to predict the relative direction of a sound, thereby enhancing spatial information.Because our method is plug-and-play, it can be applied to various VAN backbone algorithms using the same settings.In our experiments, we use two SOTA algorithms, AV-Nav [10] and AV-Wan [11] as the backbones.We demonstrate the superiority of our proposed method on two realistic 3D scene datasets, Replica [13] and Matterport3D [24], with strong generalization to scenarios with unheard sound categories and fewer training steps.In summary, our contributions are listed as follows: 1) We observe that paying different attention to semantic and spatial components in sounds can improve the sample efficiency and the generalization of visual-audio navigators on unheard sound categories.2) We meticulously design two auxiliary tasks.One task uses an adversarial mechanism to neglect semantic information, and the other task predicts a relative direction to enhance spatial information.
3) The experiments on two sets of realistic 3D scenes, Replica and Matterport3D, show that our method can achieve better generalization performance in fewer train-ing steps.

II. RELATED WORK
Visual-Audio Navigation.In this task, an agent should navigate to the sound source by utilizing egocentric visual and audio observations.The task is challenging because of the complexity of the room structure itself and its effect on sound propagation, which leads to the fact that the agent cannot precisely estimate the loudness and direction of the sound to make decisions.Several existing studies [10]- [12], [25], [26] demonstrate the importance of fusing visual and audio modalities in navigation tasks and show good performance in scenes with heard sound categories.Some works [10]- [12] do not explicitly focus on sound semantics and perform better on heard sound categories than unheard sound categories.Semantic-aware methods [25], [26] explicitly exploit the sound semantic information and learn the association between semantic information and scene representations to reason about the sound source location, e.g., hearing water dripping means the agent may need to go to the kitchen or bathroom.However, these semantic-aware methods [25], [26] can only deal with heard sound categories, including heard sound instances and unheard sound instances, while our method focuses on the generalization towards unheard sound categories.We argue that neglecting semantic information enhances the navigation generalization on unheard sound categories and does little harm or even improves the performance of heard sound categories.
Auxiliary Task.It is not a new concept to train a reinforcement learning (RL) agent with auxiliary tasks.Auxiliary tasks are commonly used to improve the sample efficiency and attempt to build up state representations by predicting supplemental variables about important aspects of RL tasks, such as terminal state prediction [27], agent modeling [28]- [30], return prediction [31], [32], and depth prediction [33].Designing auxiliary tasks for a specific goal can be challenging, especially when the input contains multiple modalities.It is important to ensure consistency between the auxiliary tasks and the main task; otherwise, the auxiliary tasks will only train the agent to accomplish the auxiliary goals or hinder performance on the main task.Our method introduces two auxiliary tasks for visual-audio navigation by referring to the human auditory mechanism.One is to predict the relative direction between the agent and the sound source location.Furthermore, the other is to force the agent to omit semantic information in sounds by adversarial learning.

III. METHOD
We follow the basic settings in AV-Nav [34] and AV-Wan [11] for the AudioGoal Navigation task.The task initializes an agent in the environment (a scene with single or multiple rooms) without the map of the environment.In each episode, a sound source is set in the environment, continuously emitting sounds that the agent can receive.The agent is required to navigate to the sound source using visual and audio information.All initial settings for the episodes are pre-generated, including the agent's initial position, the location of the target sound, the category of sound, and the room used for navigation, in order to avoid overly simplistic episodes.
In order to improve sample efficiency and make the navigation policy generalizable to unheard sound categories, we focus on extracting the generalizable components of the sounds referring to the human auditory mechanism.The contents of sounds contain two main components: semantic information and spatial information.When the sound source location and robot position remain constant, the semantic information changes with the sound category, but the spatial information remains the same.Our method is therefore composed of two main tasks for learning the generalizable representation: 1) Semantic-Agnostic Learning (denoted in green in Fig. 2) learning semantic-agnostic representation by an adversarial mechanism between audio encoder and audio classifier, and 2) Spatial-Aware Learning (denoted in red in Fig. 2) learning spatial-aware representation by predicting the angle of sound relative to the agent by using a temporal representation containing visual and auditory information.
Since the initial settings for each episode are pre-generated rather than randomly selected at the beginning of the episode [10], without the Semantic-Agnostic Learning, the navigation policy will implicitly memorize the sounds used in each training episode (i.e.over-fitting on the training episodes), so its generalization will be weakened.Without Spatial-Aware Learning, Semantic-Agnostic Learning may mistakenly neglect the spatial information, making it also ignored by the agent (the most extreme case is that the audio encoder will output the same features for any audio input).
The additional processing of representations by these two tasks allows the agent to learn task-relevant features much faster, thus improving sample efficiency.

A. Semantic-Agnostic Learning
When receiving a sound, a human may not know what the sound category exactly is but can estimate the sound source location [35]- [37], and even an infant who knows nothing about the world can roughly localize the sound source [38], [39], which shows that spatial information alone is sufficient for humans to locate sounds.Inspired by the research above, we argue that in AudioGoal navigation tasks for intelligent agents, spatial information of the sound is enough for locating and perceiving the sound.While semantic information changes with the sound categories, it increases the difficulty for agents to learn generalizable semantic representations.Moreover, for some atypical scenes (e.g., guard robots facing odd sounds), sounds and scenes are not closely related.Therefore, learning semantic-agnostic representations should not harm the navigation performance on both heard sound categories but could enhance the generalization of unheard sound categories.
Concretely, learning semantic-agnostic representations means that, with an agent fixed in a certain location and sound source in another certain location, the method outputs the same representation when taking sounds with different semantics.To equip the representations learned by the method with the semantic-agnostic property, we design an auxiliary task in which an audio encoder needs to weaken the ability of the audio classifier to distinguish the current sound semantic category while the audio classifier attempts to distinguish the sound semantic category corresponding to an audio feature.The adversarial training forces the audio encoder to learn semantic-irrelevant representations.
Therefore, we use an adversarial mechanism between an audio encoder parameterized by θ A and a 4-layer fully connected network audio classifier (AC) parameterized by θ C .To implement this adversarial mechanism, we employ a gradient reversal layer [40] between the audio classifier and the audio encoder by multiplying a factor −λ on gradient flow reflecting the adversarial intensity: where n denotes the number of currently completed episodes, N denotes the number of total episodes and b denotes the bound of the adversarial intensity.And the parameters are optimized as follows: where µ denotes the learning rate, L C denotes Cross Entropy Loss, and L O denotes other loss related to θ A such as Actor and Critic Loss in reinforcement learning.

B. Spatial-Aware Learning
Semantic-agnostic learning ignores navigation-irrelevant information but does not encourage the agent to learn navigationrelevant representations.Although reinforcement learning provides reward signals to help the agent extract navigationrelevant features, during the initial exploration phase, the agent may not catch sight of reward signals but can rapidly learn neglecting the semantic information of the sound from the adversarial audio classifier to minimize the adversarial optimization objective.This rapid learning could lead to the audio encoder incorrectly ignoring spatial information as well, resulting in its output being insensitive to changes in the agent's position.On this occasion, the agent cannot navigate to the sound source.Predicting sound location as an auxiliary task can effectively provide an additional training signal to help the agent extract spatial information and assist in navigation policy learning.
We use a 4-layer fully connected network as the location predictor (LP) with temporal features generated by a Timeseries Model as input to predict the pitch and yaw angles of the sound source relative to the agent, denoted as β and α in Fig. 1, respectively.In practice, we do not predict the angle directly but predict the sine and cosine of the angle.The sine and cosine predictions avoid the periodicity of the angle that leads to the non-uniqueness.We use the Mean-Squared Loss as the auxiliary loss function.The gradients generated by the loss of the LP are utilized to update the Audio Encoder, the Visual Encoder, and the Time-series Model.These models can thus learn to extract features containing spatial information for RL's actor and critic to learn navigation policy better.

C. Training Details
We use SoundSpaces [10] as our simulator, enabling realistic audio rendering.The SoundSpaces simulator discretizes scenes into uniformly distributed navigability graphs so that the agent can only move one node to a naviagble neighboring node in the graphs.Where there are obstacles there are no nodes.Thus the action space A has only four actions: MoveForward, TurnLeft, TurnRight and Stop.The Soundspace removes episodes where the distance from the start position to the target position is less than 4m and episodes where the shortest path is almost a straight line (ratio of geodesic to Euclidean distance less than 1.1).
Since we apply our method on AV-Nav [10] and AV-Wan [11], we follow the design of their reward function, in which the agent is given a +10 reward if the agent executes action Stop at the sound source location, +1 reward on AV-Nav or +0.25 reward on AV-Wan if the agent reduces the geodesic distance to the sound source location and an equivalent penalty if the agent increases the geodesic distance and −0.01 for time penalty.
We train all learnable models jointly with Proximal Policy Optimization (PPO) [41].Each episode contains 150 steps, and the success criterion is met if the agent executes the action Stop at the sound position in 150 steps.

A. Experiment Settings
Environments and Datasets.We use the same audio and visual dataset and train/val/test splits as AV-Nav [10] and AV-Wan [11] to demonstrate the improvement of our method.We use the same simulator, SoundSpaces [10], with two realworld 3D scene datasets, Replica and Matterport3D (MP3D), for training and testing our method along with train/val/test splits of 73/11/18 sound categories.Replica is a relatively small scene dataset with an average area of 47.24m 2 and train/val/test splits of 9/4/5 scenes.Matterport3D has relatively large scenes with an average area of 517.34m 2 and train/val/test splits of 57/10/12 scenes.We also follow basic configuration and hyper-parameters from AV-Nav and AV-Wan and only use depth maps as visual information.
Metrics.We evaluate our method on the following metrics: 1) Success Rate (SR): the fraction of successful episodes.
2) Success Weighted by Path Length (SPL) [42]: we weigh the success by the ratio of the execution path length to the shortest path length.3) Success Weighted by Number of Actions (SNA) [11]: we weigh the success by the ratio of the executive action numbers to the minor action numbers.We use the model with the highest SPL on the validation set for testing and reporting the table results.
Baselines.We compare our methods with the following baselines: 1) Random: an agent randomly selects an action in action space A. The episode ends when executing Stop.2) Direction Follower(DF) [11]: This method pretrains a model to predict the direction of arrival (DoA).An agent sets an intermediate goal K meters away in the predicted direction and plans to navigate there.We set K = 2 in Replica and K = 4 in Matterport3D.3) AV-Nav [10]: it is a state-of-the-art VAN method that makes decisions using visual-audio fusion features with temporal sequences.4) AV-Wan [11]: it is a state-of-the-art VAN method that builds geometric and acoustic maps and uses them to predict an intermediate goal adaptively.AV-Wan uses the Dijkstra [43] shortest path algorithm to compute the path from the current node to the intermediate goal.

B. Quantitative Comparison
We apply our method on AV-Nav [10] and AV-Wan [11] and test baselines and our method referred by Ours+AV-Nav and Ours+AV-Wan on unheard sound categories in Tab.I.
Random performs poorly on both datasets, showing that the difficulty of the task and the robot is supposed to make good use of visual and audio cues.Direction Follower uses only audio information for decision making, while visual information is only used for path planning, so Direction Follower performs worse than the method that fuses information from both modalities to make decisions.After applying our method, AV-Nav and AV-Wan achieve significant improvements on Replica and Matterport3D datasets on unheard sound categories, proving that our method works well for different backbone algorithms and datasets.In particular, on Replica, our method gains about 50% SPL improvement on the previous works.The results on AV-Nav and AV-Wan demonstrate the advantages of our method where we optimize the features and represent them in a more taskspecific manner.We also test our method on heard sound We visualized agent trajectories using our method and AV-Nav, respectively with the same set of start and end position episodes in the same scene.In each episode, the agent needs to navigate from the yellow point to the red point.The name at the bottom represents the category of sound, which means that each column has a different sound.Agent path fades from dark blue to light blue as time goes by.Green is the shortest geodesic path in continuous space.We aim to show that our method yields the same trajectory for different sound categories, which shows that the features we learn are indeed semantic-agnostic.The first row shows our results, and the second row is the results from AV-Nav.AV-Nav may fail in some episodes, e.g., the first three columns, and run quite differently when navigating to different sounds, while our method navigates to the goal in all four episodes and keep trajectory consistent in these episodes.Considering that there exist domain gaps between the real world and the simulator, such as audio and depth noise, we add these two parts of noise to the environment to simulate the real world and demonstrate the robustness of our method following the setting of audio noise and depth noise from AV-Wan [11].We conducted experiments on noise levels ranging from 20 to 50, with intervals of 10.Notice that, while AV-Wan [11] only use telephone in the noise experiments as the target sound, our work focuses on the generalization ability towards unheard sound categories, so we use all the sound categories in the testing set as target sounds instead.The results are shown in Tab.III.Note that even with different noise levels, our method still improves the performance of the previous works.With different levels of noise, the performance of our method shows no significant degradation and exhibits strong robustness.The robustness to noise can indicate that our method has the potential to be used in the real world.

C. Sample Efficiency and Learning Curve
To demonstrate our method's high sample efficiency, we show the learning curves on the testing set on the Replica and MatterPort3D with both AV-Nav and AV-Wan as backbones.Fig. 3 shows that our method can achieve higher performance than the final results of the previous works, with fewer samples than the previous works needs to converge.We compare the number of samples required by ours and the previous works, using the highest point of the previous works as a benchmark.In Fig. 3 (a), (c), and (d), our methods require fewer samples, and the performance still grows as the samples grow.In Fig. 3 (b), although there is no significant sample difference between ours and the previous works, our method is more stable in the later stages and the performance continues to grow.

D. Trajectory Visualizations
We visualize the trajectories using our method and AV-Nav under four categories of sounds, shown in Fig. 4. We refer Fig. 5. Trajectory Visualization for Different Scenes.We visualized navigation trajectories using our method and AV-Nav in various scenes.The name at the bottom represents the scene.In each episode, the agent needs to navigate from the yellow point to the red point.Agent path fades from dark blue to light blue as time goes by.Green is the shortest geodesic path in continuous space.The first row shows our results, and the second row is the results from AV-Nav.AV-Nav may fail in some episodes, e.g., the second and third column, or take a complex route, e.g., the first, third, and fourth column.Our method finds a good path to the end point in all four episodes.to the same start agent position and the same sound source location within the same scene as the same task.To view the trajectory generation process, please watch the attached video.In the first line of Fig. 4, our method can come out of the trajectory equivalent to the shortest path in various sounds consistent with each other.In the second line, however, AV-Nav either fails to complete the task or the trajectory is very complex and inconsistent.
We also visualize the trajectories in different scenes, shown in Fig. 5. Our method can generate more efficient trajectories within different scenes than AV-Nav.

E. Ablation Studies and Analysis
Tab. II shows the ablation results of the audio classifier and the location predictor components of our method.Removing either the audio classifier or the location predictor leads to a reduction in performance.Notably, reducing the location predictor hurts the performance more than reducing the audio classifier does in Matterport3D.Compared to Replica, scenes in Matterport3D have bigger areas; thus, spatial information is more helpful in completing tasks in Matterport3D.
In addition, the audio classifier provides an adversarial training mechanism, which implicitly boosts the model's generalization by forcing the model to ignore the semantic information of the audio inputs.Meanwhile, the model can benefit from the auxiliary localization task's additional training signals and directly improve navigation performance.

V. CONCLUSION AND DISCUSSION
This work focuses on the generalization and sample efficiency problem for VAN tasks.The different properties of spatial and semantic information inspired us to reduce the generalization gap between unheard and heard sound categories and learn task-relevant representations fast.Therefore, we propose a plug-and-play method to narrow the performance gap on unheard and heard sound categories by neglecting semantic information while enhancing spatial information.Evaluations on Replica and Matterport3D show that our method significantly outperforms the baseline on the unheard sound categories and slightly improves the heard sound categories.Learning curves show that our method has better sample efficiency than baselines.We also conducted audio and depth noise experiments to demonstrate the robustness of our method to depth image noise and varying levels of audio noise.The results show that our method performs well even with noisy inputs.
In the future, we will further explore the methods to enhance the generalization in more challenging visual-audio navigation settings, e.g., real-world development and complex environments.1) Real-world development (sim2real transfer) involves the challenging task of transferring reinforcement learning models trained in simulated environments to real robots.Due to the significant sim2real gap in both audio and visual modalities, conducting experiments in the real world remains difficult.To overcome this challenge, we must address the discrepancy between simulation and reality and improve the model's generalization ability.One potential solution is to apply bi-directional domain adaptation to align the feature distributions of simulation and reality during training.Additionally, exploring meta-reinforcement learning algorithms may enable the agent to efficiently mitigate domain drift during test time.2) In complex environments, the agent must handle interference from multiple sound sources and uncertainty from moving sound.To tackle scenarios with multiple sound sources at similar volume levels, we can leverage semantic information and sound source separation algorithms [44], [45] to filter out the target sound source as input to the navigator.Moreover, we can augment the training process with a multi-agent game [46] to automatically generate diverse and challenging distracting or moving sources, further enhancing the robustness of the system.

Fig. 1 .
Fig. 1.Problem Setting.The robot should navigate to the sound source location with the visual-audio observation, no matter what category of sound is being played.In this example, the agent is in the bedroom initially and locates the sound in its front-left direction.α and β are the yaw and pitch angles of the sound source relative to the agent.

Fig. 2 .
Fig. 2. Training Pipeline.At each time step t, our method uses depth images (Dt) and spectrograms (At) as inputs for navigation.During the training procedure, an Audio Classifier (AC , parameterized by θ C ) enforces the model to neglect semantic information via adversarial training supervised by L C .Concurrently, the temporal features (Ot) are given to a Location Predictor (LP) to pull out the sound source direction (α, β) supervised by L P .α and β are the yaw and pitch angles of the sound source relative to the agent.Action Selection samples from the probability distribution generated by Actor to obtain action at.After executing at in the environment, the environment returns a reward signal rt.At the end of each RL epoch, we train the Audio Encoder (parameterized by θ A ), the Audio Classifier and the Location Predictor simultaneously.

Fig. 3 .
Fig.3.Learning Curve on testing sets.We plot the testing results of the previous works and ours during training in both Replica and MatterPort3D environments with AV-Nav and AV-Wan as backbones, respectively.We plot a horizon dashed purple line across the highest SPL value of the previous works as a benchmark.We also draw vertical dashed lines for the previous works and ours in their corresponding colors, to indicate where their SPL values are greater than or equal to the benchmark for the first time.Our method can outperform the previous works with fewer training samples.

Fig. 4 .
Fig.4.Trajectory Visualization for different sound categories.We visualized agent trajectories using our method and AV-Nav, respectively with the same set of start and end position episodes in the same scene.In each episode, the agent needs to navigate from the yellow point to the red point.The name at the bottom represents the category of sound, which means that each column has a different sound.Agent path fades from dark blue to light blue as time goes by.Green is the shortest geodesic path in continuous space.We aim to show that our method yields the same trajectory for different sound categories, which shows that the features we learn are indeed semantic-agnostic.The first row shows our results, and the second row is the results from AV-Nav.AV-Nav may fail in some episodes, e.g., the first three columns, and run quite differently when navigating to different sounds, while our method navigates to the goal in all four episodes and keep trajectory consistent in these episodes.

TABLE I
TESTING RESULTS ON HEARD AND UNHEARD SOUND CATEGORIES.WE APPLY OUR METHOD TO AV-NAV AND AV-WAN AND OBTAIN HIGHER QUANTITATIVE RESULTS.THE SPL AND SNA SHOW THAT OUR METHOD IMPROVES THE EFFICIENCY OF THE PREVIOUS WORKS, ALLOWING THE AGENT TO CHOOSE A SHORTER PATH (HIGHER SPL) AND A FASTER PATH (HIGHER SNA) TO REACH THE SOUND LOCATION.

TABLE II ABLATION
STUDY FOR OUR METHOD ON AV-NAV.WE APPLY OUR METHOD TO AV-NAV AND PERFORM AN ABLATION STUDY ON TWO COMPONENTS OF OUR METHOD, AUDIO CLASSIFIER (AC) AND LOCATION PREDICTOR (LP), ON TESTING SETS OF REPLICA AND MATTERPORT3D DATASETS.THREE METRICS ARE COMPARED, INCLUDING SR, SPL, AND SNA.

TABLE III AUDIO
NOISE EXPERIMENTS.WE SHOW THE SPL IN THE EXPERIMENTS WITH DIFFERENT LEVELS OF NOISE(FOLLOWING THE NOISE SETTINGS FROM AV-WAN [11]).OUR METHOD STILL OUTPERFORMS THE PREVIOUS WORKS IN MOST CASES, AND THE PERFORMANCE DOES NOT SHOW A LARGE DEGRADATION COMPARED TO THE NOISE-FREE EXPERIMENTS, SHOWING OUR METHOD'S ROBUSTNESS TO NOISE.WE PRESENT THE AVERAGE SPL ON DIFFERENET NOISE LEVELS IN THE ADDITIONAL COLUMNS.