How Robust are Audio Embeddings for Polyphonic Sound Event Tagging?

Sound classification algorithms are challenged by the natural variability of everyday sounds, particularly for large sound class taxonomies. In order to be applicable in real-life environments, such algorithms must also be able to handle polyphonic scenarios, where simultaneously occurring and overlapping sound events need to be classified. With the rapid progress of deep learning, several deep audio embeddings (DAEs) have been proposed as pre-trained feature representations for sound classification. In this article, we analyze the embedding spaces of two non-trainable audio representations (NTARs) and five DAEs for sound classification in polyphonic scenarios (sound event tagging) and make several contributions. First, we compare general properties like the inter-correlation between feature dimensions and the scattering of sound classes in the embedding spaces. Second, we test the robustness of the embeddings against several audio degradations and propose two sensitivity measures based on a class-agnostic and a class-centric view on the resulting drift in the embedding space. Finally, as a central contribution, we study how a blending between pairs of sounds maps to embedding space trajectories and how the path of these trajectories can cause classification errors due to their proximity to other sound classes. Throughout our analyses, the PANN embeddings have shown the best overall performance for low-polyphony sound event tagging.


I. INTRODUCTION
T HE ability to recognize sounds is of vital importance for navigating through different everyday environments. Each environment comes with its unique set of sounds whose detection and categorization is an essential part of auditory scene analysis. While sound event detection aims at localizing sound events in time, sound event tagging (SET) focuses solely on classifying all sound classes occurring in a given scene [1].
Sounds from the same class can exhibit large differences in timbre, duration, and loudness. This intrinsic variability already makes the classification of isolated sound events (sound event Jakob Abeßer is with the Fraunhofer IDMT, 98693 Ilmenau, Germany, and also with the International Audio Laboratories Erlangen, 91058 Erlangen, Germany (e-mail: jakob.abesser@idmt.fraunhofer.de).
Digital Object Identifier 10.1109/TASLP.2023.3293032 classification) a challenging task. Real-life environments are often characterized by multiple sound sources, which are audible at the same time. In this article, we focus on the challenge of classifying overlapping sounds in such scenarios, a task which we refer to as sound event tagging (SET). Deep neural networks, which are a core component of stateof-the-art sound classification and tagging algorithms, require large amounts of training data if they are trained in a supervised fashion. In many application scenarios however, only limited amounts of annotated data are available. Transfer learning has been successfully used for SET [2], [3], [4], [5], [6] to pre-train deep neural networks on large datasets and later fine-tune them for novel (down-stream) tasks with limited amounts of training data. The intermediate layer representations of such networks (embeddings) have been shown to be powerful features for several audio classification tasks [7] and related tasks such as audio source separation [8] and acoustic scene classification [9].
As the main contribution of this article, we compare various (pre-trained) audio embeddings for SET, i. e., sound event classification in polyphonic scenarios. We focus our investigations on lower sound polyphony degrees and study how mixtures of different sound classes are represented in the embedding spaces of two non-trainable audio representations (NTARs) and five deep audio embeddings (DAEs), which are pre-trained and then applied for SET. Second, we test the robustness of the embeddings against three different types of audio degradations, which are common in real-life sound monitoring applications. To this end, we propose to measure the resulting embedding drift in the embedding space both from a class-agnostic and from a class-centric view. Finally, as a central contribution, we investigate how overlapping sounds are represented in the embedding spaces. For this we implement a continuous blending between sound pairs and study the resulting trajectories in the embedding space. We aim to understand how the path of these trajectories can cause sound misclassification due to its proximity to other sound classes. Fig. 1 illustrates how audio degradations (middle) and blending between sounds (bottom) may influence the audio clip's position in the embedding space. In the first example, a car sound is first modified to have a lower volume and then mixed with ambient background sounds. In the second example, an alarm sound is continuously blended with a car sound. As illustrated, the resulting shifts and trajectories in the embedding space can cause confusions with other sound classes (e. g., bird calls).
The remainder of this article is organized as follows. Section II provides an overview of the relevant scientific work. Section III describes the general procedure for extracting embeddings from audio signals. Furthermore, this section introduces the NTARs and DAEs compared in this article. Section IV explains how we generate and augment audio recordings with overlapping sounds to serve as a dataset. As the main part of this article, we discuss methods for exploring different embedding spaces (see Section V) and the robustness of embedding representations with respect to degradations of the audio signal (see Section VI). Furthermore, we study in Section VII the embedding space trajectories of blended sounds. Finally, Section VIII concludes this article.
We publish relevant data and source code alongside this article to enable reproducibility of the experiments. 1

II. RELATED WORK
The paradigm of transfer learning was successfully used in various disciplines ranging from computer vision, natural language processing, to speech processing [10]. In the field of audio analysis, DAEs were trained either in a supervised or selfsupervised fashion [11]. A common self-supervised learning strategy is contrastive learning [12], [13], [14], where embedding representations are learnt to capture similarity relationships between data instances or augmented versions thereof. While most DAEs operate solely in the audio domain, relationships between audio data and other modalities can be modeled by learning joint embedding spaces. Such cross-modal embeddings were applied for several audio-visual tasks such as identity verification [15], audio-visual stream correspondence [4], scene classification [16], text-based audio retrieval [14], [17], [18], 1 [Online].
Available: https://github.com/jakobabesser/embedding_ robustness_2022 and cross-modal retrieval based on audio, images, and text [13]. Further knowledge and constraints can be integrated during the learning process, for instance, using additional loss terms for regularization [19].
While most deep audio embeddings rely on spectrogram-like feature representations such as the Mel-spectrogram [2], [3], [4], [12], Lopez-Meyer et al. [20] propose a convolutional neural network (CNN) architecture that maps raw audio clips to an embedding representation in an end-to-end fashion. Kong et al. [5] combine both waveform-based and spectrogram-based features in deriving the PANN embeddings. Deep generative models for audio synthesis [21] or music synthesis [22] on a waveform-level often use encoder-decoder network architectures to learn suitable embedding representations [23], which can further be be regularized to control the perceptual properties of the synthesized audio [24].
DAEs have been applied for a large variety of down-stream tasks. Most audio embeddings are trained on the AudioSet [25], which to date is the largest set of audio files from different domains, including speech, environmental sounds, and music. Previous work has shown that DAEs trained for sound classification perform well for related tasks such as urban sound tagging [26], [27], [28], acoustic scene classification [9], and various other audio classification tasks ranging from music to industrial sounds [7]. Notably, DAEs are also effective for tasks outside of their original training data domain such as audio captioning [29], [30], speech enhancement [31], and for detecting COVID-19 in respiratory-related sounds like breathing, cough, and speech [32]. Furthermore, DAEs have been used for SED in order to inform algorithms for source separation [8], [33] and speech denoising algorithms [34].
In the context of transfer learning, the most common way to evaluate embedding representations is to measure their performance on a set of down-stream tasks [4], [5], [7]. The Holistic Evaluation of Audio Representations (HEAR) benchmark represents the largest effort so far to evaluate embeddings for a large number of down-stream tasks [35].
In addition to such general performance evaluations, embedding spaces have been investigated to better understand the predictions of classification models. For multi-class classification tasks, it is common to visualize embedding spaces after applying dimensionality reduction techniques such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), or Uniform Manifold Approximation and Projection (UMAP). Such visualizations allow for testing whether class instances form well separated clusters in the embedding space. A common observation is that class separability typically improves in the embedding space from layer to layer, which supports the idea of hierarchical feature learning [36]. We study in detail how sound classes scatter in the embedding spaces of different audio representations in Section V-B.
The analysis of DAEs can provide powerful cues about characteristics of the input data of a neural network. As shown by Stacke et al. in [37], a discrepancy between embedding space distributions of two datasets can be used as a proxy to quantify domain shift. Similarly, changes in the embedding space have been investigated as indicator for the robustness of embedding representations towards degradations of the audio signal [38]. While this previous study investigated only the OpenL3 and YamNet embeddings, we follow in this article a similar approach to measure the robustness of seven different audio representations towards audio degredations in Section VI. While most prior work focused on multi-class classification tasks, the study of embedding spaces for multi-label tasks such as SET is a relatively new field of research [39], [40]. To the best of our knowledge, no prior work investigated embedding spaces for SET.

III. AUDIO REPRESENTATIONS
In this section, we first explain the general procedure of embedding extraction from audio files. Then, we introduce seven audio representations, which we compare in our experiments. These representations include two non-trainable audio representations (NTARs) as discussed in Section III-B and five pre-trained deep audio embeddings (DAEs) as discussed in Section III-C.

A. Embedding Extraction
In the following, we explain how we extract embedding vectors from monaural audio clips x ∈ R L·f s . We enforce the clip duration to be an integer multiple of seconds by applying zero-padding if necessary. L ∈ N denotes the number of seconds and f s ∈ N denotes the sample rate in Hz. As visualized in Fig. 2, we first partition the audio clip x into L non-overlapping blocks denoting the embedding size and M ∈ N denoting the feature rate in Hz. Afterwards, we average Z b over the time frames and obtain an embedding vector z b ∈ R E b . Finally, we stack all block-level embedding vectors to a final embedding vector z ∈ R E with E = L · E b . In our experiments, we analyze 5s long audio files, hence L = 5. As an alternative, variable-length input clips could be processed using a shingle-based approach [41], where multiple pre-defined fixed-size embedding matrices are extracted from longer audio clips using an overlap of 50 %.
When analyzing a set of N ∈ N audio clips, we stack their embedding vectors to an embedding matrix Z ∈ R N ×E . As basis for distance calculations in the corresponding embedding space, we apply z-score normalization to Z. As will be discussed in the following sections, most investigated audio representations have different time resolutions. The presented approach of averaging over block-level embeddings leads to the same temporal resolution of one second for each embedding, which we believe is a good compromise that allows to capture time-dependent sound characteristics. Table I summarizes all audio representations, which are compared in this article: The embedding dimensionality E b and feature rate M of the block-level embedding matrices Z b as well as the embedding dimension E of the stacked embeddings z are provided. The training objective (last column) of the DAEs is either supervised learning (SL) or self-supervised learning (SSL).

B. Non-Trainable Audio Representations
As baseline representations, we use the librosa Python library [42] to compute two NTARs, which characterize the time-frequency energy distribution in the audio clips. Here, we use a sample rate of f s = 22.05 kHz (as in [43]), a hopsize of 1024 samples (46.4ms), and a blocksize of 2048 samples (92.9ms). The first representation is a log-magnitude Mel-spectrogram (MelSpec) using 128 Mel bands. As second representation, we compute the first 13 Mel-frequency Cepstral Coefficients (MFCC) as a compact representation of the spectral envelope. Both NTARs have a feature rate of M = 22 Hz.

C. Deep Audio Embeddings
In addition to the NTARs introduced in Section III-B, we investigate five pre-trained DAEs, which are based on different CNN and Transformer architectures. All DAEs were trained on the AudioSet dataset [25], which is the largest audio dataset to date covering different audio domains. The AudioSet includes around two million audio clips, which are weakly-labeled with an average of 2.7 labels per file. The dataset covers 527 sound classes. All DAEs except for the PANN embeddings use Melspectrogram variants as input features, however with different number of Mel bands and time resolution (hopsize). As shown in the original publications, the performance of the DAEs is greatly influenced by their parameters. We use the best-performing models here and do not include further ablation studies related to model parameters. While this section provides a high-level overview over the applied deep audio embeddings, a detailed list of model parameters are provided on the accompanying website. Since all DAEs rely on NTARs as input representations, we hypothesize that DAEs in general will show a better performance on the SET task.
Kumar et al. [3] proposed a CNN with 12 convolutional layers with intermediate pooling and a final global pooling operation to make use of the weak labels of the AudioSet. The network processes Mel-spectrograms with 128 Mel bands as input representations. Finally, the layer activations of the penultimate convolutional layer are used as (Kumar) embedding vector with E b = 1024 and a feature rate of M = 1 Hz.
The OpenL3 [4] embeddings are DAEs that are trained in a self-supervised fashion. This approach does not require any labeled data but instead uses audio-visual correspondences as training objective. The underlying L 3 -Net was initially proposed in [44] and includes two sub-networks for audio and image processing, respectively, and several fusion layers. The audio sub-network processes Mel-spectrograms using a stack of four convolutional layers with intermediate max pooling. Multiple configurations of the OpenL3 embeddings exist which were trained on different subsets of the AudioSet dataset. We use the "music" configuration with 256 Mel bands and an embedding size of E b = 512, which has shown to outperform the "environmental" configuration for various datasets including the ESC-50 dataset [4], [7]. The embeddings have a feature rate of M = 42 Hz.
The Pretrained Audio Neural Network (PANN) embeddings were introduced by Kong et al. [5]. Among several tested network architectures, the "Wavegram-Logmel-CNN" model performed best for the AudioSet (sound) tagging task. The used CNN14 model includes a total of 12 convolutional layers combined with two final dense layers. Opposed to the other three DAEs, the PANN embeddings combine as input a learnable waveform-based input feature (wavegram) and a non-trainable Mel-spectrogram with 64 Mel bands. Furthermore, a final global pooling operation aggregates the full temporal context of a given audio file by combining max and average pooling. The final dimensionality of the PANN embedding matrix is E b = 512 with a feature rate of M = 1 Hz.
The VGGish embeddings [2] are based on a modified version of a VGG model [45] that includes five convolutional layers and three final dense layers. The network takes log-magnitude Mel-spectrograms with 64 Mel bands as input. Each VGGish embedding vector has a size of E b = 128 with a feature rate of M = 1 Hz.
As alternative to the convolutional neural network architecture, we incorporate the Audio Spectrogram Transformer (AST) model [46] as DAE, which takes sequences of Mel-spectrogram patches as input. In particular, we use the PaSST-S model proposed in [6], which was trained using a strategy referred to as structured patchout. The patchout technique involves removing randomly chosen patches from the input sequence. In structured patchout, the removed patches are selected in such way that they cover the entire frequency range at one particular time window or, vice versa, the entire clip duration at a specific frequency range. This approach is comparable to data augmentation  [47]. Each PaSST embedding vector has a size of E b = 1295 with a feature rate of M = 20 Hz.

IV. EXPERIMENTAL DATASET
For our investigations, we use an augmented version of the ESC-50 dataset [48], which is a freely-available sound recognition dataset that has been widely used as benchmark for SET 2 . It includes 2000 5s-long isolated sound recordings from 50 sound classes. The dataset covers a large variety of sound classes that range from domestic and urban sounds, over nature and animal sounds, to human non-speech sounds. As all DAEs are pre-trained on the AudioSet dataset, we consider the audio clips of the ESC-50 dataset as previously unseen and hence as suitable data for our experiments.
Our research focus is on embedding space analysis for SET. To this end, we create 1000 random pairs of sounds taken from the ESC-50 dataset. Given our random sound assignment, the large majority of 98.2% of the sound pairs include sounds from different classes. From each sound pair, we create six mixtures by blending between the sound pairs. In addition to its unprocessed version, we create four degraded versions of each mixture using the methods listed in in Table II. These versions are used in Section VI to study the robustness of different audio representations towards audio degredations. We refer to this dataset as "ESC50Mix" in the following. 3 For the m-th random sound pair (x m,1 , x m,2 ) with x m,1 ∈ R L·f s , x m,2 ∈ R L·f s with L = 5 and m ∈ [1 : 1000], we create sound mixtures using a mixture coefficient γ g ∈ {0, 0.2, 0.4, 0.6, 0.8, 1} (indexed by g ∈ {1 : 6}) and apply a degradation function Λ a (indexed by a ∈ {1 : 5}) as  As detailed in Table II, we use the audiomentations 4 Python library to implement loudness reduction, additive Gaussian noise, as well as low-frequency and high-frequency boost as degradation functions Λ a . In total, the ESC50Mix dataset includes 30000 audio clips. Given an embedding function f (·), each sound mixture x m,g,a is mapped to its corresponding embedding vector z m,g,a ∈ R E as z m,g,a = f (x m,g,a ). ( For the experiments conducted in this article, we stack the embedding vectors of all 1000 sound pairs separately for each combination of embedding function and audio degradation method as an embedding matrix Z ∈ R 1000×E .

V. EMBEDDING SPACE EXPLORATION
In this section, we introduce several measures to explore the embedding space distributions of the audio representations introduced in Section III.

A. Inter-Correlation
We investigate the redundancy of an audio representation by measuring the average pair-wise correlation between its feature dimensions. We use the sample Pearson correlation coefficient as correlation measure. It is defined as for two vectors q ∈ R K and v ∈ R K with K ∈ Z and their means μ q ∈ R and μ v ∈ R, respectively.
Given a stacked embedding matrix Z ∈ R N ×E of N rowwise stacked embedding vectors z ∈ R E , we measure the intercorrelation strength α i as

B. Class Scattering
If we consider a multi-class classification scenario, an ideal embedding space has dense and well-separated clusters for each class as shown on the left side of Fig. 4. In contrast, as shown on the right side, partially overlapping classes can cause confusions between adjacent classes and hence complicate the task for a subsequent classification model. In this section, we use the Dunn Index (DI) [49] and the Davies-Bouldin Index (DBI) [50] as two established separation measures to characterize the class scattering in the embedding space for the non-degraded isolated sound recordings (a = 1, g ∈ {1, 6}).
Given a set Z = {z ∈ R E } of embedding vectors of isolated sounds, let Z i = {z ∈ Z | c(z) = i} be the subset of all embedding vectors labeled with class c(z) ∈ [1 : C] where C ∈ N denotes the number of classes. The class centroids in the embedding space are computed as The intracluster distance Δ i measures the average distance of all class samples to their class centroid as where d(·) denotes the Euclidean distance. The intercluster distance δ i,j measures the distance between the two class centroids of class i and j as Based on these concepts, the Dunn Index α DI looks for the closest pair of clusters as well as the most spread cluster to derive a Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. separation measure as This measure follows a "pessimistic" view and only considers the most poorly segregated and widely dispersed classes. As a second measure, the Davies-Bouldin Index first computes pair-wise cluster similarity measures as Then, for each class, the most similar other class is identified and these similarity values are finally averaged as Since α DBI decreases with improved class separability, we use the inverted DBI measure (1/α DBI ) as a separation measure of a given clustering. Fig. 5 summarizes the two measures for all audio representations. Both measures show a similar trend that DAEs in general yield a better class scattering in the embedding space. At the same time, there is no clear evidence whether DAEs trained in a supervised or self-supervised fashion show a better separability.

VI. SENSITIVITY TO AUDIO DEGRADATIONS
In this section, we aim to measure the sensitivity of audio representations against different types of audio degradations. Such degradations are caused by acoustic variations such as background noise and loudness variations, which often appear in real-world sound monitoring scenarios. Ideally, an embedding function f (·) is robust against such audio degradations since they do not change the semantics of the sound classes to be recognized.
As illustrated in Fig. 6, we consider two types of sensitivity measures. In the class-agnostic measure ψ a (see left side of Fig. 6), we do not take the class membership of embedding vectors into account. Instead, we consider only the distance between the non-degraded embedding vector z and the degraded embedding vector z d :  In the class-centric measure ψ c (see right side of Fig. 6), we measure the "drift" of an embedding vector relative to its corresponding class centroid μ c(z) : Positive values for ψ c indicate that an embedding vector moves towards its class centroid whereas negative values indicate a drift away from it.
In our experiment, we analyze the degraded versions of the isolated sound recordings in the ESC50Mix dataset (g ∈ {1, 6}). Fig. 7 illustrates the mean sensitivity values and the corresponding error bars based on the standard deviation computed over all audio recordings. The class-agnostic sensitivity measure ψ a show that the "Noise" degradation has the strongest impact on the embeddings and generally causes the embeddings to move away from their class centroids with the exception of MFCC, where ψ c remains almost zero on average. This is expectable, as the MFCC provide a decorrelated approximation of the spectral envelope, which naturally suppresses noise.
On the other hand, the "Quiet" degredation, which reduces the loudness, leads to an embedding drift for all representations except for MelSpec. Notably, the MFCCs seem robust to such Fig. 8. Two possible embedding space trajectories that are obtained by blending between two sounds z 0 and z 1 . Given a sound mixture z g of these sounds, the plots show the distances d 0 , d 1 , and d * towards the corresponding class centroids μ 0 and μ 1 as well as the closest out-of-class centroid μ * . The trajectory (black dotted line) indicates a "wrapped" continuous path on a submanifold embedded in a high-dimensional feature space. degradation as they tend to drift towards their class centroids which does not introduce a higher risk for misclassifications. The two audio degradations "BoostHigh" and "BoostLow", which alter the spectral envelope, have a stronger effect on the DAEs than on the NTARs. However, both do not cause a strong drift towards or away from the class centroids. In summary, while this investigation revealed individual strengths and weaknesses of all embeddings, the PaSST as well as the PANN embeddings appear to be in overall the most robust representations against the investigated audio degradations.

VII. SOUND BLENDING TRAJECTORIES
In this experiment, we investigate how a blending between two isolated sounds maps to a trajectory between their embedding vectors. We argue that if such a trajectory passes by other classes in the embedding space, misclassification can be caused. Fig. 8 illustrates this idea: Given the embedding vectors z 0 and z 1 of two isolated sounds, we investigate the trajectory corresponding to the embedding vectors of the mixtures z g of both sounds, which according to (1) depends on the mixing coefficient g. In this experiment, we do not apply any audio degradation (a = 1).

A. Embedding Space Distances
Given an example mixture z g along this trajectory, we measure its embedding space distance to the class centroids μ 0 and μ 1 of both isolated sounds as d 0 = d(z g , μ 0 ) and d 1 = d(z g , μ 1 ) as well as its distance to the closest out-of-class centroid μ * z as d * = d(z g , μ * z ). Fig. 8 illustrates two possible trajectories: The first trajectory (left plot) passes by two other classes (purple triangles, green pentagons) and shows potential for sound misclassification. The second trajectory (right plot) remains close to the original classes (blue stars, orange circles) and the mixtures remain further away from out-of-class centroids. Fig. 9 shows the dependence of the three distance values d 0 , d 1 and d * on the mixing parameter g for different audio representations. The plots show the mean values averaged over all 1000 sound pairs to illustrate general trends. We make several observations: First, d * has a convex shape and is consistently below d 1 and d 2 for the NTARs as well as for OpenL3. This property is disadvantageous for SET as it indicates that both the isolated sounds as well as the mixtures tend to remain closer to out-of-class centroids than to their corresponding class centroids. When looking at the supervised DAEs (Kumar, PANN, PaSST, and VGGish), d * has a concave shape, which indicates that for stronger mixtures (g ∈ {0.4, 0.6}), the potential for confusion with other classes is smaller than for isolated sounds (g ∈ {0, 1}) or soft mixtures (g ∈ {0.2, 0.8}). We further illustrate in Fig. 9 a likelihood measure for class confusion derived as max(0, min(d 0 , d 1 ) − d * ). In particular for the PANN and PaSST embeddings, we have the desired property that min(d 0 , d 1 ) < d * holds true for all mixture coefficients g. Therefore, we expect the structure of the embedding spaces of both the PANN and PaSST embeddings to be most suitable for low-polyphony SET among the investigated audio representations as class confusions due to proximate out-of-class centroids being mostly avoided.

B. SET Experiment
Complementary to the embedding space distance investigations presented in Section VII-A, we run a SET experiment using the ESC50Mix dataset. We use the first 800 sound pairs as training data and the last 200 sound pairs as test data. Again, we focus on the non-degraded audio clips (a = 1) and consider all gain factors g when compiling the training and test datasets. Consequently, all SET models are trained and evaluated with both single-label and multi-label audio clips. In particular, we obtain single-label annotations for all audio clips with only one sound being audible (γ g = 0 and γ g = 1) and multi-label annotations for all sound mixtures.
Inspired by [7], we use a two-layer Multi-Layer Perceptron (MLP) model as a classifier to process the embeddings, which consists of a first layer with 128 neurons and a Rectified Linear Unit (ReLU) activation function and a second layer of 50 neurons with a sigmoid activation function. All models are trained for 150 epochs using binary crossentropy loss, the Adam optimizer [51] with a learning rate of 10 −3 , and a batch size of 32. We randomly use 20 % of the training set as validation set and use early stopping on the validation loss to stop the training. The macro-average mean average precision (mAP) is computed as evaluation metric. Fig. 10 summarizes the mAP values obtained for three types of sound mixtures ranging from isolated sounds (g ∈ {0, 1}) over mixtures of one predominant and one background sound (g ∈ {0.2, 0.8}) to mixtures of two sounds of similar intensity (g ∈ {0.4, 0.6}). It comes by no surprise that we can observe a decrease in mAP from single-label audio clips with isolated sounds to multi-label audio clips. Interestingly, the tagging models perform slightly better for the multi-label clips when both sounds have a similar intensity. The NTARs perform significantly worse than the DAEs. Presumably, the shallow MLP model is less expressive using NTARs, which characterize sounds only by the shape of their spectral envelopes. DAEs, in contrast, are trained to capture more complex temporal-spectral patterns. When comparing the different DAEs, the PANN embedding perform best followed by OpenL3 and PaSST embeddings, which perform en par. This confirms our findings from Section VII-A, where DAEs clearly outperformed the NTARs. As only exception, the OpenL3 embeddings perform better than expected based on the observed embedding space distance relationships.

VIII. CONCLUSION
Motivated by the challenges of deploying machine listening approaches for real-life application scenarios, we study in this article the suitability of two non-trainable audio representations (NTARs) as well as five deep audio embeddings (DAEs) for SET. We first investigated general properties of these embeddings such as the redundancy caused by feature inter-correlations as well as the class separability in the embedding spaces. Then, we assessed the robustness of the embeddings against four types of audio degredations. We proposed two measures based on a class-agnostic and a class-centric view on the resulting embedding drift in the embedding space. We observed that both NTARs and DAEs have individual weaknesses while the PaSST and PANN embeddings seem to be the most robust representations.
As a main contribution of this article, we blended between random sound pairs to create sound mixtures and studied the resulting embedding space trajectories to assess the risk of sound misclassification. This arises if sound mixtures are too close to other sound classes in the embedding space. Again, we found that the embedding space of the PANN embeddings seems to be structured in such way that sound mixtures generally remain close enough to their original sound classes, which leads to superior SET performance. Our analyses so far are based on a low sound polyphony of two overlapping sounds. As future work, new training approaches should be developed to learn DAEs which better account for sound mixtures of higher polyphony degrees, which are common in real-world soundscapes. Another open question is, how the proposed embedding space trajectories can be generalized to higher sound polyphony degrees with a rapidly increasing number of sound permutations.

ACKNOWLEDGMENT
The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and the Fraunhofer Institute for Integrated Circuits IIS.