Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification

Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (https://github.com/Yuanbo2020/ERGL).

Typical deep learning-based ASC methods consist of three steps: 1) Converting a time-domain audio signal to a timefrequency spectrogram, which is used as input acoustic features; 2) Feeding these features to neural networks to obtain high-level representations; 3) Recognizing acoustic scenes based on such high-level representations.For example, Ren et al. [15] utilize a CNN-based model with mel features, where attention-based pooling layers are used to reduce the dimension of representations.The spatial pyramid pooling approach is used by CNN in [16] to provide various resolutions for ASC.Apart from mel-based features, wavelet-based deep scattering spectrum [17] is introduced for ASC.To explore the instance-level information of audio clips, multiple-instance learning [18] is used in ASC.The higher-order temporal information of acoustic features is exploited by convolutional recurrent neural networks (CRNN) with bidirectional recurrent layers [19] and with spatio-temporal attention pooling [20].In addition, attentional graph convolutional networks are used for audio-visual scene classification [21].Given the intrinsic relationship between acoustic scenes and audio events (AEs), some studies jointly analyze scenes and events based on multitask learning (MTL) [22] [23].Relation-guided ASC [24] is proposed to exploit the implicit relations between coarsegrained scenes and fine-grained AEs.
The features used in the methods above often contain polyphonic AEs information.These features capture both useful and irrelevant information, as well as noise, which are then used for ASC by recognition systems.However, it is difficult to explain what cues in audio clips are used by these approaches to recognize acoustic scenes (ASs).In real life, it is natural for humans to recognize ASs based on the semantically meaningful AEs contained in them, despite variations in relations among the occurring AEs in ASs [25].This paper proposes audio event-relational graph representation learning (ERGL) to classify scenes and simultaneously clearly answer which cues are used in classifying.Inspired by the multi-dimensional edge learning in graph-based image analysis [26][27], we introduce it into the proposed audio-based ERGL to enhance scenedependent semantic relationships between non-graph AEs in end-to-end training.These scene-dependent event-relational graphs (ERGs) contain not only the activation of AEs (i.e.nodes in the graph) in the audio clip, but also their relations (represented as edges) that are relevant to ASC tasks.The graph in ERGs is a single graph.Thus, spatial-temporal graph neural networks [28] [29], which aim to capture spatial and temporal dependencies of graph sequences or multigraphs, are unsuitable for modelling ERGs.ERGs are fed to a gated graph convolutional network (Gated GCN) [30] for ASC.Experiments show that graph representations learned by ERGL, from only several explicit audio event semantic embeddings, can facilitate the discrimination between different ASs.

II. AUDIO EVENT-RELATIONAL GRAPH LEARNING
This section discusses the proposed approach for learning the scene-related event-relational graph (ERG) representation from non-graph audio clips in an end-to-end manner.First, we learn a set of AEs embeddings, where each embedding (v i ) contains the i-th audio event-related information, and is treated as the i-th node in the ERG.Then, we learn a pair of multidimensional edge features (e i,j , e j,i ) to describe the relations between each pair of nodes (v i , v j ).Thus, the obtained graph explicitly describes the occurrence of a set of AEs and their relations relevant to the given scene.Finally, the obtained ERG with n nodes and n×n edges is fed into Gated GCN for ASC.

A. Audio event node embedding learning
We first propose a model derived from PANNs [31] for node embedding generation, where each describes a specific AE.As shown in Fig. 1, the spectrogram of the audio clip is fed into a set of convolutional blocks.Each block contains two convolutional layers with kernels of size 3 × 3, a batch normalization [32], and a ReLU function [33].Then, a fully-connected (FC) layer with 2048 units generates joint representations for all AEs.Unlike PANNs, we employ n independent FC layers with 64 units to learn n embeddings separately, where each embedding describes a unique pre-defined audio event.
During training, each event embedding v i is fed into the following event classification layer to predict the corresponding event probability p vi , where mean squared error (MSE) loss is used to measure the distance between the prediction p vi and the label y vi (i.e., L event = M SE(p vi , y vi )).To train the audio event node embedding generation module, we employ PANNs, which contains 527 classes of AEs, to generate pseudo labels of AEs.This produces a 527-dimensional soft pseudo label y = [y v1 , y v2 , ..., y v527 ] for each audio clip, describing the occurrence probabilities of 527 classes of AEs.Since realworld acoustic scene datasets rarely have all 527 classes of AEs, i.e., the number of occurred AEs would be much smaller than 527, we rank all AEs by accumulating their probabilities in all training data, and use a set of top-ranked (Top n) AEs with the highest overall probability describing each scene.As a result, each graph contains n nodes and n × n edges.

B. Audio event-relational edge feature learning
Once all audio event (node) embeddings are obtained, we propose a multi-dimensional edge feature learning (MEL) module to learn scene-related relations between each pair of AEs.Here, our hypothesis is that the co-occurrence patterns of all event pairs may include key clues for ASC tasks.The proposed MEL module in Fig. 1  NCM.For each node v i , NCM conducts cross-attention [34] between it and the global contextual representation v all consisting of the mean of all node features, where node v i is used as the query, and v all is employed as the key and value. NCM where Φ is the softmax function, W {q,k,v} are learnable weights, and d k is a factor equal to the number of channels in K.As a result, the obtained representations S i (i = 1, 2, • • • , n) encode scene-aware cues for each audio event.
S i = NCM(v i , v all ), S j = NCM(v j , v all ) (2) NNM.After extracting all scene-aware event node features, NNM module then models the semantic relationship between nodes by capturing multi-dimensional (m-d) edge features.In particular, NNM consists of cross-attention [34] and global average pooling (GAP) [19] layer, which takes a pair of sceneaware event features (S i , S j ) as input, and a pair of m-d edge features (e i,j , e j,i ) as output.In detail, NNM first conducts: R i,j = NNM(S j , S i ), R j,i = NNM(S i , S j ) (3) where the edge feature R i,j encodes S j -related cues in S i , and correspondingly, R j,i encodes S i -related cues in S j .Next, edge features R i,j and R j,i are fed into the GAP layer to obtain the multi-dimensional edge feature vectors e i,j and e j,i .
e i,j = GAP(R i,j ), e j,i = GAP(R j,i ) (4) Consequently, the produced e i,j and e j,i capture multiple ASC-task-specific cues related to both event nodes v i and v j .

C. Scene-aware event-relational graph
Once the ERG (denoted as G 0 ) that contains n node embeddings } is obtained, we feed G 0 to the Gated GCN [35] [30] for ASC.
Since the model contains U GCN layers, its output is G U = (v U , e U ), which is a graph with the same topology as G 0 .The i-th node represents the activation state of the i-th event in the scene.The latent node features in G U are concatenated as the scene representation and input to the final scene classification layer.Cross entropy (CE) [36] is used as the loss function in ASC between the prediction p s and the scene true label y s , L scene = CE(p s , y s ).Hence, the final loss of ERGL is L = λ 1 L event + λ 2 L scene , where λ i (i = 1, 2) default to 1 in this paper.The effect of U , which defaults to 2, on the model performance will be explored in the experiments later.

A. Dataset, baseline, experimental setup, and metric
This paper uses TUT Urban Acoustic Scenes 2018 dataset (UAS) [37] with 8640 10-second clips.The UAS contains 10 classes of real-life acoustic scenes, 24 hours in total.However, UAS does not provide labels for audio events.To obtain the event labels used in Sec.III-B, pre-trained model PANNs are used to annotate audio clips with 527 classes of AEs pseudo labels.To compare with other methods on the same test set, we follow the setup of [37], where the test, training, and validation sets contain 2518, 5509, and 613 samples, respectively.
Following [31], the log mel spectrogram with 64 bins is used as the acoustic feature, which is extracted by the Short-Time Fourier Transform with a Hamming window of size 1024 and a hop size of 320 samples.A batch size of 64 and AdamW optimizer [39] with a learning rate of 1e-4 are used to minimize the loss.The systems are trained for 400 epochs.The accuracy (Acc) [1] is used as the performance metric.

B. Results and analysis
The pseudo labels composed of probability outputs by PANNs [31] are used as supervision information for training the audio event model, as shown in Fig. 1.The accuracy of pseudo labels that do not involve human verification cannot be evaluated.Since our goal is ASC, we will mainly show ASC results, rather than the accuracy of pseudo labels on AEs.
The number n of audio events.We first evaluate the impact of the choice of n, i.e., the number of top AEs used in ERGL, on the performance of ERGL.The Acc of ERGL does not increase monotonically as n increases, as shown in Table I.The reason may be that as the number of events n increases, the number of nodes in the graph increases linearly, but the number of edges in the graph grows in the order of n 2 , which sharply increases the burden for learning the multi-dimensional edge features with the MEL module.The increased number of parameters does not provide more useful information to the model, but may compromise its performance.
The ERGL works best when n = 25, indicating that only using 25 classes of AEs can describe the 10 classes of scenes in the dataset.In preprocessing the distribution of AEs in each scene, we also found that the 25 classes of AEs automatically selected by the model cover most of the dominant AEs in each scene.Therefore, we set n = 25 in the following experiments.The number (U ) of GCN layers.Table II explores the ERGL performance under different numbers of GCN layers.The results in Table II illustrate that increasing U does not lead to better results.The reason for this may be that the 2layer Gated GCN already achieves a good balance between model performance and computational efficiency on the graph consisting of semantic embeddings of 25 classes of AEs, and also that adding extra layers would make the model deeper and harder to train.Subsequent experiments will set U as 2. Ablation study of AE-based relations in ERGL.The MEL module in ERGL aids in capturing multi-dimensional AE-based semantic relations between nodes.In MEL, NCM learns the relation between the node and global contextual representation to represent the node, while NNM uses crossattention to capture the semantic relations between nodes.To investigate how well NNM captures semantic relations, Table III presents the ablation study to compare the performance of ERGL with (w/) and without (w/o) AE-based relations.Table III shows that ERGL w/ NNM outperforms ERGL w/ NCM.For graph representation learning in ERGL, NNM, which aims to capture semantic relations between nodes, is more valuable than NCM, which focuses on learning relations between the node and global contextual representation.This shows that NNM based on cross-attention capturing AE-based semantic relations is effective.Fig. 2 shows confusion matrices of ERGL w/o and w/ AE-based relations to explore where the AE-based relational approach works and where it does not.Fig. 2 illustrates that AE-based relational edges effectively help ERGL improve its accuracy in 6 scenes: "airp.(airport), bus, metro, park, squa.(public square), traff.(street traffic)".The accuracy of airp. is improved the most, mainly because the misclassified samples between stat.(metro station), pedes.
(street pedestrian), and airp. is reduced 30 and 39 to 8 and 16, respectively.In contrast, introducing MEL increases the misclassified samples of stat.and metro.Even for humans, it is challenging to distinguish these similar scenes relying on audio only.In short, introducing AE-based relational edges can effectively improve the performance of ERGL in 6 scenes, increasing its Acc from 73.35% to 78.08% in Table III.
Comparison with non-ensemble ASC methods.Table IV shows the results of models on the same test set.In the fixed mode [40], the parameters of PANNs are not updated in training.The result of fixed-mode PANNs is comparable to Baseline, implying that PANNs with AEs knowledge, which is learned from AudioSet [41], have a certain discriminative ability for scenes.In contrast, ERGL using just audio event embeddings improves ASC accuracy even though these event embeddings are learned from pseudo labels without verification.The proposed end-to-end EGRL without data augmentations offers competitive results.This illustrates that ERGL can effectively discriminate different scenes by relying only on several scene-aware events semantic embeddings.

System
Model structure Acc (%) PANNs [31] (Fixed mode) VGG-like CNN 56.9 Baseline [37] CNN 59.7 NNF-CNNEns [42] CNN and nearest neighbor filters 69.3 Spatio-temporal Attention [20] CRNN 72.5 Attention-based CNN [15] Attention-Based CNN 72.6 PANNs (Fine-tuning mode) VGG-like CNN 73.8 Instance-based ASC [18] CNN 73.9 Wavelet-based spectrum [17] CRNN 76.6 Proposed ERGL CNN and Graph Learning 78.1 Comparison with scene-event joint methods.The ERGL infers target scenes based on the ERG in the scene, which is scene-event joint analysis.Table V compares ERGL with other scene-event joint methods.The model using the same latent space to classify scenes and events [22] performs the worst.The reason may be that real-life scenes and AEs differ at the semantic level and the feature space.Papers [23][38] use shared base and separated high-level features to identify scenes and AEs.RGASC [24] exploits the scene-event relationship to guide the model to achieve mutually beneficial scene-event classification.The ERGL relies only on semantic embeddings of AEs to achieve one-way event-to-scene inference, and recognizes scenes based on the corresponding explicit semantic ERG.Notably, ERGL, which only needs 25 classes of AEs' information, outperforms RGASC with 527 classes of AEs' information.Overall, the ERGL achieves promising results, demonstrating the feasibility of ASC based on the ERG.
Analysis of confusion.To further explore the reasons for the misclassification between scenes for graph representationbased classification.Fig. 3 presents the structure of graph representations of audio clips from different scenes.In Fig. 3

IV. CONCLUSION
To perform ASC and simultaneously clearly answer which cues are used in classifying, we propose a scene-dependent audio event-relational graph representation learning method for ASC, which represents acoustic scenes by a set of sceneaware nodes with explicit AEs semantic embeddings, and specifically produces scene-task-relevant multi-dimensional edge features to describe AE-based semantic relations between nodes.Experiments show that ERGL achieves competitive ASC performance by learning ERGs, which are constructed on semantic embeddings of only a limited number of AEs.
consists of two submodules, the node-context relation modelling (NCM) and the node-node relation modelling (NNM).NCM first learns the ASC-task-specific relation cues between each node (event) and the global context (scene), generating a scene-aware representation to represent each node.Then, NNM models the semantic relations between nodes in each node pair, to generate the final multi-dimensional edge feature describing the AE-based scene-aware relations between each pair of nodes.

Fig. 2 :
Fig. 2: Confusion matrix of ERGL w/o and w/ AE-based relational edges on the test set.(X-axis: Predicted label; Y-axis: True label.)

Fig. 3 :
Fig. 3: Graph structures of test audio samples from different scenes.Green dots represent 25 classes of AEs used in this paper.A larger dot denotes a higher probability of the event.A thicker line denotes a larger edge value between nodes in the graph representation.(For visualization, a node is considered inactive if its probability is less than 0.1.Edges between two inactive nodes are not displayed.Multidimensional features of the edge are represented by their mean value.)

TABLE I :
Acc of ERGL at different n values on the validation set.

TABLE II :
Acc of ERGL at different U layers on the validation set.

TABLE III :
Ablation study of AE relation-related MEL in ERGL.

TABLE IV :
Comparison of non-ensemble systems on the test set.

TABLE V :
Acc of scene-event joint analysis methods on test set.