Automated Audio Captioning With Topic Modeling

Automatic audio captioning (AAC) is an important area of research aimed at generating meaningful descriptions for audio clips. Most existing methods use relevant semantic information to improve AAC performance and have demonstrated the feasibility of semantic information extraction. Audio events and keywords are commonly used for this purpose. Unlike previous studies, this study proposes a framework that uses topic modeling to obtain relevant semantic content since topic models explore the main themes of the documents. To this end, we present a framework that integrates audio embeddings with audio topics in a transformer-based encoder-decoder architecture. First, we represent each audio clip with a set of topics using a pre-trained topic model, BERTopic. Then, we design a multilayer perceptron (MLP)-based multi-label classifier to predict the topics of audio clips in the testing phase. Finally, in the proposed framework, we input audio embedding and extracted topics into the transformer model to generate captions. The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training. We believe that the topic modeling can be used to extract semantic content in the AAC task.


I. INTRODUCTION
Automated audio captioning (AAC) has attracted increasing interest in recent years. The AAC task combines audio and natural language processing to create meaningful natural language sentences [1]. The purpose of audio captioning is different from earlier audio processing tasks such as audio event/scene detection and tagging. Those earlier tasks do not aim to create descriptive natural language sentences, whereas audio captioning aims to capture relations between events, scenes, and objects to generate meaningful sentences. Audio captioning is a challenging audio processing task and has a significant impact on enabling several services, such as helping hard-of-hearing people and building intelligent systems by understanding environmental sounds.
Most existing methods use encoder-decoder models in the early stage of the AAC problem [1], [2]. Then, researchers explored transformer models in AAC task to improve performance with multi-head attention mechanism [3]. However, predicted captions do not include rich semantic information The associate editor coordinating the review of this manuscript and approving it for publication was Fahmi Khalifa . by using only acoustic features. To overcome this problem, the researchers extract semantic information from audio clips and captions by using audio events and keywords from the captions [4], [5], [6], [7].
Recently, researchers have adapted the topic modeling in image captioning task [8], [9] to extract rich semantic information from the images. Inspired by the successful application of topic modeling in image captioning, we propose a new AAC model with topic representations. Alternatively to the audio event and keyword extraction method, we aim to show that topic modeling can also be used as relevant semantic content for AAC task. The difference in extracting topics from previous keyword extraction methods, the keyword extraction process mainly focuses on the words in the captions, but topic models produce more generalized words across the captions by clustering approaches. The main contributions of this article are given as follows: • To the best of our knowledge, this is the first paper that introduces topic modeling in AAC task.
• We compare the results of event, keyword, and topic inclusion to show the applicability of topic modeling in AAC task. We chose the BART model since it is a recent conditional language model that is based on multi-head selfattention architecture and improves AAC performance.
• The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training, and the topic modeling can be used to extract semantic content in the AAC task. The remainder of this article is organized as follows. First, we introduce our proposed method in Section II. Next, experiments and ablations are shown in Section III. Then, section IV presents the results and discussion. Finally, we conclude our paper in Section V.

II. RELATED WORK
This section describes related work in semantic information usage in image and video captioning tasks, AAC task, and topic modeling.

A. CAPTIONING WITH SEMANTIC INFORMATION
Semantic information extraction has been previously explored in image and video captioning tasks to obtain highlevel attributes from images and video clips. Reference [10] uses a semantic attention method by detecting visual concepts in the images to improve image captioning performance. The extracted regions, objects, and attributes are obtained as visual concepts and given to the Recurrent Neural Network (RNN). A Long Short-Term Memory with Attributes (LSTM-A) model is presented in [11] to integrate attributes with deep learning models. First, they detect attributes observed in images with rich semantic information. Then, these attributes are integrated into Convolutional Neural Networks (CNNs) plus RNNs framework to improve image captioning performance.
Researchers also handle semantic information usage in video captioning task. In [12], a novel deep architecture with transferred semantic attributes is presented. They detect high-level semantic attributes from video frames and inject them into Long Short-Term Memory (LSTM) model. Reference [13] addresses the semantic information usage using LSTM with two semantic guiding layers. These layers are global, object, and verb semantic attributes to guide the language model. The results show that the inclusion of semantic information improves video captioning performance.

B. AUDIO CAPTIONING
AAC is first proposed in [1]. The ProSound Effects [14] is used for their experiments due to the lack of publicly available audio captioning datasets. The Clotho [15] and the AudioCaps [16] datasets are published to fill this gap. The growing presence of publicly available datasets has led to increasing research in the AAC task. Several studies have addressed audio captioning on the Clotho [17], [18], [19] and AudioCaps [18], [20] datasets.
Existing audio captioning models use encoder-decoder and transformer-based encoder-decoder models to handle the sequence-to-sequence nature of the problem. An early attempt based on the encoder-decoder model with an attention mechanism is proposed in [1]. A different encoder-decoder model is presented with gated recurrent units (GRU) using a new Chinese audio captioning dataset [2]. An encoderdecoder model with caption decoder and content word decoder is presented in [19] to solve infrequent class problems in the captions. A transformer model is presented in [3] using temporal and time-frequency information in audio clips. Another transformer-based architecture is proposed in [21] to learn information with a continuously adapting approach.
Due to the data scarcity problem, the use of relevant semantic information has been widely adopted in the task of audio captioning. Recent studies extract audio events from the audio input or keywords from the captions to obtain semantic content. In [22], pre-trained embeddings are used in the encoder stage, and a transformer decoder is used in the decoding stage. They extract audio event tags from similar audio clips by using pre-trained models. Reference [7] uses YAMNet [23] to extract audio event tags with audio embeddings in BART autoencoder and improves audio captioning performance. Narisetty et al. propose a system with audio events based on a conformer encoder and a transformer decoder [24]. A transformer model with keyword estimation is proposed in [4]. Reference [18] improves audio captioning performance by extracting subject-verb keywords from the captions.

C. TOPIC MODELING
Topic models are used to discover the main themes of large documents and organize the documents according to discovered themes [25]. Topic modeling is mainly used to cluster documents in natural language processing (NLP) applications [26]. There exist different topic models in the literature, such as Latent Dirichlet Allocation (LDA) [27], top2vec [28], and BERTopic [29].
LDA is a Bayesian model that describes each collection item with a set of topics and uses a Dirichlet prior distribution. Top2vec is another popular topic model. Unlike LDA, it uses the semantic similarity between documents and word semantic embedding. The BERTopic model is recently introduced. It uses BERT [30] as an embedder and a sentence transformers model. Uniform manifold approximation and projection (UMAP) [31] and hierarchical density-based clustering (HDBSCAN) [32] methods are also used for dimension reduction and clustering documents in the BERTopic model.

III. TOPIC-BASED AUDIO CAPTION MODEL
We present the overall structure of our system in Fig. 1. The caption generation pipeline is given in the following sections.

A. FEATURE EXTRACTOR
Previous studies have shown the performance of the pretrained acoustic embeddings such as VGGish [33] and In the training phase, we input audio features and obtained topics from the topic model to the BART encoder. A linear layer is applied to PANNs features to convert audio features to 768-dimensional BART encoder inputs. In the inference phase, the Topic Predictor component is used to predict a given test audio clip's topics, and the predicted topics and audio features are given to the model to predict the caption. X is the audio feature vector, T is the topic vector from the topic modeling, and P is the predicted topic vector by the Topic Predictor component. pre-trained Audio Neural Networks (PANNs) [34] than other representations such as spectrograms, log Mel energies [18], [22]. Thus, we use PANNs as feature extractor. The PANNs are pre-trained features on the AudioSet dataset [35]. Wavegram-Logmel-CNN14 model is used to extract the PANNs features. In this case, we present PANNs features as

B. TOPIC MODELING WITH BERTopic
We extract topics from the Clotho dataset using the BERTopic [29] since it performs better by embedding method [36] than other standard topic models as LDA and top2vec. BERTopic is a neural topic modeling with a class-based TF-IDF (Term Frequency-Inverse Document Frequency) procedure. Mathematically, it is given by: where tf is the frequency of term t in a class c, A is the average number of words for each class. Here, inversed document frequency is replaced by inversed class frequency, where class c is obtained by concatenating documents in each cluster. BERTopic extracts topics with topic probabilities from the ground truth captions on the Clotho development split. In the training phase, we use DistilBERT base multilingual (cased-v2) [37] for sentence transformer and embedding models for topic modeling with BERTopic. The BERTopic model predicts ten topics for each caption at most. Since the Clotho dataset has five captions for each audio clip, we can obtain up to 50 topics for an audio clip. We have experimented with different numbers of topics (2, 3, 10) for an audio caption using the BERTopic to explore how many topics we should use in the model for each caption. Let k be the number of topics obtained from the topic model for five captions, T = [t 1 , . . . , t k ] is the topic vector with the length of k. When we experiment with two topics for each caption, we obtain k = 10. We obtain k = 50 for an audio clip when we experiment with ten topics for each caption. Since some captions are similar for a given audio clip, some topics are identical; in this case, we remove the duplicated topics while producing the topic vector. For example, when we experiment with ten topics for each caption, k is between 10 and 50 because of the duplicated topics for an audio clip. In our experiments, the best result is obtained using ten topics for each caption.
Some examples of extracted topics by BERTopic are given in Table 1. We present ten topics for the first ground truth captions. For instance, the first example in Table 1 has different topic words with different probabilities representing the captions. ''singing'' is the most probable topic word for the first example. When we analyze the ground truth captions, four captions include the word ''sing'', and it seems to be the most frequent word in the captions. For the second example in 1, the most probable topic word is ''train'' by the BERTopic model, and all of the ground truth captions include the word ''train''. We can see that the other topic words that have lower probabilities are also related to the given captions.
The BERTopic model first generates the main topics on the Clotho dataset, and each of them includes a set of words. However, the representation probabilities of these words are different. Fig. 3 presents the illustration of example topics, a set of words under these topics, and the probabilities of these words. The columns are set of the most probable words that represent a topic. For example, ''truck'', ''road'', and ''driving'' are the set of words that represents a topic in Fig. 3. The most probable word for this topic is ''truck''.
Also, we present the illustration of the similarity between topics in Fig. 4. A heatmap is created based on the cosine similarity matrix between topic embeddings. In the heatmap, the topics are grouped into three words, and the similarity matrix shows these words' similarity scores with another group of words. Fig. 4 (a) presents the similarity between the topic, includes the words ''boat, engine, water'' and ''rain, cars, car'', and (b) shows the similarity between the topic includes the words ''boat, engine, water'' and ''bell, ringing, rung''. The similarity in (a) is higher than (b) since ''boat, engine, water'' and ''rain, cars, car'' are more similar than the words in (b).
We use the extracted topics to train the language model and to create a dataset for the Topic Predictor module.

C. TOPIC PREDICTOR
Since we don't have the topic of the input audio clip during the testing phase, for inference, we predict topics for each audio clip by using a topic predictor module. We implement an explicit module for topic prediction, not in an end-toend manner. For this module, we create a dataset with the audio clips and the topics predicted by topic modeling in the previous section.
Each audio clip a j has captions S = [s 1 , s 2 , .., s z ] where s represents an arbitrary caption in the dataset and the z is set to 5 for the Clotho dataset. Hence, the number of topics extracted for an audio clip is z MULTIPLY k. However, some of the captions for a given audio clip are similar, and the BERTopic predicts similar topics for some captions. Thus, duplicate topics are removed from the topic list. In order to create our audio-topic dataset, we give audio clips' features as input and the obtained topic words as output.
The problem is a multi-label classification task. To solve this problem, we designed a multilayer perceptron (MLP). Let P j = [p j1 , . . . , p jM ] ∈ {0,1} M is topic vector where M = 1695, j is the j th audio clip. M = 1695 is the number of topics obtained by the BERTopic model from the development caption dataset. Each topic vector is obtained as: After this operation, we obtain the topic vector P j of audio clip j. The MLP module contains three hidden layers with 512 dimensions, and we train the MLP module for 100 epochs. We use a Sigmoid function.
Let p j = [p j1 , . . . , p jM ] be the probabilities of topics for j th test audio clip. We determine: where x j is the input features and P j is the predicted topic vector for j th audio clip.

D. LANGUAGE MODEL
In language modeling, our goal is to maximize the probability given by: where C is the caption, X represents the audio features, T represents the topics for a given audio clip. θ is the model parameters.
Recent approaches have shown that BART autoencoder [38] improves the performance in AAC task [7]. It is a transformer model that has a bidirectional encoder and autoregressive decoder. We use the BART-base model with six encoder and six decoder layers. Each encoder and decoder layer is composed of a multi-head self-attention layer with 12 heads. Each layer of the transformations has 768 features and 50265 sub-words in the tokenizer.
Concatenated audio features and topics are used as input to the BART encoder to similarly [7]. Before concatenation, the BART tokenizer is applied to the obtained topics, and a linear layer is applied to PANNs features in order to convert audio embeddings to 768-dimensional BART encoder inputs. After this process, the BART autoencoder generates words autoregressively for given audio features and topics.

IV. EXPERIMENTAL SETTINGS
This section describes the details of the dataset, evaluation metrics, and implementation details.

A. DATASET
We conduct our experiments on the Clotho dataset [15]. Clotho has development, evaluation, validation, and test splits. Test splits can not be obtained since the publishers of Clotho use these splits for scientific challenges. The number of records in the splits is 3839, 1045, and 1045, respectively. All splits have five captions for each audio clip. Each audio file is used five times for these experiments with their corresponding captions similar to [15]. The vocabulary of Clotho contains 4366 different words.

B. EVALUATION METRICS
For evaluations, BLEU-n [39], METEOR [40], ROUGE L [41], CIDEr [42], SPICE [43], and SPIDEr [44] metrics are used. The matching words in the actual and predicted captions are calculated for BLEU-n. It calculates the precision for n-grams. Recall and precision are calculated for METEOR. ROUGE L calculates Longest Common Subsequence. CIDEr presents more semantic results by calculating cosine similarity between the actual and predicted captions. SPICE computes semantic similarity instead of n-gram similarity. SPIDEr calculates the average of CIDEr and SPICE.

C. IMPLEMENTATION DETAILS
The system is implemented using Pytorch HuggingFace framework [45], and the experiments are run on a computer with a GTX1660Ti GPU, Linux Ubuntu 18.04 system. We use Python 3.7 for implementation. We run all experiments for 20 epochs and choose the model with the lowest validation VOLUME 11, 2023  error for the inference. We use the BART-base model with six encoder and six decoder layers for the experiments. We use AdamW [46] for the parameter optimization. The gradient accumulation step is four, and the batch size is 8. The learning rate is 10 −5 . GeLUs activation function is used [47] similarly to [38]. The number of parameters in our proposed model is approximately 141 million. Training takes about 4 hours on the given configuration. The loss function categorical-crossentropy is used and given by: where c t is the predicted word based on previous words, audio features, and topics, X is the audio features, and T is the topics for a given audio clip.

D. ABLATION STUDIES
In order to show the applicability and contribution of topic modeling in AAC task, we have also conducted experiments with audio events and keywords. In addition, we implement a base-transformer model [49] to show the contribution of topic modeling. We present the following ablations: • Extracting events and keywords experiments • Base-Transformer model experiments

1) EXTRACTING EVENTS AND KEYWORDS EXPERIMENTS
In order to extract audio event labels, we use the PANNs features. The last layer of the PANNs gives probability scores of each audio event on the AudioSet dataset. For the event extraction method in Table 2, we obtain the events from audio clips similar to our previous study in [6] since it improves performance. Let E = [e 1 , . . . , e Y ], e y ∈ R 527 , where e y is the probability of each sound class on the AudioSet dataset.
We concatenate E and X as inputs to the transformer model and generate captions.
For keyword extraction, we use our previous keyword extraction method in [18]. We extract subjects and verbs from the dataset captions. We use lemmas of the subjects and verbs and remove duplicates to create a keyword corpus. We create V = [v 1 , . . . , v R ] for each audio clip. If j th audio clip's caption contains v jr , then v jr = 1, otherwise v jr = 0. Then similar to our event extraction method, we concatenate V and X to input the transformer model.

2) BASE-TRANSFORMER MODEL EXPERIMENTS
To explore the contribution of topic modeling to the different architectures in the AAC task, we conduct topic modeling with a base-transformer model introduced in [49] and the BART model. The base-transformer model has six identical layers in the encoder and decoder. Also, the output dimension is used as d model = 512, similar to [49]. The results show  that topic modeling improves AAC performance in the basetransformer and BART models. Table 2 shows that using the topics performs better than the DCASE 2021 baseline encoder-decoder model, event, and keywords results. Firstly, we compare the results of our base transformer model with a recent base encoder-decoder model in [48]. Our base transformer model improves the recent baseline encoder-decoder model results. Then, we add events, VOLUME 11, 2023 keywords, and topics to the transformer model separately. Again, the results in Table 2 show that the inclusion of topics from the topic model has better results than event inclusion.

V. RESULTS AND DISCUSSION
In this section, we present our results and comparisons with the literature.
We compare our proposed method with the recent studies that use event and keyword extraction methods in Table 3. We divide Table 3 into two parts. The first part presents the results of studies that use semantic information in the literature, and the second part presents our proposed method with different types of semantic information inclusion.
When we analyze different types of semantic information in the literature, the study with event keyword extraction [22] performs best in Table 3. Note that, the studies [22], [24] use external data in addition to the Clotho dataset during the training. Our proposed method with topic modeling performs competitive results in the SPIDEr metric, which is known as the most crucial metric in AAC challenges [48], with the studies that use event or keyword extraction methods and data augmentation techniques.
When we compare event, keyword, and topic extraction in our deep architecture, the results show that the model with the ground truth topics performs best. The results with predicted topics with our MLP topic predictor are lower than ground truth results but competitive with event inclusion. When we analyze the topic and keyword inclusion in the model, topic inclusion performs better than keyword inclusion because the topic model groups similar words to create topics, producing more generalized semantic information than keywords. For example, in Example 2 in Table 4, we can see that the extracted keywords are part of the sentences, but the topic model can also extract similar words like ''talking'' and ''speaking''.
We further investigate topic and event inclusion, and they produce similar results, but extracted topics seem more successful than events in Table 4. For example, Example 1 in Table 4 shows that the extracted events mainly focus on different animal types, but the topic model can capture more related words to the ground truth captions. On the other hand, if we analyze the extracted events, keywords, and topics in Table 4, we can see that events are generally based on some types as animal or vehicle varieties. Also, the keywords depend on the ground truth captions and only include words in the caption corpus. Nevertheless, topics are more generalized words related to the ground truth sentences using different words except for the caption corpus. As a result, the predicted captions by different semantic information types in Table 4, the proposed method with topics produces more related words in the examples.
Topic models can produce related semantic content from audio clips by performing better results than baseline methods. These examples demonstrate that topic models can help to create meaningful captions in AAC task.

VI. CONCLUSION
This paper presents a new audio captioning method with topic modeling. Unlike other works, our method uses topic modeling that can be used alternatively for events and keywords that are widely used in AAC task. The results show that the topic model improves the performance of the baseline models. Also, it demonstrates better SPIDEr performance, which is more important than other metrics while using events or keywords compared to the literature. For future work, we will further investigate the possible improvements in topic modeling and prediction models to generate better captions. Since extracting semantic information from audio clips and captions is very important, we believe this article opens new directions for future research in AAC task.
AYŞEGÜL ÖZKAYA EREN received the B.Sc. degree in computer engineering from Çankaya University, Turkey, and the M.Sc. degree in information systems from Middle East Technical University, Turkey. She is currently pursuing the Ph.D. degree in computer engineering with Başkent University, Turkey. She is also a Research Assistant with the Computer Center, Middle East Technical University. Her current research interests include machine learning, deep learning, and audio processing.
MUSTAFA SERT (Senior Member, IEEE) received the M.Sc. and Ph.D. degrees in computer science and engineering from Gazi University, Turkey, in 2001 and 2006, respectively. He is currently an Associate Professor of computer science and engineering and a Senior Lecturer with the Department of Computer Engineering, Başkent University. His research interests include audio signal processing, machine learning, and content modeling for multimedia search and retrieval. In particular, his expertise includes deep learning, computational audio analysis, speech processing, acoustic pattern recognition, audio-video content understanding, and multimodality. He is also a member of the IEEE CTSoc WNT Technical Committee. He is a Senior Member of the IEEE Computer Society. He received the two Best Reviewer Awards from IEEE ICME. He serves in technical reviewing and organization committees for several international conferences, including IEEE ICME, VLDB, FUZZ-IEEE, and IEEE ACM MM. He also serves as a Reviewer for the IEEE/ACM TRANSACTIONS ON