Loading [MathJax]/extensions/MathMenu.js
Audio Captioning Based on Combined Audio and Semantic Embeddings | IEEE Conference Publication | IEEE Xplore

Audio Captioning Based on Combined Audio and Semantic Embeddings


Abstract:

Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder mo...Show More

Abstract:

Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.
Date of Conference: 02-04 December 2020
Date Added to IEEE Xplore: 22 January 2021
ISBN Information:
Conference Location: Naples, Italy

Contact IEEE to Subscribe

References

References is not available for this document.