Abstract:
Inspired by the success of image classification and speech recognition, deep learning algorithms have been explored to solve music source separation. Solving this problem...Show MoreMetadata
Abstract:
Inspired by the success of image classification and speech recognition, deep learning algorithms have been explored to solve music source separation. Solving this problem would open to a wide range of applications like automatic transcription, audio post-production, and many more. Most algorithms usually use the Short Time Fourier Transform (STFT) as the Time-Frequency (T-F) input representation. Each deep learning model has a different configuration for STFT. There is no constant STFT parameters that is used in solving music source separation. This paper explores the different parameters for STFT and investigates another representation, the Constant-Q Transform, in separating three individual sound sources. Results of experiments show that dilated convolutional layers are great for STFT while normal convolutional layers are great for CQT. The best T-F representation for music source separation is STFT with dilated CNNs and a soft masking method. Furthermore, researchers should still consider the parameters of the T-F representations to have better performance for their deep learning models.
Date of Conference: 19-21 August 2019
Date Added to IEEE Xplore: 06 January 2020
ISBN Information: