Introduction
Speech recognition technology, as one of the representatives of the new generation of information technology, has become more and more mature. As one of the branches, the accuracy of speech classification and music classification has reached a considerable level, even exceeding the ability of human auditory perception [1], [2]. However, as another branch of speech recognition, environmental sound classification (ESC) still faces many difficulties in various aspects, such as non-stationary nature of environment sound and the strong interference of ambient noise [3]. On the other side, ESC research has an effect on the construction of smart cities [4]. For example, it could be used to automatically identify the specific types of sound in environment, such as children crying [5], animal sound [6] and siren [7]. Hence, it has caught lots of research attentions.
In early studies, research objects of ESC are mainly features extracted manually, such as Mel-frequency cepstral coefficient (MFCC), linear predicted cepstral coefficient (LPCC), short-term energy and zero-crossing rate [8]. These features are then classified by machine learning methods, such as Support Vector Machine (SVM), k-nearest neighbors (k-NN) and Gaussian mixture model (GMM) [9]–[11].
As deep learning method has been adopted in more and more fields, it is also introduced in ESC research. In this paper, a two-stream convolutional neural network model (CNN) is proposed based on deep learning to improve accuracy of ESC. In this method, both time domain and frequency domain features of audio signal are introduced as input signal, and a pre-emphasis module is constructed at input layer to improve signal-to-noise ratio (SNR). In addition, in terms of data pre-processing, this paper proposes a random-padding method to patch shorter data sequences.
The structure of this paper is arranged as follows. A brief introduction of related work of ESC based on CNN is given in Section II. The two-steam CNN method and the data preprocessing method called random-padding are elaborated in Section III. The detailed experimental process and results based on UrbanSound8K dataset are shown in Section IV.
Related Work
Deep learning has been widely used in various fields, some researchers have also begun to introduce this technology into research on ESC [11]. As a feedforward neural network with convolutional computations and deep structures, CNN have been used in image recognition research [12]. In recent years, CNN has been frequently used in ESC research [13]–[15]. And the studies can be divided into three categories.
In first category method, the network is trained by raw audio signal. Dai et al. [16] proposed a 1D-CNN with 34 weight layers. Compared with shallow neural networks, deeper networks can achieve better results due to the expansion of the receptive field. Abdoli et al. [17] proposed an end-to-end approach for ESC based on a 1D-CNN. The advantage of this method is that the process to manually extract features is cancelled. However, 1D-CNN extracts features at the global level without considering the temporal structure and frequency feature of environmental sounds [3]. In second category method, the network is trained by features extracted from raw signal, such as spectrogram and MFCC. In lots of studies, MFCC is used as input data to train classification model. However, due to discrete cosine transform (DCT), adopted to extract coefficient features in MFCC, will lead to a lack of structural information of audio signal, MFCC does not perform well for deep learning models [15]–[17]. On the contrary, logmel-CNN (LMCNN) model adopted logmel spectrogram feature is used well. Piczak [18] proposed a type of CNN model with logmel feature extracted from raw audio signal. Zhang et al. [15] used Mixup method combined logmel and gammatone spectrogram features to improve classification performance. In third category method, the network is trained by multiple input data. Tran and Tsai [7] used raw waveform and a combined feature formed by MFCC and logmel spectrogram as input data and proposed a SirenNet for siren-sound-based emergency vehicle detection. Li et al. [19] proposed an ensemble model, in which RawNet and MelNet are used individually. And the Dempster-Shafer (DS) method was then adopted to combine the training results. Su et al. [20] proposed a TSCNN-DS model and the performance of the model is pretty good on the UrbanSound8K dataset, but the input data of the model is too complex, which requires many features to be combined and the two stream network is fused by DS evidence theory. Furthermore [21] described a multi-stream CNN with temporal attention and decision fusion for ESC. However, the multi-stream CNN not only has complex structure, but also combines the original signal and the short-time Fourier transform, which leads to a large amount of data and requires a high level of hardware. Hence, the advantage of this method is that it can combine time domain and frequency domain features of audio signal, thereby compensating for the shortcomings of the single-input model. It should be mentioned that the method proposed in this paper belongs to the third category.
Method
The deep learning models represented by the CNN and Long Short-Term Memory (LSTM) have been widely used in the field of audio processing [22]–[24]. However, we only chose the CNN to construct the basic model of ESC, because the CNN has a many advantages over LSTM in ESC tasks. First, the ESC task emphasizes the types of sounds in the current environment and does not need to pay special attention to the sounds in the past period, the most obvious advantage of LSTM technology is not applicable here. Second, the CNN can obtain the time-frequency characteristics of sound signals by using data such as the acoustic spectrum as the input, which is difficult for LSTM to achieve.
It has been demonstrated that a model combining types of features of data has better performance [19]–[21]. Hence, a two-stream CNN method combining RACNN and LMCNN is proposed in this paper. In such a way, both the time domain and the frequency domain features of signal are considered. In addition, a random-padding method is also proposed to solve the inconsistent sample length problem of UrbanSound8K dataset.
A. Selecting the Appropriate Input Data
Compared with speech, environment sound event (ESE) is a kind of background sound, which is often mixed with various background noises, so it is more difficult to be identified. As known, the MFCC is used to solve automatic speech recognition (ASR) problem, and it do has a better performance for artificial audio recognition [25]. So, we decided to introduce MFCC to deal with ESC problem. Further, to solve the coordination problem between MFCC and deep learning network, a conversion step is needed to transform MFCC into logmel. In addition, a simple comparison experiment is carried out with the logmel and other popular audio features. And the experimental results are shown in Table 1. The waveform is the raw audio wave saved as a greyscale picture, and the constant Q transform (CQT) is a time-frequency transform algorithm often used in music signals. Obviously, the logmel is superior to other common feature algorithms.
Obviously, we can not only rely on the logmel feature for learning but also need to use other information to make up for the deficiency of the logmel feature information, which may be necessary for further development of the ESC task. The fast Fourier transform (FFT) is used in the process of extracting logmel features. In it, the time-domain signal is converted into a frequency-domain signal. Therefore, the logmel feature inevitably lacks relevant important features such as the time domain. The most direct method is to analyse the original signal. Finally, we use a two-stream CNN with the raw audio signal and logmel as the input data.
B. Random-Padding Method
As a public dataset, the UrbanSound8K [26] is commonly used in ESC researches. This dataset contains 8732 labelled sound excerpts (< = 4s) of urban sounds from 10 classes. 1798 of them are less than 4s, accounting for 20.59% of the total number of samples. It is a large waste to directly exclude these samples when preprocessing the data. In fact, there are already some methods, such as cubic spline interpolation, zero-padding, are designed to patch data. However, the duration of some samples are less than 1s or even less than 0.2s. Obviously, cubic spline interpolation is not applicable to such samples. Hence, a simple and effective data patching strategy called random-padding method is proposed, describled as follows. (Its pseudocode is shown in Algorithm 1):
For the sample duration, where
, it copies the entire sample to patch sample data until the sample length reaches 4 seconds. The copied sound will eventually be truncated at a random point.0 < t \le 2s For the sample duration, where
, it selects a random data segment which can patch the duration of original simple data to make it reach 4 seconds at once.2s < t < 4s
Algorithm 1 Random-Padding
if
return
else
return
end if
There is a random point cut off in the sound in the above two padding situations, so we call this method random-padding. The advantage of this method is that it simultaneously retains 20.59% of the sample, and ensures the timing of the completed data. Figure 1. is the comparison diagram between the zero-padding and the random-padding: (a) is the raw data diagram after the zero-padding method, (b) is the raw data diagram after the random-padding method, (c) is the logmel diagram after the zero-padding method and (d) is the logmel diagram after the random-padding method.
The comparison of the input data between zero-padding and random-padding. (a) and (c) are the raw data and logmel graph after zero-padding respectively, and (b) and (d) are the raw data and logmel graph after random-padding, respectively.
C. Pre-Emphasis Module
Pre-emphasis is a widely used method for audio pre-processing [28]. The formula for the pre-emphasis is represented as
Pre-emphasis module. The raw audio data pass through the first two convolution layers that are initialized with weights. These two layers jointly constitute the pre-emphasis module and participate in the tuning with the whole network.
D. Two-Stream CNN
The network structure of the method proposed in this paper (Figure 4) is a two-stream CNN combining the RACNN and LMCNN. In the RACNN part, the input data are the raw audio signal. After the pre-emphasis module, the kernel length of the first layer convolution is set to 60 for corresponding to the logmel dimension of the sound signal. In addition, in order to increase the receptive field of the RACNN, 8 convolutional layers with kernel sizes of 3 and strides of 1 are added. The convolution layers and pooling layers are combined in a manner similar to the VGG, and a total of 11 layers are used in the RACNN. In the LMCNN part, 4 convolutional layers and a pooling layer are set after each layer convolution. The input data is logmel matrix, and the number of convolution kernels is sequentially increased, with a kernel size is 3*3 and stride is 1. And the fully connected layer is used to unify the format of network feature vector of two-stream CNN. Finally, the model uses “addition” to fuse the two stream feature maps to the “softmax” classifier to output the classification result. In addition, batch normalization (BN) and global average pooling are used in front of the fully connected layer of each stream CNN to reduce the number of parameters, which are not marked in the figure.
All samples of UrbanSound8K dataset (the light blue bars represent the samples ¡4s, the dark blue bars represent the samples = 4s).
Experiment
This part focuses on the analysis of the experimental results of the random-padding method, the pre-emphasis module and the two stream CNN model. Figure 5 shows the final flowchart of the experiment. Due to hardware limitations, we down-sampled the raw audio data are firstly down-sampled and then used to generate corresponding one-dimensional audio data and artificial logmel features by random-padding strategy. The one-dimensional audio data is input data for RACNN in combination with the pre-emphasis module, and the logmel feature is input data for LMCNN. In such a way, the fully connected layers of the two are added to output the classification result.
A. Data Pre-Processing
The hardware platform in this paper uses an Intel Core i5 9400F CPU, an NVIDIA GTX 1660 GPU and 16GB of RAM, and Keras2.2 as the development environment. The UrbanSound8K dataset is divided into 10 sound classes, which are air conditioner (AC), car horn (CH), children playing (CP), dog bark (DB), drilling (Dr), engine idling (EI), gunshot (GS), jackhammer (Ja), siren (Si) and street music (SM). The Librosa audio processing library was used to read the original sample with a sample rate of 11025 Hz, and then UrbanSound8K dataset was converted into 8732 sampled data with a length of 4s with a total of approximately 9.7 h via the random-padding method. The number of channels of the logmel spectrogram is 60, the length of the FFT window is 2048, and the frame shift is 1024. The final extracted logmel matrix size is 60*44.
B. Experiment and Results
1) Comparison Between Random-Padding and Zero-Padding
To verify the availability of the random-padding method proposed in this paper, the experiment will be compared to the most commonly used zero-padding method. In this experiment, the ratio of the training set to the test set is close to 8:2. To further prove the random-padding method, we also used the samples less than 4s as an independent dataset for experiments. It should be noted here that the samples less than 4s are extremely unevenly distributed in each category (the sample size is shown in the Figure 3). To avoid the impact of sample imbalance as much as possible, we have not selected the three categories air conditioner, children playing and street music, and the remaining 7 categories have a total of 1709 samples. Table 3 shows the comparison results. The 1D-CNN is a simple 7-layer one dimensional CNN based on the raw signal, and the 2D-CNN is a 4-layer two dimensional CNN based on the logmel. It can be seen that the accuracy of the random-padding method proposed in this paper is improved to different degrees compared with the zero-padding method. In particular, the random-padding method results in a large improvement in the 1D-CNN, but in the 2D-CNN, the accuracy improvement is not as obvious. The reason may be that the use of the zero-padding method destroys the time-order character of the signal and the 1D-CNN is sensitive to it. Therefore, random-padding results in an obvious improvement for the 1D-CNN, especially in the 1D-CNN model on a less than 4s dataset, which is approximately 2.07% better than the zero-padding and that’s enough to see that our method works.
2) Pre-Emphasis Module Emdedded in the CNN
In this section, the pre-emphasis module is compared with the pre-emphasis layer proposed in [27], and the ESC10 dataset is also used in the experiment. Both the ESC10 and UrbanSound8K dataseta have 10 classes, but the sample size is much smaller in the former than in the latter, at only 400. The division of the training set and test set is the same as in the previous section. The comparative results of the experiment are shown in Table 4. It can be seen that the pre-emphasis module proposed in this paper is superior to the pre-emphasis layer in the ESC10 and UrbanSound8K datasets. This shows that our work can improve the performance of the model. Adding a convolutional layer with a kernel length of 1 and an initial value of 1 can better regulate the network than only one pre-emphasis layer. Meanwhile, the parameters of the network model hardly increase. Finally, ablation experiments were conducted on the pre-emphasis module and the performance of the model after the introduction of the pre-emphasis module was improved on both datasets (Table 5).
3) Experiment of the Whole Network Model
In this part the performance of two-stream CNN is verified. The initial learning rate is set to 0.01, and the learning rate attenuation strategy is adopted. The learning rate is reduced to 0.1 times what it was in the 20th and 80th epochs, and the learning rate is reduced to 0.5 times what it was in the 50th epoch. A total of 110 epochs are conducted. The optimization function adopts the stochastic gradient descent method with momentum of 0.9. In this experiment, the 10-fold cross-validation method was used. First, the 8732 sample data were scrambled, and then divided into nine folds with 875 samples and one fold with 857 samples. The mean value of the 10-fold cross-validation is 95.7% and the the accuracy of the optimal model reached 96.07%. Table 6 compares the two-stream CNN model to other models on the UrbanSound8K dataset. It can be seen that the model proposed in this paper is superior to the model proposed by most other studies due to the higher recognition accuracy compared to Boddapati et al. [4] using the Spectrogram, MFCC and CRP combined features on GoogLeNet. The following uses typical experimental data for analysis. As seen from Table 7, gunshot, engine idling and siren sounds have better performance, and their recognition accuracy is over 97.5%, while children playing and street music had lower recognition rates. Here, we can roughly determine that sound pairs with larger autocorrelation coefficients and RMSs (root mean squares) can achieve better recognition accuracy. It means that the environment sound events, such as gun shot, siren and engine idling, are more recognizable due to the characteristics of obvious periodicity and large amplitude fluctuation. Figures 6 and 7 show a typical confusion matrix and training curve, respectively. It can be seen that due to the large learning rate used in the first 20 epochs, the loss keeps oscillating and decreases rapidly. After the 20th epoch, the model reduces the learning rate, becomes stable and seeks the optimal solution.
The training curves of the accuracy and loss of the proposed two-stream CNN on the UrbanSound8K dataset.
Conclusion
In this paper, a two-stream CNN model is proposed, which combines the RACNN and LMCNN. The two stream CNN uses the raw audio data and logmel matrix as input data, respectively. In such a way, the time-frequency characteristics of environment sound signals can be fully extracted. In terms of data preprocessing, this paper proposes a random-padding method to patch the uneven data samples. Hence, the available data for experiment are greatly increased. According to the comparison results with the zero-padding method, the advantage of random-padding method is confirmed. In terms of network structure, the pre-emphasis module is added to the convolution part of the RACNN. Hence, the network can be improved due to a better SNR of the signal. Finally, according to the experiment results, a high recognition accuracy of 95.7% is achieved based on the 10-fold UrbanSound8K dataset. In our future work, other newly developed models, e.g. [24] and [29], will be considered to explore the better recognation method in the field of ESC.