A New Deep CNN Model for Environmental Sound Classification

Cognitive prediction in the complicated and active environments is of great importance role in artificial learning. Classification accuracy of sound events has a robust relation with the feature extraction. In this paper, deep features are used in the environmental sound classification (ESC) problem. The deep features are extracted by using the fully connected layers of a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images. The feature vector is constituted with concatenating of the fully connected layers of the proposed CNN model. For testing the performance of the proposed method, the feature set is conveyed as input to the random subspaces K Nearest Neighbor (KNN) ensembles classifier. The experimental studies, which are carried out on the DCASE-2017 ASC and the UrbanSound8K datasets, show that the proposed CNN model achieves classification accuracies 96.23% and 86.70%, respectively.


I. INTRODUCTION
Smart sound recognition (SSR) is a modern technique for detecting sound events that exist in the real life. The SSR is principally based on analyzing human hearing systems and embedding such perception capability in artificial intelligence applications [1]. Environmental sound classification (ESC) takes part as a basic and necessary step of SSR. The key target of ESC is to exactly detect the truth category of a perceived sound, such as doorbell, horn and jackhammer. With the practical applications of SSR in audio surveillance systems, smart device applications and healthcare [2], the ESC problem has taken very much interest in recent times. For automatic speech recognition (ASR) [3] and music information recognition (MIR) [7], it has been achieved great improvements with advances in machine learning. Because of greatly non-stationary characteristics of environmental sounds, these signals cannot be categorized as speech or music only. In other words, the models constituted for ASR and MIR will be poor when applying to ESC problems. Therefore, it is important to develop the efficient machine learning algorithm for ESC problems.
ESC is formed two main parts: audio based features and classifiers. For feature extraction, audio signals are first divided into frames with a window function such as Hamming The associate editor coordinating the review of this manuscript and approving it for publication was Victor S. Sheng. or Hann window. Then, this set of features extracted from each frame is used in training or testing processing [8]. Features based on Mel filters (Mel Frequency Cepstral Coefficients (MFCC)) are commonly used features in ESC with acceptable efficiency, although they are actually developed for ASR [9], [10]. Also, a notable number of studies demonstrate that concatenated features performed better than only use one feature set in ESC missions. However, more concatenated conventional features cannot increase the classification performance. Therefore, an appropriate feature concatenation strategy is a vital part of sound classification. Artificial Neural Network (ANN), Support Vector Machines (SVM), Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) are greatly used classifiers in sound and other category. However, these conventional classifiers are designed to classify apparent changes which conclude in the absence of time and frequency invariability.
In recent years, deep learning (DL) models have been demonstrated to be more capable than conventional classifiers in resolving complicated classification problems. The convolutional neural network (CNN) is one of the most widely used models of DL, which could tackle the prior restrictions by learning parameters, which is including the time and frequency representations [10], [11]. The CNN is constituted to process data that get in the shape of multiple arrays: 1D signals, such as speech and biomedical signals, and 2D for image or audio spectrograms [12], [13]. The CNN model constituted by Krizhevsky et al. [22] outperformed all the conventional methods in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012). This CNN model known as AlexNet has pioneered the other popular CNN models, such as VGGNet and ResNet. The pre-trained CNN models, which are shared the learnable parameters, have shown good performances in almost all classification applications [4], [14], [15]. Moreover, hybrid approaches, which consist of the pre-trained CNN models and conventional classifiers, have been used to improve the classification performance. In [16], the deep features are extracted by using the pre-trained CNN model. The SVM and KNN algorithms are used for hyperspectral images classification. In [17], the pre-trained CNN models such as AlexNet and VGG16 are utilized to extract deep features from EMG signals. The best accuracy is achieved with SVM classifier. In [18], a new approach is proposed for the brain MRI classification. The feature set is constituted by combining the AlexNet and VGG16 models with hypercolumn technique. The evaluation is performed by the SVM classifier. In [19], the deep features are extracted by the last fully connected layer of the ResNet50 CNN model by using videocapsule endoscopy (VCE) images for diagnosing celiac disease. In the classification stage, the SVM, the KNN, the LDA and the softmax classifiers are evaluated on a dataset. The best accuracy is achieved by the SVM classifier. However, the popular pre-trained CNN models for feature extraction cannot fully represent the sound characteristics as they are only trained with images. In addition, the large input size and the very deep network structure, which are needed for recognition of high-resolution images may not be always required for ESC problems. In this state, it is obtained the low computational cost because of decreasing the learnable parameters.
In the paper, an approach, which consists of the deep feature extraction and the classification stage, is proposed for ESC problem. To this end, an end-to-end CNN model is constructed and trained with the spectrogram images. Thus, we obtain our own pre-trained CNN model. Then, the fully connected layers of the constructed CNN model are discarded for feature extraction. Thus, a flexible CNN architecture is obtained where the sizes and numbers of all layers are freely changed by the authors. In the classification stage of the proposed study, the random subspace KNN ensembles model is used, which uses the vote of many prediction scores in the subspace-feature sets. The classification accuracy is used to evaluate the performance of our proposed method. We further compare the performance of the proposed method with other pre-trained CNN models and classifiers for classification performance. The classification accuracies have been significantly improved by the proposed method compared to the other studies on the UrbanSound8K [5] and the DCASE-2017 ASC [6] datasets.
The main contribution of this paper is that a new CNN architecture for ESC classification is proposed. The proposed CNN model is not too deep which does not necessitates too much training time. In addition, the achievement of the proposed new CNN model is comparable with the pre-trained CNN models.

II. THE METHODOLOGY
The illustration of the proposed method is shown in Fig. 1. According to the method, the input sound signals are initially converted into time-frequency images by using the spectrogram method. The spectrogram parameters such as the window type, window length and the overlap size are adjusted during the experimental works. Later, the spectrogram images are saved by using the viridis colour map and are resized to fit them for the input of the proposed CNN model. The proposed CNN model, which is shown in Fig. 2, is constituted of three convolution, three max-pooling and normalization and three fully connected layers. The softmax and classification layers were followed the last fully connected layer. The rest part of the used datasets is utilized for the feature extraction and the testing process. The feature set is achieved with concatenating the outputs of the first and second fully connected layers of the proposed CNN. Finally, the performance of proposed method is tested with the random subspace KNN ensembles, which are used a robust classification algorithm.

A. SPECTROGRAM IMAGES
The spectrogram method converts the signals into time frequency images or loudness of a signal over time at different frequencies existing in a specific waveform. The spectrograms also show how energy levels vary over time. Spectrogram of an input signal can be described as the square of the Short Time Fourier Transform (STFT) magnitude. The STFT formulation is given as follows.
where x(i) is input signal, and ω(i) that is generally centered at the time n is a window function such as Hamming and Hanning window. Then, the spectrogram images are saved via viridis colour map which is a homogenous colour map changing from blue to green to yellow [20], [21].

B. CNN LAYERS
The CNN is designed to process data, which is taken from the multidimensional data, i.e., a colour image composed of three 2D data including pixel density in the 3D channels. CNNs use the properties of natural signals organized at four key ideas that consist of shared weights, local connections, pooling and other layers [22], [23]. Convolutional layer, ReLU layer and pooling layer are the most used CNN layers. The basic aim of the convolutional layers is to determine local connections of features from the previous layers and mapping their information to particular feature maps. The convolution of the input I with filter F (F R 2a 1 +2a 2 ) is given as follows.
ReLU (g (z) = max(0, z)) which is a non-linearity activation function, is applied the feature maps created with the convolutional layers. The task of the max-pooling layers is to combine similar features conveyed from the previous layer. The max-pooling layers realize down-sampling operation by calculating the maximum value of the field on the feature map overlapping with the filter [23].
CNN structure, which is from the fully connected (fc) layer to classification layer, is in general similar to the multi-layer perceptron neural network (MLP). The task of the fc layers is the same as the hidden layers in the MLP. One or more the fc layer can be in a CNN structure. The fc layer connects each neuron in next layer to each neuron in previous layer.
Softmax function is generally utilized in CNNs, to match the non-normalized values of previous layer to a possibility distribution over predicted class scores [24].
where σ (x i ) is the softmax output for each x i , and x j represents values of the input vector.
The batch normalization layers are used to decrease training time of CNNs and the sensitivity to network initialization [27]. Therefore, this layer is chosen for the normalization process in the proposed CNN architecture. The normalized activations with input (x i ), mini-batch mean (m b ) and mini-batch variance (v b ) variables is computed aŝ where is constant and develops the numerical state in case thev b is very small. The m b and the v b calculations are also shown in equations (5) and (6), respectively.
Finally, the activations in the batch normalization layer is concluded with shift and scale operation as where a and b are balance and scale factors, respectively. These factors are learnable variables updated to the most appropriate values during training process.

C. DEEP FEATURE EXTRACTION WITH THE PROPOSED CNN MODEL
The feature extraction processing with the pre-trained CNN models is called as deep feature extraction in literature [16], [25], [26]. For deep feature extraction, it is used the fc layers of the pre-trained CNN models. In the paper, instead of the pre-trained CNN models such as VGGNet and AlexNet, the fc layers of the proposed CNN are utilized for deep feature extraction. The layer numbers of the proposed CNN, Alexnet, VGG16, VGG19 and ResNet-50 is given in Tab 1.

D. KNN ENSEMBLES WITH RANDOM SUBSPACE METHOD
The random subspace method is used random subspace ensembles to boost the classification accuracy of k-nearest neighbor (KNN) classifiers. The method bases on a stochastic operation that randomly chooses a number of components of the learning model in creating of each classifier [28]. In the method, the training dataset is sub-divided into random subspaces and distance calculations such as Euclidean and Chebyshev are performed by using the test samples on training set constituting with the random subspaces. According to the number of nearest neighbors (K), the most appropriate subspace class membership is determined by the distance and majority voting [29]. Then, class memberships coming with each subspace ensemble is assembled in a class vector (C). The classification is achieved with highest average score in C. The base random subspace method implements the following items: • Step 1: Select without changing a stochastic set of the M-size from training dataset (M<N).
• Step 2: Train a KNN learner using only the selected predictors (b). Where d is numeric values in the training dataset, b is the selected subspace predictor, M is length of the b predictors, and L is the number of learners in the ensemble. In Fig. 3, representation of the random subspace ensemble method is shown for the KNN classifier.

III. EXPERIMENTAL WORKS A. DATASETS
In this work, two popular datasets are considered to evaluate the ESC problem. UrbanSound8K dataset is organized with ten class labels consisting of air conditioner, car horn, children, dog bark drilling, engine idling, gun shot, jackhammer, siren, and street music. The record duration for an audio file of the dataset, which contains 8732 audio files, is up to 4 seconds and the audio files are recorded with 22.05 KHz sampled frequency. Also the record lengths of the audio file and the number of files in each class are not same. The DCASE-2017 ASC dataset is constituted of two part including the development dataset with 4680 audio files and the evaluation dataset with 1620 audio files. The duration of each audio file is 10 second. The file numbers of each class are balance, and all audio files are recorded with 44.1 KHz sampled frequency. The dataset contains fifteen classes of which labels are beach, bus, cafe/restaurant, car, city center, forest path, grocery store, home, library, metro station, office, park, residential area, train, tram. The performances in the DCASE-2017 challenge have been ranked to classification accuracy on the evaluation data.

B. EVALUATION METHOD AND CRITERIA
The development and the evaluation datasets, which the DCASE-2017 ASC dataset contains, are used for the proposed CNN training and the evaluation processes, respectively. On the other hand, the UrbanSound8K dataset is randomly divided for the proposed CNN training process with a ratio of 0.9 of the full dataset, and the evaluation process is performed with the rest part of the full dataset. The classification performances on the UrbanSound8K dataset is tested with 10-fold cross-validation. The evaluation criteria consist of accuracy, specificity, sensitivity, precision, and F-score. These criterions are computed by using the confusion matrix values as given the following equations.

C. EXPERİMENTAL SETUP AND RESULTS
As it was mentioned earlier, the spectrogram method was applied to all the audio signals to convert the input audio signals to the time-frequency images. Window size, window type, overlap and FFT size parameters of the spectrogram method were chosen as 1024, Hamming, 256, 3000, respectively. These spectrogram parameters were selected   according to the optimum resolution of the spectrogram images. The dimensions of the spectrogram images were 875 × 656 × 3 and then were re-sized to 100 × 100 × 3 for the input of the proposed CNN model. The re-sized spectrogram images were fed into the proposed CNN model. The dimensional parameters in the proposed CNN layers are shown in Fig. 4. For example, the filter size and the filter number in the first convolutional layer were assigned as 3 × 3 and 8, respectively. And, the pixel block size and the stride were selected as 2 × 2 and 2 for the max-pooling layers, VOLUME 8, 2020      The k, which is the nearest neighbor number, and f, which is the size of the subspace feature vector, is the most important parameters of the random subspace k-NN ensembles.  According to the experiments in [28], k and f give the best performances for 1 and 64, respectively.
For both the datasets in Tabs. 5 and 6, the proposed method is compared with the pre-trained CNN models and the other classifiers. The obtained results showed that classification accuracy of the proposed method was better than the other CNN model-classifier structures.
The average classification accuracies for the DCASE-2017 ASC and the UrbanSound8K datasets has been increased by 15% and 9.6% compared to the other best CNN modelclassifier structure, respectively. The other performance criteria including sensitivity, specificity, precision and F-score is separately given in Tabs. 8 and 9 for each class of both datasets. The average scores of the sensitivity, specificity, precision, and F-score for the DCASE-2017 ASC dataset are 0.9623, 0.9973, 0.9626, and 0.9623, respectively. The same scores for the Urbansound8K dataset are 0.8672, 0.9852, 0.8682 and 0.8675, respectively. In Figs. 7 and 8, the states of TP, TN, FP, and FN in both datasets are shown for each class on the confusion matrices. In Tabs. 10 and 11, the proposed method is compared with the other method using the same datasets. The first ten works achieving the best classification accuracy in the DCASE-2017 ASC challenge is used for the comparison. The average classification accuracy with the proposed method has been boosted by 12.93% compared to the best challenge score [10]. In addition, the best classification accuracy has been achieved in 13 out of 15 classes, with a significant difference in most. For the UrbanSound8K dataset, the average classification accuracy has been improved by 8.05% compared to the best score [8] in the used other methods, and the best classification accuracy has been achieved in 8 out of 10 classes. For the Urban-Sound8K dataset, the 5-fold cross validation test is also applied and the obtained result is given Tab. 7. As seen in Tab 7, when 5-fold cross validation test is used in evaluation of the proposed method, 84.20% average accuracy score is obtained, that score is 86.70% for 10-fold cross validation.
From this comparison, it is observed that an increase in fold number causes an increase in the accuracy score. It is also worth to mentioning that smaller amount of training data causes low achievement, as it is obvious almost in pattern recognition problems.

IV. CONCLUSION
In this a paper, a new CNN model was developed and trained in end-to-end fashion in order to produced deep feature vectors for efficient classification of the environmental sounds. The developed CNN model was consisted of three convolution, three max-pooling and normalization and three fully connected layers. The softmax and classification layers were followed the last fully connected layer. The proposed new CNN model was quite effective in both classification and running time. After training of the proposed new CNN model, instead of using the softmax and classification layers, we opted to used deep feature extraction. These deep features were then used as input to the random subspaces K Nearest Neighbor (KNN) classifier. This classifier was chosen due to its robustness against various dataset. The DCASE-2017 ASC and the UrbanSound8K datasets were considered in experimental works and the classification accuracies were calculated for performance evaluation. The obtained results show that the proposed CNN model and subsequent deep features were quite successful in characterization of the environmental sounds. The performance of the proposed method was also compared with the state-of-the-art results. The comparison results showed that the proposed method outperformed in all compared methods.