An Audio Data Representation for Traffic Acoustic Scene Recognition

Acoustic scene recognition (ASR), recognizing acoustic environments given an audio recording of the scene, has a wide range of applications, e.g. robotic navigation and audio forensic. However, ASR remains challenging mainly due to the difficulty of representing audio data. In this article, we focus on traffic acoustic data. Traffic acoustic sense recognition provides complementary information to visual information of the scene; for example, it can be used to verify the visual perception result. The acoustic analysis and recognition, in consideration of its simple and convenient, can effectively enhance the perception ability which only applies visual information. We propose an audio data representation method to improve the traffic acoustic scene recognition accuracy. The proposed method employs the constant Q transform (CQT) and histogram of gradient (HOG) to transfer the one-dimensional audio signals into a time-frequency representation. We also propose two data representation mechanisms, called global and local feature selections, in order to select features that are able to describe the shape of time-frequency structures. We finally exploit the least absolute shrinkage and selection operator (LASSO) technique to further improve the recognition accuracy, by further selecting the most representative information for the recognition. We implemented extensive experiments, and the results show that the proposed method is effective, significantly outperforming the state-of-the-art methods.


I. INTRODUCTION
Traffic acoustic scene recognition (TASR) is a fundamental task, which increases the awareness capabilities of the driving circumstances [1]. TASR may serve as a promising complementary technique for designing safe and reliable automatic driving systems [2], especially when the visual data is temporarily unavailable at some blind spots. Although the acoustic scene recognition task has received many researchers' attention and there are a few successful real-world applications, e.g. robotic navigation [3], and audio forensics [4] existing ASR methods suffer from insufficient recognition The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney.
accuracy [5]. The TASR is more challenging, since the road scenarios are complex, highly variable, and uncontrollable.
The reason behind their limitation might be that they represent audio data not so informative as to produce a relatively high recognition accuracy [6]. Existing studies represent audio data mainly according to two criteria: time and frequency. Time-based methods [4], [7], [8] represent audio data mainly by waveform analysis, linear prediction coefficient, zero-crossing rate, and the spectral centroid, etc. Frequency-based methods represent audio data by integrating the magnitude spectrum or the power spectrum over the specified frequency bands [9]- [12]. The resulting coefficients of these algorithms can measure the amount of energy present within different sub-bands and can also be expressed as a ratio between the sub-band energy and the total energy to VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ outstand the most prominent frequency regions in the signal. The latest techniques in this line of study try to analyze the nonlinear signals in order to mimic the human perceptual system response [13]- [18]. However, it is not straightforward how to represent audio data, and hence recently, instead of further developing the aforementioned methods, there are a few studies trying to leverage unsupervised learning techniques to represent acoustic data by learning from the distribution of data. For example, Moragues et al. [19] adopt the sparse restriction Boltzman machine (SRBM) to learn the audio representation by mel-frequency cepstral coefficients (MFCCs). Such SRBM is a neural network that can learn basic maps from input data, which are similar to those constructed by visual receptors in the human brain [20]. For the problem of acoustic scene recognition, the SRBM could adaptively refine the basic attributes of the signal spectrum, and its activation function could be used to determine the period in which the important acoustics contained. Although learning audio representation directly from the data is theoretically promising, in practice, a method that is able to accurately incorporate the domain knowledge into data representation often yields a far higher classification performance, especially for the acoustic scene recognition task. For example, Rakotomamonjy and Gasso [6] proposed to incorporate the domain knowledge by employing constant Q transformation (CQT) and histogram of oriented gradient (HOG) for the audio data representation. However, such representation was motivated by the characteristics of image processing. Their feature extraction algorithm for acoustic scene recognition includes the following operations. Firstly, they use a time-frequency representation to process the audio signals corresponding to training sets. Then, they represent the frequency by the logarithmic interval frequency. Secondly, by interpolating neighboring time-frequency bins, the constant Q representations can be converted to a gray image of 512 × 512 pixels. Finally, the features are extracted from HOG images by calculating the matrix of the local gradient histogram. The method proposed above is a new idea and opens new horizons for acoustic scene recognition. Although this method substantially impacts the research field, their performance remains to be improved, mainly because of the implementation difficulty of directly operating the HOG algorithm on audio data due to the dimension issue (please refer to [6] for more details). As a result, this method cannot describe the slight fluctuations and also ignores much necessary information for the recognition task.
In order to overcome such limitation, Ye et al. [21] suggested increasing the dimension of the features vector. Dessein et al. [22] also demonstrated that a combination of employing CQT and Nonnegative Matrix Factorization (NMF) is able to further improve the recognition accuracy on [6]. Similarly, Bisot et al. [3] verified that by combining HOG and sub-band power distribution (SPD) techniques is also effective. Although these method consider the subtle cues in the fluctuations, they subsequently increase the dimension of the feature vectors. For the sample processing, Phan et al. [23] proposed a sliding window method to solve the high dimensional problem of samples. For an audio sample of a 30s' fragment, a sliding window with a frame length of 500 ms is sheathed to process the sample with the moving of the step length 250 ms. The frame length of 500 ms has 50% overlaps. This method takes every frame as a sample instead of the entire 30s' segment as a sample. This method has been reported that achieved better results than the original HOG method, but obtained weak performance for different audio scenes with some same scene sound effects (such as cafe and quiet street that have the same silent snippets).
In this article, we present a novel and effective method to represent the audio data for TASR, by capturing the global and local expressions of HOG. The proposed method is able to describe fluctuations of HOG and capture more necessary information for recognition, which both mechanisms significantly enhance the classification accuracy. In particular, the characteristics of data fluctuations are extracted on the timeline, and hence the fused features can better represent the HOG descriptor, leading to higher recognition accuracy. Furthermore, we develop a feature selection algorithm based on the least absolute shrinkage and selection operator (LASSO) [24] technique to select representative characteristics from the high-dimensional original feature space, which effectively addressed the limitation of the seminal method [6]. We implemented extensive experiments, and the experimental results show that the proposed method is effective, significantly outperforming the state-of-the-art methods.
The remainder of this article is organized as follows. We detail the global and local representation of the traffic acoustic scenes in Section II. In Section III, we first propose a dataset that contain a large number of traffic acoustic scenes, and conduct extensive experiments to evaluate the recognition performance of the acoustic scene representation. Finally, the concluding section draws some general implications and points to what remains. Fig. 1 illustrates the overview of the proposed framework, which takes an audio signal as the input and outputs the recognition results. Given an audio signal, the CQT is first adopted to convert the audio signal to a CQT spectrogram. Then, a series of HOG descriptors are defined to describe the variations of cumulative oriented gradients in both dimensions for all local regions in the CQT spectrogram. To this end, both local and global features are extracted to enrich the spectral-temporal representation. After that, a large number of high dimensional features are extracted from the CQT spectrogram. Thus, we employ a feature selection algorithm based on the least absolute shrinkage and selection operator (LASSO) [25] to selectively extract the most representative features. Finally, the TASR can be detected by integrating the recognition results by different models.

A. FROM AUDIO TO CQT SPECTROGRAM
We introduce the constant Q transform for traffic acoustic scenes is motivated by the human auditory system. As the output of the transform is effectively amplitude/phase against log frequency, fewer frequency bins are required to cover a given range effectively, and this proves useful where frequencies span several octaves. As the range of human hearing covers approximately ten octaves from 20 Hz to around 20 kHz, this reduction in output data is significant. Since the acoustic scene is difficult, if not impossible to classify in the time domain, we convert the audio clips from the time domain to the frequency domain by the constant Q transformation. The constant Q transformation (CQT) converts a data series into the frequency domain [26]. The CQT is similar to the Fourier transform [27] and the complex morlet wavelet transform. For the acoustic signals in time domain, we define a quality factor Q. The process of this transform can be thought of as a series of algorithmic spaced filters f k . The kth filter has a spectral width δf k that equal to the multiple of the previous filter's width: where δf k is the bandwidth of the kth filter, f k is the central frequency of the lowest filter, and n is the number of filters per octave. The transform exhibits a reduction in frequency resolution with higher frequency bins, which is desirable for the traffic acoustic scenes. The transform mirrors the human auditory system, whereby at lower-frequencies spectral resolution is better, whereas temporal resolution improves at higher frequencies. In another word, For the low-frequency waves, bandwidth will be small, meanwhile a higher frequency resolution is used to decompose similar notes. For the high-frequency waves, bandwidth will be larger, meanwhile a higher temporal resolution is used to track rapidly changing overtones. In this regard, CQT is suitable to represent the acoustic signal in noisy traffic scenes.

B. CQT SPECTROGRAM AND HOG DESCRIPTOR
We scale the CQT spectrogram to 512 × 512 pixels after obtaining the CQT spectrogram by the CQT algorithm.
Although the resolution is not required to be the same as Rakotomamonjy [6], we aim to demonstrate the outperformance is obtained by the improvement of the proposed method, not the change of the resolution of the CQT spectrogram. Then, we extract the histogram of oriented gradients (HOG) features from the rescaled CQT spectrogram to analyze the direction information of the local HOG descriptors in time-frequency representation (TFR). The HOG describes the pedestrian characteristics by the gradient histogram. First, the HOG is computed in each small local area. Then, the composition of multiple cells connects histograms together to form a local HOG descriptor. The local HOG descriptors are denoted as feature vectors. The feature vectors are flattened to a final feature vector. The dimension of this final vector depends on the number of bins in the histogram. In this article, the interference factors, such as illumination, are not involved, so the normalization of the input image or the color space is waived. In addition, by adjusting the number of bins, or gradient directions, we can construct a series of HOG feature descriptors correspondingly.

C. GLOBAL AND LOCAL FEATURE EXTRACTION
The flowchart for feature extraction is detailed in Fig. 2. In Fig. 2, the first unit is the original CQT spectrogram. The following five parts illustrate the feature extraction procedure. Firstly, we transform the CQT spectrogram into HOG descriptors, e.g., from unit 1 to unit 2 in Fig. 2. Next, we transform the HOG descriptors to a list of statistical histograms. To clarify, we take the cell at the bottom right corner of the unit 1 as an example. The cell of size 8 × 8 is transformed into a list of HOG vectors in the bottom row of unit 2 , which are zoomed in unit 3 . We find that the gradient in the same direction is similar to the same class, but varies for different classes. In this regard, we take the corresponding VOLUME 8, 2020 In this way, the maximum gradient along the timeline is well described. Furthermore, we also present a complementary to record the direction of the maximum gradient as an extended version of the LFE, named LFEx. Similar to Eq. 2, the LFEx averages the index of the maximum gradient as Eq. 3. Then, the LFEx feature is a concatenation of g t and g t . (3)

D. FEATURE SELECTION AND CLASSIFICATION
After computed the HOG in the spectrogram, the representation composed of histograms is constructed for all cells.
The higher resolution of the CQT spectrogram will retain more subtle cues for the subsequent processing. However, if we concatenate all these histograms to yield a final feature vector, the dimension of the final feature vector will be too large. Furthermore, although the details are well retained, the features extracted from the CQT spectrogram may contain irrelevant features and even noises. For example, the total number of cells is 642. Each cell has 8 × 8 pixels and 8 gradient orientations. This results a 642 × 8 = 32768 dimensional feature vector of dimension. The dimension will be further increased, if we reduce the size of the cells or increase the number of orientations when calculating the histogram. Moreover, the noises inside the features may degrade the performance of the recognition results. In this regard, it is essential to reduce the dimension of the final feature vector to eliminate the unrelated features. Least absolute shrinkage and selection operator (LASSO) is a popular tool for sparse linear regression, and it improves the prediction accuracy and interpretability for statistical models by performing feature selection and regularization simultaneously [24]. The key idea of LASSO is to minimize the residual sum of the squares by subjecting the sum of the absolute value of the coefficients to be less than a certain constant. Because of such constraint, the LASSO tends to produce zero coefficients that retain good features during the subset selection and the ridge regression. The ith feature extracted are denoted as x (i) j . The weighting parameters is denoted as β = (β 0 , . . . , β p ) . Then, we employ the LASSO to estimate the weighting coefficients of β.β where N is the number of samples, and p is the dimension of the feature. The parameter t > 0 controls the amount of shrinkage that is applied to the estimation of the weighting parameters. Thus, t plays a key role to shrink the weighting coefficients β j by forcing a part of the elements in β to be 0. Such an operation not only helps in reducing overfitting, but selects discriminated features. Instead of setting a hard hyperparameter t, an L 1 penalized simplifies the cost function.
By increasing the soft hyperparameter α, the regularization strength is increased, and the weights are shrunk.
In order to evaluate the discrimination of the extracted acoustic features, we would like to employ classifiers as simple as possible. To achieve this, we employ two simple but frequently used support vector machines, e.g. support vector machine classifier (SVC) with Gaussian kernel and linear support vector machine classifier (linearSVC) for this recognition task. The support vector machine classifier with Gaussian kernel is implemented the same as [28]. The lin-earSVC from the sklearn is similar to the support vector machine with a linear kernel, but it is more flexible to choose penalties and loss functions, and performs better with a large number of samples.

A. DATA DESCRIPTION
We collect 10 classes of traffic acoustic scenes at different locations to construct a traffic acoustic scene dataset to evaluate the performance of the feature extraction method. The data were collected by a recorder and mixer equipped with a Qualcomm Aqstic audio codec (WCD 9335). The raw traffic acoustic slices were collected from 1 minute to 5 minutes at a sampling frequency of 44.1 kHz. The recorded files were saved as AVI format. We cut the raw acoustic slices were cut into 30 seconds per clips. The number of samples in each class is listed in Table 1.
For the clear explanation, we selected 9 CQT spectrograms from 3 typical acoustic scenes e.g. bus, car, and motorbike in Fig. 3. Each subfigure in Fig. 3 is the CQT spectrogram of an acoustic scene. For instance, in the bus CQT spectrograms, the low-frequency line is the acceleration and deceleration of the buses. In the scenes of motorcycle, the higher frequency presented by deeper color is the sound of motorbike engines.

B. DESIGN DECISION
We employ a Matlab toolbox for the constant Q transformation. Then, the CQT spectrogram is scaled to 512 × 512. The VLFeat toolkit is used for HOG conversion. Two different sizes of cell are performed, e.g. 16 × 16 and 8 × 8 on the CQT spectrogram. As a result, the dimension of the feature vector is 64 × 64 × 8 = 32768 for one sample of the traffic acoustic scenes. We illustrated the HOG features of the CQT spectrogram in Fig. 4. The cells in second and third rows of Fig. 4 are 8 × 8. The number of directions for the gradient of each cell is 8 for the second row, and 32 for the third row, respectively. The cells in fourth and fifth rows of Fig. 4 are 16 × 16. The number of directions for the gradient of each cell is 8 for the fourth row, and 32 for the fifth row, respectively. From Fig. 4, we can see the HOG correctly captures the direction of the power spectrum along the high-energy sharp signal comparing with the raw CQT spectrogram. However, it is difficult to choose the right size for the cells and orientations. By performing preliminary  Fig. 3a -Fig. 3c are the acoustic scenes of buses. Fig. 3d -Fig. 3f are the acoustic scenes of cars. Fig. 3g -Fig. 3i are the acoustic scenes of motorcycles. It can be found that these spectrograms are visually discriminated after constant Q transformation than the original audio clips. and contrastive experimental screening, the best results are obtained when the number of gradient directions is 8 in 8 × 8 cells.

C. FEATURES EXTRACTED BY GFE AND LFE
The aforementioned LFE method describes the characteristics of HOG in a different point of view. Both of them improve the recognition accuracy of different acoustic scenes. The results are detailed in the Table 2. In Table 2, the method LFE represents the local feature extraction by Eq. 2. The method LFE only uses only the feature of 64 dimensions to achieve 73.98% accuracy. The LFE extended (LFEx) method concatenates the features extracted by Eq. 2 and Eq. 3.
The GFE+LFE method concatenates the features extracted by the GFE and the LFE methods, whereas the GFE+LFEx method does with the features extracted by the GFE and the LFEx methods. The experimental results show the feature extraction method fused with GFE and LFEx achieves the best performance, and improves 4.23% of the accuracy comparing with the GFE method.
We also try different pooling methods, e.g. pooling over time domain and pooling over frequency domain. The experimental results are shown in Table 3. The numbers in the frequency column and time column labels depict the number of histograms for the frequency domain and time domain after pooling. For example, the first row presents all   histograms have been averaged over the frequency domain. We find that pooling on the time domain achieves better performance.

D. PERFORMANCE EVALUATION
In this study, we employ LASSO to reduce the dimensions of feature vectors. In our implementation, the LASSO is employed from sklearn toolkit with default settings. Fig. 5 shows the selection of GFE features by LASSO. The results reveal that LASSO is an effective and efficient way to extract principle features from different acoustic scenes. Table 4 and Table 5 illustrate the comparison of the recognition accuracy by the GFE and LFE features, respectively. Table 4 lists the recognition accuracy by leveraging the features selected by LASSO on the GFE features, and Table 5 lists the recognition accuracy by leveraging the features selected by LASSO on the LFE features. In these two tables, the abbreviation cv means the number of holds for the cross-validation strategy. We choose different values of cv to demonstrate outperformance of our method under different of data distribution and the different accounts of data. The hyperparameter λ controls the shrinkage of the weighting parameters. In addition, the shrinkage parameter λ is also chosen by considering real-time requirement.
We also compare the recognition accuracy by leveraging the fused features extracted by the aforementioned method, e.g. GFE, LFE, and LASSO. The recognition accuracy    by leveraging the features extracted by GFE, GFE+LFE, GFE+LASSO, and GFE+LFE+LASSO is illustrated in Fig. 6. In Fig. 6, we find the GFE+LFE+LASSO almost achieves the best performance for all categories. The corresponding recognition accuracy is listed in Table 7.
We find that the fused features achieve better results than the features by a single extraction method.
In addition, the local feature extraction methods are evaluated. The LASSO method makes a significant contribution to the final recognition accuracy. In Table 7, the comparisons of GFE, LFE, and GFE+LFE with different dimensions of features are given. The experimental results show that the LEF features contribute more than GFE features.
We compare the proposed method with two state-of-the-art methods. The first method is developed by Abidin et al. [29], which uses variable-Q transform (VQT) to generate the  time-frequency representation for acoustic scene. Then, the adjacent evaluation completed local binary pattern (LBP) is adopted to extract time-frequency features. The second one is a deep learning method proposed by Yang et al. [30] for 2018 DCASE challenge. They extract multi-scale log-Mel features of acoustic signal. Then, a modified Xception network is developed to fuse the multi-scale features. These two methods are implemented and tested on our dataset. The first one achieves 85.5% of accuracy, whereas the second one does 79.8%. The comparison results are listed in Table 7. In Table 7, the three strategies of the proposed methods outperform the above the methods, because the GFE and LFE are effectively captures the acoustic features from the CQT spectrogram, and the LASSO can well select the discriminative features and eliminate the disturbance of the unrelated features.

E. DISCUSSION
The development of unmanned systems is inseparable from the effective awareness of the real-time, dynamic, and highly complex traffic environment. Traffic acoustic scene recognition (TASR) is a complex and challenging task aiming at recognizing acoustic environments solely based on an audio recording of the scene. Traffic acoustic scene recognition applications for unmanned vehicles can provide an auxiliary means besides of visual identification. The acoustic analysis and recognition, in consideration of its simple and convenient, can effectively enhance the perception ability which only applies visual information. These acoustic scenes can be defined according to specific geographical contexts, such as expressway, sidewalk, or metropolitan railway, and specific transportation tools, such as car, bus, or tramway. Accurate recognition of the scenes is really relevant for applications with the purpose of context machine awareness, which is of critical importance for intelligent transportation or automatic pilot. The feature extraction method presented is general, and can be extend to other time series analysis tasks, such traffic flow forecasting [31]- [33], intelligent computing [34]- [36], or medical signal visualization [37], [38].

IV. CONCLUSION
This article presents a new representation for traffic acoustic data, which accelerate and enhance traffic acoustic recognition. In order to achieve this, we transform the audio clip into the CQT spectrogram. The HOG descriptors are extracted from the CQT spectrogram. Then, a feature extraction method is proposed for both time domain and frequency domain on HOG descriptors. Two local feature extraction methods, which consider the volatility of the time-domain feature, are designed to describe the time-domain property. The dimension of the features is shrunk by LASSO to eliminate the negative affection inside the features in both the time domain and frequency domain. Furthermore, we collect sufficient real-world dataset for evaluating the performance of the proposed feature extraction method. The results on the real-world dataset demonstrate the outperformance of our two local feature extraction methods than the state-of-the-art frequency-domain feature extraction method.
The future work is conducting in two holds. First, we plan to use recurrent neural networks and its extension to extract more discriminated features, and use convolutional neural networks for more accurate recognition. Second, we would like to apply such a model to more complex and noisy roadway circumstances.