Statistical Classification of Vehicle Interior Sound Through Upsampling-Based Augmentation and Correction Using 1D CNN and LSTM

To quantitatively classify the results provided by engineers, we built a sound quality (SQ) classification model using neural networks. The data used in this study were recorded wav files obtained from various vehicle specifications while driving under wide open throttle conditions. However, the data lengths were not constant. The upsampling and interpolation scheme (USIS) was used to achieve constant data lengths. After the USIS was applied to the data, dynamic time warping was used to verify that there was no change in the data characteristics. The verified dataset was transformed into Mel-spectrogram to confirm the characteristics, and dimensionality reduction was applied by using a high-pass filter. Clarifying the differences between clusters improves the model performance. The classification models of 1D convolutional neural network and long short-term memory exhibited training accuracies of about 94.9% (64 or 65 out of 68 classified) and test accuracies of about 87.5% (7 out of 8 classified). For additional undefined label classification, the quantitative evaluation and statistical classification of undefined sound quality labels are successfully identified in the present study. Both neural networks produced effective results that can be used by sound design engineers to quantitatively examine the SQ of vehicle interior noise.

Hanato and Hashimoto [8] proposed a new method based 29 on power summation of the weighted 1/3 octave to find 30 the frequency band that affects the quantitative booming 31 index of SQ. Genuit [9] described how to collect SQ data 32 accurately. Additionally, there were studies that classified 33 vehicle interior noise using neural networks. Lee [10] and 34 Wang et al. [11] conducted a study to predict the preferred 35 SQ by a user through artificial neural networks (ANN). 36 Tan et al.
[12] created a prediction model for SQ evaluation 37 using four SQ factors, user evaluation, and back-propagation 38 neural network. Huang et al. [13] performed data denois- 39 ing using a discrete wavelet transform and predicted user 40 evaluation via a deep belief network. Lee   the upsampling and interpolation scheme (USIS) was used. 80 To verify whether the data characteristics were maintained, 81 the data before and after conversion were compared along 82 with the frequency components. The data were then converted 83 to a Mel-spectrogram, and the mean-standard deviation dis-84 tribution before and after the conversion was expressed as 85 a scatter plot and compared with the scatter plot results 86 obtained using the K-means algorithm. In this study, a 1D 87 CNN and LSTM were used to perform stochastic classifica-88 tion of vehicle interior noise. To train a neural network with 89 less data, the number of parameters in the neural network was 90 reduced, as summarized in Fig. 1. The distinct characteristics 91 of this study are as follows: 92 1) Upsampling and interpolation were used to match differ-93 ent data lengths. When the extracted wav data were digitized, 94 the data length was adjusted by increasing the sampling point 95 per second using USIS. 96 2) After employing USIS, the retention of data character-97 istics was verified using Euclidean distance comparison and 98 dynamic time warping, major frequency domain comparison 99 using cross-correlation, and comparison of mean-standard 100 deviation distribution of Mel-spectrogram. 101 In this paper, Section II describes the data measurement 102 method in detail. Section III describes the USIS and Mel-103 spectrogram conversion process. Section IV presents the 104 structure of the two neural network models, method of out-105 putting the results, and construction of training and test 106 datasets. Section V details the probabilistic SQ classification 107 results of the neural network according to the dataset and 108 the related discussions thereon. Section VI highlights our 109 conclusions.

111
In total, 121 vehicle interior noises produced by various 112 vehicles were measured. These vehicles consisted of sedans 113 with different engines and vehicle grades. For generalization 114 of vehicle interior noise measurement, the experiment was 115 performed with limited driving conditions, such as wide open 116 throttle (WOT). WOT was used because it is a condition 117 where the characteristics of noise, vibration, and harshness 118 (NVH) of the vehicle are revealed. WOT conditions were 119 measured from 1000 rpm or less in a three-gear state until 120  Table 1 shows the classification of 112 vehicles based on 128 the engine cylinder number.  Fig. 2(b), if data C is approximately 19.5 s long, it consists 144 of 859,950 sample points. If data A has twice the sample 145 points with minimal correction in length, the same 859,950 146 sample points can be obtained. Similarly, data B in Fig. 2(c) 147 can also be length-corrected, and therefore data A, B, and C 148 can have the same number of sample points. This process is 149 called upsampling (US). The sample point per second can 150 be increased by zero-padding and interpolation [32]. The 151 procedure for USIS is shown in Fig. 3. First, zero-padding 152 was done successively between existing sample points to 153 double the length, as shown in Fig. 3(a). The padded points 154 were derived using interpolation, as shown in Fig. 3(b). Thus, 155 the data can be increased N times, and this is called the 156 upsampling ratio (UR). Interpolation uses the firwin function 157 and convolve of the SciPy module, a Python library. Schme 158 is applied according to the following process. 159 1. Filter data above audible frequency using firwin as a 160 band-pass filter.   The convolution equation can be expressed using (1) as 170 follows:    First, the difference in signal amplitude was considered.

221
To observe the Euclidean distance (ED) between two signals, 222 the original and interpolated signals were divided and com-223 pared, as shown in Fig. 4(a) [33], [34]. Fig. 4(b) shows the 224 original (red line) and interpolated (blue line) signals super-225 imposed over each other, and it is evident that they are almost 226 similar. Fig. 4(c) shows the ED between the two signals, 227 which is less than 3% based on the original signal amplitude. 228 Therefore, considering the interpolation error, the two signals 229 are almost similar. The data used in the Fig. 4(c) is a value 230 calculated according to the Euclidean distance formula of the 231 original data and interpolated data, and ED can be expressed 232 as follows: Second, dynamic time warping (DTW) was used to compare 235 two signals. DTW is a method for measuring the similarity 236 between two time series sequences [35], [36], [37]. Once 237 the original and interpolated signals were similar each other, 238 there is no difference between two signals so that the warping 239 path should be the diagonal of matrix as shown in Fig. 4(d). 240 The red line is the wrapping path that represents the perfect 241 match of two signals whereas the black line is the wrapping 242 path of the two signals, which has a quite similar tendency to 243 the red line.

244
Third, the frequency component was compared using 245 cross-correlation. Fig. 4(e) shows the frequency components 246 of the original (red line) and interpolated (blue line) signals. 247 For comparing the frequency components, the main region 248 from 0 -1200 Hz was enlarged and compared. Fig. 4(f) shows 249 absolute value of the difference in magnitude between the 250 main frequency components. From the figure, it was con-251 firmed that the error was less than 3% based on the original 252 signal frequency magnitude. An error of less than about 3% 253 is a small. When the largest magnitude error is about 400, 254 the error is 2.5%. Fig. 4(f) shows that most of the component 255 errors show less than 150. This is a 0.94% error. Therefore, 256 the frequency components were almost similar considering 257 the interpolation error.

258
Next, the mean-standard deviation scatter plots of scale-259 converted Mel-spectrogram were compared. If the data char-260 acteristics are maintained after conversion, the scatterplot of 261 the Mel-spectrogram mean -standard deviation should be 262 similar. However, because of the interpolation error, the scat-263 ter plot is not perfect. The third verification will be described 264 further using Mel-spectrograms in Section III.C.   and were adjusted using (6) and (7).  [Const] in (7) uses {1, 1, 0.7, 0.54, 0.45, 0.37, 0.34} according 298 to the UR. If the UR exceeds 7, it is recommended that a lower 299 common multiple be found. When the data were converted 300 into a Mel-spectrogram, they were pre-processed for fre-301 quency range limitation and frequency resolution adjustment, 302 so that the features were well revealed.

303
The frequency range limitation, as shown in Fig. 6(a), lim-304 its the frequency range from 20 -20,000 Hz to 20 -1200 Hz 305 VOLUME 10, 2022 considering that the characteristics of SQ factors (Table 2)    If the frequency range of 20 -1200 Hz was divided into 321 500 partitions, i.e., a frequency of approximately 2.35 Hz per 322 column (20 Hz -22.35 Hz) would be expressed in dB. How-323 ever, a high frequency resolution is not always appropriate. 324 The reason for this can be observed in the scatter plot shown 325 in Fig. 7. A mean-standard deviation cluster of data has to be 326 formed. However, if the frequency resolution is too low or too 327 high, no clear data cluster can be formed.

334
In this study, verification of the results with the 335 mean-standard deviation cluster was compared with the 336 classification results according to the mean-standard deviation using K-means. In Fig. 8(a) and 8(b), it can be seen that x − x min x max − x min (8) 370 (x min : minimum value, x max : maximum value)

372
In this study, a 1D CNN and LSTM were used for recognition were classified probabilistically using the Softmax function.

385
As shown in Table 8 and IX, the output is the probability of

390
The data structure of Case #1 is shown in Table 1. Most of the 391 data were biased towards 4-cylinder and 6-cylinder engines. 392 This is because most of the vehicles manufactured nowadays 393 have either 4-cylinder or 6-cylinder engines. Table 4 and 394 Table 5 show the classification results obtained using the 1D 395 VOLUME 10, 2022    high learning performance with the same data and was able 402 to classify the engine cylinder accurately. LSTM had poor 403 classification performance for 8-cylinder in both training and 404 testing. However, the classification performance of LSTM for 405 4-cylinder and 6-cylinder with a large amount of data was 406 90.91%, which was higher than that of the 1D CNN (72.73%). 407 The following conclusions can be drawn from Case #1: The data structure of Case #2 is shown in Table 2. The 415 composition of the data was distributed evenly among the 416 three classes. Table 6 and Table 7 show the result of  The conclusions drawn from Case #2 are as follows.  The data of Case #3 are listed as Undefined in Table 2. 444 Fig. 10 shows a result in which undefined data (black) are 445 projected on the scatter plot of mean-standard deviation of 446 the Mel-spectrogram (Fig. 8(c)). Undefined datasets were 447 classified using the two neural networks, as described in 448 section V.B. Table 8 and Table 9 show the 1D CNN and LSTM 449 results, respectively. In the 1D CNN results, J was closer to 450 the distribution of Luxury but was classified as Sporty. Addi-451 tionally, M, S, and T were in the Powerful and Sporty bound-452 aries, and classified as Sporty by the K-means algorithm, 453 as shown in Fig. 8(d). In the LSTM results, J was classified as 454 Sporty although the adjacent scatter points were classified as 455 Luxury. M and P were on the classification boundary between 456 Sporty and Powerful, but they were misclassified as Luxury. 457 Furthermore, LSTM classified T as Sporty. Among M, S, and 458 T, which caused confusion in 1D CNN, M was incorrectly 459 classified as Luxury and S as Powerful, unlike 1D CNN.

460
However, T was classified as Sporty. The conclusion drawn 461 from Case #3 are as follows:  2) Although there was a slight difference in learning 467 performance, the results of undefined data showed similar 468 performance in both neural networks. In Fig. 10, the two 469 neural networks are misclassified at the interface. If both 470 had poor classification performance, both neural networks 471 would have misclassified data far from the classification 472 boundary.

474
This study provides ASD engineers with a probabilistic eval-475 uation method for SQ classification of vehicle interior noise. 476 VOLUME 10, 2022  amount of data (4-cylinder and 6-cylinder). In Case #2, there 497 was no difference in performance between the 1D CNN 498 and LSTM. The numbers of data points for 3-cylinder and 499 8-cylinder were 13 and 11, respectively, but that of Case #2 500 exceeded 22. Compared to Case #1, the number of data for 501 Case #2 was about 2 times more. If 20 data points per class 502 are used, it is expected that performance will be improved. 503 We believe that verification will be possible if additional 504 training data of 3-cylinder and 8-cylinder can be obtained in 505 a future study. Case #3 revealed two things. First, as shown 506 in Tables 8 and 9, the classification was incorrect or unclear 507 due to the class boundary. A neural network classification 508 model that quantitatively classifies data also had difficulty in 509 classifying SQ data accurately at the class boundary. Second, 510 LSTM had a completely opposite classification compared to 511 that of the 1D CNN for cases like Figs. 11(a) and 11(b). 512 In the case of Figs. 11(c) and (d), the classification was 513 not clear because these cases were at the class boundary. 514 However, in the case of the 1D CNN, the classification was 515 not wrong. There was a probabilistic unclear result at the 516 class boundary. The 1D CNN classified Fig. 11(a) and S as 517 Sporty, as shown in Table 8, and LSTM classified Figs. 11(a) 518 and (b) as Luxury, as shown in Table 9. Fig. 5 is a repre-519 sentative Mel-spectrogram of Luxury, Powerful, and Sporty. 520 Figs. 11(b), 11(c), and 11(d) show the Mel-spectrogram of 521 Figs. 11(a), 11(b), 11(e), respectively. All three data points 522 were at the border between Powerful and Sporty. Addition-523 ally, it is difficult to classify Luxury and Sporty, as shown in 524 Fig. 5. The Sporty class was characterized by having a slightly 525 lighter line than that of the Powerful class, and the 1D CNN 526 seemed to classify Sporty based on this feature. LSTM, which 527 specializes in time series data, appeared to classify based on 528 the characteristics of changes in frequency components over 529 time. From Fig. 11(b) and 11(c), it can be observed that the 530 frequency components gradually increased with time.

532
The purpose of this study was to propose a valid data pre-533 processing and verification method for data transformed 534 using USIS to obtain a common data length for the recorded 535 data to make them suitable for neural network learning. The 536 results of this study prove that the USIS can be used as a tool 537 for studies related to vehicle interior noise SQ classification. 538 The classification performance of the 1D CNN and LSTM 539 were different depending on the amount of data. However, 540 LSTM showed better classification performance with a large 541 amount of data. The 1D CNN tends to incline towards Sporty, 542 and LSTM towards Luxury. Therefore, the classification 543 result of boundary data should be judged by considering data 544 characteristics. The overall neural network performance was 545 good with some exceptions. The learning and classification 546 performances of the 1D CNN in Case #1 (Engine Cylinder) 547 were 93.20% and 75%, and 94.12% and 87.50% in Case #2 548 (Sound Quality), respectively. The learning and classifica-549 tion performances of LSTM in Case #1 were 81.56% and 550 68.75%, and 95.59% and 87.50% in Case #2, respectively. 551 Based on the neural network performance results in Case #1 552 and Case #2, the quantitative evaluation and classification of 553 undefined sound quality labels are successfully identified in 554 the present study.

555
For the further direction, the classification performance is 556 to be conducted in the frequency domain as well as the time 557 domain. Additionally, the more active study on the sampling 558 techniques to handle the various lengths of sound data in the 559 context of active sound design (ASD) engineering.