Multi-timescale Wavelet Scattering with Genetic Algorithm Feature Selection for Acoustic Scene Classification

In this paper, we apply a genetic algorithm (GA) for feature selection, wrapper approach, on wavelet scattering (WS) second-order coefficients to reduce the large frequency dimension (>500). The evaluation demonstrates the capability of GA to reduce the dimension space by approximately 30% while ensuring a minimum performance drop. Furthermore, the reduced WS directly impacts the training time of the convolutional neural network, by reducing the computational time by 20% to 32%. The paper extends its scopes to explore GA for feature selection on multiple timescales of WS: 46ms, 92ms, 185ms, and 371ms. Incorporating multiple timescales has improved classification performance (by ~around 2.5%) as an acoustic representation usually contains information at different time scales. However, it can increase computational cost due to the larger frequency dimension of 1851. With the application of GA for feature selection, the frequency dimension is reduced by 50%, saving around 40% computational time, thus increasing the classification performance by 3% compared to a vanilla WS. Lastly, the entire implementations are evaluated using the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 dataset, and the proposed multiple timescales model achieves 73.32% of classification accuracy.


I. INTRODUCTION
Wavelet Scattering (WS) or also known as Deep Scattering Spectrum (DSS), is a signal processing technique developed by Stephane Mallat [1][2][3][4] to recover the loss of information present in Log mel-spectrogram, especially when given a timescale larger than 25ms. While research works presented in [5,6] have shown a competitive performance advantage over log mel-spectrogram, the large frequency dimensionality might outweigh the advantage when the computational resource is limited. Hence, there is a need to reduce the frequency dimension of WS without impeding the performance. The task of finding an optimal feature subset can be described as feature selection. In a broader aspect, feature selection is a dimensionality reduction technique. Dimensionality reduction can be categorized into two methods: feature selection and feature construction (extraction). The difference is that the latter usually compares the features in another feature space, and the final output is a new set of features (e.g., Principal Component Analysis (PCA)). In contrast, feature selection selects a subset of the original features. This paper chooses the feature selection technique as the approach to dimensionality reduction. The reason can be broken down into three factors: Firstly, WS is locally translation invariant and stable to deformation [1][2][3]. These properties are paramount to acoustic classification tasks. Hence, it is vital to retain the feature as what it is. Secondly, WS consists of first and second-order coefficients, and the first and second-order coefficients have relatively huge differences in magnitude [5,6]. Thus, the translation algorithm might bias the first-order coefficients as their magnitude is larger. Thirdly, the second-order is a sparse representation [1][2][3], leading to careful selection of the translation algorithm.
Feature selection is broadly studied and has been used in many domains [7][8][9][10][11][12] with a traditional approach such as exhaustive search, which is practical for a small number of features but not feasible with a large number of features. Exhaustive search requires (2 ) computational costs, where m is the number of features. Hence, employing a heuristic technique [13] is more effective to find the optimal feature subset without searching through every single

II. WAVELET SCATTERING
Wavelet Scattering is constructed with a cascade of wavelet transform and modulus operators, as illustrated in Fig. 1, which has the desirable feature representation properties of being locally translation invariant and stability to timewarping deformation [1,2]. In subsection A, wavelet scattering architecture and algorithm are described. Subsection B discussed the limitation of wavelet scattering timescale and, hence, the reason to incorporate multi-timescale wavelet scattering.

A. WAVELET SCATTERING
WS was developed by Stephanie Mallet and presented in [1] as an improved time-frequency representation instead of log mel-spectrogram. WS leverages the cascading of wavelet transform and modulus operations to recover fine-scale information loss through averaging, especially when timescale T is more than 25ms, as observed by [1,2]. The limitation of information loss due to a larger timescale pinpoints the weakness of log mel-spectrogram and the reason for WS. As this paper further exploits WS's multiscale property, the comparison between log mel-spectrogram and WS will not be further elaborated. A comparative analysis of WS and log-mel spectrogram has already been studied in the works of [5,6].
Given a signal x, WS can be expressed as cascading convolution of wavelet transform and modulus operations: where denotes the output of WS, m is the number of wavelet scattering operations performed on x, as illustrated in Fig. 1. The symbol m is usually termed as 'order'. t is a time index. is the set of wavelet filters with the mother wavelet as 'Morlet' wavelet, and their center frequency is denoted as . The number of wavelets per octave of is determined by 1 ( , 1 ) = | ⋆ 1 | ⋆ ( ) (2), 2 ( , 1 , 2 ) = || ⋆ 1 | ⋆ 2 | ⋆ ( ) .
For simplicity, 1 derived in (2) is the first-order coefficients and 2 derived in (3) is the second-order coefficients. Q1 and Q2 are the number of wavelets per octave for the first and second-orders, respectively. The rest of the paper uses a setup of Q1 = 9 and Q2 = 1 based on the suggestion of [1][2][3][4][5][6].

B. GREATER MULTISCALE/MULTIRESOLUTION
Although WS analyzed modules with time-varying filters based on multiple scaled 'Morlet' wavelet as shown in Fig .2, the averaging function that is required for the representation to be invariant to translation will remove fine-scale information depending on the timescale T, in 1 , which is then captured by higher orders, mainly 2 . In this paper, we look at combining multiple timescales, and T has a value of {46,92,185,371} . The selected timescales are based on the observation of [6], where they exhaustively evaluated the classification performance of timescales range between a value of {23,46,92,185,371,743,1480}ms for DCASE 2020 Task1a dataset. Their experimental result shows that the effective timescale range from 46ms to 371ms.
It is observed that the larger the T, the more high-frequency information will be removed from the first order after applying the Gaussian filter, as illustrated in Fig .3. The loss information is then captured by using another wavelet transformation. In other words, the larger the T, the more information is captured by the second-order. Hence, supporting the analogy of energy dispersion from first order to second-order at different T [1][2][3][4]. The reason why the high frequency information is retained in the second-order or higher orders, is because the subsequence wavelet modulus transform is applied to | ⋆ 1 | to achieve sparse representation [1][2][3][4]. A sparse representation is a representation with condensed wavelet coefficients with highly concentrated information. Hence, applying an averaging function on higher orders will not lead to further loss of information. The timescale, T affects the averaging operation, it also affects the number of center frequencies, m, and the bandwidth or scale, of each wavelet filter where they are used to capture the acoustic signal profile.  In summary, the Heisenberg uncertainty principle has some impacts on WS. There is a need to carefully combine multiple timescales to avoid losing fine-scale acoustic signal while preventing redundancy, as shown in Fig 4.

1) INITIALIZATION OF POPULATION
The Population is relatively the pool of selected different combinations of features. Each set of features represents an input representation, and each feature represents an attribute

Lowpass filter
Area of interest of the input representation. Relating to GA and biology, a feature is termed a gene, and a set of features is termed as a chromosome.
The initial step involves the creation of a pool of chromosomes. The chromosomes are created through a random process of combining various genes. Thus, a population is formed containing variants of the chromosome. The Population will then undergo evolution, which is the continuous process of weeding out the weaker chromosomes while encouraging the reproduction between the fittest chromosomes, with the goal of achieving the chromosome with the 'best' genes. As such, the Population will evolve continuously from the initial Population until it meets an end condition.

2) FITNESS FUNCTION
The fitness function is the evaluation algorithm to determine the strength of the chromosomes. Hence, the product of the fitness function is the fitness value.

3) CROSSOVER
Crossover is the technique of mixing the genes of a pair of parents to produce offspring. Both crossover and mutation are techniques used to create new chromosomes based on the selected group of stronger chromosomes from the Population. Hence, the concept of parent and offspring, where the offspring are the newly created chromosomes.

4) MUTATION
Mutation is an alteration in a chromosome. For example, by visualizing a chromosome in a binary format, a chromosome can be represented in a sequence of binaries [0,0,0,1,1,1]. 0 means that the gene in the sequence is not selected, and 1 means that it is selected. Thus, when the mutation occurs, the represented sequence of binaries will flip from 0 to 1, vice versa, from 1 to 0. For example, [0,0,0,1,1,1] can mutate to [1,0,0,0,1,1]. The mutation is crucial to prevent the GA from getting stuck at the local optimal. However, a high mutation rate might impede the GA from finding an optimal solution.

5) END CONDITION
Lastly, the end condition or termination criteria is the algorithm to determine the convergence of GA operation. This can be set as the number of iterations or generations. A generation is the process of step 2 to step 4.
While the application of GA is robust in search optimization [16], we are interested in a particular use case, where GA is used to select the optimal features or for feature selection. The application of GA for feature selection on wavelet scattering is described in subsection A.

A. GENETIC ALGORITHM FOR FEATURE SELECTION
The concept of using GA for feature selection is not entirely new and has been well established [12,[14][15][16][17][18][19][20][21][22][23][24][25][26][27], with a history dating back to 1990 [15,21]. GA for feature selection has the exact workflow as GA described in Section III, but with the objective to maximize or maintain classification accuracy while using a subset of the original features. We have a set of  features used in a classification model to perform prediction, and this set of features has large dimensions (e.g., 500 to 1000 features). There is a possibility of redundancy or irrelevancy, which might impede the model from learning correctly and add unnecessary computational complexity.
Hence, GA is employed to search for the optimal features subset based on model performance as the fitness function. Translating the five processes into pseudocode is presented in Table I. GA provides a straightforward mapping to the input representation by encoding each feature with a binary value. 0 means that the feature is not selected and 1 means it is selected.
There are two well-established GA approaches [10]: 'filter approach' and 'wrapper approach', subsequent development is branched into 'embedding' and 'hybrid' approaches [9,15,19]. While the hybrid technique utilized a combination of filter and wrapper approaches [11,12], the embedding approach performs feature selection as part of the pattern recognition algorithm [32].
The difference between the filter and wrapper approach is on the fitness function. The wrapper approach, which is more commonly adopted [15] and evaluates the goodness of the selected features based on the classification model, usually the model's classification accuracy. In GA wrapper approach context, the fitness function relates to the classification algorithm. On the other hand, the filter approach uses an independent and less computationally intensive algorithm to evaluate the features subset; only after finding the optimal features subset for the classification algorithm. The objective of the filter approach is to drastically reduce the computational cost of the wrapper approach caused by multiple training and evaluation of the classification model such as deep CNN models. [13,15,16] provide comparative studies between filter approach and wrapper approach, and their findings suggested that wrapper approach outperformed filter approach when comparing classification accuracy. It can be attributed to the core algorithm that evaluates the chromosome's fitness. The filter approach algorithms such as information theory, correlation-based approach, or distance measure approach might only provide limited interaction between the features [13]. Furthermore, what is deemed insignificant to the filter approach algorithm might be considered useful by the wrapper approach.
Hence, this paper focuses on wrapper approach as we are dealing with WS, which has first and second-order coefficients with very different magnitude and second-order has a sparse representation. Furthermore, a wrapper approach provides the necessary framework to integrate WS with CNN, tapping on the discriminative advantage of CNN [6]. Following the framework presented by [21], we proposed our implementation of GA for WS, which has a meta-heuristic nature to find the global optimal feature subset. Unlike image classification, where features can be flattened into a 1D array and selected pixel-wise, it does not work well for timefrequency representation such as Wavelet scattering. WS has a feature structure [9]. Hence, we proposed the feature selection be performed on the frequency axis. In WS context, we select a subset of wavelet coefficients, and an example is illustrated in Fig. 5. In signal processing, selecting a subset of wavelet coefficients is equivalent to applying bandpass filters on the signal. However, the application of bandpass filters requires strong expert knowledge of the domain, and else one might filter out important information. In the case of ASC, it is even more challenging to decide which frequency spectrums are important as it consists of multiple sound sources and events with various frequency profiles.
This paper proposed two configurations to select the optimal feature subset, and their difference depends on the chosen CNN model. The two CNN models are discussed in Section IV, followed by the two implementations described in Section V.

IV. CONVOLUTION NEURAL NETWORK
As with all wrapper approach, the effectiveness of GA is highly dependable on the classification algorithm. In this paper, the model's classification accuracy becomes the fitness function. Hence, following the current state-of-the-art framework presented in [33], this paper uses CNN as the classification algorithm and the fitness function. CNN is one of the current leading neural networks that can learn complex structures such as time-frequency representation, images, but not limited to 2-Dimension images [34]. CNN is less computational heavy than the standard neural network or dense network as the nodes or features are not fully connected. Instead, CNN supports weights sharing by having multiple small filters (e.g., 3 x 3) to convolute over a feature map. Hence, each small filter provides a localized understanding of the previous feature map. Coupled with non-linearity activation function and continuous downsampling of the earlier feature maps, CNN can capture low-level abstract features [34], thus learning complex input representation. The following subsections are delivered as follows, Subsection A presents the two CNN models, and Subsection B elaborates on  the convolution block design. Lastly, Subsection C discussed other hyper-parameters and configurations applied on the CNN model.

A. TWO-STAGE MULTI-ORDERS CNN
This paper proposed two CNN models for the fitness function of each configuration. The first CNN model is a direct adaptation from [6], and for ease of reference, the Two-stage convolution neural network is abbreviated as (TSCNN). TSCNN is inspired by the work of [35,36], who splits the log mel-spectrogram into frequency bins. The binning process simply divides the log mel-spectrogram into large groups of low, mid, and high-frequency spectrum. Then, in the first stage, the CNN digests this set of frequency bins concurrently in parallel, as illustrated in Fig. 6. In the context of CNN architecture design, this process is termed as 'group convolution' [6,34,35,36]. [6] adopted this concept and applied on WS, which conveniently split the first and secondorders. [6] explained that the split is necessary to tackle the large disparity in magnitude between the first and secondorder coefficients. Furthermore, they re-termed group convolution as 'specialized learning' as the purpose for group convolution is to allow the network to have independent learning of the first and second-orders before learning them together in the second stage, termed as ' centralized learning'. The renaming also connects to the second stage where centralized learning is learning the concatenated features of the first and second-orders after understanding them separately. The reason for naming the first stage, specialized learning and the second stage, centralized learning, is mainly for the configuration of the TSCNN. [6] suggested the need to explore various specialized and centralized learning setups, which is reflected in the difference between Fig. 6 and Fig. 7. TSCNN architecture is suitable for applying mixed first and second-order as the CNN model handles the first and secondorder separately. However, the current TSCNN cannot accommodate more timescales, and modification is required. In this paper, we explored the combination of 4 different timescales, hence, = {46 , 92 , 185 , 371 }. We extended TSCNN to include two additional second-orders [35], making the first stage to be comprised of one first-order and three-second-orders of different timescale. This extended TSCNN is named TSCNN-2, and the architecture is depicted in Fig. 7. Notably, following the suggestion of [6], this paper also performed an ablation study on the number of convolution blocks stacked for specialized learning and centralized learning. Based on the analysis, we have established that additional specialized learning benefited the model and is reflected in Table II and Fig. 7.

B. CONVOLUTION BLOCK DESIGN
Next, this subsection described the convolution block (CB) design which is the fundamental building block of CNN. In other words, CNN is typically built with stacks of similar convolution blocks, except that the number of channels increases as the network goes deeper. Following [6,29] designs, the convolution block is pieced together with the below components as shown in Fig. 6:

1) 3-BY-3 CONVOLUTION LAYER
The 3-by-3 convolution layer (3x3Conv) design has revolutionized most of the current CNN architecture [6,28,29,32,35], which usually follows the combination of stacking 2 3x3Conv. [37] developed 3x3Conv to reduce the computational complexity of the model. They investigated the possibility of reducing the filter size from 7-by-7 or 5-by-5 to 3-by-3 and discovered that a stack of 2 3x3Conv has the same This table presents the proposed TSCNN-2 architecture design. The first row is the input representation which is fed to TSCNN-2. 1 1 , 2 2 , 2 3 , 2 4 represents first order with a 1 = 46 , secondorders with 2 = 92 , 3 = 185 , 4 = 371 , respectively. The column "Strides" determine the number of strides is performed during the convolution operation and column "C" determine the number of channels. Lastly, column "output feature map" presents the dimension of feature maps after the convolution block. SCB stand for stacked Convolution Block or CB. Each stack consists of 2 series of CBs with different strides, and each CB is applied onto each input feature, 1 1 , 2 2 , 2 3 , 2 4 . Other than SCB1, which has a different strides setup, the rest of the SCBs follow a stride of (1,2) for 1 1 and (2,2) for the rest, 2 2,3,4 For the first series and stride is equal to 1 for all features for the second series (refer to SCB2a,2b). discriminative capability as a larger counterpart while reducing the computational complexity of the model.
Hence, all CB uses a 3x3Conv as the convolution layer as illustrated in Fig. 6. Zero padding is applied for all 3x3Conv(s).

2) AVERAGE POOLING
Average pooling (AvgP) is mainly used to reduce the dimensionality of the feature map as the network goes deeper. Notably, average pooling does not involve learning or trainable weights. In this paper, we used an average pooling with a filter size of 3-by-3.

3) RESIDUAL LAYER
Another revolutionary design created by [38] is the residual network. In simplicity, a residual network is a network where right after each convolution action, there is a propagation of gradient from the previous feature map to the new feature map. Let x be a feature map, and the output of BN+ReLU+3x3Conv is ′ , the result of a residual network will be (refer to Fig. 7): By propagating the previous gradient forward, degradation no longer happens, even when the network goes extremely deep (e.g., 1000 layers). Similar to 3x3Conv, this residual network architecture is heavily adopted [6,35,36].

4) BATCH NORMALIZATION
[39] introduced batch normalization [BN] to accelerate and smoothen the learning process. BN applies a standard normalization on a batch of feature maps using mean and standard deviation. The key difference between a standard normalization and BN is that there are two learning parameters: scaling and offset.

5) ACTIVATION FUNCTION
In this paper, following general selection by most of the authors [6,[34][35][36][37][38], ReLU is a non-linear function, giving: is selected as the activation function, which translates all negative values to zero.

6) FULLY COVOLUTIONAL NETWORK
While this is not part of the convolution block design discussed above, 'fully convolution network', better known as FCN [40], is the last convolution block that graciously completes the CNN before the softmax classification layer. It earns its name as FCN as no fully connected or dense layer is used to complete the CNN, commonly found in older models [34,37,38]. FCN is constructed with a convolution layer with a filter size of 1-by-1, and the channels are compressed such that it is numerically similar to the number of classes. This was followed by a global average pooling (GAP) that flattened the feature map channel-wise into a vector as the softmax layer. Lastly, by stacking the convolution blocks together and ending them with FCN, this paper presents the TSCNN-2 architecture in Table II. TSCNN is not shown here as it is a direct adaptation [6].

C. PARAMATERS SETUP
Other than the CNN block and architecture design, other settings are involved when training the model. This subsection is dedicated to providing the information as follow:

1) LEARNING STRATEGY
Following [6,35,36] learning strategy, this paper uses stochastic gradient descent (SGD) [41] with warm restart scheduled at a set of epoch indexes being (3,7,15,31,63,126,254) and learning rate covering from 0.1 (max) to 0.00001 (min). Hence, the learning rate will gradually decrease at each step based on half a cosine curve called 'cosine annealing'. This provides adequate learning with a high learning rate to quickly approach local minimal, then a gradual decrease in learning rate to narrow into the local minimum. Based on the indexes derived above, the learning rate is being restarted to the maximum learning rate to ensure the model does not get stuck at the local minimum. [41] further proposed to restart the learning rate with shorter intervals at the beginning of the training phase to accelerate the learning process. VOLUME XX, 2017 FIGURE 9. Process Flow of GATSCNN and GATSCNN-2 implementation. The entire system starts with the initialization of the Population described in Fig.5a and Table II. Next, we evaluate the fitness of all the chromosomes, and depending on the configuration, different CNN models will be selected. Subsequently, we check whether the system has already selected a set of parents. This argument caters to the first iteration, where no parents have chosen yet. Hence, the rest of the iteration will always flow to combining parents and offspring. Then, based on the classification accuracy, we only keep the top-performing chromosomes, and this batch of chromosomes will be the next parents. This will bring us to the next step of creating the offspring based on Table II, step 2. With the offspring, we verify whether the stopping criterion is being satisfied. In this case, our stopping criteria depend on the total number of generations. If we did not meet the criteria, we continue another round of evaluation, selection of new parents, and breeding offspring until we reach our stopping goal. We will have the best optimal feature subset selected by GA upon reaching the stopping goal. Finally, we train and evaluate Mix Up and present the classification result in Section VII.

2) WEIGHT INITIALIZATION AND REGULARIZATION
'He normalization' [42] is being used for weights initialization and subsequently, L2 regularization is being applied.

3) BATCH SIZE AND EPOCH
In this paper, a mini-batch size of 64 and total epochs of 63 are selected for Genetic Algorithm with TSCNN (GATSCNN) and GATSCNN-2 runs. The decision was based on our observation of training TSCNN and TSCNN-2 (without Mix Up ) with 254 epochs for at least 10 folds. Their best result usually occurs right at the 62 nd Epoch. This phenomenon is due to the effective learning strategy of SGD with warm restart presented in point 1. During the evaluation of the best optimal feature subset, we used a mini-batch size of 32 and 126 epochs as we applied data augmentation, 'Mix Up', described in point 4.

4) DATA AUGMENTATION
Based on analyzing DCASE [33] and suggestions from [6,35,36], this paper adopted the use of 'Mix Up' [43] for data augmentation with an alpha of 0.2 and 0.4 for evaluation of the model alpha for TSCNN-2 and TSCNN, respectively. The value for the alpha is selected based on an exhaustive search where we evaluated alpha ranging from 0.1 to 1 for each model. 'Mix Up' is not applied during GA run as the task of GA is to find the optimal feature subset and might impede the searching process.

V. IMPLEMENTATION
The collective combination of WS for feature extraction, GA wrapper approach for feature selection or 'precise' bandpass filtering, and CNN as the classifier, is being proposed and depicted in Fig. 10. Figure 9 shows the process flow of the integrated system. For ease of reference, the first configuration is termed 'GATSCNN', and the second configuration is 'GATSCNN-2'.
While GA has the advantage of achieving global optimal [14][15][16][17][18][19][20][21][29][30][31], it is highlighted that a careful selection of the mutation rate is prudent [19,22]. A mutation operation can be described as changing the binary bit representation from 0 to 1, which means that the selected features based on the parents are flipped. The higher the rate, the more indifferent the offspring are to their parents. Hence, if we apply a high mutation rate for all the offspring, the offspring are more likely to lose important genes from their parents that made them fit in the first place. In other words, we might nullify the effect of crossover, where offspring are created based on the bestperforming parents, thus, crippling the heuristic nature of GA. While a low mutation rate might hit local optimal as it becomes highly dependable on the initial Population and the population size, this paper proposed a high mutation rate only for the last two offspring to ensure a global search effect.
Our population initialization strategy is relatively straightforward, and the decision is based on observation from our initial experiments. We initially tested the TSCNN models by randomly picking a subset of features, mimicking the initialization process to hasten the GA process in finding the optimal features. The initial test is equivalent to running the GA once, and the steps are as follow: Firstly, we randomly pick a subset of features based on controlling the number of features to be selected. In this test, we have selected The GA parameters are used for both GATSCNN and GATSCNN-2 implementation. This paper selected 8 parents or 8 fittest chromosomes for the reproduction process. The parents will produce 12 offspring and 2 offspring have a mutation rate of 0.35 to alleviate the chances of reaching local optimal while preventing the network from the inability to converge. The number of generations is set at 10. Hence, 200 models are being trained for each GA implementation. 25%,50%,60% and 70%, sequentially. Based on our GA setup, each selection will consist of 20 different combination of feature subsets; Next, we fed the input representation to TSCNN as depicted in Fig. 6. Our findings brought us to the conclusion of picking 70% of the features as it presents the nearest classification accuracy towards the full features. While for GATSCNN-2, we set the initial number of features selection rate to 50%. This is based on our observation when we tried out various combinations when building 'FS1' presented in Section VII, C.
Notably, the number of generations is set at 10 due to the resource limitation of evaluating over a combination of 1000 TSCNN and TSCNN-2 models. In addition, based on a separate experiment, with 20 iterations, the GA no longer demonstrates any gain after the 10th iteration or its 'plateau'. Hence, in this paper, the GA parameters are presented in TABLE III.

A. DATASET
This paper uses DCASE 2020 challenge task 1a development dataset [33] to evaluate our proposed GA framework. DCASE 2020 task 1a is an ASC challenge where we need to develop a model(s) to predict the whereabouts of an object or person based on an audio recording. The dataset consists of 10-seconds audio segments of 10 acoustic scenes, which are three indoor scenes: airport, shopping mall, metro station, four outdoor scenes: pedestrian street, public square, street traffic, urban park, and three traveling on transport scenes: tram, bus, metro. Each audio data is the mono channel with a 44.1 kHz sampling rate and 24-bit resolution. The development datasets are split into 70% training and 30% testing as per the challenge setup.

B. FEATURE EXTRACTION
The entire 10s raw waveform is then transformed into WS, and following [4][5][6], we further calculate their deltas and deltasdeltas. Then stacking them together will result in a 3dimensional input representation of ( × ′ × ), frequency  This is the classification result of an exhaustive search of the mixed first and second-order with TSCNN. 1 , 2 denotes the timescale for the first and second-order, respectively. The first 4 rows are the normal WS, and the rest of the tests are the combination of mixing first and second-order. The highlighted group of mixed first and second-orders have the best classification performance. However, we noticed that the classification result plateau at around ~70%. This finding solidifies our hypothesis that acoustic scene requires even more diversification of timescales, and just mixing two timescales is not enough. axis, temporal axis, and channel axis comprises the original WS, deltas of WS, and deltas-deltas.

C. HARDWARE SPECIFICATION
The entire system is implemented using python and run on a Window Desktop with Intel i7-11700F Processor, 32 GB of RAM and RTX 3090.

VII. RESULT
This Section discussed three experimentation results that were evaluated using the DCASE 2020 Task1a dataset described in Section VI, A. Each experiment is trained and tested 5 times, and we calculate the mean of the 5 tests which is presented in the results table. The first experimentation act as a 'prologue' to our investigation on the application of GA with WS. In this experimentation, we evaluated all combinations of mixed-WS. The discussion is presented in Subsection A. Based on the findings from Subsection A, we evaluated the possibility of getting an optimal feature subset with GATSCNN in the second experimentation, and it is being discussed in Subsection B. Lastly, taking a step further, we explored the application of GA in achieving optimal feature subset when four timescales are being combined together. The experimentation result is discussed in Subsection C.

A. STUDY OF MIXED FIRST AND SECOND-ORDER
Prior to conducting the experimentation presented in Subsections B and C, we exhaustively searched through a combination of the first and second-order of different timescales as suggested by [44]. In this experiment, we paired a set of first-order with timescales, 1 ={46,92,185,371}ms This is the classification result of GATSCNN and only the second-order is of interest while the first order is left untouched for all the experiments. The first to third row are the result from Table III, row 5,6 & 7 and is used for comparison between with GA and without GA runs. Row 7,8 & 9 is the combination of models by averaging the final logits. While for 10, we concatenated all the second-order coefficients together before feeding them to TSCNN. In rows 4,5 & 6 under column "feature subset", the value in the bracket is the number of features not being selected.
with a set of second-order with 2 having a similar value as 1 , but satisfy the clause 1 ! = 2 . The best combination of first and second-orders will then be selected for the GA feature selection. Based on the observation of Table IV, the best combination occurs when a small 1 (46ms) is combined with any 2 > 46 . While combining the first and second-order the other way around, where 1 > 2 , in most cases, it will result in poorer performance than not combining them (Table  IV, rows 1, 2 & 3). This finding is consistent with the theory of energy dispersion from first to second-order as T gets larger, suggested by [1][2][3]. In addition, the combination of timescales with the first order being the smallest T, present better result (Table IV rows (Table IV rows, 11, 14, 15 & 16). Hence, based on the results, we derived a general rule of 1 < 2 when constructing a 'mixed' first and second-order. Another observation is that mixing of the first or second-order of timescale of 46ms with larger timescale will result in a considerable boost of approximately ~4% in classification performance, as shown in Table IV, row 1, compared to Table  IV, rows 5 , 6, 7, 8, 11, 14. This phenomenon can be attributed to the acoustic profile of the acoustic scene, where important acoustic information is mainly captured by a larger timescale, T>46ms. However, the papers [1][2][3] do not indicate the exact values and the optimal combinations presented in this experiment.

B. GATSCNN EVALUATION
The findings from Subsection B perpetuate the evaluation of GATSCNN using features from Table IV, rows 5, 6 & 7. The experimentation results are tabulated in Table V. Notably, our experiments demonstrated that GA with the first order did not amount to any improvement. As such, GA for the first order is not being evaluated further. Hence, for experiments GATSCNN and GATSCNN-2, GA is only applied on the second-order. The application of GA has shown optimistic Redundancy can likely occur in second-order coefficients as it is a more refined representation of first order coefficients (see (3)). Hence, there lies a possibility that a sound source is captured in multiple second-order coefficients. On the other hand, irrelevancy occurs when no important sound sources reside in that frequency spectrum, or a frequency spectrum shares similar second-order coefficients with the rest of the data. Thus, rendering the second-order coefficients not useful when discriminating different acoustic scenes.
Irrelevancy is evident when we combine the models by averaging the final logits from the softmax layer, as shown in Table V

C. GATSCNN-2 EVALUATION
In the next experiment, we evaluate the combination of all the timescales selected in this paper, being 46ms,92ms,185ms and 371ms as mentioned in Section III, B. While evaluating a single second-order coefficient in Subsection B has highlighted the existence of redundancy and irrelevancy, the effect is indeed much greater when we combine all the timescales, as shown in Table VI. Prior to evaluating GATSCNN-2, we construct another feature with all the timescales based on the second-order feature subsets from Table V, rows 4, 5 & 6, and termed it as 'FS1'. FS1 is constructed using a statistical-based approach on each timescale. This table presents the computational time required before and after feature selection, column TIME, and FS TIME, respectively. The computational time is recorded based on the average time to train each epoch with a batch size of 32. to the reference of the paper. 'FEAT' stands for feature, and under feature, LMS stands for log mel-spectrogram. Next column, 'DATA AUG', stands for Data Augmentation and column 'EVAL/DEV ACC' stands for classification accuracy for the evaluation dataset and development dataset. Under column 'DATA AUG', 'SpecAug' stands for spectrum augmentation, which involves randomly masking of the frequency axis and time axis before training. The DRC stands for Dynamic Range Compression. We denote '*Etc' for [46], as they used several data augmentation techniques. Under column "MODEL", we included the result of a single model labelled 'SINGLE'. This is to provide another perspective when comparing our Genetic Algorithm with Two-stage Convolution Neural Network (GATSCNN-2) model. Notably, most of the models adapted a Two-staged Convolution Neural Network (TSCNN)-liked or TSCNN variants approach, and all of them uses log mel-spectrogram. While not stated in the table, all the models use the same learning strategy indicated in Section IV, C, point 1.
Let use Table V, row 4, as an example. The final product of GATSCNN provides the best solution and a set of parents. This set of parents can be said as the 'best solutions' from the GA run, and the best solution can be expressed as 'the best of the best'. With the best solutions or parents, we calculate the number of occurrences each feature or gene is being selected. Hence, in this context, a highly desired feature will have a frequency count of 8 and the worst case will have a frequency Next, we compared FS1 with 'FS2'. 'FS2' is the optimal feature subset output from running GATSCNN-2. The difference between FS1 and FS2 is that FS2 has feature interaction between the second-orders of different timescales during the feature selection process. The selection based on the three timescales constitutes directly to the classification accuracy. Hence, based on Table VI, FS2 has the best performance for all the outputs, whether it is evaluated on TSCNN or TSCNN-2. While comparing FS1 and FS2, this paper reconfirms the importance of feature interaction [13,17]. The lack of feature interaction is even more apparent in Table  VI, row 4.
Indeed, the combination of the four timescales has been shown to produce features with high redundancy and irrelevancy, which resonate with the plot shown in Fig. 4. Thus, the application of GA has removed redundancy and irrelevancy while achieving better performance as exhibited by

E. COMPARISON WITH DCASE 2020 models
Table VIII tabulated the top 3 models from DCASE 2020 Task1a. The models are ranked based on the evaluation dataset set classification result. Notably, all the authors [45][46][47] and this paper adopted a two-stage convolutional architecture [6,35,36]. While their best performing models are all ensemble models, we provided results on their single model to have a same perspective when comparing against our model.
Based on Table VIII, our model has the edge over [47], regardless of whether we are comparing against a single model or the ensemble model. Adding the comparison of [45] to the analysis, our model comes short of 0.4%. The combined comparison analysis between our model, [45] and [47], further established the need for higher frequency resolution. [47] only uses 128 frequency bins while [45] has double the frequency bins (256). Likewise, in our experiment, we established the need for multi-timescale, which also has the same effect as increasing the frequency resolution to improve the classification accuracy (Refer to Table V, row 5 against Table  VI, row 5).
In the case of [46], they approached this problem by using a myriad of data augmentation techniques. Their single model architecture is quite similar to [47], and the only difference is that the number of channels after each convolution has doubled. The application of various data augmentation techniques resulted in a dramatic classification accuracy of 74.6% and 81.6% on the development dataset (single and ensemble models, respectively). However, their accuracy dipped from 81.6% to 76.2% on the evaluation dataset. When compared with a single model [46] against [47], the model suppressed by 2.9% on the development dataset, The estimated gain can range from 0.7% to 1.2% as there is no classification accuracy recorded for the single model [47]. By estimating using the available ensemble results, there is a possibility of the higher classification accuracy contributed by the multiple data augmentation techniques despite increasing the training time leading to higher computing resources.
Based on our comparative study, the best way to improve the classification accuracy is to enhance the frequency resolution. However, doing so will have a caveat, the increase in frequency resolution will directly translate to an increase in computational cost as shown in Table IX, row 2 compared to row 1, and for WS, row 4 to row 6. While Table IX presented a computational time complexity comparison between the features, it is still not 'fair' to directly compare LMS against WS as the CNN architecture plays a significant role in contributing to the computational time complexity. Hence, Table IX is mainly used to compare LMS against higher resolution LMS and WS with GA feature selection against without, separately. In addition, it also provides a referencing towards the computationally time competitiveness of our proposed GATSCNN-2.
In summary, we presented a fresh approach to using GA feature selection to combine multi-timescales WS. It has shown that it improves the classification accuracy performances and makes multi-timescales WS computationally time competitive. This approach was not realized in [45][46][47]. Their collectively proposed methods include a direct increase in frequency resolution, improved CNN architecture, improved loss function, incorporating, and an ensemble model. All proposed approaches improve the classification accuracy at the price of added computational time.

VIII. CONCLUSION
In this paper, we evaluated the effectiveness of genetic algorithm (GA) for feature selection on wavelet scattering (WS) using a wrapper approach. Our results suggested that the second-order of WS indeed consists of redundant features, and redundancy was amplified when multiple timescales were combined. We reduced the frequency dimension by at least 27% while maintaining the classification accuracy at around 70%. Feature irrelevancy was spotted when we combined the first order with a timescale of 46ms and a second-order with a timescale of 185ms. By removing unnecessary features, we achieved an increase in classification accuracy of 1%, and our best Genetic Algorithm with Two-stage Convolution Neural Network (GATSCNN) model has a classification accuracy of 71.12%. With the successive result from GATSCNN, this paper proposed GATSCNN-2 to combine multiple timescales effectively. GA has reduced the number of features from 1851 to 923, which is equivalent to a 50% reduction. It also improved the classification accuracy by 1%, thus, presenting this paper best model with a classification accuracy of 73.32%.
In addition, the reduction in frequency dimension directly translates to a saving in computational time of 10% and 40%, GATSCNN and GATSCNN-2, respectively. While we only present the time saved for training with CNN, its effects should also decrease the time complexity when constructing WS. However, the disadvantage of all wrapper approaches is the computational cost [7,[9][10][11]13,16,17,[21][22][23]. Hence, there is a need to explore a more computational efficient implementation with GA approach. Yet another study could branch into multi-objective GA where minimization of the feature subset size and logloss should be part of the fitness function.