An Intelligent Low-Complexity Computing Interleaving Wavelet Scattering Based Mobile Shuffling Network for Acoustic Scene Classification

The key towards a low complexity model for convolution neural network is in controlling the number of parameters of the network and ensuring that the input representation is not extremely large. Hence, to tackle low complexity for acoustic scene classification (ASC), this paper proposed an enhanced wavelet scattering representation with a combination of mobile network modules and shuffling modules. While wavelet scattering comprises wavelet transform with multiple wavelet scales, the averaging operation to make the wavelet scattering invariant to translation limit the maximum timescale. Hence, wavelet scattering is affected by Heisenberg’s Uncertainty Principle. However, creating an input representation with multiple timescales does not meet the brief of low complexity modelling. Hence, we proposed a simple mixing of the first and second order with different timescales. The result is an input representation with nearly the same dimension as the usual wavelet scattering but with enhanced multiscale. To further leverage the ‘interleaved’ wavelet scattering, this paper presents sub-spectral shuffling inspired by shuffling modules that use stochasticity to improve the model’s generalization. Unlike channel shuffle that shuffles channel-wise and spatial shuffle that shuffles pixel-wise, sub-spectral shuffle aims at shuffling the feature maps frequency-wise with the concept of binning. Each bin is shuffled, so the high-frequency spectrum is shuffled to low-frequency spectrum position. As such, the model learns the general acoustic profile of a scene rather than memorizing what is happening at the low-frequency or high-frequency spectrum is erratic for ASC. In addition, this paper also studied temporal shuffling, which shuffles the feature maps temporal-wise, and evaluated sub-spectral shuffling, temporal shuffling, and channel shuffling individually. Our results demonstrated the superiority of sub-spectral shuffling and the modularity of shuffling modules. We then evaluate various combinations of the three shuffling modules on three acoustic scene classification datasets. Our best model combines the three shuffling modules and achieves 70.6% classification accuracy on DCASE 2021 Task 1a dataset, 82.15% on ESC-50 dataset, 81% on Urbansound8K, with ~65K parameters and a size of 126.6KB. In addition, the inclusion of shuffling modules has increased the model performance. Sub-spectral shuffling is especially useful in improving logloss, a metric used to determine the confidence level of the model.


I. INTRODUCTION
In recent Acoustic Scene Classification (ASC) challenge hosted by the Detection and Classification of Acoustic The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Zia Ur Rahman .
Scenes and Events community (DCASE) [1], there is a trend toward low-complexity computing. Acoustic Scene Classification (ASC) or Environmental Sound Classification (ESC), is the tasks of recognizing the vicinity you are at, based on the sound produced by the scene. The use case of low computational complexity model is in line with the advancement and VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ mass adaptation of the Internet of Things and robotics, such as in hearing-aid [2], [3]. On-node detection is paramount to making near real-time decisions. Therefore, low latency and less memory model is desired. Low complexity convolution neural network (CNN) model is not new and a survey on model compression is presented [4]. Current strategies include parameter pruning [5], knowledge distillation [6], quantization [7], and convolution factorization [8].
Parameters pruning [7] deletes nodes deemed least important by the algorithms. The most popular approach is based on the magnitude of the nodes' weight or the gradients. The magnitude signifies the importance of the node to the model (Large magnitude means more important).
Knowledge distillation [6] tackles low complexity with a ''teacher'' and ''student'' network concept. The student network is the desired low complexity model, and the teacher network is a larger network with the desired classification accuracy. In simplicity, the concept involves the teacher guiding the student, and corresponding to the neural network context, the teacher network calculated loss has an impact on the student final loss function and the level of influence is usually determined by a rate between > 0 to < 1.
The quantization [7] technique usually reduces computer memory size based on the data type. Commonly, the default data type for training deep networks is 32-bit floating points, which translates to each network parameter having memory upkeep of 4 bytes. Hence, converting to a smaller memory data type (e.g., 16-bit floating points, 8-bit integer) will drastically reduce the overall memory size of deep networks.
Next, convolution factorization is the concept of adapting matrix factorization to the convolution layer filters [8]. Such design has been around in the image domain with state-ofthe-art models, such as InceptionNet [9], MobileNet [10], [11] and ShuffleNet [12]. InceptionNet suggested that the convolution operations are a matrix multiplication between two tensors and have the same properties as the matrix. Hence, it can be decomposed/factorized. The proposed 'filterwise' factorization, which breakdown an M × N × C × D convolution layer (we can view this as the dimension of the convolution layer where M × N denotes the filter size, C and D represents the number of filters for input and output, respectively), into two convolution layers, consisting of a M ×1× C × D convolution layer followed by a 1× N × C × D convolution layer. This is similar to the 3 × 3 matrix = (3 × 1) × (1 × 3). Whereas the low complexity convolution layers used in both MobileNet [10], [11] and shuffleNet [12], [13] are mainly depthwise separable convolution layer [8], converting M × N × C × D into C+ (M × N × 1× D) follow by 1×1 × C × D. In fact, 1×1 × C × D is the development of [14] to improve cross-channel relations, initially termed as 'mlpconv'. In addition, shuf-fleNet integrates low complexity convolution layers with the stochastic algorithm, which is channel shuffling, to improve model generalization at zero model size cost.
While the decomposition technique mentioned above works with the dimension of the convolution layers, SqueezeNet [15] and ResNet [16] leverage mlpconv to construct their 'fire module' and 'bottleneck module', respectively. In this context, a convolution module or block is defined as a combination of convolution layers and other algorithms described in Section III. Their proposed ideas are more towards utilising mlpconv to expand or contract the number of output filters, corresponding to D of 1×1 × C × D, without overly increasing the model size as the network gets deeper. In addition, this technique has also been used in CondenseNet [17] and ResNeXt [18]. Moreover, [15], [17], [18] adapted group convolution that divide the convolution layer operation channel-wise, corresponding to C and D of M × N × C × D. Let G denotes the factor to divide the convolution layer, the convolution layer will transform to G + (M × N × C/G × D/G).
It is challenging to distinguish which method of model compression is superior as other factors will affect the accuracy of the model, such as the CNN architecture design, its applications, and requirements. Parameter pruning is the most popular approach as it can be directly applied to an existing network and when transfer learning is highly desired.
Knowledge distillation on the other hand allows configuration on their networks (both teacher and student). However, they generally achieve less competitive performance as compared to pruning and convolutional factorization [4]. The other broadly used technique is quantisation, which can be easily implemented with other compression methods, as shown in Table 4. With a prior knowledge of CNN architecture design, one can have the flexibility to construct the CNN using convolution factorization. all the model compression discussed above can be combined to maximize the model performance.
To tackle acoustic scene classification with a restriction on the size complexity of the model, this paper approaches the problem by using quantization and convolution factorization as strategies like parameter pruning and knowledge distillation are designed around having pre-trained models in mind. In particular, this paper heavily adapted MobileNet [10], [11] and shuffleNet [12], [13] as depthwise separable convolution layer has the lowest model size cost and the shuffle modules account to zero model size cost. This paper uses quantization to convert 32-bit floating points to 16-bit floating points.
Following the current state-of-the-art framework of combining time-frequency representation with CNN, this paper proposed using wavelet scattering (WS) [19]- [21] with a combination of MobileNet and ShuffleNet to tackle lowcomplexity computing for ASC. The combination of WS with CNN has already been well established [22]- [29].
In the design of CNN, [22], [24], [29] featured a two-stage CNN architecture design. A separate network trains the first and second-order coefficients in the first stage. In the second stage, both coefficients are concatenated together so that the network can learn the first and second-order as a single entity. For brevity, the term for first and second stage relates to CNN architecture design, while the term for first and second order relates to WS, the input representation of the CNN.
The reason for their setup, is due to the large disparity in magnitude between first and second orders. Their differences lie in how the second-order is being tackled. In the case of [22], they use a deep neural network (DNN) architecture, while [24], [29] uses a CNN architecture. In this context, DNN refers to stacks of fully-connected layers, or neural network will more than one hidden layer. Notably, the two-stage CNN architecture design used by [24], [ 29], is inspired by the work of [30], [31], who used it to split the log melspectrogram into low frequency and high frequency bins.
Yet in another application of a two-stage CNN architecture design, [20] utilized the first stage for training two features: WS and frequency scattering. Frequency scattering is introduced in [20] to make WS transposition invariance in frequency. Hence, based on the concept of applying logmel-frequency spectrogram with the discrete cosine transform (DCT) along the frequency axis to make mel-frequency cepstral coefficient (MFCC), they performed scattering transform of WS along the frequency axis. Overall, their design is not tailored towards low complexity modelling and does not address the limitation on the timescale. The reason why WS has a timescale limitation, even though wavelet transform provides multiscale based on the scaling functions in the theory of wavelets, is due to an averaging operation on the wavelet transform coefficients to make the representation translation invariant. The timescale that determines the size of the averaging function, limits the maximum scaling factor of the wavelet.
The scaling function is one of the most crucial attributes of wavelet. In the case of [32], they designed a 'waveletbased neural network' that replaces the radial basis function of the neural network with scaling factor of the wavelet. They exploit the scaling factor as the learning function to improve network representation to be of an orthonormal basis. The studies of scaling properties have also been used in the image domain [33], [34]. The work of [33] involve the exploitation of discrete wavelet transform scaling properties to decompose synthetic aperture radar sec-ice texture providing multiscale/ multiresolution. While [34] uses the concept of multi-scaling on the convolution filter size (M × N) and dilation rate to create multiple feature maps with different receptive fields to improve reconstruction of super-resolution remote-sensing images. The dilation rate is a control for dilated convolution filter where when the dilation is 1, it is like a standard convolution filter. When the dilation rate increase to above 1, the size of the convolution filter (M × N) will increase, but not the number of weights. Hence, dilation can be understood as the distance between each weight. Likewise, the concept of 'multi-convolution filters' is relatable to InceptionNet [9] design, where the problem they address is similar to Heisenberg Uncertainty Principle, but on convolution filter size.
Backtrack to ASC, [29] proposed multi-timescale by having a first order and three second-orders with different timescale set as the input representation for the first stage of the two-stage CNN. Next, they coupled it with genetic algorithm to reduce the large frequency dimension of all the second-order(s), which assist them in the reduction of computational time complexity.
However, for the application of low complexity model, 'multi-timescaling' is not a great option as additional resolution will result in the increase in model size complexity. Hence fore, instead of combining them together as exhibited in [29], this paper investigated a simple mixing of different timescales of the first and a single second-order to enrich the multi-timescale capability of WS. By coupling just the first and second-order together, the number of input representation in the first stage remain the similar to how WS is being handled, while providing a soft relief on the timescale limitation.
The 'mixed' WS is then coupled with a low complexity CNN model, which has a similar structure to two-stage CNN architecture described in [24], but instead of residual convolutional block, this paper adapts MobileNet architecture, in addition to Shuffling modules. Notably, the major different between residual convolution block and mobileNet convolution block design, is the use of depthwise separate convolution layer instead of the standard M × N × C × D convolution layer. Hence, we present our CNN topology, a two-stage MobileNet architecture with shuffling modules, and we termed it two-stage mobile shuffling network (T-MSNet) for easy referencing.
In addition, we disputed that the current shuffling modules proposed by [12], [35] are not designed with time-frequency representation in mind. Taking advantage of the large frequency dimension of WS, we proposed a new shuffling module called 'sub-spectral shuffle' (SSS) that shuffles the feature maps 'frequency-wised'. Notably, our experimental results show that SSS works best in the proposed setup of 'mixed' WS with low complexity CNN.
The remainder of the paper is structured as follows: Section II discusses wavelet scattering and described 'mixed' WS. Section III presented our proposed two-stage mobile shuffling network with further elaboration on our proposed sub-spectral shuffling modules. We then established the experimental setup using three different audio scene classification datasets in Section IV. In this paper, we perform our ablation study using DCASE 2021 Task 1a dataset and based on the corresponding experimental investigations, we evaluate the best 'mixed' WS and the best mobile shuffling networks on rest of the datasets. The discussion of the all the experimentation is presented in Section V, VI and VII. Lastly, to reiterate our contributions: • We provide an evaluation on the 'mixed' WS and establish a discussion on the effectiveness of this approach.
• Following current ASC framework, we present our CNN topology, which combines IWS, MobileNet, Shuffling modules and two-stage CNN for low-complexity computing for ASC.
• Next, we proposed new shuffling modules called 'subspectral shuffle' (SSS) and 'temporal shuffle' (TS). Unlike channel shuffling (CS) and pixel-wise shuffling proposed by [12] and [35], respectively, our proposed shuffling modules are designed for time-frequency VOLUME 10, 2022 representation. Their differentiation is described in Section III, C.
• We then demonstrated the effectiveness of the shuffling modules by evaluating it with 3 ASC datasets.

II. WAVELET SCATTERING
In simplicity, wavelet scattering is constructed by cascading wavelet transform of a given signal [19]- [21]. The wavelet transform is a signal processing method that learns the modulation of the signal x through convolution operations of multiple scaled 'mother' wavelet, typically, the 'Morlet' wavelet. However, wavelet transform is not invariant to time-shift. There is a need to undergo non-linearity mapping through a complex modulus function | | and an averaging function, making it locally translation invariant. Combining all these operations will produce the first-order coefficients [20]: where S 1 x denotes the first-order coefficients, t is the time index, ψ λ 1 is a set of wavelet filters with λ 1 as the centre frequency and Q 1 is defined as the number of wavelet filters per octave. The set of wavelet filters is illustrated in Fig. 1. Lastly, denotes convolution operation, and φ is an averaging function, Gaussian filter and acts as a lowpass filter. While the averaging function ensures that the representation is locally invariant to time-shift smaller than a given timescale T , it removes fine-scale information such as transient attacks.
Hence, it becomes unstable to time-warping deformations.
To be stable to deformation, it needs to satisfy the Lipschitz deformation stability condition [20], and wavelet scattering achieves this by calculating a second wavelet modulus transform on the first wavelet transform, x ψ λ 1 based on [20]: Similar to first-order coefficients, S 2 x denotes the secondorder coefficients and the support of ψ λ 2 is centered in λ 2 with a frequency bandwidth of λ 2 Q 2 for λ 2 ≥ 2πQ/T and 2π/T for λ 2 < 2πQ/T . Following the suggestion of [19]- [35], this paper will only be using S 1 x and S 2 x. While WS does provide multiple timescales with a maximum scale of T , it is still affected by Heisenberg Uncertainty Principle, where one does not know the range of time supports required to capture all the important spectra dynamics to better differentiate between one acoustic scene to another.  In addition, toggling different values of T will also affect the signal energy being captured between the first and secondorder [19,20,21,b17]. A small T (e.g., ∼46ms) will have more energy saturated in the first order. In contrast, a larger T, will see more energy dispersed to the second-order. The cause of energy dispersion is due to applying gaussian filter on the wavelet transform, φ(t), with a scale of size T as expressed in (1) and (2). φ(t), a low-pass filter of scattering coefficients, will result in a steeper slope when T is larger as illustrated in Fig. 2. Hence, more energy will be filtered out from the first-order and absorbed in the second-order where the coefficients are sparse. Insight of this analogy, this paper proposed a simple mixing of the first and second-order with different timescales. The idea is to bring diversity to the feature representation, and we termed it 'Interleaving' WS (IWS). This technique is effective for low complexity computing as the input representation has roughly the same dimension size as the usual WS while further enriched with multiscale. Relating to (1) and (2), WS and IWS can be  (1) and (2), the result will be similar to (4). t 1 is a time index based on T 1 and t 2 is a time index based on T 2 . In other word, T 1 ,T 2 ∈ {92ms,185ms, 371ms and T 1 ! =T 2 . In the construction of IWS, we are only interested in the first order of t 1 and second-order of t 2 .
expressed as: where t 1 is the timescale for the first order, and t 2 is the timescale for second-order. The construction of WS and IWS is illustrated in Fig. 3 and Fig. 4, respectively. Notably, there is a need to carefully limit the temporal dimension of IWS as the size of the input representation will increase model complexity. Based on the findings of [24], WS(s) with the timescale of T = {92ms, 185ms, 371ms} have the best classification accuracy on DCASE 2021 dataset. Hence in Section V, A, we only evaluated the combination of IWS where timescale T = {92ms, 185ms, 371ms}.

III. TWO-STAGE MOBILE SHUFFLING NETWORK
Following the ASC framework, the IWS is set as the input representation of the CNN. In this section, we introduced our proposed model T-MSNet. The construct of the CNN is based on the compilation of MobileNet [10], [11], ShuffleNet [12], [13] and two-stage CNN [24], which will be discussed in subsection A and illustrated in Fig. 5. In addition to subsection A, we elaborate on the possible configuration for T-MSNet and how it is designed to ensure that it hits the brief of being a low complexity model. With the establishment of our CNN topology, we investigate the building blocks of T-MSNet. In subsection B, we examine the construct of MobileNet convolution block which design their convolution layers based on convolution factorization. In addition, we analysed the complexity of various convolution operations and provided the reason for using MobileNet in this context. In subsection C, we proposed new shuffling modules called sub-spectral shuffling (SSS) and Temporal shuffling (TS) and explained the intuition to our approach.

A. PROPOSED TWO-STAGE MOBILE SHUFFLE NETWORK ARCHITECTURE
The collective components of the MobileNet convolution block described in subsection B and shuffling modules described in subsection C present T-MSNet. In addition, T-MSNet featured a two-stage architecture design proposed by [24], [30], [31]. With [24] explaining that the two-stage approach is crucial in tackling the huge disparity in magnitude of the first and second-order coefficients. However, we faced a challenge whereby we must abide by the strict computational complexity constraint setup by DCASE 2021. Hence, the first stage only consists of a single convolutional layer to separately process S 1 x and S 2 x, with stride (1,1) and (2,1) respectively. Following the two-staged CNN architecture suggested by [24], we performed downsampling on S 2 x, with a stride of 2, on the Freq axis as aforementioned. This is to reduce biasness of the model on S 2 x caused by being a larger input representation (The larger the feature map the more convolution operations will be performed on the map, which will result in increased learning on the feature map). The second stage concatenates the features together to enable cross-spectral learning, which is further enriched when SSS is applied. The base architecture of T-MSNet after the first convolution layer follows a 'lite' version of mobileNet. The model has 8 stacks of MobileNet convolution blocks (MBloc) and the channels configuration for each convolution block is stacked as follows: C = {16,24,24,32,32,64,64,96}. Next, the shuffling modules, which include a combination of SSS, TS and CS, or in singularity, are slotted after the MBloc before each incremental of C except for the first incrementation. Based on the findings of [35], there is a need to strategically positioned shuffling modules so that we can unleash their maximum potential. Hence, the location of shuffling modules is predetermined at C = {16,24,24, S1,32,32, S2,64,64, S3,96, S4}, where S1,S2,S3, & S4 are the indexed of their location as illustrated in Fig. 5, and Appendix, Table 6. While the location of the application of shuffling modules is set, it does not mean that shuffling will be applied. Inspired by [35], this paper analyzes the 'activation' of various combinations of shuffling modules at the predetermined location in Section VI. Finally, MSNet is completed with FCN described in subsection B and Table 1. Apart from the location of the shuffling modules, the number of segments for SSS, TS and CS are a set of hyperparameters. Based on our preliminary experiments, the configuration of the number of segments for SSS, TS, and CS should be different. For ease of referencing, we termed the number of segments for SSS, TS and CS as GBins, GTim and GChan, respectively. For the case of SSS, the number of segments used in S1 and S2 is more effective if it differs from S3 and S4.
We also found out that when GBins is the same size as Freq, it will result in poorer classification results. Having GBins = Freq means that the entire spectral information is VOLUME 10, 2022 being distorted. During our experimentation, we discovered that GBins are best set at 10 for S1 and S2 and 5 for S3 and S4, for our configuration. As for TS, we set GTim = Time as IWS has a small temporal dimension. Lastly, GChan is most effective at 8, and the reason might be due to the number of channels of T-MSNet, which is relatively small to ensure the model size is within the given constraint.
The overview of the system includes the preprocessing to IWS and T-MSNet is illustrated in Fig. 5, and the T-MSNet architecture is described in Appendix, Table 6. With the understanding of the CNN topology, the following subsections will deconstruct the T-MSNet and discussed the convolution blocks and components.

B. ADAPTING FROM MOBILE NETWORK CONVOLUTION BLOCK
The MobileNet used in this paper is a combination of mobile network components designed by [10], [11], who carefully consider the extremely low computational cost. [10], [11] uses the concept of convolution factorization [8] to reduce the computational complexity significantly, and the adapted distinction can be summarized into three key components. The first component, which is the core convolution layer of MobileNet, uses depthwise separable convolution [8] to drastically reduce the computation in two steps. The first step focuses on capturing spatial relationships using a single filter per channel concept, called depthwise convolution, depicted in Fig. 6b. The second step employs a pointwise convolution as depthwise convolution does not capture channel-wise relationship. To better understand how depthwise separable convolution help in the reduction in model complexity, this paper provides a comparison analysis in Fig. 6 and compile a list of convolution layers and their computational analysis in Table 1. For ease of referencing, similar to Section I,  we defined a standard convolution layer with a tensor of M × N ×1× D where M × N represents the size of the filters, C is the number of filters, also called channels, and D is the number of output filters.
Depthwise convolution layer is the first convolution layer of depthwise separable convolution design [8] and illustrated in Fig. 6b. Instead of filter-wise factorization, they break down the convolution layer channel-wise, such that M × N × C × D becomes M × N ×1× D and a condition of D = C. This results in a considerable reduction in model size as C is always larger than M and N, especially when the network goes deeper.
The output of the depthwise convolution layer results from overly focused feature-wised learning and lacks understanding of the cross-channel relationship. Hence, a 1 × 1 × C × D is employed to cover this semantic. In a depthwise separable convolution setup, 1 × 1 convolution (1 × 1 Conv) is termed a pointwise convolution as it focuses on a single point of the feature map instead of a spatial relationship. As aforementioned, 1 × 1 Conv was designed by [14], and they intend to enhance the cross-channel relationship. The design of 1 × 1 Conv is also depicted in Fig. 6c. Lastly, the FIGURE 7. MobileNet Convolution Block (MBloc) design adapted from [9,10]. The design follows an inverted residual and linear bottleneck structure [14]. x is a feature map, and the output after MBloc is x 1 . The first convolution layer of MBloc is called the 'expansion layer' and has a similar structure as the 1 × 1 Conv. A hyperparameter termed 'expand' determines the number of output channels, denoted as D. Batch Normalization (BN). Followed by an activation function function ''ReLU'' is being used right after the expansion layer (1 × 1 Conv) and depthwise convolution layer (3 × 3 DWConv). Based on the study of [9], no activation function is required after the pointwise convolution layer (1 × 1 PWConv) as it shows to impede learning. If the stride of the Mbloc operation is equal to 2, there will be a residual connection between x and x 1 .
comparison between standard convolution against depthwise convolution and pointwise convolution is depicted in Fig. 6.
Hence, it is crucial to understand the computational complexity of these convolution operations as they have a direct impact to the size of the model and the calculation of the number of parameters is reflected in Table 1.
The second component incorporates an inverted residual and linear bottleneck structure [16], which signified the MobileNet convolution block design (MBloc) as illustrated in Fig. 7. The whole convolution block is defined by stacking expansion convolution, which is a 1 × 1 Conv, followed by depthwise separable convolution, then a projection layer, which is also a 1 × 1 Conv. As the name suggests, the expansion convolution is designed to increase the number of feature maps. This elevates the effectiveness of depthwise separable convolution. The expanded channels provide a higher-dimensional feature space, leading to more expressiveness of the non-linear spatial-wised learning delivered by the depthwise convolution layer. The pointwise convolution condenses this information channel-wised into a sparse representation, maintaining a compact representation for the subsequent convolution operations [10,11]. The projection layer reflects the actual number of output channel and act as both a consolidator and a dimensionality reduction layer, producing a sparse set of feature maps.
Lastly, to reduce the computational cost caused by classifying with a fully connected layer, [36] proposed Fully-Convolution (FCN). Before FCN is developed, the CNN model typically ends with a fully-connected layer(s) or dense layer(s) before the softmax layer. For this dense layer(s) to be effective, it requires a sizable number of hidden nodes, resulting in extremely high computational costs (e.g., D 2 ). Hence, [36] redesigned this portion by eliminating all dense layer(s) and instead of connecting the final convolution layer with a 1 × 1 Conv followed by a 1 × 1 × C × E, where E is the number of classes. Global average pooling [14] is applied before the softmax layer. The result is a significant reduction in computational cost, as shown in Table 1.
Apart from the three components, to provide greater flexibility, mobileNet offers two hyperparameters to throttle the complexity, being 'alpha' and 'expansion ratio'. The alpha value controls the ''width'' of the T-MSNet architecture, and as such, it affects the number of channels of every Mbloc. On the other hand, the expansion ratio is a hyperparameter setting for each Mbloc, and it controls only the size of the expansion layer. In this paper, based on our initial empirical study, we proposed the alpha to be 0.5 and all the expansion ratios to be 6.

C. FROM CHANNEL AND SPATIAL SHUFFLE TO SUB-SPECTRAL AND TEMPORAL SHUFFLE FOR TIME-FREQUENCY REPRESENTATION
To further exploit IWS, this paper proposed a sub-spectral shuffling (SSS) module inspired by channel shuffle (CS) FIGURE 9. Comparison of convolutional blocks of ShuffleNet [12], ShufflNetV2 [13], mobileNet [9], and our proposed T-MSNet, which is in simplicity is the adding of shuffling modules into mobileNetV2 convolution block design. The shuffling modules are accentuated to illustrate their position for different convolution block designs. While ShuffleNet and ShuffleNetV2 use channel shuffle (CS), we proposed the inclusion of various shuffling modules, being SSS, TS, and CS. 'GConv' denotes group convolution. 'EConv' denotes the expansion layer explained in Fig. 7. 'DWConv' denotes Depthwise convolution, and 'str' denotes the stride value of the convolution layers. 'Channel Split' breaks down the feature maps, channel-wise, into two groups. designed by [12] and spatial shuffle (SS) by [35]. Generally, shuffling is a stochastic approach and tends to improve the network performance without increasing the number of model parameters [35]. Hence, it is very applicable to designing a model with size restrictions. Shuffling channelwise, pixel-wise or sub-spectral-wise encourages the filters to encompass information of different features. Notably, this paper also explored temporal shuffling (TS), which is the shuffling along the time axis. The difference between CS, SS, SSS, and TS is illustrated in Fig 8. Given an input representation with a dimension size of Freq × Time × C, where Freq × Time represents the feature map which is a time-frequency representation. Freq being the frequency axis, and Time being the time axis. C is the number of feature maps and usually termed as a channel. CS performed shuffling on-axis C, allowing cross-learning of feature maps. SS visualized feature map, Freq × Time, pixelwise and randomly scrambled the individual pixels to destroy spatial memorization by the network. Thus, allowing the network to generalize more effectively. Although this makes sense for image classification, it has an adverse effect on acoustic scene classification. The poorer performance can be attributed to time-frequency representation, where a degree of temporal sequence and frequency locality are essential profiles. As such, this paper will present examples and explanations on SSS only. In addition, the algorithm is applicable to TS and CS.
Let GBins be the number of segments needed to segment the feature maps along the frequency axis, such that before SSS is being applied on the feature maps of Freq x Time x C, the feature maps are grouped into bins of feature maps with Freq/GBins x Time x C dimension. During SSS operations, all the bins will undergo a random process of repositioning, as illustrated in Fig. 8. On top of that, it should also satisfy this condition 1 < GBins ≤ Freq. The combination of Mbloc with shuffling modules is illustrated in Fig. 9d. Fig. 9 also present a comparison between our convolution block design against ShuffleNet and MobileNet.

IV. EXPERIMENTAL SETUP
Kymatio [37] is used to process the raw waveform into IWS, and T-MSNet is implemented in TensorFlow version 2.1 [38]. All the models are trained and tested on a Window Desktop with Intel i7-11700F Processor, 32 GB of RAM and RTX 3090. The proposed model is being evaluated on three acoustic scene datasets. This is to investigate the effectiveness and robustness of IWS and SSS on various ASC tasks. The three datasets are described in subsection A. Subsection B provides details on the preprocessing, subsection C on the hyper-parameter setting, and subsection D describes the post-training quantization setup.

A. DATASETS
The experiments are conducted on three widely used audio benchmark datasets -Environmental Sound Classification (ESC-50) [39], Urban Sound Dataset (US8K) [40], and Detection and Classification of Acoustic Scenes and Events (DCASE) [41] (DCASE 2021) presented in Table 2. Notably, these three datasets actually tackle two different classification tasks being, sound event recognition (SER) and what we now re-label as ASC. The ASC that we now know is as aforementioned the tasks to identify the vicinity we are at based on the sound produced in that scene. In this case, both ambient sound and event sound are important, though some sound events are noises for this task. As for SER, we are only interested in the sound event that happened in the scene. DCASE 2021 is an ASC exclusive dataset, while on the other extreme end, we have US8K which is an SER exclusive dataset. Lastly, ESC-50 provide a mix of both ASC and SER. Hence, by evaluating over these three environmental sound classification datasets will let us understand the capability of our proposed model, especially IWS and SSS. In addition, we are also able to evaluate our model performance on various data size presented in Table 2.

B. HYPERPARAMETERS
For all the experiments, this paper uses stochastic gradient descent (SGD) [42] for the learning strategy with a warm restart scheduled at a set of epoch indexes being (3,7,15,31,63,126,254). The learning rate will gradually decrease from 0.1 to a minimum limit of 0.00001 at each step based on half a cosine curve called 'cosine annealing' and restart back to 0.1 at each epoch index. The max epoch is set at 255 to experience 5 restarts with different intervals and a mini-batch size of 32. Notably, no data augmentation is being used for all the experimentation.

C. POST-TRAINING QUANTIZATION
DCASE 2021 has an additional requirement: the model must be <= 128KB, which translates to a model with 65536 parameters as our approach applies post-training quantization [7] that converts the T-MSNet from 32-bit to 16-bit floating points.

D. WAVELET SCATTERING IMPLEMENTATION
In this paper, we proposed the use of WS and IWS to preprocess raw waveform to a time-frequency representation with different timescale T as discussed in Section II. It was observed that a small T < 92ms is not effective for most acoustic classification tasks [20,22,b17-b22]. Hence, the minimum range of T is set to 92ms. In addition, based on the collective empirical studies, T provide the best result when set to an approximate of 300ms and is exhibited in Table 3. Therefore, in this paper, T = {92ms, 185ms, 371ms}. For the value of Q 1 , [20] recommended to set it as 8, as it reflects a representation with similar attribute as mel-scale spectrogram, which is essential for environmental sounds classification. While yet another studies by [b19,b22], where they exhaustively evaluated a series of Q 1 , has demonstrated that Q 1 is in fact optimal when set to 9. This has led us to a preliminary study of choosing the value of Q 1 between 8 and 9, and based on that, we have selected Q 1 as set to 9 for both S 1 x and S 2 x. As for Q 2 , [20], [22]- [28] is in unison of setting it to 1. Notably, Q 2 is only applicable to S 2 x. [22], [24] also recommended the addition of the deltas and delta-deltas and concatenating them channel-wise, giving C = 3. Notably, the changes in the timescale, T , will affect the size of the Time dimension. For the concatenation operations (refer to the 'plus' sign of Fig. 5) to work, the feature maps of first and second orders need to share the same Time dimension size.
Hence, we set the Time dimension to follow the timescale where T =185ms, giving us a Time dimension size of 56. 'Oversampling', a feature in Kymatio (in Python) that increases the Time dimension, is being used when T =371ms to bridge the gap. We increase the strides by 2 for the time axis of the first convolution layer when T =92ms.

V. INVESTIGATION ON IWS
In subsection A, we exhaustively search for the best combination of IWS using the DCASE 2021 dataset. In subsection B, we evaluate the effectiveness of IWS and WS with shuffling modules.

A. COMPARATIVE ANALYSIS OF IWS AGAINST WS
We exhaustively evaluated various combinations of IWS where the timescale T 1 , T 2 ∈ {92ms, 185ms, 371ms} and the result is featured in Table 3. Based on the root means square (RMS) of the combined accuracy of DCASE 2021, ESC-50 and US8K, mixing of the first and second-order based on this rule T 1 < T 2 shown to improve the classification performance of WS. Hence, our finding is in line with the analogy of energy dispersion of the first and second-order based on timescale discussed by [19], [20], [21], [23].
As scattering transform is contractive, [19], [20], [21], [23] calculated the energy of dispersion based on the fraction of the squared Euclidean norm of a vector of coefficients of the raw signal x against both S 1 x and S 2 x (e.g., S 1 x 2 / x 2 ).  . Second-order coefficients presented in log-scalogram of different timescales of bird chirping sound. As the timescale increases, more energy surfaced in the second-order. Notably, the number of second-order coefficients also increase as the timescale increase.
Their calculation review that as T get larger, there will be a shift of energy dispersion from first order to second order.
Considering this analysis, we have plotted the logscalogram of the first and second order of the sound of bird chirping in Fig. 10 and Fig. 11. Furthermore, in Section II, we plot the illustration of the gaussian filters in Fig. 2 to illustrate the reason for this phenomenon. T affects the scale of the gaussian filter, which in this context is a low-pass filter.
Hence, building on this concept, we proposed a simple mixing of first and second-order with different timescales which has demonstrated to be rather effective, especially for ESC-50 which consists of both SER and ASC recordings. While slight improvement is observed from the other two datasets. In addition, the logic of T 1 < T 2 does increase the number of wavelet transforms, especially for fine-scale resolution. Figuratively, the increase of resolution of the second-order, as T increase from 92ms to 371ms is by at least 100s over additional wavelet transform as illustrated in Fig. 10. This tentatively means an increase of wavelet filter's variation, which result in capturing more acoustic characteristic as explained in Section II. More wavelets with different scales can reduce the effect of Heisenberg Uncertainty Principle. In contrast, setting T 2 < T 1 does not work well, typically, this pairing of IWS will always result in a poorer performance against their WS counterpart as reflected in Table 3, further backing the energy dispersion theorem.
However, a caveat appears when we start mixing first and second order with different timescales. There is a possible occurrence of feature redundancy or irrelevancy supported by the findings of [20,29]. The possibility of redundancy is further grounded by the energy dispersion theorem [19,20,21,b17], where scattering transform is contractive meaning, Hence, there will be a possibility of overlapping where the signal is captured by both first order and second order. Coupling with CNN that thrive in complex representation, it might result in building a stronger relationship between the overlapped wavelet scattering coefficients.
Thus, IWS can seem to be a double-edged sword without thorough experimentation, in this paper, we present the best combination of IWS for this setup is the combination of T 1 = 185ms with T 2 = 371ms.

B. EFFECTIVENESS OF SHUFFLING MODULES ON IWS AND WS
We then evaluate the combination of WS and IWS with one of the top shuffling modules presented in Table 3. The combination of SSS234, CS234 is being selected. Overall, the inclusion of shuffling modules has resulted in better performance for both WS and IWS. (Only one exception where T 1 = 371ms with T 2 = 185ms) illustrated in Fig. 12 and Table 3. Our intuition to take advantage of the large frequency dimension of WS and IWS to perform SSS has proved to be fruitful. This finding is based on the consistent improvement of performance when T 2 = 371ms. In this setup, when T 2 = 371ms, the frequency dimension of WS and IWS is the largest, where Freq = 724, as illustrated in Fig. 11. Based on our preliminary study, for the following experiments, VI and VII, we use the IWS configuration where T 1 = 185mswithT 2 = 371ms.

VI. ABLATION STUDY OF SHUFFLING MODULES
This section performed an ablation study of SSS, TS, and CS. Our ablation study is broken down into two parts: the first part is an evaluation of the individual type of shuffling modules. In this experiment setup, we explored the relationship between the position of the shuffling module and the number of shuffling modules that will influence the performance of the T-MSNet. The second part involves the study of the synergy of collectively combining SSS, TS, and CS. Lastly, we evaluate the robustness of shuffling modules on all the datasets presented in Section V, A.

A. ABLATION STUDY OF SSS, TS AND CS
The tabulated result is presented in Appendix, Table 7 and  Table 8, classification accuracy and logloss, respectively. VOLUME 10, 2022 Logloss is the measure of the confidence level of the model when making the prediction. (e.g., high confidence means that the logits on the 'true' label are nearing 1, and it has a logloss value approaching 0. Low confidence means that the logits on the 'true' label are far from 1, for example, 0.5, this will result in a higher logloss value) In this experiment, we investigated the effectiveness of the individual type of shuffling modules when 'activated' in different locations of T-MSNet, being S1, S2, S3 & S4 termed in Section 4, D. Individual type of shuffling modules means that we evaluate SSS, TS, and CS separately.
Overall, SSS has the best performance in classification accuracy and logloss as illustrated in Fig. 13. This exhibits that SSS is more effective for a time-frequency representation, especially IWS (due to its high-frequency dimension). In fact, the first operation of SSS, which splits the feature maps 'frequency-wise', corresponds to the signal processing technique of applying equal frequency binning or applying linearly scaled filters on the spectrogram. (The more common practice is applying spectrogram with mel-scale filterbanks) Hence, a slice of the feature maps has similar characteristics to a frequency bin. Next, the shuffling operations mix the frequency bins around such that the feature maps are no longer ordered based on frequency magnitude (e.g., 0-100Hz, then 101-200Hz, then 201-300Hz, . . . ). This forces the network to learn the general ''shape'' of the acoustic rather than the pitch. E.g., It no longer matters to the network whether it is a female voice or male voice since it just generalized that this is a speech. Hence, the general consistency in the improvement of logloss is demonstrated by SSS. The improvement of logloss is also reflected in other datasets. (Ref. Fig. 13b) However, the effect of improved model generalization is lost when we apply SSS on S4. The inclusion of SSS on S4 has been demonstrated to result in a higher logloss. We attributed the cause of this observation to the relatively smaller dimensional size of Freq, which is 30. (Refer to Appendix Table 6) The poorer performance observed when shuffling with a low-dimension size is in unison with the findings [35]. They concurred that channel shuffling is ineffective at earlier layers because of limited channels. Another observation we made is that the classification accuracy dips when SSS is activated on S1. We deduced that the network needs to learn initial spatial relationships and to shuffle is better on feature maps with higher dimensional space. (a few layers of CNN with non-linear activation function)  Table 9 (classification accuracy) and Table 11 (logloss).
The same setup is also being performed on TS and CS. However, their classification accuracy performances are erratic and do not improve the logloss. Furthermore, in some cases, the classification accuracy worsens with the inclusion of TS or CS. This observation is again supported by the findings of [35], where shuffling does not work effectively on lowdimension sizes.
Our proposed SSS works best on a low-complexity model with IWS with high dimension size of Freq. On top of that, the inclusion of a single SSS on S2 provides the best overall performance.

B. ABLATION STUDY OF COMBINING SSS, TS AND CS
While the study of subsection A provides an insight into the individual effectiveness of each shuffling module, this subsection explored the possible synergy when combining all of them. It is incredibly time-consuming to evaluate all the types of combinations. We first draw out some rules based on the observation from subsection A to reduce the number of possible tests. The rules are as follows: • SSS must always be included as it provides significant improvement to logloss.
• There should not be a combination where SSS is solely activated in S1, S4 and S34 (S34 stands for SSS activated on both S3 and S4), due to poor logloss performance.
Based on the rules, our focus is now shifted to SSS being the lead shuffling module and how TS and CS can enhance it. Even with the rules, there are still several combinations. Hence, we performed a random search (more than 20 combinations, the results are presented in Appendix, Table 11). As no particular findings can be observed, this paper presents the top 3 combinations in Table 3. The overall performance of the combinations is relatively close. Hence, we evaluated other datasets to establish consistency and test the robustness of shuffling modules.

C. ROBUSTNESS OF SHUFFLING MODULES
In this subsection, we evaluated the robustness of shuffling modules with additional datasets described in Section IV, A. A slight progressive improvement is observed when we mixed first and second-order to form IWS for the US8K and DCASE 2021 datasets, while a significant improvement is observed for ESC-50. This can be attributed to the number of classes we need to classify. The variety of sounds recorded for ESC-50 is larger than US8K and DCASE 2021 ( the number of classes is 5 times larger). Thus, it has a richer acoustic characteristic, and by mixing different timescales, the model can distinguish the different acoustic profiles better. Yet another observation we notice when comparing the model performance on ESC-50 against US8K and DCASE 2021. A significant increase in performance is observed when IWS was applied, and further alleviated when shuffling modules are activated on ESC-50. This can be attributed to the size of the dataset. ESC-50 is a small dataset, and CNN is known for overfitting when the sample size is small. Hence, this finding reaffirmed that the shuffling algorithm could improve the generalization of the model, as mentioned by [12], [13], [35]. Another telltale of better generalization can be observed by the consistent improvement of logloss when shuffling modules are applied, as exhibited in Fig. 12b.
Overall, shuffling modules can improve classification accuracy and logloss performance for all the datasets. Generally, the top 3 T-MSNet configurations presented in Table 3 has very close performance rating. While the combination of SSS, TS and CS has been shown to provide the best classification accuracy, the activation of a single SSS offers a less complex model (by complex, we meant that it has fewer hyperparameters to configure). In summary, sub-spectral shuffling has been demonstrated to be the most effective shuffling module for our proposed system setup.

VII. COMPARATIVE ANALYSIS
In this section, we compared our proposed T-MSNet with SOTA(s) models presented in the various datasets.

A. ESC-50
Our proposed model has an advantage over human classification accuracy presented by [39]. L 3 -net developed by [43] uses self-supervised learning with an unlabeled video dataset to build audio and image embeddings. In the case of [41], they have demonstrated the effectiveness of embeddings on the DCASE 2021 dataset. However, the model size complexity cost of embeddings is too large. Hence, it is not recommended for low complexity modelling. VOLUME 10, 2022 While compared to [44], the addition of pretraining the model with a very large dataset, AudioSet [45], has improved performance dramatically. However, a lesser gain is observed when the same setup is applied on US8K. This can be attributed to the size of the dataset; transfer learning from a very large dataset is more effective when the dataset is small. The effect diminishes when applied to a larger dataset. Although transfer learning from a very large dataset has shown to be very useful, it requires massive resources and training time first to learn the semantics of the very large dataset.
Lastly, [46] proposed a model that has nearly the same model size as ours. Pruning and quantization are used to bring down the model's parameters drastically. Our proposed model has a slight edge over the pruning technique in this scenario. Overall, the models used in ESC-50 are not geared towards low-complexity modelling. Hence, our best benchmark is against [39] and [46]. Notably, without the limitation on model size, [54], [55] are current SOTA models that achieve ∼95% and ∼97% classification accuracy, respectively. The common technique used by both models is the pre-learning of semantics from the external dataset(s).
In the work of [54], they adapted 'Transformer' network and modified it to be suitable for audio classification. ImageNet and AudioSet are used to pretrain the network. As for [55], they incorporated a CLIP framework with a CNN model dedicated to audio classification. In short, the CLIP framework is the joined learning of multi modalities, typically visual and textual. They extended the number of modalities to include audio modalities. The output of CLIP is a shared multimodal embedding space. Similar to [43], [54], [55] requires tremendous resources and training time.

B. US8K
Similar to ESC-50, US8K is not geared towards lowcomplexity modelling. Hence, we remain competitive against the models presented in Table 4. While comparing to [43], the inclusion of an additional dataset only shows a slight advantage against our model. In the work of [47], they uses convolution factorization and shares the same approach as our model. However, our model has the edge over their model in terms of both performance and model size. Our differentiation lies with the inclusion of shuffling modules, which has improved the generalization of our model. Lastly, [48] reuses the CNN model proposed by [46] and performs a series of pruning and quantization. Their best result that pushes the model size to 133.12KB only stands at 70.93%. In addition, [55] has also been evaluated with US8K and achieved a SOTA result of 90.07%.

C. DCASE 2021
Lastly, Table 5 shows the comparison of the top 3 models against T-MSNet. The models are ranked based on DCASE 2021 Task1a challenge result. The main difference between our proposed model against theirs is the use of a parameters pruning strategy. Parameters pruning removes weights that the employed pruning strategy deemed less important. This strategy has proved to be extremely effective in compressing the size of a very large model [49]- [52], hence, the popularity of applying this technique. In addition to pruning, [49], [50] uses knowledge distillation to transfer 'learning' from the larger model to the pruned network.
Although our model falls short compared to logloss, our classification accuracy put us in third place. Furthermore, we present a different approach to carefully constructing our CNN architecture using convolution factorization and design patterns suggested by [10]- [13], [24], [35]. Instead of pruning and knowledge distillation, we proposed shuffling modules, a zero-model size cost algorithm, to improve model generalization. Table 5 shows that pruning, convolution factorization, knowledge distillation, and quantization effectively compress   the model. However, supporting pruning and knowledge distillation is the requirement of a well-constructed CNN. In contrast, convolution factorization requires careful planning for each layer, and hence, both have their disadvantages. In addition to the above-stated techniques, this paper proposed shuffling modules. This stochastic method improves model generalization and favors low-complexity modelling as it does not constitute model size. Generally, our proposed model is competitive.

VIII. CONCLUSION
In this paper, we have proposed a mixing of WS called IWS and two kinds of shuffle: sub-spectral shuffling (SSS) and temporal shuffling (TS). Notably, shuffling modules are a  'zero model size' operation that helps to improve the model's ability to generalize. We then demonstrated the modularity of shuffling modules and how they can be paired with MobileNet architecture. The final product is our proposed T-MSNet. Based on our experimentation, sub-spectral shuffling is the most effective shuffling module and even comes toe-to-toe with the combined shuffling modules. SSS vs top 2 combined shuffling modules; (80.9% vs 80.6%, 82.15%), (80.5% vs 80.3%, 81%) and (70.5% vs 70.4%,70.6%) on ESC-50, US8K and DCASE 2021, respectively. Further research can explore better ways to finetune the hyperparameters of T-MSNet, pairing shuffling modules with other convolution neural network architecture, pretrain with Audioset. Apart from tuning the networks or applying it on other CNN architectures, we can also extend the study on other audio classification tasks, such as bioacoustics classification tasks or music auto-tagging. In addition, the time and frequency shuffling modules can also be applied on other time-frequency representations, such as spectrogram. While looking internally at WS algorithm, we can explore other type of 'mother' wavelet such as ''Mexcian Hat Wavelet'' or ''Meyer Wavelet''.