Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling

Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.


I. INTRODUCTION
In recent years, development of Environment Sound Classification (ESC) methods has been one of the hottest topics in audio classification domain due to its potential use in various application areas such as smart cities [1], audio surveillance systems [2], health care [3], security control systems [4], and Internet of Things (IoT) [5]. For example, ESC can be used to automatically identify different environmental sound events, such as gun shots [6], siren [7], and bird sounds [8].
The associate editor coordinating the review of this manuscript and approving it for publication was Ali Shariq Imran .
The quite limited knowledge of temporal and frequency characteristics, the lower signal to noise ratio, and less static patterns make ESC more challenging than other audio-related tasks such as music genre classification and speech recognition.
With the remarkable achievements of Deep Learning (DL) methods in different applications, such as image classification [9], [10], speech recognition [11], [12], music processing tasks [13], [14], different attempts have been made to employ DL framework for ESC. Various methods have been proposed based on using one-dimensional Convolutional Neural Network (CNN) to directly process raw audios [15], [16]. Despite the advantage of eliminating the need for manual feature extraction and tuning several hyper-parameters, they cannot benefit from the important frequency clues and they suffer from high computational complexity due to use of numerous convolutional layers (i.e. up to 34 layers in [15]). Latest methods follow the DL framework proposed by Piczak et al. [17] where a Time-Frequency (T-F) representation is utilized as input image to a two-dimensional (2-D) CNN. To benefit from complementary features, a multi-stream network was proposed in [18] where each stream receives a different input feature. However, the multi-stream CNN is not only too complex, but also requires a large amount of memory. Several methods focused on increasing the depth of the network to extract more abstract and higher-level features [19], [20] and employed well-known image recognition architectures such as ResNet and DenseNet [20], [21]. However, these methods require large amount of training data, and limited available samples for several sound classes has hindered their further success. Also, No specific domain knowledge is incorporated in their design which is necessary to achieve superior performance. In [22], in order to distinguish between different frequency bands, a model consisting of an ensemble of several CNNs was proposed, which processes each frequency band separately. Recently, several works have attempted to combine CNN with recurrent neural networks which has improved the CNN performance at the cost of higher model parameters and complexity.
As mentioned above, most of these methods are mainly accompanied by a large increase in model size and complexity. The heavier computational load and the need for large memory greatly limit their usage for many lowpower and low-resource applications such as hearing aids or IoT devices. Rather than increasing network depth or architectural complexity, we, instead, pose this question: how can a simple, small-size and not-so-deep CNN with a single input audio representation achieve superior performance for ESC task under limited data condition? To find a solution to this question, we believe that the following characteristics of environmental sounds are necessary to consider: • Temporal characteristics: Environmental sound signals exhibit complex temporal structures with different levels of local relationship and varied durations. As can be seen in Figure 1, environmental sounds can be transient (e.g. crying baby), continuous (e.g. rain), or intermittent (e.g. clock tick). Moreover, local T-F patterns are highly shift-invariance across time axis so that temporal translation has little effect on the classification of sound events.
• Spectral characteristics: Compared to other audio signals, environmental sounds have a broader range of frequency information with diverse spectral profiles which are either scattered across frequency bands, concentrated at low, middle or higher frequency bands, or spread across all frequency bands [22], [23]. Also, unlike the time dimension, translation across the frequency dimension can significantly affect the performance of the sound classification [24].
As a result of these characteristics, T-F representations of environmental sounds exhibit diverse energy modulation patterns with considerable intra-class variations. Moreover, the input data may contain many noisy or silent frames with only a few semantically relevant frames (e.g. clock tick). The variation imposed by noises and sound sources unrelated to the target sound event adds to the variety of T-F patterns. Revisiting a regular CNN shows that convolutional layers do not discriminate between different local T-F regions from the whole input representation and treat them equally. Also, the Global Average Pooling (GAP) layer, which is applied to CNN output channels in popular architectures such as ResNet [20], mixes all local features extracted from the whole input representation into a one-value descriptor via similar weighting. Thus, GAP ignores the details and leads to channel descriptors that have similar values. Given the mentioned spectral and temporal characteristics and the degree of variety of T-F patterns, this equal treatment of different regions reduces the model ability to learn a discriminative representation, especially when small training data is available. As a solution, attention mechanisms have been proposed to increase emphasis on temporal frames or channels that are more relevant to the target sound [19], [25]. But these methods rarely exploit the joint T-F information [23], [24], and bring about limited improvements at the cost of adding lots of learnable parameters. In this paper, we propose a novel and efficient approach which empowers a small-size and not-so-deep CNN to learn discriminative representation. To this end, we introduce a new global pooling scheme which is called Sparse Salient Region Pooling (SSRP) and does not include any extra learnable parameters. Figure 2 depicts the overall structure of the low-complexity CNN model equipped with SSRP. Among the mentioned diverse input patterns, we believe that there are few patterns that contain the most salient and discriminative information of each sound class, and other less informative patterns could be beneficial only if large amount of training data is available. Hence, the main idea behind SSRP is restricting the features that contribute to the channel 850 VOLUME 11, 2023 descriptors to a subset of very sparse features which play a key role in the classification of a sound class, and ignoring all other features. This is similar to the mechanism that the human auditory system has developed to tune in a single voice in a crowded room full of interfering signals or noisy surroundings (also known as the cocktail party effect). Strikingly, studies conducted in [26] found that the auditory cortical representation only reflects the neural responses of the target speaker and all information coming from superfluous speakers was completely discarded. During training, SSRP largely affects the back-propagation process which forces the model to increase its focus on those frequent patterns that can contribute the most to the classification (i.e. classdependent salient patterns). To fully utilize the information within different salient patterns, SSRP observes the aforementioned spectral and temporal characteristics by discriminating between local spectro-temporal patterns at different frequency bands and capturing the long-term temporal structures of different sound classes. To the best of our knowledge, in contrast to this work, none of the existing methods for ESC has investigated the impact of the global pooling method. Results of experiments conducted on ESC-50 and ESC-10 datasets show that our not-so-deep CNN model equipped with SSRP can yield comparable performance to the state-of-theart methods under much lower computational complexity, and much smaller parameter space.
In summary, the main contributions of this paper are as follows: • This paper proposes a novel global pooling method, named SSRP, which combines local and global information in a way that guides CNN kernels to efficiently learn from more informative and salient T-F regions of input representation.
• We show that the most salient information can reside in a small percentage of total frames, and thus extracting features only from these informative frames leads to considerable improvements.
• Our proposed model achieves 21.8% and 14.3% absolute improvements over the baselines of ESC-50 and ESC-10 datasets respectively, a notable performance for a lightweight network with total parameters of 0.7 M. The rest of this paper is organized as follows. We briefly review the existing methods proposed for ESC and provide a description about different pooling methods developed for visual domain in Section II. Section III presents details of the proposed global pooling method and its impact on the backpropagation process. Section IV provides the experimental settings and results on ESC-50 and ESC-10 datasets. Finally, Section V concludes the paper.

II. RELATED WORK A. SOUND CLASSIFICATION NETWORKS
During the past few years, many attempts have been made to employ CNN models for ESC. As the first work, Piczak [17] fed a CNN with log mel spectrogram, as a low-level 2-D representation, and achieved significant progress over traditional support vector machine, random forest and K-nearest neighbor classifiers. In [21], using the same 2-D input features, two well-known CNN architectures (AlexNet and GoogLeNet) in image classification were utilized for ESC. In [15], very deep 1-D CNNs were proposed to learn directly from raw audio waveforms. In [16], Gammatone filterbanks were utilized to initialize the first layer of an end-to-end 1-D CNN which could deal with inputs of different lengths. Another end-to-end system was proposed in [27] where two 1-D convolutional layers first work on the raw audio and learn to extract a 2-D representation which is then processed by 2-D CNNs. In [28], a multi-scale timedomain convolutional layer was used as the first layer to provide an improved frequency resolution and build more discriminating 2-D representation. The 2-D representation is then fused with log mel spectrogram using a two-phase method. In [19], a concatenation of multiple features consisting of MFCC, Gammatone Frequency Cepstral Coefficients VOLUME 11, 2023 (GFCC), the Constant Q-transform (CQT), and Chromagram were used as input of a deep 2-D CNN which discriminates between time and frequency domains via spatially separable convolution kernels with different sizes. In [25], a Convolutional Recurrent Neural Network (CRNN) is used where 2-D convolutional layers first process log mel spectrogram and then recurrent layers model temporal dynamics. Several works [18], [29], [30] have attempted to utilize temporal attention mechanism to focus on the more informative frames. In [30], an architecture was designed which takes advantage of CNNs to model spectral information, LSTMs to model long-term temporal information, and deep neural networks to classify the obtained features. Also, an attention mechanism is used to determine the importance of different time outputs of the Long Short-Term Memory (LSTM) layer. In [18], a multi-stream CNN equipped with temporal attention was proposed which relies on three different input streams consisting of raw audio and spectral features. Increasing the depth and complexity of the deep learning models intensifies the need for huge amounts of training data. To deal with the scarcity of labeled training data for ESC, data augmentation techniques have been employed in the literature. For example in [17], [21], and [31], deformations such as time stretching, pitch shifting and dynamic range compression were applied to the available data to generate new training samples. Several works [19], [25], [32] employed mixup technique; another widely-used data augmentation which simply mixes up different features and their labels with one another [33]. While these approaches improve the CNN-based classifier performance, they typically suffer from a large parameter space and high-complexity.

B. FEATURE POOLING METHODS
To the best our knowledge, global pooling layers are not investigated in the architectures developed for ESC, and simply average pooling is used in most networks [20], [34], [35]. To gain insights into other pooling strategies, here, we review some of the pooling methods proposed in the visual domain. Stochastic pooling [36] randomly picks the activation within each pooling region according to a multinomial distribution. S3Pool [37] takes the maximum of activations of a sub-region which is randomly selected within the pooling region. Detailpreserving pooling [38] computes weighted average pooling where the weights are proportional to the differences of activations. Rank-based pooling methods [39], [40], [41] sort the activations in descending order and compute the average of top-K values. In Rank-based weighted pooling [41], prior to taking the sum of activations, each activation is weighted by a coefficient computed based on its rank. To reflect advantages of max and average pooling methods, mixed pooling method chooses between the two pooling values where the selection is done based on a random coefficient [42]. In [43], the stochastic coefficient is replaced with a learnable coefficient in order to benefit from the two pooling methods at the same time. These pooling methods have been mostly utilized to be used between convolutional layers, and GAP is still used as the global feature pooling at the end of most popular architectures. Some researchers have attempted to improve the averaging mechanism of GAP. To increase focus on more semantically important regions, entropy pooling [44] assigns entropy-based weights to different features prior to computing the average value. In [45] AlphaMEX was proposed which uses log mean exponential function to aggregate the feature maps. Stochastic region pooling [46] randomly selects some regions within each feature map and takes the average value of the selected features. This is done to encourage the responses of detail features to be enhanced during training, and make the network learn more representative channel descriptors. Global learnable pooling [47], as another weighted average pooling method, assigns lots of learnable weights to all features within output feature maps, where the weights learn to highlight the contribution of more distinctive features.

III. PROPOSED GLOBAL POOLING METHOD A. SPARSE SALIENT REGION POOLING (SSRP)
In this section, we present the proposed method which aims at increasing the focus of a lightweight CNN on the most salient input T-F patterns. These spectro-temporal patterns are formed by energy modulations of frequency content across the temporal dimension with horizontal or diagonal directions [48]. Identifying these class-dependent salient patterns in the input representation has many difficulties and requires enough acoustic knowledge of different sound classes to develop low-level saliency clues. Developing a separate subnetwork which learns to identify these patterns, and then utilizing its decisions for the main network, similar to [49], can also be considered as an option, but it would greatly increase the model complexity and the total number of model parameters, and intensify the need for training data. We, instead, introduce a new global pooling scheme which affects the training of the model so that it is gradually biased towards learning the most salient input patterns.
The most commonly used global pooling method, GAP, treats all features of each CNN output channel equally, and mixes them to a single-value channel descriptor via calculating the mean value c is the c th channel of the D th convolutional layer (last layer) containing T temporal frames of F frequency bands. The input Receptive Fields (RFs) of many of these features can correspond to different non-overlapping T-F regions of input representation, depending on the total pooling size in the previous layers. Thus, the saliency of information embedded in different components of m D c may vary a lot because of the varied spectral and temporal characteristics of environmental sounds, as mentioned before. However, GAP does not consider regional saliency and lets all these features make the same contribution to the classification and then to the training of the corresponding convolutional kernels. Thus, those regions of low relevancy and importance may lessen the effect of regions containing discriminative patterns.
In the proposed SSRP, we restrict the computation of z c to a subset c that contains indices of very sparse features of m D c and drop all other features outside c . This is equivalent to applying a sparse binary mask with a few non-zero components to each output channel z c prior to computing the channel descriptor. By imposing such regional bottleneck on the flow of information in the forward path, only a small percentage of input T-F regions can contribute to the classification and then to the training of the kernels of the last convolutional layer. 1 At the early steps of training, the model may focus on capturing information within the less salient or even irrelevant patterns and thus the subset would contain no discriminating features, which would not lead to large reduction of the total loss of the network. This keeps the gradient values large which would force the model to search for the most informative input T-F patterns. Note that the desired T-F regions contain the most informative and frequent patterns that occur in most of the samples of each target sound class. Thus, we expect further training steps to encourage the responses of salient patterns in the output feature maps (z c ) to gradually intensify such that the selected sparse subset c be enriched with more discriminative features during training. To achieve this, we consider the selection routine of members of c as a decisive part of the proposed scheme. For instance, if sparse subset c is formed based on a stochastic selection routine (as in [37] and [46]), the variation of features could be very large between successive training steps and also no discriminative features may be involved in many training steps; attributes that will cause the idea to fail (as will be shown in Figure 7). Hence, we observe temporal and spectral characteristics in the design of SSRP. First, we note that each particular frequency band of m D c is of different importance among various classes. Thus, SSRP pools each frequency band separately to prevent discriminative features with lower activation levels from weakening by more dominant features at other frequency bands. Second, /the most informative and salient patterns have strong local correlations which should be preserved under the imposed restriction. Discarding parts of these features or including features of non-salient patterns would result in undesirable increased variations in the channel descriptors. Due to the mentioned diverse temporal structures, these patterns may last within a single local temporal interval of varied length, or may reside within multiple distant intervals, depending on the sound class. Hence, SSRP should also capture temporal dependencies among different temporal frames. Taking these aspects into consideration, we design a two-stage aggregation scheme where the first stage captures local information within different temporal intervals, and the second stage aggregates the first-stage results. Here, we propose three simple methods to implement SSRP. In the basic form, which we refer to as SSRP-B, we first slide a rectangular window of length W with unit stride over temporal 1 As will be discussed in section III.B, SSRP also disrupts the flow of gradients in the backward path and thus affects the training of convolutional kernels of the previous layers.
features of each frequency band of m D c and compute the mean of the activations within each window Then, in the second stage, s c (t, f ) is aggregated into z c (f ) by picking the highest mean of activations This way, members of c corresponds to the local features within the temporal windows selected at different frequency bands. Using the long-term max pooling across the time dimension in (2) is in accordance with the fact that patterns are invariant to high amounts of translation in the time domain, and makes SSRP non-sensitive to the position of salient patterns. Also, the temporal interval of the selected windows can be different among frequency bands, as illustrated in Figure 2. Thus, c is not limited to similar temporal intervals for all frequency bands and can contain discriminative features of different frequency bands irrespective of their temporal positions.
Note that increasing the size of W in (1) would decrease the level of sparsity (i.e. (T − W )/T ), and thus increase the diversity of features that contribute to the descriptor. For the special case of W = T , no restriction is imposed and SSRP-B is degenerated to a frequency-dependent temporal average pooling. Also, the sparsity level would be maximum when W = 1, for which SSRP-B would select only the most dominant feature at each frequency band and thus may lose some discriminative information. Thus, the size of W should be selected so that the descriptor can represent the feature map well enough for different sound classes. Figure 3 shows validation accuracy curves of the proposed model during training using two different sparsity levels. Comparing the accuracies achieved for sparsity levels of 92% and 81% shows that at early epochs of training, the model with smaller sparsity level learns more quickly and achieves accuracies up to 20% higher than that of the model with larger sparsity level. This indicates that the more severe the regional bottleneck (i.e. the larger the sparsity), the harder it is for the training process to bias convolutional kernels towards detecting and capturing information of salient patterns. However, the performance of the model with higher sparsity level is improved by further training so that it finally achieves accuracy of 83.50%, while smaller sparsity level leads to accuracy of 81.75%.
The basic form of SSRP has one-degree of freedom to aggregate temporal local features and is able to capture one single interval of fixed size. Considering the complex temporal structures of environmental sound classes, this would cause some discriminative information to be lost or some features extracted from non-salient patterns to be a member of the sparse subset. Also, it would lose the global temporal dependencies between distant salient patterns. To address these issues, we develop two other forms of SSRP. In the second form, we compute the moving mean of activations using L levels of increasingly coarser windows where W l is the length of the window at level l, and W 0 is the size of the finest window. Then, the temporal channel descriptor for each frequency band z c (f ) is computed by taking average of L highest means of activation (one max at each level) We name this form SSRP-Multi-Scale (MS), as it combines multiple scales of local features; the finest level captures more local features while the coarsest level can preserve patterns of longer temporal correlation.
Moreover, as illustrated in Figure 4, if the selected windows are overlapping, then the total weights used to combine the selected features can be different from the uniform weighting of SSRP-B, depending on the amount of overlaps.
The third form we propose, referred to as SSRP-Top (T), also uses multiple intervals of local features but using fixedsize windows. In the first stage of SSRP-T, the moving mean of activations s c (t, f ) is computed via (1), the same as SSRP-B. After sorting these locally aggregated features across the time dimension s c (f ) : Q → {(q 1 , . . . q n ) : q 1 > · · · > q n , the temporal descriptor z c (f ) is calculated by taking average of top K values of Q Therefore, SSRP-T selects K different intervals that have the largest mean of activations at each frequency band. This way, the K intervals can be over-lapping and close to each other to capture temporal structures of different lengths, or far from each other to capture intermittent temporal structures. Thus, SSRP-T can also assign non-uniform weights to features selected by c , when the windows are over-lapping.
After computing the frequency-wise descriptors z c (f ) ∈ R F ×C , the final representation vector is formed by concatenation of z c (f ) along both frequency and channel dimensions, as illustrated in Figure 2. Thus, SSRP not only pools each frequency band separately, but also it lets the classifier learn to weight them appropriately. This frequencyawareness enables the classifier to distinguish between features extracted from similar patterns but at different frequency bands. In order to investigate the impact of the mentioned frequency-awareness, we derive another variant of the basic form, referred to as SSRP-No Frequency Awareness (SSRP-NFA), which mixes the information of different frequency bands by taking average of m D c along the frequency dimension before applying (1)-(2). Hence, SSRP-NFA leads to the final representation vector of size 1 × C rather than F × C. In addition, we consider another variant of SSRP, named as SSRP-Random (SRRP-R), where the temporal windows forming the sparse subset c are selected randomly.

B. THE IMPACT OF SSRP ON THE BACK-PROPAGATION PROCESS
The proposed global pooling scheme has a great impact on the training of the kernels of the last convolutional layer, and of the previous layers although to a lesser extent. To illustrate this, let z c , E and δ c = ∂E ∂z c be the c th channel descriptor calculated by global pooling layer, the final loss function, and the back-propagated errors from the dense layers, respectively. Based on convolution operation, the c th output channel is computed by where w D a,b,v,c ∈R K t ×K f ×C D−1 ×C D is the weight of the convolutional kernel of size K t ×K f which are convolved with the v th channel of feature maps of (D − 1) th layer m D−1 v to produce the c th channel of the D th convolutional layer. Hence, the gradient of loss w.r.t. w D a,b,v,c when GAP is used as the global pooling layer would be As can be seen, the updating value for each weight of the last convolutional layer (∂E ∂w D a,b,v,c ) depends on the average of all components of the corresponding feature map (v th ) of its input. The same computation can be performed for the proposed SSRP methods via z c computed in (2), (5), and (6) In above operations, zero gradients are assigned for nonmaximum values when using the max operator in the forward path. Thus, ∂z c (f ) ∂m D c (t, f ) has non-zero value(s) (unit value) only for temporal frame(s) within c . According to equations (9) and (10) Comparing equations (11)- (13) shows that the large restriction imposed on the feature maps of the final convolutional layer has also affected the updating values of the layer before this layer. However, as illustrated in Figure 5, the T-F regions contributing to ∂E ∂w D−1 a,b,v,c are larger than that of the final convolutional layer due to the summation over channels (c), which aggregates the locations of non-zero gradients from different channels of the D th layer. Therefore, we infer that the closer to SSRP layer, the more restriction is imposed on the training of the convolutional kernels. This effect is reasonable as the bottom layers, which learn to extract more general features, are trained using larger T-F regions of input and learn from more diverse patterns. But the training of the top layers, which learn to extract more abstract features, are more selectively done using less varied patterns.

A. EXPERIMENT SETUP
We evaluate the proposed global pooling method on two publically available datasets: ESC-50 and ESC-10. ESC-50 consists of 2000 5-second recordings, which is of total duration of 168 minutes, from 5 major categories of animals (e.g. dog barking), natural soundscapes and water sounds (e.g. chirping birds and pouring water), human non-speech sounds (e.g. clapping), domestic sounds (e.g. keyboard typing) and exterior sounds (e.g. car horn), each containing 10 equally-balanced classes of sound events [50]. ESC-10 consists of 10 classes, with a total duration of 33 minutes, selected from ESC-50, with 40 samples for each class.
We use a simple and small-size CNN architecture with three convolutional layers with kernel sizes of 3 × 3. Each convolutional layer is followed by batch normalization [51] and Rectified Linear Unit (ReLU) activation. The number of kernels in each convolutional layer is twice the number in its VOLUME 11, 2023 preceding layer, i.e. 32 → 64 → 128. Similar to [20], average pooling layers with kernel sizes of 2 × 2 are applied to the first two convolutional layers, as average pooling has been resulted in higher performances than max pooling [52]. The input of the network is a one-channel log-mel spectrogram of size 431 × 40×1 (T × F × 1), which is extracted by using 40 mel filters, with window and hop sizes of 1024 (46.4 ms) and 256 (11.6 ms), respectively. The convolutional section is followed by global pooling layer and one dense layer containing 128 neurons with ReLU activations. Dropout is applied after the global pooling and the dense layer with the rate of 0.5. Finally, the output probabilities are produced using a softmax layer. The model is trained to optimize the categorical crossentropy loss using stochastic gradient descent with batch size of 64, and learning rate and momentum of 0.1 and 0.9, respectively. We use the five-fold cross-validation setup provided in ESC-10 and ESC-50 datasets, and the mean accuracy of the five folds is reported as the final accuracy. To boost our model's performance, we employ mixup data augmentation [33], which extends the distribution of training data by mixing two random training samples and their one-hot labels via a random mix ratio drawn from beta distribution with α = 0.2. We employ Keras library with Tensorflow as backend to implement the proposed model, and Librosa [53] for audio processing and feature extraction. All models are run on a system with a 16 GB RAM and a NVIDIA GTX 1060 GPU. Table 1 compares the performance of our five-layer CNN model equipped with the proposed global pooling method to other models reported in previous works. Moreover, we observe the effect of doubling the number of convolutional kernels and neurons of the dense layer on the model performance. This model, which is referred to as CNN-D, has still much lower parameters than state-of-the-art models.

B. PERFORMANCE COMPARISON WITH STATE-OF-THE-ART
From Table 1, we see that the proposed SSRP layers show remarkable improvements over GAP. SSRP does not include any learnable parameters in its descriptor computations and the slight increase in the size of model equipped with SSRP compared to that of with GAP is due to computing multiple frequency-dependent descriptors from each channel which increases the number of weights of the dense layer. It is seen that both SSRP-MS and SSRP-T improve SSRP-B which indicates the effectiveness of use of multiple windows. Compared to the pioneer work of Piczak et al., as the baseline model, we see that CNN+SSRP-T with model size lower than 0.01 of the size of PiczakCNN, shows 19.8% and 14.3% higher accuracies than PiczakCNN on ESC-50 and ESC-10 datasets, respectively. Also, our model with only 0.2 M parameters yields 3.4% more accuracy than EnvNet-v2 (with 101 M parameters) on ESC-10.
Results show that the model performance is further enhanced on ESC-50 (up to 2.1% absolute improvements) by doubling kernels/neurons of the model. As can be seen, the best performance of the proposed CNN-D (with 0.7 M parameters) on ESC-50 is obtained by using SSRP-T which achieves 86.7%. In addition, CNN-D produces accuracy of 94.8% on ESC-10. When comparing with other models of Table 1, it should be noted that [19], [25], [54] benefit from temporal attention mechanisms, and the models proposed in [25] and [54] utilize Recurrent Neural Networks (RNNs) to model temporal structure of environmental sound signals. Also, [18] and [19] employ multiple input feature channels, and [55] fine-tunes an ensemble of well-known pretrained image classification networks, such as Inception, ResNet50 and ResNet101. However, the proposed CNN-D model equipped with SSRP-MS/T achieves higher and comparable accuracies without use of multiple input feature channels, attention mechanisms, or recurrent networks, and too many learnable parameters. Figure 6 shows the normalized confusion matrix generated by the proposed CNN-D+SSRP-T model for ESC-50 dataset. It is seen that most classes achieve classification accuracy higher than 80%. Particularly, all signals belonging to pouring waters are classified correctly (100% accuracy). Also, 10% of cat samples and 12% of footsteps samples are misclassified into crying baby and door knock, respectively. Moreover, only 38% of helicopter signals are classified correctly, and airplane, washing machine and engine classes have attracted 14%, 11% and 10% of helicopter samples, respectively. These misclassifications could be due to similar characteristics between these environmental sounds. But the overall classification results are highly satisfactory.

C. ANALYSIS OF COMPUTATIONAL COMPLEXITY
In order to further demonstrate the efficiency of the proposed method, we also evaluated the computational complexity of the proposed model for ESC-50 dataset. Besides the number of model parameters, studied in Table 1, computational complexity is another aspect of significant importance when implementing the model in low-resource and low-power applications. To evaluate the computational complexity, we computed the number of floating point operations (FLOPs) 2 and the inference time of the model when processing a sample audio recording file. 3 Table 2 shows values computed for the proposed model, baseline model of ESC-50 [17], and three other convolutional networks [25], [27], [35]. 4 From Tables 1 and 2, we can see that EnvNet-V2 has improved the accuracy of the baseline model at the cost of considerable increase of FLOPs and inference time (i.e. 3.19 billion more FLOPs and 440 milliseconds more time). Also, the residual network and CRNN have greatly increased either FLOPs or inference time. However, it is seen that the inference time of the proposed model is very close to the baseline model and the increase in FLOPs (0.1 billion) is much less than that of the other models. Therefore, the comparison indicates that our 5-layer CNN equipped with the proposed SSRP layer efficiently utilizes the network capacity, and results in similar or higher accuracies than 2 A convolutional layer with N output channels of size w × h, kernels of size k w × k h , and M input channels, performs N × M × w × h × k w × k h multiplication operations, and almost the same number of addition operations. 3 To estimate the inference time, we computed the average value of inference time needed to process all 400 samples within one validation fold of ESC-50. 4 For the FLOPs presented in Table 2, we reproduced the network architectures according to their descriptions and used Ptflops library. the compared models, but with much lower computational complexity.

D. IMPACT OF HYPER-PARAMETERS
Referring to Eq. (1-6), we conduct several experiments on ESC-50 to study the impact of three hyper-parameters: the windows size W , the maximum scale level L in SSRP-MS, and the number of windows K in SSRP-T.
First, we evaluate the impact of sparsity on the performance of the model equipped with SSRP-B via using different sizes of W . Figure 7 shows the obtained results. It is observed that when no sparsity is imposed (i.e. W = 431), SSRP-B achieves accuracy of 79.3%, making 11.6% absolute improvement over GAP. This large gain, achieved under no sparsity condition, indicates the considerable importance of the frequencyawareness derived by separate pooling of different frequency bands and prevention of mixing the salient information of them. We can see that decreasing the size of W from the maximum value (i.e. T = 431), and thus imposing sparsity, significantly improves the accuracies obtained for both SSRP-NFA and SSRP-B. Strikingly, the best accuracies are obtained for very short temporal windows. For instance, SSRP-B reaches its highest accuracy (83.5%) using W = 6 which is equivalent to about 0.01 of the total frames or 140 ms of the input signal. This verifies that the most salient information may reside in very small interval of frames and that imposing sparsity could make model learn to extract the most salient  information. Also, it is seen that a large range of window sizes (6 ≤ W ≤ 35) leads to accuracies as much as 1% lower than that of maximum accuracy of SSRP-B. This indicates that use of single window of fixed size and thus applying similar level of sparsity might not be the appropriate choice regarding the differences in the temporal structure of different sound classes. Moreover, accuracies obtained by SSRP-R confirm that random selection of the sparse subset c could notably degrade the model performance. These results were obtained by using no temporal pooling between convolutional layers. This minimizes the input temporal RF of the network down to 7 frames, as illustrated in Figure 8, which means that the CNN output features correspond to input T-F regions with smaller amounts of overlaps. 5 Hence, more temporal resolution is available to SSRP to distinguish salient T-F regions from other less relevant regions. However, lack of temporal pooling could decrease the CNN ability to learn T-F patterns and thus a trade-off must be made. We found that the model performance is maximized at temporal pool size of 2, and further increase in pooling size degrades the model performance.   Figure 9 shows the accuracies obtained for different sets of (W , L) and (W , K ) used for SSRP-MS and SSRP-T, respectively. Clearly, the results indicate that use of multiple windows improves the model performance, especially when the window size is small which let the model capture both transient small T-F patterns, and those with longer temporal structures. It is observed that both SSRP-MS and SSRP-T outperform the basic form and the best accuracy (84.9%) is achieved by SSRP-MS using the pair of (W = 2, L = 5), and the model performance declines for larger window sizes (e.g. 83.1% for W = 10 and L = 4).

E. VISUAL ANALYSIS OF THE PROPOSED SSRP
To have a better understanding of the effect of the proposed global pooling method, we study the areas of input log mel spectrogram on which SSRP layer has focused for different sound classes. To this end, we form a mask for each output channel, where the nonzero values correspond to input RFs of the members of . Also, we use GradCAM [61] to study the visualization results of network. Figure 10 shows the resulting masks and Grad-CAM heat-maps produced for three sound log mel-spectrogram samples and exemplary output channels. The produced masks (Figure 10 (b)) show that each channel focuses on specific time-frequency regions which are different from those of other channels. Also, they show that members of for each channel correctly correspond to the essential temporal frames while the silent frames are ignored. Comparing the Grad-CAM visualizations of GAP and SSRP indicates that SSRP not only focuses on the high-energy T-F regions but it also intensifies the responses of low-energy T-F regions. For instance, the focus on low-energy regions at low-frequencies of clock tick sample (mel indices in range of 0-10) or at high-frequencies of crying baby (mel indices in range of [30][31][32][33][34][35][36][37][38][39][40] is increased by SSRP. This denotes that many low-energy T-F regions may contain salient patterns which can contribute to the classification of different sound events, more than high-energy T-F regions. This is more apparent in the visualization obtained for chainsaw sample where the highest-energy regions exhibit lower contributions to the classification than low-energy T-F patterns that reside at mel indices in the range of 30-40 of the frames close to the end of signal. VOLUME 11, 2023

V. CONCLUSION
In this paper, we proposed a global pooling method, called SSRP, which aims at enhancing the representation ability of a lightweight CNN via reducing the effect of less informative T-F regions on the training of convolutional kernels.
By observing spectral and temporal characteristics of environmental sounds, SSRP restricts the features that contribute to channel descriptors to a subset of sparse features that makes the model learn from the more salient T-F regions. Experimental results on ESC-10 and ESC-50 demonstrated that our CNN model equipped with SSRP achieves comparable performance to the state-of-the-art methods, with much smaller model size and much lower complexity. Our model makes absolute improvements of 14.3% and 21.8% in accuracy relative to the baseline models of ESC-10 and ESC-50 datasets, respectively. The obtained results denoted the importance of separately pooling each frequency band, which helps model distinguish between patterns that occur at different frequency bands. Moreover, our experiments revealed that the most salient information can reside in very short temporal intervals and thus imposing high degree of sparsity at each frequency band can significantly boost the model performance. We also performed visual analyses which indicated that the responses of low-energy salient T-F regions are also intensified by SSRP, and can contribute even more than high-energy T-F regions to the classification of specific sound classes. Considering the capability of the proposed model in ignoring silent and less informative frames, our plan for future work is to test the robustness of the model against different noise scenarios.