Dynamic Convolutional Neural Networks as Efficient Pre-Trained Audio Models

The introduction of large-scale audio datasets, such as AudioSet, paved the way for Transformers to conquer the audio domain and replace CNNs as the state-of-the-art neural network architecture for many tasks. Audio Spectrogram Transformers are excellent at exploiting large datasets, creating powerful pre-trained models that surpass CNNs when fine-tuned on downstream tasks. However, current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs. Recently, we have shown that, by employing Transformer-to-CNN Knowledge Distillation, efficient CNNs can catch up with and even outperform Transformers on large datasets. In this work, we extend this line of research and increase the capacity of efficient CNNs by introducing dynamic CNN blocks constructed of dynamic convolutions, a dynamic ReLU activation function, and Coordinate Attention. We show that these dynamic CNNs outperform traditional efficient CNNs, such as MobileNets, in terms of the performance–complexity trade-off at the task of audio tagging on the large-scale AudioSet. Our experiments further indicate that the proposed dynamic CNNs achieve competitive performance with Transformer-based models for end-to-end fine-tuning on downstream tasks while being much more computationally efficient.


I. INTRODUCTION
Pre-trained deep neural networks have emerged as a pivotal paradigm in the field of machine learning over the last few years.Leveraging transfer learning techniques, pretrained models can significantly enhance the performance on downstream tasks, in particular, when the training data is insufficient to learn an end-to-end model from scratch.Such models are typically pre-trained in a supervised or selfsupervised fashion on large datasets, such as ImageNet [1] in the vision domain, or AudioSet [2] in the audio domain.While convolutional neural networks (CNNs) have been the method of choice during the past years to create pre-trained models in both the audio and the vision domain, Transformers [3] recently surpassed them due to their ability to scale up and exploit large datasets [4], [5].With superior performance on the large pre-training datasets, Transformers then overtook CNNs also on many downstream tasks with smaller datasets.However, Transformers are computationally costly in training and inference, as the attention operation scales quadratically Code available: https://github.com/fschmid56/EfficientATwith respect to the processed sequence length.For this reason, CNNs maintain their dominance on resource-constrained platforms, such as mobile devices.
Recently, it has been shown in the audio domain that efficient CNNs can attain and even outperform Transformers on large-scale datasets when they are trained using Knowledge Distillation (KD) [6]- [8] from a Transformer ensemble.Concretely, Schmid et al. [9] train MobileNetV3s (MNs) [10] on AudioSet using offline KD from an ensemble consisting of 9 different Patchout FaSt Spectrogram Transformer (PaSST) [11] models.The resulting efficient pre-trained MNs have been shown to extract high-quality general-purpose audio embeddings that can generalize to downstream tasks in various audio domains such as music, speech and environmental sounds [12].Compared to Transformers, the quality of extracted audio embeddings is comparable while the computational cost of inference is much lower.Although the MNs achieve excellent performance on AudioSet [9] and serve as high-quality audio embeddings extractors [12], they remain to be tested in end-to-end fine-tuning, in which a pre-trained model is directly fine-tuned on a downstream task.
In this work, we extend this line of research and propose computationally efficient pre-trained audio model, obtained by integrating dynamic components into MobileNetV3s (DyMNs).We use the Knowledge Distillation pre-training setup of [9] and show that the proposed DyMNs achieve substantially higher pre-training performance on AudioSet compared to MNs.We train MNs and DyMNs of different complexity levels on the downstream tasks of polyphonic musical instrument recognition (OpenMic dataset [13]), environmental sound classification (ESC50 dataset [14]), acoustic scene classification (TAU Urban Acoustic Scenes 2020 challenge [15]), and sound event tagging (FSD50K [16]).The results show that MNs and DyMNs can attain or even surpass the performance of a single teacher model PaSST [11] on many downstream tasks, while being much more computationally efficient.The proposed DyMNs outperform MNs for the majority of downstream tasks and complexity levels.
The motivation for integrating dynamic components into MNs is threefold.Firstly, instead of scaling CNNs by network width and depth, which increases the computational complexity substantially, a variety of lightweight dynamic components such as dynamic non-linearities [17], [18], convolutions [19], [20] and attention mechanisms [21]- [27] have been proposed.These dynamic components have been shown arXiv:2310.15648v1[cs.SD] 24 Oct 2023 to increase the performance while only marginally adding to the computational complexity in terms of consumed multiplyaccumulate operations (MACs) at inference time.Secondly, the success of Transformers is largely based on self-attention, a highly dynamic operation that adapts its attention weights based on input data.Thirdly, an ablation study conducted in [9] revealed that Squeeze-and-Excitation [21], a component that dynamically computes channel attention weights based on input data, is an integral part of MNs to achieve high performance on AudioSet [2].
Concretely, our proposed DyMN block deviates from the traditional residual inverted bottleneck block [28] in MNs in three ways: we include dynamic convolution [19], replace the non-linearity after the depthwise convolution with dynamic ReLU [17], and make use of Coordinate Attention [22] instead of Squeeze-and-Excitation [21].Since all these dynamic components perform operations based on the context of the input sample, we compute the context once and share it across all dynamic components in a block.
To summarize our contribution, we extend the work of Schmid et al. [9] and introduce a dynamic CNN block to further boost the pre-training performance of efficient CNNs on AudioSet [2].We show that the proposed dynamic CNNs outperform traditional efficient CNNs on downstream tasks in an end-to-end fine-tuning setting.The proposed dynamic CNNs attain and even outperform current Audio Spectrogram Transformers on several downstream tasks while being more computationally efficient.
The remainder of this paper is structured as follows.Related work is reviewed in Section II, followed by the introduction of the proposed dynamic model in Section III.The pre-training setup and results are presented in Section IV; Section V then shows the experiments and results on the downstream tasks.A detailed systematic configuration study and an analysis of the dynamic components are performed in Sections VI and VII, respectively.The paper is concluded in Section VIII.

II. RELATED WORK
In this section, the related work is covered.We start with a general recap on efficient CNN architectures and popular dynamic CNN components that were introduced to further increase a CNN's computational efficiency.We then cover the literature on pre-trained audio models to set the stage for introducing our dynamic CNN serving as an efficient, pretrained audio model.

A. Efficient CNN Architectures
Much effort has been invested in research on designing efficient CNNs, such as the series of MobileNets [10], [28], [29], EfficientNets [30], [31] or ShuffleNets [32], [33].MobileNetV1 [29] substantially reduces the computational complexity of conventional convolution layers by factorizing them into a depthwise and a 1x1 pointwise convolution.On top of this, MobileNetV2 [28] introduces inverted residual blocks with linear bottlenecks, leading to better accuracy and computational efficiency.MobileNetV3 [10] additionally adds Squeeze-and-Excitation [21] layers after the depthwise filters, upgrades activation functions using swish non-linearity [34] and optimizes the global network structure using platformaware network architecture search [35].EfficientNet [30] builds on the MobileNetV2 inverted residual block and introduces compound scaling laws of depth, width and input resolution.Similar to MobileNetV3, EfficientNetV2 [31] performs a neural architecture search to optimize parameter efficiency and training speed.Originally introduced in the vision domain, MobileNets and EfficientNets have been shown to provide a good performance-complexity trade-off also in the audio domain [9], [36]- [38].

B. Dynamic CNN Components
While scaling CNNs by width and depth typically improves performance, it significantly increases the model's complexity.In particular, the computational demand of a CNN scales with the square of the model's width.As an alternative strategy, a lot of research has focused on using a fixed number of channels more efficiently by introducing dynamic components to CNNs.While the majority of work proposes attention mechanisms [21]- [27], others have used dynamic convolutions [19], [20], [39]- [41] and dynamic non-linearities [17], [18].
Attention Mechanisms: The most prominent instance of a CNN attention mechanism is Squeeze-and-Excitation (SE) [21] being integrated into MobileNetV3 [10] and Effi-cientNets [30], [31].As shown in Eq. 1, SE applies a squeeze (F sq ) and an excitation (F ex ) operation on a feature map to obtain channel recalibration weights s.F sq applies global average pooling (GAP) to collect contextual information while F ex captures channel-wise dependencies via a learnable nonlinear transformation followed by a sigmoid activation to compute weights.
Other attention mechanisms differ mainly in how they realize F sq and F ex .The Style-based Recalibration Module (SRM) [23] extends SE by combing GAP and global standard pooling to realize F sq , followed by a channel-wise fully connected layer, batch normalization and a sigmoid activation function for F ex .CBAM [24] additionally computes attention weights for spatial locations using a channel pooling operation to squeeze contextual information.Coordinate Attention (CA) [22] factorizes the attention mechanism into two separate context encoding processes of the spatial dimensions.Features are aggregated by GAP along the spatial dimensions, processed separately and then used as spatially aware recalibration weights.Triplet Attention (TA) [26] reduces the three dimensions of a feature map by average and max pooling, processes the three 2-dimensional slices separately, and recalibrates the feature map with the resulting three sets of recalibration weights.Global Context (GC) blocks [27] differ from the aforementioned attention mechanisms by performing additive instead of multiplicative recalibration.The recently introduced Global Response Normalization (GRN) [42] serves as a particularly lightweight attention module by recalibrating the channels based on their L2-norms and adding only two learnable parameters, a scale and a shift.
Dynamic Convolution: In contrast to standard convolution layers, dynamic convolutions adapt the kernel weights based on global context information extracted from an input.Early approaches in the vision domain have explored generating the kernels directly [43], [44], resulting in a substantial increase in complexity as the number of parameters of convolution kernels is large.A more lightweight approach is to predict coefficients to linearly combine a fixed set of kernels.In this spirit, CondConv [20] computes a linear mixture of K distinct trainable kernel weights.As shown in Eq. 2, only a single convolution ( * ) with the mixed kernel has to be performed, as this is equivalent to combining the results of K individual convolutions.The weights α i per filter are computed dynamically from the input using GAP, a linear transformation, and a sigmoid activation.
DyNet [41] proposes a similar dynamic convolution mechanism, motivated from the perspective of extracting noiseirrelevant features, showing that dynamic filters generate more diverse feature maps with lower cross-correlation.A common global context is shared across CNN blocks, e.g., the global context extracted by GAP from a block input is used to parameterize the three dynamic convolutions in an inverted bottleneck block as used in MobileNetV2s [28].The dynamic convolution introduced in [19] compresses the kernel space by introducing the constraint k α k = 1 on the computed kernel attention weights.As a result, a smaller number of kernels K can be used, saving computations and parameters.The constraint is enforced by applying a softmax instead of a sigmoid activation.Temperature scaling is used before the softmax to ensure near-uniform attention in early epochs.
Besides dynamic convolution in the vision domain, some works have explored generating input-aware convolutional filters based on the input sentences in NLP [39], [45], [46].In the audio domain, temporal dynamic convolutions (TDY) [47] and frequency dynamic convolutions (FDY) [48] have gained popularity recently.TDY dynamically adapts the filters along the time axis to consider time-varying characteristics of speech; FDY has been shown to improve sound event detection by dynamically adapting the filters along the frequency axis, addressing the fact that the frequency dimension is not shift invariant.However, both TDY and FDY execute K parallel convolutions before combining the results, which leads to considerable computational overhead during inference.
Dynamic Activation Function: Less explored in comparison to CNN attention mechanisms and dynamic convolutions is the use of dynamic activation functions.Si et al. [18] dynamically adapt the threshold value of ReLU activations in an MLP network.Chen et al. extend this line of research by introducing dynamic ReLU (Dy-ReLU) [17], which works as a dynamic and efficient version of Maxout [49].A hyperfunction, similar to Squeeze-and-Excitation [21], is used to dynamically generate coefficients for M linear mappings.
After normalizing the coefficients (slopes and intercepts) to specific ranges, the element-wise maximum is applied across the M different mappings.Most commonly, M = 2 is used and linear mappings are spatially shared but differ across channels.

C. Pre-trained Audio Models
Models in the audio domain are typically pre-trained in a supervised or self-supervised way on large-scale datasets, such as AudioSet [2].AudioSet consists of around 2 million weakly labeled 10-second audio snippets downloaded from YouTube.The audio clips were manually annotated with 527 different event classes hierarchically sorted in an ontology.
PANNs [36] introduced a series of AudioSet-pre-trained CNNs of varying complexities and architectures, which are widely used for downstream applications in audio-related domains such as sound event detection [50], automated audio captioning [51], language-based audio retrieval [52], emotion recognition [53], or even optical fiber sensing [54].The prevalence of PANNs across many different downstream application areas underlines the community's interest in pretrained audio models for end-to-end fine-tuning.The study presented in [36] includes MobileNetV2 [28] which shows a good performance-complexity trade-off but falls behind CNN14 [36] in terms of performance.Gong et al. [37] improve over PANNs in terms of performance and complexity by using an EfficientNet-B2 [30] pre-trained on ImageNet [1], balanced sampling, and label enhancement.ERANNs [55] improve the performance on AudioSet further while controlling efficiency by temporal downsampling via strided convolutions.However, despite all these improvements, CNNs have been substantially outperformed by supervised [5], [11], [38], [56] and selfsupervised [57]- [59] Transformer models.
Recently, it has been shown that the sharp increase in audio tagging performance on AudioSet achieved by Transformers can be exploited and transferred to efficient CNNs using Knowledge Distillation (KD) [6], [8].In this context, Gong et al. [38] found that Transformers and CNNs are good teachers for each other, improving the performance of both models, and Schmid et al. [9] use Transformer-to-CNN KD to match the performance of a PaSST [11] Transformer model with a Mo-bileNetV3 having only 6% of the parameters and requiring 100 times fewer MACs.These efficient, pre-trained CNNs have been shown to extract high-quality audio representations [12] and have a high potential to be fine-tuned for low-complexity on-device applications.In parallel to our work, which focuses on architectural improvements of efficient CNNs with dynamic components, Dinkel et al. [60] recently improved the KD setup introduced in [9] by using consistent ensemble distillation and an improved teacher model.

III. PROPOSED DYNAMIC MODEL
This section introduces the proposed dynamic model consisting of dynamic ReLU (Dy-ReLU) [17], dynamic convolutions (Dy-Conv) [19], and Coordinate Attention (CA) [22].This design decision is based on our belief that the three dynamic methods are complementary: Dy-Conv can extract noise-invariant features [41], CA detects important channels, time frames and frequency bins, and Dy-ReLU increases the model's expressiveness by applying a dynamic non-linear function.These dynamic components are integrated into efficient inverted residual blocks (IR blocks) [28], as our focus is on creating efficient pre-trained audio models.In particular, we use the global network design of MobileNetV3-Large (MN) [10] and scale the models by network width using a width multiplier α.MN is optimized towards latency and has shown to provide an excellent performance-complexity tradeoff on AudioSet [9].
In the following, we describe our proposed dynamic IR block in a top-down manner.We start by reviewing the conventional IR block in Section III-A, and introduce the modifications that lead to the dynamic IR block in Section III-B.We then zoom into the four central components of the dynamic IR block: the Context Generation Module (Section III-C), Dy-ReLU (Section III-D), Dy-Conv (Section III-E) and CA (Section III-F).For all these additional components, we will identify the computationally most costly part in terms of multiply-accumulate operations (MACs) and compare it to the cost of convolutions in the conventional IR block.These sequences are used to parameterize Dy-ReLU [17] and Dy-Convs [19] and to compute the channel-time and channel-frequency recalibration weights for CA.The shape of the input feature map size is denoted in terms of batch size × channels × frequency bands × time frames.

A. Inverted Residual Block
IR blocks [28] are constructed of (1) a pointwise channel expansion convolution projecting the number of channels from C IN to C EXP using 1x1 kernels, (2) a depthwise convolution operating on each of the C EXP channels independently, and (3) a pointwise projection convolution projecting the channels from C EXP to C OUT with 1x1 kernels.For most blocks, it holds that C IN = C OUT .However, transition blocks increase the number of channels, leading to C OUT > C IN .Each convolution is followed by batch normalization and ReLU activation, except for the linear bottleneck after the last convolution.The depthwise convolution can be strided to downsample the spatial dimensions.If the spatial dimensions and the number of channels match, a residual connection from block input to block output is used.
The pointwise convolutions are the computationally most expensive operations in an IR block.Specifically, given a block with C OUT output channels, C EXP channels in the expanded channel representation and spatial output dimensions of sizes T OUT and F OUT , the final pointwise convolution performs

B. Dynamic Inverted Residual Block
Starting from the conventional IR block, we apply the following modifications to integrate the dynamic components: 1) Conv − → Dy-Conv: We replace all three convolutions in the IR block with Dy-Conv [19].
2) ReLU − → Dy-ReLU: The ReLU activation function after the depthwise convolution is replaced by Dy-ReLU [17].As shown in Section VI, using additional Dy-ReLUs after the two pointwise convolutions does not yield further improvements.
3) Squeeze-and-Excitation − → CA: Instead of SE [21], CA [22] is used as the attention mechanism.As will be shown in Section VI, this replacement yields substantial performance gains.Fig. 1 shows the overall structure of the dynamic IR block.All three types of dynamic components have in common that they require statistics extracted from the input sample for dynamic parameterization and reweighting.We share the computation of these statistics across all dynamic components in a common Context Generation Module (CGM), which will be explained in detail in Section III-C.The CGM operates on the input of the IR block and outputs an embedded time and frequency sequence, i.e., two separate lists of embeddings for time frames and frequency bins (depicted as green blocks in Fig. 1).The following sections on Dy-ReLU (III-D), Dy-Conv (III-E) and CA (III-F) will explain in detail how these sequences are processed for dynamic parameterization and reweighting.

C. Context Generation Module
The goal of the CGM is to collect informative statistics from an input sample that can be used to parameterize the dynamic components without creating substantial computational overhead.Fig. 2 depicts the details of this process.The CGM transforms the block input feature map into embedded time (S T ) and frequency (S F ) sequences.Inspired by the original CA module [22], introduced in the vision domain, the input feature map is pooled separately across the two spatial dimensions to retain positional information.The resulting time and frequency sequences are then processed by a shared transformation consisting of a linear layer, a batch-norm [61] and hardswish activation [10]

D. Dy-ReLU
Our specific implementation of dynamic ReLU is the spatially-shared and channel-wise Dy-ReLU-B introduced in [17].Applying Dy-ReLU-B to a feature map requires predicting M linear mappings for each channel.The output of Dy-ReLU-B is then computed by taking the elementwise maximum across the M linear mappings.Specifically, given x c , the input plane of the channel with index c, and the predicted coefficients α m c (slope) and β m c (intercept) for that channel, the output plane y c is calculated as follows: Dy-ReLU-B requires in total 2 * M * C dynamic coefficients, resulting from slope and intercept for M linear mappings for each of the C channels.
We apply the transformation shown in Eq. 4 to predict the dynamic coefficients from the CGM output sequences S T and S F .Firstly, the two sequences are concatenated and pooled over the sequence length dimension, resulting in a vector of size H.Secondly, a trainable linear transformation with parameters W ∈ R 2 * M * CEXP xH and b ∈ R 2 * M * CEXP is used to predict the Dy-ReLU coefficients collected in the vector C dyrelu .Most commonly, M=2 linear mappings are used [17], which is also the default value in our setup.
The computational cost of Dy-ReLU is dominated by the calculation of elementwise linear mappings (α m c x c + β m c ) as part of Eq. 3. The total cost of computing M linear mappings is M * C EXP * T OUT * F OUT MACs.Since M = 2 is much smaller than the number of block output channels C OUT , the cost of DyReLU is insignificant compared to the cost of the final pointwise convolution (C OUT * C EXP * T OUT * F OUT ) as discussed in Section III-A.

E. Dy-Conv
Our implementation of Dy-Conv is based on the dynamic convolution introduced in Chen et al. [19].K different kernels are kept in parallel and are aggregated based on predicted kernel attention weights αk .The sum of the kernel attention weights is normalized to 1 using a softmax function with temperature scaling: The aggregated dynamic kernel W is then constructed as a weighted sum of the K individual kernels: Per default, we stick with the recommended settings in [19] and use K = 4 and linearly anneal the temperature τ from 30 to 1 over the first epochs of training.
To obtain predictions from the CGM output sequences for the K kernel attention weights, we apply the transformation shown in Eq. 6.The difference to the Dy-ReLU coefficient prediction is only in the shape of the trainable linear transformation and the resulting coefficients vector C dyconv .Specifically, we use W ∈ R KxH and b ∈ R K resulting in a vector C dyconv of size K.
The computationally most expensive operation is the aggregation of the K kernels requiring at most K * C EXP * C OUT MACs for the final pointwise convolution.Since K = 4 is much smaller than T OUT * F OUT , constructing the dynamic kernel is insignificant compared to the convolution itself.

F. Coordinate Attention
The purpose of CA [22] is to emphasize important spatial positions and channels by recalibrating a feature map with channel-time and channel-frequency weights.As depicted in Fig. 3, CA takes as input the embedded time and frequency sequences as produced by the CGM, performs separate transformations per sequence, and outputs the respective attention weights.Eq. 7 shows the transformation from the embedded sequences S {T,F } to the respective attention weights S{T,F } .To match the dimensions of the feature map, an average pooling operation with kernel size 3 and a stride of 2 is applied in case of a strided depthwise convolution in the IR block.The result is processed by a trainable linear layer with parameters W ∈ R CEXP xH and b ∈ R CEXP to match the channel dimension of the feature map.The final sigmoid function converts the resulting sequences into attention weights that are used to recalibrate the feature map via elementwise multiplications.
The computationally most expensive operations in CA are the linear layers, which jointly perform H * C EXP * (T OUT + F OUT ) MACs.Compared to the final pointwise convolution in the IR block, which requires C OUT * C EXP * T OUT * F OUT MACs, the computational demand of CA is negligible since C OUT ≈ H and the product of the spatial dimensions is typically much larger than their sum.

IV. PRE-TRAINING ON LARGE-SCALE AUDIO TAGGING
In this section, we report the pre-training results of the introduced dynamic CNNs on the task of large-scale audio tagging on AudioSet [2].That is, models need to assign one or multiple labels out of 527 classes to 10-second audio clips.Since AudioSet must be downloaded from YouTube, different proportions of the dataset can be successfully downloaded.In this regard, our setup is strictly comparable to the dataset used in [11] and [9].The proposed DyMN is scaled to three different complexities using width multipliers α ∈ {0.4, 1, 2}.
Adapting the width multiplier changes the number of channels in the IR blocks while keeping the total number of blocks constant.We denote the resulting models as DyMN small (DyMN-S), DyMN medium (DyMN-M) and DyMN large (DyMN-L).By default, we replace all 15 IR blocks with their dynamic counterparts; however, we will also discuss the effect of applying the dynamic IR blocks selectively in Section VI.The results will be compared in terms of parameter and computational efficiency to other models pre-trained on AudioSet.In particular, we are interested in a comparison to the non-dynamic counterpart of our proposed DyMNs, the efficient MNs used in [9].

A. Pre-Training Setup 1) Preprocessing and Augmentation:
To pre-train our models on AudioSet, we match the preprocessing used in [11] and [9].We use mono audio with a sampling rate of 32 kHz and apply STFT with a window length of 25 ms and a hop size of 10 ms.Mel spectrograms are computed using a Mel filterbank with 128 frequency bins and the minimum and maximum frequency is randomly perturbed within a range of 10 Hz and 2 kHz, respectively.Mixup [62] with a mixing coefficient of 0.3 is the only spectrogram-level data augmentation used since we are using offline KD as described in Section IV-B and it has been shown that consistent KD is beneficial [9], [63].
2) Training: Models are trained for a total of 200 epochs and 100,000 samples are drawn at random without replacement from the full AudioSet in each epoch.We use a learning rate scheduler consisting of an exponential warmup phase until epoch 8, followed by a constant peak learning rate phase, 95 epochs of linear rampdown and 25 epochs fine-tuning with 1% of the peak learning rate.The peak learning rates are set to 2 × 10 −3 , 1 × 10 −3 and 5 × 10 −4 for DyMN-S/M/L, respectively.We use the Adam optimizer [64] with a batch size of 120.We adopt the importance sampling strategy based on label frequency from [11] to counter the long tail of infrequent classes.The results presented in this section are achieved by DyMNs pre-trained on ImageNet [1], which has been shown to improve performance substantially [37].
3) Dynamic Component Settings: The temperature τ used in the context of Dy-Convs (see Eq. 2) is linearly annealed from 30 to 1 in the first 30 epochs of training.The sequence embedding dimension H as defined in the CGM (Fig. 2) is set to C EXP /r with r = 4.However, we additionally restrict its size between 32 and 128 to ensure that the sequences can capture enough information about the feature map in early layers and do not get unnecessarily complex in the final layers.These bounds are scaled accordingly with the model's width multiplier α.

B. Offline Knowledge Distillation
We copy the Transformer-to-CNN KD method introduced in [9] to train our DyMNs.Specifically, the DyMNs act as student in the KD setup and optimize the loss given in Eq. 8.The loss is a weighted sum of label loss L l and distillation loss L kd traded off by a hyperparameter λ. y denotes the AudioSet labels, z S and z T are the student and teacher logits, Fig. 4. The plot compares the performance -computational complexity trade-off across different single (i.e., non-ensemble) models on AudioSet.CNNs (MNs [9], PSLA [37], CNN14 [36], ResNet38 [36], CMKD-CNN [38], ConvNeXt [65]) are shown as circles, Transformer models (PaSST-S [11], BEATs [58] and HTS-AT [56]) are depicted as crosses, the width-scaled MobileNets introduced in [9] are connected by the green line (for better visibility), and our proposed DyMNs are denoted as red stars.The computational complexity is measured in terms of multiply-accumulate operations (MACs) plotted in log scale on the x-axis.respectively, δ is the sigmoid activation function, and Binary-Cross-Entropy is applied for L l and L kd .
The teacher logits z T are constructed by ensembling the logits of 9 different Patchout FaSt Spectrogram Transformer (PaSST) models [11] achieving a mean average precision (mAP) of 49.5 on the AudioSet evaluation set.Aligned with [9], we pre-compute the ensemble logits for all recordings in the training set to speed up the training of the DyMN students and use λ = 0.1 to emphasize the distillation loss.

C. Results on AudioSet
In the following, we will compare the computational efficiency and the parameter efficiency across different models trained on AudioSet.In particular, the DyMNs are trained in the same setup as the MNs in [9], which permits a fair comparison in assessing the effect of the dynamic components.All results presented throughout this paper are averages of at least 3 independent runs and the last 10 epochs of training.
MN with a matching width multiplier (α = 0.4) by almost 2 points in mAP.DyMN-M (α = 1.0) almost matches the performance of the MN with twice the width (α = 2.0), and DyMN-L (α = 2.0) outperforms even the largest MN (α = 4.0) while requiring approximately 4 times less MACs.The crosses depict the Transformer models PaSST-S [11], which is used as the teacher for MNs and DyMNs; 1 BEATs [58], which achieves the best performance on AudioSet; and HTS-AT [56], a particularly efficient implementation based on the Swin Transformer [68].While both DyMNs and MNs can outperform a single-teacher model, the PaSST ensemble teacher performance of 49.5 mAP is not reached.However, DyMN-L outperforms even the best-performing Transformer model BEATs [58] while requiring less than 5% of its MACs.
2) Parameter Complexity: The parameter efficiency on AudioSet is compared across different CNNs (circles) and Transformers (crosses) in Fig. 5.Both MNs and DyMNs outperform a variety of Transformers and CNNs in terms of the performance-complexity trade-off.BEATs [58] is the only model that outperforms the best MNs, but DyMN-L achieves a higher mAP with less than half the number of parameters.The introduced DyMNs outperform the MNs in terms of parameter efficiency, although the dynamic convolutions require more than K times as many parameters as conventional convolution layers and the fully-connected layers predicting the Dy-ReLU activation coefficients are nonnegligible in terms of parameters.However, scaling the width of a network by a factor α increases the number of parameters approximately by α 2 .This shows that introducing dynamic IR blocks to MNs instead of scaling by network width can be more efficient also in terms of parameters. 1 More precisely: we show here one single Transformer model from the teacher ensemble, in order to only have comparable single models in the plot.The 9-model PaSST ensemble would figure at (6.4 * 10 2 , 49.5) in Fig. 4; in Fig. 5, it would completely distort the plot, with a parameter complexity of 775.3 million.

V. EXPERIMENTS ON DOWNSTREAM TASKS
The previous section has shown that the introduction of dynamic components to MNs increase the pre-training performance on large-scale AudioSet.However, the main question is if the performance gain during pre-training can be carried over to downstream tasks.We fine-tune AudioSet-pre-trained MNs and DyMNs on the tasks of polyphonic musical instrument recognition, environmental sound classification, sound event tagging, and acoustic scene classification, and compare their performance against each other, the pre-training teacher Transformer PaSST [11] and the state of the art on the respective tasks.Furthermore, we share the same fine-tuning pipeline across all downstream tasks and only adapt the learning rate to show that no extensive hyperparameter tuning is required for high performance on downstream tasks.
A. Tasks 1) Polyphonic Musical Instrument Recognition: This task is to recognize all instruments present in an audio clip.It is based on the OpenMIC dataset [13] which consists of 20,000 10-second audio clips.Each clip is annotated by multiple tags out of 20 different classes.The performance metric is mean average precision (mAP).The state of the art for this task is the Transformer model PaSST [11], which surpassed receptivefield-regularized CNNs [69] as the previous state-of-the-art method.
2) Environmental Sound Classification: This task is to classify 5-second audio recordings into one out of 50 different classes.It is based on the ESC50 [14] dataset consisting of 2,000 environmental sound recordings.The performance metric for this task is accuracy and we report the results in terms of averages of the 5 official folds [14].The state-of-theart model on this dataset is the Transformer BEATs [58]; in terms of CNNs the fine-tuned audio encoder of CLAP [70] has the lead.
3) Sound Event Tagging: The FSD50K dataset [16] consists of 51,197 recordings that are annotated with 200 event classes taken from the AudioSet [2] ontology.It is the second-largest publicly available general-purpose sound event tagging dataset after AudioSet, consisting of 100 hours of audio.FSD50K is separated into training, validation and evaluation splits.We use the validation split to set up our fine-tuning pipeline that is shared across all models and downstream tasks.The performance is reported in terms of mAP on the evaluation set.The multi-modal giant-size ONE-PEACE [71] Transformer with 4B parameters achieves state-of-the-art results on this dataset.On the CNN side, the fine-tuned audio encoder of CLAP [70] achieves the highest mAP.
4) Acoustic Scene Classification: This task is to classify 10-second audio recordings into one out of ten different acoustic scenes.The TAU Urban Acoustic Scenes 2020 Mobile dataset [15] has been used in the DCASE 2020 challenge Task 1 and consists of 13,965 recordings in the train set and 2,979 in the test set.The performance is measured in terms of accuracy.It is particularly difficult to find models that generalize well on this dataset since it is recorded with a limited number of microphones, some of which are completely unseen during training and cause a distribution shift at test time.PaSST-S [11] is the state-of-the-art method on this dataset, outperforming the top CNN [72] from the DCASE 2020 challenge [15].

B. Fine-Tuning Setup
For fine-tuning models on downstream tasks, the preprocessing of audio recordings applied in the pre-training stage, as discussed in Section IV-A, is matched.Except for the learning rate, we share the fine-tuning pipeline across all tasks and models.We use the Adam optimizer and train for 80 epochs.The learning rate schedule includes an exponential warmup phase for 10 epochs, followed by a linear rampdown for 65 epochs and 5 final epochs with 1% of the peak learning rate.The temperature τ to compute the attention weights for Dy-Conv is fixed to 1.As for data augmentation, we randomly roll the waveform over time in a maximum range of ±125ms.We use two-level mixup [62], both on the raw waveforms and on the spectrogram level, and the audio waveform is multiplied to change the gain by ±7 dB.The min and max frequencies of the mel filterbank are randomly perturbed within ranges of 10 Hz and 2 kHz, respectively.Interestingly, a critical performance factor on the downstream tasks is the weight decay, which must be set to 0 to achieve high performance.

C. Results
The results on the four downstream tasks are given in Table I.For each of the tasks, we specify the baseline performance (Baseline), global state of the art (SOTA), state of the art among CNNs (SOTA CNN), and the performance of the AudioSet teacher model PaSST-S [11].We also compare our proposed DyMNs to the MNs with matching width multiplier and add a MN with increased width of α = 3.0.
1) DyMNs vs. MNs: The DyMNs outperform the MNs with matching width across all tasks and model sizes, with the exception of α = 1.0 on FSD50K.DyMN-L even outperforms MN (α = 3.0) on OpenMic, ESC-50 and DCASE20 while being on par on FSD50K.These results underline that dynamic components can increase channel efficiency and generalization performance.
2) DyMNs vs. PaSST: DyMN-L outperforms the pretraining teacher model PaSST-S [11] on all four downstream tasks while requiring less than half of its parameters and less than 3% of its MACs for computing the predictions for a 10-second audio recording.DyMN-M achieves comparable performance on OpenMic and ESC-50, being 8 times smaller and requiring less than 1% of the number of MACs compared to PaSST.

VI. SYSTEMATIC CONFIGURATION STUDY
The purpose of this section is to justify the design decisions that led to the final dynamic IR block presented in Section III, and to show that the proposed variant has turned out to be beneficial across a variety of other configurations.
In the following, we present configuration studies for the proposed dynamic IR block in Section VI-A.We then delve into the details of the individual dynamic components in Sections VI-B, VI-C and VI-D, followed by a study on different context generation variants in Section VI-E.All experiments are conducted on AudioSet [2] using DyMN-M and MN (α = 1.0) without ImageNet [1] pre-training.In the following tables, the default values in our setup are indicated in bold.

A. Dynamic IR block
In this section, we perform a configuration study based on the proposed dynamic IR block.We investigate the effect of the individual dynamic components, the impact of applying the dynamic IR blocks selectively, and the effect of applying Dy-Conv and Dy-ReLU at different positions in the block.
1) Importance of Dynamic Components: Table II presents the results for the proposed DyMN, the MobileNetV3 [10] Baseline (MN Baseline) from [9], a fully static MN with no Squeeze-and-Excitation [21] (MN Static) and all other combinations of the three dynamic components in DyMNs.
The results show that dynamic input-dependent processing is important.The proposed DyMN improves the performance by 7% over the static MN.While Dy-ReLU and CA are of equal importance, Dy-Conv leads to the smallest improvements.However, all three dynamic components are beneficial for the overall performance and improve over MN Baseline.2) Selectively applying the dynamic IR block: The purpose of this experiment, with results summarized in Table III, is to determine at which positions in the model the dynamic blocks have the highest impact.MN has in total 15 IR blocks, all of which are replaced by dynamic IR blocks in the proposed DyMN.In this study, we replace only the first, middle and last 5 blocks in the MN with dynamic IR blocks and keep conventional IR blocks at the remaining positions.Additionally, the setting Replace SE uses the dynamic IR block only at the positions at which the original MN uses SE, resulting in 8 out of the 15 IR blocks being dynamic.
Table III shows that dynamic blocks are beneficial at different positions in the model.Each selective variant outperforms the MN Baseline from [9].The best choice among the versions applying the 5 dynamic blocks is to make the last 5 blocks dynamic.Replacing only SE blocks with dynamic blocks comes closest to the fully dynamic model in terms of performance and can be seen as a lightweight alternative.
3) Effects of Dy-Conv and Dy-ReLU positions: The proposed dynamic IR block replaces all convolution layers with Dy-Conv and uses Dy-ReLU only after the depthwise convolution.Table IV shows the results for applying Dy-Conv and Dy-ReLU at alternative positions.Pos. 1, 2 and 3 describe the first, second and third convolution in the dynamic IR block (shown in Fig. 1) and the activation functions that follow them.In case of Dy-ReLU at Pos. 3, we add an additional Dy-ReLU after the final pointwise convolution.
The results show that replacing all convolution layers with Dy-Conv is beneficial and the proposed Dy-ReLU variant, where we have a single Dy-ReLU at Pos. 2, achieves the best performance.In particular, adding additional Dy-ReLUs does not improve results further.

B. Attention Mechanism
This section presents a study on the choice of attention mechanism and the impact of channel-frequency and channeltime recalibration in CA.
1) Choice of Attention Mechanism: Table V shows the results for integrating different popular attention mechanisms (CA [22], TA [26], SRM [23], GRN [42], SE [21], CBAM [24], and GC [27]) into MN.All attention mechanisms are integrated before the final pointwise convolution into all 15 IR blocks.While a number of different attention methods are capable of achieving substantial improvements over the static MN with no attention mechanism, CA leads to the largest improvement and is therefore the attention mechanism of choice for our proposed dynamic IR block.
2) Channel-Time and Channel-Frequency Recalibration in CA: CA performs recalibration of the feature map with channel-time and channel-frequency attention weights.The results given in Table VI assess the importance of these two recalibration steps.While the channel-frequency and channeltime weights are equally important, using both of them leads to the best results.

C. Dy-Conv
In this section, the impact of the two hyperparameters of Dy-Conv, the number of kernels K and the temperature τ is assessed.
1) Number of dynamic kernels K: The number of kernels K specifies how many different kernels are aggregated in each Dy-Conv layer.Table VII shows the results for K ∈ {2, 4, 6}.The performance improves only marginally from K = 2 to K = 4 kernels and plateaus for larger values of K.
2) Temperate τ : The temperature τ affects the computation of kernel attention weights as shown in Eq. 5. Aligned with [19], by default we use a temperature schedule for τ and anneal it from 30 to 1 over the first 30 epochs of training.This ensures near-uniform attention weights in the first epochs to properly update all kernels.Table VIII compares the temperature schedule to the results of using a constant temperature (τ ∈ {1, 10, 30}).The results show that the performance is stable across different constant temperature values in our setup.τ = 10 achieves the same performance as the temperature schedule.However, keeping the temperature constant and setting it to non-optimal values (τ = 1 or τ = 30) leads to a slight performance decrease, underlining the advantage of using a temperature schedule.

D. Dy-ReLU
An important hyperparameter of Dy-ReLU is the number of linear mappings M that the max operation, shown in Eq. 3, acts on.The results for M ∈ {1, 2, 3} are shown in Table IX.M = 1 results in a dynamic linear function while M = 2 and M = 3 are dynamic non-linear functions.The non-linear functions outperform the linear function and aligned with the findings in [17], M = 2 and M = 3 achieve a similar performance.

E. Context Generation
In this section, different variants of context generation are studied.In particular, architectural variants are discussed in Section VI-E1 and modifications of the context size H are investigated in Section VI-E2.
1) Different architectural variants for context generation: Table X contains results for the following modifications of the context generation process: • no shared context: Refers to a setting in which Dy-Conv and Dy-ReLU extract their own context by GAP and a learnable non-linear transformation, as originally proposed in [19] and [17], respectively.In this case, Dy-Conv and Dy-ReLU do not make use of the CGM output sequences.This experiment tests whether the shared CGM is capable of extracting a sufficiently rich global context that can be used to parameterize all dynamic components in a block.• no shared seq.parameters: Indicates that, in contrast to the CGM setup in Fig. 2, the linear layer and batchnorm parameters are not shared across the sequences.Instead, two sets of parameters are learned to transform the time and frequency sequences.The motivation for this experiment is to decouple transformations involving time and frequency information, which, in contrast to the  X show that all of these modifications lead to a slight decrease in performance, despite all of them increasing the number of parameters.The proposed design of the context generation is based on the findings of these experiments; the CGM follows the feature encoding process used in CA [22], and the dynamic coefficients of Dy-ReLU and Dy-Conv are derived as defined in Eqs. 4 and 6, respectively.
2) Varying the sequence embedding size H: As stated in Section VI-E, the embedding dimension for the time and frequency sequences, computed by the CGM, is defined as H = C EXP /r.Additionally H is clipped between a lower bound (H MIN = 32) and an upper bound (H MAX = 128) that are scaled accordingly with the model's width α. leads to a decrease in mAP.This shows that a assigning a certain capacity to the global context is important for the dynamic components to exploit their full potential.However, increasing H by increasing the lower and upper bounds (H MIN = 64 and H MAX = 256) does not yield further performance improvements and shows that the performance saturates after a certain context size is reached.

VII. INSPECTING THE DYNAMIC COMPONENTS
In this section, we analyze the dynamic components and investigate whether the CA attention weights and predicted coefficients for Dy-ReLU and Dy-Conv are indeed inputdependent.In contrast to Section VI, the study conducted in this section is performed using a single well-trained model In particular, context shuffle serves as the main analysis to determine whether the dynamic components perform inputdependent transformations.

A. Coordinate Attention
For context shuffle, the performance decreases by more than 50%, which clearly shows that the dynamic recalibration weights of CA are input-dependent.To assess the importance of channel, time and frequency attention, we randomly shuffle the recalibration weights along (1) the channel dimension (channel shuffle), both spatial dimensions (spatial shuffle), the time dimension (time shuffle) and the frequency dimension (frequency shuffle).The results in Table XII show that the channel attention mechanism is the most important and attention over frequency bins is more important than attention over time frames.Surprisingly, channel shuffle and spatial shuffle lead to a more severe performance drop than context shuffle, which indicates that, besides dynamic input-dependent processing, a shared prior across the samples is learned in CA.

B. Dy-Conv
In addition to context shuffle, we probe for input-dependency of Dy-Conv by randomly shuffling the K entries in the attention weight vectors (attention shuffle).Both methods lead to a moderate performance decrease, showing that Dy-Conv is the least input-dependent operation of the three dynamic components in our setup.We further investigate the diversity of the K dynamic kernels learned by constructing the kernel W using uniform attention weights (W = K k W k /K) and selecting the kernel W based on the max attention weight (W = W argmax k (α k ) ).Since the performance in case of uniform attention drops only by 2.5 points in mAP, we conclude that the learned kernels are similar to each other.However, when we conduct the same experiment for a model using Dy-Conv as the only dynamic component (such as the setting -CA, Dy-ReLU in Table II), we find that the performance when using context shuffle or uniform attention drops to a value close to random guessing.This indicates that Dy-ReLU and CA already perform much of the capabilities of Dy-Conv.This is also aligned with the finding that Dy-Conv leads to the smallest performance improvement among the three dynamic methods, as shown in Table II.

C. Dy-ReLU
In case of Dy-ReLU, context shuffle leads to a substantial performance drop of 15.3 points mAP.However, if we shuffle the predicted coefficients randomly across channels (channel shuffle), the performance drop is much more severe (44.3 points mAP).These results show that Dy-ReLU learns a prior for the channel coefficients shared across the samples, aligned with the results of CA.
Fig. 6 provides additional insights into the behaviour of Dy-ReLU.It shows the Dy-ReLU input to output mapping of several blocks at different depth in a well trained DyMN-M.The input to output mappings are collected from 10,000 randomly drawn samples from the AudioSet evaluation set.The dashed red line indicates the conventional static ReLU function.We make sure that patterns described in the following hold across several trained DyMNs and not only for the model used to create Fig. 6.An interesting pattern can be detected in Block 1.The input values are exclusively within a small range of positive values and the Dy-ReLU approximates an identity function.In general, Dy-ReLUs in early blocks tend to learn to map points at lines with specific slopes.For instance, the shape of the mappings in Block 3 resembles the absolute value function.In contrast, Dy-ReLUs in later blocks, such as Blocks 13 and 15, show a highly dynamic behaviour, mapping specific input values to a wide range of different output values.Different to the conventional ReLU, Dy-ReLU maps a lots of negative input values to positive activations.

VIII. CONCLUSION
In this work, we proposed dynamic convolutional neural networks as efficient pre-trained audio models.We integrated dynamic convolutions, dynamic ReLU, and Coordinate Attention into efficient inverted residual blocks and share the computation of a global context for dynamic parameterization across all dynamic modules in a block.The resulting models, named DyMNs, are pre-trained on AudioSet at three different complexity levels using Transformer-to-CNN Knowledge Distillation.DyMNs show a beneficial performancecomplexity trade-off compared to their non-dynamic counterparts and other Transformers and CNNs.Specifically, Dy-MN-L achieves a pre-training performance of 49.0 mAP on AudioSet, outperforming current popular Audio Spectrogram Transformers.Experiments on downstream tasks indicate that the proposed DyMNs outperform other CNNs by a large margin and are highly competitive compared to Audio Spectrogram Transformers while being much more computationally efficient.Furthermore, we show that DyMNs are suitable for simple task-specific fine-tuning by sharing the same finetuning pipeline across all downstream tasks.In short, DyMNs are efficient, high-performing, easy-to-fine-tune audio models that can have a large impact on the audio community, especially in the context of resource-critical applications.

IX. ACKNOWLEDGMENT
The computational results presented were achieved in part using the Linz Institute of Technology (LIT) AI Lab Cluster.

Fig. 1 .
Fig. 1.Dynamic inverted residual block: Starting from the conventional inverted residual (IR) block, all convolutions are replaced by dynamic convolutions (Dy-Conv) [19]; Coordinate Attention (CA) [22] is used instead of Squeeze-and-Excitation [21]; and dynamic ReLU (Dy-ReLU) [17] replaces the non-linear activation function after the depthwise convolution.The Context Generation Module (CGM) operates on the block input and extracts embedded time and frequency sequences used to parameterize Dy-ReLU, Dy-Convs and CA.Dynamic components are depicted in red, and context generation is shown in green.The shape of the input feature map size is denoted in terms of batch size × channels × frequency bands × time frames.

Fig. 2 .
Fig. 2. Context Generation Module (CGM): a zoom into the green parts of Fig. 1.The CGM operates on the input of an IR block and outputs a time and frequency sequence embedded in a reduced channel dimension of size H.These sequences are used to parameterize Dy-ReLU[17] and Dy-Convs[19] and to compute the channel-time and channel-frequency recalibration weights for CA.The shape of the input feature map size is denoted in terms of batch size × channels × frequency bands × time frames.
. The linear layer embeds the channel dimension of size C IN into a reduced space with H dimensions.We set H to a fraction of the expanded channel representation, such that H = C EXP /r, where r = 4 in our experiments.The computationally most expensive operation in the CGM is the linear layer, which performs C IN * H * (T IN + F IN ) MACs.Compared to the first pointwise convolution in the IR block, which requires C IN * C EXP * T IN * F IN MACs, the computational demand of the CGM is insignificant.

Fig. 3 .
Fig. 3. Coordinate Attention (CA): CA highlights important channels, frequency bins and time frames by recalibrating the feature map by elementwise multiplication with attention weights.It operates on the output sequences of the CGM and transforms them separately into the channel-time ST and channel-frequency SF attention weights.Average Pooling with a kernel size of 3 and a stride of 2 is used in case of a strided IR block.The linear transformation upsamples the number of channels from H to C EXP to match the dimensionality of the feature map after the depthwise convolution.

Fig. 6 .
Fig. 6.The figure shows the input to output mapping of Dy-ReLU for blocks at different depths in a well-trained DyMN-M.To generate the plots we use 10,000 randomly selected samples from the AudioSet evaluation set.Block 1 corresponds to the first block and Block 15 corresponds to the final block in Dy-MN.The dashed red line depicts the conventional ReLU function.

TABLE II IMPORTANCE
OF DY-RELU, DY-CONV AND CA IN THE PROPOSED DYNAMIC IR BLOCK.-DENOTES THAT THE RESPECTIVE DYNAMIC COMPONENTS ARE REMOVED FROM PROPOSED DYMN.

TABLE IV EFFECT
OF VARYING THE POSITION OF DY-RELU AND DY-CONVS IN THE DYNAMIC IR BLOCK.

TABLE VI EFFECT
OF APPLYING CA SELECTIVELY FOR CHANNEL-TIME AND CHANNEL-FREQUENCY RECALIBRATION.

TABLE VIII TEMPERATURE
τ FOR COMPUTING DY-CONV ATTENTION WEIGHTS

TABLE X DIFFERENT
SETTINGS FOR CONTEXT GENERATION.
Describes a setting for which Equations 4 and 6 are modified.Specifically, the sequences S T and S F are pooled separately and the vectors of size H are concatenated, resulting in a context vector of size 2 * H. Aligned with the motivation of the last experiment, with this experiment, we try to avoid mixing time and frequency information.The results presented in Table • concat pooled seq.: Table XI shows the results for different values of r, H MIN and H MAX .The results indicate that reducing the sequence embedding dimension H by either increasing r or decreasing H MIN and H MAX The LIT AI Lab is supported by the Federal State of Upper Austria.Gerhard Widmer's work is supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme, grant agreement No 101019375 (Whither Music?).