BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of information, such as local and global features. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced"viola"). BYOL-A pre-trains representations of the input sound invariant to audio data augmentations, which makes the learned representations robust to the perturbations of sounds. Whereas the BYOL-A encoder combines local and global features and calculates their statistics to make the representation provide multi-aspect information. As a result, the learned representations should provide robust and multi-aspect information to serve various needs of diverse tasks. We evaluated the general audio task performance of BYOL-A compared to previous state-of-the-art methods, and BYOL-A demonstrated generalizability with the best average result of 72.4% and the best VoxCeleb1 result of 57.6%. Extensive ablation experiments revealed that the BYOL-A encoder architecture contributes to most performance, and the final critical portion resorts to the BYOL framework and BYOL-A augmentations. Our code is available online at https://github.com/nttcslab/byol-a for future studies.

Abstract-Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of information, such as local and global features. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"). BYOL-A pretrains representations of the input sound invariant to audio data augmentations, which makes the learned representations robust to the perturbations of sounds. Whereas the BYOL-A encoder combines local and global features and calculates their statistics to make the representation provide multi-aspect information. As a result, the learned representations should provide robust and multi-aspect information to serve various needs of diverse tasks. We evaluated the general audio task performance of BYOL-A compared to previous state-of-the-art methods, and BYOL-A demonstrated generalizability with the best average result of 72.4% and the best VoxCeleb1 result of 57.6%. Extensive ablation experiments revealed that the BYOL-A encoder architecture contributes to most performance, and the final critical portion resorts to the BYOL framework and BYOL-A augmentations. Our code is available online for future studies.

I. INTRODUCTION
P RE-trained models play a vital role as feature extractors in various domains, e.g., BERT [1] in the natural language processing domain and ImageNet pre-trained models [2]- [4] in the image domain. In the audio domain, pre-trained models (e.g., VGGish [5]) have enabled recent advances in applications such as heart sound classification [6], Alzheimer's disease detection [7], conservation monitoring [8], audio captioning [9], audio retrieval [10], and so forth.
Various audio pre-trained models have been proposed for supervised learning [12]- [15] or unsupervised learning [16]- [21] methods, and they have been evaluated on target tasks such as sound event recognition (SER) [22]- [25], non-semantic speech (NOSS) [19] (e.g., speech command recognition [26], Manuscript [11] for Audio learning scenario. An input audio x i branches into two directions, or views, v i and v i , by mixing audio x j and x k and making pitch/time/amplitude random. v i , v i are projected through networks, and then loss is minimized on the projected z ξ and predicted q θ (z θ ). BYOL-A learns representations invariant to the difference between v i and v i . Find more details in Section II-E. (b) BYOL-A feature calculation: The input becomes feature maps extracted by Conv. For each time frame, the feature map is flattened as local features by Reshape, turned into global features by MLP, then both features concatenate into mixed features. Finally, T frame features are temporally pooled into mean+max statistics, achieving a multi-aspect robust feature. speaker identification [27]), and music tasks (e.g., music genre [28] and instrument [29] classification). However, while the methods claim the state-of-the-art, it is unclear which method generalizes better because their benchmarks are not compatible.
Our goal is to explore a way to achieve a versatile audio representation that works effectively for various tasks as it is, off-the-shelf, without an extra effort such as fine-tuning. The applicability of a model increases if it can be used as a frozen feature extractor because the effort for fine-tuning is not negligible, such as a careful learning rate tuning not to break pre-trained valuable features. Thus, if a representation is versatile enough as it is, it is an ultimate goal.
However, various task settings are both common and conflicting, making a single one-fits-all representation difficult. For example, we recognize words regardless of who speaks; conversely, we identify speakers while ignoring speech content words. In contrast to these conflicts, we commonly ignore slight differences such as pitch, duration, or timbre when listening speech. These suggest that while multiple information PREPRINT arXiv:2204.07402v2 [eess.AS] 16 Jun 2022 may serve conflicting needs, ignoring slight differences may serve the common needs.
For serving different needs, multiple features available from different layers of a single model and statistics of these features potentially be helpful. The former study [30] showed that early layers, local features on CNNs, contain relatively "general" filters, whereas deeper layers, global features, are specific to the pre-training. The studies [31]- [33] even utilized fusing the multilayer features. In addition, the global pooling for summarizing variable-length audio features provides multiple options, such as temporal average or max pooling [16] [34] [35], or even combining them [12].
For serving common needs, we want a representation that ignores slight differences in sounds, and self-supervised learning (SSL) frameworks for the image domain can be a good choice. These methods learn representations invariant to augmentations, and we can use audio data augmentations to make a difference in sounds. Typical choices are contrastive learning methods such as SimCLR [36] or MoCo [37], which learn representations through discriminating augmented positive pairs from augmented negative pairs in an input batch. However, we think Bootstrap Your Own Latent (BYOL) [11] can be a better choice because it learns representations invariant to changes created by augmentations, achieving what we seek directly.
To implement our principle, we combine the aforementioned options to achieve a versatile general-purpose audio representation. For encoding such representation, our network takes multilayer features and calculates statistics for accommodating multi-aspect information, which serves different needs of tasks. For pre-training, our BYOL variant framework with audio data augmentations learns a representation robust to the perturbations created by the augmentations, which serves common needs of tasks. As a result, the learned representations should provide multi-aspect robust features of the input sounds and serve various needs of diverse tasks.
The following summarizes our contributions: • We propose Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"). BYOL-A learns representations robust to the perturbations of sounds, and its encoder combines statistics of local and global features to provide multiple aspects of information. • We make a new benchmark that evaluates generalizability on diverse audio tasks. • We demonstrate the generalizability and effectiveness of our method using the benchmark while comparing ours with eleven representations extracted from the conventional state-of-the-art models. • We conduct intensive ablation studies to clarify the contributions of the BYOL-A framework, augmentations, and encoder network architecture. • We make our code available online 1 for reproducibility and to foster progress in future studies. Fig. 1 describes the BYOL-A representation learning scenario and feature calculation on the encoder. BYOL-A pretrains the encoder to transform the input into representations 1 https://github.com/nttcslab/byol-a robust to data augmentations, and its encoder combines statistics of local and global features to provide multiple aspects of information. As a result, representations should become robust and multi-aspect for serving various needs of audio task settings.

A. Relationship with our previous work
We introduced BYOL-A in our previous work [38]. In this section, we clarify the relationship between the previous and present work.
Our previous work proposed BYOL-A, which extends BYOL to work with audio data augmentations designed to learn the audio representations of specific targets, namely foreground acoustic event sound and the sound texture details. Though it showed the state-of-the-art performance by learning representations for these targets, the comparisons were limited only among unsupervised learning methods, and we did not discuss encoder architecture improvements.
In this paper, we redefined our hypothesis to explore pretrained general-purpose audio representations with a broader research scope. To achieve this new goal, we extended BYOL-A to learn representations invariant to the perturbations of sound. We also extended the BYOL-A encoder architecture to provide multiple aspects of learned features. Ablation studies in Sections IV-D, IV-E, and IV-F clarify the improvements from the previous one proposed in [38].
Detailed differences are listed as follows: • We redefine our hypothesis for effective general-purpose audio representations. • We reinterpret BYOL-A to learn representations invariant to the perturbations of sounds made by data augmentations rather than learn representations of specific sounds. • We refine data augmentations for improving performance. • We improve the encoder architecture to combine multiple aspects of information. • We evaluate our proposals with a wide variety of popular models and tasks under a unified benchmark. • We conduct intensive ablation studies to analyze contributions of the BYOL-A framework, augmentations, and encoder network architecture.

B. Audio pre-training methods
First, we overview previous pre-training methods and differentiate ours from them. Supervised learning methods learn representations to discriminate labels, relying on labels assigning samples to a predefined class [5], [12]- [15]. Self-supervised learning (SSL) methods using masked prediction predominant in the speech domain learn to predict or reconstruct masked portions of the input, relying on the input masking [39]- [41]. SSL methods using contrastive learning learn to discriminate instances among batch samples, relying on comparison in a large batch [19]- [21], [42]. Cross-modal SSL methods also learn to discriminate correspondence across modalities, relying on the cooccurrence of the pair of modalities [16]- [18]. We adopt BYOL and learn representations invariant to input changes, relying on the changes of input audio created by audio data augmentations.
Cross-modal/multi-modal pre-training methods have been proposed. OpenL3 [17] was pre-trained by using audio-visual correspondence as training signal. The method by Wang et al. [18] was pre-trained by using correspondence between video, spectrograms, and raw waveforms to learn generalpurpose audio representations. COALA [16] was pre-trained by aligning the learned latent representations of audio and associated tags and evaluated on SER and music tasks. For speech tasks, a concurrent work SLAM [47] learns speech and language modeling jointly in a multi-task fashion.
These methods showed effectiveness, but their evaluation settings are not compatible with each other, making comparison difficult for future applications to pick a suitable representation.
In concurrent works, BigSSL [48], data2vec [49], and the method by Wang et al. [50] exhibit remarkable performance on various tasks, while SERAB [51] adopts our previous BYOL-A [38] on speech emotion recognition tasks. The data2vec combines masked prediction and learning the latent target representations, similar to BYOL. WavLM [52] learns representations using masked speech prediction and denoising mixed utterances, and the learning from denoising is similar to ours. BYOL-A learns a representation of a sound invariant to the mixed background sound.

C. Benchmarks for pre-trained models
To assess the generalizability and re-usability of pre-trained models across a wide range of tasks, we need a benchmark such as SUPERB [53]. Like the standard linear evaluation protocol [36], [54], SUPERB trains lightweight heads on top of the frozen pre-trained model 2 . While it supports a broad range of tasks, these tasks are limited to the speech domain; no established benchmark exists for non-speech audio tasks. Therefore we build a benchmark for evaluating generalpurpose audio representations in this study.
Concurrent works on benchmarks share a setting similar to that in this study to evaluate the generalizability of frozen pre-trained models across different architectures, pre-training frameworks, and datasets. HARES [50] trains a linear layer head, the same as we do, while HEAR [55] uses a shallow MLP downstream classifier. SERAB [51] also evaluates frozen models in various speech emotion recognition tasks.
Other approaches average frequency first and then summarize time. PANNs' [12] CNN14 model first averages the frequency 3 . Then, it applies a temporal pooling operation, which we refer to as temporal mean+max pooling hereafter, that calculates a sum of both temporal mean-and max-pooling as the resulting output. Fonseca et al. [35] also used frequency average pooling first and then applied temporal max pooling. Ford et al. [56] proposed adding an attention module to summarize the output of frequency average pooling.
All these approaches apply flattening, max, or averaging operations, which can impair the information needed for downstream tasks. For example, averaging along frequency hides information about frequency patterns that can be crucial to tasks such as speaker age estimation.

E. Bootstrap Your Own Latent (BYOL)
BYOL [11] is a self-supervised learning algorithm that learns image representations invariant to data augmentations. Although contrastive learning methods such as SimCLR [36] and MoCo [37] also learn representations invariant to data augmentations as BYOL does, we believe that BYOL is appropriate for our purpose because it learns representations from a single input, whereas contrastive learning methods learn by comparison among input batch samples.
As shown in Fig. 2, BYOL consists of two neural networks, referred to as online and target networks. The online network is defined by a set of weights θ, and the target network has the same architecture as the online network but uses a different set of weights, ξ. First, BYOL produces two augmented views, v t(x) and v t (x), from an image x by applying respectively image augmentations t ∼ T and t ∼ T , where T and T denote the two distributions of the image augmentations. Then, the online network outputs a representation y θ , a projection z θ , and a prediction q θ (z θ ) from the first view v. On the other hand, the target network outputs y ξ and the target projection z ξ from the second view v . Finally, the following mean squared error between the L2normalized predictions q θ (z θ ) and target projections z ξ is calculated: where ·, · denotes the inner product. To symmetrize the loss L θ,ξ , L θ,ξ is computed by feeding v to the online network and v to the target network. The final loss is defined as L BYOL θ,ξ = L θ,ξ + L θ,ξ . At each training step, BYOL minimizes this loss function with respect to θ only, but ξ is a slowly moving exponential average of θ: It has been empirically shown that the combination of adding the predictor to the online network and using the moving average of the online network parameters as the target network encourages encoding more and more information within the online projection and avoids collapsed solutions such as constant representations.

III. PROPOSED METHOD A. BYOL for Audio (BYOL-A)
We expand BYOL to learn representations of the input itself invariant to the perturbations of sounds by replacing data augmentations for audio, as shown in Fig. 2. Crop (RRC, Section III-A2), and Random Linear Fader (RLF, Section III-A3). Mixup makes random background sound, RRC makes random frequency/time shifts/stretches, and RLF makes random temporal amplitude changes, a simulation of random fade in/out. Fig. 4 shows an example of augmentations. BYOL-A learns to cancel the difference between two augmentation results v and v . As a result, the learned representations should be robust to the perturbation of sounds (e.g., robust to changes in background sound, pitch shift, time shift and stretch, and volume of sound).
The input raw audio samples are preprocessed into timefrequency (TF) features of the log-mel spectrogram, as has been done in previous studies. Then the module applies Pre-Norm, which normalizes input data x tox = (x−µ)/σ, where µ and σ are the average and standard deviation of training samples, respectively. This operation stabilizes the computations in the following augmentations blocks. Similarly, after all the augmentations are applied, the module also applies Post-Norm, so that the final outputs of the augmentation module become ∼ N (0, 1). Augmentation operations can cause statistical drift in their outputs; the Post-norm corrects this possible drift.
As we describe in Section III-A4, we design the encoder for BYOL-A to encode representations to form multiple aspects of information by combining statistics of local and global features to meet various needs of tasks. 1) Mixup for making background sound perturbation: We modify Mixup [57] or between-class (BC) learning [58] to make slight randomness in background sound. These data augmentation techniques interpolate both features and labels between two data samples to create new data. Using normalized log-mel spectrogram audio as input, our Mixup block randomly picks up a sample from a queue of past inputs and mixes it with the current input audio sample in a small ratio. As a result, mixed random audio becomes a part of the background sound in the current input.
While the original Mixup applies to both audio features and labels, our Mixup applies only to the features because we do not use labels. In addition, as audio is log-scaled, we convert the input to a linear scale before the Mixup calculation and restore it to a log-scale again. In this paper, we refer to these operations as log-mixup-exp, from the analogy to the log-sumexp [59] calculation. Log-mixup-exp of ith input x i is where x k is a mixing counterpart, and mixing ratio λ is sampled from the uniform distribution U (0.0, α), like in betweenclass learning. In addition, α is a mixing ratio hyper-parameter that controls the degree of contrast between the resulting two mixed outputs. We observed that the evaluation result improves with smaller α, 0.2 for example, wherex i retains more of the original contents x i than its counterpart x k does, as we found in preliminary experiments.
x k is randomly chosen from a FIFO queue storing past inputs. As input is randomly sampled from the training dataset, samples in the queue form random subsets of the dataset. We store 2, 048 samples in the queue, which is larger than the batch size and big enough to maintain randomness.
2) Random Resize Crop (RRC) for making random frequency/time shift/stretch: RRC is an image augmentation technique we use as an approximation of pitch shift and time shift and stretch of input log-mel spectrograms for learning representations invariant to the perturbations of frequency/time shift/stretch. Fig. 5 shows the random crop procedure. The unit size of the input spectrogram consists of a number of frequency bins, F , and a number of time frames, T . First, we sample the random crop area from the virtual crop boundary, which has longer time frames than the input, 1.5 × T for example. The size of the crop area is randomly sampled as where F C and T C are the number of frequency bins and number of time frames of random crop size, respectively, f 1 and f 2 form a frequency bin range [f 1 , f 2 ], t 1 and t 2 form a time frame range [t 1 , t 2 ], · is a floor function, and min(·, ·) is a minimum function. Contents in the crop area are then resized to the size of the input by bicubic interpolation. The virtual crop boundary is wider than the input, and we use [0.6, 1.5] for both the frequency bin and time frame ranges in this paper, so the crop area can contain the outside of the input. This area is filled with zeros. Note that we do not crop the outside of the frequency bin, which is restricted by the min() function in the F C calculation above.
3) Random Linear Fader (RLF) for making random linear amplitude change: RLF randomly applies a linear change in volume to the entire audio segment to simulate the approaching or passing away of a sound source, or fade-in/out. It adds random linear volume changes to the entire segment without losing the patterns of the contained sound events.
Let each element of the input spectrogram x be x[t, f ], where t is the time frame and f is the frequency bin. First, we calculate temporal amplitude change S[t] as follows: where T is the number of time frames, start frame gain a ∼ U (−1.0, 1.0), and end frame gain b ∼ U (−1.0, 1.0). The S[t] is the gain for each time frame linearly interpolated from a to b. Then, we add S to the input to make a linear amplitude change in log-mel spectrogram.
f ] is the result of RLF calculation, and F is the number of frequency bins, respectively. For example, if a = −1 and b = 0.5, the relationship is a < b, which is an approximation of fade-in where the volume increases with time; if a = 0.5 and b = −0.5, it is an approximation of fade-out where the volume decreases with time.

4) BYOL-A encoder network:
To enable a representation that provides multiple aspects of information, we make the BYOL-A encoder to (i) preserve all available information in global pooling, (ii) optimize the resolution of local features, (iii) combine local and global features, and (iv) combine average and maximum statistics in time.
We use the audio embedding block from [60] that satisfies requirement (i) as a base architecture. We make modifications to this base architecture to realize the remaining (ii)-(iv). Table I shows the architecture, where 3x3 or 2x2 denotes the filter size, and the number after @ indicates the channel size,  [12] respectively. This CNN takes the input to produce local features in Conv blocks, which is adjusted to make the receptive field smaller for (ii). Then, Reshaping flattens frequency and channel along the time axis, preserving available information to satisfy (i). MLP learns to make useful global features on top of local features, and the following Concat concatenates both features to implement (iii). Finally, Pooling summarizes features into 3,072-d representation vectors using temporal mean+max pooling to meet (iv).
To adapt the base CNN to our purposes, we made three modifications. First, we reduce the number of convolutional blocks to increase the local feature resolution. One Conv block halves the output frequency and time resolution. We adjust the number of blocks to two, which reduces from three on the base architecture. This adjustment directly changes the receptive field (RF) size, while tuning the RF is considered crucial for their generalization to unseen testing data [61]. We conduct an ablation study in Section IV-E and discuss more the necessity of RF adjustment in Appendix A.
The second modification adds the Concat block that concatenates features from the earlier Reshaping block, a local feature, and later MLP block, a global feature.
The last modification adds the Pooling block that sums each element from temporal mean pooling and temporal max pooling of Concat output features, the temporal mean+max pooling [12], to accommodate advantages of both average and maximum statistics.
The encoder with these modifications, as a whole, make representations to combine local and global features as well as the statistics of the features while preserving frequencyand channel-wise information. The total number of encoder parameters is 6, 333, 376.

IV. EXPERIMENTS
To assess the generalizability of BYOL-A, we created a new benchmark described in Section IV-A that covers a wide range of tasks, which we used throughout our experiments. We detail the BYOL-A pre-training in Section IV-B. Then, we evaluate BYOL-A and compare it with previous studies in Section IV-C. We further conducted ablation studies: data augmentation block ablations in Section IV-D, network architecture ablations in Section IV-E, global pooling ablations in Section IV-F, and BYOL framework ablations in Section IV-G. Lastly, we summarize the experiments in Section IV-H.

A. Benchmark
We explore general-purpose audio representations. To assess the generalizability of pre-trained models, we perform the standard linear evaluation protocol [36], [54] using frozen pretrained models across a wide range of SER, NOSS, and music tasks collected from previous studies.
1) Procedure details: The linear evaluation pipeline first converts downstream task samples into feature embeddings using a frozen pre-trained model as a feature extractor and then trains a linear layer with the task labels. It then gets the test results using the trained linear layer. All these results except FSD50K are accuracies; FSD50K results are mean average precision (mAP) and area under the curve (AUC).
All audio samples were randomly cropped to the average duration of the task dataset or added zero padding at the end, and they were resampled to the default sampling rate of each model. For the models that come with a dedicated pre-processor that converts raw audio to TF feature, we used the pre-processor. The feature embeddings extracted by the models were standardized prior to training a linear layer.
We used the validation set for early stopping with a patience of 20 epochs and trained the linear layer for up to 200 epochs with the Adam optimizer. We manually tuned the learning rate to get the best results between 0.00001 and 0.01 for every test. We ran each evaluation three times and average the results.
2) Downstream tasks: We employed ten downstream tasks widely used in the previous studies as shown in Table II: three sound event recognition (SER) tasks, four non-semantic speech [19] (NOSS) tasks, and three music tasks. All tasks are multi-class single-label classifications except FSD50K, which is a multi-label classification. Therefore, we report FSD50K results separately. The following describes the tasks: • ESC-50 [23]: a sound classification with 50 environmental sound classes. We conduct leave-one-out crossvalidation (LOOCV) with the official five folds. • UrbanSound8K [24] (US8K): an urban sound classification task. We conduct LOOCV with the official ten folds.  In addition to the perspective of task diversity, we added to our benchmark a view of the sound event characteristics to gain an understanding of the utility of representations. To do so, we introduced three subsets of FSD50K classes that group together the original classes that have similar characteristics of the sound events they contain. We report mAP results for each subset as well as the usual mAP results for all classes. The following lists the subsets; Appendix B describes the details:

B. Pre-training BYOL-A
We manually tuned hyperparameters for the BYOL framework and conducted an automatic parameter search for audio data augmentations to improve performance. The followings describe the details.

1) BYOL framework settings:
We used the same MLPs in the original BYOL as the projection and prediction in BYOL-A networks, namely, a linear layer with an output size of 4, 096 followed by batch normalization (BatchNorm), rectified linear units (ReLU), and a linear layer to output embeddings with 256 dimensions. We trained for 100 epochs with the Adam optimizer with a learning rate of 0.0001, target decay rate parameter τ = 0.99, and batch size of 256. While we tweaked the learning rate for better performance, we found that the default value of τ and the handy batch size pre-trains well. We further discuss this in Section IV-G.
2) Augmentation block parameters: We conducted an exhaustive parameter search using Optuna [67] to achieve better performance in the pre-training. As a result, we used the mixing ratio α of 0. 3) Dataset details: We pre-trained using the 1, 963, 807 samples (5,455 h) from balanced train and unbalanced train segments of the AudioSet [22] without labels.
In the ablation studies, we pre-trained using a development set of FSD50K [25] without labels, 40, 966 samples (80 h) in total, with increased training epochs of 500.

C. Benchmarking BYOL-A and pre-trained models
To explore audio representations that generalize a wide range of tasks, we evaluate BYOL-A compared with various audio representations extracted from publicly available pretrained models implementing previous methods. These methods used different training frameworks, network architectures, and datasets, as described in Section II-B; moreover, they shared benchmark tasks only partially, making comparison difficult. With a unified benchmark, we can compare them and evaluate the generalizability of learned representations of the methods invented with diverse design choices.
1) Representations from previous methods: Table III lists eleven audio representations from the eight pre-trained models we use. We chose diverse state-of-the-art models that have evaluated different task performances.
We prefix the name of representations with labels:  We extracted a single embedding per variable-length audio input. For some models that output embeddings frame by frame for input, we applied temporal mean+max pooling except for Wav2Vec2-C so that we could make fair comparisons with BYOL-A. For Wav2Vec2-C, we temporally averaged their embeddings for making their best performance.

• [S] VGGish and [S]
VGGish-4K are from VGGish [5] pre-trained on YouTube-8M [43]. However, in the ESC-50, US8K, and GTZAN, BYOL-A has a performance gap compared to the AudioSet-supervised learning models. We think that AudioSet class supervision can cover similar class labels in these tasks. For the SPCV2, VoxForge, and CREMA-D tasks, Wav2Vec2-C shows the best performance, suggesting that pre-training specialized for speech has advantages in spoken language tasks, while BYOL-A shows closer performance compared to other models.
Unsupervised learning models generally perform well in all tasks, suggesting that they effectively acquire general-purpose representations. While TRILL and Wav2Vec2, pre-trained only on speech data, do not perform well in tasks other than speech, OpenL3-M, pre-trained on music samples, and OpenL3-E, pre-trained on environmental sounds, showed stable and good performance in tasks beyond the training data domain. Fig. 6 shows the Pearson correlation coefficient across the result of representations in Table IV, suggesting several trends. First, the correlations are high among the supervised learning methods and among the unsupervised learning methods, indicating that the label supervision tends to affect the task performance trend. ESResNeXt and AST have particularly closer correlation trends, but while both are trained on Ima-geNet and AudioSet, they use different architectures (CNN and Transformer, respectively), suggesting that the performance 10 https://tfhub.dev/google/nonsemantic-speech-benchmark/trill/2 11 https://huggingface.co/facebook/wav2vec2-large-960h-lv60   Table IV, excluding VGGish, OpenL3-E, and Wav2Vec2-C for brevity, providing performance trend similarity across representations. This highlights the different trends between supervised and unsupervised methods. trend can be more influenced by the learning method and dataset than the architecture. COALA shows a performance trend similar to that of supervised learning, suggesting that cross modal training of tags and audio is similar to supervision of labels.
3) Results on the FSD50K:  IV-A3, show that these representations excel in detecting Single-source events, outperforming others with a large margin. Whereas the performance gap between representations is smaller with Sequential and Scene events. Considering that OpenL3-E/M and BYOL-A demonstrate better average results in Table IV, the results in Table V indicate that performing better with the Single-source event can lead to performing better in general audio tasks.

D. Ablations of audio augmentation blocks
We assess the contribution of augmentation blocks by evaluating various combinations of these blocks in this section.
1) Experimental settings: We tried combinations of Mixup, RRC, RLF, and an extra block, Gaussian, which interpolates training input with random data points sampled from the normal distribution. We added the Gaussian for comparison with Mixup. The Gaussian block mixes the input with a random data point sampled from ∼ N (0, 0.4) using the logmixup-exp calculation described in Section III-A1.
2) Results and discussions: Table VI shows the contribution of each augmentation and that combining them is essential for achieving the best performance of the BYOL-A. We also compare results with a randomly initialized model which is not pre-trained.
The single block results from (f) to (h) show that RRC is the most impactful, indicating that a representation ignoring the slight shifts/stretches in frequency/time axes is most effective for the downstream tasks. In contrast, the (h) RLF result shows that pre-training only with a weak augmentation can impair the usefulness of a representation, even making the performance worse than a random model. The results (d) RRC+RLF improve from the (f) RRC, showing that RLF can be useful if used with other blocks. The final combination of (a) improved from (e) RRC+Mixup, our previous work [38], by adding RLF.
Comparison between (a), (b), and (c) shows that Mixup, interpolating within-dataset samples, is more effective than Gaussian interpolating with random samples. The result of (a) Mixup+RRC+RLF is superior to that of (b) and (c), where (b) adds Gaussian on top of (a), and (c) replaces Mixup in (a) with Gaussian. In other words, mixing random noise cannot be as effective as mixing the sounds from the dataset for making background sound perturbations.

E. Ablations of encoder network architecture
We discuss architectural choices of the BYOL-A encoder for varying the number of convolutional blocks and even replacing the entire network with ResNet variants. Table VII shows the  results. 1) Convolutional block ablations: We compare three BYOL-A results in Table VII, where we vary the number of convolutional blocks from one to three.
The BYOL-A Conv=1, a single Conv block with primitive output features with a rich resolution, results in the worst performance in most tasks. This result suggests that a single Conv block is insufficient to produce useful features.
The BYOL-A Conv=3, three Conv blocks with a lower resolution, degrade VC1 and Surge results. For solving VC1 (speaker identification) and Surge (pitch classification) tasks, frequency-wise information is considered important. There-fore, we think the degradation can be attributed to lower frequency resolution: 8 (Conv=3) < 16 (BYOL-A, Conv=2).
2) Replacing network with ResNets: We compare the BYOL-A encoder CNN with two ResNet variants with modifications, which we describe the detail in Appendix A. We used ResNet-18 and -50 as base ResNets. Table VII shows that BYOL-A is on par with the ResNet-50 variant and that it outperforms the ResNet-18 variant. Similar to BYOL-A Conv=3, ResNet variants show low results on VC1 and Surge while showing on par or better results on other tasks. We think that the performance drop on VC1 and Surge is, as in BYOL-A Conv=3, due to lower frequency resolution; the variants have a resolution of 8, half of the BYOL-A's. Making frequency strides smaller can increase the frequency resolution; however, it also increases the feature dimension. Doubling the ResNet-50 variant frequency resolution increases feature dimensions from 16,384-d to 32,768-d, making it closer to prohibitive for applications.
In summary, considering the trade-offs of the resolution, feature dimension, and model parameter size listed in the table, we choose BYOL-A with two convolutional blocks as a default encoder architecture that offers a balanced solution, which is also an improvement from our previous study [38]. More modification on ResNet-50 or considering more sophisticated architectures such as Transformer variants could be good options. We leave them for future studies.

F. Ablations of encoder global pooling blocks
We conducted an ablation study of the global pooling blocks in the BYOL-A encoder, namely the Reshaping, MLP, Concat, and Pooling blocks. Tables VIII and IX show the configurations and corresponding results, respectively. We also conducted MLP size ablations, found in Table X. Ablation results of Reshaping, (2) and (3) in Table IX indicate that averaging the frequency or channel deteriorates performance on downstream tasks. We average the frequency or channel axis along time frames in these results. The performance drop of (2) shows that averaging the frequency impairs results, especially on VC1 and Surge tasks where frequency-wise information is considered vital. The result of (3) shows that averaging channels significantly degrade overall task performance, suggesting the importance of channel information to downstream tasks. These averaging operations can be found in popular network architectures, even in audio models [12], [14], [19], [21]; however, these results indicate a potential negative performance impact in the previous studies.
The results (4) and (5) compare the performance difference between local and global features, namely, the MLP input and output features. The results show that the global feature (4) increases accuracy on entire tasks except for Surge, showing that MLP learns useful features from the flattened frequency and channel information for each time frame while using most of the network capacity. The result (1), concatenating both features, increases performance more than (4) global feature only in most tasks, indicating that many tasks benefit from both matured global and primitive local features. Our previous version [38] was (4) global feature only, which we improved to (1) with the performance difference of 2.0.   Temporal pooling ablation results of (6) mean or (7) max pooling show that max pooling performs better than mean pooling on five tasks with a margin of 3.0 to 8.7%, indicating that max pooling can be more advantageous in general. While the (1) combination of both statistics slightly degrades performance on some tasks, it improves on average, showing that tasks benefit from the combination of these statistics.
We also conducted an ablation study of MLP size, which is the output dimension of the FC layer in the MLP block. Table  X shows that performance saturates at a size of 2,048, which we set as default in our encoder MLP.

G. Ablations of BYOL framework
To understand the contribution of the BYOL framework, we conducted ablation studies of its hyperparameters, namely, target decay rate τ , which controls how close the target network weights become to the online network, and batch size, with which BYOL is reported to be robust. We also additinally experimented removal of the prediction q θ (z θ ).
Target decay rate τ results in Table XI show that BYOL-A learns much more robustly with a target network than the original BYOL. First, we can set the randomly-initialized BYOL-A result as a lower bound and the result of the pretraining with a default (τ = 0.99) as a near upper bound; and we should see the results in between. As the τ diverge from default 0.99, such as 0.5, 0.9, and 0.999, the performance degrades similarly as it does in the original BYOL. However, the degree of degradation is much smaller than in BYOL. We think this is because the lower bound of 61.7% is very high compared to the original image BYOL of 1.4%, reducing the room for degradation.
The τ = 0.0 and 1.0 results indicate that BYOL-A can learn representations without a moving average target, even further, with a randomly initialized target. The result of τ = 1.0, which fixes the random target weights, shows better results than the lower bound. The result of τ = 0.0, which makes the target weight instantaneously update with the online's, shows a minor degradation of −0.8. These results show that BYOL-A does not heavily rely on the the bootstrapping behavior of the BYOL framework to learn representations. We further examined the necessity of the BYOL network components and confirmed that removing the prediction q θ (z θ ) broke BYOL-A learning, as shown in Table XII. This result indicates that the standard network configuration of BYOL is essential for making BYOL-A viable, though it does not heavily depend on the bootstrapping of the target. Batch size ablation results in Table XIII show that the performance does not degrade even with a small batch size of 64. The results other than the default of 256 degraded slightly, but we think this can be attributed to the mismatch of the learning rate; we tested with a fixed learning rate that we optimized to the default batch size. The results also empirically show that BYOL-A is more robust to the small batch size than the original BYOL, which showed degradation with a batch size of less than 256, and more robust than COLA, a contrastive learning method that showed degradation with a batch size of less than 1024. We think this could also indicate that BYOL-A is less dependent on the inductive bias of the BYOL framework.

H. Summary of experiments
We demonstrated the generalizability of the BYOL-A compared across major pre-trained models on a benchmark consisting of various tasks in Section IV-C. In addition, intensive ablation studies provided evidence of contributions from different aspects. To gain a holistic understanding of what makes the BYOL-A representation learning happen, we summarize the contribution of the components in Table XIV.  Ablation results clarified that the BYOL-A encoder architecture contributes the most to performance, achieving 61.7% with only random initialization. Pre-training improves performance to 70.3%, showing that the BYOL framework and BYOL-A audio data augmentations contribute the last +8.6.
We think the convolutional blocks are the most significant performance factor for the encoder, followed by feature reshaping along time, MLP, combining local & global features, and temporal mean+max pooling. We estimate the performance contribution of the convolutional blocks can be up to about +46, which is the average accuracy of the whole encoder minus the other factors, i.e., 61.7 − 7.1 − 5.6 − 2.0 − 0.4 ≈ 46.
The ablation results related to framework and augmentations show that both the BYOL framework and augmentations are crucial to making the performance improvement viable. A simpler framework as such removing prediction q θ (z θ ) fails the training. Similarly, using a noticeably weak augmentation such as RLF only also fails the training. The BYOL framework and the BYOL-A audio data augmentations work together to achieve the performance of BYOL-A.
In summary, the inductive bias of the BYOL-A encoder network architecture primarily contributes to the performance of BYOL-A, and the pre-training under the BYOL framework with BYOL-A audio data augmentations completes the final critical portion of the performance. As a whole, BYOL-A achieves the best average result among various pre-trained models, demonstrating its generalizability as a general-purpose audio representation.

V. CONCLUSION
In this study, we explored pre-trained audio representations for general audio tasks. We hypothesized that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. Robust features can help sound applications (i.e., recognition) under perturbations such as varying pitch or timbre. Representations providing multiple aspects of information calculated using these features can help various purposes. As a result, these representations should serve to meet the diverse needs of tasks.
We proposed a self-supervised learning method called Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola") to pre-train audio representations invariant to the slight perturbations of background sound, frequency/time shift/stretch, and temporal amplitude change. To make representations that provide multiple aspects of features, we made the BYOL-A encoder combine statistics of local and global features while preserving frequency-and channel-wise information.
We evaluated the general-purpose task performance among various previous state-of-the-art methods on a benchmark composed of ten SER, NOSS, and music tasks. The BYOL-A demonstrated the generalizability of its representation with the best average performance of 72.4% and the best VoxCeleb1 performance of 57.6%.
Extensive ablation studies clarified the contributions of BYOL-A components. We found that a large portion of the performance comes from the inductive bias of the BYOL-A encoder network architecture, and that the final critical portion resorts to the BYOL framework and BYOL-A audio data augmentations. As a whole, BYOL-A learns to produce effective representations that generalize to various tasks.
We make our code available online and hope it fosters progress in future studies of audio representations.

APPENDIX A MAKING AN IMAGE-CNN-BASED MODEL PERFORM ON GENERAL AUDIO TASK BENCHMARK
To further elaborate on our encoder design described in Section III-A4, we discuss what makes a CNN model perform well on various tasks in our benchmark using an image-CNNbased architecture as an example. We use a ResNet-18 [3] from image-CNNs, and we change its input channels from three to one and remove the FC layer. Then the modified version, named ResNet-like, accepts batch input with a shape [(B)atch, 1, (F)requency, (T)ime frame], and outputs [B, 512], 512-d embeddings.
We made two improved versions based on the ResNet-like. One is 'ResNet-like (ReGP)', where a ResNet-like replaces global pooling (Replacement of Global Pooling; ReGP). The other is 'ResNet-like (ReGP + Narrow RF)', which is a ResNet-like (ReGP) with a modification so that the receptive field (RF) becomes narrower.
ResNet-like has a global average pooling that averages frequency and time axes and outputs 512-d embeddings.
ResNet-like (ReGP) replaces the global average pooling with the Reshaping and Pooling blocks from the BYOL-A encoder, making output as 1,024-d embeddings.
We pre-trained these models in BYOL-A by replacing the encoder with them, using the same setting as in the BYOL-A ablation studies described in Section IV-B. Table XV shows the results of the ResNet-like variants. The base model's performance, ResNet-like, is 61.4%, and it improves with ResNet-like (ReGP) to 63.3%. The performance of ResNet-like (ReGP + Narrow RF) improves to 69.0%, which is comparable to BYOL-A's, 70.3%.
These results show that global pooling in the image-CNNbased architecture needs to be improved, as discussed in Section II-D. Moreover, adjusting frequency resolution is also crucial for good performance in general audio tasks, as reported in the previous study [61]. This appendix describes assigning FSD50K classes to the three sound event characteristic subsets defined in Section IV-A3. We conducted the following steps to examine all FSD50K classes and determined the assignment. First, we randomly select 50 samples from the target class. Then, we conducted a manual inspection by listening to each sample to determine which subset the sample belongs to. After inspecting all 50 samples, only if 80% (40 samples) or more fall into one subset, we assigned the target class to the subset. We excluded classes that fell into multiple subsets (e.g., the Liquid class falls into a single-source and sequential event) and classes with vague characteristics (e.g., Mechanisms, Wood, etc.) from the assignment. We repeated these steps and finally assigned 93 classes to one of the subsets out of 200 classes. Table XVI lists the FSD50K classes assigned to the sound event characteristic subsets.