Pruning vs XNOR-Net: A Comprehensive Study of Deep Learning for Audio Classification on Edge-devices

Deep learning has celebrated resounding successes in many application areas of relevance to the Internet of Things (IoT), such as computer vision and machine listening. These technologies must ultimately be brought directly to the edge to fully harness the power of deep learning for the IoT. The obvious challenge is that deep learning techniques can only be implemented on strictly resource-constrained edge devices if the models are radically downsized. This task relies on different model compression techniques, such as network pruning, quantization, and the recent advancement of XNOR-Net. This study examines the suitability of these techniques for audio classification on microcontrollers. We present an application of XNOR-Net for end-to-end raw audio classification and a comprehensive empirical study comparing this approach with pruning-and-quantization methods. We show that raw audio classification with XNOR yields comparable performance to regular full precision networks for small numbers of classes while reducing memory requirements 32-fold and computation requirements 58-fold. However, as the number of classes increases significantly, performance degrades, and pruning-and-quantization based compression techniques take over as the preferred technique being able to satisfy the same space constraints but requiring approximately 8x more computation. We show that these insights are consistent between raw audio classification and image classification using standard benchmark sets. To the best of our knowledge, this is the first study to apply XNOR to end-to-end audio classification and evaluate it in the context of alternative techniques. All codes are publicly available on GitHub.


I. INTRODUCTION
A UDIO classification is a fundamental building block of many smart Internet of Things (IoT) applications such as predictive maintenance [1]- [3], surveillance [4], [5], and ecosystem monitoring [6], [7]. Smart sensors driven by microcontroller units (MCUs) are at the core of these applications. MCUs, installed at the edge of the networks, sense data and send them to the cloud for classification and detection. However, the energy requirements for transmitting high volumes of data are a burden for battery-powered MCUs. Furthermore, this increases latency, and data transmission may further lead to privacy concerns. The increased latency makes real-time or near real-time analytics infeasible. One way to solve these challenges is to move the analysis and recognition directly to the edge. In practice, this means that all the processing must take place on resource-impoverished MCUs.
MCUs are low-powered resource-constrained devices typically based on system-on-a-chip (SoC) hardware with less than a megabyte (1 MB) of RAM and below 200 MHz clock speeds. All the recent state-of-the-art audio classification models, however, are based on deep learning (DL) [8]- [16] requiring very resource-intensive computation. Usually, the memory size of such models varies from several MBs to even gigabytes (GB). To run such models on MCUs requires extreme minimization of the model's size and computation requirements with minimum loss of accuracy. Recent studies have applied model compression techniques such VOLUME 10, 2022 1 arXiv:2108.06128v3 [cs.SD] 17 Jan 2022 as pruning connections and neurons from fully connected neural networks (FCNNs) [17], filter or channel pruning from convolutional neural networks (CNNs) [18]- [20], knowledge distillation [21] and low-precision quantization [17], [22]. The most recent advancement in extreme downsizing of DL models and their computation requirements is XNOR-Net [23], where a model's activations and inputs are fully binarized.
There are many state-of-the-art XNOR-Net models for different computer vision tasks, such as Rastegari et al. [23] for MNIST, Cong [24] for CIFAR-10, and Bulat et al. [25] for CIFAR-100 and ImageNet datasets. However, for audio classification, the only work we are aware of is presented by Cerutti et al. [26]. This study uses XNOR-Net in combination with spectrogram-based input to effectively perform audio classification via image classification. As the literature on full-precision audio networks shows, this approach does not usually deliver the best results for difficult classification problems with conventional CNN architectures [27], [28] (see Section II-B for further details). To the best of our knowledge, the current literature has not yet considered XNOR-Net for raw audio classification.
Most of the work on XNOR has been performed for computer vision. The leaderboards of benchmark image datasets show considerable differences in the classification accuracy of state-of-the-art full precision nets and XNOR-Nets (see Table 9 for the CIFAR-100 and ImageNet datasets). Furthermore, models produced by XNOR-Net generally yield a up to 32x reduction in memory requirements and up to 58x reduction in computation cost compared to their fullprecision counterparts [23]. This may not result in sufficient reduction for MCUs. The memory requirements of XNOR-Net-based DL models producing comparable accuracy [23]- [25] typically reach several MBs, while typical MCUs offer only 128KB to 1MB of memory.
In this paper, we seek to understand the comparison of XNOR model minimization with pruning-and-quantization approaches and the potential of XNOR in the context of audio classification. We present XNOR-Nets for raw audio classification followed by a comprehensive study that compares this approach with traditional model compression techniques (pruning-and-quantization). Our extensive experimental study reveals that XNOR-Net may be preferred for scenarios comprising a small number of classes (e.g., 10 classes) along with extremely tight resource constraints, specifically where computation speed is concerned. In contrast, for complex scenarios with a larger number of classes, pruning-and-quantization-based compression techniques would still be the choice because they produce sufficiently small models for off-the-shelf MCUs that have higher performance in terms of accuracy.
Experiments show that, when two models are generated by pruning-and-quantization and XNOR-Net for the same memory constraint, XNOR-Net requires ∼ 8x less computation while both produce comparable classification accuracy for small problem sizes (number of classes). While XNOR-Nets require extremely little computation compared to their pruning-and-quantization based counterparts, their classification performance on datasets degrades rapidly with an increasing number of classes. The performance loss of pruning-and-quantization based models for the same audio benchmarks is much more graceful, so that this approach still appears to be preferable for problems with larger class numbers.
To harden the above findings, we have conducted a similar study on image classification datasets and analyzed this in the context of current state-of-the-art full precision and XNOR-Net leaderboards for various image benchmarks. The behavior is found to be consistent in both the audio and the image domains.
Thus, the contribution of this paper is twofold: 1) it presents the first application of XNOR-Net for raw audio classification as a benchmark for future research. 2) It presents the first comprehensive empirical comparison of pruning, quantization, and XNOR-Net-based model compression techniques and derives guidelines for when to use pruning, quantization, and XNOR-Net, respectively.

A. AUDIO REPRESENTATIONS
Audio can be represented in the time domain as a wave form of amplitude changes over time ( Figure 1). If this time series is directly used as an input to the network, we speak of raw audio classification. An alternative is to use spectrograms. For this, the audio is first transformed into the frequency domain using Fourier transforms of short overlapping windows and presented with respect to time, frequency, and amplitude ( Figure 2). The spectrogram captures the intensity of the different frequency component of the signal against time (see Figure 3).

B. AUDIO FEATURE PROCESSING
Input representation is a fundamental decision when applying deep learning to any problem. Owing to the tremendous success of deep learning in the image processing domain, researchers have used spectrograms directly as the input representations [28], [30]- [32]. Somewhat surprisingly, the results achieved so far have not matched the performance that might have been expected based on the state-of-the-art in visual image processing [27], [28].
While being a visual structure, spectrograms have properties different from those of natural images. A pixel of a certain color or the similar neighboring pixels of an image may often belong to the same visual object. On the other hand, although frequencies move together according to a common relationship of the sound, a particular frequency or several frequencies do not belong to a single sound [27], [28]. Furthermore, in images, both axes carry spatial information, but the axes of the spectrograms have different meanings. Sounds are not static two-dimensional objects such as images; they genuinely are time series. Thus, the invariances in the natural images and spectrograms are fundamentally different. For example, moving a face in any direction does not change the fact that it is the same face. However, moving frequencies upwards may not only change an adult voice to a child's voice or something completely different, it may also change the spatial information of the sound [28].
Hence, this research considers audio classification using raw audio time series, an approach that has proven to be successful with traditional full-precision networks [10].

C. AUDIO CLASSIFICATION AT THE EDGE
The recent state-of-the-art performances for audio classification are all produced by different resource-intensive DL models [10], [13]- [16]. These models must be extremely compressed to be run on the MCUs.
Unstructured pruning produces sparse weight matrices that requires sparse computation to fully utilize its benefits. This is not yet supported by the MCUs [33]- [37].
In structured pruning, the neurons and the channels are pruned to find a tiny version of the baseline network. This is an iterative process in which a single iteration performs a global ranking of filters/neurons (using methods such as L2-Norm, Taylor criteria [19] and binary index-based ranking [38]), followed by the removal of the lowest-ranked neuron/channel from the network and retraining of the network for one or two epochs to recover the loss of accuracy due to the pruning [19]. This is repeated until the target number of filters/neurons is removed from the network to find a model of the desired size. This process is illustrated in Figure 4. There has been a significant amount of research on structured channel pruning, such as [10], [19], [20], [33], [34], [36], [37]. However, all of this except [10] is proposed for computer vision. The hybrid channel pruning technique proposed in [10] is the only study to date for audio analysis. This method incorporates the Taylor expansion criteria (TE) [19] in the ranking process. In the beginning of an iteration, a forward pass with the entire training dataset takes place and the gradient is applied to the activations to determine the least VOLUME 10, 2022 affected channels for ranking. The change in gradient can be expressed as: where Z i l is the ith feature map of layer l, ∆C is the change in loss denoted by δC δZ i l and Θ T E (Z i l ) denotes the change determined by TE. Now, the gradient is applied to the activation as: For the ranking of the channels, all the feature maps are normalized layer-wise. Thus, the normalization for a layer l of a network can be expressed as: where Z l is the list of activations for all the channels c of a layer l in a CNN. Now, the ranking of the channels across all layers is performed and the index of the lowest-ranked channel is determined as follows: whereZ are the normalized activations of all channels of all layers, κ is the function that takesZ and returns the information of the channel with the lowest magnitude i lc , where i is the index of channel c of layer l.
Once this iterative process produces the final compressed and fine-tuned model, it is further compressed using lowprecision quantization. Quantization is an independent process, and this study uses 8-bit quantization to achieve a further 4x compression.
Knowledge distillation is a very different approach to pruning where the knowledge of a large teacher network is gradually transferred to a smaller student network. However, [10] showed that structured pruning produces superior performance to knowledge distillation in audio classification tasks.
According to the current literature, the best pruning-based approach to derive a tiny model from a state-of-the-art deep neural network (DNN) model is structured pruning. Furthermore, the only work that successfully deploys a DNN model for audio classification to MCUs uses this technique followed by 8-bit post-training quantization to compress the state-of-the-art model [10]. In that study, the compressed model still produces close to that of the state-of-the-art classification accuracy. Thus, for DNN models using pruningbased techniques, this study focuses on structured pruning and quantization. For simplicity, we refer to this as "pruningand-quantization".

2) XNOR-Net
In an XNOR-Net [23], all layers except the first and the last are binary. The input, activations, and the weights of the binary layers are represented using either +1 or -1 and are stored efficiently with single bits. Figure 5 shows the construction of a typical convolution layer and an equivalent binary convolution layer.

Convolution BatchNorm Activation
Typical CONV layer BatchNorm BinConvolution BinActivation Binary CONV layer The convolutions between the matrices (input/activations and weights) are implemented using exclusive-NOR (XNOR) and bit-counting operations. For this, the convolution between two vectors ∈ R n is approximated by the dot products between two vectors ∈ {−1, +1} n [23]. Thus, convolution between the input X and weight W can be written as: where α and β denote the scaling factors for all the sub tensors in input X and weight W respectively and denotes element-wise multiplication. Due to the binary activations the dot product between sign(X) and sign(W ) can be replaced by XNOR and pop-count operations which require extremely little computation. Hence, Equation 5 is reduced to: where X = sign(X), W = sign(W ) with all zeros replaced by -1, and denotes the XNOR and pop-count operations between X and W . This process is extremely efficient in terms of memory and energy usage. According to Rastegari et al. [23] XNOR-Net requires 32x less memory and reduces 58x computation requirements. Figure 6 provides an example of how the dot product between sign(X) and sign(W ) is replaced by XNOR and pop-count operations between X and W .

FIGURE 6. XNOR and POPCOUNT in XNOR-Net
Like structured pruning based compression works, almost all the works based on XNOR-Net are proposed for computer vision. The current state-of-the-art XNOR models for the benchmark image datasets are: [23] for MNIST, [24] for CIFAR-10 and CIFAR-100 and [25] for ImageNet. In contrast, we have found [26] to be the only XNOR-Net based work for audio classification. However, that work uses spectrograms as the input to the model which is similar to image classification using XNOR-Net. According to the discussion presented in Section II-B, using raw audio as input is the appropriate method for audio classification in MCUs. According to our knowledge, there is no such XNOR-Net based work present in the current literature.

III. EXPERIMENTAL DETAILS
All experiments are conducted using Python version 3.7.4, and the GPU versions of Torch 1.8.1. All experimental codes are available at: https://github.com/mohaimenz/pruning-vsxnor.

A. DATASETS
The experiments are conducted on three widely used audio benchmark datasets: Environmental Sound, namely ESC-50 and ESC-10 [39], and UrbanSound8k [40]. For the image classification experiments, we use the CIFAR-10 and CIFAR-100 benchmark image datasets.
ESC-50 contains 2000 samples that are equally distributed over 50 disjoint classes. The length of the audio samples is 5s recorded at 16kHz and 44.1kHz. Furthermore, the dataset is partitioned into five folds for cross validation to help researchers obtain directly comparable results. ESC-10 is a subset of ESC-50 with 400 audio samples distributed equally over 10 classes. The subsets of the ESC datasets are as follows: The CIFAR-10 dataset contains 60,000 image samples equally distributed over 10 classes. 50,000 of the samples are used for training and 10,000 for testing. CIFAR-100 has 100 classes, and 60,000 samples are equally distributed across the classes. For each class, there are 500 training samples and 100 test samples. We create the subsets of CIFAR-100 using Equation 8. The subsets are defined as follows.
where x ∈ {10, 20, . . . , n } | n ∈ [1, 100] and m ∈ [1,10]. This gives us the following subsets such as: . We use data augmentation as described in [15] and [10] for the audio datasets. For the image datasets, we use random cropping, horizontal flipping, and rotation available in the Transforms module of the PyTorch TorchVision library. All implementations of the data augmentation procedures are available in our GitHub repository.

C. MODELS AND HYPERPARAMETERS
The model configuration is available in the GitHub repository, and the hyperparameters for all the experiments conducted on the audio and image datasets are listed in Table 1.

IV. ANALYSIS: PRUNING VS XNOR-NET
In this section, we compare pruning-and-quantization techniques with XNOR-Net for raw audio classification. Our VOLUME 10, 2022 analysis through extensive experiments on different standard benchmark datasets for audio classification shows that XNOR is more effective for relatively simple problems and very tight constraints on the computational resources; however, the higher classification accuracy achieved by pruningand-quantization outweighs the benefit of XNOR-Net for problems with complex data characteristics, large number of classes, etc. As our starting point for pruning-and-quantization techniques, we use two of the recent state-of-the-art DL networks for raw audio classification called ACDNet [10] and AclNet [41] as the baseline networks and build their derivatives using pruning-and-quantization as well as the XNOR-Net technique. We test these models on three standard benchmark sound classification datasets -ESC-10 [39], ESC-50 [39], UrbanSound8k [40], and on subsets of them for an in-depth analysis of the effect of increasing the number of classes. We create subsets of a dataset S n with n classes as: We measure classification accuracy through 5-fold cross validation. The experimental details are provided in Section III.

A. PRUNING-AND-QUANTIZATION
We have used the hybrid structured pruning technique proposed in [10] to derive Micro-ACDNet by pruning 80% of the channels from ACDNet. The trained model is first sparsified using the l0 norm. It can be expressed as: where W denotes the weights of all the layers of the network, and χ is the function that sorts W and returns the indices of the bottom 95% of the weights. The channels are then ranked, and the lowest-ranked channel is removed from the network using Equations 1, 2, 3, and 4. The resulting model using this iterative pruning and fine-tuning process is called Micro-ACDNet [10]. Micro-ACDNet is quantized using a post-training 8-bit quantization technique. We refer to this quantized model as QMicro-ACDNet. The memory and computation requirements of QMicro-ACDNet make it suitable for current off-the-shelf MCUs (see Table 2). We follow the same procedure to derive the respective derivatives of AclNet. Micro-ACDNet and QMicro-ACDNet require 36x and 144x less memory, respectively, than ACDNet that requires 18.06MB to store its 4.74M parameters. Furthermore, both smaller versions require 37x less floating point operations (FLOPs) compared to the base model. For AclNet, the memory reductions for the same derivatives are 86x and 324x with 84x less FLOPs. We note that QMicro version requires the same number of FLOPs as the Micro version but only at quarter precision. If a full 32-bit floating-point unit (FPU) is used, this does not make a practical difference, but it allows us to exploit significant speed gains using 16-bit FPUs, 16-bit FPU modes on full-precision FPUs, and software emulations when no hardware FPU is available.
Although the smaller models require fewer resources, they lose classification accuracy because they have less capacity to learn. According to [10], Micro-ACDNet has 80% less capacity than ACDNet and QMicro-ACDNet is a quarter precision version (8-bit) of Micro-ACDNet. Table 3 and Figure 7 present a comparison of the accuracy achieved by the three versions of the both the baseline networks on ESC-10,. . . ,50 datasets (derived using Equation 8. For further details, see Section III-A). The table and the figure show that all the three versions of both the networks produce state-of-the-art and near state-of-the-art accuracy on ESC-10, however, the accuracy drops continuously with an increase in the number of classes, as may be expected. The base ACDNet achieves an accuracy of 96.75% and 87.05% on ESC-10 and ESC-50, respectively, whereas Micro-ACDNet produces 96.25% and 83.25% on ESC-10 and ESC-50, respectively. The quantized version experiences a larger drop in accuracy as the number of classes increases, starting at 92.75% for ESC-10 and ending with 75.50% for ESC-50. For AclNet and its different smaller versions, the trend is similar if not the same. Figure 7 clearly shows these trends. The scenario is not different when we apply the same networks on the UrbanSound8k dataset. Table 4 shows that QMicro-ACDNet has lost almost 13.5% accuracy compared to the base network. For QMicro-AclNet, the loss in accuracy is 11.7% compared to the base network. We note that specialized quantization targeted to a particular model can potentially reduce the loss of the accuracy that the  models experience during the quantization process. However, improving any technique used to demonstrate the process is beyond the scope of this paper and while a change in quantization techniques may shift the results, the general trends are expected to remain the same.

B. XNOR-NET
We have applied XNOR-Net technique to derive the XNOR-Net versions of ACDNet (i.e., XACDNet) and AclNet (i.e., XAclNet). Table 5 lists the memory and the computation requirements of the full precision networks and their XNOR counterparts. The table shows that the memory required by the XACDNet is 578KB and XAclNet is 1300KB. These figures are still too much for typical MCUs offering less than 1 MB of RAM, since the parameter spaces alone already leave hardly any room for the actual computations. Hence, we need smaller versions of the original networks whose resource requirements do not exceed the resources available in the MCUs. To compare the network performance, memory, and computation requirements with QMicro-ACDNet, we create Mini-ACDNet such that its XNOR-Net version has similar requirements to QMicro-ACDNet (see Tables 2 and  5). To derive Mini-ACDNet, we use the same technique as that used to derive Micro-ACDNet, summarized above and fully detailed in [10]. The same procedure is used to derive Mini-AclNet from AclNet. The sizes of the XNOR-Net versions of ACDNet and AclNet are calculated according to [23]. The number of binary operations required for the networks is calculated according to [42]. We express the computation (Binary operation + FLOPS) required for an XNOR network as FLOPs for simplicity. If a and b are the FLOPs required for the first and the last full precision layers of the XNOR network and x is the total number of FLOPs of the full precision version of the network, then the calculation of FLOPs of an XNOR network can be expressed as F LOP s = a+(x−(a+b))/64+b. Table 5 also shows that XMini versions are extremely small (128.5KB and 133.12KB) and require considerably less computation. This clearly allows the network to be deployed on the current off-the-shelf MCUs. Furthermore, from Table 6, we can see that the XNOR networks produce reasonable accuracy for a smaller number of classes (ESC-10 with 10 classes). However, as the number of classes increases, the network performance of the XNOR versions decreases rapidly.  Figure 8 shows the performance graph of the Baseline models (i.e., ACDNet and AclNet) and their mini, pruningand-quantization and XNOR counterparts. The line at the bottom (XMini) shows how extremely the smallest XNOR-Net is affected when the number of classes increases, while QMicro manages degrade much more gracefully. For all the XNOR-Net networks (XBaseline and XMini versions of the VOLUME 10, 2022 Baselines), the slope is much steeper than that for the other compression methods. Figure 9 shows the performance of the same networks on the UrbanSound8k dataset, another widely used audio benchmark. From the graph, we observe that the results for UrbanSound8k confirm our findings for ESC-10,. . . , 50. In summary, we observe that the memory requirements of QMicro and XMini versions are essentially the same. XMini-ACDNet requires 7.97x less FLOPs than QMicro-ACDNet, on the other hand, XMini-AclNet requires 6.72x less FLOPs that QMicro-AclNet. However, the XNOR-based models do not reach comparable classification accuracies when the number of classes is larger. Although the pruningand-quantization-based QMicro-ACDNet also sees a loss in accuracy, the accuracy is still reasonable for larger problems given its minuscule model size. This observation is consistent with both the baseline models and their derivatives.

C. EXTENDED ANALYSIS 1) Comparing with Existing Work
To the best of our knowledge, Cerutti et al. [26] have presented the only XNOR network for audio classification so far, termed BNN-GAP8. They have used a different benchmark, namely, the less commonly used AudioEvent dataset [43]. We cannot test BNN-GAP8 on the most widely used benchmarks, ESC50 and Urbansound-8k, since BNN-GAP8 has not been described in sufficient detail in [26] to make it reproducible.
To facilitate a direct comparison, we extend our analysis to this dataset, which has also been used in a several other studies [43], [44]. These results are consistent with what we have seen above for the most widely used standard benchmarks. Table 7 and Figure 10 show the performance of our base nets on the AudioEvent dataset and its smaller subsets. However, the larger resource requirements of our base networks do not allow us a direct and fair comparison to the BNN-GAP8 network presented in [26]. Therefore, we derive two additional smaller networks (QNano-ACDNet and XGAP8-ACDNet) with memory requirements equivalent to BNN-GAP8. QNano-ACDNet is an 8-bit quantized version of the associated full-precision network Nano-ACDNet that is derived by pruning the channels from ACDNet. XGAP8-ACDNet is an XNOR network derived from the full precision network GAP8-ACDNet. GAP8-ACDNet is also derived by pruning channels from ACDNet. The equivalent derivatives are also derived from AclNet. The new models are constructed using the same procedures as described above. The tests of these networks on the AudioEvent dataset used in [26] are summarized in Table 8. BNN-GAP8 shows good performance but the quantized networks QNano-ACDNet and QNano-AclNet clearly outperforms it for equivalent storage requirements. The computational requirements for BNN-GAP8 are given in MACs rather than FLOPs in [26]. Using the same re-scaling as detailed above, we arrive at 1768M FLOPs for BNN-GAP8. The quantized QNano networks compare very favorably, using only 10.54M FLOPs for the ACDNet version and 11.37M FLOPs for the AclNet version. In a nutshell, the pruned-andquantized QNano versions deliver better performance than BNN-GAP8 XNOR-Net with smaller resource requirements.
The performance of BNN-GAP8 on the 28-class problem is much better than that of the the XNOR networks derived from ACDNet and AclNet (i.e., XGAP8-ACDNet and XGAP8-AclNet). The results for the latter are consistent with the performance drop established for the slightly larger XMini-ACDNet on ESC30 (ESC50 reduced to 30 classes, cf. Table 6).
How can BNN-GAP8 achieve such good performance on a 28-class problem? The likely reason is that BNN-GAP8 benefits from using spectrograms as input to the network. This means that in the BNN-GAP8 implementation much of the necessary full precision computation required, for which a binary net is not suitable, is encapsulated outside of the network in the conversion of raw audio to spectrograms. The other XNOR networks in this comparison are deprived of this possibility because they perform end-to-end classification for raw audio input and thus must perform all required computations within the network.
From a pragmatic perspective, using spectrograms may be a useful approach for handling simple audio classification problems on MCUs that feature a DSP with hardware support for FFT computation, which is the basis of generating spectrograms. However, as has been shown in Table 8 and Figure 11, even where such support exists, forgoing it and using end-to-end classification with a pruned-and-quantized network instead still offers performance and resource benefits over the XNOR network. Furthermore, from the broader literature on DNN audio classification, it is evident that purely spectrogram-based classification does not always allow us to achieve state-of-the-art performance with conventional    CNN architectures and that features learned from raw audio or multi-channel features are preferable for more difficult benchmarks [8], [9], [11], [12], [16].

2) Analysis with Image Datasets
Most of the work on XNOR nets so far has taken place in the image domain. A comparison with this body of work is instructive. We summarize the state-of-the-art for the most widely used datasets in Table 9. The accuracy achieved by XNOR nets is impressive (second last column), but it has to be noted that these nets are significantly larger than our target size. None of these models fit on the relevant MCUs with the single exception of the one for MNIST. This is a very simple dataset with only 10 classes. To verify whether our own results for audio classification are consistent with the image domain, we conducted additional experiments on image classification using XNOR. We have used the XNOR version of RESNET-18 [49] to classify the widely used benchmark image datasets CIFAR-10 and CIFAR-100. We have created variably sized subsets of CIFAR-100 to determine whether the trend of the loss in accuracy is similar to audio classification. The experimental details are provided in Section III.

V. CONCLUSION
This paper presents the first study of XNOR-Net for endto-end classification of raw audio. We emphasize state-ofthe-art MCUs as practically important target architectures for realistic Edge-AI applications. For small problem sizes, measured in the number of classes, our comprehensive experimental analysis shows that XNOR-Net can produce models with reasonable performance that are sufficiently small to fit the target architectures. For our test scenarios with small problem sizes, XNOR networks require significantly less computation than comparably sized models generated by pruning-and-quantization alone. For relatively simple problems, the performance of XNOR-Nets on MCUs can be further increased by using spectrograms as input to the net. This enables us to capitalize on special DSP hardware so that the full-precision computations required to handle raw audio can be confined outside of the net in the spectrogram generation.
The picture changes as the problem complexity grows and larger numbers of classes need to be distinguished. Our analysis shows that XNOR-Nets face large drops in classification accuracy for datasets with many classes. This performance degradation is much more rapid than that of their pruningand-quantization counterparts. This performance degradation can be handled by growing the network size. Although growing the network size may still allow the model to be run on CPUs, it takes away the ability to fit the model in MCUs because XNOR-Net can only reduce the memory requirement by a maximum of 32-fold comparing to its fullprecision version. Hence, for larger models, it is necessary to use other model compression techniques to find a model that is at most 32 times bigger than the target size before XNOR is applied.
In contrast, comparably sized models generated by pruning-and-quantization show significantly better performance while still satisfying the constraints of the target architectures. The advantages of spectrogram-based network input can no longer be exploited with complex data characteristics as feature-learning from raw audio or multi-channel input are required to reach state-of-the-art performance. This renders pruning-and-quantization the preferred approach from a certain problem complexity unless computation speed rather than model size dominates the decision.
Furthermore, to the best of our knowledge, there is no off-the-shelf computation kernel for XNOR-Nets yet. This means that it is difficult to realize the theoretical advantage of XNOR-Net on existing add-multiplication-based hardware and that custom hardware is required to achieve the full benefit of faster computation [50]. However, given the popularity of XNOR-Nets, we hope that such support is not too far away.
Our study provides an experimental analysis of the current state of XNOR-Net and alternative compression methods as a basis to evaluate possible paths towards competitive deep learning architectures in edge-AI applications. The creation of specialized computation kernels for XNOR-Net and theoretical studies of XNOR-Net performance degradation for larger class sizes and how to avoid it are still open research problems and important areas for future research.