PHNNs: Lightweight Neural Networks via Parameterized Hypercomplex Convolutions

Hypercomplex neural networks have proven to reduce the overall number of parameters while ensuring valuable performance by leveraging the properties of Clifford algebras. Recently, hypercomplex linear layers have been further improved by involving efficient parameterized Kronecker products. In this article, we define the parameterization of hypercomplex convolutional layers and introduce the family of parameterized hypercomplex neural networks (PHNNs) that are lightweight and efficient large-scale models. Our method grasps the convolution rules and the filter organization directly from data without requiring a rigidly predefined domain structure to follow. PHNNs are flexible to operate in any user-defined or tuned domain, from 1-D to $n\text{D}$ regardless of whether the algebra rules are preset. Such a malleability allows processing multidimensional inputs in their natural domain without annexing further dimensions, as done, instead, in quaternion neural networks (QNNs) for 3-D inputs like color images. As a result, the proposed family of PHNNs operates with $1/n$ free parameters as regards its analog in the real domain. We demonstrate the versatility of this approach to multiple domains of application by performing experiments on various image datasets and audio datasets in which our method outperforms real and quaternion-valued counterparts. Full code is available at: https://github.com/eleGAN23/HyperNets.


I. INTRODUCTION
R ECENT state-of-the-art convolutional models achieved astonishing results in various fields of application by large-scaling the overall parameters amount [1]- [4].Simultaneously, hypercomplex algebra applications are gaining increasing attention in diverse spheres of research such as signal processing [5]- [8] or deep learning [9]- [17].Indeed, hypercomplex and quaternion neural networks (QNNs) demonstrated to significantly reduce the number of parameters while still obtaining comparable performance [18]- [24].These models exploit hypercomplex algebra properties, including the Hamilton product, to painstakingly design interactions among the imaginary units, thus involving 1/4 or 1/8 of free parameters with respect to real-valued models.Furthermore, thanks to the modelled interactions, hypercomplex networks capture internal latent relations in multidimensional inputs and preserve pre-existing correlations among input dimensions [25]- [29].Therefore, the quaternion domain is particularly appropriate for processing 3D or 4D data, such as color images or (up to) E. Grassucci and D. Comminiello are with the Dept.Information Engineering, Electronics and Telecommunications (DIET), Sapienza University of Rome, Italy. A. Zhang is with Amazon Web Services AI, East Palo Alto, CA, USA.Corresponding author's email: eleonora.grassucci@uniroma1.it.4-channel signals [30], while the octonion one is suitable for 8D inputs.Unfortunately, most common color image datasets contain RGB images and some tricks are required to process this data type with QNNs.Among them, the most employed are padding a zero channel to the input in order to encapsulate the image in the four quaternion components, or remodelling the QNN layer with the help of vector maps [31].Additionally, while quaternion neural operations are widespread and easy to be integrated in pre-existing models, very few attempts have been made to extend models to different domain orders.Accordingly, the development of hypercomplex convolutional models for larger multidimensional inputs, such as magnitudes and phases of multichannel audio signals or 16-band satellite images, still remains painful.Moreover, despite the significantly lower number of parameters, these models are often slightly slow with respect to real-valued baselines [32] and adhoc algorithms may be necessary to improve efficiency [22], [33].
Recently, a novel literature branch aims at compress neural networks leveraging Kronecker product decomposition [34], [35], gaining considerable results in terms of model efficiency [36].Lately, a parameterization of hypercomplex multiplications have been proposed to generalize hypercomplex fully connected layers by sum of Kronecker products [37].The latter method obtains high performance in various natural language processing tasks by also reducing the number of overall parameters.Other works extended this approach to graph neural networks [38] and transfer learning [39], proving the effectiveness of Kronecker product decomposition for hypercomplex operations.However, no solution exists for convolutional layers yet, which remain the most employed layers when dealing with multidimensional inputs, such as images and audio signals [40], [41].
In this paper, we devise the family of parameterized hypercomplex neural networks (PHNNs), which are lightweight large-scale hypercomplex neural models admitting any multidimensional input, whichever the number of dimensions.At the core of this novel set of models, we propose the parameterized hypercomplex convolutional (PHC) layer.Our method is flexible to operate in domains from 1D to nD, where n can be arbitrarily chosen by the user or tuned to let the model performance lead to the most appropriate domain for the given input data.Such a malleability comes from the ability of the proposed approach to subsume algebra rules to perform convolution regardless of whether these regulations are preset or not.Thus, neural models endowed with our approach adopt 1/n of free parameters with respect to their real-valued counterparts, and the amount of parameter reduction is a user choice.This makes PHNNs adaptable to a plethora of applications in which saving storage memory can be a crucial aspect.Additionally, PHNNs versatility allows processing multidimensional data in its natural domain by simply setting the dimensional hyperparameter n.For instance, color images can be analyzed in their RGB domain by setting n = 3 without adding any useless information, contrary to standard processing for quaternion networks with the padded zero-channel.Indeed, PHC layers are able to grasp the proper algebra from input data, while capturing internal correlations among the image channels and saving 66% of free parameters.
On a thorough empirical evaluation on multiple benchmarks, we demonstrate the flexibility of our method that can be adopted in different domains of applications, from images to audio signals.We devise a set of PHNNs for large-scale image classification and sound event detection tasks, letting them operate in different hypercomplex domain and with various input dimensionality with n ranging from 2 to 16.
The contribution of this paper is three-fold.
• We introduce a parameterized hypercomplex convolutional (PHC) layer which grasps the convolution rules directly from data via backpropagation exploiting the Kronecker product properties, thus reducing the number of free parameters to 1/n.• We devise the family of parameterized hypercomplex neural networks (PHNNs), lightweight and more efficient large-scale hypercomplex models.Thanks to the proposed PHC layer and to the method in [37] for fully connected layers, PHNNs can be employed with any kind of input and pre-existing neural models.To show the latter, we redefine common ResNets, VGGs and Sound Event Detection networks (SEDnets), operating in any user-defined domain just by choosing the hyperparameter n, which also drives the number of convolutional filters.• We show how the proposed approach can be employed with any kind of multidimensional data by easily changing the hyperparameter n.Indeed, by setting n = 3 a PHNN can process RGB images in their natural domain, while leveraging the properties of hypercomplex algebras, allowing parameter sharing inside the layers and leading to a parameter reduction to 1/3.To the best of our knowledge, this is the first approach that processes color images with hypercomplex-based neural models without adding any padding channel.As well, multichannel audio signals can be analysed by simply considering n = 4 for standard first-order ambisonics (which has 4 microphone capsules), n = 8 for an array of two ambisonics microphones, or even n = 16 if we want to include the information of each channel phase.
The rest of the paper is organized as follows.In Section II, we introduce concepts of hypercomplex algebra and we recapitulate real and quaternion-valued convolutional layers.Section III rigorously introduces the theoretical aspects of the proposed method.Sections IV and V reveal how the approach can be adopted in different neural models and in two different domains, the images and audio one, expounding how to Fig. 1.Example of hypercomplex multiplication table for n = 2 i.e., complex, among others (green line), n = 4 i.e., quaternions, tessarines, (blue line) and n = 8, i.e., octonions, bi-quaternions, and so on (red line).While for these domains algebra rules exist and are predefined, no regulations are set for other domains such as n = 3, 5, 6, 7 (dashed grey lines).The parameterized hypercomplex approaches are able to learn these missing algebra rules from data, thus defining hypercomplex multiplication and convolution for any desired domain.
process RGB images with n = 3 and multichannel audio with n up to 8. The experimental evaluation is presented in Section VI for image classification and in Section VII for sound event detection.Finally, Section VIII reports the ablation studies we conduct and in Section IX we draw conclusions.

II. HYPERCOMPLEX NEURAL NETWORKS A. Hypercomplex Algebra
Hypercomplex neural networks rely in a hypercomplex number system based on the set of hypercomplex numbers H and their corresponding algebra rules to shape additions and multiplications [24].These operations should be carefully modelled due to the interactions among imaginary units that may not behave as real-valued numbers.For instance, Figure 1 reports an example of a multiplication table for complex (green), quaternion (blue) and octonion (red) numbers.However, this is just a small subset of the hypercomplex domain that exist.Indeed, for n = 4 there exist quaternions, tessarines, among others, while for n = 8 octonions, dual-quaternions, and so on.Each of these domains have different multiplication rules due to dissimilar imaginary units interactions.A generic hypercomplex number is defined as being h 0 , . . ., h n ∈ R and îi , . . ., în imaginary units.Different subsets of the hypercomplex domain exist, including complex, quaternion, and octonion, among others.They are identified by the number of imaginary units they employ and by the properties of their vector multiplication.The quaternion domain is one of the most popular for neural networks thanks to the Hamilton product properties.This domain has its foundations in the quaternion number q = q 0 + q 1 î + q 2  + q 3 κ, in which q c , c ∈ {0, 1, 2, 3} are real coefficients and î, , κ the imaginary untis.A quaternion with its real part q 0 equal to 0 is named pure quaternion.The imaginary units comply with the property î2 = 2 = κ2 = −1 and with the noncommutative products î = −î; κ = −κ; κî = −îκ.Due to the non-commutativity of vector multiplication, the Hamilton product has been introduced to properly model the multiplication between two quaternions.  .The quaternion convolution rule can be expressed as sum of Kronecker products between the matrices A i that subsume the algebra rules and the matrices F i that contain the convolution filters, with i = 1, 2, 3, 4. In this example, the parameters of A i are fixed for visualization purposes, but in PHC layers they are learnable parameters.

B. Real and Quaternion-Valued Convolutional Layers
A generic convolutional layer can be described by where the input x ∈ R t×s is convolved ( * ) with the filters tensor W ∈ R s×d×k×k to produce the output y ∈ R d×t , where s is the input channels dimension, d the output one, k is the filter size, and t is the input and output dimension.The bias term b does not heavily influence the number of parameters, thus the degrees of freedom for this operation are essentially O(sdk 2 ).Quaternion convolutional layers, instead, build the weight tensor W ∈ R s×d×k×k by following the Hamilton product rule and organize filters according to it: where 4 ×k×k are the real coefficients of the quaternion weight matrix W = W 0 + W 1 î + W 2  + W 3 κ and x 0 , x 1 , x 2 , x 3 are the coefficients of the quaternion input x with the same structure.
As done for real-valued layers, the bias can be ignored and the degree of freedom computations of the quaternion convolutional layer can be approximated to O(sdk 2 /4).The lower number of parameters with respect to the real-valued operation is due to the reuse of filters performed by the Hamilton product in Eq. 3. Also, sharing the parameter submatrices forces to consider and exploit the correlation between the input components [21], [42], [43].

III. PARAMETERIZING HYPERCOMPLEX CONVOLUTIONS
In the following, we delineate the formulation for the proposed parameterized hypercomplex convolutional (PHC) layer.We also show that this approach is capable of learning the Hamilton product rule when two quaternions are convolved.

A. Parameterized Hypercomplex Convolutional Layers
The PHC layer is based on the construction, by sum of Kronecker products, of the weight tensor H which encapsulates and organizes the filters of the convolution.The proposed method is formally defined as: whereby, H ∈ R s×d×k×k is built by sum of Kronecker products between two learnable groups of matrices.Here, s is the input dimensionality to the layer, d is the output one, and k is the filter size.More concretely, in which A i ∈ R n×n with i = 1, ..., n are the matrices that describe the algebra rules and n ×k×k represents the i-th batch of filters that are arranged by following the algebra rules to compose the final weight matrix.It is worth noting that s n × d n × k × k holds for squared kernels, while s n × d n × k should be considered instead for 1D kernels.The core element of this approach is the Kronecker product [44], which is a generalization of the vector outer product that can be parameterized by n.The hyperparameter n can be set by the user who wants to operate in a pre-defined real or hypercomplex domain (e.g., by setting n = 2 the PHC layer is defined in the complex domain, or in the quaternion one if n is set equal to 4, as Figure 2 illustrates), or tuned to obtain the best performance from the model.The matrices A i and F i are learnt during training and their values are reused to build the definitive tensor H.
The degree of freedom of A i and F i are n 3 and sdk 2 /n, respectively.Usually, real world applications employ a large number of filters in layers (s, d = 256, 512, ...) and small values for k.Therefore, frequently sdk 2 n 3 holds.Thus, the degrees of freedom for the PHC weight matrix can be approximated to O(sdk 2 /n).Hence, the PHC layer reduces the number of parameters by 1/n with respect to a standard convolutional layer in real world problems.
Moreover, when processing multidimensional data with correlated channels, such as color images, rather than mulichannel audio or multisensor signals, PHC layers bring benefits due to the weight sharing among different channels.This allows capturing latent intra-channels relations that standard convolutional networks ignore because of the rigid structure of the weights [20], [45].The PHC layer is able to subsume hypercomplex convolution rules and the desired domain is specified by the hyperparameter n.Interestingly, by setting n = 1 a real-valued convolutional layer can be represented too.Indeed, standard real layers do not involve parameter sharing, therefore the algebra rules are solely described by the single A ∈ R 1×1 and the complete set of filters are included in F s×d×k×k .Therefore, the PHC layer fills the gaps left by pre-existing hypercomplex algebras in Fig. 1 and subsumes the missing algebra rules directly from data, i.e., the dashed grey lines in Fig. 1.Thus, a neural model equipped with PHC layers can grasp the filter organization also for n = 3, 5, 6, 7 and so on.Moreover, any convolutional model can be endowed with our approach, since PHC layers easily replace standard convolution / transposed convolution operations and the hyperparameter n gives high flexibility to adapt the layer to any kind of input, such as color images, multichannel audio or multisensor signals.

B. Learning Tests on Toy Examples
We test the receptive ability of the PHC layer in two toy problems building an artificial dataset.We highly encourage the reader to take a look at the section tutorials of the GitHub repository https://github.com/eleGAN23/HyperNetsfor more insights and results on toy examples, including the learned matrices A i .The first task aims at learning the right matrix A to build a quaternion convolutional layer which properly follows the Hamilton rule in Eq. 3.That is, we set n = 4 and the objective is to learn the four matrices A i as they are in the quaternion product in Fig. 2. We build the dataset by performing a convolution with a matrix of filters W ∈ H, which are arranged following the regulation in Eq. 3, and a quaternion x ∈ H in input.The target is still a quaternion, named y ∈ H.As shown in Fig. 3 (right), the MSE loss of the PHC layer converges very fast, meaning that the layer properly learns the matrix A and the Hamilton convolution.
The second toy example is a modification of the previous dataset target.Here, we want to learn the matrix A which describes the convolution among two pure quaternions.Therefore, when setting n = 4, the matrix A 1 of a pure quaternion should be complete null.Pure quaternions may be, as an example, an input RGB image and the weights of a hypercomplex convolutional layer since the first channel of RGB images is zero.Figure 3 (left) displays the convergence of the PHC layer loss during training, proving that the proposed method is able of subsuming hypercomplex convolutional rules when dealing with pure quaternions too. [A]

C. Demystifying Parameterized Hypercomplex Convolutional Layers
We provide a formal explanation of the PHC layer to better understand the Kronecker product and how it organizes convolution filters to reduce the overall number of parameters to 1/n.In Eq. 6, we show how the PHC layer generalizes from 1D to nD domains.When subsuming real-valued convolutions in the first line of Eq. 6, the Kronecker product is performed between a scalar A and the filter matrix F, whose dimension is the same as the final weight matrix H, which is s×d×k×k.
Considering the complex case with n = 2 in the second line of Eq. 6, the algebra is defined in A 1 and A 2 while the filters are contained in F 1 and F 2 , each of dimension 1/2 the final matrix H. Therefore, while the size of the weight matrix H remains unchanged, the parameter size is approximately 1/2 the real one.In the last line of Eq. 6, we can see the generalization of this process, in which the size of matrices F i , i = 1, ..., n is reduced proportionally to n.It is worth noting that, while the parameter size is reduced with growing values of n, the dimension of H remains the same.

IV. PARAMETERIZED HYPERCOMPLEX NEURAL NETWORKS FOR COLOR IMAGES
In this section, we describe how PHNNs can be applied for processing color images in hypercomplex domains without needing any additional information to the input and we propose examples of parameterized hypercomplex versions of common computer vision models such as VGGs and ResNets.In order to be consistent with literature, we perform each experiment with a real-valued baseline model, then we compare it with its complex and quaternion counterparts and with the proposed PHNN.Furthermore, we assess the malleability of the proposed approach testing different values of the hyperparameter n, therefore defining parameterized hypercomplex models in multiple domains.

A. Process Color Images with PHC Layers
Different encodes exist to process color images, however, the most common computer vision datasets are comprised of three-channel images in R 3 .In the quaternion domain, RGB images are enclosed into a quaternion and processed as single elements [42].The encapsulation is performed by considering the RGB channels as the real coefficients of the imaginary units and by padding a zeros channel as the first real component of the quaternion.
Here, we propose to leverage the high malleability of PHC layers to deal with RGB images in hypercomplex domains without embedding useless information to the input.Indeed, the PHC can directly operate in R 3 by easily setting n = 3 and process RGB images in their natural domain while exploiting hypercomplex network properties such as parameters sharing.Indeed, the great flexibility of PHC layers allows the user to choose whether processing images in R 4 or R 3 .On one hand, by setting n = 4, the zeros channel is added to the input even so the layer saves the 75% of free parameters.On the other hand, by choosing n = 3 the network does not handle any useless information, notwithstanding, it reduces the number of parameters by solely 66%.This is a trade-off which may depend on the application or on the hardware the user needs.Furthermore, the domain on which processing images can be tuned by letting the performance of the network indicates the best choice for n.

B. Parameterized Hypercomplex VGGs
A family of popular methods for image processing is based on the VGG networks [46] that stack several convolutional layers and a closing fully connected classifier.To completely define models in the desired hypercomplex domain, we propose to endow the network with PHC layers as convolution components and with Parameterized Hypercomplex Multiplication (PHM) layers [37] as linear classifier.The backbone of our PHVGG is then

C. Parameterized Hypercomplex ResNets
In recent literature, a copious set of high performance in image classification is obtained with models having a residual structure.ResNets [47] pile up manifold residual blocks composed of convolutional layers and identity mappings.A generic PHResNet residual block is defined by whereby H j are the PHC weights of layer j = 1, 2 in the block, and F is in which we omit batch normalization to simplify notation.
The backward phase of a PHNNs reduces to a backpropagation similar to the quaternion neural networks one, which has been already developed in [19], [42], [48].

V. PARAMETERIZED HYPERCOMPLEX NEURAL NETWORKS FOR MULTICHANNEL SIGNALS
In the following, we expound how PHNNs can be employed to deal with multichannel audio signals and we introduce, as an example, the parameterized hypercomplex Sound Event Detection networks (PHSEDnets).

A. Process multichannel audio with PHC layers
A first-order Ambisonics (FOA) signal is composed of 4 microphone capsules, whose magnitude representations can be enclosed in a quaternion [49], [50].However, the quaternion algebra may be restrictive if more than one microphone is employed for registration or whether the phase information has to be included too.Indeed, quaternion neural networks badly fit with multidimensional input with more than 4 channels [51].
Conversely, the proposed method can be easily adapted to deal with these additional dimensions by handily setting the hyperparameter n and thus completely leveraging each information in the n-dimensional input.Fig. 4. CIFAR10 accuracy against number of network parameters for VGG and ResNet models.The larger is the point, the higher is the standard deviation over the runs.PHC-based models obtain better accuracies in both the families while far reducing the number of parameters.We do not display Complex VGGs as their accuracy is very low with respect to other models.

B. Parameterized Hypercomplex SEDnets
Sound Event Detection networks (SEDnets) [52] are comprised of a core convolutional component which extracts features from the input spectrogram.The information is then passed to a gated recurrent unit (GRU) module and to a stack of fully connected (FC) layers with a closing sigmoid σ which outputs the probability the sound is in the audio frame.Formally, the PHSEDnet is described by t = 1, ..., j y = σ (FC (GRU (h j ))) . ( After the GRU model, We employ standard fully connected layers, that can be also implemented as PHM layers with n = 1, since the so processed signal loses its multidimensional original structure.

VI. EXPERIMENTAL EVALUATION ON IMAGE CLASSIFICATION
To begin with, we test the PHC layer on RGB images and we show how, exploiting the correlations among channels, the proposed method saves parameters while ensuring high performance.We perform each experiment with a real-valued baseline model and then we compare it with its complex and quaternion counterparts and with the proposed PHNNs.Furthermore, we assess the malleability of the proposed approach testing different values of the hyperparameter n, therefore defining parameterized hypercomplex models in multiple domains.

A. Experimental Setup
We perform the image classification task with five baseline models.We consider ResNet18, ResNet50 and ResNet152 from the ResNet family and VGG16 and VGG19 from the VGG one.Each hyperparameter is set according to the original papers [46], [47].We investigate the performance in four different color images datasets at different scales.We employ SVHN, CIFAR10, CIFAR100, and ImageNet and any kind of data augmentation is applied to these datasets in order to guarantee a fair comparison.
We modify the number of filters for ResNets in order to be divisible by 3 and thus having the possibility of testing a configuration with n = 3.The modified versions of the ResNets are built with an initial convolutional layer of 60 filters.Then, the subsequent blocks have 60, 120, 240, 516 filters.The number of layers in the blocks depends on the ResNet chosen, whether 18, 50 or 152.Instead, VGG19 convolution component comprise two 24, two 72, four 216, and eight 648 filter layers, with batch normalization.The classifier is composed of three fully connected layers of 648, 516 and 10, 100 or 1000 depending on the number of classes in the dataset.The rest of the hyperparameters are set as suggested in the original papers.The batch size is fixed to 128 and training is performed via SGD optimizer with momentum equal to 0.9, weight decay 5e −4 and a cosine annealing scheduler.For ResNets, the initial learning rate is set to 0.1.For VGG is equal to 0.01.Models on CIFAR10 and CIFAR100 are trained for 200 epochs whereas on SVHN networks run for 50 epochs.For the ImageNet dataset, we follow the recipes in [53], so we resize the images for training at 160 × 160 while keeping the standard size of 224 × 224 for validation and test.We employ a step learning rate decay every 30 epochs with γ = 0.1, the SGD optimizer and an initial learning rate of 0.1 with weight decay 0.0001.The training is performed for 300k iterations with a batch size of 256 employing four Tesla V100 GPUs.

B. Experimental Results
We execute initial experiments with VGGs against Quaternion VGGs and two versions of PHVGGs with n equal to 2 and to 4. Average and standard deviation accuracy over three runs are reported for SVHN and CIFAR10 datasets in Table I.We experiment also additional runs but any significant  difference emerges as the randomness only affects the network initialization.Both the PHVGG16 and PHVGG19 versions clearly outperform real, complex and quaternion counterparts while being built with more than a half the number of parameters of the baseline.Additionally, PH-based models extraordinarily reduce the number of training and inference time (computed on an NVIDIA Tesla-V100) required with respect to the quaternion model which operates in a hypercomplex domain as well.Furthermore, when scaling up the experiment with VGG19, the proposed methods are more efficient at inference time with respect to the real-valued VGG19.Therefore, PHNNs can be easily adopted in applications with disk memory limitations, due to the reduction of parameters, and for fast inference problems thanks to the efficiency at testing time.Although the sum of Kronecker products in PHC layers requires additional computations, the increase is insignificant with respect to the FLOPs computated for the whole network, so the overall number of FLOPs is not heavily affected by our method and the count remains almost the same.
Our approach has high malleability, indeed, when dealing with color images, we can the domain in which operating thanks to the hyperparameter n.Therefore, we test PHNNs in the complex (n = 2), quaternion (n = 4) or H 3 (n = 3) domain, where in the latter we do not concatenate any zero padding and process the RGB channels of the image in their natural domain.
Table II presents average and standard deviation accuracy over three runs with different seeds for ResNet-based models.We perform extensive experiments and the PH models with n = 4 always outperform the quaternion counterpart gaining a higher accuracy and being more robust.This underlines the effectiveness of the PHC architectural flexibility over the predefined and rigid structure of quaternion layers.Furthermore, our method distinctly far exceeds the corresponding real-valued baselines across the experiments while saving from 50% to 75% parameters.Focusing on the latter result, the PHResNets with n = 3 results to be the most suitable choice in many cases, proving the validity of processing RGB images in their natural domain leveraging hypercomplex algebra.However, performance with n = 3 and n = 4 are comparable, thus the choice of this hyperparameter may depend on the application or on the hardware employed.On one hand, n = 4 may sometimes lead to lower performance, nevertheless it allows saving disk memory, as shown in the third column of Table II, thus it may be more appropriate for edge applications.
On the other hand, processing color images with n = 3 may bring higher accuracy even so it requires more parameters.Therefore, such a flexibility makes PHNNs adaptable to a large range of applications.Likewise, PHResNets with n = 2 gain considerable accuracy scores with respect to the realvalued corresponding models and, due to the larger number of parameters with respect to the PH model with n = 3, sometimes outperform it too.Finally, the PHResNet with n = 4 obtains the overall best accuracy in the largest experiment of this set.Indeed, considering a ResNet152 backbone on CIFAR100, our exceeds the real-valued baseline by more than 4%.This is the empirical proof that, PHNNs well scale to large real-world problems by notably reducing the overall number of parameters.These results are summarized for ResNets and VGGs models on CIFAR10 in Fig. 4. The plot displays models accuracies against models parameters.The PH-based models, either ResNets or VGGs exceed their real and quaternion-valued baselines while consistently reduce the number of parameters.What is more, in Table II, we also report the memory required to store models checkpoints for inference.Our method crucially reduces the amount of disk memory demand with respect to the heavier real-valued model.Further, we perform the image classifcation task on the ImageNet dataset.We compute the percentage of successes of ResNet-based models in each run for which we report the average accuracies in Table II.As Fig. 5 shows, the largest parcentage of successes is reached by the PHResNet with n = 3 which has been demonstrated to be the most valuable choice for n when dealing with RGB images.Therefore, we test the PHResNet with n = 3 against the real-valued counterpart.Table III shows that the proposed method achieves comparable, and even slightly superior, performance than the real-valued baseline, while involving 66% fewer parameters.Additionally, in Fig. 6, we provide Grad-CAM visualizations [54] for a sample of predictions by our method in the ImageNet dataset to further prove the correct behavior of the PHRes-Net50 n = 3 in this scenario.This proves the robustness of the proposed approach, which can be adopted and implemented in models at different scales.

VII. EXPERIMENTAL EVALUATION ON SOUND EVENT DETECTION
Sound event detection (SED) is the task of recognizing the sounds classes and at what temporal instances these sounds are active in an audio signal [55].We prove that the PHC layer is adaptable to n-dimensional input signals and, due to parameter reduction and hypercomplex algebra, is more performing in terms of efficiency and evaluation scores.

A. Experimental Setup
For sound event detection models we consider the augmented version of the SELDnet [49], [52] which was proposed as baseline for of the L3DAS21 Challenge Task 2 [56] and we perform our experiments with the corresponding released dataset 1 .We consider as our baselines the SEDnet (without the localization part) and its quaternion counterpart.The L3DAS21 Task 2 dataset contains 15 hours of MSMP Bformat Ambisonics audio recordings, divided in 900 1-minutelong data points sampled at a rate of 32 kHz, where up to 3 acoustic events may overlap.The 14 sounds classes have been selected from the FSD50K dataset and are representative for an office sounds: computer keyboard, drawer open/close, cupboard open/close, finger snapping, keys jangling, knock, laughter, scissors, telephone, writing, chink and clink, printer, female speech, male speech.In this dataset, the volume difference between the sounds is in the range 0 and 20 dB full scale (dBFS).Considering the array of two microphones 1, 2, the channels order is [W1, Z1, Y1, X1, W2, Z2, Y2, X2], where W, X, Y, Z are the B-format ambisonics channels if the phase (p) information is not considered.Whether we want to include also this information, the order will be [W1, Z1, Y1, X1, W1p, Z1p, Y1p, X1p, W2, Z2, Y2, X2, W2p, Z2p, Y2p, X2p] up to 16 channels.In Fig. 7, we show the 8-channel input when considering one microphone and the phase information.Magnitudes and phases are normalized to be centered in 0 with standard deviation 1.
We perform experiments with multiple configurations of this dataset.We first test the recordings from one microphone considering the magnitudes only (4 channels input), then we test the networks with the signals recorded by two microphones and magnitudes only (8 channels input).The extracted features by the preprocessing are fed to the four-layer convolutional stack with 64, 128, 256, 512 filters, with batch normalization, ReLU activation, max pooling and dropout (probability 0.3), with pooling sizes (8, 2), (8, 2), (2, 2), (1,1).The bidirectional GRU module has three layers, each with an hidden size of 256.The tail is a four-layer fully connected classifier with 1024 filters alternated by ReLUs and with a final dropout and a sigmoid activation function.The initial learning rate is set to 0.00001.To be consistent with pre-existing literature metrics , we define True Positives as TP, False Positives as FP and False Negatives as FN.These are computed according to the detection metric [56].Moreover, in order to compute the Error Rate (ER), we consider: S = min(FN, FP), D = max(0, FN − FP) and I = max(0, FP − FN), as in [52], [55].Therefore, we consider: whereby N is the total number of active sound event classes in the reference.The SED score is defined by: For ER and SED score , the lower scores, the better the performance, while for the F score higher values stand for better accuracy.

B. Experimental Results
We investigate PHSEDnets in complex, quaternion and octonion domain with n = 2, 4, 8 and train each network for 1000 epochs with a batch size of 16.The proposed parameterized hypercomplex SEDnets distinctly outperform real and quaternion-valued baselines, as reported in Table IV and Table V.Indeed, the PHSEDnet with n = 2 gains the best results for each score and in both one and two microphone datasets, proving that the weights sharing due to the hypercomplex parameterization is able to capture more information regardless the lower number of parameters.It is interesting to note that the PHSEDnet n = 4, which operates in the quaternion domain, achieves improved scores with respect to the Quaternion SEDnet that follows the rigid predefined algebra rules.Further, the malleability of PHC layers allows gaining comparable performance with respect to the quaternion baseline even so reducing convolutional parameters by 87%, just setting n = 8.In Section VIII-B, we show additional experimental results of PH models able to save 94% of convolutional parameters while operating in the sedonion domain by involving n = 16.
Furthermore, PHSEDnets are more efficient in terms of time required for training and inference.Table V shows also that each tested version of the proposed method is faster regards as the real SEDnet and the quaternion one, both at training and at inference time.Time efficiency is crucial in audio applications where networks are usually trained for thousands of epochs and datasets are very large and require protracted computations.
Figure 8 summarises number of parameters, metrics scores and computational time in a radar plot from which it is clear that PHSEDnet n = 2 gains the best scores and a large time saving at a cost of more parameters with respect to other

VIII. ABLATION STUDIES A. Less parameters do not lead to higher generalization
In the following, we demonstrate that higher accuracies achieved by our method are not caused by the parameter reduction which may lead to more generalization.To this end, we perform multiple experiments.First, we test lighter ResNets that were originally built for the CIFAR10 dataset [47]: ResNet20, ResNet56 and ResNet110.Second, we consider also the smallest VGG network, that is the VGG11 which has 14M parameters.Finally, we perform experiments  smaller number of neural parameters leads to higher generalization capabilities, we perform experiments with real-valued baselines with a number of parameters reduced by 75%.Table VIII shows that reducing the number of filters downgrades the performance and thus it is not sufficient to improve the generalization capabilities of a model.We do not include standard deviations for values in the ablation studies as the values are similar to the previous examples so we aim at favoring paper readability.
B. Push the hyperparameter n up to 16 In the following, we perform additional experiments for the sound event detection task.We conduct a test considering two microphones and the phase information, so to have an input with 16 channels.For this purposes, we consider as baseline the quaternion model and PHNNs with n = 4, 8, 16 so to test higher order domains.Quaternion and PHSEDnet with n = 4 manage the 16 channels by grouping them in four components, thus assembling them in 4 channels: one channel containing the magnitudes of the first microphone, one channel the phases of the same microphone, and so on.Therefore, the details coming from the magnitudes, which are the most important for sound event detection, are grouped together without properly exploiting this information.On the contrary, employing PHC layers allows the model to process information without roughly grouping channels while instead leveraging every information by easily setting n equal to the number of channels, that is in this case 16.From Table IX, it is clear that employing a 4-channel model such as Quaternion or PHC with n = 4 does not lead to higher performance, despite the higher number of parameters.Indeed, the best scores are obtained with PHC models involving n = 8 and n = 16 that are able to grasp information from each channel.

IX. CONCLUSION
In this paper, we introduce a parameterized hypercomplex convolutional (PHC) layer which grasps the convolution rule directly from data and can operate in any domain from 1D to nD, regardless the algebra regulations are preset.The proposed approach reduces the convolution parameters to 1/n with respect to real-valued counterparts and allows capturing internal latent relations thanks to parameter sharing among input dimensions.Employing this method, jointly with the one in [37], we devise the family of parameterized hypercomplex neural networks (PHNNs), a set of lightweight and efficient neural models exploiting hypercomplex algebra properties for increased performance and high flexibility.We show our method is flexible to operate in different fields of application by performing experiments with images and audio signals.We also prove the malleability and the robustness of our approach to learn convolution rules in any domain by setting different values for the hyperparameter n from 2 to 16.

CO2 Emission Related to Experiments
Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.445 kgCO 2 eq/kWh.A cumulative of 2000 hours of computation was performed on hardware of type Tesla V100-SXM2-32GB (TDP of 300W).Total emissions are estimated to be 267 kgCO 2 eq of which 0 percents were directly offset.Estimations were conducted using the MachineLearning Impact calculator presented in [57].
More in detail, considering an experiment for the sound event detection (SED) task, according to Table V, the realvalued baseline requires approximately 20 hours for training and validation, with a corresponding carbon emissions of 2.71 kgCO 2 eq.Conversely, the proposed PH model takes approximately 17 hours with a reduction of carbon emissions of 16%, being 2.28 kgCO 2 eq.
In conclusion, we believe that the improved efficiency of our method with respect to standard models may be a little step towards reducing carbon emissions.

Fig. 2
Fig.2.The quaternion convolution rule can be expressed as sum of Kronecker products between the matrices A i that subsume the algebra rules and the matrices F i that contain the convolution filters, with i = 1, 2, 3, 4. In this example, the parameters of A i are fixed for visualization purposes, but in PHC layers they are learnable parameters.

Fig. 3 .
Fig. 3. Loss plots for toy examples.The PHC layer is able to learn the matrix A describing the convolution rule for pure (left) and full quaternions (right).

Fig. 5 .
Fig. 5. Bar plot of number of successes achieves by the models in Table II in each of the runs.The PHC-based models with n = 3 (red bar) far exceeds other configurations being the more performing choice for RGB image classification task.

Fig. 7 .
Fig. 7. Sample spectrograms from L3DAS21 dataset recorded by one microphone with four capsules.The first four figures represent the magnitudes while the last four contain the corresponding phases information.The black sections represent silent instants.

Fig. 8 .
Fig. 8. Radar plot for SEDnets results on L3DAS21 dataset with two microphones.The larger is the area, the better is the results.With the same computational time, PHC n = 2 gains better scores with respect to PHC n = 4 at a cost of more parameters.The real-valued SEDnet, although the discrete SED scores, has a high computational time demand as well the largest number of parameters.

TABLE I IMAGE
CLASSIFICATION RESULTS FOR VGG.THE ACCURACY MEAN AND STANDARD DEVIATION OVER THREE RUNS WITH DIFFERENT SEEDS IS REPORTED.TRAINING (T) TIME AND INFERENCE (I) TIME REQUIRED ON CIFAR10.FOR TRAINING TIME WE REPORT, IN SECONDS PER 100 ITERATIONS, THE MEAN AND THE STANDARD DEVIATION OVER THE ITERATIONS IN ONE EPOCH, WHILE THE INFERENCE TIME IS THE TIME REQUIRED TO DECODE THE TEST SET.THE PHNN WITH n = 4 OUTPERFORMS THE QUATERNION COUNTERPART BOTH IN TERMS OF ACCURACY AND TIME.THE PHVGG WITH n = 2 FAR EXCEEDS THE REAL-VALUED BASELINE IN THE CONSIDERED DATASETS, WHILE BOTH THE PHVGG19 VERSIONS WITH n = 2, 4 ARE MORE EFFICIENT THAN THE REAL AND QUATERNION-VALUED BASELINES AT INFERENCE TIME.p-VALUE UNDER THE T-TEST 0.0002.

TABLE II IMAGE
CLASSIFICATION RESULTS WITH RESNET MODELS.EACH EXPERIMENT IS RUN THREE TIMES WITH DIFFERENT SEEDS AND MEAN WITH STANDARD DEVIATION IS REPORTED.THE PROPOSED MODELS FAR EXCEED REAL-VALUED AND QUATERNION BASELINES ALMOST IN EACH EXPERIMENT WE CONDUCT.INTERESTINGLY, THE PHNN OUTPERFORM THE REAL-VALUED COUNTERPART BY 4% POINTS IN THE LARGEST-SCALE EXPERIMENT ON CIFAR100.THE TIME IS SIMILAR TO THE CLAIMS IN TABLE I SO WE DO NOT ADD HERE TO AVOID REDUNDANCY.

TABLE III IMAGENET
CLASSIFICATION WITH REAL-VALUED BASELINE AGAINST OUR BEST MODEL PH n = 3. OUR APPROACH OUTPERFORM THE BASELINE WHILE SAVING THE 66% OF PARAMETERS.

TABLE IV SEDNETS
RESULTS WITH ONE MICROPHONE (4 CHANNELS INPUT).SCORES ARE COMPUTED OVER THREE RUNS WITH DIFFERENT SEEDS AND WE REPORT THE MEAN.THE PROPOSED METHOD WTIH n = 2 FAR EXCEEDS THE BASELINES IN EACH METRIC CONSIDERED.CHANNELS INPUT).SCORES ARE COMPUTED OVER THREE RUNS WITH DIFFERENT SEEDS AND WE REPORT THE MEAN.THE PHSEDNET n = 2 OUTPERFORM THE BASELINES.FOR TRAINING TIME (SECONDS/ITERATION) THE MEAN AND THE STANDARD DEVIATION OVER ONE EPOCH IS REPORTED, FOR INFERENCE TIME WE REPORT THE TIME REQUIRED TO PERFORM AN ITERATION ON THE VALIDATION SET.PH-BASED MODELS FAR EXCEED BASELINES BOTH IN TRAINING AND INFERENCE TIME.

TABLE VI EXPERIMENTS
ON SVHN DATASET WITH THE SMALLEST NETWORKS FROM EACH FAMILY, RESNET20 AND VGG11, THE LATTER WITH MODIFIED NUMBER OF FILTERS IN ORDER TO BE DIVIDED BY EACH VALUE OF n AND FC LAYERS IN THE CLOSING CLASSIFIER.WE TEST ALSO THE PHNN WITH n = 1 TO REPLICATE THE REAL DOMAIN WHICH OUTPERFORM THE REAL-VALUED RESNET20.

TABLE VII THE
FIRST LINES REPORT VGG16 RESULTS WITH REAL-VALUED CLASSIFIER FOR QUATERNION AND PHNNS.EXTENSION OFTABLE I. ADDITIONAL EXPERIMENTS WITH RESNET56 AND RESNET110, THE LATTER WITH MODIFIED NUMBER OF FILTERS IN ORDER TO BE DIVIDED BY EACH VALUE OF n.ACCURACY SCORE IS THE MEAN OVER THREE RUNS WITH DIFFERENT SEEDS. to replicate the real-valued model, outperforming it.Experiments with VGG11 with modified number of filters in order to be divided by each value of n is also reported in the same table.Finally, in Table VII we report experiments on SVHN and CIFAR10 with ResNet56 and ResNet110, the latter with modified number of filters.PH models gain good performance in each test we conduct while reducing the amount of free parameters.Indeed, the PHResNet20s gain almost 94% of accuracy on the SVHN dataset involving just 70k parameters.
Finally, in order to further remove the hypothesis that

TABLE VIII REAL
-VALUED RESNETS WITH CONVOLUTIONAL FILTERS REDUCED BY 75%, DENOTED BY (S).FULL MODELS EXCEEDS REDUCED VERSIONS IN EACH OF THE EXPERIMENT, PROVING THAT A SMALLER NUMBER OF PARAMETERS DO NOT LEAD TO HIGHER GENERALIZATION CAPABILITIES.