Parameter Efficient Neural Networks With Singular Value Decomposed Kernels

Traditionally, neural networks are viewed from the perspective of connected neuron layers represented as matrix multiplications. We propose to compose these weight matrices from a set of orthogonal basis matrices by approaching them as elements of the real matrices vector space under addition and multiplication. Making use of the Kronecker product for vectors, this composition is unified with the singular value decomposition (SVD) of the weight matrix. The orthogonal components of this SVD are trained with a descent curve on the Stiefel manifold using the Cayley transform. Next, update equations for the singular values and initialization routines are derived. Finally, acceleration for stochastic gradient descent optimization using this formulation is discussed. Our proposed method allows more parameter-efficient representations of weight matrices in neural networks. These decomposed weight matrices achieve maximal performance in both standard and more complicated neural architectures. Furthermore, the more parameter-efficient decomposed layers are shown to be less dependent on optimization and better conditioned. As a tradeoff, training time is increased up to a factor of 2. These observations are consequently attributed to the properties of the method and choice of optimization over the manifold of orthogonal matrices.


Parameter Efficient Neural Networks With Singular Value Decomposed Kernels David Vander Mijnsbrugge , Femke Ongenae , and Sofie Van Hoecke
Abstract-Traditionally, neural networks are viewed from the perspective of connected neuron layers represented as matrix multiplications. We propose to compose these weight matrices from a set of orthogonal basis matrices by approaching them as elements of the real matrices vector space under addition and multiplication. Making use of the Kronecker product for vectors, this composition is unified with the singular value decomposition (SVD) of the weight matrix. The orthogonal components of this SVD are trained with a descent curve on the Stiefel manifold using the Cayley transform. Next, update equations for the singular values and initialization routines are derived. Finally, acceleration for stochastic gradient descent optimization using this formulation is discussed. Our proposed method allows more parameter-efficient representations of weight matrices in neural networks. These decomposed weight matrices achieve maximal performance in both standard and more complicated neural architectures. Furthermore, the more parameter-efficient decomposed layers are shown to be less dependent on optimization and better conditioned. As a tradeoff, training time is increased up to a factor of 2. These observations are consequently attributed to the properties of the method and choice of optimization over the manifold of orthogonal matrices.

I. INTRODUCTION
A RTIFICIAL neural networks are based on the mathematical representation (McCulloch-Pitts model) of artificial neurons [1]. Stacking layers of these neurons separated by nonlinear activation functions σ (·) creates a model, called the multilayer perceptron, that allows nonlinear modeling [2]. To obtain optimal weights for the network, these models have to be trained. In supervised learning, an appropriate objective function L is minimized over a set of training examples to achieve this. By using a matrix multiplication to represent the connections between neurons, the full problem statement for a single layer is written as follows: At the core of this process are the weight matrices or kernels W that transform each layer before activation. This kernel describes a set of linear equations between a hidden state vector and the input of the layer. From a mathematical standpoint, this kernel is contained in the set of all real matrices R M×N , which is a vector space under addition and scalar multiplication [3]. Consequently, each weight matrix can be described as a linear combination of basis matrices {E i } with span{E i } = R M×N as formulated in (3). By taking a basis, known as either the standard or natural basis [see (4], the before mentioned neuron interpretation is recovered In this natural basis, each basis matrix corresponds to an adjacency matrix for the connection between an input and output neuron where each coefficient s i represents the weight of this connection. From the traditional viewpoint, these basis matrices are fixed and only the coefficients are trained. We propose a new perspective given the basis decomposition, which allows to look at the weight matrix and considers it as a representation of the relationship between two entities. These entities are described using embeddings (Definition 1) and are commonly known from the natural language processing (NLP) domain [4] where they translate words or language entities into mathematical vectors. Definition 1: Embeddings are vector representations of either categorical or continuous variables, often lower dimensional than the original one.
Deep learning embeddings have been observed to obey semantic rules imposed by the linguistic rules present in the data. These semantic rules can be expressed as relations between entities as in (5). Due to the linearity of the matrix multiplication, we stipulate that these semantic properties can be a direct consequence of the weight matrix representing the relationship as portrayed by (6) in Example 1.
Example 1: Given that the embeddings of the words "Girl"/"Woman" and "young" obey the given relation relation and s i its strength. This interpretation strays away from the connected neuron interpretation by making both the coefficients and basis matrices trainable. Now, the transformation represented by the kernel W can be seen as a combination of basis transformations and their respective importance or strength, rather than neuron connections and their individual weight.
The objective of this article is to introduce this interpretation of the weight matrix and formalize it into a framework fitting with the current state of the art on deep learning. We also show the advantages of this method, i.e., improved parameter efficiency and generalization while maintaining performance, through thorough benchmark evaluations.
The remainder of this article is organized as follows. First, Section II presents the related work. Section III introduces the concept of Kronecker product to make both the basis matrices and coefficients trainable with a feasible amount of parameters. In this section, also the associated training algorithm is derived. Finally, the presented method is evaluated on benchmark datasets and the results are consequently linked to the properties of the method in Section IV. Section V presents the major conclusions and future work.

II. RELATED WORK
The concept of learning sets of basis blocks is a fundamental concept of dictionary learning and is often joined with sparsity constraints [5]. However, to the best of our knowledge, such an approach has not yet been explored for neural network layers. The resulting construction of weight matrices using singular value decomposition (SVD) has been proposed in the context of recurrent neural networks using householder matrices [6]. Even though this work extends to general matrices, there is no mention of acceleration or initialization. Furthermore, it does not make use of the large body of work regarding optimization on the Stiefel manifold with the Cayley transform [7]- [11].
On the other hand, in [12], the Cayley transform and corresponding descent curve are effectively used to model unitary matrices in recurrent neural networks. In this work, a similar approach is taken for orthogonal matrices, but in the context of optimizing the orthogonal components of an SVD. Coincidentally, the SVD has been used to investigate the properties of neural networks, such as bounding the singular values for improved training [13] or using the SVD in order to reduce the latent space to a minimal effective dimension [14]. These works use the SVD retrospectively, while in this work, the decomposition is proactively taken as the starting point.

III. APPROACH
Starting from the basis decomposition in (3), the proposed method will treat each layer by storing, initializing, and training the basis components and coefficients independently. In the forward pass, the weights are assembled from the decomposition and used to calculate the output. In the backward pass, each component is updated separately according to specific update equations. Our approach is divided into four parts corresponding to the following. First, by using a matrix decomposition for the basis matrices, a specific form for basis decomposition is proposed. Second, corresponding to the update, equations for both the basis matrices and the coefficients are derived. Third, initialization routines for each component are constructed. Finally and fourth, stochastic acceleration for each update equation is discussed.

A. Basis Matrix Decomposition
The interpretation from Section I requires making every weight in (3) trainable, resulting in O(R · N · M) parameters. Here, R is the size of the set of basis matrices. To make this number of parameters feasible, a significant reduction of these parameters per basis matrix is required. Using the concept of the Kronecker product [15] of two vectors [see (7)], each basis matrix is constructed with This reduction in parameters allows the whole decomposition to consist of O(R · [N + M + 1]) parameters by substituting (7) into (3). Consequently, if the sets of vectors { u i } and { v i } represented by U and V are orthonormal, the resulting form corresponds to the SVD of W, as shown in the following equation: We can give a straightforward interpretation to this result. The hyperparameter R represents the dimension of the subspaces in which most of the embeddings that are connected by the relation reside. The orthogonality of U and V ensures that also the set of basis matrices is orthonormal [see (9)] under the Frobenius norm, as derived in Appendix A-D. This changes the interpretation from the orthonormal basis sets { u i } and { v i } of the embeddings into a single orthonormal basis set for the relation, which is the same change of interpretation provided in Example 1. Furthermore, the hyperparameter R also grants insight into the cases when the parameter reduction will be useful, namely, in the case where the input and output data space can be adequately described by an R-dimensional subspace of R N and R M , respectively, or equivalently when W can be adequately described by an R-dimensional Parallel to this interpretation, the approach also addresses a recurring problem in neural networks, i.e., the ill-conditioned weight matrices. Ill-conditioned matrices give rise to large fluctuations of the output of the linear system with respect to small fluctuations of the input. This property translates into saddle points in the loss function, which leads to more confusion during training and less stable networks [16]. The conditioning number κ and the fraction of the largest and smallest singular values of the kernel, given by (10), represent this property. Large values of this conditioning number correspond to ill-conditioned matrices Regularization of singular values has already been shown to improve training time and performance [13]. Our proposed method inherently contains this regularization as standard weight regularization of the network on the decomposition coefficients. In terms of κ, this means that regularizing the weights in S leads to soft upper and lower bounds on κ and consequently avoids ill-conditioned matrices.

B. Update Equations
To effectively train the decomposed layer, there needs to be a set of update equations for the weights. These weights are most commonly optimized using stochastic gradient descent (SGD) [17]. At each iteration k, SGD adds the gradient of the objective function with respect to the weights at that iteration (∇ w k L), scaled by a learning constant α. The standard form for an SGD weight update step is given in the following equation: Next, update equations for U, S, and V in the form of (11) are derived. In light of the general decomposition in (3), the update equations for both U and V, corresponding to the basis matrices, and S, regarding the coefficients, are considered separately. This section starts with defining the necessary tools and then constructs these update equations for the orthogonal matrices in the SVD. Using these update equations, consequent ones for the singular values are derived. 1) Orthogonal Matrices: By construction, the matrices U and V both describe a set of orthonormal basis vectors in a rowwise fashion. To preserve this property, update equations that generate orthogonal matrices at each iteration are required. Definition 2 describes the manifold on which these matrices lie.
Definition 2: The Stiefel manifold [see (12)] describes the set of all the orthogonal matrices under the canonical metric (13), where Z 1 and Z 2 are elements of the tangent space of X (T X V N,R ) [18] Optimization of the Stiefel manifold can be done via the geodesic described by the matrix exponential. However, for computational reasons, it is better to use quasi-geodesic methods [7], [10]- [12]. This quasi-geodesic is a Padé approximation of the matrix exponential, called the Cayley transform, described in Theorem 1 Theorem 1: The Cayley transform [see (14)] of any skewsymmetric matrix is orthogonal 1 Definition 3 describes a parameterized descent curve on the Stiefel manifold for any objective function L using this transform. 1 Proof in Appendix A-A. Definition 3: Given a orthogonal matrix X and an objective function This descent curve is directly dependent on the derivative of the specified objective function ∇ X L. It is used to define the action on X as ∇ X LX T − X∇ X L T . This is a vector in the tangent space of the Stiefel manifold. Consequently, the action is transformed by the Cayley transform into an orthogonal matrix since the action is a skew-symmetric matrix. Due to the orthogonality restriction and the metric, this descent curve can be seen as a search over a surface in the direction that minimizes the loss function. This is visually represented in Fig. 1. Optimization using this descent curve normally uses line search [10], [18] to determine step sizes according to the Armijo-Wolfe conditions, as described in Definition 4 [19], [20]. Definition 4: The Armijo-Wolfe conditions for curvilinear search are These conditions are to be interpreted as an upper and lower bound on the step sizes, respectively, defined by two coefficients c 1 and c 2 . As derived in Appendix A-C, they can be evaluated using the chain rule and the directional derivative of the Cayley transform with respect to ν given in the following equation: To obtain the most acceptable step sizes, the coefficients are often taken to include a large portion of the interval [0, 1], e.g., c 1 = 0.1 and c 2 = 0.9. Doing a line search for each matrix would lead to 2L line searches at each training step for a network with L layers. Using such a method at each training iteration would thus increase training time with a factor proportional to the amount of iterations needed for these searches to converge. Vaswani et al. [21] noted that these methods converge faster at the cost of computational time.
In the following derivation, we omit this line search for the benefit of training time. Note that using line search can easily be added when striving for optimal convergence irrespective of training time.
Using the descent curve with a constant step ν solves this problem by suboptimally choosing a value for ν. Equivalently, this is optimizing both U and V by taking equidistant steps on the optimization surface according to the descent curve. Equations (21) and (22) hold for U and V, respectively The calculation of (21) and (22) requires the inversion of matrices with dimensions N × N and M × M, respectively. Theorem 2 manipulates this calculation into a term χ ν that requires a 2R × 2R matrix inversion. Theorem 2: The Cayley transform φ ν can be written as the sum of a unit matrix I and a matrix χ that requires the inversion of a 2R × 2R matrix [18] With this transformation, the update (21) and (22) are also put in additive form corresponding to the weight update step defined in (11) 2) Singular Values: The update equations for the coefficients, or equivalently the singular values, can be calculated given the two transformations of the orthogonal matrices and the partial derivatives of the objective function to each component. Starting from the regular update equation (11) and the SVD notation in (8) at iteration k + 1, an expression for the singular values update is obtained by equating both and solving for S k+1 . Isolating the singular value matrix is done by using the orthogonality of both U and V. This results in the expression given by the following equation: The gradient with respect to the assembled matrix in terms of the component derivatives is given in Theorem 3. Theorem 3: Derivative of objective function with respect to a matrix in the function of the derivatives with respect to its SVD components [22] Substituting (28) into (27) and using (23) gives the update equation for S k+1 only in terms of the components at iteration k and partial gradients. This is represented in the following equation: Rewriting this as an additive equation gives the final equation [see (30)] that uses the diagonal components of an update matrix S L

C. Initialization
Given the update equations from Section III-B, each component also needs proper initialization as convergence is strongly dependent on initial weights [23]. Neural network weights are initialized by randomly sampling each element from a specific distribution. Two distributions, namely, the Gaussian and uniform distribution, are most commonly used as initialization distributions. In terms of the components in the proposed decomposition, both U and V are orthogonal matrices. Any set of vectors can be made orthonormal using the Gramm-Schmidt procedure [24]. Using this procedure, R random vectors for both U and V are constructed. Since both these orthogonal matrices do not determine any distribution of the resulting composition matrix, the initialization of the singular value vector is responsible for the weight distribution. As such, a specific initialization for the R singular values is needed for both the Gaussian and uniform cases with an arbitrary variance. 1) Normal: Initialization for the Gaussian case corresponds to sampling the initial weights from the following distribution: Let us now consider the distribution of the singular values contained in S. According to Fisher-Hsu-Girshick-Roy [25], the singular values of a random Gaussian matrix have a marginal distribution described in Theorem 4. Theorem 4: The marginal distribution for the unordered singular values λ of X ∼ N (0, 1) ∈ R M×N is given by Theorem 4 makes use of the Laguerre polynomials (33). These polynomials are orthogonal with respect to the weighting function e s s r−t , as portrayed by the following equation [26]: This allows us to define a set of orthonormal functions as in (35), while the probability distribution takes the form of (36) This also shows that p s is a probability distribution since its integral is one. This can be seen directly from the property in (34). To extend this to general variance with zero mean, a change of variables is done Initialization can now be performed by sampling values from this distribution and putting them on the diagonal of S.
2) Uniform: Initialization for the uniform case corresponds to sampling the initial weights from the following distribution: In a similar fashion as Section III-C1, a statement about the distribution of the singular values of this random matrix is needed. Theorem 5 states the limiting distribution of the singular values {σ 2 i } for a uniform distribution given a symmetric interval and general variance.
This theorem, however, is not an exact statement as Theorem 4, but rather a limiting case. To verify the validity of the formula, the χ 2 -statistic, given in (40), for histogram comparison [27] is calculated as an estimation of the error. The χ 2 -statistic represents a measure for the deviation from the expected amount of counts per bin. c 0 i and c 1 i are the bin counts for each histogram using identically binned histograms

D. Optimizer Dependence
The update equations (25), (26), and (30) all coincide with regular SGD by construction. Current state-of-the-art neural networks, however, are commonly trained using various methods of acceleration for stochastic optimization [28], [29]. The two main components for acceleration, i.e., momentum and adaptive learning rates, are discussed from the aspect of both the basis matrices and the coefficients in the decomposition in the following. Primarily, the effect of acceleration on the orthogonality, when optimizing over the Stiefel manifold, is examined. The necessary changes to a naive implementation are made and added to the training routine.
1) Momentum: One way of accelerating optimization algorithms is adding momentum to each weight update, as portrayed by (41) and (42). Momentum is named after the phenomenon from physics and is combined with interpretation of optimization as a ball rolling down a loss surface, which keeps its previous momentum after a change in direction. This translates into keeping the gradient of the previous step partially in mind when optimizing the current iteration When considering momentum for (25) and (26), the orthogonality of U and V deteriorates with each iteration. Different implementations, such as moving average momentum implemented in the ADAM optimizer, that scale each new gradient term in (42) by a factor of 1 − β, are technically the identical and suffer from the same drawback. In light of Theorem 3 and Fig. 1, this problem can be solved by applying the momentum step in (42) with the gradient of which the action is calculated. Due to the linearity of this action, this is equivalent to applying the momentum step in the tangent space of the manifold, as in [7], and projecting afterward. To stay consistent, this same method is applied to the gradients of the singular values.
2) Adaptive Learning Rate: Any gradient update as given in (11) consists of a gradient part and a step size or learning rate. By directly changing the latter, the speed of learning can be accelerated or slowed down. Formally, this is described by Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. making the learning rate iteration dependent (α → α k ) as in the following equation: Similar to momentum, running averages of gradient norm are kept that scale the learning rate as portrayed in (44). The following equations represent the variable learning rate scheme used in the ADAM optimization algorithm: When considering adaptive learning rates, the optimization on the Stiefel manifold runs into some problems. Since the calculation of χ ν requires a scalar value for ν, the adaptive learning rate has to be addressed before calculation of χ ν . Looking at (44), this adaptive learning rate is a scalar scaled inversely by a matrix, which is done elementwise. By applying this elementwise operation to the gradient or moment matrix, as in (41) and (11), respectively, the learning returns to being a constant scalar and is compatible with the Cayley transform calculation. This adaptation can be interpreted as scaling of the gradient or momentum for each matrix entry and is applicable to all adaptive learning rate schedules that allow extraction of a scalar. It is here where reintroduction of the omitted line search from Section III-B1 is also a possibility. However, as mentioned before, we did not follow this approach for computational reasons and used standard acceleration schemes from the state-of-the-art stochastic optimization. In line with the treatment of the singular value gradients regarding momentum updates, this way of implementing the adaptive learning rate is also implemented for the singular value updates.

IV. EXPERIMENTS
In this section, all experiments comparing our proposed method to the standard interpretation of weight matrices are described. Everything was performed using two Tesla V100-SXM3-32GB GPU and eight Intel Xeon Platinum 8168 CPU @ 2.70 GHz CPU cores. Each experiment is a specific implementation of the pseudocode described in Algorithm 1 using Tensorflow 2.2.0. Initialization used for all experiments using either the standard Tensorflow library or the corresponding initialization from Section III-C. Implementations can be found in the open-source code. 3

Algorithm 1 Accelerated SVD Optimization Scheme
The performed experiments are divided into three sections. First, regular feedforward networks are evaluated for different matrix sizes and ranks using optimization with SGD, SGD with momentum, and ADAM as described in Section III-D. Second, the proposed method is used to substitute fully connected layers in residual network (ResNets) architectures. Finally, the transformer architecture is considered in which each weight matrix is substituted for its decomposed variant. This order of experiments is such that the properties of the decomposed weight matrices can be observed and, consequently, their performance for relevant architectures evaluated. For all experiments, visual evaluations are done, which contains both the learning and accuracy curves during training, tracking of the orthogonality conditioning number (κ orthogonality ), which reflects how orthogonal the U and V components are at every step, given by (45) and a tracker for the conditioning number (κ conditioning ) as defined in (10). All figures, both discussed in the following and additional ones, can be found in Appendix B (45)

A. Feedforward
The first set of experiments compares regular and decomposed weight matrices using feedforward networks. For this purpose, a feature-based dataset is selected, the clevelandheart-disease dataset [30] with 32 input features. A single hidden layer neural network with one sigmoidal output node for binary classification is evaluated. Binary cross entropy is used as loss function and binary accuracy as a validation metric. A comparison for different ranks and hidden layer sizes is done for each of the three optimization schemes mentioned before. The results on a 20% validation split are given in Tables II-IV for standard SGD, SGD with momentum, and ADAM, respectively. All of these values are averaged over 25 training runs, and for each matrix size, the best performing architecture has been highlighted.
Tables V and VI contain the amount of parameters and averaged training times for each model, respectively. Visual performance evaluation is included for three different parameter combinations. First, the 32 unit matrix size with rank 32 is selected for its high regular SGD performance, and being the only matrix size, the regular approach outperforms its decomposed variant. Second, for the maximum size matrix 512, which reaches maximal overall methods, rank 24 variant is selected since it is the best performing. Finally, matrix size 128 with rank 16 is added for intermediate comparison.
Comparing different optimizer performances in Tables II-IV shows that, unlike regular weight matrices, the decomposed matrices are a lot more independent of the chosen optimization scheme. This observation is attributed to the more directed search granted by using the Stieffel manifold for optimization of the orthogonal components of the decomposition. Eventually, both approaches reach maximal performance using the ADAM optimization scheme. From the corresponding results, presented in Table IV, the decomposed weight matrices outperform their regular counterparts consistently except for  the square matrix at 32 units. This observation seems in line with the assumption that decreasing the rank and directly learning well-conditioned matrices avoid overparameterization. This more efficient parameterization is also reflected in the number of parameters given in Table V. Table VI, however, represents the downside of the method since, at each rank, the computational time goes up significantly. When comparing the conditioning number tracker given in Fig. 2(a) with those in Fig. 2(b) and (c), this suspicion is reaffirmed. For the regular square matrix, the conditioning number does not grow out of bounds with respect to the decomposed ones. In contrast, for the more rectangular matrices, the conditioning number keeps rising throughout training, while the decomposed variants keep themselves well-conditioned. This ill-conditioned behavior likely negates some of the benefits of the increased hidden dimension. By using the decomposed matrices, the size of the hidden dimension can be enlarged sufficiently without these side effects, leading to increased performance. Both the 256 units/rank 32 and 512 units/rank 24 matrices significantly outperform the regular implementation at all sizes for the validation accuracy.

B. Residual Networks
Following the previous results on feedforward networks, the decomposition layer is also introduced in more complicated state-of-the-art architectures, more specifically ResNets for image recognition [31]. Both the MNIST and CIFAR10 benchmark datasets are used for evaluation. For MNIST, a single residual block with 256 filters followed by a single feedforward layer of 256 units is used, and for CIFAR10, a triple residual block architecture with 64, 128, and 256 filters followed by a 1000-unit feedforward layer is used. Each residual block is separated by a 2 times 2 average-pooling operation. It is both these feedforward layers that are evaluated with the regular and decomposed method. In line with the  findings from Section IV-A, all optimization is done using the ADAM optimization scheme since this results in the best performance for both the decomposed layer and all other variables such as the convolutional layers. The initial learning rate used is 0.01 and is decreased to 0.001 after 20 epochs. L 2 regularization with coefficients 0.001 and 0.01 is used for the convolutional and feedforward layers, respectively. In the case of the decomposed variants, this regularization is imposed on the singular values. Tables VII and VIII summarize the performance comparison between multiple rank decomposed approaches and the full-rank regular approach for MNIST and CIFAR10, respectively. The corresponding learning, accuracy curves, orthogonality, and conditioning numbers are given in Appendix B-B. Similar observations as in Section IV-A are seen in the MNIST evaluation. At smaller ranks corresponding to fewer parameters, the same performance as with full-rank matrices is achieved, and at a similar amount of parameters, rank 128, the decomposed matrix outperforms the regular weight matrix slightly. Furthermore, at all ranks, the discrepancy between train and test error is slightly smaller. This is indicative of better generalization properties that are linked to the conditioning number of the feedforward layer, as shown in Fig. 3(a).
Here, it is clearly seen that the regular matrix implementation is much worse conditioned than the decomposed counterparts. From the training times, a growing discrepancy is seen as the rank of the decomposition grows. This is due to the number of matrix multiplications and inverse needed in the backpropagation step. For CIFAR10, however, no significant improvement in performance is observed. This seems reasonable since the amount of convolutional parameters greatly outnumbers the number of weight in the feedforward layer and is also reflected in no significant computational discrepancy. However, as presented in Fig. 3(b), the conditioning of these feedforward layers is still much better for the decomposed layers. Since ill-conditioned matrices lead to large output fluctuations, these better conditioned matrices have the potential of being more robust to noisy input. To verify this assumption, the models are evaluated on a set of adversial examples generated on the CIFAR10 dataset using the fast gradient sign method (FSGM) [32] with a parameter epsilon. This method adds a perturbation to the original image consisting of the gradient of the image with respect to the class label given a set of model parameters with θ as given in the following equation: Fig. 4 shows the results at different values for for 1000 adversial examples created from the CIFAR10 dataset. The rank 128 model outperforms the regular weight matrix consistently and resists the FSGM attacks much better. This is in line with the smaller discrepancy between train and test error for the rank 128 model, as observed in Table VIII. On the other hand, the lower rank models do not perform well on this test. This observation shows another potential downside, which is that when the rank is too small, the decomposed weight matrices do not have enough expressive capacity, seemingly leading to worse performance under perturbation. This reiterates the importance of having a correct R value.

C. Transformer
In the following set of experiments, the transformer architecture introduced by Vaswani et al. [33] is evaluated on the TED-Talks English to Portuguese translate dataset from the tensorflow datasets library. Standard accuracy for the predicted words is used as an evaluation metric. The transformer architecture is well suited for the change of regular to decomposed weight matrices since a single layer consists of a set of multihead attention and feedforward layers. The attention mechanism contains four linear weight multiplications and the feedforward layers each have one. These can all be converted to decomposed weight matrices. Both the feedforward and attention have their own size and rank. For the attention layers, the size is referred to as depth, and for the feedforward layers, the size is referred to as width. The full transformer architecture consists of an encoder and decoder that both have a set number of these composite layers. Each transformer is trained using the ADAM optimizer as described in Section III-D for optimal performance. Preprocessing is done using a standard word tokenizer with the number of words as a hyperparameter and the embeddings are trained from scratch for each implementation. We compare a set of transformer architectures with depth 64 and width 256 using a vocabulary size of 1000 words. The architectures are summarized by a number of layers for both decoder and encoder described by the parameter L and a tuple of ranks for the depth and width, respectively, when using the decomposed weight matrices. At both values 16/16 and 16/32 for the depth rank and width rank, respectively, a double-layer encoder and decoder contain less parameters than the full matrix single-layer transformer. As such, both the single-and double-layer decomposed architectures are compared to the single-layer regular transformer. Table IX summarizes the results, while the corresponding learning and accuracy curve can be found in Appendix B-C.
For each decomposed transformer model, the training accuracy is higher than the regular model, while on the test set, only the rank 16/32 models outperform and the rank 16/16 models achieve a similar performance. At the same amount of parameters, the two layers decomposed with rank 16/32 model achieve a small but significant improvement over the regular weight matrix transformer, as indicated in bold in Table IX. It seems that, at these ranks, the separate layers retain enough expressive capacity, while the reduction in parameters allows for a second layer, leading to more complex modeling and eventually increased performance. Fig. 5 shows the conditioning of the query weight matrix in the multihead attention and the following feedforward layer in the first encoder. The conditioning numbers behave similarly as in previous experiments, leading to better conditioned matrices. In terms of training times, a significant increase of 150% for the single layer and 200% for the double-layer decomposed transformer is seen and confirms the main downside of the decomposed method. This is due to the matrix inversion in (14) being calculated exactly and the large number of matrix multiplications during backpropagation. In [9], the Cayley transform is calculated iteratively, which reduces the accuracy in favor of computation time. Since the orthogonal matrices are stored instead of the inverse Cayley transformed skew-symmetric matrix, as proposed in [7], these computations are not present during inference. We accept the computational downside during training in favor of the upsides of increased performance, generalization, parameter efficiency, and fast inference times. Moreover, all the training time results used in both the ResNets as transformers are reported without the calculation of the SVD decomposition of the regular weight matrix at each iteration needed for the conditioning and orthogonality tracking. When access to the singular values is required, the decomposed formulation has significant advantages.

V. CONCLUSION
In this article, a different way of approaching weight matrices in neural networks is proposed based on the observation that the set of real matrices is a vector space under addition and multiplication. As such, any matrix can be described as a fitting linear combination of basis matrices. Formulating each basis matrix as a Kronecker product of two vectors makes the amount of trainable parameters feasible and unifies this approach with the SVD of the weight matrix. Using the Cayley transform and a corresponding descent curve on the Stiefel manifold, we constructed update equations for the orthogonal matrices of which the basis matrices are composed. This constraint on the possible values of the weights allows for a more directed search, leading to improved performance and generalization. More stable performance over different optimization methods compared to the regular weight matrix approach is observed. A first direct consequence of this formulation is the control over the rank of the weight matrix. Supposedly, each linear transformation could be sufficiently described or approximated by a rank-deficient matrix. A second consequence is that direct access to the singular values allows much better conditioning of the matrices by regularizing the singular values and consequently the conditioning number κ. Both these properties are observed in experiments, allowing for more parameter-efficient formulations of neural networks without loss of performance. Furthermore, special routines for both uniform and Gaussian initialization are derived, which allows the state-of-the-art initialization for each component in the R-dimensional decomposition.
In future work, other possibilities for the construction of basis sets can be considered, keeping in mind the constraint on the number of parameters per layer. Furthermore, the idea that the relation blocks represented by the basis components could be reused in different layers can be entertained, thus leading to a reduction in the number of parameters. Not only the basis components but also the coefficients and the direct access to them open up possibilities. Current neural networks have static architectures and static weights for each sample. By allowing the coefficients to be a direct function of the input or, preferably, of the context of the input, dynamic networks that adapt better to specific inputs could be created.
We are convinced that the proposition of weight kernels as linear combinations of basis components gives much more flexibility and will open avenues to more data-efficient and better learning.