Revisiting Orthogonality Regularization: A Study for Convolutional Neural Networks in Image Classification

Recent research in deep Convolutional Neural Networks (CNN) faces the challenges of vanishing/exploding gradient issues, training instability, and feature redundancy. Orthogonality Regularization (OR), which introduces a penalty function considering the orthogonality of neural networks, could be a remedy to these challenges but is surprisingly not popular in the literature. This work revisits the OR approaches and empirically answer the question: Even when comparing various regularizations like weight decay and spectral norm regularization, which is the most powerful OR technique? We begin by introducing the improvements of various regularization techniques, specifically focusing on OR approaches over a variety of architectures. After that, we disentangle the benefits of OR in the comparison of other regularization approaches with a connection on how they affect norm preservation effects and feature redundancy in the forward and backward propagation. Our investigations show that Kernel Orthogonality Regularization (KOR) approaches, which directly penalize the orthogonality of convolutional kernel matrices, consistently outperform other techniques. We propose a simple KOR method considering both row- and column- orthogonality, of which empirical performance is the most effective in mitigating the aforementioned challenges. We further discuss several circumstances in the recent CNN models on various benchmark datasets, wherein KOR gains more effectiveness.


I. INTRODUCTION
R ECENT Convolutional Neural Networks (CNN) have evolved into more in-depth and broader structures to get more accurate results [1], [2]. However, stacking many layers causes several training issues such as vanishing/exploding gradient [3] and over-parameterization issues [4]. There has been a lot of literature for mitigating these issues such as the weight decay [5], Dropout [6], and skip connections [7]. Such training procedure refinements in loss functions and optimization methods play a major role in a solution while barely change the computational complexity.
Orthogonality Regularization (OR) is one of the most potent techniques in contributing to these advancements by enforcing weight matrices W to be semi-orthogonal (i.e, WW ⊤ = I or W ⊤ W = I where I is the identity matrix). The orthogonality on linear transformations can preserve the energy [8] and reduce the redundancy of the model's filter responses [9]. With the orthogonality initialization scheme, vanilla CNNs can be trained even with thousands of layers [10].
Orthgonality for convolutional layers can be injected into two different ways: (1) Kernel Orthogonality Regularization (KOR) which penalizes the gram matrix of convolutional kernel to be orthogonal and (2) Convolutional Orthogonality Regularization (COR) that strictly enforces the magnitude of the input feature and that of output feature to be same by using the Block-Toeplitz (DBT) matrix. Both works generally use the Frobenius norm [8], [9], [11], [12] as the penalty function. In KOR approaches, the spectral norm is also used as the regularization while it requires expensive computational costs [13]. To circumvent the computational complexity issue, an efficient algorithm of KOR, called Spectral Restricted Isometry Property (SRIP), is proposed; the algorithm utilizes the power method with a shift in the weight matrix [14]. Notably, in NeurIPS2019 MicroNet Challenge, team KAIST uses a variant of SRIP ranked 2nd and 3rd [15] in CIFAR-100 track, that facilitates the generalization without additional resources such as the parameter storage and computational complexity.
Despite its superiority, OR approaches have not attracted attention and have rarely been discussed by a few papers. In the literature, they generally show their superiority through model accuracy improvement without examining a collection of other regularization methods such as weight decay [5] and spectral norm regularization [16] under the same training settings. Furthermore, it is rarely comprehensively studied on the vanishing/exploding gradient issue, training instability, and feature redundancy compared to other regularization techniques.
Contribution In this paper, we examine OR methods along with a collection of regularization techniques which does not require additional resources for various architectures. In particular, we shed light on the impact of using OR methods for training deep CNN models. The contributions of this paper are as follows: • We empirically observe that KOR approaches consistently outperform the other regularization techniques defined on model parameters. In particular, our proposed simple method, referred to as SRIP + , which is a modified version of SRIP considering the roworthogonality of weight matrices, is slightly better than the other OR methods. • We demonstrate that SRIP + implicitly improves the norm preservation effects and mitigates the filter redundancy so that the models have more gains on training networks. We show this positive effect by analyzing the magnitudes of layer-wise responses [8], spectral norm of the convolutional kernel [16], the row-orthogonality of the kernel matrix [9], and the weight of the batch normalization layers [17]. • We investigate a bag of benefits of the SRIP + method on parameter initialization, training acceleration, the simple modulation of SRIP + method considering the structures of CNN models, loss landscape, and representation qualities. Especially, with such simple modulations, we can gain more effectiveness from SRIP + on recent mobile-friendly architectures such as MobileNets [18], [19] and EfficientNet [20]. The experiments are conducted on the CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, and the proposed approach improves the accuracy by up to 4.89%, +10.85%, +9.1%, and +0.89% on average, respectively.
The rest of the paper is organized as follows. After describing the preliminaries for the various regularization methods (Section II), we describe our method and experimental settings (Section III). In Section IV, we discuss our findings and the empirical evidence for each. We describe the related works in Section V. Section VI concludes the paper.

A. USEFUL DEFINITION
In this paper, we denote by function f θ a CNN parameterized by θ and the objective function L. Here f θ consists of multiple nonlinear functions, and each nonlinear function is usually divided into three parts. One is a linear transformation part with weight matrix W l , the next is a non-linearity part with an element-wise activation function ϕ such as ReLU, and the other is a normalization part N l such batch normalization (BN). Here l ∈ {1, 2, ...L} indicates the index of the layer. Under this notation, the learnable parameters are θ = {W l , N l |l = 1, 2, ..., L}. Let us denote the l-th layer's output y l , then the forward propagation may be derived as follows: y l+1 = ϕ(N l+1 (Conv(W l+1 , y l ))). We denote the set of weight matrices by W = {W 1 , . . . , W L }. We may define the notion of others as follows: Orthogonality/orthonormality The two non-zero vectors u, v ∈ R d are orthogonal when u ⊤ v = 0 and are orthonormal when u ⊤ v = 0 and ∥u∥ = ∥v∥ = 1 where ∥ · ∥ denotes the Euclidean norm. Semi-orthogonal We call a weight matrix W ∈ R m×n as semi-orthogonal when WW ⊤ = I or W ⊤ W = I, i.e., either the column vectors or the row vectors form an orthonormal set. Matrix-norms In this paper, we frequently use two different matrix norms, the Frobenius norm and the spectral norm. For any matrix A ∈ R m×n , the Frobenius norm and the spectral norm are defined as follows: where a ij is the (i, j) element of A.

B. KOR
KOR is a class of regularization methods that make the weight matrices of a model almost semi-orthogonal. Note that, to apply the KOR, it is generally necessary to reshape each convolutional operator into a matrix W ∈ R m×n with m = C out and n = S × H × C in , where S, H, C in , C out are the filter width, filter height, the number of input channels, and the number of output channels, respectively ( Figure 1). The evolution history is aligned with the matrix norm. For instance, one [8] adds the following form, referred to as Soft Orthogonality (SO), to the main objective where x i and y i are the input data and the corresponding label, λ is the regularization coefficient, I is an identity matrix, and |W| is the cardinality of W. On the other side, in [14], SRIP regularization replaces the Frobenius norm distance with the spectral norm. The spectral norm is computationally  expensive; therefore, the power iteration method with two iterations is used for the approximation. The additional loss by the SRIP is computed as follows: where σ(W ⊤ W − I n ) is the output of the power method.
where the vector v ∈ R n is randomly initialized with normal distribution.
Row-column orthogonality equivalence on SO [9] showed that the row orthogonality and column orthogonality are equivalent in terms of MSE. This implies that regularizing either row or column orthogonality is the same, so we consider only the column orthogonality of SO in this paper.

C. CONVOLUTIONAL ORTHOGONALITY REGULARIZATION (COR)
Recently, instead of KOR, [9], [12] proposed a direct constraint that preserves the norm between the l-th layer output y l and x l+1 which is the output of the linear transformation. They enforced the orthogonality of the Doubly Block-Toeplitz (DBT) matrix K l+1 directly, i.e., x l+1 = K l+1 y l (Eq. 4): DBT matrix K is different from W which is not an actual linear operator during propagation. [12] showed that KOR makes a network isometry partially, while COR enforces the norm preservation property more strictly in most cases except for when the kernel size and stride of the convolutional filter is equal to 1.

D. RELATED REGULARIZERS
Both SO and SRIP have their counterparts without considering the orthonormality. Their counterparts only consider the Frobenious norm and the spectral norm of the weight matrices rather than the difference W ⊤ W − I. Weight Decay (WD) WD is a common technique for regularizing the model parameters with the Frobenius norm: Spectral Norm Regularization (SNR) In [16], the authors showed that the generalization error of CNN models decrease when applying SNR instead of the Frobenius norm (WD): where the spectral norm is approximated similarly to SRIP.

A. METHOD
Both SO and SRIP aim to make W have orthonormal columns by using regularization functions respectively. When the number of rows is larger than the number of columns, i.e., m < n, the regularization functions could not be adequate to achieve the semi-orthogonality since the rows, instead of columns, have to be orthonormal. Fortunately, SO does not need to consider whether m < n or not. Always, ∥W ⊤ W − I∥ 2 F is minimized to max{0, m − n} if and only if W is semi-orthogonal [9] (Proposition 1). However, SRIP has to change the regularization function when m < n. All the matrices satisfying ∥W∥ 2 ≤ √ 2 minimize ∥W ⊤ W − I∥ 2 to 1 when m < n. Thus, SRIP just acts like SNR, restricting the spectral norms (Proposition 2). Proposition 1. The row orthogonality and column orthogonality are equivalent in the Frobenius sense [9], i.e., We thus propose a simple variant of SRIP, which is referred to as SRIP + , to make only semi-orthogonal W be a solution as in SO. To this aim, we simply consider the row-orthogonality σ(WW ⊤ − I m ) when m < n instead of the column-orthogonality. We also modify the power of σ in SRIP from 1 to 2 (Eq. 7) so that the penalty function becomes more significant as W goes further away from semiorthogonal.  λ |W|

B. EXPERIMENTAL SETTINGS
CIFAR We train all models using the standard pytorch SGD optimizer with a nesterov momentum of 0.9 for 600 epochs [21] for three times. We use the cosine annealing scheduler to control the learning rate, whose initial learning rate is 0.1 and the minimum learning rate is 0.0005 with batch size 128 [22]. We use dropout of 0.3, and l 2 WD 1e-5. All convolutional layers use batch-normalization layers with average decay of 0.99. We did not apply any data augmentation procedures. Other than those mentioned, the training settings covered in the original papers [18]- [20] are used. TinyImageNet / ImageNet We train the MobileNetV2, MobilenetV3, and EfficientNet-B0 to verify our method. First, for the TinyImageNet, we use the same training recipe with the CIFAR setting. Next, for the ImageNet, we use the cosine annealing scheduler on the learning rate decay whose initial learning rate is 0.025 and the minimum learning rate is 0.00005 with batch size 256. In the first 5 steps, we increase the learning rate with gradual warmup from 0 to 0.025. Finally, other training settings covered in the original papers [18]- [20] are used.

IV. DISENTANGLING THE BENEFITS OF OR A. COLLECTION OF DIFFERENT REGULARIZATIONS
As a survey, we compare recent regularization algorithms for the prior distribution of model parameters (Table 2). Here, we do not apply the WD when using the regularizations to see their intrinsic effects while some regularizations are simultaneously used with WD in their original papers. As shown in Table 2, penalizing the KOR approaches consistently improves model performance better than WD, SNR and COR. Furthermore, SRIP + among KOR families is the best method in the survey while the difference is marginal. We observe the same order in all the evaluated models. For training acceleration, as pointed out in [14], we empirically observe that KOR methods also enable faster acceleration in Resolution Train data Test data Classes  CIFAR10  32x32  50,000  10,000  10  CIFAR100  32x32  50,000  10,000  100  TinyImageNet* 1 64x64 100,000 10,000 200 ImageNet* 224x224 1,280,000 50,000 1,000 the early stages of training ( Figure 2) than other regularization methods.

B. NORM PRESERVATION EFFECTS
In this subsection, we investigate why performance gains remain when deploying the KOR approaches (Table 2). Hitherto, we attempt to answer the following question: What benefit do such regularizations pose?
At first glance, we investigate how such regularizations enables isometric learning during the forward dynamics [8].
Isometric property means that each layer of the network preserves the inner product for both forward and backward propagation [12]. In [10], [12], they demonstrated that the isometric property may stabilize the network activations. For the evaluation, we measure the ratio between magnitudes of l-th layer output y l and the output of linear transformation x l+1 , i.e., ∥x l+1 ∥ ∥y l ∥ . Analogously, in [10], [16], the authors clarified that the spectral norm is also related to the network activations. For example, let us look at the course of forward propagation in a layer-wise manner, i.e., x l+1 = K l+1 y l . Based on the definition of the spectral norm, As derived in Eq. 8, the spectral norm controls the tightness of fluctuation of forward propagation, even for the backward propagation in a similar manner. Depending the values of this bound it suggests that the signal in the process of propagation is affected by the spectral norm; if ∥K∥ goes to 0, the vanishing gradient issue occurs; otherwise, if it goes to ∞, an exploding gradient issue occurs [16]. Figure 3a shows the box plot regarding how the norm of network activations is preserved over all layers. As shown in Figure 3a, all regularizations attempt to keep the ratio close 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   to 1, but the variance is significantly different. A network trained with WD and SNR has larger variance than other regularizations dealing with orthogonality and among them, COR is the best way to practice isometric learning. A similar consistency is also found in Figure 3b, which depicts the spectral norm of the DBT matrices. COR and all categories of KOR bound the spectral norm of the DBT matrices near 1. Despite these consistency, the KOR approaches outperform the COR in our empirical studies. We think that this is because COR administers constraints that are strict compared to KOR approaches since it directly penalizes the DBT matrices. In a deep CNN model, most of filters seem to have advantages in norm preservation, but isometry learning rather obstructs the training of filters such as a downsampling layer whose stride is larger than 1, stem layer, etc. As evidence, when deploying the COR and WD together, we can improve the performance at a similar level of SRIP + while the norm-preserving behavior of COR resembles that of KOR approaches. Among the KOR methods, the spectral norm performs better than the Frobenius norm, and we think this is also a similar reason that SO strictly limits the set of parameters in an orthogonal space, while SRIP and SRIP + enable them in a wide range with a spectral norm of 1.

C. FILTER REDUNDANCY
So far we have provided empirical evidence that the benefits of KOR are primarily caused by isometric properties. We now investigate further effects regarding how the filters in the layer (row vector of W) become as orthogonal as possible [9]. This implies that their filter responses are much less redundant, so that the model utilizes a larger capacity for the feature expressiveness. Here, we calculate the filter similarities using the absolute sum of the cosine similarities between row vectors in W (Eq. 9): where W :,i is the i-th row vector of W. Figure 3c shows the box plot for the magnitudes of Eq. 9 for all filters of ResNet-56 † . As shown in Figure 3c, KOR approaches and COR remove correlations among filters compared to WD and SNR and COR slightly lead to a more ideal situation. We also focus on the channel importance of each filter through the BN weight γ. [17] showed that since every γ multiplies a normalized random variable, the channel importance becomes comparable across different layers by measuring the magnitude values of γ. When looking at the γ values in Figure 3d, SRIP + seems to consider the the importance of all channels in each filter as compared to other methods. This means that a model trained with SRIP + has fewer filters that are considered relatively redundant.

D. SRIP + AND ORTHOGONAL INITIALIZATION
We empirically observe that networks with the SRIP + can be trained with Xavier [23] and Kaiming [1] initialization methods, and consequently it is important to note that with such initialization methods, the networks are not trained as the λ grows. Because the SRIP + tries to change the optimization landscape, the initialization methods can no longer provide good initial points.
We demonstrate that the weight matrix cannot be trained depending on the initialization method. For example, when VOLUME 4, 2016

Layer
ResNet-50  the spectral norm of an initialized weight matrix is either very large or very close to zero, the SRIP + term becomes so large that the network cannot be trained. [9], [10] methods can initialize all convolutional kernels to be isometry with delta dirac initialization, but it fails to calculate the approximated spectral norm since the denominator always remain zero (Eq. 3). To resolve this issue, we construct the weight matrix as a semi-orthogonal matrix at the beginning, which is known as the orthogonal initialization [24]. The initialization method proceeds as follows: 1) Make the matrix W with the normal distribution whose mean and variance is 0 and 1. 2) Compute the QR factorization and make Q uniform.
For the overcomplete case W ∈ R m×n (m < n), transpose the matrix before computing the QR factorization. 3) Initialize the weight matrix to Q and the bias b = 0. Figure 3 shows our empirical study on the robustness of the SRIP + with respect to λ when the neural networks are trained with the SRIP + . We evaluate this through the training and test accuracy after the first epoch. Orthogonal initialization consistently outperforms other initialization methods in terms of training and test accuracy after the first training epoch. Furthermore, when λ increases, a network initialized with either Xavier and Kaiming method faces the issue that a model cannot be trained (noted it as -in Figure 3), but the orthogonal initialization diminishes this adverse effect.

E. SRIP + MODULATION FOR DIFFERENT LAYERS
In this subsection, we investigate the improvements by applying the SRIP + to different categories of layers considering the structure of CNNs. We categorize the layers as follows: a fully connected layer (FC), pointwise convolution, and other convolutional layers. Because pointwise convolutions require no effort for the flattened process before deploying SRIP + , we doubt that the pointwise convolution inherently possesses different characteristics compared to other convolutional layers. Moreover, we test the SRIP + with a C-EfficientNet [15] model as well as a ResNet model on CIFAR10 because a C-EfficientNet model consists of depthwise separable convolution unlike the ResNet series.
As shown in Table 4, SRIP + consistently creates positive synergies with all types of layers, but the most benefited filter is dependent on the model structure. Specifically, we can find that the pointwise convolution is the most promising in the model with bottleneck structure. Furthermore, for the depthwise separable convolutions, SRIP + enforced to both depthwise and pointwise convolutions seem to hinder the generalization because we could obtain the accuracy as 93.81 ±0.14 in C-EfficientNet-1 when turning on the KOR to all layers except the convolutions whose kernel size is greater than 1. These results slightly outperform those of the deployment on all layers.
To verify the validity of this simple modulation, we systematically evaluate the SRIP + for various CNN models whose convolutional filters consist of pointwise and depthwise convolutional layers. We deploy the algorithm denoted as mSRIP + that applies the SRIP + to all layers except for the depthwise convolutional layers in the model. WD is not used for all layers. For architectures, we test this strategy on MobileNets [18], [19] and EfficientNet [20]. As shown in Table 5, we can consistently improve the performance via mSRIP + . Far Interestingly, in the MobileNetlike structures, SRIP + significantly facilitates generalization compared to conventional CNN structures such as ResNet and Wide-ResNet (Table 5) even before modulation. This result provides an intuition regarding how the performance of lightweight CNN models can be dramatically improved when trained with KOR approaches.
To go further, such consistent improvements are also obtained in the vision-transformer networks [25], [26] (Table 6). Recently, there remain an increasing demand of designing a DNN with transformer layers for computer vision [25]- [28]. Although our paper mainly targets the CNN models, the experiment shown in Table 5 is simply applied to the Feed-Forward-Network (FFN) layers in the Transformer as a pilot. We observe that even applying mSRIP + on FFN layers in the Transformer block can lead to better optima (Table 6).

F. SHARP AND FLAT OPTIMA
We visualize the loss surface along a random direction near the optimal parameters referred to in [29]. This tool utilizes a filter normalization technique. Figure 4 depicts the 2D loss contours of ResNet-50 models. As shown in Figure 4, the optimum of networks trained with WD has a relatively sharp curvature, while that of KOR families has a flatter minimum.
We believe that this is also a consequence of Figure 3. Because networks trained with WD have unstable network activations during propagation, the network is easily corrupted by noise on weight, so that the loss landscape becomes sharp. On the other hand, the KOR series has a flatter optimum.

G. SHRINKAGE EFFECTS ON PENULTIMATE LAYER REPRESENTATIONS
In this subsection, we visualize the activations of the penultimate layer by the tools of [30] (Figure 5). This is to find out how energy stabilization affects the representation distribution. In detail, it depicts in 2-D via the linear projection how the activations of three classes cluster around 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.  Table 6. Test accuracy on ImageNet. WD is plugged into layers that are not spectrally regularized. Training recipes are followed in [25], [26].  [29]. Each line in contour indicates the loss values. The more sparse the spacing between the lines in the contour, the more robust the weight is to noise, indicating that it has a flatter minima.
the template that represents the representative feature vector of corresponding classes such as the row vector of fullyconnected layer. Here, we show visualization results for the classes "airplane", "automobile", and "bird" on CIFAR10.
The first column represent examples of training dataset and the second column represent those of test dataset. Training accuracies of all models used in the visualization are almost 99.99% ( Figure 5 ). As shown in Figure 5, the clusters of all representations are definite, but the cohesion and separation of each method's clusters is clearly different. We observe that networks trained with SRIP + exhibit a decrease in both inter-and intra cluster divergence, respectively. Namely, their clusters are more dense and well separated. In both visualizations of training and test dataset, their clusters from networks trained with SRIP + form almost equilateral triangle, whereas the triangular structure is less clear in the case of the networks trained with WD.

A. KOR
Orthonormality, which is applied in linear transformations between layers of networks, is one of the most interesting debates in the optimization field [8], [11], [24]. This technique can guarantee that the activation energy is not amplified [16] while spectrally regularizing the weight matrices of a network. Therefore, it can stabilize the distribution of network activations [8], [11]. Many researchers have attempted to utilize the orthogonality for training, for example, by regularizing the singular values of each weight matrix in a narrow band around the value near 1 [13] or by penalizing the gram matrix of each weight matrix to be identity [8], [31] or by orthogonal initialization [8], [24]. Various types of orthogonality regularization tools were proposed [14], and the authors empirically found the best among them. B. COR [10] provided the conditions for dynamical isometry and the convolution operator that can preserve the norm, but this method is limited to represent a subset of orthogonal space. Recently, [9], [12] proposed the same COR methods by regularizing the orthogonality of the DBT matrix. In [12], the authors showed that the isometry property can train deep vanilla networks without normalization nor skip connections. In [9], the performance of the existing networks was improved. However, neither work stand alone for training deep networks without WD.

VI. CONCLUSION
In this paper, we disentangle the benefits of the OR methods with exhaustive studies. We empirically observe that the OR series based on the penalty function of convolutional kernel weight matrices, called KOR, facilitate generalization with norm preservation property during the propagation, and such methods are sufficient for training deep convolutional neural networks even without WD. Among those, our simple method SRIP + further improves the performance of various models, specifically MobileNet-like models with a simple modulation on the layer to be regularized. We verify these findings on various benchmark dataset. Furthermore, we stabilize the SRIP + training via robust orthogonal initialization. As a qualitative study, we empirically observe that OR methods lead to flat optima. We believe that such OR approaches, especially SRIP + , should be considered as a cornerstone method. .

APPENDIX A DETAILS OF ORTHOGONALITY EQUIVALENCE ON SPECTRAL NORM IN ??
Lemma 1. Let W ∈ R M ×N be a real matrix. Then, for given orthogonal matrices P ∈ R M ×M and Q ∈ R N ×N . Then, ∥PWQ∥ 2 = ∥W∥ 2 (10) Proof. This can be derived from the orthogonal invariance of the 2-norm. Orthognal matrix P preserves the vector 2-norm for any real vector x ∈ R M , Therefore, from the Eq. equation 11, Similarly, ∥Q ⊤ W∥ 2 = ∥W∥ 2 ■.
As a next step, we provide a proof of the spectral norm equivalence, i.e., ∥WW ⊤ − I M ∥ 2 = ∥W ⊤ W − I N ∥ 2 . Recall the RIP condition of W: The right side hand of Eq. equation 15 is similarly derived as follows: Then, we can have the following inequality based on lemma 2: (from lemma 2) Therefore, As a result, ∥W ⊤ W − I N ∥ 2 = both ∥WW ⊤ − I M ∥ 2 and ∥W ⊤ W − I N ∥ 2 are equal to δ W ≥ 0 ■. Analogously, lemma 3 can be shown when M < N case, i.e. . Let denote the singular values of W by σ 1 (W) ≥ σ 2 (W) ≥ · · · ≥ σ min{M,N } (W). Then, the largest value of either Σ ⊤ Σ or ΣΣ ⊤ is σ 2 1 (W) and since the Proposition 2 implies that ∥W ⊤ W − I n ∥ 2 may be flawed for reasons: a network may not train the semi-orthogonality of W via backpropagation since ∥W ⊤ W − I n ∥ 2 is always equal to 1 regardless of the W when m < n. Furthermore, in such case, each column vector of W can not be orthonormal each other because its rank is at most m. As a proof, under the N-RIP condition, where z is N-sparse vector.

VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3185621