A New Pointwise Convolution in Deep Neural Networks Through Extremely Fast and Non Parametric Transforms

Some conventional transforms such as Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT) have been widely used as feature extractors in image processing but rarely applied in neural networks. However, we found that these conventional transforms can serve as a powerful feature extractor in channel dimension without any learnable parameters in deep neural networks. This paper firstly proposes to apply conventional transforms on pointwise convolution, showing that such transforms can significantly reduce the computational complexity of neural networks without accuracy degradation on various classification tasks and even on face detection task. Our comprehensive experiments show that the proposed DWHT-based model gained 1.49% accuracy increase with 79.4% reduced parameters and 49.4% reduced FLOPs compared with its baseline model on the CIFAR 100 dataset while achieving comparable accuracy under the condition that 81.4% of parameters and 49.4% of FLOPs reduced on SVHN dataset. Additionally, our DWHT-based model showed comparable accuracy with 89.2% reduced parameters and 26.5% reduced FLOPs compared to the baseline models on WIDER FACE and FDDB datasets.


I. INTRODUCTION
Large Convolutional Neural Networks (CNNs) [1]- [4], [5] and automatic Neural Architecture Search (NAS) based networks [6]- [8] have evolved to show remarkable accuracy on various tasks such as image classification [9], [10] and object detection [11] by taking advantage of huge amount of learnable parameters and computations. However, these large number of weights and high computational costs enabled only limited applications for mobile devices that are constrained by power consumption, memory space and computation costs [12].
With regard to solving these problems, [13]- [16] proposed parameter and computation efficient blocks while maintain-The associate editor coordinating the review of this manuscript and approving it for publication was Shadi Alawneh .
ing almost same accuracy compared to other heavy CNN models. All of these blocks utilized depthwise separable convolution which deconstructed the standard convolution sized of (3 × 3 × C) into spatial information specific depthwise convolution (3 × 3 × 1) and channel information specific pointwise (1 × 1 × C) convolution. The depthwise separable convolution achieved comparable accuracy compared to standard spatial convolution with significantly reduced parameters and FLOPs. These reduced resource requirements made the depthwise separable convolution as well as pointwise convolution (PC) more widely used in modern CNN architectures.
Nevertheless, we point out that the existing PC layer is still computationally expensive and occupies a large proportion in the number of weight parameters [13]. Although the demand toward the PC layer has been and will be growing VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ exponentially in modern neural network architectures, there have been a few studies on improving its efficiency. [15] proposed a grouped version of PC layer which splits a feature map into groups in terms of channel dimension, for reducing number of learnable parameters. In a similar manner, [17] presented a structured version of group PC layer with divide and conquer algorithm by recursively halving feature maps in channel dimensions. However, these previous works require learnable parameters for the PC layer (parametric-PC).
In this paper, we propose a new PC layer formulated by non-parametric and extremely fast conventional transforms. While deep neural networks are capable of extracting distinctive feature representations [18]- [23], the conventional time-to-frequency transforms have importantly been used as feature extractor in image processing and video compression due to their ability to reduce dimensionality of signals by concentrating energy into the low-frequency regions [24], [25] and to extract local features by decomposing image signals into various texture types [26]. Also, there are fast transform algorithms (e.g., Cooley-Tukey algorithm [27]) which significantly reduce the computation complexity by reusing the output of the previous step in butterfly-like pipeline architectures.
We point out that, although the transform algorithms have shown promising performance in feature representation and dimensionality reduction, they have hardly been incorporated into CNNs [28]. In this paper, we aim to answer following question: Can conventional transforms (e.g. Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT)) which were frequently used as spatial feature extractors [26], [29]- [31] also serve as a feature extractor in channel dimension of deep neural networks?
Through comprehensive experiments, we found that, although both of these transforms do not require learnable parameters, they can effectively capture the feature representation in channel dimension. Specifically, without any learnable parameters, orthogonality of these conventional transforms helps to reduce feature representational bottleneck (See Section IV-C1). Therefore, proposed PC layer equipped with these conventional transforms can sufficiently extract feature information in channel dimension. Also, this non-parametric property enables our proposed CNN models to be significantly compressed in terms of the number of parameters, allowing CNNs to be applied in low-power and complexity applications (i.e., efficient distributed training, less communication between server and clients), [32]. We note that especially DWHT is considered to be a good replacement of the conventional PC layer, as it requires no floating point multiplications but only additions and subtractions (i.e., multiplication with binarized weights in +1/ − 1) by which the computation overheads of PC layers can be significantly reduced. Furthermore, DWHT can take a strong advantage of its fast version where the computational complexity of the floating point operations is reduced from O(n 2 ) to O(n log n). These non-parametric and low computational properties construct extremely efficient neural network from the perspective of parameter and computation as well as enjoying accuracy gain.
Our contributions are summarized as follows: • We propose a new PC layer formulated with conventional transforms which can significantly reduce computational resources (memory usage, FLOPs).
• We demonstrate effectiveness of our proposed PC layer compared to conventional PC layer in terms of accuracy and computational resources on various classification tasks and even on face detection task.
• We investigate the optimal block structure and network hierarchy position for our prposed PC layer, along with analysis on orthogonality, which helps to reduce feature representation bottleneck.

II. RELATED WORK A. DECONSTRUCTION AND DECOMPOSITION OF CONVOLUTIONS
For reducing computational complexity of the existing convolution methods, several approaches of rethinking and deconstructing the naive convolution structures have been proposed. [2] factorized a large sized kernel (e.g., 5 × 5) in a convolution layer into several convolution layers with small sized (3×3) kernels. [33] pointed out the limitation of existing convolution in the fixed receptive field. Consequently, they introduced learnable spatial displacement parameters, showing flexibility of dilation in the convolution layers. Based on [33], [34] proved that the standard convolution can effectively be deconstructed as a single PC layer with the spatially shifted channels. Based on that, they proposed a very efficient convolution layer, namely active shift layer, by replacing spatial convolutions with shift operations. It is worth noting that the existing PC layer takes the huge proportion of computation and the number of weight parameters in modern lightweight CNN models [13], [14], [16]. Specifically, MobileNet-V1 [13] requires 94%, 74% of the overall computational cost and the overall number of weight parameters for the PC layer, respectively. Therefore, there were attempts to reduce computational complexity of the PC layer. [15] proposed ShuffleNet-V1 where the features are decomposed into several groups over channels and the PC operation was conducted for each group, thus reducing the number of weight parameters and FLOPs by the number of groups G. However, it was proved in [16] that the memory access cost increases as G increases, leading to slower inference speed. Similarly to the aforementioned methods, our work is to reduce computational complexity and the number of weight parameters in a convolution layer. However, our objective is more oriented on finding out mathematically efficient algorithms based on fast divide and conquer algorithms by utilizing the property of fixed harmonic kernels.

B. QUANTIZATION
In neural networks, quantization has been used to reduce the number of bits in weights and/or activations. [35] applied 8-bit quantization on weight parameters, which enabled considerable speed-up with small drop of accuracy. [36] applied 16-bit fixed point representation with stochastic rounding. Based on [37] which pruned the unimportant weight connections through thresholding the values of weight, [38] successfully integrated the pruning, 8 (or less) bit quantization and huffman encoding. The extreme case of quantized networks was evolved from [39], which approximated weights with the binary (+1, −1) values. From [39] as the milestone, [40], [41] constructed Binarized Neural Networks (BNN) which stochastically binarize the real valued weights and activations during training. These binarized weights and activations lead to significantly fast run-time by replacing floating point multiplications with 1-bit XNOR operations.
Based on BNN [40], [41], a Local Binary CNN (LBCNN) [42] was proposed that utilizes binarized non-learnable weights in spatial convolution based on the conventional local binary patterns [43], thus replacing multiplications with addition/subtraction operations in spatial convolution. Our work shares some similarity to LBCNN [42] in using binary fixed weight values. However, the local binary patterns cannot be applied on the PC layers consisting of much larger portion of the parameters and computation in neural networks [13] compared to the spatial convolution layers. Also, applying good mathematical properties to reduce computations (e.g., the harmonic property of DCT/DWHT kernels) was not considered in LBCNN.

C. CONVENTIONAL TRANSFORMS
Several transform techniques have been applied for image processing and compression [44]- [46]. Discrete Cosine Transform (DCT) has been used as a powerful feature extractor [26]. For an N -point input sequence, the basis kernel of DCT is defined as a list of cosine values as below: where m is the index of a basis, and DCT captures higher frequency information in the input signal as m increases. This property led DCT to be widely applied in image/video compression techniques that emphasize the powers of image signals in low frequency regions [47]. Discrete Walsh Hadamard Transform (DWHT) is a very fast and efficient transform by using only +1 and −1 elements in kernels. These binary elements in kernels allow DWHT to compute without any multiplication operations but addition/subtraction operations. Therefore, DWHT has been widely used for fast feature extraction in many practical applications, such as texture image segmentation [29], face recognition [30], and video shot boundary detection [31].
Furthermore, DWHT can take advantage of a structuredwiring-based fast algorithm (Algorithm 1, Figure 12) as well as allowing very high efficiency in encoding the spatial information [48]. The basis kernel matrix of DWHT is defined using the previous kernel matrix as below: where H 0 = 1 and D ≥ 1. In this paper, we denote H D m as the m-th row vector of H D in Eq. 2. Additionally, we adopt a fast DWHT algorithm to reduce computational complexity of the PC layer in neural networks, resulting in extremely fast and efficient ones.

III. METHOD
We propose a new PC layer which is computed with conventional transforms. The conventional PC layer can be formulated as follows: where (i, j) is the spatial index, and m is the output channel index. In Eq. 3, N and M are the number of input and output channels, respectively. X ij ∈ R N is a vector of input X at the spatial index (i, j), and W m ∈ R N is a vector of m-th weight W in Eq. 3. For simplicity, the stride is set as 1 and the bias is omitted in Eq. 3. Our proposed method is to replace the learnable parameters W m with the bases in the conventional transforms. For example, replacing W m with H D m in Eq. 3, we now can formulate the new multiplication-free PC layer using DWHT. Similarly, the DCT basis kernels C m in Eq. 1 can substitute for W m in Eq. 3, formulating another new PC layer using DCT. Note that the normalization factors in the conventional transforms are not applied in the proposed PC layer, because Batch Normalization [49] performs a normalization and a linear transform which can be viewed as a normalization in the existing transforms.
The most important benefit of the proposed method comes from the fact that the fast algorithms of the existing transforms can be applied in the proposed PC layers for further reduction of computation. Directly applying our proposed PC layers yields computational complexity of O(N 2 ). Adopting the fast algorithms, we can significantly reduce the computational complexity of the PC layer from O(N 2 ) to O(N log N ) without any change of the computation results.
We demonstrate the pseudo-code of our proposed fast PC layer using DWHT in Algorithm 1 based on the fast DWHT structure shown in Figure 1. In Algorithm 1, for log N iterations, the even-indexed channels and odd-indexed channels are added and subtracted in an element-wise manner, respectively. The resulting elements which were added and subtracted are placed in the first N /2 elements and the last N /2 elements of the input of next iteration, respectively. In this computation process, each iteration requires only N operations of addition and subtraction. Consequently, Algorithm 1 yields complexity of O(N log N ) in only addition and subtraction. Compared to the existing PC layer that requires complexity of O(N 2 ) in multiplication, our method is extremely efficient compared to the conventional PC layer in terms of computation costs (as shown in Figure 2) VOLUME 10, 2022 FIGURE 1. Entire architecture of our proposed fast DWHT-based PC layer.
x i denotes each of input feature map divided with channel dimension. Each of feature map numbered with even and odd channels are summed to be first half of output feature maps (i.e. sky blue shaded part) while subtracted (i.e. red shaded part) to be the other half of output feature maps. This structured addition and subtraction process is repeated log 2 n times. Therefore, computational complexity reduces to O(nlog 2 n) without any multiplication. and power consumption of computing devices [50]. Note that, similarly to fast DWHT, DCT can also be computed fast with a butterfly architecture which recursively decomposes the N -point input sequence into two subproblems of N /2-point DCT [51].
Compared to DWHT, DCT takes advantage of using more natural shapes of cosine basis kernels, which tend to provide better feature extraction performance through capturing the frequency information. However, DCT inevitably needs multiplications for inner product between C and X vectors, and a look up table (LUT) for computing cosine kernel bases which can increase the processing time and memory access. On the other hand, as mentioned, the kernels of DWHT consist only of +1, −1 which allows for building a multiplicationfree module. Furthermore, any memory access towards kernel bases is not needed if our structured-wiring-based fast DWHT algorithm (Algorithm 1; Figure 12) is applied. Our comprehensive experiments in Section III-A and III-B show that DWHT is more efficient than DCT when being applied in the PC layer in terms of trade-off between the computation cost and accuracy.
Note that, for securing more general formulation of our proposed PC layer, we padded zeros along the channel axis if the number of input channels is less than that of output channels while truncating the output channels when the number of output channels shrink compared to that of input channels as shown in Algorithm 1. Figure 1 shows the architecture of the fast DWHT algorithm described in Algorithm 1. This structured-wiring-based architecture ensures that the receptive field of each output Algorithm 1 Pointwise Convolution Using Fast DWHT Input: Input feature map X ∈ R B×N ×H ×W Output: Output feature map X ∈ R B×M ×H ×W 1: n ← log 2 N 2: if N <M then 3: ZeroPad1D(X , axis=1) pad zeros along channel axis 4: end if 5: for i ← 1 to n do Truncate along channel axis 13: end if FIGURE 2. Comparison of the number of multiplications between our new PC layers and the conventional PC layer. x axis denotes logarithm of the number of input channels which range from 2 0 to 2 n . For simplicity, the number of output channels is set to be same as that of the input channel for all PC layers. channels is N , meaning that each output channel is fully reflected against all input channels through log 2 N iterations. This property allows the proposed PC layer to fully capture the input channel correlations.
For successfully fusing the proposed PC layer into neural networks, we explore two themes: i) the optimal block search for the proposed PC layer; ii) the optimal insertion strategy of the proposed block found by i), in a hierarchical manner on the blocks of networks. We assumed that there is the optimal block unit structure and optimal hierarchy level (high-, middle-, low-level) position in the neural networks favored by these non-learnable transforms. Therefore, we conducted the experiments for the two aforementioned themes accordingly. We evaluated the effectiveness for each of our networks in accuracy according to the number of learnable weight parameters and FLOPs. For comparison, we counted total FLOPs with summation of the number of multiplications, additions and subtractions performed during the inference. Unless mentioned, we followed the default experimental setting as 128 batch size, 200 training epochs, 0.1 initial learning rate where 0.94 is multiplied per 2 epochs, and 0.9 momentum with the 5e-4 weight decay value using SGD optimizer. In all the experiments, the model accuracy was obtained by taking average of Top-1 accuracy values from three independent training results.

A. OPTIMAL BLOCK STRUCTURE FOR THE CONVENTIONAL TRANSFORMS
From a microscopic perspective, a block is the basic unit of neural networks, and it determines the efficiency of the weight parameter space and computation costs in terms of accuracy. Accordingly, to find the optimal block structure for our proposed PC layer, we perform comprehensive experiments to find out the optimal block including the proposed layer based on ShuffleNet-V2 [16]. The proposed block and its variant blocks are listed in Figure 3. As shown in (c) and (d) of Table 1, the ReLU [52] activation function significantly harms the accuracy of our neural networks equipped with the conventional transforms. This is because the harmonic kernels in conventional transforms tend to produce symmetric   Figure 3 on CIFAR100 dataset. All the experimented models are based on ShuffleNet-V2 with width hyper-parameter 1.1× which we customized to make the number of output channels in Stage-2, -3, -4 as 128, 256, 512, respectively, for comparison with DWHT having 2 n input channels. We replaced all of 13 basic blocks with stride 1 (i.e., (a) block) in the baseline model with distributions with zero mean where much information can be eliminated by rectifying the negative valued coefficients. Further analysis on the reason for this phenomenon is described in Section IV-A. Additionally, we find out that the proposed PC layer yields approximately 1.16% higher accuracy compared to the PC layer with randomly initialized and fixed weights as shown in Table 1. These results imply that DWHT and DCT kernels can better extract better feature representations in channel dimension compared to the kernels which are randomly initialized and non-learnable. Compared to the baseline model in Table 1, DCT w/o ReLU and DWHT w/o ReLU blocks yield approximately 2.3% accuracy drop under the condition that 42% and 49.5% of learnable weight parameters and FLOPs are reduced, respectively. These results imply that the proposed blocks (i.e., (c) and (d) in Figure 3) are still inefficient in trade-off between accuracy and computation costs of neural networks, leading us to more exploring to search the optimal neural network architecture for the proposed PC layer. In the next subsection, we address this problem through applying conventional transforms on the optimal hierarchy level features (See Section III-B). Based on our comprehensive experiments, we set the block structure (d) as our default proposed block which will be exploited in all the following experiments.

B. OPTIMAL POSITION FOR THE PROPOSED BLOCKS IN HIERARCHY LEVEL
In this section, we search on the optimal position of the proposed blocks in hierarchy level of neural networks. The optimal hierarchy level is defined such that the proposed networks have the minimal number of learnable weight parameters and FLOPs without accuracy drop. It is noted that applying our proposed block on the high-level position in the network provides much more reduced number of parameters and FLOPs rather than applying it on low-level position, because channel depth increases exponentially as the layer goes deeper in the network.
In Figure 4, we applied our optimal block (i.e., (d) block in Figure 3) on high-, middle-and low-level positions, respectively. In our experiments, we evaluate the performance of the networks depending on the number of blocks where the proposed optimal block is applied. The model under test is denoted as (transform type)-(# of the proposed blocks)-   As shown in the first column of Figure 4, the proposed block achieved much better trade-off between the number of learnable weight parameters (or FLOPs) and accuracy on the high-level position compared to the baseline models. Meanwhile, applying the proposed block on middle-and low-level features yields slightly and severely worse performance in trade-off between accuracy and the number of parameters (or TABLE 2. Performance result of hierarchically applying our optimal block on CIFAR100 dataset. All the models are based on MobileNet-V1 with the width hyper-parameter of 1×. We replaced both stride 1, 2 blocks in the baseline model with the optimal block that consists of [3 × 3 depthwise convolution -Batch Normalization -ReLU -CTPC -Batch Normalization] in series. FLOPs), respectively. This tendency is shown similarly for both DWHT-based models and DCT-based models, implying that there can be the optimal hierarchy level position of blocks favored by conventional transforms. We conjecture that enforcing learnable weight kernels of low level layers to be fixed during training impedes the low level features to play a principle role in maximally extracting information from input [53] and prevents rich information flowing from low-level to high-level layers thus leading to accuracy degradation due to the information bottleneck.
We also note that our DWHT-based models showed slightly higher or same accuracy with less FLOPs in all the hierarchy level positions compared to our DCT-based models. This is because the fast version of DWHT does not require any multiplication but needs a small amount of addition and subtraction operations compared to the fast version of DCT while it also has the sufficient ability as a feature extractor in channel dimension with the exquisite wiring-based structure ( Figure 1).
For verifying the generality of the proposed method, we also applied our methods into MobileNet-V1. Inspired by the above results showing that the optimal hierarchy level position for conventional transforms can be found in the high-level, we replaced high-level blocks of baseline model (MobileNet-V1) to verify the effectiveness of the proposed method. The experimental results are described in Table 2. Remarkably, as shown in Table 2 We further applied our MobileNet-V1 based models on SVHN dataset [54] in table 3. As on CIFAR100, we note the tendency that applying conventional transforms on high-level layers enables the baseline model to be extremely lightweight and computationally efficient also maintains on svhn dataset. Especially, DWHT-6-H model showed comparable accuracy with the baseline model under the condition that 81.4% of parameters and 49.4% of FLOPs are reduced.

IV. EXPERIMENTS AND ANALYSIS
In this section, we analyze the significant accuracy degradation of applying ReLU after our proposed PC layer. We also analyze the active utilization of 3 × 3 depthwise convolution weight kernel values which takes an auxiliary role for conventional transform being non-learnable. Additionally, not only for classification task, we demonstrate task domain generality of the proposed method on face detection task with extensive experiments.

A. HINDRANCE OF ReLU IN FEATURE REPRESENTABILITY
As shown in Table 1, applying ReLU after conventional transforms significantly harmed the accuracy. This is due to the properties of conventional transform basis kernels that both H D m in Eq. 2 and C m in Eq. 1 have the same number of positive and negative parameters in the kernels except for m = 0 and that the distributions of absolute values of positive and negative elements in kernels are almost identical. These properties imply that the output channel elements that have under zero value should also be considered during the forward pass; when forwarding X ij in Eq. 3 through the conventional transforms if some important channel elements in X ij that have larger values than others are combined with negative values of C m or H D m , the important feature information in the output Z ijm in Eq. 3 can reside in the value range under zero. Figure 5 shows that all the hierarchy level activations from both DCT and DWHT based PC layer have not only positive values but also negative values in almost same proportion. These negative values possibly include important feature information in channel dimension. Thus, applying ReLU on activations of PC layers which are based on conventional transforms discards crucial feature information contained in negative values that must be forwarded through, leading to significant accuracy drop as shown in the results of Table 1. Figure 6 demonstrates above theoretical analysis by showing that as the negative valued coefficients are fully rectified (i.e., F = ReLU), the accuracy is significantly degraded while fully reflecting the negative valued coefficients (i.e., g = 1) shows the best accuracy. From above kernel value based analysis and its experiments, we do not use non-linear activation function after the proposed PC layer.  . Ablation study of negative slope term g in activation function F , which is defined as F (x) = max(0, x) + g * min(0, x). The performance of models were evaluated based on {DCT or DWHT}-13-H ShuffleNet-V2 1.1× where we applied F as an activation function after every DCT or DWHT based PC layer and Batch Normalization layer.

B. ACTIVE 3 × 3 DEPTHWISE CONVOLUTION WEIGHTS
In Figure 7 and Appendix B, it is observed that 3 × 3 depthwise convolution weights of last 3 blocks in DWHT-3-H and DCT-3-H have much less near zero values than that of baseline model. That is, the number of values which are apart from near-zero is much larger on DCT-3-H and DWHT-3-H models than on baseline model. We conjecture that these learnable weights whose values are apart from near-zero were actively fitted to the optimal domain that is favored by conventional transforms. Consequently, these weights are actively and sufficiently utilized to take the auxiliary role for conventional transforms which are non-learnable, deriving accuracy increase compared to the conventional PC layer as shown in Figure 4.  To verify the impact of activeness of these 3 × 3 depthwise convolution weights in the last 3 blocks, we experimented with regularizing these weights varying the weight decay values. Higher weight decay values strongly regularize the scale of 3 × 3 depthwise convolution weight values in the last 3 blocks. Thus, strong constraint on the scale of these weight values hinders active utilization of these weights, which results in accuracy drop as shown in Figure 8.

C. ORTHOGONALITY
In this section, we show that DCT and DWHT based PC layers can efficiently regularize the deep neural networks and reduce the representational bottleneck by its orthogonality. Formally, Orthogonality in DCT and DWHT based PC layer is given as below: where i = j and 0 ≤ i, j ≤ M and W is C in Eq. 1 or H D in Eq. 2.

1) RANK ANALYSIS
Previous works [55]- [59] showed the regularization effect of orthogonal kernel matrix, which enables faster convergence in training and consequently improved the accuracy. Moreover, Orthogonality ensures the kernel matrix to be full rank, which helps to reduce the representational bottleneck in the feature maps [60], [61]. Formally, we can rewrite the Eq. 3 as matrix multiplication as below: where W ∈ R M ×N , X ∈ R N ×whB , B is the number of batch size and w, h is width and height of the input X . Rank(Z ) is upper bounded to min(M , N ) (assuming whB N ). The weight of DCT and DWHT based PC layers have full rank property originated from the orthogonality, enabling Rank(Z ) to have its upper-bound value (i. e. min(M , N )), while conventional PC layers are not ensured to have its upper-bound value. As the Rank(Z ) is maximized, it helps resolving the representational bottleneck which hinders the discriminative encoding of feature maps in channel dimension [60]. Without any cost, DCT and DWHT based PC layers naturally have orthogonal filter groups for each output channels (i.e. interchannel orthogonality [55]). With this inter-channel orthogonality, DCT and DWHT based PC layers not only reduce redundancy and ensure diversity of kernels but also resolve the representational bottleneck in the channel dimension. Consequently, DCT and DWHT based PC layers showed better performance than conventional PC layers as shown in Table 2 and 3.

2) IMPACT OF ORTHOGONALITY
In order to verify the effect of inter-channel orthogonality in DCT and DWHT based PC layers, we enforced the inter-channel orthogonality to be destroyed in the DCT and DWHT based PC layers by maximizing the orthogonality regularization term as below: where λ is the regularization coefficient and W is the kernel matrix of DCT (i.e. C in Eq. 1) and DWHT (i.e. H D in Eq. 2) based PC layers. λ is decayed 0.1, 0.01 and 0.0001 at 20, 50 and 70 epoch, respectively following the observation of [55]. As shown in Figure 9, DCT and DWHT based models always suffered from significant accuracy degradation when the orthogonality is destroyed. Accuracy is severely degraded even when the orthogonality is weakly destroyed. Without the inter-channel orthogonality, DCT and DWHT cannot sufficiently extract feature representations in channel dimension due to lack of equipping the diverse filters and suffer from the representational bottleneck in feature maps, consequently showing a severe accuracy loss.

D. COMPARISON WITH OTHER ORTHOGONAL KERNELS
As the kernels of DCT and DWHT have inter-channel orthogonality, we compared our DCT and DWHT based PC layers with other orthogonal kernel matrices regularized with Soft Orthogonality [59], [62], [63] (namely, SO) and Spectral Restricted Isometry Property [55] (namely, SRIP).

1) SETUP
We randomly initialized the weights of corresponding PC layers as RCPC layer in Figure 3. These weights are then regularized to be orthogonal for every training iteration with SO or SRIP method. For fair comparison with DCT and DWHT based PC layers, gradients for cross entropy loss are not updated on these layers.

2) RESULTS
In Table 4, DCT and DWHT based models obviously showed better accuracy than other orthogonal-regularized kernel matrices, which demonstrates the superior ability as a powerful feature extractor in such well-designed and orthogonal DCT and DWHT based PC layers over other orthogonal filters.

E. FACE DETECTION
In order to demonstrate the domain-generality of the proposed method, we conducted comprehensive experiments on applying our proposed PC layers to object detection, specifically to the face detection task.

1) SETUP
For the face detection schemes such as anchor design, data augmentation and feature-map resolution design, we followed [64] which is one of the baseline methods in face detection field. It is noted that there is a huge demand on real-time face detection algorithms having high detection accuracy, which leads us to applying our PC layers to a lightweight face detection network. Therefore, instead of using VGG16 [2] as backbone network as in [64], we set MobileNet-V1 0.25× as our baseline backbone model where extra depthwise separable blocks are added for detecting more diverse scales of face in the images. In this baseline model, we replaced the conventional PC layers within last 3, 6 blocks with our DCT/DWHT based PC layers. We trained all the models on the WIDER FACE [65] train dataset and evaluated on WIDER FACE validation dataset and Face Detection Data Set and Benchmark (FDDB) dataset [66].

2) RESULTS
In

V. CONCLUSION
We propose the new PC layers through conventional transforms, which allow the neural networks to be efficient in complexity of computation and learnable weight parameters. With the purpose of successfully fusing our PC layers into deep neural networks, we found the optimal block unit structure and hierarchy level position in neural networks for conventional transforms, showing accuracy increase and great feature representability in channel dimension. We further revealed the hindrance of ReLU in terms of feature representation, the activeness of depthwise convolution weights on the last blocks and the effect of orthogonality in our proposed neural network. Finally, we showed the superiority of our method on the other task with the use of a low number of parameters and FLOPs.

LIMITATIONS AND FUTURE WORKS
While the scope of our method is currently restricted to small and medium scale datasets, we believe that scaling up to a much larger datasets such as ImageNet is a totally new research frontier, where a minimal amount of learnable parameters are necessarily required in PC layer due to the increased scale and difficulty of the dataset.

APPENDIX A GENERALITY OF APPLYING PROPOSED PC LAYERS IN OTHER NEURAL NETWORKS
In Figure 10, for the purpose of finding more definite hierarchy level of blocks favored by our proposed PC layers, we subdivided our middle level experiment scheme; DCT/DWHT-3-M-Front model denotes the model which applied the proposed blocks from the beginning of Stage-3 in the baseline while DCT/DWHT-3-M-Rear model denotes the model which applied from the end of Stage-3. The performance curves of all our proposed models in Figure 10 show that if we apply the proposed optimal block within the first 6 blocks in the network, the Top-1 accuracy is mildly or significantly deteriorated compared to the required computational cost and number of learnable parameters, informing us the important fact that there are the definite hierarchy level blocks which are favored or not favored by our proposed PC layers in the network.

APPENDIX B HISTOGRAM OF 3 × 3 DEPTHWISE CONVOLUTION WEIGHTS IN HIGH-LEVEL BLOCKS
See Fig. 11.