Channel Compression: Rethinking Information Redundancy among Channels in CNN Architecture

Model compression and acceleration are attracting increasing attentions due to the demand for embedded devices and mobile applications. Research on efficient convolutional neural networks (CNNs) aims at removing feature redundancy by decomposing or optimizing the convolutional calculation. In this work, feature redundancy is assumed to exist among channels in CNN architectures, which provides some leeway to boost calculation efficiency. Aiming at channel compression, a novel convolutional construction named compact convolution is proposed to embrace the progress in spatial convolution, channel grouping and pooling operation. Specifically, the depth-wise separable convolution and the point-wise interchannel operation are utilized to efficiently extract features. Different from the existing channel compression method which usually introduces considerable learnable weights, the proposed compact convolution can reduce feature redundancy with no extra parameters. With the point-wise interchannel operation, compact convolutions implicitly squeeze the channel dimension of feature maps. To explore the rules on reducing channel redundancy in neural networks, the comparison is made among different point-wise interchannel operations. Moreover, compact convolutions are extended to tackle with multiple tasks, such as acoustic scene classification, sound event detection and image classification. The extensive experiments demonstrate that our compact convolution not only exhibits high effectiveness in several multimedia tasks, but also can be efficiently implemented by benefiting from parallel computation.

Abstract-Model compression and acceleration are attracting increasing attentions due to the demand for embedded devices and mobile applications. Research on efficient convolutional neural networks (CNNs) aims at removing feature redundancy by decomposing or optimizing the convolutional calculation. In this work, feature redundancy is assumed to exist among channels in CNN architectures, which provides some leeway to boost calculation efficiency. Aiming at channel compression, a novel convolutional construction named compact convolution is proposed to embrace the progress in spatial convolution, channel grouping and pooling operation. Specifically, the depth-wise separable convolution and the point-wise interchannel operation are utilized to efficiently extract features. Different from the existing channel compression method which usually introduces considerable learnable weights, the proposed compact convolution can reduce feature redundancy with no extra parameters. With the point-wise interchannel operation, compact convolutions implicitly squeeze the channel dimension of feature maps. To explore the rules on reducing channel redundancy in neural networks, the comparison is made among different point-wise interchannel operations. Moreover, compact convolutions are extended to tackle with multiple tasks, such as acoustic scene classification, sound event detection and image classification. The extensive experiments demonstrate that our compact convolution not only exhibits high effectiveness in several multimedia tasks, but also can be efficiently implemented by benefiting from parallel computation.
Index Terms-Acoustic scene classification, convolutional neural networks, image classification, model compression and acceleration, sound event detection.

I. INTRODUCTION
C ONVOLUTIONAL neural networks (CNNs) are attracting considerable attention in an increasing array of area, such as computer vision [1]- [3], computational acoustics [4]- [6] and natural language processing [7]- [9]. The general trend is to design deeper and more complicated network architecture to pursue better performance. However, massive resources are required for desired performance, which limits CNN-based classifiers from the real-time inference in mobile applications. Over the past few decades, various methods have been exploited for model compression and acceleration, including pruning [10]- [13], weight sharing [14], [15], lowrank matrix factorization [16]- [18] and knowledge distillation [19]- [21]. Despite their desirable compression ability, most of the compression methods typically suffer from two major drawbacks. First, the original complex model is replaced with an approximation one, resulting in the error accumulation. Therefore, fine-tuning is usually necessary for satisfying performance. Second, various manually chosen parameters (even a lot of empirical engineering that only experts competent to deal with) are required in these methods.
To overcome the above drawbacks, several efficient convolution methods are recently developed to design specific convolutional kernels for less parameters and calculations. In 2016, Szegedy et al. [22] proposed an asymmetrical convolution where a standard d×d convolution layer is spatially factorized as a sequence of two layers with d×1 and 1×d convolutions. Howard et al. [23] proposed MobileNet v1 that replaces the standard convolution with the depth-wise separable convolution. The work by Zhang et al. [24] proposed ShuffleNet, applying group convolution and channel shuffle. Iandola et al. [25] proposed SqueezeNet in which 1×1 convolutions are utilized to reduce channel numbers and replace a part of 3×3 convolutions for less parameters. Although researches [24]- [26] have investigated on reducing the channel number in the current layer to cut down the following convolutional operations, this problem is simply solved by appending 1×1 convolutional layer, which introduces extra parameters and considerable interchannel calculations.
In this paper, we found that feature redundancy exists among channels in CNN architecture, i.e., amounts of interchannel information is unimportant or even unnecessary in some cases. Instead of 1×1 convolutions, a novel convolutional construction named compact convolution is proposed to implicitly reduce feature redundancy in a non-learning approach. Specifically, the point-wise operation among channels (the point-wise interchannel operation) is implemented to squeeze the channel dimension of input feature maps. The reason for applying the point-wise operation is threefolds. First, the point-wise operation compresses the interchannel information without extra parameters, directly reducing the cost of computation. Second, the derivation of these point-wise operations can be taken easily, which contributes to the chain rule and training end-to-end networks from scratch. Third, the pointwise operation is well-suited for parallel computation on GPU or other advanced chips. Depth-wise separable convolution is further introduced to decouple spatial feature extraction from interchannel feature extraction. Like other research on efficient convolutional kernels [23]- [25], [27], useful features from feature maps can be extracted with fewer parameters and operations by simply replacing the standard convolution with our compact convolution. In addition, how different types of pointwise operations impact on interchannel feature compression is further investigated. While there is tremendous difference between sounds and images, our compact convolution yields desired performance in multiple tasks, such as acoustic scene classification, sound event detection and image classification. To the best of our knowledge, there is few work to verify the generalization of their models in across multiple medias.
Extensive experiments show that compared with general network constructions (such as VGG, Resnet and MobileNets), the network with compact convolutions (hereafter Compact-Net) not only greatly reduces computation complexity, but also yields desirable performance. To further illustrate the difference between linear manner and non-linear one, three different point-wise operations were compared. Some guidelines are provide for investigating model compression and accerleration.
The contributions of this work are summarized as follows: 1) A novel convolution named compact convolution is proposed to implicitly reduce feature redundancy in a nonlearning approach. Different from the existing channel compression method which directly utilizes 1x1 convolution, the proposed compact convolution adopts the point-wise interchannel operation to squeeze the channel dimension of feature maps with no extra parameters. It turns out that compact convolutions not only cost 18 times less computation than standard convolutions in terms of 3×3 size, but also yield competitive performance.
2) Some guidelines on replacing learnable parameters and complex operations in convolutional layers are summarized. This facilitates further investigation on feature dimension reduction in CNNs.
3) The proposed convolution can be easily applied in general CNN architectures, by replacing the current convolutions with our compact convolutions. Moreover, the compact convolution can extract either audio or visual features to solve multimedia problems.
The reminder of the paper is organized as follows. Section II provides a brief survey of related work. Section III first presents the proposed compact convolution, and then applies it into several popular CNN architectures. In Section IV, extensive experiments are conducted to evaluate CompactNets. Finally, several conclusions and possible future works are given in Section V.

II. RELATED WORK
A. 1×1 convolution 1×1 convolution was first proposed by Lin et al. [28] as a universal function approximator for feature extraction on the local patches. They found that 1×1 convolution not only has great capability in modeling various distributions of latent concepts, but also facilitates the learnable interactions of crosschannel information. Sequent work in [29], [30] utilized 1×1 convolution for tuning the number of feature maps in CNN architecture. However, 1×1 convolution involves considerable parameters and operations. This work applied the point-wise interchannel operation to reduce the dimension of feature maps.

B. Maxout function
In 2013, Goodfellow et al. [31] proposed a maxout construction that performs a max pooling across multiple affine feature maps. It turns out that the maxout construction results in a piecewise linear function which is capable of modeling any convex function. Wu et al. [32] proposed the Max-Feature-Map (MFM) layer as a variation of maxout activation to suppress low-activation neurons in each layer. Rather than a better function approximator, this paper focuses on the efficient approaches for reducing the interchannel redundancy, and compressing the dimension of feature maps in a larger range. Moreover, besides the max pooling, two more operations are investigated and further integrated into the proposed convolutional layer.

C. Depth-wise separable convolution
Howard et al. [23] proposed MobileNets v1 which took the idea of the depth-wise separable convolution and achieved preferable results on small models. Depth-wise separable convolution consists of a depth-wise convolution for spatially filtering and a point-wise convolution (1×1 convolution) for exchanging information among channels. By replacing standard convolutions with depth-wise separable convolutions, the optimized network costs about 9 times less computation than the standard convolution at the cost of a small reduction in accuracy. Inspired by the depth-wise separable convolution, the compact convolution decouples spatial feature extraction from interchannel feature extraction. Moreover, the point-wise interchannel operation is introduced between the depth-wise convolution and 1×1 convolution. Thus, the efficiency of convolution is further improved.

III. PROPOSED METHOD A. The point-wise interchannel operation
As shown in Fig. 1, the point-wise operation is implemented on the feature maps across channels. The input feature maps are firstly divided into groups. And a new feature map is extracted point by point over C feature maps in each group. Therefore, the parameter C can be deemed as a hyperparameter for adjusting the ratio of channel compression. As C gets larger, the resulting construction becomes more compact.
The input feature maps and the output feature maps of the point-wise interchannel operation are denoted as I ∈ F N ×W ×H and O ∈ F N ×W ×H , where N and N are the channel numbers of input feature maps and output feature maps, W and H are the width and height of the feature maps respectively. Each pixel on the output feature maps are independently calculated with the values in the identical position across channels. Thus, the point-wise interchannel operation of the position Here T ( * ) represents the point-wise operations across the channels ranged from nC to (n + 1)C − 1. The adopted point-wise operation can be divided into non-linear and linear manners. The non-linear manner which combines C feature maps and outputs element-wise maximum one is defined as: (2) The gradient of Eq. (2) takes the following form: Likewise, the linear manner is defined as: Here m is set to 1 when the sum method is applied, otherwise set to C. The gradient of Eq. (4) can be written as follows: Because the point-wise operation can be simultaneously processed in different groups, it is well-suited for parallel computation on the modern processors. Compared with 1×1 convolution performing weighted linear recombination across all the input feature maps, each output feature map produced by the point-wise operation is calculated from the local information of the grouped input feature maps with no extra learnable weights. Thus, the point-wise interchannel operation is capable of reducing considerable parameters and computation resources.

B. Compact convolution
Taking advantages of the depth-wise separable convolution and the point-wise interchannel operation, a novel compact convolution layer is proposed for the efficient network. The proposed compact convolution is illustrated in Fig.2. Depthwise convolution is operated over each input feature map to extract spatial features. The following point-wise interchannel operation squeezes the channel dimension of feature maps extracted by depth-wise convolutions, and maintains their major information. Finally, 1×1 convolution is applied for the exchange of information among channels. As one can see, there is a bottleneck construction inside the compact convolution. The bottleneck construction leaves the 1×1 layer with smaller input/output dimensions, which is benifit to less cost of computation. Compared with other bottleneck constructions [30] designed with 1×1 convolution, the proposed compact convolution reduces the channel dimension with less calculation and no extra learnable weights. A standard convolution layer takes a W in × H in × C in feature map F as input. Here W in and H in are the spatial width and height of the input feature map, C in is the number of input channels. And a W out ×H out ×C out feature map G is produced by a standard convolution, where W out and H out are the width and height of the output feature map and C out is the number of output channel. The standard convolutional layer is parameterized by convolution kernel sized K ×K ×C in ×C out where K is the spatial dimension of the kernel assumed to be square, C in and C out are numbers of input and output channel as defined previously.
Based on [33], the complexities of networks are evaluated with FLOPs, i.e. the number of floating-point multiply-add operations. Assume that F denotes FLOPs of the standard convolution. It can be computed as: Likewise, F represents FLOPs of the compact convolution. Through the depth-wise convolution, the point-wise interchannel operation and 1×1 convolution, F is calculated as: where m is set to 1 when the maximum and the sum methods are imposed, otherwise set to 2. Then the compression rate α of F over F is obtained as: Since the standard convolution sized 3×3 is the most frequently-used construction in CNN architecture, the kernel size of the compact convolution is set to 3 in the experiments. It turns out that FLOPs of the compact convolution sized 3×3 are between 12 to 18 times less than FLOPs of the standard one at a slight decline in accuracy as further demonstrated in Sect. V.

C. Application in network architectures
Since the compact convolution is a sparse version of the standard convolution, it can be embedded into general network architectures by simply replacing standard convolutions with compact convolutions. In this work, three different networks with compact convolutions are proposed as follows. VGG-like. Following the design principle of VGG net [34], a block consisting of two-layer 3×3 convolutions is imposed as a basic building block. Considering the limitation of dataset size, an eight-layer stacked convolutional model is adopted in the proposed VGG-like networks. The standard convolution is utilized as the first two convolutional layers, and compact convolutions are imposed as the other six convolutional layers. All the convolutional layers are followed by batch normalization [35] and ReLU non-linear activation [36].
ResNet-like. The bottleneck design is adopted in our proposed networks, which has been demonstrated desired performance in [30]. Different from [30], it is unnecessary to append 1× 1 convolution following 3×3 convolution in the bottleneck block, because our compact convolution itself includes a 1× 1 convolution.
MobileNet-like. To build MobileNet-like networks, depthwise convolutions and point-wise convolutions are replaced with compact convolutions. In [23], the input channel number of a given depth-wise separable convolution with width multiplier α is reduced from C in to αC in . Likewise, its output channel number is reduced to αC out . Therefore, the model complexity with width multiplier α decreases by roughly α 2 . Since our compact convolution adjusts the number of channels through the point-wise interchannel operation, the width multiplier of the depth-wise convolution is fixed to 1 so as not to interfere with the experiments.

D. Analysis in the training stage
Different types of the point-wise operation make various impact among channels on both inference and backpropagation stage. In Eq. (4) and Eq. (5), except for weights, the sum and the average methods process the feature maps among channels in the same way. Therefore, the point-wise operation can be divided into linear and non-linear manner according to the interchannel processing. Empirically, the linear manner is prone to reserve the major information among local channels, while the non-linear one tends to extract prominent features among local channels. The accuracy and cross-entropy loss of three different point-wise interchannel operations on the DCASE 2019 dataset are shown in Fig. 3. The convergence of the max method is slower than the convergence of the other two methods on both the training dataset and the validation dataset. In addition, it can be seen that the curves resulted by the sum and the average methods are similar, because both of them compress the information among channels in the linear manner.

IV. GENERALIZATION IN MULTIMEDIA
To assess its capacity of generalization in cross media, the proposed networks are applied to tackle with three different tasks, including acoustic scene classification (ASC), sound event detection (SED) and image classification (IC). ASC and SED take 2-D time-frequency spectrograms as inputs to CNN classifier while IC directly utilizes images as inputs. Acoustic scene here is referred as a mixture of background noise and sound events associated with a specific audio scenario. So compared with SED, ASC tends to make the discrimination with more abstract and global features.
ASC aims at enabling devices to recognize the specific audio environment from a recording or an on-line stream. To solve this problem, the proposed networks are trained and evaluated on the development dataset of TAU Urban Acoustic Scenes 2019 [37] in DCASE 2019 task 1. The dataset contains several acoustic scenes and various locations for each scene. The original recordings sampled with 44.1kHz are segmented into 10-second clips. The dataset consists of 10 scene classes, including airport, shopping mall, metro station, street pedestrian, public square, street traffic, tram, bus, metro and park.
To facilitate the proposed models training, the raw waves with binaural channels are firstly downmixed to mono. Then the log-scaled mel-spectrograms are extracted from each audio wave with hamming widow size of 1724 samples (corresponding to 0.04s), overlap of 50%, and 128 mel bands. Therefore, a feature map with a size of 128×512 is generated for each audio waves. The features are finally normalized with z-scores, and fed into the proposed models.
SED aims to detect and classify events that occur in different environments. To solve this problem, the proposed networks are trained and evaluated on UrbanSound8K [38]. The dataset contains 8732 labeled sound clips of urban sounds from 10 classes, including air conditioner, car horn children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren and street music. Different from DCASE 2019, the length of clips is varying from 0s to 4s. The pre-processing on SED for training is similar with the one on ASC, except zero padding is adopted to unify the length of raw wave.
IC is a classical problem in computer vision. Aiming at evaluating the performance of our models on IC, CIFAR 10 is utilized for further experiments in Sect. V. CIFAR 10 contains 60000 32×32 color images from 10 non-overlapping classes in the dataset, including airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Without much pre-processing, only normalization is applied for better convergence. The proposed networks are trained on 50000 samples, and validated on 10000 samples.

A. Experimental setup
The proposed CompactNets with the sum, the max and the average methods are referred as CompactNet-S, CompactNet-M and CompactNet-A, respectively. Since compact convolution is applicable to most of the common network architectures, the proposed CompactNets are built in the same con-structions as three different comparison networks, including VGG-8, ResNet and MobileNets. In addition, XVGG-8 is designed by replacing compact convolutions with depth-wise separable convolutions in order to evaluate the performance of the VGG-like CompactNet. To evaluate the efficiency of our CompactNets, some efficient convolutional neural networks (MobileNet v2, ShuffleNet v1 and Shufflenet v2) are built for comparison. Since nothing but convolutional layers changed in the following comparison experiments, only FLOPs of convolutions and our point-wise interchannel operations are taken into account. The above networks are trained by minimizing the cross-entropy loss with Adam optimizer. The learning rate, and batch size are set to 0.001 and 32 respectively.
All the experiments are implemented in python. Besides, experiments are conducted on the computer with Intel R Xeon(R) CPU E5-2650 v4 2.20 GHz and Nvidia RTX 2080Ti GPU. The proposed models are valued with Tensorflow.

B. Algorithmic complexity
The parameters, complexities and speeds of different models are listed in Table 1. The FLOPs of compact convolutions with the max, the sum and the average methods are given independently. For better observation, the results are grouped by different network architectures. Except the MobileNet-like networks on CPU, CompactNets (C=8) are fastest on both CPU and GPU among the networks in the identical structures. Specifically, the VGG-like CompactNet (C=8) speeds are 1.95× and 1.23× more than those of VGG-8 on CPU and GPU respectively. The ResNet-like CompactNet (C=8) speeds are 1.57× and 1.14× more than those of ResNet on CPU and GPU respectively. It turns that 0.25 MobileNet v1 is faster than Mosbile-like CompactNets. This is because the complexity of 0.25 MobileNet v1 is merely a half of MobileNet-like CompactNet (C=8) complexity. The non-linearity reduction of parameters and FLOPs are caused by the other unchanged convolutions in the networks, such as the first two standard convolutions in the VGG-like networks. Similarly, there are merely a few significant changes in complexity among the three proposed ResNet-like CompactNets, because only one 1×1 convolution at the end gets compacted while the other 1×1 convolutions have no change. Table 2 lists parameters, computation complexity and speeds of several efficient convolutional neural network. The speeds of mobileNet-like CompactNet (C=2) are the fastest on both CPU and GPU. Calculation complexity vs. speeds on two different platforms are shown in Fig. 4. Our proposed Com-pactNets are on the top right region under both cases. It is worthy to note that the indirect metric (complexity) is inconsistent with the direct one (speed), e.g. the difference between CompactNet (C=2) and 1.0 MobileNet v2. This result conforms to the finding in [26]: Besides FLOPs, Memory access cost (MAC) and optimized operation on specific platforms should be also taken into consideration. Table 3 shows the accuracy of different models to handle ASC task on DCASE 2019. The proposed CompactNet-S  I  COMPARISON OF SEVERAL MODELS OVER PARAMETERS, COMPLEXITY COMPUTATIONS AND SPEED ON TWO PLATFORMS AND THREE TYPES OF  NETWORK ARCHITECTURES. THE RESULTS OF OUR COMPACTNETS WITH THE     with the decrease of the scale factor. In contrast, when the compact factor C increases, the variation of our CompactNet accuracy is small. This indicates that the point-wise interchannel operation can squeeze the channel dimension of feature maps while retain the useful information in features. Fig. 6 illustrates the internal feature maps resulting in the three different point-wise interchannel operations. The max method is clearer than the average method in the detailed information. This indicates that the max method can extract the iconic features from inputs while the average method tends to restore the major information in feature maps. In addition, the distribution of feature with the average method is identical to the one with the sum method. This phenomenon accords to the analysis in Sect. III A and B.

D. Comparison between linear and non-linear manners
In Fig. 7, the accuracy variations of CompactNets with three different point-wise interchannel operations are illustrated. With the increase of compact factor C, the performance of CompactNet-S always consistents with the performance of CompactNet-A. Combining the analysis in Sect. III A and D, we can summarize several guidelines: G1) the average method and the sum method among channels work in the same way. By taking FLOPs of these two methods into consideration, the average operation can be replaced with the sum operation to squeeze the channel dimension of input feature maps.
G2) The non-linear operation is relatively hard to convergence, and it tends to yield desirable performance with small compact factors. The maximum method extracts the maximum value within a group and discards the remaining ones. As the compact factor C gets large, this nonlinear mapping loses a large amount of characteristic information, which leads to a rapid deterioration in performance.
G3) The linear operation is relatively easy to convergence, and it tends to outperform other methods in the case of large compact factors. In contrast to maximum method, average and sum methods preserve most of the information by arithmetic averaging. This facilitates model compression with a large compact factor.
These three guidelines can not only help researchers utilize CompactNets, but also expose the role of different operations in CNNs.

E. Extend to other tasks
Based on G1, only the sum method and the max method, corresponding to linear manner and non-linear one respectively, are discussed in this subsession. Table 4 lists the computation complexity and the accuracy in two different tasks. It turns out that our proposed Compact-Nets produce satisfying results in SED and IC. In SED, our CompactNet-S ((C=2) and CompactNet-M ((C=2) surpass the competing models among ResNet-like models and MobileNetlike models by 1.58% and 0.96% respectively. Compared with XVGG-8 that consists of separable convolutions, CompactNet-M ((C=2) still conducts higher accuracy by 1.92%. In IC, CompactNet-S (C=2) outperforms ResNet by about 1.48%. It is worth to note that XVGG-8 and 1.0 MobileNet v1 yield better results than CompactNets by 1.06% and 1.2%. This is because the number of samples in CIFAR 10 is large, and each sample sized 32×32 is easy to learn. Therefore, the input feature maps have less leeway to be squeezed.

VI. CONCLUSION
In this paper, a novel convolutional construction was proposed for implicitly reducing feature redundancy, where the point-wise interchannel operation was adopted to squeeze the number of channel of feature maps. The depth-wise separable convolution and the point-wise interchannel operation were integrated to speed up calculations and retain a satisfying performance. Unlike traditional methods for dimensional reduction in CNN which introduce considerable learnable weights, our compact convolution has the capacity to squeeze the channel dimension of feature maps with no extra parameters. Moreover, we showed the capacity of generalization to handle three different tasks, including acoustic scene classification, sound event detection and image classification. Extensive experimental results demonstrated that the proposed method can not only cut down the run time on CPU and GPU but also produce promising performance.
In future, we will investigate proper alternatives to the current convolutional construction with less complexity, and applications to other general multimedia tasks.