REAF: Reducing Approximation of Channels by Reducing Feature Reuse Within Convolution

High-level feature maps of Convolutional Neural Networks are computed by reusing their corresponding low-level feature maps, which brings into full play feature reuse to improve the computational efficiency. This form of feature reuse is referred to as feature reuse between convolutional layers. The second type of feature reuse is referred to as feature reuse within the convolution, where the channels of the output feature maps of the convolution are computed by reusing the same channels of the input feature maps, which results in an approximation of the channels of the output feature maps. To compute them accurately, we need specialized input feature maps for every channel of the output feature maps. In this paper, we first discuss the approximation problem introduced by full feature reuse within the convolution and then propose a new feature reuse scheme called Reducing Approximation of channels by Reducing Feature reuse (REAF). The paper also shows that group convolution is a special case of our REAF scheme and we analyze the advantage of REAF compared to such group convolution. Moreover, we develop the REAF+ scheme and integrate it with group convolution-based models. Compared with baselines, experiments on image classification demonstrate the effectiveness of our REAF and REAF+ schemes. Under the given computational complexity budget, the Top-1 accuracy of REAF-ResNet50 and REAF+-MobileNetV2 on ImageNet will increase by 0.37% and 0.69% respectively. The code and pre-trained models will be publicly available.


I. INTRODUCTION
Convolutional Neural Networks (CNNs) have achieved a series of breakthroughs on non-trivial visual tasks [1]- [5]. The features in a dataset can be learned in an end-to-end manner by CNNs with minimal human effort and can be transferred to diverse visual tasks [6]. Accordingly, researchers are dedicated to designing better networks for learning representations [7]- [10] instead of handcrafted features.
To enrich the representational power of CNNs, recent work investigates various aspects of CNN network architecture [11]- [14]. Constructing deep CNNs by stacking building blocks of the same shape is an effective strategy [11] since higher layers learn more abstract and invariant representations [15], [16]. Inherited from this, networks with skip connections [12], [17] enable the training CNNs with extreme depth. Inspired from the Hebbian principle and multi-scale The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . processing, multi-branch CNNs [18] achieve compelling accuracy if the topology of each branch is carefully designed, and multiple branches are expected to approximate large and dense layers of powerful representational capability. Except for depth, the width of a network is an essential dimension to increase the model capability [19]. Exposing the new dimension of cardinality [20] or deploying the new approaches [8], [21], [22] can enlarge the representational ability of the model. To address the limitation of spatial locality in convolution, the attention mechanism [21], [23] is used to capture a larger feature interaction. Using automated strategy, Neural Architecture Search achieves state-of-the-art accuracy [22] and platform-aware efficiency [24].
When it comes to the efficiency of designing CNNs, feature reuse is key to making it feasible [16], [25]- [27]. More precisely, the features computed by earlier layers will be reused by the latter layers, which is the feature reuse between convolutional layers and is popular in both a plain network [28] and a multi-branch network [18]. The pursuit of maximizing feature reuse between convolutional layers is an important concept in designing these networks. Deeper CNNs encourage more feature reuse, which resulted in designing VGGnet [28] and ResNet [17]. Deep CNNs with identity mapping have a problem of diminishing feature reuse in [29], which motivates the development of wide ResNet [19], ResNet with stochastic depth [30], and DenseNet [31]. Also, feature reuse plays a critical role within the convolution in addition to between the convolutional layers. Specifically, all the channels of the output activations of the convolution are computed by reusing all and the same channels of the input activations.
Feature reuse is at the heart of the theoretical advantages behind deep learning and explains the power of distributed representations [16]. However, it is a limitation of the representational capability of networks since the reused features are an approximation of accurate features. Reducing or eliminating feature reuse from a given model will result in comparable or higher model accuracy, which supports the approximation drawback of feature reuse. There is some research investigating eliminating feature reuse between convolutional layers. For example, CondenseNet [25] removes such connections between layers to avoid superfluous feature reuse in the network architecture. Besides, the lottery ticket [32], selective allocation of channels [33], pruning [34], [35], and group convolution [13], [36] can all be regarded as methods for reducing feature reuse within the convolution as analyzed in our work, which has not been pointed out in their original illustration. We initially pointed out that feature reuse within the convolution leads to the problem of approximation of the channels. Thus, by reducing feature reuse we can reduce the approximation within the convolution, which makes the calculation of the channels more accurate.
Our main contributions are summarized as follows.
• To our best knowledge, we are the first to point out and analyze the approximation problem introduced by the feature reuse within the convolution. To solve the problem, we propose the Reducing Approximation of channels by Reducing Feature reuse (REAF) scheme, which is a moderate version of feature reuse within the convolution.
• We compare our REAF scheme with group convolution and show that there are more merged channels in our REAF scheme than those in group convolution even though group convolution is a special case of our REAF scheme.
• We develop our REAF+ scheme with Bn and Relu layers as the parameterized operations and integrate it with group convolution-based models.
• We use extensive experiments to demonstrate the effectiveness of our REAF and REAF+ schemes.

II. RELATED WORK A. MULTI-BRANCH CONVOLUTIONAL NETWORKS
To ease the difficulty of training deep neural networks, an adaptive gating unit is used in Highway networks [29], which evolves into identity mapping in ResNet [17].
Replacing the identity mapping with more residual blocks, shake-shake networks [37], and multi-residual networks [38] are extended to improve the accuracy and speed. Fractal-Nets [39] and Multilevel ResNets [40] expand the multiple paths in a fractal and recursive way, respectively. The Inception series [18] aggregate the multifarious features of multi-scale with a careful configuration for each branch.

III. METHOD
In this section, we analyze the limitation of the convolution, i.e., the approximation of the channels caused by the full feature reuse within the convolution. To address this problem, we propose our REAF scheme for convolution, which introduces specialization to compute the channels of the output feature maps. We optimize the REAF scheme and compare it with group convolution since group convolution is a special case of our scheme. Finally, we develop the REAF+ scheme to improve the performance of the group convolution-based models.

A. PROBLEM DEFINITION
The output activations O ∈ R C out ×H ×W of the convolution are convolved between the input activations I ∈ R C in ×H ×W and the weights W ∈ R C out ×C in ×h×w , where the batch size N is omitted. The channel-wise representation of the output activations O is shown as follows, where O j refers to the j th channel of the output activations O and j = 0, . . . , C out − 1.
To study feature reuse for every individual channel, we construct a variable A ∈ R C out ×C in ×H ×W to compute all the channels of the output activations, where A j is the input activation to compute the j th channel of the output activation O j . The feature reuse within the convolution is shown in the left of Fig. 1, where we reuse all the channels of the input activations I to calculate every individual channel of the output activations O. Therefore, O j is convolved by the input activations A j and the weights W j as follows and i = 0, . . . , C in − 1. The information of the CNN transitions gradually from spatial coding to channel coding by a hierarchy of representations. Regarding the feature reuse between the convolutional layers, the learned hierarchical representation makes it reasonable to save computational complexity for CNNs since the high-level features are composed of the low-level features. However, when it comes to feature reuse within the convolution, i.e., every channel of the output activations is computed by reusing the same input activations. The reused input activations are approximated tensors for all the channels of the output activations. As every channel of feature maps is considered as a feature detector [52], the input feature maps to compute every channel of the output feature maps of the convolution are expected to be customized and specialized to make the computation more accurate.

B. REDUCING APPROXIMATION OF CHANNELS BY REDUCING FEATURE REUSE
The approximation problem of the channels introduced by the feature reuse within the convolution can be expressed as A j = A j and j, j = 0, . . . , C out − 1. To introduce specialized input feature maps for every channel of the output feature maps, a straightforward scheme for the convolution is shown in the middle of Fig. 1, where there is no feature reuse within the convolution at all. The input activations are divided into G I groups, where C in = G I × A in and A in is the number of channels of every group to compute every channel of the output activations. We have G I = C out . The input activations to compute the j th channel of the output activations are A j , where A j is part of the input activations I as follows. Index refers to the function that indexes the j th group from I .
In this way, we do not reuse any channels of the input activations I to compute every channel of the output activations O. O j is convolved by the input activations A j and the weights W j as follows.
Considering computational complexity and distributed representations, it is inadvisable to remove feature reuse totally within the convolution. To keep feature reuse and introduce customized input activations to compute the channels, we introduce our scheme called REAF, which is a moderate version of feature reuse for convolution as shown in the right part of Fig. 1.
All the channels of the input activations I and the output activations O are divided into G I and G O groups respectively, and there is only one channel in every group as shown in the right part of Fig. 1. Since it is hard to get prior knowledge on how many times the groups of the input activations should be reused, we try to keep the homogeneity of computing channels. Therefore, we have C(G I , G M ) = G O . Index refers to the function that indexes G M groups from I based on E l and concatenate them together.
To compute different groups of the output activations, the reused G M groups of the input activations are different from each other. O j is convolved by the input activations A j and the weights W j as follows.
C. OPTIMIZING THE CONFIGURATIONS OF OUR SCHEME Given a convolution with a computational complexity budget, the configurations of G I − G M − G O can be optimized since they have an influence on the feature reuse within the VOLUME 8, 2020 TABLE 1. ResNet50 and REAF-ResNet50 with a 4-3-4 template using the reformulation of the REAF scheme. Inside the brackets is the shape of a residual block, and outside the brackets is the number of stacked blocks on a stage. ''C = 72'' refers to the base width of the mode. ''4-3-4'' suggests that the REAF scheme with the configurations of G I = 4, G M = 3, and G O = 4 is applied to the convolution. The numbers of parameters and FLOPs are comparable between these two models.  convolution and the number of the merged channels. When G I − G M remains unchanged and G O increases, the feature reuse within the group reduces since every group of the output activations is computed by reusing different input activations. On the contrary, the feature reuse between the groups increases since the number of different channels between the reused input activations of computing different groups of the output activations decreases. When G O remains unchanged and G I − G M increases, the feature reuse within the convolution reduces while the number of the merged channels decreases. In this paper, we focus on studying the extreme case of G M = G I − 1 since G M = 1 (i.e., group convolution) has been explored in [20]. Given a convolutional neural network, we apply our scheme to its convolutions and optimize the configurations. Taking ResNet50 as an example, and the overall REAF-ResNet50 architecture, i.e., apply our scheme on ResNet50, is listed in as shown in Table 1. We keep the topology and computational complexity of the model unchanged for a fair comparison. We adopt the REAF scheme for the middle convolution of the bottleneck and adjust its width. As shown in Table 3, the experimental results of REAF-ResNet50 on CIFAR-100 classification are presented. C and C refer to the base number of the input and output channels of the middle convolution of the bottleneck and C = C in default. When applying the REAF scheme to the middle convolution, the accuracy will improve with a wide range of configurations. When G I − G M = 1, REAF-ResNet50 with G I = 4 achieves the best Top-1 accuracy among all the variants and 1.48% better than the baseline. With the increase and decrease of the number of the groups from G I = 4, the accuracy will degrade, which suggests that G I = 4 refers to an optimized configuration of the REAF scheme for the ResNet50.

D. ADVANTAGE OVER GROUP CONVOLUTION
Group convolution [20] has been widely adopted in the design of CNNs since it exposes a new dimension, i.e., the size of the set of transformations. Meanwhile, group convolution benefits from reducing feature reuse within the convolution as explained in our work, which has not been pointed out by the original reference [20]. Group convolution is an example of our REAF scheme when G M = 1. The number of merged channels in our proposed scheme is G M − 1 times larger than that in group convolution. Also note that the number of merged channels in the REAF scheme with the optimized configuration is much more than that in group convolution with the optimized number of groups G. Based on the experiments in [20], a larger number of cardinality indicates a more effective choice for a given model and the number of merged channels is 4 in group convolution with the optimized groups G = 32. For example, in our proposed scheme with the optimized configuration 4-3-4 and the number of merged channels is 54.

E. REAF+ SCHEME
We propose the REAF+ scheme, which adopts parameterized or parameter-free operations to enable A j = A j . Applying the REAF+ scheme, the performance of the models, including the group convolution-based models or REAF scheme-based models, will improve further. When the Bn and Relu layers are included, O j is convolved by the output activations O of the last convolution and the weights W j as follows.
Taking the parameterized operations of Bn and Relu layers as an example, the REAF+ scheme can be expressed as follows. There are g O pairings of Bn and Relu layers introduced when the output activations are divided into g O groups.

A. EXPERIMENTS ON CIFAR-100 CLASSIFICATION
We apply our scheme on ResNet models and conduct experiments on low-resolution imagery CIFAR-100 datasets. We adopt crop translation and flipping data augmentation. We train 200 epochs in total and the learning rate decays at the steps of 60, 120, and 160 with a factor of 0.2. As the experimental results presented in Table 4, the Top-1 accuracy of our REAF-ResNet outperforms that of ResNet baseline with various depths using comparable computational complexity. Especially, our REAF-ResNet101 achieves a 1.76% top-1 accuracy improvement compared with the ResNet101 baseline.

B. EXPERIMENTS ON ImageNet CLASSIFICATION
To show the performance of our proposed scheme on high-resolution images and large datasets, we experiment with ResNet and REAF-ResNet on ImageNet classification.
We train 100 epochs in total with a batch size of 256. The learning rate starts at 0.1 and decays every 30 epochs with a factor of 0.1. The weight decay is 1e−4 and the momentum is 0.9. The image is resized for scale augmentation. A 224×224 crop is randomly sampled from an image or its horizontal flip, with per-pixel normalization. As summarized in Table 5, the Top-1 accuracy of our REAF-ResNet increases by 0.27%, 0.37%, and 0.25% for the depth of 18, 50 and 101 layers respectively compared to the ResNet baseline.

C. COMPARISONS WITH GROUP CONVOLUTION
This subsection reports on experiments on CIFAR-100 and ImageNet classification datasets to show the advantage of our proposed scheme over group convolution. We build ResNeXt and REAF-ResNet according to a given computational complexity budget, and their architectures can be found as shown in Table 2  group convolutions. Similarly, all the convolutions in the building bottlenecks of REAF-ResNet adopt our proposed scheme with a configuration of 4 − 3 − 4. In this way, the difference between group convolution and our proposed scheme, introduced by the number of merged channels, can be observed clearly. On the CIFAR-100 dataset, the Top-1 accuracy of the REAF-ResNet50 model is 0.74% better than that of ResNeXt50, and the gap is 1.11% for 101 layers as in Table 6. Compared to ResNeXt50, the Top-1 accuracy of REAF-ResNet50 increases by 0.74% on the ImageNet dataset.

D. EXPERIMENTS OF REAF+ SCHEME
As shown in Table 7, we apply the REAF+ scheme to a group convolution-based model and conduct experiments on the classification to show accuracy improvement. Applying the REAF+ scheme with g O = 2 to the third convolution (i.e., the second pointwise convolution) of the bottleneck, the Top-1 accuracy of MobileNetV2 and ResNeXt50 on CIFAR-100 will improve by 0.55% and 0.43% respectively. The Top-1 accuracy of MobileNetV2 on ImageNet will increase by 0.69%.

E. EFFECTS ON LEARNED REPRESENTATION
In this section, we compare the class selectivity index for the features in the ResNet baseline and REAF-ResNet to interpret the effect of our proposed scheme on learned representation. The class selectivity index is computed for every channel of the feature maps, as selectivity = (u max − u −max )/(u max + u −max ). u max represents the highest class-conditional mean activity and u −max represents the mean of class-conditional mean activity across all other classes over given data distribution. Selectivity provides the degree to which features are being shared across classes, which is a central property of distributed representations and measure the extent of feature reuse. Using the validation dataset of the ImageNet, we calculate the selectivity for the features of ''layer1'', ''layer2'', ''layer3'', ''layer4'', and ''avgpool'', which are the output of the stage2, stage3, stage4, stage5, and AvgPool2d, respectively. One trend is that the selectivity for the features of deep layers is more than that of earlier layers, which conforms to the observations in [23], [54] and can be attributed to the results of feature reuse between convolutional layers. Another trend identified in our work is that our REAF scheme  increases the selectivity for the features, which verifies its reduction of feature reuse within the convolution. The distribution of the selectivity for ResNet and REAF-ResNet appears to be closely matched, so we only analyze the layer, at which the distribution separates most. Figures 2 and 3 present the selectivity comparison for the features in ResNet and REAF-ResNet for 50 and 101 layers. The most distinct distribution of the class selectivity appears at ''layer1'' in Figures 2 and 3 and the selectivity for the features of the ''layer1'' in REAF-ResNet are more than that in ResNet. We compare the selectivity for the features in ResNet50 and ResNeXt50 (i.e., REAF-ResNet50 with the configuration of G M = 1) in Figure 4, respectively. Since ResNeXt is an example of our REAF scheme, its learned representation is expected to increase the selectivity. The largest mismatch of the selectivity distribution for the features between ResNet50 and ResNeXt50 falls to ''layer4'', where ResNeXt50 exhibits more selectivity that ResNet50.

V. CONCLUSION
In this paper, we analyze the approximation problem introduced by the full use of feature reuse within the convolution, where the channels of the output activations are computed by reusing the same input activations. To make computing channels more accurate, we propose the REAF scheme, which is a moderate feature reuse version for convolution. Moreover, the configuration of the REAF scheme is optimized for a given CNN. Besides, we clarify the advantage of keeping more merged channels over group convolution. Also, we develop the REAF+ scheme and integrate it with the group convolution-based models. Last but not least, we present sufficient experiments on the classification to support our analysis concerning the REAF and REAF+ schemes.