Compact Neural Network via Stacking Hybrid Units

As an effective tool for network compression, pruning techniques have been widely used to reduce the large number of parameters in deep neural networks (NNs). Nevertheless, unstructured pruning has the limitation of dealing with the sparse and irregular weights. By contrast, structured pruning can help eliminate this drawback but it requires complex criteria to determine which components to be pruned. Therefore, this paper presents a new method termed BUnit-Net, which directly constructs compact NNs by stacking designed basic units, without requiring additional judgement criteria anymore. Given the basic units of various architectures, they are combined and stacked systematically to build up compact NNs which involve fewer weight parameters due to the independence among the units. In this way, BUnit-Net can achieve the same compression effect as unstructured pruning while the weight tensors can still remain regular and dense. We formulate BUnit-Net in diverse popular backbones in comparison with the state-of-the-art pruning methods on different benchmark datasets. Moreover, two new metrics are proposed to evaluate the trade-off of compression performance. Experiment results show that BUnit-Net can achieve comparable classification accuracy while saving around 80% FLOPs and 73% parameters. That is, stacking basic units provides a new promising way for network compression.

of mobile and edge devices that are often resource-constraint, it is critically desired to reduce the number of parameters and floating-point operations (FLOPs) of DNNs for better deployment [7].To obtain more efficient models, many techniques have been explored for compression and acceleration, including pruning [8], [9], quantization [10], low-rank decomposition [11], [12], knowledge distillation [13], [14], and compact model design [15], [16].The first four methods are usually destructive for the original model because the parameters or structure will be adjusted, while compact model design method is constrictive that directly create efficient models.Among them, network pruning has been a mainstream branch in both academia and industry due to the advantages of saving resources, reducing latency and addressing privacy concerns, showing broad prospects in numerous applications [17], [18].
The pruning frameworks aim at eliminating the redundancy of deep models by removing the components with less importance.According to the types of removed components, there are mainly two kinds of pruning methods: unstructured pruning and structured pruning.Unstructured pruning removes the connections or neurons in weight matrices [8], [19], [20].A common standard to determine which weights should be pruned is magnitude-based pruning that compares the weight amplitude with a threshold [19].First, a predefined quality parameter is multiplied by the standard deviation of weights to calculate the threshold, and then the weights with lower magnitude than the threshold will be set to zero.After all layers are pruned, the model needs to be retrained so that the remaining weights can be adjusted to compensate for the removed ones.However, this kind of low-level pruning has the risk of being non-structural that may hinder the actual acceleration, owing to the irregular memory access mode.Some special software and hardware such as sparse CNN accelerators based on ASIC [21], [22] and FPGA [23], [24] can alleviate this problem, but also brings extra cost to the deployment of models.
To achieve more practical compression and acceleration, the well-supported structured pruning methods such as channel and filter pruning [25], [26], [27] have been explored recently.Although the structured pruning for convolution kernels and graphs can obtain hardware-friendly network, they usually need specific criteria or complex mechanisms to identify the irrelevant subset of the components for elimination, which will increase the memory requirement and computation cost.For example, the reconstruction-based methods try to minimize the reconstruction error of feature maps based on the pretrained model to achieve pruning [28], [29], consuming much memory to store the feature maps of pretrained models.On the other hand, most of the current methods on compression and acceleration are conducted Fig. 1.Construct BUnit-Net via stacking designed basic units.(a) The dense neural network as baseline for comparison; (b) Different basic units with various input, hidden and output nodes; (c) The compact BUnit-Net with stacked units where the each unit is independent with each other.A compact BUnit-Net contains stacked layers and normal layers, where the compression will be achieved by stacked layer.The number of input node d in dense network is the summation of the input nodes of all units that are selected to stack.on pretrained neural networks.For example, pruning methods try to remove the unimportant components of pretrained models and usually require extra iterations to recover the accuracy, while network quantization achieves compression through quantizing the parameters such as weights and activations in existing networks.
Under the circumstances, it is interesting to ask whether it is possible to have a human-designed compact networks that can achieve comparable or even better compression performance than pruned networks learned through training.Since the current pruning methods are mainly conducted on pretrained neural networks, the performance of compressed model is highly affected by the quality of the pretrained models, which will give rise to a restriction if the pretrained models are not well trained.In some cases, even fewer or no pretrained models are available.Another concern is that network pruning often involves addition processes such as fine-tuning, re-initializing and rewinding, which may become an obstacle to implement the network because the generalizability of a given pruning framework in different architectures is usually unclear.Therefore, simply using a more efficient architecture would be more effective than pruning a suboptimal network in many cases [30].
To this end, we therefore propose to directly construct compact neural networks by independently stacking designed basic units (BUnit-Net) from a new perspective.We first design the basic unit as a small neural network with at least one hidden layer, and then combine these basic units following a certain stacking strategy to construct an entire network.The whole framework of our proposed method is illustrated in Fig. 1.In the framework, a group of nodes (i.e., neurons) forms a basic function unit just like the nervous tissues in human brain with different functions.Since there is no connection between each unit due to the independence, our BUnit-Net simply achieves weight compression without additional mechanisms for judging importance of weights.Different from the sparse and irregular weight tensors generated by unstructured pruning, the weights of each unit in BUnit-Net are regular and dense so that they can be easily processed without special devices.Moreover, the dependence on pretrained model will also be sidestepped because BUnit-Net is directly trained from scratch.Besides, BUnit-Net can be built as numerous network backbones such as convolutional neural networks (CNNs) and fully-connected networks by designing different basic units, with strong generalizability.In the experiments, we first explore the effect of different units through visualizing feature maps.Then, we conduct extensive experiments on multiple datasets (i.e., MNIST [31], CIFAR [32], Tiny ImageNet [33] and ImageNet-2012 [34]) with popular network structures (i.e., MLP, VGG [35], ResNet [2], and MobileNetV2 [36]).Experimental results demonstrate the efficacy of BUnit-Net for network compression on all datasets we have evaluated on.When comparing the performance of different compressed methods, we not only report the commonly-used metrics such as accuracy, FLOPs and parameters drop ratio, but also propose two new metrics to evaluate the trade-off between accuracy and model size (i.e., number of FLOPs and parameters), which are helpful to select the proper methods under different requirements in real-applications.As far as we know, this work is the first attempt to construct compact networks under human-designed stacking structure, showing great potential on compressing network.
We summarize our main contributions as follows: 1) We propose BUnit-Net that directly constructs compact neural networks by combining and stacking well-designed basic units.It achieves compression through the independence of each unit.Using different kinds of basic units, BUnit-Net can be applied on various models.2) Different basic units and combining strategy are explored through several experiments to verify the flexibility and efficacy of BUnit-Net.3) We define two novel metrics to measure the trade-off between accuracy and resource consumption including Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the number of FLOPs and parameters, which are helpful in decision-making when choosing an appropriate compressed model for different aims.The rest of this article is organized as follows.Related work is presented in Section II.Section III introduces the details of our proposed BUnit-Net.In Section IV, the two proposed metrics are introduced in detail, and the experiment results including comparison with different compression methods are provided.Finally, the limitations of our proposed method and the conclusions of this paper are given in Section V.

A. Network Pruning
Network pruning has been widely used to reduce the complexity and computation of CNNs because of its effectiveness and simplicity.The earliest pruning works date back to 1990 s that use Hessian approximation to compute the saliency of parameters [37], [38].And later, researchers continue exploring various pruning methods.Existing methods can be roughly divided into unstructured pruning and structured pruning in terms of the types of network components to be pruned.
Unstructured pruning involves removing weights or neurons that increases the sparsity of the network by replacing the connections or neurons with zero in the weight matrix.Han et al. [8] described a three-stage deep compression framework including network pruning, quantization and Huffman coding, where the weights were pruned by comparing the magnitude with a preset threshold.However, in such pruning technique, a weight will remain zero in the subsequent retraining process once it is removed which may cause large accuracy loss.Guo et al. [39] then proposed a dynamic network pruning framework, introducing a recovery operation in order to restore the connection that has been wrongly pruned.Zhang et al. [40] converted weight pruning as a constrained non-convex optimization problem.
Structured pruning removes the entire channels or filters, breaking the limitation that unstructured pruning heavily relies on technical hardware or software library to achieve compression and speedup because of the sparse weights.The channel pruning methods have developed different strategies such as variational technique [26] and genetic algorithm [41] to choose the unimportant channels.CGNet [25] identifies the salient channels and skip the remaining parts with less importance by the designed squeeze-excitation modules.However, the skipped regions in CGNet are irregular that require special toolkit to achieve practical compression.Recent work [42] has introduced channel independence to measure the importance.Furthermore, filter-level pruning is also proved to be efficacious.Along this line, Li et al. [29] first calculated the norms of filters for evaluation.Later, ThiNet [43] utilizes the features of the next layer to guide the pruning of the current layer.Further, [44] introduces group convolution scheme to save storage.Also, He et al. [27] proposed to rank the importance of convolution filters based on geometric median to overcome the shortcoming of norm-based pruning.The sparsity regularization measurements are explored in [45], [46].A separate sub-network is introduced in GaterNet [47] to generate gates for adaptively selecting the activated filters.AutoPruner [48] automatically identifies the less important filters in an end-to-end manner.[49] measures the importance of latent representations to identify the pruned filters.Yeom et al. [50] combines interpretability and filter pruning based on layer-wise relevance propagation (LRP) [51].
Since the traditional pruning methods often involve hyperparameters that require time-consuming human design, how to pruning automatically has become a hot issue in recent years [52], [53].Besides, Neural Architecture Search (NAS) [54] has also been combined with pruning in these years such as Metapruning method [55].

B. Compact Model Design
Compact model design methods focus on changing the basic operation and redesigning the structure to reduce the complexity and model size.It is a widely used method to construct a lightweight network with different cheaper operations.In the Network in Network (NIN) [56] model, the architecture of embedded network is developed that uses 1 × 1 convolution to increase capacity while decreasing computational complexity.Besides 1 × 1 convolution, SqueezeNet [15] utilizes group convolution for further speedup.Howard et al. [16] designed MobileNet using branching strategy, where each branch contained only one channel called depth-wise convolution.The further work termed MobileNetV2 [36] introduces residual and linear bottleneck structure.ShuffleNet [57] combines group convolution and channel shuffle operation that shuffles channels before the next convolution operation in order to realize information transmission between multiple groups.Furthermore, EspNetV2 [58] proposes deep expandable and separable convolution to improve efficiency.Jeon et al. [59] proposed active shift operation to save memory, and then the sparse shift layer (SSL) applied in FE-Net [60] eliminates the meaningless shift operation.In addition to manually design light-weight models, NAS technique [54] that can construct network automatically has also attracted attention recently.EfficientNet [61] first utilizes NAS to search an efficient backbone, and then the backbone is scaled to deal with different inputs, where the scaling is conducted on three dimensions (network width, depth and resolution) at the same time.Based on NAS, [62] and [63] have improved MobileNetV2 with comparable performance.However, these methods still require large resource consumption that is difficult to deploy on practical applications.

C. Other Compression Methods
In addition to pruning and compact models, other methods are also explored to compress networks and speed up the inference, including quantization [10], low-rank decomposition [11], [12] and knowledge distillation (KD) [13], [14].Network quantization method achieves compression by quantizing the weights or activations to low-bit representations such as binary and ternary neural networks [64], [65], [66].In this way, the 32-bit floating point parameters can be converted to low-bit or even 1 b to reduce resource consumption.Low-rank decomposition approximates the large weight matrices by the product of several low-dimensional matrices to accelerate computation.The most common used decomposition techniques include Singular Value Decomposition (SVD) [11], [67] and Tucker Decomposition [68].With respect to KD, the whole framework usually contains two networks, where the knowledge in the larger network (teacher) is utilized to supervise the training of the smaller one (student).The theoretical basis of KD comes from [69] that first proposed to train the student through the output of softmax layer of teacher.Later, the distillation method is continuously improved from different aspects, such as using intermediate feature maps [13], [14] or multiple teachers [70], [71].

A. Preliminaries
Before describing our proposed method, we formally give some notations and symbols.In a CNN, the weight of a standard convolutional layer is a four-dimensional tensor W ∈ R d×d×c in ×c out , where d × d is the kernel size of the convolution filter, c in and c out is the number of input and output channels.For a three-dimensional input X in ∈ R w in ×h in ×c in with width w in and height h in , the convolutional layer can transform it into another three-dimensional tensor X out ∈ R w out ×h out ×c out .Let Conv(•) denote the operations in a convolutional layer including convolution and activation, the transformation can be formulated as For a given convolutional layer, the weight W contains {d × d × c in × c out } parameters, and the convolution operation requires {d × d × c in × w out × h out × c out } FLOPs.Because a CNN model is usually composed by a large number of convolutional layers, the cost of memory and computation will be huge that makes it a challenge to deploy CNNs on resource-constraint devices.

B. Basic Function Unit and Stacked Layers
As illustrated in Fig. 1, one of the most important components of the proposed BUnit-Net is the designed basic unit.We take CNN as an example to demonstrate our method.
Without loss of generality, we initially design the unit as a small convolutional neural network with at least one hidden layer, where the nodes can be adjusted flexibly.Suppose the unit contains three layers and the node numbers in each layer is {c in , c h , c out }, respectively.The two weight matrices in a unit can be represented as W l ∈ R d×d×c in ×c h and W r ∈ R d×d×c h ×c out .Then, for a given input X in ∈ R w in ×h in ×c in , the output X out ∈ R w out ×h out ×c out of a single unit can be calculated as After designing the basic unit, we can then obtain the whole convolutional layers by stacking a certain number of units together.Suppose m units with various c in are selected to stack and they are represented as {U 1 , U 2 , . . ., U m }.For the entire input X in ∈ R w in ×h in ×c in of the whole layer, it first needs to be split into m pieces, where the input channel c in of each piece is the same as the corresponding unit, so that it can be processed with the unit because the unit only contains c in input channels.Note that the total input channels of m units should be consistent with c in .If these m units have the same input channel numbers, we can equally split X in into m = c in c in pieces.Let {X 1 , X 2 , . . ., X m } donate m smaller input pieces with the corresponding m outputs {X 1  out , X 2 out , . . ., X m out }.To ensure that the output of stacked layers can be transformed successfully into the next layer, these m small outputs are also needed to be concatenated to keep the dimensions consistent.Thus, the output of the whole stacked convolutional layers is where the output of each unit is calculated by (2).
By utilizing the simple stacking strategy, we can avoid the sparse weights generated after unstructured pruning.The weight tensors in BUnit-Net will remain dense so that it can achieve compression using general-purpose hardware and software, without the support of complex judgment criteria as well.The detailed analysis on memory and computation cost is presented in Section III-E.In addition to CNN, BUnit-Net is also applicable to other network backbones by designing diverse basic units, as shown in Fig. 1(b).

C. Network Construction
In BUnit-Net, there are two kinds of convolutional layers termed as Stacked Layer (solid box) and Normal Layer (dashed box) as marked in Fig. 1(c).Stacked layer refers to the layers generated by stacking basic units while normal layer is a single standard convolutional layer.Because of the independence among each unit in stacked layers, it is necessary to add normal convolution layers between the stacked layers; otherwise, the entire network will be disconnected resulting in the failure of feature information transmission.Moreover, the normal layer between the stacked layers can also help deal with the dimension matching problem.After obtaining the stacked layers as described in Section III-B, a whole neural network can be constructed using these stacked layers and normal layers alternately.It is notable that the setting of a normal layer is determined by the two contiguous stacked layers.For instance, a normal layer is placed between two stacked layers in Fig. 1(c).Assuming that the two layers contain m and n units respectively, then the input nodes (NL in ) of the normal layer are the sum of the output nodes (c out ) of m units on the left, while the output nodes (NL out ) is the total number of n unit input nodes on the right.
It is straightforward to implement BUnit-Net with stacked layer as classical network architectures like VGG-style and ResNet-style backbones (style means the similar architecture).As shown in Fig. 2(a), we can reconfigure the continuous standard convolutional layers in VGG-style networks by replacing the first few layers with stacked layers.With respect to ResNetstyle in Fig. 2(b), we can also introduce the residual structure in BUnit-Net.Furthermore, the constraint on dimensions between Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.different layers can be flexibly solved by adjusting the parameters in stacked layer such as convolution stride.In addition to CNN, BUnit-Net is appropriate for fully-connected networks like Multilayer Perceptron (MLP) if the unit is designed as a small network with only fully-connected layers.In terms of the special convolution operations in recent lightweight models such as 3 × 3 depth-wise and 1 × 1 point-wise convolution, the basic units can also be designed to equip with these operations, demonstrating the powerful flexibility of the proposed BUnit-Net.

D. Training Process
BUnit-Net involves two parts of weights, one part is the weights of basic units in stacked layers {W l , W r } and the other is the weights of normal layers W n .Because each unit in BUnit-Net is still a neural network, it follows the standard forward and backward propagation algorithms.Given a set of data pairs {(x, y)}, the objective function of training BUnit-Net can be set as (C((W, b); x), y)), (4) where W = {W l , W r , W n } is all the weights and b is the bias.The function C(•) contains the Conv operation of normal layers in (1) and Concat operation in stacked layers as described in (3).The function f (•) refers to the calculation in other layers such as fully-connected and pooling layers, and L(•) is a selected convex loss function.
When designing the basic function unit (generally designed as three layers), the nodes in each of the layers {c in , c h , c out }, and the number of units to be stacked m need to be preset.Then, the input channels of m input pieces can be adjusted based on each c in .In forward propagation, the output of stacked layers is computed by (3) while the output of normal layers uses the standard convolution operation in (1) Get a minibatch of training data {x, y}.

4:
Compute the network output where the output of stacked layers and normal layers is calculated as ( 3) and (1), respectively.5: Compute the loss L(y, ŷ).

6:
Perform standard backward propagation.7: Update parameters using any popular optimizer.8: end for using the popular optimizer such as SGD or ADAM [72].The construction and training process of BUnit-Net is summarized in Algorithm 1.

E. Analysis of Stacked Layers
In order to analyze the compression effect more intuitively, here we suppose that m units with the same nodes {c in , c h , c out } are stacked, a group of three convolutional layers will be constructed with {mc in , mc h , mc out } nodes in each layer as in Fig. 1(a).Since the stacked units are independent, which means that there is no connection among them, the memory cost of stacked layers will be (5) If three convolutional layers are fully connected, the two weight matrices will be W lf ∈ R d×d×mc in ×mc h and W rf ∈ R d×d×mc h ×mc out in which memory cost is Obviously, the number of parameters of stacked convolutional layers is much lower than the layers with fully connection.
By this way, our proposed method can effectively reduce the memory cost to achieve networks compression.
In terms of computation cost, the total FLOPs of m three-layer units are where w outl and w outr are the width of the first and second layer output, respectively, h outl and h outr are the heights.For the fully-connected convolutional layers with the same nodes, the FLOPs are Similarly, the memory and computation cost in other architectures can also be calculated in the same way.
With less parameters and FLOPs, the stacked layers in our proposed BUnit-Net is also beneficial to save much storage and computation consumption.Since we also add normal layers between the stacked layers and different kinds of units may be used when building network, the compression and acceleration of the entire BUnit-Net will be smaller than m.The actual compression performance of BUnit-Net will be reported in our experiments for comparison.

IV. EXPERIMENTS
To evaluate the performance and efficiency of BUnit-Net, we initially design several experiments to explore the effect of basic function units with different structures on both MNIST [31] and CIFAR-10 [32] datasets.In addition, we then conduct empirical experiments for image classification tasks on other three popular datasets, that are CIFAR-100, Tiny ImageNet [33] and ImageNet-2012 [34].We first construct BUnit-Net as fullyconnected MLP on MNIST, and then build the representative CNNs like VGG [35], ResNet [2], and MobileNet [36].We mainly compare BUnit-Net with baseline and the following pruning methods and lightweight models: r VP [26]: Channel pruning using variational technique.r DMCP [73]: A mask-based channel pruning method.r CHIP [42]: Filter pruning through channel independence.r FPGM [27]: Utilize geometric median to prune filters.r DualConv [74]: Efficient group convolution for lightweight models.
r DANL [75]: Neural architecture search method without additional candidates.

A. Performance Metrics
The most commonly used metrics to assess the performance of a compressed model are numbers of parameters and FLOPs.However, when considering the trade-off between compression and accuracy, it will be unfeasible to evaluate different models.As we have known, there is no unified indicator that considers these factors simultaneously so far.Thus, we define two performance scores termed TCA and TSA.TCA is the computation performance or called the trade-off between accuracy and FLOPs, while TSA is the storage performance to assess the trade-off between accuracy and parameters.Assume the accuracy, FLOPs and parameters of the baseline are {a b , f b , p b } respectively, and {a n , f n , p n } are the corresponding metrics of the new compressed model, then the two scores are calculated as where w 1 and w 2 are two scaling factors used to adjust the importance of different metrics according to specific demands in different applications.For example, a larger w 2 can be chosen if we care more about accuracy.In turn, if we pay more attention to computation and storage consumption in some cases, then we can set w 1 to be greater than w 2 .By apply the scaling factors, we can choose the model more flexibly under different constraints.
In fact, the essence of these two scores is the ratio between the accuracy drop and FLOPs drop (or parameter drop) after compressing.However, the magnitude gap between accuracy drop and FLOPs drop (or parameter drop) is usually too large.For instance, some compression methods are able to reduce more than 80% FLOPs or parameters while the accuracy loss can be less than 1%.Therefore, the exponential function here is introduced to narrow the gap.In other words, the role of exponentiation is to automatically magnify the subtle differences on accuracy or diminish the remarkable gap on FLOPs (or parameter) compared to the baseline, leading to a fair comparison of different compression methods.

B. Experimental Setting 1) BUnit-Net Construction:
We build the diverse kinds of BUnit-Net under different network architectures to verify the generalizability.Following the reconfiguration in Fig. 2, we replace the layers in original VGG and ResNet with our stacked layers and treat the original networks as baseline.To simplify the structure, we replace the last fully-connected layers with an average pooling layer.When constructing the stacked layers in BUnit-Net, we consider both single and hybrid kinds of units.On ResNet-style network, we explore different strategies to deploy stacked layers, such as replacing the intermediate blocks while remaining the layers in the first and last blocks.
2) Training Strategy: For CIFAR datasets, we train BUnit-Net for 500 epochs with the initial learning rate of 0.1.BUnit-Net on MNIST and Tiny-ImageNet is trained 200 epochs.Stochastic gradient decent (SGD) with momentum 0.9 is used as optimizer on both CIFAR and Tiny-ImageNet.For ImageNet, we also train the models for 200 epochs, where the learning rate is 0.01.Adam with 10 −4 weight decay and 0.01 epsilon is applied as optimizer.For all the dataset, the batch size is 128 and the learning rate is adjusted with cosine annealing.The cross-entropy loss is adopted as a criterion for all datasets.In addition, all the training procedures are warmed up with 10 epochs and applied weight initialization.As for the network backbones, VGG-style BUnit-Net is trained on CIFAR and Tiny-ImageNet, while ResNet-style BUnit-Net is applied on CIFAR and ImageNet.Besides, we also evaluate the performance of lightweight model MobileNetV2 on Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. Exploration on Basic Units 1) Hybrid Units versus Single Unit:
We first apply hybrid kinds of units in stacked layers to verify the flexibility of BUnit-Net as illustrated in Fig. 1(c).Specifically, we design three different units {U 1 , U 2 , U 3 } with increment on FLOPs and parameters, where the neurons in {U 1 , U 2 , U 3 } are {'1-2-1', '1-3-1', '1-4-1'}, respectively.Then, the models are constructed as the similar architecture of LeNet-5 [76].The performance of BUnit-Net using single and hybrid units is reported in Table I.To avoid randomness, we provide the average accuracy after running the experiments for five times.We first build the network separately with three kinds of units.For hybrid units, one of the three units is randomly selected to stack for each unit position in stacked layers.To exclude fortuity in random selection, we perform the evaluation using hybrid units three times.Compared with single unit, hybrid units exhibit the potential to help improve the accuracy, when maintaining the FLOPs and parameters at the same compression level.For example, we adjust the numbers of three units stacked in Hybrid 2 to keep the FLOPs and parameters the same as using single unit U 2 , it is observed that model with hybrid units can achieve higher accuracy.The results on U 1 and U 2 also indicate the possibility that units with fewer FLOPs and parameters can still bring better accuracy.Therefore, these observations motivate us to explore more effective units and hybrid combinations.
2) Exploration of Different Units: For a BUnit-Net with high performance, one of the most important components is the designed basic unit.In this section, we try to explore the effect and ability of various units on extracting features through a series of initial experiments.We first design six kinds of units (i.e., Fig. 3. Six kinds of units with different input, hidden and output nodes.The units in toy models are randomly selected from Unit A-F. Unit A-F) with different numbers of input, hidden and output node as shown in Fig. 3, and build a toy network by randomly selecting and stacking them.Specifically, the toy models are composed of two normal layers and two stacked layers, followed by a fully-connected layer as the classifier.Each stacked layer contains 20 units (numbered from Unit_1 to Unit_20) randomly selected from Unit A-F.In the exploration on basic units, the models are trained on CIFAR-10 dataset for 200 epochs with initial learning rate of 0.001.Other setting is the same as the extensive experiments on CIFAR-10.We conduct ten random experiments and the amount of different units are recorded in Table II.We also report the evaluation of toy model including accuracy, number of FLOPs and parameters for illustration.For instance, Model No.7 obtain the highest accuracy of 72.21% with 93.29 k parameters.However, Model No.10 occupies more FLOPs and parameters while the accuracy is still lower than No.7.The comparison of Model No.3 and No.4 also reflects this phenomenon.It implies that more FLOPs and parameters are helpless to improve the accuracy of the model, or even hurt the performance, which is also verified in previous experiments.The results also indicate that the number of different units has a great impact on the performance of BUnit-Net, which inspires us to Fig. 4. Visualization results of each unit in the first stacked layer.Image 1 and Image 2 are the original input images."Whole_layer" refers to the output feature map of the entire stacked layer, that is, the concatenation of each unit output.The structures of each unit are also recorded, for example, "Unit_1_F" means that F is selected as the first unit in this stacked layer.For further exploration of the learning ability of different units, we also visualize the feature maps of each unit in "Stackedlayer 1" of the compact network.We first report the visualization results of Model No.1 using 10 different images that belong to 10 classes of CIFAR-10.According to the results, it is worth highlighting that some of the units fail to learn useful information no matter what image is input.Take the class of bird and ship as examples in Fig. 4, the feature maps of some units such as Unit_4_A, Unit_7_F, Unit_17_F and Unit_20_D are almost a single color, meaning that the units miss valuable information even after training.This promotes us to replace these useless units with more powerful ones to improve the accuracy.
Although some units fail to extract the important features in Fig. 4, it is hard to conclude that these units have poor learning ability because the feature map is also closely related to the input.For example, Unit_6 has the same structure (i.e., D) as Unit_20 while it manages to learn more texture orientation and marginal features.Therefore, the position of unit also plays an essential role in improving the performance of BUnit-Net.To evaluate the effect of different units, we visualize the feature maps of the first and last unit in Stacked-layer 1 whose inputs are consistent with each other, the results of 10 experiments (Model No.1-No.10)are shown in Fig. 5.For the single-input units (i.e., A-C), it can be noted that the features of an object obtained by A with single output are more obvious than B and C, such as No.9_A of Unit_1 and No.6_A of Unit_20.However, for the two-input units (i.e., D and E), D with two outputs extracts more useful information than E with only one output in most cases, like No.3_D of Unit_1 and No.2_D of Unit_20.These findings illustrate that units with the same number of input and output nodes are prone to extract valuable features more effectively, providing a path for designing powerful units in the future.

D. MNIST
On MNIST, we choose the MLP with two hidden layers as baseline, where the number of nodes in each layer is "784-500-300-10" as applied in [77].Note that the MLP only Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III COMPARISON RESULTS ON MNIST USING MLP
contains fully-connected layers, so the basic units also need to be consistent.Based on the observations in the initial exploration of basic units, we design the basic units with the same input and output nodes, that is, c in = c out = 2.The number of hidden nodes c h is flexible to adjust the compression ratio.In Table III, we report the performance of BUnit-Net on MNIST using MLP.In addition to the widely used metrics like accuracy, number of FLOPs and parameters, we also report the two proposed scores TSA and TCA (w 1 = w 2 = 1) for comparison with other methods.It is showed that we can achieve 97% FLOPS and 96% parameters reduction compared to baseline model, with only 0.7% accuracy loss.Besides, we can also obtain much higher TCA and TSA scores compared with the other three pruning methods.Thus, the results indicate that our proposed BUnit-Net can work well on MLP.

1) CIFAR-10:
The comparison results on CIFAR-10 are reported in Table IV.For VGG-16, we manage to save 79.87% FLOPs and 72.90% parameters with only 0.43% accuracy loss.Besides, BUnit-Net gets the highest TCA scores regardless of focusing on accuracy or flops, showing the efficiency of computational consumption.On ResNet-20, our proposed BUnit-Net can reduce around half of the FLOPs and parameters, where the accuracy loss of 0.58% is still acceptable.It is encouraging that the two scores are both slightly higher than other competitors except for when focusing more on accuracy, indicting better utilization of computation and storage resources.For ResNet-style, we also try to replace the layers in ResNet-20 with our stacked layers in BUnit-Net apart from the first and last blocks (represented as the symbol 'r').Under this replacing strategy, we can prune 40% FLOPs and 32% parameters of baseline, where the accuracy gap can be narrowed to 0.23%.
2) CIFAR-100: In Table V, our BUnit-Net obtain an accuracy of 72.51% with 73.05 FLOPs and 70.26 parameters reduction when applying VGG-16.And the TSA scores are also larger than other state-of-the-art methods under different settings of scaling factors.Although the accuracy drop is more than 1%, it can be reduced by adjusting the hidden nodes in basic units or utilizing the same replacing strategy as ResNet.On ResNet-20, the advantage of BUnit-Net lies on saving computation resources because it leads to higher TCA scores.
In terms of latency on CPU and GPU, we conduct a comparison study between the proposed method with the baseline models.The results are presented in Table VII.Although the actual speedup ratio is lower than the theoretical value due to the computing and concatenation of each unit output in Algorithm 1, the proposed method still outperforms the baseline models with lower latency.This finding implies that the proposed method can help improve efficiency in real-life applications.
On the whole, the results on CIFAR dataset verify the flexibility and effectiveness of the proposed BUnit-Net on network compression, demonstrating that BUnit-Net can produce a more compressed model with comparable accuracy.
Lightweight Models: In addition to classical CNNs like VGG and ResNet, we also evaluate BUnit-Net on lightweight models, such as MobileNetV2, ShuffleNetV2 and EfficientNet, that are also popular in recent years.In MobileNet, there exist special 1 × 1 point-wise and 3 × 3 depth-wise convolution operations, where the group number in depth-wise convolution equals to the number of input channels.Our BUnit-Net can easily deal with these special operations by adjusting the parameters in basic units such as the kernel size and group number.The results are summarized in Tables IV and V.It can be seen that our BUnit-Net can also be implemented successfully as lightweight models.For instance, we get an accuracy of 91.58% on CIFAR-10, although there is a 0.62% loss compared to baseline, it is encouraging that the FLOPs and parameters can be reduced 38% and 36% respectively.Moreover, the TCA score is somewhat higher than other competitors.On CIFAR-100, the accuracy gap between BUnit-Net and baseline is relatively large.However, it manages to save 62% FLOPs and 60% parameters, and the performance scores of TCA and TSA are still larger when setting w 1 = 1 and w 2 = 4.With respect to other efficient architectures, we also apply our stack strategy on ShuffleNetV2 and EfficientNet-B0.Specifically, we replace parts of the convolutional layers in the basic block of ShuffleNetV2 and SE modules of EfficientNet-B0.The results are provided in Table VIII, showing that the proposed stack strategy can help achieve further compression with comparable accuracy.Overall, these results demonstrate that BUnit-Net can work efficiently on lightweight model.
Convergence: On CIFAR dataset, we investigated the convergence of BUnit-Net as illustrated in Fig. 6.Fig. 6(a) and (b) display the convergence curve on CIFAR-100 using ResNet-18.It is worth noting that BUnit-Net can almost reach an optimal status after certain training epochs.Besides, the training process of BUnit-Net is more stable though the maximum accuracy is a bit lower than baseline.On CIAFR-10 using VGG-16 in Fig. 6(c) and (d), our BUnit-Net is trained to be converged in fewer epochs compared with original VGG network, providing the potential possibility of acceleration.baseline, the accuracy drop of BUnit-Net is relatively large but it can save more than 63% FLOPs and 54% parameters.As analysed in the previous section, we can improve the accuracy by adjusting the hyperparameters to add nodes in stacked layers or applying a different replacing strategy.Compared with other methods, it is reasonable to select DMCP [73] with higher TSA and lower accuracy drop if we pay more attention to accuracy (i.e., w 1 = 1, w 2 = 4).However, BUnit-Net can still obtain a relatively larger score both on TCA and TSA when focusing on resource consumption (i.e., w 1 = 4, w 2 = 1).

F. Tiny-ImageNet and ImageNet
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VII COMPARISON OF INFERENCE TIME FOR 32×32 IMAGES
2) ImageNet: On large-scale ImageNet, we select ResNet-34 and ResNet-50 as the baseline model and compare the performance of our BUnit-Net with state-of-the-art pruning methods in Table VI(b).On ResNet-34, it is inspiring that our method achieves 71.51% top-1 accuracy and 35% parameters degradation, with higher TCA and TSA scores than other methods under most of the settings.Also, we manage to reduce the amount of FLOPs by around 42% although the TCA score is slightly smaller than FPGM [27] when considering accuracy performance.On ResNet-50, our model reduces more parameters and obtain higher TSA score compared with the other methods while maintaining the accuracy, resulting in 43.67% reduction in terms of number of parameters.These results demonstrate that the proposed BUnit-Net is efficient to reduce computation and memory consumption on complex tasks, meanwhile maintaining a comparable accuracy.

G. Ablation Study
Varying FLOPs and Parameters: To study the performance under different compression ratios, we apply different units to adjust FLOPs and parameters for compact BUnit-Net under ResNet-18 backbone on CIFAR-10 and CIFAR-100.The results in Fig. 7 help comprehensively understand the trade-off of the proposed BUnit-Net.The four curves reveal the same trend that BUnit-Net will perform worse as the numbers of FLOPs and parameters decrease.This trend is also consistent with our experience that a more compact model will sacrifice more accuracy.On CIFAR-100, when we tune the remaining FLOPs from 21% to 30% and corresponding parameters from 10% to 30%, the compression ratio drops a little but the accuracy increases drastically from 72.87% to 74.51%.The same observation is also verified on CIFAR-10.It is reasonable that continuing to compress will result in a sharp drop in accuracy after the model is compressed to a certain extent.Thus, to achieve a better trade-off between compression and accuracy, we can empirically control the remaining FLOPs and parameters as more than 20% except for the simple task such as MNIST when applying BUnit-Net for network compression.

V. CONCLUSION
Limitations: Although our method has been proved to be effective on compressing networks, there are still some limitations.First, the basic units are currently designed and stacked randomly.It is therefore desirable to find the optimal structure and combination which can further enhance the accuracy.Second, the independence of each unit in the stacked layer may hinder the communication of information that will also influence the performance.In the future, we will make more exploration and combine our method with other techniques such as NAS and channel shuffle to obtain better performance.
Conclusions: This paper is the first attempt to construct human-designed compact networks called BUnit-Net, which combines a number of designed basic function units to generate stacked layers.Compared with standard dense layers, the stacked layers contain much fewer parameters and FLOPs due to the independence of each unit so that BUnit-Net can achieve compression.We have provided the analysis of the memory and computation cost of BUnit-Net in detail.Also, several initial experiments have explored the powerful basic units.For comparison with different compression methods, we have introduced two unified indicators to measure the trade-off between accuracy and compression ratio.The comparative studies have shown that the proposed BUnit-Net can lead to better compression with comparable accuracy compared with the state-of-the-art pruned and lightweight models.This implies the flexibility and efficacy of BUnit-Net, providing a new promising way for network compression.

Fig. 2 .
Fig. 2. Network construction.(a) VGG-style Network; (b) ResNet-style Network.In general, the basic units are designed as three-layer architecture.To construct a VGG-style BUnit-Net, the most direct way is to use one stacked layer and one normal layer alternatively.The residual structure can also be added between the layers to build a ResNet-style BUnit-Net.

Fig. 5 .
Fig. 5. Visualization results of the first and last unit output in the first stacked layer.The top line is the feature maps of the first unit (Unit_1) while the bottom line is the last unit (Unit_20).The structures of each unit in the ten experiments are recorded, for example, "No.1_F" means that the first unit in the first stacked layer of No.1 Model is F.

Fig. 7 .
Fig. 7. Trade-off curve between test error and FLOPs drop ratio of BUnit-Net on (a) CIFAR-10 and (b) CIFAR-100.It can be noted that our method performs better under larger compression ratio.
. After obtaining the final output of BUnit-Net in forward propagation, we can get the gradient by performing backward propagation and update parameters Algorithm 1: BUnit-Net Construction and Training.Construction 1: Design several basic function units {U 1 , U 2 . . .U n }. 2: Determine the number of stacked layers J and the number of units in each stacked layer, that is M = {m 1 , m 2 , . . ., m J }. 3: for i in M do 4: Randomly selected i units from {U 1 , U 2 . . .U n } and stacked them to build the stacked layer.Skip the selection step if there is only single kind of unit.5: end for 6: Calculate the input and output nodes of J stacked layers.7: Add the normal layer between stacked layers according to the obtained node numbers.8: Add other layers such as pooling and fully-connected layers to construct the entire BUnit-Net.Training Input: Training data pairs {X train , y train }.Output: Compact neural networks.1: Initialize weight and bias in stacked layers and normal layers.

TABLE I RESULTS
ONMNIST USING SINGLE AND HYBRID KINDS OF UNITS TABLE II RESULTS OF TEN INITIAL EXPERIMENTS WITH DIFFERENT COMBINATIONS OF BASIC FUNCTION UNITS CIFAR.The experiments on CIFAR and ImageNet dataset are conducted on a NVIDIA Tesla P40 GPU and 2 NVIDIA Tesla V100S GPUs, respectively.

1 )
Tiny-ImageNet: We apply VGG-19 on Tiny-ImageNet and the results are recorded in Table VI(a).Compared with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE VIII COMPARISON
WITH SHUFFLENET AND EFFICIENTNET