Roulette: A Pruning Framework to Train a Sparse Neural Network From Scratch

Due to space and inference time restrictions, finding an efficient and sparse sub-network from a dense and over-parameterized network is critical for deploying neural networks on edge devices. Recent efforts explore obtaining a sparse sub-network by performing network pruning during training procedures to reduce training costs, such as memory and floating-point operations (FLOPs). However, these works take more than <inline-formula> <tex-math notation="LaTeX">$1.4\times $ </tex-math></inline-formula> the total number of iterations and try all possible pruning parameters manually to obtain sparse sub-networks. In this paper, we present a pruning framework Roulette to train a sparse network from scratch. First, we propose a novel method to train a sparse network by Pruning through the lens of Loss Landscape iteratively and automatically (PLL). We do a theoretical analysis that the curvature of the loss function is higher in the initial phase and can conduct us to start network pruning. According to our results on CIFAR-10/100 and ImageNet dataset, PLL saves up to <inline-formula> <tex-math notation="LaTeX">$4\times $ </tex-math></inline-formula> training FLOPs than prior works while maintaining comparable or even better accuracy. Then we design <bold>push</bold> and <bold>pull</bold> operations to synchronize the pruned weights on different GPUs during training, scaling PLL to multiple GPUs <bold>linearly</bold>. To our knowledge, Roulette is the first network pruning framework supporting multiple GPUs linearly.


I. INTRODUCTION
Deep Learning has achieved great improvement on image classification [1]- [3], natural language processing [4]- [6] and speech recognition [7]. Recently, due to space and inference time restrictions for network deployment on edge devices, more and more studies are focused on model compression [8]- [13] to reduce floating point operations (FLOPs) and parameters of an over-parameterized network. Network pruning [13] is a progressive and iterative model compression method, which removes individual connections or group structures. However, current pruning methods are focused on reducing the parameters count and FLOPs during inference, ignoring training cost to obtain a sparse network.
As shown in Figure 1, these pruning approaches follow the common three-stage network pruning pipeline [8]: 1) training a large, over-parameterized model; 2) pruning the trained large model; 3) fine-tuning the pruned model to regain lost performance. And then repeat the second and The associate editor coordinating the review of this manuscript and approving it for publication was Fan Zhang . third step iteratively until achieving expected sparsity for the networks [14]- [17]. But the sparse models trained by this pipeline require the same or even more cost [18] of training dense models in terms of memory and FLOPs. As described in Table 1, it takes more than 667 hours to obtain an extremely sparse ResNet50.
Some recent efforts empirically explore efficient training method to obtain sparse networks. Lottery tickets hypothesis [19] indicates that a winning ticket exists in a dense-connected network. The winning ticket is a sub-network containing fewer parameters and matching the test accuracy of the original network when trained alone.  The pipeline for finding winning tickets is detailed in Fig.1b. However, it's costly to find the winning tickets from scratch based on a randomly initialized network, as [19] suggests that the learning rate has to be less than 0.01 and one can search the winning ticket after fully training.
In fact, the key connections are structured in the early phase of training [20]. There are three main challenges to search and find those effective sub-networks in the early phase of training: 1) choose a proper iteration to start pruning. As it is hard to find out when the key connections are structured, one has to explore network pruning at all possible iterations. You et al. [21] calculate the hamming distance between the pruned subnetwork mask and full dense network mask and then prune the network if the distance is less than a threshold. However, there are weights with low magnitude in the early phase, resulting in the incorrect hamming distance and missing the best pruning opportunity. 2) search appropriate pruning interval. One performs iterative pruning every pruning interval to train a extremely sparse network. As described in [22], the final test accuracy would decrease if the pruning interval is too big or small. However, one finds an appropriate pruning interval through extensive experiments. 3) take a long time to train a sparse network. As shown in Table 1, it takes 500 hours to train a final ResNet50 model with single K80 GPU. Previous works [14]- [17] adapt mask matrix to prune the unimportant connections. It is hard to synchronize the pruned weights on different GPUs when pruning on multiple GPUs.
We provide some insights for training procedure that there are two phase of standard training and the curvature of loss landscape is higher in the initial phase. This curvature of loss landscape conducts us to start network pruning. Then we perform iterative pruning automatically during initial epochs. In this paper, we are focused on finding an effective sub-network in the early epochs and training it from scratch, reducing training FLOPs. Our contribution can be summarized as follow: • We propose a novel method to train a sparse network from scratch by Pruning through the curvature of Loss Landscape (PLL). PLL determines the pruning interval automatically.
• We design push and pull operations to synchronize the pruned weights on different GPUs, scaling PLL to multiple GPUs linearly.
• We design and implement a pruning framework Roulette to train a sparse network on multiple GPUs from scratch. To our knowledge, Roulette is the first pruning framework to prune an over-parameterized network for multiple GPUs.
• We perform a comprehensive evaluation of different networks on CIFAR10/100 and ImageNet. Experiments show that PLL saves up to 4× training FLOPs than the existing pruning technologies while maintaining comparable or even better performance. The rest of the article is organized as follows. We review the related work in Section II. Section III presents our approach, including PLL pruning method and how to scale PLL to multiple GPUs. We report the experiments and results in Section IV. Finally, Section V summarizes our contributions and gives the conclusion.

II. RELATED WORK
Efficient networks are critical for many applications [23] in terms of latency and memory. We aim to design an efficient pruning method to train sparse and efficient networks from scratch. In this section, we will discuss the related work including network pruning, training efficient networks from scratch, and efficient training methods. We will explain the importance and necessity of our Roulette pruning framework.

A. NETWORK PRUNING
Network pruning is a promising method to reduce floating-point operations (FLOPs) and parameters in over-parameterized networks when we deploy these networks on edge devices [24]- [26]. There are two kinds of network pruning: unstructured pruning and structured pruning.
For unstructured pruning, the existing studies [8], [13], [27] suggests that the individual weights with low magnitude can be pruned. They can compress even 90% percent of weights with negligible degraded performance. However, we can't accelerate neural networks without dedicated devices/libraries [28], [29]. The sparse networks are presented by CSR [30] or CSC and calculated by cuSparse [31] or MKL [32].
For structured pruning, we prune the unused sub-blocks from the whole weights, such as channels, filters or subgraphs. Wen et al. [33] remove the unimportant channels and filters in a convolution network through lasso regularization. VOLUME 9, 2021 Liu et al. [34] propose a trick and simple method to prune the scale factors in batch normalization. Luo et al. [35] leverage the statistics information computed from the next layer in a network and establish the filter pruning as an optimization problem. He et al. [16] design an iterative two-step algorithm to prune each layer by a LASSO regression based on channel selection and least square reconstruction. Gao et al. [15] exploit the importance of features computed by convolution layers is input-dependent, amplify convolutional channels and skip unimportant ones at runtime.
As described in Fig.1, the above methods train a sparse networks based a pre-trained network and remove the unimportant weights or connections by performing pruning-retraining procedure. In this way, we need to take days or weeks to train a vast network firstly. And then we perform different pruning algorithms to recover the degraded performance after pruning. In this paper, we focus on training a sparse network from scratch during training to save FLOPs and time.

B. AutoML FOR PRUNING
We perform unstructured and structured pruning to reduce the memory and computation consumption during inference. There are many pruning schedules [36], including different pruning ratio, different pruning interval, and layer-wise pruning. It is labor-consuming for us to enumerate all possible pruning schedules. He et al. [36] leverage reinforcement learning to sample the pruning design space and perform network pruning automatically. However, it's time-consuming to obtain a final compact neural network.
Prevailing pruning algorithms pre-define the width and depth of the pruned networks. Gordon et al. [37] propose a heuristic strategy to find a suitable width of networks by alternating between shrinking and expanding. Dong and Yang [38] apply neural architecture search to search directly for a network with flexible channel and layer sizes. These methods get the learnable width and depth of networks with much additional computation cost and memory.
The above methods provide network compression automatically while introducing more training cost, such as computation and memory. In most cases, these additional training cost requires more GPUS, which are not always available for most practitioners and researchers.

C. TRAINING EFFICIENT NETWORKS FROM SCRATCH
In fact, we can find sub-networks from dense and over-parameterized networks during standard training. As described in [19], there exists winning tickets, which are sub-networks and contain fewer parameters without compromising accuracy, can be trained alone. However, there are no efficient ways to identify and train these winning ticket sub-networks. Frankle et al. [39] restore the remaining weights of pruned network to the checkpoint at t-th iteration. Zhou et al. [40] propose and design super mask to identify these winning tickets. These works focus on the weight initialization after pruning but search the winning tickets sub-networks after full training, requiring a lot of FLOPs and memory.
As pointed in [20], the key connections of neural networks are structured in the early epochs during training. Frankle et al. [41] observe the undergo of neural networks during the early phase of training and elucidate the network changes during this pivotal initial period of learning are obvious and important. Based on the above observations, You et al. [21] argue that the key patterns in a network are structured and draw an Early Bird in this early phase of training. We do theoretical analysis about the training procedure and propose an PLL pruning algorithm to identify the winning tickets in the very early epochs. Experiment results show that we achieve better accuracy and higher sparsity than [21] while requiring fewer FLOPs.
Other works design and train a compact network from scratch, such as MobileNet [42], MobileNetV2 [43], Pelee [44] and ShuffleNet [45]. These compact networks are designed for deploying on the edges with limited resources. Our proposed PLL pruning method can work together with these networks to generate more efficient networks.

D. EFFICIENT TRAINING METHODS
The pruned weights in a network are indicated by a mask matrix (consisted of 0-1 values, where 0 indicates the weight is pruned) during training [8], [13], [19], [21], [34]. The mask matrix is updated during training procedure. As shown in Table 1, there are millions of FLOPs in a deep and complex neural network that we need to take days or weeks to obtain a trained model.
There are two popular methods to accelerate the training procedure [46], including data parallelism and model parallelism. For data parallelism [47], given with a batch of data, a network is calculated separately in each GPU and updated by the synchronized gradients. You et al. [48] propose Layer-Wise Adaptive Rate Scaling (LARS) training algorithm to scale ResNet50 to a batch size of 32k without loss in accuracy. Ben-Nun and Hoefler [49] present that AllReduce is adapted to reduce the network communication among GPUs. For model parallelism, a network expressed by a sequence of layers is partitioned into different GPUs, where sub-networks are data-dependent. Huang et al. [50] introduce TensorPipe, dividing different sub-sequences of layers into different accelerators.
The above mentioned methods can synchronize the whole weights in different processors or accelerators. In this paper, we use mask matrix to indicate the pruned weights. To accelerate training, we should synchronize the pruned weights in different devices or accelerators.

III. OUR APPROACH
There are two phases during standard training [20]. We analyze that the curvature of loss landscape is higher in the first phase than the later phase. We argue that the efficient sub-networks in over-parameterized networks emerge in this phase. And we train them based on the curvature of loss landscape. In this section, we will present our training approach to train sparse models from scratch automatically and scale our pruning approach to multiple GPUs.

A. PROBLEM DEFINITION
Let θ ∈ R d be the weights of a network. Consider a randomly initialized network f θ (x), f reaches a minimum validation loss f i loss at the i-th iteration with a test accuracy f i acc , when trained with stochastic gradient descent on datasets. In addition, consider a sub-network f m θ (x) with a mask m ∈ {0, 1} d indicates whether the connections are pruned or not. We define f t maskacc as the test accuracy when training f m θ (x) at the t iteration. When being trained with SGD, f m θ (x) reaches a minimum validation loss f j maskloss at the j-th iteration with a test accuracy f j maskacc . The network pruning technologies are used to find a sub-network f m θ (x) with mask m within j iterations to guarantee f j maskacc ≈ f i acc . In most cases, for reaching the same test accuracy, the required iteration is that j i. The Lottery tickets hypothesis indicates that one can find such a sub-network with m within j iterations that satisfy j ≤ i, f j maskacc ≥ f i acc and m 1 ≤ d (the total number of parameters for a dense network). In other words, theoretically, one can find the necessary structure of the original over-parameterized network without additional training iterations.

B. LOSS LANDSCAPE
Consider a network encoding the approximate posterior distribution p θ (y|x), parameterized by the weight θ, of the variable y given an input data x. Given a perturbation θ = θ + δ θ of the weights, the discrepancy between the p θ (y|x) and the perturbed network output p θ (y|x) can be presented by their Kullback-Leibler divergence, which, to second-order approximation, is given by: where the expectation over x is computed using the empirical data distributionQ(x) given by dataset, and is the Fisher Information Matrix (FIM). The H p θ (y|x) in equation 3 is Hessian matrix with respect to θ. In particular, weights with low Fisher Information can be changed or pruned with little effect on the performance of the network [51]. We define the posterior distribution p θ (y|x) to be the composition of a neural network f θ (x), and an ''output'' conditional distribution r(y|z), so that We then rewrite the loss function as L(y, z) = − log r(y|z). Therefore, is a semi-definite approximation of the Hessian of the loss function [52] and hence of the curvature of loss landscape at a particular point θ during training, providing an elegant connection between the FIM and the optimization procedure. However, the full FIM is too large to compute. The eigenvalues of Hessian [53] characterize the local curvature of loss landscape and determine how fast models can be optimized via first-order methods. Rather, we use the loss change f loss = |f t+1 loss − f t loss | as a metric measuring the effective connectivity of a DNN for that FIM is an approximation of the Hessian of the loss function and f loss increases when the maximum eigenvalues of Hessian increase.

C. TWO PHASES OF LEARNING
Achille et al. [20] indicate that there are two phases of learning during network training: 1) At the initial phase, the network acquires information about the training data, which results in a large increase in the FIM and the strength of network connections; 2) When the performance of a network begins to plateau, the network starts decreasing the FIM and the overall strength of its connections while the performance keeps slowly improving.
According to equation 5, there is a positive correlation between the f loss and FIM for these two phases of learning: 1) At the initial phase, the loss change f loss increases quickly while the strength of connections also increases; 2) Afterwards, though the performance keeps improving, the loss change f loss is small. This means that we can eliminate the redundant connections when the loss change f loss is small.

D. ABLATION STUDY FOR LOSS FUNCTION
Now we do empirical analysis about training procedure. We trained PreResNet101 on CIFAR10 with 160 epochs, ResNet50 on ImageNet with 90 epochs and Resnet50-RPN on COCO2017 with 12 epochs. From Figure 2, we get to know that the loss change (calculated by the absolute value of loss difference between continuous epochs) is quicker in the first few epochs and is slower in the later epochs until the VOLUME 9, 2021  learning rate is updated. As shown in Figure 3, for PreRes-Net101 on CIFAR10, the strong connections emerge at the early epochs. The loss change for PreResNet101 is steeper at the early phase than at the later epochs. We prune the unimportant connections globally at the 8k-th iteration for VGG13, VGG16, ResNet18 and PreResNet101, when the loss change f loss ≤ 0.03. The training settings are as follow: the training batch size is 128; the initial learning rate is 0.1 and is multiplied by 0.1 at the 80th and 120th epoch; the training data is augmented with a padding of 4. We use momentum SGD with a weight decay of 0.0001 and a total of 160 epochs. The results are summarized in Table 3. We learn from Table 3 that we can prune a network at the early phase with negligible degraded accuracy.

E. TRAINING SPARSE NETWORKS FROM SCRATCH
Our method, PLL, is performed according to the curvature of loss landscape and detailed in Algorithm 1. Given an dense network, PLL searches the pruning step interval automatically and removes the unimportant connections, channels or filters based on magnitude. Training a sparse network from scratch with PLL. We determine the pruning step interval T at the first N pll iterations. We complete network pruning before updating learning rate for multi-step schedule.
An intuitive training procedure for our PLL is described in Figure 4. We determine the pruning step interval T based on the loss change. We choose an end iteration T end when we should complete the network pruning. The T end should be less than the iteration t when the LR hasn't been updated because the efficient sub-network emerges during the early training. The main parts of our PLL are explained below.

1) GLOBAL PRUNING
We perform global pruning rather than layer-wise pruning. For deeper and complex neural networks, there are more parameters in the deeper layers. It's unfair for lower layers and hurtful for network performance if each-layer is pruned with same sparsity. According to our experiments, the deeper layers are more sparse for the final sparse network. We update the weights mask m t by indices ArgTopK (−|m t−1 θ t−1 |, p).
Currently, we set the pruning ratio p manually during global pruning procedure.

2) PRUNING SCHEDULE
Given the network prune ratio P, we search and find the efficient sub-networks automatically. We record the loss information and start to prune network at t-th iteration when |f t maskloss − f t−1 maskloss | ≤ . We set the pruning step interval T to t because the efficient networks are structured during the first t iterations. According to our experiments, the pruned networks can recover degraded performance every T iterations. PLL achieves more sparse network through iterative pruning compared with one-shot pruning in [21] because the important connections would be mis-pruned for large sparsity requirement. At the same time, we perform pruning until T end iteration, where the T end N .

3) WEIGHTS STATE
We can keep the remaining weights or restore them to initialized state after pruning. Frankle and Carbin [19] suggested that the initialized state is important for extremely sparse network. Based on our results, these two methods are both useful for the final sparse sub-networks.

4) PRUNED WEIGHTS
We can reduce the individual connections or group structures (channels, filters, sub-graphs). To run the pruned networks on the existing devices, we do structured pruning and remove the unused scale factors in the batch normalization.

F. PRUNING ON MULTIPLE GPUs
As described in Algorithm 1, we can train a sparse network from scratch by PLL method. However, we still take much time to obtain these sub-networks from dense networks, such as ResNet50. Therefore, we scale the PLL pruning method to multiple GPUs. The main challenge for running PLL on multiple GPUs is to synchronize the pruned weights on different GPUs.
In this paper, we design push and pull operations to synchronize the weights among GPUs: 1) we pull the latest pruned weights from GPUs to the local memory in master GPU; 2) we push the latest pruned weights from the local memory in master GPU to the memory in other GPUs. As shown in Algorithm 1, the real weights are represented by θ t−1 m t−1 . For a list of GPUs, we choose the GPU whose rank is 0 as the master GPU and pull the pruned weights θ t−1 m t−1 to prepare for updating the mask matrix m. The details for pull operation is presented in Algorithm 2.
We design push operation is used to broadcast the latest pruned weights in master GPU into different GPUs. Algorithm 3 describes the procedure of push operation. Instead of transferring the whole weights θ t m t , we divide the weights into multiple blocks. As shown in Figure 6a, we transfer one block of pruned weights θ t m t each time. To avoid increasing requirement of memory bandwidth,

Algorithm 1 PLL to Train a Sparse Network
Result: An efficient sub-network f (x; m θ) Input: Prune Ratio Set P; Dense network f (x; θ); Reinit option; T end Initialize weights θ of f (x; θ) to θ 0 and mask m; Set the pruning step interval T to -1; for each remaining iteration t do if t mod T == 0 And P is not empty then Pick a corresponding p from P; Update the mask m t by I drop ; if Reinit then Restore the remaining weights to θ 0 ; else Keep the remaining weights Train with SGD;

Algorithm 2 Pull the Latest Weights
Input: A list of GPUs, the latest weights in current iteration t Result: The latest weights are in the master memory Pick a local GPU and copy the weights m t−1 θ t−1 from the GPU to master memory; Prepare the weights m t−1 θ t−1 in PLL;

Algorithm 3 Push the Pruned Weights
Input: A list of GPUs, m t , θ t in current iteration t Result: The pruned weights are consistent among GPUs Calculate the pruned weights by element-wise product m t θ t on master CPUs; Copy the pruned weights m t θ t to a local GPU; Treat the GPUs as a ring; Split the weights into blocks; Transfer the weights among GPUs; we treat all of GPUs as a ring and broadcast the blocks in a tree manner. The communication for pruned weights θ t m t is depicted in Figure 6b. Assume that N is the bytes of weights to push, and B is the bandwidth of communication and k is the number of GPUs. We split the weights into S blocks and transfer N /S bytes to next GPU in each step. The total time is N * (S + k − 2)/(S * B), which is close to N /(S * B). Therefore, we can scale to multiple GPUs effectively without the limitation of the number of GPUs. VOLUME 9, 2021 FIGURE 5. The pipeline for Roulette pruning framework. In Roulette, we complete network pruning in the early phase of training. We continue to train the pruned networks with SGD in the remaining iterations.

G. ROULETTE PRUNING FRAMEWORK
We design and implement Roulette pruning framework to train a sparse network from scratch on multiple GPUs. The training procedure of Roulette is illustrated in Figure 5. As depicted in Figure 5, we push the pruned weights before calculating gradients of networks. And we pull the pruned weights to calculate and update the mask matrix before performing PLL. According to our experiments, we complete network pruning in the initial phase of training by PLL pruning method. We train a pruned networks with SGD in the remaining iterations. In this way, we reduce many FLOPs to achieve a sparse network.
Roulette provides a YAML file to set up pruning schedule, pruning rate, and initialization method after pruning. As shown in Listing 1, we can prune a network just adding two line code in Roulette. In Listing 1, after each epoch, we record the loss changes and decide whether to prune a network or not.

IV. EXPERIMENTS
In this section, we evaluate our PLL on different networks and datasets and check whether the PLL can scale to multiple GPUs linearly or not.

A. EXPERIMENT SETTINGS
We design and implement an efficient training method to find a sparse network. We evaluate this method on different datasets to check whether we obtain sparse networks or not, include CIFAR10/100 [58] and ImageNet [59].
For CIFAR10/100, we do evaluation on VGG16, VGG19 and PreResNet101. These networks are structured based on [13] and [21]. We set the initial learning rate to 0.1 and decay it by 0.1 at the 80th and 120th epoch. We update the weights by SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001. We train and evaluate these sparse networks with a batch size of 128 and a total epoch of 160.
For ImageNet, we train sparse networks from dense ResNet18 and ResNet50. We find these sparse networks with the initial learning rate of 0.1. And the learning rate is multiplied by 0.1 at the 30th and 60th epoch. We also adapt SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001.

B. RESULTS ON CIFAR10/100
First, we evaluate the computation saving when performing individual weights on VGG16 and VGG19, whose architectures are described in [13]. We set the < 0.03 and perform iterative pruning at the 4k-th, 8k-th and 12k-th iteration. We compare our PLL with SNIP [54], LT(one-shot) [19], and LWC [13].
We are hard to calculate the total training flops when performing individual weights pruning. Therefore, we count the total training iterations as a training cost metric when  [54], Lottery Tickets [19] and LWC [8] on CIFAR10/100. Here, we prune the individual weights with low magnitude. The ''Training iters" is the total number of iterations for training a sparse model. achieving a sparse model from a dense model. The result in Table 4 shows that we can prune 95% of weights in VGG19 and train a sparse network with comparable performance. At the same time, we achieve about 2× saving for training iterations over than LWC. Comparing with the official LT method, we reduce about 3× training iterations. VOLUME 9, 2021 TABLE 6. Comparing the accuracy and training FLOPs of PLL, NS [34], SFP [56], LCCL [57], TAS [38], EB [21], and ThiNet [55] on ImageNet. We prune the scale factors in batch normalization.
Our PLL can train a sparse network from scratch effectively. We save about 2.3× training iterations than SNIP because SNIP prunes networks at initialization before training and needs more iterations to recover the degraded accuracy.
Then we prune the scale factors in the batch normalization to achieve group-sparse networks and reduce the training FLOPs. Here, we count the training FLOPs as a training cost metric. We find the effective sub-networks from VGG16 and PreResNet101. The network architecture for these networks are borrowed from [21]. According to our experiments, we prune the scale factors at the 4k-th, 8k-th and 12k-th iteration. We compare the accuracy and training FLOPs of PLL, LT (one-shot) [19], SNIP [54], NS [34], ThiNet [55], and EB [21]. The results are summarized in Table 5. Table 5 demonstrates that our PLL can reduce 5× ∼7× training FLOPs than the lottery tickets [19]. And we achieve a better accuracy of 2.82% than lottery tickets for PreRes-Net101 on CIFAR100 when pruning 50% of scale factors. We can find the efficient sub-networks earlier than the EB training method. For example, we train a sparse network from PreResNet101 on CIFAR100 with 66% of the training FLOPs for EB. And we can save about 2× training FLOPs to achieve a sub-network with same accuracy. For VGG16 on CIFAR10, we prune 70% of scale factors and obtain a comparable accuracy comparing with EB Train and NS, reducing 1.57× and 3.8× training FLOPs respectively. We can even reduce 2.47× training FLOPs than EB with a 2% of degraded accuracy.
From the above analysis of results on CIFAR10/100, PLL can find efficient sub-networks in the initial phase of standard training, reducing many training FLOPs.

C. RESULTS ON ImageNet
Now we evaluate PLL on a large dataset, ImageNet. We perform pruning on ResNet networks for that there are many FLOPs in ResNet and we can save much time if we can prune these networks with negligible degraded performance.
The total number of training epochs is 90. We set the to 0.5 for ImageNet dataset and prune the scale factors, comparing the result accuracy and training FLOPs with those of NS [34], SFP [56], EB [21], LCCL [57], TAS [38], and ThiNet [55]. Compared with TAS [38], PLL reaches comparable accuracy with 3× fewer training FLOPs. When the pruning ratio is great than 70%, the network will have serious problems with reduced accuracy. Therefore, we perform pruning with a ratio of 30% and 50%.According to algorithm 1, we start to prune the network at the 3rd epoch. The experiment results are described in Table 6. PLL outperforms EB by up to 1.1× ∼ 1.6× in terms of training FLOPs. At the same time, PLL achieve a better top-1 accuracy of 0.19% and 1.34% for ResNet18 and ResNet50, respectively. This results demonstrate that PLL can find the efficient sub-networks in large dataset.

D. ABLATION STUDY FOR PLL
We analyze the sparsity about scale factors of Batch Normalizations of VGG16 for CIFAR10 after performing pruning.  The details about the sparsity for different layers are presented in Table 7. The Sparsity field in Table 7 indicates the percent of the zeros out of the parameters contained in BNs of a network. The saved FLOPs field means how many FLOPs can be reduced for the convolution operation based on the sparsity of BN.
From the results in Table 7, we get to know: 1) we prune more parameters contained in the deeper layers compared with the lower layers for a network; for example, we can prune more than 90% of scale factors in the deeper layers while less than 10% of parameters in the lower layers; 2) we should perform global pruning rather than layer-wise pruning for that more parameters exists in the deeper layers.
Then we verify the structure of final sparse models. In Figure 7, we compare the scale state between the final dense network and pruned network: there are over 70% positions overlap for scale factors with big magnitude between these two networks. We plot the loss function trained by PLL and normal method. The (c) in Figure 7 demonstrates that PLL can converge.

E. SCALING TO MULTIPLE GPUs
We design push and pull operations to synchronize the pruned weights on different GPUs. Then we evaluate whether the PLL can scale to multiple GPUs linearly or not. We benchmark the throughput on VGG16, VGG19, PreResNet101, DenseNet121, and GoogleNet when scaling to multiple GPUs. There are residual connections, skip connections and direct connections in these networks. The machine we perform benchmarking is equipped with 128GB memory, two E5-2640 CPUs, and four K80 GPUs.
As shown in Figure 8, given a batch of image data (whose size is 32*32*3), the throughput for VGG16 is 1163.64 imgs/s on a single GPU and is 3938.46 imgs/s on four GPUs. When the number of GPUs is four, the throughput for VGG16 increases to be 3.4×. Here, the mini-batch size for each GPU is 128. The mini-batch data does not make full use of GPU when trained on VGG16. Therefore, the increased performance is near-linear when scaling to multiple GPUs. For GoogleNet, the throughput increases to be 3.7× when the number of GPUs is four. For other networks, such as PreResNet101, DenseNet121, and VGG19, we can improve the throughput to be 3.5× ∼ 3.8× when scaling to four GPUs.

V. CONCLUSION AND DISCUSSION
In this paper, we design and implement Roulette pruning framework to train a sparse network from scratch. First, we propose an efficient pruning method PLL, which can automatically determine the pruning interval and can find sparse networks in the early stage of model training with a fewer cost. We provide comprehensive experiments over several classical datasets and demonstrate that our approach significantly outperforms the competitive baselines. Then we design push and pull operations to scale the PLL pruning method on multiple GPUs linearly. To our knowledge, Currently, we try each possible pruning ratio to prune the redundant weights. In future work, we will design a method to find an optimal pruning ratio for a neural network automatically.
QIAOLING ZHONG is currently pursuing the Ph.D. degree with the CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences. He has published three papers on top conferences and journals. His major research interests include model compression, inference acceleration, and neural network compiler.
ZHIBIN ZHANG (Member, IEEE) is currently an Associate Professor with the Institute of Computing Technology, Chinese Academy of Sciences. He leads a research group working on system research. He has published more than 20 papers on top conferences, including INFOCOM, ICNP, DEX, and JSAC. Since 2014, he has been focusing on big data systems and leading a high-performance graph computing system SQLGraph and a neural network compiler MLCC. His research interests include graph systems, streaming processing systems, and machine learning systems. He has received the Best Paper Award of ICNP 2012 and DEX 2019.