Shakedrop Regularization for Deep Residual Learning

Overfitting is a crucial problem in deep neural networks, even in the latest network architectures. In this paper, to relieve the overfitting effect of ResNet and its improvements (i.e., Wide ResNet, PyramidNet, and ResNeXt), we propose a new regularization method called ShakeDrop regularization. ShakeDrop is inspired by Shake-Shake, which is an effective regularization method, but can be applied to ResNeXt only. ShakeDrop is more effective than Shake-Shake and can be applied not only to ResNeXt but also ResNet, Wide ResNet, and PyramidNet. An important key is to achieve stability of training. Because effective regularization often causes unstable training, we introduce a training stabilizer, which is an unusual use of an existing regularizer. Through experiments under various conditions, we demonstrate the conditions under which ShakeDrop works well.


Introduction
Recent advances in generic object recognition have been brought by deep neural networks.After ResNet [11] opened the door to very deep CNNs of over a hundred layers by introducing the building block, its improvements such as Wide ResNet [29], PyramdNet [8,9] and ResNeXt [27] have broken the records of lowest error rates.The development of such base network architectures, however, is not enough to reduce the generalization error (i.e., difference between the training and test error) sufficiently.It is thought that some regularization methods are required to reduce the generalization error [30].Widely-used regularization methods include data augmentation [16], stochastic gradient descent (SGD) [30], weight decay [18], batch normalization (BN) [14], label smoothing [23], adversarial training [6], mixup [31,24,25] and dropout [21,26].
Recently, an effective regularization method named Shake-Shake regularization [5,4] was proposed.It is an interesting method which, in the training, disturbs calcula-tion of the forward pass by using a random variable, and also that of backward pass by a different random variable.Its effectiveness is proven by the experiment that ResNeXt to which Shake-Shake is applied (hereafter, this is denoted by "ResNeXt + Shake-Shake" 2 ) achieved the lowest error rate on CIFAR-10/100 datasets [15].Shake-Shake, however, has following two drawbacks; (1) it can be applied only to ResNeXt, and (2) the reason it is effective is not yet revealed.
The current paper addresses these problems.For the problem (1), we propose a novel powerful regularization method ShakeDrop regularization, which is more effective than Shake-Shake.Its big advantage is that it can be applied not only to ResNeXt (hereafter, 3-branch architectures) but also ResNet, Wide ResNet and PyramidNet (2branch architectures).For the problem (2), in the process of deriving ShakeDrop, we give an intuitive interpretation of Shake-Shake.Through experiments using various base network architectures, we reveal the condition that the proposed ShakeDrop successfully works.

Regularization methods for ResNet family
In this section, we present two regularization methods for ResNet family, both of which are used to derive the proposed method.Shake-Shake regularization [5,4] is an effective regularization method for ResNeXt.It is illustrated in Fig. 1(a).The basic ResNeXt building block which has a 3-branch architecture is given as where x and G(x) are the input and output of the building block, and F 1 (x) and F 2 (x) are the outputs of two residual branches.
Letting α and β be independent random coefficients uniformly drawn in the range of [0, 1], Shake-Shake is given as where train-fwd and train-bwd mean the forward and backward passes of training, respectively.The expected values (2) means that calculation of the forward pass is multiplied by a random coefficient α and that of the backward pass by another random coefficient β.The values of α and β are drawn in every image or batch.The paper suggests to train longer than usual (more precisely, six times as long as usual).
In the training of neural networks, if the output of a residual branch is multiplied by a coefficient α in the forward pass, it is natural to multiply the gradient by the same coefficient in the backward pass.Hence, Shake-Shake regularization makes the gradient β/α times as large as the correctly calculated gradient on one branch and (1 − β)/(1 − α) times on the other branch.It seems that the disturbance prevents from the network parameters being captured in local minima.However, the reason why such disturbance is effective is not sufficiently revealed.
RandomDrop regularization (a.k.a., Stochastic Depth and ResDrop) [13] is a regularization method originally proposed for ResNet, and also applied to PyramidNet [28].It is illustrated in Fig. 1(b).The basic ResNet building block which has a 2-branch architecture is given as where F (x) is the output of the residual branch.Random-Drop makes the network apparently shallow in learning by dropping some building blocks stochastically selected.The l th building block from the input layer is given as where b l ∈ {0, 1} is a Bernoulli random variable with the probability The paper recommends linear decay rule to determine p l , which is given as where L is the number of all building blocks, and p L is the initial parameter.The paper suggests to use p L = 0.5.
RandomDrop can be regarded as a simplified version of dropout [21].These main difference is that while Random-Drop drops layers, dropout does elements.

ShakeDrop Regularization
The proposed ShakeDrop, illustrated in Fig. 1(d), is given as where b l is a Bernoulli random variable with the probability P (b l = 1) = E[b l ] = p l given by the linear decay rule (Eqn.( 5)), and α and β are independent uniform random variables.The most representative ranges of α and β are (1) Further detail about parameters is presented in Sec. 4.
In the training phase, b l controls the behaviour of Shake-Drop.If b l = 1, Eqn. ( 6) is deformed as That is, ShakeDrop is equivalent to the original network (e.g., ResNet).If b l = 0, Eqn. ( 6) is deformed as That is, calculation of F (x) is perturbated by α and β.
In the following sections, we present how ShakeDrop is derived in Sec.3.2 and discuss the relationship with existing regularization methods in Sec.3.3.

Interpretation of Shake-Shake regularization
We give an intuitive interpretation of the forward pass of Shake-Shake.To the best of our knowledge, it has not been given yet.As shown in Eqn.(2) (and also in Fig. 1(a)), in the forward pass, Shake-Shake interpolates the outputs of two residual branches (i.e., F 1 (x) and F 2 (x)) with a random weight α.As DeVries & Taylor [2] demonstrated that interpolation of two data in the feature space can synthesize reasonable augmented data, the interpolation of two residual branches of Shake-Shake in the forward pass can be interpreted as synthesizing data.Use of a random weight α enables us to generate many different augmented data.On the other hand, in the backward pass, a different random weight β is used to disturb learning to make the network learnable long time.

Single-branch Shake Regularization
The regularization mechanism of Shake-Shake relies on two or more residual branches, so that it can be applied only to 2-branch networks architectures (i.e., ResNeXt).In order to realize a similar regularization to Shake-Shake on 2-branch architectures (i.e., ResNet, Wide ResNet and PyramidNet), we need a different mechanism from interpolation in the forward pass that can synthesize augmented data in the feature space.Actually, DeVries & Taylor [2] demonstrated that not only interpolation but also noise addition in the feature space generates reasonable augmented data.Hence, following Shake-Shake, we apply random perturbation to the output of a residual branch (i.e., F (x) of Eqn. ( 3)).That is, it is given as We call this regularization method Single-branch Shake.Single-branch Shake is expected to be effective like Shake-Shake.However, it does not work well in practice.For example, in our preliminary experiments, we applied it to 110-layer PyramidNet with α ∈ [0, 1] and β ∈ [0, 1] following Shake-Shake, the result on the CIFAR-100 dataset was hopelessly bad (i.e., an error rate of 77.99%).

Stabilization of training
What caused the failure of Single-branch Shake?A natural guess is that Shake-Shake has a stabilizing mechanism that Single-branch Shake does not have.It must be "two residual branches."Let us argue if it makes sense.As presented in Sec. 2, in the training, Shake-Shake makes the gradients of two branches β/α times and (1 − β)/(1 − α) times as large as the correctly calculated gradients.Thus, when α is close to 0 or 1, it can spoil training because it can make a gradient prohibitively large 3 .However, two residual branches of Shake-Shake work as a fail-safe system.That is, even if the coefficient on one branch (let us assume β/α) is large, the other (i.e., (1 − β)/(1 − α)) is kept small.In other words, training on at least one branch is not spoiled.However, Single-branch Shake does not have such fail-safe system.
From the discussion above, failure of Single-branch Shake is caused by too strong perturbation and lack of stabilizing mechanism.However, weakening the perturbation would also weaken the effect of regularization.Thus, we need a trick to stabilize learning under strong perturbation.
Table 1.Regularization methods generating new data."Sample-wise generation" means that data is generated using single sample.

Regularization method
Data augmentation Sample-wise In (input) data space In feature space In label space generation Data augmentation [16] Adversarial training [6] Label smoothing [23] Mixup [31,24] Manifold mixup [25] Shake-Shake [5,4] ShakeDrop Our idea is to use the mechanism of RandomDrop for solving the issue.In our situation, however, the original usage of RandomDrop 4 does not contribute because a shallower version of a strongly perturbated network (e.g., a shallow version of "PyramidNet + Single-branch Shake") would also suffer from strong perturbation.Thus, we use the mechanism of RandomDrop as a probabilistic switch of following two network architectures.
By mixing them up, it is expected that (1) when the original network is selected, learning is correctly promoted, (2) when the network with strong perturbation is selected, learning is disturbed.To achieve a good performance, two networks should be well balanced, which is controlled by a parameter p L .We will argue this issue in Secs. 4 and 5.

Relationship with existing regularization methods
In this section, we argue the relationship between Shake-Drop and existing regularization methods.Among them, SGD and weight decay are commonly used techniques in the training of deep neural networks.Though they are not designed for regularization, it is pointed out that they have generalization effects [30,18].BN [14] is a strong regularization technique which is widely used in recent network architectures.The proposed ShakeDrop will be appended to these regularization methods.
ShakeDrop differs from RandomDrop [13] and dropout [21,26] in following two points.One is that they do not explicitly generate new data.The other is that they do not update network parameters based on noisy gradients.Some methods regularize by generating new data.They are summarized in Table 1.Data augmentation [16] and adversarial training [6] synthesize data in the (input) data space.They differ in how they generate data.While the former uses manually designed means such as random crop and horizontal flip, the latter automatically generates data that should be used for training to improve the generalization performance.Label smoothing [23] generates (changes) labels for existing data.The methods mentioned above generate new data using single sample.On the other hand, some methods requires multiple samples to generate new data.Mixup and BC learning [31,24] generate new data and their corresponding class labels by interpolating two data.Though they generate new data in the data space, manifold mixup [25] does it also in the feature space.Compared with ShakeDrop which generates data in the feature space using single sample, none of these regularization methods is in the same category except Shake-Shake.

Preliminary Experiments
The proposed ShapeDrop regularization has three parameters: α, β and p L .In addition, there are four possible update rules of α and β.In this section, we search the best parameters except p L and best update rule on CIFAR-100 dataset.Following RandomDrop regularization [13], we used p L = 0.5 as default.

Ranges of α and β
In this section, the best parameter ranges of α and β are experimentally explored.We applied ShakeDrop to three network architectures: ResNet, ResNet (EraseReLU version) and PyramidNet.EraseReLU erases Rectified Linear Unit (ReLU) in the bottom of the building blocks to improve the generalization performance [3].Note that EraseReLU does not affect PyramidNet because PyramidNet does not have ReLU in the bottom of the building blocks.
Table 2 shows representative parameter ranges of α and β we tested and their results.Cases A and B correspond to the vanilla network (i.e., without regularization) and Ran-domDrop, respectively.On all three network architectures, case B was better than case A. Let us see the results of three network architectures one by one.Through experiments using various base network architectures shown in Sec. 5, we found that case O is effective on "EraseReLU"ed architectures.On the other hand, case G is effective on non-"EraseReLU"ed architectures.

Update rule of α and β
The best update rule of α and β are found from batch, image, channel and pixel levels.They mean that the same α and β are used for "all the images in the mini-batch," "each image," "each channel," "each element" for each building block, respectively.

Combinations of (α, β) for analyzing Shake-Drop behaviour
Though we have successfully found effective ranges of α and β and their update rule, we still do not grasp what mechanism contributes to improve the generalization performance on ShakeDrop.One reason is that α and β are random variables.Due to this, what we can obtain in the end of training is a network which is trained using various observed values of α and β.It makes more difficult to know the mechanism.Hence, in this section, we explored effective combinations of (α, β).
The combinations of (α, β) are defined as follows.From the best ranges of α and β for PyramidNet which are α ∈ [−1, 1] and β ∈ [0, 1], by taking both ends of ranges, we ob-Table 4. [Combinations of (α, β)] Top-1 errors (%) of "PyramidNet + ShakeDrop" at the final (300th) epoch on CIFAR-100 dataset in the batch-level update rule.Combinations of α and β Results with * are quoted from Table 2.Then, we examine its all combinations, which are shown in Table 4. Intuitively, instead of drawing α and β in the ranges, the values of α and β are picked up from the pool of (α, β) with equal probability.Table 4 shows combinations of (α, β) and their results.First of all, compared with the best result in Table 2 (i.e., case O; 16.22%), the results in Table 4 were almost comparable.In particular, the best result in Table 4 (i.e., case i; 16.24%) was almost equivalent.
Then, we focus on cases i, j and k.They all have two marked elements and one of them is (1, 1).Actually, (1, 1) is the normal state of neural networks that is given by Eqn.(6).Hence, considering their difference means considering difference among (−1, 0), (−1, 1) and (1, 0).What does (α, β) = (1, 0) do?Let us consider the case of β = 0.In such a case, the network parameters of the layers selected to perturbate (i.e., the layers with b l = 0) are not updated.However, in the forward pass, their succeeding layers are also perturbated, and in the backward pass, their network parameters are updated reflecting the perturbation.
What does (α, β) = (−1, 1) do?Let us consider the effect of α = −1.In the perturbated layers, calculation of the forward pass is perturbated by α = −1.Then, the effect of perturbation is propagated to succeeding layers.Hence, not only the perturbated layers but also their succeeding layers are purturbated.In the backward pass, when α is negative, the network parameters of the perturbated layers can be updated toward the opposite direction to the normal one.Due to this, the network parameters of the perturbated layers are strongly affected by negative α.
What does (α, β) = (−1, 0) do?A combined effect of (1, 0) and (−1, 1) would happen.First, calculation of the forward pass is perturbated.Its effect is propagated to succeeding layers.However, in the backward pass, only the network parameters of other than the perturbated layers are updated.Those of the perturbated layers are not updated.This can avoid dangerous updates by a large ratio mentioned in Sec.3.2.3 because β/α becomes 0.
In addition, we observed that p L matters to error rates.As mentioned above, (1, 1) is the normal state.Hence, the group of cases i, j and k and the one of cases l, m and n only differ in p L ; since the former group have two elements and the latter group do one element, p L of the former group works as p L /2.Comparing error rates between two groups, it shows that p L greatly affects the error rates.

Comparison on CIFAR-100
The proposed ShakeDrop is compared with the vanilla network (without regularization), RandomDrop and Shake-Shake in different network architectures including ResNet, Wide ResNet, ResNeXt and PyramidNet.
Table 5 shows the experimental results on CIFAR-100 dataset [15].In the table, method names are followed by components of their building blocks.The table shows that the proposed ShakeDrop can be applied to not only Pyra-midNet but also ResNet, Wide ResNet and ResNeXt.However, to successfully apply it, we found that the residual branches have to end with BN.In ResNet and ResNeXt, in addition to the original form, EraseReLU versions (ones without ReLU unit in the end of building blocks) were examined.In Wide ResNet, so that the residual branches will end with BN, ones with BN added in the end of residual branches were examined.In ResNeXt, we examined two ways, referred as "Type A" and "Type B," to apply Ran- domDrop and ShakeDrop."Type A" and "Type B" mean that regularization unit is inserted before and after the addition unit for residual branches, respectively.On the forward pass of training phase, Type A is given as where D(•) is perturbation of RandomDrop or ShakeDrop.Type B is given as where D 1 (•) and D 2 (•) are individual perturbations of Ran-domDrop or ShakeDrop.As far as we examined, Type B was almost always better than Type A. We also confirmed that in most cases, ShakeDrop outperformed Shake-Shake in RexNeXt.

Comparison on ImageNet
We also conducted experiments on ImageNet dataset [17] using ResNet, ResNeXt and PyramidNet.All of the network architectures used in this experiments were of 50 layers 6 .We used small p L = 0.025, 0.05.We recommend a smaller p L for smaller network architectures.The same observation was obtained in the experimental study about the relationship between p L of RandomDrop and generalization performance [13].Note that while [13] reports RandomDrop with p L = 0.5 did not improve the accuracy on a very complex and large dataset such as ImageNet dataset, we found that RandomDrop with small p L such as p L = 0.025, 0.05 improved the generalization performance. 6Please refer to Sec.A for detailed experimental conditions.
Table B2 shows the experimental results.[7] reports the standard deviation of the error on ImageNet dataset, which was 0.12%.Taking this into account, ShakeDrop clearly outperformed RandomDrop on ResNet and ResNeXt, and ShakeDrop was comparable to RandomDrop on Pyramid-Net.Comparing the original and EraseReLU architectures, the original was better for all cases.

Conclusion
We proposed a new stochastic regularization method ShakeDrop which, in principle, can be successfully applied to ResNet family as long as the building blocks end with batch normalization.Through the experiments on CIFAR-100 and ImageNet datasets, we confirmed that in most cases, ShakeDrop outperformed existing regularization methods of the same category, which are Shake-Shake and RandomDrop.

A. Experimental Conditions
All networks were trained using back-propagation by stochastic gradient descent with Nesterov accelerated gradient [19] and momentum method [20].4 GPUs (on CIFAR-10/100) and 8 GPUs (on ImageNet) were used for learning acceleration; due to parallel processing, different observations of b l , α and β are obtained on each GPU.For example, it can happen that on a GPU the l-th layer is perturbated, while the layer is not perturbated on other GPUs (l is an arbitrary number).In addition, even if the layer is perturbated on multiple GPUs, different observations of α and β are used depending on GPUs.Experimental conditions depending on datasets are described below.

CIFAR-10/100 dataset
Implementations used in the experiments on CIFAR-10/100 dataset [15] were based on the publicly available code of fb.resnet.torch7 .Input images of CIFAR-100 datasets were processed in the following manner.An original image of 32 × 32 pixels was color-normalized, followed by horizontally flipped with a 50% probability.Then, it was zero-padded to be 40 × 40 pixels and randomly cropped to be an image of 32 × 32 pixels.
On PyramidNet, the initial learning rate was set to 0.5 on both CIFAR-10/100 datasets following the version 2 of the PyramidNet paper [9], while they used 0.1 on CIFAR-10 dataset and 0.5 on CIFAR-100 dataset since the version 3 paper.
Other than PyramidNet, the initial learning rate was set to 0.1.The initial learning rate was decayed by a factor of 0.1 at 150 epochs and 225 epochs of the entire learning process (300 epochs), respectively.As the filter parameters initializer, "MSRA" [10] was used.In addition, a weight decay of 0.0001, a momentum of 0.9, and a batch size of 128 were used on 4 GPUs.The linear decay parameter p L = 0.5 was used following the RandomDrop paper [12].ShakeDrop used parameters of α = 0, β = [0, 1] (Original) and α = [−1, 1], β = [0, 1] (EraseReLU on ResNet and ResNeXt, Wide ResNet with batch normalization (BN), and PyramidNet) with the pixel-level update rule.

ImaneNet dataset
Implementations used in the experiments on ImageNet dataset [1] were based on the publicly available code of Chainer8 .Input images of ImageNet were processed in the following manner.An original image was distorted with random aspect ratio [22] and randomly cropped to be an image of 224 × 224 pixels.After that, the image was horizontally flipped with a 50% probability and added standard color noise [17].The initial learning rate was set to 0.1.

B. Experiments on Longer Training
Shake-Shake introduced longer training that can be applied to many methods in the learning process as described in Sec. 2. While most of methods related to ResNet use 300epoch scheduling for learning on CIFAR-10/100 dataset, Shake-Shake uses 1800-epoch cosine annealing, on which the initial learning rate is annealed using a cosine function without restart [4].As the longer training may contribute to improvement of accuracy, we also applied longer training to the proposed method.

B.1. CIFAR-10 and CIFAR-100
ShakeDrop was compared with Shake-Shake on CIFAR-10/100 datasets on longer training conditions.The ResNeXt experimental condition is following [4].The ResNet, ResNet (EraseReLU version), and PyramidNet experimental conditions are the same as the CIFAR-100 condition as shown in Sec. A. We used 300-epoch multi-step learning as shown in Sec.A, and 1800-epoch cosine annealing as longer training [4].
Table B1 shows the error rates."ResNeXt" and "ResNeXt + Shake-Shake" combined with the longer training improved the error rates from the 300-epoch multi-step learning.In addition, "ResNet + ShakeDrop", "ResNet (EraseReLU) + ShakeDrop", and "PyramidNet + Shake-Drop" combined with the longer training improved the error rates.Hence, the longer training is an important factor on the Shake-Shake experiments and the longer training is effective for the ShakeDrop regularization.

B.2. ImageNet
ShakeDrop was compared with the vanilla network (without regularization) and RandomDrop on ImageNet classification datasets on longer training conditions.The experimental conditions is following Sec. A. We used 180epoch multi-step learning as longer training.The initial learning rate was decayed by a factor of 0.1 at 60, 120 and 160 epochs of the entire learning process (180 epochs), respectively.
Table B2 shows the error rates.Almost all networks on the longer training improved the error rates, and Shake-Drop tended to outperform the others on the longer train-

Figure 1 .
Figure 1.Regularization methods for ResNet family.Conv represents convolution layer, E[x] expected value of x, and α, β and b l random coefficients which are changed on every iteration.(a) and (b) are existing methods.(c) is an intermediate regularization method Single-branch Shake to the proposed method.(d) is the proposed method.
• PyramidNet achieved the lowest error rates among three network architectures.Only cases N and O outperformed case B. Among them, case O was the best.•ResNethad a different tendency from PyramidNet.That is, case O which was the best on PyramidNet resulted in the chance rate (99.00%).Only case G outperformed case B.• ResNet (EraseReLU version) had the characteristics of both PyramidNet and ResNet.That is, both cases O and G outperformed case B, and case G was the best.

Table 5 .
Top-1 errors (%) on CIFAR-100 dataset.This table shows the results of the original networks (left) and modified networks (right).The modified networks mean the "EraseReLU"ed versions in (a) and (c) and the ones in which BN was inserted at the end of residual branches in (b).In ShakeDrop, α = 0, β ∈ [0, 1] were used in the original networks and α ∈ [−1, 1], β ∈ [0, 1] in the modified networks.In both cases, pL = 0.5 and the pixel-level update rule was used."×" means learning did not converge.* indicates the result is quoted from the literature.+ indicates the average result of 4 runs.

Table 6 .
Top-1 errors (%) on ImageNet at 90th epoch.This table shows the results of the original networks (left) and modified networks in which BN is at the end of residual block (right).In ShakeDrop, α = 0, β ∈ [0, 1] were used in the original networks and α ∈ [−1, 1], β ∈ [0, 1] in the modified networks.In both cases, pL = 0.5 and the pixel-level update rule was used.