PACL: Piecewise Arc Cotangent Decay Learning Rate for Deep Neural Network Training

Deep neural networks (DNNs) are currently the best-performing method for many classification problems. For training DNNs, the learning rate is the most important hyper-parameter, choice of which affects the performance of the model greatly. In recent years, some learning rate schedulers, such as HTD, CLR, and SGDR, have been proposed. These methods, some of which make use of the cycling mechanism to improve the convergence speed and accuracy of DNN, but performance degradation occurs in the convergence process. Others have good accuracy, but their convergence speed is too slow. This paper proposed a new learning rate schedule called piecewise arc cotangent decay learning rate (PACL), which can not only improve the convergence speed and accuracy of DNN but also significantly reduce performance degradation zone caused by the cycling mechanism. It is easy to implement, but almost at no extra computing expense. Finally, we demonstrate the effectiveness of PACL, on training CIFAR-10, CIFAR-100, and Tiny ImageNet with ResNet, DenseNet, WRN, SEResNet, and MobileNet.


I. INTRODUCTION
Deep learning is an active field of machine learning. Its purpose is to establish a special deep neural network (DNN) [1]. DNN has demonstrated good performance in classification tasks [2]. However, its performance is greatly affected by the right choice of learning rates [3]. At present, deep learning uses gradient descent methods [4] to optimize learning rate parameters. Though many adaptive optimization algorithms are proposed in recent years [5]- [8], the essence of those methods is to improve the gradient descent method [9].
Learning rate [3], the step size of the gradient descent method in a search process [10], is an important hyper-parameter in training processes of deep learning model [11]. The convergence rate will be very slow, if the learning rate is set too small, and the model may fall into the local minimum. If the learning rate is set too large, it may lead The associate editor coordinating the review of this manuscript and approving it for publication was Victor S. Sheng. the model to oscillate between output results [12]. As such, the final result of the model is greatly influenced by the learning rate [13].
Piecewise decay method may help to get an ideal result in theory, but the process of tuning the learning rate is tedious and time-consuming [14]. Although adaptive methods can help to adjust the learning rate of each iteration by itself, the final result is usually worse than piecewise decay [15].
There are many different learning rate schedulers proposed in the past [16]- [18]. In particular, with the cyclical learning rate (CLR) [16] and stochastic gradient descent with warm restarts (SGDR) [17] method, it has been demonstrated that compared with monotonically decreasing the learning rate, let the learning rate cyclically changes between reasonable boundaries can get better effect.
In this paper, we designed a new learning rate scheduler, called piecewise arc cotangent decay learning rate (PACL), which resets the learning rate and piecewise decay in each cycle. As compared with traditional learning rate schedules, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ such as exponential and piecewise decay, PACL can greatly improve the convergence speed of networks. Compared with SGDR and CLR, PACL has a larger proportion of small learning rates, as such better accuracy and a more stable system can be achieved. In addition, it almost doesn't need extra computing expenses. The contributions of this paper are: 1. A new learning rate scheduler is proposed. It can be an alternative to the existing schemes. The scheduler has the features of a warm restart, initializing the learning rate for every some epochs or iterations. It decays the learning rate with piecewise arc cotangent function, and has a smaller proportion of large learning rates and decays the learning rate rapidly in each cycle.
2. Some learning rate schedulers with cycling mechanisms have a large performance degradation zone in the convergence process. The PACL significantly reduces the performance degradation area caused by the cycling mechanism. The performance degradation area of PACL in each cycle is only one-third of the cycle.
3. PACL improves the convergence speed of the network and the convergence capability in the training process. DNN training with PACL has a faster convergence rate and higher classification accuracy. In addition, compared with the adaptive algorithm, it is easy to implement, and almost no extra computing expense.
The structure of the paper is as follows. Section II reviews some optimizers and learning rate schedulers proposed in the past. Section III describes the proposed PACL scheduler. Section IV shows the experiment results of PACL against other learning rate schedulers on different networks and datasets. Section V concludes the contributions of this paper and discusses some possible future works.

II. RELATED WORKS AND MOTIVATIONS
In this section, we review some optimizers like stochastic gradient descent (SGD) [19], and SGD with momentum [20]. Then we review some common learning rate schedulers proposed in recent years, such as the stochastic gradient descent with warm restarts (SGDR) [17], and cyclical learning rate (CLR) [16].

A. OPTIMIZERS
Training DNN is usually considered as the non-convex optimization problem [14], with which a loss function is first defined and then minimized by the optimization algorithm.
Gradient descent, originally proposed by Cauchy in 1847 [21], [22], is an iterative optimization algorithm for finding the minimum of a function. To find such a minimal of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point. The excellent performance of deep learning is attributable to the gradient descent optimization algorithm.
Stochastic gradient descent (SGD) [19], becomes an extension of the gradient descent, originated from the stochastic approximation proposed by Robbins and Monro [4] in 1951 and was initially applied to pattern recognition [23] and neural network [24]. In recent years, with the rapid rise of deep learning, SGD has become a mainstream and very effective method to solve machine learning optimization problems. The parameters θ of a deep neural network update by stochastic gradient descent (SGD) is as follows where α t denotes the learning rate, which is used to adjust the amplitude of the parameter update. f i (·) is the loss function of the i-th sample with respect to θ t . The updating process of SGD is simple and efficient, and the iteration cost is independent on the total sample. But there is inevitably noise in the actual data so that it is difficult for SGD to approach the minimum in the best direction. The random classical momentum algorithm (CM) [25] adds momentum term based on SGD. The historical parameter changes are integrated to speed up the optimization process. Momentum is designed to accelerate DNN training. But CM has a problem: it keeps accumulating speed and may miss the optimal solution.

B. LEARNING RATE SCHEDULERS
The learning rate is an important hyper-parameter in deep learning. Therefore, how to choose the learning rate has become the most important issue. Common learning rate schedules include time-based decay [26], piecewise decay [15], and exponential decay [15].
The piecewise decay drops the learning rate by a factor of every few epochs. Generally, the learning rate is reduced to half or one-tenth for every 10 epochs. Another schedule commonly used is exponential decay which updates the learning rate as the following [15]: where lr 0 denotes the initial learning rate. k is decay rate, t is iteration number. Adjusting the learning rate manually is an expensive process, and it is difficult to find the best learning rate under the current model quickly. Therefore, many adaptive methods are proposed in recent years, such as Adagrad [5], Adadelta [6], RMSProp [7], and Adam [8]. Adagrad is a sub-gradient method that can incorporate the gradient information in earlier iterations. The update rule for Adagrad is as follows: where α is a global learning rate shared by all dimensions, g t is the gradient in the t iteration. G t is the sum of the squares of the past gradients to all parameters θ . ε is a smoothing term that avoids division by zero (usually on the order of 10 −8 ). Adagrad overcomes the trouble of manually adjusting the learning rate. But the optimization efficiency in the later stage of training is very low because Adagrad accumulates a lot of historical gradients, as a result, makes the learning rate too small. To solve the problem, Adadelta, an improved version of Adagrad, makes the gradient decay exponentially following the time in training to avoid the continuous reduction of the learning rate. In Adadelta, we don't need to set the default learning rates, because we use the ratio of the running average of the previous time step to the current gradient. Adam, an efficient algorithm for gradient-based optimization of the stochastic objective function, combines the advantages of Adagrad and RMSProp, which is suitable for large data sets and high-dimensional spaces. These adaptive algorithms have been successfully applied to various practical problems, especially Adam has become one of the most popular algorithms for neural network training. But some studies have pointed out that the generalization capability of these adaptive algorithms is worse than that of SGD in many applications [27], [28]. After Adaptive Learning Rates, SGDR [17] and CLR [16] were proposed which have better generalization capability than adaptive algorithms. Besides experimentations with Adaptive Learning Rates are computationally expensive which CLR is not.
Stochastic gradient descent with warm restarts (SGDR) [17] improves the performance of SGD. SGDR used warm restart mechanisms to initialize the learning rate every some epochs or iterations, and it decays the learning rate with a cosine annealing for each batch. SGD with warm restarts requires 2 to 4 fewer epochs than the common learning rate schedule schemes to achieve comparable or even better results.
Cyclical learning rates (CLR) [16] is similar to the SGDR method. Instead of monotonically decreasing the learning rate, CLR lets the learning rate cyclically change between reasonable boundaries. Allowing the learning rate to rise and fall in training will have a temporary negative impact to the network, but it is beneficial overall.

C. MOTIVATIONS
Exponential and piecewise decay are widely used in the training of state-of-the-art DNN architectures. The idea of both exponential and piecewise decay is to set an initial value for the learning rate and allows it to decay with some algorithms. The discrete change of learning rate makes the change of learning performance discrete and sudden, which shows that it is possible to improve learning performance steadily by changing the learning rate constantly [18].
Intuitively, with the increase of training iterations, we should keep the learning rate decreasing to reach convergence. However, it may be more useful to use a learning rate that changes periodically in a given range. Because the periodic high learning rate can make the model jump out of the local minimum and saddle point in the training process.
Dauphin et al. [29] pointed out that the saddle point is more difficult to converge than the local minimum. If the saddle point happens at an ingenious equilibrium point, a small learning rate usually does not produce a large enough gradient change to make it skip the point. This is the advantage of the periodic high learning rate, which can make the model skip the saddle point faster.
The effect of SGDR and CLR demonstrated that instead of monotonically decreasing the learning rate, letting the learning rate cyclically rises and fall in training will improve classification accuracy and rate of convergence.
Motivated by the formerly mentioned methods with the use of piecewise decay and cyclical learning rate, we designed a new learning rate scheduler which implements cycle mechanism and piecewise decay according to arc cotangent.

III. PIECEWISE ARC COTANGENT DECAY LEARNING RATE (PACL)
This section introduces a new scheduling method, named piecewise arc cotangent decay learning rate (PACL). Fig.1 shows the decay model of piecewise arc cotangent decay learning rate (PACL), which controls the learning rate according to where Lr min and Lr max are ranges for the learning rate. T fin represents total epochs or iterations in a cycle. T i denotes how many epochs or iterations have been performed in a cycle. Lr = Lr max when T i = 0, and Lr = Lr min when T i = T fin . Arc cotangent function is introduced in equation (4) which makes the learning rate scheduler more effective. Due to the characteristics of arc cotangent function, PACL has a larger proportion of small learning rates, as such a more stable system can be achieved. Compared with conventional learning rate scheduler, such as exponential and piecewise decay, PACL has a periodic mechanism, which enables the model to skip the saddle point faster during the later stage to achieve better performances. In addition, PACL makes the learning rate decrease rapidly in the period, which enables us to set a large initial learning rate to improve the convergence speed of the model in the early stage.

A. THE PROPOSED PACL
Compared with SGDR and CLR, PACL reduces the proportion of large learning rate and decays the learning rate rapidly in each cycle. It will be more beneficial to optimize the neural network. What's more, the setting of the minimum learning rate can make the learning rate far away from zero, which is more helpful to the early training of the network, because when the learning rate is close to zero, the noise will dominate the update of DNN weights [18].

B. ESTIMATE MAXIMUM AND MINIMUM BOUNDARY
We use ''LR range test'', which was first introduced by Smith [16] to estimate reasonable maximum and minimum learning rate boundaries. Fig.2 shows an example of running ''LR range test'' with the CIFAR-10 dataset. We set the initial learning rate to a very low value such as 10 −5 , and set the final learning rate to a high value such as 1. Then, we run the model for one epoch while letting the learning rate increase from the lowest to the highest value we set. With the increase of the learning rate, it will eventually become too large, which will lead to the increase of test loss. We can see a typical curve from an LR range test from Fig.2, where the test loss has a distinct trough and peak. Generally, Lr max is set when loss rises, and Lr min is set when the gradient of loss is minimum.

C. ESTIMATE FREQUENCY OF LEARNING RATE TRANSFORMATION
In the practical application of PACL, div is introduced to represent the update frequency of the learning rate in each epoch. For example, the learning rate updated twice in each epoch when div = 2. The introduction of Div can make the learning rate update piecewise or linearly, which makes the change of learning rate more flexible in the cycle.
We compare the PACL algorithms with different T fin and Div on the CIFAR-10 dataset. Fig.3 shows the learning rate is initialized to Lr max , and decay to Lr min by PACL with different parameters in each cycle. Fig.4. show the accuracy of PACL with different parameters on the CIFAR-10 dataset. The results for T fin = 5 and Div = 1 show better performance, and therefore we use T fin = 5 and Div = 1 in our later experiments.

IV. EXPERIMENTAL AND ANALYSIS
In this section, we demonstrate the effectiveness of PACL training with different networks. In the subsections below, our algorithm (PACL) is used for training on CIFAR-10, CIFAR-100 and Tiny ImageNet dataset, and compared PACL with six types of schedulers: exponential decay, piecewise decay, fixed learning rate, CLR, SGDR, and HTD.

A. EXPERIMENTAL PLATFORM
The experiments in part C of chapter IV were executed on a computer with Windows10 operating system, Intel (R) Core (TM) i5-3470MCPU, GeForce RTX TM 2080,32GB RAM and by programming in Python. We used HUAWEI's ModelArts servers for all the rest of the experiments. Each server contains one NVIDIA Tesla P100 GPU, Intel E5-2690V4 CPU, and 64GB RAM. All the proposed models are run over highly efficient GPU using the PyTorch deep learning framework.

B. DATASET
The CIFAR-10 dataset [30] consists of 60000 color images. These images are 32×32, divided into 10 categories, and each category has 6000 images. There are 50000 training images and 10000 test images. The CIFAR-100 dataset [30] is just like the CIFAR-10 but has 100 classes instead of 10, and each class has 600 images. There are 500 training images and 100 testing images per class.
The Tiny ImageNet [31] is similar to the ImageNet [32], but it has only 200 categories. Each category has 500 images for training, 50 for testing, and 50 for verification. The images are 64 × 64 pixels.

C. EXPERIMENT ON CIFAR-10
We train the ResNet-32 [33] with seven types of schedulers mentioned earlier on the CIFAR-10 dataset. The networks are trained by SGD with momentum of 0.9 and a mini-batch size of 128. Using L2 regularization, regularization coefficient = 0.001, to avoid overfitting, small values of L2 can help prevent overfitting the training data.
Experiments in this part, we do not use any image preprocessing. For exponential and piecewise decay, we use an initial learning rate of 0.1, and the former is decayed by a factor of 10 times after every ten epochs while the latter is decayed by 0.9 times after every epoch. For CLR and SGDR, we set Lr max = 0.5, Lr min = 0.001, and for HTD, we set Lr max = 0.5, Lr min = 0. For our algorithm (PACL), we set initial parameters as Lr max = 0.5, Lr min = 0.001, T fin = 5, Div = 1. All tests are trained for 100 epochs, 390 iterations in each epoch.  Fig.6.provide a comparison among exponential decay, piecewise decay, fixed learning rate, HTD, CLR, SGDR, and PACL on the CIFAR-10 dataset. As can be seen from Fig.5. although PACL (green curve) has a temporary fall in performance in the training process compared with other algorithms, it can make DNN convergence faster and final accuracy higher. The PACL (green curve) not only reaches an accuracy of 85.87% after only 5,450 iterations but also the final accuracy of 88.96 is significantly higher than that of other algorithms without cycle mechanism. In addition, compared with CLR (orange curve) and SGDR (brown curve), PACL has less performance loss during training. The final accuracy is also slightly higher than that of CLR and SGDR by 0.23% and 0.35%. We define performance degradation as 90% below the highest accuracy in a stable period. The performance degradation area of PACL is only one-third on average in the stable period. This phenomenon can be found clearly in Fig.5. The same results are also reflected in the test of the Tiny-Image dataset.

D. EXPERIMENT ON DIFFERENT NETWORK
Shortcut (or short path) is a very effective structure in the development of the CNN model. Neural network models with shortcut structures, such as ResNet [33], WRN [34], and DenseNet [35], have excellent performance in computer vision tasks. In addition, SENet [36] and MobileNet [37] also have good performance in image recognition due to their unique structure. In this part, we provide comparisons between CLR, SGDR, HTD, and PACL based on the network mentioned earlier.
For image preprocessing, we normalize the input data using the channel means and standard deviations. For  data augmentation, we padded the picture by 4 pixels on each side, then, perform random cropping and horizontal flipping. experiments. The third column gives the learning rate update method. The other two columns show the average accuracy from three runs.
For ResNet, the original test accuracy of 93.57% on CIFAR-10 can be improved to 94.73%, and accuracy of 75.84% on CIFAR-100. The accuracy of WRN trained with PACL can achieve 96.24% and 81.11% on CIFAR-10 and CIFAR-100 respectively. Performance improvement can also be reflected in SEResNet and MobileNet. PACL is outperforming the most current leading methods except for DenseNet-BC-100-12.The accuracy performance of PACL on DenseNet-BC-100-12 is similar to HTD, 95.49, and 77.80 respectively.

E. EXPERIMENT ON TINY IMAGENET
In this part, we provide comparisons between piecewise decay, CLR, HTD, and PACL on the Tiny ImageNet dataset.
We trained the ResNet-50 and MobileNet V2 on the Tiny ImageNet dataset using settings similar to the experiment on  the CIFAR datasets: SGD with the momentum of 0.9; the L2 regularization coefficient = 0.001; the mini-batch size of 128. For piecewise decay, we used the initial learning rate 0.1, which decays by 0.1 at 80 and 150 epochs. For PACL and CLR, the learning rate followed TABLE 3 and set Hyperparameter T fin = 10, div = 1. We set Lr max = 0.1, Lr min = 0 for HTD. TABLE 4 compares the result of accuracy performance when training the network by PACL to other methods. For ResNet, the top-1 accuracy of PACL is slightly higher than that of other methods, but for MobileNet, the top-1 accuracy of PACL is higher than that of other methods. Fig.7. compares the results of running with the piecewise decay, CLR, HTD, and PACL for the ResNet and MobileNetV2. As can be seen from Fig.7.(a) and Fig.7. (b) that convergence rate and final accuracy of PACL (blue curve) is better in comparison to any other algorithms.
Especially, compared with CLR (orange curve), PACL greatly reduces the performance degradation zone in each cycle. The performance degradation area of PACL in each cycle is only one-third of the cycle.

V. CONCLUSION
In this paper, we propose a new scheduling method, named piecewise arc cotangent decay learning rate (PACL), to improve the performance of DNNs. PACL combines the advantages of piecewise decay and CLR, adopts the mechanism of the cyclic learning rate, and the learning rate piecewise decay in each cycle. Compared with other learning rate schedulers with circular mechanisms, PACL significantly reduces the performance degradation zone caused by the cycling mechanism. Besides, the setting of a minimum learning rate can make the learning rate far away from zero, which improves the effect of learning rate scheduler with circular mechanisms in the early stage of network training. Training DNNs with PACL can improve not only the accuracy of the network but also its convergence speed. Finally, we demonstrate the effectiveness of PACL, on CIFAR-10, CIFAR-100, and Tiny ImageNet, training with ResNet, DenseNet, WRN, SEResNet, and MobileNet. Future work should consider the application of PACL in some popular adaptive optimization algorithms such as Adam.
JIHONG LIU received the Ph.D. degree in pattern recognition and intelligent system from Northeastern University, China, in 2003. She is currently an Associate Professor with the School of Information Science and Engineering, Northeastern University. Her research interests include computational cardiology, intelligent information processing, and biomedical signal acquisition.
HONGWEI SUN received the B.S. degree in electrical information engineering from Shijiazhuang Tiedao University, China, in 2014. He is currently pursuing the M.S. degree with the College of Information Science and Engineering, Northeastern University, China. His research interests include deep learning, natural language processing, and signal processing of sEMG. He is a Professor of biological physics. He has published more than 400 scientific articles, among them over 200 articles were published in prestigious peer-reviewed journals in his field. Related works have attracted wide public interests, and been covered by many prestigious media, such as BBC. He has been elected as a Fellow of world renowned societies as recognition of distinctions. VOLUME 8, 2020