Adadb: Adaptive Diff-Batch Optimization Technique for Gradient Descent

Gradient descent is the workhorse of deep neural networks. Gradient descent has the disadvantage of slow convergence. The famous way to overcome slow convergence is to use momentum. Momentum effectively increases the learning factor of gradient descent. Recently, many approaches have been proposed to control the momentum for better optimization towards global minima, such as Adam, diffGrad, and AdaBelief. Adam decreases the momentum by dividing it with square root of moving averages of squared past gradients or second moment. The sudden decrease in the second moment often results in the overshoot of the gradient from the minima and then settle at the closest minima. DiffGrad decreases this problem by using a friction constant based on the difference of current gradient and immediate past gradient in Adam. The friction constant further decreases the momentum and results in slow convergence. AdaBelief adapts the step size according to the belief in the current gradient direction. Another famous way of fast convergence is to increase the batch size adaptively. This paper proposes a new optimization technique named adaptive diff-batch or adadb that removes the problem of overshooting gradient in Adam, slow convergence in diffGrad, and combines the methods with adaptive batch size for further increase in convergence rate. The proposed technique uses the friction constant based on the past three differences of gradients rather than one as in diffGrad and a condition to decide the use of friction constant. The proposed technique has outperformed the Adam, diffGrad, and AdaBelief optimizers on synthetic complex non-convex functions and real-world datasets.


I. INTRODUCTION
In recent times, neural network-based algorithms are gaining popularity due to the availability of big data and large computing power in the form of GPUs. With the help of these two factors, neural networks have achieved high accuracy in solving problems in various fields, such as computer vision, signal processing, human activity recognition, and natural language processing.
Despite recent popularity and achievements, the neural networks still dependent on a gradient descent algorithm that was developed in 1847 [1]. All the variants of gradient descent (Batch, mini-batch, and stochastic) inherit the disadvantage of slow convergence towards global minima. Recently, many attempts have been made to optimize the convergence of The associate editor coordinating the review of this manuscript and approving it for publication was Haiyong Zheng . gradient descent to get the true benefits of big data and large computing power with neural networks.
The most famous way of increasing the convergence rate of gradient descent is the use of momentum. It is reported that momentum effectively increases the convergence rate by a factor of 2 [2]. However, fast convergence often forces the gradient descent to settle at the local minima instead of the global minima.
There are two renowned methods of controlling the convergence rate: reduce the momentum and increase the batch size. Adam [3], diffGrad [4], and AdaBelief [5] optimization techniques reduce the momentum whereas adabatch technique [6] increases the batch size for better and fast optimization towards the global minima. Adam and AdaBelief optimizer techniques often overshoot the global minima whereas diffGrad suffers from the slow convergence of the solution. Similarly, adabatch is dependent on the use of convergence technique used with the adabatch. In this paper, we use both methods, control of convergence rate and increase in batch size. We remove the problem of slow convergence in diffGrad by replacing the friction constant with a sum of past three gradient differences rather than one and combines the approach with adaptive batch size increment. Combining adaptive batch size with convergence technique is based on our previous work that has shown success in improving the convergence rate [7]. The list of contributions of this paper is as follows: • A modified compact friction constant is defined that prevents the solution from overshooting the global minima • A conditional constraint is defined to select an appropriate friction constant that can increase or decrease the momentum rather than only decreasing • A batch-size increment technique is proposed to increase the convergence of the optimizer The rest of the paper is organized as follows: Section 2 describes the basics of the gradient descent. Related work is presented in Section 3. The proposed technique is explained in Section 4 and its convergence analysis is covered in Section 6. Section 6 covers the empirical analysis and experimental setup is explained in Section 7. Results and discussion is covered in Section 8. The paper is concluded in Section 9.

II. BASICS OF GRADIENT DESCENT
The gradient descent presents the most basic approach for optimizing parameters in a neural network. Initially, random values are used for parameters that are used with input data to calculate the predicted values. A loss function defined for a specific problem is used to calculate the difference between predicted output values and original ones. Later, the gradient against each parameter is computed and helps in updating the respective parameter. The updated parameters are used to calculate new predicted values. The procedure is repeated until the convergence or for a certain number of epochs. The stochastic gradient descent uses the above procedure against each data sample. The direction of gradient in each iterative step oscillate and therefore, slows the convergence. The batch gradient descent uses the average of all the data samples at once and performs only one parameter update in each epoch. The batch process helps in reducing the oscillatory behavior but requires a large number of epochs for optimization. In practice, mini-batch descent algorithm is used that helps in reducing the path oscillation and creates less overhead as compared to the batch algorithm. However, the mini-batch gradient descent still inherits the problem of slow convergence due to different behavior among a large number of parameters used in the neural networks.
In gradient descent, all parameters of the model are updated on a same learning rate α i in the ith iteration, such as: where i+1,j and i,j are the updated and previous values of the jth parameter with j = 1, 2, 3, . . . . . . J . Here, J represents a total number of parameters. The g i,j is the gradient of the loss function L with respect to the parameter ij . The mathematical representation of the gradient g i,j is In Eqn 2, the L i, is the loss function of the parameter in the ith iteration. For image processing applications, this loss function is the cross-entropy loss that is defined as: where K B is the number of images in the Batch B, the L i, ,K is the cross entropy data loss for the kth training image in the ith iteration, R i, is the regularization loss for the ith iteration, and hyper-parameter of the regularization loss is denoted by σ . The L i, ,K for the K th training sample is computed as: In Eqn 4, the number of classes in the dataset is denoted by N c , the o K is the ground truth class for the K th training image, and nth class score computed for K th training image is denoted by S n . Moreover, the regularization loss function R i, is computed as:

III. RELATED WORKS
The SGD with momentum is the advancement in the gradient descent algorithms [8]. The concept of gradient is each dimension (parameter) is included in the stochastic gradient descent to enhance the momentum of the parameters having the consistent gradient. The moment gradient is computed as: where m i,j is the gained moment at the ith iteration for the jth parameter i,j with m i,j = 0 for i = 0, and moment is controlled by the hyper-parameter γ . Therefore, Eqn 1 is modified as: In another technique AdaGrad [9], the learning rate for the gradient descent is normalized as: In Eqn 8, the is a very small value close to zero (mostly selected as 1e −8 ) added to avoid division by zero and G i,j is the sum of squares of the gradients of t steps for the jth parameter computed as: 99582 VOLUME 9, 2021 where g i,j is computed as Eqn 2. However, the issue with the denominator of Eqn 8 is the sum of squares of the gradients may become very large over the t steps due to the positive accumulation of the squares of the gradients. The AdaDelta [10] and RMSProp [11] address this issue by reducing the impact of accumulating squares of gradients by using a decay rate parameter β. In RMSProp, the Eqn 9 is modified as: The value of G i,j = 0 for i = 1. The most renowned gradient descent optimizing technique is Adam [3]. In Adam, at each step, the learning rate is computed by using 1st and 2nd order moments also known as mean and variance, respectively. Both moments are defined recursively using gradient and squares of gradient, respectively, such as: where β 1 and β 2 are the decay rates of mean and variance, respectively. The m i−1,j and v i−1,j are the mean and variance of the previous steps, respectively that are initialized with 0 for 1st iteration. The observation noted here is that initially the 1st moment is small and 2nd moment is very small that leads to a very large step size. This issue is resolved by introducing a bias correction in both moments, such as: In Eqn. 13, the β i 1 is the β 1 with power i, β i 2 is the β 2 with power i, and m i,j and v i,j are the bias-corrected mean and variance, respectively. The good starting choice for the values of the β 1 and β 2 are selected as 0.9 and 0.999, respectively. The learning rate α is selected in the range α ∈ [10 −2 , 10 −4 ]. The modified parameter's update Equation (Eqn. 1) for Adam is defined as: However, the problem that arises in Adam is due to the variance as it decreases fast. Therefore, the friction in the optimization landscape decreases with the small value of the 2nd moment that can lead to a point where the updating process can overshoot the optimized solution because of the very high learning rate and the solution will diverge. AMSGrad [12] tried to solve the problem. The AMSGrad picks the largest value of the 2nd moment from current and previous iterations. Basically, the AMSGrad normalizes the learning rate a i with the maximum valuev max i,j among all previous and current variance instead ofv i,j . The AMSGrad stores the previous maximum value of the 2nd moment and give priority to the maximum value of the 2nd moment. Thê v max i,j is computed as: wherev max i,j = 0 for i = 1. The modified parameter's update Equation (Eqn. 14) for AMSGRad is defined as: Both Adam and AMSGrad face the problem of auto adjustment of the learning rate. The actual issue is to control the friction of the 1st moment to avoid slipping on the optimum solution. AdaBelief [5] algorithm addresses the issue by taking the difference of gradient and 1st order moments instead of a gradient in Eqn. 12. The ratio 1 √ˆv serves as a belief in the system ((Eqn. 14)) with a newly added difference. The step size increases with high belief if the gradient is closer to moments otherwise step size decreases. Alternatively, Dif-fGrad [4] algorithm addresses the issue of controlling the 1st moment present in Adam by adding friction constant. The diffGrad is built on modification in short-term gradients to regulate the learning rate dynamically. The base concept of diffGrad is that update in the parameter must be smaller in the region of low gradient change must be large in the region of high gradient change. The diffGrad computed the 1st and 2nd moments as computed in Adam (Eqn. 13). However, a new diffGrad Friction Coefficient (DFC) is added in the enumerator of Eqn. 16 to control the learning rate based on short-term gradient behavior. The DFC is denoted by ξ and defined as: where AbsSig is the absolute value for the non-linear sigmoid function (Sig) that will squash every value between 0.5 and 1.
The mathematical expression is written as: The g i,j is the change is the gradient between current and previous iteration, such as: where g i,j is computed using Eqn. 2. The diffGrad reported that DFC provides more friction when gradient changes slowly and vise-versa. The modified diffGrad parameter's update Equation (Eqn. 14) for jth parameter in the ith iteration is defined as: In diffGrad, the DFC is used to control the gradient oscillation/ frequency of fluctuation near the optimum. The difference of gradient taken in Eqn. 19 reduces the learning rate by controlling the moving average near the optimal point. The diffGrad algorithm claims that Adam has ignored the impact of 1st moment to control the learning rate over the entire optimization landscape and the inclusion of DFC provides high learning rate in large gradient change area and reduce the learning rate for low gradient change area. VOLUME 9, 2021

IV. PROPOSED METHODOLOGY
The diffGrad only takes advantage of the gradient difference between the current and the previous iteration. We observe that the inclusion of the sum of the last three gradient differences with a decaying factor provides an extra zip of friction to control the learning rate faster. Therefore, the change is the gradient g i,j as written in Eqn. 19 is modified as: where β 3 = 0.999 as a decaying factor. Moreover, for a faster updating process for parameter optimization in both the large gradient change area and low gradient change, a condition is introduced to decide the DFC factor. The signum function is introduced to observe the sign change in the gradient, such as: The condition is defined as: 1) If the sign of gradient value in the two consecutive iterations is changing, then the DFC (ξ i,j ) of Eqn. 17 will be computed using Eqn. (21), such as: 2) If the sign of gradient value in the two consecutive iterations remains same positive or negative, then the new DFC is computed as: where r i = 0.1 as the percentage gradient difference and the g i,j is the same as computed using Eqn. 21. The Eqn. 24 makes the parameter updating process faster in the large gradient change area as the DFC will slightly increase the learning rate rather than always decreasing the learning rate. Therefore, the problem of slow convergence in diffGrad is resolved by the appropriate selection of the DFC. Furthermore, the researchers often ignore the choice of batch size for the optimization training process and select static batch for every iteration in the training process. The small batch size is required to produce convergence in fewer epochs. Conversely, large batch sizes often produce data-parallelism that can improve computational time and scalability. In diffGrad and Adam, the batch size is considered static during the training process. Therefore, to further improve the convergence time and accuracy, we introduce the trade-off between small batch size and large batch size by periodically update the batch size in the training process. The training is started with a small batch size to rapid the convergence in early epochs and then periodically increase the batch size to reduce convergence time.

V. CONVERGENCE ANALYSIS
The convergence analysis Adam and diffGrad is computed using an online learning framework proposed in [13]. We work on a similar footstep to observe the convergence of the proposed AdaDB. Let f 1 ( ), f 2 ( ), . . . , f N ( ) be the unfamiliar sequence of convex cost functions. The target is to predict the optimal value of the parameter i in ith iteration to compute the function f i ( ). In such a case, where the nature of sequence is unknown, the regret bound method is used to evaluate the optimization algorithm. The regret bound is defined as the sum of the difference between the past unknown guesses f i ( i ) and the best fix point parameter f i ( * ) in the practicable set of all prior iterations. The regret bound is defined as [13]: where N regret bound. The proof of the convergence is computed similarly as computed in diffGrad and comparable to known convex online learning methods. The definitions are defined as: g i,j represents the gradient of the jth parameter in the ith iteration, g 1:i,j = [g 1,j , g 2,j , . . . , g i,j ] ∈ i represents the gradient vector in the jth dimensions over all iterations till i and γ For the first condition: if (sgn(g i,j ) = sgn(g i−1,j )) then 99584 VOLUME 9, 2021 as g o,j = g 1,j = g 2,j = 0. Therefore, N ≥ 1, the proposed adadb shows the following guarantee: For the second condition: Therefore, for all N ≥ 1, the proposed adadb optimizer shows the following guarantee: It is observable that the sum of terms over the dimension J can be very small compared to its upper bound, i.e.
where E ∞ is the upper bound over the exponential function and E ∞ J j=1 e −|g 1,j | . Therefore, considering these bounds and the bounded gradients i.e., ||g i, || 2 ≤ G and ||g i, || ∞ ≤ G ∞ for all ∈ J .It is assumed that adadb will generate a bounded distance between any i , such as || n − m || 2 ≤ || n − m || ∞ ≤ D ∞ for any arbitrary n and m ∈ {1, 2, . . . .N }.The proposed adadb optimization algorithm follow the following: where lim N →∞

VI. EMPIRICAL ANALYSIS
We perform the empirical analysis to justify the importance of a proposed methodology. We compare the proposed methodology with Adam, diffGrad, and AdaBelief optimization techniques over the following three different complex non-convex functions previously used in [4].
In the above-mentioned equations, F 1 , F 2 , and F 3 are the non-convex functions and x presents the input to the functions with −∞ < x < +∞. The function F 1 has one global and one local minima (Figure 1) whereas other two functions F 2 and F 3 have two global minima and one global maxima (Figure 2 and Figure 3). Following hyper-parameters are used in the experiments for adadb, diffGrad, AdaBelief, and Adam: decay rate for 1st moment β 1 is 0.95, decay rate for second moment β 2 is 0.999, and learning rate α is 0.5. The decay rate β 3 of adadb is set at 0.999. The parameter or x in these equations is initialized with −1 to demonstrate the effectiveness of adadb optimizing technique over the Adam, AdaBelief, and diffGrad. In the experiments, we have not used the batch size increment during the operation of the adadb technique. The results of the empirical analysis are presented in Figure 4, Figure 5, and Figure 6. Figure 4 demonstrates the problem in slow momentum of the diffGrad as compared to Adam, AdaBelief, and adadb. Adadb, AdaBelief, and Adam have faster momentum than the diffGrad and therefore, managed to move out of the local minima present at value 0.2 after touching it. Adadb converges to a global minimum earlier than both Adam and AdaBelief. The same scenario is presented in Figure 5 where diffGrad is stuck at local minima value 0.0. At the slow learning rates, diffGrad stuck at the nearest minima whereas Adam and adadb can leave the nearest minima but didn't get the time to come back if the next minimum is not the global minima as demonstrated by authors in [4]. However, there is no guarantee in the practical scenarios that the nearest one will always be the global minima. AdaBelief fluctuates twice over the different minima before behaving like Adam and adadb.
The problem in fast convergence of Adam and AdaBelief is presented in Figure 6. After visiting all the minima, Adam quickly settles at one minimum point (local in this case instead of global) and AdaBelief settles between global and local minimum values. Alternatively, adadb and diffGrad kept fluctuating between global and local minima. After around 230 epochs, adadb settles near the global minima.
The empirical analysis has demonstrated that adadb performs better than the slow diffGrad, and manages to find global minima better than the fast Adam and intermediate AdaBelief. In order to make the adadb faster than the Adam, we increases the batch size in different epochs, discussed in the next section.

VII. EXPERIMENTAL SETUP
For image categorization experiments, we uses the CIFAR10 and CIFAR100 datasets. The CIFAR10 dataset VOLUME 9, 2021   consists of 50K images for training and 10K images for testing. CIFAR100 dataset is similar to the CIFAR10 dataset. Howeover, it has 100 classes containing 600 images each. There are 500 training images and 100 testing images   per class. We use the optimization techniques (adadb, Adam, and diffGrad) in the PyTorch implementation of ResNet18 architecture. The experiments were performed on Google Colab. Following hyper-parameters are used in the experiments for adadb, diffGrad, and Adam: decay rate for 1st moment β 1 is 0.9, decay rate for second moment β 2 is 0.999, and learning rate α is 0.001. The decay rate β 3 of adadb is set at 0.999. The number of maximum epochs is set to 100. The target is to find the technique with the highest accuracy in the first 100 epochs. The batch size is kept at 128 for all the techniques. Two different strategies are used for the comparison of techniques. In Strategy 1, the learning rate is kept constant throughout the 100 epochs. Strategy 2 is first search and then optimize strategy In Strategy 2, first 80 epochs are executed at a learning rate 0.001 (search) and the last 20 epochs are executed at 0.0001 (optimize). Each experiment was conducted 5 times and results are presented as best of 5 trials.

VIII. RESULTS AND DISCUSSION
The results of the experiments are presented in Table 1 and Table 2. In these experiments, adadb is also used without batch size increase and represented as adadb/wo.
The results have shown that all the optimization techniques improve the results with the Strategy 2. However, Adadb remains the fastest technique among the three techniques and provides the highest accuracy.
To make the optimization faster in the proposed technique, we incremented the batch size five percent after every 5 epochs. Adam and diffGrad are static techniques, therefore, this procedure is not repeated against these two methods. Table 3 and Table 4 present the results with different starting batch sizes.
Adadb is an upgrade of the previous technique adadiffgrad [7] that behaves exactly like diffgrad in the first 100 epochs and improves from diffgrad and Adam after around 250 epochs. Adadiffgrad is a combination of diffgrad with batch size adaption. Adadiffgrad doubles the batch size and decreases the step size at the same time after the first 100 epochs and repeats the step after every 50 epochs. However, if the methodology is applied earlier, the methodology tries to settle at the nearest available minima too quickly. For example, if the methodology is applied on the CIFAR10 dataset after every 25 epochs with starting batch size 32, the highest accuracy achieved in 100 epochs is 85.39. Alternatively, if the methodology is applied at every 50 epochs, the highest accuracy achieved in 100 epochs is 86.88. Therefore, adadiffgrad is not useful if we want high accuracy in the first 100 epochs. The proposed adadb uses batch size increment along with an improved friction constant that improves the algorithm in faster convergence than adadiffgrad.
One important observation has been made in this experiment that with a larger starting batch size, the testing accuracy does not improve considerably. However, with the small starting batch size, the testing accuracy increases. Using staring batch size equals 32, adadb gets better accuracy than the accuracy achieved by Adam, AdaBelief, and diffgrad methodologies. The reason for high accuracy with low batch size start is due to the high oscillation in the earlier epochs. This high oscillation helps the optimizing technique in searching a large number of minima and later with the increase in the size of the batch, the oscillation decreases and the technique     settles at the best available minima. Figure 7 and Figure 8 presents this phenomenon. Figure 7 shows the training loss per epoch whereas Figure 8 shows the testing accuracy per epoch against all the techniques on CIFAR10 dataset. The adadb shows the highest oscillation in training loss between epoch 1 and epoch 20, however, this oscillation gradually decreases as the number of epochs increases. The high oscillation helps in finding the best minima quickly and adadb shows the best performance in testing accuracy results.

IX. CONCLUSION
This paper presents a new optimization technique named as adaptive diff-batch or adadb for optimizing the gradient descent algorithm. The proposed technique removes the problem of slow convergence in the diffGrad algorithm by replacing the friction constant with three past differences of gradients rather than a single one. Moreover, the proposed technique presents a condition to decide when to use the friction constant. The technique performs faster than the diffGrad and finds a better solution than the Adam, and AdaBelief optimization techniques. The proposed technique is combined with the adaptive batch size to further increase the convergence rate. The proposed technique has outperformed both the Adam, AdaBelief, and diffGrad optimizers on synthetic complex non-convex functions and CIFAR10 & CIFAR100 datasets.