Adaptive Regularization via Residual Smoothing in Deep Learning Optimization

We present an adaptive regularization algorithm that can be effectively applied to the optimization problem in deep learning framework. Our regularization algorithm aims to take into account the fitness of data to the current state of model in the determination of regularity to achieve better generalization. The degree of regularization at each element in the target space of the neural network architecture is determined based on the residual at each optimization iteration in an adaptive way. Our adaptive regularization algorithm is designed to apply a diffusion process driven by the heat equation with spatially varying diffusivity depending on the probability density function following a certain distribution of residual. Our data-driven regularity is imposed by adaptively smoothing a simplified objective function in which the explicit regularization term is omitted in an alternating manner between the evaluation of residual and the determination of the degree of its regularity. The effectiveness of our algorithm is empirically demonstrated by the numerical experiments in the application of image classification problems, indicating that our algorithm outperforms other commonly used optimization algorithms in terms of generalization using popular deep learning models and benchmark datasets.


Introduction
Deep neural networks have made a significant progress in a variety of applications at a number of domains such as image understanding [8,29,39], sound recognition [2,37,42,48], motion planning [9,14,16,56], and other decision support [1,11,49,66].In particular, the successful application of convolutional neural networks (CNNs) [35] to the computer vision problems has driven advanced performance in a variety of applications such as recognition [25,57,62], segmentation [3,10,53], motion estimation [27,50,72] or reconstruction [13,64,71] due to their effective characteristic power and generalization capabilities, leading to large scale optimization problems where the numbers of both model parameters and training data are often huge.
The optimization in the deep learning applications often involves the stochastic estimation of gradients using the stochastic gradient descent [7,30,52] in order to improve the computational efficiency with a large number of training data.Albeit the choices of mini-batch size and learning rate are implicitly related to the generalization of the model [4,36], it is generally required to introduce an explicit regularization term in the objective function to avoid overparameterization or over-fitting.The objective function mainly consists of a data fidelity term that measures a discrepancy between estimation and observation and a regularization term that imposes smoothness constraint in the solution space, and their relative significance is usually determined by a constant based on the ratio of variances between likelihood and prior distributions.However, the computation of those distributions is computationally intractable leading to the grid search approach in determining the control parameter between the data fidelity and the regularization.In addition, the choice of static control parameter implies that the underlying likelihood and prior probabilities follow single model distributions, which is often undesirable to represent complicated models.
In this work, we propose a simple, yet effective regularization scheme that is designed to impose adaptive regularity depending on both spatial and temporal domain of optimization.We consider residual that is indicative of fitness between data and the current state of model in the determination of regularization in such a way that the adaptive application of regularization is achieved in both space and time for better generalization..We develop an implicit regularization scheme based on a simplified objective function where the regularization term is omitted and a diffusion process is applied to the data fidelity term.The diffusivity of diffusion process driven by heat equation is determined based on a probability density function following a certain distribution of residual at each residual element in the course of optimization.In the application of our approach to the deep learning algorithm, we present a neural network architecture incorporating our adaptive regularization, which is efficiently implemented by an additional smoothing layer with a deterministic smoothing kernels.We present the effectiveness of our proposed algorithm for generalization of model in the application of image classification problems with popular network models and commonly used benchmark datasets while our algorithm can be naturally integrated with other architectures of networks such as autoencoder for image segmentation or motion estimation.
In the remainder of this paper, we relate our method to the prior works in Sec. 2 and present the conventional optimization algorithm in Sec. 3 followed by our proposed algorithm in Sec. 4. The implementation of our adaptive regularization algorithm in the deep neural network framework is provided in Sec. 5 and the results of numerical experiments are presented in Sec.6 and the conclusion follows in Sec. 7.

Related Work
There have been a variety of regularization techniques in machine learning applications.One can categorize the techniques into two classes, namely, explicit regularization and implicit regularization.We provide a number of algorithms for the explicit regularization and the implicit regularization in Sec.2.1 and Sec.2.2, respectively.Then, we discuss in more detail the closely related works to our algorithm in Sec.2.3 where the smoothing technique is considered to impose regularity on the solution space.

Explicit Regularization
Weight Decay: The objective function is assumed to include a regularization term that penalizes a perturbation of unknown parameters in terms of L 2 2 norm [34].The gradient descent of the regularization term yields the decay of weights in a recursively manner with a given rate parameter and a learning rate [23].It is considered as one of the most practical regularization algorithms due to its computational convenience, yet often blur the solution.Sparsity Constraint: Sparsity has emerged as a way to impose L 1 regularization to objective functions.The essential motivation of the sparsity assumption on the solution space stems from the modeling of the residual distribution with a sharp peak, which is known to be more realistic in most real-world problems.Sparsity constraints suppress undesirable perturbations while preserving discontinuities in order to avoid over-fitting [54].There have been structural approaches for the sparsity constraint [12,55] where the structures such as filters and layers in the neural network are considered in the application of a sparsity constraint.Entropy Minimization: In the application of sparsity constraint to the probability distribution of solution, the entropy term in the objective function has been introduced in [19] where the entropy is to be minimized.The minimization of entropy enforces a low-rank projection of the objective function, leading to the trace-norm regularization [46].Entropy minimization has been shown to improve exploration ability, thus can regularize the objective functions in reinforcement learning tasks [20].
In contrast to the above explicit regularization techniques, our algorithm bases on the objective function that omits the regularization term instead applying simple, yet effective diffusion process to the data fidelity term.

Implicit Regularization
Noise Injection: In the estimation of gradients using the stochastic gradient descent, stochastic noise is involved and its variance is related to the size of mini-batch.The injection of stochastic noise to the neural network can be used as a way to impose regularization to arrive at a better local minimum [32,65].It is also shown that the variance of injected noise is related to the amount of imposed smoothness on the solution [47] where a tighter lower bounds of the objective function can be achieved by adding noises in a stochastic gradient descent iteration.In addition to the manipulation of noise, smoothing of ground truth label has been proposed to make the model less confident regarding its trained weights, thus improving generality in [63] where the probability of each label is arbitrarily perturbed depending on a random distribution.Similarly, there has been a regularization algorithm that replaces one of the ground truth labels with an arbitrary label uniformly at random [69].Dropout: One of the implicit implementations of sparsity constraints that suppress the value of weights to be zero is Dropout [61] that randomly eliminates units of the neural network with a uniform probability while training, thus prevents units from excessive co-adapting.There have been a variety of Dropout techniques including maxout network [17] that proposes a new activation function to leverage the dropout, stochastic pooling [70] that replaces the deterministic pooling with a stochastic procedure by randomly choosing activation from a multinomial distribution, and fractional maxpooling [18] that constructs pooling regions in a stochastic manner, in which the ratio between input and output size is non-integer.Learning Rate Decay: The stochastic gradient descent often yields better training results with a learning rate annealing scheme that schedules a temporal series of learning rates in epoch [51] where a decreasing scheduling is generally applied to improve convergence.Whereas, a decreasing annealing pattern has been repeated as a warm start to overcome undesirable sharp local minima in [40].Batch Size Scheduling: Instead of decaying the learning rate, we can achieve the similar regularization performance by increasing the mini-batch size with a fixed learning rate at the training phase [59].The dynamic mini-batch size has been introduced to decrease the stochastic variance in the estimation of gradients [5].The selection of optimal mini-batch size has been developed in a Bayesian framework with a fixed learning rate to improve the validation error [60].Model Ensemble: There has been a regularization technique developed by combining differently trained neural networks and introducing regularization effects imposed by different network architectures [58].The random ensemble of prediction functions is known to provide better training behavior [38] and the structural dropout, called Brachout [22], has been developed by randomly choosing a subset of branches in the convolutional neural networks.Batch Normalization: Batch normalization [28] has been proposed for resolving the internal covariant shift by normalizing layer inputs, in which the distribution of inputs of each layer changes during the training process.However, batch normalization is proven that it can also improve the regularization performance in training neural networks [41].In addition, batch normalization enabled training with larger learning rates, which induces faster convergence and better generalization [6].
While the aforementioned implicit methods mainly impose global regularity on the solution space, our method uses the residual that is variable in energy space and optimization time, thus spatially and temporally varying regularization depending on the residual.

Regularization via Energy Smoothing
There is different perspective of imposing regularity that the geometric property of energy landscape is modified in such a way that undesirable insignificant local minima are eliminated by smoothing the energy [44] where the objective function is convolved with Gaussian kernels.The approximated solution to the specific evolutionary partial differential equation (PDE) leads to convex envelopes of the objective function, but the approximation is assumed to be a solution of the PDE with small perturbations, which is often not the case.The energy smoothing approach has been applied to the recurrent neural network [43] where the objective function is highly non-convex.Thus, a modified network has been proposed in [21] where the loss function is differentiable, smooth and computationally stable.
Unlike the conventional smoothing approaches for regularization in deep learning optimization, our algorithm considers diffusion of residual with spatially and temporally varying diffusivity leading to adaptive regularization that is more suited for complex models in a variety of deep learning applications.

Preliminary
We consider a minimization problem in a supervised learning framework.Let χ = {(x i , y i )} n i=1 be a set of training data where where the dimension of the feature space is m.The objective of the supervised learning problem is to find optimal parameters w * that are typically obtained by minimizing the empirical loss L(w) defined on the training data χ: where we denote by f i (w) a data fidelity term for a pair of data (x i , y i ) and by γ(w) a regularization term, and λ > 0 is a control parameter for the balance between the two terms.
The data fidelity f i (w) incurred by a set of parameters w with a sample (x i , y i ) is designed to measure the discrepancy between the prediction h w (x i ) with input x i and its desired output y i .The regularization γ(w) aims to impose smoothness condition on the prediction function h w (x i ), thus avoid over-fitting of the model.The control parameter λ is determined based on the relation between the underlying distribution of data and the prior distribution of model.We consider a first-order optimization algorithm to minimize the objective function that is assumed to be differentiable leading to the following gradient descent step at iteration t: where we denote by ∇f i (w t ) gradient of f i with respect to w at iteration t, and by η t the learning rate.The computation of the above full gradient over the entire training data is often intractable due to a large number of data, which leads to the use of stochastic gradient that is computed using a subset uniformly selected at random from the training data.The iterative step of the stochastic gradient descent algorithm at iteration t reads: where β t denotes a mini-batch that is the index set of a subset uniformly selected at random from the training data.The size of mini-batch |β t | is related to the variance of the gradient norms, and thus to the regularization of the model.The small size of mini-batch yields stochastic gradients with higher variance due to noise involved in the stochastic process leading to large regularization.

Regularization via Residual Smoothing
The optimization of interest aims to minimize the objective function that consists of a data fidelity term, a regularization term, and a control parameter for their relative weight.The selection of control parameter is often critical to obtain a better solution and is determined by the ratio between the underlying distributions of the residual and the prior smoothness, both of which are mostly assumed to follow unimodal distributions.Thus, the control parameter is chosen to be constant, and it is generally required to apply a grid search over a range of parameters to choose optimal parameters.However, it is often ineffective to model the distribution of data fidelity and determine the ratio of its variance to the variance of prior distribution for a smooth solution based on a unimodal probability density function leading to a static control parameter for the trade-off between data fitting and smoothness.Thus, we propose an adaptive regularization scheme that considers residual in the determination of regularity at each point of the residual domain.

Adaptive Regularization based on Residual
The computation of empirical stochastic gradient involves the noise process following a certain distribution with zero mean, and its variance is inversely proportional to the size of mini-batch.
In addition to the stability, the noise process is also related to the regularization, thus the size of mini-batch can be used in determining regularity in an implicit way.On the other hand, the control parameter λ in the objective function in (1) can be variable with a fixed mini-batch size for each sample (x i , y i ), leading to the following modified objective function: where λ i ∈ R denotes a weighting parameter for the regularization term and it is designed to be associated with each sample (x i , y i ).We assume that the degree of regularity follows a distribution of the residual leading to the following data-driven regularity: where ν is a parameter for the variance of the residual.The degree of regularity is designed to be proportional to the magnitude of residual for each sample.In addition to the adaptive application of regularity with respect to sample, we consider the temporal state of solution in the course of optimization leading to the update of model parameters based on the stochastic gradients incorporating data-driven regularity as follows: where λ t i is variable with respect to both optimization iteration t and sample index i.The intrinsic motivation of the temporally adaptive regularization stems from the limitation of the existing static scheme that imposes the same degree of regularity albeit the residual decays in the optimization steps.However, it is computationally expensive to construct the distribution from which the control parameter for regularization is determined while computing the gradients of both data fidelity and regularization terms.In addition to the computational efficiency, it is desired to consider the relative magnitude of residual in its spatial domain.Thus, we propose a simple, yet effective regularization scheme that is designed to impose adaptive regularity depending on both spatial and temporal domain of optimization, which is achieved by smoothing residual with spatially and temporally varying degree without explicit computation of gradient for the regularization term that is omitted in the objective function, as presented in the following section.

Regularization via Adaptive Diffusion in Space and Time
We propose a regularization algorithm which is developed based on smoothing the residual that measures the discrepancy between model and sample data without taking into account an explicit regularization term.We modify the objective function in Eq. ( 4) from which the regularization term is omitted and the original data fidelity term f i (w) is replaced with g i (w) as follows: where g i (w) is L 2 2 norm of the diffused residual u i (w) for each sample (x i , y i ), and u i (w) is obtained by the diffusion process using the heat equation as follows: where κ denotes a diffusion coefficient, ∆ the Laplace operator, and τ an auxiliary variable for the diffusion time.The Neumann boundary condition is imposed and the initial condition u i (w; 0) is given by the residual defined by the magnitude of the discrepancy between the predication and the desired output as follows: where d i (w) ∈ R M .In the diffusion equation, the coefficient κ is normally set to be constant, but we consider a diffusivity map κ : R M → R M that is employed to impose spatially varying regularity depending on the residual.The diffusivity map is designed to apply regularity following a distribution of residual based on the sigmoid function S(x; s, α) defined by: where s, α ∈ R + are parameters that determine the vertical scale and the steepness of transition in function value, respectively.The graphical illustration of the sigmoid function with varying parameters is presented in Fig. 1 where the functions with varying s and fixed α = 1 are shown in (a), and the functions with varying α and fixed s = 1 are shown in (b).The update of parameters using the stochastic gradient descent based on mini-batch β t at each iteration t reads: where the computation of stochastic gradient ∇g i (w t ) involves the diffusion u i (w t ) of residual d i (w t ).The diffusivity of the heat equation applied to the residual is determined based on a distribution formed by the sigmoid function and its associated parameters, scale s and steepness α, are chosen by global and local properties of residual in the neural network architecture as presented in the following section.

Annealing of Adaptive Diffusion
The proposed algorithm aims to impose adaptive regularization depending on the magnitude of residual by applying spatially varying diffusion to the residual.The diffusivity of the heat equation applied to the residual is determined based on the magnitude of residual following the sigmoid function as follows: where the diffusivity map κ t i ∈ R M of the heat equation is determined by the sigmoid function of residual d i (w t ) ∈ R M at iteration t.We consider the temporal residual d i (w t ) given by the current state of solution w t in determining the degree of temporal diffusion k t i for each sample (x i , y i ).We also consider the scale parameter s t that is variable in optimization time and present its annealing scheme in the following section.

Global Adaptivity
The scale parameter s t ∈ R + of S(d t (w t ); s t , α) at time t in Eq. ( 12) determines the degree of diffusion that is applied to the entire domain of residual in an isotropic way with fixed steepness parameter α = 0, thus it is global parameter that is dependent on time t.The motivation of introducing time-varying scale parameter is to consider the temporal decay of residual resulting in the decrease of diffusion that is equivalent to regularity.However, it is often necessary to allow larger stochastic noise in order to avoid undesirable sharp local minima in particular at the early stage of optimization.Thus, we propose to employ annealing schemes for the scale parameter s using the probability density functions of either the Logistic distribution y 1 (x; µ, b) or the Laplace distribution y 2 (x; µ, b) as defined by: where µ and b denote the mean and the scale, respectively.The graphical illustrations of the scaled probability density functions y 1 and y 2 with varying scale parameters b are presented in Fig. 2 where the maximum value of each probability density function is scaled to have the maximum value 1 and their associated distributions are (a) Logistic and (b) Laplace.The global difusivity map κ t i is then defined by the sigmoid function with s t and fixed α = 0 as defined by: where y can be either y 1 in Eq. ( 13) or y 2 in Eq. ( 14), and µ is chosen for the peak location and b is a scale parameter for the sharpness of the distribution centered at µ.The degree of regularization driven by the diffusion process based on the sigmoid function with the annealing for its scale parameter is gradually increasing up to the peak at the mean of the annealing distribution and decreasing afterwards arriving at the original objective function without diffusion.

Local Adaptivity
In the adaptive application of regularization in the domain of residual, we consider the relative magnitude of residuals so that different degree of regularization is applied to each residual ele-ment in its domain.The residual is initially normalized to have mean 0 and standard deviation 1 at each iteration in order to consider the relative significance among the residual elements.The diffusivity map with the local adaptive scheme is defined by: where parameters s, α ∈ R + are chosen to be constant, and dt i is the normalized residual of d t i with mean µ t i and standard deviation σ t i for each sample (x i , y i ) at time t.

Combination of Global and Local Adaptivity
Our final choice of the annealing scheme for adaptive regularization incorporates both global and local approaches considering the global decay of residual and its relative weight in the residual domain at each iteration, leading to the full adaptive scheme.The proposed diffusivity map for our algorithm integrates the global annealing of the scale parameter and the relative weight of residual leading to: where µ t i and σ t i denotes the mean and the standard deviation of temporal residual d i (w t ) for sample (x i , y i ), respectively, and µ and b denotes the mean and the scale for the probability density function of the annealing distribution, respectively.

Network Architecture incorporating Adaptive Regularity
The neural network architecture with our proposed regularization algorithm is constructed by a primary network that yields an output of the prediction for the problem of interest and computes the associated residual that is subsequently fed into a series of smoothing layers leading to the objective function based on the smoothed residual.The schematic illustration of the network architecture is presented in Fig. 3 where the target of the primary network is represented by a one-hot encoding for the image classification problem.Our regularization algorithm applies a diffusion process to the residual depending on its magnitude using the heat equation based on the Laplace operator with spatially varying diffusivity, however the application of the Laplace operator is not suited for the residual domain, in which the spatial property among the neighboring elements is not locally related to the regularity of solution in the image classification problem while the Laplace operator is constrained to be applicable at the residual domain where the local affinity implies the regularity in the solution space, for example, autoencoder architectures.In the sequel, we employ an extended Laplace operator resulting in a global interpolation of all the elements in the residual domain to blur a one-hot encoding representation based on a fully connected layer with the following weights: where w jk ∈ R M ×M denotes the filter element of a fully connected layer that connects from the k-th node of the residual to the j-th node of the successive diffused residual layer, M is the dimension of residual, and κ j denotes the diffusivity value obtained by Eq. ( 19) for the j-th element of the residual.The number of smoothing layers is related to the diffusion time τ and the diffusivity κ in the heat equation in Eq. ( 8), and we set the number of smoothing layer to be one while the scale factor of the diffusivity varies for numerical stability and computational efficiency.

Experimental Results
In the experiments, we empirically provide the quantitative evaluation of our algorithm by the comparative analysis using an image classification task.A detailed description of the experimental setup is presented in the following: Datasets: We use four commonly used benchmark datasets including CIFAR-10, CIFAR-100 [33], Street View House Numbers (SVHN) [45], and Fashion-MNIST [68].CIFAR-10 consists of 50K training and 10K testing images of the size 32×32×3 for 10 categories.CIFAR-100 is the same as CIFAR-10 except that it has 100 classes.We apply conventional image augmentation with padding, random cropping and flipping to CIFAR-10 and CIFAR-100 as pre-processing stesp.SVHN is a dataset of house numbers in the street images.Neural Network Models: We consider neural network architectures ranging from shallow to deep models including ResNet20, ResNet56 [25] and DenseNet-BC with 100 layers (k = 12) [26].
Optimization and Hyperparameters: We use the stochastic gradient descent method and the objective function is the mean squared error of the residual that measures a difference between the prediction and the desired output.We use the following common hyperparameters across all the experiments; momentum is 0.9, mini-batch size is 128, number of epoch is 160 for CIFAR-10 and CIFAR-100, 100 for SVHN and 48 for Fashion-MNIST, weight decay is 0.001, learning rate is set to be 0.1 for the first 75 percent of epochs and 0.001 for the rest.The unknown weights are initialized by the algorithm proposed in [24].
Quantitative Evaluation: We compute the learning curves that include training loss, training accuracy and validation accuracy.We perform 5 independent trials for each set of experiment and the maximum validation accuracy is taken across all the epochs and the average of the maximum is taken over 5 trials.We also compute the average validation accuracy over the last 10% of epochs and the average of the average is taken over 5 trials.
6.1 Ablation Analysis on the Adaptive Regularization Parameters  We analyze the effect of global and its combination with local annealing schemes for the adaptive regularization based on ResNet20 using Fashion-MNIST dataset.We compare the performance of our algorithm to the baseline, stochastic gradient descent (SGD), to demonstrate that our algorithm outperforms SGD with grid search of regularization parameter.We apply SGD with varying weight decay values such as 1e−2, 1e−3, 1e−4, 1e−5 and the validation accuracy is presented in Table 1 where the results with our algorithm based on global adaptive annealing following Laplace and Logistic distributions are presented at the middle block, and the results based on the combination of global and local adaptive annealing following Laplace and Logistic distributions are presented at the right block.For the global adaptivity, the steepness parameter α = 0 in Eq. ( 12) is used and the scale parameter b of distribution varies from 0.1 to 0.9 with step size 0.2 while the mean µ of distributions is set to be 75% point of epochs, the maximum of distributions is scaled to be 1.In the application of full adaptive schemes integrating global and local schemes, the same parameters as the global adaptive scheme are used except the steepness parameter α = 0.25, 0.5, 1, 2, 4. We apply a grid search in the selection of parameters associated with our algorithms.It is shown that our algorithm outperforms SGD regardless of weight decay value associated with SGD, and the performance gain is achieved with the local adaptivity in addition to the global adaptivity.

Effect on Generalization based on Partial Training Data
We empirically demonstrate the effect of our adaptive regularization algorithm based on ResNet20 using Fashion-MNIST dataset.We select partial subset of training data uniformly at random for the training phase with varying ratio such as 1/2, 1/4 and 1/8 in highlighting of the effectiveness of our algorithm in generalization.The validation accuracy of SGD is computed at a range of weight decay values, 1e−2, 1e−3, 1e−4, 1e−5, and its maximum and average are computed over 5 independent trials as shown at left block in Table 2.The maximum and average of validation accuracy obtained by our algorithm with fully adaptive regularization incorporating global and local schemes are presented at right block in Table 2 where Laplace and Logistic distributions are used for global annealing of adaptive regularization.The associated steepness parameter α with the sigmoid function for the local adaptive regularization is fixed as 1 whereas the associated scale parameter b with the distribution for the global adaptive regularization is selected by a grid search over a range of values from 0.1 to 0.9 with a step size 0.2 except for Laplace adaptivity scheme with 1/8 partial data where grid search for b is done over a range from 0.9 to 1.7 with a step size 0.2.
The maximum validation accuracy obtained by our algorithm with different annealing distribution is presented in Figure 4 where the accuracy with Laplace and Logistic annealing distributions is shown in blue and red, respectively along with the baseline in black for each ratio of partial training set, (a) 1/2, (b) 1/4 and (c) 1/8.It is shown that our algorithm outperforms the baseline across all the scale parameters for both annealing distributions, indicating that our algorithm achieves better generalization.

Comparative Analysis with other Optimization Algorithms
We compare our algorithm with the commonly used optimization algorithms including Adam [31] and AdaGrad [15].In our comparative analysis, we use deeper networks including ResNet56 and DenseNet-BC with 100 layers (k = 12) for the benchmark datasets that are CIFAR-10, CIFAR-100 and SVHN.We use 0.0001 for the weight decay of SGD as recom-mended in [25,26], and the associated parameters with our algorithm are used by b = 0.2, 0.5 for Laplace, b = 0.25, 0.5 for Logistic distribution respectively and α = 1.The maximum and mean validation accuracy are presented in Table 3 where the results using ResNet56 (top block) and DenseNet-BC (bottom block) with SGD, Adam, AdaGrad, and our algorithms with Laplace and Logistic distributions are shown from left to right.
It is shown that our algorithm outperforms the other algorithms under comparison with their recommended parameters.Note that adaptive optimization methods such as Adam and AdaGrad often generalize worse than SGD for image classification task [67].There is potential that our algorithm can be improved with wider range of grid search for the parameters α and b.The learning curves obtained by (a) SGD and our algorithms with (b) Laplace and (c) Logistic distributions using CIFAR-10 (top), CIFAR-100 (middle), SVHN (bottom) are presented in Figure 5 and Figure 6 where the learning curves with our algorithm indicate better generalization in terms of the validation accuracy.The learning rate is scheduled to be dropped at 75% of epochs from 0.1 to 0.001 and the global adaptive annealing reaches the peak of the associated distribution at 75% of epochs, which leads to an abrupt change in the learning curves.

Conclusion and Discussion
In this paper, we have investigated the data-driven adaptive regularization by smoothing the residual of neural network for the image classification problem in an adaptive manner.The residual is defined by the discrepancy between the output of the neural network and the desired output.The regularization is imposed by diffusing the residual depending on the probability density function following either Laplace or Logistic distributions where the degree of regularization is proportional to the magnitude of each residual element.The combination of local and global annealing scheme that is designed to take into account residual in determining the degree of diffusion has been presented to spatially and temporally varying regularization.The effectiveness of the proposed algorithm has been demonstrated by the experimental results indicating the potential of our algorithm that can be easily integrated to a variety of problems in deep learning applications.

Figure 1 :
Figure 1: Graphical illustration of sigmoid function with varying (a) scale parameter s with fixed α and (b) steepness parameter α with fixed s.

Figure 2 :
Figure 2: Graphical illustration of the scaled probability density function associated with different distributions (a) Logistic distribution and (b) Laplace distribution with varying scale b for the annealing of the scale parameter s in sigmoid function.Each probability density function is scaled to have the maximum value 1.

Figure 3 :
Figure 3: Schematic illustration of the network architecture incorporating our regularization algorithm for image classification problem.The target of the primary network is represented by an one-hot encoding and the residual is subsequently fed into a series of smoothing layers leading to the objective function based on the smoothed residual.

Figure 4 :
Figure 4: Validation accuracy (y-axis) with varying scale parameter of distribution b (x-axis) associated with the annealing distribution using partial Fashion-MNIST dataset.The training is performed based on ResNet20 by SGD and our global+local adaptive regularization schemes.The partial ratios of training data used are (a) 1/2 (b) 1/4 and (c) 1/8.

Figure 5 :Figure 6 :
Figure 5: Learning curves obtained based on ResNet56 model using CIFAR-10 (top), CIFAR-100 (middle), and SVHN (bottom) datasets.The validation accuracy, training loss, testing loss are presented in red, blue, green color, respectively.The learning performance of our regularization scheme based on (b) Laplace and (c) Logistic distributions is compared with (a) the SGD algorithm.

Table 1 :
It consists of 73257 training and 26032 testing images of the size 32×32×3 for 10 categories.Validation accuracy based on ResNet20 using Fashion-MNIST dataset by SGD (left) with varying weight decay (wd) parameters from larger to smaller, our algorithm with global annealing scheme (middle) and the combination of global and local annealing scheme (right).The annealing of adaptive regularization parameter follows Laplace (left) and Logistic (right) distributions where the associated parameters are chosen by the grid search.

Table 2 :
Validation accuracy based on ResNet20 using partial (1/2, 1/4 and 1/8) Fashion-MNIST dataset by SGD (left) with varying weight decay (wd) parameters from larger to smaller, our algorithm with the global+local adaptive scheme (right).The annealing of adaptive regularization parameter follows Laplace (left) and Logistic (right) distributions where the associated parameters are chosen by the grid search.

Table 3 :
Comparison of validation accuracy obtained by SGD, Adam, AdaGrad, our fully adaptive algorithm with Laplace and Logistic distributions from left to right.The training is performed based on the model ResNet56 (top block) and DenseNet-BC (bottom block).