Selective Path Automatic Differentiation: Beyond Uniform Distribution on Backpropagation Dropout

This paper introduces Selective Path Automatic Differentiation (SPAD), a novel approach to reducing memory consumption and mitigating overfitting in gradient-based models for embedded artificial intelligence. SPAD extends the existing Randomized Automatic Differentiation, proposed by Oktay et al and which draws random paths through the backpropagation graph with matrix injection, by enabling alternative probability distributions on the backpropagation graph, thereby enhancing learning performance and memory management. In a specific iteration, SPAD evaluates and ranks multiple paths within the backpropagation graph. Over subsequent iterations, it preferentially follows these higher-ranked paths. This work also presents a compilation-based technique allowing model-agnostic access to random paths, ensuring generalizability across various model architectures, not restricted to deep models. Experimental evaluations conducted across various optimization functions demonstrate an enhanced minimization performance when employing SPAD. Additionally, deep learning experiments with SPAD notably mitigate overfitting, offering benefits akin to those of traditional dropout methods, but with a concomitant decrease in memory usage. We conclude by discussing the unique stochasticity implications of our work and the potential for it to augment other stochastic techniques in the field.


I. INTRODUCTION
Artificial intelligence (AI) based on deep learning architectures is being increasingly employed in various industries and everyday life [4], [18].The next step in this direction is to embed AI on small devices, such as smartphones, with limited computing resources [2].However, one of the major challenges in this context is the memory required to train deep architectures based on automatic differentiation [1].This process is resource-intensive, particularly during the reverse mode of automatic differentiation [22], which is essential for training these types of deep architectures.This issue has been inherent to neural networks since the origins of the field.For example, Stochastic Gradient Descent (SGD) [16] is currently the only method suitable for working on huge datasets.Indeed it was a major breakthrough in addressing The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry .
theoretically the splitting of large datasets into small batches while guaranteeing convergence.Another way to reduce memory usage is checkpointing [21] introduced as a trade-off between execution speed and memory consumption, without affecting gradient computations or updates.Beyond these seminal techniques, current works [15] on embedded AI systems still tries to limit the memory consumption with the additional objective to not compromise the accuracy of the model.
In addition to the resource consumption issue, overfitting of training data is a common unwanted behavior [17] that results in a reduced generalization power of the model.It occurs when the model memorizes the training set and fails to generalize to new data.To mitigate this issue, dropout was introduced in [7] on neural networks.Dropout involves temporarily turning off certain neurons of the model.By doing so, it prevents the network from relying too heavily on any group of neurons and improves the ability of the model to generalize well.Although dropout reduces overfitting, current implementations do not reduce memory consumption.Dropout is applicable to deep learning models where parameters do not convey a specific meaning and can be randomly deactivated.Therefore it is not suitable for small white box models, where every parameter has a specific purpose and meaning.Thus, dropout cannot be applied to these models as turning off crucial parameters is unfeasible without blasting the model, even though the resulting regularization is a desirable feature.
An intermediate solution will be to apply dropout not during the forward pass, but during the backpropagation of the gradient.As an added benefit beyond generalization, dropping off the computation stage of some parameters gradient limits the resources consumption of the backpropagation.Recently, this was theoretically formulated in Randomized Automatic Differentiation (RAD) [12] that introduced a novel gradient estimator that constructs unbiased estimates by randomly drawing segments of the gradient code with uniform probability to apply dropped-out backpropagation.As a proper way to represent a model code, RAD uses the Linearized Computational Graph (LCG) of the model, in which the nodes represent intermediate variables and the edges represent the mathematical operations.In a LCG, each operation depends only on the output of the previous operation.This simplification allows for efficient calculations, as the operations can be computed one after the other, without the need to store all the intermediate values.An example is given on Figure 1.
The representation of the gradient code as a graph enables drawing paths along the edges thus granting a new form of gradient stochasticity.This stochasticity is based on the gradient decomposition into the contribution of each LCG path from the parameter θ to the output node z.This decomposition as a sum is formalized in Equation 1 with f θ the function to minimize with respect to θ .RAD uses a uniform probability distribution to draw these paths.This is a valid starting point but is just one possibility.Other distributions may lead to better learning results.However, the concept of the optimal distribution over the backpropagation paths is dependent on the stage of the learning process.As the goal is to limit resource consumption, an optimization scheme for finding the temporal best would be counterproductive.This is why the focus will be on finding heuristics that emphasize the most important paths for gradient propagation.
z k −→ z l represents a directed edge connecting two nodes, and z is the output node that represents f θ .The total gradient is the sum of all the paths contributions.Selective Path Automatic Differentiation (SPAD) extends RAD by surpassing the limitations of the uniform distribution across various tasks without increasing memory usage.During each iteration of SGD, SPAD evaluates multiple paths within the backpropagation graph and assigns rankings based on their contribution to the overall gradient.Subsequently, it prioritizes following these higher-ranked paths in subsequent iterations.Moreover, while RAD proposes a technique for generating random paths within neural networks through random matrix injection, this work generalizes beyond deep neural models to enable the modification of the gradient estimator for any model architecture.The approach is based on compilation choices made during the automatic differentiation process, ensuring that each parameter is utilized only once, even if this necessitates duplicating variables.Automatic differentiation, the main computational technique for calculating function derivatives, has significantly facilitated the emergence of deep learning techniques by automatically computing gradients for custom-designed models, similar to how SGD enables gradient descent on large-scale datasets.Applying automatic differentiation to this model representation directly decomposes the gradient as a sum of path contributions.
In summary, this paper presents two contributions.Firstly, RAD is generalized into SPAD by allowing for alternative probability distributions on the LCG, which reduces overfitting and can be interpreted as a form of dropout during backpropagation.Secondly, a compilation-based technique is introduced that enables automatic access to the random paths without requiring any prior knowledge of the model architecture.The remainder of the paper is structured as follows.In the subsequent section, a novel gradient estimator, Selective Path Automatic Differentiation (SPAD), is introduced.This estimator extends RAD beyond the uniform distribution and a technique for applying it to general models without any prior knowledge of their structure is presented.The third section presents the results from experiments conducted on various objective functions.The paper concludes with a discussion on the novel direction of stochasticity prompted by this work and its similarities with dropout.

II. BEYOND UNIFORM DISTRIBUTION AND MATRIX INJECTION A. BEYOND UNIFORM DISTRIBUTION: DISTRIBUTION PROBABILITY GENERALIZATION
Thanks to Equation 1, the gradient is decomposed as a sum of all the path contributions.This decomposition can be generalized as Equation 2, regardless of the provenance of each term of the sum.
The formulation will be adhered to, where each g θ,i is related to a specific backpropagation path, even though all the following applies to optimization problems where the objective function is expressed as a sum, given the linearity of the derivative operator.
In addition to memory consumption reduction, it might help the optimization process by avoiding local minima.In gradient descent a local minimum gives a zero gradient that might slow or even stuck the minimization of the objective function.However the gradient being equal to zero does not mean that all the g θ,i are zero.Using one of them might help to avoid this unwanted scenario.An example is given on a toy function: Example 1: Let's consider the function f 1 : R → R: . This function has an infinite number of local minima, but its general trend follows x 2 .
This function is chosen because it presents multiple local minima.The decomposition of the derivative of f 1 following the backpropagation paths of its LCG is given in Equation 3.
If one employs the true gradient of f 1 and applies gradient descent with standard optimizers, it will undoubtedly become trapped in a local minimum.However, if one opts to utilize the first component of the gradient, it will reach the global minimum of f 1 at x = 0.This claim is supported by Figure 3.
This example underscores the usefulness of approaches based on code stochasticity.Although the impact is minimal on a function such as f 1 , it is evident that computing only a fraction of the terms in the gradient sum reduces resource consumption, particularly memory.
The RAD approach utilizes a uniform distribution across all possible paths.However, it is suggested that not all gradient paths are equally important at any given time.Therefore, FIGURE 3. Minimization of f 1 through SGD with Adam [9] and its default values as optimizer.The starting point is x = 5.The blue curve represents the use of the full gradient from equation 3, which gets trapped in a local minimum.In contrast, the red curve represents the random selection of gradient terms during iterations, which allows for the avoidance of local minima and leads to a decrease in the target function.The green and black-dashed curves correspond to the static selection of one component of the gradient, g x,1 and g x,2 respectively.a non-uniform distribution with varying probabilities to draw a g θ,i is aimed to be utilized during gradient descent.Let I t ∼ (p t θ,1 , . . ., p t θ,N ) be defined as the probability to draw g θ,i to compute gradient descent, defined over the T ∈ N epochs.For notational simplicity, θ is omitted, which gives: The intuition tells that locally, there is an optimal probability distribution that would decrease faster the objective function f θ .There is no reason that this distribution is uniform.Supporting this intuition, it is argued that that certain g θ t ,i may be negligible compared to others at a specific stage of the optimization process, i.e. at a specific iteration t.Drawing such g θ t ,i would have an almost negligible impact on minimizing the objective function.As a result, resources would be better utilized by computing the g θ t ,i that significantly reduces the target function.However, the magnitude of the g θ t ,i depends on the position of the parameter theta t in the search space; therefore, the probability distribution should be updated alongside the iterations.
One of the consequences of such non-uniform distribution over the g θ,i is the construction of a gradient estimator that may be biased.This is problematic as many convergence guarantees [5] rely on the unbiasedness of the gradient estimator.To address this issue, two key points are presented.First, the probability distribution I t varies during the iterations of the learning process.The similarity between a g θ,i and the exact gradient is not constant over the search space of θ .Therefore, the objective is to continuously update the probability associated with the terms of the gradient sum.Using the uniform distribution gives an unbiased estimator which gives convergence guarantees, so a proper update rule will smooth the probability associated to a backpropagation path g θ,i over the iterations, i.e 1 T T t=1 p t i will tend toward 1 N .In that case, the estimator becomes unbiased over the iterations.
Secondly, and of greater significance, a modification to the computed gradient is proposed to guarantee the unbiasedness of the new estimator, irrespective of the evolution of I t .
Definition 1 (Normalization trick): Let's define g I the stochastic gradient estimator relative to I t ∼ (p t 1 , . . ., p t N ): The corrective term 1 is introduced in order to preserve the unbiasedness of the gradient estimator, which is necessary to rely on convergence guarantees [5].g I t is unbiased as long as none of the p t i values are equal to zero: This normalization trick evacuates all the possible issues about unbiasedness of a non uniform distribution over the backpropagation paths.Let's remember that the objective is to constrain memory usage and prevent overfitting without excessively lengthen training time.Consequently, seeking the optimal term g θ,i of the gradient at every iteration is infeasible.Inspired by multi-armed bandits [19], a heuristic is introduced that balances the exploration of the best probability distribution with the utilization of the one established during exploration, also known as exploitation.

B. SELECTIVE PATH AUTOMATIC DIFFERENTIATION
The search for a good path is computationally demanding, as finding the exact best path implies to evaluate all possible paths.An approximation is then to draw and evaluate a subset of path and choose the best path of these subset.But even in this case, repeating the procedure for every iteration will be costly.Notice that if a particular g θ,i has the highest contribution to the gradient magnitude at a specific point θ t , then it will also have the highest contribution in the surrounding parameter space as SGD is an iterative method.Hence, using this g θ,i for a few iterations seems like a reasonable approximation.This approximation is even more reasonable when one assumes that the difference between the g θ,i values is independent of the batch being used.In other words, the more the observation batch is representative of the dataset, the better the approximation.
SPAD is a new gradient estimator dealing with the trade-off of choosing the best component at a time and maintaining it for the following iterations.The set of the LCG paths is denoted as P. m random paths in the backpropagation graph, denoted as P m ⊂ P, are sampled, and the induced gradients g θ,i (with i ∈ [1..m] without loss of generality) restricted to these paths are calculated.Among these m paths and for the next k max iterations, the one yielding the largest gradient i max is associated to an almost one probability with keeping an ϵ > 0 fraction of exploration for all the other paths (not restricted to the m ones).
The Almost-Dirac notation D ϵ i (j) in 5 below is conveniently introduced to represent SPAD: For a given i ≤ N , D ϵ i can be used as probability distribution over [1..N ] as j D ϵ i (j) = 1.The probability distribution of SPAD described above is formalized by I t from Equation 6. SPAD pseudocode is presented in Algorithm 1 and is particularly appealing as it avoids the need for a complete evaluation of the gradient, which is a resourceintensive process.Additionally, it does not require additional memory beyond storing the m random paths and their associated gradients.Notice that, an implementation technique, checkpointing does not influence the gradient estimation itself, but rather the manner in which it is obtained.Consequently, all variations of checkpointing are compatible with SPAD.By choosing the largest gradient among the sampled paths, this approach has the potential to enhance the learning process, as the target loss is expected to decrease more rapidly compared to a uniform selection of the path.This heuristic introduces two new parameters, namely m and k max .However, there is a tradeoff to be made as increasing m may lead to a better gradient estimation but also slows down the learning process.With m and k max both set to 1, SPAD reduces to RAD.
The parameter m represents the number of gradient paths to be selected to determine the one with the largest contribution.It is desirable to have a large value of m in order to ensure that the strongest contribution is well estimated.The estimation of a maximum is always underestimated but it will not have a strong impact on experiments thanks to quite large values of m.The parameter k max determines the number of consecutive iterations that the chosen gradient path is used for.A large value of k max can be used if the chosen path is a good one.During the k max iterations, the parameters corresponding to the unchosen paths are frozen.If k max is set to a large value, it makes the method similar to freezing layers presented in [6].If k max is large, a large value for m is preferable to ensure that the chosen random path is carefully selected for multiple iterations.However, if k max is small, a smaller m can be tolerated since the path selection has an impact on a limited number of iterations.
The rationale behind using different probability distributions in SPAD is to find the optimal distribution that emphasizes the most important paths for gradient propagation at each iteration.Locally, certain paths may have a higher impact on minimizing the objective function, and prioritizing these paths can lead to faster convergence.By allowing alternative probability distributions on the backpropagation graph, SPAD adapts the distribution based on the current stage of the learning process, which improves learning performance.Additionally, by selectively computing the most significant paths, SPAD reduces memory consumption compared to evaluating all paths.
SPAD is an intermediate solution between RAD that does not try to determine the optimal choice of distribution and optimizations schemes that would need to duplicate the memory for the parameters, which would eliminate the benefits of SPAD.The implementation of SPAD through code stochasticity based on automatic differentiation, irrespective of the shape of the model to optimize, will be demonstrated in the following section.This contrasts with the RAD implementation, which, being based on matrix injections, was only compatible with neural networks.

C. FROM COMPILATION TO RANDOM PATHS: IMPLEMENTATION GENERALIZATION
The implementation of SPAD requires a computational method to obtain the terms of the gradient from Equation 1 written as a sum.This translates into identifying the backpropagation paths within the graph.Given the orientation of the LCG, identifying forward paths or backpropagation paths equates to the same problem.In a general context, without making any assumptions about the form of the LCG, an alternative method for executing this task is proposed on any language suited for automatic differentiation that satisfies the Static Single Assignment (SSA) and Single Access (SA) properties.Definition 2 (SSA): Static Single Assignment (SSA) form is a property of a lower-level representation of a program that mandates each variable to be assigned precisely once, with every variable being defined before its use.
Definition 3 (SA): Single Access (SA) form is a property of a lower-level representation of a program which mandates that every variable is read at most once.
An example of such programming language crafted for automatic differentiation, satisfying both of these properties can be found in [14].Moreover it is a simple operation to turn an SSA differentiable language like [11] and [20] into a SSA-SA one.
Due to the SA property, the LCG of a program will contain tupling nodes, as highlighted in Example 2. They make possible the construction of program using a variable multiple times by duplicating it.With the exception of these tupling nodes, there is only one edge that exits a node, which is a strict translation of the SA property on the LCG.Consequently, choosing a contribution to the gradient from Equation 1involves following the path from a parameter node to the output node and selecting one of the edges emanating from the encountered tupling nodes.
Example 2: Let's consider the function f 2 (x, y) = e x ×(x + y).To satisfy the Single Access (SA) property, since x is utilized twice in the program, its node is tupled, resulting in the LCG (and the corresponding program) as depicted in Figure 4 (5, respectively): The tupling of variables in order to fulfill the SA property results in the gradient being expressed as a sum as proved in Equations 7.This is a key aspect of reverse mode automatic 136556 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.differentiation, also known as backpropagation.It is given by letting x be a parameter of f tupled in N variables {x i } i=1..N , Equation 1 turns into: The chain rule of differentiation in reverse yields x = ∂f ∂x called the adjoint of x, which is the desired output.Figure 6 highlights how the SSA-SA property directly gives the gradient as a sum.As previously framed, SPAD can be conceptualized as a form of dropout during backpropagation.By implementing SPAD independently of any specific model architecture, a generalized form of dropout can be introduced to a wider range of machine learning models.While dropout is a viable technique for deep learning models comprising numerous parameters without distinct significance for each individual one, it may not be suitable for smaller models.
The two approaches to obtain the gradient expressed as a sum, matrix injection or differentiation of SSA-SA languages, both rely on the multiple use of the parameters in the model implementation.If there is one and only one path from the parameter to the output node of the LCG, SPAD is pointless.
Hopefully, this does not happen in many cases, as presented in the experiments Section III.

III. EXPERIMENTS
Experiments were conducted on two different types of tasks.Firstly, our novel gradient estimator was applied to a set of functions commonly used for evaluating optimization algorithms.These functions are not well-suited for gradient descent due to their numerous local minima.However, SPAD may overcome this issue by following gradient estimations rather than relying on the exact gradient.Evaluating SPAD on these functions provides further validation of its usefulness beyond the domain of neural networks.The emphasis in these functions lies on the terminal point of the optimization, whether it is in the proximity or not to the globally optimal solution that is known beforehand.Secondly, the estimator was tested1 on the MNIST and CIFAR10 datasets using standard deep architectures to compare our method to existing ones.In order to assess the different approach, the considerations are placed on the accuracy achieved on the test data as well as the maximum memory utilization observed during the training process.These experiments vary significantly in several aspects.Firstly, the data varies greatly, as the first search space is 2-dimensional, while our dense architecture for MNIST, as presented in IV-D, has over 410k parameters.Additionally, the minimum of the optimization functions is known, which is not the case for neural networks.The diversity of these tasks provides a deeper understanding of the implications of the proposed method.
Remember that the theoretical probability distribution given by SPAD is an Almost-Dirac on the largest gradient contribution.In practice, the exact estimator g I is not utilized.Instead, the largest gradient contribution is selected for k max iterations, resulting in the simplified estimator g θ,I .It removes the necessity of a random draw at each iteration for choosing the backpropagation path.Consequently a proper implementation of this version requires the storage of only two random paths as our goal is not to sort the gradient norms, but rather to find the arg max.Therefore, the memory usage is independent of the value of m.This alternative version of SPAD is depicted in Algorithm 2.

A. OPTIMIZATION FUNCTIONS
The performance of the methods is evaluated on four optimization functions, employing the ϵ-success rate from Definition 4. This metric quantifies the ratio of optimizations with varying initializations that terminate at an ϵ value equal to or less than the known global minimum for the specified function.
Definition 4 (ϵ-success): X T ∈ Z is an ϵ-success for the minimization of f if and only if f (X T ) − arg min Experiments are conducted using three different setups.The baseline method, SGD with the classical full gradient estimator, is compared against RAD and SPAD.The functions utilized for evaluation, described in IV-B, have a confirmed minimum, thus enabling the definition of the ϵ-success.These functions, being incapable of representation as neural networks, are run on Envision, Lokad's domain specific language, where the random paths are extracted from the differentiation of the SSA-SA form of the language.Given that dropping out one of the few parameters of these functions is deemed insignificant, 1000 experiments are performed for each configuration using Adam [9] as the optimizer with its default values (learning rate of 0.01, β 1 = 0.9 β 2 = 0.999) over T = 2000 epochs.Tests were carried out for k max = 5 and k max = 50, and the ϵ-success rate is reported in Table 1.(and 2. respectively) for ϵ = 0.05 (ϵ = 0.01 respectively).Multiple values of m were not tested due to the parameter's dependency on the function in use.A single branch from each tupling node in the backpropagation graph execution was selected.For instance, in function f 1 from Example 2, when the input x is utilized twice in the function, m is set to 2. Furthermore, owing to the nature of the minimization objective, the construction of a validation set is infeasible.TABLE 1. ϵ-success table with ϵ = 0.05.In bold, the method with the higher ϵ-success rate for the corresponding function.TABLE 2. ϵ-success table with ϵ = 0.01.In bold, the method with the higher ϵ-success rate for the corresponding function.All the ϵ-success rate are lower than in Table 1 as ϵ is smaller.
Although methods are not particularly suited for these functions, better results are observed with SPAD when the gradient expression is particularly suitable in the form of a sum, as seen in the Beale function.In more challenging cases, such as the Levi function, the baseline never manages to find a minimum, whereas using SPAD allows, albeit in a limited number of cases, to find the minimum.
The ϵ-success rate for varying values of k max on the Beale and Levi functions is presented in Figure 7. On these examples the proposed method SPAD (with the appropriate k max ) outperforms the baseline and RAD, which is very promising.
It also demonstrates that there is no universal optimal value of k max , as the performance seems increasing with k max on the Beale function but decreases on the Levi function.
The choices made to conduct these experiments are motivated by two observations.Firstly, The choice of the functions in this section is motivated by the fact that they employ several times their input variables.As highlighted in Section II-C, it is necessary to use SPAD.Secondly, although SPAD is promoted as a way to reduce overfitting, this concept is not relevant in optimization problems where the goal is to find the optimal parameters that maximize the objective function, without considering factors such as the model's generalization capabilities.

B. DEEP LEARNING
Experiments were conducted on the MNIST and CIFAR10 datasets, comparing SPAD with the standard stochastic gradient estimator (baseline), RAD and the dropout technique.The experimental framework described in [12] was employed, which does not involve early stopping to prevent increased memory requirements.However, this framework may result in overfitting, which is intended to be mitigated.This explains why dropout runs were made, to be able to compare with a method that is designed to mitigate overfitting.
The objective of our approach is to preserve learning quality while reducing the memory peak in comparison to traditional SGD.
Due to the substantial number of parameters in the utilized networks, drawing a single path in the backpropagation graph would lead to negligible updates.Instead, considering the fraction of the path to be drawn, experiments were conducted wherein 10% of the network was updated at each iteration.To elaborate, m sets of random paths were drawn, with each set covering 10% of the network.In contrast, the theoretical implementation of SPAD generates m random paths, with each path covering 1  N % of the model.The same proportion (i.e., 10%) was applied during dropout runs.Our practical implementation of SPAD is described in Appendix IV-C.
The two metrics aimed to be optimized, namely the final accuracy on the testing dataset and the memory peak in percentage required during training, are depicted in Figure 8.The objective is to get the higher accuracy on testing with the lowest memory consumption, i.e. ending in the green zone.
On these examples, the many variants of SPAD are competitive with the baseline and RAD, and it achieves strictly superior results on CIFAR10.In this context, seeking hyperparameters unrelated to SPAD is irrelevant.This explains why a validation set is dispensable, and assessing methods on the test dataset alone suffices.Note that all the runs share the same neural architectures, which is a fully connected network on MNIST and a convolutional one on CIFAR10.More details are given in IV-D.
Concerning overfitting, detailed results on the CIFAR10 dataset are presented in Figure 9, while more details on the MNIST dataset are given in the appendices.They tend to confirm that our method effectively reduces it compared to the baseline.While the training loss of the baseline quickly decreases during the first iterations, its test loss quickly increases.On the contrary, SPAD slowly decreases its loss on the training dataset and its testing loss increases slowly compared to the baseline.This observation highlights the FIGURE 7. ϵ-success as a function of k max .On these graphs, the higher the better.Both experiments show an impact of k max on the ϵ-success of the gradient descent.On Figure 7a on the beale function, a bigger k max upgrades the optimization while it is the opposite on Figure 7b and the levi function.In both cases, the better results are obtained with a version of SPAD that outperforms the baseline and RAD.FIGURE 8. Accuracy on test versus memory peak tradeoff.The displayed memory is a fraction of the biggest memory peak of the baseline, the same is used for every run.The objective is to achieve the highest accuracy while minimizing memory A run is considered strictly better than another if it reaches higher accuracy with less memory.Otherwise one cannot rank two runs.The superior results are located in the upper left quadrant of the graph, indicated by the green color.With regards to the MNIST dataset, as shown in Figure 8a, none of the methods outperform the baseline, although the differences are minimal, as every model achieves over 97% accuracy.The least accurate results occur when k max = 1000.This outcome is reasonable since the selected paths may be utilized for an excessive number of iterations and might lose relevance at a specific stage.On 8b which concerns the CIFAR10 dataset, some versions of SPAD like (k max = m = 10) are strictly better than the baseline.
similarities between the process of randomly drawing paths during backpropagation and the dropout technique.Turning off a fraction of the network, on the forward pass for dropout and on the backpropagation for SPAD, tends to reduce overfitting.
The dropout runs attain the highest test accuracy with significant memory consumption.This approach effectively mitigates overfitting, as the testing loss increases at a much slower rate compared to other heuristics in Figures 11 and 12 from the appendices.Incorporating random matrix injections could prove highly beneficial for such learning techniques.
The primary objective of SPAD is to minimize memory consumption.From the perspective of a fixed memory budget, employing SPAD liberates resources that can be reallocated to increase the batch size, for instance.This hypothesis was evaluated by employing the SPAD method, utilizing a batch size twice as large as that in the other experiments, denoted as big batch in the legend of Figure 8b.While this approach led to increased memory consumption, it also resulted in heightened overfitting, exhibiting behavior akin to the baseline in both training and testing data sets.This observation is consistent with the findings of [8], which assert that large batch training methods are more prone to overfitting compared to the same network trained using smaller batch sizes.

IV. CONCLUSION AND PERSPECTIVES
From the perspective of deep learning, SPAD can be regarded as a combination of dropout and layer freezing within a neural network.By drawing backpropagation paths, SPAD proposes a similar technique to dropout for any gradient based model.It is based on reverse mode automatic differentiation of SSA-SA languages.
Currently, the most significant limitation is the lack of an efficient implementation of the method to reduce memory consumption in deep learning applications.Further efforts are needed to properly code this method to enable an efficient implementation of the method in real-world environments, maximizing its benefits in terms of memory reduction in embedded artificial intelligence models.Another limitation is the lack of an appropriate heuristic or algorithm to determine the optimal parameter values for SPAD.Finding efficient parameter values that strike a balance between memory reduction and learning performance remains a challenge.Additionally, the effectiveness of SPAD may vary depending on the specific model architectures and datasets used, so generalizability across different scenarios would need to be investigated.
Overall the main idea is to draw more frequent examples that have a bigger impact on the loss minimization.Concerning this code's stochasticity, the result shows the advantages of a non uniform probability distribution.This is aligned with multiple works [3], [10] that use non-uniform distributions on the observations and outperforms the uniform one.
Table 3 summarizes the construction of gradient stochasticity based on the chosen stochasticity.The sampling process can be conducted at the observation or code level, with uniform or non-uniform distribution.An interesting future work would be about non-uniform distributions on the observations and on the code, which could hopefully get better learning results without increasing the training memory needs.Such direction would help parameters updates on embedded artificial intelligence, which would open many industrial applications, like embedded machine learning on devices with constrained computational resources.

APPENDIX A SPAD IMPLEMENTATION A. SPAD IN PRACTICE
See Algorithm 2.

B. OTHER HEURISTICS
We have tested other probability distribution construction over the g θ,i like   Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Data augmentation was not used, but the images were centered.
For each experiment reported in the main text, we tuned the initial learning rate and weight decay parameters for the feedforward networks.We generated 20 pairs of weight decay and learning rate values from specific distribution ranges.
To ensure consistent results, each experiment was trained five times using separate bootstrapped resamplings of the full training dataset (50,000 for CIFAR-10 and 60,000 for MNIST).The models were evaluated on the test dataset (10,000 for both).The repetition of these experiments were used to create the memory versus test accuracy plots.
On the MNIST dataset, the dropout backpropagation techniques perform slightly worse than the baseline and dropout.We do not observe any overfitting in this task, as shown in Figure 11, where the test accuracy does not decrease over the iterations even though this data is never used for updating the parameters.In contrast, on the CIFAR10 dataset (Figure 12), we observe that while training accuracies consistently increase, the test accuracies tend to decrease at some point for many techniques.This is especially true for the baseline, but not for the dropout technique.Dropout backpropagation techniques help mitigate overfitting, as notably highlighted by the evolution of the testing loss.Different versions of SPAD provide an interesting range between the baseline and dropout performance.

FIGURE 1 .
FIGURE 1. Example of the LCG of an objective function f θ from the parameter θ to the output node z.

FIGURE 5 .
FIGURE 5. SSA-SA version of the program relative to f 2 .

6 .
(Top) Generic non SA LCG, with x being used three different times.The dashed lines represent backpropagation.g, h and k are arbitrary differentiable functions.(Left) SSA version of the derivative program.(Right) SSA-SA version of the derivative program.

FIGURE 9 .
FIGURE 9.Learning curves of the SmallConvNet on CIFAR10.The baseline model exhibits rapid performance improvement on the training dataset, while its performance on the testing dataset deteriorates just as quickly.This behavior is characteristic of overfitting, whereas the various versions of SPAD effectively mitigate this undesired decrease in generalization.The upper right plot suggests a continuum ranging from mild overfitting (dropout) to severe overfitting (baseline and SPAD with large batches).
s qkmax with s t = arg min p∈P m ∇f θ − g θ,p

FIGURE 10 .
FIGURE 10. ϵ-success in function of k max .see Figure 8 for more context.

FIGURE 11 .
FIGURE 11.Full train/test curves and memory consumption per iteration on MNIST.

FIGURE 12 .
FIGURE 12. Full curves and memory consumption per iteration on CIFAR10.

TABLE 3 .
Small review of the stochasticity origin of gradient estimators.