Mathematical Models of Overparameterized Neural Networks

Deep learning has received considerable empirical success in recent years. However, while many ad hoc tricks have been discovered by practitioners, until recently, there has been a lack of theoretical understanding for tricks invented in the deep learning literature. Known by practitioners that overparameterized neural networks (NNs) are easy to learn, in the past few years, there have been important theoretical developments in the analysis of overparameterized NNs. In particular, it was shown that such systems behave like convex systems under various restricted settings, such as for two-layer NNs, and when learning is restricted locally in the so-called neural tangent kernel space around specialized initializations. This article discusses some of these recent signs of progress leading to a significantly better understanding of NNs. We will focus on the analysis of two-layer NNs and explain the key mathematical models, with their algorithmic implications. We will then discuss challenges in understanding deep NNs and some current research directions.


I. INTRODUCTION
Neural Networks (NNs) are computational models that are composed of (possibly multiple) feature representation layer(s), and a final linear learner.In recent years, deep NNs have largely improved the state-of-the-art performances in numerous real applications, such as image classification [1], [2], speech recognition [3], natural language processing [4], etc.However, theoretical understanding of these empirical successes for NNs is still limited.One main conceptual difficulty is the high non-convexity of these models, which means that first-order algorithms such as gradient descent (GD) or stochastic gradient descent (SGD) may converge to bad local stationary points.
However, it is observed in practice that with the help of a number of tricks such as dropout [5] and batch normalization [6], deep neural network (DNN) can be reliably trained from random initialization with reproducible results.The solutions obtained by proper training procedures behave well and consistently.In other words, two different random initializations (using the same initialization and training strategy) generally lead to models that give similar predictions on test data.Thus, we may conclude that proper neural network training leads to similar solutions.This behavior resembles that of convex optimization, instead of generic non-convex optimization problems that tend to get stuck in suboptimal local stationary solutions.Because solutions from different random initializations are similar and reproducible, it can also be conjectured that with proper training, deep neural networks can reach solutions that are near global optimal.These empirical observations appear to be rather mysterious, and they require the developments of new mathematical models for neural networks that can bridge the gap between non-convex and convex models to understand.
In addition to the above empirical observations, it is also known by practitioners that overparameterized neural networks (NNs) with many hidden units are easy to learn [7].They achieve better and more consistent performance.Related to this empirical observation, it was noticed in the 1990s that neural networks with infinitely many hidden units are easier to model and analyze theoretically [8], [9].In the past few years, there have been many significant theoretical developments in the analysis of overparameterized NNs with massive hidden units that approach infinity.Especially, it was shown that such systems behave like convex systems under various restricted settings.This provides theoretical justifications of the empirical observations of the reproducibility of neural network training.
In this paper, we review some recently developed mathematical models for overparameterized NNs, with the focus on the neural tangent kernel (NTK) view and the mean field (MF) view.Section II introduces the basic formulation for twolayer NNs and Section III introduces a closely related learning model, random kitchen sinks [10].In Section IV, we examine the NTK view for two-layer NNs, which shows that a twolayer NN can be written equivalently as a linear model in the tangent space under some specialized conditions.Section V considers the MF view for two-layer NNs.In this view, a continuous two-layer NN is regarded as a learned distribution over the weights, which leads to a more realistic mathematical model for analyzing practical behaviors of NNs.In Section VI, we compare the three models from the feature learning perspective.Section VII considers the possible extensions on deep NNs for NTK and MFs.In Section VIII, we introduce some basic complexity results for NTK.Section IX reviews some other mathematical models.Finally, we conclude the paper and outline active research directions in Section X.

II. TWO-LAYER NEURAL NETWORKS
Two-layer NNs have a history dating back to the 1940s [11].A discrete two-layer neural network can be viewed as a 0000-0000 © 2020 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
arXiv:2012.13982v1 [cs.LG] 27 Dec 2020 k-dimensional vector valued function of a d-dimensional input vector x, which has the following form: where x ∈ R d , θ j ∈ R d , and u j ∈ R k .The model parameters are {[u j , θ j ] : j = 1, . . ., m}, which will be learned via training.Here α > 0 is a real-valued scaling parameter that is not learned.It is included here to differentiate two different regimes of overparameterized neural networks.In this system, there are m hidden units (or neurons), and each hidden unit corresponds to a function h(θ j , x) of the input x.It maps the original input feature x to a new feature h(θ j , x), with a parameter θ j that is learned.The function h(θ, x) is a real valued function.In applications, it often takes the following form h(θ, x) = h 0 (θ x), where h 0 (•) is called an activation function, and the standard choices include rectified linear unit (ReLU) h 0 (z) = max(0, z) and sigmoid h 0 (z) = e z /(1 + e z ).
In order to learn the NN parameters, we consider the minimization of the following optimization problem: Here {(x i , y In this paper, we consider the situation that the regularizer R(u, θ) is convex in (u, θ), and the loss function L(v, y) is convex in v.
In general, we consider random initialization, and specifically random Gaussians: with different scalings of α.The training is often performed via SGD, where we randomly pick a training datum i (or a mini-batch of data points) and update parameters as: The parameter η is referred to as learning rate.It is known in optimization that if we let η → 0 properly, then the procedure converges to a point (u ∞ , θ ∞ ) such that ∇φ(u ∞ , θ ∞ ) = 0.If the loss function L(v, y) is convex in v, then φ(u, θ) is convex in u but not convex in θ.Therefore in general, ∇φ(u ∞ , θ ∞ ) = 0 does not imply that (u ∞ , θ ∞ ) achieves a global optimal solution.In order to understand the empirical observation that (u ∞ , θ ∞ ) behaves like a solution that is close to global optimal, especially when m is large, mathematical models have been developed in recent years to explain the empirical phenomenon.

III. RANDOM FEATURES
Before developing mathematical models for two-layer NNs, we will first consider a closely related machine learning model that employs random features.We note that in a two-layer NN, the hidden units are feature functions h(θ j , x) that contain parameters θ j to be learned during neural network training.The random feature approach (denoted by RF in this paper) is also referred to as random kitchen sinks [10], [12], [13].
In RF, we still consider the function ( 1), but assume that θ j are fixed at θj that is generated from a random distribution, typically a Gaussian distribution: which is not learned during training.Only parameters {u j : j = 1, . . ., m} are learned.In this case, we may take any fixed scaling α such as α = 1, because the scaling is not important for the random feature approach.
In RF, the model f (•, x) becomes linear with respect to the model parameters {u j }.Therefore, if L(v, y) is convex in v, then the objective function ( 2) is convex, and thus the convergence of SGD is easy to analyze.
Since the parameters { θj } do not change during training, we apply SGD to learn {u j } only in the training process that optimizes (2): We are interested in the situation that the number of hidden units m → ∞.It can be shown that in this case, the function learned by the random feature method converges to a welldefined limit [10].To understand its behavior, it is useful to consider the kernel view for RF.
Note that if we let then the random feature method corresponds to the linear model We consider the L 2 regularization where • F is the Frobenius norm of a matrix.Then with fixed θ j , the objective function φ(u, θ) of (2) becomes which is convex in u.
In order to obtain the kernel representation, we note that the first order optimality condition of (4) at the optimal solution can be written as: where and This leads to the kernel representation [14], where the kernel is defined as the inner product of the feature vectors h(x) and h(x ) for two input variables x ∈ R d and x ∈ R d : Consider a function represented using this kernel with parameters Then we can see from equation ( 5) that That is, the original linear function has a kernel representation.Moreover, the regularizer also has a kernel representation: Using the kernel representation, the solution of RF, which minimizes the objective function (4) over u, is equivalent to the solution of the following kernel optimization problem: Let β be the solution of ( 6), we may obtain the solution of (4) using the relationship between β and u in (5).
The kernel formulation ( 6) is particularly useful for analyzing the behavior of overparameterized RF in the limiting case of m → ∞.This is because when m → ∞, we have which is a well-defined kernel, where ρ 0 ( θ) is the random distribution of θ, such as the Gaussian distribution N (0, σ 2 ) in our case.It follows that as m → ∞, the kernel function Moreover the corresponding optimization problem of (6) becomes which is also well-defined.Note that in the kernel formulation, the number of model parameters {β i } is n, which remains finite when m → ∞.Therefore the limit of {β i } is well-defined.
If we consider the original random feature function (3) in the case of m → ∞, the number of parameters {u j } will also approach infinity.In this case, we may consider u as a function of θ, and write (3) as in the limit of m → ∞, and write the two norm regularizer as With this notation for m = ∞, we have the relationship where β is the solution of the kernel formulation (7).
It is known that RF works well for certain problems [10], [12].However, for many real-world applications such as image classification, RF is inferior to NN that learns better feature representations than random ones.We will discuss the feature learning perspective in Section VI.
One interesting extension of the RF theory which can be used to analyze the behavior of NN is presented in [15] and [16].A kernel called conjugate kernel was introduced, and it was shown that the Gradient Descent (GD) process for the NNs under some special initializations belongs to this kernel space.Moreover, any function in this kernel space can be approximated by changing the weights of the last layer.Inspired by the above findings, the authors proved the global convergence of GD for NNs.However, in their analysis, only the GD updates on the last layer contributes to the global convergence, which ignores weight updates in the lower layers.In the next section, we will introduce the neural tangent kernel view, which develops theoretical results showing that weight updates in the bottom layer of two-layer NNs can also contribute to the convergence of GD.

IV. NEURAL TANGENT KERNELS
In practice, the random feature approach is often inferior to two-layer neural networks because the parameters {θ j } are not trained.As we have seen, the overparameterized case with m → ∞ corresponds to kernel learning with a welldefined kernel.One natural question is whether this point of view can be generalized to handle two-layer neural networks, where the parameters {θ j } are trained together with {u j }.Such a generalization leads to neural tangent kernels (NTK) [17], which we shall describe in this section.We note that the connection of infinitely-wide NNs and kernel methods (Gaussian processes) has already been known in the 1990s [8], [9].However, the more rigorous theory of NTK has only appeared very recently, e.g., [17], [18].
In NTK, we consider a specialized scaling and random initialization of parameters.The special scaling makes it possible to consider NN parameters in a small region around the initial value when m → ∞.The resulting NN with parameters restricted in this region can be well approximated by a linear model fitted with random features.Similar to RF of Section III, this linearization induces a kernel in the tangent space around the initialization, which becomes Neural Tangent Kernel [17], since the weights are near their initial values during the training of the neural network.This phenomenon is also referred to as the "lazy training" regime in [19].In this regime, the system becomes linear, and the dynamics of GD (or SGD) within this region can be tracked via properties of the associated NTK.
To derive NTK under the assumption of m → ∞, we consider the case that h(θ, x) is differentiable with respect to x, such as the sigmoid or tanh activation function.Note that the non-differentiable ReLU activation function can also be handled similarly, although many works consider the differentiable assumption for simplicity.
We now consider a random initialization at [ũ, θ], around which we can linearly approximate the neural network as: +ũ j (θ j − θj ) ∇ θ h( θj , x) + high order terms, where we assume that both u − ũ and θ − θ are small.Note that the theory of NTK requires α = O( √ m), so that the term (α/m) m j=1 ũj h( θj , x) has a bounded variance.A large scaling α (α → ∞ as m → ∞) is important for this linearization, because if we fix the coefficient αũ j (θ j − θ) for the random feature ∇ θ h( θj , x), then when the scaling α → ∞ (as m → ∞), one can show that a small change of θ leads to a big change of the output function values.This means that in the continuous limit, we should let (θ j − θ) → 0. A similar claim holds for u j − ũj .In this case, the higher order terms will be o(α u j − ũ + α θ j − θj ) that approach zero, and the linear approximation of (8) is accurate.
In order to motivate the NTK kernel, we consider the following representation, similar to (5) for RF: Here we have both β u i ∈ R k and β θ i ∈ R k .Using this representation, the linear approximation of (8) becomes Using the relationship of kernel as the inner product of features for linear models, this linear approximation of twolayer NN corresponds to the following kernel function representation where which corresponds to the features of the linear coefficients u (also used by the RF model of Section III), and which is a k × k matrix corresponding to the features of the linear coefficients θ (which was not used by the RF model of Section III).This kernel representation is referred to as NTK.
In the NTK representation, from the relationship of [u, θ] and [β u , β θ ] in (9), we can see that as α → ∞, θ → θ and u → ũ, which means that we have a more accurate linear approximation when α is large.
In the infinity-width limit of m → ∞, the kernel becomes the infinite-width NTK kernel, which is well-defined: where ρ 0 (ũ, θ) is the random initialization distribution for [ũ, θ], and in our case, it is chosen as N (0, σ 2 ) × N (0, σ 2 ).Moreover, ρ 0 ( θ) is the marginal random initialization distribution for θ, which in our case is N (0, σ 2 ).With this choice, we note that ũũ dρ 0 (ũ| θ) = σ 2 I k×k is proportional to a diagonal matrix.Therefore, we may also replace the k × k matrix kernel k θ ∞ (x, x ) by the following scalar kernel: We may view the infinite-width NN as a kernel method in a small neighborhood around the initialization as: This function gives an equivalent representation of the linear approximation of two-layer NN in (8) with m → ∞ and α → ∞.Therefore in this case, the optimization in [u, θ] can be equivalently represented using optimization in the kernel representation.The corresponding optimization problem using kernel representation can be written as: It is worth mentioning that in the NTK view of neural networks, we usually do not employ regularization R(u, θ).This is because the solution of NN with a nontrivial regularization will not lie in a small region around [ũ, θ].
In the NTK view, under appropriate conditions, it is possible to show that the optimization problem (10) in the kernel space is equivalent to the solution by SGD in the original representation (1), when α is large.In this case, the general SGD for two-layer NN in the original parameter space can be expressed as: If we consider using the same learning rate for the rescaled parameter αu/m and θ, as often done in practice, then we shall set where η is a small learning rate.We note that in this case, the learning rate η u = O(ηm/α) will be large compared to η θ , if we set α = O( √ m) required by NTK.The large learning rate η u will move u to be far away from the initialization ũ, violating the standard NTK requirement of u ≈ ũ.Nevertheless, we note that the requirement of θ ≈ θ is more important in NTK for linearly approximating the nonlinear function h(θ, x).With additional complexity, it is thus possible to extend the NTK analysis to handle the situation that θ ≈ θ but u may not be close to ũ.
Because of the above mentioned complexity, for the theoretical analysis of NTK, one often assumes a smaller learning rate for u [41], such as or as For the learning rates in (12), since η u is much smaller than η θ , we know that u moves very little compared to θ.Consequently, we can see from (9) that β u moves very little compared to The learning rate (13) does not suffer from this problem.From (9), it can be seen that the corresponding modification of both β u and β θ are at the same order.Therefore in such case, both kernels are effective.
It can be shown that with learning rates set as either ( 12) or (13), when m → ∞ and α → ∞, the final solution of (2) without regularization can reach zero-error within a very small neighborhood of the initialization, with radius approaching zero as α → ∞ (e.g.[18]).This phenomenon is illustrated in Figure 1.This regime is called the NTK regime, where the two-layer NN can be linearized as a kernel method, and the optimization process lies inside a small neighborhood around the initialization.This property can also be validated empirically in the actual NN optimization process.In this paper, we use the MNIST handwritten digits dataset available from http://yann.lecun.com/exdb/mnist/ [42] to demonstrate various aspects of the different frameworks.This dataset has a training set of 60,000 examples, and a test set of 10,000 examples, and is one of the standard dataset for image classification.The theory of NTK implies that when the scaling parameter α increases, the NN training process belongs to a smaller and smaller neighborhood of the initialization, and NTK approximation becomes more and more accurate.This phenomenon is shown in Figure 2, where the average distances between θ (and u) and the initialization are plotted for different values of α.
Another factor that determines the accuracy of NTK approximation is the number of hidden units m which measures the degree of overparameterization. Figure 3 shows that when m increases, the solution of the neural networks becomes closer to the initialization, and this phenomenon happens both with the practical learning rate η u /η θ = m/α and with the NTK theoretical learning rate η u /η θ = 1.From this experiment, it is reasonable to speculate that as m approaches infinity, the firstorder approximation of NN (NTK) will become more reliable.
It is also useful to point out that NTK approximation works better on simple datasets, and fails more easily on complex datasets.Figure 4 compares the objective functions of the NN (both on training and on test data) to its linearized NTK approximations during the training process.We compare the linearization error both on the MNIST dataset and on a simpler 100-dimension synthetic dataset made by make classification in sklearn.For an NN with m = 1000 and η = 0.1, we can notice that the linearization of NN on the simpler dataset is almost perfect, which means that the entire training process is well-approximated by NTK.However, there is a noticeable discrepancy between NTK and the actual NN on the MNIST data, which means that NTK does not fully explain the actual NN training process well in this case.This experiment implies that for relatively complex datasets, the NTK approximation requires a much wider NN.
In summary, when α → ∞ and m → ∞, and the formulation does not contain regularization, then we have the so-called NTK regime, with the following properties: • Initialize NN with a certain scaling • Network is sufficiently large • The formulation does not contain regularization • Learning rate is sufficiently small • NN solution path remain close to the initialization, and can be linearized around the initialization • The linearization induces a kernel (NTK) which is convex The implications of the NTK view is as follows: • The solution is in a tiny neighborhood of the initial point • Problem becomes convex, which can be solved efficiently • The solution path is reproducible in kernel representation The theory of NTK applies to a specialized regime with special initialization.It also assumes small learning rate so that the learned parameter does not escape from the NTK region.Although this is a nice mathematical model, there are several problems, making it unsuitable for a general theory of neural networks.
Note that RF can be considered as a two-layer NN where the bottom layer is fixed.The model is linear with respect to the top-layer, and one can incorporate regularization to improve generalization.In contrast, in the NTK regime, we perform GD to optimize NN weights in all layers, with random features generated in the tangent space around the initialization.In order to ensure that the solution of (2) can be approximated by (10), the resulting optimization problem (10) of NTK cannot include non-trivial regularizers that will pull the training process out of the tangent space.It is natural to extend (10) by adding regularization, where we may consider the following more general formulation of NTK in (10) to the following regularized NTK method, which may be regarded as an NTK motivated new learning algorithm, although it may not be a good approximation for two-layer NNs anymore.
where  the addition of the regularization term.
A major problem of the NTK view is that the practical performance of the NTK solution from ( 14) is often inferior to that of the fully trained NNs, despite the equivalence which can be proved under certain theoretical assumptions.Even an infinitely wide NTK cannot achieve state-of-the-art performance.In the following, we explain why this happens in practice and what can break the NTK view for neural network learning.We show that the NTK regime is broken by standard tricks in NN learning.
Although the random parameter initialization with large scaling α is used by practitioners, and the choice is consistent with that required by the NTK view point, practitioners use a large initial learning rate while the small learning rates is required by the theoretical analysis.Consequently the optimized NN by using practical SGD procedures goes out of the NTK regime.Because of this, the NTK linear approximation fails.
Moreover, in practice, neural networks are not infinitewidth, and thus the large m theory does not exactly match the practical behavior of neural network learning.Another theoretical condition for the NTK view is to not impose nontrivial regularization 1 because regularization automatically pushes the solution away from the initialization, which violates the NTK regime.However, regularization (or weight decay) is frequently used by practitioners.Besides, there is currently no analysis of NNs that can incorporate batch-normalization in the NTK regime.
One of the key reasons that NTK does not perform well in practice is that the NTK method is very similar to RF, in that it also employs random features.Compared to RF of Section III, NTK contains an additional kernel corresponding to the random features ũ∇ θ h( θ, x).In fact, the mathematical theory of NTK relies mostly on the modification of θ associated with random features ũ∇ θ h( θ, x) to reduce training loss, while the mathematical theory of RF relies on the modification of u associated with random features h( θ, x) to reduce training loss.This is why the theoretical analysis of NTK relies on a small learning rate η u .Nevertheless, the additional random features used by NTK provides extra information over RF.
Since NTK is still a random feature based method, it does not learn feature representations.In contrast, it is well-known by practitioners that a key benefit of NN learning is the ability to learn useful feature representations.The theory of NTK completely fails to explain the benefit of feature learning by neural networks.We will investigate this issue further in Section VI.
The NTK view is also inconsistent with many technical tricks used in practical NN training, which benefit learning performance, such as large initial learning rate and momentum, which may push the model parameters out of the initialization neighborhood.Here we show some cases in real practice to demonstrate these phenomenons. 1We note that the NTK regime can still incorporate very small regularizers.For example, one can add a regularizer as µ( θ 2 + u 2 ), where µ is much smaller than 1 m .In this case, there is still a solution in the neighborhood of the initialization.Moreover, we may add regularizers centered at the initialization θ such as θ − θ 2  2 , although they are not used by practitioners.
As shown in Figure 5, when we use a large learning rate or employ momentum, the difference between the initial parameters and the final solution increases.In such situation, the first order approximation (19) used by NTK fails to capture the dynamics of NN.The gap between NN and NTK is significant, and the performance of NN is better with these tricks.A few other works have also mentioned the phenomenon that a large initial learning rate leads to better NN solution [43], [44], which cannot be explained by the NTK view very well.
Note that most of our NTK experiments employ the random initialization θj ∼ N (0, 1).In the analysis, the variance σ 2 is fixed as a constant, which does not influence asymptotic behavior significantly.However, the actual value of σ 2 plays a noticeable role when m is not large enough compared to the input dimension d. Figure 5 shows that when we use θj ∼ N (0, 1/d) (which is a standard initialization technique with good practical performance, referred as He initialization [45]), the NTK regime can be broken more easily.Therefore, the gap between the first order approximation of NTK and actual NN training cannot be ignored in practice.

V. MEAN FIELD VIEW
In order to overcome the limitations of the NTK view which we explained above, other theoretical models have been developed to investigate overparameterized two-layer neural networks.In this section, we introduce another line of research, which applies the mathematical tools of the meanfield analysis from statistical physics to study two-layer neural networks [46], [47], [48], [49], [50], [51], [52], [51], [53], [54].
In order to motivate the mean field analysis for overparameterized NNs, it is instructive to first investigate the continuous dynamics of infinitely wide NNs, known as the mean field limit, and then consider the finite width neural networks as its approximation.In this paper, we call the corresponding analysis the mean field view (MF).The idea of studying the mean filed limit comes from statistical physics [55], which suggests that the mathematical model of a large number of interacting neurons can be simplified using the probability distribution that represents the average effect.
Unlike the NTK view, which requires α → ∞ as m → ∞, in the mean field view [46], we may set the scaling fixed at a constant such as α = 1 while letting m → ∞.In this case, the solution is allowed to go far from the initialization, which remedy the main limitations of NTK.Therefore one may argue that this approach gives a more realistic mathematical model for practical behaviors of neural networks.
In the limit of m → ∞ with fixed α, we may consider the continuous limit of two-layer neural networks in (1) as where ρ(u, θ) is a probability distribution over [u, θ].In this continuous formulation, we can regard the probability measure ρ as the model parameter.Therefore the original model parameters {[u j , θ j ]} in the discrete NN formulation can be viewed as a discrete probability distribution on the model parameter space [u, θ] ∈ R k × R d , and this discrete probability distribution puts a mass of 1/m at each point [u j , θ j ] (j = 1, . . ., m).In the continuous limit of m → ∞, this discrete probability distribution naturally converges to the distribution parameter ρ in the continuous NN formulation (15).It is easy to see that the function represented by twolayer NN becomes linear in ρ.
In the continuous limit, the training objective (2) for the discrete NN becomes for the continuous NN, where r(u, θ) is a regularizer of [u, θ] such as the L 2 regularization It follows that the training objective ( 16) is convex with respect to ρ if both the loss function L(•), and the regularizer R(•) are convex.In fact, the global optimal solution of ρ satisfies the first order optimality condition: for all probability measures ρ (u, θ): where is the derivative of φ(ρ) with respect to the component ρ(u, θ) by regarding the distribution ρ as an infinite dimensional vector ρ = {ρ(u, θ)}.Here L 1 (v, y) = ∇ v L(v, y).Note that if we can find ρ such that for a constant c, then (17) is satisfied.This is because in this case, we have for all ρ : g(ρ, u, θ)dρ (u, θ) = c = g(ρ, u, θ)dρ(u, θ).
In the mean field view, we may take the following connection of the discrete NN versus continuous NN when m is large.The hidden units [u j , θ j ] of the discrete NN (1) can be viewed as m particles sampled from the distribution ρ(u, θ).In the training process, we move each particle [u j , θ j ] using SGD, which is the derivative of the objective function with respect to each particle.In the continuous limit, we have infinitely many particles, and each particle [u, θ] also moves according to the gradient of the objective function with respect to the parameter.In the literature, such a gradient is often referred to as gradient flow [56], which characterizes the learning dynamics of the continuous formulation.In the following, we will present a more mathematical description.
In the continuous formulation, a hidden unit can be regarded as a particle indexed by a parameter z 0 sampled from a distribution ν 0 (z 0 ).Here z 0 only plays the role of discrete index j in the discrete formulation, and its own value is of no significance.The initial distribution ν 0 (z 0 ) is introduced for convenience so that we can sample over the index z 0 .In the discrete setting, it is simply the uniform distribution over j = 1 to j = m.
Each particle indexed by z 0 also has a parameter [u, θ] which will be trained.We assume that at time t, we move each particle during the training process, so that the model parameter becomes [u(t, z 0 ), θ(t, z 0 )].If we take z 0 ∈ R k+d , with ν 0 (z 0 ) as a Gaussian distribution, then we may simply initialize [u, θ] as [u(0, z 0 ), θ(0, z 0 )] = z 0 .Since z 0 is sampled from ν 0 (z 0 ), the particles [u(t, z 0 ), θ(t, z 0 )] induces a probability measure Here the time dependent parameters u(t, z 0 ) and θ(t, z 0 ) are obtained via training over the time.
Using the above terminology, the optimization of ρ(u, θ) in ( 15) leads to a distribution ρ t (u, θ) at training time t, which is by moving hidden unit parameters [u, θ] via gradient flow with respect to the objective function φ(ρ t ).More precisely, the corresponding particle movement in the continuous limit obey the gradient flow equation [47]: where the particle gradient g(ρ, u, θ) for a particle [u, θ] is defined in (18).The gradient flow direction of a particle [u, θ] is the gradient of g(ρ, u, θ), which is equivalent to the gradient of the objective with respect to each particle parameter [u j , θ j ] in the discrete NN formulation.Therefore (20) is the continuous version of gradient descent method with respect to the model parameter [u, θ] associated with the hidden units, and this continuous version of gradient descent method tries to minimize the objective function (16).
The gradient flow equation ( 20) implies a partial differential equation for the probability measure ρ t as with its solution interpreted in the weak sense.This equation describes the dynamics of the objective function parameter ρ of ( 16) under the continuous gradient descent method of (20).
Here we use the simplified notation The differential equation of ρ t in ( 21) characterizes the dynamics of ρ t in the neural network training process, and it can be shown that the objective value reduces according to the following ordinary differential equation: The derivation of the first equation has used the calculus of variations, which may be considered as the functional gradient of φ with respect to ρ t by treating ρ t as an infinite dimensional vector indexed by (θ, u).The functional gradient is given by g(ρ t , u, θ), which leads to the second equation.In the third equation, we have used the integration by parts, and with a slight abuse of notation, we have used the notation dρ t (u, θ) = ρ t (u, θ)dθdu, which does not differentiate measure ρ t and its corresponding density representation.This equation is the key to prove global convergence in the MF approach.It shows that the gradient descent method of (20) reduces the objective function of ( 16), and the result is stated using the probability measure ρ t , which is what we want to learn in the continuous NN formulation.
Since the objective function is bounded from below, from (22) we can obtain that as t → ∞, we must have dφ(ρ t )/dt → 0. It follows that However, this does not ensure that the objective function reaches the global minimum, unless additional conditions are imposed.Next, we shall present an intuitive explanation first, and then describe more rigorous results.From ( 23), if we can show dρ t (u, θ) = 0 for all [u, θ], then we have ∇g(ρ t , u, θ) 2 2 = 0 for all [u, θ].In this case, from ∇g(ρ t , u, θ) → 0, we obtain g(ρ t , u, θ) → c for a constant c, which implies the first order condition (19).This result implies that GD training converges to the global optimal solution of ( 16) in the continuous setting.
A more rigorous treatment of the above reasoning was presented in [46], which considered a formulation with an additional entropy regularization term in R(ρ).This entropy regularizer ensures that dρ t (u, θ) = 0 for all [u, θ].In fact, with this regularization, the measure ρ(u, θ) always has a density: dρ(u, θ) = p(u, θ)dudθ, and we can write the regularizer as R(ρ) = λ p p(u, θ) log p(u, θ)dudθ + r(u, θ)p(u, θ)dudθ, which modifies the regularizer in ( 16) by adding an extra entropy term.Using this regularizer, it can be shown that there is a unique global solution that satisfies (19).Moreover, under mild conditions, we have ρ t (u, θ) converges (weakly) to a distribution ρ ∞ (u, θ) that can be lower bounded by a normal distribution [46].Then by using the Poincaré inequality for Gaussian random variables (which state that if X is a standard normal random variable, and f (X) is a real-valued function, then Var(f (X)) ≤ E ∇f (X) 2  2 ), we may obtain from (23) that g(ρ ∞ , u, θ) = c almost everywhere for some constant c.This implies that the first order condition (19) holds.It follows that as t → ∞, the solution converges to the unique global optimal solution.
In a practical implementation of the gradient descent rule in (18) with entropy regularization, we need to compute the gradient ∇ log p(u, θ) in ∇g(ρ, u, θ).It can be shown that an equivalent implementation is to add a random noise, and the corresponding gradient flow equation of (20) becomes a stochastic partial differential equations (SDE) with t ≥ 0: where {dB(t)} t≥0 is the standard Brownian motion in R k+d , and g(•) is defined in (18).
In the GD (or SGD) implementation of the Brownian motion component of this SDE, we simply add a Gaussian noise of N (0, 2λ p η(t)) to each GD update step with learning rate η(t).This method is referred to as noisy gradient descent in the literature.With the help of entropy regularization, it can be shown that noisy gradient descent for continuous NN converges to the unique global optimal solution, and overparameterized discrete NN with a sufficiently large m approximately reaches this solution.This result can be used to explain why training of overparameterized NN is easier in practice, and why the (idealized) two-layer neural network training process can reach a good solution with consistent performance.
The benefits of entropy regularization are three-fold: (i) When we supplement the loss function with entropy regularization, the overall learning problem becomes strictly convex.Thus a unique global minimum can be guaranteed.(ii) The implementation of the entropy regularization is a very simple addition of noise.In practice, one can often observe that adding noise helps us to find a better solution.(iii) After injecting noise, dρ t = 0 for all [µ, θ].This fact, combined with (23), implies the point-wise vanishing of ∇g(ρ t , u, θ), which implies that the global minimum is achievable.
On the other hand, without the extra entropy regularization, the objective function of ( 16) may not have a unique global optimal solution.That is, there can be more than one solutions that satisfy the first order condition (17).However, we can still achieve the global convergence to the optimal objective function value of ( 16), although the final solution the training process converge to may not be unique.To analyze this situation (without the extra entropy regularization), the authors in [47] considered a different assumption with homogeneous activation functions (such as ReLU) and homogeneous regularizers.Under such assumptions, it is possible to show that the solution ρ t (u, θ) converges to a global optimal solution (which may not be unique) that satisfies (17) as t → ∞.
In the mean field approach, learning the distribution ρ can be viewed as learning effective feature representations.The ability of NN to learn feature representations is consistent with empirical observations.This perspective also explains why fully trained NN is better than RF and NTK, both of which employ random feature representations that are not learned.We will further discuss this aspect in Section VI.
To visualize the process of learning ρ, we conduct an experiment to reproduce an m = 4 sigmoid activated NN (u i = 1, i = 1, 2, 3, 4, denoted as F 4 ) by NNs with different width m.Note that the process of learning the target function can be recognized as the process of learning the target optimal ρ * = 4 i=1 δ θi /4, where δ x is dirac delta function at point x and θ i is the weight of the NN to be reproduced.Although we know that the target function can be represented by 4 neurons, Figure 6 shows that using a larger m leads to better learning.This is consistent with the theory of overparameterization.In Figure 6  noise ∼ N (0, 0.1 2 ) (the reason to use large scale input is to improve the reconstruction difficulty).It can be seen that with m = 4, we will get stuck at a local minimum and cannot learn the correct target function.When we increase m, we achieve more and more accurate learning of the target function.Figure 6 (b) and (c) show the distributions of θ at initialization, and at the optimal solution when training convergence.We can see that they differ significantly, and thus in this case, NN training goes out of the NTK regime.In the end, the distributions of the neurons are scattered, with a large number of neurons become aligned with the target {θ i } represented by the four red dots.Since the target function does not have a unique representation, therefore we cannot recover the parameters {θ i }, but only recover the function value represented by each {θ i } using multiple θ parameters distributed over the lines of the targets.Therefore we can learn the target function reliably when m is large, although we do not necessarily learn the four target parameters {θ i }.This is consistent with the analysis in [47], where the NN training reaches the minimal training error, but not necessarily unique.Figure 6 (d) shows that many particles have a very small u, which means they are "wasted neurons" that do not affect the function value.If we remove these wasted neurons with small u, then we can display the effective neurons in (e), which are well-aligned with the target neuron directions, and they can approximately recover the functions represented by the four target neurons.The phenomenon of wasted neurons in (c) is because the target function is not strongly convex in ρ.Therefore there can be many solutions that achieve global optimal.The analysis in [47] demonstrates that under suitable conditions, the training process will converge to the global optimal, although the solution may not be unique.However, if we add the entropy regularization as in [29], then the global solution becomes unique.Since the entropy regularization can be implemented using noisy gradient, we show in (f) the effect of using noisy gradient on this problem.It shows that with this regularization, there is a significant reduction of wasted neurons, and the final solution are nearly aligned with the directions of the four target neurons.Because the function is not uniquely represented by the four neurons, we still do not recover the four target neurons.Instead, the final solution converges to a unique optimal distribution ρ * which is a smooth distribution around the directions of {θ i } in the continuous limit.In real finite width NNs, some neurons may  still get stuck in the low density regions.We can thus observe a small portion of wasted neurons, which will decrease with wider NNs or larger noises.We may summarize some key points of the MF approach as follows.
• The method learns a distribution ρ which behaves like learning effective feature representations.• GD or noisy GD (SGD) over model parameters define gradient flows with dynamics characterized by partial differential equations (PDE) • Under appropriate assumptions, the solution of the underlying PDE converges to the optimal solutions in ρ that satisfies the first order condition (19).• The optimal solution can be far from the initial parameter, leading to more realistic model for neural network learning than NTK.It can be shown that as α → ∞, the dynamics of MF becomes similar to that of NTK under suitable conditions, and the solution becomes closer and closer to the initialization [29], [57].When we reduce α, the final solution becomes farther apart from the initialization.This migrates from the NTK regime to the MF regime.This phenomenon is illustrated in Figure 7.We also summarize the relationship between NTK and MF in Table I.
While MF for two-layer NN is well-understood, compared to NTK, it is significantly more difficult to generalize MF to handle deep neural network structures.It is also more difficult to obtain concrete complexity results using MF, which requires study both the discretized differential equations, and the convergence rate in terms of letting m → ∞.

VI. THE IMPORTANCE OF FEATURE LEARNING
In the previous sections, we presented three mathematical models closely related to two-layer neural networks: RF, NTK, and MF.A summary of the pros and cons for the three models are shown in Table II.The first two approaches, RF and NTK, employ simplified mathematical models by treating two-layer NNs as linear models with random features.The MF view, on the other hand, directly model the feature learning dynamics of NN.It was argued by [58] that a theoretical understanding of feature learning is the key to explain the success of NN.Following the argument of [58], this section compares the three models empirically from the feature learning perspective.
It was pointed out in [58] that when m is large, the hidden units of a discrete NN in (1) can be regarded as m (nearly) independent samples from a distribution ρ, which is the distribution of the corresponding continuous NN in (15).If we treat the function value f (ρ, x) of the continuous NN as the target, then it follows that the error of discretize NN is caused by the variance of sampling m hidden units from ρ, which converges to where f ([u, θ], x) represents the discrete NN of (1), with each hidden unit j sampled (independently) from ρ, and f (ρ, x) is the corresponding continuous NN of (15).Since under suitable conditions, the continuous representation f (ρ, x) can reach a global optimal solution via training, it can be regarded as the target function we try to learn with discrete NN.A good feature representation of the target is thus a feature distribution ρ so that its continuous NN can be well approximated by the corresponding discrete NN via (24).This means that the variance on the right hand side of ( 24) should be small.If we consider using the L 2 regularization for u, and assume that h(θ, x) is batch-normalized as for all θ, then it is shown in [58] that when fully optimized, u 2 is nearly a constant with respect to the distribution ρ(u, θ).It implies that the variance of (24) achieved by NN training is nearly minimized among all ρ such that f (ρ , x) = f (ρ, x).Therefore for a fixed m, we will achieve the smallest error with discrete NN and the learned probability measure ρ(u, θ).We thus conclude from this result that after NN training, the discrete NN can efficiently represented the target function by learning an effective feature representation characterized by the feature distribution ρ(u, θ).
If we compare this learned feature representation to the random feature approaches (RF or NTK), the feature representation learned by NN leads to more efficient discrete representation by sampling from the distribution.This efficiency explains the superiority of NN over the random feature approach.A consequence of the optimal feature representation point of view in [58] is the possibility to use a generative model to learn such a distribution ρ, and then use this generative model to replace the initial random features (i.e.random Gaussian distributions) in RF and NTK to generate hidden units of the neural network.If we consider the random features sampled from this learned distribution, instead of random features at the initialization, more effective RF and NTK can be obtained.This was illustrated in [58], [59], which we present here as well.
For Convolutional Neural Networks (CNN), the phenomenon of learning features is a consensus among practitioners [60], [59].A visualization of this phenomenon is shown in Figure 8.When we use a Variational Auto-Encoder (VAE) [61] to learn the optimized ρ distribution from samples of pre-trained models, we observe meaningful patterns not found at the initialization.In particular, to obtain samples from ρ * , we prepare 1, 000 pre-trained two-layer NNs (m = 100) with different initializations.Note that the weight of the pre-trained NNs can be regarded as samples {θ i } from ρ * .We can then use the generative model (VAE) to learn the transform from the standard normal distribution to the target distribution ρ * with these samples {θ i }.
Figure 9 illustrates the repopulation phenomenon in neural networks.In the repopulation process, we use VAE to learn the feature distribution ρ and then sample weights from the learned generative model.We then fix the generated features and learn parameter u only as in (3), just like the RF method.From this experiment, we can see that the performance of the repopulated features outperforms that of the initial random features.This means the random features learned by NN are superior to the Gaussian random features at initialization.This is consistent with the theory of [58].
Another approach to examine the effectiveness of ρ is to compare the tangent spaces at the initial and the final solutions using the linear approximation (8).This scenario has also been investigated in [62].If the representation power of NTK matches that of NN in practice, then the performance using the tangent space at the initialization should be similar to that of the learned distribution ρ.We compare random weights and generated ones in Figure 10 on the MNIST dataset.Note that the generated weights are learned by VAE at the final solution.In training, both approaches achieve very small errors.However, the generalization ability differs significantly: the learned ρ provides a more robust model in the testing stage.Many analysis of NTK investigated the training loss which can become almost zero due to the effectiveness of tangent space.However, the restricted space cannot perform as well as the full NN in terms of the generalization ability.
As indicated by Figure 6 (d), many neurons of NN can be "wasted", and it can be identified by u with proper regularization.Therefore it is possible to perform importance sampling to select effective weights from a very wide NN, which can also be regarded as an approach of pruning.We train a large NN (m = 10000) with regularization 10 −3 , which leads to many "wasted" neurons.We want to prune the NN by choosing only 10 effective neurons, and finetune the weight u.There are two strategies to select the neurons: (1) uniform sampling, which does not distinguish the importance of neurons; (2) importance sampling, which take the corresponding u as the importance of neurons.After selecting neurons, we fix the first layer θ and train u.Note that the optimization of u is a convex problem. Figure 11 shows that the performance of importance sampling outperforms uniform sampling significantly.This also confirms that in a wide network, some neurons may get stuck due to the non strong convexity of the formulation.
Since random features of NTK are not learned during training, there have been several works that tried to investigate the difference between the lazy training condition in NTK and the actual training process of NNs [52], [58], [63], [62], [64], [65].Notably, [63] showed that random features cannot be used to learn even a single ReLU neuron unless the number of the hidden units is exponentially large in d.The authors in [62] considered the quadratic activation function and showed that Gradient Descent achieves a lower prediction risk in the actual training process when the number of neurons is small.As discussed earlier, [58] showed that with appropriate regularization, NN can learn optimal feature representations that are superior to random features.
Because MF outperforms NTK in the feature learning perspective, in many cases, better generalization bounds can be obtained for MF than those of NTK.In particular, shown by [58] and [52], learning a two-layer NN with an -2 norm regularizer on the weights is equivalent to solving an -1 norm regularized problem in the feature space.This is consistent with the empirical observation that MF learns meaningful features because -1 regularization has a strong capability for feature selection and sparse representation learning.In contrast, the kernel methods typically consider an -2 norm regularizer.Moreover, in [52] a simple d-dimensional distribution was constructed, for which MF needs O(d) samples to learn.However, kernel methods (including NTK) require at least Ω(d 2 ) samples, which demonstrated the superiority of MF in terms of generalization.Recently, [66] obtained an interesting result which shows that even without a regularizer, Gradient Descent can implicitly converges to the -1 norm regularized solution in the mean-field limit.

VII. OVERPARAMETERIZED DEEP NEURAL NETWORKS
We have explained the concepts of NTK and MF using twolayer neural networks.A number of papers have considered extensions of these models to deep neural networks.

A. NTK
In general, NTK can be generalized to DNNs without much difficulty, e.g.[24], [18], [22], and the technique can also be generalized to handle more complex topological structures, such as Recurrent NNs [21] and Residual NNs [18].In these approaches, with proper initialization, we can linearize the nonlinear NN models at the initialization, similar to what we have done for two-layer NNs.By showing that the training process with small learning rates leads to zerotraining error within a small neighbor of the initialization, the entire NN train lies in the so-called NTK (or lazy-learning) regime, and the linear approximation is effective throughout training.Similar to the situation of two-layer NN, this requires specialized initialization and specialized learning rate which are often different from what are used by practitioners.
One difficulty with the NTK approach for deep neural networks is that it cannot satisfactorily explain the benefit of using deeper structures.This because the NTK view essentially corresponds to a linear model using an infinite dimensional random feature representation that defines the underlying NTK.Although with deeper structures, we add more and more random features, similar to the situation of the two-layer NNs, these features are not learned.
If we want to apply NTK to real problems, efficient computation of the NTK kernel is necessary, which may require special design.For example, an efficient exact algorithm to compute Convolutional NTK was proposed in [30].In practice, kernel methods have a quadratic complexity with respect to the number of training data, and the computational cost can be prohibitive for big data applications.Various algorithms have been investigated to alleviate this problem in the traditional kernel learning literature.We refer the readers to [14] and references therein.

B. MF
Unlike NTK, it is nontrivial to generalize MF to deep neural networks.There were a number of recent works that attempted to generalize MF [67], [68], [69], [70], [57], [71], [66].This is still an active research area which has not matured.We will thus describe some of the challenges and the latest results.
First, it is not easy to formulate the continuous limit of DNNs.Consider a three-layer NN as an example.The hidden units of the upper layer are functions of hidden units of the lower layer.However, if we allow the number of hidden units of the lower layer go to infinity (as we do in the two layer NN), then there are infinitely many features for every hidden unit of the upper hidden layer.If we let the number of hidden units of the upper layer to go to infinity, then there are infinitely many such functions, each with infinitely many features (each feature corresponds to a hidden unit of the lower layer).It is nontrivial to model these functions mathematically.One of the attempted approaches is to model DNNs with nested measures (also known as multi-layer measures [72], [73]).However, as mentioned in [67], the mathematical limit may not be well-defined.Another approach considered the continuous limit of DNNs under special conditions.For example, [69], [70] investigated the continuous limit of DNNs under the initialization that all weights were i.i.d.realizations of a fixed distribution (with finite variance) independent of the number of hidden units.Unfortunately, in such setting, all neurons in a middle layer will have the same output value at initialization, and this property holds during the entire training process.It is clearly not an appropriate mathematical model for general DNNs.In real applications, initialization strategies, e.g., [74], [45] sample the NN weights from N (0, O(m)), with variance approaching infinity as the number of hidden nodes m goes to ∞.More recently, [71] designed a new meanfield framework for DNNs, in which a DNN is represented by probability measures and functions over outputs of the hidden units instead of the neural network parameters.This new representation overcomes the degenerate situation exited in some earlier attempts, where all the hidden units essentially have only one meaningful hidden unit in each middle layer.
A second difficulty is that a DNN cannot be regarded as a linear model with respect to the distribution of the parameters.Unlike the case of two-layer NN, which is convex with respect to a reparameterization of the model using the corresponding feature distribution, it is much harder to derive a convex formulation of DNN with appropriate reparameterization.Therefore, the global minimum is hard to be identified and Gradient Descent potentially leads to sub-optimal solutions.Recently, [70] and [71] showed that Gradient Descent can find a global minimal solution for three-layer and multi-layer DNNs, respectively.Notably, they assumed that no regularization is imposed and the activation function can achieve universal approximation.Under such conditions, the global minimum can be identified as 0. Another remarkable work is [68] and the closely related study [75], in which the authors introduced a new technique, called neural feature repopulation (NFR), to reparameterize the DNNs.Using the NFR technique, one can decouple the distributions of the features from the loss function and their impact can be integrated into the regularizer.Surprisingly, with suitable regularizers, it can be shown that the overall objective function under the special reparameterization is convex, which is analogous to the case of two-layer NNs.Moreover, they proposed a new optimization process to find the global minimal solution under such regularizers.It remains an open theoretical question to show that gradient descent type of algorithms can find a global optimal solution for the associated convex formulation.

VIII. COMPLEXITY ANALYSIS FOR OVERPARAMETERIZED
NNS The theoretical properties of the linearized system in the NTK view is much easier to analyze.Therefore it is possible to prove rigorous convergence and statistical complexity bounds under the NTK regime, and polynomial convergence rates can be obtained under various conditions.
For two-layer NNs, for example, by assuming that the minimum eigenvalue of the kernel matrix for the training data is positive, denoted as λ 0 , it was shown in [41] that when the number of hidden units is greater than n 6 λ −4 0 δ −2 , with a learning rate of η = O(λ 0 n −2 ), then with probability at least 1 − δ, the gradient descent method finds an -global minimum in O η −1 λ −1 0 log( −1 ) steps.Before [41], [20] studied a different data assumption.They showed that a polynomial convergence rate can be achieved under appropriate separability conditions of the data.
The above results can be generalized to DNNs.For example, in [18], the authors showed that as long as the number of hidden units is larger than Ω(poly(λ 0 , n)2 L ), Gradient Descent finds a global minimal solution in Õ(poly(λ 0 , n)2 L ) steps for standard L-layer DNNs, where Ω and Õ hide poly-logarithmic terms.Moreover, for the Residual NNs, the exponential dependencies on L can be reduced to polynomial dependency.Similarly, the authors in [24], [22] adopted the data assumption in [20] and achieved polynomial complexities.Some other researchers, e.g., [21], [76], [77], [78], have tried to model NNs beyond linear approximation of NTK, typically second-order approximation.For example, [21], [77] and [79] proposed a training procedure with randomization techniques to extract the second-order approximation, sharpening complexity bounds.In general, the second-order approximation satisfies the so-called strict saddle property [80], thus are solvable efficiently by saddle-escaping algorithms, e.g.[81], [82], [83].Specifically, [77] showed that complexity bounds for learning polynomials on uniform distributions are lower than those of NTK by a factor of O(d).

IX. OTHER MATHEMATICAL MODELS OF NNS
A number of recent works have considered approximation properties of neural networks, leading to better understanding of why deep neural networks are superior to shallow networks in terms of function approximation.It is well known that two layer neural networks are universal approximators [84], [85].However, for certain functions that can be represented by deep neural networks with a small number of nodes, exponentially number of nodes are needed to represent them with shallow neural networks [86], [87].Related results show that deep neural networks can represent any function with a constant number of nodes per layer [88], [89], which suggest a trade-off between depth and width in terms of universal approximation.More generally, in order to represent a complex function, we can either increase a network's width, or its depth.It was observed in practice that it is beneficial to increase both depth and width simultaneously to balance the trade-off [90].
Before the development of recent mathematical models of overparameterized NN such as NTK and MF, which tried to formulate the NN optimization procedure as convex optimization, there were developments in the machine learning research community that focused on the non-convex optimization aspect of NNs.
In order to understand the NN training process, a number of earlier works studied the loss landscape of NNs.For example, several researchers observed that NN's generalization ability is related to the sharpness/flatness of the local minimal solution resulted from training, and discussed different methods to characterize flatness [91], [92], [93].There are also works, e.g., [94], [95], which attempted to understand the Hessian matrices of neural networks.With the help of restrictive assumptions, or for specialized models, a number of earlier works, e.g., [96], [97], [98], [99], [100], [101], [102], [103], studied the theoretical characterizations of NN landscapes.For example, under the assumption that the input follows Gaussian distribution or the activation function is linear or quadratic.These results, in general, showed that for any NN that satisfies strict saddle property, standard saddle-escaping algorithms can converge to a global minimal solution.We also refer the readers to the review [104] and the references therein for the global landscape of NNs.
Related studies of neural network training were investigated from the generic non-convex optimization point of view, where a main issue was the complexity of stochastic optimization algorithms such as SGD to escape saddle points and converge to local minimal solutions [80], [81].This question was resolved satisfactorily for general non-convex problems, where both the convergence rate of SGD and that of the optimal stochastic algorithm were known [82], [83], [105].
The kernel representation in the RF/NTK view has a natural connection to Gaussian processes, which has a Bayesian statistics interpretation.The earliest study of overparameterized infinite-width NN was motivated by this Bayesian interpretation [8], [9], [106], where the relationship of infinite-width NN and Gaussian processes were investigated, which is only based on random feature.More recently, some paper also investigate the kernel form of NTK regime to perform Bayesian inference [107], which has a larger function class than NF.Moreover, the Bayesian interpretation can be used to derive uncertainty estimation for neural networks.For example, it was argued in [108] using the Gaussian process point of view that dropout [5] can be used to obtain uncertainty estimation for neural networks.

X. CONCLUSION AND FURTHER DIRECTIONS
Neural network has become an essential tool in machine learning and artificial intelligence, with a wide range of applications.Although there have been significant empirical progresses, theoretical understanding is rather limited, due to the complexity of the non-convexity in NN modeling.It has been noted by practitioners that overparameterized NNs are easier to optimize, and the solutions are often reproducible with good performances that are difficult to explain from a non-convex optimization point of view.To explain this mystery, there have been numerous works to develop mathematical models for overparameterized neural networks in recent years.Due to these efforts, we begin to understand how neural network works, especially in the continuous limit of overparameterized NNs.Surprisingly, under these models, overparameterized neural networks behave more like convex systems, which can explain why they lead to reproducible results observed in practice.There are many research activities in developing better mathematical theories of DNNs.We outline some of the current directions that we feel are particularly promising.
• For two-layer NNs, NTK can achieve a polynomial computational cost, although a relatively weak generalization result.In comparison, MF achieves better generalization, but lacks quantitative computational results under general conditions.Therefore we need more sophisticated analysis of two-layer NNs showing better generalization behavior than NTK (especially in terms of feature learning), with a polynomial computational complexity.• The understanding of deep NNs is still quite limited.
Although it can be shown that GD converges globally in the NTK regime, in the existing analysis, the weight updates can be ignored except for the second to the last layer.This is clearly inconsistent with practice.Moreover, because NTK is effectively a linear (single-layer) model with respect to random features, existing results on DNN approximation imply immediately that representations requiring deep structures can not be learned efficiently by NTK.As an example, the random features cannot even represent a single ReLU neuron unless there are exponentially number of hidden units [63].Although this gap can be potentially addressed by the MF view, its analysis is even further behind in that we still do not have a satisfactory theory of MF for DNNs.Even if a satisfactory theory of MF is developed, quantitative results on generalization and optimization remain open.Therefore for DNNs, both optimization and generalization analysis require further study.As one interesting work in this direction, [109] established a principle called "backward feature correction" and showed that Gradient Descent can learn hierarchical features when the activation function is quadratic.We expect to see more results of this type that can truly illustrate the benefits of deep structures beyond the inherently shallow structure of NTK.• The success of NNs has been largely attributed to their abilities to learn discriminative features.Related topics of transfer learning, pre-training, semi-supervised learning, etc have been actively studied by practitioners with great successes.We expect more and more theoretical investigations of these topics in the future, which will inevitably lead to better understanding of NNs ability to learn feature representations.This may inspire the development of more robust algorithms for representation learning.
• In recent years, many specialized NN architectures and components are designed by practitioners that are effective in various different tasks.For example, architectures such as ResNet, CNNs, Transformers, and components such as attention, batch normalization etc have become widely used.We start to see theoretical analysis for such architectures and components.For example, in [110], the authors provided theoretic justification for Res-Net, and showed that Res-Nets generalize better than DNNs by comparing the neural tangent kernels.We expect more sophisticated theoretical analysis of special NN structures can lead to better understanding, and eventually more effective neural architecture design.• Practitioners have made many interesting empirical observations of NNs that remain to be explained theoretically.For example, [111] showed that NN learning exhibits a double descent phenomenon, where when we increase the model size or the number of training epochs, the test performance deteriorates first, then becomes better.In another work, [112] observed the so-called local elasticity phenomenon where the prediction of a datum x will not be significantly affected after a stochastic gradient descent update at a datum x which is not close to x .As another example, [113] observed a phenomenon called Neural Collapse, which states that predicted class means collapse to the vertices of a Simplex Equiangular Tight Frame at the final training stage.Developing theoretical explanations of such practical phenomena can lead to better understanding of how NN works.Finally, we conclude that the study of overparameterized NNs is still in its infancy.We expect deep mathematical insights obtained from these theoretical investigations will help us to develop solid theoretical foundations for NNs, and motivate effective algorithms and architectures in the coming years.

XI. CODE AVAILABILITY
Codes for illustration is available at Github 2 .
Tong Zhang Tong Zhang received the BA degree in mathematics and computer science from Cornell University, and the PhD degree in computer science from Stanford University.He is a chair professor of Computer Science and Mathematics at the Hong Kong University of Science and Technology.He is a fellow of IEEE, American Statistical Association, and Institute of Mathematical Statistics.His research interests are machine learning, big data and their applications.
C. Fang is with University of Pennsylvania, USA.H. Dong and T. Zhang are with the Hong Kong University of Science and Technology.

λ u u j 2 2 + λ θ θ j 2 2
i ) : i = 1, . . ., n} are the training data, L is a loss function, such as the soft-max loss for k-class classification problem with v ∈ R k and y ∈ {1, . . ., k}: L(v, y) = − log exp(v y ) k j=1 exp(v j ), and R(u, θ) is a regularizer (also called weight-decay in the neural network literature), such as the L 2 regularization: .

Fig. 6 .
Fig. 6.Neural network optimization from a mean field perspective

TABLE I COMPARISON
OF THE NTK VIEW AND THE MF VIEW

TABLE II PROS
AND CONS OF RF, NTK, AND MF.NOTE THAT THOUGH RF AND NTK CAN BE APPLIED ON DNN, THEY ARE STILL A LINEAR (ONE-LAYER) MODEL AND DO NOT FULLY EXPLORE THE HIERARCHICAL ARCHITECTURE.SEE DISCUSSION ON FURTHER DIRECTIONS IN SECTION X.