Convex Formulation of Overparameterized Deep Neural Networks

Analysis of over-parameterized neural networks has drawn significant attention in recentyears. It was shown that such systems behave like convex systems under various restrictedsettings, such as for two-level neural networks, and when learning is only restricted locally inthe so-called neural tangent kernel space around specialized initializations. However, there areno theoretical techniques that can analyze fully trained deep neural networks encountered inpractice. This paper solves this fundamental problem by investigating such overparameterizeddeep neural networks when fully trained. We generalize a new technique called neural feature repopulation, originally introduced in (Fang et al., 2019a) for two-level neural networks, to analyze deep neural networks. It is shown that under suitable representations, overparameterized deep neural networks are inherently convex, and when optimized, the system can learn effective features suitable for the underlying learning task under mild conditions. This new analysis is consistent with empirical observations that deep neural networks are capable of learning efficient feature representations. Therefore, the highly unexpected result of this paper can satisfactorily explain the practical success of deep neural networks. Empirical studies confirm that predictions of our theory are consistent with results observed in practice.


Introduction
Deep Neural Networks (DNNs) have achieved great successes in numerous real applications, such as image classification (Krizhevsky et al., 2012;Simonyan & Zisserman, 2014;He et al., 2016), face recognition (Sun et al., 2014), video understanding (Yue-Hei Ng et al., 2015, neural language processing (Bahdanau et al., 2014;Luong et al., 2015), etc. However, compared to the empirical successes, the theoretical understanding of DNNs falls far behind. Part of the reasons might be the general perception that DNNs are highly non-convex learning models.
In recent years, there has been significant breakthroughs (Mei et al., 2018;Chizat & Bach, 2018;Du et al., 2019a;Allen-Zhu et al., 2019) in analyzing over-parameterized Neural Networks (NNs), which are NNs with massive neurons in hidden layer(s). It is observed from empirical studies that such NNs are easy to train (Zhang et al., 2016). And it was noted that under some restrictive settings, such as two-level NNs (Mei et al., 2018;Chizat & Bach, 2018) and when learning is only restricted locally in the neural tangent kernel space around certain initializations (Du et al., 2019a;Allen-Zhu et al., 2019), NNs behave like convex systems when the number of the hidden neurons goes to infinity. Unfortunately, to the best of our knowledge, existing studies failed to analyze fully trained DNNs encountered in practice. In particular, the existing analysis cannot explain how DNNs can learn discriminative features specifically for the underlying learning task, as observed in real applications (Zeiler & Fergus, 2014).
To remedy the gap between the existing theories and practical observations, this paper develops a new theory that can be applied to fully trained DNNs. Following a similar argument in the analysis of two-level NNs in Fang et al. (2019a), we introduce a new theoretical framework called neural feature repopulation (NFR), to reformulate over-parameterized DNNs. Our results show that under suitable conditions, in the limit of infinite number of hidden neurons, DNNs are infinity-dimensional convex learning models with appropriate re-parameterization. In our framework, given a DNN, the hidden layers are regraded as features and the model output is given by a simple linear model using features of the top hidden layer. The output of a DNN, in the limit of infinite number of hidden neurons, depends on the distributions of the features and the final linear model. We show that using the NFR technique, it is possible to decouple the distributions of the features from the loss function and their impact can be integrated into the regularizer. This largely simplifies the objective function. Under our framework, the feature learning process is characterized by the regularizer. When suitable regularizers are imposed, the overall objective function under special re-parameterization is convex, and it guarantees that the DNN learns useful feature representations under mild conditions. Unlike the Neural Tangent Kernel approach, our theoretical framework for DNN does not require the variables to be confined in an infinitesimal region. Therefore it can explain the ability of fully trained DNNs to learn target feature representations. This matches the empirical observations. More concretely, the paper is organized as follows. Section 2 discusses the relationship of this paper to earlier studies, especially recent works on the analysis of over-parameterized NNs. Section 3 introduces the definition of discrete DNNs, and we introduce an importance weighting formulation which eventually motivates our NFR formulation. Section 4 describes the continuous DNN when the number of hidden nodes goes to infinity in the discrete DNN. In this formulation, each hidden layer is represented by a distribution over its hidden nodes that represent functions of the input. We can further interpret a discrete DNN as a random sample of hidden nodes from a continuous DNN at each layer, and then study the variance of such random discretization. The variance formula motivates the study of a class of regularization conditions for DNNs. Using the connection between discrete and continuous (over-parameterized) DNNs, we introduce the process of NFR in Section 5. In this process, an over-parameterized DNN can be reformulated as a convex system that learns effective feature representations for the underlying task. Experiments are presented in Section 6 to demonstrate that our theory is consistent with empirical observations. In Section 6.1, we consider a new optimization procedure inspired by the NFR view to verify its effectiveness. Final remarks are given in Section 7.
The main contributions of this work can be described as follows.
• We propose a new framework for analyzing overparameterized deep NN called neural feature repopulation (NFR). It can be used to remove the effect of learned feature distributions over hidden nodes from the loss function, and confine the effect only to the regularizer. This significantly simplifies the objective function.
• We study a class of regularizers. With these regularizers, the over-parameterized DNN can be reformulated as a convex system using NFR under certain conditions. The global solution of such a convex model guarantees useful feature representations for the underlying learning task.
• Our theory matches empirical findings, and hence this theory can satisfactorily explain the success of fully trained overparameterized deep NNs.
We shall also mention that the paper focuses on presenting the intuitions and consequences of the new framework. In order to make the underlying ideas easier to understand, some of the analyses are not stated with complete mathematical rigor. A more formal treatment of the results will be left to future works.
Notations. For a vector x ∈ R d , we denote · 1 and · to be its 1 and 2 norms, respectively. We let x be the transpose of x and let x k be the value of k-th dimension of x with k = 1, . . . , d. Let [m] = {1, 2, ..., m} for a positive integer m. For a function f (x) : R d → R, we denote ∇ x f to be the gradient of f with respect to x. For two real valued numbers a and b, we denote a ∨ b to be max (a, b). If µ and ν are two measures on the same measurable space, we denote µ ν if µ is absolutely continuous with respect to ν, and µ ∼ ν if µ ν and ν µ.

Related Work
In recent years, there have been a number of significant developments to obtain better theoretical understandings of NNs. We review works that are most related to this paper. The main challenge for developing such theoretical analysis is the non-convexity of the NN model, which implies that first-order algorithms such as (Stochastic) Gradient Descent may converge to local stationary points.
In some earlier works, a number of researchers (Hardt & Ma, 2016;Freeman & Bruna, 2016;Brutzkus & Globerson, 2017;Boob & Lan, 2017;Ge et al., 2017;Bakshi et al., 2018;Ge et al., 2018) studied NNs under special conditions either for input data or for NN architectures. By carefully characterizing the geometric landscapes of the objective function, these early works showed that some special NNs satisfy the so-called strict saddle property (Ge et al., 2015). One can then use some recent results in nonconvex optimization (Jin et al., 2017;Fang et al., 2018Fang et al., , 2019b to show that first-order algorithms for such NNs can efficiently escape saddle points and converge to some local mimima. A number of more recent theoretical analysis focused on over-parameterized NNs, which are NNs that contain a large number of hidden nodes. The motivation of overparameterization comes from the empirical observation that over-parameterized DNNs are much easier to train and often achieve better performances (Zhang et al., 2016). When the number of hidden units goes to infinity, the network naturally becomes a continuous NN. In the continuous limit, the resulting networks are found to behave like convex systems under appropriate conditions. Our work follows this line of the research. In the following, we review the three existing points of views, i.e., the mean field view, the neural tangent kernel view, and the NFR view. Due to the space limitation, Table 1 only summarizes some of the representative studies on these views, and additional studies are discussed in the text below.

Mean Field View for Over-parameterized Two-level NNs
The Mean Field approach has been recently introduced to analyze two-level NNs. This point of view models the continuous NN as a distribution over the NN's parameters, and it studies the evolution of the distribution as a Wasserstein gradient flow during the training process (Mei et al., 2018;Chizat & Bach, 2018;Sirignano & Spiliopoulos, 2019;Rotskoff & Vanden-Eijnden, 2018;Mei et al., 2019;Dou & Liang, 2019;Wei et al., 2018). The process can be represented by a partial differential equation, which can be further studied mathematically. For two-level continuous NNs, it is known that the objective function with respect to the distribution of parameters is convex in the continuous limit. And it can be shown that (noisy) Gradient Decent can find global optimal solutions under certain conditions.
The benefit of the Mean Field View is that it mathematically characterizes the entire training process of NN. However, it relies on the special observation that a two-level continuous NN is naturally a linear model with respect to the distribution of NN parameters, and this observation is not applicable to multi-level architectures. Consequently, it is difficult to generalize the Mean Field view to analyze DNNs without losing convexity. In fact, in a recent attempt Nguyen (2019), the technique of Mean Field view is applied to DNNs, but the author could only obtain the evolution dynamic equation for Gradient Descent. Since DNN is no longer a linear model with respect to the distribution of the parameters, Nguyen (2019) failed to show that the system is convex. Therefore similar to the situation of Stochastic Gradient Descent for discrete DNNs, Gradient Descent for continuous DNNs can still lead to suboptimal solutions.
for NNs. This is because NTK essentially approximates a nonlinear NN by a linear model of the infinite dimensional random features associated with the NTK, and these features are not learned from the underlying task. In contrast, it is well known empirically that the success of NNs largely relies on their ability to learn discriminative feature representations (LeCun et al., 2015). Therefore the NTK view does not match the empirical observations.
A recent attempt to justify NTK is given by Arora et al. (2019a), who proposed an efficient exact algorithm to compute Convolutional Neural Tangent Kernel. However, the classification accuracy of 77% on the CIFAR-10 dataset obtained by kernel regression using NTK is 5% less than that of the corresponding fully trained Convolutional NNs , and is at least 15% less than the accuracies achieved by modern NNs such as ResNet (He et al., 2016).

Neural Feature Repopulation for Over-parameterized NNs
More recently, Fang et al. (2019a) proposed the NFR view for analyzing two-level NNs. While the Mean Filed view does not have the concept of "layer" in its analysis of two-level NNs, the NFR view treats the top-layer linear model and the bottom-layer feature learning separately. The dynamics of feature learning in NN is modeled by a "repopulation" process in NFR. It was shown in (Fang et al., 2019a) that under certain conditions, two-level NNs can learn a near-optimal distribution over the features in terms of efficient representation with respect to the underlying task.
Our current work can be regarded as a non-trivial generalization of the NFR view from two-level NNs to deep NNs. Specifically, we employ the NFR technique to simplify the objective function of DNN training by showing that it is possible to reparameterize a DNN as a linear model of learned features. Moreover, the feature learning process is decoupled from the loss function, and is determined by the regularizer. This reparameterization significantly simplifies the overall objective function. We introduce a class of regularizers that can guarantee the convexity of the overall objective function, and we show that efficient distributions over features can be obtained as the result of training. Compared to NTK (Du et al., 2019a;Allen-Zhu et al., 2018, 2019, NFR is more consistent with empirical observations, because NN parameters are no longer restricted in an infinitesimal region. This implies that meaningful features can be learned from training.

Discrete Deep Neural Networks
In this section, we introduce the discrete fully connected deep neural networks. We also introduce an importance sampling scheme for discrete DNN, which can be used to motivate the NFR technique later in the paper.

Standard DNN
In machine learning, we are interested in prediction problems, where we are given an input vector x = [x 1 , . . . , x d ] ∈ R d , and we want to predict its corresponding output y.
In general, we want to learn a functionf (x) ∈ R K that can be used for prediction. The quality of prediction is measured by a loss function φ(·, ·), which is typically convex in the first argument. For regression, where y ∈ R K , we often use the least squares loss function For classification, where y ∈ [K], we often use the logistic loss, that is In this paper, we consider a deep neural networkf (x) with L hidden layers, which can be defined recursively as a function of x. First we letf , we define the nodes in the -th layer aŝ where h ( ) is the activation function and w ( ) ∈ R m ( ) ×m ( −1) is the weight matrix of the -th layer comprised of m ( ) rows, i.e., w ( ) At the top layer, we define the outputf (x) asf where u j ∈ R K is a vector. This defines an (L + 1)-level fully connected deep neural network with m ( ) nodes in each layer. The formula for training a discrete deep neural networkf is to minimize the the following objective function:Q (w, u) = J(f ) +R(w, u), where w = {w (1) , . . . , w (L) } and J(f ) = E x,y φ(f (x), y) with φ(·, ·) being the loss function and R(w, u) being the regularizer that takes the form of The parameters λ (1) , ..., λ (L) and λ (u) are the non-negative hyper-parameters for the regularizer. In this paper, we are particularly interested in the following form of regularizer: This class of regularizers will be analyzed in Sections 4.4 and 5.3.

Importance Weighted DNN
In our framework, the hidden units of a discrete NN can be regarded as samples from a continuous distribution (refer to Section 4). Instead of uniform sampling, as in (3.1) and (3.2), we may also consider importance sampling to construct the hidden nodesf ( ) j (x) with j ∈ [m ( ) ] and the final functionf (x). Specifically, we assign the k-th hidden node at layer an importance weightinĝ p ( ) k = m ( ) , and we let the hidden nodes follow a non-uniform distribution whose probability mass function with index k at layer isp ( ) k /m ( ) . Then we can rewrite (3.1) and (3.2) asf ( ) where we have also replaced the weight w ( ) j,k and u j with w j , respectively. We can find that under such transformation, the functionsf Similarly, for the regularizers, by replacing the weight w ( ) j,k and u j with w j , respectively, and by replacing the uniform distribution over the hidden units with the corresponding non-uniform distribution, we havê j .
An important observation is that under the importance weighting transformation, function values on all hidden nodes and the final output value remain unchanged, while the regularization values change. This means that the importance weighting parametersp only appear in the regularization term. This observation eventually leads to the NFR technique to reparameterize continuous DNNs.
The discrete importance weighting formula presented here provides intuitions to our NFR method for continuous DNNs, and the detailed analysis will be provided in Section 5.1.

Continuous Deep Neural Networks
When m ( ) → ∞ for all ∈ [L], we can define a continuous DNN according to the definition of discrete DNN in Section 3. In the case of continuous DNN, each hidden node in the -th layer is associated with a real valued function of the input x. It can be characterized by the weights connecting it to the hidden nodes in the ( − 1)-th layer and therefore can be represented as a real valued function defined on these nodes. The space of the hidden nodes at layer , i.e. all such real valued functions, is denoted as Z ( ) in this paper and can be regarded as an infinitedimensional feature (representation) of the input data x. A continuous DNN can be obtained by defining probability measures ρ ( ) on hidden nodes for each hidden layer , which is equivalent to the probability measures on real valued functions or features of the input x. A discrete DNN can be obtained by sampling m ( ) hidden nodes, i.e. elements belonging to Z ( ) , from ρ ( ) at each layer . We denote ρ = {ρ (0) , . . . , ρ (L) } and present the details below.

Continuous DNN Formulation
By convention, we let the 0-th layer be the input layer and denote Z (0) = [d] to be its node space corresponding to the d components of the input x. And we let ρ (0) be a probability measure over Z (0) . And for each node z (0) ∈ Z (0) , we let Now consider the -th layer with ∈ [L]. For conceptual simplicity, in this paper, we let Z ( ) be the measurable real-valued function class on Z ( −1) . Given z ( ) ∈ Z ( ) and z ( −1) ∈ Z ( −1) , we define Because z ( ) can be regarded as a hidden node in layer , and z ( −1) as a hidden node in the ( − 1)th layer, thus w(z ( ) , z ( −1) ) is the analogy of w ( ) ij in discrete DNN, which is the weight connecting node i and j in layer and − 1 respectively. Using this notation, we define the function associated with the node z ( ) in -th layer of the continuous NN as follows: where h ( ) (·) is the activation function of the -th layer, and Moreover, we let ρ ( ) be a probability measure over Z ( ) . Finally, for the output layer, let u(·) : Z (L) → R K be a K-dimensional vector valued function on Z (L) , then we can define the final output of continuous DNN as The objective function in continuous NN takes the form of Remark 1. In this paper, unless otherwise specified, for any probability measure sequence ρ = {ρ (0) , . . . , ρ (L) }, we always let ρ (0) be the uniform distribution on Z (0) .
The above process defines a continuous DNN. In the following, we establish the relationship between the continuous and discrete DNNs.

Assumptions
Before presenting our analysis, we specify the necessary assumptions first. We note that these assumptions are rather mild, and easy to be satisfied.
Assumption 1 (Bounded Gradient). We assume the activation function is differentiable, and its derivative is bounded. That is, there exists a constant c 0 > 0, such that Assumption 2 (Continuous Gradient). We further assume that there exist two constants α > 0 and c 1 > 0 such that Assumption 2 is a special type of modulus of continuity for ∇h ( ) (x). When α = 1, Assumption 2 is the standard L-smooth condition for the activation function h ( ) (x). When α < 1, it holds more generally in the local region. When proving Theorem 2, our moment condition in Assumption 3 depends on α. We also note that the commonly-used activation functions, e.g. sigmoid, tanh, and smooth relu, satisfy this assumption for all 0 < α ≤ 1.
Assumption 3 ((q 0 , q 1 )-Bounded Moment Condition). We assume for all ∈ {2, . . . , L}, we have Moreover, we assume The constants q 0 and q 1 in Assumption 3 will be specified later based on our theorem statements.

Relationship between Discrete and Continuous DNNs
We can now investigate the relationship between the discrete and continuous DNNs. Similar to the method used in (Fang et al., 2019a) for two-level NNs, the discrete DNN can be constructed from a continuous one by sampling hidden nodes from the probability measure sequence ρ. The detailed procedure is as follows: 1. Keep the input layer of the discrete DNN identical to that of the continuous DNN.

For each hidden layer
which is denoted asẐ ( ) , from ρ ( ) of continuous DNN, and set the weights 3. For the top layer, set The following result shows when m ( ) → ∞ for all ∈ [L], the final output converges to that of the continuous DNN in L 1 . All the proofs in this paper are left to the Appendices.
Theorem 1 (Consistence of Discretization). Given a continuous NN. Under Assumptions 1 and 3 with q 0 = q 1 = 1 + c ε for any c ε > 0, and suppose there is a discrete NN constructed from the continuous DNN using the procedure above, then for any input x, we have lim m ( ) →∞, =1,...,k−1 The part (i) of the theorem above does not cover the case of k = 1, since it is trivial to show that f (1) (ρ, z (1)

Variance of Discrete Approximation
While Theorem 1 shows the convergence of discrete NN to continuous NN using random sampling, it is possible to estimate the variance of such approximation with a slightly strong condition, as shown below.
Theorem 2 (Variance of Discrete Approximation). We denote ∂f (ρ,u;x) Then under Assumptions 1, 2, and 3 with q 0 = 2(1 + α) L , q 1 = 2 and treating c 0 , c 1 , α, c M , c M 1 and L as constants, we have In Theorem 2, for ∈ [L − 1], we can choose a = O 1 L 1+ν with ν ≥ 0. Thus, the assumption only requires the bounded 2 + O 1 L ν -th moment. It was argued by Fang et al. (2019a) that for two-level NNs, discretization variance is small when the underlying feature representation learned by the continuous NN is good. Similarly, from Theorem 2, we can argue that if a continuous DNN learns good feature representations, then the variance of the corresponding discrete approximation is small. We can impose appropriate regularization condition to achieve this effect. We can derive a regularization condition from the theorem above, which can induce good feature representations. Specifically, if we assume that both |f ( ) (ρ, z ( ) ; x) ∂f (ρ,u;x) ∂z ( +1) | and |f (L) (ρ, z (L) ; x)| are bounded, then in order to minimize the variance, Theorem 2 implies that we can minimize the following regularization: which corresponds to the choices of r 1 (w) = |w|, r 2 (w) = w 2 , and r (u) (u) = u 2 in (3.3) (see the proofs of (4.9) and (4.10) in Appendix C.1). We make further discussions below regarding the obtained regularizer.
• From the modeling perspective, the regularizer derived in the paper controls the efficacy of the learned feature distributions in terms of efficient representation under random sampling. If the regularization value is small, then the variance is small, and f can be efficiently represented by a discrete DNN with a small number of hidden neurons randomly sampled from the feature distributions. It is well-known that two-level NNs can achieve universal approximation (Cybenko, 1989), but DNNs have stronger representation power than shallow NNs, especially for targets with high-frequency components (Andoni et al., 2014). That is, a much smaller number of hidden units are needed to represent such target functions using DNNs. We will validate empirically in Section 6 that for such targets, the variance of deeper NNs becomes smaller after training.
• From the computational perspective, our unexpected result in Section 5.3 shows that with this regularizer, the objective function is convex under suitable re-parameterization. We also note that in the discrete formulation, the regularizer is the simple 1,2 norm regularizer if we write w(z ( ) , z ( −1) ) as a matrix with the (j, i)-th entry being w(z (i) , z (j) ). For such a regularizer, Proximal Gradient Descent (Parikh et al., 2014) can be applied to efficiently solve the optimization problem.

Neural Feature Repopulation
Form (4.5), we know that the continuous DNN can be fully characterized by (ρ, u), where ρ denotes the sequence ρ = ρ ( )

∈[L]
, and ρ ( ) is the probability measure on the node space Z ( ) . Recall that Section 3.2 introduces the importance weighting technique in the discrete NN. In Section 5.1, we will adapt it to continuous DNN. It will motivate the NFR technique to reformulate continuous DNN. The details will be formally presented in Section 5.2. Finally, Section 5.3 discusses some consequences of NFR when we specify the regularizers. In particular, we will show that for the class of 1,2 norm regularization obtained in Section 4.4, the entire objective function is convex under our re-parameterization. This generalizes a similar analysis for two-level NN in (Fang et al., 2019a).
(4) Finally, we can reformulate the top layer as with U : Z (L) → R K being the class of vector valued functions on Z (L) .
Therefore, by importance weighting, for a fixed basic continuous DNN f (ρ, u; x), a given probability sequence P induces a different but equivalent continuous DNN f (P ,ũ; x), which keeps the function values on all hidden nodes and the final loss value unchanged. The discretization of this process is the same as the discrete importance weighting in Section 3.2.
The reverse of the above equivalence relationship also holds. That is, given a continuous DNN f (P ,ũ; x), we can transform it into a equivalent basic continuous DNN f (ρ, u; x). Such a process needs to define a specific importance weighting characterized by P according toP , and the inverse mappings ofτ ( ) (ρ, P ; ·) andτ (u) (ρ, P ; ·). We refer the process as NFR, which can fundamentally simplify the objective and will be discussed in the next subsection. In the following, we give the formal definitions ofτ ( ) (ρ, P ; ·),τ (u) (ρ, P ; ·), and their inverses below.

The Formulation of Neural Feature Repopulation
This section proposes NFR, which is inspired by our reformulation approach for importance weighted continuous DNN in Section 5.1. Given a continuous NN characterized by (ρ, u), we show it is possible to transform it to a standard NN characterized by (ρ 0 ,ũ) and under such transformation the objective function depends on the probability measure sequence ρ only through the regularizer.
In this paper, we assume ρ In the finite dimensional case, Gaussian distributions satisfy this condition.

Example on Three-Level NN
We first give an example on three-level continuous DNN to illustrate how to perform NFR. We use (ρ (1) , ρ (2) , u) to represent the continuous NN for the sake of simplicity. And our destination is to transform a NN denoted by (ρ (1) , ρ (2) , u) to the standard one denoted by (ρ 0 ,ũ). We note that performing NFR layer-wisely is not fundamentally necessary. In fact, we can directly transform (ρ (1) , ρ (2) , u) to the final standard one (refer to Appendix D.1). However, the procedure presented is more intuitive and thus easier to understand.
The above procedure illustrates how to transform an arbitrary three level NN with feature distributions ρ to a standardized NN with feature distributions ρ 0 . For multiple level NNs (L > 3), one can perform Step 1 recursively. The details are left in Appendix D.1.

Formal Results of Neural Feature Repopulation
This subsection presents the formal results for NFR. We consider general (deep) continuous NNs and further take the transformations of regularizers into account. The theorems are presented below.
Based on the feature repopulated formulation of continuous DNN in Theorem 3 and the equivalence between (ρ, u) and (ρ,ũ) shown in Theorem 4, we know that learning a continuous DNN by optimizing over (ρ, u) is equivalent to minimizing the following feature repopulated objective functionQ(p,ũ) over (ρ,ũ): Here, we should keep in mind that ρ 0 is fixed and known. It follows that the continuous DNN f (ρ, u; x) is equivalent to a linear system f (ρ 0 ,ũ; x) parameterized byũ. Thus, (5.10) demonstrates that we can decouple the probability measure sequence ρ from the loss function J, and the effect of ρ in the objective function only shows up through the regularizerR after reparameterizing NN using (ρ,ũ). This reparameterization significantly simplifies the objective function.
Based on the new formulation (5.10), givenũ, the quality of the feature distributions ρ depends on the regularizer. In the next section, we will discuss the properties of continuous DNN with specific regularizers. Especially, we will show in Section 5.3.1 that the 2,1 norm regularization leads to efficient distributions over features in terms of representing a given target function.
Moreover, our NFR view implies a process to obtain improved feature representations starting from any (ρ 0 ,ũ). See Algorithms 1 and 2 in the experiment section for details.

Properties of Continuous DNN with Specific Regularizers
In the following, we show some consequences of NFR by specifying the regularizers. We will study the class of 1,2 norm regularizers proposed in Section 4.4 and the standard p,1 norm regularizers (p ≥ 1) commonly used in practice. We show that for the 1,2 norm regularizers, the overall objective under NFR is convex. And for the class of p,1 (p ≥ 1) norm regularizers, the minimization problem forρ when fixingũ is also "nearly" convex. Moreover, 1,2 norm regularizers guarantee learning efficient feature representations for the underlying learning tasks.
Although Theorem 5 is stated for a fixed ρ 0 , we can pick ρ 0 to be an arbitrary probability measure sequence. In particular, if we take ρ 0 = ρ at the current solution, then we can use NFR to study the local behavior of the objective function around ρ = ρ 0 . Since the NFR reparameterization has one-to-one correspondence with the original parameterization locally, we may conclude that a local solution of NN in the original parameterization at ρ = ρ 0 is also a local solution with respect to the NFR reparameterization. Since the objective function is still convex with the NFR reparameterization for this ρ 0 , we conclude that a local solution of NN in the original parameterization is a global solution. Note that the argument is also used in the proof of Corollary 7 to derive the KKT conditions of such a local solution. We summarize the result informally as follows.
Theorem 5 shows that a continuous NN can be reformulated as a convex model under the NFR re-parameterization. This result is quite unexpected, and it can be used to explain mysterious empirical observations in DNN. For example, it is known that overparameterized DNNs are easier to optimize. This can be explained by Corollary 6.
Compared to the NTK view, our theory is also more consistent with practice observations. First, in the NTK view, the convexity holds only when the variables are restricted in an infinitesimal region. In contrast, our result can be applied globally. In addition, the NTK view essentially treats an NN as a linear model on an infinite dimensional space of random features. The random features are not learned from the underlying task. In contrast, our results can explain that NNs learn useful features for the underlying task when they are fully trained. In fact, by using convexity and the NFR technique, we can establish specific properties satisfied by the optimal solutions of DNNs.
is an optimal solution of the DNN equipped with 1,2 norm regularizers, then there exists a real number sequence {Λ} ∈[L] , i.e. Λ ( ) ∈ R for all ∈ [L], so that the following equations hold: The equations in Corollary 7 will be validated in our experiments. They imply that the consequences of the NFR theory are consistent with empirical observations. Corollary 7 shows that the optimal feature distribution sequence ρ * relies on u * , where f (ρ * , u * ; x) represents the target function, and it can be rewritten as f (ρ 0 ,ũ; ·) with a fixed ρ 0 . In fact, given the desired target function f (ρ 0 ,ũ; ·) = f (ρ * , u * ; ·), there can be many equivalent representations f (ρ, u; ·) indexed by ρ under NFR (refer to Section 5.1). The optimal ρ * achieves the minimum 1,2 norm regularization value under this equivalent class of functions that achieve the same outputs as f (ρ 0 ,ũ; ·). Since the 1,2 norm regularization upper bounds the variance of discrete approximation of the continuous DNN in Theorem 2, a small 1,2 norm implies that a small number of hidden units are needed to represent f (ρ * , u * ; ·) in the randomly sampled discrete DNN. This means that 1,2 norm regularization leads to efficient feature representations. This result generalizes a corresponding result for two-level NNs in (Fang et al., 2019a).

p,1 Norm Regularizers
We also propose some results for the commonly-used p,1 (p ≥ 1) norm regularizers. This type of regularizers can be written in (4.5) by picking r 1 (ω) = |ω| q ( ) , r 2 (ω) = |ω|, and r (u) (u) = u q (u) . We have the property below: Theorem 8 shows that givenũ, minimization ofQ overρ behaves like "convex optimization", in which any local solution ofρ * is a global solution that achieves minimum value ofQ(·,ũ). Further remarks are discussed below: 1. From Theorem 8, we know that givenũ, solvingρ is relatively simple. This means given a target output function f (ρ 0 ,ũ; ·), it is efficient to learn the desired distributions over features under the p,1 norm regularization condition.
2. We note that in the objectiveQ, the loss function J(f (ρ 0 ,ũ; ·)) is convex. In real applications, the loss function value usually dominates the regularizers because one needs to choose small regularization parameters λ ( ) ≈ 0. In such case, the objective function is nearly convex and therefore all local minima have loss function values close to that of the global minimum. This explains the empirical observation that for overparameterized NNs, there are no "bad" local minima when the networks are fully trained until convergence.
3. Theorem 8 also indicates that the optimization problem of DNN, when equipped with p,1 , involves special structures. Therefore solving this class of nonconvex optimization problems is potentially much easier than minimizing a general nonconvex function. A more careful analysis of this observation will be left as a future research direction.

Experiments
The experiments are designed to qualitatively verify the following.
1. Optimality condition: We demonstrate that fully trained overparameterized DNNs are consistent with the NFR theory by verifying the optimality condition in Corollary 7. Here we consider the relationship between for one neuron j in layer ∈ [L], which are the estimates of 2. Deep versus Shallow Networks: We show that by increasing L, the number of hidden layers, fully connected NN can learn hierarchical feature representations that can reduce the variance of approximation described in Theorem 2. This verifies the benefit of using deeper networks for certain problems.

Compactness:
We show that compared with other regularizers, the proposed regularizer can learn better (more compact) feature representations.
4. NFR process: We show that a discrete neural feature repopulation algorithm motivated by our theory can effectively reduce the training loss, and especially the regularizer. This leads to faster convergence to better feature representations.
Note that similar to (Fang et al., 2019a), we use the approximation variance of discretization V (w, u) to measure the effectiveness of feature representation, based on the theoretical findings of Theorem 2: .

Neural Feature Repopulation Algorithm
We propose a new optimization process inspired by our NFR view to verify its effectiveness. This process is complementary to the standard SGD procedure and can be used to accelerate the learning of feature distributions.
We first present our procedure for the continuous DNN in Algorithm 1, in which we alternatively fix eitherρ orũ and update the other to minimize the objective function. Due to our feature repopulation procedure, the loss J(f (ρ 0 ,ũ; ·)) would be a constant whenũ is fixed. Therefore, we only need to minimize the regularizerR when we updateρ (see line 4). Such process explicitly improves the quality of features in terms of efficient representation. Algorithm 2 is the discrete version 1 of Algorithm 1. We combine it with SGD in Line 3.

Synthetic 1-D regression task
We begin to empirically validate our claims in a synthetic 1-D regression task. Since the feature representation f ( ) j (x) corresponding to each neuron in each layer ∈ [L] is a single-variable function, it can be easily visualized.
Here we consider the function f (x) = 2(2 cos 2 (x) − 1) 2 − 1 introduced by Mhaskar et al. (2017). We draw 60k training samples and 60k test samples uniformly from [−2π, 2π] for x and set y = f (x). We use a fully-connected NN with m ( ) = 1000 × 2 L− hidden units in each hidden layer to learn this target function. We take L = {1, · · · , 4}, and use the Adam optimizer with an initial learning rate 1e-4 in our experiments, and let the activation function be σ(x) = tanh(x). For fair comparison, we tune the hyper-parameters of the weight of regularizer so that for different L, the NN could reach training RMSE of 1e-4 when converge. This controls the representation power of the NN.
We first validate that fully trained overparameterized NN satisfies the optimality condition of Corollary 7. Here we consider the case of L = 4, and the top row in Fig 1 plots We can see that these two quantities are approximately linearly correlated, as predicted by Corollary 7.

8:
Duplicate weights connected to each node before sampling to form the updated weights after sampling. To compare the performance of shallow versus deep networks, Fig 2 (a) reports how does the approximated variance change when L increases. It demonstrates that the approximated variance decreases as L increases. Moreover, the approximate variance gap between L = 2 and L = 3 is very large while that between L = 3 and L = 4 is small. This is consistent with the fact that the hierarchical composition of the target function f (x) has depth 3 (i.e. f (x) = h(h(cos x)), where  We reach the following conclusion from visualization of different L: DNN is able to learn hierarchical feature representations when we take optimization process into consideration. To be more specific, the layer next to the input layer tends to learn low-frequency signals while the upper layers take these lower-frequency signals to form higher-frequency signals.
We further compare the compactness of different regularizers. Here we use the notation a,b We could see that the red curve is lower-bounded by the blue one, which suggests that our proposed regularizer will result in more sparse representations compared with traditional regularizer.
to represent the regularizer in the form of r 1 (x) = |x| a and r 2 (x) = |x| b . 1,2 regularizer is the proposed regularizer, which is an upper-bound of the approximation variance. 2,1 regularizer is the traditional L 2 regularizer (i.e. 2 weight decay). From Fig 4, we found that the proposed regularizer leads to sparser weights, and thus has a more compact representation. We have also tried the 1/2,4 regularizer, and found that the sparsity of 1/2,4 regularizer isn't significantly better than the proposed 1,2 regularizer. This verifies the effectiveness of the proposed regularizer to obtain sparse weights. Finally, we show that the proposed discrete feature repopulation (DFR) process can reduce the training loss, especially the regularizer loss (in Fig 6 (b)). This implies that it leads to a better feature representation. In our implementation, we first use Proximal Gradient Descent to optimize the objective (3.7) with respect to to the variables {{p are calculated, we use the algorithm described in Section 6.1 to discretely re-sample useful features from top layer to bottom layer. Fig 5 is a comparison of learned feature functions between vanilla SGD and SGD with DFR process described above. The feature functions learned by vanilla SGD contains some useless feature functions whose variance with respect to input x is near zero, while the DFR process is able to remove these bad feature functions since their importance weights are relatively low. The difference is highly visible in Layer 3.  Figure 6: The comparison of optimization process between SGD and SGD with DFR for 3-hiddenlayer NN.

Mini-Imagenet classification task
We have also performed experiments on real data. Mini-Imagenet dataset is a simplified version of ILSVRC'12 dataset (Russakovsky et al., 2015), which consists of 600 84×84×3 images sampled from 100 classes each. Here we consider the data split introduced by (Ravi & Larochelle, 2016), which consists of 64 classes and 38.4k images as our full dataset. We divide the dataset into train/valid/test split by 7:1:2.
Since fully-connected NNs do not have the capacity to deal with such image data, we first train a base CNN embedding network with a four block architecture as in (Vinyals et al., 2016). We then take the 1600-dimensional output of the embedding layer and feed it to an L layer NN for classification. The training configurations and network architectures are the same as those for the synthetic 1-D experiment, except that we tune the regularization parameters to achieve the best validation accuracy. Since the feature function of this task is hard to visualize, we only consider the optimality condition, shallow versus deep networks and compactness. Similar to the results in the synthetic 1-D experiment, the sub-figures in the bottom row of Fig 1 show that the two quantities we care about are also linearly correlated in each layer, which is consistent with our theory . Fig 7 reports how approximation variance, test RMSE, and train RMSE change during the model training procedure. We can see that the approximation variance decreases as L increases, and the gap between L = 1 and L = 2 is very large. This demonstrates the great advantage of deeper networks. Moreover, the generalization performance also increases as L increases.

Conclusion
This paper introduced the NFR technique to analyze over-parameterized DNNs and showed that it is possible to reformulate overparameterized DNNs as convex systems. Moreover, when fully trained, DNNs learn effective feature representations suitable for the underlying learning task via regularization. Our analysis is consistent with empirical observations. Similar to the analysis of two-level NN in (Fang et al., 2019a), this newly introduced NFR method paves the way for establishing global convergence results of standard optimization algorithms such as (noisy) gradient descent for overparameterized DNNs. We will leave such study as a future work.

A Preliminary
This section provides some useful known inequalities that will be later used in our proofs.

A.1 Jensen's Inequality
In our proof, we will frequently use the Jensen's inequality, which relates the value of a convex function of an integral to the integral of the convex function. In probability theory, it states that if φ is a convex function, for a random variable x, we have We are particularly interested in the case when φ(x) = |x| p where p ≥ 1. We can obtain In the finite form, Jensen's inequality takes the form of: If we further assume x i with i ∈ [n] are independent random variables which follow from a same underlying distribution, by taking expectation on (A.1), we have

A.2 Rosenthal Inequality
We will also use the Rosenthal Inequality to build relations between moments of a collection of independent random variables. It is stated in the following: Lemma 1 (Rosenthal Inequality (Ibragimov & Sharakhmetov, 1999)). Let ξ i with i ∈ [n] be independent with E[ξ i ] = 0 and E[|ξ i | t ] < ∞ for some t ≥ 2. Then we have where C 1 t can be token as (ct) t , C 2 t = (c √ t) t , and c is a constant.

B Proof of Theorem 1: Consistency of Discretization
In this section, we prove Theorem 1 which shows the discrete DNN converges to the corresponding continuous one in L 1 . Before giving the detailed proof, we give the following definitions.
Proof. We denote In fact, for all ∈ {2, . . . , L} and j ∈ {2, . . . , }, we have It is also true that all for j ∈ [L + 1], In the following, we denote Note that ξ (L+1) k with k ∈ [m (L) ] are K dimensional vectors. Because the real numbers can also be treated as one dimensional vectors. Below we treat ξ (j) k as a vector for the sake of simplicity. We have for all j ∈ {2, . . . , L + 1} where in a =, we use 1 to denote indicator function, then and obtain the result by triangle inequality, and b ≤ uses Jensen's inequality. For the first term in the right hand side of (B.10), we have ≤M are independent given z (j) , . . . , z (L) , and obtain the result by: where in a ≤, we setc ε = cε 1+cε and obtain the result by Yong's inequality, that is . (B.13) where in a =, we set M = m (j−1) 1 2(1+cε) . Then with m (j−1) → ∞, from (B.8) and (B.9), we obtain (B.6) and (B.7), respectively.

C Proof of Theorem 2: Variance of Discrete Approximation
In this section, we prove Theorem 2 which estimates the variance of discrete approximation. The proof is more complicated than that of Theorem 1, because we consider the first-order approximation off ( ) j (x) with ∈ [L]. Before giving the detailed proof, we further define the following: Note thatf ( ) andg ( ) depend on random variablesẐ (1) , . . . ,Ẑ ( −1) . For simplicity, we do not show them in the definitions explicitly.
We give the detailed proof of Theorem 2 below: Step 1. We prove and Proof. We prove it by induction. When = 1, the statement is true. Suppose at −1, the statement is true. We consider the case of . For all j ∈ [m ( ) ], we have Step 2. In this step, we compute Step 2(i): We prove that Proof. Clearly, we havef (1) (ρ, z j ; x) for all j ∈ [m (1) ]. Then we can verify that for all j ∈ [m (2) ], We then prove for all ∈ {3, . . . , L} and j ∈ [m ( ) ], we havȇ Suppose the above statement is true at − 1, then from (C.1), we havȇ where in a =, we use (C.7) on − 1. So we obtain (C.7) on . Then plugging (C.7) with = L into (C.2), we can obtain (C.5). Also by the same technique of Step 1, we can obtain Step 2(ii): Firstly, We expand
Step 2(iv): We simplify the terms in the right hand side of (C.11).
Proof. For any ∈ {2, . . . , L} and j ∈ [m ( ) ], from Lemma 2 in the end of the section, we have where a ≤ uses Assumption 1 and Ψ s (z ( ) j ) ≥ 1. Furthermore, for the first term on the right hand side of (C.24), we have From Lemma 3 in the end of the section, we have where we simply set q 0 = 3 4 .
In all, we have where in a ≤, we use (C.27) and in b =, we use the fact that for all Φ(t + 1, ) has the lowest order 1 (m J ( ) ) 1+α only when t = 0 and j = 0 and γ, C 0 B T , e L , and e 0 only depend on L, α, c 0 , c 1 , c M , and c M 1 , and can be treated as constants.

D Neural Feature Repopulation
In this section, we give the proofs of Theorems 3, 4 which present the NFR procedure and its inverse procedure, respectively.
This proves the desired result.
The remaining of the proof is identical to that of Theorem 3.

E Properties of Continuous DNN with Specific Regularizers
Below, we give the proofs of Theorem 5, Corollary 7, and Theorem 8, which present the results of NFR with specific regularizers.