A Formal Characterization of Activation Functions in Deep Neural Networks

In this article, a mathematical formulation for describing and designing activation functions in deep neural networks is provided. The methodology is based on a precise characterization of the desired activation functions that satisfy particular criteria, including circumventing vanishing or exploding gradients during training. The problem of finding desired activation functions is formulated as an infinite-dimensional optimization problem, which is later relaxed to solving a partial differential equation. Furthermore, bounds that guarantee the optimality of the designed activation function are provided. Relevant examples with some state-of-the-art activation functions are provided to illustrate the methodology.


I. INTRODUCTION
D EEP neural networks are connected sequences of layers such that each layer is solely defined by its parameters (that is weights and biases) and particular activation functions.The former along with the underlying network topology defines the neural network architecture.The latter is a nonlinear function that provides the learning capabilities of the network and plays the role of interface between the adjacent layers.While considerable research efforts have been focused on enhancing the network topology, studies improving activation functions received much less attention.This work aims at reducing this discrepancy given the influence of activation functions on the overall network performance.For instance, ReLU [1], leaky ReLU [2], and parametric ReLU [3] are proven to be better choices than the standard sigmoid functions (logistic or hyperbolic tangent functions).
The recent success of deep learning in a variety of tasks, particularly in computer vision and natural language processing, is mainly due to novel neural architectures.Most, if not all, of the state-of-the-art architectures are manually designed by humans, resulting in longer processing times and no guarantees of any kind of optimality, which can further limit their largescale applications.Furthermore, nonoptimized architectures Manuscript received 24 September 2020; revised 26 May 2022; accepted 19 June 2022.Date of publication 15 July 2022; date of current version 6 February 2024.This work was supported by the National Robotics Initiative project titled "Multi-Vehicle Systems for Collecting Shadow-Free Imagery in Precision Agriculture" from the USDA National Institute of Food and Agriculture.(Corresponding author: Massi Amrouche.) The authors are with the Department of Industrial and Enterprise Systems Engineering and the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA (e-mail: amrouch2@illinois.edu;dusan@illinois.edu).
Digital Object Identifier 10.1109/TNNLS.2022.3187538can lead to the usage of more expensive hardware that can be energy-consuming.As a natural solution to that, new tools and methods are developed to automate the design of neural architectures, which led to a new field of research known as AutoML [4] that includes, for example, neural architecture search (NAS) [5].This approach is based on defining a search strategy to explore the space of possible architectures.Selecting an optimal architecture is based on a priori defined performance of the induced neural network.The approach seems promising, yet it is interesting to note that none of the methods in AutoML considers the activation functions optimization although being critical in improving the neural network performance during training and inference.This limitation is mainly due to the lack of a formal and efficient characterization of the set of admissible activation functions.One can, for example, approach the design of desired activation function as a combinatorial optimization problem where the search space is a set of admissible functions formed by a countable and finite number of primitive functions and all their combinations [6].However, in this setup, it is clear that the set of functions that can be generated is limited by the span of the primitive functions.
In this article, we introduce a novel method of characterizing and designing activation functions that are analytical and not combinatorial.The organization of this article is given as follows.First, we pose the problem of characterizing an optimal activation function that satisfies certain desired learning properties as an infinite-dimensional optimization problem.Then, we show that such formulation is not illposed, that is, there exists at least one admissible minimizing function.Second, we propose a semiexplicit solution to the infinite-dimensional optimization problem such that the optimal function is defined as the solution to a set of differential equations.Finally, we show that such optimal functions are indeed valid activation functions, that is, the universal approximation property of the induced neural networks is guaranteed.These are also the main contributions of this work.This article is organized accordingly where the longer proofs are placed in the appendices not to clutter the presentation.

II. MOTIVATION
In this section, we show how neural networks' evolution during training is directly related to the activation function and how it impacts the training limitation of deep neural networks.Furthermore, we identify the main terms that need to be controlled and, hence, used in our characterization in Section III.
Let f T (•, ) be a deep neural network that is defined as follows: where • refers to the composition operator, represents the concatenated weights, and σ k [σ k1 , . . ., σ kn ] T is a vector of functions such that each element σ ki : R n → R is an activation function for certain neuron i ∈ {1, . . ., n} and satisfies ((∂σ ki (z))/∂ z j ) = 0 if i = j .Note that we are considering a very general case where each neuron may have its own activation function.
Furthermore, define y k to be the output of each layer of the network such that y k+1 = σ (y k , k , k), and let L : Y × Y → R be a loss function such that the objective is to learn some unknown target function using some datasets.
Then, the evolution of the weights during training via gradient descent is described by the following differential equation: where s ∈ R + represents the training steps and η is an arbitrary learning rate.Now, developing the right-hand side yields to where ∇ [(∂/∂z 1 ), . . ., (∂/∂z n )] T , is the Hadamard product such that ∇ σ k (z) [((∂σ k1 (z))/∂z 1 ), . . ., ((∂σ kn (z))/∂z n )] T , and z ∈ R n is the preactivation vector.In this case, the diagonal matrix diag(∇ σ k ) ∈ R n×n is equal to the Jacobian matrix of σ k .Hence, where • is a proper vector norm.It is clear that the information that can backpropagate during training is exclusively controlled by k and ∇ σ k for k = 1, . . ., T .Regarding k 's, when the length T −l is large, we will have two dominant cases: 1) either W k > 1, which means that the gradient may blow up, and the learning algorithm will stop and 2) or W k < 1, that is, lim T −>∞ W k T = 0, which leads to a vanishing gradient, and, hence (∂ l /∂s) → 0, which means that the weights of initial layers will not be updated.
Fortunately, there are effective remedies for that issue, as shown in [7] and [8].Thus, one can, for example, restrict the set of all possible weights to the unitary group U (n) such that W k ∈ U (n) for all k = 1, . . ., T .Recall that, for any M ∈ U (n), we have M = 1.Therefore, the main remaining term that influences directly the learning behavior via gradient descent and needs to be controlled is the quantity T k=1 ∇ σ k , which will be the focus of this article.In addition, one can show that the evolution of the neural network f t (•, ) during training is directly correlated with ∇ σ k (•) for k = 1, . . ., T .To do that and without loss of generality, assume that the network is shallow, that is, T = 1.In this case, the deep neural network map f 1 (•, ) becomes x → σ (x) = σ (W x + b), and its dynamics during training are described by the following differential equation: where η is the learning rate defined beforehand, s is the training step, and κ(•, •) is some bilinear form that represents the evolution of the preactivation vector y during training.
In this case Generally, the loss L is a smooth and convex functional with respect to f .Therefore, (∂L/∂ f ) is smooth along the training trajectories.Thus, the smoothness of the network evolution during the learning phase and the behavior of the parameters' updates via gradient descent is directly related to the term ∇ σ .
Note that ∇ is not an elegant operator to use in characterizing desirable activation functions.Fortunately, this latter can be rewritten as the gradient of a certain scalar function.In order to do so, let us define u : R n × R → R to be a scalar function such that T is a vector of activation functions at layer k, as defined previously, such that (∂σ ki /∂ x j ) = 0 if i = j .Hence, where ∇u [(∂u/∂ x 1 ), . . ., (∂u/∂ x n )] T is the gradient of the scalar function u.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
In conclusion, by controlling T k=0 ∇u(x(k), k), where represents elementwise vector product and x(k) is preactivation output of the kth layer, we can circumvent the training problems of deep neural networks arising in (2).Furthermore, the smoothness and robustness of the neural network and its evolution during training are controlled by u and ∇u, respectively.Therefore, the three terms u, ∇u, and T k=0 ∇u(x(k), k) will be used, in Section III, to characterize the set of desirable activation functions.

III. MAIN RESULTS
In order to provide the precise formulation of our main results, we need the following definition first.
As discussed in Section II, the term t k=0 ∇u(x k , k) needs to be controlled to avoid nondesired behaviors during training, and in order to do so, we propose that t k=0 |∇ i u(x k , k)| should not vanish or blow-up for any t ≥ 0 and all i = 1, . . ., n.This is motivated by the fact that, for any vector norm, we have Furthermore, the latter two conditions, which z i (t) should not vanish or explode for any t ≥ 0, can be concatenated into one boundedness problem by applying logarithmic transformations, that is, In order to bound log(z i (t)) for t > 0 and all i , it suffices that the absolute sums are bounded, that is, Therefore, a way to circumvent the vanishing or exploding gradient problems in deep neural networks is to minimize zi (t) for all i ∈ {1, . . ., n}.
The logarithmic functions are smooth, differentiable, and concave functions on R ≥0 ; however, they have a finite escape at the origin, which is not the desired behavior in optimization.To overcome this issue, we need to make another transformation such that the new terms will be smooth and finite on the whole domain R and are only infinite at infinity.This leads to the following proposition.
Proposition 2: Let {g i : R → R ≥0 | i = 1, . . ., n } be a set of continuous functions such that, for all i , the following conditions hold.
1) For any sequence (x j ) j ≥0 in R, we have 2) The roots of g i and its first derivative are trivial, that is, If, for any i ∈ {1, . . ., n} and all t ≥ 0, μi (t) is bounded, then the following term zi Proof: For each i ∈ {1, . . ., n}, note that, by property 2), we have that ((dg i (z))/dz) changes signs only once at z = 0; hence, the function g i is locally monotonic in (−∞, 0] and [0, ∞), respectively.Therefore, by property 1), g i should be nonincreasing in (−∞, 0] and nondecreasing in [0, ∞).This implies that there exists a constant m i > 0 such that g i (z) ≥ m i |z|.Now, since μi (t) is bounded, then there exists another constant Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
M > 0 such that μi (t) ≤ M for any t ≥ 0. Thus, for all t ≥ 0. Consequently, looking for activation functions that minimize μi (T ), for all i , is a valid characterization to circumvent the gradient vanishing and exploding problem during training of deep neural network with T layers.In order to fully use the calculus of variations to find optimal activation functions, we will formulate that minimization as an infinite-dimensional optimization problem, but, to achieve that, we propose to look at the continuum behavior of μi (T ) first, as shown in Section III-A.

A. Continuum Limit
The functions {u(•, k) | for k = 1, . . ., T } can be seen as uniform discretization of certain continuous function u(•, t).Using this analogy, let t = 1 such that for all i .Now, suppose that Lebesgue measurable too [9].Hence, for any partitioning such that t → 0 in a measure sense, the following new quantity of interest: which is the continuum limit of the sum (4), is well defined.
In order to bound the integral above, it suffices to bound the following integral: Similar motivations that led to Proposition 2 can be used to justify the need for the following proposition.
Proposition 3: Let {g i : R → R ≥0 | i = 1, . . ., n } be a set of continuous functions such that, for all i , the following conditions hold.
1) For any sequence (x j ) j ≥0 in R, we have 2) The roots of g i and its first derivative are trivial, that is, Define {μ i (•)| i = 1, . . ., n } be a set of functions that are defined as follows: The proof of Proposition 3 can be constructed with similar arguments as in Proposition 2. Furthermore, note that g i (•) is a continuous function, and log Therefore, if we look for activation functions that minimize μ i (T ) for all i such that T ∈ R ≥0 ∪ {∞} is the number of layers then, by Propositions 1 and 3, we are guaranteed that the vanishing or exploding gradient problems cannot occur when T is very large.
Remark 1: Any set of activation functions generated by minimizing μ i (T ) in the continuum limit circumvents the gradient problems in training of discrete (regular) deep neural network.This is guaranteed by Proposition 3 and by noting that Next, we will use the maps μ i (•)'s along with other terms to define a functional such that its minimizer is the desired activation function.

B. Characterization of Admissible and Desired Activation Functions as Functional Minimization
In this section, we will construct a functional such that its minimizer will define a set of desired activation functions that guarantee the following properties.
1) No gradient vanishing or exploding problem during training.
2) The activation functions are at least Lebesgue measurable, which is important to guarantee the universal approximation property of the induced neural networks.
3) The gradients of the activation functions are measurable.However, before that, we will recall some important concepts in functional analysis that will be used throughout the rest of this article.
Let ⊆ R n be some open set.Define L 2 ( , R) as the space of measurable and square-integrable functions in , that is, Along with the inner product defined as R) is a Hilbert space with the induced norm u L 2 ( ) = ( |u(x)| 2 dx) 1/2 .To simplify the notation, we will omit the image space wherever it is R, that is, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

F (u)
More generally, L p ( ), for any 1 ≤ p ≤ ∞, is the space of measurable and pth power of the absolute value integrable functions.Recall that L p ( ) cannot be a Hilbert space for any p = 2; however, it is complete normed space (Banach space) for the norm u L p ( ) = ( |u(x)| p dx) 1/ p .Furthermore, let H 1 ( ) be a Hilbert Space defined as follows: along with the inner product defined as and the induced norm The main idea that we propose in this article is to consider the problem of finding the set of admissible activation functions that satisfy the proprieties stated at the beginning of this section, as solutions to the following minimization problem: where the functional F (•) is defined in (6), as shown at the top of the page, and K is some functional space that will be defined later.

Remark 2:
The term 1 in (6) guaranties that )dt is bounded for all i .This holds by noting that which is true by the fact that g i is always nonnegative.Terms 2 and 3 ensure that the activation functions, and their gradients are smooth and L 2 -bounded.Furthermore, note that 2 plays also the role of a regularization term for any η > 0.
Guaranteeing the existence of a solution to ( 5) is very important because it is not trivial that this infinite-dimensional optimization problem is well defined for any g i and K, and lacking a minimizing function can make our characterization absurd.
Theorem 1: Let ⊂ R n be a bounded open set.Fix T ∈ R ≥0 , and define L p ([0, T ], X) as a normed space of functions [0, T ] → X such that, for any function: t → v(t) X , we have where X can be identified as L q ( ) or H 1 ( ), and p, q ≥ 1.
; then, for each function g i (•), the following holds.1) g i is continuous and at least twice continously differentiable almost everywhere.2) For any ω ∈ K, which does not vanish almost everywhere in Note that g i refers to the derivative of g i , that is, g i = (dg i (z)/dz), and g −1 i and g −1 i are the inverse functions of g i and g i , respectively.5) There exists three constants γ 1 , γ 2 , γ 3 ≥ 0 such that, if we define i (z) = g i (z) − g i (z), then the following holds: for almost all z ∈ R.Then, there exist a unique minimizer to the optimization problem defined by (5), where and F is as defined in (6) such that f ∈ L 2 ([0, T ], L 2 ( )).
Proof: See Appendix B. Remark 3: In the functional F as defined in (6), the functions g i 's and f are arbitrary.Hence, they are design parameters left to be defined by the user, and the well-posedness is guaranteed as long the conditions stated in Theorem 1 are satisfied.
Remark 4: Condition 2) guarantees that the functional ( 6) is well defined, while conditions 1) and 3) ensure that Proposition 3 holds, at least in pointwise manner.It is very important to note that the conditions imposed in Theorem 1 on the functions g i (•)'s are mild, and it is not hard to find such functions.For example, if and T are bounded, then g(z) = (e z /2)(z 2 + 2) − (z + 1), which satisfies all these conditions.In this case, (z) = g (z) − g(z) = ze z + z ≥ ze z − |z|, where (0) = 0, and (z) = g (z) − g (z) = (z + 1)e z + 1 ≥ 0. Furthermore, we have where ω ∈ K and nonzero almost everywhere in since ω is nonzero almost everywhere, and × [0, T ] is bounded.Hence, using Hölder's inequality and the fact that is bounded and T < ∞, we get In the same manner, Theorem 1 shows that our characterization is not ill-posed, and the solution of the optimization problem (5) yields to a certain activation function, where, if it is used in a deep neural network with T layers, we are sure that the evolution of the network during training is smooth almost everywhere, and we are guaranteed that we will not encounter a vanishing or exploding gradient problem.However, this characterizing is still implicit, and solving such infinite-dimensional optimization problems is not trivial.The following theorem proposes a better characterization.
Theorem 2: Let × [0, T ] be a bounded subset of R n+1 .Consider the optimization problem (5), and suppose that all the conditions in Theorem 1 are satisfied.
solves the minimization problem (5) such that u(x, 0) = u 0 (x) ∈ L 2 ( ), then u is solution to the following partial differential equation (PDE): where Proof: See Appendix C. Remark 5: In the PDE above, the terms g i 's, f , and u 0 are controlling parameters that can be tuned arbitrarily to generate any family of activation functions.In Section IV, we will present an example of how to define these parameters in order to generate a new family of activation functions from the ReLU activation.
Recall from Proposition 3 that bounding the terms g i 's guarantees a vanishing/exploding free gradient descent training.It is clear that, if u is solution to (7) and, hence, minimize (6), then the term 1 is bounded, which means that g i (log(|∇ i u(x, t)|)) should be bounded almost everywhere for all i .However, we still lack of an explicit estimation of how the bounds on 1 relate to the design parameters u o and f .The following corollary closes the loop by presenting an explicit bound of the term 1 , which is function of f and u 0 .
Proof: See Appendix D. Remark 6: Corollary 1 is a very important result since it implies that ) for all i , which, by Propositions 1 and 3, means that f and u 0 control the gradient flow during training.Furthermore, since f ∈ L 1 ([0, T ], L 2 ( )) and u 0 (x) ∈ L 2 ( ), we are guaranteed that activation functions generated by PDE (7) will never encounter vanishing or exploding gradient problems.
The following theorem guarantees that, if u(•, •) is solution to PDE (7), and then, for any t ≥ 0, if u(•, t) is used as an activation function, then the induced single-layered feedforward neural network is a universal approximator.
Theorem 3: Let be a compact subset of R such that 0 ∈ and u(•, •) to be solution to PDE (7).Define the sets is not an algebraic polynomial almost everywhere in , then the set Proof: See Appendix E.

IV. EXAMPLES
In this section, some examples relevant to machine learning are provided to show how our approach can derive activation functions that satisfy conditions in Section III-B.

A. Example 1 (ReLU Like Activations)
The rectified linear unit (ReLU) function was the main breakthrough that led to the current development of the state of the art in deep networks.Its simplicity and capability to circumvent the major drawbacks that sigmoid functions suffered from during training via gradient descent allowed its large-scale utilization in supervised learning.However, by design, ReLU activates only when the input is nonnegative.To circumvent that issue, several other activation functions that are derived from ReLU and, hence, share almost the same proprieties were later proposed in the literature.These functions are either handcrafted or computer designed based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
on the use of some search techniques that automate the discovery of new activation functions.One of the interesting ReLU substitutes that are hand-designed is Gaussian error linear units (GELUs) [10] that are defined as x → (x/2)[1 + erf(x/ √ 2)], where erf(•) is the error function.The other interesting example that is designed by automated search technique was proposed by a Google Brain team, and they called it Swish function [6] and is defined as x → x • sigmoid(ρx).As shown in [6] and [10], the Swish (as well as GELU) function outperforms ReLU and many other functions in several learning tasks.Nevertheless, its construction needed computationally complex simulations, and moreover, there is no rigorous mathematical justification behind it, which can be inconvenient if we want to automate the process of designing optimal neural networks.In this example, we will show mathematically, using our characterization, that we can analytically derive a similar family of functions than Swish and GELU from ReLU.
To start, fix T ≥ 0, and let ⊂ R n be a bounded set.Suppose that, without loss of generality, a neural network where the activation functions of neurons of the same layer are similar to each other, which implies from (3) that ū(x, t) = n i=1 σ (x i , t), where σ (•, t) : R → R is the activation function at layer t, and ∇ i ū(x, t) = ((∂σ (x i , t))/∂ x i ).Then, PDE (7) can be written as follows: Assume that f (x, t) = n i=1 f (x i , t) and , and define u : R × R → R such that u(x, t) = σ (x, t) for all x, t ∈ R.Then, solving (9) can be reduced to solving the following PDE: where b > 0 is an arbitrary bounded real and recall that β(•) ).Now, let us design function g such that it is solution to the following ODE: where > 0 and α ≥ 0 are some arbitrarily constants.This yields the following definition of g: Note that one can check easily that g, as defined above, satisfies all the conditions of Theorem 1 in [−b, b] × [0, T ].Specifically, we have Plugging back g(•) in β yields to β(∇u) = −α (( ∇u 2 + )/ ∇u 2 ).Now, let 1 such that the PDE (10) Note that, by Corollary 1, Remark 6, and proprieties of g • log, if u is solution to (10), then there exist always an such that (11)  In what follows, we will derive the GELU and Swish activation functions by choosing an adequate initial condition u 0 and tuning the function f .1) GELU Activation: Choose u 0 to be a ReLU function in where μ ∈ R is some constant to be determined later.One can note that the function f , as defined above, has the same shape as a normal distribution with zero mean and variance proportional to t.
In order to solve PDE (11) along with the nonhomogeneous term f and the initial condition u(x, 0) = u 0 (x) defined above, let us take the Fourier transform with respect to x such that where û is the Fourier transform of u.Solving the ODE above yields to where û0 is the Fourier transform of u 0 .Note that the inverse Fourier transform of e −(α+η) Now, part A is a convolution that can be computed as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We can already observe that some parts of u(x, t) look like GELU activation.To make this similarity more clear, let Recall that η > 0 and α ≥ 0 are some small constants that can be arbitrarily chosen, and thus, the number of layers t ≤ (1//(4(α + η))) can arbitrarily large.Furthermore, if we define μ = −2(α + η), we get which is in the same form as GELU activation function.
In fact, one can note that the GELU activation function introduced in [10] is a special case of ( 12) when (α + η)t = (1/2).Furthermore, if we fix α + η small, then, for any deep neural network of depth not larger than (1/(4(α + η))), the use of our general GELU activation function u(•, t) defined in (12) at each layer t guaranties all the conditions defined in Section III-B.Moreover, recall that the vanishing and exploding free gradient descent learning are guaranteed by Corollary 1 and Remark 6.

B. Example 2 (Sigmoid Like Activations)
As shown in Section II, training deep neural networks of the form (1) with sigmoid activation functions suffers from vanishing or exploding gradient [11], [12].However, as proved in Section III-B, if we restrict the set of trainable weights to the unitary group [7], [8] and define the activation functions of each layer as solution to the PDE (7), then, by Corollary 1 and Remark 6, we are guaranteed that such blowing or vanishing gradient problems would not arise.In this example, we will derive such sigmoid functions by solving PDE (11) for f ≡ 0 and with the following initial condition: Following the same steps as in the previous examples to solve PDE (11), we get the following solution: Fig. 2 shows the evolution of the activation function u(•, t), defined above, at different layers t.We observe that, if the depth of the neural network is not larger than t = 20, then the solution u(•, t) for t ∈ [0, 20] can be approximated by the hyperbolic tangent tanh(•).Hence, if we use the hyperbolic Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.tangent activation function for neural networks with depth not larger than 20, by Corollary 1 and Remark 6, we are guaranteed that we will not encounter a blowing or vanishing gradient problems during training.Now, if the depth is larger then 20, different activation functions need to be used in the subsequent layers to avoid a vanishing or exploding gradient during training, as shown in Fig. 2.

V. CONCLUSION
In this article, a novel methodology for designing activation functions in deep neural networks is provided.The methodology is based on the formal characterization of the desired activation functions so that they satisfy particular properties, including avoidance of the exploding and/or vanishing gradients during training.Hence, this methodology allows circumventing some of the main issues encountered in training deep neural networks.Furthermore, it can be used to formalize the design of efficient and easily trainable deep neural networks.In order to illustrate, validate, and motivate the methodology, some representative and analytical examples were provided.

APPENDIX A NEURAL NETWORK EVOLUTION DURING TRAINING: COMPUTATION OF κ(•, •)
First, (∂L/∂ W ) is n × n matrix that can be computed as follows: Hence, where ⊗ is the dyadic product.Furthermore, we have

APPENDIX B PROOF OF THEOREM 1
Proof: This proof is done in three steps, as follows.
Step 1: Let show that, for any ω ∈ K, the function g i (log(|ω|))dtdx is convex.In order to do so, it suffices to show that the maps gi : z → g i (log(|z|)) for any z ∈ R and all i are convex.We have Hence, by condition v of the theorem, we get (∂ 2 gi /∂z 2 )(z) ≥ 0 for all z, which shows that gi is convex.Therefore, for any u, ω ∈ K and all i , the following holds: almost everywhere in × [0, T ].
Step 2: Prove that )) is a Hilbert Space with the following inner product: To motivate the choice of this inner product, note that K is a subspace of . Moreover, the induced norm from the inner product ( 14) is Let us show that K is a Hilbert space.First, it is clear that ( 14) satisfies all the properties of an inner product.Thus, to complete the proof, it suffices to show that K is complete for the norm (15).Recall that a metric space S is called complete if every Cauchy sequence in S converges in S. In order to prove that K is complete, let (u i (•, •)) i≥0 be a Cauchy sequence in K and recall that H 1 0 ( ) is a closed subspace of H 1 ( ); thus, H 1 0 ( ) is also a Hilbert space (hence complete) [13], [14].Note that, for almost each fixed t ∈ R ≥0 , the sequence (u i (•, t)) i≥0 is a Cauchy sequence in H 1 0 ( ), and by completeness of H 1 0 ( ), it converges to some point in which is bounded since u(•, t) ∈ H 1 0 ( ) almost everywhere in [0, T ] and T is bounded.Hence, u ∈ K, which concludes the completeness and the proof of this step.
Step 3 (Proof of Existence of a Minimum): First, note that where, by the Cauchy-Schwarz inequality, we have Thus, there exist a constant δ ∈ R such that, for all u, F (u) ≥ δ, which means that inf u∈K F (u) is bounded.By definition of infinimum, there exists a minimizing sequence Using propriety 3) of the theorem, we can obtain the coercivity of F , which means that, for any sequence ( f j ) j ≥0 in K, we have Thus, our minimizing sequence (u i ) i≥0 is bounded in K. Now, let show that it is convergent in K. To do that, let us show that this sequence is a Cauchy sequence.However, first, note that, if u, v ∈ K, then, by triangle inequalities, we have Hence, by using (13) and the above triangle inequalities, we have Since Now, since is bounded, and u i − u j ∈ K, then, by using the Poincaré inequality, there exists a constant C (dependent of ) such that which implies that Since (u i ) i≥0 is minimizing sequence, we get when i, j → ∞.Thus, we conclude that (u i ) i≥0 is a Cauchy sequence, which, by completness of Hilbert spaces, it should converge to some fixed point ũ ∈ K such that F ( ũ) = inf v∈K F (v), which completes the proof of the existence of a minimum.In order to show the uniqueness, let proceed by contradiction.Suppose that there exist two different minimizers u, v ∈ K, that is, F (u) = F (v) = inf ω∈K F (ω).By convexity of Hilbert spaces, we have u + v ∈ K. Using inequalities (16) and (18), we get Since F ((u + v)/2) ≥ F (u), we conclude that which is a contradiction (since we supposed that u = v).

APPENDIX C PROOF OF THEOREM 2
Proof: The proof is based on calculus of variations.We start by noting that, since F is smooth and has unique minimizer, we know that its optimum is achieved when ((δF (u))/δu) = 0, where (δ•/δu) refers to the Gâteaux differential that we define as follows: Note that the considered perturbation ω is independent of t; hence, the resulting variational formulation will be a first-order ordinary differential equation in t [13].In order to compute Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
((dF (u + ω))/d )| =0 , we will proceed term by term as what follows: Taking the limit yields to lim Recall that integration by parts (Green's formula) is By using that formula and the fact that all functions in H 1 0 ( ) are compactly supported and vanish at the boundary ∂ , we have Similarly, we get to compute the next term (∇u + ∇ω) • ∇ωdxdt.
By taking the limit, we get lim The last term to compute is lim By gathering all the terms, the Gâteaux differential of F is Now, by reordering the terms above and recalling that the minimizing function is characterized by ((δF (u))/δu) = 0, we get Since the equality above should hold for any ω ∈ H 1 0 ( ) and all t ∈ [0, T ], we can conclude that, if u ∈ is solution to the optimization problem (5), then it satisfies the following: almost everywhere in and for almost all t ∈ [0, T ].Furthermore, differentiating (21) in time yields to the PDE (7).

APPENDIX D PROOF OF COROLLARY 1
Proof: Consider PDE (7).Let multiply both side of the PDE by u and integrate over .The expression of interest is then Now, let computes this integral term by term.We have By using Green's formula, we also get Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Recall that
Note that properties 3) and 4) of Theorem 1 imply that the functions g i 's are nonnegative.Furthermore, property 5) yields the following inequality: Moreover, since the function z → z log(z) is lower bounded in R ≥0 , then there exists a constant γ ≥ 0, function of γ 1 and γ 2 , and such that (γ 1 |∇ i u| γ 2 log(|∇ i u| γ 2 ))dx ≥ −(γ /n).Now, recall from proof of Proposition 2 that there exists a constant m i such that g i (z) ≥ m i |z| for all z ∈ R. Furthermore, let the constants γ 3 and m i to be designed such that (γ 3 /m i ) < 1, and define c i = 1 − (γ 3 /m i ) > 0. Hence, Gathering all the terms together yields to and note that, by the Cauchy-Schwarz inequality, we have Hence, from inequality (22) and noting that u( By concavity of the square root function, we have Therefore, we get ρ ≤ 2 γ + f (t) L 2 ( ) γ + f (t) L 2 ( ) ρ.

Now, let define (V ) =
V 0 (1/ω(x))dx such that (∂ /∂ V ) = (1/ω(V )).Therefore, from (25), we have Integrating both sides of the inequality above, we get The next step is to compute (•) as follows: Substituting this in (26) and noting that V (0) = ρ 0 , we get  Replacing ρ with the expression in (23) yields to the result of Corollary 1 and concludes the proof.

APPENDIX E PROOF OF THEOREM 3
Proof: We start by recalling that, if u is solution to PDE (7), then u ∈ L 2 ([0, T ], H 1 0 ( )) ∩ L ∞ ([0, T ], H 1 0 ( )).Thus, for each fixed t ∈ R ≥0 , if ψ(•) ≡ u(•, t), then ψ ∈ H 1 0 ( ).It is well known that H 1 0 ( ) is the closure of the space of smooth and compactly supported functions C ∞ 0 ( ) [13], [14].Hence, equivalently, we can say that C ∞ 0 ( ) is dense in H 1 0 ( ).Thus, there exists a sequence ψn ∈ C ∞ 0 ( ) that can be arbitrarily close to ψ, that is, for any > 0, there exists N ∈ N such that ψn − ψ H 1 ( ) ≤ ( /2K 1 ) for all n > N and some arbitrary constant K 1 .Next, we will show that there exists a constant M ∈ N such that, for any n > M, the set Since ψn is a smooth function, then ψn ∈ L ∞ ( ).Note that this later holds because any continuous function on a compact domain is bounded.Moreover, since ψ is not an algebraic polynomial, then there exists a constant M ∈ N such that, for all n > M, ψn is not an algebraic polynomial.Therefore, by [16, Th. 1], we conclude that, for any n > M, the set ¯ n d is dense in C(D) under the uniform norm • L ∞ (D) , that is, for any > 0 and any f ∈ C(D), there exists a collection of coefficients c i , w i ∈ R d and θ i ∈ R such that where K 2 > 0 is some arbitrary constant.Thus, for any n > max{M, N}, the following holds: where C(D) is some constant that depends on D. Next, note that Thus, if we pick K 1 ≥ d max i c i 2 and K 2 ≥ C( ), then < that is d is dense in C(D) in L 2 -norm, which concludes the proof.

Fig. 1
Fig.1shows numerically how the solutions of PDE(11) for different parameters k, such that k = 4(α + η)t, coincide with the Swish function for different ρ's.This similarity indicates that we are able to generate functions such as Swish analytically with the advantage of having mathematical bounds, as shown in Corollary 1.
is bounded for any t and all i .