Using Neural Networks for Fast Numerical Integration and Optimization

We present a novel numerical integration technique, Neural Network Integration, or NNI, where shallow neural network design is used to approximate an integrand function within a bounded set. This function approximation is such that a closed-form solution exists to its definite integral across any generalized polyhedron within the network’s domain. This closed-form solution allows for fast integral evaluation of the function across different bounds, following the initial training of the network. In other words, it becomes possible to “pre-compute” the numerical integration problem, allowing for rapid evaluation later. Experimental tests are performed using the Genz integration test functions. These experiments show NNI to be a viable integration method, working best on predictable integrand functions, but worse results on singular and non-smooth functions. NNI is proposed as a solution to problems where numerical integrations of higher dimension must be performed over different domains frequently or rapidly and with low memory requirements, such as in real-time or embedded engineering applications. The application of this method to the optimization of integral functions is also discussed.


I. INTRODUCTION
The need to integrate across multiple dimensions frequently arises in many engineering, statistics, and financial problems [1]. In many cases, the integration does not have a closed-form solution, meaning that approximate numerical methods must be used instead. Currently, multidimensional numerical integration can be done through a variety of methods. The classical one-dimensional approaches of trapezoidal rule or Simpson's rule can be expanded to N dimensions through ''product rules,'' by taking Cartesian products of each dimension [1], [2]. Unfortunately, the error bound of these methods, as well as other piecewise quadrature methods, degrades with increasing dimension. This degradation is known as the ''curse of dimensionality,'' whereby the computing cost grows exponentially with dimension [3]. Sparse grid methods offer slightly better error convergence, but still struggle in high dimensions, and only work for highly regular functions [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Mauro Gaggero .
The Monte Carlo integration method can perform well in very high dimensional space, as its complexity is independent of the dimension of the problem [3]. However, it is a stochastic method and thus only provides a probabilistic bound on integration error. Furthermore, the error convergence of the Monte Carlo method is slow, as it does not benefit from any regularity the integrand function may offer. Quasi-Monte Carlo integration methods use well-chosen deterministic grids to ensure uniform spacing of the sample points, and can achieve deterministic error bounds. These methods can be shown to have slightly better error convergence [1], but still not comparable to that of quadrature techniques. Adaptive methods sample the integrand function iteratively in an attempt to gain the most information about the function in the regions where the highest variance occurs. These methods can be applied to both quadrature methods [5], [6] and quasi-Monte Carlo methods [7], [8].
A common trait of all the above methods is that there is little that can be done to ''pre-compute'' the integration problem. Accordingly, in applications where integrations of a function must be carried out many times over different regions, the computations involve repetitive calculations at high cost. If severe restrictions are placed on the computation power available, such as in real-time applications, then the problem can become intractable. This issue is further compounded if the integrand function itself is numerically expensive to compute. This paper introduces Neural Network Integration (NNI), a novel numerical integration methodology, as a means of solving these types of problems in a tractable way. NNI is a numerical integration algorithm which uses a shallow neural network as the interpolation function. The main advantage of NNI is that it allows the bulk of computational processing to be performed a priori, allowing for rapid computation later. In NNI integration, a shallow neural network model is created to approximate the integrand function within a bounded integration region. Neural networks are convenient means of representing functions since they are universal approximators [9] and generalize well into multiple dimensions when compared to splines [10]. While network training is computationally expensive, it only needs to be done once and can be performed on specialized hardware. Once training is complete, however, the network design is such that a closed-form solution to the integral exists across the network inputs. As such, subsequent evaluations of the integral for different sets of bounds can be computed rapidly.
The characteristics of NNI make it ideal for problems where repeated integrations of the same function must be performed across different bounds with high accuracy, little processing time, and low memory requirements. These characteristics occur in many engineering problems, in particular in real-time or embedded applications where computation power is limited or where lookup-table style solutions are precluded due to a lack of memory. We also show the application of NNI to the optimization of integral functions, where many integral evaluations are required. In optimization, it will be shown that NNI has the added advantage of allowing for easy estimation of the optimization gradient, facilitating the use of efficient gradient-based methods. A possible application of NNI in optimization is in the pose optimization of a robotic manipulator, where it is necessary to integrate over the six-dimensional space of positions and orientations of the end-effector in order to place the robot's task in an optimal region of its workspace.
The idea of using shallow neural network function approximators in numerical integration has been researched briefly before. Zhao and Hut [11] outlined a method for numerical integration with a cosine-based neural-network architecture. However, the cosine activation function does not allow for the approximation of all classes of functions and can be difficult to train due to being non-monotonic [12]. Furthermore, the approach only allowed for 1-dimensional function integration, which is more quickly accomplished through other quadrature methods. The proposed NNI method uses a much more flexible, adaptable, and trainable network architecture, which still allows for analytical integration. Furthermore, NNI can be applied to an arbitrary level of dimensionality, where the integration problems become less tractable, and the need for pre-computation is greater.
The organization of this paper is as follows. Section II formally presents the problem and the proposed Neural Network Integration theory, methods, and implementational notes. Section III presents numerical testing results for the proposed method in terms of both accuracy and evaluation time. Finally, Section IV discusses possible applications of the method, in particular in the solution to optimization problems of high-dimensional integral functions.

II. NEURAL NETWORK INTEGRATION (NNI)
Let O be a compact subset of R n , over which a continuous real function f : R n → R is defined. Let S be a subset of O. The problem investigated in this paper is to evaluate the numerical integral of f across S in the form This paper will consider only the case where S is a convex, bounded set in O which is characterized by a finite number of linear inequalities or equalities. Thus, S can be defined compactly as where denotes a component-wise vector inequality. A 1 x b 1 denotes the intersection of a set of halfspaces, and A 2 x = b 2 denotes the intersection of a set of hyperplanes. Geometrically, (2) is the formula for a generalized polyhedron of dimension d in R n , where d = n − rank(A 2 ). This definition includes the common case of hypercube integration in R d , but can also be extended to much more complex domains. This paper proposes solving (1) by creating a function estimator for f (x), denoted byf (x), using an artificial neural network architecture. Figure 1 shows a graphical representation of the proposed neural network functionf (x) with n inputs, a single hidden layer with k nodes, and a single summing output node. Mathematically,f (x) has the form ji and w (2) j are the elements of W (1) ∈ R k×n and W (2) ∈ R k×1 , and represent the weights of the first and second layers, respectively, • b (1) j and b (2) are the elements of b (1) ∈ R k and b (2) ∈ R and represent the biases of the first and second layers, respectively, • x i are the elements of x ∈ R n , and • φ is the logistic sigmoid function, defined as To show the feasibility of integratingf (x) in the place of f (x), we show two results. First, it will be shown that it is possible to make the integration error I (f ) −Î (f ) arbitrarily small, whereÎ Secondly, it is demonstrated that a closed-form formulation of the definite integralÎ (f ) over an S ⊆ O is possible, and can be computed rapidly.
Finally, examples will be given to show how these theorems can be applied practically. Theorem 1 (Universal Approximation Theorem [9], [13], [14]): For a given f (x) and any > 0, one can find a network size k and weights W (1) , W (2) , b (1) and b (2) , such that the approximation error off (x) has an upper bound for all x ∈ O.
The proof of this result is covered in detail in the cited references. The only observation required is that the proposed architecture off (x) is congruent with the proofs therein, namely that 1) the chosen logistic activation function is bounded, continuous and non-constant; 2) the integrand function f (x) is continuous; and 3) the approximation domain S is bounded and compact. Corollary 1: For a given f (x) and any ζ > 0, one can find a network size k and weights W (1) , W (2) , b (1) and b (2) such that the integration error off (x) over S has an upper bound for any finite S ⊆ O. Proof: From Theorem 1,f (x) can be chosen such that there exists an upper bound to the absolute error of the network for all x ∈ O. Then, it is simple to show that the error onÎ (f ) is also bounded, where V(S) is the n-dimensional volume of S. S has finite volume, as it is a subset of the compact set O. It follows that the number V(S) is also finite. Accordingly, there exists a network configuration such that the approximation error is bounded by ζ = V(S), if the neural network is selected such that ≤ ζ /V(S). If a networkf (x) is fully trained to approximate f (x) over O, then network parameters concisely contain the entirety of the training data, and no further evaluations of the original function f (x) need to be performed. In cases where this integrand function itself is computationally expensive, f (x) already provides a convenient way of computing the function value quickly within O, or interpolating between the existing function evaluations.
To establish the theory of NNI, we first note that the logistic sigmoid function φ, used in the neural network architecture, is of the same form as Li 0 , the polylogarithm (or Jonquière's function) of order 0, As such, φ can be written, for an arbitrary argument κ(z), as φ κ(z) = 1 1 + e −κ(z) = − Li 0 −e κ(z) . Li s+1 −e κ(z) , (11) where hereon integration constants are ignored as the current work involves definite integrals only. This formulation is possible so long as ∂κ(z)/∂z is a constant, or, equivalently, that κ(z) is linear in z. Lemma 2: If κ(z 1 , . . . , z n ) is linear in z 1 , . . . , z n , and φ κ(z 1 , . . . , z n ) is integrated in n dimensions over z 1 , . . . , z n between some set of limits α 1 , . . . , α n , and β 1 , . . . , β n , then a closed-form solution to this integral in terms of polylogarithm functions is possible if the integration limits are also linear in z 1 , . . . , z n . VOLUME 8, 2020 Proof: From (10), this statement is analogous to . . . β 1 (z 2 ,...,z n ) α 1 (z 2 ,...,z n ) −Li 0 − e κ(z 1 ,...,z n ) dz 1 · · · dz n , (12) where all but the final integration limits α n and β n can be a function of the remaining integration variables. By Lemma 1, each integration in z i of this formulation will be possible so long as the argument of the exponent is linear in z i . Since it is assumed that κ(z 1 , . . . , z n ) is linear in z 1 , . . . , z n , then all integrations will be possible so long as the substitutions of the integration limits into κ(z 1 , . . . , z n ) do not break this linearity. This condition is possible if each of the integration limits themselves,

Theorem 2 (Neural Network Integration): If a singlelayer neural network of the form off (x) is integrated over a closed generalized polyhedron in dimension n,
then a closed-form solution is possible in terms of polylogarithm functions. Proof: We note that the network forward propagation functionf (x) is a linear combination of the logistic sigmoid function and constant terms. Furthermore, the argument of the sigmoid function in (3) is linear in all the input variables. Therefore, it is possible to apply Lemma 2 so long as the integration limits are linear, which is enforced by the predicate A 1 x b 1 onS. As such, a closed-form solution to the n th -order integral across the inputs of the network will exist and can be determined in closed-form.
By Theorems 1 and 2, any continuous real function f can be approximated arbitrarily well by a single-layer neural network with a logistic sigmoid activation function, and such a network can be analytically integrated across any domain as in (13). It is also possible that Theorem 2 could be extended to include similar results for other often-used activation functions. Such work would an interesting area for future work in the field.
The statement of Theorem 2 is admittedly quite general and does not provide any guidance on how such a closedform solution may be found. We now provide three examples to illustrate NNI's application to some common cases.

Example 1 (1-Dimensional Integration):
Let I (f ) be 1-dimensional integral of the form where f is any continuous and real function, and α 1 , β 1 are finite. Then by Theorem 1, it is possible to train a neural networkf (x), which approximates f arbitrarily well.
Through substitution with (10) and (11), the approximated integralÎ (f ) of the neural network function iŝ with j as the sigmoid integral term of the j th neuron, defined by Example 2 (Rectangular Integration): Example 1 can be extended to the 2-dimensional case as The corresponding approximated integral can be shown to bê (1). Then, the definite integral of this function across all inputs can be calculated aŝ ξ r is the sign of the r th sigmoid integration term of the node, and i,r is the appropriate integration limit for the i th dimension and r th summation element. The quantities ξ r and i,r can be computed directly using the formulae The above results show that integration across the proposed neural network can be done analytically in terms of polylogarithm function evaluations. The polylogarithm function itself can be evaluated to an arbitrary precision relatively quickly [15], and many efficient implementations exist, for example in the mpmath library [16] in Python. After training the network, subsequent integral evaluations of the function (of anyS ⊆ O) can be performed rapidly.
The following corollaries extend Theorem 2 to two additional cases: the case when the integration space is of lower dimensionality than the function space, as in (2), and the case when, to simplify the integration, a change of variables is required. (2), then a closed-form solution to this integration is possible.
Proof: If S is a d-dimensional polyhedron defined in n-dimensional space, then it exists in the hyperplane defined by n − d affine constraints, as enforced by the predicate (2). From these linear equalities, it is possible to choose d free dimensions to integrate along and define n − d dependent dimensions which, are fixed by a linear function where x f is a vector of the chosen free variables in x. Note that while the dependent variables h(x f ) are shown above as the last elements of x, this ordering is arbitrary and any positions can be taken to be dependent. For a linear set of constraints, a possible formulation of h is where A 2f and A 2c denote matrices formed by taking the columns of A 2 corresponding to the free and constrained variables in x, respectively, and the † superscript represents the Moore-Penrose pseudoinverse.
Because the equality constraints are linear, h must also be linear. To perform the surface integration over these constraints, the integralÎ (f ) has the form [17] Since the substitutions intof are linear, the linearity of the arguments in the integrand is not broken. Furthermore, because the constraints defined by h are affine, the partial derivative terms will be constants. Thus, the norm of the vector they define will also be a constant. The integration of f (x) is then simply scaled by a real number, and still in the same form as is required for Theorem 2. Corollary 3 (NNI With Change of Variables): Letf (x) be integrated over a closed d-dimensional generalized polyhedron S in R n . If g : S → S ∈ R d is an affine, continuous and one-to-one function, then an analytical formulation of this integral is possible after a change of variables defined by g.
Proof: IfÎ (f ) is integrated using an affine change of coordinates g : S → S ∈ R d , such that x = g(y), then the integration inÎ (f ) becomeŝ where det(J g ) is the determinant of the Jacobian of the transformation, defined as Theorem 2 requires the argument to the sigmoid function in (3) to be linear in the integration variable. If g is affine, then the mapping will not break the linearity of this argument. Additionally, the linearity of g implies that det(J g ) is a constant, and that if S satisfies (2), then S will too. Finally, since g is one-to-one and class C 1 , the Jacobian exists and is non-singular. It follows then det(J g ) is defined and constant. Then, the integration is still in the same form as is required for Theorem 2. These corollaries show that NNI can be applied to more complex domains than the hyperrectangles of Example 3. To illustrate how they may be used, we provide a further example in which Corollaries 2 and 3 are both necessary.

Example 4 (d-Dimensional Rotated Hyperrectangle):
Consider the case where S is a d-dimensional hyperrectangle in R n , defined in the hyperplane A 2 x = b 2 , and constrained by a set of d upper and lower integration bounds, which have been rotated by an arbitrary orthonormal rotation matrix R ∈ R d×d . For this example, we define the integration limits as acting on the first d dimensions of O. In other words, we seek to integrate over the hyperrectangle which lines in the hyperplane A 2 x = b 2 and normal to the cubic region defined by [α 1 , β 1 ] × [α 2 , β 2 ] × · · · × [α d , β d ] after a rotation R. A visualization of this domain with n = 3 is given in Figure 2, where O = R 3 , S ⊂ R 2 is defined on the plane defined by a single linear equality constraint A 2 x = b 2 , and R is a counterclockwise rotation about the z-axis.
Define the free variables x f = [x 1 , . . . , x d ] T be the first d elements of x, and let A 2f , A 2c be defined as in (20). Then we can write where . . , α d ] T and β = [β 1 , . . . , β d ] T , S can then be written as The integral I (f ) is then rewritten using (21) from Corollary 2 as Finally, the integration bounds can be explicitly written by a change of variables y f = Rx f using (22) from Corollary 3 as where | det (R) | = 1 since R is orthonormal. Since I (f ) in (27) is of the same form as in Example 3 with the substitution x = (Rx f ) T , (CRx f + d) T T , the integralÎ (f ) can be computed as directly from (18) by replacing α i and β i with the elements of the n-vectors α and β respectively, as then finally scaling the result by the constant K .

A. IMPLEMENTATION NOTES
We will now discuss the implementational concerns of how NNI can be practically used. The main difficulties arise in the sizing and training of the neural network approximation itself. These points are addressed briefly in the following sections.

1) FUNCTION SAMPLING
How the function is sampled affects the performance of NNI, as with any point evaluation based numerical integration routine. Areas of the function with high variance cannot be well approximated if they are not appropriately sampled. The current study on NNI uses standard uniform grid sampling of the function for all experiments. However, it is proposed that future work in NNI should investigate variance reduction techniques and adaptive sampling methods as a means of potentially increasing the accuracy of the NNI.

2) NETWORK SIZING
Network sizing is a reoccurring question in neural network design, and correct sizing is crucial to allow the network to learn and generalize properly without overfitting the data.
There is no exact answer to the network sizing question [18], and though many heuristics have been proposed, none are universally accepted. The ideal number of nodes k depends on the complexity of the function being evaluated, and the number of function samples available. If the network is sized too large but trained with very few points, the resulting network will likely have high variance and overfit the data, leading to poor interpolation between the sampled points, and corresponding poor integration results. Conversely, if the network size is too small, the network will be unable to capture the complexity of the function and also give poor results. Analytical results do exist; however, they are of little help. With the logistic sigmoid activation functions in the proposed network, the number of nodes required for negligible error on a training set of N samples can be as high as N [19]. Baum and Haussler (2008) proposed that the number of weights should be around ten times smaller than the number of training samples [20]. For the proposed network structure, this guideline would suggest approximately k = 0.1 N /(n+2) . However, in testing, this formula was found to give poor interpolatory performance in NNI, as k was often too high -particularly with large N . In the current study, the following heuristic is proposed for determining the network hidden layer size, where K 1 and K 2 are tunable parameters. This formula is a modification to the guideline of Baum and Haussler, which adds a decaying logarithmic term that prevents k from growing too large at larger values of N . This modified form was seen to give good general performance; however we offer no analytical justification for its use.

3) NETWORK TRAINING
Choosing the weights and biases such that the approximation is sufficiently accurate over O from sample points is a supervised learning problem. Network training is a subject that has been extensively covered in literature [10], [21]- [24]. The training of such a network is generally done using gradientdescent methods or variations thereof. In the current study, network evaluation is performed using the mean-squared error metric, and network training is performed using the Levenberg-Marquardt backpropagation algorithm [23]. This algorithm is found to be one of the fastest training methods for small to medium-sized networks, and generally achieves the highest convergence for a given training period [25], [26].

4) NETWORK NORMALIZATION
One common form of preprocessing that was found to be necessary for NNI was a linear scaling of the input and output variables into the same range as the output of the activation functions. This mapping helps reduce the training time of gradient descent methods, and also helps to keep the range of the activation function inputs near the origin. This scaling is particularly important in NNI since, during integration, inputs to the function become the arguments of the polylogarithm functions. In general, the polylogarithm function is not bounded, and evaluating it far from the origin can introduce numerical imprecision into the results. The strategy used in this paper is simple linear min-max mapping of the input and output of the function into the range of [y min , y max ] using where x i,min and x i,max are the minimum and maximum respectively of the i th input x i , and f min and f max are the minimum and maximum of the sampled function data. The network is then trained to fit the function which maps the scaled inputs to the scaled output. When the normalized network is integrated, some additional details must be taken into account. First, the integration limits must also be scaled using the same factors as the inputs in (31). Then, a scaled integralÎ (f ) can be computed from (18). The true integral is obtained by rescalingÎ (f ), aŝ where V(S) and V(S ) are the original and scaled hypervolume of the integration, respectively. In the case of a hyperrectangle integration as in Example 3, these volumes are computed as

III. TEST RESULTS
This section demonstrates the feasibility of NNI using experimental validation on test integrals. We stress that these tests are done to show the feasibility, not the superiority, of NNI in terms of accuracy. The advantages of NNI lie not in absolute numerical accuracy, but rather in its ability to produce results with an acceptable level of accuracy, and reduced execution time and memory. Testing is done using the numerical integration testing package by Genz [27], [28], and further refined by Joe and Sloan [2]. The package consists of six families of integrands, f (1) , . . . , f (6) , defined over the unit cube [0, 1] n : Oscillatory: Product Peak: Corner Peak: Gaussian: Continuous: Discontinuous: Each function is parameterized by two n-vectors, u and c. The variable u is a shift parameter with u i randomly distributed in [0, 1] and does not significantly affect the difficulty of integration. The variable c controls the difficulty of integration and is determined for each integrand family f (j) as where c is an n-vector with its components randomly distributed in [0, 1], and the values h j , e j are fixed for each integration family and dimension [2], [27], [28]. We use this testing package to quantify the integration accuracy and evaluation time of NNI, and provide three algorithms for comparison. Testing of NNI is performed using uniform sampling distribution. The number of hidden nodes k VOLUME 8, 2020 is selected as per (30), with K 1 = 4.33 and K 2 = 16. Network training is done using 90% of the available points, and the remaining 10% of points are used for network validation. In cases where the validation MSE of the network is more than two orders of magnitudes larger than the training MSE, the network is considered overfit and is re-trained. Network inputs and outputs are mapped into the range [−1, 1] using linear min-max mapping, as in Section II-A.4. Network training is performed for 5000 epochs. Once training is completed, the estimated integral is evaluated directly using (18).
Two non-adaptive sampling algorithms are provided for comparison: TRAPZ and MC. TRAPZ is an n-dimensional product-rule implementation of the standard trapezoidal rule (linear interpolation). MC is a Monte Carlo integration routine, based on random point sampling. Additionally, we provide an adaptive sampling algorithm ADAPT for comparison. ADAPT is an adaptive quadrature method developed by Van Dooren and De Ridder [6] with modifications by Genz and Malik [5]. However, we emphasize that ADAPT here has a significant advantage over the uniform sampling methods, NNI included, as it is able to gather more information from areas of high variance in the function. We provide it as a comparison and note that NNI, and the other uniform sampling methods, could be similarly extended to use adaptive sampling methods and also expect improved performance.

A. INTEGRATION ACCURACY
Testing was performed for each integrand family (36a-f), in dimension 2 and 6, for logarithmically increasing allowable number of integrand calls N . The results are shown for each integrand family in Figure 3. In each plot, the line-type denotes the integration method used, and the marker type denotes the dimension of integration. We take the number log 10 I (f ) to be the measure of performance, which can be interpreted as the correct number of digits of the integral estimate. For each point, a sample of 20 evaluations was taken, and the set median was shown on the figure. The integrand parameters e j and h j used are listed in Table 1. These parameters were selected similarly to Sloan [2], with the goal of keeping the integration difficulty of each family approximately equal.
Note that both TRAPZ and ADAPT require specific structure to their sampling points, and as such, the number of integration calls in these methods was sometimes lower than the number of integrand calls N . For these algorithms, N represents the maximum permissible integrand calls.
The testing shows promising results. NNI, in general, had good accuracy within the non-adaptive algorithms, and performed best on functions with higher regularity, and in the tests at higher dimensions. The ADAPT algorithm, in most cases, had the best overall performance. In particular, in the function families with large local variability, the adaptive algorithm is expected to perform much better than uniform sampling methods, as it can gain more information in areas where the function changes more. This advantage is particularly visible with the integrand families which had large local complexity, such as the f (3) : corner peak and f (4) : Gaussian families, or non-regularities, such as the f (5) : continuous or f (6) : discontinuous families. The non-adaptive algorithms do not have sufficient information in the high-variance regions to interpolate accurately.
Conversely, in the more predictable families such as f (1) : oscillatory and f (2) : product peak, we see that NNI performs very well, even when compared to ADAPT. This result is logical since these functions have a more constant regularity to them, which means that adaptive sampling provides little benefit. In particular, the oscillatory family offers almost no benefit to adaptive routines since the variance is constant everywhere. In some cases for this family, we see that NNI in fact outperforms the adaptive routine. These results lead us to believe that extensions to the current study, such as integrating NNI with a variancebased adaptive sampling algorithm, or intelligently weighting the error terms of the neural network during training, could both lead to higher accuracy results on integrand functions with discontinuities, singularities, or otherwise varying regularity.
It is also noted that the performance of TRAPZ and ADAPT (relative to the other methods) is significantly better for n = 2 as opposed to n = 6. ADAPT still performs relatively well at n = 6, although generally gave results approximately equal or even worse than the uniformly-sampled NNI. TRAPZ at this dimension performed very poorly, and gave less accurate results than even the non-interpolatory MC algorithm. This trend of deteriorating performance with increasing dimension is expected of product-based methods and should continue as dimensionality increases. Such methods become completely intractable as dimension n becomes even moderately large (n = 10 − 20) [2]. MC, being a Monte Carlo method, is entirely dimension-independent, and its performance depends only on the difficulty of the function being integrated [1]. NNI appears to also be somewhat agnostic to dimension, performing relatively as well as MC in different dimensionalities. This behavior is expected of NNI, as neural networks, in general, are known to extend well into large dimensionalities [10]. The ability of NNI to interpolate well in high dimensions with a small number of sample points is * NNI points shown with solid marker fill (• and ) use a smaller sample size of 5, due to the longer training times at large N, and the computation restrictions of the study. † Plots are similarly scaled to allow for comparison. However, some data is clipped as a result. ‡ The discontinuous family is particularly difficult to integrate, resulting in some stochastic results with lower N. Larger sample sizes would be required to eliminate this behavior, but are not possible due to computational restrictions of the current study. VOLUME 8, 2020 a significant strength when compared to the product methods tested.
We also point out that NNI can perform in a broader set of conditions than other quadrature algorithms, as it does not require the function sample points to fit any specific pattern. This flexibility contrasts the ADAPT and TRAPZ algorithms, as well as many other quadrature algorithms, which require integrand evaluations at pre-specified points in the integration region. This characteristic suggests that NNI can be particularly suitable when the integrand function can only be sampled sparsely and at arbitrary points (such as with data collected experimentally). However, the performance of NNI in these cases would be dependent on proper network sizing and variance-bias balancing; thus, we do not generalize further on this point.
The discontinuous family of functions is of particular difficulty to all methods. We note that this family is not of class C 0 , and as such, does not fall into the allowable form for Theorem 1, so NNI is not guaranteed to give good results on such functions. ADAPT also has difficulties with this family, as evinced by its somewhat stochastic performance at larger N .
Finally, the reader may note a general decrease in NNI performance at higher point levels, in some cases. Unfortunately, this trend is an artifact of the computational limits of our study. Large point sets used a larger network size, which did not train sufficiently within the allocated 5000 epochs. However, especially for larger N , we noted a strong correlation between training time and integration performance. In practice, NNI training time would not need to be limited in the same way (see Section IV), and training could be done over long periods on specialized hardware.

B. INTEGRATION TIME
Additional testing was done to characterize the integration time of the compared methods. For this test, we use only the oscillatory integrand family f (1) . Ten samples were used for each algorithm (NNI, TRAPZ, MC and ADAPT), for dimensions 2 and 6, for increasing values of N . In this test, the network size used in NNI is kept constant at k = 100. Testing was done using MATLAB interpreted code on a Linux system, 128 GB RAM, 1.4 GHz CPU. No parallel computing was leveraged. For the four algorithms and two dimensions, Figure 4 plots the mean integration time as a function of maximum integral evaluations. Here, due to its precomputation, the NNI algorithm has a consistent integration time -regardless of the number of function evaluations. Conversely, the integration time of the other algorithms increases as the maximum evaluations increases.
These results are as expected, but highlight the time complexity of the algorithms. TRAPZ, MC, and ADAPT scale approximately according to O(N ), whereas the evaluation of NNI is completely decoupled from the original number of evaluation points. Instead, the time complexity of NNI is related to the number of nodes and dimensionality as O(2 d k), which can be deduced from (18). The point where NNI becomes more efficient than the other methods depends on the relative complexity of a single integrand evaluation C f to a single polylogarithm evaluation C Li , as To make TRAPZ, MC, or ADAPT more viable, it is, of course, possible to reduce the computational complexity C f through a similar pre-computation of the values of the function over O, e.g., a lookup table. However, this approach has the disadvantage of requiring large memory, which precludes embedded hardware applications. Additionally, if the goal is to integrate f (x) repeatedly over varying S ⊆ O, then it presents problems for algorithms that require integrand evaluations at specific points, such as quadrature methods. Function evaluations at these points would then require n-dimensional interpolation methods, which will reduce the integration accuracy and increase the complexity further.
So, if C f is high, and the integration must nonetheless be evaluated quickly, or on basic hardware, NNI provides a more attractive alternative.
Finally, we note that for this testing, the polylogarithm function evaluations were evaluated to double floating-point accuracy (53-bit accuracy, or approximately 16 digits), and no parallel processing was leveraged. Additional or fewer digits can be computed as needed, with a corresponding increase or decrease in C Li , the computational cost of a polylogarithm evaluation. So, if less accuracy is expected (for example, if N was low), the accuracy of these evaluations could be reduced, which would decrease the evaluation time of NNI. Furthermore, the network structure of the NNI integration formulae are also in a convenient form for parallel processing. Leveraging this structure would allow for decreased processing time on multi-core systems, GPUs, FPGAs, or other specialized hardware.

IV. APPLICATIONS
For most numerical integration problems, NNI may not be the correct algorithm. The computational effort required to train a neural network, in most cases, outweighs the cost of standard algorithms. However, NNI is ideal in problems where its pre-computability can be leveraged, e.g., when an integrand function must be repeatedly integrated across different spaces, or when these integral evaluations must be done very quickly and with low memory usage. Thus, we propose NNI as a suitable integration algorithm for real-time or embedded applications, optimization, and other repetitive integration tasks -especially when the time complexity of the integrand function itself, C f , is high. In particular, the application of NNI to the optimization of integral functions is explored in the following section.

A. APPLICATIONS TO OPTIMIZATION OF INTEGRAL FUNCTIONS
The current section briefly illustrates some of the foundational aspects of how NNI can be applied to certain optimization problems. We show how the notation from Section II can be extended for optimization problems, and demonstrate how an NNI problem formulation can give access to the optimization gradient of the problem, allowing for the use of more efficient optimization algorithms. Finally, we illustrate the process with a simple example.
Consider the following constrained optimization problem, building on the notation of Section II, where the integration region S is now parameterized by an optimization vector θ, as and the objective function I (f , θ) is defined as the definite integral of f over S, as In other words, the goal is to find θ which parameterizes a set S(θ) which maximizes the definite integral of f over its volume. If the nature of f (x) requires the integration in (42) to be performed numerically and the dimensionality of S(θ) is large, then the problem can become intractable using standard numerical integration techniques. However, this problem fits well into the scope of NNI, as it requires a large number of integral evaluations across various S(θ) ⊆ O. By Theorem 1, we can find an approximatedf (x) such that Iff (x) is trained to approximate f within O, then each optimization iteration can be performed tractably using NNI, even if the dimensionality of S(θ) is large. Therefore, even in problems of multi-objective optimization where a Pareto front is being sought, the significant number of integration calls required does not cause the problem to become significantly less tractable. Additionally, standard numerical integration methods do not lend themselves well to optimization, as their discrete nature does not give one easy access to the optimization gradient. Without this gradient, optimization solvers are often less efficient and less robust [26]. With NNI, however, one has the added advantage that the optimization gradient is readily approximated from the network itself. Corollary 4: ForÎ (f , θ) as defined in (43), a closed-form of an approximate optimization gradient ∇ θÎ (f , θ) exists, so long as the gradient of each of the elements of A 1 , b 1 , A 2 , and b 2 can also be determined or estimated.
Proof: The closed-form ofÎ (f , θ) after expanding the integrations through Theorem 2 and corollaries 2 and 3 iŝ where K 1 . . . K 3 are constants, and κ i (θ), λ i (θ) and η(θ) are functions that consist of a linear combination of the elements A 1ij , b 1i , A 2ij and b 2i of the constraint matrices A 1 , b 1 , A 2 , and b 2 . If the gradient w.r.t θ of these matrix elements, ∇ θ A 1ij , ∇ θ b 1i , ∇ θ A 2ij and ∇ θ b 2i can be analytically determined or easily estimated, then the gradients ∇ θ κ i (θ), ∇ θ λ i (θ), ∇ θ η(θ) can be determined by the linearity property of the gradient operator. Furthermore, the gradient of the polylogarithm term can be determined through chain rule as Thus, the entire expression is comprised of the sum of products of functions whose gradients exist in closed-form. Then, by the product rule, a closed-form of the entire gradient expression exists.
Since the gradient of the objective function can be estimated, gradient-based methods can also be applied to the problem. Additionally, the logic of Theorem 4 applies equally well to the gradient ∇ θÎ (f , θ) as it does to the integral I (f , θ) itself. Therefore, the optimization Hessian can also be estimated, if desired, which gives one easy access to second-order optimization methods.
Example 5 (Positioning of an n-Dimensional Hyperrectangle): Recall the integration of an n-dimensional hyperrectangle in Example 3. Suppose we have the optimization problem of selecting the coordinates of the center θ ∈ R n of a hyperrectangle with fixed side lengths d ∈ R n , such that I (f , θ) is maximized. Mathematically, the problem could be formulated as with where I is the n × n identity matrix. The integrated form ofÎ (f , θ) is the same as in Example 3. Though multiple applications of the chain and product rules, the optimization gradient ofÎ (f , θ) is where the n-vectors C and ∇ θ j are defined by The integration limits α i and β i are defined by the i th element of (θ + d/2) and (θ − d/2) respectively. Since d is constant w.r.t θ, the gradients of these limits are simply ∇ θ α i = ∇ θ β i = e i , i = 1, . . . , n where e i ∈ R n is the unit vector aligned with the i th axis of the coordinate system.

V. CONCLUSION
We present a novel numerical integration technique, Neural Network Integration, or NNI. A shallow neural network design was used to approximate the integrand function within a bounded set. This network was designed in such a way to allow for a closed-form solution to the definite integral across any generalized polyhedron in the domain off . Thus, after training the neural network, the definite integral across varying spaces can be computed analytically -essentially allowing for a ''pre-computation'' of the integration problem. The proposed method was tested for accuracy and evaluation time using the Genz [27] function families and compared against uniform sampling numerical integration algorithms TRAPZ (trapezoidal integration) and MC (Monte Carlo integration), as well as the adaptive quadrature routine ADAPT [5]. NNI was shown to be a viable method of integration, giving comparable accuracy as the other tested uniform-sampling methods. Its accuracy lagged compared to the adaptive routine ADAPT on functions with high variability over the integration domain. However, it gave good results on functions with constant variability. These results suggest that further research into the application of adaptive sampling of the integrand function, or intelligent weighting of the sample points during network training, could give further improvements to the accuracy of NNI integral estimates. Finally, NNI was also found to perform particularly well compared to the other algorithms in high dimensions, when the number of integrand evaluations was limited.
The integration time of each method was compared and confirmed that NNI results can be computed quickly, regardless of the number of sampling points. A comparison of the time complexity of NNI and standard numerical integration algorithms gave a guideline for determining if NNI or standard algorithms are best-suited to an application.
The findings show NNI to be a viable method for the rapid computation of numerical integrals in situations where the definite integral of a function across different bounds must be evaluated multiple times. It is proposed for realtime or embedded applications, particularly when the time complexity of the integrand function itself, C f , is high. Furthermore, we explored the application of NNI to the optimization of integral functions, in which it was shown that NNI has the benefit of allowing for easy estimation of the optimization gradient. Thus, NNI is presented as an algorithm allowing for fast or real-time evaluations or even optimization of difficult, high dimension integrations.