An Interpretation of Long Short-Term Memory Recurrent Neural Network for Approximating Roots of Polynomials

This paper aims to present a flexible method for interpreting the Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) for the relational structure between the roots and the coefficients of a polynomial. A database is first developed for randomly selected inputs based on the degrees of the univariate polynomial which is then used to approximate the polynomial roots through the proposed LSTM-RNN model. Furthermore, an adaptive learning optimization algorithm is used specifically to update the network weights iteratively based on training deep neural networks data. Thus, the method can exploit the ability to find the individual learning rates for each variable through adaptive learning rate strategies to effectively prevent the weights from fluctuating in a wide spectrum. Finally, several experimental results are performed which shows that the proposed LSTM-RNN model can be used as an alternative approach to compute an approximation of each root for a given polynomial. Furthermore, the results are compared with the conventional feedforward neural network based artificial neural network model. The results clearly demonstrate the superiority of the proposed LSTM-RNN model for roots approximation in terms of accuracy, mean square error and faster convergence.


I. INTRODUCTION
Finding the roots (zeros) of a polynomial is a key problem in a variety of scientific and technical fields. There are several traditional iterative methods for determining the roots of a polynomial, e.g., Newton's method, Bisection method, Durand-Kerner (D-K) method and Laguerre's method, etc. [1], [2]. However, some common problems associated with these conventional methods are: (1) root leaping may occur, resulting in failure to achieve the desired result, as well as an inflection point; (2) choosing an estimate that is close to the root may require several iterations, leading to a slow convergence; and (3) the calculation of the first or secondorder derivatives since being computationally intensive and therefore is not always possible.
To overcome the limitations of the conventional approaches, a two-layer feedforward neural network (FNN), as one of the neural networks (NN), was proposed to do polynomial factorization with two or even more variables [3], [4]. It is shown that the NN can adequately resolve the computational problems related to polynomial root searching. Also, the classical method for backpropagation algorithm (BPA) used in FNN with, nevertheless, gradient descent shows a slow convergence [5], [6]; thus, its use in NN computing is drastically limited. It is found that the BPA needs a lot of time for convergence unless the initial network synapse weights corresponding to the roots are selected properly. In 1997, Perantonis et al. initiated the use of constrained learning (CL) methodology to train BP networks and factorize two-dimensional polynomials by incorporating prior information from problems and the parameters for network structure learning [7]. In 2001, Huang, based on the - structured NN [8], [9] proposed an artificial neural network (ANN) and included a priori information about the connections between the roots and the coefficients to discover a particular polynomial real and complex root [10]. In [11], Freitas and colleagues proposed a FNN based ANN technique for approximating roots of a given polynomial. Although the suggested ANN approach does not surpass the accuracy of the traditional iterative methods, the results are encouraging and show the effectiveness of the NN techniques for approximating the polynomial roots (complex or real). Furthermore, it is commonly known that by using flexible parallel architectures in NNs, we may obtain all roots simultaneously and in parallel, especially when the computations are performed on a parallel-computing machine. On the other hand, most nonlinear numerical algorithms can only identify one root at a time, thus increasing the number of processors will not speed up the process. As a result, numerical methods are significantly slower than the NN methods in terms of speed [10].
In the aforementioned literature survey, although slightly compromising the accuracy as compared to the traditional iterative methods which need larger processing time, the NN techniques show their potential and significance and can be an alternate way for finding the roots of polynomials with fewer effort. Therefore, this paper proposes a more advanced deep neural network (DNN) technique, namely the long short-term memory recurrent neural network (LSTM-RNN), for approximating the roots of a univariate polynomial. The purpose is to tackle the limitation of the conventional NN techniques such as FNN based NN models which do not make use of the initial interpretation capacity of NN for approximating the roots of a given polynomial. Furthermore, two common drawbacks of FNN based NN are: (1) falling into local minimum; and (2) the slow convergence makes the fully connected FNN inefficient to train and tend to overfit the model [12], [13]. On the other hand, to tackle such problems, the LSTM-RNN has become a very ardent research topic over a few years and which was first developed by Hochreiter and Schmidhuber in 1997 [14]. Several applications, such as neural computation and time series forecasting, filter design, unauthorized broadcasting identification, quality of transmission estimation etc. [15]- [20], are examples where the LSTM-RNN models have been successfully applied.
The LSTM-RNN based DNN technique can impose a confined relationship between the roots and the coefficients of a given polynomial. Also, the methodology can effectively train and validate the NN model. Similarly, how the datasets of coefficients and the roots of a given n th order polynomial can be generated and then the roots are approximately validating through the LSTM-RNN model are the research focus of this paper. LSTM-RNN structure mainly depends on layers, which consist of a set of recurrently connected blocks, known as memory blocks. These blocks can be called as a differentiable model, each one containing one or more recurrently connected memory cells and three multiplicative units: (1) the input gate; (2) the output gate; and (3) the forget gate. The particular gates provide the cells with continuous analogs of writing, reading and resetting operations. In addition, an error cost function is coupled with a constrained condition to alleviate the weight fluctuations in a wide range [21]. Besides, momentum is added to the learning algorithm to speed up the convergence [22].
The following is the organization of the rest of the paper. Section 2 presents the fundamental concept of univariate polynomial roots, discusses the LSTM-RNN model including the error cost function, and describes the optimizer based on the adaptive moment estimation (ADAM) algorithm and parameter learning. Section 3 shows the numerically experimental results with the discussions. Finally, Section 4 sets out several concluding remarks and directions for the future research.

A. nth-order arbitrary polynomial
The main focus of this study is to compute approximations of the roots of an n th degree univariate polynomial based on its coefficients ai (i = 1, 2, ..., n). Hence, without loss of generality, four cases of n = 5, 10, 15 and 20 are considered in this study. A given nth-order polynomial f(z) can be described as [6]:   The following section will discuss the interpretation of the LSTM-RNN model for obtaining the approximate roots i w for the n th order polynomial f(z) = 0 with real coefficients.

B. LSTM-RNN Network a: The fundamental concepts behind LSTM-RNN
In many applications, deep learning has received significant attention because it performs well in comparison with other NN techniques. The reason is that it can avoid gradient vanishing problem in the deep network. In fact, RNN finds a similar problem that it recognizes conditions only in a relatively short period, i.e., if we need the data after a short period, it may be reproducible; but once a lot of suppositions are weighted, somewhere the data get lost. A popular way for solving such issues is to use a particular sort of RNN, i.e., the LSTM-RNN [14].
Over several time steps, the LSTM retains a significant gradient. This means that the extended sequences can be used to train the network. In RNNs, an LSTM unit is made up of four major components: a memory cell and three logistic gates. The memory cell is in charge of storing data. The data flow inside the LSTM network is defined by the write, read, and forget gates. Similarly, the write gate manages writing data into the memory cell, whereas the read gate controls reading data from the memory cell and returning it to the recurrent network. The forget gate decides whether to keep or erase data from the information cell, or in the other words, how much old data to forget.
In short, these gates are the LSTM operations that perform some function on a linear combination of the network's inputs, hidden state, and prior output. In addition, LSTM is observed to be more efficient in sequence prediction than other deep learning NN. Hence, the objective of the proposed network is to interpret the roots which will factorize the polynomial into more sub-factors and then use these factors for concurrent execution in the hidden layer of the network.
The key element of the LSTM-RNN is the state of cells; the state is identical to a conveyor belt with only a few insignificant linear interactions. It runs straight down to the entire hub and is very convenient for the data to move relatively and produce an active effect on response across it. The LSTM-RNN can eliminate or add information, strictly regulated by its specific gates, which are an alternative way to maintain the data respectively.

b: LSTM-RNN Model Structure
For a n th order polynomial, a multi-layered LSTM-RNN containing hidden layers is designed to approximate the roots of the given polynomial. Similar to RNN, the hyperparameters of the LSTM-RNN model are fitted by back propagation through time (BPTT). Hyper-parameters for our network model include the right number of layers in which training, evaluating and learning rate are the most essential and obvious ones. We design our model structure with multiple hidden layers through a network. The first hidden layer is the LSTM-RNN layer with 200 neurons and the second layer is the fully connected dense layer with 100 nodes. The block diagram of the LSTM-RNN structure is shown in Fig.1. Besides, the hyperbolic tangent sigmoid (tansig) is used as an activation function for learning parameters [23], which also calculates faster and is less prone to saturation ~0 gradients for the network.
The input layer is fed by a vector of coefficients of a polynomial and the output layer gives the predicted roots of a given input polynomial. Additionally, limitation in the proposed model is to meet the computational efficiency with dropout rates between many layers. Although, to avoid the model overfitting factor which is tackled by increasing the LSTM layers. Furthermore, supplementary hidden layers are added during simulations while keeping certain learning parameters unchanged.
A test dataset can be used in a confirmatory way to verify that a given set of input to a given function produces some expected results. The training over multiple epochs completely passes through the datasets for evaluating the test datasets performance at each epoch to determine when to stop. Therefore, in general, the LSTM stacking layers can improve the efficiency and estimation of the network model. Furthermore, an ADAM [24]- [26] optimizer based on RMSProp [27] and momentum as an update controller is used to improve the learning process. Finally, to normalize the results with better prediction and performance, mean square error (MSE) and mean absolute error (MAE) [28], [29] are used with an identity regression value of 0.001 and a decay rate of 110 -8 .
In mathematical terminology, the output of the i th hidden neuron in the network is compiled as ˆ -.1 - where wi(i = 1,2,…..,n), the network weights of the function-to-hidden layer, i.e., the roots of the polynomial, are to be predicted. The output of the LSTM-RNN performing multiplication with the hidden layer can be characterized as: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
The output ˆ() yz of the network is the externally supervised learning which allows to collect data or produce a data output from the previous experience signals. In addition, if (3) is rendered after an absolute operation, the following logarithmic transformation can be extracted: Whereas, 'z' belongs to the given output polynomial.

c: LSTM-RNN Error Cost Function
The proposed network model efficiency will be validated by a lower error cost function (ECF) value, which signifies appropriate experimental data prediction. The ECF is therefore executed as a parameter in the training phase to measure the efficiency of the LSTM-RNN network architecture to further calculate the network output for higher degree polynomials. Thus, the ECF can be defined as: Similarly, wi(i = 1,2,…..,n) is the set of all the neural network weights as for the resurrection of coefficients with polynomial roots, which is homologous to the network factorization model. Correspondingly, yN is the predicted output value, ut+1 is the actual targeted value, N is the training pattern respectively. In addition, the hidden neurons compute linear divergences and transform into nonlinear hidden neurons with a logarithmic activation function, while the output neuron computes linear summation rather than multiplication. This logarithmically triggered design is compared to the  − architecture described in [8,9]. Therefore, we need to evaluate the where (u1, u2, u3…ut+i) with i = 1, 2,…,n is a series of the actual vector information in the model. Also,  shows an irreducible error such that the data from the previous output n-time steps are used to calculate the next predicted output vector yN. In addition, the actual vector (u1, u2, u3…ut+i) represents an input matrix [(t-n+1) *n] and output vector [(t-n+1) *1] given below:  (9) Moreover, without loss of generality, the ECF objective function L(ut, θ) can be defined as: where  depends on all the weights across the input and hidden layers with biases ignored, and is defined as

d: LSTM-RNN with ADAM Gradient Evaluation
In this section, detailed information on how to drive gradient through ADAM optimizer in the LSTM-RNN model is discussed. ADAM is an algorithm for gradient based first-order optimization of probabilistic objective functions based on lower-order adaptive estimation. The approach is intuitive to implement, computationally efficient, with low memory needs, invariant to diagonal gradient resizing and problems that are large in terms of data and/or parameters. Hence, ADAM is based on both Momentum and RMSProp's infinity heuristics. A general implementation of ADAM optimizer is mentioned in [24]- [27]. ADAM optimizer in the LSTM-RNN model is embedded with a connection between weights and polynomial coefficients which involves a combination of two gradient descent methodologies. Besides, the momentum parameter operates to accelerate the gradient descent algorithm by taking exponentially weighted average of the gradients to train the polynomial roots. As shown in Fig. 3, the LSTM-RNN depends on the memory cell ct has the same inputs (ut+1 and t u ) and outputs yN, and have more gating units to control the flow of information. Therefore, at the time step t, we can take a partial derivative w.r.t ct for updating the gradient as: Similarly, at time step t-1, the derivative L(t-1) w.r.t. ct-1 can be written as: Fig. 3, shown below, describes the unfolds memory unit of the LSTM to make it easy to understand, because the LSTM-RNN model is fitted by BPTT. However, according to Fig. 3, the error is not only backpropagated via L(t-1) but also from ct. Thus, the final gradient w.r.t. ct-1 is defined as [30] ( ) ( ) ( ) From 1 t c − to t c only elementwise multiplication by function ft, then by chain rule, (15) can be written as: In a similar manner, (15) can be derived at any time step.

C. Evaluation Criterion
According to empirical results, the ADAM optimizer performs proficiently as compared to other stochastic optimization methods [31]. The well-known Mean Squared Error (MSE) and Mean Absolute Error (MAE) functions are applied to evaluate the model error that analyzes the theoretical convergence properties of the network [28], [29]. If the change of the function value becomes extremely small, it does not contribute to the learning process. As a consequence, the convergence rate has a regret constraint comparable to the best-known results in the convex optimization domain. Finally, to evaluate the LSTM-RNN model efficiency, the following MSE and MAE functions are as follows:  (17) where yN denotes the actual i th root (i = 1, 2,…,n) in the test dataset, and ˆN y is the corresponding approximation obtained with the proposed LSTM-RNN approach.

D. Data Set Normalization with Generation
Normalization is the process of restructuring data into a network that satisfies two basic requirements: (1) there is no data redundancy; and (2) data dependencies are logical (all related data objects are stored together). Therefore, in this study, datasets are scaled in the range [-1, 1] to improve the training algorithm convergence characteristics. Using MATLAB, we generate datasets of polynomial coefficients against the corresponding degree n by randomly choosing (uniformly distributed random numbers) with 10,000 examples in each degree. Hence, for n = 5, 10, 15 and 20 degree polynomials, the generated datasets are 50000, 100000, 150000 and 200000 respectively. Meanwhile, from the coefficient datasets, the exact (real or complex) roots are calculated using a symbolic computation package. Thus, the LSTM-RNN does not know a priori which roots are real or complex. It is important to note here that double-precision values are used to generate these datasets although coefficients and roots are taken with only four decimal places. Hence, the coefficient datasets of a particular polynomial degree n are used as an input with the polynomial roots as an output, which is then processed by the LSTM-RNN model to produce an approximation of the roots of a real n degree polynomial. Tables 1 and 2, respectively show the head of the datasets that are used with the LSTM-RNN to train and compute approximations of the roots (real and complex) for n = 5 as an example. The real and complex parts of the corresponding roots are represented in Table 2 by the odd and even columns, respectively, i.e.: {Re( ), Im( )}, 1,...., Furthermore, in this study, 80% of our datasets are for the training set, while the 20% remaining datasets are to test and validate the model. The head of datasets for polynomial of degrees 10, 15 and 20 is not tabulated due to large number of data values and hence only the simulation results are shown.

III. RESULTS AND DISCUSSION
In this section, the obtained datasets and approximate roots based on the proposed network methodology are discussed.
The proposed technique has been tested in the following way: the datasets are first generated using MATLAB programming. Then the confusion matrix approximations [32] remain used to verify the effectiveness of the generated datasets. Finally, the LSTM-RNN model is tested and validated using python tool for computing roots of a given polynomial degree n. In this study, simulations have been performed using Intel core i-7 with a CPU clock of 1.8Ghz and 8 Gb RAM for f(z) with n = 5, 10, 15 and 20 respectively. Figure 4 shows the confusion matrix based on the datasets for n = 5. The confusion matrix is developed using the MATLAB classifier learner. The matrix can help us identify the areas where the classifier model has performed poorly. Fig. 4(a) shows the percentage datasets accuracy for the polynomial coefficients for n=5 only. However, the observations are similar for other degrees.    The rows represent the true class, and the columns show the predicted class with the diagonal percentage values displaying the best approximation and accuracy of the datasets as shown in Fig. 4(b). In this case, the percentage accuracy for the polynomial with n = 5 is found to be 99.4%. Similarly, Fig. 5 shows the analysis for n = 10 with the percentage accuracy being 99.2%. Correspondingly, the percentage accuracy for polynomial of degrees 10 and 20 are also found to be over 98%. Therefore, the results justify that the generated datasets for different cases of polynomial degrees are effective and valid. Table 3 shows the summarized results with classifier learning settings for predicting the datasets generation accuracy for different cases of polynomial degree n.

B. LSTM-RNN Model Verification
In order to compute the polynomial roots based on the datasets, python programming with 3.7 documentation series is used to implement the proposed LSTM-RNN model with ADAM optimizer. In fact, validating the model is also necessary to rely on the model based on its evaluation through the valid datasets. Hence, it is essential to evaluate the model on the validation dataset before testing on a training dataset. For achieving this task there are two ways: (1) Taking the validation dataset from the training dataset; and (2) Keeping different validation sets while splitting the main datasets. Such approaches are being used by many algorithms, including the famous Random Forest algorithm [33]. Hence, in our network model the head of datasets consists of both inputs and desired (or target) outputs data.  Figs. 8 and 9 show the cases for polynomials of degrees 15 and 20, respectively. From the aforementioned analysis it is observed that as the number of epochs increases the validation accuracy will also increase. Therefore, for n = 15 and 20, the analysis is only performed for the highest epochs case i.e., 6000 epochs. The validation accuracies for n = 15 and 20 are 95.2% and 93.6% respectively. Hence, the proposed methodology for roots prediction can be up to 20th degree. Moreover, it is obvious that the validation accuracy for higher degrees can be further improved by increasing the number of epochs, with however up surging execution time and system resources. Table 4 shows the summarized results under different cases of polynomial degrees with different epochs. VOLUME XX, 2017 9    Table 5 shows the comparative simulation analysis of roots approximation for polynomials of degrees 5, 10, 15 and 20. In this study, the comparative analysis is performed with the conventional FNN based ANN model. As mentioned in section I, the most commonly used ANN technique for polynomial roots approximation in the previous research work is the FNN-ANN technique. Therefore, the same ANN technique is employed for comparison with the proposed methodology. The performance indices for comparison are the mean square error (MSE), execution time and % accuracy under the fixed epochs i.e., 6000. From Table 4, the comparative results clearly demonstrate that the proposed LSTM-RNN model surpasses the FNN-ANN model for roots approximation in terms of the accuracy and lower MSE at the cost of slightly higher execution time. In fact, as stated in section II, due to the memory cells in the LSTM-RNN model, the execution time is obviously more than the FNN based ANN model. Therefore, there is a trade-off between the accuracy and the execution time. However, the execution time could be reduced with better system resources.