Fisher Information Matrix and its Application of Bipolar Activation Function Based Multilayer Perceptrons With General Gaussian Input

For the widely used multilayer perceptrons (MLPs), there exist singularities in the parameter space where Fisher information matrix (FIM) degenerates on these subspaces. The singularities seriously influence the learning dynamics of MLPs which have attracted many researchers’ attentions. As FIM plays key role in investigating the singular learning dynamics of MLPs, it is very important to obtain the analytical form of FIM. In this paper, for the bipolar activation function based MLPs with general Gaussian input, by choosing bipolar error function as the activation function, the analytical form of FIM are obtained. Then the validity of obtained results are verified by taking two experiments.


I. INTRODUCTION
As one of the most important subject in computer science, artificial intelligence has been developed fast in the last years and has been successfully applied in various areas and applications [1], [2], such as pattern recognition, computer vision, intelligence control etc [3], [4], [5]. For artificial intelligence, artificial neural networks play key roles in achieving such outstanding performance [6], [7]. Multilayer perceptrons (MLPs), which are typical feedforward neural networks, also have been widely applied in artificial intelligence [8], [9]. The main advantages of multilayer perceptrons are that they are easy to handle and can approximate any continuous function arbitrary well.
However, different with the regular learning machines, when researchers used MLPs to different applications, they found that there were some strange behaviours in the learning process of MLPs [10]. For example, there are many local The associate editor coordinating the review of this manuscript and approving it for publication was Gerardo Flores . minima, the learning process may become very slow and the so-called plateau phenomenon can often be observed (an example is shown in Fig. 1) [11]. In view of the wide applications of MLPs, the reasons why the training processes often suffer from such difficulties have attracted many researchers' attentions. Research results indicate that these singular behaviours are because of the network structure of feedforward neural networks which have hidden layers. Due to the existence of hidden layers, there exist subspaces in the parameter space of feedforward neural networks where the Fisher information matrix (FIM) is singular on such subspaces [12], [13]. These subspaces mainly cause the above singular learning behaviours of MLPs, thus we call these subspaces as singularities.
As the FIM degenerates on singularities, the subspaces become Riemann manifolds, not Euclidean spaces in the case of regular learning machines, which leads to three problems [11], [14]: 1) invalidation of the classic paradigm of Cramer-Rao theorem; 2) failure to determine approximate network structure. For example, for the commonly used model selection criteria, such as Akaike information criterion (AIC), Bayes information criterion (BIC) and minimum description length (MDL), researchers find that these criteria often fail to determine approximate network structure; 3) non Fisher-efficiency of standard gradient descent method. Instead of gradient descent direction, the Riemann gradient (natural gradient) descent direction becomes the steepest descent direction [15], then using standard gradient descent method to train neural networks will face many difficulties on the singularities. Therefore, it is very worthy to investigate the learning dynamics near singularities in MLPs.
Given that FIM plays fundamental and vital role in investigating the singular learning dynamics of MLPs, obtaining the analytical form of FIM has two important significances: 1) make us convenient to detailed analyze the mechanism of singular learning dynamics; 2) make it easier to design better learning algorithms to overcome the serious influence of singularities. Thus the main contribution of this paper is to obtain the analytical form of FIM for the bipolar-error-function-based MLPs with general Gaussian input. Further we also show the potential of analytical form to design better algorithms.
The rest of this paper is organized as follows. A brief review of related work is presented in section 2. In section 3, the analytical form of FIM is obtained. In section 4, we verify the validity of the obtained results through simulation studies. Section 5 states conclusions and discussions.

II. RELATED WORK
In this section, we provide a brief overview of previous work on the mechanism of singular learning dynamics.
By investigating the geometric structure of MLPs, [16] proved that the global minimum of the smaller model could be a local minimum or a saddle point of the larger model and illustrated various singularities in detail. For layered networks, by taking general mathematical analysis, [17] obtained universal learning trajectories near the overlap singularity. Further researchers aimed to take more detailed theoretical analysis on the learning dynamics near singularities. However, the widely used activation functions, such as log-sigmoid function 1 1 + e −λx and hyperbolic tangent function tanh(x), can not be integrated, which limits researchers to take quantitative analysis of learning dynamics. In order to overcome this problem, the error , were chosen as the activation function of MLPs in unipolar and bipolar case, respectively [11], [18]. Then different cases of MLPs with different type of activation functions, including toy model case [19], regular case [20], [21], and unrealizable case [22], have been investigated and diverse results have also been obtained. [23] obtained the analytical form of FIM in RBF networks and investigated to what extent RBF networks would be influenced by singularities.
Since the Riemann gradient (natural gradient) descent direction becomes the steepest descent direction on the singularities, the natural gradient method was proposed to overcome the serious influence of singualarities [24]. As it is very hard to obtain the explicit form of FIM and its inverse, researchers proposed adaptive natural gradient algorithms where the inverse FIM is calculated by directly using approximation formula [25], [26], [27] and applied natural gradient method in big data fields and deep neural networks [28], [29], [30].
Due to the non-integrated property of hyperbolic tangent function, we cannot obtain the analytical form of FIM. In this paper, we choose the bipolar error function φ(x) = 2 π x 0 exp − 1 2 t 2 dt as the the activation function of MLPs, and obtain the analytical form of FIM.

III. ANALYTICAL FORM OF FISHER INFORMATION MATRIX
In this section, the learning paradigm of MLPs is introduced at first and then the analytical form of FIM is obtained. The bipolar-activation-function based multilayer perceptrons with one hidden layer are defined as follows: where x is the input, k is the hidden node number, J i and w i are the weight from input layer to hidden node i and weight from hidden node i to output layer, respectively. φ(·) is a bipolar activation function. Then θ = {J 1 , · · · , J k , w 1 , · · · , w k } represents all the parameters of the model. In order to obtain the analytical form of FIM and overcome the non-integrated property of hyperbolic tangent function, in this paper, we choose the bipolar error function as the activation function, namely φ(x, For the regression mission, an unknown teacher function is needed to be approximated: which generates a number of observed data (x 1 , y 1 ), · · · , (x t , y t ). The additive noise ε usually subjects to a Gaussian distribution with mean 0 and variance σ 2 0 . Generally the input x is assumed to be subject to Gaussian distribution, in this paper, we investigate the general Gaussian input case, i.e. probability density function of x is: where n is the input dimension, µ is the expectation value and is the covariance matrix. We choose the square loss function to measure the error: and use the gradient descent method to minimize the loss: where η is the learning rate. The FIM is defined as follows [11]: where · denotes the expectation with respect to the teacher distribution. The teacher distribution is given by: Then we introduce the types of singularities. As shown in [11], besides of the overlap singularity and elimination singularity in the parameter space of unipolar-activationfunction-based MLPs, there also exists opposite singularity for the bipolar-activation-function-based MLPs (1), thus there are total three types of singularities: (1) Opposite singularity: (2) Overlap singularity: (3) Elimination singularity: Now we aim to obtain the explicit expression of FIM. For the Gaussian input case, the covariation matrix plays a center role and the value of µ does not essentially influence on the analytical process, without loss of generality, µ is adopted as 0 in this paper.
Before we give the analytical form of FIM, we firstly obtain the explicit expressions of , which play key role in obtaining the analytical form of FIM. For simplicity, we note: Then in Lemma 1, we give the explicit expressions of Eqs. (11)- (13).
are given as follows: where: Proof: We present the calculation processing in Appendix.
Now we can give the analytical form of FIM in Theorem 1. Theorem 1: The analytical form of FIM F(θ) is given by: where: Proof: Firstly we define: then from Eq. (1) and Eq. (6), we have For Eqs. (25)- (28), by using the results in Lemma 1, we have: Till now, the analytical form of FIM has been obtained.

IV. SIMULATION EXPERIMENTS
In this section, we take three experiments to illustrate the validity and importance of the obtained results. From Eq. (21), we can see that we only need to know the student parameters to obtain the FIM during training process. Thus the type of teacher model does not play a significant role. For convenience and without loss of generality, we investigate the case that the teacher model also has the form of MLPs, i.e. Eq. (2) can be rewritten as: where M is the hidden unit number. θ 0 = {t 1 , · · · , t M , v 1 , · · · , v M } represents all the teacher parameters. As can be noticed, this assumption is based on the universal approximation ability and is reasonable. Now we introduce three indexes which are very important to show the experiment results: 1) inverse condition value of FIM This index is used to judge whether the FIM is singular. When the matrix is nearly singular, the condition value will become very large, i.e. the inverse of condition value will become near 0; This index is used to judge whether two hidden units J i and J j overlap. If MLPs has been affected by overlap singularity, This index is used to judge whether MLPs have been affected by opposite singularity. If MLPs has been affected by opposite singularity, J i = −J j , then h 2 (J i , J j ) = 0.
Then we will take two experiments to visually represent the learning dynamics of MLPs, which will verify the correctness of Theorem 1 and illustrate the potential to design better algorithms based on the obtained analytical form of FIM.
For given teacher parameters, by choosing the initial student parameters, we use gradient descent method to accomplish the training processes. In the following figures of experiment results, '•' and '×' represent the initial state and final state, respectively.

A. LEARNING TRAJECTORIES IN ERROR FUNCTION BASED MLPs
This experiment is taken to verify the correctness of the obtained analytical form of FIM, i.e. on the singularity, the FIM is singular and otherwise the FIM is regular. We choose the teacher and student model to both have 6 hidden units, i.e. M = 6 and k = 6. The additional noise is ε ∼ N (0, 0.05) and the covariance matrix of Gaussian input is = 0.8 0.3 0.3 0.6 .
The learning rate is chosen as η = 0.002. Then we give the singular cases of learning dynamics which are affected by singularities and regular case, respectively. Case 1 (Opposite Singularity): the learning process is influenced by opposite singularity.
In this case, the learning process is affected by opposite singularity. We choose the teacher parameters are: 3 , · · · , J The final student parameters are: The experiment results are shown in Fig. 2, which represent the trajectories of log scale of inverse condition number of FIM, training error, output weights w and h 2 (5, 6), respectively.
From Fig. 2(d), it can be seen that h 2 (5, 6) fast becomes nearly 0 when the training process has started. When the training finishes, as shown in Eq. (39) which is the final state of student parameters, hidden units J 5 and J 6 are nearly opposite. The learning process is affected by opposite singularity. Meanwhile, as can be seen in Fig. 2(a), the inverse condition value of FIM is smaller than 10E − 15 till the end of the training process, which implies FIM becomes nearly singular. This is in accordance with theoretical analysis.
Case 2 (Overlap Singularity): the learning process is affected by the overlap singularity.
For this case, two hidden units overlap during the learning process and the learning dynamics are trapped in the overlap singularity. We choose the teacher parameters are: The experiment results are shown in Fig. 3, which represent the trajectories of log scale of inverse condition number of FIM, training error, output weights w and h 1 (1, 5), respectively. From Fig. 3(d) and the final states of student parameters, we can see that J 1 and J 5 nearly overlap, which implies that the learning process is affected by overlap singularity. As also can be seen in Fig. 3(a), the inverse condition value of FIM decrease fast to nearly 0 and is finally smaller than 10E − 15, thus the FIM becomes nearly singular till the end when the learning process has been affected by overlap singularity.
Remark 1: It can be seen that the log scale of the inverse of condition value obviously fluctuates at the end of the learning process (Figure 3(a)). We think this is mainly because the value is too small (smaller than 10E − 15), and even a slight change of the parameters would cause the obvious fluctuation of the condition number of the Fisher information matrix due to the limit to the degree of accuracy of computer. For this case, one output weight crosses 0 during the learning process and a plateau phenomenon can be obviously VOLUME 10, 2022 observed. We choose the teacher parameters are: The initial student parameters are: The experiment results are shown in Fig. 4, which represent the trajectories of inverse condition number of FIM, training error, and output weights w, respectively.
From Fig. 4(c), we can see that w 6 crosses 0 in the learning process and the learning process is affected by elimination singularity. During the stage w 6 crosses 0, the plateau phenomenon can be obviously observed in Fig. 4(b), and FIM also degenerates at this stage ( Fig. 4(a)). Then the student parameters escape the influence of elimination singularity and finally converge to the global minimum which can be seen from the final state of student parameters (51)-(52), meanwhile, the FIM also becomes regular in the late stage in Fig. 4(a) as the learning dynamics are not influenced by elimination singularity.
Case 4 (Fast Convergence): the learning process does not suffer from the influence of singularities For this case, the learning dynamics are not influenced by any singularity and fast converge to the optimal value. we choose the teacher parameters are:  The experiment results are shown in Fig. 5, which represent the trajectories of the inverse condition number of FIM, training error and output weights w, respectively.
As can be seen from Fig. 5(b) and the final student parameters, the learning dynamics quickly converge to the global minimum and have not been affected by any singularity. The FIM also remains regular during the entire training process.
In the above 4 cases, we have shown the learning dynamics belong to singular cases and regular case, respectively. We can see that the FIM degenerates when the learning dynamics are affected by singularities and remains regular in other cases, which verifies the correctness of the obtained results in Theorem 1.
Remark 2: Compared with bipolar error function, hyperbolic tangent function tanh(x) = e x − e −x e x + e −x is the most widely used bipolar activation function in MLPs. Although the theoretical results in Theorem 1 are obtained based on bipolar error function, we take another experiment to illustrate that the results are also valid for hyperbolic tangent function based MLPs. The experiment set up is the same as in section 4.1. We choose the teacher parameters and initial student parameters just the same in section 4.1. The only difference is that hyperbolic tangent function is used to replace the bipolar error function as the activation function in the teacher and student models. The experiment results are basically the same with the results shown in section 4.1. Thus the analytical form of the FIM based on bipolar error function can also be applied to the hyperbolic tangent function based MLPs.

B. FIM BASED NATURAL GRADIENT DESCENT ALGORITHM
As the natural gradient descent direction becomes the steepest descent direction, researchers proposed natural gradient method to overcome the influence of singularities, the parameter modification formula is shown as follow: where η is the learning rate and F(θ t ) is the FIM at iteration t. Compared to standard gradient descent method, the natural gradient method adds the inverse FIM item to the modification of parameters. From (59), we can see that computing the inverse FIM plays a key role in natural gradient descent method. Unfortunately, it is very hard to obtain the analytical form of inverse FIM and directly computing the inverse FIM also requires enormous computation cost. This limits the application of natural gradient descent method. Then researchers proposed adaptive natural gradient descent method, which used an iteration formula to approximate the inverse FIM instead of directly computing it. Although computing the inverse of large dimension matrix still faces many difficulties, the analytical form of FIM can help us to investigate better approximation formula of inverse FIM, which will lead to a significant improvement of adaptive natural gradient descent algorithms.
In this experiment, we aim to present the performance of natural gradient method by directly computing the inverse FIM based on the obtained analytical form. We choose the teacher model and student model both have 2 hidden nodes and the input dimension is 1, i.e. M = 2, k = 2 and n = 1. Then the experiment results will be shown by comparing natural gradient descent (NGD) algorithm with standard gradient descent (SGD) algorithm, where three singular cases, including opposite singularity case, overlap singularity case and elimination singularity case, are investigated. Due to the difficulty in calculating the inverse FIM and the precision limitation of the computer, we set the initial state of part of student parameters to be optimal value, and only the rest part of student parameters need to be modified. The experiment results are shown in Fig. 6, which represent the trajectories of inverse condition number of FIM, training error, and h 2 (1, 2), respectively.
Case 2 (Overlap Singularity): For this case, w 1 and w 2 are set to be the optimal value and we only modify the student parameter J 1 and J 2 . We choose the teacher parameters are: t 1 = 0.45, t 2 = 1.38, v 1 = −0.56, and v 2 = 0.37. The initial state of student parameters are: J The MLP is trained 200 epochs using NGD algorithm and 500 epochs using SGD algorithm, respectively. The experiment results are shown in Fig. 11, which represent the trajectories of inverse condition number of FIM, training error, and h 1 (1, 2), respectively.
Case 3 (Elimination Singularity): For this case, J 2 and w 2 remain invariable in the training process, i.e. J 2 = t 2 and w 2 = v 2 . Only the student parameters J 1 and w 1 are needed to be modified. We choose the teacher parameters are: t 1 = 0.40, t 2 = 0.89, v 1 = −0.32, and v 2 = −0.90. The initial state of student parameters are: J 1 = 0.20. The MLP is trained 300 epochs using NGD algorithm and 1000 epochs using SGD algorithm, respectively. The experiment results are shown in Fig. 11, which represent the trajectories of inverse condition number of FIM, training error, and w 1 , respectively.
For case 1 and case 2, as can be seen from Fig. 6 -Fig. 7, when using SGD algorithm to train the MLPs, the student parameters are trapped in opposite singularity or overlap singularity till the end. In sharp contrast, when using NGD algorithm to train the MLPs, the learning dynamics can easily escape the influence of opposite singularity and overlap singularity and converge to the global minimum. For case 3, from Fig. 8, we can see that the learning process is affected by elimination singularity. A plateau phenomenon can be observed in Fig. 8(b) for SGD algorithm case and the natural gradient algorithm can significantly reduce the influence of elimination singularity.
All the experiment results of above three cases have illustrated the efficiency of FIM based natural gradient method to overcome the influence of singularities. Since the analytical form of FIM have obtained in Theorem 1, it is important to derive better approximation formula of inverse FIM based on Theorem 1 in the future, which will facilitate the application of natural gradient method to highdimensional systems.

V. CONCLUSION AND DISCUSSIONS
Multilayer perceptrons have been widely used in many field, however the singularities existed in the parameter space often seriously influence the learning dynamics. As Fisher information matrix degenerates on the singularities, the FIM plays a significant role in investigating the singular learning dynamics. In this paper, for MLPs with general Gaussian input, by choosing the bipolar error function as the activation function, we obtain the analytical form of FIM. In the experiment part, we have verified the correctness of the analytical form, and finally showed the efficiency of FIM-based NGD algorithm in comparison with SGD algorithm. In the future, based on the obtained analytical form of FIM, we aim to derive better approximation formulas of inverse FIM that can be applied to high-dimensional systems. Q 1 (J i , J j ), Q 2 (J i , J j ) and Q 3 (J i , J j ) can be rewritten as: We denote A(J i , J j ) −1 = −1 + J i J T i + J j J T j and B(J i ) = −1 + J i J T i , then we can have: