Unified Model Selection Approach Based on Minimum Description Length Principle in Granger Causality Analysis

Granger causality analysis (GCA) provides a powerful tool for uncovering the patterns of brain connectivity mechanism using neuroimaging techniques. In this paper, distinct from conventional two-stage GCA, we present a unified model selection approach based on the minimum description length (MDL) principle for GCA in the context of the general regression model paradigm. In comparison with conventional methods, our approach emphasizes that model selection should follow a single mathematical theory during the GCA process. Under this framework, all candidate models within the model space might be compared freely in the context of the code length, without the need for an intermediate model. We illustrated its advantages over conventional two-stage GCA approach in a 3-node network and a 5-node network synthetic experiments. The unified model selection approach was capable of identifying the actual connectivity while avoiding the false influences of noise. More importantly, the proposed approach obtained more consistent results in a challenging fMRI dataset, in which visual/auditory stimulus with the same presentation design gives identical neural correlates of mental calculation, allowing one to evaluate the performance of different GCA methods. Moreover, the proposed approach has potential to accommodate other Granger causality representations in other function space. The comparison between different GC representations in different function spaces can also be naturally deal with in the framework.


Introduction
Since the 21st century, the European Union (EU) and the United States (US) have launched costly brain research projects, the Human Brain Project (HBP) of the EU and the BRAIN Initiative (BRAIN) of the US.Guided by the achievements of neuroscience research and with the help of computers, HBP will build a unified multi-scale brain model to simulate the brain at all levels of brain structure and function, so that neuroscientists can contact genes, molecules, cells with cognitive behaviors.Further understanding of the brain, it will open up some new ways for the treatment of nervous system diseases and the development of information technology [1].BRAIN is considered to be comparable to the Human Genome Project.Its goal is to study all neurons in brain activity, draw detailed maps of neural circuits on mesoscopic scale, and explore the relationship between neurons, neural circuits and brain function [2].Subsequently, a worldwide upsurge of brain science research was launched.Until now, major countries and organizations around the world have launched their own Brain Plan [3,4,5,6], and Neuron, a professional journal of neuroscience, set up a special issue to introduce the Global Brain Plan.Despite their different emphasis, all these projects indicate that neuroscience has increasingly focused on the functional integration at different spatial scales, that is, functional specialization has transformed to functional integration in complex neural networks [7,8,9,10,11,12].Identically, as we gradually understand the complexity of neural network in human brain, and combining the prior knowledge of laws of thermodynamics, isolated functional specialized small-networks in human brain are unlikely to exist [13,14].That means subnetworks must be connected to other network to work inside the complex network.In a word, the establishment of a neural circuit between networks is very crucial for current neuroscience, and exploring the directionality of causal networks need to be more effectively [15,16].
Since the concept of Granger causality was originally introduced by Wiener and Granger [17,18], the GCA methods have become widely applied in neuroscience to diverse sources of data, including electroencephalography (EEG) [19,20,21,22], local field potentials (LFP) [23,24,25,22], magnetoencephalography (MEG) [19,26,27], and functional magnetic resonance imaging (fMRI) [28,29,30,31].In these studies, the time series data are interpreted to reflect neural activity from a source, and GCA is used to characterize the directionality, directness and dynamic of influence between sources.Duo to the data-driven and simple mathematical form, the GCA methods may have some advantages to explore the directionality in complex neural networks, and the resulting causal connection network will be clarified more conveniently and simply.At the same time, GCA integrates the historical information and does not require a priori specification of the network model.These properties also lead to be more capable of identifying causal connectivities in following large-scale network study.
However, causal connection is complicated, relatively immature.In the current GCA research, the determination of the model orders of its historical information in model space is based on the AIC/BIC theory, and causal effect of the external information is quantified by the F-statistic [32,33,34,35].There may be inconsistencies in the selection criteria of different mathematical theories, the subjectivity of the significance level in F-statistics, and the complexity of the operation.In a generalized perspective, these two-stage scheme can be considered as a process that maps different information into the selected model spaces in GCA for quantitative comparison.Thus in conventional GCA, whether it combines with AIC/BIC to regress historical information or it regresses exogenous information of external variables by F-statistics, the two-stage scheme is a regression or model selection problem.Therefore, we need a consistent model selection method for GCA in the context of general regression model paradigm.
To keep consistency in model selection for GCA, we proposed a code length guided framework based on MDL principle, which Rissanen first proposed to quantify parsimony of a model [36], to map the two-stage scheme into the same model space.
MDL has intellectual roots in the algorithmic or descriptive complexity theory of Kolmogorov, Chaitin, and Solomonoff [37].Only considering the probability distribution as a basis for generating descriptions, Rissanen endowed MDL with a rich informationtheoretic interpretation [38,39,40,41,42,43].Due to these characteristics, causality analyzed with help of MDL principle may be more in line with physiological models.
And model selection by a single mathematical principle throughout GCA, which could be easily ignored, would determine the final result working pattern of human brain.
Thus the causal connectivity obtained by conventional GCA could be misleading, and main reason may be splitting model selection into a two-stage scheme.Our proposal incorporated the endogenous and the exogenous information into a unified model selection process, which the information will be quantified by converting into code length to obtain causality.The two-stage scheme based on the two different theories is unified under the single mathematical framework, MDL principle, which guarantees the only benchmark in the GCA methodology research.Above all, model selection for endogenous information and exogenous information by our code length guided framework is meaningful to understand the further causal connection in human brain.
At the same time, we explored the expression of MDL principle applicable to different function spaces.Generally, we pick models to describe causal connectivities in time domain, which the result causal connection seems to be complicated.And causal connectivities in the frequency domain could reveal the working pattern essentially.
Oscillation is a ubiquitous feature of neurophysiological systems and neuroscience data.With help of Geweke's work [44], the Granger Geweke method seems to provide neuroscientists with exactly what they want, that is, an assessment of direct, frequency dependent, functional influence between time series [23,45,46,47,48,49].Following our initial thought, here we prepare to analyze causality in frequency domain with the help of our code length guided framework, which may be more consistent with physiological model of human brain.Causal connections at specific frequencies between regions could be helpful for future clinical medical applications.

Problem Statement
Let variables X and Y be two stochastic and stationary time series.Now consider the following pair of (restricted and unrestricted) regression models (1) where p and q are the model orders (the numbers of time lags) in X and Y , respectively, β x and β y are the model coefficients, and u and v are the residual of the models.The order of history predictor p is usually determined with the AIC [50] or the BIC [51].
The conventional GCA requires statistical significance to determine whether the unrestricted model provides a better prediction than the restricted model.The hierarchical F -statistics, based on the extra sum-of-squares principle [52], can be used to evaluate significant predictability, given as where RSS r and RSS u represent the sum of squared residuals of restricted model and unrestricted model, respectively, n is the total number of observations to estimate the unrestricted model.The F -statistics approximately follows an F -distribution with (q, n − p − q − 1) degrees of freedom.
If F >F 0 (q, n − p − q − 1), the variability of the residual of unrestricted model is significantly less than the variability of the residual of restricted model, then there is an improvement in the prediction of X due to Y , and we refer to this as causal influence from Y to X.But in the current bivariate model, spurious connections will emerge frequently.
In order to remove spurious connections caused by indirect causalities between nodes, GCA also provides a measure of conditional causal connection by introducing another variable Z into Eqs.( 1) and (2): Then, the causal influence from Y to X, conditional on Z, is defined as Consider three model spaces A, B, and C, where each space comprises three models, and let p i and q i , i = 1, 2, 3, denote the model orders regarding its endogenous information and the exogenous information from other variables, respectively, Where We further assume that a i , b i , and c i , i = 1, 2, 3, have the same residual variance.For space A, the models only specify the regression model orders of the endogenous information by AIC/BIC, and then specify the causal effect of endogenous information by F-statistics.The models in space B specify the regression model orders of the exogenous effect by AIC/BIC, then specify the causal effect of exogenous information by F-statistics.But the models C specify the regression model orders of endogenous information and exogenous information separately, and specify the causal effect between c 1 , c 2 and c 3 by F-statistics.In this situation, the model selection in conventional GCA is split into a two-stage scheme, the model orders is determining by AIC/BIC and then the causal effect is quantified by F-statistics sequentially.It is clear that the final inference might differ with rules applied, even though three classes are completely equivalent from a model description standpoint.
As we stated above, specifying the effects of endogenous and exogenous information in conventional GCA is split into a two-stage scheme.Specifying the regression model orders of historical information, which contains the regression of endogenous information in Eq. ( 1) and the regression of both endogenous and exogenous information in Eq. ( 2), is mainly based on AIC/BIC theory separately, and then specifying the causal effect of exogenous information by F-statistics.However, specifying the two effects is the same kind of problem of model selection, which is equivalent from the mathematical perspective.Different theories might generated different benchmarks in two stages for model selection, therefore degrade the performance of GCA.
Aside from theoretical considerations, pairwise F -statistics also arouses several potential problems that might lead to misleading or unreasonable inferences in connectivity analysis.Firstly, the model comparison with F -statistics is performed under a specific significance level.The assignment of significance level is a subjective matter in conventional GCA process [53,54].A significance level that is too low could cause the false connection noise originated, whereas a significant level that is too high could erase the actual connection.When there is no rule to justify the assignment of the significance level, F -statistics will lead to different connectivity results depending on the significance level chosen.
Secondly, the selection results by pairwise F -statistics are heavily dependent on the initially selected model and the search path in model space.Consider the collection of three models M = {A, B, C} and let S denote the residual variance of each model with S A = S, S B = S + ∆S, and S C = S − ∆S.We further assume I 2 ≤ ∆S < I, where I is the interval satisfying statistical significance.The aim of F -statistics is to find the model appropriate to the observations from the model space, which is evidently model C in this case.The search path B → A → C will arrive at optimal model C, whereas if we start at A and follow the search path A → C → B, we will reach suboptimal model A, that is, optimal model C is not considered.The distinct results obtained by using F -statistics for model selection along with different search Comparing transverses within the model space can only partially reduce the risk of model misspecification, and is not always available due to the nested relation between comparable models in F -statistics.Moreover, this strategy also leads to concern about the computational complexity.
paths and initial models are given explicitly in Fig. 1.In fact, F -statistics can not discriminate the models by residual variance within a specific range relative to the chosen significance level and the noise level.On the other hand, F -statistics uses the extra sum of squares principle to identify a better model.This means that only models with a nested relationship can be compared.Then the competitive models perform pairwise comparison in an indirect way, through intermediate model, namely unrestricted model in F -statistics.Therefore, the comparison between any two models is not available in practical applications, and the search path relies on the structure of all candidate models.In such situation the optimal model is not always guaranteed.
The third potential problem relates to computational complexity.Consider the simplest case where conditional causal connections are not taken into account (q = 1 and r = 0 in Eq. ( 3) and ( 4)).Suppose that there are m candidate models for any one directional causality in the network with n nodes, the number of F -tests that needs to be performed is m(m−1)(n−1), and the total number of comparisons is mn(m−1)(n−1).
The problem on computational complexity is compounded by conditional GCA, but it still will be intractable while investigating large network [55].
Although different approaches might be applied for model selection in GCA [56,57], the two stages are kept unchanged.They are generally based on two different mathematical theories in most GCA applications.However, as mentioned above, determining the model orders by AIC/BIC or quantifying the causal effects by F-statistics can essentially be considered as a generalized model selection problem from a modeling standpoint.Since all the issues we discuss can be attributed to model selection problems, then a more practical model selection method need to be enabled.
The MDL is an information criterion that provides a generic solution for the model selection problem [58].As a broad principle, the MDL represents a completely different approach for model selection relative to traditional statistical approaches, such as F -statistics and the AIC or BIC.Compared with the AIC/BIC, the use of the MDL does not require any assumptions about the data generation mechanism.In particular, a prior probability distribution does not need to be assigned to each model class.The objective of model selection in the MDL is not to estimate an assumed but unknown distribution, but to find models that more realistically represent the data [59].
In general, the MDL principle strikes a balance between model complexity and fitting error of a model, and discriminates between competing model classes based on the complexity of each description [60,61].This proposal involves constructing a code length guided framework, then the model selection process in GCA can be convert into comparing the code length of each model.The endogenous information and the exogenous information will be uniformly encoded to be the code length, and MDL will select the model with the minimum total code length as optimal model.To solve the above potential problems in conventional GCA, we took the MDL principle as the single mathematical framework to select the generalized model for GCA, which to ensure the consistency, objectivity and the parsimony.

The minimum description length principle
We all know that the principle of parsimony is the soul of model selection.To implement the parsimony principle, one must quantify parsimony of a model relative to the available data.With help of the work of Kolmogorov [62,63], Wallace and Boulton [64], Rissenan formulated MDL as a broad principle governing statistical modeling in general.At beginning of modeling, all we have is only the data.Luckily, With the help of the algorithmic or description complexity theory of Kolmogorov, Chaitin, and Solomonoff, MDL regards a probability distribution as a descriptive standpoint.
And MDL has some connections with AIC and BIC, sometimes it behaves similarly to AIC and BIC [65,61].The difference is that MDL fixes attention on the length function rather than the code system.Therefore, many kinds of probability distribution can be compared in terms of their descriptive power [66].Our code length guided framework can be used in generalized model selection as long as there is a probability distribution in the model.

Probability and idealized code length
In order to describe the MDL principle explicitly, we deduced the formula of MDL in different cases.In model selection process of MDL, we need to compare the code length obtained by its probability distribution, so it is essential to understand the relationship between probability and the code length [61,67].
A code on a set A is simply a mapping from A to a set of codewords.Let A be a finite binary set and let Q denote a probability distribution of the any element a in A. The code length of a is that − log 2 Q, the negative logarithm of Q.For example, the Huffman code is one of the algorithms that constructed this relationship between probability and idealized code length [66].Suppose that elements of A are generated according a known distribution P .Given a code on A with length function L, the expected code length of with respect to P is defined as As is well known, if is prefix code, the code length for some distribution Q on A. There was given an arbitrary code, if no codeword is the prefix of any other, the unique decodability is guaranteed.Any code satisfying this codeword condition is called a prefix code.By Shannon's Source Coding Theorem, for any prefix code on A with length function L, the expected L is bounded below by H(P ), the entropy of P .That is where equality holds if and only if L = − log 2 P , in other words, the expected code length is minimized when Q = P .

Crude two-part code MDL
In our view, modeling is a process that find the regularity of data and compress it.
In The other is to describe the data set with the help of chosen probability model in part one.Consequently, here we firstly introduce the most common implementation of the idea -the two part code version of MDL [67,61].
Suppose there is the data D ∈ X n where X = {0, 1}.Then there is a probability P ∈ M, and minimize Here, it will select a reasonable model for D to make good predictions of future data coming from the same source, which therefore models the data using the class B of all Markov chains [67].
The first part.To get a better feel for the code L 1 , we prepare to consider two examples.First, let P θ ∈ B (1) be some Bernoulli distribution with P θ (x = 1) = θ, and let ) and θ is equal to the frequency of 1 in D, the first part L 1 (D|P ) is given as where n j denotes the number of occurrences of symbol j in D. Let k = 2 γ , the γth-order Markov chain model is denoted by B (k) , it's defined as Where and Here γ = log k is the number of bits needed to encode the first γ outcomes in D.
The maximum likelihood parameters η[1|y] are equal to the conditional frequencies of 1 prefixed by y.
The second part.In order to describe a P ∈ B, we have to describe a pair (k, θ).
We encode the all parameter in the distribution model by firstly encoding k, which will use some prefix code C 2a , and then code θ with the help of the prefix code C 2a .
The resulting code C 2 (the code length of the first part) is then defined by Since Θ (k) = [0, 1] k is an uncountable infinite set, we would have to discretize it firstly.More precisely, we will restrict Θ (k) to some finite set Θ(k) d of parameters that can be described using a finite precision of d bits per parameter.Now that the number of elements of At the end, the crude two-part code MDL for Markov chain hypothesis selection is given by,

Code length guided model selection in causality analysis
As stated above, conventional GCA split the whole process into two stages, which were actually modeling endogenous information and exogenous information.Since MDL had a close relationship with information theory, causality analysis with MDL here will also be more convincing and suitable.Further the regression of endogenous information can be converted to code length, and relative effect of exogenous variables can be also quantified by the code length guided framework, which the whole model selection process for GCA is unified into same model space.
The following is that MDL principle guided model selection in causality analysis [36], and variable X N is given, And the parameter vector consists of data where σ 2 = ξ 0 is the variance-parameter of zero-mean Gaussian distribution model for t .And t = 1, • • •, m, which m can be anyone more than n to keep the solution determined, N is the data length.In order to describe x t , turn to Gaussian distribution for t , we arrive at Where RSS denotes the residual sum of squares corresponding to the estimation in the model.Clearly there is a Gaussian distribution model, applying the two part form of MDL, the total code length is given as Where δ is the precision, and it's optimal to choose 1/ √ N [39,68,61].Specially, |ξi| δ < 1 should be ignored.

Time-domain formulation
Combining with the above formula, we can proceed to causality analysis with the code length guided framework in the time domain.There are two time-series X N and Y N , then we take two different models A and B (Eq. ( 16) and Eq. ( 18)) to describe x t .
We have the autoregressive representation Bivariate regressive representations are given, and their contemporaneous covariance matrix is Finally, 1t and 2t have Gaussian distribution with mean 0 and unknown variance σ 2 , which denote the noise of time-series are fitting residual, so do η 1t and η 2t .Therefore, the distribution of residual terms t can be a standpoint to describe the model within MDL.At the same time, the code length of model A and B we obtained can be compared to identify the causal influence between x t and y t .In the whole process of model selection for GCA, we map the causality analysis into the unified code length guided framework.According to Eq.( 15), the code length of model A and model B can be given respectively.By the definition of Granger causality, the influence from Y to X is defined by our code length guided framework, Where L X denotes the code length of optimal model in Eq.( 16), and L X+Y denotes the code length of optimal model in Eq.( 18).If F Y →X > 0, it means causal influence from Y to X existed.Otherwise, there is no causal influence existed from Y to X.
As causality represent above, our proposal can unify two-stage scheme into the code length guided framework, which can avoid inconsistency of two different mathematical theories or the subjectivity of F-statistics in model selection of GCA.
To compare conditional GCA, we consider the influence from Y to X while controlling for the effect from conditional node Z to X. Firstly, the joint autoregressive representation is given The noise covariance matrix for model can be represented as Then the autoregressive representation involving three variables can be writen as follow and the contemporaneous covariance matrix is .
By the definition of conditional GCA, if Same as above, if Clearly, in our code length guided framework, code lengths of different models did not require repeated comparison.Different from the traditional method, the causal connection is obtained by repeated pairwise comparison between models, our method can map all selected models into the same model space without the repeating comparison and to obtain the conditional causal influence directly.Which is, if both F Y →X > 0 and F Z→X > 0 existed, Here if F Y,Z→X > 0 existed, it means that both Y and Z have direct influence on X.
But if F Y,Z→X is less than 0, there will be two cases.One is F Y,Z→X = (L X+Y − L X+Y +Z ) < 0 existed, it means only Y has direct influence on X.The other is In the unified model space, multiple selected models can be directly compared by code length, which can release the complexity of the algorithm.In this way, our proposal is more in line with Occam's razor, or the principle of parsimony.

Frequency-domain formulation
With help of Geweke's work [44], the total interdependence between two time series X t and Y t can be decomposed into three components: two directional causal influences due to their interaction patterns, and the instantaneous causality due to factors possibly exogenous to the (X, Y ) system (e.g. a common driving input) [46,33].Here we need other forms regressive representations, We first rewrite Eq.( 18) and Eq.( 19) where Performing Fourier transform on both sides of Eq.( 30) leads to Then we left-multiply on both sides of Eq.( 32) and rewrite the result equation, the normalized equations yield where H 2 (ω) = H 2 (ω) − Υ2 Σ2 E 2 (ω).The spectral matrix is given where * denotes complex conjugate and matrix transpose.The spectrum of X t is found to be where The first term in Eq.( 35) is represented as the intrinsic influence and the second term as the causal influence of X t due to Y t at frequency ω.Based on this transformation we had the causal influence from Y t to X t at frequency ω as The model orders of historical information are determined by AIC or BIC in frequency domain for conventional GCA.Distinct from the conventional GCA, we obtained the causal connectivities between nodes at frequency ω by our code length guided framework.3. Result

Synthetic data experiment
To verify the performance of MDL principle in synthetic data experiment, we have generated two data sets contained 3 nodes and 5 nodes, the relationships have shown in Fig. 2 and Fig. 6.There were 1000 data points in time series of each node, the initial value of each data set was 1 and was the noise terms to ensure the fidelity of data, thus the 3-node network and 5-node network were generated by Same as 3-node network, 5-node network was given, here it's mentional that the first two time lag of each node was generated randomly, Where the noise variance in Eq. ( 36)(Eq.( 37)) varied from 0.15(0.15) to 0.3(0.35).At the same time, to keep the stationarity of synthetic data, we removed the first 700 data points and only selected the last 300 data points for each node.

Time domain case
Firstly, we considered the causal connection between 3-node network in Fig.  to Eq. ( 15), then we have shown the causal connectivities obtained by conventional GCA in Fig. ??.We found that causal influences from node 1 to node 2 and node 3 were identified both in GCA and our proposal.In other four causal influences, our proposal almost guaranteed 99% accuracy.But conventional conditional GCA only guaranteed 95% accuracy.At the same time, we also verified the accuracy of causal connection in each node and the overall true model, the comparison between conventional GCA and our proposal showed in Table .1. Clearly, at different noise levels, our proposal showed its robustness.Whether it's identifying a single node connection or an overall network connection, our proposal guaranteed an accuracy of over 96.5%.On the other hand, the conventional GCA were very sensitive due to the different confidence intervals, especially in the identification of the overall network.But conventional GCA also had a stable performance at different noise level.Comparing the accuracy of the single node and the overall network at the same noise level and the same confidence interval, it's unevenly distributed in conventional GCA.In generally, our proposal showed a relatively good performance whether it's identification in the whole or a single edge.
It was worth noting that the results obtained by conventional GCA when α = 0.01 in F-statistics were close to the results of our proposal.The results were also in line with our expectations, that was because our proposal consider the complexity of the model more thoughtfully.As we emphasized above, there is only one mathematical principle guiding the model selection throughout the GCA process, thus our proposal is a more The variance of the low noise level data ranges from 1.5 to 3.5, and the moderate(high) level data ranges from 2.5 to 4.5(3.5-5.5). 1 − α denotes the significance level of F-test in GCA.The accuracy of node i(i=1,2,3) only measures the causal connectivities with node i, not including the connections between the other two nodes.And the total accuracy represents the accuracy that the true model is found, it means only quantifies the causal influence from node 1 to node 2 and node 3.
rigorous approach or a more robust approach.Alternatively we ranged the variance of noise from (1.5-0.35) to (0.35-0.55), the result causal networks were not changed, just same as Fig. 5.
To further verify the validity and robustness, our proposal analyzed causal connection between 5-node network and compared with conventional GCA in different values of α.The connections between nodes in Fig. 6 are more complicated.For example, node 4 and node 5 has direct influence on each other, and node 1 has indirect influence on node 3, 4 and 5, there is no influence on the contrary, and so on.More complicated network is a challenge for the robustness of methods.Same as Eq. ( 36), noise terms in Eq. ( 37) were given Guassian distribution with a mean of 0. To confirm the robustness, and just stated above, the noise variance in our simulation ranged from 0.15 to 0.5.The results of 20 relationships between 5 nodes have shown in Fig. 7(b), causal influence analyzed by code length was largely consistent with relationships in Fig. 6.Simultaneously, causalities analyzed by conventional GCA between 5 nodes have shown in Fig. 7(a).Same as results in 3-node network, our proposal showed 100% consistency with relationships in Fig. 6 between directly causal related nodes, seen in Table .2. And in other relationships in 5-node network, the accuracy of our proposal also was not below 98.4%.Whereas, conventional GCA did not showed the same robustness of our proposal.Causal connectivities between direct related nodes was not well identified, for example only 507 samples were identified as causal influences from node 3 to node 4 in 1000 samples and causal influence from node 5 to node 3 was identified at 92.9% accuracy.And in other relationships where there were no direct causal connectivities in 5-node network, the accuracy of conventional GCA was more poor than our proposal.
Meanwhile, in conditional causality analysis, our approach reduced the complexity of the algorithm, which did not need repeated pairwise comparison between models, and ensured the accuracy of the results.Obviously, when the target network was more complicated in the simulation, our proposal showed a more desirable property in time

Frequency domain case
The oscillations in neurophysiological systems and neuroscience data are thought to constrain and organize neural activity within and between functional networks across a wide range of temporal and spatial scales.Geweke-Granger causality demonstrated that the oscillations at specific frequencies had been associated with the activation or inactivation of different encephalic region.But for conventional GCA in frequency domain, model selection of history information was determined by AIC or BIC.For our code length guided framework, the selected models for history information will regress into the true model space automatically.Due to its intellectual roots in descriptive complexity and close tie with information theory, our proposal may be more capable to identified causality in frequency domain.Same as time domain, causal connectivities at frequency ω between 3-node network obtained by our proposal showed in Fig. 8.In particular, causal connectivities in frequency domain were obtained within two nodes, which meant that we did not introduce conditional GCA in the frequency domain.This is mainly because there seem to be still some obfuscation with the method of using conditional GCA concept in the frequency domain.Therefore, as shown in Fig. 8, some non-existent connections between nodes were often misjudged, except for the causal influence 1 → 2 and 1 → 3.But we found that causal connectivities between node 2 and node 3 had a bigger chance to be misjudged at low frequency(0-10Hz).Actually, comparing results obtained by conventional GCA in the time domain, causalities between two nodes were more legible in frequency domain, which meant only direct causalities showed more consistent results in our analysis.For example, there are only stable and significant causal influence existed in 1 → 2 and 1 → 3.
Subsequently, we have identified the causal connectivities among 5-node network in frequency domain by our proposal, seen in Fig. 9. Same as the 3-node network, causal connection networks have been identified without introducing conditional GCA, and the causal connection network had regular characteristics.The causal influence whether it was direct or indirect existed in 5-node network was more stable to be identified in the frequency domain.Similarly, since conditional GCA was not considered, other non-existent causal connectivities had a chance to be identified.And in 5-node network, the possibility of being misjudged was even greater.Therefore, it is necessary to introduce conditional GCA to distinguish direct from indirect influences between system components in the frequency domain.And at the same time, we found that removing the noise frequency component is an obstacle to causality analysis, main reason is that we have no prior knowledge about which one is noise or others in row data.
Thus noise term could be hard to eliminate in frequency domain.

Data preparing
In the study we let ten subjects perform simple one-digit(consisting of 1-10) serial addition (SSA) and complex two-digit(consisting of 1-5) serial addition (CSA) by visual stimulus and simultaneously measured their brain activities with fMRI.Additionally, we asked the subjects to perform same tasks by auditory stimulus, and then compared the results of two modalities.Nine right-handed healthy subjects (four female, 24±1.5 years old) and one left-handed healthy female subject (24 years old) participated.One of the subjects(a right hand male) data was deleted due to excessive head motion.All subjects volunteered to participate in this study with informal written consent by themselves.
The causal connection network of subjects performing same task should be same or similar, which is also the basic assumption in group analysis.Therefore, for mental calculation under different stimuli, causal connection network of same subject may be different in the input node of stimulus, but we have reason to assume that the causal connection should be same inside the mental calculation network, at least it should be similar.Thus, we will compare the similarity of causal connection network within the same mental calculation task under different stimuli.More directly, in order to quantify which method is more robust, we will compare the similarity of the result network within auditory and visual stimulus on each subject.To obtain the similarity between networks, we defined a measure method to quantify the similarity, Where A and B are the connected matrix of mental calculation networks.The numerator represents the sum of intersections of the same connected edges, and the denominator is the sum of the unions of the connected edges.

Mental calculation network
We already have verified the validity of MDL in simulation data, but behavior in real data will determine the truly robustness of the methods.Firstly, the activation regions of mental calculation under different stimuli showed in Fig. 10, and as we assumed above, the activation regions between two modes were identical.The similarities in 4-node connection network and 6-node full connection network have shown respectively, seen in Fig. 11.Firstly, in 4-node network, we found that causal connection networks of mental calculation between visual stimulus and auditory stimulus were very similar for most subjects, except for subject 6.There are seven of nine subjects the similarities were above 0.6.Turned to conventional GCA, we found that only three subjects of causal connection networks between the two stimuli have a similarity above 0.6.Meanwhile, only in subject 1 and subject 9, we found the similarity of causal networks between two modes was above our proposal.Even in subject 1 and subject 9, the similarities in our proposal were close to conventional GCA, especially in subject 1.Further the similarities in 6-node full connection network also have shown in Fig. 11.Duo to the difference between the input node of stimulus, the similarities in two methods were not above 0.5 mostly, but there were still 3 subjects the similarities in our proposal were mostly above 0.5, especially for subject 7. Clearly, whether inside connection network of mental calculation or among the 6-node full network, the similarities in most of our subjects identified by our proposal were more than conventional GCA.
Then, causal connection networks of two subjects in mental calculation have shown in Fig. 12 and Fig. 13 respectively.As stated above, the input node of the stimulus should not be included into the network to be compared, we removed the causal connection between stimuli nodes and other four nodes of mental calculation network.
Obviously, our proposal also had a desirable robustness in fMRI data.As seen in Fig. 12 and Fig. 13, for subject 2 and subject 7, the mental calculation networks under the two different stimuli were almost identical.And causal connection networks of two subjects above were identified by conventional conditional GCA, which showed in Fig. 12 and Fig. 13.Comparing the similarity between causal networks identified by our proposal, the result obtained by conventional GCA was more irregular.As consistent in the simulation, our proposal was more robust in identifying the causal connection network, especially in complex networks.

Discussion
As we have stated above, in conventional GCA, the AIC or BIC theory determines the regression model orders of historical information, while the causal effect of exoge- .We applied this method duo to its simplicity, and a more self-contained metric space is necessary.

Conclusion
The feasibility and efficacy of the proposed strategy has been demonstrated with simulated and measured fMRI data, compared with conventional GCA.Regardless of simple or complex network, our proposal can always identify the causal connection between nodes, which was more robust.However, our proposal seems need to be introduced the conditional GCA in frequency domain, that's because our current method cannot distinguish the direct connection from the indirect or the fake connections.Even though our proposal only had a high consistency with the true connection existed in network.Luckily, our proposal showed a good robustness in real data, except for misjudged connectivities in few sample.Noise term is one of the reason for the misjudgment.More we considered that the regions of interest(ROI) we picked in mental calculation are regions in our cerebrum, rather than nodes in our simulation.
In other words, regions contained a larger range, and it seemed more likely that there were hidden nodes between regions.But in the whole, our method showed a superior performance for identifying causal connectivities, and would be desirable to determine neural circles.
At most statistical work, modeling and estimation are about extracting information from the often chaotic looking data to learn what it is that makes the data behave the way they do.And the different information extracted from different models is the key to describe data set.So does the model selection process in GCA, thus we picked the MDL principle to guide model selection for GCA, which the MDL principle can provide a unified coed length guided framework to identify causal connection.
Unlike model selection in conventional GCA, we emphasized that a single mathematical theory should hold throughout GCA.Fundamentally, MDL has root in descriptive complexity theory of Kolmogorov and developed heavily with the help of Shannon's information theory.In general, our proposal can guarantee the consistency of mathematical theory, the objectivity of criteria and the parsimony of operation, which means it may be more likely to conform to the physiological model of brain work.Indeed, it's worth noting the effect of inconsistency of mathematical theories on experimental results in the process of data analysis and statistical modeling.

Figure 1 :
Figure 1: Consider a conceivable case involving a collection of three candidate models M = {A, B, C} with residual variances of S A = S, S B = S + ∆S, and S C = S − ∆S.Suppose I 2 ≤ ∆S < I, where I is the interval satisfying statistical significance.It is clear that model C, with the minimum variance, is the optimal model in this case.Following search path B → A → C (a) we can arrive at optimal model C, while if we start with A and follow the search path A → C → B (b), we reach suboptimal model A. The results using F -statistics for model selection rely on both the search path and the initial model.

Figure 2 :
Figure 2: The 3-node network have shown above, the direction of arrows represents the flow of information.

Figure 3 :
Figure 3: The fitting Guassian distribution of the Residual.(a): comparing the distribution of the data to the normal distribution.(b): a histogram of values in data and fits a normal density function.

Figure 4 :
Figure 4: Code length obtained by our proposal, changed with different time lag in node 1.
2. To ensure the rigor of our proposal, we verified the distribution of residual between the selected model and the observed data.It was almost corresponding Gaussian distribution with mean 0, seen in Fig.3(a) and 3(b), which guaranteed the validity for our initial description standpoint.And code length changed with different time lag in autoregressive model had shown in Fig.4, it dropped down to the minimum when time lag was 2, which corresponded with generated model in Eq.(36).Then, we validated the effectiveness of our proposal in a simple network, 3-node network in Fig.2.In order to ensure the rigor of the simulation, we verified simulation by 1000 samples.And the causal connectivities identified by our proposal have shown in Fig.??, which obtained the causality by comparing different code lengths according

Figure 5 :
Figure 5: The above showed results of nodes synthesized by low variance noise(0.15-0.35), the numbers on the arrow indicated the accuracy that the causal influence was identified in 1000 samples.Specially, the accuracy was only represent the probability that the single connection direction was identified by two methods.(a):causalnetwork obtained by conventional GCA.(b):causal network obtained by our proposal.

Figure 6 :Figure 7 :
Figure 6: The relationships of simulation data sets in 5-node network have shown above.

Figure 8 :
Figure 8: Causal connectivities between 3-node network obtained in frequency domain and in time domain.In frequency domain, 6 causal connectivities identified by our proposal have shown in (a).For time domain, causal connectivities obtained by conventional GCA showed in (b).Here we identified the causal influence at frequency ω range from 1∼ 30,and at 50,100Hz

Figure 9 :
Figure 9: Causal connectivities obtained by differential code length in frequency domain.20 causal connectivities in our 5-node network have shown above.Same as 3-node network, we identified the causal influence at frequency ω range from 1∼ 30,and at 50,100Hz

Figure 10 :
Figure 10: Mental calculation of CSA-control state under the two modes(visual stimulus and auditory stimulus), the activation regions were processed by SPM12, the control state meant that the sample was in rest state without mental calculation.(a): CSA-control state under visual stimulus.(b): CSA-control state under auditory stimulus.(P<0.0001, uncorrected)

Figure 11 :
Figure 11: The left panel showed the similarity in 4-node mental calculation network and 6-node full network respectively.The right panel showed the distribution of similarity in two models.

Figure 12 :
Figure 12: Causal connectivities of subject 2 in mental calculation network under two modes obtained by our proposal and conventional GCA respectively.Node 1 and 2 represented visual stimulus and auditory stimulus node respectively.Mental calculation related nodes are node 3, 4, 5 and 6, which constituted the mental calculation network stated in the text.

Figure 13 :
Figure 13: Causal connectivities of subject 8 in mental calculation network under two modes obtained by our proposal and conventional GCA respectively.Node 1 and 2 represented visual stimulus and auditory stimulus node respectively.Node 3, 4, 5 and 6 was mental calculation related nodes, which constituted the mental calculation network stated in the text.
nous information is established by estimating the residual between the two models in the model space to be selected by the statistical F-test.There may be several potential problems: (1)the BIC/AIC theory and the statistics F-test belong to different mathematical theories.There are different selection criteria under different mathematical frameworks, so they may select completely inconsistent model results under these two different benchmark.(2)The establishment of confidence intervals in statistical tests is highly subjective.In theory, the discovery and verification of scientific laws should choose different confidence intervals.For the different research field, the criteria of selecting confidence intervals are not the same.Unfortunately, there is a lack of similar criteria in neuroimaging studies, which are more random, and the choices of different confidence intervals are also very different.(3)Optimal model is selected by repeated pairwise comparison between models, and a larger candidate global model space need to be searched to avoid the dependence on the initial selection model and the search be different from the network similarity = 2 4 and may call d the precision used to encode a parameter.Letting dϑ be the smallest d such that θ ∈ d , this gives

Table 1 :
Comparison between conventional GCA and MDL method in 3-node network