Causal Structure Learning With One-Dimensional Convolutional Neural Networks

Causal structure discovery has an important guiding role in explanatory artificial intelligence. In order to discover causal relationships from observed data and restore causal structure graphs, we propose the Directed Acyclic Graph structure learning with Causal Convolutional Neural Networks (DAG-CCNN). First, we employ a nonlinear structural causal model (SCM) generation mechanism and propose an integrated neural network model which combines a fully-connected (FC) layer and one-dimensional convolutional (1D-Conv) layer. Then, a new acyclic algebraic representation is proposed, and a theorem and its proof are given. We use the eigenvalues of the weighted adjacency matrix instead of the Hadamard product of the adjacency matrix to represent the acyclicity of the graph, which essentially avoids the computational complexity associated with matrix multiplication. Finally, compared with DAG-GNN, NOTEARS and GraN-DAG, the experimental results show that the DAG-CCNN method has some advantages in performance on synthetic and real Sach data sets and the results of the restoring causal structure graph are more satisfactory.


I. INTRODUCTION
Causal structure learning is an emerging topic in machine learning and artificial intelligence, and it has been argued that the study of causality has a crucial role in reasoning about the world. Causal discovery algorithms aim to maximize the reduction of the true directed acyclic graph by observing the data to infer causal relationships between variables, which is of great importance in the medical field [4], business optimization decision field [3], and biological field [10], etc. In recent years, causal discovery algorithms have been proposed with the help of deep learning with powerful nonlinear representation capabilities, and a boom in the improvement of causal discovery algorithms has been launched in academia [25].
The traditional method initially proposed was the randomized controlled trial, which focused on intervention data and was therefore difficult to conduct due to high costs and ethical problems. Then, Judea Pearl proposed a theoretical basis for discovering causal relationships through observational data [7]. Subsequently, three algorithms in three general directions have been proposed, which are constraint-based method, causal function model method, and search scoring method. The constraint-based method, represented by the The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Asif .
Peter-Clark (PC) algorithm and the Inductive Causation (IC) algorithm, determines the causal structure graph by means of the conditional independence test and orientation rules, but the obtained graph is one of the Markov equivalence graphs, so it is impossible to determine the unique directed acyclic graph (DAG). The method based on the causal function model starts from the causal mechanism of data generation and aims to determine the causal direction. This approach is mainly represented by the linear non-Gaussian acyclic model (LiNGAM) [20], the additive noise model (ANM) [6] and the posterior nonlinear model (PNL) [11]. However, they have strong limitations and are not universal, as they determine the causal direction based on the directional asymmetry of a particular model. The last method is the search scoring method, mainly represented by the greedy search algorithm (GES) [2]. By defining a scoring function, we can search for the highest scoring causally directed acyclic graph in the entire variable space. Obviously, the complexity of the search space grows exponentially as the number of nodes increases, making the problem NP-hard, so it can only be used for the recovery of low-dimensional causal structured graphs [13]. These challenges affect the discovery and estimation of causal relationships.
In recent years, a leapfrog improvement has been achieved in a causal discovery algorithm named NOTEARS [23]. It gives a continuous representation of the acyclic constraints VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and successfully converts the combinatorial optimization problem into a continuous constrained optimization problem, cleverly avoiding the super-exponential growth complexity problem faced by traditional algorithms. Although this is a great breakthrough, the algorithm can only be applied to linear SCMs. Recently, new algorithms based on machine learning have been proposed successively to further realize the extension from linear SCM to nonlinear SCM, such as DAG-GNN [26], Gran-DAG [18], DAG-VAE [1], DAG-RL [21], and so on. These algorithms face the difficulty of training deep artificial neural networks and the difficulty of high equipment requirements by building them and making them meet the training requirements by adding mandatory assumptions. Moreover, training deep neural networks requires a sufficiently large training set to achieve reasonable generalization capability. However, the computation of the Hadamard product of the weighted adjacency matrix is more onerous in these algorithms. Therefore, the above methods still face difficulties in estimating causal structure maps for high-dimensional sample data.
In this paper, we propose the Directed Acyclic Graph structure learning with Causal Convolutional Neural Networks (DAG-CCNN), and the method first builds an integrated neural network model combining a one-dimensional convolutional (1D-Conv) layer [22] and a fully-connected (FC) layer [5]. Then a new representation of the acyclic is given. In our model, the 1D-Conv layer performs only linear one-dimensional convolution (including two operations: scalar multiplication and addition) and automatically extracts features, which makes a real-time and low-cost hardware implementation feasible. In addition, we improve the characterization, a new acyclic characterization based on the eigenvalues of the adjacency matrix is constructed, which greatly reduces the complexity difficulties faced by the matrix multiplication operation due to the construction of the adjacency matrix with Hadamard products as mentioned in previous methods. Finally, the experiments show that the proposed method is more effective for both the data set synthesized in this paper and the Sach real data set.
The remainder of the paper is organized as four sections. Section II first introduces the background knowledge about causal discovery. Section III details the model and various methods included in our proposed model. Section IV optimizes the parameters in the proposed model using the augmented Lagrange multiplier method. The experimental results are described in Section V. Finally, the overall conclusion is drawn in the final section.

A. STRUCTURAL CAUSAL MODEL (SCM)
If all edges are directed, and there are no cycles, we have the well-known class of DAGs. The basic DAG structure learning problem can be summarized as an algebraic representation of SCM, which is formulated as follows. Let Take nodes X 1 , X 2 , X 3 as examples. The DAG formed by these three nodes is shown on the left, and the corresponding SCM is shown on the right, where f 1 is used instead of f X 1 to facilitate memory.

FIGURE 2.
A simple example of topological sorting. In the first step, 2 has no parent, so we cut off 2. In the second step, 1 has no parent, so we cut off 1. In the third step, 0 has no parent, so we cut off 0, and finally we get 3 and 4, neither of which has a parent, so we get two sorts: ''2, 1, 0, 3, 4'' or ''2, 1, 0, 4, 3. '' X = (X 1 , X 2 , . . . , X d ) be a matrix of random vectors, E = (E 1 , E 2 , . . . , E d ) be a noise set following the P(ε) distribution. If the causal mechanism f i is linear, the noise obeys a non-Gaussian distribution (uniform distribution, Laplace distribution), and if the causal mechanism f i is nonlinear, the noise obeys a Gaussian distribution. G = (X , E) be a DAG and SCM is defined as we call it SCM, where ''⊥'' represents that variables are independent of each other, X pa i ⊂ (X 1 , X 2 , . . . , X d ) is the set of the parent nodes of X i , and f i is a causal mechanism. In addition, X j ∈ X pa i represents that there exists a direct causal relation from X j to X i , written X j → X i , i.e. there exists a directed edge from X j to X i . In the following, Fig. 1 shows a SCM based on a DAG.

B. CAUSAL MARKOV ASSUMPTION AND TOPOLOGICAL SORTING
Let X = (X 1 , X 2 , . . . , X d ) denote a set of continuous random variables with joint distribution P(X ), and G be a DAG. We assume that P can be factorized along G It is said that the model obeys the causal Markov hypothesis. Topological sorting for DAG is a linear ordering of vertices such that for every directed edge X j → X i , vertex X j comes before X i in the ordering. The first vertex in topological sorting is always a vertex with in-degree as 0 (a vertex with no incoming edges). The Fig. 2 is a simple illustration where a true topological sorting is ''2, 1, 0, 3, 4 or ''2, 1, 0, 4, 3 . It can be seen that topological sorting is not unique.

C. CAUSAL IDENTIFIABILITY ASSUMPTION
When the generation mechanism f j is a nonlinear function, the joint distribution function P(X ) no longer uniquely satisfies the faithfulness hypothesis. That is, only a set of Markov equivalent directed acyclic graphs can be obtained, and no unique causal structure graph can be determined. Up to now, several special mechanisms have been proposed to satisfy the hypothesis of identifiability, for example, the linear Gaussian case with equal error variances [8], the linear non-Gaussian ANM [14] and nonlinear Gaussian ANM [9]. In the ANM, structural assignments have the form of The function f j in ANM is a nonlinear function, and E j is an additive Gaussian noise. It is claimed that the function has full structure identifiability under this mechanism. In this paper, we consider the nonlinear Gaussian case, that is, we construct special neural networks to fit the nonlinear requirement of the function f j and the Gaussian requirement of the fitted noise E j .

III. DAG STRUCTURE LEARNING (DAG-CCNN)
In this section, we will define our method DAG-CCNN. In A, an integrated neural network model based on 1D-Conv layer and FC layer is proposed. In B, the basic framework of 1D-Conv layer is further given. In C, the theorem of constructing acyclicity based on eigenvalues is given and proved.

A. AN INTEGRATED NETWORK MODEL BASED ON 1D-CONV LAYER AND FC LAYER
The DAG-CCNN method adopts the nonlinear SCM generation mechanism and learns causal structure from the weighted adjacency matrix by using a 1D-Conv layer and using a FC layer. The overall model can be expressed as (1) Each f j is the function fitted by the FC model, and each g j is the function fitted by the 1D-Conv. f j is mainly used for learning the structure of DAG, whereas g j is mainly used for feature extraction and further enhancement of learning nonlinear fitting capabilities. X −j represents the d-dimensional input vector given by variable X j with the weight of 0, and Conv1d j,H represents the 1D-Conv layer used to fit the variable X j . FC j : R d → R d×m 1 is an affine linear map, and In this paper, j , b j represents the set of weight matrix and deviation vector of linear causal mechanism f j in the neural network model, then f j can be converted into non-parameter θ j for the following optimization solution [24].
In the above FC layer model, f j mechanism is used to fitX j = f j (X − j, E j ) and output a 3-dimensional tensorX j . Next, a 1D-Conv layer based on CNN is constructed, and we useX j ∈ R (n,d,m 1 ) as input in 1D-Conv layer and output a d-dimensional vector. The Fig. 3 is the model flow chart, and the 1D filter kernels have size 1, 2, · · · , d.

B. 1D-CONV LAYER
In the literatures, although many studies have developed detection algorithms for causal structure graphs, their results are usually limited to relatively small training/testing data sets. Compared with causal discovery algorithms based on artificial neural networks or graph neural networks, the DAG-CCNN approach proposes an integrated model of FC layer and 1D-Conv layer, where the main idea of 1D-Conv layer is to encapsulate feature extraction and feature selection classification into different modules and directly show the vector data consisting of each feature column of the original adjacency matrix as input, which can be efficiently trained by proper learn the best features.
As a classic structure of deep learning, CNN has made great developments and extensive applications in image recognition, object detection, face recognition, natural language processing, etc. The 1D-Conv layer is often used to process text and sequence data. The convolution operation of 1D-Conv is shown in Fig. 4.
In Fig. 4, we use a 1D-Conv with three layers. All of the kth neuron in the l-th hidden layer first performs a sequence of convolution operations, then its sum is activated by the activation function σ and subsequently subsampling operations are performed. In this paper, the 1D-Conv takes as input the tensor output from the FC layer shown in Fig. 3 and thus learns to extract such features that can be used for the regression task performed. Thus, both feature extraction and regression operations are fused into a single process that can be optimized to maximize the regression performance. Moreover, since the only expensive operation is the one-dimensional convolution sequence, which is nothing more than a linearly weighted sum of two one-dimensional arrays, it guarantees the main advantage of one-dimensional networks that fitting is of low computational complexity. Such linear operations can be efficiently performed during forward and backward propagation.
The main advantages of the 1D-Conv based approach are 1) its compact architectural configuration (instead of a complex deep architecture) which includes only one-dimensional convolution, making it suitable for real-time directed and acyclic detection, 2) its cost effective and practical real-time hardware implementation, 3) its ability to work without any pre-determined transformation, hand-crafted feature . The integrated function model of the causal mechanism X j = g j (f j (X −j , E j )).

FIGURE 4.
Three consecutive hidden CNN layers of a 1D-Conv with X j = conv 1d j (X j ) [19]. extraction and feature selection, and 4) its ability to provide effective classifier training with limited training data set and limited number of BP iterations. It is able to train the classifier essentially with a limited number of iterations.

C. A NEW ACYCLIC REPRESENTATION BASED ON EIGENVALUES
Constructing a directed acyclic graph includes two processes: constructing a directed graph and an acyclic graph, where acyclic graphs are graphs without directed loops. The study of how to get a suitable acyclic representation has been the direction that scholars are concerned with in the process of causal discovery. Zheng et al. [23] are probably the first to represent the acyclicity of directed graphs by using the regularization of the trace of the exponential form of the matrix, which transforms a discrete combinatorial optimization problem into an equivalent continuous optimization problem that can be solved using numerical methods. However, the product between matrices should be avoided in the numerical algorithm solution. In NOTEARS, the Hadamard product of adjacency matrices has a computational complexity as high as o(d 3 ).
In this paper, we use the exponential form of the sum of all eigenvalues of the adjacency matrix to represent the acyclicity of the directed graph in place of the trace-characterizing acyclicity of the exponential form of the Hadamard product of the adjacency matrix, so that we avoid the product operation of the matrix and the new acyclic representation is nearer to o(d), and trains between 5 and 15 times faster than NOTEARS. This method significantly reduces the computational complexity from o(d 3 ) to o(d), where d is the number of edges in the adjacency matrix.
Proposition 1: Let A ∈ R d×d be the (possibly negatively) weighted adjacency matrix of a directed graph and the corresponding eigenvalues be λ 1 , λ 2 , · · · , λ d , then the following equation holds: Theorem 1: Let A ∈ R d×d be the (possibly negatively) weighted adjacency matrix of a directed graph and the corresponding eigenvalues be λ 1 , λ 2 , · · · , λ d , then the graph is acyclic if and only if: Proof: It essentially boils down to the fact that tr(A k ) counts the number of length-k closed walks in a directed graph. Clearly, an acyclic graph will have tr(A k ) = 0 for all k = 1, · · · , ∞.
Then, using Taylor's expansion

IV. OPTIMIZATION WITH AUGMENTED LAGRANGE MULTIPLIER METHOD
Theorem 1 establishes a smooth, algebraic, computable characterization of acyclicity. After the acyclic representation of the directed graph was given in Theorem 1, we further solve the continuous optimization problem so as to train the relevant parameters of the integrated model at FC layer and 1D-Conv layer. The constrained optimization problem with least-squares loss function can be formulated as: − Conv, and f j denotes the usual Sobolev space of square-integrable functions whose derivatives are also square integrable. Let the partial derivative of f j with respect to X k be denoted by ∂f j ∂X k . It is easy to show that f j is independent of X k if and only if ∂f j ∂X k L 2 = 0. Let · p represents the L p -norm of the vector, · L p represents the L pnorm of the function, and · p,q represents the (p, q)-norm of the matrix. Then, we define the weighted adjacency matrix W (f ) ∈ R d×d with entries where [W (f )] kj = 0 is equivalent to kth − column(A Therefore, the corresponding non-parametric form is defined as Next, L 1 regularization is used to constrain the adjacency matrix of the directed graph, that is, a sparse regular term with penalty weight µ 1 ≥ 0 is added to the scoring function. In order to essentially prevent overfitting, L 2 regularization with weight µ 2 ≥ 0 is added to all parameters in the 1D-Conv learning process, that is, 2 2,2 , and it is proved that the solution result of adding L 2 regularization obtained a better solution. Then, the constrained optimization problem are as follows: As in Zheng et al. [23], the standard machinery of augmented Lagrangian can be applied, resulting in a series of unconstrained problems where θ = (θ 1 , . . . , θ d ), ρ is a penalty parameter and α is the Lagrange multiplier. The corresponding non-parametric form is: We choose the L-BFGS-B algorithm [17], and let f (·) = f + (·) − f − (·), so θ = θ + − θ − can replace (8): where ξ 1 and ξ 2 are both parameters during training. The update rules are as follows θ t+1 , ξ 1,t+1 , ξ 2,t+1 = arg min θ F ρ t (θ t , ξ 1,t , ξ 2,t , α t ), where t is for the t-iteration.   Comparison of different methods on nonlinear SCMs generated from Gaussian processes (GP) with unit independent Gaussian noise. The lower the better for SHD and the higher the better for TPR (n = 1000).

TABLE 3.
Comparison on nonlinear SCMs generated from ANM with FC and Gaussian processes (GP) with unit independent Gaussian noise. The lower the better for SHD and the higher the better for TPR (n = 1000, d = 100).

V. EXPERIMENT
In this section, we present a comprehensive set of experiments to demonstrate the validity of the proposed method DAG-CCNN. In Section A, we show the baseline and evaluation metrics used to compare the validity of the methods. In Section B, synthetic data sets are obtained based on two nonlinear data generation mechanisms, namely the ANM with FC mechanism and the nonlinear Gaussian ANM mechanism, and DAG-CCNN is compared with DAG-GNN [26], NOTEARS [23] and GraN-DAG [18] in the synthetic data sets. To further illustrate the usefulness of the proposed approach, in Section C, we apply DAG-CCNN to the protein data set for the discovery of consensus protein signaling networks. Our implementation is based on PyCharm. In extracting the DAG, we used a threshold of 0.3, as suggested by Zheng et al. [23]. As mentioned above, our method named DAG-CCNN includes a FC layer and two 1D-Conv layers.

A. BASELINES AND METRICS
The proposed DAG-CCNN method is compared with three recently proposed deep learning-based DAG structure learning methods, which are DAG-GNN [26], NOTEARS [23], and GraN-DAG [18]. True Positive Rate (TPR) and Structural Hamming Distance (SHD) are commonly used criteria to evaluate the difference between the learned graph and the true graph. SHD is the minimum number of edge additions (E), deletions (D), and reversals (R) required to convert the estimated graph into a true DAG, and a lower SHD indicates a better estimate. TPR represents the correct rate of identifying the number of edges, where both true positive (TP) and true (T) edges are considered, so a higher TPR indicates a better estimate. It is expressed by the formula

B. SYNTHETIC DATA SETS
Considering graphs with 10, 40 and 100 nodes, the sample size includes n ∈ {200, 1000}. We consider two graph sampling schemes: Erdos-Renyi (ER) and Scale-Free (SF). the expected node degrees for both ER and SF graphs are 1, 2, 4 and let ERx represent the expected node degree of x. At this point, the expected number of generated edges is x · d and d represents the number of nodes, which also applies to SFx. Then, we synthesize the data sets according to two data generation schemes. The first data generation scheme is ANM with FC (ANM-FC) [15], where the data are sampled as Here, σ represents the sigmoid activation function used to encode nonlinear expressions, X pa j represents the input variables in the neural network, W 1j is the weight of the first hidden layer, and W 2j is the weight of the second hidden layer. The second data generation scheme is nonlinear Gaussian ANM (Gauss-ANM) [12], where the data are sampled as where each function f j is sampled from a Gaussian process (GP) with a radial basis function (RBF) kernel of bandwidth 1, the noise variables E j are independently and normally distributed with uniformly chosen variance, and X pa j corresponds to the direct parents of X j . Thus, both the ANM scheme with FC and the nonlinear Gaussian ANM scheme are sufficient to recover the casual graph structure. Table 1 and Table 2 show summaries for n = 1000 and d = {10, 40}. Table 3 shows summaries for n = 1000 and d = 100.
For the SHD (Taking SHD as a criterion), DAG-CCNN is the best. As we can see from Table 1, no matter for ''ER2 with 10 nodes'', ''SF2 with 10 nodes'', ''ER2 with 40 nodes'', or ''SF2 with 40 nodes'', DAG-CCNN obtains the smallest values of SHD. As we can see from Table 2, for ''SF2 with 10 nodes'', ''ER2 with 40 nodes'', and ''SF2 with 40 nodes'', DAG-CCNN also obtains the smallest values of SHD. As we can see from Table 3, the SHD of ''ER2 with 100 nodes'' and ''SF2 with 100 nodes'' are larger than that of nodes 10 and 40, which is inevitable because as the node dimension becomes higher, the causal structure graph becomes more complex and the combination space becomes larger making the SHD larger. However, from the overall trend, DAG-CCNN is still the best performer and still has the smallest SHD among the four methods, which indicates that the remaining three methods are less effective in recovering causal structure graphs composed of high-dimensional nodes.
For the TPR (Taking TPR as a criterion), DAG-CCNN is also competitive. As we can see from Table 1, for ''ER2 with 40 nodes'' and ''SF2 with 40 nodes'', DAG-CCNN obtains the largest values of TPR, and for ''ER2 with 10 nodes'' and ''SF2 with 10 nodes'', the TPR value of DAG-CCNN are closed to the largest one. As we can also see from Table 2, ''ER2 with 10 nodes'', ''SF2 with 10 nodes'', and ''SF2 with 40 nodes'', DAG-CCNN obtains the largest values of TPR. As we can see from Table 3, as the node dimension becomes higher, the TPR for ''ER2 with 100 nodes'' and ''SF2 with 100 nodes'' are not as high as that for nodes 10 and 40, but, from the overall trend, DAG-CCNN still performs the  best and still has the largest TPR among the four methods, which indicates that our method is effective in dealing with high-dimensional causal structure graphs.
Overall, the proposed DAG-CCNN method obtained the best SHD (the lower the better) and TPR (the higher the better) when applied to the simulated data sets generated by the two data generation mechanisms proposed above. When the nodes are 10, the GraN-DAG method also has a lower SHD and higher TPR, but as the nodes grow to 40 and 100, it is clearly observed that it has a correspondingly higher SHD, which indicates that there are more challenges in training gradient-based neural networks as the number of nodes grows. Apart from that, it can be observed that the NOTEARS method does not perform well, which can be partly explained by the fact that it can only be used for linear generation mechanisms, while the synthetic data sets in this paper is modeled with a nonlinear mechanism. Similarly, the poor performance of the DAG-GNN method is perhaps attributed to its use of a strong form of parameter sharing for the neural VOLUME 9, 2021 TABLE 5. Comparison of different methods on nonlinear SCMs generated from the ANM with FC. The lower the better for SHD and the higher the better for TPR (n = 200).

TABLE 6.
Comparison of different methods on nonlinear SCMs generated from Gaussian processes (GP) with unit independent Gaussian noise. The lower the better for SHD and the higher the better for TPR (n = 200).

TABLE 7.
Comparison on nonlinear SCMs generated from ANM with FC and Gaussian processes (GP) with unit independent Gaussian noise. The lower the better for SHD and the higher the better for TPR (n = 200, d = 100).
network training of the f j function in the model, which is not justified in the two specific sampling scheme assumptions proposed. Fig. 5 shows the average values of SHD and TPR for a sample size of n = 200. To make the results more intuitive, here we use box plots from the Seaborn library to show the average of all results under the whole mechanism. Not surprisingly, we can also observe that the performance of DAG-CCNN remains stable for different sample sizes n (ranging from 200 to 1000), as it performs well when the sample size is n = 200. DAG-CCNN achieves better TPR and lower SHD on ANM with FC and nonlinear Gaussian ANM settings. The specific experimental results are detailed in Tables 5, 6 and 7 in the Appendix.

C. REAL DATA SET
We compared DAG-CCNN with three other methods on the real data set provided by Sachs et al. [10]. This data set includes continuous measurements of protein and phospholipid expression levels in human immune system cells (n = 7466, d = 11). This data set has been widely used in the study of graphical models and its experimental annotations are widely accepted by the biological research community. Sachs' true causal graph with 11 nodes and 17 edges as shown by Sachs et al. [10].
On this data set, DAG-CCNN achieved the best SHD of 12, whereas GraN-DAG had a SHD of 13. DAG-GNN and NOTEARS close to GraN-DAG in terms of TPR, but they had a lower SHD. In particular, the SHD of DAG-GNN and NOTEARS are 16 and 19, respectively. 10-fold crossvalidation was performed for each algorithm, and the average results are reported in Table 3 and the causal protein networks obtained from CCNN are reported in Fig. 6.

VI. CONCLUSION
Maximizing the recovery of cause-effect outcome diagrams and finding causal relationships between variables will help solve many difficult problems in real life. This requires us to continuously improve and innovate the algorithms. In this study, a causal structure learning method based on an integrated network model of FC layer and 1D-Conv layer is proposed. First, the adjacency matrix of the directed graph is obtained by training the FC model, and then the learning capability is further enhanced by 1D-Conv to obtain the directed graph. In addition, the method further gives a new characterization of the acyclicity of the directed graph by using the exponential sum of the eigenvalues of the adjacency matrix, thus avoiding the operational complexity difficulties associated with the product of adjacency matrices.
In future studies, we will explore the causal structure learning method for high-dimensional data samples. On the one hand, we will improve the correct rate of identifying causal directions and further weaken the causality assumption, and on the other hand, we will reduce the computational complexity of neural networks, so as to achieve the accurate discovery of causal structure graphs of high-dimensional data.

APPENDIX
Following the above experimental section, the experimental results obtained under different causal generation mechanisms for sample size n = 200 are detailed in Tables 5, 6 and 7. When the sample size n = 200, the results obtained are consistent with the overall trend when the sample size n = 1000. The difference is that when the sample size is smaller, the experimental results vary more and there is a chance bias. The same is that, with the increase of sample nodes, all four methods are less effective than those recovered by low-dimensional sample nodes in recovering the causal structure graph. However, it is obvious that among these methods, DAG-CCNN performs the best, with higher TPR and lower SHD in comparison.