Bayesian Network Structure Learning Approach Based on Searching Local Structure of Strongly Connected Components

Learning the structure of Bayesian networks is a challenging problem because it is a NP-Hard problem. As an excellent search & score based method, the K2 algorithm strongly depends on the input of global order of all nodes to ensure the result is a directed acyclic graph. And the K2 algorithm searches parents for each node from the nodes before it in the global order. Incorrect node order is likely to result in a wrong structure. In this paper, we propose a new method to avoid the global order by Local structure Searching for Strongly Connected Components with hill climbing method Twice. Firstly, we search the best parent nodes for each node from all the remaining nodes except itself, and form a global directed graph by concatenating the arcs between nodes and their parents. Then, the directed acyclic structure of all the strongly connected components are determined by hill climbing algorithm twice continuously. Finally, we adopt local search method further to get the final result by taking the previous result as a start point. The proposed algorithm is evaluated on several standard benchmark networks with sampled data. Experimental results show that our algorithm outperforms the four compared algorithms in terms of structural Hamming distance, Bayesian information criterion score and their average ranking.


I. INTRODUCTION
As one of the important methods of uncertainty reasoning in artificial intelligence, Bayesian network (BN) qualitatively represents the dependence between random variables through directed acyclic graph (DAG), and quantifies this dependencies with conditional probability distribution, which has strong interpretability. Bayesian networks are widely used in various fields, including knowledge representation and reasoning [1], clinical decision-making [2], financial modeling & analysis [3], risk prediction [4], gene analysis [5], industrial application [6], etc.
A Bayesian network, also known as belief network, is a graphical model that consists of two parts: network structure and network parameters. The structure G = {V , E} is a DAG, in which nodes V = {X 1 , X 2 , . . . , X n } represent The associate editor coordinating the review of this manuscript and approving it for publication was Bo Jin . random variables in the problem domain, and directed arcs correspond to the qualitative characterization of the relationship between nodes. Each node has a conditional probability distribution, namely network parameters, associated with it, which quantitatively depicts the dependence between random variable nodes and their parent nodes. The dependencies satisfy the first-order Markov property. And the following expression holds: where P(X i | Pa(X i )) is the conditional probability distribution between node X i and its parent node set Pa(X i ). Bayesian network structure learning is a NP hard problem [7] and remains a challenging problem, due to the exponential growth of the space of valid networks with respect to the number of nodes. This paper studies the structure learning method of discrete BNs under complete data. The data sets involved have no missing values. The existing approximate algorithms for BN structure learning from complete data include conditional independence (CI) test based method, score & search based method and the hybrid method.
K2 algorithm [8] and hill climbing (HC) algorithm [9] are classical and efficient score & search based methods. HC algorithm is simple and easy to implement, but it is easy to fall into local optimum. K2 algorithm is an efficient algorithm based on greedy strategy, which has very excellent learning performance. However, K2 algorithm requires the maximum number of parent nodes and depends heavily on the order of nodes. Incorrect node order will result in a very poor result. Moreover, the impact of node order on result structure is much greater than the maximum number of parent nodes. A large number of literatures [10]- [12] have been devoted to improve this drawback and achieved certain effect. In 2020, Behjati and Beigy [10] proposed a competitive method with good result in structure score, which regards strongly connected component (SCC) as a contracted node, learns the local order of SCC nodes by exact learning method with recursive strategy, and then forms the global order, which is used as the input of K2 algorithm. In essence, the aim of this approach is to build a better order of nodes for K2 algorithm. However, it has the following two disadvantages. Firstly, in extreme situation, if the global directed graph contains only one SCC with all the nodes in dataset, then the recursive call of the algorithm is invalid and will be an endless loop. Secondly, the topological order of DAG is not unique, and the number of topological orders increases with the number of nodes. However, only one order of nodes is selected in the algorithm.
The main contribution of this paper is to provide a Bayesian network structure learning approach from data based on maximum weighted spanning tree, bidirectional greedy search and SCC local structure search. First, we estimate the maximum number of parent nodes roughly and construct the initial network structure with directed graph. Then, we extract all the SCCs and search the local DAG structure of each SCC with HC algorithm twice. A global DAG can be formed by concatenating all the local DAGs. Finally, this global DAG, as a starting point, is passed to a local search method to learn the structure of Bayesian network. The algorithm proposed can avoid the global order of nodes.
The rest of the paper is organized as follows. Section 2 presents a brief review of Bayesian network structure learning methods. Section 3 devotes for introducing some preliminaries about the proposed approach. Then, in Section 4, the proposed algorithm is described in detail. And Section 5 is dedicated to the experimental evaluation. Finally, we conclude in Section 6.

II. BAYESIAN NETWORK STRUCTURE LEARNING
Bayesian network structure learning refers to combining the prior knowledge as much as possible to obtain the optimal topological structure, including exact algorithms and approximate algorithms. The existing approximate learning methods can be divided into three categories: CI test based, score & search based and hybrid methods. Exact learning algorithms can guarantee the global optimal solution in theory, but it is complex and inefficient. Approximate learning algorithms trade the loss of accuracy for computational efficiency, and aim to obtain a suboptimal network structure in an acceptable computing time.

A. EXACT METHODS
Exact methods contain integer linear programming (ILP) methods [13], [14], A* algorithm [15], [16] and dynamic programming (DP) methods [17], [18]. Exact methods have excellent learning performance, but its efficiency is very low, so it is difficult to be applied to networks with a large number of nodes.

1) CI TEST BASED METHODS
The CI test based methods [19] regard BN as a network model representing the relationship of independent variables. Usually, statistical or information theory methods are used to analyze the dependency relationship between variables by calculating the mutual information and conditional independence between nodes, and finally a network structure is got in line with them. Typical representatives of such methods are Shrink-Grow-Shrink (SGS) algorithm [20] and Peter-Clark (PC) algorithm [9]. This kind of method is more intuitive and effective, but it requires a large number of samples to determine the independence, resulting in unreliable CI test results. Moreover, the number of CI test times grows rapidly in nodes count.

2) SCORE & SEARCH BASED METHODS
The score & search based methods regard structure learning as an optimization problem, whose goal is to search the network structure with highest score. A score & search based algorithm can be formulated as follows: given a dataset D = d 1 , d 2 , . . . , d m of cases, find a DAG G * such that where is the space of DAGs defined on node set V = {X 1 , X 2 , . . . , X n }. From the perspective of algorithm, it is necessary to determine the appropriate search strategy and the scoring function to measure the fitting degree to dataset. The number of DAGs grows super-exponentially with the number of nodes. It has been proved that this is a NP-Hard problem [7]. Heuristic search methods are usually used to solve this problem which is easy to fall into the local optimum. The search for the network structure can be carried out in the DAG space, the equivalence class space, or the ordering space.
Search on the DAG space is the most direct mode. Each candidate DAG is assigned a network score reflecting its VOLUME 10, 2022 goodness of fit, which the algorithm then attempts to maximize. Some examples are heuristics, such as hill climbing, genetic algorithms [21], whale optimization strategy [22], differential evolution algorithm [23] and Particle Swarm Optimization (PSO) [24]. K2 algorithm is a classical one of this kind with outstanding performance but needing pre-provided order of nodes.
In the search method based on equivalence class space, the points in the search space are no longer a single BN structure, but a class of structures with equivalence relationship. Two examples of this category are Greedy Equivalent Search (GES) [25] and Iterated Local Greedy Equivalent Search (ILGES) [26].
Ordering based search methods [27], [28] search in the space of node orders and construct BN structure based on the obtained result order. The fundamental observation of this kind of methods is that, finding the highest score structure is not NP-Hard given an ordering on the variables in the network.

3) HYBRID METHODS
Hybrid methods [29], [30] integrate the advantages of CI test based methods and score & search based methods for solving the structure learning problem, which reduce the size of structure space by CI test, and then adopt score & search methods further to obtain the optimal result.

III. PRELIMINARIES
In this section, we introduce several fundamental concepts about the proposed approach.

A. MUTUAL INFORMATION
l Given random variables X and Y , the mutual information between them is defined as: where p(x, y) is the joint probability of x and y, p(x) and p(y) are the marginal probability distribution of X and Y respectively, H (X ) is the entropy of X , and H (X | Y ) is the conditional entropy of X given Y . Mutual information is a metric of the degree of interdependence between two random variables.

B. MAXIMUM WEIGHTED SPANNING TREE (MWST)
The spanning tree of a connected graph with N nodes is the minimal connected subgraph of the original graph, which contains all the N nodes, and has the least edges to keep the graph connected. The weight of a tree refers to the sum of the weights of each edge in the tree. The maximum weighted spanning tree is the spanning tree with maximal weight.

C. STRONGLY CONNECTED COMPONENT
In a directed graph, two vertices X and Y are said to be strongly connected, if there exists a path from X to Y , and exists a path from Y to X meanwhile. The directed graph is strongly connected if each pair of vertices in the directed graph is strongly connected. A strongly connected component of a directed graph G = {V , E} is a maximal set of vertices C ⊆ V such that for every pair of vertices u and v in C, vertices u and v are strongly connected.

D. HILL CLIMBING ALGORITHM
Starting from an initial network structure, the hill climbing algorithm modifies the current network structure through three search operators (add one edge, delete one edge and reverse one edge) to obtain neighborhood points, and selects the network structure from the neighborhood with the highest score as the starting point of the next iteration until the local optimal network structure is obtained. The reason why hill climbing algorithm is easy to fall into local optimum lies in the selection of initial network structure. A good initial structure will lead to a better result.

IV. THE PROPOSED APPROACH
As mentioned above in section 1, the method proposed in [10], denoted as SCC-Order-K2, has two drawbacks. Firstly, in extreme situation, if the global directed graph contains only one SCC with all the nodes in dataset, then the recursive calling of the algorithm is invalid and will be an endless loop. Secondly, the topological order of a DAG is not unique, and the number of topological orders increases with the number of nodes. However, only one order of nodes is selected in the algorithm. In this section, to overcome them, we propose a novel approach based on local structure search for strongly connected components. In the following, we describe how the method proposed in this paper works.

A. METHOD DESCRIPTION
The method proposed is mainly based on the following ideas: 1) If a good initial value is provided for HC algorithm, the global optimal network structure can be obtained.
In this paper, we are committed to finding a good initial value of HC algorithm to obtain the (approximate) optimal solution of Bayesian network structure learning problem.
2) It is difficult to learn the global Bayesian network structure directly, for which the divide and conquer strategy can be adopted. We can learn the local structures first, then fuse and revise them, and finally get the global network structure. Similar to the calculation of local scores, each node and its parent nodes form a local network structure. Greedy search method is an intuitive and feasible method to search the set of parent nodes.
3) The fused global network structure may not meet the characteristic of directed acyclic, that is, there exists one or more SCCs. The global network structure can be revised step by step. We can first revise each SCC to a directed acyclic graph. Here, we use the HC algorithm twice to make full use of the results of the previous steps. Local structure modification is also an optimization problem, but the scale of it is smaller. 4) The maximum number of parent nodes is an important information for many structure learning algorithms such as K2. If it is known, the amount of computation can be greatly reduced, and a more accurate structure with fewer wrong directed edges can be obtained. MWST is an undirected graph representation of the association between nodes which has the minimum connection weight to keep the graph connected. The maximum degree of nodes in MWST can be used as a rough estimate of the maximum number of parent nodes, although this estimate is not optimal. Mutual information is a good metric for the dependent relation of random variables. Taking mutual information as the weight of edge, we can construct a maximum weighted spanning tree T for all the nodes in V , which is an undirected graph. For each node in T , there must be at least one node connected to it. The number of nodes connected to node X , is defined as the degree of it. And the maximum degree of a nodes in T can be used as the upper limit of the number of parent nodes. As shown in Fig. 1(a), we give an example of estimating the maximal number of parent nodes with a 4 nodes network. Fig. 1(b) shows the mutual information of each pair of nodes and Fig. 1(b) is its corresponding maximum weighted spanning tree. From Fig. 1(b) we can get the following information: degree(A) = 3, degree(B) = degree(C) = degree(D) = 1. Therefore, the rough estimate value of maximum number of parents is max ([3, 1, 1, 1 In K2 algorithm, the parents of each node are selected from the previous nodes of it in the order through greedy search. To relax this restriction of global order, we can use greedy strategy to select the best parent node set from all the other nodes. And then, a global directed graph G 0 is formed by concatenating all the edges. However, it is likely to be no longer a DAG which, therefore, must be revised to meet the limitation of directed acyclic.
Following the approach in SCC-Order-K2 algorithm, we extract all the SCCs in the global directed graph G 0 . If each SCC is regarded as a contracted node, the global directed graph G 0 is a DAG in a broad sense, called SCC graph. A directed graph is acyclic if and only if it has no strongly connected subgraphs with more than one vertex. Therefore, we only need to revise the local structure of each SCC to get a local directed acyclic subgraph. Instead of learning the order of nodes in each SCC, we learn the local directed acyclic structure of SCC directly by HC algorithm twice, because the edges in SCC contain some information which can be fully utilized. If we convert each directed edge in SCC to undirected one, the resulting undirected graph is a skeleton of SCC. Therefore, when using HC algorithm for the first time, this skeleton is used as the white list. And its result can be deemed as a point close to the global optimum for local structure. Then, we take it as the start DAG and adopt HC algorithm once again.
If we revise each SCC to a local DAG, and then expand the previous corresponding contracted node with the revision result, then the whole network will be a DAG. After local structures of all SCCs are determined, a global DAG G can be obtained, which is a point close to the global optimum for global structure. Then, we employ a local search method further such as HC algorithm to get the final result of BN structure with G as the initial point.
The proposed structure learning method for Bayesian networks, denoted as SCC-LST (SCC Local structure Searching with hill climbing method Twice), includes the following steps.
1) Compute the mutual information matrix.
2) Take mutual information as weight of edges, create the maximum weighted spanning tree, and compute the maximal number of parent nodes, denoted as µ.

3) Compute parents for each node with bidirectional
greedy search under the limitation of µ, and form a global directed graph G 0 . 4) Search all the strongly connected components of G 0 and construct G SCC by keeping the directed edge between SCCs and its outside. 5) Determine the local directed acyclic structure for each SCC by HC algorithm twice, and form the global DAG G. 6) Adopt a local search method such as HC algorithm by taking G as the initial point further to get the final result. The differences between our method SCC-LST and the SCC-Order-K2 in [10] are as follows. (1) We employ the bidirectional greedy search to construct the initial directed graph network with the maximal number of parents obtained through maximum weighted spanning tree, while the Sparse Parent Graph Algorithm [16] is adopted in SCC-Order-K2 algorithm. (2) We learn the local directed acyclic structure of SCC by using HC algorithm twice, while the SCC-Order-K2 algorithm only learns the nodes ordering in SCC by exact learning algorithm recursively. (3) The SCC-Order-K2 algorithm regards SCC as a contracted node to form a DAG in a broad sense and obtain the global node order. However, the topological order of a DAG is not unique. SCC-Order-K2 algorithm does not select the best order among all the orders, which fails to make full use of the results of the previous steps. Our method directly constructs the local directed acyclic structure of SCC to avoid this problem. (4) In SCC-Order-K2 algorithm, the final result is obtained by K2 algorithm taking the global order of nodes as an input value. While, in SCC-LST method, we get the final result by local search algorithm taking the global DAG structure as input instead of global order. (5) In extreme situation, there is only one SCC that contains all nodes. In this case, our method degenerates to a simple method using HC algorithm twice, while SCC-Order-K2 algorithm is invalid and falls into endless loop. Fig. 2 shows the flow chart of our method. And the pseudo-code of the proposed algorithm is shown in Algorithm 1.

Algorithm 1 SCC-LST Algorithm
Input: Dataset D with m cases on nodes set V = {V 1 , V 2 , . . . , V n }. Output: a Bayesian network structure G * . 1. compute the mutual information matrix. 2. compute the maximum weighted spanning tree. 3. µ ← max 1≤i≤n deg(X i ) /* maximal number of parents */ 4. initialize variable edges as an empty set. 5. for i ← 1 to n do 6. Pa update edges with directed edges (X j , X i ) for all X j ∈ Pa(X i ) 8. end 9. construct a global directed graph G 0 from edges. 10. search strongly connected components of G 0 : C 1 , C 2 , . . . , C T . 11. construct G SCC from G 0 . 12 Based on the Bayesian network CHILD and its sampled data, we give a demonstration example of our method. For the sake of space, we put it in the Appendix. In the SCC-LST method proposed in this paper, the two steps of ''finding candidate parents (sub-net construction)'' and ''SCC revision'' are naturally suitable for parallel processing. In addition, on the one hand, due to the previous estimation of the maximum number of parent nodes, it provides very favorable information for the sub-net construction step, which can be utilized to save a lot of computing time. On the other hand, the number of nodes contained in each SCC will not exceed the number of nodes in the whole network, so its revision time is relatively less. Moreover, HC algorithm itself is a very fast method. Therefore, it can be anticipated that our method can be scaled to massive networks with hundreds and even thousands of nodes.

B. TIME AND SPACE COMPLEXITY
In this subsection, we briefly analyse the time and space complexity of our method.
The mutual information matrix is symmetric, and the HC algorithm is a local search method, which can find a reasonable solution in a certain space with constant spatial complexity. For matrix related operations, the spatial complexity is O(n 2 ). Therefore, the space complexity of our  algorithm SCC-LST is a polynomial expression about the number of nodes.

V. EXPERIMENT EVALUATION
In this section, we aim to empirically evaluate the proposed method and compare it with 5 of the most popular algorithms: HC algorithm, PC-HC algorithm, K2 algorithm with random order (K2-RND), MWST-K2 algorithm and the OBS algorithm. PC-HC algorithm first obtains the skeleton of network via PC algorithm which is an undirected graph, and then orients the undirected edges with HC algorithm to obtain the final DAG structure. MWST-K2 algorithm adopts K2 algorithm with the order got from MWST. While the OBS algorithm is an ordering based search method by hill climbing search with tabu lists and random restart [27].
Our experiment is executed in the environment of PyCharm 2021.2.2 (Community Edition) with Python 3.9.7, on the platform of win 10 (64 bits) with Intel Core i9-10900 CPU @ 2.80GHz and 64G RAM.

A. DATASETS
The benchmark datasets include 9 Bayesian networks taken from the Bayesian network repository taken from the bnlearn R package [31]. We selected 3 small size networks (no more than 20 nodes), 3 medium size networks ( 20 -50 nodes) and 3 large size networks (50 -100 nodes). Table 1 shows the detailed information upon each network including the number of nodes, the number of directed edges (i.e. arcs), the number of maximal parent nodes, and the type of network. And for each network, we independently sample 10 datasets with 500, 5000 and 10000 cases respectively.

B. PERFORMANCE MEASURES
We evaluate the performance of our SCC-LST method in terms of structural Hamming distance (SHD) and the Bayesian information criterion (BIC) score of the result structure on the dataset. The SHD is sum of number of the correct edges, the additional edges, the missing edges and the VOLUME 10, 2022  reverse edges compared with the original network structure. The smaller the SHD, the better, indicating that the obtained network structure is closer to the original network structure. On the other hand, the larger the BIC the better, which shows that the learned network structure can fit the data set better. In addition, we give the average ranking of the algorithms on SHD and BIC score.

C. EXPERIMENTAL RESULTS
In the following, we present the experimental results and compare them. Tables 2, 3 and 4 give the average SHD and BIC score of 10 datasets on each network with sample size 500, 5000 and 10000 respectively. The results highlighted in bold are the best one for the corresponding network. By a quick look at the results reported in the 3 tables, we can draw the conclusion that the proposed SCC-LST method outperforms the other algorithms, especially with respect to BIC score. Table 2 shows the comparison result of the 6 methods with 500 samples. Of the nine networks, our algorithm has the best SHD on six of them, especially in the 3 medium networks. The SHD of our method in the three remaining networks is also very close to the best comparison algorithm. In BIC score, our algorithm has more advantages. It is only slightly inferior to HC algorithm on ALARM and WIN95PTS networks.   Table 3 gives the comparison result about 5000 sampled data. Also, it shows the good performance of our algorithm. With respect to SHD, our algorithm is slightly better than PC-HC algorithm, but it is obviously better than the other 4 algorithms. Moreover, in terms of BIC score, our algorithm is not as good as HC algorithm only on the ASIA network. And we get the best BIC score in all the other 8 networks.
From the comparison result of 10000 samples in Table 4, we can get that our algorithm does not have much advantage in SHD, and its performance is equivalent to that of the PC-HC algorithm. However, our algorithm is still significantly better than the other 5 algorithms in BIC score.
In order to further illustrate the performance of our algorithm, we give the average ranking of six algorithms on SHD and BIC in Table 5. And the results show that the average ranking of our SCC-LST method is always the highest.

VI. CONCLUSION
In this paper, based on searching local structure of strongly connected compo-nents, we proposed a simple algorithm SCC-LST for learning Bayesian network structure from data that can avoid the global order of nodes. The proposed algorithm estimates the maximal number of parents through maximum weighted spanning tree firstly. And then the initial global directed graph is constructed by searching parents for each node from the remaining nodes except itself. After that, we search the local directed acyclic structure of strongly connected components by hill climbing search twice. Finally, a further local search process is used to get the final DAG as the structure of Bayesian network, taking the previous result as input.
We evaluated the proposed method on 9 benchmark networks by sampled datasets, in terms of SHD and BIC score and its average ranking. Experimental results showed that the proposed method outperforms the five compared algorithms.
However, two research issues remain. The quality of the parent nodes searched by bidirectional greedy search has a great impact on the performance of the proposed method. Also, when determining the local directed structure of SCC, only the internal nodes and directed edges of SCC are used, and its dependence on the outside is not fully utilized. In future, we will develop new approaches to overcome these drawbacks.
Bayesian network is one of the important methods of knowledge representation and reasoning in uncertainty    problem, which has a wide range of applications, especially in the medical field. In the near future, we intend to apply Bayesian network to the actual scene of medical and clinical diagnosis.

APPENDIX DEMONSTRATION EXAMPLE OF OUR METHOD
Based on the Bayesian network CHILD, we give a demonstration example to further illustrate the procedure of our method SCC-LST, as shown in Figs. 3 to 11. The CHILD network is a medium Bayesian network with 20 nodes and 25 directed edges. The maximum number of parent nodes is 2. In this demonstration example, we learn the Bayesian network structure from 5000 sampled data of the original network. Fig. 3 shows the original true structure of the CHILD network. Fig. 4 is the maximum weighted spanning tree based on mutual information, from which we can get that the maximum number of parent nodes is 6 (pay attention to node: Disease). Fig. 5 gives the fusion result graph G 0 of each node and its parents. There exist three SCCs in Fig. 5: SCC-CO2, SCC-Sick and SCC-Disease. SCC-CO2 contains 2 node: CO2 and CO2Report. SCC-Sick contains 8 nodes: Sick, Age, Grunting, GruntingReport, LungParench, LungFlow, ChestXray and XrayReport. While SCC-Disease contains 9 nodes: RUQO2, HypoxiaInO2, CardiacMixing, HypDistrib, LowerBodyO2, DuctFlow, Disease, LVH and LVHReport. The SCC graph G SCC is shown in Fig. 6. There is no intersection between any two of the three SCCs. And the node BirthAsphyxia is not in any SCC. The three SCCs and their revised directed acyclic structures are shown in Figs. 7, 8 and 9 respectively. The global DAG G after all the SCCs are revised is presented in Fig. 10. And Fig. 11 gives the final learned structure of CHILD, in which the black arrows represent the completely correct directed edges while the bold red ones are the reversed edges. There are only two reversed directed edges: (Disease, BirthAsphyxia) and (CardiacMixing, Disease). In this demonstration example, there is no missing and additional edge.