Rapidly Learning Bayesian Networks for Complex System Diagnosis: A Reinforcement Learning Directed Greedy Search Approach

Bayesian networks are a popular diagnosis method, whose structures are usually defined by human experts and parameters are learned from data. For the increasing complexity of modern systems, building their structures based on physical behaviors is becoming a difficult task. However, the improvement of data collection techniques motivates learning their structures from data, where greedy search is a typical iterative method. In each iteration, it generates multiple structure candidates by modifying one edge, evaluates these structures by scores based on data and selects the best structure for the next iteration. This method is costly because there are too many structures to be evaluated. To solve this problem, we frame the traditional greedy search by Markov decision process and propose an efficient Bayesian network learning approach by integrating reinforcement learning into it. In our approach, a convolutional neural network is employed as the value function to approximate scores. Before evaluating structures using data, the neural network is used to predict scores. Structure candidates with a low predicted score are discarded. By avoiding unnecessary computation, the cooperation of reinforcement learning and greedy search effectively improves the learning efficiency. Two systems, a 10-tank system with 21 monitored variables and the classic Tennessee Eastman process with 52 variables, are employed to demonstrate our approach. The experiment results indicate that the computation cost of our method was reduced by 30%~50%, and the diagnosis accuracy was almost the same.


I. INTRODUCTION
For the development of sensor and data storage techniques, data-driven diagnosis (DDD) [1]- [6] received much attention recently. As one of the most popular data-driven methods, Bayesian networks (BNs) are widely applied to fault diagnosis [7]- [12].
Cai et al. [7] employed BNs to fuse multi-source information, increasing the diagnosis accuracy of ground-source heat pump (GSHP) system. They established the BN structure according to the cause and effect sequence of faults and symptoms, and studied the parameters using Noisy-OR and Noisy-MAX model. By combining two proposed BNs, their multi-source information fusion based fault diagnosis model increased the fault diagnostic accuracy greatly.
Mack et al. [8] learned BN structures to augment aircraft diagnostic reference model, which was a bipartite graph The associate editor coordinating the review of this manuscript and approving it for publication was Youqing Wang . between condition indicators (CIs) and faults. They aimed to augment the original reference model by analyzing historical data. Tree augmented naïve Bayesian networks (TANs) were learned to derive diagnostic relations over CIs in the system. Then the reference model was updated by analyzing the learned TAN. Their method is useful not only for finding additional relations and monitors to augment subsystem reference models, but also provide significant indicators to knowledge engineers and system experts.
Codetta-Raiteri and Portinale [9] exploited the modeling features and inference capabilities of dynamic BNs (DBNs) in fault detection, identification, and recovery (FDIR) for autonomous spacecrafts. An FDIR cycle, which includes the tasks of diagnosis, prognosis and recovery, is introduced and characterized through a DBN model. Simulated problems showed that the DBN method was able to properly detect and deal with faults.
Zhao et al. [10] proposed an intelligent fault detection and diagnosis (FDD) strategy based on BNs for chillers. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ Their BNs was composed of three layers: causal factors in layer 1, faults in layer 2 and fault symptoms in layer 3.
In the diagnosis of refrigerant flow air conditioning systems, Hu et al. [12] developed a three-layer BN model as well: faults in the first layer, features in the second layer, and additional information in the third layer. In the two researches, posterior probabilities of faults under measurements were calculated for fault diagnosis. Compared with other fault diagnosis approaches, the two methods are able to utilize both system information and expert knowledge. More review can be found in the work of Cai et al. [11].
In most applications, including the approaches discussed above, the structures of BNs are established by human experts and the parameters are learned from data. However, building structures for complex systems is intractable for the number of possible structures is exponential on variable number [13].
TAN is one of the methods to learn both the structure and parameters. However, it only gives a tree structure which is oversimplified in complex cases. Learning Bayesian networks in a more general way is necessary for complex system diagnosis. Typically, Bayesian structure learning approaches can be classified into two categories: constraintand score-based methods [13]. The first one builds the network skeleton based on independence test and then orients the skeleton. The second method, Greedy search (GS) for instance [13], employs a score to evaluate networks based on data and utilizes optimization algorithms to achieve a good structure. Because testing all possible conditional independences is intractable, most current researches focus on score-based approach.
Mohammadi and Wit [14] proposed a birth-death Markov Chain Monte Carlo (MCMC) method in Bayesian structure learning for sparse Gaussian graphical models. O'Gorman et al. [15] adopted quantum annealing in the search process. Haffa et al. [16] used regular vines to search sparse models. Tsamardinos et al. [17] combined constraint-and score-based method, where the skeleton was first obtained by constraint-based method and further optimized by a score-based method. Instead of traditional search method, Bartlett and Cussens [18] utilized integer linear programming (ILP) to solve the NP-hard optimization problem.
All the discussed methods suffer efficiency problem. Most of them can only make use of the current best solution(s). For instance, GS only modifies one edge of the current best solution to generate structure candidates in each iteration [13]. However, all evaluations in the search/optimization process contain structure information about the system. Extracting local structure information from the evaluation history can help to focus on the most likely structures. In this work, we implement this idea by reinforcement learning (RL) [19]- [21]. The score of structures is defined as the total rewards in RL, which is approximated by a convolutional neural network (CNN). Before computing the scores of new structures based on data in each iteration, the value function is employed to estimate the scores first. Only several structures with top estimated scores are further evaluated by data. In turn, the evaluated values are fed back to train the CNN. The combination of GS and RL significantly improves the learning efficiency.
The rest of the paper is organized as follows. Section 2 gives some basic concepts on BNs, the greedy search based BN learning algorithm, and reinforcement learning. Our method is elaborated in section 3, including how to frame the GS in a Markov decision process (MDP), the design of the value function and the details of the structure learning. We demonstrate it with a 10-tank system in the 4 th part. Section 5 analyzes and discusses our approach. Conclusions are made in the final part.

II. BACKGROUND
As a probability graphical model (PGM), BNs are used to structurally represent complex probability distributions. This section introduces Bayesian network diagnosers, the greedy search based learning method and reinforcement learning.

A. BAYESIAN NETWORK DIAGNOSERS
Definition 1: Given variable set V , a Bayesian Network is composed of two parts (G, P) where G is the network structure represented by a directed acyclic graph (DAG) including all the variables in V , and P are the parameters of conditional probability distributions (CPDs). In most cases, the CPDs are represented by conditional probability tables (CPTs). If there exist continuous variables, we have to discretize them [22].

B. GREEDY SEARCH BASED BAYESIAN NETWORK LEARNING
Score-based BN learning methods need a score to evaluate BNs. This paper adopts Bayesian information criterion (BIC) [13], [23], as shown in (1).
where ll( θ : D) is the log likelihood between G and the historical data, Dim[G] represents the structural complexity and M is the size of the historical data. The log likelihood is given by (2) and Dim[G] is obtained by (3) and (4) [23] which is the number of independent parameters in G.
where n is the number of variables, Pa (v i ) is the parent of v i , I P (v i |Pa i ) is the mutual information between v i and Pa i , H P (x i ) is the entropy of x i , and |v i | is the size of variable v i . The pseudo code of greedy search is given in II-C. The algorithm iterates k times to find a good solution with an  initial structure G 0 , where G 0 is usually an empty graph. In each iteration, the algorithm checks any two nodes. If there is no edge between the two nodes, edges (in two directions) are added to obtain new structures. If there is an edge, the edge is deleted or reversed to get new structures. All the new structures are evaluated. The best one in them is selected for the next iteration. The best structure in the final iteration is returned by the algorithm.
for any node pair (i, j): add, delete or reverse edge i-j in G 0 , insert new acyclic structures into G_set 5.
compute BIC scores for all G in G_set 6.
G 0 G with the maximal score in G_set 7. return G 0

C. REINFORCEMENT LEARNING
Reinforcement learning [19]- [21] is the area of machine learning that deals with sequential decision-making. As shown in Figure 1, for time-step t, the agent makes a decision, a t ∈ A, based on observation ω t . Environment state transits from s t to s t+1 , and outputs observation ω t+1 and reward r t to the agent. RL aims to maximize the total rewards in the future. The value of a state is the maximum reward an agent can expect from the state in the future as shown by (5).
where γ is the discount factor that weights the future rewards. The value function (v-function) with γ = 1 is employed in this paper.

III. METHODOLOGY
This elaborates our approach. First, the greedy search based Bayesian network learning method is represented in the view of Markovian decision process, which is the base of RL. This part clarifies the motivation of combing GS and RL. Following that, the framework of our method is given, and each component is explained one by one.

A. GREEDY SEARCH: A MARKOV DECISION PROCESS
The idea to combine greedy search and reinforcement learning is inspired by their implicit common point: Markov decision process (MDP) [20], [21]. Definition 2: A discrete time stochastic control process is Markovian (i.e. it has the Markov Property) if The Markov property means that the future of the process only depends on the current observation, and the agent has no interest in looking at the full history.
Definition 3: A Markov decision process is a 5-tuple defined by (6) where S is the state space, A is the action space, is the reward function and γ is the discount factor. If the observation in a system is the same with the state, ω t = s t , the MDP is fully observable. As shown in Figure 2, in a fully observable MDP, an action is offered by a policy, the state transits into a new state based on the action, where the distribution of the new state is given by the transition function, and the reward is given by the reward function. In our greedy search based BN learning, if we define BN structures as states, the iteration process is Markovian because for time-step t, only state s t is used to find a better solution. The actions defined by GS are to delete, add or reverse an edge. Its policy is to try all possible actions, obtain and evaluate all new states/structures, and finally select the best one. The transition is deterministic, and the reward is the increment of the score with discount factor γ = 1.
Equation (7), (8) and (9) formally give the policy, reward function and value function of greedy search respectively, where s a t+1 is the structure at time-step t+1 if action a is employed. Because equation (7) needs to evaluate all available structures based on data, the policy is inefficient.
Value-function based reinforcement learning methods solve an MDP in a more elegant way: using a function to approximate the value function and adopting the best estimated action in most cases. The former discussion indicates that the key difference between greedy search and value-function based reinforcement learning is how to choose the action: greedy search tries all possible actions and select the best one, while the value-function based reinforcement learning just select the best one estimated by the value function.
The key shortcoming of the greedy search is the evaluation of all available new states, which can be expensive and unnecessary. We solve the deficiency by the value function in reinforcement learning, which helps to ignore structures with a low estimated score.

B. REINFORCEMENT LEARNING DIRECTED GREEDY SEARCH
Our basic idea is to avoid evaluating unlikely BN structures by RL, as demonstrated in Figure 3. In the greedy search module, available structures are generated based on the current best one. Then if the value-function in the reinforcement learning module is accurate enough, the score of these structures is estimated by it, and only likely structures remain. Otherwise, all available structures are likely. The likely structures are further evaluated by data to generate the new best structure and the evaluated scores are fed back to the reinforcement learning module for the value-function training. The rest of this subsection elaborates the details of the framework.

1) THE REPRESENTATION OF BN STRUCTURES
As the definitions in the second section, a BN is composed of two parts: the structure and the parameter. This paper employs adjacent matrix to represent the structures of BNs. The structure of a BN is represented by a matrix G, where entry G ij ∈ {0, 1} (i = j). G ij = 1 means that edge i → j exists and G ij = 0 means that i → j does not exist. G ij × G ji = 1.  In our diagnosis problem, in addition to the monitored variables, we employ another variable, m, to represent the health state/mode of systems, as illustrated in Figure 4.
In the search process, including both the classic greedy search and the proposed one, we start from a Naïve Bayesian network G 0 , assume that edges from the mode variable to all other variables always exist and there is no edge directed to the mode variable. Learning algorithms only concerns the edges between different monitored variables.

2) CNN-BASED VALUE FUNCTIONS
There are multiple alternatives to approximate the value function, such as linear regression and polynomial functions [24]. This work employs CNNs which have proven to be powerful in image recognition [25], [26]. For their significant ability in feature extraction, CNNs are widely used in many fields, including fault diagnosis [27]- [33].
A CNN for image recognition is usually composed of multiple stacked layers. These layers can be typically categorized as convolutional layer, pooling layer and SoftMax layer [27].
Convolutional layers are employed to extract features from data and generate high level features. Convolutional kernels slide over data matrixes to extract local features, as illustrated in Figure 5 [34]. A kernel acts like a filter which is sensitive to a specific feature.
The learned features are de-sampled by pooling layers so that we do not have to handle too many features. Finally, the features with the highest level are merged by SoftMax layers to predict labels. (Sometimes, fully-connected layers are used before SoftMax layers as well).
The input of CNNs, images, are expressed in matrix. As BN structures in the form of matrix as well [13], we can easily use CNNs for structure estimation. In practice, most monitored variables in fault diagnosis are only related to part of all the variables, and local features exist in BN structures. Considering that CNNs extract local features [34] from  images, it is feasible to utilize CNNs as the value-function for structure estimation. When the node order changes, local features change as well but a fixed order means some specified local features which can always be identified by kernels. Because we need to predict a real value instead of a label, the SoftMax layer is not used.
Our value function approximators accept one and output a real number, where the matrix is similar to channels in image recognition [26] and the real number is the estimated value. Figure 6 illustrates the structure of a value function approximator. The next subsection elaborates how to integrate GS and RL for BN learning. To be brief, g is employed to represent our CNNs.

3) REINFORCEMENT LEARNING DIRECTED GREEDY SEARCH
At the beginning of the greedy search, the value-function cannot be used to discard unlikely structures because it is not accurate enough, which is similar to the ''exploration & exploitation'' problem in RL. We solve it by introducing a piecewise function, as shown in Figure 7, where the horizonal axis is the iteration number and the vertical axis represents the number of new likely structures (to be evaluated). In addition to the current best structure, (N +1) structures are evaluated by data in each iteration.
In the first several iterations, the value function approximator is not accurate. So, it is advisable to evaluate as many structures as possible to collect materials for the approximator training. When the training loss of the value-function is small enough, the unlikely structure can be effectively recognized by the value-function. Then, we reduce the number of likely structures linearly till a minimal number. The function curve in Figure 7 is formally described by equation (10) and (11).
where k represents iteration index, N is the number of structures to be evaluated in each interaction, N = N 0 in the first k 0 interactions, N = N 1 after the search iterates k 1 times, and N linearly decreases and keeps an integer between k 0 and k 1 .
A system with n monitored variables has n(n-1)/2 unordered variable pairs, i − j (i > j). For a pair i − j in a BN structure, only one of the following three cases is true: no edge between i and j, edge i → j, or edge j → i. We can 1) add edge i → j or i → j in the first case, and 2) delete or reverse the edge in the other two cases. So, by applying add, delete and reverse operations once to any original BN structure, there are at most n(n-1) new structures. And we set N 0 = n(n-1) in our experiments.

Algorithm 2 Reinforcement Learning Directed Greedy
N ← obtain the number of structures to be evaluated 4.
G_set ← obtain all possible new structures by applying operations once to G 0 5.
evaluate structures in G_set based on D batch , and add structure-score pairs into score_set 8.
G 0 ← the best structure in the last step 9.
optimize value function approximator g using score_set 10. return G 0 Now, all the introduced ingredients can be put together to elaborate the proposed method. Given an initial network structure G 0 (only edges directed from the mode variable to other monitored variables exist.), historical data D and an untrained CNN value function approximator g, the RLDGS learning algorithm iterates k times. In each iteration, the number of structures to be evaluated is obtained by equation (10) and (11), and then all possible new structures are obtained by applying operations (add, delete and reverse) to G 0 (invalid structures with cycles are ignored). After estimating their scores by g, only the top N structures and G 0 are evaluated by sampled data batch D batch . G 0 is substituted by the best structure in the evaluation and the evaluated scores are fed back into score_set to optimize the value function approximator g. VOLUME 8, 2020 The optimization of g can be implemented by gradient based methods [35]- [40]. Because score_set is not a big data set, to make g converge fast enough, the whole score_set is used without sampling data batch. Besides, g can also be optimized multiple times in each outer iteration. Finally, G 0 is returned as the learned structure. Algorithm 2 gives the pseudo code, where the parameter learning is not explicitly given because CPT parameters can be easily got by frequency [13].
Another thing that the pseudo code does not mention is that not all operations can be applied to obtain a valid structure. If there is no edge between two nodes, apparently reverse operation is invalid. And if an operation results in a cycle structure, the structure is ignored as well.
We will see that compared with the original greedy search method, the propose method distinctly costs less time after the value-function converges.

IV. CASE STUDY
Two systems, a 10-tank system and the classic Tennessee Eastman process (TEP) [41], are employed to demonstrate our method. We build Bayesian network diagnosers over V ={m}∪V m , where m is the variable representing system health mode and V m is the set of monitored variables. The mode with the maximal a posterior probability, which can be obtained by (12), is the estimated mode.
We only employ BIC score because it is widely accepted in Bayesian network learning [13], [23] (Compared with the likelihood score, BIC score is able to effectively reduce overfitting), and this work focuses on improving the learning process by reinforcement learning.

A. THE 10-TANK SYSTEM
This subsection employs a 10-tank system, as illustrated in Figure 8, to demonstrate the proposed method. There are 10 tanks, 1 input pipe, 1 output pipe and 9 connection pipes in the system. The 9 connection pipes connect the three tanks at the bottom. The input pipe feeds water to the first tank. Water flows from tank i to tank i+1 and finally outputs. Initially, all tanks are empty, and the input pipe is open. After some time, the height of the first tank increases to h and the input pipe is closed. As water flow away, the height of the first tank decreases to l and the input pipe is open again. The switch always occurs so that the height of the first tank keeps between l and h.
Equations (13)∼(16) describe the behaviors of the system where A is the cross-sectional area of the tanks and S P is the cross-sectional area of the pipes. There are 20 faults considered in this system, 10 stuck faults and 10 leakage faults, where Stuck and Leakage represent fault the parameters respectively.
The 10-tank system is a typical system in process industry. Sometimes it is difficult to establish an accurate Bayesian network diagnoser because of some practical factors. Learning such a diagnoser is challenging due to the number of monitored variables. In addition to the input flow q 0 , there are 21 monitored variables in the system. Accordingly, there are at most 2 21 ≈ 2.1×10 6 possible structures for BN diagnosers. The original greedy search based learning method evaluates 21×20+1 = 421 of them in each iteration. Although learning a good Bayesian network diagnoser is costly for the system, our method is capable of reducing much time.
The input flow, heights of the 10 tanks, and flows of the last 10 pipes with 20dB noise are used for diagnosis. All the monitored values were discretized into 10 discrete values by unsupervised k-means approach offered by Scikit-learn [42].
In our experiments, the value function is a CNN with four CNN layers and four max-pooling layers. All CNN layers padded the input data with zeros so that each output channel had the same size with the input channel. The first three CNN layers were used to extract structure features, which had 32, 16 and 8 channels respectively and their kernel sizes were 3. The last CNN layer had only one channel, and its kernel size was 1. All channels were merged by the last CNN layer. The kernel size of all pooling layers was 2. The average value of the outputs of the last pooling layer are the final output of the value-function.
The experiment was conducted on a personal computer with an Intel i7-4790 CPU and 12G memory. The greedy search was implemented by Python and the value-function was implemented by Pytorch [43]. In the search process, N 0 was set as the maximal number (21 × 20 = 420) of possible new structures, just the same with the greedy search, and N 1 was 64. k 0 was set as the first iteration after the training loss of the value function was less than 3, and k 1 was set as k 0 + 4 and k 0 + 8. All the experiments were repeated 5 times to get robust results.
To demonstrate our method, a learned Bayesian network is given, and two factors are compared: learning time and diagnosis accuracy. The former is used to show the efficiency improvement and the later one gives how much diagnosis accuracy we lose because of the reduced evaluation number. Figure 9 gives the learned BN by RLDGS with k 1 -k 0 =8 in one experiment. In the BN, most connected nodes had similar indexes. Actually, if two monitored variables were close to each other in the system, they had similar indexes. The figure indicates that the proposed approach is capable of finding the relationship between different variables and offering a good BN for the system. The original greedy search method and RLDGS with k 1 − k 0 = 4 also had similar results, but their BNs were not shown in this work to avoid redundancy.

2) THE TRAINING PROCESS
The scores and losses in our experiments are illustrated in Figure 10, where the horizonal axis represents the training time (minute), the left vertical axis is the scores evaluated by (1) (normalized by M ) and the right vertical axis represents the losses of the value function. There are five curves in this figure: three of them are the scores of the original greedy search method, RLDGS with k 1 − k 0 = 8 and RLDGS with k 1 − k 0 = 4, and the other two curves gives the losses of the two RLDGS methods. The bars in the figure indicates the 95% confidence interval obtained from the 5 repetitions. In the first 20 min, the three methods had a similar increasing trend and the losses converged to a small number. After that, the score curves started to get apart and converged in different speeds. According to the three score curves, we can see that the greedy search method converged very slowly (≈65 min) and the RLDGS methods converged much faster (≈35 min for k 1 − k 0 = 4, and ≈45 min for k 1 − k 0 = 8). Although the   greedy search method achieved the best score, the scores of the other two methods were still good enough.
The training time was illustrated more clearly by Figure 11 where the horizonal axis is the iteration times and the vertical axis represents the time. As the iteration increased, the time cost of the greedy search increased faster because the structures became more complex which were harder to be evaluated. However, the training time increased much slower because less structures were evaluated.
The specific number of evaluated structures are illustrated in Figure 12. As we can see, the number of evaluated likely structures decreased. That is because as the current VOLUME 8, 2020  best structures became more complex, an operation (delete, add and reverse) is easier to cause a cycle. Finally, the GS algorithm steadily evaluated about 330 likely structures. The two RLDGS algorithms evaluated almost the same number with the GS, however, the number decreased quickly (with different slopes) when the value-function converged, and finally estimated 65 structures. After 15 iterations, about 265 structures were eliminated by the value-function.
The three figures indicate that the proposed method is able to learn a good BN with much less time.

3) THE DIAGNOSIS ACCURACY
To evaluate the diagnosis performance of the learned BNs, we give the receiver operating characteristics (ROC) curve of each BN and Figure 13 gives one ROC curve of the greedy search method. The area under the ROC curve (AUC) [44] is adopted as the quantitative criteria. AUCs of the three methods are compared in Figure 14 where the bars indicate the 95% confidence interval. The AUCs were very close to each other, which means the BNs learned by the three methods have very close diagnosis performance. The bars tell that the AUC of the greedy search method was more stable because  the predicted score by the value function is not exactly the same with the real value.

4) THE INSTINCT OF THE CNN VALUE FUNCTION
To analyze how the CNN value function estimates scores, we visualize an input matrix and outputs of some kernels in a CNN value function by the heatmap in Figure 15. Dark blocks represent high outputs and light blocks are for low outputs. Subfigure a) is the input matrix, b) and c) are the outputs of two kernels in the first CNN layer, and d) is the output of a kernel in the second layer. The outputs of kernels in the first layer had the same shape with the input matrix. In subfigure b) and c), dark blocks located around the dark blocks in a) but with different patterns. This relationship means that different kernels in the first layer used distinct criteria to evaluate local structures. The output of the kernel in the second layer had a reduced shape, however, the distribution of dark blocks was similar to a). The similarity indicates that kernels in the second layer merged the information from the first layer and a block in d) represents a bigger local structure in a). In this way, the CNN value function merged and mapped local structure features into global structure features layer by layer and predicted the scores.

B. THE TENNESSEE EASTMAN PROCESS
The Tennessee Eastman process is a typical industrial process which is mainly composed of five process units: a two-phase reactor, a separator, a stripper, a compressor and a mixer. Martin-Villalba etc [41] introduced how to implement it in Modelica. There are 21 faults in the system and 52 variables are monitored.
We made experiments on a laptop with Intel i9-8950K CPU and 32G memory. Greedy search failed to build a Bayesian network diagnoser because of overtime. So, we only employed RLDGS with k 1 −k 0 = 4. All continuous variables were discretized into 8 discrete values.
The learned Bayesian network is not given because it is too complex. The loss of the v-function and score of structures are given in Figure 16 and the ROC curve (AUC=0.9021) is shown in Figure 17. The characteristics of the figures are similar to the ones in the former case.
The two experiments prove that the proposed reinforcement learning directed greedy search is capable of rapidly learning a good Bayesian network diagnoser.

V. ANALYSIS AND DISCUSSION
Bayesian networks are a powerful tool to represent complex probability distributions. However, building network structures is intractable because there are at most 2 n possible structures.
Traditional score-based structure learning algorithms tries to solve this problem by searching, but the huge search space is always a difficult obstacle. This work views the traditional greedy search in a Markov decision process and integrates reinforcement learning into it. The classic greedy search policy is first-order Markovian, where the ''first-order'' means that the memory length is 1. No other information can be obtained from the evaluations before the last time-step. Reinforcement learning fully makes use of all past experience using a value function approximator.
The validity of the proposed method depends on the cost of collecting sufficient statistics from data and training the value function. If the training is much easier than collecting statistics, our method theoretically outperforms the original one. Fortunately, compared with the historical data, the evaluated scores are much smaller. So, we do not need a complex neural network. In addition that CNNs is easy to be trained parallelly [34], training the value function approximator is not difficult as demonstrated in the experiment.
Quantitatively, assume that there are n monitored variables, a is the cost of evaluating a structure by data and b is the cost of training the value-function in each iteration. The greedy search evaluates O(n 2 ) structures in each iteration and its complexity of the k iterations is O(akn 2 ). The proposed method evaluates different number of structures in different phases: N 0 for the first k 0 iterations, N 0 ∼ N 1 for iteration k 0 ∼ k 1 , and N 1 for the rest iterations. So, the whole complexity is O(a(k 0 N 0 +(k 1 −k 0 )(N 0 +N 1 )/2+(k −k 1 )N 1 )+kb). When k k 1 , N 1 n 2 , and a b, the cost of our method can be reduced as (O(akN 1 )) which is significantly less than the original greedy search method.
Compared with the original method, our method is improved in two aspects.
• All information in the search process is fully utilized.
The traditional method only focuses on the current best structure. Our method compiles all the information into the value function.
• Training cost is significantly reduced by omitting impossible structures. Based on the learned value function, structures with a low estimated score (given by the value function) are not evaluated by data, avoiding unnecessary computation.

VI. CONCLUSION
In this work, we propose an efficient Bayesian network learning approach. This approach combines reinforcement learning and greedy search, where a CNN is adopted as the value-function. The initial learning process is almost the same with the traditional greedy search method, but a value-function is trained at the same time. When the training loss of the value function is less than a threshold, the function is employed to reduce the number of structures to be evaluated. For the evaluation is more expensive than the training of the value function, the proposed method is able to cost less time and improve the efficiency. A 10-tank system and the classic Tennessee Eastman process are employed to demonstrate the proposed approach. In the first experiment, our method converged much faster than the original one (reduced by 30%∼50%) with little score loss. In addition, the diagnosis accuracy also shows that the Bayesian networks learned by the new method were as good as the ones from the greedy search. In the second case, the traditional greedy search failed because of overtime, while the proposed one worked well. As a conclusion, the proposed method is capable of learning accurate Bayesian network diagnosers rapidly for complex systems.
In the future, we will use our approach to learn dynamic Bayesian networks and try to combine reinforcement learning with other basic structure learning methods, such as genetic methods, to achieve a better performance. WENQUAN  She is currently an Associate Professor of information and communication engineering with the School of Electronic and Information Engineering, Beihang University. She conducts research in intelligent systems in hybrid modeling, simulation, pattern recognition, image recognition and processing, and communication signal processing. VOLUME 8, 2020