Optimizing the Topology of Bayesian Network Classifiers by Applying Conditional Entropy to Mine Causal Relationships Between Attributes

Due to the excellent classification performance and expressivity, the study of Bayesian network classifiers (BNCs) has attracted great attention ever since the success of Naive Bayes (NB). Information theory has established mathematical basis for the rapid development of BNC. In this paper we propose the definition of entropy function <inline-formula> <tex-math notation="LaTeX">$H_{\mathcal {B}}(D)$ </tex-math></inline-formula>, which corresponds to the optimal number of bits encoded in the network structure <inline-formula> <tex-math notation="LaTeX">$\mathcal {B}$ </tex-math></inline-formula> and can roughly measure the amount of information implicated in training data <inline-formula> <tex-math notation="LaTeX">$D$ </tex-math></inline-formula>. Each factor in <inline-formula> <tex-math notation="LaTeX">$H_{\mathcal {B}}(D)$ </tex-math></inline-formula> explicitly represents statements about causal relationships. An efficient heuristic search strategy is introduced to minimize <inline-formula> <tex-math notation="LaTeX">$H_{\mathcal {B}}(D)$ </tex-math></inline-formula> and explore the optimal topology of BNC. Our extensive experimental evaluation on 40 datasets reveals that this out-of-core algorithm achieves competitive classification performance compared to state-of-the-art learners such as tree augmented Naive Bayes, <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-dependence Bayesian classifier, support vector machine, logistic regression and neural network.


I. INTRODUCTION
Machine learning recently receives tremendous attention from both statistics and computer science communities. These algorithms learn from observed data and can be used to predict. Information theory, which was first introduced and developed by Shannon [1] to explain the principle behind point-to-point communication and data storing, has been widely applied in machine learning. Many criteria, e.g., joint mutual information [2], maximum-relevance minimumredundancy [3] and etc. [4]- [6], have been proposed based on entropy, mutual information and conditional mutual information and their variations.
Classification is one of the most active research area in both machine learning and data mining. The goal is to ''train'' a model based on the training data and this model can have good generalization performance on unlabeled testing instances. The success of Naive Bayes (NB) has considerably strengthened the interest and enthusiasm for learning Bayesian network classifiers (BNCs) from data. For BNC The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Hao Chen. learning, information theory plays an important role and many of its applications consider information-theoretic quantities. Mutual information is commonly applied for attribute sorting or selection [8]- [10], and conditional mutual information for measuring conditional dependence between predictive attributes [11]- [15]. Researchers also proposed some alternative measures (e.g., information gain [16], chi-squared score [17] and etc. [18], [19]) that can be used to measure the relevance.
BNCs, e.g., state-of-the-art tree augmented Naive Bayes (TAN) and k-dependence Bayesian classifier (KDB), provide a graphical representation for the probability distribution of the training data. However, conditional mutual information cannot be used to quantitatively measure the amount of information implicated in BNCs and describe the causal relationships due to its symmetry characteristic. In this paper, we seek to explore the relationship between entropy function and joint probability distribution in terms of log likelihood. On the basis of this, we clarify the reasonableness of applying conditional entropy to identify significant causal relationships, and then propose a heuristic search approach to find an appropriate network structure that may ''best fit'' training data.
The contributions of this paper are as follows: • We introduce the definition of entropy H B to measures the amount of information implicated in the topology of BNC B. We then prove the correctness of applying conditional entropy H (X i |C, B i ) to represent causal relationships between attributes. The resulting highly scalable algorithm combines the computational efficiency of classical generative learning with the low bias of discriminative learning.
• We compare the performance of our algorithm, i.e., FKDB, with other state-of-the-art classifiers on 40 datasets, ranging in size from 24 to 165 thousand instances and 4 to 166 attributes. We show that FKDB achieves comparable or lower error than a range of state-of-the-art BNCs (e.g., NB, TAN, KDB and AODE) and non-BNC classifiers (e.g., support vector machine, logistic regression and neural network).
The remainder of this paper is organized as follows: Section 2 reviews some state-of-the-art BNCs and investigates the difference between NB and these BNCs in terms of log likelihood. Section 3 presents a heuristic search strategy to build a high-order maximum weighted spanning tree by applying conditional entropy rather than conditional mutual information. Section 4 presents a set of comparisons for our proposed algorithm with other approaches, including logistic regression, support vector machine and neural network. To finalize, Section 5 shows the main conclusions and future work.

II. BAYESIAN NETWORK CLASSIFIERS
The supervised classification problem consists of inducing a classification model that is able to assign a label c ∈ C , from a set of |C| labels of the class variable C, to an unlabeled instance x = (x 1 , . . . , x n ), with values for the n attributes X = {X 1 , · · · , X n }. Because the prior probability P(x) is irrelevant to classification, a BNC assigns the most probable a posteriori (MAP) class to x by applying the following classification rule based on Bayes theorem, BNCs approximate the estimate of P(x, c) with a factorization according to the network topology [7]. The topology is a directed acyclic graph (DAG) in which vertices correspond to the random variables {X 1 , . . . , X n , C} and arcs encode the probabilistic dependencies among triplets of variables. For restricted BNC B, class variable is the root node, thus P(x, c) can be decomposed as Eq. (2) shows. Each factor in P(x, c), i.e., P(x i |c, π B i ), is a categorical distribution, where π B i denotes the values of B i and B i is the parents of X i in B.
For specific instance d i , the log likelihood log P B (d i ) measures how many bits are needed to describe d i based on the probability distribution P B . From the statistical viewpoint, the higher the log likelihood is, the closer B is to model the probability distribution P B (d i ). Suppose that, for d i there exist N i instances that have the same attribute values. Let P(d i ) be the empirical distribution defined by frequencies of event d i , namely, P(d i ) = N i N . Thus N i log P B (d i ) or NP(d i ) log P B (d i ) measures how many bits are needed to describe d i if there exist duplicated instances. Given enough training data, the estimate of P(d i ) will approximate the joint probability distribution P(c, x). The entropy function H B (D), which is defined as follows, corresponds to the optimal number of bits needed to roughly measure the amount of information represented by B. Eq.
(3) implies that, to maximize the description of D we need to find a network structure that can maximize the log likelihood or minimize the entropy function H B (D). Because As shown in Fig. 1(a), NB [20]- [22] is the simplest BNC and assumes that every attribute is conditionally independent from the rest of the attributes given the class variable. This independence assumption can be described as For NB, Eq. (4) turns to be However, this independence assumption rarely holds in practice. One approach to avoiding the negative effects caused is to approximate the independence assumption and improve the probability estimates made by NB. Weighted Naive Bayes [23], [24] adjusts the naive Bayesian probabilities that may significantly improve the classification accuracy. Semi-naive Bayes models [25] introduce new attributes obtained as the Cartesian product of two or more original predictor variables.
Another approach is augmenting the network structure of NB with edges between attributes. The mostly cited and studied augmenting technique for NB so far is TAN (see Fig. 1(b)), which is proposed by Friedman et al. [12]. TAN maintains the basic structure of NB and models 1-dependence relationships among the attributes. There has been much prior work that explores approaches to achieving the high accuracy improvement [26], [27]. For TAN, Eq. (4) turns to be where TAN i denotes the parents of X i in TAN and [15] proposed to extend NB to allow for the modeling of arbitrarily complex dependencies, and the final BNC, KDB (see Fig. 1(c)), allows every attribute to be conditioned on the class variable and at most k other attributes. For KDB, Eq. (4) turns to be where KDB i is the parents of attribute X i in KDB and There have also been some important refinements that improve KDB's performance. Bouckaert [28] proposed to average all possible network structures for a fixed value of k (including lower orders). Rubio and Gámez [29] presented a variant of KDB that employs a hill-climbing search. Blanco et al. [9] suggested to perform attribute subset selection within a KDB using filter and wrapper approaches. Xiao et al. [30] argued that k-graph as the predictor subgraph is also the result of a kind of evolutionary computation method, which is inspired by the so-called group method of data handling (GMDH) [31].

III. KDB AND FKDB
Although different restricted BNCs, e.g., TAN and KDB, may apply different strategies to build their own structures, they all use the network structure of NB as the basic framework and can be regarded as different extensions of NB. By comparing the difference between any restricted BNC and NB in terms of log likelihood, from Eq. (4) and Eq. (6) we have Obviously, to encode more bits in the network structure B or effectively represent more significant conditional dependencies between attributes, the summation of information quantity measured by I (X i ; B i |C) will help to achieve this goal, although indirectly. That can clarify why TAN or KDB uses conditional mutual information to measure the conditional dependence. From Eq. (9) we can see that, the attribute order is not an important issue, but how to identify the conditional dependencies is. Thus it is reasonable for TAN to build maximum weighted spanning tree to identify the significant 1-dependence relationships without any prior Algorithm 1: Learning Process of KDB 1 For each attribute, calculate the mutual information I (X i ; C) 2 Calculate the conditional mutual information I (X i ; X j |C) for each pairwise combination of attributes (i = j). 3 Let the selected attribute list, L, be empty. 4 Initialize the BN, BN , with a single class node, C. 5 repeat 6 Select the attribute X max , which is not in L and has the maximum value I (X max ; C); 7 Add a node to BN representing X max ; 8 Add an arc from C to X max in BN ; 9 Add m = min(|L|, k) arcs from m distinct attributes X j in L with the highest value for I (X max , X j |C). 10 Add X max to L. 11 until L includes all attributes; attribute order. We argue that the conditional dependencies of higher degree will be more closely to fit the training data and can help BNCs achieve better generalization performance. In contrast to TAN, KDB provides an effective way to construct classifiers at arbitrary points (values of k) along the attribute dependence spectrum. The attribute order, which is determined by comparing mutual information, seems unnecessary as discussed above. The learning procedure of KDB is described in Algorithm 1.
Conditional mutual information is often used to measure conditional dependence whereas it is rarely used to identify implicit conditional independence. For example, given attribute order {X 1 , · · · , X n }, KDB should select k attributes as parents for attribute X i from its candidate parents i = {X 1 , · · · , X i−1 } when i > k. That is, KDB implicitly assumes that X i is conditionally independent from its other candidate parents, then these attributes can be regarded as redundant and removed from i . KDB doesn't check whether this assumption really holds. The classification performance of KDB will be biased to some extent due to this passive learning strategy. From the definition of I (X j ; X i |C) we cannot tell which one is the cause and which one is the effect. For example, TAN builds a maximum weighted spanning tree, then transforms this undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it. For different root variable, the arc between X j and X i may be X j → X i or X j ← X i . Obviously, the arc X j → X i in TAN just implies that X j and X i are conditionally dependent, but not that X j is the cause and X i is the effect. Bayesian network is also called causal network, but state-of-the-art BNCs, e.g., TAN or KDB, only can represent conditional dependence rather than causality.
In contrast, FKDB takes an active learning strategy and uses conditional entropy to learn significant causal relationships from data. From Fig. 2 we can see that, the distributions of I (X i ; X j |C) and H (X i |C, X j ) may differ greatly. Thus if respectively using these two measures to identify parent attributes i for attribute X i , we may get different results and that will lead to different network topologies. Subsumption resolution assumes that, if P(x i |x j ) = 1 then the attribute value x i is determined by x j , and x i can be regarded as redundant if x j exists, i.e., P(x k |c, x i , x j ) = P(x k |c, x j ) [25], that reflects the causal relationship x j → x i to some extent. Similarly, FKDB sorts attributes by comparing H (X i |C, i ). The lower the value of H (X i |C, i ) is, the more possibly that the values of X i are determined by the values of {C, i }. FKDB selects the first attribute with the minimum of H (X i |C), and selects the second attribute with the minimum of H (X j |C, X i ), then selects the third attribute with the minimum of H (X k |C, X i , X j ) and so on. Correspondingly, we will have a set of weak or strong causalities including C → X i , {C, X i } → X j and {C, X i , X j } → X k . After all attributes have been sorted in this way, each candidate parent in i = {X 1 , · · · , X i−1 } for attribute X i will show redundancy characteristics of different levels. When the structure complexity is restricted, we can select subset of i that can minimize the conditional entropy of attribute X i . For example, attribute X k takes X j and {X i , C} as its parents, X j can be regarded as redundant to a certain extent and then removed from k .
An ideal network topology should correspond to the minimum of the entropy function H B . Based on the discussion above, we propose a heuristic search strategy to identify the conditional dependence by comparing high-order conditional entropy H (X i |C, B i ), which is one of the factors in H B (see Eq. (4)), and then build a high-order maximum weighted spanning tree. The learning procedure of the proposed algorithm, called flexible KDB (FKDB), is described in Algorithm 2. Add a node to BN representing X i ; 6 Add arcs from attributes in i to X i in BN ; 7 Add X i to L. 8 until L includes all attributes;

IV. EXPERIMENTS
In this section, we compares our proposed FKDB algorithm's results with other BNCs and some state-of-the-art algorithms, such as support vector machine, logistic regression and neural network. Each algorithm is tested on each dataset using 10 rounds of 10-fold cross validation. Probability estimates are smoothed using m-estimation with m = 1 [32].
To allow the proposed algorithm to be compared with Weka's algorithms, missing values for qualitative attributes are replaced with modes and those for quantitative attributes are replaced with means from the training data. The performance is analyzed in terms of zero-one loss, bias and variance on 40 datasets from the UCI (University of California at Irvine) repository of machine learning [34]. We employ the Win/Draw/Loss (WDL) record to interpret the results, when two algorithms are compared, we count the number of datasets for which one algorithm performs better, equally well or worse than the other on a given measure. We assess a difference as significant if the outcome of a one-tailed binomial sign test is less than 0.05. Table 1 describes the details of each dataset used, including the number of instances, attributes and classes. These datasets are categorized in terms of their sizes. That is, datasets with instances < 1000, ≥ 1000 and < 10000, ≥ 10000 are denoted as small size, medium size and large size, respectively. The experiments are conducted in the Weka work-branch (version 3.5.7) and the following algorithms are compared: · NB, Naive Bayes. · TAN, tree augmented Naive Bayes. · AODE, averaged one-dependence estimators. · K 2 DB, k-dependence Bayesian classifier with k = 2. · FK 2 DB, flexible k-dependence Bayesian classifier with k = 2.
· SVM, support vector machine with default parameters. · LR, logistic regression with default parameters. · NN, neural network with default parameters.

A. ZERO-ONE LOSS
Zero-one loss is one of the most commonly used loss function to measure the classification performance. Table 2    the zero-one loss results of the above algorithms on different datasets. Table 3 presents corresponding WDL records. From Table 3 we can see that, NB performs better than SVM and LR, but poorer than NN (13 wins and 24 losses).
As the structure complexity increases, the augmented edges will help relax the independence assumption of NB. TAN and K 2 DB shows competitive generalization performance when compared with NN, and they still retain the advantage over SVM and LR. In contrast, FK 2 DB shows significant advantage over these classifiers, it respectively beats SVM, NN and LR on 31, 20 and 27 datasets.
To clarify the effectiveness of the measurement of conditional entropy, we compare the classification performance of FK 2 DB with that of K 2 DB in terms of goal difference (GD) [35]. Given two classifiers A and B, the value of GD can be computed as follows: where T denotes the set of datasets for comparison, |win| and |loss| represent the number of datasets on which A performs better or worse than B, respectively. Fig. 3 shows the fitting curve of GD(FK 2 DB; K 2 DB|S k ) in terms of zeroone loss. The X-axis represents the indexes of different datasets, referred to as k, which correspond to the indexes described in Table 1, and the Y-axis represents the value of GD(FK 2 DB; K 2 DB|S k ), where S k = {D j |j < k} and D j denotes the dataset with index j. From Fig. 3 we can see that, while dealing with datasets of different sizes, FK 2 DB enjoys significant advantage over K 2 DB in terms of zero-one loss. A notable case is Tic-tac-toe, on which the zero-one loss result for K 2 DB is 0.2035 while that for FK 2 DB is 0.0689. Conditional entropy is more reasonable for learning BNC but not definitely better than conditional mutual information. They both suffer from local optimal but global non-optimal solutions. Given n attributes, there are n! alternative orders, thus before learning the network topology BNCs often impose an ordering on the attributes implicitly (like TAN) or explicitly (like KDB). However, because the attributes are sorted one by one, the m best attributes that rank first in the order may not be the best m attributes. Besides, the estimate of highorder conditional entropy may be biased due to limited number of training instance. From the experimental results we can see that, among all of the 40 datasets FK 2 DB performs poorer than K 2 DB on 4 datasets, and 3 of them (i.e., datasets Contact-lenses, Promoters, Horse-colic) have less than 400 training instances.

B. BIAS AND VARIANCE
Bias-variance decomposition of zero-one loss is presented by Kohavi and Wolpert [33] from sampling theory statistics for analyzing different learning scenarios. Bias denotes the systematic component of error, which describes how closely the learner is able to describe the decision surfaces for a 134276 VOLUME 7, 2019 domain. Variance describes the component of error that stems from sampling, which reflects the sensitivity of the learner to variations in the training sample. Tables 4 and 5 respectively report the bias and variance results, and corresponding WDL results are shown in Table 6.
The nature of BNC ensembles (e.g., AODE) lends themselves to scalable parallelization and overcomes the limitations of single model BNCs in two prevalent directions, i.e., to diversely generate BNC components, and to sparsely combine multiple BNCs. High-dependence BNCs (e.g., K 2 DB) can represent more conditional dependencies than lowdependence ones. Thus both K 2 DB and AODE perform better than NB and TAN in terms of bias. When compared to other classifiers, they perform better than SVM and LR, but poorer than NN. NN performs the best among all classifiers in terms of bias, while its advantage over FK 2 DB is not significant (17 wins and 19 losses).
In terms of variance, since the network topology of NB and AODE are definite regardless of variation in training data. Thus NB and AODE perform the best among all BNCs. Limited number of training instances may lead to the biased estimate of high-order probability distributions, thus relatively complex network topology will increase the possibility of overfitting and that may result in high variance. TAN performs better than FK 2 DB and K 2 DB. When compared to other classifiers, FK 2 DB performs better than NN and LR, but poorer than SVM. By comparing FK 2 DB and K 2 DB in terms of bias and variance, we can clearly see that the application of the heuristic search strategy and the measurement of conditional entropy help FK 2 DB achieve the tradeoff between  bias and variance, FK 2 DB enjoys significant advantage over K 2 DB in terms of bias (16 wins and 8 losses) and variance (17 wins and 8 losses).

C. ANALYSIS OF CLASSIFICATION AND TRAINING TIME
The results of average classification and training time for all the compared algorithms are shown in Fig. 4, where the X-axis represents the algorithms, and the Y-axis corresponds to the time of the algorithm.
All experiments are conducted on a desktop computer with an Intel(R) Core(TM) i5-8300H CPU @ 2.3 GHz, 64 bits and 8 GB of memory. Each bar represents the sum of time on 40 datasets in the 10-fold cross validation. Fig. 4 indicates that NB and AODE need negligible time for training, and as structure complexity increases high-dependence BNCs (e.g., K 2 DB) need more time than low-dependence BNCs (e.g., TAN). FK 2 DB takes more time for training because it needs to build high-order maximum weighted spanning tree in terms of high-order conditional entropy. The ensemble learning strategy imposes high classification time on AODE. In contrast to BNCs, NN, SVM and LR take extremely more time for training.

V. CONCLUSION AND FUTURE WORK
In many real-world applications, although the application of conditional mutual information for learning BNC presents strong asymptotic guarantees, it does not necessarily optimize the classification performance. Our analysis suggests that using conditional entropy to measure the causal relationships between attributes may be more helpful to learn the network topology. From the experimental results, the final BNC achieves the trade-off between classification performance and structure complexity. To reduce search space, the attributes are sorted one by one by comparing conditional entropy, that may result in local optimal but global non-optimal solutions. It remains a direction for future research to explore techniques for sorting attribute values in each instance. Such a variant of FKDB would show excellent flexibility in representing the diversity of conditional dependencies in different situations.