Markov Boundary Learning With Streaming Data for Supervised Classification

In this paper, we study the problem of Markov boundary (MB) learning with streaming data. A MB is a crucial concept in a Bayesian network (BN) and plays an important role in BN structure learning. In addition, in the supervised learning setting, the MB of a class attribute is optimal to feature selection for classification. Almost all existing MB learning algorithms focus on static data, but few efforts have been proposed to learning MBs with streaming data. In this paper, by linking dynamic AD-trees with streaming data, we proposed a new SDMB (streaming data-based MB) algorithm for learning MBs with streaming data. Specifically, given a target variable, SDMB employs a dynamic AD-tree to summarize the historical data, then the SDMB sequentially learns the MB of the target upon all available data by calculating independence tests using the dynamic AD-tree. In experiments, using the synthetic and real-world data sets, we evaluate the SDMB algorithm and compared it with the state-of-the-art online feature selection algorithms and data stream mining methods, and the experimental results validate the SDMB algorithm.


I. INTRODUCTION
The notion of a Markov boundary (MB) was coined by Pearl in a Bayesian network (BN) [21]. If a data distribution generating the data set can be faithfully represented by a BN, the MB of a node (in this paper, we will use feature, variable, and node interchangeably) is unique and consists of parents (direct causes), children (direct effects), and spouses (i.e., other parents of the node's children) of the node in a BN. Figure 1 gives an example of a MB in the BN of lung cancer [14]. The MB of Lung cancer consists of Smoking and Gentics (parents), Coughing and Fatigue (children), and Allergy (spouse).
As can be seen in Figure 1, the MB of a node provides a complete picture of the local causal relationships between the node and the nodes in the MB. Thus learning MBs plays an essential role in local causal discovery [9], [38] and large-scale Bayesian network learning [8], [31]. For example, we can use the MB learning algorithms to find the MB of each variable in a data set for constructing a skeleton of a BN structure (i.e., undirected graph) to reduce the BN structure search spaces, then orient edges in the skeleton.
The associate editor coordinating the review of this manuscript and approving it for publication was Jihwan P. Choi . In addition, conditioning on the MB of a node, the node is independent of all other nodes. For example, given the MB of lung cancer, the lung cancer node is conditionally independent of all nodes except for the MB of lung cancer in Figure 1. Thus the MB is the theoretically optimal solution to feature selection [28], [37]. Thus learning the MB of a class variable is in fact a feature selection procedure [36], [39]. Then during the past decade, many algorithms have been proposed for learning the MB of a given feature without learning an entire BN structure. However, almost all MB learning algorithms were designed for tackling static data (i.e., in a batch fashion).
Streaming data is ubiquitous in many real-world applications, such as log files in web applications, clickstream records in E-commerce, stock data streams, and information from social networks. The volume of streaming data is continuously generated over time and thus may be infinite. Due to the limited memory capacity, streaming data should be processed incrementally without having access to all of the data. Then it is a challenge to summarize historical streaming data for incremental MB learning in one scan of streaming data. As new instances are continuously arriving, the distribution of streaming data may change over time. This may make a feature be a member of the MB of a target feature sometimes while not other times.
To tackle these problems, in this paper, our contributions are summarized as follows.
• We present a dynamic AD-tree to summarize historical streaming data. Then by linking the dynamic AD-tree with contingency tables, we describe the MakeCT algorithm to calculate dependence/independence relations between variables.
• Based on the dynamic AD-tree and the MakeCT algorithm, we propose a new SDMB (streaming data-based MB) algorithm to learn the MB of a given target variable in data streams. Specifically, as new data samples come, by linking the dynamic AD-tree with streaming data, SDMB first summarizes these new streaming data samples into the dynamic AD-tree, then by linking the updated dynamic AD-tree with contingency tables, SDMB calculates independence tests for learning the MB of a given target variable using the MakeCT algorithm.
• In the experiments, we have validated SDMB for standard MB learning with synthetic streaming data generated from benchmark Bayesian networks and for feature selection for classification with several state-of-the-art online feature selection algorithms and data stream classification methods (without feature selection) using synthetic and real-world data sets. The paper is organized as follows. Section II reviews related work. Section III introduces the related notations and definitions. Section IV proposes the SDMB algorithm, and Section V gives experimental results. Section VI concludes the paper and ongoing work.

II. RELATED WORK
In the past decade, many algorithms have been proposed for learning a MB of a variable of interest from data. Margaritis and Thrun [17] proposed the GSMB algorithm, the first provably correct algorithm under the faithfulness assumption.
Several variants for improving the reliability of GSMB, such as IAMB [28], Inter-IAMB [30], and Fast-IAMB [35] were proposed. These algorithms learn PC (parents and children) and spouses of the variable simultaneously and do not distinguish PC from spouses. The algorithms require an exponential number of instances in relation to the size of the MB. Thus the quality of the MB found by these algorithms degrades greatly in practical settings with finite samples.
To tackle this drawbacks, HITON-MB [1], MMMB [29], PCMB [22], STMB [10], BAMB [15] and EEMB [32] were proposed, which take two separate steps to find the MB of a target variable: (1) discovering the PC of the target and then (2) finding its spouses. More details of existing MB learning algorithms can be found from the recent survey paper [36] and references therein.
However, all of those algorithms need to know all data samples in advance and cannot directly tackle streaming data. Thus in this paper, we will address the problem of learning MBs with streaming data.
Mining data streams has been well studied in the past decades [7], [23]. Many data stream mining algorithms have been designed for classification, and these algorithms can be roughly grouped into several categories: instance-based learning methods, decision trees learning methods, ensemble learning methods and clustering methods.
There are several commonly used methods for data stream classification, and each group has a few excellent algorithms. For example, the SAM [16], ML-SAM-kNN [24] and MLSAMPkNN [25] are typical instance-based learning methods. For ensemble learning methods, the most common approaches are Bagging and Bosting, which are firstly proposed by Oza and Russell [20], like OBA [19] and LB [4]. In many environment of data mining, ensemble learning methods can achieve the best performance, such as ARF [12], [13], RED-POS [11] and KUE [6]. And ERudesD 2 s [5] is a better rule-based classification method. In addition, the massive online analysis (MOA) is a better platform for streaming data mining that contains various of streaming data mining algorithms [2], [3].

III. NOTATIONS AND DEFINITIONS
In this section, we will introduce some basic definitions and notations frequently used in this paper.

A. MARKOV BOUNDARY AND BAYESIAN NETWORK
Let V = {V 1 , ..., V N } contain N observed features (nodes), P be a discrete joint probability distribution over V , and G represent a directed acyclic graph (DAG). We call the triplet V , G, P a Bayesian network if V , G, P satisfies the Markov condition: every variable is independent of any subset of its non-descendant variables conditioned on its parents in G [21]. In a Bayesian network < V , G, P >, we assume that conditional independence relationships (or causal relationships) between variables can be represented by a DAG G containing only directed edges (→). For example, In the following, V i ⊥ ⊥ V j |Z denotes that V i and V j are statistically independent given Z ⊆ V \{V i , V j } while V i ⊥ ⊥ V j |Z represents that V i and V j are dependent given Z . The notation V \{V i , V j } means that all variables in V excluding both V i and V j .
Definition 1 (Faithfulness [21]): Given a BN < V , G, P(V ) >, G is faithful to P(V ) if and only if every conditional independence present in P is entailed by G and the Markov condition. P(V ) is faithful to G if and only if there exists a DAG G such that G is faithful to P(V ).
The faithfulness assumption establishes a relation between a probability distribution P and its underlying DAG G. Under the assumption, we can use conditional independence tests, instead of d-separation in graph [21], to find all dependencies or independencies entailed with a Bayesian network.
Definition 2 (Markov boundary, MB [21]): Under the faithfulness assumption, in a Bayesian network, the MB of a variable V i , noted as MB(V i ), is unique and consists of parents, children, and spouses (other parents of the children) of V i .
Given the MB of a node, the node is probabilistically independent of any other nodes outside of the MB, which is summarized in Theorem 1, and thus the MB is the theoretically optimal solution to feature selection [28], [37].

B. AD-TREE
For every instance x t ∈ R d , let D denote the set {x t |x t ∈ R d , t, d ∈ N +} which represents an entire streaming data set. Assuming streaming data set D comes in blocks at a time, that is, where D i is the ith data block at time i and has R rows and M attributes {V 1 , · · · , V M }.
is called a query. And the count of query, , denotes the number of instances which all match this query in a data set. An all-dimensions tree (AD-tree) is a data structure that can efficiently store all data information of a large data set [18]. An AD-tree consists of two types of nodes: • AD-node: it represents a possible query of the data set, and store the count of the query.
• Vary-node: it is the specialization for each attribute V i , all its child AD-nodes need to consider the value of V i .  In Figure 2, the rectangle nodes represent the AD-nodes and the oval nodes denote the Vary-nodes. In D 1 , V 1 and V 2 have 2 and 3 distinct values respectively, i.e., n 1 = 2 and n 2 = 3.
The AD-tree structure in Figure 2 saves the all data information in the form of query for D 1 . In the structure, the root node is an AD-node representing the query (v 1 = * , v 2 = * ) (i.e., all instances) and the count is 8. Starting from the root node, the rectangle AD-nodes and the oval Vary-nodes unfold alternately layer by layer. For example, in Figure 2, by expanding with the variable V 1 , due to n 1 = 2, the Vary-node V 1 has two child AD-nodes with constraints v 1 = 1 and v 2 = 2.
In AD-tree, a Vary-node may has multiple child AD-nodes that each represents a query. A most common value (MCV ) node is the child AD-node with a maximum count of a query. For example, in Figure 2, the Vary-node V 1 has two child AD-nodes. The AD-node To save memory space, in two cases, a NULL node was presented to replace a AD-tree node or a subtree. One is an AD-node with the count of a query up to 0. The other is the subtree of a MCV node. The count values of queries of the subtree of a MCV node can be obtained exactly and will be introduced in next section. In the Figure 2, the nodes in the dotted box or the subtrees of these nodes are set NULL in the AD-tree structure.
To deal with dynamic data, based on the AD-tree introduced above, a dynamic AD-tree [26] was proposed and is able to update the count of each query as new data instances come. We will link MB learning with streaming data by the dynamic AD-tree in next section.

IV. PROPOSED SDMB ALGORITHM
Almost all exiting MB learning algorithms focus on dealing with static data and cannot tackle streaming data. In this section, using the dynamic AD-tree proposed in [26] to the gap between streaming data and MB learning, we propose the SDMB (Streaming Data-based MB) algorithm to learning the MB of a target variable of interest from streaming data.

A. OVERVIEW OF SDMB
Given a streaming data D = {D 1 ∪ · · · ∪ D i−1 ∪ D i ∪ · · · } and assuming that MB(T ) i−1 is the MB of the target T ∈ V and was obtained at time i − 1, Figure 3 gives the flow chart of the SDMB algorithm, which summarizes the main idea of SDMB. When a data block D i arrives, SDMB first updates the counts of each query in the current dynamic AD-tree, then it calculates MB i using the updated AD-tree. In the MB updating step, to calculate MB(T ) j , SDMB first adds new features to MB(T ) j−1 , then it removes old variables from MB(T ) j−1 due to new data block coming.

Algorithm 1 The SDMB Algorithm
Require: D i , the target variable MakeCT(S ∪ T ,root-node) 15: for each attribute X ∈ S do 16: calculate dep(X , T |MB(T ) j ) 17: end for 18: 20: end if /*Phase 2.2: remove false positives*/ 22: MakeCT(MB(T ) j ∪ T ,root-node ) 23: for each attribute X ∈ MB(T ) j do 24: if X ⊥ ⊥ T |MB(T ) j \{X } then 25: end if 27: end for 28: end while 29: return MB(T ) j 30: until no more D i are coming Based on the flow chart in Figure 3, the pseudo-code of the SDMB algorithm is described in Algorithm 1. In Algorithm 1, SDMB mainly includes two phases. Phase 1 (Lines 4 to 10) summarize the new coming data blocks to the dynamic AD-tree, and Phase 2 (Lines 11 to 29) updates the MB using the updated AD-tree at Phase 1.
At Phase 1, before data block D 1 comes, both MB(T ) 1 and the AD-tree ADT 1 are empty. As a data block D 1 arrives, SDMB first constructs an initial AD-tree ADT 1 using D 1 and calculate MB(T ) 1 using ADT 1 . When D i arrives, SDMB summarizes D i to ADT i−1 and obtains ADT i . Duo to new data blocks coming, the queries in ADT i changes. At Phase 2, to update MB(T ) i−1 , SDMB first makes MB(T ) i = MB(T ) i−1 , then checks whether new features can be added to MB(T ) i (Lines 13 to 21) and old features is able to be removed from MB(T ) i (Lines 22 to 27). Phase 1 and Phase 2 implement alternatively until no new data blocks come. Sections IV-B and IV-C below will give the detailed descriptions of the two phases.

B. SUMMARIZING NEW COMING DATA BLOCKS
At Phase 1 (Lines 4 to 10), we use the dynamic AD-tree proposed in [26] to sequentially summarize historical data blocks in one scan. Initially, the AD-tree ADT 1 is an empty tree. When the first data block D 1 comes, SDMB builds ADT 1 and summarizes all queries in D 1 to ADT 1 . After that, ADT 1 will be updated once a new data block D i arrives. Based on Figure 2, Figure 4 gives an example of how to update an AD-tree as a new data block D 2 comes. The AD-tree (called ADT 1 ) in Figure 2 is built from the data block D 1 . When the data block D 2 comes, ADT 1 is updated as the AD-tree (called ADT 2 ) as shown in Figure 4. The AD-tree updating focuses on AD-nodes. For example, the AD-nodes of the leftmost subtree of ADT 2 is updated as follows.
First, the A arrow shows that the count value of the root node is updated to 16 with 8 records in D 2 newly arrived.
In the order of depth first, we first update the child AD-nodes of the Vary-node V 1 (arrow B), for these two AD-nodes, C(v 1 = 1, v 2 = * ) = 5 and C(v 1 = 2, v 2 = * ) = 3 are updated to 6 and 7 respectively. And a new AD-node is built because a new value of V 1 (i.e., v 1 = 3) generates in D 2 . Meanwhile, AD-node with the MCV is updated to the AD-node in which C(v 1 = 2, v 2 = * ) = 7 under the Vary-node V 1 , and the technical details of how to update the MCV node please see the work [26].
Finally, in arrow C, the nodes have restored instead of NULL under AD-node {v 1 = 1, v 2 = * }, and as the new data block D 2 arrives, the AD-nodes of the Vary-node V 2 are updated, Then if several AD-nodes with the same maximum count values, we set the leftmost one as the MCV node.
Since the dynamic AD-tree can store all historical streaming data, this provides the basis for incremental MB learning from streaming data. In next section, we will discuss how to learn MBs using dynamic AD-trees.

C. UPDATE MB
When a new data block D j arrives, the current dynamic AD-tree will be updated accordingly, such as the counts of the queries and the emerging new state value of V 1 as shown in Figure 4. This leads to that the MB of T found using previous data streaming may change. Assuming that the MB(T ) j−1 is the MB of T selected at the time before At Line 13 and Line 22, to calculate the independence/dependence of T and the other variables in V , the G 2 test [27] is employed to calculate dep(X , T |MB(T ) j ) (i.e., X ⊥ ⊥ T |MB(T ) j or X ⊥ ⊥ T |MB(T ) j ). To calculate dep(X , T |MB(T ) j ), we use the G 2 test. The statistic of the independence test V i and V j conditioning on V k using the G 2 test is written as follows.
In Eq.(1), S a i represents the number of v i = a in a data set, S ab ij represents the number of two attributes when v i = a and v j = b (a ∈ n i , b ∈ n j ), and S abc ijk represents the number of queries v i = a, v j = b and v k = c in the data set. The G 2 statistic is asymptotically distributed as chi-square (χ 2 ) with appropriate degrees of freedom f that is calculated as f = (n i − 1)(n j − 1)n k where n i is the number of values that V i takes.
Assuming S ⊆ V and let S = {V i (1) , · · · , V i(n) } which has n features and V i(j) be the jth attribute in S. S has a contingency table denoted by ct(V i(1) · · · V i(n) ). For example, a contingency table about {V 1 } is shown in Figure 5. and the contingency table denotes the counts of all queries about V 1 (i.e., the values that V 1 takes in D 3 ). The four contingency tables in Figure 5 contains the all possible queries in D 3 and the count of each query. Thus, if we calculate the G 2 statistic in Eq.(1) using D 3 , the values, such as S a i and S ab ij , can be read directly from the contingency table as shown Figure 5. However, for a static data set, to get the contingency tables for calculating the G 2 statistic, we need to scan the entire data set to achieve the count of each query. For streaming data, all queries of historical streaming data are stored in a dynamic AD-tree and can be retrieved from the dynamic AD-tree. To calculate the G 2 test efficient for streaming data, we link the dynamic AD-tree with the G 2 test by contingency tables and describe the MakeCT function in Algorithm 2.
MakeCT traverses the each (1) , · · · , V i(n) } in the tree structure, and the traversal start at the AD-node (ADN) as the root node of tree structure. Then, the traversal falls down according to one constraint to another where the constraint is from V i (1) to V i(n) in sequence.
At Lines 2 to 3, if the first argument is empty, it returns a one-element contingency table containing the count associated with the current AD-node. At Lines 5 to 14, V i(1) , which is the child-node of ADN, has n i(1) distinct values, in the iteration over k from 1 to n i (1) , it can not get the information directly for the MCV and do not to consider it temporarily (Lines 8 to 9); for the other AD-nodes under this current V i (1) , MakeCT ({V i(2) · · · V i(n) }) for each one recursively (Lines 11 to 12). At Line 15, the contingency table of CT MCV can be calculated by a row-wise subtraction Eq.(2) [18].
This is an example for Eq.(2). For a MCV node (v 1 = 1, v 2 = * ) in the Figure 4, its child nodes are NULL. We can calculate the count of a query (v 1 = 1, v 2 = 1) as follows.

D. COMPLEXITY ANALYSIS
In this section, we analyze the time complexity of the proposed SDMB algorithm. Since the SDMB consists of two phases, in the following, we will give the time complexity of each phase. In Phase 1: making/updating AD-tree. Each node needs to go through the corresponding instances when making or updating the AD-tree, the time complexity of Phase 1 is represented by the number of instances that need to be traversed. Without loss of generality, it is assumed that all attributes are binary, thus the memory cost of the sparse AD-tree shown above is 2 M . For a data set with R rows, AD-tree has at most log 2 R levels. Then, let , that it has to traverse R2 −k instances for each AD-nodes [18]. Then, we update the AD-tree to save the coming data set D i with R rows. During this process, the MCVs may change, thus the sub-ADtrees may need to be built. The worst case is that the MCV of each node has changed, so the time complexity of updating AD-tree is In Phase 2: Learning MB. Almost all MB learning algorithms calculate the dependence/independence between two variables using conditional independence tests (CI tests) [32], we use the number of CI tests to represent the time complexity of the Phase 2. For finding a candidate member (Phase 2.1), it adds at most |M | features to MB(T ). Since each addition will perform |M | CI tests, in the worst case, the time complexity of entire addition phase is O(|M | 2 ). Similarly for removing false positives (Phase 2.2), it takes at most O(|M |) CI tests. Thus, the total time complexity of Phase 2 is The total time complexity of the SDMB algorithm is the sum of two phases. It can be seen that the building-time is shorter than the updating-time, and the time of Phase 2 is generally much shorter than Phase 1. The specific time records can be found in the Section V-D below.

E. TRACING SDMB
In this section, we use the network of lung-cancer in Figure 1 to trace the execution of SDMB as shown in Figure 7. Lung-Cancer in black is the target feature and the MB(L) is {A, C, S, F, G} in blue.
As D i comes, we update the current AD-tree and let MB(L) i = MB(L) i−1 . Then we add/remove the features to/from MB(L) i . The implementing details are described as follows. VOLUME 8, 2020

V. EXPERIMENTS
In this section, we evaluate the performance of the SDMB algorithm using synthetic and real-world data sets and report the experimental results in Sections V-A, V-B, V-C and V-B. We compare SDMB with four online feature selection algorithms, including OFS [33], SOFS [34], B-AMD and B-ARDA [40], and four data stream mining classification methods, including OBA [19], LB [4], ARF [12] and KUE [6]. We evaluate the MB selected by SDMB against these comparison algorithms using prediction accuracy.

A. EXPERIMENTS OF SDMB WITH SYNTHETIC STREAMING DATA
Since in the benchmark Bayesian networks we know the MB of each variable, and thus we can observe how the performance of the SDMB algorithm changes as new data blocks come. Then we generate four data sets using corresponding benchmark Bayesian networks as shown in Table 1. Each data set has 100,000 data samples, we split each one into 100 data blocks and each data block has 1000 data samples. Assuming that data blocks come one by one, we sequentially evaluate the SDMB algorithm using the following metrics. For each network, we find the MBs of all variables and report the average following metrics of all variables. Since no algorithms are proposed to learn MBs with streaming data so far. To validate SDMB using synthetic data, we assume that SDMB works in a batch fashion, called SDMB-b. Assuming that data blocks arrive one by one, SDMB learns MBs upon a data block arriving, whereas SDMB-b assumes that all data blocks are known in advance and calculates MBs using the whole of these data blocks. Figure 8 shows that as data blocks sequentially arrive, the values of Precision, Recall and F1 of SDMB fluctuate over time using the four Bayesian networks. These fluctuations illustrate that as new data blocks come over time, the distribution of previous data may change over time. Thus as a new data block comes, SDMB may add new features or remove the old features to/from the MB set currently selected.
In Table 2, using the Alarm network, we compare SDMB with SDMB-b. We dot not report the results of SDMB and SDMB-b using the other three networks since the performance of SDMB is also the same as that of SDMB-b. In Table 2, in the row of 30000 data records, the results of Precision, Recall and F1 of SDMB are achieved by assuming that the data block sequentially arrives with 10000, 10000, and 10000 data records respectively, while those of SDMB-b are got using all 30000 data records. The meaning of remaining rows in Table 2 are the same as that of the row of 30000 data records. In Table 2, we can see that the results of SDMB-b are the same as SDMB exactly. This states that SDMB can deal with streaming data without degrading the performance compared to its batch version. We also can see that the dynamic AD-tree is able to exactly summarize all historical data information without any losses. The findings in Table 2 also give us a new way to deal with the data set with very large data samples, that is, considering the large-scale data set as streaming data and using dynamic AD-trees to summarize data information.

B. COMPARISON WITH FOUR ONLINE LEARNING ALGORITHM
In this section, we compare SDMB with four state-of-the-art online feature selection algorithms: OFS, SOFS, B-AMD and B-ARDA for classification using four real-world data sets (Chess, Mushroom, Connect-4 and Poker) and four synthetic data sets (Alarm, Child, Insurance, and Water) as shown in Table 3. And in Alarm, Child, Insurance, and Water, we respectively selected 4th, 20th, 25th and 31th node as the class attribute. Each data set has a roughly uniform ratio of positive and negative classes and no missing values. OFS, SOFS, B-AMD and B-ARDA need to specify the number of selected features in advance. Thus we vary the number of selected features from 1 to 15 and report the corresponding classification accuracy as shown in Figures 9 to 15. VOLUME 8, 2020   And we select the best classification accuracy of each algorithm as its accuracy for comparing with SDMB. SDMB does not need to specify the number of selected features in advance. We use three classifiers, KNN, NB, and SVM, for evaluating the MB selected by SDMB.
In the experiments, according to the case that streaming data is continuously coming, we fixed the training and testing data samples as shown in Table 3. For the Mushroom and   Chess data sets, we divide the training data into data blocks and each data block has 500 data samples. As for the remaining six data sets, we partition their training data sets into data blocks and each data block has 1000 data samples.
In Table 4, the best results are highlighted in bold face, we can see that SDMB is significantly better than the other  four online feature selection algorithms on both synthetic and the real-world datasets. This indicates that SDMB can find the most predictive features even with the distribution shifts. From Figures 9 to 15, we can see that OFS, SOFS, B-AMD do not change much as the number of selected features increases, while B-ARDA fluctuates more than OFS, SOFS, and B-AMD. This may possibly explain that OFS, SOFS, and B-AMD are not able to tackle distribution shifts, while B-ARDA is better than OFS, SOFS, and B-AMD. Thus, among OFS, SOFS, B-AMD and B-ARDA, B-ARDA is the best.
Moreover, compared to OFS, SOFS, B-AMD and B-ARDA, SDMB does not determine the number of selected features in advance. From Figures 9 to 15, we can see that it is not easy to determine the number of selected features for B-ARDA, especially with streaming data.
For the synthetic data sets generated from Alarm, Insurance and Water, we found that the MB learnt by SDMB is completely correct, therefore SDMB achieves the highest classification accuracy. For Child, there are 4 nodes in the true MB, we found them with the other 4 non-MB nodes, and the prediction accuracy on this dataset is relatively low. SDMB and B-ARDA get the best prediction accuracy (86.20%) as they only select one feature on dataset Water in Figure 15.
SDMB performs well in real datasets, especially on Mushroom. It selects 3 features (odor, spore-print-color and population) and they work very well. The prediction accuracy of the four comparison algorithms (except for B-ARDA) increases and tends to be stable with the increasing number of selected features on Chess. For Connect-4, the performances of all the algorithms are similar and not high and tends to be stable as the number of features is up to four.

C. EXPERIMENTS ON DATA STREAM MINING
In this section, we evaluate the SDMB algorithm on data stream mining, and report the experimental results in the following two subsections. We first compare SDMB with four state-of-the-art data stream mining classification methods, then, we conduct experiments on concept drift to evaluate explicitly adaptation of SDMB to various types of concept drift in the stream.

1) COMPARE WITH FOUR DATA STREAM MINING CLASSIFICATION METHODS
In this subsection, we compare SDMB with four state-ofthe-art data stream mining classification methods: OBA, LB, ARF and KUE. We also compare them by classification accuracy on real-word and synthetic data sets. The details of these data sets is shown in Table 3. For four synthetic data sets, we selected the 4-th, 20-th, 25-th, and 31-th nodes as the class attribute, respectively.
We execute the SDMB as in Section V-B, and execute the OBA, LB, ARF and KUE classification methods within the MOA framework and the classifiers parameters are default. For the Mushroom and Chess data sets, training is performed every 500 data samples, and for other data sets is 1000.
In Table 5, the best results are highlighted in bold, we can observe that SDMB is basically superior to the other four data stream mining classification methods on the synthetic data sets, because these four data sets are mainly used as benchmark data sets for MB learning. For Child, SDMB learns 4 real MBs and 4 other non-MB nodes, the prediction accuracy on this data set is relatively low. The classification effect of SDMB on Chess and Mushroom is slightly better than the data stream classification methods, but it is worse on Connect-4 and Poker. This may possibly explain that SDMB is not able to tackle distribution drift as well as OBA, LB, ARF and KUE. And among OBA, LB, ARF and KUE, KUE is the best.

2) EXPERIMENTS OF SDMB ON CONCEPT DRIFT
In this subsection, we execute SDMB on three data sets and record the MBs of the target node. These three data sets VOLUME 8, 2020   are generated by RandomTreeGenerator in MOA, including no drift, gradual drift and abrupt drift data stream with 10,000 instances, respectively. For the gradual drift, there is a drift centered around instance 5,000 with a window of 2,000 instances; and there is a drift centered around instance 5,000 with a window of 1 instance for the abrupt drift.
In Table 6, we report the MBs of SDMB learning on three types of concept drift in the stream of which D i sequentially arrives with 1000 data instances. In Table 6, 1 represents that the 1-th attribute is an MB learned by SDMB, and the meaning of remaining numbers in Table 6 are the same as number 1.
In Table 6, the different MBs that SDMB learned in each D i on three data sets are highlighted in bold. It can be seen that only on D 6 , that is, after the arrival of the 6000-th-7000-th instances, the MBs that SDMB learned on the data sets with concept drift are different from the data set with no drift. Since the drift centered around instance 5,000, it may indicate that SDMB can reflect the concept drift, but it is not very sensitive to the concept drift. On the gradual drift and burst drift data sets, the MBs learned by SDMB is exactly the same, it means that although SDMB can not distinguish the types of various concept drift, it can adapt them. Because AD-tree saves all the data information, each MB learning is carried out under all historical data, thus SDMB can tackle the impact of concept drift, and it is not very sensitive to drift accordingly.

D. RUNNING TIME
In this section, we report the running time of SDMB and its comparison algorithms on real-world and synthetic data sets. In Table 7, we show the running time of SDMB and its rivals. For SDMB, we recorded the total time of Phase 1: make/update AD-tree (denoted as MK) and the total time of Phase 2: MB learning (denoted as MB). Since four online feature selection algorithms need to specify the number of selected features, there will be multiple running times, therefore we show the time of the experiment which get the best classification accuracy. For the four data stream mining classification methods, we record the elapsed time displayed on the MOA platform.
The fastest time are highlighted in bold face in Table 7, we can see that the total time of SDMB is much longer than its rivals on each data set. SDMB spends the majority of time on making/updating AD-tree. And the time of MB learning of SDMB is roughly similar to the other algorithms. The dynamic AD-tree structure can completely save the streaming data information in a relatively fixed memory, but to be honest, it needs to pay more time.
OFS is the fastest on most data sets of these four online feature selection algorithms, and KUE also performs best in terms of time among these four data stream mining methods. Moreover, KUE and OFS are comparable in terms of running speed.
In Table 8, we show the specific running time of the two phases in SDMB. We select the alarm and child data sets as examples, and we record the time of two phases (MK/MB) of which D i sequentially arrives with 10000 data records.
The MK-time of D 1 is less than each subsequent MK-time, and each MK-time after D 1 is similar. It means that the time for making AD-tree is shorter than the updating, and each updating time is about the same. Then, the time for MB learning is much less than making/updating AD-tree, once again, it indicates that the time of SDMB is mainly spent on making/updating AD-tree. This also verifies the analysis of the time complexity described above.

VI. CONCLUSION AND ONGOING WORK
In this work, to learn the MB of a given target with streaming data, we proposed the SDMB algorithm. SDMB employs dynamic AD-trees to process and summarize streaming data, and sequentially learns the MB of the target upon new data samples coming using the dynamic AD-trees. The experimental results shown that SDMB can tackle MB learning with streaming data, it outperforms the state-of-the-art online feature selection methods for feature selection, and it performs well as the state-of-the-art data stream classification methods on real-word data sets. Since SDMB needs to reconsider the discarded features at each time when new data samples arrive, it needs to summarize these discarded variable information to dynamic AD-trees. This results in that the dynamic AD-trees are very memory expensive and only can store no more than 50 variables. Thus SDMB is not able to deal with high-dimensional streaming data. Our ongoing work is developing new algorithms with new data structures to tackle streaming data with large number of features for both local causal structure learning and online feature selection.