Incremental Attribute Reduction Method Based on Chi-Square Statistics and Information Entropy

Attributes in datasets are usually not equally significant. Some attributes are unnecessary or redundant. Attribute reduction is an important research issue of rough set theory, which can find minimum subsets of attributes with the same classification effect as the whole dataset by removing unnecessary or redundant attributes. We use Chi-square statistics to evaluate the significance of condition attributes. It can reduce the search space of attribute reduction and improve the speed of attribute reduction. Conditional entropy of relative attributes is adopted as a heuristic function. Two decision table reduction algorithms, forward selection and backward deletion, are proposed to approach the optimal solution. Based on this, an efficient incremental attribute reduction method for dynamically changing datasets is proposed by preserving intermediate variables. The intermediate variable is the observation frequency matrix of joint events of each condition attribute and decision attribute. Experimental results show that the proposed algorithms can improve performance in terms of processing time.


I. INTRODUCTION
With the development of Internet and the popularization of computer application, datasets are increasing rapidly in the number of objects and attributes. It is very difficult to process all the data because of limitation of memory. It is necessary to compress dataset without affecting the mining results.
At present, there are two mainly methods of compressing the data dimensions. One method is to compress the original data by aggregating data and converting it into a smaller space, such as principal component analysis [1]- [3] and singular value decomposition [4]- [6]. Another method is to reduce data dimensions by deleting unnecessary or redundant attributes, such as attribute reduction based on rough set, Chi-square statistics, clustering, neural network, information entropy, support vector machine and so on [7]- [9].
Attributes in a dataset are usually not equally significant. Some attributes are unnecessary or redundant. We expect to use a dataset as few attributes as possible without affecting the quality of mining results. The minimum reduction is the The associate editor coordinating the review of this manuscript and approving it for publication was Mengchu Zhou . smallest set of attributes that can distinguish equivalence classes. The attribute reductions for each dataset are generally multiple. The intersection of all reductions is called attribute core. However, for a dataset with n attributes, there may be 2 n subsets. It is impossible to find the best subset of attributes by exhaustive search.
At present, the attribute reduction methods can be divided into three classes: attribute reduction algorithms based on information entropy [10], [11], attribute reduction algorithms based on discernible matrix [12], [13] and attribute reduction algorithms based on positive region [14], [15]. Attribute reduction algorithms based on discernible matrix have a high accuracy, but their computational cost is high.
Attribute reduction methods are used to remove redundant attributes and find a subset of attributes that can maintain the same data classification capabilities as the original data in [16]- [18]. The methods based on equivalence relation and equivalence class are only suitable for datasets with discrete attributes. However, real datasets have many continuous numerical attributes. Hu presents a reduction method for numerical attributes [19]. The granulation model of universe space is formed by the neighborhood of any object in the universe space and realizes the granular computing of continuous numerical space. The work [29] enhances the precision of fault diagnosis for all fault classes by using Chi-square statistics to reduce attribute dimensions. A cost-sensitive embedded feature selection method is proposed to solve the class imbalance problem in [30]. The work [31] proposes a heuristic attribute reduction method by constructing the attribute significance measures based on stripped quotient sets. The work [32] proposed a feature selection for multi-label classification using neighborhood relationship preserving score, which is a feature evaluation criterion for multi-label feature selection. These methods are suitable for static datasets. However the real datasets are usually dynamic.
In order to deal with new samples in time, many incremental attribute reduction methods are proposed. An incremental reduction algorithm based on positive region is proposed to deal with inconsistent decision tables in [20], but the algorithm can not get the minimum reduction. The work [21] determines whether the decision table needs to be updated by analyzing the impact of new samples on the positive domain. The sample set which is not consistent with new samples is selected to reduce. Incremental attribute reduction method by constructing three new types of discernibility matrices is proposed in [22]. The work [23] presents a unified incremental reduction algorithm for three types of dynamic objects based on discernibility matrix. In [24] an incremental attribute reduction algorithm, named IARAIV is proposed. In this algorithm, new samples and their inconsistent neighborhoods are reduced so as to get the updated reduction quickly.
The remainder of this paper is organized as follows: Section II discusses related work. Section III proposes two decision table reduction algorithms, which are forward selection and backward deletion. An incremental attribute reduction algorithm is presented in Section IV. In Section V, our methods are experimented and evaluated.

II. RELATED WORK A. ROUGH SET THEORY
Rough set theory was put forward by Professor Pawlak, a Polish scholar, in 1982 [25]. It is a mathematical theory that deals with incomplete knowledge such as inaccuracy, incompleteness and inconsistency. Rough set has been widely used in attribute reduction, knowledge classification, rule mining and other fields. Attribute reduction is an important research issue of rough set. It can delete unnecessary or redundant attributes and reduce the data size so as to improve the efficiency of data processing and mining. Attribute reduction based on rough set can use information entropy to measure attributes significance. Rough set attribute reduction can delete some attributes while remaining the data classification unchanged.

1) RELEVANT DEFINITIONS OF ROUGH SET THEORY
Rough set theory associates classification with knowledge and understands knowledge as the division of specific space. The following are some related concepts [25].
Definition 1: An information system DS = < U , R, V , f > is a 4-tuple system, where U = {u 1 , u 2 , . . . , u n } is a non-empty set of objects called universe and R is a infinite nonempty attribute set. V = a∈C∪D V a , where V a is the domain of attribute a. f : U × (C∪{d})→ V is an information map function that specifies the attribute value of each object x in U . If R = C∪{d} and C ∩ D = , where C is the condition attribute set and {d} is the decision attribute, then DS = < U , C ∪ {d}, V , f > is also called as a decision table.
Definition 2: Let DS=< U , C∪{d}, V , f > be a decision table. For any B ⊆ C∪{d}, the indiscernible relation is defined as

2) DESCRIPTION OF INFORMATION VIEW OF ROUGH SET THEORY
If U is a universe, it can be considered that any attribute set of U is a random variable on σ -algebra defined on a subset of U . Its probability distribution can be determined by the following method.
Definition 3: Let P and Q be the derived partition on U as X , Y : Then the probability distribution defined by P and Q on σ -algebra consisting of subsets of U is where p(X i ) = |X i | |U | , i = 1, 2, . . . , n; p(Y i ) = |Y i | |U | , j = 1, 2, . . . , m.
Definition 4: The entropy of attribute set P is defined as Definition 5: The conditional entropy H (Q|P) of attribute set Q relative to attribute set P is defined as where

3) ATTRIBUTE REDUCTION METHODS OF PROGRESSIVE FORWARD SELECTION AND PROGRESSIVE BACKWARD DELETION
According to the reduction process, attribute reduction methods can be divided into progressive forward selection and progressive backward deletion. Generally, greedy algorithms are adopted. Local optimal selection strategy is adopted to achieve the global optimal solution when searching the attribute set space.
Progressive forward selection usually starts with a empty set or a core attribute set, then the most significant is added from the remaining set to the reduction set until the termination condition is satisfied. There are many evaluation measures to judge whether an attribute is significant or not, such as information gain, attribute frequency and so on.
A numerical attribute reduction method based on neighborhood granulation and rough approximation is proposed in [19]. Datasets whose attribute values are continuous can be processed. The reduction process starts with empty set and uses dependency function as heuristic information. The most significant attribute is selected from the remaining attribute set and added to the reduction set one by one until the significance of each remaining attribute is zero.
Progressive backward deletion starts with the whole set of attributes, then deletes the least significant attribute in the set one by one until a reduction is obtained.
The work [26] presents CEBARKNC, an attribute reduction algorithm based on information entropy. The initial condition is the whole set of attributes. The attribute with the largest relative condition attribute is deleted as the least significant attribute one by one. The time complexity of the algorithm is O( In [27], the attributes in the difference set are deleted by reverse deletion gradually. When a single attribute element appears, it is added to the reduction set. This process is repeated until the difference set is empty. Based on the simplified discernible matrix, the work [28] proposes a method of deleting attribute for attribute reduction, which can delete one or more attributes satisfying the conditions once and reduce the number of iterations.
In this paper, we use Chi-square statistics to evaluate the significance of condition attributes. And all unnecessary attributes are deleted. Using conditional entropy of related attributes as a heuristic function, two decision table reduction algorithms, forward selection and backward deletion, are proposed to approach the optimal solution. On this basis, an incremental attribute reduction method for dynamic datasets is proposed.

B. ATTRIBUTE CORRELATION MEASUREMENT BASED ON CHI-SQUARE STATISTIC
Chi-square statistic is a hypothesis test method, which is mainly used for statistical inference of classified data. Chi-square statistic is to calculate the value of Chi-square by calculating the deviation degree between the actual observed value and the theoretical inferred value. The larger the Chi-square value, the higher the correlation between the two attributes. Otherwise, the less relevant the two attributes are.
Definition 8: The correlation between attribute A and attribute B can be measured by K (Chi-square). Let A has c different values a 1 , a 2 , . . . .a c and B has l different values b 1 where o ij is the observation frequency of (A i , B j ), e ij is the expected frequency of (A i , B j ). e ij is defined as where N is the number of tuples, count(A = a i ) is the number Attribute A is related to attribute B if the value of K is greater than the critical value, which is the corresponding value in the Chi-square table with degrees of freedom (l−1)× (c − 1). The larger the K value is, the stronger the correlation between A and B is. Otherwise attribute A is not related to attribute B.
Definition 9: Relevant attribute set F: F is the set of condition attributes related to decision attribute {d} measured by Chi-square statistics, that is, the attributes in C whose Chi-square statistics are not less than the critical value. Correlation set F is sorted in descending order according to Chi-square statistics.
The following example illustrates the correlation of two attributes measured by Chi-square statistics.
An institution randomly investigated the income of 5000 people and analyzed the relationship among income, gender and education. According to certain standards, income was divided into three classes: high, middle and low. Gender and income statistics are shown in Table 1. Education and income statistics are shown in Table 2. It is assumed that income does not depend on gender. The expected frequency of each item is calculated according to the aggregate value. Each expectation is calculated by dividing the aggregate value of the row and column by the total value.  As shown in Table 3 Table 4. The degree of freedom of the table is ε p i+1 and the significance level (2 − 1) × (3 − 1) = 2 is set as 0.05. The critical value is 5.9914645 by using Chi-INV (0.05,2) function. Because the total Chi-square statistics in Table 4 is 1.6369, which is lower than the critical value, it shows that the correlation between income and gender is not significant.  Use the same method to calculate the Chi-square statistics of income and education. As shown in Table 5, the total Chi-square statistics is 761.307, which is far greater than the critical value of 5.9914645. It shows that there is a strong correlation between income and education.
In this paper, we use Chi-square statistics to evaluate the direct correlation between condition attributes and decision attribute. A condition attribute is a row and a decision attribute is a column. If a condition attribute is highly correlated with decision attribute, the condition attribute is considered to be a reduction element.

III. ATTRIBUTE REDUCTION ALGORITHMS CFAR AND CBAR A. ALGORITHM DESIGN
Attribute reduction methods are usually to evaluate the significance of the remaining attributes one by one in the process of adding or deleting. However, if there are more unnecessary attributes in the dataset, the computation will be greatly increased.
We propose a method of attribute reduction based on progressive forward selection strategy. The initial set includes the attribute with the largest K value of all condition attributes. The attribute is selected from the related attribute set F with maximum conditional entropy one by one until the conditional entropy increment of all attributes is 0.
In each iteration of selection process the attributes which increment value of conditional entropy is 0 are deleted as shown in algorithm 1. The input of algorithm 1 is decision table DS = U , R, V , f , where R = F∪{d} is an attribute set, F is the relevant attribute set, {d} is a decision attribute set. The initial set is {r max }, where r max is the attribute with the largest K value of all condition attributes.

Algorithm 1 Forward Selection Attribute Reduction Algorithms Based on Chi-Square Statistics (CFAR)
Input: Decision Table DS  = 0 from F An attribute reduction method based on gradual backward deletion strategy is also proposed. The initial set is the relevant attribute set F. The attributes which conditional information entropy in the set remains unchanged are deleted one by one from the initial set until the conditional entropy is greater than 0. The steps are shown in algorithm 2. Conditional entropy is used as a heuristic function to select significant attributes or remove unnecessary attributes to obtain the reduction set related to decision attributes. The results of data analysis of reduction set and original dataset are the same or similar. Chi-square statistics can evaluate the correlation between each condition attribute and decision attribute. However, not all the condition attributes with high relevance to decision attributes should be included in the reduction set.
For example, a shopping record table contains 75 data samples with three attributes: {occupation}, {age} and {whether to buy product A}. The {occupation} and {age} are condition attributes and {whether to buy product A} is decision attribute. The statistics of two condition attributes and decision attributes are shown in Table 6 and Table 7 respectively. The corresponding expected frequencies are shown in Table 8 and Table 9.     is 27.6467. The critical value of Chi-square detection function with freedom of (3 − 1) × (2 − 1) = 2 is 5.9915. Table 11 shows the Chi-square statistics of {age} and {whether to buy product A}. The calculated value is 26.9838. The critical value of Chi-square detection function with freedom of (4 − 1) × (2 − 1) = 3 is 7.8147. Both statistics are larger than the critical value, which indicates that both {occupation} and {age} attributes are related to the decision attribute {whether to buy product A}. However, almost all samples who buy product A are students. And the age distribution of students is under 15 or 15-18 years old. Therefore, the {age} attribute is selected to the reduction set and the {occupation} attribute need not be added. In this case {occupation} is a redundant attribute although it is related.
Therefore, the reduction set cannot be found only by Chi-square statistical detection. Information entropy should be used as the heuristic function of reduction.

C. ALGORITHM ANALYSIS
Assuming that the number of samples is U and the number of condition attributes is R. The algorithms CFAR and CBAR proposed in this paper require Chi-square statistical calculation of all condition attributes. The time complexity is O(R|U | 2 ). After deleting all unnecessary attributes, the attributes are added or deleted one by one. The worst computational complexity is O(F U | 2 ). While the time complexity of CEBARKNC algorithm is O(R U | 2 ) + O(|U | 3 ).

IV. INCREMENTAL ATTRIBUTE REDUCTION ALGORITHM ICAR A. ALGORITHM DESIGN
In many practical application scenarios, a large number of new data are generated every moment. If static reduction algorithm is used to reduce the attributes of the new dataset, it will consume a lot of time and space. In order to process new samples efficiently and accurately, we propose an incremental attribute reduction algorithm. The A multidimensional array M is used to store the observation frequency of all joint events to calculate conditional entropy. M has n dimensions,where n = (|C| + 1). M v1,v2,...vn represents the observation frequency of the joint event (C 1 = v 1 , C 2 = v 2 ,. . . , D = v n ), where D is decision attribute {d} and C i ∈ C, i = 1, 2, . . . , |C|.
For the newly added samples set Z = {z 1 , z 2 , . . . , z n }, the change of the observation frequency matrix may have the following four cases:   (1) If K is greater than or equal to the critical value and K , the conditional entropy H (d|P+{r m }) of the relative condition attribute set P of the decision attribute set d is calculated. If H (d|P+{r m }) = H (d|P), the condition attribute r m is added to the reduction set. Otherwise the reduction set will remain unchanged. (2) If K is less than the critical value and K , the conditional entropy H (d|P-{r m }) of the relative condition attribute set P of the decision attribute set d is calculated. If H (d|P − {r m }) = H (d|P), the attribute r m is deleted from the reduction table. Otherwise, r m is not deleted.
Based on above description, the steps of incremental attribute reduction method are shown as algorithm 3. The forward selection attribute reduction algorithm CFAR, backward deletion attribute reduction algorithm CBAR, the Chi-square statistical method and conditional information entropy reduction algorithm CEBARKNC without Chi-square statistics are compared. The experimental results are shown in Table 12. Experiments show that algorithm CFAR, algorithm CBAR and algorithm CEBARKNC can all get the correct attribute reduction set. While Chi-square statistical method does not get the minimum reduction attribute set. This is because if two attributes are strongly correlated and the two attributes and decision attributes are also strongly correlated. Then one of the two attributes may be a redundant attribute.
Experiments also show that the average running time of algorithm CFAR and CBAR is lower than that of algorithm CEBARKNC. This is because CFAR and CBAR calculate Chi-square statistics before conditional information entropy and delete the attributes unrelated to decision attributes. Thus the computational complexity is reduced.
The training set and test set have been divided in Spect, and the number of records is 80 and 187 respectively. For other data sets, we randomly selected 70% of them as training sets and the remaining 30% as test sets. In order to compare the classification effect of the reduced datasets and the raw datasets, we use random forest algorithm to model these datasets.
Precision and recall are used as evaluation criteria. Precision is calculated as follows: The experimental results are shown in Fig.1, Fig.2 and Fig.3. The results show that the classification effect of the  reduced dataset is not much different from that of the original dataset, even better than that of the original dataset, such as Zoo and Heart. Because of the small amount of training data in the Spect data set, the effect is relatively poor.

B. EXPERIMENTAL ON INCREMENTAL ATTRIBUTE REDUCTION ALGORITHMS FOR DYNAMIC DATASETS
We use post, Zoo and other decision table datasets in UCI library to carry out the experiment of incremental attribute reduction algorithm. 70% of the set is selected as the initial data and each time increasing by 10%. ICAR algorithm and IARAIV algorithm are used to reduce the attributes. The experimental results are shown in Table 13. The number of reduced attributes found by the two algorithms after adding new samples is basically identical.
In order to verify the efficiency of the algorithms, we select 70% of each dataset as initial data and add 5% samples each time. Table 13 also shows the processing time of the two algorithms after adding new samples to the datasets. The processing time of algorithm ICAR is less than that of algorithm IARAIV. The processing time of IARAIV algorithm increases more as the number of samples increases because the time complexity of ICAR algorithm is less than that of IARAIV algorithm. VOLUME 8, 2020

VI. CONCLUSION
Attribute reduction is a challenging problem in areas such as pattern recognition, machine learning and data mining, especially for the big data scenarios. It aims to preserve the discriminatory power of original dataset. Based on studying the existing attribute reduction methods, we propose an attribute reduction method using Chi-square statistics and conditional information entropy as a heuristic function. The static attribute reduction methods are not suitable for the dynamic dataset. We design a method which can realize attribute reduction of dynamic dataset quickly. Experiments show that the method has good performance.
For future research, our work is to study how to update attribute reduction when objects and attributes in large-scale decision systems varies simultaneously over time by combining other excellent approaches, such as knowledge granularity [33], [34], compacting a decision table [22], positive region [35], based on ensemble structure analysis [36] and so on. She is also a member of ACM. VOLUME 8, 2020