Feature Selection for Interval-Valued Data Based on D-S Evidence Theory

Feature selection is one basic and critical technology for data mining, especially in current “big data era”. Rough set theory (RST) is sensitive to noise in feature selection due to the strict condition of equivalence relation. However, D-S evidence theory is flexible to measure uncertainty of information. This paper introduces robust feature evaluation metrics “belief function” and “plausibility function” into feature selection algorithm to avoid the defect that classification effect is affected by noise. First of all, similarity between information values in an interval-valued information system (IVIS) is given and a variable parameter to control the similarity of samples is introduced. Then, <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>-lower approximation and <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>-upper approximation in IVIS are put forward. Next, belief function and plausibility function based on <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>-lower approximation and <inline-formula> <tex-math notation="LaTeX">$\theta $ </tex-math></inline-formula>-upper approximation are put forward. Finally, four feature selection algorithms in an IVIS based on D-S evidence theory are proposed. The experimental results on four real interval-valued datasets show that the proposed metrics are robust to noise, and the proposed algorithms are more effective than the existing algorithms.


I. INTRODUCTION A. RESEARCH BACKGROUND AND RELATED WORKS
Rough set theory (RST) is put forward by Pawlak [24], [25]. It is a significant method to deal with imprecision, fuzziness and uncertainty. It also doesn't need any prior information beyond the data set that the problem needs to be processed [17], [37]. RST, based on classification mechanism, regards classification as equivalence relation in a specific space, and equivalence relation constitutes the division of space. The main idea of RST is to use the known knowledge base to describe the imprecise or uncertain knowledge. RST can effectively deal with the uncertainty of information system (IS). In recent years, RST has attracted many researchers' attention, and its application is mostly related to an IS [19]- [21], [44].
Another important method used to deal with uncertainty in an IS is D-S evidence theory. It is originated by Dempster's concepts of lower and upper probabilities [3], and extended into a theory by Shafer [33]. The theory is nowadays widely used for the objective and subjective uncertainty The associate editor coordinating the review of this manuscript and approving it for publication was Francisco J. Garcia-Penalvo .
analysis [11], [12], [35]. The use of D-S evidence theory in risk analysis has many advantages over the conventional probabilistic approach. It provides convenient and comprehensive way to handle uncertain problems including imprecisely specified distributions, poorly known and unknown correlation between different variables, modeling uncertainty, small sample size, and measuring uncertainty. The basic representational structure in this theory is a belief structure. The primitive numeric measurements derived from the belief structure are a dual pair of belief and plausibility functions. Belief function and plausibility function can successfully distinguish the difference between ''uncertainty'' and ''don't know'', and satisfy the relatively weaker condition than probability theory.
Feature selection is an effective way to eliminate negative effects caused by redundant features. In the framework of RST, feature selection is also called attribute reduction, which is to find a minimal feature subset that provides the same discriminating information as the whole set of features. Specifically, some features in an IS are redundant. We want to find a reduct that has the fewest features. Feature selection in an IS mean deleting redundant features under keeping the classification ability. The core step of feature selection is to construct an feature evaluation function. This function can be used to select key representative features from highdimensional data. Thus, feature selection can simplify data and reduce the computational complexity of machine learning. At present, this technology is widely applied in data mining, pattern recognition, and other real-life fields [36].
Belief and plausibility functions in D-S evidence theory can be used in feature selection. For example, Wu et al. [42] discussed feature selection in complete decision systems based on D-S evidence theory. Moreover, Wu [38] studied knowledge reduction in incomplete ISs without decision with evidence theory. Zhang et al. [49] proposed the concepts of belief reduction and plausibility reduction in complete ISs without decision.
An interval-valued information system (IVIS) means an IS where the information values are interval numbers. In order to study interval-valued data, some researchers have extended RST and established a generalized model of single IS. Dai et al. [9] introduced uncertainty measurement for an IVIS based on α-weak similarity. Zhang et al. [48] presented incremental updating of rough approximations in an IVIS under attribute generalization. Leung et al. [18] brought up the minimal attribute reduction method for an IVIS and obtained all classification rules hidden in an IS through a knowledge induction process. Xie et al. [44] considered new measurements of uncertainty and information structures for an IVIS. Yang et al. [46] investigated the dominance-based rough set in an IVIS, which contains both incomplete and imprecise evaluations of objects. Sakai et al. [29] developed a rule generation prototype system for incomplete information databases that can process IVISs.
In this paper, we introduce robust feature evaluation metrics ''belief function'' and ''plausibility function'' into feature selection algorithms to avoid the defect that classification effect is affected by noise in an IVIS. Similarity between information values in an IVIS is established and then θ -lower and θ-upper approximations in IVIS are presented. Second, feature selection algorithms in an IVIS based on D-S evidence theory are proposed. Finally, experiments are performed to evaluate the robustness of the proposed metric and the performance of the algorithms. Experimental results show that the proposed metrics are robust to noise and the proposed feature selection algorithms are effective. This paper is organized as follows. In Section 2, we review binary relations, IVISs and D-S evidence theory. In Section 3, we study feature selection in an IVIS based on D-S evidence theory and give the corresponding algorithms. In Section 4, we perform numerical experiments and do effectiveness analysis of the proposed metrics. In Section 5, we evaluate the performance of the given algorithms. Section 6 concludes the paper.
The work flow of this paper is displayed in FIGURE 1.

II. PRELIMINARIES
The binary relation, IVIS and D-S evidence theory are briefly introduced in the section. In this paper, U = {u 1 , u 2 , . . . , u n } and A = {a 1 , a 2 , . . . , a m } denote two non-empty finite sets, 2 U denotes the family of all subsets on U , |X | denotes the cardinality of X ∈ 2 U , δ = U × U denotes the universal relation and = {(u, u) : u ∈ U } denotes the identity relation.
we also denote it by xRy.
Let R be a binary relation on U . Then R is called (1) Reflexive, if xRx for any x ∈ U .
(3) Transitive, if xRy and yRz imply xRz for any x, y, z ∈ U . R is called an equivalence relation on U , if R is reflexive, symmetric and transitive. R is called a tolerance relation on U , if R is reflexive and symmetric. Moreover, R is called a universal relation on U if R = δ; R is said to be an identity relation on U if R = .
B. INTERVAL-VALUED NUMBER For any r ∈ R, express r = [r, r].

Definition 3 ([25]): Let U be a finite set of objects and A a finite set of attributes. Then the ordered pair (U , A) is referred to as an IS, if a ∈ A is able to decide a function a :
Let (U , A) be an IS and B ⊆ A. Then we can define Then B is the tolerance relation induced by the subsystem (U , B) with respect to θ . Based on the approximation space (U , R θ B ), we define a pair of operations R θ B and R θ B as follows: Then R θ B (X ) and R θ B (X ) are called θ-lower and θ -upper approximations of X , respectively. In general, R θ B and R θ B are named as θ -rough approximations of X .
Theorem 1: Let (U , A) be an IVIS. Then the following properties hold.

D. D-S EVIDENCE THEORY
D-S evidence theory, also called ''Evidence theory'' or ''Belief function theory'', is treated as a promising method of dealing with uncertainty in IS. Basic representational structure in D-S evidence theory is a belief structure [40].
Definition 8 ( [40]): Let U be a non-empty finite set, a set function m : 2 U → I is referred to as a crisp basic probability assignment if it satisfies the following conditions: A set X ∈ 2 U with nonzero basic probability assignment is referred to as a focal element. We denote the family of all focal elements of m by M. The pair (M, m) is called a belief structure on U . A set function Bel : 2 U → I is referred to as a belief function if ' A belief function Bel : 2 U → I can also be equivalently defined by axioms. That is, Bel is a belief function if it satisfies the following axioms: Belief and plausibility functions based on the same belief structure are connected by the dual property A basic probability assignment m can also be represented by its belief function Bel by using the following Möbius transform:

III. FEATURE SELECTION IN AN IVIS BASED ON D-S EVIDENCE THEORY
In this section, we study feature selection in an IVIS based on D-S evidence theory.

A. BELIEF REDUCTION AND PLAUSIBILITY REDUCTION
Then Bel θ B and Pl θ B are θ -belief and θ -plausibility functions on U , respectively and the corresponding basic probability assignment is Then So, m θ B is a crisp basic probability assignment according to Definition 8. ( Then we obtain ∀ X ∈ 2 U , Hence By Definition 8, Bel θ B is a belief function.
Then we obtain ∀ X ∈ 2 U , By Definition 8, Pl θ B is a plausibility function. Proposition 3: Let (U , A) be an IVIS. If C ⊆ B ⊆ A, then ∀ X ∈ 2 U and θ ∈ [0, 1],the following inequalities hold, Then In this paper, denote (1) B is referred to as a classical θ-consistent set of A, (3) B is referred to as a θ-plausibility consistent set of A, if (

1) B is called a θ -reduct of A, if B is both θ-consistent and θ -independent.
(

2) B is called a θ -belief reduct of A, if B is both θ -belief consistent and θ -belief independent.
(

B. FEATURE SELECTION ALGORITHMS IN AN IVIS
Feature selection algorithms based on θ -belief and θ -plausibility functions are given as follows.
Algorithm 1 uses belief function to obtain the feature which is added to the current selected coordinated set in each loop. This algorithm terminates when the addition of any remaining feature does not increase the evaluating function. For attribute set A, the worst search time for a reduct will result in |A|(|A| + 1)/2 evaluations using the evaluation function. So, the overall time complexity of Algorithm 1 is O(|A| 2 ).
Algorithm 2 applies plausibility function to determining the feature which is thrown away from the current selected coordinated set in each loop. This algorithm terminates when reducing any remaining features increase the evaluation function. For attribute set A, the worst search time for a reduct will also result in |A|(|A| + 1)/2 evaluations using the evaluation function. So, the overall time complexity of Algorithm 2 is also O(|A| 2 ).
We can give another two feature selection algorithms based on θ -belief significance and θ -plausibility significance in an IVIS (see Algorithms 3 and 4).
For Algorithms 3 and 4, the worst search time for a reduct will result in |A|(|A|+1)(|A|+2)/6 evaluations using the evaluation function. So the time complexity of Algorithms 3-4 is O(|A| 3 ).

IV. NUMERICAL EXPERIMENTS AND EFFECTIVENESS ANALYSIS OF THE PROPOSED METRICS A. MONOTONICITY ANALYSIS
In order to test the proposed belief and plausibility functions, we apply them to four real-life interval-valued datasets which are shown in TABLE 1. For the dataset Car [2], let C i = {a 1 , . . . , a i } (i = 1, . . . , 7). And for each i, Bel θ (C i ) and Pl θ (C i ) denote the belief measurement and plausibility measurement of subsystem (U , C i ), respectively. Then the two measurement sets on the dataset Car are defined as follows: For the dataset Face [1], let F i = {a 1 , . . . , a i } (i = 1, . . . , 6). And for each i, Bel θ (F i ) and Pl θ (F i ) denote the belief measurement and plausibility measurement of subsystem (U , F i ), respectively. Then the two measurement sets on the dataset Face are defined as follows: For the dataset Fish [16], let Y i = {a 1 , . . . , a i } (i = 1, . . . , 13). And for each i, Bel θ (Y i ) and Pl θ (Y i ) denote the belief measurement and plausibility measurement of subsystem (U , Y i ), respectively. Then the two measurement sets on the dataset Fish are defined as follows: For the dataset Water [26], let W i = {a 1 , . . . , a i } (i = 1, . . . , 48). And for each i, Bel θ (W i ) and Pl θ (W i ) denote the belief measurement and plausibility measurement of subsystem (U , W i ), respectively. Then the two measurement sets on the dataset Water are defined as follows:  From FIGUREs 2-5, the following conclusions can be obtained: (1) Belief measurement increases monotonically with the increase of the number of features.
(2) Plausibility measurement decreases monotonically with the increase of the number of features.
(3) The uncertainty of an IVIS decreases with the increase of the number of features.
(4) Belief function and plausibility function can be used to measure the uncertainty of an IVIS.

B. DISPERSION ANALYSIS
Standard deviation is mainly used to measure the dispersion of numerical data. The larger the standard deviation, the higher the dispersion of the data. On the contrary, it means that the dispersion of data is lower. In this paper, the standard deviation coefficient is applied for effectiveness analysis of the proposed measurements.
Let X = {x 1 , . . . , x n } be a data set. Then arithmetic average value, standard deviation and standard deviation coefficient of X are denoted by x,σ (X ) and CV (X ), respectively. They are defined as follows: According to the above formula, we calculate the CV -values of belief measurement and plausibility measurement on four interval-valued datasets. The results are shown in FIGURE 6. FIGURE 6 shows that the CV -values of Pl θ (B) are much larger than the CV -values of Bel θ (B) on datasets Fish and Water, but there is little difference between them on datasets Car and Face This is to say, the dispersion of Bel θ (B) is small, so Bel θ (B) has the better performance in measuring the uncertainty of an IVIS than Pl θ (B).

C. CORRELATION ANALYSIS
In statistics, Pearson correlation coefficient is used to measure the linear correlation between two data sets.
Suppose that X = {x 1 , . . . , x n } and Y = {y 1 , . . . , y n } are two data sets. Pearson correlation coefficient between X and Y , denoted by r(X , Y ), is defined as Obviously, The correlation degree between X and Y can be obtained according to TABLE 2. r-values between belief measurement and plausibility measurement are given on each of four interval-valued datasets according to the above equation. The results are shown in TABLEs 3-6.
TABLEs 3-6 show that the correlation between belief measurement and plausibility measurement is MNC on the dataset Water, while the correlation between belief measurement and plausibility measurement is HNC on the datasets Car, Face and Fish.

V. PERFORMANCE ANALYSIS OF FEATURE SELECTION ALGORITHMS
In order to verify the performance of the proposed feature selection algorithms, they are applied to four real interval-valued datasets described in TABLE 1. Intervalvalued data can be regarded as a kind of data with noise. Since there are very few feature selection algorithms for interval-valued data, we only compare our algorithm with the algorithm in reference [7]. We compare the classification accuracy of all feature selection results. The classification accuracy of all experiments is the result of 10-fold cross validation.
Classical classifiers cannot classify interval-valued data. In this paper, we transform KNN (K -Nearest Neighbors) to classify interval-valued datasets by redefining the distance between two samples.
Definition 14: Let X = (x 1 , x 2 , . . . , x t ) and Y = (y 1 , y 2 , . . . , y t ) be two samples of interval-valued dataset, where both x i and y i are interval numbers at the i-th feature.   Then the distance between X and Y is defined as follows: If the features of dataset are disordered in each search, the algorithms may produce different reduction results. This paper only shows the results of a random search of all four datasets using Algorithm 1. If our algorithm can show good performance on four datasets, it can further prove the superiority of our algorithm because of the results of a search in which the order of features is randomly arranged.
We use our algorithm and the algorithm in reference [7] to select features from four interval-valued datasets in TABLE 1. The value of parameter θ affects the result of feature selection. The value of parameter θ is set to 0.2 in this paper. The results of feature selection and the corresponding classification accuracy are listed in TABLEs 7 and 8.
The proposed algorithm not only has higher classification accuracy, but also selects fewer features than the algorithm in reference [7], as shown in TABLEs 7 and 8. This fully confirms that the proposed algorithm is robust to noise.

VI. CONCLUSION AND FUTURE WORK
Interval-valued data is difficult to deal with and can be considered as a kind of data with noise. Based on D-S evidence theory, this paper has proposed two new metrics to measure the uncertainty in IVIS, and applied them to feature selection for interval-valued data to solve the problem of noise sensitivity of RST. The experimental results on four real intervalvalued datasets have showed that the proposed metrics are robust to noise, and the proposed feature selection algorithms are efficiency and accuracy than other algorithms based on RST. Our findings provide a new idea for feature selection for interval-valued data. In the future, we will study the optimization of parameter θ in the proposed feature selection algorithms.