Fuzzy rule-based interpolative reasoning supported by attribute ranking

—Using fuzzy rule interpolation (FRI) interpolative reasoning can be effectively performed with a sparse rule base where a given system observation does not match any fuzzy rules. Whilst offering a potentially powerful inference mechanism, in the current literature, typical representation of fuzzy rules in FRI assumes that all attributes in the rules are of equal signiﬁcance in deriving the consequents. This is a strong assumption in practical applications, thereby often leading to less accurate interpolated results. To address this challenging problem, this work employs feature selection (FS) techniques to adjudge the relative signiﬁcance of individual attributes and therefore, to differentiate the contributions of the rule antecedents and their impact upon FRI. This is feasible because FS provides a readily adaptable mechanism for evaluating and ranking attributes, being capable of selecting more informative features. Without requiring any acquisition of real observations, based on the originally given sparse rule base, the individual scores are computed using a set of training samples that are artiﬁcially created from the rule base through an innovative reverse engineering procedure. The attribute scores are integrated within the popular scale and move transformation-based FRI algorithm (while other FRI approaches may be similarly extended following the same idea), forming a novel method for attribute ranking-supported fuzzy interpolative reasoning. The efﬁcacy and robustness of the proposed approach is veriﬁed through systematic experimental examinations in comparison with the original FRI technique, over a range of benchmark classiﬁcation problems while utilising different FS methods. A speciﬁc and important outcome is that supported by attribute ranking, only two (i.e., the least number of) nearest adjacent rules are required to perform accurate interpolative reasoning, avoiding the need of searching for and computing with multiple rules beyond the immediate neighbourhood of a given observation.


I. INTRODUCTION
Fuzzy rule interpolation (FRI) enables fuzzy rule-based reasoning systems to perform inference with a sparse rule base [1], [2].It addresses the key limitation of conventional fuzzy rule-based systems that work using Compositional Rule of Inference [3], where no conclusion may be drawn if none of the rules in the given rule base matches a new observation.Resolving real-world problems frequently involves the use of such sparse rule bases where all the given rules cannot fully F. Li and J. Yang are with School of Computer Science and Engineering, Northwestern Polytechnical University, China and Department of Computer Science, Institute of Mathematics, Physics and Computer Science, Aberystwyth University, UK.E-mail: {fal2, jiy6}@aber.ac.uk C. Shang and Q. Shen are with Department of Computer Science, Institute of Maths, Physics and Computer Science, Aberystwyth University, UK.Email: {cns, qqs}@aber.ac.ukY. Li is with School of Computer Science and Engineering, Northwestern Polytechnical University, China.E-mail: lybyp@nwpu.edu.cncover the entire problem domain.FRI techniques play a significant role in such situations explicitly, where an estimation is able to be made by computing an interpolated consequent for the observation which matches no rules.
A number of FRI methods have been proposed and improved in the literature (e.g., [4]- [19]).However, conventional FRI approaches assume that the domain attributes appearing in the rule antecedents are of equal significance in the implementation of interpolation.This can lead to inaccurate or incorrect interpolated results.FRI methods that exploit rules with weighted antecedents have therefore, been introduced to remedy the adverse side-effects of this equal significance assumption [20]- [23].For example, Genetic Algorithms (GA) have been applied to learn the weights of rule antecedents in support of FRI [24], but this incurs a substantial increase in computation overheads and requires the setting of many additional GA parameters.Also, in [25], a weighted fuzzy interpolative reasoning method is proposed by employing weighted increment and weighted ratio transformations, entailing automatic tuning of the optimal weights of the antecedent attributes.Similar methods to this are reported in [25], [26], all of which follow a "wrapper" scheme, where the attribute weighting procedure is enabled by firing the underlying FRI for the given training samples.A different approach is given in [27] by exploring piecewise fuzzy entropies of the fuzzy sets, with the weights assigned differently to each antecedent fuzzy set involved in different fuzzy rules, thereby working at the expense of significant computation.An alternative work is to subjectively predefine the weights on the antecedents of the rules by experts, but this may restrict the adaptivity of the rules and therefore, the flexibility of the resulting fuzzy reasoning system [21], [28].
A common issue shared by most of the aforementioned weighting schemes is to aggregate the weights computed for individual antecedent attributes, in an effort to assign an overall weight to each rule prior to its use in interpolation.Yet, the resultant weights are utilised in rather different ways dependent upon which underlying FRI technique is used.In these techniques, the weights are not organically integrated with the internal working of the FRI method.In terms of their typical applications, these weighted FRI techniques (e.g., [24], [27]) are typically tailored to problems such as multivariate regression and prediction.Little work has been done in developing weighted FRI to perform classification tasks (which this paper is focussed on).
Feature selection (FS) [29], [30] aims to discover a minimal subset of features that are most predictive of a given outcome.It generally follows a four-step procedure: generation, evalua-tion, termination and validation.Feature subsets are generated via a certain search procedure amongst the family of subsets of the original feature set.These feature subsets are then evaluated individually with regard to a given quality measure.The process of searching for a reduced feature subset is terminated if the measured quality degree reaches a satisfactory level.Finally, a selected feature subset is validated with respect to the application problem at hand.In developing effective FS mechanisms, much work has been carried out regarding the second step that evaluates the quality of a candidate feature subset [31]- [34], including those directly assessing and ranking individual features [35]- [37].For any reasoning system (be it fuzzy or boolean), different ranking scores of features or domain attributes imply different contributions of them to the inference outcome.Inspired by this observation, a novel weighted FRI approach is proposed here, consolidating upon the initial ideas presented in [38], where a feature evaluation method is integrated within the FRI procedure to score the significance of individual rule antecedents.This is different from existing techniques for rule interpolation that involve weights (e.g., [39]- [41]), which construct an interpolated result by weighted aggregation of rule consequents, where rule importances are ranked using Euclidean distance between rule antecedents and a given observation.
In developing this new approach to fuzzy rule-based interpolative reasoning, an innovative reverse engineering process is introduced to artificially convert a given sparse rule base into a set of training samples.This is accomplished for the sake of computing the required ranking scores without the need of acquiring any real observations.This results in a attribute ranking-guided FRI method, implemented on the basis of the popular scale and move transformation-based FRI (T-FRI) [5] (although the same idea appears to be applicable to other FRI techniques).To ensure the proposed approach does not rely on a certain specific FS technique, the work is systematically evaluated using five different feature ranking algorithms.Comparative studies demonstrate that this work helps minimise the adverse impact of the equal significance assumption made in the conventional FRI techniques, significantly improving the accuracy of the results of fuzzy interpolative reasoning.The work also shows that supported by attribute ranking, only two (i.e., the least number of) nearest adjacent rules are required to perform accurate interpolative reasoning.This helps increase computational efficiency, without the need of searching for and operating on multiple rules beyond the immediate neighbourhood of a given observation.
The remainder of this paper is organised as follows.Section II outlines the relevant background of transformationbased FRI and reviews five popular FS approaches, each of which may be adopted for attribute ranking.Section III presents the proposed fuzzy rule interpolation method that is guided with attribute rankings.Section IV shows the results of a systematic, comparative experimental evaluation.Finally, Section V concludes the paper and points out interesting issues for further research.

II. BACKGROUND
This section presents the relevant background work, including an outline of fuzzy rule interpolation based on scale and move transformations and a brief description of selected feature selection methods to be used for attribute ranking.

A. Transformation-based FRI (T-FRI)
A fuzzy rule-based system essentially contains two key elements R, Y in describing a given problem: A nonempty finite set of domain attributes Y = A ∪ {z}, where A = {A j |j = 1, 2, . . ., m} represents the set of antecedent attributes and z stands for the consequent, and a non-empty set of finite fuzzy rules R = {r 1 , r 2 , . . ., r N }.In conventional FRI, a given rule r i ∈ R and an observation o * are generally expressed as follows: where A i j represents the (fuzzy) value of the antecedent attribute a j in the rule r i , and z i denotes the value of the consequent attribute z in r i .
Given a sparse rule base R and an observation o * , T-FRI works by running a computational process as highlighted in Fig. 1, involving four core procedures as summarised below.
1) Selection of Closest Rules: This procedure is required as the basis upon which to perform FRI, when o * does not match any of the rules in the rule base.It searches for a certain number of rules that are closest to the observation.The distance between an observation o * and a rule r q , or the distance between any two rules r p , r q ∈ R, is determined by computing the aggregated distances between all the corresponding values of the shared attributes between them: where v is o * or r p (so A v j is A * j or A p j ), depending on whether the distance is between an observation and a rule or between two rules.So, the n closest rules to o * are those rules leading to the n smallest values of this distance measurement.
In the above definition, representing the normalised result of the otherwise absolute distance, where max Aj and min Aj denote the maximal and minimal value of the attribute a j , respectively.This normalisation is to ensure that all distance measures are compatible with each other over different attribute domains.The notation Rep(A j ) regarding a fuzzy set A j in this formula represents an important concept in T-FRI, termed representative value of the fuzzy set.It reflects the key information on the overall location of A j in its domain and also, its geometric shape.For instance, given an arbitrary polygonal fuzzy set A = (a 1 , a 2 , . . ., a n ) where a i , i = 1, 2, . . ., n denote the vertices of the polygonal, its representative value Rep(A) is defined by: where w i is the weight assigned to the vertex a i per i.For computational simplicity, many fuzzy rule-based systems (including the present work) have adopted triangular membership functions to define fuzzy sets while representing attribute values.A triangular membership function is denoted in the form of A j = (a j1 , a j2 , a j3 ), with a j1 ,a j3 denoting the left and right extremity of the support and a j2 the normal point of the fuzzy set.That is, the membership values of a j1 and a j3 are equal to 0 and the membership value of a j2 equals to 1.For such a fuzzy set A j , Rep(A j ) is simply defined as follows (though its centre of gravity may also be used as an alternative if preferred): The definition of representative values for more complex membership functions can be found in [6].
2) Construction of Intermediate Fuzzy Rule: From the preceding procedure, n closest rules to a given observation can be chosen which have the minimal distances amongst all the rules to the observation.From this, an intermediate fuzzy rule r can be constructed, forming the starting point of the transformation process in T-FRI.In most applications of T-FRI, n is taken to be 2 purely for computational efficiency, but often at the expense of interpolative accuracy.
The construction procedure computes the antecedent fuzzy sets A j , j = 1, . . ., m and the corresponding consequent fuzzy set Z of the intermediate rule: r : if a 1 is A 1 and a 2 is A 2 and • • • and a m is A m , then z is Z which is a weighted aggregation of the n closest rules.Let w i j , i ∈ {1, . . ., n}, denote the weight to which the jth antecedent of the ith fuzzy rule contributes to the construction of the jth antecedent A j of the intermediate fuzzy rule: where d(A i j , A * j ) is calculated as per (3).Then, where ŵi j is the normalised weight and δ Aj is a constant (termed the shift factor of A j ), defined respectively by The consequent value of the intermediate rule is constructed in the same manner as above: where Z is the weighted aggregation of the consequent values of the n closest rules Z i , i = 1, . . ., n: with ŵi z being the mean of the normalised weights associated with the antecedents ŵi j in each rule: and max z and min z in (11) are the maximal and minimal values of the consequent attribute, and the shift factor δ z of the consequent is the mean of δ Aj , j = 1, . . ., m 3) Computation of Scale and Move Factors: The goal of a transformation process T in T-FRI is to scale and move an intermediate fuzzy set A j , such that the transformed shape and representative value coincide with those of the observed value A * j .This process is implemented in two stages: (i) scale operation from A j to Â j (denoting the scaled intermediate fuzzy set), and (ii) move operation from Â j to A * j .For this purpose, the required scale rate s Aj and move ratio m Aj are determined in this step.It computes and records all such scale rates and move ratios for use in the subsequent and final procedure to obtain the required consequent value.Unfortunately, it is difficult to have a generic, closed form representation of these transformation factors as they are dependent upon the fuzzy membership functions used.
For this work, triangular fuzzy sets are used throughout.Given such a fuzzy set A j = (a j1 , a j2 , a j3 ), the scale rate s Aj is: which essentially expands or contracts the support length of A j : a j3 − a j1 so that it becomes the same as that of A * j .The scaled intermediate fuzzy set Â j , which has the same representative value as A j , is then obtained such that Similarly, while dealing with triangular fuzzy sets, the move operation shifts the position of Â j to becoming the same as that of A * j , while maintaining its representative value Rep( Â j ).This is achieved using the move ratio m Aj : 4) Scale and Move Transformation: After calculating the necessary scale and move factors (i.e., s Aj and m Aj , j = 1, . . ., m), this procedure completes the T-FRI process, deriving the required consequent of Z * .This follows the intuition of similar observations leading to similar consequents, a heuristic fundamental to analogical approximate reasoning.For this, the transformation factors on the antecedent attributes are aggregated.In the conventional T-FRI, this is implemented by averaging them: This entails the computation of scaled Ẑ = ( ẑ 1 , ẑ 2 , ẑ 3 ): where Z = (z 1 , z 2 , z 3 ) is the fuzzy value of the intermediate consequent previously computed.From this, again, by analogy to the transformation required for the antecedent to match the observation, move transformation is applied, resulting in the final, required interpolated consequent Z * = (z * 1 , z * 2 , z * 3 ): The entire scale and move transformation process can be graphically illustrated as shown in Fig. 2. For conciseness, such a process can be collectively represented by: Z * = T (Z , s z , m z ), emphasising on the significance of both scale and move transformations.

B. Attribute Evaluation within Feature Selection
Feature selection (FS) aims to choose a minimal subset of domain attributes that are the most relevant to the target concept or decision.It preserves the original meaning of the selected attributes while reducing their overall dimensionality.In FS, an evaluation function is used to measure how good a subset of attributes are regarding the potential solution to the problem at hand, if they are utilised.This offers a natural way to evaluate the relative significance of an attribute.If systematically carried out across all domain attributes, the use of such a function will enable the ranking of the attributes with regard to the underlying quality criteria.Existing evaluation functions in the literature can be generally grouped into categories that reflect the criteria adopted to judge attribute quality, including those based on measures over distance, information, dependence, consistency, etc [29].The following presents a brief introduction to five of these that are popularly used and readily available, which will be adopted to implement the attribute ranking task in the subsequent development.
1) Information Gain: Information gain (IG) is defined via Shannon entropy in information theory to measure the expected reduction in uncertainty caused by partitioning the values of an attribute [35], [42].Given a collection of examples U = {O, A}, o i ∈ O is an object which is represented with a group of attribute A = {a 1 , . . ., a l } and a certain class label i. Shannon entropy of O is defined by where p i is the proportion of O whose elements are each labelled as the class i. IG upon a particular attribute a k , k ∈ {1, . . ., l}, is then defined by where V alue(a k ) is the set of all possible values for the attribute a k , O v is the subset of O where the value of a k is equal to v, and |•| denotes the cardinality of a set.Obviously, a quality attribute should lead to a high IG value.
2) Relief-F: Relief-F [36] works by exploiting distance measures.Each individual attribute is assigned a cumulative weight computed over a predefined number of sample data selected from a given training data set.Attributes with a weight above a certain threshold become selected elements of the attribute subset sought.A weight is assigned on the basis of the following intuition: Instances that belong to a similar class should be closer together than those in a different class.Suppose that near hit represents an instance that is closest to a certain training instance x under consideration, with both belonging to the same class, and that near miss represents an instance that is closest to x but in a different class.The cumulative weight associated with a given attribute is then computed by where w 0 = 0, I stands for the number of training iterations, and d(., .) is typically implemented with Euclidean metric.
3) Laplacian Score: Laplacian score (LS) [37] is another distance measure-based evaluation function.It is calculated for each individual attribute to reflect its capability of locality preserving.The definition of LS is inspired by an observation that the data points being related to the same topic should be close to each other.Let LS k denote the LS measure of a certain attribute a k .Then it is computed by where f ki and f kj denote the value of a k within the instance x i and that within x j respectively, V ar(f k ) is the estimated variance of a k , and S ij represents the neighbourhood relationship between the instances x i and x j , such that , if x i and x j are nearest neighbours 0, otherwise (25) A quality attribute should be of a small Laplacian score.
4) Local Learning-based Clustering for FS (LLCFS): LLCFS [34] performs attribute selection within the framework of the Local Learning-based Clustering (LLC) algorithm [43].It computes a weight and assigns it to each attribute while performing a clustering task.Typically, the weights are thinly distributed if the dataset contains much redundancy, with a weight of zero indicating that the corresponding attribute is dispensable; only those attributes associated with a weight of a significant magnitude are selected.Incidentally, such an FS approach is termed wrapper-based in the literature, as opposed to the other techniques outlined herein which follow the socalled filter-based approach to FS [30].LLCFS works by iteratively executing the following two steps until convergence: (i) estimating the weights for the attributes using the intermediate clustering result, and (ii) updating the clustering given the weighted attributes.As such, the weights are estimated iteratively during the clustering process.
5) Rough Set-based FS: As a dependence measure-based attribute reduction method, rough set-based FS (RSFS) [44] discovers the dependencies between attributes using the granularity structure inherent in data.Given the attribute subsets P and Q, the dependency degree of Q on P is defined as: where U is a nonempty set of finite objects and P OS P (Q) is termed the positive region, which is defined by where X ∈ U/IN D(Q), representing one of the equivalence classes partitioning U though the Q-indiscernibility relation: P * (X) determines the P-lower approximation of X in rough set theory, which is the union of the equivalence classes of the P-indiscernibility relations that are completely included in X.
The positive region so defined contains all objects of U that can be classified as the classes of U/IN D(Q) using only the information conveyed by those attributes in P .
As can be seen from the above, FS methods using IG, Relief-F, LS and LLCFS directly weigh and hence, rank features (and they may follow a filter or wrapper based approach).However, RSFS takes a different scheme where the quality of an attribute subset is evaluated at a time, instead of that of an individual attribute.These different styles of FS mechanism are all considered here in order to demonstrate the generality of the proposed work, as illustrated below.

III. FUZZY RULE INTERPOLATION GUIDED BY ATTRIBUTE RANKING
This section presents a novel approach for FRI that is guided by attribute ranking techniques, with the framework of which illustrated in Fig. 3.Note that any of the five different FS methods outlined in Section II-B can be employed to perform the ranking, in order to obtain the relative significance of individual attributes.

A. Reverse Engineering for Sparseness Reduction
In conventional T-FRI algorithms, the first key stage is the selection of n closest fuzzy rules when an observation is presented with no direct matching rules available in the rule base.In such work, all rule antecedents are assumed to be of equal significance while searching for the subset of closest rules; there is no assessment regarding the relative importance or ranking of these antecedents.This may reflect a seemingly important practical issue in that typically, the fuzzy rules that are provided by domain experts or learned from historical data (which constitute the rule base) are of the form as shown in Eqn.(1).That is, there is no information available on the relative significance of individual antecedent attributes.This is a premier reason that existing approaches to T-FRI commonly assume the use of this format of knowledge representation.
Fortunately, the evaluation functions embedded in the FS techniques offer an effective ranking mechanism to address this problem.However, while utilising the evaluation function of a certain FS method to differentiate the significance of attributes, data is required to act as the training instances for computing the ranking scores.Yet, in general, at the stage of performing FRI, no sufficient example data are available for use to facilitate such computation.Nevertheless, every T-FRI system has a sparse rule base as indicated in Section II-A.This set of rules can be translated into a man-made decision table, forming a set of artificially created training samples, where each row represents either a rule in the given rule base or an artificial rule generated from a given rule.Note that in data-driven learning, rules are learned from data samples.The work here is done through a reverse engineering process of data-driven learning, translating rules back to data.
1) Reverse Engineering Procedure: The question is how to create such artificial rules.In general, a fuzzy reasoning system with a sparse rule-base may involve rules that employ different antecedent attributes and a different number of antecedent attributes in different rules.To be able to systematically implement the reverse engineering procedure to obtain a training decision table, all rules are reformulated into a common representation using the following procedure: First, all possible antecedent attributes that appear in any given rule are identified, together with the value domains of these attributes.Then, each given rule is expanded iteratively into one which involves all domain attributes.The expansion is implemented such that if a certain antecedent attribute is not originally involved in a given rule, then that rule is replaced by q artificial rules, with q being the cardinality of the value domain of that attribute.In so doing, each expanded rule involves all domain attributes and each attribute in the rule takes one and only one possible value from its domain.
This reverse engineering procedure can be logically justified: For a given rule in the sparse rule base, if an attribute is missing from the rule antecedent, then the rule will have the same consequent value independent of what fuzzy value that attribute may take, provided that all those attributes appearing in the rule are satisfied regarding their respectively specified value.The presumption of the value domains being finite and discrete is also justifiable given that only fuzzy rules are considered here, where each attribute takes values from a (normally small) collection of fuzzy sets.In particular, the proposed reverse engineering procedure works with a sparse rule base, which typically involves a much smaller number of rules than the usual fuzzy rule-based systems.Besides, only those missing antecedents are to be filled with the possible fuzzy sets taken from their value domains.These factors jointly help restrain the adverse impact of the curse of dimensionality possibly caused by converting individual rules in the sparse rule base into artificial training samples.Fundamentally, it is recognised however, that in so doing, the underlying problem may be significantly reduced but not completely removed and therefore, work remains to develop a more efficient mechanism in implementing this approach.
2) Illustration of Reverse Engineering: A simple example may help illustrate the idea of this procedure.Suppose that the sparse rule base consists of the following two rules only, each involving one different antecedent attribute, x or y, and the common consequent attribute z: Following the two-step reverse engineering procedure, first, all possible antecedent attributes involved in the problem are identified, these are x and y, together with their value domains as indicated above.Then, the artificial decision table as of Table I can be constructed.This is because there are two antecedent attributes in question, of which x has two possible values (A 1 and A 2 ) and y has three alternatives (B 1 , B 2 , B 3 ).Without losing generality, suppose that the first given rule is used to construct part of the emerging artificial decision table first.As y is missing in r 1 , which means if x is satisfied (with the value A 1 ), this rule is satisfied and hence, the consequent attribute z will have the value C 1 no matter which value y takes.That is, r 1 can be expanded by three artificial rules, resulting in r 1 , r 2 and r 3 in Table I, for each of which y takes one of its three possible values.Similarly, r 4 and r 5 can be constructed to expand the original rule r 2 .

Variables
x y z 3) Inconsistency in Artificial Decision Table : When the reverse engineering procedure is applied to a given (sparse) rule base, the resultant, artificially constructed decision table may include logically inconsistent rules where certain rules may have the same antecedent but different consequents.For instance, in the above illustrative example, r 2 and r 4 in Table I may appear to be inconsistent.This does not matter as the eventual rule-based inference, including rule interpolation does not use these artificially generated rules, but the original sparse rule base.They are created just to help assess the relevant significance of individual variables through the estimation of their respective ranking scores.It is because there are attributes which may lead to potentially inconsistent implications in a given problem that it is possible to distinguish their relative importance to the problem, or their potential power in influencing the derivation of the consequent.

B. Scoring of Individual Attributes
Suppose that an artificial decision table has been derived from a given sparse rule base via reverse engineering.Then, any of the five feature ranking methods reviewed in Section II-B may be applied to evaluate the relative significance of individual antecedent attributes.
1) Scoring Methods: As indicated previously, four of those (namely, IG, Relief-F, LS and LLCFS) can be directly applied to assess individual attributes, each resulting in a vector of weighting scores associated with those attributes.For easy referencing, these score vectors are denoted as Score IG , Score Relief −F , Score LS and Score LLCF S , respectively.Note that the LS-based FS method seeks those attributes of the smallest Laplacian score(s) for selection.Thus, the ranking score of LS for a rule antecedent attribute a i , i = 1, . . ., m, can be defined by Also as indicated earlier, the RSFS method conducts feature selection based on evaluating attribute subsets, instead of individual attributes.To obtain individual attribute scores using such a technique the evaluation procedure needs to be modified, which is done in this work as follows.It is known that the dependency degree γ P (Q) captures the dependence of an attribute subset Q on another subset P .Suppose that the subset Q contains the single consequent attribute z and the subset P contains just one certain antecedent attribute a i , i = 1, . . ., m of a rule in the sparse rule base.As such, the general form of the dependency degree γ {ai} ({z}) between two subsets of attributes as per Eqn.(26) degenerates to one that assesses the importance degree of each individual antecedent attribute upon which the consequent depends: This is of course, what RSFS exactly does in the first round during its iterative process of adding attributes to the emerging selected feature subset (starting from an empty set), determining which attribute is individually speaking, the best to be selected.It means that to obtain attribute scoring vector using the evaluation function of RSFS, only one iteration of the FS algorithm is needed to be run.
2) Attribute Weighting: Having computed the scores of individual attributes, using either of the aforementioned five scoring methods, a normalised relative weighting scheme can be readily introduced.Thus, all antecedent attributes employed in the rules of a given sparse rule base can be ranked, each (say, the attribute a i ) being associated with a weight: where Score * denotes any of the five types of score (namely, one of the following: Score IG , Score Relief −F , Score LS , Score LLCF S , and Score RSF S ).
Given their underlying definition, the resulting normalised values have a natural appeal to be interpreted as the relative significance degrees of the individual rule antecedent attributes, in the determination of the corresponding rule consequent.Therefore, they can be used to act as the weights associated with each individual antecedent attribute in the original sparse rule base.Of course, for any implementation in modifying conventional non-weighted T-FRI, one and just one of the five types of the weight is required.From this viewpoint, this work presents a range of choices regarding the weighting methods that may be utilised to support and refine fuzzy interpolative reasoning, as described below.

C. Weighted T-FRI
From the above, weights can be computed and associated with rule antecedent attributes to indicate their relative significance in deriving the consequent.From this, T-FRI can be modified as shown in Fig. 3, involving the following three key computational stages.
1) Weight-guided Selection of n Closest Rules: Suppose that an observation is present which does not entail a direct match with any rule in the sparse rule base.Thus, a neighbourhood of n (n ≥ 2) closest rules of the observation is required to be chosen in order to perform rule interpolation.The conventional approach to making this choice is by exploiting the Euclidean distance measured through aggregating the distances between individual antecedent attributes of a given rule and the corresponding attribute values in the observation, as per Eqn.(2).Now that the weights of individual attributes have been obtained with a scoring mechanism (derived from the use of the evaluation function in an FS method), the distance between a given rule p and the observation o * needs to be updated accordingly, such that where d(A p j , A * j ) is calculated according to Eqn. (3) and m is the total number of rule antecedents in the rule base and m ≥ 2 (since there is no need to assign any weight if all rules in the rule base involves just the same single antecedent attribute).The term − 1) in the first part of this formula is for local weight normalisation purpose, but it is cancelled out in the overall equation.In so doing, those n closest rules whose antecedent attributes are deemed more significant (than the rest) will be selected with priority.This is because such attributes will make less contribution (i.e., (1 − W j )d(A p j , A * j ), j = 1, . . ., m) to the aggregated distance d(r p , o * ) given their relatively larger weight values.
The computation of the distance d(r p , o * ) is carried out to measure the relative closeness of the rules to the observation.Under the condition where there is no rule matching the given observation, the attribute ranking-supported T-FRI is triggered.Hence, the aggregated distance is calculated in terms of Eqn.(31) between the individual elements of the observation and each rule antecedent respectively.From this, those n rules that have resulted in the n smallest distance values are selected.
Note that the normalisation term 2 in the above is a constant and therefore, can be omitted in the process of executing fuzzy rule interpolation.This is because selecting the closest rules only requires information on the relative distance measures.
Weighted Construction of Intermediate Rule: With the weighting method introduced previously, all antecedent attributes can be ranked with respect to their estimated relative significance, in terms of their potential implication upon the derivation of the (interpolated) consequent.This allows for the development of a computational method to implement an improved version of T-FRI, where weights are integrated in all calculations during the transformation process, including the initial construction of the intermediate rule.Without unnecessarily detailing the entire construction process of the weighted intermediate rule, which is similar to that of the conventional approach (see Section II-A2), only the weighting on the consequent and the shift factor during the modified process are presented here: Obviously, these will degenerate to those computed as per Eqn.( 13) and Eqn. ( 14), when all attributes are equally regarded in terms of their significance.
3) Weighted Transformation: Given the above method for constructing the weighted intermediate rule, the scale and move factors originally provided in Eqn.(18) now become: From this, if an observation that does not match any rule in the sparse rule base is presented, an interpolated fuzzy value Z * for the consequent attribute can be obtained by computing the transformation T ( Z , sz , mz ), in the exactly same way as given in Section II-A4.Importantly, when all antecedent attributes are assumed to be of equal significance, namely when all weights are equal, the above modified fuzzy rulebased interpolative process degenerates to the conventional T-FRI.Mathematical proof for this is straightforward but is omitted here to save space.
Note that in the above description, no specification of which attribute ranking mechanism to use is made.Indeed, the proposed technique is independent of the FS method to be adopted for attribute scoring.Any of the attribute ranking methods available may be taken to assess the relative significance of individual antecedent attributes.Thus, the proposed ranking-guided FRI offers flexibility in its implementation.Section II-B has outlined five effective attribute evaluation functions (that are used in popular FS systems), each of which may be adopted to implement the ranking mechanism.

IV. EXPERIMENTAL EVALUATION
This section presents a systematic experimental evaluation of the proposed approach for T-FRI supported by attribute ranking.It first reports on the results of performing pattern classification over ten benchmark datasets.Classification results are compared with those obtained by: (i) the state-ofthe-art T-FRI; and (ii) the standard rule-based reasoning via the application of Compositional Rule of Inference (CRI) [3], without involving rule interpolation but directly firing those (fully or partially) matched rules.Then, the robustness and effectiveness of the new approach is also empirically demonstrated by observing: (i) the analysis of confusion matrices obtained for a specified case study, (ii) the classification accuracy in relation to the number of the closest rules selected for interpolation, and (iii) the consistency and efficiency of utilising different FS methods in supporting T-FRI.Finally, the improvement of the classification performance following the weighted approach is further illustrated by fine tuning the experimental settings.
A. General Experimental Set-up 1) Datasets: Ten benchmark datasets (for classification problems) are taken from the UCI machine learning [45] and KEEL (Knowledge Extraction based on Evolutionary Learning) dataset repositories [46].The details of these are summarised in Table II.2) Experimental Methodology: As indicated previously, for simplicity, triangular membership functions are employed here unless otherwise stated.They are used to represent the fuzzy values of the antecedent attributes.For each problem, the consequent attribute is designed to take a singleton fuzzy set (which is equivalent to a discrete crisp value), representing a certain class label.Whilst different antecedent attributes have their own underlying value domains, these domains are normalised to be within the common range of 0 to 1 and consisting of three qualitatively distinct fuzzy values, as shown in Fig. 4.Such a simple fuzzification is used in the main body of the experiments for simplicity as well as for fair comparison, with no optimisation of the value domain carried out.Of course, if fine-tuned membership functions are available and used, the classification performance can be expected to further improve (as to be illustrated later).Experimental results are obtained by averaging the outcomes of 10 times 10-fold cross validation per dataset.The rules used to perform both CRI and interpolative reasoning are learned from the raw data by the use of the classical method of [47], after fuzzification.This rule induction technique is employed herein forming a common ground for fair comparison.However, if preferred, more advanced rule induction mechanisms (e.g., those implemented with evolutionary or memetic algorithms [48]) may be exploited to produce a more compact rule base (but this is beyond the scope of this paper).
On average, 20% of the learned rules are purposefully removed randomly, in order to make the resultant rule base sparser and hence to validate the effectiveness of rule interpolation.Attribute weights are derived from the use of one and each of those five ranking methods introduced previously, using the artificial training data generated from the sparse rule base via the reverse engineering procedure.Classification performance is assessed in terms of accuracy over the testing data.In each test, a testing sample is checked against the rules within the rule base first.If there is no rule matching the observation, fuzzy rule interpolation is applied to make inference, using both the conventional T-FRI and the attribute ranking-supported T-FRI to facilitate comparison.
The main body of this experimental study is based on the use of n = 2 closest rules to perform rule interpolation.However, a series of experiments are also carried out by varying the number of the closest rules selected for interpolation (see Section IV-B3).In particular, 10 times 10-fold cross validation is adopted for each of the 5 different cases where the number of the closest rules selected is set to 2, 3, 4, 5, 6, respectively.

B. Results and Discussions
1) Classification Accuracy: Table III shows the average classification accuracies, and standard deviations (SD), which are calculated by averaging the 10 times 10-fold cross validation, for each of the seven compared approaches.In this table, CRI is the column showing the accuracies achievable using CRI based on the sparse rule base; Ori lists the accuracies obtained using the conventional T-FRI, with the rest naming schemes used being obvious and self-explanatory (e.g., IG stands for the accuracies achieved by the proposed approach with the antecedent attributes in the rules weighted by their corresponding information gains); and AV G Guided presents the accuracies obtained by averaging the performances of the five attribute ranking-guided T-FRI methods.
The comparison with CRI is included herein to demonstrate the power of FRI in general and that of ranking-guided FRI in particular in performing fuzzy reasoning, both of which significantly outperform the use of CRI in all the problems that involve a sparse rule base.This may be expected since a fuzzy system implemented with CRI alone cannot draw any conclusion when an observation does not match any of the rules in the rule base.As already indicated, no attempt is made to optimise the fuzzification of any attribute domains.Thus, the classification rates are generally not very impressive.However, this is not the point of this experimental investigation.The point is to compare the relative performances of different approaches, with a common ground for fair comparison.The improvement achievable by employing learned membership functions (from training samples) will be shown later.
The use of any of the five attribute ranking-guided methods has been shown to enable the corresponding fuzzy reasoning system to outperform the system using the conventional T-FRI.This indicates that individual rule antecedent attributes do make different contributions to the classification, and that the ranking scores obtained by FS techniques offer positive means for discovering such differences.Interestingly, the narrowbanded SD values (those numbers following the classification accuracy) given in Table III further demonstrate that the performance of the proposed work is robust.
Examining more closely, those methods based on directly assessing individual attributes (namely, IG, Relief-F, LLCFS and LS) achieve more significant improvements, with the best average accuracy being obtained by IG-guided T-FRI (having an average improvement of 9.44% over all ten datasets than that of Ori).The remaining one, RSFS, adopts the technique of (attribute) subset selection.As shown in Section III-B, ranking attributes with such a technique requires modification of the underlying FS algorithm.Nevertheless, the RSFS-based FRI has a comparable improvement over the conventional T-FRI to the average performance of the other four, again indicating the robustness of the innovative approach proposed in this work.Collectively, these results also show the generality of attribute ranking-guided approach in that the use of a very different FS method retains the improved performance.
As also can be seen from Table III, both FRI approaches (the original and the attribute ranking-guided) significantly outperform the standard fuzzy reasoning based on CRI, and the results are more stable with a relatively lower SD values.Of course, such an obviously poorer classification accuracy obtained by the use of CRI can be expected as it fires matched or partially matched rules only while facing the problem of sparseness of the rule base.This strongly demonstrates the effectiveness of fuzzy interpolative reasoning, especially for the proposed approach owing to its further enhanced performance over the conventional T-FRI.2) Analysis on Confusion Matrices: Apart from the overall classification accuracies, it is practically interesting to investigate the statistical properties of the classification performance in terms of true positive (TP), true negative (TN), false positive (FP) and false negative (FN).Without overwhelming the examination while having a focused discussion, the Haberman dataset is taken as an example to run such an investigation.Tables IV-X show the confusion matrices computed by the use of the seven comparative approaches (averaged over 10×10 fold cross validation), respectively.Table XI lists the averaged performance of the five different implementations of the attribute ranking-guided method.Despite the fact that this dataset contains samples that are distributed in a imbalanced manner (which increases the difficulties in performing accurate classification), these tables clearly show the superior performances achieved by the proposed approach to the original T-FRI, leaving alone CRI.
Importantly, these tables both individually and collectively reveal that the classification accuracy achieved by the use of the attribute ranking-guided T-FRI is led by the significant reduction of false negatives and the substantial increase in true positives.These results form a sharp contrast with those obtainable by the use of the original T-FRI and more remarkably, with those by CRI.This is of practical significance because for many real-world applications, not only the overall classification rates should be high, but also false negatives should be minimised while true positives are maximised.This is of particular importance for medical applications as with the situation of this dataset (which summarises the cases on the survival of patients who had undergone surgery for breast cancer -if a patient died within 5 year of the surgery then the is regarded as positive, or if the patient survived for 5 years or longer then it is a negative case).For such problems, false negatives can be extraordinarily damaging.
Fortunately, the implementations with the proposed approach all lead to much reduced false negatives (with an averaged rate of 4.49% over the range of 3.58% to 4.88%, as compared to 8.79% returned by the conventional T-FRI and 26.40% by CRI).This is in addition to the remarkable improvements over the true positive rates (an average of 73.29% over the range of 72.32% to 75.90%, as opposite to 68.40% by the original T-FRI and a mere 50.16% by CRI).
3) Number of Closest Rules: Up till now, all experimental results reported in the existing literature regarding the use of T-FRI have been based on the use of two closest rules (i.e., n = 2) to perform interpolation.The choice of using two rules is for computational simplicity.Hypotheses have been given previously in that a larger neighbourhood (i.e., more than 2 closest rules) may lead to generally more accurate interpolated outcomes.It is therefore, interesting to investigate the level of change in classification accuracy with regard to varying the number of the closest rules selected for fuzzy rule-based interpolative reasoning.
Considering the computational effort required for such an experimental investigation, only a subset of the previously listed 10 benchmark datasets (namely, BUPA, Hayes-Roth, Appendicitis and Phoneme) are randomly used to conduct this study.Table XII presents the experimental results, with the summary of these plotted in Fig. 5. Again, the accuracies shown in in this table are calculated by averaging the results obtained in 10 times 10-fold cross validation.
Over the range of n, n ∈ {2, . . ., 6} that are examined, running both the conventional T-FRI and the attribute rankingsupported T-FRI always results in a substantial improvement (in terms of the classification accuracy) over the performance achievable by running CRI with direct rule-firing (which is shown in Table III and is irrelevant to the n).Importantly, each of the proposed five attribute-guided T-FRI methods consistently outperforms the conventional T-FRI for almost all datasets and all settings of n.The results in Table XII further demonstrate the robustness of the proposed work given that the standard deviation (SD) values of the classification accuracy across all n values are rather small.Surprisingly (and very positively in support of the present approach), as a larger n is assumed, little improvement can be gained for any of the five attribute ranking-supported methods.In fact, the performance may even deteriorate as n increases.The best performance is actually achieved when the number of selected closest rules is the smallest (i.e., 2).This indicates that the weighting scheme facilitates the determination of the best neighbouring rules to be taken at the earliest opportunity.This result empirically negates the hypothesis commonly made about T-FRI in that more rules used for interpolation would lead to significantly better results.It also helps avoid the use of a larger n in applications of the weighted T-FRI, thereby reducing the computational complexity that would otherwise be increased due to the requirement of searching for and running with more rules for interpolation.

4) Consistency and Efficiency of Ranking Methods:
There is one exception in the above results regarding the Phoneme dataset where the classification accuracy achieved using LSguided T-FRI is eventually increasing as the number of closest rules goes up, though this variation is not significant.Therefore, a further investigation has been conducted to forensically examine the ranking scores which are obtained by the use of the five different evaluation functions.The results are presented in Table XIII.
As can be seen from this table, the first four attribute ranking methods consistently agree on that the fourth antecedent attribute plays the most significant role in deciding on the consequent, with a much higher ranking scores obtained.Three out of these (IG, Relief F and RSF S) put the first antecedent attribute in the second place, with LLCF S ranking it at the third place.The only one method which is out of the tune is LS, which ranks the first antecedent attribute at the bottom, with a zero score signifying its relatively lack of relevancy in this rule base.This is a very different result from the great majority, implying that the LS algorithm may underperform in deriving an appropriate ranking for this particular dataset.As such, it may explain the reason that the FRI guided with LS achieves a relative poor performance when the number of closest rules is 2 and the overall different trend of the classification accuracy while varying n in this dataset case, as shown in Fig. 5.
The introduction of ranking scores of antecedent attributes in support of weighted rule interpolation may lead to additional computational overheads overall (albeit it ensuring that only the smallest number of closest rules are needed).Table XIV shows the corresponding average testing time recorded for classification over testing samples when the number of closest rules is increasing, together with the SD value over n.In this table, the column of Max Increase lists the maximum increase of the testing time observed while increasing the number of closest rules n.Generally, there is a slight increase in time consumption when involving more closest rules in the implementation of rule interpolation for all T-FRI methods.However, whilst the attribute ranking-guided T-FRI employs the weights in all of the key stages of interpolation (including the selection of the closest rules, the construction of the intermediate rule, the calculation of weighted transformation factors and the execution of weighted transformations), there is no significant increase in the time consumed by the weighted T-FRI as compared to that by the original T-FRI.This, together with the above observed general consistency amongst the use of the attribute ranking schemes, once again shows the efficacy of the proposed approach.
5) Use of Learned Membership Functions: As indicated previously, the classification performance in terms of accuracy is not very impressive, even though the proposed work improves it significantly over the conventional approaches.
However, this is expected as the quantity space used to depict the value domains of all the attributes across all datasets is so simplistic (recall Fig. 4), without any optimisation (which is purposefully designed so as to enable systematic investigations over a wide range of experimental settings).Such unbiased specification of the domain values allows fair comparison to be made between different fuzzy reasoning techniques.Besides, an average of 20% of the learned rules are deliberately removed randomly, in order to have a rule base that is rather sparse.This makes the domain knowledge, represented in terms of fuzzy rules, rather incomplete, which in turn, makes the classification task a challenge for any learning classifier and hence, leads to less accurate classification.Nevertheless, it is interesting to empirically verify what if an (at least partially) optimised quantity space is utilised.
Fuzzy C-Means (FCM) [49] is one of the most widely used fuzzy clustering algorithms.It works by assigning a membership degree to each data sample corresponding to a certain cluster centre based on the relative distance between the cluster centre and that sample.The closer to the cluster centre the higher the membership degree to which the sample is deemed to belong to the corresponding cluster.Thus, the clustering outcome on a given dataset reveals the distribution of the membership functions for the underlying attributes.Owing to its popularity, FCM is herein adopted to perform fuzzification, learning the membership functions for the antecedent attributes.However, any optimisation of the membership functions is directly influenced by the dataset itself.Without overly complicating the experimental investigation, only the simple Iris dataset is used in this specific study (on the effect of using learned fuzzy sets).Fig. 6 shows the membership functions generated using FCM.The optimal number of clusters for each antecedent attribute is selected by the method of [50], resulting in 4 clusters for the first antecedent attribute, 2 for the third and 3 for each of the remaining two.
Table XV presents the classification results using the FCMreturned membership functions.For comparison, it also lists those that are obtained by the use of evenly distributed fuzzification based on the entries given in Table III.As expected, a better fuzzification leads to a better classification.Individually speaking, each weighted method that uses FCMlearned membership functions beats their corresponding opponent (that employs just the simple quantity space of Fig. 4 for each antecedent attribute).Collectively, this leads to an averaged enhancement of 1.87% (= 93.07%− 91.20%) for the FS-supported T-FRI methods.Importantly, this is on top of the already achieved substantial improvement of the FSsupported T-FRI over the conventional T-FRI and CRI-based classification methods, as also highlighted in this table.
It may be recognised that the improved classification rate is still not so high as the highest possible as reported in the literature regarding this simple dataset [51], where a fully trained learning classifier is adopted with the fuzzy sets involved having been comprehensively optimised.However, it must be noticed that the present high accuracy is attained with an average of 20% rules having been randomly taken out of the learned rule base.This demonstrates the great potential of the proposed FRI approach in dealing with realworld problems where typically only partial and imprecise knowledge is available.

V. CONCLUSION
This paper has presented a novel fuzzy rule interpolation approach that significantly reinforces the power of fuzzy interpolative reasoning, by exploiting attribute ranking techniques to help determine the relative importance of rule antecedent attributes involved in a sparse rule base.The approach is general since it allows for any established ranking method to be utilised to score the attributes, leading to a flexible weighting scheme for FRI.The paper has provided five different attribute ranking methods for attribute weighting, based on popular feature selection techniques in the relevant literature.This paper has also proposed an innovative reverse engineering procedure, through which the ranking scores can be calculated from an artificial decision table derived from the original rules, without requiring additional observations to be made.The proposed work has been systematically evaluated on ten benchmark classification tasks.
Collectively, the experimental results presented have clearly demonstrated the efficacy and robustness of the attribute ranking-supported approach to fuzzy rule interpolative reasoning.In particular, the weighted interpolative methods have been shown to entail remarkably improved classification accuracy over both conventional transformation-based FRI (T-FRI) and compositional rule of inference-based fuzzy reasoning techniques.This has been achieved using a very simple fuzzification mechanism.The experimental investigations have also confirmed that any of the existing popular FS techniques may be employed to evaluate and score the antecedent attributes, without adversely affecting the classification outcome nor considerably increasing the computational time complexity.The results have further illustrated that better performance can be obtained by fine tuning the membership functions that define the antecedent attributes given a particular practical problem.
Further to the aforementioned advantages over conventional T-FRI techniques, the attribute ranking-supported T-FRI methods have systematically proven to only require the least number of the closest rules to carry out interpolation (with respect to a given observation that does not match any existing rule in the sparse rule base).Overall, as the most appropriate closest rules are selected in terms of the relative significance of domain attributes, better results are obtained using fewest rules possible, thereby minimising the complexity in both rule searching and rule firing.
The work in this paper is developed on the basis of the popular T-FRI algorithm, the proposed approach appears to permit other FRI techniques to be integrated with the attribute ranking and reverse engineering methods in a similar manner.For instance, popular FRI algorithms such as those reported in [10], [39], [52] involve multidimensional input spaces also, which may be combined with the proposed attribute ranking scheme, thereby creating potentially more effective multidimensional FRI methods.This hypothesis remains to be tested, with an expectation that a generalised weighting scheme for fuzzy interpolative reasoning can be developed, making fuzzy rule-based inference more flexible.Also, the present work assumes that a sparse rule base is given.It would be interesting to investigate which of the many data-driven rule induction techniques available may be employed to generate an improved rule base, and to study how such a learning mechanism may be blended with the reverse engineering procedure to provide a stronger attribute ranking scheme.Additionally, how the  reverse engineering procedure may be efficiently implemented to minimise the adverse impact of the curse of dimensionality forms another piece of further research.Finally, the current approach presumes the use of a fixed (sparse) rule base.The most recently mechanism for dynamic fuzzy rule interpolation [53] should be integrated with it to allow the collection and refinement of any intermediate fuzzy rules and interpolated results, in order to enrich the rule base and avoid unnecessary interpolation on the fly.

Fig. 3 .
Fig. 3. Attribute ranking-supported T-FRI, with attribute rankings used in all key procedures of conventional T-FRI.

TABLE III AVERAGE
CLASSIFICATION ACCURACIES (%) WITH STANDARD DEVIATION IN 10-TIMES 10-FOLD CROSS VALIDATION

TABLE XI CONFUSION
MATRIX OF AVG GUIDED-T-FRI

TABLE XIV AVERAGE
TESTING TIME (SEC) VS.NUMBER OF CLOSEST RULES

TABLE XV ACCURACIES
(%) VS.SPECIFICATION OF MEMBERSHIP FUNCTIONS FOR IRIS DATASET