Stability Assessment of Feature Selection Algorithms on Homogeneous Datasets: A Study for Sensor Array Optimization Problem

A feature selection algorithm (FSA) is used to eliminate redundant and irrelevant features. Obviously, it can reduce dimensionality as well as the complexity of the original problem. Furthermore, the stability of FSA output becomes a major issue in real-world applications. Stability refers to the consistency of its feature preference related to the perturbation of data samples. In sensor array optimization, an FSA is used to find the best sensor combination in a sensor array. Typically, the main objectives of sensor array optimization are reducing data dimensions, electrical consumption, production cost, computational and traffic overhead, etc. Furthermore, the stable outputs of FSA in several observations are necessary to make a firm conclusion of selected sensors. The contribution of this research is to investigate the stability of FSAs in twelve homogeneous datasets in relation to the sensor array optimization problem. In this study, the stability of seventeen filter-based FSAs is compared across twelve homogeneous datasets. These datasets are generated from the electronic nose (e-nose) used to monitor twelve types of beef cuts. In this case, gas sensor array must have good generalization to differentiate all beef types. The experimental results show that a single FSA cannot guarantee stable sensors recommendation in sensor array optimization. Thus, it becomes a caution to researchers and practitioners to use a proper approach when performing sensor array optimization.


I. INTRODUCTION
Gas sensor array is the main component of the e-nose that detects and collects volatile information from a particular object [1]. It is assembled from several gas sensors with different selectivity. The sensor combination in the sensor array depends on an application of e-nose to detect a particular sample. The different sample has different volatiles that acts as biomarkers. For example, e-noses have been used to various applications such as blood glucose level detection [2], halal authentication [3], [4], meat quality detection [5]- [10], classifying vegetable oils and animal fats [11], tea classification [12], [13], monitoring tempeh fermentation [14], etc. In e-nose application, not optimal sensor array susceptible to overlapping selectivity which means The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tan . more than one sensor has similar selectivity with others. Hence, sensor array optimization is performed to reduce this problem. Moreover, several advantages can be obtained by performing sensor array optimization such as performance improvement, reducing data and communication overhead, saving electrical consumption, reducing production costs, etc [15]. In a sensor array optimization problem, sensor combination is not only optimal but also has good generalization. Typically, FSA is employed to determine the best sensor combination in a sensor array. Furthermore, the generalization of sensor array only can be guaranteed by using a stable FSA based on several datasets.
FSA plays an important role to reduce high-dimensional data for classification and regression tasks. Reducing data dimension can give several advantages such as computational efficiency, lower data storage, simpler data model, lower training time, etc. Several approaches have been proposed for feature selection including filter, wrapper, and embedded. Filter-based FSAs typically rank or select features by considering relation or correlation between features and class label. The advantages of this algorithm are fast, independent of the classifier, lower computational complexity, robust against overfitting, and better generalization. Because filterbased FSAs are independent of learning algorithms, they can fail to select the best feature combination in classification and regression tasks [16]. In contrast, wrapper intensively interacts with a classifier that makes it can find the best feature combination for a particular learning algorithm. However, it requires computationally intensive and susceptible to overfitting. Embedded FSA also relies on the classifier to determine the best feature subset even though it has better computational complexity than wrapper FSA [17].
Stability and generalization have become major issues in FSA studies. The stability of FSA indicates the robustness of its feature preference, related to the perturbation of data samples [18]. Instability of selected features leads to the difficulty to make a conclusion and lack of feature set generalization. There are several causes of FSA instability including small sample size, sample order dependency, and data partitioning [19]. In this study, the stability of FSAs is assessed based on homogeneous datasets that is generated by e-nose in beef quality monitoring. The objective of this experiment is to solve sensor array optimization problem which emphasizes the robustness of FSAs, so sensor combination is not only optimal but also has good generalization to detect various types of beef samples. In this study, we have several motivations as follows: 1 In the existing studies, sensor array optimization in e-nose employs a single FSA over a limited dataset which is a common approach to deal with this problem. However, to the best of our knowledge, there is no study that addresses the stability of FSAs in sensor array optimization across several datasets. 2 Several studies investigate FSA stability in several types of data including high-dimensional spaces [20], high-dimension and correlated data [21], high-dimension with small sample data [19], [22]. However, the majority of these studies use artificial data. On the other hand, in this study, we use homogeneous datasets that are generated by e-nose in the real case of beef quality monitoring. Homogeneous dataset refers to the datasets obtained from different observations in the similar environment. Furthermore, the stability of FSA becomes a major issue in sensor array optimization. Thus, it can be considered as new and important in this area.
Hence, the main contribution of this research is to investigate the stability of FSA in twelve homogeneous datasets in relation to the sensor array optimization problem.
The rest of this paper is organized as follows: Section II discusses related works including why sensor array optimization is important, the existing studies of sensor array optimization, and FSA stability issues. Section III is the problem formulation that explains the experimental setup, feature selection algorithms, weighting, and stability metric. Section IV discusses the result of the stability assessment. Finally, Section V draws a conclusion.

II. RELATED WORKS
The gas sensor array is a combination of several gas sensors that consist of different gas selectivity. Each sensor works individually and simultaneously to convert chemical information from various gases into a measurable signals. This condition causes overlapping selectivity which one or more gases can be detected by more than one gas sensors. The main problem arises in selecting the best and most effective gas sensor combination. Gas sensors need high power consumption especially for wireless communication than for sensing [23]. The gas sensor technology that consumes the highest power is the optical gas sensor platform therefore it is not suitable to be implemented in the Wireless Sensor Network (WSN). On the contrary, catalytic and semiconductor-based sensors provide a balanced trade-off among three-parameter such as power consumption, safeness, and performance [23]. Besides gas sensor technologies, the other ways to optimize power consumption are power management, sensing circuits, and measurement procedure.
Sensor array optimization is one of the effective mechanisms to deal with the power consumption issues in the gas sensor nodes. In 2007, sensor array optimization to detect the quality of wheat using ANOVA followed by multiple comparisons (Tukey) was proposed [24]. This method can reduce the number of gas sensors from 10 to 5. Furthermore, the percentage of correct classification increases by about 8% than without optimization. Two years later, the lowest subarrays to detect 11 variants of gasses were proposed [25]. Several gas sensors are clustered to fill each level of smaller sub-arrays. The number of gas sensors reduces about 50% from 6 to 3 gas sensors. In 2010, the combination among cluster analysis (CA) for preliminary and genetic algorithms (GA) for decision are proposed [26]. The datasets consist of three different resources. Two among them obtained from the real experiment by using 5 sensors equipped with 7 vapors and 10 sensors equipped with 3 vapors. The optimum number of sensors in both datasets is reduced into 2 sensors. Moreover, the simulation version of datasets using 10 sensors equipped with 3 vapors is reduced into 4 sensors. The general resolution factor (GRF) increases ranging from 0.3 up to 8.5%. One year later, decision methods are improved using genetic algorithms for multi-objective optimization [27]. Furthermore, the number of datasets increases from 3 to 4. The simulation dataset that uses 10 sensors and 5 vapors are added. The number of vapors from the last experiment is redefined. The optimum number of the sensor from datasets 1, 2, 3 and 4 are 3, 3, 2 and 2 sensors, respectively. Moreover, GRF from the selected sensor is better than the value of all sensors or the value of the partial random sensor. In the same year, the neural network sensitivity analysis for volatile VOLUME 8, 2020 organic compound (VOC) mixture was proposed [28]. The number of optimum sensors reduces from 6 to 4. In addition, the experiment result shows that the number of the sensor to detect the gas mixture is the lowest. In the other case, the rough set-based approach to classify the quality of black tea was proposed [29]. The number of optimum sensors reduces from 8 to 4. Furthermore, the separability index (measurement to show the fraction of data point that has the same labels as the closest neighbor) and the level of accuracy increase about 3% and 11%. In 2012, the same case was classified using the t-score, fisher's criterion, and minimum redundancy maximum relevance (MRMR) [30]. The number of optimum sensors reduces from 8 to 3. Moreover, the level of accuracy increases ranging from 6 to 10%. Two years later, the integration of genetic algorithm (GA) and quantum-behaved particle swarm optimization (QPSO) to detect wound infection was proposed [31]. The optimization uses a weighting between 0-1 but the actual number of sensors does not change that is 15 sensors. The advantage of this method is the level of accuracy increases by about 7.5%. In the same year, 2014, kernel principal component analysis (KPCA) based linear discriminant analysis (LDA) to detect indoor air contaminants was proposed [32]. This method can reduce a sensor from 4 sensors. High performance followed by low-cost implementation of air contaminants can be detected using the combination of 3 gas sensors. In 2016, binary quantum-behaved particle swarm optimization to detect wound infection was proposed [33]. This method can reduce the number of sensors from 20 to 6 sensors. Furthermore, the level of accuracy reaches 97.6%. In the same year, the wavelet transform and filter-based feature selection approach to classify beef quality was proposed [15]. However, this study only used single FSA and single dataset. One year later, an optimized feature matrix using mean, variation coefficient, cluster, and correlation analysis to detect Chinese pecan quality was proposed [34]. The number of the sensor was reduced by 5 sensors from 13 sensors. Moreover, the data dimension was reduced by 19 from 30. The result was shown in the principal component analysis (PCA) score plot and regression model. In the same year, 2017, the integration of traditional ANOVA including loading analysis methods, Wilks statistic method and sensors sensitive to aroma compounds for detecting a variety of apple juices was proposed [35]. The results were tested using PCA, K-means clustering, and SVM. The number of the original sensors was 10. The optimization algorithm using ANOVA including loading analysis methods, Wilks statistic method, and sensors sensitive to aroma compounds can reduce the number of sensors into 7, 4 and 7, respectively. SVM combined with the Wilks statistic method reaches 100% of testing accuracy. Furthermore, the lowest number of the sensor increases efficiency and flexibility for a large number of apple juice samples. One year later, 2018, random forest completed with new measurement namely Gini importance was proposed [36]. The number of sensors was reduced from 6 to 2 sensors based on their best accuracy that was predicted before using Gini importance. In 2019, the response surface method (RSM) was proposed to optimize the number of the sensor in detecting the freshness of strawberry [37]. The number of sensors was reduced from 8 to 5. The classification methods to analyze the result were PCA, LDA, and SVM. The accuracy of LDA classification was 86.4% while the data variance explanation for LDA was 84%. The validation accuracy of two types of SVM was 50.6% for C-SVM and 55.6% for Nu-SVM.
Typically, in sensor array optimization, FSA is employed to select the best feature set related to gas sensor combination. However, all of the previous studies were not discuss about robustness and stability of the FSA in the sensor array optimization. Practically, FSA can generate unstable output on the homogeneous dataset. This problem leads to lack of generalization for sensor array and difficulty to determine which the best gas sensor combination. Moreover, several studies concern on FSA stability issues. The stability of five FSAs including are assessed on a set of proteomics datasets [20]. In particular, the stability of wrapper FSA is measured over four datasets and k-Nearest Neighbor is used as a classifier [38]. This study mentions that different training data partitioning leads to different selected feature set. Data heterogeneity also affects the ranking and the stability of FSA in high-dimensional correlated data [21]. Another study defines properties of stability measures including fully defined, bound, maximum, and correction of chance to assess several stability measures [39]. An assessment of FSA stability is also performed on high-dimension and small sample data [22]. This study concludes that small sample data is highly affected on FSA stability. FSA stability is also an issue in software quality prediction datasets [40]. The result shows that filter-based FSA (ReliefF) is more stable than wrapperbased FSA.

III. PROBLEM FORMULATION
In this section, the experimental setup to generate datasets is discussed. Moreover, the problem is mathematically formulated and the metrics of FSA stability is also explained

A. EXPERIMENTAL SETUP
In this experiment, twelve kinds of beef cuts were observed including round (shank), top sirloin, tenderloin, flap meat (flank), striploin (shortloin), brisket, clod/chuck, skirt meat (plate), inside/outside, rib eye, shin, and fat. Each of them is weighted 125 grams. Table 1 shows all of gas sensors that used in this experiment. Furthermore, prototype of e-nose sensor box equipped by wireless communication module is shown in Fig. 1. E-nose signals are sent from the sensor box to the computer server every minute over the wireless network. For each experiment cycle, data is recorded continuously for 2220 minutes from fresh beef to spoil. After one experiment cycle is complete, the temperature control box and sensor box are flushed using a high speed fan. After that, they were rested for 3-6 hours to neutralize the remaining odor from previous experiments. This procedure is repeated for all cuts of beef.  In one experiment cycle, we got 2220 measurement points from a beef cut. Thus, we have a total of 26640 measurement points from twelve datasets corresponding to twelve pieces of beef. For labeling, the total number of bacteria is used as main standard of beef quality. Spectrophotometer with 1000x dilution is used to quantify optical density. Afterwards, hemocytometer was utilized to determine the microbial population in a beef sample. The experiment refers to the combination of classical and two-hour method [41]. The beef quality is divided into four sensory classes according to total viable count (TVC). It complies with the standard issued by the Agricultural and Resource Management Council of Australia and New Zealand as shown in Table 2 [42]. According to this trait, they can consider as homogeneous datasets because they have almost similar pattern except the noise contamination produced by fluctuating humidity levels. Beside the common cause of small sample size, the stability is also highly dependent on the types of feature selection algorithm in use [19]. In this experiment, we used datasets generated by e-nose in beef quality monitoring. The main characteristic of these datasets is homogeneous, noisy, and relatively low dimension. The homogeneous dataset refers to the datasets obtained from different observations in the similar environment. Moreover, noisy data is caused by fluctuating relative humidity in the sample chamber. The issue of noise has been tackled by our proposed noise filtering framework, the interested reader can refer to [8], [9]. The number of used sensors is affected to the data dimension. In this experiment, eleven gas sensors produce eleven features which are relatively low. However, in sensor array optimization problem, the number of sensor in sensor array becomes a sensitive issue because the more sensors lead to increasing electrical consumption, production cost, and data traffic/storage. Generally speaking, the datasets have a total of 26640 measurement points from twelve datasets corresponding to twelve pieces of beef. We argue that the size of datasets is sufficient to deal with a small dataset issue. Moreover, it is not a good idea to use different sensor combination for different beef types that lead to a huge number of sensors utilization to build a sensor array. Thus, stability assessment of FSA is necessary to deal with sensor array optimization problem.

B. FEATURE SELECTION ALGORITHMS AND WEIGHTING
In this study, we investigate stability of 17 filter-based FSAs for sensor array optimization in e-nose. The brief explanation about FSAs including selection criterion is shown in Table 3. In this experiment, we used FEAST toolbox for informationtheoretic FSAs [47] and scikit-feature for similarity-based and statistical-based FSAs [49]. Every FSA has different way to select the best features. There are two most important aspects of feature selection such as maximizing relevancy and minimizing redundancy. Relevancy means that the selected features must have ability to predict class labels. On the other side, redundancy indicates that the selected features should not have strong correlation with each other. It implies that several features with strong correlation can be represented by only one feature. Hence, in this subsection, the mechanisms of FSAs are briefly explained for better understanding in relation to stability assessment in the next section.
Consider dataset DS i = {(x j , c j ), j = 1, . . . , n}, i = 1, . . . , m, it consists of m and n series and instances correspond to number of beef samples and measurement points, respectively. Instance x j denotes a k-dimensional vector x j = (x j1 , x j2 , . . . , x jk ) labeled as c j where each component x j represents the value of feature vector. Furthermore, consider an FSA whose output is a vector y that denotes a following selected feature subset y = (y 1 , y 2 , y 3 , . . . , y k ), where a selected feature set is determined by a weighted appearance of a particular feature x jl in dataset DS(w DS i ,x jl ) VOLUME 8, 2020  as follows: where rank DS i ,x jl means rank of feature x jl in dataset DS i . Sorting a feature output into a top-k features, selected feature set can be determined according to this rule where 0 and 1 indicate the feature is not selected and selected, respectively. w y i and w are weight of feature y i and weight average of overall features, respectively. Furthermore, Fig. 2 illustrates the flow of experiment where FSAs produce feature ranking from each dataset. Top-k features according to a particular FSA in every dataset are determined by a weighted appearance. Hence, we have several top-k features based on the combination of FSA and datasets. Thus, the stability of FSA is measured according to the difference in output between the datasets (cross-stability).

C. STABILITY METRIC
In works of literature, several stability metrics are proposed for example Jaccard [20] and Hamming [37] distance. However, the drawback is explained that they susceptible to subset-size-biased that implies the consistency in different settings is questionable [60]. On the other hand, to deal with this problem, the stability index was proposed by Kuncheva that is expressed by where y, y, r, k, n denote selected feature set in first observation, selected feature set in second observation, number of intersection between selected feature set, number of selected feature set, and total number of features, respectively. This index must satisfy condition y ⊂ x and y ⊂ x. Thus, |y| = | y| = k, where 0 < k < |x| = n. Obviously, several FSAs are possible to produce different cardinality that means this metric cannot be directly used (k y = k y ). Moreover, Nogueira proposed the extension of this metric to handle different cardinality in selected feature set [60]. This metric can be expressed as follows: Similar to Kuncheva's index, this measure belongs to [−1, 1]. It reaches the maximum value when two selected feature subsets are identical and vice versa. In this experiment, 12 homogeneous datasets are used to investigate the stability of 17 FSAs. For every FSA, the cross-stability of 12 datasets is calculated. Table 4 shows the matrix of stability among 12 datasets, where S a,b N means the stability value of a th row and b th column correspond to two different datasets.

IV. STABILITY ASSESSMENT BASED ON THE EXTENSION OF KUNCHEVA'S INDEX
In this experiment, we investigate the stability of 17 FSAs that have been mentioned before. There are three major groups of FSA including information-theoretic (MIM, MRMR, MIFS, CMIM, JMI, DISR, CIFE, ICAP, CONDRED, CMI, FCBF), similarity-based (fisher score, reliefF, trace ratio), and statistical-based (chi-square, F-score, gini index). These outputs can have different cardinality depend on the dataset. It occurs on several FSAs like CMI and FCBF. Hence, the utilization of the extension of Kuncheva's index is reasonable. Fig. 3 shows the heat maps that imply stability cross-dataset on 12 homogeneous datasets. The brighter box color indicates that the selected feature subset on a particular dataset has similar feature members on another dataset. Otherwise, the darker box color means the selected feature subset on a dataset is highly different from another dataset. Thus, a heat map with many dark colors implies an FSA is less stable. According to this figure, MIM, relief, and chi-square get many pale boxes that imply they have the most stable outputs on 12 datasets. However, none of them can produce a consistent feature subset in each dataset.
Moreover, Fig. 4 shows the stability box plot of FSAs. CMIM, JMI, DISR, CIFE, and ICAP show almost similar stability to determine feature subset recommendation. MIM, reliefF, and chi-square produce the best stability for information-theoretic, similarity, and statistical-based FSA, respectively. Furthermore, chi-square becomes the most stable FSA that is denoted by the average value of the stability index. It is because chi-square has several high outliers (S N = 1), but has no low outlier. However, none of them can produce consistent output that makes satisfactory generalization of sensor combination. In addition, several FSAs produce negative outlier that implies they have almost completely different outputs on a different dataset. Hence, it is too risky to use a single algorithm to decide which the best sensor combination in the case of sensor array optimization. Table 5 shows the summary of performance such as computational complexity, redundancy term, and average stability. It can become consideration for practitioners related to utilization of FSAs. In this study, computational complexity is observed based on the library implementation of each FSA. The existence of redundancy term implies that FSAs have a mechanism to deal with feature redundancy. Moreover, average stability is obtained from crossstability among 12 homogenous datasets. The majority of information-theoretical FSAs consider both feature relevancy and feature redundancy. According to Table 3, redundancy term is represented by X j ∈S I (X k ; X j ). Based on experimental results, FSAs with redundancy term have relatively low stability including MIFS, MRMR, CIFE, CMIM, DISR, ICAP, CMI, FCBF. Furthermore, MIM doesn't have a redundancy term and has relatively low computational complexity (O(n 2 )). It gets the highest output stability in the information-theoretic FSA group. CONDRED also gets the third highest stability value because it only considers conditional redundancy. On the other side, JMI gets relatively high stability even though it has a redundancy term (0.21898). This result supports the experiment performed by Brown et.al that JMI has the best trade-off as well as balances the relevancy and redundancy terms and includes the conditional redundancy [47]. Different from information-theoretical FSAs, similarity and statistical-based FSAs assess the importance of feature individually without feature redundancy handling. Hence, they only emphasize how to find the most relevance features related to class label. Typically, similarity-based FSA is simple and straight-forward. The focus of computation is to build an affinity matrix for score calculation. Thus, the computational complexity is no more than O(n 3 ). ReliefF obtains the highest stability (0.32593). Fisher score is less stable because it highly relies on statistical measures such as mean and variance. Their values can vary even though on homogeneous datasets. Furthermore, statistical-based FSAs also individually evaluate the importance of features so they cannot handle feature redundancy. Most of them depend on predefined statistical measures to eliminate irrelevant features. Chi-square has the highest stability even for all groups (0.37685). It uses a simple scoring and has low computational complexity (O(n 2 )). For F-score, it employs mean value to characterize between and within group variance. As well as Fisher score, it produces very low stability (0.08657).
In addition, Gini index gets 0.275 stability score with O(n 4 ) of computational complexity. The computational complexity of Gini index is relatively high because it uses probability as a basis for selection criterion. Overall results show that a single FSA cannot present stable outputs that implies cannot guarantee the generalization of sensor array.

V. CONCLUSION
The stability of FSA becomes a major concern in many real-world applications. In this experiment, 12 homogeneous datasets from e-nose were generated. FSA is typically used to deal with sensor array optimization problem. The investigation was performed using 17 filter-based FSAs and cross-stability was measured among datasets. The experimental results show that a single FSA generates different selected features set on the homogeneous dataset. Majority of information-theoretic-based FSAs produce low VOLUME 8, 2020 stability. The most stable FSA in this group is MIM that not considers feature redundancy and low computational complexity. It can be applied as a first step to select features before performing a more advanced FSA. JMI is the most stable FSA that considers both relevancy and redundancy. ReliefF and chi-square are the most stable in similaritybased and statistical-based FSA, respectively. They have low computational complexity but individually evaluate feature relevancy without considering redundancy among selected features. According to the stability index, all of FSAs cannot present satisfactory performance that means they have unstable outputs. Unfortunately, in e-nose applications, it becomes a serious problem because different gas sensor combination must be used to classify various beef types. Hence, the sensor array has a lack of generalization. It is not a wise choice to use a different combination of sensors to detect a different sample. Another option is to use as many sensors as possible to detect samples. However, it is also not a good solution because it will raise other problems such as high electrical consumption, high-dimensional data, computational overhead, high data traffic, etc. The main objective of sensor array optimization is to determine the best sensor combination which not only contains the fewest possible number of sensors but also has good generalization. This should be a concern for researchers and practitioners to optimize sensor arrays in the proper way to guarantee generalization. Hence, there is necessary to develop the proper approach to deal with this problem. For future work, the hybrid FSA will be developed to improve the stability of a single FSA.