Exploiting Feature Selection in Human Activity Recognition: Methodological Insights and Empirical Results using Mobile Sensor Data

Human Activity Recognition (HAR) using mobile sensor data has gained increasing attention over the last few years, with a fast-growing number of reported applications. The central role of machine learning in this field has been discussed by a vast amount of research works, with several strategies proposed for processing raw data, extracting suitable features, and inducing predictive models capable of recognizing multiple types of daily activities. Since many HAR systems are implemented in resource-constrained mobile devices, the efficiency of the induced models is a crucial aspect to consider. This paper highlights the importance of exploiting dimensionality reduction techniques that can simplify the model and increase efficiency by identifying and retaining only the most informative and predictive features for activity recognition. More in detail, a large experimental study is presented that encompasses different feature selection algorithms as well as multiple HAR benchmarks containing mobile sensor data. Such a comparative evaluation relies on a methodological framework that is meant to assess not only the extent to which each selection method is effective in identifying the most predictive features but also the overall stability of the selection process, i.e., its robustness to changes in the input data. Although often neglected, in fact, the stability of the selected feature sets is important for a wider exploitability of the induced models. Our experimental results give an interesting insight into which selection algorithms may be most suited in the HAR domain, complementing and significantly extending the studies currently available in this field.


I. INTRODUCTION
Sensor-based Human Activity Recognition (HAR) is a growing research field that deals with automatically identifying the activities a user is performing based on the analysis of data collected from a variety of sensors [1]. HAR systems have several important applications in different areas including ambient assisted living [2], physical training [3], [4], activity monitoring for health assessment [5], assistance for child and elderly care [6] as well as assistance for people with cognitive disorders [7].
In particular, sensors fitted in mobile devices (accelerometers, gyroscopes, etc.) have now reached widespread adoption and allow to envisage intelligent monitoring systems that can seamlessly track daily activities, in a non-intrusive way, and help users to make better decisions about their future actions [8]- [10]. However, such an automatic decision support capability is still limited, despite the increasing capacity of processing data from smart devices. Furthermore, the exploitation of HAR systems in real-world scenarios not only demands high recognition accuracy but also poses multiple challenges in terms of power consumption and robustness with respect to different context conditions [11], [12], soliciting further research efforts in this field.
In recent years, several machine learning approaches have been explored for the automatic classification of daily living activities based on mobile sensing [13]- [18]. The overall process involves different steps, including the cleansing of raw data to remove noise and artifacts, and the extraction of high-level features that can be useful to discriminate among the considered activities [19]. A common methodology for feature extraction relies on segmenting the sensor data, e.g., the tri-axial acceleration signals, into time windows that are subsequently mapped, through proper functions, into a set of meaningful features. Different types of mapping approaches have been explored based on the application scenario [20], resulting in time-domain, frequency-domain, or other types of features that can be finally fed to a classifier to induce the HAR model. Automatic feature extraction based on deep learning methods has also been recently investigated [11], [21]. Both the features' definition and the choice of the classifier may significantly affect the performance of the recognition system, as witnessed by a vast amount of literature in the field [5], [8], [20].
The computational efficiency of the HAR system is another important aspect to consider when dealing with mobile devices that may have limitations in terms of processing capability as well as energy consumption. From this point of view, it may be convenient to contain the dimensionality of the feature vectors used at the classification stage, as discussed in some recent studies [22]- [24]. The feature extraction process can indeed result in a large number of features, especially in a multi-sensor system where the feature sets generated from the single sensors can be fused to create feature vectors of high dimensionality [25]. In such a scenario, it can be useful to apply automatic techniques to identify and select the most important features for prediction, in order to simplify the activity recognition models, decrease overfitting, and reduce computations. Indeed, it has been observed that not all the extracted features have the same importance in the classification of physical activities [19], [26]. Some features may be redundant, weakly relevant, or even irrelevant/noisy for a specific task, suggesting that feature selection techniques could be effectively employed to reduce the data dimensionality and improve the HAR system's efficiency without compromising the final prediction accuracy.
However, not much research has so far explored the potential of feature selection in this field. Most emphasis has been given to investigating which classification methods and strategies may be best suited for inducing the HAR models [5], [20], [27], [28], while few studies have compared the impact of different feature selection algorithms on such models [29], gaining insight into which heuristics may be most effective in selecting reduced subsets of discriminative features. To make a contribution in this direction, we present here an extensive comparative study that encompasses several selection approaches and investigates their behavior on typical HAR datasets extracted from mobile devices, like smartphones and smartwatches, where recognition performance and computational efficiency need to be jointly op-timized.
More in detail, extending our previous research in this field [30], this work provides a two-fold contribution. First, a general methodological framework is presented that allows evaluating the effectiveness of the feature selection process along with two directions: i) the capacity of the algorithm of identifying the most important features for activity prediction, and ii) the stability of the selected feature subsets, i.e., their robustness to perturbations in the input data. This way we can better understand the suitability of a given selection method for the considered application scenario. Indeed, identifying feature subsets that are both highly predictive and stable is important for a wider exploitability of the induced models, as highlighted in recent literature [31], [32].
Leveraging such a framework, we carried out an experimental evaluation for different levels of dimensionality reduction, i.e., for different percentages of selected features, in order to find an optimal trade-off between the number of features used for prediction and the resulting classification performance. Specifically, our study includes selection methods that are representative of different heuristics: univariate approaches, which assess every single feature independently of the others; multivariate approaches, which capture the inter-dependencies among the features; filter approaches, which carry out the selection process without interacting with the learning algorithm; embedded approaches, which exploit the features' weights derived by a proper classifier to assess the relevance of the features. Such a comprehensive evaluation has been performed in conjunction with learning algorithms that have proven highly effective in this domain, such as Support Vector Machines and Random Forest.
The experimental study has been conducted on five HAR datasets extracted from mobile sensor data [33]- [37], both using a single sensor type (accelerometer) or different types of sensors (accelerometer, gyroscope, magnetometer). Further, the considered benchmarks present different levels of dimensionality, ranging from about 150 features to over a thousand features, which has allowed us to explore the behavior of the considered selection algorithms across different feature spaces.
Overall, the results of our experiments show that the feature selection process can reduce the original dimensionality to a great extent without any degradation in the final recognition performance, which confirms the importance of introducing an automatic dimensionality reduction step into any mobile sensing processing pipeline. By jointly evaluating the predictive performance and the selection stability, we also obtained some interesting insights into which methods may be best suited in the considered domain. To the best of our knowledge, this is the first work that evaluates several feature selection approaches in this field based on different realworld benchmarks. Moreover, in this work we investigate the stability of selection methods, which is neglected by most of the existing studies.
The remainder of this paper is structured as follows. Section II gives some background concepts on feature selection and discusses its applications in the HAR field. Section III describes the adopted methodological framework and the specific selection algorithms chosen for the experimental study. The characteristics of the considered datasets are illustrated in Section IV. Section V presents the experimental analysis and Section VI further discusses the main findings. Finally, Section VII gives some concluding remarks as well as directions for future work.

II. BACKGROUND CONCEPTS AND RELATED WORK
In the last two decades, several research efforts have focused on devising proper methods for handling high dimensional datasets [38]. Feature selection plays an important role in such a context as it can discard irrelevant and redundant information, as well as noisy factors, with significant benefits in terms of computational efficiency, model interpretability and data understanding [39]. As summarized below, a wide variety of feature selection methods can be found in the literature, with some promising applications also in the HAR field [22]- [24], [29], [40]- [51], [53]- [55].

A. BACKGROUND ON FEATURE SELECTION
In the context of supervised learning tasks, like those considered in this paper, the available selection techniques can be broadly distinguished into three categories [39], [56]: • Filters methods, that conduct the selection process as a pre-processing step, without interacting with the learning algorithm used at the model induction stage, thus leading to a classifier-independent selection outcome. This can involve the individual evaluation of every single feature, based on its correlation with the class attribute, or the evaluation of subsets of features, within which the reciprocal correlation among the features is also considered to minimize redundancy (but at an increased computational cost). • Wrapper methods, that search for the feature subset that can optimize the predictive performance of a given classifier. In this case, the learning algorithm itself is employed as an evaluation function to assess each candidate subset of features. The computational cost of the selection process is hence dependent on the classifier's intrinsic efficiency, as well as on the search strategy used to build the candidate subsets (e.g., an evolutionary search or a greedy stepwise search), with an overall burden generally higher than the one of filter methods. • Embedded methods, that leverage the intrinsic capacity of some classification algorithms (e.g., Support Vector Machines classifiers) to assign weights to the features, without requiring a systematic search through different candidate subsets, as in the case of wrappers. In terms of computational cost, this approach often provides a reasonable trade-off between filters and wrappers, with results that have proven quite satisfactory in multiple scenarios. Several studies have investigated the potential and the drawbacks of the different selection methods proposed so far [57], [58]. Due to their reduced computational requirements, filter and embedded methods have found wider adoption in high-dimensional problems, with a variety of algorithms available that support both univariate and multivariate selection processes, as better discussed in section III. Hybrid and ensemble techniques, that properly integrate different selection methods, have also been studied with promising results in recent years [39], [59], [60].

B. FEATURE SELECTION IN THE HAR FIELD
Feature selection methods falling in the filter category have been generally preferred in the HAR field. For example, [40] and [41] have investigated the use of MRMR and CFS filters [38], which are designed to search for subsets of features that are highly correlated with the target class but not correlated with each other. Similarly, the inter-dependencies among the features are considered in [42], where a game theory-based feature selection (GTFS) method is proposed to optimize the size of the selected feature subsets. Such a subset-oriented evaluation tries to reduce redundancy and can be an effective and viable solution when the original dimensionality is not too high.
A more efficient ranking-based filter is exploited in [43], where every single feature is weighted according to an information theoretical criterion, known as Information Gain, with an overall ordering of features based on the resulting weights. Such an approach allows discarding the features that are less useful in discriminating the target class and has proven well suited even in the presence of very high dimensionalities. Ranking-based filters have also been applied in [44], [45], where the Chi-Squared, Fisher score, and ReliefF methods are employed to weight the features and arrange them in decreasing order of relevance. A comparison between ranking-based and subset-oriented filters is presented in [23], which emphasizes that the filter-selected, classifier-independent, feature subsets have potentially broad exploitability in smartphone-based HAR.
Fewer applications can be found in this field for the embedded methods and the wrapper methods [46]- [48]. Specifically, given the higher computational cost of wrappers, they have been mainly applied after preliminarily reducing the data dimensionality by means of a filter, as in [24] and [49]. Some direct comparisons between filter and wrapper approaches are presented in [22], [50], [51]; in such studies, however, the dimensionality of the original feature space is relatively low, which can make acceptable the higher computation time required by wrappers.
An interesting line of research recently explored in the HAR field relies on the use of nature-inspired and swarm intelligence algorithms [52] for optimizing the feature subset selection process within a wrapper model approach. In particular, [53] proposes a new method that involves the hybridization of two algorithms, namely the gradient-based optimizer (GBO) and the grey wolf optimizer (GWO). A binary firefly algorithm (BFA) is used in [54], in combination with a GTFS filter that preselects potentially interesting fea- VOLUME 4, 2016 tures. The integration of bee swarm optimization (BSO) and reinforcement learning is explored in [55], with a comparison with other swarm-based methods.
Despite an increasing amount of research pointing out the benefits of feature selection in HAR applications, there is a lack of comparative studies that extensively evaluate the strengths and weaknesses of the different selection approaches in this field. At the time of writing, the largest experimental comparison is presented in [29], where ten different selection algorithms (seven filters, two wrappers and one embedded method) are evaluated on a single sensor dataset involving more than 200 attributes; interestingly, the feature subsets selected using efficient filter methods are found to outperform those produced by wrappers that, although potentially capable of yielding superior results, are more prone to the problem of overfitting.
Finally, to the best of our knowledge, only the impact of feature selection on the performance of HAR models has been considered so far, without investigating the stability of the selection process, i.e., its sensitivity to changes in the input data, which may critically affect the robustness of the induced models. Taking such an aspect into account, this work presents a wide comparative analysis that complements the studies available in this field, encompassing several selection methods and several benchmarks, as detailed in the following sections.

III. METHODOLOGICAL FRAMEWORK
Our methodological framework is meant to be general enough to be implemented with different selection methods as well as different learning algorithms. Further, as anticipated above, it involves evaluating both the predictive power and the stability of the selected feature subsets, which is important to understand the extent to which these subsets can be truly relevant for the task at hand, regardless of the specific composition of the training data. All the steps of the adopted methodology are outlined in what follows, along with a description of the specific methods and settings chosen for the comparative analysis.

A. RANKING-BASED FEATURE SELECTION AND PERFORMANCE EVALUATION
As a general framework for a wide comparison, we chose a ranking-based selection approach [39] that is flexible enough to encompass the use of filter methods, that assign weights to the features based on their degree of correlation with the class (i.e., the activity to be predicted), and embedded methods, that rely on the features' weights derived by a proper learning algorithm.
The assigned weights, regardless of how they are computed, can be used to obtain a ranked list in which the N features of the data at hand (D) appear in decreasing order of relevance, i.e., from the most important (rank 1) to the least important (rank N ), as schematized in Figure 1. Such a list can then be cut at a suitable threshold point (n) to select a subset of highly relevant features, i.e., the n top-ranked ones.
Considering only these features, an activity recognition model can be induced from D by training any suitable classifier. The performance of the resulting model is evaluated on a separate set of test records that are structured to contain only the features previously selected from D (to avoid any selection bias, in fact, the test records must not be used in the feature selection process).
The best level of dimensionality reduction, i.e., the optimal value of the cut-off threshold in Figure 1, may vary depending on the specific task at hand. The effect of modifying such a threshold is explored experimentally in our study, which encompasses different values of n and evaluates their impact on the final recognition performance. This approach allows discarding the unnecessary features in a cost-effective way (especially when the data dimensionality makes impractical the direct adoption of wrapper-based search strategies).

B. STABILITY EVALUATION
For a stable selection method, we expect to obtain (almost) the same outcome when the original set of training instances is somewhat perturbed (e.g., randomly removing a given percentage of records) [61].
Evaluating stability essentially involves two aspects [62]: (i) a procedure to create multiple sample sets from the available data, and (ii) a consistency index to quantify the sensitivity of the selection process to sample variation. More in detail, given a dataset D with R instances and N features, a number K of reduced datasets D i (i = 1, 2, . . . , K) are drawn, each containing a fraction f of the original instances. The chosen selection algorithm is then applied to each D i , obtaining an output O i (i = 1, 2, . . . , K) that may depend on D i 's specific composition. A proper similarity measure is finally used to assess the pairwise similarity between the outputs O i : the more their average similarity, the more stable the selection method.
In our framework, each output O i takes the form of a feature subset S i containing n of the original features (selected according to the approach explained previously), for a total of K subsets of the same size. To evaluate how similar these subsets are to each other, we rely on a consistency index known as Kuncheva measure [63], which has proved to be suitable in the context of high dimensional problems. Specifically, given a pair of subsets S i and S j , their similarity is measured as follows: where |S i ∩ S j | is the number of features that are common to S i and S j . The similarity sim ij essentially measures the degree of overlap between the two subsets, with a proper correction reflecting the probability that a feature is included in both subsets simply by chance [31]. The resulting similarity values are then averaged across all pair-wise comparisons to assess the overall degree of 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  consistency among the K subsets: This average similarity can be assumed as a measure of the stability level of the selection process. Since sim ij and sim avg may vary in dependence on the size n of the selected subsets, our experimental study investigates the stability trend for feature subsets of increasing size, as shown in Section V.

C. METHODS AND SETTINGS
For implementing the methodological framework presented above, we considered some popular selection algorithms that are representative of different feature weighting paradigms.
In particular, we employed five univariate methods, that weigh every single feature independently of the others, and five multivariate methods, that are able to capture the interdependencies among the features. Specifically, among the univariate techniques, we chose: • Chi Squared (χ 2 ), that leverages the well-known chisquared statistic to evaluate how relevant a feature is with respect to the class [57]. Specifically, once a feature has been discretized into I intervals, its χ 2 value is obtained as: where R is the total number of instances, C the number of classes, R i the number of instances in the ith interval, B j the number of instances in the jth class, and A ij the number of instances in the ith interval and jth class. • Information Gain (IG), that measures the extent to which the class entropy decreases when the value of a given feature is known: the greater the decrease in entropy, the more discriminative the feature [64]. Namely, by denoting as H the entropy function, we can derive the IG value for a feature X as: that in turn rely on the IG measure but include suitable correction factors that try to compensate for the IG's bias toward features with more values [66]. Specifically, after computing the IG value for a feature X, the corresponding SU and GR values are obtained as follows: where H(X) and H(Y ) denote the entropy of, respectively, the feature X and the class Y . • OneR (OR), that weights each feature based on the accuracy of a simple classification rule built on that feature, according to the approach originally proposed in [67]. More in detail, for each of the available features, the algorithm creates one rule by determining the most frequent class for each feature's value (a rule is simply a set of attribute values bound to their majority class). The prediction accuracy of each rule is then computed, and the features are ranked according to the quality of the corresponding rules.
Among the multivariate techniques, we considered two Relief-based selectors [68] and three SVM-based selectors [69]. More in detail, we chose: computed for each feature X, starting from an initial value W (X) = 0: where r is the number of randomly drawn instances and dif f is a function that computes the difference between the value of X for R i and H as well as for R i and M . The underlying assumption is that a "good" feature should have the same value for data points of the same class and different values for data points of different classes. Such a binary formulation can be easily extended to deal with multi-class problems [68]. In turn, ReliefF-weighted (RFW) adopts a similar strategy but weighting the neighbors by their distance.
• SVM-AW, that relies on a linear SVM classifier, which has an embedded capability of assigning a weight to each feature based on how it contributes to the hyperplane decision function induced by the classifier. This function can indeed be written as follows: where X is the N -dimensional vector of input features, W is a weight vector, and b is a bias constant. The weight W j assigned to the jth feature can be interpreted as a measure of the strength of the feature; specifically, the SVM-AW algorithm considers the absolute value of this weight (AW ) [69]. • SVM-RFE, that, in turn, exploits a linear SVM but adopts a recursive feature elimination strategy that iteratively removes the features with the lowest weights and repeats the weighting process on the remaining features. In our study, two versions of this approach are evaluated: RFE10, where the percentage p of features removed at each iteration is set to 10%, and RFE50, where this percentage is 50% (in the special case where p = 100%, SVM-RFE reduces to SVM-AW as no iteration occurs).
As regards the computational complexity of the above techniques, it depends on both the number of features (N ) and the number of instances (R). In particular, it can be shown that the number of operations is of the order of N · R for the univariate approaches [70], while the multivariate approaches have a higher computational cost. Indeed, in the worst case, the number of operations is of the order of N · R 2 for the Relief-based methods while it is of the order of max(N, R)·R 2 for SVM-RFE. However, efficient implementations exist that optimize the nearest-neighbor calculations involved in ReliefF as well as the kernel-matrix calculations involved in SVM-RFE [71].
Furthermore, note that some of the above techniques (χ 2 , IG, GR, SU, RF, and RFW), which only rely on the data's intrinsic characteristics, fall in the category of filter methods, while others (OR, SVM-AW, RFE10, and RFE50) leverage the features' weights derived by a suitable classifier and can be thus categorized as embedded methods. Irrespective of the specific algorithm used to derive the features' weights, the final selection is carried out according to the approach shown in Figure 1, i.e., by retaining a number n of top-ranked features; such a number is varied in our experiments encompassing different percentages of selected features, from 5% to 90%.
As learning algorithms to induce the activity recognition models, we chose, after a series of preliminary experiments, an SVM classifier with a polynomial kernel of degree 2 and a Random Forest classifier, which proved to be a suitable option for the considered benchmarks (described in section IV). For both classifiers, as well as for the different selection methods, we leveraged the implementations provided by the WEKA machine learning library [72].
More in detail, we trained the SVM classifier using the well-known Sequential Minimal Optimization (SMO) algorithm [73]. For the Random Forest classifier, we relied on 100 unpruned trees, each built using log 2 (n) + 1 random features at the splitting stage, according to commonly adopted settings [74]. The WEKA ChiSquaredAttributeEval, Info-GainAttributeEval, SymmetricalUncertAttributeEval, Gain-RatioAttributeEval, and OneRAttributeEval were used to implement the univariate methods χ 2 , IG, SU, GR, and OR, respectively. For the Relief-based methods, we employed the ReliefFAttributeEval function, with and without instance weighting (for RFW and RF respectively). Finally, for the SVM-based selection methods, i.e., SVM-AW and SVM-RFE, we exploited the SVMAttributeEval function, properly setting the percentage of features to be removed at each iteration. Each of these feature weighting functions was used in conjunction with the Ranker search method that allows selecting a specified number of top-ranked features.

IV. DATASETS
For our comparative study, we used five datasets meant for mobile human activity recognition. Specifically, one of these datasets contains data acquired from a fitness watch and a mobile phone, three of them contain smartphone sensor data, and the last one contains body-worn sensor data. In each dataset, data instances are labeled with the corresponding activities, which are primarily sport/fitness activities and activities of daily living. The high-level characteristics of these experimental benchmarks are summarized in Table 1.
The COSAR dataset [33], [75] was collected in experiments concerning the recognition of 10 different activities: brushing teeth, climbing up and down, riding bycicle, standing still, jogging and strolling, walking downstairs and upstairs, writing on blackboard. These activities were carried out by 6 volunteers wearing two accelerometers: the first one located inside the left pocket and the second one on the right wrist. In addition, a GPS receiver in the left pocket tracked the person's location. Overall, the dataset consists of 5 hours of activity data, sampled at a frequency of 16Hz, and each activity instance has a time extension of 1 second, for a total of 18000 instances (divided into 13500 training records and 4500 test records, as reported in Table 1). The second dataset (HAR) [34] was collected from 30 volunteers wearing a smartphone on the waist, as described in [76]. Inertial data were acquired from the smartphone's 3-axial accelerometer and 3-axial gyroscope at 50 Hz. Each subject carried out six activities: walking, walking upstairs and downstairs, sitting, standing and laying. Raw data were filtered to remove noise and sampled using a 50% overlapping sliding window of 2.56 seconds, resulting in a total of 10299 instances (partitioned into 70% training data and 30% test data).
The third dataset (HAR_ALL) [35] is an extension of the HAR dataset described above; indeed, it was obtained by carrying out similar experiments and contains the same activities, as described in [77]. 30 subjects participated in the experiments, resulting in a collection of 5744 records.
In turn, the HAPT dataset [36] is an updated version of the HAR dataset that contains an extended set of activities, including postural transitions: walking, walking upstairs and downstairs, sitting, standing and laying, stand-to-sit and stand-to-lie, sit-to-stand and sit-to-lie, lie-to-sit and lie-tostand. This benchmark, described in detail in [14], is considered especially challenging due to the highly imbalanced activity distribution (given the lower frequency of postural transitions).
The last dataset we considered is (DSA) [37], previously used in various research works including [78] and [79]. It contains 19 daily and sport activities: sitting, standing and lying on back and on right side, ascending and descending stairs, standing and moving around in an elevator, walking in a parking lot and on a treadmill (in flat and 15°inclined positions), running on a treadmill, exercising on a stepper and on a cross trainer, cycling on an exercise bike in horizontal and vertical positions, rowing, jumping and playing basketball. The activities were carried out by 8 subjects and the total duration of each activity was 5 minutes per subject. Data were acquired at a sampling rate of 25 Hz, using five body-worn orientation trackers, each containing a 3-axial accelerometer, a 3-axial gyroscope and a 3-axial magnetometer. The sensor signals were divided into 5-second segments, yielding a total of 9120 instances (480 per activity).
As we can see in Table 1, the five considered benchmarks have quite different dimensionalities, due to the different numbers of sensors involved as well as to the different set of high-level features computed for each sensor signal (after segmenting it into time windows). Specifically, for each window, feature vectors were computed in the time and frequency domains (Fast Fourier Transform was applied to sensor signals to transform data in the frequency domain). Several statistical measures were computed for each window: some of them, e.g., mean, standard deviation, Kurtosis, and min/max values, were extracted for all five datasets, while others have been used only in some of them. Note that we maintained the features' definitions employed in the original studies in which the datasets were published, in order to avoid introducing any bias and make the experiments fully repeatable. See Appendix A for more details on the extracted features.

V. EXPERIMENTAL ANALYSIS
Leveraging the activity recognition benchmarks described in the previous section, we conducted an extensive experimental study aimed at investigating the extent to which the original feature space can be reduced without degrading the final predictive performance. The overall analysis, in terms of both capacity of discriminating the classes and selection stability, has been carried out for different levels of dimensionality reduction, according to the methodological framework detailed in Section III. The main experimental results are summarized in Figures 2, 3 and 4, each containing ten charts that compare the behavior of the considered selection methods.
Specifically, Figure 2 and Figure 3 show a comparison in terms of F-score, which is a performance metric widely employed in activity recognition tasks. Defined as the harmonic mean between the model sensitivity (i.e., the frac-VOLUME 4, 2016 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ tion of positive instances classified correctly) and the model precision (i.e., the fraction of correct predictions among all the instances assigned to the positive class), the F-score takes both the false positives and the false negatives into account, providing a reliable estimate of the model ability to recognize a given class (considered as positive). By measuring the average F-score across the different classes, we obtained an overall evaluation of the recognition performance of the induced models. For model induction, as anticipated in Section III-C, we employed both the SMO (Figure 2) and Random Forest (Figure 3) classifiers, in conjunction with ten different selection methods.
More in detail, for both the univariate (χ 2 , IG, SU, GR, OR) and the multivariate (RF, RFW, RFE10, RFE50, SVM-AW) selection techniques, Figures 2 and 3 show the Fscore trend for different percentages of selected features (the abscissa 100 corresponds to the model induced on the whole dataset, without any dimensionality reduction). In absolute terms, the COSAR dataset is the one where both SMO and Random Forest achieve the lowest recognition performance, due to the intrinsic difficulty of discriminating multiple activities using features extracted only from accelerometer signals. In the other four datasets, where we have a multisensor scenario (as detailed in Table 1), the average F-score is generally better, sometimes above 0.9, with quite similar trends for the two classifiers.
But the most interesting observation, for the purpose of our study, is that no significant degradation in performance is observed when the original dimensionality is reduced, regardless of the specific characteristics of the dataset at hand. In particular, 10-20% (or even less) of the original features may be sufficient, for some selection methods, to obtain recognition performances comparable to those achieved using the whole dataset. This reveals the appropriateness of introducing a suitable feature selection step into any activity recognition protocol in order to simplify the final models and make them more efficient.
When comparing the different selection techniques, the multivariate approaches, which can capture the interdependencies among the features, turn out to be overall more effective in terms of F-score. In particular, among the SVMbased methods, RFE10 and RFE50 sometimes have slightly superior performances but at a higher computational cost than the simpler SVM-AW, whose behavior is still satisfactory. As regards the Relief-based multivariate approaches (RF and RFW), quite good performance can be obtained with at most 20% of the features. Among the univariate methods, on the other hand, SU and GR overall exhibit the worst behavior, with an unsatisfactory performance for some datasets, especially when small feature subsets are selected. For the other univariate approaches, i.e., χ 2 , IG and OR, feature subsets containing (at most) 20% of the features turn out to be sufficient to obtain quite good F-score values.
Overall, the strong potential of feature selection in this domain is witnessed by both Figures 2 and 3, despite some small differences between the curves reported for the SMO and Random Forest classifiers. However, as recognized by recent literature in the feature selection field, e.g., [31], [39], [61], [62], a good selection method should not only be effective (in terms of final predictive performance) but also as stable as possible to avoid that the selected subsets depend too much on the specific composition of the training data, thus becoming less useful in future applications of the model.
The results of the stability analysis we have performed on the five considered benchmarks (COSAR, HAR, HAR_AAL, HAPT, DSA) are shown in Figure 4, where the stability trend is reported for different percentages of selected features, according to the methodology detailed in sub-section III-B; specifically, such a methodology has been here implemented with the settings K = 20 and f = 0.80, which have proven suitable in similar studies [31], [39].
A first point to highlight regarding Figure 4 is that the differences observed in terms of stability are generally higher than those observed in terms of F-score (Figure 2 and Figure 3), revealing that some selection methods systematically exhibit a more stable behavior across the different datasets. In particular, χ 2 and IG turn out to be the most stable of the univariate approaches, followed by OR, while SU and GR have shown significantly lower stability in some datasets; GR, in particular, appears to be the least robust method in the univariate group. Among the multivariate approaches, on the other hand, the Relief-based methods (i.e., RF and RFW) have proven to be very stable, with similar trends for all levels of dimensionality reduction. Conversely, the SVMbased methods appear to be less robust, especially RFE10 and RFE50, whose stability is always lower than the one of the simpler SVM-AW.
Overall, the univariate χ 2 , IG and even OR show a quite good trade-off between F-score and stability and may therefore be an option to consider when inducing activity recognition models from mobile sensor datasets like those considered in our study. As well, the multivariate Relief-based approaches exhibit a good behavior when jointly considering both predictive performance and stability, at least for subsets containing more than 10% of the original features. It is worth mentioning, nevertheless, that the least stable methods could be made significantly more robust when implemented in a bootstrap-based ensemble version [39], thus envisaging margins of improvement for some selection techniques (but at the expense of the computational cost of the feature selection process).
Based on a wide set of experiments, the above results consolidate the findings of our preliminary study in this field [30], showing that feature selection can be effectively exploited to reduce the dimensionality of mobile sensor datasets, leading to robust and more efficient recognition models.

VI. DISCUSSION
The experimental analysis here presented complements and extends the findings of recent comparative studies conducted on similar activity recognition benchmarks [23], [29], [44], VOLUME 4, 2016 [45], providing stronger evidence of the suitability of the filter selection paradigm in this field.
Specifically, [23] discusses the advantages of leveraging classifier-independent selection approaches that can identify feature subsets conveying useful information for targeted populations and applications, regardless of the chosen classifier and the specific implementation settings. Compared to our work, the dataset considered in their study is relatively low-dimensional, with only seventy-six features, which allows efficiently using subsets-oriented filters (CFS, FCBF), along with a multivariate ranker (ReliefF), while the rankingbased approach is usually preferred in the presence of higher dimensionalities [44], [45]. Both subset-oriented filters (CFS, MRMR) and ranking-based filters (Information Gain, Gain Ratio, Symmetrical Uncertainty, ReliefF), along with wrapper and embedded methods, are also evaluated in [29], using a single sensor dataset involving 206 attributes (both timedomain and frequency-domain features), with empirical evidence of the effectiveness of the simpler univariate rankers in yielding good feature subsets for HAR models.
Through wider experiments on five high-dimensional benchmarks, our study has presented a more in-depth evaluation of the ranking-based selection approach, with both univariate and multivariate techniques, showing that they can be effectively employed in conjunction with different classifiers (see Figures 2 and 3). Actually, besides filters that inherently perform a classifier-independent selection, our experiments also encompass ranking methods that leverage some learning algorithm to compute the features' weights, as in the case of the SVM-based selectors. The final subsets produced by these methods, of the embedded category, are often used like those produced by filters, i.e., as a reduced feature space to train any potentially suitable classifier. However, as shown in Figure 4, this kind of ranker has proved to be overall less stable.
Our study has also shown that the adopted ranking-based approach allows to control the level of dimensionality reduction in a fine-grained way, in order to find the optimal trade-off between the number of the selected features and the resulting classification performance. Indeed, as discussed in section V, the original dimensionality can be reduced to a very great extent, with a final recognition performance similar to that achieved with the full feature set.
To provide concrete examples of such a dimensionality reduction, Tables 2 and 3 show the SMO classifier's F-score (averaged across the different classes, as in Figures 2 and  3) for the two multi-sensor benchmarks with the highest numbers of features/classes. Specifically, besides the full feature set, three feature subsets with smaller cardinality are considered, as selected by one of the univariate rankers that have shown the best tradeoff between predictive performance and stability, namely χ 2 (which also has a reduced computational cost compared to the multivariate ranking methods). As we can see, in the HAPT dataset (Table 2), only 15% of the original features are sufficient to obtain a final performance comparable to that achieved using the whole feature set, while 10% of the features turn out to be as predictive as the whole feature set in the DSA dataset (Table 3). Furthermore, interesting insight can be gained into the extent to which each sensor type contributes to the optimal set of features, with clear evidence of the high discriminative power of the features extracted from the accelerometer signals (that, however, are not sufficient to obtain the highest F-score values). This kind of analysis can leverage different ranking methods, as previously observed in Figures 2 and 3, and allows the design of fast-response recognition systems where only a reduced set of highly discriminative features need to be computed at operation time.
Overall, our results clearly show how beneficial feature selection can be in sensor-based activity recognition. However, some limitations exist in the current study that will be addressed in further investigations. Indeed, the explored ranking-based selection approach, although effective and generally more efficient than subset-oriented selection methods, may be sub-optimal in the presence of some degree of redundancy among the features. Hence, the potential impact of feature redundancy in this field should be better analyzed, for example by exploring hybrid selection strategies that first reduce the data dimensionality through a simple and efficient ranker and then further refine the resulting subset through VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3183228 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a more sophisticated search strategy that can remove highly correlated features [38]. But this would increase the computational cost of the selection process, requiring careful costbenefit analysis. Although some recent works have applied a hybrid selection approach in the HAR field [24], [49], [54], there is a lack of comparative studies that investigate the effects of different hybrid strategies in terms of final recognition performance as well as selection stability and computational efficiency.

VII. CONCLUDING REMARKS
Recent advances in mobile sensor technology have opened up growing opportunities for HAR applications in a variety of fields, from personal wellness to health monitoring. However, as data collection becomes easier and easier, fully exploiting the available data to build reliable models for activity recognition remains challenging. Further, when dealing with wearable devices, the issues of limited resources and energy consumption cannot be neglected, posing additional requirements in terms of efficiency of the induced models. In such a perspective, feature selection may have an important role since it can significantly reduce the data dimensionality by removing irrelevant and noisy features.
In this paper, we have explored the potential of feature selection in the HAR field, leveraging five public mobile sensor datasets with heterogeneous characteristics and relying on a methodological framework that considers both the predictive power and the stability of the selected features subsets (i.e., their robustness to perturbations in the input data). Ten different selection methods have been comparatively evaluated, revealing the suitability of univariate ranking approaches, such as Chi Squared and Information Gain, as well as of some multivariate approaches, such as ReliefF, which exhibit a quite good trade-off between recognition performance and selection stability. Encompassing different levels of dimensionality reduction, our analysis has shown that the original number of features can be reduced to a very large extent in all the considered benchmarks, without any worsening of performance.
Such a large comparative study gives evidence of the potential benefits of systematically exploiting feature selection when inducing HAR models, also providing methodological insights into how to evaluate the robustness of the selection outcome and choose the optimal level of dimensionality reduction. This line of research is worthy of further investigation, given the lack of such kind of comparative studies in this field.
As future work, our study will be enlarged to better investigate the extent to which the behavior and the stability pattern of a given selection algorithm may depend on the underlying feature extraction strategy (i.e., the type and number of features extracted for each sensor). Furthermore, the rankingbased selection framework here adopted will be compared with different and more sophisticated methodological approaches, including hybrid techniques that exploit different heuristics at different stages of the selection process and ensemble techniques that suitably combine the outcome of different selectors. .

APPENDIX A FEATURE DESCRIPTION
In this appendix, our aim is to give an overview of the statistical measures that were computed to build the feature vectors of the considered datasets (i.e., COSAR, HAR, HAR_AAL, HAPT, DSA).
All the measures are listed in Table 4, with the corresponding formula. Note that the third column of the table cites the datasets that use the measure and, in brackets, the number of times it was calculated.
In each formula of the table, the variable x represents the signal. Depending on the dataset used and the feature calculated, this signal x can be understood as an acceleration, an inclination, an angular velocity, a magnitude signal, or a Jerk signal (that is the first time derivative of the acceleration).
As mentioned in Section IV, the signals were segmented into sliding windows for feature extraction; specifically, all the measures reported in Table 4 are calculated on windows containing a number of observations denoted by M .
It should also be considered that the sensors used in the experiments (accelerometers, magnetometers, and gyroscopes) acquire the three spatial components of each signal; thus, in some measures, x can represent the component on the X, Y or Z axes of the considered signal. Other features, such as correlation and covariance, are calculated using two or more components on the axes; therefore, in the formulas it has been necessary to differentiate these components, indicating with x 1 , x 2 and x 3 the components on the X, Y and Z axes respectively. Some other features, such as median and percentiles, are calculated using the window signal values sorted in ascending order, so we have indicated the ordered values differently usingx.
As already pointed out in Section IV, the feature vectors were computed in both the time domain and the frequency domain; some measures in fact, e.g. Entropy, have been calculated in the frequency domain; consequently, in Table 4, the frequency signal is indicated with the capital letter X, to distinguish it from the signal x in the time domain.
MARCO MANOLO MANCA obtained a Master Degree in Mathematics in 2019, from University of Cagliari (Italy). He is currently pursuing the Ph.D. degree in Computer Science at University of Cagliari and cooperating with the CRS4 research center.
His current research interest is machine learning applied to signal analysis and smart energy systems.
BARBARA PES (Member, IEEE) is Researcher (permanent position) at the Department of Mathematics and Computer Science of the University of Cagliari (Italy), where she taught/teaches Foundations of Computer Science, Database and Data Mining courses.
She has participated in several research projects on web-based information systems, serviceoriented architectures, data integration, highdimensional data analysis, and bioinformatics. Currently, her main research interests are in the fields of data mining and machine learning, classification of high-dimensional data, and advanced feature selection methods. She is the author/co-author of more than 80 papers published in international conferences, books, and journals. DANIELE RIBONI is associate professor of Computer Science at the University of Cagliari. His research interests are mainly focused on contextawareness, activity recognition, knowledge management and privacy issues in pervasive and mobile computing. He served as TPC chair and TPC vice chair for different conferences and workshops in the field, including IEEE PerCom and the International Conference on Intelligent Environments (IE). His contributions appear in major conferences and journals.