A Comprehensive Survey on the Process, Methods, Evaluation, and Challenges of Feature Selection

Feature selection is employed to reduce the feature dimensions and computational complexity by eliminating irrelevant and redundant features. A vast amount of increasing data and its processing generates many feature sets, which are reduced by the feature selection process to improve the performance in all types of classification, regression, clustering models. This study performs a detailed analysis of motivation and concentrates on the fundamental architecture of feature selection. This study aims to establish a structured formation related to popular methods such as filters, wrappers and, embedded into search strategies, evaluation criteria, and learning methods. Different methods organize a comparison of the benefits and drawbacks followed by multiple classification algorithms and standard validation measures. The diversity of applications in multiple domains such as data retrieval, prediction analysis, and medical, intrusion, and industrial applications is efficiently highlighted. This study focuses on some additional feature selection methods for handling big data. Nonetheless, new challenges have surfaced in the analysis of such data, which were also addressed in this study. Reflecting on commonly encountered challenges and clarifying how to obtain the absolute feature selection method are the significant components of this study.

accuracy. The most commonly used dimensionality reduction 28 processes are Feature extraction (FE) and feature selection 29 (FS). A feature is a unique quantified characteristic of the 30 observation process. Not all features are required to extract 31 relevant information from datasets. Several features may be 32 redundant or irrelevant for various machine learning, deep 33 learning and data science approaches. Some may mislead 34 clustering results, thus decreasing the quality of the model. 35 Throughout this instance, selecting a subset of the original 36 features will almost always result in improved performance. 37 Feature selection algorithms in supervised learning optimize 38 some function of predicted accuracy. Unsupervised learning, 39 on the other hand lacks class labels, and runs the risk of 40 retaining all or only a subset of significant attributes. Limiting 41 the number of features also improves readability. It relieves 42 in this combination. Whether Class 1 remains separable from 71 Classes 2 and 3 from the perspective of Feature 1 is debatable. 72 In two-dimensional space, the combination reveals that each 73 of the three classes is easily distinguishable. Combining these 74 three features allows for easier differentiation of classes in the 75 three-dimensional space depicted in the lower-left section of 76 Figure 2. However, three-dimensional space is not required 77 as in two-dimensional space, all three classes are separable if 78 one dimension space is insufficient. Using two features rather 79 than three is an example of both dimensionality reduction, 80 and feature selection. Furthermore, the motivations and goals 81 of feature selection were purposefully made more visible. 82 An expeditious review of feature selection's goal points to 83 reduce computational complexity and, as a result, improve 84 system performance parameters such as accuracy. It also aims 85 to reduce large dimensionality, in which some dimensions of 86 some instances interfere with each other and affect the perfor-87 mance. It also aims to extract meaningful rules from the clas-88 sifier and remove redundant features to reduce complexity. 89 Furthermore, in some cases, these feature reduction chal-90 lenges or activities can be named classification, clustering, 91 regression of data, and search strategy. These activities have 92 been developed formed very recently with an increasing num-93 ber of studies of feature selection. However, these activities 94 or challenges started with a regression problem that identi-95 fies the formation of the FS history. In 1924 R. A. Fisher 96 introduced a trial of variable selection for regression while 97 discussing an article [1] presented by A. J. Miller to the 98 Royal Statistical Society. Later in the 1940s, with limited 99 computing power available, the trial faced some advance-100 ments. A study on the rationale for variable selection by 101 Hotelling [2] illuminated previous approaches to solve this 102 problem. Advancements in computing power in the early 103 1960s provided significant impetus for research in this area. 104 The majority of early research was conducted by statisti-105 cians and focused on linear regression, such as Hocking [3], 106 who conducted a literature review on variable selection for 107 linear regression. Variable selection research has expanded 108 to include classification and clustering issues as well. This 109 dependent on one another. A unique mining algorithm 151 was used to determine the criterion. The performance 152 of the mining algorithm performance determines the 153 quality of the feature subset. For a predefined mining 154 algorithm, the dependent criterion typically outperforms 155 an independent criterion. However, the selected feature 156 subset may not be suitable for other mining techniques, 157 and the computational cost is high. Unidentified instance 158 forecasting accuracy is commonly used to identify a 159 feature subset that yields high testing accuracy for clas-160 sification problems [6]. 161 • Stopping criteria: After the previous phase, the FS pro- 162 cess requires a stopping criterion [4]. A suitable stopping 163 criterion reduces the time it takes to locate the best 164 feature subset and eliminates over-fitting. The decisions 165 made in the preceding steps influence the selection of 166 the stopping criterion. The following are among the most 167 regularly used stopping criteria-

172
• Optimal feature set: A subset of a specified fea-173 ture set is the optimal feature set. The optimal subset 174 minimizes a user-defined cost function (information-or 175 performance-related, depending on the application). The 176 optimal feature set reduces the number of inadvertently 177 selected features by half while maintaining constant true 178 positive rates. It is more efficient in selecting appropriate 179 variables, resulting in a model that is more straightfor- 180 ward, understandable, and accurate. 181 • Result validation: The results must be ambiguously 182 validated. Experimenting with the entire feature set, 183 rather than just a subset, is a common strategy. To 184 validate the results, the efficiencies of the before-and-185 after feature selection trials were compared. Cross-186 validation [7], [8], Confusion matrix [9], Jaccard 187 similarity-based measure [10], Rand Index [11], and 188 other validation methods have been widely used.
• Weighted: The search space in weighted operators is 248 a continuous process. All features were present in the 249 solution to some extent. The successor has a different 250 weight than that of the parent state. This is typically 251 accomplished by selecting the available set of iterative 252 instances.

253
• Random: The feature subset is constructed through 254 a random search process, which involves repeatedly 255 adding and removing features. 256 A search strategy can be implemented when the search 257 direction is determined. Figure 4 depicts several search strate- 258 gies that can be classified into three categories: Exponential 259 algorithms [13], Sequential algorithms [14], and Randomized 260 algorithms [15]. 261 A. EXPONENTIAL ALGORITHM 262 Exponential algorithms evaluate a number of subsets that 263 grow exponentially with the dimensionality of the search 264 space also known as complete search. The most widely 265 utilized and representative algorithms in this category are 266 discussed below - 267 1) EXHAUSTIVE ALGORITHMS 268 Exhaustive searches are NP-hard [16], and sub-optimal meth-269 ods such as forward selection [17] start small and make 270 additions to improve performance. The other method is 271 backward selection [18], which starts with all features and 272 removes them to improve performance and is frequently uti- 273 lized. An exhaustive search, such as the forward selection 274 method, begins by obtaining the best one-component subset 275 of the input features. It continues to search for the best 276 two-component feature subset that can be composed of any 277 combinations of input features. It is also the greedy-algorithm 278 because it tries every possible feature combination and 279 chooses the best. Figure 5 illustrates the exhaustive search. 2) COMPLETE SEARCH 281 A complete search is a strategy to find a solution to a prob-282 lem by traversing the entire search space. It ensures that an 283 optimal result is obtained based on the evaluation criteria 284 employed. The exhaustive search part of the exponential 285 search was regarded as complete. The fact that a search is 286 complete does not imply that it is exhaustive. Various heuris-287 tic functions can be used to narrow the search space without 288 decreasing the probability of obtaining the best solution. 289 Consequently, even though the order of the search space is 290 O(2 N ), fewer subsets are explored [19]. Two examples are 291 branch and bind [13] and beam search [20]. 292 • Branch and Bound (BnB): Branch and bound (BnB) 293 solves discrete and combinatorial optimization issues 294 and mathematical optimization problems [21]. The algo-295 rithm investigates the components of the tree, that are 296 subsets of the optimal solution. It is applied to determine 297 99598 VOLUME 10, 2022 mines the best solution [22]. Several studies [23], [24], 302 [25], [26] used the BnB algorithm in their works. 303 • Beam Search: Beam search is a heuristic search strategy 304 that expands the most intriguing node in a restricted 305 collection to explore a graph. It utilizes the optimization 306 of the breadth-first search [27]. 307 B. SEQUENTIAL ALGORITHM 308 Sequential algorithms are employed to add or remove fea- 309 tures sequentially. This algorithm tends to be trapped in local 310 minima. Several sequential algorithms which have been uti-311 lized for decades. Some of these issues are discussed in the 312 following sections. 313 1) SEQUENTIAL FORWARD SELECTION (SFS) 314 Sequential forward selection (SFS) is a technique in which 315 features are sequentially assigned to empty candidates until 316 the criterion is not altered [28]. Sequential feature selection 317 techniques used to minimize an initial dimensional feature 318 space to another dimensional feature subspace are included 319 in a group of greedy search algorithms. The goal is to select 320 a subset of features that are most relevant to the purpose, 321 resulting in optimal computational performance while reduc- 322 ing overfitting by removing irrelevant information. The SFS 323 performs best when the optimal subset has a small number of 324 features. SFS has been utilized in some of the articles [29], 325 [30], [31], [32].

326
2) SEQUENTIAL BACKWARD SELECTION (SBS) 327 The sequential backward selection approach intends to 328 reduce the dimensionality of the initial feature subspace 329 from N to K features with a minimum reduction in system 330 FIGURE 6. The graph shows how the thick lines in the search space identified by Sequential Forward Searching narrow as the algorithm approach the whole feature set.
performance [33]. This improves computational efficiency 331 and reduces overfitting. The main goal is to eliminate features 332 from the provided feature list of N features one by one 333 until they reach the list of K-features. At each stage of the 334 process, the feature that caused the least performance loss 335 was removed. The feature approach is based on a combina-336 torial search method, in which a subset of features from a 337 combination is chosen.The score for the subset was calculated 338 and compared with other subsets. Several studies have been 339 conducted using the SBS algorithm [32], [34], [35], [36].  . This is a population-based metaheuris-501 tic algorithm that iteratively improves a proposed solution 502 through an evolutionary process. The parameters of the pro-503 cedure are stored as floating-point variables that change when 504 an essential mathematical operation is performed. During 505 the mutation process, the modified most exemplary param-506 eter values are merged into actual population vectors via a 507 variable-length for each crossover procedure. These algo-508 rithms make few assumptions regarding the underlying opti-509 mization problem and can quickly explore enormous design 510 spaces. The primary feature of the standard DE is that it has 511 three control parameters that must be adjusted. The sample 512 vector generation scheme and control parameter selection sig-513 nificantly impact effectiveness of DE in a specific optimiza-514 tion task [101]. To achieve good optimization results, trial 515 vector generation strategy is selected and the system param-516 eters for the optimization process is optimized. Choosing an 517 appropriate control parameter is not always easy, and it can be 518 time-consuming and difficult, especially for implementation. 519 A flowchart illustrating differential evolution is shown in 520 Figure 9.    Table 1.

553
An evaluation criterion is a process that aims to find the Filter methods are commonly employed as independent pre-561 processing methods. Instead, features were selected based on 562 their correlation scores with the outcome variable in various 563 statistical tests. The term ''correlation'' refers to a purely 564 subjective concept. Furthermore, the classification algorithm 565 does not influence the evaluation of the subsets. To calculate 566 features, several parameters -such as correlation, gain Ratio, 567 Euclidean distance, and others are utilized. These parameters 568 are discussed in the following section and the structure of the 569 filter method is Illustrated in Figure 11.  where p(a, b) is the joint probability function of A and B. P(a), 578 and P(b) are the marginal probability distribution functions of 579 A and B respectively. This equation is used to determine the 580 MI between two discrete random variables, a and b. The sum-581 mation is performed using a double integral for continuous 582 random variables.
Statistical measures were used to assign scoring values to 585 each feature in the filter technique. The features were sorted 586 in descending order according to their rankings. A subset of 587 features was selected based on the threshold values. Using 588 the filter approach to select the best features requires less 589 computational time. Because the connection between inde-590 pendent variables is not considered when selecting features, 591 irrelevant features are chosen. Recent studies have utilized MI 592 techniques in their research [114], [115], and [116]. 2) PEARSON's CORRELATION (PC) [118].

593
The features that show redundancy are dealt with using 601 correlation-based feature selection [119]. The correlation 602 coefficient is used to select features that are highly related to 603 the target variable but have minimal inter-correlation between 604 them [120]. The correlation of each set of features determines 605 the highest correlation coefficient value and immediately 606 selects a feature [121]. Information Gain (IG) is filter feature selection method uti-609 lized to determine essential qualities from a group of features. 610 When the value of the feature is unknown, IG reduces the 611 risks associated with selecting a class attribute [122]. It is 612 primarily concerned with information theory. It is used to 613 rank and select top features before the learning process begins 614 to reduce the feature size. The entropy value of the distri-615 bution was calculated by ranking to estimate the uncertainty 616 of each feature based on its significance in defining separate 617 classes [123]. The entropy of the distribution, sample entropy, 618 ambiguity [124]. The information gain about X provided by 620 Y is calculated as: is the entropy of variable X and, is the entropy of X after observing another variable Y .

629
The gain ratio is required to improve the IG's bias towards 630 features with high diversity values [125]. The gain ratio is 631 significant when the data were evenly distributed. It is low 632 if all data are directed to only one branch of the property.

633
The gain ratio is an attribute determined by the number and 634 length of the branches. It attempts to correct IG by taking 635 intrinsic information into consideration [126]. The entropy 636 distribution of the instance value can be used to estimate the 637 intrinsic information of a specific feature.  The Laplacian score (L r ) is expressed as: where a diagonal matrix is denoted by D, Laplacian matrix 653 defined as L = D − S and f r is determined as follows: Fisher score is a popular supervised method for selecting fea-660 tures that compute individualized Fisher scores over the data 661 space [129]. Fisher's criterion does not recognize combined 662 effects or handle the similar features but provides optimal 663 predictors [130] under certain orthogonality assumptions. 664 The fundamental premise of the Fisher score is to increase 665 the distances between data samples in different classes while 666 decreasing the distances within the same class. Several recent 667 studies utilized the fisher score filter method for feature 668 selection [131], [132], [133].
The chi-squared (X 2 ) statistic was used to evaluate the 671 independence of two variables by calculating a score that 672 indicated the independence they are. X 2 measures the inde-673 pendence of the features for the class in feature selection. 674 Before calculating a score, X 2 relies on the assumption that 675 feature and classes are independent [134]. A substantial score 676 value indicates a highly dependent connection.
where N signifies the complete dataset, r indicates the pres-680 ence of a feature (r its absence), and ci refers to the class. 681 Where P(r, ci) is the probability that feature r occurs in class 682 ci. P(r) is the likelihood that a feature resembles the dataset. 683 Some researchers have used the chi-squared filter method for 684 feature selection [135], [136]. The fast correlation-based filter (FCBF) begins with a com-700 prehensive set of characteristics. The fast correlation-based 701 filter computes the feature dependency by employing sym-702 metrical ambiguity and eliminates superfluous features using 703 the backward selection approach [137]. This technique 704 includes an internal criterion that prevents features removal. 705 Different approaches to feature selection are slower than 706 rapid correlation-based filters. The FCBF method algorithm 707 was developed in [124]. between the characteristics, which is specified as follows: where, I (F j ; C k ) is the mutual correlation between feature 744 X j and class C k , and I (F j ; F i ) is the correlation between   feature subset is selected. These approaches success-759 fully identify and remove unnecessary features. How-760 ever they cannot remove the same features because 761 they do not account for the possible feature depen-762 dencies. Alternatively Wrapper methods evaluate the relative utility of feature sets 781 based on the prediction performance of a learning machine. 782 Classification error rate estimation and theoretical perfor-783 mance constraints are frequently used to evaluate a model's 784 performance. The lower the error rate of feature subset the 785 better the result. An exhaustive search can be conducted 786 when the number of features is small. However, examining 787 all subsets is NP-hard and is subject to overfitting. Sequential 788 forward selection or backward elimination, best-first, branch-789 and-bound, simulated annealing, and genetic algorithms are 790 just a few of the greedy search strategies that can be imple-791 mented [162]. Several of the are very common in the sequen-792 tial search included in section II. The structure of the wrapper 793 method is shown in Figure 12. The other wrapper methods are 794 discussed in the following section.  Recursive feature elimination (RFE) is a well-known feature 797 selection algorithm. It is popular since it is simple to set up 798 while using, and good at identifying features in a training 799 dataset that are more relevant in determining the desired vari-800 able [163]. It is a recursive procedure that sorts the features 801 based on feature importance and an underlying random forest 802 classification model. When using RFE, there are primarily 803 two configuration options: the number of features to choose 804 from and the algorithm used to assist in feature selection.

805
Both of these hyper-parameters can be investigated, but their 806 correct configuration has no significant effect on the per-807 formance of the method. This method has been used in the 808 several recent studies [164], [165], [166], [167].
The Boruta algorithm is a wrapper for the random forest 811 classification algorithm in the random forest R package [168].

812
The random forest classification process is fast, can typically  Table 3 presents different studies using the wrapper method 831 to select the features.  This approach penalizes the regression variable coefficients 862 by decreasing some of them to zero via a shrinking procedure 863 known as L1 regularization. Variables with non-zero coeffi-864 cients after downsizing were selected as part of the model 865 during the FS stage. The goal of this approach is to minimize 866 prediction errors as much as possible [190]. The LASSO 867 system can produce a highly accurate forecast while reduc-868 ing the variance without considerably increasing the bias by 869 shrinking and deleting coefficients. LASSO is useful, with a 870 limited number of instances and a wide variety of features. 871 Furthermore, LASSO reduces overfitting by removing exter-872 nal variables that are not associated with the response vari-873 able, thereby improving model interpretability [191]. Table 4 874 presents different studies using the embedded method to 875 select the features.     Table 5 presents different studies using the hybrid method 895 to select the features.

898
In machine learning, feature selection is also known as  While reproducing kernel Hilbert spaces (RKHS) [214], 922 [215], an independent criterion called the Hilbert-Schmidt 923 norm of the cross-covariance operator was proposed. Dif-924 ferent applications including independent component anal-925 ysis [216], sorting/ matching [217], supervised dictionary 926 learning [218], and multiview learning [219], have mentioned 927 the proposed measure known as the Hilbert-Schmidt inde-928 pendence criterion (HSIC). According to HSIC, two random 929 variables, x, and y are independent if any bounded continuous 930 function of the two random variables is uncorrelated. HSIC 931 is one of the criteria for detecting non-linear connections that 932 do not require generalized eigenvalue problems or rely on 933 regularization parameters [220], [221].

935
Unsupervised feature selection (UFS) approaches are exten-936 sively used to analyze high-dimensional data. These tech-937 niques use unlabeled data owing to the scarcity of promptly 938 available labels. The majority of existing UFS techniques 939 concentrate on the importance of features in preserving the 940 data structure while ignoring feature redundancy [222]. 941

942
Wrapper approaches use the results of precise clustering 943 algorithms to evaluate the feature subsets. The discovery of 944 feature subsets distinguishes between these methods based on 945 the aforementioned approach. The quality of the results of the 946 clustering algorithm used for selection was improved in this 947 manner way. 948 1) Sequential methods: In these methods, the features are 949 sequentially added or removed. [223], [224], [225] are 950 profound works on this topic.

951
2) Bio-inspired methods: Bio-inspired methods attempt 952 to introduce unpredictability into the search process 953 in order to avoid local optima. Some studies on these 954 methods are presented in [226] and [227].

955
3) Iterative methods: Iterative approaches resolve the UFS 956 issue and reduce the need for combinatorial search by 957 redefining it as an evaluation problem. [228], [229], 958 [230] are some studies on these methods. Hybrid-based methods attempt to use the strengths of both, 961 filter and wrapper, to achieve a suitable balance of compu-962 tational efficiency. It also demonstrates the productivity in 963 the associated objective task when the selected features are 964 used. Hybrid-based methods include a filter frame in which 965 features are ordered or chosen using a measure based on the 966 inherent attributes of the data.

1107
• Based on Support Vector Machines Support vector 1108 machine-based methods choose features by optimizing 1109 the classification margin between classes while utilizing 1110 the local data structure. Many strategies, such as mani-1111 fold regularization, recursive feature removal, merging 1112 L1-norm with L2-norm, and replacing L2-norm with 1113 L1-norm can be used for SVM-based models [254].

1114
The advantages and disadvantages of supervised, unsuper-1115 vised and semi-supervised learning method are listed in 1116 Table 7 in a concise manner.

1117
Another learning method known as the ensemble learn-1118 ing method, utilizes combination of several learning models. 1119 The ensemble learning method is described in the following 1120 section.

1122
Ensemble learning is a powerful machine learning technique. 1123 The basic concept is to improve learning outcomes by com-1124 bining several learning models [255]. Ensemble learning 1125 methods outperform single machine learning models across a 1126 variety of machine learning techniques. The rapid growth of 1127 ensemble feature selection in recent decades has been based 1128 on the concept of ensemble learning. Unlike other feature 1129 selection techniques, only one optimal feature subset was 1130 selected. The goal of the combination feature selection is 1131 to obtain a large number of optimal features. The learning 1132 outcomes re set based on several optimal feature subsets 1133 VOLUME 10, 2022  [262], [263], [264], [265].  unweighted KNN algorithm, the parameter kernel must be 1187 changed to rectangular. Several studies have utilized KNN 1188 classifiers for model validation [270], [271], [272].
As a decision boundary, Support Vector Machines use the 1192 hyperplane in the optimal feature space in terms of the max-1193 imum margin concept. Kernel functions change the shape 1194 of the hyperplane from linear to non-linear [273]. Support 1195 vector machines are frequently used with the RBF kernel. 1196 The two hyperparameters are the regularization parameter C 1197 and the kernel width parameter. SVM classifiers have been 1198 used in recent studies [274], [275], [276]. Other

1204
The naive Bayes classifier is a simple and efficient classifi-1205 cation method that facilitates the development of a fast ML 1206 algorithm's ability to make rapid predictions. It is a proba-1207 bilistic classifier that generates forecasts based on an entity's 1208 probability. The existence of one feature in a class is assumed 1209 to be independent of the presence of any other feature using 1210 a naive Bayes classifier. The probabilities for each element in 1211 the naive Bayes algorithm are determined separately from the 1212 training dataset. A search technique is used to assess the effi-1213 cacy of combining the probabilities of several attributes and 1214 forecasting the output variable. There is no built-in method 1215 for determining the relevance of features in Naive Bayes 1216 classifiers. Naive Bayes algorithms determine the conditional 1217 and unconditional probabilities associated with the features, 1218 that forecast the class with the highest probability. This can be 1219 used to solve multi-class prediction problems. If the assump-1220 tion of feature independence is maintained, it can outperform 1221 the other models while using significantly less training data. 1222 For categorical input variables, Naive Bayes was better than 1223 number.

1225
A random forest comprises a massive set of discrete decision 1226 trees that work together as an ensemble. The numerous trees 1227 in the random forest individually spit out class prediction. 1228 The class with the highest choice was the prediction of the 1229 model. It employs bagging and feature randomization to cre-1230 ate an interconnected forest of trees, the aggregate prediction 1231 of which is more accurate than that for a single tree. The 1232 underlying premise of random forest is that many highly 1233 interconnected models (trees) acting as a committee will 1234 outperform any of the measurements of individual models.    • F-score: This is a singular score derived from a com-1320 bination of recall and precision measurements [302], 1321 [304]. The F-score is a harmonic mean of the recall and 1322 precision metrics that is expressed as: where, li and yi are xi s cluster and true class labels, 1329 respectively, and n is the total number of data points. 1330 (x, y) is the delta function that matches 1 if x = y 1331 and equals 0 otherwise, and map(li) is the permutation 1332 mapping function that outlines each cluster label ri to a 1333 similar label from the data set. required to test a trained classifier may vary owing to 1382 differences in the operating, training, and test times of 1383 classifiers using different FS methods. We chose these 1384 three times to demonstrate efficiency from various per-1385 spectives, and these times were cost-dependent.

1387
Big data are defined as ''a dataset whose size exceeds the 1388 capability of typical dataset management systems in gath-1389 ering, storing, processing, and analyzing.'' It usually has 1390 three characteristics: Huge volume, wide variety, and rapid 1391 change [1][2][3]. The challenge posed by these 3V character-1392 istics, namely volume, type, and velocity, have become the 1393 focus of learning methods when dealing with extensive data. 1394 Furthermore, duplication and relatedness, which are essential 1395 in massive datasets to avoid losing valuable content, fre-1396 quently make the mining procedure more critical. SGMI is a distributed and scalable global MI-based fea-1418 ture selection framework that develops a similarity matrix 1419 in a single pass and a scalable manner. Subsequently, 1420 based on the similarity matrix, it employs a feature rank-1421 ing algorithm to discover a globally optimal solution. The 1422 similarity matrix indicates the dependency among fea-1423 tures simultaneously, and it can be computed using the 1424 MI or CMI, with the former having less complexity than 1425 the latter. The SGMI framework employs three optimiza-1426 tion approaches. The first employs a MI similarity matrix, 1427 whereas the others use a CMI similarity matrix. In this study, 1428 three techniques are developed: SGMIQP, SGMI-SR, and 1429 SGMI-TP. Consequently, these methods establish a feature 1430 ranking to place informative characteristics at the top of the 1431 ranking. The bag-of-words model is a typical method for encoding a 1487 document in text mining [307]. The purpose is to model each 1488 text based on the number of words that appearing there in 1489 it. Typically, feature vectors are built to indicate the count 1490 of a single word; however, another option is to confirm the 1491 presence or absence of a word without providing a count. 1492 A lexicon is a collection of words whose occurrences have 1493 been tracked. When a dataset requires expression, words from 1494 the documents can be combined to form a vocabulary, which 1495 is then reduced by feature selection. During feature selection, 1496 it is possible to perform some preprocessing, such as remov-1497 ing rare words with very few instances, removing exces-1498 sively familiar terms (e.g. ''a,'' ''the,'' ''and,'' and similar), 1499 and combining the various inflected forms of an expression 1500 (lemmatization, stemming) [308].  [314]. Automatic image annotation can also benefit 1545 from feature selection. Two weighted feature selec-1546 tion techniques [315], [316] have been presented to Mass Spectrometry (MS) has established itself as a new 1592 and appealing framework for diagnosing diseases and 1593 protein-based biomarker analysis [331]. A mass spectrum has 1594 thousands of possible mass/charge (m/z) ratios on the x-axis, 1595 each with its signal intensity value on the y-axis. A typical 1596 MALDI-TOF low-resolution proteomic profile can contain 1597 up to 15,500 data points in the 500-20000 m/z range. With 1598 higher resolution equipment, the number of points can be 1599 increased even further. For data mining and bioinformatics 1600 purposes, each m/z ratio can be regarded as a separate variable 1601 whose value is the intensity. Filenames, authors, sizes, dates, track lengths, and genres 1624 are frequently used to categorize and recall materials. Cat-1625 egorization is impossible based on these data; hence the fea-1626 ture selection process is intended. Feature selection in genre 1627 classification, refers to the process of converting an audio 1628 segment into compact numeric values [335]. Owing to the 1629 increased dimensionality of the feature sets, feature selection 1630 was used as a preprocessing step before classification to 1631 reduce data dimensionality.      [359], and Unsupervised Heterogeneous Anomaly 1805 Based IDS [360] are several anomaly-based IDSs. A profes-1806 sional manually build the required pattern, which consists of a 1807 sequence of guidelines that compare different valid behaviors 1808 of a device, for the specification-based detection approach. 1809 If the specifications are sufficiently precise, the pattern may 1810 be able to detect illegal patterns of activity. The Finite 1811 State Machine (FSM) methodology appears to be appropri-1812 ate for modeling network protocols [361]. Hybrid detection 1813 exploited the strengths of each intrusion detection method 1814 while minimizing its flaws and constructing a solid schema to 1815 detect the intrusion. A key feature of hybrid detection is the 1816 use of a key signature-based detection system in conjunction 1817 with an additional anomaly-based model.  [399], [400]. When model interpretability is crucial, FS is the 2032 dimensionality reduction strategy. They performed because 2033 the model was only as good as its features. They will con-2034 tinue to play an important role in the model interpretation. 2035 Users can choose between the two criteria for the FS and 2036 model creation processes. More interactive model visual-2037 izations can change the input parameters in response to 2038 model challenges and visualize future events. The other is 2039 a more interactive feature selection process where they are 2040 encouraged to iterate utilizing interactive visualizations. The 2041 goal was to make the results more interpretable by allow-2042 ing user-friendly visualization. The complexities of big data 2043 applications highlight the importance of minimizing visual 2044 complexity. Although most studies have focused on FS and 2045 visualization separately, the display of data features may 2046 play an important role in real-world high dimensionality con-2047 texts. While visualization tools are constantly used to analyze 2048 and make complex data understandable, the quality of the 2049 corresponding decision-making is frequently compromised. 2050 Because the tools refused to acknowledge the role of heuris-2051 tics, biases, and other factors in human-computer interac-2052 tion situations, interactive tools such as those suggested by 2053 Krause et al. [401] are intriguing research topics.

2055
Feature selection is a dimensionality reduction strategy that 2056 separates important feature subsets from irrelevant and redun-2057 dant ones. The importance of FS for data processing has 2058 grown significantly with the increease in the number of avail-2059 able FS methods. In addition to well-known FS approaches, 2060 this study presents a strategic categorization. Different search 2061 strategies, and standard learning methods for improving 2062 learning performance are discussed. A good representation 2063 of a wide range of algorithms based on the evaluation criteria 2064 is also presented. These FS approaches, on the other hand, 2065 have gained usability but still have potential. This potentiality 2066 is presented systematically, and some challenges in retriev- University of Business and Technology. His research experience, within both 3412 academia and industry, results in over 80 journals and conference publica-3413 tions. For more than ten years, he has been with the master's and undergradu-3414 ate students as a supervisor of their thesis work. His research interests include 3415 artificial intelligence (AI), machine learning, natural language processing 3416 (NLP), and big data analysis. He has served as a program committee member 3417 in several international conferences/workshops. He served as an associate 3418 editor of several journals.

3419
AKIBUR RAHMAN PRODEEP is currently 3420 pursuing the degree in computer science and 3421 engineering with the Bangladesh University of 3422 Business and Technology. He is also working as a 3423 Researcher Assistant with the Advanced Machine 3424 Learning Laboratory. He is an optimistic, ener-3425 getic, enthusiastic, and devoted individual search-3426 ing out a challenging situation to effectively use 3427 his assembled knowledge of artificial intelligence. 3428 He has experience working with Python, Tensor-3429 flow, Keras, Matplotlib, Numpy, and Pandas. His research interests include 3430 deep learning, image processing, computer vision, and natural language 3431 processing. He is also working on lung nodule and cancer recognition, feature 3432 selection, acne and rosacea (skin diseases) detection, and plant diseases 3433 identification.