Wrapper and Hybrid Feature Selection Methods Using Metaheuristic Algorithms for English Text Classification: A Systematic Review

Feature selection (FS) constitutes a series of processes used to decide which relevant features/attributes to include and which irrelevant features to exclude for predictive modeling. It is a crucial task that aids machine learning classifiers in reducing error rates, computation time, overfitting, and improving classification accuracy. It has demonstrated its efficacy in myriads of domains, ranging from its use for text classification (TC), text mining, and image recognition. While there are many traditional FS methods, recent research efforts have been devoted to applying metaheuristic algorithms as FS techniques for the TC task. However, there are few literature reviews concerning TC. Therefore, a comprehensive overview was systematically studied by exploring available studies of different metaheuristic algorithms used for FS to improve TC. This paper will contribute to the body of existing knowledge by answering four research questions (RQs): 1) What are the different approaches of FS that apply metaheuristic algorithms to improve TC? 2) Does applying metaheuristic algorithms for TC lead to better accuracy than the typical FS methods? 3) How effective is the modification or hybridization of a metaheuristic algorithm on the text FS problem?, and 4) What are the gaps in the current studies and their future directions? These RQs led to a study of recent works on metaheuristic-based FS methods, their contributions, and limitations. Hence, a final list of thirty-seven (37) related articles was extracted and investigated to align with our RQs to generate new knowledge in the domain of study. Most of the conducted papers focused on addressing the TC in tandem with metaheuristic algorithms based on the wrapper and hybrid FS approaches. Future research should focus on using a hybrid-based FS approach as it intuitively handles complex optimization problems and potentiality provide new research opportunities in this rapidly developing field.


I. INTRODUCTION
The huge quantity of digital data on the Internet, such as emails, social media sites, and libraries, are increasingly generated every second. The generated data could be in the form of numbers, text, audio, video, graphs, and others. Knowledge of different sorts from several domains such as financial, medical, statistical, logical, amongst others, can be extracted from such data to gain insights and make predictions. A significant part of the data accessible/available today is stored as text [1], [2]. Data in the form of text, otherwise called textual data, constitute a large portion of the generated data. For example, around a billion messages and tweets are posted on Facebook and Twitter pages monthly. Moreover, more than a million articles were edited on Wikipedia in 2020.
The mining of textual data employs several techniques (from the field of statistics to artificial intelligence) to categorize texts (e.g., news filtering, topic identification, and VOLUME 4, 2016 document routing) [3], [4] from the domain that characterizes the problem that is to be solved. Classifying textual data is also known as topic classification, text categorization, or text classification (TC). In this regard, many businesses, institutions, and people have text data that are highly unstructured. This makes it extremely difficult for them to analyze, understand, and reduce data complexity to easy and quick access to information and sort it on a huge scale. To solve this problem, businesses and others leverage TC with many techniques and methods that have been proposed for classifying text documents automatically, such as machine learning approaches and others [5], because of its scalability and real-time analysis of unstructured text. This, in turn, saves time, automates business processes, and usually helps make informed business decisions.
Therefore, there is a need to focus on the problem of organizing and managing the phenomenal growth of unstructured textual data. TC is one of the supervised learning methods; it is the primary goal to systematically organize the given documents into their relevant categories/classes (e.g., topics) [5], [6]. For example, a document with the words "viruses", "pandemic", "lockdown", "death", "vaccination" is assigned to the "Covid-19" class label. The common approach for TC consists of various essential steps: • Preprocessing the textual data using several techniques such as tokenization, stopword removal/filtering, stemming, and cleansing, amongst others [7].
• Dimensionality reductions such as Feature selection (FS) to reduce the high-dimensional of features space [5].
• Data Mining/Pattern Discovery involves building, developing, and training the models using machine or deep learning algorithms [5].
Among the above essential steps, feature extraction is the main focus of this work, which is conducted using different text representation formats such as the bag of words (BOW) model (set of words/terms or features), which is represented as the feature-vector with its associated frequency. However, the problem of having a huge feature-vector size that may comprise tens or hundreds of thousands of features is that it highly affects the performance of TC accuracy caused by the high-dimensional data [10], [11]. To resolve this problem, the Feature selection (FS) methods (in the dimensional reduction step) are advised to select the significant/valuable features. This, in turn, performs decent text representation and minimize computational time by reducing the overfitting and the error rate of the classifier to achieve a precise classification [12].
Three primary FS methods are used for TC: the filter-based approach, the wrapper-based approach, and the embedded-based approach [13]- [15]. The filter-based approach performs a statistical analysis over the feature space by ranking each feature of the dataset based on some univariate [e.g., Information Gain (IG)) or multivariate (e.g., Correlationbased Feature Selection (CFS)] methods, then selecting top-N features having the highest-ranking features [16], [17]. Yet, the filter-based methods are restricted to some limitations [18], [19]. However, the wrapper-based approach evaluates the usefulness of features based on the used classifier performance. It is computationally more expensive when compared to the filter approach due to the repeated learning steps and cross-validation [16]. Although it is computationally expensive, it creates an interaction between the search feature subset and the classification algorithm, leading to better feature subset selection. However, using an optimization algorithm as FS in a wrapper-based approach could lead to better results than the filter-based approach. Because the optimization algorithm uses an objective function to evaluate the consistent features, taking into account the classification accuracy and the number of selected features. The examples of the wrapper-based approach are recursive feature elimination, sequential (forward /backward) feature selection (SFS) algorithms, and genetic algorithm (GA) [20], [21]. Besides, apart from the wrapper and embedded-based approaches that used the classification method in their mechanism to measure the performance of TC, the wrapper-based uses feature dependencies. In contrast, the embedded-based approach uses less computational cost features. Furthermore, the embeddedbased approach incorporates the FS method into the classifier's training process (learning process) using no search algorithm such as a metaheuristic algorithm. The decision tree such as Classification and Regression Trees (CART) has a built-in mechanism (embedded-based) to implement variable selection [13], [15], [22], [23]. Nevertheless, recently the researchers proposed a two-stage approach combining a less expensive filter-based approach to rank the features and an expensive wrapper-based approach to eliminate further irrelevant features called the hybrid-based FS approach. It differs from the other approaches by taking advantage of filter and wrapper. At first, it applies the filter method to select a feature subset with the highest-ranking score, followed by the wrapper method to further optimize the selected subsets using the optimization algorithm [24], [25].
Most of the traditional approaches, such as the filter-based approach, have several flaws, such as their failure to provide adequate performance in FS because they neglect feature interactions. Some features, for example, may be redundant and irrelevant on their own but extremely successful when combined with others, and the top-scoring features may be redundant. Most filter methods, in fact, analyze features independently and apply their methods directly on original datasets with high dimensions [26], [27]. The challenges mentioned above have mandated researchers to seek other methods of getting better performance and results during the FS process. The research pursuit had led to the introduction of metaheuristic-based FS methods for TC.
Metaheuristic refers to general ideas, techniques, or approaches that are applicable to many problems [28]. Metaheuristics are estimate algorithms in which each of the algorithms has a different historical background (for instance: Evolutionary-based techniques, Swarm-based techniques, Physics-based techniques, Human-based techniques, etc.). It can also be seen as a set of algorithmic concepts utilized for defining heuristic techniques that can be applied to diverse optimization problems with little modifications to adapt them to specific problems [29].
Since the FS problems are known as non-deterministic polynomial-time hard (NP-hard) problems [30], metaheuristic algorithms have been successfully used in solving such problems, as it involves finding the approximate optimal solution by relying on the core factors; an obtained information and experience throughout the search process represented as memory to guide the iterative generation process for more prominent solution spaces, and that however improves the results of the TC problem [31].
Several review studies have been published in the literature that provides useful information on FS in TC [5], [6], [14], [32]. However, none of these review studies and the ones that are accessible focus on studying the FS utilizing metaheuristics techniques in TC. All of the available ones are either cursory analyses or offer a small collection of work in the field. Notwithstanding, this review considered having research questions formulated to direct the researchers to particular research gaps, methods, and challenges that have not been addressed in the current field of study.
This systematic literature review (SLR) provides a complete analysis and synthesis of 6 years of research on the FS utilizing optimization methods that contribute to developing a solid foundation for future studies. Hence, the contribution of this SLR is to: • Examine the exploited FS approaches (wrapper-and hybrid-based) that are based on metaheuristics to improve the English TC.
• Assess the effect of the metaheuristic algorithms as an FS method on the TC accuracy compared to the typical FS methods.
• Assess the effect of the modification or hybridization of the metaheuristic algorithms on the text FS problem.
• Identify the gaps in the current studies and their future directions.
• Serve as a hands-on guide for discovering the appropriate modelling technique for English TC problems.
Consequently, this study is meant to answer the research questions (RQs) in the following section and act as a referential guide. Therefore, the structure of the paper is delineated as follows: The introduction in Section I is followed by the review methodology in Section II that provides the details on how the papers in the study were selected. Then, the findings from the reporting stage are discussed in Section III to identify the existing literature based on the metaheuristicbased methods that answered the research questions. Section IV presents the conclusion of the study and its limitations.

II. REVIEW METHODOLOGY
A systematic literature review (SLR) was carried out to identify the current literature relevant to the Feature selection (FS) based on metaheuristic optimization methods for the text classification (TC). The review approach was performed based on the procedures for carrying out an SLR in computer science and software engineering research by [33]- [35]. The SLR involves three major phases: planning, performing, and reporting, as presented in Figure 1.

A. PHASE 1: PLANNING
The activities related to the planning phase were identifying the necessity for the review, stating the RQs, the search strategy identification, and the review protocol design, which could be utilized to obtain the RQs and the review methodology that would be used conduct the review [34]. This SLR answered the four research questions: RQ1: What are the different FS approaches applied to metaheuristic algorithms to improve TC? RQ2: Does applying metaheuristic algorithms for TC lead to better accuracy than typical FS methods? RQ3: How effective is the modification or hybridization of a metaheuristic algorithm on the text FS problem? RQ4: What are the gaps in the current studies and their future directions?
The search strategy is the first mapping work that can help determine the right research step. A search strategy begins by identifying important key terms and their alternatives and synonyms. Consequently, choosing the key terms that are closely related to the work will give good results to retrieve the relevant research papers [36]. Therefore, SLR is done based on predefined search strategies that focus on identifying the studies that are relevant to the SLR research questions. The strategy aims to identify the primary studies such as resources and search key terms to be used in the SLR, as shown in phase 2. The most common search strategy approach is to break down the research question into individual terms. Then a list of abbreviations, synonyms, and alternative spellings is drawn up. More terms can be acquired by considering subject headings utilized in journals and databases. The performing phase highlights the search keywords and study sources used in this SLR. VOLUME 4, 2016

B. PHASE 2: PERFORMING
The performing phase comprises of conducting the search strategy and sources, primary studies selection, quality assessment study, extraction, monitoring, and synthesis of data. In this review, the SLR method was applied to evaluate all the available research associated with the predefined research questions in the planning phase.

1) Source
Seven (7) database sources were utilized as the primary sources to identify the available studies which apply to the highlighted RQ. The sources are as follows: IEEE, Scopus, Research Gate, Springer, Google Scholar, Science Direct, and Taylor Francis. These databases enable the discovery of published materials in the form of journals, bulletins, conference proceedings, book chapters, symposiums, gray literature, and workshops. Our justification for choosing the seven sources stemmed from the originality and reputability of the sources. Furthermore, we established that other databases referred to these databases as the main sources for the existing studies on FS.

2) Search terms identification and conducting the search process
The selection of the relevant studies is represented by the inclusion and exclusion criteria in FS for TC using the metaheuristic algorithm and their related concepts. Several search terms were derived from the pre-identified RQ: "feature selection", "text classification", "metaheuristic", "optimization", "swarm intelligence", "evolutionary algorithms", "trajectory", "filter", "wrapper", and "hybrid" were used. For the advanced search, this study adopted a general approach of breaking down the RQ into individual terms followed by executing advanced search strings using Booleans that is "ORs" and "ANDs" [29], [30], as follows: ("feature selection" OR "text feature selection") AND ("text classification" OR "text categorization" OR "document categorization" OR "document classification") AND ("metaheuristic" OR "metaheuristic" OR "optimization" OR "swarm intelligence" OR "evolutionary algorithms" OR "trajectory").

3) Inclusion and exclusion criteria
To ensure that the selected studies are relevant and within the scope of the study objective, inclusion and exclusion criteria are a must for an SLR [33], [34]. The goal of this SLR is to highlight the current FS methods that relied on the metaheuristic-based methods (i.e., adaptation, modification, and hybridization) for TC. The inclusion and exclusion criteria applied to choose the relevant studies are as shown in Table 1.

4) Quality Assessment
Concerning the issue of quality assessment, the questions to assess the scope were used to validate the criteria. The "quality assessment" can be viewed as a critical step to assess the quality of the selected literature. Quality assessment includes questions aimed at assessing the scope that the reviewed articles have addressed bias and internal and external validity [35]. The five (5) questions of quality assessment were answered as there are only three options: Yes=1; Partially=0.5; and No=0, as presented in Table 2.

C. PHASE 3: REPORTING
In the reporting phase, the findings were stated in the following results section.

A. FINDINGS OF THE SELECTED STUDIES
The findings of this paper answer our specific research questions that guided this SLR. The selection of the studies was made by searching on a study source, then screening and filtering were performed in four iterations (T). In the first iteration, it extracted 513 relevant papers from the digital search as possible sources (n = 513). In the second iteration, duplicate articles were excluded using the Mendeley software (n = 337). In the third iteration, the studies were scanned and filtered by title, abstracts, and conclusions and excluded the articles unrelated to our domain's scope (n = 86). In the last fourth iteration, the articles were scanned by reading the full text and applying the exclusion criteria, filtering the quality assessment stage results for all papers; only 37 articles were accepted and identified as final sources for the data synthesis (n = 37). The search process, results, and the process of paper selection are shown in Figure 2.
The number of selected papers published per year has increased considerably. Generally, the average quality score appears to increase from 2016 to 2018. Whereas its growth in the years 2019 to 2020 indicates that more researchers developed an interest in the study but decreased again in 2021, indicating that the area of TC using metaheuristics algorithm as an FS needs deep research. This paper strives to help continue the usage of metaheuristic algorithms in the future search, as it highlights the current research gaps and future directions in the findings section. The distribution of publication changes in the years (2016-2021) is presented in Figure 3.
Thirty-seven (37) articles were selected as high-quality that could be used to answer the research questions in this

Inclusion criteria
Exclusion criteria · Published between the years 2016 and 2021 · Not related to the above-stated research questions in this SLR · Articles are written in the English language · Articles that are conceptual (theoretical) have no experiments and results · FS methods solely used metaheuristic algorithms for TC · Studies that used traditional FS methods (filter-based method) without using metaheuristic methods (wrapper-based method) for TC · Benchmark datasets that are in English language only · Repetitive studies · Accessible in full form · Papers indexed in SCOPUS and ISI only   Table 3 below, twenty-eight (28) articles were rated (76%) as very good quality and nine (9) articles (24%) were rated as good. Other poor-quality articles were not considered in the results since they might not be making any impact.

SLR. As shown in
Based on the quality assessment, questions (from Q1 to Q5) are predefined in the review methodology. The summary of the quality assessment for the 37 papers (P1 to P37) selected for review in this SLR is presented in Table 4.
In Table 4, "P" represents "the reviewed paper." The review shows that FS is of considerable importance in data analysis, pattern classification, data mining, and machine learning applications. Thus, in many pattern classification problems, a good FS technique can reduce the cost of feature measurement, which automatically increases the classifier efficiency and classification accuracy. Hence, FS can be a preprocessing tool of great significance before solving the categorization problems.

B. DISCUSSION OF FINDINGS BASED ON THE RESEARCH QUESTIONS
This subsection discussed the answers to the stated Research Questions:

RQ1
: What are the different FS approaches applied to metaheuristic algorithms to improve TC?
Answer 1 : Various investigations were conducted to identify the appropriate FS approaches that used the metaheuristics optimization methods. Generally, the FS approach follows one of these paradigms: Filter, Wrapper, Embedded, and Hybrid -approach. To meet the search terms of this SLR, two paradigms were found in the conducted studies: The wrapper approach and the Hybrid approach (aka filter wrapper) as statistically distributed in Figure 4. Each of the FS approaches (i.e., Wrapper and Hybrid approaches) either adapted, modified or hybridized metaheuristic algorithm(s) in their implementation as described below: • Adapt: Is to adapt a new metaheuristic algorithm by enabling an intelligent behavior to solve sophisticated highdimensional text data. The reason to adapt the metaheuristic algorithms is to tackle the text FS problem and reduce the high-dimensional datasets that are not plausible to be solved using traditional FS methods to help the classifier obtain better results.
• Modify/improve: Is to mitigate some of the core issues on the metaheuristic algorithms that suffer from multiplicity, stacking in local optimal space, and other issues. For example, improved performance in obtaining faster convergence and robust global search efficiency could be achieved by balancing exploitation and exploration in the algorithms VOLUME 4, 2016   search space, which in return improves the classification performance [1], [48], [50], [66].
• Hybridize: Is to combine the best operators from two metaheuristic algorithms to create a new, improved one. It is for creating a better algorithm that optimizes the solution (feature subsets) to enhance the quality of the initial candidate solutions using the local search strategy. Thus, an improved algorithm aids in avoiding local optima trapping, avoiding premature convergence, efficiently and effectively exploring the search space, and making excellent decisions [10], [51], [52].
It is important to note that: there is a difference between the used terminologies, hybrid and hybridize. The hybridbased approach is the FS method, while hybridize refers to the hybridization between two metaheuristic algorithms.
The approaches and their techniques mentioned above are used to find the optimal feature subsets and reduce the dimensionality of the text data by selecting the smallest subset of features to improve the performance of the classifier, which is to yield a high classification accuracy. Consequently, the error rate will be decreased. The following subsections provide a thorough explanation of the approaches and their techniques.
• Adapted metaheuristic methods A suitable metaheuristic method was adapted in fifteen studies (15) to accomplish the FS problem in TC, relying on its mechanism of finding a new subset of the relevant features to tackle the massive number of features in the original text data. Bidi, and Elberrichi [37], and Kumbhar et al. [20] used Genetic Algorithm (GA) as an FS method to improve the text classification. In [37], a text representation method (i.e., bag of words (BOW), N-gram, stemming, and conceptual representation) are used prior to the selection of the optimal number of the features subsets, and, in contrary, in [20], they did not use any text representation methods. Using the GA method, both researchers used a classification method based on the optimally selected number of feature subsets. In [37], Naive Bayes (NB), K-Nearest Neighbors (KNN), and Support Vector Machines (SVMs) were replicated as classifiers to evaluate and classify the candidate subset of features. Besides, in [20], the text classification performance had increased using the optimally selected features in tandem with the Fuzzy classifier as a classification method.
Other studies [2], [1], and [38] investigated the use of metaheuristic algorithms on subsets of text data from three benchmark datasets. In [2], Invasive Weed Optimization (IWO) presented as an FS method to be compared with Particle Swarm Optimization (PSO) and GA for optimally selecting the significant number of feature subsets. The selected subset of features using different metaheuristic methods was evaluated by the NB classifier to test the accuracy of TC using different configurations. In [1], different Term Frequency (TF) methods (i.e., TF, NORMTF, LOGTF, ITF, SPARCK) were used to weight all significant feature in all vectors. For all the weighted features, the Flower Pollination Algorithm (FPA) method was employed to further select the optimal subsets of features. It was then evaluated and classified using the Ada-boost algorithm. While, in [38], the Crow Search Algorithm (CSA) was advised as an FS method, and the KNN classifier was used to evaluate the selected subsets of features to perform TC. Likewise, Artificial Bee Colony (ABC) had been adapted as an FS method in [39] and [40], and several classification methods (e.g., SVM) were tested for the evaluation and classification.
Contrary to the method presented in [39], where the number of selected classes was eight classes, in [41] and [42], the TC techniques presented were validated using Ten classes from the Reuters dataset. Additionally, in [41] and [42], the Ant Colony Optimization (ACO) and Firefly Algorithm (FFA) were the FS methods, respectively. In [41], the ACO and Artificial Neural Network (ANN) (ACO-ANN) had the capability to congregate promptly since it has effective searchability in the search space problem. Thus, it allows the efficient determination of the minimal feature subset. At the same time, the ANN was used to create a practical model that can be used from a given set of new inputs to predict the optimal output set or classifies the best subset of features from all subset features and predicts the solution. However, in [42], FFA and KNN were implemented using the same number of classes (Ten classes). While in [43], PSO was used in conjunction with a feature weighting and a parametertuned KNN classifier (different K-value). Others replicated the standard Binary Gray Wolf Optimizer (BGWO) and KNN classifier on newly extracted data for TC, and the model showed better accuracy using BGWO and a selected subset of features [44].
Ensemble approaches, on the other hand, merged many machine learning algorithms into a single predictive model to reduce variance (bagging), bias (boosting), or enhance prediction accuracy (stacking). As a result, by integrating the output of different weak learning classifiers, the accuracy can be improved. Therefore, Khurana et al. [45] propose a novel approach (BBO-bagging). They used a combination of optimization algorithms (Biogeography-Based Optimization (BBO), GA, and PSO) as an FS technique with the ensemble classifier to get optimal performance of TC. They used ten text datasets and one real-time dataset to train and test the retrieved features on six classifiers: NB, KNN, and SVM, Random Forest (RF), Decision Tree (DT), and ensemble (Bagging).
Nonetheless, a common way to calculate two or more contradictory objective functions simultaneously, researchers suggested multi-objective optimization methods (example of the multi-objective optimization methods are the conflicting functions of high-quality feature subset and reasonable running time). In [46], a multi-objective optimization method named Multi-Objective Relative Discriminative Criterion (MORDC), first objective employed RDC to computes the relevancy of the features to the target class, wherein the other objective computes the redundancy of a feature with those with other selected features in the solution by applying the GA. The DT, Multinomial Naive Bayes (MNB), and Multilayer Perceptron (MLP) classification methods were used to assess the performance of the proposed method.
• Modified/Improved metaheuristic methods Advanced and persistent techniques to modify/improve the working mechanism of the metaheuristic methods are being proposed to address the FS problem from different perspectives such as convergence, optimal solution, and algorithm efficiency. Janani and Vijayarani [47] proposed a greedy search algorithm to modify the global optimal in ABC to filter the irrelevant features and select the optimal feature subsets called the Optimization Technique for Feature Selection (OTFS). For the selected features, a machine learning method was developed based on the Probabilistic Neural Network (PNN). The PNN was further modified to improve the model's hidden layer (pattern layer) by advising an orthogonal matrix that chose the demonstrative features from the training documents. The developed machine learning method is named an automatic text classification (MLearn-ATC) model for classifying the different benchmark and real datasets based on the relevant selected features using OTFS. Thiyagarajan and Shanthi [48], proposed a crossover mechanism alongside the Artificial Fish Swarm Algorithm (AFSA) to alleviate the problem of multiplicity, which resulted in a new FS method called the Modified Artificial Fish Swarm Algorithm (MAFSA). Ada-boost, SVM, and the NB classifiers were used to evaluate the feature subsets selected using AFSA and perform the classification. In [49], the standard PSO algorithm has been adapted and modified by adding a weighting mechanism called Self-Inertia Weight Adaptive Particle Swarm Optimization (SIW-APSO) to enhance the VOLUME 4, 2016 performance of TC. SIW-APSO has a fast convergence phenomenon that yielded high search competency and a better selection of features. In addition, the KNN technique is used to classify text. M. Mahmoudi and F. S. Gharehchopogh [50], proposed a new mechanism to solve the FS problem of the Shuffled Frog-Leaping Algorithm (SFLA) stuck in the local optima. A combination of the best and worst search space solutions are combined to handle the stuck in the local optima. The DT classifier that used to classify and evaluate the selected subset of features.
• Hybridize metaheuristic methods (Combining between two metaheuristic algorithms) Three studies proposed to hybridize metaheuristic algorithms to get the relevant and optimum feature subset from the original dataset for the FS problem. First, Maruthupandi and Devi [10] hybridized an ABC and Bacterial Foraging Optimization (BFO), known as ABC-BFO, for selecting the most significant feature subset for the prediction. To perform a multi-label classification, the ANN was used. Srilakshmi et al. [51], proposed a classification method based on three processing steps: VSM for feature extraction followed by the FS process that is performed using a hybridization method, and then a classifier is used for the classification. The proposed FS hybridizes the grasshopper optimization algorithm and the Crow Search Algorithm (GCOA). Thus, the optimally selected subset of features using GCOA is evaluated, and TC is achieved using a Deep Belief Network (DBN). In [52], the metaheuristic algorithms ACO and GA named ACOGA are hybridized to be used as an FS method based on a KNN classifier. A comparison of related studies that have employed the wrapper-based FS using the metaheuristics method for TC is given in in Table 5.

2) Hybrid-based FS approaches
The optimization algorithms for FS in the wrapper-based FS approaches suffer from high computational resources when identifying the optimal feature subset (due to their randomized mechanism). To resolve these issues, many researchers merged intelligent optimization algorithms with traditional FS methods as a hybrid method. Firstly, it employs a filter method to prune the high dimension in the data and create a subset of features using traditional FS methods (e.g., IG). Secondly, the wrapper method refines the selected subsets by using the metaheuristic methods (e.g., GA). Seventeen out of thirty-seven studies were conducted to propose a hybrid approach, and their distribution is as shown in Figure 6.
• Adapted metaheuristic methods In [53], traditional FS methods IG and CHI were firstly presented for the pre-select feature subsets. Then, a small world optimization algorithm (SWA) was used to refine the selected features further and produce the most effective feature subsets by filtering out the unwanted features to limit the search space of SWA, and that relatively saved the consumed time for the problem-solving. KNN and SVM, however, were used for TC. In [27], a two-stage FS method was presented involving traditional FS methods (Correlation (CO), Information Gain (IG), Gain Ratio (GR), and Symmetrical Uncertainty (SU)) and followed by a PSO algorithm as a filter and wrapper, respectively. The optimally selected subset of features using the stages was further evaluated, and text documents were classified using the NB classifier. Following the presented FS method in [27], T. Londt improved the PSO performance using the multi-objective function [54]. Likewise, Thirumoorthy and Muneeswaran [11] presented a new hybrid FS method called the Normalized Difference Measure (NDM) as a filter-based method and a Binary Jaya Optimization (BJO) algorithm called NDM-BJO as a wrapper-based method to reduce the high dimensionality of feature space. At the same time, the NB and SVM classifiers were used to evaluate the nominated optimal feature subsets for the TC problem. Similar to [27], [11], a two-stage FS method were presented in [13], the first stage constituted a filter-based local FS method utilizing four different kinds of univariate methods (i.e., Chi-Square (CHI), Deviation From Poisson Distribution (DP), Discriminative Features Selection (DFSS), and Relative Discrimination Criterion (RDC)). Contrary to [27], [11], this method utilized three different feature set construction methods that are to employ globalization policies (i.e., Feature Set Construction Using Maximum Globalization Policy (MAX), Feature Set Construction By Using Weighted Averaging Globalization Policy (AVG), and Feature Set Construction By Including Equal Number of Features For Each Class (EQ)), also used two kinds of dimension reduction methods as feature transformation methods in the second stage (i.e., Principal Component Analysis (PCA) and Latent Semantic Indexing (LSI)). Then, GA is used as a wrapper-based FS method. Finally, the learning models were built using SVM to evaluate and classify each feature subset.
Kyaw and Limsiroratana, presented several models to solve the multi-dimensional FS problem on news document classification (BBC news) [55]- [59]. Their main goal was to improve news categorization performance while maintaining a fair level of complexity by reducing the number of selected features from a multi-dimensional text feature set. First, in  [55], the filter approach PCA and Best First Search (BFS) were used to pick out selected text features throughout the feature reduction phase. Then, the Wolf Intelligence-based Optimization of Multi-dimensional Feature Selection system (WI-OMFS) was used to optimize the selected feature subset using PCA and BFS to obtain the optimal subset of features in the wrapper approach. Finally, the learning models were built using NB, SVM, and J48 to evaluate and classify each feature subset. In [56], however, a comparative study was conducted to compare the performance of using different filter-based methods (CFS, BFS, and IG) to perform the pre-selection of feature subsets and to be an input feature to the wrapper ACO and ABC based FS method(s). Then, the J48 decision tree was applied to evaluate each subset selected by the proposed method and classify the text documents. Similarly, in [57], a CFS was proposed as a filter-based approach to select a feature subset, and then an Evolutionary algorithm (EA) and GA were used to optimize the initially selected feature subsets by the CFS method. Consequently, the proposed methods were evaluated and tested using NB and SVM classifiers. In [58], VOLUME 4, 2016 the CFS and PCA as a filter-based method were implemented for the pre-select feature subset, and three nature-based inspired metaheuristic algorithms (Cuckoo Optimization (CO), Firefly optimization (FFA), and Bat optimization (BO)) were exploited as the wrapper FS approach to provide an optimal feature subset that to be fed into two classifiers (J48 and SVM). Further, the presented filter-based methods (CFS) in [58] were replicated in [59] with PCA using different search policies (ACO, ABC, EA, Flower Optimization algorithm (FO), Rhinoceros Optimization algorithm (RO), and Wolf Optimization Algorithm (WO). The selected feature subsets using the wrapper-based methods were evaluated using the J48 classifier. Finally, researchers in [60] and [61] imple-mented the IG as a filter-based method to select the topranking features. To do so, two-hundred (200) feature subsets are selected by IG to be fed into the Gray Wolf Optimizer (GWO) to select the optimal subset of feature, and topranking features to be fed into the Imperialist Competitive Algorithm (ICA) utilizing NB and KNN as classification methods, respectively. In [63], a hybrid method using global Point-wise Mutual Information (PMI) is proposed in two phases. First, the ranking-based filter approach is implemented for FS by applying the IG method. While the second phase comprises two other subphases, the global PMI-based FS method implemented to select a subset of features based on a class-dependent assumption for computing the correla-tion between pairs of features and then using Gravitational Search Algorithm (BGSA) is a wrapper method to find the best subset of features. Finally, the 1-NN classifier is used to assess the performance of the proposed method. Notwithstanding, the majority of the hybridized FS approaches started the FS task with a filter-based method succeeded by a wrapper-based method. Whilst, in [64], a memetic FS method that hybridizes an evolutionary feature wrapper succeed by the filter for multi-label TC problem. Firstly, a promising region in the search space is located using an evolutionary estimation of distribution algorithm (EDA) as a wrapper, then an effective score function called Label Frequency Difference (LFD) is advised as a filter for feature selection. However, the same classifier method was used as in [60]. Similarly, in [62], a unique hybridization mechanism is proposed by means of three stages; filter-based method, wrapper-based method, and filter-based method. In the first stage, Document-frequency Term-frequency (DFTF) is proposed as a filter-based method to select a significant subset of features. Secondly, a wrapper-based method named binary Poor and Rich optimization algorithm (HPRO) is used with DFTF in the first stage to select optimal feature subset using NB classifier. Lastly, the second filter-based method is used to select significant features using Term frequency reordering of document level (TRDL).
• Modified/Improved metaheuristic methods Wang et al. [65], introduced a novel method for the TC task based on an Open Directory Project (ODP) to consider the semantic relation of features and their relevancy to the text document classes. Firstly, the redundant information is filtered out using the conceptualization of equivalence word set (EWS1) that is developed based on the rich semantic data using ODP. Secondly, a Comprehensive Measurement Feature Selection (CMFS) is proposed to select the Optimal Feature Subset (OFS) and decrease the time taken for the execution, followed by the ABC-based FS (ABFS) to select the optimal features further. Lastly, the proposed model was verified using two classifiers: Fuzzy Support Vector Machine (FSVM) and NB.
Lastly, Belazzoug et al. [66], presented an improved Sine Cosine Algorithm called ISCA, which allows the exploration of more search space for FS. Then, a filter-based (IG) was used to get the top highest scores ranked features (informative feature), thus reducing the size of high dimensionality and decreasing the execution time. The informative selected features were used as an input in the wrapper algorithm (ISCA). To validate the efficiency of their work, the NB algorithm was applied for the classification task.
To conclude, the modified/improved metaheuristic methods alongside the FS approach are promisingly affecting the TC accuracy because it tackles the weakness of the traditional FS method via the robustness of metaheuristic algorithms which solves the serious complex optimization problem. Table 6 summarizes the conducted steps of the TC problem based on hybrid approaches and datasets used for evaluation purposes.
RQ2: Does applying metaheuristic algorithms for TC lead to better accuracy than the typical FS methods?
Answer 2 : From the related studies, it has been observed that the objective function of metaheuristic methods in TC achieves three things: 1) maximizes the accuracy, 2) minimizes the error rate, and 3) improves time efficiency in the classifiers, thus producing high performance in terms of TC accuracy.
Using GA as an FS method [37], [20] had positively affected the text classification performance. In [37], it can be noticed that the increasing percentage of training data increased the accuracy. However, the investigated Conceptual Representation method has produced the highest accuracy amongst other representation methods with NB, KNN, and SVM. Meanwhile, the SVM classifier has the best performance classification with both FS methods (i.e., GA) and no FS method. The conceptual approach obtains comparable or better results than the others in many situations. However, the proposed method is not compared with the traditional FS method. Besides, in [20], when the number of generations (iterations) increased, the performance of TC had increased as well, and the results indicate that a GA-based FS with the Fuzzy classifier gives promising results compared to the traditional Principal Component Analysis (PCA). In [2], the use of IWO as an FS had reduced the high dimensionality of feature space. Therefore, it helped the classifier perform better accuracy and reduced the error rate compared to GA and PSO because IWO is omitting the less important features, increases the calculation speed, and yields the optimum answer in a shorter time. As the text data is inherently noisy, a weighting method (i.e., TF) is used to assign weight values for the significant features and based on the weighted features, and an FPA algorithm is used to further reduce the number of features by optimally selecting feature subsets. Such a procedure has given a promising TC accuracy using the Ada-boost algorithm. Moreover, the classification using ITF and NORMTF model is the most accurate compared to other models (LOGTF, ITF, SPARCK) [1]. In [38], CSA as an FS method with KNN overcame the standard KNN for TC and produced a better performance than a previously presented IG method as a traditional FS, IG-GA, and IG-PCA using KNN and C4.5 classifiers. Moreover, the proposed mechanism contributed to identifying weights of features in neighboring documents as well, and the CSA also has a good memory in a way that all crows (crows: operators in CSA) preserve the significant features.
The adaption of ABC for FS has shown a different TC accuracy so long as the used dataset differs in the number of classes; in other words, in [39] and [40], the ABC method was applied to select the optimal number of features. In [40], the three widely used classifiers, namely, SVM, NB, and KNN, were compared against ABC. The experimental results show that SVM with ABC outperformed the other two classifiers. However, the proposed method has not been compared with the traditional FS methods like IG to validate the achieved results. On the other hand, in the study by [39], VOLUME 4, 2016 the SVM and the Improved SVM (ISVM) algorithm were used in the classification phase. Based on the performance measures compared with various FS techniques, namely, IG, CHI, and the original SVM algorithm with ABC, the proposed algorithm ABC with the improved SVM classifier offered better results. It reduced the high dimensionality in the dataset by selecting the important features.
Although the used number of classes (in the dataset) were the same in both [41] and [42], there is a significant difference in the TC accuracy because in [41], the ACO-ANN method was used while it was the FFA-KNN in [42]. Moreover, in [41], a comparison was held to compare the presented mechanism's performance with other metaheuristics methods (GA) and traditional FS methods that are IG and chi-square (CHI). In contrast, FFA-KNN outperformed CHI and IG methods [42].
In [43], the experimental results suggest that using a feature weighting strategy in conjunction with a parametertuned classifier increases the performance of the classification model. The weighted uncertainty operator in PSO and the tuned parameter in KNN had achieved better accuracy than the standard technique. In [49], the adaption of the weighting scheme (Self-Inertia Weight) into the PSO has overcome the problem of premature convergence in standard PSO. Furthermore, the experimental results showed that the proposed (Self-Inertia Weight) has better text classification accuracy than the typical (IG, CHI) and standard metaheuristic methods (GA, PSO).
Besides, Khurana and Verma [45] proved that the BBObagging as an FS method is superior to GA, PSO, and BBO with an ensemble classifier than individual ones in previous studies. However, not all datasets with the proposed model produced the best performance using all selected classifiers. Some of the classifiers, such as KNN or NB, work better for the high-dimensional datasets, while the SVM performed better with low-dimensional datasets using BBO-bagging. Likewise, in [60], the IG with GWO is better than MFO, SCA, ACO, and GA assessed using most of the benchmark datasets because the behavior of GWO during the search proves that the top three best solutions help to explore and exploit the search space effectively by improving the average population fitness function throughout multiple iterations. Others proved that using IG combined with ICA showed better results than using IG and Mutual information (MI) solely [61]. F. Zarisfi Kermani et al. [63], introduced the global PMI-based FS method to improve the quality of the feature subset selected based on correlation criterion. The method considers having a mutual correlation between term and class using class-dependent assumption and using the class-independent assumption that measures the correlation between pair of terms. However, the performance of the proposed model suppressed four single objective FS methods.
For the multi-label TC approach that presented in [64], the proposed memetic search mechanism overcomes the performance of four typical filter FS methods ("maximum discrimination (MD), relevance popularity (RP), variable global feature selection (VGFSS), and normalized difference measure (NDM)") and three wrapper-based methods ("asynchronously improved particle swarm optimization (AIPSO), enhanced genetic algorithm (EGA), and EGA with class discriminating measure (EGA+CDM)"). However, though the proposed memetic FS problem was promising this case study, it might not be efficient for other text data for two reasons. First, it is biased to solve such a problem; second, the evolutionary wrapper and feature filter were designed separately. Therefore, a new memetic FS method can be designed based on a compact filter and wrapper methods.
The newly proposed hybrid HPRO method in [62] has shown better accuracy compared to various filter-based methods (DFS, NDM, TRDL, and DFTF) and wrapper-based methods such as GA and FFY in terms of classification accuracy using NB classifier.
However, having a modified ABC using the SFS method has affected the performance of the FS process by ignoring the less significant features. The newly modified performance of the FS method is validated by comparing its performance against the widely used optimization methods (ABC, FFA, ACO, and PSO). It is important to mention that the TC performance also produced better accuracy while improving the standard PNN algorithm. Furthermore, the modified PNN outperformed NB, KNN, SVM classification methods [47].
In [48], the experimental results prove that the MAFSA as an FS method had overcome the traditional method AFSA concurrently with the most significant subset of features. Meanwhile, several classifiers were adapted (Ada-boost, SVM, and the NB) for TC.
A few approaches have been proposed to hybridize the metaheuristic algorithms for the FS problem. In [10], the hybridized ABC-BFO approach with ANN outperforms GA with KNN classifier on the top ten classes using the Reuters news dataset. However, ANN was used to train and evaluate the proposed ABC-AFO, while ANN was not used to compare the performance of GA. Similarly, Srilakshmi et al. [51] proposed a hybridized GOA and CSA with DBN classifier based on a hybrid weight bounding model. The performance of the newly developed model has been tested against several classification methods (NB, KNN, SVM and Deep Convolutional Neural Network (DCNN) and Stochastic Gradient-CAViaR+ DCNN). The GCOA-DBN has better classification accuracy and time efficiency.
Notwithstanding, multi-objective methods proposed in [46] have shown better results in comparison to the univariate filter algorithms (IG, GI, GR, FS, LS, and RDC), multivariate filter algorithms (mRMR, UFSACO, RRFS), and multiobjective algorithms (MOMI and MECY FS). The improved performance is attributed to the development of an evolutionary optimization technique that considers the relevancy and redundancy of features as goal functions.
On the other hand, a hybrid-based approach was proposed to integrate the filter method IG and CHI to the wrapper method SWA [53]. The approach produced better accuracy than traditional FS methods while using both KNN and SVM.
In detail, the pre-selected features using IG and CHI that fed into the SWA based on a different number of features (FN) have promisingly overcome approaches that solely used IG and CHI on both KNN and SVM classifiers. Consequently, their approach concentrated on reducing the dimension of the feature vector, decreasing model complexity, and improving the performance of TC.
The two-stage filter and wrapper method in [27] is tested against traditional FS methods (CO, IG, GR, and SU) acquired higher accuracy. Note that the used subset of features in the proposed method (two-stage) is smaller than what it is in the tested traditional FS methods. However, in [11], a deep analysis has been conducted to compare the proposed NDB-BJO method with the original NDM method and the existing filter-based methods (IG, MI, Document Frequency (DF), Distinguishing Feature Selector (DFS)) alongside the GA and FFA, in which the presented method NDB-BJO outperformed all the stated methods using SVM and NB classifiers. Whilst, in [13], the experimental results showed that DFSS, CHI, or RDC feature selection techniques, rather than DP feature selection methods, are the settings that give the greatest results for two-stage feature selection methods. The AVG and EQ feature set construction approaches appear to be superior to the MAX feature set construction method in most cases. PCA feature transformation produced most of the best outcomes for two-stage feature selection approaches. GA method's performance is often inferior to PCA and LSI approaches. However, in many cases where transformationbased approaches were successful, GA did not increase performance.
The presented models in [55]- [59] are being compared in terms of computational time and TC accuracy. In [55], the computation time of the presented WI-OMFS-Filter as an FS method fluctuated according to two parameters: number of iterations and population. The performance in terms of computational time peaked using WI-OMFS + J48, and the second-best using the NB, while SVM was the worst. At the same time, WI-OMFS-Filter + J48 achieved the best TC accuracy and the second-best in the SVM, while NB was the worst with regard to the incremented number of iterations and populations. In contrast, the conducted study in [56] is a comparative study to investigate the performance of ACO and ABC as wrapper-based methods according to the number of selected features and for both the iteration and population parameters. The classification accuracy was enhanced in the experiment for the ACO-based FS method by reducing the number of chosen features (NF) as the population size (PS) and iteration number (IN) grew. However, as the NS and IN have grown, so have the computing and hardware costs as shown in Table 7.
In [57], the experimental findings proved that both EA and GA with CFS filter were better than CSE wrapper by selecting the optimal feature subset, where the NB was more promising for TC than SVM. Besides, the EA overcame GA for both CFS and CSE-FS approaches. Similarly, in [58], amongst the three assessed metaheuristic algorithms, CO, BO, and FFA as a wrapper-based approach, and the classifiers SVM and J48, the CO with J48 have achieved the best accuracy on different PS values. Note that the best accuracy achieved using the proposed system (CO + J48) is better than the traditional method (SFS and RS) in terms of the number of the selected global feature subset. In [59], WO and J48 have the best accuracy of the other search policies (ACO, ABC, EA, Flower FO, and RO). Besides, the traditional FS methods GS and CFS optimally selected the least number of features than the search policies.
In [65], the modified ABC (ABCFS) algorithm that is incorporated with the semantic knowledge from ODP achieved better results compared to several typical FS techniques (IG, MI, CMFS, Multi-label FS based on max-dependency and min-redundancy (MDMR), Cumulate conditional mutual information minimization criterion FS (CCM), Global MIbased FS algorithm (GMFS)) using two classifiers (FSVM and NB) and different percentage of the selected number of features.
Finally, in [66], the improved ISCA is statistically outperformed, standard SCA, other SCA methods (i.e., OBL (SCA), Levy (SCA), and Weighted (SCA)), Moth-Flam optimization algorithm (MFO), GA and ACO. Besides, improved ISCA dramatically outperforms the traditional FS methods in terms of TC accuracy as well as the time complexity and the least number of the selected feature. To recap, despite improvements on the FS methods, there are certain weaknesses that can be addressed, as shown in Table 7.
RQ3: How effective are the modified, hybridized metaheuristic algorithms for text feature selection problems?
Answer3 : Janani and Vijayarani [47], mitigated the problem of global optima in the ABC method by incorporating the SFS method that effectively improves the selection of the optimal feature subsets by initializing an empty set. The SFS will add the optimal features sequentially to the initiated set by considering the global objective function. Additionally, they claimed that the proposed algorithm uses the least amount of time and memory to complete the task.
Thiyagarajan and Shanth [48], modified the original AFSA by adding a crossover operation into the Artificial Fish (AF) vector space, in which a blindness search for the food (features) by AF is directed to generally find better direction and improve the time taken for convergence. The obtained results proved that considering the crossover in the AFSA algorithm will increase the performance of selecting an optimal feature subset and bring down unwanted ones.
M. Asif et al. [49], effectively improved the feature selection mechanism based on the proposed weighting scheme, i.e., SIW-APSO. Practically, a group of arbitrary particles is initialized to search for a search space. The initialized particles change their positions by communicating with each other, and better local and global positions are acquired.
M. Mahmoudi and F. S. Gharehchopogh [50], proposed a balancing equation to find the optimal search space solution based on approximating the best and worst solutions to prevent the SFLA from stuck in the local optimum. Maruthu-VOLUME 4, 2016   The best performances of the proposed model mostly include either DFSS, CHI2, or RDC rather than DP, and AVG, EQ better than MAX feature set with using PCA and LSI feature transformation rather than GA on the two datasets based on 500 and 1500 features using different dimension reduction ratio.  [51], suggested GCOA by merging the GOA and CSA algorithms. The CSA is derived from the incentive gained from crows' clever behavior in seeking prey and locating it using memory. Furthermore, the method successfully balances the diversification and intensification stages, and the convergence rate is quite high with relatively little computing time. To address optimization problems, the GOA is based on the swarming behavior of grasshoppers. The GOA is capable of finding the optimum answer to optimization issues while balancing exploitation and exploration. Yet, it suffers from a low convergence rate. Therefore, CSA is integrated with GOA to resolve the shortcomings of the GOA algorithm. The developed GCOA was utilized to optimize the DBN's weights for better classification.
A. Singh and A. Kumar [52], hybridized different ants from ACO and the populace in the GA for the selection of an optimal feature subset. The selection is chosen based on an evaluation measure. All ants, in particular, refresh their pheromone, after which the superior underground insect stores additional pheromone on the nodes of the best course of action, and the best arrangement may be made by GA or ACO. Wang et al. [65], presented a modified ABC algorithm by integrating semantic information into the FS procedure. Semantic information and filter-based methods are used to maintain high TC accuracy and adequate execution time. The semantic information between two words (features) in a text document is calculated using conditional probability distribution and ODP (knowledge-based), which construct the Equivalent Word Set (EWS). Then, the constructed EWS was integrated into the enhanced Artificial Bee Colony Memory (ABCM), an advanced metaheuristic algorithm that imitates the artificial bee's memory to benefit from the previous successful experience of the employee bee (onlooker bee).
Finally, Belazzoug et al. [66], developed a new ISCA algorithm to overcome the SCA's search space issue. SCA is sometimes stuck in the sub-optimal region due to limitations in the exploration of the search space. Accordingly, the ISCA used a dynamic search to consider the random search to find the solution instead of only relying on the optimal solution. Therefore, determining and improving the metaheuristics performance and their mechanism, such as improving the convergence speed, balancing the global and local search space, and considering the past experience of the algorithms (e.g., employee bees in ABC) help the algorithm to select an optimal feature subset and assess the classifier in their classification.
RQ4: What are the gaps in the current studies and their future directions?
Answer4 : As shown in Figure 2, over the last few years, several studies indicate increasing attention among researchers on investigating TC using the optimization of the FS problem, especially during these three years (2018, 2019, and 2020). Therefore, this section discusses the gaps in the current FS methods after analyzing the information from the selected studies, and recommendations are presented for future research.
• Selective Classes: Some of the conducted studies seems choosy in the selected classes (class labels or categories) from the benchmark datasets to assess the performance of their proposed methods for TC problem [10], [19], [40][41][42]. Although these conducted studies are relatively achieved better accuracy using a selective number of classes. Yet, they are unable to select features from a very large search space because the used wrapper approaches herein are not supportive for many classes. Hence, the hybrid approach can be used to resolve the issue of wrapper approaches by relying on filter and wrapper methods.
• Text representation: The importance of using newer text representation on text data is to overcome the limitations of traditional text representation models like BOW, as the latter relies on the word frequency in their analysis and ignores the semantics of the words [37], [42]. The investigated experiments in [37] show that the conceptual representation model is a better text representation than BOW. This is because conceptual representations consider the semantics of the words using a lexical knowledge-based dictionary like WordNet. In [65], however, the performance of the IABCM algorithm was improved using a semantic representation from the ODP knowledge-based method. Another direction of study is to consider the feature weights before applying the FS methods as it is proven to improve accuracy [1], [43], [47], [51].
• Lack of comparison with baseline methods: Based on the two paradigms of FS methods (wrapper and hybrid), the proposed optimization algorithms were not compared to standard optimization algorithms or the typical FS methods. The necessity of comparing the metaheuristic methods with typical methods is important because the modern metaheuristics methods resolved the problem of the typical FS methods, e.g., avoiding interaction between features. However, others examined the proposed model using different classification methods with no account of comparing it with different FS methods. Also, traditional or advanced classifiers achieved high performance using feature subsets from optimization or typical FS methods. Yet, some of the proposed models were not validated against different kinds of classifiers to prove the reliability of the FS methods against multiple classifiers. In [45], it has been shown that comparing different classifiers results in different classification accuracy.
• Wrapper-based approaches: Most of the metaheuristic methods are adapted to resolve the curse of dimensionality in the data. Yet these adapted methods still suffer from the problem of parameter tuning that leads to high computational time [2], [20], [37], [40], [45]. Moreover, the problem in the adapted methods is being stuck in the local optima of the search space, and the selected features are not appropriately selected. Therefore, the modified [47], [48] and hybridized [10], [50] methods are proposed to address the problem of being stuck in the local optima by balancing exploration and exploitation. Yet these methods (modified and hybridized) still suffer from high time complexity. Subsequently, the hybridbased approaches provide the recommendation to resolve the VOLUME 4, 2016 problem of computation time by ranking and selecting the significant features using the filter-based method. This helps to reduce the dimensionality of the data and limits the search space of the optimization algorithm in the wrapper-based method to solve the problem of being trapped in the local optima. This also enhances the execution speed.
• Hybrid filter-wrapper approaches: The filter-based methods could be univariate (e.g., IG) or multivariate (e.g., CFS) methods. Either of them is being used as the FS method that technically relies on different factors of various characteristics (e.g., probability distribution). Applying either of the filter-based methods (univariate or multivariate) on a specific dataset will produce a different subset of features than other filter-based methods using the same dataset. Therefore, the applied classification method on each of the obtained subset of features led to a different performance. To cope with this issue, the filter-based approach can be modified by hybridizing it with an ensemble of filter approaches to achieve a better selection of the subset of features. The adapted optimization algorithms in hybrid-based approaches focus on resolving the high computation issue in the wrapper-based method, and the least effort is given to improve the mechanism of optimization algorithms [11], [26], [52], [55]- [59], except for [65] and [66] who developed a unique optimization method (IABCM) that keeps the tracking of the optimal solution using memory. A dynamic search is introduced in ISCA to avoid being trapped in the local optima, respectively. To handle these difficulties, more work can be done to improve/hybridize the metaheuristics-based method(s) to obtain better performance by overcoming a premature convergence and local search issues (exploitation and exploration) to attain optimal or nearoptimal solutions. Meanwhile, improving the statistical FS methods is another research direction to remove redundant features.
• Single and multi-objective optimization problem: The majority of the conducted meta-heuristic algorithms that proposed to conduct TC problem reliant on the single objective function that suffers from forcing the evolving population to form a particular feature-set due to the use of the single quality function. It is not considering the reconciliation between two or more conflicting processing functions. Therefore, a new direction of research has considered the usage of multiobjective functions like MORDC [46] and multi-objective PSO [54]. Other future research directions may consider the multi-objective function that uses minimizing redundancy, maximizing redundancy, minimizing the number of features, maximizing classifier accuracy, and time efficiency.
• Metaheuristic algorithms: Most of the metaheuristic optimization algorithms in the conducted related studies were originated from the Swarm intelligencebased algorithms (SI) [1], [10], [27], [38]- [44], [47]- [51], [54]- [56], [58]- [60], [65], and Evolutionary-based algorithms (EA) [2], [13], [20], [37], [45], [46], [57], [64], and some other studies based on Physics-based algorithms [63], [66], and Human behavior related algorithms [11], [53], [61], [62]. SI and EA are nature-inspired algorithms that share the behaviour of population-based (multiple solutions) stochastic optimization techniques. These algorithms generate a population of solutions that are updated with the number of generations (i.e., iterations) and start their optimization process. However, SI seeks to design intelligent multi-agent systems inspired by the collective actions of social insects such as ants, termites, wasps, bees, and other animal communities, such as flocks of birds or schools of fish, that compete for foods. Where are cooperated by an indirect connection medium and do actions in the decision space? While EA is grounded on the survival of the fittest candidate for a particular climate, they start with a set of solutions that strives to thrive in a particular setting that is defined with a fitness evaluation. The parent population operates in an ecosystem to share its adaptation properties with children who have different processes of evolution, such as genetic crossover and mutation. This approach continues for many generations until the most acceptable solutions for the environment are obtained. Accordingly, having multiple solutions assisted in avoiding local optima, as the particles works cooperatively with a great exploration of search space compared to Physics and Humanbased algorithms. For the issue of TC, new algorithms can be adapted, or hybridization between the existing algorithms (e.g., SI) and Physics or Human-based algorithms can be considered as future directions.
• Unfair assessment: The majority of the conducted studies used different evaluation metrics to assess the performance of the models [2], [41], [48], [53], [56], amongst others. However, evaluation metrics have to be generalized or consistent over the proposed models to provide a fair assessment.

IV. CONCLUSION AND LIMITATIONS
Text classification (TC) is widely implemented in dealing with the data structure partition in a known area (labeled data), such as email filtering, e-news filtering, topic identification, document routing, and so forth. TC approaches play a key role in tagging text documents into categories/classes based on their content. Features selection (FS) using metaheuristics algorithms can efficiently improve the accuracy of classification, computation demands, and storage, and thus, it has been applied increasingly in TC. This article presents a comprehensive analysis of the current metaheuristicsbased FS methods for TC problems. A total of 37 studies was selected from the year 2016 to 2021. This systematic literature review (SLR) provides an essential contribution for TC by shedding light on the five years of the existing body of knowledge to comprehensively understand the metaheuristics-based FS methods and their pros and cons in reducing the computational resources to perform TC. Besides, this SLR states the statistical analysis that was carried out to select the relevant studies critically. The selected studies were further analyzed and categorized into two paradigms (wrapper and hybrid approaches). The wrapper approaches comprise three kinds of techniques (adapted, modified, and hybridized), each with its own characteristics. There is a need for high computational resources for processing wrapperbased approaches that lead to a low classification performance due to the huge search space for the FS. Thus, the hybrid-based approaches have emerged in two techniques (adapted and modified) to cope-up with time inefficiency and the huge search space. Critical research questions were formulated to justify the usage of each approach, its effect on TC accuracy, the comparison with typical and existing models, and finally, the gaps and future directions of each model. Other than those investigated in ongoing research, the hybrid-based approaches require better attention as they are not well explored. To this end, several key points for future models are presented as future work. However, this SLR focuses on the metaheuristic-based FS methods for TC in the English language text, and there is also a need for an SLR that can explore TC models for other text languages such as Arabic, Chinese, Spanish, and so on.