Hybrid Feature Selection Based on Principal Component Analysis and Grey Wolf Optimizer Algorithm for Arabic News Article Classification

The rapid growth of electronic documents has resulted from the expansion and development of internet technologies. Text-documents classification is a key task in natural language processing that converts unstructured data into structured form and then extract knowledge from it. This conversion generates a high dimensional data that needs further analysis using data mining techniques like feature extraction, feature selection, and classification to derive meaningful insights from the data. Feature selection is a technique used for reducing dimensionality in order to prune the feature space and, as a result, lowering the computational cost and enhancing classification accuracy. This work presents a hybrid filter-wrapper method based on Principal Component Analysis (PCA) as a filter approach to select an appropriate and informative subset of features and Grey Wolf Optimizer (GWO) as wrapper approach (PCA-GWO) to select further informative features. Logistic Regression (LR) is used as an elevator to test the classification accuracy of candidate feature subsets produced by GWO. Three Arabic datasets, namely Alkhaleej, Akhbarona, and Arabiya, are used to assess the efficiency of the proposed method. The experimental results confirm that the proposed method based on PCA-GWO outperforms the baseline classifiers with/without feature selection and other feature selection approaches in terms of classification accuracy.


I. INTRODUCTION
The global spread and rapid development of internet technologies has led to a massive amount of natural language text documents that are accessible through different repositories such as WORLD WIDE WEB, digital libraries, and electronic publications. However, these documents are presented in a scattered manner, thus organizing these documents in any form of user interaction is an impractical and very time-consuming process. Text classification is a The associate editor coordinating the review of this manuscript and approving it for publication was Xianzhi Wang . process of allocating documents from a large-scale corpus or repository into predefined labels or categories [1], [2]. Text classification has a substantial influence on different applications such as web page classification [3], sentiment analysis [4], [5], bioinformatics [6], [7], [8], author identification [9], dialect detection [10], spam e-mail filtering [11], SMS spam filtering [12], and topic detection [13].
In the literature, most of the research conducted on text classification has targeted English and Chinese text document corpora with a minimal efforts put into Arabic language research. Arabic language holds excellent importance since it is considered the fourth most used language on the internet and the sixth official language worldwide by the United Nations [14]. The reasons that may attribute to the limited research work on Arabic language are the lack of high-quality and large Arabic corpora prepared for complex classification tasks. Arabic language has a rich morphology and complex orthography, and the available datasets cannot be freely downloaded [15].
Recently, a huge effort has been exerted in constructing new corpora by collecting news from popular sources [16], namely SANAD (Single-label Arabic News Articles Dataset) and NADIA (multi-label News Articles Dataset in Arabic). Similar to English text documents corpora, Arabic corpora still needs document processing tasks. In practice, the text document format is basically converted into a term-frequency vector, where each word's frequency is considered as a feature in the vector space [17], [18]. Such representation generates a high dimensional feature space that negatively affects the process of text classification due to the existence of irrelevant and redundant features and the increasing computation time. Therefore, a process of reducing the number of features to improve the efficiency of text classification tasks and to optimize the CPU time cost and memory size [19], [20] is needed.
Conventionally, the feature selection techniques are divided into two folds: filter-based and wrapper-based approaches. The filter approach evaluates the features within a short amount of time because it executes its calculation based on intrinsic characteristics of the training without using any machine learning algorithms. Examples of filter approaches are Chi-square [30], Kullback-Leibler [31], ReliefF [32], Minimum Relevancy Maximum Redundancy (MRMR) [33], and Robust MRMR (rMRMR) [31], Principal Component Analysis [34]. Wrapper-based approaches deal with the feature selection as an optimization search problem [35],by employing search techniques to produce candidate feature subsets. They then assess them by recreating the datasets based only on each single feature subset and then applying machine learning methods to the new dataset with reduced dimension to obtain the classification accuracy. Wrapper approaches yield higher classification accuracy when compared to filter approaches, but they suffer from expensive processing costs. A hybrid approach is an integration between filter and wrapper approaches, where it gains the benefits of both approaches. Hybrid approaches have been ensured to be more practical and efficient for high dimensional data such as text data [36], [37], image data [38], [39], EEG data [40] and microarray data [41].
As aforementioned, in the way of text classification, the massive presence of irrelevant features is intractable because the number of candidate feature subsets is grown exponentially with the increase of the number of features. In the wrapper feature selection approach, many researchers adopt metaheuristics methods to find a near-optimal feature subset to produce an efficient and accurate automatic text classification. Examples of feature selection techniques that use filter or wrapper or a combination of both are: document frequency and the term frequency with binary poor and rich optimization algorithm (DFTF-HBPRO) [36], Chi-square [42], Firefly algorithm [43], binary particle swarm optimization and KNN (BPSO-KNN) [19], information gain and principal component analysis with genetic algorithm (IG-PCAGA) [34]. However, most of these methods still have problem with local optima stagnation problem. Therefore, a powerful search method for finding the most informative features/terms that may provide more robust and accurate automatic text classification is required.
The grey wolf optimizer (GWO) is one of the widespread swarm-based optimization method inspired by the life cycle of grey wolf and their behavior in searching for prey (i.e. hunting strategy). The optimization process of GWO consists of three main phases. First, cooperative searching for finding the prey zone is performed, that reflects as an exploratory search mode. Encircling the prey zone and then attacking the prey are the second and third phases, respectively. This process is interpreted as an exploitative search mode. The GWO merits make it widely used due to its simple adaptation to any type of optimization problems, ease of use by the naive optimizers, parameter-free nature, and high flexibility. The GWO gained popularity and attracted the attention of researchers as a robust and effective solution for diverse optimization problems derived from different fields. Examples of these applications are engineering [44], [45], [46], machinelearning [47], image processing [48], scheduling [49], [50], Electroencephalography [51], networking [52], and Security [53]. Due to the excellent results and interesting merits of GWO, this research is motivated to use GWO as feature selection technique for Arabic text classification.
In this paper, a new hybrid filter-wrapper feature selection method for Arabic text classification is proposed. The proposed method adopts PCA as a filtering-based approach and GWO as a search method for feature subset generation in the wrapper approach. In classification process, Decision Tree (DT) [54], Random Forest (RF) [55], Support Vector Machine based with the popular Radial Basis Function (SVM-RBF) that frequently used in the nonlinear mapping of svm [56], Logistic Regression (LR) [57], and AdaBoost boosting (AB) [58] classifiers are carried out using three News datasets including Alkhaleej, Akhbarona, and Arabiya [16]. The performance of the machine learning methods is compared with and without PCA feature selection. The best classifier is assigned as an elevator for candidate feature subsets generated by GWO. Results show that the GWO-LR method yields the best classification accuracy when compared against machine learning baseline methods and machine learning with the PCA feature selection technique. The main contributions of this work are summarized as follows: • A new hybrid filter-wrapper feature selection based on PCA as a filter and GWO as a wrapper.
• GWO is converted to a binary version using sigmoid function.
• The performance of the proposed method is tested on a real-world Arabic text data collected from popular Arabic news portals.
• GWO perform better or similar when compared with other optimization feature selection algorithms in all experimented datasets The remaining of the paper is organised in the following sections: Sect. II presents the related work, Sect. III describes GWO's research background. The proposed method, which illustrates how the GWO is adapted for feature selection, and the datasets used in this research are provided in Sect. IV. The experiment setting and result are presented in Sect. V. The paper is concluded, and suggestions for future work is given in Sect.VI.

II. RELATED WORK
Text classification is not a new problem; it has been studied extensively in natural language processing literature. Most of the research works are applied to English text documents. Despite the importance of Arabic, there has been little research into applying and enhancing existing natural language algorithms for Arabic text classification. In [59], two popular classifiers, including support vector machine (SVM) and decision tree C5.0 were used to classify Arabic document texts, and they were experimented on seven Arabic datasets. The results demonstrated that C5.0 managed to surpass SVM by achieving 78.42% average classification accuracy. In another study, [49], eleven machine learning algorithms including Logistic Regression (LR), Multinomial NB (MNB), DT, SVM, XGBoost Classifiers (XGB), Multilayer Perceptron (MLP), KNN, Nearest Centroid Classifier (NC), AB, and Ensemble/Voting Classifier (VC), are utilized to classify Arabic text data. In this study, two large datasets were extracted from different Arab newspapers, where the articles in these datasets include diverse domains (including Sports, Technology, Business, and the Middle East). The results demonstrated that SVM and XGBoost yielded the highest classification accuracy on the first and second datasets, respectively. Three classifiers, including distancebased, KNN, and Naive bayes for classifying Arabic text, were investigated in [60]. The classifiers experimented on an in-house Arabic corpora, and the results exhibited that Naive Bayes outperformed other classifiers. Harrag et al. [61] investigated multi-classifiers including Decision trees, Naive Bayes, and Maximum entropy on data extracted from the Arabian scientific encyclopedia. The results demonstrated that Decision trees resulted the highest classification accuracy with 93%. The early work of [62], performed Arabic text classification, where Document Frequency threshold (DF) was used in the prepossessing stage, and KNN and SVM were used in the classification stage. The experiments concluded that KNN outperformed SVM by achieving higher precision results by 0.95%. The experiments were conducted on five Arabic newspaper text documents including Al-Dostor, Al-Ahram, Al-Nahar, Al-Jazeera, and Al-hayat. In [63], a comparative study is conducted on three classifiers, including SVM, Decision tree (C4.5), and Naive Bayes (NB) for Arabic text classification. The Arabic text documents are extracted from different sources such as Islamic topics, Poems, etc. The results demonstrated that the highest classification accuracy was obtained by SVM, followed by the C4.5, and NB. An efficient feature selection method on the basis of information gain and document frequency for Arabic text classification was introduced in [64]. In this study, Rocchio was employed as a classifier, and the text data used in the experiments was extracted from Egyptian newspapers, including El-Gomhoria, El-Akhbar, and El-Ahram. The proposed method was evaluated on the basis of three measurements, including recall, precision, and classification accuracy. The results revealed that Rocchio produced better classification accuracy than KNN. In [65], the authors suggested Arabic text classification system using Ant colony optimization (ACO) as a feature selection technique and SVM to perform classification. The performance of the proposed method was evaluated based on macro-averaging F1 measures, precision, and macro-averaging recall. The ACO feature selection technique exhibited better performance when compared against six state-of-the-art feature selection methods. Zahran and Kanaan [66] proposed an intelligent feature selection method for Arabic text classification using Particle Swarm Optimization (PSO) and Radial Basis Function Neural Network. The performance of the proposed method was experimented on Arabic corpora extracted from Arabic newspapers websites (including Al-Dostor, Al-Ahram, Al-Jazeera, and Al-Hayat), and was evaluated based on three measurements which are Precision, Recall, and F-score. The results demonstrated the efficiency of the proposed method when compared against Chi-square, TF-IDF, and document frequency algorithms. The authors in [19] implemented an intelligent method where PSO was used in the feature selection stage and KNN in the classification stage. The classification results showed the applicability of PSO-KNN for Arabic text classification. The authors of [67] employed Polynomial Neural Networks to classify Arabic text data after performing features selection using CHI Square. The proposed method achieved 0.94% in precision measurement. In [68], a comparative analysis was carried out on two feature selection techniques (i.e. CHI Square and Information Gain) combined with a number of classifiers, including KNN, Naive Bayes multinomial, Naive Bayesian method, and decision tree. The results revealed that the combination between CHI Square and Information Gain with classifiers provided good results except for KNN. In [69], the authors proposed an efficient hybrid feature selection approach based on a couple of filters including F-measure, Odd Ratio (OR), Class Discriminating Measure (CDM), GSS, IG, and TF-IDF of training text features (FM) and enhanced Genetic Algorithm (EGA). In EGA, the crossover operator was applied on the chromosome (feature subset) derived from term and document frequencies, while in mutation operators, two factors were considered, which are feature importance and the classification performance of the original parents. In this study, NB is used for classification and three datasets collected from a well-known Arabic news website (including Akhbar Al-Khaleej, Al-waten, and Al-Jazeerah). The results showed that the performance of the EGA outperformed GA and also six well-known filters (i.e., OR, CDM, GSS, IG, TF-IDF, and FM). In [70], the authors combined chi-square and Artificial Bee Colony (CHI-ABC) as feature selection techniques to classify Arabic text data. In this study, NB was used to perform classification. The proposed method was experimented on BBC Arabic dataset, and the results demonstrated that CHI-ABC outperformed CHI and ABC when running individually. A hybrid filter-wrapper feature selection for Arabic text classification was proposed in [71]. The proposed method embedded IG to perform a filtering approach and then passed the top-ranked features to the wrapper approach guided by a modified version of the Sine Cosine Algorithm. The proposed method was experimented on three new Hadith datasets. The results showed that the proposed method provided a good compromise of classification accuracy and the total reduced features.
In [72], the authors proposed Neural Networks (NN) for Arabic text categorization utilizing self-organization Maps (SOM) and Learning Vector Quantization (LVQ). A satisfactory results were produced on a small size datasets. Similarly, the authors in [73] confirm the superiority of NN over SVM after reducing feature space. Recently, deep-learning based approaches used for Arabic text classification, which have yielded outstanding results. In [15], the authors implemented nine deep learning models for Arabic text classification and they used word2vec embedding models to boost the classification performance. the results showed that all deep learning models yielded very promising classification accuracy, moreover, the utilize of word embedding enhanced the overall classification performance.

III. RESEARCH BACKGROUND
This section presents some of the widely used optimization algorithms, which is Grey Wolf Optimizer.

A. GREY WOLF OPTIMIZER (GWO)
The GWO is a well-known swarm-based metaheuristic algorithm that simulates the life cycle and hunting mechanism of grey wolves in nature. The GWO was introduced and proposed by Mirjalili [74], who also brought up its mathematical expression.
The grey wolves' pack divides into four hierarchical levels of wolves: alpha (α), beta (β), delta (δ), and omega (ω) wolves which are dispersed on the bases of their levels of domination, with being at the highest and lowest levels of the wolves pack as presented in Figure 1.
In this pack of grey wolves, the α wolf is the wisest. It is highly efficient in managing the pack as well as in taking decision regarding the control of the pack and appropriate hunting style. It is also excellent at selecting a habitat. The α wolf is succeeded by the β wolves in hierarchy of domination. Normally, β wolves follows the α wolf wherever it is, supporting the α wolf in management and pack control.
The third domination level of the hierarchy is made up of the δ wolves. Wolves in this level are responsible for providing help, support and at the same time being guard to the members of the territory that are weak and old. The remaining wolves in the pack are the ω wolves. This stratification is based on the lifestyle of the wolves in the pack, their points of transaction management, hunting strategies, and their overall day-to-day activities. The main advantage of this hierarchy is that it assists in the leadership of the wolves during prey hunting. As soon as a prey is found, then α wolf orders the encircling of this prey by members of the pack while it leads the β and δ wolves in attacking the prey.

B. GREY WOLF OPTIMIZER ALGORITHM
Two main elements are based on the inspiration of GWO (hierarchy of grey wolves and their domination levels). Individual wolves serve as candidate solutions to various optimization issues that are being encountered. In the first three hierarchies (i.e., α, β, and δ levels), there is only single solution contained in each. While δ level contains a good solution, β holds a better. However, α level has the best solution. The remaining solutions are held in level ω. It should be noted that the members of ω level are obliged to help the members of the α, β, and δ levels to encircle and hunt the prey by means of the following formulation.
VOLUME 10, 2022 where X p,e represents for the location of the prey at e th iteration, X (e) denotes the location at e th and (e + 1) th iterations, respectively. d 1 and d 2 stands for two arbitrary values, a and b are two coefficient vectors, and I represents the total iterations. a and b mainly aim to optimize balance between exploitation and exploration, and escape from the local optima. Through randomly altering the value of b, GWO it is capable of staying way from stagnation in local optima, as well as exploiting and exploring a given search space when |a| < 1 and |a| > 1, correspondingly. It is necessary to update the solutions in ω level after every iteration based on the solutions in α, β, and δ levels through applying the following formula:

IV. THE PROPOSED TEXT CLASSIFICATION APPROACH
In this section, an intelligent hybrid feature selection approach is proposed for the classification of Arabic texts.
In the proposed method, PCA is used as a filtering approach, and its main task is to search over the term/feature search space of all extracted features from the raw Arabic text datasets and find the best subset of relevant and informative features. To further seek informative features and better classification accuracy, GWO is coupled with Logistic Regression (LR) classifier, where GWO is utilized to optimize the feature subsets produced by PCA and then passes the candidate feature subsets to assess them by carrying out LR on those selected features by measuring their classification accuracy. The proposed system consists of several stages, including preprocessing, feature selection, and classification, as depicted in Figure 2. The stages are elaborated in detail in the subsequent sections.

A. PREPROCESSING
The raw Arabic data needs to be converted into an appropriate format that automatic text analysis can process, and because there are various ways of reporting text in Arabic language, the Arabic text data documents were fed into preprocessing task according to the following steps: The raw Arabic textual data needs to be converted into an appropriate format that automatic text analysis can process. There are various ways of representing text in Arabic language, the Arabic documents preprocessing stage includes: • Use UTF-8 encoding.
• Drop stop words that appear in the raw text like prepositions and pronouns.
• The Words with frequency less than five times are ignored.
• Vector Space Model is adopted in this stage to formulate the Arabic text data into a proper representation, and TFIDF (term frequency inverse document frequency) is employed for weighting the terms. TFIDF is a popular scheme used for weighting the terms in the field of text classification. TFIDF has been proven to be a practical statistical approach for assigning weight for the terms [75]. The TF scheme, in practice, stands for the feature/term weight F i in the feature space, which is computed by counting the number of times the F i found in text document d j [76]. Document frequency is calculated at the level of corpora, where the feature F i is assigned a weight based on the number of the document text in the corpus that contains F i at least once. The inverse document frequency that associates with the feature F i can be computed as shown in the equation [76]: where D stands for the number of text documents. The weight of the term F i in the document d j by means of TFIDF is estimated as follow:

B. FEATURE SELECTION
In this stage, the feature selection process is carried out to the Arabic news datasets to reduce the dimension of the data and also to find a set of informative features that may be a better representative for the Arabic datasets instead of using all features. In this research, a filter feature selection approach called PCA is initially applied to the dataset to produce a strong and relevant subset of features, and thereafter, a pruning process is applied to this subset to seek further informative and discriminative features using a wrapper approach that guided by GWO algorithm. PCA and the proposed method PCA-GWO are thoroughly discussed in the subsequent sections.

1) PRINCIPAL COMPONENT ANALYSIS (PCA)
The Arabic text data is highly dimensional data that deteriorates the classification performance of machine learning algorithms. Therefore, in this work, PCA is employed to produce a subset of the most relevant features. To mathematically formulate dimension reduction, the number of features is denoted as N and the features are denoted as vector x. The features in the raw data is represented as x = (x 1 , x 2 , . . . , x N ), and the output from applying PCA in the raw data is y = (y 1 , y 2 , . . . , y D ), where D is less than N. PCA is a linear transformation model, which changes several feasible correlated variables to a few number of uncorrelated variables referred to as principal components (PCs) [77]. The PCs are represented form the linear combinations of the original variables measured by the degree of contribution they make to provision of explanation on the variance in a given orthogonal dimension. The principal components are ordered based on the variability for which they stand. Larger variance is found in the lower-order PCs, whereas, the higher-order PCs have lower variance. The chosen feature selection module involves the elimination of higher-order PCs while keeping the lower-order PCs. The authors of [77] suggested that PCA determines the correlation and dependencies that exist in the extracted features (i.e, x 1 , x 2 , ..x N ) through the development of a covariance matrix U of the dimension N × N in which N denotes the number of extracted features, as shown below: Eigenvectors (v 1 , v 2 , . . . , v N ) and corresponding eigenvalues (λ 1 , λ 2 , . . . , λ N ) are values derived from the covariance matrix U to identify the PCs. Eigenvalues are arranged from the top to bottom fashion, and the k eigenvectors that are similar to the K largest eigenvalues are selected for constructing a reduced matrix U K , in which K represents the number of dimensions of the new feature subspace (K N ). A project matrix W is created from the chosen K eigenvectors through the multiplication of the transpose of the reduced matrix by the original set of extracted features X , so as the newly formulated PCs replace the original data axis. This is expressed as:

2) PCA-GWO IMPLEMENTATION PROCESS
In this section, a hybrid feature selection method PCA-GWO is proposed and thoroughly illustrated. The proposed method utilizes PCA filter approach as pruning process to the raw features in the Arabic textual data. The output from the PCA filtering process is a feature subset, where it is further optimized by GWO to produce a discriminative and informative feature. The PCA-GWO process is divided into four basic steps, which will be explained in details below. The PCA-GWO flow chart is shown in Figure 2 and pseudo coded in Algorithm 1.

Step1: Initialization.
This step involves the initialization of the number of iterations. Here, each wolf represents a standalone solution for a feature selection problem in which every solution serves as a binary vector of size D as expressed in Equation (17). This implies that the solution's decision variables accepts either 0 or 1, and this is a termed position in GWO. The launching of GWO searching processes can be achieved through the generation of n wolves to serve as random binary vectors. for each solution (j) do 6: Step2: Evaluation.

19:
The variable a 2 value is updated using (Eq. 3) 20: The variable b 2 value is updated using (Eq. 4)

21:
Calculate X 2 (Eqs. 7, 10) Step 4: Check the stop criterion 30: if The maximum number of the iterations is not met then 31: e = e + 1 32: end if 33: end while 34: Return X α subject to: where x j i refers to the jth decision variable of solution (wolf) xi.

Step2: Evaluation.
In this step, wolves are assessed based on their position vectors, where each position in the vector is either 1 or 0. The positions that have the value of 1 indicate that these features/terms form the new reduced data. Later, the reduced data is divided into training and testing, where the LR classifier is learned from the training data and assessed in the testing data. The LR model is evaluated using a classification accuracy metric. The objective function utilized to evaluate the classification performance of each grey wolf position vector is formulated below: Acc = TP + TN TP + TN + FP + FN (18) where Acc denotes the objective function (accuracy rate), and TP, TN, FP, and FN represent the true positive, true negative, false positive, and false negative, respectively. The top three fitness values are X α , X β , and X δ , respectively. This solutions hierarchy is inspired by the social hierarchy of wolves, as explained in Algorithm 2.

Algorithm 2 Social Hierarchy Component of the Proposed GWO
while (e < Max e ) do for each solution (j) do Calculate the fitness of the solution X α = the fittest solution X β = the second-best solution X δ = the third-best solution end for e = e + 1 end while Return X α , X β , X δ ;

Step3: Update GWO population]
Here, GWO involves three main operators including seeking for prey (exploration), attacking prey (exploitation), encircling prey, and hunting mechanism. This is applied during the navigation of the search space of the feature selection problem and updating the GWO population. The equations from 1 to 12 as explained earlier in Section III-A, are used to update the solutions in GWO population. The mechanism for this is operated through the assessment of the distance between each solution in the population and social hierarchybased solutions (i.e., X α , X β , and X δ ). The current wolf/solution gets its position or decision variables updated based on X α , X β , and X δ . This results in the generation of a new solution (X 1 ) that is based on X α using Eqs. 6, 3, 4, and 9. The steps are taken again in order to derive two new solutions, X 2 and X 3 , in which X 2 is obtained on the basis of X β by applying Eqs. 7, 3, 4, and 10, and X 3 is generated on the basis of X δ by applying Eqs. 8, 3, 4, and 11. At last, the solutions X 1 , X 2 , and X 3 are aggregated by use of the mean to obtain a new solution X (e + 1). However, it is necessary to note that positions of X (e + 1) have continuous values, they are converted to binary vector through the use of Eqs. 19 and 20.
where U(0,1) is a uniform random number between 0 and 1. Furthermore, the new solution X (e + 1) is generated at every iteration and can be assessed by use of fitness function which mainly relies on classification accuracy.

Step4: Check the stop criterion
The step 3 is an iterative process that tends to achieve a better search around the best solution. This iterative process is controlled by the stopping condition (which commonly sets the maximum number of iterations). Once the stopping condition is met, the best solution that carried the distinctive features/terms for Arabic text classification problem is produced.

C. CLASSIFICATION
The most well-known machine learning algorithms which are widely applied in pattern recognition (in particular text classification field) including DT, LR, SVM, Ada boost, and RF, are discussed in detail and implemented in this research.

1) DECISION TREE (DT)
The decision tree is a machine learning methodology which is widely recognized for automation of the induction of classification trees with respect to training data [78]. A typical decision tree training algorithm comprises of two phases. The tree growing phase is the first one. This phase involves the building of tree through greedy splitting of respective tree nodes. The second phase involves removal of overfitted tree branches as the branches of the tree are capable of overfitting the training data [79]. C4.5 is a univariate decision tree algorithm. Only one of the attributes of instances at a given node can be adopted for decision making purposes. Details of C4.5 are obtained from Fuhr and Buckley [80].

2) SUPPORT VECTOR MACHINE (SVM)
In the SVM approach, linear kernel is involved, it is known to possess a significantly high performance in terms of text categorization due to its linear separable nature [81]. The important merits of this classifiers include high generalization ability, success in resolving the problem of overfitting and global optimization capabilities [82]. Moreover, this classifier possesses a satisfactory performance in the large-scale feature space, it also has the ability of managing any distributional dataset [82]. It is however, not suitable for managing massive dataset, as it needs feature scale to operate adequately. The task of training and tuning classifier tend to be exhaustive and memory intensive [82].

3) LOGISTIC REGRESSION (LR)
Logistic regression is a well-known classifier. It has a simple coding procedure and it is highly reliable [57], [83]. The logistic regression is a classifier candidate that is effective in carrying out polarity classification tasks. It relies on Sigmoid function in generating a report related to the probabilities of the predicted labels. The maximum likelihood estimation adopts the use of the gradient descent algorithm in maximizing the likelihood of accurately classifying a arbitrary set of input features. The prediction of multi-class problem can be done through formulating the problem into a polarity classification (one-versus-the-rest). Otherwise, loss function (i.e., cross-entropy) can be used to get a solution.

4) ADA BOOST (AB)
Ada boost is a type of machine learning algorithm that was introduced by Yao Froud and Robert Shaper [58]. It is a meta-algorithm that is useful in enhancing performance as well as troubleshooting unbalanced categories together with similar algorithms. Classification of each step is advantageous to wrongly set samples in previous steps. It does not tolerate data that is useless and noisy; however, its operation is simpler than others classifiers. With numerous iterations, the performance of Ada boost is enhanced. In each round, a weak class is added and weights are displayed according to sample importance. Weights of wrongly classified sample increase with the increase of number of cycles, while in case the number of the samples which are correctly classified decreased, the new class focus on examples that are not easily learnt.

5) RANDOM FOREST
Random forest is a supervised machine learning algorithm. It is among the strong machine learning algorithms that is important in classification and regression problems. The random forest is a set of decision trees that are developed by randomly selected training data by the random forest classifier. The final class of the test object is chosen through combination of votes from variant decision trees. The operation of this model is more accurate since it involves the combination of several decision trees; it yields more reliable results and less noise. A major demerit of this algorithm includes complex nature, prolonged duration of training, slow speed and its less effectiveness in terms of real-time predictions due to the high number of trees [55].

D. ARABIC DATASETS
Three different datasets constructed by [16] are used in this study. These data sets were collected by means of web scraping (Python Selenium, Requests and BeautifulSoup or PowerShell), obtained from three well-known news websites (alarabiya.net, alkhaleej.ae and akhbarona.com). The datasets are grouped in one corpus referred to as SANAD. These datasets excluding al-arabiya.net (which lacks Culture or Religion categories) have all the categories [Tech, Sports, Religion, Politics, Medical, Finance and Culture]. It does not have to do with dialect because datasets are collected from news sites, since all the articles were produced in modern standard Arabic. Moreover, samples of the selected features extracted from Arabic text data is presented in Table 1.

1) ALARABIYA.NET
For this dataset, respective articles in the primary domain and sub-domains (i.e Aswaq and Ahadath) were carefully examined. The articles were then categorized into seven classes out of which two possess inadequate data (Iran Culture News) as compared to the others. We merged the ''Iran News'' and the ''Politics'' categories so as to develop an effective dataset. Consequently, when the ''Culture category was dropped, the categories were reduced to five categories. Articles that were compiled are up-to-date till early 2018. These five categories of datasets are described in Figure. 3: 2) ALKHALEEJ.AE About 1.2M (4GB) articles were collected through the examination of articles of this website for ten years (i.e., 2008-2018). It has been realized that the categorization of this website is somewhat not complete and ambiguous in several aspects. Therefore, there was the need for a manual categorization of certain amount of the article into the seven categories mentioned earlier, giving rise to a total of over 46000 articles. Moreover, it has resulted in the need to classify some articles as categories in which they don't belong to certain category, that made the data sets not as reliable as the other two datasets (i.e, alarabiya.net and Akhbarona.com). In Figure. 3, the seven categories involved in this dataset, which are in balanced distribution, is illustrated. Manual categorization of Khaleej dataset involves the selection of the tags gotten from the website so as to categorize them into a category from the seven categories. For instance, articles of the tags 'Technology', 'Digital Life', 'Computer Internet', and 'GITEX' are classified as a generic category termed 'Technology'.

3) AKHBARONA.COM
All required categories of articles were collected. It happened that one of those categories (Religion) had 50% of the features possessed by other categories. As such, the remaining 50% was sourced from a newspaper website of relevant interest (Alanba.com). This dataset's seven categories are distributed and plotted in Figure 3.

A. EXPERIMENTAL SETUP
In this study, all algorithms were implemented using Python and RapidMiner software. To assess the performance of the utilized algorithms, two metrics were used which are classification accuracy and the number of reduced features. Moreover, to validate the performance and effectiveness of these algorithms, three benchmark Arabic datasets were employed from [16] as described in Table [16]. The three datasets were compiled from Arabic news portals, including alarabiya.net, alkhaleej.ae and akhbarona.com. The number of articles for Alarabiya, Alkhaleej and Akhbarona datasets is 1207, 1408, and 1404, respectively. Alkhaleej and Akhbarona datasets have seven categories while the Alarabiya dataset has six categories. Each category has 200 articles. In the following experiments, the datasets are divided into 90:10 ratio as training and test set. In Alarabiya dataset, the number of samples in training and testing datasets are 1086 and 121, respectively. In Alkhaleej dataset, the number of samples in training and testing datasets are 1267 and 141, respectively. In Akhbarona dataset, the number of samples in training and testing datasets are 1263 and 141, respectively. Three experiment settings were carried out in the current study. In the first experiment we implemented and compared five different classifiers on the three datasets in order to pick the best classifier for the following two experiments. In the second experiment, to confirm the selection of the best classifier, PCA feature selection was combined with each classifier, then we compared the accuracy of these classifiers by using PCA feature selection. In the third experiment, the results of the proposed hybrid approach using filter and wrapper feature selection (PCA-GWO) were compared against three popular optimization algorithms (that were adapted as feature selection approaches), LR, and LR with PCA feature selection.

B. EXPERIMENTAL RESULTS OF CLASSIFIERS ONLY
The first experiment was conducted without using features selection by applying only DT, RF, SVM-RBF, LR, or AB on the full features set. Each one of the mentioned classifiers was applied on the full features set without using any reduction method such as PCA filter feature selection or wrapper method using an optimization algorithm. Table 2 shows that the best classification accuracy results were achieved using the LR classifier over the three datasets. Therefore, this confirms the superiority of the LR classifier over all other used classifiers. However, we noticed from the results that there is a possibility of further improvement by applying feature selection methods. Thus, in the next two experiments, our task is to apply PCA and wrapper feature selection to select the most relevant features and enhance the classification performance.

C. EXPERIMENTAL RESULTS OF CLASSIFIERS WITH PCA
In this experiment, PCA feature selection was applied with all classifiers (i.e, DT, RF, SVM-RBF, LR, and AB). PCA is a type of feature subset algorithm, where it produces multifeature subsets and chooses the best feature subset that mostly represents the entire dataset. The results of integration of the PCA with all classifiers are presented in Table 3 and Figure 4. The classification accuracy achieved on Khaleej dataset using PCA with all classifiers (with 1010 feature is better than using only classifiers without feature selection, except for AB classifier. It should be noted that PCA feature selection results were achieved with less than 10% from the total number of original features in the Khaleej dataset. In the Akhbarona dataset, PCA with less than 12% from the total number of original features managed to archive higher classification accuracy when integrated with DT and SVM-RBF. Furthermore, PCA-LR and LR have similar classification accuracy.  On the other hand, the baseline classifiers (i.e, DT and AB) have better classification accuracy than involving PCA feature selection in their classification task. On the Arabiya dataset, PCA feature selection reduced the dimensionality n of the feature space by less than 13%. In this new dimensionreduction data, the classification accuracy of PCA with all classifiers is higher than using all baseline classifiers without feature selection. The results confirms the significance of applying PCA as a feature selection technique with most of the classifiers for all datasets. However, to obtain a more accurate automatic Arabic text classification system, PCA is hybridized with a wrapper approach guided by GWO to seek further accurate and robust features.

D. EXPERIMENTAL RESULTS OF PCA-GWO AND OTHER APPROACHES
In this experiment, the effectiveness of the proposed method is validated by comparing it with several wellknown optimization algorithms, including Bat-inspired Algorithm (BAT) [84], Firefly Algorithm (FFA) [85], Particle Swarm Optimization (PSO) [86], White Shark Algorithm (WSO) [87], Marine Predators Algorithm (MPA) [88], and Slime Mould Algorithm (SMA) [25]. The BAT, PSO, FFA, WSO, MPA, and SMA algorithms are used as wrapper feature selection approaches, and LR is used to perform classification. In Table 4, the results of the proposed method and other optimization feature selection approaches are summarized in terms average classificatiof accuracy along with standard deviation values that are expressed in the form (average ± standard deviation). The best results are highlighted in bold font. It can be observed that PCA-GWO is managed to yield the best classification results for two out of three datasets (i.e., Khaleej and Akhbarona). On the other hand, for Arabiya dataset, the best result was achieved by PCA-PSO. In respect to the average number of selected features, PCA-SMA identifies the lowest number of features for all datasets; however, it doesn't achieve the highest classification accuracy.
Furthermore, The best classification accuracy results and number of selected features are reported in Table 5. It can be inferred that PCA-GWO yields the best results for Khaleej and Akhbarona datasets. For Arabiya dataset, the best result was achieved by PCA-MPA. In respect to the number of selected features, PCA-SMA achieved the best result, where they successfully identify the fewest number of features on all datasets. However, the classification accuracy obtained by PCA-GWO is higher than PCA-SMA on all datasets (i.e., Khaleej and Akhbarona).
To further validate the results yielded by PCA-GWO and other optimization algorithms, Wilcoxon signed-rank statistical test [89] is used in this study to demonstrate if there is statistically significant difference between these algorithms. In Table 6, Z-value stands for standardized test statistics, and P-value stands for the statistical significance (P − Value < 0.05). A P − Value < 0.05 implies that there is statistical significant difference between the compared algorithms; otherwise, there is no statistical significant difference. From Table 6, it can be inferred that PCA-GWO obtained statistical significant results in most of the datasets when compared with other algorithms.
Additionally, the execution time of the proposed method PCA-GWO is compared with LR (without feature selection) and PCA-LR, as shown in Figure 5. The results demonstrate that PCA-GWO has the minimum computational time. The proposed method PCA-GWO managed to effectively increase the classification accuracy while reducing the computation time. In summary, PCA-GWO provides superior and competitive results when compared to other feature selection approaches. This fruitful result is owed to robust searching operators in GWO represented by searching for prey (exploration), encircling prey, attacking prey (exploitation), and hunting mechanism, which resulted in searching the feature space of Arabic textual data effectively.

E. EXPERIMENTAL RESULTS OF PCA-GWO AND OTHER DEEP LEARNING MODELS
As for deep learning (DL) models, we compare the proposed method with nine DL models that were proven to produce top results [15]. Table 7 shows the accuracy results     the Khaleej dataset, GWO-LR produced slightly less accuracy score (96.86%) when compared to CGRU DL model (96.86%). Therefore, the performance of the our proposed method is at least comparable if not better. However, top performing DL models on the datasets are different. Table 8 provides further analysis between DL models and GWO-LR. It is clear that our proposed method is favoured over DL models with respect to size of dataset used (less than 1% of the number of samples in the original dataset), number of features (10% of the original set of features), and accuracy.

VI. CONCLUSION AND FURTHER DIRECTIONS
In this paper, we presents a hybrid filter-wrapper feature selection method for categorizing Arabic documents that combines PCA (filter approach) and GWO (wrapper approach). PCA is used to determine a robust feature subset that is more representative of the Arabic textual data when compared to using all features. GWO is optimized for the PCA feature subset to further select informative features. The LR classifieris used to perform classification for each feature subset produced by GWO. Three Arabic datasets Alkhaleej, Akhbarona, and Arabiya are experimented with to test the performance of the proposed PCA-GWO. The results obtained by PCA-GWO are superior to results produced by baseline classifiers with and without PCA feature selection method. We also compared GWO with other optimization feature selection algorithms. Namely, PSO, FFA, and BAT.
As PCA-GWO confirmed its superiority as a feature selection method for the Arabic text classification task. However, similar to the most metaheuristic algorithms, GWO suffers from premature convergence and falling in local optima. As future work, GWO can be further enhanced, by empowering the wrapper approach via different strategies like i) hybridized GWO with other local-based approaches; ii) modifying its optimization framework by adding extra efficient and robust optimization search operators to provide more accurate results for Arabic text classification task.