Modified Genetic Algorithm for Feature Selection and Hyper Parameter Optimization: Case of XGBoost in Spam Prediction

Recently, spam on online social networks has attracted attention in the research and business world. Twitter has become the preferred medium to spread spam content. Many research efforts attempted to encounter social networks spam. Twitter brought extra challenges represented by the feature space size, and imbalanced data distributions. Usually, the related research works focus on part of these main challenges or produce black-box models. In this paper, we propose a modified genetic algorithm for simultaneous dimensionality reduction and hyper parameter optimization over imbalanced datasets. The algorithm initialized an eXtreme Gradient Boosting classifier and reduced the features space of tweets dataset; to generate a spam prediction model. The model is validated using a 50 times repeated 10-fold stratified cross-validation, and analyzed using nonparametric statistical tests. The resulted prediction model attains on average 82.32\% and 92.67\% in terms of geometric mean and accuracy respectively, utilizing less than 10\% of the total feature space. The empirical results show that the modified genetic algorithm outperforms $Chi^2$ and $PCA$ feature selection methods. In addition, eXtreme Gradient Boosting outperforms many machine learning algorithms, including BERT-based deep learning model, in spam prediction. Furthermore, the proposed approach is applied to SMS spam modeling and compared to related works.


I. INTRODUCTION
S PAM remains one of the long lasting security threats.E-mail spams represent a true challenge against mail service providers at the early stages of the Internet.Web spams exploit social engineering to lure a privileged user to login into a deceptive service.As Internet users developed awareness skills and became more competent to distinguish fake web content from truly legitimate one, attackers exploit the pervasiveness of social networks and corresponding media to launch the latest generation of spams, namely social spam.In addition to the opportunity to target a larger number of victims, social networks create an environment for everevolving avenues for spammers.It goes beyond traditional individual compromising activities such as monetary frauds towards large-scale campaigns.Quite recently released Twitter dataset distinguished more than five ways of twitter spams, including, but not limited to, profanity, insulting, hate speech, malicious links, fraudulent reviews [1].Similarly, recent research efforts considered similar spamming approaches against other online social networks and short message service (SMS) [2], [3].It is not surprising that twitter reviews spam policy periodically [4].
As social spam campaigns emerged as a contemporary challenge against users, companies and even more governments, countermeasures evolved in a hand raising contest fashion.Earlier solutions were limited to the rule-based and regular expression matching.However, as spammers developed good experience to evade such detectors, information security practitioners considered content-based characteristics restrictively.Contemporary mature solutions utilize both content-based and account-based characteristics.In most cases, the ultimate goal is to find the shortest list of characteristics or features that indicate spamming behavior [5], [6].Some studies step further to identify spammers them-selves [7], [8].Machine learning techniques are leveraged in many ways to develop detection models.Earlier models utilized straightforward classification and categorization algorithms such as Support Vector MAchine (SVM), Naïve Bayes (NB), K-Nearest Neighbor (K-NN), and Decision Trees (DT) [3], [5], [6].More advanced solutions explore opportunities of improvement as a result of utilizing deep learning (DL) techniques [2], [9].
Deep learning based solutions approved to outperform conventional machine learning based prediction models.Considering social spams, such behavior of deep learning models is justifiable as it performs well in identifying local patterns [9].However, such performance comes at the expenses of model complexity.Additionally, Artificial Neural Networks (ANNs) models of deep learning are hard to interpret.Scalability and interpretability remain two contradictory desired characteristics of any social spam detector.In order to tackle this issue, we propose a novel dimension reduction solution.As parameter tuning is an unavoidable task regardless of the nature of the underlying prediction model, the proposed solution leverages a genetic algorithm to tune the parameters of the prediction model and select best descriptive features simultaneously.Such generated prediction models are still interpretable utilizing the final set of retained features.Further, proposed architecture allows developers to choose among a wide range of granularity depending on their targets and underlying computation capabilities.
A wide range of optimization techniques are proposed in literature.Alatas and Bingol categorized intelligent optimization techniques according to their scientific basis [10].Further they compared their performance to the light-based intelligent optimizers [11].Genetic algorithms (GA), biology based optimizer, is the most popular type of evolutionary algorithms (EA) for parameter optimization.It demonstrates noticeably outstanding performance for a wide range of problems.Genetic algorithms retain merits of both metaheuristic search algorithms and stochastic optimization techniques [12].This combination enables genetic algorithms to reach a global optima within a relatively fewer number of generations compared to other evolutionary algorithms.
One of the major issues in spam text research is the limited availability of labeled text datasets with high quality [13], [14].For example the well known benchmark datasets are few, and many researchers use tools to collect domain specific datasets.As a result, many of the available text datasets have limited number of attributes, unverified class labeling, related to a specific language, imbalanced class distributions, or biased data.Furthermore, currently social media facilitate sharing multimedia content (e.g.,audio, video, text, images, etc) but incorporating such mix of content in model building seems to be one of the future challenges.Table 1 shows a summary of the related research works in Tweets modeling and points to some research gaps.
In order to evaluate the proposed solution, a real-world Twitter dataset is utilized, a quite large number of experiments incurred.Most experiments ended up with satisfactory performance due to the incorporation of the feature selection process.Some experiments provided outstanding results that outperform base-line solutions, even deep learning solutions.Incurred experiments reveal appropriateness of the proposed approach to handle social spam detection problem, providing a trade off between prediction performance and computation capabilities.Furthermore, the proposed approach is still applicable to a wide range of data mining problems.Below are the key contributions of this research: 1) Proposing a social spam content-based detection approach that considers wide variety of contemporary ways of spammers.2) Developing a novel genetic algorithm to initialize a powerful classifier and feature selection.3) Validating the proposed approach against publicly released twitter real-world dataset.The rest of this paper is organized as follows, Section II investigates literature and related works.The proposed social spam detector and the corresponding genetic algorithm feature extraction approach are elaborated in Section III.Results are presented and discussed in Sections IV and V, respectively.Finally, Conclusion and future directions are drawn in Section VI.

II. RELATED WORKS
Recently, there has been a significant interest in detecting twitter spam.Compared to the traditional mail spam and web spam, twitter went beyond phishing, fraudulent, and scam.It creates new avenues for profanity, insulting, spreading hate speech, and bullying [1]- [3], [13], [24].Researchers have investigated wide range of approaches to accommodate such divergence.Two streams of countermeasures have been proposed.First approach considers feature extraction.Second approach considers graph-based solution.Feature based solutions investigate content-based features, account-based features, or both of them.Graph based solutions investigate communication graphs of spam spreading focusing on identification of spammers.
In recent times, feature selection is one of the important key of research in machine learning, image retrieval, text mining, intrusion detection, etc.According to literature, different algorithms have been developed and employed for feature selection.For example, a greedy search based sequential forward selection (SFS) [25] and sequential backward selection (SBS) [26] have been applied for feature selection.However, these approaches suffer from a range of problems, such as stacking in local optima and high computational cost.In order to address these problems, new algorithms for feature selection have been proposed [27]-[29], such as Particle Swarm Optimization (PSO) [30], Ant Colony Optimization(ACO) [31], and Genetic Algorithm (GA) [32].Furthermore, a novel filter feature selection method named the Proportional Rough Feature Selector (PRFS) has been proposed in [33].The method addresses a high dimensional matrix in a short text classification problem.The method makes a regional distinction using a set of terms in order to differentiate documents that exactly belong to a class and documents that possibly belong to a class.In the work of [34], the authors have presented a comparative study of eight filter methods by employing mutual information using 33 datasets.Furthermore, in the work of [35], 12 feature selection methods are compared on text classification problem.
Genetic algorithm has been known to be a very efficient and useful approach for feature selection, as described in [36]- [40].This is because of its ability of changing the functional configuration in order to improve the performance results.In [41], the authors have applied a Genetic Algorithm in order to reduce the number of features extracted from a Flavia image dataset.The authors of [42], [43] have proposed a hybrid Genetic Algorithm for feature selection based on machine learning techniques.They have investigated the performance of their algorithm using different datasets, such as Wine dataset and synthetic data sets.In [15], an approach for enhancing a classification performance of natural crisisrelated Twitter messages has been proposed.In this approach, a Genetic Algorithm has been utilized for feature selection.
Another study has been proposed in [44].The study employs a Genetic Algorithm for feature selection in order to increase a classification accuracy for breast cancer diagnosis.
Different feature selection approaches have been applied on many real-world applications, such as text categorization [45], image retrieval [46], intrusion detection [47].Several feature selection approaches have also been applied on tweets classification.For example, the work of [16] has presented a method for sentiment analysis of airline tweets.It employs a mutual information method for the process of feature selection.Furthermore, the work of [15] has implemented an improved Genetic Algorithm for disaster preparedness and response in the Philippines.The algorithm aims to select the most important features from a large number of features for the classification process of disaster-related tweets.In [17], the authors have considered Chi-Squared, Mutual Information, Kolmogorov-Smirnov statistic, area under the Precision-Recall curve, and area under the Receiver Operating Characteristic curve for feature selection on a large high-dimensional dataset of collected tweets.Each tweet is labeled to a positive sentiment or negative sentiment.The results demonstrated that employing these feature selection techniques on a sentiment classification process can have a great impact on the performance of a classifier.The [18] has applied classification techniques on tweets belonging to Renewable Energy.The Correlation based Feature Selection (CFS) Subset Evaluation and Information Gain feature selection have been used to reduce the number of used features.
The literature shows that the number of selected features used for tweets classification greatly affects the performance of the employed classifier.However, only a few works have discussed how and what an appropriate number of features should be selected to achieve the best classification performance [13], [19].The approach of [17] has shown that using between 75 and 200 features enhances the tweets classification results over using the full feature set.In [20], the authors have investigated the using between 42 and 34,855 features to represent 1000 instances from the Stanford Twitter Corpus.They have found that using more than 500 features will not significantly improve the performance of a classifier.The work of [21] has studied the effect of the application of two-stage feature selection on the twitter sentiment analysis performance.A filter feature selection based on information gain has been used and 3 feature sets of 500, 1000, and 1500 features have been produced.

III. THE PROPOSED APPROACH
Computer-based Genetic Algorithm (GA) [48] is a search heuristic that was inspired from the natural evolution theory of Darwin.Since decades, GA has been actively used by researchers to address many challenges in different domains such as malware detection [49], energy optimization [50], cancer classification [51] and so on.
Recently, GA has been used as a search strategy for dimensionality reduction of a relatively large feature space [52].Such an approach evades the limitations of the exhaustive search strategies.GA can be used to optimize the parameters of machine learning algorithms and reduce the dimensionality of the problem space.One approach in text modeling is to convert the input text into a set of features; such as TF/iDF modeling.Usually, the number of features is extremely large and so an overfitting probability is high.On the other hand, real-world classification datasets are usually an imbalanced distribution of class labels.Imbalanced datasets impose an additional challenge in avoiding classification bias and overfitting.

A. EXTREME GRADIENT BOOSTING
Chen and Guestrin in [53] introduced a powerful tree boosting algorithm, which is named eXtreme Gradient Boosting (XGBoost).The algorithm is claimed to be scalable; sparsityaware; takes into consideration data compression and sharding; and cache-aware access.Figure 1 illustrates XGboost algorithm architecture.Each tree is trained on the residual error of the previous tree which improves the performance of the constructed tree.The sum of each tree's predictions constructs the final prediction.
The characteristics of XGBoost enable it to outperform other machine learning algorithms and require less system resources.Theoretical and empirical proofs support these claims in [53]- [57].XGboost in [54] produced the best performing models over 11 machine learning algorithms in textbased spam classification.However, tree-based algorithms in general tend to perform well in relatively small number of features compared to artificial neural networks.Therefore, this research aims at leveraging the benefits of XGBoost algorithm by reducing the number of text features in the prediction model building process.

Tweets Features
Dataset The major challenge in building XGBoost-based models is proper parameter tuning [53]- [56].This research aims at proposing a novel GA variation that optimizes the parameters of a classifier (i.e., eXtreme Gradient Boosting), and to reduce the features space simultaneously.

B. DATASET DESCRIPTION
The main tweets dataset used in this research was introduced in [22].It has 5096 tweets.About 17% of tweets are labeled as "Spam" and the rest as "Ham".Tweets are labeled in a manual fashion by considering and examining each one separately [22].If a tweet content is considered as unacceptable by the community or harmful,then the tweet is labeled as spam.Otherwise, it is labeled as ham (i.e." normal tweet).Figure 2 summarizes the number of the text characters in all instances.The length of the stored characters of each tweet may exceed the number of original tweet length because some special characters and emoticons are stored as a set of representative Unicode characters.
It is apparent that the average character length, which represents each tweet, is about 100 characters in both classes, and there is no significant variance difference in the distribution of both classes as well.Therefore, length analysis adds to the challenges in building a robust classification model.

C. THE MODIFIED GA
The modified GA aims at directing the stochastic selection aspect towards a fine subset of features.At the same time, to find the best possible classification algorithm parameters.Therefore, it is to find the best combination of features and parameters simultaneously.Usually, GA is used to either initialize the classification algorithm parameters or in feature selection.The proposed modifications would leverage the capabilities of GA in defining the optimal combination of features subset and parameters.Moreover, particular modifications of some methods limit the absolute randomness of GA phases.For example, ensuring no duplicate genes in each chromosome.The Modified GA and its phases are presented in the following subsections: Many recent research studies in different domains [58]- [61] illustrated the power of GA in optimizing the parameters of XGBoost to achieve better prediction performance.Algorithm 1 represents the initial configurations of the modified GA that is used to optimize the parameters of XGboost and select the most appropriate features subset.Initially, a number of GA parameters will set the maximum percentage of features to be selected, the parents' crossover ratio in each population, the maximum number of generations, and the number of classifier parameters to be optimized.The result will be a chromosome having an optimized set of XGboost parameters and the selected features subset.The chromosome structure is illustrated in Figure 5.The input dataset is split into 70% training and 30% testing partitions for the GA-based XGBoost model building and validation.Table 4 describes the GA parameters.

2) Initial population
Creating the initial population of the GA is challenging as it is not an easy task to select a representative subset of the whole population.Neither in selecting the initial set of classifier parameters nor the subset of the feature set.Redundancy of gene values is also one of the issues to consider at this phase.To limit the absolute randomness of the GA, XGBoost boosting parameters are generated using a uniform random number generator within a recommended value range (Table The features subset, which is part of the chromosome, is created by a custom procedure that randomly selects a subset of the whole feature set; i.e., subset of the whole TF-iDF vector.The procedure ensures creating a chromosome with no duplicate features and selecting from the full features vector.Actually, the list of selected features is the set of features indices in the TF-iDF vector; Initializing the initial population will result in forming the parents chromosomes according to the preset parents size in Algorithm 1.

3) Fitness function
The bias imposed by imbalanced class distributions generally favors the majority class; which in most cases does not represent the class of interest.Therefore, positive class based metrics will dramatically mislead the selection of the best model relying on the objective function.The Geometric Mean (GMean) on the contrary considers both the positive and negative class as an objective function [62]- [64].The GA and the validation of the selected models in this research utilize the GMean as an objective function.In addition, it is used as the main metric in comparing the performance of for p = 1 to number of parents do return f itness 10: end procedure

4) Selection and Crossover
Crossover is an essential phase in GA to generate a new number of children from the parents.A child's genes will be a combination of two parent chromosomes, so the children are expected to have better genes than the parents do.To achieve this, a uniform crossover is performed using almost half of each parent genes.The crossover phase ensures generating children where each has no duplicate genes.Each chromosome will undergo two crossovers; one for XGBoost parameters and the other for the selected features set.Algorithm 4 will select the best parents to crossover according to the GMean objective function and generate a number of new children for the next generation.for number of children to generate do Because of the stochastic nature of GA, some genes may be overseen in the initial population or in the generated children.To increase the chance of fair inclusion of missed geneses the mutation tries to include new genes in the children.One parameter gene and one feature gene are selected randomly in each child chromosome and replaced with a new value.Specifically for the selected features, the mutated gene value will be selected from the full features vector set such that it is not one of the parents' genes.Mutation is illustrated by Algorithm 5. for number of children do The modified GA will run for a preset number of generations aiming at the maximization of the GMean value of the generated models.The spam and ham text features are the TF-iDF vectors generated by the pre-processing step presented in Section III-D1.The best chromosome that will be selected in the last GA generation will contain the best XGBoost parameters and the accompanying selected spam features subset.This chromosome will be used consequently to initialize an XGBoost algorithm to generate a spam prediction model in the Model Building phase illustrated in Figure 3 and Figure 4.

D. PROPOSED METHODOLOGY
The proposed methodology is divided into five main phases: (1) Dataset pre-processing, (2) Hyper parameter optimization and feature selection, (3) Sensitivity analysis, (4) Model building and validation, and (5) Classification performance analysis.Figure 3 is an abstract view of the proposed research methodology.Figure 4

1) Dataset Pre-processing
Each data instance contains raw tweet text and a label (i.e., "Ham" or "Spam").Each tweet text and its label is preprocessed to be cleaned and converted into features through a number of steps: (1) Tokenize the tweet and remove extra space and special characters, (2) Stem each tokenized word using "Porter Stemmer" [65].This will reduce the tokenized word to its root, stem, or base.(3) Each stem is given a weight using a vectorizer; depending on the term frequency-inverse document frequency (TF-iDF) [66], [67].Therefore, each tweet is converted into a representative TF-iDF vector (i.e., a set of features), and ( 4) The class labels are encoded into 0's, i.e." "Ham" class, and 1's, i.e." "Spam" class, to satisfy the requirements of the classification algorithm.

2) Hyper parameter optimization and feature selection
In this step, a modified GA tunes the parameters of the classification algorithm such that it improves the prediction rates.
It is divided into two main parts: (a) GA feature selection and (b) GA hyper parameter optimization.Each chromosome in this step is designed to hold two types of genes such that genes at the beginning are the parameters to be optimized and the rest of genes are the selected features.Figure 5 shows the detailed structure of the chromosomes, the chromosomes after GA crossover, and an illustration of the genes after the mutation process.The initial population consists of parent chromosomes holding randomly selected parameters within a recommended and pre-defined range based on literature [53], [68], [69]), and randomly selected unique features within each chromosome.It is the responsibility of the initialization algorithm to ensure choosing features without having any duplicates in each chromosome.

3) Sensitivity analysis
The main aim of the sensitivity analysis step is to find the best possible combination of XGBoost parameters and subset of feature space [70].GA performs several hyper parameter optimizations and feature selections in order to examine the behavior of the classification algorithm.Consequently, the results of different optimizations and feature selections under different configurations lead to understanding the behavior of the algorithm.
Several GA configurations are examined in this step by mainly specifying: (1) the initial population size, crossover percentage, and number of generations; and (2) the desired percentage of features to retain (i.e., the number of feature genes in the chromosomes).The effect of the configurations on the objective function is examined to determine the candidate classifier parameters and the subset of feature space.

4) Model building and validation
The optimized classifier parameters and selected features subset, which maximized the objective function, are used to build a robust classification model.10-Fold stratified crossvalidation (10 CV) is used to avoid the bias in model building process.This model building process is repeated 50 times (50 × 10 CV) to assess the classifier stability.The major configurations of the proposed approach steps, which are described in Section III-D, are listed as follows:

A. DATASET PRE-PROCESSING
The used TF-iDF vectorizer parameters in Tweets text preprocessing are listed in Table 2.The maximum number of possible features is extracted according to the pre-processing step.

B. GA CONFIGURATION
Tables 3 and 4 show the GA parameters setup.The letters F , P , C, and G are used to summarize the description of each GA configuration (i.e., Metadata elements).F is the percent of features subset to be selected from the complete features set, P is the number of randomly selected parents in the first GA generation, C is the number of parents to crossover, and G is the number of GA generations.This standard file naming convention makes it easier to sort and interpret some effects of parameter tuning.The number of generations that will be created The best results of GA feature selection and hyper parameter optimization are selected based on fitness function (i.e., GMean); which consist of the selected features and optimized XGBoost parameters, which are listed in Table 5).

C. CLASSIFICATION PERFORMANCE ANALYSIS
Analyzing the performance of the classifier to demonstrate its learning capability in model development is an essential part of the modeling phase assessment.Therefore, several metrics illustrate the performance of the developed model in detecting a potential spam tweet (i.e., classifying the tweets into Spam and Ham).
A visual summary of the classification results is represented by a confusion matrix [71].Such that a two dimensional table aggregates the counts of the labeled tweets by the developed classification model into correct (True) and incorrect (False) labels.The aggregated counts are denoted specifically as True Positive (T P ), False Positive (F P ), True Negative (T N ), and False Negative (F N ).In this work, the positive class (i.e., class of interest) is the Spam tweet, and the negative class is the Ham tweet.Consequently, the four aggregated counts in the confusion matrix are interpreted as follows: T P count represents the actual Spam tweets that are classified correctly as Spam, T N count is the number of actual Ham tweets that are classified correctly as Ham, F P count is the number of the actual Ham tweets that are classified incorrectly as Spam, and the number of actual Spam tweets that are classified incorrectly as Ham represents the F N count.T N and T P represent the goodness of the classification model in correctly predicting the class label, while F P and F N show the level of the possible confusion a prediction model may have.Table 6 represents the confusion matrix that is used to derive a number of Spam classifier performance evaluation metrics.Part of the derived evaluation metrics are: 1) True Positive Rate (T P R): the ratio of the correctly classified Spam tweets (i.e., tweets predicted as Spam and they are actually a Spam) [71].It is alternatively named as recall or sensitivity.
2) True Negative Rate (T N R): the ratio of the correctly classified Ham tweets (i.e., tweets predicted as Ham and they are actually Ham) [71].It is alternatively named as specificity.
3) Positive Predictive Value (P P V ).It is alternatively named as precision [71]: 4) False Positive Rate (F P R): the probability of false alarm, Fall-out.
5) Negative Predictive Value (N P V ) [71]: 6) F-Score (F 1) [71]: 7) Total Accuracy (Accuracy): It is traditionally derived from the confusion matrix and it represents the correctly classified instances count divided by the total number of instances.Alternatively, accuracy is also referred to as success rate (i.e." the ratio of correctly classified instances).Equation 7illustrates accuracy metric.

Accuracy = T P + T N T P + T N + F P + F N
There are many concerns in using the total accuracy as a performance metric, more particularly in imbalanced datasets [72]- [74].Usually the negative class is dominant and more frequent in real life.Consequently, the model building phase would have a higher tendency towards modeling better the patterns of the negative class.Such tendency makes less prediction power of the positive class; which is usually the class of interest.Same issue arises when considering Spam tweets.While TNR tends to rise up, TPR tends to decline.Therefore, further evaluation metrics are advised here.The Geometric Mean (GM ean) and Area Under the Curve (AU C) are used commonly in evaluating the classifiers of imbalanced class distribution.GMean and AUC take into consideration the minority class and seek the balance between the classes in illustrating the model accuracy (i.e., class independent metrics).The GMean is calculated according to Equation 8; i.e., the square root of the recall of the positive class multiplied by recall of the negative class.The calculation of the GMean ensures unbiased behavior of the metric either in objective function evaluation or in performance evaluation.A higher GMean value indicates better performance of the classifier [62]- [64].
Receiver Operating Characteristic Curve (ROC) is a metric that takes different threshold values and confront them with the corresponding probabilities (i.e., T P R and F P R).The AU C is generated by calculating the area under the ROC.Therefore, there is a positive correlation between the value of AU C and the diagnostic ability of the classification model [75].

V. RESULTS AND DISCUSSION
It takes a considerable amount of computation time to have a relatively robust tweets spam prediction model using GA.The proposed methodology is quite complex and strives to find an optimal subset of tweet features and classifier parameters simultaneously.In tweets pre-processing step, the TF-iDF vectorizer generates the maximum number of possible features per tweet along with their TF-iDF value (i.e., the total of 14343 features extracted from all tweets).
Due to the time complexity of each GA search process, an initial relatively small subset of features (i.e." 1%) is examined in order to study the performance behavior of the classifier and select the most appropriate algorithm configurations (i.e., sensitivity analysis).The performance analysis of the first 10 experiments relies on 1% of feature space, 10 initial population parents, and 100 generations.Next, the generations are fixed at 50 and the crossover ratio at 60% of parents.Finally, the effect of different feature subset size is examined (i.e., F=1 , 5, 10, 20, 30, and 40).
A number of selected features and parameters are used to 10 × 50 cross-validate prediction models.The performance metrics (GMean in particular) indicate promising capabilities of the modified GA in finding a subset of the feature space and optimizing the parameters of the classifier accordingly.The outcomes of this research are compared to related work in terms of dimensionality reduction.
It is worth noting that the performance metrics in this research are presented such that further comparison with existing or future research is possible.The following subsections show and discuss the findings in more detail.

A. SENSITIVITY ANALYSIS
Sensitivity analysis results show the effect of GA crossover ratio on classification performance and its convergence.This is an important step to predict the behavior of the GA, and define the crossover ratio and number of GA iterations.Figure 6 represents the fitness curve of several crossover ratios ranging from 2% to 10%; while the remaining three parameters have been fixed at 1%of the total number of features, 10 parents, and 100 generations.
Sensitivity analysis: Number of iterations and crossover ratio by setting the parameters described in Table 3 Almost similar convergence pattern of the algorithm is present but a significant difference in fitness value groups the results into three main levels (i.e., Low, moderate, and high fitness value groups that correspond to different crossover ratios).Extreme values of crossover rate lead to very poor fitness values and lower prediction rate.Low values of crossover rate lead to relatively moderate fitness values.Moderate crossover values lead to relatively high fitness values, such that leveraging the prediction power of the spam classification models.It is apparent that 60% crossover rate is the most appropriate diversity factor to the next GA population.Experiments reveal that the number of iterations (i.e., number of GA populations) reaches a relatively local maximum early and slightly increases in consequent generations.Consequently, the number of appropriate generations is lowered to be 50 keeping in mind the effect on larger initial populations and number of features.
Several GA optimization and feature selection experiments (Table 7) aim at maximizing the fitness value.Some of the experiments cross over ratio (C) are fixed at 60%, 50 generations (G), and examine several percent of features to select (F ).Furthermore, Table 8 lists some of the most frequently selected features in all the experiments and the presence of these top features in the best performing models.

B. MODEL VALIDATION
The selected features and the optimized parameters (i.e." the configurations that attained high fitness values) in some experiments are used to model spam prediction using XG-Boost.The model robustness is validated using 10-Fold cross-validation repeated 50 times.Table 9 summarizes the absolute difference between the GA fitness value and the validated XGBoost model performance.The best optimization and feature selection experiment "F10-P400-C240-G50" attained a fitness value that equals 84.85% (i.e., Geometric Mean).The outcomes presented in Table 10 are used to build the spam prediction model that is validated using 10-fold stratified cross-validation and repeated 50 times.XGBoost has been initialized with the optimized parameters in Table 10, and used the dataset with the selected features to build the spam classification model.The run is validated using 10-fold stratified cross-validation and repeated 50 times.Figure 7 is a box plot that summarizes the performance metrics, and Figure 8 depicts the performance metrics of each fold.The detailed performance metrics of the 10x50CV are presented in Table 11.There are two main observations worth mentioning here (a) the GA feature selection and parameter optimization effectiveness are noticeably acceptable, and (b) XGBoost classification model is barely affected by the algorithm randomness.The absolute difference between the GA fitness and the validated model is bound to an average of approximately 2.8, and the standard deviation of all the evaluation measures is less than 0.04 after 50 cross-validation runs.

C. STATISTICAL ANALYSIS
The effect of the stochastic nature of GA, and the imposed randomness of the classification algorithms on the experiments are described using the statistical tests.The whole process of GA-based parameter optimization and feature selection followed by a 10x50CV of XGBoost is repeated seven times.Runs denoted by the sequence "R01" to "R07" follow
The p-value of Wilcoxon statistical test between run pairs in Table 13 indicates a possible similarity between the runs "R00" and "R07"; considering α = 0.05.The differences against the rest of the runs are possibly because the run "R00" is controlled in terms of a random number generation method.According to the Wilcoxon test, the runs "R01, R02, R03, R05, and R06" possibly have similar distributions, "R04" distribution is different from the others.However, the GMean value of the runs does not significantly differ as illustrated in Table 12.Kruskal statistical test applies to three or more run combinations.Therefore, all run combinations are tested and the top p-values of the nonparametric Kruskal test are reported in Table 14.Assuming an α = 0.05, the runs "R01, R02, and R05" are expected to have the highest similarity of GMean distributions, and "R01, R02, R03, R05, and R06" most probably have similar distributions.The overall performance and robustness of the proposed spam classification model outperforms the work in [22].The authors in [22] claim a high performing LSTM (Long Short Term Memory) model, however the robustness of their approach is not well justified.The performance metrics are based on the "Ham" class as the positive class which is the majority class label.Swapping minority class with majority class will lead to significantly higher performance metrics values; which is misleading when interpreting some metrics such as TPR.There is no evidence of using cross-validation in assessing the model robustness.Features are reduced by the selection of most frequent words in the corpus, and the configuration of the basic classifiers used in performance comparison is not presented in the paper.For the sake of fair comparison with the results of our research, the performance metrics are re-calculated to consider "Ham" as the positive class label.Table 15 shows the equivalence equations used to re-calculate the results of our experiments to be comparable with the results in [22].Nonetheless, Table 16 compares the performance metrics of [22] in spam classification (having "Ham" as positive class, and "Spam" as negative class) and Table 17 confronts the effect of feature reduction.The modified GA approach in this study outperforms the approach used in [22] in terms of feature selection, The number of features selected in the proposed approach (i.e., 1355 features) is much lower than the number of features selected in [22] (i.e." 5000 features).On the other hand, the maximum accuracy obtained using GA in our approach was 95.88% compared to 95.09% in [22].
2) Comparison with Chi 2 Feature Selection Chi 2 [79], [80] statistical test has been used in text feature selection based on statistical significance of features.We selected the top "1355 features" using the Chi 2 method to compare the results with our best findings.Different machine learning algorithms are validated by 10x50CV in building spam prediction models using "1355" selected features by Chi 2 .The algorithms used without parameter tuning are XGBoost, Multinomial Naive Bayes (MNNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), Adaptive Boosting (AdaBoost), and Decision Trees (DT).The pvalue of Wilcoxon statistical test between the run "F10-P400-C240-G50" and Chi 2 -based models are presented in Table 18, and the descriptive statistics of the performance metrics are shown in Table 19.The p-value of the Wilcoxon test indicates different distributions of GMean value compared to the run "F10-P400-C240-G50".except for the DT model.The similar distributions are justified by the fact that XGBoost is an evolution of the decision tree algorithm; they share similar characteristics that could lead to similar behavior.However, "F10-P400-C240-G50" model outperforms the DT model as indicated by the majority of the performance metrics.

3) Comparison with PCA Feature Selection
Reducing the dimension of relatively large feature space while preserving most of the information is possible using Principal Component Analysis (PCA) [81].The feature set is transformed into a number of principal components based on their covariance matrix, then a relatively small number of the principal components will be selected to represent the full feature set.We trained an XGBoost model using a different number of principal components (i.e., from 1 to 20) and validated the model using 50x10CV.An illustration of the model accuracy in relation to the number of principal components is shown in Figure 9.The accuracy metrics and the standard deviation are presented in Table 20  It is apparent that 20 principal components will enable attaining 91.53% total accuracy in spam prediction; compared to 92.67% using our modified GA.Moreover, PCA will reduce significantly the feature space but makes the model interpretation much harder.It is apparent that the PCA based model will converge after 15 PCA components.According to the illustration in Figure 9 and Table 20 the improvement in accuracy was less than 0.5 percentage absolute point within the last nine components (i.e., PCA components 12-20).
Therefore, PCA-based XGBoost models under-perform the modified GA-based XGboost models.

4) Comparison with BERT and Deep Learning
Most recent advancements in natural language processing research introduced pre-trained word embedding models that are coupled with Deep Learning (DL) algorithms [14], [23], [82].BERT word embedding is used with DL to build spam classification models.The major issue of interest in DL is the computational complexity and extensive resource use.Despite such limitations, we were able to build a spam prediction model in this research and validated with the percentage split of the tweets dataset.BERT is used in text pre-processing and encoding, "sigmoid" activation function, Tensorflow [83], and Keras [84].Table 21 summarizes the major performance metrics of the generated model over 20 epochs.The TPR of the class of interest ("Spam") is relatively low (52%) which indicates a very low prediction power of the generated model in spam prediction.For the sake of comparison with recent advancement in text classification we implemented the BERT-DL model to assess its feasibility in spam prediction.In our case, the limited computational resources were the main barrier in tuning and seeking better prediction performance.However, the experiment shows that our modified GA approach outperforms the DL approach.Moreover, the resulting models are not intuitive to be interpreted and starve for computational resources.

5) Experimenting with SMS Dataset
The modified GA is applied to a public imbalanced SMS dataset [85]; about 13% of the messages are "Spam".Hyper parameter optimization and feature selection results are listed in Table 22.The GA reduced the selected features to 9.52% of total dataset features (i.e., 706 out of 7419 features) attained a GMean value of 97.29%.The outcomes in Table 22 initialized an XGBoost classifier to model SMS spam.The model is validated using a 50 times repeated run of 10-fold stratified cross-validation.Table 23 shows the performance metrics.In comparison to the best results in [86]- [88] the modified GA shows a competitive performance.The authors in [86], [87] ,and [88] attained maximum of total accuracy equals 96%, 96.8% and 98.74% respectively.Its worth mentioning that Random Forest and SVM algorithms attained 99% accuracy in [86] but with TF-iDF features and oversampling.In essence, the maximum accuracy attained by the modified GA was 99.1%; which makes it outperform the majority of the related works utilizing the same SMS dataset.
Our proposed approach, modified GA, reduces all the features of the tweets by 9.45% (i.e., from 14343 to 1355 features) and maintains a competitive performance in comparison to the related studies.Therefore, the proposed approach in this research is expected to reduce the dimensionality by automating the process of feature selection and tuning the prediction model parameters simultaneously.The results of this work could be extended to list the features as words (i.e., specific words in the tweets) for further feature analysis.Furthermore, the XGBoost tree models have a higher level of interpretability compared to ANN-based models; which make it much easier to deeply analyse the models for the sake of spam understanding and modeling.

E. IMPLICATIONS AND LIMITATIONS
The reported experiments and outcomes of this research establish a basis for spam modeling.In essence, it outperformed many related works in SMS spam modeling.Future research may build on the outcomes to enhance understanding of spam behavior.Further, this research could be considered for generalization in other domains such as software engineering, construction engineering, internet of things, and smart cities.In contrast to black-box models, tree-based classifiers enable straightforward implementation to detect spam tweets.The tremendous growth of Online Social Networks (OSN) calls for efficient real-time spam detectors.The state of the art solutions recommend Deep Learning based solutions, however deep learning is resource consuming and overlooks unseen spam behaviors.Our proposed approach reduces learning time significantly compared to deep learning based solutions.
Usually, GA finds outstanding solutions once its parameters (i.e., Initial population, mutation and cross-over ratios, number of generations, ...) are well tuned.In this research tuning the GA using Grid Search is time consuming.Therefore, sensitivity analysis described in section III-D3 has been used to find the best GA parameters.In the near future we expect the reliability of the public twitter spam dataset to raise concerns due to subjective interpretations by different communities.Multi class labeling of spam text in sentiment analysis is not considered in this research.
The large number of experiment runs and the comprehensive set of performance metrics would direct further research activities.

VI. CONCLUSIONS AND FUTURE DIRECTIONS
Spam modeling is a challenging task due to many issues such as the high dimensionality of the features space, the imbalanced class distributions, the bias of classification algorithms towards the majority class, and natural language processing issues.Many of the related research works lack solid validation of the generated models and usually report positive class-based performance metrics.In this paper, a modified genetic algorithm is designed in order to perform two main tasks; (1) an effective dimensionality reduction of an imbalanced tweets dataset and (2) hyper parameter optimization of XGBoost classification algorithm.Intensive validation of the generated prediction model illustrates the robustness of the modified algorithm and its competitive performance compared to other approaches.This research reports a comprehensive set of performance metrics and nonparametric statistical significance tests; which makes it easier to understand the outcomes and provide a basis for comparisons with related works.In tweets spam modeling, the proposed approach selected less than 10% of features to attain on average 92.67% and 82.32% total accuracy and geometric mean respectively.It outperformed the performance of Chi 2 and P CA based approaches in feature selection.In addition, it showed competitive performance compared to recent machine learning algorithms; including word embedding and deep learning based models.
The stochastic aspects of genetic algorithms, and parameter optimization are among the research limitations.Genetic algorithm based solutions usually require a large number of initial population space or large number of generations to find an outperforming solution.The large number of experiment runs and the comprehensive set of performance metrics would direct further research activities.There are many unexplored issues by this research; issues include parallel processing to reduce time complexity of the approach, the effect of natural language processing on improving the accuracy, incorporating user account features in spam modeling, and experimenting with multi-language spam modeling.Further research that may build on the modified genetic algorithm to tackle different problems or domains such as sentiment analysis and multi-class modeling.

FIGURE 2 .
FIGURE 2. The character length distribution of the text (emoticons are stored as a series of Unicode characters).

Algorithm 2
Initializing ] = rand.uniform(0.01,10.0) 7: gammaValue[p] = rand.uniform(0.01,10.0) 8: subSample[p] = rand.uniform(0.01,1.0) 9: colSampleByTree[p] = rand.uniform(0.01,1.0) 10: features[p] = select text features ▷ select text features ensures no duplicate genes are present in each chromosome 11: end for 12: concatenate parameters and features into chromosomes 13: population = all generated chromosomes 14: return population 15: end procedure different classification models.The Algorithm 3 and Equation 8 illustrate the GMean calculation.It is the square root of the True Positive Rate (TPR or Recall) multiplied by the True Negative Rate (TNR or Specificity).The TPR is a positive class based metric and TNR is a negative class metric; deriving a TPR and TNR based metric equals a metric that represents the accuracy of both classes (i.e., Spam and Ham).

FIGURE 5 .
FIGURE 5.The chromosomes structure, crossover, and one gene mutation.

TABLE 5 .
XGBoost boosting parameters to be optimized by the modified GA XGB Parameter Value Range Description learning_rate (eta) 0.01 -1 Algorithm learning rate; lower the better but requires more iterations to find optimal solution.n_estimators 10 -1500 Maximum number of estimators max_depth 1 -10 Maximum tree depth, to control overfitting.(e.g.,high depth will biase the algorthim towards a specific sample) min_child_weight 0.01 -10.0 Minimum sum of observations in a child gamma 0.01 -10 Minimum reduction of loss when splitting subsample 0.01 -1.0 Random subset of observations for each tree colsample_bytree 0.01 -1.0 Subset of columns to be samples in the trees seed Fixed at 723 Used in parameter tuning and to have reproducible results

F1 2 ×
T P R×P P V T P R+P P V 2 × T N R×N P V T N R+N P V 2 × T N R×N P V T N R+N P V

TABLE 1 .
Summary of related research works

Selected Classifier Sum FIGURE 1. eXtreme
Gradient Boosting (XGBoost) is a more detailed view of the methodology, and the parameter values of each step are listed in Section IV.

Model Building and Validation Best Selected Features Optimized Parameters Tweets and Labels NO Tokenizer Stemmer Vectorizer Hyper Parameter Optimization and Feature Selection Label Encoder Parameters and Features Selection Features Vectors Parent Chromosomes Model Building and Fitness Evaluation
Google™ Colaboratory (a.k.a.Colab, https://colab.research.google.com/)environment is one of Google™ Research FIGURE 4. The detailed research methodology.products.Colab offers a browser-based machine learning projects' development environment that supports Python™ https://www.python.org/code run over different modern processing architectures.The experiments of this research are implemented as Python™ projects and conducted over CPUbased Colab environment.

TABLE 3 .
GA file-name/experiment code description.

TABLE 9 .
Comparison between selected best fitness obtained by GA optimization and the results of 10-fold cross-validation repeated 50 times.

TABLE 7 .
Fitness value obtained by several GA feature selection and optimization experiments

TABLE 8 .
Most frequent selected features and their presence in some experiments.

TABLE 13 .
p-Value of Wilcoxon statistical test between run pairs

TABLE 14 .
Kruskal statistical test of run combinations, showing only top combinations.

TABLE 15 .
Metrics re-calculation equations based on the selection of the positive class, for the sake of comparison.

TABLE 16 .
[22]arison of the performance metrics in[22]with the best results of our experiments after re-calculating the performance metrics (i.e., considering "Ham" as the positive class).

TABLE 17 .
[22]ct of feature reduction in[22]compared to some experiments in our research.

TABLE 20 .
Accuracy of PCA feature selection and XGBoost.(each is validated by 50x10CV)

TABLE 22 .
Optimized XGBoost parameters and the number of selected features obtained by the "SMS Spam Dataset" experiment using modified GA and XGBoost.

TABLE 23 .
Results of "SMS Spam Dataset" using modified GA and XGBoost; repeated 50 times with 10-Fold cross-validation per each run.(Best fitness obtained by GA was GMean = 97.29%)