Detection of Social Network Spam Based on Improved Extreme Learning Machine

With the rapid advancement of the online social network, social media like Twitter has been increasingly critical to real life and become the prime objective of spammers. Twitter spam detection refers to a complex task for the involvement of a range of characteristics, and spam and non-spam have caused unbalanced data distribution in Twitter. To solve the mentioned problems, Twitter spam characteristics are analyzed as the user attribute, content, activity and relationship in this study, and a novel spam detection algorithm is designed based on regularized extreme learning machine, called the Improved Incremental Fuzzy-kernel-regularized Extreme Learning Machine (I2FELM), which is used to detect the Twitter spam accurately. As revealed from the experience validation results, the proposed I2FELM can efficiently identify the balanced and unbalanced dataset. Moreover, with few characteristics taken, the I2FELM can more effectively detect spam, which proves the effectiveness of the algorithm.


I. INTRODUCTION
Over the past few years, the Internet has been leaping forward, and the intelligent terminals have been progressively popularized. Under such background, Online Social Networks (OSN) turns out to be a critical channel for people to acquire information, disseminate information, and make friends and get entertained. For the complexity of the online social network structure, the large-scale nature of the group, and the massive, rapid, and difficult traceability of information generation, the effects of user adoption, content creation, group interaction and information dissemination on online social networks thoroughly impact social stability, organizational management models, as well as people's daily work and life [1], [2]. Take Twitter for an example, the detection of Twitter spam can facilitate the process of analyzing, guiding and monitoring social network events, as well as regulating the management of networks.
At present, the research challenges of Twitter spam are presented as follows, namely the feature selection and detection algorithm selection. The details are characterized below: The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . 1) in feature selection, predecessor research often selects the identical type of characteristics e.g., content-based and user profile-based characteristics for detection. On the whole, since many types of characteristics of social network abnormal users are different from those of normal users, and it is not enough to accurately express the state of the data. 2) In algorithm selection, researchers primarily use supervised machine learning algorithms to deal with spam detection in social networks. Based on the idea of classification, the researchers have designed numerical form characteristics to identify spam users. The supervised machine learning algorithm can be split into a single classification algorithm and an integrated classification algorithm (e.g., Support Vector Machine (SVM) [3], [8]- [11], [13], [14], meta-classifiers (Decorate, Logit Boost) [4], Naive Bayesian (NB) [6], [9], [11], Back Propagation Neural Network (BP) [16], Radial Basis Function (RBF) [18], Extreme Learning Machine (ELM) [8], [22], K-nearest Neighbor (KNN) [9], [19], Decision Tree (DT) [9], [20], Random Forest (RF) [5], [7]- [9], [23]- [26] and eXtreme Gradient Boosting (XGBoost) [31], [32]). 3) The real dataset of social networks exerts a long tail effect, i.e., it is an unbalanced dataset with a number of non-spam far exceeding the spam. When those supervised machine learning algorithms are detected on unbalanced dataset, their performance will decline. Accordingly, an algorithm capable of effectively exploiting multi-dimensional characteristics and exhibiting continuous feasibility in the face of imbalance datasets should be adopted.
By understanding and summarizing the research achievements of predecessors, four novel characteristics are proposed to express the Twitter datasets accurately and improve supervised machine learning algorithm to deal with unbalanced datasets to detect Twitter spam effectively. The details are illustrated below: 1) How to select the full category feature and pay attention to the correlation between the characteristics of the social network account helps enhance the accuracy of identifying spam users. This study considers the Twitter spam attributes composed by the user attribute, content, activity and relationship to express the user characteristic and detect the spam accurately. 2) This study proposes a novel incremental Twitter spam assessment algorithm, termed as the Improved Incremental Fuzzy-kernel-regularized Extreme Learning Machine (I2FELM) to enhance the accuracy in dealing with the unbalanced data. 3) I2FELM is capable of enhancing the performance using Cholesky factorization without square root and composite kernel function. Besides,, it can automatically determine the optimal number of hidden layer nodes by gradually adding new hidden nodes one by one. 4) The I2FELM introduces the fuzzy weight as a method to address the unbalanced problem, which can apply to each input and facilitate the learning of output weights. 5) On the public dataset and the collected dataset, a range of index parameters and experimental verification methods are adopted to ascertain the performance of I2FELM, and spam is assessed based on the imbalance data problem and few characteristics.
The article structure is arranged as follows. Section II presents the relevant work. Section III illustrates the novel Twitter spam detection model. Section IV discusses the experimental procedure, and Section V draws the conclusion of the study.

II. RELATED WORK
Extensively studied, several approaches related to social spam detection have been proposed (e.g., spam characteristics and assessment algorithm).
Benevenuto et al. [3] considered two attribute sets, namely, content attributes and user attributes, to distinguish one user class from the other and exploited the mentioned characteristics as attributes of SVM process to classify users as either spam or non-spam. Lee et al. [4] conducted the statistical analysis of the properties of the mentioned spam profiles to create spam classifier to actively filter out existing and novel spam. Based on the mentioned profile characteristics, the authors developed meta-classifiers (Decorate, Logit Boost, etc.) to identify previously unknown spam. Stringhini et al. [5] initially created a set of honey net accounts (honey-profiles) on Twitter and then identified multiple characteristics that allow authors to detect spam. Lastly, the RF model was built to detect spam and employed in a Twitter dataset. Wang [6] developed the novel content-based characteristics and graph-based characteristics to facilitate spam detection; besides, a Bayesian classification algorithm was adopted to distinguish the suspicious behaviors from normal ones. Chu et al. [7] presented the collective perspective and focused on identifying spam campaigns that manipulate multiple accounts to spread spam on Twitter. An automatic classification system was designed based on RF and a variety of characteristics, i.e., individual tweet/account levels to classify spam campaigns. In Meda et al.'s work [8], a standard Principal Component Analysis (PCA) algorithm was exploited to reduce the dimensionality of the 62 feature to the 20 characteristics, 10 characteristics, and 5 characteristics, and then three different machine learning algorithm (SVM, ELM, RF) were adopted to support spam detection in Twitter. Wang et al. [9] studied the suitability of five classification algorithms of Bayesian, KNN, SVM, DT, and RF at the detection stage; they took four different feature sets of user characteristics, content characteristics, n-grams, and sentiment characteristics to the social spam detection task. Zheng et al. [10] extracted a set of characteristics from content-based and user-based feature and applied into SVMbased spam detection algorithm. Chen et al. [11] built a hybrid model that uses SVM and NB to distinguish suspect users from normal ones based on the user-based characteristics and content-based characteristics. During the assessment, the authors assessed the impact of different factors on spam detection performance, covering discretization of functionality, size of learning data, and data related to time. Chen et al. [12] proposed an Lfun approach to identify the ''Spam Drift'' problem in statistical features based Twitter spam detection. They compared Lfun to four traditional machine learning algorithms and evaluated the performance of Lfun approach in terms of overall accuracy, F-measure and Detection Rate. He et al. [13] proposed an analysis approach based on information entropy and incremental learning to study how various features affect the performance of an RBF-based SVM spam detector, through this effort, they attempted to increase the awareness of a spam by sensing the features of a spam. Teng et al. [14] proposed a selfadaptive and collaborative intrusion detection model is built by applying the Environments classes, agents, roles, groups, and objects (E-CARGO) model. Wu et al. [15] found that most of current spam detection techniques are based on feature selection and machine learning classification (e.g. DT, RF and NB). Liu et al. [16] reviewed the schemes and systems proposed to deal with an increasing number of cyber security threats. The work can extract information from data sources and applied analytics/algorithm (e.g. machine learning) to make a decision. Sun et al. [17] presented an overview and research outlook of the emerging field, i.e., cybersecurity incident prediction. They also extracted and summarized the research methodology at critical phases of predicting cybersecurity incident. In the research of Coulter et al. [18], a new research methodology of data-driven cyber security (DDCS) was demonstrated, and its application in social and Internet traffic analysis was studied. DDCS shows the strong link between data, model, and methodology during the review of key recent works in Twitter spam detection and IP traffic classification.
Dayani et al. performed the KNN on user-based characteristics and NB on the word cloud acquired in the pre-processing step to detect tweets spreading rumors [19], which demonstrated how appropriate preprocessing improves rumor detection substantially. Sheu et al. [20] aimed to propose an efficient spam filtering mechanism based on the simple decision tree data mining algorithm that finds association rules about spams from the training e-mails. Liu et al. [21] proposed an embedded feature selection method using our proposed weighted Gini index (WGI), which used a decision tree splitting criterion as a feature selection method.
Zheng et al. [22] first built the labeled dataset through crawling Sina Weibo data and manually classified corresponding users into spam and non-spam categories. Subsequently, a set of characteristics were extracted from message content and user behavior and then substituted in the ELM-based spam classification algorithm. In Meda et al.'s research [23], the system randomly taken, as relevant, 1/6 of the originally available 54 characteristics, to simulate a real case study in which the intrinsic correlation of different characteristics is not easily understandable, causing an ineffective configuration of the probability distribution by the analyst. Next, a variant of the Random Forests Algorithm was exploited to identify spam inside Twitter traffic. The spam detection performances of 9 mainstream algorithms were compared to identify the optimal algorithms on account-based characteristics and tweet content-based characteristics datasets by Lin et al. [24]. Though experimental verification, RF achieved the optimal performance under a range of conditions.
Liu et al. [25], [26] exploited twelve characteristics to express Twitter spam and developed an ensemble learning approach that learns more accurate single classifiers from imbalanced data following three steps. In the classification step, the RF exhibited a better performance in dealing with different ratios of imbalanced data. During the final step, a majority voting scheme was introduced to combine the assessed results from the second step classification models. Wang [27] improved the precision of the liquid steel temperature prediction in Ladle Furnace by the random forest method on the large sample set accumulated from the production process. Tang et al. [28] analyzed the characteristics of spammers in Weibo and proposed fuzzy-logic-based oversampling and cost-sensitive support vector machine algorithmic levels. Wang et al. [29] presented a drifted twitter spam classification method by using multiscale drift detection test (MDDT) on K-L divergence.
He et al. [30] proposed another form of deep learning, a linguistic attribute hierarchy, embedded with linguistic decision trees, for spam detection. Such approach can improve the performance of spam detection when the semantic attributes are constructed to a proper hierarchy, while efficiently overcoming 'curse of dimensionality' in spam detection with massive attributes.
Saini et al. [31] extracted different textual characteristics from text reviews and used XGBoost to build the classifier. In Xu et al.'s study, the feature extraction was performed on existing normal and malicious requests, and XGBoost classification algorithm was adopted to identify abnormal requests. By experimental comparison, XGBoost was found exerting better recognition effect of abnormal HTTP requests than the random forest, which supports vector machine [32].
Based on the previous research results, researchers are paying attention to the problem of abnormal users in social networks. In the process of detecting spam, researchers extracted various features to describe the characteristics of spam. The spam feature description is a complex task, which can be described in terms of the user's personal information, published content and favorite features. And, supervised machine learning algorithms were widely used in the detecting of spam due to their superior performance, high prediction accuracy, and strong generalization performance.

III. MODEL
Nowadays, the Twitter spam attributes primarily focus on the tweet-based characteristics and user-based characteristics. The assessment algorithms refer to general machine learning methods based on the relationship between spam characteristics and detection. For instance, the methods adopted are primarily SVM, DT, RF, BP, RBF, ELM, and XGBoost, etc. In the face of the multi-dimensional characteristics and the imbalance dataset, the performance of the mentioned algorithms requires enhancement.
This study proposes the Twitter spam attributes consisting of user attribute, content, activity and relationship characteristics to detect spam exactly; each feature can be captured in the Twitter to ensure its integrity and reliability. Besides, I2FELM is designed to address the multi-dimension and non-balance problem to achieve the high assessment accuracy.
The procedure of Twitter spam detection is illustrated in Fig. 1. First, dataset from Twitter is collected to form user attribute, content, activity and relationship characteristic sets. Second, the collected dataset is preprocessed for labeling each user as a spam or non-spam. Third, the proposed I2FELM is adopted to tackle down the unbalanced problem. An optimal function of I2FELM is formed by training and testing phase. Based on the formed optimal function, I2FELM can effectively assess Twitter spam of novel dataset in the classification phase.

A. FEATURE SET
In this study, the feature set is composed of user attribute, content, activity and relationship in the online social network, and the details are listed in Table 1. To be specific, the user attribute feature refers to the period of the existence of the account, the number of registered locations, the number of lists added by the user, and the number of tweets sent by the user. Besides, the content feature covers the numbers of retweets this tweet, favorites this tweet received, hashtags and URLs this tweet included, characters and digits in this tweet, the mentioned time of this tweet, as well as spam words in this tweet, content similarity score. The activity feature is associated with the user behavior information when the user creates the tweet, covering the time of this tweet created, time interval between two tweets, the number of tweets created each day, the time of this tweet mentioned, the location and source of this tweet sent, the number of this user replied, the number of repetitions of this tweet, the number of uses of URL, as well as hashtag and mention this user modified. The relationship characteristic refers to the user's interaction with other people (e.g., the number of followers, the number of following and clustering coefficient).

B. DETECTION ALGORITHM
In an online social network, the number of non-spam is significantly greater than that of spam, leading to the problem of unbalanced data. Accordingly, the I2FELM is proposed to solve the problem based on RELM. The proposed algorithm can effectively improve the accuracy using fuzzy membership, as an attempt to optimize the learning of output weights in various aspects and increase operation efficiency based on Cholesky factorization without square root and composite kernel function for the non-balance of datasets.

1) RELM
Extreme learning machine (ELM) was proposed for training single hidden layer feedforward neural networks (SLFNs); it can act as an efficient learning solution for regression problem [33]. The essence of ELM is that: unlike the common understanding of learning, the hidden layer of SLFNs should not be tuned. Considering N training data , if an SLFN with L hidden nodes can approximate the mentioned N samples with zero error, it implies the existence of β, w and b; thus, it yields where β j = β j1 , . . . , β jm T denotes the vector of the output weights between the hidden layer and the output layer, w j = w j1 , . . . , w jn T is the input weights connecting input nodes with the jth hidden node, b j represents the threshold of the j hidden node, and G w j , b j , x i is the activation function (e.g., satisfying ELM universal approximation capability theorems.
To enhance the generalization ability of the traditional SFLNs based on ELM, Huang et al. [34] proposed the equality constrained optimization-based ELM. In their approach, structural risk considered as the regularization term is introduced. The so-called RELM is capable of regulating the proportion of structural risk and empirical risk using the parameter C. The proposed constrained optimization can be formulated as where ξ i denotes the slack variable of the training sample x i and C controls the tradeoff between the output weights and the errors. Eq. (2) is similar to the classical optimization problem of SVM, despite the simpler constraints, and it is valid for regression, binary, and multiclass cases [35]. Thus,

2) I2FELM
The I2FELM is proposed for training single hidden layer feedforward neural networks (SLFNs) based on RELM. The essence of I2FELM is that the hidden layer of the generalized SLFNs should not be tuned. Therefore, it can be applied in regression and multiclass classification applications directly. I2FELM consists of three layers of nodes: an input layer, a hidden layer and an output layer. In the input layer, the input dataset should be dealt with the different weights for imbalanced problem [36]. Each input data x i is provided with a weight s i , δ ≤ s i ≤ 1. The value of s i is assigned according to the ratio of abnormal users to normal users in the dataset. Therefore, the prediction problem for the constrainedoptimal-based improved incremental fuzzy kernel regularized extreme learning machine can be formulated as wherein, the values of β j , ξ i , x i , C, h (x i ) , t i are consistent with the values of RELM.
Based on the Karush-Kuhn-Tucker (KKT) theorem, the corresponding Lagrange function of the I2FELM optimization (5) is The KKT corresponding optimality conditions as follows: By substituting (7a) and (7b) into (7c), the equations can be equivalently written as Thus, the input dataset with weight matrixes can make important contributions to the learning of the output weights β for imbalanced dataset. Then, the output function of I2FELM is In order to improve the operation efficiency of I2FELM and reduce the run time, Cholesky decomposition without square root [37] will be used to calculate the value of β.
The Cholesky decomposition of β is shown as follows In (12a), the kernel methods [38] that satisfy Mercer's condition can be adapted to calculate inner product, so as to reduce the complexity of algorithm. The (12a) can be written VOLUME 8, 2020 as (12b), while the kernel function can be calculated with the composite kernel function From (12a), we can get an equation of A For any vector V , the quadratic form of A can be expressed as From (14), A is a positive definite matrix. Thus, Cholesky decomposition of the matrix A can be obtained A = LDL T (15) wherein, L is a lower triangular matrix whose diagonal elements are all positive numbers, L T is the transpose of L, and Substituting (15) into (11), thus leading to Denote Substituting (20) into (19) Using (17) and (21), we have l in f nj l nn , i > 1 (22) According to (18), (20), (22), we can further obtain Thus, β is calculated by a simple four arithmetic operation to accelerate the learning speed of the I2FELM.
To summarize, I2FELM has better scalability and runs at much faster learning speed, which can work with a widespread type of feature mappings and less human intervention. It can be summarized as in Algorithm 1. In this experiment, two datasets are exploited to compare the experimental results.
The first dataset is the public dataset, namely, the Aponador dataset [39]. Such dataset was collected with Brazil's famous location-based social network and covers both normal users and spam users, in which each record contains 59 characteristic and 2 classifications.
The second dataset is harvested by this study using the Twitter API and Twitter4J library, which cover 43 million tweets posted by around 16 million accounts that contain daily popular trends in June of 2017. The method of [40] is employed to label the Twitter spam and non-spam accounts during the pre-processing phase for I2FELM in dataset collected by ourselves. The method [40] proposed a hybrid technique, combining a blacklist augmented with algorithms fitting social networks to the problem of identifying spam and malicious Tweets. To be specific, it is concluded based on the collected data that blacklisting, in conjunction with other analytical tools, can effectively identify malicious Tweets. Accordingly, the spam can be blocked by blacklists. Using the graphical approach, a set of users involved in a round-robin approach will yield a bipartite clique in the graph. Hence, bipartite cliques in such a graph are very suspicious -the probability of real users behaving this way in the normal course of events is extraordinarily small. The blacklist is augmented with a clique-discovery approach that can also effectively identify spam. Finally, 0.81 million accounts have been identified and labeled as spam or non-spam, where each record contains 62 characteristic.

B. EXPERIMENTAL SETUP
The experiment is run in Matlab2012b environment, computer memory is 8GM RAM, and CPU is 2.40GHz. Randomly select 60% of the first or second datasets as training set and 40% as testing set. The fuzzy weight of each data sample is determined by the imbalance ratio of the training set.
The proposed I2FELM algorithm in this study pertains to supervised learning in machine learning. For this reason and given previous research results, the SVM, DT, RF, BP, RBF, ELM, and XGBoost are introduced to compare the experiment results to assess the performance of I2FELM using the accuracy, true positive rate (TPR), precision and F-measure [41], [42]. The details are elucidated below: (1) An assessment matrix [43] is illustrated in Table 2 as an effective measurement method to assess the experimental results. In this matrix, the true positive (TP) reveals that the spams are correctly classified, the false negative (FN) means that the spams are misclassified into non-spams, the false positive (FP) denotes that the non-spams are misclassified into spams, and the true negative (TN) reveals that the non-spams are classified accurately.
(2) Under the assessment matrix, the accuracy, TPR, precision, and F-measure as a set of metrics to assess the effectiveness of SVM, DT, RF, BP, RBF, ELM, XGBoost, and I2FELM.
The accuracy is indicated by the percentage of correctly identified examples in the total number of examined The TPR is the ratio of correctly classified spams to the total number of actual spams, as defined by Eq. (35).
The precision is indicated by the proportion of correctly classified spams to the total number of tweets that are classified as spams, as expressed in Eq. (36).
The F-measure is the assessment accuracy combining both the precision and TPR, as calculated by Eq. (37).

C. EXPERIMENTAL RESULT AND COMPARISON
In the present section, the three experiments are performed in first and second datasets. All assessment results are obtained by repeating the identical experiment 10 times and calculating the average value, thereby avoiding the accidental results of the experiments. The specific experimental steps are shown in Fig. 2.

1) CLASSIFICATION RESULTS FOR BALANCED DATASETS
In the first experiment, a balanced datasets are established to verify the performance of the eight algorithms in the first and second datasets.
As suggested in Fig. 3 and Fig. 4, the efficiency of the eight algorithms is assessed using the accuracy, TPR, precision, and F-measure as metrics in the first and second datasets. In the balanced datasets, the index parameters of the eight algorithms all exhibit high performance, and I2FELM is the optimal performance.

2) CLASSIFICATION RESULTS FOR UNBALANCED DATASETS
In the second experiment, the unbalanced datasets are constructed to verify the performance of the eight algorithms.
In the first and second datasets, three unbalanced datasets proportions are built, 1:10, 1:20, and 1:50. Moreover, the accuracy, TPR, precision, and F-measure are exploited to evaluate the efficiency of the eight algorithms. The experimental results are shown in Fig. 5, 6, and 7.
From Fig. 5, 6, and 7, the assessment accuracy of the SVM, DT, RF, BP, RBF, ELM, and XGBoost algorithms tends to decline when the imbalance rate of the dataset increases. Besides, the variety of multiple parameter values of SVM and BP are the most obvious. For instance, the accuracy of SVM and BP drops from about 0.7692 and 0.7731 with the imbalance rate of 10 to 0.4886 and 0.5135 with the imbalance rate of 50 in the first dataset. The DT, RF and RBF perform the second poor, and the changes rate of DT, RF and RBF are relatively close. For instance, the reduction rate of TPR of DT, RF and RBF is 23%, 25% and 24% respectively, the accuracy of DT, RF and RBF changes from 0.8085, 0.8366 and 0.8168 in the case of imbalance rate equaling to 10 into 0.5492, 0.5618 and 0.5594 in the case of imbalance rate equals to 50 in the second dataset. The alterations of ELM and XGBoost generally exhibit an imbalance rate. For instance, the drop rate of accuracy is 30% and 26% with the imbalance rate ranging from 10 to 50 in the first dataset, and the precision of XGBoost decreases from 0.8831 in dataset 1 to 0.3985 in second dataset. The proposed I2FELM keeps the identical trend and the values of the five parameters change less. It is therefore reveals shows that the assessment performance of SVM, DT, RF, BP, RBF, ELM, and XGBoost algorithms on the unbalanced dataset requires enhancement, and I2FELM exhibits a better detection performance on the unbalanced dataset to introduce the fuzzy weight as a method to address the unbalanced problem, which can apply to each input and contribute to the learning of output weights.

3) FEATURE SELECTION DETECTION RESULTS
In this experiment, the tweet feature set in Table 1 is taken to train the classification model.
The SVM, DT, RF, BP, RBF, ELM, XGBoost, and I2FELM algorithms are tested using the top N (N=10 or 20) characteristics and a range of types of characteristics in Table 1. Among them, the information entropy method [44] is adopted   to calculate the top 10 and top 20 characteristics in Table 1. The average value of the results of 5 repeated experiments is extracted, and the classification effect is listed in Table 3 and Table 4.
Experiments show that certain classification characteristics can also be exploited to achieve certain classification effects. For instance, by selecting only 20 characteristics with the I2FELM method, the accuracy of more than 89% can be   achieved in two datasets, which is close to the classification result achieved by exploiting all characteristics; With only the first 10 important characteristics, it can still achieve an accuracy of more than 81%, and the assessment accuracy is higher than any type of feature set individually. It is therefore proved that in the process of identifying spam users in social networks, comprehensive selection of a range of characteristics can achieve more effective assessment results than selecting a certain type of characteristics alone, and also proves the effectiveness of I2FELM feature selection.

V. CONCLUSION AND FUTURE WORK
This study presents a novel Twitter spam detection method, in which the feature set consists of user attribute, content, activity and relationship in the online social network for identifying the real spam. Moreover, the spam assessment algorithm is I2FELM, which uses fuzzy weights to resolve an unbalanced data problem for the accuracy enhancement. Furthermore, Cholesky factorization without square root and composite kernel function are employed to enhance performance. Also, the reasonable number of hidden nodes can be automatically determined. By the validation of experience, the proposed I2FELM can apply to the multi-dimension balanced or unbalanced datasets, and it has achieved high performance to assess the spam in the online social network.
In the subsequent study, the emphasis will be placed on the following research directions. First, more factors will be considered to identify spam precisely (e.g., semantic analysis and emotion analysis). Also, we plan to exploit feature selection method and oversampling [21], [28], [29] to select a proper feature sets and improve model adaptation. On the other hand, to address insufficient labeled data in the social network, semi-supervised learning method will be substituted into I2FELM model to detect Twitter spam automatically based on a small amount of labeled data.