A Content-Based Method for Sybil Detection in Online Social Networks via Deep Learning

Online social networks (OSNs) are generally susceptible to Sybil attack, which causes a series of cybersecurity problems and privacy violations. Malicious attackers can create massive Sybils and further utilize those fake identities to launch various Sybil attacks. Therefore, Sybil detection in OSNs has become an urgent security research problem for both academia and industries. The existing content-based methods to detect Sybils base on the combination of manual-design features and machine learning algorithms, which requires lots of professional experiences and human effort. These methods divide the Sybil detection problem into two piece-wise sub-problems, which prevents us from getting the optimal solution. In this work, we propose a novel content-based method to detect Sybils. The proposed method is an end-to-end classification model that extracts features directly from the input data, and then output the classification results in a unified framework. The proposed method includes three main parts: first, the self-normalizing convolutional neural network (CNN) is adopted to extract lower features from the multi-dimensional input data; second, the bidirectional self-normalizing LSTM network (bi-SN-LSTM) is developed to extract higher features from the compressed feature map sequence; third, the dense layer and softmax classifier are stacked to output the classification results. Unlike the traditional bidirectional long short-term memory network (bi-LSTM), the proposed bi-SN-LSTM network utilizes SELU as the activation function of its recurrent step, which provides unbounded changes to the state value. Through the case study of the real-world dataset, the comparison experiments demonstrate that our method significantly outperforms other state-of-the-art methods.

detection in OSNs is an urgent problem that needs to be addressed, and it has already become a focus issue for both academia and industries.
The existing Sybil detection methods can be summarized into two main categories, which are structure-based methods and content-based methods. In the above two types of methods, content-based methods represent each user using multi-dimensional features, which can be collected from user-profiles and the structure of users' local subgraph. Then, the feature vectors corresponding to labeled users are concatenated together to form the final training dataset. Finally, the binary classifier is trained to predict labels of users who are not used for model training. In conclusion, the core idea of the content-based method is to treat the Sybil detection problem as a binary classification problem based on machine learning classifiers. The general procedures of content-based methods are decomposed into two sub-problems, which are feature engineering and model training. The purpose of feature engineering is to manually design features and select effective features. To improve the performance of the classifier, the above sub-problem greatly requires not only professional experiences to design high-quality features, but also human effort to verify the contribution of selected features. Moreover, the model training is to obtain the classification model (e.g., the random forest classifier) by utilizing the selected features, which also needs human effort to select proper classification models to output convincing results carefully. Both of these sub-problems can directly affect the classification performance of content-based methods. In conclusion, traditional content-based methods depend on human intervention heavily, which leads to an increase in classification errors and high consumption of human effort.
To address these problems, we propose an end-to-end classification model, which is a novel content-based method for Sybil detection in OSNs. The purpose of designing an end-to-end model is to combine the feature engineering and model training together as a unified model. Our model can automatically extract and learn features directly from the raw input data. In other words, the proposed model significantly saves human effort for manually designing, selecting, and verifying features. The major architecture of the proposed model is composed of three main parts. Firstly, our model automatically extracts lower features of input data (i.e., the multi-dimensional input feature information) by a self-normalizing CNN. The extraction result of lower features forms a feature map sequence, where each feature map corresponds to a different user in the original dataset. Because RNN-based methods are very suitable for the analysis of sequence data, so we utilize the bidirectional SN-LSTM (bi-SN-LSTM) network to further extract hidden features in the following structure of the proposed model. The proposed bi-SN-LSTM network summarizes the feature map sequence with both directions (i.e., forward and backward) to extract higher features, e.g., the correlation information between users. Utilizing SELU as the activation function for the recurrent step, the proposed bi-SN-LSTM outperforms other commonly utilized RNN-based methods as demonstrated in Section V, which also proves the efficiency of the proposed method. Finally, the extracted feature information is fed into a dense layer and a softmax classifier to obtain classification results.
In conclusion, the features for classification can be extracted and learned automatically from the raw input data by the end-to-end model. As demonstrated in Section V, the lower features and higher features jointly improve the classification performance. The hierarchical architecture of the proposed model not only provides our model with excellent feature extraction capability but also increases the generalization of the classification model. To the best of our knowledge, our work is the first attempt to apply hierarchical deep learning (DL)-based method for Sybil detection in OSNs. Moreover, it is worthy to note that we utilize the combination of SELU and alpha dropout for preserving the self-normalizing property along the training process. With the same effect as normalization techniques (e.g., batch normalization, weight normalization, layer normalization, etc.), the self-normalizing property enables our model to robustly learn many layers without adding any normalization layers in the proposed method, which simplifies the model structure and enables our model to reach promising classification results when fed with large datasets from the real-world OSNs.
The main contributions of this paper are summarized as follows: 1. We propose a content-based end-to-end classification model to detect Sybils in OSNs. The proposed model utilizes self-normalizing CNN and bidirectional SN-LSTM (bi-SN-LSTM) to automatically extract lower and higher features from the input data, respectively. The end-to-end architecture of our model significantly saves human effort and improves classification results.

A bidirectional SN-LSTM network (bi-SN-LSTM)
is proposed to extract and generate higher features from the feature map sequence. The experimental results prove that bi-SN-LSTM is more effective than other commonly utilized RNN-based methods such as bi-GRU and bi-LSTM, etc. 3. The proposed method is evaluated on the MIB dataset, which is a real-world dataset for the Sybil detection in OSNs. Experiment results show that our model outperforms several state-of-the-art content-based methods.
The rest of the paper is organized as follows. Section II discusses related work. Section III describes the problem definition and other preliminaries. Section IV formally describes the architecture of the proposed model in detail. The general introduction of the datasets, the detail of comparison experiments, experimental results, ablation analysis and further analysis on the proposed model are presented in Section V. Finally, the conclusion and the future works are drawn in Section VI. 38754 VOLUME 8, 2020

II. RELATED WORK
Researches on OSN's Sybil detection in an endless stream. In this section, we introduce structure-based methods and content-based methods in this field.

A. STRUCTURE-BASED METHODS
Structure-based methods are mainly based on the structure of the social graph. Those methods distinguish Sybils and human users by analyzing the edges and links of the graph. Yu et al.. proposed a decentralized protocol SybilGuard [7] to identify the Sybil nodes and limit the influence of the Sybil attack by utilizing random routes. On the basis of SybilGuard, Yu et al. proposed a decentralized protocol using the same insight as SybilGuard but with different random walk-based (RW-based) methods in work [8]. Danezis and Mittal proposed a centralized Sybil detection algorithm, and they obtained each node's degree of certainty by calculating their probability to become Sybil with a Bayesian inference approach [9]. Tran designed Sybil-resilient to improve Sybil-Limit under the assumption of large-scale social networks considered as random expander graphs [10]. Yang proposed a new tool that relies on social graph properties to rank users base on their perceived likelihood of being Sybil by work [11]. Furthermore, work [12] further thwart Sybils by reducing the social edges on users that have received negative feedback, and it reached better detection accuracy compared to work [11]. Wei et al. proposed a mechanism that using the network topologies and RW-based methods to defend against Sybil attack in OSNs [13].
Xue et al. proposed a system that further leverages user interactions information [14], they evaluated whether the user is a Sybil by utilizing trust-based vote assignment and global vote aggregation. Furthermore, the system outperformed multiple existing ranking system on the real-world OSN environment. Yang et al. further utilized user-level activities in work [15], they proposed a voting-based Sybil detection method to identify Sybils and detect Sybil communities to find some more related Sybils. Boshmaf et al. designed a scalable defense system with better use of each node's features to detect automated Sybils in work [16], they achieved a precision(95%) that is over two times higher than SybilRank (43%). To lower the running time complexity and reduce dependency on the proper choice of known trusted nodes, Misra et al. proposed a Sybil community detection algorithm based on the communities' apparent likelihood of being a Sybil [17]. Bansal et al. (2016) modified multiple structure-based methods by making use of the trust value of each user, it showed at around 14% decrease in false-positive rate comparing with [7] and [18] in work [19]. Wang et al. [20] utilized some advantages of both loop belief propagationbased (LBP-based) and RW-based methods to make their orders of magnitude more scalable than a semi-supervised learning method [21]. Zhang et al. utilized users' activities and detect Sybils by coupling three RW-based algorithms in work [22].  proposed a structure-based model to detect Sybils in work [2], it takes the advantages from both RW-based and LBP-based methods to outperforms some existing methods, i.e., SybilRank [11] and Sybilbelief [21].

B. CONTENT-BASED METHODS
The content-based mechanisms are mainly utilizing Machine Learning (ML) methods. ML has been well applied to many frontier research areas. Researchers focus on using ML classifiers to better utilize the feature information of users in OSNs. There are several attempts to apply ML to the Sybil detection problem in OSNs. Wang et al. proposed a server-side clickstream model to group user clickstreams that are close to each other into behavioral clusters [23]. However, the false-negative rate raised when testing their model with an unbalanced training dataset with the number of Sybils in the training dataset get lower. Alsaleh et al. built some classification models based on user features from Twitter, they utilized four different ML algorithms: decision tree (C4.5), random forest, SVM, and multilayer neural network [24]. Kang et al. combined the features of different users and the network reliability [25], they constructed a user discriminant formula to identify Sybils in OSNs with the false-positive rate fluctuating between 3.74% and 14.96%. Xia et al. proposed an attribute credibility-based Sybil detection approach in work [26], they calculated the user's credibility by using the Euclidean distance between the center of Sybils' attribute and users' attribute. Furthermore, they utilized the credibility as a key parameter to classify Sybils. After analyzing human user trajectories and the behavior of Sybil attack, Xu et al. designed a Bloom filter-based Sybil detection method, which better leveraged the user location data [27]. In work [28], Mulamba et al. leveraged multi-dimensional feature information that summarized from the structure of the OSN graph to build up ML classifiers, e.g., AdaBoost and KNN.
Al-Qurishi et al. further utilized feature information of users and built up a prediction model by utilizing a deep neural network, it is also the first time that deep learning (DL) method ever applied to the field of Sybil detection [29]. Since the OSNs are growing rapidly, the lack of robustness of traditional methods will let Sybils to bypass the detection by mimicking human users. In this paper, we attempt to leverage the hierarchical DL network structure to better utilize the multi-dimensional feature information of users and minimize detection errors for Sybils. The proposed method takes advantage of the well-designed feature extractors to improve the accuracy of Sybil detection problems.

III. PRELIMINARIES
In this section, we first give the definition of the Sybil detection problem in OSNs. Then, the convolutional neural network, vanilla long-short term memory (LSTM) network, scaled exponential linear unit (SELU), and alpha dropout are introduced briefly. VOLUME 8, 2020

A. PROBLEM DEFINITION
The Sybil detection problem in this paper is to accurately detect the Sybils in the OSNs by end-to-end binary classifier based on deep learning. Every user is represented by the data composed of multi-dimensional features. Assume that there are a total number of M users and each user has N different features. We utilize different feature vectors to represent different users and let the p − th user be P p is the value of p − th feature of P i . The real label of P i is denoted as T i , we use T i = 1 when P i is Sybil and T i = 0 when P i is a human (i.e., benign user). To solve this classification problem, we build an end-to-end hierarchical model τ (P i ) to predict a labelT i that is exactly the real label T i .

B. CNN MODEL
The Convolutional Neural Network (CNN) is one of the most popular neural networks that have been well applied to many different research areas, such as image data augmentation [30], object detection [31] and text classification [32]. Features are the critical components for any machine learning model, especially for deep learning methods. The CNN model's performance is greatly affected by the extraction results of the feature information. Therefore, the CNN model adopts multiple strategies (i.e., convolution, max pooling, etc.) to extract features and also enhance the computing efficiency. In the following part of Section III-B, we summarize the commonly utilized 2-dimensional (2D) convolution operation and max pooling, respectively.
In order to obtain the output feature map by utilizing convolution operation, the learnable kernels from the current layer are used to convolve with the previous layer's feature map and then put through the activation function with an additive bias is given. Moreover, each output feature map may result from the sum of distinct kernels convolution with multiple input feature maps. The convolution operation for each feature map is defined as follows, where m i,j represents the sub-region of the input feature map with the size of i × j, m t denotes the t − th output feature map of the corresponding input, f (·) denotes the output activation function, k and b defines the trainable kernel and bias, respectively. The max pooling can down-sample the input feature maps to accelerate the convergence rate and also reduce the computation complexity. Each input feature map from the previous layer is spatially divided into multiple subregions by some fixed-size max pooling windows. Then, the pooling operation takes the maximum value of each subregion to form the output feature map with lower dimensionality is shown in Eq. (2).
where m n i+k,j+l represents the value at the (i+k, j+l) position in the n − th input 2D feature map, m n i,j denotes the value at the (i, j) position in the n − th output 2D feature map, H and W denotes the height and width of the max pooling windows, respectively.

C. VANILLA LSTM MODEL
The vanilla Long-Short Term Memory Network (vanilla LSTM) is one of the most popular variants of Recurrent Neural Network (RNN) to model sequences. It's capable of solving the vanishing gradient problem by extracting and utilizing long-term dependencies. The single unit of vanilla LSTM network contains three gates (i.e., input gate, output gate and forget gate) and a memory cell to control the flow of information. The forget gate enables the vanilla LSTM to reset its state as follows, where W f is the trainable input weight, U f represents the recurrent weight and b f is the bias weight to be learned, x t denotes the input at time step t, h t−1 is the hidden state at the previous time step. σ (x) = 1/(1+exp(−x)) is the sigmoid function used as the gate activation function with the output in [0,1]. The input gate determines which information should enter the long-term memory and its output at time step t is computed by where W i , U i , and b i denote trainable variables. Next, the tanh function is applied to create a vector which could be added to the cell state as follows, where W l , U l , and b l are also learnable parameters, ) is the hyperbolic function with the output in [−1,1]. The update of the previous cell state C t−1 can be computed from the three outputs above, that is: where C t denotes the updated cell state, represents the element-wise product. Hence, the hidden state can be updated as follows, where o t is the output of the output gate at time step t, shown as below, where W o , U o , and b o are also trainable variables. Therefore, we describe the update process briefly as: where h t is the output of the LSTM cell at time step t, LSTM (·) is the summary of the above equations.

D. SELU
The Scaled Exponential Linear Unit (SELU) [33] is a continuous curve that contains both positive and negative values for controlling the mean, a slope larger than one to increase the variance when it is too small and saturation regions to reduce variance when it is too large. When applying SELU as the activation function of the hidden layer, it is capable to automatically normalize the output of the corresponding layer to a fixed mean and variance. The SELU activation function is expressed as: where hyperparameters α = 1.6732632 and λ = 1.0507009 are derived but not trained to keep the mean and variance of the output of the hidden layers at 0 and 1, respectively.

E. ALPHA DROPOUT
Alpha dropout [33] is a variant of dropout used to prevent over-fitting while training neural networks. It's a regularization technique that can keep the mean and variance of the inputs' distribution to their original values. With the dropout variable h obeys a binomial distribution B(1, q), the mean and variance after alpha dropout are respectively expressed as Eq. (11) and (12).
where a and b are two parameters that only related to the 1−Q and randomly negative saturation value α .

IV. PROPOSED METHOD
Our proposed method is composed of data preprocessing and the end-to-end classification model as illustrated in Fig. 1.
The raw input data is preprocessed and then fed into our proposed end-to-end model. The proposed model is a hierarchical model that consists of multiple layers to extract features and to output the classification results. The feature extraction part is composed of self-normalizing CNN and the proposed bi-SN-LSTM that can extract lower features and higher features, respectively. The output features of the bi-SN-LSTM network are further fed into the stacked dense layer to seek a higher-level representation. Finally, the softmax classifier is adopted to obtain the classification result of the corresponding user. In the following part of this section, we will present the detail of our proposed method.

A. DATA PREPROCESSING
In the real-world OSNs, the multi-dimensional data of users have the input of different scales. Those unequally weighted input features may lead to an unstable model with higher generalization error. Applying the proper feature scaling method can transfer the raw data into a common scale to avoid the above problem. Moreover, it can also speed up the learning and convergence of the proposed model. To enhance the property of self-normalizing, we adopt z-score normalization [34] in this paper to make the resulting distribution has a standard deviation of 1 and a mean of 0, expresses as below: where n denote the n − th input features, m is the m − th user of the input data, x (m,n) denote raw input data, x (m,n) represent the normalized raw input data, µ n and σ n denote the mean and standard deviation of the n − th feature, respectively. To better extract features from normalized raw input data by utilizing 2-dimensional (2D) convolution, we completed the transformation of different user's input data from 1D to 2D, expressed as below: where Eq. (14) defines the function composition, the transformation step is denoted as Eq. (15), the result of H × W is equal to the total numbers of the input features, m i represents the input feature map corresponding to the i − th user.

B. PROPOSED END-TO-END MODEL
The proposed end-to-end architecture is shown in Fig. 1. Our model is composed of three parts as follows: 1. utilizing the self-normalizing CNN to automatically extract lower features from normalized input data. The self-normalizing property of our model is implemented by adopting the combination of SELU and alpha dropout along the training process of our model (i.e., keep the mean and variance of the corresponding hidden layer's output to 0 and 1, respectively). 2. leveraging the proposed bi-SN-LSTM network to summarize feature map sequence in two directions (i.e., forward and backward) to extract higher features (e.g., the correlation between users, etc.). 3. stacking dense layer to learn the representation extracted in step 3 and then leveraging the softmax classifier to output the classification results. In conclusion, the last dense layer function as the classification layer of our model, we utilize the softmax classifier to generate the classification of different users. The overall performance of our model is improved by taking advantage of both lower input feature information (i.e., multi-dimensional input features of different users) and higher feature information (i.e., the feature information of correlation among different users that hidden in the input data). The necessity and efficiency of utilizing both lower and higher features will be demonstrated in Section V. We adopt the convolutional layer and the max pooling layer to automatically extract feature information of every single user. The CNN structure is the suitable feature extractor for extracting lower VOLUME 8, 2020 features because it can compress the multi-dimensional input data, and form the feature map sequence, which can be further fed into bi-SN-LSTM network to summarize higher features.
A large number of trainable parameters enables our model to output the desired classification results, but at the same time, it increases the risk of over-fitting. To address this problem, we utilize alpha dropout as the regularization method of the proposed model, instead of the standard dropout [35]. The standard dropout method prevents the corresponding layer in our model to keep both mean and variance to the desired value, and also, it cannot fit well with SELU because the default and low variance value is lim x→−∞ SELU (x) = −λα = α [33]. Moreover, alpha dropout fits well with SELU by randomly setting inputs to α to preserve the self-normalizing property of the corresponding layer in our classification model.
In the first part of our proposed model, the inputs (i.e., P = [m 1 , m 2 , m 3 , . . . , m M ]) are fed into the CNN structure to automatically extract each user's feature information by utilizing the combination of convolutional layer and max pooling layer. To keep the mean and variance of input feature maps unchanged, we adopted SELU as the activation function in the convolutional layer to output the feature map as shown in Eq. (16). Moreover, SELU can efficiently stabilize the training process when working together with CNN without applying additional layers (e.g., Batch Normalization). Then, the max pooling layer is leveraged to summarize the features in different sub-regions of the output feature maps, which can improve the convergence speed of our model and also reduce the computational cost.
The output of the CNN structure constitutes a sequence of feature maps, and each of them represents the lower feature extraction result of the corresponding user. Then, we utilize the vanilla LSTM-based network to summarize higher features from the sequence of feature maps. The commonly utilized vanilla LSTM leverage the tanh as the activation function in the recurrent step to determine the candidate value that is added to the cell state. It has the bounded output of [−1,1] to apply both positive and negative changes to the state value. Comparing with tanh, the ReLU activation function is less prone to the vanishing gradient problem, faster to execute and can also induce sparseness. However, ReLU has the unbounded positive output that may let the hidden state to grow exponentially large, and also, it cannot decrease the state because it cannot output the strictly negative values. Inspired by work [36] that opened up possibilities for replacing the tanh activation function with an unbounded activation function in the LSTM-based network, we leverage SELU as the activation function of the recurrent step in the proposed SN-LSTM model. It not only remains the advantages of ReLU but also provides a negative output to decrease the cell state. The exponentially grown hidden state provided by SELU improves the classification performance, which will be further demonstrated in Section V. In conclusion, the proposed unit architecture of SN-LSTM is shown in Fig. 2. It can be observed that the SN-LSTM unit is composed of three gates and a memory cell. The hidden state h t of the SN-LSTM at time step t is updated as the following equations: where W l , W i , W f , W o and b l , b i , b f , b o are input weights and bias weights, respectively. U l , U i , U f , U o are recurrent weights. SELU and σ are two element-wise nonlinear activation functions. i t , f t , C t , o t , h t represents the input gate, the forget gate, the cell state, the output gate and the hidden state at time step t, respectively. l t denotes the intermediate state that is used to update the cell state.
For all users in the OSNs, they are divided into two groups, which are Sybil and human. In real-world OSNs, there are many correlations between users belonging to the same group and users belonging to different groups that are hidden in the input data. In order to fully extract and utilize the correlations to improve the performance of the classification model, we utilize two separate hidden layers of the bidirectional SN-LSTM (bi-SN-LSTM) to summarize the sequence of feature maps from both directions (i.e., forward and backward). We leverage bi-SN-LSTM instead of the single direction SN-LSTM because of only the bidirectional network can extract the feature information of correlations between a user and all other users. In other words, if we only apply the forward (backward) SN-LSTM, then we can only summarize the feature information between a user and other users in front of (behind) it in the input sequence. The lack of some higher feature information will lead to the insufficient feature extraction and adverse impact on the classification results, which will be further demonstrated in Section V. In conclusion, we feed the sequence of feature maps [m 1 , m 2 , m 3 , . . . , m T ] (i.e., the output of the self-normalizing CNN structure) into the bi-SN-LSTM to extract higher features as shown in Fig. 3. The bi-SN-LSTM contains a forward SN-LSTM network and a backward SN-LSTM network, and we respectively describe the forward and backward process at time step t as follows, where → and ← denote the forward and backward layer of the proposed bi-SN-LSTM network, respectively. The outputs of the CNN structure are further fed into the bi-SN-LSTM networks to obtain the forward hidden state − → h t and the backward hidden state ← − h t , respectively. Then, the concatenation of − → h t and ← − h t is adopted to obtain the complete hidden state h t of the proposed bi-SN-LSTM network at time step t, shown as below, where t = 1, 2, . . . , T , and h t contains higher feature information from both directions. To output the extraction result of higher features, the hidden states from both directions are VOLUME 8, 2020 collected into a matrix as below, where each row of the H o represents the higher feature extraction result of the corresponding user. Then, the H o will be fed into the dense layer to seek a higher-level representation of the feature extraction results of each sample. At a dense layer, each neuron is connected to all the neurons from the previous layer, and also receives input from those neurons. The input vector puts through the dense layer to summarize features extracted from the previous layers of our model. This process can be expressed as: where H o i denotes the i − th row of the H o (i.e., the feature vector that contains both lower and higher feature extraction results of the i − th user), o i represents the output vector of the dense layer, the trainable weight matrix and bias vector is denoted as W i and b i , respectively, the function SELU (·) means utilize SELU as the activation function of the dense layer. Then, another layer of alpha dropout is utilized to reduce the risk of over-fitting and also working with SELU to preserve the self-normalizing property of the proposed model.
Finally, the output features of the dense layer are fed into the softmax classifier as expressed in Eq. (23) to obtain the distribution of two categories, and then output the final classification result. The output probabilities of the softmax classifier will add up to 1 and each probability are always in the range of [0,1]. We take the category with the maximum probability as the predicted labelT i .
where K denotes the number of categories, W and b are the trainable parameters of the classification layer, P(y = j|x i ) represents the probability of the input x i being predicted as category j. Our model is optimized by minimizing the cross-entropy loss between the real label and the predicted label.

V. EVALUATION
In this section, we present the datasets, experimental setup, comparison models, evaluation metrics, experimental results, ablation analysis, and further analysis of the proposed model.

A. GENERAL INTRODUCTION TO THE DATASET
The datasets we use in this paper to test and verify the performance of our proposed model is generated by My Information Bubble (MIB) [37]. The statistical information of the dataset is shown in Table 1. In the MIB dataset, there are a total of five sub-datasets, which are TFP, E13, FSF, INT, and TWT. After performing multiple comparison experiments, work [37] proved that the balanced distribution of Sybils and humans enables the trained classifier to achieve the best classification results. The entire set of the Sybils is randomly under-sampled to match the size of the verified human users.
In conclusion, the MIB dataset consists of 1950 Sybils and 1950 human users with a total of 3900 Twitter accounts. Each user in the dataset has multi-dimensional feature information, which will be fed into the proposed model to extract lower features (i.e., the feature information of the input data) and higher features (i.e., the feature information of the correlation between different users hidden in the feature map sequence) by using self-normalizing CNN and the proposed bi-SN-LSTM network, respectively.

B. EXPERIMENTAL SETUP
The z-score normalization method is applied to rescale the raw input data before the training process. The filter size of the first two convolutional layers and the last two convolutional layers are set to 32 and 64, respectively. The size of the max pooling windows is set to (2,2). Moreover, we set 16 as the number of hidden state dimension of the bi-SN-LSTM network. Then, the 256-dimensional output space of the dense layer is utilized to seek a higher-level representation of the extracted features. Furthermore, we take alpha dropout to prevent our model from overfitting and preserve the self-normalizing property of the proposed model during the training process. Furthermore, the gradient-based optimizer Adam [38] with a learning rate of 0.001 is used to minimize the categorical cross-entropy loss between predicted and true distributions. Note that the choice of the hyperparameters (i.e., batch size, training epochs, dropout rate, etc.) are obtained by performing a grid search strategy. We use the Tesla K80 graphics card and 12G memory as the hardware environment. Our model is implemented with Tensorflow.

C. COMPARISON MODELS
The experimental results of the proposed model are compared with other state-of-the-art models, including 1) RandomForest (RF) [28], this model uses several decision trees and assign input data point into each tree. After getting each tree's classification result, it takes the majority vote from all trees as the final output. The number of trees is taken as 100. 2) K-Nearest Neighbors (KNN) [28], this method is a non-parametric classification algorithm and has achieved good results in many classification problems based on supervised learning methods. 2 is taken as the number of neighbors.
3) Adaptive Boosting (Ada) [28], this method is the first step- ping stone of the boosting algorithm and it combines many weak classifiers into a strong classifier to get accurate results. 100 is taken as the number of estimators. 4) XgBoost (Xgb), this method is an ensemble ML algorithm that uses a gradient boosting framework. We consider this method into the comparison experiments to further compare the performance of our proposed model with the most state-of-the-art boosting algorithm. The key parameter n_estimators (i.e., the number of estimators) is set to the same as Ada (i.e., 100).

D. EVALUATION METRICS
Many performance metrics have been introduced to investigate a binary classification problem. These metrics are calculated by those indicators illustrated in the confusion matrix as shown in Table 2. We considered four commonly utilized metrics in this work, which are precision (refer to as purity), recall (refer to as accuracy), f1-score (the harmonic mean between precision and recall) and AUC (area under ROC curve). The first three metrics are respectively expressed in Eq. (24) to (26). Moreover, the value of AUC is obtained by computing the area under the ROC (i.e., receiver operating characteristic) curve, which is plotted by the true-positive rate and false-positive rate (i.e., TPR and FPR, which can be calculated from the confusion matrix as shown in Table 2).

E. EXPERIMENTAL RESULTS
As demonstrated in the experimental setup (i.e., Section V-B), the proposed model is fed with the normalized input data.
To quantify the influence of applying feature scaling method during data preprocessing, and to make a comprehensive comparison of each model, we perform two sets of experiments, one of which utilizes raw input data and the other uses normalized data. As a result, the classification performance of the proposed model compared with other methods is shown in Table 3. The proposed model with normalized input data significantly outperforms the RF, the KNN, the Ada and the Xgb model. The main reason is that our model extracts lower features (i.e., the feature information that directly extracted from the normalized input data) and long-term dependencies (i.e., the hidden features in the feature map sequence) by self-normalizing CNN and bi-SN-LSTM, respectively. Although the Xgb model with raw input data yields similar precision as that of the proposed model, but our model achieves better results on f1-score and AUC which not only shows that the proposed model achieves more balanced tradeoff between the accuracy and purity of classification results, but also demonstrates the better overall performance of our proposed model. The highly hierarchical architecture of the proposed model has more sufficient feature extraction capabilities than other comparison methods. Moreover, the above comparison results demonstrate the necessity of leveraging the feature scaling method to rescale the input data before feeding them into the proposed model.
Compared with the proposed model with raw input data, the proposed model with normalized data achieves better results. The main reason is that after applying the z-score normalization method, the rescaled data has the distribution with a standard deviation of 1 and a mean of 0, which provides the model with self-normalizing property from the beginning of the training process.

F. ABLATION ANALYSIS
As demonstrated in Section IV, the self-normalizing CNN and bi-SN-LSTM are utilized to automatically extract lower and higher features of the normalized input data, respectively. In this section, we conduct ablation studies to quantify the influence of those two feature extraction components in the proposed model. Note that the stacked two dense layers are applied in the following models to seek a higher-level  representation of the extracted features and finally output the classification results.
''Lower-only'': This model extract features by using the self-normalizing CNN structure only. In other words, the bi-SN-LSTM in Fig. 1 is not applied in this model, and the higher-level representation (i.e., the output of the dense layer) is only determined by the lower feature extraction result.
''Higher-only'': A model that only using bi-SN-LSTM to automatically extract higher features hidden in the input data. We obtain the classification result without leveraging the lower features.
''Uni-higher'': A model that is similar to the ''Higheronly'' model but utilizing the unidirectional SN-LSTM instead of the bi-SN-LSTM to extract higher features, which will lead to a part of higher feature information to be missing. ''Self/Uni'': The architecture of this model is the same as the proposed model but using the unidirectional SN-LSTM to replace the bi-SN-LSTM network.
The comparison results between the proposed model and other ablation models are shown in Table 4. We utilize four metrics (i.e., precision, recall, f1-score, and AUC) to evaluate the classification results and conduct further analysis. The proposed model significantly outperforms ''Loweronly'' which demonstrates the necessity of leveraging higher features to accurately classify Sybils. Moreover, the performance of ''Higher-only'' drops compared with the proposed model, which illustrates the positive impact of lower features on classification results. Comparing the ''Uni-higher'' and the ''Self-Uni'' with the proposed model, the performance drops not only further demonstrates the effectiveness of two feature extraction components in the proposed model, but also proves that the bi-SN-LSTM network owns the better feature extraction ability than unidirectional SN-LSTM network. The model ''Self-Uni'' leverages a single SN-LSTM layer to summarize the higher features from the feature map sequence, which will cause the partial loss of the higher features. To address this problem, the proposed model utilizes bi-SN-LSTM to extract the higher features between a user and all users that precede and follow it in the same sequence (i.e., summarize feature map sequence from both directions). In conclusion, the proposed model significantly outperforms all four of the other ablation models by utilizing self-normalizing CNN and bi-SN-LSTM to fully extract lower and higher features of the input data, respectively. The effectiveness of the bi-SN-LSTM will be further demonstrated in Section V-G. We utilize those RNN-based methods to replace the bi-SN-LSTM in Fig. 1 to extract higher features. Note that those methods have similar network hyperparameters and parameters as the proposed bi-SN-LSTM model.
The comparison result is shown in Table 5, the proposed model achieves the best classification performance compared to other models. Comparing with bi-LSTM, the proposed model (i.e., the bi-SN-LSTM network) significantly improves the Sybil classification performance with the exponentially grown hidden state and the sparseness property provided by the recurrent step of each SN-LSTM unit with SELU as its activation function. A similar conclusion can be concluded by comparing the performance of the proposed model with bi-RNN and bi-GRU in Table 5.
According to the ''LSTM'' row of Table 5, the performance drops compared with the bi-LSTM which shows the better performance of the model when utilizing a bidirectional RNN-based network for higher feature extraction. The same phenomenon can also be found when comparing the RNN method and bi-RNN method, GRU method, and bi-GRU method. The above two sets of comparison experiment and the experimental results in Section V-F (i.e., the comparison results between ''Self/Uni'' and the proposed model) jointly further proves the effectiveness of leveraging bidirectional SN-LSTM (i.e., bi-SN-LSTM) as the higher feature extractor of the proposed model. Moreover, the comparison result between the proposed model and the bi-LSTM network shows that SELU is a more appropriate activation function for the recurrent step of the LSTM network when dealing with the Sybil detection problem.
According to the ''Time'' column of Table 5, the training efficiency of the vanilla RNN method outperforms other methods because of the simplest mathematical model behind it. Note that the time of different comparison models is obtained after training 100 epochs. However, the classification performance of the vanilla RNN method is the worst among all the methods, which means it cannot be applied to the real-world Sybil detection problems.
Compared with their bidirectional variants, the vanilla RNN, GRU, and LSTM have a more efficient training process, but the classification performance of the model is decreased. This comparison result shows the trade-off problem between the training efficiency and the classification performance when utilizing RNN-based methods for Sybil classification in OSNs. As shown in the ''bi-SN-LSTM'' row of Table 5, the training time of our proposed model increase because of the exponentially grown hidden state of the proposed bi-SN-LSTM network. However, our model achieves the best classification results while keeping the training efficiency close to those RNN-based methods with a sophisticated mathematical background (i.e., bi-GRU and bi-LSTM). In conclusion, the proposed model not only outperforms all other RNN-based methods but also achieves a more balanced trade-off between the training efficiency and the classification performance. Moreover, when dealing with the complex real-world Sybil detection problem, we can utilize the graphics processing unit to increase the computation efficiency of the proposed model based on the bi-SN-LSTM network.

2) EFFECTS OF THE NUMBER OF CONVOLUTIONAL LAYERS AND THE FEATURE SCALING METHOD
This section conducts a comparison experiment to quantify the influence of the feature scaling (i.e., data normalization) method and the number of convolutional layers in the proposed model. To better utilize the lower features of the input data, we considered different amounts of convolutional layers into our comparison experiments. Moreover, applying the appropriate feature scaling (i.e., data normalization) method will have a significant influence on the classification results. Therefore, three commonly utilized feature scaling methods are considered into our comparison experiments as follows, • Standardization (i.e., z-score normalization method): transform the data such that the resulting distribution has a standard deviation of 1 and a mean of 0, expresses as below: x (m,n) = x (m,n) − µ n σ n (28) • Mean Normalization: transform the data such that they can be described as a normal distribution (i.e., bell curve) and all the values are within the range of [0,1]: where m is the m−th user of the input data, n denotes the n−th input features, x max n and x min n are the maximum and minimum value of the n − th feature, respectively. x (m,n) denotes raw input data, x (m,n) represents the scaled input data, µ n and σ n denote the value of the mean and standard deviation of the n − th feature, respectively.
The comparison results in Fig. 4 show that the model obtains the best result on precision, F1-score and AUC, when the number of convolutional layers is set as four with z-score normalization method, is applied. Although the best result on recall can be obtained by utilizing five convolutional layers and mean normalization, it has worse results on the other three metrics (i.e., precision, f1-score, and AUC). It can be concluded from Fig. 4 that the z-score normalization method significantly outperforms the other two methods (i.e., mean normalization and min-max scaling), which shows that the standardization method is the most appropriate feature scaling method for scaling the raw input data before feeding them into the proposed model. Moreover, the best results on f1-score and AUC is obtained when four convolutional layers are applied in the proposed model. The higher value of f1-score indicates the model reaches the more balanced trade-off between precision and recall, which provides better robustness of the classification model. Furthermore, the better results on the AUC metric means the better overall performance of the proposed model. Moreover, the precision value significantly increases when applying z-score normalization and four convolutional layers into our model, which means that few human users are mistakenly classified as Sybils in the classification results.
For each convolutional layer, it contains a certain number of filters to create feature maps for the model to extract and compress lower features of the input data. Comparing the performance of 4 convolutional layers with other variants that use more convolutional layers, it can be obtained that the classification results do not improve with the increasing number of convolutional layers, which shows that the over-compressed feature information will decrease the performance of the classification model. The above results demonstrate the effectiveness of utilizing z-score normalization as the feature scaling method and four convolutional layers as the lower feature extractor in the proposed model.

3) INFLUENCES OF THE DIMENSION OF HIDDEN STATES
Hidden states of the proposed bi-SN-LSTM is to extract and store the higher feature information from the feature map sequence. The number of hidden states directly affects the classification performance of the proposed model. We conduct experiments to quantify the influence of the dimension of hidden states with varied numbers (i.e., 2,4,8,16,32,64,128,256). Note that other hyperparameters remain unchanged. The experimental results are shown in Fig. 5 and Fig. 6. Obviously, the proposed model with 16-dimensional hidden states significantly outperforms other variants. Moreover, when the dimension of hidden states increases from 2 to 8, the precision rise while the recall, f1-score, and AUC decrease. The  performance drops when the dimension of hidden states is larger than 16 (e.g., 64).
The main reason is that the model suffers from over-fitting and becomes too complicated for the dataset. When the dimension of hidden states is too large, our model will take a high risk of over-fitting even if alpha dropout is applied in the proposed model. Meanwhile, the feature extraction capability may be insufficient when the number of hidden states is too small. In conclusion, we choose 16 as the dimension of hidden states in this paper to make our model achieve better classification results.

VI. CONCLUSION
In this work, we propose a content-based end-to-end model to achieve a more accurate Sybil detection results in online social networks. The hierarchical architecture enables the proposed model to better utilize the feature information. Moreover, our model enjoys the advantage of saving human effort to design features. The performance of the proposed model is greatly improved by utilizing self-normalizing CNN and bi-SN-LSTM to extract lower and higher features of the input data, respectively. We utilize the combination of SELU and alpha dropout for preserving the self-normalizing property and relieving the overfitting problem along the training process. The bi-SN-LSTM is developed to summarize the feature map sequences (i.e., the output of CNN structure) from both directions to extract hidden feature information (e.g., the correlation between users) for classification, which is more effective than other commonly utilized RNN-based methods. Through the case study of MIB dataset, the experimental results demonstrate that our model outperforms the state-of-the-art methods and achieves an excellent classification performance. Additionally, we perform ablation analysis, and analysis of the proposed model to further prove the effectiveness of the classification model in this work. In our future work, the graph neural network-based method will be considered to classify Sybils with structure-based features.