Seeing and Believing: Evaluating the Trustworthiness of Twitter Users

Social networking and micro-blogging services, such as Twitter, play an important role in sharing digital information. Despite the popularity and usefulness of social media, there have been many instances where corrupted users found ways to abuse it, as for instance, through raising or lowering user’s credibility. As a result, while social media facilitates an unprecedented ease of access to information, it also introduces a new challenge - that of ascertaining the credibility of shared information. Currently, there is no automated way of determining which news or users are credible and which are not. Hence, establishing a system that can measure the social media user’s credibility has become an issue of great importance. Assigning a credibility score to a user has piqued the interest of not only the research community but also most of the big players on both sides - such as Facebook, on the side of industry, and political parties on the societal one. In this work, we created a model which, we hope, will ultimately facilitate and support the increase of trust in the social network communities. Our model collected data and analysed the behaviour of 50,000 politicians on Twitter. Influence score, based on several chosen features, was assigned to each evaluated user. Further, we classified the political Twitter users as either trusted or untrusted using random forest, multilayer perceptron, and support vector machine. An active learning model was used to classify any unlabelled ambiguous records from our dataset. Finally, to measure the performance of the proposed model, we used precision, recall, F1 score, and accuracy as the main evaluation metrics.


I. INTRODUCTION
An ever increasing usage and popularity of social media platforms has become the sign of our times -close to a half of the world's population is connected through social media platforms.The dynamics of communication in all spheres of life has changed.Social media provide a platform through which users can freely share information simultaneously with a significantly larger audience than traditional media.
As social media became ubiquitous in our daily lives, both its positive and negative impacts have become more pronounced.Successive studies have shown that extensive distribution of misinformation can play a significant role in the success or failure of an important event or a cause [1], [2].Barring the dissemination and circulation of misleading information, social networks also provide the mechanisms for corrupted users to perform an extensive range of illegitimate actions such as spam and political astroturfing [3], [4].As a result, measuring the credibility of both the user and the text itself has become a major issue.In this work, we assign a credibility score to each Twitter user based on certain extracted features.
Twitter is currently one of the most popular social media platforms with an average of 10,000 tweets per second [5].Twitter-enabled analytics do not only constitute a valuable source of information but provide an uncomplicated extraction and dissemination of subject specific information for government agencies, businesses, political parties, financial institutions, fundraisers and many others.
In a recent study [6], 10 million tweets from 700,000 Twitters accounts were examined.The collected accounts were linked to 600 fakes news and conspiracy sites.Surprisingly, authors found that clusters of Twitter accounts are repeatedly linked back to these sites in a coordinated and automated manner.A similar study [7] showed that 6.6 million fake news tweets were distributed prior to the 2016 US elections.
Globally, a number of social and political events in the last three years have been marred by an ever-growing presence of misleading information provoking an increasing concern about their impact on society.This concern translated into an immediate need for the design, implementation, and adoption of new systems and algorithms that will have the ability to measure the credibility of a source or a piece of news.Notwithstanding, the seemingly unencumbered growth of social media users is continuing 1 .Coupled with the growth in user numbers, the generated content is growing exponentially thus producing a body of information where it is becoming increasingly difficult to identify fabricated stories [9].Thereupon, we are facing a situation where a compelling number of unverified pieces of information could be misconstrued and ultimately misused.The research in the field is therefore currently focusing on defining the credibility of the tweets and/or assigning scores to users based on the information they have been sharing [10]- [17].

A. OUR CONTRIBUTION AND DIFFERENCES WITH PREVIOUS WORKS
We would like to draw your attention to the areas in which this work builds on our previous one [18] and where, we believe, it expounds it and offers new insights.In this work we used additional ML models, such as Multi-Layer Perceptron (MLP) and Logistic Regression (LR).Since the MLP model outperformed the LR, we only present the findings for the MLP model.For MLP, we performed the experiments for Tanh, ReLU and Logistics.Moreover, unlike [18], where just one evaluation metric, "Accuracy", was used to evaluate the model's performance, in this work, here, we measure the model's performance by using four evaluation metrics -"Precision", "Recall", "F1" score, and "Accuracy" (see table 5).Furthermore, we provide the descriptive statistics of the features (see table 4) as well as their correlation with the target (see figure 3) and compare our work with other similar works as SybilTrap [19] (see table 2).Finally, we conduct a comparative review of the user characteristics primarily used in the literature so far, and the ones used in our model and provide supplementary information to help with stratifying trusted and untrusted users (see table 3).
Our main contribution can be summarized as follows: • First, we gathered a 50,000 Twitter users dataset where for each user, we built a unique profile with 19 features (discussed in Section III).Our dataset included only users whose tweets are public and have non-zero friends and followers.Furthermore, each Twitter user account was classified as either trusted or untrusted by attaching the trusted and untrusted flag based on different features.These features are discussed in detail in Section IV. • We measured the social reputation score (Section III-C), a sentiment score (Section III-C), an h-index score (Section III-C), tweets credibility (Section III-C) and the influence score (Section III-D) for each of the analyzed Twitter users.• To classify a large pool of unlabelled data, we used an active learning model -technique best suited to the situation where the unlabelled data is abundant but manual labelling is expensive [20], [21].In addition, we evaluated the performance of various ML classifiers.

B. ARTIFACTS
As a way to support open science and reproducible research and allow other researchers to use, test and hopefully extend/enhance our models we make both our datasets as well as the code for our models available through the Zenodo2 research artifacts portal.This does not violate Twitter's developer terms.We hope that this work will inspire others to further research this problem and simultaneously kick-start a period of greater trust in social media.

C. ORGANIZATION
The rest of paper is organized as follows: Related work is discussed in Section II, accompanied by a detailed discussion of our proposed approach in Section III.In Section IV, the active learning method and the type of classifier used are discussed.The data collection and experimental results are presented in Section V. Finally, in Section VI, we conclude the paper.

II. RELATED WORK
Twitter is one of the most popular Online-Social-Networks (OSNs).As data aggregator, it provides data that can be used in research of both historical and current events.Twitter, in relation to other popular OSNs, attracts significant attention in the research community due to its open policy on data sharing and distinctive features [22].Although openness and vulnerability don't necessarily go hand in hand, on a multiple occasions malicious users misused Twitter's openness and exploited the service (e.g.political astroturfing, spammers sending unsolicited messages, post malicious links, etc.).
In contrast to mounting evidence towards the negative impact of fake news dissemination, so far, only a few techniques for identifying them in social media have been proposed [3], [4], [22]- [24].
Among the most popular and promising ones is evaluating Twitter users and assigning them a reputation score.Authors in [3] explored the posting of duplicate tweets and pointed that this behaviour, usually not followed by a legitimate user, affects the reputation score.Posting the same tweet several times has a negative effect on the user's overall reputation score.The authors presented research that supports the above by calculating the edit distance to detect duplications between two tweets posted from the same account.
Furthermore, users have used an immense amount of exchanged messages and information on Twitter to hijack trending topics [25] and send unsolicited messages to legitimate users.Additionally, there are Twitter accounts whose only purpose is to artificially boost the popularity of a specific hashtag thus increasing its popularity and eventually making the underlying topic a trend.The BBC investigated an instance where £150 was paid to Twitter users to increase the popularity of a hashtag and promote it into a trend 3 .
In an attempt to address these problems, researchers have used several ways to detect the trustworthiness of tweets and assign an overall rank to users [24].Castillo et al., [26] measured the credibility of tweets based on Twitter features by using an automated classification technique.Alex Hai Wang [3] used the followers and friends features to calculate the reputation score.Additionally, Saito and Masuda [27] considered the same metrics while assigning a rank to Twitter users.In [28], authors analysed the tweets relevant to Mumbai attacks 4 .Their analysis showed most of the information providers were unknown while the reputation of the remaining ones was very low.In another study [29] that examined the same event, the information retrieval technique and ML algorithm used found that mere 17% of the tweets were credibly related to the underlying attacks.
According to Gilani et al., [30], when compared to normal users, bots and fake accounts use a large number of external links in their tweets.Hence, analysing other Twitter features such as URL is crucial for correctly evaluating the overall credibility of a user.Although, Twitter has included tools to filter out such URLs, several masking techniques can effectively bypass Twitter's existing safeguards.
In this work, we evaluate the users' trustworthiness and credibility [31], [32] by analysing a wide range of features (see Table 1).In comparison to similar works in the field, our model explores a number of factors that could be signs of possible malicious behaviours and makes honest, fair, and precise judgements about the users' credibility.

III. METHODOLOGY
In this section, we discuss the model and main algorithms we used to calculate the user's influence score.Our first goal is to enable the users to identify certain attributes and assess a political Twitter user by considering the influence score that is the outcome of a proper run of our algorithms.Figure 1 illustrates the main features we used to calculate users' influence score.We also compare our work with stateof-the-art work in this domain (see Table 2).Secondly, the political Twitter users are classified into either trusted or untrusted based on features as social reputation, the credibility of tweets, sentiment score, the h-index score, influential score etc. Accounts containing abusive and/or harassment tweets, low social reputation and h-index score, and low influential score are grouped into untrusted users.The trusted users category envelops more reputable among the users with high h-index score, more credible tweets as well as those having high influential score.We will discuss this in more detail in Section IV.
In addition, we also present the approach used to calculate the Twitter users' influence score based on both their context and content features.For the user evaluation we took into consideration only the Twitter features that can be extracted through Twitter API.We used the outcome of that evaluation and derived more features to help us provide a better rounded and fair evaluation (Section III-C).The features, as well as the relevant notation used throughout the paper, are given in Table 1.

A. FEATURES SELECTION AND COMPARISON WITH PREVIOUS MODELS
The features used for calculating the influence score were based on extensive study of the existing literature.The selected features were used for detection purposes [33]- [35], assigning a score [24] or classification purposes [36].We used the features given in Table 1 to assign an influence score to a u i .Table 2 provides a comparative overview of existing models based on feature selection.

B. TWITTER FEATURES EXTRACTION
The pivotal step in the process of assigning a score to a Twitter user is to extract the features linked to their accounts.
The features can be either user account specific, such as the number of followers, friends, etc., or user tweet specific, such as the number of likes, retweets, URLs, etc.In our model, we considered both and used them to calculate some additional features.We then combined them all to assign an influence score to a Twitter user.Below we provide more detailed information on features used in our model.

Number of Friends
Friend is a user account feature indicating that a Twitter user (u i ) has subscribed to the updates of another u i [37].Following users who are not part of interpersonal ties yields a lot of novel information.One of the important indicators for calculating the Inf (u i ) is the f ollower/f ollowing ratio.The f ollower/f ollowing ratio compares the number of u i 's subscribers to the number of the users, u i is following.Users are more interested in updates if the f ollower/f ollowing ratio is high [38].The ideal f ollower/f ollowing ratio is 1 or close to 1.In our model, we use the Number of Friends N f ri (u i ) as one of the indicators for assigning User's Social Reputation R s (u i ).

Number of Followers
N f ol (u i ) is another user account feature showing the number of people interested in the specific u i 's tweets.As discussed in [39], N f ol (u i ) is one of the most important parameters for measuring u i 's influence.The more followers a u i has the more influence he exerts [40].Preussler et al., [41] correlates the N f ol (u i ) with the reputation of a u i .According to their study, the credibility of a u i increases as the N f ol (u i ) increases.Based on the above we consider the N f ol (u i ) an important parameter and use it as input to calculate the R s (u i ).

Number of Retweets
A tweet is considered important when it receives many positive reactions from other accounts.The reactions may take the form of likes or retweets.Retweets act as a form of endorsement, allowing u i to forward the content generated by other users, thus raising the content's visibility.It is a way of promoting a topic and is associated with the reputation of the u i [42].Since retweeting is linked to popular topics and directly affects the u i 's reputation, it is a key parameter for identifying possible fake account holders.As described in [30], bots or fake accounts depend more on retweets of existing content than posting new ones.In our model, we consider the N ret as one of the main parameters for assigning the Inf (u i ).We calculate the R ret (u i ) (used by Twitter grader) for each tweet by considering N ret divided by N T (u i ), as given in equation 1.

Number of Likes
The N lik is considered a reasonable proxy for evaluating the quality of a tweet.Authors in [36] showed that humans receive more likes per tweet when compared to bots.In [43], the authors used likes as one of the metrics to classify Twitter accounts as a human user or automated agent.As mentioned in [5], if a specific tweet receives a large N lik , it can be safely concluded that other u i 's are interested in the tweets of the underlying u i .Based on this observation, we calculate the R lik (u i ) by using the N lik for each tweet and dividing it with N T (u i ) as shown in equation 2.
URLs URL is a content level feature some u i 's include in their tweets [44].As tweets are limited to a maximum of 280 characters, it is common that u i 's cannot include all relevant information in their tweets.To overcome this issue, u i 's often populate tweets with URLs pointing to a source where more information can be found.In our model, we consider the URL as an independent variable for the engagement measurements [45].We count the tweets that include a URL and calculate the R url (u i ) by considering the U R (u i ) over the N T (u i ) as given in equation 3.
Listed Count In Twitter, a u i has the option to form several groups by creating lists of different u i 's (e.g.competitors, followers etc.).Twitter lists are mostly used to keep track of the most influential people 5 .The simplest way to measure the u i 's influence is by checking the L(u i ) that the u i is placed on.
Being present in a large number of lists is an indicator that the u i is considered as important by others.Based on this assumption, we also considered the number of lists that each u i belongs to.

Statuses Count
Compared to the other popular OSNs, Twitter is considered as a service that is less social 6 .This is mainly due to the large number of inactive u i 's or users who show low motivation in participating in an online discussion.Twitter announced a new feature "Status availability", that checks the N T (u i ) 7 .
The status count is an important feature closely related to reporting credibility.If a user is active on Twitter for a longer period, the likelihood of producing more tweets increases, which in turn may affect the author's credibility [46], [47].
To this end, for the calculation of the Inf (u i ), we also took into account how active users are by measuring how often a u i performs a new activity 8 .

Original Content Ratio
It has been observed that instead of posting original content, most u i retweet posts by others [38].As a result, Twitter is changing into a pool of constantly updating information streams.For u i 's with high influence in the network, the best strategy is to use the 30/30/30 rule: 30% retweets, 30% original content, and 30% engagement [48].Having this in mind, in our model, we look for u i 's original tweets and add them to their corresponding influence score.We calculate the R ori (u i ) by extracting the retweeted posts by others from the total tweets of u i as given in equation 4.

C. DERIVED FEATURES FOR TWITTER USERS
Following the considerations for the selection of the basic features for calculating the Inf (u i ), in this section we elaborate on the extraction of the extra ones.Additionally, we discuss the sentiment analysis technique used to analyse u i 's tweets.By using the basic features described earlier, we calculated the following features for each u i : • Social reputation of a user; • Retweet h-index score and liked h-index score; • Sentiment score of a user; • Credibility of Tweets; • Influence score of a user.

User's Social Reputation
The main factor for calculating the R s (u i ) is the number of users interested in u i 's updates.Hence, In equation 5 we utilized the log property to make the distribution smoother and minimize the impact of outliers.In addition to that, since log0 is undefined, we added 1 wherever log appears in equation 5.In equation 5, R s (u i ) is directly proportional to N f ol (u i ) and N T (u i ).Based on several studies [3], [5], [38], R s (u i ) is more dependent on N f ol (u i ) hence we give more importance to N f ol (u i ) in comparison to N T (u i ) and N f ri (u i ).If a u i has a large N f ol (u i ) then the u i is more reputable.In addition, if a u i is more active in updating his/her N T (u i ) there are more chances that u i 's tweets receive more likes and get retweeted.While N f ol (u i ) and N T (u i ) increase, R s (u i ) also increases and vice versa.Alternatively, if a u i has less N f ol (u i ) in comparison to the N f ri (u i ) then, the R s (u i ) is smaller.As can be seen from equation 5, there is an inverse relation between R s (u i ) and N f ri (u i ).

h-Index Score
The h ind score is most commonly used to measure the productivity and impact of a scholar or scientist in the research community.It is based on the number of publications as well as the number of citations for each publication [49].In our work, we use the h ind score for a more accurate calculation of Inf (u i ).The h ind of a u i is calculated considering N lik and N ret for each tweet.To find the h ind 9 , we sort the tweets based on the N lik and N ret (in decreasing order).
Algorithm 1 describes the main steps for calculating the h ind of a u i based on the N ret .The same algorithm is used for calculating the h ind of a u i based on N lik by replacing N ret with N lik .R hind (u i ) and L hind (u i ) are novel features Algorithm 1 Calculating h-index score based on retweets return N ret 9: end procedure used for measuring the relative importance of a u i .A tweet that has been retweeted many times and liked by many users is considered as attractive for the readers [5], [50].For this reason, we use R hind (u i ) and L hind (u i ) for measuring the Inf (u i ).The higher the R hind (u i ) and L hind (u i ) score of a u i , the higher will be the Inf (u i ).

Twitter User Credibility
The credibility is actually the believability [26] -that is, providing reasonable grounds for being believed.The credibility of a u i can be assessed by using the information available on the Twitter platform.In our approach, we use both the Sen s (u i ) and T wt cr (u i ) to find a credible u i .Sentiment Score: It has been observed that OSNs are a breeding ground for the distribution of fake news.In many cases even a single Twitter post significantly impacted [51] and affected the outcome of an event.
Having this in mind, we used sentiment analysis and the TextBlob [52] library, to analyze tweets with the main aim to identify certain patterns that could facilitate identification of credible news.The sentiment analysis returns a score using polarity values ranging from 1 to -1 and helps in tweet classification.We classified the collected tweets as (1) Positive (2) Neutral, and (3) Negative based on the number of positive, neutral and negative words in a tweet.According to Morozov et al., [53], the least credible tweets have more negative sentiment words and opinions and are associated with negative social events, while credible tweets, have more positive ones.Hence we classified positive tweets as being the most credible followed by the neutral, and finally the least credible negative tweets.
Following the tweets classification we assign a Sen s (u i ) to each u i [5] using the following equation: Tweets Credibility: Donovan [54] focused on finding the most suitable indicators for credibility.According to their findings, prime indicators for a tweet's credibility are mentions, URLs, tweet length and retweets.Gupta et al., [29] ranked tweets based on tweets credibility.The parameters used as an input for the ranking algorithm were: tweets, retweets, total unique users, trending topics, tweets with URLs, start and end date.Based on the existing literature, we compute the T wt cr (u i ) by considering R ret (u i ), R lik (u i ), R has (u i ), R url (u i ) and R ori (u i ) (see equation 7): To begin, we consider the R ori (u i ) (tweet) by a u i and for each R ori (u i ) we collect R ret (u i ), R lik (u i ), R has (u i ) and R url (u i ).These four features are linked with the R ori (u i ) such as R ret (u i ) and R lik (u i ) specify the number of times the R ori (u i ) has been retweeted and liked while R has (u i ) and R url (u i ) return only R ori (u i ) having URLs and hashtags.Hence, to calculate the credibility of tweets, we first calculate the average of these four parameters and then multiply it with R ori (u i ).

D. INFLUENCE SCORE
The Inf (u i ) is calculated based on the evaluation of both content and context features.More precisely, we consider the following features described earlier: R s (u i ), Sen s (u i ), T wt cr (u i ) and h ind (u i ).After calculating the values of all of these features we use them as input to Algorithm 2 line 7 which calculates the Inf (u i ).
Equation Formulation: In order to ascertain how influential a u i is, researchers have taken into consideration one, two or more of the following characteristics: • Social reputation [55] and weight-age of his tweets [5];; • Tweets credibility [5], [54]; • His ability to formulate new ideas, as well as his active participation in follow-up events and discussions [56].
An influential u i must be highly active (have ideas that impact others' behaviours, able to start new discussions etc.,).Additionally, the tweets must be relevant, credible and highly influential (retweeted and liked by a large number of other u i 's).If the tweets of highly influential u i 's are credible and the polarity of their tweets' content is positive, they are considered as highly acknowledged and recognized by the community.In short, for a u i to be considered influential, we combine the efforts of [5], [54]- [56] and calculate the Inf (u i ) using equation 8.

IV. ACTIVE LEARNING AND ML MODELS
In line with the existing literature, the classification of a u i is performed on a manually annotated dataset.The manually annotated dataset gives a ground truth, however, manual labelling is an expensive and time-consuming task.In our approach, we used active learning, a semi-supervised ML model that helps in classification when the amount of available labelled data is small.In this model, the classifier is trained using a small amount of training data (labelled instances).Next, the points ambiguous to the classifier in the large pool of unlabelled instances are labelled, and added to the training set [21].This process is repeated until all the ambiguous instances are queried or the model performance does not improve above a certain threshold.The basic flow of active learning approach 10 is shown in Figure 2. Based on the proposed model, we first trained our classifier on a small dataset of human-annotated data.Following this step, it then further classified a large pool of unlabelled instances efficiently and accurately.The steps in our active learning process were as follows: • Data Gathering: We gathered unlabelled data for 50,000 u i 's.The unlabelled data was then split into a seed -a small manually dataset consisting of 1000 manually annotated data -and a large pool of unlabelled data.The seed was then used to train the classifier just like a normal ML model.Using the seed dataset we classified each political u i as either trusted or untrusted.• Classification of Twitter Users: Two manual annotators in the field classified 1000 u i 's as trusted or un-10 https://github.com/modAL-python/modALtrusted based on certain features.Out of 1000 u i 's, 582 were classified as trusted and the rest 418 as untrusted.
For feature selection, we employed the feature engineering technique, and selected the most important features among those presented in Table 1.Based on the existing literature [57]- [60] and correlation among features, certain features were considered the most discriminatory for u i 's classification.We did not include the discriminatory features because they serve as an outlier and are biased.In addition, certain features were distributed almost equally between the trusted and untrusted users, as shown in Table 3.We discarded both as they do not add any value to classification.However, certain features were good candidates for differentiating trusted and untrusted users such as high R hind (u i ), L hind (u i ), In Table 3, the features marked with * were used for classification in the existing literature [3], [58], [61] while the features marked with ∩ were based on the correlation among the features.The impact of the individual feature is shown in Figure 3.The figure indicates that among the features, the L hind (u i ) and N f ol (u i ) are very relevant for assessing Inf (u i ).In addition, all the features except R ret (u i ) and R has (u i ) have a positive impact on the user's Inf (u i )(see Figure 3).• Choosing Unlabelled Instances: A pool based sampling with a batch size of 100 was used in which 100 ambiguous instances from the unlabelled dataset were labelled and added to a labelled dataset.Different sampling techniques were employed to select the instances from the unlabelled dataset.For the new labelled dataset, the classifier was re-trained and then the next batch of ambiguous unlabelled instances to be labelled was selected.The process was repeated until the model performance did not improve above a certain threshold.
Table 3: Feature Engineering: All values greater than or equal to 0.5 are considered high, whereas those below 0.5 are considered low.• Uncertainty Sampling: It is the most common method used to calculate the difference between the most confident prediction and 100% confidence.
where x is the most likely prediction and x is the instance to be predicted.This sampling technique selects the sample with greatest uncertainty.• Margin Sampling: In margin sampling, the probability difference between the first and second most likely prediction is calculated.Margin sampling is calculated using equation: where x1 and x2 are the most likely instances.As the decision is unsure for smaller margins, in this sampling technique, the instance with the smallest margin is selected.• Entropy Sampling: It is the measure of entropy and is defined by the equation: where p k is the probability of a sample belonging to class k.Entropy sampling measures the difference be- 11 https://modal-python.readthedocs.io/en/latest/content/querystrategies/uncertaintysampling.html tween all the predictions.Details of the three classifiers we used and their performance characteristics are given below: • Random Forest Classifier (RFC): An ensemble treebased learning algorithm [62] that aggregates the votes from various decision trees to determine the output class of the instance.RFC runs efficiently on large dataset and is capable of handling thousands of input variables.In addition, RFC measures the relative importance of each feature, and produces a highly accurate classifier.• Support Vector Machine (SVM): SVM models are commonly used in classification tasks as it achieves high accuracy with less computation power.The SVM finds a hyperplane in N -dimensional space (N represents the number of features) to classify an instance [63].The goal of SVM is to improve classification accuracy by locating the hyperplane that separates the two classes.• Multilayer Perceptron (MLP): A supervised ML algorithm that learns a nonlinear function by training on a dataset.The MLP network is divided into an input layer, hidden layer(s), and output layer [64].Each layer consist of interconnected neurons transferring information to each other.In our proposed model the MLP consisted of one input and output layer and 50 hidden layers.In addition, the activation functions used in MLP are Tanh, ReLU and Logistics.We do not provide the plots for ReLU activation function as its performance is not as good as Tanh and Logistics (see Table 5).

V. EXPERIMENTAL RESULTS AND MODEL EVALUATION
Experimental Setup: We used Python 3.5 for features extraction and dataset generation.The python script was executed locally on a machine having configuration: Intel Core i7, 2.80 GHZ, 32GB, Ubuntu 16.04 LTS 64 bit.For training and evaluating the ML models, Google Colab is used.In addition, the modAL framework [65], an active learning framework for python is used for manually labeling the Twitter users.It is a scikit-learn based platform that is modular, flexible and extensible.We used the pool-based sampling technique for the learner to query the labels of instances, and different sampling techniques for the query strategy.For classification purposes, we used different classifiers, implemented using the scikit-learn library.

A. DATASET AND DATA COLLECTION
We used tweepy -the Twitter's search API for collecting u i 's tweets and features.Tweepy has certain limitations, as it only allows the collection of a certain number of features.Additionally, a data rate cap is in place, which prevents the information collection above a certain threshold.Our main concern was to select a sufficient number of users for our dataset.In our dataset, we analysed the Twitter accounts belonging to 50,000 politicians.This dataset was generated in 2020.The main reason for choosing to evaluate politicians' profiles is their intrinsic potential to influence the public opinion.The content of such tweets originates and exists in the sphere of political life which is, unfortunately, often surrounded by controversial events and outcomes.During the selection, we only considered politicians with a public profile.Users that seemed to be inactive (e.g.limited number of followers and activities) were omitted.In addition, because duplicate data might influence model accuracy, we used the "max ID" parameter to exclude them from the data set.Firstly, we requested the most recent tweets from each user (200 tweets at a time) and kept the smallest ID (i.e. the ID of the oldest tweet).Next, we iterate through the tweets and the value of the max ID now will equal the ID of the oldest tweet minus one.This means in the next requests (for tweets collection), we got all the tweets having an ID less than or equal to a specific ID (max ID parameter).For all the subsequent requests, we used the max ID parameter to avoid tweet duplication.
For each u i , we extracted all the features required by our model.Using the extracted features and tweets we calculated Inf (u i ).Furthermore, we collected data that included 19 features including the influence score for 50,000 u i 's.Table 4 summarizes the statistics of some of the features examined in the dataset.For features which have no upper bound defined and may have outliers values, such as the number of followers, likes, etc., we used a percentile clip.We then normalized our features using min-max normalization, with 0 being the smallest and 1 being the largest value.

B. PERFORMANCE MEASUREMENTS OF MACHINE LEARNING AND NEURAL NETWORK MODELS
We gathered 50,000 unlabelled instances of u i 's and divided our dataset into three subsets: training, testing, and unlabelled data pools.For the training and testing cohorts, we had 1000 manually annotated data instances.The rest of the data was unlabelled (49,000 instances).The model was trained on the labelled training dataset while the performance of the model was measured on the testing dataset.
For the classification, we used different classifiers (all classifiers were trained on the labelled dataset and predictions are reported using 10 fold cross-validation).The precision, recall, F1 score and accuracy, were used as the main evaluation metric for the model performance.Precision is the ratio between true positive and all the positives while recall is the ratio of true positive predictions to the total positives examples.F1 score is the weighted average of precision and recall while accuracy measures the percentage of the correctly classified instances.The precision, recall and F1 score are based on true positive, true negative, false positive and false negative.To define these terms, first we considered that the trusted users are positive (labelled as 1), while the untrusted users are negative (labelled as 0).When the model predicts the actual labels, we categorize them as a true positive and true negative, otherwise false positive and false negative.If the model predicts that the user is trusted but the user is not it is false positive, and if the model predicts that the user is untrusted but the user is not then it is a false negative.The performance of the model (precision, recall, and F1 score) was calculated on the testing dataset.To improve the model accuracy, the active learner randomly selected ambiguous data instances from the unlabelled data pool using three different sampling techniques.These ambiguous data instances were then manually labelled by human annotators.The annotated data was added to the labelled dataset.In our model, the human annotators labelled the 100 most ambiguous instances from the unlabelled dataset returned by the active learner.The respective sampling techniques and the accuracy obtained for the top three classifiers (RFC, SVM and MLP) are discussed below.

Uncertainty Sampling
In uncertainty sampling, the least confidence instance is most likely to be considered.In this type of sampling method, the most probable labels are considered and the rest are discarded.The RFC obtained accuracy of 96% (Figure 4a), the SVM obtained an accuracy of 90.8% (Figure 4b), while the MLP obtained an accuracy of 90% (Figure 4c) for Tanh and 84% for Logistic as given in Figure 4d.

Margin Sampling
In margin sampling, instances with the smallest difference between the first and second most probable labels were considered.The accuracy for RFC, SVM and MLP using margin sampling was 96%, 91.2%, 87% and 88.4% as shown in Figure 5a, Figure 5b, Figure 5c and Figure 5d respectively.

Entropy Sampling
Lastly, the entropy sampling method obtained an accuracy of 95% for RFC, 88% for SVM, almost 90% for MLP (Tanh) and 90% for MLP (Logistic).Obtained results for the RFC, SVM and MLP, are shown in Figure 6a, 6b, 6c and 6d.
Comparison on the performance of our models and different sampling techniques used can be found in Table 5. Precision, recall, F1 score, and accuracy evaluation metrics were used to evaluate the results.Trusted users are represented by 1 while untrusted users are represented by 0 (see Table 5).RFC outperforms the other models in uncertainty sampling, with an F1 score of 96% for both trusted and untrusted users.Similarly, for margin sampling, RFC received an F1 score of 95% for untrustworthy users and 97% for trustworthy users and again outperformed other models.Finally, RFC outperforms in entropy sampling as well, obtaining an F1 score of 95% for both trusted and untrusted users.Overall, RFC was the best performing algorithm, while MLP (ReLU) had the worst performance.The results obtained by RFC were the best due to its superior accuracy and better record when it comes to low-dimensional datasets.Similarly, the improved performance, in the case of margin sampling, can be attributed to the fact that it considers the most probable labels probabilities, unlike the other sampling methods.

VI. CONCLUSION
Contemplating the momentous impact unreliable information has on our lives and the intrinsic issue of trust in OSNs, our work focused on finding ways to identify this kind of information and notifying users of the possibility that a specific Twitter user is not credible.
To do so, we designed a model that analyses Twitter users and assigns each a calculated score based on their social profiles, tweets credibility, sentiment score, and h-indexing score.Users with a higher score are not only considered as more influential but also, as having a greater credibility.To test our approach, we first generated a dataset of 50,000 Twitter users along with a set of 19 features for each user.Then, we classified the Twitter users into trusted or untrusted using three different classifiers.Further, we employed the active learner approach to label the ambiguous unlabelled instances.During the evaluation of our model, we conducted extensive experiments using three sampling methods.The best results were achieved by using RFC with the margin papers in field related journals and conferences and has participated as a speaker in various conferences and workshops.His research interests include private and secure e-voting systems, reputation systems, privacy in decentralized environments, cloud computing, trusted computing and privacy preserving protocols in eHealth and participatory sensing applications.

Table 1 :
Features Considered to Calculate the Influence Score i ): Retweet ratio of the user N lik : Number of likes for a tweet R lik (u i ): Liked ratio of the user U R (u i ): Tweet of the user containing URLs R url (u i ): URLs ratio of the user L(u i ): List count of the user N T (u i ): Total number of tweets or Status of the user R ori (u i ): Original content ratio of the user Rs(u i ): Social reputation score of the user h ind (u i ): h-index of the user R hind (u i ): Retweet h-index of the user L hind (u i ): Liked h-index of the user T wtcr(u i ): Tweets credibility of the user Sens(u i ): Sentiment score of the user Nneu(u i ): Neutral tweets Npos(u i ): Positive tweets Nneg(u i ): Negative tweets R has (u i ): Hashtag ratio of the user Inf (u i ): Influence score It: Tweet Index

Table 2 :
Models Comparison using Features

Table 4 :
Dataset Descriptive Statistics of only Four Features