HyVADRF: Hybrid VADER–Random Forest and GWO for Bitcoin Tweet Sentiment Analysis

In recent years, Bitcoin and other cryptocurrencies have been increasingly considered investment options for emerging markets. However, Bitcoin’s erratic behavior has discouraged some potential investors. To get insights into its behavior and price fluctuation, past studies have discovered the correlation between Twitter sentiments and Bitcoin behavior. Most of them have exclusively focused on their relationships, instead of the Twitter sentiment analysis itself. Finding the most suitable classification algorithms for sentiment analysis for this kind of data is challenging. For the enormous data in Twitter, the supervised sentiment analysis approach of unlabeled data can be time-consuming and expensive, which has been studied to be superior to unsupervised ones. As such, we propose the HyVADRF (hybrid valence aware dictionary and sentiment reasoner (VADER)–random forest) and gray wolf optimizer (GWO) model. A semantic and rule-based VADER was used to calculate polarity scores and classify sentiments, which overcame the weakness of manual labeling, while a random forest was utilized as its supervised classifier. Furthermore, considering Twitter’s massive size, we collected over 3.6 million tweets and analyzed various dataset sizes as these are related to the model’s learning process. Lastly, GWO parameter tuning was conducted to optimize the classifier’s performance. The results show that 1) the HyVADRF model had an accuracy of 75.29%, precision of 70.22%, recall of 87.70%, and F1-score of 78%. 2) The most ideal dataset size percentage is 90% of the total collected tweets ( $n =1$ ,249,060). 3) The standard deviations are 0.0008 for accuracy and F1-score and 0.0011 for precision and recall. Hence, the HyVADRF model consistently delivers stable results.


21
As one of the interesting topics in the present world, cryp- 22 tocurrency has changed the way people think about money. 23 It is a digital currency governed by a cryptographic protocol 24 that uses Blockchain technology [1]. Its continuous adoption 25 and widespread usage have added value in its real-world 26 applications by a substantial amount. The first cryptocurrency 27 is Bitcoin, which was developed in 2009 [2]. It is a type of 28 electronic cash without central governing and can be used as 29 The associate editor coordinating the review of this manuscript and approving it for publication was Rosalia Maglietta . a medium for online transactions between any two parties. 30 Bitcoin is a very volatile currency, and its price is influenced 31 by socially constructed opinions. Past studies discovered that 32 some of the extreme price increases and decreases in Bitcoin 33 coincided with dramatic events in China [3]. The rise of 34 the Internet technology has played an unprecedented role 35 in increasing the number of users' opinions and emotions 36 shared on social media and e-commerce platforms either by 37 text or multimedia data [4], [5], [6], [7]. This phenomenon 38 has resulted in the production and generation of a large 39 variety of data, which can be analyzed to assess sentiments. 40 The analysis of sentiments is beneficial for individuals and 41 The correlation between Twitter and the price predic-98 tion of cryptocurrency has been validated in previous 99 studies [16], [17]. 100 In recent years, hybrid sentiment analysis combining a 101 semantic lexicon and supervised machine learning has been 102 increasingly studied [18], [19], [20]. One of the most popular 103 lexical semantic approaches to calculate sentiment polarity 104 scores is VADER. Introduced in 2014, VADER is a lexi-105 con and rule-based sentiment analysis model that calculates 106 the polarities (positive/negative) and intensity (strength) of 107 emotions to obtain the sentiment score. The advantages of 108 VADER include the following: (i) It is an open-source tool; 109 (ii) it is a human-centric approach; and (iii) it is particularly 110 designed for social media content [21]. Furthermore, super-111 vised machine learning algorithms, such as support vector 112 machine (SVM) and naive bayes (NB), are the most fre-113 quently used algorithms for sentiment analysis either in com-114 bination with VADER or on their own. Supervised learning 115 has been found to provide more accurate sentiment analysis 116 than unsupervised learning, such as sentiment lexicons [22]. 117 Saif et al. [12] showed that Twitter data are sparser than 118 other types of data (e.g., movie review data) due to the large 119 number of infrequent words present within tweets. Such a 120 feature can be due to spelling mistakes and the usage of slang 121 words. Furthermore, Twitter contains a large amount of noisy 122 data, such as URLs, punctuation, and special symbols. Thus, 123 irrelevant words and data, which are merely present due to 124 some coincidence or do not influence the current text, may 125 affect the average polarity or entropy of the text as these 126 are outliers to the text in focus. The automated identification 127 of relevant information from these data is imperative due 128 to the immense volume of raw data, which have prompted 129 many researchers [23], [24], [25], [26] to explore various 130 feature selection methods and classifier models. Due to its 131 simplicity and computation efficiency, a very popular struc-132 tured text representation method is the bag-of-words model 133 in which documents or sentences are represented as a list 134 of words using a document-term matrix (DTM) [27]. The 135 association of words in the matrix is formed based on the 136 distances between them. This approach has been successfully 137 applied for text classification, text clustering, and information 138 retrieval. Most DTMs tend to be high dimensional and sparse 139 [28] because any given document will contain only a subset 140 of unique terms that appear throughout the corpus. This con-141 dition will result in any corresponding document row having 142 zeros for terms that were not used in that specific document. 143 Therefore, we need an approach to reduce dimensionality. 144 TF-IDF is a popular method of evaluating the word weight 145 value in a collection of documents [29], [30]. It represents 146 the distribution of each word in a document across the entire 147 document or corpus. Each word is assigned a TF-IDF score 148 by multiplying the word's TF by its IDF. The steps to get a 149 TF-IDF score is 1) to calculate the TF value with ((1), 2) 150 calculate the IDF value with (2), and 3) calculate the TF-IDF 151 weight value with (3).
where tf(t,d) represents the number of times that a word RemoveSparseTerm() of R. For its advantage, we decided to 168 use the TF-IDF approach for this study.

169
As previously mentioned, this study also aims to analyze gives the optimum performance for the developed model.

174
In a recent study [31], the importance of obtaining adequately  In this study, we used GWO. Introduced in 2014, this 215 algorithm is inspired by the leadership hierarchy and hunting 216 mechanism of gray wolves in nature. There are four types 217 of wolves in the gray wolf hierarchy. The oldest and leader 218 of the pack is the alpha (α), with the main responsibility of 219 deciding for the pack. The next rank is the beta (β), which is 220 an advisor of the alpha and discipliner of the pack. The lowest 221 rank in the hierarchy is the omega ( ), which is required to 222 yield to other dominant wolves. The delta (δ) wolf dominates 223 the omega and reports to the alpha and beta. According to 224 Kayhomayoon et al. [39], this algorithm uses the following 225 steps: 1) a wolf calculates its distance from α, β, and δ using 226 Equations 4-9 and 2) update its position with Equation 10. where X α , X β , and X δ are the positions of α, β, and δ, 235 respectively. D α , D β , and D δ represent the distances between 236 i and other wolves (α, β, δ). With the iteration process, a 237 decreases linearly from 2 to 0. r1 and r2 are two random 238 numbers between range parameters for the boundary search 239 space. Fig. 1 depicts the flow chart of GWO.

240
In their study on email detection, Batra et al. [40] found 241 that k-NN classification combined with GWO had 100% 242 recall and the least computational times among the Bayesian 243 information criterion algorithms.

245
In this section, we propose the HyVADRF and GWO model 246 for bitcoin tweet sentiment analysis research framework due 247 to its benefits. First, this algorithm uses the VADER algorithm 248 to calculate a compound polarity score for labeling raw data, 249 which is less expensive, error prone, and faster compared to 250 manual labeling. Second, as supervised machine learning was 251 known to be better than unsupervised ones, we decided to 252 use RF, NB, L2-SVM, and DT as machine learning algo-253 rithms. Third, the GWO algorithm and tuneRanger were used 254 to tune the parameters for machine learning optimization. 255 Fig. 2 presents the proposed sentiment analysis of Twitter 256 tweets related to the Bitcoin framework.   The tweet dataset does not enclose a labeled output. Tags con-267 sisting of positive or negative are labeled to train a supervised 268 classifier. Thus, VADER, a rule-based lexicon method, was 269 applied to label the dataset. Before VADER was applied to 270 tweets, ''noise'' removal was performed to raw data. Manual 271 cleaning of raw data and the use of the regular expression 272 (RegEx) in natural language processing (i.e., removal of URL 273 links, hashtags symbols, and irrelevant tweets) were used very 274 carefully to avoid decreased accuracy.  Tweets were preprocessed before the machine learning algo-284 rithms were applied. Neutral-value comments are detached. 285 Only tweets with positive and negative labels were prepro-286 cessed and used for machine learning algorithms, following 287 a prior study [18].

288
The preprocessing steps started with creating corpus doc-289 uments for this dataset. Then, ''noise'' removal steps, such 290 as eliminating punctuations and numbers, were performed. 291 The next step is removing stop words in English (e.g., ''are,'' 292 ''as,'' ''is,'' ''of,'' and ''the''), which are unnecessary words 293 in classifying the documents. Afterward, stemming is per-294 formed, which is a process of transforming different tenses 295 of words to their root form (e.g., fishing, fish, and fisher to 296 fish). This step aids in the removal of unwanted computation 297 of words and therefore reduces the time consumed by the 298 algorithm in training all the tenses of words. The unnec-299 essary white spaces were also removed. A DTM using the 300 TF-IDF feature extraction method was applied to convert 301 the documents into feature (i.e., term) vectors. These vectors 302 can easily be understood by a machine learning algorithm. 303 document or a sentence is an important text data classification LiblineaR for L2-SVM [43], fastNaiveBayes for NB [44], and 315 caret for DT [45].    [48]. 340 Table 1 summarizes the tuned hyperparameters, the defi-341 nition, and their tuning ranges. The population was set to 342 30 with the max iteration of 100 in the GWO, as in the past 343 study [49].

344
During the hyperparameter tuning, the training perfor-345 mance from the fivefold CV was used as the fitness function 346 of the GWO. Each hyperparameter was represented by a wolf 347 in the GWO. With each iteration of GWO, wolf positions 348 were updated to maximize the fitness value, and the hyper-349 parameters were optimized accordingly. The pseudo-code is 350 depicted in Algorithm 3.

351
Another hyperparameter tuning method, tuneRanger, 352 allows simultaneously tuning RF parameters using an auto-353 matic model-based optimization process [47]. Arguments 354 for this method were set to defaults based on the provided 355 example of the literature [47]. Finally, to explore the effect of 356 hyperparameter tuning, we compared the performance of the 357 standard RF and tuned RF (GWO-tuned RF and tuneRanger-358 tuned RF) with the dataset size that gave the highest per-359 formance metrics of the standard RF. The standard RF used 360 the default hyperparameter values specified in the ranger R 361 package. The evaluation of the performance of the machine learning 405 algorithms is shown in Fig. 4. We gradually increased the 406 percentage of the dataset size for the training data and test 407 data. We performed a baseline random inference implementa-408 tion by re-shuffling, re-sampling, and running each algorithm 409 for five times with different seeds and used the average 410 accuracy, precision, recall, and F1-score. For all the dataset 411 size percentages, RF gave the best results in terms of accuracies in the range of 72%-75%, precisions of 68%-70%, and 413 F1-scores of 75%-77%. DT achieved the highest recall scores 414 with values above 98%. However, these were compensated 415 with low precision scores in the range of 55% and F1-scores 416 of 70%, which made DT an unsuitable algorithm for these 417 data. Meanwhile, RF did not have the highest recall scores, 418 but they were within the range of 83%-86%. Thus, RF is the 419 most suitable algorithm for this dataset.

446
In general, the hybrid RF-GWO and hybrid 447 RF-tuneRanger slightly outperformed the single RF model. 448 This slight improvement is not surprising as the improvement 449 through tuning tends to be less obvious where RF performs 450 satisfactorily [52]. Furthermore, the impact of RF tuning is 451 much smaller compared to that of other machine learning 452 algorithms, such as SVM [53].

453
To obtain more representative results, we also compared 454 the standard deviation (SD) of each model. In terms of the 455 accuracy, the SD decreased from 0.0015 to 0.0008 (RF-456 GWO) and 0.0014 (RF-tuneRanger). The SD of precision 457 was reduced from 0.0020 to 0.0011 (RF-GWO) and 0.0016 458 (RF-tuneRanger). Moreover, the SD of recall increased from 459 0.0007 to 0.0011 (RF-GWO) and 0.0019 (RF-tuneRanger). 460 The SD of the F1-score decreased from 0.0011 to 0.0008 for 461 RF-GWO but increased to 0.0015 for RF-tuneRanger. These 462 results confirmed that the RF-GWO is more stable compared 463 to either a single RF or RF-tuneRanger. In addition, they 464 showed the feasibility of GWO to improve the classifier 465 model.

466
Although our hybrid VADER RF-GWO model has a lower 467 accuracy (75.29%) compared to those proposed in similar 468 past studies [18], [20], the dataset we used was much larger 469 than their studies. In their studies of evaluating the perfor-470 mance of Indonesian politicians based on YouTube comments 471 using a hybrid lexicon and SVM, Tanseba et al. [20] achieved 472 an accuracy of 84%, precision of 91%, and recall of 80%. 473 However, their dataset is limited to 1000 comments. Simi-474 larly, Chaitra [18] used 2,586 comments to analyze opinions 475 toward mobile phone use using hybrid VADER and naïve 476 Bayes, resulting in an accuracy of 79.78% and an F1-score 477 of 83.72%. In our case, we used 1,124,154 tweets with a 70% 478 training set and 30% test set. The hybrid VADER RF-GWO 479 model of these data gave low SDs for accuracy, precision, 480 recall, and F1-score. This result supports the finding of a prior 481 study that large training sets appear to be the most accurate 482 and consistently deliver robust results.

484
To some extent, past studies lack studies comparing the 485 behaviors and performances of machine learning algorithms 486 using different dataset sizes and hyperparameter tuning meth-487 ods. This condition is regrettable given the importance of 488 the dataset size on the massive quantity of data, such as 489 social media data. From a theoretical perspective, this study 490 contributes to the existing literature by exploring the role of 491 VOLUME 10, 2022