CE-BERT: Concise and Efficient BERT-Based Model for Detecting Rumors on Twitter

Detecting rumours on social media requires careful consideration of content and context. Graph-based neural network techniques have been used to explore the contextual features of tweets. However, reliable contextual feature extraction from Twitter is challenging due to its rules and restrictions. BERT-based models extract features directly from tweet content but can be computationally expensive, limiting their practicality. We propose CE-BERT, a concise and efficient model to detect rumours on Twitter using only source text. By reducing the number of BERT parameters, we improved processing speed without sacrificing performance. Our experiments show that CE-BERT outperformed BERT textsubscript BASE and RoBERTa, achieving comparable results to leading graph-based models. CE-BERT is more promising for real-world scenarios due to Twitter’s nature. Our results indicate that CE-BERT is faster, more concise, and more efficient than other advanced models. We hope our research aids in developing practical and effective techniques for detecting rumours on social media.


I. INTRODUCTION
Social media platforms, such as Twitter, offer users access to a wide range of information at a minimal cost. However, these platforms also enable a rumour to spread rapidly, causing confusion and misinformation, eroding public trust in government and threatening public safety. Additionally, it is difficult for ordinary users to distinguish between rumours and facts. As a result, automated detection technology to identify false information on social media has become critical to ensure social security.
While social media platforms like Facebook, Instagram, and WhatsApp are also susceptible to the spread of rumours, this study focuses specifically on Twitter due to its unique characteristics and high number of users. One significant reason is its real-time short text nature, which makes it an ideal platform for rapidly disseminating information, including rumours. Moreover, Twitter's high degree of connectivity allows users to follow and interact with other users quickly, The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato . creating a network effect that enables rumours to spread quickly through retweets and shares. Additionally, Twitter has a large presence of journalists, politicians, and other influential people, making it easier for rumours shared by these individuals to gain credibility and reach a wider audience.
Furthermore, we propose to use Twitter as a dataset for short-text classification tasks for several reasons. Firstly, Twitter provides a vast amount of easily accessible data through its API, making it convenient for researchers to collect and analyze large volumes of text data. Secondly, Twitter data often contains real-time and dynamic information that is relevant to current events, making it a valuable resource for researchers studying topics such as sentiment analysis or event detection. Thirdly, Twitter data often contains noisy and informal language, such as slang and abbreviations, making it a challenging and interesting task for researchers to develop models that can accurately classify text. Lastly, the proposed technique in this article is not restricted to Twitter but can be used for other short-text social medial platforms.
The term 'rumour' is commonly used to describe information circulating among people without supporting evidence, making it potentially accurate, partially true, false, or unconfirmed [1], [2]. Rumours often appear credible but are typically unverified and spread with questionable credibility, causing fear and illogical behaviour among readers [3], [4]. To verify the credibility of social media information, various natural language processing (NLP) techniques have been developed and implemented as binary [5], [6], [7], threeway [7], [8], or four-way [7], [9], [10], [11], [12] classification tasks. This study proposes a novel rumour detection model using a four-way classification task. The proposed model is computationally efficient compared to state-of-theart models without compromising performance.
Context-based features include information related to tweets, such as user and network information. On the other hand, content-based features involve extracting features from a tweet's text, including linguistic elements and semantic information. Most of the approaches, either context or contentbased, extract features from tweets and represent them numerically. Afterwards they are classified using conventional machine learning models [15], [16], [17], [18] or deep neural network-based models [5], [19], [20], [21].
Recent studies have achieved state-of-the-art results in detecting Twitter rumours by exploring content-and context-based features using graph-based neural network approaches. They extract high-level representations from propagation pathways, trees, or networks to identify rumours [9], [10], [12], [22], [23]. However, mining network information of tweets on Twitter is not trivial and is timeconsuming. Moreover, Twitter's rules restrict users from distributing more than the tweet-id. Therefore, the techniques must use Twitter application programming interfaces (API) to retrieve all the attributes of tweets, including text, userid, and the number of retweets. This process requires mining multiple layers of data to obtain detailed information. For example, in the first mining layer, the tweet text, user-id, and the number of retweets are retrieved. Then, it requires digging deeper into the second layer to obtain the poster information or user attributes. Similarly, it is required to mine deeper into subsequent layers to extract the retweet text, image, or video that can accompany a tweet. Thus, the extraction of all the information is a time-consuming process.
In addition, there are limitations when downloading data from Twitter. The policy of Twitter restricts data streaming if the number of retrieved tweets exceeds the limit. Also, Twitter restricts user application data flow to a 15-minute time limit. Moreover, previous tweets are only available for a maximum of three weeks unless a user registers for an enterprise API or applies for a full-search database, which makes the solution very costly.
All the existing published datasets have basic context and content information about tweets in the dataset. In order to retrieve more information, APIs are required to extract more information from Twitter. However, Twitter users have the option to delete their posts at any time, making it highly likely that many tweets have been deleted since the datasets were published. As a result, retrieving all the information about those deleted tweets is impossible. Consequently, the mined network and propagation attributes do not accurately reflect the conditions when a rumour spreads and becomes breaking news. Furthermore, many Twitter users prefer to lock their privacy settings which prevents access to their user information. Therefore, given these limitations, many researchers focus on a content-based rumour detection models that only rely on text sources [7], [21], [24], [25], [26].
Previous studies have analysed the content of text of tweets using text representation methods at both word and sentence levels. The emergence of word embeddings, such as Word2Vec [27], GloVe [28], and FastText [29], marked a turning point for contemporary language models. Word embeddings encode text as numerical vectors, allowing words with similar meanings to have equivalent numerical representations. Another significant breakthrough in NLP is transformers [30], which require training on much larger and more diverse corpora. Currently, BERT (Bidirectional Encoder Representations from Transformers)-based models have been reported as state-of-the-art models for text classification and misinformation verification tasks [7], [26]. They have the ability to capture the contextual meaning of a word by considering both the left and right sides of that word, which results in a better representation of the text [31].
Despite the advantages and capabilities of BERT-based models, there are several challenges related to their training, memory usage, and processing, primarily due to the requirement of a massive amount of training data. For instance, BERT BASE , consists of 12 layers and is trained with 110 million parameters. Normally, it requires at least 8 GB of GPU memory. Therefore, various studies have suggested reducing the size of BERT, including knowledge distillation [32] and pruning [33], [34]. Furthermore, few studies have demonstrated that not all parameters are necessary, and the most attention heads can be eliminated or relocated without compromising its performance [35], [36].
Motivated by the challenges and limitations of extracting context-based information from tweets and inspired by the capabilities of BERT-based models and size constraints, we propose Concise and Efficient BERT (CE-BERT) model having limited effective layers, which makes it efficient and can accurately detect rumours on Twitter using only the source text. We evaluated our approach by removing several encoder layers from BERT in various scenarios and compared their performance in classifying rumours on Twitter. We experimented our model with twelve, six, four, and two layers based on their functionality. Additionally, we present an advanced fine-tuning technique to optimise the performance of the limited layers of CE-BERT and evaluate the model's effectiveness on three distinct datasets: PHEME, Twitter15, and Twitter16. Overall, we summarise our contributions as follows: • We demonstrate practical strategies to reduce the number of layers in BERT without sacrificing its performance.
• We propose a strategy for selecting which layers to preserve, combining skip-jump layers with a blend of small and large contribution layers strategies. Our proposed technique outperforms other strategies in terms of performance.
• We propose CE-BERT model that performs well in classifying rumours on Twitter by utilising text of tweets only and compares the performance with several stateof-the-art models that use more complex models. The results show that our proposed model performs well and is more efficient in terms of memory and computational resource requirements.
The remainder of the paper is organised as follows: Sections II and III discuss related works on rumour detection and the details of the proposed model for detecting rumour, respectively. Section IV discusses the experiments used to investigate the model's performance. Finally, Section V concludes with some comments.

II. RELATED WORK
This study connects two bodies of literature, including rumour detection in the application domain and natural language processing (NLP) models in the methodological area. We selected Twitter in the application domain while BERT as a NLP model. Our literature review will begin with rumour detection followed by NLP models.
In the problem of four-way classification for detecting rumours on Twitter, a number of recent models have garnered attention for their state-of-the-art performance. These models employ a neural network approach based on graph theory, and leverage both content-based and context-based features to construct tweet propagation graphs. For example, Huang et al. [10] proposed a heterogeneous graph attention network by constructing a graph that includes tweets, words, and users, and achieved an accuracy of 0.91 and 0.92 on Twitter 15 and 16 datasets, respectively. Similarly, Wei et al. [12] obtained an accuracy of 0.901, 0.908 and 0.694 on Twitter 15, Twitter 16, and PHEME datasets, respectively. They proposed fuzzy graph convolutional networks (FGCNs) to understand uncertain interaction in the information network by capturing the contextual semantic association representation of source tweets during propagation. In another study, Ran et al. [23] used multi-channel graph attention networks and achieved an accuracy of 0.908 and 0.916 for Twitter 15 and 16 datasets, respectively. They propose to construct three sub-graphs to model the propagation structures, including source tweets and their responses, source tweets and their words, and the relationship between those source tweets and their related users. Additionally, Liu et al. [22] introduced a new method called DA-GCN, which utilised the Dual Attention Mechanism and Graph Convolution Network to create an event propagation graph and incorporate insights from user comments and retweets. Using this model, they obtained high accuracy scores of 0.905 and 0.902 on Twitter 15 and 16 datasets, respectively.
While leveraging social networks or disseminating information from a tweet can yield high-performance results in detecting rumours on Twitter, the process of extracting all relevant information is often hindered by Twitter's rules and restrictions, as mentioned earlier. Furthermore, since users can delete posts at any time, many comments or retweets may disappear, which can distort the accuracy of the propagated information when the rumour first emerges. These limitations may render the performance improvement insignificant compared to the effort invested in mining the data.
Alternatively, several studies have adopted a content-based approach that employs BERT-based models to extract text features from tweets, yielding outstanding results in four-way rumour classification. Luo et al. [46] proposed a novel method of embedding the propagation tree into vectors while preserving its temporal-structural information by combining CNN with BERT and RoBERTa. The BERT-CNN models VOLUME 11, 2023 80209 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
achieved accuracy of 0.852 and 0.851, while RoBERTa-CNN obtained the accuracy of 0.862 and 0.896 on Twitter 15 and 16 datasets, respectively. In another study, Pelrine et al. [7] achieved impressive results by solely focusing on tweet content and utilising CT-BERT, RoBERTa, and BERT to extract tweet's features. Their approach reported promising F1 scores of 0.825, 0.818, and 0.848 on PHEME, Twitter 15, and Twitter 16 datasets using RoBERTa, which has 24 layers and 355 million parameters.
Considering the limitations in obtaining user and network information on Twitter, and the promising results obtained using BERT-based models, our study focuses on exploring tweet's contents using BERT-based models for sentence embedding to identify rumours on Twitter. In addition, we also focus on making the training process efficient and reducing the memory consumption and computational requirements for BERT-based models without compromising performance.

B. BERT-BASED MODELS
The ability of BERT to achieve state-of-the-art performance in NLP tasks by extracting contextual features from text has prompted researchers to develop and improve BERT for specific purposes. Some examples of these BERT variants include RoBERTa (Robustly Optimized BERT Approach) [47], a pre-trained language model with 24 hidden layers and 355 million parameters based on the BERT LARGE architecture but trained using a more extensive dataset of 160GB with larger batch sizes and iterations. Another example is BERTweet [48], a pre-training model that considers unique Twitter characteristics such as hashtags, mentions, and URLs to understand the semantics of tweets and specific linguistic features unique to Twitter. Additionally, CT-BERT (COVID Twitter BERT) [49] is a pre-trained language model specifically trained on a large corpus of COVID-19-related tweets.
Current BERT-based models that achieve the best performance consist of numerous layers and millions of parameters. Although deeper and wider models perform better, they also require significant memory and computational resources which limits their usability in scenarios with restricted computational resources [36], [50], [51], [52], [53]. The increasing prevalence of larger models raises multiple concerns. Firstly, the ecological impact results from the exponential increase in computational demands associated with these models. According to a study, training a single large language model is estimated to have a carbon footprint equivalent to approximately 300,000 kg of carbon dioxide emissions. To provide a relatable comparison, this is roughly equivalent to the emissions produced by 125 round-trip flights between New York and Beijing, allowing laypersons to comprehend the scale of impact better [54]. Secondly, while the possibility of running these models on devices in real-time opens up new and fascinating language processing applications, the escalating computational and memory requirements of these models might hinder their widespread adoption. Particularly, it might create barriers for academics, students, and researchers to engage in NLP research. Therefore some researchers emphasise the importance of developing a simple, easy-to-compute efficiency metric that could help make some AI research greener, more inclusive, and perhaps more cognitively plausible [54], [55], [56].
To address the computational cost issue, researchers have proposed different methods to reduce the size of BERT models. For example, some studies have used a knowledge distillation strategy to transfer the knowledge learned from larger BERT-based models to smaller ones, such as Distill-BERT [32], TinyBERT [50], ALBERT [51], and Mobile-BERT [57]. Some researchers applied a quantisation technique to decrease the numerical precision of the model's weights to a smaller number of bits, such as Q-BERT [58], GOBO [59]. Another approach is pruning, which involves removing unused or less-critical network parts to achieve the desired outcome, such as SchuBERT [60], SNIP [61], RPP [62]. Ganesh et al. reported in their survey that various BERT compression methods can accelerate model inference and reduce its size without a significant loss in accuracy across datasets such as Multi-Genre Natural Language Inference (MNLI), Quora Question Pairs (QQP), Stanford Sentiment Treebank (SST-2), and Stanford Question Answering Dataset (SQuAD). However, it was noted that none of these approaches outperformed the base model [53].
In the context of the rumour detection problem on Twitter, various BERT-based models have been investigated for their performance on binary classification tasks. Anggrainingsih et al. [26] compared RoBERTa (representing a large BERT model), BERT BASE (representing a medium BERT model), and DistilBERT (representing a small BERT model) on the PHEME, Twitter 15, and Twitter 16 datasets. They reported that, in general, RoBERTa achieved better accuracy, although the difference was not significant compared to BERT BASE and DistilBERT, which have smaller number of parameters. Additionally, Pelrine et al. [7] compared the performance of different BERT models, including CT-BERT, BERT BASE , RoBERTa, BERTweet, ALBERT, and BERT-Tiny, on binary, three-way, and four-way classification tasks using similar datasets. They demonstrated that RoBERTa, as the largest model, outperformed the other models.
In this study, we utilised BERT BASE as the base model, despite previous reports suggesting that RoBERTa achieves slightly better performance than other pre-trained BERTbased models [7], [26]. This decision was based on our objective to develop a lightweight and efficient BERT-based model, and we found that BERT BASE had fewer parameters than RoBERTa. We observed that the benefits of RoBERTa's improved performance did not outweigh the additional complexity in our specific use case of detecting rumours on Twitter. Moreover, using a lightweight and efficient model like BERT BASE can be more practical in real-world scenarios where processing speed is critical for effective rumour detection.

III. THE PROPOSED METHOD AND VALIDATION STRATEGY
Our proposed method to obtain CE-BERT involves three main steps: selecting effective layers, efficient fine-tuning, and classification, as shown in Figure 1.
In the first step, we selected and removed some BERT BASE (will be called as BERT in rest of the article) encoder layers to reduce the number of model's parameters. In the second step, we fine-tuned the newly reduced model and saved the tweet embeddings as the model output. Then in the third and last step, we classified the tweet representations using a four-layer perceptron (4MLP) and evaluated the classification performance. Overall, we compared the performance of several reduced BERT models with varying numbers of layers, including six, four, and two layers. In addition, we improved the fine-tuning process by implementing 50 warm-up steps to the learning rate to mitigate the impact of early training examples. A warm-up step is a technique to help the model converge faster and better by giving it more time to explore the parameter space before settling on a stable solution through a low learning rate during the first phase of training. Once this warm-up period is over, the model can be trained with a normal learning rate until convergence is reached [63]. Figure 2 depicts different strategies to drop layers of the BERT model to develop concise and efficient BERT, which can do the task similar to BERT without compromising accuracy. The details of the strategies and their justifications are presented below: • For the basic model, we utilised the original BERT model with all of its twelve layers [31] and fine-tuned it by training all encoder layers, as depicted in Figure 2(A).
• For reducing the layer to one-third i.e. four layers, we experimented with three different types of four-layer BERT models. Firstly, we used the last four layers of BERT [9,10,11,12] as depicted in 2(C1), which is reported to yield the best results when fine-tuning BERT as suggested in [31]. Secondly, we implemented a contribution-based dropping approach as suggested in [36]. We evaluated the contribution of layers based on the difference between the encoder's input and output, assuming that an encoder layer has a significant contribution when the output differs from the input. We then calculated and averaged the cosine similarity scores for each BERT layer (as demonstrated in Table 1) and selected the four layers with the most different input and output, which were layers [1,9,11,12] of BERT as shown in Figure 2(C2). Thirdly, we combined layers with small and large contributions based on the similarity ratings provided in Table 1 and applied skip-jump to two and three-layer combinations starting from the fourth layer. This was inspired by a previous study that found that during fine-tuning, the majority of modifications start as early as the fourth layer. [64]. As a result, we preserved the encoder layer [4,6,8,11] and discarded the remaining eight layers, as shown in Figure 2(C3).
• For reducing the layers to a minimum of two layers of BERT, we preserved the first and last hidden layers [1,12] before removing the other encoder layers as shown in Figure 2(D). This is inspired by [64], which indicated that the early layer learns generic language patterns, and the final layer learns task-specific patterns.

B. EXPERIMENTAL SETUP
To train our models, we used a RTX 2080 GPU and the Huggingface Library [65]. We used BERTmodel-uncased, a VOLUME 11, 2023   12-layered pre-trained BERT BASE model, for all of our experiments from Huggingface library. We train for 20 epochs using Adam optimiser, with a batch size of eight and a learning rate of 5e-5.
To validate the accuracy of our results, we repeated our experiments thrice for each model, as depicted in Figure 2, and reported their averages and standard deviations. We deemed three runs to be sufficient, as the results were found to be consistent for each run. We evaluated each model by assessing its accuracy, precision, recall, and F1 score.

C. STATE-OF-THE-ART MODELS
We have reviewed several recent studies published within the past three years that have achieved the highest performance on each dataset and are considered as the current state-of-the-art techniques. Table 3 presents the state-of-the-art models for each dataset in a four-way rumour detection task. To demonstrate how well our proposed model performed in comparison to the other state-of-the-art models in the same area, we compared the results of our proposed model with all of these state-of-the-art models. However, we did not compare our proposed model to other BERT-based models since previous studies have already conducted such comparisons for rumour detection on Twitter [7], [26].

IV. RESULT AND DISCUSSION
This section compares and discusses the experimental results obtained by different strategies to make BERT concise and efficient, followed by comparison with the state-of-the-art models. Table 4 presents the results for BERT model and its comparison with the other models obtained using the strategies discussed in the earlier section to make BERT concise and efficient. The results show that despite having the same number of parameters, the 6-odd-layer and 6-even-layer models produce different performance and computational results on the three selected datasets. We achieved a significant reduction in fine-tuning time across all datasets by removing six layers, which accounted for almost half of the fine-tuning time required for the 12-layer model. This was mainly due to the reduction in the number of parameters in the model by nearly 50%. However, on the Twitter 15 dataset, we observed that despite having a comparable number of parameters, the 6-odd-layer model required a longer fine-tuning process (523.87 second) than the 6-even-layer model (383.2 second).

A. BERT LAYERS REDUCTION STRATEGIES
Moreover, the reduction of six layers resulted in an improvement in accuracy for both PHEME and Twitter 15 datasets. The six-layer models on both datasets achieved an accuracy increase of approximately 0.5% and 2.6%, respectively, compared to the 12-layer model. On the other hand, the accuracy of the 6-odd-layers and 6-even-layers models decreased by 7.4% and 0.8%, respectively, on Twitter 16 dataset compared to the 12-layer model.
On the two-layer model, the proposed approach reduced the parameters by approximately 65%, and the fine-tuning process was accelerated by more than five times, as indicated in Table 4. Interestingly, compared to the 12-layer model, the 2-layer model's accuracy decreased only by 3.4%, 2.6% and 2.9% on PHEME, Twitter 15 and Twitter 16 datasets, respectively. Specifically, the 2-layer model achieved 0.813, 0.82, and 0.835 accuracies on PHEME, Twitter 15, and Twitter 16 datasets, respectively.
At this stage, we can confidently suggest that removing or reducing six, eight, or even ten of the BERT encoder layers does not significantly impact the model's performance and is worth considering to enhance processing speed. However, we have not yet arrived at a definitive conclusion regarding which strategy of dropping layers results in better performance. In this study, we found the combination of least and most contributing layers perform the best. However, we believe that the success of the strategies is significantly influenced by the characteristics of the datasets, splitting data strategy, and parameters settings.

B. ENHANCING FINE-TUNING PROCESS
Our experiments showed that the warm-up step as suggested in [63] had varying effects on the performance of different models across all datasets, as presented in Table 5, compared to fine-tuning without a warm-up step. For instance, the six-layer models exhibited a slight decrease in performance by 0.3% to 3% across all datasets. Moreover, we observed exceptional performance among the odd-layer models on the Twitter 16 dataset, with accuracy and F1 score increasing by approximately 3%.
When the warm-up technique is implemented on the four-layer models, their performance varies across different datasets in the range of 0.6% to 1.5%. Interestingly, the 4-layer [4,6,8,11] model consistently shows improved performance on all datasets, with an increase in accuracy by 1%, 2.2%, and 3.3% for PHEME, Twitter 15, and Twitter 16 datasets, respectively, compared to models trained without the warm-up step. On the other hand, the two-layer [1,12] model exhibited a slight decrease in performance by 0.9% accuracy on the PHEME dataset but demonstrated some improvement, with an increase in accuracy of approximately 1% on Twitter 15 dataset and around 2% on Twitter 16 dataset.
Overall, the experimental results clearly indicate that incorporating a warm-up step during the fine-tuning process improves the model's performance. While the extent of this improvement may differ depending on the dataset's characteristics. In our case, we found the 4-layer [4,6,8,11] model outperformed all other models.

C. COMPARISON WITH SOTA MODELS
We selected the 4-layer [4,6,8,11] model with the warm-up step as the best proposed model and named it Concise and Efficient BERT (CE-BERT) based on the experimental results presented in Table 5. The comparisons of CE-BERT performance with state-of-the-art models for PHEME, Twitter 15 and Twitter 16 datasets are presented in Tables 6, 7, and 8, respectively. Unfortunately, we were unable to provide the number of parameters of the graph-based neural network (GNN) models in Tables 6, 7, and 8, as the studies did not report it. It has been reported that the number of parameters in a GNN can vary widely, ranging from a few thousand to tens of millions due to the factors such as the number of nodes and layers involved [66]. Therefore, we can observe from the results that our proposed CE-BERT model has the least number of parameters compared to all the BERT-based state-of-the-art models, and we can safely call it the most computationally efficient among all the compared models.
It can be observed from Table 3 that GNN models have shown high performance in detecting rumours on Twitter 15 and Twitter 16 datasets by leveraging text content and propagation information, making them the current state-ofthe-art (SOTA) models. However, for PHEME dataset, the current SOTA model, to our knowledge, is a content-based model that relies solely on tweet text and uses BERT-based models such as BERT, CT-BERT, and RoBERTa as feature extractors. Table 6 shows that on the PHEME dataset, our proposed CE-BERT model, which has only 53 million parameters, outperforms other SOTA models which have significantly more parameters, such as BERT with 110 million parameters and CT-BERT with 340 million parameters. Additionally, our proposed CE-BERT model, which uses only four BERT encoder layers, achieved results that are comparable to the RoBERTa model [7], which consists of 24 encoder layers and 355 million parameters. Moreover, our model achieved F1-score of 0.8 with a standard deviation of 0.004, while RoBERTa achieved an F1-score of 0.825 with a standard deviation of 3.3.
On Twitter 15 and 16 datasets, we considered RoBERTa [7], FGCNs [12], MGAT-ESM [23] and DA-GCN [22] (seen in Table 3) as the current SOTA models to compare with our proposed CE-BERT model. Since both our model and RoBERTa rely solely on the text content of tweets, we considered RoBERTa to be the fairest comparison. As shown in Tables 7  and 8, CE-BERT achieved significantly higher performance on both datasets, with F1-Scores 5.6% and 2.3% higher than RoBERTa on Twitter 15 and Twitter 16 datasets, respectively. Moreover, we can clearly observe the our proposed CE-BERT model is most efficient in terms of computational resource requirements.
To the best of our knowledge, GNN-based models have achieved the best results on Twitter 15 and Twitter 16 datasets, but not on PHEME dataset. In comparison to other SOTA GNN-based models, our proposed CE-BERT model did not significantly outperform them on Twitter 15 and Twitter 16 datasets, but it performed very close. However, our model outperformed other SOTA models on PHEME dataset.
Overall, we believe that CE-BERT is promising and valuable because it is concise, efficient, and more practical to 80214 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.    implement on Twitter in real-world scenarios, given Twitter's rules and nature, compared to a network information-based approach. The results indicate that our model has a potential for practical use and can provide valuable insights into analysing Twitter data.

V. CONCLUSION
In conclusion, we present CE-BERT, a concise and efficient BERT-based model for detecting rumours on Twitter using only the source text. By eliminating several encoder layers, we significantly improved the processing speed without sacrificing the model's performance. Our experiments show that reducing six or eight layers does not affect the model's performance while still achieving results that are almost comparable to state-of-the-art models that utilise both content and social context information of the source text using graph-based neural network methods. Specifically, our model achieved the best performance on the PHEME dataset and only 1% lower performance than the current state-of-the-art models on Twitter 15 and 16 datasets. Although CE-BERT did not outperform models that utilise graph-based neural network approaches on Twitter 15 and 16 datasets, it is still a highly efficient and accurate option for detecting rumours on Twitter, given its ability to operate using only the source text. We believe that CE-BERT has promising applications in real-world scenarios, especially considering Twitter's rules and restrictions.
Overall, our findings suggest that CE-BERT is a more efficient, concise, and faster model than other state-of-theart models for rumour detection on Twitter. We hope that our work will contribute to developing more effective and practical approaches for detecting rumours on social media platforms.
RINI ANGGRAININGSIH received the bachelor's degree from Diponegoro University and the master's degree from Gadjahmada University, Indonesia. She is currently pursuing the Ph.D. degree with The University of Western Australia. She is an Academic Staff with Sebelas Maret University. Her current research interests include Twitter data credibility analysis and other natural language processing studies using deep learning and pre-trained transformers-based model approaches.