Rumors Detection Based on Lifelong Machine Learning

The amount of training data in the field of Weibo rumor detection is small, and online news changes constantly, but the existing rumor detection models do not have the ability of continuous learning, they also cannot achieve knowledge accumulation and update, they usually need a large number of training examples to improve the effect. In contrast, Lifelong Machine Learning (LML) paradigm has the ability of continuous learning, which retains the knowledge learned in the past and uses it to help future learning tasks, after learning, some knowledge is updated. With the growth and update of knowledge, the performance of each task will be better and better. Hence, we use this paradigm to build a Weibo rumor detection model. We firstly extracted three types of features based on content, user, and propagation from Weibo events, and proposed three new propagation features and Bidirectional Encoder Representations from Transformers (BERT) semantic features of source message for rumor detection. We then used Simulated Annealing (SA) to improve the Genetic Algorithm (GA), which was called GA-SA used to search for the best global minimum feature subset to improve the classification effect of the Efficient Lifelong Learning Algorithm (ELLA) on rumors. In the continuous learning process, the ELLA transfers knowledge to learn new tasks and refines knowledge over time to maximize performance across all tasks. The proposed model is called GA-SA-ELLA. The experimental results show that our model even in the case of less training data for each task could achieve superior detection results.


I. INTRODUCTION
Weibo is one of the main platforms for netizens to publish and obtain information. On this platform, users publish hundreds of millions of messages every day, which contains a lot of rumors. If rumors are not stopped, there will be negative impacts on the economy and society. To prevent the spread of rumors on the network platform in time, reduce the harm caused by rumors to the public and create a healthy network environment, rumors detection has great research value.
The existing rumor detection methods mainly include traditional machine learning methods and deep learning methods. The general process of rumor detection algorithms based on traditional machine learning is: firstly, relevant features are extracted from news content, comments, and user information, and then a machine learning algorithm is selected to build a classification model to judge whether the news is a rumor. For example, based on the user, content, and propagation features, construct Support Vector Machine (SVM) [1], Decision Tree (DT) [2], [3], and Naive Bayes (NB) classifiers [2] to achieve rumor detection. For traditional machine learning-based rumor detection methods, it is very important to select and extract effective features for classification algorithms. After that, deep learning methods began to increase. For example, Ma et al. [4] proposed Recurrent Neural Networks (RNNs) to learn the hidden representations from the content of relevant posts, the model was used for Sina Weibo and Twitter rumor detection. Yu et al. [5] proposed the CAMI model for Weibo and Twitter rumor detection, the model used Conventional Neural Networks (CNNs) to obtain features and their high-level interactions from the content of the relevant posts. Yuan et al. [6] proposed Global Local Attention Network (GLAN) model for Twitter and Weibo rumor detection, which integrated local semantic information and global structural information. The model firstly generated the local semantic information from source news and related posts, and then modeled the heterogeneous graph among source post and retweets to capture the global structural information. Wang et al. [7] considered the dynamic evolution of diffusion structures, and proposed Dynamic Propagation Structures (NM-DPS), which divided the structure into several segments according to posting time of each tweet and encode each segment into a vector, then the vector as the input was feeded into Bi-directional Gate Recurrent Unit (BiGRU) with attention network to learn structure representation, the structure representation fusioned text representation for Twitter and Weibo rumor detection. Some researchers used a combination of manual features and neural networks to detect rumors, for example, Lv et al. [8] used CNN-LSTM model. Meanwhile, comment sentiment as an important feature was added to the model for Weibo rumor detection. In recent years, with the success of the Graph Neural Network (GNN), rumor detection algorithms based on GNN had become popular. Bian et al. [9] explored both characteristics by operating on both top-down and bottom-up propagation of rumors, and proposed Bi-Directional Graph Convolutional Networks (Bi-GCN) to identify Weibo and Twitter rumors. Lin et al. [10] proposed VAE-GCN model, the model is the first study of employing GCN as encoder to learn propagation information, then used AutoEncoder (AE) to learn the overall structure information, the detector utilized the output of AE to classify Twitter and Weibo news events as fake or not. Wang et al. [11] proposed Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) model which jointly the textual information, knowledge concepts, and visual information for Twitter and Weibo fake news detection. The knowledge concepts were retrieved relevant knowledge from Knowledge Graphs (KG), the visual information was obtained with object detection techniques, and the text information was obtained with GCN. BAI et al. [12] firstly built a Source-Replies relationship graph (SRgraph) for each conversation, using nodes to represent a tweet, and edges to represent interactive tweets between them. Based on SR-graphs, an integrated graph convolutional neural network with node proportional allocation mechanism (EGCN) was proposed for Twitter rumor detection. Choi et al. [13] captured the dynamic diffusion information of the rumors using GCN and constructed a Dynamic Graph Convolutional Neural Network (DGCN) for Twitter and Weibo rumor detection. To improve the effectiveness of rumor detection, some researchers used other related information to help rumor detection. For example, Zeng et al. [14] proposed a Hybrid Stance Attention Mechanism (RDM-HSAM) model which used Weibo stance information to improve the classification effect of Weibo rumor detection tasks. The rumor detection method based on deep learning can effectively avoid the problems caused by manual design of features in the rumor detection method based on traditional machine learning. However, the rumor detection method based on deep learning has a large demand for training data. When there are few training instances, the trained classifier will have the problem of classification bias [15].
Not only do research on English or Chinese rumors, now many researchers are beginning to focus on other languages. For example, Zoleikha et al. [16] proposed the BERT-SAWS semi-supervised learning model for early verification of Persian rumors on Twitter and Telegram, the model extracted the contextual word embeddings (CWE), speech act features, and artificially designed writing style (WS) features, next fused these three views features for identifying Persian rumors. Ahmed et al. [17] used a set of features like Term Frequency-Inverse Document Frequency (TF-IDF) extracted from tweet contents and trained a set of machine learning classifiers to verify fake news surrounding COVID-19 in Arabic tweets. However, without the ability to accumulate knowledge, these models need a large volume of training examples to learn effectively, but there are fewer training examples in the field of Weibo rumor detection, and even if more training data is manually collected, it is timeconsuming and labor-intensive. To make matters worse, the online languages on Weibo or other social media change constantly, the manual collection process is continuous. In contrast, Lifelong Machine Learning (LML) keeps the knowledge gained after learning learned tasks and uses it to help new tasks, after learning, some knowledge will be updated to refine it. This use of historical knowledge to help new task learning is called forward knowledge transfer. As more and more knowledge is retained and refined over time, the model is more effective in learning new tasks [18]. Thus, the new rumor detection task doesn't need a large volume of training examples to improve the effect. At the same time, LML can cope with the constant change of Weibo language through its continuous learning ability.
The Efficient Lifelong Learning Algorithm (ELLA) algorithm has an LML mechanism [19]. Compared with other LML algorithms (such as CL-cbsSVM [20]), the ELLA algorithm maintains a shared knowledge among all tasks, that is it can not only perform forward knowledge transfer to learn new tasks but also perform backward knowledge transfer is that it uses knowledge to continuously improve the performance of learned tasks. In addition, the algorithm is to improve the learning effect of the task by repeatedly refining knowledge, rather than knowledge growth. Therefore, we construct a novel rumor detection model based on the ELLA algorithm to continuously improve the learning effect of all rumor detection tasks. However, we extracted many features for rumor detection, the original ELLA algorithm could not avoid overfitting of high-dimensional feature vectors in the training process and further improved effect by optimizing these features, so we must improve the algorithm. Commonly used feature selection algorithms are Genetic Algorithm (GA) and Simulated Annealing (SA). GA imitates the process of natural selection, which means that those species that can adapt to environmental changes become the optimal solution. GA has a strong global search capability, and the optimization results have nothing to do with the initial conditions. SA algorithm adopts an iterative movement according to the variable temperature parameter which imitates the annealing transaction of the metals. The algorithm accepts inferior individuals with a certain probability in the iterative process, this increases the diversity of the population. Therefore, the algorithm can avoid falling into the local minimum in the early iterations. But they have their own disadvantages: (1) The GA is easy to fall into the local optimum because each crossover always preserves the higher fitness individuals of the father [21]. (2) The convergence speed of SA is slow, and the search capability is related to the initial temperature and other parameters [22]. Given the complementary advantages of GA and SA, and to improve the convergence speed and the global optimization ability, we use SA to improve GA. When the progeny of GA reproduction encounters inferior solutions, they are accepted according to a certain probability given by SA. This method avoids falling into the local optimum by increasing the diversity of the population and improving the convergence speed compared with SA. Finally, we obtain the best minimum feature subset of the global optimum. We name this method is GA-SA. Based on the obtained best minimum feature subset, the ELLA algorithm is used to continuously learn Weibo rumors to gradually improve the effect of detection. The proposed model is called GA-SA-ELLA. In experiments, we first verify that the new proposed features and BERT semantic feature are effective for rumor detection, and then, the effects of different feature and the improved effect by feature optimization based on GA-SA algorithm, last but not least, the influence of knowledge refinement on task detection effect, then we show that our proposed GA-SA-ELLA model is comparable or even better than current state-of-art models.
Our main contributions can be summarized as follows: 1) Three new propagation features were proposed and semantic features are used for rumor detection. 2) Modified the calculation method of sentiment tendency in literature [23] and proved its validity. 3) Built a "Question and Correction" vocabulary corpus for extracting features. 4) Used SA to improve GA, which improves the convergence speed and feature optimization effect, compared to GA and SA. 5) To our best knowledge, this is the first study of employing LML in rumor detection of social media and proposed a novel continuous learning model called GA-SA-ELLA. Our GA-SA-ELLA method outperforms several state-ofthe-art approaches, and with knowledge refining, GA-SA-ELLA could achieve higher effectiveness. LML (or lifelong learning) is an advanced machine learning paradigm that learns continuously, accumulates the knowledge learned in previous tasks, and uses it to help future learning. Its learning process is shown in Figure 1. It has learned 1 n  tasks:

A. COMPONENTS OF LML AND RELATED WORK
. These tasks are called learned tasks or old tasks, and each task has its own corresponding data set: These tasks can be in the same field or different fields. When the i -th task T i appears (called the current task or the new task), and the data set at this time is i D , LML will transfer the historical knowledge in the KB to help the Knowledge-Based Learner (KBL) learn a new task T i . The role of KB is to store and update historical knowledge, and KBL is to learn current tasks and generate knowledge.
From the above description, LML contains three key features: (1) continuous learning; (2) retaining the knowledge learned in the past in KB; (3) using historical knowledge to help future task learning. Therefore, LML can continuously learn a series of tasks in the same or similar fields, extract and store useful knowledge in KB, and transfer this knowledge to help future tasks. Knowledge will be updated after learning new tasks to refine it. With the knowledge stored and refined, the better the learning effect of the model.
From the learning process of LML in Figure 1, we can also know LML is similar to MTL. The MTL has been used to achieve rumor detection. Ma et al. proposed an MTL framework that trained two highly pertinent tasks, i.e., rumor detection and stance classification, the framework trained both tasks jointly using weight sharing to extract the common and task-invariant features while each task can still learn its task-specific features [24]. Both MTL and LML can use shared information to help task learning, this can solve the problem of lack of training data in the field of rumor detection by learning other tasks. However, MTL is difficult to continue learning, and it is difficult to accumulate knowledge. Although MTL can optimize all tasks by retraining when new tasks arrive, when the number of tasks is large enough, retraining will consume a lot of time and resources, while LML can continue to learn multiple new tasks with the help of historical knowledge [18].
LML has been used in many fields, including image identification, NLP, and network security. For example, Zhou et al. [25]built a lifelong learning model for objection detection, which overcomes the catastrophic forgetting problem in other research methods, because the LML model contains component KB. Rajvardhan Oak et al. [26] built a lifelong anomaly detection model, aiming to improve the performance of the model through knowledge growth. Wang et al. [27] constructed a lifelong relationship extraction model, aiming to discover new data and relationships. Hong et al. [28] proposed an approach based on the LML paradigm for sentiment classification as an example to show how it adapts to different domains and keeps efficiency meanwhile. From their experimental results, LML can achieve very good results in both the image processing field and the NLP tasks. In this paper, LML is applied for Weibo rumor detection. Compared with the existing rumor detection model, it achieves a higher rumor recognition accuracy and F1 value. It can be seen that LML not only has a wide range of applications but also exhibits powerful capabilities.
We apply LML for Weibo rumor detection, firstly, Task Manager assigns a new rumor task to the KBL. Then, KBL generates a classification model for the current task with the help of past knowledge stored in KB, the KB retains the new knowledge for use in future tasks. With the knowledge accumulation or refinement, models become more and more intelligent, the effectiveness of subsequent Weibo rumor detection tasks would be improved, so the new rumor detection task does not require a large number of training samples to improve the effect. This will deal with the current problem of training examples in the field of rumor detection. At the same time, through the continuous learning ability of LML to cope with the ever-changing Weibo online language.

B. ELLA
The ELLA algorithm with the LML mechanism has a global knowledge transfer structure, that is the ELLA algorithm maintains a shared knowledge among all tasks, and the knowledge in KB is transferred and learned in all tasks, thereby improving the learning effect of all tasks. Therefore, ELLA could guarantee higher performance, compared to LML algorithms that can only perform forward knowledge transfer. In addition, this process is computationally efficient due to the use of optimization strategies, for example, the second-order Taylor expansion optimization strategy and only updating the most influential parameters [19]. The detailed learning process of the algorithm is shown as follows: The ELLA algorithm uses a parameter model for learning. Given T tasks, the prediction function on the current task t is shown in Equation (1): Where t θ is the parameter vector of t -th task, which is a linear combination of weight vector () t s and the sparse shared matrix L , and the relationship between them is shown in Equation (2): The algorithm is optimized by reducing the average prediction loss value on all tasks. Its objective function is shown in Equation (3): Where  and  correspond to L1 norm and L2 norm, respectively. To reduce the computational cost of task t during training, we use the second-order Taylor expansion to optimize (t) s . After optimization, the final objective function is shown in Equation (4): is the Hessian matrix of the loss function , which is used to evaluate t θ . For the extracted characteristic data samples: {( ii Xy ， )} of the current task t , according to the optimization process of the ELLA algorithm. First, use the basic learner (logistic regression or linear regression) to calculate the parameter vector t θ and Hessian matrix   t D of the current task t . Since our task is classification, we chose logistic regression. Their calculation methods are shown in Equation (5) and (6) (the following is the matrix transposition symbol): (6) After that, use Equation (7) to calculate   t s : s D Ss θ Ls (7) Where m L refers to the value of the latent components at the m -th iteration. Then use Equation (8), (9), and (10) together to adjust the basis latent model components L : L , use Equation (2) to reset t θ , and then calculate the predicted label according to Equation (1).
The () t s and L obtained when training the current task and the training data of each task are stored in KB as knowledge. To reduce the time and space overhead generated during calculation, when the model is training for new tasks, is not updated, only L is updated. Therefore, the sparse shared matrix L is the aforementioned shared knowledge and () t s is the knowledge specific to the t -th task. When new task data arrives, L can be continuously optimized. Because the ELLA algorithm has the ability of forwarding transfer learning and backward transfer learning, it uses the sparse shared matrix L to perform transfer learnin g in all tasks. Therefore, with the L refined over time, the classification error of Weibo events (Weibo original message and its comments and reposts.) in all tasks become smaller and smaller, and then the classification accuracy of all rumor detection tasks will continue to rise. In addition, in some lifelong learning scenarios, when new task data arrived, the model is updated incrementally. Therefore, this will make it forget what has been learned in the past, this is referred to as catastrophic forgetting [29]. The way to handle the catastrophic forgetting is that in each model update, all past training data is saved to compute for the update. In this work, the ELLA retains the () t s and training data of each task in KB, although new task data arrived, the classification model of the task t , that is     t fx can be recomputed. This ensures that our model can overcome catastrophic forgetting.
The difference between ELLA and other LML algorithms mainly lies in the scope of knowledge sharing. According to the scope of knowledge sharing, knowledge can be divided into local knowledge and global knowledge [18]. LML methods based on local knowledge usually focus on optimizing the current task performance with the help of past knowledge. These methods focus on how to choose whatever pieces of past knowledge that are useful to the new task. Existing such methods, for example, CL-cbsSVM [20], Lifelong-R [30], AER (Aspect Extractionbased on Recommendation) [31], improved Faster R-CNN [25], UNLEARN [26], et al. The LML methods based on global knowledge is that they often approximate the optimality on all tasks, including the previous and the current ones, for example, ELLA, LASEM (Lifelong Architecture Search via EM) [32], BBSC-KT [28], et al. In order to reduce the computational cost and improve the performance of all tasks, we choose ELLA algorithm for lifelong learning.

III. THE PROPOSED METHODS
We first clean the relevant text content and then extract content features from the original post and comments, user features from users' information, and propagation features by constructing a spreading tree according to the transmission relationship between users, next we concat the BERT semantic feature extracted from original post. Therefore, we obtain a high-dimensional feature vector. To prevent highdimensional feature vectors from causing overfitting during training and reduce redundant features, we use GA-SA to search for the best minimum feature subset. In the process of feature optimization, the proposed algorithm can't only overcome GA falling into the local optimum, but also improve the convergence speed of the SA algorithm. Based on obtained best minimum feature subset, the LML architecture namely the ELLA algorithm is used to construct the classification model GA-SA-ELLA to achieve the continuous learning of Weibo rumor news, and gradually improve the model effect through knowledge refinement. The output results are 0 and 1, where 0 represents nonrumors (TR) and 1 represents rumors (FR). The processing flow of the GA-SA-ELLA model is shown in Figure 2. It can be seen from the figure that the model consists of two parts: feature extraction and continuous detection of rumors using the GA-SA-ELLA model.

A. FEATURES EXTRACTION
There is a big difference between rumors and true news. These differences are reflected in the publisher's user attributes, news' propagation attributes, and content [21]. In terms of user attributes, rumor publishers are general users who have not yet been authenticated and are not highly concerned. In terms of content, the correlation between the comments and the content of the source message on the rumors is low, and the rumors news generally have inflammatory negative sentiments, so the sentiments of the comments under the rumor news are generally negative. In terms of propagation attributes, the credibility of news determines the number of reposts and the structure of the propagation network. Based on the above analysis, we have selected a total of 23 features. And these characteristics are divided into four categories: content characteristics, publisher's user characteristics, propagation characteristics, and semantic characteristics. Table 1 gives a brief introduction to these characteristics. As far as we know, among them: BERT semantic features, _ avg replytime , _ firstlevel ratio and _ gini index are considered to be the first application in rumor detection research. The new proposed features have been marked with an '*' mark.
The following will give a detailed introduction to the extraction methods and calculation methods of these features.

1) SENTIMENT CHARACTERISTIC
Compared with true news, rumors can easily arouse public anger and anxiety and thus trigger comments with negative sentiment tendencies [8]. To identify the sentiment tendency of Weibo original news content and its comments, This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  According to the literature [33], hype news is generally accompanied by a large number of reposts with small transmission paths, so we assign more weight to comments with large transmission paths, and the value of the weight is consistent with the depth of the comment, thereby reducing the hype behavior of news to a certain extent. The sentiment tendency calculation method is obtained by Equation (11) Where N is the total number of comments on the Weibo event, replyi represents the i -th comment, replyi sentiment is the sentiment polarity of the i -th comment, its value is the same as the original message content, and replyi depth is the depth of the i -th comment, max_ depth represents the max depth on all comments or reposts. It can be seen from Equation (11) that the lower the value, the stronger the negative emotion in the comment area.

2) CONTENT RELEVANCE
Content relevance refers to the degree of relevance between the comment content and the source Weibo content. Compared with true news, the content of the comment on the rumor news has a weaker correlation with the content of the source Weibo message [34]. We use Latent Dirichlet Allocation (LDA) to model the source Weibo message content and each comment into the topic space through Gibbs sampling, and then uses the Jensen-Shannon distance to calculate the similarity between the content. Finally, the average of the similarities of all reviews is used as the result for rumor detection.

3) COMMENT CHALLENGE CORRECTION RATE
Compared with true news, rumor news is more questioned and corrected by netizens [35]. In Twitter, researchers often use regular expressions to identify these signals, but using this method in Weibo cannot achieve a high recognition rate. Because people may use direct expressions such as "真的假 的" (really?), "胡扯" (baloney), "瞎编" (trump up) and "造 谣" (cook up a story and spread it around) when expressing doubts or corrections in Chinese, or they may use idioms, such as "危言耸听" (say frightening things just to raise an alarm), "捕风捉影" (make groundless accusations), etc. In order to identify these complex questioning and correcting signals, we use a self-built questioning and correcting vocabulary corpus. The detailed steps are as follows: 1) Select words with questioning and correcting meanings from the dictionary or the Internet and add them to the initial vocabulary corpus; 2) Segment all comments in the dataset; 3) Use the pre-trained model word2vec in the gensim library to calculate the similarity of words, find the top 100 words in the segmented vocabularies with the similarity of each word from the initial corpus, and add these words to the questioning and correcting vocabulary corpus; 4) Manually screen the words in the vocabulary corpus to remove words that do not contain doubts, correct meanings, and repetitions; 5) Repeat steps 3 and 4 until no new vocabulary can be found. After the above steps, the established vocabulary corpus contains a total of 394 related words. If a Weibo comment contains a word in the vocabulary corpus, the comment is marked as a questioned and corrected comment. Count the number of questioned and corrected comments under each Weibo event, and use Equation (12) to calculate the questioned correction rate: Where quest n is the number of queried and corrected comments, and N is the number of comments under the Weibo event.

4) USER ATTENTION RATE
The account that usually posts the rumors has fewer fans than followers, while the official accounts are the opposite [2]. This shows that the attention of normal users is generally higher than that of rumor-spreading users. To distinguish between normal users and rumor-spreading users, we use user attention rate as one of the indicators to measure the credibility of Weibo news. The calculation method of user attention rate is shown in Equation (13) Where followers is the number of followers of the publisher, and friends is the number of friends of the publisher.

Three new transmission features
Compared with true news, rumors can attract the attention of disseminators. Users are more active in the comment area of rumors, and the dissemination network of rumors is more concentrated [36]. The visual transmission diagram of rumors and non-rumors is shown in Figure 3, (a) is the spread chart of non-rumor, the news id corresponding to this figure is "3435462423700876", and (b) is the spread chart of rumor, the news id corresponding to this figure is "3456465409077567". Moreover, rumors contain a lot of wrong and unconfirmed information in content, which can easily cause doubt and denial when spreading. Compared with normal news events, rumors are more controversial [35]. In the literature [37], the author proposed many features for controversial detection, but some features have not been used for rumor detection (i.e., _ avg replytime , _ firstlevel ratio and _ gini index ). Therefore, we use these three features as new features for rumor detection.  1) _ avg replytime and _ firstlevel ratio : The _ avg replytime means the average logged parentchild reply time over pairs of comments, the _ firstlevel ratio means the proportion of comments that were top-level (i.e., made in direct reply to the original post). Because users are more active in the comment area of rumors, so the _ avg replytime and _ firstlevel ratio value of rumor news are generally greater than normal news; 2) _ gini index : The feature means the Gini coefficient of replies to top-level comments and is used to measure the degree of aggregation of discussions in the comment area. The calculation method is to first sort by the number of replies from small to large, and then use Equation (14) to calculate: Where i stands for the i -th top-level comment, i y represents the number of comments that responded to the ith top-level comment, n represents the total number of toplevel comments. For example, we calculate the Gini coefficient of Figure 3 (a) as 0.094, and that of Figure 3 (b) as 0.745. Generally speaking, the larger the value, the more the discussion concentrated.

BERT semantic features
Compared with other traditional word embedding models, such as Word2vec, Glove, ELMO and BERT can more accurately learn the semantic information and context information of the text, so we use the BERT pre-training model to extract semantic features from the source news for rumor detection.

6) OTHER FEATURES
In addition to the features in Table 1 described in the previous section from 1 to 5, the rest of the features in Table  1 can also effectively distinguish between rumors and nonrumors.
(1) Other propagation characteristics: If the news is trustworthy, the farther it will spread, that is, the greater the _ avg depth value is; (2) Other content characteristics: Rumors often contain url external links, and the proportion of them is larger than that of normal news; (3) Other user characteristics: Rumor publishers generally use unverified accounts to post false news on the network platform, in other words, the value of verified usually takes the value 0 (1 represents verified); Rumor publishers are generally lower than ordinary users in the number of posts ( _ status count ), the number of fans ( _ followers count ), and the number of mutual followers ( _ bifollowers count ), but are more than ordinary users in the number of followers ( _ friends count ). These features are obtained directly from the corresponding fields in the dataset.

B. RUMOR DETECTION ALGORITHM BASED ON GA-SA-ELLA
As shown in Figure 2, the detailed steps of the GA-SA-ELLA rumor detection algorithm are as follows: 1) Set the initial temperature, end temperature, and cooling coefficient. The cooling procedure of SA corresponds to finding the best solution. When the temperature is set lower, the optimal solution may be missed, but when the temperature is set higher, the search time for the optimal solution is so long that the efficiency of the algorithm is very low.
2) Population coding. We use binary to encode population chromosomes. The dimension of the feature determines the length of the chromosome. Each feature corresponds to a gene on the dye. If the gene is "1", it means that the feature is selected, and "0" means that the feature is not selected. The initial chromosome code sets each gene to "1", which means that all features are selected.
3) Individual fitness calculation. The fitness is determined by the average prediction accuracy on the test set of all tasks based on ELLA. When the average prediction accuracy value on the training set of all tasks is greater, the fitness is greater. The fitness calculation method of an individual is shown in Equation (15).  probability is determined according to the SA algorithm. The calculation method is as shown in Equation (16): fx  is the fitness of the offspring, and () fx is the fitness of the parent. And T i is the temperature of iteration i , and 1 T i is the temperature of the last iteration. The determination of T i , that is, the cooling strategy, is shown in Equation (17): Where  is the cooling coefficient. Analyzing Equations (16) and (17), when the fitness of the offspring is lower than that of the parent, the offspring still has a certain probability of being accepted. The process of cooling is the process of global optimization. If the initial temperature is higher, the probability of finding the global optimal solution is greater, which prevents GA from falling into the local optimal.
5) Population reproduction. The population produces the next generation through crossover and mutation, where the crossover mode selects multipoint crossover, and the mutation operator selects bit mutation.

6)
Get the best minimum feature subset. After reaching the end temperature, the chromosome is decoded, and the feature corresponding to the gene "1" constitutes the best minimum feature subset. Based on the best minimum feature subset, the corresponding rumor recognition accuracy rate through ELLA's classification should also be globally optimal.
The above algorithm description of our model GA-SA-ELLA is shown in Algorithm 2.

A. EXPERIMENTAL DATA
In our study, we used a set of Weibo rumor events data publicly available in Ma [4]. The data was collected from the Sina Weibo Community Management Center and API interface. There were a total of 4664 Weibo events, of which 2313 were rumor events and 2351 were non-rumor events. Table 2 shows the statistics of the dataset. The information of each Weibo event was stored in a JSON file, which contains the following information: the number of reposts, the number of comments, the number of likes, timestamps of original news released, original post content and its comments and reposts, publisher, forwarders and commenters' user information and released original news' tags, etc. In order to reduce the noise of extracted features, the experimental data is cleaned first. Numbers, symbols, and English characters in the Weibo events that are irrelevant to the content are removed, only Chinese characters (except for VOLUME XX, 2017 9 statistical url related features) are kept. If advertisements and sensitive words appear in comments, they will be regarded as spam comments and filter them out, we only extract features from other non-spam comments. As for the filtering algorithm, we use the public algorithm 3 .

B. FEATURE NORMALIZATION AND EVALUATION INDEX
In addition to the BERT semantic feature, the other three types of statistical feature values extracted from the data set are too different in the weight range to be directly used for classification, so these feature data need to be normalized first. We use Z-Score Normalization to normalize the features, and the processed feature value range is [-1,1], which takes into account the negative sentiment feature value.
To measure the detection effect of the model, The Accuracy rate (Acc) and F1 score are selected as evaluation indicators. Acc and F1 score are calculated as follows:

C. EXPERIMENTAL PLATFORM AND ENVIRONMENT
Our research uses Pycharm as the programming IDE, and the programming language is Python 3.6. A computer with an operating system of Mac OS, a memory of 8 GB, and an Apple M1 CPU is used as the experimental hardware environment.

1) DATASET RELATED PARAMETERS
There are a total of 4664 Weibo events in the data set. To reflect the performance improvement of our model for future rumor detection tasks, we first sort all events according to the release time, and then divide all events into T tasks, the number of Weibo events contained in each task t is t n . To have enough training data for all tasks, we set the number of events in each task to 1166, so T is 4. The statistics of the earliest and latest release time of news in each task are shown in Table 3. The experiment divides the data samples of each 3 https://github.com/JansonKong/spam_filtering task into the training set and the test set according to the ratio of 5:5, uses 5-fold cross-validation and takes the average of the five experimental results as the experimental results to reduce the contingency of the experimental results.    To find a suitable initial temperature, we set the end temperature as 0.1°C, the cooling coefficient is 0.9, and  is e -5 . The parameter settings of other ELLA algorithms and GA are shown in Table 5. Based on fitness and optimization time, a traverse search is performed. The detailed results are shown in Table 4.
It can be seen from Table 4 that when the temperature is 500°C, the fitness is 0.958. As the temperature increases, the fitness increases, but the fitness of individuals above 1000°C is not much different. To reduce the optimization time and improve the efficiency of the algorithm, the initial temperature is chosen to be 1000°C.

3) MODEL PARAMETER SETTING
In the ELLA algorithm, because k [1, min(10, T/4)] and the total number of tasks that is T is 4, the parameter k is 1. The value of the parameter d depends on the feature dimension of GA-SA in the process of feature global optimization. For the selection of the parameter  , we traverse from its value range according to the fitness to obtain the optimal  . The specific parameter settings of the model GA-SA-ELLA are shown in Table 5.

1) NEW FEATURES AND BERT FEATURE EFFECT ANALYSIS
For all propagation features, including the three new propagation features ( _ avg replytime , _ avg replytime and _ gini index ), we use Python library treelib 4 to construct according to the propagation relationship between users, next use some methods of the library function to get the relevant information and bring it into the relevant formula to calculate the result. The data results of extracted three new features, we use box plots to display, as shown in Figure 4 index . It can be seen from Figure 4 that the data statistics obtained are consistent with the theoretical analysis above.  BERT semantic features, using the BERT Chinese pretraining model to perform feature extraction on short Weibo events. According to the feature that the length of Weibo news does not exceed 140, we set the maximum length parameter of the BERT pre-training model is 140, the pooling strategy adopts Maxpooling, and finally obtain a 4 https://github.com/caesar0301/treelib 768-dimensional sentence vector matrix. In order to avoid the curse of dimensionality, reduce the complexity and resource consumption of the algorithm. On the premise of not losing the original data information, we use Principal Component Analysis (PCA) to reduce the dimensionality of the BERT semantic features, and the feature dimension is 100 5 dimensions after dimensionality reduction.
For the three new propagation features proposed and BERT semantic features, we use an unimproved ELLA algorithm to verify their effect. We use the average classification accuracy (Avg_Acc) of all tasks to show whether these new features can effectively identify Weibo rumors. The results of the experiment before and after are shown in Table 6.

2) FEATURE INFLUENCE ANALYSIS
In order to analyze the influence of each feature on the recognition of rumors, based on the ELLA algorithm, the parameter is set to e -5 , and the average classification accuracy rate of all tasks Avg_Acc is used as the benchmark for comparison. The experimental results are shown in Figure 3.
It can be seen from Figure 5 that the characteristics used in the experiment can distinguish between rumors and nonrumors to a certain extent, but their influence is different.
Among them, the BERT semantic feature has the greatest influence on classification. This is because the BERT semantic feature dimension is higher than other features, and the semantic feature is to identify rumors from the news content itself. _ comment sentiment uses a new calculation method, but shows a good classification effect, indicating that the calculation method is correct. For the _ question ratio feature, our self-built questioning and correction vocabulary corpus has only 394 words but based on these words to calculate the feature, the classification accuracy achieved on the data set is 0. 885

3) THE INFLUENCE OF FEATURE SELECTION ON MODEL EFFECT
In order to verify that the best minimal feature subset obtained by our model GA-SA-ELLA is more globally optimal than the best minimal feature subset obtained by GA-ELLA without SA improvement, and prove that feature selection can improve the effect of model classification, we use the ELLA algorithm without feature selection to compare the experimental results with GA-ELLA and GA-SA-ELLA. The comparison is shown in Table 7.  It can be seen from Table 7 that the GA-ELLA, SA-ELLA, and GA-SA-ELLA algorithms that can perform feature selection have better detection effects on rumors than the ELLA algorithm without feature selection. In particular, the effect of GA-SA-ELLA is 0.7% higher than that of ELLA. This is because after feature selection, irrelevant and redundant features are removed, which not only reduces the feature dimension, but the degree of improvement of the three models is not the same.
It can be seen from Figure 6 that due to the strong search capability of GA, the optimal fitness in the population increased from 0.9 to 0.943 in less than 10 iterations, but the search capability decreased in the later period and converged in the 62nd iteration. However, compared with GA-SA, although the search speed of GA-SA is not as fast as GA in the early stage, the convergence speed of GA-SA is fast, and VOLUME XX, 2017 9 the individual fitness obtained during the convergence is higher. This shows that GA falls into local optimum when it converges. It can be seen from Figure 7 that the search speed of SA is slow, and there are many flattens in the cooling process, because SA conducts local search near the solution space. When SA was cooled to the termination temperature, the fitness was 0.956, lower than that obtained by GA-SA. If you want to improve the initial temperature or cooling coefficient can be increased, but this will lead to the low efficiency of the algorithm.
Check the best minimal feature subset obtained by GA-ELLA and SA-ELLA model, which remove the _ text len , _ friends count , and some columns of BERT semantic features. However, the best minimum feature subset obtained by GA-SA-ELLA removes the _ text len , _ friends count , _ repost count , _ avg createtime and partial columns of BERT semantic features.

4) CONTINUE LEARNING EFFECT EVALUATION
The model can improve the learning effect of the new task with the help of past knowledge, and the learning effect of the old task does not decrease after learning the new task. In this case, we think the model has the continuous learning ability. To prove the opinion, we use Figure 8 and Figure 9 to show the change in the detection effect of the learned tasks and new tasks respectively through continuous learning. In the two figures, KB_i represents the new KB corresponding to the i -th ( i  [1,4]) task after training. It can be seen from Figure 8 and Figure 9 that in the process of model learning, with the shared knowledge, that is the sparse shared matrix L refined (KB_1  KB_4), the classification accuracy of learned tasks and new tasks has shown a rising trend. For example, when the KB was updated from KB_2 to KB_4, the accuracy of the learned first task was first improved to 0.873, and finally to 0.895. With the help of historical knowledge, the accuracy of the new task gradually increased from 0.86 to 0.979. In other words, the shared knowledge is refined through continuous learning, the learning effect of the new task will be gradually improved through forwarding knowledge transfer and backward knowledge transfer, the old task will not only overcome catastrophic forgetting but also further improve its model effect.
In addition, although each task only has 1164 Weibo events, the accuracy of model detecting has reached 0.96 since the second task and has slowly improved since then. Therefore, we believe that the GA-SA-ELLA model based on the LML paradigm can achieve the ideal rumor detection effect through knowledge refining, and in the case of   algorithms respectively for comparison. The selected models include: SVM [1], GLAN [6], NM-DPS [7], RDM-HSAM [14], Bi-GCN [9], DGCN [13], VAE-GCN [10] and KMGCN [11]. The detailed experimental comparison results are shown in Table 8. The model is marked with '*' whose results are taken from the corresponding paper. A. Traditional machine learning method (1) SVM. SVM is a commonly used classification algorithm based on traditional machine learning rumor detection algorithms. This algorithm does not carry out feature selection. Through experiments on the feature set extracted in this paper, the results show that it is inferior to our model. B. Deep learning methods (1) GLAN. This model not only considers the semantic information of the text, but also considers the global structural information, and uses the attention mechanism to enhance the feature representation. (2) NM-DPS. The study focus on existing works only utilizing the limited static propagation structures rather than dynamic information. Therefore, the model extracts features of dynamic propagation structures and then integrates structural and textual information for rumor detection. (3) RDM-HSAM. The model not only considers content, user, and temporal features but also information related to rumor detection, which is the news stance, thus improving the effectiveness of the model. (4) Bi-GCN. This is the first study of employing GCN for rumor detection. The model not only considers the propagation relationship chains from top to bottom but also considers the propagation feature from bottom to top. (5) DGCN. The study thought Bi-GCN was limited to representing rumor propagation as a static graph and captured the dynamics of rumor propagations using sequential snapshots and temporal snapshots. (6) VAE-GCN. The model obtains textual, propagation, and structure information for rumor detection, which uses GCN to learn textual and propagation information and firstly employs VAE to obtain structure information. (7) KMGCN. The model not only obtains the semanticlevel features for each post based on GCN like other researchers, but also retrieves relevant knowledge from KG to help rumor detection, and focus on the visual information in news. The above deep learning models, their training examples are basically more than 64%, that is, more than 2984 news, the accuracy of best baseline model reaches 0.949. However, when our model learns the second task and has 583 training instances of each task, The average accuracy has reached 0.96. Therefore, compared to these deep learning models, our model is more suitable for dealing with the problem of lack of training data.
It can be seen from Table 8 that our proposed GA-SA-ELLA model is significantly better than other models in the classification index. Because the GA-SA-ELLA model uses the GA-SA algorithm to obtain the best global minimum feature subset, by eliminating redundant features, not only the feature dimension is reduced, but the classification effect of the model can also be improved. And our model has the characteristics of continuous learning, as long as new task data arrives, it would refine the past shared knowledge in KB, so the classification accuracy of the model will become higher and higher until it reaches a larger classification accuracy. Moreover, whether it is models based on traditional machine learning methods or deep learning models, they cannot generate knowledge or do not use past knowledge, and when encountering new rumor data, they can only learn from scratch, especially in the complex and everchanging online language environment, re-learning will be time-consuming and resource-wasting.

V. CONCLUSIONS AND OUTLOOK
We extracted three types of features of content, users, and propagation from Weibo source content, comments, and reposts, and merged the BERT semantic features of the original message to form feature vectors. On this basis, we used a GA-SA-ELLA model to achieve Weibo rumor detection. The experimental results show that the three new propagation features and the BERT semantic feature could effectively improve the effect of rumor recognition. The best minimal feature subset obtained by using GA-SA is better than the result obtained by using GA and SA alone. At the same time, the GA-SA-ELLA model, with the knowledge refined over time, could result in effectively learning tasks even in the case of less training data for each task. However, there is still considerable room to improve the effectiveness of the rumor detection method. The future work can be considered as follows: (1) Our work lacks early detection of rumors. Only by discovering rumors as early as possible can the harm caused by the propagation of rumors be further reduced. Therefore, our next step is to further study the early detection of rumors. For example, extracting content features from Rumor Text and storing them in KG as knowledge, which can be retrieved for early detection. (2) Focus on cross-language and cross-platform rumor detection, such as other minority languages or Chinese minority languages, such as Uygur. (3) Use other tasks to assist rumor detection. LML can not only learn tasks in the same field, but also cross-field learning. Therefore, LML can be used for continuous learning of rumor detection tasks and other related tasks, so as to continuously improve the learning effect of rumor detection tasks.