Identifying COVID-19 Personal Health Mentions from Tweets Using Masked Attention Model

Twitter has been an important platform for people to discuss and share health-related information. It provides a massive amount of data for real-time monitoring of infectious diseases (such as COVID-19) and freeing disease-prevention organizations from the tedious labor involved in public health surveillance. Personal health mention (PHM) detection is one of the critical methods to keep up-to-date on an epidemic’s condition; it attempts to identify a person’s health condition based on online text information. This paper explores PHM identification for COVID-19 through Twitter. We built a COVID-19 PHM data set containing tweets annotated with four types of COVID-19-related health conditions. A masked attention model was devised to classify the tweets as self-mention, other-mention, awareness, and non-health. We obtained promising results on the PHM identification task. The classification results facilitate timely health monitoring and surveillance for digital epidemiology. We also evaluate how the attention mechanism and training method affect the model’s predictive performance.


I. INTRODUCTION
Diseases outbreaks like the COVID-19 pandemic have been occurring frequently worldwide in recent years. When faced with unexpected public health threats, it is critical to provide warnings as early as possible to raise the alert and to prevent harm from appearing in countries [1]. Thus, public health surveillance has gained much attention in healthcare research. Public health surveillance consists of activities aimed at continuously and systematically collecting healthrelated data as well as identifying and interpreting patterns found in the data. However, traditional surveillance methods are costly and time-consuming. Health-related data are usually collected from patients and reported to a public health department for professional analysis [2], but the entire procedure lacks timeliness. When a society faces a rapidly spreading infectious disease, traditional surveillance falls short in monitoring, evaluating, and predicting the trajectory of the disease. This prolongs the effective reaction time for the pandemic and could cause serious consequences.
With the popularity of the internet, large amounts of health-related data can be found on social media, blogs, online forums, and other platforms [3]. The number of social media users has been growing rapidly in the past decade. People discuss and share information and opinions on social media platforms [4]. It has been reported that two-thirds of American adults use social media to post their status, opinions, and other information on a regular basis. This provides the opportunity for public health departments and researchers to monitor the status of public health in real time at minimal cost [5]. Research directed at disease surveillance was initiated to leverage social media data as a means to acquire early warning of epidemics or infectious diseases outbreaks. The results of the analysis aid public health departments in providing timely medical attention and quicker health services to communities. For example, a vast number of tweets with the hashtag "COVID-19" or related keywords have appeared in recent months. Many organizations, such as The Atlantic, have launched COVID tracking projects to analyze and monitor the status of COVID based on tweets (https://covidtracking.com/). This may provide greater support for public health departments to intervene in advance of the spread of epidemics. The World Health Organization (WHO) even states that early detection can be found through social media data for more than 60% of epidemics [3]. In previous research and applications, tweets have been used for the early detection of infectious diseases such as Ebola [3], E. coli [5], cholera [40], and seasonal conjunctivitis [41]. Thus, health surveillance based on social media data is highly important for communities and societies globally.
A massive amount of data is generated on social media continuously, with reports that around 3.5 million English tweets related to COVID-19 are posted every day [14], but the majority of them are not informative or are irrelevant for the downstream tasks in public health surveillance. Manual identification of useful tweets is costly and time-consuming. Thus, a critical step in online health surveillance is to develop an automatic way to identify personal health mentions (PHMs) in tweets [6]. The task involves detecting whether a particular text contains a PHM. Specifically, PHM detection attempts to classify each post into one of four categories: non-health, awareness, self-mention, and othermention. For example, the tweet "had my COVID-19 nasal swab Saturday. Got a call last night from CDC, test was positive!" should be categorized as a self-mention, as it mentions that the person who posted the tweet has a disease. The tweet "COVID-19 can be prevented. Watch for these signs and symptoms" should be categorized as awareness, as the post provides disease-related information but does not mention a specific patient or reference the person who posted it. "Need some corona to cure my hangover" should be classified as non-health, as the word "corona" here is the beer brand, not the virus.
PHMs are crucial for health surveillance. They can screen posts to filter out those irrelevant to health and collect those relevant to health in order to keep them for subsequent public health data analysis and downstream public health applications. However, studying PHM detection based on tweets has several challenges. First, tweets are usually short texts in free form; users have no predefined template to express their opinions or health-related information, and tweets are characterized by informal and creative language, including emojis and idiomatic and ambiguous expressions. Consider, for example, "SRSLY the @LushLtd crashed at the stroke of midnight ." For this tweet, it is necessary to contend with linguistic variations and effectively extract the semantic information from the free-form text. Second, though the amount of tweet data is immense, the amount of annotated data is usually limited due to the high cost of manual annotation; this, in turn, limits the application of tools in the task of PHM identification. Because of these challenges, researchers have acknowledged that the methods performed for social media text processing and mining are worse than for normal and standard texts [7]. This paper utilizes a novel deep neural network structure and model-training strategy to address these challenges. We built and annotated a COVID-19 PHM tweets corpus. We encoded tweets using word embeddings. The embeddings are fed into a bidirectional gate recurrent unit (BiGRU) layer, which sorts and extracts semantic information from short texts. The BiGRU layer is followed by a masked attention layer, which fully leverages the tweets' keywords to solve the issue of informal and ambiguous expressions in tweets.
The outcome of this layer is then inputted into the SoftMax classifier to identify the corresponding PHM. In addition, we developed a novel epoch-wise moving-average-based training method to improve the efficiency of model training.
This paper makes the following contributions. First, we built the first COVID-19 tweet corpus for PHM identification research. We collected and annotated more than 11,000 tweets with the four types of health mentions. Second, we modeled PHM identification as a text classification task and proposed a masked attention model to classify each tweet into the four categories. The mask was able to handle the different-length issue seen in tweets. The attention mechanism was able to fully utilize the keywords in tweets and thus mitigated the challenge of informal, idiomatic, and ambiguous expressions in tweets. Third, we proposed a novel model-training method based on an epoch-wise moving average of the model parameters. The newly proposed method fully utilizes the information obtained at different training stages and has achieved better results than traditional model training. To summarize, this paper is regarded as the first mover to address the research question of COVID-19 PHM identification from tweets and develop baseline methods. The code and data are released at https://github.com/yw57721/PHM_COVID19_MaskedAtten to promote the research along this direction.

A. PHM Identification Based on Social Media Data
PHM detection in social media data is a relatively newly defined research topic, although similar topics-such as public health surveillance-have been studied previously. Most existing work has used traditional machine-learning methods and has combined domain knowledge or external resources other than social media texts to identify PHMs. However, the knowledge and resources may be diseasespecific, and these developed methods may be difficult to generalize to other diseases. Lamb et al. combined linguistic features in detection [8]. Word classes, parts of speech patterns, and stylometry have been incorporated in Twitter texts to detect influenza. Other researchers have investigated various features. For example, Yin et al. applied stylistic features to Twitter, such as emoji hashtags, to train a scalable classifier to detect PHMs [9]. Paul and Dredze proposed an ailment topic aspect model (ATAM) to identify ailmentrelated tweets [10]. This model organizes symptoms and the corresponding treatments of ailments into different topics with different levels of granularities; the combination of keywords and associated topics is then applied to identify the ailment. Gesualdo et al. leveraged Twitter data to detect influenza-like illnesses (ILI) [11]. They applied the case definition from the European Center for Disease Prevention and Control, which includes technical jargon related to the disease. However, most of the disease mentions are in layman's terms. Thus, they identified all the layman expressions related to the symptoms or to the disease and the corresponding technical terms or jargon, and they trained a model based on the jargon-layman terms data pairs to detect ILI cases mentioned on the internet. Coppersmith et al. applied basic natural-language processing techniques to detect four possible mental health conditions on Twitter [12]; they found that language-model-based methods significantly outperformed traditional methods. Karisani and Agichtein combined four types of features in their neural network structure: lexical features, syntactic features, wordembedding-based features, and context features [6]. They found that incorporating the extra knowledge could improve performance. Iyer et al. observed the use of figurative expressions in tweets and combined a figurative-speech detection module with a PHM detection module to augment PHM detection [13].

B. ANALYSIS OF COVID-19 BASED ON MINING SOCIAL MEDIA DATA
Recently, much attention has been drawn to mining social media data, such as tweets, to analyze the status and public response to COVID-19. For example, the Workshop on Noisy User-generated Text in EMNLP 2020 organized one shared task on the identification of informative COVID-19 text from English tweets [14]. The participant teams of the task were provided with a corpus containing 10k of annotated tweets with two labels, "informative" and "uninformative." Informative tweets included the mention of suspected cases, confirmed cases, deaths, number of tests performed, etc. In this shared task, most of the top-ranked teams applied a pretrained language model such as BERT and its variants for this binary classification task [15]. The techniques of adversary training and ensemble were widely used. The top 10 teams applied model-ensembling to leverage the power of different models [16]. It should be noted that the informative tweets in this shared task covered three classes in the PHM identification task studied in this paper, namely self-mention, other-mention, and awareness. The uninformative tweets corresponded to the non-health class in this paper. Thus, the research question in this paper is finer in granularity. We argue that PHM identification is more useful for downstream applications of health monitoring, as the tweets in the awareness class cannot provide much information on the latest status of the disease.
Researchers have also sought to understand the tweets' content related to COVID-19. Some work has been done to characterize self-reported symptoms, experiences with testing, and other activities related to COVID-19 from social media. Alanazi et al. manually identified self-reports of COVID-19 symptoms in tweets. They conducted an offline interview with the posters to rank the appearance of the first three symptoms and then identified the most common ones [17]. However, their work mostly deals with descriptive statistics and cannot identify the symptom automatically. Mackey et al. applied a biterm topic model to identify tweets about self-reported experiences and symptoms of COVID-19 [19]. The tweets were then clustered into five main categories, such as the report of symptoms, discussion of recovery, and confirmation of negative COVID-19 diagnoses.
Besides the characterization of symptoms and experiences, some papers have also mined public opinion from tweets. Feldman trained GPT-based models for promptbased queries on public opinion toward COVID-19 [39]. Hosseini et al. combined both manual annotation and topicmodeling tools to identify the frequent topics [18]. They also used the framework to track public responses to the pandemic and its evolution over time.
COVID-19 has had an unprecedent impact on human beings, not only physically but also mentally. Thus, researchers have also investigated tweets for sentiment analysis caused by COVID-19. Nemes and Kiss used RNN to classify tweets into four emotions: weakly positive, weakly negative, strongly positive, and strongly negative [35]. They also used this model to determine which emotional manifestations (such as hashtags) appeared on a specific topic during a given time period. Researchers have also recognized that emojis play a critical role in representing emotional content. A BERT-based model was presented to predict emojis in multilingual tweets [20]. Xue et al. used latent Dirichlet allocation (LDA) to detect topics of sentiments, popular unigrams, and bigrams in tweets [36]. They clustered the topics into five categories and found that the feeling of fear is significant in the discussion of COVID-19 cases. Similarly, Jang et al. used topic modeling to mine tweets and identify the COVID-19 topics that are most relevant to public health [38]. They also applied aspect-based sentiment analysis to interpret public sentiment on COVID-19-related issues. Kruspe used word2vec, ELMo, and BERT to encode tweets and map the sentiment score into a range of [0, 1] using the sigmoid function [44]. They found that the sentiment started out negative and became positive over time. But the sentiment is still under the average sentiment in most countries during the studied period. To summarize, this stream of research is not well defined. The classes of sentiments / emotions vary across papers, and there is no common data set for researchers to examine.

A. PROBLEM DEFINITION
We model the detection of personal health mentions from tweets as a text-classification task [21]. Let cough caused by coronavirus could stop for like five minutes", self-mention , where , Tc TC , we sought to develop a model : f → TC to map each tweet to a label. In the testing or application stage, we can use the learned model to detect which label should be assigned to a new tweet ' T for a disease by ( ') fT .

B. NEURAL NETWORK STRUCTURE
Deep-learning methods have been considered an efficient method for extracting the semantic information from texts [34]. Their performance is state-of-the-art in almost all natural-language processing tasks. Thus, we aim to adopt a deep-learning-based method to perform the COVID-19 PHM identification task. Tweets are short texts; thus, complicated deep-learning units may not function well on tweet texts. We use a relatively simple deep-learning unit-gated recurrent unit (GRU)-to extract the contextual information. GRU is a type of recurrent neural network (RNN) unit that extracts information from text. Compared with the popular long shortterm memory (LSTM) recurrent neural network, it has a relatively simple structure but without much decreased effectiveness in handling various tasks. In addition, tweets contain informal, idiomatic, and ambiguous expressions; it is harder for a computer to understand these complicated language usages than formal text. We use an attention mechanism to emphasize the keywords in the tweets so the semantic information in the tweets can be better extracted and represented [31].

1) ENCODING OF TWEETS USING PRETRAINED WORD EMBEDDINGS
In this paper, we use a 300-dimension GloVe word embedding developed by Stanford University to encode each word [22]. For the tweet = ( 1 , 2 , . . . , | | ), the corresponding embedding of the tweet is = ( 1 , 2 , . . . , | | ), where t x is the embedding of the t-th word in the text sequence based on GloVe encoding.

2) GRU-BASED TEXT ENCODER
A GRU unit uses a reset gate t r and an update gate t z to control how the contextual information is updated [24]. A GRU unit t h is updated as where is the Hadamard product and t h is the candidate state of t h calculated as The update gate controls the amount of past information that can be kept and the amount of new information added to the current state t. The reset gate determines the past states' contribution to the candidate state. These are updated as

3) BIDIRECTIONAL GRU (BIGRU)
Each word in a text is dependent on its previous and future words; thus, an effective approach should capture the relevant information from both past and future directions. To achieve this goal, GRU can be generalized to bidirectional GRU [25]. A BiGRU network consists of two parallel layers propagating both forward and backward. Thus, the past and future information in the text sequence can be encoded in the network. The forward layer is denoted as which reads the tweet

4) MASKED ATTENTION LAYER
Keywords are the critical elements for people to understand the text meaning in a quicker way. In tweet comprehension, keywords play an even more critical role, as tweets are usually in informal and ambiguous language. Thus, we deploy the attention mechanism to quantify the degree of relevance or importance of each word in the meaning of the tweet [26]. The intuitive idea behind the attention mechanism is to reward the keywords by assigning a bigger weight to them. In the traditional attention mechanism, given the output of the BiGRU It should be noted that the text lengths of the tweets in the corpus vary. It is essential to handle tweets of different lengths by an appropriate method. The solution is to add padding symbols in the short text to make all the texts have the same length. The padding is meaningless and contains no semantic information; thus, the weight assigned to the padding should be 0 so that the weights for the real informative words are not diluted. In this way, we combine masking with the attention method to update the attentionbased weights as exp( where n is the dimension of the word embedding. This represents the weighted sum of the BiGRU outputs. To summarize, the attention-based model is used to encode the tweets. The attention layer can give different levels of "attention" to words with different degrees of relevance to the text.

5) SOFTMAX LAYER
The output of the attention layer is fed into a SoftMax function to perform the classification. Specifically, we can calculate the probability that a tweet belongs to class j as where C is the number of classes of the text. In the COVID-19 PHM identification task, C=4, as there are four classes. A new tweet will be assigned to the class with the highest probability.

C. MODEL TRAINING USING EPOCH-WISE MOVING AVERAGE OF THE PARAMETERS
All the parameters in the BiGRU units and the attention layers (i.e., the ,, W U b from equations (1) to (7)) are estimated using the annotated tweets. The whole tweet corpus is randomly divided into training, validation, and testing sets (with 80%, 10%, and 10% of the tweets, respectively, in this paper). The model is trained on the training set (i.e., the estimation of the parameters) by minimizing the cross-entropy loss between the true label and the predicted label distributions-i.e.,

{ , , , } i h z r w 
; for t ≤ 2, we simply take the average of all the previous epochs' parameters as the current epoch's parameter value.

A. DATA PREPARATION
We built a COVID-19 tweet corpus for PHM identification containing 11,231 tweets posted from February to May 2020. The tweets were collected using hashtags such as "COVID", "SARS," "coronavirus," "corona," "pandemic," and "quarantine." We further processed the tweets to remove the mentions, hashtags, and links and only keep the relevant textual content. Two annotators with relevant background knowledge of medical and public health independently VOLUME XX, 2017 9 annotated the 11,231 tweets. We used Fleiss's kappa to evaluate the inter-annotator agreement between the annotators. The Fleiss's kappa value was 0.76, suggesting a substantial agreement of the annotation [32]. For the tweets annotated with different labels, the two annotators and one consolidator with expertise in public health conducted further discussions to resolve the annotation disagreement. The four labels are as follows: 1 (self-mention): The tweet mentions a disease or health condition for the person who posted it.
2 (other-mention): The tweet mentions a disease or health condition for a person other than the person who posted it.
3 (awareness): The tweet contains the name of the disease but is not related to any specific people being sick. 4 (non-health): The tweet may contain the name of the disease but is not related to health. Table I shows examples of these four types of tweets. The four labels are in descending order of importance from a health-monitoring perspective. The first two reflect the status of the disease and the other two do not. Thus, if a tweet can be assigned multiple labels, we will keep the more important label only. For example, "We got our coronavirus test results. I am positive. A is also positive. The basement quarantine continues" should have both self-mention and other-mention labels, but we will only assign the selfmention label to the more important tweet. The numbers for the data size and the distribution of the data among the four classes are shown in Table II. The data distribution is highly imbalanced. The data sizes for classes 1 and 2 are small, and class 3 has a large majority of the data. Although the annotated tweets are not on a massive scale due to time and cost constraints, these statistics in Table II roughly indicate the proportion of different types of COVID-19-related tweets on Twitter. It is reasonable that significantly fewer patients will post tweets and most of the tweets reflect people's awareness of the disease.

B. MODEL TRAINING
The entire data set was randomly divided into training, validation, and testing sets, with 80%, 10%, and 10% of the tweets, respectively. We used stratified splitting to make sure that the proportion of these four labels was roughly the same in the training, validation, and testing sets. Adam optimization was used, with the learning rate of 0.005. The embedding dropout rate was set at 0.3, and the hidden dropout rate was set at 0.5. The hyperparameters were primarily set by trial based on the performance. We lowercased all the words in the tweets and used twikenizer to tokenize the tweets.

C. PERFORMANCE MEASURE
We use accuracy, precision, recall, and F1 as performance measures, as they are the major metrics for classification tasks [27]. COVID-19 PHMs are modeled as a multiple classification problem. Precision and recall for each class can be defined as: Precision . To evaluate the overall performance across multiple classes, weighted precision, recall, and F1 score are used. The overall accuracy is defined as the ratio between the total number of true positive and the size of the testing set.

A. OVERALL PERFORMANCE
To evaluate the effectiveness of the proposed approach, we use four popular methods as the baseline for performance comparison. These methods are: • fastText [23]: fastText was developed by Facebook in 2017 for text classification purposes. It has become the de facto approach for text classification, due to its simplicity and effectiveness. The embeddings of each word in the text and the corresponding n-gram features are fed into a hierarchical SoftMax to get the corresponding predicted labels. For example, the 4grams of the word "cough" are "<cou", "coug", "ough", and "ugh>", where "<" and ">" indicate the start and end of a word. A 300-dimension GloVe is used for the embeddings.
• Convolutional neural network (CNN) [28]: CNN takes the embeddings of the text (in the format of an embedding matrix) as the input. A set of convolutional kernels is applied to the input to extract the characteristics of the text. After certain operations, such as max-pooling and dropout, the features are fed into a SoftMax function to classify the text. CNN has been acknowledged to effectively extract the hierarchical and ordering information in the text [42]. In our experiment, the word embeddings dimension was 64; the kernel sizes were set at 2, 3, and 4; the dropout rate was 0.5, and the learning rate of Adam was 0.005. All hyperparameters were optimized by trial. VOLUME XX, 2017 9 • Bidirectional long short-term memory network (BiLSTM) [29]: LSTM is a type of RNN unit that is capable of capturing the long-range contextual dependency in a text. Each LSTM unit contains a forget gate, an input gate, and an output gate to control the information flow and update in the network. BiLSTM is a parallel LSTM structure that fully leverages the forward and backward contextual information [37]. Similar to CNN, in our experiment, 64-dimension embeddings and a dropout rate of 0.5 were used, and the learning rate of Adam optimizer was 0.005. • Bidirectional Encoder Representations from Transformers (BERT) [33]: BERT is a language model trained on words from BooksCorpus and Wikipedia with more than three billion words. BERT and its variants have shown an extremely good capability of extracting the sematic features from texts and have achieved state-of-the-art performance in many NLP tasks [43]. This paper uses BERTBASE to encode the tweets. The representation of the tweets was fed to a SoftMax function to perform the classification task. The batch size was 32 and the learning rate was 0.00005. The BERTBASE model was fine-tuned on the training set and tested on the testing set. The accuracy, precision, recall, and F1 scores of all the methods used for COVID-19 PHM identification are shown in Table III. The best performances are highlighted in bold. The attention model with masking operation achieves the best performance identification in terms of accuracy, recall, and the F1 score. For precision, the Masked-Attn model is outperformed only by BERT. Thus, the Masked-Attn model achieves the best overall performance due to its efficiency at extracting information from the text, even the short texts used in tweets.
It should be noted that the powerful BERT model does not achieve a satisfactory overall result, although it has the highest precision rate. The reason is that the BERT model is very heavy. It has been pre-trained using the general BooksCorpus and Wikipedia. There may be a great semantic difference between the PHM corpus and the corpus to train BERT. The PHM corpus built in this paper is on a small-tomedium scale. Fine-tuning BERT using the partial PHM data may not adapt the BERT to the PHM domain well. However, if we have a massive amount of annotated PHM data, BERT will have great potential to outperform existing methods.
The text length of tweets is relatively short compared with the long text. Research in natural language processing has observed that the same technique usually performs worse for short texts in some tasks (such as name entity recognition) due to the lack of sufficient contextual information in short texts [8]. Thus, the power of other deep-learning-based methods to extract semantic information from texts cannot be fully exploited. Some methods' performances are even worse than the simple fastText.

B. THE EFFECT OF ATTENTION AND MASKING OPERATION ON PERFORMANCE
The attention mechanism emphasizes the role of keywords in understanding the text. To show its effectiveness, we conducted the experiment using the following two network structures.
• BiGRU: The details of BiGRU can be found in section III.B.3 of this paper. The output of the BiGRU layer was fed directly to the SoftMax function without passing the attention layer. In the experiment, 128dimension embeddings and a dropout rate of 0.3 were used, and the learning rate of the Adam optimizer was 0.0005. The training batch size was 32 and the number of training epochs was 10.

•
BiGRU-Attn: To show the effect of the masking operation on the performance, we conducted an experiment using the vanilla attention architecture, i.e., a BiGRU network followed by an attention layer. The model architecture was the same as Fig.1, except all the masks were removed. The hyperparameters setting was the same as BiGRU. The experiment results in Table IV also show that the attention mechanism can significantly improve the performance, as Masked-Attn outperformed BiGRU by a significant margin (p-value=0.031 for the one-tailed twosample p-test on the overall accuracy). We also noticed that masking did improve the performance significantly, as shown in Table II (p-value=0.001 for the two-sample p-test on the overall accuracy between BiGRU-Attn and Masked-Attn). Thus, to calculate the attention among tokens in the text, it is better to mask out the padding tokens so they will not decrease the contribution of real tokens in the classification task.
According to the attention value calculated in the model, it was found that COVID-19-related keywords tended to have bigger attention weight. These keywords improved the performance of PHM identification. The visualization of tokens' attention in typical tweets is shown in Fig. 2. Darker shades indicate bigger values. For the first tweets, the tokens of "tested," "positive," "for," "corona," "covid19," and "me" have relatively big attention values. They are either directly related to COVID-19 or pronouns ("me"). The word "for" VOLUME XX, 2017 9 also has big attention. This is because its adjacent words ("tested," "positive," and "corona") all have big attention values. Thus, the attention of "for" is affected. For the second tweets, similar findings can be observed. The words of "i" (the second one), "tested," "negative," and "test" have relatively big attention value. Based on the observation, we know the COVID-19-related keywords contribute more in the masking attention model and thus lead to better performance.

C. THE EFFECT OF AVERAGING-BASED MODEL TRAINING ON PERFORMANCE
We leveraged the model parameters trained in different training epochs to update the final model parameters. To show the effectiveness of this training strategy, we conducted another experiment using the same network structure as shown in Fig. 1   As shown in Table V, the method tends to favor classes with more training data-label 3 (awareness)-as the F1 score of class 3 is the highest. Thus, with the training method without epoch-wise average, model parameters are largely determined by the result of the awareness label; the performance of other labels' classifications are not fully considered in the training. The epoch-wise, average-based training could compensate for such class imbalance. The advantage of the epoch-wise average can be attributed to its generalization capability. During model training, the parameters obtained in each epoch are distributed on the periphery of the parameter space. The center of space usually has a higher generalization capability. Epoch-wise averaging makes the parameters closer to the center of the periphery, and thus exhibits better generalization and PHM identification performance.

D. THE GENERALIZABILITY OF THE PROPOSED METHOD
Twitter provides a promising data source for more effectively identifying PHMs for public health monitoring. In addition, tweets are continuously produced and updated. They can quickly capture the latest trend in the COVID-19 public health condition in a region. Thus, new textual and semantic information related to COVID-19 may be generated with the emergence of new tweets. For example, the discussion on the omicron variant of COVID-19 emerged in 2022. In this section, we would like to explore the generalization capability of the proposed method. Specifically, we will investigate whether the performance of the model learned from old data can detect PHMs accurately for new tweets.
To conduct the analysis, we collected 200 tweets posted from January to June 2021 (denoted as testing set II; Testing set I was the original test set used in Table III), and 200 tweets posted from September 2021 to March 2022 (denoted as testing set III). Besides the hashtags used to build the COVID-19 corpus containing 11,231 tweets, we also used more hashtags such as "vaccine," "lockdown," "socialdistancing," and "omicron." The newly collected tweets were cleaned to remove non-textual content, such as mentions, hashtags, and links. They were annotated by two annotators independently in the same manner as before. The annotation disagreements were solved by a group comprising the two annotators and two more senior experts in public health and social media mining. The percentages of each label in the newly annotated tweets were 5.5% (self-mention), 18% (other-mention), 67% (awareness), and 9.5% (non-health) for testing set II and 6.5% (self-mention), 19.5% (other-mention), 64% (awareness), and 10% (non-health) for testing set III. The class imbalance issue VOLUME XX, 2017 9 still existed but was not as serious as the original tweets corpus shown in Table II. The masked attention model trained on the original COVID-19 tweets corpus was used to identify the PHMs in the newly collected two data sets. The performance comparison on various testing sets can be found in Table VI. Testing set I contains the tweets from the original corpus, with postings dated from February to May 2020. It can be noticed that the accuracy, precision, recall, and F1 score in testing sets II and III were worse than the results for testing set I, but not significantly (p-value = 0.647 on the test of difference on the accuracies between testing sets I and II; p-value = 0.357 on the test of differences in the accuracies of testing sets I and III). Thus, the experiment's results show that the model trained on the original tweets corpus has a very satisfactory generalization capability.

VI. CONCLUSION
This paper has aimed to automate COVID-19 personal health mention detection processes based on deep-learning and natural-language processing techniques. We constructed a COVID-19 tweets corpus containing 11,231 annotated tweets. Each tweet was annotated according to non-health, awareness, self-mention, and other-mention categories. The COVID-19 PHM identification was modeled as a text classification task.
An attention-based model was trained to classify each tweet according to the four classes. Promising results have been achieved in terms of the overall F1 score. Additional experiments have also been conducted to study the effect of training data size on performance. It was found that the methods tended to favor the classes with larger numbers of training samples, with classes holding more data resulting in greater reliability and a higher classification performance. Through extensive experiment, we have also shown that the proposed method has a good generalization capability. Thus, the model developed using the old data set can be applied to the new tweets data set with satisfactory performance. However, there are several limitations to this paper. The methods leverage only Twitter data, which are short texts. It is expected that incorporating domain knowledge in public health and medicine will improve the performance of shorttext classification. In the fields of public health and medicine, domain knowledge and resources are especially useful, as many professional terms and instances of jargon appear in the data set. In our future work, we will combine information from a knowledge base for the COVID-19 PHM identification task. In addition, new investigations will be conducted to compensate for classes with low sample sizes. Data resampling and even text-style transfer techniques will be attempted to mitigate the data imbalance issue. Furthermore, the regions from which the posters are tweeting are not considered in the paper. Thus, the developed method is a general method and can be applied to any region in the world. However, people in different countries may have their own tweeting style and language expression syntactics. Another future study would be to incorporate the regional information into the model so that the model is more adaptive to the regions.